Installation and Execution Guide for QDNAseq.snakemake Pipeline

public public 1yr ago 0 bookmarks

For the installation of this pipeline any Python install compatable Conda is required.

The pipeline itself will run on Python 3.8.5 and R 3.6.3. For exact dependencies view environment.yaml and r-dependencies.R .

Using Conda/Mamba

for easy installation you need (Mini)Conda.

Miniconda installation from folder where you want to install Miniconda:

cd </path/to/files/dir/>
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

follow the instructions of the installation process, give the location where you want Miniconda to be installed and answer YES to add Miniconda to your path.

go to the directory where the analysis need to be performed

cd </path/to/analysis/dir>
git clone https://github.com/tgac-vumc/QDNAseq.snakemake/
cd QDNAseq.snakemake

install Mamba as drop-in replacement for Conda with Mamba's improved installation-performance:

conda install -c conda-forge mamba

create the environment using Mamba:

mamba env create --name QDNAseq-snakemake --file environment.yaml

activate the environment by:

conda activate QDNAseq-snakemake

Then run the R-script r-dependencies.R in the terminal to install the non-conda R dependencies in the environment:

Rscript r-dependencies.R

Using Singularity

Under development

Preparing analysis

Prepare the data

go to analysis dir and prepare analysis by copy or create links to fastq.gz files:

cd </path/to/analysis/dir>
mkdir fastq
cd fastq

to link a single file:

ln -s <path/to/file>

to link all files from a folder:

for file in <path/to/fastq/files>/*.fastq.gz
do ln -s $file
done

Prepare the snakemake settings

Open the configuration file config.yaml to check the settings that snakemake will use and change according to your needs. For providing service-analysis, set setting to 'service' . For research purposes, set setting to 'research' . For all settings set setting to 'all' .

One of the options in the configfile is dewaving , if set to 'true' QNDAseq objects will be dewaved before segmentation.

These options change the rules performed in the pipeline, see the rule-graph in the next section.

Running analysis

Make sure that snakemake is able to find the excecutive file Snakefile by performing a dry-run:

cd ../QDNAseq.snakemake
snakemake -n

Check the rules that are planned to be performed, conform the rule-graph.

An visualization of the order of rules to be performed can be viewed by running the following command and opening the DAG-file

snakemake --forceall --rulegraph | dot -Tsvg > DAG.svg

Rulegraphs for the intial settings 'service' , 'research' and 'all' are commited to this repro in the files DAG_<setting>.svg .

When ready, run the analysis

snakemake

Useful snakemake options

-j , --cores, --jobs : Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores.

-n , --dryrun : Do not execute anything. but show rules which are planned to be performed.

-k , --keep-going : Go on with independent jobs if a job fails.

-f , --force : Force the execution of the selected target or the first rule regardless of already created output.

-R , --forcerun : Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.

-U , --until : Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

for all options go to https://snakemake.readthedocs.io/en/v5.31.1/executing/cli.html#all-options

Code Snippets

 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(stringr))
} else{
library(QDNAseq)
library(stringr)
}

source('scripts/ACE.R', echo = !msg)

ploidies<-as.integer(snakemake@wildcards[["ploidy"]])
inputfile <-snakemake@input[["segmented"]]
outputdir<-snakemake@params[["outputdir"]]
failed <- snakemake@params[["failed"]]
log<-snakemake@log[[1]]
fitpicker <- snakemake@output[["fitpicker"]] # Insert by Erik 2021-01-27

imagetype <- snakemake@config[["ACE"]][["imagetype"]]
method<-snakemake@config[["ACE"]][["method"]]
penalty<-as.numeric(snakemake@config[["ACE"]][["penalty"]])
cap<-as.integer(snakemake@config[["ACE"]][["cap"]])
trncname<- snakemake@config[["ACE"]][["trncname"]]
printsummaries<- snakemake@config[["ACE"]][["printsummaries"]]

copyNumbersSegmented <- readRDS(inputfile)

parameters <- data.frame(options = c("ploidies","imagetype","method","penalty","cap","trncname","printsummaries"),
                         values = c(paste0(ploidies,collapse=", "),imagetype,method,penalty,cap,trncname,printsummaries))

#write.table(parameters, file=log, quote = FALSE, sep = "\t", na = "", row.names = FALSE)
write.table(parameters, file=log, quote = FALSE, sep = "\t", na = "", col.names = NA)

ploidyplotloop(copyNumbersSegmented ,outputdir , ploidies,imagetype,method,penalty,cap,trncname,printsummaries) # 1 to 3 dir.create warnings and 4 to 15  removed rows warnings, output gives NULL

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){
  for(file in failed_samples[,1]){file.create(paste(outputdir, ploidies,"N/", file,"/summary_",file,".",imagetype,sep=""))}
}
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(CGHcall))
suppressMessages(library(CGHregions))
suppressMessages(library(CGHtest))
} else{
library(QDNAseq)
library(Biobase)
library(CGHcall)
library(CGHregions)
library(CGHtest)
}


source('scripts/CGHcallPlus.R', echo = FALSE)

recalled <- snakemake@input[["recalled"]]
RegionsCGH<-snakemake@output[["RegionsCGH"]]
profiles <- snakemake@output[["profiles"]]

averr <- snakemake@params[["averr"]]  #0.0075	 # default = 0.01

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T, split=FALSE)
##############################################################################################################
# CGH regions & frequency plot regions
##############################################################################################################

QCN.reCalled <- readRDS(recalled)

# Make CGH
CGH_reCalledRCs <- makeCgh(QCN.reCalled)

# perform CGHregions
reCalledRegionsCGH <- CGHregions(CGH_reCalledRCs, averror=averr)

# save data:
saveRDS(reCalledRegionsCGH, RegionsCGH)

# frequency plot reCalled (CGH) regions
#filename <- paste('profiles/freqPlot/freqPlotREGIONS_', bin , 'kbp.png', sep='')
png(profiles, res=300, width=14, height=7, unit='in')
frequencyPlot(reCalledRegionsCGH)
dev.off()
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(CGHtest))
suppressMessages(library(readxl))
} else{
library(CGHtest)
library(readxl)
}

source("scripts/CGHtest_functions.R", echo = FALSE)

##########
# input file paths
CGHregions_RDS_file_path = '../../../output-CGH/100kbp/data/100kbp-RegionsCGH.rds'
CGHregions_RDS_file_path <- snakemake@input[["RegionsCGH"]]

freqPlotCompare_output <- snakemake@output[["freqPlotCompare"]]
CGHtest_output <- snakemake@output[["CGHtest"]]
plotPFDR_output <- snakemake@output[["plotPFDR"]]
combined_output <- snakemake@output[["combined"]]

# params path
outputdir <- snakemake@params[["outputdir"]]
clinicaldataPath <- 'clinical.xlsx'
clinicaldataPath <- snakemake@params[["clinicaldataPath"]]

clinicaldata_col_snames <- 1
clinicaldata_col_snames <- snakemake@params[["columnSampleNames"]]

set_clinicaldata_class_samples <- c('short','long')
set_clinicaldata_class_samples <- snakemake@params[["ClassSamples"]]
clinicaldata_col_class_samples <- 3
clinicaldata_col_class_samples <- snakemake@params[["columnClassSamples"]]


log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T, split=FALSE)
##################

if(!dir.exists(outputdir)){dir.create(outputdir)}

# read CGHregions RDS
CGHregions_RDS_file <- readRDS(CGHregions_RDS_file_path)

# get sample names from CGHregions
snames <- sampleNames(CGHregions_RDS_file)

# read clinical data
clinicaldata <- as.data.frame(read_excel(clinicaldataPath))
clinicaldata_snames  <- clinicaldata[,clinicaldata_col_snames]

if (!isTRUE(all.equal(sort(snames),sort(clinicaldata_snames)))){
    print(data.frame('Clinical Sample Names' = clinicaldata_snames, 'CGHregions Sample Names' = snames))
    stop('Sample Names in clinical data and CGHregions RDS file do not match!')
}
# match sample names from CGHregions with sample names in clinical data
snames_i  <- match(clinicaldata_snames, snames)

# redefine clinical data matching the CGHregions names
clinicaldata <- clinicaldata[snames_i,]

# define clinical data classes and match with set classes
clinicaldata_class_samples_all <- clinicaldata[,clinicaldata_col_class_samples]
clinicaldata_class_samples_unique <- unique(clinicaldata_class_samples_all)

tempa <- clinicaldata_class_samples_unique
tempb <- set_clinicaldata_class_samples
if (!isTRUE(all.equal(sort(tempa),sort(tempb)))){
    print(data.frame('Classes in clinical data' = tempa, 'Classes defined by user in config' = tempb))
    stop('Classes in the clinical data do not match the classes defined for the CGHtest-setting in the configuration-file!')
}

# define groups to compare
group1_index <- which(clinicaldata_class_samples_all == set_clinicaldata_class_samples[1]) # c(4,5,8,11,14,17) for clinical.xlsx
group2_index <- which(clinicaldata_class_samples_all == set_clinicaldata_class_samples[2]) # c(1,2,3,6,7,9,10,12,13,15,16,18) for clinical.xlsx

# subset data
group1 <- CGHregions_RDS_file[,group1_index]
group2 <- CGHregions_RDS_file[,group2_index]

# Compare 2 cohorts
teststat <- "Chi-square"
compareCNAs2cohorts(x = CGHregions_RDS_file,
                    data1 = group1,
                    data2 = group2,
                    cohort1_ID = "group1",
                    cohort2_ID = "group2",
                    teststat,
                    output_freqPlotCompare = freqPlotCompare_output,
                    output_CGHtest = CGHtest_output,
                    output_plotPFDR = plotPFDR_output,
                    output_combined = combined_output)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(CGHcall))
suppressMessages(library(CGHtest))
suppressMessages(library(denstrip))
} else{
library(QDNAseq)
library(Biobase)
library(CGHcall)
library(CGHtest)
library(denstrip)
}


source('scripts/CGHcallPlus.R', echo = !msg)
source('scripts/plotQDNAseq.R', echo = !msg)

called <- snakemake@input[["called"]]
fitpickertable<- snakemake@input[["fitpicker"]]

recalled <- snakemake@output[["recalled"]]
#copynumbers<-snakemake@output[["copynumbers"]]#moved to QDNAseq-segment
#segments<-snakemake@output[["segments"]]  #moved to QDNAseq-segment
calls<-snakemake@output[["calls"]]
profiles <- snakemake@params[["profiles"]]
#copynumbersbed<-snakemake@params[["copynumbersbed"]]  #moved to QDNAseq-segment
#segmentsbed<-snakemake@params[["segmentsbed"]] #moved to QDNAseq-segment
failed <- snakemake@params[["failed"]]
#bedfolder <- snakemake@params[["bedfolder"]]
freqplot<- snakemake@output[["freqplot"]]

minimum_cellularity <- snakemake@params[["minimum_cellularity"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE, split=FALSE)

##############################################################################################################
# Call data & frequency plot calls
##############################################################################################################
#First part is to call with altered cellularity using chgCall.
# load data
QCN.fcnsdsn <- readRDS(called)
fitpicker<- read.delim(fitpickertable, stringsAsFactors=FALSE)
fitpicker$likely_fit[fitpicker$likely_fit < minimum_cellularity] <- minimum_cellularity

#perform functions
# Call with correction for cellularity
reCalled <- callBins(QCN.fcnsdsn, nclass=5, cellularity=fitpicker$likely_fit)
saveRDS(reCalled, recalled)

##############################################################################################################
# Create profiles and frequency plot
##############################################################################################################
fitpicker$likely_fit<-paste0("cellularity: ",fitpicker$likely_fit )

if(!dir.exists(profiles)){dir.create(profiles)}
plotQDNAseq(reCalled, profiles , subtitle=fitpicker$likely_fit)

# frequency plot calls
png(freqplot, res=300, width=14, height=7, unit='in')
frequencyPlot(reCalled)
dev.off()

##############################################################################################################
# Create IGV objects from readcounts
##############################################################################################################

#exportBins(reCalled, copynumbers, format="igv", type="copynumber") moved to CNA segments
#exportBins(reCalled, segments, format="igv", type="segments") moved to CNA segments
exportBins(reCalled, calls, format="igv", logTransform=FALSE, type="calls")
#exportBins(reCalled, file=copynumbersbed, format="bed", logTransform=TRUE, type="copynumber")
#exportBins(reCalled, file=segmentsbed, format="bed", type="segments")

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){file.create(paste(profiles, file,".png",sep=""))
#file.create(paste(bedfolder, file,"-copynumbers.bed",sep=""))
#file.create(paste(bedfolder, file,"-segments.bed",sep=""))
}}
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(CGHcall))
suppressMessages(library(CGHtest))
} else{
library(QDNAseq)
library(Biobase)
library(CGHcall)
library(CGHtest)
}


source('scripts/CGHcallPlus.R', echo = !msg)
source('scripts/plotQDNAseq.R', echo = !msg)

segmented <- snakemake@input[["segmented"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
called <- snakemake@output[["called"]]
profiles <- snakemake@params[["profiles"]]
freqplots <- snakemake@output[["freqplot"]]
failed <- snakemake@params[["failed"]]
calls<-snakemake@output[["calls"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE, split=FALSE)

##############################################################################################################
# Call data & frequency plot calls
##############################################################################################################

# load data
QCN.fcnsdsn <- readRDS(segmented)

# perform functions
# Call withouth correction for cellularity
QCN.fcnsdsnc <- callBins(QCN.fcnsdsn, nclass=5)
saveRDS(QCN.fcnsdsnc, called)

# Plot called profiles
plotQDNAseq(QCN.fcnsdsnc, profiles)

# frequency plot calls
png(freqplots, res=300, width=14, height=7, unit='in')
frequencyPlot(QCN.fcnsdsnc, losscol='blue', gaincol='red', delcol="darkblue", ampcol="darkred")
dev.off()

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){file.create(paste(profiles, file,".png",sep=""))
}}

##############################################################################################################
# Create IGV objects
##############################################################################################################

exportBins(QCN.fcnsdsnc, calls, format="igv", logTransform=FALSE, type="calls")
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))

# for dewaving:
suppressMessages(library(limma))
suppressMessages(library(gtools))
suppressMessages(library(impute))
suppressMessages(library(MASS))
} else{
library(QDNAseq)
library(Biobase)

# for dewaving:
library(limma)
library(gtools)
library(impute)
library(MASS)
}


source("scripts/functions.R", echo = !msg)
source("scripts/plotQDNAseq.R", echo = !msg)
sourceDir(snakemake@config[["QDNAseq"]][["dewave_dir"]], echo = !msg)
load(snakemake@params[["dewave_data"]], verbose = !msg)

bin <- as.integer(snakemake@wildcards[["binSize"]])
corrected <- snakemake@input[["corrected"]]
profiles <- snakemake@params[["profiles"]]
dewaved <- snakemake@output[["dewaved"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T , split=FALSE)
##############################################################################################################
# Correct, Normalize & Dewave raw data
##############################################################################################################

QCN.fcns <- readRDS(corrected)

if(bin!=1000){
QCN.fcns <- dewaveBins(QCN.fcns)
}

saveRDS(QCN.fcns, dewaved)
plotQDNAseq(QCN.fcns, profiles)
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(CGHregions))
} else{
library(QDNAseq)
library(CGHregions)
}

source("scripts/addCytobands.R", echo = !msg)
source("scripts/CGHcallPlus.R", echo = !msg)

RegionsCGH<-snakemake@input[["RegionsCGH"]]
max.focal.size.mb<-snakemake@params[["max_focal_size_mb"]]
min.freq.focal<-snakemake@params[["min_freq_focal"]]
allRegions<-snakemake@output[["allRegions"]]
allFocalRegions<-snakemake@output[["allFocalRegions"]]
cytoband_data<-snakemake@params[["cytobands"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T, split=FALSE)
##############################################################################################################
# Create CGHregion BED file + Annotate
##############################################################################################################

# load data
QCN.Regions <- readRDS(RegionsCGH)

# create BED file from CGHregions

makeCGHregionsTable <- function(CGHregions, max.focal.size.mb=3, min.freq.focal=25, output_allRegions, output_allFocalRegions,  cytoband_data){

	# all regions in BED format
	regionsBED <- fData(CGHregions)[,1:3]

	# region sizes
	regionSize <- fData(CGHregions)[,3] - fData(CGHregions)[,2]
	regionsBED <- cbind(regionsBED, regionSize)

	# add cytoband info
	regionsBED <- addCytobands(regionsBED, cytoband_data)

	# frequencies of gains and losses
	calls <- regions(CGHregions)
	loss.freq <- rowMeans(calls < 0)
	gain.freq <- rowMeans(calls > 0)

	# add to 1 table
	regionData <- cbind(regionsBED, loss.freq, gain.freq)
	colnames(regionData) <- c('chromo', 'start', 'end', 'Region size (Mb)', 'Cytoband', 'loss (%)', 'gain (%)')

	# round integers:
	regionData[,4] <- round(regionData[,4]/1000000, digits=2)
	regionData[,c(6,7)] <- round(regionData[,c(6,7)]*100, digits=2)

	# save
	write.table(regionData, output_allRegions, sep='\t', row.names=F, col.names=T, quote=F)

	# make separate table for focals only
	focal.ix <- which(regionData[,4] <= max.focal.size.mb)
	focalRegions <- regionData[focal.ix, ]

	# and in more than 25% of samples
	loss.cutoff.ix <- which(focalRegions[,6] > min.freq.focal)
	gain.cutoff.ix <- which(focalRegions[,7] > min.freq.focal)
	cutoff.ix <- sort( c(loss.cutoff.ix, gain.cutoff.ix ))

	focalRegions <- focalRegions[cutoff.ix, ]
	if(nrow(focalRegions) != 0 ){rownames(focalRegions) <- c(1:nrow(focalRegions))}

	write.table(focalRegions, output_allFocalRegions, sep='\t', row.names=F, col.names=F, quote=F)
}

makeCGHregionsTable(QCN.Regions, max.focal.size.mb, min.freq.focal, allRegions, allFocalRegions,  cytoband_data)
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(denstrip))
suppressMessages(library(CGHcall))
} else{
library(QDNAseq)
library(denstrip)
library(CGHcall)
}

source("scripts/functions.R", echo = !msg)
source("scripts/addCytobands.R", echo = !msg)
source("scripts/CGHcallPlus.R", echo = !msg)

recalled <- snakemake@input[["recalled"]]
beddir <- snakemake@params[["beddir"]]
cytoband_data<-snakemake@params[["cytobands"]]
max.focal.size.mb<-snakemake@params[["max_focal_size_bed"]]
failed <- snakemake@params[["failed"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE, split=FALSE)

##############################################################################################################

QCN.reCalled <- readRDS(recalled)

# create BED files

makeCNAbedFile <- function(CalledQDNAseqReadCounts, max.focal.size.mb=3, beddir,cytoband_data){
  for (i in 1:length(sampleNames(CalledQDNAseqReadCounts))){

		#
		sample <- CalledQDNAseqReadCounts[,i]
		sname <- sampleNames(sample)

		# print
		print( paste('Creating BED files of all CNAs in sample: ', sname, sep='') )

		# get all call and segment values of CNAs in the sample
		allCNA.ix <- which(calls(sample)!=0)
		allCNAs <- calls(sample)[ allCNA.ix ]
		seg.allCNA <- segmented(sample)[ allCNA.ix ]

		# coordinates of all CNAs
		reg.allCNAs <- fData(sample)[ allCNA.ix, 1:3]

		# Put together in BED format
		allCNAsPerBin <- cbind(reg.allCNAs, allCNAs, seg.allCNA)
		colnames(allCNAsPerBin)[4] <- 'call'
		colnames(allCNAsPerBin)[5] <- 'segment'

    bedfile<-paste(beddir, sname, '_allCNAsPerBin.bed', sep="")
		# save as BED file; with or without column names..?
		write.table(allCNAsPerBin, bedfile, sep='\t', row.names=F, col.names=T, quote=F)

		# create BED file for all called regions:
		# collapse all adjacent bins based on unique segment values
		coll.calls.bed <- c()
		#all_segments <- uniquecreate.file()(allCNAsPerBin$segment)
    if(length(unique(allCNAsPerBin$segment)) !=0){
			for (j in 1:length(unique(allCNAsPerBin$segment))){
			# first segment

			SEG.val <- as.numeric( unique(allCNAsPerBin$segment)[j] )

			# if the unique segment value occurs more often, then for each one create an entry
			intervals <- seqToIntervals(which(allCNAsPerBin$segment== SEG.val ))
			allCNAsPerBin[allCNAsPerBin$chromosome=="X","chromosome"]<-23
			allCNAsPerBin[allCNAsPerBin$chromosome=="Y","chromosome"]<-24
				for (z in 1:nrow(intervals)){

					start.ix 	<- intervals[z,1]
					end.ix 		<- intervals[z,2]

					SEG.start 	<- as.numeric( allCNAsPerBin[start.ix,2] )
					SEG.end 	<- as.numeric( allCNAsPerBin[end.ix,3] )
					SEG.size 	<- as.numeric( allCNAsPerBin[end.ix,3] - allCNAsPerBin[start.ix,2] )

					SEG.chromo 	<- as.numeric(allCNAsPerBin[end.ix,1])
					SEG.call 	<- as.numeric( allCNAsPerBin[end.ix,4] )

					# check if chromosome is the same; if not the bin spans more chromos and should be separated
					if(allCNAsPerBin[start.ix, 1] != allCNAsPerBin[end.ix, 1]){
					       print(paste(j, 'WARNING: Segment ',  SEG.val, ' spans more than 1 chromosome', sep=''))}

					SEG.coll 	<- cbind(SEG.chromo, SEG.start, SEG.end, SEG.size, SEG.val, SEG.call)

					coll.calls.bed <- rbind(coll.calls.bed, SEG.coll)
				}
			}

			# reorder dataframe
			CNAbed <- coll.calls.bed
			CNAbed <- CNAbed [ order(CNAbed[,1], CNAbed[,2]), ]
			# if there is only 1 CNA in sample, transform the dataframe
			if	(nrow(coll.calls.bed )==1)  {
			CNAbed <- as.data.frame(t(CNAbed))
			}	else {
			CNAbed <- as.data.frame(CNAbed)
			}

			CNAbed[,4] <- round( CNAbed[,4]/1000000, digits = 2)
			CNAbed[,5] <- round( CNAbed[,5], digits=2)


			# add cytoband info
			CNAbed <- addCytobands(CNAbed, cytoband_data)
			CNAbed <- CNAbed[, c(1:4,7,5,6)]
    }else{CNAbed<- data.frame(matrix(ncol = 7, nrow = 0))}
      colnames(CNAbed) <- c('chromo', 'start', 'end', 'cytobands','sizeInMb', 'segment', 'call')
      fileName <- paste(beddir, sname, '_CNAs.bed', sep='')
      write.table(CNAbed, fileName, sep='\t', row.names=F, col.names=T, quote=F)

			# make a separate BED file with only focal CNAs (< 3 Mb)

			focal.ix <- which(CNAbed[,4] <= max.focal.size.mb)
			focalCNAs <- CNAbed[focal.ix, ]

			focalfileName <- paste(beddir, sname, '_focalCNAs.bed', sep='')
			write.table(focalCNAs, focalfileName, sep='\t', row.names=F, col.names=F, quote=F)
	}

}

makeCNAbedFile(QCN.reCalled, max.focal.size.mb, beddir, cytoband_data)

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){
    file.create(paste(beddir, file, '_allCNAsPerBin.bed', sep=""))
    file.create(paste(beddir, file, '_CNAs.bed', sep=''))
    file.create(paste(beddir, file, '_focalCNAs.bed', sep=''))
}}
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
} else{
library(QDNAseq)
}

source('scripts/ACE.R', echo = !msg)

## Erik
#copyNumbersSegmented <- snakemake@output[["ACE_post"]]
##

inputfile <-snakemake@input[["segmented"]]
outputdir<-snakemake@params[["outputdir"]]
failed <- snakemake@params[["failed"]]
fitpickertable<-snakemake@input[["fitpicker"]]
ploidies<-as.integer(snakemake@wildcards[["ploidy"]])

imagetype <- snakemake@config[["ACE"]][["imagetype"]]
trncname<- snakemake@config[["ACE"]][["trncname"]]

copyNumbersSegmented <- readRDS(inputfile)

postanalysisloop(copyNumbersSegmented, modelsfile=fitpickertable, imagetype=imagetype, outputdir=outputdir, trncname=trncname)

# if Rplots.pdf exists rename from QDNAseq-snakemake directory to outputdir
if (file.exists("./Rplots.pdf")){
    file.rename(from = "./Rplots.pdf", to = paste0(outputdir,"all_samples.pdf"))
}

failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){
  for(file in failed_samples[,1]){file.create(paste(outputdir,"segmentfiles/",file,"_segments.tsv",sep=""))}
}
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(R.cache))
} else{
library(QDNAseq)
library(Biobase)
library(R.cache)
}


setCacheRootPath(path="../.Rcache")

bam <- snakemake@input[["bams"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
genome <- snakemake@params[["genome"]]
binReadCounts <- snakemake@output[["binReadCounts"]]

##############################################################################################################
# Get bin annotations and bin read counts
##############################################################################################################
#custom bin annotations for chrX used - did this whole script by hand for 100kbp See creationofblacklist.R
bins <- getBinAnnotations(bin, genome=genome)


QRC <- binReadCounts(bins, bamfiles=bam, cache=TRUE)

sub("(_[ACGT]+)?(_S\\d+)?(_L\\d{3})?_R\\d{1}_\\d{3}(\\.f(ast)?q\\.gz)?$", "", sampleNames(QRC)) -> samples

if (sum(duplicated(samples)) > 0) {
        QRC <- poolRuns(QRC, samples=samples, force=TRUE)
}
saveRDS(QRC, binReadCounts)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
} else{
library(QDNAseq)
library(Biobase)
}

source("scripts/functions.R", echo = !msg)
source("scripts/plotQDNAseq.R", echo = !msg)

binReadCounts <- snakemake@input[["binReadCounts"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
corrected <- snakemake@output[["corrected"]]
profiles <- snakemake@params[["profiles"]]
chrom_filter <- c(snakemake@params[["chrom_filter"]])

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T , split=FALSE)
##############################################################################################################
# Correct, Normalize & Dewave raw data
##############################################################################################################
QRC <- readRDS(binReadCounts)

#did these steps manualy as is described in second part creationofblacklist.R for samples with chrX.

QRC.f <- applyFilters(QRC, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE , chromosomes=chrom_filter)

#featureData(QRC.f)$use[c(22232:22237)]<-F  # segment 14:22400001-22500000 t/m 14:22900001-23000000 on chr14
QRC.f <- estimateCorrection(QRC.f)
#QRC.f <- applyFilters(QRC, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE)  - used for bia-ALCL (2019-07-09 not)
#QRC.f <- estimateCorrection(QRC.f, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE) - used for bia-ALCL(2019-07-09 not)
#QRC.f <- applyFilters(QRC.f, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE, chromosome="Y") - used for bia-ALCL(2019-07-09 not)

QCN.fc <- correctBins(QRC.f)
QCN.fcn <- normalizeBins(QCN.fc)
QCN.fcns <- smoothOutlierBins(QCN.fcn)

saveRDS(QCN.fcns, corrected)
plotQDNAseq(QCN.fcns, profiles)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
} else{
library(QDNAseq)
library(Biobase)
}

source("scripts/plotQDNAseq.R", echo  = !msg)

dewaved<- snakemake@input[["dewaved"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
segmented <- snakemake@output[["segmented"]]
profiles <- snakemake@params[["profiles"]]
failed<- snakemake@params[["failed"]]
min_used_reads<-snakemake@params[["minimal_used_reads"]]

copynumbersbed<-snakemake@params[["copynumbersbed"]]
segmentsbed<-snakemake@params[["segmentsbed"]]
copynumbers<-snakemake@output[["copynumbers"]]
segments<-snakemake@output[["segments"]]
bedfolder <- snakemake@params[["bedfolder"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE , split=FALSE)

# Adjust segmentation settings based on binsize
if (bin==15) {SDundo=0.75; alph=1e-15}
if (bin==30) {SDundo=0.75; alph=1e-15}
if (bin==100) {SDundo=0.10; alph=1e-20} # default = 1e-20 # 0.01 used for PELLL_FS8_a0.01
if (bin==1000) {SDundo=0.10; alph=1e-20}
# TODO: mogelijk hogere alpha nodig, om meer segmenten te krijgen: 0.01 (1e-2). Deze setting used in tmp_100kbp

# load data
QCN.fcnsd <- readRDS(dewaved)

QCN.fcnsds <- segmentBins(QCN.fcnsd[,QCN.fcnsd$used.reads > min_used_reads ], undo.splits='sdundo', undo.SD=SDundo, alpha=alph, transformFun="sqrt") # gives 'Performing segmentation: NA
QCN.fcnsdsn <- normalizeSegmentedBins(QCN.fcnsds)
saveRDS(QCN.fcnsdsn, segmented)

##############################################################################################################
# Create profiles
##############################################################################################################

plotQDNAseq(QCN.fcnsdsn, profiles)

#create output for failed samples - for snakemake compatibility.
littledata<-colnames(QCN.fcnsd[,QCN.fcnsd$used.reads <= min_used_reads ])
if(length(littledata>0)){for(file in littledata){file.create(paste(profiles, file,".png",sep=""))
file.create(paste(bedfolder, file,"-copynumbers.bed",sep=""))
file.create(paste(bedfolder, file,"-segments.bed",sep=""))
}}

write.table(littledata, file=failed )

##############################################################################################################
# Create IGV objects and bedfiles from readcounts
##############################################################################################################

exportBins(QCN.fcnsdsn, copynumbers, format="igv", type="copynumber")
exportBins(QCN.fcnsdsn, segments, format="igv", type="segments")
exportBins(QCN.fcnsdsn, file=copynumbersbed, format="bed", logTransform=TRUE, type="copynumber")
exportBins(QCN.fcnsdsn, file=segmentsbed, format="bed", type="segments")
74
75
76
77
shell:
    "bwa aln -n {params.n} -t {threads} -q {params.q} {params.ref} {input} > {output.sai} 2> {log}; "
    "bwa samse -f {output.samse} -r '@RG\\tID:{wildcards.sample}\\tSM:{wildcards.sample}'" 
    " {params.ref} {output.sai} {input} 2>> {log}"
88
89
shell:
    "samtools view -uS {input.samse} 2> {log}| samtools sort - -o {output} 2>> {log}"
100
101
102
103
shell:
    "picard MarkDuplicates I={input} O={output.bam} M={output.metrics_file}"
    " ASSUME_SORTED=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT USE_JDK_DEFLATER=true USE_JDK_INFLATER=true &> {log}; "
    "ln -s {output.bai} {output.bambai}"
113
114
shell:
    "{input.script} {input.bam} {wildcards.sample} {params.outdir} {output}"
SnakeMake From line 113 of master/Snakefile
125
126
script:
    "scripts/Run_QDNAseq_binReadCounts.R"
SnakeMake From line 125 of master/Snakefile
139
140
script:
    "scripts/Run_QDNAseq_normalize.R"
SnakeMake From line 139 of master/Snakefile
153
154
script:
    "scripts/Run_deWave.R"
SnakeMake From line 153 of master/Snakefile
175
176
script:
    "scripts/Run_QDNAseq_segment.R"
SnakeMake From line 175 of master/Snakefile
191
192
script:
    "scripts/Run_CNA_call.R"
SnakeMake From line 191 of master/Snakefile
209
210
script:
    "scripts/Run_CNA_call_cellularity_based.R"
SnakeMake From line 209 of master/Snakefile
227
228
script:
    "scripts/Run_makeCNAbedFile.R"
SnakeMake From line 227 of master/Snakefile
241
242
shell:
    "{input.script} {input.bedfile} {wildcards.sample} {params.outdir} {output} {log} 2> {log}"
SnakeMake From line 241 of master/Snakefile
254
255
script:
    "scripts/Run_CGHregions.R"
SnakeMake From line 254 of master/Snakefile
269
270
script:
    "scripts/Run_makeCGHregionstable.R"
SnakeMake From line 269 of master/Snakefile
282
283
shell:
    "{input.script} {input.bedfile} {params.filename} {params.outdir} {output} 2> {log}"
SnakeMake From line 282 of master/Snakefile
294
295
shell:
    "{input.script} {params.profiles} {params.lb2dir} > {output.index}"
SnakeMake From line 294 of master/Snakefile
312
313
shell:
    "{input.script} {wildcards.binSize}kpb {params.bamfolder} {params.statsfolder} {params.qcFastqfolder} {params.qcBamfolder}> {output}"
SnakeMake From line 312 of master/Snakefile
328
329
shell:
    "fastqc {input.fastq} --outdir {params.qcfastqdir} -t {threads} 2>> {log}"
341
342
shell:
    "fastqc {input.bam} --format bam_mapped --outdir {params.qcbamdir} -t {threads} 2>> {log}"
355
356
script:
    "scripts/Run_ACE.R"
SnakeMake From line 355 of master/Snakefile
369
370
script:
    "scripts/Run_postanalysisloop_ACE.R"
SnakeMake From line 369 of master/Snakefile
388
389
script:
    "scripts/Run_CGHtest.R"
SnakeMake From line 388 of master/Snakefile
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/tgac-vumc/QDNAseq.snakemake
Name: qdnaseq-snakemake
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...