Installation and Execution Guide for QDNAseq.snakemake Pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

For the installation of this pipeline any Python install compatable Conda is required.

The pipeline itself will run on Python 3.8.5 and R 3.6.3. For exact dependencies view environment.yaml and r-dependencies.R .

Using Conda/Mamba

for easy installation you need (Mini)Conda.

Miniconda installation from folder where you want to install Miniconda:

cd </path/to/files/dir/>
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

follow the instructions of the installation process, give the location where you want Miniconda to be installed and answer YES to add Miniconda to your path.

go to the directory where the analysis need to be performed

cd </path/to/analysis/dir>
git clone https://github.com/tgac-vumc/QDNAseq.snakemake/
cd QDNAseq.snakemake

install Mamba as drop-in replacement for Conda with Mamba's improved installation-performance:

conda install -c conda-forge mamba

create the environment using Mamba:

mamba env create --name QDNAseq-snakemake --file environment.yaml

activate the environment by:

conda activate QDNAseq-snakemake

Then run the R-script r-dependencies.R in the terminal to install the non-conda R dependencies in the environment:

Rscript r-dependencies.R

Using Singularity

Under development

Preparing analysis

Prepare the data

go to analysis dir and prepare analysis by copy or create links to fastq.gz files:

cd </path/to/analysis/dir>
mkdir fastq
cd fastq

to link a single file:

ln -s <path/to/file>

to link all files from a folder:

for file in <path/to/fastq/files>/*.fastq.gz
do ln -s $file
done

Prepare the snakemake settings

Open the configuration file config.yaml to check the settings that snakemake will use and change according to your needs. For providing service-analysis, set setting to 'service' . For research purposes, set setting to 'research' . For all settings set setting to 'all' .

One of the options in the configfile is dewaving , if set to 'true' QNDAseq objects will be dewaved before segmentation.

These options change the rules performed in the pipeline, see the rule-graph in the next section.

Running analysis

Make sure that snakemake is able to find the excecutive file Snakefile by performing a dry-run:

cd ../QDNAseq.snakemake
snakemake -n

Check the rules that are planned to be performed, conform the rule-graph.

An visualization of the order of rules to be performed can be viewed by running the following command and opening the DAG-file

snakemake --forceall --rulegraph | dot -Tsvg > DAG.svg

Rulegraphs for the intial settings 'service' , 'research' and 'all' are commited to this repro in the files DAG_<setting>.svg .

When ready, run the analysis

snakemake

Useful snakemake options

-j , --cores, --jobs : Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores.

-n , --dryrun : Do not execute anything. but show rules which are planned to be performed.

-k , --keep-going : Go on with independent jobs if a job fails.

-f , --force : Force the execution of the selected target or the first rule regardless of already created output.

-R , --forcerun : Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.

-U , --until : Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

for all options go to https://snakemake.readthedocs.io/en/v5.31.1/executing/cli.html#all-options

Code Snippets

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(stringr))
} else{
library(QDNAseq)
library(stringr)
}

source('scripts/ACE.R', echo = !msg)

ploidies<-as.integer(snakemake@wildcards[["ploidy"]])
inputfile <-snakemake@input[["segmented"]]
outputdir<-snakemake@params[["outputdir"]]
failed <- snakemake@params[["failed"]]
log<-snakemake@log[[1]]
fitpicker <- snakemake@output[["fitpicker"]] # Insert by Erik 2021-01-27

imagetype <- snakemake@config[["ACE"]][["imagetype"]]
method<-snakemake@config[["ACE"]][["method"]]
penalty<-as.numeric(snakemake@config[["ACE"]][["penalty"]])
cap<-as.integer(snakemake@config[["ACE"]][["cap"]])
trncname<- snakemake@config[["ACE"]][["trncname"]]
printsummaries<- snakemake@config[["ACE"]][["printsummaries"]]

copyNumbersSegmented <- readRDS(inputfile)

parameters <- data.frame(options = c("ploidies","imagetype","method","penalty","cap","trncname","printsummaries"),
                         values = c(paste0(ploidies,collapse=", "),imagetype,method,penalty,cap,trncname,printsummaries))

#write.table(parameters, file=log, quote = FALSE, sep = "\t", na = "", row.names = FALSE)
write.table(parameters, file=log, quote = FALSE, sep = "\t", na = "", col.names = NA)

ploidyplotloop(copyNumbersSegmented ,outputdir , ploidies,imagetype,method,penalty,cap,trncname,printsummaries) # 1 to 3 dir.create warnings and 4 to 15  removed rows warnings, output gives NULL

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){
  for(file in failed_samples[,1]){file.create(paste(outputdir, ploidies,"N/", file,"/summary_",file,".",imagetype,sep=""))}
}

R Snakemake stringr From line 9 of scripts/Run_ACE.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(CGHcall))
suppressMessages(library(CGHregions))
suppressMessages(library(CGHtest))
} else{
library(QDNAseq)
library(Biobase)
library(CGHcall)
library(CGHregions)
library(CGHtest)
}


source('scripts/CGHcallPlus.R', echo = FALSE)

recalled <- snakemake@input[["recalled"]]
RegionsCGH<-snakemake@output[["RegionsCGH"]]
profiles <- snakemake@output[["profiles"]]

averr <- snakemake@params[["averr"]]  #0.0075	 # default = 0.01

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T, split=FALSE)
##############################################################################################################
# CGH regions & frequency plot regions
##############################################################################################################

QCN.reCalled <- readRDS(recalled)

# Make CGH
CGH_reCalledRCs <- makeCgh(QCN.reCalled)

# perform CGHregions
reCalledRegionsCGH <- CGHregions(CGH_reCalledRCs, averror=averr)

# save data:
saveRDS(reCalledRegionsCGH, RegionsCGH)

# frequency plot reCalled (CGH) regions
#filename <- paste('profiles/freqPlot/freqPlotREGIONS_', bin , 'kbp.png', sep='')
png(profiles, res=300, width=14, height=7, unit='in')
frequencyPlot(reCalledRegionsCGH)
dev.off()

R Biobase CGHcall CGHregions From line 8 of scripts/Run_CGHregions.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(CGHtest))
suppressMessages(library(readxl))
} else{
library(CGHtest)
library(readxl)
}

source("scripts/CGHtest_functions.R", echo = FALSE)

##########
# input file paths
CGHregions_RDS_file_path = '../../../output-CGH/100kbp/data/100kbp-RegionsCGH.rds'
CGHregions_RDS_file_path <- snakemake@input[["RegionsCGH"]]

freqPlotCompare_output <- snakemake@output[["freqPlotCompare"]]
CGHtest_output <- snakemake@output[["CGHtest"]]
plotPFDR_output <- snakemake@output[["plotPFDR"]]
combined_output <- snakemake@output[["combined"]]

# params path
outputdir <- snakemake@params[["outputdir"]]
clinicaldataPath <- 'clinical.xlsx'
clinicaldataPath <- snakemake@params[["clinicaldataPath"]]

clinicaldata_col_snames <- 1
clinicaldata_col_snames <- snakemake@params[["columnSampleNames"]]

set_clinicaldata_class_samples <- c('short','long')
set_clinicaldata_class_samples <- snakemake@params[["ClassSamples"]]
clinicaldata_col_class_samples <- 3
clinicaldata_col_class_samples <- snakemake@params[["columnClassSamples"]]


log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T, split=FALSE)
##################

if(!dir.exists(outputdir)){dir.create(outputdir)}

# read CGHregions RDS
CGHregions_RDS_file <- readRDS(CGHregions_RDS_file_path)

# get sample names from CGHregions
snames <- sampleNames(CGHregions_RDS_file)

# read clinical data
clinicaldata <- as.data.frame(read_excel(clinicaldataPath))
clinicaldata_snames  <- clinicaldata[,clinicaldata_col_snames]

if (!isTRUE(all.equal(sort(snames),sort(clinicaldata_snames)))){
    print(data.frame('Clinical Sample Names' = clinicaldata_snames, 'CGHregions Sample Names' = snames))
    stop('Sample Names in clinical data and CGHregions RDS file do not match!')
}
# match sample names from CGHregions with sample names in clinical data
snames_i  <- match(clinicaldata_snames, snames)

# redefine clinical data matching the CGHregions names
clinicaldata <- clinicaldata[snames_i,]

# define clinical data classes and match with set classes
clinicaldata_class_samples_all <- clinicaldata[,clinicaldata_col_class_samples]
clinicaldata_class_samples_unique <- unique(clinicaldata_class_samples_all)

tempa <- clinicaldata_class_samples_unique
tempb <- set_clinicaldata_class_samples
if (!isTRUE(all.equal(sort(tempa),sort(tempb)))){
    print(data.frame('Classes in clinical data' = tempa, 'Classes defined by user in config' = tempb))
    stop('Classes in the clinical data do not match the classes defined for the CGHtest-setting in the configuration-file!')
}

# define groups to compare
group1_index <- which(clinicaldata_class_samples_all == set_clinicaldata_class_samples[1]) # c(4,5,8,11,14,17) for clinical.xlsx
group2_index <- which(clinicaldata_class_samples_all == set_clinicaldata_class_samples[2]) # c(1,2,3,6,7,9,10,12,13,15,16,18) for clinical.xlsx

# subset data
group1 <- CGHregions_RDS_file[,group1_index]
group2 <- CGHregions_RDS_file[,group2_index]

# Compare 2 cohorts
teststat <- "Chi-square"
compareCNAs2cohorts(x = CGHregions_RDS_file,
                    data1 = group1,
                    data2 = group2,
                    cohort1_ID = "group1",
                    cohort2_ID = "group2",
                    teststat,
                    output_freqPlotCompare = freqPlotCompare_output,
                    output_CGHtest = CGHtest_output,
                    output_plotPFDR = plotPFDR_output,
                    output_combined = combined_output)

R readxl From line 5 of scripts/Run_CGHtest.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(CGHcall))
suppressMessages(library(CGHtest))
suppressMessages(library(denstrip))
} else{
library(QDNAseq)
library(Biobase)
library(CGHcall)
library(CGHtest)
library(denstrip)
}


source('scripts/CGHcallPlus.R', echo = !msg)
source('scripts/plotQDNAseq.R', echo = !msg)

called <- snakemake@input[["called"]]
fitpickertable<- snakemake@input[["fitpicker"]]

recalled <- snakemake@output[["recalled"]]
#copynumbers<-snakemake@output[["copynumbers"]]#moved to QDNAseq-segment
#segments<-snakemake@output[["segments"]]  #moved to QDNAseq-segment
calls<-snakemake@output[["calls"]]
profiles <- snakemake@params[["profiles"]]
#copynumbersbed<-snakemake@params[["copynumbersbed"]]  #moved to QDNAseq-segment
#segmentsbed<-snakemake@params[["segmentsbed"]] #moved to QDNAseq-segment
failed <- snakemake@params[["failed"]]
#bedfolder <- snakemake@params[["bedfolder"]]
freqplot<- snakemake@output[["freqplot"]]

minimum_cellularity <- snakemake@params[["minimum_cellularity"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE, split=FALSE)

##############################################################################################################
# Call data & frequency plot calls
##############################################################################################################
#First part is to call with altered cellularity using chgCall.
# load data
QCN.fcnsdsn <- readRDS(called)
fitpicker<- read.delim(fitpickertable, stringsAsFactors=FALSE)
fitpicker$likely_fit[fitpicker$likely_fit < minimum_cellularity] <- minimum_cellularity

#perform functions
# Call with correction for cellularity
reCalled <- callBins(QCN.fcnsdsn, nclass=5, cellularity=fitpicker$likely_fit)
saveRDS(reCalled, recalled)

##############################################################################################################
# Create profiles and frequency plot
##############################################################################################################
fitpicker$likely_fit<-paste0("cellularity: ",fitpicker$likely_fit )

if(!dir.exists(profiles)){dir.create(profiles)}
plotQDNAseq(reCalled, profiles , subtitle=fitpicker$likely_fit)

# frequency plot calls
png(freqplot, res=300, width=14, height=7, unit='in')
frequencyPlot(reCalled)
dev.off()

##############################################################################################################
# Create IGV objects from readcounts
##############################################################################################################

#exportBins(reCalled, copynumbers, format="igv", type="copynumber") moved to CNA segments
#exportBins(reCalled, segments, format="igv", type="segments") moved to CNA segments
exportBins(reCalled, calls, format="igv", logTransform=FALSE, type="calls")
#exportBins(reCalled, file=copynumbersbed, format="bed", logTransform=TRUE, type="copynumber")
#exportBins(reCalled, file=segmentsbed, format="bed", type="segments")

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){file.create(paste(profiles, file,".png",sep=""))
#file.create(paste(bedfolder, file,"-copynumbers.bed",sep=""))
#file.create(paste(bedfolder, file,"-segments.bed",sep=""))
}}

R Snakemake Biobase CGHcall denstrip From line 8 of scripts/Run_CNA_call_cellularity_based.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(CGHcall))
suppressMessages(library(CGHtest))
} else{
library(QDNAseq)
library(Biobase)
library(CGHcall)
library(CGHtest)
}


source('scripts/CGHcallPlus.R', echo = !msg)
source('scripts/plotQDNAseq.R', echo = !msg)

segmented <- snakemake@input[["segmented"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
called <- snakemake@output[["called"]]
profiles <- snakemake@params[["profiles"]]
freqplots <- snakemake@output[["freqplot"]]
failed <- snakemake@params[["failed"]]
calls<-snakemake@output[["calls"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE, split=FALSE)

##############################################################################################################
# Call data & frequency plot calls
##############################################################################################################

# load data
QCN.fcnsdsn <- readRDS(segmented)

# perform functions
# Call withouth correction for cellularity
QCN.fcnsdsnc <- callBins(QCN.fcnsdsn, nclass=5)
saveRDS(QCN.fcnsdsnc, called)

# Plot called profiles
plotQDNAseq(QCN.fcnsdsnc, profiles)

# frequency plot calls
png(freqplots, res=300, width=14, height=7, unit='in')
frequencyPlot(QCN.fcnsdsnc, losscol='blue', gaincol='red', delcol="darkblue", ampcol="darkred")
dev.off()

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){file.create(paste(profiles, file,".png",sep=""))
}}

##############################################################################################################
# Create IGV objects
##############################################################################################################

exportBins(QCN.fcnsdsnc, calls, format="igv", logTransform=FALSE, type="calls")

R Snakemake Biobase CGHcall From line 8 of scripts/Run_CNA_call.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))

# for dewaving:
suppressMessages(library(limma))
suppressMessages(library(gtools))
suppressMessages(library(impute))
suppressMessages(library(MASS))
} else{
library(QDNAseq)
library(Biobase)

# for dewaving:
library(limma)
library(gtools)
library(impute)
library(MASS)
}


source("scripts/functions.R", echo = !msg)
source("scripts/plotQDNAseq.R", echo = !msg)
sourceDir(snakemake@config[["QDNAseq"]][["dewave_dir"]], echo = !msg)
load(snakemake@params[["dewave_data"]], verbose = !msg)

bin <- as.integer(snakemake@wildcards[["binSize"]])
corrected <- snakemake@input[["corrected"]]
profiles <- snakemake@params[["profiles"]]
dewaved <- snakemake@output[["dewaved"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T , split=FALSE)
##############################################################################################################
# Correct, Normalize & Dewave raw data
##############################################################################################################

QCN.fcns <- readRDS(corrected)

if(bin!=1000){
QCN.fcns <- dewaveBins(QCN.fcns)
}

saveRDS(QCN.fcns, dewaved)
plotQDNAseq(QCN.fcns, profiles)

R Biobase MASS GTools impute From line 8 of scripts/Run_deWave.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(CGHregions))
} else{
library(QDNAseq)
library(CGHregions)
}

source("scripts/addCytobands.R", echo = !msg)
source("scripts/CGHcallPlus.R", echo = !msg)

RegionsCGH<-snakemake@input[["RegionsCGH"]]
max.focal.size.mb<-snakemake@params[["max_focal_size_mb"]]
min.freq.focal<-snakemake@params[["min_freq_focal"]]
allRegions<-snakemake@output[["allRegions"]]
allFocalRegions<-snakemake@output[["allFocalRegions"]]
cytoband_data<-snakemake@params[["cytobands"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T, split=FALSE)
##############################################################################################################
# Create CGHregion BED file + Annotate
##############################################################################################################

# load data
QCN.Regions <- readRDS(RegionsCGH)

# create BED file from CGHregions

makeCGHregionsTable <- function(CGHregions, max.focal.size.mb=3, min.freq.focal=25, output_allRegions, output_allFocalRegions,  cytoband_data){

	# all regions in BED format
	regionsBED <- fData(CGHregions)[,1:3]

	# region sizes
	regionSize <- fData(CGHregions)[,3] - fData(CGHregions)[,2]
	regionsBED <- cbind(regionsBED, regionSize)

	# add cytoband info
	regionsBED <- addCytobands(regionsBED, cytoband_data)

	# frequencies of gains and losses
	calls <- regions(CGHregions)
	loss.freq <- rowMeans(calls < 0)
	gain.freq <- rowMeans(calls > 0)

	# add to 1 table
	regionData <- cbind(regionsBED, loss.freq, gain.freq)
	colnames(regionData) <- c('chromo', 'start', 'end', 'Region size (Mb)', 'Cytoband', 'loss (%)', 'gain (%)')

	# round integers:
	regionData[,4] <- round(regionData[,4]/1000000, digits=2)
	regionData[,c(6,7)] <- round(regionData[,c(6,7)]*100, digits=2)

	# save
	write.table(regionData, output_allRegions, sep='\t', row.names=F, col.names=T, quote=F)

	# make separate table for focals only
	focal.ix <- which(regionData[,4] <= max.focal.size.mb)
	focalRegions <- regionData[focal.ix, ]

	# and in more than 25% of samples
	loss.cutoff.ix <- which(focalRegions[,6] > min.freq.focal)
	gain.cutoff.ix <- which(focalRegions[,7] > min.freq.focal)
	cutoff.ix <- sort( c(loss.cutoff.ix, gain.cutoff.ix ))

	focalRegions <- focalRegions[cutoff.ix, ]
	if(nrow(focalRegions) != 0 ){rownames(focalRegions) <- c(1:nrow(focalRegions))}

	write.table(focalRegions, output_allFocalRegions, sep='\t', row.names=F, col.names=F, quote=F)
}

makeCGHregionsTable(QCN.Regions, max.focal.size.mb, min.freq.focal, allRegions, allFocalRegions,  cytoband_data)

R CGHregions From line 9 of scripts/Run_makeCGHregionstable.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(denstrip))
suppressMessages(library(CGHcall))
} else{
library(QDNAseq)
library(denstrip)
library(CGHcall)
}

source("scripts/functions.R", echo = !msg)
source("scripts/addCytobands.R", echo = !msg)
source("scripts/CGHcallPlus.R", echo = !msg)

recalled <- snakemake@input[["recalled"]]
beddir <- snakemake@params[["beddir"]]
cytoband_data<-snakemake@params[["cytobands"]]
max.focal.size.mb<-snakemake@params[["max_focal_size_bed"]]
failed <- snakemake@params[["failed"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE, split=FALSE)

##############################################################################################################

QCN.reCalled <- readRDS(recalled)

# create BED files

makeCNAbedFile <- function(CalledQDNAseqReadCounts, max.focal.size.mb=3, beddir,cytoband_data){
  for (i in 1:length(sampleNames(CalledQDNAseqReadCounts))){

		#
		sample <- CalledQDNAseqReadCounts[,i]
		sname <- sampleNames(sample)

		# print
		print( paste('Creating BED files of all CNAs in sample: ', sname, sep='') )

		# get all call and segment values of CNAs in the sample
		allCNA.ix <- which(calls(sample)!=0)
		allCNAs <- calls(sample)[ allCNA.ix ]
		seg.allCNA <- segmented(sample)[ allCNA.ix ]

		# coordinates of all CNAs
		reg.allCNAs <- fData(sample)[ allCNA.ix, 1:3]

		# Put together in BED format
		allCNAsPerBin <- cbind(reg.allCNAs, allCNAs, seg.allCNA)
		colnames(allCNAsPerBin)[4] <- 'call'
		colnames(allCNAsPerBin)[5] <- 'segment'

    bedfile<-paste(beddir, sname, '_allCNAsPerBin.bed', sep="")
		# save as BED file; with or without column names..?
		write.table(allCNAsPerBin, bedfile, sep='\t', row.names=F, col.names=T, quote=F)

		# create BED file for all called regions:
		# collapse all adjacent bins based on unique segment values
		coll.calls.bed <- c()
		#all_segments <- uniquecreate.file()(allCNAsPerBin$segment)
    if(length(unique(allCNAsPerBin$segment)) !=0){
			for (j in 1:length(unique(allCNAsPerBin$segment))){
			# first segment

			SEG.val <- as.numeric( unique(allCNAsPerBin$segment)[j] )

			# if the unique segment value occurs more often, then for each one create an entry
			intervals <- seqToIntervals(which(allCNAsPerBin$segment== SEG.val ))
			allCNAsPerBin[allCNAsPerBin$chromosome=="X","chromosome"]<-23
			allCNAsPerBin[allCNAsPerBin$chromosome=="Y","chromosome"]<-24
				for (z in 1:nrow(intervals)){

					start.ix 	<- intervals[z,1]
					end.ix 		<- intervals[z,2]

					SEG.start 	<- as.numeric( allCNAsPerBin[start.ix,2] )
					SEG.end 	<- as.numeric( allCNAsPerBin[end.ix,3] )
					SEG.size 	<- as.numeric( allCNAsPerBin[end.ix,3] - allCNAsPerBin[start.ix,2] )

					SEG.chromo 	<- as.numeric(allCNAsPerBin[end.ix,1])
					SEG.call 	<- as.numeric( allCNAsPerBin[end.ix,4] )

					# check if chromosome is the same; if not the bin spans more chromos and should be separated
					if(allCNAsPerBin[start.ix, 1] != allCNAsPerBin[end.ix, 1]){
					       print(paste(j, 'WARNING: Segment ',  SEG.val, ' spans more than 1 chromosome', sep=''))}

					SEG.coll 	<- cbind(SEG.chromo, SEG.start, SEG.end, SEG.size, SEG.val, SEG.call)

					coll.calls.bed <- rbind(coll.calls.bed, SEG.coll)
				}
			}

			# reorder dataframe
			CNAbed <- coll.calls.bed
			CNAbed <- CNAbed [ order(CNAbed[,1], CNAbed[,2]), ]
			# if there is only 1 CNA in sample, transform the dataframe
			if	(nrow(coll.calls.bed )==1)  {
			CNAbed <- as.data.frame(t(CNAbed))
			}	else {
			CNAbed <- as.data.frame(CNAbed)
			}

			CNAbed[,4] <- round( CNAbed[,4]/1000000, digits = 2)
			CNAbed[,5] <- round( CNAbed[,5], digits=2)


			# add cytoband info
			CNAbed <- addCytobands(CNAbed, cytoband_data)
			CNAbed <- CNAbed[, c(1:4,7,5,6)]
    }else{CNAbed<- data.frame(matrix(ncol = 7, nrow = 0))}
      colnames(CNAbed) <- c('chromo', 'start', 'end', 'cytobands','sizeInMb', 'segment', 'call')
      fileName <- paste(beddir, sname, '_CNAs.bed', sep='')
      write.table(CNAbed, fileName, sep='\t', row.names=F, col.names=T, quote=F)

			# make a separate BED file with only focal CNAs (< 3 Mb)

			focal.ix <- which(CNAbed[,4] <= max.focal.size.mb)
			focalCNAs <- CNAbed[focal.ix, ]

			focalfileName <- paste(beddir, sname, '_focalCNAs.bed', sep='')
			write.table(focalCNAs, focalfileName, sep='\t', row.names=F, col.names=F, quote=F)
	}

}

makeCNAbedFile(QCN.reCalled, max.focal.size.mb, beddir, cytoband_data)

#create output for failed samples - for snakemake compatibility.
failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){for(file in failed_samples[,1]){
    file.create(paste(beddir, file, '_allCNAsPerBin.bed', sep=""))
    file.create(paste(beddir, file, '_CNAs.bed', sep=''))
    file.create(paste(beddir, file, '_focalCNAs.bed', sep=''))
}}

R Snakemake CGHcall denstrip From line 9 of scripts/Run_makeCNAbedFile.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
} else{
library(QDNAseq)
}

source('scripts/ACE.R', echo = !msg)

## Erik
#copyNumbersSegmented <- snakemake@output[["ACE_post"]]
##

inputfile <-snakemake@input[["segmented"]]
outputdir<-snakemake@params[["outputdir"]]
failed <- snakemake@params[["failed"]]
fitpickertable<-snakemake@input[["fitpicker"]]
ploidies<-as.integer(snakemake@wildcards[["ploidy"]])

imagetype <- snakemake@config[["ACE"]][["imagetype"]]
trncname<- snakemake@config[["ACE"]][["trncname"]]

copyNumbersSegmented <- readRDS(inputfile)

postanalysisloop(copyNumbersSegmented, modelsfile=fitpickertable, imagetype=imagetype, outputdir=outputdir, trncname=trncname)

# if Rplots.pdf exists rename from QDNAseq-snakemake directory to outputdir
if (file.exists("./Rplots.pdf")){
    file.rename(from = "./Rplots.pdf", to = paste0(outputdir,"all_samples.pdf"))
}

failed_samples<-read.table(failed, stringsAsFactors=FALSE, header=TRUE)
if(length(failed_samples[,1]>0)){
  for(file in failed_samples[,1]){file.create(paste(outputdir,"segmentfiles/",file,"_segments.tsv",sep=""))}
}

R From line 9 of scripts/Run_postanalysisloop_ACE.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
suppressMessages(library(R.cache))
} else{
library(QDNAseq)
library(Biobase)
library(R.cache)
}


setCacheRootPath(path="../.Rcache")

bam <- snakemake@input[["bams"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
genome <- snakemake@params[["genome"]]
binReadCounts <- snakemake@output[["binReadCounts"]]

##############################################################################################################
# Get bin annotations and bin read counts
##############################################################################################################
#custom bin annotations for chrX used - did this whole script by hand for 100kbp See creationofblacklist.R
bins <- getBinAnnotations(bin, genome=genome)


QRC <- binReadCounts(bins, bamfiles=bam, cache=TRUE)

sub("(_[ACGT]+)?(_S\\d+)?(_L\\d{3})?_R\\d{1}_\\d{3}(\\.f(ast)?q\\.gz)?$", "", sampleNames(QRC)) -> samples

if (sum(duplicated(samples)) > 0) {
        QRC <- poolRuns(QRC, samples=samples, force=TRUE)
}
saveRDS(QRC, binReadCounts)

R Biobase R.cache From line 8 of scripts/Run_QDNAseq_binReadCounts.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
} else{
library(QDNAseq)
library(Biobase)
}

source("scripts/functions.R", echo = !msg)
source("scripts/plotQDNAseq.R", echo = !msg)

binReadCounts <- snakemake@input[["binReadCounts"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
corrected <- snakemake@output[["corrected"]]
profiles <- snakemake@params[["profiles"]]
chrom_filter <- c(snakemake@params[["chrom_filter"]])

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=T , split=FALSE)
##############################################################################################################
# Correct, Normalize & Dewave raw data
##############################################################################################################
QRC <- readRDS(binReadCounts)

#did these steps manualy as is described in second part creationofblacklist.R for samples with chrX.

QRC.f <- applyFilters(QRC, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE , chromosomes=chrom_filter)

#featureData(QRC.f)$use[c(22232:22237)]<-F  # segment 14:22400001-22500000 t/m 14:22900001-23000000 on chr14
QRC.f <- estimateCorrection(QRC.f)
#QRC.f <- applyFilters(QRC, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE)  - used for bia-ALCL (2019-07-09 not)
#QRC.f <- estimateCorrection(QRC.f, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE) - used for bia-ALCL(2019-07-09 not)
#QRC.f <- applyFilters(QRC.f, residual=TRUE, blacklist=TRUE, mappability=FALSE, bases=FALSE, chromosome="Y") - used for bia-ALCL(2019-07-09 not)

QCN.fc <- correctBins(QRC.f)
QCN.fcn <- normalizeBins(QCN.fc)
QCN.fcns <- smoothOutlierBins(QCN.fcn)

saveRDS(QCN.fcns, corrected)
plotQDNAseq(QCN.fcns, profiles)

R Biobase From line 8 of scripts/Run_QDNAseq_normalize.R

msg <- snakemake@params[["suppressMessages"]]
if (msg){
suppressMessages(library(QDNAseq))
suppressMessages(library(Biobase))
} else{
library(QDNAseq)
library(Biobase)
}

source("scripts/plotQDNAseq.R", echo  = !msg)

dewaved<- snakemake@input[["dewaved"]]
bin <- as.integer(snakemake@wildcards[["binSize"]])
segmented <- snakemake@output[["segmented"]]
profiles <- snakemake@params[["profiles"]]
failed<- snakemake@params[["failed"]]
min_used_reads<-snakemake@params[["minimal_used_reads"]]

copynumbersbed<-snakemake@params[["copynumbersbed"]]
segmentsbed<-snakemake@params[["segmentsbed"]]
copynumbers<-snakemake@output[["copynumbers"]]
segments<-snakemake@output[["segments"]]
bedfolder <- snakemake@params[["bedfolder"]]

log<-snakemake@log[[1]]
log<-file(log, open="wt")
sink(log, append=TRUE , split=FALSE)

# Adjust segmentation settings based on binsize
if (bin==15) {SDundo=0.75; alph=1e-15}
if (bin==30) {SDundo=0.75; alph=1e-15}
if (bin==100) {SDundo=0.10; alph=1e-20} # default = 1e-20 # 0.01 used for PELLL_FS8_a0.01
if (bin==1000) {SDundo=0.10; alph=1e-20}
# TODO: mogelijk hogere alpha nodig, om meer segmenten te krijgen: 0.01 (1e-2). Deze setting used in tmp_100kbp

# load data
QCN.fcnsd <- readRDS(dewaved)

QCN.fcnsds <- segmentBins(QCN.fcnsd[,QCN.fcnsd$used.reads > min_used_reads ], undo.splits='sdundo', undo.SD=SDundo, alpha=alph, transformFun="sqrt") # gives 'Performing segmentation: NA
QCN.fcnsdsn <- normalizeSegmentedBins(QCN.fcnsds)
saveRDS(QCN.fcnsdsn, segmented)

##############################################################################################################
# Create profiles
##############################################################################################################

plotQDNAseq(QCN.fcnsdsn, profiles)

#create output for failed samples - for snakemake compatibility.
littledata<-colnames(QCN.fcnsd[,QCN.fcnsd$used.reads <= min_used_reads ])
if(length(littledata>0)){for(file in littledata){file.create(paste(profiles, file,".png",sep=""))
file.create(paste(bedfolder, file,"-copynumbers.bed",sep=""))
file.create(paste(bedfolder, file,"-segments.bed",sep=""))
}}

write.table(littledata, file=failed )

##############################################################################################################
# Create IGV objects and bedfiles from readcounts
##############################################################################################################

exportBins(QCN.fcnsdsn, copynumbers, format="igv", type="copynumber")
exportBins(QCN.fcnsdsn, segments, format="igv", type="segments")
exportBins(QCN.fcnsdsn, file=copynumbersbed, format="bed", logTransform=TRUE, type="copynumber")
exportBins(QCN.fcnsdsn, file=segmentsbed, format="bed", type="segments")

R Snakemake Biobase From line 8 of scripts/Run_QDNAseq_segment.R

shell:
    "bwa aln -n {params.n} -t {threads} -q {params.q} {params.ref} {input} > {output.sai} 2> {log}; "
    "bwa samse -f {output.samse} -r '@RG\\tID:{wildcards.sample}\\tSM:{wildcards.sample}'" 
    " {params.ref} {output.sai} {input} 2>> {log}"

SnakeMake BWA From line 74 of master/Snakefile

shell:
    "samtools view -uS {input.samse} 2> {log}| samtools sort - -o {output} 2>> {log}"

SnakeMake SAMtools From line 88 of master/Snakefile

shell:
    "picard MarkDuplicates I={input} O={output.bam} M={output.metrics_file}"
    " ASSUME_SORTED=true CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT USE_JDK_DEFLATER=true USE_JDK_INFLATER=true &> {log}; "
    "ln -s {output.bai} {output.bambai}"

SnakeMake Picard From line 100 of master/Snakefile

shell:
    "{input.script} {input.bam} {wildcards.sample} {params.outdir} {output}"

SnakeMake From line 113 of master/Snakefile

script:
    "scripts/Run_QDNAseq_binReadCounts.R"

SnakeMake From line 125 of master/Snakefile

script:
    "scripts/Run_QDNAseq_normalize.R"

SnakeMake From line 139 of master/Snakefile

script:
    "scripts/Run_deWave.R"

SnakeMake From line 153 of master/Snakefile

script:
    "scripts/Run_QDNAseq_segment.R"

SnakeMake From line 175 of master/Snakefile

script:
    "scripts/Run_CNA_call.R"

SnakeMake From line 191 of master/Snakefile

script:
    "scripts/Run_CNA_call_cellularity_based.R"

SnakeMake From line 209 of master/Snakefile

script:
    "scripts/Run_makeCNAbedFile.R"

SnakeMake From line 227 of master/Snakefile

shell:
    "{input.script} {input.bedfile} {wildcards.sample} {params.outdir} {output} {log} 2> {log}"

SnakeMake From line 241 of master/Snakefile

script:
    "scripts/Run_CGHregions.R"

SnakeMake From line 254 of master/Snakefile

script:
    "scripts/Run_makeCGHregionstable.R"

SnakeMake From line 269 of master/Snakefile

shell:
    "{input.script} {input.bedfile} {params.filename} {params.outdir} {output} 2> {log}"

SnakeMake From line 282 of master/Snakefile

shell:
    "{input.script} {params.profiles} {params.lb2dir} > {output.index}"

SnakeMake From line 294 of master/Snakefile

shell:
    "{input.script} {wildcards.binSize}kpb {params.bamfolder} {params.statsfolder} {params.qcFastqfolder} {params.qcBamfolder}> {output}"

SnakeMake From line 312 of master/Snakefile

shell:
    "fastqc {input.fastq} --outdir {params.qcfastqdir} -t {threads} 2>> {log}"

SnakeMake FastQC From line 328 of master/Snakefile

shell:
    "fastqc {input.bam} --format bam_mapped --outdir {params.qcbamdir} -t {threads} 2>> {log}"

SnakeMake FastQC From line 341 of master/Snakefile

script:
    "scripts/Run_ACE.R"

SnakeMake From line 355 of master/Snakefile

script:
    "scripts/Run_postanalysisloop_ACE.R"

SnakeMake From line 369 of master/Snakefile

script:
    "scripts/Run_CGHtest.R"

SnakeMake From line 388 of master/Snakefile

ShowHide 24 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/tgac-vumc/QDNAseq.snakemake

Name: qdnaseq-snakemake

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Biobase BWA CGHcall CGHregions FastQC GTools impute limma Picard QDNAseq SAMtools Snakemake denstrip MASS R.cache readxl stringr DNA

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free