Snakemake pipeline for calling CNAs from Affymetrix (now Thermofisher) Cytoscan and Oncoscan arrays

public public 1yr ago 0 bookmarks

This pipeline has been adapted from https://github.com/gustaveroussy/EaCoN. It leverages the EaCoN R package to conduct preprocessing and normalization, segmentation and copy number estimation from raw Cytoscan HD .CEL files. The EaCoN package also supports copy number estimation from Oncoscan, SNP6 arrays as well as WES data, but these features have not yet been implemented.

Snakemake

This pipeline leverages the snakemake Python package for workflow management. As a result the pipeline and its dependencies are easily installed from this repository, allowing quick setup, configuration and deployment.

For more information on Snakemake, please see: https://snakemake.readthedocs.io/en/stable/.

Requirements

Dependency management for this pipeline is handled via conda for Python and renv for R. To get started with setup you can install miniconda3 using the instructions available here: https://docs.conda.io/en/latest/miniconda.html. If you do not currently have R installed, you can install it via conda using the command: conda -c conda-forge r-base==3.6 . Please note that the EaCoN package has not been updated to work with R >= 4.0.

Alternatively you can install it directly from CRAN as described here: https://cran.r-project.org/.

Setting Up Your Software Environment

The first step to deploying an analysis pipeline is to install the various software packages it depends on. We have included the envs/affymetrix.yml and renv.lock files here to easily accomplish this.

All commands should be executed from the top level directory of this repository.

Python and System Dependencies

Conda can be used to install all Python and most OS system dependencies using:

conda env create --file envs/affymetrix.yml

This will take some time to run as it gathers and installs the correct package versions. The environent it creates should be called affymetrix .

If it is not automatically activated after installation please run conda activate affymetrix before proceeding to the next step.

R Dependencies

The renv package can be used to install all R dependencies (both CRAN and Bioconductor). R version 3.6.3 and renv are included as dependencies in the environment.yml file and should be installed automatically when setting up your conda environment.

To initialize this project with renv run:

Rscript -e 'library(renv); renv::init()'

If you wish to isolate the R dependencies from your Conda environment R libraries, you can use this command instead:

Rscript -e 'library(renv); renv::isolate(); renv::init(bare=TRUE)'

If intialization doesn't trigger dependency installation, you can do so manually using:

Rscript -e 'renv::restore()'

For more information on renv and how it can be used to manage dependencies in your project, please see: https://rstudio.github.io/renv/articles/renv.html.

Configuring the Pipeline

This pipeline assumes the following directory structure:

.
├── envs
├── metadata
├── procdata
├── rawdata
├── renv
├── results
└── scripts

Please at minimum create the rawdata and metadata directories, as they are assumed to hold the raw Cytoscan HD plate data (.CEL) and the pairs file, respectively. For more information on the correct formatting for your pairs file, please see https://github.com/gustaveroussy/EaCoN. The remaining missing directories will be created automatically as the pipeline runs.

config.yaml

This file hold the relevant pipeline documentation. Here you can specify the paths to all the parameters for your current pipeline use case. Documentation is provided in the config.yaml file on what each field should contain.

Using the Pipeline

Batch Processing and Normalization

snakemake --cores 2 batch_process_rawdata

Segmentation

snakemake --cores 2 segment_processed_data

Copy Number Calling

`snakemake --cores 2

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
renv::activate()

library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 1 -- Preprocess and normalize the raw data; does
if (grepl('cytoscan', params$array_type, ignore.case=TRUE)) {
    CS.Process.Batch(input$pairs_file, 
        nthread=nthreads, out.dir=params$procdata, force=TRUE)
} else if (grepl('oncoscan', params$array_type, ignore.case=TRUE)) {
    OS.Process.Batch(input$pairs_file, 
        nthread=nthreads, out.dir=params$procdata, force=TRUE)
} else if (grepl('snf6', params$array_type, ignore.care=TRUE)) {
    SNF6.Process.Batch(input$pairs_file, out.dir=params$procdata)
} else if (grepl('wes', params$array_type)) {
    stop("WES has not been implemented in this pipeline yet, please see 
        https://github.com/gustaveroussy/EaCoN for information on
        setting up your own analysis script.")
} else {
    stop("Supported array families are wes, cytoscan, oncoscan and snp6")
}
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
renv::activate()

library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 1 -- L2R and BAF joint segment the preprocessed data from the previous rule
EaCoN:::Segment.ff.Batch(unlist(input), nthread=nthreads, segmenter=params$segmenter, 
    smooth.k=params$smoothk, force=TRUE)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
renv::activate()

library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 2 -- Generate .html qc reports for

# 1 -- Make copy number calls based on L2R segmentation results
EaCoN:::ASCN.ff.Batch(unlist(input), nthread=nthreads, force=TRUE, 
    gammaRange=unlist(params$gamma_range))
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
renv::activate()

library(EaCoN)
library(data.table)
library(qs)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(S4Vectors)

# -- 0.2 Parse snakemake arguments
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# -- 1. Read in gamma files

gammaFiles <- list.files(input$out_dir, '.*gammaEval.*txt', recursive=TRUE, 
    full.names=TRUE)
print(gammaFiles)


# -- 2. Load pancanPloidy data as reference
data(pancanPloidy.noSegs)
pancan.ploidy <- pancan.obj.segless$breaks$ploidy

all.fits <- lapply(gammaFiles, function(file) {
    list(fit = fread(file, sep = '\t'),
        sample = strsplit(file, '/')[[1]][2]
    )
})
names(all.fits) <- sapply(all.fits, function(x) x$sample)

.fitsToVector <- function(fit.list) {
    newList <- list()
    nameList <- list()
    for (i in seq_along(fit.list)) {
        fit <- as.list(fit.list[[i]]$fit[[2]])
        names(fit) <- tolower(unlist(fit.list[[i]]$fit[[1]]))
        print(fit)
        newList[[i]] <- fit
        nameList <- c(nameList, fit.list[[i]][[2]])
    }
    names(newList) <- unlist(nameList)
    return(newList)
}

vec.fits <- .fitsToVector(all.fits)

# -- 3. Annotated the RDS data associated with the gamma files

# Change to fix error in annotaRDS.Batch
setwd('procdata')

print('Starting annotation...')
gr.cnv <- EaCoN:::annotateRDS.Batch(
    all.fits,
    'ASCAT',
    nthread = params$nthreads,
    gamma.method = 'score',
    pancan.ploidy = pancan.ploidy
)
print('finished annotation')

setwd('..')

# Save raw results object to disk
qsave(gr.cnv, file = file.path(input$out_dir, 
    paste0(params$analysis_name, 'optimal_gamma_list.qs')), 
    nthread=params$nthreads)

## ---- Create GRangesList of segmentation results and save them to disk

# Extract segmentation data.frames from the resutls object
segmentation_df_list <- lapply(gr.cnv, function(x) x$seg)

# Convert all data.frames to GRanges and return in a list
list_of_gRanges <- lapply(segmentation_df_list, function(x) 
    makeGRangesFromDataFrame(x, keep.extra.columns = TRUE))

# Convert list of GRanges objects into GRangesList object
cnv_grList <- GRangesList(list_of_gRanges)

# Save GRangesList to disk for downstream analysis
qsave(cnv_grList, file = file.path(params$results, 
    paste0(params$analysis_name, '_grList.qs')), nthread=params$nthreads)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
renv::activate()

library(EaCoN)
library(Biobase)
library(SummarizedExperiment)
library(data.table)
library(qs)

# 0.2 -- Parse snakemake parameters
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# 1 -- Load input data
gr.cnv <- qread(input$gr_cnv, nthread=params$nthreads)

# 2 -- Build ExpressionSets
pairs_file <- fread(input$pairs_file)
setDF(pairs_file)
EaCoN:::buildPSetOut(gr.cnv, params$analysis_name, file.path(params$results),
    meta=pairs_file[, -c(1)])

# 3 -- Convert ExpressionSet objects to SummarizedExperiments

esetFiles <- list.files(params$results, 
    paste0(params$analysis_name, '.*ESet..RData'), full.names=TRUE)

esetNames <- list()
for (file in esetFiles) esetNames[file] <- load(file)

esets <- lapply(esetNames, get)
names(esets) <- gsub('.*/|_ESet..RData', '', esetFiles)

SEs <- lapply(esets, FUN=as, 'SummarizedExperiment')

# 4 -- Save results to disk
for (i in seq_along(SEs)) {
    qsave(SEs[[i]], 
        file=file.path(params$results, paste0(names(SEs)[i], '_SumExp.qs')))
}
38
39
script:
    'scripts/1_batchProcessRawdataFiles.R'
57
58
script:
    'scripts/2_segmentProcessedData.R'
74
75
script:
    'scripts/3_estimateCopyNumber.R'
92
93
script:
    'scripts/4_selectOptimalGamma.R'
107
108
script:
    'scripts/5_buildSummarizedExperiments.R'
SnakeMake From line 107 of master/Snakefile
ShowHide 7 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bhklab/affymetrix_cnv_pipeline
Name: affymetrix_cnv_pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...