Easy Copy Number Analysis (EaCoN) Pipeline

public public 1yr ago 0 bookmarks

This pipeline has been adapted from https://github.com/gustaveroussy/EaCoN. It leverages the EaCoN R package to conduct preprocessing and normalization, segmentation and copy number estimation from specific microarray .CEL files. The EaCoN package supports copy number estimation from Cytoscan, Oncoscan, SNP6 arrays as well as WES data, however, the current pipeline has only implemented support for microarray data.

Snakemake

This pipeline leverages the snakemake Python package for workflow management. As a result the pipeline and its dependencies are easily installed from this repository, allowing quick setup, configuration and deployment.

For more information on Snakemake, please see: https://snakemake.readthedocs.io/en/stable/.

Software Environment

Dependency management for this pipeline is handled via conda for Python and renv for R. To get started with setup you can install miniconda3 using the instructions available here: https://docs.conda.io/en/latest/miniconda.html.

Alternatively you can install it directly from CRAN as described here: https://cran.r-project.org/.

Setting Up Your Software Environment

The first step to deploying an analysis pipeline is to install the various software packages it depends on. We have included the env/eacon.yml and renv.lock files here to easily accomplish this.

All commands should be executed from the top level directory of this repository unless otherwise indicated.

Python and Snakemake

The first step to deploying an analysis pipeline is to install Python, Snakemake and Singularity via conda . We have included the envs/eacon.yml which specifies all the requisite dependencies to use snakemake for this pipeline, including R.

You can use conda to install all Python and OS system dependencies using:

conda env create --file env/eacon.yml

This will take some time to run as it gathers and installs the correct package versions. The environment it creates should be called eacon .

If it is not automatically activated after installation please run conda activate eacon before proceeding to the next step.

R Dependencies

R dependencies are handled via renv and all rules in this pipeline will use the local R package cache stored in the renv directory.

To install all R dependencies run:

Rscript -e 'library(renv); renv::init();'

If you wish to isolate the R dependencies from your Conda environment R libraries, you can use this command instead:

Rscript -e 'library(renv); renv::isolate(); renv::init(bare=TRUE)'

If intialization doesn't trigger dependency installation, you can do so manually using:

Rscript -e 'renv::restore()'

For more information on renv and how it can be used to manage dependencies in your project, please see: https://rstudio.github.io/renv/articles/renv.html.

Configuring the Pipeline

This pipeline assumes the following directory structure:

.
├── env
├── metadata
├── procdata
├── rawdata
├── renv
├── results
└── scripts

Please at minimum create the rawdata and metadata directories, as they are assumed to hold the raw microarray plate data (.CEL) and the pairs file, respectively. For more information on the correct formatting for your pairs file, please see https://github.com/gustaveroussy/EaCoN. The remaining missing directories will be created automatically as the pipeline runs.

config.yaml

This file hold the relevant pipeline documentation. Here you can specify the paths to all the parameters for your current pipeline use case. Documentation is provided in the config.yaml file on what each field should contain.

Using the Pipeline

Deployment

To run the pipeline end-to-end:

snakemake --cores <n_cores> --use-conda

Where <n_cores> is the number of cores to parallelize over.

For details on deploying this pipleine via your local HPC cluster or in the cloud, please consult the Snakemake documentation. Deployment to these platforms requires minimal additional configuration.

Individual Rules

The pipeline can also be run rule by rule using the rule names.

Batch Processing and Normalization

snakemake --cores 2 --use-conda batch_process_rawdata

Segmentation

snakemake --cores 2 --use-conda segment_processed_data

Copy Number Calling

snakemake --cores 2 --use-conda estimate_copy_number

Determine Optimal Value for Gamma

snakemake --cores 2 --use-conda select_optimal_gamma

Build Bioconductor RaggedExperiment Object

snakemake --cores 2 --use-conda build_ragged_experiment

Filter Samples Based on QC Criteria

snakemake --cores 2 --use-conda sample_quality_control

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
renv::activate()
library(EaCoN)


# 0.2 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output


# 0.3 -- Load platform specific dependencies
if (grepl("snp|cytoscan|oncoscan", params$array_type, ignore.case=TRUE))
    library(affy.CN.norm.data)

if (grepl('snp', params$array_type)) {
        library(apt.snp6.1.20.0)
        library(rcnorm)
        library(GenomeWideSNP.6.na35.r1)
}
## FIXME:: Add conditional dependencies for other platforms

switch(params$reference,
    'BSgenome.Hsapiens.UCSC.hg19'=library(BSgenome.Hsapiens.UCSC.hg19),
    'BSgenome.Hsapiens.UCSC.hg38'={
        if (grepl("snp", params$array_type, ignore.case=TRUE))
            stop("Must use BSgenome.Hsapiens.UCSC.hg19 for GenomeWide SNP6 arrays!")
        library(BSgenome.Hsapiens.UCSC.hg38)
    }
)


# 1 -- Load or create the metadata file specifying CEL paths
if (file.exists(input$pairs_file)) {
    pairs_df <- read.table(input$pairs_file, sep="\t", header=TRUE,
        stringsAsFactors=FALSE)
} else {
    # find all CEL files in the raw data
    cel_file_paths <- list.files(params$rawdata, pattern="*.CEL$",
        recursive=TRUE, full.names=TRUE)
    pairs_df <- data.frame(
        cel_files=cel_file_paths,
        # assumes the second element in path is the sample name
        SampleName=vapply(cel_file_paths, FUN=function(x) strsplit(x)[[1]][2],
            FUN.VALUE=character(1))
    )
    if (!is.null(input$pairs_file) || input$paris_file == "") {
        # create path if it doesn't exist
        pairs_path <- dirname(input$pairs_file)
        if (!file.exists(pairs_path)) dir.create(pairs_path, recursive=TRUE)
        # write out a pairs file
        write.table(pairs_df, input$pairs_file)
    }
}


# 2 -- Format the paths in the pairs_file to match this project directory
#   structure from config.yaml
if (grepl('cytocscan|oncoscan', params$array_type, ignore.case=TRUE)) {
    stopifnot(c("ATChannelCel", "GCChannelCel", "SampleName") %in% colnames(pairs_df))
    # remove existing path, if there is one
    pairs_df$ATChannelCel <- gsub("^.*\\/", "", pairs_df$ATChannelCel)
    pairs_df$GCChannelCel <- gsub("^.*\\/", "", pairs_df$GCChannelCel)
    # create a new path relative to specified rawdata directory
    pairs_df$GCChannelCel <- file.path(getwd(), params$rawdata,
        pairs_df$GCChannelCel)
} else if (grepl('snp6', params$array_type, ignore.case=TRUE)) {
    stopifnot(all(c("cel_files", "SampleName") %in% colnames(pairs_df)))
    # pairs_df$cel_files <- gsub("^.*\\/", "", pairs_df$cel_files)
    # pairs_df$cel_files <- file.path(getwd(), params$rawdata, pairs_df$cel_files)
}
# output the file to temporary storage so it can be read by EaCoN
pairs_file <- file.path(tempdir(), "CEL_pairs_file.csv")
write.table(pairs_df, file=pairs_file, sep="\t")


# 3 -- Preprocess and normalize the raw data; does
if (grepl('cytoscan', params$array_type, ignore.case=TRUE)) {
    EaCoN:::CS.Process.Batch(pairs_file,
        nthread=nthreads, out.dir=params$procdata, force=TRUE,
        cluter.type=params$cluster_type)
} else if (grepl('oncoscan', params$array_type, ignore.case=TRUE)) {
    EaCoN:::OS.Process.Batch(pairs_file,
        nthread=nthreads, out.dir=params$procdata, force=TRUE,
        cluster.type=params$cluster_type)
} else if (grepl('snp', params$array_type, ignore.case=TRUE)) {
    EaCoN:::SNP6.Process.Batch(pairs_file, out.dir=params$procdata, force=TRUE,
        nthread=nthreads, cluster.type=params$cluster_type)
} else if (grepl('wes', params$array_type, ignore.case=TRUE)) {
    stop("WES has not been implemented in this pipeline yet, please see
        https://github.com/gustaveroussy/EaCoN for information on
        setting up your own analysis script.")
} else {
    stop("Supported assay families are wes, cytoscan, oncoscan and snp6")
}
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
renv::activate()
library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 1 -- L2R and BAF joint segment the preprocessed data from the previous rule
EaCoN:::Segment.ff.Batch(RDS.files=unlist(input), nthread=nthreads,
    segmenter=params$segmenter, smooth.k=params$smoothk,
    BAF.filter=params$BAF_filter, nrf=params$nrf, SER.pen=params$SER_pen,
    force=TRUE)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
renv::activate()
library(EaCoN)

# 0 -- Parse Snakemake arguments
input <- snakemake@input
params <- snakemake@params
nthreads <- snakemake@threads
output <- snakemake@output

# 2 -- Generate .html qc reports for

# 1 -- Make copy number calls based on L2R segmentation results
EaCoN:::ASCN.ff.Batch(unlist(input), nthread=nthreads, force=TRUE,
    gammaRange=unlist(params$gamma_range))
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
renv::activate()
library(data.table)


# -- 0.2 Parse snakemake arguments
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# -- 0.3 Load local utility functions
source(file.path("scripts", "utils.R"))

# -- 1. Read in gamma files

gammaFiles <- list.files(params$out_dir, '.*gammaEval.*txt', recursive=TRUE,
    full.names=TRUE)

gamma_df <- rbindlist(
    setNames(
        lapply(gammaFiles, FUN=fread),
        gsub(".gammaEval.txt", "", basename(gammaFiles))
    ),
    idcol="sample_name"
)

# -- 2. Select the best fits
best_fits <- gamma_df[, .SD[which.max(GoF)], by="sample_name"]

# -- 3. Save the best fit data.frame to csv
fwrite(best_fits, file=output[[1]])
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
renv::activate()
library(EaCoN)
library(data.table)
library(qs)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(BiocParallel)
library(RaggedExperiment)

# -- 0.2 Parse snakemake parameters
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# -- 0.3 Load utity functions
source(file.path("scripts", "utils.R"))

# -- 1. Load the optimal gamma for each sample
best_fits <- fread(input[[1]])

# -- 2. Find the .RDS files associated with the best fits
best_fit_files <- Map(
    function(x, y)
        grep(pattern=y, list.files(x, recursive=TRUE, full.names=TRUE), value=TRUE),
    x=file.path(params$out_dir, best_fits$sample_name),
    y=paste0(".*gamma", sprintf("%.2f", best_fits$gamma), "/.*RDS$")
)
l2r_files <- Map(
    function(x, y)
        grep(pattern=y, list.files(x, recursive=TRUE, full.names=TRUE), value=TRUE),
    x=file.path(params$out_dir, best_fits$sample_name),
    y=".*L2R/.*RDS$"
)

# -- 3. Load the best fit ASCN and L2R data and build GRanges objects

.build_granges_from_cnv <- function(ascn, l2r) {
    ascn_data <- readRDS(ascn)
    l2r_data <- readRDS(l2r)
    buildGRangesFromASCNAndL2R(ascn_data, l2r_data)
}

BPPARAM <- BiocParallel::bpparam()
BiocParallel::bpworkers(BPPARAM) <- params$nthreads
gr_list <- BiocParallel::bpmapply(.build_granges_from_cnv,
    best_fit_files, l2r_files,
    SIMPLIFY=FALSE, USE.NAMES=TRUE, BPPARAM=BPPARAM
)

# removing directory paths
names(gr_list) <- basename(names(gr_list))

# -- 4. Construct a RaggedExperiment object
ragged_exp <- as(GRangesList(gr_list), "RaggedExperiment")

# include annotated bins to summarize the RaggedExperiment with
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
genome_bins <- binReferenceGenome()
annotated_bins <- annotateGRangesWithTxDB(genome_bins, txdb=txdb)

# include annotated genes to summarized RaggedExperiment with
gene_granges <- genes(txdb)
annotated_genes <- annotateGRangesWithTxDB(gene_granges, txdb=txdb)

metadata(ragged_exp) <- list(
    annotated_genome_bins=annotated_bins,
    annotated_genes=annotated_genes,
    simplifyReduce=function(scores, ranges, qranges) {
        if (is.numeric(scores)) {
            x <- mean(scores, na.rm=TRUE)
        } else {
            count_list <- as.list(table(scores))
            x <- paste0(
                paste0(names(count_list), ":", unlist(count_list)),
                collapse=","
            )
        }
        return(x)
    }
)

# -- Save files to disk
qsave(ragged_exp, file=output[[1]], nthreads=params$nthread)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
renv::activate()
library(EaCoN)
library(Biobase)
library(RaggedExperiment)
library(data.table)
library(qs)

# 0.2 -- Parse snakemake parameters
input <- snakemake@input
params <- snakemake@params
output <- snakemake@output

# 1 -- Collect and parse QC data
qc_files <- list.files(params$procdata, ".*qc.txt", full.names=TRUE,
    recursive=TRUE)
dt_list <- lapply(qc_files, FUN=fread)
qc_dt <- rbindlist(dt_list)
qc_dt$sample_name <- gsub("_.*$", "", basename(qc_files))

# 2 -- Output the QC results
fwrite(qc_dt, output$qc_csv)

# 3 -- Load the RaggedExperiment and filter on QC criteria
ragged_exp <- qread(input$ragged_exp)

# 4 -- Identify samples passing QC
qc_dt <- qc_dt[
    MAPD <= params$mapd & `waviness-sd` <= params$ndwavinesssd &
        SNPQC >= params$snpqc,
]
if (params$cellularity > 0) {
    if ("TUSCAN-cellularity" %in% colnames(qcdt))
        qc_dt <- qc_dt[`TUSCAN-cellularity` >= params$cellularity]
}
keep_samples <- intersect(colnames(ragged_exp), qc_dt$sample_name)
print(keep_samples)

# 5 -- Subset RaggedExperiment
ragged_exp <- ragged_exp[, keep_samples]
metadata(ragged_exp)$qc_parameters <- params[-c(1, 2)]

# 6 -- Write RaggedExperiment to disk
qsave(ragged_exp, file=output$ragged_exp, nthreads=params$nthreads)
73
74
script:
    "scripts/1_batchProcessRawdataFiles.R"
94
95
script:
    "scripts/2_segmentProcessedData.R"
109
110
script:
    "scripts/3_estimateCopyNumber.R"
SnakeMake From line 109 of master/Snakefile
124
125
script:
    "scripts/4_selectOptimalGamma.R"
SnakeMake From line 124 of master/Snakefile
139
140
script:
    "scripts/5_buildRaggedExperiment.R"
SnakeMake From line 139 of master/Snakefile
157
158
script:
    "scripts/6_sampleQualityControl.R"
SnakeMake From line 157 of master/Snakefile
170
171
script:
    "scripts/7_customTotalCopyCalls.R"
SnakeMake From line 170 of master/Snakefile
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bhklab/eacon_cnv_pipeline
Name: eacon_cnv_pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...