Whole exome somatic copy number analysis with Sequenza and CNVKit

public public 1yr ago 0 bookmarks

Snakemake workflow to analyze somatic copy number alterations from whole exome data using Sequenza and CNVkit

Sample File Format

samples.csv should be a CSV file with the following columns:

  • patient - patient identifier

  • sample_type - either normal or tumor

  • bam - filepath for BAM for that sample

All patients are required to have both normal and tumor BAMs.

Installation

  1. Install Snakemake

  2. Install singularity or all dependencies (see Dependencies section)

  3. Clone repository

  4. Create samples.csv file with sample information (see Samples File Format section)

  5. Modify config.yaml

  6. If using a SLURM cluster for execution, edit cluster.json and run_pipeline.sh

Dependencies

Singularity is recommended to ensure reproducibility and avoid having to manually manage installation of Sequenza, CNVkit, and R. If you would prefer to not use Singularity, make sure the following programs are installed and accessible from your command line:

  • sequenza-utils command line tool

  • CNVkit

  • R

The following R libraries are also needed:

  • sequenza

  • readr

Running

NOTE: This workflow saves all intermediate and output files to the same directory where Snakefile is located. Ensure you have the appropriate storage before running

If running locally

snakemake -j [# of cores] [--use-singularity]

If running on a SLURM cluster (after completing cluster.json and run_pipeline.sh

# This needs to be run from the directory where Snakefile is located
sbatch run_pipeline.sh

Output

  • results/sequenza_info.csv - sequenza purity and ploidy estimates for each tumor

  • cnvkit-results - CNVKit segmentation files and call files

  • sequenza/[patient] - results/ - sequenza segmentation files for that [patient]

  • access.5kb.hg38.bed - accessible regions in the reference fasta at 5kb windows

sequenza_info.csv and cnvkit-results/ will be the most interesting to users. I primarily use Sequenza for purity/ploidy estimation and CNVKit for copy number segmentation.

See this paper for a discussion of copy number caller performance

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo"))
library(readr)

files <- snakemake@input

dfs <- list()
for (i in 1:length(files)) {
    fp <- files[[i]]
    idx <- gregexpr('\\.', fp)[[1]][1]
    sid <- basename(substr(fp, 1, idx-1))
    df <- read_tsv(fp)
    df$sample_id <- sid
    df <- df[2, ] #only get point estimate
    dfs[[i]] <- df
}

df <- do.call("rbind", dfs)

print(df)
print(summary(df))

write_csv(df, snakemake@output[[1]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo"))

library(sequenza)

fp <- snakemake@input[[1]]
data <- sequenza.extract(fp)
CP <- sequenza.fit(data)
idx <- gregexpr('\\.', snakemake@input[[1]])[[1]][1]
sid = substr(basename(snakemake@input[[1]]), 1, idx-1)
print(paste("SampleID:", sid))
print(paste("Output dir:", snakemake@output[["outdir"]]))
sequenza.results(sequenza.extract = data, cp.table = CP, 
                 sample.id = sid, out.dir = snakemake@output[[1]])
R From line 1 of scripts/sequenza.R
63
64
65
66
shell:
    """
    sequenza-utils gc_wiggle --fasta {input} -w 50 -o {output}
    """
81
82
83
84
85
86
shell:
    """
    sequenza-utils bam2seqz -n {input.normal} -t {input.tumor} --fasta {input.fasta} \
        -gc {input.GC} -o {output.seqz} {params.chroms}
    sequenza-utils seqz_binning --seqz {output.seqz} -w 50 -o {output.small}
    """
96
97
script:
    "scripts/sequenza.R"
SnakeMake From line 96 of main/Snakefile
105
106
script:
    "scripts/gather_info.R"
SnakeMake From line 105 of main/Snakefile
114
115
116
117
shell:
    """
    cnvkit.py access {input} -o {output}
    """
SnakeMake From line 114 of main/Snakefile
130
131
132
133
134
135
136
137
138
shell:
    """
    cnvkit.py batch {input.tumors} --normal {input.normals} \
        --targets {input.targets} --fasta {input.ref} \
        --access {input.access} \
        --output-dir {output} \
        --output-reference cnvkit_reference.cnn \
        -p {threads}
    """
SnakeMake From line 130 of main/Snakefile
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/tjbencomo/cna-pipeline
Name: cna-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...