Whole exome somatic copy number analysis with Sequenza and CNVKit
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow to analyze somatic copy number alterations from whole exome data using Sequenza and CNVkit
Sample File Format
samples.csv
should be a CSV file with the following columns:
-
patient
- patient identifier -
sample_type
- eithernormal
ortumor
-
bam
- filepath for BAM for that sample
All patients are required to have both normal and tumor BAMs.
Installation
-
Install Snakemake
-
Install singularity or all dependencies (see Dependencies section)
-
Clone repository
-
Create
samples.csv
file with sample information (see Samples File Format section) -
Modify
config.yaml
-
If using a SLURM cluster for execution, edit
cluster.json
andrun_pipeline.sh
Dependencies
Singularity is recommended to ensure reproducibility and avoid having to manually manage installation of Sequenza, CNVkit, and R. If you would prefer to not use Singularity, make sure the following programs are installed and accessible from your command line:
-
sequenza-utils
command line tool -
CNVkit
-
R
The following R libraries are also needed:
-
sequenza
-
readr
Running
NOTE:
This workflow saves all intermediate and output files to the same
directory where
Snakefile
is located. Ensure you have the appropriate storage
before running
If running locally
snakemake -j [# of cores] [--use-singularity]
If running on a SLURM cluster (after completing
cluster.json
and
run_pipeline.sh
# This needs to be run from the directory where Snakefile is located
sbatch run_pipeline.sh
Output
-
results/sequenza_info.csv
- sequenza purity and ploidy estimates for each tumor -
cnvkit-results
- CNVKit segmentation files and call files -
sequenza/[patient] - results/
- sequenza segmentation files for that[patient]
-
access.5kb.hg38.bed
- accessible regions in the reference fasta at 5kb windows
sequenza_info.csv
and
cnvkit-results/
will be the most interesting to users.
I primarily use Sequenza for purity/ploidy estimation and CNVKit for copy number segmentation.
See this paper for a discussion of copy number caller performance
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo")) library(readr) files <- snakemake@input dfs <- list() for (i in 1:length(files)) { fp <- files[[i]] idx <- gregexpr('\\.', fp)[[1]][1] sid <- basename(substr(fp, 1, idx-1)) df <- read_tsv(fp) df$sample_id <- sid df <- df[2, ] #only get point estimate dfs[[i]] <- df } df <- do.call("rbind", dfs) print(df) print(summary(df)) write_csv(df, snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo")) library(sequenza) fp <- snakemake@input[[1]] data <- sequenza.extract(fp) CP <- sequenza.fit(data) idx <- gregexpr('\\.', snakemake@input[[1]])[[1]][1] sid = substr(basename(snakemake@input[[1]]), 1, idx-1) print(paste("SampleID:", sid)) print(paste("Output dir:", snakemake@output[["outdir"]])) sequenza.results(sequenza.extract = data, cp.table = CP, sample.id = sid, out.dir = snakemake@output[[1]]) |
63 64 65 66 | shell: """ sequenza-utils gc_wiggle --fasta {input} -w 50 -o {output} """ |
81 82 83 84 85 86 | shell: """ sequenza-utils bam2seqz -n {input.normal} -t {input.tumor} --fasta {input.fasta} \ -gc {input.GC} -o {output.seqz} {params.chroms} sequenza-utils seqz_binning --seqz {output.seqz} -w 50 -o {output.small} """ |
96 97 | script: "scripts/sequenza.R" |
105 106 | script: "scripts/gather_info.R" |
114 115 116 117 | shell: """ cnvkit.py access {input} -o {output} """ |
130 131 132 133 134 135 136 137 138 | shell: """ cnvkit.py batch {input.tumors} --normal {input.normals} \ --targets {input.targets} --fasta {input.ref} \ --access {input.access} \ --output-dir {output} \ --output-reference cnvkit_reference.cnn \ -p {threads} """ |
Support
- Future updates