WES analysis pipeline

public public 1yr ago 0 bookmarks

Somatic variant calling pipeline for whole exome and whole genome sequencing (WES & WGS). The pipeline uses Mutect2 to identify variants and mostly follows GATK Best Practices. SLURM execution functionality allows the workflow to run on Stanford's Sherlock computing cluster.

Author

Getting Help

If you encounter problems using the pipeline, find a bug, or would like to request a new feature, please file an issue .

Prerequisites

References

You'll need a reference genome. The GRCh38 (hg38) genome is available on the Broad's GATK website .

You'll also need the cache files for Variant Annotation Predictor (VEP) . Follow the tutorial here to download the data files. By default, the pipeline uses a docker container with VEP v104 and VCF2MAF for variant annotation. It is suggested you use v104 cache files. You can use your own version of VEP/VCF2MAF by modifying the vep_env field in config.yaml .

Input Files

The pipeline takes paired end (PE) FASTQ files as input. Samples can be normal-tumor pairs or tumor-only individual sample. Samples can be divided into multiple read groups (see units.csv ) or one pair of FASTQ files per sample.

You'll also need a BED file with a list of target regions. For WES this is usually the capture kit regions. For WGS this can be the entire genome or a whitelist file that excludes problematic regions. A WGS calling region file is available in the GATK Resource Bundle (it will need to be converted from interval_list format to BED format).

NOTE If you have WES and WGS samples to analyze, create two separate instances of the workflow and run the samples separately.

Software

Snakemake is required to run the pipeline. It is recommended users have Singularity installed to take advantage of preconfigured Docker containers for full reproducibility. If you don't want to use Singularity, you should download all required software (see workflow files) and ensure they are in your path.

Setup

  1. Clone this repository or create a new repository using this workflow as a template

  2. Edit patients.csv and units.csv with the details for your analysis. See the schemas/ directory for details about each file.

  3. Configure config.yml . See schemas/config.schema.yaml for info about each required field. Multiplexed samples should be differentiated with the readgroup column. Sequencing data must be paired, so both fq1 and fq2 are required.

Usage

After finishing the setup, inside the repo's base directory with Snakefile do a dry run to check for errors

snakemake -n

Once you're ready to run the analysis type

snakemake --use-singularity -j [cores]

This will run the workflow locally. It is recommended you have at least 20GB of storage available as the Singularity containers take up around 8GB and the output files can be very large depending on the number of samples and sequencing depth.

The pipeline produces two key files: mafs/variants.maf and qc/multiqc_report.html .

variants.maf includes somatic variants from all samples that passed Mutect2 filtering. They have been annotated with VEP and a single effect has been chosen by vcf2maf using the Ensembl database. Ensembl uses its canonical isoforms for effect selection. Other isoforms can be specified with the alternate_isoforms field in config.yaml . See the cBioPortal override isoforms for file formatting.

multiqc_report.html includes quality metrics like coverage for the fully processed BAM files. Individual VCF files for each sample prior VEP annotation are found as vcfs/{patient}.vcf . VEP annotated VCFs are found as vcfs/{patient}.vep.vcf . qc/depths.svg shows the sequencing depth distribution for normal and tumor samples.

Mutect2 Parallelism

Mutect2 can be scattered into many workers to process different regions of each sample in parallel via the num_workers setting in config.yaml . This divides the genomic_regions file into many small subregions (the same number of subregions as the number of workers). Sets of intervals are split between files and individual intervals are not broken. When using the Broad's WGS calling regions interval file, GATK does not seem to support more than 24 workers (GATK won't create more than 24 subregions). For WES data such as exome capture kits, GATK will split the regions into more subsets (I have tested up to 50 workers but it can probably do more).

Cluster Execution

If you're using a compute cluster, you can take advantage of massively parallel computation to speed up the analysis. Only SLURM clusters are currently supported, but if you work with another cluster system (SGE etc) Snakemake makes it relatively easy to add support for your cluster.

Follow these instructions to use SLURM execution

  1. Modify cluster.wes.json if you are analyzing WES data or cluster.wgs.json for WGS data. Set the account field to your SLURM account (for Sherlock your SUNet ID). The recommended format is /path/slurm-{jobid}.out . All directories in the path must exist before launch - Snakemake will not create any directories.

  2. Edit run_pipeline.sh . Specify the SLURM directives and set Snakemake to use cluster.wes.json or cluster.wgs.json depending on your needs.

  3. When ready, launch the workflow

sbatch run_pipeline.sh [path to Snakemake directory]

If for some reason you can't leave the master snakemake process running, snakemake offers the ability to launch all jobs using --immediate-submit . This approach will submit all jobs to the queue immediately and finish the master snakemake process. The downside of this method is that temporary files are not supported, so the results directories will be very cluttered. To use --immediate-submit follow these steps

  1. Give parseJobID.sh permission to run
chmod +x parseJobID.sh
  1. Submit the snakemake jobs
snakemake --cluster-config cluster.json --cluster 'sbatch $(./parseJobID.sh {dependencies}) -t {cluster.time} --mem {cluster.mem} -p {cluster.partition} -c {cluster.ncpus} - o {cluster.out}' --jobs 100 --notemp --immediate-submit

Deviations from Broad Pipeline

ngs-pipeline differs from the Broad's Somatic Variant pipeline in the following ways:

  • By default FilterMutectCalls is run with additional flags. This can be changed in config.yaml . The default configuration requires all variants must have 1 read from each direction to be included in the final callset.

  • We provide the option for more stringent variant filtering criteria with the stringent_filtering setting in config.yaml . This is turned off by default

Test Dataset

A small sample of chr21 reads are supplied from the Texas Cancer Research Biobank for end-to-end pipeline tests. This data is made available as open access data with minimal privacy restrictions. Please read the Conditions of Use before using the data.

Tips

FASTQ Formatting

The first line of each read (called the sequence identifier) should consist of a single string not separated by any spaces. An example of a properly sequence identifier is

@MGILLUMINA4_74:7:1101:10000:100149

Sequence identifiers that consist of multiple space separated words will cause problems because gatk captures the entire string and uses it as the read ID but bwa only parses the first word as the read group when it writes aligned reads to BAM files. This sequence identifier would break the pipeline

@MGILLUMINA4_74:7:1101:10000:100149 RG:Z:A470018/1

Check the sequence identifiers if you encounter a Aligned record iterator is behind the unmapped reads error.

Citations

This pipeline is based on dna-seq-gatk-variant-calling by Johannes Köster . If you use ngs-pipeline , please use citations.md to cite the necessary software tools.

Code Snippets

27
28
29
30
31
32
33
34
35
36
37
shell:
    """
    gatk Mutect2 -R {input.ref} -I {input.tumor} \
        -tumor {params.tumor} -O {output.vcf} \
        --germline-resource {input.germ_res} \
        --f1r2-tar-gz {output.f1r2tar} \
        -L {input.interval} \
        -ip 100 \
        {params.normal_input} {params.normalname} \
        {params.pon} {params.extra}
    """
47
48
49
50
shell:
    """
    gatk LearnReadOrientationModel {params.i} -O {output}
    """
59
60
61
62
63
shell:
    """
    gatk GetPileupSummaries -I {input.bam} -V {input.germ_res} \
        -L {input.germ_res} -O {output}
    """
73
74
75
76
77
78
shell:
    """
    gatk CalculateContamination -I {input.tumor}  \
        {params.matched} \
        -O {output}
    """
88
89
90
91
shell:
    """
    gatk MergeVcfs {params.i} -O {output.vcf}
    """
101
102
103
104
shell:
    """
    gatk MergeMutectStats {params.i} -O {output.stats} 
    """
120
121
122
123
124
125
126
127
128
shell:
    """
    gatk FilterMutectCalls -V {input.vcf} -R {input.ref} \
        --contamination-table {input.contamination} \
        --stats {input.stats} \
        -ob-priors {input.f1r2model} \
        -O {output.vcf} \
        {params.extra}
    """
139
140
141
142
143
144
145
146
shell:
    """
    gatk SelectVariants -V {input.vcf} \
        -R {input.ref} \
        -O {output.vcf} \
        --exclude-filtered \
        -OVI
    """
160
161
162
163
164
165
166
167
168
169
170
171
172
173
shell:
    """
    vep_fp=`which vep`
    vep_path=$(dirname "$vep_fp")
    vcf2maf.pl --input-vcf {input.vcf} --output-maf {output.maf} \
        --tumor-id {wildcards.patient}.tumor \
        --ref-fasta {input.fasta} \
        --vep-data {input.vep_dir} \
        --ncbi-build {params.assembly} \
        --vep-path $vep_path \
        --maf-center {params.center} \
        {params.normalid} \
        {params.isoforms}
    """
183
184
script:
    "../scripts/combine_mafs.py"
15
16
17
18
19
20
21
22
23
24
shell:
    """
    gatk Mutect2 \
        -I {input.bam} \
        -R {input.ref} \
        -O {output.vcf} \
        -L {input.intervals} \
        -ip 100 \
        --max-mnp-distance 0
    """
35
36
37
38
39
40
41
42
43
shell:
    """
    gatk GenomicsDBImport \
        -R {input.ref} \
        --genomicsdb-workspace-path {output} \
        -V {params.vcfs} \
        -L {input.intervals} \
        -ip 100
    """
53
54
55
56
shell:
    """
    gatk CreateSomaticPanelOfNormals -R {input.ref} -V gendb://{input.var} -O {output.vcf}
    """
19
20
21
22
23
24
25
26
27
shell:
    """
    gatk FastqToSam -F1 {input.r1} -F2 {input.r2} \
        -O {output} \
        -SM {wildcards.patient}.{wildcards.sample_type} \
        -RG {wildcards.patient}.{wildcards.sample_type}.{wildcards.readgroup} \
        --TMP_DIR {params.tmp} \
        -PL {params.pl}
    """
36
37
38
39
shell:
    """
    bwa index {input}
    """
50
51
52
53
54
55
56
57
58
shell:
    """
    gatk SamToFastq -I {input.bam} -F /dev/stdout -INTER true -NON_PF true \
    | \
    bwa mem -p -v 3 -t {threads} -T 0 \
        {input.ref} /dev/stdin - 2> >(tee {log} >&2) \
    | \
    samtools view -Shb -o {output}
    """
70
71
72
73
74
75
76
77
78
79
shell:
    """
    gatk MergeBamAlignment \
        -R {input.ref} \
        -UNMAPPED {input.unaligned} \
        -ALIGNED {input.aligned} \
        -O {output} \
        --SORT_ORDER queryname \
        --TMP_DIR {params.tmp}
    """
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
    """
    gatk MarkDuplicatesSpark \
        -I {params.inbams} \
        -O {output.bam} \
        --tmp-dir {params.tmp} \
        --conf 'spark.executor.cores={threads}' \
        --conf 'spark.local.dir={params.tmp}'
    """
114
115
116
117
118
shell:
    """
    gatk BaseRecalibrator -I {input.bam} -R {input.ref} -O {output.recal} \
        {params.ks} --tmp-dir {params.tmp}
    """
131
132
133
134
135
136
137
138
139
140
shell:
    """
    gatk ApplyBQSR \
        -I {input.bam} \
        -R {input.ref} \
        -O {output.bam} \
        -bqsr {input.recal} \
        --add-output-sam-program-record \
        --tmp-dir {params.tmp}
    """
154
155
156
157
158
shell:
    """
    mosdepth --by {params.by} -t {threads} qc/{wildcards.patient}.{wildcards.sample_type} \
        {input.bam}
    """
166
167
168
169
shell:
    """
    samtools flagstat {input} > {output}
    """
178
179
180
181
182
183
184
185
186
shell:
    """
    tmpdir=qc/fastqc/{wildcards.patient}-{wildcards.sample_type}.tmp 
    mkdir $tmpdir 
    fastqc --outdir $tmpdir {input} 
    mv $tmpdir/{wildcards.patient}.{wildcards.sample_type}_fastqc.html {output.html} 
    mv $tmpdir/{wildcards.patient}.{wildcards.sample_type}_fastqc.zip {output.zipdata} 
    rm -r $tmpdir
    """
198
199
wrapper:
    "0.50.4/bio/multiqc"
207
208
script:
    "../scripts/gather_depths.py"
216
217
script:
    "../scripts/plot_depth.R"
229
230
231
232
233
234
shell:
    """
    gatk SplitIntervals -R {input.ref} -L {input.intervals} \
        --scatter-count {params.N} -O {params.d} \
        --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION
    """
242
243
244
245
246
247
248
shell:
    """
    gatk BedToIntervalList \
        -I {input.bed} \
        -SD {input.d} \
        -O {output}
    """
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import os
import sys
import pandas as pd

def concat_mafs(files):
    mafs = []
    for f in files:
        df = pd.read_csv(f, sep = "\t", skiprows=1)
        mafs.append(df)
    all_mafs = pd.concat(mafs)
    all_mafs['patient'] = all_mafs['Tumor_Sample_Barcode'].str.replace(".tumor", "", regex=False)
    if all_mafs.shape[0] == 0:
        print("WARNING: There are no variants in your MAF file!")
    return all_mafs 

def main():
    files = snakemake.input
    maf = concat_mafs(files)
    if snakemake.params[0] is True:
        maf = maf.query('t_depth >= 15 & t_alt_count >= 5')
    maf.to_csv(snakemake.output[0], sep="\t", index=False, header=True)

if __name__ == '__main__':
    main()
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import os
import pandas as pd

col_names = ['chr', 'start', 'end', 'depth']

def parse_summaries(files):
    depths = []
    for f in files:
        df = pd.read_csv(f, sep='\t')
        s = os.path.basename(f).split('.')
        patient = s[0]
        sample_type = s[1]
        depth = df.query("chrom == 'total_region'")
        depth['patient'] = patient
        depth['sample_type'] = sample_type
        depths.append(depth)
    return pd.concat(depths, ignore_index = True)

def main():
    files = snakemake.input
    df = parse_summaries(files)
    df = df[['patient', 'sample_type', 'chrom', 'length', 'bases', 'mean', 'min', 'max']]
    df.to_csv(snakemake.output[0], index=False, header=True)

if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo"))

library(readr)
library(dplyr)
library(ggplot2)

df <- read_csv(snakemake@input[[1]])
df <- df %>%
    mutate(sample_type = factor(sample_type, levels = c("normal", "tumor")))
p <- df %>%
    ggplot(aes(sample_type, mean)) +
    geom_boxplot(aes(fill = sample_type)) +
    labs(x = "", y = "Average Depth") +
    guides(fill = guide_legend(title="")) +
    ggtitle("Sequencing Depths")
ggsave(snakemake@output[[1]], p)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
ShowHide 23 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/2eding/wes-pipeline
Name: wes-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...