An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

public public 1yr ago 0 bookmarks

This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.

The pipeline requires as inputs aligned BAM files (e.g. with BWA ) where the duplicates are already marked (e.g. with Picard or sambamba ). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples , which is automatically parallized over user defined intervals (for examples see intervals.txt ) and chromosomes.

Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration

At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.

A high level overview of the performed steps can be seen below:

DAG

As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.

Installation

Required tools:

The majority of the listed tools can be quite easily installed with conda which is recommanded.

Usage

First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.

After setting up all the config files and installing all tools, you can simply run:

snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"

This assumes that the cluster you are using is running SLURM . If this is not the case, you have to adjust the command after --cluster . The log information of each job will be safed in the snakemake_cluster_submit_log directory. This directory will not be created automatically.

-j specifies the number of jobs/rules should be submitted in parallel.

I recommand running this command in a detached session with tmux or screen .

Output

Below is the output of the tree command, after the workflow has finished for one patient H005-00ML. Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.

.
├── cohort
│ ├── benchmark
│  ├── ApplyVQSR_indel.txt
│  ├── ApplyVQSR_snp.txt
│  ├── CombineGVCFs.txt
│  ├── GenotypeGVCFs.txt
│  ├── MergeCohortVCFs.txt
│  ├── SelectVariants.txt
│  ├── VEP.txt
│  ├── VQSR_indel.txt
│  └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│ ├── ApplyVQSR_indel.out
│ ├── ApplyVQSR_snp.out
│ ├── CombineGVCFs
│ ├── CombineGVCFs.1.out
│ ├── CombineGVCFs.2.out
│ ├── ...
│ ├── ...
│ ├── CombineGVCFs.Y.out
│ ├── GenotypeGVCFs.1.out
│ ├── GenotypeGVCFs.2.out
│ ├── ...
│ ├── ...
│ ├── GenotypeGVCFs.Y.out
│ ├── MakeSitesOnly.out
│ ├── MergeCohortVCFs.out
│ ├── SelectVariants.err
│ ├── VEP.out
│ ├── VQSR_indel.out
│ └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│  ├── ApplyBQSR.txt
│  ├── BaseRecalibrator.txt
│  ├── GatherBQSRReports.txt
│  ├── GatherRecalBamFiles.txt
│  ├── HaplotypeCaller.txt
│  ├── IndexBam.txt
│  ├── MergeHaplotypeCaller.txt
│  └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│ ├── ApplyBQSR
│ ├── ApplyBQSR.0000-scattered.interval_list.out
│ ├── ApplyBQSR.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── ApplyBQSR.0049-scattered.interval_list.out
│ ├── BaseRecalibrator
│ ├── BaseRecalibrator.0000-scattered.interval_list.out
│ ├── BaseRecalibrator.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── BaseRecalibrator.0049-scattered.interval_list.out
│ ├── GatherBQSRReports.out
│ ├── GatherRecalBamFiles.out
│ ├── HaplotypeCaller
│ ├── HaplotypeCaller.0000-scattered.interval_list.out
│ ├── HaplotypeCaller.0001-scattered.interval_list.out
│ ├── ...
│ ├── ...
│ ├── HaplotypeCaller.0049-scattered.interval_list.out
│ ├── IndexBam.out
│ ├── MergeHaplotypeCaller.out
│ └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out

For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that were performed for that patient ( log directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used ( benchmark directory).

The cohort directory contains the multi-sample VCF file, which gets created after performing the joint variant calling. The cohort.recalibrated.vcf.gz is the product of GATKs Variant Quality Score Recalibration . The cohort.recalibrated.pass.vep.vcf.gz is the filtered and VEP annotated version of cohort.recalibrated.vcf.gz (only variants with PASS are kept).

For most applications, the cohort.recalibrated.pass.vep.vcf.gz file, is the file you want to continue working with.

Code Snippets

68
69
70
71
72
73
74
75
76
77
78
79
shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"

     "cat {input.intervals} | parallel --plus -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m BaseRecalibrator "
     "-I {input.marked_dup_bam} "
     "-R {input.reference} "
     "--known-sites {input.dbSNP} "
     "-L {{}} "
     "-O {params.recal_data}.{{/}}.table &> {log.out_recal}.{{/}}.out' \n"

     "touch {output.decoy}"
108
109
110
111
112
113
shell:
     "{input.GATK} GatherBQSRReports "
     "{params.recal_tables} "
     "-O {output.recal_data} &> {log.out_GatherBQSRReports} \n"

     "rm {params.tmp_dir}/{wildcards.patient}/normal_recal_bqsr.0*table"
142
143
144
145
146
147
148
149
150
151
152
shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"
     "cat {input.intervals} | parallel --plus -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m ApplyBQSR "
     "-R {input.reference} "
     "-L {{}} "
     "-I {input.marked_dup_bam} "
     "--bqsr-recal-file {input.recal_data} "
     "-O {params.recal_bam}.{{/}}.bam &> {log.out_ApplyBQSR}.{{/}}.out' \n"

     "touch {output.decoy}"
181
182
183
184
shell:
     "{input.GATK} GatherBamFiles "
     "{params.bams} "
     "-O {output.final_recal_bam} &> {log.out_GatherRecalBamFiles} \n"
211
212
shell:
     "{input.samtools} sort -@ {threads} -o {output.recal_sorted} {input.final_recal_bam} &> {log.out_SortBam}"
236
237
shell:
     "{input.sambamba} index -p -t {threads} {input.recal_sorted} {output.recal_sorted_index} &> {log.out_IndexBam}"
131
132
133
134
shell:
     "{input.GATK} MergeVcfs {params.gvcfs} -O {output.merged_gvcf} &> {log.MergeHaplotypeCaller} \n"
     "rm -rf {params.tmp_dir}/{wildcards.patient} \n"
     "touch {output.decoy}"
165
166
167
168
169
170
171
172
173
shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"

     "cat {input.chromosomes} | parallel -j {threads} '{input.GATK} CombineGVCFs --java-options -Xmx${{mem_per_job}}m "
     "{params.annotation} -R {input.reference} {params.input_files} "
     "-O {params.cohort_dir}/{{}}.cohort.g.vcf.gz -L {{}} &> {log.CombineGVCFs_log}.{{}}.out' \n"

     "touch {output.decoy}"
200
201
202
203
204
205
206
207
208
209
shell:
     "mem_per_job=$(expr {resources.mem_mb} / {threads}) \n"
     "export mem_per_job \n"

     "cat {input.chromosomes} | parallel -j {threads} '{input.GATK} --java-options -Xmx${{mem_per_job}}m GenotypeGVCFs  "
     "{params.annotation} -R {input.reference} -V {params.cohort_dir}/{{}}.cohort.g.vcf.gz "
     "-O {params.cohort_dir}/{{}}.cohort.vcf.gz &> {log.GenotypeGVCFs_log}.{{}}.out' \n"

     "rm {params.cohort_dir}/*g.vcf* \n"
     "touch {output.decoy}"
239
240
241
242
shell:
     "{input.GATK} MergeVcfs {params.cohort_gvcfs} -O {output.merged_cohort_vcf} &> {log.MergeCohortVCFs_log} \n"
     "{input.GATK} MakeSitesOnlyVcf -I {output.merged_cohort_vcf} -O {output.sites_only_vcf} &> {log.MakeSitesOnlyVcf_log} \n"
     "rm {params.cohort_dir}/{{1..22}}.coh* {params.cohort_dir}/X.coh* {params.cohort_dir}/Y.coh*"
31
32
33
34
shell:
     "{input.vep} --cache --dir_cache {params.vep_cache} --offline --fasta {input.reference} --vcf --buffer_size 55000 "
     "--pick --fork {threads} --sift b --variant_class --regulatory --check_existing --af_1kg --compress_output bgzip "
     "-i {input.pass_only_vcf} -o {output.vep_vcf} &> {log.out_VEP} \n"
SnakeMake From line 31 of rules/VEP.smk
38
39
40
41
42
43
44
45
46
47
48
shell:
     "{input.GATK} VariantRecalibrator -V {input.sites_only_vcf} --trust-all-polymorphic "
     "{params.truth_sensitivity_tranches} -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP "
     "-mode INDEL "
     "--max-gaussians 4 "
     "-AS "
     "--resource:mills,known=false,training=true,truth=true,prior=12 {input.mills_indels} "
     "--resource:axiomPoly,known=false,training=true,truth=false,prior=10.0 {input.axiom_poly} "
     "--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbSNP} "
     "-O {output.cohort_indel_recal} "
     "--tranches-file {output.tranche_files} &> {log.VQSR_indel} \n"
SnakeMake From line 38 of rules/VQSR.smk
81
82
83
84
85
86
87
88
89
90
shell:
     "{input.GATK} VariantRecalibrator -V {input.sites_only_vcf} --trust-all-polymorphic "
     "{params.truth_sensitivity_tranches} -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP "
     "-mode SNP "
     "--max-gaussians 6 "
     "-AS "
     "--resource:hapmap,known=false,training=true,truth=true,prior=15.0 {input.hapmap} "
     "--resource:omni,known=false,training=true,truth=false,prior=12.0 {input.omni} "
     "--resource:1000G,known=false,training=true,truth=false,prior=10.0 {input.thousandG_phase1_snps_high_confidence} "
     "--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbSNP} "
SnakeMake From line 81 of rules/VQSR.smk
120
121
122
123
124
125
126
shell:
     "{input.GATK} ApplyVQSR -V {input.merged_cohort_vcf} --recal-file {input.cohort_indel_recal} "
     "--tranches-file {input.tranche_files} --truth-sensitivity-filter-level 97 "
     "--create-output-variant-index true -mode INDEL "
     "-AS "
     "-O {output.indel_recalibrated_cohort} &> {log.ApplyVQSR_indel} \n"
     "rm {cohort_dir}/cohort.sites.only.vcf.gz.tbi"
SnakeMake From line 120 of rules/VQSR.smk
150
151
152
153
154
155
shell:
     "{input.GATK} ApplyVQSR -V {input.indel_recalibrated_cohort} --recal-file {input.cohort_snp_recal} "
     "--tranches-file {input.tranche_files} --truth-sensitivity-filter-level 97 "
     "--create-output-variant-index true -mode SNP "
     "-AS "
     "-O {output.recalibrated_cohort} &> {log.ApplyVQSR_snp}"
SnakeMake From line 150 of rules/VQSR.smk
178
179
180
181
182
shell:
     "{input.bcftools} view -f PASS {input.filtered_vcf} -o {output.pass_only_vcf} -O z &> {log.err_SelectVariants} \n"
     "rm -rf {params.cohort_dir}/filtration \n"
     "rm {params.cohort_dir}/cohort.indel.recalibrated.vcf.gz.tbi \n"
     "rm {params.cohort_dir}/cohort.vcf.gz.tbi \n"
SnakeMake From line 178 of rules/VQSR.smk
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/nickhir/GermlineMutationCalling
Name: germlinemutationcalling
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...