Variant Calling Pipeline using GATK4

public public 1yr ago 0 bookmarks

Identifying genomic variants using next-generation sequencing data, such as single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), is an important aspect of scientific discovery. This pipeline was created to help researchers reliably and quickly discover and annotate sequence variants. The pipeline is based on the Broad Institute's recommended best practices for variant discovery analysis and uses the Genome Analysis Toolkit 4 (GATK4) to perform variant calling. SnpEff is employed to annotate and predict the effects of variants after SNPs have been detected. This pipeline is designed for calling variants in clonal samples, or samples from a single person. In these samples, the frequency of variations is estimated to be 1 (for haploids or homozygous diploids) or 0.5 (for heterozygous diploids).

Code Snippets

48
49
50
51
52
53
54
55
56
57
58
59
    """
    bwa mem \
	-K 100000000 \
	-v 3 \
	-t ${task.cpus} \
	-Y \
	-R \"${readGroup}\" \
	$ref \
	${reads[0]} \
	${reads[1]} \
	> ${pair_id}_aligned_reads.sam
    """
82
83
84
85
86
87
88
89
90
    """
    mkdir -p ${params.tmpdir}/${workflow.runName}/${pair_id}
    gatk --java-options "-Djava.io.tmpdir=${params.tmpdir}/${workflow.runName}/${pair_id}" \
	 MarkDuplicatesSpark \
	-I $aligned_reads \
	-M ${pair_id}_dedup_metrics.txt \
	-O ${pair_id}_sorted_dedup.bam 
    rm -r ${params.tmpdir}/${workflow.runName}/${pair_id}
    """ 
111
112
113
114
115
116
117
118
119
120
121
122
123
    """
    java -jar \$PICARD_JAR \
        CollectAlignmentSummaryMetrics \
	R=${params.ref} \
        I=${sorted_dedup_reads} \
	O=${pair_id}_alignment_metrics.txt
    java -jar \$PICARD_JAR \
        CollectInsertSizeMetrics \
        INPUT=${sorted_dedup_reads} \
	OUTPUT=${pair_id}_insert_metrics.txt \
        HISTOGRAM_FILE=${pair_id}_insert_size_histogram.pdf 
    samtools depth -a ${sorted_dedup_reads} > ${pair_id}_depth_out.txt
    """
147
148
149
150
151
152
    """
    gatk HaplotypeCaller \
	-R $ref \
	-I $input_bam \
	-O ${pair_id}_raw_variants_${round}.vcf \
    """
173
174
175
176
177
178
179
180
181
182
183
184
    """
    gatk SelectVariants \
	-R $ref \
	-V $raw_variants \
	-select-type SNP \
	-O ${pair_id}_raw_snps_${round}.vcf
    gatk SelectVariants \
        -R $ref \
        -V $raw_variants \
        -select-type INDEL \
        -O ${pair_id}_raw_indels_${round}.vcf
    """
203
204
205
206
207
208
209
210
211
212
213
214
    """
    gatk VariantFiltration \
	-R $ref \
	-V $raw_snps \
	-O ${pair_id}_filtered_snps_${round}.vcf \
	-filter-name "QD_filter" -filter "QD < 2.0" \
	-filter-name "FS_filter" -filter "FS > 60.0" \
	-filter-name "MQ_filter" -filter "MQ < 40.0" \
	-filter-name "SOR_filter" -filter "SOR > 4.0" \
	-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
	-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
    """
233
234
235
236
237
238
239
240
241
    """
    gatk VariantFiltration \
        -R $ref \
        -V $raw_indels \
        -O ${pair_id}_filtered_indels_${round}.vcf \
	-filter-name "QD_filter" -filter "QD < 2.0" \
	-filter-name "FS_filter" -filter "FS > 200.0" \
	-filter-name "SOR_filter" -filter "SOR > 10.0"
    """
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
    """
    echo "New Round: " $new_round
    gatk SelectVariants \
	--exclude-filtered \
	-V $filtered_snps \
	-O ${pair_id}_bqsr_snps.vcf
    gatk SelectVariants \
        --exclude-filtered \
        -V $filtered_indels \
        -O ${pair_id}_bqsr_indels.vcf
    gatk BaseRecalibrator \
	-R $ref \
	-I $input_bam \
	--known-sites ${pair_id}_bqsr_snps.vcf \
	--known-sites ${pair_id}_bqsr_indels.vcf \
	-O ${pair_id}_recal_data.table
    gatk ApplyBQSR \
        -R $ref \
        -I $input_bam \
        -bqsr ${pair_id}_recal_data.table \
        -O ${pair_id}_recal.bam
    gatk BaseRecalibrator \
        -R $ref \
	-I ${pair_id}_recal.bam \
        --known-sites ${pair_id}_bqsr_snps.vcf \
	--known-sites ${pair_id}_bqsr_indels.vcf \
	-O ${pair_id}_post_recal_data.table
    """	
329
330
331
332
333
334
    """
    gatk AnalyzeCovariates \
	-before $recal_table \
	-after $post_recal_table \
	-plots ${pair_id}_recalibration_plots.pdf
    """
351
352
353
354
355
356
    """
    java -jar \$SNPEFF_JAR -v \
	-dataDir $params.snpeff_data \
	$params.snpeff_db \
	$filtered_snps > ${pair_id}_filtered_snps.ann.vcf
    """
NextFlow From line 351 of master/main.nf
382
383
384
"""
parse_metrics.sh ${pair_id} > ${pair_id}_report.csv
"""
NextFlow From line 382 of master/main.nf
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/
Name: variant-calling-pipeline-gatk4
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...