A Snakemake workflow to process single samples or cohorts of Illumina paired-end sequencing data (WGS or WES) using trim galore/bwa/GATK4/parabricks.

public public 1yr ago Version: v2.1.0 0 bookmarks

A Snakemake workflow to process single samples (unrelated individuals) or cohort samples (related individuals) of paired-end sequencing data (WGS or WES) using bwa and GATK4 . Quality control checks are also undertaken. The fastq files can be optionally trimmed with Trim Galore and the pipeline can run on NVIDIA GPU's where nvidia clara parabricks software is available for significant speedups in analysis times. This workflow is designed to follow the GATK best practice workflow for germline short variant discovery (SNPs + Indels) . This pipeline is designed to be followed by vcf_annotation_pipeline and the data ingested into scout for clinical interpretation. However, this pipeline also stands on it's own, taking the data from fastq to vcf (raw sequencing data to called variants). This pipeline has been developed with human genetic data in mind, however we designed it to be species agnostic. Genetic data from other species can be analysed by setting a species-specific reference genome and variant databases in the configuration file (but not all situations have been tested).

Pipeline summary - single samples

  1. Raw read QC ( FastQC and MultiQC )

  2. Adapter trimming ( Trim Galore ) ( optional )

  3. Alignment against reference genome ( Burrows-Wheeler Aligner )

  4. Mark duplicates ( GATK MarkDuplicates )

  5. Base recalibration ( GATK BaseRecalibrator and GATK ApplyBQSR )

  6. Haplotype calling ( GATK HaplotypeCalller )

Pipeline summary - single samples - GPU accelerated

  1. Raw read QC ( FastQC and MultiQC )

  2. Adapter trimming ( Trim Galore ) ( optional )

  3. Alignment against reference genome, mark duplicates, base recalibration and haplotype calling ( parabricks germline pipeline )

Pipeline summary - cohort samples

  1. Raw read QC ( FastQC and MultiQC )

  2. Adapter trimming ( Trim Galore ) ( optional )

  3. Alignment against reference genome ( Burrows-Wheeler Aligner )

  4. Mark duplicates ( GATK MarkDuplicates )

  5. Base recalibration ( GATK BaseRecalibrator and GATK ApplyBQSR )

  6. Haplotype calling ( GATK HaplotypeCalller )

  7. Combine GVCF into multi-sample GVCF ( GATK CombineGVCFs )

  8. Genotyping ( GATK GenotypeGVCFs )

Pipeline summary - cohort samples - GPU accelerated

  1. Raw read QC ( FastQC and MultiQC )

  2. Adapter trimming ( Trim Galore ) ( optional )

  3. Alignment against reference genome, mark duplicates, base recalibration and haplotype calling ( parabricks germline pipeline )

  4. Combine GVCF into multi-sample GVCF ( parabricks trio combine gvcf )

  5. Genotyping ( GATK GenotypeGVCFs )

Main output files

Single samples:

  • results/qc/multiqc_report.html

  • results/mapped/sample1_recalibrated.bam

  • results/called/sample1_raw_snps_indels.vcf

Cohort samples:

  • results/qc/multiqc_report.html

  • results/mapped/sample1_recalibrated.bam

  • results/mapped/sample2_recalibrated.bam

  • results/mapped/sample3_recalibrated.bam

  • results/called/proband1_raw_snps_indels.vcf

Prerequisites

Test human_genomics_pipeline

The provided test dataset can be used to test running this pipeline on a new machine, or test pipeline developments/releases.

Run human_genomics_pipeline

See the docs for a walkthrough guide for running human_genomics_pipeline on:

Contribute back!

Contributions and feedback are always welcome! :blush:

Code Snippets

21
22
shell:
    "bwa mem -R {params.readgroup} -t {threads} -K 10000000 {input.refgenome} {input.fastq} | gatk SortSam --java-options {params.maxmemory} {params.sortsam} -O={output} --TMP_DIR={params.tdir} &> {log}"
16
17
shell:
    "fastqc {input} -o ../results/qc/fastqc/ -t {threads} &> {log}"
20
21
shell:
    "gatk ApplyBQSR --java-options {params.maxmemory} -I {input.bam} -bqsr {input.recal} -R {input.refgenome} -O {output} {params.padding} {params.intervals}"
21
22
shell:
    "gatk BaseRecalibrator --java-options {params.maxmemory} -I {input.bams} -R {input.refgenome} -O {output} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.recalibration_resources} &> {log}"
22
23
shell:
    "gatk CombineGVCFs -R {input.refgenome} {params.command} -O {output.vcf} --tmp-dir {params.tdir} {params.other} &> {log}"
21
22
shell:
    "gatk GenotypeGVCFs --java-options {params.maxmemory} -R {input.refgenome} -V {input.gvcf} -O {output} {params.padding} {params.intervals} {params.other} &> {log}"
23
24
shell:
    "gatk HaplotypeCaller --java-options {params.maxmemory} -I {input.bams} -R {input.refgenome} -D {input.dbsnp} -O {output.vcf} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.other} &> {log}"
21
22
shell:
    "gatk HaplotypeCaller --java-options {params.maxmemory} -I {input.bams} -R {input.refgenome} -D {input.dbsnp} -O {output} --tmp-dir {params.tdir} {params.padding} {params.intervals} &> {log}"
18
19
shell:
    "gatk MarkDuplicates --java-options {params.maxmemory} -I {input} -O {output.bam} -M {output.metrics} --TMP_DIR {params.tdir} &> {log}"
12
13
shell:
    "multiqc {input} -o ../results/qc/ &> {log}"
14
15
shell:
    "multiqc {input.fastqc} {input.trimming1} {input.trimming2} -o ../results/qc/ &> {log}"
26
27
shell:
    "pbrun germline --ref {input.refgenome} --in-fq {input.fastq} {params.recalibration_resources} --out-bam {output.bam} --out-variants {output.vcf} --out-recal-file {output.recal} --num-gpus {resources.gpu} {params.readgroup} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.other_params} --num-cpu-threads {threads} &> {log}"
17
18
shell:
    "pbrun triocombinegvcf --ref {input.refgenome} {params.command} --out-variants {output} &> {log}"
22
23
shell:
    "trim_galore {input} -o ../results/trimmed/ {params.adapters} {params.other} -j {threads} &> {log}"
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ESR-NZ/human_genomics_pipeline
Name: human_genomics_pipeline
Version: v2.1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...