The snakemake workflow for whole-exome sequencing analysis and generating vcf files

public public 1yr ago 0 bookmarks

Note

This code/pipeline was written by me in 2018-2019 while working in the RayaGen startup . The main field where it could be employed is automating whole-exome sequencing data analysis from raw (fastq) to variant call format (vcf). This vcf could be further pass to a variant annotation tool of your choice. This pipeline uses snakemake workflow management system to run the analysis.

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment. For full documentation and installation please refer to the original website .

In this workflow, several addresses and paths were defined in the snakefile :

SAMPLES, = glob_wildcards("raw/{sample}_1.fastq.gz") # your file should be in a folder in this way: "/wes/raw"
READS=["1", "2"]
BASE_DIR = "/path/to/wes"
ADAPTORS = BASE_DIR + "/software/trimmomatic-0.38/adapters/TruSeq3-PE.fa"
REF_GENOME = "/path/to/wes/ref/human_g1k_v37_decoy.fasta"
BWA_INDEX = BASE_DIR + "/ref"
KNOWEN_SITE_SNP = BASE_DIR + "/ref/dbsnp_138.b37.vcf"
KNOWEN_SITE_INDEL = BASE_DIR + "/ref/Mills_and_1000G_gold_standard.indels.b37.vcf"
GATK3 = "/path/to/wes/software/GenomeAnalysisTK.jar"
PICARD = "/path/to/wes/software/picard.jar"
rule all:
 input: expand("VCF/{sample}_filtered_final.vcf",sample=SAMPLES),
 #expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
 #expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
 #expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
 #expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
 #expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
 #expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
 #"BQSR/recal_data.table",
 #expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
 #expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
 # expand("alignment/{sample}.bam",sample=SAMPLES),
 #expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
 expand("multi_qc/{sample}.multiqc.html", sample=SAMPLES),
 #expand("seqs_trimmed/{sample}_trimmed_paired_{read}.fastq.gz", sample=SAMPLES, read=READS),
 expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.html", sample=SAMPLES, read=READS),
 expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.zip", sample=SAMPLES, read=READS),
 #expand("seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz", sample=SAMPLES),
 #expand("seqs_trimmed/{sample}_trimmed_unpaired_R1.fastq.gz", sample=SAMPLES),
 #expand("seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz", sample=SAMPLES),
 #expand("seqs_trimmed/{sample}_trimmed_unpaired_R2.fastq.gz", sample=SAMPLES),
 expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
 expand("qc_raw/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=READS),
rule fastqc_raw:
 input: expand("raw/{sample}_{read}.fastq.gz", sample=SAMPLES, read=READS)
 output: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
 expand("qc_raw/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=READS)
 log: expand("logs/fastqc/{sample}_{read}_raw.fastqc.err", sample=SAMPLES, read=READS)
 priority:1
 message: "Executing Quality check (FASTQC) on the following files {input}."
 shell: """
 mkdir -p logs/fastqc
 mkdir -p qc_raw
 fastqc --outdir qc_raw --thread 50 --nogroup {input} 2>>{log}
 """
rule trimming:
 input:
 fwd = "raw/{sample}_R1.fastq.gz",
 rev = "raw/{sample}_R2.fastq.gz",
 output:
 fwd_paired = "seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz",
 fwd_unpaired = "seqs_trimmed/{sample}_trimmed_unpaired_R1.fastq.gz",
 rev_paired = "seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz",
 rev_unpaired = "seqs_trimmed/{sample}_trimmed_unpaired_R2.fastq.gz"
 params:
 TRIMMOMATIC_JAR = "~/RayaGene/wes/software/trimmomatic-0.38/trimmomatic-0.38.jar",
 PAIR = "PE SE".split(),
 threads=50,
 TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36"
 log: "logs/trimmomatic/{sample}.paired.trimomatic.log"
 priority:2
 message: "Executing Trimming (trimmomatic) on the following files {input.fwd} and {input.rev}."
 shell: """
 mkdir -p logs/trimmomatic
 java -jar {params.TRIMMOMATIC_JAR} \
 {params.PAIR[0]} -threads {params.threads} {input.fwd} {input.rev} \
 {output.fwd_paired} {output.fwd_unpaired} \
 {output.rev_paired} {output.rev_unpaired} \
 {params.TRIM_OPTS} 2>{log}
 """
rule fastqc_trimmed:
 input: expand("seqs_trimmed/{sample}_trimmed_paired_{read}.fastq.gz", sample=SAMPLES, read=READS)
 output: expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.html", sample=SAMPLES, read=READS),
 expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.zip", sample=SAMPLES, read=READS)
 log: expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
 priority:3
 message: "Executing Quality Checks (FASTQC) on the following files {input}."
 shell: """
 fastqc --outdir qc_trimmed --thread 80 --nogroup {input} \
 >> qc_trimmed/fastqc.log 2>>{log}
 """
rule multiqc:
 input: raw_fwd= expand("qc_raw/{sample}_R1_fastqc.zip", sample= SAMPLES ),
 raw_rev= expand("qc_raw/{sample}_R2_fastqc.zip", sample= SAMPLES ),
 trimed_fwd= expand("qc_trimmed/{sample}_trimmed_paired_R1_fastqc.zip", sample= SAMPLES ),
 trimed_rev= expand("qc_trimmed/{sample}_trimmed_paired_R2_fastqc.zip", sample= SAMPLES ),
 output: expand("multi_qc/{sample}.multiqc.html", sample=SAMPLES)
 log: expand("logs/multi_qc/{sample}.log", sample=SAMPLES)
 priority:4
 shell:"""
 mkdir -p multi_qc
 multiqc {input.raw_fwd} {input.raw_rev} {input.trimed_fwd} {input.trimed_rev} > {output} 2>{log}
 mv multiqc_report.html ./multi_qc
 mv multiqc_data multi_qc
 """
rule seqtk:
 input: fwd= expand("seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz", sample= SAMPLES ),
 rev= expand("seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz", sample= SAMPLES ),
 output: expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
 log: expand("logs/seqtk/{sample}.log", sample=SAMPLES),
 priority:5
 shell:"""
 mkdir -p interleaved_fastq
 seqtk mergepe {input.fwd} {input.rev} > {output} 2>{log}
 """
rule bwa_map:
 input:
 ref=REF_GENOME,
 read=expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
 output: bam = expand("alignment/{sample}.bam",sample=SAMPLES),
 #sorted = expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
 #bai = expand("alignment/sorted_{sample}.bam.bai",sample=SAMPLES),
 log:expand("logs/alignment/{sample}.log", sample=SAMPLES),
 params:
 rg = expand("'@RG\\tID:C6C0TANXX_2\\tSM:ZW177\\tLB:ZW177lib\\tPL:ILLUMINA'")
 priority:6
 shell:
 r"""
 mkdir -p alignment
 mkdir -p logs/alignment
 bwa mem -M -t 50 -p {input.ref} {input.read} -R {params.rg}\
 |samtools view -Sb - > {output.bam} 2>log
 """
rule bam_sort_index:
 input:expand("alignment/{sample}.bam",sample=SAMPLES),
 output:sorted = expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
 bai = expand("alignment/sorted_{sample}.bam.bai",sample=SAMPLES),
 shell:
 r"""
 samtools sort {input} > {output.sorted}
 samtools index {output.sorted} > {output.bai}
 """
rule mark_dup:
 input: expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
 output: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
 metrics ="alignment/marked_dup_metrics.txt",
 bai =expand("alignment/marked_duplicates_{sample}.bam.bai",sample=SAMPLES),
 log:expand("logs/mark_dup/{sample}.log", sample=SAMPLES),
 params:
 PICARD_JAR = PICARD,
 threads: 8
 priority:7
 shell:
 r"""
 mkdir -p logs/mark_dup
 java -jar -Xmx64G {params.PICARD_JAR} MarkDuplicates I={input} O={output.bam} M={output.metrics} TMP_DIR=/tmp
 samtools index {output.bam} > {output.bai}
 """
rule BQSR_1:
 input: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
 ref= REF_GENOME,
 known_site_snp= KNOWEN_SITE_SNP,
 known_site_indel = KNOWEN_SITE_INDEL,
 output:"BQSR/recal_data.table",
 log:expand("logs/BQSR/{sample}.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 priority:8
 shell:
 r"""
 mkdir -p logs/BQSR
 java -jar -Xmx64G {params.GATK3_JAR} -nct 32 -T BaseRecalibrator \
 -R {input.ref} -I {input.bam} \
 -knownSites {input.known_site_snp} \
 -knownSites {input.known_site_indel} \
 -o {output}
 """
rule BQSR_2:
 input: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
 ref= REF_GENOME,
 recal_table = "BQSR/recal_data.table",
 output:expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
 log:expand("logs/BQSR/{sample}_2.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 priority:9
 shell:
 r"""
 java -jar -Xmx64G {params.GATK3_JAR} -nct 64 -T PrintReads \
 -R {input.ref} \
 -I {input.bam} \
 -BQSR {input.recal_table} \
 -o {output} 2>log
 """
rule Variant_calling:
 input: bam = expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
 ref= REF_GENOME,
 known_site_snp= KNOWEN_SITE_SNP,
 recal_table = "BQSR/recal_data.table",
 known_site_indel = KNOWEN_SITE_INDEL
 output:expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
 log:expand("logs/VCF/{sample}_raw_vcf.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 priority:10
 shell:
 r"""
 java -jar -Xmx80G {params.GATK3_JAR} -nct 64 -T HaplotypeCaller \
 -R {input.ref} \
 -I {input.bam} \
 -stand_call_conf 30 \
 --genotyping_mode DISCOVERY \
 -o {output} 2>log
 """
rule Select_SNP:
 input: vcf = expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
 ref= REF_GENOME,
 output:expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
 log:expand("logs/VCF/{sample}_raw_SNP_vcf.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 threads: 8
 priority:11
 shell:
 r"""
 java -jar -Xmx80G {params.GATK3_JAR} -T SelectVariants \
 -R {input.ref} \
 -V {input.vcf} \
 -selectType SNP \
 -o {output} 2>log
 """
rule Select_INDEL:
 input: vcf = expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
 ref= REF_GENOME,
 output:expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
 log:expand("logs/VCF/{sample}_raw_INDEL_vcf.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 threads: 8
 priority:12
 shell:
 r"""
 java -jar -Xmx80G {params.GATK3_JAR} -T SelectVariants \
 -R {input.ref} \
 -V {input.vcf} \
 -selectType INDEL \
 -o {output} 2>log
 """
rule SNP_FILTER:
 input: vcf = expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
 ref= REF_GENOME,
 output:expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
 log:expand("logs/VCF/{sample}_filtered_SNP_vcf.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 threads: 8
 priority:13
 shell:
 r"""
 java -jar -Xmx80G {params.GATK3_JAR} -T VariantFiltration \
 -R {input.ref} \
 -V {input.vcf} \
 --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
 --filterName "Low_Qual" \
 -o {output} 2>log
 """
rule INDEL_FILTER:
 input: vcf = expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
 ref= REF_GENOME,
 output:expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
 log:expand("logs/VCF/{sample}_filtered_INDEL_vcf.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 threads: 8
 priority:13
 shell:
 r"""
 java -jar -Xmx80G {params.GATK3_JAR} -T VariantFiltration \
 -R {input.ref} \
 -V {input.vcf} \
 --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
 --filterName "Low_Qual" \
 -o {output} 2>log
 """
rule combine_variants:
 input:snp = expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
 indel = expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
 ref= REF_GENOME,
 output:expand("VCF/{sample}_filtered_final.vcf",sample=SAMPLES),
 log:expand("logs/VCF/{sample}_filtered_final_vcf.log", sample=SAMPLES),
 params:
 GATK3_JAR = GATK3,
 priority:13
 shell:
 r"""
 java -jar -Xmx80G {params.GATK3_JAR} -T CombineVariants \
 -R {input.ref} \
 -V {input.snp} \
 -V {input.indel} \
 --genotypemergeoption UNSORTED \
 -o {output} 2>log

There is a configuration yaml file in this directory that could be of help in making same virtual environment in python to what I had configured, wes . By conda you can make same environment to wes .

Pipeline steps

This pipeline consisted of several steps:

fnal note: In the repositiry there is a Snakefile file. The content of this file is copied in Snakefile.sh for inspection purpose, however for pipeline running Snakefile should be downloaded and located in your wes directory.

Code Snippets

45
46
47
48
49
shell: """
  mkdir -p logs/fastqc
  mkdir -p qc_raw
  fastqc --outdir qc_raw  --thread 50 --nogroup {input} 2>>{log}
    """
67
68
69
70
71
72
73
74
shell: """
   mkdir -p logs/trimmomatic
   java -jar {params.TRIMMOMATIC_JAR} \
      {params.PAIR[0]} -threads {params.threads} {input.fwd} {input.rev} \
      {output.fwd_paired} {output.fwd_unpaired} \
      {output.rev_paired} {output.rev_unpaired} \
      {params.TRIM_OPTS}  2>{log}
   """
82
83
84
85
shell: """
   fastqc --outdir qc_trimmed --thread 80 --nogroup {input} \
         >> qc_trimmed/fastqc.log 2>>{log}
      """
94
95
96
97
98
99
shell:"""
    mkdir -p multi_qc
    multiqc {input.raw_fwd} {input.raw_rev} {input.trimed_fwd} {input.trimed_rev} > {output} 2>{log}
    mv multiqc_report.html ./multi_qc
    mv multiqc_data multi_qc
    """
106
107
108
109
shell:"""
    mkdir -p interleaved_fastq
   seqtk mergepe {input.fwd} {input.rev} > {output} 2>{log}
   """
ShowHide 1 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/hamidghaedi/pipeline_whole_exome_sequencing_analysis
Name: pipeline_whole_exome_sequencing_analysis
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...