The snakemake workflow for whole-exome sequencing analysis and generating vcf files
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Note
This code/pipeline was written by me in 2018-2019 while working in the
RayaGen startup
. The main field where it could be employed is automating whole-exome sequencing data analysis from raw (fastq) to variant call format (vcf). This vcf could be further pass to a variant annotation tool of your choice. This pipeline uses
snakemake
workflow management system to run the analysis.
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment. For full documentation and installation please refer to the original website .
In this workflow, several addresses and paths were defined in the
snakefile
:
SAMPLES, = glob_wildcards("raw/{sample}_1.fastq.gz") # your file should be in a folder in this way: "/wes/raw"
READS=["1", "2"]
BASE_DIR = "/path/to/wes"
ADAPTORS = BASE_DIR + "/software/trimmomatic-0.38/adapters/TruSeq3-PE.fa"
REF_GENOME = "/path/to/wes/ref/human_g1k_v37_decoy.fasta"
BWA_INDEX = BASE_DIR + "/ref"
KNOWEN_SITE_SNP = BASE_DIR + "/ref/dbsnp_138.b37.vcf"
KNOWEN_SITE_INDEL = BASE_DIR + "/ref/Mills_and_1000G_gold_standard.indels.b37.vcf"
GATK3 = "/path/to/wes/software/GenomeAnalysisTK.jar"
PICARD = "/path/to/wes/software/picard.jar"
rule all:
input: expand("VCF/{sample}_filtered_final.vcf",sample=SAMPLES),
#expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
#expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
#expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
#expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
#expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
#expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
#"BQSR/recal_data.table",
#expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
#expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
# expand("alignment/{sample}.bam",sample=SAMPLES),
#expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
expand("multi_qc/{sample}.multiqc.html", sample=SAMPLES),
#expand("seqs_trimmed/{sample}_trimmed_paired_{read}.fastq.gz", sample=SAMPLES, read=READS),
expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.html", sample=SAMPLES, read=READS),
expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.zip", sample=SAMPLES, read=READS),
#expand("seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz", sample=SAMPLES),
#expand("seqs_trimmed/{sample}_trimmed_unpaired_R1.fastq.gz", sample=SAMPLES),
#expand("seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz", sample=SAMPLES),
#expand("seqs_trimmed/{sample}_trimmed_unpaired_R2.fastq.gz", sample=SAMPLES),
expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
expand("qc_raw/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=READS),
rule fastqc_raw:
input: expand("raw/{sample}_{read}.fastq.gz", sample=SAMPLES, read=READS)
output: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
expand("qc_raw/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=READS)
log: expand("logs/fastqc/{sample}_{read}_raw.fastqc.err", sample=SAMPLES, read=READS)
priority:1
message: "Executing Quality check (FASTQC) on the following files {input}."
shell: """
mkdir -p logs/fastqc
mkdir -p qc_raw
fastqc --outdir qc_raw --thread 50 --nogroup {input} 2>>{log}
"""
rule trimming:
input:
fwd = "raw/{sample}_R1.fastq.gz",
rev = "raw/{sample}_R2.fastq.gz",
output:
fwd_paired = "seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz",
fwd_unpaired = "seqs_trimmed/{sample}_trimmed_unpaired_R1.fastq.gz",
rev_paired = "seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz",
rev_unpaired = "seqs_trimmed/{sample}_trimmed_unpaired_R2.fastq.gz"
params:
TRIMMOMATIC_JAR = "~/RayaGene/wes/software/trimmomatic-0.38/trimmomatic-0.38.jar",
PAIR = "PE SE".split(),
threads=50,
TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36"
log: "logs/trimmomatic/{sample}.paired.trimomatic.log"
priority:2
message: "Executing Trimming (trimmomatic) on the following files {input.fwd} and {input.rev}."
shell: """
mkdir -p logs/trimmomatic
java -jar {params.TRIMMOMATIC_JAR} \
{params.PAIR[0]} -threads {params.threads} {input.fwd} {input.rev} \
{output.fwd_paired} {output.fwd_unpaired} \
{output.rev_paired} {output.rev_unpaired} \
{params.TRIM_OPTS} 2>{log}
"""
rule fastqc_trimmed:
input: expand("seqs_trimmed/{sample}_trimmed_paired_{read}.fastq.gz", sample=SAMPLES, read=READS)
output: expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.html", sample=SAMPLES, read=READS),
expand("qc_trimmed/{sample}_trimmed_paired_{read}_fastqc.zip", sample=SAMPLES, read=READS)
log: expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
priority:3
message: "Executing Quality Checks (FASTQC) on the following files {input}."
shell: """
fastqc --outdir qc_trimmed --thread 80 --nogroup {input} \
>> qc_trimmed/fastqc.log 2>>{log}
"""
rule multiqc:
input: raw_fwd= expand("qc_raw/{sample}_R1_fastqc.zip", sample= SAMPLES ),
raw_rev= expand("qc_raw/{sample}_R2_fastqc.zip", sample= SAMPLES ),
trimed_fwd= expand("qc_trimmed/{sample}_trimmed_paired_R1_fastqc.zip", sample= SAMPLES ),
trimed_rev= expand("qc_trimmed/{sample}_trimmed_paired_R2_fastqc.zip", sample= SAMPLES ),
output: expand("multi_qc/{sample}.multiqc.html", sample=SAMPLES)
log: expand("logs/multi_qc/{sample}.log", sample=SAMPLES)
priority:4
shell:"""
mkdir -p multi_qc
multiqc {input.raw_fwd} {input.raw_rev} {input.trimed_fwd} {input.trimed_rev} > {output} 2>{log}
mv multiqc_report.html ./multi_qc
mv multiqc_data multi_qc
"""
rule seqtk:
input: fwd= expand("seqs_trimmed/{sample}_trimmed_paired_R1.fastq.gz", sample= SAMPLES ),
rev= expand("seqs_trimmed/{sample}_trimmed_paired_R2.fastq.gz", sample= SAMPLES ),
output: expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
log: expand("logs/seqtk/{sample}.log", sample=SAMPLES),
priority:5
shell:"""
mkdir -p interleaved_fastq
seqtk mergepe {input.fwd} {input.rev} > {output} 2>{log}
"""
rule bwa_map:
input:
ref=REF_GENOME,
read=expand("interleaved_fastq/{sample}_interleaved.fastq.gz", sample=SAMPLES),
output: bam = expand("alignment/{sample}.bam",sample=SAMPLES),
#sorted = expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
#bai = expand("alignment/sorted_{sample}.bam.bai",sample=SAMPLES),
log:expand("logs/alignment/{sample}.log", sample=SAMPLES),
params:
rg = expand("'@RG\\tID:C6C0TANXX_2\\tSM:ZW177\\tLB:ZW177lib\\tPL:ILLUMINA'")
priority:6
shell:
r"""
mkdir -p alignment
mkdir -p logs/alignment
bwa mem -M -t 50 -p {input.ref} {input.read} -R {params.rg}\
|samtools view -Sb - > {output.bam} 2>log
"""
rule bam_sort_index:
input:expand("alignment/{sample}.bam",sample=SAMPLES),
output:sorted = expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
bai = expand("alignment/sorted_{sample}.bam.bai",sample=SAMPLES),
shell:
r"""
samtools sort {input} > {output.sorted}
samtools index {output.sorted} > {output.bai}
"""
rule mark_dup:
input: expand("alignment/sorted_{sample}.bam",sample=SAMPLES),
output: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
metrics ="alignment/marked_dup_metrics.txt",
bai =expand("alignment/marked_duplicates_{sample}.bam.bai",sample=SAMPLES),
log:expand("logs/mark_dup/{sample}.log", sample=SAMPLES),
params:
PICARD_JAR = PICARD,
threads: 8
priority:7
shell:
r"""
mkdir -p logs/mark_dup
java -jar -Xmx64G {params.PICARD_JAR} MarkDuplicates I={input} O={output.bam} M={output.metrics} TMP_DIR=/tmp
samtools index {output.bam} > {output.bai}
"""
rule BQSR_1:
input: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
ref= REF_GENOME,
known_site_snp= KNOWEN_SITE_SNP,
known_site_indel = KNOWEN_SITE_INDEL,
output:"BQSR/recal_data.table",
log:expand("logs/BQSR/{sample}.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
priority:8
shell:
r"""
mkdir -p logs/BQSR
java -jar -Xmx64G {params.GATK3_JAR} -nct 32 -T BaseRecalibrator \
-R {input.ref} -I {input.bam} \
-knownSites {input.known_site_snp} \
-knownSites {input.known_site_indel} \
-o {output}
"""
rule BQSR_2:
input: bam = expand("alignment/marked_duplicates_{sample}.bam",sample=SAMPLES),
ref= REF_GENOME,
recal_table = "BQSR/recal_data.table",
output:expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
log:expand("logs/BQSR/{sample}_2.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
priority:9
shell:
r"""
java -jar -Xmx64G {params.GATK3_JAR} -nct 64 -T PrintReads \
-R {input.ref} \
-I {input.bam} \
-BQSR {input.recal_table} \
-o {output} 2>log
"""
rule Variant_calling:
input: bam = expand("BQSR/recal_reads_{sample}.bam",sample=SAMPLES),
ref= REF_GENOME,
known_site_snp= KNOWEN_SITE_SNP,
recal_table = "BQSR/recal_data.table",
known_site_indel = KNOWEN_SITE_INDEL
output:expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
log:expand("logs/VCF/{sample}_raw_vcf.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
priority:10
shell:
r"""
java -jar -Xmx80G {params.GATK3_JAR} -nct 64 -T HaplotypeCaller \
-R {input.ref} \
-I {input.bam} \
-stand_call_conf 30 \
--genotyping_mode DISCOVERY \
-o {output} 2>log
"""
rule Select_SNP:
input: vcf = expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
ref= REF_GENOME,
output:expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
log:expand("logs/VCF/{sample}_raw_SNP_vcf.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
threads: 8
priority:11
shell:
r"""
java -jar -Xmx80G {params.GATK3_JAR} -T SelectVariants \
-R {input.ref} \
-V {input.vcf} \
-selectType SNP \
-o {output} 2>log
"""
rule Select_INDEL:
input: vcf = expand("VCF/{sample}_raw.vcf",sample=SAMPLES),
ref= REF_GENOME,
output:expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
log:expand("logs/VCF/{sample}_raw_INDEL_vcf.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
threads: 8
priority:12
shell:
r"""
java -jar -Xmx80G {params.GATK3_JAR} -T SelectVariants \
-R {input.ref} \
-V {input.vcf} \
-selectType INDEL \
-o {output} 2>log
"""
rule SNP_FILTER:
input: vcf = expand("VCF/{sample}_raw_SNP.vcf",sample=SAMPLES),
ref= REF_GENOME,
output:expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
log:expand("logs/VCF/{sample}_filtered_SNP_vcf.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
threads: 8
priority:13
shell:
r"""
java -jar -Xmx80G {params.GATK3_JAR} -T VariantFiltration \
-R {input.ref} \
-V {input.vcf} \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "Low_Qual" \
-o {output} 2>log
"""
rule INDEL_FILTER:
input: vcf = expand("VCF/{sample}_raw_INDEL.vcf",sample=SAMPLES),
ref= REF_GENOME,
output:expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
log:expand("logs/VCF/{sample}_filtered_INDEL_vcf.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
threads: 8
priority:13
shell:
r"""
java -jar -Xmx80G {params.GATK3_JAR} -T VariantFiltration \
-R {input.ref} \
-V {input.vcf} \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "Low_Qual" \
-o {output} 2>log
"""
rule combine_variants:
input:snp = expand("VCF/{sample}_filtered_SNP.vcf",sample=SAMPLES),
indel = expand("VCF/{sample}_filtered_INDEL.vcf",sample=SAMPLES),
ref= REF_GENOME,
output:expand("VCF/{sample}_filtered_final.vcf",sample=SAMPLES),
log:expand("logs/VCF/{sample}_filtered_final_vcf.log", sample=SAMPLES),
params:
GATK3_JAR = GATK3,
priority:13
shell:
r"""
java -jar -Xmx80G {params.GATK3_JAR} -T CombineVariants \
-R {input.ref} \
-V {input.snp} \
-V {input.indel} \
--genotypemergeoption UNSORTED \
-o {output} 2>log
There is a configuration
yaml
file in this directory that could be of help in making same virtual environment in
python
to what I had configured,
wes
. By
conda
you can make same environment to
wes
.
Pipeline steps
This pipeline consisted of several steps:
-
QC by fastqc
-
read trimming by trimmomatics
-
summarizing qc parameters by multiqc
-
read mapping by bwa
-
marking duplicated by picard
-
base quality recalibration by gatk
-
variant calling by gatk
fnal note:
In the repositiry there is a
Snakefile
file. The content of this file is copied in
Snakefile.sh
for inspection purpose, however for pipeline running
Snakefile
should be downloaded and located in your
wes
directory.
Code Snippets
45 46 47 48 49 | shell: """ mkdir -p logs/fastqc mkdir -p qc_raw fastqc --outdir qc_raw --thread 50 --nogroup {input} 2>>{log} """ |
67 68 69 70 71 72 73 74 | shell: """ mkdir -p logs/trimmomatic java -jar {params.TRIMMOMATIC_JAR} \ {params.PAIR[0]} -threads {params.threads} {input.fwd} {input.rev} \ {output.fwd_paired} {output.fwd_unpaired} \ {output.rev_paired} {output.rev_unpaired} \ {params.TRIM_OPTS} 2>{log} """ |
82 83 84 85 | shell: """ fastqc --outdir qc_trimmed --thread 80 --nogroup {input} \ >> qc_trimmed/fastqc.log 2>>{log} """ |
94 95 96 97 98 99 | shell:""" mkdir -p multi_qc multiqc {input.raw_fwd} {input.raw_rev} {input.trimed_fwd} {input.trimed_rev} > {output} 2>{log} mv multiqc_report.html ./multi_qc mv multiqc_data multi_qc """ |
106 107 108 109 | shell:""" mkdir -p interleaved_fastq seqtk mergepe {input.fwd} {input.rev} > {output} 2>{log} """ |
Support
- Future updates