Variant-calling-pipeline for identifying strain specific variants

public public 1yr ago 0 bookmarks

A reproducible snakemake workflow for identifying strain specific variants in Danio Rerio (Zebrafish). Variants called would be uploaded to European Variation Archive, https://www.ebi.ac.uk/ena/browser/home . This workflow requires two Snakefiles and a

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
set -o nounset
set -o errexit
set -o pipefail

if [[ $# -ne 3 ]] ; then
  echo "Incorrect number of arguments"
  exit 1
fi

output_fastq1=$1
output_fastq2=$2
units=$3

# Iterate over each sequencing run (i.e. unit)
count=0
for unit in $units; do
  ((count+=1))
  urls=`echo "$unit" | awk -F":" '{ print $1 }'`
  md5s=`echo "$unit" | awk -F":" '{ print $2 }'`
  url1=`echo "$urls" | awk -F";" '{ print $1 }'`
  url2=`echo "$urls" | awk -F";" '{ print $2 }'`
  md51=`echo "$md5s" | awk -F";" '{ print $1 }'`
  md52=`echo "$md5s" | awk -F";" '{ print $2 }'`

  # Get first FASTQ file and check checksum
  wget -q --timeout 60 --tries 10   -O "${output_fastq1}_tmp_$count.gz" "$url1"
  md5=`md5sum "${output_fastq1}_tmp_$count.gz" | awk '{ print $1 }'`
  if [ "$md51" != "$md5" ]; then
    echo "Checksum wrong for '$url1': '$md5' not '$md51'"
    exit 1
  fi

  gunzip "${output_fastq1}_tmp_$count.gz"


  # Get second FASTQ file and check checksum
  wget -q --timeout 60 --tries 10  -O "${output_fastq2}_tmp_$count.gz" "$url2"
  md5=`md5sum "${output_fastq2}_tmp_$count.gz" | awk '{ print $1 }'`
  if [ "$md52" != "$md5" ]; then
    echo "Checksum wrong for '$url2': '$md5' not '$md52'"
    exit 1
  fi

  gunzip "${output_fastq2}_tmp_$count.gz"

done

# Merge unzipped FASTQ files
cat `ls ${output_fastq1}_tmp_* | sort` > $output_fastq1
cat `ls ${output_fastq2}_tmp_* | sort` > $output_fastq2
rm ${output_fastq1}_tmp_* ${output_fastq2}_tmp_*
39
40
shell:
    "scripts/download_merge_fastq.sh {output.fq1} {output.fq2} '{params.fqs}' &> {log}"
SnakeMake From line 39 of main/Snakefile
53
54
55
56
57
shell:
    "(mkdir {wildcards.sample}/chunks; "
    "split --suffix-length=3 --additional-suffix=.fastq --numeric-suffixes --lines={params.chunksize} {input.fq1} {wildcards.sample}/chunks/1.; "
    "split --suffix-length=3 --additional-suffix=.fastq --numeric-suffixes --lines={params.chunksize} {input.fq2} {wildcards.sample}/chunks/2.) "
    "&> {log}"
SnakeMake From line 53 of main/Snakefile
73
74
shell:
   "bwa aln  {params.genome} {input} > {output}  "
91
92
shell:
    "bwa aln  {params.genome} {input} > {output} "
111
112
113
shell:
    "(bwa sampe {params[0]} {input.sai1} {input.sai2} {input.fq1} {input.fq2} "
    "| samtools view -bT {params.genome} -o {output} -)  "
127
128
129
130
shell:
  "samtools merge -o {output} {input} &&  "
  "rm -rf {wildcards.sample}/chunks  &&  "
  "rm -f {wildcards.sample}/1.fastq {wildcards.sample}/2.fastq "
143
144
shell:
  "samtools sort -m {resources.mem}G -n {input} -O BAM -o {output} "
157
158
shell:
  "samtools fixmate {input} - | bammarkduplicates O={output}"
172
173
shell:
  "samtools sort -m {resources.mem}G --threads {resources.cpus} -o {output} {input} "
187
188
189
190
shell:
  "  gatk  AddOrReplaceReadGroups -I {input} -O {output.addrg} -LB {wildcards.sample} "
  " -PL {wildcards.sample} -PU {wildcards.sample} -SM {wildcards.sample} --VALIDATION_STRINGENCY SILENT ; "
  "samtools index {wildcards.sample}/{wildcards.sample}_addreadgroups.bam  "
210
211
212
213
shell:
  "bcftools mpileup --gvcf 10,20  -f {input.ref} {input.bam} |  "
  "bcftools call --threads {resources.cpus}   -m  -Oz -o {output.vcf} &> {log} ;  "
  "tabix {wildcards.sample}/{wildcards.sample}_bcftools.g.vcf.gz  "
231
232
233
234
235
236
237
shell:
  "strelka-wrapper configureStrelkaGermlineWorkflow.py --bam={input.bam} "
  "--referenceFasta={input.ref} --runDir={wildcards.sample}  --exome ; "
  "strelka-wrapper {wildcards.sample}/runWorkflow.py -m local --quiet  -j {resources.cpus} -g {resources.mem} ; "
  "less {wildcards.sample}/results/variants/genome.S1.vcf.gz > {output.vcf} ; " 
  "rm -r -f {wildcards.sample}/workspace {wildcards.sample}/results ; "
  "rm {wildcards.sample}/workflow*  {wildcards.sample}/runWorkflow* "
SnakeMake From line 231 of main/Snakefile
256
257
258
shell:
  " gatk HaplotypeCaller -R {input.ref} -I {input.bam} -O {output} "
  " -L {input.intervals} -VS SILENT -ERC GVCF --QUIET --create-output-variant-index false "
299
300
301
302
303
304
305
shell:
  "gatk MergeVcfs -I {input.chr1} -I {input.chr2} -I {input.chr3} -I {input.chr4} -I {input.chr5} -I {input.chr6} "
  "-I {input.chr7} -I {input.chr8} -I {input.chr9} -I {input.chr10} -I {input.chr11} -I {input.chr12} -I {input.chr13} " 
  "-I {input.chr14} -I {input.chr15} -I {input.chr16} -I {input.chr17} -I {input.chr18} -I {input.chr19} -I {input.chr20} "
  "-I {input.chr21} -I {input.chr22} -I {input.chr23} -I {input.chr24} -I {input.chr25} -I {input.other} "
  "-O {output.vcf} --VALIDATION_STRINGENCY SILENT --QUIET &> {log} ;"
  "rm {wildcards.sample}/gatkchr_* "
ShowHide 11 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AmanahLW/Variant-calling-pipeline
Name: variant-calling-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...