Reproducible Whole Genome Sequencing (WGS) Analysis Pipeline Using Snakemake

public public 1yr ago 0 bookmarks

This is still under review and subject to change.

We're working to ensure consistent file structure for reproducing the analysis and figures in our paper.

WGS analysis

Analysis scripts to process Pool-seq data is provided as a snakemake workflow.

To run on a local machine, run the following from the project's directory:

snakemake --use-conda

We ran this pipeline on the University of Cambridge HPC server, as:

snakemake --use-conda --jobs 5 \
 --cluster "sbatch -A LEYSER-SL2-CPU -J {rulename} -o logs/slurm/{rulename}-job#%j.log \
 -p skylake -c {threads} --mem-per-cpu=5980MB -t {params.runtime}"

Code Snippets

 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
RELEASE="$1" # ENSEMBL genome release number
OUTDIR="$2"  # output directory name


#### prepare directories ####

mkdir -p "$OUTDIR"
cd "$OUTDIR"


#### Download reference genome ####

# Download genome from ENSEMBL
echo "Downloading reference genome..."
wget -O genome.fa.gz ftp://ftp.ensemblgenomes.org/pub/plants/release-${RELEASE}/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

# Decompress
echo "Decompressing reference fasta..."
gunzip genome.fa.gz


#### Indexing ####

# samtools
echo "Indexing with samtools"
samtools faidx genome.fa

# picard
echo "Indexing with picard"
picard CreateSequenceDictionary REFERENCE=genome.fa OUTPUT=genome.dict

# bwa
echo "Indexing with bwa"
bwa index genome.fa


#### Centromere annotation ####

# Centromere region is annotated in TAIR9 release
# see https://www.biostars.org/p/18782/
wget -O temp_centromeres.txt ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_gff3/Assembly_GFF/TAIR9_GFF3_assemblies.gff

# extract centromere annotations
printf "chrom\tstart\tend\n" > centromeres.tsv
grep "CEN" temp_centromeres.txt | cut -f 1,4,5 >> centromeres.tsv
rm temp_centromeres.txt
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
INPUT="$1"   # input file
OUTPUT="$2"  # output file
REF="$3"     # reference genome file

# Make sure that it doesn't stop if the exit code is != 0
set +e

# Run picard
picard ValidateSamFile \
  INPUT="$INPUT" \
  OUTPUT="$OUTPUT" \
  REFERENCE_SEQUENCE="$REF" \
  MODE=SUMMARY

# Issue exit code
exitcode=$?
if [ $exitcode -eq 1 ]
then
    exit 1
else
    exit 0
fi
67
68
shell:
  "Rscript --vanilla workflow/scripts/cleanPhenotypes.R"
86
87
shell:
  "bash workflow/scripts/getReference.sh {params.release} data/external/reference/ >{log} 2>&1"
102
103
104
shell:
  "Rscript --vanilla workflow/scripts/data_processing_founders/getAccessionGenotypes.R --outdir data/external/founder_genotypes "
  ">{log} 2>&1"
119
120
121
shell:
  "Rscript --vanilla workflow/scripts/data_processing_founders/getMagicMosaics.R"
  ">{log} 2>&1"
138
139
140
shell:
  "Rscript --vanilla workflow/scripts/data_processing_founders/imputeMagicGeno.R"
  ">{log} 2>&1"
162
163
164
shell:
  "cutadapt {params.adapters} {params.others} --cores {threads} -o {output.fastq1} -p {output.fastq2} {input} "
  ">{log} 2>&1"
196
197
198
199
200
201
202
203
  shell:
    "(bwa mem -M -t {threads} "
    "{params.extra} "
	  "{input.ref} {input.reads} "
    " | "
    "samtools sort -T data/intermediate/mapped/.samtools_{wildcards.name}.{wildcards.unit} "
    "-o {output} - ) "
    "2>{log}"
226
227
228
229
shell:
  "(samtools merge -@ $(({threads} - 1)) {output.bam} {input}; "
  "samtools index -@ $(({threads} - 1)) {output.bam} {output.bai}) "
  ">{log} 2>&1"
254
255
256
257
258
259
shell:
  "(picard MarkDuplicates "
  "INPUT={input} "
  "OUTPUT={output.bam} "
  "METRICS_FILE={output.metrics} "
  "REMOVE_DUPLICATES=true; "
304
305
306
307
308
309
shell:
  "gatk3 -T RealignerTargetCreator "
  "-nt {threads} -R {input.ref} "
  "-o {output} "
  "-I {input.bam} "
  ">{log} 2>&1"
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
  shell:
    "gatk3 -T IndelRealigner "
    "-R {input.ref} "
    "--baq CALCULATE_AS_NECESSARY "
    "-I {input.bam} "
    "-o {output.bam} "
    "-targetIntervals {input.intervals} "
    "-noTags "
    ">{log} 2>&1; "
    "touch {output.bai}" # this is just to ensure the .bai has newer date than .bam


rule picard_ValidateSamFile:
  input:
    "data/intermediate/realigned/{name}.bam"
351
352
shell:
  "bash workflow/scripts/ValidateSamFile.sh {input} {output} {params.ref} >{log} 2>&1"
374
375
376
377
378
379
380
381
382
383
384
385
shell:
  '''
  END=$(cat {input.ref}.fai | awk -F'\\t' '$1=={wildcards.chrom}' | cut -f2)
  harp like \
    --bam {input.bam} \
    --refseq {input.ref} \
    --region "{wildcards.chrom}:1-$END" \
    --snps {input.snps} \
    --out logs/harp_like/{wildcards.name}_chrom{wildcards.chrom} \
    --stem data/intermediate/harp_like/{wildcards.name}_chrom{wildcards.chrom} \
    >{log} 2>&1
  '''
404
405
406
407
408
409
410
411
412
413
414
415
416
417
shell:
  '''
  END=$(cat {input.ref}.fai | awk -F'\\t' '$1=={wildcards.chrom}' | cut -f2)
  echo "Starting harp"
  harp freq \
    --hlk {input.hlk} \
    --region "{wildcards.chrom}:1-$END" \
    --window_width {params.winsize} \
    --window_step {params.winstep} \
    --stem {params.stem} \
    >{log} 2>&1
  echo "Finished harp"
  rm -r {params.stem}.output
  '''
433
434
435
436
437
438
439
440
shell:
  '''
  Rscript --vanilla workflow/scripts/compilePoolseqFreqs.R \
    --indir data/intermediate/harp_freq/ \
    --outdir data/processed/poolseq/ \
    --window {wildcards.window} \
    >{log} 2>&1
  '''
493
494
495
496
shell:
  '''
  Rscript --vanilla workflow/scripts/simulations/compileSimulations.R >{log} 2>&1
  '''
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/tavareshugo/publication_Tavares2022_Nselection
Name: publication_tavares2022_nselection
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...