GBS Data Analysis Workflow: Variant Calling, Phasing, and Population Genetics

public public 1yr ago 0 bookmarks

This workflow operates on a merged bam file of sample data, uses Freebayes to create a VCF file, performs filtering with bcftools/vcftools, performs phasing and imputation with beagle, then calculates popgen stats with Plink

Usage

Step 1: Install workflow

clone this workflow to your local computer

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

Code Snippets

64
65
wrapper:
    "0.73.0/bio/sambamba/markdup"
SnakeMake From line 64 of main/Snakefile
75
76
wrapper:
    "0.73.0/bio/samtools/index"
SnakeMake From line 75 of main/Snakefile
93
94
wrapper:
    "0.73.0/bio/freebayes"
111
112
wrapper:
    "0.73.0/bio/freebayes"
124
125
shell:
    "bcftools stats {input} > {output}"
137
138
shell:
    "bcftools stats {input} > {output}"
152
153
shell:
    "vcftools --vcf {input} --max-missing 0.5 --mac 3 --minQ 20 --maf 0.03 --recode --recode-INFO-all --out {params} 2> {log}"
167
168
shell:
    "vcftools --vcf {input} --max-missing 0.75 --mac 3 --minQ 20 --maf 0.05 --recode --recode-INFO-all --out {params} 2> {log}"
182
183
shell:
    "vcftools --vcf {input} --max-missing 0.70 --mac 3 --minQ 20 --maf 0.05 --recode --recode-INFO-all --out {params} 2> {log}"
197
198
shell:
    "vcftools --vcf {input} --max-missing 0.25 --mac 3 --minQ 20 --maf 0.05 --recode --recode-INFO-all --out {params} 2> {log}"
212
213
shell:
    "vcftools --vcf {input} --max-missing 0.90 --mac 3 --minQ 20 --maf 0.05 --recode --recode-INFO-all --out {params} 2> {log}"
227
228
shell:
    "vcftools --vcf {input} --max-missing 0.10 --mac 3 --minQ 20 --maf 0.05 --recode --recode-INFO-all --out {params} 2> {log}"
242
243
shell:
    "vcftools --vcf {input} --max-missing 1 --mac 3 --minQ 20 --maf 0.05 --recode --recode-INFO-all --out {params} 2> {log}"
255
256
shell:
    "vcftools --vcf {input} --missing-indv --stdout > {output} 2> {log}"
268
269
shell:
    "vcftools --vcf {input} --missing-indv --stdout > {output} 2> {log}"
281
282
shell:
    "vcftools --vcf {input} --missing-indv --stdout > {output} 2> {log}"
292
293
shell:
    "mawk '$5 > 0.5' {input} | cut -f1 > {output}"
303
304
shell:
    "mawk '$5 > 0.5' {input} | cut -f1 > {output}"
314
315
shell:
    "mawk '$5 > 0.5' {input} | cut -f1 > {output}"
330
331
shell:
    "vcftools --vcf {input.vcf} --remove {input.miss} --recode --recode-INFO-all --out {params} 2> {log}"
345
346
shell:
    "vcftools --vcf {input} --remove-indels --recode --recode-INFO-all --out {params} 2> {log}"
354
355
shell:
    "grep -v super {input} |  perl -pe 's/\s\.:/\t.\/.:/g'  > {output}"
SnakeMake From line 354 of main/Snakefile
363
364
shell:
    "grep -v super {input} |  perl -pe 's/\s\.:/\t.\/.:/g'  > {output}"
SnakeMake From line 363 of main/Snakefile
379
380
shell:
    "vcftools --vcf {input.vcf} --remove {input.miss} --recode --recode-INFO-all --out {params} 2> {log}"
395
396
shell:
    "vcftools --vcf {input.vcf} --remove {input.miss} --recode --recode-INFO-all --out {params} 2> {log}"
408
409
shell:
    "grep -v super {input} |  perl -pe 's/\s\.:/\t.\/.:/g'  > {output}"
SnakeMake From line 408 of main/Snakefile
421
422
shell:
    "java -jar beagle/beagle.28Jun21.220.jar gt={input} out={params} > {log} 2>&1"
SnakeMake From line 421 of main/Snakefile
430
431
shell:
    "grep -v super {input} |  perl -pe 's/\s\.:/\t.\/.:/g'  > {output}"
SnakeMake From line 430 of main/Snakefile
443
444
shell:
    "java -jar beagle/beagle.28Jun21.220.jar gt={input} out={params} > {log} 2>&1"
SnakeMake From line 443 of main/Snakefile
458
459
shell:
    "vcftools --gzvcf {input} --remove-indels --recode --recode-INFO-all --out {params} 2> {log}"
473
474
shell:
    "java -jar beagle/beagle.28Jun21.220.jar gt={input} out={params} > {log} 2>&1"
SnakeMake From line 473 of main/Snakefile
487
488
shell:
    "vcftools --gzvcf {input} --missing-indv --stdout > {output} 2> {log}"
500
501
shell:
    "vcftools --gzvcf {input} --missing-indv --stdout > {output} 2> {log}"
511
512
shell:
    "cut -f1 {input} > {output}"
SnakeMake From line 511 of main/Snakefile
522
523
shell:
    "cut -f1 {input} > {output}"
SnakeMake From line 522 of main/Snakefile
539
540
shell:
    "vcftools --vcf {input.vcf} --remove {input.miss} --recode --recode-INFO-all --out {params} 2> {log}"
556
557
shell:
    "vcftools --vcf {input.vcf} --remove {input.miss} --recode --recode-INFO-all --out {params} 2> {log}"
570
571
shell:
    "bcftools stats {input} > {output}"
582
583
shell:
    "bcftools stats {input} > {output}"
594
595
shell:
    "bcftools stats {input} > {output}"
606
607
shell:
    "bcftools stats {input} > {output}"
620
621
shell:
    "tabix -p vcf {input} && bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' {input} > {output} 2> {log}"
632
633
shell:
    "mkdir -p {input}"
SnakeMake From line 632 of main/Snakefile
641
642
shell:
    "mkdir -p {input}"
SnakeMake From line 641 of main/Snakefile
654
655
shell:
    "plink --vcf {input} --genome 2> {log} && mv plink.genome plink/fullset/"
668
669
shell:
    "plink --vcf {input[0]} --genome 2> {log} && mv plink.genome plink/noindels/"
682
683
shell:
    "plink --threads 5 --vcf {input} --double-id --recode --out myplink 2> {log} && mv myplink* plink/"
695
696
shell:
    "cd plink/ && plink --file {params} --recode HV --out ssr_progeny_population && mv ssr_progeny_population* fullset/"
708
709
shell:
    "cd plink/ && plink --file {params} --recode HV --snps-only just-acgt --out ssr_progeny_population_noindels  && mv ssr_progeny_population_noindels* noindels/"
722
723
shell:
    "cd plink/ && plink --file {params} --indep-pairphase 50 5 0.5 && mv plink.prune* fullset/"
735
736
shell:
    "cd plink/ && plink --file {params} --r2 --ld-window-r2 0 && mv plink.ld fullset/"
749
750
shell:
    "cd plink/ && plink --file {params} --blocks no-pheno-req && mv plink.blocks* fullset/"
763
764
shell:
    "cd plink/ && plink --file {params} --indep-pairphase 50 5 0.5 && mv plink.prune* noindels/"
776
777
shell:
    "cd plink/ && plink --file {params} --r2 --ld-window-r2 0 && mv plink.ld noindels/"
790
791
shell:
    "cd plink/ && plink --file {params} --blocks no-pheno-req && mv plink.blocks* noindels/"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
__author__ = "Johannes Köster, Felix Mölder, Christopher Schröder"
__copyright__ = "Copyright 2017, Johannes Köster"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


from snakemake.shell import shell

shell.executable("bash")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

params = snakemake.params.get("extra", "")
norm = snakemake.params.get("normalize", False)
assert norm in [True, False]

pipe = ""
if snakemake.output[0].endswith(".bcf"):
    if norm:
        pipe = "| bcftools norm -Ob -"
    else:
        pipe = "| bcftools view -Ob -"
elif norm:
    pipe = "| bcftools norm -"

if snakemake.threads == 1:
    freebayes = "freebayes"
else:
    chunksize = snakemake.params.get("chunksize", 100000)
    regions = (
        "<(fasta_generate_regions.py {snakemake.input.ref}.fai {chunksize})".format(
            snakemake=snakemake, chunksize=chunksize
        )
    )
    if snakemake.input.get("regions", ""):
        regions = (
            "<(bedtools intersect -a "
            r"<(sed 's/:\([0-9]*\)-\([0-9]*\)$/\t\1\t\2/' "
            "{regions}) -b {snakemake.input.regions} | "
            r"sed 's/\t\([0-9]*\)\t\([0-9]*\)$/:\1-\2/')"
        ).format(regions=regions, snakemake=snakemake)
    freebayes = ("freebayes-parallel {regions} {snakemake.threads}").format(
        snakemake=snakemake, regions=regions
    )

shell(
    "({freebayes} {params} -f {snakemake.input.ref}"
    " {snakemake.input.samples} {pipe} > {snakemake.output[0]}) {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "sambamba markdup {snakemake.params.extra} -t {snakemake.threads} "
    "{snakemake.input[0]} {snakemake.output[0]} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]} {log}"
)
ShowHide 49 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Nevada-Bioinformatics-Center/snakemake_gbs_phasing_workflow
Name: snakemake_gbs_phasing_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...