Clone of the Snakemake workflow for variant calling with the GATK pipeline

public public 7mo ago Version: 3 0 bookmarks

This Snakemake pipeline implements the GATK best-practices workflow for calling small genomic variants.

Authors

  • Johannes Köster (https://koesterlab.github.io)

Usage

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 4: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 5: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 6: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

  1. Fork the original repo to a personal or lab account.

  2. Clone the fork to your local system, to a different place than where you ran your analysis.

  3. Copy the modified files from your analysis to the clone of your fork, e.g., cp -r envs rules scripts path/to/fork . Make sure to not accidentally copy config file contents or sample sheets.

  4. Commit and push your changes to your fork.

  5. Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automtically executed via continuous integration with Github actions.

Code Snippets

12
13
wrapper:
    "0.27.1/bio/snpeff"
 9
10
shell:
    "bedextract {wildcards.contig} {input} > {output}"
25
26
wrapper:
    "0.27.1/bio/gatk/haplotypecaller"
37
38
wrapper:
    "0.27.1/bio/gatk/combinegvcfs"
51
52
wrapper:
    "0.27.1/bio/gatk/genotypegvcfs"
63
64
wrapper:
    "0.40.2/bio/picard/mergevcfs"
16
17
wrapper:
    "0.27.1/bio/gatk/selectvariants"
36
37
wrapper:
    "0.27.1/bio/gatk/variantfiltration"
49
50
wrapper:
    "0.27.1/bio/gatk/variantrecalibrator"
64
65
wrapper:
    "0.27.1/bio/picard/mergevcfs"
11
12
wrapper:
    "0.30.0/bio/trimmomatic/se"
29
30
wrapper:
    "0.30.0/bio/trimmomatic/pe"
46
47
wrapper:
    "0.27.1/bio/bwa/mem"
60
61
wrapper:
    "0.26.1/bio/picard/markduplicates"
76
77
wrapper:
    "0.27.1/bio/gatk/baserecalibrator"
85
86
wrapper:
    "0.27.1/bio/samtools/index"
7
8
wrapper:
    "0.27.1/bio/fastqc"
18
19
wrapper:
    "0.27.1/bio/samtools/stats"
SnakeMake From line 18 of rules/qc.smk
33
34
wrapper:
    "0.27.1/bio/multiqc"
 8
 9
10
11
shell:
    "bcftools view --apply-filters PASS --output-type u {input} | "
    "rbt vcf-to-txt -g --fmt DP AD --info ANN | "
    "gzip > {output}"
22
23
script:
    "../scripts/plot-depths.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import common
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

calls = pd.read_table(snakemake.input[0], header=[0, 1])
samples = [name for name in calls.columns.levels[0] if name != "VARIANT"]
sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
sample_info = sample_info.rename({"level_1": "sample"}, axis=1)

sample_info = sample_info[sample_info["DP"] > 0]
sample_info["freq"] = sample_info["AD"] / sample_info["DP"]
sample_info.index = np.arange(sample_info.shape[0])

plt.figure()

sns.stripplot(x="sample", y="freq", data=sample_info, jitter=True)
plt.ylabel("allele frequency")
plt.xticks(rotation="vertical")

plt.savefig(snakemake.output.freqs)

plt.figure()

sns.stripplot(x="sample", y="DP", data=sample_info, jitter=True)
plt.ylabel("read depth")
plt.xticks(rotation="vertical")

plt.savefig(snakemake.output.depths)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
      "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
      "&> {snakemake.log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {1, 2}:
    raise ValueError("input must have 1 (single-end) or "
                     "2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = ("picard SortSam {sort_extra} INPUT=/dev/stdin"
                " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}")

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs 
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell("fastqc {snakemake.params} --quiet "
          "--outdir {tempdir} {snakemake.input[0]}"
          " {log}")

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path} {snakemake.output.html}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path} {snakemake.output.zip}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from tempfile import TemporaryDirectory
import os

from snakemake.shell import shell

extra = snakemake.params.get("extra", "")

with TemporaryDirectory() as tmpdir:
    recal_table = os.path.join(tmpdir, "recal_table.grp")
    log = snakemake.log_fmt_shell(stdout=True, stderr=True)
    shell("gatk BaseRecalibrator {extra} "
          "-R {snakemake.input.ref} -I {snakemake.input.bam} "
          "-O {recal_table} --known-sites {snakemake.input.known} {log}")

    log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
    shell("gatk ApplyBQSR -R {snakemake.input.ref} -I {snakemake.input.bam} "
          "--bqsr-recal-file {recal_table} "
          "-O {snakemake.output.bam} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
gvcfs = list(map("-V {}".format, snakemake.input.gvcfs))

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("gatk CombineGVCFs {extra} "
      "{gvcfs} "
      "-R {snakemake.input.ref} "
      "-O {snakemake.output.gvcf} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("gatk GenotypeGVCFs {extra} "
      "-V {snakemake.input.gvcf} "
      "-R {snakemake.input.ref} "
      "-O {snakemake.output.vcf} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell

known = snakemake.input.get("known", "")
if known:
    known = "--dbsnp " + known

extra = snakemake.params.get("extra", "")
bams = snakemake.input.bam
if isinstance(bams, str):
    bams = [bams]
bams = list(map("-I {}".format, bams))

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("gatk HaplotypeCaller {extra} "
      "-R {snakemake.input.ref} {bams} "
      "-ERC GVCF "
      "-O {snakemake.output.gvcf} {known} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("gatk SelectVariants -R {snakemake.input.ref} -V {snakemake.input.vcf} "
      "{extra} -O {snakemake.output.vcf} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
filters = ["--filter-name {} --filter-expression '{}'".format(name, expr.replace("'", "\\'"))
           for name, expr in snakemake.params.filters.items()]

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("gatk VariantFiltration -R {snakemake.input.ref} -V {snakemake.input.vcf} "
      "{extra} {filters} -O {snakemake.output.vcf} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")

def fmt_res(resname, resparams):
    fmt_bool = lambda b: str(b).lower()
    try:
        f = snakemake.input.get(resname)
    except KeyError:
        raise RuntimeError("There must be a named input file for every resource (missing: {})".format(resname))
    return "{},known={},training={},truth={},prior={}:{}".format(
           resname, fmt_bool(resparams["known"]), fmt_bool(resparams["training"]),
           fmt_bool(resparams["truth"]), resparams["prior"], f)

resources = ["--resource {}".format(fmt_res(resname, resparams))
             for resname, resparams in snakemake.params["resources"].items()]
annotation = list(map("-an {}".format, snakemake.params.annotation))
tranches = ""
if snakemake.output.tranches:
    tranches = "--tranches-file " + snakemake.output.tranches

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("gatk VariantRecalibrator {extra} {resources} "
      "-R {snakemake.input.ref} -V {snakemake.input.vcf} "
      "-mode {snakemake.params.mode} "
      "--output {snakemake.output.vcf} "
      "{tranches} {annotation} {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


inputs = " ".join("INPUT={}".format(f) for f in snakemake.input)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
    "picard"
    " MergeVcfs"
    " {extra}"
    " {inputs}"
    " OUTPUT={snakemake.output[0]}"
    " {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input}"
      " {region} > {snakemake.output} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
__author__ = "Bradford Powell"
__copyright__ = "Copyright 2018, Bradford Powell"
__email__ = "[email protected]"
__license__ = "BSD"


from snakemake.shell import shell
from os import path
import shutil
import tempfile

shell.executable("bash")
shell_command =("(snpEff {data_dir} {stats_opt} {csvstats_opt} {extra}"
                " {snakemake.params.reference} {snakemake.input}"
                " > {snakemake.output.vcf}) {log}")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

extra = snakemake.params.get("extra", "")

data_dir = snakemake.params.get("data_dir", "")
if data_dir:
    data_dir = '-dataDir "%s"'%data_dir

stats = snakemake.output.get("stats", "")
csvstats = snakemake.output.get("csvstats", "")
csvstats_opt = '' if not csvstats else '-csvStats {}'.format(csvstats)
stats_opt = '-noStats' if not stats  else '-stats {}'.format(stats)

shell(shell_command)
    #if stats:
    #    shutil.copy(path.join(stats_tempdir, 'stats'), stats)
    #if genes:
    #    shutil.copy(path.join(stats_tempdir, 'stats.genes.txt'), genes)
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


def compose_input_gz(filename):
    if filename.endswith(".gz"):
        filename = "<(pigz --decompress --stdout {filename})".format(
            filename=filename
        )
    return filename


def compose_output_gz(filename, compression_level="-5"):
    if filename.endswith(".gz"):
        return ">(pigz {compression_level} > {filename})".format(
            compression_level=compression_level,
            filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)


# Collect files
input_r1 = compose_input_gz(snakemake.input.r1)
input_r2 = compose_input_gz(snakemake.input.r2)

output_r1 = compose_output_gz(snakemake.output.r1, compression_level)
output_r1_unp = compose_output_gz(snakemake.output.r1_unpaired, compression_level)
output_r2 = compose_output_gz(snakemake.output.r2, compression_level)
output_r2_unp = compose_output_gz(snakemake.output.r2_unpaired, compression_level)

shell(
    "trimmomatic PE {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


def compose_input_gz(filename):
    if filename.endswith(".gz"):
        filename = "<(pigz --decompress --stdout {filename})".format(
            filename=filename
        )
    return filename


def compose_output_gz(filename, compression_level="-5"):
    if filename.endswith(".gz"):
        return ">(pigz {compression_level} > {filename})".format(
            compression_level=compression_level,
            filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Collect files
input = compose_input_gz(snakemake.input[0])
output = compose_output_gz(snakemake.output[0], compression_level)

shell("trimmomatic SE {extra} {input} {output} {trimmer} {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


inputs = " ".join("INPUT={}".format(f) for f in snakemake.input.vcfs)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
    "picard"
    " MergeVcfs"
    " {extra}"
    " {inputs}"
    " OUTPUT={snakemake.output[0]}"
    " {log}"
)
ShowHide 32 more snippets with no or duplicated tags.

Free

Created: 7mo ago
Updated: 7mo ago
Maitainers: public
URL: https://github.com/mptrsen/gatk-variant-calling
Name: gatk-variant-calling
Version: 3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...