Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants.
Authors
- Johannes Köster (https://koesterlab.github.io)
Usage
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files
config.yaml
,
samples.tsv
and
units.tsv
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 5: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 6: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream [email protected]:snakemake-workflows/dna-seq-gatk-variant-calling.git
orgit remote add -f upstream https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 7: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
.test
. They are automtically executed via continuous integration with Github actions.
Code Snippets
17 18 | wrapper: "0.55.1/bio/vep/annotate" |
9 10 | shell: "bedextract {wildcards.contig} {input} > {output}" |
27 28 | wrapper: "0.52.0/bio/gatk/haplotypecaller" |
39 40 | wrapper: "0.52.0/bio/gatk/combinegvcfs" |
53 54 | wrapper: "0.52.0/bio/gatk/genotypegvcfs" |
64 65 | wrapper: "0.52.0/bio/picard/mergevcfs" |
16 17 | wrapper: "0.52.0/bio/gatk/selectvariants" |
36 37 | wrapper: "0.52.0/bio/gatk/variantfiltration" |
49 50 | wrapper: "0.52.0/bio/gatk/variantrecalibrator" |
64 65 | wrapper: "0.52.0/bio/picard/mergevcfs" |
11 12 | wrapper: "0.52.0/bio/trimmomatic/se" |
29 30 | wrapper: "0.52.0/bio/trimmomatic/pe" |
47 48 | wrapper: "0.52.0/bio/bwa/mem" |
61 62 | wrapper: "0.52.0/bio/picard/markduplicates" |
79 80 | wrapper: "0.52.0/bio/gatk/baserecalibrator" |
88 89 | wrapper: "0.52.0/bio/samtools/index" |
7 8 | wrapper: "0.52.0/bio/fastqc" |
18 19 | wrapper: "0.52.0/bio/samtools/stats" |
32 33 | wrapper: "0.52.0/bio/multiqc" |
12 13 | wrapper: "0.53.0/bio/reference/ensembl-sequence" |
24 25 | wrapper: "0.53.0/bio/samtools/faidx" |
38 39 | shell: "samtools dict {input} > {output} 2> {log} " |
56 57 | wrapper: "0.57.0/bio/reference/ensembl-variation" |
70 71 | shell: "rbt vcf-fix-iupac-alleles < {input} | bcftools view -Oz > {output}" |
84 85 | wrapper: "0.53.0/bio/tabix" |
98 99 | wrapper: "0.53.0/bio/bwa/index" |
111 112 | wrapper: "0.55.0/bio/vep/cache" |
120 121 | wrapper: "0.55.0/bio/vep/plugins" |
8 9 10 11 | shell: "bcftools view --apply-filters PASS --output-type u {input} | " "rbt vcf-to-txt -g --fmt DP AD --info ANN | " "gzip > {output}" |
22 23 | script: "../scripts/plot-depths.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import common import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns calls = pd.read_table(snakemake.input[0], header=[0, 1]) samples = [name for name in calls.columns.levels[0] if name != "VARIANT"] sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False) sample_info = sample_info.rename({"level_1": "sample"}, axis=1) sample_info = sample_info[sample_info["DP"] > 0] sample_info["freq"] = sample_info["AD"] / sample_info["DP"] sample_info.index = np.arange(sample_info.shape[0]) plt.figure() sns.stripplot(x="sample", y="freq", data=sample_info, jitter=True) plt.ylabel("allele frequency") plt.xticks(rotation="vertical") plt.savefig(snakemake.output.freqs) plt.figure() sns.stripplot(x="sample", y="DP", data=sample_info, jitter=True) plt.ylabel("read depth") plt.xticks(rotation="vertical") plt.savefig(snakemake.output.depths) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "[email protected], [email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell # Extract arguments. extra = snakemake.params.get("extra", "") sort = snakemake.params.get("sort", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -" elif sort == "samtools": # Sort alignments using samtools sort. pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -" # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" prefix = path.splitext(snakemake.output[0])[0] sort_extra += " -T " + prefix + ".tmp" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = ( "picard SortSam {sort_extra} INPUT=/dev/stdin" " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}" ) else: raise ValueError("Unexpected value for params.sort ({})".format(sort)) shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {snakemake.params.index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".fastq.gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} --quiet " "--outdir {tempdir} {snakemake.input[0]}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path} {snakemake.output.html}") if snakemake.output.zip != zip_path: shell("mv {zip_path} {snakemake.output.zip}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from tempfile import TemporaryDirectory import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") with TemporaryDirectory() as tmpdir: recal_table = os.path.join(tmpdir, "recal_table.grp") log = snakemake.log_fmt_shell(stdout=True, stderr=True) known = snakemake.input.get("known", "") if known: known = "--known-sites {}".format(known) shell( "gatk --java-options '{java_opts}' BaseRecalibrator {extra} " "-R {snakemake.input.ref} -I {snakemake.input.bam} " "-O {recal_table} {known} {log}" ) log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True) shell( "gatk --java-options '{java_opts}' ApplyBQSR -R {snakemake.input.ref} -I {snakemake.input.bam} " "--bqsr-recal-file {recal_table} " "-O {snakemake.output.bam} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") gvcfs = list(map("-V {}".format, snakemake.input.gvcfs)) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' CombineGVCFs {extra} " "{gvcfs} " "-R {snakemake.input.ref} " "-O {snakemake.output.gvcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' GenotypeGVCFs {extra} " "-V {snakemake.input.gvcf} " "-R {snakemake.input.ref} " "-O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell known = snakemake.input.get("known", "") if known: known = "--dbsnp " + known extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") bams = snakemake.input.bam if isinstance(bams, str): bams = [bams] bams = list(map("-I {}".format, bams)) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' HaplotypeCaller {extra} " "-R {snakemake.input.ref} {bams} " "-ERC GVCF " "-O {snakemake.output.gvcf} {known} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' SelectVariants -R {snakemake.input.ref} -V {snakemake.input.vcf} " "{extra} -O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") filters = [ "--filter-name {} --filter-expression '{}'".format(name, expr.replace("'", "\\'")) for name, expr in snakemake.params.filters.items() ] log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' VariantFiltration -R {snakemake.input.ref} -V {snakemake.input.vcf} " "{extra} {filters} -O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") def fmt_res(resname, resparams): fmt_bool = lambda b: str(b).lower() try: f = snakemake.input.get(resname) except KeyError: raise RuntimeError( "There must be a named input file for every resource (missing: {})".format( resname ) ) return "{},known={},training={},truth={},prior={}:{}".format( resname, fmt_bool(resparams["known"]), fmt_bool(resparams["training"]), fmt_bool(resparams["truth"]), resparams["prior"], f, ) resources = [ "--resource {}".format(fmt_res(resname, resparams)) for resname, resparams in snakemake.params["resources"].items() ] annotation = list(map("-an {}".format, snakemake.params.annotation)) tranches = "" if snakemake.output.tranches: tranches = "--tranches-file " + snakemake.output.tranches log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' VariantRecalibrator {extra} {resources} " "-R {snakemake.input.ref} -V {snakemake.input.vcf} " "-mode {snakemake.params.mode} " "--output {snakemake.output.vcf} " "{tranches} {annotation} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} " "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} " "{log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell inputs = " ".join("INPUT={}".format(f) for f in snakemake.input.vcfs) log = snakemake.log_fmt_shell(stdout=False, stderr=True) extra = snakemake.params.get("extra", "") shell( "picard" " MergeVcfs" " {extra}" " {inputs}" " OUTPUT={snakemake.output[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("samtools stats {extra} {snakemake.input}" " {region} > {snakemake.output} {log}") |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | __author__ = "Johannes Köster, Jorge Langa" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell # Distribute available threads between trimmomatic itself and any potential pigz instances def distribute_threads(input_files, output_files, available_threads): gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz")) gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz")) potential_threads_per_process = available_threads // ( 1 + gzipped_input_files + gzipped_output_files ) if potential_threads_per_process > 0: # decompressing pigz creates at most 4 threads pigz_input_threads = ( min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0 ) pigz_output_threads = ( (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) if gzipped_output_files != 0 else 0 ) trimmomatic_threads = ( available_threads - pigz_input_threads * gzipped_input_files - pigz_output_threads * gzipped_output_files ) else: # not enough threads for pigz pigz_input_threads = 0 pigz_output_threads = 0 trimmomatic_threads = available_threads return trimmomatic_threads, pigz_input_threads, pigz_output_threads def compose_input_gz(filename, threads): if filename.endswith(".gz") and threads > 0: return "<(pigz -p {threads} --decompress --stdout {filename})".format( threads=threads, filename=filename ) return filename def compose_output_gz(filename, threads, compression_level): if filename.endswith(".gz") and threads > 0: return ">(pigz -p {threads} {compression_level} > {filename})".format( threads=threads, compression_level=compression_level, filename=filename ) return filename extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) compression_level = snakemake.params.get("compression_level", "-5") trimmer = " ".join(snakemake.params.trimmer) # Distribute threads input_files = [snakemake.input.r1, snakemake.input.r2] output_files = [ snakemake.output.r1, snakemake.output.r1_unpaired, snakemake.output.r2, snakemake.output.r2_unpaired, ] trimmomatic_threads, input_threads, output_threads = distribute_threads( input_files, output_files, snakemake.threads ) input_r1, input_r2 = [ compose_input_gz(filename, input_threads) for filename in input_files ] output_r1, output_r1_unp, output_r2, output_r2_unp = [ compose_output_gz(filename, output_threads, compression_level) for filename in output_files ] shell( "trimmomatic PE -threads {trimmomatic_threads} {extra} " "{input_r1} {input_r2} " "{output_r1} {output_r1_unp} " "{output_r2} {output_r2_unp} " "{trimmer} " "{log}" ) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | __author__ = "Johannes Köster, Jorge Langa" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell # Distribute available threads between trimmomatic itself and any potential pigz instances def distribute_threads(input_file, output_file, available_threads): gzipped_input_files = 1 if input_file.endswith(".gz") else 0 gzipped_output_files = 1 if output_file.endswith(".gz") else 0 potential_threads_per_process = available_threads // ( 1 + gzipped_input_files + gzipped_output_files ) if potential_threads_per_process > 0: # decompressing pigz creates at most 4 threads pigz_input_threads = ( min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0 ) pigz_output_threads = ( (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) if gzipped_output_files != 0 else 0 ) trimmomatic_threads = ( available_threads - pigz_input_threads * gzipped_input_files - pigz_output_threads * gzipped_output_files ) else: # not enough threads for pigz pigz_input_threads = 0 pigz_output_threads = 0 trimmomatic_threads = available_threads return trimmomatic_threads, pigz_input_threads, pigz_output_threads def compose_input_gz(filename, threads): if filename.endswith(".gz") and threads > 0: return "<(pigz -p {threads} --decompress --stdout {filename})".format( threads=threads, filename=filename ) return filename def compose_output_gz(filename, threads, compression_level): if filename.endswith(".gz") and threads > 0: return ">(pigz -p {threads} {compression_level} > {filename})".format( threads=threads, compression_level=compression_level, filename=filename ) return filename extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) compression_level = snakemake.params.get("compression_level", "-5") trimmer = " ".join(snakemake.params.trimmer) # Distribute threads trimmomatic_threads, input_threads, output_threads = distribute_threads( snakemake.input[0], snakemake.output[0], snakemake.threads ) # Collect files input = compose_input_gz(snakemake.input[0], input_threads) output = compose_output_gz(snakemake.output[0], output_threads, compression_level) shell( "trimmomatic SE -threads {trimmomatic_threads} {extra} {input} {output} {trimmer} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", "") if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import subprocess as sp import sys from itertools import product from snakemake.shell import shell species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( build=build, release=release ) suffixes = "" datatype = snakemake.params.get("datatype", "") if datatype == "dna": suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] elif datatype == "cdna": suffixes = ["cdna.all.fa.gz"] elif datatype == "cds": suffixes = ["cds.all.fa.gz"] elif datatype == "ncrna": suffixes = ["ncrna.fa.gz"] elif datatype == "pep": suffixes = ["pep.all.fa.gz"] else: raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") success = False for suffix in suffixes: url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species_cap}.{spec}.{suffix}".format( release=release, species=species, datatype=datatype, spec=spec.format(build=build, release=release), suffix=suffix, species_cap=species.capitalize(), branch=branch, ) try: shell("curl -sSf {url} > /dev/null 2> /dev/null") except sp.CalledProcessError: continue shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") success = True break if not success: print( "Unable to download requested sequence data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("tabix {snakemake.params} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from pathlib import Path from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "vep_install --AUTO cf " "--SPECIES {snakemake.params.species} " "--ASSEMBLY {snakemake.params.build} " "--CACHE_VERSION {snakemake.params.release} " "--CACHEDIR {snakemake.output} " "--CONVERT {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import sys from pathlib import Path from urllib.request import urlretrieve from zipfile import ZipFile from tempfile import NamedTemporaryFile if snakemake.log: sys.stderr = open(snakemake.log[0], "w") outdir = Path(snakemake.output[0]) outdir.mkdir() with NamedTemporaryFile() as tmp: urlretrieve( "https://github.com/Ensembl/VEP_plugins/archive/release/{release}.zip".format( release=snakemake.params.release ), tmp.name, ) with ZipFile(tmp.name) as f: for member in f.infolist(): memberpath = Path(member.filename) if len(memberpath.parts) == 1: # skip root dir continue targetpath = outdir / memberpath.relative_to(memberpath.parts[0]) if member.is_dir(): targetpath.mkdir() else: with open(targetpath, "wb") as out: out.write(f.read(member.filename)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from pathlib import Path from snakemake.shell import shell def get_only_child_dir(path): children = [child for child in path.iterdir() if child.is_dir()] assert ( len(children) == 1 ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper" return children[0] extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else "" stats = snakemake.output.stats cache = snakemake.input.cache plugins = snakemake.input.plugins entrypath = get_only_child_dir(get_only_child_dir(Path(cache))) species = entrypath.parent.name release, build = entrypath.name.split("_") load_plugins = " ".join(map("--plugin {}".format, snakemake.params.plugins)) if snakemake.output.calls.endswith(".vcf.gz"): fmt = "z" elif snakemake.output.calls.endswith(".bcf"): fmt = "b" else: fmt = "v" shell( "(bcftools view {snakemake.input.calls} | " "vep {extra} {fork} " "--format vcf " "--vcf " "--cache " "--cache_version {release} " "--species {species} " "--assembly {build} " "--dir_cache {cache} " "--dir_plugins {plugins} " "{load_plugins} " "--output_file STDOUT " "--stats_file {stats} | " "bcftools view -O{fmt} > {snakemake.output.calls}) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import tempfile import subprocess import sys import os from snakemake.shell import shell from snakemake.exceptions import WorkflowError species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build type = snakemake.params.type if release < 98: print("Ensembl releases <98 are unsupported.", file=open(snakemake.log[0], "w")) exit(1) branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) if type == "all": if species == "homo_sapiens" and release >= 93: suffixes = [ "-chr{}".format(chrom) for chrom in list(range(1, 23)) + ["X", "Y", "MT"] ] else: suffixes = [""] elif type == "somatic": suffixes = ["_somatic"] elif type == "structural_variations": suffixes = ["_structural_variations"] else: raise ValueError( "Unsupported type {} (only all, somatic, structural_variations are allowed)".format( type ) ) species_filename = species if release >= 91 else species.capitalize() urls = [ "ftp://ftp.ensembl.org/pub/{branch}release-{release}/variation/vcf/{species}/{species_filename}{suffix}.{ext}".format( release=release, species=species, suffix=suffix, species_filename=species_filename, branch=branch, ext=ext, ) for suffix in suffixes for ext in ["vcf.gz", "vcf.gz.csi"] ] names = [os.path.basename(url) for url in urls if url.endswith(".gz")] try: gather = "curl {urls}".format(urls=" ".join(map("-O {}".format, urls))) workdir = os.getcwd() with tempfile.TemporaryDirectory() as tmpdir: if snakemake.input.get("fai"): shell( "(cd {tmpdir}; {gather} && " "bcftools concat -Oz --naive {names} > concat.vcf.gz && " "bcftools reheader --fai {workdir}/{snakemake.input.fai} concat.vcf.gz " "> {workdir}/{snakemake.output}) {log}" ) else: shell( "(cd {tmpdir}; {gather} && " "bcftools concat -Oz --naive {names} " "> {workdir}/{snakemake.output}) {log}" ) except subprocess.CalledProcessError as e: if snakemake.log: sys.stderr = open(snakemake.log[0], "a") print( "Unable to download variation data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided? ", file=sys.stderr, ) exit(1) |
Support
- Future updates