This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants.
Usage
The usage of this workflow is described in the Snakemake Workflow Catalog .
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell known = snakemake.input.get("known", "") if known: known = "--dbsnp " + known extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") bams = snakemake.input.bam if isinstance(bams, str): bams = [bams] bams = list(map("-I {}".format, bams)) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' HaplotypeCaller {extra} " "-R {snakemake.input.ref} {bams} " "-ERC GVCF " "-O {snakemake.output.gvcf} {known} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") java_opts = snakemake.params.get("java_opts", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' SelectVariants -R {snakemake.input.ref} -V {snakemake.input.vcf} " "{extra} -O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", "") if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "[email protected], [email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell # Extract arguments. extra = snakemake.params.get("extra", "") sort = snakemake.params.get("sort", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -" elif sort == "samtools": # Sort alignments using samtools sort. pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -" # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" prefix = path.splitext(snakemake.output[0])[0] sort_extra += " -T " + prefix + ".tmp" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = ( "picard SortSam {sort_extra} INPUT=/dev/stdin" " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}" ) else: raise ValueError("Unexpected value for params.sort ({})".format(sort)) shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {snakemake.params.index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Christopher Schröder" __copyright__ = "Copyright 2020, Christopher Schröder" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True) shell( "gatk --java-options '{java_opts}' ApplyBQSR {extra} " "-R {snakemake.input.ref} -I {snakemake.input.bam} " "--bqsr-recal-file {snakemake.input.recal_table} " "-O {snakemake.output.bam} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Christopher Schröder" __copyright__ = "Copyright 2020, Christopher Schröder" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) known = snakemake.input.get("known", "") if known: if isinstance(known, str): known = [known] known = list(map("--known-sites {}".format, known)) shell( "gatk --java-options '{java_opts}' BaseRecalibrator {extra} " "-R {snakemake.input.ref} -I {snakemake.input.bam} " "-O {snakemake.output.recal_table} {known} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) gvcfs = list(map("-V {}".format, snakemake.input.gvcfs)) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' CombineGVCFs {extra} " "{gvcfs} " "-R {snakemake.input.ref} " "-O {snakemake.output.gvcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) interval_file = snakemake.input.get("interval_file", "") if interval_file: interval_file = "-L {}".format(interval_file) dbsnp = snakemake.input.get("known", "") if dbsnp: dbsnp = "-D {}".format(dbsnp) # Allow for either an input gvcf or GenomicsDB gvcf = snakemake.input.get("gvcf", "") genomicsdb = snakemake.input.get("genomicsdb", "") if gvcf: if genomicsdb: raise Exception("Only input.gvcf or input.genomicsdb expected, got both.") input_string = gvcf else: if genomicsdb: input_string = "gendb://{}".format(genomicsdb) else: raise Exception("Expected input.gvcf or input.genomicsdb.") tmp_dir = snakemake.params.get("tmp_dir", "") if tmp_dir: tmp_dir = "--tmp-dir={}".format(tmp_dir) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' GenotypeGVCFs {extra} " "-V {input_string} " "-R {snakemake.input.ref} " "{dbsnp} " "{interval_file} " "{tmp_dir} " "-O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) filters = [ "--filter-name {} --filter-expression '{}'".format(name, expr.replace("'", "\\'")) for name, expr in snakemake.params.filters.items() ] log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' VariantFiltration -R {snakemake.input.ref} -V {snakemake.input.vcf} " "{extra} {filters} -O {snakemake.output.vcf} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) def fmt_res(resname, resparams): fmt_bool = lambda b: str(b).lower() try: f = snakemake.input.get(resname) except KeyError: raise RuntimeError( "There must be a named input file for every resource (missing: {})".format( resname ) ) return "{},known={},training={},truth={},prior={} {}".format( resname, fmt_bool(resparams["known"]), fmt_bool(resparams["training"]), fmt_bool(resparams["truth"]), resparams["prior"], f, ) resources = [ "--resource:{}".format(fmt_res(resname, resparams)) for resname, resparams in snakemake.params["resources"].items() ] annotation = list(map("-an {}".format, snakemake.params.annotation)) tranches = "" if snakemake.output.tranches: tranches = "--tranches-file " + snakemake.output.tranches log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "gatk --java-options '{java_opts}' VariantRecalibrator {extra} {resources} " "-R {snakemake.input.ref} -V {snakemake.input.vcf} " "-mode {snakemake.params.mode} " "--output {snakemake.output.vcf} " "{tranches} {annotation} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) bams = snakemake.input if isinstance(bams, str): bams = [bams] bams = list(map("INPUT={}".format, bams)) shell( "picard MarkDuplicates " # Tool and its subcommand "{java_opts} " # Automatic java option "{extra} " # User defined parmeters "{bams} " # Input bam(s) "OUTPUT={snakemake.output.bam} " # Output bam "METRICS_FILE={snakemake.output.metrics} " # Output metrics "{log}" # Logging ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts inputs = " ".join("INPUT={}".format(f) for f in snakemake.input.vcfs) log = snakemake.log_fmt_shell(stdout=False, stderr=True) extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) shell( "picard" " MergeVcfs" " {java_opts}" " {extra}" " {inputs}" " OUTPUT={snakemake.output[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import subprocess as sp import sys from itertools import product from snakemake.shell import shell species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( build=build, release=release ) suffixes = "" datatype = snakemake.params.get("datatype", "") chromosome = snakemake.params.get("chromosome", "") if datatype == "dna": if chromosome: suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)] else: suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] elif datatype == "cdna": suffixes = ["cdna.all.fa.gz"] elif datatype == "cds": suffixes = ["cds.all.fa.gz"] elif datatype == "ncrna": suffixes = ["ncrna.fa.gz"] elif datatype == "pep": suffixes = ["pep.all.fa.gz"] else: raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") if chromosome: if not datatype == "dna": raise ValueError( "invalid datatype, to select a single chromosome the datatype must be dna" ) success = False for suffix in suffixes: url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species_cap}.{spec}.{suffix}".format( release=release, species=species, datatype=datatype, spec=spec.format(build=build, release=release), suffix=suffix, species_cap=species.capitalize(), branch=branch, ) try: shell("curl -sSf {url} > /dev/null 2> /dev/null") except sp.CalledProcessError: continue shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") success = True break if not success: print( "Unable to download requested sequence data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import tempfile import subprocess import sys import os from snakemake.shell import shell from snakemake.exceptions import WorkflowError species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build type = snakemake.params.type if release < 98: print("Ensembl releases <98 are unsupported.", file=open(snakemake.log[0], "w")) exit(1) branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) if type == "all": if species == "homo_sapiens" and release >= 93: suffixes = [ "-chr{}".format(chrom) for chrom in list(range(1, 23)) + ["X", "Y", "MT"] ] else: suffixes = [""] elif type == "somatic": suffixes = ["_somatic"] elif type == "structural_variations": suffixes = ["_structural_variations"] else: raise ValueError( "Unsupported type {} (only all, somatic, structural_variations are allowed)".format( type ) ) species_filename = species if release >= 91 else species.capitalize() urls = [ "ftp://ftp.ensembl.org/pub/{branch}release-{release}/variation/vcf/{species}/{species_filename}{suffix}.{ext}".format( release=release, species=species, suffix=suffix, species_filename=species_filename, branch=branch, ext=ext, ) for suffix in suffixes for ext in ["vcf.gz", "vcf.gz.csi"] ] names = [os.path.basename(url) for url in urls if url.endswith(".gz")] try: gather = "curl {urls}".format(urls=" ".join(map("-O {}".format, urls))) workdir = os.getcwd() with tempfile.TemporaryDirectory() as tmpdir: if snakemake.input.get("fai"): shell( "(cd {tmpdir}; {gather} && " "bcftools concat -Oz --naive {names} > concat.vcf.gz && " "bcftools reheader --fai {workdir}/{snakemake.input.fai} concat.vcf.gz " "> {workdir}/{snakemake.output}) {log}" ) else: shell( "(cd {tmpdir}; {gather} && " "bcftools concat -Oz --naive {names} " "> {workdir}/{snakemake.output}) {log}" ) except subprocess.CalledProcessError as e: if snakemake.log: sys.stderr = open(snakemake.log[0], "a") print( "Unable to download variation data from Ensembl. " "Did you check that this combination of species, build, and release is actually provided? ", file=sys.stderr, ) exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]} {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}") |
1 2 3 4 5 6 7 8 9 10 11 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell("tabix {snakemake.params} {snakemake.input[0]} {log}") |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | __author__ = "Johannes Köster, Jorge Langa" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts # Distribute available threads between trimmomatic itself and any potential pigz instances def distribute_threads(input_files, output_files, available_threads): gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz")) gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz")) potential_threads_per_process = available_threads // ( 1 + gzipped_input_files + gzipped_output_files ) if potential_threads_per_process > 0: # decompressing pigz creates at most 4 threads pigz_input_threads = ( min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0 ) pigz_output_threads = ( (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) if gzipped_output_files != 0 else 0 ) trimmomatic_threads = ( available_threads - pigz_input_threads * gzipped_input_files - pigz_output_threads * gzipped_output_files ) else: # not enough threads for pigz pigz_input_threads = 0 pigz_output_threads = 0 trimmomatic_threads = available_threads return trimmomatic_threads, pigz_input_threads, pigz_output_threads def compose_input_gz(filename, threads): if filename.endswith(".gz") and threads > 0: return "<(pigz -p {threads} --decompress --stdout {filename})".format( threads=threads, filename=filename ) return filename def compose_output_gz(filename, threads, compression_level): if filename.endswith(".gz") and threads > 0: return ">(pigz -p {threads} {compression_level} > {filename})".format( threads=threads, compression_level=compression_level, filename=filename ) return filename extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) compression_level = snakemake.params.get("compression_level", "-5") trimmer = " ".join(snakemake.params.trimmer) # Distribute threads input_files = [snakemake.input.r1, snakemake.input.r2] output_files = [ snakemake.output.r1, snakemake.output.r1_unpaired, snakemake.output.r2, snakemake.output.r2_unpaired, ] trimmomatic_threads, input_threads, output_threads = distribute_threads( input_files, output_files, snakemake.threads ) input_r1, input_r2 = [ compose_input_gz(filename, input_threads) for filename in input_files ] output_r1, output_r1_unp, output_r2, output_r2_unp = [ compose_output_gz(filename, output_threads, compression_level) for filename in output_files ] shell( "trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} " "{input_r1} {input_r2} " "{output_r1} {output_r1_unp} " "{output_r2} {output_r2_unp} " "{trimmer} " "{log}" ) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | __author__ = "Johannes Köster, Jorge Langa" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts # Distribute available threads between trimmomatic itself and any potential pigz instances def distribute_threads(input_file, output_file, available_threads): gzipped_input_files = 1 if input_file.endswith(".gz") else 0 gzipped_output_files = 1 if output_file.endswith(".gz") else 0 potential_threads_per_process = available_threads // ( 1 + gzipped_input_files + gzipped_output_files ) if potential_threads_per_process > 0: # decompressing pigz creates at most 4 threads pigz_input_threads = ( min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0 ) pigz_output_threads = ( (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) if gzipped_output_files != 0 else 0 ) trimmomatic_threads = ( available_threads - pigz_input_threads * gzipped_input_files - pigz_output_threads * gzipped_output_files ) else: # not enough threads for pigz pigz_input_threads = 0 pigz_output_threads = 0 trimmomatic_threads = available_threads return trimmomatic_threads, pigz_input_threads, pigz_output_threads def compose_input_gz(filename, threads): if filename.endswith(".gz") and threads > 0: return "<(pigz -p {threads} --decompress --stdout {filename})".format( threads=threads, filename=filename ) return filename def compose_output_gz(filename, threads, compression_level): if filename.endswith(".gz") and threads > 0: return ">(pigz -p {threads} {compression_level} > {filename})".format( threads=threads, compression_level=compression_level, filename=filename ) return filename extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) compression_level = snakemake.params.get("compression_level", "-5") trimmer = " ".join(snakemake.params.trimmer) # Distribute threads trimmomatic_threads, input_threads, output_threads = distribute_threads( snakemake.input[0], snakemake.output[0], snakemake.threads ) # Collect files input = compose_input_gz(snakemake.input[0], input_threads) output = compose_output_gz(snakemake.output[0], output_threads, compression_level) shell( "trimmomatic SE -threads {trimmomatic_threads} " "{java_opts} {extra} {input} {output} {trimmer} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from pathlib import Path from snakemake.shell import shell def get_only_child_dir(path): children = [child for child in path.iterdir() if child.is_dir()] assert ( len(children) == 1 ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper" return children[0] extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else "" stats = snakemake.output.stats cache = snakemake.input.get("cache", "") plugins = snakemake.input.plugins load_plugins = " ".join(map("--plugin {}".format, snakemake.params.plugins)) if snakemake.output.calls.endswith(".vcf.gz"): fmt = "z" elif snakemake.output.calls.endswith(".bcf"): fmt = "b" else: fmt = "v" fasta = snakemake.input.get("fasta", "") if fasta: fasta = "--fasta {}".format(fasta) gff = snakemake.input.get("gff", "") if gff: gff = "--gff {}".format(gff) if cache: entrypath = get_only_child_dir(get_only_child_dir(Path(cache))) species = entrypath.parent.name release, build = entrypath.name.split("_") cache = ( "--offline --cache --dir_cache {cache} --cache_version {release} --species {species} --assembly {build}" ).format(cache=cache, release=release, build=build, species=species) shell( "(bcftools view {snakemake.input.calls} | " "vep {extra} {fork} " "--format vcf " "--vcf " "{cache} " "{gff} " "{fasta} " "--dir_plugins {plugins} " "{load_plugins} " "--output_file STDOUT " "--stats_file {stats} | " "bcftools view -O{fmt} > {snakemake.output.calls}) {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from pathlib import Path from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "vep_install --AUTO cf " "--SPECIES {snakemake.params.species} " "--ASSEMBLY {snakemake.params.build} " "--VERSION {snakemake.params.release} " "--CACHEDIR {snakemake.output} " "--CONVERT " "--NO_UPDATE " "{extra} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import sys from pathlib import Path from urllib.request import urlretrieve from zipfile import ZipFile from tempfile import NamedTemporaryFile if snakemake.log: sys.stderr = open(snakemake.log[0], "w") outdir = Path(snakemake.output[0]) outdir.mkdir() with NamedTemporaryFile() as tmp: urlretrieve( "https://github.com/Ensembl/VEP_plugins/archive/release/{release}.zip".format( release=snakemake.params.release ), tmp.name, ) with ZipFile(tmp.name) as f: for member in f.infolist(): memberpath = Path(member.filename) if len(memberpath.parts) == 1: # skip root dir continue targetpath = outdir / memberpath.relative_to(memberpath.parts[0]) if member.is_dir(): targetpath.mkdir() else: with open(targetpath, "wb") as out: out.write(f.read(member.filename)) |
25 26 | wrapper: "0.74.0/bio/vep/annotate" |
10 11 | shell: "bedextract {wildcards.contig} {input} > {output}" |
32 33 | wrapper: "0.59.0/bio/gatk/haplotypecaller" |
46 47 | wrapper: "0.74.0/bio/gatk/combinegvcfs" |
60 61 | wrapper: "0.74.0/bio/gatk/genotypegvcfs" |
73 74 | wrapper: "0.74.0/bio/picard/mergevcfs" |
11 12 | wrapper: "0.59.0/bio/gatk/selectvariants" |
25 26 | wrapper: "0.74.0/bio/gatk/variantfiltration" |
38 39 | wrapper: "0.74.0/bio/gatk/variantrecalibrator" |
55 56 | wrapper: "0.74.0/bio/picard/mergevcfs" |
11 12 | wrapper: "0.74.0/bio/trimmomatic/se" |
29 30 | wrapper: "0.74.0/bio/trimmomatic/pe" |
47 48 | wrapper: "0.74.0/bio/bwa/mem" |
61 62 | wrapper: "0.74.0/bio/picard/markduplicates" |
81 82 | wrapper: "0.74.0/bio/gatk/baserecalibrator" |
100 101 | wrapper: "0.74.0/bio/gatk/applybqsr" |
111 112 | wrapper: "0.74.0/bio/samtools/index" |
9 10 | wrapper: "0.74.0/bio/fastqc" |
20 21 | wrapper: "0.74.0/bio/samtools/stats" |
42 43 | wrapper: "0.74.0/bio/multiqc" |
12 13 | wrapper: "0.74.0/bio/reference/ensembl-sequence" |
24 25 | wrapper: "0.74.0/bio/samtools/faidx" |
38 39 | shell: "samtools dict {input} > {output} 2> {log} " |
56 57 | wrapper: "0.74.0/bio/reference/ensembl-variation" |
70 71 | shell: "rbt vcf-fix-iupac-alleles < {input} | bcftools view -Oz > {output}" |
84 85 | wrapper: "0.74.0/bio/tabix" |
98 99 | wrapper: "0.74.0/bio/bwa/index" |
111 112 | wrapper: "0.74.0/bio/vep/cache" |
122 123 | wrapper: "0.74.0/bio/vep/plugins" |
14 15 16 17 | shell: "(bcftools view --apply-filters PASS --output-type u {input} | " "rbt vcf-to-txt -g --fmt DP AD --info ANN | " "gzip > {output}) 2> {log}" |
36 37 | script: "../scripts/plot-depths.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import sys sys.stderr = open(snakemake.log[0], "w") import common import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns calls = pd.read_table(snakemake.input[0], header=[0, 1]) samples = [name for name in calls.columns.levels[0] if name != "VARIANT"] sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False) sample_info = sample_info.rename({"level_1": "sample"}, axis=1) sample_info = sample_info[sample_info["DP"] > 0] sample_info["freq"] = sample_info["AD"] / sample_info["DP"] sample_info.index = np.arange(sample_info.shape[0]) plt.figure() sns.stripplot(x="sample", y="freq", data=sample_info, jitter=True) plt.ylabel("allele frequency") plt.xticks(rotation="vertical") plt.savefig(snakemake.output.freqs) plt.figure() sns.stripplot(x="sample", y="DP", data=sample_info, jitter=True) plt.ylabel("read depth") plt.xticks(rotation="vertical") plt.savefig(snakemake.output.depths) |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 10mo ago
Updated: 10mo ago
Maitainers:
public
URL:
https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling
Name:
dna-seq-gatk-variant-calling
Version:
v2.1.1
Other Versions:
Accessed: 4
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows
![psychip_snakemake](https://media.bioworkflows.com/bioworkflows/public_preview/791b56ccab3630bcb1746d66097f813640394655ed21d3c39cb591b7016d2994.jpg)
ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...
![ncov_2](https://media.bioworkflows.com/bioworkflows/public_preview/16bd34a67b114b98942f09d5bcfb47c01d0e12aaccac38e15c9bbcc5856d6139.jpg)
Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...
![cellranger-snakemake-gke](https://media.bioworkflows.com/bioworkflows/public_preview/1045115046015ac39a849801172555df3b41ec5023f7f7c8230f0640b8f89a2b.jpg)
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...
![atlas](https://media.bioworkflows.com/bioworkflows/public_preview/2e9488df62bdb57adbd8f8a137263f57f970f5571566588f6242d63f73c67151.jpg)
ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics
![rna-seq-star-deseq2](https://media.bioworkflows.com/bioworkflows/public_preview/2b57d357da619c31943bbffc57e2cc955325669933c9b5b91e33b81fd78c18d4.jpg)
RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...
![armor](https://media.bioworkflows.com/bioworkflows/public_preview/cc7665c55e38e46bd0aba5fbf5b6c09f9be831437a05f091d0e4ec7b56ab42a9.jpg)
Light-weight Snakemake workflow for preprocessing and statistical analysis of RNA-seq data
ARMOR ( A utomated R eproducible MO dular R NA-seq) is a Snakemake workflow , aimed at performing a ty...