Snakemake workflow for somatic mutation detection without matched normal samples

public 1yr ago Version: v1.2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

TOSCA ( T umor O nly S omatic CA lling) is a Snakemake workflow , aimed at performing a somatic variant calling (without matched normal samples) analysis in a reproducible, automated, and open source workflow.

TOSCA consists of a Snakefile , a set of conda environment files ( envs/*.yaml ) a configuration file ( config/config.yaml ) and a set of R scripts. It is able to perform an end-to-end analysis, from raw read files, via quality checks, alignment and variant calling, to functional annotation, databases filtering, tumor purity and ploidy estimation and finally variant classification in whole exome / target sequencing data.

By default, the pipeline performs only mandatory steps shown in the diagram below. However, you can turn on optional rules in the config/config.yaml file.

Advanced use : If you wish to custom your analysis by adding or removing some steps (e.g. VarScan over Mutect2 or Bowtie2 over BWA), you can use the software of your preference provided you have your own script(s), and change some lines within the Snakefile . If you think your "custom rule" might be of interest to a broader audience, let us know by opening an issue.

Using the TOSCA workflow

We assume that you already have conda and Snakemake installed, otherwise you can easily install them with the following commands:

To install conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

To install Snakemake via conda: conda install -c conda-forge -c bioconda snakemake snakemake-wrapper-utils mamba

To use TOSCA:
git clone https://github.com/mdelcorvo/TOSCA.git
#edit config and meta file
cd TOSCA && snakemake --use-conda --configfile config/config.yaml

More details about TOSCA workflow can be found in the wiki .

Workflow graph

DAG

Simplified directed acyclic graph (DAG) of the TOSCA workflow. Oval shapes are referred to data (input/output and intermediate files) while rectangles to processes. Input and output files are depicted as green while files independently downloaded by TOSCA are depicted as orange ovals. Red blocks are mandatory rules, while dashed lines and blue blocks are optional rules, controlled in the config.yaml . By default only mandatory rules are executed.

Contributors

Current contributors include:

Marcello Del Corvo

Paper

Main paper describing the tool:

Marcello Del Corvo, Saveria Mazzara, Stefano A Pileri, TOSCA: an automated Tumor Only Somatic CAlling workflow for somatic mutation detection without matched normal samples, Bioinformatics Advances, Volume 2, Issue 1, 2022, vbac070, https://doi.org/10.1093/bioadv/vbac070

Code Snippets

shell:
    "java -jar {params.gatk3} -T CallableLoci -R {input.ref} -I {input.bam} --minDepth {params.minD} -summary {output.out_sum} -o {output.out_bed} -U ALLOW_SEQ_DICT_INCOMPATIBILITY -L {input.bed};" 
    "grep 'CALLABLE' {output.out_bed} > {output.out_filt};"      
    "Rscript --vanilla {input.script} {output.out_filt} {output.out_final}"

SnakeMake From line 20 of rules/calling.smk

shell:
    "gatk Mutect2 -R {input.ref} -I {input.bam} --germline-resource {input.exac} --panel-of-normals {input.pon} -L {input.bed} --f1r2-tar-gz {output.f1r2} -O {output.vcf_raw};"
    "gatk LearnReadOrientationModel -I {output.f1r2} -O {output.rom};"
    "gatk GetPileupSummaries -I {input.bam} -V {input.exac} -L {input.bed} -O {output.getpileupsum};"
    "gatk CalculateContamination -I {output.getpileupsum} -O {output.contamination};"
    "gatk FilterMutectCalls -R {input.ref} --contamination-table {output.contamination} --ob-priors {output.rom} -V {output.vcf_raw} -O {output.vcf_filt};"
    "gatk VariantsToTable -V {output.vcf_raw} -F CHROM -F POS -F TYPE -GF AF --show-filtered true -O {output.af_table};"

SnakeMake gatk From line 48 of rules/calling.smk

shell:
    "curl -k -L  '{params.Cosmic}' > resources/database/{params.build}/COSMIC.vcf.gz; "
    "curl -k -L  '{params.Gen1K}' > resources/database/{params.build}/1000G-phase_3.vcf.gz; "
    "curl -k -L  '{params.ESP}' > resources/database/{params.build}/temp_ESP6500SI.vcf.tar.gz; "
    "curl -k -L  '{params.Clinvar}' > resources/database/{params.build}/ClinVar.vcf.gz; "
    "curl -k -L  '{params.Clinvar}.tbi' > resources/database/{params.build}/ClinVar.vcf.gz.tbi; "

SnakeMake From line 12 of rules/database.smk

run:
    if config["ref"]["build"]=='GRCh38': 
      shell("curl -L  '{params.mappability}_{params.kmer}.bw' > resources/database/{params.build}/raw_mappability_{params.kmer}.bw; ")
    else:
      shell("curl -L  '{params.mappability}{params.kmer}mer.bigWig' > resources/database/{params.build}/raw_mappability_{params.kmer}.bw; ")

SnakeMake From line 26 of rules/database.smk

shell:
    "tar -xzvf resources/database/{params.build}/temp_ESP6500SI.vcf.tar.gz -C resources/database/{params.build}; "
    "Rscript --vanilla {input.script} {params.build}; "
    "vcf-concat resources/database/{params.build}/*.vcf | sort -k1,1V -k2,2n | bgzip -c > resources/database/{params.build}/ESP6500SI.vcf.gz; "
    "rm resources/database/{params.build}/temp_ESP6500SI.vcf.tar.gz; "
    "rm resources/database/{params.build}/*.vcf; "

SnakeMake From line 42 of rules/database.smk

wrapper:
    "v1.23.3/bio/tabix/index" 

SnakeMake From line 56 of rules/database.smk

wrapper:
    "v1.23.3/bio/tabix/index" 

SnakeMake From line 66 of rules/database.smk

wrapper:
    "v1.23.3/bio/tabix/index"     

SnakeMake From line 76 of rules/database.smk

shell:
    "bigWigToBedGraph {input.raw_map} {output.raw_bed}; "
    "Rscript {input.script} {output.raw_bed} {output.bed}; "
    "faidx {input.ref} -i chromsizes > {output.chromsizes}; "
    "bedGraphToBigWig {output.bed} {output.chromsizes} {output.map}; "

SnakeMake bedGraphToBigWig ucsc-bigwigtobedgraph From line 91 of rules/database.smk

shell:
    "snpEff -Xmx{params.mem_gb}g -dataDir $(pwd)/resources/database/snpEff -v -csvStats {output.csv} -s {output.html} {params.ref} {input.vcf} > {output.call}; "

SnakeMake snpEff From line 16 of rules/filtering.smk

wrapper:
    "v1.23.3/bio/snpsift/annotate"

SnakeMake From line 30 of rules/filtering.smk

wrapper:
    "v1.23.3/bio/snpsift/annotate"

SnakeMake From line 44 of rules/filtering.smk

wrapper:
    "v1.23.3/bio/snpsift/annotate"

SnakeMake From line 57 of rules/filtering.smk

wrapper:
    "v1.23.3/bio/snpsift/annotate"

SnakeMake From line 71 of rules/filtering.smk

shell:
    "Rscript --vanilla {input.script} {input.call} {input.database} {output.call} "

SnakeMake From line 86 of rules/filtering.smk

wrapper:
    "v1.23.3/bio/snpsift/annotate"

SnakeMake From line 99 of rules/filtering.smk

shell:   
    "java -jar {params} -B {input.call} {input.bam} > {output.cov}"

SnakeMake From line 110 of rules/filtering.smk

shell:
    "Rscript --vanilla {input.script} {input.vcf} {input.snpeff} {input.gen1k} {input.esp} {input.exac} {input.dbsnp} {input.cosmic} {input.clinvar} {input.cov} {input.af_table} {input.gtf} {params.ref} {params.depth} {params.vaf} {output.call} "

SnakeMake From line 137 of rules/filtering.smk

shell:
    "ls {input.call} > {output.list};ls {input.ps} > {output.list_ps};Rscript {input.script} {output.list} {output.res} {output.list_ps} " if "meta_N" in config else "ls {input.call} > {output.list};Rscript {input.script} {output.list} {output.res} {output.list_ps} "

SnakeMake From line 151 of rules/filtering.smk

wrapper:
    "v1.23.3/bio/reference/ensembl-sequence"

SnakeMake From line 12 of rules/genome.smk

wrapper:
    "v1.23.3/bio/reference/ensembl-annotation"

SnakeMake From line 23 of rules/genome.smk

shell:
    "Rscript --vanilla {input.script} {input.gtf} {output.rdata} "

SnakeMake From line 36 of rules/genome.smk

shell:
    "samtools dict {input} > {output} "

SnakeMake SAMtools From line 46 of rules/genome.smk

wrapper:
    "v1.23.3/bio/bwa/index"        

SnakeMake From line 58 of rules/genome.smk

wrapper:
    "v1.23.3/bio/samtools/faidx"

SnakeMake From line 67 of rules/genome.smk

wrapper:
    "v1.23.3/bio/reference/ensembl-variation"

SnakeMake From line 82 of rules/genome.smk

wrapper:
    "v1.23.3/bio/tabix/index"   

SnakeMake From line 92 of rules/genome.smk

shell:
    "snpEff download -v {params.db} -dataDir $(pwd)/resources/database/snpEff; "

SnakeMake snpEff From line 102 of rules/genome.smk

wrapper:
    "v1.23.3/bio/trimmomatic/pe"

SnakeMake From line 17 of rules/mapping.smk

wrapper:
    "v1.23.3/bio/bwa/mem"

SnakeMake From line 37 of rules/mapping.smk

wrapper:
    "v1.23.3/bio/picard/markduplicates"

SnakeMake From line 54 of rules/mapping.smk

wrapper:
    "v1.23.3/bio/trimmomatic/pe"

SnakeMake From line 13 of rules/normal.smk

wrapper:
    "v1.23.3/bio/bwa/mem"

SnakeMake From line 31 of rules/normal.smk

wrapper:
    "v1.23.3/bio/picard/markduplicates"

SnakeMake From line 46 of rules/normal.smk

shell:
    "gatk Mutect2 -R {input.ref} -I {input.bam} --germline-resource {input.exac} --panel-of-normals {input.pon} -O {output.vcf_raw}; "

SnakeMake gatk From line 61 of rules/normal.smk

shell:
    "bgzip -c {input.vcf} > {output.vcf};"
    "tabix -p vcf {output.vcf}"

SnakeMake tabix From line 71 of rules/normal.smk

shell:
    "ls {input.vcf} > {output.list};"
    "bcftools merge --file-list {output.list} | bcftools view --min-ac=5 > {output.norm};"       

SnakeMake BCFtools From line 83 of rules/normal.smk

shell:
    "bgzip -c {input.vcf} > {output.vcf};"
    "tabix -p vcf {output.vcf}"

SnakeMake tabix From line 94 of rules/normal.smk

shell:
    "Rscript {input.script} {input.baits} {input.ref} {input.mappability} {input.gtf} {params.min_target} {params.min_off_target} {output.intervals} "

SnakeMake From line 15 of rules/purecn.smk

shell:
    "Rscript {input.script} {input.intervals} {input.bam} {output.coverage_raw} {output.coverage_loess} " 

SnakeMake From line 28 of rules/purecn.smk

shell:
    "Rscript {input.script} {input.intervals} {input.bam} {output.coverage_raw} {output.coverage_loess} " 

SnakeMake From line 41 of rules/purecn.smk

shell:
    "ls {input.cov_norm} > {output.list};"
    "Rscript {input.script} {input.vcf} {output.list} {output.rds} {output.rds_bias} {params.ref};"               

SnakeMake From line 57 of rules/purecn.smk

shell:
    "Rscript {input.script} {input.rds} {input.cov_tum} {input.vcf_tum} {input.intervals} {input.rds_bias} {params.ref} {params.sample} {output.result_rds} {output.ps} {output.plot_output} "   

SnakeMake From line 81 of rules/purecn.smk

wrapper:
    "v1.23.3/bio/fastqc"

SnakeMake FastQC From line 12 of rules/qc.smk

wrapper:
    "v1.23.3/bio/samtools/stats"

SnakeMake From line 24 of rules/qc.smk

wrapper:
    "v1.23.3/bio/mosdepth"

SnakeMake mosdepth From line 41 of rules/qc.smk

wrapper:
    "v1.23.3/bio/multiqc"

SnakeMake MultiQC From line 58 of rules/qc.smk

__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os.path import splitext

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0])

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)

Python Snakemake BWA From line 1 of index/wrapper.py

__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )

Python Snakemake SAMtools Picard snakemake-wrapper-utils From line 1 of mem/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "William Rowell"
__copyright__ = "Copyright 2020, William Rowell"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

bed = snakemake.input.get("bed", "")
by = snakemake.params.get("by", "")
if by:
    if bed:
        sys.exit(
            "Either provide a bed input file OR a window size via params.by, not both."
        )
    else:
        by = f"--by {by}"
if bed:
    by = f"--by {bed}"

quantize_out = False
thresholds_out = False
regions_bed_out = False
region_dist_out = False
for file in snakemake.output:
    if ".per-base." in file and "--no-per-base" in extra:
        sys.exit(
            "You asked not to generate per-base output (--no-per-base), but your rule specifies a '.per-base.' output file. Remove one of the two."
        )
    if ".quantized.bed.gz" in file:
        quantize_out = True
    if ".thresholds.bed.gz" in file:
        thresholds_out = True
    if ".mosdepth.region.dist.txt" in file:
        region_dist_out = True
    if ".regions.bed.gz" in file:
        regions_bed_out = True


if by and not regions_bed_out:
    sys.exit(
        "You ask for by-region output. Please also specify *.regions.bed.gz as a rule output."
    )

if by and not region_dist_out:
    sys.exit(
        "You ask for by-region output. Please also specify *.mosdepth.region.dist.txt as a rule output."
    )

if (region_dist_out or regions_bed_out) and not by:
    sys.exit(
        "You specify *.regions.bed.gz and/or *.mosdepth.region.dist.txt as a rule output. You also need to ask for by-region output via 'input.bed' or 'params.by'."
    )

quantize = snakemake.params.get("quantize", "")
if quantize:
    if not quantize_out:
        sys.exit(
            "You ask for quantized output via params.quantize. Please also specify *.quantized.bed.gz as a rule output."
        )
    quantize = f"--quantize {quantize}"

if not quantize and quantize_out:
    sys.exit(
        "The rule has output *.quantized.bed.gz specified. Please also specify params.quantize to actually generate it."
    )


thresholds = snakemake.params.get("thresholds", "")
if thresholds:
    if not thresholds_out:
        sys.exit(
            "You ask for --thresholds output via params.thresholds. Please also specify *.thresholds.bed.gz as a rule output."
        )
    thresholds = f"--thresholds {thresholds}"

if not thresholds and thresholds_out:
    sys.exit(
        "The rule has output *.thresholds.bed.gz specified. Please also specify params.thresholds to actually generate it."
    )


precision = snakemake.params.get("precision", "")
if precision:
    precision = f"MOSDEPTH_PRECISION={precision}"


fasta = snakemake.input.get("fasta", "")
if fasta:
    fasta = f"--fasta {fasta}"


# mosdepth takes additional threads through its option --threads
# One thread for mosdepth
# Other threads are *additional* decompression threads passed to the '--threads' argument
threads = "" if snakemake.threads <= 1 else "--threads {}".format(snakemake.threads - 1)


# named output summary = "*.mosdepth.summary.txt" is required
prefix = snakemake.output.summary.replace(".mosdepth.summary.txt", "")


shell(
    "({precision} mosdepth {threads} {fasta} {by} {quantize} {thresholds} {extra} {prefix} {snakemake.input.bam}) {log}"
)

Python Snakemake mosdepth From line 1 of mosdepth/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)

Python Snakemake MultiQC From line 3 of multiqc/wrapper.py

__author__ = "Johannes Köster, Christopher Schröder"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell()

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

bams = snakemake.input.bams
if isinstance(bams, str):
    bams = [bams]
bams = list(map("--INPUT {}".format, bams))

if snakemake.output.bam.endswith(".cram"):
    output = "/dev/stdout"
    if snakemake.params.embed_ref:
        view_options = "-O cram,embed_ref"
    else:
        view_options = "-O cram"
    convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}"
else:
    output = snakemake.output.bam
    convert = ""

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(picard MarkDuplicates"  # Tool and its subcommand
        " {java_opts}"  # Automatic java option
        " {extra}"  # User defined parmeters
        " {bams}"  # Input bam(s)
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {output}"  # Output bam
        " --METRICS_FILE {snakemake.output.metrics}"  # Output metrics
        " {convert} ) {log}"  # Logging
    )

Python Snakemake SAMtools snakemake-wrapper-utils From line 1 of markduplicates/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess
import sys
from pathlib import Path
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)


species = snakemake.params.species.lower()
build = snakemake.params.build
release = int(snakemake.params.release)
out_fmt = Path(snakemake.output[0]).suffixes
out_gz = (out_fmt.pop() and True) if out_fmt[-1] == ".gz" else False
out_fmt = out_fmt.pop().lstrip(".")


branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"
elif snakemake.params.get("branch"):
    branch = snakemake.params.branch + "/"


flavor = snakemake.params.get("flavor", "")
if flavor:
    flavor += "."


suffix = ""
if out_fmt == "gtf":
    suffix = "gtf.gz"
elif out_fmt == "gff3":
    suffix = "gff3.gz"
else:
    raise ValueError(
        "invalid format specified. Only 'gtf[.gz]' and 'gff3[.gz]' are currently supported."
    )


url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/{out_fmt}/{species}/{species_cap}.{build}.{release}.{flavor}{suffix}".format(
    release=release,
    build=build,
    species=species,
    out_fmt=out_fmt,
    species_cap=species.capitalize(),
    suffix=suffix,
    flavor=flavor,
    branch=branch,
)


try:
    if out_gz:
        shell("curl -L {url} > {snakemake.output[0]} {log}")
    else:
        shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
except subprocess.CalledProcessError as e:
    if snakemake.log:
        sys.stderr = open(snakemake.log[0], "a")
    print(
        "Unable to download annotation data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided?",
        file=sys.stderr,
    )
    exit(1)

Python Snakemake From line 1 of ensembl-annotation/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess as sp
import sys
from itertools import product
from snakemake.shell import shell

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build

branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"
elif snakemake.params.get("branch"):
    branch = snakemake.params.branch + "/"

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
    build=build, release=release
)

suffixes = ""
datatype = snakemake.params.get("datatype", "")
chromosome = snakemake.params.get("chromosome", "")
if datatype == "dna":
    if chromosome:
        suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)]
    else:
        suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"]
elif datatype == "cdna":
    suffixes = ["cdna.all.fa.gz"]
elif datatype == "cds":
    suffixes = ["cds.all.fa.gz"]
elif datatype == "ncrna":
    suffixes = ["ncrna.fa.gz"]
elif datatype == "pep":
    suffixes = ["pep.all.fa.gz"]
else:
    raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep")

if chromosome:
    if not datatype == "dna":
        raise ValueError(
            "invalid datatype, to select a single chromosome the datatype must be dna"
        )

spec = spec.format(build=build, release=release)
url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}"

success = False
for suffix in suffixes:
    url = f"{url_prefix}.{suffix}"

    try:
        shell("curl -sSf {url} > /dev/null 2> /dev/null")
    except sp.CalledProcessError:
        continue

    shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
    success = True
    break

if not success:
    if len(suffixes) > 1:
        url = f"{url_prefix}.[{'|'.join(suffixes)}]"
    else:
        url = f"{url_prefix}.{suffixes[0]}"
    print(
        f"Unable to download requested sequence data from Ensembl ({url}). "
        "Please check whether above URL is currently available (might be a temporal server issue). "
        "Apart from that, did you check that this combination of species, build, and release is actually provided?",
        file=sys.stderr,
    )
    exit(1)

Python Snakemake From line 1 of ensembl-sequence/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import tempfile
import subprocess
import sys
import os
from snakemake.shell import shell
from snakemake.exceptions import WorkflowError

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build
type = snakemake.params.type
chromosome = snakemake.params.get("chromosome", "")


branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"
elif snakemake.params.get("branch"):
    branch = snakemake.params.branch + "/"

if release < 98 and not branch:
    print("Ensembl releases <98 are unsupported.", file=open(snakemake.log[0], "w"))
    exit(1)

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

if chromosome and type != "all":
    raise ValueError(
        "Parameter chromosome given but chromosome-wise download"
        "is only implemented for type='all'."
    )

if type == "all":
    if species == "homo_sapiens" and release >= 93:
        chroms = (
            list(range(1, 23)) + ["X", "Y", "MT"] if not chromosome else [chromosome]
        )
        suffixes = ["-chr{}".format(chrom) for chrom in chroms]
    else:
        if chromosome:
            raise ValueError(
                "Parameter chromosome given but chromosome-wise download"
                "is only implemented for homo_sapiens in releases >=93."
            )
        suffixes = [""]
elif type == "somatic":
    suffixes = ["_somatic"]
elif type == "structural_variations":
    suffixes = ["_structural_variations"]
else:
    raise ValueError(
        "Unsupported type {} (only all, somatic, structural_variations are allowed)".format(
            type
        )
    )

species_filename = species if release >= 91 else species.capitalize()

urls = [
    "ftp://ftp.ensembl.org/pub/{branch}release-{release}/variation/vcf/{species}/{species_filename}{suffix}.{ext}".format(
        release=release,
        species=species,
        suffix=suffix,
        species_filename=species_filename,
        branch=branch,
        ext=ext,
    )
    for suffix in suffixes
    for ext in ["vcf.gz", "vcf.gz.csi"]
]
names = [os.path.basename(url) for url in urls if url.endswith(".gz")]

try:
    gather = "curl {urls}".format(urls=" ".join(map("-O {}".format, urls)))
    workdir = os.getcwd()
    with tempfile.TemporaryDirectory() as tmpdir:
        if snakemake.input.get("fai"):
            shell(
                "(cd {tmpdir}; {gather} && "
                "bcftools concat -Oz --naive {names} > concat.vcf.gz && "
                "bcftools reheader --fai {workdir}/{snakemake.input.fai} concat.vcf.gz "
                "> {workdir}/{snakemake.output}) {log}"
            )
        else:
            shell(
                "(cd {tmpdir}; {gather} && "
                "bcftools concat -Oz --naive {names} "
                "> {workdir}/{snakemake.output}) {log}"
            )
except subprocess.CalledProcessError as e:
    if snakemake.log:
        sys.stderr = open(snakemake.log[0], "a")
    print(
        "Unable to download variation data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided? ",
        file=sys.stderr,
    )
    exit(1)

Python Snakemake BCFtools From line 1 of ensembl-variation/wrapper.py

__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

samtools_opts = get_samtools_opts(
    snakemake, parse_threads=False, parse_write_index=False, parse_output_format=False
)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell("samtools faidx {samtools_opts} {extra} {snakemake.input[0]} {log}")

Python Snakemake SAMtools snakemake-wrapper-utils From line 1 of faidx/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts


bed = snakemake.input.get("bed", "")
if bed:
    bed = "-t " + bed

samtools_opts = get_samtools_opts(
    snakemake, parse_write_index=False, parse_output=False, parse_output_format=False
)


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell(
    "samtools stats {samtools_opts} {extra} {snakemake.input.bam} {bed} {region} > {snakemake.output} {log}"
)

Python Snakemake SAMtools snakemake-wrapper-utils From line 3 of stats/wrapper.py

__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

java_opts = get_java_opts(snakemake)

log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
min_threads = 1

incall = snakemake.input["call"]
if snakemake.input["call"].endswith("bcf"):
    min_threads += 1
    incall = "< <(bcftools view {})".format(incall)
elif snakemake.input["call"].endswith("gz"):
    min_threads += 1
    incall = "< <(gunzip -c {})".format(incall)

outcall = snakemake.output["call"]
if snakemake.output["call"].endswith("gz"):
    min_threads += 1
    outcall = "| gzip -c > {}".format(outcall)
elif snakemake.output["call"].endswith("bcf"):
    min_threads += 1
    outcall = "| bcftools view > {}".format(outcall)
else:
    outcall = "> {}".format(outcall)

if snakemake.threads < min_threads:
    raise ValueError(
        "At least {} threads required, {} provided".format(
            min_threads, snakemake.threads
        )
    )

shell(
    "SnpSift annotate"  # Tool and its subcommand
    " {java_opts} {extra}"  # Extra parameters
    " {snakemake.input.database}"  # Path to annotation vcf file
    " {incall} "  # Path to input vcf file
    " {outcall} "  # Path to output vcf file
    " {log}"  # Logging behaviour
)

Python Snakemake BCFtools snakemake-wrapper-utils SnpSift From line 3 of annotate/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell("tabix {snakemake.params} {snakemake.input[0]} {log}")

Python Snakemake tabix From line 1 of index/wrapper.py

__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
    gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
    gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [
    snakemake.output.r1,
    snakemake.output.r1_unpaired,
    snakemake.output.r2,
    snakemake.output.r2_unpaired,
]

trimmomatic_threads, input_threads, output_threads = distribute_threads(
    input_files, output_files, snakemake.threads
)

input_r1, input_r2 = [
    compose_input_gz(filename, input_threads) for filename in input_files
]

output_r1, output_r1_unp, output_r2, output_r2_unp = [
    compose_output_gz(filename, output_threads, compression_level)
    for filename in output_files
]

shell(
    "trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)