Multiple Sequence Alignment (MSA) and Phylogeny Construction Workflow with Snakemake

public public 1yr ago 0 bookmarks

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

  1. Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:snakemake-workflows/MSA-phylogeny.git or git remote add -f upstream https://github.com/snakemake-workflows/MSA-phylogeny.git if you do not have setup ssh keys.

  2. Update the upstream version: git fetch upstream .

  3. Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .

  4. Investigate the changes: vim upstream-changes.diff .

  5. Apply the modified diff via: git apply upstream-changes.diff .

  6. Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Code Snippets

10
11
12
shell:
    "gff2bed < <(zcat {input.gff} | "
    "awk '{{if ($3 ~ /gene/) print $0}}') > {output.bed} 2> {log}"
25
26
shell:
    "seqtk subseq {input.sequence} {input.bed} > {output.regions} 2> {log}"
39
40
script:
    "../scripts/rename-regions.py"
SnakeMake From line 39 of rules/MSA.smk
53
54
shell:
    "minimap2 -a {input.target} {input.query} -o {output} 2> {log}"
67
68
script:
    "../scripts/extract-sequences.py"
SnakeMake From line 67 of rules/MSA.smk
80
81
shell:
    "cat {input} > {output}"
SnakeMake From line 80 of rules/MSA.smk
93
94
shell:
    "mafft {input} > {output} 2> {log}"
108
109
shell:
    "CIAlign --infile {input} --outfile_stem {params.output_stem} --clean --visualise --interpret"
14
15
16
shell:
    "cp {input.fasta} {output.inter_fasta} && "
    "iqtree -s {output.inter_fasta} > {log} 2>&1"
8
9
script:
    "../scripts/update-sample-sheet.py"
12
13
14
shell:
    "((esearch -db nucleotide -query '{params.accession}' | "
    "efetch -format fasta > {output}) && [ -s {output} ]) 2> {log}"
SnakeMake From line 12 of rules/ref.smk
8
9
wrapper:
    "0.70.0/bio/samtools/index"
19
20
wrapper:
    "0.70.0/bio/samtools/faidx"
32
33
shell:
    "samtools sort -o {output} -O BAM {input}"
45
46
wrapper:
    "0.70.0/bio/tabix"
23
24
wrapper:
    "v1.12.2/bio/bcftools/mpileup"
38
39
wrapper:
    "v1.12.2/bio/bcftools/call"
49
50
wrapper:
    "v1.12.2/bio/vep/plugins"
72
73
wrapper:
    "v1.12.2/bio/vep/annotate"
85
86
script:
    "../scripts/plot-variants.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pysam
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

with pysam.AlignmentFile(snakemake.input.bam, "rb") as samfile, open(snakemake.output.seq, "w") as out_fasta:
    for record in samfile.fetch():
        if record.reference_name.endswith(snakemake.wildcards.region):
            out_seq = SeqRecord(
                Seq(record.query_sequence),
                id="%s_%s" %(record.query_name, record.reference_name.split("_")[-1]),
                description=""
            )
            # print(record.reference_name, record.reference_start, record.query_alignment_start, record.query_alignment_end)
            SeqIO.write(out_seq, out_fasta, "fasta")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from Bio import SeqIO
import pandas as pd

bed = pd.read_csv(snakemake.input.bed, delimiter="\t", header=None)

bed[1] += 1
bed["name"] = bed[0] + ":" + bed[1].astype(str) + "-" + bed[2].astype(str)
bed["feature"] = bed[9].str.extract(r';Name=([A-Z]{1,3}\d*[a-z]{0,2});')[0]
bed.set_index("name", inplace=True)

with open(snakemake.input.regions) as input_handle, open(snakemake.output.renamed_regions, "w") as output_handle:
    for record in SeqIO.parse(input_handle, "fasta"):
        record.id = record.id + "_" + bed.loc[str(record.id)]["feature"]
        record.name = ""
        record.description = ""
        SeqIO.write(record, output_handle, "fasta")
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import os
from datetime import date, datetime
from os import getcwd, listdir, mkdir, path
from shutil import copy2, move
from Bio import SeqIO

import pandas as pd
from ruamel import yaml  # conda install -c conda-forge ruamel.yaml

# define location of sample sheet and workflow config

def update_sample_sheet():
    print("""
    This function updates the samples.csv file in your config with all .fasta files
    included in data/query and data/reference. Please make sure all query files are
    single fastas per sample. Reference fasta files can be multiple sequenence fasta 
    files. All sequences have to be named properly (≤ ten letters) as some applications
    cutoff sequence names.

    Please make sure to define a tag after the sample sheet is generated
    """)

    config = snakemake.config

    QUERY_PATH = str(config["data-handling"]["data-query"])
    REFERENCE_PATH = str(config["data-handling"]["data-reference"])

    incoming_files = [f for f in listdir(QUERY_PATH)]

    if not incoming_files:
        print("No files in data/query")
        new_files_reference = pd.DataFrame(columns=["sample_name", "file", "type", "tag"])
    else:
        print("Updating sample sheet")
        # create dataframe
        new_files_query = pd.DataFrame(incoming_files, columns=["file"])

        # get id of sample by splitting the file handle
        new_files_query["sample_name"] = new_files_query["file"].apply(
            lambda x: (x.rsplit(".", 1)[0])
        )

        new_files_query["type"] = "query"
        new_files_query["tag"] = "your_tag"
        # add path of file
        new_files_query["file"] = QUERY_PATH + "/" + new_files_query["file"]

        print(new_files_query)
        print("\t{} query samples added".format(len(new_files_query)))

    incoming_files = [f for f in listdir(REFERENCE_PATH)]

    if not incoming_files:
        print("No files in data/reference")
        new_files_reference = pd.DataFrame(columns=["sample_name", "file", "type"])
    else:
        print("Updating sample sheet")
        # create dataframe
        new_files_reference = pd.DataFrame(incoming_files, columns=["file"])

        # get id of sample by splitting the file handle
        new_files_reference["sample_name"] = new_files_reference["file"].apply(
            lambda x: (x.rsplit(".", 1)[0])
        )

        new_files_reference["type"] = "reference"
        new_files_reference["tag"] = "your_tag"
        # add path of file
        new_files_reference["file"] = REFERENCE_PATH + "/" + new_files_reference["file"]

        print(new_files_reference)
        print("\t{} reference file added".format(len(new_files_reference)))

        sample_sheet = pd.concat([new_files_query, new_files_reference])

        sample_sheet.to_csv(snakemake.input[0], index=False, columns=["sample_name", "file", "type", "tag"])


update_sample_sheet()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell("tabix {snakemake.params} {snakemake.input[0]} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.bcftools import get_bcftools_opts


bcftools_opts = get_bcftools_opts(snakemake, parse_ref=False, parse_memory=False)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


class CallerOptionError(Exception):
    pass


valid_caller_opts = {"-c", "--consensus-caller", "-m", "--multiallelic-caller"}

caller_opt = snakemake.params.get("caller", "")
if caller_opt.strip() not in valid_caller_opts:
    raise CallerOptionError(
        "bcftools call expects either -m/--multiallelic-caller or "
        "-c/--consensus-caller as caller option."
    )


shell(
    "bcftools call"
    " {bcftools_opts}"
    " {caller_opt}"
    " {extra}"
    " {snakemake.input[0]}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
__author__ = "Michael Hall"
__copyright__ = "Copyright 2020, Michael Hall"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.bcftools import get_bcftools_opts


extra = snakemake.params.get("extra", "")
bcftools_opts = get_bcftools_opts(
    snakemake, parse_ref=("--no-reference" not in extra), parse_memory=False
)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


class MissingReferenceError(Exception):
    pass


shell("bcftools mpileup {bcftools_opts} {extra} {snakemake.input[0]} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
from pathlib import Path
from snakemake.shell import shell


def get_only_child_dir(path):
    children = [child for child in path.iterdir() if child.is_dir()]
    assert (
        len(children) == 1
    ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper"
    return children[0]


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else ""
stats = snakemake.output.stats
cache = snakemake.input.get("cache", "")
plugins = snakemake.input.plugins
plugin_aux_files = {"LoFtool": "LoFtool_scores.txt", "ExACpLI": "ExACpLI_values.txt"}

load_plugins = []
for plugin in snakemake.params.plugins:
    if plugin in plugin_aux_files.keys():
        aux_path = os.path.join(plugins, plugin_aux_files[plugin])
        load_plugins.append(",".join([plugin, aux_path]))
    else:
        load_plugins.append(",".join([plugin, snakemake.input.get(plugin.lower(), "")]))
load_plugins = " ".join(map("--plugin {}".format, load_plugins))

if snakemake.output.calls.endswith(".vcf.gz"):
    fmt = "z"
elif snakemake.output.calls.endswith(".bcf"):
    fmt = "b"
else:
    fmt = "v"

fasta = snakemake.input.get("fasta", "")
if fasta:
    fasta = "--fasta {}".format(fasta)

gff = snakemake.input.get("gff", "")
if gff:
    gff = "--gff {}".format(gff)

if cache:
    entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
    species = (
        entrypath.parent.name[:-7]
        if entrypath.parent.name.endswith("_refseq")
        else entrypath.parent.name
    )
    release, build = entrypath.name.split("_")
    cache = (
        "--offline --cache --dir_cache {cache} --cache_version {release} --species {species} --assembly {build}"
    ).format(cache=cache, release=release, build=build, species=species)

shell(
    "(bcftools view '{snakemake.input.calls}' | "
    "vep {extra} {fork} "
    "--format vcf "
    "--vcf "
    "{cache} "
    "{gff} "
    "{fasta} "
    "--dir_plugins {plugins} "
    "{load_plugins} "
    "--output_file STDOUT "
    "--stats_file {stats} | "
    "bcftools view -O{fmt} > {snakemake.output.calls}) {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
from pathlib import Path
from urllib.request import urlretrieve
from zipfile import ZipFile
from tempfile import NamedTemporaryFile

if snakemake.log:
    sys.stderr = open(snakemake.log[0], "w")

outdir = Path(snakemake.output[0])
outdir.mkdir()

with NamedTemporaryFile() as tmp:
    urlretrieve(
        "https://github.com/Ensembl/VEP_plugins/archive/release/{release}.zip".format(
            release=snakemake.params.release
        ),
        tmp.name,
    )

    with ZipFile(tmp.name) as f:
        for member in f.infolist():
            memberpath = Path(member.filename)
            if len(memberpath.parts) == 1:
                # skip root dir
                continue
            targetpath = outdir / memberpath.relative_to(memberpath.parts[0])
            if member.is_dir():
                targetpath.mkdir()
            else:
                with open(targetpath, "wb") as out:
                    out.write(f.read(member.filename))
ShowHide 17 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/IKIM-Essen/MSA-phylogeny
Name: msa-phylogeny
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...