Nextstrain build for mpox virus

public public 1yr ago 0 bookmarks

This is the Nextstrain build for MPXV (mpox virus). Output from this build is visible at nextstrain.org/monkeypox . The lineages within the recent mpox outbreaks in humans are defined in a separate lineage-designation repository .

Usage

Provision input data

Input sequences and metadata can be retrieved from data.nextstrain.org

Note that these data are generously shared by many labs around the world. If you analyze and plan to publish using these data, please contact these labs first.

Within the analysis pipeline, these data are fetched from data.nextstrain.org and written to data/ with:

nextstrain build --docker . data/sequences.fasta data/metadata.tsv

Run analysis pipeline

Run pipeline to produce "overview" tree for /monkeypox/mpxv with:

nextstrain build --docker --cpus 1 . --configfile config/config_mpxv.yaml

Run pipeline to produce "outbreak" tree for /monkeypox/hmpxv1 with:

nextstrain build --docker --cpus 1 . --configfile config/config_hmpxv1.yaml

Adjust the number of CPUs to what your machine has available if you want to perform alignment and tree building a bit faster.

Deploying

Run the python script scripts/deploy.py to deploy the staging build to production.

This will also automatically create a dated build where each node has a unique (random) ID so it can be targeted in shared links/narratives.

python scripts/deploy.py --build-names hmpxv1 mpxv

If a dated build already exists it is not overwritten by default. To overwrite, pass -f .

To deploy a locally built build to staging, use the --staging flag.

To not deploy a dated build to production, add the --no-dated flag.

Visualize results

View results with:

nextstrain view auspice/

Configuration

Configuration takes place in config/config.yml by default. The analysis pipeline is contained in workflow/snakemake_rule/core.smk . This can be read top-to-bottom, each rule specifies its file inputs and output and pulls its parameters from config . There is little redirection and each rule should be able to be reasoned with on its own.

Data use

We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain.

Installation

Follow the standard installation instructions for Nextstrain's suite of software tools.

If you don't use the nextstrain CLI but a custom conda environment, make sure that you have tsv-utils and seqkit installed, e.g. using:

conda install -c bioconda tsv-utils seqkit

Nextstrain build vs Snakemake

The above commands use the Nextstrain CLI and nextstrain build along with Docker to run using Nextalign v2. Alternatively, if you install Nextalign/Nextclade v2 locally you can run the pipeline with:

snakemake -j 1 -p --configfile config/config_mpxv.yaml
snakemake -j 1 -p --configfile config/config_hmpxv1.yaml

Update colors to include new countries

Update colors_hmpxv1.tsv to group countries by region based on countries present in its metadata.tsv :

python3 scripts/update_colours.py --colors config/colors_hmpxv1.tsv \
 --metadata results/hmpxv1/metadata.tsv --output config/colors_hmpxv1.tsv

and similarly update colors_mpxv.tsv :

python3 scripts/update_colours.py --colors config/colors_mpxv.tsv \
 --metadata results/mpxv/metadata.tsv --output config/colors_mpxv.tsv

Update example data

Example data is used by CI . It can also be used as a small subset of real-world data.

Example data should be updated every time metadata schema is changed or a new clade/lineage emerges. To update, run:

nextstrain build --docker . update_example_data -F

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import argparse
import pdb
import pandas as pd

# Forced colours MUST NOT appear in the ordering TSV
forced_colors = {
}

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="Assign colors based on ordering",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--ordering', type=str, required=True, help="input ordering file")
    parser.add_argument('--color-schemes', type=str, required=True, help="input color schemes file")
    parser.add_argument('--metadata', type=str, help="if provided, restrict colors to only those found in metadata")
    parser.add_argument('--output', type=str, required=True, help="output colors tsv")
    args = parser.parse_args()

    assignment = {}
    with open(args.ordering) as f:
        for line in f.readlines():
            array = line.lstrip().rstrip().split("\t")
            if len(array) == 2:
                name = array[0]
                trait = array[1]
                if name not in assignment:
                    assignment[name] = [trait]
                else:
                    assignment[name].append(trait)

    # if metadata supplied, go through and
    # 1. remove assignments that don't exist in metadata
    # 2. remove assignments that have 'focal' set to 'False' in metadata
    if args.metadata:
        metadata = pd.read_csv(args.metadata, delimiter='\t')
        for name, trait in assignment.items():
            # Items not to exclude if not (yet) present in metadata to solve bootstrapping issue
            if name in metadata and name not in ['clade_membership', 'outbreak', 'lineage']:
                subset_present = [x for x in assignment[name] if x in metadata[name].unique()]
                assignment[name] = subset_present
            if name in metadata and 'focal' in metadata:
                focal_list = metadata.loc[metadata['focal'] == True, name].unique()
                subset_focal = [x for x in assignment[name] if x in focal_list]
                assignment[name] = subset_focal

    schemes = {}
    counter = 0
    with open(args.color_schemes) as f:
        for line in f.readlines():
            counter += 1
            array = line.lstrip().rstrip().split("\t")
            schemes[counter] = array

    with open(args.output, 'w') as f:
        for trait_name, trait_array in assignment.items():
            if len(trait_array)==0:
                print(f"No traits found for {trait_name}")
                continue
            if len(schemes)<len(trait_array):
              print(f"WARNING: insufficient colours available for trait {trait_name} - reusing colours!")
              remain = len(trait_array)
              color_array = []
              while(remain>0):
                if (remain>len(schemes)):
                  color_array = [*color_array, *schemes[len(schemes)]]
                  remain -= len(schemes)
                else:
                  color_array = [*color_array, *schemes[remain]]
                  remain = 0
            else:
              color_array = schemes[len(trait_array)]
            extra_trait_values = list(forced_colors.get(trait_name, {}).keys())
            extra_color_values = list(forced_colors.get(trait_name, {}).values())

            zipped = list(zip(trait_array+extra_trait_values, color_array+extra_color_values))
            for trait_value, color in zipped:
                f.write(trait_name + "\t" + trait_value + "\t" + color + "\n")
            f.write("\n")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import json, argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="remove time info",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument(
        "--input-node-data", type=str, required=True, help="input data"
    )
    parser.add_argument(
        "--output-node-data",
        type=str,
        metavar="JSON",
        required=True,
        help="output Auspice JSON",
    )
    args = parser.parse_args()
    with open(args.input_node_data) as fh:
        data = json.load(fh)
    new_node_data = {}
    for name, node in data["nodes"].items():
        old_clade_name = node["clade_membership"]
        outbreak_name = ""
        lineage_name = ""

        # if it starts with clade -> it's a clade
        # if it starts with outbreak -> it's outbreak, need to look up clade
        # if it starts with lineage -> it's clade IIb, outbreak hMPXV-1
        if old_clade_name.startswith("clade"):
            clade_name = old_clade_name.split()[1]
        # Need to set up clade dictionary for when we have other outbreaks
        # if old_clade_name.startswith('outbreak'):
        #     outbreak_name = old_clade_name.split()[1]
        #     clade_name = clade[outbreak_name]
        elif old_clade_name.startswith("outgroup"):
            clade_name = "outgroup"
        else:
            clade_name = "IIb"
            outbreak_name = "hMPXV-1"
            lineage_name = old_clade_name

        new_node_data[name] = {
            "clade_membership": clade_name,
            "outbreak": outbreak_name,
            "lineage": lineage_name,
        }
        if "clade_annotation" in node:
            new_node_data[name]["clade_annotation"] = node["clade_annotation"]
            if node["clade_annotation"] == "A":
                new_node_data[name]["clade_annotation"] = "hMPXV-1 A"

    data["nodes"] = new_node_data
    with open(args.output_node_data, "w") as fh:
        json.dump(data, fh)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import argparse
from datetime import datetime
from augur.io import read_metadata
import json

## Script originally from https://github.com/nextstrain/ncov/blob/master/scripts/construct-recency-from-submission-date.py

def get_recency(date_str, ref_date):
    date_submitted = datetime.strptime(date_str, '%Y-%m-%d').toordinal()
    ref_day = ref_date.toordinal()

    delta_days = ref_day - date_submitted
    if delta_days<=0:
        return 'New'
    elif delta_days<3:
        return '1-2 days ago'
    elif delta_days<8:
        return '3-7 days ago'
    elif delta_days<15:
        return 'One week ago'
    elif delta_days<31:
        return 'One month ago'
    elif delta_days < 121:
        return '1-3 months ago'
    elif delta_days < 365:
        return '3-12 months ago'
    elif delta_days < 365*4:
        return '1-3 years ago'
    elif delta_days < 365*16:
        return '3-15 years ago'
    elif delta_days>=31:
        return 'Older than 15 years'

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="Assign each sequence a field that specifies when it was added",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--metadata', type=str, required=True, help="metadata file")
    parser.add_argument('--metadata-id-columns', nargs="+", help="names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.")
    parser.add_argument('--output', type=str, required=True, help="output json")
    args = parser.parse_args()

    meta = read_metadata(args.metadata, id_columns=args.metadata_id_columns).to_dict(orient="index")

    node_data = {'nodes':{}}
    ref_date = datetime.now()

    for strain, d in meta.items():
        if 'date_submitted' in d and d['date_submitted'] and d['date_submitted'] != "undefined":
            node_data['nodes'][strain] = {'recency': get_recency(d['date_submitted'], ref_date)}

    with open(args.output, 'wt') as fh:
        json.dump(node_data, fh)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
from collections import defaultdict
import argparse
from treetime import TreeAnc
from Bio import Phylo

if __name__=="__main__":
    parser = argparse.ArgumentParser(
        description="remove time info",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--alignment', type=str, required=True, help="input sequences")
    parser.add_argument('--input-tree', type=str, required=True, help="input nwk")
    parser.add_argument('--root', type=str, required=False, help="root node")
    parser.add_argument('--output', type=str, required=True, help="output nwk")
    args = parser.parse_args()

    T = Phylo.read(args.input_tree, 'newick')

    if args.root:
        T.root_with_outgroup(args.root)
    else:
        T.root_at_midpoint()

    tt = TreeAnc(tree=T, aln=args.alignment, gtr='JC69')
    tt.optimize_tree(prune_short=True)

    # make list of mutations that are phylogenetically informative (not gaps of N)
    for n in T.find_clades():
        n.relevant_mutations = set()
        for mut in n.mutations:
            if (mut[0] in 'ACGT') and (mut[2] in 'ACGT'):
                n.relevant_mutations.add(mut)

    print(f"### Checking for immediate reversions\n")
    reversions = list()
    for clade in T.find_clades():
        for child in clade.clades:
            if child.is_terminal():
                continue
            for grandchild in child.clades:
                if grandchild.is_terminal():
                    continue
                # Check if one of grandchild mutation reverts one of child
                for mut_child in child.relevant_mutations:
                    for mut_grandchild in grandchild.relevant_mutations:
                        if mut_child[1] == mut_grandchild[1] and mut_child[2] == mut_grandchild[0]:
                            reversions.append(
                                {
                                    "parent": clade,
                                    "child": child,
                                    "grandchild": grandchild,
                                    "mut_child": mut_child,
                                    "mut_grandchild": mut_grandchild
                                }
                            )
                            print(f"Below {clade}: {mut_child} in {child.name} reverted in {grandchild.name}")

    for reversion in reversions:
        # Remove reversion from grandchild
        reversion["grandchild"].relevant_mutations.remove(reversion["mut_grandchild"])
        # Remove grandchild from child
        reversion["child"].clades.remove(reversion["grandchild"])
        # If there are mutations, add grandchild as child of parent
        if reversion["grandchild"].relevant_mutations != reversion["parent"].relevant_mutations:
            reversion["parent"].clades.append(reversion["grandchild"])
        else:
            # Otherwise add grandchild clades to parent
            reversion["parent"].clades.extend(reversion["grandchild"].clades)

    # find mutations that occur multiple times in branches leading to children of a node.
    # use these mutations to group clades to merge later.
    max_iter = 5
    for ii in range(max_iter):
        print(f"###\nIteration: {ii+1}\n")
        nodes_to_merge = defaultdict(list)
        for n in T.get_nonterminals():
            shared_mutations = defaultdict(list)
            for c in n:
                for mut in c.relevant_mutations:
                    shared_mutations[mut].append(c)

            for mut in shared_mutations:
                if len(shared_mutations[mut])>1:
                    nodes_to_merge[(n,tuple(shared_mutations[mut]))].append(mut)

        if len(nodes_to_merge)==0:
            print("No more shared mutations -- breaking out of loop.")
            break

        already_touched = set()
        for (parent, children), mutations in sorted(nodes_to_merge.items(), key=lambda x:len(x[1]), reverse=True):
            if any([c in already_touched for c in children]):
                continue

            print("####\nmerging clades:\n\t", '\n\t'.join([f"{c.name} with mutations {c.relevant_mutations} and {c.count_terminals()} tips" for c in children]))
            print("shared mutations:", mutations)
            print("\n")

            parent.clades = [c for c in parent if c not in children]
            new_clade = Phylo.BaseTree.Clade(branch_length=tt.one_mutation*len(mutations))
            new_clade.relevant_mutations = set(mutations)
            for c in children:
                left_over_mutations = c.relevant_mutations.difference(mutations)
                if len(left_over_mutations):
                    c.relevant_mutations = left_over_mutations
                    c.branch_length = tt.one_mutation*len(c.relevant_mutations)
                    new_clade.clades.append(c)
                else:
                    new_clade.clades.extend(c.clades)
                already_touched.add(c)

            parent.clades.append(new_clade)

    Phylo.write(T, args.output, 'newick')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from collections import defaultdict
import json, argparse




if __name__=="__main__":
    parser = argparse.ArgumentParser(
        description="calculate mutation context json",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--tree', type=str, required=True, help="tree file")
    parser.add_argument('--mutations', type=str, required=True, help="mutations")
    parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON")
    args = parser.parse_args()


    with open(args.mutations) as fh:
        data = json.load(fh)['nodes']


    terminal_muts = defaultdict(lambda: defaultdict(int))
    internal_muts = defaultdict(lambda: defaultdict(int))

    node_data = {}

    for name, node in data.items():
        GA_count = 0
        CT_count = 0
        total_muts = 0
        for mut in node["muts"]:
            a, pos, d = mut[0], int(mut[1:-1]), mut[-1]
            if a in 'ACGT' and d in 'ACGT':
                total_muts += 1
                if a+d == 'GA':
                    GA_count += 1
                elif a+d == 'CT':
                    CT_count += 1
        GA_CT_count = GA_count + CT_count
        if total_muts:
            node_data[name] = {"GA_CT_fraction": GA_CT_count/total_muts}
        else:
            node_data[name] = {"GA_CT_fraction": None }


        dinuc_count = 0
        if GA_CT_count:
            #node_data[name]["CT_fraction"] = CT_count/GA_CT_count
            for mut in node["muts"]:
                a, pos, d = mut[0], int(mut[1:-1]), mut[-1]
                if a in 'ACGT' and d in 'ACGT':
                    if a+d == 'GA' and node['sequence'][pos]=='A':
                        dinuc_count+=1
                    elif a+d == 'CT' and node['sequence'][pos-2]=='T':
                        dinuc_count+=1
            node_data[name]["dinuc_context_fraction"] = dinuc_count/GA_CT_count
        else:
            node_data[name]["dinuc_context_fraction"] = None
            #node_data[name]["CT_fraction"] = None

    with open(args.output, 'w') as fh:
        json.dump({"nodes":node_data}, fh)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import argparse
import json
from collections import defaultdict


def sample_date(node):
    """
    Returns the sample date in numeric form.
    In the future, we could examine the 'raw_date' attr here to decide whether to ignore
    some sequences, as 'numdate' is the inferred (timetree) date which can hide
    uncertainty in actual sampling date,
    """
    if "raw_date" not in node: # internal node or tip with no date info
        return
    return node['numdate']



if __name__=="__main__":
    parser = argparse.ArgumentParser(
        description="remove time info",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--input-node-data', type=str, required=True, help="input data")
    parser.add_argument('--output-node-data', type=str, metavar="JSON", required=True, help="output Auspice JSON")
    args = parser.parse_args()


    with open(args.input_node_data) as fh:
        data = json.load(fh)

    new_node_data = {}

    for name, node in data["nodes"].items():
        try:
            new_node_data[name] = {
                "mutation_length": node["mutation_length"],
                "branch_length": node["branch_length"]
            }
            sdate = sample_date(node)
            if sdate:
                new_node_data[name]["sample_date"] = sdate
        except KeyError:
            # internal node or tip with no date info
            pass

    data["nodes"] = new_node_data
    with open(args.output_node_data, 'w') as fh:
        json.dump(data, fh)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import argparse
from Bio import SeqIO

if __name__=="__main__":
    parser = argparse.ArgumentParser(
        description="Reverse-complement reverse-complemented sequence",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--metadata', type=str, required=True, help="input metadata")
    parser.add_argument('--sequences', type=str, required=True, help="input sequences")
    parser.add_argument('--output', type=str, required=True, help="output sequences")
    args = parser.parse_args()

    metadata = pd.read_csv(args.metadata, sep='\t')

    # Read in fasta file
    with open(args.sequences, 'r') as f_in:
        with open(args.output, 'w') as f_out:
            for seq in SeqIO.parse(f_in, 'fasta'):
                # Check if metadata['reverse'] is True
                try:
                    if metadata.loc[metadata['strain'] == seq.id, 'reverse'].values[0] == True:
                        # Reverse-complement sequence
                        seq.seq = seq.seq.reverse_complement()
                        print("Reverse-complementing sequence:", seq.id)
                except:
                    print("No reverse complement for:", seq.id)

                # Write sequences to file
                SeqIO.write(seq, f_out, 'fasta')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import json, argparse
from augur.io import read_metadata

def replace_name_recursive(node, lookup):
    if node["name"] in lookup:
        node["name"] = lookup[node["name"]]

    if "children" in node:
        for child in node["children"]:
            replace_name_recursive(child, lookup)

if __name__=="__main__":
    parser = argparse.ArgumentParser(
        description="Swaps out the strain names in the Auspice JSON with the final strain name",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--input-auspice-json', type=str, required=True, help="input auspice_json")
    parser.add_argument('--metadata', type=str, required=True, help="input data")
    parser.add_argument('--metadata-id-columns', nargs="+", help="names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.")
    parser.add_argument('--display-strain-name', type=str, required=True, help="field to use as strain name in auspice")
    parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON")
    args = parser.parse_args()

    metadata = read_metadata(args.metadata, id_columns=args.metadata_id_columns)
    name_lookup = {}
    for ri, row in metadata.iterrows():
        strain_id = row.name
        name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name]

    with open(args.input_auspice_json, 'r') as fh:
        data = json.load(fh)

    replace_name_recursive(data['tree'], name_lookup)

    with open(args.output, 'w') as fh:
        json.dump(data, fh)
31
32
33
34
35
shell:
    """
    cp {input.auspice_json} {output.auspice_json}
    cp {input.root_sequence} {output.root_sequence_json}
    """
62
63
shell:
    "rm -rfv {params}"
73
74
shell:
    "rm -rfv {params}"
19
20
21
22
23
24
25
26
27
28
29
30
31
shell:
    """
    augur filter \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --sequences {input.sequences} \
        --include-where strain=MK783032 strain=MK783030 \
        --group-by clade lineage \
        --subsample-max-sequences 50 \
        --subsample-seed 0 \
        --output-metadata {output.metadata} \
        --output-sequences {output.sequences}
    """
36
37
38
39
40
41
42
43
44
45
46
47
48
49
shell:
    """
    augur filter \
        --sequences {input.sequences} \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --exclude {input.exclude} \
        --output-sequences {output.sequences} \
        --output-metadata {output.metadata} \
        --min-date {params.min_date} \
        --min-length {params.min_length} \
        --exclude-where QC_rare_mutations=bad \
        --output-log {output.log}
    """
67
68
69
70
71
72
73
74
75
76
77
78
79
shell:
    """
    augur filter \
        --sequences {input.sequences} \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --output-strains {output.strains} \
        {params.group_by} \
        {params.sequences_per_group} \
        {params.exclude} \
        {params.other_filters} \
        --output-log {output.log}
    """
 96
 97
 98
 99
100
101
102
103
104
105
106
shell:
    """
    augur filter \
        --metadata-id-columns {params.strain_id} \
        --sequences {input.sequences} \
        --metadata {input.metadata} \
        --exclude-all \
        --include {input.strains} {input.include}\
        --output-sequences {output.sequences} \
        --output-metadata {output.metadata}
    """
115
116
117
118
119
120
121
shell:
    """
    python3 scripts/reverse_reversed_sequences.py \
        --metadata {input.metadata} \
        --sequences {input.sequences} \
        --output {output}
    """
141
142
143
144
145
146
147
148
149
150
151
152
153
shell:
    """
    nextalign run \
        --jobs {threads} \
        --reference {input.reference} \
        --genemap {input.genemap} \
        --max-indel {params.max_indel} \
        --seed-spacing {params.seed_spacing} \
        --retry-reverse-complement \
        --output-fasta - \
        --output-insertions {output.insertions} \
        {input.sequences} | seqkit seq -i > {output.alignment}
    """
171
172
173
174
175
176
177
178
shell:
    """
    augur mask \
        --sequences {input.sequences} \
        --mask {input.mask} \
        --mask-from-beginning {params.from_start} \
        --mask-from-end {params.from_end} --output {output}
    """
190
191
192
193
194
195
196
197
198
shell:
    """
    augur tree \
        --alignment {input.alignment} \
        --exclude-sites {input.tree_mask} \
        --tree-builder-args="-redo" \
        --output {output.tree} \
        --nthreads {threads}
    """
211
212
213
214
215
216
217
218
shell:
    """
    python3 scripts/fix_tree.py \
        --alignment {input.alignment} \
        --input-tree {input.tree} \
        {params.root} \
        --output {output.tree}
    """
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
shell:
    """
    augur refine \
        --tree {input.tree} \
        --alignment {input.alignment} \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --output-tree {output.tree} \
        --timetree \
        --root {params.root} \
        --precision 3 \
        --keep-polytomies \
        {params.clock_rate} \
        {params.clock_std_dev} \
        --output-node-data {output.node_data} \
        --coalescent {params.coalescent} \
        --date-inference {params.date_inference} \
        --date-confidence \
        --clock-filter-iqd {params.clock_filter_iqd}
    """
283
284
285
286
287
288
289
290
shell:
    """
    augur ancestral \
        --tree {input.tree} \
        --alignment {input.alignment} \
        --output-node-data {output.node_data} \
        --inference {params.inference}
    """
302
303
304
305
306
307
308
309
shell:
    """
    augur translate \
        --tree {input.tree} \
        --ancestral-sequences {input.node_data} \
        --reference-sequence {input.genemap} \
        --output {output.node_data}
    """
327
328
329
330
331
332
333
334
335
336
337
shell:
    """
    augur traits \
        --tree {input.tree} \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --output {output.node_data} \
        --columns {params.columns} \
        --confidence \
        --sampling-bias-correction {params.sampling_bias_correction}
    """
352
353
354
355
356
357
358
359
shell:
    """
    augur clades \
        --tree {input.tree} \
        --mutations {input.nuc_muts} {input.aa_muts} \
        --clades {input.clades} \
        --output-node-data {output.node_data} 2>&1 | tee {log}
    """
367
368
369
370
371
372
shell:
    """
    python scripts/clades_renaming.py \
    --input-node-data {input} \
    --output-node-data {output.node_data}
    """
381
382
383
384
385
386
387
shell:
    """
    python3 scripts/mutation_context.py \
        --tree {input.tree} \
        --mutations {input.node_data} \
        --output {output.node_data}
    """
395
396
397
398
shell:
    """
    python3 scripts/remove_timeinfo.py --input-node-data {input} --output-node-data {output}
    """
410
411
412
413
414
415
416
shell:
    """
    python3 scripts/construct-recency-from-submission-date.py \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --output {output} 2>&1
    """
426
427
428
429
430
431
432
433
shell:
    """
    python3 scripts/assign-colors.py \
        --ordering {input.ordering} \
        --color-schemes {input.color_schemes} \
        --output {output.colors} \
        --metadata {input.metadata} 2>&1
    """
460
461
462
463
464
465
466
467
468
469
470
471
472
473
shell:
    """
    augur export v2 \
        --tree {input.tree} \
        --metadata {input.metadata} \
        --metadata-id-columns {params.strain_id} \
        --node-data {input.branch_lengths} {input.nt_muts} {input.aa_muts} {input.mutation_context} {input.clades} {input.recency}\
        --colors {input.colors} \
        --lat-longs {input.lat_longs} \
        --description {input.description} \
        --auspice-config {input.auspice_config} \
        --include-root-sequence \
        --output {output.auspice_json}
    """
487
488
489
490
491
492
493
494
495
shell:
    """
    python3 scripts/set_final_strain_name.py --metadata {input.metadata} \
            --metadata-id-columns {params.strain_id} \
            --input-auspice-json {input.auspice_json} \
            --display-strain-name {params.display_strain_field} \
            --output {output.auspice_json}
    cp {input.root_sequence} {output.root_sequence}
    """
4
5
6
7
shell:
    """
    curl https://mpox-lapis.genspectrum.org/v1/sample/fasta --output {output.sequences}
    """
13
14
15
16
17
18
shell:
    """
    curl https://mpox-lapis.genspectrum.org/v1/sample/details?dataFormat=csv | \
        tr -d "\r" |
        sed -E 's/("([^"]*)")?,/\\2\\t/g' > {output.metadata}
    """
24
25
26
27
shell:
    """
    nextstrain remote upload {params.deploy_url} {input}
    """
38
39
40
41
shell:
    """
    ./bin/notify-on-deploy {params.deploy_url} {params.slack_ts}
    """
10
11
12
13
14
shell:
    """
    curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
    curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
    """
26
27
28
29
30
shell:
    """
    gzip --decompress --keep {input.metadata}
    xz --decompress --keep {input.sequences}
    """
ShowHide 34 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://nextstrain.org/monkeypox/hmpxv1
Name: monkeypox
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...