NSGTree: A Fast and Simplified Phylogenetic Tree Construction Pipeline

public public 1yr ago 0 bookmarks

v0.4.3 August 19 2023

Why use NSGTree (over SGTree)

New Simple Genome Tree (NSGTree) is a computational pipeline for fast and easy construction of phylogenetic trees from a set of user provided genomes and a set of phylogenetic markers. NSGTree builds a species tree based on a concatenated alignment of all proteins that were identified with the provided markers HMMs. NSGTree also builds single protein trees for every marker protein that was found. It is a highly simplified version of SGTree. In contrast to SGTree, NSGTree does currently not support marker selection. Initial benchmarking showed that NSGTree is faster than SGTree (~20-50% depending on number of query faa and HMMs)

How to run it

Workflow

  • Snakemake in a conda environment

  • Input is a dir with faa files and a single file that contains a set of HMMs to identify makers

snakemake -j 24 --use-conda --config qfaadir="example" models="resources/models/rnapol.hmm" --configfile user_config.yml

or more general

snakemake -j <number of processes> --use-conda --config qfaadir="<dir with query faa>" models="<single file that contains all marker HMMs" --configfile <settings specified in user_config.yml>
  • Alternatively, inputs can be a dir with query faa files and a dir with reference faa files, output dir will be created in the dir with query faa files
snakemake -j 24 --use-conda --config rfaadir="example_r" qfaadir="example_q" models="resources/models/rnapol.hmm" --configfile user_config.yml
  • Setting for alignment, trimming and tree building can be changed in user_config.yml

  • Several sets of marker HMMs are provided in subdir resources/models/

  • Results can be found in a subdir in <query faa dir>/nsgt_<analysis name>"

Apptainer / Shifter

  • Apptainer: save the following into a run.sh script, specify path to dirs and config file as command line arguments, potentially edit further to specify different set of HMMs, adjust -j flag depending on available resources
LOCAL_FOLDER_WITH_QFAA_FILES=$1
LOCAL_FOLDER_WITH_RFAA_FILES=$2
USER_CONFIG=$3
apptainer run \	docker://docker.io/fschulzjgi/nsgtree:0.4.3 \	snakemake -j 16 \	--snakefile "/nsgtree/workflow/Snakefile" \	--use-conda \	--config \	qfaadir="$LOCAL_FOLDER_WITH_QFAA_FILES" \	rfaadir="$LOCAL_FOLDER_WITH_RFAA_FILES" \	models="/nsgtree/resources/models/rnapol.hmm" \	--conda-prefix "/nsgtree/.snakemake/conda" \	--configfile $USER_CONFIG
  • Shifter on NERSC Perlmutter
shifterimg pull fschulzjgi/nsgtree:0.4.3
  • Load the working directory that contains files with models, querydir and config file with shifter
shifter \ --volume=$(pwd):/nsgtree/example \ --image=fschulzjgi/nsgtree:0.4.3 \ bash -c \ "snakemake --snakefile /nsgtree/workflow/Snakefile \ -j 24 \ --use-conda \ --config \ qfaadir="/nsgtree/example/test" \ models="/nsgtree/example/rnapol.hmm" \ --conda-prefix /nsgtree/.snakemake/conda \ --configfile /nsgtree/example/user_config.yml"
  • Specify ref dir and query dir and config file with shifter, using the UNI56 models provided by nsgtree
qfaa=$1
rfaa=$2
configf=$3
shifter \ --volume=$(pwd):/nsgtree/example \ --image=fschulzjgi/nsgtree:0.4.3 \ bash -c \ "snakemake --snakefile /nsgtree/workflow/Snakefile \ -j 24 \ --use-conda \ --config \ qfaadir="/nsgtree/example/$qfaa" \ rfaadir="/nsgtree/example/$rfaa" \ models="/nsgtree/resources/models/UNI56.hmm" \ --conda-prefix /nsgtree/.snakemake/conda \ --configfile /nsgtree/example/$configf"
  • For docker paths to modeldir, querydir, configfile can be loaded separately with the -v flag
docker pull fschulzjgi/nsgtree:0.4.3
docker run -t -i -v $(pwd):/nsgtree/modelsdir -v $(pwd)/test:/nsgtree/querydir --user $(id -u):$(id -g) fschulzjgi/nsgtree:0.4.3 snakemake --use-conda -j 16 --config qfaadir="querydir" models="/nsgtree/modelsdir/rnapol.hmm"

Acknowledgements

NSGTree was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import sys
import tarfile
import os.path
import shutil

resultsdir = sys.argv[1]
tarout = sys.argv[2]

def make_tarfile(tarout, resultsdir):
    with tarfile.open(tarout, "w:gz") as tar:
        tar.add(resultsdir, arcname=os.path.basename(resultsdir))
        try:
            shutil.rmtree(resultsdir)
        except OSError as e:
            print("Error: %s : %s" % (resultsdir, e.strerror))

make_tarfile(tarout, resultsdir)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import glob
import sys
from Bio import SeqIO
from collections import defaultdict

alndir = sys.argv[1] # dir with aligned multifa
aln_concat = sys.argv[2] # name of concat aln

def get_aln_len(aln):
    for seq_record in SeqIO.parse(aln, "fasta"):
        return len(seq_record.seq)

def get_taxa(alndir):
    taxa = set()
    for aln in glob.glob(alndir + "/*.*"):
        for seq_record in SeqIO.parse(aln, "fasta"):
            taxa.add(seq_record.id.split("|")[0])
    return list(taxa)

def get_aln(aln, taxa, final_dict, aln_len):
    for taxon in taxa:
        seq = "?" * aln_len
        for seq_record in SeqIO.parse(aln, "fasta"):
            if taxon == seq_record.id.split("|")[0]:
                seq = str(seq_record.seq)
                break
        final_dict[taxon].append(seq)
    return final_dict

final_dict = defaultdict(list)
for aln in glob.glob(alndir + "/*"):
    try:
        aln_len = get_aln_len(aln)
        final_dict = get_aln(aln, get_taxa(alndir), final_dict, aln_len)
    except:
        if aln.split(".")[-1] not in ["faa", "fa", "fasta", "fna", "mafft", "mafft01", "aln", "afa"]:
            print("Check content of dir, all need to be aligned multifasta with typical suffix: fna, fasta, fa, faa...")
        else:
            print("Something else is wrong")

# create concatenated aln
concat_dict = {x:"".join(y) for x,y in final_dict.items()}
with open(aln_concat, "w") as outfile:
    for seqid, seq in concat_dict.items():
        outfile.write(">" + seqid + "\n" + seq + "\n")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import sys
from ete3 import Tree
import ete3 as ete
import os.path
import warnings
warnings.filterwarnings("ignore", ".*is with a literal.*", SyntaxWarning)

treein = sys.argv[1] # tree in newick format
queryids_or_pattern = sys.argv[2] # file with leaf names or pattern to look for leaves
hexcolor =  sys.argv[3] # color for identified branches
itolout = sys.argv[4] # name of outfile

def get_alltaxa(t, queryids_or_pattern):
    target = []
    if os.path.isfile(queryids_or_pattern):
        with open(queryids_or_pattern, "r") as infile:
            target = [line.strip() for line in infile]
    else:
        for node in t.traverse():
            if node.is_leaf() and node.name.startswith(queryids_or_pattern):
                target.append(node.name)
    return target


def unpack_family(parentnode, leaves, target):
    """
    get all leaves under a parental node
    """
    children = parentnode.get_children()
    for child in children:
        if child.is_leaf() and child.name not in leaves:
            leaves.append(child.name)
        else:
            unpack_family(child, leaves, target) 
    return leaves


def check_monophyly(node, target, mono_leaves):
    # move up one node in the tree
    parentnode = node.up
    # collect all terminal leaves under the parent node
    leaves = unpack_family(parentnode, [], target)
    if len(leaves) == len([x for x in leaves if x in target]):
        mono_leaves.extend(leaves)
        check_monophyly(parentnode, target, mono_leaves)
    return mono_leaves

def main():
    t = Tree(treein)
    target = get_alltaxa(t, queryids_or_pattern)
    mono_leaves_all = []
    for node in t.traverse():
        if node.name in target and node.name not in [x for sublist in mono_leaves_all for x in sublist]:
            try:
                mono_leaves_all.append(check_monophyly(node, target, []))
            except:
                mono_leaves_all.append([node.name])
    #write to output
    queriesinclades = [x for sublist in mono_leaves_all for x in sublist]
    singletons = [x for x in target if x not in queriesinclades]
    with open(itolout, "w") as outfile:
        for monotclade in mono_leaves_all:
            if len(monotclade) > 1:
                outfile.write("|".join(list(set(monotclade))) + ",clade,#"+hexcolor+",normal,2\n")
        for singleton in singletons:
            outfile.write(singleton + ",branch,#"+hexcolor+",normal,2\n")

if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import glob
import sys
from ete3 import Tree
from collections import defaultdict
from collections import Counter
import pandas as pd
import warnings
warnings.filterwarnings("ignore", ".*is with a literal.*", SyntaxWarning)

def main():
    tree_in = sys.argv[1]
    queries_in = sys.argv[2]
    outname = sys.argv[3]

    def unpack_family(node, leaves, query, seen, queries):
        """
        get all leaves under a parental node
        """
        children = node.get_children()
        for child in children:
            childname = str(child).replace("--","").replace("/-", "").replace("\-", "").replace("\n","")
            if child.is_leaf() and childname not in queries:
                    seen.append(childname)
                    leaves.append(childname)
            elif not child.is_leaf() and childname not in seen:
                seen.append(childname)
                unpack_family(child, leaves, query, seen, queries)
            elif child.is_leaf() and childname == query:
                seen.append(childname)
            else:
                pass
        if len(leaves)==0:
            parentnode = node.up
            unpack_family(parentnode, leaves, query, seen, queries)
        return leaves


    def get_closestrelative(query, leaves, node, tree):
        distance = 10
        closestrelative = "nd"
        for child in leaves:
            if child != query:
                newdist = t.get_distance(child, query, topology_only = False)
                if newdist < distance:
                    distance = newdist
                    closestrelative = child
                else:
                    pass
        return closestrelative, distance


    def get_neighbor(t, query, queries):
        for node in t.traverse():
            if node.name == query:
                child = str(node.name).replace("--","").replace("/-", "").replace("\-", "").replace("\n","")
                queryintree = node.name
                # move up one node in the tree
                # collect all terminal leaves under the parent node
                leaves = []
                seen = []
                leaves = unpack_family(node, leaves, query, seen, queries)
                closestrelative, distance = get_closestrelative(queryintree, leaves, node, t)
                return [closestrelative, distance]

    # get queries
    with open(queries_in) as infile:
        queries = infile.read().splitlines()

    # all distances
    tree_dict_all = {}
    t = Tree(tree_in)
    for query in queries:
        tree_dict_all[query] =  get_neighbor(t, query, queries)
    df = pd.DataFrame.from_dict(tree_dict_all).fillna("nd").T
    with open(outname + ".pairs", "w") as outfile:
        for query, bestref in tree_dict_all.items():
            if bestref:
                outfile.write(f"{query}\t{bestref[0]}\t{bestref[1]}\n")
            else:
                outfile.write(f"{query}\tnd\tnd\n")


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import os
from itertools import repeat
import sys
import glob
import multiprocessing as mp
from Bio import SeqIO

hmmout = sys.argv[1]
faadir = sys.argv[2]
threads = int(sys.argv[3])
blacklist = sys.argv[4]
outfaa = sys.argv[5] # contains the model name

def export_hits_faa(hit, faadir):
    hit_genome_id = hit.split("|")[0]
    faa_file = os.path.join(faadir, hit_genome_id + ".faa")
    for seq_record in SeqIO.parse(faa_file, "fasta"):
        if seq_record.description == hit:
            return (seq_record.description, seq_record.seq)

def get_hits(hmmout, model, removed_taxa):
    hits_list = []
    with open(hmmout, "r") as infile:
        for line in infile:
            if not line.startswith("#") and line.split()[0] not in hits_list and line.split()[3] == model and line.split("|")[0]:
                if line.split("|")[0] not in removed_taxa:
                    hits_list.append(line.split()[0])
    return hits_list

def get_removed_taxa(blacklist):
    removed_taxa = [line.split()[0] for line in open(blacklist, "r")]
    return removed_taxa

def main():
    removed_taxa = get_removed_taxa(blacklist)
    model = outfaa.split("/")[-1].split(".faa")[0]
    hits_list = get_hits(hmmout, model, removed_taxa)

    with mp.Pool(processes=threads) as pool:
        results = pool.starmap(export_hits_faa, zip(hits_list, repeat(faadir)))

    with open(outfaa, "w") as outfile:
        for hit in results:
            outfile.write(">" + hit[0] + "\n" + str(hit[1]) + "\n")

if __name__ == '__main__':
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
from scipy.cluster import hierarchy
import numpy as np
import subprocess
import sys
import pandas as pd
import glob
from collections import defaultdict
import os

os.environ['QT_QPA_PLATFORM'] = 'offscreen'

queryfaadir = sys.argv[1]
modelscombined = sys.argv[2] # e.g. pathtofiles + "models/GVOGuni9.hmm"
hmmout = sys.argv[3] # e.g. pathtofiles + "Fadolivirus_GVOGuni9.out"
countout = sys.argv[4] # summarized hit counts per model from hmmout
removedtaxa = sys.argv[5] # list of removed taxa with info on why removed these were removed
minmarker = float(sys.argv[6]) # percentage of markers to be present for a genome to be included in downstream analysis$
maxsdup = int(sys.argv[7]) # max copy number of any marker, if exceeded respective genome will be excluded
maxdupl = float(sys.argv[8]) # percentage of markers that can be present in multiple copies
outitol = sys.argv[9]


def get_models (modelsin):
    """
    Generate list of all model names
    Some models might have no hits in hmmout, but they should be displayed in count matrix
    """
    with open(modelsin, "r") as f:
        models = [line.split()[-1].rstrip() for line in f if line.startswith("NAME")]
    return models


def get_taxa (queryfaadir):
    taxa = [taxon.split("/")[-1].split(".")[0] for taxon in glob.glob(queryfaadir + "/*.faa")]
    return taxa


def get_markercompleteness (models, hmmout, query):
    """
    Get copy numbers for each marker
    """
    # add 0s to include models that are not in hmmout
    count_dict = { x:0 for x in models }
    seen = []
    with open(hmmout, "r") as f:
        lines = [line.rstrip() for line in f if not line.startswith("#")]
        for line in lines:
            if line.split()[0] not in seen and line.split("|")[0]==query:
                count_dict[line.split()[3]] += 1
                seen.append(line.split()[0])
    return count_dict


def filter_taxa(df, models, minmarker, maxsdup, maxdupl):
    """ Quality filtering"""
    filteredtaxa_dict = defaultdict(list)
    df['max'] = df.max(axis=1)
    filtered_taxa = df[df['max'] > maxsdup].index
    filteredtaxa_dict.update({taxon: ["maxsdup:{}".format(df.at[taxon, 'max'])] for taxon in filtered_taxa})
    df.drop(['max'], inplace=True, axis=1)
    df['duplications'] = (df > 1).sum(axis=1)
    filtered_taxa = df[df['duplications'] / len(models) > maxdupl].index
    filteredtaxa_dict.update({taxon: ["maxdupl:{:.4f}".format(df.at[taxon, 'duplications'] / len(models))] for taxon in filtered_taxa})
    df.drop(['duplications'], inplace=True, axis=1)
    df[df > 1] = 1
    df['sum'] = df.sum(axis=1)
    filtered_taxa = df[df['sum'] < len(models) * minmarker].index
    filteredtaxa_dict.update({taxon: ["completeness:{:.4f}".format(df.at[taxon, 'sum'] / len(models))] for taxon in filtered_taxa})
    return filteredtaxa_dict


def convert2itol(flabels, outfile):
    with open(outfile, "w") as outfile:
        outfile.write("DATASET_HEATMAP\n"
                        "SEPARATOR COMMA\n"
                        "DATASET_LABEL,Count\n"
                        "COLOR,#ff0000\n"
                        "COLOR_MIN,#ff0000\n"
                        "COLOR_MAX,#0000ff\n"
                        "FIELD_LABELS," + ",".join(flabels) + "\n"
                        "DATA\n"
                       )


def reorder_cols(df):
    """Reorder columns in countmatrix using hierarchical clustering"""
    Z = hierarchy.linkage(df.T, method='ward')
    reordered_ind = hierarchy.leaves_list(Z)
    df = df[df.columns[reordered_ind]]
    return df


def main():
    models = get_models(modelscombined)
    taxa = get_taxa(queryfaadir)

    count_dict = {query: get_markercompleteness(models, hmmout, query) for query in taxa}
    df = pd.DataFrame.from_dict(count_dict).T
    df.to_csv(countout, sep='\t', index=True)

    filteredtaxa_dict = filter_taxa(df, models, minmarker, maxsdup, maxdupl)
    with open(removedtaxa, 'w') as outfile:
        for taxon, reasons in filteredtaxa_dict.items():
            outfile.write(f'{taxon}\t{";".join(reasons)}\n')

    df.drop('sum', axis=1, inplace=True)
    reordered_df = reorder_cols(df)
    convert2itol(list(df.columns), outitol)
    with open(outitol, "a") as outitolA:
        df.to_csv(outitolA, header = False, sep = ",")

if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from Bio import SeqIO
import pandas as pd
import sys
import os
import glob


def main():
    faaindir = sys.argv[1]
    faaoutdir = sys.argv[2]

    # header in format ><filebasename|<proteinid>

    reformatted = []

    for faain in glob.glob(faaindir+"/*.faa"):
        faaout = os.path.join(faaoutdir, os.path.basename(faain))
        reformatted = []
        for seq_record in SeqIO.parse(faain, "fasta"):
            querybase = os.path.splitext(os.path.basename(faain))[0]
            if "|" in seq_record.id:
                if seq_record.id.split("|")[0] == querybase:
                    seq_record.id = f"{seq_record.id.split()[0]}"
                    seq_record.description = ""
                    reformatted.append(seq_record)
                else:
                    seq_record.id = f"{querybase}|{seq_record.id.split()[0]}"
                    seq_record.description = ""
                    reformatted.append(seq_record)
            else:
                seq_record.id = f"{querybase}|{seq_record.id.split()[0]}"
                seq_record.description = ""
                reformatted.append(seq_record)
        SeqIO.write(reformatted, faaout, "fasta")


if __name__ == "__main__":
    main()
60
61
62
63
64
65
66
67
68
69
70
shell:
  """
  (sed -n 's/^##//p' {input}
  date
  echo "################################"
  echo "Selected parameters for nsgtree"
  echo "models: {params.models}"
  echo "query dir: {params.qfaa}"
  cat {params.configf} | awk -F "#" '{{print $1}}') &> {output}
  echo "################################" >> {output}
  """
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
  """
  mkdir -p {output.reformatted_faa_dir}
  #cp {input.qdir}/*.faa {input.rdir}/*.faa {output.reformatted_faa_dir}
  echo "TREE_COLORS\nSEPARATOR COMMA\nDATA" > {output.q_colors_itol}
  ls {input.qdir} | grep -v "nsgt" | awk -F "." '{{print $1",branch,#cc0000,normal,1"}}' >> {output.q_colors_itol}
  ls {input.qdir} | grep -v "nsgt" | awk -F "." '{{print $1}}' > {output.queries}
  python workflow/scripts/reformat.py {input.rdir} {output.reformatted_faa_dir}
  python workflow/scripts/reformat.py {input.qdir} {output.reformatted_faa_dir}
  """
117
118
119
120
121
shell:
  """
  mkdir -p {output.reformatted_faa_dir}
  python workflow/scripts/reformat.py {input.qdir} {output.reformatted_faa_dir}
  """
132
133
134
135
136
137
shell:
  """
  cat {params.faadirf}/*.faa > {output.faacombined}
  echo "Number of genomes in the analysis: $(find {params.faadirf} -type f -name '*.faa' | wc -l)" >> {workflow_log}
  echo "################################" >> {workflow_log}
  """
156
157
158
159
160
shell:
  """
  (echo "hmmsearch --noali {params.hmmsearch_cutoff} --domtblout  {output.hmmsearchout} --cpu {params.hmmsearch_cpu} {input.modelshmm} {input.faacombined}"
  hmmsearch --noali {params.hmmsearch_cutoff} --domtblout  {output.hmmsearchout} --cpu {params.hmmsearch_cpu} {input.modelshmm} {input.faacombined}) &> {log}
  """
183
184
185
186
187
188
189
shell:
  """
  (echo "python workflow/scripts/hmmsearch_count_filter.py {params.faadirf} {params.modelshmm} {input.hmmsearchout} {output.hitcounts} {output.taxa_removed} {params.minmarker} {params.maxsdup} {params.maxdupl} {output.itolcounts}"
  python workflow/scripts/hmmsearch_count_filter.py {params.faadirf} {params.modelshmm} {input.hmmsearchout} {output.hitcounts} {output.taxa_removed} {params.minmarker} {params.maxsdup} {params.maxdupl} {output.itolcounts}) &> {log}
  echo "Number of genomes that were removed due to filtering thresholds: $(wc -l <{output.taxa_removed}) " >> {workflow_log}
  cat {output.taxa_removed} >> {workflow_log}
  """
209
210
211
212
213
214
215
216
217
218
219
220
221
shell:
  """
  (echo "python workflow/scripts/extract_qhits.py {input.hmmsearchout} {params.faadirf} {params.extract_processes} {input.taxa_removed} {output}"
  python workflow/scripts/extract_qhits.py {input.hmmsearchout} {params.faadirf} {params.extract_processes} {input.taxa_removed} {output}) &> {log}
  declare -i protein_count
  if [ ! -s {output} ]; then
    protein_count=0
  else
    protein_count=$(grep -c ">" {output})
  fi
  echo "Extracted proteins from $protein_count different taxa for {params.modelname}" >> {log}
  touch {output}
  """
242
243
244
245
246
247
248
shell:
  """
  (echo "mafft{params.mafftv} {params.mafft} --quiet {params.mafft_thread} {input.hitsfaa} > {output.aln}"
  mafft{params.mafftv} {params.mafft} --quiet {params.mafft_thread} {input.hitsfaa} > {output.aln} || touch {output.aln}
  echo "trimal {params.trimal_gt} -in {output.aln} -out {output.trimmedaln}"
  trimal {params.trimal_gt} -in {output.aln} -out {output.trimmedaln} || touch {output.trimmedaln} ) >& {log}
  """
266
267
268
269
270
271
272
shell:
  """
  (echo "fasttree {params.proteintrees} {input.trimmedaln} > {output.tree}"
  mkdir -p {params.outdirt}
  fasttree {params.proteintrees} {input.trimmedaln} > {params.outpath}.treefile || touch {output.tree}
  cp {params.outpath}.treefile {output.tree}) >& {log}
  """
290
291
292
293
294
295
296
297
shell:
  """
  (echo "iqtree --prefix {params.outpath} -m {params.proteintrees} -T 1 -fast -s {input.trimmedaln}"
  mkdir -p {params.outdirt}
  iqtree --prefix {params.outpath} -m {params.proteintrees} -T 1 -fast -s {input.trimmedaln} || touch {params.outpath}.treefile
  cp {params.outpath}.treefile {output.tree}
  touch {output.tree}) >& {log}
  """
311
312
313
314
315
316
shell:
  """
  (python workflow/scripts/concat.py {params.alndir} {output.concataln} || touch {output.concataln}) >& {log}
  echo "################################" >> {workflow_log}
  echo "Number of genomes in the final alignment: $(grep -c '>' {output.concataln})" >> {workflow_log}
  """
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
shell:
  """
  declare -i genome_count
  if [ ! -s {input.concataln} ]; then
    genome_count=0
  else
    genome_count=$(grep -c ">" {input.concataln})
  fi

  if [ $genome_count -le 3 ]; then
    error_message="Too few genomes in the supermatrix alignment. Check filtering thresholds to increase numbers of genomes to be retained."
    echo $error_message >> {log}
    echo $error_message >&2
    echo $error_message >> {workflow_log}
    exit 1
  fi
  echo "fasttree {params.speciestree} {input.concataln} > {output[0]}" >> {log}
  fasttree {params.speciestree} {input.concataln} > {output[0]} 
  cp {output[0]} {output[1]}
  """
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
shell:
  """
  declare -i genome_count
  if [ ! -s {input.concataln} ]; then
    genome_count=0
  else
    genome_count=$(grep -c ">" {input.concataln})
  fi

  if [ $genome_count -le 3 ]; then
    error_message="Too few genomes in the supermatrix alignment. Check filtering thresholds to increase numbers of genomes to be retained."
    echo $error_message >> {log}
    echo $error_message >&2
    echo $error_message >> {workflow_log}
    exit 1
  fi
  echo "iqtree --quiet --prefix {params.outpath} -m {params.speciestree} -fast -s {input.concataln}" >> {log}
  iqtree --quiet --prefix {params.outpath} -m {params.speciestree} -fast -s {input.concataln} >> {log} 2>&1
  cp {output[0]} {output[1]}
  """
404
405
406
407
408
409
410
411
412
413
414
415
shell:
  """
  if [ -e {params.itoldir}/queries.txt ]; then python workflow/scripts/ete3_nntree.py {input[0]} {params.itoldir}/queries.txt {params.itoldir}/nn.itol.txt ; fi
  if [ -e {params.itoldir}/queries.txt ]; then python workflow/scripts/ete3_clademembers.py {input[0]} {params.itoldir}/queries.txt cc0000 {params.itoldir}/clademembers_colored.itol.txt
  sed -i '1 i\DATA' {params.itoldir}/clademembers_colored.itol.txt
  sed -i '1 i\SEPARATOR COMMA' {params.itoldir}/clademembers_colored.itol.txt
  sed -i '1 i\TREE_COLOR' {params.itoldir}/clademembers_colored.itol.txt ; fi
  rm -r {params.q_faa_f_dir}
  rm -r {params.tmp_dir}
  find {params.fdir} -type f -empty -print -delete 
  python workflow/scripts/cleanup.py {params.fdir} {output}
  """
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NeLLi-team/nsgtree
Name: nsgtree
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...