NSGTree: A Fast and Simplified Phylogenetic Tree Construction Pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

v0.4.3 August 19 2023

Why use NSGTree (over SGTree)

New Simple Genome Tree (NSGTree) is a computational pipeline for fast and easy construction of phylogenetic trees from a set of user provided genomes and a set of phylogenetic markers. NSGTree builds a species tree based on a concatenated alignment of all proteins that were identified with the provided markers HMMs. NSGTree also builds single protein trees for every marker protein that was found. It is a highly simplified version of SGTree. In contrast to SGTree, NSGTree does currently not support marker selection. Initial benchmarking showed that NSGTree is faster than SGTree (~20-50% depending on number of query faa and HMMs)

How to run it

Workflow

Snakemake in a conda environment
Input is a dir with faa files and a single file that contains a set of HMMs to identify makers

snakemake -j 24 --use-conda --config qfaadir="example" models="resources/models/rnapol.hmm" --configfile user_config.yml

or more general

snakemake -j <number of processes> --use-conda --config qfaadir="<dir with query faa>" models="<single file that contains all marker HMMs" --configfile <settings specified in user_config.yml>

Alternatively, inputs can be a dir with query faa files and a dir with reference faa files, output dir will be created in the dir with query faa files

snakemake -j 24 --use-conda --config rfaadir="example_r" qfaadir="example_q" models="resources/models/rnapol.hmm" --configfile user_config.yml

Setting for alignment, trimming and tree building can be changed in user_config.yml
Several sets of marker HMMs are provided in subdir resources/models/
Results can be found in a subdir in <query faa dir>/nsgt_<analysis name>"

Apptainer / Shifter

Apptainer: save the following into a run.sh script, specify path to dirs and config file as command line arguments, potentially edit further to specify different set of HMMs, adjust -j flag depending on available resources

LOCAL_FOLDER_WITH_QFAA_FILES=$1
LOCAL_FOLDER_WITH_RFAA_FILES=$2
USER_CONFIG=$3
apptainer run \	docker://docker.io/fschulzjgi/nsgtree:0.4.3 \	snakemake -j 16 \	--snakefile "/nsgtree/workflow/Snakefile" \	--use-conda \	--config \	qfaadir="$LOCAL_FOLDER_WITH_QFAA_FILES" \	rfaadir="$LOCAL_FOLDER_WITH_RFAA_FILES" \	models="/nsgtree/resources/models/rnapol.hmm" \	--conda-prefix "/nsgtree/.snakemake/conda" \	--configfile $USER_CONFIG

Shifter on NERSC Perlmutter

shifterimg pull fschulzjgi/nsgtree:0.4.3

Load the working directory that contains files with models, querydir and config file with shifter

shifter \ --volume=$(pwd):/nsgtree/example \ --image=fschulzjgi/nsgtree:0.4.3 \ bash -c \ "snakemake --snakefile /nsgtree/workflow/Snakefile \ -j 24 \ --use-conda \ --config \ qfaadir="/nsgtree/example/test" \ models="/nsgtree/example/rnapol.hmm" \ --conda-prefix /nsgtree/.snakemake/conda \ --configfile /nsgtree/example/user_config.yml"

Specify ref dir and query dir and config file with shifter, using the UNI56 models provided by nsgtree

qfaa=$1
rfaa=$2
configf=$3
shifter \ --volume=$(pwd):/nsgtree/example \ --image=fschulzjgi/nsgtree:0.4.3 \ bash -c \ "snakemake --snakefile /nsgtree/workflow/Snakefile \ -j 24 \ --use-conda \ --config \ qfaadir="/nsgtree/example/$qfaa" \ rfaadir="/nsgtree/example/$rfaa" \ models="/nsgtree/resources/models/UNI56.hmm" \ --conda-prefix /nsgtree/.snakemake/conda \ --configfile /nsgtree/example/$configf"

For docker paths to modeldir, querydir, configfile can be loaded separately with the -v flag

docker pull fschulzjgi/nsgtree:0.4.3
docker run -t -i -v $(pwd):/nsgtree/modelsdir -v $(pwd)/test:/nsgtree/querydir --user $(id -u):$(id -g) fschulzjgi/nsgtree:0.4.3 snakemake --use-conda -j 16 --config qfaadir="querydir" models="/nsgtree/modelsdir/rnapol.hmm"

Acknowledgements

NSGTree was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.

Code Snippets

import sys
import tarfile
import os.path
import shutil

resultsdir = sys.argv[1]
tarout = sys.argv[2]

def make_tarfile(tarout, resultsdir):
    with tarfile.open(tarout, "w:gz") as tar:
        tar.add(resultsdir, arcname=os.path.basename(resultsdir))
        try:
            shutil.rmtree(resultsdir)
        except OSError as e:
            print("Error: %s : %s" % (resultsdir, e.strerror))

make_tarfile(tarout, resultsdir)

Python From line 1 of scripts/cleanup.py

import glob
import sys
from Bio import SeqIO
from collections import defaultdict

alndir = sys.argv[1] # dir with aligned multifa
aln_concat = sys.argv[2] # name of concat aln

def get_aln_len(aln):
    for seq_record in SeqIO.parse(aln, "fasta"):
        return len(seq_record.seq)

def get_taxa(alndir):
    taxa = set()
    for aln in glob.glob(alndir + "/*.*"):
        for seq_record in SeqIO.parse(aln, "fasta"):
            taxa.add(seq_record.id.split("|")[0])
    return list(taxa)

def get_aln(aln, taxa, final_dict, aln_len):
    for taxon in taxa:
        seq = "?" * aln_len
        for seq_record in SeqIO.parse(aln, "fasta"):
            if taxon == seq_record.id.split("|")[0]:
                seq = str(seq_record.seq)
                break
        final_dict[taxon].append(seq)
    return final_dict

final_dict = defaultdict(list)
for aln in glob.glob(alndir + "/*"):
    try:
        aln_len = get_aln_len(aln)
        final_dict = get_aln(aln, get_taxa(alndir), final_dict, aln_len)
    except:
        if aln.split(".")[-1] not in ["faa", "fa", "fasta", "fna", "mafft", "mafft01", "aln", "afa"]:
            print("Check content of dir, all need to be aligned multifasta with typical suffix: fna, fasta, fa, faa...")
        else:
            print("Something else is wrong")

# create concatenated aln
concat_dict = {x:"".join(y) for x,y in final_dict.items()}
with open(aln_concat, "w") as outfile:
    for seqid, seq in concat_dict.items():
        outfile.write(">" + seqid + "\n" + seq + "\n")

Python Biopython MAFFT API (EBI) From line 1 of scripts/concat.py

import sys
from ete3 import Tree
import ete3 as ete
import os.path
import warnings
warnings.filterwarnings("ignore", ".*is with a literal.*", SyntaxWarning)

treein = sys.argv[1] # tree in newick format
queryids_or_pattern = sys.argv[2] # file with leaf names or pattern to look for leaves
hexcolor =  sys.argv[3] # color for identified branches
itolout = sys.argv[4] # name of outfile

def get_alltaxa(t, queryids_or_pattern):
    target = []
    if os.path.isfile(queryids_or_pattern):
        with open(queryids_or_pattern, "r") as infile:
            target = [line.strip() for line in infile]
    else:
        for node in t.traverse():
            if node.is_leaf() and node.name.startswith(queryids_or_pattern):
                target.append(node.name)
    return target


def unpack_family(parentnode, leaves, target):
    """
    get all leaves under a parental node
    """
    children = parentnode.get_children()
    for child in children:
        if child.is_leaf() and child.name not in leaves:
            leaves.append(child.name)
        else:
            unpack_family(child, leaves, target) 
    return leaves


def check_monophyly(node, target, mono_leaves):
    # move up one node in the tree
    parentnode = node.up
    # collect all terminal leaves under the parent node
    leaves = unpack_family(parentnode, [], target)
    if len(leaves) == len([x for x in leaves if x in target]):
        mono_leaves.extend(leaves)
        check_monophyly(parentnode, target, mono_leaves)
    return mono_leaves

def main():
    t = Tree(treein)
    target = get_alltaxa(t, queryids_or_pattern)
    mono_leaves_all = []
    for node in t.traverse():
        if node.name in target and node.name not in [x for sublist in mono_leaves_all for x in sublist]:
            try:
                mono_leaves_all.append(check_monophyly(node, target, []))
            except:
                mono_leaves_all.append([node.name])
    #write to output
    queriesinclades = [x for sublist in mono_leaves_all for x in sublist]
    singletons = [x for x in target if x not in queriesinclades]
    with open(itolout, "w") as outfile:
        for monotclade in mono_leaves_all:
            if len(monotclade) > 1:
                outfile.write("|".join(list(set(monotclade))) + ",clade,#"+hexcolor+",normal,2\n")
        for singleton in singletons:
            outfile.write(singleton + ",branch,#"+hexcolor+",normal,2\n")

if __name__ == '__main__':
    main()

Python ete3 From line 1 of scripts/ete3_clademembers.py

import glob
import sys
from ete3 import Tree
from collections import defaultdict
from collections import Counter
import pandas as pd
import warnings
warnings.filterwarnings("ignore", ".*is with a literal.*", SyntaxWarning)

def main():
    tree_in = sys.argv[1]
    queries_in = sys.argv[2]
    outname = sys.argv[3]

    def unpack_family(node, leaves, query, seen, queries):
        """
        get all leaves under a parental node
        """
        children = node.get_children()
        for child in children:
            childname = str(child).replace("--","").replace("/-", "").replace("\-", "").replace("\n","")
            if child.is_leaf() and childname not in queries:
                    seen.append(childname)
                    leaves.append(childname)
            elif not child.is_leaf() and childname not in seen:
                seen.append(childname)
                unpack_family(child, leaves, query, seen, queries)
            elif child.is_leaf() and childname == query:
                seen.append(childname)
            else:
                pass
        if len(leaves)==0:
            parentnode = node.up
            unpack_family(parentnode, leaves, query, seen, queries)
        return leaves


    def get_closestrelative(query, leaves, node, tree):
        distance = 10
        closestrelative = "nd"
        for child in leaves:
            if child != query:
                newdist = t.get_distance(child, query, topology_only = False)
                if newdist < distance:
                    distance = newdist
                    closestrelative = child
                else:
                    pass
        return closestrelative, distance


    def get_neighbor(t, query, queries):
        for node in t.traverse():
            if node.name == query:
                child = str(node.name).replace("--","").replace("/-", "").replace("\-", "").replace("\n","")
                queryintree = node.name
                # move up one node in the tree
                # collect all terminal leaves under the parent node
                leaves = []
                seen = []
                leaves = unpack_family(node, leaves, query, seen, queries)
                closestrelative, distance = get_closestrelative(queryintree, leaves, node, t)
                return [closestrelative, distance]

    # get queries
    with open(queries_in) as infile:
        queries = infile.read().splitlines()

    # all distances
    tree_dict_all = {}
    t = Tree(tree_in)
    for query in queries:
        tree_dict_all[query] =  get_neighbor(t, query, queries)
    df = pd.DataFrame.from_dict(tree_dict_all).fillna("nd").T
    with open(outname + ".pairs", "w") as outfile:
        for query, bestref in tree_dict_all.items():
            if bestref:
                outfile.write(f"{query}\t{bestref[0]}\t{bestref[1]}\n")
            else:
                outfile.write(f"{query}\tnd\tnd\n")


if __name__ == "__main__":
    main()

Python Pandas ete3 From line 1 of scripts/ete3_nntree.py

import os
from itertools import repeat
import sys
import glob
import multiprocessing as mp
from Bio import SeqIO

hmmout = sys.argv[1]
faadir = sys.argv[2]
threads = int(sys.argv[3])
blacklist = sys.argv[4]
outfaa = sys.argv[5] # contains the model name

def export_hits_faa(hit, faadir):
    hit_genome_id = hit.split("|")[0]
    faa_file = os.path.join(faadir, hit_genome_id + ".faa")
    for seq_record in SeqIO.parse(faa_file, "fasta"):
        if seq_record.description == hit:
            return (seq_record.description, seq_record.seq)

def get_hits(hmmout, model, removed_taxa):
    hits_list = []
    with open(hmmout, "r") as infile:
        for line in infile:
            if not line.startswith("#") and line.split()[0] not in hits_list and line.split()[3] == model and line.split("|")[0]:
                if line.split("|")[0] not in removed_taxa:
                    hits_list.append(line.split()[0])
    return hits_list

def get_removed_taxa(blacklist):
    removed_taxa = [line.split()[0] for line in open(blacklist, "r")]
    return removed_taxa

def main():
    removed_taxa = get_removed_taxa(blacklist)
    model = outfaa.split("/")[-1].split(".faa")[0]
    hits_list = get_hits(hmmout, model, removed_taxa)

    with mp.Pool(processes=threads) as pool:
        results = pool.starmap(export_hits_faa, zip(hits_list, repeat(faadir)))

    with open(outfaa, "w") as outfile:
        for hit in results:
            outfile.write(">" + hit[0] + "\n" + str(hit[1]) + "\n")

if __name__ == '__main__':
    main()

Python Biopython From line 1 of scripts/extract_qhits.py

from scipy.cluster import hierarchy
import numpy as np
import subprocess
import sys
import pandas as pd
import glob
from collections import defaultdict
import os

os.environ['QT_QPA_PLATFORM'] = 'offscreen'

queryfaadir = sys.argv[1]
modelscombined = sys.argv[2] # e.g. pathtofiles + "models/GVOGuni9.hmm"
hmmout = sys.argv[3] # e.g. pathtofiles + "Fadolivirus_GVOGuni9.out"
countout = sys.argv[4] # summarized hit counts per model from hmmout
removedtaxa = sys.argv[5] # list of removed taxa with info on why removed these were removed
minmarker = float(sys.argv[6]) # percentage of markers to be present for a genome to be included in downstream analysis$
maxsdup = int(sys.argv[7]) # max copy number of any marker, if exceeded respective genome will be excluded
maxdupl = float(sys.argv[8]) # percentage of markers that can be present in multiple copies
outitol = sys.argv[9]


def get_models (modelsin):
    """
    Generate list of all model names
    Some models might have no hits in hmmout, but they should be displayed in count matrix
    """
    with open(modelsin, "r") as f:
        models = [line.split()[-1].rstrip() for line in f if line.startswith("NAME")]
    return models


def get_taxa (queryfaadir):
    taxa = [taxon.split("/")[-1].split(".")[0] for taxon in glob.glob(queryfaadir + "/*.faa")]
    return taxa


def get_markercompleteness (models, hmmout, query):
    """
    Get copy numbers for each marker
    """
    # add 0s to include models that are not in hmmout
    count_dict = { x:0 for x in models }
    seen = []
    with open(hmmout, "r") as f:
        lines = [line.rstrip() for line in f if not line.startswith("#")]
        for line in lines:
            if line.split()[0] not in seen and line.split("|")[0]==query:
                count_dict[line.split()[3]] += 1
                seen.append(line.split()[0])
    return count_dict


def filter_taxa(df, models, minmarker, maxsdup, maxdupl):
    """ Quality filtering"""
    filteredtaxa_dict = defaultdict(list)
    df['max'] = df.max(axis=1)
    filtered_taxa = df[df['max'] > maxsdup].index
    filteredtaxa_dict.update({taxon: ["maxsdup:{}".format(df.at[taxon, 'max'])] for taxon in filtered_taxa})
    df.drop(['max'], inplace=True, axis=1)
    df['duplications'] = (df > 1).sum(axis=1)
    filtered_taxa = df[df['duplications'] / len(models) > maxdupl].index
    filteredtaxa_dict.update({taxon: ["maxdupl:{:.4f}".format(df.at[taxon, 'duplications'] / len(models))] for taxon in filtered_taxa})
    df.drop(['duplications'], inplace=True, axis=1)
    df[df > 1] = 1
    df['sum'] = df.sum(axis=1)
    filtered_taxa = df[df['sum'] < len(models) * minmarker].index
    filteredtaxa_dict.update({taxon: ["completeness:{:.4f}".format(df.at[taxon, 'sum'] / len(models))] for taxon in filtered_taxa})
    return filteredtaxa_dict


def convert2itol(flabels, outfile):
    with open(outfile, "w") as outfile:
        outfile.write("DATASET_HEATMAP\n"
                        "SEPARATOR COMMA\n"
                        "DATASET_LABEL,Count\n"
                        "COLOR,#ff0000\n"
                        "COLOR_MIN,#ff0000\n"
                        "COLOR_MAX,#0000ff\n"
                        "FIELD_LABELS," + ",".join(flabels) + "\n"
                        "DATA\n"
                       )


def reorder_cols(df):
    """Reorder columns in countmatrix using hierarchical clustering"""
    Z = hierarchy.linkage(df.T, method='ward')
    reordered_ind = hierarchy.leaves_list(Z)
    df = df[df.columns[reordered_ind]]
    return df


def main():
    models = get_models(modelscombined)
    taxa = get_taxa(queryfaadir)

    count_dict = {query: get_markercompleteness(models, hmmout, query) for query in taxa}
    df = pd.DataFrame.from_dict(count_dict).T
    df.to_csv(countout, sep='\t', index=True)

    filteredtaxa_dict = filter_taxa(df, models, minmarker, maxsdup, maxdupl)
    with open(removedtaxa, 'w') as outfile:
        for taxon, reasons in filteredtaxa_dict.items():
            outfile.write(f'{taxon}\t{";".join(reasons)}\n')

    df.drop('sum', axis=1, inplace=True)
    reordered_df = reorder_cols(df)
    convert2itol(list(df.columns), outitol)
    with open(outitol, "a") as outitolA:
        df.to_csv(outitolA, header = False, sep = ",")

if __name__ == '__main__':
    main()

Python Pandas numpy scipy From line 1 of scripts/hmmsearch_count_filter.py

from Bio import SeqIO
import pandas as pd
import sys
import os
import glob


def main():
    faaindir = sys.argv[1]
    faaoutdir = sys.argv[2]

    # header in format ><filebasename|<proteinid>

    reformatted = []

    for faain in glob.glob(faaindir+"/*.faa"):
        faaout = os.path.join(faaoutdir, os.path.basename(faain))
        reformatted = []
        for seq_record in SeqIO.parse(faain, "fasta"):
            querybase = os.path.splitext(os.path.basename(faain))[0]
            if "|" in seq_record.id:
                if seq_record.id.split("|")[0] == querybase:
                    seq_record.id = f"{seq_record.id.split()[0]}"
                    seq_record.description = ""
                    reformatted.append(seq_record)
                else:
                    seq_record.id = f"{querybase}|{seq_record.id.split()[0]}"
                    seq_record.description = ""
                    reformatted.append(seq_record)
            else:
                seq_record.id = f"{querybase}|{seq_record.id.split()[0]}"
                seq_record.description = ""
                reformatted.append(seq_record)
        SeqIO.write(reformatted, faaout, "fasta")


if __name__ == "__main__":
    main()

Python Pandas Biopython From line 1 of scripts/reformat.py

shell:
  """
  (sed -n 's/^##//p' {input}
  date
  echo "################################"
  echo "Selected parameters for nsgtree"
  echo "models: {params.models}"
  echo "query dir: {params.qfaa}"
  cat {params.configf} | awk -F "#" '{{print $1}}') &> {output}
  echo "################################" >> {output}
  """

SnakeMake From line 60 of workflow/Snakefile

shell:
  """
  mkdir -p {output.reformatted_faa_dir}
  #cp {input.qdir}/*.faa {input.rdir}/*.faa {output.reformatted_faa_dir}
  echo "TREE_COLORS\nSEPARATOR COMMA\nDATA" > {output.q_colors_itol}
  ls {input.qdir} | grep -v "nsgt" | awk -F "." '{{print $1",branch,#cc0000,normal,1"}}' >> {output.q_colors_itol}
  ls {input.qdir} | grep -v "nsgt" | awk -F "." '{{print $1}}' > {output.queries}
  python workflow/scripts/reformat.py {input.rdir} {output.reformatted_faa_dir}
  python workflow/scripts/reformat.py {input.qdir} {output.reformatted_faa_dir}
  """

SnakeMake From line 92 of workflow/Snakefile

shell:
  """
  mkdir -p {output.reformatted_faa_dir}
  python workflow/scripts/reformat.py {input.qdir} {output.reformatted_faa_dir}
  """

SnakeMake From line 117 of workflow/Snakefile

shell:
  """
  cat {params.faadirf}/*.faa > {output.faacombined}
  echo "Number of genomes in the analysis: $(find {params.faadirf} -type f -name '*.faa' | wc -l)" >> {workflow_log}
  echo "################################" >> {workflow_log}
  """

SnakeMake From line 132 of workflow/Snakefile

shell:
  """
  (echo "hmmsearch --noali {params.hmmsearch_cutoff} --domtblout  {output.hmmsearchout} --cpu {params.hmmsearch_cpu} {input.modelshmm} {input.faacombined}"
  hmmsearch --noali {params.hmmsearch_cutoff} --domtblout  {output.hmmsearchout} --cpu {params.hmmsearch_cpu} {input.modelshmm} {input.faacombined}) &> {log}
  """

SnakeMake hmmsearch (genouest) From line 156 of workflow/Snakefile

shell:
  """
  (echo "python workflow/scripts/hmmsearch_count_filter.py {params.faadirf} {params.modelshmm} {input.hmmsearchout} {output.hitcounts} {output.taxa_removed} {params.minmarker} {params.maxsdup} {params.maxdupl} {output.itolcounts}"
  python workflow/scripts/hmmsearch_count_filter.py {params.faadirf} {params.modelshmm} {input.hmmsearchout} {output.hitcounts} {output.taxa_removed} {params.minmarker} {params.maxsdup} {params.maxdupl} {output.itolcounts}) &> {log}
  echo "Number of genomes that were removed due to filtering thresholds: $(wc -l <{output.taxa_removed}) " >> {workflow_log}
  cat {output.taxa_removed} >> {workflow_log}
  """

SnakeMake From line 183 of workflow/Snakefile

shell:
  """
  (echo "python workflow/scripts/extract_qhits.py {input.hmmsearchout} {params.faadirf} {params.extract_processes} {input.taxa_removed} {output}"
  python workflow/scripts/extract_qhits.py {input.hmmsearchout} {params.faadirf} {params.extract_processes} {input.taxa_removed} {output}) &> {log}
  declare -i protein_count
  if [ ! -s {output} ]; then
    protein_count=0
  else
    protein_count=$(grep -c ">" {output})
  fi
  echo "Extracted proteins from $protein_count different taxa for {params.modelname}" >> {log}
  touch {output}
  """

SnakeMake From line 209 of workflow/Snakefile

shell:
  """
  (echo "mafft{params.mafftv} {params.mafft} --quiet {params.mafft_thread} {input.hitsfaa} > {output.aln}"
  mafft{params.mafftv} {params.mafft} --quiet {params.mafft_thread} {input.hitsfaa} > {output.aln} || touch {output.aln}
  echo "trimal {params.trimal_gt} -in {output.aln} -out {output.trimmedaln}"
  trimal {params.trimal_gt} -in {output.aln} -out {output.trimmedaln} || touch {output.trimmedaln} ) >& {log}
  """

SnakeMake trimAl From line 242 of workflow/Snakefile

shell:
  """
  (echo "fasttree {params.proteintrees} {input.trimmedaln} > {output.tree}"
  mkdir -p {params.outdirt}
  fasttree {params.proteintrees} {input.trimmedaln} > {params.outpath}.treefile || touch {output.tree}
  cp {params.outpath}.treefile {output.tree}) >& {log}
  """

SnakeMake fasttree From line 266 of workflow/Snakefile

shell:
  """
  (echo "iqtree --prefix {params.outpath} -m {params.proteintrees} -T 1 -fast -s {input.trimmedaln}"
  mkdir -p {params.outdirt}
  iqtree --prefix {params.outpath} -m {params.proteintrees} -T 1 -fast -s {input.trimmedaln} || touch {params.outpath}.treefile
  cp {params.outpath}.treefile {output.tree}
  touch {output.tree}) >& {log}
  """

SnakeMake IQ-TREE From line 290 of workflow/Snakefile

shell:
  """
  (python workflow/scripts/concat.py {params.alndir} {output.concataln} || touch {output.concataln}) >& {log}
  echo "################################" >> {workflow_log}
  echo "Number of genomes in the final alignment: $(grep -c '>' {output.concataln})" >> {workflow_log}
  """

SnakeMake From line 311 of workflow/Snakefile

shell:
  """
  declare -i genome_count
  if [ ! -s {input.concataln} ]; then
    genome_count=0
  else
    genome_count=$(grep -c ">" {input.concataln})
  fi

  if [ $genome_count -le 3 ]; then
    error_message="Too few genomes in the supermatrix alignment. Check filtering thresholds to increase numbers of genomes to be retained."
    echo $error_message >> {log}
    echo $error_message >&2
    echo $error_message >> {workflow_log}
    exit 1
  fi
  echo "fasttree {params.speciestree} {input.concataln} > {output[0]}" >> {log}
  fasttree {params.speciestree} {input.concataln} > {output[0]} 
  cp {output[0]} {output[1]}
  """

SnakeMake fasttree From line 333 of workflow/Snakefile

shell:
  """
  declare -i genome_count
  if [ ! -s {input.concataln} ]; then
    genome_count=0
  else
    genome_count=$(grep -c ">" {input.concataln})
  fi

  if [ $genome_count -le 3 ]; then
    error_message="Too few genomes in the supermatrix alignment. Check filtering thresholds to increase numbers of genomes to be retained."
    echo $error_message >> {log}
    echo $error_message >&2
    echo $error_message >> {workflow_log}
    exit 1
  fi
  echo "iqtree --quiet --prefix {params.outpath} -m {params.speciestree} -fast -s {input.concataln}" >> {log}
  iqtree --quiet --prefix {params.outpath} -m {params.speciestree} -fast -s {input.concataln} >> {log} 2>&1
  cp {output[0]} {output[1]}
  """

SnakeMake IQ-TREE From line 367 of workflow/Snakefile

shell:
  """
  if [ -e {params.itoldir}/queries.txt ]; then python workflow/scripts/ete3_nntree.py {input[0]} {params.itoldir}/queries.txt {params.itoldir}/nn.itol.txt ; fi
  if [ -e {params.itoldir}/queries.txt ]; then python workflow/scripts/ete3_clademembers.py {input[0]} {params.itoldir}/queries.txt cc0000 {params.itoldir}/clademembers_colored.itol.txt
  sed -i '1 i\DATA' {params.itoldir}/clademembers_colored.itol.txt
  sed -i '1 i\SEPARATOR COMMA' {params.itoldir}/clademembers_colored.itol.txt
  sed -i '1 i\TREE_COLOR' {params.itoldir}/clademembers_colored.itol.txt ; fi
  rm -r {params.q_faa_f_dir}
  rm -r {params.tmp_dir}
  find {params.fdir} -type f -empty -print -delete 
  python workflow/scripts/cleanup.py {params.fdir} {output}
  """