NSGTree: A Fast and Simplified Phylogenetic Tree Construction Pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
v0.4.3 August 19 2023
Why use NSGTree (over SGTree)
New Simple Genome Tree (NSGTree) is a computational pipeline for fast and easy construction of phylogenetic trees from a set of user provided genomes and a set of phylogenetic markers. NSGTree builds a species tree based on a concatenated alignment of all proteins that were identified with the provided markers HMMs. NSGTree also builds single protein trees for every marker protein that was found. It is a highly simplified version of SGTree. In contrast to SGTree, NSGTree does currently not support marker selection. Initial benchmarking showed that NSGTree is faster than SGTree (~20-50% depending on number of query faa and HMMs)
How to run it
Workflow
-
Snakemake in a conda environment
-
Input is a dir with faa files and a single file that contains a set of HMMs to identify makers
snakemake -j 24 --use-conda --config qfaadir="example" models="resources/models/rnapol.hmm" --configfile user_config.yml
or more general
snakemake -j <number of processes> --use-conda --config qfaadir="<dir with query faa>" models="<single file that contains all marker HMMs" --configfile <settings specified in user_config.yml>
- Alternatively, inputs can be a dir with query faa files and a dir with reference faa files, output dir will be created in the dir with query faa files
snakemake -j 24 --use-conda --config rfaadir="example_r" qfaadir="example_q" models="resources/models/rnapol.hmm" --configfile user_config.yml
-
Setting for alignment, trimming and tree building can be changed in user_config.yml
-
Several sets of marker HMMs are provided in subdir resources/models/
-
Results can be found in a subdir in <query faa dir>/nsgt_<analysis name>"
Apptainer / Shifter
- Apptainer: save the following into a run.sh script, specify path to dirs and config file as command line arguments, potentially edit further to specify different set of HMMs, adjust -j flag depending on available resources
LOCAL_FOLDER_WITH_QFAA_FILES=$1
LOCAL_FOLDER_WITH_RFAA_FILES=$2
USER_CONFIG=$3
apptainer run \ docker://docker.io/fschulzjgi/nsgtree:0.4.3 \ snakemake -j 16 \ --snakefile "/nsgtree/workflow/Snakefile" \ --use-conda \ --config \ qfaadir="$LOCAL_FOLDER_WITH_QFAA_FILES" \ rfaadir="$LOCAL_FOLDER_WITH_RFAA_FILES" \ models="/nsgtree/resources/models/rnapol.hmm" \ --conda-prefix "/nsgtree/.snakemake/conda" \ --configfile $USER_CONFIG
- Shifter on NERSC Perlmutter
shifterimg pull fschulzjgi/nsgtree:0.4.3
- Load the working directory that contains files with models, querydir and config file with shifter
shifter \ --volume=$(pwd):/nsgtree/example \ --image=fschulzjgi/nsgtree:0.4.3 \ bash -c \ "snakemake --snakefile /nsgtree/workflow/Snakefile \ -j 24 \ --use-conda \ --config \ qfaadir="/nsgtree/example/test" \ models="/nsgtree/example/rnapol.hmm" \ --conda-prefix /nsgtree/.snakemake/conda \ --configfile /nsgtree/example/user_config.yml"
- Specify ref dir and query dir and config file with shifter, using the UNI56 models provided by nsgtree
qfaa=$1
rfaa=$2
configf=$3
shifter \ --volume=$(pwd):/nsgtree/example \ --image=fschulzjgi/nsgtree:0.4.3 \ bash -c \ "snakemake --snakefile /nsgtree/workflow/Snakefile \ -j 24 \ --use-conda \ --config \ qfaadir="/nsgtree/example/$qfaa" \ rfaadir="/nsgtree/example/$rfaa" \ models="/nsgtree/resources/models/UNI56.hmm" \ --conda-prefix /nsgtree/.snakemake/conda \ --configfile /nsgtree/example/$configf"
- For docker paths to modeldir, querydir, configfile can be loaded separately with the -v flag
docker pull fschulzjgi/nsgtree:0.4.3
docker run -t -i -v $(pwd):/nsgtree/modelsdir -v $(pwd)/test:/nsgtree/querydir --user $(id -u):$(id -g) fschulzjgi/nsgtree:0.4.3 snakemake --use-conda -j 16 --config qfaadir="querydir" models="/nsgtree/modelsdir/rnapol.hmm"
Acknowledgements
NSGTree was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import sys import tarfile import os.path import shutil resultsdir = sys.argv[1] tarout = sys.argv[2] def make_tarfile(tarout, resultsdir): with tarfile.open(tarout, "w:gz") as tar: tar.add(resultsdir, arcname=os.path.basename(resultsdir)) try: shutil.rmtree(resultsdir) except OSError as e: print("Error: %s : %s" % (resultsdir, e.strerror)) make_tarfile(tarout, resultsdir) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import glob import sys from Bio import SeqIO from collections import defaultdict alndir = sys.argv[1] # dir with aligned multifa aln_concat = sys.argv[2] # name of concat aln def get_aln_len(aln): for seq_record in SeqIO.parse(aln, "fasta"): return len(seq_record.seq) def get_taxa(alndir): taxa = set() for aln in glob.glob(alndir + "/*.*"): for seq_record in SeqIO.parse(aln, "fasta"): taxa.add(seq_record.id.split("|")[0]) return list(taxa) def get_aln(aln, taxa, final_dict, aln_len): for taxon in taxa: seq = "?" * aln_len for seq_record in SeqIO.parse(aln, "fasta"): if taxon == seq_record.id.split("|")[0]: seq = str(seq_record.seq) break final_dict[taxon].append(seq) return final_dict final_dict = defaultdict(list) for aln in glob.glob(alndir + "/*"): try: aln_len = get_aln_len(aln) final_dict = get_aln(aln, get_taxa(alndir), final_dict, aln_len) except: if aln.split(".")[-1] not in ["faa", "fa", "fasta", "fna", "mafft", "mafft01", "aln", "afa"]: print("Check content of dir, all need to be aligned multifasta with typical suffix: fna, fasta, fa, faa...") else: print("Something else is wrong") # create concatenated aln concat_dict = {x:"".join(y) for x,y in final_dict.items()} with open(aln_concat, "w") as outfile: for seqid, seq in concat_dict.items(): outfile.write(">" + seqid + "\n" + seq + "\n") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | import sys from ete3 import Tree import ete3 as ete import os.path import warnings warnings.filterwarnings("ignore", ".*is with a literal.*", SyntaxWarning) treein = sys.argv[1] # tree in newick format queryids_or_pattern = sys.argv[2] # file with leaf names or pattern to look for leaves hexcolor = sys.argv[3] # color for identified branches itolout = sys.argv[4] # name of outfile def get_alltaxa(t, queryids_or_pattern): target = [] if os.path.isfile(queryids_or_pattern): with open(queryids_or_pattern, "r") as infile: target = [line.strip() for line in infile] else: for node in t.traverse(): if node.is_leaf() and node.name.startswith(queryids_or_pattern): target.append(node.name) return target def unpack_family(parentnode, leaves, target): """ get all leaves under a parental node """ children = parentnode.get_children() for child in children: if child.is_leaf() and child.name not in leaves: leaves.append(child.name) else: unpack_family(child, leaves, target) return leaves def check_monophyly(node, target, mono_leaves): # move up one node in the tree parentnode = node.up # collect all terminal leaves under the parent node leaves = unpack_family(parentnode, [], target) if len(leaves) == len([x for x in leaves if x in target]): mono_leaves.extend(leaves) check_monophyly(parentnode, target, mono_leaves) return mono_leaves def main(): t = Tree(treein) target = get_alltaxa(t, queryids_or_pattern) mono_leaves_all = [] for node in t.traverse(): if node.name in target and node.name not in [x for sublist in mono_leaves_all for x in sublist]: try: mono_leaves_all.append(check_monophyly(node, target, [])) except: mono_leaves_all.append([node.name]) #write to output queriesinclades = [x for sublist in mono_leaves_all for x in sublist] singletons = [x for x in target if x not in queriesinclades] with open(itolout, "w") as outfile: for monotclade in mono_leaves_all: if len(monotclade) > 1: outfile.write("|".join(list(set(monotclade))) + ",clade,#"+hexcolor+",normal,2\n") for singleton in singletons: outfile.write(singleton + ",branch,#"+hexcolor+",normal,2\n") if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | import glob import sys from ete3 import Tree from collections import defaultdict from collections import Counter import pandas as pd import warnings warnings.filterwarnings("ignore", ".*is with a literal.*", SyntaxWarning) def main(): tree_in = sys.argv[1] queries_in = sys.argv[2] outname = sys.argv[3] def unpack_family(node, leaves, query, seen, queries): """ get all leaves under a parental node """ children = node.get_children() for child in children: childname = str(child).replace("--","").replace("/-", "").replace("\-", "").replace("\n","") if child.is_leaf() and childname not in queries: seen.append(childname) leaves.append(childname) elif not child.is_leaf() and childname not in seen: seen.append(childname) unpack_family(child, leaves, query, seen, queries) elif child.is_leaf() and childname == query: seen.append(childname) else: pass if len(leaves)==0: parentnode = node.up unpack_family(parentnode, leaves, query, seen, queries) return leaves def get_closestrelative(query, leaves, node, tree): distance = 10 closestrelative = "nd" for child in leaves: if child != query: newdist = t.get_distance(child, query, topology_only = False) if newdist < distance: distance = newdist closestrelative = child else: pass return closestrelative, distance def get_neighbor(t, query, queries): for node in t.traverse(): if node.name == query: child = str(node.name).replace("--","").replace("/-", "").replace("\-", "").replace("\n","") queryintree = node.name # move up one node in the tree # collect all terminal leaves under the parent node leaves = [] seen = [] leaves = unpack_family(node, leaves, query, seen, queries) closestrelative, distance = get_closestrelative(queryintree, leaves, node, t) return [closestrelative, distance] # get queries with open(queries_in) as infile: queries = infile.read().splitlines() # all distances tree_dict_all = {} t = Tree(tree_in) for query in queries: tree_dict_all[query] = get_neighbor(t, query, queries) df = pd.DataFrame.from_dict(tree_dict_all).fillna("nd").T with open(outname + ".pairs", "w") as outfile: for query, bestref in tree_dict_all.items(): if bestref: outfile.write(f"{query}\t{bestref[0]}\t{bestref[1]}\n") else: outfile.write(f"{query}\tnd\tnd\n") if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import os from itertools import repeat import sys import glob import multiprocessing as mp from Bio import SeqIO hmmout = sys.argv[1] faadir = sys.argv[2] threads = int(sys.argv[3]) blacklist = sys.argv[4] outfaa = sys.argv[5] # contains the model name def export_hits_faa(hit, faadir): hit_genome_id = hit.split("|")[0] faa_file = os.path.join(faadir, hit_genome_id + ".faa") for seq_record in SeqIO.parse(faa_file, "fasta"): if seq_record.description == hit: return (seq_record.description, seq_record.seq) def get_hits(hmmout, model, removed_taxa): hits_list = [] with open(hmmout, "r") as infile: for line in infile: if not line.startswith("#") and line.split()[0] not in hits_list and line.split()[3] == model and line.split("|")[0]: if line.split("|")[0] not in removed_taxa: hits_list.append(line.split()[0]) return hits_list def get_removed_taxa(blacklist): removed_taxa = [line.split()[0] for line in open(blacklist, "r")] return removed_taxa def main(): removed_taxa = get_removed_taxa(blacklist) model = outfaa.split("/")[-1].split(".faa")[0] hits_list = get_hits(hmmout, model, removed_taxa) with mp.Pool(processes=threads) as pool: results = pool.starmap(export_hits_faa, zip(hits_list, repeat(faadir))) with open(outfaa, "w") as outfile: for hit in results: outfile.write(">" + hit[0] + "\n" + str(hit[1]) + "\n") if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | from scipy.cluster import hierarchy import numpy as np import subprocess import sys import pandas as pd import glob from collections import defaultdict import os os.environ['QT_QPA_PLATFORM'] = 'offscreen' queryfaadir = sys.argv[1] modelscombined = sys.argv[2] # e.g. pathtofiles + "models/GVOGuni9.hmm" hmmout = sys.argv[3] # e.g. pathtofiles + "Fadolivirus_GVOGuni9.out" countout = sys.argv[4] # summarized hit counts per model from hmmout removedtaxa = sys.argv[5] # list of removed taxa with info on why removed these were removed minmarker = float(sys.argv[6]) # percentage of markers to be present for a genome to be included in downstream analysis$ maxsdup = int(sys.argv[7]) # max copy number of any marker, if exceeded respective genome will be excluded maxdupl = float(sys.argv[8]) # percentage of markers that can be present in multiple copies outitol = sys.argv[9] def get_models (modelsin): """ Generate list of all model names Some models might have no hits in hmmout, but they should be displayed in count matrix """ with open(modelsin, "r") as f: models = [line.split()[-1].rstrip() for line in f if line.startswith("NAME")] return models def get_taxa (queryfaadir): taxa = [taxon.split("/")[-1].split(".")[0] for taxon in glob.glob(queryfaadir + "/*.faa")] return taxa def get_markercompleteness (models, hmmout, query): """ Get copy numbers for each marker """ # add 0s to include models that are not in hmmout count_dict = { x:0 for x in models } seen = [] with open(hmmout, "r") as f: lines = [line.rstrip() for line in f if not line.startswith("#")] for line in lines: if line.split()[0] not in seen and line.split("|")[0]==query: count_dict[line.split()[3]] += 1 seen.append(line.split()[0]) return count_dict def filter_taxa(df, models, minmarker, maxsdup, maxdupl): """ Quality filtering""" filteredtaxa_dict = defaultdict(list) df['max'] = df.max(axis=1) filtered_taxa = df[df['max'] > maxsdup].index filteredtaxa_dict.update({taxon: ["maxsdup:{}".format(df.at[taxon, 'max'])] for taxon in filtered_taxa}) df.drop(['max'], inplace=True, axis=1) df['duplications'] = (df > 1).sum(axis=1) filtered_taxa = df[df['duplications'] / len(models) > maxdupl].index filteredtaxa_dict.update({taxon: ["maxdupl:{:.4f}".format(df.at[taxon, 'duplications'] / len(models))] for taxon in filtered_taxa}) df.drop(['duplications'], inplace=True, axis=1) df[df > 1] = 1 df['sum'] = df.sum(axis=1) filtered_taxa = df[df['sum'] < len(models) * minmarker].index filteredtaxa_dict.update({taxon: ["completeness:{:.4f}".format(df.at[taxon, 'sum'] / len(models))] for taxon in filtered_taxa}) return filteredtaxa_dict def convert2itol(flabels, outfile): with open(outfile, "w") as outfile: outfile.write("DATASET_HEATMAP\n" "SEPARATOR COMMA\n" "DATASET_LABEL,Count\n" "COLOR,#ff0000\n" "COLOR_MIN,#ff0000\n" "COLOR_MAX,#0000ff\n" "FIELD_LABELS," + ",".join(flabels) + "\n" "DATA\n" ) def reorder_cols(df): """Reorder columns in countmatrix using hierarchical clustering""" Z = hierarchy.linkage(df.T, method='ward') reordered_ind = hierarchy.leaves_list(Z) df = df[df.columns[reordered_ind]] return df def main(): models = get_models(modelscombined) taxa = get_taxa(queryfaadir) count_dict = {query: get_markercompleteness(models, hmmout, query) for query in taxa} df = pd.DataFrame.from_dict(count_dict).T df.to_csv(countout, sep='\t', index=True) filteredtaxa_dict = filter_taxa(df, models, minmarker, maxsdup, maxdupl) with open(removedtaxa, 'w') as outfile: for taxon, reasons in filteredtaxa_dict.items(): outfile.write(f'{taxon}\t{";".join(reasons)}\n') df.drop('sum', axis=1, inplace=True) reordered_df = reorder_cols(df) convert2itol(list(df.columns), outitol) with open(outitol, "a") as outitolA: df.to_csv(outitolA, header = False, sep = ",") if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | from Bio import SeqIO import pandas as pd import sys import os import glob def main(): faaindir = sys.argv[1] faaoutdir = sys.argv[2] # header in format ><filebasename|<proteinid> reformatted = [] for faain in glob.glob(faaindir+"/*.faa"): faaout = os.path.join(faaoutdir, os.path.basename(faain)) reformatted = [] for seq_record in SeqIO.parse(faain, "fasta"): querybase = os.path.splitext(os.path.basename(faain))[0] if "|" in seq_record.id: if seq_record.id.split("|")[0] == querybase: seq_record.id = f"{seq_record.id.split()[0]}" seq_record.description = "" reformatted.append(seq_record) else: seq_record.id = f"{querybase}|{seq_record.id.split()[0]}" seq_record.description = "" reformatted.append(seq_record) else: seq_record.id = f"{querybase}|{seq_record.id.split()[0]}" seq_record.description = "" reformatted.append(seq_record) SeqIO.write(reformatted, faaout, "fasta") if __name__ == "__main__": main() |
60 61 62 63 64 65 66 67 68 69 70 | shell: """ (sed -n 's/^##//p' {input} date echo "################################" echo "Selected parameters for nsgtree" echo "models: {params.models}" echo "query dir: {params.qfaa}" cat {params.configf} | awk -F "#" '{{print $1}}') &> {output} echo "################################" >> {output} """ |
92 93 94 95 96 97 98 99 100 101 | shell: """ mkdir -p {output.reformatted_faa_dir} #cp {input.qdir}/*.faa {input.rdir}/*.faa {output.reformatted_faa_dir} echo "TREE_COLORS\nSEPARATOR COMMA\nDATA" > {output.q_colors_itol} ls {input.qdir} | grep -v "nsgt" | awk -F "." '{{print $1",branch,#cc0000,normal,1"}}' >> {output.q_colors_itol} ls {input.qdir} | grep -v "nsgt" | awk -F "." '{{print $1}}' > {output.queries} python workflow/scripts/reformat.py {input.rdir} {output.reformatted_faa_dir} python workflow/scripts/reformat.py {input.qdir} {output.reformatted_faa_dir} """ |
117 118 119 120 121 | shell: """ mkdir -p {output.reformatted_faa_dir} python workflow/scripts/reformat.py {input.qdir} {output.reformatted_faa_dir} """ |
132 133 134 135 136 137 | shell: """ cat {params.faadirf}/*.faa > {output.faacombined} echo "Number of genomes in the analysis: $(find {params.faadirf} -type f -name '*.faa' | wc -l)" >> {workflow_log} echo "################################" >> {workflow_log} """ |
156 157 158 159 160 | shell: """ (echo "hmmsearch --noali {params.hmmsearch_cutoff} --domtblout {output.hmmsearchout} --cpu {params.hmmsearch_cpu} {input.modelshmm} {input.faacombined}" hmmsearch --noali {params.hmmsearch_cutoff} --domtblout {output.hmmsearchout} --cpu {params.hmmsearch_cpu} {input.modelshmm} {input.faacombined}) &> {log} """ |
183 184 185 186 187 188 189 | shell: """ (echo "python workflow/scripts/hmmsearch_count_filter.py {params.faadirf} {params.modelshmm} {input.hmmsearchout} {output.hitcounts} {output.taxa_removed} {params.minmarker} {params.maxsdup} {params.maxdupl} {output.itolcounts}" python workflow/scripts/hmmsearch_count_filter.py {params.faadirf} {params.modelshmm} {input.hmmsearchout} {output.hitcounts} {output.taxa_removed} {params.minmarker} {params.maxsdup} {params.maxdupl} {output.itolcounts}) &> {log} echo "Number of genomes that were removed due to filtering thresholds: $(wc -l <{output.taxa_removed}) " >> {workflow_log} cat {output.taxa_removed} >> {workflow_log} """ |
209 210 211 212 213 214 215 216 217 218 219 220 221 | shell: """ (echo "python workflow/scripts/extract_qhits.py {input.hmmsearchout} {params.faadirf} {params.extract_processes} {input.taxa_removed} {output}" python workflow/scripts/extract_qhits.py {input.hmmsearchout} {params.faadirf} {params.extract_processes} {input.taxa_removed} {output}) &> {log} declare -i protein_count if [ ! -s {output} ]; then protein_count=0 else protein_count=$(grep -c ">" {output}) fi echo "Extracted proteins from $protein_count different taxa for {params.modelname}" >> {log} touch {output} """ |
242 243 244 245 246 247 248 | shell: """ (echo "mafft{params.mafftv} {params.mafft} --quiet {params.mafft_thread} {input.hitsfaa} > {output.aln}" mafft{params.mafftv} {params.mafft} --quiet {params.mafft_thread} {input.hitsfaa} > {output.aln} || touch {output.aln} echo "trimal {params.trimal_gt} -in {output.aln} -out {output.trimmedaln}" trimal {params.trimal_gt} -in {output.aln} -out {output.trimmedaln} || touch {output.trimmedaln} ) >& {log} """ |
266 267 268 269 270 271 272 | shell: """ (echo "fasttree {params.proteintrees} {input.trimmedaln} > {output.tree}" mkdir -p {params.outdirt} fasttree {params.proteintrees} {input.trimmedaln} > {params.outpath}.treefile || touch {output.tree} cp {params.outpath}.treefile {output.tree}) >& {log} """ |
290 291 292 293 294 295 296 297 | shell: """ (echo "iqtree --prefix {params.outpath} -m {params.proteintrees} -T 1 -fast -s {input.trimmedaln}" mkdir -p {params.outdirt} iqtree --prefix {params.outpath} -m {params.proteintrees} -T 1 -fast -s {input.trimmedaln} || touch {params.outpath}.treefile cp {params.outpath}.treefile {output.tree} touch {output.tree}) >& {log} """ |
311 312 313 314 315 316 | shell: """ (python workflow/scripts/concat.py {params.alndir} {output.concataln} || touch {output.concataln}) >& {log} echo "################################" >> {workflow_log} echo "Number of genomes in the final alignment: $(grep -c '>' {output.concataln})" >> {workflow_log} """ |
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 | shell: """ declare -i genome_count if [ ! -s {input.concataln} ]; then genome_count=0 else genome_count=$(grep -c ">" {input.concataln}) fi if [ $genome_count -le 3 ]; then error_message="Too few genomes in the supermatrix alignment. Check filtering thresholds to increase numbers of genomes to be retained." echo $error_message >> {log} echo $error_message >&2 echo $error_message >> {workflow_log} exit 1 fi echo "fasttree {params.speciestree} {input.concataln} > {output[0]}" >> {log} fasttree {params.speciestree} {input.concataln} > {output[0]} cp {output[0]} {output[1]} """ |
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 | shell: """ declare -i genome_count if [ ! -s {input.concataln} ]; then genome_count=0 else genome_count=$(grep -c ">" {input.concataln}) fi if [ $genome_count -le 3 ]; then error_message="Too few genomes in the supermatrix alignment. Check filtering thresholds to increase numbers of genomes to be retained." echo $error_message >> {log} echo $error_message >&2 echo $error_message >> {workflow_log} exit 1 fi echo "iqtree --quiet --prefix {params.outpath} -m {params.speciestree} -fast -s {input.concataln}" >> {log} iqtree --quiet --prefix {params.outpath} -m {params.speciestree} -fast -s {input.concataln} >> {log} 2>&1 cp {output[0]} {output[1]} """ |
404 405 406 407 408 409 410 411 412 413 414 415 | shell: """ if [ -e {params.itoldir}/queries.txt ]; then python workflow/scripts/ete3_nntree.py {input[0]} {params.itoldir}/queries.txt {params.itoldir}/nn.itol.txt ; fi if [ -e {params.itoldir}/queries.txt ]; then python workflow/scripts/ete3_clademembers.py {input[0]} {params.itoldir}/queries.txt cc0000 {params.itoldir}/clademembers_colored.itol.txt sed -i '1 i\DATA' {params.itoldir}/clademembers_colored.itol.txt sed -i '1 i\SEPARATOR COMMA' {params.itoldir}/clademembers_colored.itol.txt sed -i '1 i\TREE_COLOR' {params.itoldir}/clademembers_colored.itol.txt ; fi rm -r {params.q_faa_f_dir} rm -r {params.tmp_dir} find {params.fdir} -type f -empty -print -delete python workflow/scripts/cleanup.py {params.fdir} {output} """ |
Support
- Future updates