Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Phylogenetic pipeline
Author:
Aleksandra Cupriak
Contents
General Description
Phylogenetic pipeline leading to the calculation of a set of gene trees and a genome tree based on proteomes
Requirments
General Information
-
Pipeline consists of the following steps: proteomes downloading, clustering, clusters selecting, multialignments calculation, neighbor joining trees building, bootsrapping (optionally), supertree building, consensus tree building
-
The correct path to the input file for running the entire pipeline: species/species.txt
-
Input file for the entire pipeline (species/species.txt) in each line must contain the UniProt Proteome ID and organism name (separated by tab)
-
Running the entire pipeline will lead to the calculation of the genome tree/s based on the downloaded proteomes for the species given in the input file species/species.txt
-
Specifying the output file for any step in the pipeline when run snakemake allows to run only a part of the pipeline if the corresponding input files are present
-
Parameters to pipeline can be passed through the configuration file
-
Pipeline should be run from the main directory of the project
-
Binary for fasturec should be located in usr/bin directory
Configuration file
-
The correct template of the configuration file is presented in the file 'configuration.yaml'
-
Parameters specification:
-
names: output file name fragments
-
results: specification of methods for generating a genome trees
-
to_remove: list of organisms that will be excluded from the analysis (before clustering)
-
c: c parameter for MMseq2 clustering ()
-
min_cluster_size: int, minimum acceptable cluster size (cluster selection)
-
max_cluster_size: int, maximum acceptable cluster size (cluster selection); to ignore can be set as False
-
max_repeat: float, the maximum acceptable frequency of the most common organism in a given cluster (cluster selection)
-
orthological: bool, whether the cluster must be otological (cluster selection)
-
choose_random: bool, whether paralogous clusters will be used for the artificial construction of orthologous clusters (cluster selection)
-
excluded_species: list of organisms that will be excluded from the analysis (after clustering)
-
bootstrap: bool, whether bootsrap will be executed
-
bootstrap_thresh: float, the rejection threshold of NJ trees after bootstrap
-
p: float, parameter p to construct the majority consensus tree
-
Usage
Usage:
To run the entire pipeline:
% snakemake -c [arg] [snakemake parameters]
To run the pipline for specific output file:
% snakemake -c [arg] [output] [snakemake parameters]
To know more about usage of snakemake workflow management system read snakemake documentation.
Examples
Running the entire pipeline:
> snakemake -c8 --configfile configuration.yaml
Running piepelin only to the clustering step:
> snakemake -c8 "results/mmseq_res/mmseq_result_allorganisms_all_seqs.fasta" --configfile configuration.yaml
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | library(ape) library(phangorn) bootstrap <- function(msa_dir, njt_dir, msa_path, njt_path){ njt_path <- paste(njt_dir, njt_path, sep="/") msa_path <- paste(msa_dir, msa_path, sep="/") tree <- ape::read.tree(njt_path) msa <- phangorn::read.phyDat(msa_path, format='clustal') fit <- phangorn::pml(tree, msa) tryCatch({ bs <- bootstrap.pml(fit, bs=100, optNni=TRUE, multicore=TRUE, mc.cores=8, control = pml.control(trace=0)) tb <- plotBS(fit$tree, bs, type="p") tb.node.label <- tb$node.label[!is.na(tb$node.label)] mean <- mean(tb.node.label) if(mean>=50){ return(njt_path) } }, error=function(e){}) } create_dir <- function(directory){ dirs <- strsplit(directory, '/') for(i in 1:length(dirs[[1]])){ to_check <- paste(dirs[[1]][1:i], collapse="/") if(!dir.exists(to_check)){ dir.create(to_check) } } } run <- function(msa_dir, njt_dir){ nj_paths <- list.files(njt_dir) msa_paths <- list.files(msa_dir) bs_trees <- mapply(bootstrap, msa_dir=msa_dir, njt_dir=njt_dir, msa_path=msa_paths, njt_path=nj_paths) bs_trees <- Filter(Negate(is.null), bs_trees) return(bs_trees) } outdir <- snakemake@output[[1]] res <- run(snakemake@input[[1]], snakemake@input[[2]]) create_dir(outdir) for(file in res){ print(file) file.copy(file, outdir) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | from Bio import Phylo, AlignIO from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor from Bio.Align import Applications from joblib import Parallel, delayed import multiprocessing import os import sys def create_dirs(path: str) -> None: ''' Creates directories based on the specified path ''' elements = path.split('/') for i in range(len(elements)): tocheck = '/'.join(elements[0:i+1]) if not os.path.exists(tocheck): try: os.mkdir(tocheck) except: sys.stderr = print('ERROR: An invalid path to the directory for output files is given.') sys.exit(0) def run_parallel(func, input_dict: dict, output_dir: str) -> dict: ''' Runs the specified function in parallel for each element of the input dictionary. Returns a dictionary with the results of the function. ''' create_dirs(output_dir) num_cores = multiprocessing.cpu_count() results = Parallel(n_jobs=num_cores)(delayed(func)(object, name, output_dir) for name, object in input_dict.items()) # make dict results_dict = dict() for result in results: results_dict[result[0]] = result[1] return results_dict def get_multialignment(input_file: str, cluster_name: str, output_dir: str) -> tuple: ''' Creates multialignment based on protein sequences passed in the input file. ''' clustalw_cline = Applications.ClustalwCommandline('clustalo', infile=input_file, outfile=f'{output_dir}/MSA_{cluster_name}.aln') clustalw_cline() proteins_msa = AlignIO.read(f'{output_dir}/MSA_{cluster_name}.aln', format='clustal') return (cluster_name, proteins_msa) def tree_construction(MSA, name: str, output_dir: str) -> tuple: ''' Based on multilineage, generates phylogenetic trees using the neighbor-joining method for the given blosum matrix. Creates images and saves in newick files. ''' cal = DistanceCalculator() constr = DistanceTreeConstructor() proteins = cal.get_distance(MSA) njtree = constr.nj(proteins) Phylo.write(njtree, f'{output_dir}/njtree_{name}.nwk', 'newick') return (name, f'{output_dir}/njtree_{name}.nwk') def run(input_dir: str, msa_dir: str, trees_dir: str) -> None: ''' Runs the pipeline for building phylogenetic trees. ''' paths = os.listdir(input_dir) paths = [path for path in paths if path[0] != '.'] cluster_names = [input_file.split('.')[:-1] for input_file in paths] cluster_names = ['.'.join(name) for name in cluster_names] paths = {cluster_name: input_dir + '/' + path for cluster_name, path in zip(cluster_names, paths)} msa_dict = run_parallel(get_multialignment, paths, msa_dir) run_parallel(tree_construction, msa_dict, trees_dir) if __name__ == '__main__': run(snakemake.input[0], snakemake.output[0], snakemake.output[1]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import os import requests def download_proteome(entery: str, output_path: str): """ Downloads one proteome from uniprot database. """ url = f'https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28proteome%3A{entery}%29%29' with open(output_path, 'w') as f: f.write(requests.get(url).text) def download_all(species_file: str, output_dir: str): """ Downloads all proteomes specified in species_file from uniprot database. """ output_dir = output_dir.split('/')[:-1] output_dir = '/'.join(output_dir) with open(species_file, 'r') as file: lines = file.readlines() proteomes_list = [f'{output_dir}/'+line.split('\t')[0]+'.fasta' for line in lines] # Download entries = [path.split('/')[-1].split('.') for path in proteomes_list] entries = ['.'.join(entry[:-1]) for entry in entries] for i, entry in enumerate(entries): download_proteome(entry.strip(), proteomes_list[i]) if __name__ == '__main__': download_all(snakemake.input[0], snakemake.output[0]) |
1 2 3 4 5 6 7 8 9 | library(ape) get_consensus <- function(input, output, p) { trees <- ape::read.tree(input) consensus <- ape::consensus(trees, p=p, rooted = FALSE) ape::write.tree(consensus, output) } get_consensus(snakemake@input[[1]], snakemake@output[[1]], snakemake@config[['get_consensus']]['p']) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import os def proteomes2onefile(input_file: str, proteome_files: list, output_file: str, excluded_species: list=[]): ''' Saves all proteomes to one fasta file. ''' with open(input_file, 'r') as f: entries = f.readlines() ids = [e.split('\t')[0].strip() for e in entries] species = [e.split('\t')[1].strip() for e in entries] # Saves all proteomes to one fasta file if excluded_species: directory = proteome_files[0].split('/')[:-1] directory = ('/').join(directory) to_rm = [directory+'/'+id+'.fasta' for i, id in enumerate(ids) if species[i] in excluded_species] for file in to_rm: proteome_files.remove(file) with open(output_file, 'w') as f: for proteome in proteome_files: try: with open(proteome, 'r') as file: f.write(file.read()) except FileNotFoundError: print(f'ERROR: File {proteome} not found!') if __name__ == '__main__': proteomes2onefile(snakemake.input[0], snakemake.params[0], snakemake.output[0], snakemake.config['save_proteomes']['to_remove']) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | import os import random import sys def clusters2dict(mmseq_file: str) -> dict: """ Saves clusters from mmseqs2 to dictionary. """ with open(mmseq_file , 'r') as f: contet = f.read() clusters = {} key = '' for part in contet.split('>'): if part == '': continue if len(part.strip()) == 6 or len(part.strip()) == 10: key = part.strip() clusters[part.strip()] = [] else: clusters[key].append(part.strip()) return clusters def get_species_name(description: str) -> str: """ Returns species name from description. """ return description.split('OS=')[1].split('OX=')[0].strip() def split_desc_seq(protein: str) -> str: """ Returns description and sequence from protein. """ return protein.split('\n')[0], protein.split('\n')[1] def ref_spec_names(species_file: str) -> set: """ Returns set of species names from species_file. """ with open(species_file, 'r') as f: species_names = f.readlines() species_names = [name.split('\t')[1].strip() for name in species_names] return set(species_names) def check_names(ref_name: str, name: str) -> bool: """ Checks if name is the same as ref_name. """ ref_name = ref_name.split(' ') name = name.split(' ') name = [el.replace('(', '') for el in name] name = [el.replace(')', '') for el in name] # check if ref_name[0] != name[0]: return False if ref_name[1] != name[1]: return False if len(ref_name) > 2 and len(name) > 2: for ref_el in ref_name[2:]: if ref_el not in name: return False return True def change_names(clusters: dict, species_names: set) -> None: """ Changes species names in clusters to names from species_names. """ change_spec_name = dict() for clust in clusters.values(): to_remove = [] for i, protein in enumerate(clust): try: description, seq = split_desc_seq(protein) spec_name = get_species_name(description) if spec_name not in change_spec_name.keys(): for ref_name in species_names: if check_names(ref_name, spec_name): change_spec_name[spec_name] = ref_name break clust[i] = (change_spec_name[spec_name], seq) except: to_remove.append(protein) for protein in to_remove: clust.remove(protein) def select_clusters(clusters: dict, min_cluster_size: int = 30, max_cluster_size: int = 100000, max_repeat: float = 1/30, orthological: bool = False, choose_random: bool = False, excluded_species=set()) -> dict: """ Selects clusters from clusters dictionary. Args: clusters (dict): dictionary with clusters min_cluster_size (int): minimal size of cluster max_cluster_size (int): maximal size of cluster max_repeat (float): maximal repeat of one species in cluster orthological (bool): if True, only orthological clusters are selected choose_random (bool): if True, random seq from species is selected excluded_species (set): set of species to exclude Returns: selected_clusters (dict): dictionary with selected clusters """ selected_clusters = {} for name, clust in clusters.items(): if len(clust) < min_cluster_size: continue if max_cluster_size and len(clust) > max_cluster_size: continue species_in_cluster = [element[0] for element in clust] if excluded_species: if len([i for i in species_in_cluster if i in excluded_species]) > 0: continue if not choose_random: most_common = max(set(species_in_cluster), key=species_in_cluster.count) if species_in_cluster.count(most_common) / len(clust) > max_repeat: continue if orthological and not choose_random: used_species = set() if_next = False for spec, _ in clust: if spec in used_species: if_next = True break used_species.add(spec) if if_next: continue if orthological and choose_random: species_dict = {} new_clust = [] for spec, seq in clust: if spec in species_dict.keys(): species_dict[spec].append(seq) else: species_dict[spec] = [seq] for spec, seqs in species_dict.items(): # random choose of one seq from seqs random_seq = random.choice(seqs) new_clust.append((spec, random_seq)) if len(new_clust) < min_cluster_size: continue clust = new_clust selected_clusters[name] = clust print(f'select_clusters: {len(selected_clusters)} clusters have been selected.') return selected_clusters def check_unique_names(cluster: list) -> list: ''' Checks name uniqueness. If the name is not unique, adds number to the end of name. ''' names = {} for i, protein in enumerate(cluster): species, seq = protein if species in names.keys(): names[species] += 1 species += f'_paralog{names[species]+1}' else: names[species] = 0 protein = (species, seq) cluster[i] = protein return cluster def cluster2fasta(cluster: list, file: str) -> None: """ Saves cluster to fasta file. """ with open(file, 'w') as f: for protein in cluster: species, seq = protein species = species.replace(' ', '_') f.write(f'>{species}\n{seq}\n') def create_dirs(path: str) -> None: """ Creates directories for output files. """ elements = path.split('/') for i in range(len(elements)): tocheck = '/'.join(elements[0:i+1]) if not os.path.exists(tocheck): try: os.mkdir(tocheck) except: sys.stderr = print('ERROR: An invalid path to the directory for output files is given.') sys.exit(0) def run(mmseq_file: str, species_file: str, output_dir: str, min_cluster_size: int, max_cluster_size: int, max_repeat: float, orthological: bool, choose_random: bool, excluded_species: set) -> None: """ Runs selecting clusters. Args: mmseq_file (str): path to mmseq file species_file (str): path to file with species names output_dir (str): path to directory for output files min_cluster_size (int): minimal size of cluster max_cluster_size (int): maximal size of cluster max_repeat (float): maximal repeat of one species in cluster orthological (bool): if True, only orthological clusters are selected choose_random (bool): if True, random seq from species is selected excluded_species (set): set of species to exclude """ create_dirs(output_dir) clusters = clusters2dict(mmseq_file) species_names = ref_spec_names(species_file) change_names(clusters, species_names) selected_clusters = select_clusters(clusters, min_cluster_size, max_cluster_size, max_repeat, orthological, choose_random, excluded_species) for name, cluster in selected_clusters.items(): cluster = check_unique_names(cluster) cluster2fasta(cluster, f'{output_dir}/{name}.fasta') if __name__ == '__main__': min_cluster_size = snakemake.config['selecting_clusters']['min_cluster_size'] max_cluster_size = snakemake.config['selecting_clusters']['max_cluster_size'] max_repeat = snakemake.config['selecting_clusters']['max_repeat'] orthological = snakemake.config['selecting_clusters']['orthological'] choose_random = snakemake.config['selecting_clusters']['choose_random'] excluded_species = snakemake.config['selecting_clusters']['excluded_species'] excluded_species = set(excluded_species) run(snakemake.input[0], snakemake.config['species_file'], snakemake.output[0], min_cluster_size, max_cluster_size, max_repeat, orthological, choose_random, excluded_species) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import os import re def replace(string: str) -> str: ''' Remove floats and : from tree string. Shorten names to 4 letters. Remove _paralog and number ''' string = re.sub(r'\d+\.\d+', '', string) string = re.sub(r':', '', string) string = re.sub(r'\w{4}\w+', lambda x: x.group(), string) string = string.replace('Inner', '') string = string.replace(';', '') string = re.sub(r'\)\d+', ')', string) string = re.sub(r'\w{4}_paralog\d+', lambda x: x.group()[:4], string) return string def save_trees2onefile(input_dir: dict, output: str) -> None: ''' Writes all sequences from the input file to a single fasta file ''' paths_list = os.listdir(input_dir) paths_list = [input_dir+'/'+path for path in paths_list] trees_nwk = [] for t in paths_list: with open(t, 'r') as f: tree = f.read() tree = replace(tree) tree = tree.strip() trees_nwk.append(tree) # sort trees from the longest to the shortest trees_nwk.sort(key=lambda x: len(x), reverse=True) with open(output, 'w') as one_file: for i, t in enumerate(trees_nwk): t = t.replace('-', '') if i < len(trees_nwk)-1: one_file.write(t+';\n') else: one_file.write(t) if __name__ == '__main__': save_trees2onefile(snakemake.input[0], snakemake.output[0]) |
46 47 | script: "scripts/download_proteomes.py" |
58 59 | script: "scripts/save_proteomes.py" |
69 70 | shell: "mmseqs easy-cluster {input} results/mmseq_res/mmseq_result_{wildcards.subset} tmp -c {params.c}" |
78 79 | script: "scripts/select_clusters.py" |
88 89 | script: "scripts/build_trees.py" |
98 99 | script: "scripts/bootstrap.R" |
120 121 | script: "scripts/trees2onefile.py" |
129 130 131 132 133 134 135 136 137 | shell: """ sed 's/;//' {input} > {input}_temp var=$(fasturec -G {input}_temp -Z | tail -1 | cut -d"-" -f2 | cut -d" " -f1) head -n 1 $var | cut -d' ' -f2 > {output} sed -i 's/$/;/' {output} rm $var rm {input}_temp """ |
144 145 | script: "scripts/get_consensus.R" |
Support
- Future updates