Phylogenetic Pipeline for Gene and Genome Trees

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Phylogenetic pipeline

Author:

Aleksandra Cupriak

General Description
Requirments
General Information
Usage
Examples

General Description

Phylogenetic pipeline leading to the calculation of a set of gene trees and a genome tree based on proteomes

Requirments

General Information

Pipeline consists of the following steps: proteomes downloading, clustering, clusters selecting, multialignments calculation, neighbor joining trees building, bootsrapping (optionally), supertree building, consensus tree building
The correct path to the input file for running the entire pipeline: species/species.txt
Input file for the entire pipeline (species/species.txt) in each line must contain the UniProt Proteome ID and organism name (separated by tab)
Running the entire pipeline will lead to the calculation of the genome tree/s based on the downloaded proteomes for the species given in the input file species/species.txt
Specifying the output file for any step in the pipeline when run snakemake allows to run only a part of the pipeline if the corresponding input files are present
Parameters to pipeline can be passed through the configuration file
Pipeline should be run from the main directory of the project
Binary for fasturec should be located in usr/bin directory

Configuration file

The correct template of the configuration file is presented in the file 'configuration.yaml'
Parameters specification:
- names: output file name fragments
- results: specification of methods for generating a genome trees
- to_remove: list of organisms that will be excluded from the analysis (before clustering)
- c: c parameter for MMseq2 clustering ()
- min_cluster_size: int, minimum acceptable cluster size (cluster selection)
- max_cluster_size: int, maximum acceptable cluster size (cluster selection); to ignore can be set as False
- max_repeat: float, the maximum acceptable frequency of the most common organism in a given cluster (cluster selection)
- orthological: bool, whether the cluster must be otological (cluster selection)
- choose_random: bool, whether paralogous clusters will be used for the artificial construction of orthologous clusters (cluster selection)
- excluded_species: list of organisms that will be excluded from the analysis (after clustering)
- bootstrap: bool, whether bootsrap will be executed
- bootstrap_thresh: float, the rejection threshold of NJ trees after bootstrap
- p: float, parameter p to construct the majority consensus tree

Usage

Usage:

To run the entire pipeline:

% snakemake -c [arg] [snakemake parameters]

To run the pipline for specific output file:

% snakemake -c [arg] [output] [snakemake parameters]

To know more about usage of snakemake workflow management system read snakemake documentation.

Examples

Running the entire pipeline:
> snakemake -c8 --configfile configuration.yaml
Running piepelin only to the clustering step:
> snakemake -c8 "results/mmseq_res/mmseq_result_allorganisms_all_seqs.fasta" --configfile configuration.yaml

Code Snippets

library(ape)
library(phangorn)


bootstrap <- function(msa_dir, njt_dir, msa_path, njt_path){
  njt_path <- paste(njt_dir, njt_path, sep="/")
  msa_path <- paste(msa_dir, msa_path, sep="/")
  tree <- ape::read.tree(njt_path)
  msa <- phangorn::read.phyDat(msa_path, format='clustal')
  fit <- phangorn::pml(tree, msa)
  tryCatch({
    bs <- bootstrap.pml(fit, bs=100, optNni=TRUE, multicore=TRUE, mc.cores=8, control = pml.control(trace=0))
    tb <- plotBS(fit$tree, bs, type="p")
    tb.node.label <- tb$node.label[!is.na(tb$node.label)]
    mean <- mean(tb.node.label)
    if(mean>=50){
      return(njt_path)
    }
  }, error=function(e){})
}

create_dir <- function(directory){
  dirs <- strsplit(directory, '/')
  for(i in 1:length(dirs[[1]])){
    to_check <- paste(dirs[[1]][1:i], collapse="/")
    if(!dir.exists(to_check)){
      dir.create(to_check)
    }
  }
}

run <- function(msa_dir, njt_dir){
  nj_paths <- list.files(njt_dir)
  msa_paths <- list.files(msa_dir)
  bs_trees <- mapply(bootstrap, msa_dir=msa_dir, njt_dir=njt_dir, msa_path=msa_paths, njt_path=nj_paths)
  bs_trees <- Filter(Negate(is.null), bs_trees)
  return(bs_trees)
}

outdir <- snakemake@output[[1]]
res <- run(snakemake@input[[1]], snakemake@input[[2]])
create_dir(outdir)
for(file in res){
  print(file)
  file.copy(file, outdir)
}

R APE phangorn From line 1 of scripts/bootstrap.R

from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Align import Applications
from joblib import Parallel, delayed
import multiprocessing
import os
import sys



def create_dirs(path: str) -> None:
    '''
    Creates directories based on the specified path
    '''
    elements = path.split('/')
    for i in range(len(elements)):
        tocheck = '/'.join(elements[0:i+1])
        if not os.path.exists(tocheck):
            try:
                os.mkdir(tocheck)
            except:
                sys.stderr = print('ERROR: An invalid path to the directory for output files is given.')
                sys.exit(0)


def run_parallel(func, input_dict: dict, output_dir: str) -> dict:
    '''
    Runs the specified function in parallel for each element of the input dictionary.
    Returns a dictionary with the results of the function.
    '''
    create_dirs(output_dir)
    num_cores = multiprocessing.cpu_count()
    results = Parallel(n_jobs=num_cores)(delayed(func)(object, name, output_dir) for name, object in input_dict.items())
    # make dict
    results_dict = dict()
    for result in results:
        results_dict[result[0]] = result[1]
    return results_dict


def get_multialignment(input_file: str, cluster_name: str, output_dir: str) -> tuple:
    '''
    Creates multialignment based on protein sequences passed in the input file.
    '''
    clustalw_cline = Applications.ClustalwCommandline('clustalo', infile=input_file,
                                                    outfile=f'{output_dir}/MSA_{cluster_name}.aln')
    clustalw_cline()
    proteins_msa = AlignIO.read(f'{output_dir}/MSA_{cluster_name}.aln', format='clustal')

    return (cluster_name, proteins_msa)


def tree_construction(MSA, name: str, output_dir: str) -> tuple:
    '''
    Based on multilineage, generates phylogenetic trees using the neighbor-joining method for the given blosum matrix. 
    Creates images and saves in newick files.
    '''
    cal = DistanceCalculator()
    constr = DistanceTreeConstructor()
    proteins = cal.get_distance(MSA)

    njtree = constr.nj(proteins)
    Phylo.write(njtree, f'{output_dir}/njtree_{name}.nwk', 'newick')

    return (name, f'{output_dir}/njtree_{name}.nwk')


def run(input_dir: str, msa_dir: str, trees_dir: str) -> None:
    '''
    Runs the pipeline for building phylogenetic trees.
    '''
    paths = os.listdir(input_dir)
    paths = [path for path in paths if path[0] != '.']
    cluster_names = [input_file.split('.')[:-1] for input_file in paths]
    cluster_names = ['.'.join(name) for name in cluster_names]
    paths = {cluster_name: input_dir + '/' + path for cluster_name, path in zip(cluster_names, paths)}
    msa_dict = run_parallel(get_multialignment, paths, msa_dir)
    run_parallel(tree_construction, msa_dict, trees_dir)


if __name__ == '__main__':
    run(snakemake.input[0], snakemake.output[0], snakemake.output[1])

Python Biopython joblib clustalo From line 1 of scripts/build_trees.py

import os
import requests


def download_proteome(entery: str, output_path: str):
    """
    Downloads one proteome from uniprot database.
    """
    url = f'https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28proteome%3A{entery}%29%29'
    with open(output_path, 'w') as f:
        f.write(requests.get(url).text)


def download_all(species_file: str, output_dir: str):
    """
    Downloads all proteomes specified in species_file from uniprot database.
    """
    output_dir = output_dir.split('/')[:-1]
    output_dir = '/'.join(output_dir)
    with open(species_file, 'r') as file:
        lines = file.readlines()
        proteomes_list = [f'{output_dir}/'+line.split('\t')[0]+'.fasta' for line in lines]
    # Download
    entries = [path.split('/')[-1].split('.') for path in proteomes_list]
    entries = ['.'.join(entry[:-1]) for entry in entries]
    for i, entry in enumerate(entries):
        download_proteome(entry.strip(), proteomes_list[i])



if __name__ == '__main__':
    download_all(snakemake.input[0], snakemake.output[0])

Python requests From line 1 of scripts/download_proteomes.py

library(ape)

get_consensus <- function(input, output, p) {
  trees <- ape::read.tree(input)
  consensus <- ape::consensus(trees, p=p, rooted = FALSE)
  ape::write.tree(consensus, output)
}

get_consensus(snakemake@input[[1]], snakemake@output[[1]], snakemake@config[['get_consensus']]['p'])

R Consensus APE From line 1 of scripts/get_consensus.R

import os


def proteomes2onefile(input_file: str, proteome_files: list, 
                      output_file: str, excluded_species: list=[]):
    '''
    Saves all proteomes to one fasta file.
    '''
    with open(input_file, 'r') as f:
        entries = f.readlines()
        ids = [e.split('\t')[0].strip() for e in entries]
        species = [e.split('\t')[1].strip() for e in entries]   
    # Saves all proteomes to one fasta file
    if excluded_species:
        directory = proteome_files[0].split('/')[:-1]
        directory = ('/').join(directory)
        to_rm = [directory+'/'+id+'.fasta' for i, id in enumerate(ids) if species[i] in excluded_species]
        for file in to_rm:
            proteome_files.remove(file)
    with open(output_file, 'w') as f:
        for proteome in proteome_files:
            try:
                with open(proteome, 'r') as file:
                    f.write(file.read())
            except FileNotFoundError:
                print(f'ERROR: File {proteome} not found!')


if __name__ == '__main__':
    proteomes2onefile(snakemake.input[0], snakemake.params[0], snakemake.output[0], snakemake.config['save_proteomes']['to_remove'])

Python From line 1 of scripts/save_proteomes.py

import os
import random
import sys


def clusters2dict(mmseq_file: str) -> dict:
    """
    Saves clusters from mmseqs2 to dictionary.
    """
    with open(mmseq_file  , 'r') as f:
        contet = f.read()
    clusters = {}
    key = ''
    for part in contet.split('>'):
        if part == '': continue
        if len(part.strip()) == 6 or len(part.strip()) == 10:
            key = part.strip()
            clusters[part.strip()] = []
        else:
            clusters[key].append(part.strip())
    return clusters


def get_species_name(description: str) -> str:
    """
    Returns species name from description.
    """
    return description.split('OS=')[1].split('OX=')[0].strip()


def split_desc_seq(protein: str) -> str:
    """
    Returns description and sequence from protein.
    """
    return protein.split('\n')[0], protein.split('\n')[1]


def ref_spec_names(species_file: str) -> set:
    """
    Returns set of species names from species_file.
    """
    with open(species_file, 'r') as f:
        species_names = f.readlines()
        species_names = [name.split('\t')[1].strip() for name in species_names]
    return set(species_names)


def check_names(ref_name: str, name: str) -> bool:
    """
    Checks if name is the same as ref_name.
    """
    ref_name = ref_name.split(' ')
    name = name.split(' ')
    name = [el.replace('(', '') for el in name]
    name = [el.replace(')', '') for el in name]
    # check
    if ref_name[0] != name[0]: return False
    if ref_name[1] != name[1]: return False
    if len(ref_name) > 2 and len(name) > 2:
        for ref_el in ref_name[2:]:
            if ref_el not in name: return False
    return True


def change_names(clusters: dict, species_names: set) -> None:
    """
    Changes species names in clusters to names from species_names.
    """
    change_spec_name = dict()
    for clust in clusters.values():
        to_remove = []
        for i, protein in enumerate(clust):
            try:
                description, seq = split_desc_seq(protein)
                spec_name = get_species_name(description)
                if spec_name not in change_spec_name.keys():
                    for ref_name in species_names:
                        if check_names(ref_name, spec_name):
                            change_spec_name[spec_name] = ref_name
                            break
                clust[i] = (change_spec_name[spec_name], seq)
            except:
                to_remove.append(protein)
        for protein in to_remove:
            clust.remove(protein)

def select_clusters(clusters: dict, min_cluster_size: int = 30, max_cluster_size: int = 100000, 
            max_repeat: float = 1/30, orthological: bool = False, choose_random: bool = False,
            excluded_species=set()) -> dict:
    """
    Selects clusters from clusters dictionary.
    Args:
        clusters (dict): dictionary with clusters
        min_cluster_size (int): minimal size of cluster
        max_cluster_size (int): maximal size of cluster
        max_repeat (float): maximal repeat of one species in cluster
        orthological (bool): if True, only orthological clusters are selected
        choose_random (bool): if True, random seq from species is selected
        excluded_species (set): set of species to exclude
    Returns:
        selected_clusters (dict): dictionary with selected clusters
    """
    selected_clusters = {}
    for name, clust in clusters.items():
        if len(clust) < min_cluster_size: continue
        if max_cluster_size and len(clust) > max_cluster_size: continue
        species_in_cluster = [element[0] for element in clust]
        if excluded_species:
            if len([i for i in species_in_cluster if i in excluded_species]) > 0: continue

        if not choose_random:
            most_common = max(set(species_in_cluster), key=species_in_cluster.count)
            if species_in_cluster.count(most_common) / len(clust) > max_repeat:
                continue

        if orthological and not choose_random:
            used_species = set()
            if_next = False
            for spec, _ in clust:
                if spec in used_species:
                    if_next = True
                    break
                used_species.add(spec)
            if if_next: continue

        if orthological and choose_random:
            species_dict = {}
            new_clust = []
            for spec, seq in clust:
                if spec in species_dict.keys():
                    species_dict[spec].append(seq)
                else:
                    species_dict[spec] = [seq]
            for spec, seqs in species_dict.items():
                # random choose of one seq from seqs
                random_seq = random.choice(seqs)
                new_clust.append((spec, random_seq))
            if len(new_clust) < min_cluster_size: continue
            clust = new_clust

        selected_clusters[name] = clust
    print(f'select_clusters: {len(selected_clusters)} clusters have been selected.')
    return selected_clusters


def check_unique_names(cluster: list) -> list:
    '''
    Checks name uniqueness.
    If the name is not unique, adds number to the end of name.
    '''
    names = {}
    for i, protein in enumerate(cluster):
        species, seq = protein
        if species in names.keys():
            names[species] += 1
            species += f'_paralog{names[species]+1}'
        else:
            names[species] = 0
        protein = (species, seq)
        cluster[i] = protein
    return cluster


def cluster2fasta(cluster: list, file: str) -> None:
    """
    Saves cluster to fasta file.
    """
    with open(file, 'w') as f:
        for protein in cluster:
            species, seq = protein
            species = species.replace(' ', '_')
            f.write(f'>{species}\n{seq}\n')


def create_dirs(path: str) -> None:
    """
    Creates directories for output files.
    """
    elements = path.split('/')
    for i in range(len(elements)):
        tocheck = '/'.join(elements[0:i+1])
        if not os.path.exists(tocheck):
            try:
                os.mkdir(tocheck)
            except:
                sys.stderr = print('ERROR: An invalid path to the directory for output files is given.')
                sys.exit(0)



def run(mmseq_file: str, species_file: str, output_dir: str,  
        min_cluster_size: int, max_cluster_size: int, max_repeat: float, 
        orthological: bool, choose_random: bool, excluded_species: set) -> None:
    """
    Runs selecting clusters.
    Args:
        mmseq_file (str): path to mmseq file
        species_file (str): path to file with species names
        output_dir (str): path to directory for output files
        min_cluster_size (int): minimal size of cluster
        max_cluster_size (int): maximal size of cluster
        max_repeat (float): maximal repeat of one species in cluster
        orthological (bool): if True, only orthological clusters are selected
        choose_random (bool): if True, random seq from species is selected
        excluded_species (set): set of species to exclude
    """
    create_dirs(output_dir)
    clusters = clusters2dict(mmseq_file)
    species_names = ref_spec_names(species_file)
    change_names(clusters, species_names)
    selected_clusters = select_clusters(clusters, min_cluster_size, max_cluster_size, 
                                        max_repeat, orthological, choose_random, excluded_species)
    for name, cluster in selected_clusters.items():
        cluster = check_unique_names(cluster)
        cluster2fasta(cluster, f'{output_dir}/{name}.fasta')


if __name__ == '__main__':
    min_cluster_size = snakemake.config['selecting_clusters']['min_cluster_size']
    max_cluster_size = snakemake.config['selecting_clusters']['max_cluster_size']
    max_repeat = snakemake.config['selecting_clusters']['max_repeat']
    orthological = snakemake.config['selecting_clusters']['orthological']
    choose_random = snakemake.config['selecting_clusters']['choose_random']
    excluded_species = snakemake.config['selecting_clusters']['excluded_species']
    excluded_species = set(excluded_species)
    run(snakemake.input[0], snakemake.config['species_file'], snakemake.output[0],
        min_cluster_size, max_cluster_size, max_repeat, orthological, choose_random, excluded_species)

Python From line 1 of scripts/select_clusters.py

import os
import re



def replace(string: str) -> str:
    '''
    Remove floats and : from tree string. Shorten names to 4 letters. Remove _paralog and number
    '''
    string = re.sub(r'\d+\.\d+', '', string)
    string = re.sub(r':', '', string)
    string = re.sub(r'\w{4}\w+', lambda x: x.group(), string)
    string = string.replace('Inner', '')
    string = string.replace(';', '')
    string = re.sub(r'\)\d+', ')', string)
    string = re.sub(r'\w{4}_paralog\d+', lambda x: x.group()[:4], string)

    return string


def save_trees2onefile(input_dir: dict, output: str) -> None:
    '''
    Writes all sequences from the input file to a single fasta file
    '''
    paths_list = os.listdir(input_dir)
    paths_list = [input_dir+'/'+path for path in paths_list]
    trees_nwk = []
    for t in paths_list:
        with open(t, 'r') as f:
            tree = f.read()
        tree = replace(tree)
        tree = tree.strip()
        trees_nwk.append(tree)
    # sort trees from the longest to the shortest
    trees_nwk.sort(key=lambda x: len(x), reverse=True)
    with open(output, 'w') as one_file:
        for i, t in enumerate(trees_nwk):
            t = t.replace('-', '')
            if i < len(trees_nwk)-1:
                one_file.write(t+';\n')
            else:
                one_file.write(t)

if __name__ == '__main__':
    save_trees2onefile(snakemake.input[0], snakemake.output[0])

Python From line 1 of scripts/trees2onefile.py

script:
    "scripts/download_proteomes.py"

SnakeMake From line 46 of main/Snakefile

script:
    "scripts/save_proteomes.py"

SnakeMake From line 58 of main/Snakefile

shell:
    "mmseqs easy-cluster {input} results/mmseq_res/mmseq_result_{wildcards.subset} tmp -c {params.c}"

SnakeMake MMseqs From line 69 of main/Snakefile

script:
    "scripts/select_clusters.py"

SnakeMake From line 78 of main/Snakefile

script:
    "scripts/build_trees.py"

SnakeMake From line 88 of main/Snakefile

script:
    "scripts/bootstrap.R"

SnakeMake From line 98 of main/Snakefile

script:
    "scripts/trees2onefile.py"

SnakeMake From line 120 of main/Snakefile

shell:
    """
    sed 's/;//' {input} > {input}_temp
    var=$(fasturec -G {input}_temp -Z | tail -1 | cut -d"-" -f2 | cut -d" " -f1)
    head -n 1 $var | cut -d' ' -f2 > {output}
    sed -i 's/$/;/' {output}
    rm $var
    rm {input}_temp
    """