Phylogenetic Pipeline for Gene and Genome Trees

public public 1yr ago 0 bookmarks

Phylogenetic pipeline

Author:

Aleksandra Cupriak

Contents

General Description

Phylogenetic pipeline leading to the calculation of a set of gene trees and a genome tree based on proteomes

Requirments

General Information

  • Pipeline consists of the following steps: proteomes downloading, clustering, clusters selecting, multialignments calculation, neighbor joining trees building, bootsrapping (optionally), supertree building, consensus tree building

  • The correct path to the input file for running the entire pipeline: species/species.txt

  • Input file for the entire pipeline (species/species.txt) in each line must contain the UniProt Proteome ID and organism name (separated by tab)

  • Running the entire pipeline will lead to the calculation of the genome tree/s based on the downloaded proteomes for the species given in the input file species/species.txt

  • Specifying the output file for any step in the pipeline when run snakemake allows to run only a part of the pipeline if the corresponding input files are present

  • Parameters to pipeline can be passed through the configuration file

  • Pipeline should be run from the main directory of the project

  • Binary for fasturec should be located in usr/bin directory

Configuration file

  • The correct template of the configuration file is presented in the file 'configuration.yaml'

  • Parameters specification:

    • names: output file name fragments

    • results: specification of methods for generating a genome trees

    • to_remove: list of organisms that will be excluded from the analysis (before clustering)

    • c: c parameter for MMseq2 clustering ()

    • min_cluster_size: int, minimum acceptable cluster size (cluster selection)

    • max_cluster_size: int, maximum acceptable cluster size (cluster selection); to ignore can be set as False

    • max_repeat: float, the maximum acceptable frequency of the most common organism in a given cluster (cluster selection)

    • orthological: bool, whether the cluster must be otological (cluster selection)

    • choose_random: bool, whether paralogous clusters will be used for the artificial construction of orthologous clusters (cluster selection)

    • excluded_species: list of organisms that will be excluded from the analysis (after clustering)

    • bootstrap: bool, whether bootsrap will be executed

    • bootstrap_thresh: float, the rejection threshold of NJ trees after bootstrap

    • p: float, parameter p to construct the majority consensus tree

Usage

Usage:

To run the entire pipeline:

% snakemake -c [arg] [snakemake parameters]

To run the pipline for specific output file:

% snakemake -c [arg] [output] [snakemake parameters]

To know more about usage of snakemake workflow management system read snakemake documentation.

Examples

Running the entire pipeline:
> snakemake -c8 --configfile configuration.yaml
Running piepelin only to the clustering step:
> snakemake -c8 "results/mmseq_res/mmseq_result_allorganisms_all_seqs.fasta" --configfile configuration.yaml

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
library(ape)
library(phangorn)


bootstrap <- function(msa_dir, njt_dir, msa_path, njt_path){
  njt_path <- paste(njt_dir, njt_path, sep="/")
  msa_path <- paste(msa_dir, msa_path, sep="/")
  tree <- ape::read.tree(njt_path)
  msa <- phangorn::read.phyDat(msa_path, format='clustal')
  fit <- phangorn::pml(tree, msa)
  tryCatch({
    bs <- bootstrap.pml(fit, bs=100, optNni=TRUE, multicore=TRUE, mc.cores=8, control = pml.control(trace=0))
    tb <- plotBS(fit$tree, bs, type="p")
    tb.node.label <- tb$node.label[!is.na(tb$node.label)]
    mean <- mean(tb.node.label)
    if(mean>=50){
      return(njt_path)
    }
  }, error=function(e){})
}

create_dir <- function(directory){
  dirs <- strsplit(directory, '/')
  for(i in 1:length(dirs[[1]])){
    to_check <- paste(dirs[[1]][1:i], collapse="/")
    if(!dir.exists(to_check)){
      dir.create(to_check)
    }
  }
}

run <- function(msa_dir, njt_dir){
  nj_paths <- list.files(njt_dir)
  msa_paths <- list.files(msa_dir)
  bs_trees <- mapply(bootstrap, msa_dir=msa_dir, njt_dir=njt_dir, msa_path=msa_paths, njt_path=nj_paths)
  bs_trees <- Filter(Negate(is.null), bs_trees)
  return(bs_trees)
}

outdir <- snakemake@output[[1]]
res <- run(snakemake@input[[1]], snakemake@input[[2]])
create_dir(outdir)
for(file in res){
  print(file)
  file.copy(file, outdir)
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Align import Applications
from joblib import Parallel, delayed
import multiprocessing
import os
import sys



def create_dirs(path: str) -> None:
    '''
    Creates directories based on the specified path
    '''
    elements = path.split('/')
    for i in range(len(elements)):
        tocheck = '/'.join(elements[0:i+1])
        if not os.path.exists(tocheck):
            try:
                os.mkdir(tocheck)
            except:
                sys.stderr = print('ERROR: An invalid path to the directory for output files is given.')
                sys.exit(0)


def run_parallel(func, input_dict: dict, output_dir: str) -> dict:
    '''
    Runs the specified function in parallel for each element of the input dictionary.
    Returns a dictionary with the results of the function.
    '''
    create_dirs(output_dir)
    num_cores = multiprocessing.cpu_count()
    results = Parallel(n_jobs=num_cores)(delayed(func)(object, name, output_dir) for name, object in input_dict.items())
    # make dict
    results_dict = dict()
    for result in results:
        results_dict[result[0]] = result[1]
    return results_dict


def get_multialignment(input_file: str, cluster_name: str, output_dir: str) -> tuple:
    '''
    Creates multialignment based on protein sequences passed in the input file.
    '''
    clustalw_cline = Applications.ClustalwCommandline('clustalo', infile=input_file,
                                                    outfile=f'{output_dir}/MSA_{cluster_name}.aln')
    clustalw_cline()
    proteins_msa = AlignIO.read(f'{output_dir}/MSA_{cluster_name}.aln', format='clustal')

    return (cluster_name, proteins_msa)


def tree_construction(MSA, name: str, output_dir: str) -> tuple:
    '''
    Based on multilineage, generates phylogenetic trees using the neighbor-joining method for the given blosum matrix. 
    Creates images and saves in newick files.
    '''
    cal = DistanceCalculator()
    constr = DistanceTreeConstructor()
    proteins = cal.get_distance(MSA)

    njtree = constr.nj(proteins)
    Phylo.write(njtree, f'{output_dir}/njtree_{name}.nwk', 'newick')

    return (name, f'{output_dir}/njtree_{name}.nwk')


def run(input_dir: str, msa_dir: str, trees_dir: str) -> None:
    '''
    Runs the pipeline for building phylogenetic trees.
    '''
    paths = os.listdir(input_dir)
    paths = [path for path in paths if path[0] != '.']
    cluster_names = [input_file.split('.')[:-1] for input_file in paths]
    cluster_names = ['.'.join(name) for name in cluster_names]
    paths = {cluster_name: input_dir + '/' + path for cluster_name, path in zip(cluster_names, paths)}
    msa_dict = run_parallel(get_multialignment, paths, msa_dir)
    run_parallel(tree_construction, msa_dict, trees_dir)


if __name__ == '__main__':
    run(snakemake.input[0], snakemake.output[0], snakemake.output[1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import os
import requests


def download_proteome(entery: str, output_path: str):
    """
    Downloads one proteome from uniprot database.
    """
    url = f'https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28proteome%3A{entery}%29%29'
    with open(output_path, 'w') as f:
        f.write(requests.get(url).text)


def download_all(species_file: str, output_dir: str):
    """
    Downloads all proteomes specified in species_file from uniprot database.
    """
    output_dir = output_dir.split('/')[:-1]
    output_dir = '/'.join(output_dir)
    with open(species_file, 'r') as file:
        lines = file.readlines()
        proteomes_list = [f'{output_dir}/'+line.split('\t')[0]+'.fasta' for line in lines]
    # Download
    entries = [path.split('/')[-1].split('.') for path in proteomes_list]
    entries = ['.'.join(entry[:-1]) for entry in entries]
    for i, entry in enumerate(entries):
        download_proteome(entry.strip(), proteomes_list[i])



if __name__ == '__main__':
    download_all(snakemake.input[0], snakemake.output[0])
1
2
3
4
5
6
7
8
9
library(ape)

get_consensus <- function(input, output, p) {
  trees <- ape::read.tree(input)
  consensus <- ape::consensus(trees, p=p, rooted = FALSE)
  ape::write.tree(consensus, output)
}

get_consensus(snakemake@input[[1]], snakemake@output[[1]], snakemake@config[['get_consensus']]['p'])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import os


def proteomes2onefile(input_file: str, proteome_files: list, 
                      output_file: str, excluded_species: list=[]):
    '''
    Saves all proteomes to one fasta file.
    '''
    with open(input_file, 'r') as f:
        entries = f.readlines()
        ids = [e.split('\t')[0].strip() for e in entries]
        species = [e.split('\t')[1].strip() for e in entries]   
    # Saves all proteomes to one fasta file
    if excluded_species:
        directory = proteome_files[0].split('/')[:-1]
        directory = ('/').join(directory)
        to_rm = [directory+'/'+id+'.fasta' for i, id in enumerate(ids) if species[i] in excluded_species]
        for file in to_rm:
            proteome_files.remove(file)
    with open(output_file, 'w') as f:
        for proteome in proteome_files:
            try:
                with open(proteome, 'r') as file:
                    f.write(file.read())
            except FileNotFoundError:
                print(f'ERROR: File {proteome} not found!')


if __name__ == '__main__':
    proteomes2onefile(snakemake.input[0], snakemake.params[0], snakemake.output[0], snakemake.config['save_proteomes']['to_remove'])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
import os
import random
import sys


def clusters2dict(mmseq_file: str) -> dict:
    """
    Saves clusters from mmseqs2 to dictionary.
    """
    with open(mmseq_file  , 'r') as f:
        contet = f.read()
    clusters = {}
    key = ''
    for part in contet.split('>'):
        if part == '': continue
        if len(part.strip()) == 6 or len(part.strip()) == 10:
            key = part.strip()
            clusters[part.strip()] = []
        else:
            clusters[key].append(part.strip())
    return clusters


def get_species_name(description: str) -> str:
    """
    Returns species name from description.
    """
    return description.split('OS=')[1].split('OX=')[0].strip()


def split_desc_seq(protein: str) -> str:
    """
    Returns description and sequence from protein.
    """
    return protein.split('\n')[0], protein.split('\n')[1]


def ref_spec_names(species_file: str) -> set:
    """
    Returns set of species names from species_file.
    """
    with open(species_file, 'r') as f:
        species_names = f.readlines()
        species_names = [name.split('\t')[1].strip() for name in species_names]
    return set(species_names)


def check_names(ref_name: str, name: str) -> bool:
    """
    Checks if name is the same as ref_name.
    """
    ref_name = ref_name.split(' ')
    name = name.split(' ')
    name = [el.replace('(', '') for el in name]
    name = [el.replace(')', '') for el in name]
    # check
    if ref_name[0] != name[0]: return False
    if ref_name[1] != name[1]: return False
    if len(ref_name) > 2 and len(name) > 2:
        for ref_el in ref_name[2:]:
            if ref_el not in name: return False
    return True


def change_names(clusters: dict, species_names: set) -> None:
    """
    Changes species names in clusters to names from species_names.
    """
    change_spec_name = dict()
    for clust in clusters.values():
        to_remove = []
        for i, protein in enumerate(clust):
            try:
                description, seq = split_desc_seq(protein)
                spec_name = get_species_name(description)
                if spec_name not in change_spec_name.keys():
                    for ref_name in species_names:
                        if check_names(ref_name, spec_name):
                            change_spec_name[spec_name] = ref_name
                            break
                clust[i] = (change_spec_name[spec_name], seq)
            except:
                to_remove.append(protein)
        for protein in to_remove:
            clust.remove(protein)

def select_clusters(clusters: dict, min_cluster_size: int = 30, max_cluster_size: int = 100000, 
            max_repeat: float = 1/30, orthological: bool = False, choose_random: bool = False,
            excluded_species=set()) -> dict:
    """
    Selects clusters from clusters dictionary.
    Args:
        clusters (dict): dictionary with clusters
        min_cluster_size (int): minimal size of cluster
        max_cluster_size (int): maximal size of cluster
        max_repeat (float): maximal repeat of one species in cluster
        orthological (bool): if True, only orthological clusters are selected
        choose_random (bool): if True, random seq from species is selected
        excluded_species (set): set of species to exclude
    Returns:
        selected_clusters (dict): dictionary with selected clusters
    """
    selected_clusters = {}
    for name, clust in clusters.items():
        if len(clust) < min_cluster_size: continue
        if max_cluster_size and len(clust) > max_cluster_size: continue
        species_in_cluster = [element[0] for element in clust]
        if excluded_species:
            if len([i for i in species_in_cluster if i in excluded_species]) > 0: continue

        if not choose_random:
            most_common = max(set(species_in_cluster), key=species_in_cluster.count)
            if species_in_cluster.count(most_common) / len(clust) > max_repeat:
                continue

        if orthological and not choose_random:
            used_species = set()
            if_next = False
            for spec, _ in clust:
                if spec in used_species:
                    if_next = True
                    break
                used_species.add(spec)
            if if_next: continue

        if orthological and choose_random:
            species_dict = {}
            new_clust = []
            for spec, seq in clust:
                if spec in species_dict.keys():
                    species_dict[spec].append(seq)
                else:
                    species_dict[spec] = [seq]
            for spec, seqs in species_dict.items():
                # random choose of one seq from seqs
                random_seq = random.choice(seqs)
                new_clust.append((spec, random_seq))
            if len(new_clust) < min_cluster_size: continue
            clust = new_clust

        selected_clusters[name] = clust
    print(f'select_clusters: {len(selected_clusters)} clusters have been selected.')
    return selected_clusters


def check_unique_names(cluster: list) -> list:
    '''
    Checks name uniqueness.
    If the name is not unique, adds number to the end of name.
    '''
    names = {}
    for i, protein in enumerate(cluster):
        species, seq = protein
        if species in names.keys():
            names[species] += 1
            species += f'_paralog{names[species]+1}'
        else:
            names[species] = 0
        protein = (species, seq)
        cluster[i] = protein
    return cluster


def cluster2fasta(cluster: list, file: str) -> None:
    """
    Saves cluster to fasta file.
    """
    with open(file, 'w') as f:
        for protein in cluster:
            species, seq = protein
            species = species.replace(' ', '_')
            f.write(f'>{species}\n{seq}\n')


def create_dirs(path: str) -> None:
    """
    Creates directories for output files.
    """
    elements = path.split('/')
    for i in range(len(elements)):
        tocheck = '/'.join(elements[0:i+1])
        if not os.path.exists(tocheck):
            try:
                os.mkdir(tocheck)
            except:
                sys.stderr = print('ERROR: An invalid path to the directory for output files is given.')
                sys.exit(0)



def run(mmseq_file: str, species_file: str, output_dir: str,  
        min_cluster_size: int, max_cluster_size: int, max_repeat: float, 
        orthological: bool, choose_random: bool, excluded_species: set) -> None:
    """
    Runs selecting clusters.
    Args:
        mmseq_file (str): path to mmseq file
        species_file (str): path to file with species names
        output_dir (str): path to directory for output files
        min_cluster_size (int): minimal size of cluster
        max_cluster_size (int): maximal size of cluster
        max_repeat (float): maximal repeat of one species in cluster
        orthological (bool): if True, only orthological clusters are selected
        choose_random (bool): if True, random seq from species is selected
        excluded_species (set): set of species to exclude
    """
    create_dirs(output_dir)
    clusters = clusters2dict(mmseq_file)
    species_names = ref_spec_names(species_file)
    change_names(clusters, species_names)
    selected_clusters = select_clusters(clusters, min_cluster_size, max_cluster_size, 
                                        max_repeat, orthological, choose_random, excluded_species)
    for name, cluster in selected_clusters.items():
        cluster = check_unique_names(cluster)
        cluster2fasta(cluster, f'{output_dir}/{name}.fasta')


if __name__ == '__main__':
    min_cluster_size = snakemake.config['selecting_clusters']['min_cluster_size']
    max_cluster_size = snakemake.config['selecting_clusters']['max_cluster_size']
    max_repeat = snakemake.config['selecting_clusters']['max_repeat']
    orthological = snakemake.config['selecting_clusters']['orthological']
    choose_random = snakemake.config['selecting_clusters']['choose_random']
    excluded_species = snakemake.config['selecting_clusters']['excluded_species']
    excluded_species = set(excluded_species)
    run(snakemake.input[0], snakemake.config['species_file'], snakemake.output[0],
        min_cluster_size, max_cluster_size, max_repeat, orthological, choose_random, excluded_species)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import os
import re



def replace(string: str) -> str:
    '''
    Remove floats and : from tree string. Shorten names to 4 letters. Remove _paralog and number
    '''
    string = re.sub(r'\d+\.\d+', '', string)
    string = re.sub(r':', '', string)
    string = re.sub(r'\w{4}\w+', lambda x: x.group(), string)
    string = string.replace('Inner', '')
    string = string.replace(';', '')
    string = re.sub(r'\)\d+', ')', string)
    string = re.sub(r'\w{4}_paralog\d+', lambda x: x.group()[:4], string)

    return string


def save_trees2onefile(input_dir: dict, output: str) -> None:
    '''
    Writes all sequences from the input file to a single fasta file
    '''
    paths_list = os.listdir(input_dir)
    paths_list = [input_dir+'/'+path for path in paths_list]
    trees_nwk = []
    for t in paths_list:
        with open(t, 'r') as f:
            tree = f.read()
        tree = replace(tree)
        tree = tree.strip()
        trees_nwk.append(tree)
    # sort trees from the longest to the shortest
    trees_nwk.sort(key=lambda x: len(x), reverse=True)
    with open(output, 'w') as one_file:
        for i, t in enumerate(trees_nwk):
            t = t.replace('-', '')
            if i < len(trees_nwk)-1:
                one_file.write(t+';\n')
            else:
                one_file.write(t)

if __name__ == '__main__':
    save_trees2onefile(snakemake.input[0], snakemake.output[0])
46
47
script:
    "scripts/download_proteomes.py"
SnakeMake From line 46 of main/Snakefile
58
59
script:
    "scripts/save_proteomes.py"
SnakeMake From line 58 of main/Snakefile
69
70
shell:
    "mmseqs easy-cluster {input} results/mmseq_res/mmseq_result_{wildcards.subset} tmp -c {params.c}"
78
79
script:
    "scripts/select_clusters.py"
SnakeMake From line 78 of main/Snakefile
88
89
script:
    "scripts/build_trees.py"
SnakeMake From line 88 of main/Snakefile
98
99
script:
    "scripts/bootstrap.R"
SnakeMake From line 98 of main/Snakefile
120
121
script:
    "scripts/trees2onefile.py"
SnakeMake From line 120 of main/Snakefile
129
130
131
132
133
134
135
136
137
shell:
    """
    sed 's/;//' {input} > {input}_temp
    var=$(fasturec -G {input}_temp -Z | tail -1 | cut -d"-" -f2 | cut -d" " -f1)
    head -n 1 $var | cut -d' ' -f2 > {output}
    sed -i 's/$/;/' {output}
    rm $var
    rm {input}_temp
    """
SnakeMake From line 129 of main/Snakefile
144
145
script:
    "scripts/get_consensus.R"
SnakeMake From line 144 of main/Snakefile
ShowHide 11 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ola-cupriak/genomics_project
Name: genomics_project
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...