Multi-species Coalescent Model for Phylogenetic Tree Inference with ASTRAL
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Multi-species Coalescent Model Species Tree with ASTRAL
In order to calculate phylogenetic trees, different methods can be used. One method concatenates multiple gene sequences into a single supermatrix alignment in order to calculate a species tree, while the other infers a tree for each gene, the so-called gene trees, to then create a consensus tree to represent the subject species. Among other things, incomplete lineage sorting may result in gene trees that are different from one another and from the species tree. Programs using the multi-species coalescent (MSC) model have been shown to be statistically consistent when facing these issues.
This Snakemake workflow calculates gene trees based on the maximum- likelihood using IQ-TREE . Then, 100 bootstrap trees are generated for each gene. A consensus tree is generated for each gene. Lastly, the gene trees are combined into a single file to be used for species tree inference under the multi-species-coalescent model implemented with ASTRAL .
System requirements
Local machine
I recommend running the workflow on a HPC system, as the analyses are resource and time consuming.
-
If you don't have it yet, it is necessary to have conda or miniconda in your machine. Follow these instructions.
-
After you are all set with conda, I highly ( highly! ) recommend installing a much much faster package manager to replace conda, mamba
-
First activate your conda base
conda activate base
- Then, type:
conda install -n base -c conda-forge mamba
-
-
Likewise, follow this tutorial to install Git if you don't have it.
HPC system
Follow the instructions from your cluster administrator regarding loading of modules, such as loading a root distribution from Conda. For example, with the cluster I work with, we use modules to set up environmental variables, which have to first be loaded within the jobscripts. They modify the $PATH variable after loading the module.
e.g.:
module load anaconda3/2022.05
You usually don't have sudo rights to install anything to the root of the cluster. So, as I wanted to work with a more updated distribution of conda and especially use mamba to replace conda as a package manager, I had to first create my own "local" conda, i.e. I first loaded the module and then created a new environment I called localconda
-
module load anaconda3/2022.05
-
conda create -n localconda -c conda-forge conda=22.9.0
-
conda install -n localconda -c conda-forge mamba
-
conda activate localconda
If you run
conda env list
you'll probably see something like this:
/home/myusername/.conda/envs/localconda/
Data requirements
Multiple sequence alingment (MSA) files on the amino acid level . The files need to be in fasta format and have the suffix .fas or .fasta, otherwise it will not work.
The workflow will be soon modified to work with difference sequence types and substitution models.
Create a folder within
resources/
and add all of your MSA files to it.
e.g.
resources/mollusca_astral
The workflow automatically recognizes the names of the files and processes them accordingly. The name of the subdirectory will be used to name the output directories and files from the workflow.
e.g.:
resources/mollusca_astral/12345at6447.fas
One of the output files will be named like this:
results/mollusca_astral/ml_bs_trees/12345at6447_mollusca_astral.treefile
The final species tree, for example, would be:
results/mollusca_astral/final_species_tree_mollusca_astral.treefile
Because of how I structured the workflow, you are able to run several different analyses in parallel :)
e.g.:
resources/mollusca_astral/
,
resources/arthropoda
, etc
Installation
- Clone this repository
git clone https://gitlab.leibniz-lib.de/jwiggeshoff/astral-species-tree.git
- Activate your conda base
conda activate base
- If you are working on a cluster or have your own "local", isolated environment you want to activate instead (see here ), use its name to activate it
conda activate localconda
- Install astral-species-tree into an isolated software environment by navigating to the directory where this repo is and run:
conda env create --file environment.yaml
If you followed what I recommended in the System requirements , run this instead:
mamba env create --file environment.yaml
The environment from astral-species-tree is created
- Always activate the environment before running the workflow
On a local machine:
conda activate astral-species-tree
If you are on a cluster and/or created the environment "within" another environment, you want to run this first:
conda env list
You will probably see something like this among your enviornments:
home/myusername/.conda/envs/localconda/envs/astral-species-tree
From now own, you have to give this full path when activating the environment prior to running the workflow
conda activate /home/myusername/.conda/envs/localconda/envs/astral-species-tree
Running the workflow
Remember to always activate the environment first
conda activate astral-species-tree
or
conda activate /home/myusername/.conda/envs/localconda/envs/astral-species-tree
Local machine
Not recommended unless you have a lot of storage and CPUs available (and time to wait...). Nevertheless, you can simply run like this:
nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --cores 11 > nohup_astral-species-tree_$(date +"%F_%H").out &
Modify number of cores accordingly.
HPC system
Two working options were tested to run the workflow in HPC clusters using the Sun Grid Engine (SGE) queue scheduler system.
For other systems, read more here .
Before the first execution of the workflow
Run this to create the environments from the rules:
snakemake --cores 8 --use-conda --conda-create-envs-only
Option 1:
mkdir snakejob_logs
nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --rerun-incomplete --cores 51 --max-threads 25 --cluster "qsub -terse -V -b y -j y -o snakejob_logs/ -cwd -pe smp {threads} -q fast.q,small.q,medium.q,large.q -M [email protected] -m be" > nohup_astral-species-tree_$(date +"%F_%H_%M_%S").out &
Remember to:
-
Modify [email protected]
-
Change values for --cores and --max-threads accordingly
-
Change environment for -pe as needed (e.g. smp)
Option 2:
A template jobscript
template_run_astral-species-tree.sh
is found under
misc/
Important: Please, modify the qsub options according to your system! Features to modify:
-
E-mail address:
-M [email protected]
-
Mailing settings, if needed:
-m be
-
If you want to split stderr to stdout, use
-j n
instead and add the line#$ -e cluster_logs/
-
If you want to, the name of the jobscript:
-N astral-species-tree
-
Name of parallel environment (e.g. smp) and number of threads (e.g. 61):
-pe smp 61
-
Queue name! (extremely unique to your system):
-q small.q,medium.q,large.q
Ater modifying the template, copy it (while also modifying its name) to the working directory:
If you are within the folder
misc/
:
cp template_run_astral-species-tree.sh ../run_astral-species-tree.sh
You should see
run_astral-species-tree.sh
within the path where the folders resources/, results/, and workflow/ are, together with files README.md and environment.yaml
Remember to
mkdir cluster_logs
before running for the first time
Finally, run:
qsub run_astral-species-tree.sh
Finishing the workflow: report.html
Upon successfully finishing the analyses, Snakemake will
automatically
generate a compressed report in the working directory,
report.html
It describes the used software versions, the commands, and paths to in and output files.
To be released: Summary of main results, drawn trees
To know more about report files, see the documentation from Snakemake here .
Done :)
Code Snippets
38 39 40 | shell: "(iqtree -s {input.MSA_file} -st AA -pre {params.outdir}/{wildcards.sample}_{wildcards.project} " "-nt {threads} -m MFP -msub nuclear -mrate E,I,G,I+G,R -cmin 2 -cmax 15 -madd LG4X,LG4M -safe -merit AICc -b 100 --redo) &> {log}" |
55 56 57 58 | shell: "(iqtree -sup {input.ML_tree} -t {input.bs_trees} " "-pre {params.outdir}/{wildcards.sample}_{wildcards.project} -nt {threads} --redo; " "echo >> {output.suptree}) &> {log}" |
65 66 | shell: "ls {input}; echo {output.flag_file}" |
84 85 86 87 | shell: "(cat {input.suptrees_group_by_project} > {output.genetrees}; " "sed -i '/^$/d' {output.genetrees}; " "echo {params.all_flags}) &> {log}" |
98 99 | shell: "(java -jar $CONDA_PREFIX/share/astral-tree-5.7.8-0/astral.5.7.8.jar -i {input.genetrees} -o {output.speciestree}) &> {log}" |
Support
- Future updates