Multi-species Coalescent Model for Phylogenetic Tree Inference with ASTRAL

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Multi-species Coalescent Model Species Tree with ASTRAL

In order to calculate phylogenetic trees, different methods can be used. One method concatenates multiple gene sequences into a single supermatrix alignment in order to calculate a species tree, while the other infers a tree for each gene, the so-called gene trees, to then create a consensus tree to represent the subject species. Among other things, incomplete lineage sorting may result in gene trees that are different from one another and from the species tree. Programs using the multi-species coalescent (MSC) model have been shown to be statistically consistent when facing these issues.

This Snakemake workflow calculates gene trees based on the maximum- likelihood using IQ-TREE . Then, 100 bootstrap trees are generated for each gene. A consensus tree is generated for each gene. Lastly, the gene trees are combined into a single file to be used for species tree inference under the multi-species-coalescent model implemented with ASTRAL .

System requirements

Local machine

I recommend running the workflow on a HPC system, as the analyses are resource and time consuming.

If you don't have it yet, it is necessary to have conda or miniconda in your machine. Follow these instructions.
- After you are all set with conda, I highly ( highly! ) recommend installing a much much faster package manager to replace conda, mamba
- First activate your conda base
conda activate base
- Then, type:
conda install -n base -c conda-forge mamba
Likewise, follow this tutorial to install Git if you don't have it.

HPC system

Follow the instructions from your cluster administrator regarding loading of modules, such as loading a root distribution from Conda. For example, with the cluster I work with, we use modules to set up environmental variables, which have to first be loaded within the jobscripts. They modify the $PATH variable after loading the module.

e.g.: module load anaconda3/2022.05

You usually don't have sudo rights to install anything to the root of the cluster. So, as I wanted to work with a more updated distribution of conda and especially use mamba to replace conda as a package manager, I had to first create my own "local" conda, i.e. I first loaded the module and then created a new environment I called localconda

module load anaconda3/2022.05
conda create -n localconda -c conda-forge conda=22.9.0
conda install -n localconda -c conda-forge mamba
conda activate localconda

If you run conda env list you'll probably see something like this: /home/myusername/.conda/envs/localconda/

Data requirements

Multiple sequence alingment (MSA) files on the amino acid level . The files need to be in fasta format and have the suffix .fas or .fasta, otherwise it will not work.

The workflow will be soon modified to work with difference sequence types and substitution models.

Create a folder within resources/ and add all of your MSA files to it.

e.g. resources/mollusca_astral

The workflow automatically recognizes the names of the files and processes them accordingly. The name of the subdirectory will be used to name the output directories and files from the workflow.

e.g.:

resources/mollusca_astral/12345at6447.fas

One of the output files will be named like this:

results/mollusca_astral/ml_bs_trees/12345at6447_mollusca_astral.treefile

The final species tree, for example, would be:

results/mollusca_astral/final_species_tree_mollusca_astral.treefile

Because of how I structured the workflow, you are able to run several different analyses in parallel :)

e.g.: resources/mollusca_astral/ , resources/arthropoda , etc

Installation

Clone this repository

git clone https://gitlab.leibniz-lib.de/jwiggeshoff/astral-species-tree.git

Activate your conda base

conda activate base

If you are working on a cluster or have your own "local", isolated environment you want to activate instead (see here ), use its name to activate it

conda activate localconda

Install astral-species-tree into an isolated software environment by navigating to the directory where this repo is and run:

conda env create --file environment.yaml

If you followed what I recommended in the System requirements , run this instead:

mamba env create --file environment.yaml

The environment from astral-species-tree is created

Always activate the environment before running the workflow

On a local machine:

conda activate astral-species-tree

If you are on a cluster and/or created the environment "within" another environment, you want to run this first:

conda env list

You will probably see something like this among your enviornments:

home/myusername/.conda/envs/localconda/envs/astral-species-tree

From now own, you have to give this full path when activating the environment prior to running the workflow

conda activate /home/myusername/.conda/envs/localconda/envs/astral-species-tree

Running the workflow

Remember to always activate the environment first

conda activate astral-species-tree

conda activate /home/myusername/.conda/envs/localconda/envs/astral-species-tree

Local machine

Not recommended unless you have a lot of storage and CPUs available (and time to wait...). Nevertheless, you can simply run like this:

nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --cores 11 > nohup_astral-species-tree_$(date +"%F_%H").out &

Modify number of cores accordingly.

HPC system

Two working options were tested to run the workflow in HPC clusters using the Sun Grid Engine (SGE) queue scheduler system.

For other systems, read more here .

Before the first execution of the workflow

Run this to create the environments from the rules:

snakemake --cores 8 --use-conda --conda-create-envs-only

Option 1:

mkdir snakejob_logs

nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --rerun-incomplete --cores 51 --max-threads 25 --cluster "qsub -terse -V -b y -j y -o snakejob_logs/ -cwd -pe smp {threads} -q fast.q,small.q,medium.q,large.q -M [email protected] -m be" > nohup_astral-species-tree_$(date +"%F_%H_%M_%S").out &

Remember to:

Modify [email protected]
Change values for --cores and --max-threads accordingly
Change environment for -pe as needed (e.g. smp)

Option 2:

A template jobscript template_run_astral-species-tree.sh is found under misc/

Important: Please, modify the qsub options according to your system! Features to modify:

E-mail address: -M [email protected]
Mailing settings, if needed: -m be
If you want to split stderr to stdout, use -j n instead and add the line #$ -e cluster_logs/
If you want to, the name of the jobscript: -N astral-species-tree
Name of parallel environment (e.g. smp) and number of threads (e.g. 61): -pe smp 61
Queue name! (extremely unique to your system): -q small.q,medium.q,large.q

Ater modifying the template, copy it (while also modifying its name) to the working directory:

If you are within the folder misc/ :

cp template_run_astral-species-tree.sh ../run_astral-species-tree.sh

You should see run_astral-species-tree.sh within the path where the folders resources/, results/, and workflow/ are, together with files README.md and environment.yaml

Remember to mkdir cluster_logs before running for the first time

Finally, run:

qsub run_astral-species-tree.sh

Finishing the workflow: report.html

Upon successfully finishing the analyses, Snakemake will automatically generate a compressed report in the working directory, report.html

It describes the used software versions, the commands, and paths to in and output files.

To be released: Summary of main results, drawn trees

To know more about report files, see the documentation from Snakemake here .

Done :)

Code Snippets

shell:
    "(iqtree -s {input.MSA_file} -st AA -pre {params.outdir}/{wildcards.sample}_{wildcards.project} "
    "-nt {threads} -m MFP -msub nuclear -mrate E,I,G,I+G,R -cmin 2 -cmax 15 -madd LG4X,LG4M -safe -merit AICc -b 100 --redo) &> {log}"

SnakeMake From line 38 of workflow/Snakefile

shell:
    "(iqtree -sup {input.ML_tree} -t {input.bs_trees} "
    "-pre {params.outdir}/{wildcards.sample}_{wildcards.project} -nt {threads} --redo; "
    "echo >> {output.suptree}) &> {log}"

SnakeMake From line 55 of workflow/Snakefile

shell:
    "ls {input}; echo {output.flag_file}"

SnakeMake From line 65 of workflow/Snakefile

shell:
    "(cat {input.suptrees_group_by_project} > {output.genetrees}; "
    "sed -i '/^$/d' {output.genetrees}; "
    "echo {params.all_flags}) &> {log}"

SnakeMake From line 84 of workflow/Snakefile

shell:
    "(java -jar $CONDA_PREFIX/share/astral-tree-5.7.8-0/astral.5.7.8.jar -i {input.genetrees} -o {output.speciestree}) &> {log}"

SnakeMake From line 98 of workflow/Snakefile

ShowHide 5 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://gitlab.leibniz-lib.de/jwiggeshoff/astral-species-tree

Name: astral-species-tree

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

astral-tree ASTRAL IQ-TREE Snakemake Phylogeny

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free