quantify-virome: identify and quantify viruses from metagenomic shotgun sequences
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake implementation of VirusSeeker Virome workflow .
Install miniconda
Download and install miniconda https://conda.io/docs/user-guide/install/index.html. In case of Linux, following should work:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Install environment
Create conda environment with snakemake . There are two options:
- If you want to upload your results to Zenodo , then you need snakemake Zenodo remote provider, which is currently implemented in zenodo-simple branch in my forked snakemake repo.
First, clone snakemake repo and checkout zenodo-simple branch:
git clone https://tpall@bitbucket.org/tpall/snakemake.git
cd snakemake
git checkout zenodo-simple
Then, create conda environment, install prerequisites and snakemake:
conda env create -f environment.yml -n snakemake
source activate snakemake
pip install -e .
- Alternatively, if you don't want to upload your results to Zenodo, you can create conda environment and install snakemake 'normally':
conda create -n snakemake -c bioconda -c conda-forge snakemake
source activate snakemake
Setup databases
Together all databases will occupy ~250GB+ from your HD.
BLAST databases
- Download BLAST version 5 databases
Download version 5 BLAST databases using these instructions https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf
Briefly, you can use
update_blastdb.pl
script from BLAST+ software bundle to update/download BLAST databases.
To get BLAST, you can start by creating conda environment with blast+ like so:
conda env create -n blastenv
conda blastenv activate
conda install -c bioconda blast
Change working directory to location where you want BLAST databases to be installed, e.g.
$HOME/databases/blast
.
mkdir -p $HOME/databases/blast
cd $HOME/databases/blast
Use update_blastdb.pl (included with the BLAST+ package) to check available version 5 databases, use the --blastdb_version flag:
update_blastdb.pl --blastdb_version 5 --showall
Download nt_v5 and nr_v5 databases (takes time and might need restarting if connection drops):
update_blastdb.pl --blastdb_version 5 nt_v5 --decompress
update_blastdb.pl --blastdb_version 5 nr_v5 --decompress
- Setup BLASTDB environment variable Edit $HOME/.bashrc file to permanently add BLASTDB variable to your shell environment
echo 'export BLASTDB=$HOME/databases/blast' >> $HOME/.bashrc
source $HOME/.bashrc
echo $BLASTDB
Download reference genome databases
- Human reference genome.
First, create a directory for the reference genome sequence file, e.g
mkdir -p $HOME/databases/ref_genomes && cd $HOME/databases/ref_genomes
.
Then, human refgenome human_g1k_v37.fasta.gz sequences file can be obtained like so:
wget --continue ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
- Bacterial reference genome sequences.
Create a directory for the bacteria reference sequence files. Download all *genomic.fna.gz files to the directory by using command.
wget --recursive --continue ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*genomic.fna.gz
Unzip the files and concatenate all the files into a single file. Use "bwa index" command to create index for the BWA algorithm.
-
Add paths to
human_g1k_v37.fasta
andBacteria_ref_genome.fna
to environment variables.
echo 'export REF_GENOME_HUMAN=$HOME/databases/ref_genomes/human_g1k_v37.fasta' >> $HOME/.bashrc
echo 'export REF_BACTERIA=$HOME/databases/bacteria_ref_sequence/Bacteria_ref_genome.fna' >> $HOME/.bashrc
source $HOME/.bashrc
Install workflow
Clone this repo and cd to repo (Change URL accordingly if using HTTPS)
git clone git@github.com:avilab/quantify-virome.git
cd quantify-virome
Example
Dry run
snakemake -n
Create workflow graph
snakemake -d .test --dag | dot -Tsvg > graph/dag.svg
Run workflow
This workflow is designed to be run in cluster.
cluster.json
configuration file may need some customisation, for example partition name. Memory nad maximum runtime limits are optimised for 100 splits. Number of splits can be specified in
config.yaml
file with n_files option (currently n_files is 2). Installation of software dependencies is taken care by conda, hence there is software installation overhead when you run this workflow for the first time in new environment.
Example workflow submission script for slurm cluster, where values for job name, cluster partition name, time and memory constraints, and slurm log path (output) are taken from cluster.json:
snakemake -j --use-conda --cluster-config cluster.json \ --cluster "sbatch -J {cluster.name} \ -p {cluster.partition} \ -t {cluster.time} \ --mem {cluster.mem} \ --output {cluster.output}"
You may want to use also following flags when running this workflow in cluster:
--max-jobs-per-second 1 --max-status-checks-per-second 10 --rerun-incomplete --keep-going
All other possible
snakemake execution
options can be printed by calling
snakemake -h
.
Exit/deactivate environment
Conda environment can be closed with the following command when work is finished:
source deactivate
Workflow graph
For technical reasons, workflow is split into two parts, virome and taxonomy, that can be run separately, but taxonomy depends on the output of virome. Virome subworkflow (virome.snakefile) munges, masks, and blasts input sequences. Taxonomy subworkflow (Snakefile) merges blast results with taxonomy data and generates report.
Figure 1. Virome workflow graph with test sample split into two (default = 20) subfiles for parallel processing. Outputs parsed BLAST results.
Code Snippets
27 28 | shell: "get_species_taxids.sh -t {params.taxid} > {output}" |
39 40 | shell: "get_species_taxids.sh -t {params.taxid} > {output}" |
50 51 | shell: "cat {input} > {output}" |
336 337 338 339 340 341 | run: tab = concatenate_tables(input, sep = ",", cols_to_integer = params.ranks) vir = tab[tab.superkingdom == 10239] non_vir = tab[tab.superkingdom != 10239] vir.to_csv(output.viral, index = False) non_vir.to_csv(output.non_viral, index = False) |
352 353 | shell: "cat {input} > {output}" |
365 366 | shell: "tar czvf {output} {input}" |
215 216 | shell: "cat {input} > {output}" |
417 418 | wrapper: "0.27.1/bio/fastqc" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell("fastqc {snakemake.params} --quiet " "--outdir {tempdir} {snakemake.input[0]}" " {log}") # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path} {snakemake.output.html}") if snakemake.output.zip != zip_path: shell("mv {zip_path} {snakemake.output.zip}") |
Support
- Future updates
Related Workflows





