Snakemake workflow to detect and classify viruses in metagenome assemblies.

public public 1yr ago 0 bookmarks

Snakemake workflow to detect and classify viruses in metagenome assemblies.

It first detects viral sequences in assemblies ( .fa files) with VirSorter2 , VIBRANT and DeepVirFinder . Predictions are strictly quality controlled with CheckV , followed by clustering with CD-HIT and taxonomic classification with Demovir .

Installation

  1. Install conda , snakemake (tested v6.3.0) and USEARCH .

  2. Clone repository

git clone --recursive https://github.com/alexmsalmeida/virsearch.git
  1. Download and extract necessary databases (uncompressed directory will require a total of 30 GB).
wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/virsearch_db.tar.gz
tar -xzvf virsearch_db.tar.gz

How to run

  1. Edit config.yml file to point to the input , output and databases directories, as well as the USEARCH binary location ( usearch_binary ). Input directory should contain the .fa assemblies to analyse.

  2. Install the necessary conda environments through snakemake

snakemake --use-conda --conda-create-envs-only --cores 1
  1. (option 1) Run the pipeline locally (adjust -j based on the number of available cores)
snakemake --use-conda -k -j 4
  1. (option 2) Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A ALMEIDA-SL3-CPU -p icelake-himem --time=12:00:00 --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output}'

Output

The main output files generated per input FASTA are the final_predictions.fa and final_predictions_tax.tsv files, which contain the viral sequences in FASTA format and their taxonomic annotation, respectively. If these files are empty it likely means that no high-confidence viral sequences were detected (check individual logs of the tools to confirm no other issues arose).

Code Snippets

52
53
54
55
shell:
    """
    virsorter setup -d {params.outdir} -j 4
    """
69
70
71
72
73
74
shell:  
    """
    rm -rf {params.outdir}
    virsorter run -j 8 -i {input.fa} --db-dir {params.database} -w {params.outdir} --min-length 10000 all || touch {params.raw}
    tools/rename_multifasta_prefix.py -f {params.raw} -p {wildcards.id}_VS2 > {output}
    """
87
88
89
90
91
shell:
    """
    rm -rf {params.outdir}
    tools/DeepVirFinder/dvf.py -i {input} -m {params.database} -o {params.outdir} -l 10000 -c 4 || touch {output}
    """
SnakeMake From line 87 of main/Snakefile
104
105
106
107
108
109
110
shell:
    """
    awk '{{if($3 > 0.9 && $4 < 0.01)print$1}}' {input.dvf} > {params.contigs}
    tools/select_seqs_by_IDs.py -i {input.fa} -d {params.contigs} -o {params.fa_ori}
    tools/rename_multifasta_prefix.py -f {params.fa_ori} -p {wildcards.id}_DVF > {output}
    rm {params.fa_ori} {params.contigs}
    """
SnakeMake From line 104 of main/Snakefile
125
126
127
128
129
130
131
132
133
134
shell:
    """
    rm -rf {params.outdir}
    VIBRANT_run.py -t 4 -d {params.database} -m {params.files} -i {input} -folder {params.outdir} -l 10000 -no_plot || true
    if [[ ! -f {params.phages} ]]; then
        mkdir -p {params.outdir} && touch {output}
    else
        tools/rename_multifasta_prefix.py -f {params.phages} -p {wildcards.id}_VIBRANT > {output}
    fi
    """
SnakeMake From line 125 of main/Snakefile
144
145
shell:
    "cat {input.vs2} {input.dvf} {input.vibrant} > {output}"
SnakeMake From line 144 of main/Snakefile
153
154
shell:
    "echo 'No high-confidence viruses detected, generating empty files'"
SnakeMake From line 153 of main/Snakefile
166
167
168
169
170
shell:
    """
    rm -rf {params.outdir}
    checkv end_to_end -t 8 -d {params.database} {input} {params.outdir}
    """
SnakeMake From line 166 of main/Snakefile
181
182
183
184
185
186
shell:
    """
    tools/filter_checkv.py {params.indir}proviruses.fna {input} {params.indir}proviruses_filt.fna pro
    tools/filter_checkv.py {params.indir}viruses.fna {input} {params.indir}viruses_filt.fna vir
    cat {params.indir}proviruses_filt.fna {params.indir}viruses_filt.fna > {output}
    """
SnakeMake From line 181 of main/Snakefile
198
199
200
201
202
shell:
    """
    cd-hit-est -c 0.99 -i {input} -o {params.clst_dir}/filtered_predictions_nr -T 0 -M 0 -d 0
    mv {params.clst_dir}/filtered_predictions_nr {output.pred}
    """
217
218
219
220
221
222
223
224
shell:
    """
    prodigal-gv -a {params.demovir_dir}/proteins.faa -i {input} -p meta &> /dev/null
    {params.usearch} -ublast {params.demovir_dir}/proteins.faa -db {params.database}/uniprot_trembl.viral.udb -evalue 1e-5 -trunclabels -blast6out {params.demovir_dir}/trembl_ublast.viral.txt -threads 4 &> /dev/null
    sort -u -k1,1 {params.demovir_dir}/trembl_ublast.viral.txt > {params.demovir_dir}/trembl_ublast.viral.u.txt
    cut -f 1,2 {params.demovir_dir}/trembl_ublast.viral.u.txt | sed 's/_[0-9]\+\t/\t/' | cut -f 1 | paste {params.demovir_dir}/trembl_ublast.viral.u.txt - > {output.contig_ids}
    tools/demovir.R {params.demovir_dir} {params.database}
    """
233
234
235
run:
    shutil.move(input[0], output.fa)
    shutil.move(input[1], output.tax)
SnakeMake From line 233 of main/Snakefile
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import sys
import Bio
from Bio import SeqIO
from Bio import SeqRecord

if len(sys.argv) < 3:
    print("usage: script.py in.fa checkv_summary.tsv out.fa pro")
    sys.exit(1)

contigs = set()
with open(sys.argv[2]) as f:
    for line in f:
        line = line.rstrip()
        cols = line.split("\t")
        if cols[0] != "contig_id":
            if cols[9] != "NA":
                if int(cols[5]) > int(cols[6]) and int(cols[1]) >= 10000 and float(cols[9]) >= 50 and float(cols[12]) <= 1.0 and "contig >1.5x" not in line:
                    contigs.add(cols[0])

with open(sys.argv[1]) as f, open(sys.argv[3], "w") as fout:
    for record in SeqIO.parse(f, "fasta"):
        if sys.argv[4] == "pro":
            contig_id = "_".join(record.id.split("_")[:-1])
        elif sys.argv[4] == "vir":
            contig_id = record.id.split()[0]
        if contig_id in contigs and len(record.seq) >= 10000:
            SeqIO.write(record, fout, "fasta")
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/alexmsalmeida/virsearch
Name: virsearch
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...