Hecatomb: Enhancing Viral Read Identification in Metagenomes
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A hecatomb is a great sacrifice or an extensive loss. Heactomb the software empowers an analyst to make data driven decisions to 'sacrifice' false-positive viral reads from metagenomes to enrich for true-positive viral reads. This process frequently results in a great loss of suspected viral sequences / contigs.
Contents
Documentation
Complete documentation is hosted at Read the Docs
Citation
Hecatomb is currently on BioRxiv!
Quick start guide
Running on HPC
Hecatomb is powered by Snakemake and greatly benefits from the use of Snakemake profiles for HPC Clusters. More information and example for setting up Snakemake profiles for Hecatomb in the documentation .
Install option 1: PIP
# Optional: create a virtual env with conda
conda create -n hecatomb python=3.10
# activate
conda activte hecatomb
# Install
pip install hecatomb
Install option 2: Conda
# Create the conda env and install hecatomb in one step
conda create -n hecatomb -c conda-forge -c bioconda hecatomb
# activate
conda activate hecatomb
Check the installation
hecatomb --help
Install the databases
# locally: using 8 threads (default is 32 threads)
hecatomb install --threads 8
# HPC: using a snakemake profile named 'slurm'
hecatomb install --profile slurm
Run the test dataset
# locally: using 32 threads and 64 GB RAM by default
hecatomb test
# HPC: using a profile named 'slurm'
hecatomb test --profile slurm
Inputs
Hecatomb can process paired- or single-end short-read sequencing, longread sequencing, and paired-end sequencing for round A/B library protocol.
hecatomb run --library paired
hecatomb run --library single
hecatomb run --library longread
hecatomb run --library roundAB
When you specify a directory of reads with
--reads
for paried-end sequencing,
Hecatomb expects paired-end sequencing reads in the format sampleName_R1/R2.fastq(.gz). e.g.
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz
When you specify a TSV file with
--reads
, Hecatomb expects a 2- or 3-column tab separated file (depending on
preprocessing method) with the first column specifying a sample name, and the other columns the relative or full paths
to the forward (and reverse) read files. e.g.
sample1 /path/to/reads/sample1.1.fastq.gz /path/to/reads/sample1.2.fastq.gz
sample2 /path/to/reads/sample2.1.fastq.gz /path/to/reads/sample2.2.fastq.gz
Dependencies
The only dependency you need to get up and running with Hecatomb is conda or the python package manager pip . Hecatomb relies on conda (and mamba ) to ensure portability and ease of installation of its dependencies. All of Hecatomb's dependencies are installed during installation or runtime, so you don't have to worry about a thing!
Links
Code Snippets
51 52 53 54 55 56 57 | shell: """ bbmap.sh ref={input.ref} in={input.shred} \ outm={output} path=tmp/ \ minid=0.90 maxindel=2 ow=t \ threads={threads} -Xmx{resources.mem_mb}m &> {log} """ |
78 79 80 81 82 83 84 | shell: """ bbmask.sh in={input.ref} out={output.fa} \ entropy={params.entropy} sam={input.sam} ow=t \ threads={threads} -Xmx{resources.mem_mb}m &> {log} gzip -c {output.fa} > {output.gz} """ |
100 101 102 103 104 | run: with open(output[0],'w') as o: for oDir in allDirSmplLen.keys(): for smpl in allDirSmplLen[oDir].keys(): o.write(f'{smpl}\t{allDirSmplLen[oDir][smpl]}\n') |
113 114 | run: combineResultDirOutput(output[0],'bigtable.tsv') |
123 124 | run: combineResultDirOutput(output[0],'seqtable.properties.tsv',sampleCol=0) |
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | run: with open(output[0],'w') as o: for oDir in allDirSmplLen.keys(): with open(os.path.join(oDir,'results','seqtable.fasta'),'r') as f: p=True for line in f: if line.startswith('>'): s = line.replace('>','').split(':') try: allDirSmplLen[oDir][s[0]] p=True o.write(line) except KeyError: p=False else: if p: o.write(line) |
168 169 170 171 172 173 174 175 176 177 | shell: """ n=0 for i in {input}; do cat $i | sed "s/>/>$n/" n=0$n done > {output.contigs} flye --subassemblies {output.contigs} -t {threads} --plasmids -o {output.flye} -g 1g mv {params} {output.assembly} """ |
38 39 40 41 42 43 44 45 46 47 48 49 | run: import urllib.request import urllib.parse import shutil dlUrl1 = urllib.parse.urljoin(config.dbs.mirror1, os.path.join(wildcards.path, wildcards.file)) dlUrl2 = urllib.parse.urljoin(config.dbs.mirror2, os.path.join(wildcards.path, wildcards.file)) try: with urllib.request.urlopen(dlUrl1) as r, open(output[0],'wb') as o: shutil.copyfileobj(r,o) except: with urllib.request.urlopen(dlUrl2) as r, open(output[0],'wb') as o: shutil.copyfileobj(r,o) |
64 65 66 67 68 69 70 | shell: """ curl {params.url} -o {params.tar} curl {params.md5} | md5sum -c mkdir -p {params.dir} tar xvf {params.tar} -C {params.dir} """ |
54 55 | shell: "samtools faidx {input} > {output} 2> {log} && rm {log}" |
72 73 | shell: "samtools index -@ {threads} {input} {output} 2> {log} && rm {log}" |
92 93 94 95 96 | shell: """ countgc.sh in={input} format=2 ow=t > {output} 2> {log} rm {log} """ |
118 119 120 121 122 123 124 125 | shell: """ {{ tetramerfreq.sh in={input} w=0 ow=t -Xmx{resources.mem_mb}m \ | tail -n+2; }} > {output} 2>> {log} rm {log} """ |
Support
- Future updates
Related Workflows





