Hecatomb: Enhancing Viral Read Identification in Metagenomes

public public 1yr ago Version: Version 1 0 bookmarks

A hecatomb is a great sacrifice or an extensive loss. Heactomb the software empowers an analyst to make data driven decisions to 'sacrifice' false-positive viral reads from metagenomes to enrich for true-positive viral reads. This process frequently results in a great loss of suspected viral sequences / contigs.

Contents

Documentation

Complete documentation is hosted at Read the Docs

Citation

Hecatomb is currently on BioRxiv!

Quick start guide

Running on HPC

Hecatomb is powered by Snakemake and greatly benefits from the use of Snakemake profiles for HPC Clusters. More information and example for setting up Snakemake profiles for Hecatomb in the documentation .

Install option 1: PIP

# Optional: create a virtual env with conda
conda create -n hecatomb python=3.10
# activate
conda activte hecatomb
# Install
pip install hecatomb

Install option 2: Conda

# Create the conda env and install hecatomb in one step
conda create -n hecatomb -c conda-forge -c bioconda hecatomb
# activate
conda activate hecatomb

Check the installation

hecatomb --help

Install the databases

# locally: using 8 threads (default is 32 threads)
hecatomb install --threads 8
# HPC: using a snakemake profile named 'slurm'
hecatomb install --profile slurm

Run the test dataset

# locally: using 32 threads and 64 GB RAM by default
hecatomb test
# HPC: using a profile named 'slurm'
hecatomb test --profile slurm

Inputs

Hecatomb can process paired- or single-end short-read sequencing, longread sequencing, and paired-end sequencing for round A/B library protocol.

hecatomb run --library paired
hecatomb run --library single
hecatomb run --library longread
hecatomb run --library roundAB

When you specify a directory of reads with --reads for paried-end sequencing, Hecatomb expects paired-end sequencing reads in the format sampleName_R1/R2.fastq(.gz). e.g.

sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz

When you specify a TSV file with --reads , Hecatomb expects a 2- or 3-column tab separated file (depending on preprocessing method) with the first column specifying a sample name, and the other columns the relative or full paths to the forward (and reverse) read files. e.g.

sample1 /path/to/reads/sample1.1.fastq.gz /path/to/reads/sample1.2.fastq.gz
sample2 /path/to/reads/sample2.1.fastq.gz /path/to/reads/sample2.2.fastq.gz

Dependencies

The only dependency you need to get up and running with Hecatomb is conda or the python package manager pip . Hecatomb relies on conda (and mamba ) to ensure portability and ease of installation of its dependencies. All of Hecatomb's dependencies are installed during installation or runtime, so you don't have to worry about a thing!

Links

Hecatomb @ PyPI

Hecatomb @ bioconda

Hecatomb @ bio.tools

Hecatomb @ WorkflowHub

Code Snippets

51
52
53
54
55
56
57
shell:
    """
    bbmap.sh ref={input.ref} in={input.shred} \
        outm={output} path=tmp/ \
        minid=0.90 maxindel=2 ow=t \
        threads={threads} -Xmx{resources.mem_mb}m &> {log}
    """
78
79
80
81
82
83
84
shell:
    """
    bbmask.sh in={input.ref} out={output.fa} \
        entropy={params.entropy} sam={input.sam} ow=t \
        threads={threads} -Xmx{resources.mem_mb}m &> {log}
    gzip -c {output.fa} > {output.gz}
    """
100
101
102
103
104
run:
    with open(output[0],'w') as o:
        for oDir in allDirSmplLen.keys():
            for smpl in allDirSmplLen[oDir].keys():
                o.write(f'{smpl}\t{allDirSmplLen[oDir][smpl]}\n')
113
114
run:
    combineResultDirOutput(output[0],'bigtable.tsv')
123
124
run:
    combineResultDirOutput(output[0],'seqtable.properties.tsv',sampleCol=0)
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
run:
    with open(output[0],'w') as o:
        for oDir in allDirSmplLen.keys():
            with open(os.path.join(oDir,'results','seqtable.fasta'),'r') as f:
                p=True
                for line in f:
                    if line.startswith('>'):
                        s = line.replace('>','').split(':')
                        try:
                            allDirSmplLen[oDir][s[0]]
                            p=True
                            o.write(line)
                        except KeyError:
                            p=False
                    else:
                        if p:
                            o.write(line)
168
169
170
171
172
173
174
175
176
177
shell:
    """
    n=0
    for i in {input}; do
      cat $i | sed "s/>/>$n/"
      n=0$n
    done > {output.contigs}
    flye --subassemblies {output.contigs} -t {threads} --plasmids -o {output.flye} -g 1g
    mv {params} {output.assembly}
    """
38
39
40
41
42
43
44
45
46
47
48
49
run:
    import urllib.request
    import urllib.parse
    import shutil
    dlUrl1 = urllib.parse.urljoin(config.dbs.mirror1, os.path.join(wildcards.path, wildcards.file))
    dlUrl2 = urllib.parse.urljoin(config.dbs.mirror2, os.path.join(wildcards.path, wildcards.file))
    try:
        with urllib.request.urlopen(dlUrl1) as r, open(output[0],'wb') as o:
            shutil.copyfileobj(r,o)
    except:
        with urllib.request.urlopen(dlUrl2) as r, open(output[0],'wb') as o:
            shutil.copyfileobj(r,o)
64
65
66
67
68
69
70
shell:
    """
    curl {params.url} -o {params.tar}
    curl {params.md5} | md5sum -c
    mkdir -p {params.dir}
    tar xvf {params.tar} -C {params.dir}
    """
54
55
shell:
    "samtools faidx {input} > {output} 2> {log} && rm {log}"
72
73
shell:
    "samtools index -@ {threads} {input} {output} 2> {log} && rm {log}"
92
93
94
95
96
shell:
    """
    countgc.sh in={input} format=2 ow=t > {output} 2> {log}
    rm {log}
    """
118
119
120
121
122
123
124
125
shell:
    """
    {{
    tetramerfreq.sh in={input} w=0 ow=t -Xmx{resources.mem_mb}m \
        | tail -n+2;
    }} > {output} 2>> {log}
    rm {log}
    """
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/shandley/hecatomb
Name: hecatomb
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...