Hecatomb: Enhancing Viral Read Identification in Metagenomes

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A hecatomb is a great sacrifice or an extensive loss. Heactomb the software empowers an analyst to make data driven decisions to 'sacrifice' false-positive viral reads from metagenomes to enrich for true-positive viral reads. This process frequently results in a great loss of suspected viral sequences / contigs.

Documentation
Citation
Quick Start Guide
Inputs
Dependencies
Links

Documentation

Complete documentation is hosted at Read the Docs

Citation

Hecatomb is currently on BioRxiv!

Quick start guide

Running on HPC

Hecatomb is powered by Snakemake and greatly benefits from the use of Snakemake profiles for HPC Clusters. More information and example for setting up Snakemake profiles for Hecatomb in the documentation .

Install option 1: PIP

# Optional: create a virtual env with conda
conda create -n hecatomb python=3.10
# activate
conda activte hecatomb
# Install
pip install hecatomb

Install option 2: Conda

# Create the conda env and install hecatomb in one step
conda create -n hecatomb -c conda-forge -c bioconda hecatomb
# activate
conda activate hecatomb

Check the installation

hecatomb --help

Install the databases

# locally: using 8 threads (default is 32 threads)
hecatomb install --threads 8
# HPC: using a snakemake profile named 'slurm'
hecatomb install --profile slurm

Run the test dataset

# locally: using 32 threads and 64 GB RAM by default
hecatomb test
# HPC: using a profile named 'slurm'
hecatomb test --profile slurm

Inputs

Hecatomb can process paired- or single-end short-read sequencing, longread sequencing, and paired-end sequencing for round A/B library protocol.

hecatomb run --library paired
hecatomb run --library single
hecatomb run --library longread
hecatomb run --library roundAB

When you specify a directory of reads with --reads for paried-end sequencing, Hecatomb expects paired-end sequencing reads in the format sampleName_R1/R2.fastq(.gz). e.g.

sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz

When you specify a TSV file with --reads , Hecatomb expects a 2- or 3-column tab separated file (depending on preprocessing method) with the first column specifying a sample name, and the other columns the relative or full paths to the forward (and reverse) read files. e.g.

sample1 /path/to/reads/sample1.1.fastq.gz /path/to/reads/sample1.2.fastq.gz
sample2 /path/to/reads/sample2.1.fastq.gz /path/to/reads/sample2.2.fastq.gz

Dependencies

The only dependency you need to get up and running with Hecatomb is conda or the python package manager pip . Hecatomb relies on conda (and mamba ) to ensure portability and ease of installation of its dependencies. All of Hecatomb's dependencies are installed during installation or runtime, so you don't have to worry about a thing!

Links

Hecatomb @ PyPI

Hecatomb @ bioconda

Hecatomb @ bio.tools

Hecatomb @ WorkflowHub

Code Snippets

shell:
    """
    bbmap.sh ref={input.ref} in={input.shred} \
        outm={output} path=tmp/ \
        minid=0.90 maxindel=2 ow=t \
        threads={threads} -Xmx{resources.mem_mb}m &> {log}
    """

SnakeMake BBMap From line 51 of workflow/AddHost.smk

shell:
    """
    bbmask.sh in={input.ref} out={output.fa} \
        entropy={params.entropy} sam={input.sam} ow=t \
        threads={threads} -Xmx{resources.mem_mb}m &> {log}
    gzip -c {output.fa} > {output.gz}
    """

SnakeMake From line 78 of workflow/AddHost.smk

run:
    with open(output[0],'w') as o:
        for oDir in allDirSmplLen.keys():
            for smpl in allDirSmplLen[oDir].keys():
                o.write(f'{smpl}\t{allDirSmplLen[oDir][smpl]}\n')

SnakeMake From line 100 of workflow/combineOutputs.smk

run:
    combineResultDirOutput(output[0],'bigtable.tsv')

SnakeMake From line 113 of workflow/combineOutputs.smk

run:
    combineResultDirOutput(output[0],'seqtable.properties.tsv',sampleCol=0)

SnakeMake From line 123 of workflow/combineOutputs.smk

run:
    with open(output[0],'w') as o:
        for oDir in allDirSmplLen.keys():
            with open(os.path.join(oDir,'results','seqtable.fasta'),'r') as f:
                p=True
                for line in f:
                    if line.startswith('>'):
                        s = line.replace('>','').split(':')
                        try:
                            allDirSmplLen[oDir][s[0]]
                            p=True
                            o.write(line)
                        except KeyError:
                            p=False
                    else:
                        if p:
                            o.write(line)

SnakeMake From line 133 of workflow/combineOutputs.smk

shell:
    """
    n=0
    for i in {input}; do
      cat $i | sed "s/>/>$n/"
      n=0$n
    done > {output.contigs}
    flye --subassemblies {output.contigs} -t {threads} --plasmids -o {output.flye} -g 1g
    mv {params} {output.assembly}
    """

SnakeMake Flye From line 168 of workflow/combineOutputs.smk

run:
    import urllib.request
    import urllib.parse
    import shutil
    dlUrl1 = urllib.parse.urljoin(config.dbs.mirror1, os.path.join(wildcards.path, wildcards.file))
    dlUrl2 = urllib.parse.urljoin(config.dbs.mirror2, os.path.join(wildcards.path, wildcards.file))
    try:
        with urllib.request.urlopen(dlUrl1) as r, open(output[0],'wb') as o:
            shutil.copyfileobj(r,o)
    except:
        with urllib.request.urlopen(dlUrl2) as r, open(output[0],'wb') as o:
            shutil.copyfileobj(r,o)

SnakeMake From line 38 of workflow/DownloadDB.smk

shell:
    """
    curl {params.url} -o {params.tar}
    curl {params.md5} | md5sum -c
    mkdir -p {params.dir}
    tar xvf {params.tar} -C {params.dir}
    """

SnakeMake From line 64 of workflow/DownloadDB.smk

shell:
    "samtools faidx {input} > {output} 2> {log} && rm {log}"

SnakeMake SAMtools From line 54 of preflight/functions.smk

shell:
    "samtools index -@ {threads} {input} {output} 2> {log} && rm {log}"

SnakeMake SAMtools From line 72 of preflight/functions.smk

shell:
    """
    countgc.sh in={input} format=2 ow=t > {output} 2> {log}
    rm {log}
    """

SnakeMake From line 92 of preflight/functions.smk

shell:
    """
    {{
    tetramerfreq.sh in={input} w=0 ow=t -Xmx{resources.mem_mb}m \
        | tail -n+2;
    }} > {output} 2>> {log}
    rm {log}
    """