Workflow module for metagenome assembly

public public 1yr ago 0 bookmarks

A Snakemake workflow for metagenome assembly using metaspades and metahit. It does pre-assembly merging and read-correction in order to improve the assembly and reduce memory foot print.

It is part of metagenome-atlas . For citation see the atlas repository

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog .

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) metagenome-assembly repository and its DOI (see above).

Code Snippets

20
21
22
23
24
shell:
    """
    prodigal -i {input} -o {output.gff} -d {output.fna} \
        -a {output.faa} -p meta -f gff 2> {log}
    """
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
run:
    header = [
        "gene_id",
        "Contig",
        "Gene_nr",
        "Start",
        "Stop",
        "Strand",
        "Annotation",
    ]
    with open(output.tsv, "w") as tsv:
        tsv.write("\t".join(header) + "\n")
        with open(input.faa) as fin:
            gene_idx = 0
            for line in fin:
                if line[0] == ">":
                    text = line[1:].strip().split(" # ")
                    old_gene_name = text[0]
                    text.remove(old_gene_name)
                    old_gene_name_split = old_gene_name.split("_")
                    gene_nr = old_gene_name_split[-1]
                    contig_nr = old_gene_name_split[-2]
                    sample = "_".join(
                        old_gene_name_split[: len(old_gene_name_split) - 2]
                    )
                    tsv.write(
                        "{gene_id}\t{sample}_{contig_nr}\t{gene_nr}\t{text}\n".format(
                            text="\t".join(text),
                            gene_id=old_gene_name,
                            i=gene_idx,
                            sample=sample,
                            gene_nr=gene_nr,
                            contig_nr=contig_nr,
                        )
                    )
                    gene_idx += 1
                    #
32
33
shell:
    "cat {input} > {output}"
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
shell:
    "rm -r {params.outdir} 2> {log} "
    " ;\n "
    " megahit "
    " {params.inputs} "
    " --tmp-dir {resources.tmpdir} "
    " --num-cpu-threads {threads} "
    " --k-min {params.k_min} "
    " --k-max {params.k_max} "
    " --k-step {params.k_step} "
    " --out-dir {params.outdir} "
    " --out-prefix {wildcards.sample}_prefilter "
    " --min-contig-len {params.min_contig_len} "
    " --min-count {params.min_count} "
    " --merge-level {params.merge_level} "
    " --prune-level {params.prune_level} "
    " --low-local-ratio {params.low_local_ratio} "
    " --memory {resources.mem}000000000  "
    " {params.preset} &>> {log} "
23
24
wrapper:
    "v1.19.0/bio/minimap2/aligner"
42
43
44
45
46
47
48
49
shell:
    "pileup.sh ref={input.fasta} in={input.bam} "
    " threads={threads} "
    " -Xmx{resources.java_mem}G "
    " covstats={output.covstats} "
    " concise=t "
    " secondary={params.pileup_secondary} "
    " 2> {log}"
71
72
73
74
75
76
77
78
79
80
81
82
shell:
    "filterbycoverage.sh "
    " in={input.fasta} "
    " cov={input.covstats} "
    " out={output.fasta} "
    " minc={params.minc} "
    " minp={params.minp} "
    " minr={params.minr} "
    " minl={params.minl} "
    " trim={params.trim} "
    " -Xmx{resources.java_mem}G "
    " 2> {log}"
103
104
105
106
107
shell:
    "rename.sh "
    " in={input} out={output} ow=t "
    " prefix={wildcards.sample} "
    " minscaf={params.minlength} &> {log} "
130
131
shell:
    "stats.sh in={input} format=3 out={output} &> {log}"
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
run:
    import os
    import pandas as pd

    c = pd.DataFrame()
    for f in input:
        df = pd.read_csv(f, sep="\t")
        assembly_step = (
            os.path.basename(f)
            .replace("_contig_stats.txt", "")
            .replace(wildcards.sample + "_", "")
        )
        c.loc[assembly_step]

    c.to_csv(output[0], sep="\t")
179
180
wrapper:
    "v1.19.0/bio/minimap2/aligner"
203
204
205
206
207
208
209
210
211
212
213
214
shell:
    "pileup.sh "
    " ref={input.fasta} "
    " in={input.bam} "
    " threads={threads} "
    " -Xmx{resources.java_mem}G "
    " covstats={output.covstats} "
    " hist={output.covhist} "
    " concise=t "
    " secondary={params.pileup_secondary} "
    " bincov={output.bincov} "
    " 2> {log} "
227
228
shell:
    "samtools index {input}"
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
    script:
        "../scripts/combine_contig_stats.py"


"""
localrules:
    build_assembly_report,

rule build_assembly_report:
    input:
        combined_contig_stats="stats/combined_contig_stats.tsv",
    output:
        report="reports/assembly_report.html",
    conda:
        "../envs/report.yaml"
    log:
        "logs/assembly/report.log",
    script:
        "../report/assembly_report.py"
"""
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
shell:
    " reformat.sh "
    " {params.inputs} "
    " interleaved={params.interleaved} "
    " {params.outputs} "
    " iupacToN=t "
    " touppercase=t "
    " qout=33 "
    " overwrite=true "
    " verifypaired={params.verifypaired} "
    " addslash=t "
    " trimreaddescription=t "
    " threads={threads} "
    " pigz=t unpigz=t "
    " -Xmx{resources.java_mem}G "
    " 2> {log} "
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
shell:
    "tadpole.sh -Xmx{resources.java_mem}G "
    " prefilter={params.prefilter} "
    " prealloc=1 "
    " {params.inputs} "
    " {params.outputs} "
    " mode=correct "
    " aggressive={params.aggressive} "
    " tossjunk={params.tossjunk} "
    " lowdepthfraction={params.lowdepthfraction}"
    " tossdepth={params.tossdepth} "
    " merge=t "
    " shave={params.shave} rinse={params.shave} "
    " threads={threads} "
    " pigz=t unpigz=t "
    " ecc=t ecco=t "
    "&> {log} "
130
131
132
133
134
135
136
137
138
shell:
    " bbmerge.sh "
    " -Xmx{resources.java_mem}G threads={threads} "
    " in1={input.R1} in2={input.R2} "
    " outmerged={output[2]} "
    " outu={output[0]} outu2={output[1]} "
    " {params.flags} k={params.kmer} "
    " pigz=t unpigz=t "
    " extend2={params.extend2} 2> {log} "
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import os, sys
import logging, traceback

logging.basicConfig(
    filename=snakemake.log[0],
    level=logging.INFO,
    format="%(asctime)s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)


def handle_exception(exc_type, exc_value, exc_traceback):
    if issubclass(exc_type, KeyboardInterrupt):
        sys.__excepthook__(exc_type, exc_value, exc_traceback)
        return

    logging.error(
        "".join(
            [
                "Uncaught exception: ",
                *traceback.format_exception(exc_type, exc_value, exc_traceback),
            ]
        )
    )


# Install exception handler
sys.excepthook = handle_exception


import pandas as pd
from utils.parsers_bbmap import parse_pileup_log_file


def parse_map_stats(sample_data, out_tsv):
    stats_df = pd.DataFrame()
    for sample in sample_data.keys():
        df = pd.read_csv(sample_data[sample]["contig_stats"], sep="\t")
        assert df.shape[0] == 1, "Assumed only one row in file {}; found {}".format(
            sample_data[sample]["contig_stats"], df.iloc[0]
        )
        df = df.iloc[0]
        df.name = sample
        genes_df = pd.read_csv(sample_data[sample]["gene_table"], index_col=0, sep="\t")
        df["N_Predicted_Genes"] = genes_df.shape[0]

        mapping_stats = parse_pileup_log_file(sample_data[sample]["mapping_log"])

        df["Assembled_Reads"] = mapping_stats["Mapped reads"]
        df["Percent_Assembled_Reads"] = mapping_stats["Percent mapped"]

        stats_df = stats_df.append(df)
    stats_df = stats_df.loc[:, ~stats_df.columns.str.startswith("scaf_")]
    stats_df.columns = stats_df.columns.str.replace("ctg_", "")
    stats_df.to_csv(out_tsv, sep="\t")
    return stats_df


def main(samples, contig_stats, gene_tables, mapping_logs, combined_stats):
    sample_data = {}
    for sample in samples:
        sample_data[sample] = {}
        for c_stat in contig_stats:
            # underscore version was for simplified local testing
            # if "%s_" % sample in c_stat:
            if "%s/" % sample in c_stat:
                sample_data[sample]["contig_stats"] = c_stat
        for g_table in gene_tables:
            # if "%s_" % sample in g_table:
            if "%s/" % sample in g_table:
                sample_data[sample]["gene_table"] = g_table
        for mapping_log in mapping_logs:
            # if "%s_" % sample in mapping_log:
            if "%s/" % sample in mapping_log:
                sample_data[sample]["mapping_log"] = mapping_log

    parse_map_stats(sample_data, combined_stats)


if __name__ == "__main__":
    main(
        samples=snakemake.params.samples,
        contig_stats=snakemake.input.contig_stats,
        gene_tables=snakemake.input.gene_tables,
        mapping_logs=snakemake.input.mapping_logs,
        combined_stats=snakemake.output.combined_contig_stats,
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
__author__ = "Tom Poorten"
__copyright__ = "Copyright 2017, Tom Poorten"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import infer_out_format
from snakemake_wrapper_utils.samtools import get_samtools_opts


samtools_opts = get_samtools_opts(snakemake, parse_output=False)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_extra = snakemake.params.get("sort_extra", "")

out_ext = infer_out_format(snakemake.output[0])

pipe_cmd = ""
if out_ext != "PAF":
    # Add option for SAM output
    extra += " -a"

    # Determine which pipe command to use for converting to bam or sorting.
    if sort == "none":

        if out_ext != "SAM":
            # Simply convert to output format using samtools view.
            pipe_cmd = f"| samtools view -h {samtools_opts}"

    elif sort in ["coordinate", "queryname"]:

        # Add name flag if needed.
        if sort == "queryname":
            sort_extra += " -n"

        # Sort alignments.
        pipe_cmd = f"| samtools sort {sort_extra} {samtools_opts}"

    else:
        raise ValueError(f"Unexpected value for params.sort: {sort}")


shell(
    "(minimap2"
    " -t {snakemake.threads}"
    " {extra} "
    " {snakemake.input.target}"
    " {snakemake.input.query}"
    " {pipe_cmd}"
    " > {snakemake.output[0]}"
    ") {log}"
)
ShowHide 12 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/metagenome-atlas/metagenome-assembly
Name: metagenome-assembly
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...