Metagenomic Taxonomic Classification Benchmarking Workflow

public public 1yr ago 0 bookmarks

metax

Metagenomic taxonomic classification benchmarking

Overview

This package is contains snakemake workflows, a jupyter notebook and a python package. This directory can be used as the snakemake workflow directory (the easiest setup), but you can also symlink in the Snakefile , rules/ and config.yaml paths, to keep it separate from the data.

Organization

  • The fastq/ folder should contain input fastq datasets compressed in .bz2 format.

  • The db/ folder contains metagenomic databases, which can be downloaded from gs://metax-bakeoff-2019 using gsutil .

  • The src/ folder contains code for classifiers that don't have conda package available. The src/local/ is a local path for installing packages akin to the /usr/local/ path, and contains compiled code (compatible with Ubuntu 16.04). Separate packages are installed under src/local/stow if they are not installed to that prefix directly. Packages installed here may need to be recompiled on a different OS. All of the rest of the packages are installed via conda . There is one conda environment metax based on python3 that serves as the default environment to use, while the metax_py2 environment is based on python2 needed by some packages, and will only be used to run specific classifiers.

  • The envs/ contains environment yaml files that fully spec these two conda environments, while the conda_environment*.txt files are their human-editable counterparts similar to requirements.txt files with only a subset of critical packages version pinned, and not including transitive dependencies.

  • The rules/ directory contains snakemake rules for each method.

  • The tmp/ folder is the default local folder for storing temporary files needed by various methods.

  • The log/ folder contains stdout/stderr logs.

  • The time/ folder contains /usr/bin/time -v logs,

  • The benchmark/ folder contains snakemake benchmarking tsvs.

  • The data/ folder is intended for storing the raw outputs of classifers, typically with a filename like <sample_name>.<classifier>.<database>.tsv.gz . These raw outputs are typically individual read classifications, so they are O(n) in the size of the input fastqs and compressed.

  • The reports/ folder is intended for storing the taxonomy reports, which typically contain one line per taxa with their abundance or number of classified reads per sample. This file is much smaller and O(1) in size compared to the inputs, and has a filename format similar to data/ .

  • The info/ and summary/ folders contain output summary data for datasets, such as the *.total_reads.txt files for number of reads in the input fastqs, as well as things like summary/classified_count.tsv for number of classified reads if they are not determinable from output reports.

  • The plotting/ folder contains the compiled_reports.tsv master tsv of all classification results generated by metax compile as well as a few other summary files. It also contains the jupyter notebook that will produce the final plots.

Configuration

A default config.yaml exists in config/config.yaml . This config assumes that everything in the workflow will be placed relative to this git directory root, including all of the databases, inputs, outputs, code and non-conda software. The config also specifies the locations of the various databases, so acquiring the database don't have to be downloaded from Google Cloud Storage, but this also means that downloading databases and running classifiers are separate snakemake invocations. Practical things to change in the config are TMPDIR which is specified as tmp/ by default, but can be changed to another physical drive to improve performance.

Install

  1. Install miniconda https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html . Install gsutil https://cloud.google.com/storage/docs/gsutil_install for downloading of already hosted databases.

  2. Create two conda environments from the envs/ directory. conda enc create -n metax --file envs/py3.yaml . Also conda enc create -n metax_py2 --file envs/py2.yaml .

  3. Activate the metax conda environment.

  4. Install this package itself as a python package. pip install . .

  5. Install remaining software into src/ using snakemake download_src . These packages may need to be recompiled.

Execution

  1. Download any desired prerequisite databases using a snakemake rule. To download all databases, use snakemake download_db_all . To download a specific database, use the general form snakemake db/[default|refseqc]/[method] , such as snakemake db/refseqc/kraken or snakemake db/default/kraken .

  2. Download existing fastq datasets using snakemake download_fastq_all .

  3. Run snakemake rules to classify samples using rules such as snakemake kraken_default_all . Alternatively all classifiers can be run on all samples by running snakemake all , but note the caveats. [^1]

  4. Run the snakemake rule to aggregate reports, read counts, and classified counts snakemake compile_reports . This will generate the plotting/compiled_reports.tsv file with all of the samples and classifications.

  5. Run the jupyter notebook Metagenomics Bench.ipynb to generate all plots.

Adding Classifiers

To add a new classifier to compare against the existing results, the easiest way would be to append the new method's classification results to the master plotting/compiled_reports.tsv file, and then simply run the jupyter notebook. Note that this would not add the compute time/resources benchmark. The columns in this file are:

  • sample : Sample name

  • classifier : Classifier name

  • database : Name of the database, usually 'default' or 'refseqc'.

  • taxid : NCBI Taxonomy ID identified.

  • cum_abundance : Cumulative abundance (out of 1) of classified reads assigned to this taxid and its children.

  • cls_cum_abundance : Cumulative abundance (out of total classified abundance).

  • unique_abundance : Specific abundance classfied to this taxid not including children.

  • cls_unique_abundance : Specific abundance (out of total classified abundance).

  • classrank : The specific taxonomy rank profile level setting of the classifier. Either 'species', 'genus', or 'all' if the classifier assigns abundances to all levels of the taxonomy simultaneously.

  • rank : Taxonomy rank e.g. 'species', 'genus' etc.

Because the NCBI taxonomy tree and ids are constantly changing, you should use the same taxonomy database used for all other methods which can be acquired from snakemake db/taxonomy .

To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file rules/newmethod.smk and add include rules/newmethod.smk to the master Snakefile . For examples see the other method Snakefiles like kaiju.smk . Ideally you will create snakemake rules for snakemake newmethod_default_all and snakemake_refseqc_all .

Parsing Taxonomic Reports

The parser for the raw classified reports is part of the metax python package, in the command metax compile . This command will process all of the heterogenous report files in the reports/ folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a parse() method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like classifier , rank , taxid etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.

In metax/reporter.py , create an additional class NewmethodParser() . The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the MegablastParser and the corresponding line where it is registered to the Matchers object.

class MegablastParser(KrakenParser): MATCH_GLOB = '*.megablast.*.txt' MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt') CLASSIFIER = 'megablast' is_paired_count = False
class Matchers: @classmethod def create(cls, processor):	...	m.add('megablast', MegablastParser(processor))

Jupyter Notebook Plots

After the snakemake compile_reports step, the Metagenomics Bench.ipynb notebook also needs to be modified to incorporate additional methods. The classifiers global variable in the notebook needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name> . Similarly, it needs to be added to refseqc_classifiers if it has an appropriate refseqc database and kept in main_classifiers if it is to be included on all plots.

[^1]: Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as snakemake kraken_default_all; snakemake kraken_refseqc_all ; snakemake diamond_default_all etc.2. Run snakemake rules to classify samples using rules such as snakemake kraken_default_all . Alternatively all classifiers can be run on all samples by running snakemake all , but note the caveats. [^1] 3. Run the snakemake rule to aggregate reports, read counts, and classified counts snakemake compile_reports . This will generate the plotting/compiled_reports.tsv file with all of the samples and classifications. 4. Run the jupyter notebook plotting/Metagenomics Bench.ipynb to generate all plots.

Adding Classifiers

To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file rules/newmethod.smk and add include rules/newmethod.smk to the master Snakefile . For examples see the other method Snakefiles like kaiju.smk . Ideally you will create snakemake rules for snakemake newmethod_default_all and snakemake_refseqc_all .

Parsing Taxonomic Reports

The parser for the raw classified reports is part of the metax python package, in the command metax compile . This command will process all of the heterogenous report files in the reports/ folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a parse() method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like classifier , rank , taxid etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.

In metax/reporter.py , create an additional class NewmethodParser() . The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the MegablastParser and the corresponding line where it is registered to the Matchers object.

class MegablastParser(KrakenParser): MATCH_GLOB = '*.megablast.*.txt' MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt') CLASSIFIER = 'megablast' is_paired_count = False
class Matchers: @classmethod def create(cls, processor):	...	m.add('megablast', MegablastParser(processor))

Jupyter Notebook Plots

After the snakemake compile_reports step, the Metagenomics Bench.ipynb notebook also needs to be modified to incorporate additional methods. The classifiers global variable in the notebook needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name> . Similarly, it needs to be added to refseqc_classifiers if it has an appropriate refseqc database and kept in main_classifiers if it is to be included on all plots.

Creating refseqc database

The source fastas for creating the refseqc databases are located at gs://metax-bakeoff-2019/refseqc/fasta/genomic.fna.gz and gs://metax-bakeoff-2019/refseqc/fasta/protein.faa.gz for DNA and protein databases respectively. Precomputed mappings of the accession to taxonomy IDs are located at gs://metax-bakeoff-2019/refseqc/fasta/genomic.taxids and gs://metax-bakeoff-2019/refseqc/fasta/protein.taxids .

Adding FASTQ datasets

Compress and place the fastq as <sample_name>_1.fastq.bz2 and <sample_name>_2.fastq.bz2 for the two paired ends and add the sample name to the config.yaml under samples_pe . If the file is single-ended, place as <sample_name>.fastq.bz2 instead and add the sample name under samples_se instead. Also add an entry to config/compile_config.yaml under samples , mapping the <sample_name> to a dict of metadata. The only required keys are sample: indicating the desired output name in the compiled_reports.tsv and paired: True if the sample is paired-ends. Otherwise it is assumed to be single-ended.

[^1]: Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as snakemake kraken_default_all; snakemake kraken_refseqc_all ; snakemake diamond_default_all etc.

Code Snippets

87
88
89
90
91
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(AC_DIAMOND_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')
 98
 99
100
101
102
shell:
    '''
    export TMPDIR={TMPDIR}
    metax blast-report --total-reads $(cat {input.total_reads}) --blast-report {output.report} --tax-dir {params.db} <(zstd -dc {input.data})
    '''
 7
 8
 9
10
11
12
13
shell:
    '''
    export LC_ALL=C TMPDIR={TMPDIR}
    echo {input}

    env PICARD="{PICARD}" ~/efs/src/metax/scripts/bam_to_fastq_paired.sh {output[0]} {output[1]} {input}
    '''
28
29
30
31
32
shell:
    '''
    export LC_ALL=C TMPDIR={TMPDIR}
    picard -Xmx100g -Djava.io.tmpdir={TMPDIR} FastqToSam {params.inargs} OUTPUT={output} SAMPLE_NAME={wildcards.seq} QUALITY_FORMAT=Standard MAX_RECORDS_IN_RAM=5000000
    '''
86
87
88
89
run:
    shell('{DROPCACHE}')
    shell('makeblastdb -in {input.fna} -dbtype nucl -out {params.out} -taxid_map <(cut -f1-2 {input.taxid_map}) -parse_seqids',
          bench_record=bench_record)
156
157
158
159
shell:
    '''
    metax blast-report --total-reads $(cat {input.total_reads}) --blast-lca {output.lca} --blast-report {output.report} --tax-dir {params.db} {input.data}
    '''
36
37
38
39
40
41
42
43
shell:
    '''
    export PS1=
    source activate metax_py2
    echo {input}
    /usr/bin/time -v -o {log.time} \
    {params.exe} -i {input} -k {params.db} -o {output} -l G 2>&1 | tee {log.log}
    '''
62
63
64
65
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(BRACKEN_SHELL, bench_record=bench_record)
77
78
79
80
81
82
83
84
85
86
87
88
89
90
run:
    shell('''\
    {DROPCACHE}
    ''')
    shell(r'''\
    export PS1=
    source activate metax_py2
    /usr/bin/time -v -o {log.time} \
    kraken --db {params.kraken_dir} --fasta-input --threads 32 <( find -L {params.kraken_dir}/library \( -name "*.fna" -o -name "*.fa" -o -name "*.fasta" \) -exec cat {{}} + )  > {params.dir}/database.kraken
    /usr/bin/time -v -a -o {log.time} \
    perl ~/efs/metax/Bracken/src/count-kmer-abundances.pl --db {params.kraken_dir} --read-length 150 {params.dir}/database.kraken > {params.dir}/database150mers.kraken_cnts
    /usr/bin/time -v -a -o {log.time} \
    python ~/efs/metax/Bracken/src/generate_kmer_distribution.py -i {params.dir}/database150mers.kraken_cnts -o {params.dir}/KMER_DISTR.150.txt
    ''', bench_record=bench_record)
57
58
59
60
61
62
shell:
    '''
    export PS1=
    source activate metax_py2
    {params.exe} -x {params.db} <(pigz -dc {input}) > {output.kreport}
    '''
79
80
81
82
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(CENTRIFUGE_SHELL + CENTRIFUGE_REPORT_SHELL, bench_record=bench_record)
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
run:
    # shell('''\
    # metax create-centrifuge-map {NCBI_GENOMES}

    # cat all-*.map > all.map
    # ''')
    shell('''\
    {DROPCACHE}
    ''')
    shell('''\
    export TMPDIR={TMPDIR}
    TMP_DIR=$(mktemp -d)
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
shell:
    '''
    : > {output.aux_input_se}
    FILES=({input.se})
    for (( i=0; i<${{#FILES[@]}} ; i+=1 )) ; do
        echo "${{FILES[i]}}" >> {output.aux_input_se}
    done

    : > {output.aux_output}
    for i in {params.output_files_se}; do
      echo "$i" >> {output.aux_output}
    done

    /usr/bin/time -v -o {log.time} \
    {params.exe} {params.db} -n {threads} -O {output.aux_input_se} -R {output.aux_output} 2>&1 | tee {log.log}

    : > {output.aux_input_pe[0]}
    : > {output.aux_input_pe[1]}
    FILES=({input.pe})
    for (( i=0; i<${{#FILES[@]}} ; i+=2 )) ; do
        echo "${{FILES[i]}}" >> {output.aux_input_pe[0]}
        echo "${{FILES[i+1]}}" >> {output.aux_input_pe[1]}
    done

    : > {output.aux_output}
    for i in {params.output_files_pe}; do
        echo "$i" >> {output.aux_output}
    done


    /usr/bin/time -av -o {log.time} \
    {params.exe} {params.db} -n {threads} -P {output.aux_input_pe[0]} {output.aux_input_pe[1]} -R {output.aux_output} 2>&1 | tee -a {log.log}
    '''
122
123
124
125
shell:
    '''
    {params.exe} -F {input} {params.db} > {output}
    '''
149
150
151
152
shell:
    '''
    {params.exe} -F <(zcat {input}) {params.db} > {output}
    '''
SnakeMake From line 149 of rules/clark.smk
173
174
175
176
177
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(CLARK_SHELL + CLARK_BENCHMARK_ABUNDANCE_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')
SnakeMake From line 173 of rules/clark.smk
186
187
188
189
190
191
192
run:
    shell('{DROPCACHE}')
    shell('''\
    /usr/bin/time -v -o {log.time}  \
    {config[CLARK_OPT]}/set_targets.sh /mnt/metax/workflow/db/refseqc/clark custom
    {config[CLARK]} -D {params.dir}/custom_0 -T {params.dir}/targets.txt -O /mnt/metax/workflow/tmp/empty.fastq -R empty.results
    ''', bench_record=bench_record)
SnakeMake From line 186 of rules/clark.smk
201
202
203
204
205
206
207
208
run:
    shell('{DROPCACHE}')
    shell('''\
    echo "$(readlink -f {params.dir}/custom_0)" > {config[CLARK_OPT]}/.dbAddress
    # Must be in the CLARK install directory
    cd {config[CLARK_OPT]}
    {config[CLARK_OPT]}/buildSpacedDB.sh
    ''', bench_record=bench_record)
SnakeMake From line 201 of rules/clark.smk
81
82
83
84
85
86
shell:
    '''
    /usr/bin/time --verbose --append -o {log.time} \
    {DIAMOND} blastx -q {input} --verbose -p {threads} -d {params.dmnd}{params.tmpdir}{params.blocksize}{params.index_chunks} --taxonmap {params.taxonmap} --taxonnodes {params.taxonnodes} --outfmt 102 -o {params.tax} 2>&1 | tee {log.log}
    {PIGZ} -9 {params.tax}
    '''
103
104
105
106
107
108
109
110
111
112
113
114
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell('''\
    {params.input_args} | \
    /usr/bin/time --verbose --append -o {log.time} \
    {DIAMOND} blastx --verbose --log -p {threads} -d {params.dmnd}{params.tmpdir}{params.blocksize}{params.index_chunks} --taxonmap {params.taxonmap} --taxonnodes {params.taxonnodes} --outfmt 102 -o {output.tax} 2>&1 | tee {log.log}

    ''', bench_record=bench_record)
    shell('''\
    metax taxid-report --output {output.report} --tax-dir {config[TAXONOMY_DB]} {output.tax}
    ''')
120
121
122
123
shell:
    '''
    metax taxid-report --output {output} --tax-dir {config[TAXONOMY_DB]} <({PIGZ} -dc {input})
    '''
129
130
131
132
133
run:
    shell('{DROPCACHE}')
    shell('''\
    {DIAMOND} makedb --in {input.faa} -d db/refseqc/diamond/refseqc
    ''', bench_record=bench_record)
38
39
40
41
42
shell:
    '''
    mkdir {output}
    gsutil cp {input} - | pigz -d | tar x -C {output}
    '''
48
49
50
51
52
shell:
    '''
    mkdir {output}
    gsutil cp {input} - | zstd -d | tar x -C {output}
    '''
66
67
68
69
shell:
    '''
    gsutil cp {input} {output}
    '''
74
75
76
77
shell:
    '''
    gsutil cp {input} - | zstd -d | tar x
    '''
74
75
76
77
78
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(GOTTCHA_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')
73
74
75
76
shell:
    '''
    {params.kaiju_table} -r species -e -p -v -t {params.db_dir}/nodes.dmp -n {params.db_dir}/names.dmp -o {output} <(pigz -dc {input})
    '''
83
84
85
86
shell:
    '''
    {params.kaiju_table} -r genus -e -p -v -t {params.db_dir}/nodes.dmp -n {params.db_dir}/names.dmp -o {output} <(pigz -dc {input})
    '''
104
105
106
107
108
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KAIJU_SHELL + KAIJU_REPORT_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')
SnakeMake From line 104 of rules/kaiju.smk
118
119
120
121
122
123
124
125
126
127
128
129
run:
    shell(r'''\
    export LC_ALL=C

    join -t$'\t' <(zcat {input.faa} | fasta_formatter -t | sed 's/ /\t/' ) {input.a2t} | \
    awk 'BEGIN {{FS=OFS="\t"}} {{printf(">%s\n%s\n", $4, $2)}}' > protein.faa
    ''')
    shell('''\
    /mnt/metax/src/kaiju/bin/mkbwt -n {threads} -l 100000 -a protein -o proteins protein.faa
    /mnt/metax/src/kaiju/bin/mkfmi proteins
    ''', bench_record=bench_record)
    shell('rm proteins.sa proteins.bwt')
SnakeMake From line 118 of rules/kaiju.smk
50
51
52
53
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KARP_SHELL, bench_record=bench_record)
SnakeMake From line 50 of rules/karp.smk
94
95
96
97
98
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KRAKEN2_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output.reads}')
110
111
112
113
114
115
116
117
118
119
120
121
run:
    shell('''\
    mkdir {params.dir}/taxonomy {params.dir}/library
    ln -sf {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp {params.dir}/taxonomy
    {PIGZ} -fdc {params.tax_db}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid
    {DROPCACHE}
    ''')

    shell('''\
    kraken2-build --db {params.dir} --add-to-library {input.fna}
    kraken2-build --db {params.dir} --build --threads {threads}
    ''', bench_record=bench_record)
71
72
73
74
75
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KRAKENHLL_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output.reads}')
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
run:
    shell('''\
    mkdir {params.dir}/taxonomy {params.dir}/library
    ln -s {TAXONOMY_DB}/names.dmp {TAXONOMY_DB}/nodes.dmp {params.dir}/taxonomy
    {PIGZ} -dc {TAXONOMY_DB}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid
    cp {input.seqmap} {params.dir}/library/
    {DROPCACHE}
    ''')

    shell('''\
    {KRAKENHLL_BUILD} --db {params.dir} --add-to-library {input.fna} > {log.log} || true  # For some reason returns 255
    /usr/bin/time -v -o {log.time} \
    {KRAKENHLL_BUILD} --db {params.dir} --build --threads {threads} --jellyfish-hash-size 20000000000 >> {log.log}
    ''', bench_record=bench_record)
88
89
90
91
92
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KRAKEN_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')
104
105
106
107
108
109
110
111
112
113
114
115
116
run:
    shell('''\
    mkdir {params.dir}/taxonomy {params.dir}/library
    ln -sf {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp {params.dir}/taxonomy
    {PIGZ} -fdc {params.tax_db}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid
    {DROPCACHE}
    ''')

    shell('''\
    kraken-build --db {params.dir} --add-to-library {input.fna}
    kraken-build --db {params.dir} --build --threads {threads}
    # --jellyfish-hash-size
    ''', bench_record=bench_record)
SnakeMake From line 104 of rules/kraken.smk
64
65
66
67
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KSLAM_SHELL, bench_record=bench_record)
81
82
83
84
shell:
    '''
    echo "$(zcat {input} | awk '$2 != 0' | wc -l) * {params.paired}" | bc -l > {output}
    '''
 95
 96
 97
 98
 99
100
101
102
103
104
105
run:
    shell('''\
    mkdir -p db/refseqc/kslam
    {DROPCACHE}
    ''')
    shell('''\
    /usr/bin/time -v -o {log.time} \
      {KSLAM} --parse-taxonomy {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp --output-file taxDB | tee {log.log}
    /usr/bin/time -v -a -o {log.time} \
      {KSLAM} --output-file {output.db} --parse-genbank {input.gbff} | tee -a {log.log}
    ''', bench_record=bench_record)
77
78
79
80
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(METAOTHELLO_SHELL, bench_record=bench_record)
86
87
88
89
shell:
    '''
    metax metaothello-report --tax-dir {params.db} <({PIGZ} -dc {input}) {output}
    '''
49
50
51
52
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(METAPHLAN2_SHELL, bench_record=bench_record)
58
59
60
61
shell:
    '''
    cut -f1 {input} | uniq | wc -l > {output}
    '''
 96
 97
 98
 99
100
101
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    else:
        shell('rm {params.query_lca_db} {params.query_db}')
    shell(MMSEQS2_SHELL, bench_record=bench_record)
108
109
110
111
shell:
    '''
    metax mmseqs-report {input.data} {output} --total-reads $(cat {input.total_reads}) --tax-dir {params.db}
    '''
122
123
124
125
126
127
128
129
run:
  shell('''\
  /usr/bin/time -v -o {log.time} \
  {MMSEQS2} createdb {input.faa} {params.dir}/refseqc.db
  ''', bench_record=bench_record)
  # shell('''
  # metax create-mmseqs-taxonomy -o {params.dir}/refseqc.tsv --tax-db {TAXONOMY_DB} --accession2taxid {input.prot} {input.dead_prot} -- {params.dir}/refseqc.db_h
  # ''')
49
50
51
52
53
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(MOTUS_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')
43
44
45
46
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(PATHSEQ_SHELL, bench_record=bench_record)
58
59
60
61
62
run:
    if params.is_paired:
        shell('expr "$(samtools view -c -f 0x41 -F 0xf04 {input})" + "$(samtools view -c -f 0x81 -F 0xf04 {input})" > {output}')
    else:
        shell('samtools view -c -F 0xf04 {input} > {output}')
23
24
25
26
shell:
    '''
    lbzip2 -dc {input} | sed -e '/^[^>]/s/X/N/g' | lbzip2 > {output}
    '''
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
run:
  if fastq_pairing == 'interleaved':
      shell("""
      echo {input[0]}
      echo {input[1]}
      paste <(paste - - - - < <(lbzcat {input[0]})) \
            <(paste - - - - < <(lbzcat {input[1]})) \
            | tr '\t' '\n' \
            | lbzip2 > {output}
      """)
  elif fastq_pairing == 'interleaved_add_suffix':
      shell(r"""
      echo {input[0]}
      echo {input[1]}
      paste <(paste - - - - < <(awk '{{if (NR % 4 == 1) {{printf "%s/1\n", $1}} \
                                       else if (NR % 4 == 3){{print "+"}} else {{print $0}} }}' <(lbzcat {input[0]}))) \
            <(paste - - - - < <(awk '{{if (NR % 4 == 1) {{printf "%s/2\n", $1}} \
                                       else if (NR % 4 == 3){{print "+"}} else {{print $0}} }}' <(lbzcat {input[1]}))) \
            | tr '\t' '\n' \
            | lbzip2 > {output}
      """)
  elif fastq_pairing == 'cat':
      shell("""
      echo {input[0]}
      echo {input[1]}
      lbzcat {input[0]} {input[1]} | lbzip2 > {output}
      """)
66
67
68
69
70
71
72
shell:
    """
    echo {input}
    lbzip2 --decompress --force --keep {input[0]}
    # If original file is empty
    touch {output}
    """
79
80
81
82
83
84
85
shell:
  """
  echo {input}
  lbzip2 --decompress --force --keep {input[0]}
  # If original file is empty
  touch {output}
  """
90
91
92
93
shell:
    '''
    fastqc -o qc -f fastq <(zcat {input})
    '''
 99
100
101
102
shell:
    '''
    seqtk seq -a {input} > {output}
    '''
108
109
110
111
shell:
  '''
  seqtk seq -F F {input} > {output}
  '''
139
140
141
142
143
144
shell:
    '''
    /usr/bin/time -v -o {log.time} \
    {params.exe} {params.args} 2>&1 | tee {log}
    rm -f {params.ofastq1} {params.ofastq2}
    '''
62
63
64
65
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(PROPHYLE_SHELL)
73
74
75
76
shell:
    '''
    metax prophyle-report{params.paired} --total-reads $(cat {input.total_reads}) --output {output.report} --tax-dir {params.db} {input.data}
    '''
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
run:
    shell('''
    mkdir -p db/refseqc/build_prophyle

    python -c "from ete3 import ncbi_taxonomy; ncbi_taxonomy.NCBITaxa(taxdump_file='{params.tax_db}/taxdump.tar.gz')"
    {DROPCACHE}
    ''')

    shell('''\
    awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest" {{print $1, $6, $20}}' {input.assembly_summary} > ftpselection.tsv
    cut -f 3 ftpselection.tsv | awk 'BEGIN{{FS=OFS="/";filesuffix="genomic.fna.gz"}} {{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}}' > ftpfilepaths.tsv
    cut -f 1,2 ftpselection.tsv | sed 's/\.[0-9]*//' > acc2taxid.tsv

    prophyle_ncbi_tree.py archaea {NCBI_GENOMES}/refseq archaea.nw acc2taxid.tsv -l archaea.log
    prophyle_ncbi_tree.py bacteria {NCBI_GENOMES}/refseq bacteria.nw acc2taxid.tsv -l bacteria.log
    prophyle_ncbi_tree.py viral {NCBI_GENOMES}/refseq viral.nw acc2taxid.tsv -l viral.log

    ln -s {NCBI_GENOMES}/refseq/archaea .
    ln -s {NCBI_GENOMES}/refseq/bacteria .
    ln -s {NCBI_GENOMES}/refseq/viral .
    /usr/bin/time -v -o {log.time} \
      prophyle index -k 31 archaea.nw bacteria.nw viral.nw refseqc | tee {log.log}
    ''' , bench_record=bench_record)
    shell('mv refseqc ../prophyle')
 7
 8
 9
10
11
shell:
  '''
  ncbi-genome-download -F fasta,protein-fasta,genbank archaea,bacteria,viral
  touch {output}
  '''
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
shell:
  '''
  CWD=$(pwd)
  : > {output}

  cd {NCBI_GENOMES}/refseq

  DOMAINS=(archaea bacteria viral)
  for DOMAIN in ${{DOMAINS[@]}}; do
    pushd $DOMAIN
    awk -F "\t" '$12=="Complete Genome" && $11=="latest" {{print $1}}' assembly_summary.txt > ${{DOMAIN}}_selection.txt

    for i in $(awk -F "\t" '$12=="Complete Genome" {{print $1}}' assembly_summary.txt); do
      cat $i/*_genomic.fna.gz >> $CWD/{output.fna}
      cat $i/*_protein.faa.gz >> $CWD/{output.faa} || true
      {PIGZ} -dc $i/*_genomic.gbff.gz >> $CWD/{output.gbff}
    done
    popd
  done
  '''
56
57
58
59
shell:
    '''
    {PIGZ} -dk {input}
    '''
67
68
69
70
71
72
shell:
  '''
  export LC_ALL=C
  cat <({PIGZ} -dc {input.prot} | tail -n +2) <({PIGZ} -dc {input.dead_prot} | tail -n +2) \
  | cut -f2,3 | sort -S {resources.mem}G -k1,1 -t$'\t' > {output}
  '''
81
82
83
84
85
86
shell:
  '''
  export LC_ALL=C
  cat <({PIGZ} -dc {input.nucl} | tail -n +2) <({PIGZ} -dc {input.dead_nucl} | tail -n +2) \
  | cut -f2,3,4 | sort -S {resources.mem}G -k1,1 -t$'\t' > {output}
  '''
92
93
94
95
96
shell:
    '''
    export LC_ALL=C
    zcat {input} | fasta_formatter -t | cut -f1 | sed 's/ /\t/' | sort -k1,1 -s -S {resources.mem}G -t $'\t' > {output}
    '''
103
104
105
106
107
shell:
    '''
    export LC_ALL=C
    join -t$'\t' <(sed 's/ /\t/' {input.headers}) {input.a2t} | cut -f1,3,2 > {output}
    '''
114
115
116
117
118
119
shell:
    '''
    export LC_ALL=C
    join -t$'\t' {input.a2t} <({PIGZ} -dc {input.fna} | fasta_formatter -t | sed 's/ /\t/' | sort -k1,1 -S {resources.mem}G -t $'\t') | \
    awk 'BEGIN {{FS=OFS="\t"}} {{printf(">gi|%s|%s %s\n%s\n", $3, $1, $4, $5)}} > {output}
    '''
 9
10
11
12
13
14
15
16
17
18
19
20
run:
    shell('echo {input}')
    if wildcards.seq in samples_fastq:
        shell('''\
        TOTAL_READS=$(expr $(wc -l <(lbzcat {input}) | cut -f1 -d' ') / 4)
        echo $TOTAL_READS > {output}
        ''')
    elif wildcards.seq in samples_fasta:
        shell('''\
        TOTAL_READS=$(fasta_formatter -i <(lbzcat {input}) -t | wc -l)
        echo $TOTAL_READS > {output}
        ''')
26
27
28
29
30
31
32
shell:
    '''
    for i in {input}; do
        NAME="$(echo $(basename $i) | sed 's/.total_reads.txt//' )"
        echo -e "$NAME\t$(cat $i)" >> {output}
    done
    '''
37
38
39
40
shell:
    '''
    cat {input} > {output}
    '''
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
shell:
    '''
    OUTPUT=$(readlink -f {output})
    pushd db/refseqc
    cd blast
    echo -e "megablast\trefseqc\t$(du -c genomic* taxdb* | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../bracken
    echo -e "bracken\trefseqc\t$(du -c KMER_DISTR.150* database150* | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../centrifuge
    echo -e "centrifuge\trefseqc\t$(du -c refseqc.*.cf | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../clark
    echo -e "clark\trefseqc\t$(du -c custom_0/db_central* | tail -n 1 | cut -f1)" >> "$OUTPUT"
    echo -e "clark_s\trefseqc\t$(du -c custom_0/T** | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../diamond
    echo -e "diamond\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../kaiju
    echo -e "kaiju\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../karp
    echo -e "karp\trefseqc\t$(du -c karp.tax karp.fasta rrna_karp.index | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../kraken
    echo -e "kraken\trefseqc\t$(du -c database.idx database.kdb taxonomy | tail -n 1 | cut -f1)" >> "$OUTPUT"

    cd ../kraken2
    echo -e "kraken2\trefseqc\t$(du -c *.k2d | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../krakenhll
    echo -e "krakenhll\trefseqc\t$(du -c database.idx database.kdb taxDB | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../kslam
    echo -e "kslam\trefseqc\t$(du -c database taxDB | tail -n 1 | cut -f1)" >> "$OUTPUT"
    # cd ../metaothello
    # echo -e "metaothello\trefseqc\t$(du -c id2tax names output.index | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../mmseqs2
    echo -e "mmseqs2\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../prophyle
    echo -e "prophyle\trefseqc\t$(du -c abv | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../taxmaps
    echo -e "taxmaps\trefseqc\t$(du -c refseqc.gem refseqc.len taxonomy.tbl | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/gottcha
    echo -e "gottcha\tdefault\t$(du -c GOTTCHA_BACTERIA_c4937_k24_u30_xHUMAN3x.species.* GOTTCHA_VIRUSES_c5900_k24_u30_xHUMAN3x.species.* *.dmp | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/metaphlan2
    echo -e "metaphlan2\tdefault\t$(du -c db_v20 | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd ~/efs/miniconda3/envs/metax/share/motus-2.0.1
    echo -e "motus2\tdefault\t$(du -c db_mOTU | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/pathseq
    echo -e "pathseq\tdefault\t$(du -c pathseq_microbe* pathseq_taxonomy.db | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/metaothello
    echo -e "metaothello\tdefault\t$(du -c 20161108 | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd
    '''
147
148
149
150
151
152
153
shell:
    '''
    for i in {input}; do
       NAME="$(echo $(basename $i) | sed 's/.txt//' )"
       echo -e "$NAME\t$(cat $i)" >> {output}
    done
    '''
161
162
163
164
shell:
    '''
    metax compile --tax-dir {config[TAXONOMY_DB]} --classified-counts {input.classified_counts} --read-counts {input.total_reads} --dir reports --config config/compile_config.yaml --rank-abundances {output.ranksum} {output.reports}
    '''
81
82
83
84
run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(taxmaps_rule, bench_record=bench_record)
 96
 97
 98
 99
100
101
102
103
104
105
run:
    shell('{DROPCACHE}')
    shell('''\
    /usr/bin/time -v -o {log.time}  \
    env PATH={GEMPATH}:$PATH \
    {TAXMAPS_TBL} -n {TAXONOMY_DB}/names.dmp -t {TAXONOMY_DB}/nodes.dmp > {params.dir}/taxonomy.tbl 2>&1 | tee {log.log}
    /usr/bin/time -a -v -o {log.time}  \
    env PATH={GEMPATH}:$PATH \
    {TAXMAPS_INDEX} -i {input.fna} -c {TAXONOMY_DB}/gi_taxid_nucl.dmp -t taxonomy.tbl -p {params.dir}/refseqc | tee -a {log.log}
    ''', bench_record=bench_record)
ShowHide 68 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/yesimon/metax_bakeoff_2019
Name: metax_bakeoff_2019
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...