Metagenomic Taxonomic Classification Benchmarking Workflow

public 1yr ago 0 bookmarks

View Workflow

metax

Metagenomic taxonomic classification benchmarking

Overview

This package is contains snakemake workflows, a jupyter notebook and a python package. This directory can be used as the snakemake workflow directory (the easiest setup), but you can also symlink in the Snakefile , rules/ and config.yaml paths, to keep it separate from the data.

Organization

The fastq/ folder should contain input fastq datasets compressed in .bz2 format.
The db/ folder contains metagenomic databases, which can be downloaded from gs://metax-bakeoff-2019 using gsutil .
The src/ folder contains code for classifiers that don't have conda package available. The src/local/ is a local path for installing packages akin to the /usr/local/ path, and contains compiled code (compatible with Ubuntu 16.04). Separate packages are installed under src/local/stow if they are not installed to that prefix directly. Packages installed here may need to be recompiled on a different OS. All of the rest of the packages are installed via conda . There is one conda environment metax based on python3 that serves as the default environment to use, while the metax_py2 environment is based on python2 needed by some packages, and will only be used to run specific classifiers.
The envs/ contains environment yaml files that fully spec these two conda environments, while the conda_environment*.txt files are their human-editable counterparts similar to requirements.txt files with only a subset of critical packages version pinned, and not including transitive dependencies.
The rules/ directory contains snakemake rules for each method.
The tmp/ folder is the default local folder for storing temporary files needed by various methods.
The log/ folder contains stdout/stderr logs.
The time/ folder contains /usr/bin/time -v logs,
The benchmark/ folder contains snakemake benchmarking tsvs.
The data/ folder is intended for storing the raw outputs of classifers, typically with a filename like <sample_name>.<classifier>.<database>.tsv.gz . These raw outputs are typically individual read classifications, so they are O(n) in the size of the input fastqs and compressed.
The reports/ folder is intended for storing the taxonomy reports, which typically contain one line per taxa with their abundance or number of classified reads per sample. This file is much smaller and O(1) in size compared to the inputs, and has a filename format similar to data/ .
The info/ and summary/ folders contain output summary data for datasets, such as the *.total_reads.txt files for number of reads in the input fastqs, as well as things like summary/classified_count.tsv for number of classified reads if they are not determinable from output reports.
The plotting/ folder contains the compiled_reports.tsv master tsv of all classification results generated by metax compile as well as a few other summary files. It also contains the jupyter notebook that will produce the final plots.

Configuration

A default config.yaml exists in config/config.yaml . This config assumes that everything in the workflow will be placed relative to this git directory root, including all of the databases, inputs, outputs, code and non-conda software. The config also specifies the locations of the various databases, so acquiring the database don't have to be downloaded from Google Cloud Storage, but this also means that downloading databases and running classifiers are separate snakemake invocations. Practical things to change in the config are TMPDIR which is specified as tmp/ by default, but can be changed to another physical drive to improve performance.

Install

Install miniconda https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html . Install gsutil https://cloud.google.com/storage/docs/gsutil_install for downloading of already hosted databases.
Create two conda environments from the envs/ directory. conda enc create -n metax --file envs/py3.yaml . Also conda enc create -n metax_py2 --file envs/py2.yaml .
Activate the metax conda environment.
Install this package itself as a python package. pip install . .
Install remaining software into src/ using snakemake download_src . These packages may need to be recompiled.

Execution

Download any desired prerequisite databases using a snakemake rule. To download all databases, use snakemake download_db_all . To download a specific database, use the general form snakemake db/[default|refseqc]/[method] , such as snakemake db/refseqc/kraken or snakemake db/default/kraken .
Download existing fastq datasets using snakemake download_fastq_all .
Run snakemake rules to classify samples using rules such as snakemake kraken_default_all . Alternatively all classifiers can be run on all samples by running snakemake all , but note the caveats. [^1]
Run the snakemake rule to aggregate reports, read counts, and classified counts snakemake compile_reports . This will generate the plotting/compiled_reports.tsv file with all of the samples and classifications.
Run the jupyter notebook Metagenomics Bench.ipynb to generate all plots.

Adding Classifiers

To add a new classifier to compare against the existing results, the easiest way would be to append the new method's classification results to the master plotting/compiled_reports.tsv file, and then simply run the jupyter notebook. Note that this would not add the compute time/resources benchmark. The columns in this file are:

sample : Sample name
classifier : Classifier name
database : Name of the database, usually 'default' or 'refseqc'.
taxid : NCBI Taxonomy ID identified.
cum_abundance : Cumulative abundance (out of 1) of classified reads assigned to this taxid and its children.
cls_cum_abundance : Cumulative abundance (out of total classified abundance).
unique_abundance : Specific abundance classfied to this taxid not including children.
cls_unique_abundance : Specific abundance (out of total classified abundance).
classrank : The specific taxonomy rank profile level setting of the classifier. Either 'species', 'genus', or 'all' if the classifier assigns abundances to all levels of the taxonomy simultaneously.
rank : Taxonomy rank e.g. 'species', 'genus' etc.

Because the NCBI taxonomy tree and ids are constantly changing, you should use the same taxonomy database used for all other methods which can be acquired from snakemake db/taxonomy .

To add additional classifier methods, the best way would be to mimic an existing method. Create a new snakemake file rules/newmethod.smk and add include rules/newmethod.smk to the master Snakefile . For examples see the other method Snakefiles like kaiju.smk . Ideally you will create snakemake rules for snakemake newmethod_default_all and snakemake_refseqc_all .

Parsing Taxonomic Reports

The parser for the raw classified reports is part of the metax python package, in the command metax compile . This command will process all of the heterogenous report files in the reports/ folder and works by matching the filename to parser classes. This allows custom report format parsing code for different classifiers, but ultimately, each parser class has a parse() method that yields python dicts for each taxa identified in a fastq file. These dicts contain mandatory keys like classifier , rank , taxid etc., and are eventually turned into tsv rows, so that all of the rows (taxa * sample) are aggregated into a single harmonized output tsv for ingestion by a jupyter notebook.

In metax/reporter.py , create an additional class NewmethodParser() . The easiest is to subclass an existing method if your report output has a similar enough format to an existing method, such as Kraken-like outputs and simple tsvs. For more complicated formats, you will need specialized parsing code, and may require a full read of the report if things like abundance cannot be calculated via streaming. For example, here is the MegablastParser and the corresponding line where it is registered to the Matchers object.

class MegablastParser(KrakenParser): MATCH_GLOB = '*.megablast.*.txt' MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt') CLASSIFIER = 'megablast' is_paired_count = False
class Matchers: @classmethod def create(cls, processor):	...	m.add('megablast', MegablastParser(processor))

Jupyter Notebook Plots

After the snakemake compile_reports step, the Metagenomics Bench.ipynb notebook also needs to be modified to incorporate additional methods. The classifiers global variable in the notebook needs to have an additional entry with your method of <classifier_codename>: <formal_classifier_name> . Similarly, it needs to be added to refseqc_classifiers if it has an appropriate refseqc database and kept in main_classifiers if it is to be included on all plots.

[^1]: Snakemake doesn't manage batching when sharing a common resource easily, like a large metagenomic database wanting all of its dependent jobs run serially so it won't have to reload the database from memory. To alleviate this and improve performance, run separate executions of snakemake for each method, such as snakemake kraken_default_all; snakemake kraken_refseqc_all ; snakemake diamond_default_all etc.2. Run snakemake rules to classify samples using rules such as snakemake kraken_default_all . Alternatively all classifiers can be run on all samples by running snakemake all , but note the caveats. [^1] 3. Run the snakemake rule to aggregate reports, read counts, and classified counts snakemake compile_reports . This will generate the plotting/compiled_reports.tsv file with all of the samples and classifications. 4. Run the jupyter notebook plotting/Metagenomics Bench.ipynb to generate all plots.

Adding Classifiers

Parsing Taxonomic Reports

class MegablastParser(KrakenParser): MATCH_GLOB = '*.megablast.*.txt' MATCH_RE = re.compile(r'(?P<fn_prefix>.*)\.megablast\.(?P<database>[^.]*)\.txt') CLASSIFIER = 'megablast' is_paired_count = False
class Matchers: @classmethod def create(cls, processor):	...	m.add('megablast', MegablastParser(processor))

Jupyter Notebook Plots

Creating refseqc database

The source fastas for creating the refseqc databases are located at gs://metax-bakeoff-2019/refseqc/fasta/genomic.fna.gz and gs://metax-bakeoff-2019/refseqc/fasta/protein.faa.gz for DNA and protein databases respectively. Precomputed mappings of the accession to taxonomy IDs are located at gs://metax-bakeoff-2019/refseqc/fasta/genomic.taxids and gs://metax-bakeoff-2019/refseqc/fasta/protein.taxids .

Adding FASTQ datasets

Compress and place the fastq as <sample_name>_1.fastq.bz2 and <sample_name>_2.fastq.bz2 for the two paired ends and add the sample name to the config.yaml under samples_pe . If the file is single-ended, place as <sample_name>.fastq.bz2 instead and add the sample name under samples_se instead. Also add an entry to config/compile_config.yaml under samples , mapping the <sample_name> to a dict of metadata. The only required keys are sample: indicating the desired output name in the compiled_reports.tsv and paired: True if the sample is paired-ends. Otherwise it is assumed to be single-ended.

Code Snippets

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(AC_DIAMOND_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')

SnakeMake From line 87 of rules/acdiamond.smk

shell:
    '''
    export TMPDIR={TMPDIR}
    metax blast-report --total-reads $(cat {input.total_reads}) --blast-report {output.report} --tax-dir {params.db} <(zstd -dc {input.data})
    '''

SnakeMake metaX From line 98 of rules/acdiamond.smk

shell:
    '''
    export LC_ALL=C TMPDIR={TMPDIR}
    echo {input}

    env PICARD="{PICARD}" ~/efs/src/metax/scripts/bam_to_fastq_paired.sh {output[0]} {output[1]} {input}
    '''

SnakeMake From line 7 of rules/bam_to_fastq.smk

shell:
    '''
    export LC_ALL=C TMPDIR={TMPDIR}
    picard -Xmx100g -Djava.io.tmpdir={TMPDIR} FastqToSam {params.inargs} OUTPUT={output} SAMPLE_NAME={wildcards.seq} QUALITY_FORMAT=Standard MAX_RECORDS_IN_RAM=5000000
    '''

SnakeMake Picard From line 28 of rules/bam_to_fastq.smk

run:
    shell('{DROPCACHE}')
    shell('makeblastdb -in {input.fna} -dbtype nucl -out {params.out} -taxid_map <(cut -f1-2 {input.taxid_map}) -parse_seqids',
          bench_record=bench_record)

SnakeMake From line 86 of rules/blast.smk

shell:
    '''
    metax blast-report --total-reads $(cat {input.total_reads}) --blast-lca {output.lca} --blast-report {output.report} --tax-dir {params.db} {input.data}
    '''

SnakeMake metaX From line 156 of rules/blast.smk

shell:
    '''
    export PS1=
    source activate metax_py2
    echo {input}
    /usr/bin/time -v -o {log.time} \
    {params.exe} -i {input} -k {params.db} -o {output} -l G 2>&1 | tee {log.log}
    '''

SnakeMake From line 36 of rules/bracken.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(BRACKEN_SHELL, bench_record=bench_record)

SnakeMake From line 62 of rules/bracken.smk

run:
    shell('''\
    {DROPCACHE}
    ''')
    shell(r'''\
    export PS1=
    source activate metax_py2
    /usr/bin/time -v -o {log.time} \
    kraken --db {params.kraken_dir} --fasta-input --threads 32 <( find -L {params.kraken_dir}/library \( -name "*.fna" -o -name "*.fa" -o -name "*.fasta" \) -exec cat {{}} + )  > {params.dir}/database.kraken
    /usr/bin/time -v -a -o {log.time} \
    perl ~/efs/metax/Bracken/src/count-kmer-abundances.pl --db {params.kraken_dir} --read-length 150 {params.dir}/database.kraken > {params.dir}/database150mers.kraken_cnts
    /usr/bin/time -v -a -o {log.time} \
    python ~/efs/metax/Bracken/src/generate_kmer_distribution.py -i {params.dir}/database150mers.kraken_cnts -o {params.dir}/KMER_DISTR.150.txt
    ''', bench_record=bench_record)

SnakeMake Kraken From line 77 of rules/bracken.smk

shell:
    '''
    export PS1=
    source activate metax_py2
    {params.exe} -x {params.db} <(pigz -dc {input}) > {output.kreport}
    '''

SnakeMake From line 57 of rules/centrifuge.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(CENTRIFUGE_SHELL + CENTRIFUGE_REPORT_SHELL, bench_record=bench_record)

SnakeMake From line 79 of rules/centrifuge.smk

run:
    # shell('''\
    # metax create-centrifuge-map {NCBI_GENOMES}

    # cat all-*.map > all.map
    # ''')
    shell('''\
    {DROPCACHE}
    ''')
    shell('''\
    export TMPDIR={TMPDIR}
    TMP_DIR=$(mktemp -d)

SnakeMake metaX From line 94 of rules/centrifuge.smk

shell:
    '''
    : > {output.aux_input_se}
    FILES=({input.se})
    for (( i=0; i<${{#FILES[@]}} ; i+=1 )) ; do
        echo "${{FILES[i]}}" >> {output.aux_input_se}
    done

    : > {output.aux_output}
    for i in {params.output_files_se}; do
      echo "$i" >> {output.aux_output}
    done

    /usr/bin/time -v -o {log.time} \
    {params.exe} {params.db} -n {threads} -O {output.aux_input_se} -R {output.aux_output} 2>&1 | tee {log.log}

    : > {output.aux_input_pe[0]}
    : > {output.aux_input_pe[1]}
    FILES=({input.pe})
    for (( i=0; i<${{#FILES[@]}} ; i+=2 )) ; do
        echo "${{FILES[i]}}" >> {output.aux_input_pe[0]}
        echo "${{FILES[i+1]}}" >> {output.aux_input_pe[1]}
    done

    : > {output.aux_output}
    for i in {params.output_files_pe}; do
        echo "$i" >> {output.aux_output}
    done


    /usr/bin/time -av -o {log.time} \
    {params.exe} {params.db} -n {threads} -P {output.aux_input_pe[0]} {output.aux_input_pe[1]} -R {output.aux_output} 2>&1 | tee -a {log.log}
    '''

SnakeMake From line 62 of rules/clark_multiple.smk

shell:
    '''
    {params.exe} -F {input} {params.db} > {output}
    '''

SnakeMake From line 122 of rules/clark_multiple.smk

shell:
    '''
    {params.exe} -F <(zcat {input}) {params.db} > {output}
    '''

SnakeMake From line 149 of rules/clark.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(CLARK_SHELL + CLARK_BENCHMARK_ABUNDANCE_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')

SnakeMake From line 173 of rules/clark.smk

run:
    shell('{DROPCACHE}')
    shell('''\
    /usr/bin/time -v -o {log.time}  \
    {config[CLARK_OPT]}/set_targets.sh /mnt/metax/workflow/db/refseqc/clark custom
    {config[CLARK]} -D {params.dir}/custom_0 -T {params.dir}/targets.txt -O /mnt/metax/workflow/tmp/empty.fastq -R empty.results
    ''', bench_record=bench_record)

SnakeMake From line 186 of rules/clark.smk

run:
    shell('{DROPCACHE}')
    shell('''\
    echo "$(readlink -f {params.dir}/custom_0)" > {config[CLARK_OPT]}/.dbAddress
    # Must be in the CLARK install directory
    cd {config[CLARK_OPT]}
    {config[CLARK_OPT]}/buildSpacedDB.sh
    ''', bench_record=bench_record)

SnakeMake From line 201 of rules/clark.smk

shell:
    '''
    /usr/bin/time --verbose --append -o {log.time} \
    {DIAMOND} blastx -q {input} --verbose -p {threads} -d {params.dmnd}{params.tmpdir}{params.blocksize}{params.index_chunks} --taxonmap {params.taxonmap} --taxonnodes {params.taxonnodes} --outfmt 102 -o {params.tax} 2>&1 | tee {log.log}
    {PIGZ} -9 {params.tax}
    '''

SnakeMake From line 81 of rules/diamond.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell('''\
    {params.input_args} | \
    /usr/bin/time --verbose --append -o {log.time} \
    {DIAMOND} blastx --verbose --log -p {threads} -d {params.dmnd}{params.tmpdir}{params.blocksize}{params.index_chunks} --taxonmap {params.taxonmap} --taxonnodes {params.taxonnodes} --outfmt 102 -o {output.tax} 2>&1 | tee {log.log}

    ''', bench_record=bench_record)
    shell('''\
    metax taxid-report --output {output.report} --tax-dir {config[TAXONOMY_DB]} {output.tax}
    ''')

SnakeMake metaX From line 103 of rules/diamond.smk

shell:
    '''
    metax taxid-report --output {output} --tax-dir {config[TAXONOMY_DB]} <({PIGZ} -dc {input})
    '''

SnakeMake metaX From line 120 of rules/diamond.smk

run:
    shell('{DROPCACHE}')
    shell('''\
    {DIAMOND} makedb --in {input.faa} -d db/refseqc/diamond/refseqc
    ''', bench_record=bench_record)

SnakeMake From line 129 of rules/diamond.smk

shell:
    '''
    mkdir {output}
    gsutil cp {input} - | pigz -d | tar x -C {output}
    '''

SnakeMake From line 38 of rules/download.smk

shell:
    '''
    mkdir {output}
    gsutil cp {input} - | zstd -d | tar x -C {output}
    '''

SnakeMake From line 48 of rules/download.smk

shell:
    '''
    gsutil cp {input} {output}
    '''

SnakeMake From line 66 of rules/download.smk

shell:
    '''
    gsutil cp {input} - | zstd -d | tar x
    '''

SnakeMake From line 74 of rules/download.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(GOTTCHA_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')

SnakeMake From line 74 of rules/gottcha.smk

shell:
    '''
    {params.kaiju_table} -r species -e -p -v -t {params.db_dir}/nodes.dmp -n {params.db_dir}/names.dmp -o {output} <(pigz -dc {input})
    '''

SnakeMake From line 73 of rules/kaiju.smk

shell:
    '''
    {params.kaiju_table} -r genus -e -p -v -t {params.db_dir}/nodes.dmp -n {params.db_dir}/names.dmp -o {output} <(pigz -dc {input})
    '''

SnakeMake From line 83 of rules/kaiju.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KAIJU_SHELL + KAIJU_REPORT_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')

SnakeMake From line 104 of rules/kaiju.smk

run:
    shell(r'''\
    export LC_ALL=C

    join -t$'\t' <(zcat {input.faa} | fasta_formatter -t | sed 's/ /\t/' ) {input.a2t} | \
    awk 'BEGIN {{FS=OFS="\t"}} {{printf(">%s\n%s\n", $4, $2)}}' > protein.faa
    ''')
    shell('''\
    /mnt/metax/src/kaiju/bin/mkbwt -n {threads} -l 100000 -a protein -o proteins protein.faa
    /mnt/metax/src/kaiju/bin/mkfmi proteins
    ''', bench_record=bench_record)
    shell('rm proteins.sa proteins.bwt')

SnakeMake From line 118 of rules/kaiju.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KARP_SHELL, bench_record=bench_record)

SnakeMake From line 50 of rules/karp.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KRAKEN2_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output.reads}')

SnakeMake From line 94 of rules/kraken2.smk

run:
    shell('''\
    mkdir {params.dir}/taxonomy {params.dir}/library
    ln -sf {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp {params.dir}/taxonomy
    {PIGZ} -fdc {params.tax_db}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid
    {DROPCACHE}
    ''')

    shell('''\
    kraken2-build --db {params.dir} --add-to-library {input.fna}
    kraken2-build --db {params.dir} --build --threads {threads}
    ''', bench_record=bench_record)

SnakeMake kraken2 From line 110 of rules/kraken2.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KRAKENHLL_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output.reads}')

SnakeMake From line 71 of rules/krakenhll.smk

run:
    shell('''\
    mkdir {params.dir}/taxonomy {params.dir}/library
    ln -s {TAXONOMY_DB}/names.dmp {TAXONOMY_DB}/nodes.dmp {params.dir}/taxonomy
    {PIGZ} -dc {TAXONOMY_DB}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid
    cp {input.seqmap} {params.dir}/library/
    {DROPCACHE}
    ''')

    shell('''\
    {KRAKENHLL_BUILD} --db {params.dir} --add-to-library {input.fna} > {log.log} || true  # For some reason returns 255
    /usr/bin/time -v -o {log.time} \
    {KRAKENHLL_BUILD} --db {params.dir} --build --threads {threads} --jellyfish-hash-size 20000000000 >> {log.log}
    ''', bench_record=bench_record)

SnakeMake From line 87 of rules/krakenhll.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KRAKEN_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')

SnakeMake From line 88 of rules/kraken.smk

run:
    shell('''\
    mkdir {params.dir}/taxonomy {params.dir}/library
    ln -sf {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp {params.dir}/taxonomy
    {PIGZ} -fdc {params.tax_db}/accession2taxid/nucl_gb.accession2taxid.gz > {params.dir}/taxonomy/nucl_gb.accession2taxid
    {DROPCACHE}
    ''')

    shell('''\
    kraken-build --db {params.dir} --add-to-library {input.fna}
    kraken-build --db {params.dir} --build --threads {threads}
    # --jellyfish-hash-size
    ''', bench_record=bench_record)

SnakeMake From line 104 of rules/kraken.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(KSLAM_SHELL, bench_record=bench_record)

SnakeMake From line 64 of rules/kslam.smk

shell:
    '''
    echo "$(zcat {input} | awk '$2 != 0' | wc -l) * {params.paired}" | bc -l > {output}
    '''

SnakeMake From line 81 of rules/kslam.smk

run:
    shell('''\
    mkdir -p db/refseqc/kslam
    {DROPCACHE}
    ''')
    shell('''\
    /usr/bin/time -v -o {log.time} \
      {KSLAM} --parse-taxonomy {params.tax_db}/names.dmp {params.tax_db}/nodes.dmp --output-file taxDB | tee {log.log}
    /usr/bin/time -v -a -o {log.time} \
      {KSLAM} --output-file {output.db} --parse-genbank {input.gbff} | tee -a {log.log}
    ''', bench_record=bench_record)

SnakeMake From line 95 of rules/kslam.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(METAOTHELLO_SHELL, bench_record=bench_record)

SnakeMake From line 77 of rules/metaothello.smk

shell:
    '''
    metax metaothello-report --tax-dir {params.db} <({PIGZ} -dc {input}) {output}
    '''

SnakeMake metaX From line 86 of rules/metaothello.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(METAPHLAN2_SHELL, bench_record=bench_record)

SnakeMake From line 49 of rules/metaphlan2.smk

shell:
    '''
    cut -f1 {input} | uniq | wc -l > {output}
    '''

SnakeMake From line 58 of rules/metaphlan2.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    else:
        shell('rm {params.query_lca_db} {params.query_db}')
    shell(MMSEQS2_SHELL, bench_record=bench_record)

SnakeMake From line 96 of rules/mmseqs2.smk

shell:
    '''
    metax mmseqs-report {input.data} {output} --total-reads $(cat {input.total_reads}) --tax-dir {params.db}
    '''

SnakeMake metaX From line 108 of rules/mmseqs2.smk

run:
  shell('''\
  /usr/bin/time -v -o {log.time} \
  {MMSEQS2} createdb {input.faa} {params.dir}/refseqc.db
  ''', bench_record=bench_record)
  # shell('''
  # metax create-mmseqs-taxonomy -o {params.dir}/refseqc.tsv --tax-db {TAXONOMY_DB} --accession2taxid {input.prot} {input.dead_prot} -- {params.dir}/refseqc.db_h
  # ''')

SnakeMake metaX From line 122 of rules/mmseqs2.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(MOTUS_SHELL, bench_record=bench_record)
    shell('truncate -s 0 {output}')

SnakeMake From line 49 of rules/motus.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(PATHSEQ_SHELL, bench_record=bench_record)

SnakeMake From line 43 of rules/pathseq.smk

run:
    if params.is_paired:
        shell('expr "$(samtools view -c -f 0x41 -F 0xf04 {input})" + "$(samtools view -c -f 0x81 -F 0xf04 {input})" > {output}')
    else:
        shell('samtools view -c -F 0xf04 {input} > {output}')

SnakeMake SAMtools From line 58 of rules/pathseq.smk

shell:
    '''
    lbzip2 -dc {input} | sed -e '/^[^>]/s/X/N/g' | lbzip2 > {output}
    '''

SnakeMake From line 23 of rules/prepare_fastq.smk

run:
  if fastq_pairing == 'interleaved':
      shell("""
      echo {input[0]}
      echo {input[1]}
      paste <(paste - - - - < <(lbzcat {input[0]})) \
            <(paste - - - - < <(lbzcat {input[1]})) \
            | tr '\t' '\n' \
            | lbzip2 > {output}
      """)
  elif fastq_pairing == 'interleaved_add_suffix':
      shell(r"""
      echo {input[0]}
      echo {input[1]}
      paste <(paste - - - - < <(awk '{{if (NR % 4 == 1) {{printf "%s/1\n", $1}} \
                                       else if (NR % 4 == 3){{print "+"}} else {{print $0}} }}' <(lbzcat {input[0]}))) \
            <(paste - - - - < <(awk '{{if (NR % 4 == 1) {{printf "%s/2\n", $1}} \
                                       else if (NR % 4 == 3){{print "+"}} else {{print $0}} }}' <(lbzcat {input[1]}))) \
            | tr '\t' '\n' \
            | lbzip2 > {output}
      """)
  elif fastq_pairing == 'cat':
      shell("""
      echo {input[0]}
      echo {input[1]}
      lbzcat {input[0]} {input[1]} | lbzip2 > {output}
      """)

SnakeMake From line 33 of rules/prepare_fastq.smk

shell:
    """
    echo {input}
    lbzip2 --decompress --force --keep {input[0]}
    # If original file is empty
    touch {output}
    """

SnakeMake From line 66 of rules/prepare_fastq.smk

shell:
  """
  echo {input}
  lbzip2 --decompress --force --keep {input[0]}
  # If original file is empty
  touch {output}
  """

SnakeMake From line 79 of rules/prepare_fastq.smk

shell:
    '''
    fastqc -o qc -f fastq <(zcat {input})
    '''

SnakeMake FastQC From line 90 of rules/prepare_fastq.smk

shell:
    '''
    seqtk seq -a {input} > {output}
    '''

SnakeMake seqtk From line 99 of rules/prepare_fastq.smk

shell:
  '''
  seqtk seq -F F {input} > {output}
  '''

SnakeMake seqtk From line 108 of rules/prepare_fastq.smk

shell:
    '''
    /usr/bin/time -v -o {log.time} \
    {params.exe} {params.args} 2>&1 | tee {log}
    rm -f {params.ofastq1} {params.ofastq2}
    '''

SnakeMake From line 139 of rules/prepare_fastq.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(PROPHYLE_SHELL)

SnakeMake From line 62 of rules/prophyle.smk

shell:
    '''
    metax prophyle-report{params.paired} --total-reads $(cat {input.total_reads}) --output {output.report} --tax-dir {params.db} {input.data}
    '''

SnakeMake metaX From line 73 of rules/prophyle.smk

run:
    shell('''
    mkdir -p db/refseqc/build_prophyle

    python -c "from ete3 import ncbi_taxonomy; ncbi_taxonomy.NCBITaxa(taxdump_file='{params.tax_db}/taxdump.tar.gz')"
    {DROPCACHE}
    ''')

    shell('''\
    awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest" {{print $1, $6, $20}}' {input.assembly_summary} > ftpselection.tsv
    cut -f 3 ftpselection.tsv | awk 'BEGIN{{FS=OFS="/";filesuffix="genomic.fna.gz"}} {{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}}' > ftpfilepaths.tsv
    cut -f 1,2 ftpselection.tsv | sed 's/\.[0-9]*//' > acc2taxid.tsv

    prophyle_ncbi_tree.py archaea {NCBI_GENOMES}/refseq archaea.nw acc2taxid.tsv -l archaea.log
    prophyle_ncbi_tree.py bacteria {NCBI_GENOMES}/refseq bacteria.nw acc2taxid.tsv -l bacteria.log
    prophyle_ncbi_tree.py viral {NCBI_GENOMES}/refseq viral.nw acc2taxid.tsv -l viral.log

    ln -s {NCBI_GENOMES}/refseq/archaea .
    ln -s {NCBI_GENOMES}/refseq/bacteria .
    ln -s {NCBI_GENOMES}/refseq/viral .
    /usr/bin/time -v -o {log.time} \
      prophyle index -k 31 archaea.nw bacteria.nw viral.nw refseqc | tee {log.log}
    ''' , bench_record=bench_record)
    shell('mv refseqc ../prophyle')

SnakeMake prophyle From line 87 of rules/prophyle.smk

shell:
  '''
  ncbi-genome-download -F fasta,protein-fasta,genbank archaea,bacteria,viral
  touch {output}
  '''

SnakeMake ncbi-genome-download From line 7 of rules/refseqc.smk

shell:
  '''
  CWD=$(pwd)
  : > {output}

  cd {NCBI_GENOMES}/refseq

  DOMAINS=(archaea bacteria viral)
  for DOMAIN in ${{DOMAINS[@]}}; do
    pushd $DOMAIN
    awk -F "\t" '$12=="Complete Genome" && $11=="latest" {{print $1}}' assembly_summary.txt > ${{DOMAIN}}_selection.txt

    for i in $(awk -F "\t" '$12=="Complete Genome" {{print $1}}' assembly_summary.txt); do
      cat $i/*_genomic.fna.gz >> $CWD/{output.fna}
      cat $i/*_protein.faa.gz >> $CWD/{output.faa} || true
      {PIGZ} -dc $i/*_genomic.gbff.gz >> $CWD/{output.gbff}
    done
    popd
  done
  '''

SnakeMake From line 19 of rules/refseqc.smk

shell:
    '''
    {PIGZ} -dk {input}
    '''

SnakeMake From line 56 of rules/refseqc.smk

shell:
  '''
  export LC_ALL=C
  cat <({PIGZ} -dc {input.prot} | tail -n +2) <({PIGZ} -dc {input.dead_prot} | tail -n +2) \
  | cut -f2,3 | sort -S {resources.mem}G -k1,1 -t$'\t' > {output}
  '''

SnakeMake From line 67 of rules/refseqc.smk

shell:
  '''
  export LC_ALL=C
  cat <({PIGZ} -dc {input.nucl} | tail -n +2) <({PIGZ} -dc {input.dead_nucl} | tail -n +2) \
  | cut -f2,3,4 | sort -S {resources.mem}G -k1,1 -t$'\t' > {output}
  '''

SnakeMake From line 81 of rules/refseqc.smk

shell:
    '''
    export LC_ALL=C
    zcat {input} | fasta_formatter -t | cut -f1 | sed 's/ /\t/' | sort -k1,1 -s -S {resources.mem}G -t $'\t' > {output}
    '''

SnakeMake From line 92 of rules/refseqc.smk

shell:
    '''
    export LC_ALL=C
    join -t$'\t' <(sed 's/ /\t/' {input.headers}) {input.a2t} | cut -f1,3,2 > {output}
    '''

SnakeMake From line 103 of rules/refseqc.smk

shell:
    '''
    export LC_ALL=C
    join -t$'\t' {input.a2t} <({PIGZ} -dc {input.fna} | fasta_formatter -t | sed 's/ /\t/' | sort -k1,1 -S {resources.mem}G -t $'\t') | \
    awk 'BEGIN {{FS=OFS="\t"}} {{printf(">gi|%s|%s %s\n%s\n", $3, $1, $4, $5)}} > {output}
    '''

SnakeMake From line 114 of rules/refseqc.smk

run:
    shell('echo {input}')
    if wildcards.seq in samples_fastq:
        shell('''\
        TOTAL_READS=$(expr $(wc -l <(lbzcat {input}) | cut -f1 -d' ') / 4)
        echo $TOTAL_READS > {output}
        ''')
    elif wildcards.seq in samples_fasta:
        shell('''\
        TOTAL_READS=$(fasta_formatter -i <(lbzcat {input}) -t | wc -l)
        echo $TOTAL_READS > {output}
        ''')

SnakeMake From line 9 of rules/reports.smk

shell:
    '''
    for i in {input}; do
        NAME="$(echo $(basename $i) | sed 's/.total_reads.txt//' )"
        echo -e "$NAME\t$(cat $i)" >> {output}
    done
    '''

SnakeMake From line 26 of rules/reports.smk

shell:
    '''
    cat {input} > {output}
    '''

SnakeMake From line 37 of rules/reports.smk

shell:
    '''
    OUTPUT=$(readlink -f {output})
    pushd db/refseqc
    cd blast
    echo -e "megablast\trefseqc\t$(du -c genomic* taxdb* | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../bracken
    echo -e "bracken\trefseqc\t$(du -c KMER_DISTR.150* database150* | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../centrifuge
    echo -e "centrifuge\trefseqc\t$(du -c refseqc.*.cf | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../clark
    echo -e "clark\trefseqc\t$(du -c custom_0/db_central* | tail -n 1 | cut -f1)" >> "$OUTPUT"
    echo -e "clark_s\trefseqc\t$(du -c custom_0/T** | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../diamond
    echo -e "diamond\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../kaiju
    echo -e "kaiju\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../karp
    echo -e "karp\trefseqc\t$(du -c karp.tax karp.fasta rrna_karp.index | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../kraken
    echo -e "kraken\trefseqc\t$(du -c database.idx database.kdb taxonomy | tail -n 1 | cut -f1)" >> "$OUTPUT"

    cd ../kraken2
    echo -e "kraken2\trefseqc\t$(du -c *.k2d | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../krakenhll
    echo -e "krakenhll\trefseqc\t$(du -c database.idx database.kdb taxDB | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../kslam
    echo -e "kslam\trefseqc\t$(du -c database taxDB | tail -n 1 | cut -f1)" >> "$OUTPUT"
    # cd ../metaothello
    # echo -e "metaothello\trefseqc\t$(du -c id2tax names output.index | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../mmseqs2
    echo -e "mmseqs2\trefseqc\t$(du -c * | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../prophyle
    echo -e "prophyle\trefseqc\t$(du -c abv | tail -n 1 | cut -f1)" >> "$OUTPUT"
    cd ../taxmaps
    echo -e "taxmaps\trefseqc\t$(du -c refseqc.gem refseqc.len taxonomy.tbl | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/gottcha
    echo -e "gottcha\tdefault\t$(du -c GOTTCHA_BACTERIA_c4937_k24_u30_xHUMAN3x.species.* GOTTCHA_VIRUSES_c5900_k24_u30_xHUMAN3x.species.* *.dmp | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/metaphlan2
    echo -e "metaphlan2\tdefault\t$(du -c db_v20 | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd ~/efs/miniconda3/envs/metax/share/motus-2.0.1
    echo -e "motus2\tdefault\t$(du -c db_mOTU | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/pathseq
    echo -e "pathseq\tdefault\t$(du -c pathseq_microbe* pathseq_taxonomy.db | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd

    pushd db/metaothello
    echo -e "metaothello\tdefault\t$(du -c 20161108 | tail -n 1 | cut -f1)" >> "$OUTPUT"
    popd
    '''

SnakeMake BLAST kraken2 Diamond bracken Kraken Kaiju Centrifuge prophyle From line 83 of rules/reports.smk

shell:
    '''
    for i in {input}; do
       NAME="$(echo $(basename $i) | sed 's/.txt//' )"
       echo -e "$NAME\t$(cat $i)" >> {output}
    done
    '''

SnakeMake From line 147 of rules/reports.smk

shell:
    '''
    metax compile --tax-dir {config[TAXONOMY_DB]} --classified-counts {input.classified_counts} --read-counts {input.total_reads} --dir reports --config config/compile_config.yaml --rank-abundances {output.ranksum} {output.reports}
    '''

SnakeMake metaX From line 161 of rules/reports.smk

run:
    if benchmark_i == 0:
        shell('{DROPCACHE}')
    shell(taxmaps_rule, bench_record=bench_record)

SnakeMake From line 81 of rules/taxmaps.smk

run:
    shell('{DROPCACHE}')
    shell('''\
    /usr/bin/time -v -o {log.time}  \
    env PATH={GEMPATH}:$PATH \
    {TAXMAPS_TBL} -n {TAXONOMY_DB}/names.dmp -t {TAXONOMY_DB}/nodes.dmp > {params.dir}/taxonomy.tbl 2>&1 | tee {log.log}
    /usr/bin/time -a -v -o {log.time}  \
    env PATH={GEMPATH}:$PATH \
    {TAXMAPS_INDEX} -i {input.fna} -c {TAXONOMY_DB}/gi_taxid_nucl.dmp -t taxonomy.tbl -p {params.dir}/refseqc | tee -a {log.log}
    ''', bench_record=bench_record)