Snakemake workflow for metagenomic classification with Kraken2

public 1yr ago Version: v2.0 0 bookmarks

View Workflow

A Snakemake pipeline wrapper of the Kraken2 short read metagenomic classification software, with additional tools for analysis, plots, compositional data and differential abundance calculation. Designed and maintained by Ben Siranosian in Ami Bhatt's lab at Stanford University.

Introduction

Kraken2 is a short read classification system that is fast and memory efficient. It assigns a taxonomic identification to each sequencing read, by using the lowest common ancestor (LCA) of matching genomes in the database. Using Bracken provides accurate estimates of proportions of different species. This guide will cover some of the basics, but the full Kraken2 manual has much more detail.

Quickstart

Install

If you're in the Bhatt lab, most of this work will take place on the SCG cluster. External users should set this pipeline up on their infrastructure of choice: cloud, HPC, or even a laptop will work for processing small datasets. You will have to download or build a database , set the database options, and create the sample input files . All steps of this pipeline are containerized, meaning only snakemake and singularity are required to run all tools.

If you're in the Bhatt lab, use these instructions to set up snakemake and set up a profile to submit jobs to the SCG cluster. External users should follow these instructions:

Install mambaforge .
Create a fresh environment with snakemake or add it to an existing environment. Activate this environment for any step using this pipeline:

mamba create --name snakemake --channel conda-forge --channel bioconda snakemake
conda activate snakemake

Install singularity .

Then, clone this repo in a convenient location.

git clone https://github.com/bhattlab/kraken2_classification.git
cd kraken2_classification

Run with test data

A Kraken2 database is required to use this pipeline. Pre-built database can be downloaded from Ben Langmead's site . As an example, we download the standard database limited to 8GB memory use, and unpack it into a folder to use with the tests:

cd kraken2_classification/tests
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
mkdir db
tar -C db -xvf k2_standard_08gb_20230605.tar.gz

A small test dataset from Yassour et. al (2018) is included in this repo. 10,000 reads from several timepoints from a mother-infant pair are used. Even with such low coverage, the differences in microbiome composition are apparent in clustering and taxonomic barplots. Launch and end-to-end test run with a command like so:

# Launch this from the kraken2_classification directory
snakemake -s Snakefile --configfile tests/test_config/config_pe.yaml -j1 --use-singularity

The script tests/run_tests.sh ensures basic functionality of the pipeline executes as expected.

Run with real-world data

Copy the config.yaml file into the working directory for your samples. Change the options to suit your project. The main input is the sample_reads_file which defines the mapping from sample names to sequencing reads. See Usage for more detail.

On the Bhatt lab SCG cluster, you can then launch the workflow with a snakemake command like so:

# Snakemake workflow - change options in config.yaml first
snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --singularity-args '--bind /oak/,/labs/,/home' --profile scg --jobs 99

If you're not in the Bhatt lab, a more general command should be sufficient, but you might need to add singularity bind arguments or a profile for SLURM job submission depending on your configuration. This example uses 8 cores, but that can be changed to reflect available resources.

snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --jobs 8 --cores 8

After running the workflow and you're satisfied the results, run the cleanup command to remove temporary files that are not needed anymore.

snakemake cleanup -s path/to/Snakefile --configfile config.yaml

Run analysis on existing data

If you have a collection of kraken/bracken reports and just want to run the downstream analysis in this pipeline, you can provide the sample_reports_file in the config, which is a map from sample names to kraken and bracken report files. See tests/test_config/config_downstream_only_bracken.yaml as an example. Then, launch the pipeline with Snakefile_downstream_only . Tune the filtering and job submission parameters to meet your needs.

snakemake -s Snakefile_downstream_only --configfile tests/test_config/config_downstream_only_bracken.yaml -j1 --use-singularity

Parsing output reports

The Kraken reports classification/sample.krak.report , bracken reports sample.krak_bracken.report , and data matrices or GCTx objects in the processed_results folder are the best for downstream analysis. See Downstream processing and plotting for details on using the data in R.

Example output

The pipeline outputs data, results and figures in the structure below.

| classification
- sample.krak Kraken results - classification of each read. These files
can get very large and are unnecessary if you only want the reports.
- sample.krak.report Kraken report - reads and percentages at each taxonomic level.
- sample.krak.report.bracken Standard bracken report at species level. Not useful, use the one below.
- sample.krak_bracken.report Most useful format of the the Bracken results.
| processed_results
- diversity.txt Diversity of each sample at each taxonomic level
| ALDEX2_differential_abuncance Compositional data analysis done with the ALDEx2 package.
Only done if you have 2 groups in the sample_groups file.
- aldex_result_[].tsv Differential abundance at the given taxonomic level.
- aldex_scatter_[].pdf Scatterplot of effect vs dispersion with significant hits highlighted
- aldex_significant_boxplots_[].pdf Boxplot of any significant hits
| braycurtis_matrices
- bravcurtis_distance_[].tsv Matrix of braycurtis distance between samples at each taxonomic level
| plots Lots of plots!
- classified_taxonomy_barplot_[].pdf Barplot at each taxonomic level.
- compositional_PCA_plot.pdf PCA done on clr values
- diversity_allsamples.pdf Diversity barplot
- diversity_by_group.pdf Diversity barplot, stratified by sample group
- PCoA_2D_plot_labels.pdf Principal coordinates analysis, calculated on braycurtis distances
- PCoA_2D_plot_nolabels.pdf Same as above without the point labels
- rarefaction_curve.pdf Rarefaction "collectors" curve plot
| taxonomy_gctx_classified_only
- bracken_[]_reads.gctx GCTx file with matrix containing reads classified at each level
- bracken_[]_percentage.gctx Same, but percentage of classified reads
| taxonomy_matrices_classified_only
- bracken_[]_reads.txt Matrix of taxa by sample classified reads
- bracken_[]_percentage.txt Same, but percentage of classified reads
- clr_values_[].txt From compositional data analysis, centered log-ratio values at each level
| processed_results_krakenonly
| Same as above, but using the results without Bracken. Also includes taxonomy matrices
| that have unclassified reads in them (as bracken no longer reports unclassified reads)
| unmapped reads
- sample_unmapped_1.fq Only present if selected in the config file; reads that are not
- sample_unmapped_2.fq classified, as paired end fastq.

Taxonomic barplot default_barplot

PCoA plot example pcoa_plot

Changelog

2023-06-07

v2.0 (Breaking changes introduced to to configuration files and the ways parameters are used). This set of changes did a bit to modernize the pipeline:

All steps are now available with containerized execution
Created separate pipeline Snakefile_downstream_only which works from a list of report files to only un the downstream analysis steps.
Included a small test dataset and better test execution
Various code and README/manual changes
License file added

2019-09-01

The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the CMapR package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section 8. GCTx data processing for more information and tools for working with the new implementation.

Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in Additional considerations

Code Snippets

import os
import pandas as pd
from itertools import compress
import argparse



'''
    parser = argparse.ArgumentParser(description="Convert a kraken2 \
        standard report into the mpa-style  .")
    parser.add_argument('-i',
                        required=True, 
                        action='store',
                        dest='input_report',
                        help='Input kraken2 report.')
    parser.add_argument('-o',
                        required=True, 
                        action='store',
                        dest='output',
                        help='Output mpa style report')
    parser.add_argument('--no_corrections',
                        required=False, 
                        action='store_true',
                        dest='no_corrections',
                        help='Some taxonomy levels \
                        are bugged in the mpa report because \
                        this script doesnt use the full ncbi taxonomy. \
                        Normally these are corrected based on a database \
                        built by comparing reports. This option allows \
                        you to skip the corrections.')
    args = parser.parse_args()
'''
def main():

    taxonomy_levels = ["D","P","C","O","F","G","S"]
    taxonomy_lower = [t.lower() for t in taxonomy_levels]

    report = pd.read_csv(snakemake.input[0], delimiter='\t', header=None)

    # filter to rows at taxonomy_levels
    report_filtered = report.loc[report[3].isin(taxonomy_levels)]
    # filter taxonomy names
    report[5] = [a.strip() for a in report[5]]

    # tax_dict keeps track of where we are in the taxonomy
    tax_dict = {t:"" for t in taxonomy_levels}

    # build report strings
    mpa_strings = []
    for i in range(report.shape[0]):
        # reset tax dict appropriately
        # what an awful way to do this
        if report.iloc[i, 3][0] == "D":
            tax_dict.update({"P":"", "C":"", "O":"", "F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "P":
            tax_dict.update({"C":"", "O":"", "F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "C":
            tax_dict.update({"O":"", "F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "O":
            tax_dict.update({"F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "F":
            tax_dict.update({"G":"", "S":""})
        elif report.iloc[i, 3][0] == "G":
            tax_dict.update({"S":""})
        elif report.iloc[i, 3][0] == "S":
            tax_dict.update({})
        else:
            continue

        if report.iloc[i, 3] in taxonomy_levels:
            tax_dict[report.iloc[i, 3]] = report.iloc[i, 5]
            mpa_strings.append(tax_dict_to_string(tax_dict))

    # correct names if desired
    #if not args.no_corrections:
    mpa_strings = correct_mpa_strings(mpa_strings)

    # build data frame of report
    mpa_data = {"col1":mpa_strings, "col2":list(report_filtered[1])}  
    mpa_df = pd.DataFrame(mpa_data)

    # write out
    mpa_df.to_csv(snakemake.output[0], sep='\t', header=False, index=False)

def tax_dict_to_string(tax_dict):
    taxonomy_levels = ["D","P","C","O","F","G","S"]
    taxonomy_lower = [t.lower() for t in taxonomy_levels]
    level_list = [tax_dict[t] for t in taxonomy_levels if tax_dict[t] != ""]
    add_letters = list(compress(taxonomy_lower, [tax_dict[t] != "" for t in taxonomy_levels]))

    mpa_string = ""
    for letter, level in zip(add_letters, level_list):
        # print(letter)
        # print(level)
        if letter != "d": 
            mpa_string = mpa_string + "|" + letter + "__" + level
        else: 
            mpa_string = mpa_string + letter + "__" + level

    return(mpa_string)

def correct_mpa_strings(mpa_strings):
    # database of strings to correct
    # shouldn't store this in this script...
    correction_dict =  {
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Hamiltonella|s__secondary endosymbiont of Ctenarytaina eucalypti" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Ctenarytaina eucalypti",
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Hamiltonella|s__secondary endosymbiont of Heteropsylla cubana" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Heteropsylla cubana",
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Calyptogena okutanii thioautotrophic gill symbiont" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Calyptogena okutanii thioautotrophic gill symbiont",
    "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|f__Pelagibacteraceae|g__Candidatus Fonsibacter" : "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|g__Candidatus Fonsibacter",
    "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|f__Pelagibacteraceae|g__Candidatus Fonsibacter|s__Candidatus Fonsibacter ubiquis" : "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|g__Candidatus Fonsibacter|s__Candidatus Fonsibacter ubiquis",
    "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae",
    "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus",
    "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus|s__Candidatus Desulfofervidus auxilii" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus|s__Candidatus Desulfofervidus auxilii",
    "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mogibacterium|s__[Eubacterium] minutum" : "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] minutum",
    "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mogibacterium|s__[Eubacterium] sulci" : "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] sulci",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis|s__Chroococcidiopsis thermalis" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis|s__Chroococcidiopsis thermalis",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas formicexedens" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas formicexedens",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus",
    "d__Bacteria|p__Candidatus Saccharibacteria|g__Candidatus Saccharimonas|s__Candidatus Saccharibacteria oral taxon TM7x" : "d__Bacteria|p__Candidatus Saccharibacteria|s__Candidatus Saccharibacteria oral taxon TM7x",
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Regiella|s__secondary endosymbiont of Ctenarytaina eucalypti": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Ctenarytaina eucalypti",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Regiella|s__secondary endosymbiont of Heteropsylla cubana": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Heteropsylla cubana",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya velum gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya velum gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__endosymbiont of Galathealinum brachiosum": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Galathealinum brachiosum",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya velesiana gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya velesiana gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Bathymodiolus thermophilus thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus thermophilus thioautotrophic gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__enodsymbiont of Escarpia spicata": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__enodsymbiont of Escarpia spicata",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya elarraichensis gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya elarraichensis gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Bathymodiolus azoricus thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus azoricus thioautotrophic gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Solemya pervernicosa gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya pervernicosa gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__endosymbiont of Ridgeia piscesae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Ridgeia piscesae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Bathymodiolus septemdierum thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus septemdierum thioautotrophic gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Tevnia jerichonana": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Tevnia jerichonana",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Riftia pachyptila": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Riftia pachyptila",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Seepiophila jonesi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Seepiophila jonesi",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Lamellibrachia luymesi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Lamellibrachia luymesi",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13013_P4": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13013_P4",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13017_P7": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13017_P7",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC14_002_19_P1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC14_002_19_P1",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC004_P11": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC004_P11",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13016_P6": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13016_P6",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC006_P13": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC006_P13",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13018_P8": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13018_P8",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC005_P12": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC005_P12",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC003_P10": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC003_P10",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC001_P9": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC001_P9",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13015_P5": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13015_P5",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola selenatireducens": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola selenatireducens",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola thiotaurini": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola thiotaurini",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola sp.",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas chitiniclastica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas chitiniclastica",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas populi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas populi",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas larvae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas larvae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas sp. G9077": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas sp. G9077",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas pentaromativorans": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas pentaromativorans",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas sp. HK-28": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas sp. HK-28",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas xiamenensis": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas xiamenensis",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus singularis": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus singularis",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus perditus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus perditus",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus autotrophicus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus autotrophicus",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G25": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G25",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S14_B12": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S14_B12",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__uncultured SUP05 cluster bacterium": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__uncultured SUP05 cluster bacterium",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S12_B1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S12_B1",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S03_B1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S03_B1",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp.",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. TMED218": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. TMED218",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G23": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G23",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria cameli": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria cameli",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria larvae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria larvae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria ureiclastica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria ureiclastica",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria indica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria indica",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria sp. F8392": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria sp. F8392",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella acticola": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella acticola",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella spirulinae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella spirulinae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella nitratireducens": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella nitratireducens",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella sp.",
"d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhizobiales|f__Salinarimonadaceae|g__Salinarimonas|s__Salinarimonadaceae bacterium HL-109": "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhizobiales|f__Salinarimonadaceae|s__Salinarimonadaceae bacterium HL-109",
"d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|g__Candidatus Micropelagos|s__PS1 clade bacterium": "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|s__PS1 clade bacterium",
"d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|o__Mariprofundales|f__Mariprofundaceae|g__Ghiorsea": "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|g__Ghiorsea",
"d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|o__Mariprofundales|f__Mariprofundaceae|g__Ghiorsea|s__Ghiorsea bivora": "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|g__Ghiorsea|s__Ghiorsea bivora",
"d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Microbacteriaceae|g__Rhodoluna|s__Microbacteriaceae bacterium MWH-Ta3": "d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Microbacteriaceae|s__Microbacteriaceae bacterium MWH-Ta3",
"d__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|g__Candidatus Carbobacillus|s__[Flavobacterium] thermophilum": "d__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|s__[Flavobacterium] thermophilum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|g__Monoglobus|s__[Bacteroides] pectinophilus": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|s__[Bacteroides] pectinophilus",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Anaerovorax|s__[Eubacterium] infirmum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] infirmum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] minutum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] minutum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] sulci": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] sulci",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] nodatum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] nodatum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] brachy": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] brachy",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Ileibacterium|s__[Eubacterium] saphenum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] saphenum",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Tissierellaceae|g__Anaerosalibacter|s__[Clostridium] ultunense": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Tissierellaceae|s__[Clostridium] ultunense",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter|s__Sporanaerobacter acetigenes": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter|s__Sporanaerobacter acetigenes",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter|s__Sporanaerobacter sp. PP17-6a": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter|s__Sporanaerobacter sp. PP17-6a",
"d__Bacteria|p__Firmicutes|c__Tissierellia|g__Sedimentibacter|s__[Bacteroides] coagulans": "d__Bacteria|p__Firmicutes|c__Tissierellia|s__[Bacteroides] coagulans",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita|s__Gloeomargarita lithophora": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita|s__Gloeomargarita lithophora",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia": "d__Bacteria|c__Candidatus Sericytochromatia",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium GL2-53 LSPB_72": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium GL2-53 LSPB_72",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 CBMW_12": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 CBMW_12",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 RAAC_196": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 RAAC_196",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas sp. GP": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. GP",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas alkenigignens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas alkenigignens",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas formicexedens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas formicexedens",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens",
"d__Bacteria|p__Chloroflexi|c__Thermomicrobia|g__Thermorudis|s__Thermomicrobia bacterium": "d__Bacteria|p__Chloroflexi|c__Thermomicrobia|s__Thermomicrobia bacterium",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter altiplanensis": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter altiplanensis",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. 10B": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. 10B",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. J07SB67": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. J07SB67",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium bin80": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium bin80",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium Tc-Br11_B2g6_7": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium Tc-Br11_B2g6_7",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium TMED105": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium TMED105",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus profundi": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus profundi",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter|s__Longibacter salinarum": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter|s__Longibacter salinarum",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas|s__Longimonas halophila": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas|s__Longimonas halophila",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Salisaeta longa": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Salisaeta longa",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA7368": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA7368",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2364": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2364",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2024": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2024",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2412": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2412",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA5586": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA5586",
"d__Bacteria|p__Candidatus Kryptonia|g__Candidatus Aegiribacteria": "d__Bacteria|g__Candidatus Aegiribacteria",
"d__Bacteria|p__Candidatus Kryptonia|g__Candidatus Aegiribacteria|s__Candidatus Aegiribacteria bacterium MLS_C": "d__Bacteria|g__Candidatus Aegiribacteria|s__Candidatus Aegiribacteria bacterium MLS_C",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Terrimicrobium": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Terrimicrobium",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Terrimicrobium|s__Terrimicrobium sacchariphilum": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Terrimicrobium|s__Terrimicrobium sacchariphilum",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Candidatus Xiphinematobacter": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Candidatus Xiphinematobacter",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Candidatus Xiphinematobacter|s__Candidatus Xiphinematobacter sp. Idaho Grape": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Candidatus Xiphinematobacter|s__Candidatus Xiphinematobacter sp. Idaho Grape",
"d__Bacteria|p__Candidatus Saccharibacteria|g__Candidatus Saccharimonas|s__candidate division TM7 genomosp. GTL1": "d__Bacteria|p__Candidatus Saccharibacteria|s__candidate division TM7 genomosp. GTL1",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-2": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-2",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium GWC2_50_8": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium GWC2_50_8",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-1": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-1",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium RBG_13_59_9": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium RBG_13_59_9",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon UBA261": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon UBA261",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon UBA203": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon UBA203",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_03": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_03",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_02": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_02",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_01": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_01",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum|s__Cenarchaeum symbiosum": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum|s__Cenarchaeum symbiosum",
"d__Archaea|p__Candidatus Korarchaeota|g__Candidatus Korarchaeum|s__Candidatus Korarchaeota archaeon NZ13-K": "d__Archaea|p__Candidatus Korarchaeota|s__Candidatus Korarchaeota archaeon NZ13-K",
    }

    mpa_strings_corrected = [correction_dict[s] if s in correction_dict.keys() else s for s in mpa_strings]
    return(mpa_strings_corrected)

if __name__ == '__main__':
    main()

Python Pandas kraken2 From line 1 of scripts/convert_report_mpa_style.py

suppressMessages(library(ggplot2, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(rafalib, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(vegan, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(reshape2, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(cmapR, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(compositions, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(zCompositions, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(ALDEx2, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(ggpubr, quietly = TRUE, warn.conflicts = FALSE))
options(stringsAsFactors = F)

# options we need from snakemake
sample_reads_file <- snakemake@params[['sample_reads_file']]
sample_reports_file <- snakemake@params[['sample_reports_file']]
sample_groups_file <- snakemake@params[['sample_groups_file']]
workflow.outdir <- snakemake@params[['workflow_outdir']]
result.dir <- snakemake@params[['result_dir']]
use.bracken.report <- snakemake@params[['use_bracken_report']]
remove.chordata <- as.logical(snakemake@params[['remove_chordata']])
if (length(remove.chordata)==0){
    remove.chordata <- F
}

# Use locations in the working directory for scripts and taxonomy array
scripts.folder <- file.path(workflow.outdir, "scripts")
tax.array.file <- file.path(workflow.outdir, "taxonomy_array.tsv")

# Downstream processing for filtering OTUs
## This removes any classification result where all samples are below this threshold,
## which serves to remove the very long tail of lowly abundant species in Kraken2/Bracken results.
## This parameter can be tuned, but I recommend you keep something here to reduce the long tail.
min_otu_percentage <- as.numeric(snakemake@config[['min_otu_percentage']])
# Filtering for compositional data analysis that gets applied for each taxonomic level
## Keep only samples with > codata_min_reads (default 5000)
codata_min_reads <- as.numeric(snakemake@config[['codata_min_reads']])
## Keep only OTUs with an FRACTION of at least 0.001 by default
## This is equivalent to a PERCENTAGE of 0.1
codata_min_prop = as.numeric(snakemake@config[['codata_min_prop']])
## Keep OTUs that are found in at least codata_otu_cutoff fraction of samples 
## Default is 0.3 or 30% of samples
codata_otu_cutoff = as.numeric(snakemake@config[['codata_otu_cutoff']])

# Set up directories and create those that don't exist
classification.folder <- file.path(workflow.outdir, 'classification')
outfolder.matrices.taxonomy <- file.path(result.dir, 'taxonomy_matrices')
outfolder.matrices.taxonomy.classified <- file.path(result.dir, 'taxonomy_matrices_classified_only')
outfolder.gctx.taxonomy <- file.path(result.dir, 'taxonomy_gctx')
outfolder.gctx.taxonomy.classified <- file.path(result.dir, 'taxonomy_gctx_classified_only')
outfolder.matrices.bray <- file.path(result.dir, 'braycurtis_matrices')
outfolder.plots <- file.path(result.dir, 'plots')
outfolder.aldex <- file.path(result.dir, 'ALDEx2_differential_abundance')
for (f in c(result.dir, outfolder.matrices.bray, outfolder.matrices.taxonomy.classified,
            outfolder.gctx.taxonomy.classified, outfolder.plots, outfolder.aldex)){
    if (!dir.exists(f)){ dir.create(f, recursive = T)}
}
# dont create dirs we dont need from bracken
if (!(use.bracken.report)){
    for (f in c(outfolder.matrices.taxonomy, outfolder.gctx.taxonomy)){
        if (!dir.exists(f)){ dir.create(f, recursive = T)}
    }
}
# Load other data processing and plotting scripts
source.script.process <- file.path(scripts.folder, 'process_classification_gctx.R')
source.script.plot <- file.path(scripts.folder, 'plotting_classification.R')
source.script.codaseq <- file.path(scripts.folder, 'CoDaSeq_functions.R')
if (!(file.exists(source.script.process) & file.exists(source.script.plot) & file.exists(source.script.codaseq))) {
    stop('processing scripts do not exist at any of the attempted directories. Exiting. ')
}
suppressMessages(source(source.script.process))
suppressMessages(source(source.script.plot))
suppressMessages(source(source.script.codaseq))

# read from sample_reads_file or sample_reports_file as input
if(sample_reads_file != ""){
    sample.reads <- read.table(sample_reads_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character')
    colnames(sample.reads) <- c('sample', 'r1', 'r2')[1:ncol(sample.reads)]
    # ensure we skip first row if it's a comment
    if ((tolower(sample.reads[1,1]) == 'sample') | (substr(sample.reads[1,1], 1,1) == "#")){
        sample.reads <- sample.reads[2:nrow(sample.reads), ]
    }
    # create dataframe of output files that would be in sample.reports
    sample.reports <- data.frame(sample = sample.reads$sample, 
                                 kraken_report = sapply(sample.reads$sample, function(x) file.path(classification.folder, paste(x, '.krak.report', sep=''))),
                                 bracken_report = sapply(sample.reads$sample, function(x) file.path(classification.folder, paste(x, '.krak_bracken_species.report', sep=''))))                    
} else if(sample_reports_file != ""){
    sample.reports <- read.table(sample_reports_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character')
    colnames(sample.reports) <- c('sample', 'kraken_report', 'bracken_report')[1:ncol(sample.reports)]
    # ensure we skip first row if it's a comment
    if ((tolower(sample.reports[1,1]) == 'sample') | (substr(sample.reports[1,1], 1,1) == "#")){
        sample.reports <- sample.reports[2:nrow(sample.reports), ]
    }
}
sample.names <- sample.reports$sample
sample.number <- nrow(sample.reports)

# Define files for loading in based on sample.
if(use.bracken.report){
    flist <- sample.reports[,3]
} else {
    flist <- sample.reports[,2]
}
names(flist) <- sample.names

# Ensure all expected results files exist
if (!(all(file.exists(flist)))){
    # print which files don't exist
    message('The follwing files do not exist:')
    print(flist[!sapply(flist, file.exists)])
    stop("Some classification files do not exist!")
}

# if a groups file is specified, read it. Otherwise assign everything to one group
if (sample_groups_file != '') {
    sample.groups <- read.table(sample_groups_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character')
    colnames(sample.groups) <- c('sample', 'group')
    # if first sample is sample, discard the line
    if(tolower(sample.groups[1,'sample']) == 'sample'){
        sample.groups <- sample.groups[2:nrow(sample.groups),]
    }
} else {
    sample.groups <- data.frame(sample=sample.names, group='All')
}

# get taxonomy array
# classification at each level for each entry in the database.
# simplified after processing krakens default taxonomy
tax.array <- read.table(tax.array.file, sep='\t', quote='', header=F, comment.char = '', colClasses = 'character')
colnames(tax.array) <- c('id', 'taxid', 'root', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'subspecies')[1:ncol(tax.array)]
# Bug in generation code gave muliple zero taxids. Eliminate that here now
tax.array <- tax.array[!duplicated(tax.array$taxid),]
rownames(tax.array) <- tax.array$taxid
# some duplicate names in this. if so, change them to include the tax level
dup.ids <- tax.array$id[duplicated(tax.array$id)]
dup.inds <- which(tax.array$id %in% dup.ids)
# fix these by adding taxid to the name
tax.array[tax.array$id %in% dup.ids, "id"] <-
    paste(tax.array[tax.array$id %in% dup.ids, "id"], ' (', tax.array[tax.array$id %in% dup.ids, "taxid"], ')', sep='')

# ensure reads and groups have the same data
if (!(all(sample.groups$sample %in% sample.names) & all(sample.names %in% sample.groups$sample))){
    message('sample.reads:')
    print(sample.reads)
    message('sample.groups:')
    print(sample.groups)
    message('sample.groups$sample[!(sample.groups$sample %in% sample.names)]')
    print(sample.groups$sample[!(sample.groups$sample %in% sample.names)])
    message('sample.names[!(sample.names %in% sample.groups$sample)]')
    print(sample.names[!(sample.names %in% sample.groups$sample)])
    stop('Sample reads and sample groups dont contain the same samples... check inputs')
}

# special case for UHGG and MAG databases.
## If reading from UHGG, the taxonomy levels go R, R1-R7
test.df <- kraken_file_to_df(flist[1])
## UHGG database will have this structure
uhgg <- all(paste0('R', 1:7) %in% test.df$tax.level)
# MAG database will have this structure
## Number of genus classifications is zero
segata <- !uhgg & sum(sum(test.df$tax.level=='G')) == 0

# load classification results from each sample and process into gct format
message(paste('Loading data from ', length(flist), ' kraken/bracken results.', sep=''))
df.list <- lapply(flist, function(x) kraken_file_to_df(x))
# merge classification matrix
merge.mat <- merge_kraken_df_list(df.list)
# sample metadata just has groups for now
sample.metadata <- data.frame(id=sample.groups$sample, group=sample.groups$group)
# construct gct object from reads matrix, sample metadata, row metadata
kgct <- make_gct_from_kraken(merge.mat, sample.metadata, tax.array)

# remove Chordata reads if desired
if(remove.chordata){
    kgct <- subset_gct(kgct, rid=kgct@rid[kgct@rdesc$phylum != "Chordata"])
}
# filter to each of the taxonomy levels
if (segata){
    filter.levels <- c('species')
    kgct.filtered.list <- list(species=subset_gct(kgct, rid=kgct@rid[kgct@rid != 'root']))
} else {
    filter.levels <-  c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
    kgct.filtered.list <- lapply(filter.levels, function(level) {
        subset_kgct_to_level(kgct, level)
        })
    names(kgct.filtered.list) <- filter.levels
}
# print("kgct.filtered.list[['species']]@mat")
# print(kgct.filtered.list[['species']]@mat)

# subset to classified taxa only
unclassified.rownames <- c('unclassified', 'classified at a higher level')
kgct.filtered.classified.list <- lapply(kgct.filtered.list, function(x) subset_gct(x, rid=x@rid[!(x@rid %in% unclassified.rownames)]))

# normalize to percentages
kgct.filtered.percentage.list <- lapply(kgct.filtered.list, function(x) normalize_kgct(x, min_otu_percentage=min_otu_percentage))
kgct.filtered.classified.percentage.list <- lapply(kgct.filtered.classified.list, function(x) normalize_kgct(x, min_otu_percentage=min_otu_percentage))

message('Saving classification matrices...')
# save matrices and gctx objects
if (use.bracken.report){mat.name <- 'bracken'} else {mat.name <- 'kraken'}
for (tn in filter.levels){
    outf.mat.reads <- file.path(outfolder.matrices.taxonomy, paste(mat.name, tolower(tn), 'reads.txt', sep='_'))
    outf.mat.percentage <- file.path(outfolder.matrices.taxonomy, paste(mat.name, tolower(tn), 'percentage.txt', sep='_'))
    outf.mat.reads.classified <- file.path(outfolder.matrices.taxonomy.classified, paste(mat.name, tolower(tn), 'reads.txt', sep='_'))
    outf.mat.percentage.classified <- file.path(outfolder.matrices.taxonomy.classified, paste(mat.name, tolower(tn), 'percentage.txt', sep='_'))
    # gctx names
    outf.gctx.reads <- file.path(outfolder.gctx.taxonomy, paste(mat.name, tolower(tn), 'reads.gctx', sep='_'))
    outf.gctx.percentage <- file.path(outfolder.gctx.taxonomy, paste(mat.name, tolower(tn), 'percentage.gctx', sep='_'))
    outf.gctx.reads.classified <- file.path(outfolder.gctx.taxonomy.classified, paste(mat.name, tolower(tn), 'reads.gctx', sep='_'))
    outf.gctx.percentage.classified <- file.path(outfolder.gctx.taxonomy.classified, paste(mat.name, tolower(tn), 'percentage.gctx', sep='_'))

    # save matrices
    if (!use.bracken.report){
        write.table(kgct.filtered.list[[tn]]@mat, outf.mat.reads, sep='\t', quote=F, row.names = T, col.names = T)
    }
    write.table(kgct.filtered.classified.list[[tn]]@mat, outf.mat.reads.classified, sep='\t', quote=F, row.names = T, col.names = T)

    mat.percentage <- kgct.filtered.percentage.list[[tn]]@mat
    # order rows by mean abundance across samples
    row.order <- rownames(mat.percentage)[order(rowMeans(mat.percentage), decreasing = T)]
    row.order <- row.order[!(row.order %in% unclassified.rownames)]
    row.order <- c(unclassified.rownames[unclassified.rownames %in% rownames(mat.percentage)], row.order)
    mat.percentage <- mat.percentage[row.order,]
    # classified only matrix
    mat.percentage.classified <- kgct.filtered.classified.percentage.list[[tn]]@mat
    row.order.classified <- order(rowMeans(mat.percentage.classified), decreasing = T)
    mat.percentage.classified <- mat.percentage.classified[row.order.classified,]

    if (!use.bracken.report){
        write.table(mat.percentage, outf.mat.percentage, sep='\t', quote=F, row.names = T, col.names = T)
    }
    write.table(mat.percentage.classified, outf.mat.percentage.classified, sep='\t', quote=F, row.names = T, col.names = T)

    # save gctx 
    # only save with unclassified if not using Bracken
    if (!use.bracken.report){
        suppressMessages(write_gctx(kgct.filtered.list[[tn]], outf.gctx.reads, appenddim = F))
        suppressMessages(write_gctx(kgct.filtered.percentage.list[[tn]], outf.gctx.percentage, appenddim = F))
    }
    suppressMessages(write_gctx(kgct.filtered.classified.list[[tn]], outf.gctx.reads.classified, appenddim = F))
    suppressMessages(write_gctx(kgct.filtered.classified.percentage.list[[tn]], outf.gctx.percentage.classified, appenddim = F))
}


#################################################################################
## Diversity calculation and plots ##############################################
#################################################################################
message('Doing diversity calculations and saving figures...')

# DISABLE rarefaction curve as it's useless with so many species
# Wait actually people want it so Im going to re-enable it
plot_rarefaction_curve(kgct.filtered.classified.list$species@mat,
                       file.path(outfolder.plots, 'rarefaction_curve.pdf'))

# diversity calculations
div.methods <- c('shannon', 'simpson')
# calculate for each tax level
div.level.method <- lapply(filter.levels, function(x) {
    use.matrix <- kgct.filtered.classified.list[[x]]@mat
    # if not enough valid rows, skip it
    if (nrow(use.matrix) <3){
        dl <- lapply(div.methods, function(x) rep(0, times=ncol(use.matrix)))
    } else {
        dl <- lapply(div.methods, function(y) diversity(use.matrix, index=y, MARGIN = 2))
    }
    names(dl) <- div.methods
    dl
})
names(div.level.method) <- filter.levels

# coerce forcefully into a dataframe
# there's definitely a better way to do this...
div.df <- as.data.frame(div.level.method)
rownames(div.df) <- sample.names
div.df <- cbind(data.frame(sample=rownames(div.df)), div.df)
div.df <- melt(div.df, id.vars = 'sample')
div.df$tax.level <- sapply(div.df$variable, function(x) strsplit(as.character(x), split="\\.")[[1]][1])
div.df$method <- sapply(div.df$variable, function(x) strsplit(as.character(x), split="\\.")[[1]][2])
div.df <- div.df[,c('sample', 'tax.level','method','value')]
# round to a sensible number
div.df$value <- round(div.df$value, 3)
# write out dataframe
out.div <- file.path(result.dir, 'diversity.txt')
write.table(div.df, out.div, sep='\t', quote=F, row.names=F, col.names=T)

# barplot of diversity under different methods
# one with everything, one separated by sample group
div.df$sample <- factor(div.df$sample, levels = sample.groups$sample)
div.plot.list <- list()
for (g in unique(sample.groups$group)){
    plot.samples <- sample.groups[sample.groups$group==g, "sample"]
    plot.df <- div.df[div.df$sample %in% plot.samples, ]
    plot.df$sample <- factor(plot.df$sample, levels = sample.groups$sample)
    p <- ggplot(plot.df, aes(x=sample, y=value)) +
        geom_bar(stat='identity') +
        facet_grid(rows = vars(tax.level), cols= vars(method), scales = 'fixed') +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        labs(title=paste('Diversity for group:', g),
             y='Diversity', x='Sample')
    div.plot.list[[g]] <- p
}

# save one page for each group
div.group.pdf <- file.path(outfolder.plots, 'diversity_by_group.pdf')
pdf(div.group.pdf, height=8, width = 10)
for(p in div.plot.list){print(p)}
trash <- dev.off()

# also a big figure with one page for each method, all samples
div.plot.all.list <- list()
for (m in unique(div.df$method)){
    plot.df <- div.df[div.df$method == m, ]
    p <- ggplot(plot.df, aes(x=sample, y=value)) +
            geom_bar(stat='identity') +
            facet_grid(rows = vars(tax.level), cols= vars(method), scales = 'fixed') +
            theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
            labs(title=paste('Diversity for all samples:', m),
                 y='Diversity', x='Sample')
    div.plot.all.list[[m]] <- p
}

# width of plot = 9 + quarter inch for each sample over 10
max.samps <- max(table(sample.groups$group))
pdf.width <- max(9, (9 + (0.25 * (max.samps-10))))
div.all.pdf <- file.path(outfolder.plots, 'diversity_allsamples.pdf')
pdf(div.all.pdf, height=5, width = pdf.width)
for(p in div.plot.all.list){print(p)}
trash <- dev.off()

#################################################################################
## Taxonomic Barplots ###########################################################
#################################################################################
message('Generating taxonomic barplots...')
# page in the pdf for each sample group
taxlevel.plots <- list()
taxlevel.plots.classified <- list()
for (tn in filter.levels){
    group.plots <- list()
    group.plots.classified <- list()
    for (g in unique(sample.groups$group)){
        plot.samples <- sample.groups[sample.groups$group==g, "sample"]
        div.df.sub <- div.df[div.df$tax.level==tn & div.df$method=='shannon' & div.df$sample %in% plot.samples,]
        div.df.plot <- div.df.sub[, c('sample', 'value')]
        rownames(div.df.plot) <- div.df.plot$sample
        plot.title <- paste('Taxonomy and diversity: ', g, ', ', tn, sep='')
        plot.mat <- kgct.filtered.percentage.list[[tn]]@mat[,plot.samples, drop=FALSE]

        # no longer doing include unclassifed with Bracken report
        # because new version of report doesn't include this data
        if (!use.bracken.report){
        group.plots[[g]] <- plot_many_samples_with_diversity_barplot(plot.mat,
                            div.df.plot, plot.title = plot.title, include.unclassified=T, tax.level.name = tn)
        }
        group.plots.classified[[g]] <- plot_many_samples_with_diversity_barplot(plot.mat,
                            div.df.plot, plot.title = plot.title, include.unclassified=F, tax.level.name = tn)
    }
    taxlevel.plots[[tn]] <- group.plots
    taxlevel.plots.classified[[tn]] <- group.plots.classified
}

# save pdfs
# width of plot = 9 + quarter inch for each sample over 10
max.samps <- max(table(sample.groups$group))
pdf.width <- max(9, (9 + (0.25 * (max.samps-10))))
for (tn in filter.levels){
    if (!use.bracken.report){
        tax.pdf <- file.path(outfolder.plots, paste('taxonomy_barplot_', tolower(tn), '.pdf', sep=''))
        pdf(tax.pdf, height=6, width=pdf.width)
        for (p in taxlevel.plots[[tn]]){print(p)}
        trash <- dev.off()
    }
    tax.pdf.classified <- file.path(outfolder.plots, paste('classified_taxonomy_barplot_', tolower(tn), '.pdf', sep=''))
    pdf(tax.pdf.classified, height=6, width=pdf.width)
    for (p in taxlevel.plots.classified[[tn]]){print(p)}
    trash <- dev.off()
}

#################################################################################
## PCoA plots ###################################################################
#################################################################################
message('Doing PCoA calculations...')
# only do this if we have >=3 samples
if (sample.number >=3){
    # do for each tax level
    plotlist.nolabels <- list()
    plotlist.labels <- list()
    # don't do for kingdom
    do.tn <- filter.levels[filter.levels!='kingdom']
    for (tn in do.tn){
        print(paste('    for:', tn))
        fraction.mat <- kgct.filtered.classified.percentage.list[[tn]]@mat
        if(nrow(fraction.mat) > 2){
            pcoa.res <- capscale(t(fraction.mat)~1, distance='bray')
            pcoa.df <- data.frame(sample.groups, scores(pcoa.res)$sites)
            pcoa.df.melt <- melt(pcoa.df, id.vars = c('sample','group'))
            pcoa.variance <- summary(pcoa.res)$cont$importance[2,1:2]
            # set colors based on number of groups
            ng <- length(unique(sample.groups$group))
            if (ng <10){cols <- brewer.pal(max(ng,3), 'Set1')} else {cols <- colorRampPalette(brewer.pal(9,'Set1'))(ng)}

            plotlist.nolabels[[tn]] <- ggplot(pcoa.df, aes(x=MDS1, y=MDS2, color=group, label=sample)) +
                geom_point(size=3) +
                # scale_color_brewer(palette='Set1') +
                scale_color_manual(values = cols) +
                theme_bw() +
                labs(title=paste('Microbiome beta diversity, Bray-Curtis, ', tn, sep=''),
                     subtitle='Principal Coordinates Analysis plot',
                     x = paste('PC1 (', round(pcoa.variance[1], 3) * 100, '% of variance)', sep=''),
                     y = paste('PC2 (', round(pcoa.variance[2], 3) * 100, '% of variance)', sep='')) +
                xlim(c(min(pcoa.df$MDS1), max(pcoa.df$MDS1) * 1.25)) +
                ylim(c(min(pcoa.df$MDS2), max(pcoa.df$MDS2) * 1.15)) +
                guides(color=guide_legend(title='Group'))

            # add a version with labels above the points
            nudge.x <- sum(abs(range(pcoa.df$MDS1))) / 12
            nudge.y <- sum(abs(range(pcoa.df$MDS2))) / 30
            plotlist.labels[[tn]] <- plotlist.nolabels[[tn]] + geom_text(nudge_x = nudge.x, nudge_y = nudge.y)
        } else{
            plotlist.nolabels[[tn]] <- plot(1,1)
            plotlist.labels[[tn]] <- plot(1,1)
        }
    }

    # save plot
    pcoa.pdf.nolabels <- file.path(outfolder.plots, 'PCoA_2D_plot_nolabels.pdf')
    pdf(pcoa.pdf.nolabels, height=6.5, width=9)
    for (tn in rev(do.tn)){
        print(plotlist.nolabels[[tn]])
    }
    trash <- dev.off()
    pcoa.pdf.labels <- file.path(outfolder.plots, 'PCoA_2D_plot_labels.pdf')
    pdf(pcoa.pdf.labels, height=6.5, width=9)
    for (tn in rev(do.tn)){
        print(plotlist.labels[[tn]])
    }
    trash <- dev.off()
} else {
    # warn the user and make fake plots
    warning('Less than 3 samples, not doing PCoA plots')
    pcoa.pdf.nolabels <- file.path(outfolder.plots, 'PCoA_2D_plot_nolabels.pdf')
    pdf(pcoa.pdf.nolabels, height=6.5, width=9)
    plot(0,0, col='white')
    text(0,0, 'Not enough samples for PCoA plot', col='firebrick')
    trash <- dev.off()
    pcoa.pdf.labels <- file.path(outfolder.plots, 'PCoA_2D_plot_labels.pdf')
    pdf(pcoa.pdf.labels, height=6.5, width=9)
    plot(0,0, col='white')
    text(0,0, 'Not enough samples for PCoA plot', col='firebrick')
    trash <- dev.off()
}

# simple bray curtis distance metrics
for (tn in filter.levels){
    bray.dist <- as.matrix(vegdist(t(kgct.filtered.classified.percentage.list[[tn]]@mat)))
    out.bray <- file.path(outfolder.matrices.bray, paste('braycurtis_distance_', tolower(tn), '.txt', sep=''))
    write.table(bray.dist, out.bray, sep='\t', quote=F, row.names = T, col.names = T)
}

#################################################################################
## Compositional data analysis and plots ########################################
#################################################################################
if (length(unique(sample.groups$group)) != 2){
    warning('Will not do ALDEx2 differential abundance with !=2 groups')}

if (nrow(sample.groups) < 3){
    warning('Will not do compositional data analysis with < 3 samples')
} else {
    pca.plot.list <- list()
    if(segata){
        do.tax.levels <- c('species')
    } else {
        do.tax.levels <- c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
    }

    print('Compositional data analysis....')
    for (tax.level in do.tax.levels[!(do.tax.levels %in% c('kingdom'))]){
        print(paste('.....', tax.level))
        use.mat <- kgct.filtered.classified.list[[tax.level]]@mat
        if(nrow(use.mat) > 2){
            reads.filtered <- tryCatch(codaSeq.filter(use.mat,
                min.reads=codata_min_reads, min.occurrence=codata_otu_cutoff, min.prop=codata_min_prop, samples.by.row=FALSE),
            error=function(e) {
                warning(e)
                matrix(0)
                })
            } else {
                reads.filtered <- matrix(0)
            }
        if(nrow(reads.filtered) > 2){
            # replace 0 values with an estimate of the probability that the zero is not 0
            # only if necessary
            # at this point samples are by row, and variables are by column
            # we can use the GBM or CZM methods. Function from zCompositions.
            if (sum(reads.filtered==0)>0){
                rfz <- t(cmultRepl(t(reads.filtered),  label=0, method="CZM", output="p-counts"))
                # round to integers and round anything low to 1
                rfz <- round(rfz)
                rfz[rfz==0] <- 1
            } else {
                rfz <- reads.filtered
            }
            # convert to CLR
            clr.aldex <- aldex.clr(rfz, conds=rep(NA, ncol(rfz)), mc.samples=128, denom = 'all', verbose = F)
            # get clr from mean of the MC instances
            rfz.clr <- t(sapply(clr.aldex@analysisData , function(x) rowMeans(x)))
            # save CLR matrix
            outf <- file.path(outfolder.matrices.taxonomy.classified, paste('clr_values_', tax.level, '.tsv', sep=''))
            write.table(round(rfz.clr, 4), outf, sep='\t', quote=F, row.names = T, col.names = T)

            rfz.clr.pca <- prcomp(rfz.clr)
            rfz.clr.mvar <- mvar(rfz.clr)

            # PCA biplot
            plot.df <- data.frame(rfz.clr.pca$x[,1:2], group=sample.groups[sample.groups$sample %in% rownames(rfz.clr), 'group'])
            pc1.var <- round(sum(rfz.clr.pca$sdev[1]^2)/rfz.clr.mvar * 100, 1)
            pc2.var <- round(sum(rfz.clr.pca$sdev[2]^2)/rfz.clr.mvar * 100, 1)
            pca.plot <- ggplot(plot.df, aes(x=PC1, y=PC2, col=group)) +
                geom_point(size=3) +
                geom_text(label=rownames(plot.df),
                          nudge_x = sum(abs(range(plot.df$PC1)))/ 50,
                          nudge_y = sum(abs(range(plot.df$PC2)))/ 50) +
                theme_bw() +
                labs(x=paste("PC1: ", pc1.var, "% of variace", sep=""),
                     y=paste("PC2: ", pc2.var, "% of variace", sep=""),
                     title=paste('Compositional PCA,', tax.level)) +
                scale_color_brewer(palette='Set1')
            pca.plot.list[[tax.level]] <- pca.plot


            # need to limit to two groups for aldex
            if (length(unique(sample.groups$group)) == 2){
                # get two groups
                group.pos <- unique(sample.groups$group)[1]
                group.neg <- unique(sample.groups$group)[2]
                s.pos <- sample.groups$sample[sample.groups$group==group.pos]
                s.neg <- sample.groups$sample[sample.groups$group==group.neg]

                # aldex differential expression between groups
                aldex.res <- aldex(rfz[,c(s.pos, s.neg)],
                    conditions = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))),
                    mc.samples=128, test="t", effect=TRUE, include.sample.summary=FALSE,
                    denom="all", verbose=FALSE)

                # scatterplot
                outf.scatter <- file.path(outfolder.aldex, paste('aldex_scatter_', tax.level, '.pdf', sep=''))
                pdf(outf.scatter, width = 7, height = 4)
                mypar(1,2, mar = c(3.5, 3.5, 2.5, 1.1))
                aldex.plot(aldex.res, type="MW", test="wilcox", xlab="Dispersion",ylab="Difference")
                mtext(paste(tax.level, ', wilcox', sep=''), line=0.25, cex=1.25)
                aldex.plot(aldex.res, type="MW", test="welch", xlab="Dispersion",ylab="Difference")
                mtext(paste(tax.level, ', welch', sep=''), line=0.25, cex=1.25)
                dev.off()

                # improve dataframe
                aldex.res <- signif(aldex.res, 4)
                # add sample numbers
                aldex.res$n.pos <- length(s.pos)
                aldex.res$n.neg <- length(s.neg)
                # add in name
                aldex.res$taxa <- rownames(aldex.res)
                # reorder
                aldex.res <- aldex.res[, c('taxa', 'n.pos', 'n.neg', colnames(aldex.res)[1:11])]
                aldex.res <- aldex.res[order(aldex.res$we.eBH),]
                # add abs effect
                aldex.res$abs.effect <- abs(aldex.res$effect)

                # save dataframe of differential results
                outf.res <- file.path(outfolder.aldex, paste('aldex_result_', tax.level, '.tsv', sep=''))
                write.table(aldex.res, outf.res, sep='\t', quote=F, row.names = F, col.names = T)

                # make boxplots
                # make plots from everything with this value or less
                plot.thresh <- 0.1
                plot.thresh.col <- 'we.eBH'
                aldex.res.plot <- aldex.res[aldex.res[, plot.thresh.col] <= plot.thresh, ]
                outf.boxplot <- file.path(outfolder.aldex, paste('aldex_significant_boxplots_', tax.level, '.pdf', sep=''))
                if (nrow(aldex.res.plot) >0 ){
                    # make clr values
                    m.prop <-  apply(rfz, 2, function(x) x/sum(x))
                    clr.aldex <- aldex.clr(rfz[,c(s.pos, s.neg)],
                        conds = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))),
                        mc.samples=128, denom = 'all', verbose = F)
                    # get clr from mean of the MC instances
                    rfz.clr <- sapply(clr.aldex@analysisData , function(x) rowMeans(x))

                    # save all significant in one boxplot
                    pdf(outf.boxplot, height=5, width = 6, onefile=TRUE)
                    toplot <- min(nrow(aldex.res.plot), 50)
                    for ( i in 1:toplot){
                        g <- aldex.res.plot[i, 'taxa']
                        test.df <- data.frame(group = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))),
                          sample = c(s.pos, s.neg),
                          proportion = c(m.prop[g, s.pos], m.prop[g, s.neg]),
                          clr = c(rfz.clr[g, s.pos], rfz.clr[g, s.neg]))


                        g1 <- ggplot(test.df, aes(x=group, y=proportion, fill=group)) +
                        geom_boxplot() +
                        theme_bw() +
                        labs(subtitle=g) +
                        ylim(c(0,max(test.df$proportion)*1.2)) +
                        stat_compare_means(method = 'wilcox') +
                        stat_compare_means(method = 't.test', label.y.npc = 0.95) + guides(fill=FALSE)

                        g2 <- ggplot(test.df, aes(x=group, y=clr, fill=group)) +
                        geom_boxplot() +
                        theme_bw() +
                        labs(subtitle=g) +
                        ylim(c(min(test.df$clr), max(test.df$clr)*1.2)) +
                        stat_compare_means(method = 'wilcox') +
                        stat_compare_means(method = 't.test', label.y.npc = 0.95)

                        print(ggarrange(g1,g2,widths = c(1,1), common.legend = T))
                    }
                    dev.off()
                }
            }
        }
    }

    # save pca plots from each level
    pdf(file.path(outfolder.plots, 'compositional_PCA_plot.pdf'), height=6.5, width=9)
    for (tn in rev(do.tax.levels)){
        print(pca.plot.list[[tn]])
    }
    trash <- dev.off()
}



message('Done! :D')

R Snakemake ggplot2 RColorBrewer ggpubr reshape2 bracken Kraken vegan ALDEx2 cmapR From line 6 of scripts/post_classification_workflow.R

shell: """
    python {params.improve_taxonomy_script} {params.db}
"""

SnakeMake From line 34 of master/Snakefile

shell: """
cp -r {params.scriptdir} {outdir}
cp {input.tax_array} {outdir}
"""

SnakeMake From line 48 of master/Snakefile

shell: """
    time kraken2 --db {params.db} --threads {threads} --output {output.krak} \
    --report {output.krak_report} {params.paired_string} {input.reads} \
    --confidence {params.confidence_threshold} --use-names
    """

SnakeMake kraken2 From line 69 of master/Snakefile

shell: """
    bracken -d {params.db} -i {input.krak_report} -o {params.outspec} -r {params.readlen} \
    -l {params.level} -t {params.threshold}
    """

SnakeMake bracken From line 93 of master/Snakefile

script:
    'scripts/post_classification_workflow.R'

SnakeMake From line 116 of master/Snakefile

script:
    'scripts/post_classification_workflow.R'

SnakeMake From line 135 of master/Snakefile

shell: """
rm -rf {params.workflow_outdir}/scripts
rm -f {params.workflow_outdir}/taxonomy_array.tsv
touch {output}
"""

SnakeMake From line 147 of master/Snakefile

shell: """
    ktImportTaxonomy -m 3 -s 0 -q 0 -t 5 -i {input} -o {output} \
    -tax $(which kraken2 | sed 's/envs\/classification2.*$//g')/envs/classification2/bin/taxonomy
    """

SnakeMake kraken2 Krona From line 156 of master/Snakefile

shell: """
    awk '$1=="U" {{ print }}' {input.krak} | cut -f 2 > {params.tempfile}
    filterbyname.sh in={input.r1} in2={input.r2} names={params.tempfile} include=true out={output.r1} out2={output.r2}
    rm {params.tempfile}
"""

SnakeMake From line 177 of master/Snakefile

shell: """
    awk '$1=="U" {{ print }}' {input.krak} | cut -f 2 > {params.tempfile}
    filterbyname.sh in={input.r1} names={params.tempfile} include=true out={output.r1}
    # rm {params.tempfile}
"""

SnakeMake From line 193 of master/Snakefile

shell: """
    rm -f {params.rmdir_1}/*.krak
    touch {output}
"""

SnakeMake From line 208 of master/Snakefile

script:
    "scripts/convert_report_mpa_style.py"

SnakeMake From line 221 of master/Snakefile

shell:
    """
    sum=$(grep -vP "\\|" {input} | cut -f 2 | awk '{{sum += $1}} END {{printf ("%.2f\\n", sum/100)}}')
    awk -v sum="$sum" 'BEGIN {{FS="\\t"}} {{OFS="\\t"}} {{print $1,$2/sum}}' {input} > {output}
    """

SnakeMake From line 230 of master/Snakefile

shell:
    """
    source activate biobakery2
    merge_metaphlan_tables.py {input} >  {output.merge}
    grep -E "(s__)|(^ID)"  {output.merge} | grep -v "t__" | sed 's/^.*s__//g' >  {output.merge_species}
    """

SnakeMake From line 242 of master/Snakefile

shell:
    """
    source activate biobakery2
    metaphlan_hclust_heatmap.py --in {input} --top 25 --minv 0.1 -s log --out {output.heatmap1} -f braycurtis -d braycurtis -c viridis
    metaphlan_hclust_heatmap.py --in {input} --top 150 --minv 0.1 -s log --out {output.heatmap2} -f braycurtis -d braycurtis -c viridis
    """

SnakeMake From line 255 of master/Snakefile

    shell:
        """
        kraken-biom outputs/*_bracken.report -o {output}
        """
'''

SnakeMake kraken-biom From line 268 of master/Snakefile

ShowHide 14 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/bhattlab/kraken2_classification

Name: kraken2_classification

Version: v2.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Database ID Taxonomic classification Plot bracken kraken-biom cmapR ALDEx2 Kraken kraken2 Krona Pandas Snakemake ggplot2 ggpubr RColorBrewer reshape2 vegan metagenomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Snakemake workflow for metagenomic classification with Kraken2

Introduction

Table of contents

Quickstart

Install

Run with test data

Run with real-world data

Run analysis on existing data

Parsing output reports

Example output

Changelog

2023-06-07

2019-09-01

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public