Snakemake workflow for metagenomic classification with Kraken2

public public 1yr ago Version: v2.0 0 bookmarks

A Snakemake pipeline wrapper of the Kraken2 short read metagenomic classification software, with additional tools for analysis, plots, compositional data and differential abundance calculation. Designed and maintained by Ben Siranosian in Ami Bhatt's lab at Stanford University.

Introduction

Kraken2 is a short read classification system that is fast and memory efficient. It assigns a taxonomic identification to each sequencing read, by using the lowest common ancestor (LCA) of matching genomes in the database. Using Bracken provides accurate estimates of proportions of different species. This guide will cover some of the basics, but the full Kraken2 manual has much more detail.

Table of contents

  1. Installation

  2. Usage

  3. Available databases

  4. Downstream processing and plotting

  5. Additional considerations

  6. Expanded database construction

  7. Using metagenome assembled genomes as a database

  8. GCTx data parsing

Quickstart

Install

If you're in the Bhatt lab, most of this work will take place on the SCG cluster. External users should set this pipeline up on their infrastructure of choice: cloud, HPC, or even a laptop will work for processing small datasets. You will have to download or build a database , set the database options, and create the sample input files . All steps of this pipeline are containerized, meaning only snakemake and singularity are required to run all tools.

If you're in the Bhatt lab, use these instructions to set up snakemake and set up a profile to submit jobs to the SCG cluster. External users should follow these instructions:

  1. Install mambaforge .

  2. Create a fresh environment with snakemake or add it to an existing environment. Activate this environment for any step using this pipeline:

mamba create --name snakemake --channel conda-forge --channel bioconda snakemake
conda activate snakemake
  1. Install singularity .

Then, clone this repo in a convenient location.

git clone https://github.com/bhattlab/kraken2_classification.git
cd kraken2_classification

Run with test data

A Kraken2 database is required to use this pipeline. Pre-built database can be downloaded from Ben Langmead's site . As an example, we download the standard database limited to 8GB memory use, and unpack it into a folder to use with the tests:

cd kraken2_classification/tests
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
mkdir db
tar -C db -xvf k2_standard_08gb_20230605.tar.gz

A small test dataset from Yassour et. al (2018) is included in this repo. 10,000 reads from several timepoints from a mother-infant pair are used. Even with such low coverage, the differences in microbiome composition are apparent in clustering and taxonomic barplots. Launch and end-to-end test run with a command like so:

# Launch this from the kraken2_classification directory
snakemake -s Snakefile --configfile tests/test_config/config_pe.yaml -j1 --use-singularity

The script tests/run_tests.sh ensures basic functionality of the pipeline executes as expected.

Run with real-world data

Copy the config.yaml file into the working directory for your samples. Change the options to suit your project. The main input is the sample_reads_file which defines the mapping from sample names to sequencing reads. See Usage for more detail.

On the Bhatt lab SCG cluster, you can then launch the workflow with a snakemake command like so:

# Snakemake workflow - change options in config.yaml first
snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --singularity-args '--bind /oak/,/labs/,/home' --profile scg --jobs 99

If you're not in the Bhatt lab, a more general command should be sufficient, but you might need to add singularity bind arguments or a profile for SLURM job submission depending on your configuration. This example uses 8 cores, but that can be changed to reflect available resources.

snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --jobs 8 --cores 8

After running the workflow and you're satisfied the results, run the cleanup command to remove temporary files that are not needed anymore.

snakemake cleanup -s path/to/Snakefile --configfile config.yaml

Run analysis on existing data

If you have a collection of kraken/bracken reports and just want to run the downstream analysis in this pipeline, you can provide the sample_reports_file in the config, which is a map from sample names to kraken and bracken report files. See tests/test_config/config_downstream_only_bracken.yaml as an example. Then, launch the pipeline with Snakefile_downstream_only . Tune the filtering and job submission parameters to meet your needs.

snakemake -s Snakefile_downstream_only --configfile tests/test_config/config_downstream_only_bracken.yaml -j1 --use-singularity

Parsing output reports

The Kraken reports classification/sample.krak.report , bracken reports sample.krak_bracken.report , and data matrices or GCTx objects in the processed_results folder are the best for downstream analysis. See Downstream processing and plotting for details on using the data in R.

Example output

The pipeline outputs data, results and figures in the structure below.

 | classification 
 - sample.krak Kraken results - classification of each read. These files
 can get very large and are unnecessary if you only want the reports. 
 - sample.krak.report Kraken report - reads and percentages at each taxonomic level.
 - sample.krak.report.bracken Standard bracken report at species level. Not useful, use the one below.
 - sample.krak_bracken.report Most useful format of the the Bracken results.
 | processed_results
 - diversity.txt Diversity of each sample at each taxonomic level
 | ALDEX2_differential_abuncance Compositional data analysis done with the ALDEx2 package.
 Only done if you have 2 groups in the sample_groups file.
 - aldex_result_[].tsv Differential abundance at the given taxonomic level. 
 - aldex_scatter_[].pdf Scatterplot of effect vs dispersion with significant hits highlighted
 - aldex_significant_boxplots_[].pdf Boxplot of any significant hits
 | braycurtis_matrices
 - bravcurtis_distance_[].tsv Matrix of braycurtis distance between samples at each taxonomic level
 | plots Lots of plots!
 - classified_taxonomy_barplot_[].pdf Barplot at each taxonomic level.
 - compositional_PCA_plot.pdf PCA done on clr values
 - diversity_allsamples.pdf Diversity barplot
 - diversity_by_group.pdf Diversity barplot, stratified by sample group
 - PCoA_2D_plot_labels.pdf Principal coordinates analysis, calculated on braycurtis distances
 - PCoA_2D_plot_nolabels.pdf Same as above without the point labels
 - rarefaction_curve.pdf Rarefaction "collectors" curve plot
 | taxonomy_gctx_classified_only 
 - bracken_[]_reads.gctx GCTx file with matrix containing reads classified at each level 
 - bracken_[]_percentage.gctx Same, but percentage of classified reads
 | taxonomy_matrices_classified_only
 - bracken_[]_reads.txt Matrix of taxa by sample classified reads
 - bracken_[]_percentage.txt Same, but percentage of classified reads
 - clr_values_[].txt From compositional data analysis, centered log-ratio values at each level
 | processed_results_krakenonly
 | Same as above, but using the results without Bracken. Also includes taxonomy matrices
 | that have unclassified reads in them (as bracken no longer reports unclassified reads)
 | unmapped reads
 - sample_unmapped_1.fq Only present if selected in the config file; reads that are not 
 - sample_unmapped_2.fq classified, as paired end fastq.

Taxonomic barplot default_barplot

PCoA plot example pcoa_plot

Changelog

2023-06-07

v2.0 (Breaking changes introduced to to configuration files and the ways parameters are used). This set of changes did a bit to modernize the pipeline:

  • All steps are now available with containerized execution

  • Created separate pipeline Snakefile_downstream_only which works from a list of report files to only un the downstream analysis steps.

  • Included a small test dataset and better test execution

  • Various code and README/manual changes

  • License file added

2019-09-01

The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the CMapR package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section 8. GCTx data processing for more information and tools for working with the new implementation.

Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in Additional considerations

Code Snippets

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
import os
import pandas as pd
from itertools import compress
import argparse



'''
    parser = argparse.ArgumentParser(description="Convert a kraken2 \
        standard report into the mpa-style  .")
    parser.add_argument('-i',
                        required=True, 
                        action='store',
                        dest='input_report',
                        help='Input kraken2 report.')
    parser.add_argument('-o',
                        required=True, 
                        action='store',
                        dest='output',
                        help='Output mpa style report')
    parser.add_argument('--no_corrections',
                        required=False, 
                        action='store_true',
                        dest='no_corrections',
                        help='Some taxonomy levels \
                        are bugged in the mpa report because \
                        this script doesnt use the full ncbi taxonomy. \
                        Normally these are corrected based on a database \
                        built by comparing reports. This option allows \
                        you to skip the corrections.')
    args = parser.parse_args()
'''
def main():

    taxonomy_levels = ["D","P","C","O","F","G","S"]
    taxonomy_lower = [t.lower() for t in taxonomy_levels]

    report = pd.read_csv(snakemake.input[0], delimiter='\t', header=None)

    # filter to rows at taxonomy_levels
    report_filtered = report.loc[report[3].isin(taxonomy_levels)]
    # filter taxonomy names
    report[5] = [a.strip() for a in report[5]]

    # tax_dict keeps track of where we are in the taxonomy
    tax_dict = {t:"" for t in taxonomy_levels}

    # build report strings
    mpa_strings = []
    for i in range(report.shape[0]):
        # reset tax dict appropriately
        # what an awful way to do this
        if report.iloc[i, 3][0] == "D":
            tax_dict.update({"P":"", "C":"", "O":"", "F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "P":
            tax_dict.update({"C":"", "O":"", "F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "C":
            tax_dict.update({"O":"", "F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "O":
            tax_dict.update({"F":"", "G":"", "S":""})
        elif report.iloc[i, 3][0] == "F":
            tax_dict.update({"G":"", "S":""})
        elif report.iloc[i, 3][0] == "G":
            tax_dict.update({"S":""})
        elif report.iloc[i, 3][0] == "S":
            tax_dict.update({})
        else:
            continue

        if report.iloc[i, 3] in taxonomy_levels:
            tax_dict[report.iloc[i, 3]] = report.iloc[i, 5]
            mpa_strings.append(tax_dict_to_string(tax_dict))

    # correct names if desired
    #if not args.no_corrections:
    mpa_strings = correct_mpa_strings(mpa_strings)

    # build data frame of report
    mpa_data = {"col1":mpa_strings, "col2":list(report_filtered[1])}  
    mpa_df = pd.DataFrame(mpa_data)

    # write out
    mpa_df.to_csv(snakemake.output[0], sep='\t', header=False, index=False)

def tax_dict_to_string(tax_dict):
    taxonomy_levels = ["D","P","C","O","F","G","S"]
    taxonomy_lower = [t.lower() for t in taxonomy_levels]
    level_list = [tax_dict[t] for t in taxonomy_levels if tax_dict[t] != ""]
    add_letters = list(compress(taxonomy_lower, [tax_dict[t] != "" for t in taxonomy_levels]))

    mpa_string = ""
    for letter, level in zip(add_letters, level_list):
        # print(letter)
        # print(level)
        if letter != "d": 
            mpa_string = mpa_string + "|" + letter + "__" + level
        else: 
            mpa_string = mpa_string + letter + "__" + level

    return(mpa_string)

def correct_mpa_strings(mpa_strings):
    # database of strings to correct
    # shouldn't store this in this script...
    correction_dict =  {
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Hamiltonella|s__secondary endosymbiont of Ctenarytaina eucalypti" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Ctenarytaina eucalypti",
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Hamiltonella|s__secondary endosymbiont of Heteropsylla cubana" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Heteropsylla cubana",
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Calyptogena okutanii thioautotrophic gill symbiont" : "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Calyptogena okutanii thioautotrophic gill symbiont",
    "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|f__Pelagibacteraceae|g__Candidatus Fonsibacter" : "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|g__Candidatus Fonsibacter",
    "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|f__Pelagibacteraceae|g__Candidatus Fonsibacter|s__Candidatus Fonsibacter ubiquis" : "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Pelagibacterales|g__Candidatus Fonsibacter|s__Candidatus Fonsibacter ubiquis",
    "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae",
    "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus",
    "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|o__Desulfurellales|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus|s__Candidatus Desulfofervidus auxilii" : "d__Bacteria|p__Proteobacteria|c__Deltaproteobacteria|f__Candidatus Desulfofervidaceae|g__Candidatus Desulfofervidus|s__Candidatus Desulfofervidus auxilii",
    "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mogibacterium|s__[Eubacterium] minutum" : "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] minutum",
    "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mogibacterium|s__[Eubacterium] sulci" : "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] sulci",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis",
    "d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis|s__Chroococcidiopsis thermalis" : "d__Bacteria|p__Cyanobacteria|o__Chroococcidiopsidales|f__Chroococcidiopsidaceae|g__Chroococcidiopsis|s__Chroococcidiopsis thermalis",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas formicexedens" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas formicexedens",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2",
    "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|f__Dehalococcoidaceae|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens" : "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus",
    "d__Bacteria|p__Bacteroidetes|c__Chitinophagia|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus" : "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus",
    "d__Bacteria|p__Candidatus Saccharibacteria|g__Candidatus Saccharimonas|s__Candidatus Saccharibacteria oral taxon TM7x" : "d__Bacteria|p__Candidatus Saccharibacteria|s__Candidatus Saccharibacteria oral taxon TM7x",
    "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Regiella|s__secondary endosymbiont of Ctenarytaina eucalypti": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Ctenarytaina eucalypti",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Candidatus Regiella|s__secondary endosymbiont of Heteropsylla cubana": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|s__secondary endosymbiont of Heteropsylla cubana",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya velum gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya velum gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__endosymbiont of Galathealinum brachiosum": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Galathealinum brachiosum",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya velesiana gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya velesiana gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Bathymodiolus thermophilus thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus thermophilus thioautotrophic gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__enodsymbiont of Escarpia spicata": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__enodsymbiont of Escarpia spicata",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thiodiazotropha|s__Solemya elarraichensis gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya elarraichensis gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Bathymodiolus azoricus thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus azoricus thioautotrophic gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Solemya pervernicosa gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Solemya pervernicosa gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__endosymbiont of Ridgeia piscesae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Ridgeia piscesae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Ruthia|s__Bathymodiolus septemdierum thioautotrophic gill symbiont": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Bathymodiolus septemdierum thioautotrophic gill symbiont",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Tevnia jerichonana": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Tevnia jerichonana",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Riftia pachyptila": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Riftia pachyptila",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Seepiophila jonesi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Seepiophila jonesi",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__endosymbiont of Lamellibrachia luymesi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__endosymbiont of Lamellibrachia luymesi",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13013_P4": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13013_P4",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13017_P7": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13017_P7",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC14_002_19_P1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC14_002_19_P1",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC004_P11": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC004_P11",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13016_P6": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13016_P6",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC006_P13": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC006_P13",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13018_P8": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13018_P8",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC005_P12": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC005_P12",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC003_P10": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC003_P10",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC001_P9": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC001_P9",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Endoriftia|s__Gammaproteobacteria bacterium LUC13015_P5": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|s__Gammaproteobacteria bacterium LUC13015_P5",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola selenatireducens": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola selenatireducens",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola thiotaurini": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola thiotaurini",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Sedimenticola|s__Sedimenticola sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Sedimenticola|s__Sedimenticola sp.",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas chitiniclastica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas chitiniclastica",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas populi": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas populi",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas larvae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas larvae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas sp. G9077": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Wohlfahrtiimonas|s__Wohlfahrtiimonas sp. G9077",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas pentaromativorans": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas pentaromativorans",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas sp. HK-28": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas sp. HK-28",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Gallaecimonas|s__Gallaecimonas xiamenensis": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Gallaecimonas|s__Gallaecimonas xiamenensis",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus singularis": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus singularis",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus perditus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus perditus",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus autotrophicus": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus autotrophicus",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G25": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G25",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S14_B12": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S14_B12",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__uncultured SUP05 cluster bacterium": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__uncultured SUP05 cluster bacterium",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S12_B1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S12_B1",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S03_B1": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. REDSEA-S03_B1",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp.",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. TMED218": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. TMED218",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G23": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Candidatus Thioglobus|s__Candidatus Thioglobus sp. MED-G23",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria cameli": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria cameli",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria larvae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria larvae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria ureiclastica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria ureiclastica",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria indica": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria indica",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Ignatzschineria|s__Ignatzschineria sp. F8392": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Ignatzschineria|s__Ignatzschineria sp. F8392",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella acticola": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella acticola",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella spirulinae": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella spirulinae",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella nitratireducens": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella nitratireducens",
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|f__Candidatus Competibacteraceae|g__Pseudohongiella|s__Pseudohongiella sp.": "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|g__Pseudohongiella|s__Pseudohongiella sp.",
"d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhizobiales|f__Salinarimonadaceae|g__Salinarimonas|s__Salinarimonadaceae bacterium HL-109": "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhizobiales|f__Salinarimonadaceae|s__Salinarimonadaceae bacterium HL-109",
"d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|g__Candidatus Micropelagos|s__PS1 clade bacterium": "d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|s__PS1 clade bacterium",
"d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|o__Mariprofundales|f__Mariprofundaceae|g__Ghiorsea": "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|g__Ghiorsea",
"d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|o__Mariprofundales|f__Mariprofundaceae|g__Ghiorsea|s__Ghiorsea bivora": "d__Bacteria|p__Proteobacteria|c__Zetaproteobacteria|g__Ghiorsea|s__Ghiorsea bivora",
"d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Microbacteriaceae|g__Rhodoluna|s__Microbacteriaceae bacterium MWH-Ta3": "d__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Microbacteriaceae|s__Microbacteriaceae bacterium MWH-Ta3",
"d__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|g__Candidatus Carbobacillus|s__[Flavobacterium] thermophilum": "d__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|s__[Flavobacterium] thermophilum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|g__Monoglobus|s__[Bacteroides] pectinophilus": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|s__[Bacteroides] pectinophilus",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Anaerovorax|s__[Eubacterium] infirmum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] infirmum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] minutum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] minutum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] sulci": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] sulci",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] nodatum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] nodatum",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Mobilibacterium|s__[Eubacterium] brachy": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] brachy",
"d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|g__Ileibacterium|s__[Eubacterium] saphenum": "d__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiales Family XIII. Incertae Sedis|s__[Eubacterium] saphenum",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Tissierellaceae|g__Anaerosalibacter|s__[Clostridium] ultunense": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Tissierellaceae|s__[Clostridium] ultunense",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter|s__Sporanaerobacter acetigenes": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter|s__Sporanaerobacter acetigenes",
"d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|f__Gottschalkiaceae|g__Sporanaerobacter|s__Sporanaerobacter sp. PP17-6a": "d__Bacteria|p__Firmicutes|c__Tissierellia|o__Tissierellales|g__Sporanaerobacter|s__Sporanaerobacter sp. PP17-6a",
"d__Bacteria|p__Firmicutes|c__Tissierellia|g__Sedimentibacter|s__[Bacteroides] coagulans": "d__Bacteria|p__Firmicutes|c__Tissierellia|s__[Bacteroides] coagulans",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita",
"d__Bacteria|p__Cyanobacteria|c__Gloeobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita|s__Gloeomargarita lithophora": "d__Bacteria|p__Cyanobacteria|o__Gloeoemargaritales|f__Gloeomargaritaceae|g__Gloeomargarita|s__Gloeomargarita lithophora",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia": "d__Bacteria|c__Candidatus Sericytochromatia",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium GL2-53 LSPB_72": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium GL2-53 LSPB_72",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 CBMW_12": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 CBMW_12",
"d__Bacteria|p__Candidatus Margulisbacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 RAAC_196": "d__Bacteria|c__Candidatus Sericytochromatia|s__Candidatus Sericytochromatia bacterium S15B-MN24 RAAC_196",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas sp. GP": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. GP",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas alkenigignens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas alkenigignens",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas formicexedens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas formicexedens",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas sp. WBC-2",
"d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|o__Dehalococcoidales|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens": "d__Bacteria|p__Chloroflexi|c__Dehalococcoidia|g__Dehalogenimonas|s__Dehalogenimonas lykanthroporepellens",
"d__Bacteria|p__Chloroflexi|c__Thermomicrobia|g__Thermorudis|s__Thermomicrobia bacterium": "d__Bacteria|p__Chloroflexi|c__Thermomicrobia|s__Thermomicrobia bacterium",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter ruber",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter altiplanensis": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter altiplanensis",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. 10B": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. 10B",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. J07SB67": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salinibacter|s__Salinibacter sp. J07SB67",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium bin80": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium bin80",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium RA",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium Tc-Br11_B2g6_7": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium Tc-Br11_B2g6_7",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium TMED105": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|s__Rhodothermaceae bacterium TMED105",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus marinus",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus profundi": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Rhodothermus|s__Rhodothermus profundi",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter|s__Longibacter salinarum": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longibacter|s__Longibacter salinarum",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas|s__Longimonas halophila": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Longimonas|s__Longimonas halophila",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Salisaeta longa": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Salisaeta longa",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA7368": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA7368",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2364": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2364",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2024": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2024",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA2412": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA2412",
"d__Bacteria|p__Bacteroidetes|c__Saprospiria|o__Bacteroidetes Order II. Incertae sedis|f__Rhodothermaceae|g__Salisaeta|s__Bacteroidetes bacterium UBA5586": "d__Bacteria|p__Bacteroidetes|o__Bacteroidetes Order II. Incertae sedis|s__Bacteroidetes bacterium UBA5586",
"d__Bacteria|p__Candidatus Kryptonia|g__Candidatus Aegiribacteria": "d__Bacteria|g__Candidatus Aegiribacteria",
"d__Bacteria|p__Candidatus Kryptonia|g__Candidatus Aegiribacteria|s__Candidatus Aegiribacteria bacterium MLS_C": "d__Bacteria|g__Candidatus Aegiribacteria|s__Candidatus Aegiribacteria bacterium MLS_C",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Terrimicrobium": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Terrimicrobium",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Terrimicrobium|s__Terrimicrobium sacchariphilum": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Terrimicrobium|s__Terrimicrobium sacchariphilum",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Candidatus Xiphinematobacter": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Candidatus Xiphinematobacter",
"d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|o__Chthoniobacterales|g__Candidatus Xiphinematobacter|s__Candidatus Xiphinematobacter sp. Idaho Grape": "d__Bacteria|p__Verrucomicrobia|c__Spartobacteria|g__Candidatus Xiphinematobacter|s__Candidatus Xiphinematobacter sp. Idaho Grape",
"d__Bacteria|p__Candidatus Saccharibacteria|g__Candidatus Saccharimonas|s__candidate division TM7 genomosp. GTL1": "d__Bacteria|p__Candidatus Saccharibacteria|s__candidate division TM7 genomosp. GTL1",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-2": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-2",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium GWC2_50_8": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium GWC2_50_8",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-1": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium HGW-Riflebacteria-1",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium",
"d__Bacteria|p__Candidatus Riflebacteria|c__Candidatus Ozemobacteria|f__Candidatus Ozemobacteraceae|g__Candidatus Ozemobacter|s__Candidatus Riflebacteria bacterium RBG_13_59_9": "d__Bacteria|p__Candidatus Riflebacteria|s__Candidatus Riflebacteria bacterium RBG_13_59_9",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon UBA261": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon UBA261",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon UBA203": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon UBA203",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_03": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_03",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_02": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_02",
"d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|g__Candidatus Syntrophoarchaeum|s__Methanosarcinales archaeon Methan_01": "d__Archaea|p__Euryarchaeota|c__Methanomicrobia|o__Methanosarcinales|s__Methanosarcinales archaeon Methan_01",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum",
"d__Archaea|p__Thaumarchaeota|c__Nitrososphaeria|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum|s__Cenarchaeum symbiosum": "d__Archaea|p__Thaumarchaeota|o__Cenarchaeales|f__Cenarchaeaceae|g__Cenarchaeum|s__Cenarchaeum symbiosum",
"d__Archaea|p__Candidatus Korarchaeota|g__Candidatus Korarchaeum|s__Candidatus Korarchaeota archaeon NZ13-K": "d__Archaea|p__Candidatus Korarchaeota|s__Candidatus Korarchaeota archaeon NZ13-K",
    }

    mpa_strings_corrected = [correction_dict[s] if s in correction_dict.keys() else s for s in mpa_strings]
    return(mpa_strings_corrected)

if __name__ == '__main__':
    main()
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
suppressMessages(library(ggplot2, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(rafalib, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(vegan, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(reshape2, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(cmapR, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(compositions, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(zCompositions, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(ALDEx2, quietly = TRUE, warn.conflicts = FALSE))
suppressMessages(library(ggpubr, quietly = TRUE, warn.conflicts = FALSE))
options(stringsAsFactors = F)

# options we need from snakemake
sample_reads_file <- snakemake@params[['sample_reads_file']]
sample_reports_file <- snakemake@params[['sample_reports_file']]
sample_groups_file <- snakemake@params[['sample_groups_file']]
workflow.outdir <- snakemake@params[['workflow_outdir']]
result.dir <- snakemake@params[['result_dir']]
use.bracken.report <- snakemake@params[['use_bracken_report']]
remove.chordata <- as.logical(snakemake@params[['remove_chordata']])
if (length(remove.chordata)==0){
    remove.chordata <- F
}

# Use locations in the working directory for scripts and taxonomy array
scripts.folder <- file.path(workflow.outdir, "scripts")
tax.array.file <- file.path(workflow.outdir, "taxonomy_array.tsv")

# Downstream processing for filtering OTUs
## This removes any classification result where all samples are below this threshold,
## which serves to remove the very long tail of lowly abundant species in Kraken2/Bracken results.
## This parameter can be tuned, but I recommend you keep something here to reduce the long tail.
min_otu_percentage <- as.numeric(snakemake@config[['min_otu_percentage']])
# Filtering for compositional data analysis that gets applied for each taxonomic level
## Keep only samples with > codata_min_reads (default 5000)
codata_min_reads <- as.numeric(snakemake@config[['codata_min_reads']])
## Keep only OTUs with an FRACTION of at least 0.001 by default
## This is equivalent to a PERCENTAGE of 0.1
codata_min_prop = as.numeric(snakemake@config[['codata_min_prop']])
## Keep OTUs that are found in at least codata_otu_cutoff fraction of samples 
## Default is 0.3 or 30% of samples
codata_otu_cutoff = as.numeric(snakemake@config[['codata_otu_cutoff']])

# Set up directories and create those that don't exist
classification.folder <- file.path(workflow.outdir, 'classification')
outfolder.matrices.taxonomy <- file.path(result.dir, 'taxonomy_matrices')
outfolder.matrices.taxonomy.classified <- file.path(result.dir, 'taxonomy_matrices_classified_only')
outfolder.gctx.taxonomy <- file.path(result.dir, 'taxonomy_gctx')
outfolder.gctx.taxonomy.classified <- file.path(result.dir, 'taxonomy_gctx_classified_only')
outfolder.matrices.bray <- file.path(result.dir, 'braycurtis_matrices')
outfolder.plots <- file.path(result.dir, 'plots')
outfolder.aldex <- file.path(result.dir, 'ALDEx2_differential_abundance')
for (f in c(result.dir, outfolder.matrices.bray, outfolder.matrices.taxonomy.classified,
            outfolder.gctx.taxonomy.classified, outfolder.plots, outfolder.aldex)){
    if (!dir.exists(f)){ dir.create(f, recursive = T)}
}
# dont create dirs we dont need from bracken
if (!(use.bracken.report)){
    for (f in c(outfolder.matrices.taxonomy, outfolder.gctx.taxonomy)){
        if (!dir.exists(f)){ dir.create(f, recursive = T)}
    }
}
# Load other data processing and plotting scripts
source.script.process <- file.path(scripts.folder, 'process_classification_gctx.R')
source.script.plot <- file.path(scripts.folder, 'plotting_classification.R')
source.script.codaseq <- file.path(scripts.folder, 'CoDaSeq_functions.R')
if (!(file.exists(source.script.process) & file.exists(source.script.plot) & file.exists(source.script.codaseq))) {
    stop('processing scripts do not exist at any of the attempted directories. Exiting. ')
}
suppressMessages(source(source.script.process))
suppressMessages(source(source.script.plot))
suppressMessages(source(source.script.codaseq))

# read from sample_reads_file or sample_reports_file as input
if(sample_reads_file != ""){
    sample.reads <- read.table(sample_reads_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character')
    colnames(sample.reads) <- c('sample', 'r1', 'r2')[1:ncol(sample.reads)]
    # ensure we skip first row if it's a comment
    if ((tolower(sample.reads[1,1]) == 'sample') | (substr(sample.reads[1,1], 1,1) == "#")){
        sample.reads <- sample.reads[2:nrow(sample.reads), ]
    }
    # create dataframe of output files that would be in sample.reports
    sample.reports <- data.frame(sample = sample.reads$sample, 
                                 kraken_report = sapply(sample.reads$sample, function(x) file.path(classification.folder, paste(x, '.krak.report', sep=''))),
                                 bracken_report = sapply(sample.reads$sample, function(x) file.path(classification.folder, paste(x, '.krak_bracken_species.report', sep=''))))                    
} else if(sample_reports_file != ""){
    sample.reports <- read.table(sample_reports_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character')
    colnames(sample.reports) <- c('sample', 'kraken_report', 'bracken_report')[1:ncol(sample.reports)]
    # ensure we skip first row if it's a comment
    if ((tolower(sample.reports[1,1]) == 'sample') | (substr(sample.reports[1,1], 1,1) == "#")){
        sample.reports <- sample.reports[2:nrow(sample.reports), ]
    }
}
sample.names <- sample.reports$sample
sample.number <- nrow(sample.reports)

# Define files for loading in based on sample.
if(use.bracken.report){
    flist <- sample.reports[,3]
} else {
    flist <- sample.reports[,2]
}
names(flist) <- sample.names

# Ensure all expected results files exist
if (!(all(file.exists(flist)))){
    # print which files don't exist
    message('The follwing files do not exist:')
    print(flist[!sapply(flist, file.exists)])
    stop("Some classification files do not exist!")
}

# if a groups file is specified, read it. Otherwise assign everything to one group
if (sample_groups_file != '') {
    sample.groups <- read.table(sample_groups_file, sep='\t', quote='', header=F, comment.char = "#", colClasses = 'character')
    colnames(sample.groups) <- c('sample', 'group')
    # if first sample is sample, discard the line
    if(tolower(sample.groups[1,'sample']) == 'sample'){
        sample.groups <- sample.groups[2:nrow(sample.groups),]
    }
} else {
    sample.groups <- data.frame(sample=sample.names, group='All')
}

# get taxonomy array
# classification at each level for each entry in the database.
# simplified after processing krakens default taxonomy
tax.array <- read.table(tax.array.file, sep='\t', quote='', header=F, comment.char = '', colClasses = 'character')
colnames(tax.array) <- c('id', 'taxid', 'root', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'subspecies')[1:ncol(tax.array)]
# Bug in generation code gave muliple zero taxids. Eliminate that here now
tax.array <- tax.array[!duplicated(tax.array$taxid),]
rownames(tax.array) <- tax.array$taxid
# some duplicate names in this. if so, change them to include the tax level
dup.ids <- tax.array$id[duplicated(tax.array$id)]
dup.inds <- which(tax.array$id %in% dup.ids)
# fix these by adding taxid to the name
tax.array[tax.array$id %in% dup.ids, "id"] <-
    paste(tax.array[tax.array$id %in% dup.ids, "id"], ' (', tax.array[tax.array$id %in% dup.ids, "taxid"], ')', sep='')

# ensure reads and groups have the same data
if (!(all(sample.groups$sample %in% sample.names) & all(sample.names %in% sample.groups$sample))){
    message('sample.reads:')
    print(sample.reads)
    message('sample.groups:')
    print(sample.groups)
    message('sample.groups$sample[!(sample.groups$sample %in% sample.names)]')
    print(sample.groups$sample[!(sample.groups$sample %in% sample.names)])
    message('sample.names[!(sample.names %in% sample.groups$sample)]')
    print(sample.names[!(sample.names %in% sample.groups$sample)])
    stop('Sample reads and sample groups dont contain the same samples... check inputs')
}

# special case for UHGG and MAG databases.
## If reading from UHGG, the taxonomy levels go R, R1-R7
test.df <- kraken_file_to_df(flist[1])
## UHGG database will have this structure
uhgg <- all(paste0('R', 1:7) %in% test.df$tax.level)
# MAG database will have this structure
## Number of genus classifications is zero
segata <- !uhgg & sum(sum(test.df$tax.level=='G')) == 0

# load classification results from each sample and process into gct format
message(paste('Loading data from ', length(flist), ' kraken/bracken results.', sep=''))
df.list <- lapply(flist, function(x) kraken_file_to_df(x))
# merge classification matrix
merge.mat <- merge_kraken_df_list(df.list)
# sample metadata just has groups for now
sample.metadata <- data.frame(id=sample.groups$sample, group=sample.groups$group)
# construct gct object from reads matrix, sample metadata, row metadata
kgct <- make_gct_from_kraken(merge.mat, sample.metadata, tax.array)

# remove Chordata reads if desired
if(remove.chordata){
    kgct <- subset_gct(kgct, rid=kgct@rid[kgct@rdesc$phylum != "Chordata"])
}
# filter to each of the taxonomy levels
if (segata){
    filter.levels <- c('species')
    kgct.filtered.list <- list(species=subset_gct(kgct, rid=kgct@rid[kgct@rid != 'root']))
} else {
    filter.levels <-  c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
    kgct.filtered.list <- lapply(filter.levels, function(level) {
        subset_kgct_to_level(kgct, level)
        })
    names(kgct.filtered.list) <- filter.levels
}
# print("kgct.filtered.list[['species']]@mat")
# print(kgct.filtered.list[['species']]@mat)

# subset to classified taxa only
unclassified.rownames <- c('unclassified', 'classified at a higher level')
kgct.filtered.classified.list <- lapply(kgct.filtered.list, function(x) subset_gct(x, rid=x@rid[!(x@rid %in% unclassified.rownames)]))

# normalize to percentages
kgct.filtered.percentage.list <- lapply(kgct.filtered.list, function(x) normalize_kgct(x, min_otu_percentage=min_otu_percentage))
kgct.filtered.classified.percentage.list <- lapply(kgct.filtered.classified.list, function(x) normalize_kgct(x, min_otu_percentage=min_otu_percentage))

message('Saving classification matrices...')
# save matrices and gctx objects
if (use.bracken.report){mat.name <- 'bracken'} else {mat.name <- 'kraken'}
for (tn in filter.levels){
    outf.mat.reads <- file.path(outfolder.matrices.taxonomy, paste(mat.name, tolower(tn), 'reads.txt', sep='_'))
    outf.mat.percentage <- file.path(outfolder.matrices.taxonomy, paste(mat.name, tolower(tn), 'percentage.txt', sep='_'))
    outf.mat.reads.classified <- file.path(outfolder.matrices.taxonomy.classified, paste(mat.name, tolower(tn), 'reads.txt', sep='_'))
    outf.mat.percentage.classified <- file.path(outfolder.matrices.taxonomy.classified, paste(mat.name, tolower(tn), 'percentage.txt', sep='_'))
    # gctx names
    outf.gctx.reads <- file.path(outfolder.gctx.taxonomy, paste(mat.name, tolower(tn), 'reads.gctx', sep='_'))
    outf.gctx.percentage <- file.path(outfolder.gctx.taxonomy, paste(mat.name, tolower(tn), 'percentage.gctx', sep='_'))
    outf.gctx.reads.classified <- file.path(outfolder.gctx.taxonomy.classified, paste(mat.name, tolower(tn), 'reads.gctx', sep='_'))
    outf.gctx.percentage.classified <- file.path(outfolder.gctx.taxonomy.classified, paste(mat.name, tolower(tn), 'percentage.gctx', sep='_'))

    # save matrices
    if (!use.bracken.report){
        write.table(kgct.filtered.list[[tn]]@mat, outf.mat.reads, sep='\t', quote=F, row.names = T, col.names = T)
    }
    write.table(kgct.filtered.classified.list[[tn]]@mat, outf.mat.reads.classified, sep='\t', quote=F, row.names = T, col.names = T)

    mat.percentage <- kgct.filtered.percentage.list[[tn]]@mat
    # order rows by mean abundance across samples
    row.order <- rownames(mat.percentage)[order(rowMeans(mat.percentage), decreasing = T)]
    row.order <- row.order[!(row.order %in% unclassified.rownames)]
    row.order <- c(unclassified.rownames[unclassified.rownames %in% rownames(mat.percentage)], row.order)
    mat.percentage <- mat.percentage[row.order,]
    # classified only matrix
    mat.percentage.classified <- kgct.filtered.classified.percentage.list[[tn]]@mat
    row.order.classified <- order(rowMeans(mat.percentage.classified), decreasing = T)
    mat.percentage.classified <- mat.percentage.classified[row.order.classified,]

    if (!use.bracken.report){
        write.table(mat.percentage, outf.mat.percentage, sep='\t', quote=F, row.names = T, col.names = T)
    }
    write.table(mat.percentage.classified, outf.mat.percentage.classified, sep='\t', quote=F, row.names = T, col.names = T)

    # save gctx 
    # only save with unclassified if not using Bracken
    if (!use.bracken.report){
        suppressMessages(write_gctx(kgct.filtered.list[[tn]], outf.gctx.reads, appenddim = F))
        suppressMessages(write_gctx(kgct.filtered.percentage.list[[tn]], outf.gctx.percentage, appenddim = F))
    }
    suppressMessages(write_gctx(kgct.filtered.classified.list[[tn]], outf.gctx.reads.classified, appenddim = F))
    suppressMessages(write_gctx(kgct.filtered.classified.percentage.list[[tn]], outf.gctx.percentage.classified, appenddim = F))
}


#################################################################################
## Diversity calculation and plots ##############################################
#################################################################################
message('Doing diversity calculations and saving figures...')

# DISABLE rarefaction curve as it's useless with so many species
# Wait actually people want it so Im going to re-enable it
plot_rarefaction_curve(kgct.filtered.classified.list$species@mat,
                       file.path(outfolder.plots, 'rarefaction_curve.pdf'))

# diversity calculations
div.methods <- c('shannon', 'simpson')
# calculate for each tax level
div.level.method <- lapply(filter.levels, function(x) {
    use.matrix <- kgct.filtered.classified.list[[x]]@mat
    # if not enough valid rows, skip it
    if (nrow(use.matrix) <3){
        dl <- lapply(div.methods, function(x) rep(0, times=ncol(use.matrix)))
    } else {
        dl <- lapply(div.methods, function(y) diversity(use.matrix, index=y, MARGIN = 2))
    }
    names(dl) <- div.methods
    dl
})
names(div.level.method) <- filter.levels

# coerce forcefully into a dataframe
# there's definitely a better way to do this...
div.df <- as.data.frame(div.level.method)
rownames(div.df) <- sample.names
div.df <- cbind(data.frame(sample=rownames(div.df)), div.df)
div.df <- melt(div.df, id.vars = 'sample')
div.df$tax.level <- sapply(div.df$variable, function(x) strsplit(as.character(x), split="\\.")[[1]][1])
div.df$method <- sapply(div.df$variable, function(x) strsplit(as.character(x), split="\\.")[[1]][2])
div.df <- div.df[,c('sample', 'tax.level','method','value')]
# round to a sensible number
div.df$value <- round(div.df$value, 3)
# write out dataframe
out.div <- file.path(result.dir, 'diversity.txt')
write.table(div.df, out.div, sep='\t', quote=F, row.names=F, col.names=T)

# barplot of diversity under different methods
# one with everything, one separated by sample group
div.df$sample <- factor(div.df$sample, levels = sample.groups$sample)
div.plot.list <- list()
for (g in unique(sample.groups$group)){
    plot.samples <- sample.groups[sample.groups$group==g, "sample"]
    plot.df <- div.df[div.df$sample %in% plot.samples, ]
    plot.df$sample <- factor(plot.df$sample, levels = sample.groups$sample)
    p <- ggplot(plot.df, aes(x=sample, y=value)) +
        geom_bar(stat='identity') +
        facet_grid(rows = vars(tax.level), cols= vars(method), scales = 'fixed') +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        labs(title=paste('Diversity for group:', g),
             y='Diversity', x='Sample')
    div.plot.list[[g]] <- p
}

# save one page for each group
div.group.pdf <- file.path(outfolder.plots, 'diversity_by_group.pdf')
pdf(div.group.pdf, height=8, width = 10)
for(p in div.plot.list){print(p)}
trash <- dev.off()

# also a big figure with one page for each method, all samples
div.plot.all.list <- list()
for (m in unique(div.df$method)){
    plot.df <- div.df[div.df$method == m, ]
    p <- ggplot(plot.df, aes(x=sample, y=value)) +
            geom_bar(stat='identity') +
            facet_grid(rows = vars(tax.level), cols= vars(method), scales = 'fixed') +
            theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
            labs(title=paste('Diversity for all samples:', m),
                 y='Diversity', x='Sample')
    div.plot.all.list[[m]] <- p
}

# width of plot = 9 + quarter inch for each sample over 10
max.samps <- max(table(sample.groups$group))
pdf.width <- max(9, (9 + (0.25 * (max.samps-10))))
div.all.pdf <- file.path(outfolder.plots, 'diversity_allsamples.pdf')
pdf(div.all.pdf, height=5, width = pdf.width)
for(p in div.plot.all.list){print(p)}
trash <- dev.off()

#################################################################################
## Taxonomic Barplots ###########################################################
#################################################################################
message('Generating taxonomic barplots...')
# page in the pdf for each sample group
taxlevel.plots <- list()
taxlevel.plots.classified <- list()
for (tn in filter.levels){
    group.plots <- list()
    group.plots.classified <- list()
    for (g in unique(sample.groups$group)){
        plot.samples <- sample.groups[sample.groups$group==g, "sample"]
        div.df.sub <- div.df[div.df$tax.level==tn & div.df$method=='shannon' & div.df$sample %in% plot.samples,]
        div.df.plot <- div.df.sub[, c('sample', 'value')]
        rownames(div.df.plot) <- div.df.plot$sample
        plot.title <- paste('Taxonomy and diversity: ', g, ', ', tn, sep='')
        plot.mat <- kgct.filtered.percentage.list[[tn]]@mat[,plot.samples, drop=FALSE]

        # no longer doing include unclassifed with Bracken report
        # because new version of report doesn't include this data
        if (!use.bracken.report){
        group.plots[[g]] <- plot_many_samples_with_diversity_barplot(plot.mat,
                            div.df.plot, plot.title = plot.title, include.unclassified=T, tax.level.name = tn)
        }
        group.plots.classified[[g]] <- plot_many_samples_with_diversity_barplot(plot.mat,
                            div.df.plot, plot.title = plot.title, include.unclassified=F, tax.level.name = tn)
    }
    taxlevel.plots[[tn]] <- group.plots
    taxlevel.plots.classified[[tn]] <- group.plots.classified
}

# save pdfs
# width of plot = 9 + quarter inch for each sample over 10
max.samps <- max(table(sample.groups$group))
pdf.width <- max(9, (9 + (0.25 * (max.samps-10))))
for (tn in filter.levels){
    if (!use.bracken.report){
        tax.pdf <- file.path(outfolder.plots, paste('taxonomy_barplot_', tolower(tn), '.pdf', sep=''))
        pdf(tax.pdf, height=6, width=pdf.width)
        for (p in taxlevel.plots[[tn]]){print(p)}
        trash <- dev.off()
    }
    tax.pdf.classified <- file.path(outfolder.plots, paste('classified_taxonomy_barplot_', tolower(tn), '.pdf', sep=''))
    pdf(tax.pdf.classified, height=6, width=pdf.width)
    for (p in taxlevel.plots.classified[[tn]]){print(p)}
    trash <- dev.off()
}

#################################################################################
## PCoA plots ###################################################################
#################################################################################
message('Doing PCoA calculations...')
# only do this if we have >=3 samples
if (sample.number >=3){
    # do for each tax level
    plotlist.nolabels <- list()
    plotlist.labels <- list()
    # don't do for kingdom
    do.tn <- filter.levels[filter.levels!='kingdom']
    for (tn in do.tn){
        print(paste('    for:', tn))
        fraction.mat <- kgct.filtered.classified.percentage.list[[tn]]@mat
        if(nrow(fraction.mat) > 2){
            pcoa.res <- capscale(t(fraction.mat)~1, distance='bray')
            pcoa.df <- data.frame(sample.groups, scores(pcoa.res)$sites)
            pcoa.df.melt <- melt(pcoa.df, id.vars = c('sample','group'))
            pcoa.variance <- summary(pcoa.res)$cont$importance[2,1:2]
            # set colors based on number of groups
            ng <- length(unique(sample.groups$group))
            if (ng <10){cols <- brewer.pal(max(ng,3), 'Set1')} else {cols <- colorRampPalette(brewer.pal(9,'Set1'))(ng)}

            plotlist.nolabels[[tn]] <- ggplot(pcoa.df, aes(x=MDS1, y=MDS2, color=group, label=sample)) +
                geom_point(size=3) +
                # scale_color_brewer(palette='Set1') +
                scale_color_manual(values = cols) +
                theme_bw() +
                labs(title=paste('Microbiome beta diversity, Bray-Curtis, ', tn, sep=''),
                     subtitle='Principal Coordinates Analysis plot',
                     x = paste('PC1 (', round(pcoa.variance[1], 3) * 100, '% of variance)', sep=''),
                     y = paste('PC2 (', round(pcoa.variance[2], 3) * 100, '% of variance)', sep='')) +
                xlim(c(min(pcoa.df$MDS1), max(pcoa.df$MDS1) * 1.25)) +
                ylim(c(min(pcoa.df$MDS2), max(pcoa.df$MDS2) * 1.15)) +
                guides(color=guide_legend(title='Group'))

            # add a version with labels above the points
            nudge.x <- sum(abs(range(pcoa.df$MDS1))) / 12
            nudge.y <- sum(abs(range(pcoa.df$MDS2))) / 30
            plotlist.labels[[tn]] <- plotlist.nolabels[[tn]] + geom_text(nudge_x = nudge.x, nudge_y = nudge.y)
        } else{
            plotlist.nolabels[[tn]] <- plot(1,1)
            plotlist.labels[[tn]] <- plot(1,1)
        }
    }

    # save plot
    pcoa.pdf.nolabels <- file.path(outfolder.plots, 'PCoA_2D_plot_nolabels.pdf')
    pdf(pcoa.pdf.nolabels, height=6.5, width=9)
    for (tn in rev(do.tn)){
        print(plotlist.nolabels[[tn]])
    }
    trash <- dev.off()
    pcoa.pdf.labels <- file.path(outfolder.plots, 'PCoA_2D_plot_labels.pdf')
    pdf(pcoa.pdf.labels, height=6.5, width=9)
    for (tn in rev(do.tn)){
        print(plotlist.labels[[tn]])
    }
    trash <- dev.off()
} else {
    # warn the user and make fake plots
    warning('Less than 3 samples, not doing PCoA plots')
    pcoa.pdf.nolabels <- file.path(outfolder.plots, 'PCoA_2D_plot_nolabels.pdf')
    pdf(pcoa.pdf.nolabels, height=6.5, width=9)
    plot(0,0, col='white')
    text(0,0, 'Not enough samples for PCoA plot', col='firebrick')
    trash <- dev.off()
    pcoa.pdf.labels <- file.path(outfolder.plots, 'PCoA_2D_plot_labels.pdf')
    pdf(pcoa.pdf.labels, height=6.5, width=9)
    plot(0,0, col='white')
    text(0,0, 'Not enough samples for PCoA plot', col='firebrick')
    trash <- dev.off()
}

# simple bray curtis distance metrics
for (tn in filter.levels){
    bray.dist <- as.matrix(vegdist(t(kgct.filtered.classified.percentage.list[[tn]]@mat)))
    out.bray <- file.path(outfolder.matrices.bray, paste('braycurtis_distance_', tolower(tn), '.txt', sep=''))
    write.table(bray.dist, out.bray, sep='\t', quote=F, row.names = T, col.names = T)
}

#################################################################################
## Compositional data analysis and plots ########################################
#################################################################################
if (length(unique(sample.groups$group)) != 2){
    warning('Will not do ALDEx2 differential abundance with !=2 groups')}

if (nrow(sample.groups) < 3){
    warning('Will not do compositional data analysis with < 3 samples')
} else {
    pca.plot.list <- list()
    if(segata){
        do.tax.levels <- c('species')
    } else {
        do.tax.levels <- c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
    }

    print('Compositional data analysis....')
    for (tax.level in do.tax.levels[!(do.tax.levels %in% c('kingdom'))]){
        print(paste('.....', tax.level))
        use.mat <- kgct.filtered.classified.list[[tax.level]]@mat
        if(nrow(use.mat) > 2){
            reads.filtered <- tryCatch(codaSeq.filter(use.mat,
                min.reads=codata_min_reads, min.occurrence=codata_otu_cutoff, min.prop=codata_min_prop, samples.by.row=FALSE),
            error=function(e) {
                warning(e)
                matrix(0)
                })
            } else {
                reads.filtered <- matrix(0)
            }
        if(nrow(reads.filtered) > 2){
            # replace 0 values with an estimate of the probability that the zero is not 0
            # only if necessary
            # at this point samples are by row, and variables are by column
            # we can use the GBM or CZM methods. Function from zCompositions.
            if (sum(reads.filtered==0)>0){
                rfz <- t(cmultRepl(t(reads.filtered),  label=0, method="CZM", output="p-counts"))
                # round to integers and round anything low to 1
                rfz <- round(rfz)
                rfz[rfz==0] <- 1
            } else {
                rfz <- reads.filtered
            }
            # convert to CLR
            clr.aldex <- aldex.clr(rfz, conds=rep(NA, ncol(rfz)), mc.samples=128, denom = 'all', verbose = F)
            # get clr from mean of the MC instances
            rfz.clr <- t(sapply(clr.aldex@analysisData , function(x) rowMeans(x)))
            # save CLR matrix
            outf <- file.path(outfolder.matrices.taxonomy.classified, paste('clr_values_', tax.level, '.tsv', sep=''))
            write.table(round(rfz.clr, 4), outf, sep='\t', quote=F, row.names = T, col.names = T)

            rfz.clr.pca <- prcomp(rfz.clr)
            rfz.clr.mvar <- mvar(rfz.clr)

            # PCA biplot
            plot.df <- data.frame(rfz.clr.pca$x[,1:2], group=sample.groups[sample.groups$sample %in% rownames(rfz.clr), 'group'])
            pc1.var <- round(sum(rfz.clr.pca$sdev[1]^2)/rfz.clr.mvar * 100, 1)
            pc2.var <- round(sum(rfz.clr.pca$sdev[2]^2)/rfz.clr.mvar * 100, 1)
            pca.plot <- ggplot(plot.df, aes(x=PC1, y=PC2, col=group)) +
                geom_point(size=3) +
                geom_text(label=rownames(plot.df),
                          nudge_x = sum(abs(range(plot.df$PC1)))/ 50,
                          nudge_y = sum(abs(range(plot.df$PC2)))/ 50) +
                theme_bw() +
                labs(x=paste("PC1: ", pc1.var, "% of variace", sep=""),
                     y=paste("PC2: ", pc2.var, "% of variace", sep=""),
                     title=paste('Compositional PCA,', tax.level)) +
                scale_color_brewer(palette='Set1')
            pca.plot.list[[tax.level]] <- pca.plot


            # need to limit to two groups for aldex
            if (length(unique(sample.groups$group)) == 2){
                # get two groups
                group.pos <- unique(sample.groups$group)[1]
                group.neg <- unique(sample.groups$group)[2]
                s.pos <- sample.groups$sample[sample.groups$group==group.pos]
                s.neg <- sample.groups$sample[sample.groups$group==group.neg]

                # aldex differential expression between groups
                aldex.res <- aldex(rfz[,c(s.pos, s.neg)],
                    conditions = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))),
                    mc.samples=128, test="t", effect=TRUE, include.sample.summary=FALSE,
                    denom="all", verbose=FALSE)

                # scatterplot
                outf.scatter <- file.path(outfolder.aldex, paste('aldex_scatter_', tax.level, '.pdf', sep=''))
                pdf(outf.scatter, width = 7, height = 4)
                mypar(1,2, mar = c(3.5, 3.5, 2.5, 1.1))
                aldex.plot(aldex.res, type="MW", test="wilcox", xlab="Dispersion",ylab="Difference")
                mtext(paste(tax.level, ', wilcox', sep=''), line=0.25, cex=1.25)
                aldex.plot(aldex.res, type="MW", test="welch", xlab="Dispersion",ylab="Difference")
                mtext(paste(tax.level, ', welch', sep=''), line=0.25, cex=1.25)
                dev.off()

                # improve dataframe
                aldex.res <- signif(aldex.res, 4)
                # add sample numbers
                aldex.res$n.pos <- length(s.pos)
                aldex.res$n.neg <- length(s.neg)
                # add in name
                aldex.res$taxa <- rownames(aldex.res)
                # reorder
                aldex.res <- aldex.res[, c('taxa', 'n.pos', 'n.neg', colnames(aldex.res)[1:11])]
                aldex.res <- aldex.res[order(aldex.res$we.eBH),]
                # add abs effect
                aldex.res$abs.effect <- abs(aldex.res$effect)

                # save dataframe of differential results
                outf.res <- file.path(outfolder.aldex, paste('aldex_result_', tax.level, '.tsv', sep=''))
                write.table(aldex.res, outf.res, sep='\t', quote=F, row.names = F, col.names = T)

                # make boxplots
                # make plots from everything with this value or less
                plot.thresh <- 0.1
                plot.thresh.col <- 'we.eBH'
                aldex.res.plot <- aldex.res[aldex.res[, plot.thresh.col] <= plot.thresh, ]
                outf.boxplot <- file.path(outfolder.aldex, paste('aldex_significant_boxplots_', tax.level, '.pdf', sep=''))
                if (nrow(aldex.res.plot) >0 ){
                    # make clr values
                    m.prop <-  apply(rfz, 2, function(x) x/sum(x))
                    clr.aldex <- aldex.clr(rfz[,c(s.pos, s.neg)],
                        conds = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))),
                        mc.samples=128, denom = 'all', verbose = F)
                    # get clr from mean of the MC instances
                    rfz.clr <- sapply(clr.aldex@analysisData , function(x) rowMeans(x))

                    # save all significant in one boxplot
                    pdf(outf.boxplot, height=5, width = 6, onefile=TRUE)
                    toplot <- min(nrow(aldex.res.plot), 50)
                    for ( i in 1:toplot){
                        g <- aldex.res.plot[i, 'taxa']
                        test.df <- data.frame(group = c(rep(group.pos, length(s.pos)), rep(group.neg, length(s.neg))),
                          sample = c(s.pos, s.neg),
                          proportion = c(m.prop[g, s.pos], m.prop[g, s.neg]),
                          clr = c(rfz.clr[g, s.pos], rfz.clr[g, s.neg]))


                        g1 <- ggplot(test.df, aes(x=group, y=proportion, fill=group)) +
                        geom_boxplot() +
                        theme_bw() +
                        labs(subtitle=g) +
                        ylim(c(0,max(test.df$proportion)*1.2)) +
                        stat_compare_means(method = 'wilcox') +
                        stat_compare_means(method = 't.test', label.y.npc = 0.95) + guides(fill=FALSE)

                        g2 <- ggplot(test.df, aes(x=group, y=clr, fill=group)) +
                        geom_boxplot() +
                        theme_bw() +
                        labs(subtitle=g) +
                        ylim(c(min(test.df$clr), max(test.df$clr)*1.2)) +
                        stat_compare_means(method = 'wilcox') +
                        stat_compare_means(method = 't.test', label.y.npc = 0.95)

                        print(ggarrange(g1,g2,widths = c(1,1), common.legend = T))
                    }
                    dev.off()
                }
            }
        }
    }

    # save pca plots from each level
    pdf(file.path(outfolder.plots, 'compositional_PCA_plot.pdf'), height=6.5, width=9)
    for (tn in rev(do.tax.levels)){
        print(pca.plot.list[[tn]])
    }
    trash <- dev.off()
}



message('Done! :D')
34
35
36
shell: """
    python {params.improve_taxonomy_script} {params.db}
"""
48
49
50
51
shell: """
cp -r {params.scriptdir} {outdir}
cp {input.tax_array} {outdir}
"""
69
70
71
72
73
shell: """
    time kraken2 --db {params.db} --threads {threads} --output {output.krak} \
    --report {output.krak_report} {params.paired_string} {input.reads} \
    --confidence {params.confidence_threshold} --use-names
    """
93
94
95
96
shell: """
    bracken -d {params.db} -i {input.krak_report} -o {params.outspec} -r {params.readlen} \
    -l {params.level} -t {params.threshold}
    """
116
117
script:
    'scripts/post_classification_workflow.R'
SnakeMake From line 116 of master/Snakefile
135
136
script:
    'scripts/post_classification_workflow.R'
SnakeMake From line 135 of master/Snakefile
147
148
149
150
151
shell: """
rm -rf {params.workflow_outdir}/scripts
rm -f {params.workflow_outdir}/taxonomy_array.tsv
touch {output}
"""
SnakeMake From line 147 of master/Snakefile
156
157
158
159
shell: """
    ktImportTaxonomy -m 3 -s 0 -q 0 -t 5 -i {input} -o {output} \
    -tax $(which kraken2 | sed 's/envs\/classification2.*$//g')/envs/classification2/bin/taxonomy
    """
177
178
179
180
181
shell: """
    awk '$1=="U" {{ print }}' {input.krak} | cut -f 2 > {params.tempfile}
    filterbyname.sh in={input.r1} in2={input.r2} names={params.tempfile} include=true out={output.r1} out2={output.r2}
    rm {params.tempfile}
"""
SnakeMake From line 177 of master/Snakefile
193
194
195
196
197
shell: """
    awk '$1=="U" {{ print }}' {input.krak} | cut -f 2 > {params.tempfile}
    filterbyname.sh in={input.r1} names={params.tempfile} include=true out={output.r1}
    # rm {params.tempfile}
"""
SnakeMake From line 193 of master/Snakefile
208
209
210
211
shell: """
    rm -f {params.rmdir_1}/*.krak
    touch {output}
"""
SnakeMake From line 208 of master/Snakefile
221
222
script:
    "scripts/convert_report_mpa_style.py"
SnakeMake From line 221 of master/Snakefile
230
231
232
233
234
shell:
    """
    sum=$(grep -vP "\\|" {input} | cut -f 2 | awk '{{sum += $1}} END {{printf ("%.2f\\n", sum/100)}}')
    awk -v sum="$sum" 'BEGIN {{FS="\\t"}} {{OFS="\\t"}} {{print $1,$2/sum}}' {input} > {output}
    """
SnakeMake From line 230 of master/Snakefile
242
243
244
245
246
247
shell:
    """
    source activate biobakery2
    merge_metaphlan_tables.py {input} >  {output.merge}
    grep -E "(s__)|(^ID)"  {output.merge} | grep -v "t__" | sed 's/^.*s__//g' >  {output.merge_species}
    """
SnakeMake From line 242 of master/Snakefile
255
256
257
258
259
260
shell:
    """
    source activate biobakery2
    metaphlan_hclust_heatmap.py --in {input} --top 25 --minv 0.1 -s log --out {output.heatmap1} -f braycurtis -d braycurtis -c viridis
    metaphlan_hclust_heatmap.py --in {input} --top 150 --minv 0.1 -s log --out {output.heatmap2} -f braycurtis -d braycurtis -c viridis
    """
SnakeMake From line 255 of master/Snakefile
268
269
270
271
272
    shell:
        """
        kraken-biom outputs/*_bracken.report -o {output}
        """
'''
ShowHide 14 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bhattlab/kraken2_classification
Name: kraken2_classification
Version: v2.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...