Workflow to rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes

public public 1yr ago Version: v1.1.0 0 bookmarks

Phanta

Rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes

The foundation of this workflow is a comprehensive, virus-inclusive database of genomes with integrated taxonomic information

Workflow Figure Workflow figure created on BioRender.com.

Citation

Pinto, Y., Chakraborty, M., Jain, N. et al. Phage-inclusive profiling of human gut microbiomes with Phanta. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01799-4 https://www.nature.com/articles/s41587-023-01799-4

Updates

  • 18 Aug 2023 - Thanks to Phanta user @telatin, we now have added a runner script (run_phanta.py) to the root of this repository. This runner script may simplify Phanta execution, especially if you use Phanta regularly. For usage instructions, please see RUNNER.md . Please note that we have not tested this extensively and we recommend reading through the below information before trying to execute Phanta with the runner script.

Table of contents

  1. Quick Start

  2. Advanced

  3. Troubleshooting

Quick Start

Installation

Step One - Clone the repository

Clone the repository to the desired location on your system using the following command:

git clone https://github.com/bhattlab/phanta.git

Step Two - Install conda, if not already installed

If you do not already have conda installed, please install using the instructions provided here .

Step Three - Create and activate a new conda environment

Create a new conda environment via the following command, replacing /full/path/to/cloned/repo with the appropriate path to your cloned repository:

 # If you use strict or flexible channel priority, first remove the setting
conda config --set channel_priority false
conda env create -n phanta_env --file /full/path/to/cloned/repo/phanta_env.yaml

Activate the environment by executing:

conda activate phanta_env

If you have trouble creating the environment using the above commands, you can alternatively follow the instructions here .

Step Four - Download Phanta's default database

Create a directory to store Phanta's default database of genomes. For example:

mkdir phanta_dbs
cd phanta_dbs

Then execute the following commands:

wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/database_V1.tar.gz
tar xvzf database_V1.tar.gz

The list of files that should be extracted is detailed here . Note alternative databases are also available, please see our databases list .

Test Your Installation

To test that you are ready to run Phanta on your data, navigate to your cloned repository using cd and download the four .fq.gz files required for testing via the following command:

wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/test_dataset.tar.gz
tar xvzf test_dataset.tar.gz

Then edit two files contained in the testing subdirectory of your cloned repository.

  1. Edit samp_file.txt by replacing /full/path/to/cloned/repo in the four locations indicated with the full path to your cloned repository.

  2. Edit config_test.yaml by replacing:

  • /full/path/to/cloned/repo in the three locations indicated with the full path to your cloned repository.

  • /full/path/to/downloaded/database in the one location indicated with the full path to the database you downloaded during the install.

Finally, execute the Snakemake command below, after replacing:

  1. /full/path/to/cloned/repo with the path to your cloned repository

  2. The number provided to max-threads with the maximum number of threads you have available. Note that if this number is greater than 16, you can (but don't need to) also increase the argument to class_threads in config_test.yaml .

Note At least 32GB memory is required. Also, you may have to replace the --cores and max-threads arguments with a profile for Snakemake execution depending on your setup (e.g., replace with --profile slurm ).

snakemake -s /full/path/to/cloned/repo/Snakefile \
--configfile /full/path/to/cloned/repo/testing/config_test.yaml \
--jobs 99 --cores 1 --max-threads 16

When execution has completed, please check that your test_dataset directory has an empty file called pipeline_completed.txt . You should also have two new subdirectories in test_dataset - classification and final_merged_outputs - which should have identical contents to the corresponding subdirectories in the testing subdirectory of your cloned repository. You will also have two additional files ending in *krak within test_dataset/classification/intermediate .

Basic Usage

For basic usage, replace the four paths at the top of the provided config.yaml file as appropriate.

You do not need to make any additional changes except if: 1) you did not conduct 150bp sequencing, or 2) if your read files are not gzipped. In those cases, you may also need to change the read_length or gzipped parameters, which are described under Advanced Usage .

After you have finished editing your config file, execute a similar Snakemake command to the one you used to Test Your Installation , example below:

snakemake -s /full/path/to/cloned/repo/Snakefile \
--configfile /full/path/to/cloned/repo/config.yaml \
--jobs 99 --cores 1 --max-threads 16

When the pipeline is done, you will have an empty file in your designated output directory called pipeline_completed.txt .

Main Outputs

The main outputs are merged tables that list the abundance of each taxon, in each sample. Rows are taxa and columns are samples.

  • final_merged_outputs/counts.txt : provides the number of read (pairs) assigned to each taxon

  • final_merged_outputs/relative_read_abundance.txt : same as counts.txt but normalized out of the total number of reads in each sample that were ultimately assigned to any taxon during abundance estimation.

  • final_merged_outputs/relative_taxonomic_abundance.txt : similar to relative_read_abundance.txt but abundances are corrected for genome length. In addition, only species (and not higher taxonomic levels) are included in this report.

For examples of the above outputs, please see the testing/final_merged_outputs subdirectory.

Note : To filter counts.txt or relative_read_abundance.txt to a specific taxonomic level (e.g., species, genus), or to change relative_taxonomic_abundance.txt to a higher taxonomic level than species, please refer to Filtering Merged Tables to a Specific Taxonomic Level under Provided Postprocessing Scripts .

Advanced

Advanced Usage

This section contains a description of the additional parameters in the config file that were not mentioned under Basic Usage but can be altered if desired.

Parameters under Specifications for step one - classification of metagenomic reads

  • confidence_threshold (default 0.1 ). This parameter can range from 0 to 1. Higher values yield more confident classifications but reduce sensitivity. Please see this link from the Kraken2 documentation for more details.

  • gzipped (default True ). This parameter should be True if your read files are gzipped, otherwise False .

  • class_mem_mb (default 32768 ). This parameter specifies the memory in megabytes used for the classification step.

  • class_threads (default 16 ). This parameter specifies the number of threads used for the classification step.

Parameters under Specifications for step two - filtering false positive species

  • cov_thresh_viral (default 0.1 ). This parameter can range from 0 to 1 and specifies a genome coverage requirement for a viral species to be considered a "true positive" in a sample. For example, if this parameter is 0.1, that means that for a viral species to be considered a true positive in a sample, at least one genome of the species must be at least 10% covered by sample reads.

    • Each genome's coverage is estimated by dividing:

      • The number of unique minimizers in the genome that are covered by sample reads, by

      • The total number of unique minimizers in the genome.

    • Terminology note - minimizers are very similar to kmers; for a more detailed description of what they are, please see the Kraken2 paper .

  • minimizer_thresh_viral (default 0 ). This parameter can take any value >= 0 and specifies an additional requirement for a viral species to be considered true positive in a sample. E.g., if this parameter is 10, that means at least one genome of the species must have 10+ of its unique minimizers covered by sample reads.

  • cov_thresh_bacterial and minimizer_thresh_bacterial are the analogous parameters for filtering bacterial species.

  • Sample sequencing depth may affect the choice of bacterial/viral coverage thresholds - please see our suggested coverage thresholds by sequencing depth, below. These suggestions were based on evaluating the sensitivity/specificity (TPR/FDR) tradeoff at various coverage thresholds and sequencing depths, as described in our manuscript in greater detail. suggested coverage thresholds by sequencing depth

  • As of release 1.1, upon user request we have also added coverage and minimizer parameters for archaeal and eukaryotic species ( cov_thresh_arc , cov_thresh_euk , minimizer_thresh_arc , minimizer_thresh_euk ).

Parameters under Specifications for step three - per-species abundance estimation

  • read_length (default 150 ). This parameter specifies the read length. Currently the default database is compatible with 75, 100, 120, or 150, and the alternative databases are compatible with 100 or 150. Additional read lengths can be requested here .

  • filter_thresh (default 10 ). This parameter specifies one last false positive species filter - how many sample reads must have been classified to species X (in step one) for it to be considered truly present in the sample? This parameter is specific to the Bracken tool used for abundance estimation and is equivalent to the threshold parameter described in the original Bracken documentation . Note that this filter is uniform across all types of species (e.g., viral, bacterial).

Additional parameters

  • database . Phanta is typically run with the default database linked above under Step Four of Installation . However, as described in our manuscript, an alternative version of Phanta's default database was also created, in which prophage sequences have been "masked" in prokaryotic genomes. The download link for this database is: http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/masked_db_v1.tar.gz . All available databases are described here .

  • delete_intermediate (default True ). Specify True if you would like intermediate outputs to be deleted, otherwise False . Intermediate outputs are per-sample outputs generated during the execution of Steps 1 and 2. Examples of these intermediate files can be found within the testing/classification/intermediate subdirectory of the cloned repository.

Additional Outputs

In addition to the merged tables provided in the final_merged_outputs subdirectory (see Main Outputs ), the pipeline provides per-sample outputs in the classification subdirectory. Specifically:

  • The files ending with .krak.report_bracken_species.filtered correspond to the Kraken-style report outputted by Bracken and specify the per-sample abundances that underlie the creation of the final merged tables counts.txt and relative_read_abundance.txt . Unlike in the merged tables, taxa that are not present in the sample are not included.

  • The files ending with .krak.report.filtered.bracken.scaled essentially correspond to per-sample versions of final_merged_outputs/relative_taxonomic_abundance.txt . Specifically see the rel_taxon_abundance column. Unlike in the merged table, taxa that are not present in the sample are not included.

    • Note : additional normalizations beyond relative taxonomic abundance are also provided - e.g., reads_per_million_bases , reads_per_million_reads , reads_per_million_bases_per_million_reads (RPMPM) , copies_per_million_reads .

There are two final outputs worth noting:

  1. samples_that_failed_bracken.txt in the classification subdirectory. This file contains names of samples that did not ultimately have any reads directly assigned to the species level during step three of the workflow. Please note that for these samples, the .krak.report.filtered.bracken.scaled file will be empty and there may be additional associated files in the classification subdirectory that end with .temp .

  2. total_reads.tsv in the final_merged_outputs subdirectory. This file contains information about the total number of reads assigned to taxa in each sample, at various steps of the pipeline. Specifically:

  • Tot_Samp_Reads : total number of reads in the original sample

  • Unassigned_Step_One : number of reads that were not assigned a classification by Kraken2 during the classification step

  • Assigned_Step_Three : sum total of reads assigned to a taxon after filtering false positive species and estimating species-level abundances

  • Unassigned_Step_Three : difference between Tot_Samp_Reads and Assigned_Step_Three

Note that the normalization used to create final_merged_outputs/relative_read_abundance.txt from final_merged_outputs/counts.txt utilizes the Assigned_Step_Three column of total_reads.tsv .

Provided Postprocessing Scripts

These scripts are provided within the post_pipeline_scripts subdirectory of the cloned repository.

Filtering Merged Tables to a Specific Taxonomic Level

There are two scripts provided for this purpose.

Script one

post_pipeline_scripts/reduce_merged_table_to_specific_rank.py is a Python script that can be utilized to filter final_merged_outputs/counts.txt or final_merged_outputs/relative_read_abundance.txt to a given taxonomic level.

The necessary command-line arguments to the script are, in order:

  1. Full path to the merged table, and

  2. The taxonomic level of interest (e.g., species, genus, superkingdom...)

An example command is:

python reduce_merged_table_to_specific_rank.py /full/path/to/counts.txt genus

The output of the above command would be a file called counts_genus.txt in the same directory as the original counts.txt .

Script Two

post_pipeline_scripts/sum_corrected_relab_to_higher_level.py is a Python script that can be used to sum up the species-level, genome length-corrected relative abundances provided in final_merged_outputs/relative_taxonomic_abundance.txt to a higher taxonomic level of interest (e.g., genus, superkingdom).

The necessary command-line arguments to the script are, in order:

  1. Full path to the downloaded database of genomes

  2. Full path to final_merged_outputs/relative_taxonomic_abundance.txt

  3. Full path of desired output file, including the desired file name

  4. Taxonomic level of interest

An example command is:

python sum_corrected_relab_to_higher_level.py \
/full/path/to/downloaded/database \
/full/path/to/final_merged_outputs/relative_taxonomic_abundance.txt \
summed.txt genus

Filtering Merged Tables by Prevalence of Taxa

To filter any of Phanta's merged tables by the prevalence of taxa, you may use the post_pipeline_scripts/filter_by_prevalence.py script.

Example command:

python post_pipeline_scripts/filter_by_prevalence.py counts.txt counts_filtered.txt 0.2

Modifying Merged Tables to OTU Format

Many tools require a classic "OTU" format for input tables - taxa as rows and samples as columns. To convert any of the merged output tables generated by Phanta to this classic format, you may use the post_pipeline_scripts/make_otu.py script.

Example command:

python post_pipeline_scripts/make_otu.py counts.txt counts.otu

Collapse Viral Abundances by Predicted Host

post_pipeline_scripts/collapse_viral_abundances_by_host.py is script that collapses viral abundances in each sample by predicted host.

Expected arguments are:

  1. merged abundance file generated by Phanta (e.g., final_merged_outputs/counts.txt )

  2. host assignment file (provided with the default database at /full/path/to/downloaded/database/host_prediction_to_genus.tsv ). Note: the host assignment file has two columns: 1) taxonomy ID of viral species, 2) predicted lineage of host genus, in the following format: d_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Enterobacterales;f_Enterobacteriaceae;g_Escherichia

  3. full desired path to output file (including file name)

As of preprint posting, the script collapses the viral abundances in the input merged abundance file by predicted host genus. So, the output file is a table where predicted host genera are rows, columns are samples, and each cell provides the abundance of viruses with a particular predicted host genus in a particular sample.

Usage:

python collapse_viral_abundances_by_host.py <merged abundance table> <host assignment file> <path to outfile>

Calculate Viral Lifestyle Statistics

post_pipeline_scripts/calculate_lifestyle_stats/lifestyle_stats.R is an R script that can be used to calculate overall statistics about the lifestyles of the viruses present in each sample (e.g., abundance ratio of temperate to virulent phages). Calculations are based on per-species lifestyle predictions that were made using the tool BACPHLIP, for the default database described in the preprint. These per-species predictions are provided with our database at /database/species_name_to_vir_score.txt .

The necessary command-line arguments to the R script are, in order:

  1. The BACPHLIP-predicted virulence score above which a virus should be considered 'virulent' (suggestion: 0.5 ).

  2. The full path to the species_name_to_vir_score.txt file

  3. The full path to a species-level version of final_merged_outputs/counts.txt (please see Filtering Merged Tables to a Specific Taxonomic Level )

  4. The full path to final_merged_outputs/relative_taxonomic_abundance.txt (or a species-level version of final_merged_outputs/relative_read_abundance.txt )

  5. The desired name for the output file (full path)

An example command is:

Rscript lifestyle_stats.R \
0.5 /full/path/to/downloaded/database/species_name_to_vir_score.txt \
final_merged_outputs/counts_species.txt \
final_merged_outputs/relative_taxonomic_abundance.txt \
lifestyle_stats.txt

An example output file, based on the testing dataset, is located in the same directory as the R script ( post_pipeline_scripts/calculate_lifestyle_stats/example_lifestyle_stats.txt ).

Calculation of Cross-Domain Correlations

This module calculates cross-domain abundance correlations. A later step of this module will require fastspar (https://github.com/scwatts/fastspar - please see below). Please note that fastspar may not be compatible with your phanta env so you may need to create a new environment for the step that requires it.

Fastspar calculates all-vs-all correlation and can be memory- and disk usage-intensive.

We first recommend to filter your merged counts table to a specific rank (e.g., genus). Please see Filtering Merged Tables to a Specific Taxonomic Level .

Next, we recommend filtering by prevalence. Please see Filtering Merged Tables by Prevalence of Taxa . Example command for this use case:

python post_pipeline_scripts/filter_by_prevalence.py counts_genus.txt counts_genus_filtered.txt 0.2

To make the output compatible with fastspar, please use the script that modifies our merged output to a more standard OTU table - please see Modifying Merged Tables to OTU Format . Example command for this use case:

python post_pipeline_scripts/make_otu.py counts_genus_filtered.txt counts_genus_filtered.otu

The next step actually correlates abundances. This module requires fastspar (https://github.com/scwatts/fastspar) to be installed in your environment. If fastspar is not compatible with your phanta environment, you can create a new environment for that step.

bash post_pipeline_scripts/correl.sh counts_genus_filtered.otu <outdir>

This script calculates all-vs-all correlations that can be found in <prefix>_correl.txt and covariance <prefix>_cov.txt within the specified output directory. P-values for the correlations can be found in the <prefix>_pvalues.tsv within the specified output directory.

Optional positional arguments to the correlation script above:

  1. threads - number of threads (default: 10 )

  2. permutations - number of permutations of the data to estimate p-value (default: 100 )

Now we can filter for correlations between viruses and bacteria, underneath a maximal p-value, using the following command:

python post_pipeline_scripts/filter_cross_domain.py <pref_correl.txt> <pref_pvalues.tsv> <maximal p-value> <cross_domain_correlations.txt>

Determine Which Detected Viruses are Likely Integrated Prophages

post_pipeline_scripts/integrated_prophage_detector.py is a Python script that can be used to identify which viruses detected in each sample are likely integrated prophages, based on detecting "junctions" between viruses and bacteria. Specifically, the script leverages paired-end information (when available) to find cases where one end (e.g., forward read) would be independently classified to a viral species, while the other end (e.g., reverse read) would be independently classified to a bacterial species.

To enable the usage of this script, please specify the single_end_krak parameter in the config file as True before running Phanta. This will create a directory that stores separate Kraken2 classifications for forward and reverse reads. This does not affect other Phanta processes.

If you've already run Phanta for a set of samples, don't worry! You can simply change the single_end_krak parameter in the original config to True and rerun the Snakemake command that you used to run Phanta.

Then you can run the post-processing script as follows:

python integrated_prophages_detector.py \
/full/path/to/sample_file/used/for/Phanta \
/full/path/to/Phanta/database \
/full/path/to/Phanta/classification/outdir \
<desired_output_directory>

The third argument is the full path to the classification subdirectory of the Phanta results.

The output file will be called integrated_prophages_detection_results.txt and will have one line per pair of viral and bacterial species that had "chimeric" reads detected. Example line:

sample2 4034362 1680 48

This line means that in sample2, there were 48 read pairs for which one end (e.g., forward read) would be independently classified to viral species 4034362, while the other end (e.g., reverse read) would be independently classified to bacterial species 1680. Thus, 4034362 is likely one of the viral species in sample2 that can be found as an integrated prophage, at least to some extent.

Troubleshooting

Environment Creation

If you have trouble creating the environment specified by phanta_env.yaml following the instructions above , you can alternatively install the required packages stepwise, into a fresh conda environment, using the instructions below.

Package list:

  1. bracken v2.7

  2. kraken2 v2.1.2

  3. pandas (pipeline developed with v1.4.3 but anything higher should also work)

  4. numpy ("" 1.23.2 "")

  5. r-base ("" 4.1.3 "")

  6. r-stringr ("" 1.4.0 "")

  7. r-dplyr ("" 1.0.9 "")

  8. snakemake (pipeline developed with v7.12.1 and we have heard of issues with v7.3+)

Example set of commands that should install all of the above (including the dependencies) into a fresh environment:

conda create -n phanta_env_alternate
conda activate phanta_env_alternate
conda install -c bioconda bracken=2.7
conda install pandas
conda install -c conda-forge r-base r-stringr r-dplyr
conda install -c bioconda snakemake=7.12

Stalled Snakemake Pipeline

Occasionally, Snakemake stalls when submitting SLURM jobs (e.g., see link ). If it appears that the Snakemake command is still running but new jobs have long stopped being submitted, please cancel all current jobs using the scancel command and post a GitHub issue. We can help you determine how to proceed to ensure that the pipeline completes both: 1) without error and 2) accurately. This is an easily fixable problem but there is no universal solution. The exact steps will depend on when the pipeline stalled. The Snakemake log file (in the hidden .snakemake directory) will be of use.

Code Snippets

 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
if ! command -v bc &> /dev/null; then
    echo "Error: 'bc' calculator is not installed."
    exit 1
fi

# first - calculate total reads in the sample - add total reads classified under root
# by Kraken2 and total reads unclassified by Kraken2

root=$(grep 'root' $2/$1.krak.report | head -n 1 | cut -f 2 | xargs)
unclassified_krak=$(grep 'unclassified' $2/$1.krak.report | head -n 1 | cut -f 2 | xargs)
if [[ -z $root ]] && [[ -z $unclassified_krak ]]
then
  echo "No reads in sample"
  exit 64
elif [[ -z $root ]]
then
  tot_reads=$unclassified_krak
elif [[ -z $unclassified_krak ]]
then
  tot_reads=$root
else
  tot_reads=$(($root+$unclassified_krak))
fi

# next - calculate total BRACKEN-CLASSIFIED reads in the sample
tot_reads_bracken=$(cut -f 6 $2/$1.krak.report.filtered.bracken | tail -n +2 | paste -sd+ | bc)

# how many reads were left out by Bracken?
unclassified_brack="$((tot_reads-tot_reads_bracken))"

echo -e "$1\t${tot_reads}\t${unclassified_krak}\t${tot_reads_bracken}\t${unclassified_brack}" > $3
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
import sys
import pandas as pd
import numpy as np

kraken_report, kraken_db, max_cov_bacteria, max_cov_virus, \
max_minimizers_bacteria, max_minimizers_virus, max_cov_arc, \
max_cov_euk, max_minimizers_arc, max_minimizers_euk = \
sys.argv[1], sys.argv[2], float(sys.argv[3]), float(sys.argv[4]), \
int(sys.argv[5]), int(sys.argv[6]), float(sys.argv[7]), float(sys.argv[8]), \
int(sys.argv[9]), int(sys.argv[10])

"""
STEP TWO
Define some functions that will be required for STEP THREE
"""

def make_dicts(nodes_file):
  with open(nodes_file, 'r') as infile:
    # make a child to parent dictionary
    # and a taxid to rank dictionary
    child_to_parent = {}
    taxid_to_rank = {}
    for line in infile:
      line=line.rstrip('\n').split('\t')
      child, parent, rank = line[0], line[2], line[4]
      child_to_parent[child] = parent
      taxid_to_rank[child] = rank
  return child_to_parent, taxid_to_rank

def taxid_to_desired_rank(taxid, desired_rank, child_parent, taxid_rank):
  # look up the specific taxid,
  # build the lineage using the dictionaries
  # stop at the desired rank and return the taxid
  lineage = [[taxid, taxid_rank[taxid]]]
  if taxid_rank[taxid] == desired_rank:
    return taxid
  child, parent = taxid, None
  if child == '0':
    return 'unclassified'
  while not parent == '1':
    # print(child, parent)
    # look up child, add to lineage
    parent = child_parent[child]
    rank = taxid_rank[parent]
    lineage.append([parent, rank])
    if rank == desired_rank:
      return parent
    child = parent # needed for recursion
  return 'error - taxid above desired rank, or not annotated at desired rank'

"""
STEP THREE
Adapted from /labs/asbhatt/yishay/scripts/kraken_species_filter_low_rank.py
Make a modified version of the Kraken report that:
1) Only includes lines representing species and strains
2) Has two extra columns: 1) coverage, 2) species level taxid
"""
# read in Kraken report, merge with inspect file to get info to calculate coverage

kraken_file = pd.read_csv(kraken_report, sep="\t", names=['fraction','fragments', 'assigned','minimizers','uniqminimizers', 'classification_rank','ncbi_taxa','scientific name'])
inspect_fname = kraken_db + '/inspect.out'
inspect_file = pd.read_csv(inspect_fname, sep="\t", names=['frac','minimizers_clade', 'minimizers_taxa', 'rank','ncbi_taxonomy','sci_name'])

kraken_file = kraken_file.merge(inspect_file,left_on='ncbi_taxa', right_on='ncbi_taxonomy')

# calculate coverage
kraken_file['cov'] = kraken_file['uniqminimizers']/kraken_file['minimizers_taxa']

# assign coverage as NA if there are fewer than 5 minimizers assigned to this taxon
kraken_file.loc[kraken_file.minimizers_taxa < 5,'cov']=np.nan

# filter kraken_file to species only
species_kraken = kraken_file.copy()[kraken_file['classification_rank'].str.startswith('S', na=False)]

# add a column for species level taxid - use functions defined above
child_parent, taxid_rank = make_dicts(kraken_db + '/taxonomy/nodes.dmp')
species_kraken['species_level_taxa'] = species_kraken.apply(lambda x: taxid_to_desired_rank(str(x['ncbi_taxonomy']), 'species', child_parent, taxid_rank), axis=1)
species_kraken.to_csv(kraken_report + ".species", sep="\t", index=False)

"""
STEP FOUR
Read in the modified Kraken report and make two dictionaries.
1) species_to_superkingdom: species taxid to superkingdom
2) species_genome_info:
species taxid to list of: [taxid, cov, rank, unique_minimizers] for every taxon
found by Kraken and falling under that species taxid
"""
with open(kraken_report + '.species', 'r') as infile:
  # figure out the column number for all the features of interest

  header=infile.readline().rstrip('\n').split('\t')

  unique_minimizers_col, cov_col, species_col, rank_col, taxid_col = \
  None, None, None, None, None

  for i in range(len(header)):
    if header[i] == 'cov':
      cov_col = i
    elif header[i] == 'species_level_taxa':
      species_col = i
    elif header[i] == 'rank':
      rank_col = i
    elif header[i] == 'uniqminimizers':
      unique_minimizers_col = i
    elif header[i] == 'ncbi_taxa':
      taxid_col = i

  assert unique_minimizers_col != None
  assert cov_col != None
  assert species_col != None
  assert rank_col != None
  assert taxid_col != None

  # now go through report!
  # first initialize the dictionaries
  species_to_superkingdom = {}
  species_genome_info = {}

  i = 0
  for line in infile: # skipped header already
    i += 1
    line=line.rstrip('\n').split('\t')

    unique_minimizers, cov, species_taxid, rank, taxid = \
    int(line[unique_minimizers_col]), line[cov_col], line[species_col], line[rank_col], \
    line[taxid_col]

    if cov == '':
      cov = 0
    else:
      cov = float(cov)

    # add to dictionaries
    if rank == 'S':
      # can add to the species_to_superkingdom dictionary
      # look up superkingdom
      superkingdom = taxid_to_desired_rank(species_taxid, 'superkingdom', child_parent, taxid_rank)
      species_to_superkingdom[species_taxid] = superkingdom
      # if i > 100 and i < 200:
        # print('superkingdom', i)
        # print(species_to_superkingdom[species_taxid])

    # regardless, add to the species_genome_info dictionary
    if not species_taxid in species_genome_info:
      # initialize
      species_genome_info[species_taxid] = [[taxid, cov, rank, unique_minimizers]]
    else:
      species_genome_info[species_taxid].append([taxid, cov, rank, unique_minimizers])

#    if i > 100 and i < 200:
      # print('other', i)
      # print(species_genome_info[species_taxid])

"""
STEP FIVE
Go through species_genome_info and figure out - using species_to_superkingdom as well -
which species should be "kept" in the Kraken report.
Output a file containing info about these decisions.
Here we will use the parameters that were passed in:
max_cov_bacteria, max_cov_virus, max_minimizers_bacteria, max_minimizers_virus
Recall: if max_cov is passed, definitely keep. If not, keep if max_minimizers threshold is passed.
Recall - we only care about the coverages/minimizer values for the GENOMES in
each species - i.e., the entries in species_genome_info that don't have a 'lower rank'
right afterwards.
Note - will also make a dictionary that stores this info - taxid_to_lowest_rank.
"""
species_to_keep = set()
taxid_to_lowest_rank = {}

out_fname = kraken_report + '.filtering_decisions.txt'

with open(out_fname, 'w') as outfile:

  # write out a header first
  header = ['species_taxid', 'superkingdom', 'max_cov', 'max_uniq_minimizers', 'kept']
  outfile.write('\t'.join(header) + '\n')

  for species in species_genome_info:
    lines_to_consider = species_genome_info[species]
    relevant_coverages, relevant_unique_minimizer_vals = [], []
    for i in range(len(lines_to_consider)):

      taxid_to_lowest_rank[lines_to_consider[i][0]] = 'False'

      if i == (len(lines_to_consider) - 1): # special case - last line in the Kraken report for this species taxid
        relevant_coverages.append(lines_to_consider[i][1])
        relevant_unique_minimizer_vals.append(lines_to_consider[i][3])
        # update taxid_to_lowest_rank
        taxid_to_lowest_rank[lines_to_consider[i][0]] = 'True'
      else: # not the last line in the Kraken report for this species taxid
        this_rank = lines_to_consider[i][2]
        next_rank = lines_to_consider[i+1][2]
        if this_rank == 'S':
          continue # since it's not the last line in the Kraken report, next_rank must be "lower"
          # sanity check
          assert len(next_rank) > len(this_rank)
        else: # we're dealing with S1, S2, etc.
          if int(this_rank[1:]) < int(next_rank[1:]): # e.g., S1 < S2
            continue
          else: # e.g., S3 > S2
            relevant_coverages.append(lines_to_consider[i][1])
            relevant_unique_minimizer_vals.append(lines_to_consider[i][3])
            # update taxid_to_lowest_rank
            taxid_to_lowest_rank[lines_to_consider[i][0]] = 'True'

    # now we can get the maximum coverage from the relevant coverages
    max_cov = max(relevant_coverages)
    # also get the maximum "# unique minimizers"
    max_minimizers = max(relevant_unique_minimizer_vals)

    # should we keep this species?
    # figure out based on superkingdom
    superkingdom = species_to_superkingdom[species]

    if superkingdom == '2': # bacteria
      if (max_cov >= max_cov_bacteria) and (max_minimizers >= max_minimizers_bacteria):
        species_to_keep.add(species)

    elif superkingdom == '2157': # archaea
      if (max_cov >= max_cov_arc) and (max_minimizers >= max_minimizers_arc):
        species_to_keep.add(species)

    elif superkingdom == '2759': # eukaryotes
      if (max_cov >= max_cov_euk) and (max_minimizers >= max_minimizers_euk):
        species_to_keep.add(species)

    elif superkingdom == '10239': # viruses
      if (max_cov >= max_cov_virus) and (max_minimizers >= max_minimizers_virus):
        species_to_keep.add(species)

    else:
      species_to_keep.add(species)

    if species in species_to_keep:
      keep = 'True'
    else:
      keep = 'False'
    outfile.write('\t'.join([species, superkingdom, str(max_cov), str(max_minimizers), keep]) + '\n')

"""
STEP SIX
Go through the original Kraken report and make a new filtered version,
where lines from species to filter out are removed.
"""
with open(kraken_report, 'r') as infile:
  with open(kraken_report + '.filtered', 'w') as outfile:
    for line in infile:
      line=line.rstrip('\n').split('\t')
      if line[5].startswith('S'):
        # look up the taxonomy id - what is the species level taxid?
        species_taxid = taxid_to_desired_rank(line[6], 'species', child_parent, taxid_rank)
        # do we care about this species?
        if species_taxid in species_to_keep:
          outfile.write('\t'.join(line) + '\n')
      else:
        outfile.write('\t'.join(line) + '\n')
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
library(stringr)

# need to read in arguments from command line
args <- commandArgs(trailingOnly=TRUE)
indir <- args[1]
outdir <- args[2]
failed <- args[3]

# First merge counts tables

counts_tables_list <- args[4]
count_tables <- as.character(read.csv(counts_tables_list, header=FALSE)[,1])

failed_samples <- tryCatch(
  expr = {
    as.character(read.csv(failed, header=FALSE)[,1])
  },
  error = function(e){
    return(c())
  })

#print(count_tables)

# figure out what column names we'll want for the merged data frame
# find locations of '.' in the file names
locations <- str_locate(count_tables, '\\.krak.report')[,1]
sampnames <- substr(count_tables, 1, locations-1)
desired_colnames <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sampnames)

#print(desired_colnames)

# initialize data frame
sample <- desired_colnames[3]
if (sample %in% failed_samples) {
  merged_table <- data.frame(matrix(ncol = 2, nrow = 0))
  colnames(merged_table) <- desired_colnames[1:2]
} else {
  merged_table <- read.csv(paste0(indir, '/', count_tables[1]), sep='\t', header=FALSE)
  colnames(merged_table) <- desired_colnames[1:3]
}

# loop through files and keep merging
if (length(count_tables) > 1) {
for (i in seq(2,length(count_tables))) {
  sample <- desired_colnames[i+2]
  if (sample %in% failed_samples) {
    # do nothing
  } else {
  f <- paste0(indir, '/', count_tables[i])
  table_to_merge <- read.csv(f, sep='\t', header=FALSE)
  colnames(table_to_merge) <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sample)
  merged_table <- merge(merged_table, table_to_merge, by = c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs"), all=TRUE)
  # replace NA with 0
  merged_table[is.na(merged_table)] <- 0
}}}

for (sample in failed_samples) {
  merged_table[[sample]] <- rep(NA, nrow(merged_table))
}

# make the first two columns character vectors, not factors
merged_table[,1] <- as.character(merged_table[,1])
merged_table[,2] <- as.character(merged_table[,2])

# now output
outpath <- paste0(outdir, '/', 'counts.txt')
write.table(merged_table, outpath, quote=FALSE, row.names = FALSE, sep = '\t')

# Next merge tables normalized out of total Bracken-classified reads

norm_brack_tables_list <- args[5]
norm_brack_tables <- as.character(read.csv(norm_brack_tables_list, header=FALSE)[,1])

# the samples should be in the same order - do a sanity check
locations <- str_locate(norm_brack_tables, '\\.krak.report')[,1]
sampnames <- substr(norm_brack_tables, 1, locations-1)
desired_colnames_test <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sampnames)
stopifnot(desired_colnames == desired_colnames_test)

# initialize data frame
sample <- desired_colnames[3]
if (sample %in% failed_samples) {
  merged_table <- data.frame(matrix(ncol = 2, nrow = 0))
  colnames(merged_table) <- desired_colnames[1:2]
} else {
  merged_table <- read.csv(paste0(indir, '/', norm_brack_tables[1]), sep='\t', header=FALSE)
  colnames(merged_table) <- desired_colnames[1:3]
}

# loop through files and keep merging
if (length(norm_brack_tables) > 1) {
for (i in seq(2,length(norm_brack_tables))) {
  sample <- desired_colnames[i+2]
  if (sample %in% failed_samples) {
  } else {
  f <- paste0(indir, '/', norm_brack_tables[i])
  table_to_merge <- read.csv(f, sep='\t', header=FALSE)
  colnames(table_to_merge) <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sample)
  merged_table <- merge(merged_table, table_to_merge, by = c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs"), all=TRUE)
  # replace NA with 0
  merged_table[is.na(merged_table)] <- 0
}}}

for (sample in failed_samples) {
  merged_table[[sample]] <- rep(NA, nrow(merged_table))
}

# make the first two columns character vectors, not factors
merged_table[,1] <- as.character(merged_table[,1])
merged_table[,2] <- as.character(merged_table[,2])

# now output
outpath <- paste0(outdir, '/', 'relative_read_abundance.txt')
write.table(merged_table, outpath, quote=FALSE, row.names = FALSE, sep = '\t')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import sys

db, temp_file, outdir = sys.argv[1], sys.argv[2], sys.argv[3]

# make a few helpful dictionaries

with open(db + '/taxonomy/nodes.dmp', 'r') as infile:

  # make a child to parent dictionary
  # and a taxid to rank dictionary
  child_to_parent = {}
  taxid_to_rank = {}
  for line in infile:
    line=line.rstrip('\n').split('\t')
    child, parent, rank = line[0], line[2], line[4]
    child_to_parent[child] = parent
    taxid_to_rank[child] = rank

# also want a taxid_to_name dictionary
# use the inspect.out file

with open(db + '/inspect.out', 'r') as infile:
  taxid_to_name = {}
  for line in infile:
    line=line.rstrip('\n').split('\t')
    taxid, name = line[4], line[5].strip()
    taxid_to_name[taxid] = name

def taxid_to_lineages(taxid):
  # look up the specific taxid,
  # build the lineages using the dictionaries

  # first deal with the special case of unclassiifed
  if taxid == '0':
    return 'unclassified'

  # not unclassified - proceed
  rank = taxid_to_rank[taxid]
  lineage_taxids = rank + '_' + taxid
  lineage_names = rank + '_' + taxid_to_name[taxid]
  child, parent = taxid, None

  while not parent == '1':
    # look up child, add to lineages
    parent = child_to_parent[child]
    parent_rank = taxid_to_rank[parent]
    # look up the name of the parent too
    parent_name = taxid_to_name[parent]
    lineage_taxids = parent_rank + '_' + parent + '|' + lineage_taxids
    lineage_names = parent_rank + '_' + parent_name + '|' + lineage_names
    child = parent # needed for recursion
  return lineage_names, lineage_taxids

with open(temp_file, 'r') as infile:
  with open(outdir + '/relative_taxonomic_abundance.txt', 'w') as outfile:
    # first fix the header
    orig_header = infile.readline().rstrip('\n').split('\t')
    new_header = '\t'.join(['Taxon_Lineage_with_Names', 'Taxon_Lineage_with_IDs'] + orig_header[1:]) + '\n'
    outfile.write(new_header)
    for line in infile:
      line=line.rstrip('\n').split('\t')
      # use the function to figure out what we want to call this line
      taxid = line[0]
      lin_names, lin_ids = taxid_to_lineages(taxid)
      outlist = [lin_names, lin_ids] + line[1:]
      outfile.write('\t'.join(outlist) + '\n')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
library(stringr)

# need to read in arguments from command line
args <- commandArgs(trailingOnly=TRUE)
outdir <- args[1]

scaled_reports_list <- args[2]
scaled_reports <- as.character(read.csv(scaled_reports_list, header=FALSE)[,1])

# figure out what column names we'll want for the merged data frame
# find locations of '.' in the file names
locations <- str_locate(scaled_reports, '\\.krak.report')[,1]
sampnames <- substr(scaled_reports, 1, locations-1)
desired_colnames <- c("TaxID", sampnames)

# also figure out which samples failed
failed <- args[3]
failed_samples <- tryCatch(
  expr = {
    as.character(read.csv(failed, header=FALSE)[,1])
  },
  error = function(e){
    return(c())
  })

# initialize data frame
sample <- desired_colnames[2]
if (sample %in% failed_samples) {
  merged_table <- data.frame(matrix(ncol = 1, nrow = 0))
  colnames(merged_table) <- desired_colnames[1]
} else {
  merged_table <- read.csv(paste0(outdir, '/', scaled_reports[1]), sep='\t', header=TRUE)
  taxid_col <- which(colnames(merged_table) == 'taxonomy_id')
  rel_taxon_col <- which(colnames(merged_table) == 'rel_taxon_abundance')
  merged_table <- merged_table[,c(taxid_col, rel_taxon_col)]
  colnames(merged_table) <- desired_colnames[1:2]
}

# loop through files and keep merging
if (length(scaled_reports) > 1) {
for (i in seq(2,length(scaled_reports))) {
  sample <- desired_colnames[i+1]
  if (sample %in% failed_samples) {
  } else {
  f <- paste0(outdir, '/', scaled_reports[i])
  table_to_merge <- read.csv(f, sep='\t', header=TRUE)
  taxid_col <- which(colnames(table_to_merge) == 'taxonomy_id')
  rel_taxon_col <- which(colnames(table_to_merge) == 'rel_taxon_abundance')
  table_to_merge <- table_to_merge[,c(taxid_col, rel_taxon_col)]
  colnames(table_to_merge) <- c("TaxID", sample)
  merged_table <- merge(merged_table, table_to_merge, by = c("TaxID"), all=TRUE)
  # replace NA with 0
  merged_table[is.na(merged_table)] <- 0
  }}}

for (sample in failed_samples) {
  merged_table[[sample]] <- rep(NA, nrow(merged_table))
}

# make the first column a character vector, not a factor
merged_table[,1] <- as.character(merged_table[,1])

# now output
outpath <- paste0(outdir, '/', 'relative_taxonomic_abundance_temp.txt')
write.table(merged_table, outpath, quote=FALSE, row.names = FALSE, sep = '\t')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import sys

report, kraken_db, failed_file, sample = \
sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]

# make a few helpful dictionaries

with open(kraken_db + '/taxonomy/nodes.dmp', 'r') as infile:

  # make a child to parent dictionary
  # and a taxid to rank dictionary
  child_to_parent = {}
  taxid_to_rank = {}
  for line in infile:
    line=line.rstrip('\n').split('\t')
    child, parent, rank = line[0], line[2], line[4]
    child_to_parent[child] = parent
    taxid_to_rank[child] = rank

# also want a taxid_to_name dictionary
# use the inspect.out file

with open(kraken_db + '/inspect.out', 'r') as infile:
  taxid_to_name = {}
  for line in infile:
    line=line.rstrip('\n').split('\t')
    taxid, name = line[4], line[5].strip()
    taxid_to_name[taxid] = name

# also make a set of the samples that failed Bracken
with open(failed_file, 'r') as infile:
  failed_samples = set()
  for line in infile:
    failed_samples.add(line.rstrip('\n'))

def taxid_to_lineages(taxid):
  # look up the specific taxid,
  # build the lineages using the dictionaries

  # first deal with the special case of unclassiifed
  if taxid == '0':
    return 'unclassified'

  # not unclassified - proceed
  rank = taxid_to_rank[taxid]
  lineage_taxids = rank + '_' + taxid
  lineage_names = rank + '_' + taxid_to_name[taxid]
  child, parent = taxid, None

  while not parent == '1':
    # look up child, add to lineages
    parent = child_to_parent[child]
    parent_rank = taxid_to_rank[parent]
    # look up the name of the parent too
    parent_name = taxid_to_name[parent]
    lineage_taxids = parent_rank + '_' + parent + '|' + lineage_taxids
    lineage_names = parent_rank + '_' + parent_name + '|' + lineage_names
    child = parent # needed for recursion
  return lineage_taxids, lineage_names

with open(report, 'r') as infile:
  with open(report + '.to_merge', 'w') as outfile:
    if sample in failed_samples:
      pass
    else:
      for line in infile:
        line=line.rstrip('\n').split('\t')
        # use the function to figure out what we want to call this line
        taxid = line[4]
        lin_taxids, lin_names = taxid_to_lineages(taxid)
        num_reads = line[1]
        outfile.write('\t'.join([lin_names, lin_taxids, num_reads]) + '\n')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import sys

tot_reads_file, indir, failed = \
sys.argv[1], sys.argv[2], sys.argv[3]

# make a dictionary from sample name to number of Bracken-classified reads
with open(tot_reads_file, 'r') as infile:
  samp_name_to_reads = {}
  header=infile.readline() # skip
  for line in infile:
    line=line.rstrip('\n').split('\t')
    samp, brack_reads = line[0], line[3]
    samp_name_to_reads[samp] = brack_reads

sample_names = samp_name_to_reads.keys()

# make a set of failed samples
with open(failed, 'r') as infile:
  failed_samples = set()
  for line in infile:
    failed_samples.add(line.rstrip('\n'))

for samp in sample_names:
  # file to normalize
  counts_table = indir + '/' + samp + '.krak.report_bracken_species.filtered.to_merge'

  # file to produce
  outf = counts_table + '.norm_brack'

  with open(counts_table, 'r') as infile:
    with open(outf, 'w') as outfile:
      if samp in failed_samples:
        pass
      else:
        # look up bracken_classified_reads
        brack_classified_reads = int(samp_name_to_reads[samp])
        for line in infile:
          line=line.rstrip('\n').split('\t')
          lin_names, lin_taxids, num_reads = line[0], line[1], float(line[2])
          frac_reads_brack = str(round(num_reads/brack_classified_reads, 10))
          outfile.write('\t'.join([lin_names, lin_taxids, frac_reads_brack]) + '\n')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import sys

kraken_filtered, threshold, kraken_db = \
sys.argv[1], int(sys.argv[2]), sys.argv[3]

# make a dictionary that will be later useful
nodes_file = kraken_db + '/taxonomy/nodes.dmp'

def child_to_parent(nodes):
  with open(nodes, 'r') as infile:
    child_to_parent = {}
    for line in infile:
      line=line.rstrip('\n').split('\t')
      child, parent = line[0], line[2]
      child_to_parent[child] = parent
  return child_to_parent

child_parent = child_to_parent(nodes_file)

# parse filtered Kraken report, figure out whether Bracken will error
with open(kraken_filtered, 'r') as infile:
  bracken_will_fail = True
  for line in infile:
    line = line.rstrip('\n').split('\t')
    rank, assigned = line[5], int(line[1])
    if rank == 'S' and assigned >= threshold:
      bracken_will_fail = False
      break

if bracken_will_fail:

  # output an empty Bracken species report
  with open(kraken_filtered + '.bracken.temp', 'w') as outfile:
    pass

  # output an empty Kraken-style Bracken report
  out_fname = kraken_filtered[:kraken_filtered.rfind('.')] + '_bracken_species.filtered.temp'
  with open(out_fname, 'w') as outfile:
    pass

  # output an empty scaled Bracken species report
  with open(kraken_filtered + '.bracken.scaled.temp', 'w') as outfile:
    pass
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
import sys

def main():
    ###step 1- input
    kraken_db = sys.argv[1]
    bracken_report = sys.argv[2]
    kraken_filtering_decisions =  sys.argv[3]
    genome_length_path = kraken_db + "/library/species_genome_size.txt"
    read_length = int(sys.argv[4])
    paired = sys.argv[5]
    if str(paired) == 'True':
        read_length = read_length*2

    ##step two- reading files to DFs
    bracken_df = pd.read_csv(bracken_report, sep="\t")
    bracken_df['taxonomy_id'] = bracken_df['taxonomy_id'].astype('str')
    kraken_df = pd.read_csv(kraken_filtering_decisions, sep="\t")
    kraken_df['species_taxid'] = kraken_df['species_taxid'].astype('str')
    length_df = pd.read_csv(genome_length_path, sep="\t")

    ##step 3- calculating stats
    tmp_merged_df = pd.merge(bracken_df,length_df[['species_level_taxa','max','mean','median']], left_on='taxonomy_id', right_on='species_level_taxa')
    per_million_scaling = (tmp_merged_df['new_est_reads'].sum() / 1000000) # calculate how many millions of reads there are
    tmp_merged_df['fraction_total_reads'] = tmp_merged_df['new_est_reads']/tmp_merged_df['new_est_reads'].sum()
    tmp_merged_df['tmp_scaling'] = tmp_merged_df['fraction_total_reads']*(tmp_merged_df['median'].sum()/tmp_merged_df['median']) # scale fraction_total_reads to reflect difference in genome length - smaller genome should be given greater weight
    tmp_merged_df['rel_taxon_abundance'] = tmp_merged_df['tmp_scaling']/tmp_merged_df['tmp_scaling'].sum() # make sure that the scaling adds up to 1
    tmp_merged_df['genetic_abundance'] = tmp_merged_df['fraction_total_reads']
    tmp_merged_df['depth'] = (tmp_merged_df['new_est_reads']*read_length)/tmp_merged_df['median'] #which is estimation of copies
    tmp_merged_df['reads_per_million_bases'] = (tmp_merged_df['new_est_reads'] / (tmp_merged_df['median'] / 1000000)).round(4) # equivalent to RPK but with megabases instead of kilobases
    tmp_merged_df['reads_per_million_reads'] = (tmp_merged_df['new_est_reads'] / per_million_scaling).round(4) # reads per million total reads
    tmp_merged_df['RPMPM'] =  tmp_merged_df['new_est_reads'] / ((tmp_merged_df['median'] / 1000000)* (per_million_scaling)) #reads per million base pairs per million reads
    tmp_merged_df['copies_per_million_reads'] = tmp_merged_df['depth'] / (per_million_scaling).round(4)
    tmp_merged_df = tmp_merged_df.merge(kraken_df, how='left', left_on='taxonomy_id', right_on='species_taxid')

    ## step 4 - output
    tmp_merged_df.rename(columns={'median': 'median_genome_length', 'max_cov':'breadth'}, inplace=True)
    processed_report = tmp_merged_df.drop(['tmp_scaling', 'species_level_taxa', 'max','mean','species_taxid','superkingdom','max_uniq_minimizers','kept'], axis=1)
    processed_report.to_csv(bracken_report+'.scaled', sep="\t", index=False)

if __name__ == "__main__":
    main()
 99
100
101
102
103
shell: """
  kraken2 --db {params.db} --threads {threads} --output {output.krak} \
  --report {output.krak_report} --report-minimizer-data {params.paired_string} \
  {params.gzipped_string} {input.reads} --confidence {params.confidence_threshold}
  """
124
125
126
127
128
129
130
shell: """
  python {params.repo_dir}/pipeline_scripts/filter_kraken_reports.py {input.krak_report} {params.db} \
  {params.cov_thresh_bacterial} {params.cov_thresh_viral} {params.minimizer_thresh_bacterial} \
  {params.minimizer_thresh_viral} \
  {params.cov_thresh_arc} {params.cov_thresh_euk} {params.minimizer_thresh_arc} \
  {params.minimizer_thresh_euk} 
"""
SnakeMake From line 124 of main/Snakefile
143
144
145
146
147
shell: """
  python {params.repo_dir}/pipeline_scripts/process_filtered_kraken.py {input.krak_report_filtered} \
  {params.threshold} {params.db}
  touch {output.completed}
  """
SnakeMake From line 143 of main/Snakefile
165
166
167
168
169
170
shell: """
  kraken2 --db {params.db} --threads {threads} --output {output.fwd_krak} \
  {params.gzipped_string} {input.fwd_reads} --confidence {params.confidence_threshold}
  kraken2 --db {params.db} --threads {threads} --output {output.rev_krak} \
  {params.gzipped_string} {input.rev_reads} --confidence {params.confidence_threshold}
  """
191
192
193
194
195
196
197
198
199
shell: """
  # protection against Bracken error
  [ -f {params.possible_1} ] && \
  ( cp {params.possible_1} {output.brack_report_1};
  cp {params.possible_2} {output.brack_report_2} ) \
  || bracken -d {params.db} -i {input.krak_report} \
  -o {output.brack_report_1} -r {params.readlen} \
  -l 'S' -t {params.threshold}
  """
208
209
210
211
212
213
214
215
216
217
shell: """
  count=`find {params.classdir} -maxdepth 1 -name "*.krak.report.filtered.bracken.scaled.temp" | wc -l`
  if [[ $count != 0 ]]
  then
    ls {params.classdir}/*.krak.report.filtered.bracken.scaled.temp | rev | \
    cut -d'/' -f 1 | rev | cut -d'.' -f 1 > {output.failed}
  else
    touch {output.failed}
  fi
  """
SnakeMake From line 208 of main/Snakefile
230
231
232
233
shell: """
  python {params.repo_dir}/pipeline_scripts/prep_to_merge_counts.py \
  {input.brack_report} {params.db} {input.failed_file} {wildcards.samp}
  """
SnakeMake From line 230 of main/Snakefile
246
247
248
249
shell: """
  bash {params.repo_dir}/pipeline_scripts/calc_total_reads.sh {wildcards.samp} \
  {params.krak_brack_dir} {output.tot_reads_file}
  """
SnakeMake From line 246 of main/Snakefile
258
259
260
261
shell: """
  echo -e 'Samp_Name\tTot_Samp_Reads\tUnassigned_Step_One\tAssigned_Step_Three\tUnassigned_Step_Three' > {output.outf}
  cat {params.classdir}/*total_reads.txt >> {output.outf}
  """
SnakeMake From line 258 of main/Snakefile
274
275
276
277
shell: """
  python {params.repo_dir}/pipeline_scripts/prep_to_merge_normed.py \
  {input.tot_reads_file} {params.classdir} {input.failed_file}
  """
SnakeMake From line 274 of main/Snakefile
293
294
295
296
297
298
shell: """
  ls {params.classdir}/*to_merge | rev | cut -d'/' -f 1 | rev > {params.classdir}/counts_tables.txt
  ls {params.classdir}/*norm_brack | rev | cut -d'/' -f 1 | rev > {params.classdir}/norm_brack_tables.txt
  Rscript {params.repo_dir}/pipeline_scripts/merging_bracken_tables.R {params.classdir} {params.finaldir} \
  {input.failed_file} {output.list1} {output.list2}
  """
SnakeMake From line 293 of main/Snakefile
315
316
317
318
319
320
shell: """
  [ -f {params.possible} ] && \
  cp {params.possible} {output.scaled_bracken} || \
  python {params.repo_dir}/pipeline_scripts/scale_bracken.py {params.db} {input.bracken_report} \
  {input.filtering_decisions} {params.readlen} {params.paired}
  """
SnakeMake From line 315 of main/Snakefile
335
336
337
338
339
shell: """
  ls {params.classdir}/*scaled | rev | cut -d'/' -f 1 | rev > {params.classdir}/scaled_reports.txt
  Rscript {params.repo_dir}/pipeline_scripts/merging_scaled_bracken.R {params.classdir} {output.list} {input.failed_file}
  python {params.repo_dir}/pipeline_scripts/merging_scaled_bracken.py {params.db} {output.merged_temp} {params.finaldir}
  """
SnakeMake From line 335 of main/Snakefile
355
356
357
358
359
360
361
362
363
364
365
366
shell: """
  mkdir {params.intdir}
  mv {params.classdir}/*krak {params.intdir}
  mv {params.classdir}/*krak.report {params.intdir}
  mv {params.classdir}/*krak.report.filtered {params.intdir}
  mv {params.classdir}/*krak.report.filtered.bracken {params.intdir}
  mv {params.classdir}/*krak.report.filtering_decisions.txt {params.intdir}
  if [[ {params.delete} == 'True' ]]; then
    rm -r {params.intdir}
  fi
  touch {output.completed}
  """
SnakeMake From line 355 of main/Snakefile
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bhattlab/phanta
Name: phanta
Version: v1.1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...