NanoClass: A Taxonomic Meta-Classifier for Oxford Nanopore Amplicon Sequencing Data

public public 1yr ago Version: v0.3.1-beta 0 bookmarks

NanoClass is a taxonomic meta-classifier for 16S/18S amplicon sequencing data generated with the Oxford Nanopore MinION. With a single command, you can run upto 11 popular classification tools on multiple samples in parallel, including BLASTN, Centrifuge, Kraken2, IDTAXA, MegaBLAST, dcMegaBLAST, Minimap2, Mothur, QIIME2, RDP and SPINGO. Optional read preparation steps, such as quality trimming, length filtering and sub-sampling, are an integral part of the pipeline.

NanoClass automatically installs all software packages and dependencies, downloads and builds required taxonomic databases and runs the analysis on your samples.

Getting started

Requirements

The entire NanoClass workflow can be run on a powerfull desktop computer, but for many applications a laptop will do. Most classification tools implemented in NanoClass run in a matter of minutes to hours. Prerequisites are Conda and Snakemake .

NanoClass automatically installs all other software packages and dependencies.

Installation

You can either clone NanoClass, like so:

git clone https://github.com/ejongepier/NanoClass.git

Or download and extract the zip archive from https://github.com/ejongepier/NanoClass.

NanoClass is immediately ready for use. See also the Documentation .

Usage

Quick start

Enter your samples and the paths to your fastq.gz files in the sample.csv. Sample labels should be unique. Both sample and run labels should contain letters and numbers only. Barcode column should be left empty, meaning your input files should already be demultiplexed. For an example see the sample.csv file.

After editing the samples.csv, the entire pipeline can be run with a single command:

snakemake --use-conda --cores <ncores>

Where --cores are the number of CPU cores/jobs that can be run in parallel on your system.

Customizing

You can customize the pipeline through the config.yaml, e.g. by running only a subset of the reads or classification tools, or by changing the default Silva 16S taxonomic database.
For details on how to customize NanoClass, see the Documentation .

Report

After successful execution, you can create an interactive HTML report with:

snakemake --report report/NanoClass.zip

Authors

Code Snippets

13
14
15
16
17
shell:
    """
    makeblastdb -in {input} -parse_seqids \
        -dbtype nucl 2>&1 | tee -a {log}
    """
32
33
34
35
36
shell:
    """
    fasta-splitter --n-parts {params.n_chunks} {input} \
        --nopad --out-dir {params.out_dir}
    """
59
60
61
62
63
64
65
shell:
    """
    blastn -task 'blastn' -db {input.db} -evalue {params.eval} \
        -perc_identity {params.pident} -max_target_seqs {params.nseqs} \
        -query {input.fasta} -out {output.all} -outfmt 6 \
        2> {log}
    """
88
89
90
91
92
93
94
95
96
shell:
    """
    cat {input.out} | awk -F '\\t' -v l={params.alnlen} \
        '{{if ($4 > l) print $0}}' \
        > {output.out}
    cat {input.benchm} | grep -P "^[0-9]" | \
      awk '{{sum+=$1}} END {{print "s"; print sum/{params.threads}}}' \
      > {output.benchm} 2> {log}
    """
113
114
115
116
117
shell:
    """
    scripts/tolca.py -b {input.blast} -t {input.db} \
         -l {output} -c {params.lcacons} > {log}
    """
SnakeMake From line 113 of rules/blastn.smk
132
133
shell:
    "scripts/tomat.py -l {input} 2> {log}"
SnakeMake From line 132 of rules/blastn.smk
22
23
24
25
26
27
28
29
30
shell:
    """
    centrifuge-download -o db/centrifuge/taxonomy taxonomy > {log} 2>&1
    wget {params.map_url} -q -O - | gzip -d -c - | \
        awk '{{print $1\".\"$2\".\"$3\"\t\"$(NF)}}' \
        > {output.tax_map} 2>> {log}
    scripts/todb.py -s {input.seq} -t {input.tax} -m centrifuge \
        -S {output.ref_seqs} -T {output.ref_tax} 2>> {log}
    """
57
58
59
60
61
62
63
64
65
shell:
    """
    centrifuge-build \
      --threads {threads} \
      --conversion-table {input.conversion_table} \
      --taxonomy-tree {input.tax_tree} \
      --name-table {input.name_table} \
      {input.ref_seqs} {params.prefix} > {log} 2>&1
    """
89
90
91
92
93
94
95
96
97
shell:
    """
    centrifuge -x {params.index_prefix} \
      -U {input.fastq} \
      --threads {threads} \
      --report-file {output.report} \
      -S  {output.classification} \
      --met-stderr > {log} 2>&1
    """
115
116
shell:
    "scripts/tomat.py -c {input.out} -f {input.ref_seqs} 2> {log}"
18
19
20
21
22
23
24
25
26
27
28
29
30
31
shell:
    """
    wget -q -O db/common/db.zip {params.url}
    unzip -q -p -j db/common/db.zip \
        */taxonomy/{params.ssu}_only/99/majority_taxonomy_7_levels.txt \
        > {output.ref_tax} 2> {log}
    unzip -q -p -j db/common/db.zip \
        */rep_set/rep_set_{params.ssu}_only/99/silva_*_99_{params.ssu}.fna \
        > {output.ref_seqs} 2>> {log}
    unzip -q -p -j db/common/db.zip \
        */rep_set_aligned/99/99_alignment.fna.zip > tmp.aln.zip
    unzip -q -p tmp.aln.zip > {output.ref_aln} 2>> {log}
    rm db/common/db.zip tmp.aln.zip 2>> {log}
    """
66
67
68
69
70
shell:
    """
    mkdir -p ./plots
    Rscript scripts/barplot.R {input} 2> {log}
    """
88
89
shell:
    "scripts/toconsensus.py -l {input} 2> {log}"
106
107
108
109
110
shell:
    """
    mkdir -p ./plots
    Rscript scripts/lineplot.R {input} 2> {log}
    """
SnakeMake From line 106 of rules/common.smk
132
133
134
135
136
137
shell:
    """
    mkdir -p ./plots
    mkdir -p ./tables
    Rscript scripts/timeplot.R {input} 2> {log}
    """
SnakeMake From line 132 of rules/common.smk
20
21
22
23
24
shell:
    """
    fasta-splitter --n-parts {params.n_chunks} {input} \
        --nopad --out-dir {params.out_dir}
    """
46
47
48
49
50
51
52
shell:
    """
    blastn -task 'dc-megablast' -db {input.db} -evalue {params.eval} \
        -perc_identity {params.pident} -max_target_seqs {params.nseqs} \
        -query {input.fasta} -out {output.all} -outfmt 6 \
        2> {log} 
    """
76
77
78
79
80
81
82
83
84
    shell:
        """
	cat {input.out} | awk -F '\\t' -v l={params.alnlen} \
            '{{if ($4 > l) print $0}}' \
            > {output.out}
        cat {input.benchm} | grep -P "^[0-9]" | \
          awk '{{sum+=$1}} END {{print "s"; print sum / {params.threads}}}' \
          > {output.benchm} 2> {log}
        """
102
103
104
105
106
shell:
  	"""
    scripts/tolca.py -b {input.blast} -t {input.db} \
         -l {output} -c {params.lcacons} > {log}
    """
121
122
shell:
    "scripts/tomat.py -l {input} 2> {log}"
16
17
shell:
    "Rscript scripts/learntaxa.R {input.ref_seqs} {input.ref_tax} {output} 2> {log}"
39
40
41
42
43
shell:
    """
    Rscript scripts/idtaxa.R {input.db} {input.query} {output.tmp} {threads} {params} 2> {log}
    awk 'BEGIN{{FS=OFS="\\t"}} {{for (i=1; i<=7; i++) if ($i ~ /^ *$/) $i="NA"}} 1' {output.tmp} > {output.out}
    """
59
60
shell:
    "scripts/tomat.py -l {input.list} 2> {log}"
20
21
22
23
24
shell:
    """
    kraken2-build --db {params.db} --special {params.db_type} \
      --threads {threads} > {log} 2> {log}
    """
45
46
47
48
49
50
51
52
shell:
    """
    kraken2 --db {params.db_dir} \
        --output {output.out} \
        --report {output.report} \
        --gzip-compressed \
        --threads {threads} {input.fastq} 2> {log}
    """
71
72
73
74
75
shell:
    """
    scripts/tomat.py -k {input.kraken_out} -f {input.silva_seqs} \
      -m {input.kraken_map} 2> {log}
    """
18
19
20
21
22
shell:
    """
    mapseq -nthreads {threads} {input.query} {input.ref_seqs} \
       {input.ref_tax}  > {output} 2> {log}
    """
37
38
shell:
    "scripts/tomat.py -b {input.out} -t {input.db} 2> {log}"
20
21
22
23
24
shell:
    """
    fasta-splitter --n-parts {params.n_chunks} {input} \
        --nopad --out-dir {params.out_dir}
    """
46
47
48
49
50
51
52
shell:
    """
    blastn -task 'megablast' -db {input.db} -evalue {params.eval} \
        -perc_identity {params.pident} -max_target_seqs {params.nseqs} \
        -query {input.fasta} -out {output.all} -outfmt 6 \
        2> {log} 
    """
76
77
78
79
80
81
82
83
84
    shell:
        """
	cat {input.out} | awk -F '\\t' -v l={params.alnlen} \
            '{{if ($4 > l) print $0}}' \
            > {output.out}
        cat {input.benchm} | grep -P "^[0-9]" | \
          awk '{{sum+=$1}} END {{print "s"; print sum / {params.threads}}}' \
          > {output.benchm} 2> {log}
        """
102
103
104
105
106
shell:
  	"""
    scripts/tolca.py -b {input.blast} -t {input.db} \
         -l {output} -c {params.lcacons} > {log}
    """
121
122
shell:
    "scripts/tomat.py -l {input} 2> {log}"
20
21
22
23
24
shell:
    """
    minimap2 {params.extra} -t {threads} -N {params.nseqs} \
      {input.target} {input.query} -o {output} 2> {log}
    """
39
40
41
42
43
44
shell:
    """
    samtools sort -@ {threads} {input} | \
    samtools view -@ {threads} | \
    cut -f 1,3 > {output} 2> {log}
    """
61
62
63
64
65
    shell:
        """
	scripts/tolca.py -b {input.mm} -t {input.db} \
             -l {output} -c {params.lcacons} > {log}
        """
81
82
shell:
    "scripts/tomat.py -l {input} 2> {log}"
17
18
19
20
21
shell:
    """
    scripts/todb.py -s {input.aln} -t {input.tax} -m mothur \
        -S {output.aln} -T {output.tax} 2> {log}
    """
42
43
44
45
46
47
48
49
50
51
52
53
shell:
    """
    mkdir -p {output.dir}
    cp {input} {output.dir}
    queryf=$(basename {input.query})
    alnf=$(basename {input.aln})
    echo \"align.seqs(candidate={output.dir}/$queryf, template={output.dir}/$alnf, \
        processors={threads}, ksize=6, align=needleman); -q\" > {output.dir}/mothur.cmd
    mothur {output.dir}/mothur.cmd 2> {log}
    awk -F '\\t' -v OFS='\\t' '{{if (NR==1) printf "%s","#"; print $1, $3}}' \
      {output.dir}/{params.file} > {output.out} 2>> {log} 
    """
71
72
shell:
    "scripts/tomat.py -b {input.out} -t {input.tax} 2> {log}"
29
30
31
32
33
34
35
36
shell:
    """
    porechop --input {input} \
      --output {output} \
      --threads {threads} \
      --check_reads {params.check_reads} \
      --discard_middle > {log}
    """
57
58
59
60
61
62
shell:
    """
    gzip -d -c {input} | \
      NanoFilt --quality {params.quality} --length {params.min_len} --maxlength {params.max_len} | \
      gzip > {output} 2> {log}
    """
81
82
83
84
85
shell:
    """
    seqtk sample -s {params.seed} {input} {params.n} | \
        gzip -c > {output}
    """
97
98
shell:
    "zcat < {input} | sed -n '1~4s/^@/>/p;2~4p' > {output}"
121
122
123
124
125
shell:
    """
    pistis --fastq {input} --output {output} \
      --downsample {params.downsample} 2> {log}
    """
143
144
145
146
147
shell:
    """
    NanoStat --fastq {input} --name {output} \
      --threads {threads} 2> {log}
    """
14
15
16
17
18
19
20
21
shell:
    """
    qiime tools import \
        --type 'FeatureData[Sequence]' \
        --input-path {input.fasta} \
        --output-path {output} \
        2>&1 | tee -a {log}
    """
35
36
37
38
39
40
41
42
shell:
    """
    qiime tools import \
        --type 'FeatureData[Sequence]' \
        --input-path {input} \
        --output-path {output} \
        2>&1 | tee -a {log}
    """
56
57
58
59
60
61
62
63
64
shell:
    """
    qiime tools import \
        --type FeatureData[Taxonomy] \
        --input-format HeaderlessTSVTaxonomyFormat \
        --input-path {input} \
        --output-path {output} \
        2>&1 | tee -a {log}
    """
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
shell:
    """
    qiime feature-classifier classify-consensus-vsearch \
        --i-query {input.query} \
        --i-reference-reads {input.ref_seqs} \
        --i-reference-taxonomy {input.taxo} \
        --o-classification {output.qza} \
        --p-threads {threads} \
        --p-perc-identity {params.pident} \
        --p-maxaccepts {params.nseqs} \
        --p-min-consensus {params.lcacons} \
        2>&1 | tee -a {log}
    qiime tools export \
        --input-path {output.qza} \
        --output-path {output.path} 2>&1 | tee -a {log}
    mv {output.path}/taxonomy.tsv {output.out} 2>> {log}
    """
118
119
120
121
122
123
124
125
126
shell:
    """
    cut -f 1,2 {input} | grep -P -v '\\tUnassigned$' | \
    awk -F '\\t|;' -v OFS='\\t' '{{NF=7}}1' | \
    awk 'BEGIN {{ FS = OFS = "\\t" }} {{ for(i=1; i<=NF; i++) if($i == "") $i = "NA" }}; 1' | \
    sed "s/D_[[:digit:]]__//g" | \
    sed "s/^Feature ID.*/\#readid\\tDomain\\tPhylum\\tClass\\tOrder\\tFamily\\tGenus/" \
    > {output} 2> {log}
    """
SnakeMake From line 118 of rules/qiime.smk
139
140
shell:
    "scripts/tomat.py -l {input.list} 2> {log}"
SnakeMake From line 139 of rules/qiime.smk
15
16
17
18
19
shell:
    """
    scripts/todb.py -s {input.seq} -t {input.tax} -m rdp -S tmp.seq -T {output.tax} 2> {log}
    gzip -c tmp.seq > {output.seq} && rm tmp.seq 2>> {log}
    """
SnakeMake From line 15 of rules/rdp.smk
40
41
shell:
    "Rscript scripts/assigntaxonomy.R {input.db} {input.query} {output} {threads} {params} 2> {log}"
SnakeMake From line 40 of rules/rdp.smk
57
58
shell:
    "scripts/tomat.py -l {input.list} 2> {log}"
SnakeMake From line 57 of rules/rdp.smk
18
19
20
21
22
23
shell:
    """
    scripts/todb.py -s {input.seq} -t {input.tax} -m spingo \
        -S {output.seq} -T {output.tax} 2> {log}
    spindex -k 8 -p {threads} -d {output.seq} 2>> {log}
    """
43
44
45
46
47
48
49
50
51
52
53
54
55
shell:
    """
    spingo -d {input.db} -k 8 -a \
      -p {threads} -i {input.fasta} \
      > {output.tmp} 2> {log}
    sed 's/ /\\t/' {output.tmp} | \
    awk -F '\\t' -v OFS='\\t' '{{
        if (NR == 1)
            print "#readid","Domain","Phylum","Class","Order","Family","Genus";
        else
            print $1, $4, $6, $8, $10, $12, $14
    }}' > {output.out} 2> {log}
    """
71
72
shell:
    "scripts/tomat.py -l {input.list} 2> {log}"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
rm(list=ls())

suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(seqinr))

args = commandArgs(trailingOnly=TRUE)
threads = as.numeric(args[4])
pident = as.numeric(args[5])

query <- read.fasta(args[2], as.string = T, forceDNAtolower = F, set.attributes = F)

set.seed(100)
taxann <- assignTaxonomy(seqs =  unlist(query), refFasta = args[1], 
          tryRC = T, outputBootstraps=F, minBoot = pident, multithread = threads, verbose = F, 
          taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")
          )

# remove lengthy seqs from row names
queryid <- names(strsplit(rownames(taxann), " "))
taxodf <- as.data.frame(taxann)
for (i in 1:length(taxodf)) taxodf[,i] <- gsub('_', ' ', taxodf[,i])
names(taxodf)[1] <- "Domain"
taxodf$"#readid" <- queryid
write.table(taxodf[,c(7,1:6)], file = args[3], row.names=F, col.names=T, quote=F, sep='\t')
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
suppressPackageStartupMessages(library(phyloseq))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(vroom))
suppressPackageStartupMessages(library(tidyr))

args = commandArgs(trailingOnly=TRUE)

taxmat <- as.data.frame(unique(vroom(delim = '\t', args)))
taxmat <- taxmat[!is.na(taxmat$Domain),]

taxmat <- unique(taxmat)
names(taxmat)[1] <- "taxid"
write.table(taxmat, file="tables/taxonomy-table.tsv", row.names=F, col.names=T, sep='\t', quote=F)

rownames(taxmat) <- taxmat$taxid
taxmat$taxid <- NULL
taxmat <- as.matrix(taxmat)

file <- gsub(".taxmat$", ".otumat", args)
sam <- data.frame(run = rep(NA, length(file)), sample = rep(NA, length(file)), method = rep(NA, length(file)))

for (i in 1:length(file)){
  otumatje <- read.table(file[i], header = T, sep = '\t', comment = "")
  names(otumatje)[1] <- "taxid"
  names(otumatje)[2] <- paste0(strsplit(file[i], "/")[[1]][2],"_",names(otumatje)[2])
  ifelse (i == 1, otumat <- otumatje, otumat <- merge(otumat, otumatje, all=TRUE))

  lab = strsplit(file[i], "/")[[1]][4]
  sam$run[i] = strsplit(file[i], "/")[[1]][2]
  sam$sample[i] = strsplit(lab, "[.]")[[1]][1]
  sam$method[i] = strsplit(lab, "[.]")[[1]][2]
} 

otumat[is.na(otumat)] <- 0

## for some custom DB like BOLD local copy, the same taxonomic lineage may occur 2x.
## This causes an error when they are used as rownames, which should be unique --> fix: aggregate
otumat <- aggregate(otumat[2:length(otumat)], by=list(otumat$taxid), sum)
names(otumat)[1] <- "taxid"
write.table(otumat, file="tables/otu-table.tsv", row.names=F, col.names=T, sep='\t', quote=F)

rownames(otumat) <- otumat$taxid
otumat$taxid <- NULL
otumat <- as.matrix(otumat)

rownames(sam) <- colnames(otumat)


OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
SAM = sample_data(sam)

physeq = phyloseq(OTU, TAX, SAM)
pphyseq  = transform_sample_counts(physeq, function(x) x / sum(x) )

theme_set(theme_bw())

for (level in colnames(taxmat)){
    top.taxa <- tax_glom(physeq, level)
    TopNOTUs <- names(sort(taxa_sums(top.taxa), TRUE)[1:20])
    BottumNOTUs <- names(taxa_sums(top.taxa))[which(!names(taxa_sums(top.taxa)) %in% TopNOTUs)]
    merged_physeq = merge_taxa(top.taxa, BottumNOTUs, 2)

    mdf = psmelt(merged_physeq); names(mdf)[names(mdf) == level] <- "level"
    mdf$OTU[which(is.na(mdf$level))] <- "aaaOther"
    mdf$level[which(is.na(mdf$level))] <- "aaaOther"
    aggr_mdf <- aggregate(Abundance ~ sample + run + method + level, data = mdf, sum)

    labs = aggr_mdf$level; labs[labs=="aaaOther"] <- "Other"
    cols = scales::hue_pal()(length(unique(labs))); cols[unique(labs) == "Other"] <- "#CCCCCC"

    p = ggplot(aggr_mdf, aes_string(x = "method", y = "Abundance", fill = "level"))
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$sample))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p = p + guides(fill=guide_legend(ncol=1))
    p = p + labs(x = "Method", y = "Absolute abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/aabund-", level, "-by-sample.pdf"), plot=p, device="pdf")

    p = ggplot(aggr_mdf, aes_string(x = "sample", y = "Abundance", fill = "level")) 
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$method))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p =	p + guides(fill=guide_legend(ncol=1))  
    p = p + labs(x = "Sample", y = "Absolute abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/aabund-", level, "-by-method.pdf"), plot=p, device="pdf")
}


for (level in colnames(taxmat)){
    top.taxa <- tax_glom(pphyseq, level)
    TopNOTUs <- names(sort(taxa_sums(top.taxa), TRUE)[1:20])
    BottumNOTUs <- names(taxa_sums(top.taxa))[which(!names(taxa_sums(top.taxa)) %in% TopNOTUs)]
    merged_physeq = merge_taxa(top.taxa, BottumNOTUs, 2)

    mdf = psmelt(merged_physeq); names(mdf)[names(mdf) == level] <- "level"
    mdf$OTU[which(is.na(mdf$level))] <- "aaaOther"
    mdf$level[which(is.na(mdf$level))] <- "aaaOther"
    aggr_mdf <- aggregate(Abundance ~ sample + run + method + level, data = mdf, sum)

    labs = aggr_mdf$level; labs[labs=="aaaOther"] <- "Other"
    cols = scales::hue_pal()(length(unique(labs))); cols[unique(labs) == "Other"] <- "#CCCCCC"

    p = ggplot(aggr_mdf, aes_string(x = "method", y = "Abundance", fill = "level"))
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$sample))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p =	p + guides(fill=guide_legend(ncol=1))  
    p = p + labs(x = "Method", y = "Absolute abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/rabund-", level, "-by-sample.pdf"), plot=p, device="pdf")

    p = ggplot(aggr_mdf, aes_string(x = "sample", y = "Abundance", fill = "level"))
    p = p + scale_fill_manual(name = level, labels = unique(labs), values = cols)
    p = p + facet_grid(paste0("", aggr_mdf$run) ~ paste0("", aggr_mdf$method))
    p = p + geom_bar(stat = "identity", position = "stack",  color = "black", size = 0.1)
    p =	p + guides(fill=guide_legend(ncol=1))  
    p = p + labs(x = "Sample", y = "Relative abundance")
    p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0, size = 5))

    ggsave(paste0("plots/rabund-", level, "-by-method.pdf"), plot=p, device="pdf")
}
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
rm(list=ls())

suppressPackageStartupMessages(library("DECIPHER"))

args = commandArgs(trailingOnly=TRUE)

threads = as.numeric(args[4])

pctth = as.numeric(args[5])

query <- readDNAStringSet(args[2])
db <- load(args[1])

taxann <- IdTaxa(query, trained_classifiyer, strand = "both", threshold = pctth, processors = threads, verbose = F)

assignment <- sapply(taxann, function(x) paste(x$taxon, collapse="\t"))
names(assignment) <- sapply(strsplit(names(assignment)," "), `[`, 1)

dl <- lapply(assignment, function (x) {gsub("D_[0-6]__", "", x)})
dl <- lapply(dl, function (x) {gsub("Root\t", "", x)})

df <- setNames(as.data.frame(unlist(dl)),"Domain\tPhylum\tClass\tOrder\tFamily\tGenus")
df$'#readid' <- rownames(df)

write.table(df[,c(2,1)], file=args[3], row.names=F, col.names=T, quote=F, sep='\t')
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
rm(list=ls())

suppressPackageStartupMessages(library("DECIPHER"))

args = commandArgs(trailingOnly=TRUE)

# train classifyer
refseqs <- readDNAStringSet(args[1])
taxo = read.table(args[2], sep = "\t", comment.char = "", quote="" )

taxonomy <- paste0(taxo$V1, " Root;", taxo$V2)
trained_classifiyer <- LearnTaxa(train = refseqs, taxonomy = taxonomy, verbose = T)

save(trained_classifiyer, file = args[3])
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
suppressPackageStartupMessages(library(ggplot2))

args = commandArgs(trailingOnly=TRUE)

dl = list()
for (i in 1:length(args)){
  # Get metadata
  lab = strsplit(args[i], "/")[[1]][4]
  run = strsplit(args[i], "/")[[1]][2]
  sample = strsplit(lab, "[.]")[[1]][1]
  method = strsplit(lab, "[.]")[[1]][2]
  # Read data and compute total prop precision
  tbl <- read.table(args[i], header = T, sep = '\t', 
                    comment = "", row.names = 1)
  dt <- as.data.frame(colSums(tbl, na.rm=T)/
                      colSums(!is.na(tbl)))
  names(dt) <- "precision"
  # Add meta data columns
  dt$level <- row.names(dt)
  dt$rank <- factor(dt$level, ordered=T, 
      levels=c("Domain","Phylum","Class",
               "Order","Family","Genus"))
  dt$run <- rep(run, nrow(dt))
  dt$sample <- rep(sample, nrow(dt))
  dt$method <- rep(method, nrow(dt))
  # Add to list
  dl[[i]] <- dt
}

df <- do.call(rbind, dl)

theme_set(theme_bw())
p = ggplot(df, aes(x = rank, y = precision, 
           group = method, col = method)) + 
    geom_line() + geom_point() +
    labs(x = "", y = "Precision") + ylim(0,1) +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
p = p + facet_grid(paste0("Run: ",run) ~ 
                   paste0("",sample))
ggsave("plots/precision.pdf", plot=p, device="pdf") 
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
suppressPackageStartupMessages(library(ggplot2))

args = commandArgs(trailingOnly=TRUE)

dl = list()
for (i in 1:length(args)){
  # Get metadata
  lab = strsplit(args[i], "/")[[1]][3]
  run = strsplit(args[i], "/")[[1]][2]
  sample = strsplit(strsplit(lab, "[_]")[[1]][3], "[.]")[[1]][1]
  method = strsplit(lab, "[_]")[[1]][1]
  # Read data
  dt <- as.data.frame(read.table(args[i], header = T, sep = '\t',
                    comment = "")[,1])
  # Add meta data columns
  dt$run <- rep(run, nrow(dt))
  dt$sample <- rep(sample, nrow(dt))
  dt$method <- rep(method, nrow(dt))
  # Add to list
  dl[[i]] <- dt
}

df <- do.call(rbind, dl)
names(df)[1] <- "s"

theme_set(theme_bw())

p1 = ggplot(df, aes(x = method, y = s, fill=method)) +
    geom_bar(stat="identity", color="black") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
p1 = p1 + facet_grid(paste0("Run: ",run) ~
                   paste0("Sample: ",sample))
p1 = p1 + theme(legend.position="none")
ggsave("plots/runtime-by-sample.pdf", plot=p1, device="pdf")

q1 = ggplot(df, aes(x = method, y = s, fill=method)) +
    geom_bar(stat="identity", color="black") +
    scale_y_continuous(trans = "log10") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
q1 = q1 + facet_grid(paste0("Run: ",run) ~
                   paste0("Sample: ",sample))
q1 = q1 + theme(legend.position="none")
ggsave("plots/runtime_log-by-sample.pdf", plot=q1, device="pdf")


p2 = ggplot(df, aes(x = sample, y = s, fill=sample)) +
    geom_bar(stat="identity", color="black") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
p2 = p2 + facet_grid(paste0("Run: ",run) ~
                   paste0("Method: ",method))
p2 = p2 + theme(legend.position="none")
ggsave("plots/runtime-by-method.pdf", plot=p2, device="pdf")

q2 = ggplot(df, aes(x = sample, y = s, fill=sample)) +
    geom_bar(stat="identity", color="black") +
    scale_y_continuous(trans = "log10") +
    labs(x = "", y = "Runtime (s)") +
    theme(axis.text.x = element_text(
        angle = 90, vjust = 0.5, hjust=1))
q2 = q2 + facet_grid(paste0("Run: ",run) ~
                   paste0("Method: ",method))
q2 = q2 + theme(legend.position="none")
ggsave("plots/runtime_log-by-method.pdf", plot=q2, device="pdf")
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2020, Evelien Jongepier"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
import re
import argparse
import collections


class ToConsensus(object):

    def __init__(self, taxlists, consensustax, pconsensus):
        self.taxlists = taxlists
        self.consensustax = consensustax
        self.pconsensus = pconsensus
        return


    def taxlist_per_sample_per_run_dict(self):
        '''
        Example outout: {run : {sample1 : [path1, path2, ...],
          sample2 : [path3, path4, ...], ...}
        Stores the path to each of the input taxlists for each
        of the samples for each of the runs.
        '''
        sampledict = collections.defaultdict(
          dict = collections.defaultdict(list))
        for idx, taxlist in enumerate(self.taxlists):
            run = taxlist.split('/')[1]
            sample = taxlist.split('/')[3].split('.')[0]
            if run not in sampledict:
                sampledict[run] = {sample:[taxlist]}
            elif sample not in sampledict[run]:
                sampledict[run][sample] = [taxlist]
            else:
                sampledict[run][sample].append(taxlist)
        return sampledict

    def run_all_runs_n_samples(self):
        '''
        '''
        sampledict = self.taxlist_per_sample_per_run_dict()
        for run in sampledict:
            for sample in sampledict[run]:
                ftaxlists = sampledict[run][sample]
                samplereaddict = self.get_read_dict_per_sample(ftaxlists)
                nlists = len(ftaxlists)
                consensusreaddict = self.comp_consensus_tax(samplereaddict, nlists)
                for ftaxlist in ftaxlists:
                    precisiondict = self.comp_precision_per_list(consensusreaddict, ftaxlist)
                    path = self.get_output_path(ftaxlist)
                    self.write_precision_tax(precisiondict, path)
                #self.write_consensus_tax(
        return


    def get_output_path(self, ftaxlist):
        dir = '/'.join(ftaxlist.split('/')[:3])
        method = ftaxlist.split('/')[2]
        sample = ftaxlist.split('/')[3].split('.')[0]
        outpath = dir + '/' + sample + '.' + method + '.precision'
        return outpath


    def get_read_dict_per_list(self, taxlist):
        '''
        Example output: readdict = {read_id: {Domain:tool1},
           {Phylum:tool1},{Class:tool1},...}
        Stores tax classification of a single taxlist (i.a the taxo-
        nomic classification based on a single tool) into a dict
        of dict. Outer dict is readid as key and dict of tax as
        value. Inner / nested dicts have tax level is a key with
        tax annotation at that level for that read in that taxlist
        as str.
        '''
        readdict = collections.defaultdict(dict)
        with open (taxlist) as f:
            for l in f.readlines():
                if l.startswith("#"):
                    headers = l.strip().split('\t')[1:]
                else:
                    readid = l.strip().split('\t')[0]
                    taxlist = l.strip().split('\t')[1:7]
                    readdict[readid] = dict(
                        (headers[idx], tax) for idx, tax in enumerate(taxlist))
        return readdict

    def get_read_dict_per_sample(self, ftaxlists):
        '''
        Example output: readdict = {read_id: {Domain:[tool1, tool2,
           tool3,...]},{Phylum:[tool1, tool2,tool3,...]},...}
        Stores tax classification of all taxlists (i.a the taxo-
        nomic classification based on a all tools) into a dict
        of dict of lists. Outer dict is readid as key and dict of tax
        as value. Inner / nested dicts have tax level is a key with
        tax annotation at that level for that read in each of the
        taxlist (thus tools) as list.
        '''
        samplereaddict = collections.defaultdict(
          dict = collections.defaultdict(list))
        for taxlist in ftaxlists:
            readdict = self.get_read_dict_per_list(taxlist)
            for readid in readdict:
                if readid not in samplereaddict:
                    samplereaddict[readid] = readdict[readid]
                    # change tax into list so all taxlists can be appended
                    for level in readdict[readid]:
                        samplereaddict[readid][level] = [samplereaddict[readid][level]]
                else:
                    for level in readdict[readid]:
                        tax = readdict[readid][level]
                        samplereaddict[readid][level].append(tax)
        return samplereaddict


    def comp_consensus_tax(self, samplereaddict, nlists):
        '''
        Example output: {readid = {Domain:cons},{Phylum:cons},...}
        Compute consensus for each read at each taxonomic level
        based on user defined level (0.5 = majority) and the taxo-
        nomic classificatons of each tool.
        '''
        consensusreaddict = collections.defaultdict(dict)

        for readid in samplereaddict:
            for level in samplereaddict[readid]:
                freqdict = collections.Counter(samplereaddict[readid][level])
                maxkey = max(freqdict, key=freqdict.get)
                if (freqdict[maxkey] / float(nlists)) > self.pconsensus:
                    consensusreaddict[readid][level] = maxkey
                else:
                    consensusreaddict[readid][level] = "NA"
        return consensusreaddict


    def comp_precision_per_list(self, consensusreaddict, ftaxlist):
        '''
        Example output: {readid = {Domain:bool}, {Phylum:bool}, ...}
        For each read and each taxomic level, a boolian whether
        the classification matches the consesnus clasification or
        not.
        '''
        readdict = self.get_read_dict_per_list(ftaxlist)
        precisiondict = collections.defaultdict(dict)
        for readid in readdict:
            for level in readdict[readid]:
                if readdict[readid][level] == 'NA' or consensusreaddict[readid][level] == 'NA':
                    precisiondict[readid][level] =  'NA'
                elif readdict[readid][level] == consensusreaddict[readid][level]:
                    precisiondict[readid][level] = '1'
                else:
                    precisiondict[readid][level] = '0'
        return precisiondict


    def write_consensus_tax(self):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write consensus to file, incl. header line.
        '''
        consensusreaddict = self.comp_consensus_tax()
        levels = ['Domain','Phylum','Class','Order','Family','Genus']
        with open(self.consensustax, 'w') as f:
            f.write('#readid' + '\t' + '\t'.join(levels) + '\n')
            for readid in consensusreaddict:
                tax = []
                for level in levels:
                    tax.append(consensusreaddict[readid][level])
                f.write(readid + '\t' + '\t'.join(tax) + '\n')
        return


    def write_precision_tax(self, precisiondict, path):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write precision (0/1) to file, incl. header line.
        '''
        levels = ['Domain','Phylum','Class','Order','Family','Genus']
        with open(path, 'w') as f:
            f.write('#readid' + '\t' + '\t'.join(levels) + '\n')
            for readid in precisiondict:
                tax = []
                for level in levels:
                    tax.append(precisiondict[readid][level])
                f.write(readid + '\t' + '\t'.join(tax) + '\n')
        return


def main(taxlists, consensustax, pconsensus):
    m = ToConsensus(
        taxlists = taxlists,
        consensustax = consensustax,
        pconsensus = pconsensus
    )
    m.run_all_runs_n_samples()
    return


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Obtain consensus and '
                        'majority taxonomic classifications based on a '
                        'list of taxlists as generated by various classi'
                        'fication tools. Taxlists can be obtained with '
                        'tomat.py')
    parser.add_argument('-l', '--taxlists',  nargs='+', default=None,
                        help='List of paths to taxlists, which contain '
                        'an taxonomic identifier in the first column followed '
                        'by tab-delimited 6-level taxonomic classifications '
                        'sensu Silva.')
    parser.add_argument('-o', '--consensustax', default=None,
                        help='Paths to where output consensus taxonomy '
                        'should be written.')
    parser.add_argument('-c', '--pconsensus', type=float, default=0.5,
                        help='Consensus treshold ranging between 0.5 and 1. '
                        'Minimum fraction of studies that need to show '
                        'consensus about each taxonomic classification at '
                        'each taxonomic level. Majority obtained with 0.5. '
                        'Taxonomic annotations for which there is no consensus '
                        'are returned as NA.')
    args = parser.parse_args()

    assert (len(args.taxlists) > 2), '''
        \n Please provide at least three taxlists to determine
        consensus taxonomy. Read the help (--help) for further information.
        '''

    main(
        taxlists = args.taxlists,
        consensustax = args.consensustax,
        pconsensus = args.pconsensus
        )
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2020, Evelien Jongepier"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
import re
import argparse
import collections
from Bio import SeqIO


class ToDb(object):

    def __init__(self, refseq, reftax, method, outseq, outtax):
        self.refseq = refseq
        self.reftax = reftax
        self.method = method
        self.outseq = outseq
        self.outtax = outtax
        return


    def get_ref_dict(self):
        '''
        Example out: {id1:"domain1;phylum1;...", id2:"domain2;phylum2;..."}
        Create a dictionary with sequence id as key and taxonomic string as
        value. The exact formatting of the taxonomic string differs between
        methods.
        '''
        refdict = {}
        with open(self.reftax) as t:
            for line in t.readlines():
                id, tax = line.strip().split('\t')[:2]
                tax = re.sub(r'D_[0-6]__', '', tax)
                tax = re.sub(r' ', '_', tax)
                if self.method == 'centrifuge':
                    refdict[id] = tax
                else:
                    tax = tax.split(';')[:6]
                    if self.method == 'mothur':
                        refdict[id] = ';'.join(tax)
                    if self.method == 'rdp':
                        refdict[id] = ';'.join(tax)
                    if self.method == 'spingo':
                        tax.reverse()
                        refdict[id] = '\t'.join(tax)
        return refdict


    def get_out_files(self):
        '''
        Reformat fasta header to make it compatible with the selected method.
        '''
        with open(self.outseq, "w") as outs, open(self.outtax, "w") as outt:
            refdict = self.get_ref_dict()
            for record in SeqIO.parse(open(self.refseq), 'fasta'):
                if record.id in refdict:
                    record.description = ''
                    outt.write(record.id + '\t' + refdict[record.id] + '\n')
                    if self.method == 'mothur':
                        record.id = record.id + '\tNA\t' + refdict[record.id]
                    if self.method == 'rdp':
                        record.id = refdict[record.id] + ';'
                    if self.method == 'spingo':
                        record.id = record.id + "]\t" + refdict[record.id]
                    if self.method == 'centrifuge':
                        record.id = record.id + ' ' + refdict[record.id]
                    SeqIO.write(record, outs, "fasta")



def main(refseq, reftax, method, outseq, outtax):
    m = ToDb(
        refseq = refseq,
        reftax = reftax,
        method = method,
        outseq = outseq,
        outtax = outtax
    )
    m.get_out_files()
    return


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert qiime2-formatted '
                        'database, consisting of a sequence file and a 7-level '
                        'taxonomy file, into database file(s) compatible with '
                        'other taxonomic classification methods.')
    parser.add_argument('-s', '--refseq',  default=None,
                        help='Path to fasta file of reference nucleotide sequences. '
                        'Headers should contain only the sequence identifier.')
    parser.add_argument('-t', '--reftax', default=None,
                        help='Path to tab-separated table with reference taxonomy. '
                        'The first column should contain the sequence identifier '
                        'as given in "refseq", the second column a semi-colon-'
                        'separated 7-level taxonomy.')
    parser.add_argument('-m', '--method', default=None,
                        help='A string specifying for the which taxonomic classi'
                        'fication tool the output db files should be re-formatted. '
                        'Legit values are "mothur", "centrifuge", "rdp" or "spingo".')
    parser.add_argument('-S', '--outseq',  default=None,
                        help='Path to output fasta file with re-formatted sequences.')
    parser.add_argument('-T', '--outtax', default=None,
                        help='Path to tab separated output table with re-formatted '
                        'taxonomy.')


    args = parser.parse_args()

    main(
        refseq = args.refseq,
        reftax = args.reftax,
        method = args.method,
        outseq = args.outseq,
        outtax = args.outtax
        )
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2021, Evelien Jongepier"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
import re
import argparse
import collections



class ToLCA(object):

    def __init__(self, blast, taxonomy, lca, pconsensus):
        self.blast = blast
        self.taxonomy = taxonomy
        self.lca = lca
        self.pconsensus = pconsensus
        self.levels = ["Domain","Phylum","Class","Order","Family","Genus"]
        return


    def get_blastdict(self):
        '''
        Example output: readdict = {read_id: [hitid1, hitid2, 
            hitid3],...}
        Stores list of blast hit id for each read in dict.
        '''
        readdict = collections.defaultdict(list)
        with open (self.blast) as f:
            for l in f.readlines():
                readid = l.strip().split('\t')[0]
                hitid = l.strip().split('\t')[1]
                if readid not in readdict:
                    readdict[readid] = [hitid]
                else:
                    readdict[readid].append(hitid)
        return readdict


    def get_dbdict(self):
        '''
        Example output: dbdict = {dbid: [domain, order,
            class, ...], ...}
        Stores list of taxonomic annotations (i.e. 7 levels)
        for each refernce in the db.
        '''
        dbdict = collections.defaultdict(list)
        with open (self.taxonomy) as f:
            for l in f.readlines():
                dbid, dbtaxstr = l.strip().split('\t')[:2]
                dbtaxstr = re.sub(r'D_[0-6]__', '', dbtaxstr)
                dbtax = dbtaxstr.split(';')[:6]
                dbdict[dbid] = dbtax
        return dbdict


    def get_classdict(self):
        '''
        Example output: classdict = {readid: {domain: [hit1, 
            hit2, hit3, ...], ...}, ...}
        Stores list of classifications for each of the 7
        levels of the taxonomy for each dbhit of each read.
        Redundant hits, not yet consensus.
        '''

        classdict = collections.defaultdict(
            dict = collections.defaultdict(list)
        )

        readdict = self.get_blastdict()
        dbdict = self.get_dbdict()

        for readid in readdict:
              for hitid in readdict[readid]:
                   dbtaxs = dbdict[hitid]

                   if readid not in classdict:
                       classdict[readid] = dict(
                           (self.levels[idx], [tax]) for idx, tax in enumerate(dbtaxs))
                   else:
                       for idx, tax in enumerate(dbtaxs):
                           classdict[readid][self.levels[idx]].append(tax)
        return classdict


    def comp_lca(self):
        '''
        Example output: {readid = {Domain:cons},{Phylum:cons},...}
        Compute consensus for each read at each taxonomic level
        based on user defined level (0.5 = majority) and the taxo-
        nomic classificatons of each hit.
        '''

        lcadict = collections.defaultdict(dict)
        classdict = self.get_classdict()

        for readid in classdict:
            for level in classdict[readid]:
                freqdict = collections.Counter(classdict[readid][level])
                totfreq = sum(freqdict.values())
                maxkey = max(freqdict, key=freqdict.get)
                if (freqdict[maxkey] / float(totfreq)) > self.pconsensus:
                    lcadict[readid][level] = maxkey
                else:
                    lcadict[readid][level] = "NA"
        return lcadict


    def write_lca(self):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write consensus to file, incl. header line.
        '''
        lcadict = self.comp_lca()
        with open(self.lca, 'w') as f:
            f.write('#readid' + '\t' + '\t'.join(self.levels) + '\n')
            for readid in lcadict:
                tax = []
                for level in self.levels:
                    tax.append(lcadict[readid][level])
                f.write(readid + '\t' + '\t'.join(tax) + '\n')
        return


def main(blast, taxonomy, lca, pconsensus):
    m = ToLCA(
        blast = blast,
        taxonomy = taxonomy,
        lca = lca,
        pconsensus = pconsensus
    )
    m.write_lca()
    return      


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Obtain LCA based on blast '
                        'tabular and user defined consensus treshold.')
    parser.add_argument('-b', '--blast', default=None,
                        help='Path to blast tabular with read id and reference '
                        'hit id.')
    parser.add_argument('-t', '--taxonomy',  default=None,
                        help='Path to taxonomy with reference id and 7-level '
                        'taxonomy sensu Silva.')
    parser.add_argument('-l', '--lca', default=None,
                        help='Path to where output lca consensus taxonomy '
                        'should be written.')
    parser.add_argument('-c', '--pconsensus', type=float, default=0.5,
                        help='Consensus treshold ranging between 0.5 and 0.99. '
                        'Minimum fraction of studies that need to show '
                        'consensus about each taxonomic classification at '
                        'each taxonomic level. Majority obtained with 0.5. '
                        'Taxonomic annotations for which there is no consensus '
                        'are returned as NA.')
    args = parser.parse_args()

    main(
	blast = args.blast,
        taxonomy = args.taxonomy,
        lca = args.lca,
        pconsensus = args.pconsensus
        )
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
__author__ = "Evelien Jongepier"
__copyright__ = "Copyright 2020, Evelien Jongepier"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
import re
import argparse
from collections import defaultdict


class ToMat(object):

    def __init__(self, taxlist, bblastlike, centrifugelike, krakenlike, dbtaxonomy, dbfasta, dbmap):
        self.taxlist = taxlist
        self.bblastlike = bblastlike
        self.centrifugelike = centrifugelike
        self.krakenlike = krakenlike
        self.dbtaxonomy = dbtaxonomy
        self.dbfasta = dbfasta
        self.dbmap = dbmap

        if taxlist is not None:
            self.pathfile = taxlist
        elif bblastlike is not None:
            self.pathfile = bblastlike
        elif centrifugelike is not None:
            self.pathfile = centrifugelike
        elif krakenlike is not None:
            self.pathfile = krakenlike
        self.dir = '/'.join(self.pathfile.split('/')[:3])
        self.run = self.pathfile.split('/')[1]
        self.method = self.pathfile.split('/')[2]
        self.sample = self.pathfile.split('/')[3].split('.')[0]
        self.outpath = self.dir + '/' + self.sample + '.' + self.method
        ## if input is bblastlike then create otumat and taxmat from taxlist created here
        if any([bblastlike, centrifugelike, krakenlike]):
            self.taxlist = self.outpath + ".taxlist"
        return


    def get_reftax_dict(self):
        '''
        Example output: {refid : [Domain, Phylum, ...]}
        Create a dict of taxdb identifyers and taxdb taxonomy.
        Input file can be taxonomy table or fasta with taxonomy
        in header, depending which one is provided.
        '''
        refdict = {}
        if self.dbtaxonomy is not None:
            with open(self.dbtaxonomy) as reftax:
                for l in reftax.readlines():
                    refid, taxstr = l.strip().split('\t')[:2]
                    taxstr = re.sub(r'D_[0-6]__', '', taxstr)
                    refdict[refid] = taxstr.split(';')[:6]
        elif self.dbfasta is not None:
            with open(self.dbfasta) as fasta:
                for l in fasta.readlines():
                    if l.startswith('>'):
                        line = l.strip().split('>')
                        refid = line[1].split(' ')[0]
                        taxstr = line[1].split(' ')[1]
                        refdict[refid] = taxstr.split(';')[:6]
        return refdict


    def get_krakentax_dict(self):
        '''
        Example output: {ncbiid : [silvaid1,silvaid2,...]}
        For krakenlike input the ncbi refids should first be
        translated into silva refids. This creates a dictionary
        of silva to ncbi id. Each silvaid has only one ncbiid
        but same ncbiid's can have multiple silvaid's.
        '''
        krakendict = defaultdict(list)
        with open (self.dbmap) as map:
            for l in map.readlines():
                line = l.strip().split('\t')
                ncbiid = line[1]
                krakendict[ncbiid].append(line[0])
        return krakendict


    def get_krakenlike_read_dict(self):
        '''
        Example output: {readid : [refid1, refid2, ...]}
        For krakenlike input the ncbi refids in the input table
        should first be replaces by the silva refids.
        There are multiple silva ref ids associated with each
        readid hense a dict of lists.
        '''
        with open(self.krakenlike) as f:
            readdict = defaultdict(list)
            krakendict = self.get_krakentax_dict()
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    readid, refrefid = l.strip().split('\t')[1:3]
                    if refrefid in krakendict:
                        readdict[readid].extend(krakendict[refrefid])
        return readdict


    def get_read_dict(self):
        '''
        Example output: {readid : [refid1, refid2, ...]}
        Read a bblastlike file and store reads as keys in a
        dictionary with tadid's as values list. Some classifiers
        return multiple refids per read, hense the list (e.g.
        centrifuge, but not bestblast).
        '''
        readdict = defaultdict(list)
        with open(self.pathfile) as f:
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    readid, refid = l.strip().split('\t')[:2]
                    readdict[readid].append(refid)
        return readdict


    def parse_read_dict(self):
        '''
        Parse correct read dict depending on input data type.
        '''
        if self.krakenlike is not None:
            readdict = self.get_krakenlike_read_dict()
        else:
            readdict = self.get_read_dict()
        return readdict


    def get_taxlist_dict(self):
        '''
        Example output: {readid : [Domain, Phylum, Class..]}
        Read a bblast like file and link taxdb identifyers
        to taxdb taxonomy in refdict to store in taxlist dictionary.
        '''
        readdict = self.parse_read_dict()
        refdict = self.get_reftax_dict()
        taxdict = defaultdict(list)
        for readid in readdict:
            refidlist = readdict[readid]
            for refid in refidlist:
                if refid in refdict and len(refdict[refid]) > 5:
                    ''' If already in dict, compare for each tax
                    level whether there is concensus. If not replace
                    with NA. '''
                    if readid in taxdict:
                        for level in range(5, -1, -1):
                            if taxdict[readid][level] == refdict[refid][level]:
                                break
                            else:
                                taxdict[readid][level] = "NA"
                    else:
                        taxdict[readid] = refdict[refid]
        return taxdict


    def write_taxlist(self):
        '''
        Example out: "readid \t Domain \t Phylum ..."
        Write taxlist to file, incl. header line.
        '''
        taxlist = self.get_taxlist_dict()
        with open(self.outpath + ".taxlist", 'w') as f:
            f.write('#readid' + '\t' + 'Domain' + '\t' +
                    'Phylum' + '\t' + 'Class' + '\t' +
                    'Order' + '\t' + 'Family' + '\t' +
                    'Genus' + '\n')
            for readid in taxlist:
                tax = taxlist[readid]
                f.write(readid + '\t' + '\t'.join(tax) + '\n')


    def get_taxmat_dict(self):
        '''
        Example out: {taxstr: [Domain,Phylum,...]}
        Read taxlist, concatenate taxonomy into taxid
        and create lib with key taxid and value taxonomy.
        '''
        with open(self.taxlist) as f:
            taxdict = {}
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    line = l.strip().split('\t')
                    tax = line[1:7]
                    taxid = '_'.join(tax)
                    taxdict[taxid] = tax
        return taxdict


    def write_taxmat(self):
        '''
        Example out: "taxstr \t Domain \t Phylum ..."
        Write taxmat to file, incl. header line.
        '''
        with open(self.outpath + ".taxmat", 'w') as f:
            f.write('#taxid' + '\t' + 'Domain' + '\t' +
                    'Phylum' + '\t' + 'Class' + '\t' +
                    'Order' + '\t' + 'Family' + '\t' +
                    'Genus' + '\n')
            for taxid in self.get_taxmat_dict():
                tax = self.get_taxmat_dict()[taxid]
                taxlab = re.sub(r'[_NA]*_NA$', '', taxid)
                taxstr = '\t'.join(tax)
                taxstr = re.sub(r'[\tNA]*\tNA$', '', taxstr)
                taxstr = re.sub(r'_', ' ', taxstr)
                f.write(taxlab + '\t' + taxstr + '\n')


    def get_otumat_dict(self):
        '''
        Example out: {taxstr: count}
        Read taxlist, concatenate taxonomy into taxid
        and create lib witk key taxid and value count
        where count is the frequency each taxid occured
        in taxlist
        '''
        with open(self.taxlist) as f:
            otudict = defaultdict(int)
            for l in f.readlines():
                if l.startswith("#") or not l.strip():
                    continue
                else:
                    line = l.strip().split('\t')
                    tax = line[1:7]
                    taxid = '_'.join(tax)
                    otudict[taxid] += 1
        return otudict


    def write_otumat(self):
        '''
        Example output: "taxstr \t count"
        Write otumat to file, incl. header line.
        '''
        with open(self.outpath + ".otumat", 'w') as f:
            f.write('#taxid' + '\t' + self.method + '_' + self.sample + '\n')
            for taxid in self.get_otumat_dict():
                count = self.get_otumat_dict()[taxid]
                taxlab = re.sub(r'[_NA]*_NA$', '', taxid)
                f.write(taxlab + '\t' + str(count) + '\n')


def main(taxlist, bblastlike, centrifugelike, krakenlike, dbtaxonomy, dbfasta, dbmap):
    m = ToMat(
        taxlist = taxlist,
        bblastlike = bblastlike,
        centrifugelike = centrifugelike,
        krakenlike = krakenlike,
        dbtaxonomy = dbtaxonomy,
        dbfasta = dbfasta,
        dbmap = dbmap
    )
    if any([bblastlike, centrifugelike, krakenlike]):
        m.write_taxlist()
    m.write_taxmat()
    m.write_otumat()
    return


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert output tables '
                        'from classifyers in nanoclass snakemake pipeline '
                        'to a taxonomy and otu matrix. These serve to '
                        'generate taxonomic barplots. If not given, the '
                        'script also produces a taxlist which serves as input '
                        'to assign consensus and majority classifications.')
    parser.add_argument('-l', '--taxlist', default=None,
                        help='Path to a file with read IDs in the '
                        'first column, followed by at least 6 columns '
                        'with tab-delimited 7-level taxonomic classifications '
                        'sensu Silva.')
    parser.add_argument('-b', '--bblastlike', default=None,
                        help='Path to a best-blast-like output file '
                        'with the first column listing the readID and '
                        'the second column the taxonomic identifyer.')
    parser.add_argument('-c', '--centrifugelike', default=None,
                        help='Path to a centrifuge-like output file. '
                        'First column should contain read ID and second '
                        'the taxonomic identifyer.')
    parser.add_argument('-k', '--krakenlike', default=None,
                        help='Path to a kraken-like output file. '
                        'Second column should contain read ID and third '
                        'the ncbi taxonomic identifyer.')
    parser.add_argument('-t', '--dbtaxonomy', default=None,
                        help='Path to the taxonomy of the database used '
                        'to get taxonomic classifications. Only required '
                        'in combination with --bblastlike.')
    parser.add_argument('-f', '--dbfasta', default=None,
                        help='Path to the fasta of database that was used '
                        'to get taxonomic classifications. Fasta headers '
                        'should contain taxids and taxo string. Only required '
                        'in combination with --centrifugelike.')
    parser.add_argument('-m', '--dbmap', default=None,
                        help='Path to the map of database that was used '
                        'to get taxonomic classifications. Map contains '
                        'link between taxonomy in third column of '
                        '--krakenlike and taxids in --dbfasta. Only '
                        'required in combination with --krakenlike.')
    args = parser.parse_args()

    assert any([args.taxlist, args.bblastlike, args.centrifugelike,
                args.krakenlike]), '''
        \n Please provide the path to the taxlist or bblastlike.
        Read the help (--help) for further information.
        '''

    main(
        taxlist = args.taxlist,
        bblastlike = args.bblastlike,
        centrifugelike = args.centrifugelike,
        krakenlike = args.krakenlike,
        dbtaxonomy = args.dbtaxonomy,
        dbfasta = args.dbfasta,
        dbmap = args.dbmap
        )
ShowHide 49 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ejongepier/NanoClass
Name: nanoclass
Version: v0.3.1-beta
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...