Snakemake workflow: species-quantification using kraken2 and bracken

public public 1yr ago Version: v1.0.0 0 bookmarks

A Snakemake workflow for quantification of species in a sequencing sample, using kraken2 and bracken . Optionally, it also offers to do benchmarking of Illumina and ONT samples that are generated to contain human + desired bacterial species.

The required configuration of these usages can be seen in config/README.md .

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess as sp
import sys
from itertools import product
from snakemake.shell import shell

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build

branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
    build=build, release=release
)

suffixes = ""
datatype = snakemake.params.get("datatype", "")
chromosome = snakemake.params.get("chromosome", "")
if datatype == "dna":
    if chromosome:
        suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)]
    else:
        suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"]
elif datatype == "cdna":
    suffixes = ["cdna.all.fa.gz"]
elif datatype == "cds":
    suffixes = ["cds.all.fa.gz"]
elif datatype == "ncrna":
    suffixes = ["ncrna.fa.gz"]
elif datatype == "pep":
    suffixes = ["pep.all.fa.gz"]
else:
    raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep")

if chromosome:
    if not datatype == "dna":
        raise ValueError(
            "invalid datatype, to select a single chromosome the datatype must be dna"
        )

success = False
for suffix in suffixes:
    url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species_cap}.{spec}.{suffix}".format(
        release=release,
        species=species,
        datatype=datatype,
        spec=spec.format(build=build, release=release),
        suffix=suffix,
        species_cap=species.capitalize(),
        branch=branch,
    )

    try:
        shell("curl -sSf {url} > /dev/null 2> /dev/null")
    except sp.CalledProcessError:
        continue

    shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
    success = True
    break

if not success:
    print(
        "Unable to download requested sequence data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided?",
        file=sys.stderr,
    )
    exit(1)
16
17
18
19
20
shell:
    "kraken2-build --download-taxonomy --skip-maps --db {output.db} && "
    "kraken2-build {params.dbtype} --threads {threads} --db {output.db} && kraken2-build --build --db {output.db} --threads {threads} && "
    "bracken-build -d {output.db} && touch {output.mock} && "
    "kraken2-build --clean --db {output.db}"
37
38
39
shell:
    "kraken2 --use-names --threads {threads} --db {input.db} --fastq-input {params.paired} {input.fq} "
    " --report {output.rep} > {output.kraken} 2> {log}"
53
54
shell:
    "bracken -d {input.db} -i {input.rep} -l S -o {output} 2> {log}"
12
13
wrapper:
    "0.72.0/bio/reference/ensembl-sequence"
29
30
31
shell:
    "art_illumina -ss HS25 -i results/refs/hs_genome.fasta -p -l {params.length} -c {params.n_reads_per_seq} -m 200 -s 10 -o"
    " results/art/hum/Sample{wildcards.art_hum} --noALN 2> {log}"
48
49
shell:
    "art_illumina -ss HS25 -i {input} -p -l {params.length} -c {params.n_reads_per_seq} -m 200 -s 10 -o results/art/bac/{wildcards.bac_ref}_ --noALN 2> {log}"
67
68
69
shell:
    "simulator.py genome -rg {input.ref} -c {input.model}/training -b guppy --num_threads {threads}"
    " --fastq -o results/nanosim/hum/{wildcards.n} -n {params.long_nreads} 2> {log}"
83
84
85
shell:
    "read_analysis.py genome -i {input.read} -rg {input.r} --num_threads {threads}"
    " -o results/nanosim_train/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2 2> {log}"
103
104
105
shell:
    "simulator.py genome -rg {input.r} -c results/nanosim_train/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2"
    " -b guppy --num_threads {threads} --fastq -o results/nanosim/bac/{wildcards.bac_ref} -dna_type circular -n {params.long_nreads} 2> {log}"
121
122
123
shell:
    "seqtk sample -s100 {input.bac_fq1} {params} > {output.out_fq1}; "
    "seqtk sample -s100 {input.bac_fq2} {params} > {output.out_fq2} 2> {log}"
143
144
145
shell:
    "cat {input.hum_fq1} {input.bac_fq1} > {output.out_fq1}; "
    "cat {input.hum_fq2} {input.bac_fq2} > {output.out_fq2}"
159
160
shell:
    "seqtk sample -s100 {input.bac_fq} {params} > {output.out_fq} 2> {log}"
175
176
shell:
    "cat {input.hum_fq} {input.bac_fq} > {output.out_fq}"
193
194
195
196
197
shell:
    "kraken2-build --download-taxonomy --skip-maps --db {output.db} && "
    "kraken2-build {params.dbtype} --threads {threads} --db {output.db} && kraken2-build --build --db {output.db} --threads {threads} && "
    "bracken-build -d {output.db} && touch {output.mock} && "
    "kraken2-build --clean --db {output.db}"
213
214
215
shell:
    "kraken2 --use-names --threads {threads} --db {input.db} --fastq-input --report "
    " {output.rep} --paired {input.fq1} {input.fq2} > {output.kraken} 2> {log}"
229
230
shell:
    "bracken -d {input.db} -i {input.rep} -l S -o {output} 2> {log}"
241
242
shell:
    "cd results/sourmash_lca_db && wget {params.link} -O {params.name}.gz && gunzip {params.name}.gz"
253
254
shell:
    "cd results/sourmash_lca_db && wget {params.link} -O {params.name}.gz && gunzip {params.name}.gz"
268
269
shell:
    "sourmash compute --scaled {params} {input} -o {output} -k=51 2> {log}"
282
283
shell:
    "sourmash lca summarize --query {input.sig} --db {input.db} -o {output} 2> {log}"
298
299
300
shell:
    "kraken2 --use-names --threads {threads} --db {input.db} --fastq-input {input.fq} "
    " --report {output.rep} > {output.kraken} 2> {log}"
314
315
shell:
    "bracken -d {input.db} -i {input.rep} -l S -o {output} 2> {log}"
329
330
shell:
    "sourmash compute --scaled {params} {input} -o {output} -k=21 2> {log}"
343
344
shell:
    "sourmash lca summarize --query {input.sig} --db {input.db} -o {output} 2> {log}"
385
386
script:
    "../scripts/method_comparison_sr.R"
427
428
script:
    "../scripts/method_comparison_lr.R"
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
library(ggplot2)

#this script creates an abundance plot for all samples and compares 3 quantification methods; sourmash, kraken2, bracken.

sp <- snakemake@params[["species"]]
fractions <- snakemake@params[["fractions"]]

#find the number of reads in samples
number_of_reads <- c()
t <- 1
for (n in snakemake@input[["fq"]]){
  number_of_reads[t] <-  as.numeric(system(paste("echo $(cat", n ," |wc -l)/4|bc"), intern = TRUE))
  t <- t+1
}

#sourmash#

fin.s <- data.frame()
for (i in snakemake@input[["sourmash"]]){

  #read and match the genera
  sour <- read.csv(i)
  sour$species <- substr(sour$species,start = 4, stop = 1000000L)
  sel.sour <- sour[sour$species %in% sp,]
  sel.sour <- sel.sour[!duplicated(sel.sour[,"species"]),]
  sour.df <- sel.sour[,c("count", "species")]

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% sour.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% sour.df$species])
    sour.fin <- rbind(sour.df, add)
    sour.fin$method <- "sourmash_k21"
  } else {
    sour.fin <- sour.df
    sour.fin$method <- "sourmash_k21"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["sourmash"]]) #suspicous, if matched or not?

  #calculate observed fraction
  sour.fin$o_fraction <- sour.fin$count/total_n[i]

  #calculate real fraction
  r_fraction <- unlist(strsplit(i, split = "_"))[5] 

  #divide the total n reads by the fraction
  sour.fin$r_fraction <- as.numeric(r_fraction)/total_n[i]

  #add sample names
  sour.fin$sample <- unlist(strsplit(i, split = "_"))[5]
  sour.fin$sample <- paste0("fraction", sour.fin$sample)

  #create the final table for sourmash
  fin.s <- rbind(fin.s, sour.fin)
}

#kraken2#

fin.k <- data.frame()
for (j in snakemake@input[["kraken2"]]){

  #read and match the genera
  kra <- read.table(j, header = F, sep = "\t", strip.white = T)
  kra.s <- kra[kra$V4 == "S",]
  sel.kra <- kra.s[kra.s$V6 %in% sp,]#v2 is the count
  kra.df <- sel.kra[,c("V2", "V6")]
  colnames(kra.df) <- c("count", "species")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% kra.df$species)) {
    add <- data.frame("count" = 0,
                        species = sp[!sp %in% kra.df$species])
    kra.fin <- rbind(kra.df, add)
    kra.fin$method <- "kraken2"
  } else {
    kra.fin <- kra.df
    kra.fin$method <- "kraken2"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["kraken2"]])

  #calculate the observed fraction
  kra.fin$o_fraction <- kra.fin$count/total_n[j]

  #calculate the real fraction
  tmp <- unlist(strsplit(j, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]

  #divide the total n reads by the fraction
  kra.fin$r_fraction <- as.numeric(r_fraction)/total_n[j] #divide the total n reads by the fraction

  #add sample names
  kra.fin$sample <- unlist(strsplit(j, split = "_"))[3]

  #create the final table for kraken2
  fin.k <- rbind(fin.k, kra.fin)
}  

#bracken#

fin.b <- data.frame()
for (k in snakemake@input[["bracken"]]){

  #read and match the genera
  bra <- read.table(k, header = T, sep = "\t", strip.white = T)
  sel.bra <- bra[bra[,"name"] %in% sp,] 
  bra.df <- sel.bra[,c("name", "new_est_reads")]
  colnames(bra.df) <- c("species", "count")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% bra.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% bra.df$species])
    bra.fin <- rbind(bra.df, add)
    bra.fin$method <- "bracken"
  } else {
    bra.fin <- bra.df
    bra.fin$method <- "bracken"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["bracken"]])

  #calculate the observed fraction
  bra.fin$o_fraction <- bra.fin$count/total_n[k]

  #calculate the real fraction
  tmp <- unlist(strsplit(k, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]
  r_fraction <- gsub( ".bracken", "", r_fraction)

  #divide the total n reads by the fraction
  bra.fin$r_fraction <- as.numeric(r_fraction)/total_n[k] #divide the total n reads by the fraction

  #add sample names
  bra.fin$sample <- unlist(strsplit(k, split = "_"))[3]

  #have the final table for kraken2
  fin.b <- rbind(fin.b, bra.fin)
  fin.b$sample <- gsub( ".bracken", "", fin.b$sample)

}  

#final table
fin <- rbind(fin.s, fin.k)
fin <- rbind(fin, fin.b)

#modify sample names
levels <- paste("fraction", fractions, sep="")
fin$sample_f <- factor(fin$sample, levels=levels)

#scatter plot 
p <- ggplot(fin, aes(x=r_fraction, y=o_fraction, shape=species, color=species))+
  geom_point()+
  facet_grid(method~sample_f) + coord_fixed() + geom_abline(linetype="dashed")+
  scale_shape_manual(values = seq(0,6))+
  theme_classic() + annotate("segment", x=-Inf, xend=Inf, y=-Inf, yend=-Inf)+
  annotate("segment", x=-Inf, xend=-Inf, y=-Inf, yend=Inf)

#save the pdf file containing the scatter plot
ggsave(p, filename = paste0("results/final_abundance/scatter_plot/lr/lr_final_abundance_all_samples_coord_fixed.pdf"),
       width = 11,
       height = 8.5,)

#output the table
write.csv(fin, "results/final_abundance/scatter_plot/lr/lr_final_abundance_all_samples.csv", row.names = F)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
library(ggplot2)

#this script creates an abundance plot for all samples and compares 3 quantification methods; sourmash, kraken2, bracken.

sp <- snakemake@params[["species"]]
fractions <- snakemake@params[["fractions"]]

#find the number of reads in samples
number_of_reads <- c()
t <- 1
for (n in snakemake@input[["fq"]]){
  number_of_reads[t] <-  as.numeric(system(paste("echo $(cat", n ," |wc -l)/4|bc"), intern = TRUE))
  t <- t+1
}

#sourmash#

fin.s <- data.frame()
for (i in snakemake@input[["sourmash"]]){

  #read and match the genera
  sour <- read.csv(i)
  sour$species <- substr(sour$species,start = 4, stop = 1000000L)
  sel.sour <- sour[sour$species %in% sp,]
  sel.sour <- sel.sour[!duplicated(sel.sour[,"species"]),]
  sour.df <- sel.sour[,c("count", "species")]

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% sour.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% sour.df$species])
    sour.fin <- rbind(sour.df, add)
    sour.fin$method <- "sourmash_k51"
  } else {
    sour.fin <- sour.df
    sour.fin$method <- "sourmash_k51"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["sourmash"]]) #suspicous, if matched or not?

  #calculate observed fraction
  sour.fin$o_fraction <- sour.fin$count/total_n[i]

  #calculate real fraction
  r_fraction <- unlist(strsplit(i, split = "_"))[5] 

  #divide the total n reads by the fraction
  sour.fin$r_fraction <- as.numeric(r_fraction)/total_n[i]

  #add sample names
  sour.fin$sample <- unlist(strsplit(i, split = "_"))[5]
  sour.fin$sample <- paste0("fraction", sour.fin$sample)

  #create the final table for sourmash
  fin.s <- rbind(fin.s, sour.fin)
}

#kraken2#

fin.k <- data.frame()
for (j in snakemake@input[["kraken2"]]){

  #read and match the genera
  kra <- read.table(j, header = F, sep = "\t", strip.white = T)
  kra.s <- kra[kra$V4 == "S",]
  sel.kra <- kra.s[kra.s$V6 %in% sp,] #v2 is the count

  kra.df <- sel.kra[,c("V2", "V6")]
  colnames(kra.df) <- c("count", "species")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% kra.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% kra.df$species])
    kra.fin <- rbind(kra.df, add)
    kra.fin$method <- "kraken2"
  } else {
    kra.fin <- kra.df
    kra.fin$method <- "kraken2"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["kraken2"]])

  #calculate the observed fraction
  kra.fin$o_fraction <- kra.fin$count/total_n[j]

  #calculate the real fraction
  tmp <- unlist(strsplit(j, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]

  #divide the total n reads by the fraction
  kra.fin$r_fraction <- as.numeric(r_fraction)/total_n[j] #divide the total n reads by the fraction

  #add sample names
  kra.fin$sample <- unlist(strsplit(j, split = "_"))[3]

  #create the final table for kraken2
  fin.k <- rbind(fin.k, kra.fin)
}  

#bracken#

fin.b <- data.frame()
for (k in snakemake@input[["bracken"]]){

  #read and match the genera
  bra <- read.table(k, header = T, sep = "\t", strip.white = T)
  sel.bra <- bra[bra[,"name"] %in% sp,] 
  bra.df <- sel.bra[,c("name", "new_est_reads")]
  colnames(bra.df) <- c("species", "count")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% bra.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% bra.df$species])
    bra.fin <- rbind(bra.df, add)
    bra.fin$method <- "bracken"
  } else {
    bra.fin <- bra.df
    bra.fin$method <- "bracken"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["bracken"]])

  #calculate the observed fraction
  bra.fin$o_fraction <- bra.fin$count/total_n[k]

  #calculate the real fraction
  tmp <- unlist(strsplit(k, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]
  r_fraction <- gsub( ".bracken", "", r_fraction)

  #divide the total n reads by the fraction
  bra.fin$r_fraction <- as.numeric(r_fraction)/total_n[k] #divide the total n reads by the fraction

  #add sample names
  bra.fin$sample <- unlist(strsplit(k, split = "_"))[3]

  #have the final table for kraken2
  fin.b <- rbind(fin.b, bra.fin)
  fin.b$sample <- gsub( ".bracken", "", fin.b$sample)
}  

#final table
fin <- rbind(fin.s, fin.k)
fin <- rbind(fin, fin.b)

#modify sample names
levels <- paste("fraction", fractions, sep="")
fin$sample_f <- factor(fin$sample, levels=levels)

#scatter plot 
p <- ggplot(fin, aes(x=r_fraction, y=o_fraction, shape=species, color=species))+
  geom_point()+
  facet_grid(method~sample_f) + coord_fixed() + geom_abline(linetype="dashed")+
  scale_shape_manual(values = seq(0,6))+
  theme_classic() + annotate("segment", x=-Inf, xend=Inf, y=-Inf, yend=-Inf)+
  annotate("segment", x=-Inf, xend=-Inf, y=-Inf, yend=Inf)

#save the pdf file containing the scatter plot
ggsave(p, filename = paste0("results/final_abundance/scatter_plot/sr/sr_final_abundance_all_samples_coord_fixed.pdf"),
       width = 11,
       height = 8.5,)

#output the table
write.csv(fin, "results/final_abundance/scatter_plot/sr/sr_final_abundance_all_samples.csv", row.names = F)
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/snakemake-workflows/species-quantification
Name: species-quantification
Version: v1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...