Snakemake workflow: species-quantification using kraken2 and bracken

public 1yr ago Version: v1.0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A Snakemake workflow for quantification of species in a sequencing sample, using kraken2 and bracken . Optionally, it also offers to do benchmarking of Illumina and ONT samples that are generated to contain human + desired bacterial species.

The required configuration of these usages can be seen in config/README.md .

Code Snippets

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess as sp
import sys
from itertools import product
from snakemake.shell import shell

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build

branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
    build=build, release=release
)

suffixes = ""
datatype = snakemake.params.get("datatype", "")
chromosome = snakemake.params.get("chromosome", "")
if datatype == "dna":
    if chromosome:
        suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)]
    else:
        suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"]
elif datatype == "cdna":
    suffixes = ["cdna.all.fa.gz"]
elif datatype == "cds":
    suffixes = ["cds.all.fa.gz"]
elif datatype == "ncrna":
    suffixes = ["ncrna.fa.gz"]
elif datatype == "pep":
    suffixes = ["pep.all.fa.gz"]
else:
    raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep")

if chromosome:
    if not datatype == "dna":
        raise ValueError(
            "invalid datatype, to select a single chromosome the datatype must be dna"
        )

success = False
for suffix in suffixes:
    url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species_cap}.{spec}.{suffix}".format(
        release=release,
        species=species,
        datatype=datatype,
        spec=spec.format(build=build, release=release),
        suffix=suffix,
        species_cap=species.capitalize(),
        branch=branch,
    )

    try:
        shell("curl -sSf {url} > /dev/null 2> /dev/null")
    except sp.CalledProcessError:
        continue

    shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
    success = True
    break

if not success:
    print(
        "Unable to download requested sequence data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided?",
        file=sys.stderr,
    )
    exit(1)

Python Snakemake From line 1 of ensembl-sequence/wrapper.py

shell:
    "kraken2-build --download-taxonomy --skip-maps --db {output.db} && "
    "kraken2-build {params.dbtype} --threads {threads} --db {output.db} && kraken2-build --build --db {output.db} --threads {threads} && "
    "bracken-build -d {output.db} && touch {output.mock} && "
    "kraken2-build --clean --db {output.db}"

SnakeMake kraken2 bracken From line 16 of rules/abundances.smk

shell:
    "kraken2 --use-names --threads {threads} --db {input.db} --fastq-input {params.paired} {input.fq} "
    " --report {output.rep} > {output.kraken} 2> {log}"

SnakeMake kraken2 From line 37 of rules/abundances.smk

shell:
    "bracken -d {input.db} -i {input.rep} -l S -o {output} 2> {log}"

SnakeMake bracken From line 53 of rules/abundances.smk

wrapper:
    "0.72.0/bio/reference/ensembl-sequence"

SnakeMake From line 12 of rules/benchmarking.smk

shell:
    "art_illumina -ss HS25 -i results/refs/hs_genome.fasta -p -l {params.length} -c {params.n_reads_per_seq} -m 200 -s 10 -o"
    " results/art/hum/Sample{wildcards.art_hum} --noALN 2> {log}"

SnakeMake From line 29 of rules/benchmarking.smk

shell:
    "art_illumina -ss HS25 -i {input} -p -l {params.length} -c {params.n_reads_per_seq} -m 200 -s 10 -o results/art/bac/{wildcards.bac_ref}_ --noALN 2> {log}"

SnakeMake From line 48 of rules/benchmarking.smk

shell:
    "simulator.py genome -rg {input.ref} -c {input.model}/training -b guppy --num_threads {threads}"
    " --fastq -o results/nanosim/hum/{wildcards.n} -n {params.long_nreads} 2> {log}"

SnakeMake From line 67 of rules/benchmarking.smk

shell:
    "read_analysis.py genome -i {input.read} -rg {input.r} --num_threads {threads}"
    " -o results/nanosim_train/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2 2> {log}"

SnakeMake From line 83 of rules/benchmarking.smk

shell:
    "simulator.py genome -rg {input.r} -c results/nanosim_train/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2"
    " -b guppy --num_threads {threads} --fastq -o results/nanosim/bac/{wildcards.bac_ref} -dna_type circular -n {params.long_nreads} 2> {log}"

SnakeMake From line 103 of rules/benchmarking.smk

shell:
    "seqtk sample -s100 {input.bac_fq1} {params} > {output.out_fq1}; "
    "seqtk sample -s100 {input.bac_fq2} {params} > {output.out_fq2} 2> {log}"

SnakeMake seqtk From line 121 of rules/benchmarking.smk

shell:
    "cat {input.hum_fq1} {input.bac_fq1} > {output.out_fq1}; "
    "cat {input.hum_fq2} {input.bac_fq2} > {output.out_fq2}"

SnakeMake From line 143 of rules/benchmarking.smk

shell:
    "seqtk sample -s100 {input.bac_fq} {params} > {output.out_fq} 2> {log}"

SnakeMake seqtk From line 159 of rules/benchmarking.smk

shell:
    "cat {input.hum_fq} {input.bac_fq} > {output.out_fq}"

SnakeMake From line 175 of rules/benchmarking.smk

shell:
    "kraken2-build --download-taxonomy --skip-maps --db {output.db} && "
    "kraken2-build {params.dbtype} --threads {threads} --db {output.db} && kraken2-build --build --db {output.db} --threads {threads} && "
    "bracken-build -d {output.db} && touch {output.mock} && "
    "kraken2-build --clean --db {output.db}"

SnakeMake kraken2 bracken From line 193 of rules/benchmarking.smk

shell:
    "kraken2 --use-names --threads {threads} --db {input.db} --fastq-input --report "
    " {output.rep} --paired {input.fq1} {input.fq2} > {output.kraken} 2> {log}"

SnakeMake kraken2 From line 213 of rules/benchmarking.smk

shell:
    "bracken -d {input.db} -i {input.rep} -l S -o {output} 2> {log}"

SnakeMake bracken From line 229 of rules/benchmarking.smk

shell:
    "cd results/sourmash_lca_db && wget {params.link} -O {params.name}.gz && gunzip {params.name}.gz"

SnakeMake sourmash From line 241 of rules/benchmarking.smk

shell:
    "cd results/sourmash_lca_db && wget {params.link} -O {params.name}.gz && gunzip {params.name}.gz"

SnakeMake sourmash From line 253 of rules/benchmarking.smk

shell:
    "sourmash compute --scaled {params} {input} -o {output} -k=51 2> {log}"

SnakeMake From line 268 of rules/benchmarking.smk

shell:
    "sourmash lca summarize --query {input.sig} --db {input.db} -o {output} 2> {log}"

SnakeMake From line 282 of rules/benchmarking.smk

shell:
    "kraken2 --use-names --threads {threads} --db {input.db} --fastq-input {input.fq} "
    " --report {output.rep} > {output.kraken} 2> {log}"

SnakeMake kraken2 From line 298 of rules/benchmarking.smk

shell:
    "bracken -d {input.db} -i {input.rep} -l S -o {output} 2> {log}"

SnakeMake bracken From line 314 of rules/benchmarking.smk

shell:
    "sourmash compute --scaled {params} {input} -o {output} -k=21 2> {log}"

SnakeMake From line 329 of rules/benchmarking.smk

shell:
    "sourmash lca summarize --query {input.sig} --db {input.db} -o {output} 2> {log}"

SnakeMake From line 343 of rules/benchmarking.smk

script:
    "../scripts/method_comparison_sr.R"

SnakeMake From line 385 of rules/benchmarking.smk

script:
    "../scripts/method_comparison_lr.R"

SnakeMake From line 427 of rules/benchmarking.smk

library(ggplot2)

#this script creates an abundance plot for all samples and compares 3 quantification methods; sourmash, kraken2, bracken.

sp <- snakemake@params[["species"]]
fractions <- snakemake@params[["fractions"]]

#find the number of reads in samples
number_of_reads <- c()
t <- 1
for (n in snakemake@input[["fq"]]){
  number_of_reads[t] <-  as.numeric(system(paste("echo $(cat", n ," |wc -l)/4|bc"), intern = TRUE))
  t <- t+1
}

#sourmash#

fin.s <- data.frame()
for (i in snakemake@input[["sourmash"]]){

  #read and match the genera
  sour <- read.csv(i)
  sour$species <- substr(sour$species,start = 4, stop = 1000000L)
  sel.sour <- sour[sour$species %in% sp,]
  sel.sour <- sel.sour[!duplicated(sel.sour[,"species"]),]
  sour.df <- sel.sour[,c("count", "species")]

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% sour.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% sour.df$species])
    sour.fin <- rbind(sour.df, add)
    sour.fin$method <- "sourmash_k21"
  } else {
    sour.fin <- sour.df
    sour.fin$method <- "sourmash_k21"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["sourmash"]]) #suspicous, if matched or not?

  #calculate observed fraction
  sour.fin$o_fraction <- sour.fin$count/total_n[i]

  #calculate real fraction
  r_fraction <- unlist(strsplit(i, split = "_"))[5] 

  #divide the total n reads by the fraction
  sour.fin$r_fraction <- as.numeric(r_fraction)/total_n[i]

  #add sample names
  sour.fin$sample <- unlist(strsplit(i, split = "_"))[5]
  sour.fin$sample <- paste0("fraction", sour.fin$sample)

  #create the final table for sourmash
  fin.s <- rbind(fin.s, sour.fin)
}

#kraken2#

fin.k <- data.frame()
for (j in snakemake@input[["kraken2"]]){

  #read and match the genera
  kra <- read.table(j, header = F, sep = "\t", strip.white = T)
  kra.s <- kra[kra$V4 == "S",]
  sel.kra <- kra.s[kra.s$V6 %in% sp,]#v2 is the count
  kra.df <- sel.kra[,c("V2", "V6")]
  colnames(kra.df) <- c("count", "species")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% kra.df$species)) {
    add <- data.frame("count" = 0,
                        species = sp[!sp %in% kra.df$species])
    kra.fin <- rbind(kra.df, add)
    kra.fin$method <- "kraken2"
  } else {
    kra.fin <- kra.df
    kra.fin$method <- "kraken2"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["kraken2"]])

  #calculate the observed fraction
  kra.fin$o_fraction <- kra.fin$count/total_n[j]

  #calculate the real fraction
  tmp <- unlist(strsplit(j, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]

  #divide the total n reads by the fraction
  kra.fin$r_fraction <- as.numeric(r_fraction)/total_n[j] #divide the total n reads by the fraction

  #add sample names
  kra.fin$sample <- unlist(strsplit(j, split = "_"))[3]

  #create the final table for kraken2
  fin.k <- rbind(fin.k, kra.fin)
}  

#bracken#

fin.b <- data.frame()
for (k in snakemake@input[["bracken"]]){

  #read and match the genera
  bra <- read.table(k, header = T, sep = "\t", strip.white = T)
  sel.bra <- bra[bra[,"name"] %in% sp,] 
  bra.df <- sel.bra[,c("name", "new_est_reads")]
  colnames(bra.df) <- c("species", "count")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% bra.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% bra.df$species])
    bra.fin <- rbind(bra.df, add)
    bra.fin$method <- "bracken"
  } else {
    bra.fin <- bra.df
    bra.fin$method <- "bracken"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["bracken"]])

  #calculate the observed fraction
  bra.fin$o_fraction <- bra.fin$count/total_n[k]

  #calculate the real fraction
  tmp <- unlist(strsplit(k, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]
  r_fraction <- gsub( ".bracken", "", r_fraction)

  #divide the total n reads by the fraction
  bra.fin$r_fraction <- as.numeric(r_fraction)/total_n[k] #divide the total n reads by the fraction

  #add sample names
  bra.fin$sample <- unlist(strsplit(k, split = "_"))[3]

  #have the final table for kraken2
  fin.b <- rbind(fin.b, bra.fin)
  fin.b$sample <- gsub( ".bracken", "", fin.b$sample)

}  

#final table
fin <- rbind(fin.s, fin.k)
fin <- rbind(fin, fin.b)

#modify sample names
levels <- paste("fraction", fractions, sep="")
fin$sample_f <- factor(fin$sample, levels=levels)

#scatter plot 
p <- ggplot(fin, aes(x=r_fraction, y=o_fraction, shape=species, color=species))+
  geom_point()+
  facet_grid(method~sample_f) + coord_fixed() + geom_abline(linetype="dashed")+
  scale_shape_manual(values = seq(0,6))+
  theme_classic() + annotate("segment", x=-Inf, xend=Inf, y=-Inf, yend=-Inf)+
  annotate("segment", x=-Inf, xend=-Inf, y=-Inf, yend=Inf)

#save the pdf file containing the scatter plot
ggsave(p, filename = paste0("results/final_abundance/scatter_plot/lr/lr_final_abundance_all_samples_coord_fixed.pdf"),
       width = 11,
       height = 8.5,)

#output the table
write.csv(fin, "results/final_abundance/scatter_plot/lr/lr_final_abundance_all_samples.csv", row.names = F)

R ggplot2 kraken2 bracken From line 1 of scripts/method_comparison_lr.R

library(ggplot2)

#this script creates an abundance plot for all samples and compares 3 quantification methods; sourmash, kraken2, bracken.

sp <- snakemake@params[["species"]]
fractions <- snakemake@params[["fractions"]]

#find the number of reads in samples
number_of_reads <- c()
t <- 1
for (n in snakemake@input[["fq"]]){
  number_of_reads[t] <-  as.numeric(system(paste("echo $(cat", n ," |wc -l)/4|bc"), intern = TRUE))
  t <- t+1
}

#sourmash#

fin.s <- data.frame()
for (i in snakemake@input[["sourmash"]]){

  #read and match the genera
  sour <- read.csv(i)
  sour$species <- substr(sour$species,start = 4, stop = 1000000L)
  sel.sour <- sour[sour$species %in% sp,]
  sel.sour <- sel.sour[!duplicated(sel.sour[,"species"]),]
  sour.df <- sel.sour[,c("count", "species")]

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% sour.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% sour.df$species])
    sour.fin <- rbind(sour.df, add)
    sour.fin$method <- "sourmash_k51"
  } else {
    sour.fin <- sour.df
    sour.fin$method <- "sourmash_k51"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["sourmash"]]) #suspicous, if matched or not?

  #calculate observed fraction
  sour.fin$o_fraction <- sour.fin$count/total_n[i]

  #calculate real fraction
  r_fraction <- unlist(strsplit(i, split = "_"))[5] 

  #divide the total n reads by the fraction
  sour.fin$r_fraction <- as.numeric(r_fraction)/total_n[i]

  #add sample names
  sour.fin$sample <- unlist(strsplit(i, split = "_"))[5]
  sour.fin$sample <- paste0("fraction", sour.fin$sample)

  #create the final table for sourmash
  fin.s <- rbind(fin.s, sour.fin)
}

#kraken2#

fin.k <- data.frame()
for (j in snakemake@input[["kraken2"]]){

  #read and match the genera
  kra <- read.table(j, header = F, sep = "\t", strip.white = T)
  kra.s <- kra[kra$V4 == "S",]
  sel.kra <- kra.s[kra.s$V6 %in% sp,] #v2 is the count

  kra.df <- sel.kra[,c("V2", "V6")]
  colnames(kra.df) <- c("count", "species")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% kra.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% kra.df$species])
    kra.fin <- rbind(kra.df, add)
    kra.fin$method <- "kraken2"
  } else {
    kra.fin <- kra.df
    kra.fin$method <- "kraken2"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["kraken2"]])

  #calculate the observed fraction
  kra.fin$o_fraction <- kra.fin$count/total_n[j]

  #calculate the real fraction
  tmp <- unlist(strsplit(j, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]

  #divide the total n reads by the fraction
  kra.fin$r_fraction <- as.numeric(r_fraction)/total_n[j] #divide the total n reads by the fraction

  #add sample names
  kra.fin$sample <- unlist(strsplit(j, split = "_"))[3]

  #create the final table for kraken2
  fin.k <- rbind(fin.k, kra.fin)
}  

#bracken#

fin.b <- data.frame()
for (k in snakemake@input[["bracken"]]){

  #read and match the genera
  bra <- read.table(k, header = T, sep = "\t", strip.white = T)
  sel.bra <- bra[bra[,"name"] %in% sp,] 
  bra.df <- sel.bra[,c("name", "new_est_reads")]
  colnames(bra.df) <- c("species", "count")

  #if the genus doesn't have a hit, get rid of NAs
  if (any(!sp %in% bra.df$species)) {
    add <- data.frame("count" = 0,
                      species = sp[!sp %in% bra.df$species])
    bra.fin <- rbind(bra.df, add)
    bra.fin$method <- "bracken"
  } else {
    bra.fin <- bra.df
    bra.fin$method <- "bracken"
  }

  #match the total number of reads with samples
  total_n <- setNames(number_of_reads, snakemake@input[["bracken"]])

  #calculate the observed fraction
  bra.fin$o_fraction <- bra.fin$count/total_n[k]

  #calculate the real fraction
  tmp <- unlist(strsplit(k, split = "_"))[3] 
  r_fraction <- unlist(strsplit(tmp, split = "fraction"))[2]
  r_fraction <- gsub( ".bracken", "", r_fraction)

  #divide the total n reads by the fraction
  bra.fin$r_fraction <- as.numeric(r_fraction)/total_n[k] #divide the total n reads by the fraction

  #add sample names
  bra.fin$sample <- unlist(strsplit(k, split = "_"))[3]

  #have the final table for kraken2
  fin.b <- rbind(fin.b, bra.fin)
  fin.b$sample <- gsub( ".bracken", "", fin.b$sample)
}  

#final table
fin <- rbind(fin.s, fin.k)
fin <- rbind(fin, fin.b)

#modify sample names
levels <- paste("fraction", fractions, sep="")
fin$sample_f <- factor(fin$sample, levels=levels)

#scatter plot 
p <- ggplot(fin, aes(x=r_fraction, y=o_fraction, shape=species, color=species))+
  geom_point()+
  facet_grid(method~sample_f) + coord_fixed() + geom_abline(linetype="dashed")+
  scale_shape_manual(values = seq(0,6))+
  theme_classic() + annotate("segment", x=-Inf, xend=Inf, y=-Inf, yend=-Inf)+
  annotate("segment", x=-Inf, xend=-Inf, y=-Inf, yend=Inf)

#save the pdf file containing the scatter plot
ggsave(p, filename = paste0("results/final_abundance/scatter_plot/sr/sr_final_abundance_all_samples_coord_fixed.pdf"),
       width = 11,
       height = 8.5,)

#output the table
write.csv(fin, "results/final_abundance/scatter_plot/sr/sr_final_abundance_all_samples.csv", row.names = F)