A pipeline for processing 16S / ITS data

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword Kraken2 .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This workflow is an implementation of the popular DADA2 tool. I followed the steps in the Tutorial . I utilized Kraken2 for ASV sequence classification instead of IDTaxa.

dada2

The pipeline was inspired by the Silas Kieser (@silask)'s dada2 snakemake pipeline .

Authors

Rene Welch (@ReneWelch)

Install workflow

There are two steps to install the pipeline:

Install snakemake and dependencies: Assuming conda is already installed, and a copy of this repository has been downloaded. Then, the pipeline can be installed by:

conda env create -n {env_name} --file dependencies.yml
conda env create -n microbiome --file dependencies.yml

Alternatively, mamba could be used:

mamba env create -n {env_name} --file dependencies.yml
mamba env create -n microbiome --file dependencies.yml

Install R packages: To install the R packages used by this pipeline use:
```
R CMD BATCH --vanilla ./install_r_packages.R
```

Troubleshooting conda and environment variables

If you have other versions of R and R user libraries elsewhere, you might encounter some problems with environment variables and conda. You may need to provide a local .bashrc file to place the conda path at the beginning of your $PATH environment variable.

You may not need to do this step.

source /etc/bash_completion.d/git
__conda_setup="$('/path/to/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
 eval "$__conda_setup"
else
 if [ -f "/path/to/miniconda3/etc/profile.d/conda.sh" ]; then
 . "/path/to/miniconda3/etc/profile.d/conda.sh"
 else
 export PATH="/path/to/miniconda3/bin:$PATH"
 fi
fi
unset __conda_setup
alias R='R --vanilla'
export PATH="/path/to/miniconda3/envs/microbiome/bin:$PATH"

Set up data and metadata

Set up a data/ directory with one subdirectory per batch. The pipeline will look in the data/ directory to find subdirectories that contain fastq.gz files. You can specify different filtering and trimming parameters per batch, and dada2 will learn error rates per batch. So, you may wish to separate your files by sequencing batch and/or by sample type.
```
 mkdir data/
 mkdir data/batch01 [data/batch02 ... ]
```
Next, populate the subdirectories with your fastq.gz (or fastq ) files, making sure to include both R1 and R2. If your data are already located elsewhere in your file system, you can use symbolic links (symlinks) to avoid duplicating data:
```
 fqdir=/path/to/existing_fastq_files/
 batchdir=data/batch01
 ln -s ${fqdir}/*.fastq.gz ${batchdir} 
```
Generate sample table. The input for the pipeline are the sequencing files separated by batch in the data/ directory. Using:
```
Rscript ./prepare_sample_table.R
```
will generate the samples.tsv file that contains 4 columns separated by tabs (the file will not actually contain |)
```
| batch | key | end1 | end2 |
```
Set up sample metadata file. This is where to link sample IDs to variables of interest to your study (condition, body site, time point, etc). As currently implemented , this is only used to augment the final TreeSummarizedExperiment file, which you will use for your own downstream analysis. You need to have a column key that matches the sample table. For example:
```
| key | subject_id | body_site | time_point | ...
```
The default path for this file is data/meta.tsv . If you have it named otherwise, edit the variable metadata in config/config.yaml to point to your file.
Set up negative control mapping file. If you have negative control samples, you should make a table linking each study sample to its negative control kit(s). As currently implemented , this needs to be a qs file containing a tibble of this format:
```
# A tibble: 3 x 3
batch key kits 
<chr> <chr> <list> 
1 batch01 sample_1 <chr [3]>
2 batch01 sample_2 <chr [1]>
3 batch02 sample_3 <chr [1]>
```
Each element of key is a study sample ID matching the sample table. Each element of kits is a character list of sample keys for negative control samples, eg c("sample_141", "sample_142", "sample_143") .

The default path for this file is data/negcontrols.qs . If you have it named otherwise, edit the variable negcontroltable in config/config.yaml to point to your file.

Set up configuration file

Now you will edit config.yaml to set up running parameters. At the minimum, you will need to make these edits:

Confirm general parameters and file paths. Check that these variables are set to your satisfaction:

# general parameters for run
asv_prefix: "asv" # this will be the prefix for the ASV IDs
sample_table: "samples.tsv" # this file is generated by prepare_sample_table.R
negcontroltable: "data/negcontrols.qs" # manually generated
metadata: "data/meta.tsv" # manually generated
threads: 8 # max number of threads for a multithreaded snakemake rule, eg fastqc in quality_control

Run sequence QC. If you have not already looked at your data, the first thing you will want to do is run fastQC to generate quality profiles.

First, run snakemake -j{cores} sequence_qc to run fastQC, plot quality profiles, and build a multiqc report. You can find the individual plots in workflow/report/quality_profiles and the full report in output/quality_control/multiqc .

Next, inspect those profiles. You may want to use them to decide where you want to trim R1 and R2 and how to set minimum overlap.
Edit parameters for each batch. Check the config.yaml file for the default filter_and_trim, merge_pairs, and qc_parameters specifications for each batch. Keep in mind the length of your 16S region (eg, the V4 region is 252 bp long) and whether you will have enough overlap after trimming. Look up other dada2 resources for more advice on how to choose these parameters.

Here is an example for a 2x150 NextSeq run targeting the V4 region. This set of parameters keeps only full-length reads and requires an overlap of 25 base pairs. Because the V4 region is only ~252 base pairs long, we suspect that ASVs that are outside of that range may be bimeras.
```
filter_and_trim:
 batch01:
 truncQ: 0
 trimLeft: 0
 trimRight: 0
 maxLen: Inf
 minLen: 150
 maxN: 0
 minQ: 0
 maxEE: [2, 2]
...
merge_pairs:
 batch01:
 minOverlap: 25
 maxMismatch: 0
...
qc_parameters:
 negctrl_prop: 0.5
 max_length_variation: 5
 low_abundance_perc: 0.0001 
```
The defaults provided in the pipeline are from a 2x300 run targeting V3-V4. Here we are truncating the ends of both reads because we noticed a significant loss of quality. Because we have longer reads, we can require more overlap.
```
filter_and_trim:
 dust_dec2018:
 truncQ: 12
 truncLen: [280, 250]
 trimLeft: 0
 trimRight: 0
 maxLen: Inf
 minLen: 100
 maxN: 0
 minQ: 0
 maxEE: [2, 2]
....
merge_pairs:
 dust_dec2018:
 minOverlap: 50
 maxMismatch: 0
...
qc_parameters:
 negctrl_prop: 0.5
 max_length_variation: 50
 low_abundance_perc: 0.0001 
```

Download references for sequence classification

Databases need to be downloaded from https://benlangmead.github.io/aws-indexes/k2 . These are the four used by default in the config file.

Here is an example of doing this using wget :

# download references
wdir="https://genome-idx.s3.amazonaws.com/kraken"
refdir="data/kraken_dbs"
mkdir -p $refdir
greengenes="16S_Greengenes13.5_20200326"
rdp="16S_RDP11.5_20200326"
silva="16S_Silva138_20200326"
minikraken="minikraken2_v2_8GB_201904"
for db in $greengenes $rdp $silva $minikraken;
do
 remote=${wdir}/${db}.tgz
 local=${refdir}/${db}.tgz
 wget $remote -O $local
 tar -xzf ${local} -C ${refdir}
done

Workflow commands

Replace {cores} with the maximum number of cores that you want to be used at one time.

The rules taxonomy and phylotree will take a little longer the first time they are run because they install conda environments.

snakemake -j{cores} runs everything
snakemake -j{cores} sequence_qc plots quality profiles, and build a multiqc report
snakemake -j{cores} [--nt] dada2 computes the ASV matrix from the different batches (use --nt option to keep intermediate files for troubleshooting)
snakemake -j{cores} taxonomy --use-conda to labels the ASV sequences with kraken2 . Databases need to be downloaded in advance from https://benlangmead.github.io/aws-indexes/k2
snakemake -j{cores} phylotree --use-conda computes the phylogenetic tree using qiime2 's FastTree
snakemake -j{cores} mia prepare the TreeSummarizedExperiment containing all the data generated

microbiome_pipeline

Cite

dada2

Callahan, B., McMurdie, P., Rosen, M. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016). https://doi.org/10.1038/nmeth.3869

Kraken2

Wood, D., Lu, J., Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biology 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0

phyloseq

McMurdie, P., Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS One 8, 4 (2013). https://doi.org/10.1371/journal.pone.0061217

qiime2

Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, et al. 2018. “QIIME 2: Reproducible, Interactive, Scalable, and Extensible Microbiome Data Science.” e27295v2. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295v2 .

FastTree

Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. 2010. “FastTree 2--Approximately Maximum-Likelihood Trees for Large Alignments.” PloS One 5 (3): e9490.

TreeSummarizedExperiment

Huang, Ruizhu, Charlotte Soneson, Felix G. M. Ernst, Kevin C. Rue-Albrecht, Guangchuang Yu, Stephanie C. Hicks, and Mark D. Robinson. 2020. “TreeSummarizedExperiment: A S4 Class for Data with Hierarchical Structure.” F1000Research 9 (October): 1246.

Code Snippets

shell:
  """Rscript workflow/scripts/dada2/filter_and_trim.R \
    {output.end1} {output.end2} {output.summary} \
    {params.sample_name} --end1={input.end1} --end2={input.end2} \
    --batch={params.batch} --log={log} --config={params.config}

    if [[ -s {output.summary} && ! -s {output.end1} && ! -s {output.end2} ]]; then
      echo "Filtering succeeded but all reads were removed. Creating temp placeholder files."
      touch {output.end1}
      touch {output.end2}
    fi
    """

SnakeMake From line 15 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/learn_error_rates.R \
    --error_rates={output.mat} --plot_file={output.plot} \
    {input.filtered} --log={log} --batch={params.batch} \
    --config={params.config} --cores={threads}"""

SnakeMake From line 41 of rules/dada2.smk

shell:
  """
    if [[ -e {input.filt_end1} && -e {input.filt_end2} && ! -s {input.filt_end1} && ! -s {input.filt_end2} ]]; then
        touch {output.merge} 
        echo "Filtering succeeded, but no reads were left after filtering:"
        ls -l {input.filt_end1} {input.filt_end2}
        echo "Generating placeholder output file: {output.merge}"
    else 
      Rscript workflow/scripts/dada2/dereplicate_one_sample_pair.R \
      {output.merge} {params.sample_name} \
      {input.filt_end1} {input.filt_end2} \
      --end1_err={input.mat_end1} --end2_err={input.mat_end2} \
      --log={log} --batch={params.batch} --config={params.config}
    fi
    """

SnakeMake From line 78 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/gather_derep_seqtab.R \
    {output.asv} {output.summary} {input.derep} \
    --batch={params.batch} --log={log} --config={params.config}"""

SnakeMake From line 106 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/remove_chimeras.R \
  {output.asv} {output.summary} {output.matspy} \
  {input.asv} \
  --log={log} --config={params.config} --cores={threads}"""

SnakeMake From line 123 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/filter_asvs.R \
    {output.seqtab_filt} \
    {output.plot_seqlength} {output.plot_seqabundance} \
    {input.seqtab} {input.negcontrol} \
    --log={log} --config={params.config} --cores={threads}"""

SnakeMake From line 142 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/collect_tsv_files.R \
    {output.filt} {input.filt} \
    --log={log}"""

SnakeMake From line 157 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/collect_tsv_files.R \
    {output.merg} {input.merg} \
    --log={log}"""

SnakeMake From line 171 of rules/dada2.smk

shell:
  """Rscript workflow/scripts/dada2/summarize_nreads.R \
    {output.nreads} {output.fig_step} {output.fig_step_rel} \
    {input.filt} {input.merg} {input.asvs} \
    --log={log}"""

SnakeMake From line 187 of rules/dada2.smk

script:
  """../scripts/phylo/compute_tree.R"""

SnakeMake From line 10 of rules/phylo_tree_fastreeR.smk

shell:
  """Rscript workflow/scripts/dada2/plot_quality_profiles.R \
    {output} --end1={input.end1} --end2={input.end2} \
    --log={log}"""

SnakeMake From line 10 of rules/quality_control.smk

shell:
  """fastqc -o output/quality_control/fastqc -t {params.threads} {input.R1} {input.R2}"""

SnakeMake FastQC From line 27 of rules/quality_control.smk

shell:
  """multiqc output/quality_control/fastqc -o output/quality_control/multiqc"""

SnakeMake FastQC MultiQC From line 37 of rules/quality_control.smk

shell:
  """Rscript workflow/scripts/taxonomy/extract_fasta.R \
    {output.fasta} {input.asv} \
    --prefix={params.prefix} --log={log}"""

SnakeMake From line 10 of rules/taxonomy.smk

shell:
  """kraken2 --db {input.ref} --threads {threads} \
    --output {output.out} --use-names \
    --report {output.summary} --use-mpa-style \
    --confidence {params.confidence} \
    --classified-out {output.classified} \
    --memory-mapping \
    --unclassified-out {output.unclassified} {input.fasta}"""

SnakeMake kraken2 From line 34 of rules/taxonomy.smk

shell:
  """blastn -db {params.db} -query {input.fasta} \
    -out {output.blast} -outfmt {params.fmt} \
    -perc_identity {params.perc} \
    -num_threads {threads}"""

SnakeMake BLAST From line 53 of rules/taxonomy.smk

shell:
  """Rscript workflow/scripts/taxonomy/clean_kraken_wblast.R \
    {output.taxa} {output.hits} \
    --kraken={input.kraken} --summary={input.kraken_summary} \
    --blast={input.blast} --fasta={input.fasta} \
    --log={log} --cores={threads}
  """

SnakeMake From line 72 of rules/taxonomy.smk

script:
  """../scripts/taxonomy/merge_dblabels_wblast.R"""

SnakeMake From line 93 of rules/taxonomy.smk

"Collect tsv files into one

Usage:
collect_tsv_files.R [<outfile>] [<infiles> ...] [--log=<logfile>]
collect_tsv_files.R (-h|--help)
collect_tsv_files.R --version

Options:
--log=<logfile>    name of the log file [default: collect.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "collect tsv files V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$outfile <- "out.tsv"
  arguments$infiles <- list.files("output/dada2/summary",
    full.names = TRUE)

}

info <- Sys.info();

print(arguments)
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")

library(magrittr)
library(tidyverse)
library(vroom)

infiles <- purrr::map(arguments$infiles, vroom::vroom)

out <- bind_rows(infiles)

fs::dir_create(dirname(arguments$outfile))
readr::write_tsv(out, file = arguments$outfile)

R tidyverse magrittr docopt vroom From line 5 of dada2/collect_tsv_files.R

"Dereplicate one sample pair

Usage:
dereplicate_one_sample_pair.R [<sample_merge_file>] [<sample_name> <end1_file> <end2_file> --end1_err=<end1_err> --end2_err=<end2_err>] [--log=<logfile> --batch=<batch> --config=<cfile>]
dereplicate_one_sample_pair.R (-h|--help)
dereplicate_one_sample_pair.R --version

Options:
--end1_err=<end1>    name of the R1 error rates matrix
--end2_err=<end2>    name of the R2 error rates matrix
--log=<logfile>    name of the log file [default: dereplicate.log]
--batch=<batch>    name of the batch if any to get the filter and trim parameters
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "dereplicate one sample V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$end1_file <-
    "output/dada2/filtered/sample_545_filtered_R1.fastq.gz"
  arguments$end2_file <-
    "output/dada2/filtered/sample_545_filtered_R2.fastq.gz"
  arguments$end1_err <-
    "output/dada2/model/dust_jun2021_error_rates_R1.qs"
  arguments$end2_err <-
    "output/dada2/model/dust_jun2021_error_rates_R2.qs"
  arguments$batch <- "dust_jun2021"
  arguments$sample_name <- "sample_545"

}


stopifnot(
  file.exists(arguments$end1_file),
  file.exists(arguments$end2_file),
  file.exists(arguments$end1_err),
  file.exists(arguments$end2_err))

if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config))


info <- Sys.info();

print(arguments)
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")

library(magrittr)
library(dada2)
library(qs)
library(yaml)

filter_fwd <- arguments$end1_file
filter_bwd <- arguments$end2_file

message("loading error rates")
err_fwd <- qs::qread(arguments$end1_err)
err_bwd <- qs::qread(arguments$end2_err)

message("dereplicating filtered files")

dereplicate_merge <- function(filtfwd, filtbwd, err_fwd, err_bwd,
  min_overlap = 12, max_mismatch = 0) {

  message("de-replicating end1 file")
  derep_fwd <- dada2::derepFastq(filtfwd)
  dd_fwd <- dada2::dada(derep_fwd, err = err_fwd)

  message("de-replicating end2 file")
  derep_bwd <- dada2::derepFastq(filtbwd)
  dd_bwd <- dada2::dada(derep_bwd, err = err_bwd)

  message("merging pairs")
  merge <- dada2::mergePairs(dd_fwd, derep_fwd, dd_bwd, derep_bwd,
      minOverlap = min_overlap, maxMismatch = max_mismatch)

  out <- list(
    "dada_fwd" = dd_fwd,
    "dada_bwd" = dd_bwd,
    "merge" = merge)

  return(out)
}

config <- yaml::read_yaml(arguments$config)$merge_pairs

if (!is.null(arguments$batch)) {
  stopifnot(arguments$batch %in% names(config))
  config <- config[[arguments$batch]]
} else {
  nms <- c("minOverlap", "maxMismatch")
  if (any(names(config) %in% nms)) {
    warning("will use first element instead")
    config <- config[[1]]
  }
}

merge <- dereplicate_merge(filter_fwd, filter_bwd, err_fwd, err_bwd,
  min_overlap = as.numeric(config[["minOverlap"]]),
  max_mismatch = as.numeric(config[["maxMismatch"]]))

qs::qsave(merge, arguments$sample_merge_file)

R magrittr dada2 yaml docopt qs From line 22 of dada2/dereplicate_one_sample_pair.R

"Filter and trim

Usage:
filter_and_trim.R [<filter_end1> <filter_end2> <summary_file>] [<sample_name> --end1=<end1> --end2=<end2>] [--batch=<batch> --log=<logfile> --config=<cfile>]
filter_and_trim.R (-h|--help)
filter_and_trim.R --version

Options:
-h --help    show this screen
--end1=<end1>    name of the R1 end fastq.gz file
--end2=<end2>    name of the R2 end fastq.gz file
--log=<logfile>    name of the log file [default: logs/filter_and_trim.log]
--batch=<batch>    name of the batch if any to get the filter and trim parameters
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args, version = "filter_and_trim V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$batch <- "dust_dec2018"
  arguments$sample_name <- "sample_20"
  arguments$end1 <-
    "data/dust_dec2018/190114_C75PR/203_S19_L001_R1_001.fastq.gz"
  arguments$end2 <-
    "data/dust_dec2018/190114_C75PR/203_S19_L001_R2_001.fastq.gz"

  arguments$filter_end1 <-
    "filtered_L001_R1_001.fastq.gz"
  arguments$filter_end2 <-
    "filtered_L001_R2_001.fastq.gz"
}

print(arguments)

info <- Sys.info();

message("loading dada2")
library(magrittr)
library(dada2)
library(qs)
library(yaml)
library(fs)

stopifnot(file.exists(arguments$config),
  file.exists(arguments$end1), file.exists(arguments$end2))

print(stringr::str_c(names(info), " : ", info, "\n"))
config <- yaml::read_yaml(arguments$config)
config <- config[["filter_and_trim"]]
print(config)

if (!is.null(arguments$batch)) {
  stopifnot(arguments$batch %in% names(config))
  config <- config[[arguments$batch]]
} else {
  nms <- c("truncQ", "truncLen", "trimLeft", "trimRight",
    "maxLen", "minLen", "maxN", "minQ", "maxEE")
  if (any(names(config) %in% nms)) {
    warning("will use first element instead")
    config <- config[[1]]
  }

}

fs::dir_create(unique(dirname(arguments$filter_end1)))
fs::dir_create(unique(dirname(arguments$filter_end2)))

track_filt <- dada2::filterAndTrim(
  arguments$end1, arguments$filter_end1,
  arguments$end2, arguments$filter_end2,
  truncQ =    as.numeric(config[["truncQ"]]),
  truncLen =  as.numeric(config[["truncLen"]]),
  trimLeft =  as.numeric(config[["trimLeft"]]),
  trimRight = as.numeric(config[["trimRight"]]),
  maxLen =    as.numeric(config[["maxLen"]]),
  minLen =    as.numeric(config[["minLen"]]),
  maxN =      as.numeric(config[["maxN"]]),
  minQ =      as.numeric(config[["minQ"]]),
  maxEE =     as.numeric(config[["maxEE"]]),
  rm.phix = TRUE,
  compress = TRUE,
  multithread = FALSE)

row.names(track_filt) <- arguments$sample_name
colnames(track_filt) <- c("raw", "filtered")

track_filt %>%
  as.data.frame() %>%
  tibble::as_tibble(rownames = "samples") %>%
  dplyr::mutate(
    end1 = arguments$end1,
    end2 = arguments$end2) %>%
  readr::write_tsv(arguments$summary_file)

R magrittr dada2 yaml docopt qs fs From line 12 of dada2/filter_and_trim.R

"Filters the low quality ASVs

Usage:
filter_asvs.R [<asv_matrix_qc> <seqlength_fig> <abundance_fig>] [<asv_matrix_file> <negcontrol_file>] [--lysis --log=<logfile> --config=<cfile> --cores=<cores>]
filter_asvs.R (-h|--help)
filter_asvs.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: filter_and_trim.log]
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]
--cores=<cores>    number of CPUs for parallel processing [default: 24]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "filter ASVs V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$asv_matrix_qc <- "asv.qs"
  arguments$seqlength_fig <- "sl.png"
  arguments$abundance_fig <- "sa.png"
  arguments$asv_matrix_file <- "output/dada2/remove_chim/asv_mat_wo_chim.qs"
  arguments$negcontrol_file <- "output/predada2/negcontrols.qs"

}


message("arguments")
print(arguments)

info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))


message("loading packages")
library(magrittr)
library(tidyverse)
library(dada2)
library(qs)
library(yaml)

stopifnot(
  file.exists(arguments$asv_matrix_file),
  file.exists(arguments$config),
  file.exists(arguments$negcontrol_file))

message("starting with asvs in ", arguments$asv_matrix_file)

seqtab <- qs::qread(arguments$asv_matrix_file)
config <- yaml::read_yaml(arguments$config)[["qc_parameters"]]

message("filtering samples by negative controls")
message("using neg. control file ", arguments$negcontrol_file)
message("removing counts >= ", config[["negctrl_prop"]],
  " sum(neg_controls)")

neg_controls <- qs::qread(arguments$negcontrol_file)

if (arguments$lysis) {

  neg_controls %<>%
    dplyr::mutate(
      kits = map2(kits, lysis, c))

}

neg_controls %<>%
  dplyr::select(batch, key, kits)

subtract_neg_control <- function(name, neg_controls, seqtab, prop) {

  negs <- neg_controls

  negs <- negs[negs %in% rownames(seqtab)]
  out_vec <- seqtab[name, ]

  if (length(negs) > 1) {

    negs <- seqtab[negs, ]
    neg_vec <- colSums(negs)
    out_vec <- out_vec - prop * neg_vec

  } else if (length(negs) == 1) {

    neg_vec <- seqtab[negs, ]
    out_vec <- out_vec - prop * neg_vec
  }

  if (any(out_vec < 0)) {
    out_vec[out_vec < 0] <- 0
  }
  return(out_vec)

}

neg_controls %<>%
  dplyr::mutate(sample_vec = purrr::map2(key, kits,
    safely(subtract_neg_control),
    seqtab, config[["negctrl_prop"]]))

to_remove <- dplyr::filter(neg_controls,
  purrr::map_lgl(sample_vec, ~ !is.null(.$error)))

if (nrow(to_remove) > 0) {
  message("removed the samples:\n",
    stringr::str_c(to_remove$key, collapse = "\n"))
}

neg_controls %<>%
  dplyr::filter(purrr::map_lgl(sample_vec, ~ is.null(.$error))) %>%
  dplyr::mutate(
    sample_vec = purrr::map(sample_vec, "result"))

sample_names <- neg_controls %>%
  dplyr::pull(key)

seqtab_samples <- purrr::reduce(
  dplyr::pull(neg_controls, sample_vec), dplyr::bind_rows)
seqtab_samples %<>%
  as.matrix() %>%
  set_rownames(sample_names)

seqtab_nc <- seqtab[! rownames(seqtab) %in% sample_names, ]
seqtab_new <- floor(rbind(seqtab_samples, seqtab_nc))

# Length of sequences
message("filtering sequences by length")

seq_lengths <- nchar(dada2::getSequences(seqtab_new))

l_hist <- as.data.frame(table(seq_lengths)) %>%
  tibble::as_tibble() %>%
  rlang::set_names("length", "freq")

l_hist <- l_hist %>%
  ggplot(aes(x = length, y = freq)) +
    geom_col() +
    labs(title = "Sequence Lengths by SEQ Count") +
    theme_bw() +
    theme(
      axis.text = element_text(size = 6),
      axis.text.x = element_text(angle = 90, hjust = 1,
        vjust = 0.5, size = 10),
      axis.text.y = element_text(size = 10))

ggsave(
  filename = arguments$seqlength_fig,
  plot = l_hist,
  width = 8, height = 4, units = "in")


table2 <- tapply(colSums(seqtab_new), seq_lengths, sum)
table2 <- tibble::tibble(
  seq_length = names(table2),
  abundance = table2)

most_common_length <- dplyr::top_n(table2, 1, abundance) %>%
  dplyr::pull(seq_length) %>%
  as.numeric()

table2 <- table2 %>%
  ggplot(aes(x = seq_length, y = log1p(abundance))) +
    geom_col() +
    labs(title = "Sequence Lengths by SEQ Abundance") +
    theme_bw() +
    theme(
      axis.text = element_text(size = 6),
      axis.text.x = element_text(angle = 90, hjust = 1,
        vjust = 0.5, size = 10),
      axis.text.y = element_text(size = 10))

ggsave(
  filename = arguments$abundance_fig,
  plot = table2,
  width = 8, height = 4, units = "in")


max_diff <- config[["max_length_variation"]]

message("most common length: ", most_common_length)
message("removing sequences outside range ",
  " < ", most_common_length - max_diff, " or > ",
  most_common_length + max_diff)

right_length <- abs(seq_lengths - most_common_length) <= max_diff
seqtab_new <- seqtab_new[, right_length]


total_abundance <- sum(colSums(seqtab_new))
min_reads_per_asv <- ceiling(config[["low_abundance_perc"]] / 100 *
  total_abundance)

seqtab_abundance <- colSums(seqtab_new)

message("removing ASV with < ", min_reads_per_asv, " reads")
message("in total ", sum(seqtab_abundance < min_reads_per_asv))
seqtab_new <- seqtab_new[, seqtab_abundance >= min_reads_per_asv]


message("saving files in ", arguments$asv_matrix_qc)
qs::qsave(seqtab_new, arguments$asv_matrix_qc)

R ggplot2 tidyverse magrittr dada2 yaml docopt qs From line 6 of dada2/filter_asvs.R

"Gather ASV table

Usage:
gather_derep_seqtab.R [<asv_file> <summary_file>] [<derep_file> ...] [--batch=<batch> --log=<logfile> --config=<cfile>]
gather_derep_seqtab.R (-h|--help)
gather_derep_seqtab.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: gather_derep_seqtab.log]
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "gather dereplicated sequence table V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$batch <- "batch2018"
  arguments$asv_file <- "batch2018_asv.qs"
  arguments$derep_file <- list.files(file.path("output", "dada2",
    "merge", arguments$batch), full.names = TRUE)

}

message("arguments")
print(arguments)

message("info")
info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(dada2)
library(qs)
library(yaml)

derep_files <- arguments$derep_file

# stopifnot(any(file.exists(derep_files)), file.exists(arguments$config))
stopifnot(all(file.exists(derep_files)), file.exists(arguments$config))

# remove the files for samples where all reads were filtered out
derep_files <- derep_files[file.exists(derep_files)]

# Identify and remove empty files.
# These are placeholder files that were created
# to appease snakemake after all reads were
# filtered out from the input file back in filter_and_trim.
# stop if we have no files left after checking for empty.
derep_tb <- tibble(derep_file = derep_files) %>%
  dplyr::mutate(
    size = purrr::map_dbl(derep_file, ~ file.info(.x)$size),
    key = purrr::map(derep_file, basename),
    key = stringr::str_remove(key, "_asv.qs"))

count_nonempty <- derep_tb %>%
  dplyr::filter(size > 0) %>%
  nrow()

stopifnot(count_nonempty > 0)

# read ASV files
read_merger <- function(filename, size) {
  if (size == 0) {
    NULL
  } else {
    qs::qread(filename)
  }
}

# changed the commands to be all inside the same mutate to improve readability
derep_tb %<>%
  dplyr::mutate(
    derep_merger = purrr::map2(derep_file, size, read_merger),
    mergers = purrr::map(derep_merger, "merge"),
    dada_fwd = purrr::map(derep_merger, "dada_fwd"),
    dada_bwd = purrr::map(derep_merger, "dada_bwd"))


config <- yaml::read_yaml(arguments$config)

stopifnot(file.exists(config$samples_file))

sample_tb <- readr::read_tsv(config$samples_file) %>%
  dplyr::left_join(derep_tb)

# now do some checking to make sure we only have
# the correct batch in the input
check_batches <- sample_tb %>%
  dplyr::filter(!is.na(derep_file)) %>%
  dplyr::count(batch)

if (nrow(check_batches) > 1) {
  message("User supplied input derep files from multiple batch(es):")
  message(paste(sprintf("%s: %d", check_batches$batch, check_batches$n),
    collapse = "\n"))
  message("We will only look at the requested batch: ", arguments$batch)
}

# remove other batches, but keep empty files for summary
sample_tb %<>%
  dplyr::filter(batch == arguments$batch)
nonempty_tb <- sample_tb %>%
  dplyr::filter(size > 0)
mergers <- pluck(nonempty_tb, "mergers") %>%
  setNames(pluck(nonempty_tb, "key"))
dada_fwd <- pluck(nonempty_tb, "dada_fwd")
dada_fwd <- pluck(nonempty_tb, "dada_bwd")

message("creating sequence table")
seqtab <- dada2::makeSequenceTable(mergers)

qs::qsave(seqtab, arguments$asv_file)

message("summarizing results")

## get N reads
## include lost samples with 0s
get_nreads <- function(x) {

  if (!is.null(x)) {
    sum(dada2::getUniques(x))
  } else {
    NA
  }
}

track <- sample_tb %>%
  dplyr::mutate(
    denoised = purrr::map(dada_fwd, get_nreads),
    merged = purrr::map(mergers, get_nreads)) %>%
  tidyr::unnest(c(denoised, merged)) %>%
  dplyr::transmute(samples = key, denoised, merged) %>%
  dplyr::mutate(across(c(denoised, merged), ~replace_na(.x, 0)))

track %>%
  readr::write_tsv(arguments$summary_file)

message("Done! summary file at ", arguments$summary_file)

R Snakemake tidyverse magrittr dada2 yaml docopt qs From line 11 of dada2/gather_derep_seqtab.R

"Learn error rates

Usage:
learn_error_rates.R [--error_rates=<matrix_file> --plot_file=<pfile>] [<filtered> ...] [--log=<logfile> --batch=<batch> --config=<cfile> --cores=<cores>]
learn_error_rates.R (-h|--help)
learn_error_rates.R --version

Options:
-h --help    show this screen
--error_rates=<matrix_file>    name of the file with the learned error rates matrix
--plot_file=<pfile>    name of the file to save the diagnostic plot
--log=<logfile>    name of the log file [default: error_rates.log]
--batch=<batch>    name of the batch if any to get the filter and trim parameters
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]
--cores=<cores>    number of CPUs for parallel processing [default: 4]" -> doc

library(docopt)
library(fs)


my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args, version = "error_rates V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  library(magrittr)
  library(tidyverse)

  arguments$error_rates <- "error_rate_matrix.qs"
  arguments$plot_file <- "error_rates.png"
  arguments$batch <- "batch2018"
  arguments$filtered <- readr::read_tsv("samples.tsv") %>%
    filter(batch == arguments$batch) %>%
    pull(key)
  arguments$filtered <- as.character(glue::glue(
    "output/dada2/filtered/{sample}_filtered_R1.fastq.gz",
    sample = arguments$filtered))

}

stopifnot(any(file.exists(arguments$filtered)))
stopifnot(file.exists(arguments$config))


info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))
print(arguments)


# Identify and remove empty files.
# These are placeholder files that were created
# to appease snakemake after all reads were
# filtered out from the input file.
# stop if we have no files left after checking for empty.
sizes <- lapply(arguments$filtered, FUN = function(x) file.info(x)$size)
sizes <- setNames(sizes, arguments$filtered)
nonempty <- names(sizes[sizes > 0])
stopifnot(length(nonempty) > 0)

message("loading packages")
library(dada2)
library(ggplot2)
library(qs)
library(yaml)

config <- yaml::read_yaml(arguments$config)$error_rates
print(config)

if (!is.null(arguments$batch)) {
  stopifnot(arguments$batch %in% names(config))
  config <- config[[arguments$batch]]
} else {
  nms <- c("learn_nbases")
  if (any(names(config) %in% nms)) {
    warning("will use first element instead")
    config <- config[[1]]
  }

}

print("computing error rates")
errs <- dada2::learnErrors(
  nonempty, #arguments$filtered,
  nbases = as.numeric(config$learn_nbases),
  multithread = as.numeric(arguments$cores), randomize = TRUE)

fs::dir_create(dirname(arguments$error_rates))
qs::qsave(errs, file = arguments$error_rates)

print("plotting diagnostics")
err_plot <- dada2::plotErrors(errs, nominalQ = TRUE)

fs::dir_create(dirname(arguments$plot_file))
ggplot2::ggsave(
  filename = arguments$plot_file,
  plot = err_plot,
  width = 20,
  height = 20,
  units = "cm")

R Snakemake ggplot2 tidyverse magrittr dada2 yaml docopt qs fs From line 9 of dada2/learn_error_rates.R

"Plot quality profiles

Usage:
  plot_quality_profiles.R <plot_file> [--end1=<end1> --end2=<end2> --log=<logfile>]
  plot_quality_profiles.R (-h|--help)
  plot_quality_profiles.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: logs/plot_qc_profiles.log]" -> doc

my_args <- commandArgs(trailingOnly = TRUE)

if (interactive()) {
  my_args <- c("qc_profile.png", "--end1=end1.fastq.gz",
    "--end2=end2.fastq.gz")
}

library(docopt, quietly = TRUE)

arguments <- docopt(doc, args = my_args, version = "plot_quality_profiles V1")
print(arguments)

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file)
  sink(log_file, type = "message")
}

message("system info")
print(Sys.info())

message("arguments")
print(arguments)

stopifnot(file.exists(arguments$end1), file.exists(arguments$end2))

message("loading R packages")
library(dada2, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(purrr, quietly = TRUE)
library(cowplot, quietly = TRUE)

message("making plots")

plot_fun <- purrr::safely(dada2::plotQualityProfile)

r1_plot <- plot_fun(arguments$end1)
r2_plot <- plot_fun(arguments$end2)

if (is.null(r1_plot$error)) {
  r1_plot <- r1_plot$result
} else {
  message(r1_plot$error)
  r1_plot <- ggplot()
}

if (is.null(r2_plot$error)) {
  r2_plot <- r2_plot$result
} else {
  message(r2_plot$error)
  r2_plot <- ggplot()
}

r1_plot <- r1_plot +
  cowplot::theme_minimal_grid() +
  theme(strip.background = element_blank())

r2_plot <- r2_plot +
  cowplot::theme_minimal_grid() +
  theme(strip.background = element_blank())


message("saving plots")
final_plot <- cowplot::plot_grid(r1_plot, r2_plot, nrow = 1)

ggsave(filename = arguments$plot_file, final_plot, ggsave, width = 8,
  height = 4, units = "in", bg = "white")

R ggplot2 cowplot dada2 purrr docopt From line 8 of dada2/plot_quality_profiles.R

"Merge sequence tables and remove chimeras

Usage:
remove_chimeras.R [<asv_merged_file> <summary_file> <fig_file>] [<asv_file> ...] [--log=<logfile> --config=<cfile> --cores=<cores>]
remove_chimeras.R (-h|--help)
remove_chimeras.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: remove_chimeras.log]
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]
--cores=<cores>    number of CPUs for parallel processing [default: 24]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "remove chimeras V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$fig_file <- "matspy.png"
  arguments$asv_merged_file <- "asv.qs"
  arguments$summary_file <- "summary.qs"
  arguments$asv_file <- list.files(file.path("output", "dada2",
    "asv_batch"), full.names = TRUE, pattern = "asv")

}


message("arguments")
print(arguments)

## merges different ASV tables, and removes chimeras

message("info")
info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(dada2)
library(qs)
library(yaml)
library(ComplexHeatmap)

stopifnot(any(file.exists(arguments$asv_file)))

if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config))

config <- yaml::read_yaml(arguments$config)
sample_table <- readr::read_tsv(config$sample)
config <- config$remove_chimeras

seqtab_list <- purrr::map(arguments$asv_file, qs::qread)

if (length(seqtab_list) > 1) {
  seqtab_all <- dada2::mergeSequenceTables(tables = seqtab_list)
} else {
  seqtab_all <- seqtab_list[[1]]
}

# Remove chimeras
message("removing chimeras")
seqtab <- dada2::removeBimeraDenovo(
  seqtab_all,
  method = config[["chimera_method"]],
  minSampleFraction = config[["minSampleFraction"]],
  ignoreNNegatives = config[["ignoreNNegatives"]],
  minFoldParentOverAbundance = config[["minFoldParentOverAbundance"]],
  allowOneOf = config[["allowOneOf"]],
  minOneOffParentDistance = config[["minOneOffParentDistance"]],
  maxShift = config[["maxShift"]],
  multithread = as.numeric(arguments$cores))

fs::dir_create(dirname(arguments$asv_merged_file))
qs::qsave(seqtab, arguments$asv_merged_file)

out <- tibble::tibble(samples = row.names(seqtab),
  nonchim = rowSums(seqtab))

fs::dir_create(dirname(arguments$summary_file))
out %>%
  readr::write_tsv(arguments$summary_file)

fs::dir_create(dirname(arguments$fig_file))

outmat <- seqtab[, seq_len(min(1e4, ncol(seqtab)))]
outmat <- ifelse(outmat > 0, 1, 0)
cols <- structure(c("black", "white"), names = c("1", "0"))


# put row annotation in same order as matrix
# will remove samples that were completely empty
# after filter_and_trim
anno_df <- sample_table %>%
    select(batch, key) %>%
    as.data.frame() %>%
    column_to_rownames("key")
anno_df <- anno_df[rownames(outmat), , drop = F]

annot <- ComplexHeatmap::rowAnnotation(
  df = anno_df,
  annotation_legend_param = list(
    batch = list(direction = "horizontal")))

png(filename = arguments$fig_file, width = 10, height = 8, units = "in",
  res = 1200)
hm <- ComplexHeatmap::Heatmap(outmat,
  left_annotation = annot,
  col = cols, name = "a", show_row_dend = FALSE, show_row_names = FALSE,
  show_column_dend = FALSE, show_column_names = FALSE,
  cluster_columns = FALSE, cluster_rows = FALSE,
  show_heatmap_legend = FALSE, use_raster = TRUE,
  column_title = "Top 10K ASVs",
  heatmap_legend_param = list(direction = "horizontal"))
draw(hm, annotation_legend_side = "top", heatmap_legend_side = "top",
  merge_legend = TRUE)
dev.off()

R tidyverse magrittr dada2 yaml ComplexHeatmap docopt qs From line 6 of dada2/remove_chimeras.R

"Summarized the # of reads per processing step

Usage:
summarize_nreads.R [<nreads_file> <nreads_fig> <preads_fig>] [<filt_summary_file> <derep_summary_file> <final_asv_mat>] [--log=<logfile>]
summarize_nreads.R (-h|--help)
summarize_nreads.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: summarize_nreads.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "summarize # reads V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$filt_summary_file <-
    "output/dada2/filtered/all_sample_summary.tsv"
  arguments$derep_summary_file <-
    "output/dada2/asv_batch/all_sample_summary.tsv"
  arguments$final_asv_mat <-
    "output/dada2/after_qc/asv_mat_wo_chim.qs"

  arguments$nreads_file <-
    "output/dada2/stats/Nreads_dada2.txt"
  arguments$nreads_fig <-
    "workflow/report/dada2qc/dada2steps_vs_abundance.png"
  arguments$preads_fig <-
    "workflow/report/dada2qc/dada2steps_vs_relabundance.png"

}

# Summarize numbers of reads per step, makes some plots

info <- Sys.info();

message(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(qs)

stats <- list()
stats[[1]] <- readr::read_tsv(arguments$filt_summary_file) %>%
  select(-end1, -end2)
stats[[2]] <- readr::read_tsv(arguments$derep_summary_file)
stats[[3]] <- qs::qread(arguments$final_asv_mat) %>%
  rowSums() %>%
  tibble::tibble(samples = names(.), nreads = .)

# change to full join so we see input files that were lost during any step
stats <- purrr::reduce(stats, purrr::partial(dplyr::full_join, by = "samples"))
# fill in 0s for final nreads for samples that were previously lost
stats %<>%
  dplyr::mutate(nreads = replace_na(nreads, 0))

stats %>%
  readr::write_tsv(arguments$nreads_file)

message("making figures")

order_steps <- stats %>%
  dplyr::select(-samples) %>%
  names()

rel_stats <- stats %>%
  dplyr::mutate(
    dplyr::across(
      -samples, list(~ . / raw), .names = "{.col}"))

# remove empties from relative change plot
# they don't get plotted anyway
rel_stats %<>%
  dplyr::filter(!is.nan(nreads))

make_plot <- function(stats, summary_fun = median, ...) {

  order_steps <- stats %>%
    dplyr::select(-samples) %>%
    names()

  stats %<>%
    tidyr::pivot_longer(-samples, names_to = "step", values_to = "val") %>%
    dplyr::mutate(step = factor(step, levels = order_steps))

  summary <- stats %>%
    dplyr::group_by(step) %>%
    dplyr::summarize(
      val = summary_fun(val, ...), .groups = "drop")

  stats %>%
    ggplot(aes(step, val)) + geom_boxplot() +
    geom_point(alpha = 0.25, shape = 21) +
    geom_line(aes(group = samples), alpha = 0.25) +
    theme_classic() +
    theme(
      legend.position = "none",
      strip.text.y = ggplot2::element_text(angle = -90, size = 10),
      panel.grid.minor.x = ggplot2::element_blank(),
      panel.grid.major.x = ggplot2::element_blank()) +
    geom_line(data = summary, aes(group = 1), linetype = 2, colour = "red") +
    labs(x = "step")

}

ggsave(
  filename = arguments$nreads_fig,
  plot = make_plot(stats) + labs("# reads") +
   scale_y_continuous(labels = scales::comma_format(1)),
  width = 6,
  height = 4,
  units = "in")

ggsave(
  filename = arguments$preads_fig,
  plot = make_plot(rel_stats) + labs(y = "relative change") +
    scale_y_continuous(labels = scales::percent_format(1)),
  width = 6,
  height = 4,
  units = "in")

R ggplot2 tidyverse magrittr docopt qs From line 6 of dada2/summarize_nreads.R

"Prepare mia object

Usage:
prepare_mia_object.R [<mia_file>] [--asv=<asv_file> --taxa=<taxa_file> --tree=<tree_file> --meta=<meta_file>] [--asv_prefix=<prefix> --log=<logfile> --config=<config> --cores=<cores>]
prepare_mia_object.R (-h|--help)
prepare_mia_object.R --version

Options:
-h --help    show this screen
--asv=<asv_file>    ASV matrix file
--taxa=<taxa_file>    Taxa file
--tree=<tree_file>    Tree file
--meta=<meta_file>    Metadata file
--log=<logfile>    name of the log file [default: logs/filter_and_trim.log]
--config=<config>    name of the config file [default: config/config.yaml]
--cores=<cores>    number of parallel CPUs [default: 8]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "prepare mia file V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$asv <- "output/dada2/remove_chim/asv_mat_wo_chim.qs"
  arguments$taxa <- "output/taxa/kraken_merged/kraken_rdata.qs"
  arguments$tree <- "output/phylotree/newick/tree.nwk"
  arguments$meta <- "output/init/metadata.qs"
  arguments$asv_prefix <- "asv"

}

print(arguments)

info <- Sys.info();

message("loading packages")
library(magrittr)
library(tidyverse)
library(TreeSummarizedExperiment)
library(Biostrings)
library(BiocParallel)
library(ape)
library(mia)
library(qs)
library(scater)
library(yaml)
library(parallelDist)

stopifnot(
  file.exists(arguments$asv),
  file.exists(arguments$taxa),
  file.exists(arguments$tree),
  file.exists(arguments$meta)
)

if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config))

asv <- qs::qread(arguments$asv)
taxa <- qs::qread(arguments$taxa)
tree <- ape::read.tree(arguments$tree)

if (tools::file_ext(arguments$meta) == "tsv") {
  meta <- readr::read_tsv(arguments$meta)
} else if (tools::file_ext(arguments$meta) == "qs") {
  meta <- qs::qread(arguments$meta)
}

cdata <- meta %>%
  as.data.frame() %>%
  tibble::column_to_rownames("key")

# clean samples
sample_names <- intersect(
  rownames(cdata),
  rownames(asv))

# clean ASVs
asv_sequences <- colnames(asv)
colnames(asv) <- str_c(arguments$asv_prefix, seq_along(asv_sequences),
  sep = "_")
names(asv_sequences) <- colnames(asv)
asv_sequences <- Biostrings::DNAStringSet(asv_sequences)

taxa %<>%
  dplyr::add_count(asv) %>%
  dplyr::filter(n == 1) %>%
  dplyr::select(asv, taxa, taxid, domain, phylum, class, order, family,
    genus, species, kraken_db)

asv_names <- intersect(names(asv_sequences), taxa$asv)

asv <- asv[, asv_names]
asv_sequences <- asv_sequences[asv_names, ]


tree$tip.label %<>%
  stringr::str_sub(2, nchar(.) - 1) # either qiime2 or ape is adding "'" at the 
  # start and end of ASV names

tree <- ape::keep.tip(tree, intersect(asv_names, tree$tip.label))

rdata <- taxa %>%
    as.data.frame() %>%
    tibble::column_to_rownames("asv")

out <- TreeSummarizedExperiment::TreeSummarizedExperiment(
  assays = list(counts = t(asv[sample_names,])),
  colData = cdata[sample_names, ],
  rowData = rdata,
  rowTree = tree)

metadata(out)[["date_processed"]] <- Sys.Date()
metadata(out)[["sequences"]] <- asv_sequences

fs::dir_create(dirname(arguments$mia_file))
qs::qsave(out, arguments$mia_file)
message("done!")

R tidyverse QIIME2.0 magrittr scater yaml docopt APE qs MIA parallelDist TreeSummarizedExperiment From line 6 of mia/prepare_mia_object.R

library(fastreeR)
library(ape)

fasta <- snakemake@input[["fasta"]]

message("computes the distance matrix between fasta sequences")
fasta_dist <- fastreeR::fasta2dist(fasta, kmer = 6)

message("converting distance to tree")
tree <- dist2tree(fasta_dist)
tree <- ape::read.tree(text = tree)

write.tree(tree, file = snakemake@output[["tree"]])

R APE fastreeR From line 2 of phylo/compute_tree.R

"Clean kraken2 results with blast

Usage:
clean_kraken_wblast.R [<taxa_table> <hits_table>] [--kraken=<kraken_file> --summary=<summary_file> --blast=<blast_file> --fasta=<fasta_file>] [--log=<logfile> --cores=<cores>]
clean_kraken_wblast.R (-h|--help)
clean_kraken_wblast.R --version

Options:
--log=<logfile>    name of the log file [default: ./parse_kraken.log]
--cores=<cores>    number of parallel CPUs [default: 8]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "clean kraken2 results using blast V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$kraken <- here::here("output/taxa/kraken/k2_std/kraken_results.out")
  arguments$summary <- here::here(
      "output/taxa/kraken/k2_std/kraken_summary.out")
  arguments$blast <- here::here("output/taxa/blast/16S_ribosomal_RNA/blast_results.tsv")
  arguments$fasta <- here::here("output/taxa/fasta//asv_sequences.fa")

}



info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
# library(BiocParallel)

# bpp <- BiocParallel::MulticoreParam(workers = as.numeric(arguments$cores))

tax_order <- c("domain", "phylum",
    "class", "order", "family", "genus", "species")

kraken <- vroom::vroom(arguments$kraken,
  col_names = c("classified", "asv", "taxa", "length", "map"))

ksummary <- vroom::vroom(arguments$summary, col_names = c("taxa", "taxid")) %>%
  dplyr::select(-taxid) %>%
  tidyr::separate(taxa, into = tax_order, sep = "\\|", remove = FALSE,
    fill = "right")

blast <- vroom::vroom(arguments$blast,
  col_names = c("qseqid", "sseqid", "evalue", "bitscore",
    "score", "mismatch", "positive", "stitle", "qframe",
    "sframe", "length", "pident"))

# first remove sequences with width outside fasta sequences
fa <- ShortRead::readFasta(arguments$fasta)
width_vec <- BiocGenerics::width(fa)
width_range <- range(width_vec)

blast %<>%
  dplyr::filter(length >= width_range[1] & length <= width_range[2])

med_mismatches <- median(blast$mismatch)

blast %<>%
  dplyr::filter(mismatch <= med_mismatches) %>%
  dplyr::select(-qframe, -sframe)

kraken_taxids <- kraken %>%
  dplyr::count(taxa) %>%
  dplyr::mutate(
    taxid = stringr::str_split(taxa, "taxid"),
    only_taxa = purrr::map_chr(taxid, 1),
    only_taxa = stringr::str_remove(only_taxa, regex("\\(")),
    only_taxa = stringr::str_trim(only_taxa),
    taxid = purrr::map_chr(taxid, 2),
    taxid = stringr::str_trim(taxid),
    taxid = stringr::str_remove(taxid, "\\)"),
    taxid = as.numeric(taxid))

kraken %<>%
  dplyr::inner_join(
    dplyr::select(kraken_taxids, taxa, taxid), by = "taxa")

# map_to_tibble <- function(map) {

#   map %<>% stringr::str_split("\\:")
#   tibble::tibble(
#     taxid = as.numeric(purrr::map_chr(map, 1)),
#     bps = as.numeric(purrr::map_chr(map, 2)))

# }

# kraken %<>%
#   dplyr::mutate(
#     map_tib = stringr::str_split(map, " "),
#     map_tib = furrr::future_map(map_tib, map_to_tibble))

get_taxid <- function(taxa, kraken_taxids) {

  last_taxa <- stringr::str_split(taxa, "\\|")[[1]]
  last_taxa <- last_taxa[length(last_taxa)]
  taxa_word <- stringr::str_remove(last_taxa, regex("^[d|p|c|o|f|g|s]__"))
  out <- kraken_taxids %>%
    dplyr::filter(only_taxa == taxa_word)
  if (nrow(out) == 0) NA_real_ else min(out$taxid)

}

ksummary %<>%
  dplyr::mutate(
    taxid = purrr::map_dbl(taxa, get_taxid, kraken_taxids)) %>%
  dplyr::select(starts_with("tax"), everything())

check_hits_blast <- function(asv, taxa, taxid,
  blast, ksummary, kraken_taxids, min_prob = .1, max_dist = 2) {

  print(asv)

  if (is.null(blast)) {
    blast <- tibble::tibble()
  } else {
    blast %<>%
      dplyr::mutate(
        taxa = stringr::str_split(stitle, " "),
        taxa = purrr::map_chr(taxa, 1))

    nmatches <- nrow(blast)
    blast %<>%
      dplyr::count(taxa) %>%
      dplyr::mutate(prob = n / nmatches) %>%
      dplyr::filter(prob >= min_prob)
  }

  if (nrow(blast) > 0) {
    # get kraken id
    tid <- taxid
    ksummary_id <- ksummary %>%
      dplyr::filter(taxid == tid)

    # get blast id based on the genus
    ksummary_genus <- ksummary %>%
      dplyr::filter(!is.na(genus))
    ksummary_genus2 <- stringr::str_remove(ksummary_genus$genus, "g__")

    mmat <- stringdist::stringdistmatrix(blast$taxa,
      ksummary_genus2, method = "osa")
    rownames(mmat) <- blast$taxa

    mmat_search <- apply(mmat, 1, function(x)which(x < max_dist),
      simplify = FALSE)

    if (is.list(mmat_search)) {
      blast %<>%
        dplyr::mutate(
          taxonomy = mmat_search,
          taxonomy = purrr::map(taxonomy, ~ ksummary_genus[., ]),
          taxonomy = purrr::map(taxonomy, dplyr::select, -taxa, -species,
            -taxid),
          taxonomy = purrr::map(taxonomy, dplyr::distinct))
    } else {

      blast %<>%
        dplyr::mutate(
          taxonomy = list(ksummary_genus[mmat_search, ]),
          taxonomy = purrr::map(taxonomy, dplyr::select, -taxa, -species,
            -taxid),
          taxonomy = purrr::map(taxonomy, dplyr::distinct))

    }

    blast %<>%
      dplyr::mutate(
        prob = prob / sum(prob),
        d_hit = purrr::map_lgl(taxonomy,
          ~ ifelse(nrow(.) == 0, FALSE, .$domain == ksummary_id$domain)),
        p_hit = purrr::map_lgl(taxonomy,
          ~ ifelse(nrow(.) == 0, FALSE, .$phylum == ksummary_id$phylum)),
        c_hit = purrr::map_lgl(taxonomy,
          ~ ifelse(nrow(.) == 0, FALSE, .$class == ksummary_id$class)),
        o_hit = purrr::map_lgl(taxonomy,
          ~ ifelse(nrow(.) == 0, FALSE, .$order == ksummary_id$order)),
        f_hit = purrr::map_lgl(taxonomy,
          ~ ifelse(nrow(.) == 0, FALSE, .$family == ksummary_id$family)),
        g_hit = purrr::map_lgl(taxonomy,
          ~ ifelse(nrow(.) == 0, FALSE, .$genus == ksummary_id$genus)),
        across(
          where(is.logical),
          list(~ ifelse(is.na(.), 0, as.numeric(.))), .names = "{.col}"))

    blast %>%
      dplyr::select(ends_with("hit")) %>%
      colMeans()

  } else {

    rep(0, 6) %>%
      rlang::set_names(c("d_hit", "p_hit", "c_hit", "o_hit", "f_hit", "g_hit"))

  }

}


blast %<>%
  tidyr::nest(blast_pred = -c(qseqid)) %>%
  dplyr::rename(asv = qseqid)

kraken %<>%
  dplyr::left_join(blast, by = "asv")

kraken %<>%
  dplyr::mutate(
    hit_vec = purrr::pmap(list(asv, taxa, taxid, blast_pred),
      check_hits_blast, ksummary, kraken_taxids))

# outputs

# - accuracy hits
hits <- bind_rows(kraken$hit_vec) %>%
  dplyr::mutate(
    asv = kraken$asv,
    taxa = kraken$taxa,
    taxid = kraken$taxid) %>%
  dplyr::select(asv, taxid, tidyselect::everything())

# - rowData - asv and ksummary
rdata <- kraken %>%
  dplyr::select(asv, taxid) %>%
  dplyr::inner_join(ksummary, by = "taxid") %>%
  dplyr::select(asv, taxa, taxid, tidyselect::everything())


fs::dir_create(dir(arguments$taxa_table))
rdata %>%
  qs::qsave(arguments$taxa_table)

fs::dir_create(dir(arguments$hits_table))
hits %>%
  qs::qsave(arguments$hits_table)

R tidyverse BLAST kraken2 magrittr Kraken docopt From line 10 of taxonomy/clean_kraken_wblast.R

"Extract fasta sequences

Usage:
extract_fasta.R [<fasta_file>] [<asv_mat_file>] [--prefix=<pref> --log=<logfile>]
extract_fasta.R (-h|--help)
extract_fasta.R --version

Options:
--prefix=<pref>    prefix to name the sequences [default: asv]
--log=<logfile>    name of the log file [default: extract_fa.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "extract fasta V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$fasta_file <- "out.fasta"
  arguments$asv_mat_file <-
    "output/dada2/after_qc/asv_mat_wo_chim.qs"

}

info <- Sys.info();

print(arguments)
print(stringr::str_c(names(info), " : ", info, "\n"))

stopifnot(file.exists(arguments$asv_mat_file))

message("loading packages")
library(magrittr)
library(tidyverse)
library(Biostrings)
library(qs)

asvs <- qs::qread(arguments$asv_mat_file)
sequences <- colnames(asvs)
names(sequences) <- stringr::str_c(arguments$prefix,
  seq_along(sequences), sep = "_")

fs::dir_create(dirname(arguments$fasta_file))

sequences <- Biostrings::DNAStringSet(sequences)
Biostrings::writeXStringSet(sequences, filepath = arguments$fasta_file)

R tidyverse magrittr docopt qs From line 5 of taxonomy/extract_fasta.R

stopifnot(
  all(file.exists(snakemake@input[["taxa"]])),
  all(file.exists(snakemake@input[["hits"]])),
  all(file.exists(snakemake@input[["blast"]])))

library(magrittr)
library(tidyverse)
library(vroom)
library(microbiome.misc)
library(progress)

# snakemake
rdata_files <- snakemake@input[["taxa"]]
hits_files <- snakemake@input[["hits"]]
blast_files <- snakemake@input[["blast"]]

config <- here::here(snakemake@params[["config"]]) %>%
  yaml::read_yaml()


blast_dbs <- basename(dirname(blast_files))

format <- stringr::str_split(config[["blast"]][["format"]], " ")
format <- format[[1]][-1]
format <- stringr::str_remove_all(format, "\\'")
format[1] <- "asv"

blast_list <- tibble::tibble(blast_db = blast_dbs,
  blast = purrr::map(blast_files, vroom::vroom,
  col_names = format))

kraken_blast_min_perc <- config[["kraken_blast_min_perc"]] * 100

blast_list %<>%
  dplyr::mutate(
    blast = purrr::map(blast, dplyr::filter, pident >= kraken_blast_min_perc),
    blast = purrr::map(blast, tidyr::nest, blast = -c(asv)))

get_db <- microbiome.misc:::get_db

all_hits <- tibble::tibble(
  blast_db = get_db(hits_files, "blast"),
  kraken_db = get_db(hits_files, "kraken"),
  hits = purrr::map(hits_files, qs::qread))

all_labels <- tibble::tibble(
  blast_db = get_db(rdata_files, "blast"),
  kraken_db = get_db(rdata_files, "kraken"),
  hits = purrr::map(rdata_files, qs::qread))

blast_list %<>%
  tidyr::unnest(cols = c(blast))

db_hierarchy <- config[["kraken_db_merge"]]

message("subsetting kraken dbs: ",
  stringr::str_c(db_hierarchy, collapse = ", "))

all_hits %<>%
  dplyr::filter(kraken_db %in% db_hierarchy) %>%
  tidyr::unnest(cols = c(hits)) %>%
  tidyr::nest(hits = -c(blast_db, asv))

all_labels %<>%
  dplyr::filter(kraken_db %in% db_hierarchy) %>%
  tidyr::unnest(cols = c(hits)) %>%
  tidyr::nest(labels = -c(blast_db, asv))

merge_labels <- purrr::reduce(
  list(all_hits, all_labels, blast_list), dplyr::full_join,
    by = c("blast_db", "asv"))

pb <- progress::progress_bar$new(total = nrow(merge_labels))

match_asvs_blast_pbar <- function(hits, labels, blast, hierarchy) {
  pb$tick()
  microbiome.misc::match_asvs_blast(hits, labels, blast, hierarchy)

}


merge_labels %<>%
  dplyr::mutate(
    merged = purrr::pmap(list(hits, labels, blast),
      match_asvs_blast_pbar, config[["kraken_db_merge"]]))

if (length(unique(merge_labels$blast_db)) > 1) {
  stop("more than 1 db, fix for next time")
}

merge_labels %<>%
  dplyr::mutate(
    hits_merged = purrr::map(merged, "hits"),
    labels_merged = purrr::map(merged, "labels"))

merge_hits <- merge_labels %>%
  dplyr::select(asv, hits_merged) %>%
  tidyr::unnest(cols = c(hits_merged)) %>%
  dplyr::select(kraken_db, tidyselect::everything())

merge_hits %>%
  qs::qsave(snakemake@output[["hits"]])

merge_lbs <- merge_labels %>%
  dplyr::select(asv, labels_merged) %>%
  tidyr::unnest(cols = c(labels_merged)) %>%
  dplyr::select(kraken_db, tidyselect::everything())

merge_lbs %>%
  qs::qsave(snakemake@output[["taxa"]])

R Snakemake tidyverse BLAST magrittr Kraken vroom progress From line 15 of taxonomy/merge_dblabels_wblast.R

shell:
  "rm -fr output/quality_control workflow/report/quality_profiles"

SnakeMake From line 47 of workflow/Snakefile

shell:
  "rm -fr output/dada2 workflow/report/dada2qc workflow/report/model"

SnakeMake From line 74 of workflow/Snakefile

shell:
  "rm -fr output/taxa"

SnakeMake From line 95 of workflow/Snakefile

shell:
  "rm -fr output/phylotree"

SnakeMake From line 107 of workflow/Snakefile

shell:
  """Rscript workflow/scripts/mia/prepare_mia_object.R \
    {output.mia} --asv={input.asv} --taxa={input.taxa} \
    --tree={input.tree} --meta={input.meta} \
    --asv_prefix={params.prefix} --log={log} \
    --config={params.config} --cores={threads}"""

SnakeMake From line 124 of workflow/Snakefile

shell:
  """
  snakemake --forceall --rulegraph mia | dot -Tpng > {output.mia_dag}
  snakemake --rulegraph dada2 | dot -Tpng > {output.dada2_dag}
  snakemake --rulegraph taxonomy | dot -Tpng > {output.taxonomy_dag}
  snakemake --rulegraph phylotree | dot -Tpng > {output.phylo_dag}
  """