A pipeline for processing 16S / ITS data

public public 1yr ago Version: 2 0 bookmarks

This workflow is an implementation of the popular DADA2 tool. I followed the steps in the Tutorial . I utilized Kraken2 for ASV sequence classification instead of IDTaxa.

dada2

The pipeline was inspired by the Silas Kieser (@silask)'s dada2 snakemake pipeline .

Authors

  • Rene Welch (@ReneWelch)

Install workflow

There are two steps to install the pipeline:

  1. Install snakemake and dependencies: Assuming conda is already installed, and a copy of this repository has been downloaded. Then, the pipeline can be installed by:

    conda env create -n {env_name} --file dependencies.yml
    conda env create -n microbiome --file dependencies.yml
    
  2. Install R packages: To install the R packages used by this pipeline use:

    R CMD BATCH --vanilla ./install_r_packages.R
    

Troubleshooting conda and environment variables

If you have other versions of R and R user libraries elsewhere, you might encounter some problems with environment variables and conda. You may need to provide a local .bashrc file to place the conda path at the beginning of your $PATH environment variable.

You may not need to do this step.

source /etc/bash_completion.d/git
__conda_setup="$('/path/to/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
 eval "$__conda_setup"
else
 if [ -f "/path/to/miniconda3/etc/profile.d/conda.sh" ]; then
 . "/path/to/miniconda3/etc/profile.d/conda.sh"
 else
 export PATH="/path/to/miniconda3/bin:$PATH"
 fi
fi
unset __conda_setup
alias R='R --vanilla'
export PATH="/path/to/miniconda3/envs/microbiome/bin:$PATH"

Set up data and metadata

  1. Set up a data/ directory with one subdirectory per batch. The pipeline will look in the data/ directory to find subdirectories that contain fastq.gz files. You can specify different filtering and trimming parameters per batch, and dada2 will learn error rates per batch. So, you may wish to separate your files by sequencing batch and/or by sample type.

     mkdir data/
     mkdir data/batch01 [data/batch02 ... ]
    

    Next, populate the subdirectories with your fastq.gz (or fastq ) files, making sure to include both R1 and R2. If your data are already located elsewhere in your file system, you can use symbolic links (symlinks) to avoid duplicating data:

     fqdir=/path/to/existing_fastq_files/
     batchdir=data/batch01
     ln -s ${fqdir}/*.fastq.gz ${batchdir} 
    
  2. Generate sample table. The input for the pipeline are the sequencing files separated by batch in the data/ directory. Using:

    Rscript ./prepare_sample_table.R
    

    will generate the samples.tsv file that contains 4 columns separated by tabs (the file will not actually contain |)

    | batch | key | end1 | end2 |
    
  3. Set up sample metadata file. This is where to link sample IDs to variables of interest to your study (condition, body site, time point, etc). As currently implemented , this is only used to augment the final TreeSummarizedExperiment file, which you will use for your own downstream analysis. You need to have a column key that matches the sample table. For example:

    | key | subject_id | body_site | time_point | ...
    

    The default path for this file is data/meta.tsv . If you have it named otherwise, edit the variable metadata in config/config.yaml to point to your file.

  4. Set up negative control mapping file. If you have negative control samples, you should make a table linking each study sample to its negative control kit(s). As currently implemented , this needs to be a qs file containing a tibble of this format:

    # A tibble: 3 x 3
    batch key kits 
    <chr> <chr> <list> 
    1 batch01 sample_1 <chr [3]>
    2 batch01 sample_2 <chr [1]>
    3 batch02 sample_3 <chr [1]>
    

    Each element of key is a study sample ID matching the sample table. Each element of kits is a character list of sample keys for negative control samples, eg c("sample_141", "sample_142", "sample_143") .

    The default path for this file is data/negcontrols.qs . If you have it named otherwise, edit the variable negcontroltable in config/config.yaml to point to your file.

Set up configuration file

Now you will edit config.yaml to set up running parameters. At the minimum, you will need to make these edits:

  1. Confirm general parameters and file paths. Check that these variables are set to your satisfaction:

    # general parameters for run
    asv_prefix: "asv" # this will be the prefix for the ASV IDs
    sample_table: "samples.tsv" # this file is generated by prepare_sample_table.R
    negcontroltable: "data/negcontrols.qs" # manually generated
    metadata: "data/meta.tsv" # manually generated
    threads: 8 # max number of threads for a multithreaded snakemake rule, eg fastqc in quality_control
    
  2. Run sequence QC. If you have not already looked at your data, the first thing you will want to do is run fastQC to generate quality profiles.

    First, run snakemake -j{cores} sequence_qc to run fastQC, plot quality profiles, and build a multiqc report. You can find the individual plots in workflow/report/quality_profiles and the full report in output/quality_control/multiqc .

    Next, inspect those profiles. You may want to use them to decide where you want to trim R1 and R2 and how to set minimum overlap.

  3. Edit parameters for each batch. Check the config.yaml file for the default filter_and_trim, merge_pairs, and qc_parameters specifications for each batch. Keep in mind the length of your 16S region (eg, the V4 region is 252 bp long) and whether you will have enough overlap after trimming. Look up other dada2 resources for more advice on how to choose these parameters.

    Here is an example for a 2x150 NextSeq run targeting the V4 region. This set of parameters keeps only full-length reads and requires an overlap of 25 base pairs. Because the V4 region is only ~252 base pairs long, we suspect that ASVs that are outside of that range may be bimeras.

    filter_and_trim:
     batch01:
     truncQ: 0
     trimLeft: 0
     trimRight: 0
     maxLen: Inf
     minLen: 150
     maxN: 0
     minQ: 0
     maxEE: [2, 2]
    ...
    merge_pairs:
     batch01:
     minOverlap: 25
     maxMismatch: 0
    ...
    qc_parameters:
     negctrl_prop: 0.5
     max_length_variation: 5
     low_abundance_perc: 0.0001 
    

    The defaults provided in the pipeline are from a 2x300 run targeting V3-V4. Here we are truncating the ends of both reads because we noticed a significant loss of quality. Because we have longer reads, we can require more overlap.

    filter_and_trim:
     dust_dec2018:
     truncQ: 12
     truncLen: [280, 250]
     trimLeft: 0
     trimRight: 0
     maxLen: Inf
     minLen: 100
     maxN: 0
     minQ: 0
     maxEE: [2, 2]
    ....
    merge_pairs:
     dust_dec2018:
     minOverlap: 50
     maxMismatch: 0
    ...
    qc_parameters:
     negctrl_prop: 0.5
     max_length_variation: 50
     low_abundance_perc: 0.0001 
    

Download references for sequence classification

Databases need to be downloaded from https://benlangmead.github.io/aws-indexes/k2 . These are the four used by default in the config file.

Here is an example of doing this using wget :

# download references
wdir="https://genome-idx.s3.amazonaws.com/kraken"
refdir="data/kraken_dbs"
mkdir -p $refdir
greengenes="16S_Greengenes13.5_20200326"
rdp="16S_RDP11.5_20200326"
silva="16S_Silva138_20200326"
minikraken="minikraken2_v2_8GB_201904"
for db in $greengenes $rdp $silva $minikraken;
do
 remote=${wdir}/${db}.tgz
 local=${refdir}/${db}.tgz
 wget $remote -O $local
 tar -xzf ${local} -C ${refdir}
done

Workflow commands

Replace {cores} with the maximum number of cores that you want to be used at one time.

The rules taxonomy and phylotree will take a little longer the first time they are run because they install conda environments.

  • snakemake -j{cores} runs everything

  • snakemake -j{cores} sequence_qc plots quality profiles, and build a multiqc report

  • snakemake -j{cores} [--nt] dada2 computes the ASV matrix from the different batches (use --nt option to keep intermediate files for troubleshooting)

  • snakemake -j{cores} taxonomy --use-conda to labels the ASV sequences with kraken2 . Databases need to be downloaded in advance from https://benlangmead.github.io/aws-indexes/k2

  • snakemake -j{cores} phylotree --use-conda computes the phylogenetic tree using qiime2 's FastTree

  • snakemake -j{cores} mia prepare the TreeSummarizedExperiment containing all the data generated

microbiome_pipeline

Cite

dada2

Callahan, B., McMurdie, P., Rosen, M. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016). https://doi.org/10.1038/nmeth.3869

Kraken2

Wood, D., Lu, J., Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biology 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0

phyloseq

McMurdie, P., Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS One 8, 4 (2013). https://doi.org/10.1371/journal.pone.0061217

qiime2

Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, et al. 2018. “QIIME 2: Reproducible, Interactive, Scalable, and Extensible Microbiome Data Science.” e27295v2. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295v2 .

FastTree

Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. 2010. “FastTree 2--Approximately Maximum-Likelihood Trees for Large Alignments.” PloS One 5 (3): e9490.

TreeSummarizedExperiment

Huang, Ruizhu, Charlotte Soneson, Felix G. M. Ernst, Kevin C. Rue-Albrecht, Guangchuang Yu, Stephanie C. Hicks, and Mark D. Robinson. 2020. “TreeSummarizedExperiment: A S4 Class for Data with Hierarchical Structure.” F1000Research 9 (October): 1246.

Code Snippets

15
16
17
18
19
20
21
22
23
24
25
26
shell:
  """Rscript workflow/scripts/dada2/filter_and_trim.R \
    {output.end1} {output.end2} {output.summary} \
    {params.sample_name} --end1={input.end1} --end2={input.end2} \
    --batch={params.batch} --log={log} --config={params.config}

    if [[ -s {output.summary} && ! -s {output.end1} && ! -s {output.end2} ]]; then
      echo "Filtering succeeded but all reads were removed. Creating temp placeholder files."
      touch {output.end1}
      touch {output.end2}
    fi
    """
41
42
43
44
45
shell:
  """Rscript workflow/scripts/dada2/learn_error_rates.R \
    --error_rates={output.mat} --plot_file={output.plot} \
    {input.filtered} --log={log} --batch={params.batch} \
    --config={params.config} --cores={threads}"""
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
shell:
  """
    if [[ -e {input.filt_end1} && -e {input.filt_end2} && ! -s {input.filt_end1} && ! -s {input.filt_end2} ]]; then
        touch {output.merge} 
        echo "Filtering succeeded, but no reads were left after filtering:"
        ls -l {input.filt_end1} {input.filt_end2}
        echo "Generating placeholder output file: {output.merge}"
    else 
      Rscript workflow/scripts/dada2/dereplicate_one_sample_pair.R \
      {output.merge} {params.sample_name} \
      {input.filt_end1} {input.filt_end2} \
      --end1_err={input.mat_end1} --end2_err={input.mat_end2} \
      --log={log} --batch={params.batch} --config={params.config}
    fi
    """
106
107
108
109
shell:
  """Rscript workflow/scripts/dada2/gather_derep_seqtab.R \
    {output.asv} {output.summary} {input.derep} \
    --batch={params.batch} --log={log} --config={params.config}"""
SnakeMake From line 106 of rules/dada2.smk
123
124
125
126
127
shell:
  """Rscript workflow/scripts/dada2/remove_chimeras.R \
  {output.asv} {output.summary} {output.matspy} \
  {input.asv} \
  --log={log} --config={params.config} --cores={threads}"""
SnakeMake From line 123 of rules/dada2.smk
142
143
144
145
146
147
shell:
  """Rscript workflow/scripts/dada2/filter_asvs.R \
    {output.seqtab_filt} \
    {output.plot_seqlength} {output.plot_seqabundance} \
    {input.seqtab} {input.negcontrol} \
    --log={log} --config={params.config} --cores={threads}"""
SnakeMake From line 142 of rules/dada2.smk
157
158
159
160
shell:
  """Rscript workflow/scripts/dada2/collect_tsv_files.R \
    {output.filt} {input.filt} \
    --log={log}"""
SnakeMake From line 157 of rules/dada2.smk
171
172
173
174
shell:
  """Rscript workflow/scripts/dada2/collect_tsv_files.R \
    {output.merg} {input.merg} \
    --log={log}"""
SnakeMake From line 171 of rules/dada2.smk
187
188
189
190
191
shell:
  """Rscript workflow/scripts/dada2/summarize_nreads.R \
    {output.nreads} {output.fig_step} {output.fig_step_rel} \
    {input.filt} {input.merg} {input.asvs} \
    --log={log}"""
SnakeMake From line 187 of rules/dada2.smk
 8
 9
10
11
12
shell:
  """qiime tools import \
    --input-path {input.fa} \
    --output-path {output.qza} \
    --type 'FeatureData[Sequence]'"""
24
25
26
27
28
29
30
shell:
  """qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences {input.qza} \
--o-alignment {output.alignment} \
--o-masked-alignment {output.mask_alignment} \
--o-tree {output.unroot_tree} \
--o-rooted-tree {output.root_tree}"""
39
40
41
42
43
shell:
  """qiime tools export \
    --input-path {input.root_tree} \
    --output-path output/phylotree/newick
  """
10
11
12
13
shell:
  """Rscript workflow/scripts/dada2/plot_quality_profiles.R \
    {output} --end1={input.end1} --end2={input.end2} \
    --log={log}"""
28
29
shell:
  """fastqc -o output/quality_control/fastqc -t {params.threads} {input.R1} {input.R2}"""
38
39
shell:
  """multiqc output/quality_control/fastqc -o output/quality_control/multiqc"""
10
11
12
13
shell:
  """Rscript workflow/scripts/taxonomy/extract_fasta.R \
    {output.fasta} {input.asv} \
    --prefix={params.prefix} --log={log}"""
32
33
34
35
36
37
shell:
  """kraken2 --db {input.ref} --threads {threads} \
    --output {output.out} --report {output.summary} \
    --confidence {params.confidence} \
    --classified-out {output.classified} \
    --unclassified-out {output.unclassified} {input.fasta}"""
50
51
52
53
shell:
  """Rscript workflow/scripts/taxonomy/parse_kraken_summary.R \
    {output.standard} {output.full} {input.kraken} \
   --log={log}"""
65
66
67
68
69
shell:
  """Rscript workflow/scripts/taxonomy/parse_kraken.R \
    {output.taxa} {output.summary} {input.kraken} \
    --map {input.mapfile} \
    --log={log} --cores={threads}"""
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
"Collect tsv files into one

Usage:
collect_tsv_files.R [<outfile>] [<infiles> ...] [--log=<logfile>]
collect_tsv_files.R (-h|--help)
collect_tsv_files.R --version

Options:
--log=<logfile>    name of the log file [default: collect.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "collect tsv files V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$outfile <- "out.tsv"
  arguments$infiles <- list.files("output/dada2/summary",
    full.names = TRUE)

}

info <- Sys.info();

print(arguments)
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")

library(magrittr)
library(tidyverse)
library(vroom)

infiles <- purrr::map(arguments$infiles, vroom::vroom)

out <- bind_rows(infiles)

fs::dir_create(dirname(arguments$outfile))
readr::write_tsv(out, file = arguments$outfile)
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
"Dereplicate one sample pair

Usage:
dereplicate_one_sample_pair.R [<sample_merge_file>] [<sample_name> <end1_file> <end2_file> --end1_err=<end1_err> --end2_err=<end2_err>] [--log=<logfile> --batch=<batch> --config=<cfile>]
dereplicate_one_sample_pair.R (-h|--help)
dereplicate_one_sample_pair.R --version

Options:
--end1_err=<end1>    name of the R1 error rates matrix
--end2_err=<end2>    name of the R2 error rates matrix
--log=<logfile>    name of the log file [default: dereplicate.log]
--batch=<batch>    name of the batch if any to get the filter and trim parameters
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "dereplicate one sample V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$end1_file <-
    "output/dada2/filtered/sample_545_filtered_R1.fastq.gz"
  arguments$end2_file <-
    "output/dada2/filtered/sample_545_filtered_R2.fastq.gz"
  arguments$end1_err <-
    "output/dada2/model/dust_jun2021_error_rates_R1.qs"
  arguments$end2_err <-
    "output/dada2/model/dust_jun2021_error_rates_R2.qs"
  arguments$batch <- "dust_jun2021"
  arguments$sample_name <- "sample_545"

}


stopifnot(
  file.exists(arguments$end1_file),
  file.exists(arguments$end2_file),
  file.exists(arguments$end1_err),
  file.exists(arguments$end2_err))

if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config))


info <- Sys.info();

print(arguments)
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")

library(magrittr)
library(dada2)
library(qs)
library(yaml)

filter_fwd <- arguments$end1_file
filter_bwd <- arguments$end2_file

message("loading error rates")
err_fwd <- qs::qread(arguments$end1_err)
err_bwd <- qs::qread(arguments$end2_err)

message("dereplicating filtered files")

dereplicate_merge <- function(filtfwd, filtbwd, err_fwd, err_bwd,
  min_overlap = 12, max_mismatch = 0) {

  message("de-replicating end1 file")
  derep_fwd <- dada2::derepFastq(filtfwd)
  dd_fwd <- dada2::dada(derep_fwd, err = err_fwd)

  message("de-replicating end2 file")
  derep_bwd <- dada2::derepFastq(filtbwd)
  dd_bwd <- dada2::dada(derep_bwd, err = err_bwd)

  message("merging pairs")
  merge <- dada2::mergePairs(dd_fwd, derep_fwd, dd_bwd, derep_bwd,
      minOverlap = min_overlap, maxMismatch = max_mismatch)

  out <- list(
    "dada_fwd" = dd_fwd,
    "dada_bwd" = dd_bwd,
    "merge" = merge)

  return(out)
}

config <- yaml::read_yaml(arguments$config)$merge_pairs

if (!is.null(arguments$batch)) {
  stopifnot(arguments$batch %in% names(config))
  config <- config[[arguments$batch]]
} else {
  nms <- c("minOverlap", "maxMismatch")
  if (any(names(config) %in% nms)) {
    warning("will use first element instead")
    config <- config[[1]]
  }
}

merge <- dereplicate_merge(filter_fwd, filter_bwd, err_fwd, err_bwd,
  min_overlap = as.numeric(config[["minOverlap"]]),
  max_mismatch = as.numeric(config[["maxMismatch"]]))

qs::qsave(merge, arguments$sample_merge_file)
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
"Filter and trim

Usage:
filter_and_trim.R [<filter_end1> <filter_end2> <summary_file>] [<sample_name> --end1=<end1> --end2=<end2>] [--batch=<batch> --log=<logfile> --config=<cfile>]
filter_and_trim.R (-h|--help)
filter_and_trim.R --version

Options:
-h --help    show this screen
--end1=<end1>    name of the R1 end fastq.gz file
--end2=<end2>    name of the R2 end fastq.gz file
--log=<logfile>    name of the log file [default: logs/filter_and_trim.log]
--batch=<batch>    name of the batch if any to get the filter and trim parameters
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args, version = "filter_and_trim V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$batch <- "dust_dec2018"
  arguments$sample_name <- "sample_20"
  arguments$end1 <-
    "data/dust_dec2018/190114_C75PR/203_S19_L001_R1_001.fastq.gz"
  arguments$end2 <-
    "data/dust_dec2018/190114_C75PR/203_S19_L001_R2_001.fastq.gz"

  arguments$filter_end1 <-
    "filtered_L001_R1_001.fastq.gz"
  arguments$filter_end2 <-
    "filtered_L001_R2_001.fastq.gz"
}

print(arguments)

info <- Sys.info();

message("loading dada2")
library(magrittr)
library(dada2)
library(qs)
library(yaml)
library(fs)

stopifnot(file.exists(arguments$config),
  file.exists(arguments$end1), file.exists(arguments$end2))

print(stringr::str_c(names(info), " : ", info, "\n"))
config <- yaml::read_yaml(arguments$config)
config <- config[["filter_and_trim"]]
print(config)

if (!is.null(arguments$batch)) {
  stopifnot(arguments$batch %in% names(config))
  config <- config[[arguments$batch]]
} else {
  nms <- c("truncQ", "truncLen", "trimLeft", "trimRight",
    "maxLen", "minLen", "maxN", "minQ", "maxEE")
  if (any(names(config) %in% nms)) {
    warning("will use first element instead")
    config <- config[[1]]
  }

}

fs::dir_create(unique(dirname(arguments$filter_end1)))
fs::dir_create(unique(dirname(arguments$filter_end2)))

track_filt <- dada2::filterAndTrim(
  arguments$end1, arguments$filter_end1,
  arguments$end2, arguments$filter_end2,
  truncQ =    as.numeric(config[["truncQ"]]),
  truncLen =  as.numeric(config[["truncLen"]]),
  trimLeft =  as.numeric(config[["trimLeft"]]),
  trimRight = as.numeric(config[["trimRight"]]),
  maxLen =    as.numeric(config[["maxLen"]]),
  minLen =    as.numeric(config[["minLen"]]),
  maxN =      as.numeric(config[["maxN"]]),
  minQ =      as.numeric(config[["minQ"]]),
  maxEE =     as.numeric(config[["maxEE"]]),
  rm.phix = TRUE,
  compress = TRUE,
  multithread = FALSE)

row.names(track_filt) <- arguments$sample_name
colnames(track_filt) <- c("raw", "filtered")

track_filt %>%
  as.data.frame() %>%
  tibble::as_tibble(rownames = "samples") %>%
  dplyr::mutate(
    end1 = arguments$end1,
    end2 = arguments$end2) %>%
  readr::write_tsv(arguments$summary_file)
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
"Filters the low quality ASVs

Usage:
filter_asvs.R [<asv_matrix_qc> <seqlength_fig> <abundance_fig>] [<asv_matrix_file> <negcontrol_file>] [--lysis --log=<logfile> --config=<cfile> --cores=<cores>]
filter_asvs.R (-h|--help)
filter_asvs.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: filter_and_trim.log]
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]
--cores=<cores>    number of CPUs for parallel processing [default: 24]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "filter ASVs V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$asv_matrix_qc <- "asv.qs"
  arguments$seqlength_fig <- "sl.png"
  arguments$abundance_fig <- "sa.png"
  arguments$asv_matrix_file <- "output/dada2/remove_chim/asv_mat_wo_chim.qs"
  arguments$negcontrol_file <- "data/negcontrols.qs"

}


message("arguments")
print(arguments)

info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))


message("loading packages")
library(magrittr)
library(tidyverse)
library(dada2)
library(qs)
library(yaml)

stopifnot(
  file.exists(arguments$asv_matrix_file),
  file.exists(arguments$config),
  file.exists(arguments$negcontrol_file))

message("starting with asvs in ", arguments$asv_matrix_file)

seqtab <- qs::qread(arguments$asv_matrix_file)
config <- yaml::read_yaml(arguments$config)[["qc_parameters"]]

message("filtering samples by negative controls")
message("using neg. control file ", arguments$negcontrol_file)
message("removing counts >= ", config[["negctrl_prop"]],
  " sum(neg_controls)")

neg_controls <- qs::qread(arguments$negcontrol_file)

if (arguments$lysis) {

  neg_controls %<>%
    dplyr::mutate(
      kits = map2(kits, lysis, c))

}

neg_controls %<>%
  dplyr::select(batch, key, kits)

subtract_neg_control <- function(name, neg_controls, seqtab, prop) {

  negs <- neg_controls

  negs <- negs[negs %in% rownames(seqtab)]
  out_vec <- seqtab[name, ]

  if (length(negs) > 1) {

    negs <- seqtab[negs, ]
    neg_vec <- colSums(negs)
    out_vec <- out_vec - prop * neg_vec

  } else if (length(negs) == 1) {

    neg_vec <- seqtab[negs, ]
    out_vec <- out_vec - prop * neg_vec
  }

  if (any(out_vec < 0)) {
    out_vec[out_vec < 0] <- 0
  }
  return(out_vec)

}

neg_controls %<>%
  dplyr::mutate(sample_vec = purrr::map2(key, kits,
    safely(subtract_neg_control),
    seqtab, config[["negctrl_prop"]]))

to_remove <- dplyr::filter(neg_controls,
  purrr::map_lgl(sample_vec, ~ !is.null(.$error)))

if (nrow(to_remove) > 0) {
  message("removed the samples:\n",
    stringr::str_c(to_remove$name, collapse = "\n"))
}

neg_controls %<>%
  dplyr::filter(purrr::map_lgl(sample_vec, ~ is.null(.$error))) %>%
  dplyr::mutate(
    sample_vec = purrr::map(sample_vec, "result"))

sample_names <- neg_controls %>%
  dplyr::pull(key)

seqtab_samples <- purrr::reduce(
  dplyr::pull(neg_controls, sample_vec), dplyr::bind_rows)
seqtab_samples %<>%
  as.matrix() %>%
  set_rownames(sample_names)

seqtab_nc <- seqtab[! rownames(seqtab) %in% sample_names, ]
seqtab_new <- floor(rbind(seqtab_samples, seqtab_nc))

# Length of sequences
message("filtering sequences by length")

seq_lengths <- nchar(dada2::getSequences(seqtab_new))

l_hist <- as.data.frame(table(seq_lengths)) %>%
  tibble::as_tibble() %>%
  rlang::set_names("length", "freq")

l_hist <- l_hist %>%
  ggplot(aes(x = length, y = freq)) +
    geom_col() +
    labs(title = "Sequence Lengths by SEQ Count") +
    theme_bw() +
    theme(
      axis.text = element_text(size = 6),
      axis.text.x = element_text(angle = 90, hjust = 1,
        vjust = 0.5, size = 10),
      axis.text.y = element_text(size = 10))

ggsave(
  filename = arguments$seqlength_fig,
  plot = l_hist,
  width = 8, height = 4, units = "in")


table2 <- tapply(colSums(seqtab_new), seq_lengths, sum)
table2 <- tibble::tibble(
  seq_length = names(table2),
  abundance = table2)

most_common_length <- dplyr::top_n(table2, 1, abundance) %>%
  dplyr::pull(seq_length) %>%
  as.numeric()

table2 <- table2 %>%
  ggplot(aes(x = seq_length, y = log1p(abundance))) +
    geom_col() +
    labs(title = "Sequence Lengths by SEQ Abundance") +
    theme_bw() +
    theme(
      axis.text = element_text(size = 6),
      axis.text.x = element_text(angle = 90, hjust = 1,
        vjust = 0.5, size = 10),
      axis.text.y = element_text(size = 10))

ggsave(
  filename = arguments$abundance_fig,
  plot = table2,
  width = 8, height = 4, units = "in")


max_diff <- config[["max_length_variation"]]

message("most common length: ", most_common_length)
message("removing sequences outside range ",
  " < ", most_common_length - max_diff, " or > ",
  most_common_length + max_diff)

right_length <- abs(seq_lengths - most_common_length) <= max_diff
seqtab_new <- seqtab_new[, right_length]


total_abundance <- sum(colSums(seqtab_new))
min_reads_per_asv <- ceiling(config[["low_abundance_perc"]] / 100 *
  total_abundance)

seqtab_abundance <- colSums(seqtab_new)

message("removing ASV with < ", min_reads_per_asv, " reads")
message("in total ", sum(seqtab_abundance < min_reads_per_asv))
seqtab_new <- seqtab_new[, seqtab_abundance >= min_reads_per_asv]


message("saving files in ", arguments$asv_matrix_qc)
qs::qsave(seqtab_new, arguments$asv_matrix_qc)
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
"Gather ASV table

Usage:
gather_derep_seqtab.R [<asv_file> <summary_file>] [<derep_file> ...] [--batch=<batch> --log=<logfile> --config=<cfile>]
gather_derep_seqtab.R (-h|--help)
gather_derep_seqtab.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: gather_derep_seqtab.log]
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "gather dereplicated sequence table V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$batch <- "batch2018"
  arguments$asv_file <- "batch2018_asv.qs"
  arguments$derep_file <- list.files(file.path("output", "dada2",
    "merge", arguments$batch), full.names = TRUE)

}

message("arguments")
print(arguments)

message("info")
info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(dada2)
library(qs)
library(yaml)

derep_files <- arguments$derep_file

stopifnot(any(file.exists(derep_files)), file.exists(arguments$config))

# Identify and remove empty files.
# These are placeholder files that were created
# to appease snakemake after all reads were
# filtered out from the input file back in filter_and_trim.
# stop if we have no files left after checking for empty.
derep_tb <- tibble(derep_file = derep_files) %>%
  dplyr::mutate(
    size = purrr::map_dbl(derep_file, ~ file.info(.x)$size),
    key = purrr::map(derep_file, basename),
    key = stringr::str_remove(key, "_asv.qs"))

count_nonempty <- derep_tb %>%
  dplyr::filter(size > 0) %>%
  nrow()

stopifnot(count_nonempty > 0)

# read ASV files
read_merger <- function(filename, size) {
  if (size == 0) {
    NULL
  } else {
    qs::qread(filename)
  }
}

# changed the commands to be all inside the same mutate to improve readability
derep_tb %<>%
  dplyr::mutate(
    derep_merger = purrr::map2(derep_file, size, read_merger),
    mergers = purrr::map(derep_merger, "merge"),
    dada_fwd = purrr::map(derep_merger, "dada_fwd"),
    dada_bwd = purrr::map(derep_merger, "dada_bwd"))


config <- yaml::read_yaml(arguments$config)

stopifnot(file.exists(config$sample_table))

sample_tb <- readr::read_tsv(config$sample_table) %>%
  dplyr::left_join(derep_tb)

# now do some checking to make sure we only have
# the correct batch in the input
check_batches <- sample_tb %>%
  dplyr::filter(!is.na(derep_file)) %>%
  dplyr::count(batch)

if (nrow(check_batches) > 1) {
  message("User supplied input derep files from multiple batch(es):")
  message(paste(sprintf("%s: %d", check_batches$batch, check_batches$n),
    collapse = "\n"))
  message("We will only look at the requested batch: ", arguments$batch)
}

# remove other batches, but keep empty files for summary
sample_tb %<>%
  dplyr::filter(batch == arguments$batch)
nonempty_tb <- sample_tb %>%
  dplyr::filter(size > 0)
mergers <- pluck(nonempty_tb, "mergers") %>%
  setNames(pluck(nonempty_tb, "key"))
dada_fwd <- pluck(nonempty_tb, "dada_fwd")
dada_fwd <- pluck(nonempty_tb, "dada_bwd")

message("creating sequence table")
seqtab <- dada2::makeSequenceTable(mergers)

qs::qsave(seqtab, arguments$asv_file)

message("summarizing results")

## get N reads
## include lost samples with 0s
get_nreads <- function(x) {

  if (!is.null(x)) {
    sum(dada2::getUniques(x))
  } else {
    NA
  }
}

track <- sample_tb %>%
  dplyr::mutate(
    denoised = purrr::map(dada_fwd, get_nreads),
    merged = purrr::map(mergers, get_nreads)) %>%
  tidyr::unnest(c(denoised, merged)) %>%
  dplyr::transmute(samples = key, denoised, merged) %>%
  dplyr::mutate(across(c(denoised, merged), ~replace_na(.x, 0)))

track %>%
  readr::write_tsv(arguments$summary_file)

message("Done! summary file at ", arguments$summary_file)
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
"Learn error rates

Usage:
learn_error_rates.R [--error_rates=<matrix_file> --plot_file=<pfile>] [<filtered> ...] [--log=<logfile> --batch=<batch> --config=<cfile> --cores=<cores>]
learn_error_rates.R (-h|--help)
learn_error_rates.R --version

Options:
-h --help    show this screen
--error_rates=<matrix_file>    name of the file with the learned error rates matrix
--plot_file=<pfile>    name of the file to save the diagnostic plot
--log=<logfile>    name of the log file [default: error_rates.log]
--batch=<batch>    name of the batch if any to get the filter and trim parameters
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]
--cores=<cores>    number of CPUs for parallel processing [default: 4]" -> doc

library(docopt)
library(fs)


my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args, version = "error_rates V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  library(magrittr)
  library(tidyverse)

  arguments$error_rates <- "error_rate_matrix.qs"
  arguments$plot_file <- "error_rates.png"
  arguments$batch <- "batch2018"
  arguments$filtered <- readr::read_tsv("samples.tsv") %>%
    filter(batch == arguments$batch) %>%
    pull(key)
  arguments$filtered <- as.character(glue::glue(
    "output/dada2/filtered/{sample}_filtered_R1.fastq.gz",
    sample = arguments$filtered))

}

stopifnot(any(file.exists(arguments$filtered)))
stopifnot(file.exists(arguments$config))


info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))
print(arguments)


# Identify and remove empty files.
# These are placeholder files that were created
# to appease snakemake after all reads were
# filtered out from the input file.
# stop if we have no files left after checking for empty.
sizes <- lapply(arguments$filtered, FUN = function(x) file.info(x)$size)
sizes <- setNames(sizes, arguments$filtered)
nonempty <- names(sizes[sizes > 0])
stopifnot(length(nonempty) > 0)

message("loading packages")
library(dada2)
library(ggplot2)
library(qs)
library(yaml)

config <- yaml::read_yaml(arguments$config)$error_rates
print(config)

if (!is.null(arguments$batch)) {
  stopifnot(arguments$batch %in% names(config))
  config <- config[[arguments$batch]]
} else {
  nms <- c("learn_nbases")
  if (any(names(config) %in% nms)) {
    warning("will use first element instead")
    config <- config[[1]]
  }

}

print("computing error rates")
errs <- dada2::learnErrors(
  nonempty, #arguments$filtered,
  nbases = as.numeric(config$learn_nbases),
  multithread = as.numeric(arguments$cores), randomize = TRUE)

fs::dir_create(dirname(arguments$error_rates))
qs::qsave(errs, file = arguments$error_rates)

print("plotting diagnostics")
err_plot <- dada2::plotErrors(errs, nominalQ = TRUE)

fs::dir_create(dirname(arguments$plot_file))
ggplot2::ggsave(
  filename = arguments$plot_file,
  plot = err_plot,
  width = 20,
  height = 20,
  units = "cm")
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
"Plot quality profiles

Usage:
  plot_quality_profiles.R <plot_file> [--end1=<end1> --end2=<end2> --log=<logfile>]
  plot_quality_profiles.R (-h|--help)
  plot_quality_profiles.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: logs/plot_qc_profiles.log]" -> doc

my_args <- commandArgs(trailingOnly = TRUE)

if (interactive()) {
  my_args <- c("qc_profile.png", "--end1=end1.fastq.gz",
    "--end2=end2.fastq.gz")
}

library(docopt, quietly = TRUE)

arguments <- docopt(doc, args = my_args, version = "plot_quality_profiles V1")
print(arguments)

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file)
  sink(log_file, type = "message")
}

message("system info")
print(Sys.info())

message("arguments")
print(arguments)

stopifnot(file.exists(arguments$end1), file.exists(arguments$end2))

message("loading R packages")
library(dada2, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(purrr, quietly = TRUE)
library(cowplot, quietly = TRUE)

message("making plots")

plot_fun <- purrr::safely(dada2::plotQualityProfile)

r1_plot <- plot_fun(arguments$end1)
r2_plot <- plot_fun(arguments$end2)

if (is.null(r1_plot$error)) {
  r1_plot <- r1_plot$result
} else {
  message(r1_plot$error)
  r1_plot <- ggplot()
}

if (is.null(r2_plot$error)) {
  r2_plot <- r2_plot$result
} else {
  message(r2_plot$error)
  r2_plot <- ggplot()
}

r1_plot <- r1_plot +
  cowplot::theme_minimal_grid() +
  theme(strip.background = element_blank())

r2_plot <- r2_plot +
  cowplot::theme_minimal_grid() +
  theme(strip.background = element_blank())


message("saving plots")
final_plot <- cowplot::plot_grid(r1_plot, r2_plot, nrow = 1)

ggsave(filename = arguments$plot_file, final_plot, ggsave, width = 8,
  height = 4, units = "in", bg = "white")
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
"Merge sequence tables and remove chimeras

Usage:
remove_chimeras.R [<asv_merged_file> <summary_file> <fig_file>] [<asv_file> ...] [--log=<logfile> --config=<cfile> --cores=<cores>]
remove_chimeras.R (-h|--help)
remove_chimeras.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: remove_chimeras.log]
--config=<cfile>    name of the yaml file with the parameters [default: ./config/config.yaml]
--cores=<cores>    number of CPUs for parallel processing [default: 24]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "remove chimeras V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$fig_file <- "matspy.png"
  arguments$asv_merged_file <- "asv.qs"
  arguments$summary_file <- "summary.qs"
  arguments$asv_file <- list.files(file.path("output", "dada2",
    "asv_batch"), full.names = TRUE, pattern = "asv")

}


message("arguments")
print(arguments)

## merges different ASV tables, and removes chimeras

message("info")
info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(dada2)
library(qs)
library(yaml)
library(ComplexHeatmap)

stopifnot(any(file.exists(arguments$asv_file)))

if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config))

config <- yaml::read_yaml(arguments$config)
sample_table <- readr::read_tsv(config$sample)
config <- config$remove_chimeras

seqtab_list <- purrr::map(arguments$asv_file, qs::qread)

if (length(seqtab_list) > 1) {
  seqtab_all <- dada2::mergeSequenceTables(tables = seqtab_list)
} else {
  seqtab_all <- seqtab_list[[1]]
}

# Remove chimeras
message("removing chimeras")
seqtab <- dada2::removeBimeraDenovo(
  seqtab_all,
  method = config[["chimera_method"]],
  minSampleFraction = config[["minSampleFraction"]],
  ignoreNNegatives = config[["ignoreNNegatives"]],
  minFoldParentOverAbundance = config[["minFoldParentOverAbundance"]],
  allowOneOf = config[["allowOneOf"]],
  minOneOffParentDistance = config[["minOneOffParentDistance"]],
  maxShift = config[["maxShift"]],
  multithread = as.numeric(arguments$cores))

fs::dir_create(dirname(arguments$asv_merged_file))
qs::qsave(seqtab, arguments$asv_merged_file)

out <- tibble::tibble(samples = row.names(seqtab),
  nonchim = rowSums(seqtab))

fs::dir_create(dirname(arguments$summary_file))
out %>%
  readr::write_tsv(arguments$summary_file)

fs::dir_create(dirname(arguments$fig_file))

outmat <- seqtab[, seq_len(min(1e4, ncol(seqtab)))]
outmat <- ifelse(outmat > 0, 1, 0)
cols <- structure(c("black", "white"), names = c("1", "0"))


# put row annotation in same order as matrix
# will remove samples that were completely empty
# after filter_and_trim
anno_df <- sample_table %>%
    select(batch, key) %>%
    as.data.frame() %>%
    column_to_rownames("key")
anno_df <- anno_df[rownames(outmat), , drop = F]

annot <- ComplexHeatmap::rowAnnotation(
  df = anno_df,
  annotation_legend_param = list(
    batch = list(direction = "horizontal")))

png(filename = arguments$fig_file, width = 10, height = 8, units = "in",
  res = 1200)
hm <- ComplexHeatmap::Heatmap(outmat,
  left_annotation = annot,
  col = cols, name = "a", show_row_dend = FALSE, show_row_names = FALSE,
  show_column_dend = FALSE, show_column_names = FALSE,
  cluster_columns = FALSE, cluster_rows = FALSE,
  show_heatmap_legend = FALSE, use_raster = TRUE,
  column_title = "Top 10K ASVs",
  heatmap_legend_param = list(direction = "horizontal"))
draw(hm, annotation_legend_side = "top", heatmap_legend_side = "top",
  merge_legend = TRUE)
dev.off()
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
"Summarized the # of reads per processing step

Usage:
summarize_nreads.R [<nreads_file> <nreads_fig> <preads_fig>] [<filt_summary_file> <derep_summary_file> <final_asv_mat>] [--log=<logfile>]
summarize_nreads.R (-h|--help)
summarize_nreads.R --version

Options:
-h --help    show this screen
--log=<logfile>    name of the log file [default: summarize_nreads.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "summarize # reads V1")

if (!interactive()) {
  fs::dir_create(dirname(arguments$log))
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$filt_summary_file <-
    "output/dada2/filtered/all_sample_summary.tsv"
  arguments$derep_summary_file <-
    "output/dada2/asv_batch/all_sample_summary.tsv"
  arguments$final_asv_mat <-
    "output/dada2/after_qc/asv_mat_wo_chim.qs"

  arguments$nreads_file <-
    "output/dada2/stats/Nreads_dada2.txt"
  arguments$nreads_fig <-
    "workflow/report/dada2qc/dada2steps_vs_abundance.png"
  arguments$preads_fig <-
    "workflow/report/dada2qc/dada2steps_vs_relabundance.png"

}

# Summarize numbers of reads per step, makes some plots

info <- Sys.info();

message(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(qs)

stats <- list()
stats[[1]] <- readr::read_tsv(arguments$filt_summary_file) %>%
  select(-end1, -end2)
stats[[2]] <- readr::read_tsv(arguments$derep_summary_file)
stats[[3]] <- qs::qread(arguments$final_asv_mat) %>%
  rowSums() %>%
  tibble::tibble(samples = names(.), nreads = .)

# change to full join so we see input files that were lost during any step
stats <- purrr::reduce(stats, purrr::partial(dplyr::full_join, by = "samples"))
# fill in 0s for final nreads for samples that were previously lost
stats %<>%
  dplyr::mutate(nreads = replace_na(nreads, 0))

stats %>%
  readr::write_tsv(arguments$nreads_file)

message("making figures")

order_steps <- stats %>%
  dplyr::select(-samples) %>%
  names()

rel_stats <- stats %>%
  dplyr::mutate(
    dplyr::across(
      -samples, list(~ . / raw), .names = "{.col}"))

# remove empties from relative change plot
# they don't get plotted anyway
rel_stats %<>%
  dplyr::filter(!is.nan(nreads))

make_plot <- function(stats, summary_fun = median, ...) {

  order_steps <- stats %>%
    dplyr::select(-samples) %>%
    names()

  stats %<>%
    tidyr::pivot_longer(-samples, names_to = "step", values_to = "val") %>%
    dplyr::mutate(step = factor(step, levels = order_steps))

  summary <- stats %>%
    dplyr::group_by(step) %>%
    dplyr::summarize(
      val = summary_fun(val, ...), .groups = "drop")

  stats %>%
    ggplot(aes(step, val)) + geom_boxplot() +
    geom_point(alpha = 0.25, shape = 21) +
    geom_line(aes(group = samples), alpha = 0.25) +
    theme_classic() +
    theme(
      legend.position = "none",
      strip.text.y = ggplot2::element_text(angle = -90, size = 10),
      panel.grid.minor.x = ggplot2::element_blank(),
      panel.grid.major.x = ggplot2::element_blank()) +
    geom_line(data = summary, aes(group = 1), linetype = 2, colour = "red") +
    labs(x = "step")

}

ggsave(
  filename = arguments$nreads_fig,
  plot = make_plot(stats) + labs("# reads") +
   scale_y_continuous(labels = scales::comma_format(1)),
  width = 6,
  height = 4,
  units = "in")

ggsave(
  filename = arguments$preads_fig,
  plot = make_plot(rel_stats) + labs(y = "relative change") +
    scale_y_continuous(labels = scales::percent_format(1)),
  width = 6,
  height = 4,
  units = "in")
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
"Prepare mia object

Usage:
prepare_mia_object.R [<mia_file>] [--asv=<asv_file> --taxa=<taxa_file> --tree=<tree_file> --meta=<meta_file>] [--asv_prefix=<prefix> --log=<logfile> --config=<config> --cores=<cores>]
prepare_mia_object.R (-h|--help)
prepare_mia_object.R --version

Options:
-h --help    show this screen
--asv=<asv_file>    ASV matrix file
--taxa=<taxa_file>    Taxa file
--tree=<tree_file>    Tree file
--meta=<meta_file>    Metadata file
--log=<logfile>    name of the log file [default: logs/filter_and_trim.log]
--config=<config>    name of the config file [default: config/config.yaml]
--cores=<cores>    number of parallel CPUs [default: 8]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "prepare mia file V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$asv <- "output/dada2/after_qc/asv_mat_wo_chim.qs"
  arguments$taxa <- "output/taxa/kraken/minikraken/kraken_taxatable.qs"
  arguments$tree <- "output/phylotree/newick/tree.nwk"
  arguments$meta <- "data/meta.tsv"
  arguments$asv_prefix <- "HSD2M"

}

print(arguments)

info <- Sys.info();

message("loading packages")
library(magrittr)
library(tidyverse)
library(TreeSummarizedExperiment)
library(Biostrings)
library(BiocParallel)
library(ape)
library(mia)
library(qs)
library(scater)
library(yaml)
library(parallelDist)

stopifnot(
  file.exists(arguments$asv),
  file.exists(arguments$taxa),
  file.exists(arguments$tree),
  file.exists(arguments$meta)
)

if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config))

asv <- qs::qread(arguments$asv)
taxa <- qs::qread(arguments$taxa)
tree <- ape::read.tree(arguments$tree)
meta <- readr::read_tsv(arguments$meta)

asv_sequences <- colnames(asv)
colnames(asv) <- str_c(arguments$asv_prefix, seq_along(asv_sequences),
  sep = "_")
names(asv_sequences) <- colnames(asv)
asv_sequences <- Biostrings::DNAStringSet(asv_sequences)

asv_aux <- tibble::tibble(asv = colnames(asv))


# need to fix parse_taxa to return a tibble of length 
# equal to all the sequences and not only the sequences with known information

cdata <- meta %>%
  as.data.frame() %>%
  tibble::column_to_rownames("key")

out <- TreeSummarizedExperiment::TreeSummarizedExperiment(
  assays = list(counts = t(asv)),
  colData = cdata[rownames(asv), ], # need to make sure correct order
  rowData = asv_aux %>%
    dplyr::left_join(taxa, by = "asv") %>%
    as.data.frame() %>%
    tibble::column_to_rownames("asv"),
  rowTree = tree)

bpp <- BiocParallel::MulticoreParam(workers = as.numeric(arguments$cores))

message("estimate diversity")
out <- mia::estimateDiversity(out, abund_values = "counts", BPPARAM = bpp)

message("estimate richness")
out <- mia::estimateRichness(out, abund_values = "counts", BPPARAM = bpp)

if (!is.null(arguments$config)) {

  config <- yaml::read_yaml(arguments$config)[["beta"]]

}

compute_mds_wrap <- function(mia, dist_name, altexp_name, ncomps) {

  message("computing beta diversity for ", dist_name)

  if (dist_name %in% c("bray", "euclidean", "hellinger", "mahalanobis",
    "manhattan", "bhjattacharyya", "canberra", "chord")) {

    mia <- scater::runMDS(
      mia, FUN = parallelDist::parDist,
      method = dist_name, threads = as.numeric(arguments$cores),
      name = altexp_name, ncomponents = ncomps, exprs_values = "counts",
      keep_dist = FALSE)

  } else if (dist_name == "fjaccard") {

    mia <- scater::runMDS(
      mia, FUN = parallelDist::parDist,
      method = "fJaccard", threads = as.numeric(arguments$cores),
      name = altexp_name, ncomponents = ncomps, exprs_values = "counts",
      keep_dist = FALSE)

  } else if (dist_name == "unifrac") {

    unifrac <- mia::calculateUniFrac(
      x = t(counts(mia)), tree = rowTree(mia),
      weighted = FALSE, normalized = FALSE, BPPARAM = bpp)

    pcoa <- stats::cmdscale(unifrac, k = ncomps)
    SingleCellExperiment::reducedDim(mia, altexp_name) <- pcoa


  } else if (dist_name == "w_unifrac") {
    unifrac <- mia::calculateUniFrac(
      x = t(counts(mia)), tree = rowTree(mia),
      weighted = TRUE, normalized = FALSE, BPPARAM = bpp)

    pcoa <- stats::cmdscale(unifrac, k = ncomps)
    SingleCellExperiment::reducedDim(mia, altexp_name) <- pcoa
  } else if (dist_name == "w_unifrac_norm") {

    unifrac <- mia::calculateUniFrac(
      x = t(counts(mia)), tree = rowTree(mia),
      weighted = TRUE, normalized = TRUE, BPPARAM = bpp)
    pcoa <- stats::cmdscale(unifrac, k = ncomps)
    SingleCellExperiment::reducedDim(mia, altexp_name) <- pcoa

  } else {
    stop(dist_name, " distance not available")
  }
  mia

}



if (!is.null(config$beta_div)) {
  for (div in config$beta_div) {
    out <- compute_mds_wrap(out, div, str_c("all_", div), config$comps)
  }
}

if (any(config$beta_group != "all")) {

  beta_group <- config$beta_group
  beta_group <- beta_group[beta_group != "all"]
  grouped <- purrr::map(beta_group,
    ~ mia::agglomerateByRank(out, rank = ., na.rm = FALSE))
  names(grouped) <- beta_group

  config$beta_div <- config$beta_div[!str_detect(config$beta_div, "unifrac")]

  for (gg in names(grouped)) {
  message(gg)
    for (div in config$beta_div) {
      grouped[[gg]] <- compute_mds_wrap(grouped[[gg]], div,
        str_c(gg, div, sep = "_"), config$comps)
    }
  }

  SingleCellExperiment::altExps(out) <- grouped
}

metadata(out)[["date_processed"]] <- Sys.Date()
metadata(out)[["sequences"]] <- asv_sequences

fs::dir_create(dirname(arguments$mia_file))
qs::qsave(out, arguments$mia_file)
message("done!")
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
"Extract fasta sequences

Usage:
extract_fasta.R [<fasta_file>] [<asv_mat_file>] [--prefix=<pref> --log=<logfile>]
extract_fasta.R (-h|--help)
extract_fasta.R --version

Options:
--prefix=<pref>    prefix to name the sequences [default: asv]
--log=<logfile>    name of the log file [default: extract_fa.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "extract fasta V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  arguments$fasta_file <- "out.fasta"
  arguments$asv_mat_file <-
    "output/dada2/after_qc/asv_mat_wo_chim.qs"

}

info <- Sys.info();

print(arguments)
print(stringr::str_c(names(info), " : ", info, "\n"))

stopifnot(file.exists(arguments$asv_mat_file))

message("loading packages")
library(magrittr)
library(tidyverse)
library(Biostrings)
library(qs)

asvs <- qs::qread(arguments$asv_mat_file)
sequences <- colnames(asvs)
names(sequences) <- stringr::str_c(arguments$prefix,
  seq_along(sequences), sep = "_")

fs::dir_create(dirname(arguments$fasta_file))

sequences <- Biostrings::DNAStringSet(sequences)
Biostrings::writeXStringSet(sequences, filepath = arguments$fasta_file)
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
"Parse kraken2 results

Usage:
parse_kraken.R [<taxa_table> <taxa_summary>] [<kraken_file>] [--map=<map_file> --log=<logfile> --cores=<cores>]
parse_kraken.R (-h|--help)
parse_kraken.R --version

Options:
--log=<logfile>    name of the log file [default: ./parse_kraken.log]
--map=<map_file>   name of file with map from taxid to taxonomy [default: NULL]
--cores=<cores>    number of parallel CPUs [default: 8]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "parse kraken2 results V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {

  db = "silva"
  arguments$taxa_table <- "parse_output"
  arguments$taxa_summary <- "parse_output_summary"
  arguments$kraken_file <- sprintf("output/taxa/kraken/%s/kraken_results.out", db)
  arguments$map <- "output_test"
}


info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(magrittr)
library(tidyverse)
library(taxizedb)
library(BiocParallel)

bpp <- BiocParallel::MulticoreParam(workers = as.numeric(arguments$cores))

tax_order <- c("domain","superkingdom", "kingdom", "phylum", 
    "class", "order", "family", "genus", "species")

# mapping file provided
if (!is.null(arguments$map)) {

  message("reading labels from mapping file ", arguments$map)

  # taxonomy info
  map_tb <- readr::read_tsv(arguments$map)
  # ASV assignments
  kraken <- readr::read_tsv(arguments$kraken_file, 
    col_names = c("status", "asv", "taxid", "seq_len", "lca_mapping"))
  labels <- kraken %>%
    dplyr::select(asv, taxid) %>%
    dplyr::left_join(map_tb) %>%
    dplyr::select(-taxname, -taxid)

} else {
  message("getting ncbi database")
  ncbi_db <- taxizedb::db_download_ncbi()

  message("reading labels from kraken")
  kraken <- readr::read_tsv(arguments$kraken_file, 
    col_names = c("status", "asv", "id", "seq_length", "id_bp"))

  get_taxa_from_id <- function(results) {
    query <- taxizedb::classification(results$id, db = "ncbi")

    results %>%
      dplyr::mutate(
        id_taxa = map(query, list),
        id_taxa = purrr::map(id_taxa, ~ unique(.[[1]])),
        is_na = ! purrr::map_lgl(id_taxa, is.data.frame))

  }

  clean_id_taxa <- function(id_taxa, taxa) {

    name <- NULL

    if (is.data.frame(id_taxa)) {
      id_taxa %<>%
        dplyr::filter(rank %in% taxa) %>%
        dplyr::select(-id)
      id_taxa %<>%
        tidyr::pivot_wider(names_from = rank, values_from = name)
    }
    id_taxa
  }

  clean_id_taxa_wrap <- function(labels, taxa, bpp) {

    labels %<>%
      dplyr::mutate(
        id_taxa_clean = BiocParallel::bplapply(id_taxa, clean_id_taxa, taxa,
          BPPARAM = bpp))

    # any_of will allow for missing taxonomic levels (eg domain)
    labels %>%
      dplyr::select(asv, id_taxa_clean) %>%
      tidyr::unnest(cols = c(id_taxa_clean)) %>%
      dplyr::select(asv, tidyselect::any_of(taxa)) 

  }

  message("parsing labels")
  labels <- get_taxa_from_id(kraken) 
  labels <- clean_id_taxa_wrap(labels, tax_order, bpp)  
  # after the cleaning, we have a table like this:
  #    asv    phylum         class           order             family  genus species
  #   <chr>  <chr>          <chr>           <chr>             <chr>   <chr> <chr>  
  #  1 asv_1  Chordata       Mammalia        Artiodactyla      Cervid… Cerv… Cervus…
  #  2 asv_2  Chordata       Mammalia        Artiodactyla      Cervid… Cerv… Cervus…
  #  3 asv_3  Actinobacteria Actinomycetia   Pseudonocardiales Pseudo… Sacc… Saccha…
}

taxa_summary <- labels %>%
  tidyr::pivot_longer(tidyselect::any_of(tax_order), 
    names_to = "taxa", values_to = "value") %>%
  dplyr::mutate(
    taxa = factor(taxa, levels = tax_order)) %>%
  dplyr::group_by(taxa) %>%
  dplyr::summarize(
    total = length(value),
    label = sum(!is.na(value)), .groups = "drop") %>%
  dplyr::mutate(perc = label / total)

message("saving results")

fs::dir_create(dirname(arguments$taxa_table))
fs::dir_create(dirname(arguments$taxa_summary))

labels %>%
  qs::qsave(arguments$taxa_table)

taxa_summary %>%
  readr::write_tsv(arguments$taxa_summary)
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
"Parse kraken2 summary into taxonomy table

Usage:
parse_kraken_summary.R [<outfile_standard> <outfile_full> <kraken_summary>] [--log=<logfile>]
parse_kraken_summary.R (-h|--help)
parse_kraken_summary.R --version

Options:
--log=<logfile>    name of the log file [default: ./parse_kraken_summary.log]" -> doc

library(docopt)

my_args <- commandArgs(trailingOnly = TRUE)

arguments <- docopt::docopt(doc, args = my_args,
  version = "parse kraken2 summary file V1")

if (!interactive()) {
  log_file <- file(arguments$log, open = "wt")
  sink(log_file, type = "output")
  sink(log_file, type = "message")
}

if (interactive()) {
  db = "minikraken"
  arguments$outfile_standard <- "output_test"
  arguments$outfile_full <- "output_full_table"
  arguments$kraken_summary <- sprintf("output/taxa/kraken/%s/kraken_summary.out", db)
}

info <- Sys.info();
print(stringr::str_c(names(info), " : ", info, "\n"))

message("loading packages")
library(tidyverse)
library(magrittr)

stopifnot(file.exists(arguments$kraken_summary))


# for assembling taxonomy string in mpa style
# we'll only take the main headers
# change prefix for domain to k "kingdom"
mpa_levels <- c("D", "P", "C", "O", "F", "G", "S")
new_colname <- c("domain", "phylum", "class", 
    "order", "family", "genus", "species")
mpa_tib <- tibble(colname = mpa_levels, 
            new_colname = new_colname,
            prefix = tolower(mpa_levels)) %>%
        dplyr::mutate(prefix = ifelse(prefix=="d", "k", prefix))


# read summary file
summ_tb <- readr::read_tsv(arguments$kraken_summary,
    col_names = c("frac_asvs", "n_asvs_below", 
        "n_asvs_assigned", "taxlevel", "taxid", 
        "taxname"),
    trim_ws=F)

# Compute number of spaces - used to determine hierarchical relationships
summ_tb %<>%
    dplyr::mutate(nspace = 
        #purrr::map(taxname, ~ str_count(.x, pattern=" "))) %>%
        purrr::map(taxname, ~ str_match(.x, "(^\\s*)")[,1] %>% nchar)) %>%
    unnest(nspace) %>%
    dplyr::mutate(taxid = as.character(taxid)) %>%
    dplyr::mutate(taxname = purrr::map(taxname, str_trim)) %>%
    unnest(taxname)

#' function to process the summary file into a
#' list of taxonomy strings, indexed by taxID.
#' it's a bit sloppy
search_tree <- function(summ_tb, curr=1, prev=0,
    curr_string = list(tmp="tmp"), curr_nspace = c(-1), debug=F) {

    # check to make sure in synch
    stopifnot(length(curr_string) == length(curr_nspace))

    cl <- summ_tb[curr,] %>% as.list()

    # previous space level is at the end of the list
    prev_space <- curr_nspace[length(curr_nspace)]

    # the downstream strings that we will collect
    strings = list()

    # stop if we ran off the end
    if (curr > nrow(summ_tb)) {
        #message("ran off the end")
        return(strings)
    }

    if (debug) {
        message("curr ", curr, " prev ", prev)
        message("prev nspace ", prev_space, " curr nspace ", cl$nspace)
        message("\t\tlength check OK?",  length(curr_nspace) == length(curr_string))
    }
    res <- "whoa"
    if (cl$nspace > prev_space) {

        # if we are still proceeding through the tree, keep going.
        curr_string[[cl$taxlevel]] = cl$taxname        
        curr_nspace <- c(curr_nspace, cl$nspace)

        if (debug) {
            message("moving forward")
            message(sprintf("%s_%s", names(curr_string), curr_string) %>%
                paste(collapse="|"))
            message("\t", paste(curr_nspace, collapse = "|"))
        }
    } else if (cl$nspace <= prev_space) { 
        # if the current level is at or above the prev, 
        # then: 
        # back up until we get to the 
        # back up the string until we get to 
        # the level above the current,
        # add current, 
        # set the prev to current
        # and current to current+1
        if (debug) {
            message("backing up. prev was: ", paste(curr_string, collapse = "|"))
            message("\t\tnspace: ", paste(curr_nspace, collapse = "|"))
        }
        #temp_nspace = curr_nspace
        curr_nspace %<>% .[. < cl$nspace ]
        #curr_string %<>% .[temp_nspace < cl$nspace] 
        curr_string %<>% .[1:length(curr_nspace)]

        if (debug) {
            message("now prev is: ", paste(curr_string, collapse = "|"))
            message("\t\tnspace: ", paste(curr_nspace, collapse = "|"))
        }
        curr_string[[cl$taxlevel]] = cl$taxname
        curr_nspace <- c(curr_nspace, cl$nspace)
    }

    if (debug) {
        message(paste(curr_string, collapse = "|"))
        message("\t", paste(curr_nspace, collapse = "|"))
    }

    strings[[cl$taxid]] <- curr_string
    res <- search_tree(summ_tb, 
            curr = curr+1, prev = curr,
            curr_string, curr_nspace, debug = debug)

    return(c(strings, res))
}

# function to pivot tax strings into a tibble
tax_tibble <- function(strings) {
    tmp <- tibble(taxid = names(strings),
        taxlist = strings) %>%
        dplyr::mutate(taxonomy = 
            purrr::map(taxlist, 
                ~ tibble(taxlevel = names(.x), name = .x) %>%
                unnest(name))) %>%
        unnest(taxonomy)
    tmp %>%
        pivot_wider(names_from = taxlevel, values_from = name) 
}

#' function to convert a taxonomy list to an mpa-style string
#' like k__domain|p__phylum|c__class
mpa_string <- function(taxlist, mpa_tib) {
    sub_mpa <- mpa_tib %>%
        dplyr::filter(colname %in% names(taxlist))
    if ("U" %in% names(taxlist) && !is.na(taxlist[["U"]])) {
        taxlist[["U"]]
    } else {
        taxlist[sub_mpa$colname] %>%
            sprintf("%s__%s", sub_mpa$prefix, .) %>%
            paste(collapse="|") %>%
            gsub(" ", "_", .)
    }
}


# Obtain taxonomy strings from table
strings <- summ_tb %>%
    search_tree(debug = T)

# get order of taxlevels
all_levels = c("U", "R", "D", "P", "C", "O", "F", "G", "S")
all_levels = purrr::map(all_levels,
    ~ c(.x, sprintf("%s%s", .x, 1:5))) %>% 
    unlist

# Make table with full taxonomy
# with all levels from original file
# put in order of taxonomy
taxtib <- tax_tibble(strings) %>%
    dplyr::mutate(tax_string = 
        purrr::map(taxlist, 
            ~ mpa_string(unlist(.x), mpa_tib))) %>%
    unnest(tax_string) %>%
    dplyr::select(-taxlist) %>%
    dplyr::select(taxid, tidyselect::any_of(all_levels), tax_string)

# Check
stopifnot(nrow(taxtib) == nrow(summ_tb))
stopifnot(sum(summ_tb$taxid != taxtib$taxid)==0)
#left_join(summ_tb, taxtib)


# Select only subset of levels for the standardized
# output table
mpa_tib %<>%
    dplyr::filter(colname %in% colnames(taxtib))
nmap <- setNames(mpa_tib$new_colname, mpa_tib$colname)
sub_tax <- taxtib %>%
    dplyr::select(taxid, any_of(mpa_tib$colname), tax_string) 
sub_tax %<>%
    dplyr::rename_at(names(nmap), ~ nmap[.])

summ_tb %>%
    dplyr::select(taxname, taxid) %>%
    dplyr::left_join(sub_tax) %>%
    write_tsv(arguments$outfile_standard)

summ_tb %>%
    dplyr::select(taxname, taxid) %>%
    dplyr::left_join(taxtib) %>%
    readr::write_tsv(arguments$outfile_full)
47
48
shell:
  "rm -fr output/quality_control workflow/report/quality_profiles"
73
74
shell:
  "rm -fr output/dada2 workflow/report/dada2qc workflow/report/model"
85
86
shell:
  "rm -fr output/taxa"
97
98
shell:
  "rm -fr output/phylotree"
114
115
116
117
118
119
shell:
  """Rscript workflow/scripts/mia/prepare_mia_object.R \
    {output.mia} --asv={input.asv} --taxa={input.taxa} \
    --tree={input.tree} --meta={input.meta} \
    --asv_prefix={params.prefix} --log={log} \
    --config={params.config} --cores={threads}"""
141
142
143
144
145
146
shell:
  """Rscript workflow/scripts/mia/prepare_mia_object.R \
    {output.mia} --asv={input.asv} --taxa={input.taxa} \
    --tree={input.tree} --meta={input.meta} \
    --asv_prefix={params.prefix} --log={log} \
    --config={params.config} --cores={threads}"""
ShowHide 25 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/welch16/microbiome-pipeline
Name: microbiome-pipeline
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...