Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword Kraken2 .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow is an implementation of the popular DADA2 tool. I followed the steps in the Tutorial . I utilized Kraken2 for ASV sequence classification instead of IDTaxa.
The pipeline was inspired by the Silas Kieser (@silask)'s dada2 snakemake pipeline .
Authors
- Rene Welch (@ReneWelch)
Install workflow
There are two steps to install the pipeline:
-
Install snakemake and dependencies: Assuming conda is already installed, and a copy of this repository has been downloaded. Then, the pipeline can be installed by:
conda env create -n {env_name} --file dependencies.yml conda env create -n microbiome --file dependencies.yml
Alternatively, mamba could be used:
mamba env create -n {env_name} --file dependencies.yml mamba env create -n microbiome --file dependencies.yml
-
Install R packages: To install the R packages used by this pipeline use:
R CMD BATCH --vanilla ./install_r_packages.R
Troubleshooting conda and environment variables
If you have other versions of R and R user libraries elsewhere, you might encounter some problems with environment variables and conda. You may need to provide a local
.bashrc
file to place the conda path at the beginning of your
$PATH
environment variable.
You may not need to do this step.
source /etc/bash_completion.d/git
__conda_setup="$('/path/to/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/path/to/miniconda3/etc/profile.d/conda.sh" ]; then
. "/path/to/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/path/to/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
alias R='R --vanilla'
export PATH="/path/to/miniconda3/envs/microbiome/bin:$PATH"
Set up data and metadata
-
Set up a data/ directory with one subdirectory per batch. The pipeline will look in the
data/
directory to find subdirectories that containfastq.gz
files. You can specify different filtering and trimming parameters per batch, anddada2
will learn error rates per batch. So, you may wish to separate your files by sequencing batch and/or by sample type.mkdir data/ mkdir data/batch01 [data/batch02 ... ]
Next, populate the subdirectories with your
fastq.gz
(orfastq
) files, making sure to include both R1 and R2. If your data are already located elsewhere in your file system, you can use symbolic links (symlinks) to avoid duplicating data:fqdir=/path/to/existing_fastq_files/ batchdir=data/batch01 ln -s ${fqdir}/*.fastq.gz ${batchdir}
-
Generate sample table. The input for the pipeline are the sequencing files separated by batch in the
data/
directory. Using:Rscript ./prepare_sample_table.R
will generate the
samples.tsv
file that contains 4 columns separated by tabs (the file will not actually contain |)| batch | key | end1 | end2 |
-
Set up sample metadata file. This is where to link sample IDs to variables of interest to your study (condition, body site, time point, etc). As currently implemented , this is only used to augment the final TreeSummarizedExperiment file, which you will use for your own downstream analysis. You need to have a column
key
that matches the sample table. For example:| key | subject_id | body_site | time_point | ...
The default path for this file is
data/meta.tsv
. If you have it named otherwise, edit the variablemetadata
inconfig/config.yaml
to point to your file. -
Set up negative control mapping file. If you have negative control samples, you should make a table linking each study sample to its negative control kit(s). As currently implemented , this needs to be a
qs
file containing a tibble of this format:# A tibble: 3 x 3 batch key kits <chr> <chr> <list> 1 batch01 sample_1 <chr [3]> 2 batch01 sample_2 <chr [1]> 3 batch02 sample_3 <chr [1]>
Each element of
key
is a study sample ID matching the sample table. Each element ofkits
is a character list of sample keys for negative control samples, egc("sample_141", "sample_142", "sample_143")
.The default path for this file is
data/negcontrols.qs
. If you have it named otherwise, edit the variablenegcontroltable
inconfig/config.yaml
to point to your file.
Set up configuration file
Now you will edit
config.yaml
to set up running parameters. At the minimum, you will need to make these edits:
-
Confirm general parameters and file paths. Check that these variables are set to your satisfaction:
# general parameters for run asv_prefix: "asv" # this will be the prefix for the ASV IDs sample_table: "samples.tsv" # this file is generated by prepare_sample_table.R negcontroltable: "data/negcontrols.qs" # manually generated metadata: "data/meta.tsv" # manually generated threads: 8 # max number of threads for a multithreaded snakemake rule, eg fastqc in quality_control
-
Run sequence QC. If you have not already looked at your data, the first thing you will want to do is run fastQC to generate quality profiles.
First, run
snakemake -j{cores} sequence_qc
to run fastQC, plot quality profiles, and build amultiqc
report. You can find the individual plots inworkflow/report/quality_profiles
and the full report inoutput/quality_control/multiqc
.Next, inspect those profiles. You may want to use them to decide where you want to trim R1 and R2 and how to set minimum overlap.
-
Edit parameters for each batch. Check the
config.yaml
file for the default filter_and_trim, merge_pairs, and qc_parameters specifications for each batch. Keep in mind the length of your 16S region (eg, the V4 region is 252 bp long) and whether you will have enough overlap after trimming. Look up otherdada2
resources for more advice on how to choose these parameters.Here is an example for a 2x150 NextSeq run targeting the V4 region. This set of parameters keeps only full-length reads and requires an overlap of 25 base pairs. Because the V4 region is only ~252 base pairs long, we suspect that ASVs that are outside of that range may be bimeras.
filter_and_trim: batch01: truncQ: 0 trimLeft: 0 trimRight: 0 maxLen: Inf minLen: 150 maxN: 0 minQ: 0 maxEE: [2, 2] ... merge_pairs: batch01: minOverlap: 25 maxMismatch: 0 ... qc_parameters: negctrl_prop: 0.5 max_length_variation: 5 low_abundance_perc: 0.0001
The defaults provided in the pipeline are from a 2x300 run targeting V3-V4. Here we are truncating the ends of both reads because we noticed a significant loss of quality. Because we have longer reads, we can require more overlap.
filter_and_trim: dust_dec2018: truncQ: 12 truncLen: [280, 250] trimLeft: 0 trimRight: 0 maxLen: Inf minLen: 100 maxN: 0 minQ: 0 maxEE: [2, 2] .... merge_pairs: dust_dec2018: minOverlap: 50 maxMismatch: 0 ... qc_parameters: negctrl_prop: 0.5 max_length_variation: 50 low_abundance_perc: 0.0001
Download references for sequence classification
Databases need to be downloaded from https://benlangmead.github.io/aws-indexes/k2 . These are the four used by default in the config file.
Here is an example of doing this using
wget
:
# download references
wdir="https://genome-idx.s3.amazonaws.com/kraken"
refdir="data/kraken_dbs"
mkdir -p $refdir
greengenes="16S_Greengenes13.5_20200326"
rdp="16S_RDP11.5_20200326"
silva="16S_Silva138_20200326"
minikraken="minikraken2_v2_8GB_201904"
for db in $greengenes $rdp $silva $minikraken;
do
remote=${wdir}/${db}.tgz
local=${refdir}/${db}.tgz
wget $remote -O $local
tar -xzf ${local} -C ${refdir}
done
Workflow commands
Replace
{cores}
with the maximum number of cores that you want
to be used at one time.
The rules
taxonomy
and
phylotree
will take a little longer the first
time they are run because they install conda environments.
-
snakemake -j{cores}
runs everything -
snakemake -j{cores} sequence_qc
plots quality profiles, and build amultiqc
report -
snakemake -j{cores} [--nt] dada2
computes the ASV matrix from the different batches (use --nt option to keep intermediate files for troubleshooting) -
snakemake -j{cores} taxonomy --use-conda
to labels the ASV sequences with kraken2 . Databases need to be downloaded in advance from https://benlangmead.github.io/aws-indexes/k2 -
snakemake -j{cores} phylotree --use-conda
computes the phylogenetic tree usingqiime2
's FastTree -
snakemake -j{cores} mia
prepare theTreeSummarizedExperiment
containing all the data generated
Cite
dada2
Callahan, B., McMurdie, P., Rosen, M. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016). https://doi.org/10.1038/nmeth.3869
Kraken2
Wood, D., Lu, J., Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biology 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
phyloseq
McMurdie, P., Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS One 8, 4 (2013). https://doi.org/10.1371/journal.pone.0061217
qiime2
Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, et al. 2018. “QIIME 2: Reproducible, Interactive, Scalable, and Extensible Microbiome Data Science.” e27295v2. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295v2 .
FastTree
Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. 2010. “FastTree 2--Approximately Maximum-Likelihood Trees for Large Alignments.” PloS One 5 (3): e9490.
TreeSummarizedExperiment
Huang, Ruizhu, Charlotte Soneson, Felix G. M. Ernst, Kevin C. Rue-Albrecht, Guangchuang Yu, Stephanie C. Hicks, and Mark D. Robinson. 2020. “TreeSummarizedExperiment: A S4 Class for Data with Hierarchical Structure.” F1000Research 9 (October): 1246.
Code Snippets
15 16 17 18 19 20 21 22 23 24 25 26 | shell: """Rscript workflow/scripts/dada2/filter_and_trim.R \ {output.end1} {output.end2} {output.summary} \ {params.sample_name} --end1={input.end1} --end2={input.end2} \ --batch={params.batch} --log={log} --config={params.config} if [[ -s {output.summary} && ! -s {output.end1} && ! -s {output.end2} ]]; then echo "Filtering succeeded but all reads were removed. Creating temp placeholder files." touch {output.end1} touch {output.end2} fi """ |
41 42 43 44 45 | shell: """Rscript workflow/scripts/dada2/learn_error_rates.R \ --error_rates={output.mat} --plot_file={output.plot} \ {input.filtered} --log={log} --batch={params.batch} \ --config={params.config} --cores={threads}""" |
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | shell: """ if [[ -e {input.filt_end1} && -e {input.filt_end2} && ! -s {input.filt_end1} && ! -s {input.filt_end2} ]]; then touch {output.merge} echo "Filtering succeeded, but no reads were left after filtering:" ls -l {input.filt_end1} {input.filt_end2} echo "Generating placeholder output file: {output.merge}" else Rscript workflow/scripts/dada2/dereplicate_one_sample_pair.R \ {output.merge} {params.sample_name} \ {input.filt_end1} {input.filt_end2} \ --end1_err={input.mat_end1} --end2_err={input.mat_end2} \ --log={log} --batch={params.batch} --config={params.config} fi """ |
106 107 108 109 | shell: """Rscript workflow/scripts/dada2/gather_derep_seqtab.R \ {output.asv} {output.summary} {input.derep} \ --batch={params.batch} --log={log} --config={params.config}""" |
123 124 125 126 127 | shell: """Rscript workflow/scripts/dada2/remove_chimeras.R \ {output.asv} {output.summary} {output.matspy} \ {input.asv} \ --log={log} --config={params.config} --cores={threads}""" |
142 143 144 145 146 147 | shell: """Rscript workflow/scripts/dada2/filter_asvs.R \ {output.seqtab_filt} \ {output.plot_seqlength} {output.plot_seqabundance} \ {input.seqtab} {input.negcontrol} \ --log={log} --config={params.config} --cores={threads}""" |
157 158 159 160 | shell: """Rscript workflow/scripts/dada2/collect_tsv_files.R \ {output.filt} {input.filt} \ --log={log}""" |
171 172 173 174 | shell: """Rscript workflow/scripts/dada2/collect_tsv_files.R \ {output.merg} {input.merg} \ --log={log}""" |
187 188 189 190 191 | shell: """Rscript workflow/scripts/dada2/summarize_nreads.R \ {output.nreads} {output.fig_step} {output.fig_step_rel} \ {input.filt} {input.merg} {input.asvs} \ --log={log}""" |
10 11 | script: """../scripts/phylo/compute_tree.R""" |
10 11 12 13 | shell: """Rscript workflow/scripts/dada2/plot_quality_profiles.R \ {output} --end1={input.end1} --end2={input.end2} \ --log={log}""" |
27 28 | shell: """fastqc -o output/quality_control/fastqc -t {params.threads} {input.R1} {input.R2}""" |
37 38 | shell: """multiqc output/quality_control/fastqc -o output/quality_control/multiqc""" |
10 11 12 13 | shell: """Rscript workflow/scripts/taxonomy/extract_fasta.R \ {output.fasta} {input.asv} \ --prefix={params.prefix} --log={log}""" |
34 35 36 37 38 39 40 41 | shell: """kraken2 --db {input.ref} --threads {threads} \ --output {output.out} --use-names \ --report {output.summary} --use-mpa-style \ --confidence {params.confidence} \ --classified-out {output.classified} \ --memory-mapping \ --unclassified-out {output.unclassified} {input.fasta}""" |
53 54 55 56 57 | shell: """blastn -db {params.db} -query {input.fasta} \ -out {output.blast} -outfmt {params.fmt} \ -perc_identity {params.perc} \ -num_threads {threads}""" |
72 73 74 75 76 77 78 | shell: """Rscript workflow/scripts/taxonomy/clean_kraken_wblast.R \ {output.taxa} {output.hits} \ --kraken={input.kraken} --summary={input.kraken_summary} \ --blast={input.blast} --fasta={input.fasta} \ --log={log} --cores={threads} """ |
93 94 | script: """../scripts/taxonomy/merge_dblabels_wblast.R""" |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | "Collect tsv files into one Usage: collect_tsv_files.R [<outfile>] [<infiles> ...] [--log=<logfile>] collect_tsv_files.R (-h|--help) collect_tsv_files.R --version Options: --log=<logfile> name of the log file [default: collect.log]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "collect tsv files V1") if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$outfile <- "out.tsv" arguments$infiles <- list.files("output/dada2/summary", full.names = TRUE) } info <- Sys.info(); print(arguments) print(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(tidyverse) library(vroom) infiles <- purrr::map(arguments$infiles, vroom::vroom) out <- bind_rows(infiles) fs::dir_create(dirname(arguments$outfile)) readr::write_tsv(out, file = arguments$outfile) |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | "Dereplicate one sample pair Usage: dereplicate_one_sample_pair.R [<sample_merge_file>] [<sample_name> <end1_file> <end2_file> --end1_err=<end1_err> --end2_err=<end2_err>] [--log=<logfile> --batch=<batch> --config=<cfile>] dereplicate_one_sample_pair.R (-h|--help) dereplicate_one_sample_pair.R --version Options: --end1_err=<end1> name of the R1 error rates matrix --end2_err=<end2> name of the R2 error rates matrix --log=<logfile> name of the log file [default: dereplicate.log] --batch=<batch> name of the batch if any to get the filter and trim parameters --config=<cfile> name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "dereplicate one sample V1") if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$end1_file <- "output/dada2/filtered/sample_545_filtered_R1.fastq.gz" arguments$end2_file <- "output/dada2/filtered/sample_545_filtered_R2.fastq.gz" arguments$end1_err <- "output/dada2/model/dust_jun2021_error_rates_R1.qs" arguments$end2_err <- "output/dada2/model/dust_jun2021_error_rates_R2.qs" arguments$batch <- "dust_jun2021" arguments$sample_name <- "sample_545" } stopifnot( file.exists(arguments$end1_file), file.exists(arguments$end2_file), file.exists(arguments$end1_err), file.exists(arguments$end2_err)) if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config)) info <- Sys.info(); print(arguments) print(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(dada2) library(qs) library(yaml) filter_fwd <- arguments$end1_file filter_bwd <- arguments$end2_file message("loading error rates") err_fwd <- qs::qread(arguments$end1_err) err_bwd <- qs::qread(arguments$end2_err) message("dereplicating filtered files") dereplicate_merge <- function(filtfwd, filtbwd, err_fwd, err_bwd, min_overlap = 12, max_mismatch = 0) { message("de-replicating end1 file") derep_fwd <- dada2::derepFastq(filtfwd) dd_fwd <- dada2::dada(derep_fwd, err = err_fwd) message("de-replicating end2 file") derep_bwd <- dada2::derepFastq(filtbwd) dd_bwd <- dada2::dada(derep_bwd, err = err_bwd) message("merging pairs") merge <- dada2::mergePairs(dd_fwd, derep_fwd, dd_bwd, derep_bwd, minOverlap = min_overlap, maxMismatch = max_mismatch) out <- list( "dada_fwd" = dd_fwd, "dada_bwd" = dd_bwd, "merge" = merge) return(out) } config <- yaml::read_yaml(arguments$config)$merge_pairs if (!is.null(arguments$batch)) { stopifnot(arguments$batch %in% names(config)) config <- config[[arguments$batch]] } else { nms <- c("minOverlap", "maxMismatch") if (any(names(config) %in% nms)) { warning("will use first element instead") config <- config[[1]] } } merge <- dereplicate_merge(filter_fwd, filter_bwd, err_fwd, err_bwd, min_overlap = as.numeric(config[["minOverlap"]]), max_mismatch = as.numeric(config[["maxMismatch"]])) qs::qsave(merge, arguments$sample_merge_file) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | "Filter and trim Usage: filter_and_trim.R [<filter_end1> <filter_end2> <summary_file>] [<sample_name> --end1=<end1> --end2=<end2>] [--batch=<batch> --log=<logfile> --config=<cfile>] filter_and_trim.R (-h|--help) filter_and_trim.R --version Options: -h --help show this screen --end1=<end1> name of the R1 end fastq.gz file --end2=<end2> name of the R2 end fastq.gz file --log=<logfile> name of the log file [default: logs/filter_and_trim.log] --batch=<batch> name of the batch if any to get the filter and trim parameters --config=<cfile> name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "filter_and_trim V1") if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$batch <- "dust_dec2018" arguments$sample_name <- "sample_20" arguments$end1 <- "data/dust_dec2018/190114_C75PR/203_S19_L001_R1_001.fastq.gz" arguments$end2 <- "data/dust_dec2018/190114_C75PR/203_S19_L001_R2_001.fastq.gz" arguments$filter_end1 <- "filtered_L001_R1_001.fastq.gz" arguments$filter_end2 <- "filtered_L001_R2_001.fastq.gz" } print(arguments) info <- Sys.info(); message("loading dada2") library(magrittr) library(dada2) library(qs) library(yaml) library(fs) stopifnot(file.exists(arguments$config), file.exists(arguments$end1), file.exists(arguments$end2)) print(stringr::str_c(names(info), " : ", info, "\n")) config <- yaml::read_yaml(arguments$config) config <- config[["filter_and_trim"]] print(config) if (!is.null(arguments$batch)) { stopifnot(arguments$batch %in% names(config)) config <- config[[arguments$batch]] } else { nms <- c("truncQ", "truncLen", "trimLeft", "trimRight", "maxLen", "minLen", "maxN", "minQ", "maxEE") if (any(names(config) %in% nms)) { warning("will use first element instead") config <- config[[1]] } } fs::dir_create(unique(dirname(arguments$filter_end1))) fs::dir_create(unique(dirname(arguments$filter_end2))) track_filt <- dada2::filterAndTrim( arguments$end1, arguments$filter_end1, arguments$end2, arguments$filter_end2, truncQ = as.numeric(config[["truncQ"]]), truncLen = as.numeric(config[["truncLen"]]), trimLeft = as.numeric(config[["trimLeft"]]), trimRight = as.numeric(config[["trimRight"]]), maxLen = as.numeric(config[["maxLen"]]), minLen = as.numeric(config[["minLen"]]), maxN = as.numeric(config[["maxN"]]), minQ = as.numeric(config[["minQ"]]), maxEE = as.numeric(config[["maxEE"]]), rm.phix = TRUE, compress = TRUE, multithread = FALSE) row.names(track_filt) <- arguments$sample_name colnames(track_filt) <- c("raw", "filtered") track_filt %>% as.data.frame() %>% tibble::as_tibble(rownames = "samples") %>% dplyr::mutate( end1 = arguments$end1, end2 = arguments$end2) %>% readr::write_tsv(arguments$summary_file) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | "Filters the low quality ASVs Usage: filter_asvs.R [<asv_matrix_qc> <seqlength_fig> <abundance_fig>] [<asv_matrix_file> <negcontrol_file>] [--lysis --log=<logfile> --config=<cfile> --cores=<cores>] filter_asvs.R (-h|--help) filter_asvs.R --version Options: -h --help show this screen --log=<logfile> name of the log file [default: filter_and_trim.log] --config=<cfile> name of the yaml file with the parameters [default: ./config/config.yaml] --cores=<cores> number of CPUs for parallel processing [default: 24]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "filter ASVs V1") if (!interactive()) { fs::dir_create(dirname(arguments$log)) log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$asv_matrix_qc <- "asv.qs" arguments$seqlength_fig <- "sl.png" arguments$abundance_fig <- "sa.png" arguments$asv_matrix_file <- "output/dada2/remove_chim/asv_mat_wo_chim.qs" arguments$negcontrol_file <- "output/predada2/negcontrols.qs" } message("arguments") print(arguments) info <- Sys.info(); print(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(tidyverse) library(dada2) library(qs) library(yaml) stopifnot( file.exists(arguments$asv_matrix_file), file.exists(arguments$config), file.exists(arguments$negcontrol_file)) message("starting with asvs in ", arguments$asv_matrix_file) seqtab <- qs::qread(arguments$asv_matrix_file) config <- yaml::read_yaml(arguments$config)[["qc_parameters"]] message("filtering samples by negative controls") message("using neg. control file ", arguments$negcontrol_file) message("removing counts >= ", config[["negctrl_prop"]], " sum(neg_controls)") neg_controls <- qs::qread(arguments$negcontrol_file) if (arguments$lysis) { neg_controls %<>% dplyr::mutate( kits = map2(kits, lysis, c)) } neg_controls %<>% dplyr::select(batch, key, kits) subtract_neg_control <- function(name, neg_controls, seqtab, prop) { negs <- neg_controls negs <- negs[negs %in% rownames(seqtab)] out_vec <- seqtab[name, ] if (length(negs) > 1) { negs <- seqtab[negs, ] neg_vec <- colSums(negs) out_vec <- out_vec - prop * neg_vec } else if (length(negs) == 1) { neg_vec <- seqtab[negs, ] out_vec <- out_vec - prop * neg_vec } if (any(out_vec < 0)) { out_vec[out_vec < 0] <- 0 } return(out_vec) } neg_controls %<>% dplyr::mutate(sample_vec = purrr::map2(key, kits, safely(subtract_neg_control), seqtab, config[["negctrl_prop"]])) to_remove <- dplyr::filter(neg_controls, purrr::map_lgl(sample_vec, ~ !is.null(.$error))) if (nrow(to_remove) > 0) { message("removed the samples:\n", stringr::str_c(to_remove$key, collapse = "\n")) } neg_controls %<>% dplyr::filter(purrr::map_lgl(sample_vec, ~ is.null(.$error))) %>% dplyr::mutate( sample_vec = purrr::map(sample_vec, "result")) sample_names <- neg_controls %>% dplyr::pull(key) seqtab_samples <- purrr::reduce( dplyr::pull(neg_controls, sample_vec), dplyr::bind_rows) seqtab_samples %<>% as.matrix() %>% set_rownames(sample_names) seqtab_nc <- seqtab[! rownames(seqtab) %in% sample_names, ] seqtab_new <- floor(rbind(seqtab_samples, seqtab_nc)) # Length of sequences message("filtering sequences by length") seq_lengths <- nchar(dada2::getSequences(seqtab_new)) l_hist <- as.data.frame(table(seq_lengths)) %>% tibble::as_tibble() %>% rlang::set_names("length", "freq") l_hist <- l_hist %>% ggplot(aes(x = length, y = freq)) + geom_col() + labs(title = "Sequence Lengths by SEQ Count") + theme_bw() + theme( axis.text = element_text(size = 6), axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10), axis.text.y = element_text(size = 10)) ggsave( filename = arguments$seqlength_fig, plot = l_hist, width = 8, height = 4, units = "in") table2 <- tapply(colSums(seqtab_new), seq_lengths, sum) table2 <- tibble::tibble( seq_length = names(table2), abundance = table2) most_common_length <- dplyr::top_n(table2, 1, abundance) %>% dplyr::pull(seq_length) %>% as.numeric() table2 <- table2 %>% ggplot(aes(x = seq_length, y = log1p(abundance))) + geom_col() + labs(title = "Sequence Lengths by SEQ Abundance") + theme_bw() + theme( axis.text = element_text(size = 6), axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10), axis.text.y = element_text(size = 10)) ggsave( filename = arguments$abundance_fig, plot = table2, width = 8, height = 4, units = "in") max_diff <- config[["max_length_variation"]] message("most common length: ", most_common_length) message("removing sequences outside range ", " < ", most_common_length - max_diff, " or > ", most_common_length + max_diff) right_length <- abs(seq_lengths - most_common_length) <= max_diff seqtab_new <- seqtab_new[, right_length] total_abundance <- sum(colSums(seqtab_new)) min_reads_per_asv <- ceiling(config[["low_abundance_perc"]] / 100 * total_abundance) seqtab_abundance <- colSums(seqtab_new) message("removing ASV with < ", min_reads_per_asv, " reads") message("in total ", sum(seqtab_abundance < min_reads_per_asv)) seqtab_new <- seqtab_new[, seqtab_abundance >= min_reads_per_asv] message("saving files in ", arguments$asv_matrix_qc) qs::qsave(seqtab_new, arguments$asv_matrix_qc) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | "Gather ASV table Usage: gather_derep_seqtab.R [<asv_file> <summary_file>] [<derep_file> ...] [--batch=<batch> --log=<logfile> --config=<cfile>] gather_derep_seqtab.R (-h|--help) gather_derep_seqtab.R --version Options: -h --help show this screen --log=<logfile> name of the log file [default: gather_derep_seqtab.log] --config=<cfile> name of the yaml file with the parameters [default: ./config/config.yaml]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "gather dereplicated sequence table V1") if (!interactive()) { fs::dir_create(dirname(arguments$log)) log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$batch <- "batch2018" arguments$asv_file <- "batch2018_asv.qs" arguments$derep_file <- list.files(file.path("output", "dada2", "merge", arguments$batch), full.names = TRUE) } message("arguments") print(arguments) message("info") info <- Sys.info(); print(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(tidyverse) library(dada2) library(qs) library(yaml) derep_files <- arguments$derep_file # stopifnot(any(file.exists(derep_files)), file.exists(arguments$config)) stopifnot(all(file.exists(derep_files)), file.exists(arguments$config)) # remove the files for samples where all reads were filtered out derep_files <- derep_files[file.exists(derep_files)] # Identify and remove empty files. # These are placeholder files that were created # to appease snakemake after all reads were # filtered out from the input file back in filter_and_trim. # stop if we have no files left after checking for empty. derep_tb <- tibble(derep_file = derep_files) %>% dplyr::mutate( size = purrr::map_dbl(derep_file, ~ file.info(.x)$size), key = purrr::map(derep_file, basename), key = stringr::str_remove(key, "_asv.qs")) count_nonempty <- derep_tb %>% dplyr::filter(size > 0) %>% nrow() stopifnot(count_nonempty > 0) # read ASV files read_merger <- function(filename, size) { if (size == 0) { NULL } else { qs::qread(filename) } } # changed the commands to be all inside the same mutate to improve readability derep_tb %<>% dplyr::mutate( derep_merger = purrr::map2(derep_file, size, read_merger), mergers = purrr::map(derep_merger, "merge"), dada_fwd = purrr::map(derep_merger, "dada_fwd"), dada_bwd = purrr::map(derep_merger, "dada_bwd")) config <- yaml::read_yaml(arguments$config) stopifnot(file.exists(config$samples_file)) sample_tb <- readr::read_tsv(config$samples_file) %>% dplyr::left_join(derep_tb) # now do some checking to make sure we only have # the correct batch in the input check_batches <- sample_tb %>% dplyr::filter(!is.na(derep_file)) %>% dplyr::count(batch) if (nrow(check_batches) > 1) { message("User supplied input derep files from multiple batch(es):") message(paste(sprintf("%s: %d", check_batches$batch, check_batches$n), collapse = "\n")) message("We will only look at the requested batch: ", arguments$batch) } # remove other batches, but keep empty files for summary sample_tb %<>% dplyr::filter(batch == arguments$batch) nonempty_tb <- sample_tb %>% dplyr::filter(size > 0) mergers <- pluck(nonempty_tb, "mergers") %>% setNames(pluck(nonempty_tb, "key")) dada_fwd <- pluck(nonempty_tb, "dada_fwd") dada_fwd <- pluck(nonempty_tb, "dada_bwd") message("creating sequence table") seqtab <- dada2::makeSequenceTable(mergers) qs::qsave(seqtab, arguments$asv_file) message("summarizing results") ## get N reads ## include lost samples with 0s get_nreads <- function(x) { if (!is.null(x)) { sum(dada2::getUniques(x)) } else { NA } } track <- sample_tb %>% dplyr::mutate( denoised = purrr::map(dada_fwd, get_nreads), merged = purrr::map(mergers, get_nreads)) %>% tidyr::unnest(c(denoised, merged)) %>% dplyr::transmute(samples = key, denoised, merged) %>% dplyr::mutate(across(c(denoised, merged), ~replace_na(.x, 0))) track %>% readr::write_tsv(arguments$summary_file) message("Done! summary file at ", arguments$summary_file) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | "Learn error rates Usage: learn_error_rates.R [--error_rates=<matrix_file> --plot_file=<pfile>] [<filtered> ...] [--log=<logfile> --batch=<batch> --config=<cfile> --cores=<cores>] learn_error_rates.R (-h|--help) learn_error_rates.R --version Options: -h --help show this screen --error_rates=<matrix_file> name of the file with the learned error rates matrix --plot_file=<pfile> name of the file to save the diagnostic plot --log=<logfile> name of the log file [default: error_rates.log] --batch=<batch> name of the batch if any to get the filter and trim parameters --config=<cfile> name of the yaml file with the parameters [default: ./config/config.yaml] --cores=<cores> number of CPUs for parallel processing [default: 4]" -> doc library(docopt) library(fs) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "error_rates V1") if (!interactive()) { fs::dir_create(dirname(arguments$log)) log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { library(magrittr) library(tidyverse) arguments$error_rates <- "error_rate_matrix.qs" arguments$plot_file <- "error_rates.png" arguments$batch <- "batch2018" arguments$filtered <- readr::read_tsv("samples.tsv") %>% filter(batch == arguments$batch) %>% pull(key) arguments$filtered <- as.character(glue::glue( "output/dada2/filtered/{sample}_filtered_R1.fastq.gz", sample = arguments$filtered)) } stopifnot(any(file.exists(arguments$filtered))) stopifnot(file.exists(arguments$config)) info <- Sys.info(); print(stringr::str_c(names(info), " : ", info, "\n")) print(arguments) # Identify and remove empty files. # These are placeholder files that were created # to appease snakemake after all reads were # filtered out from the input file. # stop if we have no files left after checking for empty. sizes <- lapply(arguments$filtered, FUN = function(x) file.info(x)$size) sizes <- setNames(sizes, arguments$filtered) nonempty <- names(sizes[sizes > 0]) stopifnot(length(nonempty) > 0) message("loading packages") library(dada2) library(ggplot2) library(qs) library(yaml) config <- yaml::read_yaml(arguments$config)$error_rates print(config) if (!is.null(arguments$batch)) { stopifnot(arguments$batch %in% names(config)) config <- config[[arguments$batch]] } else { nms <- c("learn_nbases") if (any(names(config) %in% nms)) { warning("will use first element instead") config <- config[[1]] } } print("computing error rates") errs <- dada2::learnErrors( nonempty, #arguments$filtered, nbases = as.numeric(config$learn_nbases), multithread = as.numeric(arguments$cores), randomize = TRUE) fs::dir_create(dirname(arguments$error_rates)) qs::qsave(errs, file = arguments$error_rates) print("plotting diagnostics") err_plot <- dada2::plotErrors(errs, nominalQ = TRUE) fs::dir_create(dirname(arguments$plot_file)) ggplot2::ggsave( filename = arguments$plot_file, plot = err_plot, width = 20, height = 20, units = "cm") |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | "Plot quality profiles Usage: plot_quality_profiles.R <plot_file> [--end1=<end1> --end2=<end2> --log=<logfile>] plot_quality_profiles.R (-h|--help) plot_quality_profiles.R --version Options: -h --help show this screen --log=<logfile> name of the log file [default: logs/plot_qc_profiles.log]" -> doc my_args <- commandArgs(trailingOnly = TRUE) if (interactive()) { my_args <- c("qc_profile.png", "--end1=end1.fastq.gz", "--end2=end2.fastq.gz") } library(docopt, quietly = TRUE) arguments <- docopt(doc, args = my_args, version = "plot_quality_profiles V1") print(arguments) if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file) sink(log_file, type = "message") } message("system info") print(Sys.info()) message("arguments") print(arguments) stopifnot(file.exists(arguments$end1), file.exists(arguments$end2)) message("loading R packages") library(dada2, quietly = TRUE) library(ggplot2, quietly = TRUE) library(purrr, quietly = TRUE) library(cowplot, quietly = TRUE) message("making plots") plot_fun <- purrr::safely(dada2::plotQualityProfile) r1_plot <- plot_fun(arguments$end1) r2_plot <- plot_fun(arguments$end2) if (is.null(r1_plot$error)) { r1_plot <- r1_plot$result } else { message(r1_plot$error) r1_plot <- ggplot() } if (is.null(r2_plot$error)) { r2_plot <- r2_plot$result } else { message(r2_plot$error) r2_plot <- ggplot() } r1_plot <- r1_plot + cowplot::theme_minimal_grid() + theme(strip.background = element_blank()) r2_plot <- r2_plot + cowplot::theme_minimal_grid() + theme(strip.background = element_blank()) message("saving plots") final_plot <- cowplot::plot_grid(r1_plot, r2_plot, nrow = 1) ggsave(filename = arguments$plot_file, final_plot, ggsave, width = 8, height = 4, units = "in", bg = "white") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | "Merge sequence tables and remove chimeras Usage: remove_chimeras.R [<asv_merged_file> <summary_file> <fig_file>] [<asv_file> ...] [--log=<logfile> --config=<cfile> --cores=<cores>] remove_chimeras.R (-h|--help) remove_chimeras.R --version Options: -h --help show this screen --log=<logfile> name of the log file [default: remove_chimeras.log] --config=<cfile> name of the yaml file with the parameters [default: ./config/config.yaml] --cores=<cores> number of CPUs for parallel processing [default: 24]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "remove chimeras V1") if (!interactive()) { fs::dir_create(dirname(arguments$log)) log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$fig_file <- "matspy.png" arguments$asv_merged_file <- "asv.qs" arguments$summary_file <- "summary.qs" arguments$asv_file <- list.files(file.path("output", "dada2", "asv_batch"), full.names = TRUE, pattern = "asv") } message("arguments") print(arguments) ## merges different ASV tables, and removes chimeras message("info") info <- Sys.info(); print(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(tidyverse) library(dada2) library(qs) library(yaml) library(ComplexHeatmap) stopifnot(any(file.exists(arguments$asv_file))) if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config)) config <- yaml::read_yaml(arguments$config) sample_table <- readr::read_tsv(config$sample) config <- config$remove_chimeras seqtab_list <- purrr::map(arguments$asv_file, qs::qread) if (length(seqtab_list) > 1) { seqtab_all <- dada2::mergeSequenceTables(tables = seqtab_list) } else { seqtab_all <- seqtab_list[[1]] } # Remove chimeras message("removing chimeras") seqtab <- dada2::removeBimeraDenovo( seqtab_all, method = config[["chimera_method"]], minSampleFraction = config[["minSampleFraction"]], ignoreNNegatives = config[["ignoreNNegatives"]], minFoldParentOverAbundance = config[["minFoldParentOverAbundance"]], allowOneOf = config[["allowOneOf"]], minOneOffParentDistance = config[["minOneOffParentDistance"]], maxShift = config[["maxShift"]], multithread = as.numeric(arguments$cores)) fs::dir_create(dirname(arguments$asv_merged_file)) qs::qsave(seqtab, arguments$asv_merged_file) out <- tibble::tibble(samples = row.names(seqtab), nonchim = rowSums(seqtab)) fs::dir_create(dirname(arguments$summary_file)) out %>% readr::write_tsv(arguments$summary_file) fs::dir_create(dirname(arguments$fig_file)) outmat <- seqtab[, seq_len(min(1e4, ncol(seqtab)))] outmat <- ifelse(outmat > 0, 1, 0) cols <- structure(c("black", "white"), names = c("1", "0")) # put row annotation in same order as matrix # will remove samples that were completely empty # after filter_and_trim anno_df <- sample_table %>% select(batch, key) %>% as.data.frame() %>% column_to_rownames("key") anno_df <- anno_df[rownames(outmat), , drop = F] annot <- ComplexHeatmap::rowAnnotation( df = anno_df, annotation_legend_param = list( batch = list(direction = "horizontal"))) png(filename = arguments$fig_file, width = 10, height = 8, units = "in", res = 1200) hm <- ComplexHeatmap::Heatmap(outmat, left_annotation = annot, col = cols, name = "a", show_row_dend = FALSE, show_row_names = FALSE, show_column_dend = FALSE, show_column_names = FALSE, cluster_columns = FALSE, cluster_rows = FALSE, show_heatmap_legend = FALSE, use_raster = TRUE, column_title = "Top 10K ASVs", heatmap_legend_param = list(direction = "horizontal")) draw(hm, annotation_legend_side = "top", heatmap_legend_side = "top", merge_legend = TRUE) dev.off() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | "Summarized the # of reads per processing step Usage: summarize_nreads.R [<nreads_file> <nreads_fig> <preads_fig>] [<filt_summary_file> <derep_summary_file> <final_asv_mat>] [--log=<logfile>] summarize_nreads.R (-h|--help) summarize_nreads.R --version Options: -h --help show this screen --log=<logfile> name of the log file [default: summarize_nreads.log]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "summarize # reads V1") if (!interactive()) { fs::dir_create(dirname(arguments$log)) log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$filt_summary_file <- "output/dada2/filtered/all_sample_summary.tsv" arguments$derep_summary_file <- "output/dada2/asv_batch/all_sample_summary.tsv" arguments$final_asv_mat <- "output/dada2/after_qc/asv_mat_wo_chim.qs" arguments$nreads_file <- "output/dada2/stats/Nreads_dada2.txt" arguments$nreads_fig <- "workflow/report/dada2qc/dada2steps_vs_abundance.png" arguments$preads_fig <- "workflow/report/dada2qc/dada2steps_vs_relabundance.png" } # Summarize numbers of reads per step, makes some plots info <- Sys.info(); message(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(tidyverse) library(qs) stats <- list() stats[[1]] <- readr::read_tsv(arguments$filt_summary_file) %>% select(-end1, -end2) stats[[2]] <- readr::read_tsv(arguments$derep_summary_file) stats[[3]] <- qs::qread(arguments$final_asv_mat) %>% rowSums() %>% tibble::tibble(samples = names(.), nreads = .) # change to full join so we see input files that were lost during any step stats <- purrr::reduce(stats, purrr::partial(dplyr::full_join, by = "samples")) # fill in 0s for final nreads for samples that were previously lost stats %<>% dplyr::mutate(nreads = replace_na(nreads, 0)) stats %>% readr::write_tsv(arguments$nreads_file) message("making figures") order_steps <- stats %>% dplyr::select(-samples) %>% names() rel_stats <- stats %>% dplyr::mutate( dplyr::across( -samples, list(~ . / raw), .names = "{.col}")) # remove empties from relative change plot # they don't get plotted anyway rel_stats %<>% dplyr::filter(!is.nan(nreads)) make_plot <- function(stats, summary_fun = median, ...) { order_steps <- stats %>% dplyr::select(-samples) %>% names() stats %<>% tidyr::pivot_longer(-samples, names_to = "step", values_to = "val") %>% dplyr::mutate(step = factor(step, levels = order_steps)) summary <- stats %>% dplyr::group_by(step) %>% dplyr::summarize( val = summary_fun(val, ...), .groups = "drop") stats %>% ggplot(aes(step, val)) + geom_boxplot() + geom_point(alpha = 0.25, shape = 21) + geom_line(aes(group = samples), alpha = 0.25) + theme_classic() + theme( legend.position = "none", strip.text.y = ggplot2::element_text(angle = -90, size = 10), panel.grid.minor.x = ggplot2::element_blank(), panel.grid.major.x = ggplot2::element_blank()) + geom_line(data = summary, aes(group = 1), linetype = 2, colour = "red") + labs(x = "step") } ggsave( filename = arguments$nreads_fig, plot = make_plot(stats) + labs("# reads") + scale_y_continuous(labels = scales::comma_format(1)), width = 6, height = 4, units = "in") ggsave( filename = arguments$preads_fig, plot = make_plot(rel_stats) + labs(y = "relative change") + scale_y_continuous(labels = scales::percent_format(1)), width = 6, height = 4, units = "in") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | "Prepare mia object Usage: prepare_mia_object.R [<mia_file>] [--asv=<asv_file> --taxa=<taxa_file> --tree=<tree_file> --meta=<meta_file>] [--asv_prefix=<prefix> --log=<logfile> --config=<config> --cores=<cores>] prepare_mia_object.R (-h|--help) prepare_mia_object.R --version Options: -h --help show this screen --asv=<asv_file> ASV matrix file --taxa=<taxa_file> Taxa file --tree=<tree_file> Tree file --meta=<meta_file> Metadata file --log=<logfile> name of the log file [default: logs/filter_and_trim.log] --config=<config> name of the config file [default: config/config.yaml] --cores=<cores> number of parallel CPUs [default: 8]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "prepare mia file V1") if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$asv <- "output/dada2/remove_chim/asv_mat_wo_chim.qs" arguments$taxa <- "output/taxa/kraken_merged/kraken_rdata.qs" arguments$tree <- "output/phylotree/newick/tree.nwk" arguments$meta <- "output/init/metadata.qs" arguments$asv_prefix <- "asv" } print(arguments) info <- Sys.info(); message("loading packages") library(magrittr) library(tidyverse) library(TreeSummarizedExperiment) library(Biostrings) library(BiocParallel) library(ape) library(mia) library(qs) library(scater) library(yaml) library(parallelDist) stopifnot( file.exists(arguments$asv), file.exists(arguments$taxa), file.exists(arguments$tree), file.exists(arguments$meta) ) if (!is.null(arguments$config)) stopifnot(file.exists(arguments$config)) asv <- qs::qread(arguments$asv) taxa <- qs::qread(arguments$taxa) tree <- ape::read.tree(arguments$tree) if (tools::file_ext(arguments$meta) == "tsv") { meta <- readr::read_tsv(arguments$meta) } else if (tools::file_ext(arguments$meta) == "qs") { meta <- qs::qread(arguments$meta) } cdata <- meta %>% as.data.frame() %>% tibble::column_to_rownames("key") # clean samples sample_names <- intersect( rownames(cdata), rownames(asv)) # clean ASVs asv_sequences <- colnames(asv) colnames(asv) <- str_c(arguments$asv_prefix, seq_along(asv_sequences), sep = "_") names(asv_sequences) <- colnames(asv) asv_sequences <- Biostrings::DNAStringSet(asv_sequences) taxa %<>% dplyr::add_count(asv) %>% dplyr::filter(n == 1) %>% dplyr::select(asv, taxa, taxid, domain, phylum, class, order, family, genus, species, kraken_db) asv_names <- intersect(names(asv_sequences), taxa$asv) asv <- asv[, asv_names] asv_sequences <- asv_sequences[asv_names, ] tree$tip.label %<>% stringr::str_sub(2, nchar(.) - 1) # either qiime2 or ape is adding "'" at the # start and end of ASV names tree <- ape::keep.tip(tree, intersect(asv_names, tree$tip.label)) rdata <- taxa %>% as.data.frame() %>% tibble::column_to_rownames("asv") out <- TreeSummarizedExperiment::TreeSummarizedExperiment( assays = list(counts = t(asv[sample_names,])), colData = cdata[sample_names, ], rowData = rdata, rowTree = tree) metadata(out)[["date_processed"]] <- Sys.Date() metadata(out)[["sequences"]] <- asv_sequences fs::dir_create(dirname(arguments$mia_file)) qs::qsave(out, arguments$mia_file) message("done!") |
2 3 4 5 6 7 8 9 10 11 12 13 14 | library(fastreeR) library(ape) fasta <- snakemake@input[["fasta"]] message("computes the distance matrix between fasta sequences") fasta_dist <- fastreeR::fasta2dist(fasta, kmer = 6) message("converting distance to tree") tree <- dist2tree(fasta_dist) tree <- ape::read.tree(text = tree) write.tree(tree, file = snakemake@output[["tree"]]) |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | "Clean kraken2 results with blast Usage: clean_kraken_wblast.R [<taxa_table> <hits_table>] [--kraken=<kraken_file> --summary=<summary_file> --blast=<blast_file> --fasta=<fasta_file>] [--log=<logfile> --cores=<cores>] clean_kraken_wblast.R (-h|--help) clean_kraken_wblast.R --version Options: --log=<logfile> name of the log file [default: ./parse_kraken.log] --cores=<cores> number of parallel CPUs [default: 8]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "clean kraken2 results using blast V1") if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$kraken <- here::here("output/taxa/kraken/k2_std/kraken_results.out") arguments$summary <- here::here( "output/taxa/kraken/k2_std/kraken_summary.out") arguments$blast <- here::here("output/taxa/blast/16S_ribosomal_RNA/blast_results.tsv") arguments$fasta <- here::here("output/taxa/fasta//asv_sequences.fa") } info <- Sys.info(); print(stringr::str_c(names(info), " : ", info, "\n")) message("loading packages") library(magrittr) library(tidyverse) # library(BiocParallel) # bpp <- BiocParallel::MulticoreParam(workers = as.numeric(arguments$cores)) tax_order <- c("domain", "phylum", "class", "order", "family", "genus", "species") kraken <- vroom::vroom(arguments$kraken, col_names = c("classified", "asv", "taxa", "length", "map")) ksummary <- vroom::vroom(arguments$summary, col_names = c("taxa", "taxid")) %>% dplyr::select(-taxid) %>% tidyr::separate(taxa, into = tax_order, sep = "\\|", remove = FALSE, fill = "right") blast <- vroom::vroom(arguments$blast, col_names = c("qseqid", "sseqid", "evalue", "bitscore", "score", "mismatch", "positive", "stitle", "qframe", "sframe", "length", "pident")) # first remove sequences with width outside fasta sequences fa <- ShortRead::readFasta(arguments$fasta) width_vec <- BiocGenerics::width(fa) width_range <- range(width_vec) blast %<>% dplyr::filter(length >= width_range[1] & length <= width_range[2]) med_mismatches <- median(blast$mismatch) blast %<>% dplyr::filter(mismatch <= med_mismatches) %>% dplyr::select(-qframe, -sframe) kraken_taxids <- kraken %>% dplyr::count(taxa) %>% dplyr::mutate( taxid = stringr::str_split(taxa, "taxid"), only_taxa = purrr::map_chr(taxid, 1), only_taxa = stringr::str_remove(only_taxa, regex("\\(")), only_taxa = stringr::str_trim(only_taxa), taxid = purrr::map_chr(taxid, 2), taxid = stringr::str_trim(taxid), taxid = stringr::str_remove(taxid, "\\)"), taxid = as.numeric(taxid)) kraken %<>% dplyr::inner_join( dplyr::select(kraken_taxids, taxa, taxid), by = "taxa") # map_to_tibble <- function(map) { # map %<>% stringr::str_split("\\:") # tibble::tibble( # taxid = as.numeric(purrr::map_chr(map, 1)), # bps = as.numeric(purrr::map_chr(map, 2))) # } # kraken %<>% # dplyr::mutate( # map_tib = stringr::str_split(map, " "), # map_tib = furrr::future_map(map_tib, map_to_tibble)) get_taxid <- function(taxa, kraken_taxids) { last_taxa <- stringr::str_split(taxa, "\\|")[[1]] last_taxa <- last_taxa[length(last_taxa)] taxa_word <- stringr::str_remove(last_taxa, regex("^[d|p|c|o|f|g|s]__")) out <- kraken_taxids %>% dplyr::filter(only_taxa == taxa_word) if (nrow(out) == 0) NA_real_ else min(out$taxid) } ksummary %<>% dplyr::mutate( taxid = purrr::map_dbl(taxa, get_taxid, kraken_taxids)) %>% dplyr::select(starts_with("tax"), everything()) check_hits_blast <- function(asv, taxa, taxid, blast, ksummary, kraken_taxids, min_prob = .1, max_dist = 2) { print(asv) if (is.null(blast)) { blast <- tibble::tibble() } else { blast %<>% dplyr::mutate( taxa = stringr::str_split(stitle, " "), taxa = purrr::map_chr(taxa, 1)) nmatches <- nrow(blast) blast %<>% dplyr::count(taxa) %>% dplyr::mutate(prob = n / nmatches) %>% dplyr::filter(prob >= min_prob) } if (nrow(blast) > 0) { # get kraken id tid <- taxid ksummary_id <- ksummary %>% dplyr::filter(taxid == tid) # get blast id based on the genus ksummary_genus <- ksummary %>% dplyr::filter(!is.na(genus)) ksummary_genus2 <- stringr::str_remove(ksummary_genus$genus, "g__") mmat <- stringdist::stringdistmatrix(blast$taxa, ksummary_genus2, method = "osa") rownames(mmat) <- blast$taxa mmat_search <- apply(mmat, 1, function(x)which(x < max_dist), simplify = FALSE) if (is.list(mmat_search)) { blast %<>% dplyr::mutate( taxonomy = mmat_search, taxonomy = purrr::map(taxonomy, ~ ksummary_genus[., ]), taxonomy = purrr::map(taxonomy, dplyr::select, -taxa, -species, -taxid), taxonomy = purrr::map(taxonomy, dplyr::distinct)) } else { blast %<>% dplyr::mutate( taxonomy = list(ksummary_genus[mmat_search, ]), taxonomy = purrr::map(taxonomy, dplyr::select, -taxa, -species, -taxid), taxonomy = purrr::map(taxonomy, dplyr::distinct)) } blast %<>% dplyr::mutate( prob = prob / sum(prob), d_hit = purrr::map_lgl(taxonomy, ~ ifelse(nrow(.) == 0, FALSE, .$domain == ksummary_id$domain)), p_hit = purrr::map_lgl(taxonomy, ~ ifelse(nrow(.) == 0, FALSE, .$phylum == ksummary_id$phylum)), c_hit = purrr::map_lgl(taxonomy, ~ ifelse(nrow(.) == 0, FALSE, .$class == ksummary_id$class)), o_hit = purrr::map_lgl(taxonomy, ~ ifelse(nrow(.) == 0, FALSE, .$order == ksummary_id$order)), f_hit = purrr::map_lgl(taxonomy, ~ ifelse(nrow(.) == 0, FALSE, .$family == ksummary_id$family)), g_hit = purrr::map_lgl(taxonomy, ~ ifelse(nrow(.) == 0, FALSE, .$genus == ksummary_id$genus)), across( where(is.logical), list(~ ifelse(is.na(.), 0, as.numeric(.))), .names = "{.col}")) blast %>% dplyr::select(ends_with("hit")) %>% colMeans() } else { rep(0, 6) %>% rlang::set_names(c("d_hit", "p_hit", "c_hit", "o_hit", "f_hit", "g_hit")) } } blast %<>% tidyr::nest(blast_pred = -c(qseqid)) %>% dplyr::rename(asv = qseqid) kraken %<>% dplyr::left_join(blast, by = "asv") kraken %<>% dplyr::mutate( hit_vec = purrr::pmap(list(asv, taxa, taxid, blast_pred), check_hits_blast, ksummary, kraken_taxids)) # outputs # - accuracy hits hits <- bind_rows(kraken$hit_vec) %>% dplyr::mutate( asv = kraken$asv, taxa = kraken$taxa, taxid = kraken$taxid) %>% dplyr::select(asv, taxid, tidyselect::everything()) # - rowData - asv and ksummary rdata <- kraken %>% dplyr::select(asv, taxid) %>% dplyr::inner_join(ksummary, by = "taxid") %>% dplyr::select(asv, taxa, taxid, tidyselect::everything()) fs::dir_create(dir(arguments$taxa_table)) rdata %>% qs::qsave(arguments$taxa_table) fs::dir_create(dir(arguments$hits_table)) hits %>% qs::qsave(arguments$hits_table) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | "Extract fasta sequences Usage: extract_fasta.R [<fasta_file>] [<asv_mat_file>] [--prefix=<pref> --log=<logfile>] extract_fasta.R (-h|--help) extract_fasta.R --version Options: --prefix=<pref> prefix to name the sequences [default: asv] --log=<logfile> name of the log file [default: extract_fa.log]" -> doc library(docopt) my_args <- commandArgs(trailingOnly = TRUE) arguments <- docopt::docopt(doc, args = my_args, version = "extract fasta V1") if (!interactive()) { log_file <- file(arguments$log, open = "wt") sink(log_file, type = "output") sink(log_file, type = "message") } if (interactive()) { arguments$fasta_file <- "out.fasta" arguments$asv_mat_file <- "output/dada2/after_qc/asv_mat_wo_chim.qs" } info <- Sys.info(); print(arguments) print(stringr::str_c(names(info), " : ", info, "\n")) stopifnot(file.exists(arguments$asv_mat_file)) message("loading packages") library(magrittr) library(tidyverse) library(Biostrings) library(qs) asvs <- qs::qread(arguments$asv_mat_file) sequences <- colnames(asvs) names(sequences) <- stringr::str_c(arguments$prefix, seq_along(sequences), sep = "_") fs::dir_create(dirname(arguments$fasta_file)) sequences <- Biostrings::DNAStringSet(sequences) Biostrings::writeXStringSet(sequences, filepath = arguments$fasta_file) |
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | stopifnot( all(file.exists(snakemake@input[["taxa"]])), all(file.exists(snakemake@input[["hits"]])), all(file.exists(snakemake@input[["blast"]]))) library(magrittr) library(tidyverse) library(vroom) library(microbiome.misc) library(progress) # snakemake rdata_files <- snakemake@input[["taxa"]] hits_files <- snakemake@input[["hits"]] blast_files <- snakemake@input[["blast"]] config <- here::here(snakemake@params[["config"]]) %>% yaml::read_yaml() blast_dbs <- basename(dirname(blast_files)) format <- stringr::str_split(config[["blast"]][["format"]], " ") format <- format[[1]][-1] format <- stringr::str_remove_all(format, "\\'") format[1] <- "asv" blast_list <- tibble::tibble(blast_db = blast_dbs, blast = purrr::map(blast_files, vroom::vroom, col_names = format)) kraken_blast_min_perc <- config[["kraken_blast_min_perc"]] * 100 blast_list %<>% dplyr::mutate( blast = purrr::map(blast, dplyr::filter, pident >= kraken_blast_min_perc), blast = purrr::map(blast, tidyr::nest, blast = -c(asv))) get_db <- microbiome.misc:::get_db all_hits <- tibble::tibble( blast_db = get_db(hits_files, "blast"), kraken_db = get_db(hits_files, "kraken"), hits = purrr::map(hits_files, qs::qread)) all_labels <- tibble::tibble( blast_db = get_db(rdata_files, "blast"), kraken_db = get_db(rdata_files, "kraken"), hits = purrr::map(rdata_files, qs::qread)) blast_list %<>% tidyr::unnest(cols = c(blast)) db_hierarchy <- config[["kraken_db_merge"]] message("subsetting kraken dbs: ", stringr::str_c(db_hierarchy, collapse = ", ")) all_hits %<>% dplyr::filter(kraken_db %in% db_hierarchy) %>% tidyr::unnest(cols = c(hits)) %>% tidyr::nest(hits = -c(blast_db, asv)) all_labels %<>% dplyr::filter(kraken_db %in% db_hierarchy) %>% tidyr::unnest(cols = c(hits)) %>% tidyr::nest(labels = -c(blast_db, asv)) merge_labels <- purrr::reduce( list(all_hits, all_labels, blast_list), dplyr::full_join, by = c("blast_db", "asv")) pb <- progress::progress_bar$new(total = nrow(merge_labels)) match_asvs_blast_pbar <- function(hits, labels, blast, hierarchy) { pb$tick() microbiome.misc::match_asvs_blast(hits, labels, blast, hierarchy) } merge_labels %<>% dplyr::mutate( merged = purrr::pmap(list(hits, labels, blast), match_asvs_blast_pbar, config[["kraken_db_merge"]])) if (length(unique(merge_labels$blast_db)) > 1) { stop("more than 1 db, fix for next time") } merge_labels %<>% dplyr::mutate( hits_merged = purrr::map(merged, "hits"), labels_merged = purrr::map(merged, "labels")) merge_hits <- merge_labels %>% dplyr::select(asv, hits_merged) %>% tidyr::unnest(cols = c(hits_merged)) %>% dplyr::select(kraken_db, tidyselect::everything()) merge_hits %>% qs::qsave(snakemake@output[["hits"]]) merge_lbs <- merge_labels %>% dplyr::select(asv, labels_merged) %>% tidyr::unnest(cols = c(labels_merged)) %>% dplyr::select(kraken_db, tidyselect::everything()) merge_lbs %>% qs::qsave(snakemake@output[["taxa"]]) |
47 48 | shell: "rm -fr output/quality_control workflow/report/quality_profiles" |
74 75 | shell: "rm -fr output/dada2 workflow/report/dada2qc workflow/report/model" |
95 96 | shell: "rm -fr output/taxa" |
107 108 | shell: "rm -fr output/phylotree" |
124 125 126 127 128 129 | shell: """Rscript workflow/scripts/mia/prepare_mia_object.R \ {output.mia} --asv={input.asv} --taxa={input.taxa} \ --tree={input.tree} --meta={input.meta} \ --asv_prefix={params.prefix} --log={log} \ --config={params.config} --cores={threads}""" |
141 142 143 144 145 146 147 | shell: """ snakemake --forceall --rulegraph mia | dot -Tpng > {output.mia_dag} snakemake --rulegraph dada2 | dot -Tpng > {output.dada2_dag} snakemake --rulegraph taxonomy | dot -Tpng > {output.taxonomy_dag} snakemake --rulegraph phylotree | dot -Tpng > {output.phylo_dag} """ |
Support
- Future updates