The manuscript accompanying the OptiFit algorithm in mothur

public 1yr ago Version: v1.0.0 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

an improved method for fitting amplicon sequences to existing OTUs

This repository contains the complete analysis workflow used to benchmark the OptiFit algorithm in mothur and produce the accompanying manuscript . Find details on how to use OptiFit and descriptions of the parameter options on the mothur wiki: https://mothur.org/wiki/cluster.fit/.

Citation

Sovacool KL, Westcott SL, Mumphrey MB, Dotson GA, Schloss PD. 2022. OptiFit: An Improved Method for Fitting Amplicon Sequences to Existing OTUs. mSphere. http://dx.doi.org/10.1128/msphere.00916-21

A bibtex entry for LaTeX users:

@article{sovacool_optifit_2022,
author = {Kelly L. Sovacool and Sarah L. Westcott and M. Brodie Mumphrey and Gabrielle A. Dotson and Patrick D. Schloss},
title = {OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs},
journal = {mSphere},
year = {2022},
doi = {10.1128/msphere.00916-21}
URL = {https://journals.asm.org/doi/10.1128/msphere.00916-21},

The Workflow

The workflow is split into five subworkflows:

0_prep_db — download & preprocess reference databases.
1_prep_samples — download, preprocess, & de novo cluster the sample datasets.
2_fit_reference_db — fit datasets to reference databases.
3_fit_sample_split — split datasets; cluster one fraction de novo and fit the remaining sequences to the de novo OTUs.
4_vsearch — run vsearch clustering for comparison.

The main workflow ( Snakefile ) creates plots from the results of the subworkflows and renders the paper .

Quickstart

Before cloning, configure git symlinks:
```
 git config --global core.symlinks true
```
Otherwise, git will create text files in place of symlinks.

Clone this repository.

 git clone https://github.com/SchlossLab/Sovacool_OptiFit_mSphere_2022
 cd Sovacool_OptiFit_mSphere_2022

Install the dependencies.

Almost all are listed in the conda environment file. Everything needed to run the analysis workflow is listed here.
```
conda env create -f config/env.simple.yaml
conda activate optifit
```
Additionally, I used a custom version of ggraph for the algorithm figure. You can install it with devtools from R:
```
devtools::install_github('kelly-sovacool/ggraph', ref = 'iss-297_ggtext')
```
If you do not have LaTeX already, you'll need to install a LaTeX distribution before rendering the manuscript as a PDF. You can use tinytex to do so:
```
tinytex::install_tinytex()
```
I also used latexdiffr to create a PDF with changes tracked prior to submitting revisions to the journal.
```
devtools::install_github("hughjonesd/latexdiffr")
```
Run the entire pipeline.

Locally:
```
snakemake --cores 4
```
Or on an HPC running slurm:
```
sbatch code/slurm/submit_all.sh
```
(You will first need to edit your email and slurm account info in the submission script and cluster config .)

Directory Structure

.
├── OptiFit.Rproj
├── README.md
├── Snakefile
├── code
│ ├── R
│ ├── bash
│ ├── py
│ ├── slurm
│ └── tests
├── config
│ ├── cluster.json
│ ├── config.yaml
│ ├── config_test.yaml
│ ├── env.export.yaml
│ ├── env.simple.yaml
│ └── slurm
│ └── config.yaml
├── docs
│ ├── paper.md
│ ├── paper.pdf
│ └── slides
├── exploratory
│ ├── 2018_fall_rotation
│ ├── 2019_winter_rotation
│ ├── 2020-05_May-Oct
│ ├── 2020-11_Nov-Dec
│ ├── 2021
│ │ ├── figures
│ │ ├── plots.Rmd
│ │ ├── plots.md
│ ├── AnalysisRoadmap.md
│ └── DeveloperNotes.md
├── figures
├── log
├── paper
│ ├── figures.yaml
│ ├── head.tex
│ ├── msphere.csl
│ ├── paper.Rmd
│ ├── preamble.tex
│ └── references.bib
├── results
│ ├── aggregated.tsv
│ ├── stats.RData
│ └── summarized.tsv
└── subworkflows
 ├── 0_prep_db
 │ ├── README.md
 │ └── Snakefile
 ├── 1_prep_samples
 │ ├── README.md
 │ ├── Snakefile
 │ ├── data
 │ │ ├── human
 │ │ └── SRR_Acc_List.txt
 │ │ ├── marine
 │ │ └── SRR_Acc_List.txt
 │ │ ├── mouse
 │ │ └── SRR_Acc_List.txt
 │ │ └── soil
 │ │ └── SRR_Acc_List.txt
 │ └── results
 │ ├── dataset_sizes.tsv
 │ └── opticlust_results.tsv
 ├── 2_fit_reference_db
 │ ├── README.md
 │ ├── Snakefile
 │ └── results
 │ ├── denovo_dbs.tsv
 │ ├── optifit_dbs_results.tsv
 │ └── ref_sizes.tsv
 ├── 3_fit_sample_split
 │ ├── README.md
 │ ├── Snakefile
 │ └── results
 │ ├── optifit_crit_check.tsv
 │ └── optifit_split_results.tsv
 └── 4_vsearch
 ├── README.md
 ├── Snakefile
 └── results
 └── vsearch_results.tsv

Code Snippets

def word_count(infilename, starter, stopper):
    with open(infilename, "r") as infile:
        words = []
        line = next(infile)
        while starter != line:
            line = next(infile)
        line = next(infile)  # make sure the first parsed line is not the starter
        while stopper != line:
            words += [word for word in line.strip().split()]
            line = next(infile)
    return len(words)


def check_wc(section_name, num_words, word_limit):
    if num_words > word_limit:
        raise ValueError(
            f"The {section_name} section is {num_words} words. You need to cut {num_words - word_limit} words."
        )


def main(src_filename, log_filename):
    with open(log_filename, "w") as outfile:
        outfile.write("section\tword_count\n")
        for section, word_limit, starter, stopper in zip(
            ["abstract", "importance"],
            [250, 150],
            ["## Abstract\n", "### Importance\n"],
            ["### Importance\n", "\\newpage\n"],
        ):
            wc = word_count(src_filename, starter, stopper)
            check_wc(section, wc, word_limit)
            outfile.write(f"{section}\t{wc}\n")


if __name__ == "__main__":
    if "snakemake" in locals() or "snakemake" in globals():
        main(snakemake.input.src, snakemake.output.txt)
    else:
        main("paper/paper.Rmd", "log/count_words_abstract.log")

Python Snakemake From line 1 of py/abstract_word_count.py

def main(uc1_filename, uc2_filename, list_filename):
    otus1 = {1: ["a", "b"], 2: ["c", "d"], 3: ["e"]}
    otus2 = {1: ["f", "g"], 2: ["h"]}
    for filename, otus in ((uc1_filename, otus1), (uc2_filename, otus2)):
        with open(filename, "w") as uc_file:
            for otu_id, seqs in otus.items():
                for seq_id in seqs:
                    uc_file.write(
                        f"H\t{otu_id}\t1\t100\t+\t-\t-\t=\t{seq_id}\t{otu_id}\n"
                    )
    combined = [",".join(seqs) for otus in [otus1, otus2] for seqs in otus.values()]
    with open(list_filename, "w") as listfile:
        listfile.write(f"userLabel\t{str(len(combined))}")
        for otu in combined:
            listfile.write(f"\t{otu}")
        listfile.write("\n")


if __name__ == "__main__":
    main(
        "code/tests/data/closed.uc",
        "code/tests/data/denovo.uc",
        "code/tests/data/oracle_open.list",
    )

Python From line 1 of py/create_test_uc_files.py

library(here)
library(tidyverse)

rel_diff <- function(final, init, percent = TRUE) {
  mult <- if (isTRUE(percent)) 100 else 1
  return((final - init) / init * mult)
}
coeff_var <- function(x) {
  return(sd(x) / mean(x))
}


dat <- read_tsv(here("results", "summarized.tsv"))
agg <- read_tsv(here("results", "aggregated.tsv"))
################################################################################
# de novo datasets
opticlust_mcc <- agg %>%
  filter(
    method == "de_novo",
    tool == "mothur"
  ) %>%
  pull(mcc) %>%
  median()
opticlust_sec <- agg %>%
  filter(
    method == "de_novo",
    tool == "mothur"
  ) %>%
  pull(sec) %>%
  median()
opticlust_mem <- agg %>%
  filter(
    strategy == "de_novo",
    tool == "mothur"
  ) %>%
  pull(mem_gb) %>%
  median()
dn_vsearch_mcc <- agg %>%
  filter(strategy == "de_novo", tool == "vsearch") %>%
  pull(mcc) %>%
  median()
dn_vsearch_sec <- agg %>%
  filter(strategy == "de_novo", tool == "vsearch") %>%
  pull(sec) %>%
  median()
mcc_opticlust_vs_vsearch <- rel_diff(opticlust_mcc, dn_vsearch_mcc)
sec_opticlust_vs_vsearch <- abs(rel_diff(dn_vsearch_sec, opticlust_sec))

################################################################################
# de novo ref dbs

dn_dbs <- read_tsv("subworkflows/2_fit_reference_db/results/denovo_dbs.tsv") %>%
  group_by(ref) %>%
  summarize(med_mcc = median(mcc)) %>%
  full_join(read_tsv(here(
    "subworkflows", "0_prep_db", "data",
    "seq_counts.tsv"
  )),
  by = "ref"
  ) %>%
  mutate(refname = case_when(
    ref == "gg" ~ "Greengenes",
    TRUE ~ toupper(ref)
  ))

################################################################################
# ref db open
open_fit_db_mcc <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "mothur"
  ) %>%
  pull(mcc) %>%
  median()

mcc_open_fit_db_vs_clust <- rel_diff(opticlust_mcc, open_fit_db_mcc)

open_fit_gg_mcc <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "mothur",
    ref == "gg"
  ) %>%
  pull(mcc) %>%
  median()

open_fit_silva_mcc <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "mothur",
    ref == "silva"
  ) %>%
  pull(mcc) %>%
  median()

open_fit_rdp_mcc <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "mothur",
    ref == "rdp"
  ) %>%
  pull(mcc) %>%
  median()

open_vsearch_mcc <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "vsearch",
    ref == "gg"
  ) %>%
  pull(mcc) %>%
  median()
mcc_open_fit_db_vs_vsearch <- rel_diff(open_fit_gg_mcc, open_vsearch_mcc)

open_vsearch_sec <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "vsearch",
    ref == "gg"
  ) %>%
  pull(sec) %>%
  median()
open_fit_db_sec <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "mothur"
  ) %>%
  pull(sec) %>%
  median()
sec_vsearch_vs_open_fit_db <- rel_diff(open_vsearch_sec, open_fit_db_sec)
sec_opticlust_vs_open_fit_db <- rel_diff(opticlust_sec, open_fit_db_sec) %>% abs()

# human dataset to silva
open_fit_silva_human_sec <- agg %>%
  filter(
    method == "open",
    strategy == "database",
    tool == "mothur",
    ref == "silva",
    dataset == "human"
  ) %>%
  pull(sec) %>%
  median()
closed_fit_silva_human_sec <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "silva",
    dataset == "human"
  ) %>%
  pull(sec) %>%
  median()
opticlust_human_sec <- agg %>%
  filter(
    method == "de_novo",
    tool == "mothur",
    dataset == "human"
  ) %>%
  pull(sec) %>%
  median()

################################################################################
# ref db closed
closed_fit_db_mcc <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur"
  ) %>%
  pull(mcc) %>%
  median()
mcc_closed_fit_db_vs_clust <- rel_diff(closed_fit_db_mcc, opticlust_mcc) %>% abs()

closed_fit_gg_mcc <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "gg"
  ) %>%
  pull(mcc) %>%
  median()
closed_fit_silva_mcc <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "silva"
  ) %>%
  pull(mcc) %>%
  median()
closed_fit_rdp_mcc <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "rdp"
  ) %>%
  pull(mcc) %>%
  median()

frac_fit_db <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur"
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100
frac_fit_gg <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "gg"
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100
frac_fit_silva <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "silva"
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100
frac_fit_rdp <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur",
    ref == "rdp"
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100
frac_vsearch <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "vsearch",
    ref == "gg"
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100
frac_vsearch_vs_fit <- rel_diff(frac_vsearch, frac_fit_gg)

closed_fit_db_sec <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "mothur"
  ) %>%
  pull(sec) %>%
  median()
sec_closed_fit_db_vs_clust <- rel_diff(closed_fit_db_sec, opticlust_sec) %>% abs()

closed_vsearch_sec <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "vsearch"
  ) %>%
  pull(sec) %>%
  median()
sec_closed_fit_db_vs_vsearch <- rel_diff(closed_fit_db_sec, closed_vsearch_sec) %>% abs()

closed_vsearch_mcc <- agg %>%
  filter(
    method == "closed",
    strategy == "database",
    tool == "vsearch"
  ) %>%
  pull(mcc) %>%
  median()
mcc_closed_fit_db_vs_vsearch <- rel_diff(closed_fit_db_mcc, closed_vsearch_mcc) %>% abs()

################################################################################
# fit split
cv_fit_split_mcc <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(mcc) %>%
  coeff_var()

frac_fit_split <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    method == "closed",
    ref_frac == 0.5
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100

closed_fit_split_sec <- agg %>%
  filter(
    method == "closed",
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(sec) %>%
  median()
open_fit_split_sec <- agg %>%
  filter(
    method == "open",
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(sec) %>%
  median()

sec_closed_fit_split_vs_clust <- rel_diff(opticlust_sec, closed_fit_split_sec) %>% abs()
sec_open_fit_split_vs_clust <- rel_diff(opticlust_sec, open_fit_split_sec) %>% abs()
sec_open_fit_split_vs_db <- rel_diff(open_fit_db_sec, open_fit_split_sec) %>% abs()
sec_closed_fit_split_vs_db <- rel_diff(closed_fit_db_sec, closed_fit_split_sec) %>% abs()

closed_fit_split_mem <- agg %>%
  filter(
    method == "closed",
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(mem_gb) %>%
  median()
open_fit_split_mem <- agg %>%
  filter(
    method == "open",
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(mem_gb) %>%
  median()
mem_closed_fit_split_vs_clust <- rel_diff(closed_fit_split_mem, opticlust_mem)
mem_open_fit_split_vs_clust <- rel_diff(open_fit_split_mem, opticlust_mem)

cv_fit_split_mcc_human_simple <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    dataset == "human",
    ref_weight == "simple"
  ) %>%
  pull(mcc) %>%
  coeff_var()

cv_fit_split_mem_human_simple <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    dataset == "human",
    ref_weight == "simple"
  ) %>%
  pull(mem_gb) %>%
  coeff_var()

sec_fit_split_human_simple_1 <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    dataset == "human",
    ref_weight == "simple",
    ref_frac == 0.1
  ) %>%
  pull(sec) %>%
  median()

sec_fit_split_human_simple_9 <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    dataset == "human",
    ref_weight == "simple",
    ref_frac == 0.9
  ) %>%
  pull(sec) %>%
  median()

frac_fit_split_human_simple_1 <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    dataset == "human",
    ref_weight == "simple",
    method == "closed",
    ref_frac == 0.1
  ) %>%
  pull(fraction_mapped) %>%
  median()

frac_fit_split_human_simple_9 <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    dataset == "human",
    ref_weight == "simple",
    method == "closed",
    ref_frac == 0.9
  ) %>%
  pull(fraction_mapped) %>%
  median()

mcc_fit_split_simple <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(mcc) %>%
  median()
mcc_opticlust_vs_fit_split_simple <- rel_diff(opticlust_mcc, mcc_fit_split_simple)

#####
# fit split at ref_frac 0.5

mcc_fit_split_abun <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "abundance",
    ref_frac == 0.5
  ) %>%
  pull(mcc) %>%
  median()

mcc_fit_split_dist <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "distance",
    ref_frac == 0.5
  ) %>%
  pull(mcc) %>%
  median()

frac_fit_split_simple <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    method == "closed",
    ref_frac == 0.5
  ) %>%
  pull(fraction_mapped) %>%
  median()

frac_fit_split_abun <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "abundance",
    method == "closed",
    ref_frac == 0.5
  ) %>%
  pull(fraction_mapped) %>%
  median()

frac_fit_split_dist <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "distance",
    method == "closed",
    ref_frac == 0.5
  ) %>%
  pull(fraction_mapped) %>%
  median()

sec_fit_split_simple <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "simple",
    ref_frac == 0.5
  ) %>%
  pull(sec) %>%
  median()

sec_fit_split_abun <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "abundance",
    ref_frac == 0.5
  ) %>%
  pull(sec) %>%
  median()

sec_fit_split_dist <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    ref_weight == "distance",
    ref_frac == 0.5
  ) %>%
  pull(sec) %>%
  median()

##########

frac_fit_split_1 <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    method == "closed",
    ref_frac == 1
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100

frac_fit_split_9 <- agg %>%
  filter(
    strategy == "self-split",
    tool == "mothur",
    method == "closed",
    ref_frac == 0.9
  ) %>%
  pull(fraction_mapped) %>%
  median() * 100


################################################################################
# save results
save.image(file = here("results", "stats.RData"))

R tidyverse VSEARCH mothur here From line 1 of R/calc_results_stats.R

devtools::load_all("../ggraph")
library(cowplot)
library(glue)
library(gridExtra)
library(ggtext)
library(here)
library(patchwork)
library(reticulate)
library(tidygraph)
library(tidyverse)
set.seed(20200308)
# use_python('/usr/local/bin/python3')
source_python(here("code", "py", "algorithm_diagram.py"))
optifit <- create_optifit()
optifit_iters <- optifit$iterate %>%
  lapply(function(x) {
    return(list(
      nodes = x[["nodes"]] %>% py_to_r(),
      edges = x[["edges"]] %>% py_to_r() %>%
        bind_rows(data.frame(from = 1, to = 1, mcc = NA)) %>%
        mutate(
          is_loop = from == to,
          loop_dir = ifelse(from == 1, 270, 90)
        )
    ))
  })

plot_optifit_graph <- function(graph, title = "",
                               hide_loops = FALSE) {
  loop_dir <- 90
  loop_color <- ifelse(hide_loops, "white", "black")
  create_layout(graph, "linear", sort.by = id) %>%
    ggraph() +
    geom_edge_arc(aes(
      label = mcc,
      start_cap = label_rect(node1.name),
      end_cap = label_rect(node2.name,
        padding = margin(1, 1, 2.8, 1, "mm")
      )
    ),
    arrow = arrow(
      length = unit(3, "mm"),
      angle = 35,
      type = "closed"
    ),
    edge_colour = "gray",
    angle_calc = "along",
    label_dodge = unit(-2, "mm")
    ) +
    geom_edge_loop(aes(
      span = 1,
      direction = loop_dir,
      strength = 0.5,
      color = is_loop
    )) +
    geom_node_label(aes(label = name)) +
    scale_edge_color_manual(values = c(loop_color)) +
    labs(title = title) +
    theme_void() +
    theme(
      plot.margin = unit(x = c(0, 0, 0, 0), units = "pt"),
      legend.position = "none"
    )
}

i <- 0
optifit_graphs <- lapply(optifit_iters, function(x) {
  i <<- i + 1
  tbl_graph(nodes = x$nodes, edges = x$edges) %>%
    plot_optifit_graph(
      title = glue("{i}. MCC = {x$edges %>% filter(is_loop) %>% pull(mcc)}"),
      hide_loops = TRUE
    )
})


base_color <- "#000000"
ref_color <- "#D95F02"
query_color <- "#1B9E77"
ref_seqs <- LETTERS[1:17]
query_seqs <- LETTERS[23:26]

dist_dat <- get_dists() %>%
  arrange(seq1, seq2) %>%
  mutate(
    color1 = ifelse(seq1 %in% ref_seqs, ref_color, query_color),
    color2 = ifelse(seq2 %in% ref_seqs, ref_color, query_color),
  )
dist_dat[["color3"]] <- rep.int("black", nrow(dist_dat))
dist_dat[["dist"]] <- runif(nrow(dist_dat), 1.0, 2.9) %>%
  format(digits = 2) %>%
  as.character()
table_colors <- dist_dat %>%
  select(color1, color2, color3) %>%
  as.matrix() %>%
  t()

table_plot <- plot_grid(ggdraw() +
  draw_label("0. List of sequence pairs within the distance threshold",
    x = 0,
    hjust = 0
  ) +
  theme(plot.margin = margin(5, 0, 5, 0)),
tableGrob(dist_dat %>%
  select(seq1, seq2, dist) %>%
  rename(
    `% distance` = dist,
    ` ` = seq1,
    `  ` = seq2
  ) %>%
  t(),
theme = ttheme_default(
  base_size = 10,
  padding = unit(c(4, 4), "pt"),
  core = list(
    fg_params = list(col = table_colors),
    bg_params = list(col = "white")
  ),
  rowhead = list(bg_params = list(col = NA)),
  colhead = list(bg_params = list(col = NA))
)
),
ncol = 1, rel_heights = c(0.1, 1)
)

plot_diagram <- table_plot /
  optifit_graphs +
  plot_layout(heights = c(0.75, 1, 1.5, 1, 0.3))


dims <- eval(parse(text = snakemake@params[["dim"]]))
ggsave(snakemake@output[["tiff"]],
  device = "tiff", dpi = 300,
  width = dims[1], height = dims[2], units = "in"
)

R tidyverse cowplot patcHwork GLUE gridExtra here ggtext reticulate tidygraph From line 1 of R/plot_algorithm_diagram.R

set.seed(20210509)
library(cowplot)
library(ggtext)
library(glue)
library(here)
library(knitr)
library(tidyverse)

mutate_perf <- function(dat) {
  dat %>%
    mutate(
      mem_mb = max_rss,
      mem_gb = mem_mb / 1024
    ) %>%
    rename(sec = s)
}
select_cols <- function(dat) {
  dat %>%
    select(
      dataset, strategy, method, tool, mcc, sec, mem_gb, fraction_mapped,
      ref_frac, ref_weight
    )
}

opticlust <- read_tsv(here("subworkflows/1_prep_samples/results/opticlust_results.tsv")) %>%
  full_join(read_tsv(here("subworkflows/1_prep_samples/results/dataset_sizes.tsv"))) %>%
  mutate_perf() %>%
  mutate(
    strategy = method,
    fraction_mapped = NA,
    ref_frac = 0,
    ref_weight = "NA"
  )
optifit_split <- read_tsv(here("subworkflows/3_fit_sample_split/results/optifit_split_results.tsv")) %>%
  mutate_perf() %>%
  mutate(strategy = "self-split")

dat <- list(optifit_split, opticlust) %>%
  lapply(select_cols) %>%
  reduce(bind_rows) %>%
  mutate(
    method = as.character(method),
    strategy = as.character(strategy)
  ) %>%
  mutate(fraction_mapped = case_when(
    method %>% as.character() != "closed" ~ NA_real_,
    TRUE ~ fraction_mapped
  )) %>%
  pivot_longer(c(mcc, fraction_mapped, sec),
    names_to = "metric"
  ) %>%
  mutate(
    metric = factor(
      case_when(
        metric == "mcc" ~ "MCC",
        metric == "fraction_mapped" ~ "Fraction Mapped",
        metric == "sec" ~ "Runtime (sec)",
        TRUE ~ metric
      ),
      levels = c("MCC", "Fraction Mapped", "Runtime (sec)")
    ),
    strategy = factor(
      case_when(
        strategy == "de_novo" ~ "_de novo_",
        strategy == "database_rdp" ~ "db: RDP",
        strategy == "database_silva" ~ "db: SILVA",
        strategy == "database_gg" ~ "db: Greengenes",
        TRUE ~ strategy
      ),
      levels = c(
        "db: RDP", "db: SILVA", "db: Greengenes",
        "self-split", "_de novo_"
      )
    ),
    method = factor(
      case_when(
        method == "de_novo" ~ "_de novo_",
        TRUE ~ method
      ),
      levels = c("open", "closed", "_de novo_")
    ),
    ref_weight = factor(
      case_when(
        ref_weight == "distance" ~ "similarity",
        TRUE ~ ref_weight
      ),
      levels = c("simple", "abundance", "similarity", "NA")
    )
  )

med_iqr <- function(x) {
  return(data.frame(
    y = median(x),
    ymin = quantile(x)[2],
    ymax = quantile(x)[4]
  ))
}
color_breaks <- list(
  simple = "#FF8C00",
  abundance = "#9932CC",
  similarity = "#008B8B"
)
color_labels <- lapply(
  names(color_breaks),
  function(name) {
    glue("<span style = 'color:{color_breaks[[name]]};'>{name}</span>")
  }
) %>% unlist()
color_values <- append(color_breaks, list(`NA` = "#000000"))
plot_results_split <- dat %>%
  filter(((ref_weight == "simple" | ref_frac == 0.5) | method == "_de novo_") & !is.na(value)) %>%
  ggplot(aes(ref_frac, value, color = ref_weight, shape = method)) +
  coord_flip() +
  stat_summary(
    geom = "point",
    fun = median,
    size = 2,
    position = position_dodge(width = 0.07)
  ) +
  facet_grid(dataset ~ metric, scales = "free", switch = "x") +
  scale_shape_manual(values = list(open = 1, closed = 19, `_de novo_` = 17)) +
  scale_color_manual(
    values = color_values,
    breaks = names(color_breaks),
    labels = color_labels
  ) +
  scale_x_continuous(
    breaks = seq(0, 1, 0.1),
    labels = c("NA", seq(0.1, 1, 0.1))
  ) +
  labs(x = "reference fraction", y = "") +
  theme_bw() +
  theme(
    legend.text = element_markdown(),
    legend.title = element_blank(),
    legend.position = "top",
    legend.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"),
    legend.spacing.x = unit(0.5, "pt"),
    plot.margin = unit(x = c(0, 0, 0, 0), units = "pt"),
    panel.grid.minor.y = element_blank(),
    axis.title.x = element_blank(),
    strip.placement = "outside",
    strip.background = element_blank()
  ) +
  guides(
    shape = guide_legend(order = 1),
    colour = guide_legend(
      override.aes = list(size = -1)
    )
  )

dim <- eval(parse(text = snakemake@params[["dim"]]))
ggsave(snakemake@output[["tiff"]],
  device = "tiff", dpi = 300,
  width = dim[1], height = dim[2], units = "in"
)

R ggplot2 tidyverse cowplot GLUE knitr here ggtext From line 3 of R/plot_results_split.R

set.seed(20210509)
library(cowplot)
library(ggtext)
library(glue)
library(here)
library(knitr)
library(tidyverse)
mutate_perf <- function(dat) {
  dat %>%
    mutate(
      mem_mb = max_rss,
      mem_gb = mem_mb / 1024
    ) %>%
    rename(sec = s)
}
select_cols <- function(dat) {
  dat %>%
    select(dataset, strategy, method, tool, mcc, sec, mem_gb, fraction_mapped)
}

opticlust <- read_tsv(here("subworkflows/1_prep_samples/results/opticlust_results.tsv")) %>%
  full_join(read_tsv(here("subworkflows/1_prep_samples/results/dataset_sizes.tsv"))) %>%
  mutate_perf() %>%
  mutate(strategy = method, fraction_mapped = NA)
optifit_dbs <- read_tsv(here("subworkflows/2_fit_reference_db/results/optifit_dbs_results.tsv")) %>%
  mutate_perf()
optifit_split <- read_tsv(here("subworkflows/3_fit_sample_split/results/optifit_split_results.tsv")) %>%
  filter(ref_frac == 0.5, ref_weight == "simple") %>%
  mutate_perf()
optifit_all <- list(
  optifit_dbs %>%
    mutate(strategy = glue("database_{ref}")),
  optifit_split %>%
    mutate(strategy = "self-split")
) %>%
  reduce(full_join)
vsearch <- read_tsv(here("subworkflows/4_vsearch/results/vsearch_results.tsv")) %>%
  mutate_perf() %>%
  mutate(strategy = case_when(
    method == "de_novo" ~ method,
    TRUE ~ as.character(glue("database_{ref}"))
  ))
mothur_vsearch <- list(optifit_all, opticlust, vsearch) %>%
  lapply(select_cols) %>%
  reduce(bind_rows) %>%
  mutate(
    method = as.character(method),
    strategy = as.character(strategy)
  ) %>%
  mutate(fraction_mapped = case_when(
    method %>% as.character() != "closed" ~ NA_real_,
    TRUE ~ fraction_mapped
  )) %>%
  pivot_longer(c(mcc, fraction_mapped, sec),
    names_to = "metric"
  ) %>%
  mutate(
    metric = factor(
      case_when(
        metric == "mcc" ~ "MCC",
        metric == "fraction_mapped" ~ "Fraction Mapped",
        metric == "sec" ~ "Runtime (sec)",
        TRUE ~ metric
      ),
      levels = c("MCC", "Fraction Mapped", "Runtime (sec)")
    ),
    strategy = factor(
      case_when(
        strategy == "de_novo" ~ "_de novo_",
        strategy == "database_rdp" ~ "db: RDP",
        strategy == "database_silva" ~ "db: SILVA",
        strategy == "database_gg" ~ "db: Greengenes",
        TRUE ~ strategy
      ),
      levels = c(
        "db: RDP", "db: SILVA", "db: Greengenes",
        "self-split", "_de novo_"
      )
    ),
    method = factor(
      case_when(
        method == "de_novo" ~ "_de novo_",
        TRUE ~ method
      ),
      levels = c("open", "closed", "_de novo_")
    )
  )

med_iqr <- function(x) {
  return(data.frame(
    y = median(x),
    ymin = quantile(x)[2],
    ymax = quantile(x)[4]
  ))
}

color_list <- list(
  `OptiClust (_de novo_) or OptiFit` = RColorBrewer::brewer.pal(3, "Set1")[1],
  VSEARCH = RColorBrewer::brewer.pal(3, "Set1")[2]
)
color_labels <- lapply(
  names(color_list),
  function(name) {
    glue("<span style = 'color:{color_list[[name]]};'>{name}</span>")
  }
) %>% unlist()

plot_results_sum <- mothur_vsearch %>%
  mutate(tool = case_when(
    tool == "vsearch" ~ "VSEARCH",
    tool == "mothur" ~ "OptiClust (_de novo_) or OptiFit"
  )) %>%
  ggplot(aes(value, strategy, color = tool, shape = method)) +
  # stat_summary(geom = "linerange",
  #              fun.data = med_iqr,
  #              position = position_dodge(width = 0.4)) +
  stat_summary(
    geom = "point",
    fun = median,
    size = 2,
    position = position_dodge(width = 0.4)
  ) +
  facet_grid(dataset ~ metric, scales = "free", switch = "x") +
  scale_shape_manual(values = list(open = 1, closed = 19, `_de novo_` = 17)) +
  scale_color_manual(
    values = color_list,
    labels = color_labels
  ) +
  labs(x = "", y = "") +
  theme_bw() +
  theme(
    strip.placement = "outside",
    strip.background = element_blank(),
    axis.text.y = element_markdown(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    legend.title = element_blank(),
    legend.text = element_markdown(),
    legend.position = "top",
    legend.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"),
    legend.spacing.x = unit(0.5, "pt"),
    plot.margin = unit(x = c(0, 0, 0, 0), units = "pt")
  ) +
  guides(
    shape = guide_legend(order = 1),
    colour = guide_legend(
      override.aes = list(size = -1)
    )
  )

dims <- eval(parse(text = snakemake@params[["dim"]]))
ggsave(snakemake@output[["tiff"]],
  device = "tiff", dpi = 300,
  width = dims[1], height = dims[2], units = "in"
)

R ggplot2 tidyverse cowplot VSEARCH GLUE knitr mothur here ggtext From line 3 of R/plot_results_sum.R

rmarkdown::render(
  here::here(snakemake@input[["Rmd"]]),
  params = list(include_figures = snakemake@params[["include_figures"]]),
  output_format = snakemake@params[["format"]],
  output_file = here::here(snakemake@output[1])
)

R From line 1 of R/render.R

library(tidyverse)
library(glue)
library(here)

mutate_perf <- function(dat) {
  dat %>%
    mutate(
      mem_mb = max_rss,
      mem_gb = mem_mb / 1024,
      label = as.character(label)
    ) %>%
    rename(
      sec = s
    )
}

opticlust <- read_tsv(here("subworkflows/1_prep_samples/results/opticlust_results.tsv")) %>%
  mutate_perf() %>%
  mutate(strategy = method)
optifit_db <- read_tsv(here("subworkflows/2_fit_reference_db/results/optifit_dbs_results.tsv")) %>%
  mutate_perf() %>%
  mutate(strategy = "database")
optifit_split <- read_tsv(here("subworkflows/3_fit_sample_split/results/optifit_split_results.tsv")) %>%
  mutate_perf() %>%
  mutate(strategy = "self-split")
vsearch <- read_tsv(here("subworkflows/4_vsearch/results/vsearch_results.tsv")) %>%
  rename(label = label...10) %>%
  select(-label...30) %>%
  mutate_perf() %>%
  mutate(strategy = case_when(
    method == "de_novo" ~ method,
    TRUE ~ "database"
  ))
results_agg <- list(opticlust, optifit_db, optifit_split, vsearch) %>%
  reduce(full_join)

results_sum <- results_agg %>%
  group_by(tool, strategy, method, dataset, ref, ref_frac, ref_weight) %>%
  summarize(
    n = n(),
    mcc_median = median(mcc), # TODO: tidy way to avoid this repetitiveness?
    sec_median = median(sec),
    mem_gb_median = median(mem_gb),
    frac_map_median = median(fraction_mapped)
  )
#
# write_tsv(results_agg, snakemake@output[['agg']])
# write_tsv(results_sum, snakemake@output[['sum']])
write_tsv(results_agg, "results/aggregated.tsv")
write_tsv(results_sum, "results/summarized.tsv")

vsearch %>%
  select(dataset, method, mcc, fraction_mapped, sec) %>%
  knitr::kable() %>%
  write(file = "subworkflows/4_vsearch/results/vsearch_abbr.md")

R tidyverse VSEARCH GLUE here From line 1 of R/summarize_results.R

library(here)
library(testthat)
test_dir(here("code", "tests", "testthat"), stop_on_failure = TRUE)

R here testthat From line 1 of tests/testthat.R

script:
    'code/R/summarize_results.R'

SnakeMake From line 69 of main/Snakefile

script:
    'code/R/calc_results_stats.R'

SnakeMake From line 79 of main/Snakefile

script:
    'code/R/plot_algorithm_diagram.R'

SnakeMake From line 90 of main/Snakefile

shell:
    """
    dot -T tiff -Gsize={params.dim}\! -Gdpi=300 {input.gv} > {params.tmp}
    convert {params.tmp} -gravity center \
                         -background white \
                         -extent {params.width}x{params.height} \
                         {output.tiff}
    rm {params.tmp}
    """

SnakeMake From line 103 of main/Snakefile

script:
    'code/R/plot_results_sum.R'

SnakeMake From line 121 of main/Snakefile

script:
    'code/R/plot_results_split.R'

SnakeMake From line 132 of main/Snakefile

script:
    'code/R/render.R'

SnakeMake From line 148 of main/Snakefile

shell:
    """
    cp -r paper/figures/blanks/ figures/blanks/
    R -e "latexdiffr::latexdiff('{input.draft}', '{input.final}')"
    mv {params.diff} {output.diff}
    rm diff.log
    """

SnakeMake From line 165 of main/Snakefile

script:
    'code/R/render.R'

SnakeMake From line 185 of main/Snakefile

script:
    'code/R/render.R'

SnakeMake From line 206 of main/Snakefile

script:
    'code/R/render.R'

SnakeMake From line 227 of main/Snakefile

script:
    'code/py/abstract_word_count.py'

SnakeMake From line 236 of main/Snakefile

script:
    'code/py/create_test_uc_files.py'

SnakeMake From line 246 of main/Snakefile

script:
    'code/tests/testthat.R'

SnakeMake From line 253 of main/Snakefile

shell:
    'python -m code.tests.test_python'

SnakeMake From line 261 of main/Snakefile

run:
    for i, fig in enumerate(input):
        i += 1
        print(i, fig)
        shutil.copyfile(fig, f'paper/figures/Figure{i}.tiff')

SnakeMake From line 272 of main/Snakefile

shell:
    """
    R -e '
    rmarkdown::render(here::here("{input.Rmd}"),
                      output_format = "{params.format}"
                      )
    '
    """

SnakeMake From line 286 of main/Snakefile

shell:
    """
    zip -j {output} {input}
    rm -f paper/paper*.tex paper/paper*.log
    """

SnakeMake From line 304 of main/Snakefile

ShowHide 22 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: http://www.schlosslab.org/Sovacool_OptiFit_mSphere_2022

Name: sovacool_optifit_msphere_2022

Version: v1.0.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

GLUE mothur patcHwork Snakemake VSEARCH cowplot ggplot2 ggtext gridExtra here knitr reticulate testthat tidygraph tidyverse

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free