A single cell RNA-seq workflow, including highly variable gene analysis, cell type assignment and differential expression analysis.

public 7mo ago Version: v1.1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A single cell RNA-seq workflow following Lun, McCarthy and Marioni 2016 and Soneson and Robinson 2018 , with added more recent functionality.

Authors

Johannes Köster, https://koesterlab.github.io

Usage

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if already available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 4: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 5: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 6: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:snakemake-workflows/single-cell-rna-seq.git or git remote add -f upstream https://github.com/snakemake-workflows/single-cell-rna-seq.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 7: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

Fork the original repo to a personal or lab account.
Clone the fork to your local system, to a different place than where you ran your analysis.
Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.
Commit and push your changes to your fork.
Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automtically executed via continuous integration with Travis CI.

Code Snippets

script:
    "../scripts/cell-cycle.R"

SnakeMake From line 18 of rules/cell-cycle.smk

script:
    "../scripts/cell-cycle-scores.R"

SnakeMake From line 36 of rules/cell-cycle.smk

script:
    "../scripts/cellassign.R"

SnakeMake From line 25 of rules/cell-type.smk

script:
    "../scripts/plot-cellassign.R"

SnakeMake From line 42 of rules/cell-type.smk

script:
    "../scripts/celltype-tsne.R"

SnakeMake From line 67 of rules/cell-type.smk

script:
    "../scripts/plot-gene-expression.R"

SnakeMake From line 89 of rules/cell-type.smk

script:
    "../scripts/load-counts.R"

SnakeMake From line 18 of rules/counts.smk

script:
    "../scripts/diffexp.R"

SnakeMake From line 35 of rules/diffexp.smk

script:
    "../scripts/plot-differential-expression.R"

SnakeMake From line 55 of rules/diffexp.smk

script:
    "../scripts/filter-cells.R"

SnakeMake From line 21 of rules/filtration.smk

script:
    "../scripts/normalize.R"

SnakeMake From line 23 of rules/normalization.smk

script:
    "../scripts/batch-effect-removal.R"

SnakeMake From line 40 of rules/normalization.smk

script:
    "../scripts/qc.R"

SnakeMake From line 35 of rules/qc.smk

script:
    "../scripts/explained-variance.R"

SnakeMake From line 53 of rules/qc.smk

script:
    "../scripts/plot-gene-gene-expression.R"

SnakeMake From line 77 of rules/qc.smk

script:
    "../scripts/gene-tsne.R"

SnakeMake From line 96 of rules/qc.smk

script:
    "../scripts/hvg.R"

SnakeMake From line 38 of rules/variance.smk

script:
    "../scripts/hvg-correlation.R"

SnakeMake From line 69 of rules/variance.smk

script:
    "../scripts/hvg-pca.R"

SnakeMake From line 89 of rules/variance.smk

script:
    "../scripts/hvg-tsne.R"

SnakeMake From line 111 of rules/variance.smk

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(limma)

sce <- readRDS(snakemake@input[["sce"]])
cycle.assignments <- readRDS(snakemake@input[["cycles"]])
colData(sce)$G1 <- cycle.assignments$scores$G1
colData(sce)$G2M <- cycle.assignments$scores$G2M

# setup design matrix
model <- snakemake@params[["model"]]
design <- model.matrix(as.formula(model), data=colData(sce))

# store design matrix
saveRDS(design, file=snakemake@output[["design_matrix"]])

# remove batch effects based on variables
batch.removed <- removeBatchEffect(logcounts(sce), covariates=design)
logcounts(sce) <- batch.removed
saveRDS(sce, file=snakemake@output[["sce"]])

R scater From line 6 of scripts/batch-effect-removal.R

Sys.setenv(MKL_THREADING_LAYER = "GNU")

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(SingleCellExperiment)
library(tensorflow)
library(cellassign)
library(ComplexHeatmap)
library(viridis)
library(ggsci)

is.float <- function(x) {
    (typeof(x) == "double") && (x %% 1 != 0)
}

sce <- readRDS(snakemake@input[["sce"]])
parent <- snakemake@wildcards[["parent"]]

# get parent fit and filter sce to those cells
parent_fit <- snakemake@input[["fit"]]
if(length(parent_fit) > 0) {
    parent_fit <- readRDS(parent_fit)$cell_type
    is_parent_type <- rownames(parent_fit)[parent_fit$cell_type == parent]
    sce <- sce[, is_parent_type]
}

markers <- read.table(snakemake@input[["markers"]], row.names = NULL, header = TRUE, sep="\t", stringsAsFactors = FALSE, na.strings = "")
markers[is.na(markers$parent), "parent"] <- "root"
markers <- markers[markers$parent == parent, ]

# convert markers into something cellAssign understands
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
get_genes <- function (x) {
    if(is.na(x)) {
        vector()
    } else {
        sapply(strsplit(x, ","), trim)
    }
}
genes <- vector()
for(g in markers$genes) {
    genes <- c(genes, get_genes(g))
}
genes <- sort(unique(genes))

if(length(genes) < 1) {
    stop("Markers have to contain at least two different genes in union.")
}

marker_mat <- matrix(0, nrow = nrow(markers), ncol = length(genes))
colnames(marker_mat) <- genes
rownames(marker_mat) <- markers$name
for(i in 1:nrow(markers)) {
    cell_type <- markers[i, "name"]
    marker_mat[cell_type, ] <- genes %in% get_genes(markers[i, "genes"])
}
marker_mat <- t(marker_mat)
marker_mat <- marker_mat[rownames(marker_mat) %in% rownames(sce),, drop=FALSE]


# apply cellAssign
sce <- sce[rownames(marker_mat), ]
# remove genes with 0 counts in all cells and cells with 0 counts in all genes
sce <- sce[rowSums(counts(sce)) != 0, colSums(counts(sce)) != 0]
# obtain batch effect model
model <- readRDS(snakemake@input[["design_matrix"]])
# constrain to selected cells and remove intercept (not allowed for cellassign)
model <- model[colnames(sce), colnames(model) != "(Intercept)"]
# normalize float columns (as recommended in cellassign manual)
float_cols <- apply(model, 2, is.float)
model[, float_cols] <- apply(model[, float_cols], 2, scale)
if(nrow(sce) == 0) {
    stop("Markers do not match any gene names in the count matrix.")
}
# fit
fit <- cellassign(exprs_obj = sce, marker_gene_info = marker_mat, s = sizeFactors(sce), learning_rate = 1e-2, B = 20, shrinkage = TRUE, X = model)

# add cell names to results
cells <- colnames(sce)
rownames(fit$mle_params$gamma) <- cells
fit$cell_type <- data.frame(cell_type = fit$cell_type)
rownames(fit$cell_type) <- cells

saveRDS(fit, file = snakemake@output[["fit"]])

save.image()
# plot heatmap
source(file.path(snakemake@scriptdir, "common.R"))
sce <- assign_celltypes(fit, sce, snakemake@params[["min_gamma"]])

pdf(file = snakemake@output[["heatmap"]])
pal <- pal_d3("category20")(ncol(marker_mat))
names(pal) <- colnames(marker_mat)
celltype <- HeatmapAnnotation(df = data.frame(celltype = colData(sce)$celltype), col = list(celltype = pal))
Heatmap(logcounts(sce), col = viridis(100), clustering_distance_columns = "canberra", clustering_distance_rows = "canberra", use_raster = TRUE, show_row_dend = FALSE, show_column_dend = FALSE, show_column_names = FALSE, top_annotation = celltype, name = "logcounts")
dev.off()

R viridis ComplexHeatmap SingleCellExperiment ggsci CellAssign From line 2 of scripts/cellassign.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)

sce <- readRDS(snakemake@input[[1]])

# determine cell cycle
species = snakemake@params[["species"]]
if (species == "mouse") {
    markers <- "mouse_cycle_markers.rds"
} else if (species == "human") {
    markers <- "human_cycle_markers.rds"
} else {
    stop("Unsupported species. Only mouse and human are supported.")
}

# read trained data
pairs <- readRDS(system.file("exdata", markers, package="scran"))

# obtain assignments
assignments <- cyclone(sce, pairs, gene.names=rowData(sce)$ensembl_gene_id)

# store assignments
saveRDS(assignments, file=snakemake@output[[1]])

R scater scran From line 6 of scripts/cell-cycle.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


library(RColorBrewer)

assignments <- readRDS(snakemake@input[["rds"]])
annotation <- read.table(snakemake@input[["cells"]], header=TRUE, row.names=1)
covariate <- gsub("-", ".", snakemake@wildcards[["covariate"]])
covariate.values <- unique(annotation[, covariate])
n <- length(covariate.values)

# plot
pdf(file=snakemake@output[[1]])
# set color palette
palette(brewer.pal(n=n, name="Dark2"))
plot(assignments$score$G1, assignments$score$G2M,
     xlab="G1 score", ylab="G2/M score", pch=16,
     col=annotation[, covariate])
legend("topright", legend=covariate.values, col=palette()[covariate.values], pch=16)
dev.off()

R RColorBrewer From line 6 of scripts/cell-cycle-scores.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)
library(ggsci)
source(file.path(snakemake@scriptdir, "common.R"))

seed <- as.integer(snakemake@wildcards[["seed"]])
target_parent <- snakemake@wildcards[["parent"]]
parents <- snakemake@params[["parents"]]
fits <- snakemake@input[["fits"]]

sce <- readRDS(snakemake@input[["sce"]])

for(i in 1:length(fits)) {
    cellassign_fit <- fits[i]
    parent <- parents[i]
    cellassign_fit <- readRDS(cellassign_fit)
    sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]])
    if(parent == target_parent) {
        break
    }
}

style <- theme(
    axis.text=element_text(size=12),
    axis.title=element_text(size=16))

# plot t-SNE
pdf(file=snakemake@output[[1]], width = 12)
set.seed(seed)
plotTSNE(sce, colour_by="celltype") + scale_fill_d3(alpha = 1.0) + scale_colour_d3(alpha = 1.0) + style
dev.off()

R scater scran ggsci From line 6 of scripts/celltype-tsne.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(tidyverse)
library(SingleCellExperiment)
library(scran)
library(edgeR)
library(dplyr)

source(file.path(snakemake@scriptdir, "common.R"))

sce <- readRDS(snakemake@input[["sce"]])
for(cellassign_fit in snakemake@input[["cellassign_fits"]]) {
    cellassign_fit <- readRDS(cellassign_fit)
    sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]])
}
# handle constrain celltypes
constrain_celltypes <- snakemake@params[["constrain_celltypes"]]
if(!is.null(constrain_celltypes)) {
    celltypes <- constrain_celltypes[["celltypes"]]
    common_var <- constrain_celltypes[["common"]]

    sce <- sce[, colData(sce)$celltype %in% celltypes]

    if(!is.null(common_var)) {
        if(!(common_var %in% colnames(colData(sce)))) {
            stop(paste("covariate", common_var, "not found in cell metadata"))
        }
        is_common_in_all <- apply(table(colData(sce)[, c(common_var, "celltype")]) > 0, 1, all)
	common_in_all <- names(is_common_in_all)[is_common_in_all]
	sce <- sce[, colData(sce)[, common_var] %in% common_in_all]
        colData(sce) <- droplevels(colData(sce))
    }
}
# only keep the requested cells
if(length(snakemake@params[["celltypes"]]) > 0) {
    sce <- sce[, colData(sce)$celltype %in% snakemake@params[["celltypes"]]]
}
colData(sce)$celltype <- factor(colData(sce)$celltype)
colData(sce)$detection_rate <- cut(colData(sce)$detection_rate, 10)


# convert to edgeR input
y <- convertTo(sce, type = "edgeR", col.fields = colnames(colData(sce)))

# obtain design matrix
design <- model.matrix(as.formula(snakemake@params[["design"]]), data=y$samples)

y <- calcNormFactors(y)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = snakemake@params[["coef"]])

saveRDS(y, file = snakemake@output[["edger_dge"]])

write_tsv(rownames_to_column(topTags(qlf, n = 100000)$table, var = "gene"), snakemake@output[["table"]])

pdf(file = snakemake@output[["bcv"]])
plotBCV(y)
dev.off()

pdf(file = snakemake@output[["md"]])
plotMD(qlf)
abline(h = c(-1,1), col = "grey")
dev.off()

pdf(file = snakemake@output[["disp"]])
plotQLDisp(fit)
dev.off()

R tidyverse dplyr scran SingleCellExperiment From line 1 of scripts/diffexp.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)

sce <- readRDS(snakemake@input[["rds"]])
annotation <- read.table(snakemake@input[["cells"]], row.names=1, header=TRUE)

pdf(file=snakemake@output[[1]])
plotExplanatoryVariables(sce,
    variables=c("total_counts_Spike",
                "log10_total_counts_Spike",
                colnames(annotation))) + theme(
    axis.text=element_text(size=12),
    axis.title=element_text(size=16)
)
dev.off()

R scater scran From line 6 of scripts/explained-variance.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)

sce <- readRDS(snakemake@input[[1]])

# drop cells with too few counts or expressed features
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)

# drop cells with too high proportion of mito genes or spike-ins expressed
mito.drop <- isOutlier(sce$pct_counts_Mt, nmads=3, type="higher")
spike.drop <- isOutlier(sce$pct_counts_Spike, nmads=3, type="higher")

# write stats
stats <- data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
     ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
write.table(stats, file=snakemake@output[["stats"]], sep='\t',
            row.names=FALSE, quote=FALSE)

# filter sce
sce <- sce[,!(libsize.drop | feature.drop | mito.drop | spike.drop)]

saveRDS(sce, file=snakemake@output[["rds"]])

R scater From line 6 of scripts/filter-cells.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)
library(ggsci)
library(viridis)

seed <- as.integer(snakemake@wildcards[["seed"]])
gene <- snakemake@wildcards[["gene"]]

sce <- readRDS(snakemake@input[["sce"]])

colData(sce)[, gene] <- logcounts(sce)[gene, ]

style <- theme(
    axis.text=element_text(size=12),
    axis.title=element_text(size=16))

# plot t-SNE
pdf(file=snakemake@output[[1]], width = 12)
set.seed(seed)
plotTSNE(sce, colour_by=gene) + scale_fill_viridis(alpha = 1.0) + scale_colour_viridis(alpha = 1.0) + style
dev.off()

R scater viridis scran ggsci From line 6 of scripts/gene-tsne.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)
library(RBGL)
library(Rgraphviz)
library(gplots)


fdr <- snakemake@params[["fdr"]]
sce <- readRDS(snakemake@input[["sce"]])
hvgs <- read.table(snakemake@input[["hvg"]], row.names=1)

topn <- min(nrow(hvgs), snakemake@params[["top_n"]])
hvgs <- hvgs[1:topn, ]

# find correlated pairs
set.seed(100)
var.cor <- correlatePairs(sce, subset.row=rownames(hvgs))
sig.cor <- var.cor$FDR <= fdr
write.table(file=snakemake@output[["corr"]], var.cor, sep="\t", quote=FALSE, row.names=FALSE)


# define graph with significant correlation as edges
g <- ftM2graphNEL(cbind(var.cor$gene1, var.cor$gene2)[sig.cor,],
     W=NULL, V=NULL, edgemode="undirected")
# find highly connected clusters
cl <- highlyConnSG(g)$clusters
cl <- cl[order(lengths(cl), decreasing=TRUE)]


# plot correlation graph
pdf(file=snakemake@output[["graph"]])
plot(g, "neato", attrs=list(node=list(fillcolor="lightblue", color="lightblue")))
dev.off()


# choose significantly correlated genes
chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))
# get normalized expressions
norm.exprs <- logcounts(sce)[chosen,,drop=FALSE]


# plot heatmap
pdf(file=snakemake@output[["heatmap"]])
# z-score
heat.vals <- norm.exprs - rowMeans(norm.exprs)
heat.out <- heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace="none", cexRow=0.6)
dev.off()

R scater scran gplots Rgraphviz From line 6 of scripts/hvg-correlation.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)

fdr <- snakemake@params[["fdr"]]
covariate <- gsub("-", ".", snakemake@wildcards[["covariate"]])
sce <- readRDS(snakemake@input[["sce"]])
var.cor <- read.table(snakemake@input[["var_cor"]], header=TRUE)
sig.cor <- var.cor$FDR <= fdr

# choose significantly correlated genes
chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))

style <- theme(
    axis.text=element_text(size=12),
    axis.title=element_text(size=16))

# plot PCA
pdf(file=snakemake@output[[1]])
plotPCA(sce[chosen, ], colour_by=covariate,
        ncomponents=3) + style
dev.off()

R PCAtools scater From line 6 of scripts/hvg-pca.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)

use.spikes <- snakemake@params[["use_spikes"]]
fdr <- snakemake@params[["fdr"]]
min.bio.comp <- snakemake@params[["min_bio_comp"]]

sce <- readRDS(snakemake@input[["sce_norm"]])
design <- readRDS(snakemake@input[["design_matrix"]])

# apply trend model to obtain per-gene variance
# batch effects are modeled in design matrix
var.fit <- trendVar(sce, use.spikes=use.spikes, design=design)
var.out <- decomposeVar(sce, var.fit)


# from now on use expressions with removed batch effects
sce <- readRDS(snakemake@input[["sce_batch"]])


# plot mean vs. variance (spikes are red)
pdf(file=snakemake@output[["mean_vs_variance"]])
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression",
     ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)
cur.spike <- isSpike(sce)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16)
dev.off()


# determine HVGs (highly variable genes)
hvg.out <- var.out[which(var.out$FDR <= fdr & var.out$bio >= min.bio.comp),]
# sort
hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),]


# store HVGs in table on disk
write.table(file=snakemake@output[["hvg"]], hvg.out, sep="\t", quote=FALSE, col.names=NA)


# plot expression distributions
pdf(file=snakemake@output[["hvg_expr_dist"]])
print(hvg.out)
plotExpression(sce, rownames(hvg.out)[1:snakemake@params[["show_n"]]]) + theme(
    axis.text=element_text(size=12),
    axis.title=element_text(size=16)
)
dev.off()


# store variance estimates
saveRDS(var.out, file=snakemake@output[["var"]])

R scater scran From line 6 of scripts/hvg.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)

seed <- as.integer(snakemake@wildcards[["seed"]])
fdr <- snakemake@params[["fdr"]]
covariate <- gsub("-", ".", snakemake@wildcards[["covariate"]])

sce <- readRDS(snakemake@input[["sce"]])
var.cor <- read.table(snakemake@input[["var_cor"]], header=TRUE)
sig.cor <- var.cor$FDR <= fdr

# choose significantly correlated genes
chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))

style <- theme(
    axis.text=element_text(size=12),
    axis.title=element_text(size=16))

# plot t-SNE
pdf(file=snakemake@output[[1]])
set.seed(seed)
plotTSNE(sce[chosen, ], colour_by=covariate) + style
dev.off()

R scater scran From line 6 of scripts/hvg-tsne.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)


all.counts <- as.matrix(read.table(snakemake@input[["counts"]], row.names=1, header=TRUE))
all.annotation <- read.table(snakemake@input[["cells"]], row.names=1, header=TRUE)
sce <- SingleCellExperiment(
  assays = list(counts = all.counts),
  colData = all.annotation
)

# get feature annotation
species = snakemake@params[["species"]]
if (species == "mouse") {
    dataset <- "mmusculus_gene_ensembl"
    symbol <- "mgi_symbol"
} else if (species == "human") {
    dataset <- "hsapiens_gene_ensembl"
    symbol <- "hgnc_symbol"
} else {
    stop("Unsupported species. Only mouse and human are supported.")
}

sce <- getBMFeatureAnnos(sce, filters=c(snakemake@params[["feature_ids"]]), attributes = c("ensembl_gene_id", symbol, "chromosome_name", "gene_biotype", "start_position", "end_position"), dataset=dataset, host=snakemake@config[["counts"]][["biomart"]])
rowData(sce)[, "gene_symbol"] <- rowData(sce)[, symbol]
rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$gene_symbol)

# get mitochondrial genes
is.mito <- colData(sce)$chromosome_name == "MT"

# annotate spike-ins
is.spike <- grepl(snakemake@params[["spike_pattern"]], rownames(sce))
isSpike(sce, "Spike") <- is.spike

# calculate metrics
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))
colData(sce)[, "detection_rate"] <- sce$total_features_by_counts / nrow(sce)

saveRDS(sce, file=snakemake@output[[1]])

R scater scran From line 6 of scripts/load-counts.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)

sce <- readRDS(snakemake@input[[1]])

# pre-cluster in order to group similar cells
# this avoids a violation of the non-DE assumption
preclusters <- quickCluster(sce)

# compute size factors by pooling cells according to the clustering
sce <- computeSumFactors(sce, clusters=preclusters, min.mean=snakemake@params[["min_count"]])

pdf(file=snakemake@output[["scatter"]])
plot(sizeFactors(sce), sce$total_counts/1e6, log="xy",
     ylab="Library size (millions)", xlab="Size factor")
dev.off()

# use size factors to calculate normalized counts in log space (available under logcounts(sce)) 
sce <- normalize(sce)

saveRDS(sce, file=snakemake@output[["rds"]])

R scater scran From line 6 of scripts/normalize.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(ComplexHeatmap)
library(viridis)

set.seed(562374)

fit <- readRDS(snakemake@input[[1]])

pdf(file = snakemake@output[[1]])
Heatmap(fit$mle_params$gamma, col = viridis(100), clustering_distance_rows = "canberra", use_raster = TRUE, show_row_dend = FALSE, show_column_dend = FALSE, show_row_names = FALSE)
dev.off()

R viridis ComplexHeatmap From line 1 of scripts/plot-cellassign.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(tidyverse)
library(edgeR)

gene_of_interest <- snakemake@wildcards[["gene"]]
coef <- snakemake@params[["coef"]]

diffexp <- read_tsv(snakemake@input[["diffexp"]])
y <- readRDS(snakemake@input[["edger_dge"]])

log_cpm <- as_tibble(cpm(y, log=TRUE), rownames="gene") %>% gather(key = "cell", value = "cpm", -gene) %>% filter(gene == gene_of_interest)

if (nrow(log_cpm) == 0) {
    stop(paste("Error: gene not found:", gene_of_interest))
}

log_cpm <- log_cpm %>% mutate(condition = as.factor(y$design[, coef]))

diffexp <- diffexp %>% filter(gene == gene_of_interest)

fmt_float <- function(x) {
    formatC(x, digits=2, format="g")
}
coef_name <- colnames(y$design)[coef]
pdf(file = snakemake@output[[1]])
ggplot(log_cpm, aes(x = condition, y = cpm, fill = condition)) + geom_violin() + geom_jitter(alpha = 0.5, size = 1) + labs(fill = coef_name, title = gene_of_interest, subtitle = paste("p =", fmt_float(diffexp$PValue), ", FDR =",  fmt_float(diffexp$FDR), ", log2fc =", fmt_float(diffexp$logFC))) + theme_classic()
dev.off()

R ggplot2 tidyverse From line 1 of scripts/plot-differential-expression.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)
library(ggsci)
source(file.path(snakemake@scriptdir, "common.R"))

seed <- as.integer(snakemake@wildcards[["seed"]])

sce <- readRDS(snakemake@input[["sce"]])

cellassign_fit <- snakemake@input[["fit"]]
cellassign_fit <- readRDS(cellassign_fit)
sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]])


# plot gene expression
pdf(file=snakemake@output[[1]], width = 12, height = 12)
plotExpression(sce, snakemake@params[["genes"]], snakemake@params[["feature"]])
dev.off()

R scater scran ggsci From line 6 of scripts/plot-gene-expression.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)
library(scran)
library(ggsci)
library(ggpubr)
source(file.path(snakemake@scriptdir, "common.R"))

gene_a <- snakemake@wildcards[["gene_a"]]
gene_b <- snakemake@wildcards[["gene_b"]]

aes_name <- function(name) {
    paste0("`", name, "`")
}

sce <- readRDS(snakemake@input[["sce"]])
constrain_celltypes <- snakemake@params[["constrain_celltypes"]]

for(cellassign_fit in snakemake@input[["fits"]]) {
    cellassign_fit <- readRDS(cellassign_fit)
    sce <- assign_celltypes(cellassign_fit, sce, snakemake@params[["min_gamma"]], constrain_celltypes)
}

# handle constrain celltypes
if(!is.null(constrain_celltypes)) {
    celltypes <- constrain_celltypes
    print(celltypes)
    sce <- sce[, colData(sce)$celltype %in% celltypes]
}

pdf(file=snakemake@output[[1]], width = 4, height = 4)
data <- as.data.frame(t(logcounts(sce[c(gene_a, gene_b), ])))
vmin <- min(data)
vmax <- max(data)

regression <- snakemake@params[["regression"]]
correlation <- snakemake@params[["correlation"]]
dropout_threshold <- snakemake@params[["dropout_threshold"]]

data[, "dropout"] <- apply(data, 1, min) < dropout_threshold
non_dropout_data <- data[!data$dropout, ]

p <- ggplot(data, aes_string(x=aes_name(gene_a), y=aes_name(gene_b), color=aes_name("dropout"))) +
    geom_point(shape=1) + 
    scale_color_manual(values = c("#000000", "#666666")) +
    theme_classic() + 
    theme(legend.position = "none")

if(regression != FALSE) {
    regression <- as.formula(regression)
    p = p + geom_smooth(method="lm", color = "red", formula = formula, data = non_dropout_data) +
	    stat_regline_equation(data = non_dropout_data, label.x.npc = "left", label.y.npc = "top", formula = formula,
				  aes(label = paste(..eq.label.., ..rr.label.., ..adj.rr.label.., sep = "~~~~")))
}
if(correlation != FALSE) {
    p = p + stat_cor(method = correlation, data = non_dropout_data) + geom_smooth(method="lm", data = non_dropout_data, color = "red")
}

# pass plot to PDF
p

dev.off()

R ggplot2 ggpubr scater scran ggsci From line 6 of scripts/plot-gene-gene-expression.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(scater)

sce <- readRDS(snakemake@input[[1]])

# plot library sizes
pdf(file=snakemake@output[["libsizes"]])
hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="",
     breaks=20, col="grey80", ylab="Number of cells")
dev.off()

# plot number of expressed genes
pdf(file=snakemake@output[["expressed"]])
hist(sce$total_features_by_counts, xlab="Number of expressed genes", main="",
     breaks=20, col="grey80", ylab="Number of cells")
dev.off()

# plot mitochondrial proportion
pdf(file=snakemake@output[["mito_proportion"]])
hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)",
     ylab="Number of cells", breaks=20, main="", col="grey80")
dev.off()


# plot spike-in proportion
pdf(file=snakemake@output[["spike_proportion"]])
hist(sce$pct_counts_Spike, xlab="ERCC proportion (%)",
     ylab="Number of cells", breaks=20, main="", col="grey80")
dev.off()