Mixture Model for Zero-Inflated scRNA-seq Data Analysis

public public 1yr ago 0 bookmarks

A Bayesian mixture model approach to zero-inflation of counts for scRNA-seq data

--- Table of Contents

---

About

Single cell RNA-seq data exhibit large numbers of zero count values, that as demonstrated in our paper , for a subset of transcripts, be better modelled by a zero inflated negative binomial distribution. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple cell types, and a mixture of single cell RNA-seq counts at the transcript level to model the transcript specific zero-inflation of counts. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model single cell RNA-seq counts, and also performs better or comparably to existing top performing methods. By taking a Bayesian approach we are able to build interpretable models of expression within clusters, and to quantify uncertainty in cluster assignments. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population. The methodology is implemented as an open source Snakemake pipeline hosted on this repository.

Identifying sub-populations of cells in single cell transcriptomic data – a Bayesian mixture model approach to zero-inflation of counts

---

Features

After running the pipeline, the following graphs are produced allowing the results of clustering by the Bayesian nonparametric zero-inflated negative binomial model to be interpreted: UMAP

All cells are projected onto a 2D plane through UMAP embeddings of their gene expression allowing the clusters assigned by the model to be visualised.

This graph is created under the directory results/plots


Cluster Means

For each cluster, an interactive graph is produced plotting the mean gene expression within that cluster against the mean expression across all clusters. This allows the identification of genes with unusual expression within the discovered sub-populations of cells.

These graphs are created under the directory results/{data}_exp

Average Gene Expression of All Clusters Against Cluster 1

g:Profiler

For the up and down regulated genes within each cluster, the results of gene set enrichment analysis with gprofiler2 are presented through interactive Manhattan plots. This enables the identification of significantly enriched biological functions and pathways from well established sources such as Gene Ontology (GO) and KEGG.

This graph is created under the directory results/plots

---

Running the Code

Dependencies

About Prior Dependencies Installation

Anaconda

Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.
  • N/A
Download from the Anaconda site

Singularity

Singularity is a container platform. Within a container, required programs, libraries, data and scripts can be packaged together and distributed to create a reproducable runtime environment.
  • Singularity runs on Linux, though can also be run on Windows and Mac through a virtual machine such as VirtualBox
  • Anaconda
See Singularity Installation

Snakemake

Snakemake is a workflow management system for reproducible and scalable data analysis through the creation of pipelines. The workflow is defined in a ‘Snakefile’ consisting of rules that represent how to create output files from input files.
  • Anaconda
See Snakemake Installation

Instructions

Set Up

1. Download the code from the main brach

  • Either directly from GitHub by pressing Code ➜ Download ZIP

  • Or through a Linux terminal using the command below
(base) user@terminal:~$ wget https://github.com/tt104/scmixture/archive/refs/heads/main.zip

2. Extract the contents of zip file

  • Either locate the file and pressing "Extract"

  • Or if using the terminal, use the command below, where scmixture.zip is the file path of the zip file and ~/new_file_path is the location where you want to save the file
(base) user@terminal:~$ unzip scmixture.zip -d ~/new_file_path

3. Once the project has been unzipped, navigate to the project folder, where scmixture is the file path

(base) user@terminal:~$ cd scmixture

Configuration **5.** Place one or more scRNA-seq csv dataset(s) into folder `scmixture/data`

  • If the folder does not exist, you can create it using the command below
(base) user@terminal:~$ mkdir data

6. Optionally edit config/config.yaml to change the algorithm's settings

# K-means initialisation - no. of clusters
kmeans: 40
# Controls the number of transcripts selected to consider in clustering
filter: 1000
# Burn-in, total number of MCMC chain iterations, thinning out interval and total number of chains
burnin: 500
iter: 1000
thin: 10
chains: 4
# model - nb or mult
model: "nb"

Running the Code **7.** Activate the snakemake environment through conda

(base) user@terminal:~/scmixture$ conda activate snakemake

8. Run the code through the command below:

(snakemake) user@terminal:~/scmixture$ snakemake --use-singularity --cores n --config data=DATASET organism=ORGANISM_ID
  • Replace n in --cores n (or -cn ) with the number of cores you wish to use

  • Replace DATASET with the name of your dataset, ignoring the .csv file extension, e.g. --config data=GBM

  • If running g:Profiler, replace ORGANISM_ID with the g:Profiler id for the data's subject (for example hsapiens for human or mmusculus for mouse), otherwise organism=ORGANISM_ID can be omitted from the command

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
dir.create(snakemake@output[[1]])

# Fetch cluster means and dispersions
clusterMu<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))
clusterOmega<-as.matrix(read.csv(snakemake@input[[2]],header=TRUE,row.names=1))

nc<-nrow(clusterMu)
names<-colnames(clusterMu)

avmu<-colMeans(clusterMu)
avom<-colMeans(clusterOmega)
qup<-qnbinom(0.95,size=avom,mu=avmu)
qlo<-qnbinom(0.05,size=avom,mu=avmu)

for(cl in c(1:nc))
{
	clmu<-clusterMu[cl,]
	clom<-clusterOmega[cl,]

	up<-which(clmu>qup)
	down<-which(clmu<qlo)
	upgenes<-names[up]
	downgenes<-names[down]

	if(length(upgenes)!=0)
	{
	write.table(upgenes,paste(snakemake@output[[1]],"/Cluster_",cl,"_up.csv",sep=''),sep=',',row.names=FALSE,col.names=FALSE,quote=FALSE)
	}
	if(length(downgenes)!=0)
	{
	write.table(downgenes,paste(snakemake@output[[1]],"/Cluster_",cl,"_down.csv",sep=''),sep=',',row.names=FALSE,col.names=FALSE,quote=FALSE)
	}
}
R From line 1 of scripts/genes.R
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
library(gprofiler2)
library(htmlwidgets)

# Iterate through up and down genes
for (i in 1:length(snakemake@input[])) {
  genelist <- read.csv(snakemake@input[[i]], header=FALSE)[,1]

  if(length(genelist)>0)
  {
    gostres <- gost(query = genelist ,organism = snakemake@config[["organism"]])
    if(!is.null(gostres))
    {
      p <- gostplot(gostres, interactive = TRUE)
      htmlwidgets::saveWidget(p,snakemake@output[[i]])
    }
    else
    {
      system(paste("echo No enrichment > ",snakemake@output[[i]],sep=''))
    }
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
logsum<-function(a,b)
{
	m<-pmax(a,b)
	log(exp(a-m)+exp(b-m))+m
}

maxlikeNB2<-function(xs,scales)
{
	mu<-NULL
	os<-NULL
	ws<-NULL

	for(j in 1:ncol(xs))
	{
		mask<-which(xs[,j]>0)
		zmask<-which(xs[,j]==0)
		nblikeZI<-function(ps)
		{
			if(length(zmask>0))
			{
				l0<- -sum(logsum(log(ps[3])+dnbinom(xs[zmask,j],size=ps[1],mu=scales[zmask]*ps[2],log=TRUE),rep(log(1-ps[3]),length(zmask))))
			}
			else
			{
				l0<-0.0
			}
			l<- -sum(log(ps[3])+dnbinom(xs[mask,j],size=ps[1],mu=scales[mask]*ps[2],log=TRUE))
			p<- -dgamma(ps[1],shape=1,rate=0.01,log=TRUE)
			l0+l+p
		}
		mlzi<-optim(c(1,mean(xs[,j]),0.5),nblikeZI,method="L-BFGS-B",lower=c(0.001,0.001,0.001),upper=c(Inf,Inf,0.9999))
		mu<-c(mu,mlzi$par[2])
		os<-c(os,mlzi$par[1])
		ws<-c(ws,mlzi$par[3])
	}
	list(mu=mu,os=os,ws=ws)
}

d<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))

dd<-t(d)

ldata<-log(dd+1)
vs<-apply(ldata,2,var)
cv<-sqrt(exp(vs-1))
ix<-order(cv,decreasing=TRUE)[1:1000]

xs<-t(d)[,ix[1:1000]]

scales<-as.matrix(read.csv(snakemake@input[[2]],header=FALSE))

res<-maxlikeNB2(xs,scales)

ms<-res$mu
os<-res$os
ws<-res$ws

gmulike<-function(ps)
{
	        -sum(dgamma(ms,shape=ps[1],rate=ps[2],log=TRUE))
}

mug<-optim(c(mean(ms),1),gmulike,method="L-BFGS-B",lower=c(0.00001,0.00001),upper=c(Inf,Inf))

gomegalike<-function(ps)
{
	        -sum(dgamma(os,shape=ps[1],rate=ps[2],log=TRUE))
}

omegag<-optim(c(mean(os),1),gomegalike,method="L-BFGS-B",lower=c(0.00001,0.00001),upper=c(Inf,Inf))

hp<-c(mug$par,omegag$par)

write.table(hp,snakemake@output[[1]],row.names=FALSE,col.names=FALSE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
library(plotly)
library(htmlwidgets)

# Open as cluster means as data.frame
cluster_data <- read.csv(snakemake@input[[1]], header=TRUE)

# Find the mean gene expression for each gene across all clusters
mean_values <- colMeans(cluster_data)
# Get the gene names from the columns
gene_names <- colnames(cluster_data)
# Create list of column numbers representing genes
gene_ids <- c(1:ncol(cluster_data))

create_plot <- function(cluster_num) {
  # Get the gene counts for the selected cluster
  cluster_values <- unname(cluster_data[cluster_num,])
  # Plot the mean gene expression across clusters
  fig <- plot_ly(cluster_data, x = ~gene_names, y = ~mean_values, type = 'bar', 
                 name = 'Mean Expression of All Clusters',
                 hoverlabel = list(namelength = -1),
                 marker = list(color = 'grey'))
  # Plot the gene expression of the selected cluster
  fig <- fig %>% add_markers(y = ~cluster_values, 
                             name = paste('Cluster', cluster_num, 'Mean Expression'),
                             marker = list(
                               color = ~cluster_values,
                               colorscale = 'Jet',
                               line = list(color = 'black', width = 1)))
  # Name the graph and hide the x-axis
  fig <- fig %>% layout(title = paste("Average Gene Expression of All Clusters Against Cluster", cluster_num),
                        xaxis = list(title = "Gene", zeroline = FALSE, showline = FALSE, 
                                     showticklabels = FALSE, showgrid = FALSE),
                        yaxis = list(title = "Gene Expression"),
                        plot_bgcolor='rgb(245, 245, 245)',
                        hovermode = "x unified")
  # Save the graph to HTML
  htmlwidgets::saveWidget(fig, file.path(gsub(" ", "", paste(snakemake@output[[1]], "/Cluster", cluster_num, "Means.html"))))
}

# Create the output directory if it does not already exist
dir.create(file.path(snakemake@output[[1]]), showWarnings = FALSE)

# Make a graph for each cluster
print(paste('Creating mean expression graphs for', nrow(cluster_data), 'clusters'))
for (cluster in 1:nrow(cluster_data)) {
  create_plot(cluster)
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
library(umap)
library(ggplot2)

# Original data
D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))

clusters<-as.matrix(read.csv(snakemake@input[[2]],header=TRUE,row.names=1))

uD<-umap(log(t(D)+1))

pdf(snakemake@output[[1]])
udf<-data.frame(x=uD$layout[,1],y=uD$layout[,2],Cluster=as.factor(clusters[,1]))
p<-ggplot(udf)+geom_point(aes(x,y,color=Cluster))+xlab("UMAP 1")+ylab("UMAP 2")
print(p)
dev.off()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))
cellnames<-colnames(D)

# Gene names selected
names<-as.matrix(read.csv(snakemake@input[[5]],header=FALSE))[,1]
ngene<-length(names)

# Re-map clusters
clustersRaw<-as.matrix(read.csv(snakemake@input[[2]],header=FALSE))
clist<-unique(clustersRaw)

nc<-length(clist)

clusters<-rep(0,length(clustersRaw))
clustersDict<-list()

for(i in 1:nc)
{
	t<-clist[i]
	clusters[clustersRaw==t]<-i
	clustersDict[[t]]<-i
}

# Write counts table
write.table(clusters,snakemake@output[[1]],sep=',',row.names=cellnames,col.names=c("Cluster"),quote=FALSE)
write.table(table(clusters),snakemake@output[[2]],sep=',')

# Fetch cluster means and dispersions
clusterMuRaw<-as.matrix(read.csv(snakemake@input[[3]],header=FALSE))
clusterOmegaRaw<-as.matrix(read.csv(snakemake@input[[4]],header=FALSE))


clusterMu<-matrix(0,nrow=nc,ncol=ngene)
colnames(clusterMu)<-names
rownames(clusterMu)<-c(1:nc)

clusterOmega<-matrix(0,nrow=nc,ncol=ngene)
colnames(clusterOmega)<-names
rownames(clusterOmega)<-c(1:nc)

for(i in 1:nc)
{
	t<-clist[i]
	s<-clustersDict[[t]]
	clusterMu[s,]<-clusterMuRaw[t,]
	clusterOmega[s,]<-clusterOmegaRaw[t,]
}

write.table(clusterMu,snakemake@output[[3]],sep=',',quote=FALSE)
write.table(clusterOmega,snakemake@output[[4]],sep=',',quote=FALSE)
R From line 2 of scripts/post.R
1
2
3
4
5
6
7
8
library(scran)

D<-as.matrix(read.csv(snakemake@input[[1]],header=TRUE,row.names=1))

clusters <- quickCluster(D)
sce <- calculateSumFactors(D, clusters=clusters)

write.table(sce,snakemake@output[[1]],sep=',',row.names=FALSE,col.names=FALSE)
36
37
38
39
shell:
	'''
	rm finished.txt
	'''
48
script: "scripts/scales.R"
56
script: "scripts/hyperparameters.R"
70
script: "scripts/scmix.jl"
85
script: "scripts/post.R"
96
script: "scripts/plots.R"
105
106
script: 
	"scripts/genes.R"
113
script: "scripts/gprof.R"
121
script: "scripts/meanplots.R"
ShowHide 12 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/tt104/scmixture
Name: scmixture
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...