RNA-Seq pipeline for bulk, paired-end data

public public 1yr ago 0 bookmarks

rna_seq

This SnakeMake pipeline processes aligns bulk RNA-Seq datasets using STAR and finds differential gene expression between conditions using DESeq2.

1. Prepare you work environment

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd


"""Function accepts a STAR output directory and compiles all sample information from ReadsPerGene.out.tab

Args:
    snakemake.input (list): list of globbed wildcards STAR ReadsPerGene.out.tab
    project_title (str): Project title for compiled STAR counts
Returns:
    Compiled STAR counts as tab delimited file.
"""

colnames = snakemake.params.samples
tables = [pd.read_csv(fh, header=None, skiprows=4, usecols=[0,3], index_col=0, sep = '\t',  names = ['Genes',fh.split('/')[-2].split('_')[0]]) for fh in snakemake.input]
joined_table = pd.concat(tables, axis=1)
joined_table.columns = colnames
joined_table_sorted = joined_table.reindex(sorted(joined_table.columns), axis = 1)
joined_table_sorted.to_csv(snakemake.output[0], sep='\t')
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
sink(file(snakemake@log[[1]], open = "wt"), type = "message")

library(DESeq2)
library(ggplot2)
library(tibble)
library(dplyr)
library(viridisLite)
library(pheatmap)
library(yaml)

parallel <- FALSE
if (snakemake@threads > 1) {
	library("BiocParallel")
	register(MulticoreParam(snakemake@threads))
	parallel <- TRUE
}

if (!dir.exists(snakemake@output$outdir)) {dir.create(snakemake@output$outdir)}

# import data -------------------------------------------------------------------------------------\

# import counts
counts = read.table(snakemake@input$counts, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
gene_names = counts |> pull(Genes)

# import metdata
md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
rownames(md) = md$SampleID

# subset md by config/GROUPS.txt if provided
# allow user to upload a contrast combination file for easier interpretation of the output
if (file.exists(snakemake@config$GROUPS)) {

	# import contrast combination file
	groups = read.table(snakemake@config$GROUPS) |> pull(1)

	# subset conditions in md with the conditions in config/GROUPS.txt
	md = md |> filter(Condition %in% groups)
}

# make the md reflect the available sample names
md = md[md$SampleID %in% colnames(counts),]

# make sure md rownames and counts colnames in the same order.
rownames(md) = md$SampleID
counts = counts[, rownames(md)]
stopifnot(rownames(md) == colnames(counts))

# DESeq2 ------------------------------------------------------------------------------------------

# differential expression
dds = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula("~Condition"))
dds.lrt <- DESeq(dds, test="LRT", reduced=~1, parallel = parallel)
res.lrt <- results(dds.lrt, cooksCutoff = Inf, independentFiltering=FALSE)

# Obtain normalized counts
# vsd = vst(dds.lrt, blind = FALSE)
rld <- rlog(dds.lrt, blind=FALSE)

# visualization -----------------------------------------------------------------------------------

colors_fwd <- viridis(100)
colors_rev <- viridis(100, direction = -1)

### sample-sample distances
sampleDists = dist(t(assay(rld)), diag = TRUE)

sampleDistsMatrix = sampleDists |> as.matrix()
rownames(sampleDistsMatrix) <- rld$SampleID
colnames(sampleDistsMatrix) <- NULL

distance_plot = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-group-sample-dists.pdf")
pdf(distance_plot, width = 12, height = 12)
pheatmap(sampleDistsMatrix,
	fontsize = 12,
	clustering_distance_rows = sampleDists,
	clustering_distance_cols = sampleDists,
	cluster_rows = TRUE,
	cluster_cols = TRUE,
	col = colors_rev
)
dev.off()

### PCA plot
pca_plot = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-group-pca.pdf")
pdf(pca_plot, width = 12, height = 12)
plotPCA(rld, intgroup=c("Condition"))
dev.off()

### Gene expression - top 50 genes post DESeq2 normalization
# top_ge = assay(rld)[top_genes,]
top_gene_indices <- head(order(res.lrt$padj), 50)
top_ge = counts(dds.lrt, normalized=TRUE)[top_gene_indices, ]
rownames(top_ge) = gene_names[top_gene_indices]

top_50 = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-top-50-genes-heatmap.pdf")
pdf(top_50, width = 12, height = 12)
pheatmap(top_ge,
	main = "Top 50 differential gene expression (DESeq2-normalized counts)",
	scale = "row",
	fontsize = 12,
	cluster_rows = TRUE,
	cluster_cols = TRUE
)
dev.off()

### Gene expression - all genes by individual replicates
all_ge = counts(dds.lrt, normalized=TRUE)
rownames(all_ge) = gene_names
all_ge = all_ge |>
	as.data.frame() |>
	distinct() |>
	as.matrix()
all_ge = all_ge[apply(all_ge, 1, function(x) sd(x) != 0), ]

all_genes_heatmap_individual = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-all-genes-heatmap-individual.pdf")
pdf(all_genes_heatmap_individual, width = 12, height = 12)
pheatmap(all_ge,
	main = "All gene expression (DESeq2-normalized counts) by individual samples",
	scale = "row",
	fontsize = 12,
	cluster_rows = TRUE,
	cluster_cols = TRUE,
	show_rownames = FALSE
)
dev.off()

### Gene expression - all genes by merged replicates

if (snakemake@config$MERGE_REPLICATES) {

	message("Merging replicates...")

	# read YAML file
	merge_scheme = read_yaml(snakemake@config$REPLICATES)
	for (name in names(merge_scheme)) {

		reps = merge_scheme[[name]]

		if (snakemake@config$MERGE_METHOD == "mean") {
			merged_counts = all_ge[, reps] |> apply(MARGIN = 1, FUN = mean) # same as rowMeans(counts[, reps])
		} else if (snakemake@config$MERGE_METHOD == "median") {
			merged_counts = all_ge[, reps] |> apply(MARGIN = 1, FUN = median)
		} else {
			stop("ERROR: MERGE_METHOD must be either 'mean' or 'median'")
		}

		all_ge = all_ge |>
			as.data.frame() |>
			mutate(!!enexpr(name) := merged_counts) |>
			select(!all_of(reps)) |>
			as.matrix()
		}

	# remove rows with sd=0 one more time.
	all_ge = all_ge[apply(all_ge, 1, function(x) sd(x) != 0), ]

	# define output
	all_genes_heatmap_merged = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-all-genes-heatmap-merged.pdf")
	pdf(all_genes_heatmap_merged, width = 12, height = 12)
	pheatmap(all_ge,
		main = "All gene expression (DESeq2-normalized counts) by merged samples",
		scale = "row",
		fontsize = 12,
		cluster_rows = TRUE,
		cluster_cols = FALSE, # this respects the order of replicates.yaml
		show_rownames = FALSE
	)
	dev.off()

}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
library(DESeq2)

# import data -------------------------------------------------------------------------------------

# import counts
counts = read.table(snakemake@input$counts, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE, row.names = "Genes")

# import metdata
md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
rownames(md) = md$SampleID

# make the md reflect the available sample names
md = md[md$SampleID %in% colnames(counts),]

# make sure md rownames and counts colnames in the same order.
rownames(md) = md$SampleID
counts = counts[, rownames(md)]
stopifnot(rownames(md) == colnames(counts))

# create DESeq2 objects ---------------------------------------------------------------------------
ddsMat = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula(snakemake@config$MODEL))
dds = DESeq(ddsMat)

message('calculating deseq...')

# Export DESeq2-normalized counts -----------------------------------------------------------------
normCounts = counts(dds, normalized=TRUE)
normCounts = 0.1 + normCounts # ensure nonzero
log2normCounts = log2(normCounts)
log2normCounts = as.data.frame(log2normCounts)
write.table(normCounts, snakemake@output$norm_counts, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
write.table(log2normCounts, snakemake@output$log2_norm_counts, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
sink(file(snakemake@log[[1]], open = "wt"), type = "message")

library(DESeq2)
library(ggplot2)
library(tibble)
library(dplyr)
library(viridisLite)
library(pheatmap)
library(yaml)

parallel <- FALSE
if (snakemake@threads > 1) {
	library("BiocParallel")
	register(MulticoreParam(snakemake@threads))
	parallel <- TRUE
}

c1 = snakemake@params$c1
c2 = snakemake@params$c2
outdir = snakemake@params$outdir
contrast_name = paste0(snakemake@params$c1, '-vs-', snakemake@params$c2)

# import data -------------------------------------------------------------------------------------

# import counts
counts = read.table(snakemake@input$counts, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
gene_names = counts |> pull(Genes)

# import metdata
md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)

# subset and counts matrix by contrast groups
md = md |> filter(Condition %in% c(c1, c2))
rownames(md) = md$SampleID

# make sure md rownames and counts colnames in the same order.
counts = counts[, rownames(md)]
stopifnot(rownames(md) == colnames(counts))

# DESeq2 ------------------------------------------------------------------------------------------

# differential expression
ddsMatrix = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula("~Condition"))
dds <- DESeq(ddsMatrix, parallel = parallel)

# Obtain normalized counts
# vsd = vst(dds.lrt, blind = FALSE)
rld <- rlog(dds, blind=FALSE)

# write outputs -----------------------------------------------------------------------------------

# define outputs
all_file = paste0(outdir, "/", contrast_name, "-all.txt") # contain all gene results for this pw comparison
sig_file = paste0(outdir, "/", contrast_name, "-", snakemake@params$padj, ".txt") # contain significant gene results
pw_pca = paste0(outdir, "/", contrast_name, "-pca.pdf") # pca of a subset of samples
pw_ma = paste0(outdir, "/", contrast_name, "-MA.pdf") # MA of a subset of samples

# extract results from DESeq object
results = results(dds, contrast = c("Condition", c1, c2), cooksCutoff = FALSE)
results_df = results |>
	as.data.frame() |>
	rownames_to_column("Genes") |>
	arrange(padj)
all_sig = results_df |> filter(padj <= snakemake@params$padj)

# export tables
write.table(results_df, all_file, quote = FALSE, sep = "\t", row.names = FALSE)
write.table(all_sig, sig_file, quote = FALSE, sep = "\t", row.names = FALSE)

# output PCA plot
pdf(pw_pca, width = 12, height = 12)
plotPCA(rld, intgroup=c("Condition"))
dev.off()

# plot MA
pdf(pw_ma, width = 12, height = 12)
plotMA(results, intgroup=c("Condition"))
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import yaml
import pandas as pd
import os

# identify the annotation provided by the STAR index (user-defined)
# filter_col will be used for filtering columns later on.
if snakemake.config["ANNOTATION_STYLE"] == "Genes":
	filter_col = "Gene name lower"
elif snakemake.config["ANNOTATION_STYLE"] == "UCSC":
	filter_col = "UCSC Stable ID"
elif snakemake.config["ANNOTATION_STYLE"] == "ENSEMBL_ID":
	filter_col = "Gene stable ID"
elif snakemake.config["ANNOTATION_STYLE"] == "REFSEQ":
	filter_col = "RefSeq mRNA ID"
elif snakemake.config["ANNOTATION_STYLE"] == "NCBI":
	filter_col = "NCBI gene (formerly Entrezgene) ID"
else:
	print("ERROR: ANNOTATION_STYLE is undefined. Please select from one of the following options: ['Genes', 'UCSC', 'ENSEMBL_ID', 'REFSEQ', 'NCBI']")
	os.exit()

# annotation file -------------------------------------------------------------

# open annotation file
if snakemake.input["annotation"]:
	annotation = pd.read_csv(snakemake.config["ANNOTATION"], sep = "\t")
else:
	print("ERROR: ANNOTATION key in config.yaml is undefined")
	os.exit()

# edit content of the annotation file

# turn gene type and names to lower-case to ensure string matching downstream.
annotation["Gene type lower"] = annotation["Gene type"].transform(lambda x: str.lower(x))
annotation["Gene name lower"] = annotation["Gene name"].transform(lambda x: str(x).lower())

# import biotypes. depending on their values, get them into a list.

biotypes = snakemake.config["BIOTYPES"]
biotype_type = type(biotypes)

if biotype_type == str:
	biotypes = [ snakemake.config["BIOTYPES"] ]
elif biotype_type == list:
	biotypes = snakemake.config["BIOTYPES"]
else:
	print("ERROR: BIOTYPES must be a single entry <string> or multiple entries <list of strings>")

biotypes = [ i.lower() for i in biotypes ]

# filter annotation file contents by biotype
annotation = annotation[annotation["Gene type lower"].isin(biotypes)]
filter_features = list(annotation[filter_col]) # use these features to filter counts table

# open counts file ------------------------------------------------------------

# import counts
counts = pd.read_csv(snakemake.input["counts"], sep = "\t")

print("PREFILTER: Counts table rows = {}".format(counts.shape[0]))
print("PREFILTER: Counts table columns = {}".format(counts.shape[1]))

# filter features
counts["Gene lower"] = counts["Genes"].transform(lambda x: str.lower(x))
counts = counts[counts["Gene lower"].isin(filter_features)]

print("POSTFILTER: Counts table rows = {}".format(counts.shape[0]))
print("POSTFILTER: Counts table columns = {}".format(counts.shape[1] - 1)) # minus one to adjust for lower-case gene names col

counts = counts.drop("Gene lower", axis = 1)

# export filtered counts
counts.to_csv(str(snakemake.output), sep = "\t", index = False)
104
105
106
107
108
109
110
111
112
113
shell:
	"fastp "
	"-i {input.r1} "
	"-I {input.r2} "
	"-o {output.r1} "
	"-O {output.r2} "
	"--detect_adapter_for_pe "
	"--thread {threads} "
	"-j {log} "
	"-h /dev/null"
125
126
shell:
	"fastqc -t {threads} --outdir data/fastqc {input} > {log} 2>&1"
139
140
shell:
	"fastq_screen --aligner bowtie2 --threads {threads} --outdir data/fastq_screen --conf {input.config} --force {input.fastq} > {log} 2>&1"
158
159
160
161
162
163
164
165
166
167
168
169
170
shell:
	"STAR "
	"--runThreadN {threads} "
	"--runMode alignReads "
	"--genomeDir {params.genome_index} "
	"--readFilesIn {input.fwd} {input.rev} "
	"--outFileNamePrefix data/star/{wildcards.sample}_bam/ "
	"--sjdbGTFfile {params.gtf} "
	"--quantMode GeneCounts "
	"--sjdbGTFtagExonParentGene gene_name "
	"--outSAMtype BAM SortedByCoordinate "
	"--readFilesCommand zcat "
	"--twopassMode Basic"
180
181
shell:
	"samtools index -@ {threads} {input}"
195
196
shell:
	"preseq c_curve -B {resources.defect_mode} -l 1000000000 -P -o {output} {input} > {log} 2>&1"
210
211
shell:
	"preseq lc_extrap -B {resources.defect_mode} -l 1000000000 -P -e 1000000000 -o {output} {input} > {log} 2>&1"
222
223
shell:
	"bamCoverage -b {input[0]} -o {output} -p {threads} --normalizeUsing CPM --binSize 10 --smoothLength 50"
239
240
241
shell:
	"if [ '{params.singularity}' == 'true' ]; then export LC_ALL=C.UTF-8; export LANG=C.UTF-8; fi && "
	"multiqc data -f --ignore data/tmp -o data/multiqc 2>&1"
252
253
script:
	"scripts/compile_star_counts.py"
SnakeMake From line 252 of main/Snakefile
261
262
script:
	"scripts/filter_counts.py"
SnakeMake From line 261 of main/Snakefile
276
277
script:
	"scripts/deseq2-norm.R"
SnakeMake From line 276 of main/Snakefile
299
300
script:
	"scripts/deseq2-pairwise.R"
SnakeMake From line 299 of main/Snakefile
317
318
script:
	"scripts/deseq2-group.R"
SnakeMake From line 317 of main/Snakefile
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/maxsonBraunLab/rna_seq
Name: rna_seq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...