RNA-Seq pipeline for bulk, paired-end data

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

rna_seq

This SnakeMake pipeline processes aligns bulk RNA-Seq datasets using STAR and finds differential gene expression between conditions using DESeq2.

1. Prepare you work environment

Code Snippets

import pandas as pd


"""Function accepts a STAR output directory and compiles all sample information from ReadsPerGene.out.tab

Args:
    snakemake.input (list): list of globbed wildcards STAR ReadsPerGene.out.tab
    project_title (str): Project title for compiled STAR counts
Returns:
    Compiled STAR counts as tab delimited file.
"""

colnames = snakemake.params.samples
tables = [pd.read_csv(fh, header=None, skiprows=4, usecols=[0,3], index_col=0, sep = '\t',  names = ['Genes',fh.split('/')[-2].split('_')[0]]) for fh in snakemake.input]
joined_table = pd.concat(tables, axis=1)
joined_table.columns = colnames
joined_table_sorted = joined_table.reindex(sorted(joined_table.columns), axis = 1)
joined_table_sorted.to_csv(snakemake.output[0], sep='\t')

Python Pandas STAR From line 1 of scripts/compile_star_counts.py

sink(file(snakemake@log[[1]], open = "wt"), type = "message")

library(DESeq2)
library(ggplot2)
library(tibble)
library(dplyr)
library(viridisLite)
library(pheatmap)
library(yaml)

parallel <- FALSE
if (snakemake@threads > 1) {
	library("BiocParallel")
	register(MulticoreParam(snakemake@threads))
	parallel <- TRUE
}

if (!dir.exists(snakemake@output$outdir)) {dir.create(snakemake@output$outdir)}

# import data -------------------------------------------------------------------------------------\

# import counts
counts = read.table(snakemake@input$counts, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
gene_names = counts |> pull(Genes)

# import metdata
md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
rownames(md) = md$SampleID

# subset md by config/GROUPS.txt if provided
# allow user to upload a contrast combination file for easier interpretation of the output
if (file.exists(snakemake@config$GROUPS)) {

	# import contrast combination file
	groups = read.table(snakemake@config$GROUPS) |> pull(1)

	# subset conditions in md with the conditions in config/GROUPS.txt
	md = md |> filter(Condition %in% groups)
}

# make the md reflect the available sample names
md = md[md$SampleID %in% colnames(counts),]

# make sure md rownames and counts colnames in the same order.
rownames(md) = md$SampleID
counts = counts[, rownames(md)]
stopifnot(rownames(md) == colnames(counts))

# DESeq2 ------------------------------------------------------------------------------------------

# differential expression
dds = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula("~Condition"))
dds.lrt <- DESeq(dds, test="LRT", reduced=~1, parallel = parallel)
res.lrt <- results(dds.lrt, cooksCutoff = Inf, independentFiltering=FALSE)

# Obtain normalized counts
# vsd = vst(dds.lrt, blind = FALSE)
rld <- rlog(dds.lrt, blind=FALSE)

# visualization -----------------------------------------------------------------------------------

colors_fwd <- viridis(100)
colors_rev <- viridis(100, direction = -1)

### sample-sample distances
sampleDists = dist(t(assay(rld)), diag = TRUE)

sampleDistsMatrix = sampleDists |> as.matrix()
rownames(sampleDistsMatrix) <- rld$SampleID
colnames(sampleDistsMatrix) <- NULL

distance_plot = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-group-sample-dists.pdf")
pdf(distance_plot, width = 12, height = 12)
pheatmap(sampleDistsMatrix,
	fontsize = 12,
	clustering_distance_rows = sampleDists,
	clustering_distance_cols = sampleDists,
	cluster_rows = TRUE,
	cluster_cols = TRUE,
	col = colors_rev
)
dev.off()

### PCA plot
pca_plot = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-group-pca.pdf")
pdf(pca_plot, width = 12, height = 12)
plotPCA(rld, intgroup=c("Condition"))
dev.off()

### Gene expression - top 50 genes post DESeq2 normalization
# top_ge = assay(rld)[top_genes,]
top_gene_indices <- head(order(res.lrt$padj), 50)
top_ge = counts(dds.lrt, normalized=TRUE)[top_gene_indices, ]
rownames(top_ge) = gene_names[top_gene_indices]

top_50 = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-top-50-genes-heatmap.pdf")
pdf(top_50, width = 12, height = 12)
pheatmap(top_ge,
	main = "Top 50 differential gene expression (DESeq2-normalized counts)",
	scale = "row",
	fontsize = 12,
	cluster_rows = TRUE,
	cluster_cols = TRUE
)
dev.off()

### Gene expression - all genes by individual replicates
all_ge = counts(dds.lrt, normalized=TRUE)
rownames(all_ge) = gene_names
all_ge = all_ge |>
	as.data.frame() |>
	distinct() |>
	as.matrix()
all_ge = all_ge[apply(all_ge, 1, function(x) sd(x) != 0), ]

all_genes_heatmap_individual = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-all-genes-heatmap-individual.pdf")
pdf(all_genes_heatmap_individual, width = 12, height = 12)
pheatmap(all_ge,
	main = "All gene expression (DESeq2-normalized counts) by individual samples",
	scale = "row",
	fontsize = 12,
	cluster_rows = TRUE,
	cluster_cols = TRUE,
	show_rownames = FALSE
)
dev.off()

### Gene expression - all genes by merged replicates

if (snakemake@config$MERGE_REPLICATES) {

	message("Merging replicates...")

	# read YAML file
	merge_scheme = read_yaml(snakemake@config$REPLICATES)
	for (name in names(merge_scheme)) {

		reps = merge_scheme[[name]]

		if (snakemake@config$MERGE_METHOD == "mean") {
			merged_counts = all_ge[, reps] |> apply(MARGIN = 1, FUN = mean) # same as rowMeans(counts[, reps])
		} else if (snakemake@config$MERGE_METHOD == "median") {
			merged_counts = all_ge[, reps] |> apply(MARGIN = 1, FUN = median)
		} else {
			stop("ERROR: MERGE_METHOD must be either 'mean' or 'median'")
		}

		all_ge = all_ge |>
			as.data.frame() |>
			mutate(!!enexpr(name) := merged_counts) |>
			select(!all_of(reps)) |>
			as.matrix()
		}

	# remove rows with sd=0 one more time.
	all_ge = all_ge[apply(all_ge, 1, function(x) sd(x) != 0), ]

	# define output
	all_genes_heatmap_merged = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-all-genes-heatmap-merged.pdf")
	pdf(all_genes_heatmap_merged, width = 12, height = 12)
	pheatmap(all_ge,
		main = "All gene expression (DESeq2-normalized counts) by merged samples",
		scale = "row",
		fontsize = 12,
		cluster_rows = TRUE,
		cluster_cols = FALSE, # this respects the order of replicates.yaml
		show_rownames = FALSE
	)
	dev.off()

}

R ggplot2 dplyr DESeq2 BiocParallel PCAtools pheatmap tibble yaml viridisLite From line 1 of scripts/deseq2-group.R

library(DESeq2)

# import data -------------------------------------------------------------------------------------

# import counts
counts = read.table(snakemake@input$counts, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE, row.names = "Genes")

# import metdata
md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
rownames(md) = md$SampleID

# make the md reflect the available sample names
md = md[md$SampleID %in% colnames(counts),]

# make sure md rownames and counts colnames in the same order.
rownames(md) = md$SampleID
counts = counts[, rownames(md)]
stopifnot(rownames(md) == colnames(counts))

# create DESeq2 objects ---------------------------------------------------------------------------
ddsMat = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula(snakemake@config$MODEL))
dds = DESeq(ddsMat)

message('calculating deseq...')

# Export DESeq2-normalized counts -----------------------------------------------------------------
normCounts = counts(dds, normalized=TRUE)
normCounts = 0.1 + normCounts # ensure nonzero
log2normCounts = log2(normCounts)
log2normCounts = as.data.frame(log2normCounts)
write.table(normCounts, snakemake@output$norm_counts, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
write.table(log2normCounts, snakemake@output$log2_norm_counts, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)

R DESeq2 From line 1 of scripts/deseq2-norm.R

sink(file(snakemake@log[[1]], open = "wt"), type = "message")

library(DESeq2)
library(ggplot2)
library(tibble)
library(dplyr)
library(viridisLite)
library(pheatmap)
library(yaml)

parallel <- FALSE
if (snakemake@threads > 1) {
	library("BiocParallel")
	register(MulticoreParam(snakemake@threads))
	parallel <- TRUE
}

c1 = snakemake@params$c1
c2 = snakemake@params$c2
outdir = snakemake@params$outdir
contrast_name = paste0(snakemake@params$c1, '-vs-', snakemake@params$c2)

# import data -------------------------------------------------------------------------------------

# import counts
counts = read.table(snakemake@input$counts, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
gene_names = counts |> pull(Genes)

# import metdata
md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)

# subset and counts matrix by contrast groups
md = md |> filter(Condition %in% c(c1, c2))
rownames(md) = md$SampleID

# make sure md rownames and counts colnames in the same order.
counts = counts[, rownames(md)]
stopifnot(rownames(md) == colnames(counts))

# DESeq2 ------------------------------------------------------------------------------------------

# differential expression
ddsMatrix = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula("~Condition"))
dds <- DESeq(ddsMatrix, parallel = parallel)

# Obtain normalized counts
# vsd = vst(dds.lrt, blind = FALSE)
rld <- rlog(dds, blind=FALSE)

# write outputs -----------------------------------------------------------------------------------

# define outputs
all_file = paste0(outdir, "/", contrast_name, "-all.txt") # contain all gene results for this pw comparison
sig_file = paste0(outdir, "/", contrast_name, "-", snakemake@params$padj, ".txt") # contain significant gene results
pw_pca = paste0(outdir, "/", contrast_name, "-pca.pdf") # pca of a subset of samples
pw_ma = paste0(outdir, "/", contrast_name, "-MA.pdf") # MA of a subset of samples

# extract results from DESeq object
results = results(dds, contrast = c("Condition", c1, c2), cooksCutoff = FALSE)
results_df = results |>
	as.data.frame() |>
	rownames_to_column("Genes") |>
	arrange(padj)
all_sig = results_df |> filter(padj <= snakemake@params$padj)

# export tables
write.table(results_df, all_file, quote = FALSE, sep = "\t", row.names = FALSE)
write.table(all_sig, sig_file, quote = FALSE, sep = "\t", row.names = FALSE)

# output PCA plot
pdf(pw_pca, width = 12, height = 12)
plotPCA(rld, intgroup=c("Condition"))
dev.off()

# plot MA
pdf(pw_ma, width = 12, height = 12)
plotMA(results, intgroup=c("Condition"))
dev.off()

R ggplot2 dplyr DESeq2 BiocParallel PCAtools tibble yaml viridisLite From line 1 of scripts/deseq2-pairwise.R

import yaml
import pandas as pd
import os

# identify the annotation provided by the STAR index (user-defined)
# filter_col will be used for filtering columns later on.
if snakemake.config["ANNOTATION_STYLE"] == "Genes":
	filter_col = "Gene name lower"
elif snakemake.config["ANNOTATION_STYLE"] == "UCSC":
	filter_col = "UCSC Stable ID"
elif snakemake.config["ANNOTATION_STYLE"] == "ENSEMBL_ID":
	filter_col = "Gene stable ID"
elif snakemake.config["ANNOTATION_STYLE"] == "REFSEQ":
	filter_col = "RefSeq mRNA ID"
elif snakemake.config["ANNOTATION_STYLE"] == "NCBI":
	filter_col = "NCBI gene (formerly Entrezgene) ID"
else:
	print("ERROR: ANNOTATION_STYLE is undefined. Please select from one of the following options: ['Genes', 'UCSC', 'ENSEMBL_ID', 'REFSEQ', 'NCBI']")
	os.exit()

# annotation file -------------------------------------------------------------

# open annotation file
if snakemake.input["annotation"]:
	annotation = pd.read_csv(snakemake.config["ANNOTATION"], sep = "\t")
else:
	print("ERROR: ANNOTATION key in config.yaml is undefined")
	os.exit()

# edit content of the annotation file

# turn gene type and names to lower-case to ensure string matching downstream.
annotation["Gene type lower"] = annotation["Gene type"].transform(lambda x: str.lower(x))
annotation["Gene name lower"] = annotation["Gene name"].transform(lambda x: str(x).lower())

# import biotypes. depending on their values, get them into a list.

biotypes = snakemake.config["BIOTYPES"]
biotype_type = type(biotypes)

if biotype_type == str:
	biotypes = [ snakemake.config["BIOTYPES"] ]
elif biotype_type == list:
	biotypes = snakemake.config["BIOTYPES"]
else:
	print("ERROR: BIOTYPES must be a single entry <string> or multiple entries <list of strings>")

biotypes = [ i.lower() for i in biotypes ]

# filter annotation file contents by biotype
annotation = annotation[annotation["Gene type lower"].isin(biotypes)]
filter_features = list(annotation[filter_col]) # use these features to filter counts table

# open counts file ------------------------------------------------------------

# import counts
counts = pd.read_csv(snakemake.input["counts"], sep = "\t")

print("PREFILTER: Counts table rows = {}".format(counts.shape[0]))
print("PREFILTER: Counts table columns = {}".format(counts.shape[1]))

# filter features
counts["Gene lower"] = counts["Genes"].transform(lambda x: str.lower(x))
counts = counts[counts["Gene lower"].isin(filter_features)]

print("POSTFILTER: Counts table rows = {}".format(counts.shape[0]))
print("POSTFILTER: Counts table columns = {}".format(counts.shape[1] - 1)) # minus one to adjust for lower-case gene names col

counts = counts.drop("Gene lower", axis = 1)

# export filtered counts
counts.to_csv(str(snakemake.output), sep = "\t", index = False)

Python Pandas STAR PyYAML From line 1 of scripts/filter_counts.py

shell:
	"fastp "
	"-i {input.r1} "
	"-I {input.r2} "
	"-o {output.r1} "
	"-O {output.r2} "
	"--detect_adapter_for_pe "
	"--thread {threads} "
	"-j {log} "
	"-h /dev/null"

SnakeMake fastp From line 104 of main/Snakefile

shell:
	"fastqc -t {threads} --outdir data/fastqc {input} > {log} 2>&1"

SnakeMake FastQC From line 125 of main/Snakefile

shell:
	"fastq_screen --aligner bowtie2 --threads {threads} --outdir data/fastq_screen --conf {input.config} --force {input.fastq} > {log} 2>&1"

SnakeMake Bowtie 2 From line 139 of main/Snakefile

shell:
	"STAR "
	"--runThreadN {threads} "
	"--runMode alignReads "
	"--genomeDir {params.genome_index} "
	"--readFilesIn {input.fwd} {input.rev} "
	"--outFileNamePrefix data/star/{wildcards.sample}_bam/ "
	"--sjdbGTFfile {params.gtf} "
	"--quantMode GeneCounts "
	"--sjdbGTFtagExonParentGene gene_name "
	"--outSAMtype BAM SortedByCoordinate "
	"--readFilesCommand zcat "
	"--twopassMode Basic"

SnakeMake STAR From line 158 of main/Snakefile

shell:
	"samtools index -@ {threads} {input}"

SnakeMake SAMtools From line 180 of main/Snakefile

shell:
	"preseq c_curve -B {resources.defect_mode} -l 1000000000 -P -o {output} {input} > {log} 2>&1"

SnakeMake preseq From line 195 of main/Snakefile

shell:
	"preseq lc_extrap -B {resources.defect_mode} -l 1000000000 -P -e 1000000000 -o {output} {input} > {log} 2>&1"

SnakeMake preseq From line 210 of main/Snakefile

shell:
	"bamCoverage -b {input[0]} -o {output} -p {threads} --normalizeUsing CPM --binSize 10 --smoothLength 50"

SnakeMake DeepTools From line 222 of main/Snakefile

shell:
	"if [ '{params.singularity}' == 'true' ]; then export LC_ALL=C.UTF-8; export LANG=C.UTF-8; fi && "
	"multiqc data -f --ignore data/tmp -o data/multiqc 2>&1"

SnakeMake MultiQC From line 239 of main/Snakefile

script:
	"scripts/compile_star_counts.py"

SnakeMake From line 252 of main/Snakefile

script:
	"scripts/filter_counts.py"

SnakeMake From line 261 of main/Snakefile

script:
	"scripts/deseq2-norm.R"

SnakeMake From line 276 of main/Snakefile

script:
	"scripts/deseq2-pairwise.R"

SnakeMake From line 299 of main/Snakefile

script:
	"scripts/deseq2-group.R"

SnakeMake From line 317 of main/Snakefile

ShowHide 9 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/maxsonBraunLab/rna_seq

Name: rna_seq

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free