Processing sorted cells from BC1 that have been bulkrnaseq'd
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Simple workflow to quantify gene-level RNA abundance and detect differentially expressed genes (DEGs)
from bulk RNAseq samples. The pipeline uses
kallisto
or
salmon
to quantify transcript level abundance and
DESeq2
to normalize counts and detect DEGs.
Installation
-
Install Anaconda or Miniconda and then
conda install snakemake
interactive Sherlock session:
sdev
ml python/3.9.0
pip3 install --user snakemake
pip3 install --user mamba
-
Download the appropriate
kallisto
orsalmon
references or build your own -
Clone the repository
-
Describe your samples in
samples.csv
-
Modify the settings in
config.yaml
-
(Optional) If you plan on using a SLURM cluster, fill out the
#SBATCH
directives inrun_pipeline.sh
and theout
andaccount
fields incluster.json
-
(Optional) If you want to run the pipeline in a Singularity environment for full reproducibility, install Singularity.
run_pipeline
assumes Singularity is installed. Delete the--use-singularity
flag if you want to skip using a Singularity environment. The same goes for usingconda
environments, although it is recommended to use both for the best reproducibility.
Sample File Configuration
bulk-rnaseq
requires you specify a CSV file with a metadata about samples to analyze.
There are several required columns:
-
patient
-
condition
-
fq1
-
fq2
-
strandedness
(required bykallisto
; optional if usingsalmon
)
The
patient
and
condition
columns are used to create a sample specific ID for downstream processing and identification.
The values specified in the these columns should create unique values when combined as
{patient}-{condition}
.
condition
should specify the experimental condition of the sample (
normal
or
tumor
,
primary
or
metastatic
etc).
NOTE
salmon
will automatically infer the library stranding and choose the best option.
kallisto
requires you specify
the stranding info. See
check-strand
to determine the proper stranding if you prefer to
run
kallisto
.
Running the pipeline
Run jobs locally
snakemake -j [cores] --use-singularity --use-conda
Run jobs on a SLURM cluster
sbatch run_pipeline.sh [path to directory with Snakefile]
Output
Results are stored in
results/
and include:
-
normalized_counts.rds
- DESeq2 object with VST normalized log2 counts. Useful for downstream visualization, clustering, and machine learning -
pca_plot.svg
- Plot of the samples described by the first two principal components. Useful to check for batch effects and identify bad samples
Each contrast has its own folder in
results/
with the following files:
-
mle_foldchanges.csv
- List of log2 fold changes estimated by DESeq2 using maximum likelihood estimation (MLE) -
map_foldchanges.csv
- List of log2 fold changes estimated by DESeq2 via shrinkage using theapeglm
package -
mle_ma.svg
- MA plot showing the MLE fold changes -
map_ma.svg
- MA plot showing the MAP fold changes
The full DESeq2 object used for differential expression analysis can be found in
deseq2/
.
A MultiQC html report with
kallisto
statistics for each sample is located in
qc/
.
Check this to verify a reasonable proportion of reads were pseudoaligned.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(tximport) library(DESeq2) library(readr) library(stringr) res_dirs <- snakemake@input[['cts']] tx2g_fp <- snakemake@input[['tx2g']] samples_fp <- snakemake@input[['samples']] quant_program <- snakemake@params[['aligner']] design_formula <- snakemake@params[['formula']] if (snakemake@threads > 1) { library("BiocParallel") parallel <- TRUE register(MulticoreParam(snakemake@threads)) } else { parallel <- FALSE } if (quant_program == 'kallisto') { files <- file.path(res_dirs, "abundance.h5") } else { files <- file.path(res_dirs, "quant.sf") } names(files) <- basename(dirname(files)) tx2g <- read_table2(tx2g_fp, col_names = c("tx", "ensgene", "symbol")) txi <- tximport(files, type = quant_program, txOut = FALSE, tx2g = tx2g[, 1:2]) # txi <- tximport(files, type = "kallisto", txOut = FALSE, tx2g = tx2g[, 1:2]) samples <- read.csv(samples_fp) print("First") print(samples) samples$id <- paste(samples$patient, "-", samples$condition, sep = "") print("Second") print(samples) # Reorder rows so they match files order samples <- samples[match(names(files), samples$id),] print("Third") print(files) print(samples) ## Ensure factor ordering based on config specifications vars <- snakemake@params[['levels']] var_levels <- str_split(vars, ';', simplify=T) for (var in var_levels) { print(paste("Variable:", var)) s <- str_split(var, '=|,', simplify=T) col <- s[1, 1] level_order = s[1, 2:dim(s)[2]] samples[, col] <- factor(samples[, col], level_order) } f <- as.formula(design_formula) ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples, design = f) keep <- rowSums(counts(ddsTxi)) >= 1 ddsTxi <- ddsTxi[keep, ] dds <- DESeq(ddsTxi, parallel=parallel) print(dds) vst_cts <- vst(dds, blind=FALSE) print(vst_cts) saveRDS(dds, file=snakemake@output[['deseq']]) saveRDS(vst_cts, file=snakemake@output[['cts']]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(DESeq2) library(IHW) library(ggplot2) library(dplyr) library(stringr) if (snakemake@threads > 1) { library("BiocParallel") parallel <- TRUE register(MulticoreParam(snakemake@threads)) } else { parallel <- FALSE } t2g_fp <- snakemake@input[[2]] t2g <- readr::read_table2(t2g_fp, col_names = c("tx", "ensgene", "symbol")) t2g <- t2g %>% dplyr::distinct(ensgene, symbol) dds <- readRDS(snakemake@input[[1]]) print(dds) # Grab last variable at end of formula for contrasts design_formula <- snakemake@params[['formula']] s <- str_remove_all(design_formula, " |~") s <- str_split(s, "\\+") vars <- s[[1]] var <- vars[length(vars)] print(snakemake@params[['contrast']]) print("Creating results for the following variable:") print(var) print(resultsNames(dds)) contrast_coef <- paste(c(var, snakemake@params[['contrast']][1], "vs", snakemake@params[['contrast']][2]), collapse="_") de_contrast <- c(var, snakemake@params[['contrast']][1], snakemake@params[['contrast']][2]) mle_res <- results(dds, contrast=de_contrast, filterFun=ihw, alpha = .05, parallel = parallel) map_res <- lfcShrink(dds, coef=contrast_coef, type = "apeglm", parallel = parallel) print("MLE LFC:") print(mle_res) print(summary(mle_res)) print("MAP LFC:") print(map_res) print(summary(map_res)) mle_df <- mle_res %>% data.frame() %>% tibble::rownames_to_column(var = "ensgene") %>% as_tibble() %>% left_join(t2g) %>% dplyr::select(symbol, ensgene, everything()) %>% dplyr::arrange(padj) %>% dplyr::distinct(symbol, .keep_all = TRUE) map_df <- map_res %>% data.frame() %>% tibble::rownames_to_column(var = "ensgene") %>% as_tibble() %>% left_join(t2g) %>% dplyr::select(symbol, ensgene, everything()) %>% dplyr::arrange(padj) %>% dplyr::distinct(symbol, .keep_all = TRUE) print("MLE dataframe") print(mle_df) print("MAP dataframe") print(map_df) mleplot <- mle_df %>% dplyr::mutate( significant = ifelse(padj < .05, "padj < .05", "padj >= .05"), direction = ifelse(log2FoldChange > 0, "Upregulated", "Downregulated") ) %>% ggplot(aes(log10(baseMean),log2FoldChange)) + geom_point(aes(color = significant, shape = direction)) + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + labs(x = "log10 Expression", y = "MLE Log2FoldChange") + theme_bw() mapplot <- map_df %>% dplyr::mutate( significant = ifelse(padj < .05, "padj < .05", "padj >= .05"), direction = ifelse(log2FoldChange > 0, "Upregulated", "Downregulated") ) %>% ggplot(aes(log10(baseMean),log2FoldChange)) + geom_point(aes(color = significant, shape = direction)) + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + labs(x = "log10 Expression", y = "MAP Log2FoldChange") + theme_bw() readr::write_csv(mle_df, snakemake@output[['mleres']]) readr::write_csv(map_df, snakemake@output[['mapres']]) ggsave(snakemake@output[['mlema']], plot = mleplot, width = 10, height = 7) ggsave(snakemake@output[['mapma']], plot = mapplot, width = 10, height = 7) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(DESeq2) library(ggplot2) library(stringr) vsd <- readRDS(snakemake@input[[1]]) print(vsd) label_vars <- str_split(snakemake@params[['label_vars']], ',', simplify=T)[1, ] pcaplot <- plotPCA(vsd, intgroup=label_vars) ggsave(snakemake@output[[1]], pcaplot) |
120 121 122 123 124 125 | shell: """ kallisto quant -i {input.idx} -o {output} -t {threads} \ {params.strand} {params.single} {params.frag_length} {params.frag_sd} \ {params.fqs} &> >(tee {log}) """ |
140 141 142 143 | shell: """ salmon quant -i {input.idx} -l A {params.fqs} -p {threads} -o {output} """ |
162 163 | script: "scripts/deseq2.R" |
182 183 | script: "scripts/diffexp.R" |
196 197 | wrapper: "0.50.4/bio/multiqc" |
210 211 | script: "scripts/plot_pca.R" |
222 223 224 225 226 227 228 229 | shell: """ tmpdir=qc/fastqc/.{wildcards.sample_id}.tmp mkdir $tmpdir mkdir {output} fastqc {input} -o {output} &> >(tee {log}) -d $tmpdir rm -r $tmpdir """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
Support
- Future updates