Workflow to process bulk RNASeq samples using Salmon and DESeq2
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Simple workflow to quantify gene-level RNA abundance and detect differentially expressed genes (DEGs)
from bulk RNAseq samples. The pipeline uses
kallisto
or
salmon
to quantify transcript level abundance and
DESeq2
to normalize counts and detect DEGs.
Resource Requirements
Salmon usually works with 8GB to 12GB of RAM. 16GB is recommended for DESeq2 with small datasets. Larger amounts of memory will be required for larger datasets.
Installation
-
Install Singularity and
snakemake
-
Download the
salmon
references or build your own -
Clone this repository or a create new repository from this template
-
Describe your samples in
samples.csv
-
Modify the settings in
config.yaml
-
(Optional) If you plan on using a SLURM cluster, fill out the
#SBATCH
directives inrun_pipeline.sh
and theaccount
field incluster.json
.
Salmon vs Kallisto
As of April 2022,
bulk-rnaseq
only supports Salmon quantification. Older versions supported Kallisto,
but this was dropped due to better Salmon integration with tximeta
Sample File Configuration
bulk-rnaseq
requires you specify a CSV file with a metadata about samples to analyze.
There are several required columns:
-
patient
-
condition
-
fq1
-
fq2
The
patient
and
condition
columns are used to create a sample specific ID for downstream processing and identification.
The values specified in the these columns should create unique values when combined as
{patient}-{condition}
.
condition
should specify the experimental condition of the sample (
normal
or
tumor
,
primary
or
metastatic
etc).
NOTE
salmon
will automatically infer the library stranding and choose the best option.
Running the pipeline
Run jobs locally
snakemake -j [cores] --use-singularity
Run jobs on a SLURM cluster
# Must run within the bulk-rnaseq directory
sbatch run_pipeline.sh
Output
Results are stored in
results/[gene|transcript]-level
and include:
-
normalized_counts.rds
- DESeq2 object with VST normalized log2 counts. Useful for downstream visualization, clustering, and machine learning -
pca_plot.svg
- Plot of the samples described by the first two principal components. Useful to check for batch effects and identify bad samples
Each contrast has its own folder in
results/
with the following files:
-
mle_foldchanges.csv
- List of log2 fold changes estimated by DESeq2 using maximum likelihood estimation (MLE) -
map_foldchanges.csv
- List of log2 fold changes estimated by DESeq2 via shrinkage using theapeglm
package -
mle_ma.svg
- MA plot showing the MLE fold changes -
map_ma.svg
- MA plot showing the MAP fold changes
The full DESeq2 object used for differential expression analysis can be found in
deseq2/
.
A MultiQC html report with quantification statistics and FASTQC reports for each sample is located in
qc/
.
Check this to verify a reasonable proportion of reads were pseudoaligned.
validateFastq_summary.csv
has the results of running
qc/validateFastq
on each sample's input FASTQ files.
This checks for proper FASTQ formatting and properly sorted read headers.
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Handle readr tzdata bug when using singularity mambaforge container Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo")) library(tximeta) library(DESeq2) library(readr) library(stringr) library(org.Hs.eg.db) res_dirs <- snakemake@input[['cts']] samples_fp <- snakemake@input[['samples']] quant_program <- snakemake@params[['aligner']] design_formula <- snakemake@params[['formula']] if (snakemake@threads > 1) { library("BiocParallel") parallel <- TRUE register(MulticoreParam(snakemake@threads)) } else { parallel <- FALSE } if (quant_program == 'salmon') { files <- file.path(res_dirs, "quant.sf") } else{ print("quant_program wasn't Salmon!") stop() } names(files) <- basename(dirname(files)) samples <- read.csv(samples_fp) samples$id <- paste(samples$patient, "-", samples$condition, sep = "") print("First") print(samples) # Reorder rows so they match files order samples <- samples[match(names(files), samples$id),] print("Second") print(files) print(samples) # Check that matching worked samples$names <- names(files) samples$files <- files stopifnot(all(samples$id == samples$names)) stopifnot(all(samples$names == basename(dirname(samples$files)))) print(samples) # Save SummarizedExperiment with tx-level data for future use se <- tximeta(samples) se <- addIds(se, "SYMBOL", gene = T) saveRDS(se, snakemake@output[['se_tx']]) # Collapse to gene-level for DE analysis gse <- summarizeToGene(se) gse <- addIds(gse, "SYMBOL", gene = T) f <- as.formula(design_formula) # No longer need se object - remove from memory rm(se); gc() ## Gene-level print("Building DESeq object for gene-level features") dds <- DESeqDataSet(gse, design = f) print("Old ordering for condition") print(colData(dds)$condition) ## Ensure factor ordering based on config specifications vars <- snakemake@params[['levels']] var_levels <- str_split(vars, ';', simplify=T) for (var in var_levels) { print(paste("Variable:", var)) s <- str_split(var, '=|,', simplify=T) col <- s[1, 1] level_order = s[1, 2:dim(s)[2]] colData(dds)[, col] <- factor(colData(dds)[, col], level_order) print(paste("Ordering for", col)) print(levels(colData(dds)[, col])) } dds <- DESeq(dds, parallel=parallel) print(dds) vst_cts <- vst(dds, blind=FALSE) print(vst_cts) saveRDS(dds, file=snakemake@output[['deseq']]) saveRDS(vst_cts, file=snakemake@output[['cts']]) |
1
of
scripts/deseq2.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Handle readr tzdata bug when using singularity mambaforge container Sys.setenv("TZDIR"=paste0(Sys.getenv("CONDA_PREFIX"), "/share/zoneinfo")) library(DESeq2) library(IHW) library(ggplot2) library(dplyr) library(stringr) if (snakemake@threads > 1) { library("BiocParallel") parallel <- TRUE register(MulticoreParam(snakemake@threads)) } else { parallel <- FALSE } dds <- readRDS(snakemake@input[[1]]) print(dds) # Grab last variable at end of formula for contrasts design_formula <- snakemake@params[['formula']] s <- str_remove_all(design_formula, " |~") s <- str_split(s, "\\+") vars <- s[[1]] var <- vars[length(vars)] print(snakemake@params[['contrast']]) print("Creating results for the following variable:") print(var) print(resultsNames(dds)) contrast_coef <- paste(c(var, snakemake@params[['contrast']][1], "vs", snakemake@params[['contrast']][2]), collapse="_") de_contrast <- c(var, snakemake@params[['contrast']][1], snakemake@params[['contrast']][2]) mle_res <- results(dds, contrast=de_contrast, filterFun=ihw, alpha = .05, parallel = parallel) map_res <- lfcShrink(dds, coef=contrast_coef, type = "apeglm", parallel = parallel) # Add gene symbols from rowData print("res and dds ensgenes match?") stopifnot(all(rownames(mle_res) == rownames(rowData(dds)))) stopifnot(all(rownames(map_res) == rownames(rowData(dds)))) print("Ensgenes match. Adding symbols to results dataframes") mle_res$symbol <- rowData(dds)$SYMBOL map_res$symbol <- rowData(dds)$SYMBOL print("MLE LFC:") print(mle_res) print(summary(mle_res)) print("MAP LFC:") print(map_res) print(summary(map_res)) mle_df <- mle_res %>% data.frame() %>% tibble::rownames_to_column(var = "ensgene") %>% as_tibble() %>% dplyr::select(symbol, ensgene, everything()) %>% dplyr::arrange(padj) map_df <- map_res %>% data.frame() %>% tibble::rownames_to_column(var = "ensgene") %>% as_tibble() %>% dplyr::select(symbol, ensgene, everything()) %>% dplyr::arrange(padj) print("MLE dataframe") print(mle_df) print("MAP dataframe") print(map_df) mleplot <- mle_df %>% dplyr::mutate( significant = ifelse(padj < .05, "padj < .05", "padj >= .05"), direction = ifelse(log2FoldChange > 0, "Upregulated", "Downregulated") ) %>% ggplot(aes(log10(baseMean),log2FoldChange)) + geom_point(aes(color = significant, shape = direction)) + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + labs(x = "log10 Expression", y = "MLE Log2FoldChange") + theme_bw() mapplot <- map_df %>% dplyr::mutate( significant = ifelse(padj < .05, "padj < .05", "padj >= .05"), direction = ifelse(log2FoldChange > 0, "Upregulated", "Downregulated") ) %>% ggplot(aes(log10(baseMean),log2FoldChange)) + geom_point(aes(color = significant, shape = direction)) + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + labs(x = "log10 Expression", y = "MAP Log2FoldChange") + theme_bw() readr::write_csv(mle_df, snakemake@output[['mleres']]) readr::write_csv(map_df, snakemake@output[['mapres']]) ggsave(snakemake@output[['mlema']], plot = mleplot, width = 10, height = 7) ggsave(snakemake@output[['mapma']], plot = mapplot, width = 10, height = 7) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(DESeq2) library(ggplot2) library(stringr) vsd <- readRDS(snakemake@input[[1]]) print(vsd) label_vars <- str_split(snakemake@params[['label_vars']], ',', simplify=T)[1, ] pcaplot <- plotPCA(vsd, intgroup=label_vars) ggsave(snakemake@output[[1]], pcaplot) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import os import pandas as pd def check_validation(fp): sampleid = os.path.basename(fp) sampleid = sampleid.replace(".txt", "") with open(fp, 'r') as f: line = f.readline() while line: if 'Error found' in line: return sampleid, 'Failed' elif 'no errors found' in line: return sampleid, 'Passed' line = f.readline() return sampleid, 'No log data found in file' def main(): files = snakemake.input results = [check_validation(f) for f in files] df = pd.DataFrame.from_dict({'patient' : [t[0] for t in results], 'result' : [t[1] for t in results]}) print(df) df.to_csv(snakemake.output[0], index = False) if __name__ == '__main__': main() |
158 159 160 161 | shell: """ salmon quant -i {input.idx} -l A {params.fqs} -p {threads} -o {output} {params.reps} """ |
179 180 | script: "scripts/deseq2.R" |
197 198 | script: "scripts/diffexp.R" |
208 209 | wrapper: "0.50.4/bio/multiqc" |
221 222 | script: "scripts/plot_pca.R" |
232 233 234 235 236 237 238 239 | shell: """ tmpdir=qc/fastqc/.{wildcards.sample_id}.tmp mkdir $tmpdir mkdir {output} fastqc {input} -o {output} &> >(tee {log}) -d $tmpdir rm -r $tmpdir """ |
250 251 252 253 | shell: """ biopet-validatefastq {params} 2>&1 | tee {output} """ |
260 261 | script: "scripts/summarize_fastqValidation.py" |
Support
- Future updates