Processing sorted cells from BC1 that have been bulkrnaseq'd

public public 1yr ago 0 bookmarks

Simple workflow to quantify gene-level RNA abundance and detect differentially expressed genes (DEGs) from bulk RNAseq samples. The pipeline uses kallisto or salmon to quantify transcript level abundance and DESeq2 to normalize counts and detect DEGs.

Installation

  1. Install Anaconda or Miniconda and then conda install snakemake interactive Sherlock session:
sdev
ml python/3.9.0
pip3 install --user snakemake
pip3 install --user mamba
  1. Download the appropriate kallisto or salmon references or build your own

  2. Clone the repository

  3. Describe your samples in samples.csv

  4. Modify the settings in config.yaml

  5. (Optional) If you plan on using a SLURM cluster, fill out the #SBATCH directives in run_pipeline.sh and the out and account fields in cluster.json

  6. (Optional) If you want to run the pipeline in a Singularity environment for full reproducibility, install Singularity. run_pipeline assumes Singularity is installed. Delete the --use-singularity flag if you want to skip using a Singularity environment. The same goes for using conda environments, although it is recommended to use both for the best reproducibility.

Sample File Configuration

bulk-rnaseq requires you specify a CSV file with a metadata about samples to analyze. There are several required columns:

  • patient

  • condition

  • fq1

  • fq2

  • strandedness (required by kallisto ; optional if using salmon )

The patient and condition columns are used to create a sample specific ID for downstream processing and identification. The values specified in the these columns should create unique values when combined as {patient}-{condition} . condition should specify the experimental condition of the sample ( normal or tumor , primary or metastatic etc).

NOTE salmon will automatically infer the library stranding and choose the best option. kallisto requires you specify the stranding info. See check-strand to determine the proper stranding if you prefer to run kallisto .

Running the pipeline

Run jobs locally

snakemake -j [cores] --use-singularity --use-conda

Run jobs on a SLURM cluster

sbatch run_pipeline.sh [path to directory with Snakefile]

Output

Results are stored in results/ and include:

  • normalized_counts.rds - DESeq2 object with VST normalized log2 counts. Useful for downstream visualization, clustering, and machine learning

  • pca_plot.svg - Plot of the samples described by the first two principal components. Useful to check for batch effects and identify bad samples

Each contrast has its own folder in results/ with the following files:

  • mle_foldchanges.csv - List of log2 fold changes estimated by DESeq2 using maximum likelihood estimation (MLE)

  • map_foldchanges.csv - List of log2 fold changes estimated by DESeq2 via shrinkage using the apeglm package

  • mle_ma.svg - MA plot showing the MLE fold changes

  • map_ma.svg - MA plot showing the MAP fold changes

The full DESeq2 object used for differential expression analysis can be found in deseq2/ .

A MultiQC html report with kallisto statistics for each sample is located in qc/ . Check this to verify a reasonable proportion of reads were pseudoaligned.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(tximport)
library(DESeq2)
library(readr)
library(stringr)

res_dirs <- snakemake@input[['cts']]
tx2g_fp <- snakemake@input[['tx2g']]
samples_fp <- snakemake@input[['samples']]
quant_program <- snakemake@params[['aligner']]

design_formula <- snakemake@params[['formula']]

if (snakemake@threads > 1) {
    library("BiocParallel")
    parallel <- TRUE
    register(MulticoreParam(snakemake@threads))
} else {
    parallel <- FALSE
}

if (quant_program == 'kallisto') {
    files <- file.path(res_dirs, "abundance.h5")
} else {
    files <- file.path(res_dirs, "quant.sf")
}
names(files) <- basename(dirname(files))
tx2g <- read_table2(tx2g_fp, col_names = c("tx", "ensgene", "symbol"))
txi <- tximport(files, type = quant_program, txOut = FALSE, tx2g = tx2g[, 1:2])
# txi <- tximport(files, type = "kallisto", txOut = FALSE, tx2g = tx2g[, 1:2])

samples <- read.csv(samples_fp)
print("First")
print(samples)
samples$id <- paste(samples$patient, "-", samples$condition, sep = "")
print("Second")
print(samples)
# Reorder rows so they match files order
samples <- samples[match(names(files), samples$id),]
print("Third")
print(files)
print(samples)

## Ensure factor ordering based on config specifications
vars <- snakemake@params[['levels']]
var_levels <- str_split(vars, ';', simplify=T)
for (var in var_levels) {
    print(paste("Variable:", var))
    s <- str_split(var, '=|,', simplify=T)
    col <- s[1, 1]
    level_order = s[1, 2:dim(s)[2]]
    samples[, col] <- factor(samples[, col], level_order)
}

f <- as.formula(design_formula)
ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples, design = f)

keep <- rowSums(counts(ddsTxi)) >= 1
ddsTxi <- ddsTxi[keep, ]

dds <- DESeq(ddsTxi, parallel=parallel)
print(dds)

vst_cts <- vst(dds, blind=FALSE)
print(vst_cts)

saveRDS(dds, file=snakemake@output[['deseq']])
saveRDS(vst_cts, file=snakemake@output[['cts']])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(DESeq2)
library(IHW)
library(ggplot2)
library(dplyr)
library(stringr)

if (snakemake@threads > 1) {
    library("BiocParallel")
    parallel <- TRUE
    register(MulticoreParam(snakemake@threads))
} else {
    parallel <- FALSE
}

t2g_fp <- snakemake@input[[2]]
t2g <- readr::read_table2(t2g_fp, col_names = c("tx", "ensgene", "symbol"))
t2g <- t2g %>%
    dplyr::distinct(ensgene, symbol)

dds <- readRDS(snakemake@input[[1]])

print(dds)

# Grab last variable at end of formula for contrasts
design_formula <- snakemake@params[['formula']]
s <- str_remove_all(design_formula, " |~")
s <- str_split(s, "\\+")
vars <- s[[1]]
var <- vars[length(vars)]

print(snakemake@params[['contrast']])
print("Creating results for the following variable:")
print(var)

print(resultsNames(dds))

contrast_coef <- paste(c(var, snakemake@params[['contrast']][1], "vs", snakemake@params[['contrast']][2]), collapse="_")
de_contrast <- c(var, snakemake@params[['contrast']][1], snakemake@params[['contrast']][2])

mle_res <- results(dds, contrast=de_contrast, filterFun=ihw, alpha = .05, parallel = parallel)
map_res <- lfcShrink(dds, coef=contrast_coef, type = "apeglm", parallel = parallel)

print("MLE LFC:")
print(mle_res)
print(summary(mle_res))

print("MAP LFC:")
print(map_res)
print(summary(map_res))

mle_df <- mle_res %>%
  data.frame() %>%
  tibble::rownames_to_column(var = "ensgene") %>%
  as_tibble() %>%
  left_join(t2g) %>%
  dplyr::select(symbol, ensgene, everything()) %>%
  dplyr::arrange(padj) %>%
  dplyr::distinct(symbol, .keep_all = TRUE)

map_df <- map_res %>%
    data.frame() %>%
    tibble::rownames_to_column(var = "ensgene") %>%
    as_tibble() %>%
    left_join(t2g) %>%
    dplyr::select(symbol, ensgene, everything()) %>%
    dplyr::arrange(padj) %>%
    dplyr::distinct(symbol, .keep_all = TRUE)

print("MLE dataframe")
print(mle_df)
print("MAP dataframe")
print(map_df)


mleplot <- mle_df %>%
    dplyr::mutate(
        significant = ifelse(padj < .05, "padj < .05", "padj >= .05"),
        direction = ifelse(log2FoldChange > 0, "Upregulated", "Downregulated")
    ) %>%
    ggplot(aes(log10(baseMean),log2FoldChange)) +
    geom_point(aes(color = significant, shape = direction)) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(x = "log10 Expression", y = "MLE Log2FoldChange") +
    theme_bw()

mapplot <- map_df %>%
    dplyr::mutate(
        significant = ifelse(padj < .05, "padj < .05", "padj >= .05"),
        direction = ifelse(log2FoldChange > 0, "Upregulated", "Downregulated")
    ) %>%
    ggplot(aes(log10(baseMean),log2FoldChange)) +
    geom_point(aes(color = significant, shape = direction)) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(x = "log10 Expression", y = "MAP Log2FoldChange") +
    theme_bw()

readr::write_csv(mle_df, snakemake@output[['mleres']])
readr::write_csv(map_df, snakemake@output[['mapres']])
ggsave(snakemake@output[['mlema']], plot = mleplot, width = 10, height = 7)
ggsave(snakemake@output[['mapma']], plot = mapplot, width = 10, height = 7)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(DESeq2)
library(ggplot2)
library(stringr)

vsd <- readRDS(snakemake@input[[1]])

print(vsd)

label_vars <- str_split(snakemake@params[['label_vars']], ',', simplify=T)[1, ]

pcaplot <- plotPCA(vsd, intgroup=label_vars)

ggsave(snakemake@output[[1]], pcaplot)
120
121
122
123
124
125
shell:
    """
    kallisto quant -i {input.idx} -o {output} -t {threads} \
        {params.strand} {params.single} {params.frag_length} {params.frag_sd} \
        {params.fqs} &> >(tee {log})
    """
140
141
142
143
shell:
    """
    salmon quant -i {input.idx} -l A {params.fqs} -p {threads} -o {output}
    """
162
163
script:
    "scripts/deseq2.R"
182
183
script:
    "scripts/diffexp.R"
SnakeMake From line 182 of main/Snakefile
196
197
wrapper:
    "0.50.4/bio/multiqc"
210
211
script:
    "scripts/plot_pca.R"
SnakeMake From line 210 of main/Snakefile
222
223
224
225
226
227
228
229
shell:
    """
    tmpdir=qc/fastqc/.{wildcards.sample_id}.tmp
    mkdir $tmpdir
    mkdir {output}
    fastqc {input} -o {output} &> >(tee {log}) -d $tmpdir
    rm -r $tmpdir
    """
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
ShowHide 2 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/akirosingh/BC1bulk-rnaseq
Name: bc1bulk-rnaseq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...