RNA-Seq Analysis Pipeline Development with Snakemake

public public 1yr ago 0 bookmarks

We will use the workflow manager snakemake to develop a RNAseq pipeline. One big part of developing successful pipelines is making them reproducible and transferable, i.e. we should get the same results using the pipeline on the same data, irrespective of which compute/system we analysed them on. To achieve this, we will work with contained software environments, using conda (please install Anaconda if you haven't yet, so we can make use of this feature).

During the lectures, we will build our pipeline from scratch, starting with how to write snakemake rules. In the end, we will have built an analysis pipeline for RNAseq including alignment, quality control and differential expression analysis. There is also a homework assignment that will ask you to extend the pipeline, evaluate your results and learn how to generate an analysis report.

Directory structure

The files you see in this directory are either part of the pipeline ( Snakemake , scripts , envs ), or contain the example data to run the analysis ( genome , reads , samples.txt ).

Set-up

To get started, please install the required software packages before the lectures. The set-up is described below. If you run into issues with the installation, please let me know BEFORE the first session, so we can resolve any problems before the lectures.

(Don't worry if the setup does not make sense to you, or if you are not familiar with the commands yet; we will cover it all in the lectures)

1. Install a text editor Please make sure you have a text editor installed on your computer; if you do not have one, give Atom a try!

2. Install Snakemake Following the recommodations by the snakemake developers , we will first install mamba , a robuster and faster version of the default conda package manager that is shipped with Anaconda.

  • Open your terminal app (aka commad line, shell)

  • Type the following command:

conda install -n base -c conda-forge mamba

NB: this might take a while. Initially you should see:

Collecting package metadata (current_repodata.json): done
Collecting package metadata (repodata.json): done
Solving environment: done

At some point it will ask you Proceed ([y]/n)? . Type y and hit enter.

If it successfully finishes, you will see this on your screen:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

We will then use mamba to create a software environment specific for our analysis. At the moment, all it needs to contain is snakemake itself. To create this environment type:

mamba create -c conda-forge -c bioconda -n snakemake-rnaseq snakemake

As above, this might take a while and it will ask you again Proceed ([y]/n)? . Type y and hit enter. Once it finishes (if successful) you will see:

# To activate this environment, use
#
# $ conda activate snakemake-rnaseq
#
# To deactivate an active environment, use
#
# $ conda deactivate

3. Activate the snakemake environment

  • type the following to activate the environment
conda activate snakemake-rnaseq
  • test the environment by typing
snakemake --version

You should see: 7.18.2

4. Move into the rnaseq-tutorial directory

  • enter the directory using the command line cd /path/2/rnaseq-tutorial (where path/2/ is the path to the directory where you saved the folder)

5. Test the setup We currently are in a software environment that contains snakemake. We can use this environment for any analysis that makes use of snakemake as a workflow manager. That also means, we might use it for different pipelines that require different software. To make everything transparent and reproducible, we can tell snakemake itself that for each analysis type it should automatically create a conda environment and handle the activation/deactivation, so we do not have to worry about this. Here, we are testing this setup and the way I have written it, all that snakemake will do here is create these environments and use them to write some dummy text. If that works, we know we will not have software related issues during the lecture and can instead focus on the analysis, so let's do that!

  • type: snakemake --use-conda --cores 1

  • hit enter

  • you should see something like this on your screen:

Building DAG of jobs...
Creating conda environment envs/pandas.yaml...
Downloading and installing remote packages.
...

(7 times for the different conda environments that snakemake will create) followed by

Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
--------------- ------- ------------- -------------
all 1 1 1
count_matrix 1 1 1
cutadapt 1 1 1
deseq2 1 1 1
generate_genome 1 1 1
index 1 1 1
multiqc 1 1 1
rseqc_coverage 1 1 1
total 8 1 1
  • if all went well you should see this at the end:
Finished job 0.
8 of 8 steps (100%) done

If you do not see this, or get stuck anywhere before that, let me know and we can have a look at it together - again, BEFORE class, so we can spend the lecture time most efficiently.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd

print(snakemake.params.strand)
def get_column(strandedness):
    if pd.isnull(strandedness) or strandedness == "none":
        return 1 #non stranded protocol
    elif strandedness == "yes":
        return 2 #3rd column
    elif strandedness == "reverse":
        return 3 #4th column, usually for Illumina truseq
    else:
        raise ValueError(("'strandedness' column should be empty or have the "
                          "value 'none', 'yes' or 'reverse', instead has the "
                          "value {}").format(repr(strandedness)))

counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(snakemake.params.strand)],
          header=None, skiprows=4) for f in snakemake.input]

for t, sample in zip(counts, snakemake.params.samples):
    t.columns = [sample]

matrix = pd.concat(counts, axis=1)
matrix.index.name = "gene"
# collapse technical replicates
if len(set(matrix.columns)) != len(matrix.columns):
    matrix = matrix.groupby(matrix.columns, axis=1).sum()
matrix.to_csv(snakemake.output[0], sep="\t")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#################
## libraries ####
#################
library("DESeq2")
library("tidyverse")


#################
## functions ####
#################
genes_de <- function(deset, thrP=0.05, thrLog2FC=log2(1.5),
                     direction=c('up', 'down', 'any')) {
    tmp <- deset %>%
        as.data.frame %>%
        rownames_to_column(var="gene_id")
    if (direction == "up") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP,
                          log2FoldChange > thrLog2FC)
    } else if (direction == "down") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP,
                          log2FoldChange < thrLog2FC)
    } else if (direction == "any") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP)
    } else {
        stop(direction, "is not a valid option for direction")
    }
    tmp
}


save_up_down <- function(res, snakemake) {
    up  <- genes_de(res, direction="up")
    down  <- genes_de(res, direction="down")
    write_csv(up, snakemake@output[['up']])
    write_csv(down, snakemake@output[['down']])
    return(list(up=up$gene_id, down=down$gene_id))
}

############
## data ####
############

dds <- readRDS(snakemake@input[["dds"]])

################
## analysis ####
################

## 1. Model fit ####
# Generate named coefficients need for apeglm lfcShrink
comparison <- snakemake@wildcards[["contrast"]]
coef <- paste("condition", comparison, sep="_")

# Relevel for reference to second element in contrasts
dds$condition <- relevel(dds$condition, "control")
dds <- nbinomWaldTest(dds)

## 2. Process results ####
# Extract coefficient specific results
res <- results(dds, name=coef, parallel=parallel)

# shrink fold changes for lowly expressed genes
res <- lfcShrink(dds, coef=coef, res=res, type='apeglm')

# add gene names to results object
res$gene_name <- mcols(dds)$symbol

# sort by p-value
res <- res[order(res$padj),]

## 3. Summarise results ####
## a) All genes for all groups ####
res_format <- res %>%
    as.data.frame() %>%
    rownames_to_column(var="gene_id") %>%
    as_tibble() %>%
    rename_at(vars(-gene_id, -gene_name), ~ paste0(., "_", comparison))

# normalised expression values
rld <- rlog(dds, blind = FALSE)
deg_genes <- assay(rld)

combined <- deg_genes %>%
    as.data.frame() %>%
    rownames_to_column(var="gene_id") %>%
    as_tibble() %>%
    right_join(res_format, by="gene_id") %>%
    dplyr::select(gene_id, gene_name, everything())

write_delim(combined, snakemake@output[["table"]], delim="\t")

## b) Up/Down genes ####
genes_up_down <- save_up_down(res=res, snakemake=snakemake)

## 4. Visualise results ####
# ma plot
pdf(snakemake@output[["ma_plot"]])
plotMA(res, ylim=c(-2,2))
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#################
## libraries ####
#################
library("DESeq2")
library("biomaRt")
library("tidyverse")

############
## data ####
############
message("Reading counts")
cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene",
                  check.names=FALSE)

message("Reading sample file")
coldata <- read.table(snakemake@input[["samples"]], header=TRUE,
                      row.names="sample", check.names=FALSE, sep="\t")

message("Reading annotation file")
annotation <- read.table(snakemake@input[["annotation"]], header=TRUE,
                       check.names=FALSE, sep="\t")

message("Getting experimental design")
design <- as.formula(snakemake@params[["design"]])

# colData and countData must have the same sample order
if (nrow(coldata) != ncol(cts)) {
    stop("Number of samples in sample sheet and number of samples in counts",
         "matrix is not the same")
}
cts <- cts[,match(rownames(coldata),colnames(cts))]
if (any(c("control", "Control", "CONTROL") %in% levels(coldata$condition))) {
    if ("control" %in% levels(coldata$condition)) {
        coldata$condition <- relevel(coldata$condition, "control" )
    } else if ("Control" %in% levels(coldata$condition)) {
        coldata$condition <- relevel(coldata$condition, "Control" )
    } else {
        coldata$condition <- relevel(coldata$condition, "CONTROL" )
    }
}

################
## analysis ####
################

## 1. Generate and annotate DESeq object ####
dds <- DESeqDataSetFromMatrix(countData=cts,
                              colData=coldata,
                              design=design)

# remove uninformative columns
dds <- dds[rowSums(counts(dds)) > 1,]

# normalization and preprocessing
dds <- DESeq(dds)

# Remove build number on ENS gene id
rownames(dds) <- gsub("\\.\\d*", "", rownames(dds))

genemap <- annotation %>%
    distinct()

featureData <- tibble(gene_id=rownames(dds)) %>%
    left_join(genemap, by="gene_id") %>%
    mutate(symbol=case_when(is.na(symbol) ~ gene_id,
                            TRUE ~ symbol)) %>%
    select(symbol)

mcols(dds) <- DataFrame(mcols(dds), featureData)

saveRDS(dds, snakemake@output[["dds"]])
31
32
33
34
35
36
37
38
39
40
41
shell:
    """
    STAR \
        --runMode genomeGenerate \
        --runThreadN 1 \
        --genomeFastaFiles {input.genome} \
        --sjdbGTFfile {input.gtf} \
        --genomeDir genome/STARINDEX \
        --genomeSAindexNbases 11 \
        --sjdbOverhang 75
    """
57
58
59
60
61
62
63
64
65
66
shell:
    """
    cutadapt \
        -a {params.adapter} \
        -o {output.fastq1} \
        -p {output.fastq2} \
        -j {threads} \
        {input} \
    > {output.qc}
    """
85
86
87
88
89
90
91
92
93
shell:
    """
    multiqc \
        --force \
        --export \
        --outdir qc \
        --filename multiqc_report.html \
        trimmed star qc> {log}
    """
115
116
117
118
119
120
121
122
123
124
125
shell:
    """
    STAR \
        --runMode alignReads \
        --runThreadN {threads} \
        --genomeDir {params.indexdir} \
        --readFilesIn {input.fastq1} {input.fastq2} \
        --outFileNamePrefix star/{wildcards.sample}-{wildcards.unit}. \
        --quantMode GeneCounts \
        --outSAMtype BAM SortedByCoordinate
    """
138
139
shell:
    "samtools index {input}"
155
156
157
158
159
160
161
shell:
    """
    geneBody_coverage.py \
        -r {input.bed} \
        -i {input.bam} \
        -o qc/rseqc/{wildcards.sample}-{wildcards.unit} 2> {log}
    """
SnakeMake From line 155 of main/Snakefile
179
180
script:
    "scripts/count-matrix.py"
SnakeMake From line 179 of main/Snakefile
200
201
script:
    "scripts/setup_deseq2.R"
SnakeMake From line 200 of main/Snakefile
217
218
script:
    "scripts/deseq2.R"
ShowHide 5 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/meyer-lab-cshl/rnaseq-tutorial
Name: rnaseq-tutorial
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...