Snakemake workflow for generating gene expression counts from RNA-sequencing data.

public public 1yr ago Version: 2 0 bookmarks

Introduction

Author: Anže Lovše (@AnzeLovse)

This is a Snakemake workflow for generating gene expression counts from RNA-sequencing data and can optionally be used to run differential expression analysis as well. The workflow handles both single-end and paired-end sequencing data. We used the workflow for analyzing sequencing data from Bacillus thuringiensis serovar israelensis as a part of my bachelor thesis.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.

Overview

The standard workflow performs the following steps:

  1. Trim reads with Cutadapt.

  2. Build a genome index and align reads with Bowtie2.

  3. Sort and index reads with Samtools.

  4. Generate gene expression counts with featureCounts.

  5. Merge all counts into a single file.

  6. Normalize counts (CPM, TPM) with rnanorm

  7. (optionally) Differential expression with DESeq2

Additionally the workflow includes quality control steps:

  • Read quality reports before and after trimming using FASTQC.

  • Read alignment statistics based on Samtools and Bowtie2 reports.

  • featureCounts statistics report.

  • Aggregated report made with MultiQC

and visualization helper functions:

  • Genome coverage made with genomeCoverageBed.

  • Calculate FASTA index.

  • Create JSON file with coverage data.

  • Convert GTF to BED.

  • Generate JSON file containing genomic features.

Steps are shown in the following graph. Each rectangle represents a step of the analysis. Names of the rules are written in italic. Blue rectangles represent main steps, yellow ones show QC steps and the gray ones represent helper functions. steps

How to use

Prerequisites

Download Docker or install Miniconda. If you use Miniconda choose Python 3.8 (or higher) for your operating system and follow the installation instructions.

conda --version # ideally 4.9.xx or higher
python --version # ideally 3.8.xx or higher

Step 1: Obtain a copy of this workflow

If you simply want to use this workflow, download and extract the latest release or clone it with:

git clone https://github.com/AnzeLovse/mag.git

If you intend to modify and further develop this workflow, fork this repository. Please consider providing any generally applicable modifications via a pull request.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, samples.tsv to specify your experiment setup and units.tsv to specify sequencing runs.

Step 3: Pull Docker image (recommended) or create Conda environment

For Docker image pull the latest (currently 0.1.5) image from Docker Hub:

docker pull alovse/rnaseq-mag

Conda environment only supports quantification and not differential expressions. To set it up use:

conda install -y -c conda-forge mamba
mamba create -q -y -c conda-forge -c bioconda -n snakemake snakemake python=3.8
conda env update -n snakemake --file /var/cache/build/environment.yaml
conda run -n snakemake python -m pip install rnanorm

Step 4: Execute workflow

Create a results directory and copy the configuration in it:

mkdir results_timecourse # create the directory
cp config results_timecourse # copy the configuration

Transfer reference genomes and reads into a subdirectory of results directory. The results directory should follow this structure. Please note that 0.72.0/bio/fastqc wrapper only supports .fastq files and not .fq files.

├── config
│ ├── config.yaml
│ ├── samples.tsv
│ └── units.tsv
├── data
│ ├── reference.fasta
│ ├── annotation.gtf
│ └── reads.fastq

To run the workflow with Docker use:

cd mag # open the directory
docker run -m `$M` -v $(pwd):/data mag:latest bin/bash -c "source activate snakemake; cd data; snakemake --cores `$N` `$STEP`"

using $N cores, $M of memory and the desired step $STEP (e.g. all or all_with_de )

Alternatively you can use Conda.

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake -n all

Execute the workflow locally via

snakemake --cores $N all

using $N cores and choosing the desired $STEP (e.g. all or all_with_de )

See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can inspect the interactive HTML report made with MultiQC. It is located in results_timecourse/qc/multiqc.html

For more information about MultiQC please see the official webpage and tool documentation.

Directory sturcture of the results folder is shown below.

├── bedgraphs
├── config
├── counts
├── data
├── deseq2
├── index
├── logs
│ ├── bedgraphs
│ ├── bowtie2
│ ├── bowtie2_build
│ ├── cutadapt
│ ├── deseq2
│ ├── fastqc
│ ├── fastqc_posttrim
│ ├── feature_counts
│ ├── multiqc.log
│ └── samtools_stats
├── mapped_reads
├── qc
│ ├── fastqc
│ ├── fastqc_posttrim
│ ├── feature_counts
│ ├── multiqc.html
│ ├── multiqc_data
│ └── samtools_stats
├── sorted_reads
├── trimmed
└── visualisations

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your Github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

  1. Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:AnzeLovse/mag.git or git remote add -f upstream https://github.com/AnzeLovse/mag.git if you do not have setup ssh keys.

  2. Update the upstream version: git fetch upstream .

  3. Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .

  4. Investigate the changes: vim upstream-changes.diff .

  5. Apply the modified diff via: git apply upstream-changes.diff .

  6. Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

  1. Fork the original repo to a personal or lab account.

  2. Clone the fork to your local system, to a different place than where you ran your analysis.

  3. Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.

  4. Commit and push your changes to your fork.

  5. Create a pull request against the original repository.

Code Snippets

13
14
wrapper:
    "0.72.0/bio/bowtie2/align"
24
25
shell:
    "samtools sort -@ {params.threads} -T sorted_reads/{wildcards.sample}-t{wildcards.time}-{wildcards.rep} -O bam -o {output}  {input}"
35
36
shell:
    "samtools index -@ {params.threads} {input}"
15
16
17
18
19
20
21
22
shell:
    "featureCounts "
    "{params.extra} "
    "-a {params.annotation} "
    "-o {output.counts} "
    "-T {threads} "
    "{input.bam} > {log} 2>&1; "
    "mv \"{output.counts}.summary\" {output.summary}"
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
run:
    indices = ["Geneid", "Chr", "Start", "End", "Strand","Length"]
    frames = [
        pd.read_csv(fp, sep="\t", skiprows=1, index_col=indices)
        for fp in input
    ]
    merged = pd.concat(frames, axis=1)

    merged = merged.rename(columns=lambda c: Path(c).stem)
    merged.to_csv(output.complete, sep="\t", index=True)

    # Save counts compatible with rnanorm
    merged = merged.reset_index(level="Geneid")
    merged = merged.rename(columns={"Geneid":"FEATURE_ID"})
    merged.to_csv(output.raw, sep="\t", index=False)

    lengths = merged.reset_index(level="Length")
    lengths[["FEATURE_ID", "Length"]].to_csv(output.lengths, sep="\t", index=False)
60
61
shell:
    "rnanorm {input.counts} --cpm-output={output.cpm} --tpm-output={output.tpm} --gene-lengths={input.lengths}"
16
17
script:
    "../scripts/deseq2.R"
14
15
wrapper:
    "0.72.0-5-gd0d054c/bio/bowtie2/build"
10
11
12
13
14
15
16
17
18
    wrapper:
        "0.72.0/bio/fastqc" # TODO update version to correctly remove .fq.gz ending

rule fastqc_post_trim:
    input:
        "trimmed/{sample}-t{time}-{rep}_{pair}.fastq.gz",
    output:
        html="qc/fastqc_posttrim/{sample}-t{time}-{rep}_R{pair}.html",
        zip="qc/fastqc_posttrim/{sample}-t{time}-{rep}_R{pair}_fastqc.zip"
22
23
24
25
26
27
28
    wrapper:
        "0.72.0/bio/fastqc" # TODO update version to correctly remove .fq.gz ending

rule samtools_stats:
    input:
        bam="sorted_reads/{sample}-t{time}-{rep}.bam",
        idx="sorted_reads/{sample}-t{time}-{rep}.bam.bai"
33
34
shell:
    "samtools stats {input.bam} 1> {output} 2> {log}"
46
47
wrapper:
    "0.72.0/bio/multiqc"
15
16
wrapper:
    "0.72.0/bio/cutadapt/pe"
SnakeMake From line 15 of rules/trim.smk
30
31
wrapper:
    "0.72.0/bio/cutadapt/se"
SnakeMake From line 30 of rules/trim.smk
15
16
17
18
shell:
    "cat {input.annotation} | "
    "awk 'OFS=\"\\t\" {{if ($3==\"gene\") {{print $1,$4-1,$5,$10,$16,$7}}}}' | "
    "tr -d '\";' > {output.complete}"
27
28
29
30
31
32
33
34
35
36
37
38
39
40
run:
    annotation = pd.read_csv(
        input.annotation,
        sep="\t",
        header=None,
        names=["chr", "start", "end", "name", "type", "strand"]
    )
    annotation = annotation.rename(columns = {"chr":"block_id"})

    pos = prepare_annotation(annotation, "+", "rgb(255,51,51)")
    neg = prepare_annotation(annotation, "-", "rgb(255,51,51)")
    joined = {"positive": pos, "negative": neg}
    with open(output.json, "w") as handle:
        json.dump(joined, handle, indent=4)
53
54
55
56
57
58
59
shell:
    "genomeCoverageBed "
    "-ibam {input.bam} "
    "{params.paired} "
    "-bga "
    "-strand {wildcards.strand} "
    "> {output.coverage}"
68
69
shell:
    "samtools faidx {input}"
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
run:
    fai = pd.read_csv(
        input.fai,
        sep="\t",
        header=None,
        names=["chr", "len"],
        usecols=[0,1],
        index_col=0
    )
    lengths = dict(zip(fai.index, fai.len))

    coverage = {"length":lengths, "positive":{}, "negative":{}}
    coverage_pos = pd.read_csv(input.coverage_pos, sep="\t", header=None, names=["block_id", "start", "end", "value"])
    coverage_neg = pd.read_csv(input.coverage_neg, sep="\t", header=None, names=["block_id", "start", "end", "value"])

    for chrom in lengths.keys():
        pos = coverage_pos.loc[coverage_pos["block_id"] == chrom]
        neg = coverage_neg.loc[coverage_neg["block_id"] == chrom]

        coverage["positive"][chrom] = json.loads(pos.to_json(orient="records"))
        coverage["negative"][chrom] = json.loads(neg.to_json(orient="records"))

    with open(output.json, "w") as handle:
        json.dump(coverage, handle, indent=4)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")

counts <- read.csv(snakemake@input[["counts"]], sep="\t", row.names="Geneid")
counts <- subset(counts, select = -c(Chr, Start, End, Strand, Length))

samples <- read.csv(snakemake@params[["samples"]], sep="\t", stringsAsFactors = TRUE)
units <- read.csv(snakemake@params[["units"]], sep="\t", stringsAsFactors = TRUE)

coldata <- merge(unique(samples), units, by='sample')
coldata <- subset(coldata, select=-c(fq1, fq2))
coldata[] <- lapply(coldata, as.factor)
rownames(coldata) <- paste(coldata$sample, paste("t",coldata$time,sep=""), coldata$replicate, sep=".")

# Speciffic steps needed for our analysis
#--------------------------------------------------
counts$G_M.t0.1 <- counts$G.t0.1
counts$G_M.t0.2 <- counts$G.t0.2

zero_rows <- data.frame(rbind(
  c("G_M", "mitomycin", 1, 0),
  c("G_M", "mitomycin", 2, 0)
))
colnames(zero_rows) <- colnames(coldata)
rownames(zero_rows) <- c("G_M.t0.1", "G_M.t0.2")
coldata <- rbind(coldata, zero_rows)
#--------------------------------------------------

# Filter rows with less than 5 counts in at least 2 samples
filter <- apply(counts, 1, function(x) length(x[x>5])>=2)
filtered <- counts[filter,]

# Create a DESeq data set with a full desing.
ddsFull <- DESeqDataSetFromMatrix(
    countData = filtered,
    colData = coldata,
    design = as.formula(snakemake@params[["model"]]))

# Use Likelihood-ratio test on a reduced model.
dds <- DESeq(ddsFull, test="LRT", reduced = as.formula(snakemake@params[["reduced_model"]]))
saveRDS(dds, file = snakemake@output[["rds"]])

alpha <- as.numeric(snakemake@params[["alpha"]])
result <- results(dds, alpha=alpha)

# Construct a list of shrunken LFC for the coefficients of the model.
logfc.list <- list()
logfc.list[[1]] <- data.frame(result)
for (i in 2:length(resultsNames(dds))) {
    coeff = resultsNames(dds)[i]
    res.t <- results(dds, name=coeff, test="Wald", alpha = alpha)
    out.t <- as.data.frame(res.t)
    out.t <- cbind(geneid = rownames(out.t), out.t)
    write.table(
        out.t,
        file=paste("deseq2/", coeff, ".tsv", sep=""),
        sep="\t",
        quote=FALSE,
        row.names=FALSE
    )

    res_shrunken <- lfcShrink(dds, coef=coeff, type="apeglm", res = res.t)
    out.table <- as.data.frame(res_shrunken)
    out.table <- cbind(geneid = rownames(out.table), out.table)
    write.table(
        out.table,
        file=paste("deseq2/", coeff, "_shrunken.tsv", sep=""),
        sep="\t",
        quote=FALSE,
        row.names=FALSE
    )

    svg(paste("deseq2/", coeff, "ma_plot.svg", sep=""))
    plotMA(result, ylim=c(-2,2))
    dev.off()

    # Add the shrunken log2FC to the list.
    lfc <- res_shrunken[, "log2FoldChange", drop=FALSE]
    colnames(lfc) <- c(paste(coeff, "sLFC", sep="."))
    logfc.list[[i]] <- lfc
}

# Write combined results in the main output.
out.result <- do.call(cbind, logfc.list)
out.result <- out.result[order(out.result$padj),]
out.result <- cbind(geneid = rownames(out.result), out.result)
  write.table(
    out.result,
    file=snakemake@output[["table"]],
    sep="\t",
    quote=FALSE,
    row.names=FALSE
  )

# Use variance stabilising transformation and then plot PCA.
vsd <- vst(ddsFull, blind=FALSE)
svg(snakemake@output[["pca_plot"]])
plotPCA(vsd, intgroup=c("condition", "time"), ntop = 500)
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Daniel Standage"
__copyright__ = "Copyright 2020, Daniel Standage"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
indexbase = snakemake.output[0].replace(".1.bt2", "")
shell(
    "bowtie2-build --threads {snakemake.threads} {snakemake.params.extra} "
    "{snakemake.input.reference} {indexbase}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "-U {}".format(*snakemake.input.sample)
else:
    reads = "-1 {} -2 {}".format(*snakemake.input.sample)

shell(
    "(bowtie2 --threads {snakemake.threads} {extra} "
    "-x {snakemake.params.index} {reads} "
    "| samtools view -Sbh -o {snakemake.output[0]} -) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " -j {snakemake.threads}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 1, "Input must contain 1 (single-end) element."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -j {snakemake.threads}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
ShowHide 17 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AnzeLovse/mag
Name: mag
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...