Snakemake workflow for generating gene expression counts from RNA-sequencing data.

public 1yr ago Version: 2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction

Author: Anže Lovše (@AnzeLovse)

This is a Snakemake workflow for generating gene expression counts from RNA-sequencing data and can optionally be used to run differential expression analysis as well. The workflow handles both single-end and paired-end sequencing data. We used the workflow for analyzing sequencing data from Bacillus thuringiensis serovar israelensis as a part of my bachelor thesis.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.

Overview

The standard workflow performs the following steps:

Trim reads with Cutadapt.
Build a genome index and align reads with Bowtie2.
Sort and index reads with Samtools.
Generate gene expression counts with featureCounts.
Merge all counts into a single file.
Normalize counts (CPM, TPM) with rnanorm
(optionally) Differential expression with DESeq2

Additionally the workflow includes quality control steps:

Read quality reports before and after trimming using FASTQC.
Read alignment statistics based on Samtools and Bowtie2 reports.
featureCounts statistics report.
Aggregated report made with MultiQC

and visualization helper functions:

Genome coverage made with genomeCoverageBed.
Calculate FASTA index.
Create JSON file with coverage data.
Convert GTF to BED.
Generate JSON file containing genomic features.

Steps are shown in the following graph. Each rectangle represents a step of the analysis. Names of the rules are written in italic. Blue rectangles represent main steps, yellow ones show QC steps and the gray ones represent helper functions. steps

How to use

Prerequisites

Download Docker or install Miniconda. If you use Miniconda choose Python 3.8 (or higher) for your operating system and follow the installation instructions.

conda --version # ideally 4.9.xx or higher
python --version # ideally 3.8.xx or higher

Step 1: Obtain a copy of this workflow

If you simply want to use this workflow, download and extract the latest release or clone it with:

git clone https://github.com/AnzeLovse/mag.git

If you intend to modify and further develop this workflow, fork this repository. Please consider providing any generally applicable modifications via a pull request.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, samples.tsv to specify your experiment setup and units.tsv to specify sequencing runs.

Step 3: Pull Docker image (recommended) or create Conda environment

For Docker image pull the latest (currently 0.1.5) image from Docker Hub:

docker pull alovse/rnaseq-mag

Conda environment only supports quantification and not differential expressions. To set it up use:

conda install -y -c conda-forge mamba
mamba create -q -y -c conda-forge -c bioconda -n snakemake snakemake python=3.8
conda env update -n snakemake --file /var/cache/build/environment.yaml
conda run -n snakemake python -m pip install rnanorm

Step 4: Execute workflow

Create a results directory and copy the configuration in it:

mkdir results_timecourse # create the directory
cp config results_timecourse # copy the configuration

Transfer reference genomes and reads into a subdirectory of results directory. The results directory should follow this structure. Please note that 0.72.0/bio/fastqc wrapper only supports .fastq files and not .fq files.

├── config
│ ├── config.yaml
│ ├── samples.tsv
│ └── units.tsv
├── data
│ ├── reference.fasta
│ ├── annotation.gtf
│ └── reads.fastq

To run the workflow with Docker use:

cd mag # open the directory
docker run -m `$M` -v $(pwd):/data mag:latest bin/bash -c "source activate snakemake; cd data; snakemake --cores `$N` `$STEP`"

using $N cores, $M of memory and the desired step $STEP (e.g. all or all_with_de )

Alternatively you can use Conda.

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake -n all

Execute the workflow locally via

snakemake --cores $N all

using $N cores and choosing the desired $STEP (e.g. all or all_with_de )

See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can inspect the interactive HTML report made with MultiQC. It is located in results_timecourse/qc/multiqc.html

For more information about MultiQC please see the official webpage and tool documentation.

Directory sturcture of the results folder is shown below.

├── bedgraphs
├── config
├── counts
├── data
├── deseq2
├── index
├── logs
│ ├── bedgraphs
│ ├── bowtie2
│ ├── bowtie2_build
│ ├── cutadapt
│ ├── deseq2
│ ├── fastqc
│ ├── fastqc_posttrim
│ ├── feature_counts
│ ├── multiqc.log
│ └── samtools_stats
├── mapped_reads
├── qc
│ ├── fastqc
│ ├── fastqc_posttrim
│ ├── feature_counts
│ ├── multiqc.html
│ ├── multiqc_data
│ └── samtools_stats
├── sorted_reads
├── trimmed
└── visualisations

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your Github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:AnzeLovse/mag.git or git remote add -f upstream https://github.com/AnzeLovse/mag.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

Fork the original repo to a personal or lab account.
Clone the fork to your local system, to a different place than where you ran your analysis.
Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.
Commit and push your changes to your fork.
Create a pull request against the original repository.

Code Snippets

wrapper:
    "0.72.0/bio/bowtie2/align"

SnakeMake From line 13 of rules/align.smk

shell:
    "samtools sort -@ {params.threads} -T sorted_reads/{wildcards.sample}-t{wildcards.time}-{wildcards.rep} -O bam -o {output}  {input}"

SnakeMake SAMtools From line 24 of rules/align.smk

shell:
    "samtools index -@ {params.threads} {input}"

SnakeMake SAMtools From line 35 of rules/align.smk

shell:
    "featureCounts "
    "{params.extra} "
    "-a {params.annotation} "
    "-o {output.counts} "
    "-T {threads} "
    "{input.bam} > {log} 2>&1; "
    "mv \"{output.counts}.summary\" {output.summary}"

SnakeMake FeatureCounts From line 15 of rules/counts.smk

run:
    indices = ["Geneid", "Chr", "Start", "End", "Strand","Length"]
    frames = [
        pd.read_csv(fp, sep="\t", skiprows=1, index_col=indices)
        for fp in input
    ]
    merged = pd.concat(frames, axis=1)

    merged = merged.rename(columns=lambda c: Path(c).stem)
    merged.to_csv(output.complete, sep="\t", index=True)

    # Save counts compatible with rnanorm
    merged = merged.reset_index(level="Geneid")
    merged = merged.rename(columns={"Geneid":"FEATURE_ID"})
    merged.to_csv(output.raw, sep="\t", index=False)

    lengths = merged.reset_index(level="Length")
    lengths[["FEATURE_ID", "Length"]].to_csv(output.lengths, sep="\t", index=False)

SnakeMake From line 33 of rules/counts.smk

shell:
    "rnanorm {input.counts} --cpm-output={output.cpm} --tpm-output={output.tpm} --gene-lengths={input.lengths}"

SnakeMake From line 60 of rules/counts.smk

script:
    "../scripts/deseq2.R"

SnakeMake DESeq2 From line 16 of rules/diff_expression.smk

wrapper:
    "0.72.0-5-gd0d054c/bio/bowtie2/build"

SnakeMake From line 14 of rules/genome_index.smk

    wrapper:
        "0.72.0/bio/fastqc" # TODO update version to correctly remove .fq.gz ending

rule fastqc_post_trim:
    input:
        "trimmed/{sample}-t{time}-{rep}_{pair}.fastq.gz",
    output:
        html="qc/fastqc_posttrim/{sample}-t{time}-{rep}_R{pair}.html",
        zip="qc/fastqc_posttrim/{sample}-t{time}-{rep}_R{pair}_fastqc.zip"

SnakeMake FastQC From line 10 of rules/qc.smk

    wrapper:
        "0.72.0/bio/fastqc" # TODO update version to correctly remove .fq.gz ending

rule samtools_stats:
    input:
        bam="sorted_reads/{sample}-t{time}-{rep}.bam",
        idx="sorted_reads/{sample}-t{time}-{rep}.bam.bai"

SnakeMake FastQC From line 22 of rules/qc.smk

shell:
    "samtools stats {input.bam} 1> {output} 2> {log}"

SnakeMake SAMtools From line 33 of rules/qc.smk

wrapper:
    "0.72.0/bio/multiqc"

SnakeMake MultiQC From line 46 of rules/qc.smk

wrapper:
    "0.72.0/bio/cutadapt/pe"

SnakeMake From line 15 of rules/trim.smk

wrapper:
    "0.72.0/bio/cutadapt/se"

SnakeMake From line 30 of rules/trim.smk

shell:
    "cat {input.annotation} | "
    "awk 'OFS=\"\\t\" {{if ($3==\"gene\") {{print $1,$4-1,$5,$10,$16,$7}}}}' | "
    "tr -d '\";' > {output.complete}"

SnakeMake From line 15 of rules/visualisation_helpers.smk

run:
    annotation = pd.read_csv(
        input.annotation,
        sep="\t",
        header=None,
        names=["chr", "start", "end", "name", "type", "strand"]
    )
    annotation = annotation.rename(columns = {"chr":"block_id"})

    pos = prepare_annotation(annotation, "+", "rgb(255,51,51)")
    neg = prepare_annotation(annotation, "-", "rgb(255,51,51)")
    joined = {"positive": pos, "negative": neg}
    with open(output.json, "w") as handle:
        json.dump(joined, handle, indent=4)

SnakeMake From line 27 of rules/visualisation_helpers.smk

shell:
    "genomeCoverageBed "
    "-ibam {input.bam} "
    "{params.paired} "
    "-bga "
    "-strand {wildcards.strand} "
    "> {output.coverage}"

SnakeMake From line 53 of rules/visualisation_helpers.smk

shell:
    "samtools faidx {input}"

SnakeMake SAMtools From line 68 of rules/visualisation_helpers.smk

run:
    fai = pd.read_csv(
        input.fai,
        sep="\t",
        header=None,
        names=["chr", "len"],
        usecols=[0,1],
        index_col=0
    )
    lengths = dict(zip(fai.index, fai.len))

    coverage = {"length":lengths, "positive":{}, "negative":{}}
    coverage_pos = pd.read_csv(input.coverage_pos, sep="\t", header=None, names=["block_id", "start", "end", "value"])
    coverage_neg = pd.read_csv(input.coverage_neg, sep="\t", header=None, names=["block_id", "start", "end", "value"])

    for chrom in lengths.keys():
        pos = coverage_pos.loc[coverage_pos["block_id"] == chrom]
        neg = coverage_neg.loc[coverage_neg["block_id"] == chrom]

        coverage["positive"][chrom] = json.loads(pos.to_json(orient="records"))
        coverage["negative"][chrom] = json.loads(neg.to_json(orient="records"))

    with open(output.json, "w") as handle:
        json.dump(coverage, handle, indent=4)

SnakeMake From line 79 of rules/visualisation_helpers.smk

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")

counts <- read.csv(snakemake@input[["counts"]], sep="\t", row.names="Geneid")
counts <- subset(counts, select = -c(Chr, Start, End, Strand, Length))

samples <- read.csv(snakemake@params[["samples"]], sep="\t", stringsAsFactors = TRUE)
units <- read.csv(snakemake@params[["units"]], sep="\t", stringsAsFactors = TRUE)

coldata <- merge(unique(samples), units, by='sample')
coldata <- subset(coldata, select=-c(fq1, fq2))
coldata[] <- lapply(coldata, as.factor)
rownames(coldata) <- paste(coldata$sample, paste("t",coldata$time,sep=""), coldata$replicate, sep=".")

# Speciffic steps needed for our analysis
#--------------------------------------------------
counts$G_M.t0.1 <- counts$G.t0.1
counts$G_M.t0.2 <- counts$G.t0.2

zero_rows <- data.frame(rbind(
  c("G_M", "mitomycin", 1, 0),
  c("G_M", "mitomycin", 2, 0)
))
colnames(zero_rows) <- colnames(coldata)
rownames(zero_rows) <- c("G_M.t0.1", "G_M.t0.2")
coldata <- rbind(coldata, zero_rows)
#--------------------------------------------------

# Filter rows with less than 5 counts in at least 2 samples
filter <- apply(counts, 1, function(x) length(x[x>5])>=2)
filtered <- counts[filter,]

# Create a DESeq data set with a full desing.
ddsFull <- DESeqDataSetFromMatrix(
    countData = filtered,
    colData = coldata,
    design = as.formula(snakemake@params[["model"]]))

# Use Likelihood-ratio test on a reduced model.
dds <- DESeq(ddsFull, test="LRT", reduced = as.formula(snakemake@params[["reduced_model"]]))
saveRDS(dds, file = snakemake@output[["rds"]])

alpha <- as.numeric(snakemake@params[["alpha"]])
result <- results(dds, alpha=alpha)

# Construct a list of shrunken LFC for the coefficients of the model.
logfc.list <- list()
logfc.list[[1]] <- data.frame(result)
for (i in 2:length(resultsNames(dds))) {
    coeff = resultsNames(dds)[i]
    res.t <- results(dds, name=coeff, test="Wald", alpha = alpha)
    out.t <- as.data.frame(res.t)
    out.t <- cbind(geneid = rownames(out.t), out.t)
    write.table(
        out.t,
        file=paste("deseq2/", coeff, ".tsv", sep=""),
        sep="\t",
        quote=FALSE,
        row.names=FALSE
    )

    res_shrunken <- lfcShrink(dds, coef=coeff, type="apeglm", res = res.t)
    out.table <- as.data.frame(res_shrunken)
    out.table <- cbind(geneid = rownames(out.table), out.table)
    write.table(
        out.table,
        file=paste("deseq2/", coeff, "_shrunken.tsv", sep=""),
        sep="\t",
        quote=FALSE,
        row.names=FALSE
    )

    svg(paste("deseq2/", coeff, "ma_plot.svg", sep=""))
    plotMA(result, ylim=c(-2,2))
    dev.off()

    # Add the shrunken log2FC to the list.
    lfc <- res_shrunken[, "log2FoldChange", drop=FALSE]
    colnames(lfc) <- c(paste(coeff, "sLFC", sep="."))
    logfc.list[[i]] <- lfc
}

# Write combined results in the main output.
out.result <- do.call(cbind, logfc.list)
out.result <- out.result[order(out.result$padj),]
out.result <- cbind(geneid = rownames(out.result), out.result)
  write.table(
    out.result,
    file=snakemake@output[["table"]],
    sep="\t",
    quote=FALSE,
    row.names=FALSE
  )

# Use variance stabilising transformation and then plot PCA.
vsd <- vst(ddsFull, blind=FALSE)
svg(snakemake@output[["pca_plot"]])
plotPCA(vsd, intgroup=c("condition", "time"), ntop = 500)
dev.off()

R DESeq2 PCAtools From line 1 of scripts/deseq2.R

__author__ = "Daniel Standage"
__copyright__ = "Copyright 2020, Daniel Standage"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
indexbase = snakemake.output[0].replace(".1.bt2", "")
shell(
    "bowtie2-build --threads {snakemake.threads} {snakemake.params.extra} "
    "{snakemake.input.reference} {indexbase}"
)

Python Snakemake Bowtie 2 From line 1 of build/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "-U {}".format(*snakemake.input.sample)
else:
    reads = "-1 {} -2 {}".format(*snakemake.input.sample)

shell(
    "(bowtie2 --threads {snakemake.threads} {extra} "
    "-x {snakemake.params.index} {reads} "
    "| samtools view -Sbh -o {snakemake.output[0]} -) {log}"
)

Python Snakemake SAMtools From line 1 of align/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " -j {snakemake.threads}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 1, "Input must contain 1 (single-end) element."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -j {snakemake.threads}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of se/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)