standard RNA-seq analysis pipeline (cloned from snakemake example workflow)

public 1yr ago 0 bookmarks

View Workflow

rna-seq-standard-pipeline — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakemake workflow: rna-seq-star-deseq2

This workflow performs a differential expression analysis with STAR and Deseq2.

Authors

Johannes Köster (@johanneskoester), https://koesterlab.github.io
Sebastian Schmeier (@sschmeier), https://sschmeier.com
Jose Maturana (@matrs)

Usage

Simple

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

See the Snakemake documentation for further details.

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above.

Step 4: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Advanced

The following recipe provides established best practices for running and extending this workflow in a reproducible way.

Fork the repo to a personal or lab account.
Clone the fork to the desired working directory for the concrete project/run on your machine.
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
Modify the config, and any necessary sheets (and probably the workflow) as needed.
Commit any changes and push the project-branch to your fork on github.
Run the analysis.
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
Optional: Push results (plots/tables) to the remote branch on your fork.
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
Optional: Delete the local clone/workdir to free space.

Testing

Tests cases are in the subfolder .test . They are automtically executed via continuous integration with Travis CI.

Code Snippets

script:
    "../scripts/count-matrix.py"

SnakeMake From line 28 of rules/diffexp.smk

script:
    "../scripts/deseq2-init.R"

SnakeMake DESeq2 From line 50 of rules/diffexp.smk

script:
    "../scripts/plot-pca.R"

SnakeMake From line 65 of rules/diffexp.smk

script:
    "../scripts/deseq2.R"

SnakeMake DESeq2 From line 86 of rules/diffexp.smk

script:
    "../scripts/gtf2bed.py"

SnakeMake From line 22 of rules/qc.smk

shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"

SnakeMake From line 40 of rules/qc.smk

shell:
    "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log} 2>&1"

SnakeMake From line 59 of rules/qc.smk

shell:
    "bam_stat.py -i {input} > {output} 2> {log}"

SnakeMake From line 74 of rules/qc.smk

shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 89 of rules/qc.smk

shell:
    "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 106 of rules/qc.smk

shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 121 of rules/qc.smk

shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 137 of rules/qc.smk

shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 153 of rules/qc.smk

import pandas as pd

def get_column(strandedness):
    if pd.isnull(strandedness) or strandedness == "none":
        return 1 #non stranded protocol
    elif strandedness == "yes":
        return 2 #3rd column
    elif strandedness == "reverse":
        return 3 #4th column, usually for Illumina truseq
    else:
        raise ValueError(("'strandedness' column should be empty or have the " 
                          "value 'none', 'yes' or 'reverse', instead has the " 
                          "value {}").format(repr(strandedness)))

counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(strandedness)], 
          header=None, skiprows=4) 
          for f, strandedness in zip(snakemake.input, snakemake.params.strand)]

for t, sample in zip(counts, snakemake.params.samples):
    t.columns = [sample]

matrix = pd.concat(counts, axis=1)
matrix.index.name = "gene"
# collapse technical replicates
matrix = matrix.groupby(matrix.columns, axis=1).sum()
matrix.to_csv(snakemake.output[0], sep="\t")

Python Pandas From line 1 of scripts/count-matrix.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")

parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

# colData and countData must have the same sample order, but this is ensured
# by the way we create the count matrix
cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene", check.names=FALSE)
coldata <- read.table(snakemake@params[["samples"]], header=TRUE, row.names="sample", check.names=FALSE)

dds <- DESeqDataSetFromMatrix(countData=cts,
                              colData=coldata,
                              design=~ condition)

# remove uninformative columns
dds <- dds[ rowSums(counts(dds)) > 1, ]
# normalization and preprocessing
dds <- DESeq(dds, parallel=parallel)

saveRDS(dds, file=snakemake@output[[1]])

R DESeq2 BiocParallel From line 1 of scripts/deseq2-init.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")

parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

dds <- readRDS(snakemake@input[[1]])

contrast <- c("condition", snakemake@params[["contrast"]])
res <- results(dds, contrast=contrast, parallel=parallel)
# shrink fold changes for lowly expressed genes
res <- lfcShrink(dds, contrast=contrast, res=res)
# sort by p-value
res <- res[order(res$padj),]
# TODO explore IHW usage


# store results
svg(snakemake@output[["ma_plot"]])
plotMA(res, ylim=c(-2,2))
dev.off()

write.table(as.data.frame(res), file=snakemake@output[["table"]])

R DESeq2 BiocParallel From line 1 of scripts/deseq2.R

import gffutils

db = gffutils.create_db(snakemake.input[0],
                        dbfn=snakemake.output.db,
                        force=True,
                        keep_order=True,
                        merge_strategy='merge',
                        sort_attribute_values=True,
                        disable_infer_genes=True,
                        disable_infer_transcripts=True)

with open(snakemake.output.bed, 'w') as outfileobj:
    for tx in db.features_of_type('transcript', order_by='start'):
        bed = [s.strip() for s in db.bed12(tx).split('\t')]
        bed[3] = tx.id
        outfileobj.write('{}\n'.format('\t'.join(bed)))

Python From line 1 of scripts/gtf2bed.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")

# load deseq2 data
dds <- readRDS(snakemake@input[[1]])

# obtain normalized counts
counts <- rlog(dds, blind=FALSE)
svg(snakemake@output[[1]])
plotPCA(counts, intgroup=snakemake@params[["pca_labels"]])
dev.off()