standard RNA-seq analysis pipeline (cloned from snakemake example workflow)
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow: rna-seq-star-deseq2
This workflow performs a differential expression analysis with STAR and Deseq2.
Authors
-
Johannes Köster (@johanneskoester), https://koesterlab.github.io
-
Sebastian Schmeier (@sschmeier), https://sschmeier.com
-
Jose Maturana (@matrs)
Usage
Simple
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
See the Snakemake documentation for further details.
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Advanced
The following recipe provides established best practices for running and extending this workflow in a reproducible way.
-
Fork the repo to a personal or lab account.
-
Clone the fork to the desired working directory for the concrete project/run on your machine.
-
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
-
Modify the config, and any necessary sheets (and probably the workflow) as needed.
-
Commit any changes and push the project-branch to your fork on github.
-
Run the analysis.
-
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
-
Optional: Push results (plots/tables) to the remote branch on your fork.
-
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
-
Optional: Delete the local clone/workdir to free space.
Testing
Tests cases are in the subfolder
.test
. They are automtically executed via continuous integration with Travis CI.
Code Snippets
28 29 | script: "../scripts/count-matrix.py" |
50 51 | script: "../scripts/deseq2-init.R" |
65 66 | script: "../scripts/plot-pca.R" |
86 87 | script: "../scripts/deseq2.R" |
22 23 | script: "../scripts/gtf2bed.py" |
40 41 42 | shell: "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log[0]} 2>&1" |
59 60 61 | shell: "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log} 2>&1" |
74 75 | shell: "bam_stat.py -i {input} > {output} 2> {log}" |
89 90 | shell: "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
106 107 | shell: "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1" |
121 122 | shell: "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
137 138 | shell: "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1" |
153 154 | shell: "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import pandas as pd def get_column(strandedness): if pd.isnull(strandedness) or strandedness == "none": return 1 #non stranded protocol elif strandedness == "yes": return 2 #3rd column elif strandedness == "reverse": return 3 #4th column, usually for Illumina truseq else: raise ValueError(("'strandedness' column should be empty or have the " "value 'none', 'yes' or 'reverse', instead has the " "value {}").format(repr(strandedness))) counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(strandedness)], header=None, skiprows=4) for f, strandedness in zip(snakemake.input, snakemake.params.strand)] for t, sample in zip(counts, snakemake.params.samples): t.columns = [sample] matrix = pd.concat(counts, axis=1) matrix.index.name = "gene" # collapse technical replicates matrix = matrix.groupby(matrix.columns, axis=1).sum() matrix.to_csv(snakemake.output[0], sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(snakemake@threads)) parallel <- TRUE } # colData and countData must have the same sample order, but this is ensured # by the way we create the count matrix cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene", check.names=FALSE) coldata <- read.table(snakemake@params[["samples"]], header=TRUE, row.names="sample", check.names=FALSE) dds <- DESeqDataSetFromMatrix(countData=cts, colData=coldata, design=~ condition) # remove uninformative columns dds <- dds[ rowSums(counts(dds)) > 1, ] # normalization and preprocessing dds <- DESeq(dds, parallel=parallel) saveRDS(dds, file=snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(snakemake@threads)) parallel <- TRUE } dds <- readRDS(snakemake@input[[1]]) contrast <- c("condition", snakemake@params[["contrast"]]) res <- results(dds, contrast=contrast, parallel=parallel) # shrink fold changes for lowly expressed genes res <- lfcShrink(dds, contrast=contrast, res=res) # sort by p-value res <- res[order(res$padj),] # TODO explore IHW usage # store results svg(snakemake@output[["ma_plot"]]) plotMA(res, ylim=c(-2,2)) dev.off() write.table(as.data.frame(res), file=snakemake@output[["table"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import gffutils db = gffutils.create_db(snakemake.input[0], dbfn=snakemake.output.db, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True, disable_infer_transcripts=True) with open(snakemake.output.bed, 'w') as outfileobj: for tx in db.features_of_type('transcript', order_by='start'): bed = [s.strip() for s in db.bed12(tx).split('\t')] bed[3] = tx.id outfileobj.write('{}\n'.format('\t'.join(bed))) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") # load deseq2 data dds <- readRDS(snakemake@input[[1]]) # obtain normalized counts counts <- rlog(dds, blind=FALSE) svg(snakemake@output[[1]]) plotPCA(counts, intgroup=snakemake@params[["pca_labels"]]) dev.off() |
Support
- Future updates