Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow: rna-seq-star-deseq2
This workflow performs a differential expression analysis with STAR and Deseq2.
Authors
-
Johannes Köster (@johanneskoester), https://koesterlab.github.io
-
Sebastian Schmeier (@sschmeier), https://sschmeier.com
-
Jose Maturana (@matrs)
Usage
Simple
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
See the Snakemake documentation for further details.
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Advanced
The following recipe provides established best practices for running and extending this workflow in a reproducible way.
-
Fork the repo to a personal or lab account.
-
Clone the fork to the desired working directory for the concrete project/run on your machine.
-
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
-
Modify the config, and any necessary sheets (and probably the workflow) as needed.
-
Commit any changes and push the project-branch to your fork on github.
-
Run the analysis.
-
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
-
Optional: Push results (plots/tables) to the remote branch on your fork.
-
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
-
Optional: Delete the local clone/workdir to free space.
Testing
Tests cases are in the subfolder
.test
. They are automtically executed via continuous integration with Travis CI.
Code Snippets
31 32 | wrapper: "0.19.4/bio/star/align" |
18 19 | script: "../scripts/count-matrix.py" |
40 41 | script: "../scripts/deseq2-init.R" |
55 56 | script: "../scripts/plot-pca.R" |
76 77 | script: "../scripts/deseq2.R" |
13 14 | script: "../scripts/gtf2bed.py" |
31 32 33 | shell: "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log[0]} 2>&1" |
50 51 52 | shell: "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log} 2>&1" |
65 66 | shell: "bam_stat.py -i {input} > {output} 2> {log}" |
80 81 | shell: "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
97 98 | shell: "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1" |
112 113 | shell: "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
128 129 | shell: "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1" |
144 145 | shell: "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1" |
164 165 | wrapper: "0.31.1/bio/multiqc" |
16 17 | wrapper: "0.17.4/bio/cutadapt/pe" |
30 31 | wrapper: "0.17.4/bio/cutadapt/se" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import pandas as pd def get_column(strandedness): if pd.isnull(strandedness) or strandedness == "none": return 1 #non stranded protocol elif strandedness == "yes": return 2 #3rd column elif strandedness == "reverse": return 3 #4th column, usually for Illumina truseq else: raise ValueError(("'strandedness' column should be empty or have the " "value 'none', 'yes' or 'reverse', instead has the " "value {}").format(repr(strandedness))) counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(strandedness)], header=None, skiprows=4) for f, strandedness in zip(snakemake.input, snakemake.params.strand)] for t, sample in zip(counts, snakemake.params.samples): t.columns = [sample] matrix = pd.concat(counts, axis=1) matrix.index.name = "gene" # collapse technical replicates matrix = matrix.groupby(matrix.columns, axis=1).sum() matrix.to_csv(snakemake.output[0], sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(snakemake@threads)) parallel <- TRUE } # colData and countData must have the same sample order, but this is ensured # by the way we create the count matrix cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene", check.names=FALSE) coldata <- read.table(snakemake@params[["samples"]], header=TRUE, row.names="sample", check.names=FALSE) dds <- DESeqDataSetFromMatrix(countData=cts, colData=coldata, design=~ condition) # remove uninformative columns dds <- dds[ rowSums(counts(dds)) > 1, ] # normalization and preprocessing dds <- DESeq(dds, parallel=parallel) saveRDS(dds, file=snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(snakemake@threads)) parallel <- TRUE } dds <- readRDS(snakemake@input[[1]]) contrast <- c("condition", snakemake@params[["contrast"]]) res <- results(dds, contrast=contrast, parallel=parallel) # shrink fold changes for lowly expressed genes res <- lfcShrink(dds, contrast=contrast, res=res) # sort by p-value res <- res[order(res$padj),] # TODO explore IHW usage # store results svg(snakemake@output[["ma_plot"]]) plotMA(res, ylim=c(-2,2)) dev.off() write.table(as.data.frame(res), file=snakemake@output[["table"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import gffutils db = gffutils.create_db(snakemake.input[0], dbfn=snakemake.output.db, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True, disable_infer_transcripts=True) with open(snakemake.output.bed, 'w') as outfileobj: for tx in db.features_of_type('transcript', order_by='start'): bed = [s.strip() for s in db.bed12(tx).split('\t')] bed[3] = tx.id outfileobj.write('{}\n'.format('\t'.join(bed))) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") # load deseq2 data dds <- readRDS(snakemake@input[[1]]) # obtain normalized counts counts <- rlog(dds, blind=FALSE) svg(snakemake@output[[1]]) plotPCA(counts, intgroup=snakemake@params[["pca_labels"]]) dev.off() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq}" " {snakemake.input[0]}" " > {snakemake.output.qc} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) sample = [snakemake.input.sample] if isinstance(snakemake.input.sample, str) else snakemake.input.sample n = len(sample) assert n == 1 or n == 2, "input->sample must have 1 (single-end) or 2 (paired-end) elements." if sample[0].endswith(".gz"): readcmd = "--readFilesCommand zcat" else: readcmd = "" outprefix = os.path.dirname(snakemake.output[0]) + "/" shell( "STAR " "{snakemake.params.extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.params.index} " "--readFilesIn {snakemake.input.sample} " "{readcmd} " "--outSAMtype BAM Unsorted " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}") |
Support
- Future updates