Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake workflow: rna-seq-star-deseq2
This workflow performs a differential expression analysis with STAR and Deseq2.
Authors
-
Johannes Köster (@johanneskoester), https://koesterlab.github.io
-
Sebastian Schmeier (@sschmeier), https://sschmeier.com
-
Jose Maturana (@matrs)
Usage
Simple
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
See the Snakemake documentation for further details.
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above.
Step 4: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Advanced
The following recipe provides established best practices for running and extending this workflow in a reproducible way.
-
Fork the repo to a personal or lab account.
-
Clone the fork to the desired working directory for the concrete project/run on your machine.
-
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
-
Modify the config, and any necessary sheets (and probably the workflow) as needed.
-
Commit any changes and push the project-branch to your fork on github.
-
Run the analysis.
-
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
-
Optional: Push results (plots/tables) to the remote branch on your fork.
-
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
-
Optional: Delete the local clone/workdir to free space.
Testing
Tests cases are in the subfolder
.test
. They are automtically executed via continuous integration with Travis CI.
Code Snippets
13 14 | script: "../scripts/percent_aligned_report.py" |
23 24 | script: "../scripts/percent_aligned_report.py" |
37 38 | script: "../scripts/count-matrix-bams.py" |
45 46 47 48 49 50 51 52 53 54 55 56 57 | run: print("Input is:\t") print(input[0]) spike_in_counts = pd.read_table(input[0]).set_index("gene",drop=True) #foreach sample, #get the CoD for all the genes sample_names = spike_in_counts.columns coeff_of_deviations = [] for curr_sample in sample_names: coeff_of_deviations.append(variation(spike_in_counts[curr_sample])) df = pd.DataFrame({ "sample": sample_names, "coeff_of_deviations": coeff_of_deviations }) df.to_csv(output[0],sep="\t") |
19 20 21 22 23 | shell: "unset TMPDIR; module load star/2.7.5c; set -euo pipefail; " + "STAR --readFilesCommand zcat {params.extra} --runThreadN {threads} --genomeDir {params.index} " + "--readFilesIn {input.fastq1} {input.fastq2} --outSAMtype BAM SortedByCoordinate " + "--outFileNamePrefix {params.star_prefix} --outStd Log > {log} 2>&1" |
42 43 44 45 46 | shell: "unset TMPDIR; module load star/2.7.5c; set -euo pipefail; " + "STAR --readFilesCommand zcat {params.extra} --runThreadN {threads} --genomeDir {params.index} " + "--readFilesIn {input.fastq1} {input.fastq2} --outSAMtype BAM SortedByCoordinate " + "--outFileNamePrefix {params.star_prefix} --outStd Log > {log} 2>&1" |
19 20 | script: "../scripts/count-matrix-bams.py" |
33 34 | script: "../scripts/count-matrix-bams.py" |
57 58 59 60 61 | shell: """ module load R; Rscript scripts/deseq2-init.R {input.counts} {output} {params.samples} {log} {threads} &> {log} """ |
76 77 78 79 80 | shell: """ module load R; Rscript scripta/plot-pca.R {input} {output} {params.pca_labels} {log} """ |
102 103 104 105 106 | shell: """ module load R; Rscript scripts/deseq2.R {input} {output.table} {output.ma_plot} {params.config_yaml} {log} {threads} """ |
118 119 120 121 122 | shell: """ module load R; Rscript scripts/normalized_counts.R {input} {output} {log} """ |
11 12 | shell: "unset TMPDIR; module load samtools; samtools index {input}" |
21 22 | shell: "unset TMPDIR; module load samtools; samtools index {input}" |
13 14 | shell: "unset TMPDIR; module load cutadapt; cutadapt {params} -o {output.fastq1} -p {output.fastq2} {input} > {output.qc}" |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | import pysam import pandas as pd import numpy as np def get_sample_counts(bam_file, reference): bamfile_obj = pysam.AlignmentFile(bam_file,'rb') ref = open(reference + "/chrName.txt",'r') curr_file_counts = [] ref_list = [] for seq in ref: seq = seq.strip() curr_seq_reads = bamfile_obj.fetch(seq) read_counter = 0 for curr_read in curr_seq_reads: read_counter = read_counter + 1 curr_file_counts.extend([read_counter]) ref_list.extend([seq]) counts = pd.DataFrame(curr_file_counts) counts = counts.transpose() counts.columns=ref_list counts = counts.transpose() return counts ############################################################ bam_files = snakemake.input.bams counts = [get_sample_counts(f, snakemake.params.ref) for f in bam_files] samples = snakemake.params.samples print("checking sample order in the python script") print(samples) for t, sample in zip(counts, samples): t.columns = [sample] print("checking sample order in the python script after the zip") for t in counts: print(t.columns) matrix = pd.concat(counts, axis=1) matrix.index.name = "gene" print("Checking columns in matrix before groupby") print(matrix.columns) matrix = matrix.groupby(matrix.columns, axis=1).sum() print("Checking columns in matrix") print(matrix.columns) matrix.to_csv(snakemake.output[0], sep="\t") |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | args=commandArgs(trailingOnly=TRUE) counts_file=args[1] output_file=args[2] sample_file=args[3] log_file=args[4] threads=args[5] log <- file(log_file, open="wt") sink(log) sink(log, type="message") library("DESeq2") library(dplyr) parallel <- FALSE if (threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(threads)) parallel <- TRUE } # colData and countData must have the same sample order, but this is ensured # by the way we create the count matrix cts <- read.table(counts_file, header=TRUE, row.names="gene", check.names=FALSE) coldata <- read.table(sample_file, header=TRUE, row.names="sample", check.names=FALSE) sorted_cts <- cts[order(names(cts))] dds <- DESeqDataSetFromMatrix(countData= cts[order(names(cts))], colData=coldata[order(row.names(coldata)),], design=~ condition) # remove uninformative columns dds <- dds[ rowSums(counts(dds)) > 1, ] # normalization and preprocessing dds <- DESeq(dds, parallel=parallel) saveRDS(dds, file=output_file) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | args=commandArgs(trailingOnly=TRUE) rds=args[1] output_table=args[2] output_ma_plot=args[3] params.config_yaml=args[4] log=args[5] threads=args[6] log <- file(log, open="wt") sink(log) sink(log, type="message") library("DESeq2") parallel <- FALSE if (threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(threads)) parallel <- TRUE } dds <- readRDS(rds) library("yaml") config_yaml <- read_yaml(params.config_yaml) contrast <- c("condition", as.vector(config_yaml$diffexp$contrasts$"MJC_cambium-vs-MJ_cambium")) res <- results(dds, contrast=contrast, parallel=parallel) # shrink fold changes for lowly expressed genes res <- lfcShrink(dds, contrast=contrast, res=res,type="normal") # sort by p-value res <- res[order(res$padj),] # TODO explore IHW usage # store results svg(output_ma_plot) plotMA(res, ylim=c(-8,8)) dev.off() write.table(as.data.frame(res), file=output_table) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | args=commandArgs(trailingOnly=TRUE) rds=args[1] output_table=args[2] log=args[3] log <- file(log, open="wt") sink(log) sink(log, type="message") library("DESeq2") dds <- readRDS(rds) normalized_counts <- counts(dds, normalized=TRUE) write.table(normalized_counts, file="results/counts/all.norm_counts.tsv", sep="\t", quote=F, col.names=NA) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import pysam import pandas as pd import numpy as np from scipy.stats import variation import re def get_number_of_reads_in_cat(star_log_file, target_string): star_log = open(star_log_file,'r') for line in star_log: if(target_string in line): regex = r"\s\|\s(\d+)" match_list = re.findall(regex, line, flags=0) return match_list[0] def get_sample_counts(star_log_file): #print("in get sample counts") #print("this is star log file") #print(star_log_file) curr_file_counts = [] target_strings = ["Number of input reads", "Uniquely mapped reads number", "Number of reads mapped to multiple loci", "Number of reads mapped to too many loci", "Number of reads unmapped: too many mismatches", "Number of reads unmapped: too short", "Number of reads unmapped: other"] for string in target_strings: curr_file_counts.append( get_number_of_reads_in_cat(star_log_file,string)) counts = pd.DataFrame(curr_file_counts) counts = counts.transpose() counts.columns=target_strings counts = counts.transpose() return counts ############################################################ star_logs = np.unique(snakemake.input) #print("Passing this to get_sample_counts") #print(star_logs) #print(len(star_logs)) counts = [get_sample_counts(f) for f in star_logs] samples = snakemake.params.samples #print("checking sample order in the python script") #print(samples) for t, sample in zip(counts, samples): t.columns = [sample] #print("checking sample order in the python script after the zip") #for t in counts: # print(t.columns) matrix = pd.concat(counts, axis=1) matrix.index.name = "gene" #print(matrix.columns) matrix = matrix.groupby(matrix.columns, axis=1).sum() matrix = matrix.transpose() #print("Checking columns in matrix") #print(matrix.columns) matrix["proportion_of_library_reads_out_of_all_reads"] = \ matrix["Number of input reads"] / matrix["Number of input reads"].sum() matrix["percent_of_library_reads_mapped"] = \ (matrix["Uniquely mapped reads number"] + matrix["Number of reads mapped to multiple loci"]) / matrix["Number of input reads"] matrix.to_csv(snakemake.output[0], sep="\t") |
Support
- Future updates
Related Workflows





