Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
rna_seq
This SnakeMake pipeline processes aligns bulk RNA-Seq datasets using STAR and finds differential gene expression between conditions using DESeq2.
1. Prepare you work environment
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import pandas as pd """Function accepts a STAR output directory and compiles all sample information from ReadsPerGene.out.tab Args: snakemake.input (list): list of globbed wildcards STAR ReadsPerGene.out.tab project_title (str): Project title for compiled STAR counts Returns: Compiled STAR counts as tab delimited file. """ colnames = snakemake.params.samples tables = [pd.read_csv(fh, header=None, skiprows=4, usecols=[0,3], index_col=0, sep = '\t', names = ['Genes',fh.split('/')[-2].split('_')[0]]) for fh in snakemake.input] joined_table = pd.concat(tables, axis=1) joined_table.columns = colnames joined_table_sorted = joined_table.reindex(sorted(joined_table.columns), axis = 1) joined_table_sorted.to_csv(snakemake.output[0], sep='\t') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | sink(file(snakemake@log[[1]], open = "wt"), type = "message") library(DESeq2) library(ggplot2) library(tibble) library(dplyr) library(viridisLite) library(pheatmap) library(yaml) parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") register(MulticoreParam(snakemake@threads)) parallel <- TRUE } if (!dir.exists(snakemake@output$outdir)) {dir.create(snakemake@output$outdir)} # import data -------------------------------------------------------------------------------------\ # import counts counts = read.table(snakemake@input$counts, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE) gene_names = counts |> pull(Genes) # import metdata md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE) rownames(md) = md$SampleID # subset md by config/GROUPS.txt if provided # allow user to upload a contrast combination file for easier interpretation of the output if (file.exists(snakemake@config$GROUPS)) { # import contrast combination file groups = read.table(snakemake@config$GROUPS) |> pull(1) # subset conditions in md with the conditions in config/GROUPS.txt md = md |> filter(Condition %in% groups) } # make the md reflect the available sample names md = md[md$SampleID %in% colnames(counts),] # make sure md rownames and counts colnames in the same order. rownames(md) = md$SampleID counts = counts[, rownames(md)] stopifnot(rownames(md) == colnames(counts)) # DESeq2 ------------------------------------------------------------------------------------------ # differential expression dds = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula("~Condition")) dds.lrt <- DESeq(dds, test="LRT", reduced=~1, parallel = parallel) res.lrt <- results(dds.lrt, cooksCutoff = Inf, independentFiltering=FALSE) # Obtain normalized counts # vsd = vst(dds.lrt, blind = FALSE) rld <- rlog(dds.lrt, blind=FALSE) # visualization ----------------------------------------------------------------------------------- colors_fwd <- viridis(100) colors_rev <- viridis(100, direction = -1) ### sample-sample distances sampleDists = dist(t(assay(rld)), diag = TRUE) sampleDistsMatrix = sampleDists |> as.matrix() rownames(sampleDistsMatrix) <- rld$SampleID colnames(sampleDistsMatrix) <- NULL distance_plot = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-group-sample-dists.pdf") pdf(distance_plot, width = 12, height = 12) pheatmap(sampleDistsMatrix, fontsize = 12, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, cluster_rows = TRUE, cluster_cols = TRUE, col = colors_rev ) dev.off() ### PCA plot pca_plot = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-group-pca.pdf") pdf(pca_plot, width = 12, height = 12) plotPCA(rld, intgroup=c("Condition")) dev.off() ### Gene expression - top 50 genes post DESeq2 normalization # top_ge = assay(rld)[top_genes,] top_gene_indices <- head(order(res.lrt$padj), 50) top_ge = counts(dds.lrt, normalized=TRUE)[top_gene_indices, ] rownames(top_ge) = gene_names[top_gene_indices] top_50 = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-top-50-genes-heatmap.pdf") pdf(top_50, width = 12, height = 12) pheatmap(top_ge, main = "Top 50 differential gene expression (DESeq2-normalized counts)", scale = "row", fontsize = 12, cluster_rows = TRUE, cluster_cols = TRUE ) dev.off() ### Gene expression - all genes by individual replicates all_ge = counts(dds.lrt, normalized=TRUE) rownames(all_ge) = gene_names all_ge = all_ge |> as.data.frame() |> distinct() |> as.matrix() all_ge = all_ge[apply(all_ge, 1, function(x) sd(x) != 0), ] all_genes_heatmap_individual = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-all-genes-heatmap-individual.pdf") pdf(all_genes_heatmap_individual, width = 12, height = 12) pheatmap(all_ge, main = "All gene expression (DESeq2-normalized counts) by individual samples", scale = "row", fontsize = 12, cluster_rows = TRUE, cluster_cols = TRUE, show_rownames = FALSE ) dev.off() ### Gene expression - all genes by merged replicates if (snakemake@config$MERGE_REPLICATES) { message("Merging replicates...") # read YAML file merge_scheme = read_yaml(snakemake@config$REPLICATES) for (name in names(merge_scheme)) { reps = merge_scheme[[name]] if (snakemake@config$MERGE_METHOD == "mean") { merged_counts = all_ge[, reps] |> apply(MARGIN = 1, FUN = mean) # same as rowMeans(counts[, reps]) } else if (snakemake@config$MERGE_METHOD == "median") { merged_counts = all_ge[, reps] |> apply(MARGIN = 1, FUN = median) } else { stop("ERROR: MERGE_METHOD must be either 'mean' or 'median'") } all_ge = all_ge |> as.data.frame() |> mutate(!!enexpr(name) := merged_counts) |> select(!all_of(reps)) |> as.matrix() } # remove rows with sd=0 one more time. all_ge = all_ge[apply(all_ge, 1, function(x) sd(x) != 0), ] # define output all_genes_heatmap_merged = paste0(snakemake@output$outdir, "/", snakemake@config$PROJECT_ID, "-all-genes-heatmap-merged.pdf") pdf(all_genes_heatmap_merged, width = 12, height = 12) pheatmap(all_ge, main = "All gene expression (DESeq2-normalized counts) by merged samples", scale = "row", fontsize = 12, cluster_rows = TRUE, cluster_cols = FALSE, # this respects the order of replicates.yaml show_rownames = FALSE ) dev.off() } |
R
ggplot2
dplyr
DESeq2
BiocParallel
PCAtools
pheatmap
tibble
yaml
viridisLite
From
line
1
of
scripts/deseq2-group.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | library(DESeq2) # import data ------------------------------------------------------------------------------------- # import counts counts = read.table(snakemake@input$counts, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE, row.names = "Genes") # import metdata md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE) rownames(md) = md$SampleID # make the md reflect the available sample names md = md[md$SampleID %in% colnames(counts),] # make sure md rownames and counts colnames in the same order. rownames(md) = md$SampleID counts = counts[, rownames(md)] stopifnot(rownames(md) == colnames(counts)) # create DESeq2 objects --------------------------------------------------------------------------- ddsMat = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula(snakemake@config$MODEL)) dds = DESeq(ddsMat) message('calculating deseq...') # Export DESeq2-normalized counts ----------------------------------------------------------------- normCounts = counts(dds, normalized=TRUE) normCounts = 0.1 + normCounts # ensure nonzero log2normCounts = log2(normCounts) log2normCounts = as.data.frame(log2normCounts) write.table(normCounts, snakemake@output$norm_counts, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE) write.table(log2normCounts, snakemake@output$log2_norm_counts, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | sink(file(snakemake@log[[1]], open = "wt"), type = "message") library(DESeq2) library(ggplot2) library(tibble) library(dplyr) library(viridisLite) library(pheatmap) library(yaml) parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") register(MulticoreParam(snakemake@threads)) parallel <- TRUE } c1 = snakemake@params$c1 c2 = snakemake@params$c2 outdir = snakemake@params$outdir contrast_name = paste0(snakemake@params$c1, '-vs-', snakemake@params$c2) # import data ------------------------------------------------------------------------------------- # import counts counts = read.table(snakemake@input$counts, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE) gene_names = counts |> pull(Genes) # import metdata md = read.table(snakemake@input$md, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE) # subset and counts matrix by contrast groups md = md |> filter(Condition %in% c(c1, c2)) rownames(md) = md$SampleID # make sure md rownames and counts colnames in the same order. counts = counts[, rownames(md)] stopifnot(rownames(md) == colnames(counts)) # DESeq2 ------------------------------------------------------------------------------------------ # differential expression ddsMatrix = DESeqDataSetFromMatrix(countData = counts, colData = md, design = as.formula("~Condition")) dds <- DESeq(ddsMatrix, parallel = parallel) # Obtain normalized counts # vsd = vst(dds.lrt, blind = FALSE) rld <- rlog(dds, blind=FALSE) # write outputs ----------------------------------------------------------------------------------- # define outputs all_file = paste0(outdir, "/", contrast_name, "-all.txt") # contain all gene results for this pw comparison sig_file = paste0(outdir, "/", contrast_name, "-", snakemake@params$padj, ".txt") # contain significant gene results pw_pca = paste0(outdir, "/", contrast_name, "-pca.pdf") # pca of a subset of samples pw_ma = paste0(outdir, "/", contrast_name, "-MA.pdf") # MA of a subset of samples # extract results from DESeq object results = results(dds, contrast = c("Condition", c1, c2), cooksCutoff = FALSE) results_df = results |> as.data.frame() |> rownames_to_column("Genes") |> arrange(padj) all_sig = results_df |> filter(padj <= snakemake@params$padj) # export tables write.table(results_df, all_file, quote = FALSE, sep = "\t", row.names = FALSE) write.table(all_sig, sig_file, quote = FALSE, sep = "\t", row.names = FALSE) # output PCA plot pdf(pw_pca, width = 12, height = 12) plotPCA(rld, intgroup=c("Condition")) dev.off() # plot MA pdf(pw_ma, width = 12, height = 12) plotMA(results, intgroup=c("Condition")) dev.off() |
R
ggplot2
dplyr
DESeq2
BiocParallel
PCAtools
tibble
yaml
viridisLite
From
line
1
of
scripts/deseq2-pairwise.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import yaml import pandas as pd import os # identify the annotation provided by the STAR index (user-defined) # filter_col will be used for filtering columns later on. if snakemake.config["ANNOTATION_STYLE"] == "Genes": filter_col = "Gene name lower" elif snakemake.config["ANNOTATION_STYLE"] == "UCSC": filter_col = "UCSC Stable ID" elif snakemake.config["ANNOTATION_STYLE"] == "ENSEMBL_ID": filter_col = "Gene stable ID" elif snakemake.config["ANNOTATION_STYLE"] == "REFSEQ": filter_col = "RefSeq mRNA ID" elif snakemake.config["ANNOTATION_STYLE"] == "NCBI": filter_col = "NCBI gene (formerly Entrezgene) ID" else: print("ERROR: ANNOTATION_STYLE is undefined. Please select from one of the following options: ['Genes', 'UCSC', 'ENSEMBL_ID', 'REFSEQ', 'NCBI']") os.exit() # annotation file ------------------------------------------------------------- # open annotation file if snakemake.input["annotation"]: annotation = pd.read_csv(snakemake.config["ANNOTATION"], sep = "\t") else: print("ERROR: ANNOTATION key in config.yaml is undefined") os.exit() # edit content of the annotation file # turn gene type and names to lower-case to ensure string matching downstream. annotation["Gene type lower"] = annotation["Gene type"].transform(lambda x: str.lower(x)) annotation["Gene name lower"] = annotation["Gene name"].transform(lambda x: str(x).lower()) # import biotypes. depending on their values, get them into a list. biotypes = snakemake.config["BIOTYPES"] biotype_type = type(biotypes) if biotype_type == str: biotypes = [ snakemake.config["BIOTYPES"] ] elif biotype_type == list: biotypes = snakemake.config["BIOTYPES"] else: print("ERROR: BIOTYPES must be a single entry <string> or multiple entries <list of strings>") biotypes = [ i.lower() for i in biotypes ] # filter annotation file contents by biotype annotation = annotation[annotation["Gene type lower"].isin(biotypes)] filter_features = list(annotation[filter_col]) # use these features to filter counts table # open counts file ------------------------------------------------------------ # import counts counts = pd.read_csv(snakemake.input["counts"], sep = "\t") print("PREFILTER: Counts table rows = {}".format(counts.shape[0])) print("PREFILTER: Counts table columns = {}".format(counts.shape[1])) # filter features counts["Gene lower"] = counts["Genes"].transform(lambda x: str.lower(x)) counts = counts[counts["Gene lower"].isin(filter_features)] print("POSTFILTER: Counts table rows = {}".format(counts.shape[0])) print("POSTFILTER: Counts table columns = {}".format(counts.shape[1] - 1)) # minus one to adjust for lower-case gene names col counts = counts.drop("Gene lower", axis = 1) # export filtered counts counts.to_csv(str(snakemake.output), sep = "\t", index = False) |
104 105 106 107 108 109 110 111 112 113 | shell: "fastp " "-i {input.r1} " "-I {input.r2} " "-o {output.r1} " "-O {output.r2} " "--detect_adapter_for_pe " "--thread {threads} " "-j {log} " "-h /dev/null" |
125 126 | shell: "fastqc -t {threads} --outdir data/fastqc {input} > {log} 2>&1" |
139 140 | shell: "fastq_screen --aligner bowtie2 --threads {threads} --outdir data/fastq_screen --conf {input.config} --force {input.fastq} > {log} 2>&1" |
158 159 160 161 162 163 164 165 166 167 168 169 170 | shell: "STAR " "--runThreadN {threads} " "--runMode alignReads " "--genomeDir {params.genome_index} " "--readFilesIn {input.fwd} {input.rev} " "--outFileNamePrefix data/star/{wildcards.sample}_bam/ " "--sjdbGTFfile {params.gtf} " "--quantMode GeneCounts " "--sjdbGTFtagExonParentGene gene_name " "--outSAMtype BAM SortedByCoordinate " "--readFilesCommand zcat " "--twopassMode Basic" |
180 181 | shell: "samtools index -@ {threads} {input}" |
195 196 | shell: "preseq c_curve -B {resources.defect_mode} -l 1000000000 -P -o {output} {input} > {log} 2>&1" |
210 211 | shell: "preseq lc_extrap -B {resources.defect_mode} -l 1000000000 -P -e 1000000000 -o {output} {input} > {log} 2>&1" |
222 223 | shell: "bamCoverage -b {input[0]} -o {output} -p {threads} --normalizeUsing CPM --binSize 10 --smoothLength 50" |
239 240 241 | shell: "if [ '{params.singularity}' == 'true' ]; then export LC_ALL=C.UTF-8; export LANG=C.UTF-8; fi && " "multiqc data -f --ignore data/tmp -o data/multiqc 2>&1" |
252 253 | script: "scripts/compile_star_counts.py" |
261 262 | script: "scripts/filter_counts.py" |
276 277 | script: "scripts/deseq2-norm.R" |
299 300 | script: "scripts/deseq2-pairwise.R" |
317 318 | script: "scripts/deseq2-group.R" |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/maxsonBraunLab/rna_seq
Name:
rna_seq
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows
ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...
Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...
ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics
RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...
This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...