Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword DESeq2 .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow performs a differential expression analysis using STAR and DESeq2 . It performs a wide range of quality control steps and bundles there results in a qc report via MultiQC . Reported results include:
-
PCA plot of all samples
-
differentially up and down regulated genes for each analysis (ie treatment vs control)
-
MA plot for each analysis
-
heatmap of counts and log2 Fold Changes
-
lollipop plot of enriched gene sets
Usage
Step 1: Configure workflow
You will likely run these analyses on the HPCC, but you will want access to the results locally. My strategy is to have the analysis repository on the local machine and on the HPCC, in the same file path location (relative to the home directory).
We will start by creating the local repository (on your computer):
-
Click on 'Use this template', which will copy the content of this repository to a new remote repositiory on github. Find a good name for your analysis and use it as the name for the repository;
-
Copy git link from this new repository, move to the parent folder for your RNA-seq analysis (ie where you want to save your new analyses/results) and create a local repository on your computer using git clone;
-
Move into the newly created directory (repository);
-
Configure the workflow by editing the file
config.yaml
;-
Choose mouse or human for the STAR Index;
-
Modify the pca columns, design for deseq2, organism of interest, the contrasts, gene sets for (optional) enrichment analysis;
-
Add gene set files to gene_sets directory
-
-
Configure the
samples.txt
andunits.txt
files with project specific annotations and file;- Hint: Check for rogue leading/trailing spaces in your file paths/sample names etc;
-
Git push changes to master.
We will now set up the analysis on the HPCC:
-
Copy the git link again and create a repository on the cluster (same path, see above) using git clone;
-
Move into the newly created directory (repository);
-
Create a reads directory and copy over the reads (fastq.gz files) from the sequencing facility.
To run the analysis, activate your conda snakemake environment:
conda activate snakemake
Step 2: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it on the
UGE cluster environment
snakemake --use-conda --profile uge
Step 3: Investigate results
After successful execution, you can create a self-contained interactive HTML report via:
snakemake --report report.html
-
to look at the results, use secure copy
scp
to transfer thereport.html
and the quality control reportqc/multiqc.html
from the HPCC to your local computer (this is where storing them in the same location relative to the home directory comes in handy); -
explore the results;
-
forward to your collaborators.
Code Snippets
21 22 23 24 25 26 27 28 29 30 31 32 | shell: """ STAR \ {params.extra} \ --runThreadN {threads} \ --runMode alignReads \ --genomeDir {params.index} \ --readFilesIn {input.fq1} {input.fq2} {params.readcmd} \ --outReadsUnmapped Fastq \ --outSAMtype BAM Unsorted \ --outFileNamePrefix star/{wildcards.sample}-{wildcards.unit}/ """ |
46 47 48 49 | shell: """ samtools sort {input} -o {output} -@ {threads} -m 20G """ |
63 64 65 66 | shell: """ samtools index {input} -@ {threads} """ |
16 17 | script: "../scripts/count-matrix.py" |
34 35 | script: "../scripts/deseq2-init.R" |
53 54 | script: "../scripts/plot-pca.R" |
73 74 | script: "../scripts/deseq2.R" |
96 97 | script: "../scripts/plot-heatmap.R" |
124 125 | script: "../scripts/fgsea.R" |
137 138 | script: "../scripts/plot-volcano.R" |
11 12 | script: "../scripts/gtf2bed.py" |
30 31 32 33 34 35 36 37 | shell: """ junction_annotation.py {params.extra} \ -i {input.bam} \ -r {input.bed} \ -o {params.prefix} > {log[0]} 2>&1 """ |
55 56 57 58 59 60 61 62 | shell: """ junction_saturation.py {params.extra} \ -i {input.bam} \ -r {input.bed} \ -o {params.prefix} > {log} 2>&1 """ |
76 77 | shell: "bam_stat.py -i {input} > {output} 2> {log}" |
92 93 | shell: "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
110 111 112 113 114 115 116 | shell: """ inner_distance.py \ -r {input.bed} \ -i {input.bam} \ -o {params.prefix} > {log} 2>&1 """ |
131 132 | shell: "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
148 149 | shell: "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1" |
165 166 | shell: "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1" |
196 197 198 199 200 201 202 203 204 205 | shell: """ multiqc \ --force \ --export \ --outdir qc \ --filename multiqc_report.html \ logs/rseqc/rseqc_junction_annotation qc/rseqc star > {log} #{input} > {log} """ |
18 19 20 21 22 23 24 25 26 27 28 | shell: """ cutadapt \ {params.adapters} \ {params.others} \ -o {output.fastq1} \ -p {output.fastq2} \ -j {threads} \ {input} \ > {output.qc} """ |
46 47 48 49 50 51 52 53 54 55 | shell: """ cutadapt \ {params.adapters} \ {params.others} \ -o {output.fastq} \ -j {threads} \ {input} \ > {output.qc} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import pandas as pd def get_column(strandedness): if pd.isnull(strandedness) or strandedness == "none": return 1 #non stranded protocol elif strandedness == "yes": return 2 #3rd column elif strandedness == "reverse": return 3 #4th column, usually for Illumina truseq else: raise ValueError(("'strandedness' column should be empty or have the " "value 'none', 'yes' or 'reverse', instead has the " "value {}").format(repr(strandedness))) counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(strandedness)], header=None, skiprows=4) for f, strandedness in zip(snakemake.input, snakemake.params.strand)] for t, sample in zip(counts, snakemake.params.samples): t.columns = [sample] matrix = pd.concat(counts, axis=1) matrix.index.name = "gene" # collapse technical replicates if len(set(matrix.columns)) != len(matrix.columns): matrix = matrix.groupby(matrix.columns, axis=1).sum() matrix.to_csv(snakemake.output[0], sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("tidyverse") library("AnnotationHub") library("DESeq2") library("biomaRt") parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(snakemake@threads)) parallel <- TRUE } message("Reading counts") cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene", check.names=FALSE) message("Reading sample file") coldata <- read.table(snakemake@params[["samples"]], header=TRUE, row.names="sample", check.names=FALSE, sep="\t") coldata$sample <- rownames(coldata) message("Getting experimental design") design <- as.formula(snakemake@params[["design"]]) # colData and countData must have the same sample order if (nrow(coldata) != ncol(cts)) { stop("Number of samples in sample sheet and number of samples in counts", "matrix is not the same") } cts <- cts[,match(rownames(coldata),colnames(cts))] if (any(c("control", "Control", "CONTROL") %in% levels(coldata$condition))) { if ("control" %in% levels(coldata$condition)) { coldata$condition <- relevel(coldata$condition, "control" ) } else if ("Control" %in% levels(coldata$condition)) { coldata$condition <- relevel(coldata$condition, "Control" ) } else { coldata$condition <- relevel(coldata$condition, "CONTROL" ) } } dds <- DESeqDataSetFromMatrix(countData=cts, colData=coldata, design=design) #Collapse replicates if specified in config if (snakemake@params[["collapse"]] == "yes") { dds <- collapseReplicates(dds, dds$replicate) } # remove uninformative columns dds <- dds[rowSums(counts(dds)) > 1,] # normalization and preprocessing dds <- DESeq(dds, parallel=parallel) # Remove build number on ENS gene id rownames(dds) <- gsub("\\.\\d*", "", rownames(dds)) # Annotate by gene names if (snakemake@params[["species"]] == "mouse") { ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl") } else if(snakemake@params[["species"]] == "human") { ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") } else { stop("No annotation (biomart) specified for organism:", snakemake@params[["species"]]) } genemap <- getBM(attributes=c('ensembl_gene_id', "external_gene_name"), filters = 'ensembl_gene_id', values = rownames(dds), mart = ensembl) %>% dplyr::rename(gene_id = ensembl_gene_id, symbol = external_gene_name) %>% distinct() featureData <- tibble(gene_id=rownames(dds)) %>% left_join(genemap, by="gene_id") %>% mutate(symbol=case_when(is.na(symbol) ~ gene_id, TRUE ~ symbol)) %>% dplyr::select(symbol) mcols(dds) <- DataFrame(mcols(dds), featureData) saveRDS(dds, file=snakemake@output[[1]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ################# ## libraries #### ################# library("DESeq2") library("tidyverse") ################# ## functions #### ################# genes_de <- function(deset, thrP=0.05, thrLog2FC=log2(1.5), direction=c('up', 'down', 'any')) { tmp <- deset %>% as.data.frame %>% rownames_to_column(var="gene_id") if (direction == "up") { tmp <- tmp %>% dplyr::filter(padj < thrP, log2FoldChange > thrLog2FC) } else if (direction == "down") { tmp <- tmp %>% dplyr::filter(padj < thrP, log2FoldChange < thrLog2FC) } else if (direction == "any") { tmp <- tmp %>% dplyr::filter(padj < thrP) } else { stop(direction, "is not a valid option for direction") } tmp } save_up_down <- function(res, setup) { up <- genes_de(res, direction="up") down <- genes_de(res, direction="down") write_csv(up, snakemake@output[['up']]) write_csv(down, snakemake@output[['down']]) return(list(up=up$gene_id, down=down$gene_id)) } ############ ## data #### ############ parallel <- FALSE if (snakemake@threads > 1) { library("BiocParallel") # setup parallelization register(MulticoreParam(snakemake@threads)) parallel <- TRUE } dds <- readRDS(snakemake@input[[1]]) directory <- "deseq2" ################ ## analysis #### ################ ## 1. Model fit #### # Generate named coefficients need for apeglm lfcShrink elements <- snakemake@params[["contrast"]] comparison <- paste(elements[1], "vs", elements[2], sep="_") coef <- paste("condition", elements[1], "vs", elements[2], sep="_") # Relevel for reference to second element in contrasts dds$condition <- relevel(dds$condition, elements[2]) dds <- nbinomWaldTest(dds) ## 2. Process results #### # Extract coefficient specific results res <- results(dds, name=coef, parallel=parallel) # shrink fold changes for lowly expressed genes res <- lfcShrink(dds, coef=coef, res=res, type='apeglm') # add gene names to results object res$gene_name <- mcols(dds)$symbol # sort by p-value res <- res[order(res$padj),] ## 3. Summarise results #### ## a) All genes for all groups #### res_format <- res %>% as.data.frame() %>% rownames_to_column(var="gene_id") %>% as_tibble() %>% rename_at(vars(-gene_id, -gene_name), ~ paste0(., "_", comparison)) # normalised expression values rld <- rlog(dds, blind = FALSE) deg_genes <- assay(rld) combined <- deg_genes %>% as.data.frame() %>% rownames_to_column(var="gene_id") %>% as_tibble() %>% right_join(res_format, by="gene_id") %>% dplyr::select(gene_id, gene_name, everything()) write_csv(combined, snakemake@output[["table"]]) ## b) Up/Down genes #### genes_up_down <- save_up_down(res=res, setup=coef) ## 4. Visualise results #### svg(snakemake@output[["ma_plot"]]) plotMA(res, ylim=c(-2,2)) dev.off() pdf(snakemake@output[["ma_pdf"]]) plotMA(res, ylim=c(-2,2)) dev.off() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") library("ggrepel") library("tidyverse") library("circlize") library("EnhancedVolcano") library("readr") library("janitor") library("GSA") library("msigdbr") library("fgsea") library("shades") library("biomaRt") #Load contrast elements_contrast <- snakemake@params[["contrast"]] contrast <- paste(elements_contrast[1], "vs", elements_contrast[2], sep="-") #Load gene set file names, species name, msigdb categories, descriptions option elements <- snakemake@params[["gene_sets"]] species <- snakemake@params[["species"]] msigdb_categories <- snakemake@params[["msigdb_categories"]] descriptions <- snakemake@params[["descriptions"]] #Determine required columns if (descriptions == "yes") { required <- c("entrez_gene", "gs_name", "gs_description") } else { required <- c("entrez_gene", "gs_name") } #Function to retrieve msigdbr species parameter if (any(msigdb_categories != "")) { if (species == "mouse") { species_msigdb <- "Mus musculus" } else if(species == "human") { species_msigdb <- "Homo sapiens" } else { stop("No msigdb entries for species: ", species) } } #Retrieve msigdb gene sets if specified if (any(msigdb_categories != "")) { gene_sets <- lapply(msigdb_categories, msigdbr, species = species_msigdb) %>% bind_rows() %>% dplyr::select(all_of(required)) } else { gene_sets <- setNames(data.frame(matrix(ncol = length(required), nrow = 0)), required) } #Check to see if uploaded gene sets have required columns ##Function to check gene sets check_gene_list <- function(filename, required_columns) { gene_set <-read.csv(filename) if (any(! required_columns %in% colnames(gene_set))) { missing <- required_columns[!required_columns %in% colnames(gene_set)] %>% paste(., collapse=",") stop(paste0("Dataframe of gene sets does not have ", missing, " column(s)")) } return(gene_set) } ##If gene sets are uploaded, run check; if passes combine with earlier gene sets if (any(elements != "")) { custom_gene_sets <- lapply(elements, check_gene_list, required) %>% bind_rows() %>% dplyr::select(all_of(required)) gene_sets <- gene_sets %>% rbind(custom_gene_sets) } gene_set_list <- split(x = gene_sets$entrez_gene, f = gene_sets$gs_name) #Load full diffexp output and extract log2FoldChange diff_exp <- read.csv(snakemake@input[['table']])%>% na.omit(cols=startsWith("padj")) lfc <- dplyr::select(diff_exp, starts_with("gene_id"), starts_with("log2FoldChange")) #Convert ensembl ID to entrez ID species <- "mouse" if (species == "mouse") { ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl") } else if(species == "human") { ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") } else { stop("No annotation (biomart) specified for organism:", species) } lfc_gene_ids <- as.vector(lfc$gene_id) #Get entrez ID's for our input genes and remove duplicates lfc_names <- getBM(attributes=c('entrezgene_id', 'ensembl_gene_id'), values = lfc_gene_ids, mart = ensembl) %>% group_by(ensembl_gene_id) %>% filter(!any(row_number() > 1)) %>% na.omit() #Select only the input genes that have corresponding entrez ID's lfc_both_names <- lfc %>% subset(gene_id %in% lfc_names$ensembl_gene_id) %>% arrange(gene_id) #Make sure all of the entrez ID's from biomaRt have ensembl ID's lfc_names <- lfc_names %>% subset(ensembl_gene_id %in% lfc_both_names$gene_id) %>% arrange(ensembl_gene_id) lfc_entrez_list <- cbind(lfc_both_names, lfc_names) %>% dplyr::select(starts_with("entrezgene_id"), starts_with("log2FoldChange")) %>% deframe() #Run fgsea fgseaRes <- fgsea(pathways = gene_set_list, stats = lfc_entrez_list, eps = 0.00, minSize = 15, maxSize = 500) #Collect results res <- fgseaRes[order(padj), ] #If gene sets have descriptions, add to results if (descriptions == "yes") { names_list <- as.vector(res$pathway) gene_des <- gene_sets %>% dplyr::select(gs_name, gs_description) %>% distinct() %>% filter(gs_name %in% names_list) %>% dplyr::rename(pathway = gs_name) res <- list (res, gene_des) %>% reduce(full_join, by = "pathway") } #Save results as RDS object saveRDS(res, file = snakemake@output[['rds']]) #Remove leadingEdge list column res_no_le <- res %>% subset(select = -c(leadingEdge)) #Write results to csv write_csv(res_no_le, snakemake@output[["table"]]) #Select the significant gene sets (if any) and add ensembl ID's for genes in those gene sets if (any(res$padj < 0.05)) { unpack_le <- function(res_sig) { genes <- as.vector(res_sig$leadingEdge) return(genes) } get_BM_vector <- function(genes) { genes_ids_df <- getBM(attributes=c('entrezgene_id','ensembl_gene_id'), filters = 'entrezgene_id', values = genes, mart = ensembl) genes_ids <- genes_ids_df$ensembl_gene_id return(genes_ids) } res_sig <- filter(res, padj < 0.05) genes <- apply(res_sig, 1, unpack_le) genes_ids <- sapply(genes, get_BM_vector) res_sig$leadingEdgeEnsembl <- genes_ids #Save significant results as RDS object saveRDS(res_sig, snakemake@output[["rds_sig"]]) #Remove leadingEdge list column res_sig <- res_sig %>% subset(select = -c(leadingEdge, leadingEdgeEnsembl)) %>% arrange(padj) #Write significant results to csv write_csv(res_sig, snakemake@output[["table_sig"]]) } #Visualize results ##Collect up to the top 5 pathways if (nrow(res) >= 5) { n <- 5 } else { n <- nrow(res) } subset_pathway <- res[1:n,] top_pathway_names <- subset_pathway$pathway ##Create enrichment plots pdf(snakemake@output[["top_gene_sets_pdf"]]) enrichment_plots <- function(top_pathway_names) { plotEnrichment(pathway = gene_set_list[[top_pathway_names]], lfc_entrez_list) + labs(title = top_pathway_names) + theme(plot.title = element_text(hjust = 0.5, face="bold")) } plt <- lapply(top_pathway_names,enrichment_plots) print(plt) dev.off() #Create top pathway dotplot ##Plot up to top 25 pathways and save to pdf if (nrow(res) >= 25) { m <- 25 } else { m <- nrow(res) } selected_res <- res[1:m,] %>% na.omit() ##Order results by NES selected_res <- selected_res %>% arrange(NES) %>% mutate(pathway=fct_inorder(pathway)) gg <- ggplot(data=selected_res, aes(x=x, y=y)) + geom_segment(aes(x=0, xend=NES, y=pathway, yend=pathway)) + geom_point(aes(size=size, color= -log10(padj), x=NES, y=pathway)) + theme_light() + ggtitle("Enriched Pathways") + lightness(scale_colour_distiller(palette = "YlGnBu", direction=1), scalefac(0.90)) + xlab("Normalized Enrichment Score") + ylab("Pathway") + expand_limits(x=0) + theme( panel.grid.major.y = element_blank(), panel.border = element_blank(), axis.ticks.y = element_blank() ) ggsave(plot=gg, scale = 2, filename = snakemake@output[["pathways_pdf"]], dpi = 600, width = 8, height = 6, units = "in") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import gffutils db = gffutils.create_db(snakemake.input[0], dbfn=snakemake.output.db, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True, disable_infer_transcripts=True) with open(snakemake.output.bed, 'w') as outfileobj: for tx in db.features_of_type('transcript', order_by='start'): bed = [s.strip() for s in db.bed12(tx).split('\t')] bed[3] = tx.id outfileobj.write('{}\n'.format('\t'.join(bed))) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## libraries #### library("DESeq2") library("cowplot") library("limma") library("ggrepel") library("tidyverse") library("scales") library("ggnewscale") library("ComplexHeatmap") library("circlize") library("EnhancedVolcano") library("readr") library("janitor") library("biomaRt") ## functions #### theme_set(theme_half_open(font_family = 'sans' )) title_size <- 10 text_size <- 8 point_size <- 0.8 ## settings #### map_color_range <- function(matrix, color_vec) { matrix %>% range(., na.rm=TRUE) %>% abs %>% max %>% seq(-., ., length.out = length(color_vec)) %>% colorRamp2(., color_vec) } ## data #### dds <- readRDS(snakemake@input[[1]]) elements <- snakemake@params[["contrast"]] contrast <- paste(elements[1], "vs", elements[2], sep="-") adjust_batch <- snakemake@params[['batch']] ## analysis #### # keep contrast specific samples dds <- dds[,colData(dds)$condition %in% elements] # obtain normalized counts counts <- rlog(dds, blind=FALSE) if (adjust_batch != "") { if (!adjust_batch %in% colnames(colData(counts))) { stop("Column selected for batch correction prior to PCA not present") } assay(counts) <- limma::removeBatchEffect(assay(counts), colData(counts)[, adjust_batch]) } # Reformat and scale counts counts_scaled <- t(apply(assay(counts), 1, scale)) colnames(counts_scaled) <- colnames(assay(counts)) # Load up and downregulated genes and extract gene list up_reg <- read.csv(snakemake@input[['up']]) down_reg <- read.csv(snakemake@input[['down']]) sig_genes <- rbind(up_reg, down_reg) fc <- sig_genes %>% dplyr::select(gene_id, gene_name, log2FoldChange) se <- sig_genes %>% dplyr::select(gene_id, lfcSE) # Extract counts and fc values for significant genes and order them by gene_id combined <- counts_scaled %>% as.data.frame %>% rownames_to_column(var = "gene_id") %>% inner_join(fc, by="gene_id") %>% inner_join(se, by="gene_id") %>% dplyr::select(gene_id, gene_name, log2FoldChange, lfcSE,everything()) %>% arrange(gene_id) %>% as.data.frame # Set heatmap categories and condition type if (!"labels" %in% colnames(colData(dds))) { colData(dds)$labels <- colData(dds)$condition } # Set heatmap colors, themes, and sizes cond_type_vals <- c('#d95f02','#7570b3') names(cond_type_vals) <- unique(colData(dds)$labels) color <- list(cond_type = cond_type_vals) counts_color <- c('#a6611a','#dfc27d','#f5f5f5','#80cdc1','#018571') fc_color <- c('#0571b0','#92c5de','#f7f7f7','#f4a582','#ca0020') z_stat_color <- c('#e66101','#fdb863','#f7f7f7','#b2abd2','#5e3c99') # Create column annotations for rnaseq data coldata <- colData(dds) %>% as_tibble columnAnno <- columnAnnotation( cond_type = coldata$labels, col = list(cond_type = color$cond_type), annotation_legend_param = list(cond_type = list(title = "Condition")), show_annotation_name = FALSE, annotation_name_gp = gpar(fontsize = text_size) ) # Create the heatmap counts_matrix <- combined %>% dplyr::select(-c(gene_id, log2FoldChange, lfcSE)) %>% column_to_rownames("gene_name") %>% as.matrix fc_matrix <- combined %>% dplyr::select(gene_name, log2FoldChange) %>% column_to_rownames("gene_name") %>% as.matrix z_stat_matrix <- combined %>% mutate(z_stat = log2FoldChange / lfcSE) %>% dplyr::select(gene_name, z_stat) %>% column_to_rownames("gene_name") %>% as.matrix hm_counts_sig <- Heatmap(counts_matrix, name="Counts", col=map_color_range(counts_matrix, counts_color), cluster_rows=TRUE, cluster_columns=FALSE, show_row_dend = FALSE, show_row_names=FALSE, show_column_names=TRUE, top_annotation = columnAnno, row_names_gp = gpar(fontsize = text_size), column_names_gp = gpar(fontsize = text_size), heatmap_legend_param = list(direction = "horizontal", title_gp = gpar(fontsize = text_size, fontface = "bold"), labels_gp = gpar(fontsize = text_size)), use_raster=TRUE, row_title=NULL, column_title=NULL) hm_fc <- Heatmap(fc_matrix, col=map_color_range(fc_matrix, fc_color), use_raster=TRUE, show_row_names=FALSE, show_row_dend = FALSE, show_column_names = FALSE, row_title = NULL, column_title = NULL, cluster_columns=FALSE, cluster_rows=TRUE, row_names_gp = gpar(fontsize = text_size), column_names_gp = gpar(fontsize = text_size), heatmap_legend_param = list(direction = "horizontal", title_gp = gpar(fontsize = text_size, fontface = "bold"), labels_gp = gpar(fontsize = text_size)), name = "log2(FC)", width = unit(3, "cm")) hm_z <- Heatmap(z_stat_matrix, col=map_color_range(z_stat_matrix, z_stat_color), use_raster=TRUE, show_row_names=FALSE, show_row_dend = FALSE, show_column_names = FALSE, row_title = NULL, column_title = NULL, cluster_columns=FALSE, cluster_rows=TRUE, row_names_gp = gpar(fontsize = text_size), column_names_gp = gpar(fontsize = text_size), heatmap_legend_param = list(direction = "horizontal", title_gp = gpar(fontsize = text_size, fontface = "bold"), labels_gp = gpar(fontsize = text_size)), name = "z_statistic", width = unit(3, "cm")) pdf(snakemake@output[['pdf']], width=190/25.4, height =150/25.4) draw(hm_counts_sig + hm_fc, heatmap_legend_side = "bottom", merge_legend = TRUE) dev.off() #Save counts heatmap saveRDS(hm_counts_sig, file = snakemake@output[['counts_heatmap_data']]) saveRDS(hm_fc, file = snakemake@output[['lfc_heatmap_data']]) saveRDS(hm_z, file = snakemake@output[['z-stat_heatmap_data']]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library("DESeq2") library("ggplot2") library("cowplot") library("limma") library("ggrepel") dds <- readRDS(snakemake@input[[1]]) pca_color <- snakemake@params[['color']] pca_fill <- snakemake@params[['fill']] pca_labels <- snakemake@params[['labels']] pca_adjust_batch <- snakemake@params[['batch']] if (all(c(pca_color, pca_fill) != "")) { intgroup <- c(pca_color, pca_fill) } else if (pca_color != "") { intgroup <- pca_color } else if (pca_fill != "") { intgroup <- pca_fill } else { stop("At least one of fill or color have to be specified") } if (pca_labels != "") { if (!pca_labels %in% intgroup) { intgroup <- c(intgroup, pca_labels) } } head(colnames(colData(dds))) if (!all(intgroup %in% colnames(colData(dds)))) { missing <- intgroup[!intgroup %in% colnames(colData(dds))] stop(paste(missing, collapse=","), "columns selected for PCA highlighting are missing") } # obtain normalized counts counts <- rlog(dds, blind=FALSE) if (pca_adjust_batch != "") { if (! pca_adjust_batch %in% colnames(colData(counts))) { stop("Column selected for batch correction prior to PCA not present") } assay(counts) <- limma::removeBatchEffect(assay(counts), colData(counts)[, pca_adjust_batch]) } pcaData <- plotPCA(counts, intgroup = intgroup, returnData = TRUE) percentVar <- round(100 * attr(pcaData, "percentVar")) color_sym <- sym(pca_color) p <- ggplot(pcaData, aes(x = PC1, y = PC2)) if (all(c(pca_color, pca_fill) != "")) { fill_sym <- sym(pca_fill) p <- p + geom_point(aes(color = !!color_sym, fill = !!fill_sym), pch=22, size=3, stroke=2) } else { p <- p + geom_point(aes(color = !!color_sym)) } if (pca_labels != "") { label_sym <- sym(pca_labels) p <- p + geom_text_repel(aes(label=!!label_sym), size = 3, color="black", segment.size = 0.3, segment.color = "black", min.segment.length = 0.2, point.padding = 0.3) } p <- p + scale_color_brewer(type="qual", palette="Dark2") + labs(x=paste0("PC1: ", percentVar[1], "% variance"), y=paste0("PC2: ", percentVar[2], "% variance")) + coord_fixed() + theme_cowplot() + theme(legend.position = "bottom", legend.justification = 0) ggsave(plot=p, height=4.5, width=7.5, filename = "results/pca.pdf") ggsave(plot=p, height=4.5, width=7.5, filename = "results/pca.svg") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | library("ggrepel") library("tidyverse") #Load full diffexp output and extract log2FoldChange and padj diffexp <- read.csv(snakemake@input[['table']]) fc_p_adj <- dplyr::select(diffexp, starts_with("gene_id"), starts_with("log2FoldChange"), starts_with("padj"), starts_with("gene_name")) %>% na.omit %>% rename( log2FoldChange = starts_with("log2FoldChange"), padj = starts_with("padj") ) # The significantly differentially expressed genes are the ones found in the upper-left and upper-right corners. # Add a column to the data frame to specify if they are UP- or DOWN- regulated # add a column of NAs fc_p_adj$diff_expressed <- "NO" # if log2Foldchange > 0.6 and pvalue < 0.05, set as "UP" fc_p_adj$diff_expressed[fc_p_adj$log2FoldChange > 0.6 & fc_p_adj$pad < 0.05] <- "UP" # if log2Foldchange < -0.6 and pvalue < 0.05, set as "DOWN" fc_p_adj$diff_expressed[fc_p_adj$log2FoldChange < -0.6 & fc_p_adj$padj < 0.05] <- "DOWN" # Now write down the name of genes beside the points... # Create a new column "delabel" to the dataframe, that will contain the name of genes differentially expressed (NA in case they are not) fc_p_adj$delabel <- NA fc_p_adj$delabel[fc_p_adj$diff_expressed != "NO"] <- fc_p_adj$gene_name[fc_p_adj$diff_expressed != "NO"] #Create and save plot p <- ggplot(data=fc_p_adj, aes(x=log2FoldChange, y=-log10(padj), col=diff_expressed, label=delabel)) + geom_point() + theme_minimal() + geom_text_repel() + scale_color_manual(values=c("blue", "black", "red")) + geom_vline(xintercept=c(-0.6, 0.6), col="red") + geom_hline(yintercept=-log10(0.05), col="red") + ggtitle("-log10(p) vs log2FoldChange") + theme(plot.title = element_text(hjust = 0.5)) ggsave(plot=p, scale = 1, width = 15.4, height = 8.6, filename = snakemake@output[['pdf']]) |
Support
- Future updates