Bio 470 project 2022. Bulk RNA scripts

public public 1yr ago 0 bookmarks

This workflow performs a differential expression analysis using STAR and DESeq2 . It performs a wide range of quality control steps and bundles there results in a qc report via MultiQC . Reported results include:

  • PCA plot of all samples

  • differentially up and down regulated genes for each analysis (ie treatment vs control)

  • MA plot for each analysis

  • heatmap of counts and log2 Fold Changes

  • lollipop plot of enriched gene sets

Usage

Step 1: Configure workflow

You will likely run these analyses on the HPCC, but you will want access to the results locally. My strategy is to have the analysis repository on the local machine and on the HPCC, in the same file path location (relative to the home directory).

We will start by creating the local repository (on your computer):

  • Click on 'Use this template', which will copy the content of this repository to a new remote repositiory on github. Find a good name for your analysis and use it as the name for the repository;

  • Copy git link from this new repository, move to the parent folder for your RNA-seq analysis (ie where you want to save your new analyses/results) and create a local repository on your computer using git clone;

  • Move into the newly created directory (repository);

  • Configure the workflow by editing the file config.yaml ;

    • Choose mouse or human for the STAR Index;

    • Modify the pca columns, design for deseq2, organism of interest, the contrasts, gene sets for (optional) enrichment analysis;

    • Add gene set files to gene_sets directory

  • Configure the samples.txt and units.txt files with project specific annotations and file;

    • Hint: Check for rogue leading/trailing spaces in your file paths/sample names etc;
  • Git push changes to master.

We will now set up the analysis on the HPCC:

  • Copy the git link again and create a repository on the cluster (same path, see above) using git clone;

  • Move into the newly created directory (repository);

  • Create a reads directory and copy over the reads (fastq.gz files) from the sequencing facility.

To run the analysis, activate your conda snakemake environment:

 conda activate snakemake

Step 2: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it on the UGE cluster environment

snakemake --use-conda --profile uge

Step 3: Investigate results

After successful execution, you can create a self-contained interactive HTML report via:

snakemake --report report.html
  • to look at the results, use secure copy scp to transfer the report.html and the quality control report qc/multiqc.html from the HPCC to your local computer (this is where storing them in the same location relative to the home directory comes in handy);

  • explore the results;

  • forward to your collaborators.

Code Snippets

21
22
23
24
25
26
27
28
29
30
31
32
shell:
    """
    STAR \
    {params.extra} \
        --runThreadN {threads} \
        --runMode alignReads \
        --genomeDir {params.index} \
        --readFilesIn {input.fq1} {input.fq2} {params.readcmd} \
        --outReadsUnmapped Fastq \
        --outSAMtype BAM Unsorted \
        --outFileNamePrefix star/{wildcards.sample}-{wildcards.unit}/
    """
46
47
48
49
shell:
    """
    samtools sort {input} -o {output} -@ {threads} -m 20G
    """
63
64
65
66
shell:
    """
    samtools index {input} -@ {threads}
    """         
16
17
script:
    "../scripts/count-matrix.py"
34
35
script:
    "../scripts/deseq2-init.R"
53
54
script:
    "../scripts/plot-pca.R"
73
74
script:
    "../scripts/deseq2.R"
96
97
script:
    "../scripts/plot-heatmap.R"
124
125
script:
    "../scripts/fgsea.R"
137
138
script:
    "../scripts/plot-volcano.R"
11
12
script:
    "../scripts/gtf2bed.py"
SnakeMake From line 11 of rules/qc.smk
30
31
32
33
34
35
36
37
shell:
    """
    junction_annotation.py {params.extra} \
        -i {input.bam} \
        -r {input.bed} \
        -o {params.prefix}
    > {log[0]} 2>&1
    """
SnakeMake From line 30 of rules/qc.smk
55
56
57
58
59
60
61
62
shell:
    """
    junction_saturation.py {params.extra} \
        -i {input.bam} \
        -r {input.bed} \
        -o {params.prefix}
    > {log} 2>&1
    """
SnakeMake From line 55 of rules/qc.smk
76
77
shell:
    "bam_stat.py -i {input} > {output} 2> {log}"
SnakeMake From line 76 of rules/qc.smk
92
93
shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 92 of rules/qc.smk
110
111
112
113
114
115
116
shell:
    """
    inner_distance.py \
        -r {input.bed} \
        -i {input.bam} \
        -o {params.prefix} > {log} 2>&1
    """
SnakeMake From line 110 of rules/qc.smk
131
132
shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 131 of rules/qc.smk
148
149
shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 148 of rules/qc.smk
165
166
shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 165 of rules/qc.smk
196
197
198
199
200
201
202
203
204
205
shell:
    """
    multiqc \
        --force \
        --export \
        --outdir qc \
        --filename multiqc_report.html \
        logs/rseqc/rseqc_junction_annotation qc/rseqc star > {log}
        #{input} > {log}
    """
18
19
20
21
22
23
24
25
26
27
28
shell:
    """
    cutadapt \
        {params.adapters} \
        {params.others} \
        -o {output.fastq1} \
        -p {output.fastq2} \
        -j {threads} \
        {input} \
    > {output.qc}
    """
46
47
48
49
50
51
52
53
54
55
shell:
    """
    cutadapt \
        {params.adapters} \
        {params.others} \
        -o {output.fastq} \
        -j {threads} \
        {input} \
    > {output.qc}
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd

def get_column(strandedness):
    if pd.isnull(strandedness) or strandedness == "none":
        return 1 #non stranded protocol
    elif strandedness == "yes":
        return 2 #3rd column
    elif strandedness == "reverse":
        return 3 #4th column, usually for Illumina truseq
    else:
        raise ValueError(("'strandedness' column should be empty or have the " 
                          "value 'none', 'yes' or 'reverse', instead has the " 
                          "value {}").format(repr(strandedness)))

counts = [pd.read_table(f, index_col=0, usecols=[0, get_column(strandedness)],
          header=None, skiprows=4)
          for f, strandedness in zip(snakemake.input, snakemake.params.strand)]

for t, sample in zip(counts, snakemake.params.samples):
    t.columns = [sample]

matrix = pd.concat(counts, axis=1)
matrix.index.name = "gene"
# collapse technical replicates
if len(set(matrix.columns)) != len(matrix.columns):
    matrix = matrix.groupby(matrix.columns, axis=1).sum()
matrix.to_csv(snakemake.output[0], sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
library("AnnotationHub")
library("DESeq2")
library("biomaRt")

parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

message("Reading counts")
cts <- read.table(snakemake@input[["counts"]], header=TRUE, row.names="gene",
                  check.names=FALSE)
message("Reading sample file")
coldata <- read.table(snakemake@params[["samples"]], header=TRUE,
                      row.names="sample", check.names=FALSE, sep="\t")
coldata$sample <- rownames(coldata)
message("Getting experimental design")
design <- as.formula(snakemake@params[["design"]])

# colData and countData must have the same sample order
if (nrow(coldata) != ncol(cts)) {
    stop("Number of samples in sample sheet and number of samples in counts",
         "matrix is not the same")
}
cts <- cts[,match(rownames(coldata),colnames(cts))]
if (any(c("control", "Control", "CONTROL") %in% levels(coldata$condition))) {
    if ("control" %in% levels(coldata$condition)) {
        coldata$condition <- relevel(coldata$condition, "control" )
    } else if ("Control" %in% levels(coldata$condition)) {
        coldata$condition <- relevel(coldata$condition, "Control" )
    } else {
        coldata$condition <- relevel(coldata$condition, "CONTROL" )
    }
}
dds <- DESeqDataSetFromMatrix(countData=cts,
                              colData=coldata,
                              design=design)

#Collapse replicates if specified in config
if (snakemake@params[["collapse"]] == "yes") {
	                dds <- collapseReplicates(dds, dds$replicate)
}

# remove uninformative columns
dds <- dds[rowSums(counts(dds)) > 1,]

# normalization and preprocessing
dds <- DESeq(dds,  parallel=parallel)

# Remove build number on ENS gene id
rownames(dds) <- gsub("\\.\\d*", "", rownames(dds))

# Annotate by gene names
if (snakemake@params[["species"]] == "mouse") {
	ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
} else if(snakemake@params[["species"]] == "human") {
	ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
} else {
	stop("No annotation (biomart) specified for organism:",
	     snakemake@params[["species"]])
}

genemap <- getBM(attributes=c('ensembl_gene_id', "external_gene_name"),
		 filters = 'ensembl_gene_id',
		 values = rownames(dds),
		 mart = ensembl) %>%
    dplyr::rename(gene_id = ensembl_gene_id,
		  symbol = external_gene_name) %>%
    distinct()

featureData <- tibble(gene_id=rownames(dds)) %>%
	left_join(genemap, by="gene_id") %>%
	mutate(symbol=case_when(is.na(symbol) ~ gene_id,
				TRUE ~ symbol)) %>%
	dplyr::select(symbol)
mcols(dds) <- DataFrame(mcols(dds), featureData)

saveRDS(dds, file=snakemake@output[[1]])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#################
## libraries ####
#################
library("DESeq2")
library("tidyverse")

#################
## functions ####
#################
genes_de <- function(deset, thrP=0.05, thrLog2FC=log2(1.5),
                     direction=c('up', 'down', 'any')) {
    tmp <- deset %>%
        as.data.frame %>%
        rownames_to_column(var="gene_id")
    if (direction == "up") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP,
                          log2FoldChange > thrLog2FC)
    } else if (direction == "down") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP,
                          log2FoldChange < thrLog2FC)
    } else if (direction == "any") {
        tmp <- tmp %>%
            dplyr::filter(padj < thrP)
    } else {
        stop(direction, "is not a valid option for direction")
    }
    tmp
}

save_up_down <- function(res, setup) {
    up  <- genes_de(res, direction="up")
    down  <- genes_de(res, direction="down")
    write_csv(up, snakemake@output[['up']])
    write_csv(down, snakemake@output[['down']])
    return(list(up=up$gene_id, down=down$gene_id))
}

############
## data ####
############
parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

dds <- readRDS(snakemake@input[[1]])
directory <- "deseq2"

################
## analysis ####
################

## 1. Model fit ####
# Generate named coefficients need for apeglm lfcShrink
elements <- snakemake@params[["contrast"]]
comparison <- paste(elements[1], "vs", elements[2], sep="_")
coef <- paste("condition", elements[1], "vs", elements[2], sep="_")

# Relevel for reference to second element in contrasts
dds$condition <- relevel(dds$condition, elements[2])
dds <- nbinomWaldTest(dds)

## 2. Process results ####
# Extract coefficient specific results
res <- results(dds, name=coef, parallel=parallel)

# shrink fold changes for lowly expressed genes
res <- lfcShrink(dds, coef=coef, res=res, type='apeglm')

# add gene names to results object
res$gene_name <- mcols(dds)$symbol

# sort by p-value
res <- res[order(res$padj),]

## 3. Summarise results ####
## a) All genes for all groups ####
res_format <- res %>%
    as.data.frame() %>%
    rownames_to_column(var="gene_id") %>%
    as_tibble() %>%
    rename_at(vars(-gene_id, -gene_name), ~ paste0(., "_", comparison))

# normalised expression values
rld <- rlog(dds, blind = FALSE)
deg_genes <- assay(rld)

combined <- deg_genes %>%
    as.data.frame() %>%
    rownames_to_column(var="gene_id") %>%
    as_tibble() %>%
    right_join(res_format, by="gene_id") %>%
    dplyr::select(gene_id, gene_name, everything())

write_csv(combined, snakemake@output[["table"]])

## b) Up/Down genes ####
genes_up_down <- save_up_down(res=res, setup=coef)

## 4. Visualise results ####
svg(snakemake@output[["ma_plot"]])
plotMA(res, ylim=c(-2,2))
dev.off()

pdf(snakemake@output[["ma_pdf"]])
plotMA(res, ylim=c(-2,2))
dev.off()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")
library("ggrepel")
library("tidyverse")
library("circlize")
library("EnhancedVolcano")
library("readr")
library("janitor")
library("GSA")
library("msigdbr")
library("fgsea")
library("shades")
library("biomaRt")

#Load contrast
elements_contrast <- snakemake@params[["contrast"]]
contrast <- paste(elements_contrast[1], "vs", elements_contrast[2], sep="-")

#Load gene set file names, species name, msigdb categories, descriptions option
elements <- snakemake@params[["gene_sets"]]
species <- snakemake@params[["species"]]
msigdb_categories <- snakemake@params[["msigdb_categories"]]
descriptions <- snakemake@params[["descriptions"]]

#Determine required columns 
if (descriptions == "yes") {
  required <- c("entrez_gene", "gs_name", "gs_description")
} else {
  required <- c("entrez_gene", "gs_name")
}

#Function to retrieve msigdbr species parameter 
if (any(msigdb_categories != "")) {
  if (species == "mouse") {
    species_msigdb <- "Mus musculus"
  } else if(species == "human") {
    species_msigdb <- "Homo sapiens"
  } else {
    stop("No msigdb entries for species: ",
         species)
  }
}


#Retrieve msigdb gene sets if specified
if (any(msigdb_categories != "")) {
  gene_sets <- lapply(msigdb_categories, msigdbr, species = species_msigdb) %>%
    bind_rows() %>%
    dplyr::select(all_of(required))
} else {
  gene_sets <- setNames(data.frame(matrix(ncol = length(required), nrow = 0)), required)
}

#Check to see if uploaded gene sets have required columns

##Function to check gene sets
check_gene_list <- function(filename, required_columns) {
  gene_set <-read.csv(filename)
  if (any(! required_columns %in% colnames(gene_set))) {
    missing <- required_columns[!required_columns %in% colnames(gene_set)] %>%
      paste(., collapse=",")

    stop(paste0("Dataframe of gene sets does not have ", missing, " column(s)"))
  }
  return(gene_set)
}

##If gene sets are uploaded, run check; if passes combine with earlier gene sets
if (any(elements != "")) {
  custom_gene_sets <- lapply(elements, check_gene_list, required) %>% 
    bind_rows() %>%
    dplyr::select(all_of(required))
  gene_sets <- gene_sets %>%
    rbind(custom_gene_sets)
}

gene_set_list <- split(x = gene_sets$entrez_gene, f = gene_sets$gs_name)

#Load full diffexp output and extract log2FoldChange
diff_exp <- read.csv(snakemake@input[['table']])%>% na.omit(cols=startsWith("padj"))
lfc <- dplyr::select(diff_exp, starts_with("gene_id"), starts_with("log2FoldChange"))

#Convert ensembl ID to entrez ID
species <- "mouse"
if (species == "mouse") {
	        ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
} else if(species == "human") {
	        ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
} else {
	        stop("No annotation (biomart) specified for organism:",
		                  species)
}

lfc_gene_ids  <- as.vector(lfc$gene_id)

#Get entrez ID's for our input genes and remove duplicates
lfc_names <- getBM(attributes=c('entrezgene_id', 'ensembl_gene_id'),
             values = lfc_gene_ids,
             mart = ensembl) %>%
  group_by(ensembl_gene_id) %>%
  filter(!any(row_number() > 1)) %>%
  na.omit()

#Select only the input genes that have corresponding entrez ID's
lfc_both_names <- lfc %>%
  subset(gene_id %in% lfc_names$ensembl_gene_id) %>%
  arrange(gene_id)

#Make sure all of the entrez ID's from biomaRt have ensembl ID's
lfc_names <- lfc_names %>%
  subset(ensembl_gene_id %in% lfc_both_names$gene_id) %>%
  arrange(ensembl_gene_id)

lfc_entrez_list <- cbind(lfc_both_names, lfc_names) %>%
  dplyr::select(starts_with("entrezgene_id"), starts_with("log2FoldChange")) %>%
  deframe()

#Run fgsea
fgseaRes <- fgsea(pathways = gene_set_list,
		  stats = lfc_entrez_list,
		  eps = 0.00,
		  minSize  = 15,
		  maxSize  = 500)

#Collect results
res <- fgseaRes[order(padj), ]

#If gene sets have descriptions, add to results
if (descriptions == "yes") {
  names_list <- as.vector(res$pathway)
  gene_des <- gene_sets %>%
	  dplyr::select(gs_name, gs_description) %>%
		distinct() %>%
    filter(gs_name %in% names_list) %>%
    dplyr::rename(pathway = gs_name)

	res <- list (res, gene_des) %>%
	  reduce(full_join, by = "pathway")
} 

#Save results as RDS object
saveRDS(res, file = snakemake@output[['rds']])

#Remove leadingEdge list column
res_no_le <- res %>%
	subset(select = -c(leadingEdge)) 

#Write results to csv
write_csv(res_no_le, snakemake@output[["table"]])

#Select the significant gene sets (if any) and add ensembl ID's for genes in those gene sets
if (any(res$padj < 0.05)) {
  unpack_le <- function(res_sig) {
    genes <- as.vector(res_sig$leadingEdge)
    return(genes)
  }
  get_BM_vector <- function(genes) {
    genes_ids_df <- getBM(attributes=c('entrezgene_id','ensembl_gene_id'),
                          filters = 'entrezgene_id',
                          values = genes,
                          mart = ensembl)
    genes_ids <- genes_ids_df$ensembl_gene_id
    return(genes_ids)
  }
  res_sig <- filter(res, padj < 0.05) 
  genes <- apply(res_sig, 1, unpack_le)
  genes_ids <- sapply(genes, get_BM_vector) 
  res_sig$leadingEdgeEnsembl <- genes_ids

  #Save significant results as RDS object
  saveRDS(res_sig, snakemake@output[["rds_sig"]]) 

  #Remove leadingEdge list column
  res_sig <- res_sig %>%
    subset(select = -c(leadingEdge, leadingEdgeEnsembl)) %>%
    arrange(padj)

  #Write significant results to csv
  write_csv(res_sig, snakemake@output[["table_sig"]])
}



#Visualize results
##Collect up to the top 5 pathways
if (nrow(res) >= 5) {
  n <- 5
} else {
  n <- nrow(res)
}
subset_pathway <- res[1:n,]
top_pathway_names <- subset_pathway$pathway

##Create enrichment plots
pdf(snakemake@output[["top_gene_sets_pdf"]])

enrichment_plots <- function(top_pathway_names) {
  plotEnrichment(pathway = gene_set_list[[top_pathway_names]], lfc_entrez_list) +
    labs(title = top_pathway_names) +
    theme(plot.title = element_text(hjust = 0.5, face="bold"))
} 

plt <- lapply(top_pathway_names,enrichment_plots)
print(plt)
dev.off()

#Create top pathway dotplot 
##Plot up to top 25 pathways and save to pdf
if (nrow(res) >= 25) {
  m <- 25
} else {
  m <- nrow(res)
}
selected_res <- res[1:m,] %>%
	  na.omit()

##Order results by NES
selected_res <- selected_res %>%
	arrange(NES) %>%
	mutate(pathway=fct_inorder(pathway))

gg <- ggplot(data=selected_res, aes(x=x, y=y)) + 
	geom_segment(aes(x=0, xend=NES, y=pathway, yend=pathway)) + 
	geom_point(aes(size=size, color= -log10(padj), x=NES, y=pathway)) + 
	theme_light() + 
	ggtitle("Enriched Pathways") + 
	lightness(scale_colour_distiller(palette = "YlGnBu", direction=1), scalefac(0.90)) + 
	xlab("Normalized Enrichment Score") + 
	ylab("Pathway") + 
	expand_limits(x=0) + 
	theme(
	        panel.grid.major.y = element_blank(),
		  panel.border = element_blank(),
		  axis.ticks.y = element_blank()
		  )

ggsave(plot=gg, scale = 2, filename = snakemake@output[["pathways_pdf"]], dpi = 600, width = 8, height = 6, units = "in")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import gffutils

db = gffutils.create_db(snakemake.input[0],
                        dbfn=snakemake.output.db,
                        force=True,
                        keep_order=True,
                        merge_strategy='merge',
                        sort_attribute_values=True,
                        disable_infer_genes=True,
                        disable_infer_transcripts=True)

with open(snakemake.output.bed, 'w') as outfileobj:
    for tx in db.features_of_type('transcript', order_by='start'):
        bed = [s.strip() for s in db.bed12(tx).split('\t')]
        bed[3] = tx.id
        outfileobj.write('{}\n'.format('\t'.join(bed)))
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## libraries ####
library("DESeq2")
library("cowplot")
library("limma")
library("ggrepel")
library("tidyverse")
library("scales")
library("ggnewscale")
library("ComplexHeatmap")
library("circlize")
library("EnhancedVolcano")
library("readr")
library("janitor")
library("biomaRt")

## functions ####
theme_set(theme_half_open(font_family = 'sans' ))

title_size <- 10
text_size <- 8
point_size <- 0.8

## settings ####
map_color_range <- function(matrix, color_vec) {
    matrix %>%
        range(., na.rm=TRUE) %>%
        abs %>%
        max %>%
        seq(-., ., length.out = length(color_vec)) %>%
        colorRamp2(., color_vec)
}

## data ####
dds <- readRDS(snakemake@input[[1]])
elements <- snakemake@params[["contrast"]]
contrast <- paste(elements[1], "vs", elements[2], sep="-")
adjust_batch <- snakemake@params[['batch']]


## analysis ####
# keep contrast specific samples
dds <- dds[,colData(dds)$condition %in% elements]

# obtain normalized counts
counts <- rlog(dds, blind=FALSE)

if (adjust_batch != "") {
    if (!adjust_batch %in% colnames(colData(counts))) {
        stop("Column selected for batch correction prior to PCA not present")
    }
    assay(counts) <- limma::removeBatchEffect(assay(counts),
                                              colData(counts)[, adjust_batch])
}

# Reformat and scale counts
counts_scaled <- t(apply(assay(counts), 1, scale))
colnames(counts_scaled) <- colnames(assay(counts))

# Load up and downregulated genes and extract gene list
up_reg <- read.csv(snakemake@input[['up']])
down_reg <- read.csv(snakemake@input[['down']])
sig_genes <- rbind(up_reg, down_reg)

fc <- sig_genes %>%
  dplyr::select(gene_id, gene_name, log2FoldChange)

se <- sig_genes %>%
  dplyr::select(gene_id, lfcSE)

# Extract counts and fc values for significant genes and order them by gene_id
combined <- counts_scaled %>%
  as.data.frame %>%
  rownames_to_column(var = "gene_id") %>%
  inner_join(fc, by="gene_id") %>%
  inner_join(se, by="gene_id") %>%
  dplyr::select(gene_id, gene_name, log2FoldChange, lfcSE,everything()) %>%
  arrange(gene_id) %>%
  as.data.frame

# Set heatmap categories and condition type
if (!"labels" %in% colnames(colData(dds))) {
    colData(dds)$labels <- colData(dds)$condition
}

# Set heatmap colors, themes, and sizes
cond_type_vals <- c('#d95f02','#7570b3')
names(cond_type_vals) <- unique(colData(dds)$labels)
color <- list(cond_type = cond_type_vals)
counts_color <- c('#a6611a','#dfc27d','#f5f5f5','#80cdc1','#018571')
fc_color <- c('#0571b0','#92c5de','#f7f7f7','#f4a582','#ca0020')
z_stat_color <- c('#e66101','#fdb863','#f7f7f7','#b2abd2','#5e3c99')

# Create column annotations for rnaseq data
coldata <- colData(dds) %>%
  as_tibble

columnAnno <- columnAnnotation(
                   cond_type = coldata$labels,
                   col = list(cond_type = color$cond_type),
                   annotation_legend_param = list(cond_type = list(title = "Condition")),
                   show_annotation_name = FALSE,
                   annotation_name_gp = gpar(fontsize = text_size)
                   )

# Create the heatmap
counts_matrix <- combined %>%
  dplyr::select(-c(gene_id,  log2FoldChange, lfcSE)) %>%
  column_to_rownames("gene_name") %>%
  as.matrix

fc_matrix <- combined %>%
  dplyr::select(gene_name, log2FoldChange) %>%
  column_to_rownames("gene_name") %>%
  as.matrix

z_stat_matrix <- combined %>%
  mutate(z_stat = log2FoldChange / lfcSE) %>%
  dplyr::select(gene_name, z_stat) %>%
  column_to_rownames("gene_name") %>%
  as.matrix

hm_counts_sig <- Heatmap(counts_matrix,
                         name="Counts",
                         col=map_color_range(counts_matrix, counts_color),
                         cluster_rows=TRUE,
                         cluster_columns=FALSE,
                         show_row_dend = FALSE,
                         show_row_names=FALSE,
                         show_column_names=TRUE,
                         top_annotation = columnAnno,
                         row_names_gp = gpar(fontsize = text_size),
                         column_names_gp = gpar(fontsize = text_size),
                         heatmap_legend_param =
                                 list(direction = "horizontal",
                                 title_gp = gpar(fontsize = text_size,
                                                      fontface = "bold"),
                                      labels_gp = gpar(fontsize = text_size)),
                         use_raster=TRUE,
                         row_title=NULL,
                         column_title=NULL)

hm_fc <- Heatmap(fc_matrix,
         col=map_color_range(fc_matrix, fc_color),
         use_raster=TRUE,
         show_row_names=FALSE,
         show_row_dend = FALSE,
         show_column_names = FALSE,
                 row_title = NULL,
         column_title = NULL,
         cluster_columns=FALSE,
         cluster_rows=TRUE,
         row_names_gp = gpar(fontsize = text_size),
         column_names_gp = gpar(fontsize = text_size),
         heatmap_legend_param =
             list(direction = "horizontal",
                  title_gp = gpar(fontsize = text_size,
                          fontface = "bold"),
                  labels_gp = gpar(fontsize = text_size)),
         name = "log2(FC)",
         width = unit(3, "cm"))

hm_z <- Heatmap(z_stat_matrix,
                 col=map_color_range(z_stat_matrix, z_stat_color),
                 use_raster=TRUE,
                 show_row_names=FALSE,
                 show_row_dend = FALSE,
                 show_column_names = FALSE,
                 row_title = NULL,
                 column_title = NULL,
                 cluster_columns=FALSE,
                 cluster_rows=TRUE,
                 row_names_gp = gpar(fontsize = text_size),
                 column_names_gp = gpar(fontsize = text_size),
                 heatmap_legend_param =
                   list(direction = "horizontal",
                        title_gp = gpar(fontsize = text_size,
                                        fontface = "bold"),
                        labels_gp = gpar(fontsize = text_size)),
                 name = "z_statistic",
                 width = unit(3, "cm"))

pdf(snakemake@output[['pdf']], width=190/25.4, height =150/25.4)
draw(hm_counts_sig + hm_fc, heatmap_legend_side = "bottom",
          merge_legend = TRUE)
dev.off()

#Save counts heatmap
saveRDS(hm_counts_sig, file = snakemake@output[['counts_heatmap_data']])
saveRDS(hm_fc, file = snakemake@output[['lfc_heatmap_data']])
saveRDS(hm_z, file = snakemake@output[['z-stat_heatmap_data']])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("DESeq2")
library("ggplot2")
library("cowplot")
library("limma")
library("ggrepel")

dds <- readRDS(snakemake@input[[1]])
pca_color <- snakemake@params[['color']]
pca_fill <- snakemake@params[['fill']]
pca_labels <- snakemake@params[['labels']]
pca_adjust_batch <- snakemake@params[['batch']]


if (all(c(pca_color, pca_fill) != "")) {
    intgroup <- c(pca_color, pca_fill)
} else if (pca_color != "") {
    intgroup <- pca_color
} else if (pca_fill != "") {
    intgroup <- pca_fill
} else {
    stop("At least one of fill or color have to be specified")
}

if (pca_labels != "") {
    if (!pca_labels %in% intgroup) {
        intgroup <- c(intgroup, pca_labels)
    }
}

head(colnames(colData(dds)))
if (!all(intgroup %in% colnames(colData(dds)))) {
    missing <- intgroup[!intgroup %in% colnames(colData(dds))]
    stop(paste(missing, collapse=","),
         "columns selected for PCA highlighting are missing")
}

# obtain normalized counts
counts <- rlog(dds, blind=FALSE)

if (pca_adjust_batch != "") {
    if (! pca_adjust_batch %in% colnames(colData(counts))) {
        stop("Column selected for batch correction prior to PCA not present")
    }
    assay(counts) <- limma::removeBatchEffect(assay(counts),
                                              colData(counts)[, pca_adjust_batch])

}
pcaData <- plotPCA(counts, intgroup = intgroup, returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))


color_sym <- sym(pca_color)
p <- ggplot(pcaData, aes(x = PC1, y = PC2))

if (all(c(pca_color, pca_fill) != "")) {
    fill_sym <- sym(pca_fill)
    p <- p +
        geom_point(aes(color = !!color_sym, fill = !!fill_sym),
                   pch=22,  size=3, stroke=2)
} else {
    p <- p + geom_point(aes(color = !!color_sym))
}

if (pca_labels != "") {
    label_sym <- sym(pca_labels)
    p <- p + geom_text_repel(aes(label=!!label_sym), size = 3, color="black",
                             segment.size = 0.3, segment.color = "black",
                             min.segment.length = 0.2, point.padding = 0.3)
}

p <- p +
    scale_color_brewer(type="qual", palette="Dark2") +
    labs(x=paste0("PC1: ", percentVar[1], "% variance"),
         y=paste0("PC2: ", percentVar[2], "% variance")) +
    coord_fixed() +
    theme_cowplot() +
    theme(legend.position = "bottom",
          legend.justification = 0)
ggsave(plot=p, height=4.5, width=7.5, filename = "results/pca.pdf")
ggsave(plot=p, height=4.5, width=7.5, filename = "results/pca.svg")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library("ggrepel")
library("tidyverse")

#Load full diffexp output and extract log2FoldChange and padj
diffexp <- read.csv(snakemake@input[['table']])
fc_p_adj <- dplyr::select(diffexp, starts_with("gene_id"), starts_with("log2FoldChange"), starts_with("padj"), starts_with("gene_name")) %>%
  na.omit %>%
  rename(
	 log2FoldChange = starts_with("log2FoldChange"),
	 padj = starts_with("padj")
	 )

# The significantly differentially expressed genes are the ones found in the upper-left and upper-right corners.
# Add a column to the data frame to specify if they are UP- or DOWN- regulated

# add a column of NAs
fc_p_adj$diff_expressed <- "NO"
# if log2Foldchange > 0.6 and pvalue < 0.05, set as "UP" 
fc_p_adj$diff_expressed[fc_p_adj$log2FoldChange > 0.6 & fc_p_adj$pad < 0.05] <- "UP"
# if log2Foldchange < -0.6 and pvalue < 0.05, set as "DOWN"
fc_p_adj$diff_expressed[fc_p_adj$log2FoldChange < -0.6 & fc_p_adj$padj < 0.05] <- "DOWN"

# Now write down the name of genes beside the points...
# Create a new column "delabel" to the dataframe, that will contain the name of genes differentially expressed (NA in case they are not)
fc_p_adj$delabel <- NA
fc_p_adj$delabel[fc_p_adj$diff_expressed != "NO"] <- fc_p_adj$gene_name[fc_p_adj$diff_expressed != "NO"]

#Create and save plot
p <- ggplot(data=fc_p_adj, aes(x=log2FoldChange, y=-log10(padj), col=diff_expressed, label=delabel)) +
  geom_point() + 
  theme_minimal() +
  geom_text_repel() +
  scale_color_manual(values=c("blue", "black", "red")) +
  geom_vline(xintercept=c(-0.6, 0.6), col="red") +
  geom_hline(yintercept=-log10(0.05), col="red") +
  ggtitle("-log10(p) vs log2FoldChange") +
  theme(plot.title = element_text(hjust = 0.5))
ggsave(plot=p, scale = 1, width = 15.4, height = 8.6, filename = snakemake@output[['pdf']])
ShowHide 20 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Vyoming/Lung_adenocarcinoma_vs_Squamous_time
Name: lung_adenocarcinoma_vs_squamous_time
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...