A Snakemake workflow for performing genomic region set and gene set enrichment analyses using LOLA, GREAT, and GSEApy.

public public 1yr ago Version: v0.1.1 0 bookmarks

Genomic Region Set & (Ranked) Gene Set Enrichment Analysis & Visualization Snakemake Workflow for Human and Mouse Genomes.

Given human (hg19 or hg38) or mouse (mm9 or mm10) based genomic region sets (i.e., region sets) and/or (ranked) gene sets of interest and respective background region/gene sets, the enrichment within the configured databases is determined using LOLA, GREAT, GSEApy (over-representation analysis (ORA) & preranked GSEA) and results saved as CSV files. Additionally, the most significant results are plotted for each region/gene set, database queried, and analysis performed. Finally, the results within the same "group" (e.g., stemming from the same DEA) are aggregated per database and analysis in summary CSV files and visualized using hierarchically clustered heatmaps and bubble plots. For collaboration, communication and documentation of results, methods and workflow information a detailed self-contained HTML report can be generated.

This workflow adheres to the module specifications of MR. PARETO , an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.

If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.7810621 .

Workflow Rulegraph

Table of contents

Authors

Software

This project wouldn't be possible without the following software and their dependencies:

Software Reference (DOI)
Enrichr https://doi.org/10.1002/cpz1.90
ggplot2 https://ggplot2.tidyverse.org/
GREAT https://doi.org/10.1371/journal.pcbi.1010378
GSEA https://doi.org/10.1073/pnas.0506580102
GSEApy https://doi.org/10.1093/bioinformatics/btac757
LOLA https://doi.org/10.1093/bioinformatics/btv612
pandas https://doi.org/10.5281/zenodo.3509134
pheatmap https://cran.r-project.org/package=pheatmap
rGREAT https://doi.org/10.1093/bioinformatics/btac745
Snakemake https://doi.org/10.12688/f1000research.29032.2

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml files) or post execution (results_dir/envs/enrichment_analysis/*.yaml files). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

The outlined analyses were performed using the programming languages R (ver) [ref] and Python (ver) [ref] unless stated otherwise. All approaches statistically correct their results using expressed/accessible background genomic region/gene sets from the respective analyses that yielded the query region/gene sets.

Genomic region set enrichment analyses

LOLA. Genomic region set enrichment analysis was performed using LOLA (ver) [ref], which uses Fisher’s exact test. The following databases were queried [lola_dbs].

GREAT. Genomic region set enrichment analysis was performed using GREAT [ref] implemented with rGREAT (ver) [ref]. The following databases were queried [great_dbs].

Furthermore, genomic regions (query- and background-sets) were mapped to genes using GREAT and then analyzed as gene-sets as described below for a complementary and extended perspective.

Gene set enrichment analyses (GSEA)

Over-representation analysis (ORA). Gene set ORA was performed using Enrichr [ref], which uses Fisher’s exact test (i.e., hypergeometric test), implemented with GSEApy's (ver) [ref] function enrich . The following databases were queried [enrichr_dbs][local_gmt_dbs][local_json_dbs].

Preranked GSEA. Preranked GSEA was performed using GSEA [ref], implemented with GSEApy's (ver) [ref] function prerank . The following databases were queried [enrichr_dbs][local_gmt_dbs][local_json_dbs].

Aggregation The results of all queries belonging to the same analysis [group] were aggregated by method and database. Additionally, we filtered the results by retaining only the union of terms that were statistically significant (i.e. [adj_pvalue]<[adjp_th]) in at least one query.

Visualization All analysis results were visualized in the same way.

For each query, method and database combination an enrichment dot plot was used to visualize the most important results. The top [top_n] terms were ranked (along the y-axis) by the mean rank of statistical significance ([p_value]), effect-size ([effect_size]), and overlap ([overlap]) with the goal to make the results more balanced and interpretable. The significance (adjusted p-value) is denoted by the dot color, effect-size by the x-axis position, and overlap by the dot size.

The aggregated results per analysis [group], method and database combination were visualized using hierarchically clustered heatmaps and bubble plots. The union of the top [top_terms_n] most significant terms per query were determined and their effect-size and significance were visualized as hierarchically clustered heatmaps, and statistical significance ([adj_pvalue] < [adjp_th]) was denoted by *. Furthermore, a hierarchically clustered bubble plot encoding both effect-size (color) and statistical significance (size) is provided, with statistical significance denoted by *. All summary visualizations’ values were capped by [adjp_cap]/[or_cap]/[nes_cap] to avoid shifts in the coloring scheme caused by outliers.

The analysis and visualizations described here were performed using a publicly available Snakemake (ver) [ref] workflow [ 10.5281/zenodo.7810621 ].

Features

The three tools LOLA, GREAT and GSEApy (over-representation analysis (ORA) & preranked GSEA) are used for various enrichment analyses. Databases to be queried can be configured (see ./config/config.yaml). All approaches statistically correct their results using the provided background region/gene sets.

  • enrichment analysis methods:

    • region-set

      • LOLA : Genomic Locus Overlap Enrichment Analysis is run locally. Required (cached) databases, which are downloaded automatically during the first run. Supported databases depend on the genome (lola_dbs).

      • GREAT using rGREAT : Genomic Regions Enrichment of Annotations Tool is queried remotely (requires a working internet connection). Supported databases depend on the genome (great_dbs).

        • query region sets with >500,000 regions are not supported and empty output files are generated to satisfy Snakemake

        • background region sets with >1,000,000 are not supported and the whole genome is used as background

    • gene-set over-representation analysis (ORA_GSEApy)

      • GSEApy enrich() function performs Fisher’s exact test (i.e., hypergeoemtric test) and is run locally.
    • region-based gene-set over-representation analysis (ORA_GSEApy)

      • region-gene associations for each query and background region-set are obtained using GREAT.

      • they are used for a complementary ORA using GSEApy.

      • thereby an extended region-set enrichment perspective can be gained through association to genes by querying the same and/or more databases, that are not supported/provided by region-based tools.

      • limitation: if the background region set exceeds GREAT's capacities (i.e., 1,000,000 regions), no background gene list is generated and background gene number (bg_n) of 20,000 is used in the ORA.

    • preranked gene-set enrichment analysis (preranked_GSEApy)

      • GSEApy prerank() function performs preranked GSEA and is run locally.

      • no duplicates allowed: only entries with the largest absolute score are kept.

  • resources (databases) for both gene-based analyses are downloaded (Enrichr) or copied (local files) and saved as JSON files in /resources

    • all Enrichr databases can be queried (enrichr_dbs).

    • local JSON database files can be queried (local_json_dbs).

    • local GMT database files (e.g., from MSigDB ) can be queried (local_gmt_dbs).

  • group aggregation of results per method and database

    • results of all queries belonging to the same group are aggregated per method (e.g., ORA_GSEApy) and database (e.g., GO_Biological_Process_2021) by concatenation and saved as a long-format table (CSV).

    • a filtered version taking the union of all statistically significant (i.e., adjusted p-value <{adjp_th}) terms per query is also saved as CSV file.

  • visualization

    • region/gene-set specific enrichment dot plots are generated for each query, method and database combination

      • the top {top_n} terms are ranked (along the y-axis) by the mean rank of statistical significance ({p_value}), effect-size ({efect_size} e.g., log2(odds ratio) or normalized enrichemnt scores), and overlap ({overlap} e.g., coverage or support) with the goal to make the results more balanced and interpretable

      • significance (adjusted p-value) is presented by the dot color

      • effect-size is presented by the x-axis position

      • overlap is presented by the dot size

    • group summary/overview

      • the union of the top {top_terms_n} most significant terms per query, method, and database within a group is determined.

      • their effect-size (effect) and statistical significance (adjp) are visualized as hierarchically clustered heatmaps, with statistical significance denoted by * (PDF).

      • a hierarchically clustered bubble plot encoding both effect-size (color) and significance (size) is provided, with statistical significance denoted by * (PNG and SVG).

      • all summary visualizations are configured to cap the values ({adjp_cap}/{or_cap}/{nes_cap}) to avoid shifts in the coloring scheme caused by outliers.

Results

The result directory {result_path}/enrichment_analysis contains a folder for each region/gene-set {query} and {group}

  • {query}/{method}/{database}/ containing:

    • result table (CSV file): {query}_{database}.csv

    • enrichment dot plot (SVG and PNG): {query}_{database}.{svg|png}

  • {group}/{method}/{database}/ containing

    • aggregated result table (CSV file): {group}_{database}_all.csv

    • filtered aggregated result table (CSV file): {group}_{database}_sig.csv

    • hierarchically clustered heatmaps visualizing statistical significance and effect-sizes of the top {top_terms_n} terms (PDF): {group}_{database}_{adjp|effect}_hm.pdf

    • hierarchically clustered bubble plot visualizing statistical significance and effect-sizes simultaneously (PNG and SVG): {group}_{database}_summary.{svg|png}

Usage

Here are some tips for the usage of this workflow:

  • Run the analysis on every query gene/region set of interest (e.g., results of differential analyses) with the respective background genes/regions (e.g., all expressed genes in the data or consensus regions).

  • generate the Snakemake Report

  • look through the overview plots of your dedicated groups and queried databases in the report

  • dig deeper by looking at the

    • aggregated result table underlying the summary/overview plot

    • enrichment plots for the individual query sets

  • investigate interesting hits further by looking into the individual query result tables.

Configuration

Detailed specifications can be found here ./config/README.md

Examples

We provide four example queries:

We provide two local example databases

Follow these steps to run the complete analysis:

  1. activate your snakemake conda environment

    conda activate snakemake
    
  2. enter the workflow directory

    cd enrichment_analysis
    
  3. run a snakemake dry-run (-n flag) using the provided configuration to check if everything is in order

    snakemake -p --use-conda --configfile .test/config/example_enrichment_analysis_config.yaml -n
    
  4. run the workflow

    snakemake -p --use-conda --configfile .test/config/example_enrichment_analysis_config.yaml
    
  5. generate report

    snakemake --report .test/report.html --configfile .test/config/example_enrichment_analysis_config.yaml
    

Links

Resources

Publications

The following publications successfully used this module for their analyses.

  • ...

Code Snippets

18
19
script:
    "../scripts/aggregate.py"
64
65
script:
    "../scripts/overview_plot.R"
22
23
script:
    "../scripts/region_enrichment_analysis_LOLA.R"
43
44
script:
    "../scripts/region_enrichment_analysis_GREAT.R"
64
65
script:
    "../scripts/gene_ORA_GSEApy.py"
84
85
script:
    "../scripts/gene_preranked_GSEApy.py"
114
115
script:
    "../scripts/enrichment_plot.R"
18
19
20
21
shell:
    """
    conda env export > {output}
    """
38
39
40
run:
    with open(output["configs"], 'w') as outfile:
        yaml.dump(config, outfile,  indent=4)
59
60
61
62
shell:
    """
    cp {input} {output}
    """
16
17
script:
    "../scripts/get_Enrichr_databases_GSEApy.py"
32
33
34
35
shell:
    """
    cp {input} {output}
    """
52
53
script:
    "../scripts/get_gmt_databases_GSEApy.py"
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
run:
    print("start downloading and unpacking LOLA resources")
    os.makedirs(output.lola_resources, exist_ok=True)

    LOLA_path = output.lola_resources

    URL_Core='http://big.databio.org/regiondb/LOLACoreCaches_180412.tgz'
    URL_Ext='http://big.databio.org/regiondb/LOLAExtCaches_170206.tgz'

    # download
    getCore_str = 'wget --directory-prefix={} {}'.format(LOLA_path, URL_Core)
    getExt_str = 'wget --directory-prefix={} {}'.format(LOLA_path, URL_Ext)

    subprocess.run(getCore_str, shell=True)
    subprocess.run(getExt_str, shell=True)

    # unpack
    unpackCore_str = 'tar zxvf {} -C {}'.format(os.path.join(LOLA_path, 'LOLACoreCaches_180412.tgz'), LOLA_path)
    unpackExt_str = 'tar zxvf {} -C {}'.format(os.path.join(LOLA_path, 'LOLAExtCaches_170206.tgz'), LOLA_path)

    subprocess.run(unpackCore_str, shell=True)
    subprocess.run(unpackExt_str, shell=True)

    # remove
    removeCore_str = 'rm -f {}'.format(os.path.join(LOLA_path, 'LOLACoreCaches_180412.tgz'))
    removeExt_str = 'rm -f {}'.format(os.path.join(LOLA_path, 'LOLAExtCaches_170206.tgz'))

    subprocess.run(removeCore_str, shell=True)
    subprocess.run(removeExt_str, shell=True)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import pandas as pd
import os
import numpy as np

# configs
result_paths = snakemake.input['enrichment_results']

results_all_path = snakemake.output['results_all']
results_sig_path = snakemake.output['results_sig']

group = snakemake.wildcards["group"] #"testgroup"
tool = snakemake.wildcards["tool"] #"ORA_GSEApy"
db = snakemake.wildcards["db"] #"WikiPathways_2019_Human"

term_col = snakemake.config["column_names"][tool]["term"] #'Term'
adjp_col = snakemake.config["column_names"][tool]["adj_pvalue"] #'Adjusted_P_value'
# effect_col = snakemake.config["column_names"][tool]["effect_size"] #'Odds_Ratio'

adjp_th = snakemake.config["adjp_th"][tool] #0.05

dir_results = os.path.dirname(results_all_path)
if not os.path.exists(dir_results):
    os.mkdir(dir_results)

# load results
results_list = list()
for result_path in result_paths:
    if os.stat(result_path).st_size != 0:
        tmp_name = os.path.basename(result_path).replace("_{}.csv".format(db),"")
        tmp_res = pd.read_csv(result_path, index_col=0)
        tmp_res['name'] = tmp_name
        results_list.append(tmp_res)


# move on if results are empty
if len(results_list)==0:
    open(results_all_path, mode='a').close()
    open(results_sig_path, mode='a').close()
    sys.exit(0)

# concatenate all results into one results dataframe
result_df = pd.concat(results_list, axis=0)

# save all enirchment results
result_df.to_csv(results_all_path)

# find union of statistically significant terms
sig_terms = result_df.loc[result_df[adjp_col]<adjp_th, term_col].unique()

# filter by significant terms
result_sig_df = result_df.loc[result_df[term_col].isin(sig_terms), :]

# save filtered enirchment results by significance
result_sig_df.to_csv(results_sig_path)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
library(ggplot2)
library(svglite)

# source utility functions
# source("workflow/scripts/utils.R")
snakemake@source("./utils.R")

# configs
enrichment_result_path <- snakemake@input[["enrichment_result"]]
enrichment_plot_path <- snakemake@output[["enrichment_plot"]]

tool <- snakemake@params[["tool"]] #"ORA_GSEApy"
database <- snakemake@params[["database"]] #"KEGG_2021_Human"
feature_set <- snakemake@params[["feature_set"]] 
plot_cols <- snakemake@config[["column_names"]][[tool]]

top_n <- plot_cols[["top_n"]] #25
pval_col <- plot_cols[["p_value"]] #'P.value'
adjp_col <- plot_cols[["adj_pvalue"]] #'Adjusted.P.value'
effect_col <- plot_cols[["effect_size"]] #'Odds.Ratio'
overlap_col <- plot_cols[["overlap"]] #'Overlap'
term_col <- plot_cols[["term"]] #'Term'

# load enrichment result
if (file.size(enrichment_result_path) != 0L){
    enrichment_result <- read.csv(enrichment_result_path, row.names = NULL, header= TRUE)
}else{
    file.create(enrichment_plot_path)
    quit()
}

# evaluate overlap numerically if necessary
if(class(enrichment_result[[overlap_col]])=="character"){
    enrichment_result[[overlap_col]] <- as.numeric(lapply(enrichment_result[[overlap_col]], evaltext))
}

# calculate comparable effect size either NES or odds-ratio/fold based
if (tool!="preranked_GSEApy"){
    # calculate log2(effect-size) and put in new column
    effect_col_new <- paste0("log2_",effect_col)
    enrichment_result[[effect_col_new]] <- log2(enrichment_result[[effect_col]])
    effect_col <- effect_col_new
}

# determine ranks
enrichment_result$PValue_Rnk <- rank(enrichment_result[[pval_col]])
enrichment_result$Fold_Rnk <- rank(-abs(enrichment_result[[effect_col]]))
enrichment_result$Coverage_Rnk <- rank(-enrichment_result[[overlap_col]])
# calculate and sort by mean rank
enrichment_result$meanRnk <- rowMeans(enrichment_result[,c('PValue_Rnk', 'Fold_Rnk','Coverage_Rnk')])
enrichment_result <- enrichment_result[order(enrichment_result$meanRnk, decreasing=FALSE),]

# format term column that order is kept and values are unique
enrichment_result[[term_col]] <- make.names(enrichment_result[[term_col]], unique=TRUE)
enrichment_result[[term_col]] <- factor(enrichment_result[[term_col]], levels = enrichment_result[[term_col]])

# plot top_n terms by mean_rnk
do_enrichment_plot(plot_data=enrichment_result[1:top_n,], 
               title=paste0(tool, ' results of \n',feature_set,' in ',database), 
               x=effect_col, 
               y=term_col, 
               size=overlap_col, 
               colorBy=adjp_col, 
               font.size=10, 
               path=file.path(dirname(enrichment_plot_path)), 
               filename=paste0(feature_set,"_",database),
                   top_n = top_n
              )
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import pandas as pd
import json
# import pickle
import os
import numpy as np
import gseapy as gp
import sys


# utils for manual odds ratio calculation -> not used anymore
def overlap_converter(overlap_str, bg_n, gene_list_n):
    overlap_n, gene_set_n = str(overlap_str).split('/')
    return odds_ratio_calc(bg_n, gene_list_n, int(gene_set_n), int(overlap_n))

def odds_ratio_calc(bg_n, gene_list_n, gene_set_n, overlap_n): 
    import scipy.stats as stats
    table=np.array([[gene_set_n, bg_n-gene_set_n],[overlap_n, gene_list_n-overlap_n]])
    oddsratio, pvalue = stats.fisher_exact(table)
    return (1/oddsratio)


# configs

# input
query_genes_path = snakemake.input['query_genes']
background_genes_path = snakemake.input['background_genes']
database_path = snakemake.input['database']

# output
result_path = snakemake.output['result_file']

# parameters
db = snakemake.params["database"]

dir_results = os.path.dirname(result_path)

if not os.path.exists(dir_results):
    os.mkdir(dir_results)

# check if genes file exists & load or handle exception
if os.path.exists(query_genes_path):
    genes = open(query_genes_path, "r")
    gene_list = genes.read()
    gene_list = gene_list.split('\n')
    gene_list.remove('')
    genes.close()
else:
    with open(os.path.join(dir_results,"no_genes_found.txt"), 'w') as f:
        f.write('no genes found')
    quit()

# load background genes
bg_file = open(background_genes_path, "r")
background = bg_file.read()
background = background.split('\n')
background.remove('')
bg_file.close()

# move on if query-genes are empty
if len(gene_list)==0:
    open(result_path, mode='a').close()
    sys.exit(0)

# load database .pkl file
# with open(enrichr_databases, 'rb') as f:
#     db_dict = pickle.load(f)
with open(database_path) as json_file:
    db_dict = json.load(json_file)

# convert gene lists and database to upper case
gene_list=[str(x).upper() for x in list(gene_list)]
background=[str(x).upper() for x in list(background)]
db_dict = {key: [ele.upper() for ele in db_dict[key] ] for key in db_dict}

# count number of background genes for odds-ratio calculation
bg_n = len(background)

# if background-genes are empty provide number of genes as 20,000 as heuristic
# this excpetion only occurs if background region set exceeds 500,000 regions
if bg_n==0:
    bg_n = 20000
    background = bg_n

# perform ORA (hypergeometric test) in database using GSEApy (barplots are generated automatically)
try:
    res = gp.enrich(gene_list=gene_list,
                     gene_sets=db_dict,
                     background=background,
                     outdir=None,#os.path.join(dir_results),
                     top_term=25,
                     cutoff=0.05,
                     format='png',
                     verbose=True,
                    ).res2d
except ValueError:
    print("Result is empty")
    res = pd.DataFrame()

# move on if result is empty
if res.shape[0]==0:
    open(result_path, mode='a').close()
    sys.exit(0)

# annotate used gene set
res['Gene_set'] = db

# odds ratio calculation -> results still differ, but correlation is >0.99 and with large OR values the difference shows up at third decimal place
# gene_list_n=len(gene_list)
# res['Odds Ratio'] = res['Overlap'].apply(overlap_converter, args=(bg_n, gene_list_n))

# make column names language agnostic (i.e., R compatible)
column_names = list(res.columns.values)
column_names = [col.replace(" ","_") for col in column_names]
column_names = [col.replace("-","_") for col in column_names]
res.columns = column_names

# separate export
res.to_csv(result_path)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import pandas as pd
import json
# import pickle
import os
import numpy as np
import gseapy as gp
import sys


# configs

# input
query_genes_path = snakemake.input['query_genes']
database_path = snakemake.input['database']

# output
result_path = snakemake.output['result_file']

# parameters
db = snakemake.params["database"]

dir_results = os.path.dirname(result_path)

if not os.path.exists(dir_results):
    os.mkdir(dir_results)


# load gene-score file
genes = pd.read_csv(query_genes_path, index_col=0)

# load database JSON file
with open(database_path) as json_file:
    db_dict = json.load(json_file)

# convert all genes to upper case
genes.index = [str(x).upper() for x in list(genes.index)]
db_dict = {key: [ele.upper() for ele in db_dict[key] ] for key in db_dict}

# remove duplicates: only retain largest absolute value
# sort by absolute value of "score" column (i.e., first column)
genes = genes.iloc[abs(genes.iloc[:, 0]).argsort()[::-1]]
# drop duplicates based on index
genes = genes[~genes.index.duplicated(keep='first')]

# run prerank GSEA of database with GSEApy
res = gp.prerank(rnk=genes, 
                 gene_sets=db_dict,
                 #threads=4,
                 min_size=1, # Minimum allowed number of genes from gene set also the data set. Default: 15.
                 max_size=100000, # Maximum allowed number of genes from gene set also the data set. Defaults: 500.
                 permutation_num=1000, # Number of permutations. Reduce number to speed up testing;  Default: 1000. Minimial possible nominal p-value is about 1/nperm.
                 outdir=os.path.join(dir_results),
                 graph_num = 25, # Plot graphs for top sets of each phenotype.
                 format='png',
                 seed=42,
                 verbose=True,
                ).res2d

# move on if result is empty
if res.shape[0]==0:
    open(result_path, mode='a').close()
    sys.exit(0)

# annotate used gene set
res['Gene_set'] = db

# make column names language agnostic (i.e., R compatible)
column_names = list(res.columns.values)
column_names = [col.replace(" %","") for col in column_names]
column_names = [col.replace(" ","_") for col in column_names]
column_names = [col.replace("-","_") for col in column_names]
res.columns = column_names

# separate export
res.to_csv(result_path)
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import json
import gseapy as gp
import os

# config

# output
results_path = snakemake.output['result_file']

# parameters
db = snakemake.params["database"] #gp.get_library_name() #-> gets all available names
organism = "Human" if (snakemake.config["genome"]=='hg19' or snakemake.config["genome"]=='hg38' ) else "Mouse"

# download the database to local dict
# db_dict = gp.parser.gsea_gmt_parser(db, min_size=0, max_size=100000,)
db_dict = gp.parser.download_library(name=db, organism=organism)

# save as json
with open(results_path, 'w') as fp:
    json.dump(db_dict, fp,  indent=4)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import json
import gseapy as gp
import os

# config

# input
gmt_path = snakemake.input[0]

# output
results_path = snakemake.output['result_file']

# parameters
db = snakemake.params["database"]

# download the GMT database to local dict
db_dict = gp.parser.read_gmt(gmt_path)

# save as json
with open(results_path, 'w') as fp:
    json.dump(db_dict, fp,  indent=4)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
library(ggplot2)
library(svglite)
library(reshape2)
library(pheatmap)

# source utility functions
# source("workflow/scripts/utils.R")
snakemake@source("./utils.R")

# configs
results_all_path <- snakemake@input[["results_all"]]

plot_path <- snakemake@output[["summary_plot"]]
adjp_hm_path <- snakemake@output[["adjp_hm"]]
effect_hm_path <- snakemake@output[["effect_hm"]]

tool <- snakemake@wildcards[["tool"]] #"ORA_GSEApy"
database <- snakemake@wildcards[["db"]] #"GO_Biological_Process_2021"
group <- snakemake@wildcards[["group"]] #"testgroup"

term_col <- snakemake@config[["column_names"]][[tool]][["term"]] #'Term'
adjp_col <- snakemake@config[["column_names"]][[tool]][["adj_pvalue"]] #'Adjusted_P_value'
effect_col <- snakemake@config[["column_names"]][[tool]][["effect_size"]] #'Odds_Ratio'

top_n <- snakemake@config[["top_terms_n"]] #5
adjp_cap <- snakemake@config[["adjp_cap"]] #4
effect_cap <- if (tool=="preranked_GSEApy") snakemake@config[["nes_cap"]] else snakemake@config[["or_cap"]] #5

adjp_th <- as.numeric(snakemake@config[["adjp_th"]][tool]) #0.05

# stop early if results are empty
if(file.size(results_all_path) == 0L){
    file.create(plot_path)
    file.create(adjp_hm_path)
    file.create(effect_hm_path)
    quit()
}

# load aggregated result dataframe
results_all <- read.csv(results_all_path, header= TRUE)

# stop early if results consist of only one query
if(length(unique(results_all$name))==1){
    file.create(plot_path)
    file.create(adjp_hm_path)
    file.create(effect_hm_path)
    quit()
}

# determine top_n most significant terms (not necessarily statistically significant!)
top_terms <- c()
for (query in unique(results_all$name)){
    tmp_result <- results_all[results_all$name==query,]
    tmp_terms <- tmp_result[order(tmp_result[[adjp_col]]), term_col][1:top_n]
    top_terms <- unique(c(top_terms,tmp_terms))
}

# make adjusted p-vale and effect-size (odds-ratio or normalized enrichment scores) dataframes
adjp_df <- dcast(results_all, as.formula(paste(term_col, "~ name")), value.var = adjp_col)
rownames(adjp_df) <- adjp_df[[term_col]]
adjp_df[[term_col]] <- NULL

effect_df <- dcast(results_all, as.formula(paste(term_col, "~ name")), value.var = effect_col)
rownames(effect_df) <- effect_df[[term_col]]
effect_df[[term_col]] <- NULL

# filter by top terms
adjp_df <- adjp_df[top_terms,]
effect_df <- effect_df[top_terms,]

# fill NA for effect_df with 1 or 0 (i.e., neutral enrichment) and for adjp_df with 1 (i.e., no significance)
effect_df[is.na(effect_df)] <- if (tool=="preranked_GSEApy") 0 else 1
adjp_df[is.na(adjp_df)] <- 1

# make stat. sign. annotation for effect-size plot later
adjp_annot <- adjp_df
adjp_annot[adjp_df < adjp_th] <- "*"
adjp_annot[adjp_df >= adjp_th] <- ""

# log2 transform odds ratios
if (tool!="preranked_GSEApy"){
    effect_df <- log2(effect_df)
}

# cap effect_df for plotting depending on tool  abs(log2(or)) < or_cap OR abs(NES) < nes_cap
effect_df[effect_df > effect_cap] <- effect_cap
effect_df[effect_df < -effect_cap] <- -effect_cap

# log10 transform adjp & cap -log10(adjpvalue) < adjp_cap
adjp_df <- -log10(adjp_df)
adjp_df[adjp_df > adjp_cap] <- adjp_cap

# plot hierarchically clustered heatmap for adjp and effect
width_hm <- 0.2 * dim(adjp_df)[2] + 5
height_hm <- 0.2 * dim(adjp_df)[1] + 3

pheatmap(adjp_df,
         display_numbers=adjp_annot,
         main="-log10(adj. p-values)",
         treeheight_row = 10,
         treeheight_col = 10,
         fontsize = 6,
         fontsize_number = 10,
         silent=TRUE,
         width=width_hm,
         height=height_hm,
         angle_col=45, 
         cellwidth=10,
         cellheight=10,
         filename=adjp_hm_path,
         breaks=seq(0, max(adjp_df), length.out=200),
         color=colorRampPalette(c("white", "red"))(200)
        )

pheatmap(effect_df,
         display_numbers=adjp_annot,
         main = if (tool=="preranked_GSEApy") effect_col else paste0("log2(",effect_col,")"),
         treeheight_row = 10,
         treeheight_col = 10,
         fontsize = 6,
         fontsize_number = 10,
         silent=TRUE,
         width=width_hm,
         height=height_hm,
         angle_col=45, 
         cellwidth=10,
         cellheight=10,
         filename=effect_hm_path,
         breaks=seq(-max(abs(effect_df)), max(abs(effect_df)), length.out=200),
         color=colorRampPalette(c("blue", "white", "red"))(200)
        )


# perform hierarchical clustering on the effect-sizes (NES or log2 odds ratios) of the terms and reorder dataframe
hc_rows <- hclust(dist(effect_df))
hc_row_names <- rownames(effect_df)[hc_rows$order]
hc_cols <- hclust(dist(t(effect_df)))
hc_col_names <- colnames(effect_df)[hc_cols$order]
effect_df <- effect_df[hc_rows$order, hc_cols$order]

# add a column for the terms
effect_df$terms <- rownames(effect_df)
# melt data frame for plotting
plot_df <- melt(data=effect_df,
                id.vars="terms",
                measure.vars=colnames(adjp_df),
                variable.name = "feature_set",
               value.name = "effect")

# add adjusted p-values to plot dataframe
plot_df$adjp <- apply(plot_df, 1, function(x) adjp_df[x['terms'], x['feature_set']])

# set effect-size and adjusted p-value conditional to NA (odds ratios == 0 and NES == 0) for plotting
plot_df$effect[plot_df$effect==0] <- NA
plot_df$adjp[plot_df$adjp==0] <- NA
# if (tool=="preranked_GSEApy"){
#     plot_df$effect[plot_df$effect==0] <- NA
# }else{
#     plot_df$effect[plot_df$effect<=0] <- NA
# }

# ensure that the order of terms and feature sets is kept
plot_df$terms <- factor(plot_df$terms,levels=hc_row_names)
plot_df$feature_set <- factor(plot_df$feature_set, levels=hc_col_names)

# plot
enr_plot <- ggplot(plot_df, aes(x=feature_set, y=terms, fill=effect, size=adjp))+ 
geom_point(shape=21, stroke=0.25) +
geom_point(data = plot_df[plot_df$adjp > -log10(adjp_th),], aes(x=feature_set, y=terms), shape=8, size=0.5, color = "black", alpha = 0.5) + # stars for statistical significance
# scale_fill_gradient(low="grey", high="red", breaks = c(1, 2, 3, 4), limits = c(0, 4), name="-log10(adjp)") +
scale_fill_gradient2(midpoint=0, low="royalblue4", mid="white", high="firebrick2", space ="Lab", name = if (tool=="preranked_GSEApy") effect_col else paste0("log2(",effect_col,")")) +
scale_y_discrete(label=addline_format) + 
scale_size_continuous(range = c(1,5), , name = "-log10(adjp)") + #not needed, because data already capped? limits = c(0, adjp_cap)
ggtitle(paste(tool, database, group, sep='\n')) +
clean_theme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1),
      axis.title.x=element_blank(),
      axis.title.y=element_blank()) + guides(size = guide_legend(reverse=TRUE))

width <- 0.15 * dim(adjp_df)[2] + 3
height <- 0.2 * dim(adjp_df)[1] + 2
# options(repr.plot.width=width, repr.plot.height=height)
# enr_plot

# save plot
ggsave_new(filename=paste0(group,'_',database,'_summary'),
               results_path=dirname(plot_path),
               plot=enr_plot,
               width=width,
               height=height
              )
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
library(LOLA)
library(GenomicRanges)
library(rGREAT)

# configs
regions_file <- snakemake@input[["regions"]]
background_file <- snakemake@input[["background"]]
gene_file <- snakemake@output[["genes"]]
result_paths <- snakemake@output[["results"]]

genome <- snakemake@config[["genome"]] #"hg38"
databases <- snakemake@config[["great_dbs"]]
region_set <- snakemake@params[["region_set"]]

dir_result <- dirname(gene_file)

# make result directory, if not exist
dir.create(file.path(dir_result), showWarnings = FALSE, recursive = TRUE)

# load query and background/universe region sets (eg consensus region set)
regionSet_query <- unname(readBed(regions_file))

# create empty results if query region set is too large for GREAT and quit i.e. >500,000 https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655402/File+Size
if (length(regionSet_query)>500000){
    print("query regions set is too large (>500,000)")
    for (path in result_paths){
        file.create(path)
    }
    file.create(file.path(gene_file))
    quit()
}

# if query and background the same, do not use a background (used to map background regions to genes for downstream gene-based analyses)
if(regions_file!=background_file){
    regionSet_background <- unname(readBed(background_file))
    # check if background region set is not too large https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655402/File+Size
    if (length(regionSet_background)>1000000){
        print("background regions set is too large (>1,000,000) and set to NULL i.e. whole genome")
        regionSet_background <- NULL
    }
}else{
    regionSet_background <- NULL
}


###### GREAT
job = submitGreatJob(regionSet_query, species = genome, bg=regionSet_background)

# save job description
capture.output(job, file=file.path(dir_result, 'job_description.txt'), append=TRUE)

# plot, get and save gene-region association
pdf(file=file.path(dir_result, paste0('region_gene_associations.pdf')), width=20, height=15)
res = plotRegionGeneAssociationGraphs(job, request_interval=600)
write.csv(res, file.path(dir_result, 'region_gene_associations.csv'), row.names=TRUE)
dev.off()

# save unique associated genes by using mcols(), which returns a DataFrame object containing the metadata columns.
genes <- unique(mcols(res)$gene)
write(genes, file.path(gene_file))
# write.table(genes, file.path(gene_file), sep='\t', row.names=FALSE, col.names=FALSE, quote = FALSE)

# get & save enrichment results
if(regions_file!=background_file){
    for (db in databases){
        tb <- getEnrichmentTables(job,  ontology = db, category=NULL, download_by = 'tsv')
        tb$Desc <- paste(tb$Desc, tb$ID)
    #     write.table(tb[[db]], file.path(dir_result, db, paste0(region_set,'_',db,'.tsv')), sep='\t', row.names=FALSE)
        write.csv(tb[[db]], file.path(dir_result, gsub(" ", "_", db), paste0(region_set,'_',gsub(" ", "_", db),'.csv')), row.names=FALSE)
    }
}else{
    for (path in result_paths){
        file.create(path)
    }
}
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
library(LOLA)
library(GenomicRanges)

# configs
query_regions <- snakemake@input[["regions"]]
background_regions <- snakemake@input[["background"]]
genome <- snakemake@config[["genome"]] #"hg38" "mm10"
region_set <- snakemake@params[["region_set"]]
dir_result <- snakemake@params[["result_path"]]

### load data

# load query region sets
regionSet_query <- readBed(query_regions)

# load background/universe region sets (eg consensus region set)
regionSet_background <- readBed(background_regions)

# needs resources downloaded from: http://big.databio.org/regiondb/LOLACoreCaches_180412.tgz
# needs simpleCache package installed
LOLACore <- loadRegionDB(file.path("resources",snakemake@config[["project_name"]],"LOLA/nm/t1/resources/regions/LOLACore",genome))

if (genome=="hg19" | genome=="hg38"){
    jaspar_motifs <- loadRegionDB(file.path("resources",snakemake@config[["project_name"]],"LOLA/scratch/ns5bc/resources/regions/LOLAExt",genome), collections = 'jaspar_motifs')
    roadmap_epigenomics <- loadRegionDB(file.path("resources",snakemake@config[["project_name"]],"LOLA/scratch/ns5bc/resources/regions/LOLAExt",genome), collections = 'roadmap_epigenomics')
    dblist <- list(LOLACore=LOLACore, jaspar_motifs=jaspar_motifs,roadmap_epigenomics=roadmap_epigenomics)
} else{
    dblist <- list(LOLACore=LOLACore)
}

###### LOLA

# run LOLA and save results for every database
for (db in names(dblist)){

    # run LOLA
    locResults <- runLOLA(regionSet_query, regionSet_background, dblist[[db]], cores=1)

    # determine adjusted p-values via BH method -> not needed qvalue already provided
    # locResults[['adjPvalueLog']] <- -1*log10(p.adjust(('^'(10,-1*locResults[['pValueLog']])), method = 'BH'))

    # make description more descriptive
    if (db=='LOLACore'){
        locResults$description <- paste(locResults$description, locResults$cellType, locResults$antibody,sep='.')
    }else{
        locResults$description <- locResults$filename
    }

    # format description that values are unique
    locResults$description <- make.names(locResults$description, unique=TRUE)

    # determine raw p-value
    locResults$pValue <- ('^'(10,-1*locResults[['pValueLog']]))

    # export all results
    dir.create(file.path(dir_result, region_set, "LOLA", db), showWarnings = FALSE)
#     writeCombinedEnrichment(locResults, outFolder= file.path(dir_result, db), includeSplits=TRUE)

    write.csv(locResults, file.path(dir_result, region_set, "LOLA", db, paste0(region_set,'_',db,'.csv')), row.names=FALSE)
}    
ShowHide 17 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://epigen.github.io/enrichment_analysis/
Name: enrichment_analysis
Version: v0.1.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...