diffEnrich

Introduction

The goal of diffEnrich is to compare functional enrichment between two experimentally-derived groups of genes or proteins. Given a list of NCBI gene symbols, diffEnrich will perform differential enrichment analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) REST API. This package provides a number of functions that are intended to be used in a pipeline (See Figure 1). Briefly, the user provides a KEGG formatted species id for either human, mouse or rat, and the package will download and store species specific ENTREZ gene IDs and map them to their respective KEGG pathways by accessing the KEGG REST API. The KEGG API is used to guarantee the most up-to-date pathway data from KEGG. Next, the user will identify significantly enriched pathways in two different gene sets, and finally, the user will identify pathways that are differentially enriched between the two gene sets. In addition to the analysis pipeline, this package also provides a plotting function.

The KEGG REST API

KEGG is a database resource for understanding high-level functions of a biological system, such as a cell, an organism and an ecosystem, from genomic and molecular-level information https://www.kegg.jp/kegg/kegg1a.html. KEGG is an integrated database resource consisting of eighteen databases that are clustered into 4 main categories: 1) systems information (e.g. hierarchies and maps), 2) genomic information (e.g. genes and proteins), 3) chemical information (e.g. biochemical reactions), and 4) health information (e.g. human disease and drugs) https://www.kegg.jp/kegg/kegg1a.html.

In 2012 KEGG released its first application programming interface (API), and has been adding features and functionality ever since. There are benefits to using an API. First, API’s, like KEGG’s, allow users to perform customized analyses with the most up-to-date versions of the data contained in the database. In addition, accessing the KEGG API is very easy using statistical programming tools like R or Python and integrating data retrieval into user’s code makes the program reproducible. To further enforce reproducibilty diffEnrich adds a date and KEGG release tag to all data files it generates from accessing the API. For update histories and release notes for the KEGG REST API please go here.

Figure 1. diffEnrich Analysis pipeline. Functions within the diffEnrich package are represented by blue rectangles. The data that must be provided by the user is represented by the purple ovals. Data objects generated by a function in diffEnrich are represented by red ovals. The external call of the get_kegg function to the KEGG REST API is represented in yellow.

Motivating experimental design for differential enrichment

Often high throughput omics studies include a functional enrichment analysis to glean biological insight from a list of candidate genes, proteins, metabolites, etc. Functional enrichment examines whether the number of genes in the list associated with a biological function or particular pathway is more than would be expected by chance. As an example, enrichment of a particular pathway among a list of genes that are differentially expressed after an experimental manipulation may indicate that the pathway has been altered by that manipulation. This analysis is rather straight forward and many solutions have been offered (e.g., [Haung et al., 2009]https://pubmed.ncbi.nlm.nih.gov/19033363/); Kuleshov et al., 2016; Liao et al., 2019; Subramanian et al., 2005). A wide variety of databases have also been used to define these pathways (e.g., Kanehisa and Goto, 2000) and ontologies (e.g., Ashburn et al., 2000).

One key component of a statistically rigorous functional enrichment analysis is the definition of a background data set that can be used to estimate the number of candidate genes that are ‘expected’ to be associated with the pathway by chance, e.g., if 5% of genes in the background data set are associated with a pathway then 5% of candidate gene are expected to be associated with the pathway by chance. For many study designs, the background data set is relatively simple to define (e.g., RNA-Seq analyses where the background data set includes genes expressed above background).

However, for some newer omics technologies, the background data set is hard to define. For example, LC-MS analysis can be used to identify carbonylated proteins ( Peterson et al., 2018; Shearn et al., 2019; Shearn et al., 2018). With this study design, carbonylated proteins are isolated using a BH-derivation and then LC-MS is used to identify peptides in this isolated sample. The most appropriate background data set would be proteins present in that tissue, but this would require a separate analytical analysis. Furthermore, most functional enrichment analyses involve a single gene list. However, in protein modification studies, the typical experimental design compares the presence or absence of particular modified proteins between multiple groups.

When there are two or more gene lists to compare and the background gene list is not clearly defined, as is often the case in protein modification experiments, we propose a differential enrichment analysis. In this analysis, we compare the proportion of genes/proteins from one gene list associated with a particular pathway to the proportion of genes/proteins from a second gene list that are associated with that pathway. To easily execute this analysis, we have designed an R package that uses the KEGG REST API to obtain the most recent version of the KEGG PATHWAY ( Kanehisa and Goto, 2000) database to initially identify functional enrichment within a gene list using the entire KEGG transcriptome as the background data set and then to identify differentially enriched pathways between two gene lists. This R package includes a function to generate a “differential enrichment” graphic.

Installation

You can install the released version of diffEnrich from CRAN with:

install.packages("diffEnrich")

Example

Step 1: Collect and store pathways from KEGG API

First we will use the get_kegg function to access the KEGG REST API and download the data sets required to perform our downstream analysis. This function takes two arguments. The first, ‘species’ is required. Currently, diffEnrich supports three species, and the argument is a character string using the KEGG code https://www.pnas.org/doi/10.1073/pnas.0806162105: Homo sapiens (human), use ‘hsa’; Mus musculus (mouse), use ‘mmu’; and Rattus norvegicus (rat), use ‘rno’. The second, ‘path’ is also passed as a character string, and is the path of the directory in which the user would like to write the data sets downloaded from the KEGG REST API. If the user does not provide a path, the data sets will be automatically written to the current working directory using the here::here() functionality. These data sets will be tab delimited files with a name describing the data, and for reproducibility, the date they were generated and the version of KEGG when the API was accessed. In addition to these flat files, get_kegg will also create a named list in R with the three relevant KEGG data sets. The names of this list will describe the data set. For a detailed description of list elements use ?get_kegg.

## run get_kegg() using rat
kegg_rno <- get_kegg('rno')
#> 3 data sets will be written as tab delimited text files
#> File location: /private/var/folders/ms/g80_v_b157b6tryj42yxgfjm0000gq/T/Rtmpzqq8Ou/Rbuildf55625004d35/diffEnrich
#> Kegg Release: Release_102.0+_06-27_Jun_22

Here are examples of the output files:

ncbi_to_kegg2019-10-17Release_92.0+_10-17_Oct_19.txt
kegg_to_pathway2019-10-17Release_92.0+_10-17_Oct_19.txt
pathway_to_species2019-10-17Release_92.0+_10-17_Oct_19.txt

Note: Because it is assumed that the user might want to use the data sets generated by get_kegg, it is careful not to overwrite data sets with exact names. get_kegg checks the path provided for data sets generated ‘same-day/same-version’, and if it finds even one of the three, it will not re-write any of the data sets. It will still however, let the user know it is not writing out new data sets and still generate the named list object. Users can generate ‘same-day/same-version’ data sets in different directories if they so choose.

## run get_kegg() using rat
kegg_rno <- get_kegg('rno')
#> These files already exist in your working directory. New files will not be generated.
#> Kegg Release: Release_102.0+_06-27_Jun_22

## run get_kegg() using rat
kegg_rno <- get_kegg(read = TRUE, path = "/path/to/files", date = "2019-10-17", release = "92")

Here is an example of the output:

Reading in the following files:
ncbi_to_kegg2019-10-17Release_92.0+_10-17_Oct_19.txt
kegg_to_pathway2019-10-17Release_92.0+_10-17_Oct_19.txt
pathway_to_species2019-10-17Release_92.0+_10-17_Oct_19.txt
File location: ~/Desktop

Table 1. Description of data sets generated from accessing KEGG REST API
get_kegg.list.object	Object.description
ncbi_to_kegg	ncbi gene ID <– mapped to –> KEGG gene ID
kegg_to_pathway	KEGG gene ID <– mapped to –> KEGG pathway ID
pathway_to_species	KEGG pathway ID <– mapped to –> KEGG pathway species description

Step 2: Perform enrichment analysis of individual gene sets

In this step we will use the pathEnrich function to identify KEGG pathways that are enriched (i.e. over-represented) based on a gene list of interest. User gene lists must be character vectors and be formatted as ENTREZ gene IDs. The clusterProfiler package offers a nice function (bitr) that maps gene symbols and Ensembl IDs to ENTREZ gene IDs, and an example can be seen in their vignette.

## View sample gene lists from package data
head(geneLists$list1)
#> [1] "361692"    "293654"    "293655"    "500974"    "100361529" "171434"
head(geneLists$list2)
#> [1] "315547" "315548" "315549" "315550" "50938"  "58856"

This function may not always use the complete list of genes provided by the user. Specifically, it will only use the genes from the list provided that are also in the most current species list pulled from the KEGG REST API using get_kegg, or from the older KEGG data loaded by the user from a previous get_kegg call. Users can also decide which KEGG pathways should be tested based on how many genes from their gene list are contained in the KEGG pathway. Users can set this parameter by changing the ‘N’ argument. The default is N = 2. The pathEnrich function should be run once for the genes of interest in list 1 and once for the genes of interest in list2. Each pathEnrich call generates a data frame summarizing the results of an enrichment analysis in which a Fisher’s Exact test is used to identify which KEGG pathways are enriched for the user’s list of genes compared to all genes annotated to a KEGG pathway. Users can choose a multiple correction option from those supported by stats::p.adjust. The default is the False Discovery Rate ( Benjamini and Hochberg, 1995), and the default threshold to reach significance is 0.05.

# run pathEnrich using kegg_rno
## List 1
list1_pe <- pathEnrich(gk_obj = kegg, gene_list = geneLists$list1, cutoff = 0.05, N = 2)
## list2
list2_pe <- pathEnrich(gk_obj = kegg, gene_list = geneLists$list2, cutoff = 0.05, N = 2)

Table 2. First 6 rows of list1_pe data frame generated using pathEnrich
KEGG_PATHWAY_ID	KEGG_PATHWAY_description	KEGG_PATHWAY_cnt	KEGG_PATHWAY_in_list	KEGG_DATABASE_cnt	KEG_DATABASE_in_list	expected	enrich_p	p_adj	fold_enrichment
rno04530	Tight junction - Rattus norvegicus (rat)	170	19	8856	295	5.662827	0.0000035	0.0005387	3.355214
rno05135	Yersinia infection - Rattus norvegicus (rat)	128	16	8856	295	4.263776	0.0000049	0.0005387	3.752542
rno05210	Colorectal cancer - Rattus norvegicus (rat)	88	12	8856	295	2.931346	0.0000318	0.0023211	4.093683
rno05231	Choline metabolism in cancer - Rattus norvegicus (rat)	99	12	8856	295	3.297764	0.0001032	0.0044701	3.638829
rno05213	Endometrial cancer - Rattus norvegicus (rat)	58	9	8856	295	1.932023	0.0001132	0.0044701	4.658328
rno04144	Endocytosis - Rattus norvegicus (rat)	275	22	8856	295	9.160456	0.0001225	0.0044701	2.401627

pathEnrich generates a list object that contains a data frame with 9 columns described below as well as other metadata from the function call. Details also provided in pathEnrich documentation. S3 generic functions for print and summary are provided. The print function prints the results table as a tibble, and the summary function returns the number of pathways that reached statistical significance as well as their descriptions, the number of genes used from the KEGG data base, the KEGG species, and the method used for multiple testing correction.

summary(list1_pe)
#> 219 KEGG pathways were tested. 
#>  Only KEGG pathways that contained at least 2 genes from gene_list were tested. 
#>  KEGG pathway species: Rattus norvegicus (rat)
#>  8856 genes from gene_list were in the KEGG data pull. 
#>  p-value adjustment method: BH
#>  36 pathways reached statistical significance after multiple testing correction at a cutoff of 0.05. 
#>  
#> Significant pathways: 
#>  Tight junction
#> Yersinia infection
#> Colorectal cancer
#> Choline metabolism in cancer
#> Endometrial cancer
#> Endocytosis
#> Neurotrophin signaling pathway
#> Thermogenesis
#> Oocyte meiosis
#> VEGF signaling pathway
#> Thyroid hormone signaling pathway
#> Hippo signaling pathway
#> T cell receptor signaling pathway
#> Apoptosis
#> Hepatocellular carcinoma
#> MAPK signaling pathway
#> Focal adhesion
#> Salmonella infection
#> Non-alcoholic fatty liver disease (NAFLD)
#> ErbB signaling pathway
#> Sphingolipid signaling pathway
#> Pancreatic cancer
#> Progesterone-mediated oocyte maturation
#> Alzheimer disease
#> Endocrine resistance
#> Adrenergic signaling in cardiomyocytes
#> IL-17 signaling pathway
#> Chronic myeloid leukemia
#> Dopaminergic synapse
#> Prostate cancer
#> EGFR tyrosine kinase inhibitor resistance
#> Hepatitis C
#> Ras signaling pathway
#> Acute myeloid leukemia
#> Insulin signaling pathway
#> Fc epsilon RI signaling pathway

Table 3. Description of columns is dataframe generated by pathEnrich
Column Names	Column Description
KEGG_PATHWAY_ID	KEGG Pathway Identifier
KEGG_PATHWAY_description	Description of KEGG Pathway (provided by KEGG)
KEGG_PATHWAY_cnt	Number of Genes in KEGG Pathway
KEGG_PATHWAY_in_list	Number of Genes from gene list in KEGG Pathway
KEGG_DATABASE_cnt	Number of Genes in KEGG Database
KEG_DATABASE_in_list	Number of Genes from gene list in KEGG Database
expected	Expected number of genes from list to be in KEGG pathway by chance
enrich_p	P-value for enrichment within the KEGG pathway for list genes
p_adj	Multiple testing adjusted enrichment p-values (default = False Discovery Rate (Benjamini and Hochberg, 1995))
fold_enrichment	Ratio of number of genes observed from the gene list annotated to the KEGG pathway to the number of genes expected from the gene list to be annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list/expected)

Step 3: Identify differentially enriched KEGG pathways

The diffEnrich function will merge two results from the pathEnrich calls generated above. Specifically, the data frame ‘list1_pe’ and the data frame ‘list2_pe’ will be merged by the following columns: “KEGG_PATHWAY_ID”, “KEGG_PATHWAY_description”, “KEGG_PATHWAY_cnt”, “KEGG_DATABASE_cnt”. This merged data set will then be used to perform differential enrichment using the same method and p-value calculation as described above. Users do have the option of choosing a method for multiple testing adjustment. Users can choose from those supported by stats::p.adjust. The default is the False Discovery Rate ( Benjamini and Hochberg, 1995). KEGG pathways that do not contain any genes from either gene list (i.e., list1_pe\(enrich_table\)KEGG_PATHWAY_in_list for ‘rno04530’ = 0 AND list2_pe\(enrich_table\)KEGG_PATHWAY_in_list for ‘rno04530’ = 0) will be removed as these cannot be tested. If this is the case a warning will be printed that tells the user how many pathways were removed. This can be avoided by setting the ‘N’ parameter to a value > 0 in the pathEnrich calls.

## Perform differential enrichment
diff_enrich <- diffEnrich(list1_pe = list1_pe, list2_pe = list2_pe, method = 'none', cutoff = 0.05)

Table 4. First 6 rows from data frame generated by diffEnrich
KEGG_PATHWAY_ID	KEGG_PATHWAY_description	KEGG_PATHWAY_cnt	KEGG_DATABASE_cnt	KEGG_PATHWAY_in_list1	KEGG_DATABASE_in_list1	expected_list1	enrich_p_list1	p_adj_list1	fold_enrichment_list1	KEGG_PATHWAY_in_list2	KEGG_DATABASE_in_list2	expected_list2	enrich_p_list2	p_adj_list2	fold_enrichment_list2	odd_ratio	diff_enrich_p	diff_enrich_adjusted
rno04530	Tight junction - Rattus norvegicus (rat)	170	8856	19	295	5.662827	0.0000035	0.0005387	3.355214	131	5308	101.89250	0.0000015	0.0000056	1.285669	0.3676651	0.0002936	0.0002936
rno05135	Yersinia infection - Rattus norvegicus (rat)	128	8856	16	295	4.263776	0.0000049	0.0005387	3.752542	105	5308	76.71906	0.0000001	0.0000003	1.368630	0.3520039	0.0005435	0.0005435
rno05210	Colorectal cancer - Rattus norvegicus (rat)	88	8856	12	295	2.931346	0.0000318	0.0023211	4.093683	81	5308	52.74435	0.0000000	0.0000000	1.535709	0.3655602	0.0032572	0.0032572
rno05213	Endometrial cancer - Rattus norvegicus (rat)	58	8856	9	295	1.932023	0.0001132	0.0044701	4.658328	55	5308	34.76332	0.0000000	0.0000000	1.582127	0.3328275	0.0058775	0.0058775
rno04660	T cell receptor signaling pathway - Rattus norvegicus (rat)	106	8856	11	295	3.530940	0.0007743	0.0129121	3.115318	79	5308	63.53297	0.0011075	0.0022562	1.243449	0.3901694	0.0072048	0.0072048
rno04657	IL-17 signaling pathway - Rattus norvegicus (rat)	95	8856	9	295	3.164521	0.0042709	0.0346419	2.844032	59	5308	56.93993	0.3737223	0.4591045	1.036180	0.3572935	0.0087538	0.0087538

Table 5. Description of columns in dataframe generated by diffEnrich
Column Names	Column Description
KEGG_PATHWAY_ID	KEGG Pathway Identifier
KEGG_PATHWAY_description	Description of KEGG Pathway (provided by KEGG)
KEGG_PATHWAY_cnt	Number of Genes in KEGG Pathway
KEGG_DATABASE_cnt	Number of Genes in KEGG Database
KEGG_PATHWAY_in_list1	Number of Genes from gene list 1 in KEGG Pathway
KEGG_DATABASE_in_list1	Number of Genes from gene list 1 in KEGG Database
expected_list1	Expected number of genes from list 1 to be in KEGG pathway by chance
enrich_p_list1	P-value for enrichment of list 1 genes related to KEGG pathway
p_adj_list1	Multiple testing adjusted enrichment p-values from gene list 1 (default = False Discovery Rate (Benjamini and Hochberg, 1995))
fold_enrichment_list1	Ratio of number of genes observed from gene list 1 annotated to the KEGG pathway to the number of genes expected from gene list 1 annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list1/expected_list1)
KEGG_PATHWAY_in_list2	Number of Genes from gene list 2 in KEGG Pathway
KEGG_DATABASE_in_list2	Number of Genes from gene list 2 in KEGG Database
expected_list2	Expected number of genes from list 2 to be in KEGG pathway by chance
enrich_p_list2	P-value for enrichment of list 2 genes related to KEGG pathway
p_adj_list2	Multiple testing adjusted enrichment p-values from gene list 2 (default = False Discovery Rate (Benjamini and Hochberg, 1995))
fold_enrichment_list2	Ratio of number of genes observed from gene list 2 annotated to the KEGG pathway to the number of genes expected from gene list 2 annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list2/expected_list2)
odd_ratio	Odds of a gene from list 2 being from this KEGG pathway / Odds of a gene from list 1 being from this KEGG pathway
diff_enrich_p	P-value for differential enrichment of this KEGG pathway between list 1 and list 2
diff_enrich_adjusted	Multiple testing adjusted differential enrichment p-values (default = False Discovery Rate (Benjamini and Hochberg, 1995))

The result of the diffEnrich call is a list object that contains a data frame with the estimated odds ratio generated by the Fisher’s Exact test and the associated p-value as well as other metadata from the function call. S3 generic functions for print and summary are provided. The print function prints the results table as a tibble, and the summary function returns the number of pathways that reached statistical significance as well as their descriptions, the number of genes used from the KEGG database, the KEGG species, the number of pathways that were shared (i.e. had at least 1 gene from each gene list present in the pathway based on what the user chose for N in pathEnrich) between the gene lists and the method used for multiple testing correction.

summary(diff_enrich)
#> 219 KEGG pathways were shared between gene lists and were tested. 
#>  KEGG pathway species: Rattus norvegicus (rat)
#>  8856 genes from gene_list were in the KEGG data pull. 
#>  p-value adjustment method: none
#>  19 pathways reached statistical significance after multiple testing correction at a cutoff of 0.05. 
#>  
#> Significant pathways: 
#>  Tight junction
#> Yersinia infection
#> Colorectal cancer
#> Endometrial cancer
#> T cell receptor signaling pathway
#> IL-17 signaling pathway
#> Salmonella infection
#> Choline metabolism in cancer
#> Endocytosis
#> VEGF signaling pathway
#> Oocyte meiosis
#> Thermogenesis
#> Hippo signaling pathway
#> Neurotrophin signaling pathway
#> Apoptosis
#> Hepatocellular carcinoma
#> Fc epsilon RI signaling pathway
#> Thyroid hormone signaling pathway
#> Alzheimer disease

Figures

Step 4: Plot fold enrichment

plotFoldEnrichment generates a grouped bar plot using ggplot2 and the ggnewscale package. There are 3 arguments: 1) de_res is the dataframe generated from the diffEnrich function, 2) pval is the threshold for the adjusted p-value associated with differential enrichment that will filter which KEGG pathways to plot, and 3) after filtering based on pval N tells the function how many pathways to plot. It is important to make a note that the significance of the fold change is associated with the number of genes in the gene list. Notice that in this example the pathways in gene list 2 have smaller fold changes (shorter bars) than those in list 1, but that many of them are more significant (darker blue). This is because there are more genes in gene list 2 compared to gene list 1.

## Plot fold enrichment
plotFoldEnrichment(de_res = diff_enrich, pval = 0.05, N = 5)

Figure 2. Example of a differential enrichment graphic. KEGG pathways are plotted on the y-axis and fold enrichment is plotted on the x-axis. Each KEGG pathway has a bar depicting its fold enrichment in list 1 (red) and its fold enrichment in list 2 (blue). The transparency of the bars correspond to the unadjusted p-value for the pathway’s enrichment in the given list. The p-value presented as text to the right of each pair of bars is the adjusted p-value (user defined: default is FDR) associated with the differential enrichment of the pathway between the two lists, and the pathways are ordered from top to bottom by this p-value (i.e. smallest p-value on top of plot, and largest p-value on bottom of plot). The dotted line represents a fold enrichment of 1. Finally, the number of genes used for analysis from each gene list (recall that this number may not be the same as the number of genes in the user’s original list) are reported below their respective p-values in the legend.

diffEnrich by example

Harry Smith

2022-06-27