Knowledge Graphs

Thomas Charlon

2024-09-19

Knowledge graphs

Background

Knowledge graphs enable to organize vast amounts of knowledge and has been used extensively on internet to help websites share data and integrate one another. In the domain of research and knowledge discovery, they enable to organise results and provide specific views of results by showing subsets of interest.

Knowledge graphs are defined by sets of nodes in relationships to each other (and in the contexts of graphs, relationships between nodes can be called edges). Both nodes and edges can be of different types and one commonly used type of relationship is “parent-child”, e.g. in the context of mental health, one such relationship could be: schizophrenia “is a” mental health disorder. This is an example of a directed relationship, and examples of undirected relationships include pair-wise similarities as cosine.

The ‘kgraph’ package provides the ability to easily build knowledge graphs and feed them for visualization to the ‘sgraph’ package. The ‘kgraph’ package focuses on building the complex graphs that arise when we build knowledge graphs, with a particular focus on clinical and biomedical data, although the methods aim to be general enough for use in any application. The ‘sgraph’ package focuses on interfacing the Sigma.JS library and performs minimal computation.

There are three main computations performed by the ‘kgraph’ package:

Features overview

The two main features provided by the kgraph package are building a knowledge graph based on a data frame of concept relationships, be it p-values or cosine similarities, and building the data frame of concept relationships from an embedding matrix, the second feature thus operating logically before the first one. The user can either provide a direct data frame of weighted relationships, as p-values or pre-computed similarities, to the build_kgraph function, or provide a data matrix on which the similarities will be directly computed and a threshold will be determined, to the fit_embeds_kg function.

Building graphs using specificied relationships

Minimal call

The first way to use the ‘kgraph’ package is to call the build_kgraph function with a list of selected concepts and a data frame consisting of 3 columns: ‘concept1’, ‘concept2’ and ‘weight’. The two first columns will define 2 nodes in relationship, and the third defines the weight of the relationship, making the length of the edge globally shorter as the weight is higher. Thus when using p-values, one should first transform the values by minus log10 to reflect the stronger association.

In the case of a graph with a single node of interest, the df_weights dataframe will be searched for all rows containing the selected node, and the weight of the relationship will be proportional to the size of the node of the second concept. For graphs with multiple nodes of interest, graphs for each node taken individually will first be built, and the graphs will then be merged. For nodes connected to several nodes of interest, the maximum of the weights to each node of interest is retained to determine the final node size.

Data preparation

As an example of a concepts relationships dataframe, the ‘kgraph’ package integrates GWAS results on suicide behavior downloaded from https://www.ebi.ac.uk/gwas/efotraits/EFO_0007623.

In the scripts folder of the package, the file gwas_data.R creates an rda file based on it, which we load here.

  library(kgraph)
  data('df_pval')
  head(df_pval)
#>         concept1   concept2   weight
#> 2910 EFO_0004320  rs6557168 11.52288
#> 3010 EFO_0007623  rs6557168 11.52288
#> 354  EFO_0004321 rs62474683 11.04576
#> 111  EFO_0007624 rs34399104 10.39794
#> 1    EFO_0004320 rs73581580 10.39794
#> 2    EFO_0007623 rs73581580 10.39794

Building the igraph object

We can then call directly the build_kgraph function:

  kg_obj = build_kgraph('EFO_0007623', df_pval)              

The kg_obj object returned by build_kgraph is a classic graph object consisting of two dataframes: one defining the edges (df_links) and one defining the nodes (df_nodes). We can then import this graph object as an ‘igraph’ object with the l_graph_to_igraph function from the ‘sgraph’ package, which is basically just a wrapper to the graph_from_data_frame from the ‘igraph’ package, and adds directly the node dataframe to the ‘igraph’ object.

  ig_obj = sgraph::l_graph_to_igraph(kg_obj)

Visualization with Sigma.js

We then build the ‘sgraph’ object from the ‘igraph’ object with the sgraph_clusters function from the ‘sgraph’ package. Here we can specify the layout of the ‘igraph’ object, in this example a spring layout, but possibly also a force-directed layout with the layout_with_fr, or any other ‘igraph’ layouts. We can also modify some layout parameters, e.g. the number of iterations.

  sg_obj = sgraph::sgraph_clusters(ig_obj, node_size = 'weight',
                                   label = 'label',
                                   layout = igraph::layout_with_kk(ig_obj))

The ‘sgraph’ object is now ready to be visualized

  sg_obj

Providing a dictionary

Most of the time, we will want to integrate and overlay data from a second data frame: the node dictionary. This data frame will contain at minimum the columns ‘id’ (corresponding to concept names in ‘concept1’ and ‘concept2’ columns of the weights data frame) and ‘desc’ (the textual descriptions, i.e. labels, that will appear on the graph). Optionally, the data frame can contain the columns ‘color’ (for nodes’ color) and ‘group’ (detailed below).

The script gwas_data.R builds a basic dictionary that will map the traits’ URI to labels, and SNPs to genes. We load directly the .rda file here.

  data('df_pval_dict')
  head(df_pval_dict)
#>              id              desc             group     color
#> 22  EFO_0007624           suicide           suicide Phenotype
#> 35  EFO_0004320 suicidal ideation suicidal ideation Phenotype
#> 48  EFO_0007623 suicide behaviour suicide behaviour Phenotype
#> 154 EFO_0004321 attempted suicide attempted suicide Phenotype
#> 359  rs73581580        rs73581580              EXD3       SNP
#> 360   rs3757323         rs3757323              ESR1       SNP

Then call build_kgraph with the df_dict parameter.

  kg_obj = build_kgraph('EFO_0007623', df_pval, df_dict = df_pval_dict)              

The get_sgraph function is a wrapper to build the ‘igraph’ and ‘sgraph’ objects, and the ... parameter is passed to the sgraph_clusters function.

  sg_obj = get_sgraph(kg_obj)
  sg_obj

Multiple nodes

We can specify multiple nodes of interest to see which SNPs are associated with several phenotypes.

  kg_obj = build_kgraph(c('EFO_0007623', 'EFO_0007624'), df_pval, df_pval_dict)

  sg_obj = get_sgraph(kg_obj)
  sg_obj

Groupings

The node dictionary dataframe can include a ‘group’ column, which will add groups as nodes in the graph. There is two main ways of displaying groups in the ‘kgraph’ package, and each should be used with a specific graph layout from the ‘igraph’ package for best results.

In the first case (‘floating’), the group nodes will be connected to all nodes belonging to each group and should be used with the layout_with_kk layout function, which will have the effect of pulling together nodes from similar groups while keeping their direct relationships to nodes of interest.

In the second case (‘anchored’), nodes’ direct edges to concepts of interest will be removed and group nodes will be connected to concepts of interest instead. This layout should be used with the layout_with_fr function (force-directed), which will have the effect of using the groups as intermediary nodes. Additionally, nodes in a group can be hidden by default and revealed only when the group node is clicked upon.

Groups with a single child node will automatically by set to a common group labeled ‘Other’. This behavior can be disabled by setting the rm_single_groups parameter to false, and the label can be replaced through the display_val_str parameter.

In this vignette, we focus only on the first case, although functions are available to use the second type for users who want to explore that path. The second type might be demonstrated in a second vignette with other features in development.

Building a fit object from a data matrix

Overview

Up to here, the dataframes we provided defined all the nodes’ relationships, i.e. the edges. One useful additonal feature is to be able to determine automatically such relationships based on either a pair-wise similarity matrix, or an embedding matrix derived from co-occurence in sliding windows or other methods.

In order to do this, we need to determine a similarity threshold, and define all similarities greater than the threshold as relationships. If our number of features isn’t too big we can compute all pair-wise similarities and keep only the 5% or 10% greatest values. Otherwise if our number of features is large and we can’t compute all pair-wise similarities, we can usually assume that most of our concepts are not related, thus we can generate a number of random pairs (e.g. 10,000) and assume they are all true negatives, and use them to determine a 5% or 10% false positive threshold. Then, when we build a knowledge graph, we compute pair-wise similarities between the concepts of interest and all the other concepts, and we use this global threshold to retain only the concepts with a higher similarity.

Pair-wise similarities

To determine this threshold from a data matrix, we call the fit_embeds_kg function, presumably on an embedding matrix with concepts as rows and dimensions as columns. Similarity method may vary and will usually be either the inner product or the cosine similarity, and two other methods are available. Threshold determining will usually be around 5% or 10% false positives (i.e. 0.95 or 0.9 specificity) but may be anywhere between 1% and 50% depending on the structure of the data (if many concepts are similar or not) and how the user wants to visualize the graph.

Depending on the number of concepts in the data matrix (i.e. number of rows) and the amount of available RAM, it may not be possible to compute all pair-wise similarities in order to after be able to select nodes of interest. In that case pair-wise similarities for concepts of interest are computed on the fly and the fit_embeds_kg function is mainly used to determine the threshold. Otherwise if all pair-wise similarities can be computed, the fitted object contains the complete dataframe of weighted relationships.

The ‘on-the-fly’ computation is performed by the project_graph function, and the build_kgraph_from_fit function enables to dispatch automatically between the two behaviors. The maximum number of concepts above which ‘on-the-fly’ computation is performed is determined by the max_concepts parameter, by default 1000, which is quite low and should thus run easily on modest systems and enable concurrent applications. Since the number of nodes of interest is rarely above 100, the ‘on-the-fly’ computation should still be near instantaneous and barely noticeable.

Similarity threshold

The threshold determining can vary depending on the random sampling performed. By default ~10,000 random pairs are sampled, and this number can be increased with the n_notpairs parameter. Increasing it will increase the computation time of the fitted object (but which is only performed once), and will increase the stability of the threshold and the resulting graph. The fitted object can then be reused to explore different nodes of interest, and is recomputed only when the similarity method or the threshold is changed.

Example

The ‘kgraph’ package provides an example embedding data matrix, which is a subset of a larger one hosted on Dropbox and in which embeddings of medical concepts have been computed in 1,700 mental health-related scientific publications using word2vec-like algorithms (co-occurence in sliding windows, in which pairs of words appearing frequently close together will be more similar). For more information on how this data was computed, you can refer to my R/Medicine 2024 tutorial.

A corresponding dictionary object is also provided, that you can recreate by downloading the NILE software (for .edu, .org or .gov e-mail addresses only, please contact the CELEHS lab if you are an academic with another kind of e-mail address) and then extracting the file in the portable_NILE folder in inst.

Building the fit object

The input data is a matrix with concepts as rows and embedding dimensions in columns.

  data('m_embeds')
  dim(m_embeds)
#> [1] 1122  100

We build the fit object to determine the similarity threshold. The 5% threshold is always computed for information, and the threshold_projs parameter determines the actual threshold that will be used, by default 10% (note that it is specified in AUC specificity, so 10% -> 0.9 and 5% -> 0.95).

  fit_kg = fit_embeds_kg(m_embeds, 'cosine', threshold_projs = 0.9)
  fit_kg$threshold_projs
#>       90% 
#> 0.2018853
Producing the knowledge graph

We can then call the build_kgraph_from_fit function, with a node of interest. We first load the dictionary to find identifiers.

  data('df_embeds_dict')
  head(df_embeds_dict)
#>         id                             desc     color
#> 1 C0001627  hyperplasia, congenital adrenal Disorders
#> 2 C0003099             anniversary reaction Disorders
#> 3 C0004114                 astrocytic tumor Disorders
#> 4 C0007112 adenocarcinoma of prostate gland Disorders
#> 5 C0007642                       cellulitis Disorders
#> 6 C0009176             cocaine intoxication Disorders
#>                              group
#> 1              Disease or Syndrome
#> 2 Mental or Behavioral Dysfunction
#> 3               Neoplastic Process
#> 4               Neoplastic Process
#> 5              Pathologic Function
#> 6 Mental or Behavioral Dysfunction
  target_nodes_idxs = grep('suicide', df_embeds_dict$desc) %>% head(2)
  target_nodes = df_embeds_dict$id[target_nodes_idxs]

  kg_obj = build_kgraph_from_fit(target_nodes, m_embeds, fit_kg,
                                 df_dict = df_embeds_dict)

  sg_obj = get_sgraph(kg_obj)
  sg_obj

Measuring performance using known pairs

As we determine a similarity threshold from data matrices, we are basically predicting relationships between concepts. And of course, we would like to know about the performance of our prediction model. To obtain numeric quality measures, we can use databases of known related concepts, which act as true positives. In the clinical context, these databases can be curated by clinicians, and in the general context by expert knowledge. A useful performance measure for such predictions is AUC. AUC will require the same number of true negatives than of true positives, so for each pair of concept we know about, we want to generate a random other one.

The ‘kgraph’ package integrates an extract of the PrimeKG database subsetted to PheCodes (diagnoses) relationships. PrimeKG was developed to help with bioinformatics drug discovery and focuses on genetic pathways, but we can use it as a starting point in a biomedical setting (focusing on relations between diagnoses).

To see the performance of our prediction model, we load the known pairs, recompute the fit object, and plot AUC.

  data('df_cuis_pairs')

  fit_kg = fit_embeds_kg(m_embeds, 'cosine', df_pairs = df_cuis_pairs[c(1, 3)])

  pROC::plot.roc(fit_kg$roc, print.auc = TRUE)