Knowledge graphs enable to organize vast amounts of knowledge and has been used extensively on internet to help websites share data and integrate one another. In the domain of research and knowledge discovery, they enable to organise results and provide specific views of results by showing subsets of interest.
Knowledge graphs are defined by sets of nodes in relationships to each other (and in the contexts of graphs, relationships between nodes can be called edges). Both nodes and edges can be of different types and one commonly used type of relationship is “parent-child”, e.g. in the context of mental health, one such relationship could be: schizophrenia “is a” mental health disorder. This is an example of a directed relationship, and examples of undirected relationships include pair-wise similarities as cosine.
The ‘kgraph’ package provides the ability to easily build knowledge graphs and feed them for visualization to the ‘sgraph’ package. The ‘kgraph’ package focuses on building the complex graphs that arise when we build knowledge graphs, with a particular focus on clinical and biomedical data, although the methods aim to be general enough for use in any application. The ‘sgraph’ package focuses on interfacing the Sigma.JS library and performs minimal computation.
There are three main computations performed by the ‘kgraph’ package:
The two main features provided by the kgraph package are building a
knowledge graph based on a data frame of concept relationships, be it
p-values or cosine similarities, and building the data frame of concept
relationships from an embedding matrix, the second feature thus
operating logically before the first one. The user can either provide a
direct data frame of weighted relationships, as p-values or pre-computed
similarities, to the build_kgraph
function, or provide a
data matrix on which the similarities will be directly computed and a
threshold will be determined, to the fit_embeds_kg
function.
The first way to use the ‘kgraph’ package is to call the
build_kgraph
function with a list of selected concepts and
a data frame consisting of 3 columns: ‘concept1’, ‘concept2’ and
‘weight’. The two first columns will define 2 nodes in relationship, and
the third defines the weight of the relationship, making the length of
the edge globally shorter as the weight is higher. Thus when using
p-values, one should first transform the values by minus log10 to
reflect the stronger association.
In the case of a graph with a single node of interest, the
df_weights
dataframe will be searched for all rows
containing the selected node, and the weight of the relationship will be
proportional to the size of the node of the second concept. For graphs
with multiple nodes of interest, graphs for each node taken individually
will first be built, and the graphs will then be merged. For nodes
connected to several nodes of interest, the maximum of the weights to
each node of interest is retained to determine the final node size.
As an example of a concepts relationships dataframe, the ‘kgraph’ package integrates GWAS results on suicide behavior downloaded from https://www.ebi.ac.uk/gwas/efotraits/EFO_0007623.
In the scripts
folder of the package, the file
gwas_data.R
creates an rda
file based on it,
which we load here.
library(kgraph)
data('df_pval')
head(df_pval)
#> concept1 concept2 weight
#> 2910 EFO_0004320 rs6557168 11.52288
#> 3010 EFO_0007623 rs6557168 11.52288
#> 354 EFO_0004321 rs62474683 11.04576
#> 111 EFO_0007624 rs34399104 10.39794
#> 1 EFO_0004320 rs73581580 10.39794
#> 2 EFO_0007623 rs73581580 10.39794
We can then call directly the build_kgraph
function:
The kg_obj
object returned by build_kgraph
is a classic graph object consisting of two dataframes: one defining the
edges (df_links
) and one defining the nodes
(df_nodes
). We can then import this graph object as an
‘igraph’ object with the l_graph_to_igraph
function from
the ‘sgraph’ package, which is basically just a wrapper to the
graph_from_data_frame
from the ‘igraph’ package, and adds
directly the node dataframe to the ‘igraph’ object.
We then build the ‘sgraph’ object from the ‘igraph’ object with the
sgraph_clusters
function from the ‘sgraph’ package. Here we
can specify the layout of the ‘igraph’ object, in this example a spring
layout, but possibly also a force-directed layout with the
layout_with_fr
, or any other ‘igraph’ layouts. We can also
modify some layout parameters, e.g. the number of iterations.
sg_obj = sgraph::sgraph_clusters(ig_obj, node_size = 'weight',
label = 'label',
layout = igraph::layout_with_kk(ig_obj))
The ‘sgraph’ object is now ready to be visualized
Most of the time, we will want to integrate and overlay data from a second data frame: the node dictionary. This data frame will contain at minimum the columns ‘id’ (corresponding to concept names in ‘concept1’ and ‘concept2’ columns of the weights data frame) and ‘desc’ (the textual descriptions, i.e. labels, that will appear on the graph). Optionally, the data frame can contain the columns ‘color’ (for nodes’ color) and ‘group’ (detailed below).
The script gwas_data.R
builds a basic dictionary that
will map the traits’ URI to labels, and SNPs to genes. We load directly
the .rda
file here.
data('df_pval_dict')
head(df_pval_dict)
#> id desc group color
#> 22 EFO_0007624 suicide suicide Phenotype
#> 35 EFO_0004320 suicidal ideation suicidal ideation Phenotype
#> 48 EFO_0007623 suicide behaviour suicide behaviour Phenotype
#> 154 EFO_0004321 attempted suicide attempted suicide Phenotype
#> 359 rs73581580 rs73581580 EXD3 SNP
#> 360 rs3757323 rs3757323 ESR1 SNP
Then call build_kgraph
with the df_dict
parameter.
The get_sgraph
function is a wrapper to build the
‘igraph’ and ‘sgraph’ objects, and the ...
parameter is
passed to the sgraph_clusters
function.
The node dictionary dataframe can include a ‘group’ column, which will add groups as nodes in the graph. There is two main ways of displaying groups in the ‘kgraph’ package, and each should be used with a specific graph layout from the ‘igraph’ package for best results.
In the first case (‘floating’), the group nodes will be connected to
all nodes belonging to each group and should be used with the
layout_with_kk
layout function, which will have the effect
of pulling together nodes from similar groups while keeping their direct
relationships to nodes of interest.
In the second case (‘anchored’), nodes’ direct edges to concepts of
interest will be removed and group nodes will be connected to concepts
of interest instead. This layout should be used with the
layout_with_fr
function (force-directed), which will have
the effect of using the groups as intermediary nodes. Additionally,
nodes in a group can be hidden by default and revealed only when the
group node is clicked upon.
Groups with a single child node will automatically by set to a common
group labeled ‘Other’. This behavior can be disabled by setting the
rm_single_groups
parameter to false, and the label can be
replaced through the display_val_str
parameter.
In this vignette, we focus only on the first case, although functions are available to use the second type for users who want to explore that path. The second type might be demonstrated in a second vignette with other features in development.
Up to here, the dataframes we provided defined all the nodes’ relationships, i.e. the edges. One useful additonal feature is to be able to determine automatically such relationships based on either a pair-wise similarity matrix, or an embedding matrix derived from co-occurence in sliding windows or other methods.
In order to do this, we need to determine a similarity threshold, and define all similarities greater than the threshold as relationships. If our number of features isn’t too big we can compute all pair-wise similarities and keep only the 5% or 10% greatest values. Otherwise if our number of features is large and we can’t compute all pair-wise similarities, we can usually assume that most of our concepts are not related, thus we can generate a number of random pairs (e.g. 10,000) and assume they are all true negatives, and use them to determine a 5% or 10% false positive threshold. Then, when we build a knowledge graph, we compute pair-wise similarities between the concepts of interest and all the other concepts, and we use this global threshold to retain only the concepts with a higher similarity.
To determine this threshold from a data matrix, we call the
fit_embeds_kg
function, presumably on an embedding matrix
with concepts as rows and dimensions as columns. Similarity method may
vary and will usually be either the inner product or the cosine
similarity, and two other methods are available. Threshold determining
will usually be around 5% or 10% false positives (i.e. 0.95 or 0.9
specificity) but may be anywhere between 1% and 50% depending on the
structure of the data (if many concepts are similar or not) and how the
user wants to visualize the graph.
Depending on the number of concepts in the data matrix (i.e. number
of rows) and the amount of available RAM, it may not be possible to
compute all pair-wise similarities in order to after be able to select
nodes of interest. In that case pair-wise similarities for concepts of
interest are computed on the fly and the fit_embeds_kg
function is mainly used to determine the threshold. Otherwise if all
pair-wise similarities can be computed, the fitted object contains the
complete dataframe of weighted relationships.
The ‘on-the-fly’ computation is performed by the
project_graph
function, and the
build_kgraph_from_fit
function enables to dispatch
automatically between the two behaviors. The maximum number of concepts
above which ‘on-the-fly’ computation is performed is determined by the
max_concepts
parameter, by default 1000, which is quite low
and should thus run easily on modest systems and enable concurrent
applications. Since the number of nodes of interest is rarely above 100,
the ‘on-the-fly’ computation should still be near instantaneous and
barely noticeable.
The threshold determining can vary depending on the random sampling
performed. By default ~10,000 random pairs are sampled, and this number
can be increased with the n_notpairs
parameter. Increasing
it will increase the computation time of the fitted object (but which is
only performed once), and will increase the stability of the threshold
and the resulting graph. The fitted object can then be reused to explore
different nodes of interest, and is recomputed only when the similarity
method or the threshold is changed.
The ‘kgraph’ package provides an example embedding data matrix, which is a subset of a larger one hosted on Dropbox and in which embeddings of medical concepts have been computed in 1,700 mental health-related scientific publications using word2vec-like algorithms (co-occurence in sliding windows, in which pairs of words appearing frequently close together will be more similar). For more information on how this data was computed, you can refer to my R/Medicine 2024 tutorial.
A corresponding dictionary object is also provided, that you can
recreate by downloading the NILE
software (for .edu, .org or .gov e-mail addresses only, please
contact the CELEHS lab if you are an academic with another kind of
e-mail address) and then extracting the file in the
portable_NILE
folder in inst
.
The input data is a matrix with concepts as rows and embedding dimensions in columns.
We build the fit object to determine the similarity threshold. The 5%
threshold is always computed for information, and the
threshold_projs
parameter determines the actual threshold
that will be used, by default 10% (note that it is specified in AUC
specificity, so 10% -> 0.9 and 5% -> 0.95).
We can then call the build_kgraph_from_fit
function,
with a node of interest. We first load the dictionary to find
identifiers.
data('df_embeds_dict')
head(df_embeds_dict)
#> id desc color
#> 1 C0001627 hyperplasia, congenital adrenal Disorders
#> 2 C0003099 anniversary reaction Disorders
#> 3 C0004114 astrocytic tumor Disorders
#> 4 C0007112 adenocarcinoma of prostate gland Disorders
#> 5 C0007642 cellulitis Disorders
#> 6 C0009176 cocaine intoxication Disorders
#> group
#> 1 Disease or Syndrome
#> 2 Mental or Behavioral Dysfunction
#> 3 Neoplastic Process
#> 4 Neoplastic Process
#> 5 Pathologic Function
#> 6 Mental or Behavioral Dysfunction
As we determine a similarity threshold from data matrices, we are basically predicting relationships between concepts. And of course, we would like to know about the performance of our prediction model. To obtain numeric quality measures, we can use databases of known related concepts, which act as true positives. In the clinical context, these databases can be curated by clinicians, and in the general context by expert knowledge. A useful performance measure for such predictions is AUC. AUC will require the same number of true negatives than of true positives, so for each pair of concept we know about, we want to generate a random other one.
The ‘kgraph’ package integrates an extract of the PrimeKG database subsetted to PheCodes (diagnoses) relationships. PrimeKG was developed to help with bioinformatics drug discovery and focuses on genetic pathways, but we can use it as a starting point in a biomedical setting (focusing on relations between diagnoses).
To see the performance of our prediction model, we load the known pairs, recompute the fit object, and plot AUC.