Global RDF Vector Space Embeddings

. Vector space embeddings have been shown to perform well when using RDF data in data mining and machine learning tasks. Existing approaches, such as RDF2Vec, use local information, i.e., they rely on local sequences generated for nodes in the RDF graph. For word embeddings, global techniques, such as GloVe , have been proposed as an alternative. In this paper, we show how the idea of global embeddings can be transferred to RDF embeddings, and show that the results are competitive with traditional local techniques like RDF2Vec.


Introduction
While RDF data is graph shaped by nature, most traditional data mining and machine learning software expect data to be in propositional form.Hence, to be used in machine learning and data mining pipelines, RDF data needs to be transformed to propositional feature vectors.
Recently, vector space embeddings have been proposed as a means to create lowdimensional feature vector representations of nodes in an RDF graphs.Inspired by techniques from NLP, such as word2vec [14], they train neural networks for automatically learning the mapping of RDF nodes to feature vectors.Vector space embeddings have been shown to outperform traditional methods for creating propositional feature vectors from RDF [22], e.g., in tasks like content-based recommender systems [24].
Unlike the first models for RDF vector space embeddings, which are based on paths, walks, or kernels, and therefore rely on local patterns, in this paper we present an approach in that exploits global patterns for creating vector space embeddings, inspired by the Global Vectors (GloVe) [20] approach for learning vector space embeddings for words from a text corpus.We show that using the GloVe approach on the same data as the older RDF2Vec approach does not improve the created embeddings.However, this approach is able to incorporate larger portions of the graph, without substantially increasing the computational time, leading to comparable results.The main contributions of this paper are this new embedding approach and an approach to approximate all-pairs Personalized PageRank (PPR) computation, which is used to efficiently compute such embeddings.
The rest of this paper is structured as follows.Section 2 presents an overview on related work.In section 3, we explain the basic idea of GloVe embeddings, and show how we transfer that idea to RDF graphs.Section 4 discusses an evaluation in different scenarios.We close with a summary and an outlook on future work.
The source code used in this evaluation can be found from https://github.com/miselico/globalRDFEmbeddingsISWC.Possible further developments will also be on http://users.jyu.fi/˜miselico/software/.

Related Work
RDF vector space embeddings, i.e., projections of an RDF graph into a low-dimensional, dense vector space, have recently been proposed as a means to make RDF data accessible for propositional machine learning techniques, and shown to outperform traditional feature generation techniques [22].
RDF2Vec [22] is one of the first approaches that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs.The approach generates sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and then learns latent numerical representations of entities in RDF graphs.
The RDF2Vec approach is closely related to the approaches DeepWalk [21] and Deep Graph Kernels [31].DeepWalk uses language modeling approaches to learn social representations of vertices of graphs by modeling short random-walks on large social graphs, like BlogCatalog, Flickr, and YouTube.The Deep Graph Kernel approach extends the Deep-Walk approach, by modeling graph substructures, like graphlets, instead of graph walks.In this paper, we pursue and deepen the idea of random and biased walks since those have proven to be scalable even to large RDF graphs, unlike other transformation approaches, such as graph kernels.Node2vec [7] is another approach very similar to DeepWalk, which uses second order random walks to preserve the network neighborhood of the nodes.
Furthermore, multiple approaches for knowledge graph embeddings for the task of link prediction have been proposed [16], which could also be considered as approaches for generating propositional features from graphs.RESCAL [17] is one of the earliest approaches, which is based on factorization of a three-way tensor.The approach is later extended into Neural Tensor Networks (NTN) [28], which can be used for the same purpose (optionally using multilingual information [10]).One of the most successful approaches is the model based on translating embeddings, TransE [2].This model builds entity and relation embeddings by regarding a relation as translation from head entity to tail entity.This approach assumes that relationships between words could be computed by their vector difference in the embedding space.However, this approach cannot deal with reflexive, one-to-many, many-to-one, and many-to-many relations.This problem was resolved in the TransH model [30], which models a relation as a hyperplane together with a translation operation on it.More precisely, each relation is characterized by two vectors, the norm vector of the hyperplane, and the translation vector on the hyperplane.While both TransE and TransH, embed the relations and the entities in the same semantic space, the TransR model [13] builds entity and relation embeddings in separate entity space and multiple relation spaces.This approach is able to model entities that have multiple aspects, and various relations that focus on different aspects of entities.
Unlike the first models for RDF vector space embeddings, which are based on paths, walks, or kernels, and therefore rely on local patterns, the approach in this paper exploits global patterns for creating vector space embeddings, inspired by the Global Vectors (GloVe) [20] approach for learning vector space embeddings for words from a text corpus.

Global Vectors from RDF Data
The embedding method which we propose borrows the optimization problem and approach from GloVe [20].Glove training, however, is based on the creation of a global co-occurrence matrix from text.Consequently, in our approach we need to devise a way to build a co-occurrence matrix from graph data.To this end, we first weigh the edges of the graph and compute approximate personalized PageRank scores starting from each node.The PageRank score for the other nodes (i.e., context nodes) is then used as the absolute frequency in a matrix.This procedure is then repeated on the graph with all edges reversed and the result is added to the co-occurrence matrix.This combined matrix is then subsequently used for training the vectors with the original Glove approach.

The GloVe model
GloVe was designed for creating dense word vectors (also known as word embeddings) from natural language texts, which have been recently used with much success in a plethora of Natural Language Processing tasks.GloVe follows a distributional semantic view of word meaning in context, which basically relies on the assumption that 'words which are similar in meaning occur in similar contexts' [25] -i.e., meaning can be derived from the context (i.e., the surrounding words) of the word in a large corpus of text.
Consequently, to build a GloVe model a word-word co-occurrence matrix is first built, which contains for each word how often other words occur in its context.Model parameters then include the size of the context window, whether to distinguish left context from right context, as well as a weighting functions to weight the contribution of each word co-occurrence -e.g., a decreasing weighting function, where word pairs that are d words apart contribute 1/d to the total co-occurrence count.
After obtaining a co-occurrence matrix, GloVe attempts to minimize the following cost function using Adagrad [5].
where f (X ij ) is a weighting function on co-occurrence counts of word j in the context of word i (X ij ), w i are word vectors, wj context vectors and b i and b j biases.The intuition behind this cost function is the following one.Each summand of the summation represents the amount of error attributed to a count X ij in the co-occurrence matrix.The error consists of a weighing function f , to dampen the effect of very large co-occurrence counts, and a squared error factor.The squared error factor will become smaller when the dot product of word vectors becomes closer to the logarithm of the probability that the words co-occur.
Or turned the other way, when two words co-occur often, their vectors' dot product will be relatively high, meaning that the vectors are more similar to make the error factor smaller.The logarithm also causes that ratios of co-occurrence probabilities are associated with differences of vectors.As a result, the embedding contains information useful for determining analogies.

Building a Co-occurrence Matrix from Graph Data
The co-occurrence matrix for textual data is obtained by linearly scanning through the text and counting the occurrence of context words in the context of each word.However, the graph which we use as input data does not have a linear structure.This problem has been worked around in the past by performing random walks starting from each of the nodes in the graph.Recording the paths of these walks results in a linear sequence of node (and optionally edge) labels, which can then, in turn, be used as a pseudo-text to train a model.This approach is, for example, used in node2vec [7] and RDF2Vec [22].However, in these approaches, the trained model is different from the GloVe model and it does not use the co-occurrence counts, but rather trains a neural network on the individual context windows directly.In the case of GloVe, only the counts are needed and hence we are looking for a method to obtain these without generating the random walks explicitly.
A possible solution would be to perform a breadth-first search of a certain depth starting from each node in turn, and take all reachable nodes as the context of each start node.Given these kinds of contexts, one could then straightforwardly apply GloVe's co-occurrence weighting and assign a lower weight to co-occurrence counts of nodes which are further away from the focus node.However, this simple approach is problematic in that: a) there could be nodes reachable through multiple paths at different levels, b) there could be loops in the graph, making a walk pass through the same node multiple times, and c) if there is a node with many context nodes at level d, but only few ones at level d−1, then the ones at level d will dominate the closer ones in the co-occurrence matrix as there are that many of them.
To solve this problem, we investigate the use of Personalized PageRank [18] to determine how important nodes are in the context of a focus node.In general, PageRank is used to find important nodes in a directed graph.Its first, well-known use is the ranking of web pages, but later PageRank has also been applied in other areas (e.g., peer-to-peer networks [9] and social network analysis [15], among others).At its heart, PageRank works by simulating random walkers over the graph and observing where these random walkers end up.A simplified model which we will elaborate below would be as follows.First, we denote the out degree of a node i as deg(i).Then, if there are n nodes in the graph, construct an n × n matrix P filled with zeros except for positions i,j, for which there exists an arc i → j.These positions contain 1/deg(i).Now, the simplified page rank problem is solved by finding the stationary solution to (notation from [1] p (i) is the vector converging to the PageRank value for each page after i iterations.) ( This simplified version of PageRank can run into a number of problems, namely some pages may have a zero out degree (so called dangling nodes) and there could be groups of pages which form closed cycles.In the first case, PageRank (i.e., random walkers) will get lost from the graph and any node linking directly or indirectly to a zero out-degree node will get a PageRank of zero.In the second case walkers will get trapped and the pages in the cycles will accumulate all PageRank.To amend these problems, the above equation is adapted to include parts which ensure that when a walk ends up in a dangling node, it will continue from another node selected from a distribution v, called the teleportation distribution.Further, to avoid ending in a cycle, a random jump is also performed with probability α to a node selected from the same distribution.Usually, v is chosen to be a uniform distribution, making each node equally likely to be the target of the jump.However, in the case of personalized page rank the distribution is degenerate as the target of these random jumps is always the node for which the rank vector is computed (which we called the focus node).In effect, the Personalized PageRank vector indicated the importance of nodes from the perspective of the focus node.
Computing PageRank (and also the PPR variant) is reasonably scalable.However, as we need to compute PPR for each individual node in turn, in order to build the cooccurrence matrix, the rapidly becomes too expensive.Moreover, the PageRank algorithm assigns a value to all nodes in the graph.If we computed the co-occurrence matrix this way, we would end up with a very large (in our experiments below this would become around 500 TB) dense matrix with many small values, which have little to no impact on the later training.Hence, we designed a faster, approximate all-pairs PPR computation method, which results in a sparse matrix.This algorithm is based on an approximate PPR method which we will introduce next.

BCA: A Fast Personalized PageRank Approximation
A method for faster computation of Personalized PageRank, called Bookmark-Coloring Algorithm (BCA) was presented by Berkhin [1].The main idea behind this method is to create an approximation to the standard PPR such that the effort of the algorithm is only used for these nodes which will receive a significant rank.This requires fewer computations and since nodes with no significant PageRank are not assigned a value, a sparse representation is obtained.
An intuitive version of the BCA algorithm is as follows (for full details, see [1]).To compute the PPR vector p (b) for a focus node b, we start by injecting a unit amount of paint, representing the walkers in the standard personal PageRank computation, to b.From this paint an α-portion is retained and added to the value for b in p (b) .The remaining (1 − α)-portion is distributed uniformly over the out-links.This retain-and-distribute process is then repeated recursively for all nodes which got paint injected.When a node has a zero out degree, the outgoing paint is discarded.
This basic algorithm can be improved by choosing the order in which nodes considered for the retain-and-distribute.It is more efficient to select nodes with a larger amount of paint fist.To achieve this, a max priority queue, with the amount of paint as priorities is maintained.In principle, the queue could contain an entry for each node involved in each distribute step.However, it is more efficient to merge the separate wet paint amounts into one entry.Hence, the queue must allow efficient finding and updating of elements.Finally, when the amount of paint to be distributed becomes negligible (i.e., less than the parameter ) it gets discarded, making the resulting rank vector sparse.All these improvements are described in more detail in the BCA paper [1].One more technique described in the same paper is reuse of Bookmark-Coloring Vectors (BCV -the equivalent to the PageRank vector) for the computation of other BCVs.This is analyzed further for the case of hubs (i.e., nodes which correspond to a subset of important pages).The BCV is precomputed for these pages and whenever the retain-and-distribute process forwards paint to a page p (h) in the hub, the amount is multiplied with the BCV corresponding to p (h) and added to p (b) .This optimization makes sense when many BCVs have to be computed, which is also the case for the co-occurrence matrix.However, since we are interested in computing the BCV for all nodes, further enhancements are possible, as we will discuss in the following subsection.

A Fast All-Pairs PPR Algorithm
The method introduced in the previous subsection speeds up the computation of individual PPR computations.Now, the observation leading to reuse of BCVs for pages in a hub can be adapted to our setting.The main point is that the computation of the BCV of node b can reuse the BCV of nodes reachable through its out links.Especially, it is beneficial if the BCV of nodes one hop away have already been computed.Adopting this viewpoint, we say that computation of the BCV of the node b depends on the BCV computation of all one-hop reachable nodes and hence b is a dependent of these nodes.Now, what we want to achieve is that we only compute the BCV for nodes once the BCVs of all its dependent nodes have been computed.However, this will not always be feasible as the graphs contains cycles.Hence, we want to quickly find an ordering of nodes, such that we can likely reuse as many BCV computations as possible.To achieve this we break cycles and in that case compute the BCV for the node at which we break without being able to count on all dependents being available.We choose the node for breaking the cycle to be the one with the highest in-degree as that one is likely to cause most reuse and break multiple cycles at once.The pseudocode of the Algorithm is shown in algorithm 1, the actual implementation also includes a couple of indexes and bitmaps to speed up the computation.Now, with the order determined, we can compute each BCV, reusing many previously computed values.

Biasing the Random Walks
The default PageRank and BCA algorithm assume that a random walker will follow the out edges of a node with equal likelihood.However, one can also create a setup in which given out edges are more likely than others.For BCA, this possibility was already hinted in the original paper [1], but not elaborated much further.This so called biasing can be accomplished by taking into account the out edge weights when distributing the paint over them.Following our previous work [3], we apply twelve different strategies for assigning these weights to the edges of the graph.These weights will then in turn bias the random walks on the graph.In particular, when a walk arrives in a vertex v with out edges v o1 ,...v od , then the walk will follow edge v ol with a probability computed by In other words, the normalized edge weights are directly interpreted as the probability to follow a particular edge.To obtain these edge weights, we make use of the following statistics computed from the RDF data: Copied because G will be modified in the function Initialize list Order The list with the node ordering Initialize max priority queue Q indeg The Predicate Frequency for each predicate in the dataset, we count the number of times the predicate occurs (only occurrences as a predicate are counted).Object Frequency for each resource in the dataset, we count the number of times it occurs as the object of a triple.Predicate-Object frequency for each pair of a predicate and an object in the dataset, we count the number of times there is a statement with this predicate and object.
Besides these statistics, we also use PageRank [18] computed for the entities in the knowledge graph [29].This PageRank is computed based on links between the Wikipedia articles representing the respective entities.When using the PageRank computed for DBpedia, not each node has a value assigned, as only entities which have a corresponding Wikipedia page are accounted for in the PageRank computation.Examples of nodes which do not have a PageRank include DBpedia types or categories, like http://dbpedia.org/ontology/Placeand http://dbpedia.org/resource/Category:Central_Europe.Therefore, we assigned a fixed Page-Rank to all nodes which are not entities.We chose a value of 0.2, which is roughly the median PageRank in the non-normalized page rank values we used.
We have essentially two types of metrics, those assigned to nodes, and those assigned to edges.The predicate frequency and predicate-object frequency, as well as the inverses of these, can be directly used as weights for edges.Therefore, we call these weighting methods edge-centric.In the case of predicate frequency each predicate edge with that label is assigned the weight in question.In the case of predicate-object frequency, each predicate edge which ends in a given object gets assigned the predicate-object frequency.We also use the inverse metrics, where not the absolute frequency is assigned, but its multiplicative inverse.
In contrast, the object frequency, and also the used PageRank metric, assign a numeric score to each node in the graph.Therefore, we call weighting approaches based on them node-centric.To obtain a weight for the edges, we either push the weight down, meaning that the number assigned to a node is used as the weight of all in edges, or we split the number down, meaning that the weight is divided by the number of in edges and then assigned to all these edges.If split is not mentioned explicitly in node centric weighting strategies, then it is a push down strategy.
Note that uniform weights are equivalent to using object frequency with splitting the weights.To see why this holds true, we have to follow the steps which will be taken.First, each node gets assigned the amount of times it is used as an object.This number is equal to the number of in edges to the node.Then, this number is split over the in edges, i.e., each in edge gets assigned the number 1. Finally, this weight is normalized, assigning to each out link a uniform weight.Hence, this strategy would result in the same walks as using unbiased random walks over the graph.
So, even if we add unbiased random walks to the list of weighting strategies, we retain 12 unique ones, each with their own characteristics.These strategies, which we further elaborated upon in our earlier work [3]

Combining the Pieces
In earlier work on RDF graph embeddings (specifically RDF2Vec [22]), symmetric windows were used on top of generated random walks, which include both node and edge labels.These symmetric windows have the focus word in the middle and the context of the word is both before and after it.This means that the context of a node b consists of the nodes it can reach by following edges, as well as the nodes which can reach b.What this means is that the result RDF2Vec would be the same, independently of whether the original walks would be performed forward or backward.Inspired by this, we investigated the effect of creating the co-occurence matrix as the sum of the normal PPR matrix as described above and the PPR matrix of the graph with all edges reversed.Since a positive effect on the embeddings was obtained (at least for the tasks we used in the evaluation) we chose to use this approach.
RDF2Vec also includes edge labels into the walks and the embedding procedure.We also noticed a positive effect including the edge labels whenever they are traversed by Algorithm 2 Global RDF Vector Space Embedding paint with a weight equal to the amount of paint.Because the summation and additions of the label weights might lead to a skew in the values, we normalize each BCV in the co-occurence matrix by removing the value on the diagonal and scaling the remaining values such that their sum is 1.This operation led to improvements in the results and hence we adopted this technique for the overall algorithm.The pseudo code of the Global RDF Vector Space Embedding algorithm can be found in algorithm 2.
The overall algorithm has several parameters.First, there is the weighting strategy; the options are described above.Second, there are the parameters α and for the BCA algorihm.We chose the α parameter to be 0.1 and = 0.00001, which is within the ranges stated by Berkhin [1].Third, there are the parameters for the GloVe training.There is the vector length, which we choose to be 200, which is in the middle of the sizes used in the original Glove experiments [20].We use 20 training iterations as we noticed that more iterations did not significantly decrease the cost function.We used the default values for the Adagrad learning rate and damp function.

Evaluation
First, we evaluate the different weighting strategies on a number of classification and regression tasks, comparing the results of different feature extraction strategies combined with different learning algorithms.Second, we evaluate the weighting strategies on the task of computing document similarity.We evaluate our approach using DBpedia [12].We use the English version of the 2016-04 DBpedia dataset, which contains 4,678,230 instances and 1,379 mapping-based properties.In our evaluation we only consider object properties, and ignore literals.All the experiments were run using a Linux machine using at most 300GB RAM and 24 Intel Xeon 2.60GHz CPUs.For all the weighing strategies the processes took between 6 hours for the least demanding strategy, the Predicate Frequency strategy, and up to 48 hours for the most demanding strategy, the Predicate-Object Frequency.The runtime for building the related work approaches, using the publicly available code, 3 was more than a week.

Machine Learning Tasks
We use the DBpedia entity embeddings on five different datasets from different domains, for the tasks of classification and regression, i.e., Cities4 , Metacritic Movies5 , Metacritic Albums6 , AAUP7 and Forbes8 .Details on the dataset can be found in [23].We follow the same experimental setup as in our RDF2Vec paper [22], using Naive Bayes, k-Nearest Neighbors, C4.5, and Support Vector Machine for classification, and Linear Regression, M5Rules, and k-Nearest Neighbors for regression, measuring accuracy and root mean squared error (RMSE) in stratified 10-fold cross validation.The results on parameter settings for the algorithms can be found in [22].Furthermore, from our original RDF2Vec paper [22], we report the best baseline and the best RDF2Vec performance.As an additional baseline, we use the same set of random walks used in [22] to build a simple GloVe model, and report the results under RDF2VecGloVe.Furthermore, we compare our results to the embedding approaches TransE, TransH, and TransR, which have shown to be scalable to large knowledge graphs.
Tables 1 and 3 depict the results for the classification and regression task.We determine the significance in ranking of the approaches using the approach introduced by Demšar [4], as discussed in [22].The results are depicted in tables 2 and 4. We can observe that although RDF2Vec is a very strong competitor, the approach introduced in this paper is capable of producing embeddings which outperform the results achieved with RDF2Vec in specific cases.In particular for classification algorithms which yield inferior results with RDF2Vec.It is also remarkable that TransE, TransH, and TransR are often outperformed by the baselines.Furthermore, we can observe that a naive application of the GloVe approach to walks (RDF2VecGloVe) does not lead to convincing results.

Document Modeling
Calculating entity similarity lies at the heart of knowledge-rich approaches to computing semantic similarity, a fundamental task in Natural Language Processing and Information Retrieval [32].As previously mentioned, in the feature embedding space semantically similar entities appear close to each other in the feature space.Therefore, the problem of calculating the similarity between two instances is a matter of calculating the distance between two instances in the given feature space.To do so, we use the standard cosine similarity measure, which is applied on the vectors of the entities.
We use the entity similarity approach in the task of calculating semantic document similarity.We follow an approach similar to the one presented in [19], where two documents are considered to be similar if many entities of the one document are similar to at least one entity in the other document.More precisely, we try to identify the most similar pairs of entities in both documents, ignoring the similarity of all the other 1-1 similarities values.The similarity of two documents is then defined as the average maximum similarity for all entities in each document (see [3]).We evaluate performance on document similarity approach using the LP50 dataset [11].We follow standard practices and use Pearson's linear correlation coefficient and Spearman's rank correlation plus their harmonic mean as evaluation metrics.In addition to the baselines introduced above, we compare our approach to the following approaches: -TF-IDF: Distributional baseline algorithm.
-GED: semantic similarity using a Graph Edit Distance based measure [27].
The results for the related approaches were taken from the respective papers, except for ESA, which was taken from [19], where it is calculated via the public ESA REST endpoint 9 .All results are collected in table 5. We can see that our approach, using inverse predicate object frequency weights, outperforms the state-of-the-art approaches, as well as the embeddings generated by RDF2Vec.

Conclusion and Outlook
In this paper, we have introduced a novel approach for generating embeddings of RDF graphs, which exploits global instead of local patterns.We have shown that it is possible to outperform local graph embeddings techniques, in particular on document similarity.For most other tasks similar performance can be obtained.
One key finding of this work is that weighting techniques are a crucial factor in the overall performance.In the future, we would like to investigate this point more thoroughly, and analyze the interplay of the dataset, the task, the learning algorithm, and the weighting technique more formally and with more exhaustive experimentation.One way to achieve this is by evaluating the embedding using intrinsic measures such as those suggested in [26].Besides, we would like to further investigate how the literals in the dataset can be incorporated while learning the embedding.Furthermore, as GloVe embeddings are known to work particularly well for finding analogies, we plan to adapt the approach for predicting missing links in RDF data sets.
nodes in ascending in-degree Add all nodes to Q indeg repeat while G has a node n with out-degree 0 do Add n to Order, Remove n from G, Remove n from Q indeg end while if G is not empty then There is a cycle which needs to be broken n ← Q indeg .pop()Add n to Order, Remove n from G for all d dependent on n do Update priority of d in Q indeg , are:

Table 2 :
Classification average rank results.The best ranked results for each method are marked in bold.The learning models for which the strategies were shown to have significant difference based on the Friedman test with α < 0.05 are marked with *.The single values marked with * mean that are significantly worse than the best strategy at significance level q = 0.05

Table 4 :
Regression average rank results.The best ranked results for each method are marked in bold.The learning models for which the strategies were shown to have significant difference based on the Friedman test with α < 0.05 are marked with *.The single values marked with * mean that are significantly worse than the best strategy at significance level q = 0.05