0% found this document useful (0 votes)
149 views9 pages

1 - Modeling Relational Data With Graph Convolutional Networks

Uploaded by

faiza21live
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views9 pages

1 - Modeling Relational Data With Graph Convolutional Networks

Uploaded by

faiza21live
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Modeling Relational Data with Graph Convolutional Networks

Michael Schlichtkrull∗ Thomas N. Kipf∗ Peter Bloem


University of Amsterdam University of Amsterdam VU Amsterdam
[email protected] [email protected] [email protected]

Rianne van den Berg Ivan Titov Max Welling


University of Amsterdam University of Amsterdam University of Amsterdam, CIFAR†
[email protected] [email protected] [email protected]
arXiv:1703.06103v4 [stat.ML] 26 Oct 2017

Abstract :country

U.S.A.
Knowledge graphs enable a wide variety of applications, in-
f
cluding question answering and information retrieval. De-
ze n_o
spite the great effort invested in their creation and mainte- citi :university
nance, even the largest (e.g., Yago, DBPedia or Wikidata) educated_at
Mikhail Baryshnikov Vaganova Academy
remain incomplete. We introduce Relational Graph Convo-
lutional Networks (R-GCNs) and apply them to two standard :ballet_dancer awa
knowledge base completion tasks: Link prediction (recovery rde :award
d
of missing facts, i.e. subject-predicate-object triples) and en-
Vilcek prize
tity classification (recovery of missing entity attributes). R-
GCNs are related to a recent class of neural networks operat-
ing on graphs, and are developed specifically to deal with the Figure 1: A knowledge base fragment: The nodes are en-
highly multi-relational data characteristic of realistic knowl- tities, the edges are relations labeled with their types, the
edge bases. We demonstrate the effectiveness of R-GCNs as nodes are labeled with entity types (e.g., university). The
a stand-alone model for entity classification. We further show edge and the node label shown in red are the missing in-
that factorization models for link prediction such as DistMult formation to be inferred.
can be significantly improved by enriching them with an en-
coder model to accumulate evidence over multiple inference
steps in the relational graph, demonstrating a large improve-
ment of 29.8% on FB15k-237 over a decoder-only baseline. Academy is marked as a university). It is convenient to rep-
resent knowledge bases as directed labeled multigraphs with
entities corresponding to nodes and triples encoded by la-
1 Introduction beled edges (see Figure 1).
Knowledge bases organize and store factual knowledge, en- We consider two fundamental SRL tasks: link predic-
abling a multitude of applications including question an- tion (recovery of missing triples) and entity classification
swering (Yao and Van Durme 2014; Bao et al. 2014; Seyler, (assigning types or categorical properties to entities). In
Yahya, and Berberich 2015; Hixon, Clark, and Hajishirzi both cases, many missing pieces of information can be ex-
2015; Bordes et al. 2015; Dong et al. 2015) and informa- pected to reside within the graph encoded through the neigh-
tion retrieval (Kotov and Zhai 2012; Dalton, Dietz, and Al- borhood structure – i.e. knowing that Mikhail Baryshnikov
lan 2014; Xiong and Callan 2015b; 2015a). Even the largest was educated at the Vaganova Academy implies both that
knowledge bases (e.g. DBPedia, Wikidata or Yago), despite Mikhail Baryshnikov should have the label person, and that
enormous effort invested in their maintenance, are incom- the triple (Mikhail Baryshnikov, lived in, Russia) must be-
plete, and the lack of coverage harms downstream applica- long to the knowledge graph. Following this intuition, we
tions. Predicting missing information in knowledge bases is develop an encoder model for entities in the relational graph
the main focus of statistical relational learning (SRL). and apply it to both tasks.
Following previous work on SRL, we assume that knowl- Our entity classification model, similarly to Kipf and
edge bases store collections of triples of the form (subject, Welling (2017), uses softmax classifiers at each node in the
predicate, object). Consider, for example, the triple (Mikhail graph. The classifiers take node representations supplied by
Baryshnikov, educated at, Vaganova Academy), where we a relational graph convolutional network (R-GCN) and pre-
will refer to Baryshnikov and Vaganova Academy as enti- dict the labels. The model, including R-GCN parameters, is
ties and to educated at as a relation. Additionally, we as- learned by optimizing the cross-entropy loss.
sume that entities are labeled with types (e.g., Vaganova Our link prediction model can be regarded as an autoen-
coder consisting of (1) an encoder: an R-GCN producing

Equal contribution. latent feature representations of entities, and (2) a decoder:

Canadian Institute for Advanced Research a tensor factorization model exploiting these representations
to predict labeled edges. Though in principle the decoder can This type of transformation has been shown to be very
rely on any type of factorization (or generally any scoring effective at accumulating and encoding features from lo-
function), we use one of the simplest and most effective fac- cal, structured neighborhoods, and has led to significant im-
torization methods: DistMult (Yang et al. 2014). We observe provements in areas such as graph classification (Duvenaud
that our method achieves competitive results on standard et al. 2015) and graph-based semi-supervised learning (Kipf
benchmarks, outperforming, among other baselines, direct and Welling 2017).
optimization of the factorization (i.e. vanilla DistMult). This Motivated by these architectures, we define the following
improvement is especially large when we consider the more simple propagation model for calculating the forward-pass
challenging FB15k-237 dataset (Toutanova and Chen 2015). update of an entity or node denoted by vi in a relational (di-
This result demonstrates that explicit modeling of neighbor- rected and labeled) multi-graph:
hoods in R-GCNs is beneficial for recovering missing facts  
in knowledge bases. (l+1)
X X 1 (l) (l) (l)
Our main contributions are as follows. To the best of our hi = σ W (l) h + W0 hi  , (2)
ci,r r j
knowledge, we are the first to show that the GCN frame- r∈R j∈Nir

work can be applied to modeling relational data, specifically


to link prediction and entity classification tasks. Secondly, where Nir denotes the set of neighbor indices of node i un-
we introduce techniques for parameter sharing and to en- der relation r ∈ R. ci,r is a problem-specific normaliza-
force sparsity constraints, and use them to apply R-GCNs tion constant that can either be learned or chosen in advance
to multigraphs with large numbers of relations. Lastly, we (such as ci,r = |Nir |).
show that the performance of factorization models, at the Intuitively, (2) accumulates transformed feature vectors
example of DistMult, can be significantly improved by en- of neighboring nodes through a normalized sum. Different
riching them with an encoder model that performs multiple from regular GCNs, we introduce relation-specific transfor-
steps of information propagation in the relational graph. mations, i.e. depending on the type and direction of an edge.
To ensure that the representation of a node at layer l + 1
2 Neural relational modeling can also be informed by the corresponding representation at
layer l, we add a single self-connection of a special relation
We introduce the following notation: we denote directed and type to each node in the data. Note that instead of simple lin-
labeled multi-graphs as G = (V, E, R) with nodes (entities) ear message transformations, one could choose more flexi-
vi ∈ V and labeled edges (relations) (vi , r, vj ) ∈ E, where ble functions such as multi-layer neural networks (at the ex-
r ∈ R is a relation type.1 pense of computational efficiency). We leave this for future
work.
2.1 Relational graph convolutional networks A neural network layer update consists of evaluating (2)
Our model is primarily motivated as an extension of GCNs in parallel for every node in the graph. In practice, (2) can be
that operate on local graph neighborhoods (Duvenaud et al. implemented efficiently using sparse matrix multiplications
2015; Kipf and Welling 2017) to large-scale relational data. to avoid explicit summation over neighborhoods. Multiple
These and related methods such as graph neural networks layers can be stacked to allow for dependencies across sev-
(Scarselli et al. 2009) can be understood as special cases of eral relational steps. We refer to this graph encoder model
a simple differentiable message-passing framework (Gilmer as a relational graph convolutional network (R-GCN). The
et al. 2017): computation graph for a single node update in the R-GCN
! model is depicted in Figure 2.
(l+1) (l) (l)
X
hi =σ gm (hi , hj ) , (1)
m∈Mi 2.2 Regularization
(l) (l) A central issue with applying (2) to highly multi-relational
where hi ∈ Rd is the hidden state of node vi in the data is the rapid growth in number of parameters with the
l-th layer of the neural network, with d(l) being the di- number of relations in the graph. In practice this can easily
mensionality of this layer’s representations. Incoming mes- lead to overfitting on rare relations and to models of very
sages of the form gm (·, ·) are accumulated and passed large size.
through an element-wise activation function σ(·), such as To address this issue, we introduce two separate meth-
the ReLU(·) = max(0, ·).2 Mi denotes the set of incoming ods for regularizing the weights of R-GCN-layers: basis-
messages for node vi and is often chosen to be identical to and block-diagonal-decomposition. With the basis decom-
the set of incoming edges. gm (·, ·) is typically chosen to be (l)
position, each Wr is defined as follows:
a (message-specific) neural network-like function or simply
a linear transformation gm (hi , hj ) = W hj with a weight B
(l) (l)
X
matrix W such as in Kipf and Welling (2017). Wr(l) = arb Vb , (3)
1 b=1
R contains relations both in canonical direction (e.g. born in)
and in inverse direction (e.g. born in inv). (l)
2 i.e. as a linear combination of basis transformations Vb ∈
Note that this represents a simplification of the message pass- (l+1)
×d(l) (l)
ing neural network proposed in (Gilmer et al. 2017) that suffices to Rd with coefficients arb such that only the coeffi-
include the aforementioned models as special cases. cients depend on r. In the block-diagonal decomposition, we
rel_1 (in)
rel_1

Node loss

Edge loss
DistMult
R-GCN

R-GCN
Input

Input
rel_1 (out)

encoder encoder decoder

(a) Entity classification (b) Link prediction


… + ReLU
rel_N (in)
Figure 3: (a) Depiction of an R-GCN model for entity clas-
rel_N sification with a per-node loss function. (b) Link predic-
tion model with an R-GCN encoder (interspersed with fully-
rel_N (out) connected/dense layers) and a DistMult decoder that takes
pairs of hidden node representations and produces a score
for every (potential) edge in the graph. The loss is evaluated
per edge.
self-loop self-loop

we only consider such a featureless approach in this work,


we note that it was shown in Kipf and Welling (2017) that
it is possible for this class of models to make use of pre-
Figure 2: Diagram for computing the update of a single
defined feature vectors (e.g. a bag-of-words description of a
graph node/entity (red) in the R-GCN model. Activations
document associated with a specific node).
(d-dimensional vectors) from neighboring nodes (dark blue)
are gathered and then transformed for each relation type
individually (for both in- and outgoing edges). The result- 3 Entity classification
ing representation (green) is accumulated in a (normalized) For (semi-)supervised classification of nodes (entities),
sum and passed through an activation function (such as the we simply stack R-GCN layers of the form (2), with a
ReLU). This per-node update can be computed in parallel softmax(·) activation (per node) on the output of the last
with shared parameters across the whole graph. layer. We minimize the following cross-entropy loss on all
labeled nodes (while ignoring unlabeled nodes):
(l) K
let each Wr be defined through the direct sum over a set of XX (L)
low-dimensional matrices: L=− tik ln hik , (5)
i∈Y k=1
B
(l)
M
Wr(l) = Qbr . (4) where Y is the set of node indices that have labels and hik
(L)
b=1 is the k-th entry of the network output for the i-th labeled
Thereby, Wr
(l)
are block-diagonal matrices: node. tik denotes its respective ground truth label. In prac-
(l) (l) (l) (l+1)
/B)×(d(l) /B) tice, we train the model using (full-batch) gradient descent
diag(Q1r , . . . , QBr ) with Qbr ∈ R(d . techniques. A schematic depiction of our entity classifica-
The basis function decomposition (3) can be seen as a tion model is given in Figure 3a.
form of effective weight sharing between different relation
types, while the block decomposition (4) can be seen as a
sparsity constraint on the weight matrices for each relation
4 Link prediction
type. The block decomposition structure encodes an intu- Link prediction deals with prediction of new facts (i.e.
ition that latent features can be grouped into sets of variables triples (subject, relation, object)). Formally, the knowledge
which are more tightly coupled within groups than across base is represented by a directed, labeled graph G =
groups. Both decompositions reduce the number of parame- (V, E, R). Rather than the full set of edges E, we are given
ters needed to learn for highly multi-relational data (such as only an incomplete subset Ê. The task is to assign scores
realistic knowledge bases). At the same time, we expect that f (s, r, o) to possible edges (s, r, o) in order to determine
the basis parameterization can alleviate overfitting on rare how likely those edges are to belong to E.
relations, as parameter updates are shared between both rare In order to tackle this problem, we introduce a graph
and more frequent relations. auto-encoder model, comprised of an entity encoder and a
The overall R-GCN model then takes the following form: scoring function (decoder). The encoder maps each entity
We stack L layers as defined in (2) – the output of the previ- vi ∈ V to a real-valued vector ei ∈ Rd . The decoder re-
ous layer being the input to the next layer. The input to the constructs edges of the graph relying on the vertex repre-
first layer can be chosen as a unique one-hot vector for each sentations; in other words, it scores (subject, relation, ob-
node in the graph if no other features are present. For the ject)-triples through a function s : Rd × R × Rd → R.
block representation, we map this one-hot vector to a dense Most existing approaches to link prediction (for example,
representation through a single linear transformation. While tensor and neural factorization methods (Socher et al. 2013;
Lin et al. 2015; Toutanova et al. 2016; Yang et al. 2014; Dataset AIFB MUTAG BGS AM
Trouillon et al. 2016)) can be interpreted under this frame-
work. The crucial distinguishing characteristic of our work Entities 8,285 23,644 333,845 1,666,764
is the reliance on an encoder. Whereas most previous ap- Relations 45 23 103 133
proaches use a single, real-valued vector ei for every vi ∈ Edges 29,043 74,227 916,199 5,988,321
V optimized directly in training, we compute representa- Labeled 176 340 146 1,000
(L) Classes 4 2 2 11
tions through an R-GCN encoder with ei = hi , simi-
lar to the graph auto-encoder model introduced in Kipf and
Table 1: Number of entities, relations, edges and classes
Welling (2016) for unlabeled undirected graphs. Our full
along with the number of labeled entities for each of the
link prediction model is schematically depicted in Figure 3b.
datasets. Labeled denotes the subset of entities that have la-
In our experiments, we use the DistMult factoriza-
bels and that are to be classified.
tion (Yang et al. 2014) as the scoring function, which is
known to perform well on standard link prediction bench-
marks when used on its own. In DistMult, every relation r
Baselines As a baseline for our experiments, we com-
is associated with a diagonal matrix Rr ∈ Rd×d and a triple
pare against recent state-of-the-art classification results
(s, r, o) is scored as
from RDF2Vec embeddings (Ristoski and Paulheim 2016),
f (s, r, o) = eTs Rr eo . (6) Weisfeiler-Lehman kernels (WL) (Shervashidze et al. 2011;
As in previous work on factorization (Yang et al. 2014; de Vries and de Rooij 2015), and hand-designed feature ex-
Trouillon et al. 2016), we train the model with negative tractors (Feat) (Paulheim and Fümkranz 2012). Feat assem-
sampling. For each observed example we sample ω nega- bles a feature vector from the in- and out-degree (per re-
tive ones. We sample by randomly corrupting either the sub- lation) of every labeled entity. RDF2Vec extracts walks on
ject or the object of each positive example. We optimize labeled graphs which are then processed using the Skipgram
for cross-entropy loss to push the model to score observable (Mikolov et al. 2013) model to generate entity embeddings,
triples higher than the negative ones: used for subsequent classification. See Ristoski and Paul-
heim (2016) for an in-depth description and discussion of
1 X 
these baseline approaches. All entity classification experi-
L=− y log l f (s, r, o) +
(1 + ω)|Ê| (s,r,o,y)∈T (7) ments were run on CPU nodes with 64GB of memory.

(1 − y) log 1 − l f (s, r, o) ,
Results All results in Table 2 are reported on the
where T is the total set of real and corrupted triples, l is the train/test benchmark splits from Ristoski, de Vries, and
logistic sigmoid function, and y is an indicator set to y = 1 Paulheim (2016). We further set aside 20% of the training set
for positive triples and y = 0 for negative ones. as a validation set for hyperparameter tuning. For R-GCN,
we report performance of a 2-layer model with 16 hidden
5 Empirical evaluation units (10 for AM), basis function decomposition (Eq. 3), and
5.1 Entity classification experiments trained with Adam (Kingma and Ba 2014) for 50 epochs us-
Here, we consider the task of classifying entities in a knowl- ing a learning rate of 0.01. The normalization constant is
edge base. In order to infer, for example, the type of an entity chosen as ci,r = |Nir |. Further details on (baseline) models
(e.g. person or company), a successful model needs to rea- and hyperparameter choices are provided in the supplemen-
son about the relations with other entities that this entity is tary material.
involved in.
Model AIFB MUTAG BGS AM
Datasets We evaluate our model on four datasets3 in Re- Feat 55.55 77.94 72.41 66.66
source Description Framework (RDF) format (Ristoski, de WL 80.55 80.88 86.20 87.37
Vries, and Paulheim 2016): AIFB, MUTAG, BGS, and AM. RDF2Vec 88.88 67.20 87.24 88.33
Relations in these datasets need not necessarily encode di- R-GCN 95.83 73.23 83.10 89.29
rected subject-object relations, but are also used to encode
the presence, or absence, of a specific feature for a given Table 2: Entity classification results in accuracy (averaged
entity. In each dataset, the targets to be classified are prop- over 10 runs) for a feature-based baseline (see main text
erties of a group of entities represented as nodes. The exact for details), WL (Shervashidze et al. 2011; de Vries and
statistics of the datasets can be found in Table 1. For a more de Rooij 2015), RDF2Vec (Ristoski and Paulheim 2016),
detailed description of the datasets the reader is referred to and R-GCN (this work). Test performance is reported on the
Ristoski, de Vries, and Paulheim (2016). We remove rela- train/test set splits provided by Ristoski, de Vries, and Paul-
tions that were used to create entity labels: employs and affil- heim (2016).
iation for AIFB, isMutagenic for MUTAG, hasLithogenesis
for BGS, and objectCategory and material for AM.
Our model achieves state-of-the-art results on AIFB and
3 AM. To explain the gap in performance on MUTAG and
http://dws.informatik.uni-mannheim.de/en/research/a-
collection-of-benchmark-datasets-for-ml BGS it is important to understand the nature of these
datasets. MUTAG is a dataset of molecular graphs, which 1
R-GCN
was later converted to RDF format, where relations either
DistMult

MRR
indicate atomic bonds or merely the presence of a certain
feature. BGS is a dataset of rock types with hierarchical fea- 0.5
ture descriptions which was similarly converted to RDF for-
mat, where relations encode the presence of a certain feature 0
or feature hierarchy. Labeled entities in MUTAG and BGS 500 1,000 1,500 2,000 2,500
are only connected via high-degree hub nodes that encode a
certain feature. Average degree
We conjecture that the fixed choice of normalization con- Figure 4: Mean reciprocal rank (MRR) for R-GCN and Dist-
stant for the aggregation of messages from neighboring Mult on the FB15k validation data as a function of the node
nodes is partly to blame for this behavior, which can be par- degree (average of subject and object).
ticularly problematic for nodes of high degree. A potential
way to overcome this limitation is to introduce an atten-
tion mechanism, i.e. to replace the normalization constant datasets, and furthermore corresponds to a version of our
1/ci,r with data-dependent attention weights aij,r , where
P model with fixed entity embeddings in place of the R-GCN
j,r aij,r = 1. We expect this to be a promising avenue encoder as described in Section 4. As a second baseline, we
for future research. add the simple neighbor-based LinkFeat algorithm proposed
in Toutanova and Chen (2015).
5.2 Link prediction experiments We further compare to ComplEx (Trouillon et al. 2016)
As shown in the previous section, R-GCNs serve as an ef- and HolE (Nickel, Rosasco, and Poggio 2015), two state-of-
fective encoder for relational data. We now combine our en- the-art link prediction models for FB15k and WN18. Com-
coder model with a scoring function (which we will refer to plEx facilitates modeling of asymmetric relations by gen-
as a decoder, see Figure 3b) to score candidate triples for eralizing DistMult to the complex domain, while HolE re-
link prediction in knowledge bases. places the vector-matrix product with circular correlation.
Datasets Link prediction algorithms are commonly evalu- Finally, we include comparisons with two classic algorithms
ated on FB15k, a subset of the relational database Freebase, – CP (Hitchcock 1927) and TransE (Bordes et al. 2013).
and WN18, a subset of WordNet containing lexical relations
between words. In Toutanova and Chen (2015), a serious Results We provide results using two commonly used
flaw was observed in both datasets: The presence of inverse evaluation metrics: mean reciprocal rank (MRR) and Hits at
triplet pairs t = (e1 , r, e2 ) and t0 = (e2 , r−1 , e1 ) with t n (H@n). Following Bordes et al. (2013), both metrics can
in the training set and t0 in the test set. This reduces a large be computed in a raw and a filtered setting. We report both
part of the prediction task to memorization of affected triplet filtered and raw MRR (with filtered MRR typically consid-
pairs. A simple baseline LinkFeat employing a linear classi- ered more reliable), and filtered Hits at 1, 3, and 10.
fier on top of sparse feature vectors of observed training re- We evaluate hyperparameter choices on the respective
lations was shown to outperform existing systems by a large validation splits.P
We found a normalization constant defined
margin. To address this issue, Toutanova and Chen proposed as ci,r = ci = r |Nir | — in other words, applied across
a reduced dataset FB15k-237 with all such inverse triplet relation types – to work best. For FB15k and WN18, we re-
pairs removed. We therefore choose FB15k-237 as our pri- port results using basis decomposition (Eq. 3) with two basis
mary evaluation dataset. Since FB15k and WN18 are still functions, and a single encoding layer with 200-dimensional
widely used, we also include results on these datasets using embeddings. For FB15k-237, we found block decomposi-
the splits introduced by Bordes et al. (2013). tion (Eq. 4) to perform best, using two layers with block
dimension 5 × 5 and 500-dimensional embeddings. We reg-
Dataset WN18 FB15K FB15k-237 ularize the encoder through edge dropout applied before nor-
malization, with dropout rate 0.2 for self-loops and 0.4 for
Entities 40,943 14,951 14,541
other edges. Using edge droupout makes our training ob-
Relations 18 1,345 237
jective similar to that of denoising autoencoders (Vincent et
Train edges 141,442 483,142 272,115
al. 2008). We apply l2 regularization to the decoder with a
Val. edges 5,000 50,000 17,535
penalty of 0.01.
Test edges 5,000 59,071 20,466
We use the Adam optimizer (Kingma and Ba 2014) with
a learning rate of 0.01. For the baseline and the other
Table 3: Number of entities and relation types along with the
factorizations, we found the parameters from Trouillon et
number of edges per split for the three datasets.
al. (2016) – apart from the dimensionality on FB15k-237 –
to work best, though to make the systems comparable we
maintain the same number of negative samples (i.e. ω = 1).
Baselines A common baseline for both experiments is di- We use full-batch optimization for both the baselines and
rect optimization of DistMult (Yang et al. 2014). This fac- our model.
torization strategy is known to perform well on standard On FB15k, local context in the form of inverse relations
FB15k WN18
MRR Hits @ MRR Hits @
Model Raw Filtered 1 3 10 Raw Filtered 1 3 10
LinkFeat 0.779 0.804 0.938 0.939
DistMult 0.248 0.634 0.522 0.718 0.814 0.526 0.813 0.701 0.921 0.943
R-GCN 0.251 0.651 0.541 0.736 0.825 0.553 0.814 0.686 0.928 0.955
R-GCN+ 0.262 0.696 0.601 0.760 0.842 0.561 0.819 0.697 0.929 0.964
CP* 0.152 0.326 0.219 0.376 0.532 0.075 0.058 0.049 0.080 0.125
TransE* 0.221 0.380 0.231 0.472 0.641 0.335 0.454 0.089 0.823 0.934
HolE** 0.232 0.524 0.402 0.613 0.739 0.616 0.938 0.930 0.945 0.949
ComplEx* 0.242 0.692 0.599 0.759 0.840 0.587 0.941 0.936 0.945 0.947

Table 4: Results on the the Freebase and WordNet datasets. Results marked (*) taken from Trouillon et al. (2016). Results marks
(**) taken from Nickel, Rosasco, and Poggio (2015). R-GCN+ denotes an ensemble between R-GCN and DistMult – see main
text for details.

is expected to dominate the performance of the factoriza- MRR Hits @


tions, contrasting with the design of the R-GCN model. To
better understand the difference, we plot in Figure 4 the Model Raw Filtered 1 3 10
FB15k performance of the best R-GCN model and the base- LinkFeat 0.063 0.079
line (DistMult) as functions of degree of nodes correspond-
DistMult 0.100 0.191 0.106 0.207 0.376
ing to entities in the considered triple (namely, the average
R-GCN 0.158 0.248 0.153 0.258 0.414
of degrees for the subject and object entities). It can be seen
R-GCN+ 0.156 0.249 0.151 0.264 0.417
that our model performs better for nodes with high degree
where contextual information is abundant. The observation CP 0.080 0.182 0.101 0.197 0.357
that the two models are complementary suggests combining TransE 0.144 0.233 0.147 0.263 0.398
the strengths of both into a single model, which we refer to HolE 0.124 0.222 0.133 0.253 0.391
as R-GCN+. On FB15k and WN18 where local and long- ComplEx 0.109 0.201 0.112 0.213 0.388
distance information can both provide strong solutions, we
expect R-GCN+ to outperform each individual model. On Table 5: Results on FB15k-237, a reduced version of FB15k
FB15k-237 where local information is less salient, we do not with problematic inverse relation pairs removed. CP, TransE,
expect the combination model to outperform a pure R-GCN and ComplEx were evaluated using the code published for
model significantly. To test this, we evaluate an ensemble Trouillon et al. (2016), while HolE was evaluated using the
(R-GCN+) with a trained R-GCN model and a separately code published for Nickel, Rosasco, and Poggio (2015).
trained DistMult factorization model: f (s, r, t)R-GCN+ =
αf (s, r, t)R-GCN + (1 − α)f (s, r, t)DistMult , with α = 0.4 se-
lected on FB15k development data.
In Table 4, we evaluate the R-GCN model and the combi- previously discussed) inverse relation pairs have been re-
nation model (R-GCN+) on FB15k and WN18. moved and the LinkFeat baseline fails to generalize4 . Here,
On the FB15k and WN18 datasets, R-GCN and R-GCN+ our R-GCN model outperforms the DistMult baseline by
both outperform the DistMult baseline, but like all other a large margin of 29.8%, highlighting the importance of a
systems underperform on these two datasets compared to separate encoder model. As expected from our earlier anal-
the LinkFeat algorithm. The strong result from this baseline ysis, R-GCN and R-GCN+ show similar performance on
highlights the contribution of inverse relation pairs to high- this dataset. The R-GCN model further compares favorably
performance solutions on these datasets. Interestingly, R- against other factorization methods, despite relying on a
GCN+ yields better performance than ComplEx for FB15k, DistMult decoder which shows comparatively weak perfor-
even though the R-GCN decoder (DistMult) does not explic- mance when used without an encoder.
itly model asymmetry in relations, as opposed to ComplEx.
This suggests that combining the R-GCN encoder with
the ComplEx scoring function (decoder) may be a promising 4
direction for future work. The choice of scoring function is Our numbers are not directly comparable to those reported in
Toutanova and Chen (2015), as they use pruning both for training
orthogonal to the choice of encoder; in principle, any scoring and testing (see their sections 3.3.1 and 4.2). Since their pruning
function or factorization model could be incorporated as a schema is not fully specified (e.g., values of the relation-specific
decoder in our auto-encoder framework. parameter t are not given) and the code is not available, it is not
In Table 5, we show results for FB15k-237 where (as possible to replicate their set-up.
6 Related Work link prediction and entity classification. For the entity clas-
sification problem, we have demonstrated that the R-GCN
6.1 Relational modeling model can act as a competitive, end-to-end trainable graph-
Our encoder-decoder approach to link prediction relies on based encoder. For link prediction, the R-GCN model with
DistMult (Yang et al. 2014) in the decoder, a special and DistMult factorization as the decoding component outper-
simpler case of the RESCAL factorization (Nickel, Tresp, formed direct optimization of the factorization model, and
and Kriegel 2011), more effective than the original RESCAL achieved competitive results on standard link prediction
in the context of multi-relational knowledge bases. Numer- benchmarks. Enriching the factorization model with an R-
ous alternative factorizations have been proposed and stud- GCN encoder proved especially valuable for the challenging
ied in the context of SRL, including both (bi-)linear and non- FB15k-237 dataset, yielding a 29.8% improvement over the
linear ones (e.g., (Bordes et al. 2013; Socher et al. 2013; decoder-only baseline.
Chang et al. 2014; Nickel, Rosasco, and Poggio 2015; There are several ways in which our work could be ex-
Trouillon et al. 2016)). Many of these approaches can be tended. For example, the graph autoencoder model could be
regarded as modifications or special cases of classic tensor considered in combination with other factorization models,
decomposition methods such as CP or Tucker; for a com- such as ComplEx (Trouillon et al. 2016), which can be better
prehensive overview of tensor decomposition literature we suited for modeling asymmetric relations. It is also straight-
refer the reader to Kolda and Bader (2009). forward to integrate entity features in R-GCNs, which would
Incorporation of paths between entities in knowledge be beneficial both for link prediction and entity classification
bases has recently received considerable attention. We can problems. To address the scalability of our method, it would
roughly classify previous work into (1) methods creating be worthwhile to explore subsampling techniques, such as
auxiliary triples, which are then added to the learning objec- in Hamilton, Ying, and Leskovec (2017). Lastly, it would
tive of a factorization model (Guu, Miller, and Liang 2015; be promising to replace the current form of summation over
Garcia-Duran, Bordes, and Usunier 2015); (2) approaches neighboring nodes and relation types with a data-dependent
using paths (or walks) as features when predicting edges attention mechanism. Beyond modeling knowledge bases,
(Lin et al. 2015); or (3) doing both at the same time (Nee- R-GCNs can be generalized to other applications where re-
lakantan, Roth, and McCallum 2015; Toutanova et al. 2016). lation factorization models have been shown effective (e.g.
The first direction is largely orthogonal to ours, as we would relation extraction).
also expect improvements from adding similar terms to our
loss (in other words, extending our decoder). The second Acknowledgements
research line is more comparable; R-GCNs provide a com- We would like to thank Diego Marcheggiani, Ethan Fetaya,
putationally cheaper alternative to these path-based models. and Christos Louizos for helpful discussions and comments.
Direct comparison is somewhat complicated as path-based This project is supported by the European Research Council
methods used different datasets (e.g., sub-sampled sets of (ERC StG BroadSem 678254), the SAP Innovation Center
walks from a knowledge base). Network and the Dutch National Science Foundation (NWO
VIDI 639.022.518).
6.2 Neural networks on graphs
Our R-GCN encoder model is closely related to a number References
of works in the area of neural networks on graphs. It is pri- Bao, J.; Duan, N.; Zhou, M.; and Zhao, T. 2014. Knowledge-
marily motivated as an adaption of previous work on GCNs based question answering as machine translation. In ACL.
(Bruna et al. 2014; Duvenaud et al. 2015; Defferrard, Bres-
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and
son, and Vandergheynst 2016; Kipf and Welling 2017) for
Yakhnenko, O. 2013. Translating embeddings for modeling
large-scale and highly multi-relational data, characteristic of
multi-relational data. In NIPS.
realistic knowledge bases.
Early work in this area includes the graph neural network Bordes, A.; Usunier, N.; Chopra, S.; and Weston, J. 2015.
by Scarselli et al. (2009). A number of extensions to the orig- Large-scale simple question answering with memory net-
inal graph neural network have been proposed, most notably works. arXiv preprint arXiv:1506.02075.
(Li et al. 2016) and (Pham et al. 2017), both of which utilize Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2014.
gating mechanisms to facilitate optimization. Spectral networks and locally connected networks on
R-GCNs can further be seen as a sub-class of message graphs. In ICLR.
passing neural networks (Gilmer et al. 2017), which encom- Chang, K.-W.; tau Yih, W.; Yang, B.; and Meek, C. 2014.
pass a number of previous neural models for graphs, includ- Typed Tensor Decomposition of Knowledge Bases for Rela-
ing GCNs, under a differentiable message passing interpre- tion Extraction. In EMNLP.
tation. Dalton, J.; Dietz, L.; and Allan, J. 2014. Entity query feature
expansion using knowledge base links. In Proceedings of
7 Conclusions the 37th international ACM SIGIR conference on Research
We have introduced relational graph convolutional networks & development in information retrieval, 365–374.
(R-GCNs) and demonstrated their effectiveness in the con- de Vries, G. K. D., and de Rooij, S. 2015. Substructure
text of two standard statistical relation modeling problems: counting graph kernels for machine learning from rdf data.
Web Semantics: Science, Services and Agents on the World Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A three-
Wide Web 35:71–84. way model for collective learning on multi-relational data.
Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. In ICML.
Convolutional neural networks on graphs with fast localized Paulheim, H., and Fümkranz, J. 2012. Unsupervised gen-
spectral filtering. In NIPS. eration of data mining features from linked open data. In
Dong, L.; Wei, F.; Zhou, M.; and Xu, K. 2015. Question an- Proceedings of the 2nd international conference on web in-
swering over freebase with multi-column convolutional neu- telligence, mining and semantics, 31.
ral networks. In ACL. Pham, T.; Tran, T.; Phung, D.; and Venkatesh, S. 2017. Col-
Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, umn networks for collective classification. In AAAI.
R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. Ristoski, P., and Paulheim, H. 2016. Rdf2vec: Rdf graph
Convolutional networks on graphs for learning molecular embeddings for data mining. In International Semantic Web
fingerprints. In NIPS. Conference, 498–514. Springer.
Garcia-Duran, A.; Bordes, A.; and Usunier, N. 2015. Com- Ristoski, P.; de Vries, G. K. D.; and Paulheim, H. 2016. A
posing relationships with translations. Technical Report, collection of benchmark datasets for systematic evaluations
CNRS, Heudiasyc. of machine learning on the semantic web. In International
Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Semantic Web Conference, 186–194. Springer.
Dahl, G. E. 2017. Neural message passing for quantum Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and
chemistry. arXiv preprint arXiv:1704.01212. Monfardini, G. 2009. The graph neural network model.
Guu, K.; Miller, J.; and Liang, P. 2015. Travers- IEEE Transactions on Neural Networks 20(1):61–80.
ing knowledge graphs in vector space. arXiv preprint Seyler, D.; Yahya, M.; and Berberich, K. 2015. Generating
arXiv:1506.01094. quiz questions from knowledge graphs. In Proceedings of
Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017. Induc- the 24th International Conference on World Wide Web.
tive representation learning on large graphs. arXiv preprint Shervashidze, N.; Schweitzer, P.; Leeuwen, E. J. v.;
arXiv:1706.02216. Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler-
Hitchcock, F. L. 1927. The expression of a tensor or a lehman graph kernels. Journal of Machine Learning Re-
polyadic as a sum of products. Studies in Applied Mathe- search 12(Sep):2539–2561.
matics 6(1-4):164–189. Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013.
Hixon, B.; Clark, P.; and Hajishirzi, H. 2015. Learning Reasoning with neural tensor networks for knowledge base
knowledge graphs for question answering through conver- completion. In NIPS.
sational dialog. In Proceedings of NAACL HLT, 851–861. Toutanova, K., and Chen, D. 2015. Observed versus latent
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic features for knowledge base and text inference. In Proceed-
optimization. arXiv preprint arXiv:1412.6980. ings of the 3rd Workshop on Continuous Vector Space Mod-
Kipf, T. N., and Welling, M. 2016. Variational graph auto- els and their Compositionality, 57–66.
encoders. arXiv preprint arXiv:1611.07308. Toutanova, K.; Lin, V.; Yih, W.-t.; Poon, H.; and Quirk, C.
Kipf, T. N., and Welling, M. 2017. Semi-supervised classi- 2016. Compositional learning of embeddings for relation
fication with graph convolutional networks. In ICLR. paths in knowledge base and text. In ACL.
Kolda, T. G., and Bader, B. W. 2009. Tensor decompositions Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, E.; and
and applications. SIAM review 51(3):455–500. Bouchard, G. 2016. Complex embeddings for simple link
Kotov, A., and Zhai, C. 2012. Tapping into knowledge prediction. In ICML.
base for concept feedback: leveraging conceptnet to improve Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.-
search results for difficult queries. In WSDM. A. 2008. Extracting and composing robust features with
Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2016. denoising autoencoders. In ICML.
Gated graph sequence neural networks. In ICLR. Xiong, C., and Callan, J. 2015a. Esdrank: Connecting query
Lin, Y.; Liu, Z.; Luan, H.; Sun, M.; Rao, S.; and Liu, S. and documents through external semi-structured data. In
2015. Modeling relation paths for representation learning of CIKM.
knowledge bases. arXiv preprint arXiv:1506.00379. Xiong, C., and Callan, J. 2015b. Query expansion with free-
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and base. In Proceedings of the 2015 International Conference
Dean, J. 2013. Distributed representations of words and on The Theory of Information Retrieval, 111–120.
phrases and their compositionality. In NIPS. Yang, B.; Yih, W.-t.; He, X.; Gao, J.; and Deng, L. 2014.
Neelakantan, A.; Roth, B.; and McCallum, A. 2015. Com- Embedding entities and relations for learning and inference
positional vector space models for knowledge base comple- in knowledge bases. arXiv preprint arXiv:1412.6575.
tion. arXiv preprint arXiv:1504.06662. Yao, X., and Van Durme, B. 2014. Information extraction
Nickel, M.; Rosasco, L.; and Poggio, T. 2015. Holo- over structured data: Question answering with freebase. In
graphic embeddings of knowledge graphs. arXiv preprint ACL.
arXiv:1510.04935.
Further experimental details on entity classification

For the entity classification benchmarks described in our paper, the evaluation process differs subtly between publications. To
eliminate these differences, we repeated the baselines in a uniform manner, using the canonical test/train split from (Ristoski,
de Vries, and Paulheim 2016). We performed hyperparameter optimization on only the training set, running a single evaluation
on the test set after hyperparameters were chosen for each baseline. This explains why the numbers we report differ slightly
from those in the original publications (where cross-validation accuracy was reported).
For WL, we use the tree variant of the Weisfeiler-Lehman subtree kernel from the Mustard library.5 For RDF2Vec, we use
an implementation provided by the authors of (Ristoski and Paulheim 2016) which builds on Mustard. In both cases, we extract
explicit feature vectors for the instance nodes, which are classified by a linear SVM.
For the MUTAG task, our preprocessing differs from that used in (de Vries and de Rooij 2015; Ristoski and Paulheim 2016)
where for a given target relation (s, r, o) all triples connecting s to o are removed. Since o is a boolean value in the MUTAG
data, one can infer the label after processing from other boolean relations that are still present. This issue is now mentioned in
the Mustard documentation. In our preprocessing, we remove only the specific triples encoding the target relation.
Hyperparameters for baselines are chosen according to the best model performance in (Ristoski and Paulheim 2016),
i.e. WL: 2 (tree depth), 3 (number of iterations); RDF2Vec: 2 (WL tree depth), 4 (WL iterations), 500 (embedding size), 5
(window size), 10 (SkipGram iterations), 25 (number of negative samples). We optimize the SVM regularization constant
C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000} based on performance on a 80/20 train/validation split (of the original training set).
For R-GCN, we choose an l2 penalty on first layer weights Cl2 ∈ {0, 5 · 10−4 } and the number of basis functions B ∈
{0, 10, 20, 30, 40} based on validation set performance, where B = 0 refers to using no basis function decomposition. Using
the block decomposition did not improve results. Otherwise, hyperparameters are chosen as follows: 50 (number of epochs), 16
(number of hidden units), and ci,r = |Nir | (normalization constant). We do not use dropout. For AM, we use a reduced number
of 10 hidden units for R-GCN to reduce the memory footprint.
Results with standard error (omitted in main paper due to spatial constraints) are summarized in Table 7. All entity classifi-
cation experiments were run on CPU nodes with 64GB of memory.

R-GCN setting AIFB MUTAG BGS AM


−4 −4
l2 penalty 0 5 · 10 5 · 10 5 · 10−4
# basis functions 0 30 40 40
# hidden units 16 16 16 10

Table 6: Best hyperparameter choices based on validation set performance for 2-layer R-GCN model.

Model AIFB MUTAG BGS AM


Feat 55.55 ± 0.00 77.94 ± 0.00 72.41 ± 0.00 66.66 ± 0.00
WL 80.55 ± 0.00 80.88 ± 0.00 86.20 ± 0.00 87.37 ± 0.00
RDF2Vec 88.88 ± 0.00 67.20 ± 1.24 87.24 ± 0.89 88.33 ± 0.61
R-GCN (Ours) 95.83 ± 0.62 73.23 ± 0.48 83.10 ± 0.80 89.29 ± 0.35

Table 7: Entity classification results in accuracy (average and standard error over 10 runs) for a feature-based baseline (see
main text for details), WL (Shervashidze et al. 2011; de Vries and de Rooij 2015), RDF2Vec (Ristoski and Paulheim 2016),
and R-GCN (this work). Test performance is reported on the train/test set splits provided by (Ristoski, de Vries, and Paulheim
2016).

5
https://github.com/Data2Semantics/mustard

Common questions

Powered by AI

Applying R-GCNs to knowledge bases represented as directed labeled multigraphs is significant because it allows the incorporation of relational data into the learning process, capturing complex semantics and multi-relational structures common in such data. R-GCNs effectively handle different types of relations (both in canonical and inverse directions) by using relation-specific transformations. This approach enables the model to conduct tasks such as link prediction and entity classification efficiently, utilizing the data's full relational context to improve the accuracy of inferred information .

R-GCN outperforms traditional classification models like RDF2Vec and Weisfeiler-Lehman (WL) kernels in terms of accuracy on several datasets. Specifically, R-GCN achieves state-of-the-art results on datasets like AIFB and AM, with accuracies of 95.83% and 89.29% respectively, surpassing RDF2Vec and WL. However, its performance is less competitive on datasets like MUTAG and BGS, where the structural and relational characteristics make the task more challenging .

The R-GCN showed limited performance in the MUTAG and BGS datasets primarily due to the high-degree hub nodes, which connect most labeled entities. The fixed normalization constant for message aggregation from neighboring nodes was not optimal for high-degree nodes, affecting the model’s performance. A potential solution is to replace this constant with data-dependent attention weights to better handle the varying importance of neighboring nodes, which was suggested as an area for future research .

FB15k-237 is important for evaluating link prediction models because it addresses a critical flaw present in earlier datasets like FB15k and WN18, which contained inverse triplet pairs that made link prediction easier and less representative of actual capabilities. By removing such pairs, FB15k-237 provides a more challenging and realistic benchmark that better reflects the models' ability to generalize and predict unseen links, thereby offering a fairer evaluation of their performance .

R-GCNs extend traditional GCNs by incorporating relation-specific transformations in their propagation model. While traditional GCNs operate on undirected graphs and use homogeneous transformations, R-GCNs handle directed multigraphs with potentially large numbers of relations by introducing separate linear transformations for each relation type. This allows the model to incorporate different types of information encoded in the edges, thereby making it capable of processing large-scale relational data effectively .

In R-GCNs, message passing involves accumulating information from neighboring nodes through relation-specific transformations. Each node's new representation is computed by a weighted sum of its neighbors' messages, followed by adding its current representation. The activation function, typically an element-wise nonlinearity like ReLU, is applied to the accumulated messages to introduce nonlinearity, helping the model learn complex patterns in the graph. This activation function plays a crucial role in transforming and encoding the aggregated information to enhance the expressiveness of the node representations .

Enriching factorization models with R-GCN encoders significantly improves performance, particularly on challenging datasets like FB15k-237. The R-GCN encoder enhances the model’s ability to capture and propagate complex relational patterns in the latent feature space, yielding a 29.8% improvement over a decoder-only baseline. This enrichment allows the model to better capture the structure of the relational data, resulting in more accurate link predictions .

Hyperparameter optimization for the R-GCN involved using a validation set to determine the best settings. Key parameters chosen included an l2 penalty on the first layer weights (set to either 0 or 5×10−4), the number of basis functions (varying from 0 to 40, with 0 meaning no decomposition was used), the number of hidden units (16 for most datasets, reduced to 10 for AM due to memory constraints), and a normalization constant ci,r based on the degree of nodes. These choices were aimed at optimizing the model's performance on the validation set .

The R-GCN model addresses link prediction in multigraphs by using an encoder-decoder framework. The encoder, an R-GCN, generates latent feature representations of entities by propagating information through the graph, taking into account the local neighborhood structure of each node. The decoder, a tensor factorization model known as DistMult, exploits the encoded feature representations to predict labeled edges. The explicit modeling of neighborhoods in R-GCNs is shown to be beneficial for recovering missing facts, as it improves factorization performance, especially on challenging datasets like FB15k-237 .

R-GCNs offer advantages over traditional path-based models primarily in terms of computational efficiency. They do not require the explicit construction of paths or walks in the graph but instead propagate information locally within the neighborhood of each node. This makes R-GCNs computationally cheaper while still effectively encoding features from the graph. Additionally, path-based models often operate using sub-sampled sets of walks, which may limit their ability to represent the complete structure of the knowledge base, whereas R-GCNs inherently incorporate the full neighborhood context .

You might also like