GNN-Foundations-Frontiers-and-Applications-chapter10
GNN-Foundations-Frontiers-and-Applications-chapter10
Muhan Zhang
10.1 Introduction
Link prediction is the problem of predicting the existence of a link between two
nodes in a network (Liben-Nowell and Kleinberg, 2007). Given the ubiquitous ex-
istence of networks, it has many applications such as friend recommendation in
social networks (Adamic and Adar, 2003), co-authorship prediction in citation net-
works (Shibata et al, 2012), movie recommendation in Netflix (Bennett et al, 2007),
protein interaction prediction in biological networks (Qi et al, 2006), drug response
prediction (Stanfield et al, 2017), metabolic network reconstruction (Oyetunde et al,
2017), hidden terrorist group identification (Al Hasan and Zaki, 2011), knowledge
graph completion (Nickel et al, 2016a), etc.
Muhan Zhang
Institute for Artificial Intelligence, Peking University, e-mail: [email protected]
195
196 Muhan Zhang
Link prediction has many names in different application domains. The term “link
prediction” often refers to predicting links in homogeneous graphs, where nodes and
links both only have a single type. This is the simplest setting and most link predic-
tion works focus on this setting. Link prediction in bipartite user-item networks is
referred to as matrix completion or recommender systems, where nodes have two
types (user and item) and links can have multiple types corresponding to different
ratings users can give to items. Link prediction in knowledge graphs is often re-
ferred to as knowledge graph completion, where each node is a distinct entity and
links have multiple types corresponding to different relations between entities. In
most cases, a link prediction algorithm designed for the homogeneous graph setting
can be easily generalized to heterogeneous graphs (e.g., bipartite graphs and knowl-
edge graphs) by considering heterogeneous node type and relation type information.
There are mainly three types of traditional link prediction methods: heuris-
tic methods, latent-feature methods, and content-based methods. Heuristic meth-
ods compute heuristic node similarity scores as the likelihood of links (Liben-
Nowell and Kleinberg, 2007). Popular ones include common neighbors (Liben-
Nowell and Kleinberg, 2007), Adamic-Adar (Adamic and Adar, 2003), preferen-
tial attachment (Barabási and Albert, 1999), and Katz index (Katz, 1953). Latent-
feature methods factorize the matrix representations of a network to learn low-
dimensional latent representations/embeddings of nodes. Popular network embed-
ding techniques such as DeepWalk (Perozzi et al, 2014), LINE (Tang et al, 2015b)
and node2vec (Grover and Leskovec, 2016), are also latent-feature methods because
they implicitly factorize some matrix representations of networks too (Qiu et al,
2018). Both heuristic methods and latent-feature methods infer future/missing links
leveraging the existing network topology. Content-based methods, on the contrary,
leverage explicit node attributes/features rather than the graph structure (Lops et al,
2011). It is shown that combining the graph topology with explicit node features
can improve the link prediction performance (Zhao et al, 2017).
By learning from graph topology and node/edge features in a unified way, graph
neural networks (GNNs) recently show superior link prediction performance than
traditional methods (Kipf and Welling, 2016; Zhang and Chen, 2018b; You et al,
2019; Chami et al, 2019; Li et al, 2020e). There are two popular GNN-based link
prediction paradigms: node-based and subgraph-based. Node-based methods first
learn a node representation through a GNN, and then aggregate the pairwise node
representations as link representations for link prediction. An example is (Varia-
tional) Graph AutoEncoder (Kipf and Welling, 2016). Subgraph-based methods first
extract a local subgraph around each target link, and then apply a graph-level GNN
(with pooling) to each subgraph to learn a subgraph representation, which is used as
the target link representation for link prediction. An example is SEAL (Zhang and
Chen, 2018b). We introduce these two types of methods separately in Section 10.3.1
and 10.3.2, and discuss their expressive power differences in Section 10.3.3.
To understand GNNs’ power for link prediction, several theoretical efforts have
been made. The g-decaying theory (Zhang and Chen, 2018b) unifies existing link
prediction heuristics into a single framework and proves their local approximability,
which justifies using GNNs to “learn” heuristics from the graph structure instead of
10 Graph Neural Networks: Link Prediction 197
using predefined ones. The theoretical analysis of labeling trick (Zhang et al, 2020c)
proves that subgraph-based approaches have a higher link representation power than
node-based approaches by being able to learn most expressive structural representa-
tions of links (Srinivasan and Ribeiro, 2020b) where node-based approaches always
fail. We introduce these theories in Section 20.3.
Finally, by analyzing limitations of existing methods, we provide several future
directions on GNN-based link prediction in Section 20.4.
In this section, we review traditional link prediction methods. They can be cate-
gorized into three classes: heuristic methods, latent-feature methods, and content-
based methods.
Heuristic methods use simple yet effective node similarity scores as the likelihood
of links (Liben-Nowell and Kleinberg, 2007; Lü and Zhou, 2011). We use x and y
to denote the source and target node between which to predict a link. We use G (x)
to denote the set of x’s neighbors.
One simplest heuristic is called common neighbors (CN), which counts the number
of neighbors two nodes share as a measurement of their likelihood of having a link:
|G (x) \ G (y)|
fJaccard (x, y) = . (10.2)
|G (x) [ G (y)|
There is also a famous preferential attachment (PA) heuristic (Barabási and Al-
bert, 1999), which uses the product of node degrees to measure the link likelihood:
Fig. 10.1: Illustration of three link prediction heuristics: CN, PA and AA.
There are also high-order heuristics which require knowing the entire network.
Examples include Katz index (Katz, 1953), rooted PageRank (RPR) (Brin and Page,
2012), and SimRank (SR) (Jeh and Widom, 2002).
Katz index uses a weighted sum of all the walks between x and y where a longer
walk is discounted more:
•
fKatz (x, y) = Â b l |walkshli (x, y)|. (10.6)
l=1
Here b is a decaying factor between 0 and 1, and |walkshli (x, y)| counts the length-l
walks between x and y. When we only consider length-2 walks, Katz index reduces
to CN.
Rooted PageRank (RPR) is a generalization of PageRank. It first computes the
stationary distribution px of a random walker starting from x who randomly moves
to one of its current neighbors with probability a, or returns to x with probability
1 a. Then it uses px at node y (denoted by [px ]y ) to predict link (x, y). When the
network is undirected, a symmetric version of rooted PageRank uses
10.2.1.3 Summarization
We summarize the eight introduced heuristics in Table 10.1. For more variants of the
above heuristics, please refer to (Liben-Nowell and Kleinberg, 2007; Lü and Zhou,
2011). Heuristic methods can be regarded as computing predefined graph structure
features located in the observed node and edge structures of the network. Although
effective in many domains, these handcrafted graph structure features have limited
expressivity—they only capture a small subset of all possible structure patterns, and
cannot express general graph structure features underlying different networks. Be-
sides, heuristic methods only work well when the network formation mechanism
200 Muhan Zhang
aligns with the heuristic. There may exist networks with complex formation mech-
anisms which no existing heuristics can capture well. Most heuristics only work for
homogeneous graphs.
Notes: G (x) denotes the neighbor set of vertex x. b < 1 is a damping factor. |walkshli (x, y)| counts
the number of length-l walks between x and y. [px ]y is the stationary distribution probability of y
under the random walk from x with restart, see (Brin and Page, 2012). SimRank score uses a
recursive definition.
The second class of traditional link prediction methods is called latent-feature meth-
ods. In some literature, they are also called latent-factor models or embedding meth-
ods. Latent-feature methods compute latent properties or representations of nodes,
often obtained by factorizing a specific matrix derived from the network, such as the
adjacency matrix and the Laplacian matrix. These latent features of nodes are not
explicitly observable—they must be computed from the network through optimiza-
tion. Latent features are also not interpretable. That is, unlike explicit node features
where each feature dimension represents a specific property of nodes, we do not
know what each latent feature dimension describes.
One most popular latent feature method is matrix factorization (Koren et al, 2009;
Ahmed et al, 2013), which is originated from the recommender systems literature.
Matrix factorization factorizes the observed adjacency matrix A of the network into
10 Graph Neural Networks: Link Prediction 201
the product of a low-rank latent-embedding matrix Z and its transpose. That is, it
approximately reconstructs the edge between i and j using their k-dimensional latent
embeddings zi and z j :
Âi, j = z>
i z j, (10.9)
It then minimizes the mean-squared error between the reconstructed adjacency ma-
trix and the true adjacency matrix over the observed edges to learn the latent em-
beddings:
1
|E | (i,Â
L = (Ai, j Âi, j )2 . (10.10)
j)2E
Finally, we can predict new links by the inner product between nodes’ latent em-
beddings. Variants of matrix factorization include using powers of A (Cangea et al,
2018) and using general node similarity matrices (Ou et al, 2016) to replace the
original adjacency matrix A. If we replace A with the Laplacian matrix L and define
the loss as follows:
then the nontrivial solution to the above are constructed by the eigenvectors corre-
sponding to the k smallest nonzero eigenvalues of L, which recovers the Laplacian
eigenmap technique (Belkin and Niyogi, 2002) and the solution to spectral cluster-
ing (VONLUXBURG, 2007).
Network embedding methods have gained great popularity in recent years since
the pioneering work DeepWalk (Perozzi et al, 2014). These methods learn low-
dimensional representations (embeddings) for nodes, often based on training a skip-
gram model (Mikolov et al, 2013a) over random-walk-generated node sequences,
so that nodes which often appear nearby each other in a random walk (i.e., nodes
close in the network) will have similar representations. Then, the pairwise node
embeddings are aggregated as link representations for link prediction. Although
not explicitly factorizing a matrix, it is shown in (Qiu et al, 2018) that many net-
work embedding methods, including LINE (Tang et al, 2015b), DeepWalk, and
node2vec (Grover and Leskovec, 2016), implicitly factorize some matrix representa-
tions of the network. Thus, they can also be categorized into latent-feature methods.
For example, DeepWalk approximately factorizes:
⇣ 1 w ⌘
log vol(G ) Â
w r=1
(D 1 A)r D 1 log(b), (10.12)
202 Muhan Zhang
where vol(G ) is the sum of node degrees, D is the diagonal degree matrix, w is
skip-gram’s window size, and b is a constant. As we can see, DeepWalk essentially
factorizes the log of some high-order normalized adjacency matrices’ sum (up to
w). To intuitively understand this, we can think of the random walk as extending a
node’s neighborhood to w hops away, so that we not only require direct neighbors to
have similar embeddings, but also require nodes reachable from each other through
w steps of random walk to have similar embeddings.
Similarly, the LINE algorithm (Tang et al, 2015b) in its second-order forms im-
plicitly factorizes:
10.2.2.3 Summarization
Both heuristic methods and latent-feature methods face the cold-start problem. That
is, when a new node joins the network, heuristic methods and latent-feature meth-
ods may not be able to predict its links accurately because it has no or only a few
existing links with other nodes. In this case, content-based methods might help.
Content-based methods leverage explicit content features associated with nodes for
link prediction, which have wide applications in recommender systems (Lops et al,
2011). For example, in citation networks, word distributions can be used as content
features for papers. In social networks, a user’s profile, such as their demographic in-
formation and interests, can be used as their content features (however, their friend-
ship information belongs to graph structure features because it is calculated from the
graph structure). However, content-based methods usually have worse performance
than heuristic and latent-feature methods due to not leveraging the graph structure.
Thus, they are usually used together with the other two types of methods (Koren,
2008; Rendle, 2010; Zhao et al, 2017) to enhance the link prediction performance.
In the last section, we have covered three types of traditional link prediction meth-
ods. In this section, we will talk about GNN methods for link prediction. GNN
methods combine graph structure features and content features by learning them to-
gether in a unified way, leveraging the excellent graph representation learning ability
of GNNs.
There are mainly two GNN-based link prediction paradigms, node-based and
subgraph-based. Node-based methods aggregate the pairwise node representations
learned by a GNN as the link representation. Subgraph-based methods extract a
local subgraph around each link and use the subgraph representation learned by a
GNN as the link representation.
The most straightforward way of using GNNs for link prediction is to treat GNNs
as inductive network embedding methods which learn node embeddings from lo-
cal neighborhood, and then aggregate the pairwise node embeddings of GNNs to
construct link representations. We call these methods node-based methods.
204 Muhan Zhang
Âi, j = s (z>
i z j ), where zi = Zi,: , Z = GCN(X, A) (10.14)
where Z is the node representation (embedding) matrix output by the GCN with
the ith row of Z being node i’s representation zi , Âi, j is the predicted probability for
link (i, j) and s is the sigmoid function. If X is not given, GAE can use the one-
hot encoding matrix I instead. The model is trained to minimize the cross entropy
between the reconstructed adjacency matrix and the true adjacency matrix:
Given p(A|Z) and p(Z), we may compute the posterior distribution of Z using
Bayes’ rule. However, this distribution is often intractable. Thus, given the adja-
cency matrix A and node feature matrix X, VGAE uses graph neural networks to
approximate the posterior distribution of the node embedding matrix Z:
10 Graph Neural Networks: Link Prediction 205
q(Z|X, A) = ’ q(zi |X, A), where q(zi |X, A) = N (zi |µi , diag( 2
i )). (10.18)
i2V
Here, the mean µi and variance i2 of zi are given by two GCNs. Then, VGAE
maximizes the evidence lower bound to learn the GCN parameters:
There are many variants of GAE and VGAE. For example, ARGE (Pan et al, 2018)
enhances GAE with an adversarial regularization to regularize the node embeddings
to follow a prior distribution. S-VAE (Davidson et al, 2018) replaces the Normal
distribution in VGAE with a von Mises-Fisher distribution to model data with a hy-
perspherical latent structure. MGAE (Wang et al, 2017a) uses a marginalized graph
autoencoder to reconstruct node features from corrupted ones through a GCN and
applies it to graph clustering.
GAE represents a general class of node-based methods, where a GNN is first used
to learn node embeddings and pairwise node embeddings are aggregated to learn
link representations. In principle, we can replace the GCN used in GAE/VGAE with
any GNN, and replace the inner product z> i z j with any aggregation function over
{zi , z j } and feed the aggregated link representation to an MLP to predict the link
(i, j). Following this methodology, we can generalize any GNN designed for learn-
ing node representations to link prediction. For example, HGCN (Chami et al, 2019)
combines hyperbolic graph convolutional neural networks with a Fermi-Dirac de-
coder for aggregating pairwise node embeddings and outputting link probabilities:
where d(·, ·) computes the hyperbolic distance and r,t are hyperparameters.
Position-aware GNN (PGNN) (You et al, 2019) aggregates messages only from
some selected anchor nodes during the message passing to capture position informa-
tion of nodes. Then, the inner product between node embeddings are used to predict
links. The PGNN paper also generalizes other GNNs, including GAT (Petar et al,
2018), GIN (Xu et al, 2019d) and GraphSAGE (Hamilton et al, 2017b), to the link
prediction setting based on the inner-product decoder.
Many graph neural networks use link prediction as an objective for training node
embeddings in an unsupervised manner, despite that their final task is still node clas-
sification. For example, after computing the node embeddings, GraphSAGE (Hamil-
ton et al, 2017b) minimize the following objective for each zi to encourage con-
206 Muhan Zhang
where j is a node co-occurs near i on some fixed-length random walk, pn is the neg-
ative sampling distribution, and kn is the number of negative samples. If we focus on
length-2 random walks, the above loss reduces to a link prediction objective. Com-
pared to the GAE loss in Equation (10.15), the above objective does not consider all
O(n) negative links, but uses negative sampling instead to only consider kn negative
pairs (i, j0 ) for each positive pair (i, j), thus is more suitable for large graphs.
In the context of recommender systems, there are also many node-based meth-
ods that can be seen as variants of GAE/VGAE. Monti et al (2017) use GNNs to
learn user and item embeddings from their respective nearest-neighbor networks,
and use the inner product between user and item embeddings to predict links. Berg
et al (2017) propose the graph convolutional matrix completion (GC-MC) model
which applies a GNN to the user-item bipartite graph to learn user and item embed-
dings. They use one-hot encoding of node indices as the input node features, and
use the bilinear product between user and item embeddings to predict links. Spec-
tralCF (Zheng et al, 2018a) uses a spectral-GNN on the bipartite graph to learn node
embeddings. The PinSage model (Ying et al, 2018b) uses node content features as
the input node features, and uses the GraphSAGE (Hamilton et al, 2017b) model to
map related items to similar embeddings.
In the context of knowledge graph completion, R-GCN (Relational Graph Con-
volutional Neural Network) (Schlichtkrull et al, 2018) is one representative node-
based method, which considers the relation types by giving different weights to
different relation types during the message passing. SACN (Structure-Aware Con-
volutional Network) (Shang et al, 2019) performs message passing for each relation
type’s induced subgraphs individually and then uses a weighted sum of node em-
beddings from different relation types.
Subgraph-based methods extract a local subgraph around each target link and learn
a subgraph representation through a GNN for link prediction.
Fig. 10.2: Illustration of the SEAL framework. SEAL first extracts enclosing sub-
graphs around target links to predict. It then applies a node labeling to the enclosing
subgraphs to differentiate nodes of different roles within a subgraph. Finally, the
labeled subgraphs are fed into a GNN to learn graph structure features (supervised
heuristics) for link prediction.
DRNL works as follows: First, assign label 1 to x and y. Then, for any node i with
radius (d(i, x), d(i, y)) = (1, 1), assign label 2. Nodes with radius (1, 2) or (2, 1) get
label 3. Nodes with radius (1, 3) or (3, 1) get 4. Nodes with (2, 2) get 5. Nodes with
(1, 4) or (4, 1) get 6. Nodes with (2, 3) or (3, 2) get 7. So on and so forth. In other
words, DRNL iteratively assigns larger labels to nodes with a larger radius w.r.t. the
two center nodes.
DRNL satisfies the following criteria: 1) The two target nodes x and y always
have the distinct label “1” so that they can be distinguished from the context nodes.
2) Nodes i and j have the same label if and only if their “double radius” are the
same, i.e., i and j have the same distances to (x, y). This way, nodes of the same rel-
ative positions within the subgraph (described by the double radius (d(i, x), d(i, y)))
always have the same label.
DRNL has a closed-form solution for directly mapping (d(i, x), d(i, y)) to labels:
where dx := d(i, x), dy := d(i, y), d := dx + dy , (d/2) and (d%2) are the integer
quotient and remainder of d divided by 2, respectively. For nodes with d(i, x) = •
or d(i, y) = •, DRNL gives them a null label 0.
After getting the DRNL labels, SEAL transforms them into one-hot encoding
vectors, or feeds them to an embedding layer to get their embeddings. These new
feature vectors are concatenated with the original node content features (if any) to
form the new node features. SEAL additionally allows concatenating some pre-
trained node embeddings such as node2vec embeddings to node features. How-
ever, as its experimental results show, adding pretrained node embeddings does not
show clear benefits to the final performance (Zhang and Chen, 2018b). Furthermore,
adding pretrained node embeddings makes SEAL lose the inductive learning ability.
Finally, SEAL feeds these enclosing subgraphs as well as their new node feature
vectors into a graph-level GNN, DGCNN (Zhang et al, 2018g), to learn a graph
classification function. The groundtruth of each subgraph is whether the two cen-
ter nodes really have a link. To train this GNN, SEAL randomly samples N exist-
ing links from the network as positive training links, and samples an equal number
of unobserved links (random node pairs) as negative training links. After training,
SEAL applies the trained GNN to new unobserved node pairs’ enclosing subgraphs
to predict their links. The entire SEAL framework is illustrated in Figure 10.2.
SEAL achieves strong performance for link prediction, demonstrating consistently
superior performance than predefined heuristics (Zhang and Chen, 2018b).
SEAL inspired many follow-up works. For example, Cai and Ji (2020) propose to
use enclosing subgraphs of different scales to learn scale-invariant models. Li et al
(2020e) propose Distance Encoding (DE) which generalizes DRNL to node classi-
fication and general node set classification problems and theoretically analyzes the
10 Graph Neural Networks: Link Prediction 209
power it brings to GNNs. The line graph link prediction (LGLP) model (Cai et al,
2020c) transforms each enclosing subgraph into its line graph and uses the center
node embedding in the line graph to predict the original link.
SEAL is also generalized to the bipartite graph link prediction problem of rec-
ommender systems (Zhang and Chen, 2019). The model is called Inductive Graph-
based Matrix Completion (IGMC). IGMC also samples an enclosing subgraph
around each target (user, item) pair, but uses a different node labeling scheme. For
each enclosing subgraph, it first gives label 0 and label 1 to the target user and the
target item, respectively. The remaining nodes’ labels are determined based on both
their node types and their distances to the target user and item: if a user-type node’s
shortest path to reach either the target user or the target item has a length k, it will get
a label 2k; if an item-type node’s shortest path to reach the target user or the target
item has a length k, it will get a label 2k + 1. This way, the target nodes can always
be distinguished from the context nodes, and users can be distinguished from items
(users always have even labels). Furthermore, nodes of different distances to the
center nodes can be differentiated, too. Finally, the enclosing subgraphs are fed into
a GNN with R-GCN convolution layers to incorporate the edge type information
(each edge type corresponds to a different rating). And the output representations
of the target user and the target item are concatenated as the link representation to
predict the target rating. IGMC is an inductive matrix completion model without
relying on any content features, i.e., the model predicts ratings based only on local
graph structures, and the learned model can transfer to unseen users/items or new
tasks without retraining.
In the context of knowledge graph completion, SEAL is generalized to GraIL
(Graph Inductive Learning) (Teru et al, 2020). It also follows the enclosing subgraph
extraction, node labeling, and GNN prediction framework. For enclosing subgraph
extraction, it extracts the subgraph induced by all the nodes that occur on at least
one path of length at most h + 1 between the two target nodes. Unlike SEAL, the
enclosing subgraph of GraIL does not include those nodes that are only neighbors
of one target node but are not neighbors of the other target node. This is because for
knowledge graph reasoning, paths connecting two target nodes are of extra impor-
tance than dangling nodes. After extracting the enclosing subgraphs, GraIL applies
DRNL to label the enclosing subgraphs and uses a variant of R-GCN by enhancing
R-GCN with edge attention to output the score for each link to predict.
At first glance, both node-based methods and subgraph-based methods learn graph
structure features around target links based on a GNN. However, as we will show,
subgraph-based methods actually have a higher link representation ability than
node-based methods due to modeling the associations between two target nodes.
210 Muhan Zhang
𝑣3
𝑣2 𝑣3 𝑣2 𝑣3 𝑣2 𝑣3
𝑣1 𝑣4 𝑣1 𝑣4 𝑣1 𝑣4
Fig. 10.3: The different link representation ability between node-based methods and
subgraph-based methods. In the left graph, nodes v2 and v3 are isomorphic; links
(v1 , v2 ) and (v4 , v3 ) are isomorphic; link (v1 , v2 ) and link (v1 , v3 ) are not isomor-
phic. However, a node-based method cannot differentiate (v1 , v2 ) and (v1 , v3 ). In
the middle graph, when we predict (v1 , v2 ), we label these two nodes differently
from the rest, so that a GNN is aware of the target link when learning v1 and v2 ’s
representations. Similarly, when predicting (v1 , v3 ), nodes v1 and v3 will be labeled
differently (shown in the right graph). This way, the representation of v2 in the left
graph will be different from the representation of v3 in the right graph, enabling
GNNs to distinguish (v1 , v2 ) and (v1 , v3 ).
When using GNNs for link prediction, we want to learn graph structure features
useful for predicting links based on message passing. However, it is usually not
212 Muhan Zhang
possible to use very deep message passing layers to aggregate information from the
entire network due to the computation complexity introduced by neighbor explosion
and the issue of oversmoothing (Li et al, 2018b). This is why node-based methods
(such as GAE) only use 1 to 3 message passing layers in practice, and why subgraph-
based methods only extract a small 1-hop or 2-hop local enclosing subgraph around
each link.
The g-decaying heuristic theory (Zhang and Chen, 2018b) mainly answers how
much structural information useful for link prediction is preserved in local neigh-
borhood of the link, in order to justify applying a GNN only to a local enclos-
ing subgraph in subgraph-based methods. To answer this question, the g-decaying
heuristic theory studies how well can existing link prediction heuristics be approxi-
mated from local enclosing subgraphs. If all these existing successful heuristics can
be accurately computed or approximated from local enclosing subgraphs, then we
are more confident to use a GNN to learn general graph structure features from these
local subgraphs.
Firstly, a direct conclusion from the definition of h-hop enclosing subgraphs (Defi-
nition 10.1) is:
Proposition 10.1. Any h-order heuristic score for (x, y) can be accurately calcu-
h around (x, y).
lated from the h-hop enclosing subgraph Gx,y
For example, a 1-hop enclosing subgraph contains all the information needed to
calculate any first-order heuristics, while a 2-hop enclosing subgraph contains all the
information needed to calculate any first and second-order heuristics. This indicates
that first and second-order heuristics can be learned from local enclosing subgraphs
based on an expressive GNN. However, how about high-order heuristics? High-
order heuristics usually have better link prediction performance than local ones. To
study high-order heuristics’ local approximability, the g-decaying heuristic theory
first defines a general formulation of high-order heuristics, namely the g-decaying
heuristic.
Definition 10.2. (g-decaying heuristic) A g-decaying heuristic for link (x, y) has
the following form:
•
H (x, y) = h  g l f (x, y, l), (10.23)
l=1
Next, it proves that under certain conditions, any g-decaying heuristic can be
approximated from an h-hop enclosing subgraph, and the approximation error de-
creases at least exponentially with h.
10 Graph Neural Networks: Link Prediction 213
l=g(h)+1 l=ah+b+1
The above proof indicates that a smaller gl leads to a faster decaying speed and a
smaller approximation error. To approximate a g-decaying heuristic, one just needs
to sum its first few terms calculable from an h-hop enclosing subgraph.
Then, a natural question to ask is which existing high-order heuristics belong to
g-decaying heuristics that allow local approximations. Surprisingly, the g-decaying
heuristic theory shows that three most popular high-order heuristics: Katz index,
rooted PageRank and SimRank (listed in Table 10.1) are all g-decaying heuristics
which satisfy the properties in Theorem 10.1.
To prove these, we need the following lemma first.
h .
Lemma 10.1. Any walk between x and y with length l 2h + 1 is included in Gx,y
where walkshli (x, y) is the set of length-l walks between x and y, and Al is the l th
power of the adjacency matrix of the network. Katz index sums over the collection
of all walks between x and y where a walk of length l is damped by b l (0 < b < 1),
giving more weights to shorter walks.
Katz index is directly defined in the form of a g-decaying heuristic with h =
1, g = b , and f (x, y, l) = |walkshli (x, y)|. According to Lemma 10.1, |walkshli (x, y)|
is calculable from Gx,y h for l 2h + 1, thus property 2 in Theorem 10.1 is satisfied.
Proposition 10.2. For any nodes i, j, [Al ]i, j is bounded by d l , where d is the maxi-
mum node degree of the network.
Proof. We prove it by induction. When l = 1, Ai, j d for any (i, j). Thus the
base case is correct. Now, assume by induction that [Al ]i, j d l for any (i, j),
we have
|V | |V |
[Al+1 ]i, j =  [Al ]i,k Ak, j d l  Ak, j d l d = d l+1 .
k=1 k=1
Taking l = d, we can see that whenever d < 1/b , the Katz index will satisfy
property 1 in Theorem 10.1. In practice, the damping factor b is often set to very
small values like 5E-4 (Liben-Nowell and Kleinberg, 2007), which implies that Katz
can be very well approximated from the h-hop enclosing subgraph.
10.4.1.3 PageRank
The rooted PageRank for node x calculates the stationary distribution of a random
walker starting at x, who iteratively moves to a random neighbor of its current po-
sition with probability a or returns to x with probability 1 a. Let px denote the
stationary distribution vector. Let [px ]i denote the probability that the random walker
is at node i under the stationary distribution.
Let P be the transition matrix with Pi, j = |G (v1 j )| if (i, j) 2 E and Pi, j = 0 otherwise.
Let ex be a vector with the xth element being 1 and others being 0. The stationary
distribution satisfies
10 Graph Neural Networks: Link Prediction 215
When used for link prediction, the score for (x, y) is given by [px ]y (or [px ]y +
[py ]x for symmetry). To show that rooted PageRank is a g-decaying heuristic, we
introduce the inverse P-distance theory (Jeh and Widom, 2003), which states that
[px ]y can be equivalently written as follows:
where the summation is taken over all walks w starting at x and ending at y (pos-
sibly touching x and y multiple times). For a walk w = hv0 , v1 , · · · , vk i, len(w) :=
|hv0 , v1 , · · · , vk i| is the length of the walk. The term P[w] is defined as ’ki=01 |G (v
1
i )|
,
which can be interpreted as the probability of traveling w. Now we have the follow-
ing theorem.
Theorem 10.2. The rooted PageRank heuristic is a g-decaying heuristic which sat-
isfies the properties in Theorem 10.1.
10.4.1.4 SimRank
The SimRank score (Jeh and Widom, 2002) is motivated by the intuition that two
nodes are similar if their neighbors are also similar. It is defined in the following
recursive way: if x = y, then s(x, y) := 1; otherwise,
where g is a constant between 0 and 1. According to (Jeh and Widom, 2002), Sim-
Rank has an equivalent definition:
216 Muhan Zhang
where w : (x, y) ( (z, z) denotes all simultaneous walks such that one walk starts at
x, the other walk starts at y, and they first meet at any vertex z. For a simultaneous
walk w = h(v0 , u0 ), · · · , (vk , uk )i, len(w) = k is the length of the walk. The term P[w]
is similarly defined as ’ki=01 |G (vi )||G 1
(ui )| , describing the probability of this walk. Now
we have the following theorem.
10.4.1.5 Discussion
There exist several other high-order heuristics based on path counting or random
walk (Lü and Zhou, 2011) which can be as well incorporated into the g-decaying
heuristic framework. Another interesting finding is that first and second-order
heuristics can be unified into this framework too. For example, common neighbors
can be seen as a g-decaying heuristic with h = g = 1, and f (x, y, l) = |G (x) \ G (y)|
for l = 1, f (x, y, l) = 0 otherwise.
The above results reveal that most existing link prediction heuristics inherently
share the same g-decaying heuristic form, and thus can be effectively approximated
from an h-hop enclosing subgraph with exponentially smaller approximation er-
ror. The ubiquity of g-decaying heuristics is not by accident—it implies that a suc-
cessful link prediction heuristic is better to put exponentially smaller weight on
structures far away from the target, as remote parts of the network intuitively make
little contribution to link existence. The g-decaying heuristic theory builds the foun-
dation for learning supervised heuristics from local enclosing subgraphs, as they
imply that local enclosing subgraphs already contain enough information to learn
good graph structure features for link prediction which is much desired considering
10 Graph Neural Networks: Link Prediction 217
learning from the entire network is often infeasible. This motivates the proposition
of subgraph-based methods.
To summarize, from small enclosing subgraphs extracted around links, we are
able to accurately calculate first and second-order heuristics, and approximate a
wide range of high-order heuristics with small errors. Therefore, given a sufficiently
expressive GNN, learning from such enclosing subgraphs is expected to achieve
performance at least as good as a wide range of heuristics.
For simplicity, we will briefly use structural representation to denote most expres-
sive structural representation in the rest of this section. We will omit A if it is
clear from context. We call G (i, A) a structural node representation for i, and call
G ({i, j}, A) a structural link representation for (i, j).
Definition 10.4 requires the structural representations of two node sets to be the
same if and only if they are isomorphic. That is, isomorphic node sets always have
the same structural representation, while non-isomorphic node sets always have
different structural representations. This is in contrast to positional node embed-
dings such as DeepWalk (Perozzi et al, 2014) and matrix factorization (Mnih and
Salakhutdinov, 2008), where two isomorphic nodes can have different node embed-
dings (Ribeiro et al, 2017).
So why do we need structural representations? Formally speaking, Srinivasan
and Ribeiro (2020b) prove that any joint prediction task over node sets only requires
most-expressive structural representations of node sets, which are the same for two
node sets if and only if these two node sets are isomorphic. This means, for link pre-
diction tasks, we need to learn the same representation for isomorphic links while
discriminating non-isomorphic links by giving them different representations. Intu-
itively speaking, two links being isomorphic means they should be indistinguishable
from any perspective—if one link exists, the other should exist too, and vice versa.
Therefore, link prediction ultimately requires finding such a structural link repre-
sentation for node pairs which can uniquely identify link isomorphism classes.
According to Figure 10.3 left, node-based methods that directly aggregate two
node representations cannot learn such a valid structural link representation because
they cannot differentiate non-isomorphic links such as (v1 , v2 ) and (v1 , v3 ). One may
wonder whether using one-hot encoding of node indices as the input node features
help node-based methods learn such a structural link representation. Indeed, using
node-discriminating features enables node-based methods to learn different repre-
sentations for (v1 , v2 ) and (v1 , v3 ) in Figure 10.3 left. However, it also loses GNN’s
ability to map isomorphic nodes (such as v2 and v3 ) and isomorphic links (such
as (v1 , v2 ) and (v4 , v3 )) to the same representations, since any two nodes already
10 Graph Neural Networks: Link Prediction 219
have different representations from the beginning. This might result in poor gener-
alization ability—two nodes/links may have different final representations even they
share identical neighborhoods.
To ease our analysis, we also define a node-most-expressive GNN, which gives
different representations to all non-isomorphic nodes and gives the same represen-
tation to all isomorphic nodes. In other words, a node-most-expressive GNN learns
structural node representations.
Now, we are ready to introduce the labeling trick and see how it enables learning
structural representations of node sets. As we have seen in Section 10.4.2, a simple
zero-one labeling trick can help a GNN distinguish non-isomorphic links such as
(v1 , v2 ) and (v1 , v3 ) in Figure 10.3 left. At the same time, isomorphic links, such
as (v1 , v2 ) and (v4 , v3 ), will still have the same representation, since the zero-one
labeled graph for (v1 , v2 ) is still symmetric to the zero-one labeled graph for (v4 , v3 ).
This brings an exclusive advantage over using one-hot encoding of node indices.
Below we give the formal definition of labeling trick, which incorporates the
zero-one labeling trick as one specific form.
Definition 10.6. (Labeling trick) Given (S, A), we stack a labeling tensor L(S) 2
Rn⇥n⇥d in the third dimension of A to get a new A(S) 2 Rn⇥n⇥(k+d) , where L satis-
0
fies: 8S, A, S0 , A0 , p 2 Pn , (1) L(S) = p(L(S ) ) ) S = p(S0 ), and (2) S = p(S0 ), A =
0
p(A0 ) ) L(S) = p(L(S ) ).
To explain a bit, labeling trick assigns a label vector to each node/edge in graph
A, which constitutes the labeling tensor L(S) . By concatenating A and L(S) , we get
the adjacency tensor A(S) of the new labeled graph. By definition we can assign
labels to both nodes and edges. For simplicity, here we only consider node labels,
(S)
i.e., we let off-diagonal components Li, j,: be all zero.
The labeling tensor L(S) should satisfy two conditions in Definition 10.6. The
first condition requires the target nodes S to have distinct labels from those of the
rest nodes, so that S is distinguishable from others. This is because if a permutation
p preserving node labels exists between nodes of A and A0 , then S and S0 must have
distinct labels to guarantee S0 is mapped to S by p. The second condition requires
the labeling function to be permutation equivariant, i.e., when (S, A) and (S0 , A0 ) are
isomorphic under p, the corresponding nodes i 2 S, j 2 S0 , i = p( j) must always have
the same label. In other words, the labeling should be consistent across different S.
220 Muhan Zhang
For example, the zero-one labeling is a valid labeling trick by always giving label 1
to nodes in S and 0 otherwise, which is both consistent and S-discriminating. How-
ever, an all-one labeling is not a valid labeling trick, because it cannot distinguish
the target set S.
Now we introduce the main theorem of labeling trick showing that with a valid
labeling trick, a node-most-expressive GNN can learn structural link representations
by aggregating its node representations learned from the labeled graph.
The proof of the above theorem can be found in Appendix A of (Zhang et al, 2020c).
Theorem 10.4 implies that AGG({GNN(i, A(S) )|i 2 S}) is a structural represen-
tation for (S, A). Remember that directly aggregating structural node representa-
tions learned from the original graph A does not lead to structural link representa-
tions. Theorem 10.4 shows that aggregating over the structural node representations
learned from the adjacency tensor A(S) of the labeled graph, somewhat surprisingly,
results in a structural representation for S.
The significance of Theorem 10.4 is that it closes the gap between GNN’s node
representation nature and link prediction’s link representation requirement, which
solves the open question raised in (Srinivasan and Ribeiro, 2020b) questioning
node-based GNN methods’ ability of performing link prediction. Although directly
aggregating pairwise node representations learned by GNNs does not lead to struc-
tural link representations, combining GNNs with a labeling trick enables learning
structural link representations.
It can be easily proved that the zero-one labeling, DRNL and Distance Encod-
ing (DE) (Li et al, 2020e) are all valid labeling tricks. This explains subgraph-
based methods’ superior empirical performance than node-based methods (Zhang
and Chen, 2018b; Zhang et al, 2020c).
In this section, we introduce several important future directions for link prediction:
accelerating subgraph-based methods, designing more powerful labeling tricks, and
understanding when to use one-hot features.
which prevent them from being deployed in modern recommendation systems. How
to accelerate subgraph-based methods is thus an important problem to study.
The extra computation complexity of subgraph-based methods comes from their
node labeling step. The reason is that for every link (i, j) to predict, we need to
relabel the graph according to (i, j). The same node v will be labeled differently
depending on which one is the target link, and will be given a different node rep-
resentation by the GNN when it appears in different links’ labeled graphs. This is
different from node-based methods, where we do not relabel the graph and each
node only has a single representation.
In other words, for node-based methods, we only need to apply the GNN to
the whole graph once to compute a representation for each node, while subgraph-
based methods need to repeatedly apply the GNN to differently labeled subgraphs
each corresponding to a different link. Thus, when computing link representations,
subgraph-based methods require re-applying the GNN for each target link. For a
graph with n nodes and m links to predict, node-based methods only need to apply
a GNN O(n) times to get a representation for each node (and then use some sim-
ple aggregation function to get link representations), while subgraph-based methods
need to apply a GNN O(m) times for all links. When m n, subgraph-based meth-
ods have much worse time complexity than node-based methods, which is the price
for learning more expressive link representations.
Is it possible to accelerate subgraph-based methods? One possible way is to sim-
plify the enclosing subgraph extraction process and simplify the GNN architecture.
For example, we may adopt sampling or random walk when extracting the enclosing
subgraphs which might largely reduce the subgraph sizes and avoid hub nodes. It is
interesting to study such simplifications’ influence on performance. Another possi-
ble way is to use distributed and parallel computing techniques. The enclosing sub-
graph extraction process and the GNN computation on a subgraph are completely
independent of each other and are naturally parallelizable. Finally, using multi-stage
ranking techniques could also help. Multi-stage ranking will first use some simple
methods (such as traditional heuristics) to filter out most unlikely links, and use
more powerful methods (such as SEAL) in the later stage to only rank the most
promising links and output the final recommendations/predictions.
Either way, solving the scalability issue of subgraph-based methods can be a
great contribution to the field. That means we can enjoy the superior link prediction
performance of subgraph-based GNN methods without using much more computa-
tion resources, which is expected to extend GNNs to more application domains.
Another direction is to design more powerful labeling tricks. Definition 10.6 gives
a general definition of labeling trick. Although any labeling trick satisfying Defi-
nition 10.6 can enable a node-most-expressive GNN to learn structural link repre-
sentations, the real-world performance of different labeling tricks can vary a lot due
222 Muhan Zhang
to the limited expressive power and depths of practical GNNs. Also, some subtle
differences in implementing a labeling trick can also result in large performance
differences. For example, given the two target nodes x and y, when computing the
distance d(i, x) from a node i to x, DRNL will temporarily mask node y and all its
edges, and when computing the distance d(i, y), DRNL will temporarily mask node
x and all its edges (Zhang and Chen, 2018b). The reason for this “masking trick” is
that DRNL aims to use the pure distance between i and x without the influence of
y. If we do not mask y, d(i, x) will be upper bounded by d(i, y) + d(x, y), which ob-
scures the “true distance” between i and x and might hurt the node labels’ ability to
discriminate structurally-different nodes. As shown in Appendix H of (Zhang et al,
2020c), this masking trick can greatly improve the performance. It is thus interest-
ing to study how to design a more powerful labeling trick (not necessarily based on
shortest path distance like DRNL and DE). It should not only distinguish the target
nodes, but also assign diverse but generalizable labels to nodes with different roles
in the subgraph. A further theoretical analysis on the power of different labeling
tricks is also needed.