Link Prediction
Link Prediction
One of the key reasons link prediction has garnered significant interest is due to its
ability to uncover latent patterns and relationships that are not immediately obvious
from the observed data. Unlike static graph analysis, link prediction considers the tem-
poral dynamics and evolutionary behavior of networks, making it a powerful tool
for modeling change over time.
Moreover, link prediction is inherently challenging due to several factors:
• Sparsity: In most real-world networks, the number of actual links is much smaller
than the total number of possible links.
• Imbalance: The number of non-linked node pairs vastly exceeds the number of
actual links, leading to class imbalance in supervised formulations.
• Scalability: Networks often contain millions of nodes and edges, requiring efficient
algorithms.
1
• Incompleteness: Data collected from real systems may be noisy or partially ob-
served.
Researchers have approached the link prediction problem using a wide range of tech-
niques, from simple heuristics based on graph topology (like common neighbors or Jac-
card coefficient) to sophisticated machine learning methods (such as probabilistic models,
relational learning, matrix factorization, and deep learning).
This chapter provides a comprehensive survey of these methods, categorizing them
into:
• Feature-based and classification-based methods
G = (V, E)
where:
• V is the set of nodes (vertices),
s:V ×V →R
which assigns a score s(u, v) to each node pair, indicating the likelihood of a future link.
2
Symbol / Term Meaning
G = (V, E) Graph representing the network, with nodes V and edges E
Gt Snapshot of the graph at time t
Vt Set of nodes present at time t
Et Set of edges present at time t
Et′ Set of edges present at a future time t′ > t
(u, v) A candidate node pair in the network
Γ(u) Neighborhood of node u, i.e., all nodes connected to u
yuv Ground truth label: 1 if link exists at t′ , else 0
ŷuv Predicted label for whether a link will form
xuv Feature vector representing the pair (u, v)
f (xuv ) Classification function that predicts ŷuv
s(u, v) Scoring function estimating the likelihood of a link between u
and v
R Set of real numbers (output range of scoring function)
Common Neighbors
Definition:
The Common Neighbors (CN) score between two nodes u and v is given by:
3
Intuition:
If two nodes share several neighbors, they are more likely to form a link in the future.
This heuristic mirrors the “mutual friend” concept in social networks.
Example Network:
Node Neighbors
A B, D
B A, C, E
C B
D A
E B
1.5
CN Score
1 1 1
1
0.5
0
0
A-B A-C A-E D-E
Node Pairs
4
E
A B
D C
Advantages:
Limitations:
Applications:
Jaccard Coefficient
The Jaccard Coefficient between nodes u and v is:
|Γ(u) ∩ Γ(v)|
Jaccard(u, v) =
|Γ(u) ∪ Γ(v)|
5
Path-Based Features in Link Prediction
Definition: Path-based features evaluate the likelihood of link formation based on the
paths (direct and indirect) between nodes in a network.
Common Metrics
• Shortest Path Distance (SPD):
• Katz Index: ∞
X
Katz(u, v) = β l · |paths(l)
uv |
l=1
• Rooted PageRank: Random walk with restart from node u to reach node v.
• SimRank:
C X X
SimRank(u, v) = SimRank(i, j)
|Γ(u)||Γ(v)|
i∈Γ(u) j∈Γ(v)
A B C
D E
Paths from A to C:
6
Katz Score Plot
0.1
0.1
Katz Score
5 · 10−2
Advantages
• Capture indirect, multi-hop relationships.
Disadvantages
• Computationally expensive.
Applications
• Friend recommendation in social networks.
• SPD(A, C) = 2
Advantages:
• Fast to compute (e.g., BFS)
7
• Intuitive notion of proximity
Disadvantages:
2. Katz Index
Definition: The Katz Index sums over all paths between two nodes, exponentially
damped by path length.
Intuition: Shorter paths contribute more, but longer paths also play a role in con-
nectivity.
Example: In a graph:
A B C
D E
Paths from A to C:
• A-B-C: length 2
• A-D-A-B-C: length 4
Let β = 0.1,
Advantages:
Disadvantages:
• Computationally intensive
8
• Combines locality and global connectivity
• Captures influence of dense regions
Disadvantages:
• Requires matrix iteration
• Sensitive to restart parameter α
4. SimRank
Definition: SimRank is based on the principle: ”Two nodes are similar if they are
connected to similar nodes.”
Intuition: Recursive notion of similarity; propagates similarity scores via neighbors.
Example: If A and C both connect to B, and SimRank(B, B) = 1, then A and C
have non-zero SimRank.
Advantages:
• Models structural equivalence
• Recursive and expressive
Attribute-Based Features
Definition: Attribute-based features incorporate node-level and edge-level metadata to
enrich the representation of node pairs. These features reflect semantic similarity, profile
match, behavioral overlap, or contextual relevance.
Types of Attributes
• Node-Level Attributes: Age, gender, location, profession, behavior logs, textual
profiles.
• Edge-Level Attributes: Frequency of interaction, strength of relationship, time
of last communication.
Advantages
• Captures semantic similarity not evident from structure.
• Useful in sparse or cold-start conditions.
• Can be fused with topological features in machine learning models.
9
Disadvantages
• Metadata may be missing or noisy.
• Raises privacy and ethical concerns in social networks.
• Computational cost in processing text/content attributes.
Applications
• Friend and job recommendations (LinkedIn, Facebook)
• Content-based link inference (YouTube, Netflix)
• Knowledge graph completion using entity types and descriptions
Problem Setup
Each node pair (u, v) is associated with:
• A feature vector xuv describing their topological and/or attribute similarity
• A label yuv ∈ {0, 1} indicating link existence
Data Construction
• Positive examples: All existing links (u, v) ∈ E are labeled as y = 1
• Negative examples: Randomly sampled non-linked node pairs (u, v) ∈
/ E, labeled
y=0
Note: Negative sampling must avoid self-loops and repeated links.
Feature Engineering
Feature vector xuv may include:
• Structural scores: Common Neighbors, Jaccard Coefficient, Adamic-Adar
• Attribute similarity: age difference, skill overlap
• Hybrid or learned embeddings (e.g., node2vec)
Classification Algorithms
• Decision Trees: Handle categorical features; interpretable
• Logistic Regression: Efficient and interpretable; suited for linear relationships
• SVM: Effective for sparse, non-linear data using kernels
• Neural Networks: Capture complex feature interactions; used in deep learning-
based link prediction
10
Advantages
• Flexible to include both structural and attribute features
Disadvantages
• Sensitive to negative sampling bias
Confusion Matrix
Predicted: 1 Predicted: 0
Actual: 1 True Positive (TP) False Negative (FN)
Actual: 0 False Positive (FP) True Negative (TN)
Evaluation Metrics
1. Accuracy:
TP + TN
Accuracy =
TP + FP + TN + FN
Measures overall correctness.
2. Precision:
TP
Precision =
TP + FP
How many predicted links were actually correct?
3. Recall (Sensitivity):
TP
Recall =
TP + FN
How many actual links were found?
4. F1 Score:
P recision · Recall
F1 = 2 ·
P recision + Recall
Balances precision and recall.
5. ROC-AUC: Area under the ROC curve – measures ranking quality.
11
Worked Example
Node Pair True Label Predicted Score Predicted Label
(A,B) 1 0.90 1
(C,D) 0 0.70 1
(E,F) 1 0.40 0
(G,H) 0 0.10 0
• TP = 1, FP = 1, FN = 1, TN = 1
2 1 1
Accuracy = = 0.5, Precision = = 0.5, Recall = = 0.5, F1 Score = 0.5
4 2 2
Conclusion
The evaluation node is essential for validating classification-based link predictors using
metrics that reflect real-world performance and tradeoffs.
Advantages
• Captures link uncertainty
• Probabilistic interpretation
Disadvantages
• Requires probability distribution estimation
12
Feature Examples
• Number of common neighbors
Advantages
• Lightweight and interpretable
Limitations
• Misses global structural information
Motivation
Traditional link prediction models assume static graphs. In contrast, real-world networks
(e.g., social, collaboration, or communication networks) evolve. Network evolution models
address this by estimating:
(t+1)
P (yuv = 1 | history up to time t)
(t)
• xuv : Observed features at time t
(t) (t)
• zu , zv : Latent variables representing node roles or preferences
13
Common Techniques
• Dynamic Bayesian Networks (DBNs): Model temporal transitions in graph
structure.
• Graph Recurrent Neural Networks: Learn node/link dynamics over time using
RNNs or GRUs.
Advantages
• Naturally capture dynamic behavior.
Disadvantages
• Require multiple time-labeled snapshots.
Motivation
Many real-world networks have modular or hierarchical structure (e.g., communities,
departments, research groups). Hierarchical models capture these patterns using:
14
Key Models
1. Stochastic Block Model (SBM)
Each node belongs to a latent community zu , and the link probability is:
P (yuv = 1) = Θzu zv
Where:
Advantages
• Models overlapping or hierarchical communities
Disadvantages
• Computationally demanding
Motivation
Traditional Bayesian networks assume flat data. PRMs extend this to multi-relational
data, where object attributes and relationships influence one another.
15
Key Components
• Entities: Distinct classes such as users, items, documents
Example
In a citation network:
The probability of a citation between two papers can depend on their topics, citation
history, and authors’ fields.
Advantages
• Models uncertainty and structure jointly
Disadvantages
• Requires background schema definition
16
Definition
RMNs represent the probability of a set of relational variables using cliques and potential
functions:
1 Y
P (Y) = ψC (YC )
Z C∈C
Key Characteristics
• Undirected graphical model
Applications
• Co-authorship Networks: Model joint authorship and topical similarity
Advantages
• Captures symmetric and undirected relationships
Disadvantages
• High computational cost for inference
17
Adjacency and Laplacian Matrices
• Adjacency Matrix (A): Aij = 1 if edge (i, j) exists; else 0.
Kernel-Based Methods
Exponential Kernel:
∞
βA
X β n An
K=e =
n=0
n!
Von Neumann Kernel:
∞
X
−1
K = (I − βA) = β n An
n=0
Converges if
beta <
f rac1lambdamax (A).
These kernels model the similarity between nodes based on path counts and decaying
influence over longer walks.
Advantages
• Capture global network structure
Disadvantages
• Computationally intensive on large graphs
18
7.1 Time-Aware Link Prediction
• Accounts for temporal evolution of networks.
• Techniques:
7.2 Scalability
• Addresses computational challenges in large graphs.
• Techniques:
– Low-rank approximations
– Sampling-based methods
– Parallel/distributed implementations
ŷuv = σ(h⊤
u hv )
• Methods:
– Pretrained embeddings
– Fine-tuning on target network
– Domain adaptation across relational spaces
Motif Analysis
Motifs are small, recurring subgraph patterns that serve as building blocks of complex
networks. Motif analysis in link prediction identifies incomplete motifs and uses them to
infer the likelihood of future links.
19
Key Concepts
• A motif is a small graph pattern (e.g., triangle, star, square).
Common Motifs
• Open triad: A–B, A–C, but B–C missing.
Advantages
• Models higher-order dependencies
Disadvantages
• Expensive computation for large graphs
Example Graph
Let G have nodes A, B, C, D and edges (A, B), (B, C), (C, A), (A, D). One triangle exists:
(A, B, C).
20
Algorithms
1. Naı̈ve Enumeration: Check all 3-node combinations.
Time: O(n3 )
2. Node Iterator: For each node u, check edges among its neighbors.
Time: O(n · d2u )
3. Edge Iterator:
P For edge (u, v), count common neighbors |N (u) ∩ N (v)|.
Total triangles = (u,v)∈E |N (u) ∩ N (v)|
4. Matrix Multiplication:
1
Triangles = · Trace(A3 )
6
5. Parallel Methods: Use distributed computing for large-scale graphs (e.g., Spark,
GraphX)
Applications
• Triadic closure for link prediction
• Community detection
Steps
1. Enumerate all triplets (u, v, w) where u < v < w
Example
Graph G has nodes A, B, C, D and edges (A, B), (B, C), (C, A), (A, D)
Triplets:
Total triangles: 1
21
Complexity
• Time: O(n3 )
B D
A C
B D
A C
Steps
1. For each node u, get neighbors N (u)
22
Example
Graph with edges: (A, B), (A, C), (B, C), (C, D)
Complexity
!
X deg(u)
O
u∈V
2
Algorithm
1. For each edge (u, v) ∈ E:
23
Example
Graph with edges: (A, B), (A, C), (B, C), (C, D)
B D
A C
• Edge (A, B): N (A) = {B, C}, N (B) = {A, C} ⇒ {C} → triangle (A, B, C)
Complexity
O(m · d) for average degree d
Advantages
• Efficient and scalable
Disadvantages
• Triangle counted multiple times
Example
Graph: Nodes = {A, B, C, D} Edges = (A, B), (A, C), (B, C), (C, D)
0 1 1 0
1 0 1 0
A= 1 1 0 1
0 0 1 0
24
2 1 1 1 2 3 3 1
1 2 1 1 3 2 3 1
A2 =
1
, A3 =
1 3 0 3 3 2 3
1 1 0 1 1 1 3 0
6
trace(A3 ) = 2 + 2 + 2 + 0 = 6 ⇒ Triangles = =1
6
Complexity
• Time: O(nω ), fast matrix multiplication
• Space: O(n2 )
Remarks
• Elegant for small dense graphs
• Impractical for large or sparse networks
Motivation
Counting triangles in large-scale networks is computationally expensive. Parallel methods
allow the workload to be split and processed simultaneously.
Techniques
1. Vertex Partitioning:
• Each thread processes a subset of nodes.
• Counts triangles by checking neighbor pairs.
2. Edge Partitioning:
• Each processor handles a subset of edges.
• For edge (u, v), compute N (u) ∩ N (v).
• Avoid duplicate counts using ordering constraints.
3. MapReduce Algorithms:
• Implemented in Apache Hadoop, Spark.
• Mappers emit neighbor information; reducers count triangles.
4. GPU-Based Approaches:
• Use CUDA/OpenCL for fast triangle enumeration.
• Suited for dense graphs in CSR or adjacency matrix form.
25
Example: Cohen’s MapReduce Strategy
1. For each vertex u, emit edge (u, v) if u < v.
2. Group by v, find common neighbors w.
3. Emit triangle (u, v, w) if all edges exist.
Advantages
• Highly scalable
• Suitable for distributed systems and GPUs
• Handles large graphs efficiently
Disadvantages
• Complex to implement
• Synchronization and load balancing required
• Potential overcounting without constraints
1. Social Networks
• Triangles indicate mutual friendships (triadic closure).
• Star motifs highlight influential users or hubs.
• Helps in link prediction and group recommendation.
2. Biological Networks
• Feed-forward loops (FFLs) regulate gene expression.
• Bi-fan motifs capture protein interaction modules.
3. Neural Circuits
• Triadic motifs support local computation and memory.
• Identify functional modules in the brain.
4. Financial Networks
• Triangles or cycles may signal money laundering or fraud.
• Dense motifs indicate collusion or shell companies.
26
5. Web and Citation Networks
• Triadic motifs help in link prediction.
6. Infrastructure Networks
• Motifs identify critical points in power grids.
7. Ecological Networks
• Food chain structures use path and triangle motifs.
27