0% found this document useful (0 votes)
10 views27 pages

Link Prediction

The document discusses link prediction in social networks, which involves predicting future connections between nodes based on current network structures and attributes. It highlights various applications, challenges, and methodologies, including feature-based, probabilistic, and classification approaches. Additionally, it covers mathematical formulations, evaluation metrics, and specific scoring functions used in link prediction tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views27 pages

Link Prediction

The document discusses link prediction in social networks, which involves predicting future connections between nodes based on current network structures and attributes. It highlights various applications, challenges, and methodologies, including feature-based, probabilistic, and classification approaches. Additionally, it covers mathematical formulations, evaluation metrics, and specific scoring functions used in link prediction tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Link Prediction in Social Networks

1. Introduction to Link Prediction


In recent years, the explosive growth of online platforms such as Facebook, Twitter,
LinkedIn, and research collaboration networks has highlighted the importance of un-
derstanding how relationships between entities evolve over time. At the heart of this
evolution lies a fundamental problem in network science and data mining: link predic-
tion.
Link prediction refers to the task of predicting the emergence of new edges (links) or
the reappearance of previously removed edges in a network, based on its current structure
and optionally node/edge attributes. Formally, given a snapshot of a graph G = (V, E)
at time t, the objective is to accurately predict which node pairs (u, v) ∈
/ E are likely to
form a connection in the future (t + ∆t).
This task is crucial in several real-world applications:

• Social Networking Sites: Suggesting potential friends or professional connec-


tions.

• E-commerce Platforms: Recommending products by predicting future user-


product interactions.

• Biological Networks: Identifying unknown interactions between proteins or genes.

• Knowledge Graphs: Predicting missing or future facts and relations.

• Security and Intelligence: Discovering hidden relationships in criminal or ter-


rorist networks.

One of the key reasons link prediction has garnered significant interest is due to its
ability to uncover latent patterns and relationships that are not immediately obvious
from the observed data. Unlike static graph analysis, link prediction considers the tem-
poral dynamics and evolutionary behavior of networks, making it a powerful tool
for modeling change over time.
Moreover, link prediction is inherently challenging due to several factors:

• Sparsity: In most real-world networks, the number of actual links is much smaller
than the total number of possible links.

• Imbalance: The number of non-linked node pairs vastly exceeds the number of
actual links, leading to class imbalance in supervised formulations.

• Scalability: Networks often contain millions of nodes and edges, requiring efficient
algorithms.

1
• Incompleteness: Data collected from real systems may be noisy or partially ob-
served.
Researchers have approached the link prediction problem using a wide range of tech-
niques, from simple heuristics based on graph topology (like common neighbors or Jac-
card coefficient) to sophisticated machine learning methods (such as probabilistic models,
relational learning, matrix factorization, and deep learning).
This chapter provides a comprehensive survey of these methods, categorizing them
into:
• Feature-based and classification-based methods

• Probabilistic and Bayesian models

• Probabilistic relational models

• Linear algebraic and matrix factorization techniques

• Emerging and future directions


By the end of this chapter, readers gain an in-depth understanding of the theoretical
foundations, algorithmic strategies, and practical applications of link prediction across
domains.

Mathematical Formulation of Link Prediction


Network Representation
Let a social network be represented as a graph:

G = (V, E)

where:
• V is the set of nodes (vertices),

• E ⊆ V × V is the set of edges (links) representing relationships.

Link Prediction Objective


Given:
• A snapshot of the network at time t, denoted Gt = (Vt , Et ),

• A future time point t′ > t,


the goal is to predict the set of future links Et′ \Et . That is, find node pairs (u, v) ∈ Vt ×Vt
such that:
(u, v) ∈
/ Et and (u, v) ∈ Et′
This can be formalized by learning a scoring function:

s:V ×V →R

which assigns a score s(u, v) to each node pair, indicating the likelihood of a future link.

2
Symbol / Term Meaning
G = (V, E) Graph representing the network, with nodes V and edges E
Gt Snapshot of the graph at time t
Vt Set of nodes present at time t
Et Set of edges present at time t
Et′ Set of edges present at a future time t′ > t
(u, v) A candidate node pair in the network
Γ(u) Neighborhood of node u, i.e., all nodes connected to u
yuv Ground truth label: 1 if link exists at t′ , else 0
ŷuv Predicted label for whether a link will form
xuv Feature vector representing the pair (u, v)
f (xuv ) Classification function that predicts ŷuv
s(u, v) Scoring function estimating the likelihood of a link between u
and v
R Set of real numbers (output range of scoring function)

Table 1: Nomenclature used in the binary classification formulation of link prediction

Binary Classification Formulation


In supervised learning, we treat each node pair (u, v) as an instance with:

• A binary label yuv ∈ {0, 1},

• A feature vector xuv derived from the graph or node attributes.

The classifier learns a function:


f (xuv ) = ŷuv
to predict whether a link will form between u and v.

Common Scoring Functions


• Common Neighbors: s(u, v) = |Γ(u) ∩ Γ(v)|
|Γ(u)∩Γ(v)|
• Jaccard Coefficient: s(u, v) = |Γ(u)∪Γ(v)|

• Adamic-Adar Index: s(u, v) = 1


P
w∈Γ(u)∩Γ(v) log |Γ(w)|

• Preferential Attachment: s(u, v) = |Γ(u)| · |Γ(v)|

Here, Γ(u) denotes the set of neighbors of node u.

Common Neighbors
Definition:
The Common Neighbors (CN) score between two nodes u and v is given by:

CN(u, v) = |Γ(u) ∩ Γ(v)|

where Γ(u) is the set of neighbors of node u.

3
Intuition:
If two nodes share several neighbors, they are more likely to form a link in the future.
This heuristic mirrors the “mutual friend” concept in social networks.
Example Network:

Node Neighbors
A B, D
B A, C, E
C B
D A
E B

Step-by-step CN Score Calculations:

CN(A, B) = |{B, D} ∩ {A, C, E}| = 1


CN(A, C) = |{B, D} ∩ {B}| = 1
CN(A, E) = |{B, D} ∩ {B}| = 1
CN(D, E) = |{A} ∩ {B}| = 0

Bar Chart of CN Scores:

1.5
CN Score

1 1 1
1

0.5
0
0
A-B A-C A-E D-E
Node Pairs

Figure 1: Common Neighbors Score for Selected Node Pairs

Visual Network Diagram:

4
E

A B

D C

Figure 2: Example Network Structure with Nodes A to E

Advantages:

• Simple and efficient to compute.

• Uses only local topology, making it scalable.

Limitations:

• Fails to account for the influence of high-degree nodes.

• Performs poorly in sparse or noisy graphs.

Applications:

• Friend recommendation on social networks (e.g., Facebook, LinkedIn).

• Scientific collaboration prediction.

• Connection suggestions in professional networking platforms.

Jaccard Coefficient
The Jaccard Coefficient between nodes u and v is:
|Γ(u) ∩ Γ(v)|
Jaccard(u, v) =
|Γ(u) ∪ Γ(v)|

It normalizes the number of common neighbors by the union of neighbors.


Example: If Γ(u) = {A, B, C}, Γ(v) = {B, C, D, E}:
2
Jaccard(u, v) = = 0.4
5
Pros: Normalized score, interpretable.
Cons: Still limited to local structure.

5
Path-Based Features in Link Prediction
Definition: Path-based features evaluate the likelihood of link formation based on the
paths (direct and indirect) between nodes in a network.

Common Metrics
• Shortest Path Distance (SPD):

SPD(u, v) = min length(u → v)


paths

• Katz Index: ∞
X
Katz(u, v) = β l · |paths(l)
uv |
l=1

• Rooted PageRank: Random walk with restart from node u to reach node v.

• SimRank:
C X X
SimRank(u, v) = SimRank(i, j)
|Γ(u)||Γ(v)|
i∈Γ(u) j∈Γ(v)

Example for Katz Index


Graph structure:

A B C

D E

Paths from A to C:

• Path 1: A–B–C (length 2)

• Path 2: A–D–A–B–C (length 4)

Katz(A, C) = 0.12 · 1 + 0.14 · 1 = 0.01 + 0.0001 = 0.0101

6
Katz Score Plot

0.1
0.1
Katz Score

5 · 10−2

1.01 · 10−2 1 · 10−2


0
A-C A-B A-D
Node Pairs

Figure 3: Katz Score for Selected Node Pairs

Advantages
• Capture indirect, multi-hop relationships.

• Use global network structure.

Disadvantages
• Computationally expensive.

• Not scalable without approximations.

Applications
• Friend recommendation in social networks.

• Missing link inference in biological networks.

• Hyperlink prediction in the web.

1. Shortest Path Distance (SPD)


Definition: The shortest path distance between nodes u and v is the minimum number
of edges that must be traversed to go from u to v.
Intuition: If two nodes are a small number of hops apart, they are likely to become
connected in the future.
Example: In the graph A − B − C:
• SPD(A, B) = 1

• SPD(A, C) = 2
Advantages:
• Fast to compute (e.g., BFS)

7
• Intuitive notion of proximity

Disadvantages:

• Ignores alternate or redundant paths

• Cannot differentiate between equally distant nodes

2. Katz Index
Definition: The Katz Index sums over all paths between two nodes, exponentially
damped by path length.
Intuition: Shorter paths contribute more, but longer paths also play a role in con-
nectivity.
Example: In a graph:

A B C

D E

Paths from A to C:

• A-B-C: length 2

• A-D-A-B-C: length 4

Let β = 0.1,
Advantages:

• Considers all possible paths

• Gives weight to structural closeness

Disadvantages:

• Requires full adjacency matrix

• Computationally intensive

3. Rooted PageRank (Random Walk with Restart)


Definition: Rooted PageRank measures the stationary probability of reaching a node v
from node u via random walks with restarts.
Intuition: Nodes that are more reachable from a source node via multiple short walks
are considered more likely to form a future link.
Advantages:

8
• Combines locality and global connectivity
• Captures influence of dense regions
Disadvantages:
• Requires matrix iteration
• Sensitive to restart parameter α

4. SimRank
Definition: SimRank is based on the principle: ”Two nodes are similar if they are
connected to similar nodes.”
Intuition: Recursive notion of similarity; propagates similarity scores via neighbors.
Example: If A and C both connect to B, and SimRank(B, B) = 1, then A and C
have non-zero SimRank.
Advantages:
• Models structural equivalence
• Recursive and expressive

Attribute-Based Features
Definition: Attribute-based features incorporate node-level and edge-level metadata to
enrich the representation of node pairs. These features reflect semantic similarity, profile
match, behavioral overlap, or contextual relevance.

Types of Attributes
• Node-Level Attributes: Age, gender, location, profession, behavior logs, textual
profiles.
• Edge-Level Attributes: Frequency of interaction, strength of relationship, time
of last communication.

Example Features for Node Pair (u, v):


• Age difference: |age(u) − age(v)|
• Location match: loc(u) = loc(v)
• Profile similarity: Cosine similarity of profile text
• Skillset overlap: Jaccard similarity of skills

Advantages
• Captures semantic similarity not evident from structure.
• Useful in sparse or cold-start conditions.
• Can be fused with topological features in machine learning models.

9
Disadvantages
• Metadata may be missing or noisy.
• Raises privacy and ethical concerns in social networks.
• Computational cost in processing text/content attributes.

Applications
• Friend and job recommendations (LinkedIn, Facebook)
• Content-based link inference (YouTube, Netflix)
• Knowledge graph completion using entity types and descriptions

3.2 Classification Models


Link prediction can be cast as a binary classification problem where the goal is to learn
a function that predicts whether a link exists between a given pair of nodes.

Problem Setup
Each node pair (u, v) is associated with:
• A feature vector xuv describing their topological and/or attribute similarity
• A label yuv ∈ {0, 1} indicating link existence

Data Construction
• Positive examples: All existing links (u, v) ∈ E are labeled as y = 1
• Negative examples: Randomly sampled non-linked node pairs (u, v) ∈
/ E, labeled
y=0
Note: Negative sampling must avoid self-loops and repeated links.

Feature Engineering
Feature vector xuv may include:
• Structural scores: Common Neighbors, Jaccard Coefficient, Adamic-Adar
• Attribute similarity: age difference, skill overlap
• Hybrid or learned embeddings (e.g., node2vec)

Classification Algorithms
• Decision Trees: Handle categorical features; interpretable
• Logistic Regression: Efficient and interpretable; suited for linear relationships
• SVM: Effective for sparse, non-linear data using kernels
• Neural Networks: Capture complex feature interactions; used in deep learning-
based link prediction

10
Advantages
• Flexible to include both structural and attribute features

• Leverages well-studied ML models

• Suitable for evolving networks with periodic retraining

Disadvantages
• Sensitive to negative sampling bias

• Requires effective feature engineering

• Does not model dynamic changes in topology natively

Evaluation Node in Classification-Based Link Prediction


In classification-based link prediction, the evaluation node compares predicted outputs
with actual labels and computes evaluation metrics to assess model performance.

Confusion Matrix
Predicted: 1 Predicted: 0
Actual: 1 True Positive (TP) False Negative (FN)
Actual: 0 False Positive (FP) True Negative (TN)

Evaluation Metrics
1. Accuracy:
TP + TN
Accuracy =
TP + FP + TN + FN
Measures overall correctness.
2. Precision:
TP
Precision =
TP + FP
How many predicted links were actually correct?
3. Recall (Sensitivity):
TP
Recall =
TP + FN
How many actual links were found?
4. F1 Score:
P recision · Recall
F1 = 2 ·
P recision + Recall
Balances precision and recall.
5. ROC-AUC: Area under the ROC curve – measures ranking quality.

11
Worked Example
Node Pair True Label Predicted Score Predicted Label
(A,B) 1 0.90 1
(C,D) 0 0.70 1
(E,F) 1 0.40 0
(G,H) 0 0.10 0

• TP = 1, FP = 1, FN = 1, TN = 1

2 1 1
Accuracy = = 0.5, Precision = = 0.5, Recall = = 0.5, F1 Score = 0.5
4 2 2

Conclusion
The evaluation node is essential for validating classification-based link predictors using
metrics that reflect real-world performance and tradeoffs.

4. Bayesian Probabilistic Models


Bayesian models estimate the probability that a link exists between two nodes by ap-
plying Bayes’ theorem. These models capture uncertainty and provide interpretable link
likelihoods.

P (xuv | yuv = 1) · P (yuv = 1)


P (yuv = 1 | xuv ) =
P (xuv )
Where:
• yuv ∈ {0, 1}: Link existence

• xuv : Feature vector for node pair (u, v)

Advantages
• Captures link uncertainty

• Probabilistic interpretation

• Can incorporate prior knowledge

Disadvantages
• Requires probability distribution estimation

• Simple variants (e.g., Naive Bayes) assume feature independence

4.1 Local Probabilistic Models


Local probabilistic models use node-specific and neighborhood-based features to estimate
link probability.

12
Feature Examples
• Number of common neighbors

• Same community or group

• Clustering coefficient of nodes

• Similarity in node attributes (e.g., age, location)

Naive Bayes Example


Assume binary features (e.g., location match, age match):
Y
P (yuv = 1 | xuv ) ∝ P (yuv = 1) P (xi | yuv = 1)
i

Advantages
• Lightweight and interpretable

• Suitable for large graphs

Limitations
• Misses global structural information

• Sensitive to local sparsity

4.2 Network Evolution Models


Network evolution models aim to capture the dynamics of network structure over time,
enabling temporal link prediction based on historical patterns.

Motivation
Traditional link prediction models assume static graphs. In contrast, real-world networks
(e.g., social, collaboration, or communication networks) evolve. Network evolution models
address this by estimating:
(t+1)
P (yuv = 1 | history up to time t)

Latent Variable Modeling


These models often introduce hidden variables to encode unobserved node behaviors or
states. The link formation probability becomes:
(t+1)
P (yuv = 1 | x(t) (t) (t)
uv , zu , zv )

(t)
• xuv : Observed features at time t
(t) (t)
• zu , zv : Latent variables representing node roles or preferences

13
Common Techniques
• Dynamic Bayesian Networks (DBNs): Model temporal transitions in graph
structure.

• Hidden Markov Models (HMMs): Use latent states to generate observable


links.

• Tensor Factorization: Represent temporal graphs as 3D tensors (Node × Node


× Time).

• Graph Recurrent Neural Networks: Learn node/link dynamics over time using
RNNs or GRUs.

Advantages
• Naturally capture dynamic behavior.

• Estimate link likelihood over time.

• Can model latent patterns not evident in static snapshots.

Disadvantages
• Require multiple time-labeled snapshots.

• Estimating latent variables is computationally intensive.

• Model training and validation are more complex.

4.3 Hierarchical Bayesian Models


Hierarchical Bayesian models incorporate layered probabilistic structures to model both
global and community-level patterns in a network. These models extend traditional
Bayesian frameworks by introducing shared latent parameters across groups.

Motivation
Many real-world networks have modular or hierarchical structure (e.g., communities,
departments, research groups). Hierarchical models capture these patterns using:

• Latent community or topic memberships

• Shared priors across communities

• Stochastic block-like dependencies

14
Key Models
1. Stochastic Block Model (SBM)
Each node belongs to a latent community zu , and the link probability is:

P (yuv = 1) = Θzu zv

2. Mixed Membership Stochastic Block Model (MMSBM)


Nodes can belong to multiple communities:
K X
X K
P (yuv = 1) = πuk πvl Θkl
k=1 l=1

Where:

• πu : Community membership vector for node u

• Θ: Community interaction matrix

3. Relational Topic Model (RTM)


Model links using topic distributions over node attributes and shared parameters.

Advantages
• Models overlapping or hierarchical communities

• Enables knowledge sharing across groups

• Flexible and interpretable probabilistic structure

Disadvantages
• Computationally demanding

• Requires approximations for inference (e.g., variational methods)

• Scalability issues in large networks

5. Probabilistic Relational Models (PRMs)


Probabilistic Relational Models (PRMs) represent uncertainty in relational data by mod-
eling entities, their attributes, and their inter-relationships probabilistically.

Motivation
Traditional Bayesian networks assume flat data. PRMs extend this to multi-relational
data, where object attributes and relationships influence one another.

15
Key Components
• Entities: Distinct classes such as users, items, documents

• Attributes: Properties of each entity (e.g., user age, paper topic)

• Relations: Links such as authorship, friendship, citation

• Dependencies: Conditional relationships across entities and their attributes

Relational Bayesian Networks (RBNs)


RBNs are a subclass of PRMs that define:

• Conditional probability dependencies among attributes of related entities

• Link existence as a random variable influenced by relational context

Example
In a citation network:

• Authors have attributes: field, publication count

• Papers have attributes: topic, year

• Relations: author writes paper, paper cites another paper

The probability of a citation between two papers can depend on their topics, citation
history, and authors’ fields.

Advantages
• Models uncertainty and structure jointly

• Captures rich, multi-entity relationships

• Suitable for social, citation, and biological graphs

Disadvantages
• Requires background schema definition

• Computational cost of inference is high

• Complex to learn and interpret for large datasets

Relational Markov Networks (RMNs)


Relational Markov Networks (RMNs) extend Markov networks to relational data. They
define joint distributions over attributes and relationships using undirected graphs and
potential functions.

16
Definition
RMNs represent the probability of a set of relational variables using cliques and potential
functions:
1 Y
P (Y) = ψC (YC )
Z C∈C

• Y: Set of variables (e.g., labels, links)

• C: Set of cliques over entities and their attributes

• ψC : Potential function over clique C

• Z: Partition function for normalization

Key Characteristics
• Undirected graphical model

• No assumption of causal direction

• Captures mutual and cyclic dependencies

Applications
• Co-authorship Networks: Model joint authorship and topical similarity

• Movie Ratings: Capture dependencies between users and movie ratings

• Citation Graphs: Jointly classify papers based on citation patterns

Advantages
• Captures symmetric and undirected relationships

• Expressive for complex relational dependencies

• Does not require directional dependency modeling

Disadvantages
• High computational cost for inference

• Inference requires approximation (e.g., Gibbs Sampling, Loopy BP)

• Requires careful design of relational cliques and potentials

6. Linear Algebraic Methods


Linear algebraic methods leverage matrix operations to estimate node similarity and
predict links. These approaches utilize the structure of the graph encoded in matrices
like the adjacency matrix and Laplacian.

17
Adjacency and Laplacian Matrices
• Adjacency Matrix (A): Aij = 1 if edge (i, j) exists; else 0.

• Graph Laplacian (L): L = D − A, where D is the degree matrix.

Singular Value Decomposition (SVD)


SVD decomposes A as:
A ≈ U ΣV T
Low-rank approximation enables dimensionality reduction. Node embeddings from U or
V are used to compute similarity (e.g., dot product for link score).

Kernel-Based Methods
Exponential Kernel:

βA
X β n An
K=e =
n=0
n!
Von Neumann Kernel:

X
−1
K = (I − βA) = β n An
n=0

Converges if
beta <
f rac1lambdamax (A).
These kernels model the similarity between nodes based on path counts and decaying
influence over longer walks.

Advantages
• Capture global network structure

• Embedding-friendly for machine learning

• Scalable via low-rank approximations

Disadvantages
• Computationally intensive on large graphs

• Matrix inversion or exponentiation is costly

• Less interpretable compared to local metrics

7. Recent Developments and Future Directions


Recent advancements in link prediction have moved beyond classical approaches by in-
corporating time dynamics, scalability solutions, deep learning, and knowledge transfer
across domains.

18
7.1 Time-Aware Link Prediction
• Accounts for temporal evolution of networks.

• Techniques:

– Temporal snapshot modeling


– Tensor factorization (Node × Node × Time)
– Dynamic node embeddings (e.g., EvolveGCN, TGAT)

• Applications: Social dynamics, fraud detection, collaboration forecasting

7.2 Scalability
• Addresses computational challenges in large graphs.

• Techniques:

– Low-rank approximations
– Sampling-based methods
– Parallel/distributed implementations

7.3 Graph Neural Networks (GNNs)


• Learn representations via message passing.

• Incorporate both node features and structure.

• GNNs used: GCN, GraphSAGE, GAT

• Example: Predicting link between nodes u and v using:

ŷuv = σ(h⊤
u hv )

7.4 Transfer Learning


• Leverages knowledge from one network to another.

• Methods:

– Pretrained embeddings
– Fine-tuning on target network
– Domain adaptation across relational spaces

Motif Analysis
Motifs are small, recurring subgraph patterns that serve as building blocks of complex
networks. Motif analysis in link prediction identifies incomplete motifs and uses them to
infer the likelihood of future links.

19
Key Concepts
• A motif is a small graph pattern (e.g., triangle, star, square).

• Motif frequency reflects structural regularity in real-world networks.

• Incomplete motifs can signal potential new links.

Common Motifs
• Open triad: A–B, A–C, but B–C missing.

• Triangle (closed triad): A–B–C–A.

• Tetrads: Squares, stars, or cliques of 4 nodes.

Motif-Based Link Prediction


• Count motif instances involving node pair (u, v).

• Use the number of motifs completed by adding (u, v) as a feature.

• Predict link probability based on motif-based scores.

Advantages
• Models higher-order dependencies

• Effective in sparse or community-rich networks

• Enhances prediction when traditional metrics fail

Disadvantages
• Expensive computation for large graphs

• Sensitive to noise and missing links

• Requires choice of relevant motif types

Triangle Counting and Enumeration


A triangle is a 3-node fully connected subgraph in a network. Triangle counting deter-
mines the number of such structures, while enumeration identifies them explicitly.

Example Graph
Let G have nodes A, B, C, D and edges (A, B), (B, C), (C, A), (A, D). One triangle exists:
(A, B, C).

20
Algorithms
1. Naı̈ve Enumeration: Check all 3-node combinations.
Time: O(n3 )
2. Node Iterator: For each node u, check edges among its neighbors.
Time: O(n · d2u )
3. Edge Iterator:
P For edge (u, v), count common neighbors |N (u) ∩ N (v)|.
Total triangles = (u,v)∈E |N (u) ∩ N (v)|
4. Matrix Multiplication:
1
Triangles = · Trace(A3 )
6
5. Parallel Methods: Use distributed computing for large-scale graphs (e.g., Spark,
GraphX)

Applications
• Triadic closure for link prediction

• Clustering coefficient estimation

• Community detection

• Motif-based graph learning

Naı̈ve Enumeration Algorithm


The naı̈ve algorithm for triangle counting checks every unordered triplet of nodes (u, v, w)
and verifies whether all three edges exist.

Steps
1. Enumerate all triplets (u, v, w) where u < v < w

2. For each triplet, check if:

(u, v) ∈ E, (v, w) ∈ E, (w, u) ∈ E

3. If true, increment triangle count

Example
Graph G has nodes A, B, C, D and edges (A, B), (B, C), (C, A), (A, D)
Triplets:

• (A, B, C): All edges exist ⇒ Triangle

• (A, B, D): Edge (B, D) missing ⇒ No triangle

• Others: No complete triangle

Total triangles: 1

21
Complexity
• Time: O(n3 )

• Space: O(1) (assuming edge lookup in adjacency matrix)

B D

A C

Figure 4: Example graph with 4 nodes and 4 edges

Example: Triangle Counting via Naı̈ve Enumeration


Consider the undirected graph with nodes {A, B, C, D} and edges:

E = {(A, B), (A, C), (B, C), (C, D)}

Triplets and Checks:

• (A, B, C): All 3 edges exist ⇒ Triangle ✓

• (A, B, D): Edge (B, D) missing ⇒ Not a triangle ✗

• (A, C, D): Edge (A, D) missing ⇒ Not a triangle ✗

• (B, C, D): Edge (B, D) missing ⇒ Not a triangle ✗

Result: Only 1 triangle exists, formed by nodes A, B, and C.

B D

A C

Figure 5: Graph with triangle among nodes A–B–C

Triangle Counting via Node Iterator


The Node Iterator method counts triangles by examining pairs of neighbors for each node.

Steps
1. For each node u, get neighbors N (u)

2. For each unordered pair (v, w) ∈ N (u), check if (v, w) ∈ E

3. If yes, (u, v, w) forms a triangle

22
Example
Graph with edges: (A, B), (A, C), (B, C), (C, D)

• Node A: N (A) = {B, C} ⇒ (B, C) ∈ E ⇒ Triangle ✓

• Node B: N (B) = {A, C} ⇒ (A, C) ∈ E ⇒ Triangle ✓

• Node C: N (C) = {A, B, D} ⇒ (A, B) ∈ E ⇒ Triangle ✓

• Node D: N (D) = {C} ⇒ No triangle ✗

Total count = 3, divide by 3 (each triangle counted thrice):


3
Number of triangles = =1
3

Complexity
!
X deg(u)
O
u∈V
2

Pros and Cons


• Efficient for sparse graphs

• Uses local neighborhood structure

• Requires correction for overcounting

Triangle Counting via Edge Iterator


The Edge Iterator algorithm counts triangles by iterating over each edge and finding the
common neighbors of its endpoints.

Algorithm
1. For each edge (u, v) ∈ E:

• Compute N (u) ∩ N (v)


• Each node w ∈ N (u) ∩ N (v) completes triangle (u, v, w)

2. Divide total count by 3 to avoid triple counting.

23
Example
Graph with edges: (A, B), (A, C), (B, C), (C, D)

B D

A C

Figure 6: Graph with triangle A–B–C

• Edge (A, B): N (A) = {B, C}, N (B) = {A, C} ⇒ {C} → triangle (A, B, C)

• Edge (A, C): {B} → triangle (A, B, C)

• Edge (B, C): {A} → triangle (A, B, C)

• Edge (C, D): ∅ → no triangle

Total triangle instances: 3 → Final count = 3/3 = 1

Complexity
O(m · d) for average degree d

Advantages
• Efficient and scalable

• Exploits local edge neighborhoods

Disadvantages
• Triangle counted multiple times

• Requires fast set intersection

Triangle Counting via Matrix Multiplication


Given an adjacency matrix A of an undirected graph, the number of triangles is:
1
Triangles = · trace(A3 )
6

Example
Graph: Nodes = {A, B, C, D} Edges = (A, B), (A, C), (B, C), (C, D)
 
0 1 1 0
1 0 1 0
A= 1 1 0 1

0 0 1 0

24
   
2 1 1 1 2 3 3 1
1 2 1 1 3 2 3 1
A2 = 
1
, A3 =  
1 3 0 3 3 2 3
1 1 0 1 1 1 3 0
6
trace(A3 ) = 2 + 2 + 2 + 0 = 6 ⇒ Triangles = =1
6

Complexity
• Time: O(nω ), fast matrix multiplication
• Space: O(n2 )

Remarks
• Elegant for small dense graphs
• Impractical for large or sparse networks

Triangle Counting via Parallel Methods


Parallel algorithms for triangle counting distribute the graph workload across processors
or machines to improve efficiency and scalability.

Motivation
Counting triangles in large-scale networks is computationally expensive. Parallel methods
allow the workload to be split and processed simultaneously.

Techniques
1. Vertex Partitioning:
• Each thread processes a subset of nodes.
• Counts triangles by checking neighbor pairs.
2. Edge Partitioning:
• Each processor handles a subset of edges.
• For edge (u, v), compute N (u) ∩ N (v).
• Avoid duplicate counts using ordering constraints.
3. MapReduce Algorithms:
• Implemented in Apache Hadoop, Spark.
• Mappers emit neighbor information; reducers count triangles.
4. GPU-Based Approaches:
• Use CUDA/OpenCL for fast triangle enumeration.
• Suited for dense graphs in CSR or adjacency matrix form.

25
Example: Cohen’s MapReduce Strategy
1. For each vertex u, emit edge (u, v) if u < v.
2. Group by v, find common neighbors w.
3. Emit triangle (u, v, w) if all edges exist.

Advantages
• Highly scalable
• Suitable for distributed systems and GPUs
• Handles large graphs efficiently

Disadvantages
• Complex to implement
• Synchronization and load balancing required
• Potential overcounting without constraints

Applications of Motif Analysis


Motif analysis identifies frequently occurring subgraph patterns in a network. These
patterns provide insights into the local structural organization and functional behavior
of the network.

1. Social Networks
• Triangles indicate mutual friendships (triadic closure).
• Star motifs highlight influential users or hubs.
• Helps in link prediction and group recommendation.

2. Biological Networks
• Feed-forward loops (FFLs) regulate gene expression.
• Bi-fan motifs capture protein interaction modules.

3. Neural Circuits
• Triadic motifs support local computation and memory.
• Identify functional modules in the brain.

4. Financial Networks
• Triangles or cycles may signal money laundering or fraud.
• Dense motifs indicate collusion or shell companies.

26
5. Web and Citation Networks
• Triadic motifs help in link prediction.

• Reveal topic-based communities and influence flow.

6. Infrastructure Networks
• Motifs identify critical points in power grids.

• Analyze failure propagation and resilience.

7. Ecological Networks
• Food chain structures use path and triangle motifs.

• Analyze species interactions and ecosystem balance.

27

You might also like