Web Structure Mining
and Social Network
Analysis
Thank You Credits
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 1
Web Structure Mining
■ Definition
Discovery and interpretation of patterns in
1. the hyperlink structure of the Web
2. the social ties among actors that interact
on the Web
■ Typical sources of web graphs
1. web crawls including HTML pages and hyperlinks
2. social networks representing relations between actors
3. knowledge graphs that have been extracted from the Web
4. other types of community data (discussion forums,
email conversations, navigation paths …)
■ Web structure mining focuses on the structure, but is also
often combined with content or usage mining
techniques
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 2
Hyperlink Graph
A hyperlink graph is a collection of hyperlinks between web
pages which belong to web sites.
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 3
Social Network
A social network is a set of relations (e.g. friendship, interest,
data exchange) between social entities, i.e. members of a social
system (actors).
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 4
Knowledge Graph
A knowledge graph is a set of relations having different types
(e.g. located in, painted, is interested in, is a) between entities
(Mona Lisa, Louvre, Da Vinci) belonging to classes (e.g.
persons, paintings, museums, places, dates).
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 5
Chapter Outline
1. Describing Graphs
1. Terminology and Metrics
2. Prominence
1. Centrality
2. Prestige
3. Community Detection
1. Connected Components and K-Cores
2. Clustering-based Techniques
4. Machine Learning with Graphs
1. Link Prediction and Node Classification
2. Node Embeddings
3. Graph Neural Networks
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 6
1. Describing Graphs: Terminology and Metrics
A Graph is a collection of vertices that are connected by edges.
Network often refers to real systems
vertex
Graph: mathematical representation
of a network
edge But often: “Network” ≡ “Graph”
Community Points Lines
Math vertices edges, arcs
Computer Science nodes links
Physics sites bonds
Sociology actors ties, relations
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 7
Graphs
A graph is an ordered pair where
is a set of vertices and is a set of
edges.
Two vertices a and b are called adjacent if
directed edge/arc: a b
undirected edge: a b
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 8
Examples: Directed and Undirected Graphs
Undirected Graph Directed Graph
undirected edges (symmetrical) edge directed edges arcs
Graph: Digraph = directed graph:
L
A D
M B
F
C
I
D
B G E
G
A
H
C F
Undirected edges: Directed arcs:
• co-authorship links • hyperlinks on the WWW
• roads (mostly) • following on Twitter
• phone calls
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 9
Graph Terminology
Definition: When (u, v) is an edge of the graph G with
directed edges, u is said to be adjacent to v, and v is
said to be adjacent from u.
The vertex u is called the initial vertex of (u, v), and v is
called the terminal vertex of (u, v).
The initial vertex and terminal vertex of a loop are the
same.
Representing Graphs
a a
d
b b
d
c c
Adjacent Initial Terminal
Vertex
Vertices Vertex Vertices
a b, c, d a c
b a, d b a
c a, d c
d a, b, c d a, b, c
Adjacency Matrix
A graph can be represented as
adjacency matrix.
j 1
2
i 3
4 5
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 12
Adjacency Matrices for Directed and Undirected
Graphs
4 4
3
3 2
2
1
1
A12
A14
(0 1 0 1 (0 0 0 1ö
0 0 1 0 0 0
Aij = 1
0 0 A ij 0 0
ç0 1 ç0 0
÷ ÷
1 1 1 0 1 1 0
0
Aij=1 if there is a link between vertices i and j
Aij=0 if vertices i and j are not connected to each other.
Note that for an undirected graph (left) the matrix is symmetric.
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 13
Weighted and Unweighted Graphs
Unweighted Graph Weighted Graph
(undirected) (undirected)
4 4
1 1
2 2
3 3
(0 ( 0 2
ç1 1 1 0÷ö ç 2 0ö
4÷
÷
Aij = 0 1 1÷ Aij = 0.5
ç1 0÷
ç ç0 ÷ ç 0
è 1 0 0÷ è 0ø
÷
1
1 0 0ø Example: Road networks (distance in miles)
ç0.5 1 0
ç
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 14
Bipartite Graphs
Bipartite graph (or bigraph) is a
graph whose vertices can be
divided into two disjoint sets U and
V such that every line connects a
vertex in U to one in V; that is, U
and V are independent sets.
Examples:
• movie/actor network
• disease/symptom network
• photo/tag network on Flickr
• customer/product recommendations
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 15
Vertex, Arc and Edge Attributes
Vertices, arcs and edges can have attributes.
Example of a network with vertex and arc attributes:
■ girls’ school dormitory dining-table partners (Moreno, The sociometry reader, 1960)
■ first and second choices shown
Louise
Ada Lena
Adele
Marion
Jane
Cora Frances
Eva Maxine Mary
Anna Ruth
Edna
Robin Martha Betty
Jean
Laura
Alice
Helen Hazel Hilda
Ellen
Ella
Irene
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 16
Graph Terminology
Definition: The degree of a vertex in an undirected
graph is the number of edges incident with it, except that
a loop at a vertex contributes twice to the degree of that
vertex.
In other words, you can determine the degree of a vertex
in a displayed graph by counting the lines that touch it.
The degree of the vertex v is denoted by deg(v).
Example: Degrees of Undirected and Directed
Graphs
Undirected
Degree: the number of edges connected to the vertex.
A
kA = kB =
B
1 4
In directed graphs we can define an in-degree and out-
D
B
degree. The (total) degree is the sum of in- and out-degree.
Directed
G
E k Cin = 2 k Cout = 1 kC = 3
A
Source: a vertex with kin= 0 and kout> 0
Sink: a vertex with kout= 0 and kin> 0
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 18
Graph Terminology
A vertex of degree 0 is called isolated, since it is not
adjacent to any vertex.
Note: A vertex with a loop at it has at least degree 2
and, by definition, is not isolated, even if it is not
adjacent to any other vertex.
A vertex of degree 1 is called pendant. It is adjacent to
exactly one other vertex.
Graph Terminology
Example: Which vertices in the following graph are
isolated, which are pendant, and what is the maximum
degree?
f h
d
a e
b c f j
Solution: Vertex f is isolated, and vertices a, d
and j are pendant. The maximum degree is
deg(g) = 5.
Degree
Degree: Number of edges adjacent to
j
In-degree:
i
Out-degree:
1 2
3 4
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 21
Degree Distribution
Summarizes the degrees of all vertices.
Alternative representations:
1. A frequency count of the vertices of each degree
2. P(k): probability that a randomly chosen vertex has degree k
5
P(k
)
4
3
0.6 P(k) = Nk / N
0.5
frequency
2 0.4
1
0.3
0.2
0
0 1 2 0.1
indegree
1 2 3 4 k
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 22
Degree Distribution: Friendship on Facebook
Displayed on
log-log scale.
New
or
lonely Human or
user? robot?
Source: Zafarani, et al: Social Media Mining. Cambridge University Press, 2014.
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 23
In-Degree Distribution of the WDC Hyperlink Graph
Covers 3.5 billion web pages and 128 billion hyperlinks, extracted from Common Crawl 2012
Displayed on
log-log scale,
meaning that
left third covers
over 99% of
the mass.
Meusel, Vigna, Lehmberg, Bizer: Graph Structure in the Web - Revisited. 23rd Conference on World Wide Web
(WWW2014). Website: [Link]
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 19
Explore Common Crawl: Top In-Degree Websites
[Link] Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 20
Literature
1. Zafarani, et al: Social Media Mining. Cambridge University
Press, 2014. Free online version
[Link]
2. Wasserman and Faust: Social Network Analysis. Cambridge
University Press, 1994.
3. David Easley, Jon Kleinberg: Networks, Crowds, and Markets:
Reasoning About a Highly Connected World. Cambridge
University Press, 2010. Free online version
[Link]
4. Bing Liu: Web Data Mining. 2nd Edition, Springer, 2011.
Universität Mannheim – Bizer: Web Structure Mining – FSS2024 (Version: 22.03.2024) – Slide 62