GCN
GCN
I A graph is a triplet G = (V, E, W), which includes vertices V, edges E, and weights W
) Edges are ordered pairs of labels (i, j). We interpret (i, j) 2 E as “i can be influenced by j.”
) Weights wij 2 R are numbers associated to edges (i, j). “Strength of the influence of j on i.”
w42 w46
w12 2 4 6 w86
w53 w75
w31 3 5 7 w87
w35 w57
2
Directed Graphs
I Edge (i, j) is represented by an arrow pointing from j into i. Influence of node j on node i
I Edge (i, j) is di↵erent from edge (j, i) ) It is possible to have (i, j) 2 .E and (j, i) 2
/E
I If both edges are in the edge set, the weights can be di↵erent ) It is possible to have wij 6= wji
w42 w46
w12 2 4 6 w86
w35 w57
3
Symmetric Graphs
I A graph is symmetric or undirected if both, the edge set and the weight are symmetric
) Weights are symmetric ) We must have wij = wji for all (i, j) 2 E
w24 w46
w12 2 4 6 w68
w13 3 5 7 w78
4
Unweighted Graphs
) Equivalently, we can say that all weights are units ) wij = 1 for all (i, j) 2 E
2 4 6
1 8
3 5 7
5
Unweighted Graphs
) Equivalently, we can say that all weights are units ) wij = 1 for all (i, j) 2 E
2 4 6
1 8
3 5 7
5
Weighted Symmetric Graphs
I Most of the graphs we encounter in practical situations are symmetric and weighted
w24 w46
w12 2 4 6 w68
w13 3 5 7 w78
w35 w57
6
Graph Shift Operators
I Graphs have matrix representations. Which in this course, we call graph shift operators (GSOs)
7
Adjacency Matrices
I The adjacency matrix of graph G = (V, E, W) is the sparse matrix A with nonzero entries
w24 = w42
2 4
2 3
w12 = w21 0 w12 w13 0 0
6 w21 0 w23 w24 0 7
6 7
A=6 w31 w32 0 0 w35 7.
1 w32 = w23 w45 = w54 4 5
0 w42 0 0 w45
0 0 w53 w54 0
w31 = w13
3 5
w53 = w35
8
Adjacency Matrices for Unweighted Graphs
I For the particular case in which the graph is unweighted. Weights interpreted as units
1
1 2 4 2 3
0 1 1 0 0
6 1 0 1 1 0 7
6 7
11 1 A=6 1 1 0 0 1 7.
4 0 1 0 0 1 5
0 0 1 1 0
1 3 5
1
9
Neighborhoods and Degrees
I The neighborhood of node i is the set of nodes that influence i ) n(i) := {j : (i, j) 2 E}
X X
I Degree di of node i is the sum of the weights of its incident edges ) di = wij = wij
j2n(i) j:(i,j)2E}
w24 = w42
2 4
w12
I Node 1 neighborhood ) n(1) = {2, 3}
1 w32 = w23 w45 = w54
I Node 1 degree ) n(1) = w12 + w13
w13
3 5
w53 = w35
10
Degree Matrix
I The degree matrix is a diagonal matrix D with degrees as diagonal entries ) Dii = di
P
I Write in terms of adjacency matrix as D = diag(A1). Because (A1)i = j wij = di
1
1 2 4 2 3
2 0 0 0 0
6 0 3 0 0 0 7
1 1 1 6 7
D=6 0 0 3 0 0 7
4 0 0 0 2 0 5
0 0 0 0 2
1 3 5
1
11
Laplacian Matrix
2 3 1 2 4
2 1 1 0 0
6 1 3 1 1 0 7
6 7 1 1 1
L=6 1 1 3 0 1 7
4 0 1 0 2 1 5
0 0 1 1 2
1 3 5
1
12
Normalized Matrix Representations: Adjacencies
I Normalized adjacency and Laplacian matrices express weights relative to the nodes’ degrees
wij
I Normalized adjacency matrix ) Ā := D 1/2
AD 1/2
) Results in entries (Ā)ij = p
di dj
13
Normalized Matrix Representations: Laplacians
⇣ ⌘
I Given definitions normalized representations ) L̄ = D 1/2
D A D 1/2
= I Ā
) The normalized Laplacian and adjacency are essentially the same linear transformation.
I Normalized operators are more homogeneous. The entries in the vector A1 tend to be similar.
14
Graph Shift Operator
I The Graph Shift Operator S is a stand in for any of the matrix representations of the graph
I The specific choice matters in practice but most of results and analysis hold for any choice of S
15
Graph Signals
I Graph Signals are supported on a graph. They are the objets we process in Graph Signal Processing
16
Graph Signal
I To emphasize that the graph is intrinsic to the signal we may write the signal as a pair ) (S, x)
x2 w24 x4 w46 x6
w12 2 4 6 w68
w13 3 5 7 w78
x3 w35 w57 x7
x5
17
Graph Signal Di↵usion
I Multiplication by the graph shift operator implements di↵usion of the signal over the graph
X X
I Define di↵used signal y = Sx ) Components are yi = wij xj = wij xj
j2n(i) j
) Codifies a local operation where components are mixed with components of neighboring nodes.
y2 w24 x4 w46 x6
w12 2 4 6 w68
w13 3 5 7 w78
x3 w35 w57 x7
x5
18
The Di↵usion Sequence
I Can unroll the recursion and write the di↵usion sequence as the power sequence ) x(k) = Sk x
19
Some Observations about the Di↵usion Sequence
I The kth element of the di↵usion sequence x (k) di↵uses information to k-hop neighborhoods
) One reason why we use the di↵usion sequence to define graph convolutions
I We have two definitions. One recursive. The other one using powers of S
) Always implement the recursive version. The power version is good for analysis
20
Graph Convolutional Filters
I Graph convolutional filters are the tool of choice for the linear processing of graph signals
21
Graph Filters
I Given graph shift operator S and coefficients hk , a graph filter is a polynomial (series) on S
1
X
H(S) = h k Sk
k=0
I The result of applying the filter H(S) to the signal x is the signal
1
X
y = H(S) x = h k Sk x
k=0
22
From Local to Global Information
I Consider a signal x supported on a graph with shift operator S. Along with filter h = {hk }Kk=01
x3 x2 x9 x8
3 2 9 8
x4 x1 x10 x7
4 1 10 7
5 6 11 12
x5 x6 x11 x12
K
X1
I Graph convolution output ) y = h ?S x = h0 S0 x +h1 S1 x +h2 S2 x +h3 S3 x + . . . = h k Sk x
k=0
23
Transferability of Filters Across Di↵erent Graphs
5 6 11 12 w13 3 5 7 w78
x5 x6 x11 x12
x3 w35 x5 w57 x7
1
X
I Graph convolution output ) y = h ?S x = h0 S0 x +h1 S1 x +h2 S2 x +h3 S3 x + . . . = h k Sk x
k=0
I Output depends on the filter coefficients h, the graph shift operator S and the signal x
24
Learning with Graph Signals
I Almost ready to introduce GNNs. We begin with a short discussion of learning with graph signals
1
Empirical Risk Minimization
I In this course, machine learning (ML) on graphs ⌘ empirical risk minimization (ERM) on graphs.
) A loss function `(y, ŷ) to evaluate the similarity between y and an estimate ŷ
) A function class C
⇣ ⌘
I Learning means finding function ⇤
2 C that minimizes loss ` y, (x) averaged over training set
X ⇣ ⌘
⇤
= argmin ` y, (x),
2C
(x,y)2T
I We use ⇤
(x) to estimate outputs ŷ = ⇤
(x) when inputs x are observed but outputs y are unknown
2
Empirical Risk Minimization with Graph Signals
I In ERM, the function class C is the degree of freedom available to the system’s designer
⇤
X ⇣ ⌘
= argmin ` y, (x)
2C
(x,y)2T
I Since we are interested in graph signals, graph convolutional filters are a good starting point
enough
despite
either
each
down
hence
from
for
can
by
if
but
in
into
th
ay
bo
aw
it
e
lik
e
e
at
littl
id
y
as
an
d
m
un
as
ay
aro
m t
igh y
an er
m
oth
re
mo
t an
mos d
h an
muc
an
must
r along
neithe
all
next
against
about
no
none
aboard
nor
a
nothing
yet
of
would
on
with
once
will
one
while
or
whi
othe ch
r whe
ou ther
r wh
ou ere
t wh
en
wh
ro
un
d at
sh
us
all
sh
up
ou
so
on
up
ld
so
un
me
su
un
to
than
ch
to
til
that
throu
the
those
them
this
then
they
thence
these
therefore
gh
3
Learning with a Graph Convolutional Filter
I Input / output signals x / y are graph signals supported on a common graph with shift operator S
K
X1
I Function class ) graph filters of order K supported on S ) (x) = h k Sk x = (x;S,h)
k=0
K
X1
x k
z= (x; S,h)
z= hk S x
k=0
X ⇣ ⌘
I Learn ERM solution restricted to graph filter class ) h⇤ = argmin ` y, ( x; S, h )
h
(x,y)2T
) Optimization is over filter coefficients h with the graph shift operator S given
4
When the Output is Not a Graph Signal: Readout
I Outputs y 2 Rm are not graph signals ) Add readout layer at filter’s output to match dimensions
K
X1
I Readout matrix A 2 Rm⇥n yields parametrization ) A ⇥ (x;S,h) = A ⇥ h k Sk x
k=0
K
X1
x k
z= (x; S,h) A⇥ (x; S,h)
z= hk S x A
k=0
X ⇣ ⌘
I Making A trainable is inadvisable. Learn filter only. ) h⇤ = argmin ` y, A ⇥ ( x; S, h )
h
(x,y)2T
I Readouts are simple. Read out node i ) A = eTi . Read out signal average ) A = 1T .
5
Graph Neural Networks (GNNs)
6
Pointwise Nonlinearities
2 3 2 3
x1 (x1 )
h i 6 x2 7 6 (x2 ) 7
I The result of applying pointwise 6 7 6 7
to a vector x is ) x = 6 .. 7=6 .. 7
4 . 5 4 . 5
xn (xn )
I ReLU: (x) = max(0, x). Hyperbolic tangent: (x) = (e 2x 1)/(e 2x + 1). Absolute value: (x) = |x|.
7
Learning with a Graph Perceptron
I Graph filters have limited expressive power because they can only learn linear maps
" K
#
X1
I A first approach to nonlinear maps is the graph perceptron ) (x) = k
hk S x = (x; S,h)
k=0
2 3 2 3
x1 (x1 )
K
X1 h i (x; S, h) h i 6 x2 7 6 (x2 ) 7
x k z 6 7 6 7
z= hk S x z x = 6 .. 7=6 .. 7
k=0
4 . 5 4 . 5
Perceptron xn (xn )
X ⇣ ⌘
I Optimal regressor restricted to perceptron class ) h⇤ = argmin ` y, ( x; S, h )
h
(x,y)2T
) Perceptron allows learning of nonlinear maps ) More expressive. Larger Representable Class
8
Graph Neural Networks (GNNs)
I Layer 1 processes input signal x with the perceptron h1 = [h10 , . . . , h1,K 1] to produce output x1
"K 1 #
h i X k
x1 = z1 = h1k S x
k=0
I The Output of Layer 1 x1 becomes an input to Layer 2. Still x1 but with di↵erent interpretation
I Repeat analogous operations for L times (the GNNs depth) ) Yields the GNN predicted output xL
9
Graph Neural Networks (GNNs)
I Layer 2 processes its input signal x1 with the perceptron h2 = [h20 , . . . , h2,K 1] to produce output x2
"K 1 #
h i X k
x2 = z2 = h2k S x1
k=0
I The Output of Layer 2 x2 becomes an input to Layer 3. Still x2 but with di↵erent interpretation
I Repeat analogous operations for L times (the GNNs depth) ) Yields the GNN predicted output xL
9
The GNN Layer Recursion
I A generic layer of the GNN, Layer `, takes as input the output x` 1 of the previous layer (` 1)
I Layer ` processes its input signal x` 1 with perceptron h` = [h`0 , . . . , h`,K 1] to produce output x`
"K 1 #
h i X k
x` = z` = h`k S x` 1
k=0
I With the convention that the Layer 1 input is x0 = x, this provides a recursive definition of a GNN
⇣ ⌘ ⇣ ⌘
I If it has L layers, the GNN output ) xL = x; S, h1 , . . . , hL = x; S, H
I The filter tensor H = [h1 , . . . , hL ] is the trainable parameter. The graph shift is prior information
10
GNN Block Diagram
x0 = x
KX1 h i
I Illustrate definition with a GNN with 3 layers z1 =
k
h1k S x
z1
x1 = z1
k=0
Layer 1
x1
I Feed input signal x = x0 into Layer 1 x1
"K 1 # KX1
z2 h i
h i X z2 =
k
h2k S x1 x2 = z2
x1 = z1 = h1k Sk x0 k=0
k=0 Layer 2
x2
x2
I Last layer output is the GNN output ) (x; S, H)
KX1 h i
k z3
z3 = h3k S x2 x3 = z3
) Parametrized by filter tensor H = [h1 , h2 , h3 ] k=0
Layer 3
x3 = (x; S, H)
11
GNN Block Diagram
x0 = x
KX1 h i
I Illustrate definition with a GNN with 3 layers z1 =
k
h1k S x
z1
x1 = z1
k=0
Layer 1
x1
I Feed Layer 1 output as an input to Layer 2 x1
"K 1 # KX1
z2 h i
h i X z2 =
k
h2k S x1 x2 = z2
x2 = z2 = h2k Sk x1 k=0
k=0 Layer 2
x2
x2
I Last layer output is the GNN output ) (x; S, H)
KX1 h i
k z3
z3 = h3k S x2 x3 = z3
) Parametrized by filter tensor H = [h1 , h2 , h3 ] k=0
Layer 3
x3 = (x; S, H)
11
GNN Block Diagram
x0 = x
KX1 h i
I Illustrate definition with a GNN with 3 layers z1 =
k
h1k S x
z1
x1 = z1
k=0
Layer 1
x1
I Feed Layer 2 output as an input to Layer 3 x1
"K 1 # KX1
z2 h i
h i X z2 =
k
h2k S x1 x2 = z2
x3 = z3 = h3k Sk x2 k=0
k=0 Layer 2
x2
x2
I Last layer output is the GNN output ) (x; S, H)
KX1 h i
k z3
z3 = h3k S x2 x3 = z3
) Parametrized by filter tensor H = [h1 , h2 , h3 ] k=0
Layer 3
x3 = (x; S, H)
11