MIT 9.520/6.
860, Fall 2018
Statistical Learning Theory and Applications
Class 04: Features and Kernels
Lorenzo Rosasco
Linear functions
Let Hlin be the space of linear functions
f (x) = w > x.
I f ↔ w is one to one,
D E
I inner product f , f̄ := w > w̄,
H
I norm/metric f − f̄ := kw − w̄k.
H
[Link], 9.520/6.860 2018
An observation
Function norm controls point-wise convergence.
Since
|f (x) − f̄ (x)| ≤ kxk kw − w̄k , ∀x ∈ X
then
wj → w ⇒ fj (x) → f (x), ∀x ∈ X .
[Link], 9.520/6.860 2018
ERM
n
1X
min (yi − w > xi )2 + λ kwk2 , λ≥0
w∈Rd n
i =1
I λ → 0 ordinary least squares (bias to minimal norm),
I λ > 0 ridge regression (stable).
[Link], 9.520/6.860 2018
Computations
Let Xn ∈ Rnd and b
Y ∈ Rn .
The ridge regression solution is
bλ = (Xn > Xn+nλI )−1 Xn >b
w Y time O (nd 2 ∨d 3 ) mem. O (nd ∨d 2 )
but also
bλ = Xn > (XnXn > +nλI )−1b
w Y time O (dn 2 ∨n 3 ) mem. O (nd ∨n 2 )
[Link], 9.520/6.860 2018
Representer theorem in disguise
We noted that
n
X n
X
bλ = Xn > c =
w xi ci ⇔ f̂ λ (x) = x > xi ci ,
i =1 i =1
c = (XnXn > + nλI )−1b
Y, (XnXn > )ij = xi> xj .
[Link], 9.520/6.860 2018
Limits of linear functions
Regression
[Link], 9.520/6.860 2018
Limits of linear functions
Classification
[Link], 9.520/6.860 2018
Nonlinear functions
Two main possibilities:
f (x) = w > Φ(x), f (x) = Φ(w > x)
where Φ is a non linear map.
I The former choice leads to linear spaces of functions1 .
I The latter choice can be iterated
f (x) = Φ(wL> Φ(wL>−1 . . . Φ(w1> x))).
1 The spaces are linear, NOT the functions! [Link], 9.520/6.860 2018
Features and feature maps
f (x) = w > Φ(x),
where Φ : X → Rp
Φ(x) = (ϕ1 (x), . . . , ϕp (x))>
and ϕj : X → R, for j = 1, . . . , p.
I X need not be Rd .
I We can also write
p
X
f (x) = w j ϕj (x).
i =1
[Link], 9.520/6.860 2018
Geometric view
f (x) = w > Φ(x)
[Link], 9.520/6.860 2018
An example
[Link], 9.520/6.860 2018
More examples
The equation
p
X
f (x) = w > Φ(x) = w j ϕj (x)
i =1
suggests to think of features as some form of basis.
Indeed we can consider
I Fourier basis,
I wave-lets + their variations,
I ...
[Link], 9.520/6.860 2018
And even more examples
Any set of functions
ϕj : X → R, j = 1, . . . , p
can be considered.
Feature design/engineering
I vision: SIFT, HOG
I audio: MFCC
I ...
[Link], 9.520/6.860 2018
Nonlinear functions using features
Let HΦ be the space of linear functions
f (x) = w > Φ(x).
I f ↔ w is one to one, if (ϕj )j are lin. indip.
D E
I inner product f , f̄ := w > w̄,
HΦ
I norm/metric f − f̄ := kw − w̄k.
HΦ
In this case
|f (x) − f̄ (x)| ≤ kΦ(x)k kw − w̄k , ∀x ∈ X .
[Link], 9.520/6.860 2018
Back to ERM
n
1X
minp (yi − w > Φ(xi ))2 + λ kwk2 , λ ≥ 0,
w∈R n
i =1
Equivalent to,
n
1X
min (yi − f (xi ))2 + λ kf k2HΦ , λ ≥ 0.
f ∈HΦ n
i =1
[Link], 9.520/6.860 2018
Computations using features
b ∈ Rnp with
Let Φ
b ij = ϕj (xi )
(Φ)
The ridge regression solution is
b λ = (Φ
w b > Φ+nλI
b )−1 Φ
b>b
Y time O (np 2 ∨p 3 ) mem. O (np∨p 2 ),
but also
bλ = Φ
w b > (Φ
bΦb > +nλI )−1b
Y time O (pn 2 ∨n 3 ) mem. O (np∨n 2 ).
[Link], 9.520/6.860 2018
Representer theorem a little less in disguise
Analogously to before
n
X n
X
bλ = Φ
w b> c = Φ(xi )ci ⇔ f̂ λ (x) = Φ(x)> Φ(xi )ci
i =1 i =1
c = (Φ
bΦb > + λI )−1b
Y, (Φ
bΦb > )ij = Φ(xi )> Φ(xj )
p
X
Φ(x)> Φ(x̄) = ϕs (x)ϕs (x̄).
s =1
[Link], 9.520/6.860 2018
Unleash the features
I Can we consider linearly dependent features?
I Can we consider p = ∞?
[Link], 9.520/6.860 2018
An observation
For X = R consider
s
−x 2 γ (2γ)(j −1)
ϕj (x) = x j −1 e , j = 2, . . . , ∞
(j − 1)!
with ϕ1 (x) = 1.
Then
s s
∞ ∞
X X
−x 2 γ (2γ)j −1 j −1 −x̄ 2 γ (2γ)j −1
ϕj (x)ϕj (x̄) = x j −1 e x̄ e
(j − 1)! (j − 1)!
j =1 j =1
∞
2γ 2γ
X (2γ)j −1 2γ 2γ 2γ
= e −x e −x̄ (xx̄)j −1 = e −x e −x̄ e 2x x̄
(j − 1)!
j =1
−|x−x̄|2 γ
= e
[Link], 9.520/6.860 2018
From features to kernels
∞
X
Φ(x)> Φ(x̄) = ϕj (x)ϕj (x̄) = k (x, x̄)
j =1
We might be able to compute the series in closed form.
The function k is called kernel.
Can we run ridge regression ?
[Link], 9.520/6.860 2018
Kernel ridge regression
We have
n
X n
X
λ >
f̂ (x) = Φ(x) Φ(xi )ci = k (x, xi )ci
i =1 i =1
K + λI )−1b
c = (b Y, K )ij = Φ(xi )> Φ(xj ) = k (xi , xj )
(b
K is the kernel matrix, the Gram (inner products) matrix of the data.
b
“The kernel trick”
[Link], 9.520/6.860 2018
Kernels
I Can we start from kernels instead of features?
I Which functions k : X × X → R define kernels we can use?
[Link], 9.520/6.860 2018
Positive definite kernels
A function k : X × X → R is called positive definite:
I if the matrix K̂ is positive semidefinite for all choice of points
x1 , . . . , xn , i.e.
a >bK a ≥ 0, ∀a ∈ Rn .
I Equivalently
n
X
k (xi , xj )ai aj ≥ 0,
i ,j =1
for any a1 , . . . , an ∈ R, x1 , . . . , xn ∈ X .
[Link], 9.520/6.860 2018
Inner product kernels are pos. def.
Assume Φ : X → Rp , p ≤ ∞ and
k (x, x̄) = Φ(x)> Φ(x̄)
Note that
n n n 2
X X X
k (xi , xj )ai aj = Φ(xi )> Φ(xj )ai aj = Φ(xi )ai .
i ,j =1 i ,j =1 i =1
Clearly k is symmetric.
[Link], 9.520/6.860 2018
But there are many pos. def. kernels
Classic examples
I linear k (x, x̄) = x > x̄
I polynomial k (x, x̄) = (x > x̄ + 1)s
2
I Gaussian k (x, x̄) = e −kx−x̄k γ
But one can consider
I kernels on probability distributions
I kernels on strings
I kernels on functions
I kernels on groups
I kernels graphs
I ...
It is natural to think of a kernel as a measure of similarity.
[Link], 9.520/6.860 2018
From pos. def. kernels to functions
Let X be any set/ Given a pos. def. kernel k .
I consider the space Hk of functions
N
X
f (x) = k (x, xi )ai
i =1
for any a1 , . . . , an ∈ R, x1 , . . . , xn ∈ X and any N ∈ N.
I Define an inner product on Hk
D E N X
X N̄
f , f̄ = k (xi , x̄j )ai āj .
Hk
i =1 j =1
I Hk can be completed to a Hilbert space.
[Link], 9.520/6.860 2018
A key result
Functions defind by Gaussian kernels with large and small widths.
[Link], 9.520/6.860 2018
An illustration
Theorem
Given a pos. def. k there exists Φ s.t. k (x, x̄) = hΦ(x), Φ(x̄)iHk and
HΦ ' H k
Roughly speaking
N
X
>
f (x) = w Φ(x) ' f (x) = k (x, xi )ai
i =1
[Link], 9.520/6.860 2018
From features and kernels to RKHS and beyond
Hk and HΦ have many properties, characterizations, connections:
I reproducing property
I reproducing kernel Hilbert spaces (RKHS)
I Mercer theorem (Kar hunen Loéve expansion)
I Gaussian processes
I Cameron-Martin spaces
[Link], 9.520/6.860 2018
Reproducing property
Note that by definition of Hk
I kx = k (x, ·) is a function in Hk
I For all f ∈ Hk , x ∈ X
f (x) = hf , kx iHk
called the reproducing property
I Note that
|f (x) − f̄ (x)| ≤ kkx kHk f − f̄ Hk
, ∀x ∈ X .
The above observations have a converse.
[Link], 9.520/6.860 2018
RKHS
Definition
A RKHS H is a Hilbert with a function k : X × X → R s.t.
I kx = k (x, ·) ∈ Hk ,
I and
f (x) = hf , kx iHk .
Theorem
If H is a RKHS then k is pos. def.
[Link], 9.520/6.860 2018
Evaluation functionals in a RKHS
If H is a RKHS then the evaluation functionals
ex (f ) = f (x)
are continuous. i.e.
|ex (f ) − ex (f̄ )| . f − f̄ Hk
, ∀x ∈ X
since
ex (f ) = hf , kx iHk .
Note that L 2 (Rd ) or C (Rd ) don’t have this property!
[Link], 9.520/6.860 2018
Alternative RKHS definition
Turns out the previous property also characterizes a RKHS.
Theorem
A Hilbert space with continuous evaluation functionals is a RKHS.
[Link], 9.520/6.860 2018
Summing up
I From linear to non linear functions
I using features
I using kernels
plus
I pos. def. functions
I reproducing property
I RKHS
[Link], 9.520/6.860 2018