Lecture1 Introduction To GPs
Lecture1 Introduction To GPs
Richard Wilkinson
GP summer school
September 2020
Welcome to Sheffield
Introduction
1.0
0.8
0.3
0.6
density
P(X<x)
0.2
0.4
0.1
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x
Y ∼ N(µ, σ 2 )
(y − µ)2
1
PDF: fY (y ) = √ exp −
2πσ 2 2σ 2
CDF: FY (y ) = P(Y ≤ y ) not known in closed form
Univariate Gaussian distributions
PDF of a N(0,1) random variable CDF of a N(0,1) random variable
0.4
1.0
0.8
0.3
0.6
density
P(X<x)
0.2
0.4
0.1
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x
Y ∼ N(µ, σ 2 )
(y − µ)2
1
PDF: fY (y ) = √ exp −
2πσ 2 2σ 2
CDF: FY (y ) = P(Y ≤ y ) not known in closed form
If Z ∼ N(0, 1) then Y = µ + σZ ∼ N(µ, σ 2 )
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
If Y and Z are jointly normally distributed and are uncorrelated, then
they are independent
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
If Y and Z are jointly normally distributed and are uncorrelated, then
they are independent
Square-loss functions lead to procedures that have a Gaussian
probabilistic interpretation
eg Fit model fβ (x) to data y by mimizing (yi − fβ (xi ))2 is
P
equivalent to maximum likelihood estimation under the assumption
that y = fβ (x) + where ∼ N(0, σ 2 ).
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2
σ12
Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2
σ12
Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22
σ12
Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22
1 d 1
pdf: f (y | µ, Σ) = |Σ|− 2 (2π)− 2 exp − (y − µ)> Σ−1 (y − µ)
2
0
µ=
0
4
1 0
Σ=
0 1
2
Y2
−2
−4
−4 −2 0 2 4
Y1
0
µ=
0
4
1 0
Σ=
0 1
2
So
Cor (Y1 , Y2 ) = 0
hence Y1
Y2
independent of Y2
−2
−4
−4 −2 0 2 4
Y1
4 0
µ=
1
2
1 0
Σ=
0 0.2
Y2
−2
−4
−4 −2 0 2 4
Y1
4 0
µ=
0
2
1 0.9
Σ=
0.9 1
Y2
−2
−4
−4 −2 0 2 4
Y1
4 0
µ=
0
2
1 1 0.9
Σ=
3 0.9 1
Y2
−2
−4
−4 −2 0 2 4
Y1
4 0
µ=
0
2
1 0.99
Σ=
0.99 1
Y2
−2
−4
−4 −2 0 2 4
Y1
0
µ=
0
4
1 0.54
Σ=
2
0.54 0.3
Cor (Yp
1 , Y2 ) =
0.54/ (0.3) =
Y2
0
0.99
−2
−4
−4 −2 0 2 4
Y1
More pictures
1.5
1.0
1.0
0.5
0.5
Y2
Y
0.0
0.0
−0.5
−0.5
−1.0
−1.0
Y1 Index
Consider d = 5 with
0
1 0.99 0.98 0.97 0.96
0 0.99 1 0.99 0.98 0.97
µ= . Σ = 0.98 0.99 1 0.99 0.98
. 0.97 0.98 0.99 1 0.99
.
0 0.96 0.97 0.98 0.99 1
2
1
Y
0
−1
−2
1 2 3 4 5
index
2
1
Y
0
−1
−2
0 10 20 30 40
index
Each line is one sample.
d = 50
1 0.99 0.98 0.97 0.96 ...
0 0.99 1 0.99 0.98 0.97 ...
0
0.98 0.99 1 0.99 0.98 ...
0.97 0.98 0.99 1 0.99 ...
µ= . Σ=
. 0.96 0.97 0.98 0.99 1 ...
.
0
. . . . . .
. . . . . .
. . . . . .
2
1
Y
0
−1
−2
0 10 20 30 40
index
Each line is one sample.
We can think of Gaussian processes as an infinite dimensional distribution
over functions - all we need to do is change the indexing
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Thankfully, to understand the law of y we only need consider the finite
dimensional distributions (FDDs), i.e., for all x1 , . . . , xn and for all n ∈ N
X ∼ N(µ, Σ)
Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.
X ∼ N(µ, Σ)
X ∼ N(µ, Σ)
X ∼ N(µ, Σ)
where
We are free to choose the mean E(y (x)) and covariance Cov(y (x), y (x 0 ))
functions however we like (e.g. trial and error), subject to some ‘rules’:
Specifying the mean function
We are free to choose the mean E(y (x)) and covariance Cov(y (x), y (x 0 ))
functions however we like (e.g. trial and error), subject to some ‘rules’:
We can use any mean function we want:
1 (x − x 0 )2
0
k(x, x ) = exp −
2 0.252
Examples
RBF/Squared-exponential/exponentiated quadratic
1 (x − x 0 )2
0
k(x, x ) = exp −
2 42
Examples
RBF/Squared-exponential/exponentiated quadratic
1
k(x, x 0 ) = 100 exp − (x − x 0 )2
2
Examples
Matern 3/2
k(x, x 0 ) ∼ (1 + |x − x 0 |) exp −|x − x 0 |
Examples
Brownian motion
k(x, x 0 ) = min(x, x 0 )
Examples
White noise (
0 1 if x = x 0
k(x, x ) =
0 otherwise
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0
What is happening?
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0
What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0
What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Then
Cov(y (x), y (x 0 )) = Cov(cx, cx 0 )
= x > Cov(c, c)x 0
= x >x 0
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0
What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Then
Cov(y (x), y (x 0 )) = Cov(cx, cx 0 )
= x > Cov(c, c)x 0
= x >x 0
Suppose
Y1
Y = ∼ N2 (µ, Σ)
Y2
where
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Property 2: Conditional distributions are still Gaussian
Suppose
Y1
Y = ∼ N2 (µ, Σ)
Y2
where
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Then
Y2 | Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1
11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Property 2: Conditional distributions are still Gaussian
Suppose
Y1
Y = ∼ N2 (µ, Σ)
Y2
where
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Then
Y2 | Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1
11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Proof:
π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
Proof:
π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
" > #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22
where
−1 Q11 Q12
Σ := Q :=
Q21 Q22
Proof:
π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
" > #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22
i
1h > >
∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
where
−1 Q11 Q12
Σ := Q :=
Q21 Q22
Proof:
π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
" > #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22
i
1h > >
∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
where
−1 Q11 Q12
Σ := Q :=
Q21 Q22
So Y2 |Y1 = y1 is Gaussian.
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )
giving
Y2 |Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1
11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
If we observe its value at x1 , . . . , xn then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(µ∗ , σ ∗ )
where µ∗ and σ ∗ are as on the previous slide.
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
If we observe its value at x1 , . . . , xn then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(µ∗ , σ ∗ )
where µ∗ and σ ∗ are as on the previous slide.
Note that we still believe f is a GP even though we’ve observed its value
at a number of locations.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition
D = (f (x1 ), . . . , f (xn ))
then
f |D ∼ GP
but with updated mean and covariance functions.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition
D = (f (x1 ), . . . , f (xn ))
then
f |D ∼ GP
but with updated mean and covariance functions.
Closed under any linear operator. If f ∼ GP(m(·), k(·, ·)), then if L
is a linear operator
L ◦ f ∼ GP(L ◦ m, L2 ◦ k)
df
R
e.g. dx , f (x)dx, Af are all GPs
Conditional updates of Gaussian processes - revisited
Suppose f is a Gaussian process, then
where
k(x1 , x1 . . . k(x1 , xn ) k(x1 , x)
.. .. ..
Σ=
. . .
k(xn , x1 ) . . . k(xn , xn ) k(xn , x)
k(x, x1 ) . . . k(x, xn ) k(x, x)
KXX kX (x)
=
kX (x)> k(x, x)
where
k(x1 , x1 . . . k(x1 , xn ) k(x1 , x)
.. .. ..
Σ=
. . .
k(xn , x1 ) . . . k(xn , xn ) k(xn , x)
k(x, x1 ) . . . k(x, xn ) k(x, x)
KXX kX (x)
=
kX (x)> k(x, x)
and
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with
and
−1
k̄(x) = k(x, x) − kX (x)> KXX kX (x)
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with
and
−1
k̄(x) = k(x, x) − kX (x)> KXX kX (x)
Cf
−1
m̄(x) = m(x) + kX (x)> KXX f
−1
k̄(x, x 0 ) = k(x, x 0 ) − kX (x)> KXX kX (x 0 )
No noise/nugget - Interpolation
−1
Solid line m̄(x) = kX (x)KXX f
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> K −1 kX (x)
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
then y1 , . . . , yn , f (x) ∼ Nn+1 (0, Σ)
k(x1 , x)
KXX + σ 2 I k(x2 , x)
..
where Σ=
.
k(xn , x)
k(x, x1 ) k(x, x2 ) . . . k(x, xn ) k(x, x)
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
then y1 , . . . , yn , f (x) ∼ Nn+1 (0, Σ)
k(x1 , x)
KXX + σ 2 I k(x2 , x)
..
where Σ=
.
k(xn , x)
k(x, x1 ) k(x, x2 ) . . . k(x, xn ) k(x, x)
Then
f (x) | y1 , . . . , yn ∼ N(m̄(x), k̄(x))
where
m̄(x) = kX (x)> (KXX + σ 2 I )−1 y
k̄(x) = k(x, x) − kX (x)> (KXX + σ 2 I )−1 kX (x)
Nugget standard deviation σ = 0.1
−1
Solid line m̄(x) = kX (x)> KXX y
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> (K −1 + σ 2 I )kX (x)
Nugget standard deviation σ = 0.025
−1
Solid line m̄(x) = kX (x)> KXX y
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> (K −1 + σ 2 I )kX (x)
If mean is a linear combination of known regressor functions,
y |D ∼ t-process
1
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R
x1>
x2>
where X =
..
.
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R
x1>
x2>
where X =
..
.
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R
x1>
x2>
where X =
..
.
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R
x1>
x2>
where X =
..
.
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
At first the dual form
— This is useful!
Prediction
The best prediction of y at a new location x 0 is
ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y
ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y
ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y
k(x 0 , x) = x 0> x
is replaced by
k(x 0 , x) = φ(x 0 )> φ(x)
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>
i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>
i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Then
i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Then
2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
Including features III
2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
Including features III
2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
1 2
Example: If X = [0, 1], c0 = 0, c1 = N , c2 = N , . . . , cN = 1 then
(modulo some detail) if
(x−c0 )2 (x−cN )2
φ(x) ∝ (e − 2λ2 , . . . , e− 2λ2 )
then as N → ∞ then
(x − x 0 )2
>
φ(x) φ(x) = exp −
2λ2
1 2
Example: If X = [0, 1], c0 = 0, c1 = N , c2 = N , . . . , cN = 1 then
(modulo some detail) if
(x−c0 )2 (x−cN )2
φ(x) ∝ (e − 2λ2 , . . . , e− 2λ2 )
then as N → ∞ then
(x − x 0 )2
>
φ(x) φ(x) = exp −
2λ2
We can use an infinite dimensional feature vector φ(x), and because linear
regression can be done solely in terms of inner-products (inverting a n × n
matrix in the dual form) we never need evaluate the feature vector, only
the kernel.
Kernel trick:
3
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
3
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
i
3
Not quite - it lies in the completion of this set of linear combinations
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
this space of functions is calledi the Reproducing Kernel Hilbert
Space (RKHS) of k.
3
Not quite - it lies in the completion of this set of linear combinations
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
this space of functions is calledi the Reproducing Kernel Hilbert
Space (RKHS) of k.
Although reality may not lie in the RKHS defined by k, this space is much
richer than any parametric regression model (and can be dense in some
sets of continuous bounded functions), and is thus more likely to contain
an element close to the true functional form than any class of models that
contains only a finite number of features.
This is the motivation for non-parametric methods.
3
Not quite - it lies in the completion of this set of linear combinations
Why use GPs? Answer 3: Naturalness of GP framework
EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0
EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0
EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0
µ = EŶ (x)
= E(c + w> y)
= c + w> µ
µ = EŶ (x)
= E(c + w> y)
= c + w> µ
Ŷ (x) = µ + w> (y − µ)
Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error
and thus
−1
Ŷ (x) = µ + kX (x)> KXX (y − µ)
as before.
So the Gaussian process posterior mean is optimal (i.e. is the BLUP) even
if we don’t assume a Gaussian distribution.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
It is important to check both aspects.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
It is important to check both aspects.
Warning: the uncertainty estimates from a GP can be flawed. Note that
given data D = {X , y }
−1
Var(f (x)|X , y ) = k(x, x) − kX (x)KXX kX (x)
so that the posterior variance of f (x) does not depend upon y !
The variance estimates are particularly sensitive to the hyper-parameter
estimates.
Difficulties of using GPs
cos(wi> x)
1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)
cos(wi> x)
1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)
cos(wi> x)
1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)