0% found this document useful (0 votes)
56 views

Lecture1 Introduction To GPs

This document provides an introduction to Gaussian processes. It begins with an overview of univariate and multivariate Gaussian distributions, explaining their properties such as closedness under linear operations and maximum entropy. Gaussian distributions occur naturally due to the central limit theorem. The document then discusses multivariate Gaussian distributions, defining them using a mean vector and covariance matrix. It provides an example of a bivariate Gaussian and defines related terms such as variance, covariance, and correlation.

Uploaded by

Fuzzy K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Lecture1 Introduction To GPs

This document provides an introduction to Gaussian processes. It begins with an overview of univariate and multivariate Gaussian distributions, explaining their properties such as closedness under linear operations and maximum entropy. Gaussian distributions occur naturally due to the central limit theorem. The document then discusses multivariate Gaussian distributions, defining them using a mean vector and covariance matrix. It provides an example of a bivariate Gaussian and defines related terms such as variance, covariance, and correlation.

Uploaded by

Fuzzy K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

An introduction to Gaussian Processes

Richard Wilkinson

School of Mathematical Sciences


University of Nottingham

GP summer school
September 2020
Welcome to Sheffield
Introduction

(Multivariate) Gaussian distributions


Definition of Gaussian processes
Motivations and derivations
Difficulties
You can download a copy of these slides from www.gpss.cc
Univariate Gaussian distributions
PDF of a N(0,1) random variable CDF of a N(0,1) random variable
0.4

1.0
0.8
0.3

0.6
density

P(X<x)
0.2

0.4
0.1

0.2
0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

Y ∼ N(µ, σ 2 )
(y − µ)2
 
1
PDF: fY (y ) = √ exp −
2πσ 2 2σ 2
CDF: FY (y ) = P(Y ≤ y ) not known in closed form
Univariate Gaussian distributions
PDF of a N(0,1) random variable CDF of a N(0,1) random variable
0.4

1.0
0.8
0.3

0.6
density

P(X<x)
0.2

0.4
0.1

0.2
0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

Y ∼ N(µ, σ 2 )
(y − µ)2
 
1
PDF: fY (y ) = √ exp −
2πσ 2 2σ 2
CDF: FY (y ) = P(Y ≤ y ) not known in closed form
If Z ∼ N(0, 1) then Y = µ + σZ ∼ N(µ, σ 2 )
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
If Y and Z are jointly normally distributed and are uncorrelated, then
they are independent
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
If Y and Z are jointly normally distributed and are uncorrelated, then
they are independent
Square-loss functions lead to procedures that have a Gaussian
probabilistic interpretation
eg Fit model fβ (x) to data y by mimizing (yi − fβ (xi ))2 is
P
equivalent to maximum likelihood estimation under the assumption
that y = fβ (x) +  where  ∼ N(0, σ 2 ).
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2

σ12
     
Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2

σ12
     
Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22

Var(Yi ) = σi2 Cov(Yi , Yj ) = ρij σi σj Cor(Yi , Yj ) = ρ12 for i 6= j


Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2

σ12
     
Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22

Var(Yi ) = σi2 Cov(Yi , Yj ) = ρij σi σj Cor(Yi , Yj ) = ρ12 for i 6= j

 
1 d 1
pdf: f (y | µ, Σ) = |Σ|− 2 (2π)− 2 exp − (y − µ)> Σ−1 (y − µ)
2
 
0
µ=
0

4
 
1 0
Σ=
0 1
2
Y2

−2

−4

−4 −2 0 2 4
Y1
 
0
µ=
0

4
 
1 0
Σ=
0 1
2

So
Cor (Y1 , Y2 ) = 0
hence Y1
Y2

independent of Y2

−2

−4

−4 −2 0 2 4
Y1
 
4 0
µ=
1

2  
1 0
Σ=
0 0.2
Y2

−2

−4

−4 −2 0 2 4
Y1
 
4 0
µ=
0

2  
1 0.9
Σ=
0.9 1
Y2

−2

−4

−4 −2 0 2 4
Y1
 
4 0
µ=
0

2  
1 1 0.9
Σ=
3 0.9 1
Y2

−2

−4

−4 −2 0 2 4
Y1
 
4 0
µ=
0

2  
1 0.99
Σ=
0.99 1
Y2

−2

−4

−4 −2 0 2 4
Y1
 
0
µ=
0
4

 
1 0.54
Σ=
2
0.54 0.3

Cor (Yp
1 , Y2 ) =
0.54/ (0.3) =
Y2

0
0.99

−2

−4

−4 −2 0 2 4
Y1
More pictures

Hard to visualise in dimensions > 2, so stack points next to each other.


More pictures

Hard to visualise in dimensions > 2, so stack points next to each other.


So for 2d instead of we have
1.5

1.5
1.0

1.0
0.5

0.5
Y2

Y
0.0

0.0
−0.5

−0.5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 1.5 1 2 3 4 5

Y1 Index
Consider d = 5 with
0
 
1 0.99 0.98 0.97 0.96
 
 0   0.99 1 0.99 0.98 0.97 
 
µ= . Σ =  0.98 0.99 1 0.99 0.98 
  
 .   0.97 0.98 0.99 1 0.99 
 . 
0 0.96 0.97 0.98 0.99 1

2
1
Y
0
−1
−2

1 2 3 4 5

index

Each line is one sample.


d = 50
 1 0.99 0.98 0.97 0.96 ... 
0 0.99 1 0.99 0.98 0.97 ...
 
 
 0  
 0.98 0.99 1 0.99 0.98 ... 

0.97 0.98 0.99 1 0.99 ...
 
µ= . Σ=
  

 .   0.96 0.97 0.98 0.99 1 ... 
 .   
0
 . . . . . . 
. . . . . .
. . . . . .

2
1
Y
0
−1
−2

0 10 20 30 40

index
Each line is one sample.
d = 50
 1 0.99 0.98 0.97 0.96 ... 
0 0.99 1 0.99 0.98 0.97 ...
 
 
 0  
 0.98 0.99 1 0.99 0.98 ... 

0.97 0.98 0.99 1 0.99 ...
 
µ= . Σ=
  

 .   0.96 0.97 0.98 0.99 1 ... 
 .   
0
 . . . . . . 
. . . . . .
. . . . . .

2
1
Y
0
−1
−2

0 10 20 30 40

index
Each line is one sample.
We can think of Gaussian processes as an infinite dimensional distribution
over functions - all we need to do is change the indexing
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Thankfully, to understand the law of y we only need consider the finite
dimensional distributions (FDDs), i.e., for all x1 , . . . , xn and for all n ∈ N

P(y (x1 ) ≤ c1 , . . . , y (xn ) ≤ cn )

as these uniquely determine the law of y .


Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Thankfully, to understand the law of y we only need consider the finite
dimensional distributions (FDDs), i.e., for all x1 , . . . , xn and for all n ∈ N

P(y (x1 ) ≤ c1 , . . . , y (xn ) ≤ cn )

as these uniquely determine the law of y .


A Gaussian process is a stochastic process with Gaussian FDDs, i.e.,

(y (x1 ), . . . , y (xn )) ∼ Nn (µ, Σ)


Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Thankfully, to understand the law of y we only need consider the finite
dimensional distributions (FDDs), i.e., for all x1 , . . . , xn and for all n ∈ N

P(y (x1 ) ≤ c1 , . . . , y (xn ) ≤ cn )

as these uniquely determine the law of y .


A Gaussian process is a stochastic process with Gaussian FDDs, i.e.,

(y (x1 ), . . . , y (xn )) ∼ Nn (µ, Σ)

We write y (·) ∼ GP to denote that the function y is a GP.


Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)
Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)

To fully specify the law of a Gaussian process, we need to specify mean


and covariance functions.
Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)

To fully specify the law of a Gaussian process, we need to specify mean


and covariance functions.

y (·) ∼ GP(m(·), k(·, ·))


Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)

To fully specify the law of a Gaussian process, we need to specify mean


and covariance functions.

y (·) ∼ GP(m(·), k(·, ·))

where

E(y (x)) = m(x)


Cov(y (x), y (x 0 )) = k(x, x 0 )
Specifying the mean function

We are free to choose the mean E(y (x)) and covariance Cov(y (x), y (x 0 ))
functions however we like (e.g. trial and error), subject to some ‘rules’:
Specifying the mean function

We are free to choose the mean E(y (x)) and covariance Cov(y (x), y (x 0 ))
functions however we like (e.g. trial and error), subject to some ‘rules’:
We can use any mean function we want:

m(x) = E(y (x))

Most popular choices are m(x) = 0 or m(x) = const for all x, or


m(x) = β > x
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
This can be problematic (see Nicolas’ talk)
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
This can be problematic (see Nicolas’ talk)
We often assume k is a function of only the distance between locations
Cov(y (x), y (x 0 )) = k(x − x 0 )
which results in a stationary processes.
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
This can be problematic (see Nicolas’ talk)
We often assume k is a function of only the distance between locations
Cov(y (x), y (x 0 )) = k(x − x 0 )
which results in a stationary processes.
If Cov(y (x), y (x 0 )) = k(||x − x 0 ||) the covariance function is said to be
isotropic.
The covariance function determines the nature of the GP.
k determines the hypothesis space/space of functions
Examples
RBF/Squared-exponential/exponentiated quadratic
 
1
k(x, x 0 ) = exp − (x − x 0 )2
2
Examples
RBF/Squared-exponential/exponentiated quadratic

1 (x − x 0 )2
 
0
k(x, x ) = exp −
2 0.252
Examples
RBF/Squared-exponential/exponentiated quadratic

1 (x − x 0 )2
 
0
k(x, x ) = exp −
2 42
Examples
RBF/Squared-exponential/exponentiated quadratic
 
1
k(x, x 0 ) = 100 exp − (x − x 0 )2
2
Examples
Matern 3/2
k(x, x 0 ) ∼ (1 + |x − x 0 |) exp −|x − x 0 |

Examples
Brownian motion
k(x, x 0 ) = min(x, x 0 )
Examples
White noise (
0 1 if x = x 0
k(x, x ) =
0 otherwise
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Then
Cov(y (x), y (x 0 )) = Cov(cx, cx 0 )
= x > Cov(c, c)x 0
= x >x 0
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Then
Cov(y (x), y (x 0 )) = Cov(cx, cx 0 )
= x > Cov(c, c)x 0
= x >x 0

So y (·) ∼ GP(0, k(x, x 0 )) with


k(x, x 0 ) = x > x 0
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to


work with:
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to


work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to


work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
So sums of Gaussians are Gaussian, and marginal distributions of
multivariate Gaussians are still Gaussian.
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to


work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
So sums of Gaussians are Gaussian, and marginal distributions of
multivariate Gaussians are still Gaussian.
Corollary: Σ must be positive semi-definite as a> Σa ≥ 0 for all a ∈ Rd .
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to


work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
So sums of Gaussians are Gaussian, and marginal distributions of
multivariate Gaussians are still Gaussian.
Corollary: Σ must be positive semi-definite as a> Σa ≥ 0 for all a ∈ Rd .
Conversely, any matrix Σ which is positive semi-definite is a valid
covariance matrix:
1
If Z ∼ Nd (0d , Id ) then Y = µ + Σ 2 Z ∼ Nd (µ, Σ).
1
Where Σ 2 is a matrix square root of Σ.
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to


work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
So sums of Gaussians are Gaussian, and marginal distributions of
multivariate Gaussians are still Gaussian.
Corollary: Σ must be positive semi-definite as a> Σa ≥ 0 for all a ∈ Rd .
Conversely, any matrix Σ which is positive semi-definite is a valid
covariance matrix:
1
If Z ∼ Nd (0d , Id ) then Y = µ + Σ 2 Z ∼ Nd (µ, Σ).
1
Where Σ 2 is a matrix square root of Σ.
Gives one way of generating multivariate Gaussians.
Property 2: Conditional distributions are still Gaussian

Suppose  
Y1
Y = ∼ N2 (µ, Σ)
Y2
where    
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Property 2: Conditional distributions are still Gaussian

Suppose  
Y1
Y = ∼ N2 (µ, Σ)
Y2
where    
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Then

Y2 | Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1

11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Property 2: Conditional distributions are still Gaussian

Suppose  
Y1
Y = ∼ N2 (µ, Σ)
Y2
where    
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Then

Y2 | Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1

11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
 
1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
"   >   #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22

where  
−1 Q11 Q12
Σ := Q :=
Q21 Q22
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
 
1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
"   >   #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22
 i
1h > >
∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2

where  
−1 Q11 Q12
Σ := Q :=
Q21 Q22
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
 
1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
"   >   #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22
 i
1h > >
∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2

where  
−1 Q11 Q12
Σ := Q :=
Q21 Q22

So Y2 |Y1 = y1 is Gaussian.
 i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
 i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
 i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
 i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
 i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
 
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
 i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
 i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
 
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )
 i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
 i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
 
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )

A simple matrix inversion lemma gives


−1
Q22 = Σ22 − Σ21 Σ−1
11 Σ12
−1
andQ22 Q21 = Σ21 Σ−1
11
 i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
 i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
 
1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )

A simple matrix inversion lemma gives


−1
Q22 = Σ22 − Σ21 Σ−1
11 Σ12
−1
andQ22 Q21 = Σ21 Σ−1
11

giving
Y2 |Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1

11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
If we observe its value at x1 , . . . , xn then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(µ∗ , σ ∗ )
where µ∗ and σ ∗ are as on the previous slide.
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
If we observe its value at x1 , . . . , xn then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(µ∗ , σ ∗ )
where µ∗ and σ ∗ are as on the previous slide.
Note that we still believe f is a GP even though we’ve observed its value
at a number of locations.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition

f1 (·), f2 (·) ∼ GP then (f1 + f2 )(·) ∼ GP


Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition

f1 (·), f2 (·) ∼ GP then (f1 + f2 )(·) ∼ GP

Closed under Bayesian conditioning, i.e., if we observe

D = (f (x1 ), . . . , f (xn ))

then
f |D ∼ GP
but with updated mean and covariance functions.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition

f1 (·), f2 (·) ∼ GP then (f1 + f2 )(·) ∼ GP

Closed under Bayesian conditioning, i.e., if we observe

D = (f (x1 ), . . . , f (xn ))

then
f |D ∼ GP
but with updated mean and covariance functions.
Closed under any linear operator. If f ∼ GP(m(·), k(·, ·)), then if L
is a linear operator

L ◦ f ∼ GP(L ◦ m, L2 ◦ k)
df
R
e.g. dx , f (x)dx, Af are all GPs
Conditional updates of Gaussian processes - revisited
Suppose f is a Gaussian process, then

f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (0, Σ)

where

 
k(x1 , x1 . . . k(x1 , xn ) k(x1 , x)
 .. .. .. 
Σ=
 . . . 

 k(xn , x1 ) . . . k(xn , xn ) k(xn , x) 
k(x, x1 ) . . . k(x, xn ) k(x, x)
 
 KXX kX (x) 
=



kX (x)> k(x, x)

where X = {x1 , . . . , xn }, [KXX ]ij = k(xi , xj ) is the Gram/kernel matrix,


and [kX (x)]j = k(xj , x)
Conditional updates of Gaussian processes - revisited
Suppose f is a Gaussian process, then

f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (0, Σ)

where

 
k(x1 , x1 . . . k(x1 , xn ) k(x1 , x)
 .. .. .. 
Σ=
 . . . 

 k(xn , x1 ) . . . k(xn , xn ) k(xn , x) 
k(x, x1 ) . . . k(x, xn ) k(x, x)
 
 KXX kX (x) 
=



kX (x)> k(x, x)

where X = {x1 , . . . , xn }, [KXX ]ij = k(xi , xj ) is the Gram/kernel matrix,


and [kX (x)]j = k(xj , x)
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with

f = (f (x1 ), . . . , f (xn ))>


kX (x)> = (k(x, x1 ) k(x, x2 ) . . . k(x, xn )) ∈ R1×n

and
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with

f = (f (x1 ), . . . , f (xn ))>


kX (x)> = (k(x, x1 ) k(x, x2 ) . . . k(x, xn )) ∈ R1×n

and

−1
k̄(x) = k(x, x) − kX (x)> KXX kX (x)
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with

f = (f (x1 ), . . . , f (xn ))>


kX (x)> = (k(x, x1 ) k(x, x2 ) . . . k(x, xn )) ∈ R1×n

and

−1
k̄(x) = k(x, x) − kX (x)> KXX kX (x)
Cf

Y2 |Y1 = x1 ∼ N µ2 + Σ21 Σ−1 −1



11 (x1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
More generally, if
f (·) ∼ GP(m(·), k(·, ·))
then
f (·)|f (x1 ), . . . , f (xn ) ∼ GP(m̄(·), k̄(·, ·))
with

−1
m̄(x) = m(x) + kX (x)> KXX f
−1
k̄(x, x 0 ) = k(x, x 0 ) − kX (x)> KXX kX (x 0 )
No noise/nugget - Interpolation

−1
Solid line m̄(x) = kX (x)KXX f
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> K −1 kX (x)
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
then y1 , . . . , yn , f (x) ∼ Nn+1 (0, Σ)
 
k(x1 , x)
 KXX + σ 2 I k(x2 , x) 
..
 
where Σ=
 
 . 

 k(xn , x) 
k(x, x1 ) k(x, x2 ) . . . k(x, xn ) k(x, x)
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
then y1 , . . . , yn , f (x) ∼ Nn+1 (0, Σ)
 
k(x1 , x)
 KXX + σ 2 I k(x2 , x) 
..
 
where Σ=
 
 . 

 k(xn , x) 
k(x, x1 ) k(x, x2 ) . . . k(x, xn ) k(x, x)
Then
f (x) | y1 , . . . , yn ∼ N(m̄(x), k̄(x))
where
m̄(x) = kX (x)> (KXX + σ 2 I )−1 y
k̄(x) = k(x, x) − kX (x)> (KXX + σ 2 I )−1 kX (x)
Nugget standard deviation σ = 0.1

−1
Solid line m̄(x) = kX (x)> KXX y
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> (K −1 + σ 2 I )kX (x)
Nugget standard deviation σ = 0.025

−1
Solid line m̄(x) = kX (x)> KXX y
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> (K −1 + σ 2 I )kX (x)
If mean is a linear combination of known regressor functions,

m(x) = β > h(x) for known h(x)

and β) is given a normal prior distribution (including π(β) ∝ 1), then


y (·) | D, β ∼ GP and
y (·) | D ∼ GP
with slightly modified mean and variance formulas.
If mean is a linear combination of known regressor functions,

m(x) = β > h(x) for known h(x)

and β) is given a normal prior distribution (including π(β) ∝ 1), then


y (·) | D, β ∼ GP and
y (·) | D ∼ GP
with slightly modified mean and variance formulas.
If
k(x, x 0 ) = σ 2 c(x, x 0 )
and we give σ 2 an inverse gamma prior (including π(σ 2 ) ∝ 1/σ 2 )
then y |D, σ 2 ∼ GP and

y |D ∼ t-process

with n − p degrees of freedom. In practice, for reasonable n, this is


indistinguishable from a GP.
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.

1
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β


Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1


β

x1>
 
 x2> 
where X =
 
.. 
 . 
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1


β

= (X > X + σ 2 I )−1 X > y usual least squares estimator

x1>
 
 x2> 
where X =
 
.. 
 . 
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1


β

= (X > X + σ 2 I )−1 X > y usual least squares estimator


> > 2 −1
= X (XX + σ I) y the dual form

x1>
 
 x2> 
where X =
 
.. 
 . 
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1


β

= (X > X + σ 2 I )−1 X > y usual least squares estimator


> > 2 −1
= X (XX + σ I) y the dual form
as (X X + σ I )X > = X > (XX > + σ 2 I )
> 2

so X > (XX > + σ 2 I )−1 = (X > X + σ 2 I )−1 X >

x1>
 
 x2> 
where X =
 
.. 
 . 
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
At first the dual form

β̂ = X > (XX > + σ 2 I )−1 y

looks harder to compute than the usual

β̂ = (X > X + σ 2 I )−1 X > y

X > X is p × p p = number of features/parameters


XX > is n × n n is the number of data points
At first the dual form

β̂ = X > (XX > + σ 2 I )−1 y

looks harder to compute than the usual

β̂ = (X > X + σ 2 I )−1 X > y

X > X is p × p p = number of features/parameters


XX > is n × n n is the number of data points
But the dual form only uses inner products between vectors in Rn
 >   >
x1 x1 . . . x1> xn

x1
XX > =  ...  (x1 . . . xn ) =  ...
   

xn> > >
xn x1 . . . xn xn
=KXX if k(x, x 0 ) = x > x 0

— This is useful!
Prediction
The best prediction of y at a new location x 0 is

ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y

where kX (x 0 )> := (x 0> x1 , . . . , x 0> xn ) and [KXX ]ij := xi> xj


Prediction
The best prediction of y at a new location x 0 is

ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y

where kX (x 0 )> := (x 0> x1 , . . . , x 0> xn ) and [KXX ]ij := xi> xj


KXX and kX (x) are kernel matrices:
every element is an inner product between 2 points: k(x, x 0 ) = x > x 0
Prediction
The best prediction of y at a new location x 0 is

ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y

where kX (x 0 )> := (x 0> x1 , . . . , x 0> xn ) and [KXX ]ij := xi> xj


KXX and kX (x) are kernel matrices:
every element is an inner product between 2 points: k(x, x 0 ) = x > x 0
Note this is exactly the GP conditional mean we derived before.

m(x) = kX (x)> (KXX + σ 2 I )−1 y

linear regression and GP regression are equivalent when


k(x, x 0 ) = x > x 0 .
Including features I

We can replace x by a feature vector in linear regression, e.g.,


φ(x) = (1 x x 2 )
It doesn’t change the expressions other than the inner product

k(x 0 , x) = x 0> x

is replaced by
k(x 0 , x) = φ(x 0 )> φ(x)
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>

i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>

i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Then

k(x, z) = φ(x)> φ(z)


√ √ √ √ √ √
= (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 )>
= (1 + (x1 , x2 )(z1 , z2 )> )2
= (1 + x> z)2
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>

i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Then

k(x, z) = φ(x)> φ(z)


√ √ √ √ √ √
= (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 )>
= (1 + (x1 , x2 )(z1 , z2 )> )2
= (1 + x> z)2

To evaluate k(x, z) we didn’t need to explicitly compute the feature


vector φ(x)
Including features III

To evaluate k(x, z) we didn’t need to explicitly compute the feature


vectors φ(x), φ(z) ∈ R6
The same idea works with much larger feature vectors, sometimes even
when φ(x) ∈ R∞

2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
Including features III

To evaluate k(x, z) we didn’t need to explicitly compute the feature


vectors φ(x), φ(z) ∈ R6
The same idea works with much larger feature vectors, sometimes even
when φ(x) ∈ R∞
Theorem: A function
k :X ×X →R
is positive semi-definite (and thus a valid covariance function) if and only
if we can write2
k(x, x 0 ) = φ(x)> φ(x 0 )
for some (possibly infinite dimensional) feature vector φ(x).

2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
Including features III

To evaluate k(x, z) we didn’t need to explicitly compute the feature


vectors φ(x), φ(z) ∈ R6
The same idea works with much larger feature vectors, sometimes even
when φ(x) ∈ R∞
Theorem: A function
k :X ×X →R
is positive semi-definite (and thus a valid covariance function) if and only
if we can write2
k(x, x 0 ) = φ(x)> φ(x 0 )
for some (possibly infinite dimensional) feature vector φ(x).
So GP regression with k can be thought of as linear regression with φ(x).

2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
1 2
Example: If X = [0, 1], c0 = 0, c1 = N , c2 = N , . . . , cN = 1 then
(modulo some detail) if
(x−c0 )2 (x−cN )2
φ(x) ∝ (e − 2λ2 , . . . , e− 2λ2 )

then as N → ∞ then
(x − x 0 )2
 
>
φ(x) φ(x) = exp −
2λ2
1 2
Example: If X = [0, 1], c0 = 0, c1 = N , c2 = N , . . . , cN = 1 then
(modulo some detail) if
(x−c0 )2 (x−cN )2
φ(x) ∝ (e − 2λ2 , . . . , e− 2λ2 )

then as N → ∞ then
(x − x 0 )2
 
>
φ(x) φ(x) = exp −
2λ2

We can use an infinite dimensional feature vector φ(x), and because linear
regression can be done solely in terms of inner-products (inverting a n × n
matrix in the dual form) we never need evaluate the feature vector, only
the kernel.
Kernel trick:

lift x into feature space by replacing inner products x > x 0 by k(x, x 0 )


Kernel regression
Kernel regression and GP regression are closely related.
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.


Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.


Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.


Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
We can show that
m̄(x) = arg min L(f )
f ∈Hk

where m̄(x) is the posterior mean if we assume yi = f (xi ) + N(0, σ 2 ) and


f (·) ∼ GP(0, k(·, ·))
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.


Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
We can show that
m̄(x) = arg min L(f )
f ∈Hk

where m̄(x) is the posterior mean if we assume yi = f (xi ) + N(0, σ 2 ) and


f (·) ∼ GP(0, k(·, ·))
Note that m̄(·) ∈ Hk
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.


Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
We can show that
m̄(x) = arg min L(f )
f ∈Hk

where m̄(x) is the posterior mean if we assume yi = f (xi ) + N(0, σ 2 ) and


f (·) ∼ GP(0, k(·, ·))
Note that m̄(·) ∈ Hk (samples from a GP live in a slightly larger RKHS)
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function

3
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.

3
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
i

3
Not quite - it lies in the completion of this set of linear combinations
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
this space of functions is calledi the Reproducing Kernel Hilbert
Space (RKHS) of k.

3
Not quite - it lies in the completion of this set of linear combinations
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
this space of functions is calledi the Reproducing Kernel Hilbert
Space (RKHS) of k.
Although reality may not lie in the RKHS defined by k, this space is much
richer than any parametric regression model (and can be dense in some
sets of continuous bounded functions), and is thus more likely to contain
an element close to the true functional form than any class of models that
contains only a finite number of features.
This is the motivation for non-parametric methods.
3
Not quite - it lies in the completion of this set of linear combinations
Why use GPs? Answer 3: Naturalness of GP framework

Why use Gaussian processes as non-parametric models?


Why use GPs? Answer 3: Naturalness of GP framework

Why use Gaussian processes as non-parametric models?


If we only knew the expectation and variance of some random variables,
X and Y , then how should we best do statistics?
Why use GPs? Answer 3: Naturalness of GP framework

Why use Gaussian processes as non-parametric models?


If we only knew the expectation and variance of some random variables,
X and Y , then how should we best do statistics?
It has been shown, using coherency arguments, or geometric arguments,
or..., that the best second-order inference we can do to update our beliefs
about X given Y is

E(X |Y ) = E(X ) + Cov(X , Y )Var(Y )−1 (Y − E(Y ))


i.e., exactly the Gaussian process update for the posterior mean.
So GPs are in some sense second-order optimal.
Kriging
Kriging

Suppose Y (x) is a (second order stationary) stochastic process with

EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0

NB we’re not assuming Y has a Gaussian distribution.


Kriging

Suppose Y (x) is a (second order stationary) stochastic process with

EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0

NB we’re not assuming Y has a Gaussian distribution.


If someone tells you y = (Y (x1 ), . . . , Y (xn ))> , how would you predict
Y (x)?
Kriging

Suppose Y (x) is a (second order stationary) stochastic process with

EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0

NB we’re not assuming Y has a Gaussian distribution.


If someone tells you y = (Y (x1 ), . . . , Y (xn ))> , how would you predict
Y (x)?
One option is to find the best linear unbiased predictor (BLUP) of Y (x).
Best Linear Unbiased Predictors (BLUP)
Consider the linear estimator
X
Ŷ (x) = c + wi Y (xi ) = c + w> y
Best Linear Unbiased Predictors (BLUP)
Consider the linear estimator
X
Ŷ (x) = c + wi Y (xi ) = c + w> y

If we require Ŷ (x) to be unbiased,

µ = EŶ (x)
= E(c + w> y)
= c + w> µ

where µ = (µ, . . . , µ)> .


Best Linear Unbiased Predictors (BLUP)
Consider the linear estimator
X
Ŷ (x) = c + wi Y (xi ) = c + w> y

If we require Ŷ (x) to be unbiased,

µ = EŶ (x)
= E(c + w> y)
= c + w> µ

where µ = (µ, . . . , µ)> .


Thus c = µ − w> µ and we must have

Ŷ (x) = µ + w> (y − µ)
Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error

MSE (Ŷ (x)) = E((Ŷ (x) − Y (x))2 )


 
= E (w> (y − µ) + (µ − Y (x))2
= w> Var(y)w + Var(Y (x)) − 2w> Cov(y, Y (x))
= w> KXX w + k(0) − 2w> kX (x)
Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error

MSE (Ŷ (x)) = E((Ŷ (x) − Y (x))2 )


 
= E (w> (y − µ) + (µ − Y (x))2
= w> Var(y)w + Var(Y (x)) − 2w> Cov(y, Y (x))
= w> KXX w + k(0) − 2w> kX (x)

If we differentiate wrt w and set the gradient equal to zero, we find

0 = 2KXX w − 2kX (x)


Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error

MSE (Ŷ (x)) = E((Ŷ (x) − Y (x))2 )


 
= E (w> (y − µ) + (µ − Y (x))2
= w> Var(y)w + Var(Y (x)) − 2w> Cov(y, Y (x))
= w> KXX w + k(0) − 2w> kX (x)

If we differentiate wrt w and set the gradient equal to zero, we find

0 = 2KXX w − 2kX (x)

and thus
−1
Ŷ (x) = µ + kX (x)> KXX (y − µ)
as before.
So the Gaussian process posterior mean is optimal (i.e. is the BLUP) even
if we don’t assume a Gaussian distribution.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
It is important to check both aspects.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
It is important to check both aspects.
Warning: the uncertainty estimates from a GP can be flawed. Note that
given data D = {X , y }

−1
Var(f (x)|X , y ) = k(x, x) − kX (x)KXX kX (x)
so that the posterior variance of f (x) does not depend upon y !
The variance estimates are particularly sensitive to the hyper-parameter
estimates.
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should


use, GPs work great!
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should


use, GPs work great!
Unfortunately, we don’t usually know this.
We pick a covariance function from a small set, based usually on
differentiability considerations.
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should


use, GPs work great!
Unfortunately, we don’t usually know this.
We pick a covariance function from a small set, based usually on
differentiability considerations.
Possibly try a few (plus combinations of a few) covariance functions,
and attempt to make a good choice using some sort of empirical
evaluation.
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should


use, GPs work great!
Unfortunately, we don’t usually know this.
We pick a covariance function from a small set, based usually on
differentiability considerations.
Possibly try a few (plus combinations of a few) covariance functions,
and attempt to make a good choice using some sort of empirical
evaluation.
Covariance functions often contain hyper-parameters. E.g
I RBF kernel
1 (x − x 0 )2
 
k(x, x 0 ) = σ 2 exp −
2 λ2
Estimate these using your favourite statistical procedure (maximum
likelihood, cross-validation, Bayes, expert judgement etc)
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the


data.
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the


data.
The number of parameters in a GP is essentially infinite, and so they are
not always identified even asymptotically.
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the


data.
The number of parameters in a GP is essentially infinite, and so they are
not always identified even asymptotically.
So the posterior can concentrate not on a point, but on some submanifold
of parameter space, and the projection of the prior on this space continues
to impact the posterior even as more and more data are collected.
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the


data.
The number of parameters in a GP is essentially infinite, and so they are
not always identified even asymptotically.
So the posterior can concentrate not on a point, but on some submanifold
of parameter space, and the projection of the prior on this space continues
to impact the posterior even as more and more data are collected.
E.g. consider a zero mean GP on [0, 1] with covariance function

k(x, x 0 ) = σ 2 exp(−κ2 |x − x|)


We can consistently estimate σ 2 κ, but not σ 2 or κ, even as n → ∞.
Problems with hyper-parameter optimization
As well as problems of identifiability, the likelihood surface that is being
maximized is often flat and multi-modal, and thus the optimizer can
sometimes fail to converge, or gets stuck in local-maxima.
Problems with hyper-parameter optimization
As well as problems of identifiability, the likelihood surface that is being
maximized is often flat and multi-modal, and thus the optimizer can
sometimes fail to converge, or gets stuck in local-maxima.
In practice, it is not uncommon to optimize hyper parameters and find
solutions such as
Problems with hyper-parameter optimization
As well as problems of identifiability, the likelihood surface that is being
maximized is often flat and multi-modal, and thus the optimizer can
sometimes fail to converge, or gets stuck in local-maxima.
In practice, it is not uncommon to optimize hyper parameters and find
solutions such as

We often work around these problems by running the optimizer multiple


times from random start points, using prior distributions, constraining or
fixing hyper-parameters, or adding white noise.
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
There are many ways to side-step this cost, but one approach is to
consider basis expansions and switching back to the primal form for linear
regression.
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
There are many ways to side-step this cost, but one approach is to
consider basis expansions and switching back to the primal form for linear
regression.
Suppose
Xm
0
k(x, x ) = φi (x)φi (x 0 ) = φ(x)> φ(x 0 )
i=1
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
There are many ways to side-step this cost, but one approach is to
consider basis expansions and switching back to the primal form for linear
regression.
Suppose
Xm
0
k(x, x ) = φi (x)φi (x 0 ) = φ(x)> φ(x 0 )
i=1

Then GP regression is equivalent to linear regression with covariates φ(x)


Dual form for regression coefficients costs O(n3 ),
but primal solution only costs O(m3 )
In practice we may use a basis expansion with m << n such that
m
X
k(x, x 0 ) ≈ φi (x)φi (x 0 )
i=1
Choice of basis
There are many choices of basis. Two examples:
Mercer basis: Consider the map
Z
Tk (f )(·) = k(x, ·)f (x)dx
X
Consider the eigenfunctions of this map, i.e., φ : X 7→ R s.t.
Tk (φ)(·) = λφ(·). Then Mercer’s theorem says that

X
0
k(x, x ) = λi φi (x)φi (x 0 )
i=1
Choice of basis
There are many choices of basis. Two examples:
Mercer basis: Consider the map
Z
Tk (f )(·) = k(x, ·)f (x)dx
X
Consider the eigenfunctions of this map, i.e., φ : X 7→ R s.t.
Tk (φ)(·) = λφ(·). Then Mercer’s theorem says that

X
0
k(x, x ) = λi φi (x)φi (x 0 )
i=1
The Karhunen-Loeve thm says we can write f (·) ∼ GP(0, k(·, ·)) as

iid
X p
f (x) = Zi λi φi (x) where Zi ∼ N(0, 1)
i=1
Choice of basis
There are many choices of basis. Two examples:
Mercer basis: Consider the map
Z
Tk (f )(·) = k(x, ·)f (x)dx
X
Consider the eigenfunctions of this map, i.e., φ : X 7→ R s.t.
Tk (φ)(·) = λφ(·). Then Mercer’s theorem says that

X
0
k(x, x ) = λi φi (x)φi (x 0 )
i=1
The Karhunen-Loeve thm says we can write f (·) ∼ GP(0, k(·, ·)) as

iid
X p
f (x) = Zi λi φi (x) where Zi ∼ N(0, 1)
i=1

We can approximate the process (& reduce cost to O(m3 )) by


truncating the sum m
X p
f (x) = Zi λi φi (x)
i=1
The Mercer/KL basis minimizes the mean square truncation error.
Choice of basis
There are many choices of basis. Two examples:
Random Fourier features:
Bochner’s theorem says that a stationary kernel can be represented
as a Fourier transform of a distribution
Z
0
k(x − x ) = exp(iw > (x − x 0 ))p(w )dw = Ew ∼p exp(iw > (x − x 0 ))

cos(wi> x)
 
1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)

by using Euler’s identity and discarding the imaginary part


Choice of basis
There are many choices of basis. Two examples:
Random Fourier features:
Bochner’s theorem says that a stationary kernel can be represented
as a Fourier transform of a distribution
Z
0
k(x − x ) = exp(iw > (x − x 0 ))p(w )dw = Ew ∼p exp(iw > (x − x 0 ))

cos(wi> x)
 
1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)

by using Euler’s identity and discarding the imaginary part


Using the primal form for linear regression again reduces the
complexity to O(m3 ).
Choice of basis
There are many choices of basis. Two examples:
Random Fourier features:
Bochner’s theorem says that a stationary kernel can be represented
as a Fourier transform of a distribution
Z
0
k(x − x ) = exp(iw > (x − x 0 ))p(w )dw = Ew ∼p exp(iw > (x − x 0 ))

cos(wi> x)
 
1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)

by using Euler’s identity and discarding the imaginary part


Using the primal form for linear regression again reduces the
complexity to O(m3 ).
Recent work by Rudi and Rosasco (2017) shows that using

m = n log(n) features achieve similar performance to using the full
kernel.
Conclusions

Once the good china, GPs are now ubiquitous in statistics/ML.


Popularity stems from
I Naturalness of the framework
I Mathematical tractability
I Empirical success
Conclusions

Once the good china, GPs are now ubiquitous in statistics/ML.


Popularity stems from
I Naturalness of the framework
I Mathematical tractability
I Empirical success

Thank you for listening!


References

Rasmussen and Williams. Gaussian processes for machine learning.


MIT press, 2006.
Stein. Interpolation of Spatial Data: Some Theory for Kriging.
Springer, 1999
Kanagawa, Hennig, Sejdinovic, and Sriperumbudur. Gaussian
processes and kernel methods: A review on connections and
equivalences.. arXiv:1807.02582 2018.

You might also like