0% found this document useful (0 votes)

56 views

Lecture1 Introduction To GPs

This document provides an introduction to Gaussian processes. It begins with an overview of univariate and multivariate Gaussian distributions, explaining their properties such as closedness under linear operations and maximum entropy. Gaussian distributions occur naturally due to the central limit theorem. The document then discusses multivariate Gaussian distributions, defining them using a mean vector and covariance matrix. It provides an example of a bivariate Gaussian and defines related terms such as variance, covariance, and correlation.

Uploaded by

Fuzzy K

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Lecture1 Introduction To GPs

Uploaded by

Fuzzy K

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 172

An introduction to Gaussian Processes

Richard Wilkinson

School of Mathematical Sciences

University of Nottingham

GP summer school
September 2020
Welcome to Sheffield
Introduction

(Multivariate) Gaussian distributions

Definition of Gaussian processes
Motivations and derivations
Difficulties
You can download a copy of these slides from www.gpss.cc
Univariate Gaussian distributions
PDF of a N(0,1) random variable CDF of a N(0,1) random variable
0.4

1.0
0.8
0.3

0.6
density

P(X<x)
0.2

0.4
0.1

0.2
0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

Y ∼ N(µ, σ 2 )
(y − µ)2

1
PDF: fY (y ) = √ exp −
2πσ 2 2σ 2
CDF: FY (y ) = P(Y ≤ y ) not known in closed form
Univariate Gaussian distributions
PDF of a N(0,1) random variable CDF of a N(0,1) random variable
0.4

1.0
0.8
0.3

0.6
density

P(X<x)
0.2

0.4
0.1

0.2
0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

Y ∼ N(µ, σ 2 )
(y − µ)2

1
PDF: fY (y ) = √ exp −
2πσ 2 2σ 2
CDF: FY (y ) = P(Y ≤ y ) not known in closed form
If Z ∼ N(0, 1) then Y = µ + σZ ∼ N(µ, σ 2 )
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
If Y and Z are jointly normally distributed and are uncorrelated, then
they are independent
Univariate Gaussians
The normal/Gaussian distribution occurs naturally and is convenient
mathematically
Family of normal distributions is closed under linear operations (more
later).
Central limit theorem
Maximum entropy/surprisal: N(µ, σ 2 ) has maximum entropy of any
distribution with mean µ and variance σ 2 (max. ent. principle: the
distribution with the largest entropy should be used as a
least-informative default)
Infinite divisibility
If Y and Z are jointly normally distributed and are uncorrelated, then
they are independent
Square-loss functions lead to procedures that have a Gaussian
probabilistic interpretation
eg Fit model fβ (x) to data y by mimizing (yi − fβ (xi ))2 is
P
equivalent to maximum likelihood estimation under the assumption
that y = fβ (x) + where ∼ N(0, σ 2 ).
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2

σ12

Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22
Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2

σ12

Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22

Var(Yi ) = σi2 Cov(Yi , Yj ) = ρij σi σj Cor(Yi , Yj ) = ρ12 for i 6= j

Multivariate Gaussian distributions
‘Multivariate’ = two or more random variables
Suppose Y ∈ Rd has a multivariate Gaussian distribution with
mean vector µ ∈ Rd
covariance matrix Σ ∈ Rd×d .
Write
Y ∼ Nd (µ, Σ)
Bivariate Gaussian: d=2

σ12

Y1 µ1 ρ12 σ1 σ2
Y = µ= Σ=
Y2 µ2 ρ21 σ1 σ2 σ22

Var(Yi ) = σi2 Cov(Yi , Yj ) = ρij σi σj Cor(Yi , Yj ) = ρ12 for i 6= j

1 d 1
pdf: f (y | µ, Σ) = |Σ|− 2 (2π)− 2 exp − (y − µ)> Σ−1 (y − µ)
2

0
µ=
0

4

1 0
Σ=
0 1
2
Y2

−2

−4

−4 −2 0 2 4
Y1

0
µ=
0

4

1 0
Σ=
0 1
2

So
Cor (Y1 , Y2 ) = 0
hence Y1
Y2

independent of Y2

−2

−4

−4 −2 0 2 4
Y1

4 0
µ=
1

2
1 0
Σ=
0 0.2
Y2

−2

−4

−4 −2 0 2 4
Y1

4 0
µ=
0

2
1 0.9
Σ=
0.9 1
Y2

−2

−4

−4 −2 0 2 4
Y1

4 0
µ=
0

2
1 1 0.9
Σ=
3 0.9 1
Y2

−2

−4

−4 −2 0 2 4
Y1

4 0
µ=
0

2
1 0.99
Σ=
0.99 1
Y2

−2

−4

−4 −2 0 2 4
Y1

0
µ=
0
4

1 0.54
Σ=
2
0.54 0.3

Cor (Yp
1 , Y2 ) =
0.54/ (0.3) =
Y2

0
0.99

−2

−4

−4 −2 0 2 4
Y1
More pictures

Hard to visualise in dimensions > 2, so stack points next to each other.

More pictures

Hard to visualise in dimensions > 2, so stack points next to each other.

So for 2d instead of we have
1.5

1.5
1.0

1.0
0.5

0.5
Y2

Y
0.0

0.0
−0.5

−0.5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 1.5 1 2 3 4 5

Y1 Index
Consider d = 5 with
0
 
1 0.99 0.98 0.97 0.96
 
 0   0.99 1 0.99 0.98 0.97 
 
µ= . Σ =  0.98 0.99 1 0.99 0.98 
  
 .   0.97 0.98 0.99 1 0.99 
 . 
0 0.96 0.97 0.98 0.99 1

2
1
Y
0
−1
−2

1 2 3 4 5

index

Each line is one sample.

d = 50
 1 0.99 0.98 0.97 0.96 ... 
0 0.99 1 0.99 0.98 0.97 ...
 
 
 0  
 0.98 0.99 1 0.99 0.98 ... 

0.97 0.98 0.99 1 0.99 ...
 
µ= . Σ=
  

 .   0.96 0.97 0.98 0.99 1 ... 
 .   
0
 . . . . . . 
. . . . . .
. . . . . .

2
1
Y
0
−1
−2

0 10 20 30 40

index
Each line is one sample.
d = 50
 1 0.99 0.98 0.97 0.96 ... 
0 0.99 1 0.99 0.98 0.97 ...
 
 
 0  
 0.98 0.99 1 0.99 0.98 ... 

0.97 0.98 0.99 1 0.99 ...
 
µ= . Σ=
  

 .   0.96 0.97 0.98 0.99 1 ... 
 .   
0
 . . . . . . 
. . . . . .
. . . . . .

2
1
Y
0
−1
−2

0 10 20 30 40

index
Each line is one sample.
We can think of Gaussian processes as an infinite dimensional distribution
over functions - all we need to do is change the indexing
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Thankfully, to understand the law of y we only need consider the finite
dimensional distributions (FDDs), i.e., for all x1 , . . . , xn and for all n ∈ N

P(y (x1 ) ≤ c1 , . . . , y (xn ) ≤ cn )

as these uniquely determine the law of y .

Gaussian processes
A stochastic process is a collection of random variables indexed by some
variable x ∈ X
y = {y (x) : x ∈ X }
Usually y (x) ∈ R and X ⊂ Rn i.e. y can be thought of as a function of x.
If X = Rn , then y is an infinite dimensional process.
Thankfully, to understand the law of y we only need consider the finite
dimensional distributions (FDDs), i.e., for all x1 , . . . , xn and for all n ∈ N

P(y (x1 ) ≤ c1 , . . . , y (xn ) ≤ cn )

as these uniquely determine the law of y .

A Gaussian process is a stochastic process with Gaussian FDDs, i.e.,

(y (x1 ), . . . , y (xn )) ∼ Nn (µ, Σ)

P(y (x1 ) ≤ c1 , . . . , y (xn ) ≤ cn )

as these uniquely determine the law of y .

A Gaussian process is a stochastic process with Gaussian FDDs, i.e.,

(y (x1 ), . . . , y (xn )) ∼ Nn (µ, Σ)

We write y (·) ∼ GP to denote that the function y is a GP.

Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)
Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)

To fully specify the law of a Gaussian process, we need to specify mean

and covariance functions.
Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)

To fully specify the law of a Gaussian process, we need to specify mean

and covariance functions.

y (·) ∼ GP(m(·), k(·, ·))

Mean and covariance function
To fully specify the law of a Gaussian distribution we only need the mean
and variance.

X ∼ N(µ, Σ)

To fully specify the law of a Gaussian process, we need to specify mean

and covariance functions.

y (·) ∼ GP(m(·), k(·, ·))

where

E(y (x)) = m(x)

Cov(y (x), y (x 0 )) = k(x, x 0 )
Specifying the mean function

We are free to choose the mean E(y (x)) and covariance Cov(y (x), y (x 0 ))
functions however we like (e.g. trial and error), subject to some ‘rules’:
Specifying the mean function

We are free to choose the mean E(y (x)) and covariance Cov(y (x), y (x 0 ))
functions however we like (e.g. trial and error), subject to some ‘rules’:
We can use any mean function we want:

m(x) = E(y (x))

Most popular choices are m(x) = 0 or m(x) = const for all x, or

m(x) = β > x
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
This can be problematic (see Nicolas’ talk)
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
This can be problematic (see Nicolas’ talk)
We often assume k is a function of only the distance between locations
Cov(y (x), y (x 0 )) = k(x − x 0 )
which results in a stationary processes.
Covariance functions
We usually use a covariance function that is a function of the
indexes/locations
k(x, x 0 ) = Cov(y (x), y (x 0 )),
k must be a positive semi-definite function, i.e., lead to valid covariance
matrices:
Given locations x1 , . . . , xn , the n × n Gram matrix K with
Kij = k(xi , xj ) must be a positive semi-definite matrix.
This can be problematic (see Nicolas’ talk)
We often assume k is a function of only the distance between locations
Cov(y (x), y (x 0 )) = k(x − x 0 )
which results in a stationary processes.
If Cov(y (x), y (x 0 )) = k(||x − x 0 ||) the covariance function is said to be
isotropic.
The covariance function determines the nature of the GP.
k determines the hypothesis space/space of functions
Examples
RBF/Squared-exponential/exponentiated quadratic

1
k(x, x 0 ) = exp − (x − x 0 )2
2
Examples
RBF/Squared-exponential/exponentiated quadratic

1 (x − x 0 )2

0
k(x, x ) = exp −
2 0.252
Examples
RBF/Squared-exponential/exponentiated quadratic

1 (x − x 0 )2

0
k(x, x ) = exp −
2 42
Examples
RBF/Squared-exponential/exponentiated quadratic

1
k(x, x 0 ) = 100 exp − (x − x 0 )2
2
Examples
Matern 3/2
k(x, x 0 ) ∼ (1 + |x − x 0 |) exp −|x − x 0 |

Examples
Brownian motion
k(x, x 0 ) = min(x, x 0 )
Examples
White noise (
0 1 if x = x 0
k(x, x ) =
0 otherwise
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Then
Cov(y (x), y (x 0 )) = Cov(cx, cx 0 )
= x > Cov(c, c)x 0
= x >x 0
Examples
The GP inherits its properties primarily from the covariance function k.
Smoothness
Differentiability
Variance
A final example
k(x, x 0 ) = x > x 0

What is happening?
Suppose y (x) = cx where
c ∼ N(0, 1).
Then
Cov(y (x), y (x 0 )) = Cov(cx, cx 0 )
= x > Cov(c, c)x 0
= x >x 0

So y (·) ∼ GP(0, k(x, x 0 )) with

k(x, x 0 ) = x > x 0
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to

work with:
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to

work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to

work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
So sums of Gaussians are Gaussian, and marginal distributions of
multivariate Gaussians are still Gaussian.
Corollary: Σ must be positive semi-definite as a> Σa ≥ 0 for all a ∈ Rd .
Conversely, any matrix Σ which is positive semi-definite is a valid
covariance matrix:
1
If Z ∼ Nd (0d , Id ) then Y = µ + Σ 2 Z ∼ Nd (µ, Σ).
1
Where Σ 2 is a matrix square root of Σ.
Why use Gaussian processes?
Why would we want to use this very restricted class of model?

Gaussian distributions have several properties that make them easy to

work with:
Proposition:
Y ∼ Nd (µ, Σ) if and only if AY ∼ Np (Aµ, AΣA> )
for all A ∈ Rp×d .
So sums of Gaussians are Gaussian, and marginal distributions of
multivariate Gaussians are still Gaussian.
Corollary: Σ must be positive semi-definite as a> Σa ≥ 0 for all a ∈ Rd .
Conversely, any matrix Σ which is positive semi-definite is a valid
covariance matrix:
1
If Z ∼ Nd (0d , Id ) then Y = µ + Σ 2 Z ∼ Nd (µ, Σ).
1
Where Σ 2 is a matrix square root of Σ.
Gives one way of generating multivariate Gaussians.
Property 2: Conditional distributions are still Gaussian

Suppose
Y1
Y = ∼ N2 (µ, Σ)
Y2
where
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Property 2: Conditional distributions are still Gaussian

Suppose
Y1
Y = ∼ N2 (µ, Σ)
Y2
where
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Then

Y2 | Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1

11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Property 2: Conditional distributions are still Gaussian

Suppose
Y1
Y = ∼ N2 (µ, Σ)
Y2
where
µ1 Σ11 Σ12
µ= Σ=
µ2 Σ21 Σ22
Then

Y2 | Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1

11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )

1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
" > #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22

where
−1 Q11 Q12
Σ := Q :=
Q21 Q22
Proof:

π(y1 , y2 )
π(y2 |y1 ) = ∝ π(y1 , y2 )
π(y1 )

1
∝ exp − (y − µ)> Σ−1 (y − µ)
2
" > #
1 y1 µ1 Q11 Q12
= exp(− − ···
2 y2 µ2 Q21 Q22
i
1h > >
∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2

where
−1 Q11 Q12
Σ := Q :=
Q21 Q22
Proof:

where
−1 Q11 Q12
Σ := Q :=
Q21 Q22

So Y2 |Y1 = y1 is Gaussian.
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2

1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2

1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2

1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )

A simple matrix inversion lemma gives

−1
Q22 = Σ22 − Σ21 Σ−1
11 Σ12
−1
andQ22 Q21 = Σ21 Σ−1
11
i
1h > >
π(y2 |y1 ) ∝ exp − (y2 − µ2 ) Q22 (y2 − µ2 ) + 2(y2 − µ2 ) Q21 (y1 − µ1 )
2
i
1h
∝ exp − y2> Q22 y2 − 2y2> (Q22 µ2 + Q21 (y1 − µ1 ))
2

1 −1
>
∝ exp − y2 − Q22 (Q22 µ2 + Q21 (y1 − µ1 )) Q22 (y2 − . . .)
2
So
−1
Y2 |Y1 = y1 ∼ N(µ2 + Q22 Q21 (y1 − µ1 ), Q22 )

A simple matrix inversion lemma gives

−1
Q22 = Σ22 − Σ21 Σ−1
11 Σ12
−1
andQ22 Q21 = Σ21 Σ−1
11

giving
Y2 |Y1 = y1 ∼ N µ2 + Σ21 Σ−1 −1

11 (y1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
If we observe its value at x1 , . . . , xn then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(µ∗ , σ ∗ )
where µ∗ and σ ∗ are as on the previous slide.
Conditional updates of Gaussian processes
So suppose f is a Gaussian process, then
f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (µ, Σ)
If we observe its value at x1 , . . . , xn then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(µ∗ , σ ∗ )
where µ∗ and σ ∗ are as on the previous slide.
Note that we still believe f is a GP even though we’ve observed its value
at a number of locations.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition

f1 (·), f2 (·) ∼ GP then (f1 + f2 )(·) ∼ GP

Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition

f1 (·), f2 (·) ∼ GP then (f1 + f2 )(·) ∼ GP

Closed under Bayesian conditioning, i.e., if we observe

D = (f (x1 ), . . . , f (xn ))

then
f |D ∼ GP
but with updated mean and covariance functions.
Why use GPs? Answer 1
The GP class of models is closed under various operations.
Closed under addition

f1 (·), f2 (·) ∼ GP then (f1 + f2 )(·) ∼ GP

Closed under Bayesian conditioning, i.e., if we observe

D = (f (x1 ), . . . , f (xn ))

then
f |D ∼ GP
but with updated mean and covariance functions.
Closed under any linear operator. If f ∼ GP(m(·), k(·, ·)), then if L
is a linear operator

L ◦ f ∼ GP(L ◦ m, L2 ◦ k)
df
R
e.g. dx , f (x)dx, Af are all GPs
Conditional updates of Gaussian processes - revisited
Suppose f is a Gaussian process, then

f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (0, Σ)

where

 
k(x1 , x1 . . . k(x1 , xn ) k(x1 , x)
 .. .. .. 
Σ=
 . . . 

 k(xn , x1 ) . . . k(xn , xn ) k(xn , x) 
k(x, x1 ) . . . k(x, xn ) k(x, x)
 
 KXX kX (x) 
=



kX (x)> k(x, x)

where X = {x1 , . . . , xn }, [KXX ]ij = k(xi , xj ) is the Gram/kernel matrix,

and [kX (x)]j = k(xj , x)
Conditional updates of Gaussian processes - revisited
Suppose f is a Gaussian process, then

f (x1 ), . . . , f (xn ), f (x) ∼ Nn+1 (0, Σ)

where

where X = {x1 , . . . , xn }, [KXX ]ij = k(xi , xj ) is the Gram/kernel matrix,

and [kX (x)]j = k(xj , x)
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with

f = (f (x1 ), . . . , f (xn ))>

kX (x)> = (k(x, x1 ) k(x, x2 ) . . . k(x, xn )) ∈ R1×n

and
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with

f = (f (x1 ), . . . , f (xn ))>

kX (x)> = (k(x, x1 ) k(x, x2 ) . . . k(x, xn )) ∈ R1×n

and

−1
k̄(x) = k(x, x) − kX (x)> KXX kX (x)
Conditional updates of Gaussian processes - revisited
Then
f (x)|f (x1 ), . . . , f (xn ) ∼ N(m̄(x), k̄(x))
where
−1
m̄(x) = kX (x)> KXX f
with

f = (f (x1 ), . . . , f (xn ))>

kX (x)> = (k(x, x1 ) k(x, x2 ) . . . k(x, xn )) ∈ R1×n

and

−1
k̄(x) = k(x, x) − kX (x)> KXX kX (x)
Cf

Y2 |Y1 = x1 ∼ N µ2 + Σ21 Σ−1 −1

11 (x1 − µ1 ), Σ22 − Σ21 Σ11 Σ12
More generally, if
f (·) ∼ GP(m(·), k(·, ·))
then
f (·)|f (x1 ), . . . , f (xn ) ∼ GP(m̄(·), k̄(·, ·))
with

−1
m̄(x) = m(x) + kX (x)> KXX f
−1
k̄(x, x 0 ) = k(x, x 0 ) − kX (x)> KXX kX (x 0 )
No noise/nugget - Interpolation

−1
Solid line m̄(x) = kX (x)KXX f
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> K −1 kX (x)
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
then y1 , . . . , yn , f (x) ∼ Nn+1 (0, Σ)
 
k(x1 , x)
 KXX + σ 2 I k(x2 , x) 
..
 
where Σ=
 
 . 

 k(xn , x) 
k(x, x1 ) k(x, x2 ) . . . k(x, xn ) k(x, x)
Noisy observations/with nugget - Regression
In practice, we don’t usually observe f (x) directly. If we observe
yi = f (xi ) + N(0, σ 2 )
then y1 , . . . , yn , f (x) ∼ Nn+1 (0, Σ)
 
k(x1 , x)
 KXX + σ 2 I k(x2 , x) 
..
 
where Σ=
 
 . 

 k(xn , x) 
k(x, x1 ) k(x, x2 ) . . . k(x, xn ) k(x, x)
Then
f (x) | y1 , . . . , yn ∼ N(m̄(x), k̄(x))
where
m̄(x) = kX (x)> (KXX + σ 2 I )−1 y
k̄(x) = k(x, x) − kX (x)> (KXX + σ 2 I )−1 kX (x)
Nugget standard deviation σ = 0.1

−1
Solid line m̄(x) = kX (x)> KXX y
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> (K −1 + σ 2 I )kX (x)
Nugget standard deviation σ = 0.025

−1
Solid line m̄(x) = kX (x)> KXX y
q
Shaded region m̄(x) ± 1.96 k̄(x)
k̄(x) = k(x, x) − kX (x)> (K −1 + σ 2 I )kX (x)
If mean is a linear combination of known regressor functions,

m(x) = β > h(x) for known h(x)

and β) is given a normal prior distribution (including π(β) ∝ 1), then

y (·) | D, β ∼ GP and
y (·) | D ∼ GP
with slightly modified mean and variance formulas.
If mean is a linear combination of known regressor functions,

m(x) = β > h(x) for known h(x)

and β) is given a normal prior distribution (including π(β) ∝ 1), then

y (·) | D, β ∼ GP and
y (·) | D ∼ GP
with slightly modified mean and variance formulas.
If
k(x, x 0 ) = σ 2 c(x, x 0 )
and we give σ 2 an inverse gamma prior (including π(σ 2 ) ∝ 1/σ 2 )
then y |D, σ 2 ∼ GP and

y |D ∼ t-process

with n − p degrees of freedom. In practice, for reasonable n, this is

indistinguishable from a GP.
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.

1
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β

Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1

x1>
 
 x2> 
where X =
 
.. 
 . 
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
Why use GPs? Answer 2: non-parametric/kernel regression
We can also view GPs as a non-parametric extension to linear regression.
k determines the space of functions that sample paths live in.
Suppose we’re given data {(xi , yi )ni=1 } with xi ∈ Rp , yi ∈ R

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1

= (X > X + σ 2 I )−1 X > y usual least squares estimator

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1

= (X > X + σ 2 I )−1 X > y usual least squares estimator

> > 2 −1
= X (XX + σ I) y the dual form

β̂ = arg min ||y − X β||22 + σ 2 ||β||22 regularised least squares1

= (X > X + σ 2 I )−1 X > y usual least squares estimator

> > 2 −1
= X (XX + σ I) y the dual form
as (X X + σ I )X > = X > (XX > + σ 2 I )
> 2

so X > (XX > + σ 2 I )−1 = (X > X + σ 2 I )−1 X >

x1>
 
 x2> 
where X =
 
.. 
 . 
xn>
1
Tikhonov regularisation/the Bayesian MAP estimator with a normal prior on β
At first the dual form

β̂ = X > (XX > + σ 2 I )−1 y

looks harder to compute than the usual

β̂ = (X > X + σ 2 I )−1 X > y

X > X is p × p p = number of features/parameters

XX > is n × n n is the number of data points
At first the dual form

β̂ = X > (XX > + σ 2 I )−1 y

looks harder to compute than the usual

β̂ = (X > X + σ 2 I )−1 X > y

X > X is p × p p = number of features/parameters

XX > is n × n n is the number of data points
But the dual form only uses inner products between vectors in Rn
 >   >
x1 x1 . . . x1> xn

x1
XX > =  ...  (x1 . . . xn ) =  ...
   

xn> > >
xn x1 . . . xn xn
=KXX if k(x, x 0 ) = x > x 0

— This is useful!
Prediction
The best prediction of y at a new location x 0 is

ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y

where kX (x 0 )> := (x 0> x1 , . . . , x 0> xn ) and [KXX ]ij := xi> xj

Prediction
The best prediction of y at a new location x 0 is

ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y

where kX (x 0 )> := (x 0> x1 , . . . , x 0> xn ) and [KXX ]ij := xi> xj

KXX and kX (x) are kernel matrices:
every element is an inner product between 2 points: k(x, x 0 ) = x > x 0
Prediction
The best prediction of y at a new location x 0 is

ŷ 0 = x 0> β̂
= x 0> X > (XX > + σ 2 I )−1 y
= kX (x 0 )> (KXX + σ 2 I )−1 y

where kX (x 0 )> := (x 0> x1 , . . . , x 0> xn ) and [KXX ]ij := xi> xj

KXX and kX (x) are kernel matrices:
every element is an inner product between 2 points: k(x, x 0 ) = x > x 0
Note this is exactly the GP conditional mean we derived before.

m(x) = kX (x)> (KXX + σ 2 I )−1 y

linear regression and GP regression are equivalent when

k(x, x 0 ) = x > x 0 .
Including features I

We can replace x by a feature vector in linear regression, e.g.,

φ(x) = (1 x x 2 )
It doesn’t change the expressions other than the inner product

k(x 0 , x) = x 0> x

is replaced by
k(x 0 , x) = φ(x 0 )> φ(x)
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>

i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>

i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Then

k(x, z) = φ(x)> φ(z)

√ √ √ √ √ √
= (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 )>
= (1 + (x1 , x2 )(z1 , z2 )> )2
= (1 + x> z)2
Including features II
For some sets of features, φ(x), computation of the inner product doesn’t
require us to evaluate the individual features.
E.g., Consider X = R2 and let
√ √ √
φ : x = (x1 , x2 ) 7→ (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )>

i.e., linear regression using all the linear and quadratic terms, and first
order interactions.
Then

k(x, z) = φ(x)> φ(z)

√ √ √ √ √ √
= (1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 )(1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 )>
= (1 + (x1 , x2 )(z1 , z2 )> )2
= (1 + x> z)2

To evaluate k(x, z) we didn’t need to explicitly compute the feature

vector φ(x)
Including features III

To evaluate k(x, z) we didn’t need to explicitly compute the feature

vectors φ(x), φ(z) ∈ R6
The same idea works with much larger feature vectors, sometimes even
when φ(x) ∈ R∞

2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
Including features III

To evaluate k(x, z) we didn’t need to explicitly compute the feature

vectors φ(x), φ(z) ∈ R6
The same idea works with much larger feature vectors, sometimes even
when φ(x) ∈ R∞
Theorem: A function
k :X ×X →R
is positive semi-definite (and thus a valid covariance function) if and only
if we can write2
k(x, x 0 ) = φ(x)> φ(x 0 )
for some (possibly infinite dimensional) feature vector φ(x).

2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
Including features III

To evaluate k(x, z) we didn’t need to explicitly compute the feature

2
I’m being sloppy - really we should write this as an inner product
k(x, x 0 ) = hφ(x), φ(x 0 )i
1 2
Example: If X = [0, 1], c0 = 0, c1 = N , c2 = N , . . . , cN = 1 then
(modulo some detail) if
(x−c0 )2 (x−cN )2
φ(x) ∝ (e − 2λ2 , . . . , e− 2λ2 )

then as N → ∞ then
(x − x 0 )2

>
φ(x) φ(x) = exp −
2λ2
1 2
Example: If X = [0, 1], c0 = 0, c1 = N , c2 = N , . . . , cN = 1 then
(modulo some detail) if
(x−c0 )2 (x−cN )2
φ(x) ∝ (e − 2λ2 , . . . , e− 2λ2 )

then as N → ∞ then
(x − x 0 )2

>
φ(x) φ(x) = exp −
2λ2

We can use an infinite dimensional feature vector φ(x), and because linear
regression can be done solely in terms of inner-products (inverting a n × n
matrix in the dual form) we never need evaluate the feature vector, only
the kernel.
Kernel trick:

lift x into feature space by replacing inner products x > x 0 by k(x, x 0 )

Kernel regression
Kernel regression and GP regression are closely related.
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.

Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.

Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.

Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
We can show that
m̄(x) = arg min L(f )
f ∈Hk

where m̄(x) is the posterior mean if we assume yi = f (xi ) + N(0, σ 2 ) and

f (·) ∼ GP(0, k(·, ·))
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.

Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
We can show that
m̄(x) = arg min L(f )
f ∈Hk

where m̄(x) is the posterior mean if we assume yi = f (xi ) + N(0, σ 2 ) and

f (·) ∼ GP(0, k(·, ·))
Note that m̄(·) ∈ Hk
Kernel regression
Kernel regression and GP regression are closely related.
Consider the space of functions
Hk = span{k(·, x) : x ∈ X }
ie functions of the form ni=1 αi k(x, xi ) with inner product
P
X X X
h ai k(·, xi ), bi k(·, yi )i = ai bj k(xi , yj )
ij

This is the reproducing kernel Hilbert space (RKHS) associated with k.

Kernel ridge regression chooses f to minimise
X
L(f ) = (f (xi ) − yi )2 + σ 2 ||f ||2Hk
i
We can show that
m̄(x) = arg min L(f )
f ∈Hk

where m̄(x) is the posterior mean if we assume yi = f (xi ) + N(0, σ 2 ) and

f (·) ∼ GP(0, k(·, ·))
Note that m̄(·) ∈ Hk (samples from a GP live in a slightly larger RKHS)
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function

3
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
i

3
Not quite - it lies in the completion of this set of linear combinations
Generally, we don’t think about these features, we just choose a kernel.
k(x, x 0 ) is a kernel ifF it is a positive semidefinite function
Functions live in function spaces (vector spaces with inner products).
There are lots of different function spaces: the GP kernel implicitly
determines this space - our hypothesis space.
We can write k(x, x 0 ) = φ(x)> φ(x 0 ) for some feature vector φ(x)), and
our model only includes functions that are linear combinations of this set
of features
X
f (x) = ci k(x, xi )3
this space of functions is calledi the Reproducing Kernel Hilbert
Space (RKHS) of k.
Although reality may not lie in the RKHS defined by k, this space is much
richer than any parametric regression model (and can be dense in some
sets of continuous bounded functions), and is thus more likely to contain
an element close to the true functional form than any class of models that
contains only a finite number of features.
This is the motivation for non-parametric methods.
3
Not quite - it lies in the completion of this set of linear combinations
Why use GPs? Answer 3: Naturalness of GP framework

Why use Gaussian processes as non-parametric models?

Why use GPs? Answer 3: Naturalness of GP framework

Why use Gaussian processes as non-parametric models?

If we only knew the expectation and variance of some random variables,
X and Y , then how should we best do statistics?
Why use GPs? Answer 3: Naturalness of GP framework

Why use Gaussian processes as non-parametric models?

If we only knew the expectation and variance of some random variables,
X and Y , then how should we best do statistics?
It has been shown, using coherency arguments, or geometric arguments,
or..., that the best second-order inference we can do to update our beliefs
about X given Y is

E(X |Y ) = E(X ) + Cov(X , Y )Var(Y )−1 (Y − E(Y ))

i.e., exactly the Gaussian process update for the posterior mean.
So GPs are in some sense second-order optimal.
Kriging
Kriging

Suppose Y (x) is a (second order stationary) stochastic process with

EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0

NB we’re not assuming Y has a Gaussian distribution.

Kriging

Suppose Y (x) is a (second order stationary) stochastic process with

EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0

NB we’re not assuming Y has a Gaussian distribution.

If someone tells you y = (Y (x1 ), . . . , Y (xn ))> , how would you predict
Y (x)?
Kriging

Suppose Y (x) is a (second order stationary) stochastic process with

EY (x) = µ ∀ x
Cov(Y (x), Y (x 0 )) = k(x − x 0 ) ∀ x, x 0

NB we’re not assuming Y has a Gaussian distribution.

If someone tells you y = (Y (x1 ), . . . , Y (xn ))> , how would you predict
Y (x)?
One option is to find the best linear unbiased predictor (BLUP) of Y (x).
Best Linear Unbiased Predictors (BLUP)
Consider the linear estimator
X
Ŷ (x) = c + wi Y (xi ) = c + w> y
Best Linear Unbiased Predictors (BLUP)
Consider the linear estimator
X
Ŷ (x) = c + wi Y (xi ) = c + w> y

If we require Ŷ (x) to be unbiased,

µ = EŶ (x)
= E(c + w> y)
= c + w> µ

where µ = (µ, . . . , µ)> .

Best Linear Unbiased Predictors (BLUP)
Consider the linear estimator
X
Ŷ (x) = c + wi Y (xi ) = c + w> y

If we require Ŷ (x) to be unbiased,

µ = EŶ (x)
= E(c + w> y)
= c + w> µ

where µ = (µ, . . . , µ)> .

Thus c = µ − w> µ and we must have

Ŷ (x) = µ + w> (y − µ)
Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error

MSE (Ŷ (x)) = E((Ŷ (x) − Y (x))2 )

= E (w> (y − µ) + (µ − Y (x))2
= w> Var(y)w + Var(Y (x)) − 2w> Cov(y, Y (x))
= w> KXX w + k(0) − 2w> kX (x)
Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error

MSE (Ŷ (x)) = E((Ŷ (x) − Y (x))2 )

= E (w> (y − µ) + (µ − Y (x))2
= w> Var(y)w + Var(Y (x)) − 2w> Cov(y, Y (x))
= w> KXX w + k(0) − 2w> kX (x)

If we differentiate wrt w and set the gradient equal to zero, we find

0 = 2KXX w − 2kX (x)

Best Linear Unbiased Predictors (BLUP) - II
The best linear unbiased predictor minimises the mean square error

MSE (Ŷ (x)) = E((Ŷ (x) − Y (x))2 )

= E (w> (y − µ) + (µ − Y (x))2
= w> Var(y)w + Var(Y (x)) − 2w> Cov(y, Y (x))
= w> KXX w + k(0) − 2w> kX (x)

If we differentiate wrt w and set the gradient equal to zero, we find

0 = 2KXX w − 2kX (x)

and thus
−1
Ŷ (x) = µ + kX (x)> KXX (y − µ)
as before.
So the Gaussian process posterior mean is optimal (i.e. is the BLUP) even
if we don’t assume a Gaussian distribution.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
It is important to check both aspects.
Why use GPs? Answer 4: Uncertainty estimates from
emulators
We often think of our prediction as consisting of two parts
point estimate
uncertainty in that estimate
That GPs come equipped with the uncertainty in their prediction is seen
as one of their main advantages.
It is important to check both aspects.
Warning: the uncertainty estimates from a GP can be flawed. Note that
given data D = {X , y }

−1
Var(f (x)|X , y ) = k(x, x) − kX (x)KXX kX (x)
so that the posterior variance of f (x) does not depend upon y !
The variance estimates are particularly sensitive to the hyper-parameter
estimates.
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should

use, GPs work great!
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should

use, GPs work great!
Unfortunately, we don’t usually know this.
We pick a covariance function from a small set, based usually on
differentiability considerations.
Difficulties of using GPs

If we know what RKHS/hypothesis space/covariance function we should

use, GPs work great!
Unfortunately, we don’t usually know this.
We pick a covariance function from a small set, based usually on
differentiability considerations.
Possibly try a few (plus combinations of a few) covariance functions,
and attempt to make a good choice using some sort of empirical
evaluation.
Covariance functions often contain hyper-parameters. E.g
I RBF kernel
1 (x − x 0 )2

k(x, x 0 ) = σ 2 exp −
2 λ2
Estimate these using your favourite statistical procedure (maximum
likelihood, cross-validation, Bayes, expert judgement etc)
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the

data.
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the

data.
The number of parameters in a GP is essentially infinite, and so they are
not always identified even asymptotically.
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the

data.
The number of parameters in a GP is essentially infinite, and so they are
not always identified even asymptotically.
So the posterior can concentrate not on a point, but on some submanifold
of parameter space, and the projection of the prior on this space continues
to impact the posterior even as more and more data are collected.
Difficulties of using GPs
Gelman et al. 2017

Assuming a GP model for your data imposes a complex structure on the

k(x, x 0 ) = σ 2 exp(−κ2 |x − x|)

We can consistently estimate σ 2 κ, but not σ 2 or κ, even as n → ∞.
Problems with hyper-parameter optimization
As well as problems of identifiability, the likelihood surface that is being
maximized is often flat and multi-modal, and thus the optimizer can
sometimes fail to converge, or gets stuck in local-maxima.
Problems with hyper-parameter optimization
As well as problems of identifiability, the likelihood surface that is being
maximized is often flat and multi-modal, and thus the optimizer can
sometimes fail to converge, or gets stuck in local-maxima.
In practice, it is not uncommon to optimize hyper parameters and find
solutions such as
Problems with hyper-parameter optimization
As well as problems of identifiability, the likelihood surface that is being
maximized is often flat and multi-modal, and thus the optimizer can
sometimes fail to converge, or gets stuck in local-maxima.
In practice, it is not uncommon to optimize hyper parameters and find
solutions such as

We often work around these problems by running the optimizer multiple

times from random start points, using prior distributions, constraining or
fixing hyper-parameters, or adding white noise.
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
There are many ways to side-step this cost, but one approach is to
consider basis expansions and switching back to the primal form for linear
regression.
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
There are many ways to side-step this cost, but one approach is to
consider basis expansions and switching back to the primal form for linear
regression.
Suppose
Xm
0
k(x, x ) = φi (x)φi (x 0 ) = φ(x)> φ(x 0 )
i=1
Computational cost
One difficulty with GP is the computational cost of training them is
O(n3 ) (and O(n2 ) memory)
There are many ways to side-step this cost, but one approach is to
consider basis expansions and switching back to the primal form for linear
regression.
Suppose
Xm
0
k(x, x ) = φi (x)φi (x 0 ) = φ(x)> φ(x 0 )
i=1

Then GP regression is equivalent to linear regression with covariates φ(x)

Dual form for regression coefficients costs O(n3 ),
but primal solution only costs O(m3 )
In practice we may use a basis expansion with m << n such that
m
X
k(x, x 0 ) ≈ φi (x)φi (x 0 )
i=1
Choice of basis
There are many choices of basis. Two examples:
Mercer basis: Consider the map
Z
Tk (f )(·) = k(x, ·)f (x)dx
X
Consider the eigenfunctions of this map, i.e., φ : X 7→ R s.t.
Tk (φ)(·) = λφ(·). Then Mercer’s theorem says that
∞
X
0
k(x, x ) = λi φi (x)φi (x 0 )
i=1
Choice of basis
There are many choices of basis. Two examples:
Mercer basis: Consider the map
Z
Tk (f )(·) = k(x, ·)f (x)dx
X
Consider the eigenfunctions of this map, i.e., φ : X 7→ R s.t.
Tk (φ)(·) = λφ(·). Then Mercer’s theorem says that
∞
X
0
k(x, x ) = λi φi (x)φi (x 0 )
i=1
The Karhunen-Loeve thm says we can write f (·) ∼ GP(0, k(·, ·)) as
∞
iid
X p
f (x) = Zi λi φi (x) where Zi ∼ N(0, 1)
i=1
Choice of basis
There are many choices of basis. Two examples:
Mercer basis: Consider the map
Z
Tk (f )(·) = k(x, ·)f (x)dx
X
Consider the eigenfunctions of this map, i.e., φ : X 7→ R s.t.
Tk (φ)(·) = λφ(·). Then Mercer’s theorem says that
∞
X
0
k(x, x ) = λi φi (x)φi (x 0 )
i=1
The Karhunen-Loeve thm says we can write f (·) ∼ GP(0, k(·, ·)) as
∞
iid
X p
f (x) = Zi λi φi (x) where Zi ∼ N(0, 1)
i=1

We can approximate the process (& reduce cost to O(m3 )) by

truncating the sum m
X p
f (x) = Zi λi φi (x)
i=1
The Mercer/KL basis minimizes the mean square truncation error.
Choice of basis
There are many choices of basis. Two examples:
Random Fourier features:
Bochner’s theorem says that a stationary kernel can be represented
as a Fourier transform of a distribution
Z
0
k(x − x ) = exp(iw > (x − x 0 ))p(w )dw = Ew ∼p exp(iw > (x − x 0 ))

cos(wi> x)

1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)

by using Euler’s identity and discarding the imaginary part

Choice of basis
There are many choices of basis. Two examples:
Random Fourier features:
Bochner’s theorem says that a stationary kernel can be represented
as a Fourier transform of a distribution
Z
0
k(x − x ) = exp(iw > (x − x 0 ))p(w )dw = Ew ∼p exp(iw > (x − x 0 ))

cos(wi> x)

1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)

by using Euler’s identity and discarding the imaginary part

Using the primal form for linear regression again reduces the
complexity to O(m3 ).
Choice of basis
There are many choices of basis. Two examples:
Random Fourier features:
Bochner’s theorem says that a stationary kernel can be represented
as a Fourier transform of a distribution
Z
0
k(x − x ) = exp(iw > (x − x 0 ))p(w )dw = Ew ∼p exp(iw > (x − x 0 ))

cos(wi> x)

1 X > >
≈ (cos(wi x), sin(wi x)) if wi ∼ p(·)
m sin(wi> x)

by using Euler’s identity and discarding the imaginary part

Using the primal form for linear regression again reduces the
complexity to O(m3 ).
Recent work by Rudi and Rosasco (2017) shows that using
√
m = n log(n) features achieve similar performance to using the full
kernel.
Conclusions

Once the good china, GPs are now ubiquitous in statistics/ML.

Popularity stems from
I Naturalness of the framework
I Mathematical tractability
I Empirical success
Conclusions

Once the good china, GPs are now ubiquitous in statistics/ML.

Popularity stems from
I Naturalness of the framework
I Mathematical tractability
I Empirical success

Thank you for listening!

References

Rasmussen and Williams. Gaussian processes for machine learning.

MIT press, 2006.
Stein. Interpolation of Spatial Data: Some Theory for Kriging.
Springer, 1999
Kanagawa, Hennig, Sejdinovic, and Sriperumbudur. Gaussian
processes and kernel methods: A review on connections and
equivalences.. arXiv:1807.02582 2018.

GD&T PPT General
No ratings yet
GD&T PPT General
197 pages
Project Management - Determining Path Probabilities
No ratings yet
Project Management - Determining Path Probabilities
4 pages
Probability PDF
No ratings yet
Probability PDF
30 pages
Gaussian Probability Density Functions: Properties and Error Characterization
No ratings yet
Gaussian Probability Density Functions: Properties and Error Characterization
30 pages
Slide - 4 - 07 (Lecture 4.7 Gaussian Random Variable)
No ratings yet
Slide - 4 - 07 (Lecture 4.7 Gaussian Random Variable)
22 pages
Probability Distributions and Combination of Random Variables
No ratings yet
Probability Distributions and Combination of Random Variables
52 pages
Stochastic Dynamics
No ratings yet
Stochastic Dynamics
72 pages
Statistics: Sample
No ratings yet
Statistics: Sample
12 pages
GaussianRV
No ratings yet
GaussianRV
11 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Basic Review w Sampling - Copy
No ratings yet
Basic Review w Sampling - Copy
28 pages
Chapter 5. Multiple Random Variables 5.9: The Multivariate Normal Distribution
No ratings yet
Chapter 5. Multiple Random Variables 5.9: The Multivariate Normal Distribution
5 pages
Statistical Foundations: SOST70151 - LECTURE 5
No ratings yet
Statistical Foundations: SOST70151 - LECTURE 5
49 pages
Continuous Distributions
No ratings yet
Continuous Distributions
22 pages
Rand Mathieu CISM
No ratings yet
Rand Mathieu CISM
19 pages
Session 31 - Sample Statistics
No ratings yet
Session 31 - Sample Statistics
28 pages
mcmc
No ratings yet
mcmc
76 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
7 pages
Lecture 27-28
No ratings yet
Lecture 27-28
9 pages
WKB Method2(教材)
No ratings yet
WKB Method2(教材)
7 pages
1155_CS_F425_20230524120823_Mid_Semester_Question_Paper_DL
No ratings yet
1155_CS_F425_20230524120823_Mid_Semester_Question_Paper_DL
5 pages
New Schedule: Fall 2004 Pattern Recognition For Vision
No ratings yet
New Schedule: Fall 2004 Pattern Recognition For Vision
48 pages
EPRST Formula Sheet (1)
No ratings yet
EPRST Formula Sheet (1)
2 pages
DeltaMethod
No ratings yet
DeltaMethod
10 pages
Cambridge Books Online
No ratings yet
Cambridge Books Online
27 pages
Part I
No ratings yet
Part I
1 page
Lecture 7_Introduction to Kriging
No ratings yet
Lecture 7_Introduction to Kriging
52 pages
WKB by Subhasish Sir
No ratings yet
WKB by Subhasish Sir
13 pages
Control 1D Pauta PDF
No ratings yet
Control 1D Pauta PDF
3 pages
107 2 EM Midterm 2 Sol
No ratings yet
107 2 EM Midterm 2 Sol
2 pages
Fundamentals For Finite Element Method
No ratings yet
Fundamentals For Finite Element Method
34 pages
Engineering Prob & Stat Lecture Notes 6
No ratings yet
Engineering Prob & Stat Lecture Notes 6
12 pages
Digital Communications: Lecture Notes by Y. N. Trivedi
No ratings yet
Digital Communications: Lecture Notes by Y. N. Trivedi
5 pages
MATH219 Lecture 20
No ratings yet
MATH219 Lecture 20
5 pages
110.302 Differential Equations Professor Richard Brown
No ratings yet
110.302 Differential Equations Professor Richard Brown
2 pages
CPSC 540: Machine Learning: Gaussians
No ratings yet
CPSC 540: Machine Learning: Gaussians
30 pages
Unity Power Factor Discussion Latest Version 199782 7
No ratings yet
Unity Power Factor Discussion Latest Version 199782 7
3 pages
2018-19 Exam
No ratings yet
2018-19 Exam
9 pages
GHFKJF
No ratings yet
GHFKJF
2 pages
Griffiths Problems 09.24
100% (1)
Griffiths Problems 09.24
2 pages
Homework01 Sol
No ratings yet
Homework01 Sol
8 pages
Chapter 3
No ratings yet
Chapter 3
67 pages
MAS3301 Bayesian Statistics: M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2008-9
No ratings yet
MAS3301 Bayesian Statistics: M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2008-9
18 pages
2021-22 Exam
No ratings yet
2021-22 Exam
11 pages
My Notes For Discrete and Continuous Distributions 987654
No ratings yet
My Notes For Discrete and Continuous Distributions 987654
28 pages
Differential Equations Notes: Author Vincent Huang
No ratings yet
Differential Equations Notes: Author Vincent Huang
16 pages
Al 7 Amdolilah
No ratings yet
Al 7 Amdolilah
38 pages
Rayleigh Distribution: Example: To Be Prepared. Solution: P (X
No ratings yet
Rayleigh Distribution: Example: To Be Prepared. Solution: P (X
2 pages
Estimation
No ratings yet
Estimation
39 pages
Partial Differential Equations
No ratings yet
Partial Differential Equations
18 pages
10.3934 dcds.2024075
No ratings yet
10.3934 dcds.2024075
25 pages
Homework 3
No ratings yet
Homework 3
1 page
Rayleigh Distributions of A Random Variable
No ratings yet
Rayleigh Distributions of A Random Variable
5 pages
Tmna 125 2 413
No ratings yet
Tmna 125 2 413
13 pages
Exact Finite Differences The Derivative On Non-Uniformly Spaced Partitions - Armando Martínez-Pérez - 25 July 2017
No ratings yet
Exact Finite Differences The Derivative On Non-Uniformly Spaced Partitions - Armando Martínez-Pérez - 25 July 2017
11 pages
Qual 18 p3 Sol
No ratings yet
Qual 18 p3 Sol
3 pages
Lec06
No ratings yet
Lec06
5 pages
Math 3215 Intro. Probability & Statistics Summer '14 Group Quiz 6
No ratings yet
Math 3215 Intro. Probability & Statistics Summer '14 Group Quiz 6
3 pages
PJM_May2020_711_to_729
No ratings yet
PJM_May2020_711_to_729
20 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Goda 2000
No ratings yet
Goda 2000
31 pages
Djikstra & Henseler 2015 Consistent PLS
No ratings yet
Djikstra & Henseler 2015 Consistent PLS
14 pages
Food
0% (1)
Food
33 pages
Random Walk
No ratings yet
Random Walk
22 pages
Answer: Z 0 and Z 1.63 0.4484: X Mean Standard Deviation
No ratings yet
Answer: Z 0 and Z 1.63 0.4484: X Mean Standard Deviation
3 pages
Hovering UAV-Based FSO Communications: Channel Modelling, Performance Analysis, and Parameter Optimization
No ratings yet
Hovering UAV-Based FSO Communications: Channel Modelling, Performance Analysis, and Parameter Optimization
14 pages
Normal Approximation of The Poisson Distribution
No ratings yet
Normal Approximation of The Poisson Distribution
1 page
Download Full Evaluating and Conducting Research in Audiology 1st Edition Vinaya Manchaiah PDF All Chapters
100% (5)
Download Full Evaluating and Conducting Research in Audiology 1st Edition Vinaya Manchaiah PDF All Chapters
67 pages
Monte Carlo Simulations: Number of Iterations and Accuracy: by William Oberle
No ratings yet
Monte Carlo Simulations: Number of Iterations and Accuracy: by William Oberle
34 pages
Dip Lab
No ratings yet
Dip Lab
55 pages
Probability
No ratings yet
Probability
93 pages
Probability Theory & Stochastic Processes - BITS
No ratings yet
Probability Theory & Stochastic Processes - BITS
12 pages
Hypothesis
No ratings yet
Hypothesis
19 pages
Tax Planning and Firms' Value in Nigeria
No ratings yet
Tax Planning and Firms' Value in Nigeria
13 pages
21R05NOV10
No ratings yet
21R05NOV10
26 pages
Week 8 Statistics and Probability
No ratings yet
Week 8 Statistics and Probability
7 pages
AP Stats Practices
No ratings yet
AP Stats Practices
28 pages
Basic Biostatistics: Statistics for Public Health Practice 2nd Edition (eBook PDF) 2024 Scribd Download
100% (2)
Basic Biostatistics: Statistics for Public Health Practice 2nd Edition (eBook PDF) 2024 Scribd Download
41 pages
Instant ebooks textbook Process Capability Analysis Estimating Quality 1st ed Edition Neil W. Polhemus download all chapters
100% (9)
Instant ebooks textbook Process Capability Analysis Estimating Quality 1st ed Edition Neil W. Polhemus download all chapters
60 pages
M11 12SP IIIb 4
100% (1)
M11 12SP IIIb 4
2 pages
SOLVED Business Statistics
No ratings yet
SOLVED Business Statistics
21 pages
Curran PG (2016) Preprint
No ratings yet
Curran PG (2016) Preprint
65 pages
Interval Estimation Annotated
No ratings yet
Interval Estimation Annotated
38 pages
Where can buy Probability Random Variables and Data Analytics with Engineering Applications P. Mohana Shankar ebook with cheap price
100% (2)
Where can buy Probability Random Variables and Data Analytics with Engineering Applications P. Mohana Shankar ebook with cheap price
40 pages
Full Text
No ratings yet
Full Text
18 pages
Sen. Renato "Compañero" Cayetano Memorial Science and Technology Highschool
No ratings yet
Sen. Renato "Compañero" Cayetano Memorial Science and Technology Highschool
2 pages
Assignment Problems
No ratings yet
Assignment Problems
31 pages
EMT Untuk Statistika
No ratings yet
EMT Untuk Statistika
50 pages