0% found this document useful (0 votes)
61 views9 pages

Understanding Gaussian Distributions

This document introduces the basics of probabilistic modeling for continuous data, focusing on the Gaussian (Normal) distribution and its properties. It covers concepts such as cumulative distribution functions (CDFs), probability density functions (PDFs), and parameter estimation for Gaussian distributions. The document also explains how to extend the univariate Gaussian to a multivariate case, including the use of mean vectors and covariance matrices.

Uploaded by

Pinjala Anoop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views9 pages

Understanding Gaussian Distributions

This document introduces the basics of probabilistic modeling for continuous data, focusing on the Gaussian (Normal) distribution and its properties. It covers concepts such as cumulative distribution functions (CDFs), probability density functions (PDFs), and parameter estimation for Gaussian distributions. The document also explains how to extend the univariate Gaussian to a multivariate case, including the use of mean vectors and covariance matrices.

Uploaded by

Pinjala Anoop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

Cumulative Distribution Function

Gaussians

1.0
0.8
Hiroshi Shimodaira∗

January-March 2020

0.6
F(x)

0.4
In this chapter we introduce the basics of how to build probabilistic models of continuous-valued data,
including the most important probability distribution for continuous data: the Gaussian, or Normal,
distribution. We discuss both the univariate Gaussian (the Gaussian distribution for one-dimensional

0.2
data) and the multivariate Gaussian distribution (the Gaussian distribution for multi-dimensional data).

0.0
8.1 Continuous random variables
0 10 20 30 40

First we review the concepts of the cumulative distribution function and the probability density function x
of a continuous random variable.
Many events that we want to model probabilistically are described by real numbers rather than discrete Figure 8.1: Cumulative distribution function of random variable X in the ‘bus’ example.
symbols or integers. In this case we must use continuous random variables. Some examples of
continuous random variables include:
The probability distribution function for a random variable assigns a probability to each value that the
variable may take. It is impossible to write down a probability distribution function for a continuous
• The sum of two numbers drawn randomly from the interval 0 < x < 1; random variable X, since P(X = x) = 0 for all x. This is because X is continuous and can take infinitely
• The length of time for a bus to arrive at the bus-stop; many values (and 1/∞ = 0). However we can write down a cumulative distribution F(X), which gives
the probability of X taking a value that is less than or equal to x. For the current example:
• The height of a member of a population. 


 0 x<0

F(x) = P(X ≤ x) = 
 (x − 0)/30 = x/30 0 ≤ x ≤ 30 (8.1)

 1
8.1.1 Cumulative distribution function x > 30

We will develop the way we model continuous random variables using a simple example. In writing down this cumulative distribution, we have made the (reasonable) assumption that the
Consider waiting for a bus, which runs every 30 minutes. We shall make the idealistic assumption that probability of a bus arriving increases in proportion to the interval of time waited (in the region 0–30
the buses are always exactly on time, thus a bus arrives every 30 minutes. If you are waiting for a bus, minutes). This cumulative density function is plotted in Figure 8.1.
but don’t know the timetable, then the precise length of time you need to wait is unknown. Let the Cumulative distribution functions have the following properties:
continuous random variable X denote the length of time you need to wait for a bus.
Given the above information we may assume that X never takes values above 30 or below 0. We can (i) F(−∞) = 0;
write this as:
(ii) F(∞) = 1;
P(X < 0) = 0
P(0 ≤ X ≤ 30) = 1 (iii) If a ≤ b then F(a) ≤ F(b).
P(X > 30) = 0
To obtain the probability of falling in an interval we can do the following:
∗c 2014-2020 University of Edinburgh. All rights reserved. This note is heavily based on notes inherited from Steve
Renals and Iain Murray. P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a) = F(b) − F(a). (8.2)

1 2
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

0.25

Probability Density Function

0.2

0.030
0.025
0.15

f(x)
0.020

0.1
p(x)

0.015

0.05
0.010
0.005

0
−8 −6 −4 −2 0 2 4 6 8
x
0.000

Figure 8.3: P(−2 < X ≤ 2) is the shaded area under the pdf.
0 10 20 30 40

x The probability that the random variable lies in interval (a, b) is the area under the pdf between a and b:

Figure 8.2: Probability density function of random variable X in the ‘bus’ example. P(a < X ≤ b) = F(b) − F(a)
Z b Z a
= p(x) dx − p(x) dx
For the ‘bus’ example: −∞ −∞
Z b
P(15 < X ≤ 21) = F(21) − F(15) = p(x) dx .
a
= 0.7 − 0.5 = 0.2
This integral is illustrated in Figure 8.3. The total area under the pdf equals 1, the probability that x
8.1.2 Probability density function takes on some value between −∞ and ∞. We can also confirm that F(∞) − F(−∞) = 1 − 0 = 1.

Although we cannot define a probability distribution function for a continuous random variable, we
can define a closely related function, called the probability density function (pdf), p(x): 8.2 The Gaussian distribution
d
p(x) = F(x) = F 0 (x) The Gaussian (or Normal) distribution is the most commonly encountered (and easily analysed)
dx
Z x continuous distribution. It is also a reasonable model for many situations (the famous ‘bell curve’).
F(x) = p(x) dx . If a (scalar) variable has a Gaussian distribution, then it has a probability density function with this
−∞
form: !
The pdf is the gradient of the cdf. Note that p(x) is not the probability that X has value x. However, the 1 −(x − µ)2
p(x | µ, σ2 ) = N(x; µ, σ2 ) = √ exp . (8.3)
pdf is proportional to the probability that X lies in a small interval centred on x. The pdf is the usual 2πσ2 2σ2
way of describing the probabilities associated with a continuous random variable X. We usually write
The Gaussian is described by two parameters:
probabilities using upper case P and probability densities using lower case p.
We can immediately write down the pdf for the ‘bus’ example: • the mean µ (location)


 0 x<0

 • the variance σ2 (dispersion)
p(x) =  1/30 0 ≤ x ≤ 30

 0 x > 30
We write p(x | µ, σ2 ) because the pdf of x depends on the parameters. Sometimes (to slim down the
Figure 8.2 shows a graph of this pdf. X is said to be uniform on the interval (0, 30). notation) we simply write p(x).

3 4
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

pdf of Gaussian Distribution


0.4 pdfs of Gaussian distributions
mean=0
variance=1 1.6
0.35

1.4
0.3 µ=0.0
1.2 σ=0.25
0.25

p(x|m,s)
p(x|m,s) 1
0.2 0.8

0.6 µ=0.0
0.15 σ=0.5
µ=0.0
0.4 σ=1.0
0.1
0.2 µ=0.0
σ=2.0
0.05
0
-6 -4 -2 0 2 4 6
0
−4 −3 −2 −1 0 1 2 3 4 x
x

Figure 8.4: One dimensional Gaussian (µ = 0, σ2 = 1) Figure 8.5: Four Gaussian pdfs with zero mean and different standard deviations.

All Gaussians have the same shape, with the location controlled by the mean, and the dispersion 8.3 Parameter estimation
(horizontal scaling) controlled by the variance.1 Figure 8.4 shows a one-dimensional Gaussian with
zero mean and unit variance (µ = 0, σ2 = 1.) A Gaussian distribution has two parameters the mean (µ) and the variance(σ2 ). In machine learning or
In Equation (8.3), the mean occurs in the exponential part of the pdf, exp(−0.5(x − µ) /σ ). This term 2 2 pattern recognition we are not given the parameters, we have to estimate them from data. As in the
will have a maximum (exp(0) = 1) when x = µ; thus the peak of the Gaussian corresponds to the mean, case of Naive Bayes we can choose the parameters such that they maximise the likelihood of the model
and we can think of it as the location parameter. generating the training data. In the case of the Gaussian distribution the maximum likelihood estimate
(MLE) of the mean and the variance2 results in:
In one dimension, the variance can be thought of as controlling the width of the Gaussian pdf. Since
the area under the pdf must equal 1, this means that the wide Gaussians have lower peaks than 1X
N

narrow Gaussians. This explains why the variance occurs twice in the formula for a Gaussian. In the µ̂ = xn , (8.4)
N n=1
exponential part exp(−0.5(x − µ)2 /σ2 ), the variance parameter controls the width: for larger values of
N
σ2 √
, the value of the exponential decreases more slowly as x moves away from the mean. The term 1X
σ̂2 = (xn − µ̂)2 , (8.5)
1/ 2πσ2 is the normalisation constant, which scales the whole pdf to ensure that it integrates to 1. N n=1
This term decreases with σ2 : hence as σ2 decreases so the pdf gets a taller (but narrower) peak. The
behaviour of the pdf with respect to the variance parameter is shown in Figure 8.5. where xn denotes the feature value of n’th sample, and N is the number of samples in total.

8.3.1 Maximum likelihood parameter estimation for Gaussian distribution

The two formulas Equation (8.4) and Equation (8.5) are very popular, but it is not a good practice
that we just remember them without understanding how they are derived in the context of Gaussian
distribution.
We here consider the parameter estimation as an optimisation problem:
max p(x1 , . . . , xN | µ, σ2 ) , (8.6)
µ, σ2

where we try to find such µ and σ2 that maximise the likelihood. Note that this likelihood is a function
of µ and σ2 , and not of the data, since the data are fixed - they are given and do not change. To
2
The maximum likelihood estimate of the variance turns out to be a biased form of the sample variance that is normalised
1
To be precise, the width of a distribution scales with its standard deviation, σ, i.e. the square root of the variance. by N −1 rather than N.

5 6
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

make the optimisation problem tractable, we introduce an assumption that all the training samples are
0.4
independent from each other, so that the optimisation problem is simplified to
0.35 P(x|S)
max p(x1 | µ, σ2 ) · · · p(xN | µ, σ2 ) (8.7)
µ, σ2
0.3

Applying the natural log to the likelihood and letting it denoted by LL(µ, σ2 ), 0.25
 
LL(µ, σ2 ) = ln p(x1 | µ, σ2 ) · · · p(xN | µ, σ2 ) (8.8)

p(x)
0.2
N
X
= ln p(x1 | µ, σ2 ) (8.9) 0.15
P(x(T)
n=1
!! 0.1
XN
1 −(xn − µ)2
= ln √ exp (8.10)
n=1 2πσ2 2σ2 0.05

N
N N   X (xn − µ)2 0
0 5 10 15 20 25
= − ln(2π) − ln σ2 − (8.11) x
2 2 n=1
2σ2

As we studied in Section 5.5 in Note 5, we can find the optimal parameters of this unconstrained
Figure 8.6: Estimated Gaussian pdfs for class S (µ̂ = 10, σ̂2 = 1) and class class T (µ̂ = 12, σ̂2 = 4)
optimisation problem by solving the following system of equations:

∂LL(µ, σ2 ) The process of estimating the parameters from the training data is sometimes referred to as fitting the
=0 (8.12)
∂µ distribution to the data.
∂LL(µ, σ2 ) Figure 8.6 shows the pdfs for each class. The pdf for class T is twice the width of that for class S : the
=0 (8.13)
∂σ2 width of a distribution scales with its standard deviation, not its variance.
You can easily confirm that Equation (8.4) and Equation (8.5) are the solutions.
8.5 The multivariate Gaussian distribution and covariance
8.4 Example
The univariate (one-dimensional) Gaussian may be extended to the multivariate (multi-dimensional)
case. The D-dimensional Gaussian is parameterised by a mean vector, µ = (µ1 , . . . , µD )T , and a
A pattern recognition problem has two classes, S and T . Some observations are available for each
covariance matrix3 , Σ = (σi j ), and has a probability density
class:
!
1 1 T −1
Class S : 10 8 10 10 11 11 p(x | µ, Σ) = exp − (x − µ) Σ (x − µ) . (8.14)
(2π)D/2 |Σ|1/2 2
Class T : 12 9 15 10 13 13
The 1-dimensional Gaussian is a special case of this pdf. The covariance matrix gives the variance
of each variable (dimension) along the leading diagonal, and the off-diagonal elements measure the
We assume that each class may be modelled by a Gaussian. Using the above data, estimate the
correlations between the variables. The argument to the exponential 12 (x − µ)T Σ−1 (x − µ) is an example
parameters of the Gaussian pdf for each class, and sketch the pdf for each class.
of a quadratic form. |Σ| is the determinant of the covariance matrix Σ.
The mean and variance of each pdf are estimated with MLE shown in Equation (8.4) and Equation (8.5).
The mean vector µ is the expectation of x:
µ = E[x] .
(10 + 8 + 10 + 10 + 11 + 11)
µ̂S = = 10
6 The covariance matrix Σ is the expectation of the deviation of x from the mean:
2 2
(10 − 10) + (8 − 10) + (10 − 10)2 + (10 − 10)2 + (11 − 10)2 + (11 − 10)2
σ̂2S = =1 Σ = E[(x − µ)(x − µ)T ] . (8.15)
6
(12 + 9 + 15 + 10 + 13 + 13) From Equation (8.15) it follows that Σ = (σi j ) is a D×D symmetric matrix; that is Σ = Σ : T
µ̂T = = 12
6
(12 − 12) + (9 − 12) + (15 − 12)2 + (10 − 12)2 + (13 − 12)2 + (13 − 12)2
2 2 σi j = E[(xi − µi )(x j − µ j )] = E[(x j − µ j )(xi − µi )] = σ ji .
σ̂2T = =4
6 3
Σ is a D-by-D square matrix, and σi j or Σi j denotes its element at i’th row and j’th column.

7 8
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

Note that σi j denotes not the standard deviation but the covariance between i’th and j’th elements of x. (Remember the covariance matrix is symmetric so a12 = a21 .) To avoid clutter, assume that µ = (0, 0)T ,
For example, in the 1-dimensional case, σ11 = σ2 . then the quadratic form is:
It is helpful to consider how the covariance matrix may be interpreted. The sign of the covariance, σi j ,   a11 a12 ! x1 !
helps to determine the relationship between two components: xT Σ−1 x = x1 x2
a12 a22 x2
  a11 x1 + a12 x2 !
• If x j is large when xi is large, then (x j − µ j )(xi − µi ) will tend to be positive;4 = x1 x2
a12 x1 + a22 x2
• If x j is small when xi is large, then (x j − µ j )(xi − µi ) will tend to be negative. = a11 x12 + 2a12 x1 x2 + a22 x22 .

Thus we see that the argument to the exponential expands as a quadratic of D variables (D = 2 in this
If variables are highly correlated (large covariance) then this may indicate that one does not give case).7
much extra information if the other is known. If two components of the input vector, xi and x j , are
In the case of a diagonal covariance matrix:
statistically independent then the covariance between them is zero, σi j = 0.
!−1 1 ! !
σ11 0 0 a11 0
Σ−1 = = σ11
1 = ,
Correlation coefficient The values of the elements of the covariance matrix depend on the unit 0 σ22 0 σ22
0 a22
of measurement: consider the case when x is measured in metres, compared when x is measured in
and the quadratic form has no cross terms:
millimetres. It is useful to define a measure of dispersion that is independent of the unit of measurement.
To do this we may define the correlation coefficient5 between features xi and x j , ρ(xi , x j ): xT Σ−1 x = a11 x12 + a22 x22 .
σi j
ρ(xi , x j ) = ρi j = √ . (8.16) In the multidimensional case the normalisation term in front of the exponential is (2π)d/21 |Σ|1/2 . Recall
σii σ j j
that the determinant of a matrix can be regarded as a measure of its size. And the dependence on the
The correlation coefficient ρi j is obtained by normalising the covariance σi j by the square root of the dimension reflects the fact that the volume increases with dimension.
product of the variances σii and σ j j , and satisfies −1 ≤ ρi j ≤ 1: Consider a two-dimensional Gaussian with the following mean vector and covariance matrix:
! !
ρ(x, y) = +1 if y = ax + b a > 0 0 1 0
µ= Σ=
0 0 1
ρ(x, y) = −1 if y = ax + b a < 0 .
We refer to this as a spherical Gaussian since the probability distribution has spherical (circular)
symmetry. The covariance matrix is diagonal (so the off-diagonal correlations are 0), and the variances
The correlation coefficient is both scale-invariant and location(or shift)-invariant, i.e.: are equal (1). This pdf is illustrated in the plots of this pdf in Figure 8.7a.
Now consider a two-dimensional Gaussian with the following mean vector and covariance matrix 8 :
ρ(xi , x j ) = ρ(axi + b, cx j + d) . (8.17)
! !
0 1 0
where a > 0, c > 0, and c and d are arbitrary constants. µ= Σ=
0 0 4

In this case the covariance matrix is again diagonal, but the variances are not equal. Thus the resulting
8.6 The 2-dimensional Gaussian distribution pdf has an elliptical shape, illustrated in Figure 8.7b.
Now consider the following two-dimensional Gaussian:
Let’s look at a two dimensional case, with the following inverse covariance matrix6 : ! !
0 1 −1
µ= Σ=
!−1 ! ! 0 −1 4
σ11 σ12 1 σ22 −σ12 a11 a12
Σ−1 = = = .
σ21 σ22 σ11 σ22 − σ12 σ21 −σ21 σ11 a12 a22 In this case we have a full covariance matrix (off-diagonal terms are non-zero). The resultant pdf is
shown in Figure 8.7c.
4
Note that xi in this section denotes the i’th element of a vector x (which is a vector of D random variables) rather than
7
the i’th sample in a data set {x1 , . . . , xN }. Any covariance matrix is positive semi-definite, meaning xT Σ x ≥ 0 for any real-valued vector x. The inverse of
5
This is normally referred as ‘Pearson’s correlation coefficient’, whose another version for sampled data was discussed covariance matrix, if it exists, is also positive semi-definite, i.e., xT Σ−1 x ≥ 0.
8
in Note 2. Like the covariance shown here, a covariance matrix whose off-diagonal elements are all zeros is called ‘diagonal
6
The inverse covariance matrix is sometimes called the precision matrix. covariance matrix’, as opposed to ‘full covariance matrix’ that has non-zero off-diagonal elements.

9 10
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

8.7 Parameter estimation


Surface plot of p(x 1 , x 2 ) Contour plot of p(x1 , x 2 )

4 It is possible to show that the mean vector and covariance matrix that maximise the likelihood of the
0.16
3
Gaussian generating the training data are given by: 9
N
1X
0.14
2

xn (8.18)
0.12
µ̂ =
N n=1
p(x 1 ,x 2 )

0.1 1

0.08

x2
0
N
1X
0.06

(xn − µ̂) (xn − µ̂)T . (8.19)


0.04
-1
Σ̂ =
0.02

0 -2
N n=1
4

Alternatively, in a scalar representation:


2 4 -3

0 2
0
-2 -4
x2 -2
N
1X
-4 -4 x1
-4 -3 -2 -1 0 1 2 3 4

xni , for i = 1, . . . , D (8.20)


x1
µ̂i =
N n=1
(a) Spherical Gaussian (diagonal covariance, equal variances) N
1X
σ̂i j = (xni − µ̂i ) (xn j − µ̂ j ) for i, j = 1, . . . , D . (8.21)
Surface plot of p(x 1 , x 2 ) Contour plot of p(x1 , x 2 )
N n=1
4

3
As an example consider the data points displayed in Figure 8.8a. To fit a Gaussian to these samples we
compute the mean and variance with MLE. The resulting Gaussian is superimposed as a contour map
0.16

0.14
2
0.12 on the training data in Figure 8.8b.
p(x 1 ,x 2 )

0.1 1

0.08
x2

0
0.06

0.04
-1 8.8 Bayes’ theorem and Gaussians
0.02

0 -2

4
2 4 -3
Many of the rules for combining probabilities that were outlined at the start of the course, are similar
0
0
2

-4
for probability density functions. For example, if x and y are continuous random variables, with
-2
probability density functions (pdfs) p(x), and p(y):
x2 -2
-4 -4 x1
-4 -3 -2 -1 0 1 2 3 4
x1

p(x, y) = p(x | y) p(y) (8.22)


Z
(b) Gaussian with diagonal covariance matrix
p(x) = p(x, y) dy , (8.23)
Surface plot of p(x 1 , x 2 ) Contour plot of p(x1 , x 2 )

4
where p(x | y) is the pdf of x given y.
0.16
3 Indeed we may mix probabilities of discrete variables and probability densities of continuous variables,
0.14
2
for example if x is continuous and z is discrete:
0.12

p(x, z) = p(x | z) P(z) . (8.24)


p(x 1 ,x 2 )

0.1 1

0.08
x2

0
0.06
Proving that this is so requires a branch of mathematics called measure theory.
0.04
-1
0.02
-2
We can thus write Bayes’ theorem for continuous data x and discrete class k as:
0
4
p(x |Ck ) P(Ck )
2 4 -3
P(Ck | x) =
p(x)
0 2
0
-2 -4
x2 -2
-4 -4 x1

p(x |Ck ) P(Ck )


-4 -3 -2 -1 0 1 2 3 4
x1

= PK (8.25)
`=1 p(x |C ` ) P(C ` )
(c) Gaussian with full covariance matrix
P(Ck | x) ∝ p(x |Ck ) P(Ck ) (8.26)
Figure 8.7: Surface and contour plots of 2–dimensional Gaussian with different covariance structures
9
Again the estimated covariance matrix with MLE is a biased estimator, rather than the unbiased estimator that is
normalised by N −1.

11 12
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

The posterior probability of the class given the data is proportional to the probability density of the
data times the prior probability of the class.
10
We can thus use Bayes’ theorem for pattern recognition with continuous random variables.
If the pdf of continuous random variable x given class k is represented as a Gaussian with mean µk and
variance σ2k , then we can write: 10

P(Ck | x) ∝ p(x |Ck ) P(Ck )


5
∝ N(x ; µk , σ2k ) P(Ck )
!
1 −(x − µk )2
∝ q exp P(Ck ) . (8.27)
2πσ2k 2 σ2k
X2

We call p(x |Ck ) the likelihood of class k (given the observation x).
0

8.9 Appendix: Plotting Gaussians with Matlab

plotgauss1D is a function to plot a one-dimensional Gaussian with mean mu and variance sigma2:
−5
−4 −2 0 2 4 6 8 10 function plotgauss1D(mu, sigma2)
X1
% plot 1 dimension Gaussian with mean mu and variance sigma2
sd = sqrt(sigma2); % std deviation
x = mu-3*sd:0.02:mu+3*sd; % location of points at which x is calculated
(a) Training data g = 1/(sqrt(2*pi)*sd)*exp(-0.5*(x-mu).ˆ2/sigma2);
plot(x,g);
10

Recall that the standard deviation (SD) is the square root of the variance. It is a fact that about 0.68 of
the probability mass of a Gaussian is within 1 SD (either side) of the mean, about 0.95 is within 2 SDs
of the mean, and over 0.99 is within 3 SDs of the mean. Thus plotting a Gaussian for x ranging from
µ − 3σ to µ + 3σ captures over 99% of the probability mass, and we take these as the ranges for the
5 plot.
The following Matlab function plots two-dimensional Gaussians as a surface or a contour plot (and
was used for the plots in the previous section). We could easily write it to take a (2-dimensional) mean
X2

vector and 2x2 covariance matrix, but it can be convenient to write the covariance matrix in terms of
variances σ j j and correlation coefficient, ρ jk . Recall that:
0 σ jk
ρ jk = √ . (8.28)
σ j j σkk

Then we may write:


√ √
σ jk = ρ jk σ j j σkk (8.29)
−5

−4 −2 0 2 4 6 8 10
where σ j j is the standard deviation of dimension j. The following code does the job:
X1

function plotgauss2D(xmu, ymu, xvar, yvar, rho)


(b) Estimated Gaussian
% make a contour plot and a surface plot of a 2D Gaussian
Figure 8.8: Fitting a Gaussian to a set of two-dimensional data samples
% xmu, ymu - mean of x, y
10
The suffix k in µk and σk denotes the class number rather than the k’ th element of vector.

13 14
Note 8 Informatics 2B - Learning Note 8 Informatics 2B - Learning

% xvar, yvar - variance of x, y error(’Covariance matrix should be square’);


% rho correlation coefficient between x and y end

xsd = sqrt(xvar); % std deviation on x axis % force MU and X into column vectors
ysd = sqrt(yvar); % std deviation on y axis mu = reshape(mu, d, 1);
x = reshape(x, d, 1);
if (abs(rho) >= 1.0)
disp(’error: rho must lie between -1 and 1’); % subtract the mean from the data point
return x = x-mu;
end
covxy = rho*xsd*ysd; % calculation of the covariance invcovar = inv(covar);

C = [xvar covxy; covxy yvar]; % the covariance matrix y = 1/sqrt((2*pi)ˆd*det(covar)) * exp (-0.5*x’*invcovar*x);
A = inv(C); % the inverse covariance matrix
However, for efficiency it is usually better to estimate the Gaussian pdfs for a set of data points together.
% plot between +-2SDs along each dimension The following function, from the Netlab toolbox, takes an n×d matrix x, where each row corresponds
maxsd = max(xsd,ysd); to a data point.
x = xmu-2*maxsd:0.1:xmu+2*maxsd; % location of points at which x is calculated
y = ymu-2*maxsd:0.1:ymu+2*maxsd; % location of points at which y is calculated function y = gauss(mu, covar, x)
% Y = GAUSS(MU, COVAR, X) evaluates a multi-variate Gaussian density
[X, Y] = meshgrid(x,y); % matrices used for plotting % in D-dimensions at a set of points given by the rows of the matrix X.
% The Gaussian density has mean vector MU and covariance matrix COVAR.
% Compute value of Gaussian pdf at each point in the grid %
z = 1/(2*pi*sqrt(det(C))) * % Copyright (c) Ian T Nabney (1996-2001)
exp(-0.5 * (A(1,1)*(X-xmu).ˆ2 + 2*A(1,2)*(X-xmu).*(Y-ymu) + A(2,2)*(Y-ymu).ˆ2));
[n, d] = size(x);
surf(x,y,z); [j, k] = size(covar);
figure;
contour(x,y,z); % Check that the covariance matrix is the correct dimension
if ((j ˜= d) | (k ˜=d))
The above code computes the vectors x and y over which the function will be plotted. meshgrid takes error(’Dimension of the covariance matrix and data should match’);
these vectors and forms the set of all pairs ([X, Y]) over which the pdf is to be estimated. surf plots end
the function as a surface, and contour plots it as a contour map, or plan. You can use the Matlab help
to find out more about plotting surfaces. invcov = inv(covar);
In the equation for the Gaussian pdf in plotgauss2D, because we are evaluating over points in a grid, mu = reshape(mu, 1, d); % Ensure that mu is a row vector
we write out the quadratic form fully. More generally, if we want to evaluate a D-dimensional Gaussian
pdf for a data point x, we can use a Matlab function like the following: x = x - ones(n, 1)*mu; % Replicate mu and subtract from each data point
fact = sum(((x*invcov).*x), 2);
function y=evalgauss1(mu, covar, x)
% EVALGAUSS1 - evauate a Gaussian pdf y = exp(-0.5*fact);

% y=EVALGAUSS1(MU, COVAR, X) evaluates a multivariate Gaussian with y = y./sqrt((2*pi)ˆd*det(covar));


% mean MU and covariance COVAR for a data point X
Check that you understand how this function works. Note that sum(a,2) sums along rows of matrix a
to return a column vector of the row sums. (sum(a,1) sums down columns to return a row vector.)
[d b] = size(covar);

% Check that the covariance matrix is square


if (d ˜= b)

15 16
Note 8 Informatics 2B - Learning

Exercises

1. Draw a one-dimensional Gaussian distribution by hand as accurate as possible when µ =


3.0, σ2 = 1.0. (You may use a calculator)

2. Using a calculator, find the height (i.e. maximum value) of a one-dimensional Gaussian distribu-
tion for σ2 = 10, 1.0, 0.1, 0.01, 0.001. What the height will be, when σ2 → 0?

3. By solving the system of equations (8.12) and (8.13), confirm that the MLE for a a Gaussian
distribution is given as (8.4) and (8.5).

4. Confirm that the correlation coefficient defined in Equation (8.16) is the same as the Pearson’s
correlation coefficient in Note 2.

5. Prove that the correlation coefficient is scale-invariant and location-invariant as is shown in


Equation (8.17).

6. Consider a 2-dimensional Gaussian distribution


! with a mean vector µ = (µ1 , µ2 )T and a diagonal
σ11 0
covariance matrix, i.e., Σ = , show that its pdf can be simplified to the product of
0 σ22
two pdfs, each of which corresponds to a one-dimensional Gaussian distribution.

p(x | µ, Σ) = p(x1 | µ1 , σ11 ) p(x2 | µ2 , σ22 )

7. For each of the Gaussian distributions shown in Figure 8.7, which type of correlation do x1 and
x2 have, (i) a positive correlation, (ii) a negative correlation, or (iii) no correlation?

17

You might also like