Linear Discriminant Analysis Overview
Linear Discriminant Analysis Overview
g Limitations of LDA
g Variants of LDA
n Of all the possible lines we would like to select the one that maximizes the
separability of the scalars
g This is illustrated for the two-dimensional case in the following figures
x2 x2
x1 x1
Introduction to Pattern Analysis 2
Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (2)
g In order to find a good projection vector, we need to define a measure
of separation between the projections
g The mean vector of each class in x and y feature space is
1 ~ = 1 1
µi =
Ni
∑x
x∈ωi
and µi
Ni
∑y = N ∑w
y∈ωi
T
x = w Tµi
i x∈ωi
g We could then choose the distance between the projected means as our objective
function
~ = w T (µ − µ )
~ −µ
J( w ) = µ1 2 1 2
g However, the distance between the projected means is not a very good measure since it
does not take into account the standard deviation within the classes
x2
µ1
This axis yields better class separability
µ2
x1
x1
∑ (x − µ )(x − µ )
T
Si = i i
x∈ωi
S1 + S 2 = S W
g where SW is called the within-class scatter matrix
n The scatter of the projection y can then be expressed as a function of the scatter matrix in
feature space x
~
∑ (y − µ~ ) ∑ (w ) = ∑ w (x − µ )(x − µ ) w = w S w
2 2 T
si2 = i = T
x − w Tµi T
i i
T
i
y∈ωi x∈ωi x∈ωi
~
s12 + ~
s22 = w T S W w
n Similarly, the difference between the projected means can be expressed in terms of the means
in the original feature space
SB
g The matrix SB is called the between-class scatter. Note that, since SB is the outer product of two vectors,
its rank is at most one
n We can finally express the Fisher criterion in terms of SW and SB as
w T SB w
J( w ) = T
w SW w
Introduction to Pattern Analysis 5
Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (5)
n To find the maximum of J(w) we derive and equate to zero
[J( w )] = d wT SB w = 0 ⇒
T
d
dw dw w S W w
[
⇒ w SW w
T
dw
][
d w T SB w ] [
− w SB w
T d w TS W w
dw
][
=0 ⇒
]
[ ] [
⇒ w T S W w 2SB w − w T SB w 2S W w = 0 ]
n Dividing by wTSWw
[w S w ]S w − [w S w ] S
T T
w =0 ⇒
[w S w ] [w S w ]
W B
T B T W
W W
⇒ SB w − JS W w = 0 ⇒
⇒ S −W1SB w − Jw = 0
w T SB w
= S W (µ1 − µ2 )
−1
w* = argmax T
w w SW w
g This is know as Fisher’s Linear Discriminant (1936), although it is not a discriminant but rather a
specific choice of direction for the projection of the data down to one dimension
n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
g SOLUTION (by hand) 6
i=1
1 1
where µ = ∑
N ∀x
x = ∑ Niµi
N x∈ωi
SW2
x1
g where ST=SB+SW is called the total scatter matrix
~ =1 y
C
~
µ ∑ SB = ∑ Ni (µ ~)(µ
~ −µ
i
~)T
~ −µ
i
N ∀y i=1
~
S W = W TS W W
~
SB = W T SB W
n Recall that we are looking for a projection that maximizes the ratio of between-class to
within-class scatter. Since the projection is no longer a scalar (it has C-1 dimensions), we
then use the determinant of the scatter matrices to obtain a scalar objective function:
~
SB W T SB W
J( W ) = ~ =
SW W TSW W
n And we will seek the projection matrix W* that maximizes this ratio
W T SB W
[
W* = w | w | L | w
*
1
*
2
*
C −1 ] = argmax T ⇒ (SB − λ iS W )w i = 0
*
W S W W
g NOTES
n SB is the sum of C matrices of rank one or less and the mean vectors are constrained by
1 C
∑ µi = µ
C i=1
g Therefore, SB will be of rank (C-1) or less
g This means that only (C-1) of the eigenvalues λi will be non-zero
n The projections with maximum class separability information are the eigenvectors
corresponding to the largest eigenvalues of SW-1SB
n LDA can be derived as the Maximum Likelihood method for the case of normal class-
conditional densities with equal covariance matrices
Sensor response
n
20
the response of the gas sensor array was processed in
order to obtain a 60-dimensional feature vector
0
g Results
n From the 3D scatter plots it is clear that LDA -20
outperforms PCA in terms of class discrimination
n This is one example where the discriminatory -40
information is not aligned with the direction of 0 50 100 150 200
maximum variance Sulaw es y Keny a A rabian Sumatra Colombia
PCA LDA
2
223114 1 1 1 5
32333
2 11345 11 11 5 5
35 23 32 242 41 45 1 3111111111
3
5 5 55 5
4 2 43 15 5 7.42 1
3 3 11111
111 1 5
32 11 452223
3
4
4
23
23 14
1315
4
4 4 312 5 3 311
33333131131 13 14
31 555 5 5
555
30 2 2 5 131 155 3141252
454425
2121
2
3
1
322532 4
2 333331313313 3 4 5
5 5555 555 55
4 21
333435 43142 4 7.4
5555 55 5
2 5 135 5 1 4 2 1154355
34
51
5551353 44
51415 3 333 331 3 444444 4 4 5 55 555
2 1 4 3
4
4 35
5 5
5 4 7.38 3 33 3 4
333 44 4 22 22 5
5
2 5
2 333 44444444
25 3 5
4 4 4 2
2511 4 2 413 234311
1 4 5 44 4 2222
axis 3
axis 3
3 3 5433
51 2 2 51 51 7.36 44 44 4 4 22 22
4 22
2
2 22
20
4 44 444 2 222222 2
2
3 213 35 52 22
5 1 4 4 2 22 2 22
1 5
7.34
15 4 44 2
7.32 2 2 2
10 100 0.4
5 90
2
-260 2 -1.96 0.35
-240 80 -1.94
-220 -1.92
70 -1.9
-200 -1.88 0.3
axis 2 axis 2
axis 1 axis 1
µ1 ω1 µ1=µ2=µ
ω1 ω2
ω1
ω2
ω2 ω1
µ2
g LDA will fail when the discriminatory information is not in the mean but rather
in the variance of the data
x2
PC
A
LD
A
x1
UNINTERESTING INTERESTING
x2 x2
x1 x1
E(d, d' ) = ∑
[d(P ,P ) − d(P ,P )]
i j i
'
j
' 2
i≠ j d(Pi ,Pj )
n The original method did not obtain an explicit mapping but only a lookup table for the elements in
the training set
n Recent implementations using artificial neural networks (MLPs and RBFs) do provide an explicit
mapping for test data and also consider cost functions (Neuroscale)
n Sammon’s mapping is closely related to Multi-Dimensional Scaling (MDS), a family of multivariate
statistical methods commonly used in the social sciences
x2 x2
Pj P’j
x3 x1