Fisher’s linear discriminant can be used as a supervised
learning classifier. Given labeled data, the classifier can
find a set of weights to draw a decision boundary,
classifying the data. Fisher’s linear discriminant
attempts to find the vector that maximizes the
separation between classes of the projected
data. Maximizing “separation” can be ambiguous. The
criteria that Fisher’s linear discriminant follows to do this
is to maximize the distance of the projected means and to
minimize the projected within-class variance.
For example:
Here are two bivariate Gaussians with identical covariance
matrices and distinct means. We want to find the vector
that best separates the projections of the data. Let's
draw a random vector and plot the projections.
Remember we are looking at projections of the data onto
the vector (the dot product of the weights vector and the
data matrix) and not a decision boundary. The projections
of the data onto this random weights vector can be plotted
as a histogram (image on the right). As you can see when
projecting the data onto the vector and drawing the
histogram the two classes of data aren’t well separated.
The goal is to find the line that best separates the two
distributions on the image on the right.
To separate the two distributions, we could first try to
maximize the distance between the projected means,
meaning the distributions are, on average, as far as
possible from each other. Let’s draw a line between the
two means and plot the histogram of the projections onto
that line.
Image by Author
That’s quite a bit better, but the projections of the data are
not fully separated yet. To fully separate them, Fisher’s
linear discriminant minimizes the within-class variance of
the projections at the same time as maximizing the
projections between the means. It tries to maximise the
means as we discussed before to separate them, but also
attempts to make the distributions as tight as possible.
This allows for better separation as you’ll see below.
Image by Author
As you can see the projections of the data are well
separated. We can take an orthogonal vector from the
weights vector to create a decision boundary. The decision
boundary tells us that on either side of the boundary the
data can be predicted to be one class or another. For
multivariate gaussian distributions with identical
covariance matrices, this yields an optimal classifier.
Π= hyper plane
W= vector projection (dimension of the vectors)
Πw = hyper plane
According to the vector projection formula
wTx=0
x=represent the data points (black and red)
xi=points projected on the plane
wT xi = xi ‘
D= Data set
D(xi,yi)
J(w)= fisher discriminant function
u1 ‘ and u2 ‘= mean of classes
s1 ‘ and s2 ‘= variance of classes 1 and 2 (variance means a distance of point at hyperplane)
J(w) = (u1 ‘ +u2 ‘)2 /s1 ‘2 +s2 ‘2