0% found this document useful (0 votes)
17 views6 pages

Topic 2.8: Topics Comes Under The Topic of Discriminant Functions

This document introduces key concepts in machine learning, focusing on discriminant functions, least squares for classification, and Fisher's linear discriminant. It explains how to derive the parameter matrix for classification models, discusses the perceptron algorithm, and highlights the relationship between least squares and Fisher's criterion. The document emphasizes the importance of class separation and dimensionality reduction in effective classification.

Uploaded by

p.hemanthsai125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Topic 2.8: Topics Comes Under The Topic of Discriminant Functions

This document introduces key concepts in machine learning, focusing on discriminant functions, least squares for classification, and Fisher's linear discriminant. It explains how to derive the parameter matrix for classification models, discusses the perceptron algorithm, and highlights the relationship between least squares and Fisher's criterion. The document emphasizes the importance of class separation and dimensionality reduction in effective classification.

Uploaded by

p.hemanthsai125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INTRODUCTION TO MACHINE LEARNING - 20EC6404C

Topic 2.8: Topics comes under the topic of Discriminant Functions


Unit - II
Faculty: Turimerla Pratap

Least squares for classification


We considered models that were linear functions of the parameters, and we saw that the mini-
mization of a sum-of-squares error function led to a simple closed-form solution for the parameter
values. One justification for using least squares in such a context is that it approximates the condi-
tional expectation E[t|x] of the target values given the input vector.

Each class Ck is described by its own linear model, so that

yk (x) = wkT x + wk0

where k = 1, . . . , K. We can conveniently group these together using vector notation so that

y(x) = WT x

where W is a matrix whose kth column(comprises) the (D + 1)-dimensional vector wk and x is the
T
corresponding augmented input vector x with a dummy input x0 = 1. A new input x is then
assigned to the class for which the output yk = wkT x is largest.

We now determine the parameter matrix W by minimizing a sum-of-squares error function.


Consider a training dataset {xn , tn } where n = 1, . . . , N , and define a matrix T whose nth row
is the vector (tn )T , together with a matrix X whose nth row is (xn )T . The sum-of-squares error
function can then be written as
1 [ ]
ED (W) = Tr (XW − T)T (XW − T)
2
Setting the derivative with respect to W to zero and rearranging, we then obtain the solution for
W in the form

W = (XT X)−1 XT T = X† T
where X† is the pseudo-inverse of the matrix X. We then obtain the discriminant function in
the form
y(x) = WT x = T† (X† )T x

1
An interesting property of least-squares solutions with multiple target variables is that if every
target vector in the training set satisfies some linear constraint aT tn + b = 0 for some constants a
and b, then the model prediction for any value of x will satisfy the same constraint, so that

aT y(x) + b = 0

Thus, if we use a 1-of-K coding scheme for K classes, then the predictions made by the model
will have the property that the elements of y(x) will sum to 1 for any value of x. However, this
summation constraint alone is not sufficient to allow the model outputs to be interpreted as proba-
bilities because they are not constrained to lie within the interval (0, 1).

The least-squares approach gives an exact closed-form solution for the discriminant function
parameters. However, even as a discriminant function, it may not perform well in practice for
classification problems due to its limited flexibility.

Fisher’s linear discriminant


One way to view a linear classification model is in terms of dimensionality reduction. Consider
first the case of two classes, and suppose we take the D-dimensional input vector x and project it
down to one dimension using y = wT x. If we place a threshold on y and classify y ≥ −w0 as
class C1 , and otherwise class C2 , then we obtain our standard linear classifier.

In general, the projection onto one dimension leads to a considerable loss of information, and
classes that are well separated in the original D-dimensional space may become strongly overlap-
ping in one dimension. However, by adjusting the components of the weight vector w, we can
select a projection that maximizes the class separation. To begin with, consider a two-class prob-
lem in which there are N1 points of class C1 and N2 points of class C2 , so that the mean vectors of
the two classes are given by
1 ∑ 1 ∑
m1 = xn , m2 = xn .
N1 n∈C N2 n∈C
1 2

The simplest measure of the separation of the classes when projected onto w is the separation
of the projected class means. This suggests that we might choose w so as to maximize

m2 − m1 = wT (m2 − m1 ),

where mk = wT mk is the mean of the projected data from class Ck . However, this expression
can be made arbitrarily large simply by increasing the magnitude of w. To solve this problem, we
could constrain w to have unit length, so that |w|2 = 1. Using a Lagrange multiplier to perform
the constrained maximization, we then find that

w ∝ S−1
W (m2 − m1 ),

2
where SW is the total within-class covariance matrix defined as
∑ ∑
SW = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T .
n∈C1 n∈C2

The result w ∝ S−1 W (m2 − m1 ) is known as Fisher’s linear discriminant, which gives the di-
rection for the projection of the data.

The projection formula that transforms the set of labeled data points in x into a labeled set in
the one-dimensional space y is given by:

y = wT x.

The within-class variance of the transformed data from class Ck is defined as:

s2k = (yn − mk )2 ,
n∈Ck

where yn = wT xn . We can define the total within-class variance for the whole dataset as:

s21 + s22 .

The Fisher criterion is defined as the ratio of the between-class variance to the within-class
variance and is given by:
(m2 − m1 )2
J(w) = .
s21 + s22
By rewriting the Fisher criterion using the projection formula and the within-class variances,
we have:
wT SB w
J(w) = T ,
w SW w
where SB is the between-class covariance matrix defined as:

SB = (m2 − m1 )(m2 − m1 )T .

To find the weight vector w that maximizes J(w), we differentiate J(w) with respect to w, set
it to zero, and solve for w. This leads to the following equation:

S−1
W (m2 − m1 ) = λSB w,

where λ is a scalar. It can be shown that the optimal weight vector w is proportional to S−1
W (m2 −
m1 ).
Finally, we can use the projected data and a threshold y0 to construct a discriminant. A new
vector x can be classified as belonging to class C1 if y(x) ≥ y0 , and as belonging to class C2
otherwise. The threshold y0 can be determined by modeling the class-conditional densities p(y|Ck )
using Gaussian distributions and using maximum likelihood estimation to find the parameters of
the Gaussian distributions.

3
Relation to least squares
The least squares approach and the Fisher criterion are related for the two-class problem. By
adopting a different target coding scheme, the least squares solution can be shown to be equivalent
to the Fisher solution.
In the least squares approach, we minimize the sum-of-squares error function given by:

1 ∑( T )2
N
E= w x n + w 0 − tn ,
2 n=1

where w is the weight vector, w0 is the bias, xn are the input vectors, and tn are the target values.
Setting the derivatives of E with respect to w0 and w to zero, we obtain the following equations:


N
( ) ∑
N
( )
w xn + w0 − tn = 0,
T
w T x n + w 0 − tn x n = 0. (1)
n=1 n=1

Using a specific target coding scheme where the targets for class C1 are N/N1 and the targets
for class C2 are −N/N2 , we can simplify the above equations. The bias can be expressed as:

w0 = −wT m,

where m is the mean of the total dataset.


By substituting the bias expression into the second equation and performing some algebraic
manipulations, we arrive at:
SW w = N (m1 − m2 ),
where SW is the total within-class covariance matrix.
Comparing this equation with the Fisher solution, we find that w is proportional to S−1
W (m2 −
m1 ). Therefore, the weight vector obtained from the least squares approach coincides with the
Fisher solution.

Fishers discriminant for multiple classes


Fisher’s linear discriminant can also be extended to handle multiple classes. The objective is to find
a projection that maximizes the class separability while minimizing the within-class scatter. The
resulting discriminant is known as Fisher’s discriminant or Fisher’s linear discriminant analysis.

Let’s consider a problem with K classes. We want to find a projection vector w that maps the
original D-dimensional input space to a lower-dimensional space, such that the projected data can
be effectively separated into their respective classes.

The between-class scatter matrix SB is defined as the sum of the scatter matrices for each
class. Each scatter matrix measures the deviation between the class mean and the overall mean in

4
the projected space. Mathematically, SB is given by:


K
SB = Nk (mk − m)(mk − m)T ,
k=1

where Nk is the number of samples in class k, mk is the mean of class k, and m is the overall mean
of the data.

The within-class scatter matrix SW measures the scatter within each class. It is defined as the
sum of the scatter matrices for each class, which quantify the deviations of samples from their
respective class means. Mathematically, SW is given by:


K ∑
SW = (xn − mk )(xn − mk )T ,
k=1 n∈Ck

where xn is a sample from class k and mk is the mean of class k.


The Fisher criterion seeks to maximize the ratio of the between-class scatter to the within-class
scatter. This can be expressed as the following optimization problem:

w T SB w
max .
w w T SW w
Solving this optimization problem yields the optimal projection vector w, which can be ob-
−1
tained by computing the eigenvectors of SW SB and selecting the eigenvectors corresponding to
the largest eigenvalues.

Once the projection vector w is obtained, new samples can be projected onto this vector, and a
classification rule can be applied to assign them to the appropriate class.

Fisher’s discriminant provides a linear decision boundary in the projected space that optimally
separates the classes. However, it assumes Gaussian distributions for the class data and equal
covariance matrices for all classes, which may not always hold in practice.

The Perceptron
The perceptron is a fundamental binary classification algorithm in machine learning. It is a type
of linear classifier that separates two classes by learning a decision boundary based on the input
features. The perceptron algorithm was introduced by Frank Rosenblatt in 1957 and forms the
basis for many subsequent developments in neural networks.

The perceptron model takes an input vector x of D features and assigns weights w1 , w2 , . . . , wD
to each feature. Additionally, there is a bias term b that represents the threshold for the decision
boundary. The output of the perceptron is a binary prediction ŷ, indicating which class the input

5
belongs to.

The prediction ŷ is obtained by applying the following steps:

• Calculate the weighted sum of the input features:


D
z= wi xi + b.
i=1

• Apply an activation function to the weighted sum. In the perceptron algorithm, the activation
function is a simple step function (also known as the Heaviside step function), which returns
1 if z is positive or zero, and 0 otherwise:
{
1 if z ≥ 0,
ŷ =
0 otherwise.

The predicted output ŷ represents the class label assigned by the perceptron. The learning
process of the perceptron involves adjusting the weights and bias based on the training data. The
algorithm starts with random or zero weights and iterates through the training samples until con-
vergence or a predefined number of iterations. During each iteration, the perceptron updates its
weights based on the following rule:

wi ← wi + η(y − ŷ)xi ,
b ← b + η(y − ŷ),
where y is the true class label of the training sample, ŷ is the predicted class label, xi is the i-th
feature of the input, and η is the learning rate, which controls the step size of the weight updates.

The learning process continues until the algorithm correctly classifies all training samples or
reaches the maximum number of iterations. The perceptron algorithm guarantees convergence if
the classes are linearly separable; otherwise, it may not converge. The perceptron is a foundational
algorithm that paved the way for more sophisticated neural network architectures. While it is a
linear classifier, its simplicity and interpretability make it an essential concept in understanding the
basics of machine learning and neural networks.

You might also like