0% found this document useful (0 votes)

45 views

Lec 10

The document recaps key concepts related to linear classifiers and regression. It discusses perceptrons, linear regression, least squares solutions, gradient descent, and the LMS algorithm. The LMS algorithm is described as a stochastic gradient descent method for minimizing the mean squared error of predictions. Linear regression and classification models are described as fitting functions to minimize expected error.

Uploaded by

Varun Krishna Ps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Lec 10

Uploaded by

Varun Krishna Ps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 196

Recap

• We have been considering linear classifiers:

′
d
X
h(X) = 1 if wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.

PRNN (PSS) Jan-Apr 2018 – p.1/196

Recap

• We have been considering linear classifiers:

′
d
X
h(X) = 1 if wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.
• We take h(X) = sign(W T X) for simplicity of notation.
• Perceptron is a classical algorithm to learn such a
classifier.

PRNN (PSS) Jan-Apr 2018 – p.2/196

Recap
• We also discussed linear regression. The objective is
to learn a model:
d
X
ŷ(X) = w i xi + w 0
i=1

PRNN (PSS) Jan-Apr 2018 – p.3/196

Recap
• We also discussed linear regression. The objective is
to learn a model:
d
X
ŷ(X) = w i xi + w 0
i=1

We could use φi (X) in place of xi .

PRNN (PSS) Jan-Apr 2018 – p.4/196

Recap
• We also discussed linear regression. The objective is
to learn a model:
d
X
ŷ(X) = w i xi + w 0
i=1

We could use φi (X) in place of xi .

• The criterion is to minimize
n
1X T
J(W ) = (W Xi − yi )2
2 i=1
• The minimizer is the linear least squares solution.
PRNN (PSS) Jan-Apr 2018 – p.5/196
Recap

• The minimizer of J is given by

W ∗ = (AT A)−1 AT Y
Where A is a matrix whose rows are Xi and Y is a
vector whose components are yi .

PRNN (PSS) Jan-Apr 2018 – p.6/196

Recap

• The minimizer of J is given by

W ∗ = (AT A)−1 AT Y
Where A is a matrix whose rows are Xi and Y is a
vector whose components are yi .
• We can also minimize J by iterative gradient descent.

PRNN (PSS) Jan-Apr 2018 – p.7/196

Recap

• The minimizer of J is given by

W ∗ = (AT A)−1 AT Y
Where A is a matrix whose rows are Xi and Y is a
vector whose components are yi .
• We can also minimize J by iterative gradient descent.
• An incremental version of this gradient descent is the
LMS algorithm.

PRNN (PSS) Jan-Apr 2018 – p.8/196

Recap

• The LMS algorithm is:

W (k + 1) = W (k) − η (X(k)T W (k) − y(k))X(k)

where (X(k), y(k)) is the (random) sample picked
and W (k) is the weight vector at iteration k .
• This is used in many adaptive signal processing
problems.

PRNN (PSS) Jan-Apr 2018 – p.9/196

• The LMS algorithm is

W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))

• This is very similar to Perceptron algorithm.
• If y(k) ∈ {0, 1} and if we use thresholded version of
X(k)T W (k) in the above what we get is exactly the
Perceptron algorithm.
• This is also a classical algorithm.

PRNN (PSS) Jan-Apr 2018 – p.10/196

Adaline

• We can view this as a unit similar to Perceptron

PRNN (PSS) Jan-Apr 2018 – p.11/196

Adaline

• We can view this as a unit similar to Perceptron

• Output is weighted sum of inputs.
• Called Adaline (ADaptine LINear Element); weights
are adapted (Widrow 1963).

PRNN (PSS) Jan-Apr 2018 – p.12/196

• The least square error criterion is to minimize
n
1 X T
2
J(W ) = X i W − yi
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.13/196

• The least square error criterion is to minimize
n
1 X T
2
J(W ) = X i W − yi
2 i=1
• Assuming the training examples to be drawn iid, the
above is a good approximation of
n T 2

J(W ) = E (X W − y)
2

PRNN (PSS) Jan-Apr 2018 – p.14/196

• The least square error criterion is to minimize
n
1 X T
2
J(W ) = X i W − yi
2 i=1
• Assuming the training examples to be drawn iid, the
above is a good approximation of
n T 2

J(W ) = E (X W − y)
2
• That is, the objective is to minimize mean squred
error.
PRNN (PSS) Jan-Apr 2018 – p.15/196
• Equating the gradient of J(W ) to zero we get

W T nE[XX T ] − nE[Xy] = 0

PRNN (PSS) Jan-Apr 2018 – p.16/196

• Equating the gradient of J(W ) to zero we get

W T nE[XX T ] − nE[Xy] = 0
• This gives us the optimal W ∗ as

W ∗ = (nE[XX T ])−1 (nE[Xy])

PRNN (PSS) Jan-Apr 2018 – p.17/196

• Equating the gradient of J(W ) to zero we get

W T nE[XX T ] − nE[Xy] = 0
• This gives us the optimal W ∗ as

W ∗ = (nE[XX T ])−1 (nE[Xy])

• The earlier expression we have for W ∗ would be
same as this if we approximate expectations by
sample averages.

PRNN (PSS) Jan-Apr 2018 – p.18/196

• Since rows of A are Xi , we have
n
X
AT A = Xi XiT ≈ nEXX T
i=1

T
Pn
• Similarly A Y = i=1 Xi yi ≈ nE[Xy].
• Thus we have
(AT A)−1 AT Y ≈ (nE[XX T ])−1 (nE[Xy])

PRNN (PSS) Jan-Apr 2018 – p.19/196

• We are fitting a W to minimize 12 E [(W T X − y)2 ].

PRNN (PSS) Jan-Apr 2018 – p.20/196

• We are fitting a W to minimize 12 E [(W T X − y)2 ].
• The gradient descent on this objective would be

W (k + 1) = W (k) − ηE[(W T X − y)X]

PRNN (PSS) Jan-Apr 2018 – p.21/196

• We are fitting a W to minimize 12 E [(W T X − y)2 ].
• The gradient descent on this objective would be

W (k + 1) = W (k) − ηE[(W T X − y)X]

• However, we cannot calculate the expectation.

PRNN (PSS) Jan-Apr 2018 – p.22/196

• We are fitting a W to minimize 12 E [(W T X − y)2 ].
• The gradient descent on this objective would be

W (k + 1) = W (k) − ηE[(W T X − y)X]

• However, we cannot calculate the expectation.
• We have iid training samples.
• We can evaluate only the ‘noisy’ gradient at any
sample.

PRNN (PSS) Jan-Apr 2018 – p.23/196

LMS and Stochastic Gradient Descent

• Consider the LMS algorithm

• Suppose at k th iteration a random sample
(X(k), y(k)) is picked. Then

W (k + 1) = W (k) − η(W (k)T X(k) − y(k))X(k)

PRNN (PSS) Jan-Apr 2018 – p.24/196

LMS and Stochastic Gradient Descent

• Consider the LMS algorithm

• Suppose at k th iteration a random sample
(X(k), y(k)) is picked. Then

W (k + 1) = W (k) − η(W (k)T X(k) − y(k))X(k)

• So, we use the ‘noisy’ gradient. Same as
Robbins-Munro algorithm we saw earlier.

PRNN (PSS) Jan-Apr 2018 – p.25/196

LMS and Stochastic Gradient Descent

• Consider the LMS algorithm

• Suppose at k th iteration a random sample
(X(k), y(k)) is picked. Then

W (k + 1) = W (k) − η(W (k)T X(k) − y(k))X(k)

• So, we use the ‘noisy’ gradient. Same as
Robbins-Munro algorithm we saw earlier.
• This is a stochastic gradient descent algorithm.

PRNN (PSS) Jan-Apr 2018 – p.26/196

• Least squares method of fitting a model tries to find a
function f to minimize

R(f ) = E[(f (X) − y)2 ]

PRNN (PSS) Jan-Apr 2018 – p.27/196

• Least squares method of fitting a model tries to find a
function f to minimize

R(f ) = E[(f (X) − y)2 ]

• Since we are learning only linear models here, the
minimization is only over all f that are linear (or affine)
functions.

PRNN (PSS) Jan-Apr 2018 – p.28/196

• Least squares method of fitting a model tries to find a
function f to minimize

R(f ) = E[(f (X) − y)2 ]

• Since we are learning only linear models here, the
minimization is only over all f that are linear (or affine)
functions.
• In general, we can find the best f among all possible
functions.

PRNN (PSS) Jan-Apr 2018 – p.29/196

• This is a problem of approximating a random variable
y as a function of another random variable X in the
sense of best mean square error.

PRNN (PSS) Jan-Apr 2018 – p.30/196

• This is a problem of approximating a random variable
y as a function of another random variable X in the
sense of best mean square error.
• If f ∗ is the optimal function, then f ∗ (X) is called the
regression function of y on X .

PRNN (PSS) Jan-Apr 2018 – p.31/196

f ∗ (X) = E[y | X]

PRNN (PSS) Jan-Apr 2018 – p.32/196

• We need some properties of conditional expectation
in the proof.

PRNN (PSS) Jan-Apr 2018 – p.33/196

• We need some properties of conditional expectation
in the proof.
• For random variables, X, Z (with a joint density)
Z
E[g(Z) | X = x] = g(z) fZ|X (z|x) dz

where fZ|X (z|x) is the conditional density.

PRNN (PSS) Jan-Apr 2018 – p.34/196

• We need some properties of conditional expectation
in the proof.
• For random variables, X, Z (with a joint density)
Z
E[g(Z) | X = x] = g(z) fZ|X (z|x) dz

where fZ|X (z|x) is the conditional density.

• If Z is discrete random variable
X
E[g(Z) | X = x] = g(zj ) P [Z = zj | X = x]
j

PRNN (PSS) Jan-Apr 2018 – p.35/196

• E[Z | X] is a function of X and is a random variable.

PRNN (PSS) Jan-Apr 2018 – p.36/196

• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

PRNN (PSS) Jan-Apr 2018 – p.37/196

• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional expecta-

tion are:

PRNN (PSS) Jan-Apr 2018 – p.38/196

• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional

expectation are:

(i) E [ E[Z | X] ] = E[Z], ∀Z, X

PRNN (PSS) Jan-Apr 2018 – p.39/196

• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional

expectation are:

(i) E [ E[Z | X] ] = E[Z], ∀Z, X

(ii) E[g(Z) h(X) | X] = h(X) E[g(Z) | X], ∀g, h

PRNN (PSS) Jan-Apr 2018 – p.40/196

• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional

expectation are:

(i) E [ E[Z | X] ] = E[Z], ∀Z, X

(ii) E[g(Z) h(X) | X] = h(X) E[g(Z) | X], ∀g, h

• We will need both these properties for our proof.

PRNN (PSS) Jan-Apr 2018 – p.41/196

• We want to show that for all f
h i
2 2

E (E[y | X] − y) ≤ E (f (X) − y)

PRNN (PSS) Jan-Apr 2018 – p.42/196

• We want to show that for all f
h i
2 2

E (E[y | X] − y) ≤ E (f (X) − y)

We have
2 2
(f (X) − y) = [(f (X) − E[y | X]) + (E[y | X] − y)]

PRNN (PSS) Jan-Apr 2018 – p.43/196

• We want to show that for all f
h i
2 2

E (E[y | X] − y) ≤ E (f (X) − y)

We have
2 2
(f (X) − y) = [(f (X) − E[y | X]) + (E[y | X] − y)]
= (f (X) − E[y | X])2 + (E[y | X] − y)2
+ 2(f (X) − E[y | X])(E[y | X] − y)

PRNN (PSS) Jan-Apr 2018 – p.44/196

• We want to show that for all f
h i
2 2

E (E[y | X] − y) ≤ E (f (X) − y)

We have
2 2
(f (X) − y) = [(f (X) − E[y | X]) + (E[y | X] − y)]
= (f (X) − E[y | X])2 + (E[y | X] − y)2
+ 2(f (X) − E[y | X])(E[y | X] − y)

Now we can take expectation on both sides.

PRNN (PSS) Jan-Apr 2018 – p.45/196

First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]

PRNN (PSS) Jan-Apr 2018 – p.46/196

First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
because E[Z] = E[ E[Z|X] ]

PRNN (PSS) Jan-Apr 2018 – p.47/196

First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) E {(E[y | X] − y) | X} ]
because E[g(X)h(Z)|X] = g(X) E[h(Z)|X]

PRNN (PSS) Jan-Apr 2018 – p.48/196

First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) E {(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) (E[y | X] − E[y | X)) ]

PRNN (PSS) Jan-Apr 2018 – p.49/196

First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) E {(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) (E[y | X] − E[y | X)) ]
= 0

PRNN (PSS) Jan-Apr 2018 – p.50/196

Hence we get
2 2

E (f (X) − y) = E (f (X) − E[y | X])
2

+ E (E[y | X] − y)

PRNN (PSS) Jan-Apr 2018 – p.51/196

Hence we get
2 2

E (f (X) − y) = E (f (X) − E[y | X])
2

+ E (E[y | X] − y)
2

≥ E (E[y | X] − y)

PRNN (PSS) Jan-Apr 2018 – p.52/196

Hence we get
2 2

E (f (X) − y) = E (f (X) − E[y | X])
2

+ E (E[y | X] − y)
2

≥ E (E[y | X] − y)
• Since the above is true for all functions f , we get

f ∗ (X) = E [y | X]

PRNN (PSS) Jan-Apr 2018 – p.53/196

• We showed that if we want to predict y as a function
of X to minimize E[ (f (X) − y)2 ], then the optimal
function is
f ∗ (X) = E [y | X]

PRNN (PSS) Jan-Apr 2018 – p.54/196

• We showed that if we want to predict y as a function
of X to minimize E[ (f (X) − y)2 ], then the optimal
function is
f ∗ (X) = E [y | X]
• Suppose y ∈ {0, 1}. Then
∗
f (X) = E [y | X] = P [y = 1 | X] = q1 (X)

PRNN (PSS) Jan-Apr 2018 – p.55/196

PRNN (PSS) Jan-Apr 2018 – p.56/196

• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.57/196

• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {0, 1}, then we learn a best linear

approximation to the posterior probability, q1 (X).

PRNN (PSS) Jan-Apr 2018 – p.58/196

• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {0, 1}, then we learn a best linear

approximation to the posterior probability, q1 (X).
• So, by thresholding X T W ∗ at 0.5, we get a good
classifier.

PRNN (PSS) Jan-Apr 2018 – p.59/196

• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {−1, 1} then we learn a good linear

approximation to 2q1 (X) − 1.

PRNN (PSS) Jan-Apr 2018 – p.60/196

• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {−1, 1} then we learn a good linear

approximation to 2q1 (X) − 1.
• Hence we can threshold X T W ∗ at zero to get a good
classifier.

PRNN (PSS) Jan-Apr 2018 – p.61/196

PRNN (PSS) Jan-Apr 2018 – p.62/196

• However, in general, a linear function of X is not a
good choice for posterior probability function q1 (X).
• But we can extend the linear least squares method to
take care of more interesting models.
• Let h : ℜ → ℜ+ be a continuous strictly monotonically
increasing function.

PRNN (PSS) Jan-Apr 2018 – p.63/196

ŷ(X) = h(W T X + w0 )

PRNN (PSS) Jan-Apr 2018 – p.64/196

• By our assumptions, h is invertible.

PRNN (PSS) Jan-Apr 2018 – p.65/196

• By our assumptions, h is invertible.
• Suppose ŷ(X) = h(W T X + w0 ) is a good model.

PRNN (PSS) Jan-Apr 2018 – p.66/196

• By our assumptions, h is invertible.
• Suppose ŷ(X) = h(W T X + w0 ) is a good model.
• Given data {(Xi , yi ), i = 1, · · · , n}, we can
approximate yi well by h(XiT W ∗ + w0∗ ).

PRNN (PSS) Jan-Apr 2018 – p.67/196

• By our assumptions, h is invertible.
• Suppose ŷ(X) = h(W T X + w0 ) is a good model.
• Given data {(Xi , yi ), i = 1, · · · , n}, we can
approximate yi well by h(XiT W ∗ + w0∗ ).
• This means, XiT W ∗ + w0∗ is a good approximation for
h−1 (yi ).

PRNN (PSS) Jan-Apr 2018 – p.68/196

• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).

PRNN (PSS) Jan-Apr 2018 – p.69/196

• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.

PRNN (PSS) Jan-Apr 2018 – p.70/196

PRNN (PSS) Jan-Apr 2018 – p.71/196

• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.
• Make new data, {(Xi , yi′ ), i = 1, 2, · · · , n} with
yi′ = h−1 (yi ).
• Now we fit the model W T X + w0 to the new data.

PRNN (PSS) Jan-Apr 2018 – p.72/196

PRNN (PSS) Jan-Apr 2018 – p.73/196

PRNN (PSS) Jan-Apr 2018 – p.74/196

• We can also use the LMS algorithm.

PRNN (PSS) Jan-Apr 2018 – p.75/196

• We can also use the LMS algorithm.
• For notational simplicity, assume augumented
variables and write XiT W for XiT W + w0 .

PRNN (PSS) Jan-Apr 2018 – p.76/196

• We can also use the LMS algorithm.
• For notational simplicity, assume augumented
variables and write XiT W for XiT W + w0 .
• Our criterion is to minimize
n
1 X
J(W ) = (h(XiT W ) − yi )2
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.77/196

• We can use a gradient descent algorithm for

minimizing J .

PRNN (PSS) Jan-Apr 2018 – p.78/196

• A gradient descent for this would be

W (k + 1) = W (k) −
n
X
η h′ (XiT W )Xi (h(XiT W ) − yi )
i=1

where h′ (·) is the derivative of h.

PRNN (PSS) Jan-Apr 2018 – p.79/196

• Or, we can use the incremental version of gradient
descent.
W (k + 1) = W (k) − η h′k X(k) e(k)
where
e(k) = (h(X(k)T W (k))−y(k)), h′k = h′ (X(k)T W (k))

PRNN (PSS) Jan-Apr 2018 – p.80/196

• Or, we can use the incremental version of gradient
descent.
W (k + 1) = W (k) − η h′k X(k) e(k)
where
e(k) = (h(X(k)T W (k))−y(k)), h′k = h′ (X(k)T W (k))
• This is the LMS algorithm extended for this case.

PRNN (PSS) Jan-Apr 2018 – p.81/196

PRNN (PSS) Jan-Apr 2018 – p.82/196

Logistic Regression

• A h that is often used is

1
h(a) =
1 + exp(−a)

PRNN (PSS) Jan-Apr 2018 – p.83/196

Logistic Regression

• This function is known as the logistic function or

sigmoid function

PRNN (PSS) Jan-Apr 2018 – p.84/196

Logistic Regression

• This function is known as the logistic function or

sigmoid function
• This is a useful model for posterior probability function

PRNN (PSS) Jan-Apr 2018 – p.85/196

Logistic Regression

• This function is known as the logistic function or

sigmoid function
• This is a useful model for posterior probability function
• Least squares method with this h is called logistic
regression

PRNN (PSS) Jan-Apr 2018 – p.86/196

Logistic Regression

• This function is known as the logistic function or

sigmoid function
• This is a useful model for posterior probability function
• Least squares method with this h is called logistic
regression
• Normally one uses the LMS algorithm in logistic
regression.

PRNN (PSS) Jan-Apr 2018 – p.87/196

• Logistic regression is often a good way to learn
posterior probability function.

PRNN (PSS) Jan-Apr 2018 – p.88/196

• Logistic regression is often a good way to learn
posterior probability function.
• In a 2-class problem, let
• f0 (X), f1 (X) – class conditional densities
• q0 (X), q1 (X) – posterior probabilities
• p0 , p1 – prior probabilities

PRNN (PSS) Jan-Apr 2018 – p.89/196

• Then, by Bayes rule

f0 (X) p0
q0 (X) =
f0 (X) p0 + f1 (X) p1

PRNN (PSS) Jan-Apr 2018 – p.90/196

• Then, by Bayes rule

f0 (X) p0
q0 (X) =
f0 (X) p0 + f1 (X) p1
1
= where
1 + exp(−ξ)

f1 (X) p1 f0 (X) p0

ξ = − ln = ln
f0 (X) p0 f1 (X) p1

PRNN (PSS) Jan-Apr 2018 – p.91/196

• Thus, logistic regression is a very good method if we
can write
f0 (X) p0

ln = W T X + w0
f1 (X) p1

PRNN (PSS) Jan-Apr 2018 – p.92/196

• Thus, logistic regression is a very good method if we
can write
f0 (X) p0

ln = W T X + w0
f1 (X) p1
• For example, if f0 and f1 are Gaussian with same
covariance matrix, then the above holds.

PRNN (PSS) Jan-Apr 2018 – p.93/196

• Thus, logistic regression is a very good method if we
can write
f0 (X) p0

ln = W T X + w0
f1 (X) p1
• For example, if f0 and f1 are Gaussian with same
covariance matrix, then the above holds.
• Thus, this is one case where logistic regression would
give you the optimal classifier.

PRNN (PSS) Jan-Apr 2018 – p.94/196

• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.

PRNN (PSS) Jan-Apr 2018 – p.95/196

• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.

• The ‘targets’ are: yi ∈ {0, 1}.

PRNN (PSS) Jan-Apr 2018 – p.96/196

• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.

• The ‘targets’ are: yi ∈ {0, 1}.
• We can use, e.g., LMS algorithm to find W ∗ , w0∗ .

PRNN (PSS) Jan-Apr 2018 – p.97/196

• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.

• The ‘targets’ are: yi ∈ {0, 1}.
• We can use, e.g., LMS algorithm to find W ∗ , w0∗ .
• Then, we use h(X T W ∗ + w0∗ ) as posterior probability
of class 1, to implement the classifier

PRNN (PSS) Jan-Apr 2018 – p.98/196

• Consider Bayes (or naive Bayes) classifier.

PRNN (PSS) Jan-Apr 2018 – p.99/196

• Consider Bayes (or naive Bayes) classifier.
• We are trying to model (in 2-class case):

f (x, y) = f (y)f (x|y) = (p0 f0 (x))1−y (p1 f1 (x))y

This is what is called a generative model.

PRNN (PSS) Jan-Apr 2018 – p.100/196

• Consider Bayes (or naive Bayes) classifier.
• We are trying to model (in 2-class case):

f (x, y) = f (y)f (x|y) = (p0 f0 (x))1−y (p1 f1 (x))y

This is what is called a generative model.
• But the purpose of the model ultimately is to predict
the target.

PRNN (PSS) Jan-Apr 2018 – p.101/196

• In logistic regression we are fitting a model:
1
Prob[y = 1|X, W ] =
1 + exp(−W T X)
• If we take y ∈ {+1, −1} then we can write the
conditional density as
1
f (y|X, W ) =
1 + exp(−yW T X)
• This is a ‘discriminative’ model.

PRNN (PSS) Jan-Apr 2018 – p.102/196

• In most applications, our observations or data would
be noisy.

PRNN (PSS) Jan-Apr 2018 – p.103/196

• In most applications, our observations or data would
be noisy.
• We can take the Xi to be fixed and the observed yi to
be random.

PRNN (PSS) Jan-Apr 2018 – p.104/196

PRNN (PSS) Jan-Apr 2018 – p.105/196

• In most applications, our observations or data would
be noisy.
• We can take the Xi to be fixed and the observed yi to
be random.
• Often, we get data by measuring yi for specific value
of Xi . Hence this is a useful scenario.
• Now the W ∗ obtained through linear least squares
regression would also be random.

PRNN (PSS) Jan-Apr 2018 – p.106/196

PRNN (PSS) Jan-Apr 2018 – p.107/196

• We assume that noise corrupting different yi are iid
and zero-mean.

PRNN (PSS) Jan-Apr 2018 – p.108/196

• We assume that noise corrupting different yi are iid
and zero-mean.
• Recall that Y is a vector random variable with
components yi .

PRNN (PSS) Jan-Apr 2018 – p.109/196

• We assume that noise corrupting different yi are iid
and zero-mean.
• Recall that Y is a vector random variable with
components yi .
• By our assumption, its covariance matrix is
ΣY = σ 2 I
where I is the identity matrix and σ 2 is noise variance.

PRNN (PSS) Jan-Apr 2018 – p.110/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,

PRNN (PSS) Jan-Apr 2018 – p.111/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]

PRNN (PSS) Jan-Apr 2018 – p.112/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ]

PRNN (PSS) Jan-Apr 2018 – p.113/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T

PRNN (PSS) Jan-Apr 2018 – p.114/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

PRNN (PSS) Jan-Apr 2018 – p.115/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

ΣW = (AT A)−1 AT σ 2 IA(AT A)−1 = σ 2 (AT A)−1

PRNN (PSS) Jan-Apr 2018 – p.116/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

ΣW = (AT A)−1 AT σ 2 IA(AT A)−1 = σ 2 (AT A)−1

• This gives us the covariance matrix of the least
squares estimate.

PRNN (PSS) Jan-Apr 2018 – p.117/196

• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

ΣW = (AT A)−1 AT σ 2 IA(AT A)−1 = σ 2 (AT A)−1

• This gives us the covariance matrix of the least
squares estimate.
• If the noise is Gaussian then the least squares
estimate W would also be Gaussian.
PRNN (PSS) Jan-Apr 2018 – p.118/196
• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.

PRNN (PSS) Jan-Apr 2018 – p.119/196

• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.
• Then we expect linear least squares method to easily
learn W .

PRNN (PSS) Jan-Apr 2018 – p.120/196

• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.
• Then we expect linear least squares method to easily
learn W .
• The final mean sqare error would be variance of ξ .

PRNN (PSS) Jan-Apr 2018 – p.121/196

• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.
• Then we expect linear least squares method to easily
learn W .
• The final mean sqare error would be variance of ξ .
• Using this idea, we can think of the least squares
method as an ML estimation procedure under a
reasonable probability model.

PRNN (PSS) Jan-Apr 2018 – p.122/196

• Let y be a random variable, function of X .

PRNN (PSS) Jan-Apr 2018 – p.123/196

• Let y be a random variable, function of X .
• We take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2
where W and σ are the parameters.

PRNN (PSS) Jan-Apr 2018 – p.124/196

• Let y be a random variable, function of X .
• We take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2
where W and σ are the parameters.
• Let D = {y1 (X1 ), · · · , yn (Xn )} be the iid data.

PRNN (PSS) Jan-Apr 2018 – p.125/196

PRNN (PSS) Jan-Apr 2018 – p.126/196

• The data likelihood is
n
Y
L(W, σ | D) = f (yi | Xi , W, σ)
i=1
n T 2
1 1 (yi − X W )
Y
i
= √ exp −
i=1 σ 2π 2 σ 2

PRNN (PSS) Jan-Apr 2018 – p.127/196

• The log likelihood is given by
n
1 1 X T 2
l(W, σ | D) = n ln √ − (yi − Xi W )
σ 2π 2σ i=1
2

PRNN (PSS) Jan-Apr 2018 – p.128/196

• The log likelihood is given by
n
1 1 X T 2
l(W, σ | D) = n ln √ − (yi − Xi W )
σ 2π 2σ i=1
2

• Equating gradient of log likelihood to zero, we get

n
X
Xi (yi − XiT W ) = 0
i=1

PRNN (PSS) Jan-Apr 2018 – p.129/196

• The log likelihood is given by
n
1 1 X T 2
l(W, σ | D) = n ln √ − (yi − Xi W )
σ 2π 2σ i=1
2

• Equating gradient of log likelihood to zero, we get

n
X
Xi (yi − XiT W ) = 0
i=1

• This gives us the same W as least squares.

PRNN (PSS) Jan-Apr 2018 – p.130/196

• Supoose we want ML estimate of σ also.

PRNN (PSS) Jan-Apr 2018 – p.131/196

• Supoose we want ML estimate of σ also.
n
∂l n −2 X
=− − (yi − XiT W )2 = 0
∂σ σ 2σ 3 i=1

PRNN (PSS) Jan-Apr 2018 – p.132/196

• Supoose we want ML estimate of σ also.
n
∂l n −2 X
=− − (yi − XiT W )2 = 0
∂σ σ 2σ 3 i=1

• This gives us
n
1 X
σ2 = (yi − XiT W )2
n i=1

PRNN (PSS) Jan-Apr 2018 – p.133/196

• Supoose we want ML estimate of σ also.
n
∂l n −2 X
=− − (yi − XiT W )2 = 0
∂σ σ 2σ 3 i=1

• This gives us
n
1 X
σ2 = (yi − XiT W )2
n i=1

• This is the residual average squared error.

PRNN (PSS) Jan-Apr 2018 – p.134/196

Regularization

• As we saw, we can take any fixed basis functions in

our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

PRNN (PSS) Jan-Apr 2018 – p.135/196

Regularization

• As we saw, we can take any fixed basis functions in

our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

• Suppose we have one dimensional Data: Xi , yi ∈ ℜ.

PRNN (PSS) Jan-Apr 2018 – p.136/196

Regularization

• As we saw, we can take any fixed basis functions in

our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

• Suppose we have one dimensional Data: Xi , yi ∈ ℜ.

• If we take φi (X) = X i , i = 0, 1, · · · , M , then we are
trying to fit a polynomial of degree M for the data.

PRNN (PSS) Jan-Apr 2018 – p.137/196

Regularization

• As we saw, we can take any fixed basis functions in

our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

• Suppose we have one dimensional Data: Xi , yi ∈ ℜ.

• If we take φi (X) = X i , i = 0, 1, · · · , M , then we are
trying to fit a polynomial of degree M for the data.
• What M should we take?

PRNN (PSS) Jan-Apr 2018 – p.138/196

• Fixing a M to get ‘least’ error is not a good idea.

PRNN (PSS) Jan-Apr 2018 – p.139/196

• Fixing a M to get ‘least’ error is not a good idea.
• If we take M = n − 1 (where n is the number of data
points), we get zero error!

PRNN (PSS) Jan-Apr 2018 – p.140/196

PRNN (PSS) Jan-Apr 2018 – p.141/196

• Fixing a M to get ‘least’ error is not a good idea.
• If we take M = n − 1 (where n is the number of data
points), we get zero error!
(Given n points, we can always find a (n − 1)-degree
polynomial that goes through all data points).
• Wrong choice of M can result in ‘overfitting’ – low
training error but poor generalization.
• This is a fundamental issue in learning from
examples.

PRNN (PSS) Jan-Apr 2018 – p.142/196

• We are fitting a model f (X) = W T Φ(X) to the data.

PRNN (PSS) Jan-Apr 2018 – p.143/196

• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.

PRNN (PSS) Jan-Apr 2018 – p.144/196

• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.
T 2
−
P
•
i (W Φ(X i ) y i ) is the ‘data error’.

PRNN (PSS) Jan-Apr 2018 – p.145/196

• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.
T 2
−
P
•
i (W Φ(X i ) y i ) is the ‘data error’.
• But it does not tell whole story of how good is W .