0% found this document useful (0 votes)
45 views

Lec 10

The document recaps key concepts related to linear classifiers and regression. It discusses perceptrons, linear regression, least squares solutions, gradient descent, and the LMS algorithm. The LMS algorithm is described as a stochastic gradient descent method for minimizing the mean squared error of predictions. Linear regression and classification models are described as fitting functions to minimize expected error.

Uploaded by

Varun Krishna Ps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Lec 10

The document recaps key concepts related to linear classifiers and regression. It discusses perceptrons, linear regression, least squares solutions, gradient descent, and the LMS algorithm. The LMS algorithm is described as a stochastic gradient descent method for minimizing the mean squared error of predictions. Linear regression and classification models are described as fitting functions to minimize expected error.

Uploaded by

Varun Krishna Ps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 196

Recap

• We have been considering linear classifiers:



d
X
h(X) = 1 if wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.

PRNN (PSS) Jan-Apr 2018 – p.1/196


Recap

• We have been considering linear classifiers:



d
X
h(X) = 1 if wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.
• We take h(X) = sign(W T X) for simplicity of notation.
• Perceptron is a classical algorithm to learn such a
classifier.

PRNN (PSS) Jan-Apr 2018 – p.2/196


Recap
• We also discussed linear regression. The objective is
to learn a model:
d
X
ŷ(X) = w i xi + w 0
i=1

PRNN (PSS) Jan-Apr 2018 – p.3/196


Recap
• We also discussed linear regression. The objective is
to learn a model:
d
X
ŷ(X) = w i xi + w 0
i=1

We could use φi (X) in place of xi .

PRNN (PSS) Jan-Apr 2018 – p.4/196


Recap
• We also discussed linear regression. The objective is
to learn a model:
d
X
ŷ(X) = w i xi + w 0
i=1

We could use φi (X) in place of xi .


• The criterion is to minimize
n
1X T
J(W ) = (W Xi − yi )2
2 i=1
• The minimizer is the linear least squares solution.
PRNN (PSS) Jan-Apr 2018 – p.5/196
Recap

• The minimizer of J is given by

W ∗ = (AT A)−1 AT Y
Where A is a matrix whose rows are Xi and Y is a
vector whose components are yi .

PRNN (PSS) Jan-Apr 2018 – p.6/196


Recap

• The minimizer of J is given by

W ∗ = (AT A)−1 AT Y
Where A is a matrix whose rows are Xi and Y is a
vector whose components are yi .
• We can also minimize J by iterative gradient descent.

PRNN (PSS) Jan-Apr 2018 – p.7/196


Recap

• The minimizer of J is given by

W ∗ = (AT A)−1 AT Y
Where A is a matrix whose rows are Xi and Y is a
vector whose components are yi .
• We can also minimize J by iterative gradient descent.
• An incremental version of this gradient descent is the
LMS algorithm.

PRNN (PSS) Jan-Apr 2018 – p.8/196


Recap

• The LMS algorithm is:

W (k + 1) = W (k) − η (X(k)T W (k) − y(k))X(k)


where (X(k), y(k)) is the (random) sample picked
and W (k) is the weight vector at iteration k .
• This is used in many adaptive signal processing
problems.

PRNN (PSS) Jan-Apr 2018 – p.9/196


• The LMS algorithm is

W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))


• This is very similar to Perceptron algorithm.
• If y(k) ∈ {0, 1} and if we use thresholded version of
X(k)T W (k) in the above what we get is exactly the
Perceptron algorithm.
• This is also a classical algorithm.

PRNN (PSS) Jan-Apr 2018 – p.10/196


Adaline

• We can view this as a unit similar to Perceptron

PRNN (PSS) Jan-Apr 2018 – p.11/196


Adaline

• We can view this as a unit similar to Perceptron


• Output is weighted sum of inputs.
• Called Adaline (ADaptine LINear Element); weights
are adapted (Widrow 1963).

PRNN (PSS) Jan-Apr 2018 – p.12/196


• The least square error criterion is to minimize
n
1 X T
2
J(W ) = X i W − yi
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.13/196


• The least square error criterion is to minimize
n
1 X T
2
J(W ) = X i W − yi
2 i=1
• Assuming the training examples to be drawn iid, the
above is a good approximation of
n  T 2

J(W ) = E (X W − y)
2

PRNN (PSS) Jan-Apr 2018 – p.14/196


• The least square error criterion is to minimize
n
1 X T
2
J(W ) = X i W − yi
2 i=1
• Assuming the training examples to be drawn iid, the
above is a good approximation of
n  T 2

J(W ) = E (X W − y)
2
• That is, the objective is to minimize mean squred
error.
PRNN (PSS) Jan-Apr 2018 – p.15/196
• Equating the gradient of J(W ) to zero we get

W T nE[XX T ] − nE[Xy] = 0

PRNN (PSS) Jan-Apr 2018 – p.16/196


• Equating the gradient of J(W ) to zero we get

W T nE[XX T ] − nE[Xy] = 0
• This gives us the optimal W ∗ as

W ∗ = (nE[XX T ])−1 (nE[Xy])

PRNN (PSS) Jan-Apr 2018 – p.17/196


• Equating the gradient of J(W ) to zero we get

W T nE[XX T ] − nE[Xy] = 0
• This gives us the optimal W ∗ as

W ∗ = (nE[XX T ])−1 (nE[Xy])


• The earlier expression we have for W ∗ would be
same as this if we approximate expectations by
sample averages.

PRNN (PSS) Jan-Apr 2018 – p.18/196


• Since rows of A are Xi , we have
n
X
AT A = Xi XiT ≈ nEXX T
i=1

T
Pn
• Similarly A Y = i=1 Xi yi ≈ nE[Xy].
• Thus we have
(AT A)−1 AT Y ≈ (nE[XX T ])−1 (nE[Xy])

PRNN (PSS) Jan-Apr 2018 – p.19/196


• We are fitting a W to minimize 12 E [(W T X − y)2 ].

PRNN (PSS) Jan-Apr 2018 – p.20/196


• We are fitting a W to minimize 12 E [(W T X − y)2 ].
• The gradient descent on this objective would be

W (k + 1) = W (k) − ηE[(W T X − y)X]

PRNN (PSS) Jan-Apr 2018 – p.21/196


• We are fitting a W to minimize 12 E [(W T X − y)2 ].
• The gradient descent on this objective would be

W (k + 1) = W (k) − ηE[(W T X − y)X]


• However, we cannot calculate the expectation.

PRNN (PSS) Jan-Apr 2018 – p.22/196


• We are fitting a W to minimize 12 E [(W T X − y)2 ].
• The gradient descent on this objective would be

W (k + 1) = W (k) − ηE[(W T X − y)X]


• However, we cannot calculate the expectation.
• We have iid training samples.
• We can evaluate only the ‘noisy’ gradient at any
sample.

PRNN (PSS) Jan-Apr 2018 – p.23/196


LMS and Stochastic Gradient Descent

• Consider the LMS algorithm


• Suppose at k th iteration a random sample
(X(k), y(k)) is picked. Then

W (k + 1) = W (k) − η(W (k)T X(k) − y(k))X(k)

PRNN (PSS) Jan-Apr 2018 – p.24/196


LMS and Stochastic Gradient Descent

• Consider the LMS algorithm


• Suppose at k th iteration a random sample
(X(k), y(k)) is picked. Then

W (k + 1) = W (k) − η(W (k)T X(k) − y(k))X(k)


• So, we use the ‘noisy’ gradient. Same as
Robbins-Munro algorithm we saw earlier.

PRNN (PSS) Jan-Apr 2018 – p.25/196


LMS and Stochastic Gradient Descent

• Consider the LMS algorithm


• Suppose at k th iteration a random sample
(X(k), y(k)) is picked. Then

W (k + 1) = W (k) − η(W (k)T X(k) − y(k))X(k)


• So, we use the ‘noisy’ gradient. Same as
Robbins-Munro algorithm we saw earlier.
• This is a stochastic gradient descent algorithm.

PRNN (PSS) Jan-Apr 2018 – p.26/196


• Least squares method of fitting a model tries to find a
function f to minimize

R(f ) = E[(f (X) − y)2 ]

PRNN (PSS) Jan-Apr 2018 – p.27/196


• Least squares method of fitting a model tries to find a
function f to minimize

R(f ) = E[(f (X) − y)2 ]


• Since we are learning only linear models here, the
minimization is only over all f that are linear (or affine)
functions.

PRNN (PSS) Jan-Apr 2018 – p.28/196


• Least squares method of fitting a model tries to find a
function f to minimize

R(f ) = E[(f (X) − y)2 ]


• Since we are learning only linear models here, the
minimization is only over all f that are linear (or affine)
functions.
• In general, we can find the best f among all possible
functions.

PRNN (PSS) Jan-Apr 2018 – p.29/196


• This is a problem of approximating a random variable
y as a function of another random variable X in the
sense of best mean square error.

PRNN (PSS) Jan-Apr 2018 – p.30/196


• This is a problem of approximating a random variable
y as a function of another random variable X in the
sense of best mean square error.
• If f ∗ is the optimal function, then f ∗ (X) is called the
regression function of y on X .

PRNN (PSS) Jan-Apr 2018 – p.31/196


• This is a problem of approximating a random variable
y as a function of another random variable X in the
sense of best mean square error.
• If f ∗ is the optimal function, then f ∗ (X) is called the
regression function of y on X .
• We will show that this f ∗ is given by

f ∗ (X) = E[y | X]

PRNN (PSS) Jan-Apr 2018 – p.32/196


• We need some properties of conditional expectation
in the proof.

PRNN (PSS) Jan-Apr 2018 – p.33/196


• We need some properties of conditional expectation
in the proof.
• For random variables, X, Z (with a joint density)
Z
E[g(Z) | X = x] = g(z) fZ|X (z|x) dz

where fZ|X (z|x) is the conditional density.

PRNN (PSS) Jan-Apr 2018 – p.34/196


• We need some properties of conditional expectation
in the proof.
• For random variables, X, Z (with a joint density)
Z
E[g(Z) | X = x] = g(z) fZ|X (z|x) dz

where fZ|X (z|x) is the conditional density.


• If Z is discrete random variable
X
E[g(Z) | X = x] = g(zj ) P [Z = zj | X = x]
j

PRNN (PSS) Jan-Apr 2018 – p.35/196


• E[Z | X] is a function of X and is a random variable.

PRNN (PSS) Jan-Apr 2018 – p.36/196


• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

PRNN (PSS) Jan-Apr 2018 – p.37/196


• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional expecta-


tion are:

PRNN (PSS) Jan-Apr 2018 – p.38/196


• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional


expectation are:

(i) E [ E[Z | X] ] = E[Z], ∀Z, X

PRNN (PSS) Jan-Apr 2018 – p.39/196


• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional


expectation are:

(i) E [ E[Z | X] ] = E[Z], ∀Z, X

(ii) E[g(Z) h(X) | X] = h(X) E[g(Z) | X], ∀g, h

PRNN (PSS) Jan-Apr 2018 – p.40/196


• E[Z | X] is a function of X and is a random variable.
• It has all the linearity properties of expectation.

Two important special properties of conditional


expectation are:

(i) E [ E[Z | X] ] = E[Z], ∀Z, X

(ii) E[g(Z) h(X) | X] = h(X) E[g(Z) | X], ∀g, h


• We will need both these properties for our proof.

PRNN (PSS) Jan-Apr 2018 – p.41/196


• We want to show that for all f
h i
2  2

E (E[y | X] − y) ≤ E (f (X) − y)

PRNN (PSS) Jan-Apr 2018 – p.42/196


• We want to show that for all f
h i
2  2

E (E[y | X] − y) ≤ E (f (X) − y)

We have
2 2
(f (X) − y) = [(f (X) − E[y | X]) + (E[y | X] − y)]

PRNN (PSS) Jan-Apr 2018 – p.43/196


• We want to show that for all f
h i
2  2

E (E[y | X] − y) ≤ E (f (X) − y)

We have
2 2
(f (X) − y) = [(f (X) − E[y | X]) + (E[y | X] − y)]
= (f (X) − E[y | X])2 + (E[y | X] − y)2
+ 2(f (X) − E[y | X])(E[y | X] − y)

PRNN (PSS) Jan-Apr 2018 – p.44/196


• We want to show that for all f
h i
2  2

E (E[y | X] − y) ≤ E (f (X) − y)

We have
2 2
(f (X) − y) = [(f (X) − E[y | X]) + (E[y | X] − y)]
= (f (X) − E[y | X])2 + (E[y | X] − y)2
+ 2(f (X) − E[y | X])(E[y | X] − y)

Now we can take expectation on both sides.

PRNN (PSS) Jan-Apr 2018 – p.45/196


First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]

PRNN (PSS) Jan-Apr 2018 – p.46/196


First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
because E[Z] = E[ E[Z|X] ]

PRNN (PSS) Jan-Apr 2018 – p.47/196


First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) E {(E[y | X] − y) | X} ]
because E[g(X)h(Z)|X] = g(X) E[h(Z)|X]

PRNN (PSS) Jan-Apr 2018 – p.48/196


First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) E {(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) (E[y | X] − E[y | X)) ]

PRNN (PSS) Jan-Apr 2018 – p.49/196


First consider the last term
E [(f (X) − E[y | X])(E[y | X] − y)]
= E [ E {(f (X) − E[y | X])(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) E {(E[y | X] − y) | X} ]
= E [ (f (X) − E[y | X]) (E[y | X] − E[y | X)) ]
= 0

PRNN (PSS) Jan-Apr 2018 – p.50/196


Hence we get
2 2
   
E (f (X) − y) = E (f (X) − E[y | X])
2
 
+ E (E[y | X] − y)

PRNN (PSS) Jan-Apr 2018 – p.51/196


Hence we get
2 2
   
E (f (X) − y) = E (f (X) − E[y | X])
2
 
+ E (E[y | X] − y)
2
 
≥ E (E[y | X] − y)

PRNN (PSS) Jan-Apr 2018 – p.52/196


Hence we get
2 2
   
E (f (X) − y) = E (f (X) − E[y | X])
2
 
+ E (E[y | X] − y)
2
 
≥ E (E[y | X] − y)
• Since the above is true for all functions f , we get

f ∗ (X) = E [y | X]

PRNN (PSS) Jan-Apr 2018 – p.53/196


• We showed that if we want to predict y as a function
of X to minimize E[ (f (X) − y)2 ], then the optimal
function is
f ∗ (X) = E [y | X]

PRNN (PSS) Jan-Apr 2018 – p.54/196


• We showed that if we want to predict y as a function
of X to minimize E[ (f (X) − y)2 ], then the optimal
function is
f ∗ (X) = E [y | X]
• Suppose y ∈ {0, 1}. Then

f (X) = E [y | X] = P [y = 1 | X] = q1 (X)

PRNN (PSS) Jan-Apr 2018 – p.55/196


• We showed that if we want to predict y as a function
of X to minimize E[ (f (X) − y)2 ], then the optimal
function is
f ∗ (X) = E [y | X]
• Suppose y ∈ {0, 1}. Then

f (X) = E [y | X] = P [y = 1 | X] = q1 (X)
• It is easy to see that, if y ∈ {−1, 1} then
f ∗ (X) = 2q1 (X) − 1.

PRNN (PSS) Jan-Apr 2018 – p.56/196


• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.57/196


• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {0, 1}, then we learn a best linear


approximation to the posterior probability, q1 (X).

PRNN (PSS) Jan-Apr 2018 – p.58/196


• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {0, 1}, then we learn a best linear


approximation to the posterior probability, q1 (X).
• So, by thresholding X T W ∗ at 0.5, we get a good
classifier.

PRNN (PSS) Jan-Apr 2018 – p.59/196


• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {−1, 1} then we learn a good linear


approximation to 2q1 (X) − 1.

PRNN (PSS) Jan-Apr 2018 – p.60/196


• In a classification problem, suppose we learnt W to
minimize
n
1 X
J(W ) = (XiT W − yi )2
2 i=1

• If we had y ∈ {−1, 1} then we learn a good linear


approximation to 2q1 (X) − 1.
• Hence we can threshold X T W ∗ at zero to get a good
classifier.

PRNN (PSS) Jan-Apr 2018 – p.61/196


• However, in general, a linear function of X is not a
good choice for posterior probability function q1 (X).
• But we can extend the linear least squares method to
take care of more interesting models.

PRNN (PSS) Jan-Apr 2018 – p.62/196


• However, in general, a linear function of X is not a
good choice for posterior probability function q1 (X).
• But we can extend the linear least squares method to
take care of more interesting models.
• Let h : ℜ → ℜ+ be a continuous strictly monotonically
increasing function.

PRNN (PSS) Jan-Apr 2018 – p.63/196


• However, in general, a linear function of X is not a
good choice for posterior probability function q1 (X).
• But we can extend the linear least squares method to
take care of more interesting models.
• Let h : ℜ → ℜ+ be a continuous strictly monotonically
increasing function.
• Suppose we want to learn a model

ŷ(X) = h(W T X + w0 )

PRNN (PSS) Jan-Apr 2018 – p.64/196


• By our assumptions, h is invertible.

PRNN (PSS) Jan-Apr 2018 – p.65/196


• By our assumptions, h is invertible.
• Suppose ŷ(X) = h(W T X + w0 ) is a good model.

PRNN (PSS) Jan-Apr 2018 – p.66/196


• By our assumptions, h is invertible.
• Suppose ŷ(X) = h(W T X + w0 ) is a good model.
• Given data {(Xi , yi ), i = 1, · · · , n}, we can
approximate yi well by h(XiT W ∗ + w0∗ ).

PRNN (PSS) Jan-Apr 2018 – p.67/196


• By our assumptions, h is invertible.
• Suppose ŷ(X) = h(W T X + w0 ) is a good model.
• Given data {(Xi , yi ), i = 1, · · · , n}, we can
approximate yi well by h(XiT W ∗ + w0∗ ).
• This means, XiT W ∗ + w0∗ is a good approximation for
h−1 (yi ).

PRNN (PSS) Jan-Apr 2018 – p.68/196


• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).

PRNN (PSS) Jan-Apr 2018 – p.69/196


• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.

PRNN (PSS) Jan-Apr 2018 – p.70/196


• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.
• Make new data, {(Xi , yi′ ), i = 1, 2, · · · , n} with
yi′ = h−1 (yi ).

PRNN (PSS) Jan-Apr 2018 – p.71/196


• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.
• Make new data, {(Xi , yi′ ), i = 1, 2, · · · , n} with
yi′ = h−1 (yi ).
• Now we fit the model W T X + w0 to the new data.

PRNN (PSS) Jan-Apr 2018 – p.72/196


• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.
• Make new data, {(Xi , yi′ ), i = 1, 2, · · · , n} with
yi′ = h−1 (yi ).
• Now we fit the model W T X + w0 to the new data.
• This is the usual linear least squares problem.

PRNN (PSS) Jan-Apr 2018 – p.73/196


• Thus we can use the usual linear least squares
method by simply taking the ‘targets’ to be h−1 (yi ).
• Given data is: {(Xi , yi ), i = 1, 2, · · · , n}.
• Make new data, {(Xi , yi′ ), i = 1, 2, · · · , n} with
yi′ = h−1 (yi ).
• Now we fit the model W T X + w0 to the new data.
• This is the usual linear least squares problem.
• Finally we use the model h(X T W ∗ + w0 ).

PRNN (PSS) Jan-Apr 2018 – p.74/196


• We can also use the LMS algorithm.

PRNN (PSS) Jan-Apr 2018 – p.75/196


• We can also use the LMS algorithm.
• For notational simplicity, assume augumented
variables and write XiT W for XiT W + w0 .

PRNN (PSS) Jan-Apr 2018 – p.76/196


• We can also use the LMS algorithm.
• For notational simplicity, assume augumented
variables and write XiT W for XiT W + w0 .
• Our criterion is to minimize
n
1 X
J(W ) = (h(XiT W ) − yi )2
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.77/196


• We can also use the LMS algorithm.
• For notational simplicity, assume augumented
variables and write XiT W for XiT W + w0 .
• Our criterion is to minimize
n
1 X
J(W ) = (h(XiT W ) − yi )2
2 i=1

• We can use a gradient descent algorithm for


minimizing J .

PRNN (PSS) Jan-Apr 2018 – p.78/196


• A gradient descent for this would be

W (k + 1) = W (k) −
n
X
η h′ (XiT W )Xi (h(XiT W ) − yi )
i=1

where h′ (·) is the derivative of h.

PRNN (PSS) Jan-Apr 2018 – p.79/196


• Or, we can use the incremental version of gradient
descent.
W (k + 1) = W (k) − η h′k X(k) e(k)
where
e(k) = (h(X(k)T W (k))−y(k)), h′k = h′ (X(k)T W (k))

PRNN (PSS) Jan-Apr 2018 – p.80/196


• Or, we can use the incremental version of gradient
descent.
W (k + 1) = W (k) − η h′k X(k) e(k)
where
e(k) = (h(X(k)T W (k))−y(k)), h′k = h′ (X(k)T W (k))
• This is the LMS algorithm extended for this case.

PRNN (PSS) Jan-Apr 2018 – p.81/196


• Or, we can use the incremental version of gradient
descent.
W (k + 1) = W (k) − η h′k X(k) e(k)
where
e(k) = (h(X(k)T W (k))−y(k)), h′k = h′ (X(k)T W (k))
• This is the LMS algorithm extended for this case.
• The LMS algorithm is simple to implement.

PRNN (PSS) Jan-Apr 2018 – p.82/196


Logistic Regression

• A h that is often used is


1
h(a) =
1 + exp(−a)

PRNN (PSS) Jan-Apr 2018 – p.83/196


Logistic Regression

• This function is known as the logistic function or


sigmoid function

PRNN (PSS) Jan-Apr 2018 – p.84/196


Logistic Regression

• This function is known as the logistic function or


sigmoid function
• This is a useful model for posterior probability function

PRNN (PSS) Jan-Apr 2018 – p.85/196


Logistic Regression

• This function is known as the logistic function or


sigmoid function
• This is a useful model for posterior probability function
• Least squares method with this h is called logistic
regression

PRNN (PSS) Jan-Apr 2018 – p.86/196


Logistic Regression

• This function is known as the logistic function or


sigmoid function
• This is a useful model for posterior probability function
• Least squares method with this h is called logistic
regression
• Normally one uses the LMS algorithm in logistic
regression.

PRNN (PSS) Jan-Apr 2018 – p.87/196


• Logistic regression is often a good way to learn
posterior probability function.

PRNN (PSS) Jan-Apr 2018 – p.88/196


• Logistic regression is often a good way to learn
posterior probability function.
• In a 2-class problem, let
• f0 (X), f1 (X) – class conditional densities
• q0 (X), q1 (X) – posterior probabilities
• p0 , p1 – prior probabilities

PRNN (PSS) Jan-Apr 2018 – p.89/196


• Then, by Bayes rule

f0 (X) p0
q0 (X) =
f0 (X) p0 + f1 (X) p1

PRNN (PSS) Jan-Apr 2018 – p.90/196


• Then, by Bayes rule

f0 (X) p0
q0 (X) =
f0 (X) p0 + f1 (X) p1
1
= where
1 + exp(−ξ)

f1 (X) p1 f0 (X) p0
   
ξ = − ln = ln
f0 (X) p0 f1 (X) p1

PRNN (PSS) Jan-Apr 2018 – p.91/196


• Thus, logistic regression is a very good method if we
can write
f0 (X) p0
 
ln = W T X + w0
f1 (X) p1

PRNN (PSS) Jan-Apr 2018 – p.92/196


• Thus, logistic regression is a very good method if we
can write
f0 (X) p0
 
ln = W T X + w0
f1 (X) p1
• For example, if f0 and f1 are Gaussian with same
covariance matrix, then the above holds.

PRNN (PSS) Jan-Apr 2018 – p.93/196


• Thus, logistic regression is a very good method if we
can write
f0 (X) p0
 
ln = W T X + w0
f1 (X) p1
• For example, if f0 and f1 are Gaussian with same
covariance matrix, then the above holds.
• Thus, this is one case where logistic regression would
give you the optimal classifier.

PRNN (PSS) Jan-Apr 2018 – p.94/196


• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.

PRNN (PSS) Jan-Apr 2018 – p.95/196


• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.


• The ‘targets’ are: yi ∈ {0, 1}.

PRNN (PSS) Jan-Apr 2018 – p.96/196


• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.


• The ‘targets’ are: yi ∈ {0, 1}.
• We can use, e.g., LMS algorithm to find W ∗ , w0∗ .

PRNN (PSS) Jan-Apr 2018 – p.97/196


• In logistic regression we find W, w0 to minimize
n
1 X
(h(W T Xi + w0 ) − yi )2
2 i=1

where h(a) = (1 + exp(−a))−1 is the logistic function.


• The ‘targets’ are: yi ∈ {0, 1}.
• We can use, e.g., LMS algorithm to find W ∗ , w0∗ .
• Then, we use h(X T W ∗ + w0∗ ) as posterior probability
of class 1, to implement the classifier

PRNN (PSS) Jan-Apr 2018 – p.98/196


• Consider Bayes (or naive Bayes) classifier.

PRNN (PSS) Jan-Apr 2018 – p.99/196


• Consider Bayes (or naive Bayes) classifier.
• We are trying to model (in 2-class case):

f (x, y) = f (y)f (x|y) = (p0 f0 (x))1−y (p1 f1 (x))y


This is what is called a generative model.

PRNN (PSS) Jan-Apr 2018 – p.100/196


• Consider Bayes (or naive Bayes) classifier.
• We are trying to model (in 2-class case):

f (x, y) = f (y)f (x|y) = (p0 f0 (x))1−y (p1 f1 (x))y


This is what is called a generative model.
• But the purpose of the model ultimately is to predict
the target.

PRNN (PSS) Jan-Apr 2018 – p.101/196


• In logistic regression we are fitting a model:
1
Prob[y = 1|X, W ] =
1 + exp(−W T X)
• If we take y ∈ {+1, −1} then we can write the
conditional density as
1
f (y|X, W ) =
1 + exp(−yW T X)
• This is a ‘discriminative’ model.

PRNN (PSS) Jan-Apr 2018 – p.102/196


• In most applications, our observations or data would
be noisy.

PRNN (PSS) Jan-Apr 2018 – p.103/196


• In most applications, our observations or data would
be noisy.
• We can take the Xi to be fixed and the observed yi to
be random.

PRNN (PSS) Jan-Apr 2018 – p.104/196


• In most applications, our observations or data would
be noisy.
• We can take the Xi to be fixed and the observed yi to
be random.
• Often, we get data by measuring yi for specific value
of Xi . Hence this is a useful scenario.

PRNN (PSS) Jan-Apr 2018 – p.105/196


• In most applications, our observations or data would
be noisy.
• We can take the Xi to be fixed and the observed yi to
be random.
• Often, we get data by measuring yi for specific value
of Xi . Hence this is a useful scenario.
• Now the W ∗ obtained through linear least squares
regression would also be random.

PRNN (PSS) Jan-Apr 2018 – p.106/196


• In most applications, our observations or data would
be noisy.
• We can take the Xi to be fixed and the observed yi to
be random.
• Often, we get data by measuring yi for specific value
of Xi . Hence this is a useful scenario.
• Now the W ∗ obtained through linear least squares
regression would also be random.
• Hence we would like to know its variance.

PRNN (PSS) Jan-Apr 2018 – p.107/196


• We assume that noise corrupting different yi are iid
and zero-mean.

PRNN (PSS) Jan-Apr 2018 – p.108/196


• We assume that noise corrupting different yi are iid
and zero-mean.
• Recall that Y is a vector random variable with
components yi .

PRNN (PSS) Jan-Apr 2018 – p.109/196


• We assume that noise corrupting different yi are iid
and zero-mean.
• Recall that Y is a vector random variable with
components yi .
• By our assumption, its covariance matrix is
ΣY = σ 2 I
where I is the identity matrix and σ 2 is noise variance.

PRNN (PSS) Jan-Apr 2018 – p.110/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,

PRNN (PSS) Jan-Apr 2018 – p.111/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]

PRNN (PSS) Jan-Apr 2018 – p.112/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ]

PRNN (PSS) Jan-Apr 2018 – p.113/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T

PRNN (PSS) Jan-Apr 2018 – p.114/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

PRNN (PSS) Jan-Apr 2018 – p.115/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

ΣW = (AT A)−1 AT σ 2 IA(AT A)−1 = σ 2 (AT A)−1

PRNN (PSS) Jan-Apr 2018 – p.116/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

ΣW = (AT A)−1 AT σ 2 IA(AT A)−1 = σ 2 (AT A)−1


• This gives us the covariance matrix of the least
squares estimate.

PRNN (PSS) Jan-Apr 2018 – p.117/196


• For any random vectors, Z, Y ,
if Z = B Y for some matrix B then,
ΣZ = E[(Z − EZ)(Z − EZ)T ]
= E[B(Y − EY )(Y − EY )T B T ] = B ΣY B T
• We have W ∗ = (AT A)−1 AT Y . Hence

ΣW = (AT A)−1 AT σ 2 IA(AT A)−1 = σ 2 (AT A)−1


• This gives us the covariance matrix of the least
squares estimate.
• If the noise is Gaussian then the least squares
estimate W would also be Gaussian.
PRNN (PSS) Jan-Apr 2018 – p.118/196
• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.

PRNN (PSS) Jan-Apr 2018 – p.119/196


• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.
• Then we expect linear least squares method to easily
learn W .

PRNN (PSS) Jan-Apr 2018 – p.120/196


• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.
• Then we expect linear least squares method to easily
learn W .
• The final mean sqare error would be variance of ξ .

PRNN (PSS) Jan-Apr 2018 – p.121/196


• Suppose X, y are related by y = W T X + ξ where ξ is
a zero mean noise.
• Then we expect linear least squares method to easily
learn W .
• The final mean sqare error would be variance of ξ .
• Using this idea, we can think of the least squares
method as an ML estimation procedure under a
reasonable probability model.

PRNN (PSS) Jan-Apr 2018 – p.122/196


• Let y be a random variable, function of X .

PRNN (PSS) Jan-Apr 2018 – p.123/196


• Let y be a random variable, function of X .
• We take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2
where W and σ are the parameters.

PRNN (PSS) Jan-Apr 2018 – p.124/196


• Let y be a random variable, function of X .
• We take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2
where W and σ are the parameters.
• Let D = {y1 (X1 ), · · · , yn (Xn )} be the iid data.

PRNN (PSS) Jan-Apr 2018 – p.125/196


• Let y be a random variable, function of X .
• We take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2
where W and σ are the parameters.
• Let D = {y1 (X1 ), · · · , yn (Xn )} be the iid data.
• We want to derive the ML estimate for the parameters.

PRNN (PSS) Jan-Apr 2018 – p.126/196


• The data likelihood is
n
Y
L(W, σ | D) = f (yi | Xi , W, σ)
i=1
n T 2
1 1 (yi − X W )
Y 
i
= √ exp −
i=1 σ 2π 2 σ 2

PRNN (PSS) Jan-Apr 2018 – p.127/196


• The log likelihood is given by
n
1 1 X T 2
l(W, σ | D) = n ln √ − (yi − Xi W )
σ 2π 2σ i=1
2

PRNN (PSS) Jan-Apr 2018 – p.128/196


• The log likelihood is given by
n
1 1 X T 2
l(W, σ | D) = n ln √ − (yi − Xi W )
σ 2π 2σ i=1
2

• Equating gradient of log likelihood to zero, we get


n
X
Xi (yi − XiT W ) = 0
i=1

PRNN (PSS) Jan-Apr 2018 – p.129/196


• The log likelihood is given by
n
1 1 X T 2
l(W, σ | D) = n ln √ − (yi − Xi W )
σ 2π 2σ i=1
2

• Equating gradient of log likelihood to zero, we get


n
X
Xi (yi − XiT W ) = 0
i=1

• This gives us the same W as least squares.

PRNN (PSS) Jan-Apr 2018 – p.130/196


• Supoose we want ML estimate of σ also.

PRNN (PSS) Jan-Apr 2018 – p.131/196


• Supoose we want ML estimate of σ also.
n
∂l n −2 X
=− − (yi − XiT W )2 = 0
∂σ σ 2σ 3 i=1

PRNN (PSS) Jan-Apr 2018 – p.132/196


• Supoose we want ML estimate of σ also.
n
∂l n −2 X
=− − (yi − XiT W )2 = 0
∂σ σ 2σ 3 i=1

• This gives us
n
1 X
σ2 = (yi − XiT W )2
n i=1

PRNN (PSS) Jan-Apr 2018 – p.133/196


• Supoose we want ML estimate of σ also.
n
∂l n −2 X
=− − (yi − XiT W )2 = 0
∂σ σ 2σ 3 i=1

• This gives us
n
1 X
σ2 = (yi − XiT W )2
n i=1

• This is the residual average squared error.

PRNN (PSS) Jan-Apr 2018 – p.134/196


Regularization

• As we saw, we can take any fixed basis functions in


our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

PRNN (PSS) Jan-Apr 2018 – p.135/196


Regularization

• As we saw, we can take any fixed basis functions in


our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

• Suppose we have one dimensional Data: Xi , yi ∈ ℜ.

PRNN (PSS) Jan-Apr 2018 – p.136/196


Regularization

• As we saw, we can take any fixed basis functions in


our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

• Suppose we have one dimensional Data: Xi , yi ∈ ℜ.


• If we take φi (X) = X i , i = 0, 1, · · · , M , then we are
trying to fit a polynomial of degree M for the data.

PRNN (PSS) Jan-Apr 2018 – p.137/196


Regularization

• As we saw, we can take any fixed basis functions in


our linear model:
M
X
ŷ(X) = f (X) = wi φi (X)
i=0

• Suppose we have one dimensional Data: Xi , yi ∈ ℜ.


• If we take φi (X) = X i , i = 0, 1, · · · , M , then we are
trying to fit a polynomial of degree M for the data.
• What M should we take?

PRNN (PSS) Jan-Apr 2018 – p.138/196


• Fixing a M to get ‘least’ error is not a good idea.

PRNN (PSS) Jan-Apr 2018 – p.139/196


• Fixing a M to get ‘least’ error is not a good idea.
• If we take M = n − 1 (where n is the number of data
points), we get zero error!

PRNN (PSS) Jan-Apr 2018 – p.140/196


• Fixing a M to get ‘least’ error is not a good idea.
• If we take M = n − 1 (where n is the number of data
points), we get zero error!
(Given n points, we can always find a (n − 1)-degree
polynomial that goes through all data points).

PRNN (PSS) Jan-Apr 2018 – p.141/196


• Fixing a M to get ‘least’ error is not a good idea.
• If we take M = n − 1 (where n is the number of data
points), we get zero error!
(Given n points, we can always find a (n − 1)-degree
polynomial that goes through all data points).
• Wrong choice of M can result in ‘overfitting’ – low
training error but poor generalization.
• This is a fundamental issue in learning from
examples.

PRNN (PSS) Jan-Apr 2018 – p.142/196


• We are fitting a model f (X) = W T Φ(X) to the data.

PRNN (PSS) Jan-Apr 2018 – p.143/196


• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.

PRNN (PSS) Jan-Apr 2018 – p.144/196


• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.
T 2

P

i (W Φ(X i ) y i ) is the ‘data error’.

PRNN (PSS) Jan-Apr 2018 – p.145/196


• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.
T 2

P

i (W Φ(X i ) y i ) is the ‘data error’.
• But it does not tell whole story of how good is W .

PRNN (PSS) Jan-Apr 2018 – p.146/196


• We are fitting a model f (X) = W T Φ(X) to the data.
• We want to rate different W for their ‘goodness of fit’.
T 2

P

i (W Φ(X i ) y i ) is the ‘data error’.
• But it does not tell whole story of how good is W .
• We can say: in addition, we want ‘simple’ model.

PRNN (PSS) Jan-Apr 2018 – p.147/196


• Hence we can change our criterion to

PRNN (PSS) Jan-Apr 2018 – p.148/196


• Hence we can change our criterion to

J(W ) = Data error + λ model complexity

PRNN (PSS) Jan-Apr 2018 – p.149/196


• Hence we can change our criterion to

J(W ) = Data error + λ model complexity


n
1 X
= (W T Φ(Xi ) − yi )2 + λ Ω(W )
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.150/196


• Hence we can change our criterion to

J(W ) = Data error + λ model complexity


n
1 X
= (W T Φ(Xi ) − yi )2 + λ Ω(W )
2 i=1
• Here Ω(W ) is some measure of how ‘complex’ the
model is.

PRNN (PSS) Jan-Apr 2018 – p.151/196


• Hence we can change our criterion to

J(W ) = Data error + λ model complexity


n
1 X
= (W T Φ(Xi ) − yi )2 + λ Ω(W )
2 i=1
• Here Ω(W ) is some measure of how ‘complex’ the
model is.
• This is called regularized least squares and λ is called
the regularization constant.

PRNN (PSS) Jan-Apr 2018 – p.152/196


• In linear least squares regression, we often choose
Ω(W ) = 21 ||W ||2 .

PRNN (PSS) Jan-Apr 2018 – p.153/196


• In linear least squares regression, we often choose
Ω(W ) = 21 ||W ||2 .
• Now the criterion is
n
1 X
T 2 λ T
J(W ) = (W Φ(Xi ) − yi ) + W W
2 i=1
2

PRNN (PSS) Jan-Apr 2018 – p.154/196


• In linear least squares regression, we often choose
Ω(W ) = 21 ||W ||2 .
• Now the criterion is
n
1 X
T 2 λ T
J(W ) = (W Φ(Xi ) − yi ) + W W
2 i=1
2
1 T λ T
= (AW − Y ) (AW − Y ) + W W
2 2

PRNN (PSS) Jan-Apr 2018 – p.155/196


• In linear least squares regression, we often choose
Ω(W ) = 21 ||W ||2 .
• Now the criterion is
n
1 X
T 2 λ T
J(W ) = (W Φ(Xi ) − yi ) + W W
2 i=1
2
1 T λ T
= (AW − Y ) (AW − Y ) + W W
2 2
where, as earlier, A is the matrix whose rows are
Φ(Xi ).
PRNN (PSS) Jan-Apr 2018 – p.156/196
• Equating the gradient of J to zero, we get

AT (AW − Y ) + λ W = 0

PRNN (PSS) Jan-Apr 2018 – p.157/196


• Equating the gradient of J to zero, we get

AT (AW − Y ) + λ W = 0
• This gives us

(AT A + λI)W = AT Y ⇒ W ∗ = (AT A + λI)−1 AT Y

PRNN (PSS) Jan-Apr 2018 – p.158/196


• Equating the gradient of J to zero, we get

AT (AW − Y ) + λ W = 0
• This gives us

(AT A + λI)W = AT Y ⇒ W ∗ = (AT A + λI)−1 AT Y


• This is similar to the least squares solution except for
the λI term.

PRNN (PSS) Jan-Apr 2018 – p.159/196


• We are essentially minimizing
n
X
T
2 T
J(W ) = W X i − yi + λW W
i=1

• This is known as ridge regression in statistics


• It essentially ’shrinks’ components of W .

PRNN (PSS) Jan-Apr 2018 – p.160/196


• We are essentially minimizing
n
X
T
2 T
J(W ) = W X i − yi + λW W
i=1

• Gradient descent would give


T

W (k + 1) = W (k) − η W Xi − yi Xi − ηλW (k)
• Known as weight decay in neural network literature

PRNN (PSS) Jan-Apr 2018 – p.161/196


• Another way to look at regularized least squares is
from a Bayesian framework.

PRNN (PSS) Jan-Apr 2018 – p.162/196


• Another way to look at regularized least squares is
from a Bayesian framework.
• We saw that the least squares solution can also be
derived as a ML estimate of parameters of a
reasonable probabilty model for y .

PRNN (PSS) Jan-Apr 2018 – p.163/196


• Another way to look at regularized least squares is
from a Bayesian framework.
• We saw that the least squares solution can also be
derived as a ML estimate of parameters of a
reasonable probabilty model for y .
• The regularized least squares can be derived as a
Bayesian (MAP) estimate of the parameters of the
same model.

PRNN (PSS) Jan-Apr 2018 – p.164/196


• As earlier, take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2

PRNN (PSS) Jan-Apr 2018 – p.165/196


• As earlier, take the probabilty model for y as
T 2
1 1 (y − W X)

f (y | X, W, σ) = √ exp −
σ 2π 2 σ2
• We want to estimate W from n iid observations
{yi (Xi ), i = 1, · · · , n}.

PRNN (PSS) Jan-Apr 2018 – p.166/196


• We take the prior density of W as
d T
1 W W
  
f (W ) = √ exp −
α 2π 2α2
which is zero-mean normal with diagonal covariance
matrix; α is a parameter of the prior.

PRNN (PSS) Jan-Apr 2018 – p.167/196


Now the posterior density is given by

PRNN (PSS) Jan-Apr 2018 – p.168/196


Now the posterior density is given by
n
Y
f (W | Y ) ∝ f (yi | Xi , W, σ) f (W )
i=1

PRNN (PSS) Jan-Apr 2018 – p.169/196


Now the posterior density is given by
n
Y
f (W | Y ) ∝ f (yi | Xi , W, σ) f (W )
i=1
n
!
X (yi − W T Xi )2 1 T
∝ exp − − W W
i=1
2σ 2 2α 2

PRNN (PSS) Jan-Apr 2018 – p.170/196


Now the posterior density is given by
n
Y
f (W | Y ) ∝ f (yi | Xi , W, σ) f (W )
i=1
n
!
X (yi − W T Xi )2 1 T
∝ exp − − W W
i=1
2σ 2 2α 2

• To find the MAP estimate we need to maximize the


posterior density

PRNN (PSS) Jan-Apr 2018 – p.171/196


Now the posterior density is given by
n
Y
f (W | Y ) ∝ f (yi | Xi , W, σ) f (W )
i=1
n
!
X (yi − W T Xi )2 1 T
∝ exp − − W W
i=1
2σ 2 2α 2

• To find the MAP estimate we need to maximize the


posterior density
• We can maximize log of the posterior.

PRNN (PSS) Jan-Apr 2018 – p.172/196


Now, the log posterior density is given by
n
1 X T 2
ln(f (W | Y )) = − 2 (yi − W Xi )
2σ i=1
1
− 2 WTW + K

where K is a constant.

PRNN (PSS) Jan-Apr 2018 – p.173/196


Thus, the log posterior is of the form
n
1 X
ln(f (W | Y )) = − (yi − W T Xi )2 − λ W T W + K
2 i=1

PRNN (PSS) Jan-Apr 2018 – p.174/196


Thus, the log posterior is of the form
n
1 X
ln(f (W | Y )) = − (yi − W T Xi )2 − λ W T W + K
2 i=1

• Maximizing this is same as minimizing the regularized


least squares criterion.

PRNN (PSS) Jan-Apr 2018 – p.175/196


Thus, the log posterior is of the form
n
1 X
ln(f (W | Y )) = − (yi − W T Xi )2 − λ W T W + K
2 i=1

• Maximizing this is same as minimizing the regularized


least squares criterion.
• Hence the MAP estimate is the regularized least
squares solution.

PRNN (PSS) Jan-Apr 2018 – p.176/196


Bayesian Linear Regression

• We saw that we can look at linear least squares


estimation in a Bayesian framework also.

PRNN (PSS) Jan-Apr 2018 – p.177/196


Bayesian Linear Regression

• We saw that we can look at linear least squares


estimation in a Bayesian framework also.
• The MAP estimate corresponds to regularized least
squares.

PRNN (PSS) Jan-Apr 2018 – p.178/196


Bayesian Linear Regression

• We saw that we can look at linear least squares


estimation in a Bayesian framework also.
• The MAP estimate corresponds to regularized least
squares.
• MAP is a point estimate obtained from the posterior.

PRNN (PSS) Jan-Apr 2018 – p.179/196


Bayesian Linear Regression

• We saw that we can look at linear least squares


estimation in a Bayesian framework also.
• The MAP estimate corresponds to regularized least
squares.
• MAP is a point estimate obtained from the posterior.
• In Bayesian framework, the estimate is whole of the
posterior.

PRNN (PSS) Jan-Apr 2018 – p.180/196


Bayesian Linear Regression

• We saw that we can look at linear least squares


estimation in a Bayesian framework also.
• The MAP estimate corresponds to regularized least
squares.
• MAP is a point estimate obtained from the posterior.
• In Bayesian framework, the estimate is whole of the
posterior.
• We can also calculate the distribution of prediction
variable based on data.

PRNN (PSS) Jan-Apr 2018 – p.181/196


The posterior density
• In general, the prior can be

f (W ) = N (W |µ0 , Σ0 )
• The probability model for y is
n
Y
f (Y |A, W, σ 2 ) = N (yi |XiT W, σ 2 ) = N (Y |AW, σ 2 I)
i=1

• The posterior would also be a Gaussian which can be


calculated using the technique of ‘completing the
squares’

PRNN (PSS) Jan-Apr 2018 – p.182/196


The posterior density

• The posterior would be

f (W |Y, A, σ 2 ) = N (W |µn , Σn )
where
2 T −1

µ n = Σn σ A Y + Σ µ 0 0

Σ−1
n = (Σ −1
0 + σ 2 T
A A)

PRNN (PSS) Jan-Apr 2018 – p.183/196


The Predictive Distribution

• We can also Calculate (for any given new X )


Z
f (y|X, A, Y, σ 2 ) = f (y|X, W, σ 2 ) f (W |A, Y, σ 2 ) dW

which is N (y|µn T X, AT Σn A + σ 2 ).
• We can use this to predict the y for any X .

PRNN (PSS) Jan-Apr 2018 – p.184/196


Robust Regression

• We saw linear least squares solution can be viewd as


MLE by taking y = W T X + ξ where ξ is zero-mean
Gaussian noise.

PRNN (PSS) Jan-Apr 2018 – p.185/196


Robust Regression

• We saw linear least squares solution can be viewd as


MLE by taking y = W T X + ξ where ξ is zero-mean
Gaussian noise.
• The least squares solution is sensitive to outliers.

PRNN (PSS) Jan-Apr 2018 – p.186/196


Robust Regression

• We saw linear least squares solution can be viewd as


MLE by taking y = W T X + ξ where ξ is zero-mean
Gaussian noise.
• The least squares solution is sensitive to outliers.
• One way of looking at this is that Gaussian distribution
has ‘light tails’.

PRNN (PSS) Jan-Apr 2018 – p.187/196


Robust Regression

• We saw linear least squares solution can be viewd as


MLE by taking y = W T X + ξ where ξ is zero-mean
Gaussian noise.
• The least squares solution is sensitive to outliers.
• One way of looking at this is that Gaussian distribution
has ‘light tails’.
• To make the solution robust to outliers, we can
attempt other noise models.

PRNN (PSS) Jan-Apr 2018 – p.188/196


The Laplace Density

• The Laplace density is given by

1 |x − µ|
 
fLap (x|µ, b) = exp − , −∞ < x < ∞
2b b
• Mean is µ and variance is 2b2 .

PRNN (PSS) Jan-Apr 2018 – p.189/196


The Laplace Density

• The Laplace density is given by

1 |x − µ|
 
fLap (x|µ, b) = exp − , −∞ < x < ∞
2b b
• Mean is µ and variance is 2b2 .
• This is a heavy tailed distribution.

PRNN (PSS) Jan-Apr 2018 – p.190/196


• We take our noise model to be
f (y|X, W, b) = fLap (y|W T X, b)

PRNN (PSS) Jan-Apr 2018 – p.191/196


• We take our noise model to be
f (y|X, W, b) = fLap (y|W T X, b)
• One can (easily!) show that MLE is same as
minimizing
n
X
J(W ) = |yi − W T Xi |
i=1

PRNN (PSS) Jan-Apr 2018 – p.192/196


• We take our noise model to be
f (y|X, W, b) = fLap (y|W T X, b)
• One can (easily!) show that MLE is same as
minimizing
n
X
J(W ) = |yi − W T Xi |
i=1

• This is called robust linear regression.


• The optimization problem is rather hard.
PRNN (PSS) Jan-Apr 2018 – p.193/196
• Another way for robustness is to use Huber loss
instead of squared error loss.

PRNN (PSS) Jan-Apr 2018 – p.194/196


• Another way for robustness is to use Huber loss
instead of squared error loss.
• The Huber loss is defined by

LH−δ (a, b) = 0.5(a − b)2 if |a − b| ≤ δ


= δ|a − b| − 0.5δ 2 if |a − b| > δ

PRNN (PSS) Jan-Apr 2018 – p.195/196


• Another way for robustness is to use Huber loss
instead of squared error loss.
• The Huber loss is defined by

LH−δ (a, b) = 0.5(a − b)2 if |a − b| ≤ δ


= δ|a − b| − 0.5δ 2 if |a − b| > δ
d
• Using dr |r| = sign(r), this loss function is
differentiable and hence we can solve the resulting
optimization problem.

PRNN (PSS) Jan-Apr 2018 – p.196/196

You might also like