0% found this document useful (0 votes)
21 views10 pages

Lect 6

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Lect 6

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 6: Regression continued

C4B Machine Learning Hilary 2011 A. Zisserman

• Lasso
• L1 regularization
• other regularizers

• SVM regression
• epsilon-insensitive loss

• More loss functions

Regression
y

• Suppose we are given a training set of N observations

((x1, y1), . . . , (xN , yN )) with xi ∈ Rd, yi ∈ R

• The regression problem is to estimate f (x) from this data


such that
yi = f (xi)
Regression cost functions
Minimize with respect to w
N
X
l (f (xi, w), yi) + λR (w)
i=1
loss function regularization

• There is a choice of both loss functions and regularization


• So far we have seen – “ridge” regression
N
X
• squared loss: (yi − f (xi, w))2
i=1

• squared regularizer: λkwk2


• Now, consider other losses and regularizers

The “Lasso” or L1 norm regularization

• LASSO = Least Absolute Shrinkage and Selection

Minimize with respect to w ∈ Rd


N
X d
X
2
(yi − f (xi, w)) + λ |wj |
i=1 j

loss function regularization

• This is a quadratic optimization problem


• There is a unique solution ⎛ ⎞1
d
X p
• p-Norm definition: k w kp = ⎝ |wi|p⎠
j=1
Sparsity property of the Lasso
• contour plots for d = 2
N
X
(yi − f (xi, w))2
i=1

d
λkwk2 λ
X
|wj |
ridge regression lasso j

• Minimum where loss contours tangent to regularizer’s


• For the lasso case, minima occur at “corners”
• Consequently one of the weights is zero
• In high dimensions many weights can be zero
Example: Lasso for polynomial basis functions regression
ideal fit
• The red curve is the true function 1.5
Sample points

(which is not a polynomial) 1


Ideal fit

0.5

• The data points are samples from the


curve with added noise in y. 0

y
-0.5

• N = 9, M = 7 -1

M
X -1.5
j > 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f (x, w) = wj x = w Φ(x) x

j=0
w is a M+1
dimensional vector

ridge regression

lasso

Variation of weights with lambda


Variation of weights with lambda

100
ridge regression 500
lasso
400
50
300

0 200
wj

100
wj

-50
0

-100
-100

-200
-150

-300

-8 -7 -6 -5 -4 -3 -2
10 10 10 10 10 10 10 -8 -7 -6 -5 -4 -3 -2
10 10 10 10 10 10 10
Variation of weightslog λ lambda
with
with λlambda
Variation of weights log
30
200

detail detail
20 150

100
10

50
wj

0
wj

-10
-50

-20 -100

-150
-5 -4 -7 -6 -5
10 10 10 10 10
log λ log λ
Second example – lasso in action
1.5

0.5
weights

−0.5

−1
0 0.5 1 1.5
regularization parameter λ

Sparse weight vectors

• Weights being zero is a method of “feature selection” –


zeroing out the unimportant features

• The SVM classifier also has this property (sparse alpha in


the dual representation)

• Ridge regression does not

• AdaBoost achieves feature selection by a different,


greedy approach
Other regularizers
N
X d
X
2
(yi − f (xi, w)) + λ |wj |q
i=1 j

• For q ≥ 1, the cost function is convex and has a unique minimum.


The solution can be obtained by quadratic optimization.

• For q < 1, the problem is not convex, and obtaining the global
minimum is more difficult
SVMs for Regression
Use ε-insensitive error measure square
( loss
0 if |r| ≤ ε
Vε(r) =
|r| − ε otherwise. Vε(r)
This can also be written as

Vε(r) = (|r| − ε)+


r
where ()+ indicates the positive part of (.).
Or equivalently as
cost is zero inside epsilon “tube”
Vε(r) = max ((|r| − ε), 0)

loss function regularization

• As before, introduce slack variables for


cost is zero inside epsilon “tube”
points that violate ε-insensitive error.

• For each data point, xi, two slack vari-


ables, ξi, ξbi, are required (depending on
whether f (xi) is above or below the tube)

• Learning is by the optimization


N ³
X ´ 1
min C ξi + ξbi + ||w||2
w∈Rd, ξi, ξbi i 2
subject to

yi ≤ f (xi, w)+ε+ξi, yi ≥ f (xi, w)−ε−ξbi, ξi ≥ 0, ξbi ≥ 0 for i = 1 . . . N

• Again, this is a quadratic programming problem


• It can be dualized
• Some of the data points will become support vectors
• It can be kernelized
Example: SV regression with Gaussian basis functions
ideal fit
1.5

• The red curve is the true function Sample points


Ideal fit

(which is not a polynomial) 1

0.5

• Regression function – Gaussians 0

y
centred on data points
-0.5

• Parameters are: C, epsilon, sigma -1

-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

N
X 2
/σ 2
f (x, w) = wi e−(x−xi ) = w> Φ(x)
i=1

Φ : x → Φ(x) R → RN w is a N-vector

1.5 1.5
Sample points Sample points
Ideal fit Validation set fit
1 1 Support vectors

0.5 0.5

0 0
y

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

epsilon = 0.01

• Validation set fit is a search


over both C and sigma
epsilon = 0.5 epsilon = 0.8
1.5 1.5
Sample points Sample points
Validation set fit Validation set fit
1 Support vectors 1 Support vectors

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

As epsilon increases:
• fit becomes looser
• less data points are support vectors

Loss functions for regression


• quadratic (square) loss `(y, f (x)) = 1
2 (y − f (x))
2

• ε-insensitive loss `(y, f (x)) = max ((|r| − ε), 0)

• Hüber loss (mixed quadratic/linear): robustness to outliers:


`(y, f (x)) = h(y − f (x))
(
r2 if |r| ≤ c 4 square
h(r) = 2
2c|r| − c otherwise. ε−insensitive
Huber
3
• all of these are convex
2

0
−3 −2 −1 0 1 2 3
y−f(x)
Final notes on cost functions

Regressors and classifiers can be constructed by a “mix ‘n’ match” of loss


functions and regularizers to obtain a learning machine suited to a
particular application. e.g. for a classifier f (x) = w>x + b
• L1 Logistic regression
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||1
w∈Rd i

• L1—SVM
N
X
min max (0, 1 − yif (xi)) + λ||w||1
w∈Rd i

• Least squares SVM


N
X
min [max (0, 1 − yif (xi))]2 + λ||w||2
w∈Rd i

Background reading

• Bishop, chapters 3.1 & 7.1.4

• Hastie et al, chapters 3.4 & 12.3.5

• More on web page:


[Link]

You might also like