Lec15 Regression Gradient Descent
Lec15 Regression Gradient Descent
Descent
Acknowledgements: These slides are based primarily on material from the book
Machine Learning by Tom Mitchell (and associated slides), the book Machine
Learning, An Applied Mathematics Introduction by Paul Wilmott, slides created by
Vivek Srikumar (Utah), Dan Roth (Penn), Julia Hockenmaier (Illinois Urbana-
Champaign), Jessica Wu (Harvey Mudd) and C. David Page (U of Wisconsin-Madison)
Today’s Topics
Midterm
– Wednesday, March 30
Linear Regression
– Gradient descent
– Example
– Convergence
– Stochastic gradient descent
– Regularization
[email protected]
Recap
Evaluation metrics (for classification)
To evaluate model, compare predicted labels to actual
By default, precision and recall computed for the positive label, as that is
usually the case of interest and the one usually with fewer example (e.g.,
diagnosing diseases in patients, identifying spam emails)
ti
ti
ti
5
ti
ti
Combining into one number
2pr
f1 =
p+r
Goal: use the training data to find the best possible value of w
If our hypothesis space is linear functions …
How do we know which weight vector is best one for a training set?
For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi | How far apart is true from
y
predicted in absolute sense?
True Predicted If very different then weight
output output vector is probably not very good
But could also be that weight vector is just bad for that example
8
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?
For an input (xi, yi) in the training set, the cost of a mistake is
This tells us how good
| yi − wT xi | for one example
9
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?
For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi |
Function of functions
J is a function that evaluates how good other m
w w
T
x. Every 1
One strategy
functionsfor learning:are,
or regressors Find
e.g.,the with least
J(f)cost
= on this
(yi −data
f(xi))2
choice of w gives a different regressor. So J 2∑i=1
evaluates how good a regressor is.
10
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?
For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi |
One strategy for learning: Find the w with least cost on this data
11
This is called Least Mean Squares (LMS) Regression
1 m
(yi − wT xi)2
w 2∑
min J(w) = min
w
i=1
‣ This isstrategies
Different
just the training objective: you can use different learning
exist for learning by optimization
algorithms to minimize this objective
• Gradient descent: is a popular algorithm
‣•Properties of J(w):for
Matrix inversion: differentiable andminimization
this particular convex. Lower values
objective,
mean
therebetter
is also weight w, i.e., regressor.
vectorsolution;
an analytical no need for gradient
descent: b = (X T X)−1X TY
‣ Mathematical optimization: focuses on solving problems of the
form min J(w). So many algorithms exist to solve problem
w
12
This is called Least Mean Squares (LMS) Regression
1 m
(yi − wT xi)2
w 2∑
min J(w) = min
w
i=1
13
Linear Regression
GRADIENT DESCENT
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)
16
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)
17
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)
Then at every point, compute the gradient (the arrow), and take
a step in direction away from gradient (i.e., move to a point
where value of function is lower) 18
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)
Keep repeating …
19
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)
20
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0 Initialize to zeroes or random
(convex function, so doesn’t matter where initialized)
2. For t = 0,1,2,…
– Compute gradient of J(w) at wt. Call it ∇J(wt )
Grad J or Nabla J
– Update w as follows:
wt+1 = wt − r ∇J(wt)
Use “-“ since step is in opposite direction
where r is the learning rate (a small constant)
21
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0
– Update w as follows:
wt+1 = wt − r ∇J(wt)
22
Gradient of the cost J at point w
Remember that w is a vector with d elements
w = [w1, w2, w3, …, wj, …, wd] J is a function that maps w to
real number (the total cost)
To find the best direction in the weight space w we compute the
gradient of J with respect to each of the components of
∂J ∂ 1 m
∂wj
=
∂wj 2 ∑
(yi − wT xi)2 Let’s compute gradient for jth weight
i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)
∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
(yi − wT xi)xij
∑
=−
i=1
25
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
=
2∑ ∂wj
(yi − wT xi)2 Gradient of sum is just the sum of gradients
i=1
1 m ∂ so move partial derivative inside
2(yi − wT xi)
∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
(yi − wT xi)xij
∑
=−
i=1
26
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m
2(yi − wT xi)
∂ Apply chain rule for derivative
2∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
i=1
∂wj Expanded dot product
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
(yi − wT xi)xij
∑
=−
i=1
27
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m
2(yi − wT xi)
∂ Apply chain rule for derivatives
2∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
i=1
∂wj Expanded dot product
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
T
Only one element
∑
=− (yi − w xi)xij
i=1 depends on j
28
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)
∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
T
Only one element
∑
=− (yi − w xi)xij
i=1 depends on j
29
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)
∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
(yi − wT xi)xij
∑
=− Move 2 and minus outside, 2s cancel
i=1
30
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
One element of
1 m ∂ the gradient vector
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)
∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
(yi − wT xi)xij
∑
=−
i=1
31
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1
∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
One element of
1 m ∂ the gradient vector
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)
∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )
∑
=
2 i=1
m
(yi − wT xi)xij Negative of this gradient is how much to
∑
=−
i=1 change jth weight
Larger features (xij) with larger errors will cause larger change
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0
[ 1 ∂wd ]
∂J ∂J ∂J
∇J(wt ) = , , ⋯,
– Update w as follows: ∂w ∂w2
wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)
33
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0
wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)
34
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0
wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)
35
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0
36
Gradient Descent
EXAMPLE
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage
(x 100 lb) (years) per gallon
31.5 6 21
36.2 2 25
43.1 0 18
27.6 2 30
38
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18
x4 x40 = 1 x41 27.6 x42 2 30
Example Feature
index index
39
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18
x4 x40 = 1 x41 27.6 x42 2 30
40
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
41
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
42
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J One element
(yi − wT xi)xij
∑
=−
∂wj of ∇J(wt )
i=1
wt+1 = wt − r ∇J(wt )
43
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi0
∑
=−
[ ∂w0 ∂w1 ∂w2 ]
∂J ∂J ∂J
∇J(wt ) = , , ∂w0 i=1
m
∂J
(yi − wT xi)xij
∑
=−
∂wj i=1
wt+1 = wt − r ∇J(wt )
44
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi0
∑
=−
[ ∂w0 ∂w1 ∂w2 ]
∂J ∂J ∂J
∇J(wt ) = , , ∂w0 i=1
m
∂J
m (yi − wT xi)xi1
∑
∂J =−
(yi − wT xi)xij ∂w1
∑
=− i=1
∂wj i=1
m
∂J
(yi − wT xi)xi2
∑
wt+1 = wt − r ∇J(wt ) =−
∂w2 i=1 45
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
∂J m w2
(yi − wT xi)xi0
∑
=−
∂w0 i=1
= − (y1 − wT x1)x10 − (y2 − wT x2)x20 − (y3 − wT x3)x30 − (y4 − wT x4)x40
46
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
∂J m w2
(yi − wT xi)xi0
∑
=−
∂w0 i=1
= − (y1 − wT x1)x10 − (y2 − wT x2)x20 − (y3 − wT x3)x30 − (y4 − wT x4)x40
m
∂J
(yi − wT xi)xi1
∑
=−
∂w1 i=1
= − (y1 − wT x1)x11 − (y2 − wT x2)x21 − (y3 − wT x3)x31 − (y4 − wT x4)x41
m
∂J
(yi − wT xi)xi2
∑
=−
∂w2 i=1
= − (y1 − wT x1)x12 − (y2 − wT x2)x22 − (y3 − wT x3)x32 − (y4 − wT x4)x42 47
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi0
∑
=−
∂w0 i=1
= − (y1 − wT x1)x10 − (y2 − wT x2)x20 − (y3 − wT x3)x30 − (y4 − wT x4)x40
= − (21 − wT x1)1 − (25 − wT x2)1 − (18 − wT x3)1 − (30 − wT x4)1
= − (21 − 0)1 − (25 − 0)1 − (18 − 0)1 − (30 − 0)1
= − 94
48
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi1
∑
=−
∂w1 i=1
= − (y1 − wT x1)x11 − (y2 − wT x2)x21 − (y3 − wT x3)x31 − (y4 − wT x4)x41
= − (21 − wT x1)31.5 − (25 − wT x2)36.2 − (18 − wT x3)43.1 − (30 − wT x4)27.6
= − (21 − 0)31.5 − (25 − 0)36.2 − (18 − 0)43.1 − (30 − 0)27.6
= − 661.5 − 905 − 775 − 828
= − 3169.5 49
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi2
∑
=−
∂w2 i=1
= − (y1 − wT x1)x12 − (y2 − wT x2)x22 − (y3 − wT x3)x32 − (y4 − wT x4)x42
= − (21 − wT x1)6 − (25 − wT x2)2 − (18 − wT x3)0 − (30 − wT x4)2
= − (21 − 0)6 − (25 − 0)2 − (18 − 0)0 − (30 − 0)2
= − 126 − 50 − 0 − 60
= − 236 50
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
[ 0 1 ∂w2 ]
∂J ∂J ∂J
∇J(wt ) = , ,
∂w ∂w m
∂J
(yi − wT xi)xi0 = − 94
∑
wt+1 = wt − r ∇J(wt ) =−
∂w0 i=1
m
∂J
(yi − wT xi)xi1 = − 3169.5
∑
=−
∂w1 i=1
m
∂J
(yi − wT xi)xi2 = − 236
∑
=−
∂w2 i=1 51
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
54
Learning Rates and Convergence
▪ In the general (“non-separable”) case the learning rate
r must decrease to zero to guarantee convergence.
▪ The learning rate is called the step size. There are more
sophisticated algorithms that choose the step size
automatically and converge faster.
55
Impact of learning rate
Cost Cost
w w
Random initial value Random initial value
57
Decreasing learning rate over time
In order to guarantee that the algorithm will converge, the
learning rate should decrease over time. Here is a general
formula
‣ At iteration t
c1
rt = a where
t + c2
58
When should algorithm stop?
1. Stop after fixed number of iterations
59
Stopping criteria
For most functions, you probably won’t get the gradient
to be exactly equal to 0 in a reasonable amount of time
60
Distance
(x2, y2)
How far apart are two points? d
y2 − y1
(x1, y1)
x2 − x1
k
(pi − qi)2
∑
d(p, q) =
i=1
63
Stopping criteria
Stop when the norm of the gradient is below some
threshold, θ:
| | ∇L(w) | | < θ
64
Gradient descent
1. Initialize the parameters w to some guess (usually all zeroes,
or random values)
65
Linear Regression
INCREMENTAL/STOCHASTIC
GRADIENT DESCENT
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0
67
Incremental/stochastic gradient descent
Repeat for each example (xi, yi)
‣ Pretend that entire training set is represented by
this single example
‣ Use this example to calculate the gradient and
update the model
68
Incremental/stochastic gradient descent
1. Initialize w
May get close to optimum much faster than the batch version. In general - does not
converge to global minimum. Decreasing r with time guarantees convergence. But,
online/incremental algorithms are often preferred when the training set is very large
69
Incremental/stochastic gradient descent
1. Initialize w
May get close to optimum much faster than the batch version. In general - does not
converge to global minimum. Decreasing r with time guarantees convergence. But,
online/incremental algorithms are often preferred when the training set is very large
70
Linear regression: summary
▪ What we want: predict a real-valued output using
feature representation of the input
[email protected] 71
Linear Regression
REGULARIZATION
Generalization
▪ Prediction functions that work on the training data
might not work on other data
[email protected] 73
Regularization
▪ Modify learning algorithm to favor “simpler” prediction
rules to avoid overfitting
[email protected] 74
Regularization
▪ How do we define whether weights are large?
k
(wi)2 = | | w | | Note that bias
∑
d(w, 0) =
i=1 term w0 is not
regularized
[email protected] 75
Regularization
▪ New goal for minimization Square to eliminate square root:
2
L(w) + λ | | w | | easier to work with mathematically
[email protected] 76
Regularization
▪ Regularization helps the computational problem
because gradient descent won’t try to make some
feature weights grow larger and larger
[email protected] 77
Regularization
▪ This also helps with generalization because it won’t
give large weight to features unless there is sufficient
evidence that they are useful
[email protected] 78
Regularization
▪ More generally
L(w) + λR(w)
This is called the regularization
term or regularizer or penalty.
The squared L2 norm is one kind
λ is called the regularization strength. of penalty, but there are others
Other common names for λ: alpha in
sklearn, C in many algorithms. Usually C
actually refers to the inverse regularization
strength, 1/λ. Figure out which one your
implementation is using (whether this will
increase or decrease regularization)
[email protected] 79
L2 Regularization
▪ When the regularizer is the squared L2 norm | | w | |2 , this is
called L2 regularization.
[email protected] 80
L2 Regularization
▪ The function R(w) = | | w | |2 is convex, so if it is added
to a convex loss function, the combined function will
still be convex.
[email protected] 81
L1 Regularization
▪ Another common regularizer is the L1 norm:
k
∑
| | w | |1 = | wj |
j=1
[email protected] 82
L1+L1 Regularization
▪ L2 and L1 regularization can be combined
2
R(w) = λ2 | | w | | + λ1 | | w | |1
[email protected] 83
Feature normalization
▪ The scale of the feature values matters when using
regularization