INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
Linear Classifier
by
Dr. Sanjeev Kumar
Associate Professor
Department of Mathematics
IIT Roorkee, Roorkee-247 667, India
[email protected]
[email protected]
Linear models
A strong high-bias assumption is linear separability:
in 2 dimensions, can separate classes by a line
in higher dimensions, need hyperplanes
A linear model is a model that assumes the data is linearly
separable
Linear models
A linear model in n-dimensional space (i.e. n
features) is define by n+1 weights:
In two dimensions, a line:
0 =w1 f1 + w2 f2 + b (where b = -a)
In three dimensions, a plane:
0 =w1 f1 + w2 f2 + w3 f3 + b
In m-dimensions, a hyperplane
m
0 =b + å wj fj
j=1
Perceptron learning algorithm
repeat until convergence (or for some # of iterations):
for each training example (f1, f2, …, fm, label):
m
prediction =b + å wj fj
j=1
if prediction * label ≤ 0: // they don’t agree
for each wj:
wj = wj + fj*label
b = b + label
Which line will it find?
Which line will it find?
Only guaranteed to find some
line that separates the data
Linear models
Perceptron algorithm is one example of a linear
classifier
Many, many other algorithms that learn a line (i.e. a
setting of a linear combination of weights)
Goals:
Explore a number of linear training algorithms
Understand why these algorithms work
Perceptron learning algorithm
repeat until convergence (or for some # of iterations):
for each training example (f1, f2, …, fm, label):
prediction =b + å
m
wj fj
j=1
if prediction * label ≤ 0: // they don’t agree
for each wi:
wi = wi + fi*label
b = b + label
A closer look at why we got it wrong
w1 w2 (-1, -1, positive)
0 * f1 +1* f2 =
We’d like this value to be positive
0 * - 1+1*- 1 =- 1 since it’s a positive value
didn’t contribute, contributed in the
but could have wrong direction
Intuitively these make sense
decrease decrease Why change by 1?
Any other way of doing it?
0 -> -1 1 -> 0
Model-based machine learning
1. pick a model
e.g. a hyperplane, a decision tree,…
A model is defined by a collection of parameters
What are the parameters for DT? Perceptron?
Model-based machine learning
1. pick a model
e.g. a hyperplane, a decision tree,…
A model is defined by a collection of parameters
2. pick a criteria to optimize (aka objective function)
What criterion do decision tree learning and
perceptron learning optimize?
Model-based machine learning
1. pick a model
e.g. a hyperplane, a decision tree,…
A model is defined by a collection of parameters
2. pick a criteria to optimize (aka objective function)
e.g. training error
3. develop a learning algorithm
the algorithm should try and minimize the criteria
sometimes in a heuristic way (i.e. non-optimally)
sometimes explicitly
Linear models in general
1. pick a model
0 =b + å
m
wj fj
j=1
These are the parameters we want to learn
2. pick a criteria to optimize (aka objective function)
Some notation: indicator function
ìï 1 if x =True üï
1[ x ] =í ý
î 0 if x =False
ï ï
þ
Convenient notation for turning T/F answers into numbers/counts:
drinks _ to _ bring _ for _ class = å 1[ x >=21]
xÎ class
Some notation: dot-product
Sometimes it is convenient to use vector notation
We represent an example f1, f2, …, fm as a single vector, x
Similarly, we can represent the weight vector w1, w2, …, wm as a single
vector, w
The dot-product between two vectors a and b is defined as:
m
a ×b =å a j b j
j=1
Linear models
1. pick a model
0 =b + å
n
wj fj
j=1
These are the parameters we want to learn
2. pick a criteria to optimize (aka objective function)
n
å 1[ y (w ×x + b) £0]
i i
i=1
What does this equation say?
0/1 loss function
n
å 1[ y (w ×x + b) £0]
i i
i=1
m
distance =b + å w j x j =w ×x + b distance from hyperplane
j=1
incorrect =yi (w ×xi + b) £0 whether or not the
prediction and label agree
n
0/1 loss =å 1[ yi (w ×xi + b) £0 ] total number of mistakes,
i=1 aka 0/1 loss
Model-based machine learning
1. pick a model
0 =b + å
m
wj fj
j=1
2. pick a criteria to optimize (aka objective function)
n
å 1[ y (w ×x + b) £ 0]
i i
i=1
3. develop a learning algorithm
n
argmin w,b å 1[ yi (w ×xi + b) £ 0 ] Find w and b that
i=1
minimize the 0/1 loss
Minimizing 0/1 loss
n
Find w and b that
argmin w,b å 1[ yi (w ×xi + b) £ 0 ]
minimize the 0/1 loss
i=1
How do we do this?
How do we minimize a function?
Why is it hard for this function?
Minimizing 0/1 in one dimension
n
å 1[ y (w ×x + b) £ 0]
i i
i=1
loss
Each time we change w such that the example
is right/wrong the loss will increase/decrease
Minimizing 0/1 over all w
n
å 1[ y (w ×x + b) £ 0]
i i
i=1
loss
Each new feature we add (i.e. weights) adds
another dimension to this space!
Minimizing 0/1 loss
n
Find w and b that
argmin w,b å 1[ yi (w ×xi + b) £ 0 ]
minimize the 0/1 loss
i=1
This turns out to be hard (in fact, NP-HARD )
Challenge:
- small changes in any w can have large changes in
the loss (the change isn’t continuous)
- there can be many, many local minima
- at any give point, we don’t have much information
to direct us towards any minima
More manageable loss functions
loss
w
What property/properties do we want from our loss function?
More manageable loss functions
loss
- Ideally, continues (i.e. differentiable) so we get an
indication of direction of minimization
- Only one minima
Convex functions
Convex functions look something like:
One definition: The line segment between any
two points on the function is above the function
Surrogate loss functions
For many applications, we really would like to minimize the
0/1 loss
A surrogate loss function is a loss function that provides an
upper bound on the actual loss function (in this case, 0/1)
We’d like to identify convex surrogate loss functions to make
them easier to minimize
Key to a loss function is how it scores the difference between
the actual label y and the predicted label y’
Surrogate loss functions
0/1 loss: l(y, y') =1[ yy' £0 ]
Ideas?
Some function that is a proxy for
error, but is continuous and convex
Surrogate loss functions
0/1 loss: l(y, y') =1[ yy' £ 0 ]
Hinge: l(y, y') =max(0,1- yy')
Exponential: l(y, y') =exp(- yy')
Squared loss: l(y, y') =(y - y')2
Why do these work? What do they penalize?
Surrogate loss functions
0/1 loss: l(y, y') =1[ yy' £ 0 ] Hinge: l(y, y') =max(0,1- yy')
Squared loss: l(y, y') =(y - y')2 Exponential: l(y, y') =exp(- yy')
Model-based machine learning
1. pick a model
0 =b + å
m
wj fj
j=1
2. pick a criteria to optimize (aka objective function)
n
use a convex surrogate
å exp(- y (w ×x + b))
i i loss function
i=1
3. develop a learning algorithm
n
argmin w,b å exp(- yi (w ×xi + b)) Find w and b that
i=1
minimize the
surrogate loss
Finding the minimum
You’re blindfolded, but you can see out of the bottom of the
blindfold to the ground right by your feet. I drop you off
somewhere and tell you that you’re in a convex shaped valley
and escape is at the bottom/minimum. How do you get out?
Finding the minimum
loss
How do we do this for a function?
One approach: gradient descent
Partial derivatives give us the
slope (i.e. direction to move)
in that dimension
loss
w
One approach: gradient descent
Partial derivatives give us the
slope (i.e. direction to move) in
that dimension
loss
Approach:
pick a starting point (w)
repeat: w
pick a dimension
move a small amount in that
dimension towards decreasing loss
(using the derivative)
One approach: gradient descent
Partial derivatives give us the
slope (i.e. direction to move) in
that dimension
Approach:
pick a starting point (w)
repeat:
pick a dimension
move a small amount in that
dimension towards decreasing loss
(using the derivative)
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
w j =w j - h loss(w)
dw j
What does this do?
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
w j =w j - h loss(w)
dw j
learning rate (how much we want to move in the error
direction, often this will change over time)
Some maths
d d n
dw j
loss = å
dw j i=1
exp(- yi (w ×xi + b))
n
d
=å exp(- yi (w ×xi + b)) - yi (w ×xi + b)
i=1 dw j
n
=å - yi xij exp(- yi (w ×xi + b))
i=1
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
n
w j =w j + h å yi xij exp(- yi (w ×xi + b))
i=1
What is this doing?
Exponential update rule
n
w j =w j + hå yi xij exp(- yi (w ×xi + b))
i=1
for each example xi:
w j =w j + h yi xij exp(- yi (w ×xi + b))
Does this look familiar?
Perceptron learning algorithm!
repeat until convergence (or for some # of iterations):
for each training example (f1, f2, …, fm, label):
m
prediction =b + å wj fj
j=1
if prediction * label ≤ 0: // they don’t agree
for each wj:
wj = wj + fj*label
b = b + label
w j =w j + h yi xij exp(- yi (w ×xi + b))
or
w j =w j + xij yi c where c =h exp(- yi (w ×xi + b))
The constant
c =h exp(- yi (w ×xi + b))
learning rate label prediction
When is this large/small?
The constant
c =h exp(- yi (w ×xi + b))
label prediction
If they’re the same sign, as the
predicted gets larger there update
gets smaller
If they’re different, the more
different they are, the bigger the
update
Perceptron learning algorithm!
repeat until convergence (or for some # of iterations):
for each training example (f1, f2, …, fm, label):
prediction =b + å
m
wj fj
j=1
if prediction * label ≤ 0: // they don’t agree
for each wj: Note: for gradient descent, we always update
wj = wj + fj*label
b = b + label
w j =w j + h yi xij exp(- yi (w ×xi + b))
or
w j =w j + xij yi c where c =h exp(- yi (w ×xi + b))
Summary
Model-based machine learning:
define a model, objective function (i.e. loss function),
minimization algorithm
Gradient descent minimization algorithm
require that our loss function is convex
make small updates towards lower losses
Perceptron learning algorithm:
gradient descent
exponential loss function (modulo a learning rate)
Regularization
Introduction
Introduction
Introduction
Introduction
Introduction
Ill-posed Problems
In finite domains, most of the inverse problems are
referred to the ill-posed problem.
The mathematical term well-posed problem stems
from a definition given by Jacques Hadamard. He
believed that mathematical models of physical
phenomena should have the properties that
•A solution exists
•The solution is unique
•The solution's behavior changes continuously
with the initial conditions.
Ill-conditioned Problems
Problems that are not well-posed in the sense of
Hadamard are termed ill-posed. Inverse problems
are often ill-posed.
Even if a problem is well-posed, it may still be ill-
conditioned, meaning that a small error in the
initial data can result in much larger errors in the
answers. An ill-conditioned problem is indicated by
a large condition number.
Example: Curve Fitting
Some Examples: Linear System of Eqn
Solve [A]{x} = {b}
Ill-conditioning
• A system of equations is singular if det|A| = 0
• If a system of equations is nearly singular it is ill-
conditioned.
Systems which are ill-conditioned are extremely
sensitive to small changes in coefficients of [A]
and {b}. These systems are inherently sensitive to
round-off errors.
One concern
n
argmin w,b å exp(- yi (w ×xi + b))
i=1
loss
What is this calculated on? w
Is this what we want to optimize?
Perceptron learning algorithm!
repeat until convergence (or for some # of iterations):
for each training example (f1, f2, …, fm, label):
prediction =b + å
m
wj fj
j=1
if prediction * label ≤ 0: // they don’t agree
for each wj: Note: for gradient descent, we always update
wj = wj + fj*label
b = b + label
w j =w j + h yi xij exp(- yi (w ×xi + b))
or
w j =w j + xij yi c where c =h exp(- yi (w ×xi + b))
One concern
n
argmin w,b å exp(- yi (w ×xi + b))
i=1
loss
We’re calculating this on the training
set
We still need to be careful about w
overfitting!
The min w,b on the training set is
generally NOT the min for the test set
How did we deal with this for the perceptron algorithm?
Overfitting revisited: regularization
A regularizer is an additional criteria to the loss function to
make sure that we don’t overfit
It’s called a regularizer since it tries to keep the parameters
more normal/regular
It is a bias on the model forces the learning to prefer certain
types of weights over others
n
argmin w,b å loss(yy') + l regularizer(w, b)
i=1
Regularizers
0 =b + å
n
wj fj
j=1
Should we allow all possible weights?
Any preferences?
What makes for a “simpler” model for a
linear model?
Regularizers
0 =b + å
n
wj fj
j=1
Generally, we don’t want huge weights
If weights are large, a small change in a feature can result in a
large change in the prediction
Also gives too much weight to any one feature
Might also prefer weights of 0 for features that aren’t useful
How do we encourage small weights? or penalize large weights?
Regularizers
0 =b + å
n
wj fj
j=1
How do we encourage small weights? or penalize large weights?
n
argmin w,b å loss(yy') + l regularizer(w, b)
i=1
Common regularizers
r(w, b) =å w j
sum of the weights
wj
2
sum of the squared weights r(w, b) = å wj
wj
What’s the difference between these?
Common regularizers
r(w, b) =å w j
sum of the weights
wj
åw
2
sum of the squared weights r(w, b) = j
wj
Squared weights penalizes large values more
Sum of weights will penalize small values more
p-norm
sum of the weights (1-norm) r(w, b) =å w j
wj
åw
sum of the squared weights 2
(2-norm)
r(w, b) = j
wj
r(w, b) = p å w j = w
p p
p-norm
wj
Smaller values of p (p < 2) encourage sparser vectors
Larger values of p discourage large weights more
p-norms visualized
lines indicate penalty = 1
w1
w2
p w2
1 0.5
For example, if w1 = 0.5 1.5 0.75
2 0.87
3 0.95
∞ 1
p-norms visualized
all p-norms penalize larger
weights
p < 2 tends to create sparse
(i.e. lots of 0 weights)
p > 2 tends to like similar
weights
Model-based machine learning
1. pick a model
0 =b + å
n
wj fj
j=1
2. pick a criteria to optimize (aka objective function)
n
å loss(yy') + lregularizer(w)
i=1
3. develop a learning algorithm
n
argmin w,b å loss(yy') + lregularizer(w) Find w and b
i=1
that minimize
Minimizing with a regularizer
We know how to solve convex minimization problems using
gradient descent:
n
argmin w,b å loss(yy')
i=1
If we can ensure that the loss + regularizer is convex then we
could still use gradient descent:
n
argmin w,b å loss(yy') + lregularizer(w)
i=1
make convex
Convexity revisited
One definition: The line segment between any
two points on the function is above the function
Mathematically, f is convex if for all x1, x2:
f (tx1 +(1- t)x2 ) £tf (x1 ) +(1- t) f (x2 ) " 0 < t <1
the value of the the value at some point
function at some point on the line segment
between x1 and x2 between x1 and x2
Adding convex functions
Claim: If f and g are convex functions then so is the
function z=f+g
Prove:
z(tx1 +(1- t)x2 ) £ tz(x1 ) +(1- t)z(x2 ) " 0 < t <1
Mathematically, f is convex if for all x1, x2:
f (tx1 +(1- t)x2 ) £ tf (x1 ) +(1- t) f (x2 ) " 0 < t <1
Adding convex functions
By definition of the sum of two functions:
z(tx1 +(1- t)x2 ) = f (tx1 +(1- t)x2 ) + g(tx1 +(1- t)x2 )
tz(x1 ) +(1- t)z(x2 ) =tf (x1 ) +tg(x1 ) +(1- t) f (x2 ) +(1- t)g(x2 )
=tf (x1 ) +(1- t) f (x2 ) +tg(x1 ) +(1- t)g(x2 )
Then, given that:
f (tx1 +(1- t)x2 ) £tf (x1 ) +(1- t) f (x2 )
g(tx1 +(1- t)x2 ) £ tg(x1 ) +(1- t)g(x2 )
We know:
f (tx1 +(1- t)x2 ) + g(tx1 +(1- t)x2 ) £tf (x1 ) +(1- t) f (x2 ) + tg(x1 ) +(1- t)g(x2 )
So: z(tx1 +(1- t)x2 ) £ tz(x1 ) +(1- t)z(x2 )
Minimizing with a regularizer
We know how to solve convex minimization problems using
gradient descent:
n
argmin w,b å loss(yy')
i=1
If we can ensure that the loss + regularizer is convex then we
could still use gradient descent:
n
argmin w,b å loss(yy') + lregularizer(w)
i=1
convex as long as both loss and regularizer are convex
p-norms are convex
r(w, b) = p å w j = w
p p
wj
p-norms are convex for p >= 1
Model-based machine learning
1. pick a model
0 =b + å
n
wj fj
j=1
2. pick a criteria to optimize (aka objective function)
n
l
å exp(- yi (w ×xi + b)) + 2 w
2
i=1
3. develop a learning algorithm
n
l
argmin w,b å exp(- yi (w ×xi + b)) + w Find w and b
2
i=1 2 that minimize
Our optimization criterion
n
l
argmin w,b å exp(- yi (w ×xi + b)) + w
2
i=1 2
Loss function: penalizes
Regularizer: penalizes large
examples where the prediction
weights
is different than the label
Key: this function is convex allowing us to use gradient descent
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
wi =wi - h (loss(w) + regularizer(w, b))
dwi
n
l
argmin w,b å exp(- yi (w ×xi + b)) +
2
w
i=1 2
Some more maths
d d n l
å
2
objective = exp(- yi (w ×x i + b)) + w
dw j dw j i=1 2
…
(some math happens)
n
=- å yi xij exp(- yi (w ×xi + b)) + lw j
i=1
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
wi =wi - h (loss(w) + regularizer(w, b))
dwi
n
w j =w j + h å yi xij exp(- yi (w ×xi + b)) - hlw j
i=1
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hl w j
learning rate direction to regularization
update
constant: how far from wrong
What effect does the regularizer have?
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlw j
learning rate direction to regularization
update
constant: how far from wrong
If wj is positive, reduces wj moves wj towards 0
If wj is negative, increases wj
L1 regularization
n
argmin w,b å exp(- yi (w ×xi + b)) + w
i=1
d d n
dw j
objective = å
dw j i=1
exp(- yi (w ×xi + b)) + l w
n
=- å yi xij exp(- yi (w ×xi + b)) + lsign(w j )
i=1
L1 regularization
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlsign(w j )
learning rate direction to regularization
update
constant: how far from wrong
What effect does the regularizer have?
L1 regularization
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlsign(w j )
learning rate direction to regularization
update
constant: how far from wrong
If wj is positive, reduces by a constant moves wj towards 0
If wj is negative, increases by a constant regardless of magnitude
Regularization with p-norms
L1:
w j =w j + h (loss _ correction - l sign(w j ))
L2:
w j =w j + h (loss _ correction - l w j )
Lp:
p- 1
w j =w j + h (loss _ correction - l cw )
j
How do higher order norms affect the weights?
Regularizers summarized
L1 is popular because it tends to result in sparse solutions
(i.e. lots of zero weights)
However, it is not differentiable, so it only works for gradient
descent solvers
L2 is also popular because for some loss functions, it can
be solved directly (no gradient descent required, though
often iterative solvers still)
Lp is less popular since they don’t tend to shrink the
weights enough
The other loss functions
Without regularization, the generic update is:
w j =w j + h yi xij c
where
c =exp(- yi (w ×xi + b)) exponential
c =1[yy' <1] hinge loss
w j =w j + h (yi - (w ×xi + b)xij ) squared error