Lecture 9: Linear Regression
0/23
Linear Regression
Roadmap
1
When Can Machines Learn?
Why Can Machines Learn?
Lecture 8: Noise and Error
learning can happen
with target distribution P(y |x) and low Ein w.r.t. err
3
How Can Machines Learn?
Lecture 9: Linear Regression
Linear Regression Problem
Linear Regression Algorithm
Generalization Issue
Linear Regression for Binary Classification
4
How Can Machines Learn Better?
1/23
Linear Regression
Linear Regression Problem
Credit Limit Problem
unknown target function
f: X Y
age
gender
annual salary
year in residence
year in job
current debt
credit limit? 5,000
(ideal credit limit formula)
training examples
D : (x1 , y1 ), , (xN , yN )
23 years
female
33,000 USD
1 year
0.5 year
20,000
learning
algorithm
A
(historical records in bank)
final hypothesis
gf
(learned formula to be used)
hypothesis set
H
(set of candidate formula)
Y = R: regression
2/23
Linear Regression
Linear Regression Problem
Linear Regression Hypothesis
age
annual salary
year in job
current debt
23 years
$ 33.000
0.5 year
20,000
For x = (x0 , x1 , x2 , , xd ) features of customer,
approximate the desired credit limit with a weighted sum:
y
d
X
wi xi
i=0
linear regression hypothesis: h(x) = wT x
h(x): like perceptron, but without the sign
3/23
Linear Regression
Linear Regression Problem
Illustration of Linear Regression
x = (x) R
x = (x1 , x2 ) R2
x1
x2
linear regression:
find lines/hyperplanes with small residuals
4/23
Linear Regression
Linear Regression Problem
The Error Measure
popular/historical error measure:
squared error err(y , y ) = (y y )2
in-sample
out-of-sample
N
1X
(h(xn ) yn )2
Ein (hw) =
| {z }
N
n=1
Eout (w) =
E
(x,y )P
(wT x y )2
wT xn
next: how to minimize Ein (w)?
5/23
Linear Regression
Linear Regression Problem
Fun Time
Consider using linear regression hypothesis h(x) = wT x to
predict the credit limit of customers x. Which feature below shall
have a positive weight in a good hypothesis for the task?
1
birth month
monthly income
current debt
number of credit cards owned
Reference Answer: 2
Customers with higher monthly income should
naturally be given a higher credit limit, which is
captured by the positive weight on the monthly
income feature.
6/23
Linear Regression
Linear Regression Algorithm
Matrix Form of Ein (w)
Ein (w) =
N
N
1X T
1X T
2
(w xn yn ) =
(xn w yn )2
N
N
n=1
n=1
2
T
x1 w y1
T
1
x
w
y
2
2
...
N
xT w y
N
N
xT1
T
1
x2 w
...
N
xT
N
1
k X
w y k2
N |{z} |{z} |{z}
Nd+1 d+11
2
y1
y2
...
yN
N1
7/23
Linear Regression
Linear Regression Algorithm
min Ein (w) =
w
1
kXw yk2
N
Ein (w): continuous, differentiable, convex
necessary condition of best w
Ein
Ein (w)
Ein
w 0 (w)
Ein
w 1 (w)
...
Ein
w d (w)
0
0
...
0
task: find wLIN such that Ein (wLIN ) = 0
8/23
Linear Regression
Linear Regression Algorithm
The Gradient Ein (w)
Ein (w) =
1
1
kXw yk2 = wT XT X w 2wT XT y + yT y
|{z}
| {z }
|{z}
N
N
b
vector w
one w only
Ein (w)= N1
aw 2bw + c
Ein (w)= N1 wT Aw 2wT b + c
Ein (w)= N1 (2aw 2b)
Ein (w)= N1 (2Aw 2b)
simple! :-)
similar (derived by definition)
Ein (w) =
2
N
XT Xw XT y
9/23
Linear Regression
Linear Regression Algorithm
Optimal Linear Regression Weights
task: find wLIN such that
2
N
invertible XT X
easy! unique solution
wLIN =
1
XT y
XT X
|
{z
}
pseudo-inverse X
often the case because
XT Xw XT y = Ein (w) = 0
singular XT X
many optimal solutions
one of the solutions
wLIN = X y
by defining X in other ways
N d +1
practical suggestion:
use well-implemented routine
1 T
instead of XT X
X
for numerical stability when almost-singular
10/23
Linear Regression
Linear Regression Algorithm
Linear Regression Algorithm
1
from D, construct input matrix X and output vector y by
xT1
y1
xT
2
y = y2
X=
T
yN
xN
| {z }
|
{z
}
N(d+1)
calculate pseudo-inverse
N1
X
|{z}
(d+1)N
3
return wLIN = X y
|{z}
(d+1)1
simple and efficient
with good routine
11/23
Linear Regression
Linear Regression Algorithm
Fun Time
After getting wLIN , we can calculate the predictions yn = wTLIN xn . If all yn
similar to how we form y, what is the matrix
are collected in a vector y
?
formula of y
1
XXT y
XX y
XX XXT y
Reference Answer: 3
= XwLIN . Then, a simple
Note that y
substitution of wLIN reveals the answer.
12/23
Linear Regression
Linear Regression for Binary Classification
Linear Classification vs. Linear Regression
Linear Regression
Linear Classification
Y = {1, +1}
T
Y = R
h(x) = sign(w x)
err(y , y ) = Jy 6= y K
h(x) = wT x
err(y , y ) = (y y )2
NP-hard to solve in general
efficient analytic solution
{1, +1} R: linear regression for classification?
1
run LinReg on binary classification data D (efficient)
return g(x) = sign(wTLIN x)
but explanation of this heuristic?
19/23
Linear Regression
Linear Regression for Binary Classification
Relation of Two Errors
r
z
err0/1 = sign(wT x) 6= y
2
errsqr = wT x y
desired y = 1
desired y = 1
squared
0/1
err
err
wTx
wTx
err0/1 errsqr
20/23
Linear Regression
Linear Regression for Binary Classification
Linear Regression for Binary Classification
err0/1 errsqr
classification Eout (w)
VC
classification Ein (w) + . . . . . .
regression Ein (w) + . . . . . .
(loose) upper bound errsqr as ec
rr to approximate err0/1
trade bound tightness for efficiency
wLIN : useful baseline classifier,
or as initial PLA/pocket vector
21/23
Linear Regression
Linear Regression for Binary Classification
Summary
1
2
When Can Machines Learn?
Why Can Machines Learn?
Lecture 8: Noise and Error
3
How Can Machines Learn?
Lecture 9: Linear Regression
Linear Regression Problem
use hyperplanes to approximate real values
Linear Regression Algorithm
analytic solution with pseudo-inverse
Generalization Issue
Eout Ein 2(d+1)
on average
N
Linear Regression for Binary Classification
0/1 error squared error
next: binary classification, regression, and then?
4
How Can Machines Learn Better?
23/23