07. Linear Regression
07. Linear Regression
Linear Regression
Matt Gormley
Lecture 7
Feb. 5, 2020
1
Reminders
• Homework 2: Decision Trees
– Out: Wed, Jan. 22
– Due: Wed, Feb. 05 at 11:59pm
• Homework 3: KNN, Perceptron, Lin.Reg.
– Out: Wed, Feb. 05 (+ 1 day)
– Due: Wed, Feb. 12 at 11:59pm
• Today’s In-Class Poll
– http://p7.mlcourse.org
5
THE PERCEPTRON ALGORITHM
6
Intercept Term
Q: Why do we need an
intercept term?
b<0
Q: Why do we add / subtract 1.0
to the intercept term during
Perceptron training?
A: Two cases
1. Increasing b shifts the
decision boundary
b=0 towards the negative side
2. Decreasing b shifts the
decision boundary
towards the positive side
7
b>0
Perceptron Inductive Bias
1. Decision boundary should be linear
2. Most recent mistakes are most important
(and should be corrected)
8
Background: Hyperplanes
Hyperplane (Definition 1):
Notation Trick: fold the T
H = {x : w x = b}
bias b and the weights w
into a single vector θ by Hyperplane (Definition 2):
prepending a constant to ’ ’
x and increasing
w ’
dimensionality by one to 1
get x’!
Half-spaces:
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
11
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.
Discussion:
The Batch Perceptron Algorithm can be derived in two ways.
1. By extending the online Perceptron algorithm to the batch
setting (as mentioned above)
2. By applying Stochastic Gradient Descent (SGD) to minimize a
so-called Hinge Loss on a linear separator
13
Extensions of Perceptron
• Voted Perceptron
– generalizes better than (standard) perceptron
– memory intensive (keeps around every weight vector seen during
training, so each one can vote)
• Averaged Perceptron
– empirically similar performance to voted perceptron
– can be implemented in a memory efficient way
(running averages are efficient)
• Kernel Perceptron
– Choose a kernel K(x’, x)
– Apply the kernel trick to Perceptron
– Resulting algorithm is still very simple
• Structured Perceptron
– Basic idea can also be applied when y ranges over an exponentially
large set
– Mistake bound does not depend on the size of that set
14
Perceptron Exercises
Question:
The parameter vector w learned by the
Perceptron algorithm can be written as
a linear combination of the feature
vectors x(1), x(2),…, x(N).
16
Geometric Margin
Definition: The margin of example ! w.r.t. a linear sep. " is the
distance from ! to the plane " ⋅ ! = 0 (or the negative if on wrong side)
!&
w
!'
+ +
+
w
!" +
- !" +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Geometric Margin
Definition: The margin of example % w.r.t. a linear sep. $ is the
distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side)
Definition: The margin !# of a set of examples " wrt a linear
separator $ is the smallest margin over points % ∈ ".
Definition: The margin ! of a set of examples " is the maximum !#
over all linear separators $.
! + w
- ! +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Linear Separability
Def: For a binary classification problem, a set of examples !
is linearly separable if there exists a linear decision boundary
that can separate the points
+ - + + -
- + + + + + - +
20
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)
+ +
+
g +
- g +
+
- +
- - R
-
- - -
Slide adapted from Nina Balcan
- 21
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)
+ +
+
Def: We say that the (batch) perceptron algorithm has
g +
converged if it stops making
g mistakes
+ on the training data
- the training data).+
(perfectly classifies
- +
Main Takeaway: For - linearly
- separable
R data, if the
perceptron algorithm cycles- repeatedly through the data,
- # of- steps.
it will converge in a finite
-
Slide adapted from Nina Balcan
- 22
Analysis: Perceptron
Perceptron Mistake Bound
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
Suppose:
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+
g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
23
Figure from Nina Balcan
Common
Analysis: Perceptron Misunderstanding:
The radius is
Perceptron Mistake Bound centered at the
Theorem 0.1 (Block (1962), Novikoff (1962)). origin, not at the
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
center of the
Suppose: points.
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+
g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
24
Figure from Nina Balcan
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
(k+1)
Ak || || B k
(k+1)
Ak
(k+1) || || B k
Ak || || B k
(k+1)
Ak || || B k
25
Analysis: Perceptron
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 . +
+
+
Suppose:
+
1. Finite size inputs: ||x(i) || R - g
g
+
+
2. Linearly separable data: s.t. || || = 1 and - +
y (i) ( · t(i) ) , i -
-
-
R
k (R/ )2
Cauchy-Schwartz inequality
28
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
AkB, || (k+1) || B k
Part 2: for some
|| (k+1) 2
|| = || (k)
+ y (i) t(i) ||2
by Perceptron algorithm update
= || (k) 2
|| + (y (i) )2 ||t(i) ||2 + 2y (i) ( (k)
· t(i) )
(k) 2
|| || + (y (i) )2 ||t(i) ||2
since kth mistake y (i) ( (k)
· t(i) ) 0
(k) 2
= || || + R2
since (y (i) )2 ||t(i) ||2 = ||t(i) ||2 = R2 by assumption and (y (i) )2 = 1
(k+1) 2
|| || kR2
by induction on k since ( (1) 2
) =0
(k+1)
|| || kR
29
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 3: Combining the bounds finishes the proof.
(k+1)
k || || kR
2
k (R/ )
30
Combining, gives
√
k R ≥ ∥vk+1 ∥ ≥ vk+1 · u ≥ kγ
Analysis:2
Perceptron
which implies k ≤ (R/γ ) proving the theorem. ✷
di = max{0, γ − yi (u · xi )},
!"
m 2
and define D = i=1 di . Then the number of mistakes of the online perceptron algorithm
on this sequence is bounded by
# $2
R+D
.
γ
31
Proof: The case D = 0 follows from Theorem 1, so we can assume that D > 0.
Perceptron Exercises
Question:
Unlike Decision Trees and K-
Nearest Neighbors, the Perceptron
algorithm does not suffer from
overfitting because it does not
have any hyperparameters that
could be over-tuned on the
training data.
A. True
B. False
C. True and False
32
Summary: Perceptron
• Perceptron is a linear classifier
• Simple learning algorithm: when a mistake is
made, add / subtract the features
• Perceptron will converge if the data are linearly
separable, it will not converge if the data are
linearly inseparable
• For linearly separable and inseparable data, we
can bound the number of mistakes (geometric
argument)
• Extensions support nonlinear separators and
structured prediction
33
Perceptron Learning Objectives
You should be able to…
• Explain the difference between online learning and
batch learning
• Implement the perceptron algorithm for binary
classification [CIML]
• Determine whether the perceptron algorithm will
converge based on properties of the dataset, and
the limitations of the convergence guarantees
• Describe the inductive bias of perceptron and the
limitations of linear models
• Draw the decision boundary of a linear model
• Identify whether a dataset is linearly separable or not
• Defend the use of a bias term in perceptron
34
REGRESSION
39
Flexible Modeling of Epi
Regression
Goal:
– Given a training dataset of pairs
(x,y) where
• x is a vector
• y is a scalar
– Learn a function (aka. curve or line)
y’ = h(x) that best fits the training
data
Example Applications:
– Stock price prediction
– Forecasting epidemics
– Speech synthesis
– Generation of images (e.g. Deep
Dream)
– Predicting the number of tourists
on Machu Picchu on a given day
Fig 2. 2013–2014 national forecast, retrospectively, using the final revisions of wILI values, using
41
revised wILI data through epidemiological weeks (A) 47, (B) 51, (C) 1, and (D) 7.
Figure from Brooks et al. (2015)
doi:10.1371/journal.pcbi.1004382.g002
Regression
y Example: Dataset with only
one feature x and one scalar Q: What is the function that
output y best fits these points?
43
k-NN Regression
y Example: Dataset with only
one feature x and one scalar k=1 Nearest Neighbor
output y Regression
• Train: store all (x, y) pairs
• Predict: pick the nearest x
in training data and return
its y
44
LINEAR REGRESSION
45
Regression Problems
Chalkboard
– Definition of Regression
– Linear functions
– Residuals
– Notation trick: fold in the intercept
46