0% found this document useful (0 votes)
2 views

07. Linear Regression

The document is a lecture on Linear Regression from the Machine Learning course at Carnegie Mellon University, focusing on the Perceptron algorithm. It covers the need for an intercept term, the online and batch versions of the Perceptron algorithm, and extensions like Voted and Averaged Perceptron. Additionally, it discusses the geometric margin, linear separability, and the Perceptron mistake bound theorem.

Uploaded by

K SD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

07. Linear Regression

The document is a lecture on Linear Regression from the Machine Learning course at Carnegie Mellon University, focusing on the Perceptron algorithm. It covers the need for an intercept term, the online and batch versions of the Perceptron algorithm, and extensions like Voted and Averaged Perceptron. Additionally, it discusses the geometric margin, linear separability, and the Perceptron mistake bound theorem.

Uploaded by

K SD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

10-601 Introduction to Machine Learning

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Linear Regression

Matt Gormley
Lecture 7
Feb. 5, 2020

1
Reminders
• Homework 2: Decision Trees
– Out: Wed, Jan. 22
– Due: Wed, Feb. 05 at 11:59pm
• Homework 3: KNN, Perceptron, Lin.Reg.
– Out: Wed, Feb. 05 (+ 1 day)
– Due: Wed, Feb. 12 at 11:59pm
• Today’s In-Class Poll
– http://p7.mlcourse.org

5
THE PERCEPTRON ALGORITHM

6
Intercept Term
Q: Why do we need an
intercept term?

A: It shifts the decision


boundary off the origin

b<0
Q: Why do we add / subtract 1.0
to the intercept term during
Perceptron training?
A: Two cases
1. Increasing b shifts the
decision boundary
b=0 towards the negative side
2. Decreasing b shifts the
decision boundary
towards the positive side
7
b>0
Perceptron Inductive Bias
1. Decision boundary should be linear
2. Most recent mistakes are most important
(and should be corrected)

8
Background: Hyperplanes
Hyperplane (Definition 1):
Notation Trick: fold the T
H = {x : w x = b}
bias b and the weights w
into a single vector θ by Hyperplane (Definition 2):
prepending a constant to ’ ’
x and increasing
w ’
dimensionality by one to 1
get x’!
Half-spaces:
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Prediction: Output determined by hyperplane.


T 1, if a 0
ŷ = h (x) = sign( x) sign(a) =
1, otherwise

Learning: Iterative procedure:


• initialize parameters to vector of all zeroes
• while not converged
• receive next example (x(i), y(i))
• predict y’ = h(x(i))
• if positive mistake: add x(i) to parameters
• if negative mistake: subtract x(i) from parameters 10
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Prediction: Output determinedImplementation


by hyperplane. Trick: same
T 1, if a 0
ŷ = h (x) = sign( x) behavior as our “add on
sign(a) =
1, otherwise
positive mistake and
subtract on negative
Learning: mistake” version, because
y(i) takes care of the sign

11
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.

Algorithm 1 Perceptron Learning Algorithm (Batch)


1: procedure P (D = {(t(1) , y (1) ), . . . , (t(N ) , y (N ) )})
2: 0 Initialize parameters
3: while not converged do
4: for i {1, 2, . . . , N } do For each example
5: ŷ sign( T t(i) ) Predict
6: if ŷ = y (i) then If mistake
7: + y (i) t(i) Update parameters
8: return
12
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.

Discussion:
The Batch Perceptron Algorithm can be derived in two ways.
1. By extending the online Perceptron algorithm to the batch
setting (as mentioned above)
2. By applying Stochastic Gradient Descent (SGD) to minimize a
so-called Hinge Loss on a linear separator

13
Extensions of Perceptron
• Voted Perceptron
– generalizes better than (standard) perceptron
– memory intensive (keeps around every weight vector seen during
training, so each one can vote)
• Averaged Perceptron
– empirically similar performance to voted perceptron
– can be implemented in a memory efficient way
(running averages are efficient)
• Kernel Perceptron
– Choose a kernel K(x’, x)
– Apply the kernel trick to Perceptron
– Resulting algorithm is still very simple
• Structured Perceptron
– Basic idea can also be applied when y ranges over an exponentially
large set
– Mistake bound does not depend on the size of that set

14
Perceptron Exercises
Question:
The parameter vector w learned by the
Perceptron algorithm can be written as
a linear combination of the feature
vectors x(1), x(2),…, x(N).

A. True, if you replace “linear” with


“polynomial” above
B. True, for all datasets
C. False, for all datasets
D. True, but only for certain datasets
E. False, but only for certain datasets
15
ANALYSIS OF PERCEPTRON

16
Geometric Margin
Definition: The margin of example ! w.r.t. a linear sep. " is the
distance from ! to the plane " ⋅ ! = 0 (or the negative if on wrong side)

Margin of positive example !&

!&
w

Margin of negative example !'

!'

Slide from Nina Balcan


Geometric Margin
Definition: The margin of example % w.r.t. a linear sep. $ is the
distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side)
Definition: The margin !" of a set of examples # wrt a linear
separator $ is the smallest margin over points % ∈ #.

+ +
+
w
!" +
- !" +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Geometric Margin
Definition: The margin of example % w.r.t. a linear sep. $ is the
distance from % to the plane $ ⋅ % = 0 (or the negative if on wrong side)
Definition: The margin !# of a set of examples " wrt a linear
separator $ is the smallest margin over points % ∈ ".
Definition: The margin ! of a set of examples " is the maximum !#
over all linear separators $.

! + w
- ! +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Linear Separability
Def: For a binary classification problem, a set of examples !
is linearly separable if there exists a linear decision boundary
that can separate the points

Case 1: Case 2: Case 3: Case 4:

+ - + + -
- + + + + + - +

20
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)

+ +
+
g +
- g +
+
- +
- - R
-
- - -
Slide adapted from Nina Balcan
- 21
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)

+ +
+
Def: We say that the (batch) perceptron algorithm has
g +
converged if it stops making
g mistakes
+ on the training data
- the training data).+
(perfectly classifies
- +
Main Takeaway: For - linearly
- separable
R data, if the
perceptron algorithm cycles- repeatedly through the data,
- # of- steps.
it will converge in a finite
-
Slide adapted from Nina Balcan
- 22
Analysis: Perceptron
Perceptron Mistake Bound
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
Suppose:
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+

g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
23
Figure from Nina Balcan
Common
Analysis: Perceptron Misunderstanding:
The radius is
Perceptron Mistake Bound centered at the
Theorem 0.1 (Block (1962), Novikoff (1962)). origin, not at the
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
center of the
Suppose: points.
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+

g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
24
Figure from Nina Balcan
Analysis: Perceptron
Proof of Perceptron Mistake Bound:

We will show that there exist constants A and B s.t.


(k+1)
Ak || || B k

(k+1)
Ak || || B k
(k+1)
Ak
(k+1) || || B k
Ak || || B k

(k+1)
Ak || || B k
25
Analysis: Perceptron
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 . +
+
+
Suppose:
+
1. Finite size inputs: ||x(i) || R - g
g
+
+
2. Linearly separable data: s.t. || || = 1 and - +

y (i) ( · t(i) ) , i -
-
-
R

Then: The number of mistakes made by the Perceptron -


-
-
-
algorithm on this dataset is

k (R/ )2

Algorithm 1 Perceptron Learning Algorithm (Online)


1: procedure P (D = {(t(1) , y (1) ), (t(2) , y (2) ), . . .})
2: 0, k = 1 Initialize parameters
3: for i {1, 2, . . .} do For each example
4: if y (i) ( (k) · t(i) ) 0 then If mistake
5:
(k+1) (k)
+ y (i) t(i) Update parameters
6: k k+1
7: return 26
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 1: for some A, Ak || (k+1) || B k
(k+1)
· =( (k)
+ y (i) t(i) )
by Perceptron algorithm update
= (k)
· + y (i) ( · t(i) )
(k)
· +
by assumption
(k+1)
· k
by induction on k since (1)
=0
|| (k+1) || k
since ||r|| ||m|| r · m and || || = 1

Cauchy-Schwartz inequality
28
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
AkB, || (k+1) || B k
Part 2: for some
|| (k+1) 2
|| = || (k)
+ y (i) t(i) ||2
by Perceptron algorithm update
= || (k) 2
|| + (y (i) )2 ||t(i) ||2 + 2y (i) ( (k)
· t(i) )
(k) 2
|| || + (y (i) )2 ||t(i) ||2
since kth mistake y (i) ( (k)
· t(i) ) 0
(k) 2
= || || + R2
since (y (i) )2 ||t(i) ||2 = ||t(i) ||2 = R2 by assumption and (y (i) )2 = 1
(k+1) 2
|| || kR2
by induction on k since ( (1) 2
) =0
(k+1)
|| || kR

29
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 3: Combining the bounds finishes the proof.
(k+1)
k || || kR
2
k (R/ )

The total number of mistakes


must be less than this

30
Combining, gives

k R ≥ ∥vk+1 ∥ ≥ vk+1 · u ≥ kγ
Analysis:2
Perceptron
which implies k ≤ (R/γ ) proving the theorem. ✷

What if the data is not linearly separable?


3.2. Analysis for the inseparable case

1.If thePerceptron will separable


data are not linearly not converge in this 1case
then the Theorem cannot(itbecan’t!)
used directly. However,
2.we now give a generalized
However, Freundversion of the theorem
& Schapire (1999) which allows
show for some
that mistakes in the
by projecting the
training set. As(hypothetically)
points far as we know, this theorem is new, although
into a higher the proof space,
dimensional techniqueweis very
can
similar to that ofa Klasner
achieve similarand Simon on
bound (1995,
theTheorem
number 2.2).
ofSee also the recent
mistakes madework on of
Shawe-Taylor
one pass andthrough
Cristianini (1998) who used thisoftechnique
the sequence examples to derive generalization error
bounds for any large margin classifier.

Theorem 2. Let ⟨(x1 , y1 ), . . . , (xm , ym )⟩ be a sequence of labeled examples with ∥xi ∥ ≤ R.


Let u be any vector with ∥u∥ = 1 and let γ > 0. Define the deviation of each example as

di = max{0, γ − yi (u · xi )},
!"
m 2
and define D = i=1 di . Then the number of mistakes of the online perceptron algorithm
on this sequence is bounded by
# $2
R+D
.
γ
31
Proof: The case D = 0 follows from Theorem 1, so we can assume that D > 0.
Perceptron Exercises
Question:
Unlike Decision Trees and K-
Nearest Neighbors, the Perceptron
algorithm does not suffer from
overfitting because it does not
have any hyperparameters that
could be over-tuned on the
training data.

A. True
B. False
C. True and False
32
Summary: Perceptron
• Perceptron is a linear classifier
• Simple learning algorithm: when a mistake is
made, add / subtract the features
• Perceptron will converge if the data are linearly
separable, it will not converge if the data are
linearly inseparable
• For linearly separable and inseparable data, we
can bound the number of mistakes (geometric
argument)
• Extensions support nonlinear separators and
structured prediction
33
Perceptron Learning Objectives
You should be able to…
• Explain the difference between online learning and
batch learning
• Implement the perceptron algorithm for binary
classification [CIML]
• Determine whether the perceptron algorithm will
converge based on properties of the dataset, and
the limitations of the convergence guarantees
• Describe the inductive bias of perceptron and the
limitations of linear models
• Draw the decision boundary of a linear model
• Identify whether a dataset is linearly separable or not
• Defend the use of a bias term in perceptron
34
REGRESSION

39
Flexible Modeling of Epi

Regression
Goal:
– Given a training dataset of pairs
(x,y) where
• x is a vector
• y is a scalar
– Learn a function (aka. curve or line)
y’ = h(x) that best fits the training
data
Example Applications:
– Stock price prediction
– Forecasting epidemics
– Speech synthesis
– Generation of images (e.g. Deep
Dream)
– Predicting the number of tourists
on Machu Picchu on a given day

Fig 2. 2013–2014 national forecast, retrospectively, using the


revised wILI data through epidemiological weeks (A)
4047, (B) 5
doi:10.1371/journal.pcbi.1004382.g002
Regression
Example Application:
Forecasting Epidemics
• Input features, x:
attributes of the
epidemic
• Output, y:
Weighted %ILI,
prevalence of the
disease
• Setting: observe
past prevalence to
predict future
prevalence

Fig 2. 2013–2014 national forecast, retrospectively, using the final revisions of wILI values, using
41
revised wILI data through epidemiological weeks (A) 47, (B) 51, (C) 1, and (D) 7.
Figure from Brooks et al. (2015)
doi:10.1371/journal.pcbi.1004382.g002
Regression
y Example: Dataset with only
one feature x and one scalar Q: What is the function that
output y best fits these points?

43
k-NN Regression
y Example: Dataset with only
one feature x and one scalar k=1 Nearest Neighbor
output y Regression
• Train: store all (x, y) pairs
• Predict: pick the nearest x
in training data and return
its y

k=2 Nearest Neighbor Distance


Weighted Regression
• Train: store all (x, y) pairs
• Predict: pick the nearest
two instances x(n1) and x(n2)
x
in training data and return
the weighted average of
their y values

44
LINEAR REGRESSION

45
Regression Problems
Chalkboard
– Definition of Regression
– Linear functions
– Residuals
– Notation trick: fold in the intercept

46

You might also like