Lecture 2-Regression
Lecture 2-Regression
2
CHAPTER 2:
Classic Machine
Learning: regression
What is ML?
Arthur Samuel (1959). Machine learning:
a field of study that gives computers the ability to learn
without being explicitly programmed.
Tom Mitchell (1998): Well-proposed Learning
Problem:
"A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with
experience E."
Suppose your email program watches which emails you do
or do not mark as spam, and based on that learns how to
better filter spam.
What is the task T in this setting?
Classifying emails as spam or not spam
What is the experience in this setting?
Watching your label email as spam or not spam
What is the performance of this setting?
The number of emails correctly classified as spam or not spam 4
Types of ML
Supervised Unsupervised
Semi-
supervised
An intermediate learning
Model learns both from
labeled and unlabeled data
5
Supervised Learning Types
6
Supervised Learning:
classification vs regression
Model Construction
A training set is used to create the model.
The model is represented as classification rules, decision
trees, or mathematical formula
Model Usage
the test set is used to see how well it works for classifying
future or unknown objects
8
How ML works:
classic ML process
The general learning approach: First, the dataset needs to be transformed into a
representation, most often a list of vectors, which can be used by the learning
algorithm. The learning algorithm choses a model and efficiently searches for the
model’s parameters.
After we have selected a model that has been fitted on the training dataset, we can
use the test dataset to estimate how well it performs on this unseen data to estimate
9
the so-called generalization error
How ML works:
classic ML process
11
Learning: model representations
Training set
Learning Algorithm
𝑥 Size of house
ℎ Estimated price
𝑦
Hypothesis
𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
12
Shorthand ℎ 𝑥
How Does ML Work: ML framework
yp = f(Θ, x)
X: the input
Yp : output (values predicted by the model)
Θ: represents parameters of the model (1 or more
variables)
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
400
300
200
100
19
Linear regression: model representations
400
Learning 300
200
Algorithm 100
𝑥 ℎ 𝑦 Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
With different choices of the parameter's 𝜃0 𝑎𝑛𝑑 𝜃1 ,
we get different hypothesis, different hypothesis
functions
3 3 3
2 2 2
1 1 1
1 2 3 1 2 3 1 2 3
𝜃0 = 1.5 𝜃0 = 0 𝜃0 = 1
𝜃1 = 0 𝜃1 = 0.5 𝜃1 = 0.5 21
Linear regression: model representations
400
300
200
100
24
Linear regression: cost function
Idea: Choose 𝜃0 , 𝜃1 so that ℎ𝜃 𝑥 is
close to 𝑦 for the training example 𝑥, 𝑦
1 2
minimize σ𝑚 ℎ𝜃 𝑥 𝑖 −𝑦 𝑖
2𝑚 𝑖=1
𝜃0 , 𝜃1
𝑦
Price ($) in 1000’s
ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1 𝑥 (𝑖)
400
𝑚 300
1 𝑖 𝑖 2 200
𝐽 𝜃0 , 𝜃1 = ℎ𝜃 𝑥 −𝑦
2𝑚 100
𝑖=1
500 1000 1500 2000 2500
𝑥
Size in feet^2
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1
Squared error Cost function
25
Linear regression: cost function example
26
Linear regression: cost function example
28
Linear regression: gradient descent
29
Linear regression: gradient descent
We make steps down the cost function in the direction with the
steepest descent. The size of each step is determined by the
parameter α, which is called the learning rate.
32
Linear regression: gradient descent
34
Linear regression: gradient descent algorithm intuition
36
Linear regression: gradient descent algorithm intuition
37
Linear regression: gradient descent algorithm intuition
The point of all this is that if we start with a guess for our
hypothesis and then repeatedly apply these gradient descent
equations, our hypothesis will become more and more
accurate. 40
Univariate Linear Regression: Example
Consider the problem of predicting how well a student does in
her second year of university, given how well she did in her
first year.
Specifically, let x be equal to the number of "A" grades that a
student receives in their first year of university. We would like
to predict the value of y, which we define as the number of "A"
grades they get in their second year.
Recall that in linear regression, our hypothesis is hθ(x) =θ0+θ1x,
and we use m to denote the number of training examples.
Price ($) in
Size in feet^2 (x)
1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
hθ(x) =θ0+θ1x
42
Linear Regression with multiple variables:
multivariate
Price ($) in 1000’s
Size in feet^2 (x)
(y)
2104 460
1416 232
1534 315
852 178
… …
hθ(x) =θ0+θ1x
𝑥0
𝑥1
𝛩0 𝛩1 𝛩1 … 𝛩𝑛 𝑥2 = 𝛩𝑇 𝑥 = hθ(x)
.
.
.
𝑥𝑛
44
Multivariate Linear Regression: gradient descent
Classification
Training Algorithms
Data
Classification
Algorithms
Training
Data
Classifier
NAME RANK YEARS TENURED
(Model)
M ike A ssistan t P ro f 3 no
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes IF rank = ‘professor’
D ave A ssistan t P ro f 6 no OR years > 6
Anne A sso ciate P ro f 3 no THEN tenured = ‘yes’
47
(Summary) Learning – Model Usage
Model usage
the test set is used to see how well it works for classifying
future or unknown objects
Classifier
model
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes 48
Thank You
49