0% found this document useful (0 votes)
22 views

Lecture 2-Regression

Uploaded by

abebaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lecture 2-Regression

Uploaded by

abebaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Machine Learning

ABDELA AHMED, PhD


© University of Gondar, 2022
[email protected]
Contents…

2. Classic Machine Learning


 Linear regressions
 Logistic regressions
 K-nearest neighbor (KNN)
 Decision Tree
 Data preprocessing and representations
 Evaluation methods

2
CHAPTER 2:
Classic Machine
Learning: regression
What is ML?
 Arthur Samuel (1959). Machine learning:
 a field of study that gives computers the ability to learn
without being explicitly programmed.
 Tom Mitchell (1998): Well-proposed Learning
Problem:
 "A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with
experience E."
 Suppose your email program watches which emails you do
or do not mark as spam, and based on that learns how to
better filter spam.
 What is the task T in this setting?
 Classifying emails as spam or not spam
 What is the experience in this setting?
 Watching your label email as spam or not spam
 What is the performance of this setting?
 The number of emails correctly classified as spam or not spam 4
Types of ML

Supervised Unsupervised

 have a target column  Doesn’t have a target column


 Data points have unknown outcome
 Data points have a known outcome

Semi-
supervised
 An intermediate learning
 Model learns both from
labeled and unlabeled data

5
Supervised Learning Types

6
Supervised Learning:
classification vs regression

 Suppose you are working on weather prediction, and


your weather station makes one of three predictions
for each day's weather: Sunny, Cloudy or Rainy. You'd
like to use a learning algorithm to predict tomorrow's
weather.

 Would you treat this as a classification or a


regression problem?
 multiclass classification
 Given data about the size of houses and number of bed
rooms on the real estate market, try to predict their price.
 Would you treat price as a function of size and
#bedrooms as a classification or a regression problem?
 regression 7
Learning – a two step process
 Model : A learning algorithm
 A model is a small thing that captures a larger thing
 A good model is going to omit unimportant details while
retaining what's important
 But we need to do so in a way that preserves the features or
relationships that were interested in.

 Model Construction
 A training set is used to create the model.
 The model is represented as classification rules, decision
trees, or mathematical formula

 Model Usage
 the test set is used to see how well it works for classifying
future or unknown objects

8
How ML works:
classic ML process

 The general learning approach: First, the dataset needs to be transformed into a
representation, most often a list of vectors, which can be used by the learning
algorithm. The learning algorithm choses a model and efficiently searches for the
model’s parameters.
 After we have selected a model that has been fitted on the training dataset, we can
use the test dataset to estimate how well it performs on this unseen data to estimate
9
the so-called generalization error
How ML works:
classic ML process

 From a technical perspective, every classic machine learning problem is composed of


several key choices to be made in a standard pipeline of five steps.

 Step 1: Feature Extraction


 Derive compact and uncorrelated features to represent raw data.
 Step 2: Choose a proper model (hypothesis function)
 Based on the nature of the given problem, choose a good machine learning
model from the candidates listed below
 Linear models
 Logistic sigmoid, softmax
 Nonlinear kernels
 Decision trees
 constrain the function family to be learned from
 Step 3: Choose a learning criterion (cost/objective function)
 Choose an appropriate learning criterion from the candidates Listed below,
which forms an objective function of model parameters.
 Mean squared error
 Minimum classification error
 Minimum cross-entropy
 Choose certain criteria to measure how well the selected models fits over the
training data as a function of unknown model parameters 10
How ML works:
classic ML process

 From a technical perspective, every classic machine learning problem is composed of


several key choices to be made in a standard pipeline of five steps.

 Step 4: Choose an optimization algorithm


 Considering the characteristics of the derived objective function, use an
appropriate optimization algorithm from Lists below to learn the model
parameters.
 Grid search, Gradient descent, SGD , ADAM, RMSprop
 Once the objective functions are determined, machine learning is turned into a
standard optimization problem, where the objective function needs to be
maximized or minimized with respect to the unknown model parameters.
 Estimate the parameters in the hypothesis function

 Step 5: Perform empirical evaluation


 Use held-out data to empirically evaluate the performance of learned models
 In practice, the performance of the learned models can always be empirically
evaluated based on a held-out data set that is not used anywhere in the earlier
steps

11
Learning: model representations

 To describe the supervised learning problem slightly more


formally, our goal is, given a training set, to learn a hypothesis
function
h:X→Y
so that h(x) is a “good” predictor for the corresponding value of y

Training set

Learning Algorithm

𝑥 Size of house
ℎ Estimated price
𝑦
Hypothesis

𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
12
Shorthand ℎ 𝑥
How Does ML Work: ML framework

yp = f(Θ, x)
X: the input
Yp : output (values predicted by the model)
Θ: represents parameters of the model (1 or more
variables)

 Training: given a training set of labeled examples {(x1,y1), …,


(xn,yn)}, estimate the prediction function f by minimizing the
prediction error on the training set
 Testing: apply f to a never before seen test example x and output
the predicted value y = f(x)
13
Linear regression
 (Univariate) Linear regression
 involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the
other.

 predict the Box Office


revenue of a movie using the
marketing budget only 14
Linear regression
 (Univariate) Linear regression
 involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the
other.

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥

 predict the Box Office


revenue of a movie using the
marketing budget only 15
Linear regression
 (Univariate) Linear regression
 involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
 Multivariate linear regression
 an extension of linear regression, where more than two attributes are involved
and the data are fit to a multidimensional surface.

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥

 with this line we can predict


what the box office revenue
16
will be for a new movie
Linear regression: house pricing predictions

Price ($) in 1000’s

400
300
200
100

500 1000 1500 2000 2500


Size in feet^2
17
Linear regression: notations
Price ($) in
Size in feet^2 (x)
1000’s (y)
2104 460
1416 232
𝑚 = 47
1534 315
852 178
… …
 Notation:  Examples:
𝑚 = Number of training examples 𝑥 (1) = 2104
𝑥 = Input variable / features
𝑥 (2) = 1416
𝑦 = Output variable / target
𝑦 (1) = 460
variable
(𝑥, 𝑦) = One training example
(𝑥 (𝑖) , 𝑦 (𝑖) ) = 𝑖 𝑡ℎ training example 18
Linear regression
 Model representation
 Cost function
 Gradient descent
 Gradient descent for linear regression

19
Linear regression: model representations

Training set Price ($)


in 1000’s

400

Learning 300

200
Algorithm 100

500 1000 1500 2000 2500

𝑥 ℎ 𝑦 Size in feet^2

Size of house Estimated price


Hypothesis

Univariate linear regression


𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 Parameters of the mode:
Shorthand ℎ 𝑥 𝜃0 𝑎𝑛𝑑 𝜃1
20
Linear regression: model representations

 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
 With different choices of the parameter's 𝜃0 𝑎𝑛𝑑 𝜃1 ,
we get different hypothesis, different hypothesis
functions

3 3 3

2 2 2

1 1 1

1 2 3 1 2 3 1 2 3
𝜃0 = 1.5 𝜃0 = 0 𝜃0 = 1
𝜃1 = 0 𝜃1 = 0.5 𝜃1 = 0.5 21
Linear regression: model representations

Price ($) in 1000’s

400
300
200
100

500 1000 1500 2000 2500


Size in feet^2
22
Linear regression: model representations

Price ($) in 1000’s

In linear regression, What we want to do, is come up


400 with values for the parameters 𝜃0 𝑎𝑛𝑑 𝜃1 so that the
straight line fits the data well.
300
Idea: Choose 𝜃0 𝑎𝑛𝑑 𝜃1 so that ℎ𝜃 𝑥 is close to y
200 for the training examples (x, y)
100

500 1000 1500 2000 2500


Size in feet^2
23
Linear regression
 Model representation
 Objective function
 Optimization algorithm (parameter learning)
 Gradient descent for linear regression

24
Linear regression: cost function
 Idea: Choose 𝜃0 , 𝜃1 so that ℎ𝜃 𝑥 is
close to 𝑦 for the training example 𝑥, 𝑦
1 2
minimize σ𝑚 ℎ𝜃 𝑥 𝑖 −𝑦 𝑖
2𝑚 𝑖=1
𝜃0 , 𝜃1
𝑦
Price ($) in 1000’s
ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1 𝑥 (𝑖)
400
𝑚 300
1 𝑖 𝑖 2 200
𝐽 𝜃0 , 𝜃1 = ෍ ℎ𝜃 𝑥 −𝑦
2𝑚 100
𝑖=1
500 1000 1500 2000 2500
𝑥
Size in feet^2
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1
Squared error Cost function
25
Linear regression: cost function example

 Let the training set be (1,1), (2,2), (3,3)


 Calculate the cost function for the following 𝜃 values
 0.5, 1, 1.5, 0, -0.5

26
Linear regression: cost function example

 Let the training


𝜃𝟏set be (1,1), (2,2), (3,3)𝐽 𝜃1
 Calculate the 1.5
cost function for the following
0.583 𝜃 values
 0.5, 1, 1.5, 0, -0.5
1 0
0.5 0.583
0 2.33
-0.5 5.25
27
Linear regression

 Model representation (hypothesis function)


 Objective function (cost function)
 Optimization algorithm (parameter learning)
 Gradient descent for linear regression

28
Linear regression: gradient descent

 So we have our hypothesis function and we have a way of


measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That's
where gradient descent comes in

29
Linear regression: gradient descent

Imagine that we graph the cost function based on the parameters


θ0 and θ1
The points on our graph will be the result of the cost function
using our hypothesis with those specific theta parameters 30
Linear regression: gradient descent

We will know that we have succeeded when our cost function is at


the very bottom of the pits in our graph, i.e. when its value is the
minimum.
31
Linear regression: gradient descent

We make steps down the cost function in the direction with the
steepest descent. The size of each step is determined by the
parameter α, which is called the learning rate.
32
Linear regression: gradient descent

Depending on where one starts on the graph, one could end up at


different points. The image above shows us two different starting
points that end up in two different places.
33
Linear regression: gradient descent algorithm

34
Linear regression: gradient descent algorithm intuition

 The intuition behind the convergence of gradient


descent is that approaches 0 as we approach
the bottom of our convex function.
 At the minimum, the derivative will always be 0 and
thus we get
θ1:=θ1​−α∗0 35
Linear regression: gradient descent algorithm intuition

36
Linear regression: gradient descent algorithm intuition

37
Linear regression: gradient descent algorithm intuition

Suppose θ1 is at a local optimum of J(θ1), such as


shown in the figure below. What will one step of
gradient descent do?

Gradient descent can converge to a local minimum38


Linear regression: gradient descent for linear regression

 We want to apply gradient descent to minimize the squared


error cost function
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1 39
Linear regression: gradient descent for linear regression

 The point of all this is that if we start with a guess for our
hypothesis and then repeatedly apply these gradient descent
equations, our hypothesis will become more and more
accurate. 40
Univariate Linear Regression: Example
 Consider the problem of predicting how well a student does in
her second year of university, given how well she did in her
first year.
 Specifically, let x be equal to the number of "A" grades that a
student receives in their first year of university. We would like
to predict the value of y, which we define as the number of "A"
grades they get in their second year.
 Recall that in linear regression, our hypothesis is hθ(x) =θ0+θ1x,
and we use m to denote the number of training examples.

 For the training set given above, what is the value of m?


 What is J(0,1)?
 0.5
 Suppose we set θ0​=−1,θ1​=0.5. What is hθ​(4)? J(θ0, θ1)?
 1 and 3.094 41
Linear Regression with multiple variables:
multivariate

Price ($) in
Size in feet^2 (x)
1000’s (y)
2104 460
1416 232
1534 315
852 178
… …

hθ(x) =θ0+θ1x

42
Linear Regression with multiple variables:
multivariate
Price ($) in 1000’s
Size in feet^2 (x)
(y)
2104 460
1416 232
1534 315
852 178
… …
hθ(x) =θ0+θ1x

hθ(x) =θ0+θ1x1 +θ2x2 + θ3x3 + θ4x4


43
Multivariate Linear Regression: hypothesis

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + …… θnxn


If we define x0 = 1, then hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + …… θnxn
𝑥0 𝛩0
𝑥1 𝛩1
𝑥2 𝛩2
𝑥= . 𝛩= .
. .
. .
𝑥𝑛 𝛩𝑛

𝑥0
𝑥1
𝛩0 𝛩1 𝛩1 … 𝛩𝑛 𝑥2 = 𝛩𝑇 𝑥 = hθ(x)
.
.
.
𝑥𝑛
44
Multivariate Linear Regression: gradient descent

 The gradient descent equation


itself is generally the same
form; we just have to repeat it
for our 'n' features:
45
(Summary) Learning – Model Construction

 A training set is used to create the model.


 The model is represented as classification rules, decision
trees, or mathematical formula

Classification
Training Algorithms
Data

Size(m2) #bedrooms Price (in Classifier


million) (Model)
250 2 2
500 3 2.5
750 3 3.5 Price = a*size + b*#bedrooms
1000 5 5
46
(Summary) Learning – Model Construction

 A training set is used to create the model.


 The model is represented as classification rules, decision
trees, or mathematical formula

Classification
Algorithms
Training
Data

Classifier
NAME RANK YEARS TENURED
(Model)
M ike A ssistan t P ro f 3 no
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes IF rank = ‘professor’
D ave A ssistan t P ro f 6 no OR years > 6
Anne A sso ciate P ro f 3 no THEN tenured = ‘yes’
47
(Summary) Learning – Model Usage

 Model usage
 the test set is used to see how well it works for classifying
future or unknown objects

Classifier
model
Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes 48
Thank You

49

You might also like