Machine Learning
ABDELA AHMED, PhD
© University of Gondar, 2022
[email protected]Contents…
2. Classic Machine Learning
Linear regressions
Logistic regressions
K-nearest neighbor (KNN)
Decision Tree
Data preprocessing and representations
Evaluation methods
2
CHAPTER 2:
Classic Machine
Learning: regression
What is ML?
Arthur Samuel (1959). Machine learning:
a field of study that gives computers the ability to learn
without being explicitly programmed.
Tom Mitchell (1998): Well-proposed Learning
Problem:
"A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with
experience E."
Suppose your email program watches which emails you do
or do not mark as spam, and based on that learns how to
better filter spam.
What is the task T in this setting?
Classifying emails as spam or not spam
What is the experience in this setting?
Watching your label email as spam or not spam
What is the performance of this setting?
The number of emails correctly classified as spam or not spam 4
Types of ML
Supervised Unsupervised
have a target column Doesn’t have a target column
Data points have unknown outcome
Data points have a known outcome
Semi-
supervised
An intermediate learning
Model learns both from
labeled and unlabeled data
5
Supervised Learning Types
6
Supervised Learning:
classification vs regression
Suppose you are working on weather prediction, and
your weather station makes one of three predictions
for each day's weather: Sunny, Cloudy or Rainy. You'd
like to use a learning algorithm to predict tomorrow's
weather.
Would you treat this as a classification or a
regression problem?
multiclass classification
Given data about the size of houses and number of bed
rooms on the real estate market, try to predict their price.
Would you treat price as a function of size and
#bedrooms as a classification or a regression problem?
regression 7
Learning – a two step process
Model : A learning algorithm
A model is a small thing that captures a larger thing
A good model is going to omit unimportant details while
retaining what's important
But we need to do so in a way that preserves the features or
relationships that were interested in.
Model Construction
A training set is used to create the model.
The model is represented as classification rules, decision
trees, or mathematical formula
Model Usage
the test set is used to see how well it works for classifying
future or unknown objects
8
How ML works:
classic ML process
The general learning approach: First, the dataset needs to be transformed into a
representation, most often a list of vectors, which can be used by the learning
algorithm. The learning algorithm choses a model and efficiently searches for the
model’s parameters.
After we have selected a model that has been fitted on the training dataset, we can
use the test dataset to estimate how well it performs on this unseen data to estimate
9
the so-called generalization error
How ML works:
classic ML process
From a technical perspective, every classic machine learning problem is composed of
several key choices to be made in a standard pipeline of five steps.
Step 1: Feature Extraction
Derive compact and uncorrelated features to represent raw data.
Step 2: Choose a proper model (hypothesis function)
Based on the nature of the given problem, choose a good machine learning
model from the candidates listed below
Linear models
Logistic sigmoid, softmax
Nonlinear kernels
Decision trees
constrain the function family to be learned from
Step 3: Choose a learning criterion (cost/objective function)
Choose an appropriate learning criterion from the candidates Listed below,
which forms an objective function of model parameters.
Mean squared error
Minimum classification error
Minimum cross-entropy
Choose certain criteria to measure how well the selected models fits over the
training data as a function of unknown model parameters 10
How ML works:
classic ML process
From a technical perspective, every classic machine learning problem is composed of
several key choices to be made in a standard pipeline of five steps.
Step 4: Choose an optimization algorithm
Considering the characteristics of the derived objective function, use an
appropriate optimization algorithm from Lists below to learn the model
parameters.
Grid search, Gradient descent, SGD , ADAM, RMSprop
Once the objective functions are determined, machine learning is turned into a
standard optimization problem, where the objective function needs to be
maximized or minimized with respect to the unknown model parameters.
Estimate the parameters in the hypothesis function
Step 5: Perform empirical evaluation
Use held-out data to empirically evaluate the performance of learned models
In practice, the performance of the learned models can always be empirically
evaluated based on a held-out data set that is not used anywhere in the earlier
steps
11
Learning: model representations
To describe the supervised learning problem slightly more
formally, our goal is, given a training set, to learn a hypothesis
function
h:X→Y
so that h(x) is a “good” predictor for the corresponding value of y
Training set
Learning Algorithm
𝑥 Size of house
ℎ Estimated price
𝑦
Hypothesis
𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
12
Shorthand ℎ 𝑥
How Does ML Work: ML framework
yp = f(Θ, x)
X: the input
Yp : output (values predicted by the model)
Θ: represents parameters of the model (1 or more
variables)
Training: given a training set of labeled examples {(x1,y1), …,
(xn,yn)}, estimate the prediction function f by minimizing the
prediction error on the training set
Testing: apply f to a never before seen test example x and output
the predicted value y = f(x)
13
Linear regression
(Univariate) Linear regression
involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the
other.
predict the Box Office
revenue of a movie using the
marketing budget only 14
Linear regression
(Univariate) Linear regression
involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the
other.
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
predict the Box Office
revenue of a movie using the
marketing budget only 15
Linear regression
(Univariate) Linear regression
involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
Multivariate linear regression
an extension of linear regression, where more than two attributes are involved
and the data are fit to a multidimensional surface.
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
with this line we can predict
what the box office revenue
16
will be for a new movie
Linear regression: house pricing predictions
Price ($) in 1000’s
400
300
200
100
500 1000 1500 2000 2500
Size in feet^2
17
Linear regression: notations
Price ($) in
Size in feet^2 (x)
1000’s (y)
2104 460
1416 232
𝑚 = 47
1534 315
852 178
… …
Notation: Examples:
𝑚 = Number of training examples 𝑥 (1) = 2104
𝑥 = Input variable / features
𝑥 (2) = 1416
𝑦 = Output variable / target
𝑦 (1) = 460
variable
(𝑥, 𝑦) = One training example
(𝑥 (𝑖) , 𝑦 (𝑖) ) = 𝑖 𝑡ℎ training example 18
Linear regression
Model representation
Cost function
Gradient descent
Gradient descent for linear regression
19
Linear regression: model representations
Training set Price ($)
in 1000’s
400
Learning 300
200
Algorithm 100
500 1000 1500 2000 2500
𝑥 ℎ 𝑦 Size in feet^2
Size of house Estimated price
Hypothesis
Univariate linear regression
𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 Parameters of the mode:
Shorthand ℎ 𝑥 𝜃0 𝑎𝑛𝑑 𝜃1
20
Linear regression: model representations
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
With different choices of the parameter's 𝜃0 𝑎𝑛𝑑 𝜃1 ,
we get different hypothesis, different hypothesis
functions
3 3 3
2 2 2
1 1 1
1 2 3 1 2 3 1 2 3
𝜃0 = 1.5 𝜃0 = 0 𝜃0 = 1
𝜃1 = 0 𝜃1 = 0.5 𝜃1 = 0.5 21
Linear regression: model representations
Price ($) in 1000’s
400
300
200
100
500 1000 1500 2000 2500
Size in feet^2
22
Linear regression: model representations
Price ($) in 1000’s
In linear regression, What we want to do, is come up
400 with values for the parameters 𝜃0 𝑎𝑛𝑑 𝜃1 so that the
straight line fits the data well.
300
Idea: Choose 𝜃0 𝑎𝑛𝑑 𝜃1 so that ℎ𝜃 𝑥 is close to y
200 for the training examples (x, y)
100
500 1000 1500 2000 2500
Size in feet^2
23
Linear regression
Model representation
Objective function
Optimization algorithm (parameter learning)
Gradient descent for linear regression
24
Linear regression: cost function
Idea: Choose 𝜃0 , 𝜃1 so that ℎ𝜃 𝑥 is
close to 𝑦 for the training example 𝑥, 𝑦
1 2
minimize σ𝑚 ℎ𝜃 𝑥 𝑖 −𝑦 𝑖
2𝑚 𝑖=1
𝜃0 , 𝜃1
𝑦
Price ($) in 1000’s
ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1 𝑥 (𝑖)
400
𝑚 300
1 𝑖 𝑖 2 200
𝐽 𝜃0 , 𝜃1 = ℎ𝜃 𝑥 −𝑦
2𝑚 100
𝑖=1
500 1000 1500 2000 2500
𝑥
Size in feet^2
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1
Squared error Cost function
25
Linear regression: cost function example
Let the training set be (1,1), (2,2), (3,3)
Calculate the cost function for the following 𝜃 values
0.5, 1, 1.5, 0, -0.5
26
Linear regression: cost function example
Let the training
𝜃𝟏set be (1,1), (2,2), (3,3)𝐽 𝜃1
Calculate the 1.5
cost function for the following
0.583 𝜃 values
0.5, 1, 1.5, 0, -0.5
1 0
0.5 0.583
0 2.33
-0.5 5.25
27
Linear regression
Model representation (hypothesis function)
Objective function (cost function)
Optimization algorithm (parameter learning)
Gradient descent for linear regression
28
Linear regression: gradient descent
So we have our hypothesis function and we have a way of
measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That's
where gradient descent comes in
29
Linear regression: gradient descent
Imagine that we graph the cost function based on the parameters
θ0 and θ1
The points on our graph will be the result of the cost function
using our hypothesis with those specific theta parameters 30
Linear regression: gradient descent
We will know that we have succeeded when our cost function is at
the very bottom of the pits in our graph, i.e. when its value is the
minimum.
31
Linear regression: gradient descent
We make steps down the cost function in the direction with the
steepest descent. The size of each step is determined by the
parameter α, which is called the learning rate.
32
Linear regression: gradient descent
Depending on where one starts on the graph, one could end up at
different points. The image above shows us two different starting
points that end up in two different places.
33
Linear regression: gradient descent algorithm
34
Linear regression: gradient descent algorithm intuition
The intuition behind the convergence of gradient
descent is that approaches 0 as we approach
the bottom of our convex function.
At the minimum, the derivative will always be 0 and
thus we get
θ1:=θ1−α∗0 35
Linear regression: gradient descent algorithm intuition
36
Linear regression: gradient descent algorithm intuition
37
Linear regression: gradient descent algorithm intuition
Suppose θ1 is at a local optimum of J(θ1), such as
shown in the figure below. What will one step of
gradient descent do?
Gradient descent can converge to a local minimum38
Linear regression: gradient descent for linear regression
We want to apply gradient descent to minimize the squared
error cost function
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1 39
Linear regression: gradient descent for linear regression
The point of all this is that if we start with a guess for our
hypothesis and then repeatedly apply these gradient descent
equations, our hypothesis will become more and more
accurate. 40
Univariate Linear Regression: Example
Consider the problem of predicting how well a student does in
her second year of university, given how well she did in her
first year.
Specifically, let x be equal to the number of "A" grades that a
student receives in their first year of university. We would like
to predict the value of y, which we define as the number of "A"
grades they get in their second year.
Recall that in linear regression, our hypothesis is hθ(x) =θ0+θ1x,
and we use m to denote the number of training examples.
For the training set given above, what is the value of m?
What is J(0,1)?
0.5
Suppose we set θ0=−1,θ1=0.5. What is hθ(4)? J(θ0, θ1)?
1 and 3.094 41
Linear Regression with multiple variables:
multivariate
Price ($) in
Size in feet^2 (x)
1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
hθ(x) =θ0+θ1x
42
Linear Regression with multiple variables:
multivariate
Price ($) in 1000’s
Size in feet^2 (x)
(y)
2104 460
1416 232
1534 315
852 178
… …
hθ(x) =θ0+θ1x
hθ(x) =θ0+θ1x1 +θ2x2 + θ3x3 + θ4x4
43
Multivariate Linear Regression: hypothesis
hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + …… θnxn
If we define x0 = 1, then hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + …… θnxn
𝑥0 𝛩0
𝑥1 𝛩1
𝑥2 𝛩2
𝑥= . 𝛩= .
. .
. .
𝑥𝑛 𝛩𝑛
𝑥0
𝑥1
𝛩0 𝛩1 𝛩1 … 𝛩𝑛 𝑥2 = 𝛩𝑇 𝑥 = hθ(x)
.
.
.
𝑥𝑛
44
Multivariate Linear Regression: gradient descent
The gradient descent equation
itself is generally the same
form; we just have to repeat it
for our 'n' features:
45
(Summary) Learning – Model Construction
A training set is used to create the model.
The model is represented as classification rules, decision
trees, or mathematical formula
Classification
Training Algorithms
Data
Size(m2) #bedrooms Price (in Classifier
million) (Model)
250 2 2
500 3 2.5
750 3 3.5 Price = a*size + b*#bedrooms
1000 5 5
46
(Summary) Learning – Model Construction
A training set is used to create the model.
The model is represented as classification rules, decision
trees, or mathematical formula
Classification
Algorithms
Training
Data
Classifier
NAME RANK YEARS TENURED
(Model)
M ike A ssistan t P ro f 3 no
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes IF rank = ‘professor’
D ave A ssistan t P ro f 6 no OR years > 6
Anne A sso ciate P ro f 3 no THEN tenured = ‘yes’
47
(Summary) Learning – Model Usage
Model usage
the test set is used to see how well it works for classifying
future or unknown objects
Classifier
model
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes 48
Thank You
49