10/23/2023
Introduction to
Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[email protected]
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun
Lecture 2:
Supervised Learning
Linear regression with one variable
Reading:
• Chapter 17, “Bayesian Reasoning and Machine Learning” Page 345-348
• Chapter 03, “Pattern Recognition and Machine Learning” of Christopher M. Bishop, Page 137
• Chapter 11, “Data Mining A Knowledge Discovery Approach”, from page 346
• Chapter 18 , “Artificial Intelligence A Modern Approach”, from page 718
Model representation
1
10/23/2023
500
Housing Prices
(Portland, OR) 400
300
Price 200
(in 1000s
100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Classification Problem
Discrete valued output
4
Training set of Price ($) in 1000's
Size in feet2 (x)
housing prices (y)
(Portland, OR) 2104 460
1416 232
m
1534 315
852 178
Notation: … …
m = Number of training examples x(1) = 2104
x’s = “input” variable / features x(3) = 1534
y’s = “output” variable / “target” variable y(4) = 178
(x, y) – one training example y(2) = 232
(x(i), y(i)) – ith training example
i is an index to training set 5
Training Set How do we represent h ?
hθ(x) = θ0 + θ1x
Shorthand: h(x)
Learning Algorithm
Size of Estimated hθ(x) = θ0 + θ1x
h
house price
x Hypothesis Estimated
value of y
Linear regression with one variable
h is a function Univariate linear regression
h maps from x’s to y’s
6
2
10/23/2023
In summary
• A hypothesis h takes in some variable(s)
• Uses parameters determined by a learning
system
• Outputs a prediction based on that input
Cost function
• A cost function let us figure out how to fit the
best straight line to our data
Training Set Price ($) in 1000's
Size in feet2 (x)
(y)
2104 460
1416 232
m
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
9
3
10/23/2023
Different parameter values give different functions
3 3 3
h(x) = 0 + 0.5.x
2 h(x) = 1.5 + 0.x 2 2
1 1 1 h(x) = 1 + 0.5.x
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Positive slope if θ1> 0
10
Idea: Choose so that
is close to for our
training examples y
x
• hθ(x) is a "y imitator"
• Tries to convert the x into y
• Considering we already have y we
can evaluate how well hθ(x) does this
11
Minimal deviation of x from y
Idea: Choose so that
is close to for our
training examples
Minimization problem y
12
4
10/23/2023
ℎ 𝑥 = 𝜃 + 𝜃 𝑥( )
1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚
minimize 𝐽 𝜃 , 𝜃
Cost function
13
1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚
• This cost function is also called the squared
error cost function
– Reasonable choice for most regression functions
– Probably most commonly used function
Cost function intuition I
Simplified version
Hypothesis:
𝜃 =0
3 3
Parameters: 2 2
1 1
0 0
Cost Function: 0 1 2 3 0 1 2 3
Goal:
15
5
10/23/2023
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 = ℎ 𝑥 −𝑦
2𝑚
𝐽 1 =0
1 1
𝐽 𝜃 = 𝜃 𝑥−𝑦 = 0 +0 +0 =0
2×3 6 16
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 = 𝜃 𝑥( ) − 𝑦
2𝑚 𝐽 0.5 = 0.58
1 1 3.6
= (0.5 − 1) +(1 − 2) +(1.5 − 3) = 0.5 + 1 + 1.5 = ≈ 0.58
2×3 6 6 17
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 =
2𝑚
𝜃 𝑥( ) − 𝑦 𝐽 0 ≈ 2.3
1 1 14
= (0 × 1 − 1) +(0 × 2 − 2) +(0 × 3 − 3) = 1+4+9 = ≈ 2.3
6 6 6 18
6
10/23/2023
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 𝜃 = −0.5 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝐽 −0.5 ≈ 5.15 19
• If we compute a range of values plot
• 𝐽(𝜃 ) vs 𝜃 we get a polynomial (looks like
a quadratic)
20
Cost function intuition II
Hypothesis:
Parameters:
Cost Function:
Goal:
21
7
10/23/2023
???
(for fixed , this is a function of x) (function of the parameters )
500
400
Price ($)
in 1000’s 300
200
𝜃 = 50
100
𝜃 = 0.06
0
0 1000 2000 3000
Size in feet2 (x)
𝜃 𝜃 ???
22
23
(for fixed , this is a function of x) (function of the parameters )
Contour Plot 24
8
10/23/2023
(for fixed , this is a function of x) (function of the parameters )
25
(for fixed , this is a function of x) (function of the parameters )
26
(for fixed , this is a function of x) (function of the parameters )
27
9
10/23/2023
• Doing this manually is painful
• What we really want is an efficient algorithm
for finding the minimum J for θ0 and θ1
28
Gradient descent algorithm
• Minimize cost function J
• Used all over machine learning
for minimization
29
Gradient descent algorithm
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
30
10
10/23/2023
• Local Search for optimization :
– hill climbing, simulated annealing, Gradient
descent algorithm, etc
31
Local Search Methods
• Applicable when seeking Goal State & don't care how
to get there. E.g.,
– N-queens,
– finding shortest/cheapest round trips
• (Travel Salesman Problem, Vehicle Routing Problem)
– finding models of propositional formulae (SAT solvers)
– VLSI layout, planning, scheduling, time-tabling, . . .
– map coloring,
– resource allocation
– protein structure prediction
– genome sequence assembly
32
Local search
Key idea (surprisingly simple):
1. Select (random) initial state
(generate an initial guess)
2. Make local modification to
improve current state (evaluate
current state and move to other
states)
3. Repeat Step 2 until goal state
found (or out of time) 33
11
10/23/2023
Gradient descent algorithm
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
34
J(0,1)
1
0
35
J(0,1)
1
0
36
12
10/23/2023
Gradient descent algorithm
Derivative term
𝛼: Learning Rate
𝛼: Learning Rate (Should be a small number)
• Large number:= Huge steps
• Small number := baby steps
37
Gradient descent algorithm
Correct:
Simultaneous update of 𝜃 , 𝜃 Incorrect:
38
Gradient descent intuition
• To understand the intuition, we'll return to a simpler
function where we minimize one parameter to help
explain the algorithm in more detail
𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ 𝑅
39
13
10/23/2023
Two key terms in the algorithm
• Derivative term
•𝛼
40
Partial derivative vs. derivative
• Use partial derivative when we have multiple variables but only
derive with respect to one
• Use derivative when we are deriving with respect to all the variables
𝑑
𝜃 =𝜃 −𝛼 𝐽(𝜃 )
𝑑𝜃
≥0
𝜃 = 𝜃 − 𝛼(+𝑣𝑒 𝑛𝑜. )
Derivative: it takes the tangent to the point (the straight red line) and calculates the slop of
this tangent line. Slop = vertical line / horizontal line 41
𝑑
𝜃 =𝜃 −𝛼 𝐽(𝜃 )
𝑑𝜃
≤0
𝜃 = 𝜃 − 𝛼 . (−𝑣𝑒 𝑛𝑜. )
42
14
10/23/2023
Slope
• Familiar meaning?
• The slope of a line is the change in y divided by the
change in x .
• Slope (m) = = =
• Pick any two points on the line: (𝑥 , 𝑦 ), (𝑥 , 𝑦 )
• Ex. Find the slope of the line which passes through the
points (2, 5) and (0, 1) :
• 𝑚= = = = 2 which is positive number
• Meaning: Every time x increases by 1 (anywhere on
the line), y increase by 2 , and whenever x decreases
by 1, y decreases by 2 .
5−1 4 2
𝑚= = =
2−0 2 1
Positive slope (i.e. m > 0 )
• y always increases
when x increases
and y always decreases
when x decreases.
• The graph of the line starts
at the bottom left and goes
towards the top right.
44
3+1 4 4
𝑚= = = − = −1.33
−2 − 1 −3 3
Negative slope (i.e. m < 0 )
Y always decreases
when x increases
and y always increases
when x decreases.
45
15
10/23/2023
Horizontal and Vertical Lines
• The slope of any
horizontal line is 0
• 𝑚= =0
• The slope of any vertical
line is undefined
46
• Positive value
• Negative value
• Zero value
• At each point, the line is always tangent to the curve
• Its slope is the derivative
𝛼: Learning Rate
If α is too small, gradient
descent can be slow.
If α is too large, gradient descent
can overshoot the minimum. It
may fail to converge.
48
16
10/23/2023
Question: When you get to a local minimum
at local optima
Current value of Derivative term = 0
θ1 = θ1- 0
So θ1 remains the same
49
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
50
Gradient descent for linear regression
Gradient descent algorithm Linear Regression Model
51
17
10/23/2023
Gradient descent for linear regression
52
𝜕 1
ℎ 𝑥 −𝑦
𝜕𝜃 2𝑚
𝜕 1
= 𝜃 + 𝜃 𝑥( ) − 𝑦
𝜕𝜃 2𝑚
1
(ℎ (𝑥 ) − 𝑦 )
𝑚
1
(ℎ (𝑥 ) − 𝑦 ) . 𝑥 ( )
𝑚
53
Gradient descent algorithm
𝜕
𝐽(𝜃 , 𝜃 )
𝜕𝜃
update
and
simultaneously
𝜕
𝐽(𝜃 , 𝜃 )
𝜕𝜃
54
18
10/23/2023
J(0,1)
1
0
55
J(0,1)
1
0
56
57
19
10/23/2023
(for fixed , this is a function of x) (function of the parameters )
58
(for fixed , this is a function of x) (function of the parameters )
59
(for fixed , this is a function of x) (function of the parameters )
60
20
10/23/2023
(for fixed , this is a function of x) (function of the parameters )
61
(for fixed , this is a function of x) (function of the parameters )
62
(for fixed , this is a function of x) (function of the parameters )
63
21
10/23/2023
(for fixed , this is a function of x) (function of the parameters )
64
(for fixed , this is a function of x) (function of the parameters )
65
(for fixed , this is a function of x) (function of the parameters )
66
22
10/23/2023
Linear Regression with One Variable
• Error here a is y-intercept
while b is slope
• SSE
67
Linear Regression with One Variable
68
Another name:
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses
all the training examples.
Another algorithm that solves
Normal equations method
Gradient descent algorithm scales better than Normal
equations method to larger datasets
69
23
10/23/2023
Generalization of
Gradient descent algorithm
• Learn with larger number of features.
• Difficult to plot
70
We see here this matrix shows us Vector
Size, Number of bedrooms Shown as y
Number floors, Age of home Shows us the prices
All in one variable 71
• Need linear algebra for more complex linear
regression models
• Linear algebra is good for making
computationally efficient models (we’ll see
later)
– Provides a good way to work with large sets of data
sets
– Typically, vectorization of a problem is a common
optimization technique
24
10/23/2023
End
25