2-LR_Optim
2-LR_Optim
Zijun Yao
Assistant Professor, EECS Department
The University of Kansas
1
Agenda
• Loss function
• Optimizing parameters
2
Supervised learning setup
• Given a collection of records (training set) Training
• Each record is characterized by a pair (x, y) set
3
* x is bolded because it represents a set of features; y is not because it is just a value.
House price prediction - regression
Size of House
# of Bedrooms f Price of House
…….
4
Linear regression
*A set of function means same model but with different values of parameter. 6
Step 1: Model definition
Predicted value of y
Predicted value of y
𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath
9
Step 1: Define set of functions
Size of
house
𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath
10
Step 1: Define set of functions
Size of
house
𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath
12
Step 1: A variant form of linear model
Equivalence
by
13
Step 1: A variant form of linear model
Prediction Parameters
Function of 𝑥 Features
14
Agenda
• Loss function
• Optimizing parameters
15
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y
Model
A set of
function
f1 , f 2
Training
Data
16
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y
A set of (𝟏)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,043 sqft
function (𝟏)
𝒙𝒂𝒈𝒆 : 26 years 𝑦ො (1) = f(𝐱 (1) ) 𝑦 (1) 784,000
f1 , f 2
Training (𝟐)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,976 sqft
Data (𝟐) 𝑦ො (2) = f(𝐱 (2) ) 𝑦 (2) 724,900
𝒙𝒂𝒈𝒆 : 8 years
Suppose we have two house features in
this data: size and age Superscript means the data index
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y
f1 , f 2
Training (𝟐)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,976 sqft
Data (𝟐) 𝑦ො (2) = f(𝐱 (2) ) 𝑦 (2) 724,900
𝒙𝒂𝒈𝒆 : 8 years
Suppose we have two house features in
this data: size and age Superscript means the data index
Step 2: Measure error
How good is a function? - Loss function 𝐿
A set of Model
Input: a function Output: the loss - how far is
function f1 , f 2 and data prediction from true value
𝑛 Sum of square error (SSE)
2
Goodness of 𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
function f 𝑖=1
Sum over examples Estimated y based
on input function
Training 1
𝑛
2
Averaged by n, you have mean squared error (MSE) loss 𝑦 (𝑖) − 𝑓 𝑥 (𝑖)
Data 𝑛
𝑖=1
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
𝑖=1
Estimated y based on
input function
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
𝑖=1
A simple case where only one feature is used to predict y When there are 2 features to predict y 22
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
3 15
𝑤=1
𝑦 2 L 𝑓 10
1 5
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−1)2 +(2−2)2 +(3−3)2 = 0
23
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
3 15
𝑤 = 0.5 10
𝑦 2 L 𝑓
1 5
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0.5)2 +(2−1)2 +(3−1.5)2 = 3.5
24
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
3 15
𝑦 2 L 𝑓 10
1 5
𝑤=0
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0)2 +(2−0)2 +(3−0)2 = 14
25
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
Loss function L is
3 15 a concave
𝑦 2 L 𝑓 10
1 5
𝑤=0
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0)2 +(2−0)2 +(3−0)2 = 14
26
Step 2: Intuition of loss function
15
L(𝑓)
𝐿 𝑓 10
0 0.5 1 1.5 𝑤1
𝑤2
𝑤
• Loss function
• Optimizing parameters
28
Step 3: Find the best function
𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
= 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2 𝑖=1
Loss
Goodness of
function f
Training
Data
29
Step 3: Find the best function
𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
= 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2 𝑖=1
30
Step 3: Find the best function
𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
= 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2 𝑖=1
𝐿(𝑤)
• The derivative of loss function 𝐿 is
32
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
33
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
• Gradients
Gradients is a vector consists of
partial derivative of each parameter
34
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
Positive Decrease w
w0 w
36
Step 3: Gradient descent
∗
• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤
𝛼 is called
w0 𝑑𝐿 w1 “learning rate” w
−𝛼 |𝑤=𝑤 0 37
𝑑𝑤 Usually small, like 0.05
Step 3: Gradient descent
∗
• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤
1 0
𝜕𝐿 𝜕𝐿
𝑤 ←𝑤 −𝛼 |𝑤=𝑤 0 ,𝑏=𝑏0 𝑏1 ← 𝑏0 − 𝛼 |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
𝜕𝐿 𝜕𝐿
➢ Compute | 1 1, | 1 1
𝜕𝑤 𝑤=𝑤 ,𝑏=𝑏 𝜕𝑏 𝑤=𝑤 ,𝑏=𝑏
𝜕𝐿 𝜕𝐿
𝑤2 ← 𝑤1 −𝛼 |𝑤=𝑤 1 ,𝑏=𝑏1 2 1
𝑏 ← 𝑏 − 𝛼 |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
repeat iterations until convergence
39
Step 3: Gradient descent
• Gradient of Linear Regression
40
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝑤
Low 𝐿
𝑏
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
http://www.bdhammel.com/learning-rates/
Linear algebra review
• Vector in ℝ𝑑 is an ordered set of 𝑑 real values
• Matrix in ℝ𝑛×𝑚 is a 𝑛 by 𝑚 object with 𝑛 rows and 𝑚
columns
• Transpose
• Matrix production
52
4×2 2×3 4×3
Vectorization of linear regression
• Benefits of vectorization
• More compact equations
• Faster code (using optimized matrix libraries)
• Linear regression model:
• Let
53
Vectorization of linear regression
• Consider the model for 𝑛 instance
• Let
ℝ(𝑑+1)×1 ℝ𝑛×(𝑑+1)
• In vectorized form, linear regression model:
54
Vectorization of linear regression
• For the loss function
56
https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
Feature standardization
• Rescale features to have zero mean and unit variance
for 𝑗 = 1 … 𝑑 (not 𝑥0 )
• Must apply the same transformation for both training and testing instances
57
Regularization
• A method to control the complexity of model, avoid overfitting
• Why - address overfitting issues by keeping 𝑤 small
• How - Penalize for large value of 𝑤𝑗
• Can incorporate into the loss function
• Works well when we have a lot of features
𝑛 𝑑
2
Also called 𝐿2 -norm
𝐿 𝑓 = 𝑦 𝑖 −𝑓 𝑥 𝑖 + 𝜆 𝑤𝑗2
𝑖=1 𝑗=1
60