0% found this document useful (0 votes)
2 views

2-LR_Optim

The document outlines the fundamentals of linear regression in machine learning, detailing the model definition, loss function, and parameter optimization. It explains the supervised learning setup, where the goal is to predict a target value based on input features. The document also discusses the steps involved in defining the model, measuring its goodness through loss functions, and optimizing parameters to minimize error.

Uploaded by

vinay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2-LR_Optim

The document outlines the fundamentals of linear regression in machine learning, detailing the model definition, loss function, and parameter optimization. It explains the supervised learning setup, where the goal is to predict a target value based on input features. The document also discusses the steps involved in defining the model, measuring its goodness through loss functions, and optimizing parameters to minimize error.

Uploaded by

vinay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

EECS 836: Machine Learning

Zijun Yao
Assistant Professor, EECS Department
The University of Kansas

1
Agenda

• Linear Regression model


• Model definition

• Loss function

• Optimizing parameters

2
Supervised learning setup
• Given a collection of records (training set) Training
• Each record is characterized by a pair (x, y) set

• x: feature, attribute, independent variable


Learning
• y: target, label, dependent variable algorithm
• Goal
• Learn a model (function ) so that f (x) can Function
correctly predict 𝑦ො for the corresponding value of y
• Tasks x 𝑦ො
• Regression: to predict a continuous value 𝑦ො Feature Predicted
• Classification: to predict a category class 𝑦ො value of y

3
* x is bolded because it represents a set of features; y is not because it is just a value.
House price prediction - regression

Size of House
# of Bedrooms f Price of House
…….

4
Linear regression

Independent variables (features) x Dependent variables (targets) y


• Given
• Data
• Corresponding labels

• Goal: find a continuous function that models the continuous points


5
3 ML steps for linear regression

Step 1: Step 2: Step 3: pick


define a set goodness the best
of function* of function function

Define a model Measure the error Optimizing parameters

*A set of function means same model but with different values of parameter. 6
Step 1: Model definition

Predicted value of y

Target 1st Feature 2nd Feature d-th Feature

A linear relationship between feature and target Data 7


Step 1: Model definition

Bias: a fixed offset Weights: significance of each feature


Parameters

Predicted value of y

Target 1st Feature 2nd Feature d-th Feature

A linear relationship between feature and target Data 8


Step 1: Define set of functions
Size of
house

𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath

9
Step 1: Define set of functions
Size of
house

𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath

Linear Regression model

10
Step 1: Define set of functions
Size of
house

𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath

Linear Regression model

w and b are parameters (can be any value)


A set of f1:
f1 , f 2 
function f2:
With different values of parameters …… infinite 11
Step 1: A variant form of linear model

1st Feature 2nd Feature d-th Feature

12
Step 1: A variant form of linear model

Equivalence
by

1st Feature 2nd Feature d-th Feature


0st Feature 1st Feature 2nd Feature d-th Feature

13
Step 1: A variant form of linear model

0st Feature 1st Feature 2nd Feature d-th Feature

Prediction Parameters

Linear Regression model

Function of 𝑥 Features

14
Agenda

• Linear Regression model


• Model definition

• Loss function

• Optimizing parameters

15
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y

Model

A set of
function

f1 , f 2 

Training
Data

16
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y

Model function function


input: Output (scalar): label value:

A set of (𝟏)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,043 sqft
function (𝟏)
𝒙𝒂𝒈𝒆 : 26 years 𝑦ො (1) = f(𝐱 (1) ) 𝑦 (1) 784,000

f1 , f 2 

Training (𝟐)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,976 sqft
Data (𝟐) 𝑦ො (2) = f(𝐱 (2) ) 𝑦 (2) 724,900
𝒙𝒂𝒈𝒆 : 8 years
Suppose we have two house features in
this data: size and age Superscript means the data index
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y

Model function function


input: Output (scalar): label value:
Measure the difference
A set of (𝟏)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,043 sqft
function (𝟏)
𝒙𝒂𝒈𝒆 : 26 years 𝑦ො (1) = f(𝐱 (1) ) 𝑦 (1) 784,000

f1 , f 2 

Training (𝟐)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,976 sqft
Data (𝟐) 𝑦ො (2) = f(𝐱 (2) ) 𝑦 (2) 724,900
𝒙𝒂𝒈𝒆 : 8 years
Suppose we have two house features in
this data: size and age Superscript means the data index
Step 2: Measure error
How good is a function? - Loss function 𝐿

A set of Model
Input: a function Output: the loss - how far is
function f1 , f 2  and data prediction from true value
𝑛 Sum of square error (SSE)
2
Goodness of 𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
function f 𝑖=1
Sum over examples Estimated y based
on input function
Training 1
𝑛
2
Averaged by n, you have mean squared error (MSE) loss ෍ 𝑦 (𝑖) − 𝑓 𝑥 (𝑖)
Data 𝑛
𝑖=1

Loss function also called cost function and objective function


Step 2: Measure error
How good is a function? - Loss function 𝐿
𝑛
Input: a function and data 2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
output: the loss - how far is 𝑖=1
prediction from true value Sum over examples
True y
Estimated y based
on input function
𝐿 𝑤𝑠𝑖𝑧𝑒 , 𝑤𝑎𝑔𝑒 , 𝑏
𝑛
2
(𝑖) (𝑖)
=෍ 𝑦 (𝑖) − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
𝑖=1

Loss function also called cost function and objective function


Step 2: Loss function

𝑛
2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
𝑖=1

Estimated y based on
input function

A simple case where only one feature is used to predict y 21


Step 2: Loss function

𝑛
2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
𝑖=1

A simple case where only one feature is used to predict y When there are 2 features to predict y 22
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑤∙𝑥 𝑖

𝑖=1
𝑓 𝑥 𝐿 𝑓

3 15
𝑤=1
𝑦 2 L 𝑓 10

1 5

𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−1)2 +(2−2)2 +(3−3)2 = 0
23
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑤∙𝑥 𝑖

𝑖=1
𝑓 𝑥 𝐿 𝑓

3 15

𝑤 = 0.5 10
𝑦 2 L 𝑓

1 5

𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0.5)2 +(2−1)2 +(3−1.5)2 = 3.5
24
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑤∙𝑥 𝑖

𝑖=1
𝑓 𝑥 𝐿 𝑓

3 15

𝑦 2 L 𝑓 10

1 5
𝑤=0
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0)2 +(2−0)2 +(3−0)2 = 14
25
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 =෍ 𝑦 (𝑖) −𝑤∙𝑥 𝑖

𝑖=1
𝑓 𝑥 𝐿 𝑓
Loss function L is
3 15 a concave

𝑦 2 L 𝑓 10

1 5
𝑤=0
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0)2 +(2−0)2 +(3−0)2 = 14
26
Step 2: Intuition of loss function

One-feature case Two-feature case

15

L(𝑓)
𝐿 𝑓 10

0 0.5 1 1.5 𝑤1
𝑤2
𝑤

Loss function tracks the performance of model as parameters change


27
Agenda

• Linear Regression model


• Model definition

• Loss function

• Optimizing parameters

28
Step 3: Find the best function

𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
=෍ 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2  𝑖=1

Loss
Goodness of
function f

Training
Data

29
Step 3: Find the best function

𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
=෍ 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2  𝑖=1

Loss The “best” – the function


Goodness of gives minimum loss
function f
Optimizing Search 𝑤, 𝑏 to find
Parameters minimum 𝐿
Training
Data

30
Step 3: Find the best function

𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
=෍ 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2  𝑖=1

Loss The “best” – the function


Goodness of gives minimum loss
function f
𝑓 ∗ = 𝑎𝑟𝑔 min L 𝑓 Optimizing Search 𝑤, 𝑏 to find
𝑓
Parameters minimum 𝐿
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min L 𝑤, 𝑏
𝑤,𝑏
Training 𝑛
2
Data (𝑖) (𝑖)
= 𝑎𝑟𝑔 min ෍ 𝑦 (𝑖) − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
𝑤,𝑏
𝑖=1
31
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)

𝐿(𝑤)
• The derivative of loss function 𝐿 is

The sensitivity to change of loss function


with respect to a change in a parameter 𝑤

32
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)

𝐿(𝑤)
• The derivative of loss function 𝐿 is

• Partial Derivatives: Let be loss function of parameters

Function derivative with respect to one of


parameters 𝑤𝑖 , with the others held constant

33
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)

𝐿(𝑤)
• The derivative of loss function 𝐿 is

• Partial Derivatives: Let be loss function of parameters

• Gradients
Gradients is a vector consists of
partial derivative of each parameter

34
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)

𝐿(𝑤)
• The derivative of loss function 𝐿 is

• Partial Derivatives: Let be loss function of parameters

• Gradients How to reduce Loss function?


Subtract gradient from each
parameter w
Gradient is a direction that increase the value of 𝐿(𝒘)
35
Step 3: Gradient descent

• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤

➢ (Randomly) Pick an initial value w0 at time 0


𝑑𝐿
➢ Compute | 0
Loss 𝑑𝑤 𝑤=𝑤
𝐿 𝑤 Negative Increase w

Positive Decrease w

w0 w
36
Step 3: Gradient descent

• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤

➢ (Randomly) Pick an initial value w0 at time 0


𝑑𝐿
➢ Compute | 𝑡 𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 𝑡+1 ← 𝑤𝑡 −𝛼 |𝑤=𝑤 𝑡
𝐿 𝑤 𝑑𝑤

𝛼 is called
w0 𝑑𝐿 w1 “learning rate” w
−𝛼 |𝑤=𝑤 0 37
𝑑𝑤 Usually small, like 0.05
Step 3: Gradient descent

• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤

➢ (Randomly) Pick an initial value w0


𝑑𝐿
➢ Compute | 0 1 0
𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 ←𝑤 −𝛼 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
𝑑𝐿
➢ Compute |𝑤=𝑤 1 2 1
𝑑𝐿
𝑑𝑤 𝑤 ←𝑤 −𝛼 |𝑤=𝑤 1
𝑑𝑤
repeat iterations until convergence
Local global
minima minima
w0 w1 w2 wT w
38
𝜕𝐿
Step 3: Gradient descent 𝛻𝐿 = 𝜕𝑤
𝜕𝐿
• How about two parameters? 𝜕𝑏 gradient
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏

➢ (Randomly) Pick an initial value for each parameter w0, b0


𝜕𝐿 𝜕𝐿
➢ Compute | 0 0, | 0 0
𝜕𝑤 𝑤=𝑤 ,𝑏=𝑏 𝜕𝑏 𝑤=𝑤 ,𝑏=𝑏

1 0
𝜕𝐿 𝜕𝐿
𝑤 ←𝑤 −𝛼 |𝑤=𝑤 0 ,𝑏=𝑏0 𝑏1 ← 𝑏0 − 𝛼 |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
𝜕𝐿 𝜕𝐿
➢ Compute | 1 1, | 1 1
𝜕𝑤 𝑤=𝑤 ,𝑏=𝑏 𝜕𝑏 𝑤=𝑤 ,𝑏=𝑏

𝜕𝐿 𝜕𝐿
𝑤2 ← 𝑤1 −𝛼 |𝑤=𝑤 1 ,𝑏=𝑏1 2 1
𝑏 ← 𝑏 − 𝛼 |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
repeat iterations until convergence
39
Step 3: Gradient descent
• Gradient of Linear Regression

40
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 41
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 42
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 43
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 44
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 45
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 46
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 47
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 48
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏 Find minimum!
High 𝐿

𝑤
Low 𝐿
𝑏

How does the loss get minimized by gradient descent


Slide by Andrew Ng 49
Step 3: Gradient descent
• Small gradient can slow down or halt the optimization
Loss
Very slow at
the plateau
Stuck at
saddle point
Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0

The value of the parameter w 50


Step 3: Gradient descent – learning rate 𝜶

Monitor the loss at each iteration


- Search 𝛼 by order of magnitude
at first (like, 10−1 … 10−5 )
- Tune 𝛼 locally to achieve efficient
convergence

http://www.bdhammel.com/learning-rates/
Linear algebra review
• Vector in ℝ𝑑 is an ordered set of 𝑑 real values
• Matrix in ℝ𝑛×𝑚 is a 𝑛 by 𝑚 object with 𝑛 rows and 𝑚
columns
• Transpose

• Matrix production

52
4×2 2×3 4×3
Vectorization of linear regression
• Benefits of vectorization
• More compact equations
• Faster code (using optimized matrix libraries)
• Linear regression model:

• Let

• In vectorized form, linear regression model:

53
Vectorization of linear regression
• Consider the model for 𝑛 instance

• Let

ℝ(𝑑+1)×1 ℝ𝑛×(𝑑+1)
• In vectorized form, linear regression model:
54
Vectorization of linear regression
• For the loss function

One time calculation, without iterating through all data samples.


55
Improving learning
• Feature scaling (or normalization)
• Ensure all features have similar scales
• Gradient descent would converge faster

56
https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
Feature standardization
• Rescale features to have zero mean and unit variance

• Let 𝜇𝑗 be the mean of feature 𝑗

• Let 𝑠𝑗 be the standard deviation of feature 𝑗

• Replace each value with

for 𝑗 = 1 … 𝑑 (not 𝑥0 )

• Must apply the same transformation for both training and testing instances

• Outliers can cause problems

57
Regularization
• A method to control the complexity of model, avoid overfitting
• Why - address overfitting issues by keeping 𝑤 small
• How - Penalize for large value of 𝑤𝑗
• Can incorporate into the loss function
• Works well when we have a lot of features

𝑛 𝑑
2
Also called 𝐿2 -norm
𝐿 𝑓 =෍ 𝑦 𝑖 −𝑓 𝑥 𝑖 + 𝜆 ෍ 𝑤𝑗2
𝑖=1 𝑗=1

model fit to data regularization


o 𝜆 is the predefined hyperparameter to control the degree of regularization
o No regularization on 𝑤0 (bias 𝑏)
58
Summary
• Problem: estimate a real value
• Model: 𝑦ො = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑑 𝑥𝑑
• Loss function: sum of square (SSE)
𝑛 𝑛
1
𝐿 𝐰, 𝑏 = ෍ 𝑙 𝑖 𝐰, 𝑏 = ෍ (𝑦 𝑖
− 𝑦ො (𝑖) )2
𝑛
𝑖=1 𝑖=1

• Optimize parameters by Gradient Descent method


• Choose a starting point
• Repeat
• Compute gradient
• Update parameters
59
Demo
• Use ML library
• https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

60

You might also like