0% found this document useful (0 votes)
10 views

Lecture 4 - More On Linear Regression and Polynomial Regression

Uploaded by

royeha2011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 4 - More On Linear Regression and Polynomial Regression

Uploaded by

royeha2011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Lec 4 – More on Linear

Regression and Gradient


Descent
Mariette Awad

Slide sources for this set of slides: Stanford Intro to ML course


Lecture Outline
• Multivariate Linear Regression
• Stochastic Gradient Descent
• Practical Tricks to make GD work well - Feature Scaling
• Practical Tricks to make GD work well - Plotting J(θ) and choosing
Learning rate
• Polynomial Regression
Lecture Outcomes
• What is a Multivariate Linear Regression
• What is a Stochastic Gradient Descent
• What are Practical Tricks to make GD work well
• Feature Scaling
• Plotting J(θ)
• Choosing well the Learning rate
• What is a Polynomial Regression
Multivariate Linear Regression
or Multiple Features Hypothesis
Linear Regression with Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years) Before: single feature (price)
and we were trying to predict
2104 5 1 45 460 price.
1416 3 2 40 232 Now: We have more features
1534 3 2 30 315 (e.g. size, # of bedrooms, # of
852 2 1 36 178 floors, age of home).
… … … … …

Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
Multivariate linear regression
For convenience of notation, define
Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )


New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)

(simultaneously update )
Stochastic Gradient Descent
(SGD)
Stochastic Gradient Descent (SGD)
• So far, what we used is called the Batch gradient descent since it uses
all training (i.e. we are looking at the whole batch) data in every
iteration.
• There are other versions of gradient descent such as SGD:
Batch Gradient Descent Stochastic Gradient Descent
𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒{ 𝐿𝑜𝑜𝑝 {
𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑛, {
𝑛
𝑖 𝑖 𝑖 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 + 𝛼 ෍ 𝑦 − ℎ𝜃 𝑥 𝑥𝑗 , (𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗) 𝜃𝑗 ≔ 𝜃𝑗 + 𝛼 𝑦 − ℎ𝜃 𝑥 𝑥𝑗 , (𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗)
𝑖=1

}
}
}
SGD versus Batch GD
• Batch gradient descent has to process the entire training set at every
iteration of making an update to the parameters.
• SGD parameter updates are conducted with one training sample at
every step.
• Often, SGD gets θ “close” to the minimum much faster than batch
gradient descent. Note however that it may never “converge” to the
minimum
• The parameters θ will keep oscillating around the minimum of J(θ);
but in practice most of the values near the minimum will be
reasonably good approximations to the true minimum.
• When training set is large, SGD preferred.
SGD (for linear regression) updates in vector
form

Note here:
• Θ is a vector in this equation with dimension = (Number of features + 1)
• x(i) is a vector of feature values for the ith (one) training sample.
Practical Tricks to make gradient
descent work well – Feature
Scaling
Feature Scaling
• If you have features, where one feature values are much larger than
another feature values, the contour plot for the cost function will
likely be skewed in the direction orthogonal to the feature with the
larger values since any change will increase dramatically.
• Accordingly, the gradient descent may take a long time to converge. It
may actually oscillate between steps on its way to converging.
• To address this problem, we scale all features by their maximum
values to bring them all to a similar range (or close to the same range)
• Mathematically, it can be shown that the convergence now takes a
more direct path and the convergence is faster.
Example
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms

0 ≤ 𝑥1 ≤ 1, 0 ≤ 𝑥2 ≤ 1
Scaling and Mean Normalization
Scaling: Get every feature into approximately a range

Scaling with Mean normalization:


Replace with to make features have approximately zero mean
(Do not apply to ).

E.g.

Other option for normalization: Divide by standard deviation


Practical Tricks to make gradient
descent work well – Learning
Rate
Debugging - Making sure gradient descent is working correctly

• The job of gradient descent is to


find the theta that minimizes the
cost function.
• Debugging approach: Plot the cost
function with the number of
iterations.
• For example, Run Gradient descent
for 100 iterations and evaluate the
cost function for the value of theta
after 100 iterations.
• Same after 200 iterations.
• If gradient descent is working
properly, J(theta) should decrease
0 100 200 300 400 after every iteration.
No. of iterations
Number of Iterations
• Assume after 300-400 iterations,
if the change is not much, then
you know it has converged or
close to convergence.
• Note that the number of
iterations may be very different
for different algorithms.
• For some, it may be 30 iterations,
for others it could be 3 millions.
0 100 200 300 400 Best way to know is from the
No. of iterations plot.
Easy to spot failure

• We can also find out when it is not


working. If we plot the cost
function with number of iterations,
and we noticed that it is going up,
then it means gradient descent is
not converging.

0 100 200 300 400


No. of iterations
Automatic Convergence test

Example automatic
convergence test:

Declare convergence if
decreases by less than
in one iteration.
• The problem with this approach is
that it is very hard to decide what
0 100 200 300 400 threshold to choose. Checking the
No. of iterations plot is a better approach.
Choice of Learning Rate
• If gradient descent is not converging, a good approach is to choose a
smaller learning rate (alpha)
• Example scenarios with gradient descent not converging:
• Cost function increasing with iterations
• Cost function oscillating with iterations
• It can be shown that if the learning rate is chosen small enough, the
gradient descent would decrease with every iteration.
• However, it may take a while for the algorithm to converge when the
learning rate is too small.
Advice to choose , try …., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ….
• Then choose the one that gives you the fastest rate of decrease while still converging
Choice of features and
polynomial regression
Choice of features
• Sometimes you may be given a set features, but decide you need
other features that can be derived from existing features.
• Example:
• Consider the case where you are given frontage of house and depth of house,
and you want to compute house prices.
• You may choose to compute the area, which is the product of frontage and
depth.
• In some cases, we may want new features as squares or cubes of
original features.
• This would lead to polynomial regression.
Polynomial regression For the given data,
one possible option:

Price
(y) But this may cause a drop with
larger values.
Another alternative may be:

Size (x)

This can be represented using Linear Regression Modeling

Same Gradient descent is now applicable as in Linear Regression Models


Other choices of features to match data patterns

Price
(y)

Size (x)

By having insight into the fact that square root has a saturating pattern:

You might also like