Lecture 4 - More On Linear Regression and Polynomial Regression
Lecture 4 - More On Linear Regression and Polynomial Regression
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
Multivariate linear regression
For convenience of notation, define
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update )
Stochastic Gradient Descent
(SGD)
Stochastic Gradient Descent (SGD)
• So far, what we used is called the Batch gradient descent since it uses
all training (i.e. we are looking at the whole batch) data in every
iteration.
• There are other versions of gradient descent such as SGD:
Batch Gradient Descent Stochastic Gradient Descent
𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒{ 𝐿𝑜𝑜𝑝 {
𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑛, {
𝑛
𝑖 𝑖 𝑖 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 + 𝛼 𝑦 − ℎ𝜃 𝑥 𝑥𝑗 , (𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗) 𝜃𝑗 ≔ 𝜃𝑗 + 𝛼 𝑦 − ℎ𝜃 𝑥 𝑥𝑗 , (𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗)
𝑖=1
}
}
}
SGD versus Batch GD
• Batch gradient descent has to process the entire training set at every
iteration of making an update to the parameters.
• SGD parameter updates are conducted with one training sample at
every step.
• Often, SGD gets θ “close” to the minimum much faster than batch
gradient descent. Note however that it may never “converge” to the
minimum
• The parameters θ will keep oscillating around the minimum of J(θ);
but in practice most of the values near the minimum will be
reasonably good approximations to the true minimum.
• When training set is large, SGD preferred.
SGD (for linear regression) updates in vector
form
Note here:
• Θ is a vector in this equation with dimension = (Number of features + 1)
• x(i) is a vector of feature values for the ith (one) training sample.
Practical Tricks to make gradient
descent work well – Feature
Scaling
Feature Scaling
• If you have features, where one feature values are much larger than
another feature values, the contour plot for the cost function will
likely be skewed in the direction orthogonal to the feature with the
larger values since any change will increase dramatically.
• Accordingly, the gradient descent may take a long time to converge. It
may actually oscillate between steps on its way to converging.
• To address this problem, we scale all features by their maximum
values to bring them all to a similar range (or close to the same range)
• Mathematically, it can be shown that the convergence now takes a
more direct path and the convergence is faster.
Example
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms
0 ≤ 𝑥1 ≤ 1, 0 ≤ 𝑥2 ≤ 1
Scaling and Mean Normalization
Scaling: Get every feature into approximately a range
E.g.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
• The problem with this approach is
that it is very hard to decide what
0 100 200 300 400 threshold to choose. Checking the
No. of iterations plot is a better approach.
Choice of Learning Rate
• If gradient descent is not converging, a good approach is to choose a
smaller learning rate (alpha)
• Example scenarios with gradient descent not converging:
• Cost function increasing with iterations
• Cost function oscillating with iterations
• It can be shown that if the learning rate is chosen small enough, the
gradient descent would decrease with every iteration.
• However, it may take a while for the algorithm to converge when the
learning rate is too small.
Advice to choose , try …., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ….
• Then choose the one that gives you the fastest rate of decrease while still converging
Choice of features and
polynomial regression
Choice of features
• Sometimes you may be given a set features, but decide you need
other features that can be derived from existing features.
• Example:
• Consider the case where you are given frontage of house and depth of house,
and you want to compute house prices.
• You may choose to compute the area, which is the product of frontage and
depth.
• In some cases, we may want new features as squares or cubes of
original features.
• This would lead to polynomial regression.
Polynomial regression For the given data,
one possible option:
Price
(y) But this may cause a drop with
larger values.
Another alternative may be:
Size (x)
Price
(y)
Size (x)
By having insight into the fact that square root has a saturating pattern: