07: Regularization: The Problem of Overfitting
07: Regularization: The Problem of Overfitting
To recap, if we have too many features then the learned hypothesis may give a cost
function of exactly zero
But this tries too hard to fit the training set
Fails to provide a general solution - unable to generalize (apply to new examples)
Addressing overfitting
Later we'll look at identifying when overfitting and underfitting is occurring
Earlier we just plotted a higher order function - saw that it looks "too curvy"
Plotting hypothesis is one way to decide, but doesn't always work
Often have lots of a features - here it's not just a case of selecting a degree
polynomial, but also harder to plot the data and visualize to decide what features to
keep and which to drop
If you have lots of features and little data - overfitting can be a problem
How do we deal with this?
1) Reduce number of features
Manually select which features to keep
Model selection algorithms are discussed later (good for reducing number of
features)
But, in reducing the number of features we lose some information
Ideally select those features which minimize data loss, but even so, some
info is lost
2) Regularization
Keep all features, but reduce magnitude of parameters θ
Works well when we have a lot of features, each of which contributes a bit to
predicting y
The addition in blue is a modification of our cost function to help penalize θ3 and θ4
So here we end up with θ3 and θ4 being close to zero (because the constants are
massive)
So we're basically left with a quadratic function
Regularization
Small values for parameters corresponds to a simpler hypothesis
(you effectively get rid of some of the terms)
A simpler hypothesis is less prone to overfitting
Another example
Have 100 features x1 , x2 , ..., x100
Unlike the polynomial example, we don't know what are the high order terms
How do we pick the ones to pick to shrink?
With regularization, take cost function and modify it to shrink all the parameters
Add a term at the end
This regularization term shrinks every parameter
By convention you don't penalize θ0 - minimization is from θ1 onwards
Previously, gradient descent would repeatedly update the parameters θj, where j =
0,1,2...n simultaneously
Shown below
The term
We saw earlier that logistic regression can be prone to overfitting with lots of features
Logistic regression cost function is as follows;
Again, to modify the algorithm we simply need to modify the update rule for θ1 , onwards
Looks cosmetically the same as linear regression, except obviously the hypothesis
is very different
use fminunc
Pass it an @costfunction argument
Minimizes in an optimized manner using the cost function
jVal
Need code to compute J( θ)
Need to include regularization term
Gradient
Needs to be the partial derivative of J( θ) with respect to θi
Adding the appropriate term here is also necessary