Choosing Model and Tuning
Choosing Model and Tuning
..
• Now that we have looked at many examples of bad data, let’s look at
a couple of examples of bad algorithms.
• Overfitting the Training Data
• Underfitting the Training Data
2
Overfitting the Training Data
• Complex models such as deep neural networks can detect subtle patterns
in the data, but if the training set is noisy, or if it is too small (which
introduces sampling noise), then the model is likely to detect patterns in
the noise itself.
• Obviously these patterns will not generalize to new instances.
• For example, say you feed your life satisfaction model many more
attributes, including uninformative ones such as the country’s name.
• In that case, a complex model may detect patterns like the fact that all countries in
the training data with a w in their name have a life satisfaction greater than 7:New
Zealand (7.3), Norway (7.4), Sweden (7.2), and Switzerland (7.5).
• How confident are you that the W-satisfaction rule generalizes to Rwanda or
Zimbabwe?
• Obviously, this pattern occurred in the training data by pure chance, but the model
has no way to tell whether a pattern is real or simply the result of noise in the data.
4
…
• There are several ways to prevent overfitting, which are stated
below:
• If the training data is too small (will detect noisy patterns) to train add
more relevant and clean data.
• If the number of features is too large, do some feature selection and
remove unnecessary features.
• Constraining a model to make it simpler and reduce the risk of overfitting
➔ called regularization.
• selecting one with fewer parameters (e.g., a linear model rather than a high-degree
polynomial model)
• the fewer degrees of freedom it has, the harder it will be for it to overfit the data.
End-to-End ML Project
Part(2)
Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, by Aurelien Geron
8
5. Select a Model and Train it.
• How are you supposed to decide which classification algorithm to
pick for a given task?
• The best answer is to let the data decide: try a number of different
approaches and see what works best.
• Done! You now have a working Linear Regression model. Let’s try it
out on a few instances from the training set:
• some_data = housing.iloc[:5] #housing is the trained set before
preprocessing
• some_labels = housing_labels.iloc[:5]
• some_data_prepared = preparingData(some_data,cat_enc,imputer,stsc)
• print("Predictions:", lin_reg.predict(some_data_prepared))
• print("Labels:", list(some_labels))
10
…
• It works, although the predictions are not exactly accurate.
• Predictions: [85657.90, 305492.60, 152056.46, 186095.70, 244550.67]
• Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]
11
…
def preparingData (someData,cat_enc,imputer,stsc): #SomeData: unlabeled instances before
preprocessing
someData_num = someData.drop("ocean_proximity", axis=1)
someData_cat = someData[["ocean_proximity"]]
n = cat_enc.transform(someData_cat)
someData_cat_1hot = pd.DataFrame(n.toarray(), columns=list(cat_enc.categories_[0]))
X = imputer.transform(someData_num)
someData_num = pd.DataFrame(X, columns=someData_num.columns)
someData_num["rooms_per_household"] = someData_num["total_rooms"]/someData_num["households"]
someData_num["population_per_household"]= someData_num["population"]/someData_num["households"]
someData_num["bedrooms_per_room"] = someData_num["total_bedrooms"]/someData_num["total_rooms"]
n0 = stsc.transform(someData_num)
someData_num = pd.DataFrame(n0, columns=someData_num.columns)
return pd.concat([someData_num,someData_cat_1hot] , axis=1)
12
…
• Let’s measure this regression model’s RMSE on the whole training set
(already prepared) using Scikit-Learn’s mean_squared_error function:
• from sklearn.metrics import mean_squared_error
• housing_predictions = lin_reg.predict(housing_prepared)
• lin_mse = mean_squared_error(housing_labels, housing_predictions)
• lin_rmse = np.sqrt(lin_mse)
13
…
• This is an example of a model underfitting the training data. When
this happens it can mean that the features do not provide enough
information to make good predictions, or that the model is not
powerful enough.
• let’s try a more complex model to see how it does:
• DecisionTree, This is a powerful model, capable of finding complex nonlinear
relationships in the data
• from sklearn.tree import DecisionTreeRegressor
• tree_reg = DecisionTreeRegressor()
• tree_reg.fit(housing_prepared, housing_labels)
14
…
• Now that the model is trained, let’s evaluate it on the training set:
• housing_predictions = tree_reg.predict(housing_prepared)
• tree_mse = mean_squared_error(housing_labels,
housing_predictions)
• tree_rmse = np.sqrt(tree_mse)➔ 0.0
• Wait, what!? No error at all? Could this model really be absolutely perfect?
• it is much more likely that the model has badly overfit the data.
• How can you be sure? As we saw earlier, you don’t want to touch the test
set until you are ready to launch a model you are confident about, so you
need to use part of the training set for training, and part for model
validation.
15
17
…
• Shuffle the dataset and split into k number of equal folds.
• In first iteration, the first subset is used as the validation set
while all the other folds are considered as the training set.
• Train the model with the training set and evaluate it using the
validation set. Keep the evaluation score, and get rid of the
model.
• In the next iteration, select a different fold as the validation
set: Re-train the model with the training set and validate it using
the new validation set, keep the evaluation score and discard the
model.
• Continue iterating the above k times. You will end up with a
k number of evaluation scores.
• The total error rate is the average of all
these individual evaluation scores.
cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the
scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the code computes –scores
before calculating the square root.
19
…
• Let’s try one last model now: the RandomForestRegressor.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse) ➔ 18650.698
20
…
• Cross validation with RandomForest
from sklearn.model_selection import cross_val_score
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
print("Scores:", forest_rmse_scores)
print("Mean:", forest_rmse_scores.mean()) ➔ 50435
print("Standard deviation:", forest_rmse_scores.std()) ➔ 2203
This is much better: Random Forests look very promising. However, note that the score on the training set is
still much lower than on the validation sets, meaning that the model is still overfitting the training set.
21
…
• Instead, you can use the cross_val_predict() function:
• Just like the cross_val_score() function, cross_val_predict() performs
K-fold cross-validation, but instead of returning the evaluation scores,
it returns the predictions made on each test fold. This means that you
get a clean prediction for each instance in the training set (“clean”
meaning that the prediction is made by a model that never saw the
data during training).
• For example, if there are 5 folds, then the output of each sample is
its predicted value when it was in the test fold (trained on the other 4
folds of data).
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf, X_train, y_train, cv=3)
22
6. Fine-Tune Your Model.
• Let’s assume that you now have a shortlist of promising models. You
now need to fine-tune them.
• One way to do that would be to fiddle with the hyperparameters manually
(Regularization), until you find a great combination of hyperparameter values. ➔
tedious work
• On other hand, suppose that the linear model generalizes better, but
you want to apply some regularization to avoid overfitting. The
question is: how do you choose the value of the regularization
hyperparameter? One option is to train 100 different models using 100
different values for this hyperparameter.
• Tuning hyperparameters is an important part of building a Machine
Learning system
A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by
the learning algorithm itself; it must be set prior to training and remains constant during training.
23
…
• Scikit-learn provide many ways to do that in a simple way:
1.Grid Search: All you need to do is tell it which hyperparameters you
want it to experiment with, and what values to try out, and it will
evaluate all the possible combinations of hyperparameter values,
using cross-validation.
2.Randomized Search: instead of trying out all possible combinations,
it evaluates a given number of random combinations by selecting a
random value for each hyperparameter at every iteration
3.Ensemble Methods: try to combine the models that perform best.
The group (or “ensemble”) will often perform better than the best
individual model (just like Random Forests perform better than the
individual Decision Trees they rely on), especially if the individual
models make very different types of errors.
24
6.1 Grid Search
• The following code searches for the best combination of hyperparameter values for the
RandomForestRegressor:
from sklearn.model_selection import GridSearchCV
param_grid = [
# First, try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
print(grid_search.best_params_) ➔ {'max_features': 8, 'n_estimators': 30}
25
…
• Since 8 and 30 are the maximum values that were evaluated, you
should probably try searching again with higher values, since the score
may continue to improve.
• The RMSE score for this combination is 49,682, which is slightly better
than the score you got earlier using the default hyperparameter values
(which was 50,182).
• You can also get the best estimator directly:
print(grid_search.best_estimator_)
➔ RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)
Don’t forget that you can treat some of the data preparation steps as hyperparameters. For
example, the grid search will automatically find out whether or not to add a feature you were
not sure about. It may similarly be used to automatically find the best way to handle outliers,
missing features, feature selection, and more.
26
…
• The evaluation scores are also available:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
27
29
30
Learning Curve
31
Learning Curves
• How can you tell that your model is overfitting or underfitting the data?
• You can use cross-validation to get an estimate of a model’s generalization
performance.
• If a model performs well on the training data but generalizes poorly according to the
cross-validation metrics, then your model is overfitting.
• If it performs poorly on both, then it is underfitting. This is one way to tell when a
model is too simple or too complex.
• Another way is to look at the learning curves:
• these are plots of the model’s performance on the training set and the validation set
as a function of the training set size.
• To generate the plots, simply train the model several times on different sized subsets
of the training set.
• The following code defines a function that plots the learning curves of a model given
some training data:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
…
• First, let’s look at the performance on the training data:
• when there are just one or two instances in the training set, the model can fit them perfectly,
which is why the curve starts at zero.
• But as new instances are added to the training set, it becomes impossible for the model to fit
the training data perfectly, both because the data is noisy and because it is not linear at all.
So the error on the training data goes up until it reaches a plateau, at which point adding new
instances to the training set doesn’t make the average error much better or worse.
• Now let’s look at the performance of the model on the validation data.
• When the model is trained on very few training instances, it is incapable of generalizing
properly, which is why the validation error is initially quite big.
• Then as the model is shown more training examples, it learns and thus the validation error
slowly goes down. However, once again a straight line cannot do a good job modeling the
data, so the error ends up at a plateau, very close to the other curve.
• These learning curves are typical of an underfitting model. Both curves have
reached a plateau; they are close and fairly high.
…
• Using different model (10-degree
polynomial model); these learning curves
look a bit like the previous ones, but there
are two very important differences:
• The error on the training data is much lower
than with the Linear Regression model.
• There is a gap between the curves. This
means that the model performs significantly
better on the training data than on the
validation data, which is the hallmark of an
overfitting model.
• However, if you used a much larger training One way to improve an overfitting model is to
set, the two curves would continue to get
closer. feed it more training data until the validation
error reaches the training error.
Feature Selection
36
Feature Selection
• There are number of techniques to address the feature selection problem:
• Logistic regression, SVMs, and decision trees/forests have methods of determining relative
feature importance; you can run these and keep only the features with highest importance.
• You can use L1 regularization for feature selection in logistic regression and SVM classifiers.
• If the number n of features is reasonably small (say, n < 100) you can use a “build it up”
approach:
• build n one-feature models and determine which is best on your validation set; then
build n–1 two-feature models, and so on, until the gain of adding an additional feature is
below a certain threshold.
• Similarly, you can use a “leave one out” approach: build a model on n features, then n
models on n–1 features and keep the best, and so on until the loss of removing an
additional feature is too great.
…
• scikit-learn implements the
sklearn.feature_selection.SelectFromModel helper utility
that assists operators in selecting features based on importance
weights.
• As long as a trained estimator has the feature_importances_ or
coef_ attribute after fitting, it can be passed into SelectFromModel
for feature selection importance ranking.
• DecisionTreeClassifier and RandomForestClassifier, have the feature_importances_ attribute.
• LinearRegression and LogisticRegression and support vector machines have the coef_
attribute.
…
• Assuming that we have a pretrained DecisionTreeClassifier model (variable name clf) and an
original training dataset (variable name train_x) with 119 features, here is a short code snippet
showing how to use SelectFromModel to keep only features with a feature_importance that lies
above the mean: