0% found this document useful (0 votes)
45 views

Choosing Model and Tuning

The document discusses selecting models and tuning hyperparameters for machine learning models. It covers overfitting and underfitting training data, and introduces strategies like using part of the training set for validation to better evaluate model performance rather than just on the full training set.

Uploaded by

kar20201214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Choosing Model and Tuning

The document discusses selecting models and tuning hyperparameters for machine learning models. It covers overfitting and underfitting training data, and introduces strategies like using part of the training set for validation to better evaluate model performance rather than just on the full training set.

Uploaded by

kar20201214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Choosing a Model and Tuning

..
• Now that we have looked at many examples of bad data, let’s look at
a couple of examples of bad algorithms.
• Overfitting the Training Data
• Underfitting the Training Data

2
Overfitting the Training Data
• Complex models such as deep neural networks can detect subtle patterns
in the data, but if the training set is noisy, or if it is too small (which
introduces sampling noise), then the model is likely to detect patterns in
the noise itself.
• Obviously these patterns will not generalize to new instances.
• For example, say you feed your life satisfaction model many more
attributes, including uninformative ones such as the country’s name.
• In that case, a complex model may detect patterns like the fact that all countries in
the training data with a w in their name have a life satisfaction greater than 7:New
Zealand (7.3), Norway (7.4), Sweden (7.2), and Switzerland (7.5).
• How confident are you that the W-satisfaction rule generalizes to Rwanda or
Zimbabwe?
• Obviously, this pattern occurred in the training data by pure chance, but the model
has no way to tell whether a pattern is real or simply the result of noise in the data.

Overfitting the Training Data


• Overfitting: the model you construct matches the
training data so thoroughly that it does not
generalize well to unseen data.
• For example, consider the decision boundary shown on the
left side. All points are classified correctly, but the shape of
the boundary is very complex, and it is unlikely that this
boundary can be used to effectively separate new points.
• The error rate on new cases is called the
generalization error
• If the training error is low (i.e., your model makes
few mistakes on the training set) but the
generalization error is high, it means that your model
is overfitting the training data.
One reason: If the number of features is greater than the number of data points, your model will be overfit.

4

• There are several ways to prevent overfitting, which are stated
below:
• If the training data is too small (will detect noisy patterns) to train add
more relevant and clean data.
• If the number of features is too large, do some feature selection and
remove unnecessary features.
• Constraining a model to make it simpler and reduce the risk of overfitting
➔ called regularization.
• selecting one with fewer parameters (e.g., a linear model rather than a high-degree
polynomial model)
• the fewer degrees of freedom it has, the harder it will be for it to overfit the data.

Underfitting the Training Data


• underfitting is the opposite of overfitting: it occurs when
your model is too simple to learn the underlying structure of
the data.
• For example, a linear model is prone to underfit; reality is just
more complex than the model, so its predictions are bound to
be inaccurate, even on the training examples.
• the decision boundary shown here; this simple line is correct on the
training data a majority of the time, but makes a lot of errors and will
likely also have poor performance on unseen data.
• This type of problem can be very easily detected by the
performance metrics.
When the model fails to learn from the training dataset and is also not able to
generalize the test dataset, is referred to as underfitting.
6

• The ways to prevent underfitting are stated below,
• Selecting a more powerful model, with more parameters (Increase the model
complexity)
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model (e.g., reducing the regularization
hyperparameter)

End-to-End ML Project
Part(2)

5. Select a model and train it.


6. Fine-tune your model.
7. Evaluate Your System on the Test Set
8. Launch, monitor, and maintain your system.

Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, by Aurelien Geron

8
5. Select a Model and Train it.
• How are you supposed to decide which classification algorithm to
pick for a given task?
• The best answer is to let the data decide: try a number of different
approaches and see what works best.

5.1 Training and Evaluating on the Training Set


• Let’s first train a Linear Regression model:
• from sklearn.linear_model import LinearRegression
• lin_reg = LinearRegression()
• lin_reg.fit(housing_prepared, housing_labels)

• Done! You now have a working Linear Regression model. Let’s try it
out on a few instances from the training set:
• some_data = housing.iloc[:5] #housing is the trained set before
preprocessing
• some_labels = housing_labels.iloc[:5]
• some_data_prepared = preparingData(some_data,cat_enc,imputer,stsc)
• print("Predictions:", lin_reg.predict(some_data_prepared))
• print("Labels:", list(some_labels))

10

• It works, although the predictions are not exactly accurate.
• Predictions: [85657.90, 305492.60, 152056.46, 186095.70, 244550.67]
• Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]

11


def preparingData (someData,cat_enc,imputer,stsc): #SomeData: unlabeled instances before
preprocessing
someData_num = someData.drop("ocean_proximity", axis=1)
someData_cat = someData[["ocean_proximity"]]
n = cat_enc.transform(someData_cat)
someData_cat_1hot = pd.DataFrame(n.toarray(), columns=list(cat_enc.categories_[0]))
X = imputer.transform(someData_num)
someData_num = pd.DataFrame(X, columns=someData_num.columns)
someData_num["rooms_per_household"] = someData_num["total_rooms"]/someData_num["households"]
someData_num["population_per_household"]= someData_num["population"]/someData_num["households"]
someData_num["bedrooms_per_room"] = someData_num["total_bedrooms"]/someData_num["total_rooms"]
n0 = stsc.transform(someData_num)
someData_num = pd.DataFrame(n0, columns=someData_num.columns)
return pd.concat([someData_num,someData_cat_1hot] , axis=1)

12

• Let’s measure this regression model’s RMSE on the whole training set
(already prepared) using Scikit-Learn’s mean_squared_error function:
• from sklearn.metrics import mean_squared_error
• housing_predictions = lin_reg.predict(housing_prepared)
• lin_mse = mean_squared_error(housing_labels, housing_predictions)
• lin_rmse = np.sqrt(lin_mse)

• $68,628.198 ➔ is not very satisfying.

13


• This is an example of a model underfitting the training data. When
this happens it can mean that the features do not provide enough
information to make good predictions, or that the model is not
powerful enough.
• let’s try a more complex model to see how it does:
• DecisionTree, This is a powerful model, capable of finding complex nonlinear
relationships in the data
• from sklearn.tree import DecisionTreeRegressor
• tree_reg = DecisionTreeRegressor()
• tree_reg.fit(housing_prepared, housing_labels)

14

• Now that the model is trained, let’s evaluate it on the training set:
• housing_predictions = tree_reg.predict(housing_prepared)
• tree_mse = mean_squared_error(housing_labels,
housing_predictions)
• tree_rmse = np.sqrt(tree_mse)➔ 0.0
• Wait, what!? No error at all? Could this model really be absolutely perfect?
• it is much more likely that the model has badly overfit the data.
• How can you be sure? As we saw earlier, you don’t want to touch the test
set until you are ready to launch a model you are confident about, so you
need to use part of the training set for training, and part for model
validation.
15

5.2 Better Evaluation Using Cross-Validation


• A common approach: simply hold out part of the training set
(validation set) to evaluate several candidate models and select the best
one.
• The majority (say, 60%) is the training set and pass it to the learning algorithm.
• The second part (say, 20%) is the validation set and is used to evaluate and
iterate on the model output by the learning algorithm.
• Finally, after you have settled on an optimal model, you use the remaining part
(say, 20%) as the test set to estimate real-world performance on unseen data.

You can tune model


parameters based on
performance on the
validation set.

• More specifically,
• you train multiple models with various hyperparameters on the reduced training
set, and you select the model that performs best on the validation set.
• After this holdout validation process, you train the best model on the full training
set (including the validation set), and this gives you the final model.
• Lastly, you evaluate this final model on the test set to get an estimate of the
generalization error.
• Which part is used as the validation set?
• K-fold cross-validation: This technique is a standard method for evaluating
models when there is not a lot of training data, so every labeled example makes
a significant contribution to the learned model.

17


• Shuffle the dataset and split into k number of equal folds.
• In first iteration, the first subset is used as the validation set
while all the other folds are considered as the training set.
• Train the model with the training set and evaluate it using the
validation set. Keep the evaluation score, and get rid of the
model.
• In the next iteration, select a different fold as the validation
set: Re-train the model with the training set and validate it using
the new validation set, keep the evaluation score and discard the
model.
• Continue iterating the above k times. You will end up with a
k number of evaluation scores.
• The total error rate is the average of all
these individual evaluation scores.
cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the
scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the code computes –scores
before calculating the square root.

Scikit-Learn’s K-fold cross-validation


• The following code randomly splits the training set into 10 distinct folds, then it trains and
evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time
and training on the other 9 folds. The result is an array containing the 10 evaluation scores:
from sklearn.model_selection import cross_val_score
tree_reg = DecisionTreeRegressor()
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring=" neg_mean_squared_error ", cv=10)
tree_rmse_scores = np.sqrt(-scores)
print("Scores:", tree_rmse_scores)
print("Mean:", tree_rmse_scores.mean()) ➔ 71491.4
print("Standard deviation:", tree_rmse_scores.std()) ➔ 2674.5

The Decision Tree has a score of approximately 71,491, generally ±2,674.


Now the Decision Tree doesn’t look as good as it did earlier

19


• Let’s try one last model now: the RandomForestRegressor.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse) ➔ 18650.698

20

• Cross validation with RandomForest
from sklearn.model_selection import cross_val_score
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
print("Scores:", forest_rmse_scores)
print("Mean:", forest_rmse_scores.mean()) ➔ 50435
print("Standard deviation:", forest_rmse_scores.std()) ➔ 2203

The Random Forest has a score of approximately 50,435, generally ±2,203.

This is much better: Random Forests look very promising. However, note that the score on the training set is
still much lower than on the validation sets, meaning that the model is still overfitting the training set.

21


• Instead, you can use the cross_val_predict() function:
• Just like the cross_val_score() function, cross_val_predict() performs
K-fold cross-validation, but instead of returning the evaluation scores,
it returns the predictions made on each test fold. This means that you
get a clean prediction for each instance in the training set (“clean”
meaning that the prediction is made by a model that never saw the
data during training).
• For example, if there are 5 folds, then the output of each sample is
its predicted value when it was in the test fold (trained on the other 4
folds of data).
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf, X_train, y_train, cv=3)

22
6. Fine-Tune Your Model.
• Let’s assume that you now have a shortlist of promising models. You
now need to fine-tune them.
• One way to do that would be to fiddle with the hyperparameters manually
(Regularization), until you find a great combination of hyperparameter values. ➔
tedious work
• On other hand, suppose that the linear model generalizes better, but
you want to apply some regularization to avoid overfitting. The
question is: how do you choose the value of the regularization
hyperparameter? One option is to train 100 different models using 100
different values for this hyperparameter.
• Tuning hyperparameters is an important part of building a Machine
Learning system
A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by
the learning algorithm itself; it must be set prior to training and remains constant during training.

23


• Scikit-learn provide many ways to do that in a simple way:
1.Grid Search: All you need to do is tell it which hyperparameters you
want it to experiment with, and what values to try out, and it will
evaluate all the possible combinations of hyperparameter values,
using cross-validation.
2.Randomized Search: instead of trying out all possible combinations,
it evaluates a given number of random combinations by selecting a
random value for each hyperparameter at every iteration
3.Ensemble Methods: try to combine the models that perform best.
The group (or “ensemble”) will often perform better than the best
individual model (just like Random Forests perform better than the
individual Decision Trees they rely on), especially if the individual
models make very different types of errors.

24
6.1 Grid Search
• The following code searches for the best combination of hyperparameter values for the
RandomForestRegressor:
from sklearn.model_selection import GridSearchCV
param_grid = [
# First, try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
print(grid_search.best_params_) ➔ {'max_features': 8, 'n_estimators': 30}

25


• Since 8 and 30 are the maximum values that were evaluated, you
should probably try searching again with higher values, since the score
may continue to improve.
• The RMSE score for this combination is 49,682, which is slightly better
than the score you got earlier using the default hyperparameter values
(which was 50,182).
• You can also get the best estimator directly:
print(grid_search.best_estimator_)
➔ RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)

Don’t forget that you can treat some of the data preparation steps as hyperparameters. For
example, the grid search will automatically find out whether or not to add a feature you were
not sure about. It may similarly be used to automatically find the best way to handle outliers,
missing features, feature selection, and more.

26

• The evaluation scores are also available:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

63895.161577951665 {'max_features': 2, 'n_estimators': 3}


54916.32386349543 {'max_features': 2, 'n_estimators': 10}
52885.86715332332 {'max_features': 2, 'n_estimators': 30}
60075.3680329983 {'max_features': 4, 'n_estimators': 3}
52495.01284985185 {'max_features': 4, 'n_estimators': 10}

27

6.2 Randomized Search


from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg,
param_distributions=param_distribs,n_iter=10, cv=5,
scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
28
7. Evaluate Your System on the Test Set
• After tweaking your models for a while, you eventually have a system that
performs sufficiently well. ➔Now is the time to evaluate the final model
on the test set.
• There is nothing special about this process; just get the predictors and the
labels from your test set, prepare your test set, and evaluate the final
model on the test set:
• final_model = grid_search.best_estimator_
• X_test = strat_test_set.drop("median_house_value", axis=1)
• y_test = strat_test_set["median_house_value"].copy()
• X_test_prepared = preparingData(X_test,cat_enc,imputer,stsc)
• final_predictions = final_model.predict(X_test_prepared)
• final_mse = mean_squared_error(y_test, final_predictions)
• final_rmse = np.sqrt(final_mse) # => evaluates to 47,873.260

29

8-Launch, Monitor, and Maintain Your System


• You need to write monitoring code to check your system’s live performance
at regular intervals and trigger alerts when it drops.
• Evaluating your system’s performance by experts
• You should also make sure you evaluate the system’s input data quality. If
you monitor your system’s inputs, you may catch this earlier. Monitoring
the inputs is particularly important for online learning systems.
• You will generally want to train your models on a regular basis using fresh
data.
• You should automate this process as much as possible.
• If you don’t, you are very likely to refresh your model only every six months (at best)
• If your system is an online learning system, you should make sure you save snapshots
of its state at regular intervals so you can easily roll back to a previously working
state.

30
Learning Curve

31

Learning Curves
• How can you tell that your model is overfitting or underfitting the data?
• You can use cross-validation to get an estimate of a model’s generalization
performance.
• If a model performs well on the training data but generalizes poorly according to the
cross-validation metrics, then your model is overfitting.
• If it performs poorly on both, then it is underfitting. This is one way to tell when a
model is too simple or too complex.
• Another way is to look at the learning curves:
• these are plots of the model’s performance on the training set and the validation set
as a function of the training set size.
• To generate the plots, simply train the model several times on different sized subsets
of the training set.
• The following code defines a function that plots the learning curves of a model given
some training data:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

def plot_learning_curves(model, X, y):


X_train, X_val, y_train, y_val = train_test_split(X, y,
test_size=0.2)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m],
y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")

This deserves a bit of explanation.


Let’s look at the learning curves of the plain Linear Regression model


• First, let’s look at the performance on the training data:
• when there are just one or two instances in the training set, the model can fit them perfectly,
which is why the curve starts at zero.
• But as new instances are added to the training set, it becomes impossible for the model to fit
the training data perfectly, both because the data is noisy and because it is not linear at all.
So the error on the training data goes up until it reaches a plateau, at which point adding new
instances to the training set doesn’t make the average error much better or worse.
• Now let’s look at the performance of the model on the validation data.
• When the model is trained on very few training instances, it is incapable of generalizing
properly, which is why the validation error is initially quite big.
• Then as the model is shown more training examples, it learns and thus the validation error
slowly goes down. However, once again a straight line cannot do a good job modeling the
data, so the error ends up at a plateau, very close to the other curve.
• These learning curves are typical of an underfitting model. Both curves have
reached a plateau; they are close and fairly high.

• Using different model (10-degree
polynomial model); these learning curves
look a bit like the previous ones, but there
are two very important differences:
• The error on the training data is much lower
than with the Linear Regression model.
• There is a gap between the curves. This
means that the model performs significantly
better on the training data than on the
validation data, which is the hallmark of an
overfitting model.
• However, if you used a much larger training One way to improve an overfitting model is to
set, the two curves would continue to get
closer. feed it more training data until the validation
error reaches the training error.

Feature Selection

36
Feature Selection
• There are number of techniques to address the feature selection problem:
• Logistic regression, SVMs, and decision trees/forests have methods of determining relative
feature importance; you can run these and keep only the features with highest importance.
• You can use L1 regularization for feature selection in logistic regression and SVM classifiers.
• If the number n of features is reasonably small (say, n < 100) you can use a “build it up”
approach:
• build n one-feature models and determine which is best on your validation set; then
build n–1 two-feature models, and so on, until the gain of adding an additional feature is
below a certain threshold.
• Similarly, you can use a “leave one out” approach: build a model on n features, then n
models on n–1 features and keep the best, and so on until the loss of removing an
additional feature is too great.


• scikit-learn implements the
sklearn.feature_selection.SelectFromModel helper utility
that assists operators in selecting features based on importance
weights.
• As long as a trained estimator has the feature_importances_ or
coef_ attribute after fitting, it can be passed into SelectFromModel
for feature selection importance ranking.
• DecisionTreeClassifier and RandomForestClassifier, have the feature_importances_ attribute.
• LinearRegression and LogisticRegression and support vector machines have the coef_
attribute.

• Assuming that we have a pretrained DecisionTreeClassifier model (variable name clf) and an
original training dataset (variable name train_x) with 119 features, here is a short code snippet
showing how to use SelectFromModel to keep only features with a feature_importance that lies
above the mean:

from sklearn.feature_selection import SelectFromModel


sfm = SelectFromModel(clf, prefit=True) # i.e. clf has been trained before
train_x_new = sfm.transform(train_x) # Generate new training set, keeping only
the selected features
print("Orig# features”, train_x.shape[1],” selected #features”,
train_x_new.shape[1]) ➔ Orig# features: 119, selected# features: 7
print(sfm.estimator_.coef_ )
print(sfm.get_support())

You might also like