Evaluating Machine Learning Models
Violeta Menéndez González
[email protected]What is Machine Learning?
Image by Manfred Steger from Pixabay
It’s a data analytics technique that provides systems with the
ability to automatically learn and improve from experience
without being explicitly programmed.
They learn on sample data to make predictions or decisions in
future data.
What is a good ML model?
What task is the model designed to perform?
How can we assess whether a model is doing it well? (How would
you evaluate a human’s performance?)
Image by Ahmed Gad from Pixabay
There is no one type of model that definitively solves any problem;
neither is there any one set of definitive set of data that produces
the best predictions.
Model evaluation
Carefully analyse the model’s outputs to evaluate whether they
are meeting the goals that we set up for it.
This will allow us to compare models.
Chabacano / CC BY-SA
(https://creativecommons.org/licenses/by-sa/4.0)
Model evaluation
To prevent overfitting we split the data we have:
Some to train the model.
Some to test the model (“fake” future data).
Split data methods:
Holdout: train-validation-test data sets.
Cross-validation: k-fold cross validation.
Holdout testing
Data is divided into train/validation/test sets.
Split is usually 60/20/20, but if we have a lot of data it’s enough to
test in smaller proportions (~1,000,000 data 98/1/1 split) as long as
it has high confidence in the overall performance.
Validation and test sets should come from the same distribution,
something that reflects future data.
Model learns from training set. Test on validation data. If
performance is not good enough we do another round of test until
performance is good enough, then we test on test data.
It’s like studying for an exam: Study → Practice test → Refocus
studying → (repeat) → Final exam.
Cross-validation testing
Divide data into k number of sets (k-folds).
Leave one set out for testing, train on k-1 sets. Repeat for all
combinations of sets. Then compute an average in each of
the tests.
Useful when we have limited amounts of data.
Gufosowa / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)
Model metrics
A way of quantifying how well a model is performing at a certain
task.
Different type of models need different type of metrics.
It’s like scores of an exam. They score each prediction as right
or wrong and produce an overall score of the model.
We can give more importance to a type of performance, like
we can give more value to certain questions on an exam.
Image by Manfred Steger from Pixabay
Classification models
Predict a class for each input.
Binary classification: is picture a cat or not.
Multi-class: is picture a cat, a dog or an owl.
Multi-label: object in picture is a cat, an animal, black.
Output is commonly the probability of an input belonging to a
class. We can change the decision threshold.
Classification model evaluation
Intuitively we tend to use accuracy: how many right predictions
we got over all predictions made.
This doesn’t work well for imbalanced problems!
For example, a fraud detection algorithm that predicts all
transactions to be valid. If we have 5% of fraud transactions
the accuracy is 95%! But the model is useless.
Image by mohamed Hassan from Pixabay
Confusion matrix
Model prediction vs reality.
Precision and recall
Precision: of all the positive predictions, how many were actually
positive?
Recall: of all the actual positive results, how many did the model
predict were positive?
Precision and recall
Example: of 100 transactions 5% are fraud.
Model 1: All transactions are detected as fraud.
Model 2: Only 1 transaction correctly detected as fraud.
F1 score
We want a compromise on precision and recall.
We use the harmonic mean because it punishes extreme
values.
We can adjust the importance we give each of them.
In critical system models we may want to favour recall as
to not miss any positive results.
Precision-recall curve
Shows the precision vs recall trade-off as we vary the threshold
for identifying a positive in our model.
Appropriate for imbalanced data.
Jason Brownlee, ROC Curves and Precision-Recall Curves for Imbalanced Classification, Machine Learning Mastery, Available from
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/, accessed May 3rd, 2020.
Receiver Operating Characteristic curve
Shows how the true positive rate vs the false positive rate
changes as we vary the model’s threshold.
TPR – recall.
FPR – probability of a false alarm.
Better for balanced data.
Jason Brownlee, ROC Curves and Precision-Recall Curves for Imbalanced Classification, Machine Learning Mastery, Available from
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/, accessed May 3rd, 2020.
Comparing models
We can quantify a model’s curve by calculating the total Area
Under the Curve (AUC).
The AUC summarises the skill of a model across thresholds.
The F-score summarises model skill for a specific threshold.
Regression models
Models a target prediction value given previous values.
House price prediction.
Stock price prediction.
Training and testing methods are the same, what differs is how
we assess the models.
Regression metrics
Most common metrics are:
Mean Absolute Error (MAE).
Mean Squared Error (MSE).
MSE penalises large errors more than MAE.
Clustering models
Grouping of data points:
Unsupervised learning, we don’t have a label with
groundtruth.
We use metrics that measure how self-similar observations
are in the same cluster.
Doesn’t really measure the validity of the predictions but can
help us comparing models.
Jason Brownlee, 10 Clustering Algorithms With Python, Machine Learning Mastery, Available
from https://machinelearningmastery.com/clustering-algorithms-with-python/, accessed May 3rd,
2020.
Conclusions?
It’s very important to understand what a model does and what we
want to measure.
One metric does NOT fit all!
Metrics are only useful if you can compare them to something.
Be careful when testing your models, you may be overfitting.
The end
Thanks!
Photo by Franck V. on Unsplash