Model Evaluation & Selection
Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
Methods for estimating a classifier’s accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
Comparing classifiers:
Confidence intervals
Cost-benefit analysis and ROC Curves
Classifier Evaluation Metrics:
Confusion Matrix
2
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
Actual class\ buy_computer buy_computer Total
Predicted class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i
that were labeled by the classifier as class j
May have extra rows/columns to provide totals
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
3
A\P C ¬C Class Imbalance Problem:
C TP FN P One class may be rare, e.g.
¬C FP TN N fraud, or HIV-positive
P’ N’ All Significant majority of the
negative class and minority of
Classifier Accuracy, or the positive class
recognition rate: percentage of test Sensitivity: True Positive
set tuples that are correctly recognition rate
classified
Sensitivity = TP/P
Accuracy = (TP + TN)/All Specificity: True Negative
Error rate: 1 – accuracy, or recognition rate
Specificity = TN/N
Error rate = (FP + FN)/All
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
4
Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive
Recall: completeness – what % of positive tuples did the classifier label as
positive?
Perfect score is 1.0
Inverse relationship between precision & recall
F measure (F1 or F-score): harmonic mean of precision and recall,
Fß: weighted measure of precision and recall
assigns ß times as much weight to recall as to precision
Example
5
Actual Class\Predicted class cancer = yes cancer = Total
no
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
Example
6
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
A\P C ¬C
C TP FN P
¬C FP TN N
Accuracy = (TP + TN)/All P’ N’ All
Sensitivity = TP/P
Specificity = TN/N
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
7
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
At i-th iteration, use Di as test set and others as training set
Leave-one-out: k folds where k = # of tuples, for small sized data
Model Selection: ROC Curves
8
ROC (Receiver Operating Characteristics) curves: for visual comparison of
classification models
Shows the trade-off between the true positive rate and the false positive rate
The area under the ROC curve is a measure of the accuracy of the model
Rank the test tuples in decreasing order: the one that is most likely to belong to the
positive class appears at the top of the list
The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is
the model
Vertical axis represents the true positive rate
Horizontal axis rep. the false positive rate
The plot also shows a diagonal line
A model with perfect accuracy will have an area of 1.0
Model Selection: ROC Curves
9
An ROC curve for a given model shows the trade-off between the true positive
rate (TPR) and the false positive rate (FPR).
Given a test set and a model, TPR is the proportion of positive (or “yes”) tuples
that are correctly labeled by the model
FPR is the proportion of negative (or “no”) tuples that are mislabeled as positive.
Given that TP, FP, P, and N are the number of true positive, false positive, positive,
and negative tuples, respectively we know that
Example
10
Example
11
Issues Affecting Model Selection
12
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness
of classification rules
Another Example
13
Solution
14
In the above example, there are 192 cases (N = 15 + 47 + 12 + 118).
Accuracy = (15 + 118)/192 = 69.27%
Precision = 15/(15 + 12) = 55.55%
Recall = 15/(15 + 47) = 24.19%
Specificity = 118/(118+12) = 90.77%
F1-Score = 2*15/((2*15) + 12 +47) = 33.70%
ROC Example
15
Solution
16
References
Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining: Concepts and Techniques”,
3rd ed., The Morgan Kaufmann Series in Data Management Systems, Morgan
Kaufmann Publishers, July 2011. ISBN 978-0123814791
https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm