TR Rain Error
TR Rain Error
Evaluation
Connectionist and Statistical Language Processing
Frank Keller
[email protected]
holdout, stratication crossvalidation, leave-one-out comparing machine learning algorithms comparing against a baseline precision and recall evaluating numeric prediction
Literature: Witten and Frank (2000: ch. 6), Mitchell (1997: ch. 5).
Evaluation p.1/21
Evaluation p.2/21
Evaluation p.3/21
Evaluation p.4/21
Holdout
If a lot of data data are available, simply take two independent samples and use one for training and one for testing. The more training, the better the model. The more test data, the more accurate the error estimate. Problem: obtaining data is often expensive and time consuming. Example: corpus annotated with word senses, experimental data on subcat preferences. Solution: obtain a limited data set and use a holdout procedure. Most straightforward: random split into test and training set. Typically between 1/3 and 1/10 held out for testing.
Stratication
Problem: the split into training and test set might be unrepresentative, e.g., a certain class is not represented in the training set, thus the model will not learn to classify it. Solution: use stratied holdout, i.e., sample in such a way that each class is represented in both sets. Example: data set with two classes A and B. Aim: construct a 10% test set. Take a 10% sample of all instances of class A plus a 10% sample of all instances of class B. However, this procedure doesnt work well on small data sets.
Evaluation p.5/21
Evaluation p.6/21
Crossvalidation
Solution: k-fold crossvalidation maximizes the use of the data.
Divide data randomly into Train the model on
Crossvalidation
Example: data set with 20 instances, 5-fold crossvalidation test set training set
k 1 folds, use one fold for testing. k times so that all folds are used for
testing.
Compute the average performance on the k test sets.
This effectively uses all the data for both training and testing. Typically k = 10 is used. Sometimes stratied k-fold crossvalidation is used.
Evaluation p.7/21
i1 , i2 , i3 , i4 i5 , i6 , i7 , i8 i9 , i10 , i11 , i12 i13 , i14 , i15 , i16 i17 , i18 , i19 , i20
compute error rate for each fold then compute average error rate
Evaluation p.8/21
Leave-one-out
Leave-one-out crossvalidation is simply k-fold crossvalidation with k set to n, the number of instances in the data set.
This means that the test set only consists of a single instance, which will be classied either correctly or incorrectly. Advantages: maximal use of training data, i.e., training on n 1 instances. The procedure is deterministic, no sampling involved. Disadvantages: unfeasible for large data sets: large number of training runs required, high computational cost. Cannot be stratied (only one class in the test set).
Comparing Algorithms
Assume you want to compare the performance of two machine learning algorithms A and B on the same data set. You could use crossvalidation to determine the error rates of A and B and then compute the difference. Problem: sampling is involved (in getting the data and in crossvalidation), hence there is variance for the error rates. Solution: determine if the difference between error rates is statistically signicant. If the crossvalidations of A and B use the same random division of the data (the same folds), then a paired t -test is appropriate.
Evaluation p.9/21
Evaluation p.10/21
Comparing Algorithms
Let k samples of the error rate of algorithms A be denoted by x1 , . . . , xk and k samples of the error rate of B by y1 , . . . , yk . Then the t for a paired t -test is:
(1)
t=
d 2 /k d
Where d is the mean of the differences xn yn , and d is the standard deviation of this mean.
Evaluation p.11/21
Confusion Matrix
Assume a two way classication. Four classication outcomes are possible, which can be displayed in a confusion matrix: predicted yes actual class yes no true positive false positive class no false negative true negative
True positives (TP): class members classied as class members True negatives (TN): class non-members classied as non-members False positives (FP): class non-members classied as class members False negatives (FN): class members classied as class non-members
Evaluation p.14/21
Precision: number of class members classied correctly over total number of instances classied as class members.
(2)
Precision =
False positive: wrongly identifying an oil slick if there is none. False negative: fail to identify an oil slick if there is one.
Here, false negatives (environmental disasters) are much more costly than false negatives (false alarms). We have to take that into account when we evaluate our model.
Evaluation p.15/21
Recall: number of class members classied correctly over total number of class members.
(3)
Recall =
(4)
F=
Example: For the oil slick scenario, we want to maximize recall (avoiding environmental disasters). Maximizing precision (avoiding false alarms) is less important. The F -measure can be used if precision and recall are equally important.
p1 , . . . , pn : predicted values of the for instances 1, . . . , n a1 , . . . , an : actual values of the for instances 1, . . . , n
There are several measure that compare ai and pi .
Evaluation p.17/21
Evaluation p.18/21
Summary
No matter how the performance of the model is measured (precision, recall, MSE, correlation), we always need to measure on the test set, not on the training set. Performance on the training only tells us that the model learns what its supposed to learn. It is not a good indicator of performance on unseen data. The test set can be obtained using an independent sample or holdout techniques (crossvalidation, leave-one-out). To meaningfully compare the performance of two algorithms for a given type of data, we need to compute if a difference in performance is signicant. We also need to compare performance against a baseline (chance or frequency).
Evaluation p.20/21
(5)
(6)
RMSE =
Evaluation p.19/21
References
Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill. Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and
Evaluation p.21/21