0% found this document useful (0 votes)

31 views

TR Rain Error

To evaluate machine learning models, data must be split into training, validation, and test sets. The test set is used to evaluate model performance on unseen data, which is important because the model may overfit the training data. Cross-validation techniques like k-fold cross-validation are commonly used to evaluate models. Performance should also be compared to baselines like random chance and most frequent class, and statistical tests can determine if differences in performance between models are significant. Various metrics exist for classification tasks like precision, recall, and F1 score, and for regression tasks mean squared error is often used.

Uploaded by

VaggelarasB

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

TR Rain Error

Uploaded by

VaggelarasB

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Overview

training set, validation set, test set

Evaluation
Connectionist and Statistical Language Processing
Frank Keller
[email protected]

holdout, stratication crossvalidation, leave-one-out comparing machine learning algorithms comparing against a baseline precision and recall evaluating numeric prediction

Computerlinguistik Universit t des Saarlandes a

Literature: Witten and Frank (2000: ch. 6), Mitchell (1997: ch. 5).

Evaluation p.1/21

Evaluation p.2/21

Training and Test Set

For classication problems, we measure the performance of a model in terms of its error rate: percentage of incorrectly classied instances in the data set. We build a model because we want to use it to classify new data. Hence we are chiey interested in model performance on new (unseen) data. The resubstitution error (error rate on the training set) is a bad predictor of performance on new data. The model was build to account for the training data, so might overt it, i.e., not generalize to unseen data.

Test and Validation Set

Use two data set: the training set (seen data) to build the model (determine its parameters) and the test set (unseen data) to measure its performance (holding the parameters constant). Sometimes, we also need a validation set to tune the model (e.g., for pruning a decision tree). The validation set cant be used for testing (as its not unseen). All three data set have to be representative samples of the data that the model will be applied to.

Evaluation p.3/21

Evaluation p.4/21

Holdout
If a lot of data data are available, simply take two independent samples and use one for training and one for testing. The more training, the better the model. The more test data, the more accurate the error estimate. Problem: obtaining data is often expensive and time consuming. Example: corpus annotated with word senses, experimental data on subcat preferences. Solution: obtain a limited data set and use a holdout procedure. Most straightforward: random split into test and training set. Typically between 1/3 and 1/10 held out for testing.

Stratication
Problem: the split into training and test set might be unrepresentative, e.g., a certain class is not represented in the training set, thus the model will not learn to classify it. Solution: use stratied holdout, i.e., sample in such a way that each class is represented in both sets. Example: data set with two classes A and B. Aim: construct a 10% test set. Take a 10% sample of all instances of class A plus a 10% sample of all instances of class B. However, this procedure doesnt work well on small data sets.

Evaluation p.5/21

Evaluation p.6/21

Crossvalidation
Solution: k-fold crossvalidation maximizes the use of the data.
Divide data randomly into Train the model on

Crossvalidation
Example: data set with 20 instances, 5-fold crossvalidation test set training set

k folds (subsets) of equal size.

k 1 folds, use one fold for testing. k times so that all folds are used for

Repeat this process

testing.
Compute the average performance on the k test sets.

This effectively uses all the data for both training and testing. Typically k = 10 is used. Sometimes stratied k-fold crossvalidation is used.
Evaluation p.7/21

i1 , i2 , i3 , i4 i5 , i6 , i7 , i8 i9 , i10 , i11 , i12 i13 , i14 , i15 , i16 i17 , i18 , i19 , i20

compute error rate for each fold then compute average error rate

Evaluation p.8/21

Leave-one-out
Leave-one-out crossvalidation is simply k-fold crossvalidation with k set to n, the number of instances in the data set.
This means that the test set only consists of a single instance, which will be classied either correctly or incorrectly. Advantages: maximal use of training data, i.e., training on n 1 instances. The procedure is deterministic, no sampling involved. Disadvantages: unfeasible for large data sets: large number of training runs required, high computational cost. Cannot be stratied (only one class in the test set).

Comparing Algorithms
Assume you want to compare the performance of two machine learning algorithms A and B on the same data set. You could use crossvalidation to determine the error rates of A and B and then compute the difference. Problem: sampling is involved (in getting the data and in crossvalidation), hence there is variance for the error rates. Solution: determine if the difference between error rates is statistically signicant. If the crossvalidations of A and B use the same random division of the data (the same folds), then a paired t -test is appropriate.

Evaluation p.9/21

Evaluation p.10/21

Comparing Algorithms
Let k samples of the error rate of algorithms A be denoted by x1 , . . . , xk and k samples of the error rate of B by y1 , . . . , yk . Then the t for a paired t -test is:

Comparing against a Baseline

An error rate in itself is not very meaningful. We have to take into account how hard the problem is. This means comparing against a baseline model and showing that our model performs signicantly better than the baseline. The simplest model is the chance baseline, which assigns a classication randomly. Example: if we have two classes A and B in our data, and we classify each instance randomly as either A or B, then we will get 50% right just by chance (in the limit). In the general case, the error rate for the chance baseline is 1 1 for an n-way classication. n
Evaluation p.12/21

(1)

d 2 /k d

Where d is the mean of the differences xn yn , and d is the standard deviation of this mean.

Evaluation p.11/21

Comparing against a Baseline

Problem: a chance baseline is not useful if the distribution of the data is skewed. We need to compare against a frequency baseline instead. A frequency baseline always assigns the most frequent class. Its error rate is 1 fmax , where fmax is the percentage of instances in the data that belong to the most frequent class. Example: determining when a cow is in oestrus (classes: yes, no). Chance baseline: 50%, frequency baseline: 3% (1 out of 30 days). More realistic example: assigning part-of-speech tags for English text. A tagger that assigns to each word its most frequent tag gets 20% error rate. Current state of the art is 4%.
Evaluation p.13/21

Confusion Matrix
Assume a two way classication. Four classication outcomes are possible, which can be displayed in a confusion matrix: predicted yes actual class yes no true positive false positive class no false negative true negative

True positives (TP): class members classied as class members True negatives (TN): class non-members classied as non-members False positives (FP): class non-members classied as class members False negatives (FN): class members classied as class non-members

Evaluation p.14/21

Precision and Recall

The error rate is an inadequate measure of the performance of an algorithm, it doesnt take into account the cost of making wrong decisions. Example: Based on chemical analysis of the water try to detect an oil slick in the sea.

Precision and Recall

Measures commonly used in information retrieval, based on true positives, false positives, and false negatives:

Precision: number of class members classied correctly over total number of instances classied as class members.

(2)

Precision =

|TP| |TP| + |FP|

False positive: wrongly identifying an oil slick if there is none. False negative: fail to identify an oil slick if there is one.
Here, false negatives (environmental disasters) are much more costly than false negatives (false alarms). We have to take that into account when we evaluate our model.
Evaluation p.15/21

Recall: number of class members classied correctly over total number of class members.

(3)

Recall =

|TP| |TP| + |FN|

Evaluation p.16/21

Precision and Recall

Precision and recall can be combined in the F -measure:

Evaluating Numeric Prediction

Error rate, precision, recall, and F -measure are mainly used for classication tasks. They are less suitable for tasks were a numeric quantity has to be predicted. Examples: predict the frequency of subcategorization frames. We want to measure how close the model is to the actual frequencies.

(4)

2|TP| 2 recall precision = recall + precision 2|TP| + |FP| + |FN|

Example: For the oil slick scenario, we want to maximize recall (avoiding environmental disasters). Maximizing precision (avoiding false alarms) is less important. The F -measure can be used if precision and recall are equally important.

p1 , . . . , pn : predicted values of the for instances 1, . . . , n a1 , . . . , an : actual values of the for instances 1, . . . , n
There are several measure that compare ai and pi .

Evaluation p.17/21

Evaluation p.18/21

Mean Squared Error

Mean squared error measures the mean difference between actual and predicted values. Used e.g. for connectionist nets:

Summary
No matter how the performance of the model is measured (precision, recall, MSE, correlation), we always need to measure on the test set, not on the training set. Performance on the training only tells us that the model learns what its supposed to learn. It is not a good indicator of performance on unseen data. The test set can be obtained using an independent sample or holdout techniques (crossvalidation, leave-one-out). To meaningfully compare the performance of two algorithms for a given type of data, we need to compute if a difference in performance is signicant. We also need to compare performance against a baseline (chance or frequency).
Evaluation p.20/21

(5)

1 n MSE = (ai pi )2 n i=1

Often also the root mean squared error is used:

(6)

RMSE =

1 n (ai pi)2 n i=1

Advantage: has the same dimension as the original quantity.

Evaluation p.19/21

References
Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill. Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and

Techniques with Java Implementations. San Diego, CA: Morgan Kaufmann.

Evaluation p.21/21

Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
Module 6
No ratings yet
Module 6
24 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
48 pages
Lecture 5 Evaluation_Classifer
No ratings yet
Lecture 5 Evaluation_Classifer
61 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Chapitre_2-converti
No ratings yet
Chapitre_2-converti
26 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Chapitre_2
No ratings yet
Chapitre_2
26 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Lesson 3.2 - Supervised Learning Evaluation PDF
No ratings yet
Lesson 3.2 - Supervised Learning Evaluation PDF
38 pages
Cofusion Matrix Cross- Validation
No ratings yet
Cofusion Matrix Cross- Validation
34 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
2020 Evaluation PDF
No ratings yet
2020 Evaluation PDF
25 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Lecture 01-Model Selection and Evaluation
No ratings yet
Lecture 01-Model Selection and Evaluation
29 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
No ratings yet
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
33 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Week 4 Lecture Slides BUS265 2023
No ratings yet
Week 4 Lecture Slides BUS265 2023
45 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Lecture 11
No ratings yet
Lecture 11
61 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
Mining Process
No ratings yet
Mining Process
33 pages
ML 5
No ratings yet
ML 5
14 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
CHP 3
No ratings yet
CHP 3
70 pages
CST 42315 Dam - L9 1
No ratings yet
CST 42315 Dam - L9 1
15 pages
AI_Lecture 3
No ratings yet
AI_Lecture 3
50 pages
Evaluation Metrics:: Confusion Matrix
No ratings yet
Evaluation Metrics:: Confusion Matrix
7 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
19-Introduction classification algorithm-18-09-2024
No ratings yet
19-Introduction classification algorithm-18-09-2024
102 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)
No ratings yet
Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)
53 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
6.Data Mining - Classification Ppt
No ratings yet
6.Data Mining - Classification Ppt
37 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Bi Intro
No ratings yet
Bi Intro
24 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
Codes and Concepts of ML-Developer-2
No ratings yet
Codes and Concepts of ML-Developer-2
17 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

TR Rain Error

Uploaded by

TR Rain Error

Uploaded by

Overview

training set, validation set, test set

Computerlinguistik Universit t des Saarlandes a

Training and Test Set

Test and Validation Set

k folds (subsets) of equal size.

Repeat this process

Comparing against a Baseline

Comparing against a Baseline

Precision and Recall

Precision and Recall

|TP| |TP| + |FP|

|TP| |TP| + |FN|

Precision and Recall

Evaluating Numeric Prediction

2|TP| 2 recall precision = recall + precision 2|TP| + |FP| + |FN|

Mean Squared Error

1 n MSE = (ai pi )2 n i=1

Often also the root mean squared error is used:

1 n (ai pi)2 n i=1

Advantage: has the same dimension as the original quantity.

Techniques with Java Implementations. San Diego, CA: Morgan Kaufmann.

You might also like