0% found this document useful (0 votes)

21 views44 pages

Model Evaluation

The document discusses various methods for evaluating machine learning models, including accuracy metrics, confusion matrices, and techniques like holdout, cross-validation, and bootstrapping. It emphasizes the importance of precision and recall in assessing classifier performance, particularly in the context of imbalanced datasets. Additionally, it covers the implications of overfitting and underfitting in model training and evaluation.

Uploaded by

dumi dlam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views44 pages

Model Evaluation

Uploaded by

dumi dlam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Model evaluation

Model Evaluation
• Evaluation metrics: How can we measure accuracy?
• Use validation test set of class-labeled tuples
instead of training set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
Evaluating classifiers
•Outcome:
•Accuracy;
•Confusion matrix;
•If cost-sensitive, the expected cost of
classification (attribute test cost +
misclassification cost);
etc.
Confusion Matrix
•Also called error matrix
•Usually used for binary classification
•Visualize information needed for performance
evaluation
•Categorize predictions with correctness and
classes
Example of Confusion Matrix
Accuracy and Error Rate
• Accuracy = (TP + TN)/ALL
(6954+2588)/10000 = 0.9542
• Error rate = (FP + FN)/ALL = 1 - Accuracy
(412+46) / 10000 = 0.0458
Problem of Imbalance Data
• Some classes may be much rare
• Fraud
• High accuracy but
unsatisfactory
• 99% accuracy with all ~C
predictions. A\P C ~C
• Sensitivity: TP recognition rate
C 0 1 1
• TP/P = 0/1 = 0%
~C 0 99 99
• Specificity: TN recognition rate
0 100 100
• TN/N = 99/99 = 99%
Classifier Evaluation Metrics:
Precision and Recall
•Precision: exactness – what % of tuples that the
classifier labeled as positive are actually positive

•Recall: completeness – what % of positive tuples

did the classifier label as positive?
•Perfect score is 1.0
•Inverse relationship between precision & recall
Precision and Recall
• Focus on a single class (usually positive class in binary
classification)
• Precision: exactness, precision of positive predictions
• TP / (TP + FP) = 6954 / (6954 + 412) = 0.9440
• Recall: completeness, recall for positive instances
• TP / (TP + FN) = 6954 / (6954 + 46) = 0.9934
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity)

cancer = no 140 9560 9700 98.56 (specificity)

Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

Scoring and ranking
evaluation method
Scoring is related to classification.

We are interested in a single class (positive class),

e.g., buyers class in a marketing database.

Instead of assigning each test instance a definite

class, scoring assigns a probability estimate (PE) to
indicate the likelihood that the example belongs to
the positive class.
Ranking and lift analysis

After each example is given a PE score, we can

rank all examples according to their PEs.

We then divide the data into n (say 10) bins. A lift

curve can be drawn according how many positive
examples are in each bin. This is called lift
analysis.

Classification systems can be used for scoring.

Need to produce a probability estimate.
Example
A businessman wants to send promotion
materials to potential customers to sell a watch.
Each package cost $0.50 to send (material and
postage).
If a watch is sold, a businessman makes $5 profit.
Suppose a businessman has a large amount of
past data for building a predictive/classification
model. A businessman also has a large list of
potential customers.
How many packages should a businessman send
and who should a businessman send to?
Example (cont.)
Assume that the test set has 10000 instances. Out of
this, 500 are positive cases.
After the classifier is built, we score each test instance.
We then rank the test set and divide the ranked test set
into 10 bins.

Each bin has 1000 test instances.

Bin 1 has 210 actual positive instances
Bin 2 has 120 actual positive instances
Bin 3 has 60 actual positive instances
…
Bin 10 has 5 actual positive instances
Lift curve
Bin 1 2 3 4 5 6 7 8 9 10
210 120 60 40 22 18 12 7 6 5
42% 24% 12% 8% 4.40% 3.60% 2.40% 1.40% 1.20% 1%
42% 66% 78% 86% 90.40% 94% 96.40% 97.80% 99% 100%

100
90
Percent of total positive cases

80
70
60
lift
50
random
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Percent of testing cases
Generating datasets
Methods:
Holdout (2/3rd training, 1/3rd testing)
Cross validation (n – fold)
Divide into n parts
Train on (n-1), test on last
Repeat for different combinations
Bootstrapping
Select random samples to form the
training set
Holdout method
• The holdout method is the simplest kind of cross
validation.
• The data set is separated into two sets, called the
training set and the testing set.
• The function approximator fits a function using the
training set only.
• Then the function approximator is asked to predict
the output values for the data in the testing set
(it has never seen these output values before).
Holdout Method
Holdout Method
• This is a basic concept of estimating a prediction.

• Classifier is learned based on the training set and get

evaluated with testing set.

• Proportion of training and testing sets is at the

discretion of analyst; typically 1:1 or 2:1, and there
is a trade-off between these sizes of these two sets.

• If the training set is too large, then model may be

good enough, but estimation may be less reliable due
to small testing set and vice-versa.
Cross-Validation
In cross-validation the original sample is split into two
parts. One part is called the training (or derivation)
sample, and the other part is called the validation (or
validation + testing) sample.
1) What portion of the sample should be in each part?
If sample size is very large, it is often best to split the
sample in half. For smaller samples, it is more
conventional to split the sample such that 2/3 of the
observations are in the derivation sample and 1/3 are in
the validation sample.
Cross-Validation
2) How should the sample be split?
The most common approach is to divide the
sample randomly, thus theoretically eliminating
any systematic differences.

Modeling of the data uses one part only. The

model selected for this part is then used to predict
the values in the other part of the data. A valid
model should show good predictive accuracy.
Cross-Validation
Cross-Validation

1. Divide data into three sets, training, validation and test sets

2. Find the optimal model on the training set, and use the test set to
check its predictive capability

3. See how well the model can predict the test set

4.The validation error gives an unbiased estimate of the

predictive power of a model
Cross- Validation
• Split original set of examples, train

Examples D
- + -
- +
-
- - -
+ + + +
- + - +
Train
+
+
- - +
-
+ Hypothesis space H
Cross-Validation
• Evaluate hypothesis on testing set

Testing set
-
-
-
-
+ + +

+ +
+
- -
+ Hypothesis space H
Cross-Validation
• Evaluate hypothesis on testing set

Testing set
-
+
+
-
+ + +
Test
- +
+
- -
- Hypothesis space H
Cross-Validation
• Compare true concept against prediction
9/13 correct
Testing set
--
- +
-+
--
++ ++ ++

+- ++
++
-- --
+ - Hypothesis space H
K-fold Cross Validation

1. Split the data into 5

samples
2. Fit a model to the
training samples and use
the test sample to
calculate a CV metric.
3. Repeat the process for
the next sample, until all
samples have been used
to either train or test the
model
STT592-002: Intro. to Statistical Learning 29

K-fold Cross Validation

STT592-002: Intro. to Statistical Learning 30

K-fold Cross Validation

Bootstrapping
• Technique for estimating the confidence in the model
parameters 
• Procedure:
1. Draw k hypothetical datasets from original data.
Either via cross validation or sampling with
replacement.
2. Fit the model for each dataset to compute
parameters k
3. Return the standard deviation of 1,…,k (or a
confidence interval)
Can also estimate confidence in a prediction y=f(x)
Bootstrap Method
• The Bootstrap method suggests the sampling of training
records with replacement.

• Each time a record is selected for training set, is put back

into the original pool of records, so that it is equally likely to
be redrawn in the next run.

• In other words, the Bootstrap method samples the given

data set uniformly with replacement.

• The rational of having this strategy is that let some records be

occur more than once in the samples of both training as well
as testing.

• What is the probability that a record will be selected more

than once?
Bootstrap Method
Bootstrap Method (sample with
n=3 observations)
Bootstrap Method : Implication

• This is why, the Bootstrap method is also known as 0.632 bootstrap method
Evaluating which method works
the best for classification

•No model is uniformly the best

•Dimensions for comparison
•speed of training
•speed of model application
•noise tolerance
•explanation ability
•Best Results: Hybrid, Integrated models
18
Pseudo-code
For each attribute,
For each value of the attribute, make a rule
as follows:
count how often each class appears;
find the most frequent class;
make the rule assign that class to this
attribute-value;
Calculate the error rate of the rules.
Choose the rules with the smallest error rate
Evaluating the weather attributes
Outlook Temp. Humidity Windy Play
Attribute Rules Errors Total
Sunny Hot High False No errors
Sunny Hot High True No Outlook Sunny → No 2/5 4/14
Overcast Hot High False Yes Overcast → Yes 0/4
Rainy Mild High False Yes Rainy → Yes 2/5
Rainy Cool Normal False Yes Temperature Hot → No* 2/4 5/14
Rainy Cool Normal True No Mild → Yes 2/6
Overcast Cool Normal True Yes Cool → Yes 1/4
Sunny Mild High False No Humidity High → No 3/7 4/14
Sunny Cool Normal False Yes Normal → Yes 1/7
Rainy Mild Normal False Yes Windy False → Yes 2/8 5/14
Sunny Mild Normal True Yes True → No* 3/6
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Dealing with numeric attributes
• Numeric attributes are discretized: the range of the
attribute is divided into a set of intervals
• Instances are sorted according to attribute’s
values
• Breakpoints are placed where the (majority) class
changes (so that the total error is minimized)
• Example: temperature from weather data (in Fahrenheit)
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
Result of overfitting avoidance
• Final result for temperature attribute (in Fahrenheit) :
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

• Resulting Attribute Rules Errors Total errors

Outlook Sunny → No 2/5 4/14
rule sets:
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature  77.5 → Yes 3/10 5/14
> 77.5 → No* 2/4
Humidity  82.5 → Yes 1/7 3/14
> 82.5 and  95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6
Model Overfitting

Underfitting: when model is too simple, both training and test errors
are large
Overfitting: when model is too complex, training error is small but
test error is large
Model Overfitting

Using twice the number of data instances

• If training data is under-representative, testing errors increase and
training errors decrease on increasing number of nodes
• Increasing the size of training data reduces the difference
between training and testing errors at a given number of nodes
Multiple Comparison Procedure
• Consider the task of predicting whether stock Day 1 Up
market will rise/fall in the next 10 trading Day 2 Down
days
Day 3 Down
Day 4 Up
• Random guessing:
P(correct) = 0.5 Day 5 Down
Day 6 Down
Day 7 Up
• Make 10 random guesses in a row:
Day 8 Up
Day 9 Up
  +   +  
10 10 10
Day 10 Down
 8   9  10
P(# correct  8) = 10
= 0.0547
2
Effect of Multiple Comparison
Procedure
Approach:
Get 50 analysts
Each analyst makes 10 random guesses
Choose the analyst that makes the most
number of correct predictions

Probability that at least one analyst makes at

least 8 correct predictions
P(# correct  8) = 1 − (1 − 0.0547)50 = 0.9399

Classifier Evaluation Techniques
No ratings yet
Classifier Evaluation Techniques
59 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Data Mining Evaluation Metrics Guide
No ratings yet
Data Mining Evaluation Metrics Guide
40 pages
Estimation Techniques for Classifiers
No ratings yet
Estimation Techniques for Classifiers
61 pages
TE - DWM Module No 3
No ratings yet
TE - DWM Module No 3
48 pages
Advanced ML Classification Guide
No ratings yet
Advanced ML Classification Guide
40 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
Lec 5
No ratings yet
Lec 5
29 pages
Machine Learning for Data Analysts
No ratings yet
Machine Learning for Data Analysts
31 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
ML Unit IV
No ratings yet
ML Unit IV
70 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Bi Unit 5
No ratings yet
Bi Unit 5
20 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
Unit Iii
No ratings yet
Unit Iii
67 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
Bootstrap Method in Classifier Evaluation
No ratings yet
Bootstrap Method in Classifier Evaluation
13 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Lecture 11
No ratings yet
Lecture 11
61 pages
Machine Learning Model Insights
No ratings yet
Machine Learning Model Insights
14 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Dimensionality Reduction & Model Evaluation
No ratings yet
Dimensionality Reduction & Model Evaluation
80 pages
Unit 6 Classification and Prediction
No ratings yet
Unit 6 Classification and Prediction
66 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
MODULE 3 Classification
No ratings yet
MODULE 3 Classification
5 pages
ML Unit II Modelling Notes
No ratings yet
ML Unit II Modelling Notes
18 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Validaciones - Bosstrap
No ratings yet
Validaciones - Bosstrap
50 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
Data Mining Evaluation Techniques
No ratings yet
Data Mining Evaluation Techniques
36 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
Unit 2 Part 2 Data Science Final 23june
No ratings yet
Unit 2 Part 2 Data Science Final 23june
39 pages
Evaluation Metricsflaksdj Fa
No ratings yet
Evaluation Metricsflaksdj Fa
22 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
Sensitivity Analysis in Data Analytics
No ratings yet
Sensitivity Analysis in Data Analytics
64 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Comparing Multiple Algorithms
No ratings yet
Comparing Multiple Algorithms
70 pages
CHP 3
No ratings yet
CHP 3
70 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Classification
No ratings yet
Classification
33 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
5 DL
No ratings yet
5 DL
33 pages
Chương 2e. Model Evaluation
No ratings yet
Chương 2e. Model Evaluation
27 pages
Model Evaluation Techniques
No ratings yet
Model Evaluation Techniques
53 pages
Data Mining Techniques and Models
No ratings yet
Data Mining Techniques and Models
43 pages
Aptitude Material Assignments
No ratings yet
Aptitude Material Assignments
21 pages
RMO 12 2013 List of Unused Expired ORsSIsCIs Annex D
No ratings yet
RMO 12 2013 List of Unused Expired ORsSIsCIs Annex D
2 pages
Numerology Calculator by Muthuveerappan
58% (12)
Numerology Calculator by Muthuveerappan
5 pages
Rn3f01a - Rl3f01a
No ratings yet
Rn3f01a - Rl3f01a
4 pages
Abhishek PMO CV
No ratings yet
Abhishek PMO CV
2 pages
Managing Services On Linux
No ratings yet
Managing Services On Linux
8 pages
Mummy
No ratings yet
Mummy
47 pages
MHD Zaki Faiz Albar: Accounting Profile
No ratings yet
MHD Zaki Faiz Albar: Accounting Profile
1 page
Public Procurement Guide
No ratings yet
Public Procurement Guide
58 pages
Course Plan
No ratings yet
Course Plan
5 pages
Supply Chain Management Overview
No ratings yet
Supply Chain Management Overview
7 pages
Strategic Insights for India's Tyre Industry
50% (2)
Strategic Insights for India's Tyre Industry
35 pages
Option: Catalogo Parti Di Ricambio - Spare Parts Catalogue
No ratings yet
Option: Catalogo Parti Di Ricambio - Spare Parts Catalogue
27 pages
AD592
No ratings yet
AD592
8 pages
Transformations in Computer Graphics
No ratings yet
Transformations in Computer Graphics
52 pages
QoS Scheduling JUNIPER
No ratings yet
QoS Scheduling JUNIPER
27 pages
GCP 2017 Version 1.0 2017
No ratings yet
GCP 2017 Version 1.0 2017
52 pages
Interview With Ulcerate
No ratings yet
Interview With Ulcerate
7 pages
History: Desktop Publishing (DTP) Is The Creation of Documents Using
No ratings yet
History: Desktop Publishing (DTP) Is The Creation of Documents Using
3 pages
Airline Service Quality Insights
No ratings yet
Airline Service Quality Insights
29 pages
Experiment 2
No ratings yet
Experiment 2
8 pages
Class 12 Micro Economics Chapter 4 - Revision Notes
No ratings yet
Class 12 Micro Economics Chapter 4 - Revision Notes
15 pages
Tax 1 Midterm Exam 2019
No ratings yet
Tax 1 Midterm Exam 2019
4 pages
Model QRNG 325 Model QRNG 325 Model QRNG 325 Model QRNG 325 Model QRNG 325
No ratings yet
Model QRNG 325 Model QRNG 325 Model QRNG 325 Model QRNG 325 Model QRNG 325
26 pages
DX-G Troubleshooting No02 4.0
No ratings yet
DX-G Troubleshooting No02 4.0
26 pages
APPSC Group-2 Mains 90 Days Mains Plan-Schedule
No ratings yet
APPSC Group-2 Mains 90 Days Mains Plan-Schedule
10 pages
African Holistic Health PDF Free 39
No ratings yet
African Holistic Health PDF Free 39
1 page
Indoor Stadium Structural Layout
100% (1)
Indoor Stadium Structural Layout
17 pages
Facebook User Profile Links
No ratings yet
Facebook User Profile Links
2 pages
02.03 Pacot O. - HES - Switzerland
No ratings yet
02.03 Pacot O. - HES - Switzerland
8 pages

Model Evaluation

Uploaded by

Model Evaluation

Uploaded by

Model evaluation

•Recall: completeness – what % of positive tuples

cancer = yes 90 210 300 30.00 (sensitivity)

cancer = no 140 9560 9700 98.56 (specificity)

Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

We are interested in a single class (positive class),

Instead of assigning each test instance a definite

After each example is given a PE score, we can

We then divide the data into n (say 10) bins. A lift

Classification systems can be used for scoring.

Each bin has 1000 test instances.

• Classifier is learned based on the training set and get

• Proportion of training and testing sets is at the

• If the training set is too large, then model may be

Modeling of the data uses one part only. The

4.The validation error gives an unbiased estimate of the

1. Split the data into 5

K-fold Cross Validation

K-fold Cross Validation

• Each time a record is selected for training set, is put back

• In other words, the Bootstrap method samples the given

• The rational of having this strategy is that let some records be

• What is the probability that a record will be selected more

•No model is uniformly the best

• Resulting Attribute Rules Errors Total errors

Using twice the number of data instances

Probability that at least one analyst makes at

You might also like