0% found this document useful (0 votes)
21 views44 pages

Model Evaluation

The document discusses various methods for evaluating machine learning models, including accuracy metrics, confusion matrices, and techniques like holdout, cross-validation, and bootstrapping. It emphasizes the importance of precision and recall in assessing classifier performance, particularly in the context of imbalanced datasets. Additionally, it covers the implications of overfitting and underfitting in model training and evaluation.

Uploaded by

dumi dlam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views44 pages

Model Evaluation

The document discusses various methods for evaluating machine learning models, including accuracy metrics, confusion matrices, and techniques like holdout, cross-validation, and bootstrapping. It emphasizes the importance of precision and recall in assessing classifier performance, particularly in the context of imbalanced datasets. Additionally, it covers the implications of overfitting and underfitting in model training and evaluation.

Uploaded by

dumi dlam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Model evaluation

Model Evaluation
• Evaluation metrics: How can we measure accuracy?
• Use validation test set of class-labeled tuples
instead of training set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
Evaluating classifiers
•Outcome:
•Accuracy;
•Confusion matrix;
•If cost-sensitive, the expected cost of
classification (attribute test cost +
misclassification cost);
etc.
Confusion Matrix
•Also called error matrix
•Usually used for binary classification
•Visualize information needed for performance
evaluation
•Categorize predictions with correctness and
classes
Example of Confusion Matrix
Accuracy and Error Rate
• Accuracy = (TP + TN)/ALL
(6954+2588)/10000 = 0.9542
• Error rate = (FP + FN)/ALL = 1 - Accuracy
(412+46) / 10000 = 0.0458
Problem of Imbalance Data
• Some classes may be much rare
• Fraud
• High accuracy but
unsatisfactory
• 99% accuracy with all ~C
predictions. A\P C ~C
• Sensitivity: TP recognition rate
C 0 1 1
• TP/P = 0/1 = 0%
~C 0 99 99
• Specificity: TN recognition rate
0 100 100
• TN/N = 99/99 = 99%
Classifier Evaluation Metrics:
Precision and Recall
•Precision: exactness – what % of tuples that the
classifier labeled as positive are actually positive

•Recall: completeness – what % of positive tuples


did the classifier label as positive?
•Perfect score is 1.0
•Inverse relationship between precision & recall
Precision and Recall
• Focus on a single class (usually positive class in binary
classification)
• Precision: exactness, precision of positive predictions
• TP / (TP + FP) = 6954 / (6954 + 412) = 0.9440
• Recall: completeness, recall for positive instances
• TP / (TP + FN) = 6954 / (6954 + 46) = 0.9934
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity)

cancer = no 140 9560 9700 98.56 (specificity)

Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%


Scoring and ranking
evaluation method
Scoring is related to classification.

We are interested in a single class (positive class),


e.g., buyers class in a marketing database.

Instead of assigning each test instance a definite


class, scoring assigns a probability estimate (PE) to
indicate the likelihood that the example belongs to
the positive class.
Ranking and lift analysis

After each example is given a PE score, we can


rank all examples according to their PEs.

We then divide the data into n (say 10) bins. A lift


curve can be drawn according how many positive
examples are in each bin. This is called lift
analysis.

Classification systems can be used for scoring.


Need to produce a probability estimate.
Example
A businessman wants to send promotion
materials to potential customers to sell a watch.
Each package cost $0.50 to send (material and
postage).
If a watch is sold, a businessman makes $5 profit.
Suppose a businessman has a large amount of
past data for building a predictive/classification
model. A businessman also has a large list of
potential customers.
How many packages should a businessman send
and who should a businessman send to?
Example (cont.)
Assume that the test set has 10000 instances. Out of
this, 500 are positive cases.
After the classifier is built, we score each test instance.
We then rank the test set and divide the ranked test set
into 10 bins.

Each bin has 1000 test instances.


Bin 1 has 210 actual positive instances
Bin 2 has 120 actual positive instances
Bin 3 has 60 actual positive instances

Bin 10 has 5 actual positive instances
Lift curve
Bin 1 2 3 4 5 6 7 8 9 10
210 120 60 40 22 18 12 7 6 5
42% 24% 12% 8% 4.40% 3.60% 2.40% 1.40% 1.20% 1%
42% 66% 78% 86% 90.40% 94% 96.40% 97.80% 99% 100%

100
90
Percent of total positive cases

80
70
60
lift
50
random
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Percent of testing cases
Generating datasets
Methods:
Holdout (2/3rd training, 1/3rd testing)
Cross validation (n – fold)
Divide into n parts
Train on (n-1), test on last
Repeat for different combinations
Bootstrapping
Select random samples to form the
training set
Holdout method
• The holdout method is the simplest kind of cross
validation.
• The data set is separated into two sets, called the
training set and the testing set.
• The function approximator fits a function using the
training set only.
• Then the function approximator is asked to predict
the output values for the data in the testing set
(it has never seen these output values before).
Holdout Method
Holdout Method
• This is a basic concept of estimating a prediction.

• Classifier is learned based on the training set and get


evaluated with testing set.

• Proportion of training and testing sets is at the


discretion of analyst; typically 1:1 or 2:1, and there
is a trade-off between these sizes of these two sets.

• If the training set is too large, then model may be


good enough, but estimation may be less reliable due
to small testing set and vice-versa.
Cross-Validation
In cross-validation the original sample is split into two
parts. One part is called the training (or derivation)
sample, and the other part is called the validation (or
validation + testing) sample.
1) What portion of the sample should be in each part?
If sample size is very large, it is often best to split the
sample in half. For smaller samples, it is more
conventional to split the sample such that 2/3 of the
observations are in the derivation sample and 1/3 are in
the validation sample.
Cross-Validation
2) How should the sample be split?
The most common approach is to divide the
sample randomly, thus theoretically eliminating
any systematic differences.

Modeling of the data uses one part only. The


model selected for this part is then used to predict
the values in the other part of the data. A valid
model should show good predictive accuracy.
Cross-Validation
Cross-Validation

1. Divide data into three sets, training, validation and test sets

2. Find the optimal model on the training set, and use the test set to
check its predictive capability

3. See how well the model can predict the test set

4.The validation error gives an unbiased estimate of the


predictive power of a model
Cross- Validation
• Split original set of examples, train

Examples D
- + -
- +
-
- - -
+ + + +
- + - +
Train
+
+
- - +
-
+ Hypothesis space H
Cross-Validation
• Evaluate hypothesis on testing set

Testing set
-
-
-
-
+ + +

+ +
+
- -
+ Hypothesis space H
Cross-Validation
• Evaluate hypothesis on testing set

Testing set
-
+
+
-
+ + +
Test
- +
+
- -
- Hypothesis space H
Cross-Validation
• Compare true concept against prediction
9/13 correct
Testing set
--
- +
-+
--
++ ++ ++

+- ++
++
-- --
+ - Hypothesis space H
K-fold Cross Validation

1. Split the data into 5


samples
2. Fit a model to the
training samples and use
the test sample to
calculate a CV metric.
3. Repeat the process for
the next sample, until all
samples have been used
to either train or test the
model
STT592-002: Intro. to Statistical Learning 29

K-fold Cross Validation


STT592-002: Intro. to Statistical Learning 30

K-fold Cross Validation


Bootstrapping
• Technique for estimating the confidence in the model
parameters 
• Procedure:
1. Draw k hypothetical datasets from original data.
Either via cross validation or sampling with
replacement.
2. Fit the model for each dataset to compute
parameters k
3. Return the standard deviation of 1,…,k (or a
confidence interval)
Can also estimate confidence in a prediction y=f(x)
Bootstrap Method
• The Bootstrap method suggests the sampling of training
records with replacement.

• Each time a record is selected for training set, is put back


into the original pool of records, so that it is equally likely to
be redrawn in the next run.

• In other words, the Bootstrap method samples the given


data set uniformly with replacement.

• The rational of having this strategy is that let some records be


occur more than once in the samples of both training as well
as testing.

• What is the probability that a record will be selected more


than once?
Bootstrap Method
Bootstrap Method (sample with
n=3 observations)
Bootstrap Method : Implication

• This is why, the Bootstrap method is also known as 0.632 bootstrap method
Evaluating which method works
the best for classification

•No model is uniformly the best


•Dimensions for comparison
•speed of training
•speed of model application
•noise tolerance
•explanation ability
•Best Results: Hybrid, Integrated models
18
Pseudo-code
For each attribute,
For each value of the attribute, make a rule
as follows:
count how often each class appears;
find the most frequent class;
make the rule assign that class to this
attribute-value;
Calculate the error rate of the rules.
Choose the rules with the smallest error rate
Evaluating the weather attributes
Outlook Temp. Humidity Windy Play
Attribute Rules Errors Total
Sunny Hot High False No errors
Sunny Hot High True No Outlook Sunny → No 2/5 4/14
Overcast Hot High False Yes Overcast → Yes 0/4
Rainy Mild High False Yes Rainy → Yes 2/5
Rainy Cool Normal False Yes Temperature Hot → No* 2/4 5/14
Rainy Cool Normal True No Mild → Yes 2/6
Overcast Cool Normal True Yes Cool → Yes 1/4
Sunny Mild High False No Humidity High → No 3/7 4/14
Sunny Cool Normal False Yes Normal → Yes 1/7
Rainy Mild Normal False Yes Windy False → Yes 2/8 5/14
Sunny Mild Normal True Yes True → No* 3/6
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Dealing with numeric attributes
• Numeric attributes are discretized: the range of the
attribute is divided into a set of intervals
• Instances are sorted according to attribute’s
values
• Breakpoints are placed where the (majority) class
changes (so that the total error is minimized)
• Example: temperature from weather data (in Fahrenheit)
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
Result of overfitting avoidance
• Final result for temperature attribute (in Fahrenheit) :
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

• Resulting Attribute Rules Errors Total errors


Outlook Sunny → No 2/5 4/14
rule sets:
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature  77.5 → Yes 3/10 5/14
> 77.5 → No* 2/4
Humidity  82.5 → Yes 1/7 3/14
> 82.5 and  95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6
Model Overfitting

Underfitting: when model is too simple, both training and test errors
are large
Overfitting: when model is too complex, training error is small but
test error is large
Model Overfitting

Using twice the number of data instances


• If training data is under-representative, testing errors increase and
training errors decrease on increasing number of nodes
• Increasing the size of training data reduces the difference
between training and testing errors at a given number of nodes
Multiple Comparison Procedure
• Consider the task of predicting whether stock Day 1 Up
market will rise/fall in the next 10 trading Day 2 Down
days
Day 3 Down
Day 4 Up
• Random guessing:
P(correct) = 0.5 Day 5 Down
Day 6 Down
Day 7 Up
• Make 10 random guesses in a row:
Day 8 Up
Day 9 Up
  +   +  
10 10 10
Day 10 Down
 8   9  10
P(# correct  8) = 10
= 0.0547
2
Effect of Multiple Comparison
Procedure
Approach:
Get 50 analysts
Each analyst makes 10 random guesses
Choose the analyst that makes the most
number of correct predictions

Probability that at least one analyst makes at


least 8 correct predictions
P(# correct  8) = 1 − (1 − 0.0547)50 = 0.9399

You might also like