Evaluation of Machine Learning
Classifiers
Machine Learning
Dr. Dinesh K. Vishwakarma
1
Outline: Evaluation Parameters
Precision
Recall
Accuracy
F-Measure
True Positive Rate
False Positive Rate
Sensitivity
ROC 2
Experiment: Training and Testing
Objective: Unbiased estimate of accuracy
3
Experiment: Training and
Testing…
How can we get an unbiased estimate of the
accuracy of a learned model?
when learning a model, you should pretend that you
don’t have the test data yet (it is “in the mail”)*
if the test-set labels influence the learned model in
any way, accuracy estimates will be biased
* In some applications it is reasonable to assume that
you have access to the feature vector (i.e. x) but not the
y part of each test instance
4
Learning Curve
How does the accuracy of a learning method
change as a function of the training-set size?
This can be assessed by plotting learning curves
#Given training/test set
partition
• for each sample size s on
learning curve
• (optionally) repeat n times
• randomly select s instances
from training set
• learn model
• evaluate model on test set
to determine accuracy a
• plot (s, a) or (s, avg.
accuracy and error bars)
5
Validation (Tuning) Set
Consider we want unbiased estimates of accuracy
during the learning process (e.g. to choose the best level
of decision-tree pruning)?
Partition training data into separate training/validation sets 6
Limitation of Single Training/Test Partition
We may not have enough data to make sufficiently
large
training and test sets a larger test set gives us more
reliable estimate of accuracy (i.e. a lower variance
estimate)
but… a larger training set will be more representative of
how much data we actually have for learning process
A single training set doesn’t tell us how sensitive
accuracy is to a particular training sample
7
Random Sampling
It can be addressed the second issue by repeatedly
randomly partitioning the available data into training and set
sets.
8
Random Sampling…
When randomly selecting
training or validation sets,
we may want to ensure that
class proportions are
maintained in each selected
set.
This can be done via
stratified sampling: first
stratify instances by class,
then randomly select
instances from each class
proportionally.
9
Cross Validation
Partition
data
into n
subsamples
Iteratively
leave one
subsample
out for
the test set,
train on
the rest
10
Cross Validation Example
Suppose we have 100 instances, and we want to
estimate accuracy with cross validation.
11
Cross Validation…
10-fold cross validation is common, but smaller values
of n are often used when learning takes a lot of time
In leave-one-out cross validation, n = # instances
In stratified cross validation, stratified sampling is used
when partitioning the data
CV makes efficient use of the available data for testing
Note that whenever we use multiple training sets, as in
CV and random resampling, we are evaluating a
learning method as opposed to an individual learned
model
12
Internal Cross Validation
Instead of a single validation set, we can use cross-
validation within a training set to select a model (e.g.
to choose the best level of decision-tree pruning)
13
Confusion Matrix
It is also called as prediction
table.
It is an 𝑵 × 𝑵 matrix used for
evaluating the performance of a
classification model, where 𝑵 is
the number of target classes
It compares the actual target
values with those predicted.
The columns represent the actual
values of the target variable
The rows represent the predicted
values of the target variable. 14
Confusion Matrix…
15
Type-I and Type-II Error
16
Sec. 8.3
Precision
Precision: measures the correctness achieved in true
prediction. Also, tells us how many predictions are actually
positive out of all the total positive predicted. Precision
should be high(ideally 1)
“Precision is a useful metric in cases where False Positive
is a higher concern than False Negatives”
𝑡𝑝
Precision/ Positive Prediction Value 𝑃 = 𝑡
𝑝 +𝑓𝑝
𝑡𝑝
Recall R=
𝑡𝑝 +𝑓𝑛
17
Issues with “Precision & Recall”
TP FP
FN TN
Both classifiers gives the same precision and recall
values of 66.7% and 40% (Note: the data sets are
different)
They exhibit very different behaviours:
Same positive recognition rate
Extremely different negative recognition rate: strong on the left /
nil on the right
Note: Accuracy has no problem catching this!
18
Sec. 8.3
A combined measure: F
Combined measure that assesses
precision/recall tradeoff is F measure (weighted
harmonic mean):
1 ( 1) PR
2
F
1
(1 )
1 PR
2
P R
People usually use balanced F1 measure
i.e., with = 1 or = ½
Harmonic mean is a conservative average. 19
Accuracy
Measures the correct predictions.
The accuracy metric is not suited for imbalanced
classes.
Accuracy has its own disadvantages, for imbalanced
data, when the model predicts that each point belongs to
the majority class label, the accuracy will be high. But,
the model is not accurate.
Accuracy is a valid choice of evaluation for
classification problems which are well
balanced and not skewed or there is no class
imbalance.
20
Sec. 8.3
Accuracy Measure
The accuracy of
an engine: the
fraction of these
classifications that
are correct.
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚(%) =
(𝒕𝒑 +𝒕𝒏 )
× 100
(𝒕𝒑 +𝒕𝒏 +𝒇𝒏 +𝒇𝒑 )
21
Accuracy Measure
𝒚 labelled Value (0- 𝒚 predicted
ෝ Output at Confusion Matrix
Negative, 1-Positive) value threshold (0.5)
0 0.3 0 TP=2 FP=1
1 0.4 0 FN=1 TN=2
0 0.7 1
1 0.8 1
0 0.4 0
1 0.7 1
4 𝑇𝑃 2
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = .666 𝑅𝑒𝑐𝑎𝑙𝑙 = = = .666
6 𝑇𝑃 + 𝐹𝑁 3
2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = .666
3
22
Issues with Accuracy
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect
any class 1 example
3/30/2022 Dinesh K. Vishwakarma, Ph.D. 23
Issues with Accuracy…
Both classifiers gives 60% accuracy.
They exhibit very different behaviors:
On the left: weak positive recognition rate/strong
negative recognition rate
On the right: strong positive recognition rate/weak
negative recognition rate
24
Is accuracy adequate measure?
Accuracy may not be useful measure in cases
where
there is a large class skew
Is 98% accuracy good if 97% of the instances are negative?
there are differential misclassification costs – say,
getting a positive wrong costs more than getting a
negative wrong.
Consider a medical domain in which a false positive results in
an extraneous test but a false negative results in a failure to
treat a disease
we are most interested in a subset of high-confidence
predictions
25
Miss Classification Error
Recognition rate=accuracy=success rate
Miss classification rate= failure rate
5+10
𝑀𝑖𝑠𝑠 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 = = 0.09
50+10+5+100
𝐹𝑁+𝐹𝑃
Error in percentage= ∗ 100
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 26
Sensitivity & Specificity
Sensitivity is the metric that evaluates a
model’s ability to predict true positives of each
available category.
Specificity is the metric that evaluates a
model’s ability to predict true negatives of each
available category.
27
Find Sensitivity and Specificity
28
Other form of Accuracy Metrics
29
ROC/AUC
A Receiver Operating Characteristic (ROC)/Area Under
Curve plots the TP-rate vs. the FP-rate as a threshold on
the confidence of an instance being positive is varied.
Different methods can
work better in different
parts of ROC space.
This depends on cost of
false + vs. false -
expected curve for
random guessing
30
Area Under the Receiver
Operating Characteristics
AUC-ROC curve measure the
performance at various threshold
settings.
ROC is a probability curve and
AUC represents the degree or
measure of separability.
AUC tells the model capability of
distinguishing between classes.
Higher AUC, the better the model
is at predicting 0 classes as 0 and
1 classes as 1.
The ROC curve is plotted between
TPR & FPR, where TPR is on the
y-axis and FPR is on the x-axis.
31
ROC curves & Misclassification
costs
Best operating point
when FN costs 10× FP
Best operating point when
cost of misclassifying
positives
and negatives is equal
Best operating point when
FP costs 10× FN
32
Create ROC of a model
Consider a prediction table at different threshold setting
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66
Setting (0.5)
33
Create ROC of a model…
Threshold setting (0.6)
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66
Setting (0.6)
34
Create ROC of a model…
Threshold setting (0.72)
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=1 FP=1 TN=2 FN=2 TPR=1/(1+2)=.33 FPR=1/(1+2)=.33
Setting (0.72)
35
Create ROC of a model…
Threshold setting (0.80)
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0
Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0
Setting (0.80
36
Plot of ROC
Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66
Setting (0.5)
Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66
Setting (0.6)
Threshold TP=1 FP=1 TN=2 FN=2 TPR=1/(1+2)=.33 FPR=1/(1+2)=.33
Setting (0.72)
Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0
Setting (0.80
1
0.9
0.8
0.7
0.6
TPR
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FPR 37
Step to create ROC
Sort test-set predictions according to confidence
that each instance is positive.
Step through sorted list from high to low
confidence
locate a threshold between instances with opposite
classes (keeping instances with the same confidence
value on the same side of threshold)
compute TPR, FPR for instances above threshold
output (FPR, TPR) coordinate
38
Example of ROC Plot
39
Example of ROC Plot …
Rearrange the samples according to class
Correct class Instance Confidence Positive
+ Ex 9 0.99
+ Ex 7 0.98 Positive
+ Ex 2 0.70 Class
+ Ex 6 0.65
+ Ex 5 0.24
- Ex 1 0.72
- Ex10 0.51
Negative
- Ex 3 0.39 Class
- Ex 4 0.11
- Ex 8 0.01
40
Example of ROC Plot …
For Threshold 0.72
Correct class Instance confidence positive predicted class
+ Ex 9 0.99 +
+ Ex 7 0.98 +
+ Ex 2 0.70 -
+ Ex 6 0.65 -
+ Ex 5 0.24 -
- Ex 1 0.72 +
- Ex10 0.51 -
- Ex 3 0.39 -
- Ex 4 0.11 -
- Ex 8 0.01 -
Confidence > threshold
TP=2
Positive class FP=1
Else TN=4
Negative class FN=3
TPR=TP/TP+FN=2/5
41
FPR=FP/FP+TN=1/5
Example of ROC Plot …
For Threshold 0.65
Correct class Instance confidence positive predicted class
+ Ex 9 0.99 +
+ Ex 7 0.98 +
+ Ex 2 0.70 +
+ Ex 6 0.65 +
+ Ex 5 0.24 -
- Ex 1 0.72 +
- Ex10 0.51 -
- Ex 3 0.39 -
- Ex 4 0.11 -
- Ex 8 0.01 -
Confidence > threshold
TP=4
Positive class FP=1
Else TN=4
Negative class FN=1
TPR=TP/TP+FN=4/5
42
FPR=FP/FP+TN=1/5
Significance of ROC
This is an ideal situation, when two curves don’t overlap
at all, means model has an ideal measure of separability.
It is perfectly able to distinguish between positive class
and negative class.
43
Significance of ROC…
When two distributions overlap, then type 1 and type 2
errors are introduced.
Depending upon the threshold, it can be minimized or
maximized. When AUC is 0.7, it means there is a 70%
chance that the model will be able to distinguish between
positive class and negative class.
44
Significance of ROC…
This is the worst situation.
When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive
class and negative class.
45
Significance of ROC…
When AUC is approximately 0, the model is actually
reciprocating the classes. It means the model is
predicting a negative class as a positive class and vice
versa.
TPR⬆️, FPR⬆️ and TPR⬇️, FPR⬇️ 46
Issues with ROC/AUC
AUC/ROC has adopted as replacement of
accuracy but it has also some criticism such as:
The ROC curves on which the AUCs of different
classifiers are based may cross, thus not giving an
accurate picture of what is really happening.
The misclassification cost distributions used by the
AUC are different for different classifiers.
Therefore, we may be comparing “apples and
oranges” as the AUC may give more weight to
misclassifying a point by classifier A than it does by
classifier B. Ans: H-Measure
47
Other Accuracy Metrics
48
Precision/recall curves
A precision/recall curve plots the precision vs.
recall (TP-rate) as a threshold on the confidence
of an instance being positive is varied.
49
Comment on ROC/PR Curve
Both
allow predictive performance to be assessed at various levels of
confidence
assume binary classification tasks
sometimes summarized by calculating area under the curve
ROC curves
insensitive to changes in class distribution (ROC curve does not
change if the proportion of positive and negative instances in the
test set are varied)
can identify optimal classification thresholds for tasks with
differential misclassification costs
Precision/Recall curves
show the fraction of predictions that are false positives
well suited for tasks with lots of negative instances
50
Loss Function
Mean Square Error Loss Function
It is used for regression problem
Mean square error loss for m-data point is defined as
1 𝑚
𝐿𝑆𝐸 = σ𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2
𝑚
ො 2.
For single point 𝐿𝑆𝐸 1 = (𝑦 − 𝑦)
Binary Cross Entropy Loss Function
It is used for classification problem
1
BCELF is defined 𝐿𝐶𝐸 = − σ𝑚 ො𝑖 + (1 −
𝑖=1 [𝑦𝑖 ln 𝑦
𝑚
51
Example
Consider a 2-class problem, where ground truth is
𝑦 = 0. 𝑡ℎ𝑒𝑛 𝑳𝑺𝑬 𝟏 = 𝒚 ෝ𝟐 and, 𝑦 = 1 𝐿𝑆𝐸 1 = (1 − 𝑦)
ො 2.
Similarly 𝑳𝑪𝑬 𝟏 = 𝒍𝒏 𝟏 − 𝒚 ෝ 𝒂𝒏𝒅 𝒍𝒏(ෝ𝒚) 𝟐
2 ෝ𝒚
(1 − 𝑦)
ො
Consider example,
𝑦 = 0, & 𝑦ො = 0.9, 𝐿𝑆𝐸 = 0.81
Similarly 𝐿𝐶𝐸 = 2.3
𝜕𝐿𝑆𝐸 𝜕𝐿𝐶𝐸
Gradient = 1.8 and = 10.0
𝜕 𝑦ො 𝜕𝑦ො
Cross entropy loss 𝒍𝒏(ෝ
𝒚)
𝒍𝒏 𝟏 − ෝ
𝒚
penalizes model
more
52