0% found this document useful (0 votes)
32 views52 pages

DL IT324a 4

The document discusses various evaluation parameters for machine learning classifiers, including precision, recall, accuracy, F-measure, sensitivity, and ROC curves. It emphasizes the importance of unbiased accuracy estimates through techniques like cross-validation and stratified sampling. Additionally, it highlights the limitations of accuracy as a metric, especially in cases of class imbalance, and introduces alternative metrics such as sensitivity and specificity.

Uploaded by

Jay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views52 pages

DL IT324a 4

The document discusses various evaluation parameters for machine learning classifiers, including precision, recall, accuracy, F-measure, sensitivity, and ROC curves. It emphasizes the importance of unbiased accuracy estimates through techniques like cross-validation and stratified sampling. Additionally, it highlights the limitations of accuracy as a metric, especially in cases of class imbalance, and introduces alternative metrics such as sensitivity and specificity.

Uploaded by

Jay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Evaluation of Machine Learning

Classifiers
Machine Learning

Dr. Dinesh K. Vishwakarma

1
Outline: Evaluation Parameters
 Precision
 Recall
 Accuracy
 F-Measure
 True Positive Rate
 False Positive Rate
 Sensitivity
 ROC 2
Experiment: Training and Testing
 Objective: Unbiased estimate of accuracy

3
Experiment: Training and
Testing…
 How can we get an unbiased estimate of the
accuracy of a learned model?
 when learning a model, you should pretend that you
don’t have the test data yet (it is “in the mail”)*
 if the test-set labels influence the learned model in
any way, accuracy estimates will be biased
 * In some applications it is reasonable to assume that
you have access to the feature vector (i.e. x) but not the
y part of each test instance

4
Learning Curve
 How does the accuracy of a learning method
change as a function of the training-set size?
 This can be assessed by plotting learning curves
#Given training/test set
partition
• for each sample size s on
learning curve
• (optionally) repeat n times
• randomly select s instances
from training set
• learn model
• evaluate model on test set
to determine accuracy a
• plot (s, a) or (s, avg.
accuracy and error bars)
5
Validation (Tuning) Set
 Consider we want unbiased estimates of accuracy
during the learning process (e.g. to choose the best level
of decision-tree pruning)?

Partition training data into separate training/validation sets 6


Limitation of Single Training/Test Partition
 We may not have enough data to make sufficiently
large
 training and test sets a larger test set gives us more
reliable estimate of accuracy (i.e. a lower variance
estimate)
 but… a larger training set will be more representative of
how much data we actually have for learning process
 A single training set doesn’t tell us how sensitive
accuracy is to a particular training sample

7
Random Sampling
 It can be addressed the second issue by repeatedly
randomly partitioning the available data into training and set
sets.

8
Random Sampling…
 When randomly selecting
training or validation sets,
we may want to ensure that
class proportions are
maintained in each selected
set.
 This can be done via
stratified sampling: first
stratify instances by class,
then randomly select
instances from each class
proportionally.

9
Cross Validation

Partition
data
into n
subsamples

Iteratively
leave one
subsample
out for
the test set,
train on
the rest
10
Cross Validation Example
 Suppose we have 100 instances, and we want to
estimate accuracy with cross validation.

11
Cross Validation…
 10-fold cross validation is common, but smaller values
of n are often used when learning takes a lot of time
 In leave-one-out cross validation, n = # instances
 In stratified cross validation, stratified sampling is used
when partitioning the data
 CV makes efficient use of the available data for testing
 Note that whenever we use multiple training sets, as in
CV and random resampling, we are evaluating a
learning method as opposed to an individual learned
model
12
Internal Cross Validation
 Instead of a single validation set, we can use cross-
validation within a training set to select a model (e.g.
to choose the best level of decision-tree pruning)

13
Confusion Matrix
 It is also called as prediction
table.
 It is an 𝑵 × 𝑵 matrix used for
evaluating the performance of a
classification model, where 𝑵 is
the number of target classes
 It compares the actual target
values with those predicted.
 The columns represent the actual
values of the target variable
 The rows represent the predicted
values of the target variable. 14
Confusion Matrix…

15
Type-I and Type-II Error

16
Sec. 8.3

Precision
 Precision: measures the correctness achieved in true
prediction. Also, tells us how many predictions are actually
positive out of all the total positive predicted. Precision
should be high(ideally 1)
 “Precision is a useful metric in cases where False Positive
is a higher concern than False Negatives”
𝑡𝑝
 Precision/ Positive Prediction Value 𝑃 = 𝑡
𝑝 +𝑓𝑝
𝑡𝑝
 Recall R=
𝑡𝑝 +𝑓𝑛

17
Issues with “Precision & Recall”

TP FP
FN TN

 Both classifiers gives the same precision and recall


values of 66.7% and 40% (Note: the data sets are
different)
 They exhibit very different behaviours:
 Same positive recognition rate
 Extremely different negative recognition rate: strong on the left /
nil on the right
 Note: Accuracy has no problem catching this!
18
Sec. 8.3

A combined measure: F
 Combined measure that assesses
precision/recall tradeoff is F measure (weighted
harmonic mean):
1 (   1) PR
2
F 
1
  (1   )
1  PR
2

P R

 People usually use balanced F1 measure


 i.e., with  = 1 or  = ½
 Harmonic mean is a conservative average. 19
Accuracy
 Measures the correct predictions.
 The accuracy metric is not suited for imbalanced
classes.
 Accuracy has its own disadvantages, for imbalanced
data, when the model predicts that each point belongs to
the majority class label, the accuracy will be high. But,
the model is not accurate.
 Accuracy is a valid choice of evaluation for
classification problems which are well
balanced and not skewed or there is no class
imbalance.
20
Sec. 8.3

Accuracy Measure
 The accuracy of
an engine: the
fraction of these
classifications that
are correct.
 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚(%) =
(𝒕𝒑 +𝒕𝒏 )
× 100
(𝒕𝒑 +𝒕𝒏 +𝒇𝒏 +𝒇𝒑 )

21
Accuracy Measure
𝒚 labelled Value (0- 𝒚 predicted
ෝ Output at Confusion Matrix
Negative, 1-Positive) value threshold (0.5)
0 0.3 0 TP=2 FP=1
1 0.4 0 FN=1 TN=2
0 0.7 1

1 0.8 1

0 0.4 0

1 0.7 1

4 𝑇𝑃 2
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = .666 𝑅𝑒𝑐𝑎𝑙𝑙 = = = .666
6 𝑇𝑃 + 𝐹𝑁 3
2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = .666
3
22
Issues with Accuracy
 Consider a 2-class problem
 Number of Class 0 examples = 9990

 Number of Class 1 examples = 10

 If model predicts everything to be class 0,


accuracy is 9990/10000 = 99.9 %
 Accuracy is misleading because model does not detect
any class 1 example

3/30/2022 Dinesh K. Vishwakarma, Ph.D. 23


Issues with Accuracy…

 Both classifiers gives 60% accuracy.


 They exhibit very different behaviors:
 On the left: weak positive recognition rate/strong
negative recognition rate
 On the right: strong positive recognition rate/weak
negative recognition rate
24
Is accuracy adequate measure?
 Accuracy may not be useful measure in cases
where
 there is a large class skew
 Is 98% accuracy good if 97% of the instances are negative?
 there are differential misclassification costs – say,
getting a positive wrong costs more than getting a
negative wrong.
 Consider a medical domain in which a false positive results in
an extraneous test but a false negative results in a failure to
treat a disease
 we are most interested in a subset of high-confidence
predictions
25
Miss Classification Error
 Recognition rate=accuracy=success rate
 Miss classification rate= failure rate

5+10
 𝑀𝑖𝑠𝑠 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 = = 0.09
50+10+5+100
𝐹𝑁+𝐹𝑃
 Error in percentage= ∗ 100
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 26
Sensitivity & Specificity
 Sensitivity is the metric that evaluates a
model’s ability to predict true positives of each
available category.
 Specificity is the metric that evaluates a
model’s ability to predict true negatives of each
available category.

27
Find Sensitivity and Specificity

28
Other form of Accuracy Metrics

29
ROC/AUC
 A Receiver Operating Characteristic (ROC)/Area Under
Curve plots the TP-rate vs. the FP-rate as a threshold on
the confidence of an instance being positive is varied.

Different methods can


work better in different
parts of ROC space.
This depends on cost of
false + vs. false -

expected curve for


random guessing
30
Area Under the Receiver
Operating Characteristics
 AUC-ROC curve measure the
performance at various threshold
settings.
 ROC is a probability curve and
AUC represents the degree or
measure of separability.
 AUC tells the model capability of
distinguishing between classes.
 Higher AUC, the better the model
is at predicting 0 classes as 0 and
1 classes as 1.
 The ROC curve is plotted between
TPR & FPR, where TPR is on the
y-axis and FPR is on the x-axis.

31
ROC curves & Misclassification
costs

Best operating point


when FN costs 10× FP

Best operating point when


cost of misclassifying
positives
and negatives is equal

Best operating point when


FP costs 10× FN

32
Create ROC of a model
 Consider a prediction table at different threshold setting
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0

Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66


Setting (0.5)

33
Create ROC of a model…
 Threshold setting (0.6)
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0

Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66


Setting (0.6)

34
Create ROC of a model…
 Threshold setting (0.72)
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0

Threshold TP=1 FP=1 TN=2 FN=2 TPR=1/(1+2)=.33 FPR=1/(1+2)=.33


Setting (0.72)

35
Create ROC of a model…
 Threshold setting (0.80)
𝒚 𝒚 predicted
ෝ Output at Output at Output at Output at
labelled Value (0- value threshold threshold threshold threshold
Negative, 1- (0.5) (0.6) (0.72) (0.8)
Positive)
0 0.3 0 0 0 0
1 0.55 1 0 0 0
0 0.75 1 1 1 0
1 0.8 1 1 1 1
0 0.4 0 0 0 0
1 0.7 1 1 0 0

Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0


Setting (0.80

36
Plot of ROC
Threshold TP=3 FP=1 TN=2 FN=0 TPR=3/(3+0)=1 FPR=2/(2+1)=.66
Setting (0.5)

Threshold TP=2 FP=1 TN=2 FN=1 TPR=2/(2+1)=.66 FPR=1/(1+2)=.66


Setting (0.6)

Threshold TP=1 FP=1 TN=2 FN=2 TPR=1/(1+2)=.33 FPR=1/(1+2)=.33


Setting (0.72)

Threshold TP=1 FP=0 TN=3 FN=2 TPR=1/(1+2)=.33 FPR=0


Setting (0.80
1
0.9
0.8
0.7
0.6
TPR

0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FPR 37
Step to create ROC
 Sort test-set predictions according to confidence
that each instance is positive.
 Step through sorted list from high to low
confidence
 locate a threshold between instances with opposite
classes (keeping instances with the same confidence
value on the same side of threshold)
 compute TPR, FPR for instances above threshold
 output (FPR, TPR) coordinate

38
Example of ROC Plot

39
Example of ROC Plot …
 Rearrange the samples according to class
Correct class Instance Confidence Positive
+ Ex 9 0.99
+ Ex 7 0.98 Positive
+ Ex 2 0.70 Class

+ Ex 6 0.65
+ Ex 5 0.24
- Ex 1 0.72
- Ex10 0.51
Negative
- Ex 3 0.39 Class
- Ex 4 0.11
- Ex 8 0.01
40
Example of ROC Plot …
 For Threshold 0.72
Correct class Instance confidence positive predicted class
+ Ex 9 0.99 +
+ Ex 7 0.98 +
+ Ex 2 0.70 -
+ Ex 6 0.65 -
+ Ex 5 0.24 -
- Ex 1 0.72 +
- Ex10 0.51 -
- Ex 3 0.39 -
- Ex 4 0.11 -
- Ex 8 0.01 -

Confidence > threshold


TP=2
Positive class FP=1
Else TN=4
Negative class FN=3
TPR=TP/TP+FN=2/5
41
FPR=FP/FP+TN=1/5
Example of ROC Plot …
 For Threshold 0.65
Correct class Instance confidence positive predicted class
+ Ex 9 0.99 +
+ Ex 7 0.98 +
+ Ex 2 0.70 +
+ Ex 6 0.65 +
+ Ex 5 0.24 -
- Ex 1 0.72 +
- Ex10 0.51 -
- Ex 3 0.39 -
- Ex 4 0.11 -
- Ex 8 0.01 -

Confidence > threshold


TP=4
Positive class FP=1
Else TN=4
Negative class FN=1
TPR=TP/TP+FN=4/5
42
FPR=FP/FP+TN=1/5
Significance of ROC

 This is an ideal situation, when two curves don’t overlap


at all, means model has an ideal measure of separability.
 It is perfectly able to distinguish between positive class
and negative class.

43
Significance of ROC…

 When two distributions overlap, then type 1 and type 2


errors are introduced.
 Depending upon the threshold, it can be minimized or
maximized. When AUC is 0.7, it means there is a 70%
chance that the model will be able to distinguish between
positive class and negative class.
44
Significance of ROC…

 This is the worst situation.


 When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive
class and negative class.

45
Significance of ROC…

 When AUC is approximately 0, the model is actually


reciprocating the classes. It means the model is
predicting a negative class as a positive class and vice
versa.

TPR⬆️, FPR⬆️ and TPR⬇️, FPR⬇️ 46


Issues with ROC/AUC
 AUC/ROC has adopted as replacement of
accuracy but it has also some criticism such as:
 The ROC curves on which the AUCs of different
classifiers are based may cross, thus not giving an
accurate picture of what is really happening.
 The misclassification cost distributions used by the
AUC are different for different classifiers.
 Therefore, we may be comparing “apples and
oranges” as the AUC may give more weight to
misclassifying a point by classifier A than it does by
classifier B. Ans: H-Measure
47
Other Accuracy Metrics

48
Precision/recall curves
 A precision/recall curve plots the precision vs.
recall (TP-rate) as a threshold on the confidence
of an instance being positive is varied.

49
Comment on ROC/PR Curve
 Both
 allow predictive performance to be assessed at various levels of
confidence
 assume binary classification tasks
 sometimes summarized by calculating area under the curve
 ROC curves
 insensitive to changes in class distribution (ROC curve does not
change if the proportion of positive and negative instances in the
test set are varied)
 can identify optimal classification thresholds for tasks with
differential misclassification costs
 Precision/Recall curves
 show the fraction of predictions that are false positives
 well suited for tasks with lots of negative instances
50
Loss Function
 Mean Square Error Loss Function
 It is used for regression problem
 Mean square error loss for m-data point is defined as
1 𝑚
𝐿𝑆𝐸 = σ𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2
𝑚
ො 2.
 For single point 𝐿𝑆𝐸 1 = (𝑦 − 𝑦)
 Binary Cross Entropy Loss Function
 It is used for classification problem
1
 BCELF is defined 𝐿𝐶𝐸 = − σ𝑚 ො𝑖 + (1 −
𝑖=1 [𝑦𝑖 ln 𝑦
𝑚

51
Example
 Consider a 2-class problem, where ground truth is
𝑦 = 0. 𝑡ℎ𝑒𝑛 𝑳𝑺𝑬 𝟏 = 𝒚 ෝ𝟐 and, 𝑦 = 1 𝐿𝑆𝐸 1 = (1 − 𝑦)
ො 2.
 Similarly 𝑳𝑪𝑬 𝟏 = 𝒍𝒏 𝟏 − 𝒚 ෝ 𝒂𝒏𝒅 𝒍𝒏(ෝ𝒚) 𝟐
2 ෝ𝒚
(1 − 𝑦)

 Consider example,
 𝑦 = 0, & 𝑦ො = 0.9, 𝐿𝑆𝐸 = 0.81
 Similarly 𝐿𝐶𝐸 = 2.3
𝜕𝐿𝑆𝐸 𝜕𝐿𝐶𝐸
 Gradient = 1.8 and = 10.0
𝜕 𝑦ො 𝜕𝑦ො

Cross entropy loss 𝒍𝒏(ෝ


𝒚)
𝒍𝒏 𝟏 − ෝ
𝒚
penalizes model
more
52

You might also like