0% found this document useful (0 votes)
56 views12 pages

004 07 Roc Auc Eer W4L2 W5L1 PDF

The document discusses evaluation metrics like ROC and PR curves, AUC, and EER for evaluating classifiers. It covers topics like true and false positives/negatives, precision, recall, and how these metrics are impacted by class imbalance. It also discusses using ROC curves to evaluate classifiers at different thresholds and characteristics of ROC curves like the AUC.

Uploaded by

Nermine Limeme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views12 pages

004 07 Roc Auc Eer W4L2 W5L1 PDF

The document discusses evaluation metrics like ROC and PR curves, AUC, and EER for evaluating classifiers. It covers topics like true and false positives/negatives, precision, recall, and how these metrics are impacted by class imbalance. It also discusses using ROC curves to evaluate classifiers at different thresholds and characteristics of ROC curves like the AUC.

Uploaded by

Nermine Limeme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Evaluation of Classifiers

ROC, PR Curves, AUC, EER

Agha Ali Raza

CS535/EE514 – Machine Learning


Gold Labels

Gold Positive Gold Negative

Predicted 𝑡𝑝 “Precision” aka


True Positives (𝑡𝑝) False Positives (𝑓𝑝)
Predicted Positive 𝑡𝑝 + 𝑓𝑝 "Positive Predictive Value”
Labels Predicted 𝑡𝑛
False Negatives (𝑓𝑛) True Negatives (𝑡𝑛) “Negative Predictive Value”
Negative 𝑓𝑛 + 𝑡𝑛
𝑡𝑝 𝑡𝑛
𝑡𝑝 + 𝑓𝑛 𝑓𝑝 + 𝑡𝑛
“Recall” aka "Sensitivity" aka "Specificity" aka
"True Positive Rate” "True Negative Rate”
“True Acceptance Rate” “True Rejection Rate” 𝑡𝑝 + 𝑡𝑛
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑓𝑛 𝑓𝑝 𝑡𝑝 + 𝑓𝑝 + 𝑡𝑛 + 𝑓𝑛
𝑡𝑝 + 𝑓𝑛 𝑓𝑝 + 𝑡𝑛
1 - Sensitivity = 1 - Specificity =
"False Negative Rate“ aka "False Positive Rate” aka
“False Rejection Rate" “False Acceptance Rate"

• Sensitivity, specificity, FNR and FPR are not influenced by real-world data imbalances
• These imbalances impact the denominators Actual + Actual -
• E.g. a rare disease, or a rare phenomena (like a fraud email) Test + 5 5,000
5 5
Sensitivity = 10 = 0.5, Precision = 5+5000 = 0.0009 Test - 5 5,000

• Precision and Negative Predictive Value are impacted by these imbalances and are sensitive
to them.

2
The Thresholding Problem in Classification
• Pinocchio’s nose
• Assume that liars have longer noses (wooden dummies only! ☺)
• We need to set a threshold on the nose length (input feature) above which we
classify the subject as a liar (label = yes).
• The same principle applies to classification of:
o A growth as a tumor based on size
o Blood pressure levels as hypertension
o An email as SPAM/Not-SPAM, Misinfo/Not-Misinfo based on probability scores
o A search item as match/not-match based on similarity scores, e.g.
▪ spoken term detection
▪ keyword spotting
▪ voice biometrics

• So, the question is where to place the cutoff


• Can we exhaustively try all cutoffs over a bounded score?
o Yes, but what do we track?
▪ Precision/Recall?
▪ False acceptances/False rejections?
▪ True positives/False positives?

• Say hello to RoC curves!


3
Receiver Operating Characteristic (RoC)
• A graphical plot that illustrates the diagnostic ability of a binary
classifier as its discrimination threshold is varied
• The method was originally developed for operators of military
radar receivers starting in 1941, which led to its name.
• Plot the true positive rate (TPR) – sensitivity – against the
false positive rate (FPR) – (1 - specificity) at various threshold
settings

https://en.wikipedia.org/wiki/Receiver_operating_characteristic
4
Thresholds
Height h Output Score Adult
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
(inches) (probability) (gold)
12 0.12 n y fp n tn n tn n tn n tn n tn n tn n tn n tn n tn
82 0.82 y y tp y tp y tp y tp y tp y tp y tp y tp n fn n fn
18 0.18 n y fp n tn n tn n tn n tn n tn n tn n tn n tn n tn
60 0.6 y y tp y tp y tp y tp y tp n fn n fn n fn n fn n fn
72 0.72 y y tp y tp y tp y tp y tp y tp y tp n fn n fn n fn
55 0.55 n y fp y fp y fp y fp y fp n tn n tn n tn n tn n tn
48 0.48 y y tp y tp y tp y tp n fn n fn n fn n fn n fn n fn
24 0.24 n y fp y fp n tn n tn n tn n tn n tn n tn n tn n tn
26 0.26 n y fp y fp n tn n tn n tn n tn n tn n tn n tn n tn
68 0.68 y y tp y tp y tp y tp y tp y tp n fn n fn n fn n fn
tp 5 5 5 5 4 3 2 1 0 0
fn 0 0 0 0 1 2 3 4 5 5
fp 5 3 1 1 1 0 0 0 0 0
tn 0 2 4 4 4 5 5 5 5 5
1-Specificity (FPR) 1 0.6 0.2 0.2 0.2 0 0 0 0 0
Sensitivity (Recall, TPR) 1 1 1 1 0.8 0.6 0.4 0.2 0 0
Precision 0.5 0.625 0.8333 0.8333 0.8 1 1 1 NAN NAN

5
Characteristics
• The best possible prediction method would
yield a point in the upper left corner (0,1)
i.e.100% sensitivity (no false negatives) and
100% specificity (no false positives)
• A random guess would give a point along a
diagonal line (the line of no-discrimination)
from the left bottom to the top right corners
(TPR = FPR)
• The red diagonal divides the ROC space.
• Points above the diagonal represent good
classification (better than random)
• Points below the line represent bad results (worse
than random)
• The output of a consistently bad predictor could
simply be inverted to obtain a good predictor.
• The blue diagonal is the Equal Error Diagonal
• Here FPR = FNR
o Where FNR = 1-TPR Equal Error Point
• A viable way to locate desired threshold
• Smaller is better (in the graph: higher and to
the left)
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
6
More Characteristics
• It is hard to compare classifiers using ROC curves
• A way around that is to use AUC – Area under the
ROC Curve, A' (pronounced "a-prime") or "c-
statistic" ("concordance statistic").
• Larger is better
• “AUC ROC can be interpreted as the probability
that the scores given by a classifier will rank a
randomly chosen positive instance higher than a
randomly chosen negative one.” (Page 54,
Learning from Imbalanced Data Sets, 2018)
• For imbalanced datasets: “ROC analysis does not
have any bias toward models that perform well on
the minority class at the expense of the majority
class—a property that is quite attractive when
dealing with imbalanced data.” (Page 27,
Imbalanced Learning: Foundations, Algorithms,
and Applications, 2013)

Tronci, Roberto, Giorgio Giacinto, and Fabio Roli. "Dynamic score combination: A supervised and unsupervised score combination method." In International
Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 163-177. Springer, Berlin, Heidelberg, 2009.
https://www.researchgate.net/publication/225180361_Dynamic_Score_Combination_A_Supervised_and_Unsupervised_Score_Combination_Method/figures?lo=1

7
Examples
𝐴𝑈𝐶 → 1

𝐴𝑈𝐶 ≈ 0.7

𝐴𝑈𝐶 ≈ 0.5

𝐴𝑈𝐶 → 0

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
8
Precision-Recall Curve
• The precision-recall curve shows the tradeoff
between precision and recall for different
thresholds. A high area under the curve
represents both high recall and high precision
• High scores for both show that the classifier is
returning accurate results (high precision), as
well as returning a majority of all positive
results (high recall).

https://www.vlfeat.org/overview/plots-rank.html, https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

9
Discussion: PR vs ROC
Assuming a "positive" class 1 and a "negative" class 0. 𝑦ො is our estimate of the true class label 𝑦.
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = 𝑷(𝒚 = 𝟏|ෝ𝒚 = 𝟏)
𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 = 𝑷(ෝ
𝒚 = 𝟏|𝒚 = 𝟏)
𝑺𝒑𝒆𝒄𝒊𝒇𝒊𝒄𝒊𝒕𝒚 = 𝑷(ෝ𝒚 = 𝟎|𝒚 = 𝟎)
• 𝑃(𝑦 = 1) is the baseline probability depending on how common the event is in the real world
• 𝑃(𝑦ො = 1) is the probability that our classifier will classify a sample as positive

• ROC curves will be the same regardless of 𝑃(𝑦 = 1)


• PR curves may be more useful in practice for needle-in-haystack type problems or problems
where the "positive" class is more interesting than the negative class.
Bottomline:
• Use precision and recall to focus on a small positive class
• When 𝑃 𝑦 = 1 is low and the ability to detect correctly positive samples is our main focus (correct
detection of negatives examples is less important to the problem).
• Use ROC when both the detection of both classes is equally important
• When we want to give equal weight to the prediction ability of both classes.
• Use ROC when the positives are the majority or switch the labels and use precision and recall
• When the positive class is larger, use the ROC metrics because the precision and recall would reflect
mostly the ability of prediction of the positive class and not the negative class which will naturally be
harder to detect due to the smaller number of samples.
• If the negative class (the minority in this case) is more important, we can switch the labels and use
precision and recall.
https://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves, https://towardsdatascience.com/what-metrics-should-we-
use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba
10
A word on Biometric Systems
• The crossover error rate describes the point where
the false reject rate (FRR) and false accept rate
(FAR) are equal. CER is also known as the equal
error rate (EER). The crossover error rate describes
the overall accuracy of a biometric system.
• As the sensitivity of a biometric system increases,
FRRs will rise and FARs will drop. Conversely, as the
sensitivity is lowered, FRRs will drop and FARs will
rise.
• CER is better when lower.
• Authentication algorithms need to simultaneously This is not recall!
minimize the permeability to intruders, therefore they This is the calibration sensitivity
have to be demanding, and to maximize the comfort of the biometric device.
level, therefore to be permissive.
• This contradiction is the base for the optimization
problem in authentication algorithms and the measure
of success for the overall precision of an algorithm and
of its usability is the CER.

https://www.sciencedirect.com/topics/computer-science/crossover-error-rate
11
For more details please visit

http://aghaaliraza.com

Thank you!
12

You might also like