Created by Turbolearn AI
Evaluating Binary Classification
Binary classification is the task of classifying elements into two groups based on a
classification rule.
Observed response (output) 'y' has two possible values: +/-, or True/False.
Requires defining the relationship between h(x) and y.
Uses a decision rule.
Examples:
Medical test: Determining if a patient has a disease.
Fitness test: Determining if a person is fit.
Spam email classification.
Definitions
Instances: The objects of interest in machine learning.
Instance Space: The set of all possible instances. For example, the set of all
possible e-mails.
Label Space: Used in supervised learning to label examples.
Model: A mapping from the instance space to the output space.
In classification, the output space is a set of classes.
In regression, it is the set of real numbers.
To learn a model, a training set of labeled instances (x, l(x)), also called
examples, is needed.
Assessing Classification Performance
The outputs of learning algorithms must be assessed and analyzed carefully to
evaluate different learning algorithms. The performance of classifiers can be
summarized using a contingency table or confusion matrix.
1. Contingency Table or Confusion Matrix
A confusion matrix is a table that describes the performance of a
classification model on a set of test data where the true values are known.
Page 1
Created by Turbolearn AI
It summarizes prediction results on a classification problem.
It contains counts of correct and incorrect predictions, broken down by each
class.
It shows how the classification model is confused when it makes predictions.
It contains information about actual and predicted classifications.
Key terms:
True Positive (TP): The classifier correctly predicts a spam email as spam.
False Negative (FN): The classifier incorrectly predicts a spam email as non-
spam (a miss).
False Positive (FP): The classifier incorrectly predicts a non-spam email as
spam (a false alarm).
True Negative (TN): The classifier correctly predicts a non-spam email as non-
spam.
Example: Confusion matrix of email classification
Classification problem: spam and non-spam classes.
Dataset: 100 examples, 65 are spam, and 35 are non-spam.
Key Metrics Derived from the Confusion Matrix
Page 2
Created by Turbolearn AI
Sensitivity (True Positive Rate or Recall): Measure of positive examples
labeled as positive by the classifier. Should be higher.
For instance, the proportion of emails which are spam among all spam
emails.
Out of all the positive classes, how much we predicted correctly.
TP
Sensitivity =
T P +F N
Example: Sensitivity = = 69.23 (69.23% of spam emails are
45+20
45
correctly classified).
Specificity (True Negative Rate): Measure of negative examples labeled as
negative by the classifier. Should be higher.
For instance, the proportion of emails which are non-spam among all non-
spam emails.
TN
Specif icity =
T N +F P
Example: Specif icity = = 85.71 (85.71% of non-spam emails are
30
30+5
accurately classified).
Accuracy: Proportion of the total number of predictions that are correct.
T P +T N
Accuracy =
T P +T N +F P +F N
Example: Accuracy = = 75 (75% of examples are correctly
45+30
45+30+20+5
classified).
Precision: Ratio of correctly classified positive examples to the total number of
predicted positive examples.
Shows correctness achieved in positive prediction (out of all the positive
classes we have predicted correctly, how many are actually positive).
High precision indicates that an example labeled as positive is indeed
positive (small number of FPs).
TP
P recision =
T P +F P
Example: P recision = = 90 (90% of examples classified as spam
45
45+5
are actually spam).
Recall: Ratio of correctly classified positive examples to the total number of
positive examples.
Out of all the positive classes, how much we predicted correctly.
Should be as high as possible.
High recall indicates the class is correctly recognized (small number of
FNs).
F-measure (F1 score): Balances between precision and recall.
Page 3
Created by Turbolearn AI
Helps to compute recall and precision in one equation, solving the
problem of distinguishing models with low recall and high precision or
vice versa.
2⋅P recision⋅Recall
F -measure =
P recision+Recall
The last column and the last row give the marginals (i.e., column and row sums).
Visualizing Classification Performance
1. Coverage Plot
A coverage plot visualizes the four numbers (number of positives Pos, number of
negatives Neg, number of true positives TP, and number of false positives FP) using a
rectangular coordinate system and a point. In a coverage plot, classifiers with the
same accuracy are connected by line segments with slope 1.
2. ROC Curves
An ROC curve (receiver operating characteristic curve) is a graph showing
the performance of a classification model at all classification thresholds.
This curve plots two parameters:
True Positive Rate (TPR)
False Positive Rate (FPR)
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives.
Example:
Hypothetical Data:
True Labels: [1, 0, 1, 0, 1, 1, 0, 0, 1, 0]
Predicted Probabilities: [0.8, 0.3, 0.6, 0.2, 0.7, 0.9, 0.4, 0.1, 0.75, 0.55]
Case 1: Threshold = 0.5
Page 4
Created by Turbolearn AI
TP 4
TPR = = = 0.8
T P +F N 4+1
FP 0
FPR = = = 0
F P +T N 0+5
Case 2: Threshold = 0.7
TP 5
TPR = = = 1.0
T P +F N 5+0
FP 2
FPR = = = 0.4
F P +T N 2+3
Case 3: Threshold = 0.4
TP 4
TPR = = = 0.8
T P +F N 4+1
FP 4
FPR = = = 0.8
F P +T N 4+1
Case 4: Threshold = 0.2
TP 2
TPR = = = 0.4
T P +F N 2+3
FP 0
FPR = = = 0
F P +T N 0+5
Case 5: Threshold = 0.85
TP 5
TPR = = = 1.0
T P +F N 5+0
FP 4
FPR = = = 0.8
F P +T N 4+1
AUC Curve
AUC stands for "Area Under the ROC Curve." AUC measures the entire
two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).
AUC ranges in value from 0 to 1.
A model whose predictions are 100% wrong has an AUC of 0.0.
One whose predictions are 100% correct has an AUC of 1.0.
AUC ROC indicates how well the probabilities from the positive classes are
separated from the negative classes.
ROC curves for different classifiers for a given dataset.
Useful to pick the right classifier based on the AUC curve with a good TP rate.
Class Probability Estimation
The probability of an event is the likelihood that the event will happen.
Page 5
Created by Turbolearn AI
Probability-based classifiers produce the class probability estimation (the
probability that a test instance belongs to the predicted class).
Involves not only predicting the class label but also obtaining a probability of
the respective label for decision-making.
Definition:
A probabilistic classifier is a classifier that is able to predict, given an
observation of an input, a probability distribution over a set of classes.
Binary (ordinary) classifier uses a function that assigns to a sample 'x' a class
label 'ŷ':
ŷ = f (x)
Probabilistic classifiers: Instead of functions, they are conditional distributions
P r = (Y /X) for a given x ∈ X , assigning probabilities to all y ∈ Y (and these
probabilities sum to one).
Examples: Naive Bayes, logistic regression, and multilayer perceptrons are
naturally probabilistic.
Assessing Class Probability Estimates
Page 6
Created by Turbolearn AI
1. Sum Squared Error (SSE): Square the individual error terms (difference
between the estimated values and the actual value), which results in a positive
number for all values.
2. Mean Squared Error (MSE): Measures the average of the squares of the errors.
The average squared difference between the estimated values and the
actual value (take the average, or the mean, of the individual squared error
terms).
3. Brier Score:
Definition of error in probability estimates, used in forecasting theory.
f - the probability that was forecast.
t - the actual outcome of the event at instance t (0 if it does not happen
0
and 1 if it does happen).
N is the number of forecasting instances.
In effect, it is the mean squared error of the forecast.
The Brier score is a proper scoring rule only for binary events (for
example, "rain" or "no rain").
Example: Suppose one is forecasting the probability P that it will rain on a
given day. Then the Brier score is calculated as follows:
If the forecast is 100% (P = 1) and it rains, then the Brier Score is 0
(best score).
If the forecast is 100% and it does not rain, then the Brier Score is 1
(worst score).
If the forecast is 70% (P = 0.70) and it rains, then the Brier Score is
(0.70 − 1) = 0.09.
2
If the forecast is 70% (P = 0.70) and it does not rain, then the Brier
Score is (0.70 − 0) = 0.49.
2
Empirical Probability
Empirical probability uses the number of occurrences of an outcome
within a sample set as a basis for determining the probability of that
outcome.
Page 7
Created by Turbolearn AI
The number of times "event X" happens out of 100 trials will be the probability
of event X happening.
The empirical probability of an event is the ratio of the number of outcomes in
which a specified event occurs to the total number of trials.
Empirical probability (experimental probability) estimates probabilities from
experience and observation.
Example: In a buffet, 95 out of 100 people chose to order coffee over tea. What
is the empirical probability of someone ordering tea?
Answer: The empirical probability of someone ordering tea is 5/100 = 5.
Page 8