DSML Clasification
DSML Clasification
INTRODUCTION
1
Classification
§ Introduction
§ Logistic Regression
§ Classification Process
§ Naïve Bayes
§ Decision Tress
§ KNN
2
Classification
Classification is the problem of identifying to which of a set of categories or label, a
new observation belongs.
Classification of the new observation is based on a training set of data containing observations (or
instances) whose category membership is known.
X1 X2 Y X1 X2 Y
10 20 100 10 20 A
15 30 150 15 30 A
5 10 75 5 10 B
Regression Classification
3
Logistic Regression
Logistic regression is a technique used for binary classification problems, where the goal is to predict one
of two possible outcomes.
Unlike linear regression, which predicts a continuous outcome, logistic regression predicts the probability
that a given input belongs to a certain class.
4
Logistic Regression
• What is the output of the logistic regression model?
• Is the output range bound?
• How is the output constrained to a range?
5
Logistic Regression
Logistic regression uses a logistic function (or sigmoid function) to model the probability of a particular
outcome. The logistic function maps any real-valued number into the range (0, 1).
z output
-5 0.01
-2 0.02
0 0.50
1 0.73
2 0.88
6
Logistic Regression
7
Logistic Regression
8
Logistic Regression
Instance Class (yi) Probability (yi) log p(yi) Class (1-yi) Probability (1-yi) log P(1-yi) - (yi*log p(yi) + (1-yi)(1-log p(yi)))
1 1 0.8 -0.22 0 0.2 -1.61 0.223
2 1 0.9 -0.11 0 0.1 -2.30 0.105
3 0 0.1 -2.30 1 0.9 -0.11 0.105
4 0 0.2 -1.61 1 0.8 -0.22 0.223
5 1 0.9 -0.11 0 0.1 -2.30 0.105
6 0 0.3 -1.20 1 0.7 -0.36 0.357
8 0 0.4 -0.92 1 0.6 -0.51 0.511
9 1 0.6 -0.51 0 0.4 -0.92 0.511
10 0 0.1 -2.30 1 0.9 -0.11 0.105
Obj Fn Value 0.250
9
Logistic Regression
10
Interpretation
Positive Beta: Indicates that as the predictor
increases, the probability of the outcome increases.
11
Classification Process
Evaluation Measures
• AUC-ROC
• Obtain the Receiver-Operating Characteristic
Curve and determine the AUC value
• Determine the overall goodness of the model(s)
• Confusion Matrix
• Obtain Predicted Probabilities
• Examine Probability Distribution and Identify
Threshold for Classification
• Assess accuracy, sensitivity, specificity,
precision, F1-score, etc.
• If needed, determine the threshold for classification
using other methods
• Youden's Index
• Cost-Benefit Approach
12
Now can I get different predictions I can get 99 different
based on threshold ? TPR and FPR Values by
getting predicted
Can I get TPR and FPR for different classes each time by
Model Performance thresholds ? setting prob between
0.01 and 0.99
. . . . .
13
Classification Process
AUC-ROC AUC = 0.8
1. Obtain the predicted probabilities for the instances in the test data
2. Plot the probability distribution of predicted probabilities for classes “1” and “0” of test data
3. Select a suitable threshold from the probability distribution for classifying an instance as “1”
4. Obtain the predicted classes using the threshold
5. Create the confusion matrix
6. Calculate Accuracy, Sensitivity, Specificity, Precision and F1-Score
15
Classification Process
Creating the Confusion Matrix
Step 1: Obtain the predicted probabilities (refers to probability for being class “1” ) for the instances
in the test data
# Actual Class Predicted Probability
Test data contains the instances
1 1 0.8
with their features and actual
class labels. The model is applied 2 0 0.7
on the test data to obtain the 3 1 0.3
predicted probabilities for the
4 1 0.9
instances.
5 0 0.2
6 0 0.1
7 1 0.9
16
Classification Process
Creating the Confusion Matrix
Step 2: Select a suitable threshold from the probability distribution for classifying an instance as “1”
17
Classification Process
Creating the Confusion Matrix
Step 3: Select a suitable threshold from the probability distribution for classifying an instance as “1”
18
Classification Process
Creating the Confusion Matrix
Step 4: Obtain the predicted classes using the threshold.
Predicted Class = 1 if Predicted Probability > 0.25 (for example)
# Actual Class Predicted Probability Predicted Class
1 1 0.8 1 TP
2 0 0.7 1 FP
3 1 0.1 0 FN
4 1 0.9 1 TP
5 0 0.2 0 TN
6 0 0.1 0 TN
7 1 0.9 1 TP
19
Classification Process
Creating the Confusion Matrix
Step 5: Create the confusion matrix
Actual Class
20
Classification Process
Creating the Confusion Matrix
Step 6: Calculate Accuracy, Sensitivity, Specificity, Precision and F1-Score
21
Classification Process
Gain
Gain Tables and Curves show the proportion of targets captured by the model up to a chosen
percentile of the predicted probabilities. Higher gain values indicate better model performance in
capturing the target class compared to random selection.
22
Classification Process
Lift
Lift measures how much better the model is at identifying positive cases compared to a random
model. A lift greater than 1 indicates that the model is effective at identifying positive outcomes
better than random selection. A lift of 1 means that the model performs no better than random
guessing.
23
Gain and Lift - Example
Let's consider an insurance website where visitors explore various insurance products. In this context, a conversion
occurs when a visitor responds positively to an offer. We mark visitors who convert as "1" and those who do not
convert as "0." Assume we have developed a classification model designed to predict which visitors are likely to
convert. How do we calculate the gain and lift in this case?
24
Gain and Lift - Example
Without Machine Learning With Machine Learning
25
Naïve Bayes
Naïve Bayes is a Probabilistic Classifier. It is based on Bayes theorem. Bayes Theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the event.
The conditional probability that an object belongs to Ck given the feature set X is given by
26
Naïve Bayes
27
Naïve Bayes
We have to find the probability of playing when Humidity = Medium, Temp = Low and Outlook =
Overcast
X = {Humidity=Medium, Temp = Low, Outlook=Overcast}
P(X|CYes) * P(CYes)
= P(Humidity=Medium |Play= Yes) * P(Temp = Low | Play = Yes) * P(Outlook = Overcast | Play = Yes) * P (Yes)
= (2/7)*(2/7)*(2/7)*(7/12) = 0.0136
P(X|CNo) * P(CNo)
= P(Humidity=Medium |Play= No) * P(Temp = Low | Play = No) * P(Outlook = Overcast | Play = No) * P (No)
= (2/5)*(2/5)*(1/5)*(5/12) = 0.0133
P(X) = P(Humidity=Medium)*P(Temp=Low)*P(Outlook=Overcast)
P(X) = (4/12)*(4/12)*(3/12) = 0.0278
P(Cricket=Y) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.0136/0.0278 = 0.490
P(Cricket=N) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.0133/0.0278 =0.480
28
Naïve Bayes
We know that P(Yes) + P(No) = 1, therefore standardizing results
P(Cricket=Y) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.49/(0.49+0.48) = 0.505
P(Cricket=N) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.48/0.49+0.48) = 0.495
29
Decision Trees
Decision tree (Classification tree) is an algorithm that constructs rules based on the independent variables (predictors),
by recursively partitioning the data, in order to split the data into given classes (class labels) that are as homogenous as
possible.
Root Node Contains all data
Structure of decision tree
30
Decision Trees - Example
Class = 0, 1 Class = 0, 1
SQUARE 0 1
1 1 0 0 1
0
0 1
1
1 0 0
Shape 0 1
0 1 0
1
1 0 CIRCLE 1 0
1 0 1 0 0
1
RED Blue RED GREEN
Colour Colour
IF Colour (IV) = Red, Class (DV) = 1 IF Colour (IV) = Red & Shape (IV) = Circle, Class (DV) = 1
IF Colour (IV) = Green & Shape (IV) = Square, Class (DV) = 1
IF Colour (IV) = Blue, Class (DV) = 0
IF Colour (IV) = Red & Shape (IV) = Square, Class (DV) = 0
IF Colour (IV) = Green & Shape (IV) = Circle, Class (DV) = 0
31
Decision Trees - Example
32
Decision Trees
Decision Trees Example- Example
IF Overcast, Play Golf 4Y
IF Sunny and Not Windy, Play Golf 3Y
IF Sunny and Windy, Not Play Golf 2N
n=14, Y=9, N=5 IF Rainy and High Humidity, Not Play Golf 3N
IF Rainy and Normal Humidity, Play Golf 2Y
Yes No Yes No
Windy High
True Humidity
n=2, Y=0, N=2 n=3, Y=3, N=0 n=3, Y=0, N=3 n=2, Y=2, N=0
33
Decision Trees - Challenge
Decision Trees
How do we find the feature (IV) to be used for determining the split
We select the feature which results in the most pure or homogenous sub-sets
There are various measures of purity or homogeneity. If we have two classes, a and b, and P(a) and P(b) be the
probabilities of P(a) and P(b), then
34
Decision Trees
Decision Trees
1. Calculate the entropy / gini impurity of root node
2. Choose the attribute which results in the highest information gain or reduction in impurity
35
Decision Trees
Gini = 0.5
Gini = 0.5
n=10, Y=5, N=5 n=10, Y=5, N=5
Feature Feature
D1 D2
n=5, Y=0, N=5 n=5, Y=5, N=0 n=5, Y=3, N=2 n=5, Y=2, N=3
Gini = Gini =
Avg Gini = 0.0 Avg Gini = 0.480 0.480
Gini = 0.0 Gini = 0.0 0.480
36
Decision Trees
Feature Feature
D1 D2
n=5, Y=0, N=5 n=5, Y=5, N=0 n=5, Y=3, N=2 n=5, Y=2, N=3
Ent = Ent =
Avg Gini = 0.0 Avg Gini = 0.97 0.97
Ent = 0 Ent = 0 0.97
37
Decision Trees - GINI
Gini = 0.459 Node Gini
n=14, Y=9, N=5
00 Weighted Gini
A_Gini = 0.357
No
Overcast
Gini = 0.0 Outlook
Gini = 0.500
n=10, Y=5, N=5
n=4, Y=4, N=0
11 12
Yes No
Sunny Gini = 0.480
Gini = 0.480 Outlook
n=5, Y=2, N=3
n=5, Y=3, N=2 A_Gini = 0.480
22
21
Yes No Yes No
Windy High
True Humidity
n=2, Y=0, N=2 n=3, Y=3, N=0 n=2, Y=2, N=0 n=3, Y=0, N=3
Gini = 0.0 Gini = 0.0 Gini = 0.0 Gini = 0.0
31 32 33 34
38
Decision Trees
Decision Trees - Entropy
Example [Entropy Calculation]
E = 0.940 Node Entropy
n=14, Y=9, N=5
00 Weighted Entropy
A_Ent = 0.714
Yes No
Overcast
E = 0.0 Outlook
E = 1.000
n=10, Y=5, N=5
n=4, Y=4, N=0
11 12
Yes No
Sunny E=0.971
E= 0.971 Outlook
n=5, Y=2, N=3
n=5, Y=3, N=2 A_Ent = 0.714
22
21
Yes No Yes No
Windy High
True Humidity
n=2, Y=0, N=2 n=3, Y=3, N=0 n=2, Y=2, N=0 n=3, Y=0, N=3
E = 0.0 E = 0.0 E = 0.0 E = 0.0 34
31 32 33
39
k-NN
KNN
In k-NN Classification, an object is assigned to the class most
common among its k nearest neighbors.
The test sample (green dot) should be classified either to blue squares or to red triangles.
If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the
inner circle.
If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).
40
k-NN
KNN
In k-NN Classification, an object is assigned to the class most common among its k nearest neighbors.
• The algorithm only stores the training examples during the learning phase
• The algorithm is executed during the classification phase. The unlabeled observation is assigned the label
which is the most frequent among its k-nearest neighbours.
41
Logistic
Naïve Bayes CART KNN
Regression
42
Model Evaluation – Training and Testing
Data
Train Model A on Data using Fold1, Fold2, Fold3 and test it on Fold4
Train Model B on Data using Fold2, Fold3, Fold4 and test it on Fold1
Train Model C on Data using Fold3, Fold4, Fold1 and test it on Fold2
Train Model D on Data using Fold4, Fold1, Fold2 and test it on Fold3
43
Thank You