0% found this document useful (0 votes)
18 views44 pages

DSML Clasification

DSML_Clasification

Uploaded by

arif.mba23064
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views44 pages

DSML Clasification

DSML_Clasification

Uploaded by

arif.mba23064
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Classification

INTRODUCTION

1
Classification
§ Introduction
§ Logistic Regression
§ Classification Process
§ Naïve Bayes
§ Decision Tress
§ KNN

2
Classification
Classification is the problem of identifying to which of a set of categories or label, a
new observation belongs.

Classification of the new observation is based on a training set of data containing observations (or
instances) whose category membership is known.

X1 X2 Y X1 X2 Y
10 20 100 10 20 A

15 30 150 15 30 A

5 10 75 5 10 B

Regression Classification

3
Logistic Regression
Logistic regression is a technique used for binary classification problems, where the goal is to predict one
of two possible outcomes.

Unlike linear regression, which predicts a continuous outcome, logistic regression predicts the probability
that a given input belongs to a certain class.

4
Logistic Regression
• What is the output of the logistic regression model?
• Is the output range bound?
• How is the output constrained to a range?

5
Logistic Regression
Logistic regression uses a logistic function (or sigmoid function) to model the probability of a particular
outcome. The logistic function maps any real-valued number into the range (0, 1).

z output
-5 0.01

-2 0.02

0 0.50

1 0.73

2 0.88

6
Logistic Regression

What do you notice in


the picture?

7
Logistic Regression

8
Logistic Regression
Instance Class (yi) Probability (yi) log p(yi) Class (1-yi) Probability (1-yi) log P(1-yi) - (yi*log p(yi) + (1-yi)(1-log p(yi)))
1 1 0.8 -0.22 0 0.2 -1.61 0.223
2 1 0.9 -0.11 0 0.1 -2.30 0.105
3 0 0.1 -2.30 1 0.9 -0.11 0.105
4 0 0.2 -1.61 1 0.8 -0.22 0.223
5 1 0.9 -0.11 0 0.1 -2.30 0.105
6 0 0.3 -1.20 1 0.7 -0.36 0.357
8 0 0.4 -0.92 1 0.6 -0.51 0.511
9 1 0.6 -0.51 0 0.4 -0.92 0.511
10 0 0.1 -2.30 1 0.9 -0.11 0.105
Obj Fn Value 0.250

9
Logistic Regression

Are the beta’s


significant?

If so, the feature


has a significant
impact on the
outcome

10
Interpretation
Positive Beta: Indicates that as the predictor
increases, the probability of the outcome increases.

Negative Beta: Indicates that as the predictor


increases, the probability of the outcome
decreases.

Magnitude of Beta: The larger the value of β, the


stronger the association between the predictor and
the outcome.

If β=0.5, then e0.5 ≈ 1.65. This means the odds of the


outcome are 65% higher for each one-unit increase
in the predictor

11
Classification Process
Evaluation Measures
• AUC-ROC
• Obtain the Receiver-Operating Characteristic
Curve and determine the AUC value
• Determine the overall goodness of the model(s)
• Confusion Matrix
• Obtain Predicted Probabilities
• Examine Probability Distribution and Identify
Threshold for Classification
• Assess accuracy, sensitivity, specificity,
precision, F1-score, etc.
• If needed, determine the threshold for classification
using other methods
• Youden's Index
• Cost-Benefit Approach

12
Now can I get different predictions I can get 99 different
based on threshold ? TPR and FPR Values by
getting predicted
Can I get TPR and FPR for different classes each time by
Model Performance thresholds ? setting prob between
0.01 and 0.99

Case Actual Class Pred Prob Predicted Class Type

1 1 0.9 1 True Positive

2 0 0.1 0 True Negative


3 1 0.8 0 False Negative
4 0 0.3 1 False Positive

5 1 0.7 1 True Positive

. . . . .

N 0 0.4 0 True Negative


TPR: Ratio of True Positives to all actual positive (Class 1) observations
FPR: Ratio of False Positives to all actual negative (Class 0) observations

13
Classification Process
AUC-ROC AUC = 0.8

AUC Quantifies the overall ability of the


model to discriminate between positive
and negative classes, with values closer Good
Bad
to 1 indicating better performance.

To find the AUC-ROC


1. Obtain the predicted
probabilities for instances in the
test dataset
2. Use the actual target labels and
their predicted probabilities to
plot the ROC curve and obtain
AUC
TPR: Ratio of True Positives to all actual positive (Class 1) observations
FPR: Ratio of False Positives to all actual negative (Class 0) observations
14
Classification Process
A confusion matrix is a fundamental tool used to evaluate the performance of a classification
model. It provides a detailed breakdown of the model's predictions compared to the actual
outcomes.

Creating the Confusion Matrix

1. Obtain the predicted probabilities for the instances in the test data
2. Plot the probability distribution of predicted probabilities for classes “1” and “0” of test data
3. Select a suitable threshold from the probability distribution for classifying an instance as “1”
4. Obtain the predicted classes using the threshold
5. Create the confusion matrix
6. Calculate Accuracy, Sensitivity, Specificity, Precision and F1-Score

15
Classification Process
Creating the Confusion Matrix

Step 1: Obtain the predicted probabilities (refers to probability for being class “1” ) for the instances
in the test data
# Actual Class Predicted Probability
Test data contains the instances
1 1 0.8
with their features and actual
class labels. The model is applied 2 0 0.7
on the test data to obtain the 3 1 0.3
predicted probabilities for the
4 1 0.9
instances.
5 0 0.2
6 0 0.1
7 1 0.9

16
Classification Process
Creating the Confusion Matrix
Step 2: Select a suitable threshold from the probability distribution for classifying an instance as “1”

Try to select a threshold that


reduces the margin of overlap
(overlap is responsible for
incorrect predictions)

17
Classification Process
Creating the Confusion Matrix
Step 3: Select a suitable threshold from the probability distribution for classifying an instance as “1”

Try to select a threshold that


reduces the margin of overlap
(overlap is responsible for
incorrect predictions)
TN
Let’s say we select 0.25 to be the
threshold. All instances in test TP
data with predicted probabilities
> 0.25 will be labeled as Class 1.
FN FP

18
Classification Process
Creating the Confusion Matrix
Step 4: Obtain the predicted classes using the threshold.
Predicted Class = 1 if Predicted Probability > 0.25 (for example)
# Actual Class Predicted Probability Predicted Class
1 1 0.8 1 TP

2 0 0.7 1 FP

3 1 0.1 0 FN

4 1 0.9 1 TP

5 0 0.2 0 TN

6 0 0.1 0 TN

7 1 0.9 1 TP

19
Classification Process
Creating the Confusion Matrix
Step 5: Create the confusion matrix

Actual Class

Class (1) Class (0)

Class (1) 3 [TP] 1 [FP]


Predicted Class
Class (0) 1 [FN] 2 [TN]

20
Classification Process
Creating the Confusion Matrix
Step 6: Calculate Accuracy, Sensitivity, Specificity, Precision and F1-Score

21
Classification Process
Gain

Gain Tables and Curves show the proportion of targets captured by the model up to a chosen
percentile of the predicted probabilities. Higher gain values indicate better model performance in
capturing the target class compared to random selection.

22
Classification Process
Lift

Lift measures how much better the model is at identifying positive cases compared to a random
model. A lift greater than 1 indicates that the model is effective at identifying positive outcomes
better than random selection. A lift of 1 means that the model performs no better than random
guessing.

23
Gain and Lift - Example
Let's consider an insurance website where visitors explore various insurance products. In this context, a conversion
occurs when a visitor responds positively to an offer. We mark visitors who convert as "1" and those who do not
convert as "0." Assume we have developed a classification model designed to predict which visitors are likely to
convert. How do we calculate the gain and lift in this case?

Number of visitors 1000

Number of converts in the entire dataset 200

Conversion Percentage 200/1000 = 20%


Number of actual converts in the top 20% of leads (200 predictions) by predicted
80
probability
Gain [means that the model captures 40% of all converted leads within these two deciles] 80/200 = 40%

Number of converts identifiable by random guessing 20% of 200 = 40


Lift [means that the model is 2 times better than random guessing at capturing leads that
80/40 = 2
convert in the two deciles predictions]

24
Gain and Lift - Example
Without Machine Learning With Machine Learning

Decile Number of Number of Percentage of Decile Number of Number of Percentage of


Visitors converts Converts Visitors converts Converts
1 100 20 20 1 100 50 50
2 100 20 20 2 100 30 30
3 100 20 20 3 100 25 25
4 100 20 20 4 100 20 20
5 100 20 20 5 100 17 17
6 100 20 20 6 100 15 15
7 100 20 20 7 100 13 13
8 100 20 20 8 100 12 12
9 100 20 20 9 100 10 10
10 100 20 20 10 100 8 8
1000 200 20 1000 200 20

25
Naïve Bayes
Naïve Bayes is a Probabilistic Classifier. It is based on Bayes theorem. Bayes Theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the event.

The conditional probability that an object belongs to Ck given the feature set X is given by

26
Naïve Bayes

27
Naïve Bayes
We have to find the probability of playing when Humidity = Medium, Temp = Low and Outlook =
Overcast
X = {Humidity=Medium, Temp = Low, Outlook=Overcast}
P(X|CYes) * P(CYes)
= P(Humidity=Medium |Play= Yes) * P(Temp = Low | Play = Yes) * P(Outlook = Overcast | Play = Yes) * P (Yes)
= (2/7)*(2/7)*(2/7)*(7/12) = 0.0136

P(X|CNo) * P(CNo)
= P(Humidity=Medium |Play= No) * P(Temp = Low | Play = No) * P(Outlook = Overcast | Play = No) * P (No)
= (2/5)*(2/5)*(1/5)*(5/12) = 0.0133

P(X) = P(Humidity=Medium)*P(Temp=Low)*P(Outlook=Overcast)
P(X) = (4/12)*(4/12)*(3/12) = 0.0278

P(Cricket=Y) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.0136/0.0278 = 0.490
P(Cricket=N) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.0133/0.0278 =0.480
28
Naïve Bayes
We know that P(Yes) + P(No) = 1, therefore standardizing results

P(Cricket=Y) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.49/(0.49+0.48) = 0.505
P(Cricket=N) when it is { Medium Humidity, Low Temp, Overcast Outlook} = 0.48/0.49+0.48) = 0.495

29
Decision Trees
Decision tree (Classification tree) is an algorithm that constructs rules based on the independent variables (predictors),
by recursively partitioning the data, in order to split the data into given classes (class labels) that are as homogenous as
possible.
Root Node Contains all data
Structure of decision tree

Decision Node Applies a rule using one


of the IV’s
Condition Satisfied = Yes No

Terminal Node Decision Node


Yes No

Terminal Node Decision Node

30
Decision Trees - Example
Class = 0, 1 Class = 0, 1

SQUARE 0 1
1 1 0 0 1
0
0 1
1
1 0 0
Shape 0 1

0 1 0
1
1 0 CIRCLE 1 0
1 0 1 0 0
1
RED Blue RED GREEN
Colour Colour
IF Colour (IV) = Red, Class (DV) = 1 IF Colour (IV) = Red & Shape (IV) = Circle, Class (DV) = 1
IF Colour (IV) = Green & Shape (IV) = Square, Class (DV) = 1
IF Colour (IV) = Blue, Class (DV) = 0
IF Colour (IV) = Red & Shape (IV) = Square, Class (DV) = 0
IF Colour (IV) = Green & Shape (IV) = Circle, Class (DV) = 0

31
Decision Trees - Example

32
Decision Trees
Decision Trees Example- Example
IF Overcast, Play Golf 4Y
IF Sunny and Not Windy, Play Golf 3Y
IF Sunny and Windy, Not Play Golf 2N
n=14, Y=9, N=5 IF Rainy and High Humidity, Not Play Golf 3N
IF Rainy and Normal Humidity, Play Golf 2Y

Yes No n=10, Y=5, N=5


Overcast
Outlook

n=4, Y=4, N=0


Yes No n=5, Y=2, N=3
Sunny
Outlook
n=5, Y=3, N=2

Yes No Yes No
Windy High
True Humidity

n=2, Y=0, N=2 n=3, Y=3, N=0 n=3, Y=0, N=3 n=2, Y=2, N=0

33
Decision Trees - Challenge

Decision Trees
How do we find the feature (IV) to be used for determining the split

We select the feature which results in the most pure or homogenous sub-sets

So How do we measure homogeneity or purity ?

There are various measures of purity or homogeneity. If we have two classes, a and b, and P(a) and P(b) be the
probabilities of P(a) and P(b), then

(1) Gini Impurity


Gini_Impurity_Node = 1 – P(a)2 – P(b)2
(2) Information Gain
Entropy_node = - p(a)*log2(p(a)) - p(b)*log2(p(b))
Information Gain = Entropy of Original Set – Entropy of sets resulting after split
(3) Variance Reduction (usually for regression trees)

34
Decision Trees

Decision Trees
1. Calculate the entropy / gini impurity of root node

2. Choose the attribute which results in the highest information gain or reduction in impurity

3. Repeat procedures until no more split is possible

Attribute that gives highest resulting


homogeneity is said to have the highest
information gain

35
Decision Trees
Gini = 0.5
Gini = 0.5
n=10, Y=5, N=5 n=10, Y=5, N=5

Feature Feature
D1 D2

n=5, Y=0, N=5 n=5, Y=5, N=0 n=5, Y=3, N=2 n=5, Y=2, N=3

Gini = Gini =
Avg Gini = 0.0 Avg Gini = 0.480 0.480
Gini = 0.0 Gini = 0.0 0.480

36
Decision Trees

Entropy = 1 n=10, Y=5, N=5 Entropy = 1


n=10, Y=5, N=5

Feature Feature
D1 D2

n=5, Y=0, N=5 n=5, Y=5, N=0 n=5, Y=3, N=2 n=5, Y=2, N=3

Ent = Ent =
Avg Gini = 0.0 Avg Gini = 0.97 0.97
Ent = 0 Ent = 0 0.97

37
Decision Trees - GINI
Gini = 0.459 Node Gini
n=14, Y=9, N=5
00 Weighted Gini

A_Gini = 0.357

No
Overcast
Gini = 0.0 Outlook
Gini = 0.500
n=10, Y=5, N=5
n=4, Y=4, N=0
11 12
Yes No
Sunny Gini = 0.480
Gini = 0.480 Outlook
n=5, Y=2, N=3
n=5, Y=3, N=2 A_Gini = 0.480
22
21
Yes No Yes No
Windy High
True Humidity

A_Gini = 0.0 A_Gini = 0.0

n=2, Y=0, N=2 n=3, Y=3, N=0 n=2, Y=2, N=0 n=3, Y=0, N=3
Gini = 0.0 Gini = 0.0 Gini = 0.0 Gini = 0.0
31 32 33 34

38
Decision Trees
Decision Trees - Entropy
Example [Entropy Calculation]
E = 0.940 Node Entropy
n=14, Y=9, N=5
00 Weighted Entropy

A_Ent = 0.714

Yes No
Overcast
E = 0.0 Outlook
E = 1.000
n=10, Y=5, N=5
n=4, Y=4, N=0
11 12
Yes No
Sunny E=0.971
E= 0.971 Outlook
n=5, Y=2, N=3
n=5, Y=3, N=2 A_Ent = 0.714
22
21
Yes No Yes No
Windy High
True Humidity

A_Gini = 0.0 A_Gini = 0.0

n=2, Y=0, N=2 n=3, Y=3, N=0 n=2, Y=2, N=0 n=3, Y=0, N=3
E = 0.0 E = 0.0 E = 0.0 E = 0.0 34
31 32 33

39
k-NN

KNN
In k-NN Classification, an object is assigned to the class most
common among its k nearest neighbors.

The test sample (green dot) should be classified either to blue squares or to red triangles.

If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the
inner circle.

If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).

40
k-NN

KNN
In k-NN Classification, an object is assigned to the class most common among its k nearest neighbors.
• The algorithm only stores the training examples during the learning phase
• The algorithm is executed during the classification phase. The unlabeled observation is assigned the label
which is the most frequent among its k-nearest neighbours.

41
Logistic
Naïve Bayes CART KNN
Regression

White-box Model Preferred approach White-box Model Useful in non-linear


with large # of patterns
Gives a very nice categorical Useful in non-linear
probabilistic estimate patterns Robust results in large
Useful in non-linear sample size
Robust enough from patterns Ensembles perform
overfitting, especially very well Choosing k is difficult
when regularized Computationally
Efficient Prone to overfitting Memory Intensive

Efficient, no Does not give good Sensitive to small Poor performance on


assumption of results when changes in values in high dimension data
distributions assumptions of data
independence are More susceptible to
Useful only in linear violated Large trees are difficult noise in small sample
models to interpret size
Continuous IVs must
hold normal
distribution

42
Model Evaluation – Training and Testing
Data

Fold1 Fold2 Fold3 Fold4


Use k-fold validation strategy

Train Model A on Data using Fold1, Fold2, Fold3 and test it on Fold4
Train Model B on Data using Fold2, Fold3, Fold4 and test it on Fold1
Train Model C on Data using Fold3, Fold4, Fold1 and test it on Fold2
Train Model D on Data using Fold4, Fold1, Fold2 and test it on Fold3

Confusion Matrix Measures on Fold1, Fold2, Fold3, Fold4 should be similar

43
Thank You

You might also like