Classification Techniques
Classification Techniques
Classification
• Classification predicts categorical (discrete) labels
• Example: categorize bank loan applications as either safe or risky
• A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n
measurements made on the tuple from n database attributes, respectively, A1, A2, : : : , An
• Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute.
• The class label attribute is discrete-valued and unordered and It is categorical in that each value
serves as a category or class.
• Data classification is a two-step process
• In the first step, classification algorithm builds the classifier by analyzing or “learning from” a training set
made up of database tuples and their associated class labels.
• In the second step, the model is used for classification
• Because the class label of each training tuple is provided, this step is also known as supervised
learning
Decision Tree
Decision Tree
• Decision tree induction is the learning of decision
trees from class-labeled training tuples.
• A decision tree is a flowchart-like tree structure,
where each internal node (nonleaf node) denotes
a test on an attribute, each branch represents an
outcome of the test, and each leaf node (or
terminal node) holds a class label.
• The topmost node in a tree is the root node.
• Given a tuple, X, for which the associated class
label is unknown, the attribute values of the
tuple are tested against the decision tree.
• A path is traced from the root to a leaf node,
which holds the class for that tuple.
Decision Tree types
• An attribute selection measure is for selecting the splitting criterion
that “best” separates a given data partition, D, of class-labeled
training tuples into individual classes.
• Attribute selection measure:
▪ Information gain – ID3
▪ Gain ratio – C4.5
▪ Gini index - CART
ID3 Decision Tree
Naive Bayesian Classification
Bayesian Classification
• Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.
• Bayes’ Theorem
Naive Bayesian Classification
• Let represent a tuple
• Let represent attribute vector
• Let represent classes
• X belongs to the class Ci if and only if
Where
Neural Network
Feed-forward Neural Network
• The backpropagation algorithm performs learning on a multilayer
feed-forward neural network.
• It iteratively learns a set of weights for prediction of the class label of
tuples.
• A multilayer feed-forward neural network consists of an input layer,
one or more hidden layers, and an output layer.
Learning by the backpropagation algorithm
Sample calculations for learning by the
backpropagation algorithm.
10 1 20 0 60 0 Y = -0.01404 * X + 0.8176
11 1 21 0 61 0
12 1 22 0 62 0
13 1 23 0 63 0
14 1 24 0 64 0
15 1 25 0 65 0
16 1 26 0 66 0
17 1 27 0 67 0
18 1 28 0 68 0
19 1 29 0 69 0
Why Linear Regression fails for categorical data
AGE OUTCOMES AGE OUTCOMES AGE OUTCOMES
1
p=
1 + e-(β0 + β1 ∗ X)
The dataset of pass/fail in an exam for 5 students is given in the table below.
If we use Logistic Regression as the classifier and assume the model
suggested by the optimizer will become the following for Odds of passing a
course:
log (Odds)=−64+2×hours
• Calculate the probability of Pass for the student who studied 33 hours?
• At least how many hours the student should study that makes sure will pass
the course with the probability of more than 95%?
HOURS STUDIES RESULT (1= PASS, 0=FAIL)
29 0
15 0
33 1
28 1
39 1
Probability of Pass for the student who studied 33 hour
P = 1/(1+e-z)
Z = -64 + 2 * Hours
= -64 + 66 #Hours = 33
Z=2
P = 1/(1+e-2)
P = 0.88
A student who studies for 33 hours has 88% chance of passing the
course
At least how many hours the student should study that makes sure will pass the course with
the probability of more than 95%
P = 0.95 Log (odds) = -64 + 2 * hours
0.95 = 1/ (1 + e-z) Z = 2.94
0.95. (1 + e-z) = 1 2.94 = -64 + 2* hours
0.95 + 0.95 e-z =1 Hours = 33.5
0.95 e-z = 1 – 0.95
= 0.05 Z = -64 + 2. 3305
e-z = 0.0526 = -64 + 67
ln (e-z) = ln (0.0526) Z=3
-Z = -2.94 P = 1/(1+e-3)
Z = 2.94 P ≈ 0.952
Ensemble Learning
(AdaBoost Algorithm)
Ensemble Learning
• Ensemble learning combines several base algorithms to form one
optimized predictive algorithm
• Example: Instead of one Decision Tree, Ensemble Methods take
several different trees and aggregate them into one final, strong
predictor
• Types
• Bagging
• Boosting
• Stacking
Bagging Boosting Stacking
Weak learners Homogenous Homogenous Heterogenous
Learning Parallel Sequential Parallel
Combination Weak + deterministic averaging Weak + deterministic Weak + meta-model
process strategy
Goal Decrease Variance Decrease Bias Improve Predictions
Boosting
• Boosting algorithm tries to build a strong learner (predictive model) from the mistakes of several
weaker models.
• It starts by creating a model from the training data.
• Then, it creates a second model from the previous one by trying to reduce the errors from the
previous model.
• Models are added sequentially, each correcting its predecessor, until the training data is predicted
perfectly or the maximum number of models have been added.
• Boosting basically tries to reduce the bias error which arises when models are not able to identify
relevant trends in the data.
• This happens by evaluating the difference between the predicted value and the actual value.
• Types
• AdaBoost (Adaptive Boosting)
• Gradient Tree Boosting
• XGBoost
AdaBoost (Adaptive Boosting)
• Initialize weights wi = 1/N for every i
• For t=1 to T
❖ Generate training dataset by sampling with {Wi}
❖ Fit some weak learner gt
1−et
❖ Set λt = ½ ln
et
❖ et = σ𝑛𝑖=1(𝑒𝑖
∗ 𝑛
𝑤𝑖 )/ σ𝑖=1(𝑤𝑖 )
❖ Update the weights
➢ Wi ←wieλt if wrongly classified by gk
➢ Wi ←wie-λt if correctly classified
❖ Normalize wi to sum to one
• The new model is ft = ft-1 + λt gt
• fT (x) = sign [ σ𝑇𝑡=1 λt. gt ]
Example Initialize
weights wi = 1/N
Fit some weak learner gt
Update the weights
X1 X2 Decision
x1 x2 actual weight prediction loss
weight Wi ←wieλt if wrongly classified by gk
Wi ←wie-λt if correctly classified
x1 x2 actual Weight * loss
2 3 true
2 3 1 0.1 2 3 1 0.1 1 0 0 Normalize wi to sum to one
2.1 2 true 2 2 1 0.1 1 0 0
2 2 1 0.1 x x actu weig predicti norm(w_(i+
4 6 1 0.1 -1 1 0.1 w_(i+1)
1 2 al ht on 1))
4.5 6 true 4 6 1 0.1
4 3 -1 0.1 -1 0 0
4 3 -1 0.1 2 3 1 0.1 1 0.065 0.071
4 3.5 false 4 1 -1 0.1 -1 0 0
4 1 -1 0.1 5 7 1 0.1 -1 1 0.1 2 2 1 0.1 1 0.065 0.071
3.5 1 false
5 7 1 0.1 5 3 -1 0.1 -1 0 0
4 6 1 0.1 -1 0.153 0.167
5 7 true
6 5 1 0.1 -1 1 0.1
5 3 -1 0.1
4 3 -1 0.1 -1 0.065 0.071
5 3 false 8 6 -1 0.1 -1 0 0
6 5 1 0.1
8 2 -1 0.1 -1 0 0 4 1 -1 0.1 -1 0.065 0.071
6 5.5 true
8 6 -1 0.1
et = 0.3 5 7 1 0.1 -1 0.153 0.167
8 6 false 1−et
8 2 -1 0.1
λt = ½ ln = ln[(1 – 0.3)/0.3] / 2
8 2 false et 5 3 -1 0.1 -1 0.065 0.071
λt = 0.42
6 5 1 0.1 -1 0.153 0.167
x1 x2 actual weight prediction loss weight * loss w(i+1) norm(w(i+1)) x1 x2 actual weight prediction loss w * loss w(i+1) norm(w(i+1))
et = 0.1, λt = 1.1 round 1 prediction round 2 prediction round 3 prediction round 4 prediction
1 -1 1 1
1 -1 1 1
-1 1 -1 1
x1 x2 actual weight prediction loss weight * loss w(i+1) norm(w(i+1))
-1 -1 -1 1
-1 1 -1 1
2 2 1 0.122 1 0 0.000 0.041 0.068
-1 -1 -1 1
4 6 1 0.167 1 0 0.000 0.056 0.093 -1 1 -1 1
-1 -1 -1 -1
4 1 -1 0.033 1 1 0.033 0.100 0.167
5 7 1 0.167 1 0 0.000 0.056 0.093 For example, prediction of the 1st instance will be
5 3 -1 0.033 1 1 0.033 0.100 0.167 0.42 x 1 + 0.65 x (-1) + 0.38 x 1 + 1.1 x 1 = 1.25
6 5 1 0.167 1 0 0.000 0.056 0.093 And we will apply sign function
8 6 -1 0.122 -1 0 0.000 0.041 0.068 Sign(1.25) = +1 aka true which is correctly classified
8 2 -1 0.033 -1 0 0.000 0.011 0.019
Classification & Prediction
Accuracy and Error Measures
Classifier Accuracy Measures
The accuracy of a classifier is the percentage of tuples
that are correctly classified by the classifier.
Where,
Where,
f_pos is the number of false positives (“not cancer”
tuples that were incorrectly labelled as “cancer”)
Confusion matrix
Predictor Error Measures
• Loss functions measure the error between actual value yi and the predicted value yi’
• Average loss is given as follows where mean squared error exaggerates the presence of outliers
• Relative error is the error to be relative to what it would have been if we had just predicted the mean
value for y from the training data, D.
Evaluating the Accuracy of a Classifier or
Predictor
• Holdout Method
• Two-thirds of the data are allocated to the training set, and the remaining
one-third is allocated to the test set.
• The training set is used to derive the model, whose accuracy is estimated with
the test set
• Random subsampling
• Holdout method is repeated k times
• The overall accuracy estimate is taken as the average of the accuracies
obtained from each iteration
• For prediction, the average of the predictor error rates is the overall error rate
Ensemble Methods - Bagging
Ensemble Methods - Boosting
W1
W2
Combine
all
Wn
ROC Curve
# Actual Predicted Prob. Y Prob. N Actual Prob.Y
1 Y N 0.35 0.65 N 0.55
2 N N 0.23 0.77 Y 0.54
3 N Y 0.55 0.45 N 0.47
4 Y N 0.32 0.68 Y 0.35
5 Y Y 0.54 0.46 Y 0.32
6 N N 0.47 0.53 N 0.23
TP Rate = TP/(TP+FN)
TN Rate = FP/(FP+TN)
Cut-off 0.5:
1/(1+2)=0.33
1/(1+2) =0.33
Cut-off 0.4
1/(1+2)=0.33
2/(1+2)=0.66
PREDICTION & CLUSTERING
TECHNIQUES
PREDICTION TECHNIQUES
Linear Regression
Multiple Linear Regression
Regression Tree
LINEAR REGRESSION
Prediction
Prediction is the task of predicting continuous values for given input.
For example, we may wish to predict the salary of college graduates with
10 years of work experience.
By far, the most widely used approach for numeric prediction is regression,
a statistical methodology.
Regression analysis can be used to model the relationship between one or
more predictor variables and a response variable (which is continuous-
valued).
Predictor variables are the attributes describing the tuple.
The values of the predictor variables are known.
The response variable is what we want to predict.
Linear Regression
Linear Regression develops a model Y as a linear function of X.
y = w0 + w1 x
Where w0 and w1 are Y-intercept and slope of the line respectively
These regression coefficients can be solved by the method of least
squares
- Data points
|D| - No. of data points
- Mean of
- Mean of
Using this equation, we can predict that the salary of a college graduate with, say, 10 years of experience is $58,600
MULTIPLE LINEAR REGRESSION
Multiple linear regression formula
The formula for a multiple linear regression is
a = the y-intercept
a and b has to be chosen so as to minimize the sum of squared errors of prediction. So the prediction equation is:
Bivariate Linear Regression
EXAMPLE
X1 X2 Y
2 6 7
4 5 7
5 8 9
1 3 5
3 4 6
2 2 4
1 4 5
Y X1 X2 X1X2 X1Y X2Y X12 X22 - 1) / N
2 2 2
1 1
7 2 6 12 14 42 4 36
7 4 5 20 28 35 16 25 - 2) / N
2 2 2
2 2
9 5 8 40 45 72 25 64
5 1 3 3 5 15 1 9 1y 1Y - 1 /N]
6 3 4 12 18 24 9 16
4 2 2 4 8 8 4 4 2y 2Y - 2 /N]
5 1 4 4 5 20 1 16
1= 18 = 32 1x2 =12.7143 1y = 12.4286 2y = 19.4286 2=13.7143 2 = 23.7143
2 1 2
b1 = 47.71/163.57 = 0.2917
b2 = 108.43/163.57 = 0.66288
Golf players
3 Overcast Hot High Weak 46
= {25, 30, 46, 45, 52, 23, 43, 35, 38, 46, 48, 52,
4 Rain Mild High Weak 45 44, 30}
Overcast 3.49 4
Rain 10.87 5
Sunny 7.78 5
Golf players for mild temperature = {45, 35, 46, 48, 52, 30}
Hot 8.95 4
Cool 10.51 4
Mild 7.65 6
Golf players for high humidity = {25, 30, 46, 45, 35, 52, 30}
Golf players for normal humidity = {52, 23, 43, 38, 46, 48, 44}
High 9.36 7
Normal 8.73 7
Golf players for strong wind= {30, 23, 43, 48, 52, 30}
Golf players for weakk wind= {25, 46, 45, 52, 35, 38, 46, 44}
Strong 10.59 6
Weak 7.87 8
Outlook 1.66
Temperature 0.47
Humidity 0.27
Wind 0.29
Standard deviation for sunny outlook
Day Outlook Temp. Humidity Wind Golf Players
Golf players for sunny outlook = {25, 30, 35, 38, 48}
Hot 2.5 2
Cool 0 1
Mild 6.5 2
Weighted standard deviation for sunny outlook and temperature = (2/5)x2.5 + (1/5)x0 + (2/5)x6.5 = 3.6
Standard deviation reduction for sunny outlook and temperature = 7.78 3.6 = 4.18
Standard deviation for sunny outlook and
high humidity = 4.08
High 4.08 3
Normal 5.00 2
Weighted standard deviations for sunny outlook and humidity = (3/5)x4.08 + (2/5)x5 = 4.45
Standard deviation reduction for sunny outlook and humidity = 7.78 4.45 = 3.33
Sunny outlook and Wind
Day Outlook Temp. Humidity Wind Golf Players Standard deviation for sunny
outlook and strong wind = 9
2 Sunny Hot High Strong 30
Day Outlook Temp. Humidity Wind Golf Players Standard deviation for
sunny outlook and weak
1 Sunny Hot High Weak 25 wind = 5.56
8 Sunny Mild High Weak 35
Temperature 4.18
Humidity 3.33
Wind 0.85
Pruning
Cool branch has one instance in its sub data set.
We can say that if outlook is sunny and temperature is cool, then there would be 38 golf
players.
But what about hot branch? There are still 2 instances.
Should we add another branch for weak wind and strong wind? No, we should not.
Because this causes over-fitting.
We should terminate building branches
if there are less than five instances in the sub data set.
Or standard deviation of the sub data set can be less than 5% of the entire data set.
Here, terminate the branch if there are less than 5 instances in the current sub data set.
If this termination condition is satisfied, then calculate the average of the sub data set.
This operation is called as pruning in decision tree trees.
Overcast outlook
Overcast outlook branch has already 4 instances in the sub data set.
We can terminate building branches for this leaf.
Final decision will be average of the following table for overcast
outlook.
If outlook is overcast, then there would be (46+43+52+44)/4 = 46.25
golf players
Day Outlook Temp. Humidity Wind Golf Players
Wind Standard deviation for golf players instances Weighted standard deviation for rainy outlook and wind =
(3/5)x3.09 + (2/5)x3.5 = 3.25
Weak 3.09 3
Standard deviation reduction for rainy outlook and wind =
Strong 3.5 2
10.87 3.25 = 7.62
Feature Standard deviation reduction
Temperature 0.67
Humidity 0.37
Wind 7.62
Decision trees are powerful way to classify problems
They can be adapted into regression problems
Regression trees tend to over-fit much more than classification trees
Termination rule should be tuned carefully to avoid over-fitting