Data Classification and Prediction : Lecture-11
Data Classification and Prediction : Lecture-11
Dr.J. Dhar 1
Classification Algorithms
Classification by Decision Tree Induction:
ID3
C4.5
CART
• Bayesian Classifier
• k-Nearest-Neighbor Classifier
• Classification and Prediction Accuracy Measures
Dr.J. Dhar 2
A Defect of
• It favors attributes with many values
• Such attribute splits N to many subsets, and if these are small,
they will tend to be pure anyway
• One way to rectify this is through a corrected measure of
information gain ratio.
• C4.5, a successor of ID3, uses an extension to information gain
known as gain ratio,which attempts to overcome this bias.
• It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D)
Dr.J. Dhar 3
C4.5 Algorithm: Information Gain Ratio
• SplitInfo_A(D) is amount of information needed to
determine the value of an attribute A
Dr.J. Dhar 4
. .
. .
Information Gain Ratio
. .
. .
red
Color? green
.
yellow .
.
.
Dr.J. Dhar 5
Comparison:
Information Gain and Information Gain Ratio
Dr.J. Dhar 6
CART Algorithm: Gini Index
• Another sensible measure of impurity (i and j are classes)
Dr.J. Dhar 7
Gini Index
. .
. .
. .
Dr.J. Dhar 8
. .
. .
. .
Gini Index for Color
. .
red
Color? green
.
yellow .
.
.
Dr.J. Dhar 9
GiniGain of Color
Dr.J. Dhar 10
Comparison of Three Impurity Measures
A Gain(A) GainRatio(A) GiniGain(A)
Color 0.247 0.156 0.058
Outline 0.152 0.152 0.046
Dot 0.048 0.049 0.015
Dr.J. Dhar 12
Tree Pruning
When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
Pruned trees tend to be smaller and less complex and, thus,
easier to comprehend.
They are usually faster and better at correctly classifying
independent test data than unpruned trees.
In the prepruning approach, a tree is “pruned” by halting its
construction early.
Dr.J. Dhar 13
Decision Tree Pruning
Dr.J. Dhar 14
Bayesian Classification
• Bayesian classification is based on Bayes’ theorem.
• Studies comparing classification algorithms have found
a simple Bayesian classifier known as the Naïve
Bayesian Classifier.
• It is comparable in performance with decision tree and
selected neural network classifiers.
• Bayesian classifiers have also exhibited high accuracy
and speed when applied to large databases.
Dr.J. Dhar 15
Naïve Bayesian Classification
1. Let D be a training set of tuples and their associated class labels.
As usual, each tuple is represented by an n-dimensional attribute
vector, X = ( , : : : , ), depicting n measurements made on
the tuple from n attributes, respectively, ,:::, .
2. Suppose that there are m classes, ,:::, . Given a tuple,
X, the classifier will predict that X belongs to the class having the
highest posterior probability, conditioned on X. That is, the naïve
Bayesian classifier predicts that tuple X belongs to the class if
and only if
Dr.J. Dhar 16
3. Thus we maximize . The class for which is
maximized is called the maximum posteriori hypothesis. By
Bayes’ theorem:
Dr.J. Dhar 17
• In order to reduce computation in evaluating , the
naive assumption of class conditional independence is made.
We can easily estimate the probabilities , ), : :
:, ) from the training tuples. Thus
Dr.J. Dhar 18
Example: Bank Loan
Sl No Age Income Occupation C_R Loan The data tuples are described by the
1 Young Low Service Good Safe attributes Age, Income, Occupation,
and C_R. The class label attribute,
2 Young Average Entp Fair Risky
Bank loan, has two distinct values
3 Middle High Farmer Fair Safe
(namely, Safe, Risky). Let
4 Senior Average Entp Good Risky correspond to the class Bank loan =
5 Middle High Service Fair Safe Safe and correspond to Bank
6 Senior High Entp Good Safe loan = Risky.
7 Middle High Farmer Good Safe
8 Senior Average Entp Fair Risky The tuple we wish to classify is
9 Middle Average Entp Good Risky X = (Age= Middle, Income = Average,
Occupation = Entp, C_R= Fair)
10 Young Average Service Fair Safe
Dr.J. Dhar 19
( )
Solution
, i=1,2
( )
Here P( )= 6/10 and P( )= 4/10
X = (Age= Middle, Income = Average, Occupation = Entp, C_R= Fair)
Dr.J. Dhar 20
Example
Dr.J.21
Dhar
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Dr.J.22
Dhar
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Dr.J.24
Dhar
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 22.8, 25.2, 20.1, 19.3, 18.5, 21.7, 24.3, 23.1, 19.8
No: 15.1, 17.4, 27.3, 30.1, 29.5
– Estimate mean and variance for each class
1 N 1 N Yes 21.64 , Yes 2.35
xn , ( xn ) 2
2
N n1 N n1 No 23.88 , No 7.09
– Learning Phase: output two Gaussian models for P(temp|C)
1 ( x 21 .64 ) 2 1 ( x 21 .64 ) 2
Pˆ ( x | Yes ) exp exp
2.35 2 2 2 . 35 2
2 . 35 2 11 . 09
ˆ 1 ( x 23 .88 ) 2 1 ( x 23 .88 ) 2
P ( x | No ) exp exp
7 .09 2 2 7 . 09 2
7 .09 2 50 . 25
Dr.J.25
Dhar
Naïve Bayes: Zero conditional probability
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during test
Pˆ (x1 | ci ) Pˆ (a jk | ci ) Pˆ (xn | ci ) 0 for xj ajk , Pˆ(ajk | ci ) 0
– For a remedy, class conditional probabilities re-estimated with
n mp
Pˆ ( a jk | ci ) c (m-estimate)
nm
nc : number of training examples for which x j a jk and c ci
n : number of training examples for which c ci
p : prior estimate (usually, p 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m 1)
Dr.J.26
Dhar
Zero conditional probability
• Example: P(outlook=overcast|no)=0 in the play-tennis dataset
– Adding m “virtual” examples (m: up to 1% of #training
example)
• In this dataset, # of training examples for the “no” class is 5.
Dr.J.27
Dhar
• Non-probabilistic Classification Algorithm
Dr.J. Dhar 28
k-Nearest-Neighbor Classifiers
• Nearest-neighbor classifiers are based on learning by analogy, that
is, by comparing a given test tuple with training tuples that are similar
to it. The training tuples are described by n attributes. Each tuple
represents a point in an n-dimensional space. In this way, all of the
training tuples are stored in an n-dimensional pattern space.
• When given an unknown tuple, a k-nearest-neighbor classifier
searches the pattern space for the k training tuples that are closest to
the unknown tuple. These k training tuples are the k “nearest
neighbors” of the unknown tuple.
Dr.J. Dhar 29
K-NN Classifiers
• “Closeness” is defined in terms of a distance metric, such
as Euclidean distance. The Euclidean distance between
two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2
= (x21, x22, : : : , x2n), distance formula?
• Mini-Max normalization is perform to normalize all the
attribute in same range. Why?
• “But how can distance be computed for attributes that
not numeric, but categorical, such as color?”
Dr.J. Dhar 30
• “What about missing values?” In general, if the value
of a given attribute A is missing in tuple X1 and/or in
tuple X2, we assume the maximum possible difference.
• “How can I determine a good value for k, the number
of neighbors?” This can be determined experimentally.
Starting with k = 1, we use a test set to estimate the
error rate of the classifier. This process can be repeated
each time by incrementing k to allow for one more
neighbor.
Dr.J. Dhar 31
Comparing Classification and Prediction Methods
• Accuracy: The accuracy of a classifier refers to the ability to
correctly predict the class label of previously unseen data.
• Scalability: It refers to the ability to fit the classifier or
predictor efficiently given large data.
• Speed: This refers to the time required in generating and
using the given classifier or predictor.
• Robustness: This is the ability of the classifier to make correct
predictions given noisy data or data with missing values.
Dr.J. Dhar 32
Classifier Accuracy Measures
Dr.J. Dhar 33
Evaluating the Accuracy of a Classifier or Predictor
Holdout Method and • Ensemble Methods—Increasing
Random Subsampling the Accuracy
k-fold cross-validation - Bagging - Boosting
• Bootstrap: The bootstrap method
samples the given training tuples
uniformly with replacement. That is,
each time a tuple is selected, it is
equally likely to be selected again and
readded to the training set.
Dr.J. Dhar 34
Few more algorithms
• We will discuss other machine learning (i.e. soft
computing) based classification algorithms after
introduction of soft computing (during 15/16
classes).
Dr.J. Dhar 35
Thank You
Dr.J. Dhar 36