0% found this document useful (0 votes)
34 views36 pages

Data Classification and Prediction : Lecture-11

The document discusses various classification algorithms including decision trees, Bayesian classifiers, and k-nearest neighbors. It covers decision tree algorithms like ID3, C4.5, and CART and explains how they use information gain, information gain ratio, and Gini index to build classification trees. The document also provides an overview of Bayesian classification and the naive Bayesian classifier approach. It gives examples to illustrate naive Bayesian classification and compares the different impurity measures used in decision tree algorithms.

Uploaded by

deepanshu ja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views36 pages

Data Classification and Prediction : Lecture-11

The document discusses various classification algorithms including decision trees, Bayesian classifiers, and k-nearest neighbors. It covers decision tree algorithms like ID3, C4.5, and CART and explains how they use information gain, information gain ratio, and Gini index to build classification trees. The document also provides an overview of Bayesian classification and the naive Bayesian classifier approach. It gives examples to illustrate naive Bayesian classification and compares the different impurity measures used in decision tree algorithms.

Uploaded by

deepanshu ja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture-11

Data Classification and Prediction


(Part-2)

Dr.J. Dhar 1
Classification Algorithms
 Classification by Decision Tree Induction:
 ID3
 C4.5
CART
• Bayesian Classifier
• k-Nearest-Neighbor Classifier
• Classification and Prediction Accuracy Measures
Dr.J. Dhar 2
A Defect of
• It favors attributes with many values
• Such attribute splits N to many subsets, and if these are small,
they will tend to be pure anyway
• One way to rectify this is through a corrected measure of
information gain ratio.
• C4.5, a successor of ID3, uses an extension to information gain
known as gain ratio,which attempts to overcome this bias.
• It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D)
Dr.J. Dhar 3
C4.5 Algorithm: Information Gain Ratio
• SplitInfo_A(D) is amount of information needed to
determine the value of an attribute A

• Information gain ratio:

Dr.J. Dhar 4
. .
. .
Information Gain Ratio
. .
. .
red

Color? green

.
yellow .

.
.

Dr.J. Dhar 5
Comparison:
Information Gain and Information Gain Ratio

A |v(A)| Gain(A) GainRatio(A)


Color 3 0.247 0.156
Outline 2 0.152 0.152
Dot 2 0.048 0.049

Dr.J. Dhar 6
CART Algorithm: Gini Index
• Another sensible measure of impurity (i and j are classes)

• After applying attribute A, the resulting Gini value is

• Giniindex ( GiniGain) can be interpreted as expected error rate

Dr.J. Dhar 7
Gini Index

. .

. .
. .

Dr.J. Dhar 8
. .
. .
. .
Gini Index for Color
. .
red

Color? green

.
yellow .

.
.

Dr.J. Dhar 9
GiniGain of Color

Dr.J. Dhar 10
Comparison of Three Impurity Measures
A Gain(A) GainRatio(A) GiniGain(A)
Color 0.247 0.156 0.058
Outline 0.152 0.152 0.046
Dot 0.048 0.049 0.015

 These impurity measures assess the effect of a single attribute


Criterion “most informative” that they define is local.
 It does not reliably predict the effect of several attributes applied
jointly
Dr.J. Dhar 11
Alternative calculation of Gini Index

Dr.J. Dhar 12
Tree Pruning
When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
Pruned trees tend to be smaller and less complex and, thus,
easier to comprehend.
They are usually faster and better at correctly classifying
independent test data than unpruned trees.
In the prepruning approach, a tree is “pruned” by halting its
construction early.
Dr.J. Dhar 13
Decision Tree Pruning

Dr.J. Dhar 14
Bayesian Classification
• Bayesian classification is based on Bayes’ theorem.
• Studies comparing classification algorithms have found
a simple Bayesian classifier known as the Naïve
Bayesian Classifier.
• It is comparable in performance with decision tree and
selected neural network classifiers.
• Bayesian classifiers have also exhibited high accuracy
and speed when applied to large databases.

Dr.J. Dhar 15
Naïve Bayesian Classification
1. Let D be a training set of tuples and their associated class labels.
As usual, each tuple is represented by an n-dimensional attribute
vector, X = ( , : : : , ), depicting n measurements made on
the tuple from n attributes, respectively, ,:::, .
2. Suppose that there are m classes, ,:::, . Given a tuple,
X, the classifier will predict that X belongs to the class having the
highest posterior probability, conditioned on X. That is, the naïve
Bayesian classifier predicts that tuple X belongs to the class if
and only if

Dr.J. Dhar 16
3. Thus we maximize . The class for which is
maximized is called the maximum posteriori hypothesis. By
Bayes’ theorem:

4. As P(X) is constant for all classes, only need be


maximized. If the class prior probabilities are not known, then it
is commonly assumed that the classes are equally likely, that is,
P(C1) = P(C2) = …. = P(Cm), and we would therefore maximize
. Otherwise, we maximize .

Dr.J. Dhar 17
• In order to reduce computation in evaluating , the
naive assumption of class conditional independence is made.
We can easily estimate the probabilities , ), : :
:, ) from the training tuples. Thus

Dr.J. Dhar 18
Example: Bank Loan
Sl No Age Income Occupation C_R Loan The data tuples are described by the
1 Young Low Service Good Safe attributes Age, Income, Occupation,
and C_R. The class label attribute,
2 Young Average Entp Fair Risky
Bank loan, has two distinct values
3 Middle High Farmer Fair Safe
(namely, Safe, Risky). Let
4 Senior Average Entp Good Risky correspond to the class Bank loan =
5 Middle High Service Fair Safe Safe and correspond to Bank
6 Senior High Entp Good Safe loan = Risky.
7 Middle High Farmer Good Safe
8 Senior Average Entp Fair Risky The tuple we wish to classify is
9 Middle Average Entp Good Risky X = (Age= Middle, Income = Average,
Occupation = Entp, C_R= Fair)
10 Young Average Service Fair Safe

Dr.J. Dhar 19
( )
Solution
, i=1,2
( )
Here P( )= 6/10 and P( )= 4/10
X = (Age= Middle, Income = Average, Occupation = Entp, C_R= Fair)

= Age= Middle Income = Average Occupation = Entp C_R= Fair


= (3/6 )x (1/6) x (1/6)X(3/6) = 1/144
= Age= Middle Income = Average Occupation = Entp C_R= Fair
= (1/4 )x (4/4) x (4/4)X(2/4) = 1/8
Which implies .

Hence > , therefore X belongs to 𝟐 class (i.e., loan is Risky).

Dr.J. Dhar 20
Example

Dr.J.21
Dhar
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=N Wind Play=Yes Play=No


o Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Dr.J.22
Dhar
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision making with the MAP rule


P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206

Given the fact P(Yes|x’) < Dr.J.


P(No|
23 x’), we label x’ to be “No”.
Dhar
Naïve Bayes Algorithm: Continuous-Valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal distribution
1  ( x j   ji ) 2 
Pˆ ( x j | c i )  exp   
2  ji  2  2 
 ji 
 ji : mean (avearage) of feature values x j of examples for which c  c i
 ji : standard deviation of feature values x j of examples for which c  c i

– Learning Phase: for X  ( X 1 ,  , X F ), C  c1 ,  , c L


Output: normal distributions and P(C  ci ) i  1,  , L
– Test Phase: Given an unknown instance X   ( a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all the normal
distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete case)

Dr.J.24
Dhar
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 22.8, 25.2, 20.1, 19.3, 18.5, 21.7, 24.3, 23.1, 19.8
No: 15.1, 17.4, 27.3, 30.1, 29.5
– Estimate mean and variance for each class
1 N 1 N Yes  21.64 , Yes  2.35
   xn ,    ( xn   ) 2
2
N n1 N n1  No  23.88 , No  7.09
– Learning Phase: output two Gaussian models for P(temp|C)
1  ( x  21 .64 ) 2  1  ( x  21 .64 ) 2 
Pˆ ( x | Yes )  exp     exp   
2.35 2  2  2 . 35 2
 2 . 35 2  11 . 09 
ˆ 1  ( x  23 .88 ) 2  1  ( x  23 .88 ) 2 
P ( x | No )  exp     exp   
7 .09 2  2  7 . 09 2
 7 .09 2  50 . 25 

Dr.J.25
Dhar
Naïve Bayes: Zero conditional probability
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during test
Pˆ (x1 | ci )    Pˆ (a jk | ci )    Pˆ (xn | ci )  0 for xj  ajk , Pˆ(ajk | ci )  0
– For a remedy, class conditional probabilities re-estimated with

n  mp
Pˆ ( a jk | ci )  c (m-estimate)
nm
nc : number of training examples for which x j  a jk and c  ci
n : number of training examples for which c  ci
p : prior estimate (usually, p  1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1)
Dr.J.26
Dhar
Zero conditional probability
• Example: P(outlook=overcast|no)=0 in the play-tennis dataset
– Adding m “virtual” examples (m: up to 1% of #training
example)
• In this dataset, # of training examples for the “no” class is 5.

• We can only add m=1 “virtual” example in our m-esitmate remedy.

– The “outlook” feature can takes only 3 values. So p=1/3.


– Re-estimate P(outlook|no) with the m-estimate

Dr.J.27
Dhar
• Non-probabilistic Classification Algorithm

Dr.J. Dhar 28
k-Nearest-Neighbor Classifiers
• Nearest-neighbor classifiers are based on learning by analogy, that
is, by comparing a given test tuple with training tuples that are similar
to it. The training tuples are described by n attributes. Each tuple
represents a point in an n-dimensional space. In this way, all of the
training tuples are stored in an n-dimensional pattern space.
• When given an unknown tuple, a k-nearest-neighbor classifier
searches the pattern space for the k training tuples that are closest to
the unknown tuple. These k training tuples are the k “nearest
neighbors” of the unknown tuple.

Dr.J. Dhar 29
K-NN Classifiers
• “Closeness” is defined in terms of a distance metric, such
as Euclidean distance. The Euclidean distance between
two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2
= (x21, x22, : : : , x2n), distance formula?
• Mini-Max normalization is perform to normalize all the
attribute in same range. Why?
• “But how can distance be computed for attributes that
not numeric, but categorical, such as color?”

Dr.J. Dhar 30
• “What about missing values?” In general, if the value
of a given attribute A is missing in tuple X1 and/or in
tuple X2, we assume the maximum possible difference.
• “How can I determine a good value for k, the number
of neighbors?” This can be determined experimentally.
Starting with k = 1, we use a test set to estimate the
error rate of the classifier. This process can be repeated
each time by incrementing k to allow for one more
neighbor.

Dr.J. Dhar 31
Comparing Classification and Prediction Methods
• Accuracy: The accuracy of a classifier refers to the ability to
correctly predict the class label of previously unseen data.
• Scalability: It refers to the ability to fit the classifier or
predictor efficiently given large data.
• Speed: This refers to the time required in generating and
using the given classifier or predictor.
• Robustness: This is the ability of the classifier to make correct
predictions given noisy data or data with missing values.

Dr.J. Dhar 32
Classifier Accuracy Measures

IF-THEN Rules Classification

Dr.J. Dhar 33
Evaluating the Accuracy of a Classifier or Predictor
 Holdout Method and • Ensemble Methods—Increasing
Random Subsampling the Accuracy
 k-fold cross-validation - Bagging - Boosting
• Bootstrap: The bootstrap method
samples the given training tuples
uniformly with replacement. That is,
each time a tuple is selected, it is
equally likely to be selected again and
readded to the training set.

Dr.J. Dhar 34
Few more algorithms
• We will discuss other machine learning (i.e. soft
computing) based classification algorithms after
introduction of soft computing (during 15/16
classes).

Dr.J. Dhar 35
Thank You

Dr.J. Dhar 36

You might also like