CIS 517 :Data Mining and
Warehousing
Week 6 & 7: Classification
(Chapter 8)
Instructor :
1
Email : maalnasser@[Link]
1
Lecture Outline
What is classification? What is
prediction?
Issues regarding classification and
prediction
Classification by Decision tree induction
Rule based Classification.
Classification accuracy
Summary
Classification vs. Prediction
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Classification—A Two-Step Process
Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 3 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification Process (1):
Model Construction
Classification Process (2):
Use the Model in Prediction
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Issues (2):
Evaluating Classification Methods
Predictive accuracy
Time to construct the model
Time to use the model
Handling noise and missing values
Decision tree size
Compactness of classification rules
Classification: Measure the quality
Usually the Accuracy measure is used:
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines(SVM)
Classification by Decision Tree Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Example: Training Dataset
Output: A Decision Tree for
“buys_computer”
Algorithm for Decision
Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
Example of a Decision
Tree ca l
ca l
us
ri ri u o
ego ego tin ss
t t n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree
Another Example of
Decision Tree l l
ir ca ir ca ous
o o u
t eg t eg n tin ss
ca ca co cla MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree
Classification Task Tree
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Model
Decisio
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ? n Tree
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test
Test Data
Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund
10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
Test Data
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5 (J48 on WEKA)
SLIQ,SPRINT
Decision Tree Induction
Greedy strategy (Heuristic method)
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Greedy Algorithm : Generate a Decision
Tree
Input:
Data partition, D, which is a set of training tuples and
their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the
splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting
attribute and, possibly, either a split-point or splitting
subset.
Output: A decision tree
Greedy Algorithm : Generate a
Decision Tree
Method :
How to determine the best
split?
How to determine the best
split?
How to determine the best
split?
How to determine the best
split?
Example 2 :
New data : Age=30 , Student=Yes, Income=4000,
Buy_computer =?
Attribute Selection Measures
Heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of
class-labeled training tuples into individual
classes.
Three popular attribute selection measures:
1. Information gain.
2. Gain ratio.
3. Gini index.
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain.
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy)
m needed to classify a tuple in D:
Info ( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info ( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
Attribute Selection Measure:
Information Gain (ID3)
Example: Induction of a decision tree using information gain.
Table presents a training set, D, of class-labeled tuples randomly selected from
the All Electronics customer database.
Example: Induction of a decision tree using
information gain
1. Compute the expected information needed to classify a tuple
in D : The class label attribute, buys computer, has two distinct
values (yes, no); there are two distinct classes m=2.
2. Compute the expected information requirement for each
attribute. Let’s start with the attribute age
Example: Induction of a decision tree using
information gain
3. The gain in information from such a partitioning would be :
Gain(income )=0.029 bits, Gain(student)= 0.151 bits,
and Gain(credit rating)=0.048 bits.
Because age has the highest information gain among the
attributes, it is selected as the splitting attribute.
Node N is labeled with age, and branches are grown for
each of the attribute’s values
Example: Induction of a decision tree using
information gain
Example: Induction of a decision tree using
information gain
Rule-Based Classification
Using IF-THEN Rules for Classification
Example : Rule accuracy and coverage
Let’s go back to our data in Table .
These are class labeled tuples from the All Electronics customer
database. Our task is to predict whether a customer will buy a
computer. Consider rule R1, which covers 2 of the 14 tuples.
It can correctly classify both tuples.
Coverage (R1)=2/14 = 14.28%
Accuracy (R1)=2/2 =100%.
Rule Extraction from a Decision Tree
Naïve Bayes Classification
Bayes Classification
A statistical classifier: performs probabilistic prediction, i.e., predicts
class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and selected neural
network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
Bayes’ Theorem: Basics
Bayes’ Theorem:
P( H | X) P(X | H ) P( H ) P(X | H ) P( H ) / P(X)
P(X)
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Example of Naïve Bayes Classifier: Training
Dataset
Example of Naïve Bayes Classifier: Training
Dataset
Example of Naïve Bayes Classifier: Training
Dataset
Example of Naïve Bayes Classifier: Training
Dataset
Example of Naïve Bayes Classifier: Training
Dataset
Example of Naïve Bayes Classifier: Training
Dataset
Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy?
Other metrics to consider?
What if we have more than one classifier and want to
choose the “best” one? This is referred to as model
selection.
Use validation test set of class-labeled tuples instead
of training set when assessing accuracy.
Methods for estimating a classifier’s accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
57
Metrics for Evaluating Classifier Performance:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class YES NO
YES True Positives (TP) False Negatives (FN)
NO False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i
that were labeled by the classifier as class j
May have extra rows/columns to provide totals
58
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P yes no Class Imbalance Problem:
yes TP FN P
One class may be rare, e.g.
no FP TN N
fraud, or HIV-positive
P’ N’ All
Significant majority of the
Classifier Accuracy, or negative class and minority of
recognition rate: percentage of the positive class
Sensitivity: True Positive
test set tuples that are
correctly classified recognition rate
Sensitivity = TP/P
Accuracy = (TP + TN)/All
Error rate: 1 – accuracy, or Specificity: True Negative
Error rate = (FP + FN)/All recognition rate
Specificity = TN/N
60
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive
Recall: completeness – what % of positive tuples did the classifier
label as positive?
Perfect score is 1.0
Inverse relationship between precision & recall
F measure (F1 or F-score): harmonic mean of precision and
recall,
61
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
Precision = 90/230 = 39.13%
Recall = 90/300 = 30.00%
63
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
At i-th iteration, use Di as test set and others as training set
Leave-one-out: k folds where k = # of tuples, for small sized data
*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
64
Cross-Validation Method
Cross-Validation Method
Summary
Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..