TOOLS &
TECHNIQUES FOR
DATA SCIENCE
LECTURE 3
Classification, Decision Tree Induction, Model Evaluation and
Selection
Prepared by Dr.Danish Jamil
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
What is classification?
Classification is the task of learning a target function f
that maps attribute set x to one of the predefined class labels
y al al s
ric ric uou
o o
t eg t eg nti n ass
ca ca co c l
Tid Refund Marital Taxable
Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Why classification?
The target function f is known as a classification model
Descriptive modeling: Explanatory tool to distinguish between objects
of different classes (e.g., understand why people cheat on their taxes)
Predictive modeling: Predict a class of a previously unseen record
Examples of Classification Tasks
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Categorizing news stories as finance, weather, entertainment, sports,
etc
Identifying spam email, spam web pages
Understanding if a web query has commercial intent or not
General approach to classification
Training set consists of records with known class labels
Training set is used to build a classification model
A labeled test set of previously unseen data records is used to
evaluate the quality of the model.
The classification model is applied to new records with unknown class
labels
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Prediction Problems: Classification vs.
Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Numeric Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit/loan approval
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test)
set
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
10
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes no yes
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they
are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
There are no samples left
Comparing Attribute Selection Measures
The three measures, in general, return good results
but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is much smaller than
the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in both
partitions
Tree Induction
Issues
How to Classify a leaf node
Assign the majority class
If leaf is empty, assign the default class – the class that has the highest
popularity.
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sport CarType CarType
s, OR {Family,
Luxury {Family} Luxury} {Sports}
}
Splitting Based on Ordinal
Attributes
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Binary split: Divides values into two subsets – respects the order.
Need to find optimal partitioning.
Size
What about this split? OR {Medium,
{Small}
Large}
{Small, Size
Medium {Large}
} Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval bucketing, equal frequency
bucketing (percentiles), or clustering.
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split
Stopping Criteria for Tree Induction
Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute
values
Early termination (to be discussed later)
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many simple
data sets
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training
data
Toomany branches, some may reflect anomalies
due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
Postpruning:Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
24
Use a set of data different from the training data to decide which is the “best pruned
tree”
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false
CLASS negative)
Class=No c d
c: FP (false
positive)
d: TN (true
Metrics for Performance
Evaluation… PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)
Most widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
Precision-Recall
Count PREDICTED CLASS
Class=Yes Class=No
a TP Class=Yes a b
Precision(p)
a c TP FP ACTUAL Class=No c d
CLASS
a TP
Recall(r)
a b TP FN
1 2rp 2a 2TP
F - measure(F)
1 / r 1 / p r p 2a b c 2TP FP FN
2
Precision is biased towards C(Yes|Yes) & C(Yes|No)
Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
Performance of a model may depend on other factors besides the
learning algorithm:
Class distribution
Cost of misclassification
Size of training and test sets
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
One sample may be biased -- Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the
remaining one
Leave-one-out: k=n
Guarantees that each record is used the same number of times for training
and testing
Bootstrap
Sampling with replacement
~63% of records used for training, ~27% for testing
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection
theory
Shows the trade-off between the true
positive rate and the false positive
rate Vertical axis
The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
Rank the test tuples in decreasing
Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at The plot also shows a
the top of the list diagonal line
The closer to the diagonal line (i.e., A model with perfect
the closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules