0% found this document useful (0 votes)
41 views31 pages

TTDS Lecture 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views31 pages

TTDS Lecture 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

TOOLS &

TECHNIQUES FOR
DATA SCIENCE
LECTURE 3
Classification, Decision Tree Induction, Model Evaluation and
Selection

Prepared by Dr.Danish Jamil


Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
What is classification?
 Classification is the task of learning a target function f
that maps attribute set x to one of the predefined class labels
y al al s
ric ric uou
o o
t eg t eg nti n ass
ca ca co c l
Tid Refund Marital Taxable
Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Why classification?

 The target function f is known as a classification model

 Descriptive modeling: Explanatory tool to distinguish between objects


of different classes (e.g., understand why people cheat on their taxes)

 Predictive modeling: Predict a class of a previously unseen record


Examples of Classification Tasks

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions as legitimate or fraudulent

 Categorizing news stories as finance, weather, entertainment, sports,


etc

 Identifying spam email, spam web pages

 Understanding if a web query has commercial intent or not


General approach to classification

 Training set consists of records with known class labels

 Training set is used to build a classification model

 A labeled test set of previously unseen data records is used to


evaluate the quality of the model.

 The classification model is applied to new records with unknown class


labels
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test)
set
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
10
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Decision Trees
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they
are discretized in advance)
 Examples are partitioned recursively based on
selected attributes
 Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
 There are no samples left
Comparing Attribute Selection Measures

 The three measures, in general, return good results


but
 Information gain:
 biased towards multivalued attributes

 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much smaller than
the others

 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in both
partitions
Tree Induction

 Issues
 How to Classify a leaf node
 Assign the majority class
 If leaf is empty, assign the default class – the class that has the highest
popularity.
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
How to Specify Test Condition?

 Depends on attribute types


 Nominal
 Ordinal
 Continuous

 Depends on number of ways to split


 2-way split
 Multi-way split
Splitting Based on Nominal Attributes
 Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.

{Sport CarType CarType


s, OR {Family,
Luxury {Family} Luxury} {Sports}
}
Splitting Based on Ordinal
Attributes

Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium
 Binary split: Divides values into two subsets – respects the order.
Need to find optimal partitioning.

Size
 What about this split? OR {Medium,
{Small}
Large}
{Small, Size
Medium {Large}
} Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes

 Different ways of handling


 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval bucketing, equal frequency
bucketing (percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best cut
 can be more compute intensive
Splitting Based on Continuous
Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Stopping Criteria for Tree Induction

 Stop expanding a node when all the records belong to the same class

 Stop expanding a node when all the records have similar attribute
values

 Early termination (to be discussed later)


Decision Tree Based Classification

 Advantages:
 Inexpensive to construct
 Extremely fast at classifying unknown records
 Easy to interpret for small-sized trees
 Accuracy is comparable to other classification techniques for many simple
data sets
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training
data
 Toomany branches, some may reflect anomalies
due to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold

 Postpruning:Remove branches from a “fully grown”


tree—get a sequence of progressively pruned trees
24
 Use a set of data different from the training data to decide which is the “best pruned
tree”
Metrics for Performance Evaluation

 Focus on the predictive capability of a model


 Rather than how fast it takes to classify or build models, scalability, etc.
 Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false
CLASS negative)
Class=No c d
c: FP (false
positive)
d: TN (true
Metrics for Performance
Evaluation… PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Precision-Recall
Count PREDICTED CLASS
Class=Yes Class=No
a TP Class=Yes a b
Precision(p)  
a  c TP  FP ACTUAL Class=No c d
CLASS
a TP
Recall(r)  
a  b TP  FN
1 2rp 2a 2TP
F - measure(F)    
 1 / r  1 / p  r  p 2a  b  c 2TP  FP  FN
 
 2 
 Precision is biased towards C(Yes|Yes) & C(Yes|No)
 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
Methods for Performance Evaluation

 How to obtain a reliable estimate of performance?

 Performance of a model may depend on other factors besides the


learning algorithm:
 Class distribution
 Cost of misclassification
 Size of training and test sets
Methods of Estimation
 Holdout
 Reserve 2/3 for training and 1/3 for testing
 Random subsampling
 One sample may be biased -- Repeated holdout
 Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the
remaining one
 Leave-one-out: k=n
 Guarantees that each record is used the same number of times for training
and testing
 Bootstrap
 Sampling with replacement
 ~63% of records used for training, ~27% for testing
Model Selection: ROC Curves
 ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection
theory
 Shows the trade-off between the true
positive rate and the false positive
rate  Vertical axis
 The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
 Rank the test tuples in decreasing
 Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at  The plot also shows a
the top of the list diagonal line
 The closer to the diagonal line (i.e.,  A model with perfect
the closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
Issues Affecting Model Selection

 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules

You might also like