0% found this document useful (0 votes)

41 views31 pages

TTDS Lecture 4

Uploaded by

ABDULLAH ASIF BUBBAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views31 pages

TTDS Lecture 4

Uploaded by

ABDULLAH ASIF BUBBAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

TOOLS &

TECHNIQUES FOR
DATA SCIENCE
LECTURE 3
Classification, Decision Tree Induction, Model Evaluation and
Selection

Prepared by Dr.Danish Jamil

Supervised vs. Unsupervised Learning

 Supervised learning (classification)

 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
What is classification?
 Classification is the task of learning a target function f
that maps attribute set x to one of the predefined class labels
y al al s
ric ric uou
o o
t eg t eg nti n ass
ca ca co c l
Tid Refund Marital Taxable
Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Why classification?

 The target function f is known as a classification model

 Descriptive modeling: Explanatory tool to distinguish between objects

of different classes (e.g., understand why people cheat on their taxes)

 Predictive modeling: Predict a class of a previously unseen record

Examples of Classification Tasks

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions as legitimate or fraudulent

 Categorizing news stories as finance, weather, entertainment, sports,

etc

 Identifying spam email, spam web pages

 Understanding if a web query has commercial intent or not

General approach to classification

 Training set consists of records with known class labels

 Training set is used to build a classification model

 A labeled test set of previously unseen data records is used to

evaluate the quality of the model.

 The classification model is applied to new records with unknown class

labels
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test)
set
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
10
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Decision Trees
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they
are discretized in advance)
 Examples are partitioned recursively based on
selected attributes
 Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
 There are no samples left
Comparing Attribute Selection Measures

 The three measures, in general, return good results

but
 Information gain:
 biased towards multivalued attributes

 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much smaller than
the others

 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in both
partitions
Tree Induction

 Issues
 How to Classify a leaf node
 Assign the majority class
 If leaf is empty, assign the default class – the class that has the highest
popularity.
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
How to Specify Test Condition?

 Depends on attribute types

 Nominal
 Ordinal
 Continuous

 Depends on number of ways to split

 2-way split
 Multi-way split
Splitting Based on Nominal Attributes
 Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.

Need to find optimal partitioning.

{Sport CarType CarType

s, OR {Family,
Luxury {Family} Luxury} {Sports}
}
Splitting Based on Ordinal
Attributes

Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium
 Binary split: Divides values into two subsets – respects the order.
Need to find optimal partitioning.

Size
 What about this split? OR {Medium,
{Small}
Large}
{Small, Size
Medium {Large}
} Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes

 Different ways of handling

 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval bucketing, equal frequency
bucketing (percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)

 consider all possible splits and finds the best cut
 can be more compute intensive
Splitting Based on Continuous
Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Stopping Criteria for Tree Induction

 Stop expanding a node when all the records belong to the same class

 Stop expanding a node when all the records have similar attribute
values

 Early termination (to be discussed later)

Decision Tree Based Classification

 Advantages:
 Inexpensive to construct
 Extremely fast at classifying unknown records
 Easy to interpret for small-sized trees
 Accuracy is comparable to other classification techniques for many simple
data sets
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training
data
 Toomany branches, some may reflect anomalies
due to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold

 Postpruning:Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees
24
 Use a set of data different from the training data to decide which is the “best pruned
tree”
Metrics for Performance Evaluation

 Focus on the predictive capability of a model

 Rather than how fast it takes to classify or build models, scalability, etc.
 Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false
CLASS negative)
Class=No c d
c: FP (false
positive)
d: TN (true
Metrics for Performance
Evaluation… PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Precision-Recall
Count PREDICTED CLASS
Class=Yes Class=No
a TP Class=Yes a b
Precision(p)  
a  c TP  FP ACTUAL Class=No c d
CLASS
a TP
Recall(r)  
a  b TP  FN
1 2rp 2a 2TP
F - measure(F)    
 1 / r  1 / p  r  p 2a  b  c 2TP  FP  FN
 
 2 
 Precision is biased towards C(Yes|Yes) & C(Yes|No)
 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
Methods for Performance Evaluation

 How to obtain a reliable estimate of performance?

 Performance of a model may depend on other factors besides the

learning algorithm:
 Class distribution
 Cost of misclassification
 Size of training and test sets
Methods of Estimation
 Holdout
 Reserve 2/3 for training and 1/3 for testing
 Random subsampling
 One sample may be biased -- Repeated holdout
 Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the
remaining one
 Leave-one-out: k=n
 Guarantees that each record is used the same number of times for training
and testing
 Bootstrap
 Sampling with replacement
 ~63% of records used for training, ~27% for testing
Model Selection: ROC Curves
 ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection
theory
 Shows the trade-off between the true
positive rate and the false positive
rate  Vertical axis
 The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
 Rank the test tuples in decreasing
 Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at  The plot also shows a
the top of the list diagonal line
 The closer to the diagonal line (i.e.,  A model with perfect
the closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
Issues Affecting Model Selection

 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules

Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Classification
No ratings yet
Classification
81 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
Classification and Prediction
100% (2)
Classification and Prediction
31 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Supervised Learning Classification Techniques
No ratings yet
Supervised Learning Classification Techniques
224 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
29 pages
Classification-1
No ratings yet
Classification-1
48 pages
Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Decision Tree Course Guide
No ratings yet
Decision Tree Course Guide
37 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
CH 5
No ratings yet
CH 5
84 pages
Week 5
No ratings yet
Week 5
72 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
7 Classification
100% (3)
7 Classification
63 pages
Classification Algorithms
No ratings yet
Classification Algorithms
23 pages
Topic7 Classification
No ratings yet
Topic7 Classification
104 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
Classification
No ratings yet
Classification
23 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
6 Classification DecisionTree
No ratings yet
6 Classification DecisionTree
47 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Unit 3
No ratings yet
Unit 3
98 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
Classification
No ratings yet
Classification
45 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
DM Unit-3
No ratings yet
DM Unit-3
23 pages
Classification
No ratings yet
Classification
33 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Unit Iii
No ratings yet
Unit Iii
11 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Unit Iv
No ratings yet
Unit Iv
38 pages
Classification and Prediction
No ratings yet
Classification and Prediction
69 pages
Classification Part 1
No ratings yet
Classification Part 1
76 pages
Module 4
No ratings yet
Module 4
99 pages
Unit 4
No ratings yet
Unit 4
186 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
Session 5
No ratings yet
Session 5
91 pages
4 & 5 DWM 2024-25
No ratings yet
4 & 5 DWM 2024-25
32 pages
Exam Paper JB
No ratings yet
Exam Paper JB
14 pages
DEL Raipur Del 070225 090225
No ratings yet
DEL Raipur Del 070225 090225
6 pages
AWS CSAA Hands-On Labs Slides
No ratings yet
AWS CSAA Hands-On Labs Slides
571 pages
Deep Learning With Long Short-Term Memory Networks and Random Forests For Demand Forecasting in Multi-Channel Retail
No ratings yet
Deep Learning With Long Short-Term Memory Networks and Random Forests For Demand Forecasting in Multi-Channel Retail
17 pages
Manual TP Link Archer c60
No ratings yet
Manual TP Link Archer c60
103 pages
Acsse Ifm01b1 2024 L04
No ratings yet
Acsse Ifm01b1 2024 L04
30 pages
SD-WAN and Routing Subscription FAQ
No ratings yet
SD-WAN and Routing Subscription FAQ
22 pages
Wang 2022
No ratings yet
Wang 2022
8 pages
Using Generative Adversarial Networks For Improving Classification Effectiveness in Credit Card Fraud Detection
100% (1)
Using Generative Adversarial Networks For Improving Classification Effectiveness in Credit Card Fraud Detection
8 pages
Fujitsu M12 (Building Block Configuration) Installation Specialist
No ratings yet
Fujitsu M12 (Building Block Configuration) Installation Specialist
2 pages
Digital Signature Standard
No ratings yet
Digital Signature Standard
24 pages
Xplore Learning Calendar: Line-Up of Learning Events For This Week
No ratings yet
Xplore Learning Calendar: Line-Up of Learning Events For This Week
1 page
Personalized Federated Learning via MAML
No ratings yet
Personalized Federated Learning via MAML
29 pages
User Manual - Lion Master by Safari Pedals
No ratings yet
User Manual - Lion Master by Safari Pedals
9 pages
WiseInsight Software User Manual
No ratings yet
WiseInsight Software User Manual
68 pages
Master Thesis Topics in Communication Engineering
100% (1)
Master Thesis Topics in Communication Engineering
6 pages
Lab Introduction To STATA
100% (1)
Lab Introduction To STATA
27 pages
Aliuddin Resume
No ratings yet
Aliuddin Resume
2 pages
Work Intimation Request Form - Final
No ratings yet
Work Intimation Request Form - Final
1 page
Beans Leaf Disease Detection AI
No ratings yet
Beans Leaf Disease Detection AI
12 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Advance Java Imp Question For Exam
No ratings yet
Advance Java Imp Question For Exam
54 pages
LTTS - Corporate Brochure - 2020
No ratings yet
LTTS - Corporate Brochure - 2020
15 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
61 pages
Heatmiser Senior / Plus User Operators Guide Revision 2: Support Tel: 01254 776343
No ratings yet
Heatmiser Senior / Plus User Operators Guide Revision 2: Support Tel: 01254 776343
8 pages
Honeywell Modbus TCP Firewall Rev D Partner Upgrade Release Notes v1.0
No ratings yet
Honeywell Modbus TCP Firewall Rev D Partner Upgrade Release Notes v1.0
14 pages
MFJ-1270X Manual V2
No ratings yet
MFJ-1270X Manual V2
16 pages
Participant Information Sheet Lola
No ratings yet
Participant Information Sheet Lola
3 pages
Si5351A VFO Kit for DSB Transceivers
No ratings yet
Si5351A VFO Kit for DSB Transceivers
2 pages
Math151 Module 2
No ratings yet
Math151 Module 2
11 pages

TTDS Lecture 4

Uploaded by

TTDS Lecture 4

Uploaded by

TOOLS &

Prepared by Dr.Danish Jamil

 Supervised learning (classification)

 The target function f is known as a classification model

 Descriptive modeling: Explanatory tool to distinguish between objects

 Predictive modeling: Predict a class of a previously unseen record

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions as legitimate or fraudulent

 Categorizing news stories as finance, weather, entertainment, sports,

 Identifying spam email, spam web pages

 Understanding if a web query has commercial intent or not

 Training set consists of records with known class labels

 Training set is used to build a classification model

 A labeled test set of previously unseen data records is used to

 The classification model is applied to new records with unknown class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

NAME RANK YEARS TENURED Classifier

student? yes credit rating?

no yes excellent fair

 The three measures, in general, return good results

 Depends on attribute types

 Depends on number of ways to split

 Binary split: Divides values into two subsets.

{Sport CarType CarType

 Different ways of handling

 Binary Decision: (A < v) or (A  v)

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

 Early termination (to be discussed later)

 Postpruning:Remove branches from a “fully grown”

 Focus on the predictive capability of a model

 Most widely-used metric:

 How to obtain a reliable estimate of performance?

 Performance of a model may depend on other factors besides the

You might also like