Debre Tabor University
Gafat Institute of Technology
Department of Computer Science
Introduction to Data Mining & Warehousing
For 4th Year IT Computer Science students
Instructors: Habtu Hailu (PhD)
November, 24
Chapter 04
Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Backpropagation
Prediction
Classification accuracy
Summary
Classification vs. Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
3
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples
that are correctly classified by the model
Test set is independent of training set, otherwise over-
fitting will occur
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known 4
Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5
Process (2): Using the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
6
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
7
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
8
Issues: Evaluating Classification
Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
9
Classification by Decision Tree
Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise or
outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the
decision tree
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This follows an example of Quinlan’s ID3 (Playing Tennis) 11
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
12
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
There are no samples left
13
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information
gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
m
Expected information (entropy)
Info ( D) needed to classify
pi log 2 ( pi )
a tuple in D: i 1
Information needed (after using A to | D j | D into v
split
v
Info A ( D) I ( D j )
partitions) to classify D: j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
14
Attribute Selection: Information Gain
g Class P: buys_computer = 5 4
Info age ( D ) I (2,3) I (4,0)
“yes” 14 14
g Class N: buys_computer = 5
9 9 5 5
Info ( D)“no”
I (9,5) log 2 ( ) log 2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
I (2,3) means “age <=30”
<=30 2 3 0.971 14
has 5 out of 14 samples,
31…40 4 0 0
with 2 yes’es and 3 no’s.
>40 3 2 0.971
age income student credit_rating buys_computer GainHence
(age) Info ( D ) Info age ( D ) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student ) 0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 15
Classification by
Backpropagation
Backpropagation: A neural network learning
algorithm
Started by psychologists and neurobiologists to
develop and test computational analogues of
neurons
A neural network: A set of connected input/output
units where each connection has a weight
associated with it
During the learning phase, the network learns
by adjusting the weights so as to be able to
predict the correct class label of the input tuples
16
Neural Network as a
Classifier
Weakness
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or ``structure."
Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of ``hidden
units" in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the
extraction of rules from trained neural networks
17
A Neuron (= a perceptron)
- mk
x0 w0
x1 w1
å f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi k )
vector x vector w sum function i 0
The n-dimensional input vector x is mapped into variable y
by means of the scalar product and a nonlinear function
mapping
18
A Multi-Layer Feed-Forward Neural
Network
Output vector
Err j O j (1 O j ) Errk w jk
Output layer k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden layer Err j O j (1 O j )(T j O j )
wij 1
Oj Ij
1 e
Input layer
I j wij Oi j
i
Input vector: X
19
How A Multi-Layer Neural Network
Works?
The inputs to the network correspond to the attributes
measured for each training tuple
Inputs are fed simultaneously into the units making up the
input layer
They are then weighted and fed simultaneously to a
hidden layer
The number of hidden layers is arbitrary, although usually
only one
The weighted outputs of the last hidden layer are input to
units making up the output layer, which emits the
network's prediction
The network is feed-forward in that none of the weights
cycles back to an input unit or to an output unit of a
previous layer 20
Defining a Network Topology
First decide the network topology: # of units in
the input layer, # of hidden layers (if > 1), # of
units in each hidden layer, and # of units in the
output layer
Normalizing the input values for each attribute
measured in the training tuples to [0.0—1.0]
One input unit per domain value, each initialized to
0
Output, if for classification and more than two
classes, one output unit per class is used
Once a network has been trained and its accuracy
is unacceptable, repeat the training process with
a different network topology or a different set of 21
What Is Prediction?
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a
given input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent
or predictor variables and a dependent or response
variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model,
Poisson regression, log-linear models, regression trees 22
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression
coefficients
Multiple linear regression: involves more than one predictor
variable
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method or using
SAS, S-Plus
Many nonlinear functions can be transformed into the
above
23
Nonlinear Regression
Some nonlinear models can be modeled by a
polynomial function
A polynomial regression model can be transformed
into linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3=
x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be
transformed to linear model
24
Classifier Accuracy C1 C2
Measures C1 True positive False
negative
C2 False True negative
classes buy_computer = buy_computer =positive
total recognition(%
yes no )
buy_computer = 6954 46 7000 99.34
yes
buy_computer = 412 2588 3000 86.27
Accuracy
no of a classifier M, acc(M): percentage of test set tuples
that total
are correctly classified
7366 by the model
2634 M 1000 95.52
Error rate (misclassification rate) of M = 1 – acc(M)
0
Given m classes, CMi,j, an entry in a confusion matrix,
indicates # of tuples in class i that are labeled by the
classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate
*/
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity *
neg/(pos + neg)
This model can also be used for cost-benefit analysis
25