Unit-3
Unit-3
1
Unit – III : Classification
◼ Basic Concepts, Decision Tree Induction, Bayes
Classification Methods, Rule-Based Classification,
Metrics for Evaluating Classifier Performance,
Ensemble Methods, Multilayer Feed-Forward Neural
Network, Support Vector Machines, k-Nearest-
Neighbor Classifiers.
2
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed- Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers.
3
Supervised vs. Unsupervised Learning
5
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ Note: If the test set is used to select models, it is called validation (test) set
6
Process (1): Model Construction
Classification
Algorithms
Training
Data
N A M E R A N K Y E A R S TE N U R E D Classifier
M ik e A s s is t a n t P r o f 3 no (Model)
M a ry A s s is t a n t P r o f 7 yes
B ill P ro fe s s o r 2 yes
J im A s s o c ia t e P r o f 7 yes
IF rank = ‘professor’
D ave A s s is t a n t P r o f 6 no
OR years > 6
A nne A s s o c ia t e P r o f 3 no
THEN tenured = ‘yes’
7
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NA M E R A NK YEARSTENUR ED
Tom Assistant Prof 2 no Tenured?
M erlisa AssociateProf 7 no
GeorgeProfessor 5 yes
Joseph Assistant Prof 7 yes
8
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed- Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers.
9
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
10
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-
conquer manner
◼ At start, all the training examples are at the root
discretized in advance)
◼ Examples are partitioned recursively based on selected
attributes
◼ Test attributes are selected on the basis of a heuristic or
m=2
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
13
Attribute Selection: Information Gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
◼ GainRatio(A) = Gain(A)/SplitInfo(A)
◼ Ex.
noise or outliers
◼ Poor accuracy for unseen samples
23
Scalability Framework for RainForest
24
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed- Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers.
25
Bayesian Classification: Why?
◼ A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
◼ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
◼ Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
26
Bayes’ Theorem: Basics
◼ Total probability Theorem:
◼ Bayes’ Theorem:
medium income
27
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
28
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
needs to be maximized
29
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
◼ This greatly reduces the computation cost: Only counts the class
distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
30
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
31
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
“uncorrected” counterparts
33
Naïve Bayes Classifier: Comments
◼ Advantages
◼ Easy to implement
◼ Disadvantages
◼ Assumption: class conditional independence, therefore loss
of accuracy
◼ Practically, dependencies exist among variables
Bayes Classifier
◼ How to deal with these dependencies? Bayesian Belief Networks
34
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed- Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers.
35
Using IF-THEN Rules for Classification
◼ Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
◼ Rule antecedent/precondition vs. rule consequent
◼ One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
◼ Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes yes
prediction
◼ Rules are mutually exclusive and exhaustive
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
37
Rule Induction: Sequential Covering Method
◼ Sequential covering algorithm: Extracts rules directly from training
data
◼ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
◼ Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
◼ Steps:
◼ Rules are learned one at a time
◼ Each time a rule is learned, the tuples covered by the rules are
removed
◼ Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
◼ Comp. w. decision-tree induction: learning a set of rules
simultaneously
38
Sequential Covering Algorithm
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
39
Rule Generation
◼ To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
40
How to Learn-One-Rule?
◼ Start with the most general rule possible: condition = empty
◼ Adding new attributes by adopting a greedy depth-first strategy
◼ Picks the one that most improves the rule quality
condition
◼ favors rules that have high accuracy and cover many positive tuples
◼ Rule pruning based on an independent set of test tuples
42
Model Evaluation and Selection
◼ Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
◼ Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
◼ Methods for estimating a classifier’s accuracy:
◼ Holdout method, random subsampling
◼ Cross-validation
◼ Bootstrap
◼ Comparing classifiers:
◼ Confidence intervals
◼ Cost-benefit analysis and ROC Curves
43
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
45
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
46
Classifier Evaluation Metrics: Example
47
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
◼ Holdout method
◼ Given data is randomly partitioned into two independent sets
49
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
◼ Suppose we have 2 classifiers, M1 and M2, which one is better?
◼ These mean error rates are just estimates of error on the true
population of future data cases
50
Estimating Confidence Intervals:
Null Hypothesis
◼ Perform 10-fold cross-validation
◼ Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we can reject null hypothesis, then
◼ we conclude that the difference between M 1 & M2 is
statistically significant
◼ Chose model with lower error rate
51
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
52
Estimating Confidence Intervals:
Table for t-distribution
◼ Symmetric
◼ Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
◼ Confidence limit, z
= sig/2
53
Estimating Confidence Intervals:
Statistical Significance
◼ Are M1 & M2 significantly different?
◼ Compute t. Select significance level (e.g. sig = 5%)
are same
◼ Conclude: statistically significant difference between M1
& M2
◼ Otherwise, conclude that any difference is chance
54
Model Selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Originated from signal detection theory
◼ Shows the trade-off between the true
positive rate and the false positive rate
◼ The area under the ROC curve is a ◼ Vertical axis
measure of the accuracy of the model represents the true
positive rate
◼ Rank the test tuples in decreasing ◼ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ◼ The plot also shows a
the top of the list diagonal line
◼ The closer to the diagonal line (i.e., the ◼ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
55
Issues Affecting Model Selection
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
56
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed - Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers.
57
Ensemble Methods: Increasing the Accuracy
◼ Ensemble methods
◼ Use a combination of models to increase accuracy
classifiers
◼ Boosting: weighted vote with a collection of classifiers
58
Bagging: Boostrap Aggregation
◼ Analogy: Diagnosis based on multiple doctors’ majority vote
◼ Training
◼ Given a set D of d tuples, at each iteration i, a training set Di of d tuples
◼ The bagged classifier M* counts the votes and assigns the class with the
most votes to X
◼ Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
◼ Accuracy
◼ Often significantly better than a single classifier derived from D
61
Random Forest (Breiman 2001)
◼ Random Forest:
◼ Each classifier in the ensemble is a decision tree classifier and is
returned
◼ Two Methods to construct Random Forest:
◼ Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
◼ Forest-RC (random linear combinations): Creates new attributes (or
64
Classification by Backpropagation
Output vector
Output layer
Hidden layer
wij
Input layer
Input vector: X
67
How A Multi-Layer Neural Network Works
◼ The inputs to the network correspond to the attributes measured for each
training tuple
◼ Inputs are fed simultaneously into the units making up the input layer
◼ They are then weighted and fed simultaneously to a hidden layer
◼ The number of hidden layers is arbitrary, although usually only one
◼ The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
◼ The network is feed-forward: None of the weights cycles back to an input
unit or to an output unit of a previous layer
◼ From a statistical point of view, networks perform nonlinear regression:
Given enough hidden units and enough training samples, they can closely
approximate any function
68
Defining a Network Topology
◼ Decide the network topology: Specify # of units in the input
layer, # of hidden layers (if > 1), # of units in each hidden layer,
and # of units in the output layer
◼ Normalize the input values for each attribute measured in the
training tuples to [0.0—1.0]
◼ One input unit per domain value, each initialized to 0
◼ Output, if for classification and more than two classes, one
output unit per class is used
◼ Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
69
Backpropagation
◼ Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
◼ For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
◼ Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
◼ Steps
◼ Initialize weights to small random numbers, associated with biases
◼ Propagate the inputs forward (by applying activation function)
◼ Backpropagate the error (by updating weights and biases)
◼ Terminating condition (when error is very small, etc.)
70
Neuron: A Hidden/Output Layer Unit
bias
x0 w0 k
x1 w1
f output y
xn wn For Example
n
y = sign( wi xi − k )
Input weight weighted Activation i =0
72
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed- Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers.
73
Classification: A Mathematical Mapping
◼ Criticism
◼ Long training time
75
SVM—Support Vector Machines
◼ A relatively new classification method for both linear and
nonlinear data
◼ It uses a nonlinear mapping to transform the original training
data into a higher dimension
◼ With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
◼ With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
◼ SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
76
SVM—History and Applications
77
SVM—General Philosophy
78
SVM—Margins and Support Vectors
79
SVM—When Data Is Linearly Separable
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels y i
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin , i.e., maximum
marginal hyperplane (MMH)
80
SVM—Linearly Separable
◼ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
◼ This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints →
Quadratic Programming (QP) → Lagrangian multipliers
81
Why Is SVM Effective on High Dimensional Data?
82
A 2
SVM—Linearly Inseparable
A 1
83
SVM: Different Kernel functions
◼ Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
◼ Typical Kernel Functions
◼ SVM is not scalable to the number of data objects in terms of training time
and memory usage
◼ H. Yu, J. Yang, and J. Han, “Classifying Large Data Sets Using SVM with
Hierarchical Clusters”, KDD'03)
◼ CB-SVM (Clustering-Based SVM)
◼ Given limited amount of system resources (e.g., memory), maximize
the SVM performance in terms of accuracy and the training speed
◼ Use micro-clustering to effectively reduce the number of points to be
considered
◼ At deriving support vectors, de-cluster micro-clusters near “candidate
vector” to ensure high classification accuracy
85
CF-Tree: Hierarchical Micro-cluster
◼ Read the data set once, construct a statistical summary of the data
(i.e., hierarchical clusters) given a limited amount of memory
◼ Micro-clustering: Hierarchical indexing structure
◼ provide finer samples closer to the boundary and coarser
samples farther from the boundary
86
Selective Declustering: Ensure High Accuracy
87
CB-SVM Algorithm: Outline
◼ Construct two CF-trees from positive and negative data sets
independently
◼ Need one scan of the data set
88
Accuracy and Scalability on Synthetic Dataset
91
Classification
◼ Basic Concepts
◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Metrics for Evaluating Classifier Performance
◼ Ensemble Methods
◼ Multilayer Feed- Forward Neural Network
◼ Support Vector Machines
◼ k-Nearest-Neighbor Classifiers
92
Lazy vs. Eager Learning
◼ Lazy vs. eager learning
◼ Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
◼ Eager learning (the above discussed methods): Given a set of
training tuples, constructs a classification model before
receiving new (e.g., test) data to classify
◼ Lazy: less time in training but more time in predicting
◼ Accuracy
◼ Lazy method effectively uses a richer hypothesis space since
it uses many local linear functions to form an implicit global
approximation to the target function
◼ Eager: must commit to a single hypothesis that covers the
entire instance space
93
Lazy Learner: Instance-Based Methods
◼ Instance-based learning:
◼ Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
◼ Typical approaches
◼ k-nearest neighbor approach
◼ Case-based reasoning
inference
94
The k-Nearest Neighbor Algorithm
◼ All instances correspond to points in the n-D space
◼ The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
◼ Target function could be discrete- or real- valued
◼ For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq
◼ Vonoroi diagram: the decision surface induced by 1-NN for
a typical set of training examples
_
_
_ _ .
+
_
. +
xq +
. . .
_ + . 95
Discussion on the k-NN Algorithm
97
End of Unit - III
98