www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.
in
UNIT-III
Classification
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or mathematical
formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set, otherwise over-fitting will occur
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by
labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Issues regarding classification and prediction (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Evaluating Classification Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provded by the model
Goodness of rules
decision tree size
compactness of classification rules
Classification by Decision Tree Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Training Dataset
Output: A Decision Tree for “buys_computer”
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf
There are no samples left
Attribute Selection Measure
Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)
All attributes are assumed continuous-valued
Assume there exist several possible split values for each attribute
May need other tools, such as clustering, to get the possible split values
Can be modified for categorical attributes
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Assume there are two classes, P and N
Let the set of examples S contain p elements of class P and n elements of class N
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
The amount of information, needed to decide if an arbitrary example in S belongs to P
or N is defined as
p p n n
I ( p, n) log2 log2
pn pn pn pn
Information Gain in Decision Tree Induction
Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}
If Si contains pi examples of P and ni examples of N, the entropy, or the expected
information needed to classify objects in all subtrees Si is
pi ni
E ( A) I ( pi , ni )
i 1 pn
The encoding information that would be gained by branching on A
Gain( A) I ( p, n) E ( A)
Attribute Selection by Information Gain Computation
Class P: buys_computer = “yes” 5 4
E (age) I ( 2,3) I (4,0)
Class N: buys_computer = “no” 14 14
5
I (3,2) 0.69
I(p, n) = I(9, 5) =0.940 14
Compute the entropy for age Hence
Gain(age) I ( p, n) E (age)
age pi ni I(pi, ni)
<=30 2 3 0.971 Similarly
30…40 4 0 0 Gain(income) 0.029
>40 3 2 0.971 Gain( student) 0.151
Gain(credit _ rating) 0.048
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index
of the split data contains examples from n classes, the gini index gini(T) is defined as
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
N 1 gini( ) N 2 gini( )
ginisplit (T ) T1 T2
N N
The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all
possible splitting points for each attribute).
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Avoid Overfitting in Classification
The generated tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Use a set of data different from the training data to decide which is the “best
pruned tree”
Approaches to Determine the Final Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a
node may improve the entire distribution
Use minimum description length (MDL) principle:
halting growth of the tree when the encoding is minimized
Enhancements to basic decision tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that partition the continuous
attribute value into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are sparsely represented
This reduces fragmentation, repetition, and replication
Classification in Large Databases
Classification—a classical problem extensively studied by statisticians and machine learning
researchers
Scalability: Classifying data sets with millions of examples and hundreds of attributes with
reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other classification methods)
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
convertible to simple and easy to understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
Scalable Decision Tree Induction Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.)
builds an index for each attribute and only class list and the current attribute list reside
in memory
SPRINT (VLDB’96 — J. Shafer et al.)
constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
separates the scalability aspects from the criteria that determine the quality of the tree
builds an AVC-list (attribute, value, class label)
Data Cube-Based Decision-Tree Induction
Integration of generalization with decision-tree induction (Kamber et al’97).
Classification at primitive concept levels
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy classification-trees
Semantic interpretation problems.
Cube-based multi-level classification
Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.
Bayesian Classification
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
approaches to certain types of learning problems
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Incremental: Each training example can incrementally increase/decrease the probability that a
hypothesis is correct. Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured
Bayesian Theorem
Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
P(h | D) P( D | h) P(h)
P ( D)
MAP (maximum posteriori) hypothesis
h arg max P(h | D) arg max P(D | h)P(h).
MAP hH hH
Practical difficulty: require initial knowledge of many probabilities, significant computational
cost
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally independent:
n
P (C j | V ) P (C j ) P (vi | C j )
i 1
Greatly reduces the computation cost, only count the class distribution.
Given a training set, we can compute the probabilities
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Bayesian classification
The classification problem may be formalized using a-posteriori probabilities:
P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.
E.g. P(class=N | outlook=sunny,windy=true,…)
Idea: assign to sample X the class label C such that P(C|X) is maximal
Estimating a-posteriori probabilities
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
P(X) is constant for all classes
P(C) = relative freq of class C samples
C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
Computationally easy in both cases
Play-tennis example: estimating P(xi|C)
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false>
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class n (don’t play)
The independence hypothesis…
… makes computation possible
… yields optimal classifiers when satisfied
… but is seldom satisfied in practice, as attributes (variables) are often correlated.
Attempts to overcome this limitation:
Bayesian networks, that combine Bayesian reasoning with causal relationships between
attributes
Decision trees, that reason on one attribute at the time, considering most important
attributes first
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com
www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in
Bayesian Belief Networks (I)
www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com