UNIT-III
Classification
There are two forms of data analysis that can be used to extract models describing important classes or
predict future data trends. These two forms are as follows:
Classification
Prediction
We use classification and prediction to extract a model, representing the data classes to predict future
data trends.
Classification predicts the categorical labels of data with the prediction models. This analysis provides
us with the best understanding of the data at a large scale.Classification models predict categorical class
labels
prediction models predict continuous-valued functions. For example, we can build a classification
model to categorize bank loan applications as either safe or risky
Classification is a form of data analysis that extracts models describing important data classes. Such
models, called classifiers, predict categorical (discrete, unordered) class labels. For example, we can
build a classification model to categorize bank loan applications as either safe or risky
Why Classification?
A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and
which are “risky” for the bank. A marketing manager at AllElectronics needs data analysis to help guess
whether a customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data to predict which one of three specific
treatments a patient should receive. In each of these examples, the data analysis task is classification,
where a model or classifier is constructed to predict class (categorical) labels, such as “safe” or “risky”
for the loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or
“treatment C” for the medical data.
Suppose that the marketing manager wants to predict how much a given customer will
spend during a sale at AllElectronics. This data analysis task is an example of numeric prediction,
where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a
class label. This model is a predictor.
Regression analysis is a statistical methodology that is most often used for numeric prediction; hence the
two terms tend to be used synonymously, although other methods for numeric prediction exist.
Classification and numeric prediction are the two major types of prediction problems.
General Approach for Classification:
Classification is a two-step process involving,
Learning Step: It is a step where the Classification model is to be constructed. In this phase,
training data are analyzed by a classification Algorithm.
Classification Step: it’s a step where the model is employed to predict class labels for given data.
In this phase, test data are wont to estimate the accuracy of classification rules.
Data classification is a two-step process, consisting of a learning step (where a classification model is
constructed) and a classification step (where the model is used to predict class labels for given data).
In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the
classifier by analyzing or “learning from” a training set made up of database tuples and their
associated class labels.
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
In the second step, the model is used for classification. First, the predictive accuracy of the
classifier is estimated.
Accuracy rate is the percentage of test set samples that are correctly classified by the
model
Difference between Classification and Prediction
Classification Prediction
Classification is the process of Predication is the process of
identifying which category a new identifying the missing or unavailable
observation belongs to based on a numerical data for a new observation.
training data set containing
observations whose category
membership is known.
In classification, the accuracy depends In prediction, the accuracy depends
on finding the class label correctly. on how well a given predictor can
guess the value of a predicated
attribute for new data.
In classification, the model can be In prediction, the model can be
known as the classifier. known as the predictor.
A model or the classifier is constructed A model or a predictor will be
to find the categorical labels. constructed that predicts a
continuous-valued function or ordered
value.
For example, the grouping of For example, We can think of
patients based on their medical prediction as predicting the correct
records can be considered a treatment for a particular disease for
classification. a person.
Decision Tree Induction:
Decision tree induction is the learning of decision trees from class-labeled training tuples.
⮚ A decision tree is a flowchart-like tree structure,
⮚ Where each internal node (non leaf node) denotes a test on an attribute,
⮚ Each branch represents an outcome of the test, and
⮚ Each leaf node (or terminal node) holds a class label.
⮚ The topmost node in a tree is the root node.
⮚ Internal nodes are denoted by rectangles,
⮚ and leaf nodes are denoted by ovals.
The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
“How are decision trees used for classification?” Given a tuple, X, for which the associated class
label is unknown, the attribute values of the tuple are tested against the decision tree.
A path is traced from the root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classification rules.
“Why are decision tree classifiers so popular?” The construction of decision tree classifiers does
not require any domain knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
Decision trees can handle multidimensional data. Their representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast.
Decision tree induction algorithms have been used for classification in many application areas such
as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology.
Decision trees are the basis of several commercial rule induction systems.
During tree construction, attribute selection measures are used to select the attribute that best
partitions the tuples into distinct classes. When decision trees are built, many of the branches may
reflect noise or outliers in the training data.
Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data.
o During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
o This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J.
Marin, and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a
benchmark to which newer supervised learning algorithms are often compared.
o In 1984,a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published
the book
o Classification and Regression Trees (CART), which described the generation of binary
decision trees.
Decision Tree Algorithm:
Input:
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of
data partition, D.
▪ Data partition, D, which is a set of training tuples and their associated class labels;
▪ attribute list, the set of candidate attributes;
▪ Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes.
This criterion consists of a splitting attribute and, possibly, either a
split-point or splitting subset.
Output: A decision tree.
Method:
1) create a node N;
2) if tuples in D are all of the same class, C, then
3) return N as a leaf node labeled with the class C;
4) if attribute list is empty then
5) return N as a leaf node labeled with the majority class in D; // majority
voting
6) apply Attribute selection method(D, attribute list) to
find the “best” splitting criterion;
7) label node N with splitting criterion;
8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
9) attribute list attribute list - splitting attribute; // remove splitting attribute
10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
12) if Dj is empty then
13) attach a leaf labeled with the majority class in D to node N;
14) else attach the node returned by Generate decision tree(Dj , attribute list) to
node N;
endfor
15) return N;
Methods for selecting best test conditions:
Decision tree induction algorithms must provide a method for expressing an attribute test condition
and its corresponding outcomes for different attribute types.
Binary Attributes: The test condition for a binary attribute generates two potential outcomes.
Nominal Attributes: These can have many values. These can be represented in two ways.
Ordinal attributes: These can produce binary or multiway splits. The values can be grouped
as long as the grouping does not violate the order property of attribute values.
Attribute Selection Measures:
An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.
If we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure (i.e., all the tuples that fall into a given partition would belong
to the same class).
Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.
Attribute selection measures are also known as splitting rules because they determine how the
tuples at a given node are to be split.
The attribute selection measure provides a ranking for each attribute describing the given training
tuples. The attribute having the best score for the measure is chosen as the splitting attribute for the
given tuples.
If the splitting attribute is continuous-valued or if we are restricted to binary trees, then,
respectively, either a split point or a splitting subset must also be determined as part of the splitting
criterion.
The tree node created for partition D is labeled with the splitting criterion, branches are grown for
each outcome of the criterion, and the tuples are partitioned accordingly.
There are three popular attribute selection measures—information gain, gain ratio, and Gini index.
Information Gain
ID3 uses information gain as its attribute selection measure. Let node N represents or hold the
tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute for node N.
Such an approach minimizes the expected number of tests needed to classify a given tuple and
guarantees that a simple (but not necessarily the simplest) tree is found.
The expected information needed to classify a tuple in D is given by
Where piis the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated
by |Ci,D|/|D|. A log function to the base 2 is used, because the information is encoded in
bits.Info(D) is also known as the entropy of D.
Information needed after using A to split D into V partitions.
Information gain is defined as the difference between the original information requirement (i.e.,
based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning
on A). That is,
The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at
nodeN. This is equivalent to saying that we want to partition on the attribute A that would do the
“best classification,” so that the amount of information still required to finish classifying the
tuples is minimal.
Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts
to overcome this bias. It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as
This value represents the potential information generated by splitting the training data set, D, into v
partitions, corresponding to the v outcomes of a test on attribute A.
It differs from information gain, which measures the information with respect to classification that
is acquired based on the same partitioning. The gain ratio is defined as
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini indexmeasures
the impurity of D, a data partition or set of training tuples, as
Where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ciand is estimated
by |Ci,D|/|D| over m classes.
Note: The Gini index considers a binary split for each attribute.
When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition. For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given
that partitioning is
For each attribute, each of the possible binary splits is considered. For a discrete-valued attribute,
the subset that gives the minimum Gini index for that attribute is selected as its splitting subset.
For continuous-valued attributes, each possible split-point must be considered. The strategy is
similar to that described earlier for information gain, where the midpoint between each pair of
(sorted) adjacent values is taken as a possible split-point.
The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-
valued attribute A is
Tree Pruning:
When a decision tree is built, many of the branches will reflect anomalies in the training data due to
noise or outliers.
Tree pruning methods address this problem of overfitting the data. Such methods typically use
statistical measures to remove the least-reliable branches.
Pruned trees tend to be smaller and less complex and, thus, easier to comprehend.
They are usually faster and better at correctly classifying independent test data (i.e., of previously
unseen tuples) than unpruned trees.
“How does tree pruning work?” There are two common approaches to tree pruning:
prepruning and post pruning.
prepruning approach, a tree is “pruned” by halting its construction early. Upon halting, the node
becomes a leaf. The leaf may hold the most frequent class among the subset tuples or the
probability distribution of those tuples.
If partitioning the tuples at a node would result in a split that falls below a pre specified threshold,
then further partitioning of the given subset is halted. There are difficulties, however, in choosing
an appropriate threshold.
postpruning, which removes subtrees from a “fully grown” tree? A subtree at a given node is
pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the most
frequent class among the subtree being replaced.
Fig: Unpruned and Pruned Trees
The cost complexity pruning algorithm used in CART is an example of the post pruning approach.
This approach considers the cost complexity of a tree to be a function of the number of leaves in
the tree and the error rate of the tree (where the error rate is the percentage of tuples misclassified
by the tree). It starts from the bottom of the tree.
For each internal node, N, it computes the cost complexity of the subtree at N, and the cost
complexity of the subtree at N if it were to be pruned (i.e., replaced by a leaf node).
The two values are compared. If pruning the subtree at node N would result in a smaller cost
complexity, then the subtree is pruned. Otherwise, it is kept.
A pruning set of class-labeled tuples is used to estimate cost complexity.
This set is independent of the training set used to build the unpruned tree and of any test set usedfor
accuracy estimation.
The algorithm generates a set of progressively pruned trees. Ingeneral, the smallest decision tree
that minimizes the cost complexity is preferred.
C4.5 uses a method called pessimistic pruning, which is similar to the cost complexity method in
that it also uses error rate estimates to make decisions regarding subtree pruning.
Advantages Of Decision Tree Classification
Enlisted below are the various merits of Decision Tree Classification:
1. Decision tree classification does not require any domain knowledge, hence, it is
appropriate for the knowledge discovery process.
2. The representation of data in the form of the tree is easily understood by humans and it is
intuitive.
3. It can handle multidimensional data.
4. It is a quick process with great accuracy.
5.
Disadvantages Of Decision Tree Classification
Given below are the various demerits of Decision Tree Classification:
1. Sometimes decision trees become very complex and these are called overfitted trees.
2. The decision tree algorithm may not be an optimal solution.
3. The decision trees may return a biased solution if some class label dominates it.
Scalability of Decision Tree Induction:
“What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In other
words, how scalable is decision tree induction?” The efficiency of existing decision tree
algorithms, such as ID3, C4.5, and CART, has been well established for relatively small data sets.
Efficiency becomes an issue of concern when these algorithms are applied to the mining of very
large real-world databases.
The pioneering decision tree algorithms that we have discussed so far have the restriction that the
training tuples should reside in memory.
In data mining applications, very large training sets of millions of tuples are common. Most often, the
training data will not fit in memory! Therefore, decision tree construction becomes inefficient due to
swapping of the training tuples in and out of main and cache memories.
More scalable approaches, capable of handling training data that are too large to fit in memory, are
required. Earlier strategies to “save space” included discretizing continuous- valued attributes and
sampling data at each node. These techniques, however, still assume that the training set can fit in
memory.
Several scalable decision tree induction methods have been introduced in recent studies.
Rain Forest, for example, adapts to the amount of main memory available and applies to
any decision tree induction algorithm.
The method maintains an AVC-set (where “AVC” stands for “Attribute-Value,
Classlabel”) for each attribute, at each tree node, describing the training tuples at the
node. The AVC-set of an attribute A at node N gives the class label counts for each value
of A for the tuples at N. The set of all AVC-sets at a node N is the AVC-group of N.
The size of an AVC-set for attribute A at node N depends only on the numb of distinct
values of A and the number of classes in the set of tuples at N. Typically, this size should
fit in memory, even for real-world data.
Rain Forest also has techniques, however, for handling the case where the AVC-group
does not fit in memory. Therefore, the method has high scalability for decision tree
induction in very large data sets.
Fig: AVC Sets for dataset
Example for Decision Tree construction and Classification Rules:
Construct Decision Tree for following dataset
Age income Student credit_rating buys_computer
Youth high No fair No
Youth high No excellent No
middle_aged high No fair Yes
Senior medium No fair Yes
Senior low Yes fair Yes
Senior low Yes excellent No
middle_aged low Yes excellent Yes
Youth medium No fair No
Youth low Yes fair Yes
Senior medium Yes fair Yes
Youth medium Yes excellent Yes
middle_aged medium No excellent Yes
middle_aged high Yes fair Yes
Senior medium No excellent No
Age P N TOTAL I(P,N)
youth 2 3 5 I(2,3) 0.970
middle_age 4 0 4 I(4,0) 0
d
senior 3 2 5 I(3,2) 0.970
Gain(Age) = Info(D) – InfoAge(D)
= 0.940 – 0693 = 0.247
Similarly
Gain(Income) = 0.029
Gain (Student) = 0.151
Gain(credit_rating)=0.048
Finally, age has the highest information gain among the attributes, it is selected as the splitting
attribute. Node N is labeled with age, and branches are grown for each of the attribute’s values.
The tuples are then partitioned accordingly, as
The Tree after Tree Pruning,
Finally, The Classification Rules are,
⮚ IF age=Youth AND Student=Yes THEN buys_computer=Yes
⮚ IF age=Middle_aged THEN buys_computer=Yes
IF age=Senior AND Credit_rating=Fair THEN buys_computer=Yes
Constructing a Decision Tree
Let us take an example of the last 10 days weather dataset with attributes outlook, temperature,
wind, and humidity. The outcome variable will be playing cricket or not. We will use the ID3
algorithm to build the decision tree.
Day Outlook Temperature Humidity Wind Play cricket
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Step1: The first step will be to create a root node.
Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node “no” will
be returned.
Step3: Find out the Entropy of all observations and entropy with attribute “x” that is E(S) and
E(S, x).
Step4: Find out the information gain and select the attribute with high information gain.
Step5: Repeat the above steps until all attributes are covered.
Calculation of Entropy:
Yes No
9 5
If entropy is zero, it means that all members belong to the same class and if entropy is one then it
means that half of the tuples belong to one class and one of them belong to other class. 0.94 means
fair distribution.
Find the information gain attribute which gives maximum information gain.
For Example “Wind”, it takes two values: Strong and Weak, therefore, x = {Strong, Weak}
Find out H(x), P(x) for x =weak and x= strong. H(S) is already calculated above.
Weak= 8
Strong= 8
For “weak” wind, 6 of them say “Yes” to play cricket and 2 of them say “No”. So entropy will be:
For “strong” wind, 3 said “No” to play cricket and 3 said “Yes”.
Calculate the information gain,
Similarly the information gain for other attributes is:
The attribute outlook has the highest information gain of 0.246, thus it is chosen as root.
Overcast has 3 values: Sunny, Overcast and Rain. Overcast with play cricket is always “Yes”. So
it ends up with a leaf node, “yes”. For the other values “Sunny” and “Rain”.
Temperature Humidity Wind Golf
Hot High Weak No
Hot High Strong No
Mild High Weak No
Cool Normal Weak Yes
Mild Normal Strong Yes
Entropy for “Outlook” “Sunny” is:
Information gain for attributes with respect to Sunny is:
The information gain for humidity is highest, therefore it is chosen as the next node. Similarly,
Entropy is calculated for Rain. Wind gives the highest information gain.
CART
CART model i.e. Classification and Regression Models is a decision tree algorithm for building
models. Decision Tree model where the target values have a discrete nature is called classification
models.A discrete value is a finite or countably infinite set of values,
For Example, age, size, etc. The models where the target values are represented by continuous
values are usually numbers that are called Regression Models. Continuous variables are floating-
point variables. These two models together are called CART.
CART uses Gini Index as Classification matrix.
Decision Tree Induction for Data Mining: ID3
In the late 1970s and early 1980s, J.Ross Quinlan was a researcher who built a decision tree
algorithm for machine learning. This algorithm is known as ID3, Iterative Dichotomiser. This
algorithm was an extension of the concept learning systems described by E.B Hunt, J, and Marin.
ID3 later came to be known as C4.5. ID3 and C4.5 follow a greedy top-down approach for
constructing decision trees. The algorithm starts with a training dataset with class labels that are
portioned into smaller subsets as the tree is being constructed.
#1) Initially, there are three parameters i.e. attribute list, attribute selection method and data
partition. The attribute list describes the attributes of the training set tuples.
#2) The attribute selection method describes the method for selecting the best attribute for
discrimination among tuples. The methods used for attribute selection can either be Information
Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute selection method.
#4) When constructing a decision tree, it starts as a single node representing the tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute selection method
to split or partition the tuples. The step will lead to the formation of branches and decision nodes.
#6) The splitting method will determine which attribute should be selected to partition the data
tuples. It also determines the branches to be grown from the node according to the test outcome.
The main motive of the splitting criteria is that the partition at each branch of the decision tree
should represent the same class label.