Module 3
Classification
Content
• Basic Concepts; Classification methods: 1. Decision Tree Induction:
Attribute Selection Measures, Tree pruning. 2. Bayesian Classification:
Naïve Bayes‟ Classifier. Prediction: Structure of regression models;
Simple linear regression, Multiple linear regression. Accuracy and Error
measures, Precision, Recall . (06 hours)
Basic Concepts
There are two forms of data analysis that can be used for extracting models,
describing important classes or to predict future data trends.
These two forms are as follows −
• Classification
• Prediction
• Classification models predict categorical class labels; and prediction models
predict continuous valued functions. For example, we can build a classification
model to categorize bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process
includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Step 1:Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their
associated class labels.
• Each tuple that constitutes the training set is referred to as a category or class.
These tuples can also be referred to as sample, object or data points.
Step 2:Using Classifier for Classification
In this step, the classifier is used for
classification. Here the test data is used
to estimate the accuracy of classification
rules. The classification rules can be
applied to the new data tuples if the
accuracy is considered acceptable.
Classification Issues
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying smoothing
techniques and the problem of missing values is solved by replacing a
missing value with most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by
any of the following methods.
Continued…
• Normalization − The data is transformed using normalization.
Normalization involves scaling all values for given attribute in order
to make them fall within a small specified range. Normalization is
used when in the learning step, the neural networks or the methods
involving measurements are used.
• Generalization − The data can also be transformed by generalizing it
to the higher concept. For this purpose we can use the concept
hierarchies.
• Note − Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and Prediction.
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well a
given predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability−It refers to what extent the classifier or predictor understands.
Classification is a highly popular aspect of data mining. As a result, machine
learning has many classifiers:
[Link] regression
[Link] regression
[Link] trees
[Link] forest
[Link] Bayes
[Link] Vector Machines
7.K-nearest neighbours
Decision Tree Induction
• Decision Tree Mining is a type of data mining technique that is used to build
Classification Models. It builds classification models in the form of a tree-like
structure, just like its name. This type of mining belongs to supervised class
learning.
• In supervised learning, the target result is already known. Decision trees can be
used for both categorical and numerical data. The categorical data represent
gender, marital status, etc. while the numerical data represent age, temperature,
etc.
• Decision Tree is used to build classification and regression models. It is used to
create data models that will predict class labels or values for the decision-
making process. The models are built from the training dataset fed to the system
(supervised learning).
• Using a decision tree, we can visualize the decisions that make it easy to
understand and thus it is a popular data mining technique.
• A decision tree is a structure that includes a root node, branches, and leaf
nodes.
• Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
• The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not.
• Each internal node represents a test on an attribute. Each leaf node represents a
class.
The benefits of having a decision tree are as follows −
1. It does not require any domain knowledge.
2. It is easy to comprehend.
3. The learning and classification steps of a
decision tree are simple and fast.
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor of ID3.
• ID3 and C4.5 adopt a greedy approach.
• In this algorithm, there is no backtracking; the trees are constructed in a top-
down recursive divide-and-conquer manner.
• Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
What is ID3 Algorithm?
• The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and
most widely used algorithms to create Decision Trees from a given
dataset.
• It uses the concept of entropy and information gain to select the best
attribute for splitting the data at each node.
• Entropy measures the uncertainty or randomness in the data, and
information gain quantifies the reduction in uncertainty achieved by
splitting the data on a particular attribute.
• The ID3 algorithm recursively splits the dataset based on the attributes
with the highest information gain until a stopping criterion is met,
resulting in a Decision Tree that can be used for classification tasks.
Understanding the ID3 Algorithm:
• ID3 stands for Iterative Dichotomizer3 and is named such because the
algorithm iteratively(repeatedly) dichotomizes(divides) features into two
or more groups at each step.
• The ID3 algorithm uses the concept of entropy and information gain to
construct a decision tree.
• Entropy measures the amount of uncertainty or randomness in a dataset,
while information gain quantifies the reduction in entropy achieved by
splitting the data on a specific attribute.
• The attribute with the highest information gain is selected as the decision
node for the tree.
Steps to making Decision Tree
a) Take the Entire dataset as an input.
b) Calculate the Entropy of the target variable, As well as the predictor
attributes
c) Calculate the information gain of all attributes.
d) Choose the attribute with the highest information gain as the Root Node
e) Repeat the same procedure on every branch until the decision node of
each branch is finalized.
[Link]
achine-learning-4120d8ba013b
• (Go through above link for solved problems)
Overfitting and Tree Pruning
Overfitting:
• An induced tree may overfit the training data – Too many branches, some may
reflect anomalies due to noise or outliers – Poor accuracy for unseen samples
Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
• Use a set of data different from the training data to decide which is the “best
pruned tree”
What Is Tree Pruning?
• Pruning is the method of removing the unused branches from the decision tree.
Some branches of the decision tree might represent outliers or noisy data.
• Tree pruning is the method to reduce the unwanted branches of the tree. This
will reduce the complexity of the tree and help in effective predictive analysis.
It reduces the overfitting as it removes the unimportant branches from the trees.
There are two ways of pruning the tree:
1) Prepruning:
• In this approach, the construction of the decision tree is stopped early.
• It means it is decided not to further partition the branches.
• The last node constructed becomes the leaf node and this leaf node may hold
the most frequent class among the tuples.
• The attribute selection measures are used to find out the weightage of the
split.
• Threshold values are prescribed to decide which splits are regarded as useful.
• If the portioning of the node results in splitting by falling below threshold then
the process is halted.
2) Postpruning:
• This method removes the outlier branches from a fully grown tree.
• The unwanted branches are removed and replaced by a leaf node denoting the
most frequent class label.
• This technique requires more computation than prepruning, however, it is more
reliable.
• The pruned trees are more precise and compact when compared to unpruned
trees but they carry a disadvantage of replication and repetition.
• Repetition occurs when the same attribute is tested again and again along a
branch of a tree.
• Replication occurs when the duplicate subtrees are present within the tree. These
issues can be solved by multivariate splits.
The Below image shows an unpruned and pruned tree.