0% found this document useful (0 votes)

40 views22 pages

DWM - Module 3

Module 3 covers classification methods in data analysis, focusing on decision tree induction and Bayesian classification. It explains the process of building classifiers, issues in classification, and the importance of data preparation, including cleaning and transformation. Additionally, it discusses the ID3 algorithm for decision tree creation, overfitting concerns, and tree pruning techniques.

Uploaded by

mehveeshshaikh1404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views22 pages

DWM - Module 3

Uploaded by

mehveeshshaikh1404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Module 3

Classification
Content
• Basic Concepts; Classification methods: 1. Decision Tree Induction:
Attribute Selection Measures, Tree pruning. 2. Bayesian Classification:
Naïve Bayes‟ Classifier. Prediction: Structure of regression models;
Simple linear regression, Multiple linear regression. Accuracy and Error
measures, Precision, Recall . (06 hours)
Basic Concepts
There are two forms of data analysis that can be used for extracting models,
describing important classes or to predict future data trends.
These two forms are as follows −
• Classification
• Prediction
• Classification models predict categorical class labels; and prediction models
predict continuous valued functions. For example, we can build a classification
model to categorize bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.
What is classification?

Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process
includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Step 1:Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their
associated class labels.
• Each tuple that constitutes the training set is referred to as a category or class.
These tuples can also be referred to as sample, object or data points.
Step 2:Using Classifier for Classification
In this step, the classifier is used for
classification. Here the test data is used
to estimate the accuracy of classification
rules. The classification rules can be
applied to the new data tuples if the
accuracy is considered acceptable.
Classification Issues

The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying smoothing
techniques and the problem of missing values is solved by replacing a
missing value with most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by
any of the following methods.
Continued…
• Normalization − The data is transformed using normalization.
Normalization involves scaling all values for given attribute in order
to make them fall within a small specified range. Normalization is
used when in the learning step, the neural networks or the methods
involving measurements are used.
• Generalization − The data can also be transformed by generalizing it
to the higher concept. For this purpose we can use the concept
hierarchies.
• Note − Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.
Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction.
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well a
given predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability−It refers to what extent the classifier or predictor understands.
Classification is a highly popular aspect of data mining. As a result, machine
learning has many classifiers:
[Link] regression
[Link] regression
[Link] trees
[Link] forest
[Link] Bayes
[Link] Vector Machines
7.K-nearest neighbours
Decision Tree Induction
• Decision Tree Mining is a type of data mining technique that is used to build
Classification Models. It builds classification models in the form of a tree-like
structure, just like its name. This type of mining belongs to supervised class
learning.
• In supervised learning, the target result is already known. Decision trees can be
used for both categorical and numerical data. The categorical data represent
gender, marital status, etc. while the numerical data represent age, temperature,
etc.
• Decision Tree is used to build classification and regression models. It is used to
create data models that will predict class labels or values for the decision-
making process. The models are built from the training dataset fed to the system
(supervised learning).
• Using a decision tree, we can visualize the decisions that make it easy to
understand and thus it is a popular data mining technique.
• A decision tree is a structure that includes a root node, branches, and leaf
nodes.
• Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
• The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not.
• Each internal node represents a test on an attribute. Each leaf node represents a
class.
The benefits of having a decision tree are as follows −
1. It does not require any domain knowledge.
2. It is easy to comprehend.
3. The learning and classification steps of a
decision tree are simple and fast.
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor of ID3.
• ID3 and C4.5 adopt a greedy approach.
• In this algorithm, there is no backtracking; the trees are constructed in a top-
down recursive divide-and-conquer manner.
• Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
What is ID3 Algorithm?
• The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and
most widely used algorithms to create Decision Trees from a given
dataset.
• It uses the concept of entropy and information gain to select the best
attribute for splitting the data at each node.
• Entropy measures the uncertainty or randomness in the data, and
information gain quantifies the reduction in uncertainty achieved by
splitting the data on a particular attribute.
• The ID3 algorithm recursively splits the dataset based on the attributes
with the highest information gain until a stopping criterion is met,
resulting in a Decision Tree that can be used for classification tasks.
Understanding the ID3 Algorithm:
• ID3 stands for Iterative Dichotomizer3 and is named such because the
algorithm iteratively(repeatedly) dichotomizes(divides) features into two
or more groups at each step.
• The ID3 algorithm uses the concept of entropy and information gain to
construct a decision tree.
• Entropy measures the amount of uncertainty or randomness in a dataset,
while information gain quantifies the reduction in entropy achieved by
splitting the data on a specific attribute.
• The attribute with the highest information gain is selected as the decision
node for the tree.
Steps to making Decision Tree

a) Take the Entire dataset as an input.

b) Calculate the Entropy of the target variable, As well as the predictor
attributes
c) Calculate the information gain of all attributes.
d) Choose the attribute with the highest information gain as the Root Node
e) Repeat the same procedure on every branch until the decision node of
each branch is finalized.
[Link]
achine-learning-4120d8ba013b
• (Go through above link for solved problems)
Overfitting and Tree Pruning
Overfitting:
• An induced tree may overfit the training data – Too many branches, some may
reflect anomalies due to noise or outliers – Poor accuracy for unseen samples
Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
• Use a set of data different from the training data to decide which is the “best
pruned tree”
What Is Tree Pruning?

• Pruning is the method of removing the unused branches from the decision tree.
Some branches of the decision tree might represent outliers or noisy data.
• Tree pruning is the method to reduce the unwanted branches of the tree. This
will reduce the complexity of the tree and help in effective predictive analysis.
It reduces the overfitting as it removes the unimportant branches from the trees.
There are two ways of pruning the tree:
1) Prepruning:
• In this approach, the construction of the decision tree is stopped early.
• It means it is decided not to further partition the branches.
• The last node constructed becomes the leaf node and this leaf node may hold
the most frequent class among the tuples.
• The attribute selection measures are used to find out the weightage of the
split.
• Threshold values are prescribed to decide which splits are regarded as useful.
• If the portioning of the node results in splitting by falling below threshold then
the process is halted.
2) Postpruning:
• This method removes the outlier branches from a fully grown tree.
• The unwanted branches are removed and replaced by a leaf node denoting the
most frequent class label.
• This technique requires more computation than prepruning, however, it is more
reliable.
• The pruned trees are more precise and compact when compared to unpruned
trees but they carry a disadvantage of replication and repetition.
• Repetition occurs when the same attribute is tested again and again along a
branch of a tree.
• Replication occurs when the duplicate subtrees are present within the tree. These
issues can be solved by multivariate splits.
The Below image shows an unpruned and pruned tree.

DM Unit-3
No ratings yet
DM Unit-3
46 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Classification and Prediction Overview
No ratings yet
Classification and Prediction Overview
75 pages
Classification & Prediction Guide
100% (1)
Classification & Prediction Guide
67 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
Unit 3
No ratings yet
Unit 3
16 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
CH 5
No ratings yet
CH 5
84 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
Chapter 3
No ratings yet
Chapter 3
67 pages
Module 3 Notes
No ratings yet
Module 3 Notes
31 pages
Classification
No ratings yet
Classification
23 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Data Mining
No ratings yet
Data Mining
68 pages
Unit-2 Material
No ratings yet
Unit-2 Material
52 pages
Dwdm-Unit-3 R16
No ratings yet
Dwdm-Unit-3 R16
14 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
7 Classification
100% (3)
7 Classification
63 pages
Siv UNIT-3 Classification DWM PART-A
No ratings yet
Siv UNIT-3 Classification DWM PART-A
12 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Module 3
No ratings yet
Module 3
64 pages
Classification
No ratings yet
Classification
81 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Classification & Prediction Guide
No ratings yet
Classification & Prediction Guide
83 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
50 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Topic7 Classification
No ratings yet
Topic7 Classification
104 pages
Les 3 DWM
No ratings yet
Les 3 DWM
21 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Decision Tree Modeling Explained
No ratings yet
Decision Tree Modeling Explained
58 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
DM Unit Iii
No ratings yet
DM Unit Iii
87 pages
Dmi Unit 4
No ratings yet
Dmi Unit 4
34 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
3 Module DWM
No ratings yet
3 Module DWM
16 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
Data Mining Chapter 3 Classification
No ratings yet
Data Mining Chapter 3 Classification
82 pages
4 & 5 DWM 2024-25
No ratings yet
4 & 5 DWM 2024-25
32 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Supervised Learning Classification Techniques
No ratings yet
Supervised Learning Classification Techniques
224 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Classifiction
No ratings yet
Classifiction
42 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
MODULE 3 Classification
No ratings yet
MODULE 3 Classification
5 pages
Lecture11 Ch8 ClassBasic Part1
No ratings yet
Lecture11 Ch8 ClassBasic Part1
38 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Mod 3a
No ratings yet
Mod 3a
32 pages
Deep Learning For Software Defect Prediction - A Survey
No ratings yet
Deep Learning For Software Defect Prediction - A Survey
6 pages
AI DS Question Paper
No ratings yet
AI DS Question Paper
2 pages
Top 100 Deep Learning Interview Questions
No ratings yet
Top 100 Deep Learning Interview Questions
157 pages
Gelu
No ratings yet
Gelu
10 pages
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
100% (1)
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
704 pages
Neural Network Forward & Backward Pass
No ratings yet
Neural Network Forward & Backward Pass
5 pages
Supervised Learning Network Introduction: Unit 2
No ratings yet
Supervised Learning Network Introduction: Unit 2
52 pages
Comprehensive Guide to Neural Networks
No ratings yet
Comprehensive Guide to Neural Networks
52 pages
Understanding of Neural Networks
No ratings yet
Understanding of Neural Networks
7 pages
DSC - MachineLearning Regular HO
No ratings yet
DSC - MachineLearning Regular HO
7 pages
EE769 9 Combining Models
No ratings yet
EE769 9 Combining Models
32 pages
Machine Learning Algorithmsfor Predictionofmobilephone Price
No ratings yet
Machine Learning Algorithmsfor Predictionofmobilephone Price
9 pages
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
No ratings yet
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
11 pages
Neural Networks & Deep Learning MCQs
100% (1)
Neural Networks & Deep Learning MCQs
6 pages
Neural Networks: Perceptron & Adaline
No ratings yet
Neural Networks: Perceptron & Adaline
7 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools To Build Learning Machines 3rd Edition by OReilly Media ISBN 9781098122461 1098122461 Instant Download
100% (6)
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools To Build Learning Machines 3rd Edition by OReilly Media ISBN 9781098122461 1098122461 Instant Download
75 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
Lec06 - Ensembling Methods Bagging Boosting
No ratings yet
Lec06 - Ensembling Methods Bagging Boosting
48 pages
Slides Attention
No ratings yet
Slides Attention
16 pages
Deep Learning Techniques for AI
No ratings yet
Deep Learning Techniques for AI
2 pages
Unit 3: Classification & Regression: Question Bank and Its Solution
No ratings yet
Unit 3: Classification & Regression: Question Bank and Its Solution
180 pages
11.2 BaselineModels
No ratings yet
11.2 BaselineModels
6 pages
EC3606
No ratings yet
EC3606
1 page
Ensemble Learning, Decision Trees
No ratings yet
Ensemble Learning, Decision Trees
65 pages
A Novel IChOA-CNN-LSTM Model For Android Malware Detection Using Opcode-Based Feature Selection and Optimization
No ratings yet
A Novel IChOA-CNN-LSTM Model For Android Malware Detection Using Opcode-Based Feature Selection and Optimization
14 pages
Practice Question Bank - Machine Learning
No ratings yet
Practice Question Bank - Machine Learning
4 pages
Co-So-Tri-Tue-Nhan-Tao - 2021-Reviewexercise09-Nn-Sol - (Cuuduongthancong - Com)
No ratings yet
Co-So-Tri-Tue-Nhan-Tao - 2021-Reviewexercise09-Nn-Sol - (Cuuduongthancong - Com)
2 pages
Ensemble Learning for Signature Verification
No ratings yet
Ensemble Learning for Signature Verification
12 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages

DWM - Module 3

Uploaded by

DWM - Module 3

Uploaded by

Module 3

a) Take the Entire dataset as an input.

You might also like