0% found this document useful (0 votes)

18 views13 pages

DM Unit-Iii

The document discusses classification and prediction in data mining, detailing the processes of model construction and usage, as well as the differences between supervised and unsupervised learning. It covers various methods for classification, including decision tree induction and Bayesian classification, along with techniques for evaluating and enhancing model performance. Key concepts such as data preparation, overfitting, and attribute selection measures like information gain and Gini index are also highlighted.

Uploaded by

upender

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views13 pages

DM Unit-Iii

Uploaded by

upender

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.

UNIT-III

Classification

 Classification:

 predicts categorical class labels

 classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data

 Prediction:

 models continuous-valued functions, i.e., predicts unknown or missing values

 Typical Applications

 credit approval

 target marketing

 medical diagnosis

 treatment effectiveness analysis

Classification—A Two-Step Process

 Model construction: describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class, as determined by the

class label attribute

 The set of tuples used for model construction: training set

 The model is represented as classification rules, decision trees, or mathematical

formulae

 Model usage: for classifying future or unknown objects

 Estimate accuracy of the model

 The known label of test sample is compared with the classified result from the
model

 Accuracy rate is the percentage of test set samples that are correctly classified
by the model

 Test set is independent of training set, otherwise over-fitting will occur

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

Supervised vs. Unsupervised Learning

 Supervised learning (classification)

 Supervision: The training data (observations, measurements, etc.) are accompanied by

labels indicating the class of the observations

 New data is classified based on the training set

 Unsupervised learning (clustering)

 The class labels of training data is unknown

 Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

Issues regarding classification and prediction (1): Data Preparation

 Data cleaning

 Preprocess data in order to reduce noise and handle missing values

 Relevance analysis (feature selection)

 Remove the irrelevant or redundant attributes

 Data transformation

 Generalize and/or normalize data

Evaluating Classification Methods

 Predictive accuracy

 Speed and scalability

 time to construct the model

 time to use the model

 Robustness

 handling noise and missing values

 Scalability

 efficiency in disk-resident databases

 Interpretability:

 understanding and insight provded by the model

 Goodness of rules

 decision tree size

 compactness of classification rules

Classification by Decision Tree Induction

 Decision tree

 A flow-chart-like tree structure

 Internal node denotes a test on an attribute

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

 Branch represents an outcome of the test

 Leaf nodes represent class labels or class distribution

 Decision tree generation consists of two phases

 Tree construction

 At start, all the training examples are at the root

 Partition examples recursively based on selected attributes

 Tree pruning

 Identify and remove branches that reflect noise or outliers

 Use of decision tree: Classifying an unknown sample

 Test the attribute values of the sample against the decision tree

Training Dataset

Output: A Decision Tree for “buys_computer”

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

Algorithm for Decision Tree Induction

 Basic algorithm (a greedy algorithm)

 Tree is constructed in a top-down recursive divide-and-conquer manner

 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are discretized in advance)

 Examples are partitioned recursively based on selected attributes

 Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)

 Conditions for stopping partitioning

 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf

 There are no samples left

Attribute Selection Measure

 Information gain (ID3/C4.5)

 All attributes are assumed to be categorical

 Can be modified for continuous-valued attributes

 Gini index (IBM IntelligentMiner)

 All attributes are assumed continuous-valued

 Assume there exist several possible split values for each attribute

 May need other tools, such as clustering, to get the possible split values

 Can be modified for categorical attributes

Information Gain (ID3/C4.5)

 Select the attribute with the highest information gain

 Assume there are two classes, P and N

 Let the set of examples S contain p elements of class P and n elements of class N

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

 The amount of information, needed to decide if an arbitrary example in S belongs to P

or N is defined as

p p n n
I ( p, n)   log2  log2
pn pn pn pn

Information Gain in Decision Tree Induction

 Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}

 If Si contains pi examples of P and ni examples of N, the entropy, or the expected

information needed to classify objects in all subtrees Si is

pi  ni
E ( A)   I ( pi , ni )
i 1 pn
 The encoding information that would be gained by branching on A

Gain( A)  I ( p, n)  E ( A)

Attribute Selection by Information Gain Computation

 Class P: buys_computer = “yes” 5 4

E (age)  I ( 2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
5
 I (3,2)  0.69
 I(p, n) = I(9, 5) =0.940 14

Compute the entropy for age Hence

Gain(age)  I ( p, n)  E (age)
age pi ni I(pi, ni)
<=30 2 3 0.971 Similarly

30…40 4 0 0 Gain(income)  0.029

>40 3 2 0.971 Gain( student)  0.151
Gain(credit _ rating)  0.048

Gini Index (IBM IntelligentMiner)

 If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.

 If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index
of the split data contains examples from n classes, the gini index gini(T) is defined as

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

N 1 gini( )  N 2 gini( )
ginisplit (T )  T1 T2
N N
 The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all
possible splitting points for each attribute).

Extracting Classification Rules from Trees

 Represent the knowledge in the form of IF-THEN rules

 One rule is created for each path from the root to a leaf

 Each attribute-value pair along a path forms a conjunction

 The leaf node holds the class prediction

 Rules are easier for humans to understand

 Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Avoid Overfitting in Classification

 The generated tree may overfit the training data

 Too many branches, some may reflect anomalies due to noise or outliers

 Result is in poor accuracy for unseen samples

 Two approaches to avoid overfitting

 Prepruning: Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold

 Difficult to choose an appropriate threshold

 Postpruning: Remove branches from a “fully grown” tree—get a sequence of

progressively pruned trees

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

 Use a set of data different from the training data to decide which is the “best
pruned tree”

Approaches to Determine the Final Tree Size

 Separate training (2/3) and testing (1/3) sets

 Use cross validation, e.g., 10-fold cross validation

 Use all the data for training

 but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a
node may improve the entire distribution

 Use minimum description length (MDL) principle:

 halting growth of the tree when the encoding is minimized

Enhancements to basic decision tree induction

 Allow for continuous-valued attributes

 Dynamically define new discrete-valued attributes that partition the continuous

attribute value into a discrete set of intervals

 Handle missing attribute values

 Assign the most common value of the attribute

 Assign probability to each of the possible values

 Attribute construction

 Create new attributes based on existing ones that are sparsely represented

 This reduces fragmentation, repetition, and replication

Classification in Large Databases

 Classification—a classical problem extensively studied by statisticians and machine learning

researchers

 Scalability: Classifying data sets with millions of examples and hundreds of attributes with
reasonable speed

 Why decision tree induction in data mining?

 relatively faster learning speed (than other classification methods)

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

 convertible to simple and easy to understand classification rules

 can use SQL queries for accessing databases

 comparable classification accuracy with other methods

Scalable Decision Tree Induction Methods in Data Mining Studies

 SLIQ (EDBT’96 — Mehta et al.)

 builds an index for each attribute and only class list and the current attribute list reside
in memory

 SPRINT (VLDB’96 — J. Shafer et al.)

 constructs an attribute list data structure

 PUBLIC (VLDB’98 — Rastogi & Shim)

 integrates tree splitting and tree pruning: stop growing the tree earlier

 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

 separates the scalability aspects from the criteria that determine the quality of the tree

 builds an AVC-list (attribute, value, class label)

Data Cube-Based Decision-Tree Induction

 Integration of generalization with decision-tree induction (Kamber et al’97).

 Classification at primitive concept levels

 E.g., precise temperature, humidity, outlook, etc.

 Low-level concepts, scattered classes, bushy classification-trees

 Semantic interpretation problems.

 Cube-based multi-level classification

 Relevance analysis at multi-levels.

 Information-gain analysis with dimension + level.

Bayesian Classification

 Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
approaches to certain types of learning problems

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

 Incremental: Each training example can incrementally increase/decrease the probability that a
hypothesis is correct. Prior knowledge can be combined with observed data.

 Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

 Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured

Bayesian Theorem

 Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem

P(h | D)  P( D | h) P(h)
P ( D)
 MAP (maximum posteriori) hypothesis

h  arg max P(h | D)  arg max P(D | h)P(h).

MAP hH hH

 Practical difficulty: require initial knowledge of many probabilities, significant computational

cost

Naïve Bayes Classifier

 A simplified assumption: attributes are conditionally independent:

n
P (C j | V )  P (C j ) P (vi | C j )
i 1

 Greatly reduces the computation cost, only count the class distribution.

Given a training set, we can compute the probabilities

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

Bayesian classification

 The classification problem may be formalized using a-posteriori probabilities:

 P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.

 E.g. P(class=N | outlook=sunny,windy=true,…)

 Idea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities

 Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

 P(X) is constant for all classes

 P(C) = relative freq of class C samples

 C such that P(C|X) is maximum =

C such that P(X|C)·P(C) is maximum

 Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification

 Naïve assumption: attribute independence

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

 If i-th attribute is categorical:

P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C

 If i-th attribute is continuous:

P(xi|C) is estimated thru a Gaussian density function

 Computationally easy in both cases

Play-tennis example: estimating P(xi|C)

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

Outlook Temperature Humidity Windy Class

sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N

Play-tennis example: classifying X

 An unseen sample X = <rain, hot, high, false>

 Sample X is classified in class n (don’t play)

The independence hypothesis…

 … makes computation possible

 … yields optimal classifiers when satisfied

 … but is seldom satisfied in practice, as attributes (variables) are often correlated.

 Attempts to overcome this limitation:

 Bayesian networks, that combine Bayesian reasoning with causal relationships between
attributes

 Decision trees, that reason on one attribute at the time, considering most important
attributes first

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

www.android.universityupdates.in | www.universityupdates.in | www.ios.universityupdates.in

Bayesian Belief Networks (I)

www.android.previousquestionpapers.com | www.previousquestionpapers.com | www.ios.previousquestionpapers.com

Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Unit 3
No ratings yet
Unit 3
98 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Classification
No ratings yet
Classification
45 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Decision Tree Course Guide
No ratings yet
Decision Tree Course Guide
37 pages
Unit Iv
No ratings yet
Unit Iv
38 pages
ECE Classification Concepts
No ratings yet
ECE Classification Concepts
69 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
7 Classification
100% (3)
7 Classification
63 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Classification
100% (1)
Classification
37 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Classification
No ratings yet
Classification
81 pages
Module 4
No ratings yet
Module 4
99 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
33 pages
DWDM UNIT-IV Classification and Prediction
100% (1)
DWDM UNIT-IV Classification and Prediction
70 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
DM Module-3 Notes
No ratings yet
DM Module-3 Notes
25 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
DM 3
No ratings yet
DM 3
37 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
Lecture 6 Classification-Decision Tree Rule Based K-NN
No ratings yet
Lecture 6 Classification-Decision Tree Rule Based K-NN
73 pages
05 Classification
No ratings yet
05 Classification
79 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Classification
No ratings yet
Classification
27 pages
Classification
No ratings yet
Classification
75 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
CH 5
No ratings yet
CH 5
84 pages
Class Basic
No ratings yet
Class Basic
75 pages
Classification
No ratings yet
Classification
33 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Gini Index
No ratings yet
Gini Index
6 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
LLM ML Interview Q
No ratings yet
LLM ML Interview Q
43 pages
Feature Selection in PR
No ratings yet
Feature Selection in PR
6 pages
Decision Tree
No ratings yet
Decision Tree
9 pages
Rafsanzani Pane
No ratings yet
Rafsanzani Pane
11 pages
Lecture 07 On Decision Trees
No ratings yet
Lecture 07 On Decision Trees
36 pages
Introduction To AI and ML
No ratings yet
Introduction To AI and ML
22 pages
(Ebook) Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou ISBN 9781439830031, 1439830037 sample
100% (1)
(Ebook) Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou ISBN 9781439830031, 1439830037 sample
111 pages
5G Smart Diabetes
No ratings yet
5G Smart Diabetes
58 pages
Chapter-V CLASSIFICATION & CLUSTERING
No ratings yet
Chapter-V CLASSIFICATION & CLUSTERING
153 pages
ID3 Decision Tree Explanation
No ratings yet
ID3 Decision Tree Explanation
8 pages
Intro to Regression & Decision Trees
No ratings yet
Intro to Regression & Decision Trees
11 pages
DBB3102 - Business Analytics
No ratings yet
DBB3102 - Business Analytics
8 pages
ML Unit2
No ratings yet
ML Unit2
22 pages
Decision Tree
No ratings yet
Decision Tree
16 pages
Development of An Efficient Network Intrusion Detection Model Using Extreme Gradient Boosting XGBoost On The UNSW-NB15 Dataset
No ratings yet
Development of An Efficient Network Intrusion Detection Model Using Extreme Gradient Boosting XGBoost On The UNSW-NB15 Dataset
7 pages
CUET ML Algorithms Report
No ratings yet
CUET ML Algorithms Report
28 pages
THEORY FILE - Machine Learning (6th Sem) !!
No ratings yet
THEORY FILE - Machine Learning (6th Sem) !!
26 pages
ML
No ratings yet
ML
131 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
DMDW 11 Classification Basic
No ratings yet
DMDW 11 Classification Basic
41 pages
IT ML Lab
No ratings yet
IT ML Lab
35 pages
ML - Questions & Answer
No ratings yet
ML - Questions & Answer
45 pages
DR +R +kavitha
No ratings yet
DR +R +kavitha
7 pages
Decision Tree Basics for Students
No ratings yet
Decision Tree Basics for Students
52 pages
Machine Learning Lab Manual Final
No ratings yet
Machine Learning Lab Manual Final
65 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
Unit3 ID3 DT Examples
No ratings yet
Unit3 ID3 DT Examples
12 pages
Detecting Fraud in Financial Reports
No ratings yet
Detecting Fraud in Financial Reports
16 pages