4 & 5 DWM 2024-25
4 & 5 DWM 2024-25
CLASSIFICATION
Basic Concepts - Decision Tree Induction - Bayes Classification Methods - Rule-Based Classification-
Bayesian Belief Networks - Support Vector Machines - Other Classification Methods.
4.1 Classification:
Predicts categorical class labels
Classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Example: A bank loan officer wants to analyze the data in order to know which customers (loan applicant) are
risky or which are safe. A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.
Prediction:
Models continuous-valued functions, i.e., predicts unknown or missing values
Example: Suppose the marketing manager needs to predict how much a given customer will spend during a
sale at his company. Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Typical applications:
Credit approval
Target marketing
Medical diagnosis
Fraud detection
The Data Classification process includes two steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model:
This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and their associated class
labels.
Each tuple that constitutes the training set is referred to as a category or class. These tuples can
also be referred to as sample, object or data points.
71
Figure 4.1 Building the Classifier or Model
Using Classifier for Classification:
In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of
classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered
acceptable.
73
Figure 4.3 Decision tree
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm:
In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.
The tree starts as a single node, N, representing the training tuples in D (step 1)
If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that
class (steps 2 and 3).
Note that steps 4 and 5 are terminating conditions. All of the terminating conditions are
explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion.
The splitting criterion tells us which attribute to test at node N by determining the ―best‖ way
to separate or partition the tuples in D into individual classes(step 6).
The splitting criterion also tells us which branches to grow from node N with respect to the
outcomes of the chosen test. More specifically, the splitting criterion indicates the splitting
attribute and may also indicate either a split-point or a splitting subset. The splitting criterion is
determined so that, ideally, the resulting partitions at each branch are as ―pure‖ as possible.
A partition is pure if all of the tuples in it belong to the same class. In other words, if we were to
split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion,
we hope for the resulting partitions to be as pure as possible.
74
The node N is labeled with the splitting criterion, which serves as a test at the node (step 7). A
branch is grown from node N for each of the outcomes of the splitting criterion. The tuples in D
are partitioned accordingly (steps 10 to 11).
Generating a decision tree form training tuples of data partition D
Algorithm: Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
75
decision tree(Dj, attribute list) to node N;
end for
return N;
Information gain: ID3 uses information gain as its attribute selection measure.
Information gain is defined as the difference between the original information requirement (i.e., based on just
the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is,
In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in
the information requirement caused by knowing the value of A. The attribute A with the highest information
gain, (Gain(A)), is chosen as the splitting attribute at node N.
Example Induction of a decision tree using information gain.
Example:
77
Dataset
Goal
Solution
The prior probabilities represent the likelihood of each class (Spam/Not Spam) based on the dataset.
=8/ 4 = 0.5
=8/ 4 = 0.5
The likelihood represents the probability of observing a feature value given a class. Use the frequency of
occurrences.
For Spam
78
P (Cheap = Yes|Not Spam) = 0/4 = 0P
P(Class|Data) = P(Data|Class)⋅P(Class)
P(Data)
Since P(Data) is the same for both classes, we only need to compute the numerator:
For Spam
P(Spam|Data) = 0.5⋅0.75⋅0.25⋅0.5=0.046875
Step 6: Conclusion
80
It provides a graphical model of causal relationship on which learning can be performed.
We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
Directed acyclic graph
A set of conditional probability tables
Directed Acyclic Graph:
Each node in a directed acyclic graph represents a random variable.
These variable may be discrete or continuous valued.
These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.
81
It uses a nonlinear mapping to transform the original training data into a higher dimension With the new
dimension, it searches for the linear optimal separating hyperplane (i.e.,decision boundary)
With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can
always be separated by a hyperplane.
SVM finds this hyperplane using support vectors (essential‖ training tuples) and margins (defined by the
support vectors)
82
Figure 4.7 Linearly seperable 2- D training data
84
Figure 4.9 Rough set approach
Lower Approximation of C − The lower approximation of C consists of all the data tuples, that based on
the knowledge of the attribute, are certain to belong to class C.
Upper Approximation of C − The upper approximation of C consists of all the tuples, that based on the
knowledge of attributes, cannot be described as not belonging to C.
Part B & C
1. Explain algorithm for constructing a decision tree from training samples.
2. Write Bayes theorem. Describe in detail about the following Classification methods.
a. Bayesian classification
b. Fuzzy set approach
c. Genetic algorithms
3. Generalize the Bayes theorem of posterior probability and explain the working of a Naïve Bayesian
classifier with an example. (Apr/May 2024)
4. Explain in detail about the Naive Bayesian classification method.(Nov/Dec 2024)
5. Formulate rule based classification techniques. (Apr/May 2024)
6. Explain in detail about Bayesian Belief Networks.
7. What are Bayesian classifiers? Explain.
8. Explain about support vector machines. (Apr/May 2024) (Nov/Dec 2024)
9. Apply the tree pruning in decision tree induction? What is a drawback of using a separate set of tuples
to evaluate pruning?(Nov/Dec 2024)
86
UNIT V
Cluster Analysis – Partitioning methods – Hierarchical methods – Density based methods – Grid based
methods – Clustering in high dimensional data - Outliers and Outlier Analysis – Outlier detection methods
What is Cluster?
Cluster is a group of objects that belong to the same class.
Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses.
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups.
In other words the similar object are grouped in one cluster and dissimilar are grouped in other cluster.
While doing the cluster analysis, the set of data into groups based on data similarity and then assign
the label to the groups.
87
Figure 5.1 Cluster Analysis
Applications of Cluster Analysis
Market research, pattern recognition, data analysis, and image processing.
Characterize their customer groups based on purchasing patterns.
In field of biology it can be used to derive plant and animal taxonomies, categorize genes with similar
functionality and gain insight into structures inherent in populations.
Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according house type, value, and
geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function Cluster Analysis serve as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.
Clustering Methods
The clustering methods can be classified into following categories:
Kmeans
Partitioning Method
88
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
5.2 Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct k partition of data.
Each partition will represent a cluster and k≤n. It means that it will classify the data into k groups, which
satisfy the following requirements:
Each group contain at least one object.
Each object must belong to exactly one group.
Typical methods:
K-means, k-medoids, CLARANS
89
and thus, when assigned to a cluster, they can dramatically distort the mean value of the cluster. This
inadvertently affects the assignment of other objects to clusters.
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
Input:
k: the number of clusters,
D: a data set containing n objects. Output: A set of k clusters. Method:
1. arbitrarily choose k objects in D as the initial representative objects or seeds;
2. repeat
3. assign each remaining object to the cluster with the nearest representative object;
4. randomly select a non representative object, Orandom;
5. compute the total cost, S, of swapping representative object, Oj , with Orandom;
6. if S < 0 then swap Oj with Orandom to form the new set of k representative objects;
7. until no change;
Agglomerative Approach
This approach is also known as bottom-up approach. In this we start with each object forming a separate
group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all
of the groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as top-down approach. In this we start with all of the objects in the same
cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in
one cluster or the termination condition holds.
90
Figure 5.2 Hierarchical Methods
Disadvantage
This method is rigid i.e. once merge or split is done, It can never be undone.
BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
91
Figure 5.3 BIRCH Methods
ROCK (1999): clustering categorical data by neighbor and link analysis Robust Clustering using links
Major ideas
Use links to measure similarity/proximity
Not distance-based
Computational complexity:
Algorithm: sampling-based clustering
Draw random sample
Cluster with links
Label data in disk
5.4Density-based Method
Clustering based on density (local cluster criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point
Typical methods: DBSACN, OPTICS, DenClue
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Algorithm :
Arbitrary select a point p
93
Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is
formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of
the database.
Continue the process until all of the points have been processed.
94
Typical methods: STING, WaveCluster, CLIQUE
STING: a Statistical Information Grid approach
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different levels of resolution
Each cell at a high level is partitioned into a number of smaller cells in the next lower level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from parameters of lower level cell count,
mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small number of cells
For each cell in the current level compute the confidence interval
95
Allows natural clusters to become more distinguishable
Subspace Clustering Methods: Subspace clustering approaches to search for clusters existing in
subspaces of the given high-dimensional data space, where a subspace is defined using a subset of
attributes in the full space.
There are 3 Subspace Clustering Methods:
Subspace search methods
Correlation-based clustering methods
Biclustering methods
96
Figure 5.7 Subspace Clustering Methods
Statistical Distribution-based outlier detection-Identify the outlier with respect to the model using
discordancy test
How discordancy test work
Data is assumed to be part of a working hypothesis (working hypothesis)-H
Each data object in the dataset is compared to the working hypothesis and is either accepted in the working
hypothesis or rejected as discordant into an alternative hypothesis (outliers)- H
98
Figure 5.8 Discordancy test
Distance-Based outlier detection
Imposed by statistical methods
We need multi-dimensional analysis without knowing data distribution Algorithms for mining
distance-based outliers
Distance-based outlier detection is based on global distance distribution
It encounters difficulties to identify outliers if data is not uniformly distributed
Ex. C1 contains 400 loosely distributed points, C2 has 100 tightly condensed points, 2 outlier
points o1, o2
Some outliers can be defined as global outliers, some can be defined as local outliers to a given
cluster
O2 would not normally be considered an outlier with regular distance-based outlier detection, since
it looks at the global picture
Each data object is assigned a local outlier factor (LOF)
Objects which are closer to dense clusters receive a higher LOF. LOF varies according to the
parameter MinPts
PART A
1. Define cluster.
2. What is clustering?
3. List the various clustering methods. (Apr/May 2024)
4. Identify what changes you make to solve the problem in cluster analysis.
5. Classify the typical phases of outlier detection methods.
6. List the challenges of outlier detection.
7. Classify the hierarchical clustering methods.
8. Compare Density based and grid-based cluster analysis. (Apr/May 2024)
9. Define outlier analysis.
10. List the various applications of cluster analysis.
11. What are the requirements of cluster analysis in data mining?
12. Define DBSCAN.
13. Define outlier.
14. How does a clustering and Nearest Neighbour prediction algorithm? (Nov/Dec 2024)
100
15. What is an outlier? Mention the methods of detecting outliers. (Nov/Dec 2024)
PART B & C
1. Discuss the different hierarchical methods in cluster analysis.
2. Discuss the different types of data in cluster analysis.
3. Interpret following clustering algorithm using examples.
a) K.means
b) K-medoid.
4. Explain the hierarchical based method for cluster analysis.
5. Illustrate the concepts
a. CLIQUE
b. DBSCAN
6. Explain the hierarchical based method for cluster analysis.
7. Explain in detail about density based methods. (Apr/May 2024)
8. How would you discuss the outlier analysis in detail? (Apr/May 2024)
9. Discuss in detail about the various detection techniques in outlier. (Nov/Dec 2024)
10. Explain in detail about the grid based method? (Nov/Dec 2024)
11. Build the algorithm for DBSCAN.
12. Illustrate about the k – means partitioning algorithm.
13. Categorize outlier detection methods according to whether the sample of data for analysis is given with
domain expert- provided labels that can be used to build an outlier detection model.
14. Develop a clustering high dimensional data. (Nov/Dec 2024)
15. Consider five points { X1, X2,X3, X4, X5} with the following coordinates as a two dimensional sample
for clustering: X1 = (0,2.5); X2 = (0,0); X3= (1.5,0); X4 = (5,0); X5 = (5,2)
Compose the K-means partitioning algorithm using the above data set. (Apr/May 2024)
16. Consider that the data mining task is to cluster the following eight points A1,A2,A3,B1,B2,B3,C1 AND
C2(with (X,Y) representing location) into three clusters A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5),
B3(6,4), C1(1,2), C2(4,9). The distance function is Euclidean distance .Suppose initially we assign A1,
B1 and C1 as the center of each cluster, respectively. Analyze the K-means algorithm to show the
three cluster centers after the first round of execution and the final tree clusters.
101
102