0% found this document useful (0 votes)
12 views32 pages

4 & 5 DWM 2024-25

This document covers various classification methods including decision tree induction, Bayesian classification, rule-based classification, Bayesian belief networks, and support vector machines. It explains the processes of building classifiers, using classifiers for classification, and the differences between supervised and unsupervised learning. Additionally, it discusses issues related to classification and prediction, such as data cleaning and transformation, and provides examples of applications in credit approval, target marketing, and medical diagnosis.

Uploaded by

rmenagadevi.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

4 & 5 DWM 2024-25

This document covers various classification methods including decision tree induction, Bayesian classification, rule-based classification, Bayesian belief networks, and support vector machines. It explains the processes of building classifiers, using classifiers for classification, and the differences between supervised and unsupervised learning. Additionally, it discusses issues related to classification and prediction, such as data cleaning and transformation, and provides examples of applications in credit approval, target marketing, and medical diagnosis.

Uploaded by

rmenagadevi.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT IV

CLASSIFICATION

Basic Concepts - Decision Tree Induction - Bayes Classification Methods - Rule-Based Classification-
Bayesian Belief Networks - Support Vector Machines - Other Classification Methods.

4.1 Classification:
 Predicts categorical class labels
 Classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Example: A bank loan officer wants to analyze the data in order to know which customers (loan applicant) are
risky or which are safe. A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
 In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.

Prediction:
Models continuous-valued functions, i.e., predicts unknown or missing values
Example: Suppose the marketing manager needs to predict how much a given customer will spend during a
sale at his company. Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Typical applications:
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
The Data Classification process includes two steps −
 Building the Classifier or Model
 Using Classifier for Classification
Building the Classifier or Model:
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and their associated class
labels.
 Each tuple that constitutes the training set is referred to as a category or class. These tuples can
also be referred to as sample, object or data points.

71
Figure 4.1 Building the Classifier or Model
Using Classifier for Classification:
In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of
classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered
acceptable.

Figure 4.2 Using Classifier for Classification


Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are accompanied by labels
indicating the class of the observations a New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data
Issues Regarding Classification and Prediction
 Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a
72
missing value with the most commonly occurring value for that attribute, or with the most probable value
based on statistics). Relevance analysis: Many of the attributes in the data may be redundant. Correlation
analysis can be used to identify whether any two given attributes are statistically related.
 Data Transformation and reduction− The data can be transformed by any of the following methods.
 Normalization− The data is transformed using normalization. Normalization involves scaling all
values for given attribute in order to make them fall within a small specified range. Normalization is
used when in the learning step, the neural networks or the methods involving measurements are used.
 Generalization− The data can also be transformed by generalizing it to the higher concept. For this
purpose we can use the concept hierarchies. Comparing Classification and Prediction Methods
Classification and prediction methods can be compared and evaluated according to the following criteria:
 Accuracy
 Speed
 Robustness
 Scalability
 Interpretability

4.2 Classification By Decision Tree Induction


Decision tree
 A decision tree is a structure that includes a root node, branches, and leaf nodes.
 Each internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label.
 The topmost node in the tree is the root node.
 Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
The following decision tree is for the concept buy_computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node
represents a class.

73
Figure 4.3 Decision tree
The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm:
 In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.
 The tree starts as a single node, N, representing the training tuples in D (step 1)
 If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that
class (steps 2 and 3).
 Note that steps 4 and 5 are terminating conditions. All of the terminating conditions are
explained at the end of the algorithm.
 Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion.
The splitting criterion tells us which attribute to test at node N by determining the ―best‖ way
to separate or partition the tuples in D into individual classes(step 6).
 The splitting criterion also tells us which branches to grow from node N with respect to the
outcomes of the chosen test. More specifically, the splitting criterion indicates the splitting
attribute and may also indicate either a split-point or a splitting subset. The splitting criterion is
determined so that, ideally, the resulting partitions at each branch are as ―pure‖ as possible.
 A partition is pure if all of the tuples in it belong to the same class. In other words, if we were to
split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion,
we hope for the resulting partitions to be as pure as possible.

74
 The node N is labeled with the splitting criterion, which serves as a test at the node (step 7). A
branch is grown from node N for each of the outcomes of the splitting criterion. The tuples in D
are partitioned accordingly (steps 10 to 11).
Generating a decision tree form training tuples of data partition D
Algorithm: Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate

75
decision tree(Dj, attribute list) to node N;
end for
return N;

Attribute Selection Measures


An attribute selection measure is a heuristic for selecting the splitting criterion that ―best‖ separates a given
data partition, D, of class-labeled training tuples into individual classes. If we were to split D into smaller
partitions according to the outcomes of the splitting criterion, If the splitting attribute is continuous -valued or
if we are restricted to binary trees then, respectively, either a split point or a splitting subset must also be
determined as part of the splitting criterion

Information gain: ID3 uses information gain as its attribute selection measure.

Information gain is defined as the difference between the original information requirement (i.e., based on just
the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is,
In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in

the information requirement caused by knowing the value of A. The attribute A with the highest information
gain, (Gain(A)), is chosen as the splitting attribute at node N.
Example Induction of a decision tree using information gain.

4.3 Bayesian Classification


Bayesian classification is based on Bayes' Theorem.
Bayesian classifiers are the statistical classifiers.
Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple
belongs to a particular class.
1. Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
 Posterior Probability [P(H/X)]
 Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.

According to Bayes' Theorem,


76
1. Naïve Bayesian Classification

Example:

77
Dataset

Email ID Contains "Buy" Contains "Cheap" Contains "Win" Spam/Not Spam


1 Yes Yes No Spam
2 No Yes Yes Spam
3 No No Yes Spam
4 Yes Yes Yes Spam
5 Yes No No Not Spam
6 No No No Not Spam
7 No No Yes Not Spam
8 Yes No Yes Not Spam

Goal

Predict whether the following email is Spam or Not Spam:

 Contains "Buy" = Yes


 Contains "Cheap" = Yes
 Contains "Win" = No

Solution

Step 1: Calculate Prior Probabilities

The prior probabilities represent the likelihood of each class (Spam/Not Spam) based on the dataset.

P(Spam)=Total emails/ Number of Spam emails

=8/ 4 = 0.5

P(Not Spam)=Total emails/Number of Not Spam emails

=8/ 4 = 0.5

Step 2: Calculate Likelihoods

The likelihood represents the probability of observing a feature value given a class. Use the frequency of
occurrences.

For Spam

P(Buy = Yes|Spam) = 2/4 = 0.5

P(Cheap = Yes|Spam) = 3/4 = 0.75

P(Win = No|Spam)= 1/4 = 0.25

For Not Spam

P (Buy = Yes|Not Spam) = 2/4 = 0.5P

78
P (Cheap = Yes|Not Spam) = 0/4 = 0P

P (Win = No|Not Spam) = 2/4 = 0.5P

Step 3: Apply Bayes Theorem

The formula for Naive Bayes:

P(Class|Data) = P(Data|Class)⋅P(Class)

P(Data)

Since P(Data) is the same for both classes, we only need to compute the numerator:

P(Spam|Data) = P(Buy = Yes | Spam)⋅P(Cheap = Yes | Spam)⋅P(Win = No | Spam)⋅P(Spam)

P(Not Spam|Data)=P(Buy = Yes|Not Spam)⋅P(Cheap = Yes|Not Spam)⋅P(Win = No|Not Spam)⋅P(Not Spam)

Step 4: Compute Probabilities

For Spam

P(Spam|Data) = 0.5⋅0.75⋅0.25⋅0.5=0.046875

For Not Spam

P(Not Spam|Data) = 0.5⋅0⋅0.5⋅0.5 = 0

Step 5: Normalize Probabilities

Since P (Not Spam | Data) = 0

the normalized probability for P(Spam|Data) is 1.

Step 6: Conclusion

The predicted class is Spam

4.4 Rule Based Classification

Using IF-THEN Rules for Classification:

 Represent the knowledge in the form of IF-THEN rules


 R: IF age = youth AND student = yes THEN buys_computer = yes Rule antecedent/precondition vs. rule
consequent
 Assessment of a rule: coverage and accuracy ncovers = # of tuples covered by R, ncorrect = # of
79
tuples correctly classified by R
 coverage(R) = ncovers /|D| /* D: training data set */
 accuracy(R) = ncorrect / ncovers
 If more than one rule is triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the toughest requirement (i.e.,
with the most attribute test)
 Class-based ordering: decreasing order of prevalence or misclassification cost per class
 Rule-based ordering (decision list): rules are organized into one long priority list, according to some
measure of rule quality or by experts

Rule Extraction from a Decision Tree


 Rules are easier to understand than large trees
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive

Figure 4.4 Rule extraction from our buys_computer decision-tree

Rule Extraction from the Training Data

Figure 4.5 Rule Extraction from the Training Data

4.5 Bayesian Belief Networks


 Bayesian Belief Networks specify joint conditional probability distributions.
 They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
 A Belief Network allows class conditional independencies to be defined between subsets of variables.

80
 It provides a graphical model of causal relationship on which learning can be performed.
 We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
 Directed acyclic graph
 A set of conditional probability tables
Directed Acyclic Graph:
 Each node in a directed acyclic graph represents a random variable.
 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.

Figure 4.6 Directed Acyclic Graph Representation


The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by
a person's family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting
that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that
the patient is a smoker, given that we know the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC) showing each possible
combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −

Figure 4.7 Conditional Probability Table

4.6 SVM—SUPPORT VECTOR MACHINES


 A new classification method for both linear and nonlinear data

81
 It uses a nonlinear mapping to transform the original training data into a higher dimension With the new
dimension, it searches for the linear optimal separating hyperplane (i.e.,decision boundary)
 With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can
always be separated by a hyperplane.
 SVM finds this hyperplane using support vectors (essential‖ training tuples) and margins (defined by the
support vectors)

1. The Case When the Data Are Linearly Separable


 An SVM approaches this problem by searching for the maximum marginal hyperplane.
 Both hyperplanes can correctly classify all of the given data tuples. Intuitively, however, we expect the
hyperplane with the larger margin to be more accurate at classifying future data tuples than the
hyperplane with the smaller margin.
 The SVM searches for the hyperplane with the largest margin, that is, the maximum marginal
hyperplane (MMH). The associated margin gives the largest separation between classes.
 Getting to an informal definition of margin, we can say that the shortest distance from a hyperplane to
one side of its margin is equal to the shortest distance from the hyperplane to the other side of its
margin, where the ―sides‖ of the margin are parallel to the hyperplane.
 When dealing with the MMH, this distance is, in fact, the shortest distance from the MMH to the closest
training tuple of either class.

82
Figure 4.7 Linearly seperable 2- D training data

Figure 4.8 Linearly seperable training data with small margin

Figure 4.8 Linearly seperable training data with larger margin

2. The Case When the Data Are Linearly Inseparable


A nonlinear SVM by extending the approach for linear SVMs as follows.
 There are two main steps. In the first step, we transform the original input data into a higher dimensional
space using a nonlinear mapping.
 Once the data have been transformed into the new higher space, the second step searches for a linear
separating hyperplane in the new space.We again end up with a quadratic optimization problem that
83
can be solved using the linear SVM formulation.
 The maximal marginal hyperplane found in the new space corresponds to a nonlinear separating
hypersurface in the original space.

4.7 Other Classification Methods:


Genetic Algorithms:
 In genetic algorithm, first of all, the initial population is created. This initial population consists of
randomly generated rules. We can represent each rule by a string of bits.
 For example, in a given training set, the samples are described by two Boolean attributes such as A1
and A2. And this given training set contains two classes such as C1 and C2.
 We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit representation, the
two leftmost bits represent the attribute A1 and A2, respectively.
 Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
 If the attribute has K values where K>2, then we can use the K bits to encode the attribute values. The
classes are also encoded in the same manner.
 Points to remember:
 Based on the notion of the survival of the fittest, a new population is formed that consists of the fittest
rules in the current population and offspring values of these rules as well.
 The fitness of a rule is assessed by its classification accuracy on a set of training samples.
 The genetic operators such as crossover and mutation are applied to create offspring.
 In crossover, the substring from pair of rules are swapped to form a new pair of rules.
 In mutation, randomly selected bits in a rule’s string are inverted.

Rough Set Approach:


 We can use the rough set approach to discover structural relationship within imprecise and noisy data.
 This approach can only be applied on discrete-valued attributes. Therefore, continuous-valued attributes
must be discretized before its use.
 The Rough Set Theory is based on the establishment of equivalence classes within the given training
data. The tuples that forms the equivalence class are indiscernible. It means the samples are identical
with respect to the attributes describing the data.
 There are some classes in the given real world data, which cannot be distinguished in terms of available
attributes. We can use the rough sets to roughly define such classes.
 The following diagram shows the Upper and Lower Approximation of class C –
 For a given class C, the rough set definition is approximated by two sets as follows:

84
Figure 4.9 Rough set approach

 Lower Approximation of C − The lower approximation of C consists of all the data tuples, that based on
the knowledge of the attribute, are certain to belong to class C.
 Upper Approximation of C − The upper approximation of C consists of all the tuples, that based on the
knowledge of attributes, cannot be described as not belonging to C.

Fuzzy Set Approaches:


 Fuzzy Set Theory is also called Possibility Theory. This theory was proposed by Lotfi Zadeh in 1965 as
an alternative the two-value logic and probability theory.
 This theory allows us to work at a high level of abstraction. It also provides us the means for dealing
with imprecise measurement of data.
 The fuzzy set theory also allows us to deal with vague or inexact facts. For example, being a member of
a set of high incomes is in exact (e.g. if $50,000 is high then what about $49,000 and $48,000).
 Unlike the traditional CRISP set where the element either belong to S or its complement but in fuzzy set
theory the element can belong to more than one fuzzy set.
 For example, the income value $49,000 belongs to both the medium and high fuzzy sets but to differing
degrees. Fuzzy set notation for this income value is as follows −
mmedium_income($49k)=0.15 and mhigh_income($49k)=0.96
where ‘m’ is the membership function that operates on the fuzzy sets of medium_income and high_income
respectively. This notation can be shown diagrammatically as follows −

Figure 4.9 Fuzzy set approach


85
PART A
1. Define classification.
2. How the prediction is differed from classification.(Nov/Dec 2024)
3. Write the applications of classification.
4. Define a null – invariant measure.(Nov/Dec 2024)
5. What is meant by supervised and unsupervised learning? (Apr/May 2024)
6. What are the issues of classification and prediction?
7. What are the criteria to compare classification and prediction?
8. Define Information gain.
9. What are the steps involved in preparing the data for classification? (Apr/May 2024)
10. What is Naive Bayesian classification? How it is differing from Bayesian classification?
11. Name the features of Decision tree induction.
12. Define Bayesian belief network.
13. How does tree pruning work?
14. What are the other classification methods in data mining?

Part B & C
1. Explain algorithm for constructing a decision tree from training samples.
2. Write Bayes theorem. Describe in detail about the following Classification methods.
a. Bayesian classification
b. Fuzzy set approach
c. Genetic algorithms
3. Generalize the Bayes theorem of posterior probability and explain the working of a Naïve Bayesian
classifier with an example. (Apr/May 2024)
4. Explain in detail about the Naive Bayesian classification method.(Nov/Dec 2024)
5. Formulate rule based classification techniques. (Apr/May 2024)
6. Explain in detail about Bayesian Belief Networks.
7. What are Bayesian classifiers? Explain.
8. Explain about support vector machines. (Apr/May 2024) (Nov/Dec 2024)
9. Apply the tree pruning in decision tree induction? What is a drawback of using a separate set of tuples
to evaluate pruning?(Nov/Dec 2024)

86
UNIT V

CLUSTER ANALYSIS AND OUTLIER DETECTION

Cluster Analysis – Partitioning methods – Hierarchical methods – Density based methods – Grid based
methods – Clustering in high dimensional data - Outliers and Outlier Analysis – Outlier detection methods

5.1 CLUSTER ANALYSIS

What is Cluster?
 Cluster is a group of objects that belong to the same class.
 Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses.
 Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups.
 In other words the similar object are grouped in one cluster and dissimilar are grouped in other cluster.
 While doing the cluster analysis, the set of data into groups based on data similarity and then assign
the label to the groups.

87
Figure 5.1 Cluster Analysis
Applications of Cluster Analysis
 Market research, pattern recognition, data analysis, and image processing.
 Characterize their customer groups based on purchasing patterns.
 In field of biology it can be used to derive plant and animal taxonomies, categorize genes with similar
functionality and gain insight into structures inherent in populations.
 Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according house type, value, and
geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card fraud.
 As a data mining function Cluster Analysis serve as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.

Requirements of Clustering in Data Mining


Here are the typical requirements of clustering in data mining:
 Scalability - We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kind of attributes - Algorithms should be capable to be applied on any kind
of data such as interval based (numerical) data, categorical, binary data.
 Discovery of clusters with attribute shape - The clustering algorithm should be capable of detect
cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find
spherical cluster of small size.
 High dimensionality - The clustering algorithm should not only be able to handle low-dimensional data
but also the high dimensional space.
 Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. Some algorithms
are sensitive to such data and may lead to poor quality clusters.
 Interpretability - The clustering results should be interpretable, comprehensible and usable.

Clustering Methods
The clustering methods can be classified into following categories:
 Kmeans
 Partitioning Method

88
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
5.2 Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct k partition of data.
Each partition will represent a cluster and k≤n. It means that it will classify the data into k groups, which
satisfy the following requirements:
 Each group contain at least one object.
 Each object must belong to exactly one group.
Typical methods:
K-means, k-medoids, CLARANS

k-Means: A Centroid-Based Technique


A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster.
Conceptually, the centroid of a cluster is its center point. The centroid can be defined in various ways such
as by the mean or medoid of the objects (or points) assigned to the cluster. The difference between an object
p ∈ Ci and ci , the representative of the cluster, is measured by dist(p,ci), where dist(x,y) is the Euclidean
distance between two points x and y. The quality of cluster Ci can be measured by the within cluster variation
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by
the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is the most similar, based on the mean value
of the objects in the cluster;
4. update the cluster means, that is, calculate the mean value of the objects for each cluster;
5. until no change;

k-Medoids: A Representative Object-Based Technique


The k-means algorithm is sensitive to outliers because such objects are far away from the majority of the data,

89
and thus, when assigned to a cluster, they can dramatically distort the mean value of the cluster. This
inadvertently affects the assignment of other objects to clusters.

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.
Input:
k: the number of clusters,
D: a data set containing n objects. Output: A set of k clusters. Method:
1. arbitrarily choose k objects in D as the initial representative objects or seeds;
2. repeat
3. assign each remaining object to the cluster with the nearest representative object;
4. randomly select a non representative object, Orandom;
5. compute the total cost, S, of swapping representative object, Oj , with Orandom;
6. if S < 0 then swap Oj with Orandom to form the new set of k representative objects;
7. until no change;

5.3 Hierarchical Methods


This method creates the hierarchical decomposition of the given set of data objects.:
 Agglomerative Approach
 Divisive Approach

Agglomerative Approach
This approach is also known as bottom-up approach. In this we start with each object forming a separate
group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all
of the groups are merged into one or until the termination condition holds.

Divisive Approach
This approach is also known as top-down approach. In this we start with all of the objects in the same
cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in
one cluster or the termination condition holds.

90
Figure 5.2 Hierarchical Methods
Disadvantage
This method is rigid i.e. once merge or split is done, It can never be undone.

Approaches to improve quality of Hierarchical clustering


Here is the two approaches that are used to improve quality of hierarchical clustering:
Perform careful analysis of object linkages at each hierarchical partitioning. Integrate hierarchical
agglomeration by first using a hierarchical agglomerative algorithm to group objects into microclusters, and
then performing macroclustering on the microclusters.
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

91
Figure 5.3 BIRCH Methods
ROCK (1999): clustering categorical data by neighbor and link analysis Robust Clustering using links
Major ideas
 Use links to measure similarity/proximity
 Not distance-based
 Computational complexity:
 Algorithm: sampling-based clustering
 Draw random sample
 Cluster with links
 Label data in disk

CHAMELEON (1999): hierarchical clustering using dynamic modeling


· Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and closeness (proximity) between two
clusters are high relative to the internal interconnectivity of the clusters and closeness of items
within the clusters
 Cure ignores information about interconnectivity of the objects, Rock ignores information about
92
the closeness of two clusters
 A two-phase algorithm
 Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-
clusters
 Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly
combining these sub-clusters

Figure 5.4 Chameleon Methods

5.4Density-based Method
Clustering based on density (local cluster criterion), such as density-connected points
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-neighbourhood of that point
Typical methods: DBSACN, OPTICS, DenClue
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Algorithm :
 Arbitrary select a point p

93
 Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is
formed.
 If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of
the database.
 Continue the process until all of the points have been processed.

Figure 5.4 DBSCAN Methods

OPTICS: Ordering Points To Identify the Clustering Structure


 Produces a special order of the database with its density-based clustering structure
 This cluster-ordering contains info equiv to the density-based clustering’s corresponding to a broad
range of parameter settings
 Good for both automatic and interactive cluster analysis, including finding intrinsic clustering
structure
 Can be represented graphically or using visualization techniques

DENCLUE: Density based clustering


Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data
sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters

1.5 Grid-based Method


Using multi-resolution grid data structure
Advantage
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized space.

94
Typical methods: STING, WaveCluster, CLIQUE
STING: a Statistical Information Grid approach
 The spatial area area is divided into rectangular cells
 There are several levels of cells corresponding to different levels of resolution

Figure 5.5 STING Methods

 Each cell at a high level is partitioned into a number of smaller cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand and is used to answer queries
 Parameters of higher level cells can be easily calculated from parameters of lower level cell count,
mean, s, min, max
 type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries
 Start from a pre-selected layer—typically with a small number of cells
 For each cell in the current level compute the confidence interval

Wave Cluster: Clustering by Wavelet Analysis


 A multi-resolution clustering approach which applies wavelet transform to the feature space
 How to apply wavelet transform to find clusters
 Summarizes the data by imposing a multidimensional grid structure onto data space
 These multidimensional spatial data objects are represented in a n-dimensional feature space
 Apply wavelet transform on feature space to find the dense regions in the feature space
 Apply wavelet transform multiple times which result in clusters at different scales from fine to
coarse
 Wavelet transform: A signal processing technique that decomposes a signal into different
frequency sub-band (can be applied to n-dimensional signals)
 Data are transformed to preserve relative distance between objects at different levels of resolution

95
 Allows natural clusters to become more distinguishable

Figure 5.6 Wave Cluster Methods

5.4 CLUSTERING HIGH-DIMENSIONAL DATA


 Clustering of the High-Dimensional Data return the group of objects which are clusters. It is
required to group similar types of objects together to perform the cluster analysis of high-
dimensional data, But the High-Dimensional data space is huge and it has complex data types and
attributes.
 A major challenge is that we need to find out the set of attributes that are present in each cluster.
A cluster is defined and characterized based on the attributes present in the cluster. Clustering
High-Dimensional Data we need to search for clusters and find out the space for the existing
clusters.
 The High-Dimensional data is reduced to low-dimension data to make the clustering and search for
clusters simple.
 some applications need the appropriate models of clusters, especially the high-dimensional data.
clusters in the high-dimensional data are significantly small. the conventional distance measures
can be ineffective.
 Instead, To find the hidden clusters in high-dimensional data we need to apply sophisticated
techniques that can model correlations among the objects in subspaces.

Subspace Clustering Methods: Subspace clustering approaches to search for clusters existing in
subspaces of the given high-dimensional data space, where a subspace is defined using a subset of
attributes in the full space.
There are 3 Subspace Clustering Methods:
 Subspace search methods
 Correlation-based clustering methods
 Biclustering methods

96
Figure 5.7 Subspace Clustering Methods

1. Subspace Search Methods:


 A subspace search method searches the subspaces for clusters.
 Here, the cluster is a group of similar types of objects in a subspace.
 The similarity between the clusters is measured by using distance or density features.
 CLIQUE algorithm is a subspace clustering method.
 subspace search methods search a series of subspaces.
 There are two approaches in Subspace Search Methods:
 Bottom-up approach starts to search from the low-dimensional subspaces. If the hidden
clusters are not found in low-dimensional subspaces then it searches in higher dimensional
subspaces.
 The top-down approach starts to search from the high-dimensional subspaces and then
search in subsets of low-dimensional subspaces. Top-down approaches are effective if the
subspace of a cluster can be defined by the local neighborhood sub-space clusters.
2. Correlation-Based Clustering:
 Correlation-based approaches discover the hidden clusters by developing advanced correlation
models.
97
 Correlation-Based models are preferred if is not possible to cluster the objects by using the
Subspace Search Methods.
 Correlation-Based clustering includes the advanced mining techniques for correlation cluster
analysis.
 Biclustering Methods are the Correlation-Based clustering methods in which both the objects and
attributes are clustered.
3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we can cluster both objects and attributes
at a time in some applications. The resultant clusters are biclusters. To perform the biclustering there are
four requirements:
 Only a small set of objects participate in a cluster.
 A cluster only involves a small number of attributes.
 The data objects can take part in multiple clusters, or the objects may also include in any cluster.
 An attribute may be involved in multiple clusters. Objects and attributes are not treated in the
same way. Objects are clustered according to their attribute values. We treat Objects and
attributes as different in biclustering analysis.

5.5 OUTLIER ANALYSIS


The set of objects are considerably dissimilar from the remainder of the data Example: Sports: Michael
Jordon, Wayne Gretzky, ........
Problem: Define and find outliers in large data sets Applications:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis

 Statistical Distribution-based outlier detection-Identify the outlier with respect to the model using
discordancy test
How discordancy test work
Data is assumed to be part of a working hypothesis (working hypothesis)-H
Each data object in the dataset is compared to the working hypothesis and is either accepted in the working
hypothesis or rejected as discordant into an alternative hypothesis (outliers)- H

98
Figure 5.8 Discordancy test
 Distance-Based outlier detection
 Imposed by statistical methods
 We need multi-dimensional analysis without knowing data distribution Algorithms for mining
distance-based outliers
 Distance-based outlier detection is based on global distance distribution
 It encounters difficulties to identify outliers if data is not uniformly distributed
 Ex. C1 contains 400 loosely distributed points, C2 has 100 tightly condensed points, 2 outlier
points o1, o2
 Some outliers can be defined as global outliers, some can be defined as local outliers to a given
cluster
 O2 would not normally be considered an outlier with regular distance-based outlier detection, since
it looks at the global picture
 Each data object is assigned a local outlier factor (LOF)
 Objects which are closer to dense clusters receive a higher LOF. LOF varies according to the
parameter MinPts

Figure 5.9 Distance-Based outlier detection

5.6 OUTLIER DETECTION METHODS


There are many outlier detection methods in the literature and in practice. Here, we present two orthogonal
99
ways to categorize outlier detection methods. First, we categorize outlier detection methods according to
whether the sample of data for analysis is given with domain expert–provided labels that can be used to
build an outlier detection model. Second, we divide methods into groups according to their assumptions
regarding normal objects versus outliers.
Supervised, Semi-Supervised, and Unsupervised Methods

Figure 5.10 Outlier Detection Methods

PART A
1. Define cluster.
2. What is clustering?
3. List the various clustering methods. (Apr/May 2024)
4. Identify what changes you make to solve the problem in cluster analysis.
5. Classify the typical phases of outlier detection methods.
6. List the challenges of outlier detection.
7. Classify the hierarchical clustering methods.
8. Compare Density based and grid-based cluster analysis. (Apr/May 2024)
9. Define outlier analysis.
10. List the various applications of cluster analysis.
11. What are the requirements of cluster analysis in data mining?
12. Define DBSCAN.
13. Define outlier.
14. How does a clustering and Nearest Neighbour prediction algorithm? (Nov/Dec 2024)

100
15. What is an outlier? Mention the methods of detecting outliers. (Nov/Dec 2024)

PART B & C
1. Discuss the different hierarchical methods in cluster analysis.
2. Discuss the different types of data in cluster analysis.
3. Interpret following clustering algorithm using examples.
a) K.means
b) K-medoid.
4. Explain the hierarchical based method for cluster analysis.
5. Illustrate the concepts
a. CLIQUE
b. DBSCAN
6. Explain the hierarchical based method for cluster analysis.
7. Explain in detail about density based methods. (Apr/May 2024)
8. How would you discuss the outlier analysis in detail? (Apr/May 2024)
9. Discuss in detail about the various detection techniques in outlier. (Nov/Dec 2024)
10. Explain in detail about the grid based method? (Nov/Dec 2024)
11. Build the algorithm for DBSCAN.
12. Illustrate about the k – means partitioning algorithm.
13. Categorize outlier detection methods according to whether the sample of data for analysis is given with
domain expert- provided labels that can be used to build an outlier detection model.
14. Develop a clustering high dimensional data. (Nov/Dec 2024)
15. Consider five points { X1, X2,X3, X4, X5} with the following coordinates as a two dimensional sample
for clustering: X1 = (0,2.5); X2 = (0,0); X3= (1.5,0); X4 = (5,0); X5 = (5,2)
Compose the K-means partitioning algorithm using the above data set. (Apr/May 2024)
16. Consider that the data mining task is to cluster the following eight points A1,A2,A3,B1,B2,B3,C1 AND
C2(with (X,Y) representing location) into three clusters A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5),
B3(6,4), C1(1,2), C2(4,9). The distance function is Euclidean distance .Suppose initially we assign A1,
B1 and C1 as the center of each cluster, respectively. Analyze the K-means algorithm to show the
three cluster centers after the first round of execution and the final tree clusters.

101
102

You might also like