DWM Unit 5 Mining Frequent Patterns and Cluster Analysis
DWM Unit 5 Mining Frequent Patterns and Cluster Analysis
Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Frequent Patterns:
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that
appeared frequently in a data set.
For example, a set of items, such as milk and bread, which appear frequently together in a
transaction data set, is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
How is it used?
As a first step, market basket analysis can be used in deciding the location and promotion
of goods inside a store.
If, it has been observed, purchasers of Barbie dolls have been more likely to buy candy,
then high-margin candy can be placed near to the Barbie doll display.
Customers who would have bought candy with their Barbie dolls had they thought of it will
now be suitably tempted.
But this is only the first level of analysis. Differential market basket analysis can find
interesting results and can also eliminate the problem of a potentially high volume of trivial
results.
1
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
2
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Frequent Itemsets:
Frequent itemsets are patterns that appear frequently in a data set.
For example, a set of items, such as milk and bread that appear frequently together in a
transaction data set is a frequent itemset.
An itemset whose support is greater than or equal to a minimum support threshold. (ex: 2)
From diagram frequent itemsets are: a, b, c, d, ab, ad, bd, cd, abd
Closed Itemsets:
An itemset is closed if none of its immediate supersets has the same support as that of the
itemset.
Consider two itemsets X and Y, if every item of X is in Y but there is at least one item of Y,
which is not in X, then Y is not a proper super itemset of X, here itemset X is closed
itemset.
If X is both closed and frequent, called as closed frequent itemset.
From diagram closed frequent itemsets are: a, c, cd, abd
cd is closed itemset as its supersets acd and bcd have support less than 2
3
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Association Rules:
Association rule, finds interesting associations and relationships among large sets of data
items. This rule shows how frequently an itemsets occurs in a transaction.
A typical example is a Market Based Analysis.
Association Rule: An implication expression of the form X → Y, where X and Y are
itemsets.
Support: The first measure called the support is the number of transactions that include
items in the {A} and {B} parts of the rule as a percentage of the total number of
4
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Confidence: The second measure called the confidence of the rule is the ratio of the
number of transactions that include all items in {Y} as well as the number of transactions
that include all items in {X} to the number of transactions that include all items in {X}.
Confidence= X+Y/X
How often items in Y appear in transactions that contain X only.
Lift: The third measure called the lift or lift ratio is the ratio of confidence to expected
confidence. Expected confidence is the confidence divided by the frequency of Y. The Lift
tells us how much better a rule is at predicting the result than just assuming the result in the
first place. Greater lift values indicate stronger associations.
Lift= Confidence / (Y / N)
How much our confidence has increased that Y will be purchased given that X was
purchased.
5
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
For Example: Bread and butter, Laptop and Antivirus software, etc.
Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent itemsets
are used to find k+1 itemsets.
This algorithm uses two steps “join” and “prune” (prune means delete) to reduce the search
space.
It is an iterative approach to discover the most frequent itemsets.
Apriori says:
The probability that item x is not frequent is if:
P(x) is less than minimum support threshold, and then x is not frequent.
Apriori Algorithm:
D: Database
Min_sup: minimum support count
K: items in itemset
C: candidate list
L: frequent itemsets in D
6
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Solution:
Calculate min_supp=0.5*4=2 (support count is 2)
(0.5: given minimum support in problem, 4: total transactions in database D)
Step 1: Generate candidate list C1 from D
C1=
Itemsets
1
2
3
4
5
Step 2: Scan D for count of each candidate and find the support.
C1=
Itemsets Support count
1 2
2 3
3 3
4 1
5 3
7
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Step 5: Scan D for count of each candidate and find the support.
C2=
Itemsets Support count
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2
8
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Step 8: Scan D for count of each candidate and find the support.
C3=
Itemsets Support count
1,2,3 1
1,2,5 1
1,3,5 1
2,3,5 2
As minimum confidence threshold is 70%, the first two rules are the output.
i.e. 2 3→5, 3 5→2
9
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Cluster Analysis:
Clustering is a data mining technique used to place data elements into related
groups without advance knowledge.
Clustering is the process of grouping a set of data objects into multiple groups or clusters
so that objects within a cluster have high similarity.
Dissimilarities and similarities are assessed based on the attribute values describing the
objects and often involve distance measures.
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets.
Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis
can be referred to as a clustering.
10
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
1. Partitioning Method:
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements:
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember:
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s centre is represented by the
mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centres
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster
(4) update the cluster means, that is, calculate the mean value of the objects for
each cluster
(5) Repeat until no change.
11
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Example: K-means
Question: Use k-means algorithm to create 3 clusters for given set of values:
{2,3,6,8,9,12,15,18,22}
Answer:
Set of values: 2,3,6,8,9,12,15,18,22
1. Break given set of values randomly in to 3 clusters and calculate the mean value.
K1: 2,8,15 mean=8.3
K2: 3,9,18 mean=10
K3: 6,12,22 mean=13.3
2. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 2,3,6,8,9 mean=5.6
K2: mean=0
K3: 12,15,18,22 mean=16.75
3. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 3,6,8,9 mean=6.5
K2: 2 mean=2
K3: 12,15,18,22 mean=16.75
4. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 6,8,9 mean=7.6
K2: 2,3 mean=2.5
K3: 12,15,18,22 mean=16.75
5. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 6,8,9,12 mean=8.7
K2: 2,3 mean=2.5
K3:15,18,22 mean=18.33
12
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
6. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 6,8,9,12 mean=8.7
K2: 2,3 mean=2.5
K3:15,18,22 mean=18.33
2. Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here:
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keeps on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This
method is rigid, i.e., once a merging or splitting is done, it can never be undone.
3. Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighbourhood exceeds some threshold, i.e., for
each data point within a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
4. Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
13
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
5. Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It reflects
spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based
on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
6. Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of desired
clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.
Applications of Clustering:
Clustering algorithms can be applied in many fields, for instance:
1. Marketing: finding groups of customers with similar behaviour given a large database
of customer data containing their properties and past buying records;
2. Biology: classification of plants and animals given their features;
3. Libraries: book ordering;
4. Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
5. City-planning: identifying groups of houses according to their house type, value and
geographical location;
6. Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
7. WWW: document classification; clustering weblog data to discover groups of similar
access patterns.
14
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Assignment No. 5
TID Items
1 K, A, D, B
2 D, A, C, E, B
3 C, A, B, E
4 B, A, D
Find all frequent itemsets using apriori method. List strong association rules. (6)
8. List clustering Methods explain any two. (6)
9. Explain Apriori algorithms for frequent itemset using candidate generation. (6)
10. Consider the data set given and create 3 (Dataset 1), 4 (Dataset 2) clusters using k-
means method. (6)
Data set1: {10,4,2,12,3,20,30,11,25,31}
Data set2: {8,14,2,22,13,40,30,18,25,10}
15