100% found this document useful (1 vote)
177 views

DWM Unit 5 Mining Frequent Patterns and Cluster Analysis

The document discusses frequent patterns, market basket analysis, and the Apriori algorithm. It defines frequent itemsets as patterns that appear frequently in transaction data and discusses how market basket analysis is used to identify common purchases. The Apriori algorithm is introduced as a technique for finding frequent itemsets.

Uploaded by

Sp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
177 views

DWM Unit 5 Mining Frequent Patterns and Cluster Analysis

The document discusses frequent patterns, market basket analysis, and the Apriori algorithm. It defines frequent itemsets as patterns that appear frequently in transaction data and discusses how market basket analysis is used to identify common purchases. The Apriori algorithm is introduced as a technique for finding frequent itemsets.

Uploaded by

Sp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Padmashri Dr.

Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Unit 5: Mining Frequent Patterns and Cluster Analysis


(14 Marks)
Course Outcome (CO): Apply basic statistical calculations on Datasets.

 Frequent Patterns:
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that
appeared frequently in a data set.
For example, a set of items, such as milk and bread, which appear frequently together in a
transaction data set, is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.

 Market Basket Analysis:


Market Basket Analysis is a modelling technique based upon the theory that if you buy a
certain group of items, you are more (or less) likely to buy another group of items.
Ex: (Computer → Antivirus)
Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items.
It works by looking for combinations of items that occur together frequently in transactions.
i.e it allows retailers to identify relationships between the items that people buy.
Market basket analysis can be used in deciding the location and promotion of goods inside
a store.
Market Basket Analysis creates If-Then scenario rules, for example, if item A is purchased
then item B is likely to be purchased.

How is it used?
As a first step, market basket analysis can be used in deciding the location and promotion
of goods inside a store.
If, it has been observed, purchasers of Barbie dolls have been more likely to buy candy,
then high-margin candy can be placed near to the Barbie doll display.
Customers who would have bought candy with their Barbie dolls had they thought of it will
now be suitably tempted.
But this is only the first level of analysis. Differential market basket analysis can find
interesting results and can also eliminate the problem of a potentially high volume of trivial
results.

1
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

In differential analysis, compare results between different stores, between customers in


different demographic groups, between different days of the week, different seasons of the
year, etc.
If we observe that a rule holds in one store, but not in any other (or does not hold in one
store, but holds in all others), then we know that there is something interesting about that
store.
Investigating such differences may yield useful insights which will improve company sales.

Other Application Areas


Market Basket Analysis used for:
1. Analysis of credit card purchases.
2. Analysis of telephone calling patterns.
3. Identification of fraudulent medical insurance claims.
4. Analysis of telecom service purchases.

2
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

 Frequent Itemsets and Closed Itemsets:

Frequent Itemsets:
Frequent itemsets are patterns that appear frequently in a data set.
For example, a set of items, such as milk and bread that appear frequently together in a
transaction data set is a frequent itemset.
An itemset whose support is greater than or equal to a minimum support threshold. (ex: 2)
From diagram frequent itemsets are: a, b, c, d, ab, ad, bd, cd, abd

Closed Itemsets:
An itemset is closed if none of its immediate supersets has the same support as that of the
itemset.
Consider two itemsets X and Y, if every item of X is in Y but there is at least one item of Y,
which is not in X, then Y is not a proper super itemset of X, here itemset X is closed
itemset.
If X is both closed and frequent, called as closed frequent itemset.
From diagram closed frequent itemsets are: a, c, cd, abd
cd is closed itemset as its supersets acd and bcd have support less than 2

3
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

 Association Rules:
Association rule, finds interesting associations and relationships among large sets of data
items. This rule shows how frequently an itemsets occurs in a transaction.
A typical example is a Market Based Analysis.
Association Rule: An implication expression of the form X → Y, where X and Y are
itemsets.

Example: {Milk, Cheese} → {Banana}

Example of Association Rules:


You are in a supermarket to buy milk. Based on the analysis, are you more likely to buy
apples or cheese in the same transaction than somebody who did not buy milk?
In the following table, there are nine baskets containing various combinations of milk,
cheese, apples, and bananas.

Basket Product 1 Product 2 Product 3


1 Milk Cheese
2 Milk Apples Cheese
3 Apples Banana
4 Milk Cheese
5 Apples Banana
6 Milk Cheese Banana
7 Milk Cheese
8 Cheese Banana
9 Cheese Milk

Support Confidence Lift


Rules
(X+Y/N) (X+Y/X) (Confidence/(Y/N)
Milk→Cheese 6/9=0.66 6/6=1 1/ (7/9) =9/7=1.28
X+Y: milk+cheese X+Y: milk+cheese Y: cheese
Justification
N: no. of baskets X: milk N: no. of baskets
Apple, Milk→Cheese 1/9=0.11 1/1=1 1/ (7/9) =9/7=1.28
Apple, Cheese→Milk 1/9=0.11 1/1=1 1/ (6/9) =9/6=1.5

Determine the relationships and the rules.

Support: The first measure called the support is the number of transactions that include
items in the {A} and {B} parts of the rule as a percentage of the total number of

4
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

transactions. It is a measure of how frequently the collection of items occur together as a


percentage of all transactions.
Support=X+Y/N (N: total transactions or baskets)
Fraction of transactions that contain both X and Y.

Confidence: The second measure called the confidence of the rule is the ratio of the
number of transactions that include all items in {Y} as well as the number of transactions
that include all items in {X} to the number of transactions that include all items in {X}.
Confidence= X+Y/X
How often items in Y appear in transactions that contain X only.

Lift: The third measure called the lift or lift ratio is the ratio of confidence to expected
confidence. Expected confidence is the confidence divided by the frequency of Y. The Lift
tells us how much better a rule is at predicting the result than just assuming the result in the
first place. Greater lift values indicate stronger associations.
Lift= Confidence / (Y / N)
How much our confidence has increased that Y will be purchased given that X was
purchased.

Support Confidence Lift


Rules
(X+Y/N) (X+Y/X) (Confidence/(Y/N)
Milk→Cheese 6/9=0.66 6/6=1 1/ (7/9) =9/7=1.28
X+Y: milk+cheese X+Y: milk+cheese Y: cheese
Justification
N: no. of baskets X: milk N: no. of baskets
Apple, Milk→Cheese 1/9=0.11 1/1=1 1/ (7/9) =9/7=1.28
Apple, Cheese→Milk 1/9=0.11 1/1=1 1/ (6/9) =9/6=1.5

5
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

 Apriori Algorithm – Frequent Pattern Algorithms


A set of items together is called an itemset. If any itemset has k-items it is called a k-
itemset.
An itemset consists of two or more items. An itemset that occurs frequently is called a
frequent itemset.
Thus, frequent itemset mining is a data mining technique to identify the items that
often occur together.

For Example: Bread and butter, Laptop and Antivirus software, etc.

Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent itemsets
are used to find k+1 itemsets.
This algorithm uses two steps “join” and “prune” (prune means delete) to reduce the search
space.
It is an iterative approach to discover the most frequent itemsets.
Apriori says:
The probability that item x is not frequent is if:
 P(x) is less than minimum support threshold, and then x is not frequent.

The steps followed in the Apriori Algorithm of data mining are:


1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item
with itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate
item does not meet minimum support, then it is denoted as infrequent and thus it is
removed. This step is performed to reduce the size of the candidate itemsets.

Apriori Algorithm:
D: Database
Min_sup: minimum support count
K: items in itemset
C: candidate list
L: frequent itemsets in D

6
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Example Apriori Method:


Consider the given database D and minimum support 50%. Apply the Apriori algorithm and
find frequent itemsets with confidence greater than 70%
TID Items
1 134
2 235
3 1235
4 25

Solution:
Calculate min_supp=0.5*4=2 (support count is 2)
(0.5: given minimum support in problem, 4: total transactions in database D)
Step 1: Generate candidate list C1 from D
C1=
Itemsets
1
2
3
4
5

Step 2: Scan D for count of each candidate and find the support.
C1=
Itemsets Support count
1 2
2 3
3 3
4 1
5 3

7
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Step 3: Compare candidate support count with min_supp (i.e. 2)


(prune or remove the itemset which have support count less than min_supp i.e. 2)
L1=
Itemsets Support count
1 2
2 3
3 3
5 3

Step 4: Generate candidate list C2 from L1


(k-itemsets converted to k+1 itemsets)
C2=
Itemsets (k+1)
1,2
1,3
1,5
2,3
2,5
3,5

Step 5: Scan D for count of each candidate and find the support.
C2=
Itemsets Support count
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2

Step 6: Compare candidate support count with min_supp (i.e. 2)


(prune or remove the itemset which have support count less than min_supp i.e. 2)
L2=
Itemsets Support count
1,3 2
2,3 2
2,5 3
3,5 2

8
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Step 7: Generate candidate list C3 from L2


(k-itemsets converted to k+1 itemsets)
C3=
Itemsets (k+1)
1,2,3
1,2,5
1,3,5
2,3,5

Step 8: Scan D for count of each candidate and find the support.
C3=
Itemsets Support count
1,2,3 1
1,2,5 1
1,3,5 1
2,3,5 2

Step 9: Compare candidate support count with min_supp (i.e. 2)


(prune or remove the itemset which have support count less than min_supp i.e. 2)
L3=
Itemsets Support count
2,3,5 2

Step 10: Frequent itemset is {2,3,5}

Apply Association rules:


Rule Support Confidence(X+Y)/X Confidence %
2 3→5 2 2/2=1 100
3 5→2 2 2/2=1 100
2 5→3 2 2/3=0.66 66
2→3 5 2 2/3=0.66 66
3→2 5 2 2/3=0.66 66
5→2 3 2 2/3=0.66 66

As minimum confidence threshold is 70%, the first two rules are the output.
i.e. 2 3→5, 3 5→2

9
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

 Cluster Analysis:
Clustering is a data mining technique used to place data elements into related
groups without advance knowledge.
Clustering is the process of grouping a set of data objects into multiple groups or clusters
so that objects within a cluster have high similarity.
Dissimilarities and similarities are assessed based on the attribute values describing the
objects and often involve distance measures.
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets.
Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis
can be referred to as a clustering.

Requirements of Cluster Analysis:


 Scalability: Need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes: Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
 Discovery of clusters with attribute shape: The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
 High dimensionality: the clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data: Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
 Interpretability: The clustering results should be interpretable, comprehensible, and
usable.

10
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

*Basic Clustering Methods:


Clustering methods can be classified into the following categories:
1. Partitioning Method
2. Hierarchical Method
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method

1. Partitioning Method:
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements:
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember:
 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.

Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s centre is represented by the
mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centres
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster
(4) update the cluster means, that is, calculate the mean value of the objects for
each cluster
(5) Repeat until no change.
11
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Example: K-means
Question: Use k-means algorithm to create 3 clusters for given set of values:
{2,3,6,8,9,12,15,18,22}
Answer:
Set of values: 2,3,6,8,9,12,15,18,22
1. Break given set of values randomly in to 3 clusters and calculate the mean value.
K1: 2,8,15 mean=8.3
K2: 3,9,18 mean=10
K3: 6,12,22 mean=13.3

2. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 2,3,6,8,9 mean=5.6
K2: mean=0
K3: 12,15,18,22 mean=16.75

3. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 3,6,8,9 mean=6.5
K2: 2 mean=2
K3: 12,15,18,22 mean=16.75

4. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 6,8,9 mean=7.6
K2: 2,3 mean=2.5
K3: 12,15,18,22 mean=16.75

5. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 6,8,9,12 mean=8.7
K2: 2,3 mean=2.5
K3:15,18,22 mean=18.33

12
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

6. Reassign the values to clusters as per the mean calculated and calculate the mean
again.
K1: 6,8,9,12 mean=8.7
K2: 2,3 mean=2.5
K3:15,18,22 mean=18.33

7. Mean of all three clusters remains same.


So, Final 3 clusters are {6,8,9,12}, {2,3}, {15,18,22}

2. Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here:
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keeps on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This
method is rigid, i.e., once a merging or splitting is done, it can never be undone.

3. Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighbourhood exceeds some threshold, i.e., for
each data point within a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.

4. Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages

13
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

 The major advantage of this method is fast processing time.


 It is dependent only on the number of cells in each dimension in the quantized
space.

5. Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It reflects
spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based
on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.

6. Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of desired
clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.

Applications of Clustering:
Clustering algorithms can be applied in many fields, for instance:
1. Marketing: finding groups of customers with similar behaviour given a large database
of customer data containing their properties and past buying records;
2. Biology: classification of plants and animals given their features;
3. Libraries: book ordering;
4. Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
5. City-planning: identifying groups of houses according to their house type, value and
geographical location;
6. Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
7. WWW: document classification; clustering weblog data to discover groups of similar
access patterns.

14
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621

Assignment No. 5

1. State applications of cluster analysis. (2)


2. Define cluster Analysis or clutering. (2)
3. Define frequent itemset and closed itemset. (2)
4. Describe Association rule of data mining with example. (4/6)
5. Explain Market basket analysis with example. (4/6)
6. Describe the requirement of clustering in data mining. (4)
7. Consider the database (D) with min_supp=50% and 60% and min_confidence=80%

TID Items
1 K, A, D, B
2 D, A, C, E, B
3 C, A, B, E
4 B, A, D

Find all frequent itemsets using apriori method. List strong association rules. (6)
8. List clustering Methods explain any two. (6)
9. Explain Apriori algorithms for frequent itemset using candidate generation. (6)
10. Consider the data set given and create 3 (Dataset 1), 4 (Dataset 2) clusters using k-
means method. (6)
Data set1: {10,4,2,12,3,20,30,11,25,31}
Data set2: {8,14,2,22,13,40,30,18,25,10}

15

You might also like