0% found this document useful (0 votes)
7 views10 pages

FDS Unit - 3

Frequent pattern mining identifies recurring relationships in datasets, particularly in transactional and relational databases, to discover associations and correlations among items. Market basket analysis is a key example, helping retailers understand customer buying habits and optimize marketing strategies. The document also discusses various mining methods, including the Apriori algorithm and FP-growth, for extracting frequent itemsets and generating association rules.

Uploaded by

kousalyadvg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

FDS Unit - 3

Frequent pattern mining identifies recurring relationships in datasets, particularly in transactional and relational databases, to discover associations and correlations among items. Market basket analysis is a key example, helping retailers understand customer buying habits and optimize marketing strategies. The document also discusses various mining methods, including the Apriori algorithm and FP-growth, for extracting frequent itemsets and generating association rules.

Uploaded by

kousalyadvg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fundamentals of Data Science

UNIT-3

Mining Frequent pattern

Basic concepts:

Frequent pattern mining searches for recurring relationships in a given data set. This section
introduces the basic concepts of frequent pattern mining for the discovery of interesting associations
and correlations between item sets in transactional and relational databases.

Frequent pattern mining in data mining is the process of identifying patterns or associations within a
dataset that occur frequently. This is typically done by analysing large datasets to find items or sets of
items that appear together frequently.

Example of market basket analysis, the earliest form of frequent pattern mining for association rules.

Definition of Frequent Patterns: Frequent patterns refer to combinations of items, sequences, or


substructures that occur frequently in a dataset. For example, in a retail dataset, a frequent
patterncouldbetheassociationbetweencertainproductsthatareoftenpurchasedtogether,like bread and
butter.

Mining frequent patterns in data science involves identifying recurring associations or relationships
within a dataset.

Market Basket Analysis: A Motivating Example

Frequent itemset mining leads to the discovery of associations and correlations among items in large
transactional or relational data sets. With massive amounts of data continuously being collected and
stored, many industries are becoming interested in mining such patterns from their databases.

The discovery of interesting correlation relationships among huge amounts of business transaction
records can help in many business decision-making processes such as catalog design, cross-
marketing, and customer shopping behaviour analysis.

A typical example of frequent itemset mining is market basket analysis. This process analyzes
customer buying habits by finding associations between the different items that customers place in
their “shopping baskets” (Figure). The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are frequently purchased together by
customers.

For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.

Sheetal G Naik, DVG Page 1


Fundamentals of Data Science

Figure: Market basket analysis.

Frequent Itemset Mining Methods:

In this section, you will learn methods for mining the simplest form of frequent pat terns such as
those discussed for market basket analysis.

AprioriAlgorithm:

Finding Frequent Itemsets Using Candidate Generation: The Apriori Algorithm

 Apriori is a seminal algorithm proposed by R.Agrawal and R.Srikanthin1994


Srikanthin1994 for mining
frequent itemsets for Boolean association rules.

 The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent item set properties.

 Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
itemsets.

 1 itemsets is found by scanning the database to accumulate the count


First, the set of frequent 1-itemsets
for each item, and collecting those items that satisfy minimum support. The resulting set is
denoted L1.Next, L1 itemsets,which is used to find
1 is used to find L2, the set of frequent2-itemsets,which
L3,and soon, until no more frequent k-item sets can be found.

 The finding of each Lk requires one full scan of the database.

 A two-step process is followed in Apriori consisting of join and prune action.

Sheetal G Naik, DVG Page 2


Fundamentals of Data Science

Example:
TID ListofitemIDs
T10 I1,I2, I5
0
T20 I2,I4
0
T30 I2,I3
0
T4 I1,I2, I4
00
T5 I1,I3
00
T6 I2,I3
00
T7 I1,I3
00
T8 I1,I2,I3,I5
00
T9 I1,I2, I3
00

There are n in e-transactions in this database, that is, |D| =9.


Steps:

1. In the first iteration of the algorithm, each item is a member of the set of candidate1-itemsets,
C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.

2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent1-itemsets,L1, can then be determined. It consists of the candidate1-itemsets satisfying
minimum support. In our example, all of the candidates in C1satisfy minimum support.

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate
a candidate set of 2-itemsets, C2.No candidates are removed from C2 during the prune step
because each subset of the candidates is also frequent.

4. Next, the transactions in D are scanned and the support count of each candidate itemset In C2 is
accumulated.

5. The set of frequent2-itemsets, L2, is then determined, consisting of those candidate 2 - itemsets
in C2 having minimum support.

6. The generation of the set of candidate3-itemsets,C3,From the join step, we first get
C3=L2xL2=({I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4},{I2,I3,I5},{I2,I4,I5}. Based on the
Apriori property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent.
Sheetal G Naik, DVG Page 3
Fundamentals of Data Science

7. The transactions in D are scanned in order to determine L3, consisting of those candidate3-
itemsets in C3having minimum support.

8. The algorithm uses L3xL3 to generate a candidate set of 4-itemsets, C4.

Sheetal G Naik, DVG Page 4


Fundamentals of Data Science

FP-growth (finding frequent itemsets with out candidate generation)

 Were-examine the mining of transaction database, D, of Table5.1 in Example5.3 using the


frequent pattern growth approach.

 The first scan of the database is the same as Apriori, which derives the set of frequent items (1-
itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set
of frequent items is sorted in the order of descending support count. This resulting set or list is
denoted L.

 An FP-tree is then constructed as follows. First, create the root of the tree, labeled with “null.”
Scan database D a second time. The items in each transaction are processed in L order (i.e.,
sorted according to descending support count), and a branch is created for each transaction.

Sheetal G Naik, DVG Page 5


Fundamentals of Data Science

 For example, the scan of the first transaction, “T100: I1, I2,I5,” which contains three items (I2,
I1, I5 in L order), leads to the construction of the first branch of the tree with three nodes, hI2:
1i, hI1:1i, and hI5: 1i, where I2 is linked as a child of the root, I1 is linked to I2, and I5 is
linkedtoI1.

 The second transaction, T200, contains the items I2 and I4 in L order, which would result in a
branch where I2 is linked to the root and I4 is linked to I2. However, this branch would share a
common prefix, I2, with the existing path for T100.

 Therefore, we instead increment the count of the I2 node by 1, and create a new node,
hI4:1i,which is linked as a child of hI2: 2i. In general, when considering the branch to be added
for a transaction, the count of each node along a common prefix is incremented by 1, and nodes
for the items following the prefix are created and linked accordingly.

 To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links. The tree obtained after scanning all of the
transactions is shown in Figure 5.7 with the associated node-links. In this way, the problem of
mining frequent patterns in databases is transformed to that of mining the FP-tree.

Sheetal G Naik, DVG Page 6


Fundamentals of Data Science

The FP-tree is mined as follows.

 Start from each frequent length-1 pattern (as an initial suffix pattern), construct its conditional
pattern base (a “sub database,” which consists of the set of prefix paths in the FP-tree co-
occurring with the suffix pattern), then construct its (conditional) FP-tree, and perform mining
recursively on such a tree. The pattern growth is achieved by the concatenation of the suffix
pattern with the frequent patterns generated from a conditional FP-tree.

 Mining of the FP-tree is summarized in Table 5.2 and detailed as follows. We first consider I5,
which is the last item in L, rather than the first. The reason for starting at the end of the list will
become apparent as we explain the FP-tree mining process. I5 occurs in two branches of the
FP-tree of Figure 5.7. (The occurrences of I5 can easily be found by following its chain of
node-links.) The paths formed by these branches are hI2, I1, I5: 1i and hI2, I1,I3,I5:1i.

 Therefore, considering I5 as a suffix, its corresponding two prefix paths are hI2, I1: 1i and hI2,
I1, I3: 1i, whichformitsconditionalpatternbase.ItsconditionalFP-treecontainsonlyasinglepath,
hI2:2,I1:2i;I3 is not included because its support count of 1is less than the minimum support
count.

 The single path generates all the combinations of frequent patterns: fI2,I5:2g,
fI1,I5:2g,fI2,I1,I5:2g.

Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is straight forward
to generate strong association rules from the

Sheetal G Naik, DVG Page 7


Fundamentals of Data Science

Compact Representation of Frequent ItemSet

 Form any applications, it is difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data at those levels.
 Strong associations discovered at high levels of abstraction may represent common sense
knowledge.
 Therefore, data mining systems should provide capabilities for mining association rules at
multiple levels of abstraction, with sufficient flexibility for easy traversal among different
abstraction spaces.
 Association rules generated from mining data at multiple levels of abstraction are called
multiple-level or multilevel association rules.
 Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence frame work.
 In general, a top-down strategy is employed, where counts are accumulated for the calculation
of frequent itemsets at each concept level, starting at the concept level 1 and working
downward in the hierarchy toward the more specific concept levels, until no more frequent
itemsets can be found.
 A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher level, more general concepts. Data can be generalized by replacing low-level concepts
within the data by their higher-level concepts, orancestors, from a concept hierarchy.

Sheetal G Naik, DVG Page 8


Fundamentals of Data Science

The concept hierarchy has five levels, respectively referred to as levels 0 to 4, starting
with level 0 at the rootnode for all.
 Here, Level 1 includes computer, software, printer&camera, and computer accessory.
 Level2 includes laptop computer, desktop computer, office software, antivirus software.
 Level3 includes IBM desktop computer, Microsoft office software, and so on.
 Level4 is the most specific abstraction level of this hierarchy.

Approaches For Mining Multilevel Association Rules

1. Uniform MinimumSupport:
 The same minimum support threshold is used when mining at each level of abstraction. When
a uniform minimum support threshold is used, the search procedure is simplified. The method
is also simple in that users are required to specify only one minimum support threshold.

 The uniform support approach, however, has some difficulties. It is unlikely that items at
lower levels of abstraction will occur as frequently as those at higher levels of abstraction.

 If the minimum support threshold is set too high, it could miss some meaningful associations
occurring at low abstraction levels. If the threshold is set too low, it may generate any
uninteresting associations occurring at high abstraction levels.

2. Reduced Minimum Support:


 Each level of abstraction has its own minimum support threshold.
 The deeper the level of abstraction, the smaller the corresponding threshold is.
For example, the minimum support thresholds for levels1 and 2 are 5% and %,
respectively. In this way, ―computer, ―laptop computer, and ―desktop computer are all
considered frequent.

Sheetal G Naik, DVG Page 9


Fundamentals of Data Science

3. Group-Based Minimum Support:


 Because users or experts often have insight as to which groups are more important than
others, it is sometimes more desirable to set up user-specific, item, or group based
minimal support thresholds when mining multilevel rules.

 For example, a user could set up the minimum support thresholds based on product price,
or on items of interest, such as by setting particularly low support thresholds for laptop
computers and flash drives in order to pay particular attention to the association patterns
containing items in these categories.

Sheetal G Naik, DVG Page 10

You might also like