Association-Analysis
Association-Analysis
Pukar Karki
Assistant Professor
[email protected]
Contents
1. Basics and Algorithms
2. Frequent Itemset Pattern & Apriori Principle 5 minutes engineering
5 minutes engineering
4. Handling Categorical Attributes https://www.youtube.com/watch?v=LWDhSGQHt2o
2
Contents
1. Basics and Algorithms
2. Frequent Itemset Pattern & Apriori Principle
3. FP-Growth, FP-Tree
4. Handling Categorical Attributes
5. Sequential, Subgraph, and Infrequent Patterns
3
Frequent Pattern Mining
Frequent pattern mining searches for recurring relationships in a given data set.
4
Frequent Pattern Mining
✔
Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
✔
With massive amounts of data continuously being collected and stored,
many industries are becoming interested in mining such patterns from
their databases.
5
Frequent Pattern Mining
✔
The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business
decision-making processes such as
- catalog design
- cross-marketing, and
- customer shopping behavior analysis.
6
Frequent Pattern Mining – Market Basket Analysis
✔
A typical example of frequent itemset mining is market basket
analysis.
✔
This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”
✔
The discovery of these associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased
together by customers.
7
Frequent Pattern Mining – Market Basket Analysis
✔
For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip.
✔
This information can lead to increased sales by helping retailers do
selective marketing and plan their shelf space.
8
Frequent Pattern Mining – Market Basket Analysis
For example, the information that customers who purchase computers also tend
to buy antivirus software at the same time is represented in the following
association rule:
A support of 2% for means that 2% of all the transactions under analysis
show that computer and antivirus software are purchased together.
A confidence of 60% means that 60% of the customers who purchased
a computer also bought the software.
9
Frequent Pattern Mining – Market Basket Analysis
✔
Typically, association rules are considered interesting if they satisfy
both a minimum support threshold and a minimum confidence
threshold.
✔
These thresholds can be a set by users or domain experts.
10
Frequent Itemsets, Closed Itemsets, and Association Rules
Let I = {I1, I2,..., Im} be an itemset.
Let D, the task-relevant data, be a set of database transactions where
each transaction T is a nonempty itemset such that T ⊆ I.
Each transaction is associated with an identifier, called a TID.
Let A be a set of items.
A transaction T is said to contain A if A ⊆ T.
11
Frequent Itemsets, Closed Itemsets, and Association Rules
An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂
I, A ≠ ∅, B ≠ ∅, and A ∩ B = ∅.
The rule A ⇒ B holds in the transaction set D with support s, where s is
the percentage of transactions in D that contain A ∪ B (i.e., the union of
sets A and B say, or, both A and B).
This is taken to be the probability, P(A ∪ B).
12
Frequent Itemsets, Closed Itemsets, and Association Rules
The rule A ⇒ B has confidence c in the transaction set D, where c is the
percentage of transactions in D containing A that also contain B.
This is taken to be the conditional probability, P(B|A).
13
Frequent Itemsets, Closed Itemsets, and Association Rules
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset.
The set {computer, antivirus software} is a 2-itemset.
14
Frequent Itemsets, Closed Itemsets, and Association Rules
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will
occur at least as frequently as a predetermined minimum support count,
min sup.
2. Generate strong association rules from the frequent itemsets:
By definition, these rules must satisfy minimum support and minimum
confidence.
15
Contents
1. Basics and Algorithms
2. Frequent Itemset Pattern & Apriori Principle
3. FP-Growth, FP-Tree
4. Handling Categorical Attributes
5. Sequential, Subgraph, and Infrequent Patterns
16
Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in
1994 for mining frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent itemset properties.
17
Apriori Algorithm
Apriori property: All nonempty subsets of a frequent itemset must also be
frequent.
18
Apriori Algorithm
✔
The Apriori property is based on the following observation.
✔
By definition, if an item-set I does not satisfy the minimum support
threshold, min_sup, then I is not frequent, that is, P(I) < min_sup.
✔
If an item A is added to the itemset I, then the resulting itemset (i.e., I
∪ A) cannot occur more frequently than I.
✔
Therefore, I ∪ A is not frequent either, that is, P(I∪A) < min_sup.
19
Apriori Algorithm
✔
This property belongs to a special category of properties called
antimonotonicity in the sense that
if a set cannot pass a test, all of its supersets will fail the same
test as well.
✔
It is called antimonotonicity because the property is monotonic in the
context of failing a test.
20
Apriori Algorithm: Example
Consider an example, based on the AllElectronics transaction
database, D.
There are nine transactions in this database, that is, |D| = 9.
21
Apriori Algorithm: Example
In the first iteration of the algorithm, each item is a member of the set of
candidate 1-itemsets, C1. The algorithm simply scans all of the transactions
to count the number of occurrences of each item.
22
Apriori Algorithm: Example
Suppose that the minimum support count required is 2, that is, min_sup = 2.
(Here, we are referring to absolute support because we are using a support
count. The corresponding relative support is 2/9 = 22%.)
The set of frequent 1-itemsets, L1, can then be determined. It consists of the
candidate 1-itemsets satisfying minimum support.
23
Apriori Algorithm: Example
To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ⋈
L1 to generate a candidate set of 2-itemsets, C2.
24
Apriori Algorithm: Example
Next, the transactions in D are scanned and the support count of each
candidate itemset in C2 is accumulated, as shown in the middle table of the
second row in
25
Apriori Algorithm: Example
The set of frequent 2-itemsets, L2, is then determined, consisting of
those candidate 2-itemsets in C2 having minimum support.
26
Apriori Algorithm: Example
The generation of the set of the candidate 3-itemsets, C3, is detailed below.
From the join step, we first get
C3 = L2 ⋈ L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
27
Apriori Algorithm: Example
The transactions in D are scanned to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support
28
Apriori Algorithm: Example
The algorithm uses L3 ⋈ L3 to generate a candidate set of 4-itemsets,
C 4.
Although the join results in {{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is
pruned because its subset {I2, I3, I5} is not frequent.
Thus, C4 = φ, and the algorithm terminates, having found all of the
frequent itemsets.
29
Generating Association Rules from Frequent Itemsets
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if
30
Generating Association Rules from Frequent Itemsets
Because the rules are generated from frequent itemsets, each one
automatically satis- fies the minimum support.
Frequent itemsets can be stored ahead of time in hash tables along with
their counts so that they can be accessed quickly.
31
Generating association rules.: Example
Consider the frequent itemset X = {I1, I2, I5}. What are the
association rules that can be generated from X?
The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and
{I5}. The resulting association rules are as shown below, each listed with
its confidence:
32
Improving the efficiency of Apriori
Hash-based technique:
A hash-based technique can be used to reduce the size of the
candidate k-itemsets, Ck , for k > 1.
For example, when scanning each transaction in the database to
generate the frequent 1-itemsets, L1, we can generate all the 2-itemsets
for each transaction, hash (i.e., map) them into the different buckets of a
hash table structure, and increase the corresponding bucket counts.
A 2-itemset with a corresponding bucket count in the hash table that is
below the support threshold cannot be frequent and thus should be
removed from the candidate set.
Such a hash-based technique may substantially reduce the number of
candidate k-itemsets examined (especially when k = 2).
33
Improving the efficiency of Apriori
Hash-based technique:
34
Improving the efficiency of Apriori
Transaction reduction:
A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets.
Therefore, such a transaction can be marked or removed from
further consideration because subsequent database scans for j-
itemsets, where j > k, will not need to consider such a transaction.
35
Improving the efficiency of Apriori
Partitioning:
A partitioning technique can be used that requires just two database
scans to mine the frequent itemsets
36
Improving the efficiency of Apriori
Sampling:
The basic idea of the sampling approach is to pick a random sample S of
the given data D, and then search for frequent itemsets in S instead of D.
In this way, we trade off some degree of accuracy against efficiency.
The S sample size is such that the search for frequent itemsets in S can be
done in main memory, and so only one scan of the transactions in S is
required overall.
Because we are searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets.
37
Improving the efficiency of Apriori
Sampling:
To reduce this possibility, we use a lower support threshold than minimum
support to find the frequent itemsets local to S (denoted LS).
The rest of the database is then used to compute the actual frequencies
of each itemset in LS.
A mechanism is used to determine whether all the global frequent
itemsets are included in LS.
If LS actually contains all the frequent itemsets in D, then only one scan of
D is required.
Otherwise, a second pass can be done to find the frequent itemsets that
38
were missed in the first pass.
Contents
1. Basics and Algorithms
2. Frequent Itemset Pattern & Apriori Principle
3. FP-Growth, FP-Tree
4. Handling Categorical Attributes
5. Sequential, Subgraph, and Infrequent Patterns
39
A Pattern-Growth Approach for Mining Frequent Itemsets
In many cases the Apriori candidate generate-and-test method
significantly reduces the size of candidate sets, leading to good
performance gain. However, it can suffer from two nontrivial costs:
1) It may still need to generate a huge number of candidate sets. For
example, if there are 104 frequent 1-itemsets, the Apriori algorithm will
need to generate more than 107 candidate 2-itemsets.
2) It may need to repeatedly scan the whole database and check a large
set of candidates by pattern matching. It is costly to go over each
transaction in the database to determine the support of the candidate
itemsets.
40
FP-Growth
Consider an example, based on the AllElectronics transaction
database, D.
There are nine transactions in this database, that is, |D| = 9.
41
FP-Growth
An FP-tree is then constructed as follows.
First, create the root of the tree, labeled with “null.”
Scan database D a second time.
The items in each transaction are
processed in L order (i.e., sorted
according to descending support
count) and a branch is created for
each transaction.
42
FP-Growth
43
Mining FP-Tree
The FP-tree is mined as follows.
Start from each frequent length-1 pattern (as an initial suffix pattern),
construct its conditional pattern base (a “sub-database,” which
consists of the set of prefix paths in the FP-tree co-occurring with the
suffix pattern), then construct its (conditional) FP-tree, and perform
mining recursively on the tree.
The pattern growth is achieved by the concatenation of the suffix
pattern with the frequent patterns generated from a conditional FP-
tree.
44
Mining FP-Tree
We first consider I5, which is the last item in L, rather than the first.
45
Mining FP-Tree
46
FP-Growth
The FP-growth method transforms the problem of finding long
frequent patterns into searching for shorter ones in much smaller
conditional databases recursively and then concatenating the suffix.
It uses the least frequent items as a suffix, offering good selectivity.
The method substantially reduces the search costs.
47
FP-Growth Vs Apriori
A study of the FP-growth method performance shows that it is
efficient and scalable for mining both long and short frequent
patterns, and is about an order of magnitude faster than the Apriori
algorithm.
48
Contents
1. Basics and Algorithms
2. Frequent Itemset Pattern & Apriori Principle
3. FP-Growth, FP-Tree
4. Handling Categorical Attributes
5. Sequential, Subgraph, and Infrequent Patterns
49
Handling Categorical Attributes
Until now, we have assumed that the input data consists of binary
attributes called items.
The presence of an item in a transaction is also assumed to be more
important than its absence.
As a result, an item is treated as an asymmetric binary attribute and
only frequent patterns are considered interesting.
50
Handling Categorical Attributes
There are many applications that contain symmetric binary and nominal attributes.
51
Handling Categorical Attributes
To extract such patterns, the categorical and symmetric binary
attributes are transformed into “items” first, so that existing association
rule mining algorithms can be applied.
This type of transformation can be performed by creating a new item
for each distinct attribute-value pair.
52
Handling Categorical Attributes
For example, the nominal attribute Level of Education can be replaced
by three binary items: Education = College, Education =
Graduate, and Education = High School.
Similarly, symmetric binary attributes such as Gender can be con-
verted into a pair of binary items, Male and Female.
53
Handling Categorical Attributes
54
Handling Categorical Attributes: Issues(1)
Some attribute values may not be frequent enough to be part of a
frequent pattern.
This problem is more evident for nominal attributes that have many
possible values, e.g., state names.
Lowering the support threshold does not help because it exponentially
increases the number of frequent patterns found (many of which may be
spurious) and makes the computation more expensive.
55
Handling Categorical Attributes: Issues(1)
A more practical solution is to
group related attribute values
into a small number of
categories.
For example, each state name
can be replaced by its
corresponding geographical
region, such as Midwest, Pacific
Northwest, Southwest, and East
Coast.
Another possibility is to
aggregate the less frequent
attribute values into a single
category called Others. 56
Handling Categorical Attributes: Issues(2)
Some attribute values may have considerably higher frequencies than
others.
For example, suppose 85% of the survey participants own a home
computer.
By creating a binary item for each attribute value that appears frequently
in the data, we may potentially generate many redundant patterns, as
illustrated by the following example:
57
Handling Categorical Attributes: Issues(2)
Because the high-frequency items correspond to the typical values of an
attribute, they seldom carry any new information that can help us to
better understand the pattern.
It may therefore be useful to remove such items before applying standard
association analysis algorithms.
58
Handling Categorical Attributes: Issues(3)
Although the width of every transaction is the same as the number of
attributes in the original data, the computation time may increase
especially when many of the newly created items become frequent.
This is because more time is needed to deal with the additional candidate
itemsets generated by these items.
One way to reduce the computation time is to avoid generating candidate
itemsets that contain more than one item from the same attribute.
For example, we do not have to generate a candidate itemset such as
{State = X, State = Y, . . .} because the support count of the itemset is
zero.
59
Contents
1. Basics and Algorithms
2. Frequent Itemset Pattern & Apriori Principle
3. FP-Growth, FP-Tree
4. Handling Categorical Attributes
5. Sequential, Subgraph, and Infrequent Patterns
60
Sequential Patterns
Event-based data collected from scientific experiments or the mon-
itoring of physical systems, such as telecommunications networks,
computer networks, and wireless sensor networks, have an inherent
sequential nature to them.
The latter information may be valuable for identifying recurring
features of a dynamic system or predicting future occurrences of
certain events.
61
Sequential Patterns
Event 6 is followed by event 1 in all of the sequences. Note that such a pattern cannot
be inferred if we treat this as a market basket data by ignoring information about the
object and timestamp. 62
Sequential Patterns
63
Subgraph Patterns
Association analysis methods to graphs, which are more complex entities
than itemsets and sequences.
A number of entities such as chemical compounds, 3-D protein
structures, computer networks, and tree structured XML documents can
be modeled using a graph representation.
64
Subgraph Patterns
A useful data mining task to perform on this type of data is to derive
a set of frequently occurring substructures in a collection of graphs.
Such a task is known as frequent subgraph mining.
65
Subgraph Patterns
Subgraph: A graph G′=(V′,E′) is a subgraph of another graph G=(V,E) if
its vertex set V′ is a subset of V and its edges E′ is a subset of E, such
that the endpoints of every edge in E′ is contained in V ′.
66
Subgraph Patterns
Support: Given a collection of graphs G, the support for a subgraph g is
defined as the fraction of all graphs that contain g as its subgraph, i.e.,
67
68
Infrequent Patterns
The association analysis formulation described so far is based on the
premise that the presence of an item in a transaction is more
important than its absence.
As a consequence, patterns that are rarely found in a database are
often considered to be uninteresting and are eliminated using the
support measure.
Such patterns are known as infrequent patterns.
70
Infrequent Patterns
Some infrequent patterns may also suggest the occurrence of
interesting rare events or exceptional situations in the data.
For example, if {Fire = Yes} is frequent but {Fire = Yes, Alarm = On}
is infrequent, then the latter is an interesting infrequent pattern
because it may indicate faulty alarm systems.
To detect such unusual situations, the expected support of a pattern
must be determined, so that, if a pattern turns out to have a
considerably lower support than expected, it is declared as an
interesting infrequent pattern.
71
Infrequent Patterns
Key issues in mining infrequent patterns are:
(1) how to identify interesting infrequent patterns, and
(2) how to efficiently discover them in large data sets.
72