Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
ransactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the
Market-Basket transactions
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Given a set of transactions T, the goal of association rule mining is to find all ru
support ≥ minsup threshold
confidence ≥ minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper} {Beer} (s=0.4, c=0.67)
2 {Milk,Beer} {Diaper} (s=0.4, c=1.0)
Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke {Diaper,Beer} {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, KumarIntroduction to Data Mining4/18/20046
Mining Association Rules
Two-step approach:
Frequent Itemset Generation
Generate all itemsets whose support minsup
Rule Generation
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a freq
A B C D E
AB ACAD AE BC BD BE CD CEDE
ABCABDABEACDACEADEBCDBCEBDECDE
ABCDE
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke M
4 N Diaper, Beer
Bread, Milk,
5 Bread, Milk, Diaper, Coke
w
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d !!!
© Tan,Steinbach, KumarIntroduction to Data Mining4/18/20049
Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
d d k
R j
d 1 d k
k
k 1 j1
3 2 1 d d 1
Apriori principle holds due to the following property of the support measure:
X ,Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
null
A B C D E
Found to be Infrequent
ABCABDABEACDACEADEBCDBCEBDECDE
Pruned supersets
ABCDE
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidate itemsets containing subsets of length k that are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those that are frequent
Buckets
© Tan,Steinbach, KumarIntroduction to Data Mining4/18/200416
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
Hash function
Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, sp
1,4,7 3,6,9
2,5,8
234
567
14 5 136
345 356 367
Hash on 1, 4 or 7 357 368
689
124 125 159
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on 2, 5 or 8 357 368
689
124 159
1 25
457 4 58
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 36 7
Hash on 3, 6 or 9 357 36 8
Level 1
1 2356 2 356 3 56
Level 2
12 356 13 56 156 23 56 25 6 35 6
123
125 135 156 235 256 356
126 136 236
1+2356
2+356 1,4,73,6,9
2,5,8
3+56
234
567
145 136
345 356 367
357 368
689
124125 159
457 458
1+2356
2+356 1,4,73,6,9
12+356 2,5,8
3+56
13+56
234
15+6 567
145 136
345 356 367
357 368
689
124125 159
457 458
1+2356
2+356 1,4,73,6,9
12+356 2,5,8
3+56
13+56
234
15+6 567
145 136
345 356 367
357 368
689
124 125 159
457 458
Match transaction against 11 out of 15 candidates
© Tan,Steinbach, KumarIntroduction to Data Mining4/18/200424
Factors Affecting Complexity
Choice of minimum support threshold
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of frequent itemsets
Dimensionality (number of items) of the data set
more space is needed to store support count of each item
if number of frequent items also increases, both computation and I/O costs may also increase
Size of database
since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Average transaction width
transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transac
Some itemsets are redundant because they have identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
10
Number of frequent itemsets 3
10
kk 1
Maximal Itemsets A B C D E
AB AC AD AE BC BD BE CD CEDE
ABCABDABEACDACEADEBCDBCEBDECDE
Infrequent Itemsets
ABCD
E Border
© Tan,Steinbach, KumarIntroduction to Data Mining
4/18/2004 27
Closed Itemset
An itemset is closed if none of its immediate supersets has the same support as the itemset
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
3 BCE
4 ACDE 12124 24 4 123 2 3 24 34 45
ABAC AD AE BC BD BE CD CEDE
5 DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCEBDECDE
24
ABCDABCEABDEACDE BCDE
12124 24 41232 3 24 34 45
ABAC AD AE BC BDBECD CEDE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADEBCDBCE BDECDE
2 4
ABCD ABCE ABDEACDE BCDE # Closed = 9
# Maximal = 4
ABCDE
Frequent Itemsets
......
......
Frequent itemset border
{a1,a2,...,an} {a1,a2,...,an} {a1,a2,...,an}
A B C D ABC D
ABCD ABCD
Representation of Database
– horizontal vs vertical data layout
Horizontal
Data LayoutVertical Data Layout
TID Items
1 A,B,E
2 B,C,D
3 C,E A B C D E
4 A,C,D 1 1 2 2 1
5 A,B,C,D 4 2 3 4 3
6 A,E 5 5 4 5 6
7 A,B 6 7 8 9
8 A,B,C 7 8 9
9 A,C,D 8 10
10 B 9
wth Algorithm
pressed representation of the database using an FP-tree
-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the freque
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E}
After reading TID=2: null
5 {A,B,C}
6 {A,B,C,D}
A:1
7 {B,C} B:1
8 {A,B,C}
9 {A,B,D}
B:1 C:1
10 {B,C,E}
D:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37
FP-Tree Construction
TID Items
1 {A,B}
2 {B,C,D}
Transaction Database
3 {A,C,D,E}
4 {A,D,E}
null
5 {A,B,C}
6 {A,B,C,D}
7 {B,C} A:7 B:3
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
B:5 C:3
C:1 D:1
C:3 D:1
D:1
D:1
D:1
Tree Projection
Set enumeration tree: null
A B C D E
Possible Extension: E(A) = {B,C,D,E}
AB AC AD AE BCBD BE CD CEDE
ABCABDABEACDACEADEBCDBCEBDECDE
ABCDE
Projected Database
TID Items
1 A,B,E
2 B,C,D
3 C,E A B C D E
4 A,C,D 1 1 2 2 1
5 A,B,C,D 4 2 3 4 3
6 A,E 5 5 4 5 6
7 A,B 6 7 8 9
8 A,B,C 7 8 9
9 A,C,D 8 10
10 B 9
TID-list
ECLAT
Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets.
A B AB
1 1 1
4 2 5
5
6
5
7 7
8
7 8
8 10
9
3 traversal approaches:
– top-down, bottom-up and hybrid
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too large for memory
itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence
quent itemset, candidate rules:
Rule Generation
How to efficiently generate rules from frequent itemsets?
In general, confidence does not have an anti- monotone property
c(ABC D) can be larger or smaller than c(AB D)
But confidence of rules generated from the same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
Lattice of rules
Low ABCD=>{ }
Confidence Rule
BCD=>A
If minsup is set too low, it is computationally expensive and the number of itemsets is very large
AC ABD
A 0 .10% 0.25% A
AD ABE
BC ACE
C 0 .30% 0.29% C
BD ADE
CD BCE
E 3% 4 .20% E
CE BDE
DE CDE
© Tan,Steinbach, KumarIntroduction to Data Mining4/18/200452
Multiple Minimum Support
AB
ABC
Item MS(I) Sup(I)
AC ABD
A
A 0 .10 % 0.25% AD ABE
B AE ACD
B 0 .20 % 0.26%
BC ACE
C
C 0 .30 % 0.29% BD ADE
D BE BCD
D 0 .50 % 0.05%
CD BCE
E
E 3% 4 .20% CE BDE
DE CDE
Modifications to Apriori:
In traditional Apriori,
A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k
The candidate is pruned if it contains any infrequent subsets of size k
Pruning step has to be modified:
Prune only if subset contains the first item
e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
{Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli.
Pattern Evaluation
Association rule algorithms tend to produce too many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence
n the original formulation of association rules, support & confidence are the only measures used
Preprocessed Data
Featur Feea tur Feea tur Feea tur
Fea tur
Feae
Feae tur
Feae tur
Feae tur
Feae tur
Mining
etur e
Selected Data
Data Preprocessing
Selection
en a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table
tingency table for X Y
upport of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Statistical Independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(Y )
P( X ,Y )
Interest
P( X )P(Y )
PS P( X ,Y ) P( X )P(Y )
P( X ,Y ) P( X )P(Y )
coefficient
P( X )[1 P( X )]P(Y )[1 P(Y )]
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61
Example: Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Some measures are good for certain applications, but not for others
What about Apriori- style support based pruning? How does it affect these measures?
Properties of A Good Measure
Piatetsky-Shapiro:
3 properties a good measure M must satisfy:
– M(A,B) = 0 if A and B are statistically independent
– M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchan
– M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain
B B A A
A p q B p r
A r s B q s
Symmetric measures:
support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures:
confidence, conviction, Laplace, J-measure, etc
2x 10x
Mosteller:
Underlying association should be independent of the relative number of male and female students in the sampl
A B C D E F
1 0 0 1 0 0
Trans .
action 1
0
0
0
0
1
1
1
1
1
1
0
0
. 0 0 1 1 1 0
. 0
0
1
0
1
1
0
1
1
1
1
0
. 0 0 1 1 1 0
.
Transaction N
0
0
0
0
1
1
1
1
1
1
0
0
1 0 0 1 0 0
Example: -Coefficient
-coefficient is analogous to correlation coefficient for continuous variables
Y Y Y Y
X 60 10 70 X 20 10 30
X 10 20 30 X 10 60 70
70 30 100 30 70 100
B B B B
A p q A p q
A r s A r s+k
Invariant measures:
support, cosine, Jaccard, etc Non-invariant measures:
correlation, Gini, mutual information, odds ratio, etc
Most of the association rule mining algorithms use support measure to prune
1000
900
800
700
600
500
400
300
200
100
0
Correlation
300 300
250 250
200 200
150 150
100 100
50 50
0 0
Correlation Correlation
150
100
50
Correlation
Steps:
Generate 10000 contingency tables
Rank each table according to the different measures
Compute the pair-wise correlation between the measures
Yule Y 0.8
Re liability Kappa
Klos g en Yule Q
Confidence Laplace
IS
J accard
Support J accard Lambda Gini
J-meas ure 0.7
Mutual Info
0.6
0.5
0.4
0.3
0.1
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Correlation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Yule Q C F
Yule Y Kappa 0.9
IS
J accard Support Lambda Gini
J-meas ure
Mutual Info
0.8
J accard
0.7
0.6
0.5
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Scatter Plot between Correlation & Jaccard Measure:
0.3
0.1
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
© Tan,Steinbach, Kumar Introduction to Data Mining Corre la tion 4/18/2004 78
Effect of Support-based Pruning
0.5% support 30%
0.00 5 <= s upp ort <= 0 .3 00 (76.42 % )
Support Interes t
Re liability C onviction
Yule Q
Odds ratio Confidence
0.9
0.8
CF
Yule Y Kappa
J accard
0.7
0.5
0.3
0.2 0
Lambda
Mutual Info Gini -0.4-0.2 0 0.2 0.4 0.6 0.8 1
J-meas ure Corre la tion
0.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
+ - Expected Patterns
-+ Unexpected Patterns
Need to combine expectation of users with evidence from data (i.e., extracted patterns)
© Tan,Steinbach, KumarIntroduction to Data Mining4/18/200481
P( X X ... X ) 12 k
– Use Dempster-Shafer theory to combine domain knowledge and evidence from data