0% found this document useful (0 votes)
10 views

Chap5-Association Analysis

data mining

Uploaded by

Bareeq Nope
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chap5-Association Analysis

data mining

Uploaded by

Bareeq Nope
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Mining

Chapter 5
Association Analysis: Basic Concepts

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

02/14/2018 Introduction to Data Mining, 2 nd Edition 1


Association Rule Mining

 Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Tea},
1 Bread, Milk
{Milk, Bread}  {Eggs, Coke},
2 Bread, Diaper, Tea, Eggs {Tea, Bread}  {Milk},
3 Milk, Diaper, Tea, Coke
4 Bread, Milk, Diaper, Tea Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

02/14/2018 Introduction to Data Mining, 2 nd Edition 2


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 Support count () 2 Bread, Diaper, Tea, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Tea, Coke
– E.g. ({Milk, Bread, Diaper}) = 2 4 Bread, Milk, Diaper, Tea
 5 Bread, Milk, Diaper, Coke
Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold

02/14/2018 Introduction to Data Mining, 2 nd Edition 3


Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
X  Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Tea, Eggs
– Example:
{Milk, Diaper}  {Tea} 3 Milk, Diaper, Tea, Coke
4 Bread, Milk, Diaper, Tea

5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain Example:
both X and Y
{Milk , Diaper}  {Beer}
Tea
– Confidence (c)
 Measures how often items in Y  (Milk , Diaper, Beer
Tea ) 2
appear in transactions that s   0.4
contain X
|T| 5
 (Milk, Diaper, Beer
Tea ) 2
c   0.67
 (Milk , Diaper ) 3
02/14/2018 Introduction to Data Mining, 2 nd Edition 4
Association Rule Mining Task

 Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!

02/14/2018 Introduction to Data Mining, 2 nd Edition 5


Computational Complexity
 Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1d d 1

If d=6, R = 602 rules

02/14/2018 Introduction to Data Mining, 2 nd Edition 6


Mining Association Rules

TID Items Example of Rules:


1 Bread, Milk {Milk, Diaper}  {Tea} (s=0.4, c=0.67)
2 Bread, Diaper, Tea, Eggs {Milk, Tea}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Tea, Coke {Diaper, Tea}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Tea {Tea}  {Milk, Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk, Tea} (s=0.4, c=0.5)
{Milk}  {Diaper, Tea} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Tea}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements

02/14/2018 Introduction to Data Mining, 2 nd Edition 7


Mining Association Rules

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still


computationally expensive

02/14/2018 Introduction to Data Mining, 2 nd Edition 8


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
02/14/2018 Introduction to Data Mining, 2 nd Edition 9
Frequent Itemset Generation
 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
02/14/2018 Introduction to Data Mining, 2 nd Edition 10
Frequent Itemset Generation Strategies
 Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
 Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
 Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction

02/14/2018 Introduction to Data Mining, 2 nd Edition 11


Reducing Number of Candidates

 Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

 Apriori principle holds due to the following property


of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support

02/14/2018 Introduction to Data Mining, 2 nd Edition 12


Illustrating Apriori Principle

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
02/14/2018 Introduction to Data Mining, 2 nd Edition 13
Illustrating Apriori Principle

TID Items Items (1-itemsets)


1 Bread, Milk
Item Count
2 Tea, Bread, Diaper, Eggs
Bread 4
3 Tea, Coke, Diaper, Milk Coke 2
4 Tea, Bread, Diaper, Milk Milk 4
Tea 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1

Minimum Support = 3

02/14/2018 Introduction to Data Mining, 2 nd Edition 14


Illustrating Apriori Principle

TID Items Items (1-itemsets)


1 Bread, Milk
Item Count
2 Tea, Bread, Diaper, Eggs
Bread 4
3 Tea, Coke, Diaper, Milk Coke 2
4 Tea, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Tea 3
Diaper 4
Eggs 1

Minimum Support = 3

02/14/2018 Introduction to Data Mining, 2 nd Edition 15


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Tea 3 {Bread,Milk} 3
Diaper 4 {Bread,Tea} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Tea} 2 or Eggs)
{Milk,Diaper} 3
{Tea,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3
{Bread, Diaper, Milk} 2
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 1 = 13

02/14/2018 Introduction to Data Mining, 2 nd Edition 16


Support Counting of Candidate Itemsets

 Scan the database of transactions to determine the


support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation

TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

02/14/2018 Introduction to Data Mining, 2 nd Edition 17


Apriori Algorithm

– Fk: frequent k-itemsets


– Lk: candidate k-itemsets
 Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
 Candidate Generation: Generate Lk+1 from Fk
 Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
 Support Counting: Count the support of each candidate in
Lk+1 by scanning the DB
 Candidate Elimination: Eliminate candidates in Lk+1 that are
infrequent, leaving only those that are frequent => F k+1
02/14/2018 Introduction to Data Mining, 2 nd Edition 18
The Apriori Algorithm—An Example

Supmin = 2 Itemset sup


Database Itemset sup
{A} 2 F1
L1 {A} 2
Tid Items {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
L2 Itemset sup L2
2 nd Itemset
F2 {A, B} 1
Itemset sup {A, B}
{A, C} 2 scan
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L3 Itemset
3rd scan F3 Itemset sup
{B, C, E} {B, C, E} 2
02/14/2018 Introduction to Data Mining, 2 nd Edition 19
Candidate Generation: Fk-1 x Fk-1 Method

02/14/2018 Introduction to Data Mining, 2 nd Edition 20


Candidate Generation: Fk-1 x Fk-1 Method

 Merge two frequent (k-1)-itemsets if their first (k-2) items


are identical

 F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share only


prefix of length 1 instead of length 2

02/14/2018 Introduction to Data Mining, 2 nd Edition 21


Candidate Pruning

 Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

 L4 = {ABCD,ABCE,ABDE} is the set of candidate


4-itemsets generated (from previous slide)

 Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

 After candidate pruning: L4 = {ABCD}


02/14/2018 Introduction to Data Mining, 2 nd Edition 22
Rule Generation

 Given a frequent itemset L, find all non-empty


subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

 If |L| = k, then there are 2k – 2 candidate


association rules (ignoring L   and   L)

02/14/2018 Introduction to Data Mining, 2 nd Edition 23


Rule Generation

 In general, confidence does not have an anti-


monotone property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same


itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABC  D)  c(AB  CD)  c(A 


BCD)

– Confidence is anti-monotone w.r.t. number of items


on the RHS
02/14/2018
of the rule
Introduction to Data Mining, 2 Edition
nd
24
Rule Generation for Apriori Algorithm

Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules

02/14/2018 Introduction to Data Mining, 2 nd Edition 25


Factors Affecting Complexity of Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
 Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
 Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with
number of transactions
 Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemset

02/14/2018 Introduction to Data Mining, 2 nd Edition 26


Construct FP-tree from a Transaction Database

TID items Items bought (ordered) frequent


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list = f-c-a-b-m-p p:2 m:1
27
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets


according to f-list
 F-list = f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

28
Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
29

You might also like