0% found this document useful (0 votes)
71 views24 pages

FP Growth (Tree)

The document discusses challenges with the Apriori algorithm for frequent pattern mining and proposes an improved approach called FP-Growth. [END SUMMARY]

Uploaded by

Waseeque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views24 pages

FP Growth (Tree)

The document discusses challenges with the Apriori algorithm for frequent pattern mining and proposes an improved approach called FP-Growth. [END SUMMARY]

Uploaded by

Waseeque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Frequent Pattern(FP) Growth

(FP Tree)
Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
Methods to Improve Apriori’s Efficiency

 Transaction reduction
 A transaction that does not contain any frequent k-itemset is useless in
subsequent scans

 Partitioning
 Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB

4
Methods to Improve Apriori’s Efficiency

 Sampling
 mining on a subset of given data, lower support
threshold + a method to determine the
completeness.

5
Mining Frequent Patterns Without Candidate
Generation

 Compress a large database into a compact, Frequent-


Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only

6
Mining Frequent Patterns Without
Candidate Generation

 Grow long patterns from short ones using local


frequent items

 “abc” is a frequent pattern

 Get all transactions having “abc”: DB|abc

 “d” is a local frequent item in DB|abc  abcd is


a frequent pattern
Steps

 1) Find frequency table of each item


 2) Order frequent itemset in desc order(consider
only those whose support > = minm support
 3) Draw FP Tree
 4) Find frequent pattern from FP tree
Example FP growth
 TID Items bought
 100 {f, a, c, d, g, i, m, p}
 200 {a, b, c, f, l, m, o}
 300 {b, f, h, j, o, w}
 400 {b, c, k, s, p}
 500 {a, f, c, e, l, p, m, n}

Minimum Support = 3
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
Benefits of the FP-tree Structure

 Reduce irrelevant info—infrequent items are gone


 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets


according to f-list
 F-list=f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f
Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree
Mining Frequent Patterns With FP-trees

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by pattern and

database partition
 Method
 For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


 Repeat the process on each newly created conditional

FP-tree
 Until the resulting FP-tree is empty, or it contains only

one path—single path will generate all the


combinations of its sub-paths, each of which is a
frequent pattern
Why Is FP-Growth the Winner?

 Divide-and-conquer:
 decompose both the mining task and DB according to
the frequent patterns obtained so far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
From association mining to
correlation analysis
Interestingness Measurements

 Objective measures-
 Two popular measurements
 support
 confidence

 Subjective measures-
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
Criticism to Support and Confidence
 Example

 Among 5000 students


 3000 play basketball

 3750 eat cereal

 2000 both play basket ball and eat cereal

 play basketball  eat cereal [40%, 66.7%] is misleading


because the overall percentage of students eating cereal is 75%
which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence

basketball not basketball sum(row)


cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Other Interestingness Measures: Interest
 Interest (correlation, lift) P( A  B)
P ( A) P( B)
 taking both P(A) and P(B) in consideration

 A and B negatively correlated, if the value is less than 1;


otherwise A and B positively correlated

X 1 1 1 1 0 0 0 0 Itemset Support Interest


Y 1 1 0 0 0 0 0 0 X,Y 25% 2
X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
Criticism to Support and Confidence

 Example
 X and Y: positively correlated,

 X and Z, negatively correlated


X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
We need a measure of dependent or
Z 0 1 1 1 1 1 1 1

correlated events

P( A B) Itemset Support Interest


corrA, B  X,Y 25% 2
P( A) P( B) X,Z 37.50% 0.9
Y,Z 12.50% 0.57

Rule Support Confidence


X=>Y 25% 50%
X=>Z 37.50% 75%
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)

P( A  B) Cereal 2000 1750 3750

Not cereal 1000 250 1250


P ( A) P ( B ) Sum(col.) 3000 2000 5000

2000 / 5000 1000 / 5000


lift ( B, C )   0.89 lift ( B, C )   1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000

You might also like