0% found this document useful (0 votes)
60 views

NGDM07 Philip Yu

This document discusses approximate frequent pattern mining from noisy data. It begins by reviewing traditional frequent pattern mining algorithms like Apriori and FPgrowth. It then describes some limitations of these algorithms, namely that real-world data contains noise that can distort true frequent patterns. It proposes an alternative "core pattern" approach that allows for errors in the data matrix but still aims to discover true underlying patterns. The key ideas of this approximate approach are introduced, including using minimum support, row error rate and column error rate as constraints. An example is provided to illustrate how an approximate frequent itemset may be discovered from a noisy database under these constraints. Finally, the document outlines how a lattice structure could be used to efficiently discover these approximate frequent patterns from

Uploaded by

api-3798592
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

NGDM07 Philip Yu

This document discusses approximate frequent pattern mining from noisy data. It begins by reviewing traditional frequent pattern mining algorithms like Apriori and FPgrowth. It then describes some limitations of these algorithms, namely that real-world data contains noise that can distort true frequent patterns. It proposes an alternative "core pattern" approach that allows for errors in the data matrix but still aims to discover true underlying patterns. The key ideas of this approximate approach are introduced, including using minimum support, row error rate and column error rate as constraints. An example is provided to illustrate how an approximate frequent itemset may be discovered from a noisy database under these constraints. Finally, the document outlines how a lattice structure could be used to efficiently discover these approximate frequent patterns from

Uploaded by

api-3798592
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Approximate Frequent

Pattern Mining
Philip S. Yu1, Xifeng Yan1, Jiawei Han2,
Hong Cheng2, Feida Zhu2
1
IBM T.J.Watson Research Center
2
University of Illinois at Urbana-
Champaign
Frequent Pattern Mining
 Frequent pattern mining has been studied for over a decade
with tons of algorithms developed
 Apriori (SIGMOD’93, VLDB’94, …)
 FPgrowth (SIGMOD’00), EClat, LCM, …
 Extended to sequential pattern mining, graph mining, …
 GSP, PrefixSpan, CloSpan, gSpan, …
 Applications: Dozens of interesting applications explored
 Association and correlation analysis
 Classification (CBA, CMAR, …, discrim. feature analysis)
 Clustering (e.g., micro-array analysis)
 Indexing (e.g. g-Index)
The Problem of Frequent
Itemset Mining
 First proposed by Agrawal et al. in 1993 [AIS93].
Itemset X = {x1, …, xk}
Transaction-id Items bought
Given a minimum support s,
10 A, B, C
20 A discover all itemsets X,
30 A, B, C, D s.t. sup(X) >= s
40 C, D  sup(X) is the percentage of
50 A, B
60 A, C, D
transactions containing X
 If s=40%, X={A,B} is a
70 B, C, D
frequent itemset since
Table 1. A sample
transaction database D sup(X)=3/7 > 40%
A Binary Matrix Representation
 We can also use a A B C D
binary matrix to 10 1 1 1 0
represent a transaction 20 1 0 0 0
database.
30 1 1 1 1
 Row: Transactions
40 0 0 1 1
 Column: Items
 Entry: Presence/absence 50 1 1 0 0
of an item in a 60 1 0 1 1
transaction
70 0 1 1 1

Table 2. Binary
representation of D
A Noisy Data Model
 A noise free data model
 Assumption made by all the above algorithms
 A noisy data model
 Real world data is subject to random noise and measurement
error. For example:
 Promotions
 Special events
 Out-of-stock items or overstocked items
 Measurement imprecision
 The true frequent itemsets could be distorted by such noise.
 The exact itemset mining algorithms will discover multiple
fragmented itemsets, but miss the true ones.
Itemsets With and Without
Noise Exact mining algorithms
get fragmented itemsets!

Itemset B  Itemset B 
Transactions 

Itemset A  Transactions  Itemset A 

Items  Items 

Figure1(a). Itemset Figure 1(b). Itemset


without noise with noise
Alternative Models
 Existence of core patterns
 I.E., even under noise, the original pattern can still
appear with high probability
 Only summary patterns can be derived
 Summary pattern may not even appear in the
database
The Core Pattern Approach
 Core Pattern Definition
 An itemset x is a core pattern if its exact support in the
noisy database satisfies
sup( x) ≥ α ⋅ min sup,0 ≤ α ≤ 1
 If an approximate itemset is interesting, it is with
high probability that it is a core pattern in the noisy
database. Therefore, we could discover the
approximate itemsets from only the core patterns.
 Besides the core pattern constraint, we use the
constraints of minimum support, ε r , and ε c , as in
[LPS+06].
Approximate Itemset Example
 Let ε r = 0.25 and ε c = 0.25
A B C D
 For <ABCD>, its exact
support = 1; 10 1 1 1 0
 By allowing a fraction of 1 0 0 0
20
ε r = 0.25
30 1 1 1 1
40 0 0 1 1
50 1 1 0 0
noise in a row, 1 0 1 1
60
transaction 10, 30, 60,
0 1 1 1
ε70c
all0approximately
=
<ABCD>;
.25 support 70

 For each item in <ABCD>,


in the transaction set {10,
30, 60, 70}, a fraction of
The Approximate Frequent
Itemset Mining Approach
 Intuition
 Discover approximate itemsets by allowing “holes” in the
matrix representation.
 Constraints
 Minimum support s: the percentage of transactions
containing an itemset
 Row error rate ε r : the percentage of 0s (item) allowed in
each transaction
 Column error rate ε c : the percentage of 0s allowed in
transaction set for each item
Algorithm Outlines
 Mine core patterns using
min sup' = α ⋅ min sup,0 ≤ α ≤ 1
 Build a lattice of the core patterns
 Traverse the lattice to compute the approximate
itemsets
A Running Example
 Let the database be
A B C D
D, ε r = 0.5, ε c = 0.5,
1 1 1 0
s=3, and α = 13 10
20 1 0 0 0
null:7 Level 0
30 1 1 1 1
a:5 b:4 c:5 d:4 Level 1
40 0 0 1 1
ab:3 ac:3 ad:2 bc:3 bd:2 cd:4 Level 2 50 1 1 0 0
60 1 0 1 1
Level 3
abc:2 abd:1 acd:2 bcd:2 70 0 1 1 1
abcd:1 Level 4
Database D
The Lattice of Core Patterns
Microarray → Co-Expression Network

Coexpression
Microarray Module
Network

conditions
MCM7NASP
MCM3
genes

FEN1
UNG

CCNB1 SNRPG
CDC2

• noise edges
Two Issues:
• large scale
Mining Poor Quality Data

Patterns discovered in multiple graphs are more reliable and significant


transform graph mining
dense
vertexset

.. .. ..
. . .
Transcriptional
Annotation

~9000 genes 105 x ~(9000 x 9000) = 8 billion edges


Summary Graph: Concept

..
.

overlap clustering

Scale Down
M networks ONE graph
Summary Graph: Noise Edges

Frequent dense dense subgraphs in


vertexsets
? summary graph

 Dense subgraphs are accidentally formed by


noise edges
 They are false frequent dense vertexsets

 Noise edges will also interfere with true


modules
Unsupervised Partition: Find a
Subset seed
clustering mining
together
(1)

identify group
..
.
(2) (3)
Frequent Approximate Substrinng

ATCCGCACAGGTCAGT AGCA
Limitation on Mining Frequent Patterns:
Mine Very Small Patterns!

 Can we mine large (i.e., colossal) patterns? ― such as just size


around 50 to 100? Unfortunately, not!
 Why not? ― the curse of “downward closure” of frequent patterns
 The “downward closure” property
 Any sub-pattern of a frequent pattern is frequent.
 Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1, a2), (a1,
a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There are about 2100
such frequent itemsets!
 No matter using breadth-first search (e.g., Apriori) or depth-first search
(FPgrowth), we have to examine so many patterns
 Thus the downward closure property leads to explosion!
Do We Need Mining Colossal Patterns?
 From frequent patterns to closed patterns and maximal patterns
 A frequent pattern is closed if and only if there exists no super-pattern
that is both frequent and has the same support
 A frequent pattern is maximal if and only if there exists no frequent
super-pattern
 Closed/maximal patterns may partially alleviate the problem but not
really solve it: We often need to mine scattered large patterns!
 Many real-world mining tasks needs mining colossal patterns
 Micro-array analysis in bioinformatics (when support is low)
 Biological sequence patterns
 Biological/sociological/information graph pattern mining
Colossal Pattern Mining Philosophy
 No hope for completeness
 If the mining of mid-sized patterns is explosive in size,
there is no hope to find colossal patterns efficiently by
insisting “complete set” mining philosophy
 Jumping out of the swamp of the mid-sized results
 What we may develop is a philosophy that may jump
out of the swamp of mid-sized results that are
explosive in size and jump to reach colossal patterns
 Striving for mining almost complete colossal patterns
 The key is to develop a mechanism that may quickly
reach colossal patterns and discover most of them
Conclusions
 Most previous work focused on finding exact
frequent patterns
 There exists a discrepancy between the exact model
and some real world phenomenon due to
 Noise, perturbation, etc
 Very long pattern mining can be another prohibiting
problem
 Need to develop new methodologies to find
approximate frequent patterns

You might also like