NGDM07 Philip Yu
NGDM07 Philip Yu
Pattern Mining
Philip S. Yu1, Xifeng Yan1, Jiawei Han2,
Hong Cheng2, Feida Zhu2
1
IBM T.J.Watson Research Center
2
University of Illinois at Urbana-
Champaign
Frequent Pattern Mining
Frequent pattern mining has been studied for over a decade
with tons of algorithms developed
Apriori (SIGMOD’93, VLDB’94, …)
FPgrowth (SIGMOD’00), EClat, LCM, …
Extended to sequential pattern mining, graph mining, …
GSP, PrefixSpan, CloSpan, gSpan, …
Applications: Dozens of interesting applications explored
Association and correlation analysis
Classification (CBA, CMAR, …, discrim. feature analysis)
Clustering (e.g., micro-array analysis)
Indexing (e.g. g-Index)
The Problem of Frequent
Itemset Mining
First proposed by Agrawal et al. in 1993 [AIS93].
Itemset X = {x1, …, xk}
Transaction-id Items bought
Given a minimum support s,
10 A, B, C
20 A discover all itemsets X,
30 A, B, C, D s.t. sup(X) >= s
40 C, D sup(X) is the percentage of
50 A, B
60 A, C, D
transactions containing X
If s=40%, X={A,B} is a
70 B, C, D
frequent itemset since
Table 1. A sample
transaction database D sup(X)=3/7 > 40%
A Binary Matrix Representation
We can also use a A B C D
binary matrix to 10 1 1 1 0
represent a transaction 20 1 0 0 0
database.
30 1 1 1 1
Row: Transactions
40 0 0 1 1
Column: Items
Entry: Presence/absence 50 1 1 0 0
of an item in a 60 1 0 1 1
transaction
70 0 1 1 1
Table 2. Binary
representation of D
A Noisy Data Model
A noise free data model
Assumption made by all the above algorithms
A noisy data model
Real world data is subject to random noise and measurement
error. For example:
Promotions
Special events
Out-of-stock items or overstocked items
Measurement imprecision
The true frequent itemsets could be distorted by such noise.
The exact itemset mining algorithms will discover multiple
fragmented itemsets, but miss the true ones.
Itemsets With and Without
Noise Exact mining algorithms
get fragmented itemsets!
Itemset B Itemset B
Transactions
Items Items
Coexpression
Microarray Module
Network
conditions
MCM7NASP
MCM3
genes
FEN1
UNG
CCNB1 SNRPG
CDC2
• noise edges
Two Issues:
• large scale
Mining Poor Quality Data
.. .. ..
. . .
Transcriptional
Annotation
..
.
overlap clustering
Scale Down
M networks ONE graph
Summary Graph: Noise Edges
identify group
..
.
(2) (3)
Frequent Approximate Substrinng
ATCCGCACAGGTCAGT AGCA
Limitation on Mining Frequent Patterns:
Mine Very Small Patterns!