Frequent Item Mining
Frequent Item Mining
What
is
data
mining?
• =Pa6ern
Mining?
• What
pa6erns?
• Why
are
they
useful?
Defini>on:
Frequent
Itemset
• Itemset
– A
collec>on
of
one
or
more
items
• Example:
{Milk,
Bread,
Diaper}
– k‐itemset
• An
itemset
that
contains
k
items
• Support
count
(σ)
– Frequency
of
occurrence
of
an
itemset
– E.g.
σ({Milk,
Bread,Diaper})
=
2
• Support
– Frac>on
of
transac>ons
that
contain
an
itemset
– E.g.
s({Milk,
Bread,
Diaper})
=
2/5
• Frequent
Itemset
– An
itemset
whose
support
is
greater
than
or
equal
to
a
minsup
threshold
3
Frequent
Itemsets
Mining
TID Transactions • Minimum
support
level
100 { A, B, E } 50%
200 { B, D } – {A},{B},{C},{A,B},
{A,C}
300 { A, B, E }
400 { A, C }
500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }
Frequent
Pa6ern
Mining
A
E
A
E
B
A B A B
E
F
E
A
A
B
A
B
C
D
C
F
D
F
C
D
C
A
D F D C
A B
D
C
Beyond
Itemsets
• Sequence
Mining
– Finding
frequent
subsequences
from
a
collec>on
of
sequences
• Graph
Mining
– Finding
frequent
(connected)
subgraphs
from
a
collec>on
of
graphs
• Tree
Mining
– Finding
frequent
(embedded)
subtrees
from
a
set
of
trees/
graphs
• Geometric
Structure
Mining
– Finding
frequent
substructures
from
3‐D
or
2‐D
geometric
graphs
• Among
others…
Why
Frequent
Pa6ern
Mining
is
So
Important?
• Applica>on
Domains
– Business,
biology,
chemistry,
WWW,
computer/networing
security,
…
• Summarizing
the
underlying
datasets,
providing
key
insights
• Basic
tools
for
other
data
mining
tasks
– Assoca>on
rule
mining
– Classifica>on
– Clustering
– Change
Detec>on
– etc…
Network motifs: recurring patterns that
occur significantly more than in
randomized nets
• Do mo>fs have specific roles in the network?
• Many
possible
dis>nct
subgraphs
The 13 three-node connected
subgraphs
199 4-node directed connected subgraphs
And
it
grows
fast
for
larger
subgraphs
:
9364
5‐node
subgraphs,
1,530,843
6‐node…
Finding network motifs –
an overview
• Genera>on
of
a
suitable
random
ensemble
(reference
networks)
• Network
mo>fs
detec>on
process:
• How
does
the
FIM
formulated
in
these
different
se`ngs?
13
Frequent
Itemset
Genera>on
Given
d
items,
there
are
2d
possible
candidate
itemsets
14
Frequent
Itemset
Genera>on
• Brute‐force
approach:
– Each
itemset
in
the
la`ce
is
a
candidate
frequent
itemset
– Count
the
support
of
each
candidate
by
scanning
the
database
– Match
each
transac>on
against
every
candidate
– Complexity
~
O(NMw)
=>
Expensive
since
M
=
2d
!!!
15
Reducing
Number
of
Candidates
• Apriori
principle:
– If
an
itemset
is
frequent,
then
all
of
its
subsets
must
also
be
frequent
• Apriori
principle
holds
due
to
the
following
property
of
the
support
measure:
– Support
of
an
itemset
never
exceeds
the
support
of
its
subsets
– This
is
known
as
the
an>‐monotone
property
of
support
16
Illustra>ng
Apriori
Principle
Found
to
be
Infrequent
Pruned
supersets
17
Illustra>ng
Apriori
Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
18
Apriori
• Step
2:
pruning
forall
itemsets
c
in
Ck
do
forall
(k‐1)‐subsets
s
of
c
do
if
(s
is
not
in
Lk‐1)
then
delete
c
from
Ck
20
Challenges
of
Frequent
Itemset
Mining
• Challenges
– Mul>ple
scans
of
transac>on
database
– Huge
number
of
candidates
– Tedious
workload
of
support
coun>ng
for
candidates
• Improving
Apriori:
general
ideas
– Reduce
passes
of
transac>on
database
scans
– Shrink
number
of
candidates
– Facilitate
support
coun>ng
of
candidates
21
Compact
Representa>on
of
Frequent
Itemsets
• Some
itemsets
are
redundant
because
they
have
iden>cal
support
as
their
supersets
• Number of frequent itemsets
• Need a compact representa>on
22
Maximal
Frequent
Itemset
An
itemset
is
maximal
frequent
if
none
of
its
immediate
supersets
is
frequent
Maximal
Itemsets
Border
Infrequent
Itemsets
23
Closed
Itemset
• An
itemset
is
closed
if
none
of
its
immediate
supersets
has
the
same
support
as
the
itemset
24
Maximal
vs
Closed
Itemsets
TransacPon
Ids
Not
supported
by
any
transacPons
25
Maximal
vs
Closed
Frequent
Itemsets
Minimum
support
=
2
Closed
but
not
maximal
Closed
and
maximal
#
Closed
=
9
#
Maximal
=
4
26
Maximal
vs
Closed
Itemsets
27
Research
Ques>ons
• How
to
efficiently
enumerate
Maximal
Frequent
Itemsets?
• How about Closed Frequent Itemsets?
28
Alterna>ve
Methods
for
Frequent
Itemset
Genera>on
• Representa>on
of
Database
– horizontal
vs
ver>cal
data
layout
29
ECLAT
• For
each
item,
store
a
list
of
transac>on
ids
(>ds)
TID‐list
30
ECLAT
• Determine
support
of
any
k‐itemset
by
intersec>ng
>d‐lists
of
two
of
its
(k‐1)
subsets.
∧ →
• 3
traversal
approaches:
– top‐down,
bo6om‐up
and
hybrid
• Advantage:
very
fast
support
coun>ng
• Disadvantage:
intermediate
>d‐lists
may
become
too
large
for
memory
31
FP‐growth
Algorithm
• Use
a
compressed
representa>on
of
the
database
using
an
FP‐tree
• Once
an
FP‐tree
has
been
constructed,
it
uses
a
recursive
divide‐and‐conquer
approach
to
mine
the
frequent
itemsets
32
FP‐tree
construc>on
null
A]er
reading
TID=1:
A:1
B:1
A]er
reading
TID=2:
null
A:1 B:1
B:1 C:1
33
D:1
FP‐Tree
Construc>on
TransacPon
Database
null
A:7 B:3
B:5 C:3
C:1 D:1
Header
table
D:1
C:3 E:1
D:1 E:1
D:1
E:1
D:1
Pointers
are
used
to
assist
frequent
itemset
generaPon
34
FP‐growth
CondiPonal
Pa`ern
base
for
null
D:
P
=
{(A:1,B:1,C:1),
A:7 B:1
(A:1,B:1),
(A:1,C:1),
(A:1),
B:5 C:1
(B:1,C:1)}
C:1 D:1
Recursively
apply
FP‐growth
D:1
on
P
C:3
D:1
D:1 Frequent
Itemsets
found
(with
sup
>
1):
D:1
AD,
BD,
CD,
ACD,
BCD
35