0% found this document useful (0 votes)
42 views35 pages

Frequent Item Mining

The document discusses frequent itemset mining. It begins by defining key concepts like itemsets, support count, and frequent itemsets. It then explains the Apriori algorithm for generating frequent itemsets and how it uses the Apriori property to reduce the number of candidates. The document also discusses challenges with frequent itemset mining and methods to more efficiently enumerate maximal and closed frequent itemsets, including alternative representations of the transaction database like ECLAT and FP-growth.

Uploaded by

Anh Kiet Duong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views35 pages

Frequent Item Mining

The document discusses frequent itemset mining. It begins by defining key concepts like itemsets, support count, and frequent itemsets. It then explains the Apriori algorithm for generating frequent itemsets and how it uses the Apriori property to reduce the number of candidates. The document also discusses challenges with frequent itemset mining and methods to more efficiently enumerate maximal and closed frequent itemsets, including alternative representations of the transaction database like ECLAT and FP-growth.

Uploaded by

Anh Kiet Duong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Frequent
Item
Mining


What
is
data
mining?

•  =Pa6ern
Mining?

•  What
pa6erns?

•  Why
are
they
useful?


Defini>on:
Frequent
Itemset

•  Itemset

–  A
collec>on
of
one
or
more
items

•  Example:
{Milk,
Bread,
Diaper}

–  k‐itemset

•  An
itemset
that
contains
k
items

•  Support
count
(σ)

–  Frequency
of
occurrence
of
an
itemset

–  E.g.


σ({Milk,
Bread,Diaper})
=
2


•  Support

–  Frac>on
of
transac>ons
that
contain
an
itemset

–  E.g.


s({Milk,
Bread,
Diaper})
=
2/5

•  Frequent
Itemset

–  An
itemset
whose
support
is
greater
than
or

equal
to
a
minsup
threshold


3

Frequent
Itemsets
Mining

TID Transactions •  Minimum
support
level

100 { A, B, E } 50%

200 { B, D } –  {A},{B},{C},{A,B},
{A,C}

300 { A, B, E }
400 { A, C }
500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }
Frequent
Pa6ern
Mining

A
 E
 A
 E
 B


A
 B
 A
 B


E
 F
 E
 A
 A

B
 A
 B
 C


D
 C
 F
 D
 F
 C

D
 C

A


D
 F
 D
 C


A
 B


D
 C

Beyond
Itemsets 


•  Sequence
Mining

–  Finding
frequent
subsequences
from
a
collec>on
of
sequences


•  Graph
Mining

–  Finding
frequent
(connected)
subgraphs
from
a
collec>on
of

graphs

•  Tree
Mining

–  Finding
frequent
(embedded)
subtrees
from
a
set
of
trees/
graphs

•  Geometric
Structure
Mining

–  Finding
frequent
substructures
from
3‐D
or
2‐D
geometric

graphs

•  Among
others…

Why
Frequent
Pa6ern
Mining
is
So

Important?

•  Applica>on
Domains

–  Business,
biology,
chemistry,
WWW,
computer/networing
security,
…

•  Summarizing
the
underlying
datasets,
providing
key
insights

•  Basic
tools
for
other
data
mining
tasks

–  Assoca>on
rule
mining

–  Classifica>on

–  Clustering

–  Change
Detec>on

–  etc…

Network motifs: recurring patterns that
occur significantly more than in
randomized nets

•  Do
mo>fs
have
specific
roles
in
the
network?


•  Many
possible
dis>nct
subgraphs

The 13 three-node connected
subgraphs
199 4-node directed connected subgraphs

And
it
grows
fast
for
larger
subgraphs
:
9364
5‐node
subgraphs,

1,530,843
6‐node…

Finding network motifs –
an overview
•  Genera>on
of
a
suitable
random
ensemble
(reference

networks)

•  Network
mo>fs
detec>on
process:



  Count how many times each subgraph


appears
  Compute statistical significance for each
subgraph – probability of appearing in
random as much as in real network
(P-val or Z-score)
Ensemble
of

networks


Real
=
5
 








 
 
Rand=0.5±0.6





 
 


Zscore
(#Standard
DeviaPons)=7.5

Three
Different
Views
of
FIM

•  Transac>onal
Database

–  How
we
do
store
a
transac>onal

database?

•  Horizontal,
Ver>cal,
Transac>on‐Item

Pair

•  Binary
Matrix

•  Bipar>te
Graph


•  How
does
the
FIM
formulated
in

these
different
se`ngs?

13
Frequent
Itemset
Genera>on


Given
d
items,
there
are

2d
possible
candidate

itemsets

14

Frequent
Itemset
Genera>on

•  Brute‐force
approach:


–  Each
itemset
in
the
la`ce
is
a
candidate
frequent
itemset

–  Count
the
support
of
each
candidate
by
scanning
the

database


–  Match
each
transac>on
against
every
candidate

–  Complexity
~
O(NMw)
=>
Expensive
since
M
=
2d
!!!
 15

Reducing
Number
of
Candidates

•  Apriori
principle:

–  If
an
itemset
is
frequent,
then
all
of
its
subsets
must
also

be
frequent


•  Apriori
principle
holds
due
to
the
following
property

of
the
support
measure:


–  Support
of
an
itemset
never
exceeds
the
support
of
its

subsets

–  This
is
known
as
the
an>‐monotone
property
of
support

16

Illustra>ng
Apriori
Principle


Found
to
be

Infrequent


Pruned

supersets
 17

Illustra>ng
Apriori
Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered,


6C + 6C + 6C = 41
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

18

Apriori


R. Agrawal and R. Srikant.


Fast algorithms for mining association rules.
VLDB, 487-499, 1994
How
to
Generate
Candidates?

•  Suppose
the
items
in
Lk‐1
are
listed
in
an
order

•  Step
1:
self‐joining
Lk‐1


insert
into
Ck

select
p.item1,
p.item2,
…,
p.itemk‐1,
q.itemk‐1

from
Lk‐1
p,
Lk‐1
q

where
p.item1=q.item1,
…,
p.itemk‐2=q.itemk‐2,
p.itemk‐1
<
q.itemk‐1


•  Step
2:
pruning

forall
itemsets
c
in
Ck
do

forall
(k‐1)‐subsets
s
of
c
do

if
(s
is
not
in
Lk‐1)
then
delete
c
from
Ck


20

Challenges
of
Frequent
Itemset
Mining

•  Challenges

–  Mul>ple
scans
of
transac>on
database

–  Huge
number
of
candidates

–  Tedious
workload
of
support
coun>ng
for
candidates


•  Improving
Apriori:
general
ideas

–  Reduce
passes
of
transac>on
database
scans

–  Shrink
number
of
candidates

–  Facilitate
support
coun>ng
of
candidates


21

Compact
Representa>on
of
Frequent
Itemsets


•  Some
itemsets
are
redundant
because
they
have
iden>cal
support

as
their
supersets


•  Number
of
frequent
itemsets


•  Need
a
compact
representa>on


22

Maximal
Frequent
Itemset

An
itemset
is
maximal
frequent
if
none
of
its
immediate
supersets
is

frequent


Maximal

Itemsets


Border

Infrequent

Itemsets

23

Closed
Itemset

•  An
itemset
is
closed
if
none
of
its
immediate
supersets
has
the

same
support
as
the
itemset


24

Maximal
vs
Closed
Itemsets

TransacPon
Ids


Not
supported
by

any
transacPons
 25

Maximal
vs
Closed
Frequent
Itemsets

Minimum
support
=
2
 Closed
but

not
maximal


Closed
and

maximal


#
Closed
=
9

#
Maximal
=
4


26

Maximal
vs
Closed
Itemsets


27

Research
Ques>ons

•  How
to
efficiently
enumerate
Maximal

Frequent
Itemsets?


•  How
about
Closed
Frequent
Itemsets?


28

Alterna>ve
Methods
for
Frequent
Itemset

Genera>on

•  Representa>on
of
Database

–  horizontal
vs
ver>cal
data
layout


29

ECLAT

•  For
each
item,
store
a
list
of
transac>on
ids

(>ds)


TID‐list
 30

ECLAT

•  Determine
support
of
any
k‐itemset
by
intersec>ng
>d‐lists
of

two
of
its
(k‐1)
subsets.


∧
 →


•  3
traversal
approaches:


–  top‐down,
bo6om‐up
and
hybrid

•  Advantage:
very
fast
support
coun>ng

•  Disadvantage:
intermediate
>d‐lists
may
become
too
large
for

memory

31

FP‐growth
Algorithm

•  Use
a
compressed
representa>on
of
the

database
using
an
FP‐tree


•  Once
an
FP‐tree
has
been
constructed,
it
uses

a
recursive
divide‐and‐conquer
approach
to

mine
the
frequent
itemsets


32

FP‐tree
construc>on

null
A]er
reading
TID=1:


A:1

B:1

A]er
reading
TID=2:

null

A:1 B:1

B:1 C:1

33

D:1
FP‐Tree
Construc>on

TransacPon

Database

null

A:7 B:3

B:5 C:3
C:1 D:1

Header
table
 D:1
C:3 E:1
D:1 E:1
D:1
E:1
D:1
Pointers
are
used
to
assist

frequent
itemset
generaPon

34

FP‐growth

CondiPonal
Pa`ern
base
for
null

D:







P
=
{(A:1,B:1,C:1),

A:7 B:1 
(A:1,B:1),















(A:1,C:1),














(A:1),


B:5 C:1 












(B:1,C:1)}

C:1 D:1
Recursively
apply
FP‐growth
D:1 
on
P

C:3
D:1
D:1 Frequent
Itemsets
found

(with
sup
>
1):

D:1 


AD,
BD,
CD,
ACD,
BCD


35


You might also like