0% found this document useful (0 votes)
55 views

Blatt03 Sol

The document discusses frequent itemset mining and the Apriori and FP-growth algorithms. It provides definitions of key terms like items, itemsets, transactions, support. It then presents exercises involving proving properties of frequent itemsets, running the Apriori and FP-growth algorithms on a sample database, and finding closed and maximal frequent itemsets.

Uploaded by

Wafa'a AbdoIslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Blatt03 Sol

The document discusses frequent itemset mining and the Apriori and FP-growth algorithms. It provides definitions of key terms like items, itemsets, transactions, support. It then presents exercises involving proving properties of frequent itemsets, running the Apriori and FP-growth algorithms on a sample database, and finding closed and maximal frequent itemsets.

Uploaded by

Wafa'a AbdoIslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Database Systems Group • Prof. Dr.

Thomas Seidl

Exercise 3:
Frequent Itemset Mining
Knowledge Discovery in Databases I
SS 2016
Recap: Frequent Itemset Mining

Basic terms and definitions:

• Items 𝐼𝐼 = 𝑖𝑖1 , … , 𝑖𝑖𝑚𝑚 TID items


100 {butter, bread, milk, sugar}
• Itemset 𝑋𝑋 ⊆ 𝐼𝐼 200 {butter, flour, milk, sugar}
300 {butter, eggs, milk, salt}
• Database 𝐷𝐷 400 {eggs}
500 {butter, flour, milk, salt sugar}
• Transactions 𝑇𝑇

• Support: 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋 = 𝑇𝑇 ∈ 𝐷𝐷 | 𝑋𝑋 ⊆ 𝑇𝑇
• Frequent Itemset: 𝑋𝑋 freq. iff 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋 ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

Goal: Find all frequent itemsets in 𝐷𝐷!


Knowledge Discovery in Databases I: Exercise 3 13.05.2016 2
Recap: Frequent Itemset Mining

Naive Algorithm: Just count the frequencies of all


possible subsets of 𝐼𝐼 in the database.

• Problem: For 𝐼𝐼 = 𝑚𝑚, there are 2𝑚𝑚 such itemsets!


• Clearly, this becomes infeasible rather quickly…

ABCD not frequent

Main idea of the Apriori algorithm: ABC ABD ACD BCD


AB AC AD BC BD CD
Prune the exponential search space
A B C D
using anti-monotonicity Ø

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 3


Exercise 3-1: Frequent Itemsets

The Apriori algorithm makes use of prior knowledge of


subset support properties. Prove the following subset
properties:
a) All non-empty subsets of a frequent itemset must
also be frequent.
b) The support of any non-empty subset 𝑆𝑆𝑆 of itemset 𝑆𝑆
must be as great as the support of 𝑆𝑆.

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 4


Exercise 3-1 (a): Frequent Itemsets

a) All non-empty subsets of a frequent itemset must


also be frequent:

Proof:
• Let 𝑆𝑆 ⊆ 𝐼𝐼 be a frequent itemset, i.e. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑆𝑆 ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
• Let ∅ ≠ 𝑆𝑆 ′ ⊆ 𝑆𝑆
• Then
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑆𝑆 ′ ≥𝑏𝑏) 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑆𝑆
≥𝑆𝑆 𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓. 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
i.e. 𝑆𝑆𝑆 is a frequent itemset.

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 5


Exercise 3-1 (b): Frequent Itemsets

b) The support of any non-empty subset 𝑆𝑆𝑆 of itemset 𝑆𝑆


must be as great as the support of 𝑆𝑆.

Proof:
• Let ∅ ≠ 𝑆𝑆 ′ ⊆ 𝑆𝑆 ⊆ 𝐼𝐼
• For any transaction 𝑇𝑇 ⊆ 𝐼𝐼 in database 𝐷𝐷, we have:
𝑆𝑆 ⊆ 𝑇𝑇 ⇒ 𝑆𝑆𝑆 ⊆ 𝑇𝑇
• Thus, it holds that
𝑇𝑇 ∈ 𝐷𝐷 | 𝑆𝑆 ⊆ 𝑇𝑇 ⊆ 𝑇𝑇 ∈ 𝐷𝐷 | 𝑆𝑆𝑆 ⊆ 𝑇𝑇
and consequently
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑆𝑆 = 𝑇𝑇 ∈ 𝐷𝐷 | 𝑆𝑆 ⊆ 𝑇𝑇 ≤ 𝑇𝑇 ∈ 𝐷𝐷 | 𝑆𝑆𝑆 ⊆ 𝑇𝑇 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑆𝑆 ′ )

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 6


Exercise 3-2: Frequent Itemset Mining

Let 𝐷𝐷 be a database that contains the following four


transactions:
TID items_bought
T1 {K, A, D, B}
T2 {D, A, C, E, B}
T3 {C, A, B, E}
T4 {B, A, D}

In addition let 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 60%.


a) Find all frequent itemsets using the Apriori algorithm.
b) Find all frequent itemsets using the FP-growth
algorithm.
c) Determine all closed and maximal frequent itemsets.

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 7


Exercise 3-2 (a): Apriori Algorithm

minSup=0.6
database D C1 itemset sup L1 itemset sup
{A} 100% {A} 100%
TID items scan D {B} 100% {B} 100%
1 {K, A, D, B} {C} 50% {D} 75%
2 {D, A, C, E, B} {D} 75% 𝐿𝐿1 ⋈ 𝐿𝐿1
3 {C, A, B, E} {E} 50%
4 {B, A, D} {K} 25%

C2 itemset C2 itemset C2 itemset sup L2 itemset sup


{A B} prune C {A B} scan D {A B} 100% {A B} 100%
2
{A D} {A D} {A D} 75% {A D} 75%
{B D} {B D} {B D} 75% {B D} 75%

𝐿𝐿2 ⋈ 𝐿𝐿2
C3 itemset prune C3 C3 itemset scan D C3 itemset sup L3 itemset sup
{A B D} {A B D} {A B D} 75% {A B D} 75%

𝐿𝐿3 ⋈ 𝐿𝐿3
C4 is empty
Knowledge Discovery in Databases I: Exercise 3 13.05.2016 8
Recap: FP-Growth Algorithm

Bottleneck of Apriori: Candidate generation

• Huge candidate set


• Multiple scans of the database

FP-Growth: FP-mining without candidate generation

• Compress database, retain only information relevant


to FP-mining: FP-tree
• Use efficient Divide & Conquer approach and grow
frequent patterns without generating candidate sets

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 9


Exercise 3-2 (b): FP-Growth Algorithm

TID items bought (ordered) frequent items


1 {K, A, D, B} {A, B, D}
2 {D, A, C, E, B} {A, B, D}
3 {C, A, B, E} {A, B}
Initial FP-tree
4 {B, A, D} {A, B, D}

for each transaction only keep


{}
minSup=0.6 its frequent items sorted in
descending order of their
frequencies A:4
sort header table:
items in item frequency B:4
the order of A 4
descending support B 4
D 3 D:3
C 2
E 2
K 1

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 10


Exercise 3-2 (b): FP-Growth Algorithm

Initial FP-tree conditional pattern base:


item cond. pattern base
{} A {}
B A:4
D AB:3
A:4
item frequency
item frequency
A 4 B:4 A 3
B 4
B 3
D 3
C 2 D:3
E 2
K 1 D-conditional FP-tree
{}|D {}|B {}|A={}
{{A}}
A:3 A:4
{{B},{AB}}
B:3
{{D},{AD},{BD},{ABD}}
Knowledge Discovery in Databases I: Exercise 3 13.05.2016 11
Exercise 3-2 (c): Closed and Maximal
Frequent Itemsets

• Closed frequent itemsets:


• 𝑋𝑋 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ⇔ ∄𝑌𝑌: 𝑋𝑋 ⊂ 𝑌𝑌 ∧ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑌𝑌 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋)
• Set of closed itemsets contains complete information

• Maximal frequent itemsets:


• 𝑋𝑋 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 ⇔ ∄𝑌𝑌: 𝑋𝑋 ⊂ 𝑌𝑌 ∧ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑌𝑌 ≥ 𝑚𝑚𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
• Not complete, but more compact

frequent itemsets support


{A} 1
TID items_bought
{B} 1
T1 {K, A, D, B}
{D} 0.75
T2 {D, A, C, E, B}
closed but not maximal {A,B} 1
T3 {C, A, B, E}
{A,D} 0.75
T4 {B, A, D}
{B,D} 0.75
closed & maximal {A,B,D} 0.75

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 12


Recap: Association Rule Mining

Association rule:
𝑋𝑋 ⇒ 𝑌𝑌
where 𝑋𝑋, 𝑌𝑌 ⊆ 𝐼𝐼 are two itemsets with 𝑋𝑋 ∩ 𝑌𝑌 = ∅.

• 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋 ⇒ 𝑌𝑌 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋 ∪ 𝑌𝑌)


𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋∪𝑌𝑌)
• 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑋𝑋 ⇒ 𝑌𝑌 =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋)
• Strong association rules have 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 and
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

Goal: Find all strong association rules in 𝐷𝐷!

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 13


Exercise 3-3: Association Rule Mining

After frequent itemset mining, association rules can be


extracted as follows: For each frequent itemset 𝑋𝑋 and
every non-empty subset 𝑌𝑌 ⊂ 𝑋𝑋, generate a rule 𝑌𝑌 ⇒
𝑋𝑋 ∖ 𝑌𝑌 if it fulfills the minimum confidence property.

a) Proof the following anti-monotonicity lemma for


strong association rules:

Let 𝑋𝑋 be a frequent itemset and 𝑌𝑌 ⊂ 𝑋𝑋. If 𝑌𝑌 ⇒ 𝑋𝑋 ∖ 𝑌𝑌 is a


strong association rule, then 𝑌𝑌 ′ ⇒ 𝑋𝑋 ∖ 𝑌𝑌𝑌 is also a
strong association rule for every 𝑌𝑌 ⊆ 𝑌𝑌𝑌.

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 14


Exercise 3-3 (a): Association Rule Mining

Let 𝑋𝑋 be a frequent itemset and 𝑌𝑌 ⊂ 𝑋𝑋. If 𝑌𝑌 ⇒ 𝑋𝑋 ∖ 𝑌𝑌 is a


strong association rule, then 𝑌𝑌 ′ ⇒ 𝑋𝑋 ∖ 𝑌𝑌𝑌 is also a
strong association rule for every 𝑌𝑌 ⊆ 𝑌𝑌𝑌.

Proof:
• 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑌𝑌 ′ ⇒ 𝑋𝑋 ∖ 𝑌𝑌 ′ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋
≥ 𝑋𝑋 𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓. 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋
• 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑌𝑌 ′ ⇒ 𝑋𝑋 ∖ 𝑌𝑌 ′ =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑌𝑌 ′

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋
≥3−1 𝑏𝑏
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑌𝑌

= 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑌𝑌 ⇒ 𝑋𝑋 ∖ 𝑌𝑌
≥𝑌𝑌⇒𝑋𝑋∖𝑌𝑌 𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 15


Exercise 3-3 (b): Association Rule Mining

b) Extract all strong association rules from the


database 𝐷𝐷 provided in the previous exercise with a
minimum confidence of 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 80%. Which
candidate rules can be pruned based on anti-
monotonicity?
candidate rule confidence
𝑨𝑨 ⇒ 𝑩𝑩 1 ✔
frequent itemsets support
𝑩𝑩 ⇒ 𝑨𝑨 1 ✔
{A} 1
𝑨𝑨 ⇒ 𝑫𝑫 0.75 ✗
{B} 1
𝑫𝑫 ⇒ 𝑨𝑨 1 ✔

{D} 0.75
𝑩𝑩 ⇒ 𝑫𝑫 0.75
{A,B} 1
𝑫𝑫 ⇒ 𝑩𝑩 1 ✔
{A,D} 0.75
𝑨𝑨, 𝑩𝑩 ⇒ 𝑫𝑫 0.75 ✗
{B,D} 0.75
𝑨𝑨, 𝑫𝑫 ⇒ 𝑩𝑩 1 ✔
{A,B,D} 0.75
𝑩𝑩, 𝑫𝑫 ⇒ 𝑨𝑨 1 ✔
𝐴𝐴 ⇒ 𝐵𝐵, 𝐷𝐷 and 𝐵𝐵 ⇒ 𝐴𝐴, 𝐷𝐷 can be pruned! 𝑫𝑫 ⇒ 𝑨𝑨, 𝑩𝑩 1 ✔

Knowledge Discovery in Databases I: Exercise 3 13.05.2016 16

You might also like