Dwdmunit2 Assoc
Dwdmunit2 Assoc
Association Rules
• Applications
• Rule form
prediction (Boolean variables) =>
prediction (Boolean variables) [support,
confidence]
– Computer => antivirus_software [support =2%,
confidence = 60%]
– buys (x, “computer”) ® buys (x,
“antivirus_software”) [0.5%, 60%]
• Confidence
• Support
• Minimum support threshold
• Minimum confidence threshold
• Shopping baskets
• Each item has a Boolean variable representing the
presence or absence of that item.
• Each basket can be represented by a Boolean vector
of values assigned to these variables.
• Identify patterns from Boolean vector
• Patterns can be represented by association rules.
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Mining Association Rules—An Example
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Mining Frequent Itemsets: the Key Step
• Find the frequent itemsets: the sets of items that
have minimum support
– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
– Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)
• Use the frequent itemsets to generate association
rules.
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
The Apriori Algorithm
• Join Step
– Ck is generated by joining Lk-1with itself
• Prune Step
– Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Example of Generating Candidates
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting
– A k-itemset whose corresponding hashing bucket count is below the
threshold cannot be frequent
• Transaction reduction
– A transaction that does not contain any frequent k-itemset is useless
in subsequent scans
• Partitioning
– Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Methods to Improve Apriori’s Efficiency
• Sampling
– mining on a subset of given data, lower support threshold
+ a method to determine the completeness
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Mining Frequent Patterns Without Candidate
Generation
Lecture-28
Mining single-dimensional Boolean association rules from transactional databases
Lecture-29
Mining multilevel association rules
from transactional databases
Mining various kinds of association rules
Level 1 Milk
min_sup = 5%
[support = 10%]
Back
Lecture-29 - Mining multilevel association rules from transactional databases
Reduced Support
Level 1 Milk
min_sup = 5%
[support = 10%]
• Single-dimensional rules
buys(X, “milk”) buys(X, “bread”)
• Multi-dimensional rules
– Inter-dimension association rules -no repeated predicates
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
– hybrid-dimension association rules -repeated predicates
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
• Categorical Attributes
– finite number of possible values, no ordering
among values
• Quantitative Attributes
– numeric, implicit ordering among values
• Subjective measures
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
Succinctness
Anti-monotonicity Monotonicity
Convertible constraints
Inconvertible constraints
Lecture-32 - Constraint-based association mining
Property of Constraints: Anti-Monotone
• Anti-monotonicity: If a set S violates the constraint,
any superset of S violates the constraint.
• Examples:
– sum(S.Price) v is anti-monotone
– sum(S.Price) v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
– Push “sum(S.price) 1000” deeply into iterative frequent
set computation.
S v, { , , } yes
vS no
SV no
SV yes
SV partly
min(S) v no
min(S) v yes
min(S) v partly
max(S) v yes
max(S) v no
max(S) v partly
count(S) v yes
count(S) v no
count(S) v partly
sum(S) v yes
sum(S) v no
sum(S) v partly
avg(S) v, { , , } convertible
(frequent constraint) (yes)
Lecture-32 - Constraint-based association mining
Example of Convertible Constraints: Avg(S) V