Data Mining:
Association
Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
What Is Association Mining?
Association rule mining:
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
Applications:
Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.
Example:
Computer=>antivirus_software [support=2%, confidence=60%]
A support of 2% for above rule
means that 2% of all the
transactions under analysis show
that computer and antivirus software
are purchased together.
A confidence of 60% means that
60% of the customers who
purchased a computer also bought
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one set
of items with that of another set of items
E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
Association Rule Mining: A
Road Map
Boolean vs. quantitative associations (Based
on the types of values handled)
buys(x, “SQLServer”) ^ buys(x, “DMBook”)
® buys(x, “DBMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) ®
buys(x, “PC”) [1%, 75%]
Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Mining Association Rules—An Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A C: {A,C} 50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent itemset must be
Mining Frequent
Itemsets: the Key Step
Find the frequent itemsets: the sets of
items that have minimum support
A subset of a frequent itemset must also be a
frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and
{B} should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate
association rules.
The Apriori Algorithm
Join Step: C is generated by joining L with
k k-1
itself
Prune Step: Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent k-
itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
The Apriori Algorithm — Example
Example 2:
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an
order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
How to Count Supports of
Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very
huge
One transaction may contain many
candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates
Example of Generating
Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
Methods to Improve Apriori’s
Efficiency
Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent
Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent
Visualization of Association Rule Using Plane Graph
Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Multiple-Level Association
Rules
Food
Items often form
hierarchy. milk bread
Items at the lower level
are expected to have skim 2% wheat white
lower support.
Rules regarding itemsets Fraser Sunset
at
appropriate levels could TID Items
be quite useful. T1 {111, 121, 211, 221}
Transaction database can T2 {111, 211, 222, 323}
be encoded based on T3 {112, 122, 221, 411}
dimensions and levels T4 {111, 121}
We can explore shared T5 {111, 122, 211, 221, 413}
Mining Multi-Level
Associations
A top_down, progressive deepening approach:
First find high-level strong rules:
milk ® bread [20%, 60%].
Then find their lower-level “weaker” rules:
2% milk ® wheat bread [6%, 50%].
Variations at mining multiple-level association rules.
Level-crossed association rules:
2% milk ® Wonder wheat bread
Association rules with multiple, alternative
hierarchies:
2% milk ® Wonder bread
Multi-level Association: Uniform
Support vs. Reduced Support
Uniform Support: the same minimum support for all
levels
+ One minimum support threshold. No need to examine
itemsets containing any item whose ancestors do not have
minimum support.
– Lower level items do not occur as frequently. If support
threshold
too high miss low level associations
too low generate too many high level associations
Reduced Support: reduced minimum support at
lower levels
There are 4 search strategies:
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
Uniform Support
Multi-level mining with uniform support
Level 1 Milk
min_sup = 5%
[support = 10%]
Level 2 2% Milk Skim Milk
min_sup = 5% [support = 6%] [support = 4%]
Back
Reduced Support
Multi-level mining with reduced support
Level 1 Milk
min_sup = 5%
[support = 10%]
Level 2 2% Milk Skim Milk
min_sup = 3% [support = 6%] [support = 4%]
Back
Multi-level Association:
Redundancy Filtering
Some rules may be redundant due to
“ancestor” relationships between items.
Example
milk wheat bread [support = 8%, confidence =
70%]
2% milk wheat bread [support = 2%, confidence =
72%]
We say the first rule is an ancestor of the
second rule.
A rule is redundant if its support is close to
the “expected” value, based on the rule’s
Multi-Level Mining:
Progressive Deepening
A top-down, progressive deepening
approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker”
frequent itemsets:
2% milk (5%), wheat bread
(4%)
Different min_support threshold across
multi-levels lead to different algorithms:
If adopting the same min_support across
multi-levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower
Progressive Refinement of
Data Mining Quality
Why progressive refinement?
Mining operator can be expensive or cheap, fine
or rough
Trade speed with quality: step-by-step
refinement.
Superset coverage property:
Preserve all the positive answers—allow a
positive false test but not a false negative test.
Two- or multi-step mining:
First apply rough/cheap operator (superset
coverage)
Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Multi-Dimensional
Association: Concepts
Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates
Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes
finite number of possible values, no ordering among values
Quantitative Attributes
numeric, implicit ordering among values
Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Interestingness
Measurements
Objective measures
Two popular measurements:
support; and
confidence
Subjective measures (Silberschatz &
Tuzhilin, KDD95)
A rule (pattern) is interesting if
it is unexpected (surprising to the user);
and/or
actionable (the user can do something
with it)
Criticism to Support and
Confidence
Example 1: (Aggarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading
because the overall percentage of students eating cereal
is 75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Criticism to Support and
Confidence (Cont.)
Example 2:
X and Y: positively X 1 1 1 1 0 0 0 0
correlated, Y 1 1 0 0 0 0 0 0
X and Z, negatively related Z 0 1 1 1 1 1 1 1
support and confidence of
X=>Z dominates
We need a measure of
Rule Support Confidence
dependent or A B)
P(correlated X=>Y 25% 50%
corr
eventsA, B
P( A) P( B) X=>Z 37.50% 75%
Other Interestingness Measures:
Interest
Interest (correlation, lift)P( A B)
P ( A) P ( B )
taking both P(A) and P(B) in consideration
P(A^B)=P(B)*P(A), if A and B are independent
events
A and B negatively correlated, if the value is less
than 1; otherwise A and B positively
Itemset Support correlated
Interest
X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Constraint-Based Mining
Interactive, exploratory mining giga-bytes of data?
Could it be real? — Making good use of constraints!
What kinds of constraints can be used in mining?
Knowledge type constraint: classification, association,
etc.
Data constraint: SQL-like queries
Find product pairs sold together in Vancouver in
Dec.’98.
Dimension/level constraints:
in relevance to region, price, brand, customer
category.
Rule constraints
small sales (price < $10) triggers big sales (sum >
$200).
Interestingness constraints:
strong rules (min_support 3%, min_confidence
Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary