Association Rule Mining with R
Wirapong Chansanam, Ph.D.
iSchools, Khon Kaen University,
Thailand
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Association Rules
• To discover association rules showing item sets that occur together
frequently [Agrawal et al., 1993].
• Widely used to analyze retail basket or transaction data.
• An association rule is of the form A => B, where A and B are items or
attribute-value pairs.
• The rule means that those database tuples having the items in the
left hand of the rule are also likely to having those items in the right
hand.
• Examples of association rules:
– bread => butter
– computer =>software
– age in [20,29] & income in [60K,100K] => buying up-to-date mobile handsets
Association Rules
Association rules are rules presenting association or correlation
between item sets.
where P(A) is the percentage (or probability) of cases containing
A.
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
• support =
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
• support = P(R & data mining) = 6/100 = 0.06
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
• support = P(R & data mining) = 6/100 = 0.06
• Confidence =
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
• support = P(R & data mining) = 6/100 = 0.06
• Confidence = support / P(R) = 0.06/0.08 = 0.75
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
• support = P(R & data mining) = 6/100 = 0.06
• Confidence = support / P(R) = 0.06/0.08 = 0.75
• lift =
An Example
• Assume there are 100 students.
• 10 out of them know data mining techniques,
8 know R language and 6 know both of them.
• knows R => knows data mining
• support = P(R & data mining) = 6/100 = 0.06
• Confidence = support / P(R) = 0.06/0.08 = 0.75
• lift = confidence / P(data mining) = 0.75/0.10 =
7.5
Association Rule Mining
• Association Rule Mining is normally composed of two
steps:
– Finding all frequent item sets whose supports are no less than a
minimum support threshold;
– From above frequent item sets, generating association rules
with confidence above a minimum confidence threshold.
• The second step is straightforward, but the first one,
frequent item set generation, is computing intensive.
• The number of possible item sets is 2n -1, where n is the
number of unique items.
• Algorithms: Apriori, ECLAT, FP-Growth
Downward-Closure Property
• Downward-closure property of support, a.k.a. anti-
monotonicity
• For a frequent itemset, all its subsets are also frequent.
if {A,B} is frequent, then both {A} and {B} are frequent.
• For an infrequent itemset, all its super-sets are
infrequent.
if {A} is infrequent, then {A,B}, {A,C} and {A,B,C} are
infrequent.
• useful to prune candidate item sets
Itemset Lattice
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Apriori
• Apriori [Agrawal and Srikant, 1994]: a classic
algorithm for association rule mining
• A level-wise, breadth-first algorithm
• Counts transactions to find frequent itemsets
• Generates candidate itemsets by exploiting
downward closure property of support
Apriori Process
1. Find all frequent 1-itemsets
2. Join step: generate candidate k-itemsets by
joining with itself
3. Prune step: prune candidate k-itemsets using
downward-closure property
4. Scan the dataset to count frequency of candidate
k-itemsets and select frequent k-itemsets
5. Repeat above process, until no more frequent
itemsets can be found.
FP-growth
• FP-growth: frequent-pattern growth, which mines frequent
itemsets without candidate generation [Han et al., 2004]
• Compresses the input database creating an FP-tree
instance to represent frequent items.
• Divides the compressed database into a set of conditional
databases, each one associated with one frequent pattern.
• Each such database is mined separately.
• It reduces search costs by looking for short patterns
recursively and then concatenating them in long frequent
patterns[1].
[1] [Link]
FP-tree
• The frequent-pattern tree (FP-tree) is a compact structure that stores
quantitative information about frequent patterns in a dataset. It has two
components:
– A root labeled as “null" with a set of item-prefix subtrees as children
– A frequent-item header table
• Each node has three attributes:
– Item name
– Count: number of transactions represented by the path from root to the node
– Node link: links to the next node having the same item name
• Each entry in the frequent-item header table also has three attributes:
– Item name
– Head of node link: point to the first node in the FP-tree having the same item name
– Count: frequency of the item
FP-tree
ECLAT
• ECLAT: equivalence class transformation [Zaki et al., 1997]
• A depth-first search algorithm using set intersection
• Idea: use tidset intersection to compute the support of a candidate
itemset, avoiding the generation of subsets that does not exist in
the prefix tree.
• t(AB) = t(A) Ո t(B)
• support(AB) = | t(AB)|
• ECLAT intersects the tidsets only if the frequent itemsets share a
common prefix.
• It traverses the prefix search tree in a DFS-like manner, processing a
group of itemsets that have the same prefix, also called a prefix
equivalence class.
ECLAT
• It works recursively.
• The initial call uses all single items with their tid-sets.
• In each recursive call, it verifies each itemset tid-set
pair (X, t(X))with all the other pairs to generate new
candidates.
If the new candidate is frequent, it is added to the
set .
• Recursively, it finds all frequent itemsets in the X
branch.
ECLAT
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Interestingness Measures
• Which rules or patterns are the most interesting ones? One way
is to rank the discovered rules or patterns with interestingness
measures.
• The measures of rule interestingness fall into two categories,
subjective and objective [Freitas, 1998, Silberschatz and Tuzhilin,
1996].
• Objective measures, such as lift, odds ratio and conviction, are
often data-driven and give the interestingness in terms of
statistics or information theory.
• Subjective (user-driven) measures, e.g., unexpectedness and
action ability, focus on finding interesting patterns by matching
against a given set of user beliefs.
Objective Interestingness Measures
• Support, confidence and lift are the most widely used objective measures to
select interesting rules.
• Many other objective measures introduced by Tan et al. [Tan et al., 2002], such
as Ø-coefficient, odds ratio, kappa, mutual information, J-measure, Gini index,
laplace, conviction, interest and cosine.
• Their study shows that different measures have different intrinsic properties
and there is no measure that is better than others in all application domains.
• In addition, any-confidence, all-confidence and bond, are designed by
Omiecinski [Omiecinski, 2003].
• Utility is used by Chan et al. [Chan et al., 2003] to find top-k objective-directed
rules.
• Unexpected Confidence Interestingness and Isolated Interestingness are
designed by Dong and Li [Dong and Li, 1998] by considering its unexpectedness
in terms of other association rules in its neighborhood.
Subjective Interestingness Measures
• Unexpectedness and action ability are two main categories of subjective
measures [Silberschatz and Tuzhilin, 1995].
• A pattern is unexpected if it is new to a user or contradicts the user's
experience or domain knowledge.
• A pattern is actionable if the user can do something with it to his/her
advantage [Silberschatz and Tuzhilin, 1995, Liu et al., 2003].
• Liu and Hsu [Liu and Hsu, 1996] proposed to rank learned rules by matching
against expected patterns provided by the user.
• Ras and Wieczorkowska [Ras and Wieczorkowska, 2000] designed action-
rules which show “what actions should be taken to improve the profitability
of customers". The attributes are grouped into “hard attributes" which
cannot be changed and “soft attributes" which are possible to change with
reasonable costs. The status of customers can be moved from one to another
by changing the values of soft ones.
Interestingness Measures - I
Interestingness Measures - II
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Applications - I
• Market basket analysis
– Identifying associations between items in shopping baskets, i.e., which items are
frequently purchased together
– Can be used by retailers to understand customer shopping habits, do selective
marketing and plan shelf space
• Churn analysis and selective marketing
– Discovering demographic characteristics and behaviors of customers who are
likely/unlikely to switch to other telos
– Identifying customer groups who are likely to purchase a new service or product
• Credit card risk analysis
– Finding characteristics of customers who are likely to default on credit card or
mortgage
– Can be used by banks to reduce risks when assessing new credit card or
mortgage applications
Applications - II
• Stock market analysis
– Finding relationships between individual stocks, or
between stocks and economic factors
– Can help stock traders select interesting stocks and
improve trading strategies
• Medical diagnosis
– Identifying relationships between symptoms, test
results and illness
– Can be used for assisting doctors on illness diagnosis or
even on treatment
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Association Rule Mining Algorithms in R
• Apriori [Agrawal and Srikant, 1994]
– a level-wise, breadth-first algorithm which counts
transactions to find frequent itemsets and then derive
association rules from them
– apriori() in package arules
• ECLAT [Zaki et al., 1997]
– finds frequent itemsets with equivalence classes,
depth-first search and set intersection instead of
counting
– eclat() in package arules
The Titanic Dataset
• The Titanic dataset in the datasets package is a 4-
dimensional table with summarized information on the
fate of passengers on the Titanic according to social
class, sex, age and survival.
• To make it suitable for association rule mining, we
reconstruct the raw data as [Link], where each
row represents a person.
• The reconstructed raw data can also be downloaded at
[Link]
R programming
Function apriori()
• Mine frequent itemsets, association rules or
association hyper edges using the Apriori
algorithm. The Apriori algorithm employs
level-wise search for frequent itemsets.
• Default settings:
– minimum support: supp=0.1
– minimum confidence: conf=0.8
– maximum length of rules: maxlen=10
>library(arules)
>[Link] <- apriori([Link])
>inspect([Link])
# rules with rhs containing "Survived" only
rules <- apriori([Link],
control = list(verbose=F),
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No",
"Survived=Yes"),
default="lhs"))
## keep three decimal places
quality(rules) <- round(quality(rules), digits=3)
## order rules by lift
[Link] <- sort(rules, by="lift")
• inspect([Link])
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Redundant Rules
• There are often too many association rules
discovered from a dataset.
• It is necessary to remove redundant rules
before a user is able to study the rules and
identify interesting ones from them.
Redundant Rules
• Rule #2 provides no extra knowledge in addition to rule #1, since rules #1 tells
us that all 2nd-class children survived.
• When a rule (such as #2) is a super rule of another rule (#1) and the former has
the same or a lower lift, the former rule (#2) is considered to be redundant.
• Other redundant rules in the above result are rules #4, #7 and #8, compared
respectively with #3, #6 and #5.
Remove Redundant Rules
## find redundant rules
>[Link] <- [Link]([Link], [Link])
>[Link][[Link]([Link], diag = T)] <- NA
>redundant <- colSums([Link], [Link] = T) >= 1
## which rules are redundant
>which(redundant)
## remove redundant rules
>[Link] <- [Link][!redundant]
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Interpreting Rules
>inspect([Link][1])
• Did children have a higher survival rate than
adults?
• Did children of the 2nd class have a higher
survival rate than other children?
• The rule states only that all children of class 2
survived, but provides no information at all about
the survival rates of other classes.
Rules about Children
>#Rules about Children
>rules <- apriori([Link], control = list(verbose=F),
parameter = list(minlen=3, supp=0.002, conf=0.2),
appearance = list(default="none", rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Age=Child", "Age=Adult")))
>[Link] <- sort(rules, by="confidence")
>inspect([Link])
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Visualizing Association Rules
>library(arulesViz)
>plot([Link])
Visualizing Association Rules
>plot([Link], method = "grouped")
Visualizing Association Rules
>plot([Link], method = "graph")
Visualizing Association Rules
>plot([Link], method = "graph", control = list(type = "items"))
Visualizing Association Rules
>plot([Link], method = "paracoord", control = list(reorder = TRUE))
Outline
• Basics of Association Rules
• Algorithms: Apriori, ECLAT and FP-growth
• Interestingness Measures
• Applications
• Association Rule Mining with R
• Removing Redundancy
• Interpreting Rules
• Visualizing Association Rules
• Further Readings and Online Resources
Further Readings and Online Resources
• More than 20 interestingness measures, such as chi-square, conviction, gini and
leverage
• Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness
measure for association patterns. In Proc. of KDD '02, pages 32-41, New York, NY,
USA. ACM Press.
• More reviews on interestingness measures:
[Silberschatz and Tuzhilin, 1996], [Tan et al., 2002],
[Omiecinski, 2003] and [Wu et al., 2007]
• Post mining of association rules, such as selecting interesting association rules,
visualization of association rules and using association rules for classification
[Zhao et al., 2009]
Yanchang Zhao, et al. (Eds.). “Post-Mining of Association Rules: Techniques for
Elective Knowledge Extraction", ISBN 978-1-60566-404-0, May 2009. Information
Science Reference.
• Package arulesSequences: mining sequential patterns
[Link]
Further Readings and Online Resources
• Chapter 9 - Association Rules, in book titled R and Data Mining:
Examples and Case Studies [Zhao, 2012]
[Link]
• RDataMining Reference Card
[Link]
• Free online courses and documents
[Link]
• RDataMining Group on LinkedIn (22,000+ members)
[Link]
• Twitter (2,700+ followers)
@RDataMining
• Association Rule Mining with R
[Link]