Data Science for Business
Lecture 6 – Other Data Science Tasks
and Techniques
Assoc. Prof. Pham Quoc Trung
[email protected]Data Mining Tasks and Machine Learning
Unsupervise
d Learning:
Association
Analysis
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
Transaction Database
Transaction ID Items bought
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D 3
Association
Analysis
4
Association Analysis:
Mining Frequent Patterns,
Association and Correlations
• Association Analysis
• Mining Frequent Patterns
• Association and Correlations
• Apriori Algorithm
5
Source: Han & Kamber (2006)
Market Basket Analysis
6
Source: Han & Kamber (2006)
Association Rule Mining
• Apriori Algorithm
Raw Transaction Data One-item Itemsets Two-item Itemsets Three-item Itemsets
Transaction SKUs Itemset Itemset Itemset
Support Support Support
No (Item No) (SKUs) (SKUs) (SKUs)
1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 7
Association Rule Mining
• A very popular DM method in business
• Finds interesting relationships (affinities) between
variables (items or events)
• Part of machine learning family
• Employs unsupervised learning
• There is no output variable
• Also known as market basket analysis
• Often used as an example to describe DM to
ordinary people, such as the famous “relationship
between diapers and beers!”
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 8
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a laptop computer and a virus
protection software, also bought extended service plan 70
percent of the time."
• How do you use such a pattern/knowledge?
• Put the items next to each other for ease of finding
• Promote the items as a package (do not put one on sale if the other(s)
are on sale)
• Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially seeing and
buying other items
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 9
Association Rule Mining
• A representative applications of association rule
mining include
• In business: cross-marketing, cross-selling, store design,
catalog design, e-commerce site design, optimization of
online advertising, product pricing, and sales/promotion
configuration
• In medicine: relationships between symptoms and
illnesses; diagnosis and patient characteristics and
treatments (to be used in medical DSS); and genes and
their functions (to be used in genomics projects)…
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 10
Association Rule Mining
• Are all association rules interesting and useful?
A Generic Rule: X Y [S%, C%]
X, Y: products and/or services
X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X
Example: {Laptop Computer, Antivirus Software}
{Extended Service Plan} [30%, 70%]
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 11
Association Rule Mining
• Algorithms are available for generating association rules
• Apriori
• Eclat
• FP-Growth
• + Derivatives and hybrids of the three
• The algorithms help identify the frequent item sets, which are, then
converted to association rules
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 12
Association Rule Mining
• Apriori Algorithm
• Finds subsets that are common to at least a minimum number of the itemsets
• uses a bottom-up approach
• frequent subsets are extended one item at a time (the size of frequent subsets increases from
one-item subsets to two-item subsets, then three-item subsets, and so on), and
• groups of candidates at each level are tested against the data for minimum
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 13
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id Items bought • Itemset X = {x1, …, xk}
10 A, B, D • Find all the rules X → Y with minimum
20 A, C, D support and confidence
30 A, D, E • support, s, probability that a
40 B, E, F
transaction contains X Y
50 B, C, D, E, F • confidence, c, conditional
probability that a transaction
Customer Customer having X also contains Y
buys both buys diaper
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A → D (60%, 100%)
buys beer D → A (60%, 75%)
A → D (support = 3/5 = 60%, confidence = 3/3 =100%)
D → A (support = 3/5 = 60%, confidence = 3/4 = 75%)
14
Source: Han & Kamber (2006)
Market basket analysis
• Example
• Which groups or sets of items are customers likely to purchase on a given trip to
the store?
• Association Rule
• Computer → antivirus_software
[support = 2%; confidence = 60%]
• A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.
15
Source: Han & Kamber (2006)
Association rules
• Association rules are considered interesting if they satisfy both
• a minimum support threshold and
• a minimum confidence threshold.
16
Source: Han & Kamber (2006)
Frequent Itemsets,
Closed Itemsets, and
Association Rules
Support (A→ B) = P(A B)
Confidence (A→ B) = P(B|A)
17
Source: Han & Kamber (2006)
Support (A→ B) = P(A B)
Confidence (A→ B) = P(B|A)
• The notation P(A B) indicates the probability that a transaction
contains the union of set A and set B
• (i.e., it contains every item in A and in B).
• This should not be confused with P(A or B), which indicates the
probability that a transaction contains either A or B.
18
Source: Han & Kamber (2006)
Does diaper purchase predict beer purchase?
• Contingency tables
Beer Beer
Yes No Yes No
No 6 94 100 23 77
diapers
diapers
40 60 100 23 77
DEPENDENT (yes) INDEPENDENT (no predictability)
Source: Dickey (2012) http://www4.stat.ncsu.edu/~dickey/SAScode/Encore_2012.ppt
Support (A→ B) = P(A B)
Confidence (A→ B) = P(B|A)
Conf (A → B) = Supp (A B)/ Supp (A)
Lift (A → B) = Supp (A B) / (Supp (A) x Supp (B))
Lift (Correlation)
Lift (A→B) = Confidence (A→B) / Support(B)
20
Source: Dickey (2012) http://www4.stat.ncsu.edu/~dickey/SAScode/Encore_2012.ppt
Lift
Lift = Confidence / Expected Confidence if Independent
Checking No Yes
Saving (1500) (8500) (10000)
No 500 3500 4000
Yes 1000 5000 6000
SVG=>CHKG Expect 8500/10000 = 85% if independent
Observed Confidence is 5000/6000 = 83%
Lift = 83/85 < 1.
Savings account holders actually LESS likely than others to
have checking account !!!
21
Source: Dickey (2012) http://www4.stat.ncsu.edu/~dickey/SAScode/Encore_2012.ppt
Support & Confidence
A B C A C D B C D A D E B C E
Rule Support Confidence
AD 2/5 2/3
CA 2/5 2/4
AC 2/5 2/3
B&CD 1/5 1/3
22
Source: SAS Enterprise Miner Course Notes, 2014, SAS
Support & Confidence & Lift
Checking Account
No Yes
No 500 3500 4,000
Saving
Account
Yes 1000 5000 6,000
10,000
Support(SVG CK) = 50%=5,000/10,000
Confidence(SVG CK) = 83%=5,000/6,000
Expected Confidence(SVG CK) = 85%=8,500/10,000
Lift (SVG → CK) = Confidence/Expected Confidence = 0.83/0.85 < 1
23
Source: SAS Enterprise Miner Course Notes, 2014, SAS
Support (A→B)
Confidence (A→B)
Expected Confidence (A→B)
Lift (A→B)
24
Support (A→ B) = P(A B)
Count(A&B)/Count(Total)
Confidence (A→ B) = P(B|A)
Conf (A → B) = Supp (A B)/ Supp (A)
Count(A&B)/Count(A)
Expected Confidence (A→B) = Support(B)
Count(B)
Lift (A → B) = Confidence (A→B) / Expected Confidence (A→B)
Lift (A → B) = Supp (A B) / (Supp (A) x Supp (B))
Lift (Correlation)
Lift (A→B) = Confidence (A→B) / Support(B)
25
Lift (A→B)
• Lift (A→B)
= Confidence (A→B) / Expected Confidence (A→B)
= Confidence (A→B) / Support(B)
= (Supp (A&B) / Supp (A)) / Supp(B)
= Supp (A&B) / Supp (A) x Supp (B)
26
Minimum Support and
Minimum Confidence
• Rules that satisfy both a minimum support threshold (min_sup) and a
minimum confidence threshold (min_conf) are called strong.
• By convention, we write support and confidence values so as to occur
between 0% and 100%, rather than 0 to 1.0.
27
Source: Han & Kamber (2006)
K-itemset
• itemset
• A set of items is referred to as an itemset.
• K-itemset
• An itemset that contains k items is a k-itemset.
• Example:
• The set {computer, antivirus software} is a 2-itemset.
28
Source: Han & Kamber (2006)
Absolute Support and
Relative Support
• Absolute Support
• The occurrence frequency of an itemset is the number of transactions that
contain the itemset
• frequency, support count, or count of the itemset
• Ex: 3
• Relative support
• Ex: 60%
29
Source: Han & Kamber (2006)
Frequent Itemset
• If the relative support of an itemset I satisfies a prespecified
minimum support threshold, then I is a frequent itemset.
• i.e., the absolute support of I satisfies the corresponding minimum support
count threshold
• The set of frequent k-itemsets is commonly denoted by LK
30
Source: Han & Kamber (2006)
Confidence
• the confidence of rule A→ B can be easily derived
from the support counts of A and A B.
• once the support counts of A, B, and A B are found,
it is straightforward to derive the corresponding
association rules A→B and B→A and check whether
they are strong.
• Thus the problem of mining association rules can be
reduced to that of mining frequent itemsets.
31
Source: Han & Kamber (2006)
Association rule mining:
Two-step process
1. Find all frequent itemsets
• By definition, each of these itemsets will occur at least as frequently as a
predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets
• By definition, these rules must satisfy minimum support and minimum
confidence.
32
Source: Han & Kamber (2006)
Efficient and Scalable
Frequent Itemset Mining Methods
• The Apriori Algorithm
• Finding Frequent Itemsets Using Candidate Generation
33
Source: Han & Kamber (2006)
Apriori Algorithm
• Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant
in 1994 for mining frequent itemsets for Boolean association rules.
• The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties, as we shall see
following.
34
Source: Han & Kamber (2006)
Apriori Algorithm
• Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-itemsets.
• First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The
resulting set is denoted L1.
• Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-
itemsets can be found.
• The finding of each Lk requires one full scan of the database.
35
Source: Han & Kamber (2006)
Apriori Algorithm
• To improve the efficiency of the level-wise generation of frequent
itemsets, an important property called the Apriori property.
• Apriori property
• All nonempty subsets of a frequent itemset must also be frequent.
36
Source: Han & Kamber (2006)
Apriori algorithm
(1) Frequent Itemsets
(2) Association Rules
37
Transaction Database
Transaction ID Items bought
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D 38
Table 1 shows a database with 10 transactions.
Let minimum support = 20% and minimum confidence = 80%.
Please use Apriori algorithm for generating association rules
from frequent itemsets.
Table 1: Transaction Database
Transaction Items bought
ID
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D 39
Transaction
ID
Items bought
Apriori Algorithm Step 1-1
T01
T02
A, B, D
A, C, D
C1 → L1
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D
C1 L1
Itemset Support minimum Itemset Support
Count support = 20% Count
= 2 / 10 A 6
A 6
Min. Support
B 7 Count = 2 B 7
C 6 C 6
D 7 D 7
E 3 E 3
40
Transaction
ID
Items
bought Apriori Algorithm Step 1-2
C2 → L2
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
C2
T08 B, D
T09
T10
A, C, E
B, D
L2
Itemset Support Itemset Support
Count Count
L1 A, B
A, C
3
4 minimum
A, B 3
A, C 4
Itemset Support support = 20%
Count A, D 3 = 2 / 10 A, D 3
A 6 A, E 2 Min. Support A, E 2
B 7 B, C 3 Count = 2
B, C 3
C 6 B, D 6 B, D 6
D 7 B, E 2 B, E 2
E 3 C, D 3 C, D 3
C, E 3 C, E 3
D, E 1 41
Transaction
ID
Items
bought Apriori Algorithm Step 1-3
C3 → L3
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
C3 L3
T08 B, D
T09 A, C, E
T10 B, D
Itemset Support minimum Itemset Support
Count support = 20% Count
A, B, C 1 = 2 / 10
L2 A, B, D 2
Min. Support
A, B, D
A, C, E
2
2
Itemset Support
Count = 2
Count A, B, E 1 B, C, D 2
A, B 3 A, C, D 1 B, C, E 2
A, C 4
A, C, E 2
A, D 3
A, E 2 B, C, D 2
B, C 3 B, C, E 2
B, D 6
B, E 2
C, D 3
C, E 3
42
Generating Association Rules 2-1
Transaction Items
ID bought Step
T01 A, B, D
T02 A, C, D
T03 B, C, D, E minimum confidence = 80%
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D
L2 Association Rules
L1
Itemset Support
Count Generated from L2
A, B 3 A→B: 3/6 B→A: 3/7
Itemset Support
Count
A, C 4 A→C: 4/6 C→A: 4/6
A 6
A, D 3 A→D: 3/6 D→A: 3/7
B 7
C 6 A, E 2 A→E: 2/6 E→A: 2/3
D 7 B, C 3 B→C: 3/7 C→B: 3/6
E 3 B→D: 6/7=85.7% * D→B: 6/7=85.7% *
B, D 6
B, E 2 B→E: 2/7 E→B: 2/3
C, D 3 C→D: 3/6 D→C: 2/7
C, E 3 C→E: 3/6 E→C: 3/3=100% * 43
Generating Association Rules 2-2
Transaction Items
ID bought Step
T01 A, B, D
T02 A, C, D
T03 B, C, D, E minimum confidence = 80%
Association Rules
T04 A, B, D
T05 A, B, C, E
T06 A, C
Generated from L3
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D
A→BD: 2/6 B→CD: 2/7
B→AD: 2/7 C→BD: 2/6
D→AB: 2/7 D→BC: 2/7
L1 L2 L3 AB→D: 2/3 BC→D: 2/3
Itemset Support Itemset Support
Count Count Itemset Support AD→B: 2/3 BD→C: 2/6
A 6 A, B 3 Count BD→A: 2/6 CD→B: 2/3
B 7 A, C 4 A, B, D 2 A→CE: 2/6 B→CE: 2/7
C 6 A, D 3 A, C, E 2 C→AE: 2/6 C→BE: 2/6
D 7 A, E 2
B, C, D 2 E→AC: 2/3 E→BC: 2/3
E 3 B, C 3
B, C, E 2 AC→E: 2/4 BC→E: 2/3
B, D 6
B, E 2 AE→C: 2/2=100%* BE→C: 2/2=100%*
C, D 3
CE→A: 2/3 CE→B: 2/3
C, E 3 44
Frequent Itemsets and Association Rules
Transaction Items
ID bought
T01 A, B, D
T02
T03
A, C, D
B, C, D, E
L1 L2 L3
T04 A, B, D
Itemset Support Itemset Support
T05 A, B, C, E
Count Count Itemset Support
T06 A, C
T07 B, C, D A 6 A, B 3 Count
T08 B, D
B 7 A, C 4
T09 A, C, E A, B, D 2
T10 B, D C 6 A, D 3
D 7 A, E 2 A, C, E 2
E 3 B, C 3
B, C, D 2
B, D 6
B, E 2 B, C, E 2
minimum support = 20% C, D 3
C, E 3
minimum confidence = 80%
Association Rules:
B→D (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
D→B (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
E→C (30%, 100%) (Sup.: 3/10, Conf.: 3/3)
AE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2)
BE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2)
45
Table 1 shows a database with 10 transactions.
Let minimum support = 20% and minimum confidence = 80%.
Please use Apriori algorithm for generating association rules from frequent itemsets.
Transaction ID Items bought
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D
Association Rules:
B→D (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
D→B (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
E→C (30%, 100%) (Sup.: 3/10, Conf.: 3/3)
AE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2)
BE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2) 46
Co-occurrences and Associations
•Complexity control:
• Support of association
• Let’s say that we require rules to apply to at least 0.01% of all transactions
• Confidence or strength of the rule
• Let’s say that we require that 5% or more of the time, a buyer of A also buys B
•Measuring surprise:
𝑝(𝐴,𝐵)
•𝐿𝑖𝑓𝑡 𝐴, 𝐵 =
𝑝 𝐴 ∗𝑝(𝐵)
Example: Beer and Lottery Tickets
• We operate a small convenience store where people buy groceries,
liquor, lottery tickets, etc. We estimate that:
• 30% of all transactions involve beer,
• 40% of all transactions involve lottery tickets,
• and 20% of the transactions include both beer and lottery tickets.
Example: Beer and Lottery Tickets
• If the two products are unrelated:
• 𝑝 𝑏𝑒𝑒𝑟 × 𝑝 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 0.12
• Otherwise:
0.2
• 𝐿𝑖𝑓𝑡 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = ≈ 1.67
0.12
• 𝐿𝑒𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 0.2 − 0.12 = 0.08
• 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 20%
• 𝑆𝑡𝑟𝑒𝑛𝑔𝑡ℎ 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 𝑝 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 𝑏𝑒𝑒𝑟 =
67%
Profiling: Finding Typical Behavior
• Profiling attempts to characterize the typical behavior of an
individual, group, or population
• Profiling can essentially involve clustering, if there are subgroups of
the population with different behaviors
Profiling
Profiling
Profiling
Profiling
Link Prediction and Social Recommendation
• Sometimes, instead of predicting a property (target value) of a data
item, it is more useful to predict connections between data items
• A common example of this is predicting that a link should exist
between two individuals
• Link prediction can also estimate the strength of a link
Data Reduction and Latent Information
• Trade-off between the insight or manageability gained against the
information lost
Latent Information and Movie
Recommendation
Bias, Variance, and Ensemble Methods
• The errors a model makes can be characterized by three factors:
• 1. Inherent randomness,
• 2. Bias, and
• 3. Variance.
Thanks!
Q&A