0% found this document useful (0 votes)
22 views60 pages

Lecture 6 - Other Data Science Tasks and Techniques

The document discusses various data mining tasks and techniques, focusing on association analysis and the Apriori algorithm for mining frequent patterns and correlations in transaction data. It highlights the importance of association rule mining in business applications such as market basket analysis, cross-marketing, and product placement. The document also explains key concepts like support, confidence, and lift, which are essential for evaluating the strength of association rules.

Uploaded by

Le Harry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views60 pages

Lecture 6 - Other Data Science Tasks and Techniques

The document discusses various data mining tasks and techniques, focusing on association analysis and the Apriori algorithm for mining frequent patterns and correlations in transaction data. It highlights the importance of association rule mining in business applications such as market basket analysis, cross-marketing, and product placement. The document also explains key concepts like support, confidence, and lift, which are essential for evaluating the strength of association rules.

Uploaded by

Le Harry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Science for Business

Lecture 6 – Other Data Science Tasks


and Techniques

Assoc. Prof. Pham Quoc Trung


[email protected]
Data Mining Tasks and Machine Learning

Unsupervise
d Learning:
Association
Analysis

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
Transaction Database
Transaction ID Items bought
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D 3
Association
Analysis
4
Association Analysis:
Mining Frequent Patterns,
Association and Correlations

• Association Analysis
• Mining Frequent Patterns
• Association and Correlations
• Apriori Algorithm

5
Source: Han & Kamber (2006)
Market Basket Analysis

6
Source: Han & Kamber (2006)
Association Rule Mining
• Apriori Algorithm

Raw Transaction Data One-item Itemsets Two-item Itemsets Three-item Itemsets

Transaction SKUs Itemset Itemset Itemset


Support Support Support
No (Item No) (SKUs) (SKUs) (SKUs)

1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 7
Association Rule Mining
• A very popular DM method in business
• Finds interesting relationships (affinities) between
variables (items or events)
• Part of machine learning family
• Employs unsupervised learning
• There is no output variable
• Also known as market basket analysis
• Often used as an example to describe DM to
ordinary people, such as the famous “relationship
between diapers and beers!”

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 8
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a laptop computer and a virus
protection software, also bought extended service plan 70
percent of the time."
• How do you use such a pattern/knowledge?
• Put the items next to each other for ease of finding
• Promote the items as a package (do not put one on sale if the other(s)
are on sale)
• Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially seeing and
buying other items

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 9
Association Rule Mining
• A representative applications of association rule
mining include
• In business: cross-marketing, cross-selling, store design,
catalog design, e-commerce site design, optimization of
online advertising, product pricing, and sales/promotion
configuration
• In medicine: relationships between symptoms and
illnesses; diagnosis and patient characteristics and
treatments (to be used in medical DSS); and genes and
their functions (to be used in genomics projects)…

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 10
Association Rule Mining
• Are all association rules interesting and useful?
A Generic Rule: X  Y [S%, C%]
X, Y: products and/or services
X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X
Example: {Laptop Computer, Antivirus Software} 
{Extended Service Plan} [30%, 70%]

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 11
Association Rule Mining

• Algorithms are available for generating association rules


• Apriori
• Eclat
• FP-Growth
• + Derivatives and hybrids of the three
• The algorithms help identify the frequent item sets, which are, then
converted to association rules

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 12
Association Rule Mining
• Apriori Algorithm
• Finds subsets that are common to at least a minimum number of the itemsets
• uses a bottom-up approach
• frequent subsets are extended one item at a time (the size of frequent subsets increases from
one-item subsets to two-item subsets, then three-item subsets, and so on), and
• groups of candidates at each level are tested against the data for minimum

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 13
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id Items bought • Itemset X = {x1, …, xk}
10 A, B, D • Find all the rules X → Y with minimum
20 A, C, D support and confidence
30 A, D, E • support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F • confidence, c, conditional
probability that a transaction
Customer Customer having X also contains Y
buys both buys diaper

Let supmin = 50%, confmin = 50%


Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A → D (60%, 100%)
buys beer D → A (60%, 75%)
A → D (support = 3/5 = 60%, confidence = 3/3 =100%)
D → A (support = 3/5 = 60%, confidence = 3/4 = 75%)
14
Source: Han & Kamber (2006)
Market basket analysis
• Example
• Which groups or sets of items are customers likely to purchase on a given trip to
the store?
• Association Rule
• Computer → antivirus_software
[support = 2%; confidence = 60%]
• A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.

15
Source: Han & Kamber (2006)
Association rules

• Association rules are considered interesting if they satisfy both


• a minimum support threshold and
• a minimum confidence threshold.

16
Source: Han & Kamber (2006)
Frequent Itemsets,
Closed Itemsets, and
Association Rules

Support (A→ B) = P(A  B)


Confidence (A→ B) = P(B|A)

17
Source: Han & Kamber (2006)
Support (A→ B) = P(A  B)
Confidence (A→ B) = P(B|A)

• The notation P(A  B) indicates the probability that a transaction


contains the union of set A and set B
• (i.e., it contains every item in A and in B).
• This should not be confused with P(A or B), which indicates the
probability that a transaction contains either A or B.

18
Source: Han & Kamber (2006)
Does diaper purchase predict beer purchase?

• Contingency tables

Beer Beer
Yes No Yes No

No 6 94 100 23 77
diapers

diapers
40 60 100 23 77

DEPENDENT (yes) INDEPENDENT (no predictability)


Source: Dickey (2012) http://www4.stat.ncsu.edu/~dickey/SAScode/Encore_2012.ppt
Support (A→ B) = P(A  B)

Confidence (A→ B) = P(B|A)


Conf (A → B) = Supp (A  B)/ Supp (A)

Lift (A → B) = Supp (A  B) / (Supp (A) x Supp (B))


Lift (Correlation)
Lift (A→B) = Confidence (A→B) / Support(B)

20
Source: Dickey (2012) http://www4.stat.ncsu.edu/~dickey/SAScode/Encore_2012.ppt
Lift
Lift = Confidence / Expected Confidence if Independent

Checking No Yes
Saving (1500) (8500) (10000)
No 500 3500 4000

Yes 1000 5000 6000

SVG=>CHKG Expect 8500/10000 = 85% if independent


Observed Confidence is 5000/6000 = 83%
Lift = 83/85 < 1.
Savings account holders actually LESS likely than others to
have checking account !!!

21
Source: Dickey (2012) http://www4.stat.ncsu.edu/~dickey/SAScode/Encore_2012.ppt
Support & Confidence

A B C A C D B C D A D E B C E

Rule Support Confidence


AD 2/5 2/3
CA 2/5 2/4
AC 2/5 2/3
B&CD 1/5 1/3
22
Source: SAS Enterprise Miner Course Notes, 2014, SAS
Support & Confidence & Lift

Checking Account
No Yes

No 500 3500 4,000


Saving
Account
Yes 1000 5000 6,000

10,000
Support(SVG  CK) = 50%=5,000/10,000
Confidence(SVG  CK) = 83%=5,000/6,000
Expected Confidence(SVG  CK) = 85%=8,500/10,000
Lift (SVG → CK) = Confidence/Expected Confidence = 0.83/0.85 < 1
23
Source: SAS Enterprise Miner Course Notes, 2014, SAS
Support (A→B)
Confidence (A→B)
Expected Confidence (A→B)
Lift (A→B)

24
Support (A→ B) = P(A  B)
Count(A&B)/Count(Total)
Confidence (A→ B) = P(B|A)
Conf (A → B) = Supp (A  B)/ Supp (A)
Count(A&B)/Count(A)
Expected Confidence (A→B) = Support(B)
Count(B)

Lift (A → B) = Confidence (A→B) / Expected Confidence (A→B)


Lift (A → B) = Supp (A  B) / (Supp (A) x Supp (B))
Lift (Correlation)
Lift (A→B) = Confidence (A→B) / Support(B)
25
Lift (A→B)
• Lift (A→B)
= Confidence (A→B) / Expected Confidence (A→B)
= Confidence (A→B) / Support(B)
= (Supp (A&B) / Supp (A)) / Supp(B)
= Supp (A&B) / Supp (A) x Supp (B)

26
Minimum Support and
Minimum Confidence
• Rules that satisfy both a minimum support threshold (min_sup) and a
minimum confidence threshold (min_conf) are called strong.
• By convention, we write support and confidence values so as to occur
between 0% and 100%, rather than 0 to 1.0.

27
Source: Han & Kamber (2006)
K-itemset
• itemset
• A set of items is referred to as an itemset.
• K-itemset
• An itemset that contains k items is a k-itemset.
• Example:
• The set {computer, antivirus software} is a 2-itemset.

28
Source: Han & Kamber (2006)
Absolute Support and
Relative Support
• Absolute Support
• The occurrence frequency of an itemset is the number of transactions that
contain the itemset
• frequency, support count, or count of the itemset
• Ex: 3
• Relative support
• Ex: 60%

29
Source: Han & Kamber (2006)
Frequent Itemset

• If the relative support of an itemset I satisfies a prespecified


minimum support threshold, then I is a frequent itemset.
• i.e., the absolute support of I satisfies the corresponding minimum support
count threshold
• The set of frequent k-itemsets is commonly denoted by LK

30
Source: Han & Kamber (2006)
Confidence

• the confidence of rule A→ B can be easily derived


from the support counts of A and A  B.
• once the support counts of A, B, and A  B are found,
it is straightforward to derive the corresponding
association rules A→B and B→A and check whether
they are strong.
• Thus the problem of mining association rules can be
reduced to that of mining frequent itemsets.

31
Source: Han & Kamber (2006)
Association rule mining:
Two-step process
1. Find all frequent itemsets
• By definition, each of these itemsets will occur at least as frequently as a
predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets
• By definition, these rules must satisfy minimum support and minimum
confidence.

32
Source: Han & Kamber (2006)
Efficient and Scalable
Frequent Itemset Mining Methods
• The Apriori Algorithm
• Finding Frequent Itemsets Using Candidate Generation

33
Source: Han & Kamber (2006)
Apriori Algorithm

• Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant


in 1994 for mining frequent itemsets for Boolean association rules.
• The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties, as we shall see
following.

34
Source: Han & Kamber (2006)
Apriori Algorithm

• Apriori employs an iterative approach known as a level-wise


search, where k-itemsets are used to explore (k+1)-itemsets.
• First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The
resulting set is denoted L1.
• Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-
itemsets can be found.
• The finding of each Lk requires one full scan of the database.

35
Source: Han & Kamber (2006)
Apriori Algorithm

• To improve the efficiency of the level-wise generation of frequent


itemsets, an important property called the Apriori property.
• Apriori property
• All nonempty subsets of a frequent itemset must also be frequent.

36
Source: Han & Kamber (2006)
Apriori algorithm
(1) Frequent Itemsets
(2) Association Rules

37
Transaction Database
Transaction ID Items bought
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D 38
Table 1 shows a database with 10 transactions.
Let minimum support = 20% and minimum confidence = 80%.
Please use Apriori algorithm for generating association rules
from frequent itemsets.
Table 1: Transaction Database
Transaction Items bought
ID
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D 39
Transaction
ID
Items bought
Apriori Algorithm Step 1-1
T01
T02
A, B, D
A, C, D
C1 → L1
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D

C1 L1
Itemset Support minimum Itemset Support
Count support = 20% Count
= 2 / 10 A 6
A 6
Min. Support
B 7 Count = 2 B 7
C 6 C 6
D 7 D 7
E 3 E 3

40
Transaction
ID
Items
bought Apriori Algorithm Step 1-2
C2 → L2
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D

C2
T08 B, D
T09
T10
A, C, E
B, D
L2
Itemset Support Itemset Support
Count Count

L1 A, B
A, C
3
4 minimum
A, B 3
A, C 4
Itemset Support support = 20%
Count A, D 3 = 2 / 10 A, D 3
A 6 A, E 2 Min. Support A, E 2
B 7 B, C 3 Count = 2
B, C 3
C 6 B, D 6 B, D 6
D 7 B, E 2 B, E 2
E 3 C, D 3 C, D 3
C, E 3 C, E 3
D, E 1 41
Transaction
ID
Items
bought Apriori Algorithm Step 1-3
C3 → L3
T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D

C3 L3
T08 B, D
T09 A, C, E
T10 B, D

Itemset Support minimum Itemset Support


Count support = 20% Count
A, B, C 1 = 2 / 10
L2 A, B, D 2
Min. Support
A, B, D
A, C, E
2
2
Itemset Support
Count = 2
Count A, B, E 1 B, C, D 2
A, B 3 A, C, D 1 B, C, E 2
A, C 4
A, C, E 2
A, D 3
A, E 2 B, C, D 2
B, C 3 B, C, E 2
B, D 6
B, E 2
C, D 3
C, E 3
42
Generating Association Rules 2-1
Transaction Items
ID bought Step
T01 A, B, D
T02 A, C, D
T03 B, C, D, E minimum confidence = 80%
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D

L2 Association Rules
L1
Itemset Support
Count Generated from L2
A, B 3 A→B: 3/6 B→A: 3/7
Itemset Support
Count
A, C 4 A→C: 4/6 C→A: 4/6
A 6
A, D 3 A→D: 3/6 D→A: 3/7
B 7
C 6 A, E 2 A→E: 2/6 E→A: 2/3
D 7 B, C 3 B→C: 3/7 C→B: 3/6
E 3 B→D: 6/7=85.7% * D→B: 6/7=85.7% *
B, D 6
B, E 2 B→E: 2/7 E→B: 2/3
C, D 3 C→D: 3/6 D→C: 2/7
C, E 3 C→E: 3/6 E→C: 3/3=100% * 43
Generating Association Rules 2-2
Transaction Items
ID bought Step
T01 A, B, D
T02 A, C, D
T03 B, C, D, E minimum confidence = 80%

Association Rules
T04 A, B, D
T05 A, B, C, E
T06 A, C

Generated from L3
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D
A→BD: 2/6 B→CD: 2/7
B→AD: 2/7 C→BD: 2/6
D→AB: 2/7 D→BC: 2/7
L1 L2 L3 AB→D: 2/3 BC→D: 2/3
Itemset Support Itemset Support
Count Count Itemset Support AD→B: 2/3 BD→C: 2/6
A 6 A, B 3 Count BD→A: 2/6 CD→B: 2/3
B 7 A, C 4 A, B, D 2 A→CE: 2/6 B→CE: 2/7
C 6 A, D 3 A, C, E 2 C→AE: 2/6 C→BE: 2/6
D 7 A, E 2
B, C, D 2 E→AC: 2/3 E→BC: 2/3
E 3 B, C 3
B, C, E 2 AC→E: 2/4 BC→E: 2/3
B, D 6
B, E 2 AE→C: 2/2=100%* BE→C: 2/2=100%*
C, D 3
CE→A: 2/3 CE→B: 2/3
C, E 3 44
Frequent Itemsets and Association Rules
Transaction Items
ID bought
T01 A, B, D
T02
T03
A, C, D
B, C, D, E
L1 L2 L3
T04 A, B, D
Itemset Support Itemset Support
T05 A, B, C, E
Count Count Itemset Support
T06 A, C
T07 B, C, D A 6 A, B 3 Count
T08 B, D
B 7 A, C 4
T09 A, C, E A, B, D 2
T10 B, D C 6 A, D 3
D 7 A, E 2 A, C, E 2
E 3 B, C 3
B, C, D 2
B, D 6
B, E 2 B, C, E 2
minimum support = 20% C, D 3
C, E 3
minimum confidence = 80%

Association Rules:
B→D (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
D→B (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
E→C (30%, 100%) (Sup.: 3/10, Conf.: 3/3)
AE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2)
BE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2)
45
Table 1 shows a database with 10 transactions.
Let minimum support = 20% and minimum confidence = 80%.
Please use Apriori algorithm for generating association rules from frequent itemsets.

Transaction ID Items bought


T01 A, B, D
T02 A, C, D
T03 B, C, D, E
T04 A, B, D
T05 A, B, C, E
T06 A, C
T07 B, C, D
T08 B, D
T09 A, C, E
T10 B, D

Association Rules:
B→D (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
D→B (60%, 85.7%) (Sup.: 6/10, Conf.: 6/7)
E→C (30%, 100%) (Sup.: 3/10, Conf.: 3/3)
AE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2)
BE→C (20%, 100%) (Sup.: 2/10, Conf.: 2/2) 46
Co-occurrences and Associations

•Complexity control:
• Support of association
• Let’s say that we require rules to apply to at least 0.01% of all transactions
• Confidence or strength of the rule
• Let’s say that we require that 5% or more of the time, a buyer of A also buys B

•Measuring surprise:

𝑝(𝐴,𝐵)
•𝐿𝑖𝑓𝑡 𝐴, 𝐵 =
𝑝 𝐴 ∗𝑝(𝐵)
Example: Beer and Lottery Tickets

• We operate a small convenience store where people buy groceries,


liquor, lottery tickets, etc. We estimate that:
• 30% of all transactions involve beer,
• 40% of all transactions involve lottery tickets,
• and 20% of the transactions include both beer and lottery tickets.
Example: Beer and Lottery Tickets

• If the two products are unrelated:

• 𝑝 𝑏𝑒𝑒𝑟 × 𝑝 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 0.12

• Otherwise:
0.2
• 𝐿𝑖𝑓𝑡 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = ≈ 1.67
0.12

• 𝐿𝑒𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 0.2 − 0.12 = 0.08

• 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 20%

• 𝑆𝑡𝑟𝑒𝑛𝑔𝑡ℎ 𝑏𝑒𝑒𝑟, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 = 𝑝 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑡𝑖𝑐𝑘𝑒𝑡𝑠 𝑏𝑒𝑒𝑟 =


67%
Profiling: Finding Typical Behavior

• Profiling attempts to characterize the typical behavior of an


individual, group, or population
• Profiling can essentially involve clustering, if there are subgroups of
the population with different behaviors
Profiling
Profiling
Profiling
Profiling
Link Prediction and Social Recommendation

• Sometimes, instead of predicting a property (target value) of a data


item, it is more useful to predict connections between data items
• A common example of this is predicting that a link should exist
between two individuals
• Link prediction can also estimate the strength of a link
Data Reduction and Latent Information

• Trade-off between the insight or manageability gained against the


information lost
Latent Information and Movie
Recommendation
Bias, Variance, and Ensemble Methods

• The errors a model makes can be characterized by three factors:


• 1. Inherent randomness,
• 2. Bias, and
• 3. Variance.
Thanks!

Q&A

You might also like