Big Data Computing
Big Data Computing
BIG DATA
COMPUTING
Map Reduce, Clustering, Association Analysis
(Version 26/06/2020)
Edited by:
Stefano Ivancich
CONTENTS
1. Map Reduce......................................................................................................... 1
1.1. Intro............................................................................................................... 1
1.2. MapReduce computation.............................................................................. 2
1.3. Basic Techniques and Primitives ................................................................... 6
1.3.1. Partitioning Technique ............................................................................ 6
1.3.2. Efficiency-accuracy tradeoffs .................................................................. 7
1.3.3. Exploiting samples .................................................................................. 7
2. Clustering ............................................................................................................ 9
2.1. Distance ......................................................................................................... 9
2.2. Center-based clustering .............................................................................. 11
2.3. k-center clustering ...................................................................................... 12
2.4. k-means clustering ...................................................................................... 14
2.5. k-median clustering ..................................................................................... 17
2.6. Cluster Evaluation ....................................................................................... 19
2.6.1. Clustering tendency .............................................................................. 19
2.6.2. Unsupervised evaluation ...................................................................... 19
2.6.3. Supervised evaluation ........................................................................... 20
3. Association Analysis .......................................................................................... 21
3.1. Basics ........................................................................................................... 21
3.2. Mining of Frequent Itemsets and Association Rules ................................... 23
3.2.1. Mining Frequent Itemsets ..................................................................... 23
3.2.2. Mining association rules ....................................................................... 26
3.3. Frequent itemset mining for Big Data ......................................................... 27
3.4. Limitations................................................................................................... 29
3.4.1. Closed and Maximal itemsets ............................................................... 29
3.4.2. Top-K frequent itemsets ....................................................................... 31
This document was written by students with no intention of replacing university materials. It is a
useful tool for the study of the subject but does not guarantee an equally exhaustive and complete
preparation as the material recommended by the University.
The purpose of this document is to summarize the fundamental concepts of the notes taken during
the lesson, rewritten, corrected and completed by referring to the slides to be used in the design of
Big Data Systems as a "practical and quick" manual to consult. There are no examples and detailed
explanations, for these please refer to the cited texts and slides.
1
1.2. MapReduce computation
Map-Reduce computation:
• Is a sequence of rounds.
• Each round transforms a set of key-value pairs into another set of key-value pairs (data
centric view), through the following two phases:
o Map phase: a user-specified map function is applied separately to each input key-
value pair and produces ≥ 𝟎 other key-value pairs, referred to as intermediate key-
value pairs.
o Reduce phase: the intermediate key-value pairs are grouped by key and a user-
specified reduce function is applied separately to each group of key-value pairs with
the same key, producing ≥ 0 other key-value pairs, which is the output of the round.
• The output of a round is the input of the next round.
Implementation of a round:
• Input files are elements of any type (tuple, documents, …). A chunk is a collection of
elements, and no element is stored across two chunks. Keys of input elements are not
relevant and we tend to ignore them.
• Input file is split into 𝑋 chunks and each chunk is the input of a map task.
• Each map task is assigned to a worker (a compute node) which applies the map function to
each key-value pair of the corresponding chunk, buffering the intermediate key-value pairs
it produces in its local disk.
A Map task can produce several key-value pairs with the same key, even from the same
element.
• The intermediate key-values pairs, while residing in the local disks, are grouped by key, and
the values associated with each key are formed into a list of values.
This is done by partitioning into 𝑌 buckets through a hash function ℎ.
(𝑘, 𝑣 ) is putted in the Bucket 𝑖 = ℎ(𝑘 ) mod 𝑌
Hashing helps balancing the load of the workers.
• Each bucket is the input of a different reduce task which is assigned to a worker.
The input of the reduce function is a pair of the form (𝑘, [𝑣1 , … , 𝑣𝑘 ])
The output of the Reduce function is a sequence of ≥ 0 key-value pairs.
Reducer: application of the reduce function to a single key and its associated list of values.
Reduce task executes one or more reducers.
2
• The outputs from all the Reduce tasks are merged into a single file written on
DataFileSystem.
• The user program is forked into a master process and several worker processes. The master
is in charge of assigning map and reduce tasks to the various workers, and to monitor their
status (idle, in-progress, completed).
• Input and output files reside on a Distributed File System, while intermediate data are stored
on the workers' local disks.
• The round involves a data shuffle for moving the intermediate key-value pairs from the
compute nodes where they were produced (by map tasks) to the compute nodes where they
must be processed (by reduce tasks).
Shuffle is often the most expensive operation of the round
• The values 𝑋 and 𝑌 are design parameters.
4
Pros of Map Reduce:
• Data-centric view: Algorithm design can focus on data transformations, targeting the design
goals. Programming frameworks (e.g., Spark) usually make allocation of tasks to workers,
data management, handling of failures, totally transparent to the programmer.
• Portability/Adaptability: Applications can run on different platforms and the underlying
implementation of the framework will do best effort to exploit parallelism and minimize data
movement costs.
• Cost: MapReduce applications can be run on moderately expensive platforms, and many
popular cloud providers support their execution
Cons of Map Reduce:
• Weakness of the number of rounds as performance measure. It ignores the runtimes of
map and reduce functions and the actual volume of data shuffled. More sophisticated (yet,
less usable) performance measures exist.
• Curse of the last reducer: In some cases, one or a few reducers may be much slower than
the other ones, thus delaying the end of the round. When designing MapReduce algorithms,
one should try to ensure load balancing (if at all possible) among reducers.
5
1.3. Basic Techniques and Primitives
Chernoff bound: Given a Binomial random variable 𝐵𝑖𝑛(𝑛, 𝑝), 𝜇 = 𝐸[𝑋] = 𝑛𝑝
For every 𝛿1 ≥ 5, 𝛿2 ∈ (0,1):
• Pr(𝑋 ≥ (1 + 𝛿1 )𝜇 ) ≤ 2−(1+𝛿1 )𝜇
2
• Pr(𝑋 ≤ (1 − 𝛿2 )𝜇 ) ≤ 2−𝜇𝛿2 /2
6
1.3.2. Efficiency-accuracy tradeoffs
For some problems an exact MapReduce algorithm may be too costly (require large 𝑅, 𝑀𝐿 , or 𝑀𝐴 )
If is acceptable for the application, we relax the condition of exact solution (we search for an
approximate solution).
E.g.: Sorting
7
8
2. Clustering
Given a set of points belonging to some space, with a notion of distance between points, clustering
aims at grouping the points into a number of subsets (clusters) such that:
• Points in the same cluster are "close" to one another
• Points in different clusters are "distant" from one another
The distance captures a notion of similarity: close points are similar, distant point are dissimilar.
2.1. Distance
Metric Space: ordered pair (𝑀, 𝑑 ) where:
• 𝑀 is a set
• 𝑑 (∙) is a metric on 𝑀. 𝑑: 𝑀 × 𝑀 → ℝ s.t. ∀𝑥, 𝑦, 𝑧 ∈ 𝑀
o 𝑑 (𝑥, 𝑦) ≥ 0
o 𝑑 (𝑥, 𝑦) = 0 ⟺ 𝑥 = 𝑦
o 𝑑 (𝑥, 𝑦) = 𝑑 (𝑦, 𝑥 ) symmetry
o 𝑑 (𝑥, 𝑧) ≤ 𝑑 (𝑥, 𝑦) + 𝑑 (𝑦, 𝑧) triangle inequality
Different distance functions can be used to analyze dataset through clustering. The choice of the
function may have a significant impact on the effectiveness of the analysis.
Distance functions:
• Minkowski distances: 𝑑𝐿𝑟 (𝑋, 𝑌) = (∑𝑛𝑖=1|𝑥𝑖 − 𝑦𝑖 |𝑟 )1/𝑟 where the vectors 𝑋, 𝑌 ∈ ℝ𝑛 , 𝑟 > 0
o 𝑟 = 2: Euclidean distance. (Linea d’aria) ‖∙‖
Aggregates the gap in each dimension between two objects.
o 𝑟 = 1: Manhattan distance. Used in grid-like environments.
o 𝑟 = ∞: (Chebyshev distance)
𝑋∗𝑌 ∑𝑛
𝑖=1 𝑥𝑖 𝑦𝑖
• Cosine (or angular) distance: 𝑑𝑐𝑜𝑠𝑖𝑛𝑒 (𝑋, 𝑌) = arccos (‖𝑋‖∗‖𝑌‖) = arccos ( )
√∑𝑛 2 𝑛 2
𝑖=1 𝑥𝑖 √∑𝑖=1 𝑦𝑖
Curse of dimensionality: issues that arise when processing data in high-dimensional spaces.
• Quality: Distance functions may lose effectiveness in assessing similarity. Points in a high-
dimensional metric space tend to be Sparse, Almost equally distant from one another,
Almost orthogonal (as vectors) to one another.
• Performance: The running times may have a linear or even exponential dependency on 𝑛.
o Running time of a clustering 𝑂(𝐶𝑑𝑖𝑠𝑡 ∗ 𝑁𝑑𝑖𝑠𝑡 ) =cost distance computation*#of
distance computations
o Nearest Neighbor Search problem: given a set 𝑆 ⊆ ℝ𝑛 of 𝑚 points, construct a data
structure that, for a given query point 𝑞 ∈ ℝ𝑛 , efficiently returns the closest point
𝑥 ∈ 𝑆 to 𝑞. The problem requires time: Θ(min{𝑛 ∗ 𝑚, 2𝑛 })
10
2.2. Center-based clustering
Family of clustering problems used in data analysis.
Partitioning primitive: assign each point 𝑥 ∈ 𝑃 to the Cluster 𝐶𝑖 which its center 𝑐𝑖 is closest to 𝑥
11
2.3. k-center clustering
Provides a strong guarantee on how close each point is to the center of its cluster.
For noisy pointsets (e.g., pointsets with outliers) the clustering which optimizes the k-center
objective may obfuscate some “natural" clustering inherent in the data.
𝒐𝒑𝒕
Is a 2-approximation algorithm: 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝐶𝑎𝑙𝑔 ) ≤ 𝟐 ∗ 𝝓𝒌𝒄𝒆𝒏𝒕𝒆𝒓 (𝑷, 𝒌)
It means that each point is at distance at most 2𝜙 from its closest center.
Dealing with Big Data: pointset 𝑃 is too large for a single machine.
Farthest-First Traversal requires 𝑘 − 1 scans of pointset 𝑃 which is impractical for massive pointsets
and not so small 𝑘.
12
Coreset for K-Center:
• Select coreset 𝑇 ⊂ 𝑃 making sure that for each 𝑥 ∈ 𝑃 − 𝑇 is close to some point of 𝑇
𝑜𝑝𝑡
(approx. within 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 ))
• Search the set 𝑆 of 𝑘 final centers in 𝑇, so that each point of 𝑇 is at distance approx. within
𝑜𝑝𝑡
𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 ) from the closest center.
𝑜𝑝𝑡
By combining the above two points, we gather that each point of 𝑃 is at distance 𝑂 (𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 ))
from the closest center.
MapReduce-Farthest-First Traversal
𝑜𝑝𝑡
Is a 4-approximation algorithm: 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝐶𝑎𝑙𝑔 ) ≤ 4 ∗ 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 )
It means that each point is at distance at most 4𝜙 from its closest center.
The sequential Farthest-First Traversal algorithm is used both to extract the coreset and to compute
the final set of centers. It provides a good coreset since it ensures that any point not in the coreset
be well represented by some coreset point.
By selecting 𝑘 ′ > 𝑘 centers from each subset 𝑃𝑖 in Round 1, the quality of the final clustering
improves. In fact, it can be shown that when 𝑃 satisfy certain properties and 𝑘′ is sufficiently large,
MR-Farthest-First Traversal can almost match the same approximation quality as the sequential
Farthest-First Traversal, while, still using sublinear local space and linear aggregate space.
13
2.4. k-means clustering
Aim at optimizing average distances.
Finds a 𝑘-clustering 𝐶 = (𝐶1 , … , 𝐶𝑘 ) which minimizes 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑎∈𝐶𝑖 𝑑(𝑎, 𝑐𝑖 )2
• Assumes that centers need not necessarily belong to the input pointset.
• Aims at minimizing cluster variance and works well for discovering ball-shaped clusters.
• Because of the quadratic dependence on distances, 𝑘-means clustering is rather sensitive to
outliers (though less than 𝑘-center).
1
Centroid of a set 𝑃: 𝑐 (𝑃) = 𝑁 ∑𝑋∈𝑃 𝑋 (component wise sum)
Is the point that minimizes the sum of the square distances to all points of 𝑃. ∑𝑋∈𝑃 𝑑 (𝑋, 𝑐𝑃 )2 ≤
∑𝑋∈𝑃 𝑑(𝑋, 𝑌)2
Lloyd's algorithm
• Always terminates but the number of iterations can be exponential in the input size.
• May be trapped into a local optimum.
• The quality of the solution and the speed of convergence of Lloyd's algorithm depends
considerably from the choice of the initial set of centers
k-means++: center initialization strategy that yields clustering not too far from the optimal ones.
Computes a (initial) set 𝑆 of 𝑘 centers for 𝑃
2
(𝑑(𝑝𝑗 ,𝑆))
Probability of a point 𝑗 is selected: 𝜋𝑗 = 2 where ∑𝑗≥1 𝜋𝑗 = 1
∑𝑞∈𝑃−𝑆(𝑑(𝑞,𝑆))
Pick 𝑥 ∈ [0,1] and set 𝑐𝑖 = 𝑝𝑟 where 𝑟 ∈ [1, |𝑃 − 𝑆|] such that ∑𝑟−1 𝑟
𝑗=1 𝜋𝑗 < 𝑥 ≤ ∑𝑗=1 𝜋𝑗
Let 𝐶𝑎𝑙𝑔 = Partition(P; S) be the 𝑘-clustering of 𝑃 induced by the set 𝑆 of centers returned by the 𝑘-
𝒐𝒑𝒕
means++. 𝝓𝒌𝒎𝒆𝒂𝒏𝒔 (𝑪𝒂𝒍𝒈 ) is a random variable. 𝑬[𝝓𝒌𝒎𝒆𝒂𝒏𝒔(𝑪𝒂𝒍𝒈 )] ≤ 𝟖(𝐥𝐧 𝒌 + 𝟐)𝝓𝒌𝒎𝒆𝒂𝒏𝒔(𝒌)
14
Dealing with Big Data: pointset 𝑃 is too large for a single machine.
• Distributed (e.g., MapReduce) implementation of Lloyd's algorithm and 𝑘-means++.
• Coreset-based approach
Parallel k-means: variant in which a set of > 𝑘 candidate centers is selected in 𝑂(log 𝑁 ) parallel
rounds, and then 𝑘-centers are extracted from this set using a weighted version of 𝑘-means++.
Weighted k-means: for each point 𝑝 ∈ 𝑃 is given an integer weight 𝑤(𝑝) > 0. It minimizes:
𝑘
𝑤
𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶 ) = ∑ ∑ 𝑤(𝑝) ∗ 𝑑 (𝑝, 𝑐𝑖 )2
𝑖=1
𝑝∈𝐶𝑖
Most of the algorithms known for the standard 𝑘-means clustering problem (unit weights) can be
adapted to solve the case with general weights.
It is sufficient to regard each point 𝑝 as representing 𝑤(𝑝) identical copies of itself.
Lloyd's algorithm remains virtually identical except that:
1
• 𝐶𝑖 becomes ∑ ∑ 𝑤 (𝑝 ) ∗ 𝑝
𝑝∈𝐶𝑖 𝑤(𝑝) 𝑝∈𝐶𝑖
𝑤 (𝐶 ) is used
• The objective function 𝜙𝑘𝑚𝑒𝑎𝑛𝑠
k-means++: remains virtually identical except that in the selection of the 𝑖-th center 𝑐𝑖 , the
2
𝑤(𝑝)∗(𝑑(𝑝,𝑆))
probability for a point 𝑝 ∈ 𝑃 − 𝑆 to be selected becomes 2
∑𝑞∈𝑃−𝑆(𝑤(𝑞)∗(𝑑(𝑞,𝑆)) )
15
𝑜𝑝𝑡
Is a 𝑂(𝛼 2 )-approximation algorithm: 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶𝑎𝑙𝑔 ) = 𝑂 (𝛼 2 ∗ 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝑃, 𝑘 ))
Where 𝐴 is an 𝛼-approximation algorithm for Weighted k-means clustering with 𝛼 > 1.
It means that each point is at distance at most 𝛼 2 from its closest center.
2
𝑜𝑝𝑡
𝑇 is a 𝜸-coreset for 𝑃, 𝑘 and 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 if ∑𝑝∈𝑃 (𝑑(𝑝, 𝜋(𝑝))) ≤ 𝛾 ∗ 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝑃, 𝑘 ) where:
• 𝑇⊆𝑃
• 𝜋: 𝑃 → 𝑇 proxy function
The smaller 𝛾, the better 𝑇 represents 𝑃 wrt the k-means problem, and it’s likely that a good solution
can be extracted from 𝑇, by weighing each point of 𝑇 with the # of points for which it acts as proxy.
16
2.5. k-median clustering
Finds a 𝑘-clustering 𝐶 = (𝐶1 , … , 𝐶𝑘 ) which minimizes 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑎∈𝐶𝑖 𝑑 (𝑎, 𝑐𝑖 )
• require that cluster centers belong to the input pointset.
• Uses an arbitrary metric spaces (like Manhattan distance)
𝑜𝑝𝑡
• Is a 5-approximation algorithm: 𝜙𝑘𝑚𝑒𝑛𝑑𝑖𝑎𝑛 (𝐶𝑎𝑙𝑔 ) ≤ 5 ∗ 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝑃, 𝑘 )
• is very slow in practice especially for large inputs since:
o the local search may require a large number of iterations to converge (𝑁𝑘). This can
be solved by stopping the iteration when the objective function does not decrease
significantly.
o in each iteration up to 𝑘(𝑁 − 𝑘) swaps may need to be checked, and for each swap
a new clustering must be computed.
• Faster alternative: adaptation of the LLoyd's algorithm where the center of a cluster 𝐶 is
defined to be the point of 𝐶 that minimizes the sum of the distances to all other points of 𝐶.
𝑜𝑝𝑡
Is a 𝑂(𝛼 2 )-approximation algorithm: 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝐶𝑎𝑙𝑔 ) = 𝑂 (𝛼 2 ∗ 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝑃, 𝑘 ))
17
k-center vs k-means vs k-median
• All 3 problems tend to return spherically shaped clusters.
• They differ in the center selection strategy since, given the centers, the best clustering
around them is the same for the 3 problems.
• k-center is useful when we want to optimize the distance of every point from its centers,
hence minimizes the radii of the cluster spheres. It is also useful for coreset selection.
• k-median is useful for facility-location applications when targeting minimum average
distance from the centers. However, it is the most challenging problem from a
computational standpoint.
• k-means is useful when centers need not belong to the pointset and minimum average
distance is targeted. Optimized implementations are available in many software packages
(unlike k-median)
18
2.6. Cluster Evaluation
2.6.1. Clustering tendency
Assess whether the data contains meaningful clusters, namely clusters that are unlikely to occur in
random data.
Hopkins statistic: measures to what extent 𝑃 can be regarded as a random subset from 𝑀
∑𝑡𝑖=1 𝑢𝑖
𝐻 (𝑃 ) = ∑ 𝑡 𝑡 ∈ [0,1] for some fixed 𝑡 ≪ 𝑁 (typically 𝑡 < 0.1 ∗ 𝑁), where
𝑖=1 𝑢𝑖 +∑𝑖=1 𝑤𝑖
• 𝑃 set of 𝑁 points in some metric space (𝑀, 𝑑 )
• 𝑋 = {𝑥1 , … , 𝑥𝑡 } random points from 𝑃
• 𝑌 = {𝑦1 , … , 𝑦𝑡 } random set of points from 𝑀 (metric space)
• 𝑤𝑖 = 𝑑 (𝑥𝑖 , 𝑃 − {𝑥𝑖 }) for 1 ≤ 𝑖 ≤ 𝑡
• 𝑢𝑖 = 𝑑 (𝑦𝑖 , 𝑃 − {𝑦𝑖 }) for 1 ≤ 𝑖 ≤ 𝑡
In the case of 𝑘-center, 𝑘-means, and 𝑘-median, the value of the objective function is the most
natural metric to assess the quality a clustering or the relative quality of two clusterings.
However, the objective functions of 𝑘-center/means/median capture intra-cluster similarity but do
not assess whether distinct clusters are truly separated.
Approaches for measuring intra-cluster similarity and inter-cluster dissimilarity:
• Cohesion/Separation metrics
• Silhouette coefficient
Sum of the distances between a given point and a subset of points 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) = ∑𝑞∈𝐶 𝑑 (𝑝, 𝑞)
Average distance between 𝑝 and 𝐶: 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 )/|𝐶 |
Computing one 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) can be done efficiently, even for large 𝐶, but computing many of them
at once may become too costly.
1
∑𝑘
𝑖=1( ∑𝑝∈𝐶𝑖 𝑑𝑠𝑢𝑚 (𝑝,𝐶𝑖 ))
Cohesion: cohesion(𝐶 ) = 2
𝑁𝑖
∑𝑘
𝑖=1( 2 )
1
Average silhouette coefficient: 𝑠𝐶 = |𝑃| ∑𝑝∈𝑃 𝑠𝑝
• measure the quality of the clustering C
• Good: 𝑠𝐶 ≅ 1 (i.e.,𝑎𝑝 ≪ 𝑏𝑝 )
Computing exactly cohesion, separation or the average silhouette coefficient require to compute
all pairwise distances between points and its impractical for very large 𝑷.
𝑛
Approximation of these sums: 𝑑̃𝑠𝑢𝑚 (𝑝, 𝐶, 𝑡) = 𝑡 ∑𝑡𝑖=1 𝑑 (𝑝, 𝑥𝑖 )
• 𝑛 = |𝐶 |
• 𝑡 random points 𝑥1 , … , 𝑥𝑡 from 𝐶, with replacement and uniform probability.
• is a random variable
• is an unbiased estimator of 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 )
For any 𝜖 ∈ (0,1), if we set 𝑡 ∈ 𝛼 ∗ log 𝑛/𝜖 2 , then with high probability:
𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) − 𝛿 ≤ 𝑑̃𝑠𝑢𝑚 (𝑝, 𝐶, 𝑡) ≤ 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) + 𝛿
where 𝛿 = 𝜖 ∗ 𝑛 ∗ max 𝑑 (𝑝, 𝑦)
𝑦∈𝐶
20
3. Association Analysis
3.1. Basics
Set 𝐼 of 𝑑 items
Transaction 𝑡 ⊆ 𝐼
Dataset 𝑇 = {𝑡1 , … , 𝑡𝑁 } of 𝑁 transactions over 𝐼, with 𝑡𝑖 ⊆ 𝐼 for 1 ≤ 𝑖 ≤ 𝑁
Itemset 𝑋 ⊆ 𝐼
• 𝑇𝑋 ⊆ 𝑇 is the subset of transactions that contains 𝑋
|𝑇 |
• Support of 𝑋 w.r.t. 𝑇: Supp𝑇 (𝑋) = 𝑁𝑋 is the fraction of transactions of 𝑇 that contain 𝑋.
o Supp𝑇 (∅) = 1
21
Support and confidence measure the interestingness of a pattern (itemset or rule).
• thresholds minsup and minconf define which patterns must be regarded as interesting.
Ideally, we would like that the support and confidence (for rules) of the returned patterns be
unlikely to be seen in a random dataset.
The choice of minsup and minconf directly influences:
• Output size: low thresholds may yield too many patterns, which become hard to exploit
effectively.
• False positive/negatives: low thresholds may yield a lot of uninteresting patterns (false
positives), while high thresholds may miss some interesting patterns (false negatives).
Lattice of Itemsets
The family of itemsets under ⊆ forms a lattice:
• Partially ordered set
• For each two elements 𝑋, 𝑌 there is: a unique least upper bound (𝑋 ∪ 𝑌) and a unique
greatest lower bound (𝑋 ∩ 𝑌)
Represented through the Hasse diagram:
• nodes contain itemsets
• 2 nodes are connected if one is contained in the other.
22
3.2. Mining of Frequent Itemsets and Association Rules
Input: Dataset 𝑇 of 𝑁 transactions over 𝐼, minsup and minconf
Output (Frequent Itemsets): 𝐹𝑇,minsup = {(𝑋, Supp𝑇 (𝑋)): 𝑋 ≠ ∅ 𝑎𝑛𝑑 Supp𝑇 (𝑋) ≥ minsup}
Output (Association Rules):
{(𝑟: 𝑋 → 𝑌, Supp 𝑇 (𝑟), Conf 𝑇 (𝑟)): Supp𝑇 (𝑟) ≥ minsup 𝑎𝑛𝑑 Conf 𝑇 (𝑟) ≥ minconf}
23
apriori-gen(F):
• Candidate set generation: by merging pairs from 𝐹 with all but 1 item in common.
I.e., itemsets "ABCD", "ABCE" in 𝐹 generate "ABCDE"
• Candidate pruning: Exclude candidates containing an infrequent subset.
I.e., "ABCDE" survives only if all lenght-4 subsets are in 𝐹
The pruning is done a priori without computation of support.
Efficiency of A-Priori:
• just 𝑘𝑚𝑎𝑥 + 1 passes over the dataset, where 𝑘𝑚𝑎𝑥 is the length of the longest frequent
itemset.
• Support is computed only for a few non-frequent itemsets: this is due to candidate
generation and pruning which exploiting antimonotonicity of support.
• Several optimizations are known for the computation of supports of candidates this is
typically the most time-consuming task.
• The algorithm computes the support for ≤ 𝑑 + min{𝑀2 , 𝑑 ∗ 𝑀} itemsets. Where 𝑀 is the #
of frequent itemsets returned at the end. 𝑑 is the number of items in 𝐼.
• Can be implemented in time polynomial in both the input size (sum of all transaction lengths)
and the output size (sum of all frequent itemsets lengths).
24
Other approaches to frequent itemsets mining: depth-first mining strategies
• avoiding several passes over the entire dataset of transactions.
• confining the support counting of longer itemsets to suitable small projections of the
dataset, typically much smaller than the original one.
25
3.2.2. Mining association rules
Let
• 𝑇 set of 𝑁 transactions over the set 𝐼 of 𝑑 items
• minsup, minconf ∈ (0,1) support/confidence thresholds.
• Supp 𝑇 (𝑟) = Supp𝑇 (𝑋 ∪ 𝑌)
• Conf 𝑇 (𝑟) = Supp𝑇 (𝑋 ∪ 𝑌)/Supp𝑇 (𝑋)
Compute: {(𝑟: 𝑋 → 𝑌, Supp𝑇 (𝑟), Conf 𝑇 (𝑟)): Supp 𝑇 (𝑟) ≥ minsup 𝑎𝑛𝑑 Conf 𝑇 (𝑟) ≥ minconf}
If 𝐹 is the set of frequent itemsets w.r.t. minsup, then for every rule 𝑟: 𝑋 → 𝑌 that we target, we
have that 𝑋, 𝑋 ∪ 𝑌 ∈ 𝐹
Mining strategy:
• Compute the frequent itemsets w.r.t. minsup and their supports
• For each frequent itemset 𝑍 compute {𝑟: 𝑍 − 𝑌 → 𝑌 𝑠𝑡. 𝑌 ⊂ 𝑍 𝑎𝑛𝑑 Conf 𝑇 (𝑟) ≥ minconf}
Each rule derived from 𝑍 automatically satisfies the support constraint since 𝑍 is frequent.
Conversely, rules derived from itemsets which are not frequent need not to be checked, since they
would not satisfy the support constraint.
If
• 𝑌′ ⊂ 𝑌 ⊂ 𝑍
• 𝑟1 = 𝑍 − 𝑌 ′ → 𝑌 ′
• 𝑟2 = 𝑍 − 𝑌 → 𝑌
Then: Conf 𝑇 (𝑟1 ) < minconf ⇒ Conf 𝑇 (𝑟2 ) < minconf
26
3.3. Frequent itemset mining for Big Data
When the dataset 𝑇 is very large.
Partition-based approach
Sampling-based approach
27
• The size of the sample is independent of the support threshold minsup and of the number
𝑁 of transactions. It only depends on the approximation guarantee embodied in the
parameters 𝜖, 𝛿, and on the max transaction length h, which is often quite low.
• Tighter bounds on the sample size are known.
• The sample-based algorithm yields a 2-round MapReduce algorithm: in first round the
sample of suitable size is extracted; in the second round the approximate set of frequent
itemsets is extracted from the sample (e.g., through A-Priori) within one reduce.
28
3.4. Limitations
Limitations of the Support/Confidence framework:
• Redundancy: many returned patterns may characterize the same subpopulation of data
(e.g., transactions/customers).
• Difficult control of output size: it is hard to predict how many patterns will be returned for
given support/confidence thresholds.
• Significance: are the returned patterns significant, interesting?
Itemset 𝑋 ⊆ 𝐼 is Closed wrt 𝑇: if for each superset 𝑌 ⊃ 𝑋 we have Supp𝑇 (𝑌) < Supp𝑇 (𝑋)
It means that 𝑋 is closed if its support decreases as soon as an item is added.
• CLO 𝑇 = {𝑋 ⊆ 𝐼: 𝑋 is closed wrt 𝑇} set of closed itemsets
• CLO-F𝑇,minsup = {𝑋 ∈ CLO 𝑇 : Supp𝑇 (𝑋) ≥ minsup} set of frequent closed itemsets
Itemset 𝑋 ⊆ 𝐼 is Maximal wrt 𝑇 and minsup: if Supp𝑇 (𝑋) ≥ minsup and for each superset 𝑌 ⊃ 𝑋
we have Supp 𝑇 (𝑌) < minsup
It means that 𝑋 is maximal if it is frequent and becomes infrequent as soon as an item is added.
• MAX 𝑇,minsup = {𝑋 ⊆ 𝐼: 𝑋 is maximal wrt 𝑇 𝑎𝑛𝑑 minsup}
29
Consequences of the properties:
MAX and CLO-F provide a compact and lossless representations of 𝐹:
• Determine 𝐹 by taking all subsets of MAX or CLO-F.
CLO-F (+ supports) provides a compact and lossless representation of 𝐹 (+ supports):
• Determine 𝐹 from CLO-F as above
• For each 𝑋 ∈ 𝐹 compute Supp𝑇 (𝑋) = max{Supp𝑇 (𝑌): 𝑌 ∈ CLO-F and 𝑋 ⊆ 𝑌}
Maximal/frequent closed itemsets yield a quite effective reduction of the output size.
However, there are non-trivial pathological cases where their number is exponential in the input
size.
30
3.4.2. Top-K frequent itemsets
Controls output size.
Let
• 𝑋1 , 𝑋2 , … numeration of the non-empty itemsets in non-increasing order of support.
• For a given integer 𝐾 ≥ 1
• 𝑠(𝐾 ) = Supp𝑇 (𝑋𝐾 )
The Top-K frequent itemsets w.r.t. 𝑇 are all itemsets of support ≥ 𝑠(𝐾 )
The number of Top-𝐾 frequent itemsets is ≥ 𝐾
Let
• 𝑋1 , 𝑋2 , … numeration of the non-empty closed itemsets in non-increasing order of support
• For a given integer 𝐾 ≥ 1
• 𝑠𝑐(𝐾 ) = Supp𝑇 (𝑋𝐾 )
The Top-K frequent Closed itemsets w.r.t. 𝑇 are all itemsets of support ≥ 𝑠𝑐 (𝐾 )
There are efficient algorithms for mining the Top-𝐾 frequent (closed) itemsets. A popular strategy
is this:
• Generate (closed) itemsets in non-increasing order of support (using a priority queue).
• Stop at the first itemset smaller support than the 𝐾-th one.
Top-𝐾 frequent closed itemsets there is a tight polynomial bound on the output size.
For Top-𝐾 frequent itemsets there are pathological cases with exponential output size.
31
Control on the output size
For the Top-𝐾 frequent closed itemsets, 𝐾 provides tight control on the output size.
For 𝐾 > 0, the # of Top-𝑲 frequent closed itemsets = 𝑶(𝒅 ∗ 𝑲) where 𝑑=#of items
For small 𝐾 (at most polynomial in the input size) the number of Top-𝐾 frequent closed itemsets
will be polynomial in the input size.
Maximal itemsets:
• Lossless representation of the frequent itemsets
• In practice, much fewer than the frequent itemsets, but, in pathological cases, still
exponential in the input size.
• Reduction of redundancy
Frequent Close itemsets:
• Lossless representation of the frequent itemsets with supports
• In practice, much fewer than the frequent itemsets, but, in pathological cases, still
exponential in the input size.
• Reduction of redundancy
Top-𝑲 frequent (closed) itemsets:
• Output size: 𝑂 (𝑑 ∗ 𝑘 ) if restricted to closed itemsets, otherwise small in practice but
exponential in 𝑑 for pathological cases.
• Reduction of redundancy and control on the output size (with closed itemsets)
32