0% found this document useful (0 votes)

43 views36 pages

Big Data Computing

This document discusses MapReduce, clustering, and association analysis techniques for big data. MapReduce is a programming framework that uses a map and reduce process to distribute computations across large datasets. Clustering algorithms group similar data points together, and are discussed including k-means, k-medians, and center-based clustering. Association analysis finds frequent patterns and association rules in transactional datasets, using techniques like mining frequent itemsets and association rules.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views36 pages

Big Data Computing

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

NOTES OF

BIG DATA
COMPUTING
Map Reduce, Clustering, Association Analysis
(Version 26/06/2020)

Edited by:
Stefano Ivancich
CONTENTS
1. Map Reduce......................................................................................................... 1
1.1. Intro............................................................................................................... 1
1.2. MapReduce computation.............................................................................. 2
1.3. Basic Techniques and Primitives ................................................................... 6
1.3.1. Partitioning Technique ............................................................................ 6
1.3.2. Efficiency-accuracy tradeoffs .................................................................. 7
1.3.3. Exploiting samples .................................................................................. 7
2. Clustering ............................................................................................................ 9
2.1. Distance ......................................................................................................... 9
2.2. Center-based clustering .............................................................................. 11
2.3. k-center clustering ...................................................................................... 12
2.4. k-means clustering ...................................................................................... 14
2.5. k-median clustering ..................................................................................... 17
2.6. Cluster Evaluation ....................................................................................... 19
2.6.1. Clustering tendency .............................................................................. 19
2.6.2. Unsupervised evaluation ...................................................................... 19
2.6.3. Supervised evaluation ........................................................................... 20
3. Association Analysis .......................................................................................... 21
3.1. Basics ........................................................................................................... 21
3.2. Mining of Frequent Itemsets and Association Rules ................................... 23
3.2.1. Mining Frequent Itemsets ..................................................................... 23
3.2.2. Mining association rules ....................................................................... 26
3.3. Frequent itemset mining for Big Data ......................................................... 27
3.4. Limitations................................................................................................... 29
3.4.1. Closed and Maximal itemsets ............................................................... 29
3.4.2. Top-K frequent itemsets ....................................................................... 31
This document was written by students with no intention of replacing university materials. It is a
useful tool for the study of the subject but does not guarantee an equally exhaustive and complete
preparation as the material recommended by the University.
The purpose of this document is to summarize the fundamental concepts of the notes taken during
the lesson, rewritten, corrected and completed by referring to the slides to be used in the design of
Big Data Systems as a "practical and quick" manual to consult. There are no examples and detailed
explanations, for these please refer to the cited texts and slides.

If you find errors, please report them here:

www.stefanoivancich.com/
[email protected]
The document will be updated as soon as possible
1. Map Reduce
1.1. Intro
Programming framework for big data processing on distributed platforms.
Main features:
• Data centric view
• Inspired by functional programming (map, reduce functions)
• Ease of algorithm/program development. Messy details (e.g., task allocation; data
distribution; fault-tolerance; load-balancing) are hidden to the programmer
Runs on:
• Clusters of commodity processors (On-premises)
• Platforms from cloud providers (Amazon AWS, Microsoft Azure)
o IaaS (Infrastructure as a Service): provides the users computing infrastructure and
physical (or virtual) machines. This is the case of CloudVeneto which provides us a
cluster of virtual machines.
o PaaS (Platform as a Service): provides the users computing platforms with OS;
execution environments, etc.

Typical cluster architecture:

• Racks of 16-64 compute nodes
(commodity hardware), connected
(within each rack and among racks) by
fast switches (e.g., 10 Gbps Ethernet)
• Distributed File System:
o Files can be enormous, possibly a
terabyte.
o Files are rarely updated
o Files are divided into chunks (e.g., 64MB per chunk)
o Each chunk is replicated (2x or 3x) with replicas in different nodes and, possibly, in
different racks.
o To find the chunks of a file, there is a master node file which is also replicated. A
directory (also replicated) records where all master nodes are.
o Examples: Google File System (GFS); Hadoop Distributed File System (HDFS)
Software Implementations: Apache Hadoop (old, works BAD), Apache Spark (uses the HDFS).

1
1.2. MapReduce computation
Map-Reduce computation:
• Is a sequence of rounds.
• Each round transforms a set of key-value pairs into another set of key-value pairs (data
centric view), through the following two phases:
o Map phase: a user-specified map function is applied separately to each input key-
value pair and produces ≥ 𝟎 other key-value pairs, referred to as intermediate key-
value pairs.
o Reduce phase: the intermediate key-value pairs are grouped by key and a user-
specified reduce function is applied separately to each group of key-value pairs with
the same key, producing ≥ 0 other key-value pairs, which is the output of the round.
• The output of a round is the input of the next round.

Implementation of a round:
• Input files are elements of any type (tuple, documents, …). A chunk is a collection of
elements, and no element is stored across two chunks. Keys of input elements are not
relevant and we tend to ignore them.
• Input file is split into 𝑋 chunks and each chunk is the input of a map task.
• Each map task is assigned to a worker (a compute node) which applies the map function to
each key-value pair of the corresponding chunk, buffering the intermediate key-value pairs
it produces in its local disk.
A Map task can produce several key-value pairs with the same key, even from the same
element.
• The intermediate key-values pairs, while residing in the local disks, are grouped by key, and
the values associated with each key are formed into a list of values.
This is done by partitioning into 𝑌 buckets through a hash function ℎ.
(𝑘, 𝑣 ) is putted in the Bucket 𝑖 = ℎ(𝑘 ) mod 𝑌
Hashing helps balancing the load of the workers.
• Each bucket is the input of a different reduce task which is assigned to a worker.
The input of the reduce function is a pair of the form (𝑘, [𝑣1 , … , 𝑣𝑘 ])
The output of the Reduce function is a sequence of ≥ 0 key-value pairs.
Reducer: application of the reduce function to a single key and its associated list of values.
Reduce task executes one or more reducers.
2
• The outputs from all the Reduce tasks are merged into a single file written on
DataFileSystem.
• The user program is forked into a master process and several worker processes. The master
is in charge of assigning map and reduce tasks to the various workers, and to monitor their
status (idle, in-progress, completed).
• Input and output files reside on a Distributed File System, while intermediate data are stored
on the workers' local disks.
• The round involves a data shuffle for moving the intermediate key-value pairs from the
compute nodes where they were produced (by map tasks) to the compute nodes where they
must be processed (by reduce tasks).
Shuffle is often the most expensive operation of the round
• The values 𝑋 and 𝑌 are design parameters.

Dealing with faults: Master pings workers periodically to detect failures.

• Worker failure:
o Map tasks completed or in-progress at failed worker are reset to idle and will be
rescheduled.
o Reduce tasks in-progress at failed worker are reset to idle and will be rescheduled.
• Master failure: the whole MapReduce task is aborted

Specifications of a MapReduce algorithm:

• Input and Output
• The sequence of rounds
• For each round
o input, intermediate and output sets of key-value pairs
o functions applied in the map and reduce phases.
• Meaningful values (or asymptotic) bounds for the key performance indicators (defined
later).
For simplicity, we sometime provide a high-level description of a round, which, however, must
enable a straightforward yet tedious derivation of the Map and Reduce phases.
3
Analysis of a MapReduce algorithm: estimate these Key Performance Indicators:
• Number of rounds 𝑹.
• Local space 𝑴𝑳: maximum amount of main memory required, in any round, by a single
invocation of the map and reduce function used in that round, for storing the input and any
data structure needed by the invocation.
Bounds the amount of main memory required at each worker.
• Aggregate space 𝑴𝑨: maximum amount of (disk) space which is occupied by the stored data
at the beginning/end of the map and reduce phase of any round.
Bounds the overall amount of (disk) space that the executing platform must provide.
These are usually estimated through asymptotic analysis (either worst-case or probabilistic) as
functions of the instance size.

Design goals for MapReduce algorithms

For every computational problem solvable by a sequential algorithm in space 𝑆(|𝑖𝑛𝑝𝑢𝑡|) there exists
a 1-round MapReduce algorithm with 𝑀𝐿 = 𝑀𝐴 = Θ(𝑆(|𝑖𝑛𝑝𝑢𝑡|)).
But this is impractical for very large inputs because a platform with very large main memory is
needed and No parallelism is exploited.
So, we break the computation into a (hopefully small) number of rounds that execute several tasks
in parallel, each task working efficiently on a (hopefully small) subset of the data.
Goals:
• 𝑹 = 𝑶 (𝟏)
• Sublinear local space: 𝑀𝐿 = 𝑂(|𝑖𝑛𝑝𝑢𝑡|𝜖 ) with 𝜖 ∈ (0,1)
• Linear aggregate space: 𝑀𝐴 = 𝑂(|𝑖𝑛𝑝𝑢𝑡 |), or slightly superlinear
• Polynomial complexity of each map or reduce functions
Often, small 𝑀𝐿 enables high parallelism but may incur large 𝑅.

4
Pros of Map Reduce:
• Data-centric view: Algorithm design can focus on data transformations, targeting the design
goals. Programming frameworks (e.g., Spark) usually make allocation of tasks to workers,
data management, handling of failures, totally transparent to the programmer.
• Portability/Adaptability: Applications can run on different platforms and the underlying
implementation of the framework will do best effort to exploit parallelism and minimize data
movement costs.
• Cost: MapReduce applications can be run on moderately expensive platforms, and many
popular cloud providers support their execution
Cons of Map Reduce:
• Weakness of the number of rounds as performance measure. It ignores the runtimes of
map and reduce functions and the actual volume of data shuffled. More sophisticated (yet,
less usable) performance measures exist.
• Curse of the last reducer: In some cases, one or a few reducers may be much slower than
the other ones, thus delaying the end of the round. When designing MapReduce algorithms,
one should try to ensure load balancing (if at all possible) among reducers.

5
1.3. Basic Techniques and Primitives
Chernoff bound: Given a Binomial random variable 𝐵𝑖𝑛(𝑛, 𝑝), 𝜇 = 𝐸[𝑋] = 𝑛𝑝
For every 𝛿1 ≥ 5, 𝛿2 ∈ (0,1):
• Pr(𝑋 ≥ (1 + 𝛿1 )𝜇 ) ≤ 2−(1+𝛿1 )𝜇
2
• Pr(𝑋 ≤ (1 − 𝛿2 )𝜇 ) ≤ 2−𝜇𝛿2 /2

Union Bound: Pr(⋃𝑖 𝐴𝑖 ) ≤ ∑𝑖 Pr(𝐴𝑖 )

1.3.1. Partitioning Technique

When some aggregation functions may potentially receive large inputs or skewed ones, it is
advisable to partition the input, either deterministically or randomly, and perform the aggregation
in stages.
• Map (i, elem) --> (𝑖 mod √𝑁, elem)
• Map (i, elem) --> (𝑗, elem) where 𝑗 =random ∈ [0, √𝑁)

Eg.1: Improved Word Count (Random key)

Eg.2: Class Count (𝒊 mod √𝑵 key)

6
1.3.2. Efficiency-accuracy tradeoffs
For some problems an exact MapReduce algorithm may be too costly (require large 𝑅, 𝑀𝐿 , or 𝑀𝐴 )
If is acceptable for the application, we relax the condition of exact solution (we search for an
approximate solution).

E.g.: Maximum pairwise distance

1.3.3. Exploiting samples

If we can profitably process a small sample of the data, with that sample we can
• subdivide the dataset in smaller subsets and analyze them separately.
• or, give an accurate representation of the whole dataset, which contains a good solution to
the problem and filters out noise and outliers, thus allowing the execution of the task on the
sample.

E.g.: Sorting

7
8
2. Clustering
Given a set of points belonging to some space, with a notion of distance between points, clustering
aims at grouping the points into a number of subsets (clusters) such that:
• Points in the same cluster are "close" to one another
• Points in different clusters are "distant" from one another
The distance captures a notion of similarity: close points are similar, distant point are dissimilar.

2.1. Distance
Metric Space: ordered pair (𝑀, 𝑑 ) where:
• 𝑀 is a set
• 𝑑 (∙) is a metric on 𝑀. 𝑑: 𝑀 × 𝑀 → ℝ s.t. ∀𝑥, 𝑦, 𝑧 ∈ 𝑀
o 𝑑 (𝑥, 𝑦) ≥ 0
o 𝑑 (𝑥, 𝑦) = 0 ⟺ 𝑥 = 𝑦
o 𝑑 (𝑥, 𝑦) = 𝑑 (𝑦, 𝑥 ) symmetry
o 𝑑 (𝑥, 𝑧) ≤ 𝑑 (𝑥, 𝑦) + 𝑑 (𝑦, 𝑧) triangle inequality
Different distance functions can be used to analyze dataset through clustering. The choice of the
function may have a significant impact on the effectiveness of the analysis.

Distance functions:
• Minkowski distances: 𝑑𝐿𝑟 (𝑋, 𝑌) = (∑𝑛𝑖=1|𝑥𝑖 − 𝑦𝑖 |𝑟 )1/𝑟 where the vectors 𝑋, 𝑌 ∈ ℝ𝑛 , 𝑟 > 0
o 𝑟 = 2: Euclidean distance. (Linea d’aria) ‖∙‖
Aggregates the gap in each dimension between two objects.
o 𝑟 = 1: Manhattan distance. Used in grid-like environments.
o 𝑟 = ∞: (Chebyshev distance)
𝑋∗𝑌 ∑𝑛
𝑖=1 𝑥𝑖 𝑦𝑖
• Cosine (or angular) distance: 𝑑𝑐𝑜𝑠𝑖𝑛𝑒 (𝑋, 𝑌) = arccos (‖𝑋‖∗‖𝑌‖) = arccos ( )
√∑𝑛 2 𝑛 2
𝑖=1 𝑥𝑖 √∑𝑖=1 𝑦𝑖

is the angle between the vectors 𝑋, 𝑌 ∈ ℝ𝑛

Often used in information retrieval to assess similarity between documents. Consider an
alphabet of 𝑛 words. A document 𝑋 over this alphabet can be represented as an 𝑛-vector,
where 𝑥𝑖 is the number of occurrences in 𝑋 of the 𝑖-th word of the alphabet.
Measures the ratios among features' values, rather than their absolute values.
• Hamming distance: 𝑑𝐻𝑎𝑚𝑚𝑖𝑛𝑔(𝑋, 𝑌) = |{𝑖: 𝑥𝑖 ≠ 𝑦𝑖 }| = 𝑑𝐿1 (𝑋, 𝑌) = ∑𝑛𝑖=1|𝑥𝑖 − 𝑦𝑖 |
Used when points are binary vectors.
Measures the total number of differences between two objects.
|𝑆∩𝑇|
• Jaccard distance: 𝑑𝐽𝑎𝑐𝑐𝑎𝑟𝑑 (𝑆, 𝑇) = 1 − |𝑆∪𝑇|
∈ [0,1]
Used when points are sets (e.g., documents seen as bags of words).
Measures the ratio between the number of differences between two sets and the total
number of points in the sets.
• Edit distance: minimum number of deletions and insertions that must be applied to
transform 𝑋 into 𝑌.
Used in Strings.
Longest Common Subsequence 𝐿𝐶𝑆(𝑋, 𝑌) longest string 𝑍 whose characters occur in both
𝑋 and 𝑌 in the same order but not necessarily contiguously.
𝑑𝑒𝑑𝑖𝑡 (𝑋, 𝑌) = |𝑋| + |𝑌| − 2|𝐿𝐶𝑆(𝑋, 𝑌)|
9
Hamming and Jaccard distances are used when an object is characterized by having or not having
some features.
Euclidean and Cosine distances are used when an object is characterized by the numerical values of
its features.

Minimum distance of a point 𝑝 ∈ 𝑃 from a set 𝑆 ⊆ 𝑃: 𝑑(𝑝, 𝑆) = min{𝑞 ∈ 𝑆: 𝑑(𝑝, 𝑞)}

Curse of dimensionality: issues that arise when processing data in high-dimensional spaces.
• Quality: Distance functions may lose effectiveness in assessing similarity. Points in a high-
dimensional metric space tend to be Sparse, Almost equally distant from one another,
Almost orthogonal (as vectors) to one another.
• Performance: The running times may have a linear or even exponential dependency on 𝑛.
o Running time of a clustering 𝑂(𝐶𝑑𝑖𝑠𝑡 ∗ 𝑁𝑑𝑖𝑠𝑡 ) =cost distance computation*#of
distance computations
o Nearest Neighbor Search problem: given a set 𝑆 ⊆ ℝ𝑛 of 𝑚 points, construct a data
structure that, for a given query point 𝑞 ∈ ℝ𝑛 , efficiently returns the closest point
𝑥 ∈ 𝑆 to 𝑞. The problem requires time: Θ(min{𝑛 ∗ 𝑚, 2𝑛 })

Objective functions can be categorized based on whether or not:

• number 𝑘 of clusters is given in input
• For each cluster a center must be identified.
• Disjoint clusters are sought

Combinatorial optimization problem: Π(𝐼, 𝑆, 𝜙, 𝑚) NP-HARD

• instances 𝐼
• solutions 𝑆
• objective function 𝜙 that assigns a real value to each solution 𝑠 ∈ 𝑆
• 𝑚 ∈ {𝑚𝑖𝑛, 𝑚𝑎𝑥 }
• ∀𝑖 ∈ 𝐼 ∃𝑆𝑖 ⊆ 𝑆 of feasible solutions
• given 𝑖 ∈ 𝐼 find a feasible solution 𝑠 ∈ 𝑆𝑖 which minimize 𝜙 (𝑠) or maximize.
𝑐-approximation algorithm 𝐴: ∀𝑖 ∈ 𝐼 returns a feasible solution 𝐴(𝑖 ) ∈ 𝑆𝑖 s.t. 𝜙(𝐴(𝑖 )) ≤ 𝑐 min 𝜙(𝑠)
s∈Si

10
2.2. Center-based clustering
Family of clustering problems used in data analysis.

𝒌-clustering of a set of points 𝑃 is a tuple 𝐶 = (𝐶1 , … , 𝐶𝑘 ; 𝑐1 , … , 𝑐𝑘 ) where:

• |𝑃 | = 𝑁
• (𝐶1 , … , 𝐶𝑘 ) is a partition of 𝑃 = 𝐶1 ∪ … ∪ 𝐶𝑘
• 𝑐1, … , 𝑐𝑘 are centers of the clusters, 𝑐𝑖 ∈ 𝐶𝑖

Minimize one of those Objective functions: NP-HARD

• k-center clustering: 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝐶 ) = max max 𝑑 (𝑎, 𝑐𝑖 ) minimize the maximum distance of
𝑖=1,…,𝑘 𝑎∈𝐶𝑖
any point from the closest center.
Useful when we need to guarantee that every point is close to a center.
Sensitive to noise.
• k-means: 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑎∈𝐶𝑖 𝑑(𝑎, 𝑐𝑖 )2 minimize the sum of the squared distances of
the points from their closest centers.
More sensitive to noise but there exist faster algorithms to solve it.
• k-median: 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑎∈𝐶𝑖 𝑑 (𝑎, 𝑐𝑖 ) minimize the sum of distances of the points
from their closest centers.
Each point is at distance at most 𝝓 from its closest center.
k-center and k-median belong to the family of facility-location problems: given a set 𝐹 of candidate
locations of facilities and a set 𝐶 of customers, require to find a subset of 𝑘 locations where to open
facilities, and an assignment of customers to them, so to minimize the max/avg distance between a
customers and its assigned facility.

Partitioning primitive: assign each point 𝑥 ∈ 𝑃 to the Cluster 𝐶𝑖 which its center 𝑐𝑖 is closest to 𝑥

11
2.3. k-center clustering
Provides a strong guarantee on how close each point is to the center of its cluster.
For noisy pointsets (e.g., pointsets with outliers) the clustering which optimizes the k-center
objective may obfuscate some “natural" clustering inherent in the data.

Farthest-First Traversal Algorithm

• Input: set 𝑃 of 𝑁 points from a metric space (𝑀, 𝑑 ), integer 𝑘 > 1
• Output: k-clustering 𝐶 = (𝐶1 , … , 𝐶𝑘 ; 𝑐1, … , 𝑐𝑘 ) of 𝑃

𝒐𝒑𝒕
Is a 2-approximation algorithm: 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝐶𝑎𝑙𝑔 ) ≤ 𝟐 ∗ 𝝓𝒌𝒄𝒆𝒏𝒕𝒆𝒓 (𝑷, 𝒌)
It means that each point is at distance at most 2𝜙 from its closest center.

Dealing with Big Data: pointset 𝑃 is too large for a single machine.
Farthest-First Traversal requires 𝑘 − 1 scans of pointset 𝑃 which is impractical for massive pointsets
and not so small 𝑘.

Coreset technique: (is a variant of sampling)

• Extract a "small" subset of the input (dubbed coreset), making sure that it contains a good
solution to the problem.
• Run best known (sequential) algorithm, possibly expensive, to solve the problem on the
small coreset, rather than on the entire input, still obtaining a fairly good solution.
The technique is effective when the coreset can be extracted efficiently and, in particular, when it
can be composed by combining smaller coresets independently extracted from some (arbitrary)
partition of the input.

12
Coreset for K-Center:
• Select coreset 𝑇 ⊂ 𝑃 making sure that for each 𝑥 ∈ 𝑃 − 𝑇 is close to some point of 𝑇
𝑜𝑝𝑡
(approx. within 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 ))
• Search the set 𝑆 of 𝑘 final centers in 𝑇, so that each point of 𝑇 is at distance approx. within
𝑜𝑝𝑡
𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 ) from the closest center.
𝑜𝑝𝑡
By combining the above two points, we gather that each point of 𝑃 is at distance 𝑂 (𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 ))
from the closest center.

MapReduce-Farthest-First Traversal

𝑜𝑝𝑡
Is a 4-approximation algorithm: 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝐶𝑎𝑙𝑔 ) ≤ 4 ∗ 𝜙𝑘𝑐𝑒𝑛𝑡𝑒𝑟 (𝑃, 𝑘 )
It means that each point is at distance at most 4𝜙 from its closest center.

The sequential Farthest-First Traversal algorithm is used both to extract the coreset and to compute
the final set of centers. It provides a good coreset since it ensures that any point not in the coreset
be well represented by some coreset point.
By selecting 𝑘 ′ > 𝑘 centers from each subset 𝑃𝑖 in Round 1, the quality of the final clustering
improves. In fact, it can be shown that when 𝑃 satisfy certain properties and 𝑘′ is sufficiently large,
MR-Farthest-First Traversal can almost match the same approximation quality as the sequential
Farthest-First Traversal, while, still using sublinear local space and linear aggregate space.

13
2.4. k-means clustering
Aim at optimizing average distances.
Finds a 𝑘-clustering 𝐶 = (𝐶1 , … , 𝐶𝑘 ) which minimizes 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑎∈𝐶𝑖 𝑑(𝑎, 𝑐𝑖 )2
• Assumes that centers need not necessarily belong to the input pointset.
• Aims at minimizing cluster variance and works well for discovering ball-shaped clusters.
• Because of the quadratic dependence on distances, 𝑘-means clustering is rather sensitive to
outliers (though less than 𝑘-center).

1
Centroid of a set 𝑃: 𝑐 (𝑃) = 𝑁 ∑𝑋∈𝑃 𝑋 (component wise sum)
Is the point that minimizes the sum of the square distances to all points of 𝑃. ∑𝑋∈𝑃 𝑑 (𝑋, 𝑐𝑃 )2 ≤
∑𝑋∈𝑃 𝑑(𝑋, 𝑌)2

Lloyd's algorithm

• Always terminates but the number of iterations can be exponential in the input size.
• May be trapped into a local optimum.
• The quality of the solution and the speed of convergence of Lloyd's algorithm depends
considerably from the choice of the initial set of centers

k-means++: center initialization strategy that yields clustering not too far from the optimal ones.
Computes a (initial) set 𝑆 of 𝑘 centers for 𝑃

2
(𝑑(𝑝𝑗 ,𝑆))
Probability of a point 𝑗 is selected: 𝜋𝑗 = 2 where ∑𝑗≥1 𝜋𝑗 = 1
∑𝑞∈𝑃−𝑆(𝑑(𝑞,𝑆))
Pick 𝑥 ∈ [0,1] and set 𝑐𝑖 = 𝑝𝑟 where 𝑟 ∈ [1, |𝑃 − 𝑆|] such that ∑𝑟−1 𝑟
𝑗=1 𝜋𝑗 < 𝑥 ≤ ∑𝑗=1 𝜋𝑗

Let 𝐶𝑎𝑙𝑔 = Partition(P; S) be the 𝑘-clustering of 𝑃 induced by the set 𝑆 of centers returned by the 𝑘-
𝒐𝒑𝒕
means++. 𝝓𝒌𝒎𝒆𝒂𝒏𝒔 (𝑪𝒂𝒍𝒈 ) is a random variable. 𝑬[𝝓𝒌𝒎𝒆𝒂𝒏𝒔(𝑪𝒂𝒍𝒈 )] ≤ 𝟖(𝐥𝐧 𝒌 + 𝟐)𝝓𝒌𝒎𝒆𝒂𝒏𝒔(𝒌)
14
Dealing with Big Data: pointset 𝑃 is too large for a single machine.
• Distributed (e.g., MapReduce) implementation of Lloyd's algorithm and 𝑘-means++.
• Coreset-based approach

Parallel k-means: variant in which a set of > 𝑘 candidate centers is selected in 𝑂(log 𝑁 ) parallel
rounds, and then 𝑘-centers are extracted from this set using a weighted version of 𝑘-means++.

Weighted k-means: for each point 𝑝 ∈ 𝑃 is given an integer weight 𝑤(𝑝) > 0. It minimizes:
𝑘
𝑤
𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶 ) = ∑ ∑ 𝑤(𝑝) ∗ 𝑑 (𝑝, 𝑐𝑖 )2
𝑖=1
𝑝∈𝐶𝑖
Most of the algorithms known for the standard 𝑘-means clustering problem (unit weights) can be
adapted to solve the case with general weights.
It is sufficient to regard each point 𝑝 as representing 𝑤(𝑝) identical copies of itself.
Lloyd's algorithm remains virtually identical except that:
1
• 𝐶𝑖 becomes ∑ ∑ 𝑤 (𝑝 ) ∗ 𝑝
𝑝∈𝐶𝑖 𝑤(𝑝) 𝑝∈𝐶𝑖
𝑤 (𝐶 ) is used
• The objective function 𝜙𝑘𝑚𝑒𝑎𝑛𝑠
k-means++: remains virtually identical except that in the selection of the 𝑖-th center 𝑐𝑖 , the
2
𝑤(𝑝)∗(𝑑(𝑝,𝑆))
probability for a point 𝑝 ∈ 𝑃 − 𝑆 to be selected becomes 2
∑𝑞∈𝑃−𝑆(𝑤(𝑞)∗(𝑑(𝑞,𝑆)) )

Coreset-based MapReduce algorithm for k-means:

Uses, as a parameter, a sequential algorithm 𝐴 for 𝑘-means: MR-kmeans(A).

15
𝑜𝑝𝑡
Is a 𝑂(𝛼 2 )-approximation algorithm: 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝐶𝑎𝑙𝑔 ) = 𝑂 (𝛼 2 ∗ 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝑃, 𝑘 ))
Where 𝐴 is an 𝛼-approximation algorithm for Weighted k-means clustering with 𝛼 > 1.
It means that each point is at distance at most 𝛼 2 from its closest center.

2
𝑜𝑝𝑡
𝑇 is a 𝜸-coreset for 𝑃, 𝑘 and 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 if ∑𝑝∈𝑃 (𝑑(𝑝, 𝜋(𝑝))) ≤ 𝛾 ∗ 𝜙𝑘𝑚𝑒𝑎𝑛𝑠 (𝑃, 𝑘 ) where:
• 𝑇⊆𝑃
• 𝜋: 𝑃 → 𝑇 proxy function
The smaller 𝛾, the better 𝑇 represents 𝑃 wrt the k-means problem, and it’s likely that a good solution
can be extracted from 𝑇, by weighing each point of 𝑇 with the # of points for which it acts as proxy.

The coreset computed in Round 1 is a 𝛾-coreset, with 𝛾 = 𝛼.

16
2.5. k-median clustering
Finds a 𝑘-clustering 𝐶 = (𝐶1 , … , 𝐶𝑘 ) which minimizes 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑎∈𝐶𝑖 𝑑 (𝑎, 𝑐𝑖 )
• require that cluster centers belong to the input pointset.
• Uses an arbitrary metric spaces (like Manhattan distance)

Partitioning Around Medoids (PAM) algorithm:

𝑜𝑝𝑡
• Is a 5-approximation algorithm: 𝜙𝑘𝑚𝑒𝑛𝑑𝑖𝑎𝑛 (𝐶𝑎𝑙𝑔 ) ≤ 5 ∗ 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝑃, 𝑘 )
• is very slow in practice especially for large inputs since:
o the local search may require a large number of iterations to converge (𝑁𝑘). This can
be solved by stopping the iteration when the objective function does not decrease
significantly.
o in each iteration up to 𝑘(𝑁 − 𝑘) swaps may need to be checked, and for each swap
a new clustering must be computed.
• Faster alternative: adaptation of the LLoyd's algorithm where the center of a cluster 𝐶 is
defined to be the point of 𝐶 that minimizes the sum of the distances to all other points of 𝐶.

Dealing with Big Data: uses coreset like 𝑘-median

𝑤 (𝐶 ) = ∑𝑘𝑖=1 ∑𝑝∈𝐶𝑖 𝑤(𝑝) ∗ 𝑑 (𝑝, 𝑐𝑖 )
• 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛
• Let Β be a sequential algorithm which solves the weighted variant of 𝑘-median using space
linear in the input size.
• MR-kmedian(B): 3-round MapReduce algorithm for 𝑘-median, which uses 𝐵 as a subroutine.

𝑜𝑝𝑡
Is a 𝑂(𝛼 2 )-approximation algorithm: 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝐶𝑎𝑙𝑔 ) = 𝑂 (𝛼 2 ∗ 𝜙𝑘𝑚𝑒𝑑𝑖𝑎𝑛 (𝑃, 𝑘 ))

17
k-center vs k-means vs k-median
• All 3 problems tend to return spherically shaped clusters.
• They differ in the center selection strategy since, given the centers, the best clustering
around them is the same for the 3 problems.
• k-center is useful when we want to optimize the distance of every point from its centers,
hence minimizes the radii of the cluster spheres. It is also useful for coreset selection.
• k-median is useful for facility-location applications when targeting minimum average
distance from the centers. However, it is the most challenging problem from a
computational standpoint.
• k-means is useful when centers need not belong to the pointset and minimum average
distance is targeted. Optimized implementations are available in many software packages
(unlike k-median)

18
2.6. Cluster Evaluation
2.6.1. Clustering tendency
Assess whether the data contains meaningful clusters, namely clusters that are unlikely to occur in
random data.

Hopkins statistic: measures to what extent 𝑃 can be regarded as a random subset from 𝑀
∑𝑡𝑖=1 𝑢𝑖
𝐻 (𝑃 ) = ∑ 𝑡 𝑡 ∈ [0,1] for some fixed 𝑡 ≪ 𝑁 (typically 𝑡 < 0.1 ∗ 𝑁), where
𝑖=1 𝑢𝑖 +∑𝑖=1 𝑤𝑖
• 𝑃 set of 𝑁 points in some metric space (𝑀, 𝑑 )
• 𝑋 = {𝑥1 , … , 𝑥𝑡 } random points from 𝑃
• 𝑌 = {𝑦1 , … , 𝑦𝑡 } random set of points from 𝑀 (metric space)
• 𝑤𝑖 = 𝑑 (𝑥𝑖 , 𝑃 − {𝑥𝑖 }) for 1 ≤ 𝑖 ≤ 𝑡
• 𝑢𝑖 = 𝑑 (𝑦𝑖 , 𝑃 − {𝑦𝑖 }) for 1 ≤ 𝑖 ≤ 𝑡

𝐻(𝑃) ≅ 1: 𝑃 is likely to have a clustering structure.

𝐻(𝑃) ≅ 0.5: 𝑃 is likely to be a random set.
𝐻(𝑃) ≪ 0.5: points of 𝑃 are likely to be well (i.e., regularly) spaced.

𝐻(𝑃) is suited for big data.

2.6.2. Unsupervised evaluation

Assess the quality of a clustering (or the relative quality of two clusterings) without reference to
external information.

In the case of 𝑘-center, 𝑘-means, and 𝑘-median, the value of the objective function is the most
natural metric to assess the quality a clustering or the relative quality of two clusterings.
However, the objective functions of 𝑘-center/means/median capture intra-cluster similarity but do
not assess whether distinct clusters are truly separated.
Approaches for measuring intra-cluster similarity and inter-cluster dissimilarity:
• Cohesion/Separation metrics
• Silhouette coefficient

Sum of the distances between a given point and a subset of points 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) = ∑𝑞∈𝐶 𝑑 (𝑝, 𝑞)
Average distance between 𝑝 and 𝐶: 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 )/|𝐶 |
Computing one 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) can be done efficiently, even for large 𝐶, but computing many of them
at once may become too costly.
1
∑𝑘
𝑖=1( ∑𝑝∈𝐶𝑖 𝑑𝑠𝑢𝑚 (𝑝,𝐶𝑖 ))
Cohesion: cohesion(𝐶 ) = 2
𝑁𝑖
∑𝑘
𝑖=1( 2 )

• average distance between points in the same cluster.

𝐶 = (𝐶1 , … , 𝐶𝑘 ) k-clustering of a pointset 𝑃, 𝑁𝑖 = |𝐶𝑖 |
∑1≤𝑖<𝑗≤𝑘 ∑𝑝∈𝐶 𝑑𝑠𝑢𝑚 (𝑝,𝐶𝑗 )
Separation: separation(𝐶 ) = 𝑖
∑1≤𝑖<𝑗≤𝑘 𝑁𝑖 ∗𝑁𝑗
• average distance between points in different clusters.
The larger the gap between cohesion and separation, the better the quality of the clustering.
19
𝑏𝑝 −𝑎𝑝
Silhouette coefficient of a point 𝑝 ∈ 𝐶𝑖 : 𝑠𝑝 = max{𝑎 ∈ [−1,1]
𝑝 ,𝑏𝑝 }
• 𝑎𝑝 = 𝑑𝑠𝑢𝑚 (𝑝, 𝐶𝑖 )/𝑁𝑖 average distance between 𝑝 and the other points in its cluster
• 𝑏𝑝 = min 𝑑𝑠𝑢𝑚 (𝑝, 𝐶𝑗 )/𝑁𝑗 minimum, over all clusters 𝐶𝑗 ≠ 𝐶𝑖 of the average distance
𝑗≠𝑖
between 𝑝 and the other points in 𝐶𝑗
𝑠𝑝 is close to 1 when 𝑝 is much closer to the points of its cluster than to the points of the closest
clusters (i.e.,𝑎𝑝 ≪ 𝑏𝑝 ), while it is close to −1 when the opposite holds.

1
Average silhouette coefficient: 𝑠𝐶 = |𝑃| ∑𝑝∈𝑃 𝑠𝑝
• measure the quality of the clustering C
• Good: 𝑠𝐶 ≅ 1 (i.e.,𝑎𝑝 ≪ 𝑏𝑝 )

Computing exactly cohesion, separation or the average silhouette coefficient require to compute
all pairwise distances between points and its impractical for very large 𝑷.
𝑛
Approximation of these sums: 𝑑̃𝑠𝑢𝑚 (𝑝, 𝐶, 𝑡) = 𝑡 ∑𝑡𝑖=1 𝑑 (𝑝, 𝑥𝑖 )
• 𝑛 = |𝐶 |
• 𝑡 random points 𝑥1 , … , 𝑥𝑡 from 𝐶, with replacement and uniform probability.
• is a random variable
• is an unbiased estimator of 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 )
For any 𝜖 ∈ (0,1), if we set 𝑡 ∈ 𝛼 ∗ log 𝑛/𝜖 2 , then with high probability:
𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) − 𝛿 ≤ 𝑑̃𝑠𝑢𝑚 (𝑝, 𝐶, 𝑡) ≤ 𝑑𝑠𝑢𝑚 (𝑝, 𝐶 ) + 𝛿
where 𝛿 = 𝜖 ∗ 𝑛 ∗ max 𝑑 (𝑝, 𝑦)
𝑦∈𝐶

2.6.3. Supervised evaluation

Assess the quality of a clustering (or the relative quality of two clusterings) with reference to
external information (e.g., class labels).
𝑚𝐶,𝑖 𝑚𝐶,𝑖
Entropy of a cluster 𝐶: − ∑𝐿𝑖=1 log 2 ∈ [0, log 2 𝐿]
𝑚𝐶 𝑚𝐶
• 𝐿 class labels
• 𝑚𝐶 =#points in cluster 𝐶
• 𝑚𝑖 =#points of class 𝑖
• 𝑚𝐶,𝑖 =#points of class 𝑖 in cluster 𝐶
Measures the impurity of 𝐶, ranging from 0 (i.e., min impurity when all points of 𝐶 belong to the
same class), to log 2 𝐿 (i.e., max impurity when all classes are equally represented in 𝐶).
𝑚𝐶,𝑖 𝑚𝐶,𝑖
Entropy of a class 𝑖: − ∑𝐶∈𝐶 log 2 ∈ [0, log 2 𝐾 ]
𝑚𝐶 𝑚𝑖
measures how evenly the points of class 𝑖 are spread among clusters: 0=all in 1 cluster.

20
3. Association Analysis
3.1. Basics
Set 𝐼 of 𝑑 items
Transaction 𝑡 ⊆ 𝐼
Dataset 𝑇 = {𝑡1 , … , 𝑡𝑁 } of 𝑁 transactions over 𝐼, with 𝑡𝑖 ⊆ 𝐼 for 1 ≤ 𝑖 ≤ 𝑁

Itemset 𝑋 ⊆ 𝐼
• 𝑇𝑋 ⊆ 𝑇 is the subset of transactions that contains 𝑋
|𝑇 |
• Support of 𝑋 w.r.t. 𝑇: Supp𝑇 (𝑋) = 𝑁𝑋 is the fraction of transactions of 𝑇 that contain 𝑋.
o Supp𝑇 (∅) = 1

Association rule: 𝑟: 𝑋 → 𝑌 with 𝑋, 𝑌 ⊆ 𝐼, 𝑋, 𝑌 ≠ ∅ and 𝑋 ∩ 𝑌 = ∅

• Support of 𝑟 w.r.t. 𝑇: Supp𝑇 (𝑟) = Supp 𝑇 (𝑋 ∪ 𝑌)
• Confidence of 𝑟 w.r.t. 𝑇: Conf 𝑇 (𝑟) = Supp 𝑇 (𝑋 ∪ 𝑌)/Supp𝑇 (𝑋)

Problem formulation: Given

• the dataset 𝑇 of 𝑁 transactions over 𝐼
• Support threshold minsup ∈ (0,1]
• Confidence threshold minconf ∈ (0,1]
Compute:
• The set 𝐹𝑇,minsup composed by the frequent itemsets w.r.t. 𝑇 and minsup.
That are all (non empty) itemsets 𝑋 such that Supp𝑇 (𝑋) ≥ minsup
• All (interesting) association rules 𝑟 such that Supp𝑇 (𝑟) ≥ minsup and Conf 𝑇 (𝑟) ≥ minconf

21
Support and confidence measure the interestingness of a pattern (itemset or rule).
• thresholds minsup and minconf define which patterns must be regarded as interesting.
Ideally, we would like that the support and confidence (for rules) of the returned patterns be
unlikely to be seen in a random dataset.
The choice of minsup and minconf directly influences:
• Output size: low thresholds may yield too many patterns, which become hard to exploit
effectively.
• False positive/negatives: low thresholds may yield a lot of uninteresting patterns (false
positives), while high thresholds may miss some interesting patterns (false negatives).

Potential output explosion

For set 𝐼 of 𝑑 items:
• Number of distinct non-empty itemsets = 2𝑑 − 1
• Number of distinct association rules = 3𝑑 − 2𝑑+1 + 1
Enumeration of all itemsets/rules (to find the interesting ones) is not possible even for small sizes
(say 𝑑 > 40).
Complexity analysis w.r.t. input size might be not meaningful.
We consider efficient strategies those that require time/space polynomial in both the input and the
output sizes.

Lattice of Itemsets
The family of itemsets under ⊆ forms a lattice:
• Partially ordered set
• For each two elements 𝑋, 𝑌 there is: a unique least upper bound (𝑋 ∪ 𝑌) and a unique
greatest lower bound (𝑋 ∩ 𝑌)
Represented through the Hasse diagram:
• nodes contain itemsets
• 2 nodes are connected if one is contained in the other.

Anti-monotonicity of support: for every itemsets 𝑋, 𝑌 ⊆ 𝐼: 𝑿 ⊆ 𝒀 ⇒ Supp𝑻(𝑿) ≥ Supp𝑻 (𝒀)

Because, if 𝑋 is contained in 𝑌, then every transaction that contain 𝑌 also contains 𝑋.
For a given support threshold:
• 𝑋 is frequent ⇒ every 𝑊 ⊆ 𝑋 is frequent
• 𝑋 is not frequent ⇒ every 𝑊 ⊇ 𝑋 is not frequent
Frequent itemsets form a sublattice closed downwards

22
3.2. Mining of Frequent Itemsets and Association Rules
Input: Dataset 𝑇 of 𝑁 transactions over 𝐼, minsup and minconf
Output (Frequent Itemsets): 𝐹𝑇,minsup = {(𝑋, Supp𝑇 (𝑋)): 𝑋 ≠ ∅ 𝑎𝑛𝑑 Supp𝑇 (𝑋) ≥ minsup}
Output (Association Rules):
{(𝑟: 𝑋 → 𝑌, Supp 𝑇 (𝑟), Conf 𝑇 (𝑟)): Supp𝑇 (𝑟) ≥ minsup 𝑎𝑛𝑑 Conf 𝑇 (𝑟) ≥ minconf}

Most algorithms to compute the association rules, work in two phases:

• Compute the frequent itemsets (w.r.t minsup) and their supports (the hardest phase!).
• Extract the rules from the frequent itemsets and their supports.

3.2.1. Mining Frequent Itemsets

Objective: Careful exploration of the lattice of itemsets exploiting anti-monotonicity of support

Absolute support of 𝑋 ⊆ 𝐼: 𝝈(𝑿) = Supp𝑻(𝑿) ∗ 𝑵. = # of transactions that contain 𝑋

Assume the existence of a total ordering of the items and assume that transactions/itemsets are
represented as sorted vectors.

A-Priori algorithm (Breadth First)

• Compute frequent itemsets by increasing length 𝑘 = 1,2, …
• Frequent itemsets of 𝑘 > 1 (𝐹𝑘 ) are computed as follows:
o Compute a set of candidates 𝐶𝑘 ⊇ 𝐹𝑘 from 𝐹𝑘−1 .
o The anti-monotonicity property is used to make 𝐶𝑘 small.
o Extract 𝐹𝑘 from 𝐶𝑘 computing the support of each 𝑋 ∈ 𝐶𝑘

23
apriori-gen(F):
• Candidate set generation: by merging pairs from 𝐹 with all but 1 item in common.
I.e., itemsets "ABCD", "ABCE" in 𝐹 generate "ABCDE"
• Candidate pruning: Exclude candidates containing an infrequent subset.
I.e., "ABCDE" survives only if all lenght-4 subsets are in 𝐹
The pruning is done a priori without computation of support.

Efficiency of A-Priori:
• just 𝑘𝑚𝑎𝑥 + 1 passes over the dataset, where 𝑘𝑚𝑎𝑥 is the length of the longest frequent
itemset.
• Support is computed only for a few non-frequent itemsets: this is due to candidate
generation and pruning which exploiting antimonotonicity of support.
• Several optimizations are known for the computation of supports of candidates this is
typically the most time-consuming task.
• The algorithm computes the support for ≤ 𝑑 + min{𝑀2 , 𝑑 ∗ 𝑀} itemsets. Where 𝑀 is the #
of frequent itemsets returned at the end. 𝑑 is the number of items in 𝐼.
• Can be implemented in time polynomial in both the input size (sum of all transaction lengths)
and the output size (sum of all frequent itemsets lengths).

Main performance issues:

• most time-consuming step is the counting of supports of candidates (sets 𝐶1 , 𝐶2 , …)
• A-Priori requires a pass over the entire dataset checking each candidate against each
transaction (a lot for large input/output sizes)
• The number of candidates may still be very large stressing storage and computing capacity.
Optimizations of A-Priori:
• Use of a trie-like data structure (called HASH TREE) to store 𝐶𝑘 and to speed-up support
counting.
• Use of a strategy to avoid storing all pairs of items in 𝐶2 . Namely,
o While computing the supports of individual items, builds a small hash table to upper
bound the supports of pairs
o In 𝐶2 includes only pairs with upper bound of support ≥ minsup

24
Other approaches to frequent itemsets mining: depth-first mining strategies
• avoiding several passes over the entire dataset of transactions.
• confining the support counting of longer itemsets to suitable small projections of the
dataset, typically much smaller than the original one.

25
3.2.2. Mining association rules

Let
• 𝑇 set of 𝑁 transactions over the set 𝐼 of 𝑑 items
• minsup, minconf ∈ (0,1) support/confidence thresholds.
• Supp 𝑇 (𝑟) = Supp𝑇 (𝑋 ∪ 𝑌)
• Conf 𝑇 (𝑟) = Supp𝑇 (𝑋 ∪ 𝑌)/Supp𝑇 (𝑋)
Compute: {(𝑟: 𝑋 → 𝑌, Supp𝑇 (𝑟), Conf 𝑇 (𝑟)): Supp 𝑇 (𝑟) ≥ minsup 𝑎𝑛𝑑 Conf 𝑇 (𝑟) ≥ minconf}
If 𝐹 is the set of frequent itemsets w.r.t. minsup, then for every rule 𝑟: 𝑋 → 𝑌 that we target, we
have that 𝑋, 𝑋 ∪ 𝑌 ∈ 𝐹

Mining strategy:
• Compute the frequent itemsets w.r.t. minsup and their supports
• For each frequent itemset 𝑍 compute {𝑟: 𝑍 − 𝑌 → 𝑌 𝑠𝑡. 𝑌 ⊂ 𝑍 𝑎𝑛𝑑 Conf 𝑇 (𝑟) ≥ minconf}
Each rule derived from 𝑍 automatically satisfies the support constraint since 𝑍 is frequent.
Conversely, rules derived from itemsets which are not frequent need not to be checked, since they
would not satisfy the support constraint.
If
• 𝑌′ ⊂ 𝑌 ⊂ 𝑍
• 𝑟1 = 𝑍 − 𝑌 ′ → 𝑌 ′
• 𝑟2 = 𝑍 − 𝑌 → 𝑌
Then: Conf 𝑇 (𝑟1 ) < minconf ⇒ Conf 𝑇 (𝑟2 ) < minconf

High-level strategy for mining rules

For each frequent itemset 𝑍: (polynomial time)
• Check confidence of rules 𝑍 − 𝑌 → 𝑌 by increasing length of 𝑌 (|𝑌| = 1,2, …)
• Once all consequents 𝑌 of length 𝑘 have been checked generate "candidate" consequents
of length 𝑘 + 1 to be checked, using apriori-gen.

26
3.3. Frequent itemset mining for Big Data
When the dataset 𝑇 is very large.

Partition-based approach: (Exact solution but inefficient)

• Partition 𝑇 into 𝐾 subsets (𝐾 is a design parameter)
• Mine frequent itemsets independently in each subset
• Extract the final output from the union of the frequent itemsets computed before.
Sampling approach: compute the frequent itemsets from a small sample of 𝑇.
• approximation but is efficient

Partition-based approach

Sampling-based approach

27
• The size of the sample is independent of the support threshold minsup and of the number
𝑁 of transactions. It only depends on the approximation guarantee embodied in the
parameters 𝜖, 𝛿, and on the max transaction length h, which is often quite low.
• Tighter bounds on the sample size are known.
• The sample-based algorithm yields a 2-round MapReduce algorithm: in first round the
sample of suitable size is extracted; in the second round the approximate set of frequent
itemsets is extracted from the sample (e.g., through A-Priori) within one reduce.

28
3.4. Limitations
Limitations of the Support/Confidence framework:
• Redundancy: many returned patterns may characterize the same subpopulation of data
(e.g., transactions/customers).
• Difficult control of output size: it is hard to predict how many patterns will be returned for
given support/confidence thresholds.
• Significance: are the returned patterns significant, interesting?

3.4.1. Closed and Maximal itemsets

Avoid redundancy and (attempt to) control the output size.

Itemset 𝑋 ⊆ 𝐼 is Closed wrt 𝑇: if for each superset 𝑌 ⊃ 𝑋 we have Supp𝑇 (𝑌) < Supp𝑇 (𝑋)
It means that 𝑋 is closed if its support decreases as soon as an item is added.
• CLO 𝑇 = {𝑋 ⊆ 𝐼: 𝑋 is closed wrt 𝑇} set of closed itemsets
• CLO-F𝑇,minsup = {𝑋 ∈ CLO 𝑇 : Supp𝑇 (𝑋) ≥ minsup} set of frequent closed itemsets

Itemset 𝑋 ⊆ 𝐼 is Maximal wrt 𝑇 and minsup: if Supp𝑇 (𝑋) ≥ minsup and for each superset 𝑌 ⊃ 𝑋
we have Supp 𝑇 (𝑌) < minsup
It means that 𝑋 is maximal if it is frequent and becomes infrequent as soon as an item is added.
• MAX 𝑇,minsup = {𝑋 ⊆ 𝐼: 𝑋 is maximal wrt 𝑇 𝑎𝑛𝑑 minsup}

Properties of Closed/Maximal itemsets:

• Property 1: MAX ⊆ CLO-F ⊆ Frequent Itemset
• Property 2: for each itemset 𝑋 there is a closed itemset 𝑋′ such that
o 𝑋 ⊆ 𝑋′
o Supp𝑇 (𝑋′) = Supp𝑇 (𝑋)
• Property 3: for each frequent itemset 𝑋 there is a maximal itemset 𝑋′ such that
o 𝑋 ⊆ 𝑋′

29
Consequences of the properties:
MAX and CLO-F provide a compact and lossless representations of 𝐹:
• Determine 𝐹 by taking all subsets of MAX or CLO-F.
CLO-F (+ supports) provides a compact and lossless representation of 𝐹 (+ supports):
• Determine 𝐹 from CLO-F as above
• For each 𝑋 ∈ 𝐹 compute Supp𝑇 (𝑋) = max{Supp𝑇 (𝑌): 𝑌 ∈ CLO-F and 𝑋 ⊆ 𝑌}

Closure: Closure(𝑋) = ⋂𝑡∈𝑇𝑋 𝑡

If Supp𝑇 (𝑋) > 0:
• 𝑋 ⊆ Closure(𝑋)
• Supp 𝑇 (Closure(𝑋)) = Supp𝑇 (𝑋)
• Closure(𝑋) is closed
Each closed itemset 𝑌 represents compactly all (many) itemsets 𝑋 such that Closure(𝑋) = 𝑌
There exist efficient algorithms for mining maximal or frequent closed itemsets.
Notions of closure similar to the ones used for itemsets are employed in other mining contexts.

Maximal/frequent closed itemsets yield a quite effective reduction of the output size.
However, there are non-trivial pathological cases where their number is exponential in the input
size.

30
3.4.2. Top-K frequent itemsets
Controls output size.

Let
• 𝑋1 , 𝑋2 , … numeration of the non-empty itemsets in non-increasing order of support.
• For a given integer 𝐾 ≥ 1
• 𝑠(𝐾 ) = Supp𝑇 (𝑋𝐾 )
The Top-K frequent itemsets w.r.t. 𝑇 are all itemsets of support ≥ 𝑠(𝐾 )
The number of Top-𝐾 frequent itemsets is ≥ 𝐾

Let
• 𝑋1 , 𝑋2 , … numeration of the non-empty closed itemsets in non-increasing order of support
• For a given integer 𝐾 ≥ 1
• 𝑠𝑐(𝐾 ) = Supp𝑇 (𝑋𝐾 )
The Top-K frequent Closed itemsets w.r.t. 𝑇 are all itemsets of support ≥ 𝑠𝑐 (𝐾 )

There are efficient algorithms for mining the Top-𝐾 frequent (closed) itemsets. A popular strategy
is this:
• Generate (closed) itemsets in non-increasing order of support (using a priority queue).
• Stop at the first itemset smaller support than the 𝐾-th one.
Top-𝐾 frequent closed itemsets there is a tight polynomial bound on the output size.
For Top-𝐾 frequent itemsets there are pathological cases with exponential output size.

31
Control on the output size
For the Top-𝐾 frequent closed itemsets, 𝐾 provides tight control on the output size.
For 𝐾 > 0, the # of Top-𝑲 frequent closed itemsets = 𝑶(𝒅 ∗ 𝑲) where 𝑑=#of items
For small 𝐾 (at most polynomial in the input size) the number of Top-𝐾 frequent closed itemsets
will be polynomial in the input size.

Maximal itemsets:
• Lossless representation of the frequent itemsets
• In practice, much fewer than the frequent itemsets, but, in pathological cases, still
exponential in the input size.
• Reduction of redundancy
Frequent Close itemsets:
• Lossless representation of the frequent itemsets with supports
• In practice, much fewer than the frequent itemsets, but, in pathological cases, still
exponential in the input size.
• Reduction of redundancy
Top-𝑲 frequent (closed) itemsets:
• Output size: 𝑂 (𝑑 ∗ 𝑘 ) if restricted to closed itemsets, otherwise small in practice but
exponential in 𝑑 for pathological cases.
• Reduction of redundancy and control on the output size (with closed itemsets)

C++ Notes Complete
67% (6)
C++ Notes Complete
145 pages
Sent To Collaborate in The Reconciliation of All Things in Christ de Statu Societatis Iesu - 2023
100% (1)
Sent To Collaborate in The Reconciliation of All Things in Christ de Statu Societatis Iesu - 2023
141 pages
k5bmc Application Logic
No ratings yet
k5bmc Application Logic
335 pages
Great Physician PDF
100% (1)
Great Physician PDF
120 pages
CTPS Module III Updated
No ratings yet
CTPS Module III Updated
128 pages
TransCAD 24 CreatingAndEditingGeographicFiles
No ratings yet
TransCAD 24 CreatingAndEditingGeographicFiles
19 pages
Linux Keyboard Shortcuts
No ratings yet
Linux Keyboard Shortcuts
15 pages
DLP Fractions Grade 4
No ratings yet
DLP Fractions Grade 4
21 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
A Distributed File System-1
No ratings yet
A Distributed File System-1
65 pages
Week 02
No ratings yet
Week 02
115 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Kyosho 326 e
No ratings yet
Kyosho 326 e
28 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Desire in the Iliad: The Force That Moves the Epic and Its Audience Rachel H. Lesser pdf download
No ratings yet
Desire in the Iliad: The Force That Moves the Epic and Its Audience Rachel H. Lesser pdf download
58 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Vhdlzybo 1
No ratings yet
Vhdlzybo 1
43 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Arduino Circuit Experts Projects Handson Wi Fi Repeater or Ran ESP8266 based Smart Plug NodeMCU ESP8266 Over the Air etc. instant download
100% (1)
Arduino Circuit Experts Projects Handson Wi Fi Repeater or Ran ESP8266 based Smart Plug NodeMCU ESP8266 Over the Air etc. instant download
31 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Lec 6
No ratings yet
Lec 6
14 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
BDA unit-3
No ratings yet
BDA unit-3
63 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Upp-Int Progress Test Unit 02 B
No ratings yet
Upp-Int Progress Test Unit 02 B
6 pages
2 Bda Chapter2 Answer
No ratings yet
2 Bda Chapter2 Answer
9 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
No ratings yet
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
24 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Hadoop
No ratings yet
Hadoop
7 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
Lec 6
No ratings yet
Lec 6
16 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Log
No ratings yet
Log
6 pages
A Listing of Familiar Benedictions Given in The Bible
No ratings yet
A Listing of Familiar Benedictions Given in The Bible
9 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Vacation Work Sheet STD - 4-1
No ratings yet
Vacation Work Sheet STD - 4-1
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
(The Faith Series) Clark Carlton - The Life - The Orthodox Doctrine of Salvation-Regina Orthodox Press (2001)
100% (1)
(The Faith Series) Clark Carlton - The Life - The Orthodox Doctrine of Salvation-Regina Orthodox Press (2001)
196 pages
A Proposed Solution To A Puzzle About Belief - Ruth Barcan Marcus
No ratings yet
A Proposed Solution To A Puzzle About Belief - Ruth Barcan Marcus
10 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
File Rusak THE
No ratings yet
File Rusak THE
12 pages
ABAP Code Samples - Read Excel Files From Presentation Server - Discovering ABAP
No ratings yet
ABAP Code Samples - Read Excel Files From Presentation Server - Discovering ABAP
11 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
Map Reduce
No ratings yet
Map Reduce
25 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Tanya CV
No ratings yet
Tanya CV
1 page
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Unit 5
No ratings yet
Unit 5
7 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Data Science
No ratings yet
Data Science
7 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Untitled
No ratings yet
Untitled
16 pages
Muscles of Facial Expression: DR Kibe GK
No ratings yet
Muscles of Facial Expression: DR Kibe GK
32 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
Assignment 2.4.1 Multiclass Classification
No ratings yet
Assignment 2.4.1 Multiclass Classification
5 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
S.Y. 2019-2020 Daily Lesson Plan Grade 11-Oral Communication
No ratings yet
S.Y. 2019-2020 Daily Lesson Plan Grade 11-Oral Communication
3 pages
Mewaii™ Kawaii Mushroom Stuffed Animal Plush Squishy Toy
No ratings yet
Mewaii™ Kawaii Mushroom Stuffed Animal Plush Squishy Toy
1 page
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Smartbooks Advance Guide To Database Setup
No ratings yet
Smartbooks Advance Guide To Database Setup
7 pages
Hadoop
No ratings yet
Hadoop
34 pages
Good Document Practice Procedure
No ratings yet
Good Document Practice Procedure
5 pages
Grade 4 Singing Syllabus
No ratings yet
Grade 4 Singing Syllabus
9 pages
Simple Past Story 1
0% (1)
Simple Past Story 1
7 pages
Drafting Handouts and Questions
No ratings yet
Drafting Handouts and Questions
6 pages
Blender Pro Studio Advanced Techniques for Real-World Projects: Blender, #3
From Everand
Blender Pro Studio Advanced Techniques for Real-World Projects: Blender, #3
Steven Mcananey
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet

Big Data Computing

Uploaded by

Big Data Computing

Uploaded by

NOTES OF

If you find errors, please report them here:

Typical cluster architecture:

Dealing with faults: Master pings workers periodically to detect failures.

Specifications of a MapReduce algorithm:

Design goals for MapReduce algorithms

Union Bound: Pr(⋃𝑖 𝐴𝑖 ) ≤ ∑𝑖 Pr(𝐴𝑖 )

1.3.1. Partitioning Technique

Eg.1: Improved Word Count (Random key)

Eg.2: Class Count (𝒊 mod √𝑵 key)

E.g.: Maximum pairwise distance

1.3.3. Exploiting samples

is the angle between the vectors 𝑋, 𝑌 ∈ ℝ𝑛

Minimum distance of a point 𝑝 ∈ 𝑃 from a set 𝑆 ⊆ 𝑃: 𝑑(𝑝, 𝑆) = min{𝑞 ∈ 𝑆: 𝑑(𝑝, 𝑞)}

Objective functions can be categorized based on whether or not:

Combinatorial optimization problem: Π(𝐼, 𝑆, 𝜙, 𝑚) NP-HARD

𝒌-clustering of a set of points 𝑃 is a tuple 𝐶 = (𝐶1 , … , 𝐶𝑘 ; 𝑐1 , … , 𝑐𝑘 ) where:

Minimize one of those Objective functions: NP-HARD

Farthest-First Traversal Algorithm

Coreset technique: (is a variant of sampling)

Coreset-based MapReduce algorithm for k-means:

Uses, as a parameter, a sequential algorithm 𝐴 for 𝑘-means: MR-kmeans(A).

The coreset computed in Round 1 is a 𝛾-coreset, with 𝛾 = 𝛼.

Partitioning Around Medoids (PAM) algorithm:

Dealing with Big Data: uses coreset like 𝑘-median

𝐻(𝑃) ≅ 1: 𝑃 is likely to have a clustering structure.

𝐻(𝑃) is suited for big data.

2.6.2. Unsupervised evaluation

• average distance between points in the same cluster.

2.6.3. Supervised evaluation

Association rule: 𝑟: 𝑋 → 𝑌 with 𝑋, 𝑌 ⊆ 𝐼, 𝑋, 𝑌 ≠ ∅ and 𝑋 ∩ 𝑌 = ∅

Problem formulation: Given

Potential output explosion

Anti-monotonicity of support: for every itemsets 𝑋, 𝑌 ⊆ 𝐼: 𝑿 ⊆ 𝒀 ⇒ Supp𝑻(𝑿) ≥ Supp𝑻 (𝒀)

Most algorithms to compute the association rules, work in two phases:

3.2.1. Mining Frequent Itemsets

Absolute support of 𝑋 ⊆ 𝐼: 𝝈(𝑿) = Supp𝑻(𝑿) ∗ 𝑵. = # of transactions that contain 𝑋

A-Priori algorithm (Breadth First)

Main performance issues:

High-level strategy for mining rules

Partition-based approach: (Exact solution but inefficient)

3.4.1. Closed and Maximal itemsets

Properties of Closed/Maximal itemsets:

Closure: Closure(𝑋) = ⋂𝑡∈𝑇𝑋 𝑡

You might also like