0% found this document useful (0 votes)
22 views42 pages

Randomized Algorithms Lecture Notes

The document consists of lecture notes on randomized algorithms, covering topics such as Monte Carlo and Las Vegas algorithms, probabilistic analysis of Quick-Sort, and various randomized algorithms including Randomized Selection and Min-Cut. It explains the differences between deterministic and randomized algorithms, their advantages, and provides detailed examples and analyses of specific algorithms. The notes also discuss the average-case runtime of Quick-Sort and techniques for generating random permutations.

Uploaded by

Abisoye Fiyin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views42 pages

Randomized Algorithms Lecture Notes

The document consists of lecture notes on randomized algorithms, covering topics such as Monte Carlo and Las Vegas algorithms, probabilistic analysis of Quick-Sort, and various randomized algorithms including Randomized Selection and Min-Cut. It explains the differences between deterministic and randomized algorithms, their advantages, and provides detailed examples and analyses of specific algorithms. The notes also discuss the average-case runtime of Quick-Sort and techniques for generating random permutations.

Uploaded by

Abisoye Fiyin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Algorithms

Lecture Notes — Randomized Algorithms


Imdad ullah Khan

Contents
1 Introduction 2
1.1 Monte Carlo and Las Vegas Algorithms . . . . . . . . . . . . . . . . . . 2

2 Probabilistic Analysis: Quick-Sort 4


2.1 Randomized Quick Sort . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Randomized Selection 9

4 Randomized Max-Cut 11

5 Minimum Cut 14
5.1 Edge Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Karger’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Karger-Stein Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Max-3-SAT 23
6.1 Derandomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Closest Pair 25

8 Hashing 28
8.1 Dictionary ADT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2 Chained Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.3 Randomized Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9 Stream Processing 32
9.1 Stream Model of Computation . . . . . . . . . . . . . . . . . . . . . . . 32
9.1.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.2 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.2.1 Weighted Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.2.2 Reservoir Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.3 Linear Sketch and Frequency Moments . . . . . . . . . . . . . . . . . . 37
9.4 Count-Min Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1
1 Introduction
We have seen various algorithm design paradigms so far: greedy, divide-and-conquer,
dynamic programming and network flows. We have also used some of these types of
algorithms to cope with np-hard problems, compromising on different aspects such as
polynomial runtime (Intelligent Exhaustive Search), exact solution (Approximation &
Heuristic Algorithms), and solving for all cases (Special Cases) and parameters (Fixed
Parameter Tractability).
One thing that all these algorithms have in common is that they are deterministic,
i.e. they always produce the same output for the same input. However, surprisingly,
some of the smartest and fastest algorithms that exist depend on odds or chance,
i.e. some specified decisions in the algorithm are based on the outcome of the toss of a
random coin. Such algorithms are called randomzied algorithms. So how is randomness
incorporated in the operation of these algorithms?

Deterministic Algorithms Randomized Algorithms

Input Output Input Output


Algorithm Algorithm

Random Numbers
Figure 1: Deterministic vs. Randomized Algorithms

As shown in Figure 2, a randomized algorithm receives, in addition to the input, a


random number stream to make decisions during its execution. As a result, randomized
algorithms may output different results on the same input across different runs. When
designing a randomized algorithm, the aim is to have good average-case (or expected)
behaviour, which means that we should get exact answers, or answers close to the
correct one, in a small runtime with high probability. Often, randomized algorithms
are simple and elegant, but there is a high probability of producing an output which
is correct to some acceptable degree. Another advantage of randomized algorithms is
that they require less execution time or space than the best deterministic algorithm.

1.1 Monte Carlo and Las Vegas Algorithms


We can broadly classify randomized algorithms into two types: Las Vegas and Monte
Carlo algorithms. Monte Carlo algorithms introduce randomness in the solution, i.e.
they are guaranteed to run in a fixed time but are expected to output a correct an-

2
swer with some, usually high, probability. On the other hand, Las Vegas algorithms
introduce randomness in the runtime, i.e. they are guaranteed to output the correct
answer but the runtime is expected to be small with high probability. An interesting
thing to note is that we can always convert a Las Vegas algorithm into a Monte Carlo
algorithm by stopping the algorithm after a certain point (fixed time). However, there
is no known method for the reverse conversion and it needs efficient verification.
Input Random Numbers

Randomized Algorithms

Monte Carlo Algorithms Las Vegas Algorithms

Random result Deterministic result


Deterministic runtime Random runtime
with high probability result is correct with high probability runtime is small

Figure 2: Monte Carlo and Las Vegas Randomized Algorithms

We illustrate the differences in Deterministic and Monte Carlo and Las Vegas Ran-
domized algorithms through a simple problem.
Input: An array A with n/4 1’s and 3n/4 0’s
Output: An index k such that A[k] = 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0

We give a deterministic algorithm and a Monte Carlo and Las Vegas randomized algo-
rithm for the problem and illustrate their qualitative and runtime analysis.

Algorithm Deterministic Algorithm Monte Carlo Algorithm Las Vegas


k←0 k ← random(1 · · · n) k←1
for i = 1 → n do return k while A[k] ̸= 1 do
if A[i] = 1 then k ← random(1 · · · n)
k←i return k
return k

3
Quality: correct Quality: correct w.p 14 Quality: correct
3n worst case runtime: 1 expected runtime: 4
worst case runtime: 4

P
1
4

r
4
r

3
r
P

P
4

4
✓ ✗

P
1

r
4

3
r
P

4
1

P

r
r

3
P

4
..

P

r
.

3
P

4

...

We will study, among others, a Monte Carlo algorithm for the Min-Cut problem, and
Las Vegas algorithms for sorting and closest-pair problems.

2 Probabilistic Analysis: Quick-Sort


Recall the recursive quick-sort algorithm outlined in Algorithm 4.

4
Algorithm 4 Sorting A using partition
1: function quicksort(A)
2: if |A| ≤ 1 then
3: return A
4: z ← A[1] ▷ pivot z
5: partition(A, z)
6: r ← rank(z, A)
7: quicksort(A[1 . . . r − 1])
8: quicksort(A[r + 1 . . . |A|])
9: function partition(A, z)
10: i←1 j ← |A|
11: r ← rank(A, z)
12: while i < j do
13: while doA[i] < z
14: i←i+1
15: while doA[j] > z
16: j ←j−1
17: if i ̸= r and j ̸= r then
18: swap(A[i], A[j])

As shown in Figure 3, the partition procedure rearranges the input array such that
all elements less than z are to its left, and all elements greater than z are to its right.
The process is recursively repeated for both sub-arrays, to the left and right of z, as
shown in Figure 4.

Array A
z

A[1···r−1] A[r+1···n] first pivot


z }| { r z }| {
≤z z ≥z

Figure 3: Partitioning A around pivot z


Figure 4: Recursion Tree of quick-sort

Let T (n) be the runtime of QuickSort on |A| = n. The worst case for the runtime
is when the pivot is always the minimum or maximum of A, leading to a completely
skewed recursion tree, as shown in Figure 5, from which we do not gain any benefit of
the divide-and-conquer approach.

5
n−1

Figure 5: Worst case of quick-sort when z = min(A)

The recurrence relation for T (n) in the above case is:


(
T (n − 1) + T (0) + O(n) if n > 1
T (n) =
1 if n ≤ 1

Therefore, in the worst case, T (n) = O(n2 ).


On the contrary, the best case of quickSort is when the pivot is always the medium
of the array so the left and right sub-arrays after partition are balanced. In this case,
the recurrence relation for T (n) is:
(
2T (n/2) + O(n) if n > 1
T (n) =
1 if n ≤ 1

log n

Figure 6: Best case of quick-sort when z = median(A)

6
Therefore, in the best case, T (n) = O(n log n).
We know the extremes, best and worst case, of runtime performance of quickSort,
but for practical purposes, the more feasible question to ask is that what is the average
case running time of quickSort? But what do we average over? We have or assume
knowledge about the distribution of the input, and so we average over the distribution
considering how probable each element in the input is to appear. When probability is
used in the analysis of a deterministic algorithm, it is called probabilistic analysis.
For the probabilistic analysis of quickSort, we assume all permutations of n numbers
in A are equally likely, i.e. ranks of numbers in A is a uniform random permutation of
[1 · · · n]. We make a few observations about the quickSort algorithm first.
Note that an element of A can be chosen as pivot at most once, since all subsequent
processing is done on the two sub-arrays. Furthermore, elements of A are compared to
pivots only as elements are only compared with z in the partition function, and no
comparison is made in the outer recursive function. From the previous two statements,
we can conclude that a pair of elements of A are compared only when one of them is
a pivot, and consequently, a pair is compared at most once. Since comparisons always
involve the pivot, after a comparison is made, the two elements always go to different
parts. We use these insights to obtain a bound on the expected number of comparisons
made.
Let the sorted order of elements of A be z1 , z2 , . . . , zn . We define an indicator random
variable Xij as:
(
1 if zi is compared with zj
Xij =
0 else

Note that the comparison between zi and zj can be at any time of the execution and not
in a specific call. Thus, the total number of comparisons X made during the execution
of the algorithm can be obtained by summing over all pairs.

n X
X n
X = Xij
i=1 j=1

Thus, by linearity of expectation:


" n X
n
# n X
n
X X
E(X) = E Xij = E [Xij ]
i=1 j=i+1 i=1 j=i+1

7
To find the expected value of Xij , we define Zij to be the set of elements between
zi and zj (inclusive). Thus, |Zij | = j − i + 1. Now consider the sequence Zij :
zi , zi+1 , . . . , . . . , zj . Initially, all the elements in the sequence Zij are in the same
array A. They split only when some zk for i ≤ k ≤ j is selected as the pivot.
Recall that zi and zj are compared only if they are in the same (sub)array and ei-
ther zi and zj is selected as the pivot. If the first pivot in Zij is other than zi
and zj , then Zij is split, zi and zj never get compared, and Xij = 0. Therefore,
E [Xij ] = P r[zi or zj is the first among Zij chosen as pivot]. What is the probability
of zi or zj being chosen as a pivot first among Zij before any zk ?
The probability that zi is chosen as the pivot first among Zij is 1/j−i+1. Thus, the
probability of comparison is 2/j−i+1 i.e. sum the probability of zi being chosen, and of
zj being chosen as the pivot first. Therefore,

2
E(Xij ) =
j−i+1

Substituting k = j − i in the equation for E(X), we get that:

n X
n−i n n
X 2 XX 2
E(X) = = < ≤ 2n log n.
i=1 k=1
k+1 i=1 k=1
k

Therefore, the average-case runtime of quickSort is O(n log(n)).


Since the input array can not be guaranteed to be randomly ordered, we permute the
array A to make it a random permutation. In fact, generating a random permutation
is an interesting exercise. The simple way using an auxiliary array to randomly select
an element from A and copy it to the next empty position in an auxiliary array and
remove the element from A. However, the time complexity of this algorithm is O(n2 ).
A smarter approach, which avoids the use of extra space, and can permute the array
in O(n) is the Fisher-Yates shuffle algorithm, which assumes that we can generate a
random number between 0 and n in O(1). The key idea is to begin at the last element
in A and swap it with the element at a randomly selected index in A, including the
last element. The process is repeated for the array from the first to the second-last
element, and so on, decrementing the array size by 1, until the first element is reached.
We can also reduce the likelihood of the worst case occurring by taking the median of
3 or 4 elements as the pivot. If the pivot is the median of the whole array, then the
best, worst and average case time complexity is O(n log(n)).

8
2.1 Randomized Quick Sort
Instead of randomly permuting the array A for QuickSort as above, we can also
choose the pivot randomly, instead of always taking the first element as the pivot as
shown in Algorithm 5.

Algorithm 5 Randomized QuickSort


function rand-quicksort(A)
if |A| ≤ 1 then
return A
randIndex ← random(1, |A|)
z ← A[randIndex]
partition(A, z)
r ← rank(z, A)
rand-quicksort(A[1 . . . r − 1])
rand-quicksort(A[r + 1 . . . |A|])

By partitioning around a random element, the runtime is independent of the input


order. Therefore, no assumptions need to be made about the distribution of the in-
put and no specific input results in the worst-case behavior. Rather, the worst case
(runtime) depends on the sequence of random numbers.
The probabilistic analysis we carried out above was to find the average runtime over all
inputs (of length n) of the deterministic quicksort algorithm. For the randomized-
quicksort algorithm, we find the expected runtime, i.e. the expected value of the
runtime random variable of a randomized algorithm. Expected runtime can be under-
stood to effectively be an ‘average’ of run-time over all sequences of random numbers.
The analysis of Randomized-QuickSort is exactly the same as the probabilistic
analysis we discussed above, using indicator random variables.

3 Randomized Selection
For the quicksort algorithm, we saw that we can achieve the best case runtime if we
repeatedly select the median of each (sub)problem as the pivot to partition around.
This is only one case where efficiently finding the median of an array is important.
Likewise, for numerous applications, we may need to select the minimum, maximum
or the k th smallest element (called the k th order statistic) in an array.
PROBLEM 1 (Select(S, k) problem). : Find the k th smallest element in S.

9
In the case of a sorted array with distinct elements, Select can be done in O(1).
This also gives us a simple algorithm for any array: first sort the array of distinct
elements and then select the element at the k th index. However, such deterministic
sorting algorithms take O(n log n) time.
We discuss a randomized algorithm for the Select problem, i.e. select the k th smallest
element in an array S. Assume S has only distinct elements. For a randomly chosen
z ∈ S, partition S into SL (< z), Sz (= z) and SG (> z) as shown in Figure 7.

z
S
S[1···r−1] S[r+1···n]
z }| { r z }| {
<z z >z
SL Sz SG
Figure 7: In each sub-problem, choose a random z to partition around until z is the
k th order statistic

Subject to the choosing z, the randomized algorithm is given by the following recur-
rence:

select(SL , k)
 if k ≤ |SL |
select(S, k) = z if k = r = |SL | + 1

select(SG , k − |SL | − 1) if k > |SL | + 1

We analyze the runtime T (n) of the above randomized algorithm for select given an
input S. Let |S| = n. Then,

T (n) = T (max{r, n − r − 1}) + Θ(n)

We want z to be the median, or close to the median, of S to find the required value
efficiently. However, since we choose z randomly, it may be far from the median,
resulting in an imbalanced partition. In the wort case, T (n) = Θ(n). However, the
expected runtime E[T (n)] ≤ cn. This can be easily proven by induction.
Recall that the rank of an element x in an array is the number of elements smaller
than than x. Since S contains distinct elements, P r[rank(z) = r] = 1/n. Then,

10
k
X n
X
E[T (n)] = n + T (r)P r[rank(z) = r] + T (n − r)P r[rank(z) = r]
r=1 r=k+1
" #  
k n n
1 X X 2 X
=n+ T (r) + T (n − r) ≤ n +  T (r)
n r=1 r=k+1
n n r=⌊ /2⌋

Therefore, E[T (n)] = O(n).

4 Randomized Max-Cut
Recall that a cut in a graph G = (V, E) is defined as a subset S ⊂ V . Cuts in graphs are
useful structures with applications in network flows, statistical physics, circuit design,
complexity and approximation theory.
We denote a cut as [S, S] and assume that S is not a trivial cut, i.e. S ̸= ∅ and S ̸= V .
As shown in Figure 8, an edge (u, v) is said to be crossing the cut [S, S], if u ∈ S and
v ∈ S.

4
A C A cut edges C
5 3
8
6 F 1 B F B
G
13 2
G E D E D
6

Figure 8: A cut [{A, F, E, G}, {C, B, D}] in G

The size or cost of a cut is defined as the number of crossing edges. For weighted
graphs, the size of cut is the sum of weights of crossing edges.
PROBLEM 2 (Max-Cut(G) problem). : Find a cut in G of maximum size.
A cut of size 4 A max cut of size 6

Figure 9: Examples of a cut and a max-cut in a graph

11
The decision version of the max-cut(G) problem is NP-Complete. A simple ran-
domized approximation algorithm for max-cut(G) is to place each vertex in S or S
randomly, as shown in Algorithm 6.

Algorithm 6 rand-max-cut(G)
S←∅
for each vertex v ∈ V do
S ← S ∪ {v} with probability 1/2

The runtime of the above algorithm is clearly O(n) where n = |V |, since it requires only
one iteration over the vertex set V . However, how ‘good’ is the estimated max-cut?
We analyze it as follows.
For edge e, let Ce be an indicator random variable, where Ce = 1 if e crosses
P the cut
[S, S] and Ce = 0 otherwise.
 Then, the for a given
P cut, size([S, S]) = e∈E e . By
C
P
linearity of expectation, E size([S, S]) = E e∈E Ce = e∈E E[Ce ].

E[Ce ] is the probability that e is a crossing edge in the cut. As shown in Figure 10, note

that the P r[e = (u, v) crosses the cut [S, S]] = P r (u ∈ S ∧ v ∈ S) ∨ (u ∈ S ∧ v ∈ S) =
1/4 + 1/4 = 1/2.

Four possibilities for edge e

:∈S :∈S
Figure 10: Given a cut [S, S] in G = (V, E), each edge e ∈ E must be one of these four
types.

  X X
E size([S, S]) = E[Ce ] = 1/2 = |E|/2 = m/2

e∈E e∈E

Therefore, since the size of all cuts, including max-cut, is ≤ m, where m = |E|, the
cut constructed by this randomized algorithm has size at least 1/2 of the max cut.

12
Although the algorithm is good on average, their is no (probabilistic) guarantee on
the quality of its output. We can improve over rand-max-cut(G) using the magic
of repeated trial to amplify the probability, i.e. run rand-max-cut(G) k times on G
and return the largest cut found. We refer to this as a meta-algorithm of k repeated
trails of Rand-Max-Cut. The runtime for the meta-algorithm is O(k(m + n)) where
building the cut takes O(n) and determining its size takes O(m).
For large k, the meta-algorithm returns a large cut with a high probability, which is
the kind of guarantee we desire. In fact, we will show that we get a cut of size m/4 with
very high probability.
Let X1 , X2 , · · · Xk be sizes of the cuts found by each run of rand-max-cut(G), and
let E be the event when meta-algorithm of k repeated trials produces a cut of size less
Tk  m 
than m/4, i.e. E = i=1 Xi ≤
4
Since Xi ’s are independent, by linearity of expectation,
"k # k
  \ m Y h mi
Pr E = Pr Xi ≤ = P r Xi ≤
i=1
4 i=1
4

m
 
To find P r Xi ≤ 4
, we will use the Markov Inequality.
Theorem 1. Markov
  Inequality: If Z is a non-negative
 random variable and
a > 0, then P r Z ≥ a ≤ /a ⇐⇒ P r Z ≥ aE[Z] ≤ /a
E[Z] 1

Note that we can write (Xi ≤ m4 ) as (m − Xi ≥ 3m 4


) to match the format of the
Markov Inequality. For simplicity, let Y1 , Y2 , · · · Yk be random variables, defined as
Yi = m − Xi . Since m/2 is a lower bound on the size of cut found by a single run Xi ,
E[Xi ] = m/2. Thus, E[Yi ] = m − E[Xi ] = m − m/2 = m/2. Then,

k   k k k  k
  Y 3m Y E[Yi ] Y m/2 Y 2 2
Pr E = P Yi ≥ ≤ = = =
i=1
4 i=1
3m/4
i=1
3m/4
i=1
3 3

If we output the largest cut out of k runs of rand-max-cut(G) k, the probability


that we don’t get ≥ m/4 edges is at most (2/3)k , i.e. the probability we do get ≥ m/4
edges is at least 1 − (2/3)k . If we set k = log2/3 m, then the probability we get at least
m/4 edges is 1 − 1/m.

Therefore, the meta algorithm is a O((m + n) log m)-time randomized algorithm that
finds a 0.25-approximation to textscmax − cut with probability 1 − 1/m. This is called
a (ϵ, δ)-randomized approximation algorithm, but more on that later.

13
5 Minimum Cut
Similar to the max-cut problem discussed above, we may also need to find a cut with
minimum cost in a graph.
PROBLEM 3 (Min-Cut(G) problem). : Find a cut in G of minimum size.

A cut of size 3 A min cut of size 2

Figure 11: Examples of a cut and a min-cut in a graph

Note that the size of the minimum cut is at most the minimum degree of any vertex
in G and as shown in Figure 12, a minimum cut does not have to be unique.

Figure 12: 3 minimum cuts, each of size 2

The problem is also called Global Min-Cut and has applications in network reliability
and robustness analysis, as shown in Figure 13.

The network on the left is easier to disconnect


Figure 13: Knowing a min-cut allows an adversary to disrupt a network with minimal
effort

14
Normalized minimum cut is used when spectral clustering is applied to image segmen-
tation, in which a digital image is broken down into various subgroups. Suppose we
want to separate the background of the image from the foreground to, for example,
distinguish an aircraft or missile from the horizon. Figure 14 shows this separation
for an example image and the application of Min-Cut to segmentation of this image is
briefly described in Fig 15.

Figure 14: The original image (left) and its segmentation by a boundary separating
the foreground from the background of the image (right)

9 9 9 9 8 8 8 9
9 9 9 9 8 8 8 9
9 8 1 1 8 8 9 9
9 1 9 8 1 1 8 8 9 9
9 1
9 2 8 9 9 8
9 8 9 2 9 2 8 9 9 8
1 9 8 9 2
9 2 8 8 8
2 8 9 2 9 1 2 8 8 8
2 8 9 2
8 9 8 9 2 8 9
8 8 8 8 9 8 9 2 8 9
1 8 8 8
8 8 8 1 1
1 8 8 8 9 8 8 8 1 1
1 8 8 8 9
8 9 1 2 2 8 1 1
1 1 1 8 9 1 2 2 8 1 1
1 1 1
8 8 8 8 9 9
8 8 8 8 9 9
9 9 9 9 9 1 1 9
9 9 9 9 9 1 1 9

Figure 15: Image Segmentation using Min-Cut: Let each square denote a pixel. Red
pixels belong to the foreground whereas white pixels denote the background. Observe
that if pixel (x, y) is background or foreground, then so are nearby pixels. We make
a graph (left) with nodes for each pixel adjacent to neighboring pixels and the weight
of edge (i, j) is defined as a penalty pij of classifying i and j differently, i.e. pij is a
‘similarity measure’ determined by image processing. A min-cut in this weighted graph
draws the boundary between the foreground and the background as it (overall) cuts
across edges between nodes with least similarity.

15
Recall from network flows that a maximum s − t flow in G is equal to a minimum s − t
cut. Since the value of the min-cut of G is minimum over all possible s − t cuts in G,
the brute force solution is to compute a min s − t cut for all s − t pairs of V . This
requires O(n2 ) calls to a min s − t cut ([Link] s − t flow) solver such as
ˆ ford-fulkerson algorithm: O(n2 · m · |fmax |)
ˆ edmond-karp algorithm: O(n2 · n · m2 )
ˆ dinic’s or push-relabel algorithm: O(n2 · n2 · m)
A smarter approach is to fix a node s which must appear in one of S or S. and find
the min s − t cut for all t ∈ V . This requires only O(n) calls to min s − t cut (max
s − t flow) solver.
Many deterministic algorithms have been proposed for the min-cut problem such as
the Stoer-Wagner algorithm which takes O(nm+n2 log m) time. We will study a simple
randomized algorithm by Karger and an elegant extension of it by to Karger and Stein.
These algorithms are based on the edge contraction operation. Before we discuss what
the edge contraction operation is, recall the concepts of pseudographs and multigraphs,
as shown in Figure 16.

a a
b c d b c d

Figure 16: A graph G = (V, E), where V is the vertex set, is a pseudograph if E is
a set of edges where G is a multigraph if E is a multi-set of edges i.e. there may be
multiple edges between a pair of nodes. Both types of graphs may have self loops on
nodes.

5.1 Edge Contraction


Contraction of an edge (u, v) in G constructs a graph G\{u, v} in which u and v become
one vertex uv, edge (u, v) becomes a self-loop (which we remove in this case) and all
edges incident on u or v become incident on uv. As shown in Figure 17, the resulting
graph may become multigraphs, and we keep all edges introduced in the contraction
process.

16
g g
a f a f
e contract
(e, d) de

c d c
b b

Figure 17: Example of edge contraction: the function contract(G, e) takes a graph
and edge as input and return the resulting graph after contracting the given edge. A self
loop introduced by the contraction is only shown in the first graph after contraction.

Edge contraction can be performed in O(n) time bu merging the adjacency lists of u and
v and updating the adjacency lists of neighbouring vertices can be updated in O(n)
time if we keep corresponding pointers at entries of adjacency lists (see Figure 18).
We remove any self-loops in the resulting graphs. Multigraphs can be saved with
multiplicity as edge weight.

1 2 3 1 2 3

3 1 u 3 1 u

.. ..
. .

u 3 u 3
.. ..
. .
v x v x

x v x v
.. ..
. .

Figure 18: In each entry of the adjacency list, we maintain a pointer to the correspond-
ing entry in the adjacency list of the other end-point. These pointers can be initially
populated in one scan over the graph.

Now, how does edge contraction affect the minimum cut in the graph? The min cut
in G \ {u, v} is at least as large as the min cut in G because any cut in G \ {u, v} is
‘actually’ a cut in G too. However, the converse is not necessarily true, as shown in
Figure 19.

17
a e a e

b c f g b cf g

Figure 19: The min cut in G is not necessarily at least as large as the min cut in
G \ {u, v}. The min-cut in the graph on the left is [{a, b, c}, {e, f, g}] is of size one,
whereas there is no cut of size one in the graph after contracting edge (c, f ) on the
right.

5.2 Karger’s Algorithm


Karger’s algorithm for the estimating the minimum cut simply contracts a random
edge in a graph until there are only two vertices or ‘super nodes’ left in the graph. The
estimated min-cut is then the cut induced by the two super nodes. For example, if the
last two vertices remaining are (uv) and (xyz), the cut is [{u, , v}, {x, y, z}].

Algorithm 7 : Karger’s algorithm for mincut (G)


while there are more than two vertices left in G do
Pick a random edge e = (u, v)
G ← G \ uv
return G ▷ the cut induced by the remaining two (super)nodes

The estimated cuts from two runs of Algorithm 7 are shown in Figure 20.
Since each contraction reduces the number of vertices by 1, the number of contraction
done by Karger’s algorithm is n − 2. With the right data structure, a contraction can
be done in O(n) so the total runtime is O(n2 ).
Let us now analyze Karger’s algorithm for the ‘goodness’ of the estimated min-cut.
Let C = [S, S] be a cut. The intuition is that, if during the execution some edge in C
is contracted, the algorithm will not output the cut C because if (u, v) ∈ C ↔ u ∈
S ∧ v ∈ S is contracted, then u and v will belong to the same supernode and (u, v)
cannot be crossing edge. Since edges are selected at random for contraction, a cut with
more number of crossing edges is more likely to not be output by the algorithm. The
algorithm will output C only if it never contracts any edge in C. Among all cuts, min-
cuts have the least probability of having an edge contracted. However, this intuition
is alone not enough and we must obtain a probabilistic guarantee on the likelihood of
a min-cut being output by the algorithm.

18
Figure 20: Two runs of the Karger’s algorithm (the arrow marks the contracted edge
in each step) producing a sub-optimal cut of size 3 (top) and an optimal min-cut of
size 2 (bottom).

LetGi = (Vi , Ei ) be the graph after the ith contraction for 0 ≤ i ≤ n − 2, which means
that |Vi | = ni = n − i. Note that initially, G0 = (V0 , E0 ) = G = (V, E) with |V0 | = n
and |E0 | = m.
Let C = [S, S] be a specific min-cut of size k. Since C is a min-cut of size k, every
vertex in V has degree ≥ k. Combining this information with the hand-shaking lemma
(i.e. the sum of degrees of vertices equals twice the number of edges in the graph) we
get that m0 ≥ kn0/2 and more generally, mi ≥ kni/2 = k(n−1)/2.
We say that an edge is ‘killed’ if it is contracted in a given ‘round’ or iteration of the
algorithm. We compute the probability that the min-cut C of size k is the output of
the algorithm as follows.
P r[C is “killed” in 1st round] = P r[an edge in C is contracted] = k/m0 ≥ 2/n
P r[C survives in 1st round] = P r[no edge in C is contracted] ≥ 1 − 2/n0
P r[C sruvives in (i + 1)th round | C survived so far] = 1 − k/mi ≥ 1 − 2/n−i
P r[C sruvives all rounds] = n−3
Q
i=0 P r[C sruvives round i + 1 | C survived so far]
Qn−3 n−i−2
P r[C sruvives all rounds] = P r[C is the output] = i=0 n−i
n−2 n−3 n−4
P r[C is the output] = n
× n−1
× n−2
× . . . × 24 × 1
3
= 2
n(n−1)
= 1/(n2 )
Therefore, the probability that a min-cut is output by the Karger’s algorithm is
approximately1/n2 , which seems to be very small. However, there are 2m cuts, many of
which are min-cuts, and the algorithm finds one of the min-cuts with probability 1/n2 .

19
Nevertheless, with repeated trials, we can amplify the probability to any desired value,
i.e. run the Karger’s algorithm multiple times, say M times, and return the smallest
of the M cuts obtained. Suppose we call this the Good-Min-Cut(G, M ) algorithm.
Then,
M
P r[all M runs fail to output C] = M
Q
i=1 P r[Run i fails] ≤ 1 − 1/n2 .
A very useful inequality that we will use now is: ∀ x ∈ R (1 + x) < ex . Applying this
inequality to the probability that all Good-Min-Cut(G, M ) fails to output C, we get
that:
2
P r[good-min-cut(G, M ) fails to output C] ≤ eM/n .
If we set M = cn2 log n, P r[good-min-cut(G, M ) outputs C] ≥ 1 − 1/nc .
In this case when M = cn2 log n, the runtime of good-min-cut is O(n4 log n).

5.3 Karger-Stein Algorithm


In the good-min-cut algorithm, we repeat all n−2 rounds of Karger’s algorithm each
time. The Karger-Stein extension of the algorithm, however, tries to do this smartly.
Let’s analyze the probability of a min-cut C being ‘killed’ (i.e. an edge in C being
contracted) in each round.
P r[C is “killed” in round 1] = P r[an edge in C is contracted] = k/m0 ≥ 2/n
P r[C is “killed” in round 2 | C survived round 1] = k/m1 ≥ 2/n−1
P r[C is “killed” in rond (i + 1)| C survived so far] = k/mi ≤ 2/n−i
P r[C is “killed” in rond (n − 3)| C survived so far] ≤ 2/4
P r[C is “killed” in rond (n − 2)| C survived so far] ≤ 2/3
As you can see, the probability of wrong contraction increases in each round. The
intuition is that we can repeat only the rounds where the probability of C being
‘killed’ is higher to make it more likely that we can come across a round where it is
not ‘killed’ and not waste time repeating the first ‘few’ iterations. In other words, as
G gets smaller, repeat increasingly many times to reduce the error probability. This
is, in-fact, what the Karger-Stein algorithm does.
As outlined in Algorithm 8, the Karger-Stein algroithm obtains two independently and
randomly contracted graphs H1 and H2 from G. When H1 and H2 are small, it make 4
random contractions and so on. When the graph has less than 6 vertices, the minimum
of all ∼ 25 cuts is found. Note that Now we cannot chase a fixed minimum cut C, as
both X1 and X2 could be min cuts, if successful and we may choose either.

20
Algorithm 8 : Karger-Stein algorithm for min-cut
1: function contract(G, t)
2: while more than t vertices left in G do
3: Pick a random edge e = (u, v)
4: G ← G \ uv
5: return G
6: function Fast-Cut(G)
7: if n ≤ 6 then
8: return exact min-cut via brute force

9: t ← ⌈1 + n/ 2⌉
10: H1 ← contract(G, t)
11: H2 ← contract(G, t)
12: C1 ← fast-cut(H1 )
13: C2 ← fast-cut(H2 )
14: return smaller of C1 and C2

Let T (n) be runtime of fast-cut(G) with |V (G)| = n.


( √
2T (n/ 2) + O(n2 ) if n > 6
T (n) =
O(1) else

Then, by the master theorem, T (n) = O(n2 log n), which is not much worse than the
O(n2 ) of the initial Karger’s algorithm.
Finally, we analyze the quality of the Karger-Stein algorithm.
fast-cut(G) succeeds (i.e. finds an optimal min-cut) if and only if (a) a min-cut
survives the contract(G, t) step and (b) at least one of the fast-cut(H1 ) and
fast-cut(H2 ) finds a min-cut.
Probability a min cut survive contract(G, t) step (lines 10 and 11 in Algorithm 8)
is:

21
n−t−1
Y n−i−2
P r[a cut survives n − t contractions] =
i=0
n−i
n−2 t t−1 t(t − 1)
= × ... × × =
n t+2 t+1 n(n − 1)
t(t − 1) 1 n
= ≃ as t = √
n(n − 1) 2 2

Let P (j) be the probability that fast-cut(H) finds min-cut if |V (H)| = j. fast-cut(G)
finds a min-cut when either of the two following branches are successful:
ˆ A min-cut survives in H1 AND C1 is a min-cut in H1
ˆ A min-cut survives in H2 AND C1 is a min-cut in H2

H1 H2

The probability that branch i succeeds (i.e. min-cut survives in Hi AND Ci is a


min-cut in Hi ) is 1/2P (t). Then, the probability P (n) that fast-cut(G) succeeds is
the probability that both branches do not fail. Therefore, it can be easily proved by
induction that


P (n) ≥ 1 − (1 − 1/2P (t))2 = 1 − (1 − 1/2P (n/ 2))2 = Ω(1/log n)

Therefore, the extended algorithm has a success probability Ω(1/log n) which is much
better than Ω(1/n2 ) of initial version. Furthermore, to achieve success probability Ω(1−
1/nc ), the initial version had to amplified by cn2 log n independent trial with total

22
O(n4 log n) runtime whereas the fast-cut(G) algorithm needs to be amplified by
c log2 n independent trials with total runtime O(n2 log3 n). Further improvements in
these algorithms have been made, but are out of the scope of this course.

6 Max-3-SAT
Recall the 3-SAT(f ) problem: Given n Boolean variables x1 , . . . , xn , where xi can take
a value of 0 or 1, a literal is a variable appearing in some formula as xi or x̄i and a
clause of size 3 is an or of three literals. A 3-cnf formula, which is and of one or more
clauses of size ≤ 3, is satisfiable if there is an assignment of 0/1 values to the variables
such that the formula evaluates to 1 (or True). The 3-SAT(f ) search problem is to
find a satisfying assignment for the 3-cnf formula f . This problem is np-hard. An
np-hard optimization problem in the same setting is Max-3-SAT.
PROBLEM 4 (Max-3-SAT(f ) problem). : Find an assignment for 3-cnf formula
f that satisfies the maximum number of clauses.
The brute force algorithm for Max-3-SAT(f ) is to try all 2n possible assignments in
O(m2n ) time, which is infeasible. We study a more efficient randomized algorithm.
The key idea is very simple: toss a coin and set a variable to True if Heads and to False
if Tails, i.e. independently set each variable true with probability 1/2. The important
question now is, what is the expected number of clauses satisfied by such a random
assignment?
Theorem 2. A random assignment to variables satisfies in expectation 7m/8 clauses of
a 3-cnf formula f with m clauses

Proof. Let random variable Zj = 1 if clause cj is satisfied and Zj = 0 otherwise.


Then, E[Zj ] = P r[Cj is satisfied] = 1 − P r[Cj is not satisfied]. Evidently, Cj
is not satisfied when all literals in Cj are set to false. Since literals values are set
independently, P r[Cj is not satisfied] = (1/2)3 = 1/8. Thus, E[Zj ] = 7/8.
Let Z be the number of clauses satisfied by random assignment. By linearity of expec-
m m
7
= 7m
P P
tation, E[Z] = E[Zj ] = 8 8
.
j=1 j=1

Theorem 3. For any instance of max-3-sat with m clauses, there exists a truth
assignment which satisfies at least 7m/8 clauses.

Proof. There is a non-zero probability that a random variable takes the value of its
expectation, i.e. P r[Z ≥ E[Z] = 7k/8] > 0. This non-constructive way by Paul

23
Erdös of proving a claim is called Probabilistic Method which refers to proving the
existence of a non-obvious property by showing that a random construction produces
it with positive probability.

Although there exists a truth assignment which satisfies at least 7m/8 clauses, we have
no probabilistic guarantees for the above randomized algorithm. Again, by utilizing
the standard trick of repeated trials, we can repeatedly generate a random assignment
A to variables until A satisfies at least 7m/8 clauses. But how long will that take? Can
we guarantee that this can be achieved in polynomial time? Turns out, yes we can, in
expectation at least. In fact, this gives us a 7/8 Las Vegas approximation algorithm for
max-3-sat, i.e. an algorithm guaranteed to find an assignment satisfying at least 7m/8
clauses whose expected runtime is polynomial.
In what case would the runtime be polynomial? Suppose P r [A satisfies ≥ 7m/8 clauses] ≥
p. Then, by expectation of a geometric random variable, expected number of trials to
find this assignment is 1/p. If p is polynomial, then expected running time is polynomial.
Therefore, we show that p is polynomial.
Theorem 4. The probability p that a random assignment satisfies ≥ 7m/8 clauses is
≥ 1/8m

Proof. A lower bound on p can be obtained by using E[Z] = 7m/8 where Z is the
number of clauses satisfied by a random assignment. Let pj be the probability that the
random assignment satisfies exactly j clauses, where j = 1, 2, · · · , m. Then,

m
X X X X X
E[Z] = j pj = j pj + j pj ≤ 7m−1/8 pj + m pj
j=0 j<7m/8 j≥7m/8 j<7m/8 j≥7m/8

This implies that E[Z] ≤ 7m−1/8 ·1+m·p which in turn implies that 7m/8 ≤ 7m−1/8 +mp.

Solving this inequality for p shows that p ≥ 1/8m.

In fact, it was proven by Hástad in 1997 that this lower bound is tight, i.e. max-3-sat
cannot be approximated in polynomial time to within a ratio greater than 7/8, unless
P = NP.

6.1 Derandomization
So far, in all the randomized algorithms that we have seen, random choices made by
an algorithm sometimes happen to be ‘good’ when the algorithm’s output is close to
the optimal. Is it possible that we can always make these ‘good’ choices, i.e. can

24
these be made deterministically? For some problems and algorithm, yes, it is possible.
This process of transforming a randomized algorithm into a deterministic algorithm is
called derandomization. We show how the 7/8 Las Vegas algorithm for max-3-sat can
be derandomized.
In order to derandomize the 7/8 Las Vegas algorithm for max-3-sat, we need to know
which set of choices for variable assignments are ‘good’, i.e. satisfies a greater number
of clauses. The idea is to consider the choice for each variable, either True or False,
one at a time. Given assignments for the “first i” variables x1 = a1 · · · , xi = ai ,
the expected value of the number of satisfied clauses with random assignment of the
unassigned variables xi+1 , · · · , xn can be computed in polynomial time. For each clause
Cj when a variable is given an assignment, if the corresponding literal in the clause
evaluates to False, then remove the literal from Cj . Otherwise, if the corresponding
literal in the clause evaluates to True, then the clause can be ignored for remaining
variable assignments as it is already satisfied. This yields the following polynomial
time deterministic algorithm for max-3-sat.
First, fix an order of variables x1 , x2 , · · · , xn . Traverse the variables in the fixed order,
setting the value of the next variable in each iteration. How is the value of xi set?
xi should be set to the value given which, the conditional expectation of the num-
ber of satisfied clauses is greater, i.e. if E [Z|x1 = a1 , · · · , xi−1 = ai−1 , xi = true] >
E [Z|x1 = a1 , · · · , xi−1 = ai−1 , xi = false], then xi = True and otherwise xi = False.
How is the conditional expectation computed? Let Z be the number of clauses sat-
isfied. The conditional expectation of Z is the unconditional expectation of Z in the
reduced set of clauses plus the number of already satisfied clauses.
Since E [Z|x1 = a1 , · · · , xi = ai ] ≥ E [Z] for 1 ≤ i ≤ n by construction and E[Z] =
7m/8, it follows that E [Z|x = a , · · · , x = a ] ≥ 7m/8. Therefore, the derandomized
1 1 i i
algorithm for max-3-sat deterministically satisfies at least 7m/8 clauses.

7 Closest Pair
For two points
p pi = (xi , yi ) and pj = (xj , yj ), the Euclidean distance defined as
d(pi , pj ) = (xi − xj )2 + (yi − yj )2 which can be computer in O(1).
PROBLEM 5 (Closest-Pair(P ) problem: ). Given a set P = {p1 , p2 , . . . , pn } of n
distinct points in R2 , find a pair of distinct points (pi , pj ) in P that minimizes d(pi , pj )
The closest pair problem finds various applications in computer graphics, computer
vision, geographic information systems, molecular modeling and air traffic control.
The naive brute force algorithm is to find the minimum distance among all n2 pairwise


25
distances in O(n2 ) time. In case of 1-dimensional space, a straight-forward algorithm
is to sort the points in O(n log n) time and find the closest adjacent points in O(n). For
2-dimensional space, we have already seen a Divide and Conquer algorithm that takes
O(n log n) time. Now we will study a randomized algorithm for the Closest-Pair(P )
problem whose expected runtime in O(n).
We assume that the distance between each pair of points is distinct and that all points
are in the unit square 0 ≤ xi , yi ≤ 1 without loss of generality. =
Let P = {p1 , p2 , · · · , pn } be a fixed random order of points and let Si = {p1 , p2 , · · · , pi }
be the set of the first i points in P . We denote the distance of the closest pair in Si δi
by δi , i.e. the distance of the closest pair in P is δn . The idea is to begin with S2 laid
out a grid G with cell size δ2 × δ2 . In each following iteration, we add the next random
point pi to the grid and update the cell size of the grid to δi if δi < δi − 1, as shown in
Figure 21.

δi−1 δi
δi
δi−1

pi δi−1 pi
δi δi

Figure 21: Point pi added to the grid (left). The highlighted cells constitute the
neighbourhood N (pi ) of the cell containing pi which is used to compute δi efficiently.
Grid cell size is updated as δi < δi − 1(right).

A naive method to compute δi is to compute the distance of pi with all points in Si−1 ,
which will result in overall quadratic time as we would have eventually have compared
all pairwise distances. An efficient way to compute δi in constant time is to utilize
the fact that the distance between pi and points outside the adjacent cells of the cell
containing pi , i.e. outside the neighbourhood N (pi ) as shown in Figure 21, is already
at least δi−1 by construction since the cell size of the grid is the minimum pairwise
distance computer so far. Therefore, it is sufficient to compare the minimum among
the distances of pi with the eight adjacent cells, where each cell contains at most one
point by construction, with δi−1 . The procedure is outline in Algorithm 9.

26
To implement the grid structure, we identify the following required operations:
1. build-grid(S, δ): build grid G with cell size δ & insert all points in S
2. insert-point(pi ): insert pi
3. locate-cell(pi ): return cell containing pi
4. get-points(c): return points in cell c
The grid can be implemented using hashing such that all operations take O(1) time.
The key universe is IDs of all cells in the grid, whereas the actual key space is the IDs
of cells containing points. The point co-ordinates are the data for each key. The cell
containing pi is located at the cell (⌊xi/δi−1 ⌋, ⌊yi/δi−1 ⌋) in the grid.

Algorithm 9 Randomized Closest Pair: returns distance


function closest-pair(P )
{p1 , p2 , · · · , pn } ← random-permutation(P )
S2 ← {p1 , p2 }
G ← build-grid(S, δ2 )
for i = 3 → n do
Si ← Si−1 ∪ pi ▷ O(1)
Compute δi ▷ O(1)
if δi < δi−1 then [Link]-grid(S, δi ) ▷ O(i)
else
[Link]-point(pi ) ▷ O(1)
return δn

Note that the runtime of the above algorithm depends on whether the grid is rebuilt
with the updated cell size in an iteration Thus, to analyze the expected runtime, we
first need to find the probability of the event δi < δi−1 occurring for a given permutation
of Si .
Let Ci be the closest pair in Si . Given Si , δi < δi−1 when pi ∈ Ci for any permutation
of Si . Thus P r[δi < δi−1 |Si ] = 2(i−1)!
i!
= 2i . Then, the unconditional probability of
δi < δi−1 is:

X
P r[δi < δi−1 ] = P r[δi < δi−1 |Sij ] · P r[Sij ]
n
j∈( )
i

27
Since all ni choices of SP

i are equally likely, the probabilities of all permutations oc-
curring add up to 1 i.e. j∈(ni) P r[Sij ] = 1. Using this fact and applying linearity of
expectation,

2 X 2
P r[δi < δi−1 ] = P r[Sij ] =
i i
j∈(ni)

Let the runtime of an iteration i be Xi and the overall runtime of the algorithm be X.
E[Xi ] = O(1) + O(i) · P r[δi < δi−1 ] = O(1) + O(i) · 2/i = O(1).
By linearity of expectation, E[X] = ni=1 E[Xi ] = O(n).
P

Therefore, the expected runtime of the randomized algorithm for closest-pair(P ) is


linear.

8 Hashing
8.1 Dictionary ADT
From your Data Structures course, recall that the dictionary ADT (Abstract Data
Type) maintains a set of n elements from a universe U . The elements are unique and
identified by their ‘key’ k and could have a particular value or satellite data associated
with the key, i.e. e;lements could be compound (key, value) pairs. For example, a
dictionary may contain student IDs are key and their exam score as the corresponding
value. Another dictionary keyed by student IDs may store each student’s personal
information such as age, gender, address, etc. The main operations required for the
dictionary ADT are Insert, Lookup and Delete. A dictionary can be implemented
in various ways: using arrays and linked lists which may be sorted or unsorted, binary
search trees which may be balanced or unbalanced, and hash tables. Table 1 summa-
rizes the time complexity of the key operations for implementation using each of these
data structures.

28
Unsor. Sorted Unsorted Sorted Hash
Operation BST AVL
Array Array Linked list Linked list Function
Search(D,k) O(n) O(log n) O(n) O(n) O(h) O(log n) O(1)
Insert(D,k,v) O(1) O(n) O(1) O(n) O(h) O(log n) O(1)
Delete(D,k) O(n) O(n) O(n) O(n) O(h) O(log n) O(1)

Table 1: Runtime of dictionary operations for different implementations

In many applications, it is vital to perform all three operations efficiently in constant


time. A simple straight-forward method to ensure all operations are done in O(1) is
to use a direct-address table, as shown in Figure 22, where each position in the table
corresponds to a key in the universe U .
Image: CLRS

Figure 22: Dictionary implementation using a direct-address table

An evident drawback of this implementation is the large unused space occupied as a


result of a large universe. The space used can be reduced if there is no satellite data
since then, keys can be stored in a bit-vector and for a fixed number of elements, the
space taken by a bit-vector is much smaller than that taken by an array. However,
there is still a large percentage of unused space that is occupied. To see how O(1) time
complexity can be achieved without wasting space, we introduce hash tables.

8.2 Chained Hash Tables


Let m ∈ Z+ be a positive integer and let h : U → [m] be a function that maps elements
in U to integers [1, m]. The hash table T , then, is an array (or table) of length m. For
an element x with key k, store at T [h(k)] to insert x, return or get T [h(k)] to insert

29
x and remove from T [h(k)] to delete x. Since T [h(k)] can be computed in constant
time, all operations take O(1).
Image: CLRS

Figure 23: Dictionary implementation using a hash table which results in a collision
as elements k5 and k2 map to the same index in the table.

As shown in Figure 23, a collision occurs if h(kx ) = h(ky ) and it is unclear which of
kx or ky should be stored. Most likely, we want to store both kx and ky , for which we
used chained hash tables.
Instead of storing a single element at each position in the table T , we let T [i] be an
array or list for i ∈ [1, m]. The operations are carried out similar to hash tables but
the element x with key k is now inserted, looked up and deleted in the list T [h(k)]
instead of T [h(k)] itself, as shown in Figure 24.
Image: CLRS

Figure 24: Dictionary implementation using a chained hash table.

The runtime of all operations for data x with key k now depends on the length of the
list at T [h(k)]. It would be a reasonably efficient algorithm if we can somehow ensure

30
that the length of lists in T is not too large. As it turns out, we can obtain probabilistic
guarantees by introducing randomization in the hashing (mapping elements in U to
lists in T using h) process.

8.3 Randomized Hashing


Note that since an element must always hash (or be mapped) to the same list in T
for correct storage and retrieval, the hash function h can not involve randomness.
However, given multiple hash functions, we can randomly choose which one to use for
each element, i.e. for z ∈ U , h(z) is chosen uniformly at random from {0, · · · , m −
1}. We now analyze the expected runtime of operations in chained hash tables using
randomized hashing.
For any xi ∈ U , let random variable Ci = 1 if h(xi ) = h(z) andPCi = 0 otherwise.
Let X be the number of elements in the same list as z, i.e. X = xi ̸=z Ci . Then, by
linearity of expectation,
X X X X 1 n
E[X] = E[ Ci ] = E[Ci ] = P r[h(xi ) = h(z)] = ≤
xi ̸=z xi ̸=z xi ̸=z x ̸=z
m m
i

Therefore, the expected runtime of operations is O(1 + E[X]) = O(1 + n/m). We


observe a space-time tradeoff, i.e. larger m leads to a lower expected runtime since the
more lists (or hash functions) are created, the shorter each list is expected to be for
a fixed number of elements n. We desire a small range (m) of hash functions due to
space limitations, and fewer collisions for efficient operations. Furthermore, since each
operation requires evaluating a hash value, it must be easy to do so for any key with
small space complexity.
We can obtain probabilistic guarantees is universal hash functions defined as follows:
Theorem 5. A family of hash functions H is 2-universal iff for any x, y ∈x̸=y U , if
h ∈ H is chosen uniformly at random, then P r[h(x) = h(y)] ≤ 1/m
An example of universal hash functions is the Linear Congruential Generators for U =
Z. For a prime number p > m and any two integers a and b such that 1 ≤ a ≤ p−1 and
0 ≤ b ≤ p − 1, a hash function ha,b : U 7→ [m] is defined as ha,b (x) = [(ax + b) mod p]
mod m. Then, H := {ha,b : 1 ≤ a ≤ p − 1 , 0 ≤ b ≤ p − 1} is 2-universal. This makes
picking a random h ∈ H easy since it amounts to selecting random a and b.
Another interesting example of hashing with probabilistic guarantees is Locality-Sensitive-
Hashing (LSH) which is widely used in situations where algorithms work in a very
limited space such as those required in the data streaming model were the data is

31
far larger than available memory. Although we do not dive into the LSH scheme, we
discuss how randomization is crucial to processing of streaming data in the following
section.

9 Stream Processing
Stream processing is the application of data analysis and algorithmic methods on a
continuous data stream which is a massive sequence of data items too large to store on
disk, memory, cache, etc. For example, data from social media (twitter feed, foursquare
checkins ), sensors (weather, radars, cameras, IoT and energy devices), network traffic
(trajectories, source/destination pairs), financial and satellite feeds, sequences of web
clicks and search queries is processed in a streaming manner. Before we see how this
kind of data is dealt with, and where randomization lies in this context, we observe
the key characteristics of data streams:
ˆ Huge volumes of continuous data, possibly infinite
ˆ Arbitrary arrival order of items
ˆ Fast changing and requires fast, real-time response
ˆ Captures aptly our data processing needs of today
ˆ Random access is expensive
ˆ Constant per item processing required
ˆ Single pass or scan over the stream allowed in most cases
ˆ Store only the summary of the data seen so far due to limited memory
ˆ Most stream data are low-level or multidimensional in nature and thus need
multi-level and multi-dimensional processing
Data items can be of complex types such as documents (tweets, news articles), images,
geo-located time-series, etc. For simplicity, we can abstract away application specific
details and study the basic algorithmic ideas considering the data stream as a sequence
of numbers.

9.1 Stream Model of Computation


A data stream is a sequence of m items S = a1 , a2 , a3 , . . . , am , where each item is chosen
from some universe of size n. Typically we take the universe to be [n] := {1, 2, . . . , n}.

32
n and m are two size parameters. m may be unknown and we do not assume anything
about the distribution of items.
Typically we work in the model where we see each item only once in the order given by
S, in general we cannot (or do not want to) save the whole stream. In the literature
there are algorithms that take more than one passes over the stream but here we restrict
ourselves to one-pass algorithms only.
Our goal is to use a very small amount of memory for computing some function over
stream, f (S), i.e. space requirement of algorithm should be o(min{m, n}). Ideally we
should use O(log n+log m) space. Note that this space is needed anyway to store one (or
some constant number of) stream items for processing. Sometimes space requirement
could be relaxed to polylogarithmic in min{m, n} (like (log n)c for some constant c).
In most of interesting cases of computing functions over streams, f (S) can provably
not be computed given the sublinear space and 1-pass requirements. We, therefore
often allow approximation algorithms that make errors with bounded probability.

9.1.1 Synopsis
The fundamental methodology in stream processing is to keep a synopsis, i.e. a succinct
summary of the stream that has arrive so far and answer query based on it. The
synopsis is updated after examining each item in O(1) and occupies space of the order
of poly-log bits. There are various kinds of synopsis which can be maintained, such
as sliding window, random sample, histogram, wavelets and sketch. Figure 25 outlines
the stream processing model.
Motwani, PODS (2002)

a1 a2 a3 a4 am ai ∈ [n]
...
Query Q
Stream Query
Processing Processing Application
Engine Engine

Synopsis O(log n)

Figure 25: Data Stream Model of Computation

Using a small synposis, we can compute quite a few functions of the stream S =
a1 , a2 , . . . , am , such that each ai ∈ [n]. These examples are included as a motivation
and to make the concepts of stream computation and synopsis clear.

33
Stream elements in an arbitrary order Random Sample

Figure 26: The random subset is a representative sample of the stream

ˆ Length of S (m): This can clearly be computed by storing a running counter.


The size of synopsis is one integer.
ˆ Sum of S: This can also be computed by storing a running sum. I.e. an integer
variable initialized to 0 is added to the next stream item ai . At the end of stream
or at query time, the variable contain the sum of all elements of the S.
ˆ Mean of S: The mean value can be computed from the sum and length of S
(computed above). Note the size of synposis in this case is 2 integers.
ˆ Variance of S: This can be computed from keeping a synopsis of 3 integers to
get the sum of elements of S, sum of squares of elements of S, and length of S.

V ar(X) = E(X 2 ) − (E(X))2

Recall that computing variance using the definition of V ar(X) = E((X − µ)2 )
will not work in the streaming model.
All these use O(log n) bits memory and constant time per element. This is just to
emphasize the fact, that though the streaming model is restrictive, we can still achieve
quite a lot.

9.2 Random Sampling


A general and powerful technique to tackle massive data streams is sampling. Random
sample is the most versatile general purpose synopsis with deep statistical founda-
tions. Recall from your undergraduate statistics courses some relevant concepts like
population, sample, confidence interval, size of sample, bias, weighted sampling, etc.
We keep a “representative” subset of the stream as a synopsis and compute answers
to the given query based on the sample (with appropriate scaling etc.) Probabilistic
guarantees on the approximation quality of the query answer are derived using the fact
that the sample is a (random) representative sample.
There are several sampling techniques to determine how a random sample of a data
stream is kept. We first discuss the case when the entire data is given as an array or

34
list assuming the data is one-dimensional, and then see how one or more elements can
be sampled from a stream.
PROBLEM 6. Sample a random element from array A of length n. Generally, we
require a uniform sample, i.e. each data point is chosen equally likely, (pick A[i] with
probability 1/n).
This problem can be solved as
follows: Generate a random a1 a2 a3 a4 a11 a12
number r ∈ [ 0, n ]. Many pro-
gramming languages provide a 0 1 2 3 4 5 6 7 8 9 10 11 12

pseudo random number gener-


r
ator function. Most of them
generate real numbers in [0, 1]. Figure 27: Sample an element from A (length 12)
Thus, to get a number in [0, n]
we can do r ← rand() × n. The number r is then rounded and we return A [ ⌈r⌉ ].

9.2.1 Weighted Sampling


PROBLEM 7. Given an array A of n elements, where the elements A[i] has an
associated weight wi . Sample a random element (by weight) from A, i.e. choose A[i]
with probability wi/W , where W is the total (sum of ) weights of elements in A.
Let Wi = nj=1 wi (this implies
P
W = Wn ), we generate a ran- w1 w2 w3 w4 w11 w12

dom number r in [ 0, W ]. As a1 a2 a3 a4 a11 a12


discussed above this can be done
for instance as r ← rand() × 0 W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12
Pi
Wn . Then we return the element Wi =
r j=1 wj
A [ i ] such that Wi−1 ≤ r < Wi .
This returns A[i] with probabil- Figure 28: Sample random element (by weight) from
ity wi/W because the probability array A of length 12
that r is such that Wi−1 ≤ r <
Wi is Wi −Wi−1/W = wi/W .

9.2.2 Reservoir Sampling


PROBLEM 8. Given a stream S = a1 , a2 , . . . ,. Sample a random element from S,
i.e. choose the element ai with probability 1/m. Here m is the length of the stream.
If m is known, we can use the algorithm for uniform sampling an element from an
array. In other words, we pick a random index r between 1 and m and then choose ar

35
(when it arrives). When m is unknown (think of it as follows: elements are streamed
one at a time and at any time you may be asked to return a random sample. So m is
known at the query time, but since we cannot store the whole stream, we have to be
prepared with a random sample from the currently observed elements of the stream.
The algorithm for this is called reservoir sampling.

Algorithm 10 : Reservoir Sampling (S)


R ← a1 ▷ R (reservoir) maintains the sample
for i ≥ 2 do
Pick ai with probability 1/i
Replace with current element in R ▷ If ai is picked

Let us analyze the probability that a given element ai is chosen. The probability that
ai is in the reservoir R (at query time m or end of the stream) is given by

selected at time i × Pr that ai survived in R until time m


= |Pr that ai was {z } | {z }
1 m  
Y 1
i 1−
j=i+1
j

1 i i+1 i+2 m−2 m−1 1


= × × × × ... × × =
i i+1 i+2 i+3 m−1 m m
PROBLEM 9. Given a stream S = a1 , a2 , . . . ,. Sample k random element from S,
i.e. the element ai should be in the sample with probability k/m. Again, m is the length
of the stream.
Reservoir sampling is easily extended to solve this problem.

Algorithm 11 : Reservoir Sampling (S, k)


R ← a1 , a2 , . . . , ak ▷ R (reservoir) maintains the sample
for i ≥ k + 1 do
Pick ai with probability k/i
If ai is picked, replace with it a randomly chosen element in R

The probability that ai is in the reservoir R (at query time m or end of the stream) is
given by

36
selected at time i × Pr that ai survived in R untill time m
= |Pr that ai was {z } | {z }
k m   
Y k 1
i 1− ×
j k j=i+1

k i i+1 i+2 m−2 m−1 k


= × × × × ... × × =
i i+1 i+2 i+3 m−1 m m
1 2 3 n

S : a1, a2, a3, a4, . . . , am F: f1 f2 f3 ... ... fn

ai ∈ [n] fj = |{ai ∈ S : ai = j}| (frequency of j in S )

S : 2, 5, 6, 7, 8, 2, 1, 2, 7, 5, 5, 4, 2, 8, 8, 9, 5, 6, 4, 4, 2, 5, 5
1 2 3 4 5 6 7 8 9
F: 1 5 0 3 6 2 2 3 1

Figure 29: Example of a linear sketch

A random sample is a general purpose synopsis and can be used to answer any query
about the whole stream. However, in sampling, we only process the sampled elements
and do not take any advantage from observing the whole stream. Sketches, histograms
and wavelets take advantage from the fact the processor see the whole stream (though
can’t remember the whole stream). We discuss linear sketch in some detail and give
some example of uses of them. Sketches are generally specific to a particular purpose
(meaning sketches are designed to answer specific queries).

9.3 Linear Sketch and Frequency Moments


A linear sketch interprets the stream as defining the frequency vector, as shown in
Figure 29.
Often we are interested in functions of the frequency vector from a stream. Given
S = < a1 , a2 , a3 , . . . , am > with ai ∈ [n], let fi be the frequency of i in S and
F = {f1 , f2 , . . . , fn } be the frequency vector. Some commonly used functions of the
frequency vectors are:
n
ˆ F0 := fi0
P
▷ number of distinct elements
i=1
n
ˆ F1 :=
P
fi ▷ length of stream, m
i=1

37
n
ˆ F2 := fi2
P
▷ second frequency moment
i=1

Having established the importance of the frequency vector of a data stream, we now
attempt to estimate the frequency vector. Recall that storing the exact frequency of
items in a data stream is not possible as the stream may consist of possibly infinite and
distinct items whereas we can only use a very small and limited amount of memory.

9.4 Count-Min Sketch


PROBLEM 10. Given a data stream S = a1 , a2 , a3 , . . . , am , where each ai ∈ [n],
which defines a frequency vector F = {f1 , f2 , . . . , fn } where fi is the frequency of i in
the stream, i.e. fj = |{i : ai = j}|, estimate the frequencies fi for all elements.
A randomized solution to this problem is the Count-Min sketch. The Count-Min sketch
is a simple sketch and has found many applications. It was introduced by Cormode &
Muthukrishnan in 2005. The algorithm takes an error bound ϵ and an error probability
bound δ and provides an additive approximation guarantee. We begin with a simpler
version.
Let h : [n] → [1, k] be a function chosen uniformly from a 2-universal family of hash
functions. A 2-universal family H of hash functions have that property,

P rh∈R H [h(x) = h(y)] = 1/k

We keep an array Count[1 . . . k] of k integers. Consider the following simple algorithm.

Algorithm : Count-Min Sketch (k, ϵ, δ)


count ← zeros(k) ▷ sketch consists of k integers
Pick a random h : [n] 7→ [k] from a 2-universal family H
On input ai
count[h(ai )] ← count[h(ai )] + 1 ▷ increment count at index h(ai )

On query j ▷ query: F[j] =?


return count[h(j)]

38
S : 2, 5, 6, 7, 8, 2, 1, 2, 7, 5, 5, 4, 2, 8, 8, 9, 5, 6, 4, 4, 2, 5, 5
1 2 3
5+2 Sketch
count : 1+2 3+6
+3 + 1

Mapping by
h : {1, 2, . . . , 8, 9} 7→ {1, 2, 3}

1 2 3 4 5 6 7 8 9
True
F: 1 5 0 3 6 2 2 3 1 Frequencies

Figure 30: Count-Min Sketch

Note that when k < n (which is typically the case, actually we require that k ∈
o(n) i.e. k ≪ n), the algorithm provides an upper bound on actual frequency (since
the algorithm returns Count[h(j)], other elements that hash to the same value, i.e.
elements i, such that h(i) = h(j) also contribute to the returned value Count[h(j)]).
1 2 3 4 5
f3 f1 f2
count + + +
fn f4

h(·)

1 2 3 4 n
F: f1 f2 f3 f4 ... ... fn
fj = |{ai ∈ S : ai = j}| (frequency of j in S )

Figure 31: The frequency estimated by the algorithm is an upper bound on the actual
frequency.

Let f˜j = Count[h(j)] be the estimate provided by the algorithm for query j then from
the above reasoning we get that
f˜j ≥ fj .

For j ∈ [n], we estimate the excess (error), Xj in f˜j . Clearly, Xj = f˜j −fj . Let 1h(i)=h(j)
be the indicator random variable for the event h(i) = h(j).
(
1 if h(i) = h(j)
1h(i)=h(j) =
0 otherwise

Note that i makes contribution to f˜j iff h(i) = h(j) (1h(i)=h(j) = 1) and when it does
contribute, its contribution is exactly fi . We therefore get that
X
Xj = fi · 1h(i)=h(j) .
i∈[n]\j

39
By the goodness of h, (h is a 2-universal hash function), we have that ∀ i ̸= j P[h(i) =
h(j)] = k1 . This gives us

E[1h(i)=h(j) ] = P[h(i) = h(j)] = 1/k.

We find the expectation of Xj .


X X
E(Xj ) = E( fi · 1h(i)=h(j) ) = E(fi · 1h(i)=h(j) ) Linearity of expectation
i∈[n]\j i∈[n]\j
X X 1 X 1
= fi · E(1h(i)=h(j) ) = fi · ≤ ∥F ∥1 ·
k k
i∈[n]\j i∈[n]\j i∈[n]\j

Since all frequencies are non-negative it is actually the L1 norm of the frequency vector,
that is why we denoted it by the L1 norm of F .
frequency

...
F
1 2 3 ... ... n

h(·)

count : Sketch

Good case Bad case

Figure 32: Comparison of good vs. bad case

Recall the Markov’s Inequality i.e. If Z be a non-negative random variable, then


P[Z ≥ t · E(Z)] ≤ 1t
Substitute k = 2/ϵ (since k, the length of hash table is in our control) and using Markov’s
inequality we get that

P[Xj ≥ ϵ∥F ∥1 ] = P[Xj ≥ 2 E[Xj ]] ≤ 1/2

Summarizing the analysis of this algorithm we get


ˆ f˜j ≥ fj
ˆ f˜j ≤ fj + ϵ∥F ∥1 with probability at least 1/2

40
Hence Algorithm is an (ϵ∥F ∥1 , 21 )-additive approximation algorithm. Space required
by the algorithm is k integers (plus some more for processing etc.) and k = 2/ϵ is a
constant.
We can amplify the probability of success by selecting t independent hash functions
h1 , h2 , . . . , ht each from a 2-universal family of hash functions and proceed as follows.
Each hi : [n] → [1 . . . k]

h2 (·)
1
h1 (·) S : 2, 5, 6, 7, 8, 2, 1, 2, 7, 5, 5, 4, 2, 8, 8, 9, 5, 6, 4, 4, 2, 5, 5
2 h1 (·)
1 1
1 2 3
3
5+2
0+1+2 3+6
+3 + 1
2
4
2 count : Sketch
0+2+2 1+5+6 3+3+1
5
h2 (·)
3 6 3
7
1 2 3 4 5 6 7 8 9
8 True
F: 1 5 0 3 6 2 2 3 1 Frequencies
9

Figure 33: An example of the count-min sketch using two independent hash functions.

Algorithm 13 : Count-Min Sketch (k, ϵ, δ)


count ← zeros(t × k) ▷ sketch consists of t rows of k integers
Pick t random functions h1 , . . . , ht : [n] 7→ [k] from a 2-universal family
On input ai
for r = 1 to t do
count[r][hr (ai )] ← count[r][hr (ai )] + 1
▷ increment count[r] at index hr (ai )
On query j ▷ query: F[j] =?
return min count[r][hr (j)]
1≤r≤t

So we keep t estimates instead of 1, and since every estimate is an upper bound, it is


clear that we should return the minimum of all estimates (the one which has minimum
contribution from other elements).

41
On input a
h1 (a) ht (a)
1 2 3 ... ... k
count[ 1 ][ · ] +1
count[ 2 ][ · ] +1
count[ 3 ][ · ] +1
..
.
count[ t ][ · ] +1

On query a mini count[ i ][ hi (a)]

Figure 34: Amplifying probability of the count-min sketch using t independent hash
functions

Define Xjr to be the contribution of other elements to Count[r][h(j)]. We know that


if the length of hash table is k = 2/ϵ

P[Xjr ≥ ϵ∥F ∥1 ] ≤ 1/2

Now if f˜j ≥ fj + ϵ∥F ∥1 , then for all 1 ≤ r ≤ t, we must have that Xjr ≥ ϵ∥F ∥1 , and
the probability of this event is (since hr ’s are independent) is bounded as

P[ ∀ r Xjr ≥ ϵ∥F ∥1 ] ≤ (1/2)t

Substitute t = log( 1δ ) (the number of hash functions is in our control) and we get

P[ ∀ r Xjr ≥ ϵ∥F ∥1 ] ≤ (1/2)log


1/δ

Summarizing the analysis of this algorithm 2 we get


ˆ f˜j ≥ fj
ˆ f˜j ≤ fj + ϵ∥F ∥1 with probability at least 1 − δ
Hence Algorithm 13 is an (ϵ∥F ∥1 , δ) additive approximation algorithm. Space required
by the algorithm is k · t integers (plus some more for processing etc), and k = 2ϵ and
t = log(1/δ).

42

You might also like