CSC 263
CSC 263
Zhang.
Contents
5 Randomized Algorithms 57
Randomized quicksort 58
Universal Hashing 60
6 Graphs 63
Fundamental graph definitions 63
Implementing graphs 65
Graph traversals: breadth-first search 67
Graph traversals: depth-first search 72
Applications of graph traversals 74
Weighted graphs 79
Minimum spanning trees 80
7 Amortized Analysis 87
Dynamic arrays 87
Amortized analysis 89
8 Disjoint Sets 95
Initial attempts 95
Heuristics for tree-based disjoint sets 98
Combining the heuristics: a tale of incremental analyses 105
1 Introduction and analysing running time
Before we begin our study of different data structures and their applica-
tions, we need to discuss how we will approach this study. In general, we
will follow a standard approach:
1. Motivate a new abstract data type or data structure with some examples
and reflection of previous knowledge.
2. Introduce a data structure, discussing both its mechanisms for how it
stores data and how it implements operations on this data.
3. Justify why the operations are correct.
4. Analyse the running time performance of these operations.
Given that we all have experience with primitive data structures such
as arrays (or Python lists), one might wonder why we need to study data
structures at all: can’t we just do everything with arrays and pointers,
already available in many languages?
Indeed, it is true that any data we want to store and any operation we
want to perform can be done in a programming language using primi-
tive constructs such as these. The reason we study data structures at all,
spending time inventing and refining more complex ones, is largely be- This is not to say arrays and pointers
play no role. To the contrary, the study
cause of the performance improvements we can hope to gain over their
of data structures can be viewed as
more primitive counterparts. Given the importance of this performance the study of how to organize and
analysis, it is worth reviewing what you already know about how to anal- synthesize basic programming language
components in new and sophisticated
yse the running time of algorithms, and pointing out some subtleties and ways.
common misconceptions you may have missed along the way.
• The target of the analysis is always a relationship between the size of the
input and number of basic operations performed.
What we mean by “size of the input” depends on the context, and we’ll
always be very careful when defining input size throughout this course.
What we mean by “basic operation” is anything whose running time
does not depend on the size of the algorithm’s input. This is deliberately a very liberal
definition of “basic operation.” We
• The result of the analysis is a qualitative rate of growth, not an exact don’t want you to get hung up on step
number or even an exact function. We will not say “by our asymptotic counting, because that’s completely
hidden by Big-Oh expressions.
analysis, we find that this algorithm runs in 5 steps” or even “...in 10n +
3 steps.” Rather, expect to read (and write) statements like “we find that
this algorithm runs in O(n) time.”
Worst-case analysis
how long the list is, but whether and where it has any 263 entries.
Asymptotic notation alone cannot help solve this problem: they help us
clarify how we are counting, but we have here a problem of what exactly
we are counting.
This problem is why asymptotic analysis is typically specialized to
worst-case analysis. Whereas asymptotic analysis studies the relationship
between input size and running time, worst-case analysis studies only the
relationship between input size of maximum possible running time. In other
words, rather than answering the question “what is the running time of
this algorithm for an input size n?” we instead aim to answer the question
“what is the maximum possible running time of this algorithm for an input
size n?” The first question’s answer might be a whole range of values; the
second question’s answer can only be a single number, and that’s how we
get a function involving n.
Some notation: we typically use the name T (n) to represent the maxi-
mum possible running time as a function of n, the input size. The result
of our analysis, could be something like T (n) = Θ(n), meaning that the
worst-case running time of our algorithm grows linearly with the input
size.
But we still haven’t answered the question: where do O and Ω come in?
The answer is basically the same as before: even with this more restricted
notion of worst-case running time, it is not always possible to calculate an
exact expression for this function. What is usually easy to do, however, is
calculate an upper bound on the maximum number of operations. One such
example is the likely familiar line of reasoning, “the loop will run at most
n times” when searching for 263 in a list of length n. Such analysis, which
gives a pessimistic outlook on the most number of operations that could
theoretically happen, results in an exact count – e.g., n + 1 – which is an
upper bound on the maximum number of operations. From this analysis,
we can conclude that T (n), the worst-case running time, is O(n).
What we can’t conclude is that T (n) = Ω(n). There is a subtle impli-
cation of the English phrase “at most.” When we say “you can have at
most 10 chocolates,” it is generally understood that you can indeed have
exactly 10 chocolates; whatever number is associated with “at most” is
achievable.
In our analysis, however, we have no way of knowing that the upper
bound we obtain by being pessimistic in our operation counting is ac-
tually achievable. This is made more obvious if we explicitly mention
the fact that we’re studying the maximum possible number of operations:
data structures and analysis 11
(i) Give a pessimistic upper bound on the number of basic operations that
could occur for any input of a fixed size n. Obtain the corresponding
Big-Oh expression (i.e., T (n) = O( f )). Observe that (i) is proving something
about all possible inputs, while (ii) is
(ii) Give a family of inputs (one for each input size), and give a lower bound on proving something about just one family
the number of basic operations that occurs for this particular family of of inputs.
Average-case analysis
So far in your career as computer scientists, you have been primarily con-
cerned with the worst-case algorithm analysis. However, in practice this
type of analysis often ends up being misleading, with a variety of al-
gorithms and data structures having a poor worst-case performance still
performing well on the majority of possible inputs.
Some reflection makes this not too surprising; focusing on the maxi-
mum of a set of numbers (like running times) says very little about the
“typical” number in that set, or, more precisely, the distribution of num-
12 david liu
bers within that set. In this section, we will learn a powerful new tech-
nique that enables us to analyse some notion of “typical” running time for
an algorithm.
Warmup
1 def evens_are_bad(lst):
2 if every number in lst is even:
3 repeat lst.length times:
4 calculate and print the sum of lst
5 return 1
6 else:
7 return 0
Let n represent the length of the input list lst. Suppose that lst con-
tains only even numbers. Then the initial check on line 2 takes Ω(n) time,
while the computation in the if branch takes Ω(n2 ) time. This means that We leave it as an exercise to justify why
the if branch takes Ω(n2 ) time.
the worst-case running time of this algorithm is Ω(n2 ). It is not too hard
to prove the matching upper bound, and so the worst-case running time
is Θ(n2 ).
However, the loop only executes when every number in lst is even;
when just one number is odd, the running time is O(n), the maximum
possible running time of executing the all-even check. Intuitively, it seems Because executing the check might
abort quickly if it finds an odd number
much more likely that not every number in lst is even, so we expect the
early in the list, we used the pessimistic
more “typical case” for this algorithm is to have a running time bounded upper bound of O(n) rather than Θ(n).
above as O(n), and only very rarely to have a running time of Θ(n2 ).
Our goal now is to define precisely what we mean by the “typical case”
for an algorithm’s running time when considering a set of inputs. As
is often the case when dealing with the distribution of a quantity (like
running time) over a set of possible values, we will use our tools from
probability theory to help achieve this.
We define the average-case running time of an algorithm to be the
function Tavg (n) which takes a number n and returns the (weighted) aver-
age of the algorithm’s running time for all inputs of size n.
For now, let’s ignore the “weighted” part, and just think of Tavg (n) as
computing the average of a set of numbers. What can we say about the
average-case for the function evens_are_bad? First, we fix some input size
n. We want to compute the average of all running times over all input lists
of length n.
data structures and analysis 13
At this point you might be thinking, “well, each number is even with
probability one-half, so...” This is a nice thought, but a little premature –
the first step when doing an average-case analysis is to define the possible
set of inputs. For this example, we’ll start with a particularly simple set
of inputs: the lists whose elements are between 1 and 5, inclusive.
The reason for choosing such a restricted set is to simplify the calcu-
lations we need to perform when computing an average. Furthermore,
because the calculation requires precise numbers to work with, we will
need to be precise about what “basic operations” we’re counting. For this It is actually fairly realistic to focus
example, we’ll count only the number of times a list element is accessed, solely on operations of a particular
type in a runtime analysis. We typically
either to check whether it is even, or when computing the sum of the list. choose the operation that happens the
So a “step” will be synonymous with “list access.” most frequently (as in this case), or
the one which is the most expensive.
The preceding paragraph is the work of setting up the context of our Of course, the latter requires that we
analysis: what inputs we’re looking at, and how we’re measuring runtime. are intimately knowledgeable about
the low-level details of our computing
The final step is what we had initially talked about: compute the average
environment.
running time over inputs of length n. This often requires some calcu-
lation, so let’s get to it. To simplify our calculations even further, we’ll
assume that the all-evens check on line 2 always accesses all n elements. You’ll explore the “return early” variant
In the loop, there are n2 steps (each number is accessed n times, once per in an exercise.
2n ( n 2 + n ) + (5n − 2n ) n
Tavg (n) = (5n inputs total)
5n
n
2 n 2
= n +n
5
n
2
= n2 + n
5
Remember that any exponential grows
= Θ(n) faster than any polynomial, so that first
term goes to 0 as n goes to infinity.
This analysis tells us that the average-case running time of this algo-
rithm is Θ(n), as our intuition originally told us. Because we computed
As is the case with worst-case analysis,
an exact expression for the average number of steps, we could convert this it won’t always be so easy to compute
directly into a Theta expression. an exact expression for the average,
and in those cases the usual upper and
lower bounding must be done.
14 david liu
The analysis we performed in the previous section was done through the
lens of counting the different kinds of inputs and then computing the av-
erage of their running times. However, there is a more powerful tech-
nique that you know about: treating the algorithm’s running time as a
random variable T, defined over the set of possible inputs of size n. We
can then redo the above analysis in this probabilistic context, performing
these steps:
1. Define the set of possible inputs and a probability distribution over this
set. In this case, our set is all lists of length n that contain only ele-
ments in the range 1-5, and the probability distribution is the uniform
distribution. Recall that the uniform distribution
assigns equal probability to each
2. Define how we are measuring runtime. (This is unchanged from the element in the set.
previous analysis.)
3. Define the random variable T over this probability space to represent
the running time of the algorithm. Recall that a random variable is a
function whose domain is the set of
In this case, we have the nice definition inputs.
n2 + n, input contains only even numbers
T=
n, otherwise
E[ T ] = ∑ t · Pr[T = t]
t
Exercise Break!
The previous example may have been a little underwhelming, since it was
“obvious” that the worst-case was quite rare, and so not surprising that
the average-case running time is asymptotically faster than the worst-case.
However, this is not a fact you should take for granted. Indeed, it is often
the case that algorithms are asymptotically no better “on average” than
they are in the worst-case. Sometimes, though, an algorithm can have a Keep in mind that because asymptotic
significantly better average case than worst case, but it is not nearly as notation hides the constants, saying two
functions are different asymptotically is
obvious; we’ll finish off this chapter by studying one particularly well- much more significant than saying that
known example. one function is “bigger” than another.
16 david liu
Recall the quicksort algorithm, which takes a list and sorts it by choos-
ing one element to be the pivot (say, the first element), partitioning the
remaining elements into two parts, those less than the pivot, and those
greater than the pivot, recursively sorting each part, and then combining
the results.
1 def quicksort(array):
2 if array.length < 2:
3 return
4 else:
5 pivot = array[0]
6 smaller, bigger = partition(array[1:], pivot)
7 quicksort(smaller)
8 quicksort(bigger)
9 array = smaller + [pivot] + bigger # array concatenation
10
You have seen in previous courses that the choice of pivot is crucial, as
it determines the size of each of the two partitions. In the best case, the
pivot is the median, and the remaining elements are split into partitions of
roughly equal size, leading to a running time of Θ(n log n), where n is the
size of the list. However, if the pivot is always chosen to be the maximum
element, then the algorithm must recurse on a partition that is only one
element smaller than the original, leading to a running time of Θ(n2 ).
So given that there is a difference between the best- and worst-case
running times of quicksort, the next natural question to ask is: What is
the average-case running time? This is what we’ll answer in this section.
First, let n be the length of the list. Our set of inputs is all possible
permutations of the numbers 1 to n (inclusive). We’ll assume any of these Our analysis will therefore assume the
n! permutations are equally likely; in other words, we’ll use the uniform lists have no duplicates.
Proposition 1.1. Let 1 ≤ i < j ≤ n. The probability that i and j are compared
when running quicksort on a random permutation of {1, . . . , n} is 2/( j − i + 1).
Proof. First, let us think about precisely when elements are compared with
each other in quicksort. Quicksort works by selecting a pivot element,
then comparing the pivot to every other item to create two partitions, and
then recursing on each partition separately. So we can make the following
observations:
n n
E[ T ] = ∑ ∑ Pr[i and j are compared]
i =1 j = i +1
n n
2
= ∑ ∑ j−i+1
i =1 j = i +1
n n −i
2
= ∑∑
0 j0 +1
(change of index j0 = j − i)
i =1 j =1
n n −i
1
=2∑ ∑
0
i =1 j =1
j0 +1 i j0 values
1 1, 2, 3, . . . , n − 2, n − 1
2 1, 2, 3, . . . , n − 2
.. ..
Now, note that the individual terms of the inner summation don’t de- . .
pend on i, only the bound does. The first term, when j0 = 1, occurs when n−3 1, 2, 3
n−2 1, 2
1 ≤ i ≤ n − 1, or n − 1 times in total; the second (j0 = 2) occurs when n−1 1
i ≤ n − 2, and in general the j0 = k term appears n − k times. n
n −1
n − j0
E[ T ] = 2 ∑
0 j0 + 1
j =1
n −1
n+1
=2 ∑ −1
j 0 =1
j0 + 1
n −1
1
= 2( n + 1) ∑
0 j0 + 1
− 2( n − 1)
j =1
n
1
We will use the fact from mathematics that the function ∑ i+1 is
i =1
Θ(log n), and so we get that E[ T ] = Θ(n log n). Actually, we know something stronger:
we even know that the constant hidden
in the Θ is equal to 1.
Exercise Break!
1.4 Review the insertion sort algorithm, which builds up a sorted list by
repeatedly inserting new elements into a sorted sublist (usually at the
front). We know that its worst-case running time is Θ(n2 ), but its best
case is Θ(n), even better than quicksort. So does it beat quicksort on
average?
Suppose we run insertion sort on a random permutation of the numbers
{1, . . . , n}, and consider counting the number of swaps as the running
time. Let T be the random variable representing the total number of
swaps.
In this chapter, we will study our first major data structure: the heap. As
this is our first extended analysis of a new data structure, it is important to
pay attention to the four components of this study outlined at the previous
chapter:
1. Motivate a new abstract data type or data structure with some examples
and reflection of previous knowledge.
2. Introduce a data structure, discussing both its mechanisms for how it
stores data and how it implements operations on this data.
3. Justify why the operations are correct.
4. Analyse the running time performance of these operations.
Definition 2.1 (abstract data type, data structure). An abstract data type
(ADT) is a theoretical model of an entity and the set of operations that
can be performed on that entity. The key term is abstract: an ADT is a
definition that can be understood and
A data structure is a value in a program which can be used to store and communicated without any code at all.
operate on data.
For example, contrast the difference between the List ADT and an array
data structure.
22 david liu
List ADT
• Length(L): Return the number of items in L. There may be other operations you
think are fundamental to lists; we’ve
• Get(L, i): Return the item stored at index i in L.
given as bare-bones a definition as
• Store(L, i, x): Store the item x at index i in L. possible.
This definition of the List ADT is clearly abstract: it specifies what the
possible operations are for this data type, but says nothing at all about
how the data is stored, or how the operations are performed.
It may be tempting to think of ADTs as the definition of interfaces or
abstract classes in a programming language – something that specifies a
collection of methods that must be implemented – but keep in mind that
it is not necessary to represent an ADT in code. A written description of
the ADT, such as the one we gave above, is perfectly acceptable.
On the other hand, a data structure is tied fundamentally to code. It
exists as an entity in a program; when we talk about data structures, we
talk about how we write the code to implement them. We are aware of
not just what these data structures do, but how they do them.
When we discuss arrays, for example, we can say that they implement
the List ADT; i.e., they support the operations defined in the List ADT.
However, we can say much more than this:
The main implementation-level detail that we’ll care about this in course
is the running time of an operation. This is not a quantity that can be spec-
ified in the definition of an ADT, but is certainly something we can study if
we know how an operation is implemented in a particular data structure.
The first abstract data type we will study is the Priority Queue, which is
similar in spirit to the stacks and queues that you have previously studied.
data structures and analysis 23
Like those data types, the priority queue supports adding and removing
an item from a collection. Unlike those data types, the order in which
items are removed does not depend on the order in which they are added,
but rather depends on a priority which is specified when each item is
added.
A classic example of priority queues in practice is a hospital waiting
room: more severe injuries and illnesses are generally treated before minor
ones, regardless of when the patients arrived at the hospital.
A basic implementation
8 def FindMax(PQ):
9 n = PQ.head
10 maxNode = None
11 while n is not None:
12 if maxNode is None or n.priority > maxNode.priority:
13 maxNode = n
14 n = n.next
15 return maxNode.item
16
17
18 def ExtractMax(PQ):
19 n = PQ.head
20 prev = None
21 maxNode = None
22 prevMaxNode = None
23 while n is not None:
24 if maxNode is None or n.priority > maxNode.priority:
25 maxNode, prevMaxNode = n, prevNode
26 prev, n = n, n.next
27
28 if prevMaxNode is None:
29 self.head = maxNode.next
30 else:
31 prevMaxNode.next = maxNode.next
32
33 return maxNode.item
Heaps
Recall that a binary tree is a tree in which every node has at most two
children, which we distinguish by calling the left and right. You probably
remember studying binary search trees, a particular application of a binary
tree that can support fast insertion, deletion, and search. We’ll look at a more advanced form of
binary search trees in the next chapter.
Unfortunately, this particular variant of binary trees does not exactly
suit our purposes, since the item with the highest priority will generally
data structures and analysis 25
Definition 2.2 (heap property). A tree satisfies the heap property if and
only if for each node in the tree, the value of that node is greater than or
equal to the value of all of its descendants.
Alternatively, for any pair of nodes a, b in the tree, if a is an ancestor of
b, then the value of a is greater than or equal to the value of b.
This property is actually less stringent than the BST property: given
a node which satisfies the heap property, we cannot conclude anything
about the relationships between its left and right subtrees. This means
that it is actually much easier to get a compact, balanced binary tree that
satisfies the heap property than one that satisfies the BST property. In fact,
we can get away with enforcing as strong a balancing as possible for our
data structure.
• All of its levels are full, except possibly the bottom one. Every leaf must be in one of the two
bottommost levels.
• All of the nodes in the bottom level are as far to the left as possible.
A
Complete trees are essentially the trees which are most “compact.” A
complete tree with n nodes has dlog ne height, which is the smallest pos- B C
Definition 2.4 (heap). A heap is a binary tree that satisfies the heap property
and is complete.
We implement a heap in a program as an array, where the items in the
array correspond to the level order of the actual binary tree. This is a nice example of how we use
primitive data structures to build
more complex ones. Indeed, a heap is
nothing sophisticated from a technical
standpoint; it is merely an array whose
26 david liu
Now that we have defined the heap data structure, let us see how to use
it to implement the three operations of a priority queue.
FindMax becomes very simple to both implement and analyse, be-
cause the root of the heap is always the item with the maximum priority,
and in turn is always stored at the front of the array.
Remember that indexing starts at 1
rather than 0.
1 def FindMax(PQ):
2 return PQ[1]
But the last leaf priority is smaller than many other priorities, and so if
28 20 16 2
we leave the heap like this, the heap property will be violated. Our last
step is to repeatedly swap the moved value with one of its children until
8 19 17
the heap property is satisfied once more. On each swap, we choose the
larger of the two children to ensure that the heap property is preserved.
17 is swapped with 45 (since 45 is
greater than 30).
1 def ExtractMax(PQ): 17
2 temp = PQ[1]
3 PQ[1] = PQ[PQ.size] # Replace the root with the last leaf 30 45
4 PQ.size = PQ.size - 1
5 28 20 16 2
6 # Bubble down
7 i = 1 8 19
8 while i < PQ.size:
9 curr_p = PQ[i].priority
10 left_p = PQ[2*i].priority No more swaps occur; 17 is greater
than both 16 and 2.
11 right_p = PQ[2*i + 1].priority
12
45
15 break
16 # left child has higher priority 28 20 16 2
17 else if left_p >= right_p:
18 PQ[i], PQ[2*i] = PQ[2*i], PQ[i] 8 19
19 i = 2*i
20 # right child has higher priority
21 else:
22 PQ[i], PQ[2*i + 1] = PQ[2*i + 1], PQ[i]
23 i = 2*i + 1 The repeated swapping is colloquially
24 called the “bubble down” step, refer-
ring to how the last leaf starts at the
25 return temp
top of the heap and makes its way back
down in the loop.
What is the running time of this algorithm? All individual lines of code
take constant time, meaning the runtime is determined by the number of
loop iterations.
At each iteration, either the loop stops immediately, or i increases by at
least a factor of 2. This means that the total number of iterations is at most
28 david liu
log n, where n is the number of items in the heap. The worst-case running
time of Remove is therefore O(log n).
The implementation of Insert is similar. We again use the fact that
the number of items in the heap completely determines its structure: in
this case, because a new item is being added, there will be a new leaf
immediately to the right of the current final leaf, and this corresponds to
the next open position in the array after the last item.
So our algorithm simply puts the new item there, and then performs
an inverse of the swapping from last time, comparing the new item with
its parent, and swapping if it has a larger priority. The margin diagrams
show the result of adding 35 to the given heap. A “bubble up” instead of “bubble
down”
45
1 def Insert(PQ, x, priority):
2 PQ.size = PQ.size + 1 30 17
3 PQ[PQ.size].item = x
4 PQ[PQ.size].priority = priority 28 20 16 2
5
6 i = PQ.size 8 19 35
7 while i > 1:
8 curr_p = PQ[i].priority 45
13 else:
14 PQ[i], PQ[i // 2] = PQ[i // 2], PQ[i] 8 19 20
15 i = i // 2
45
Again, this loop runs at most log n iterations, where n is the number 35 17
of items in the heap. The worst-case running time of this algorithm is
therefore O(log n). 28 30 16 2
8 19 20
Runtime summary
35 is swapped twice with its parent (20,
then 30), but does not get swapped with
Let us compare the worst-case running times for the three operations for 45.
the two different implementations we discussed in this chapter. In this
table, n refers to the number of elements in the priority queue. Operation Linked list Heap
Insert Θ (1) Θ(log n)
This table nicely illustrates the tradeoffs generally found in data struc- FindMax Θ(n) Θ (1)
ture design and implementation. We can see that heaps beat unsorted ExtractMax Θ(n) Θ(log n)
linked lists in two of the three priority queue operations, but are asymp-
totically slower than unsorted linked lists when adding a new element.
data structures and analysis 29
Now, this particular case is not much of a choice: that the slowest oper-
ation for heaps runs in Θ(log n) time in the worst case is substantially
better than the corresponding Θ(n) time for unsorted linked lists, and in
practice heaps are indeed widely used.
The reason for the speed of the heap operations is the two properties –
the heap property and completeness – that are enforced by the heap data
structures. These properties impose a structure on the data that allows
us to more quickly extract the desired information. The cost of these
properties is that they must be maintained whenever the data structure is
mutated. It is not enough to take a new item and add it to the end of the
heap array; it must be “bubbled up” to its correct position to maintain the
heap property, and this is what causes the Insert operation to take longer.
In our final section of this chapter, we will look at one interesting applica-
tion of heaps to a fundamental task in computer science: sorting. Given
a heap, we can extract a sorted list of the elements in the heap simply by
repeatedly calling Remove and adding the items to a list. Of course, this technically sorts by
priority of the items. In general, given
However, to turn this into a true sorting algorithm, we need a way of a list of values to sort, we would treat
converting an input unsorted list into a heap. To do this, we interpret these values as priorities for the pur-
pose of priority queue operations.
the list as the level order of a complete binary tree, same as with heaps.
The difference is that this binary tree does not necessarily satisfy the heap
property, and it is our job to fix it.
We can do this by performing the “bubble down” operation on each
node in the tree, starting at the bottom node and working our way up.
1 def BuildHeap(items):
2 i = items.length
3 while i > 0:
4 BubbleDown(items, i)
5 i = i - 1
6
n k
∑ T (i, n) = ∑ h · # nodes at height h
i =1 h =1
The final question is, how many nodes are at height h in the tree? To 1
make the analysis simpler, suppose the tree has height k and n = 2k − 1
nodes; this causes the binary tree to have a full last level. Consider the 2 3
complete binary tree shown at the right (k = 4). There are 8 nodes at
height 1 (the leaves), 4 nodes at height 2, 2 nodes at height 3, and 1 node 4 5 6 7
at height 4 (the root). In general, the number of nodes at height h when
the tree has height k is 2k−h . Plugging this into the previous expression 8 9 10 11 12 13 14 15
for the total number of iterations yields:
data structures and analysis 31
n k
∑ T (n, i) = ∑ h · # nodes at height h
i =1 h =1
k
= ∑ h · 2k − h
h =1
k
h
= 2k ∑ 2 h
(2k doesn’t depend on h)
h =1
k
h
= ( n + 1) ∑ h
(n = 2k − 1)
h =1 2
∞
h
< ( n + 1) ∑ h
h =1 2
∞
h
It turns out quite remarkably that ∑ h
= 2, and so the total number
h =1 2
of iterations is less than 2(n + 1).
Bringing this back to our original problem, this means that the total cost
of all the calls to BubbleDown is O(n), which leads to a total running
time of BuildHeap of O(n), i.e., linear in the size of the list. Note that the final running time de-
pends only on the size of the list:
there’s no “i” input to BuildHeap,
The Heapsort algorithm after all. So what we did was a more
careful analysis of the helper function
BubbleDown, which did involve i,
Now let us put our work together with the heapsort algorithm. Our first which we then used in a summation
step is to take the list and convert it into a heap. Then, we repeatedly over all possible values for i.
extract the maximum element from the heap, with a bit of a twist to keep
this sort in-place: rather than return it and build a new list, we simply
swap it with the current last leaf, making sure to decrement the heap size
so that the max is never touched again for the remainder of the algorithm.
1 def Heapsort(items):
2 BuildHeap(items)
3
4 # Repeated build up a sorted list from the back of the list, in place.
5 # sorted_index is the index immediately before the sorted part.
6 sorted_index = items.size
7 while sorted_index > 1:
8 swap items[sorted_index], items[1]
9 sorted_index = sorted_index - 1
10
Exercise Break!
1 def BuildHeap(items):
2 i = 1
3 while i < items.size:
4 BubbleDown(items, i)
5 i = i + 1
(a) Give a good upper bound on the running time of this algorithm.
(b) Is this algorithm also correct? If so, justify why it is correct. Other-
wise, give a counterexample: an input where this algorithm fails to
produce a true heap.
2.2 Analyse the running time of the loop in Heapsort. In particular, show
that its worst-case running time is Ω(n log n), where n is the number of
items in the heap.
3 Dictionaries, Round One: AVL Trees
In this chapter and the next, we will look at two data structures which
take very different approaches to implementing the same abstract data
type: the dictionary, which is a collection of key-value pairs that supports
the following operations:
Dictionary ADT
You probably have seen some basic uses of dictionaries in your prior
programming experience; Python dicts and Java Maps are realizations of
this ADT in these two languages. We use dictionaries as a simple and ef-
ficient tool in our applications for storing associative data with unique key
identifiers, such as mapping student IDs to a list of courses each student
is enrolled in. Dictionaries are also fundamental in the behind-the-scenes
implementation of programming languages themselves, from supporting
identifier lookup during programming compilation or execution, to im-
plementing dynamic dispatch for method lookup during runtime.
One might wonder why we devote two chapters to data structures im-
plementing dictionaries at all, given that we can implement this function-
ality using the various list data structures at our disposal. Of course, the
answer is efficiency: it is not obvious how to use either a linked list or
array to support all three of these operations in better than Θ(n) worst-
case time. In this chapter and the next, we will examine some new data or even on average
structures which do better both in the worst case and on average.
34 david liu
Recall the definition of a binary search tree (BST), which is a binary tree
that satisfies the binary search tree property: for every node, its key is ≥ 40
every key in its left subtree, and ≤ every key in its right subtree. An
example of a binary search tree is shown in the figure on the right, with 20 55
12
22
data structures and analysis 35
All three of these algorithms are recursive; in each one the cost of
the non-recursive part is Θ(1) (simply some comparisons, attribute ac-
cess/modification), and so each has running time proportional to the
number of recursive calls made. Since each recursive call is made on a
tree of height one less than its parent call, in the worst-case the number of
recursive calls is h, the height of the BST. This means that an upper bound
on the worst-case running time of each of these algorithms is O(h). We leave it as an exercise to show that
this bound is in fact tight in all three
However, this bound of O(h) does not tell the full story, as we measure cases.
the size of a dictionary by the number of key-value pairs it contains. A
binary tree of height h can have anywhere from h to 2h − 1 nodes, and so
in the worst case, a tree of n nodes can have height n. This leads to to
a worst-case running time of O(n) for all three of these algorithms (and
again, you can show this bound is tight).
But given that the best case for the height of a tree of n nodes is log n,
it seems as though a tree of n nodes having height anywhere close to n
is quite extreme – perhaps we would be very “unlucky” to get such trees.
As you’ll show in the exercises, the deficiency is not in the BST property
itself, but how we implement insertion and deletion. The simple algo-
rithms we presented above make no effort to keep the height of the tree
small when adding or removing values, leaving it quite possible to end up
with a very linear-looking tree after repeatedly running these operations.
So the question is: can we implement BST insertion and deletion to not only
insert/remove a key, but also keep the tree’s height (relatively) small?
Exercise Break!
3.1 Prove that the worst-case running time of the naïve Search, Insert,
and Delete algorithms given in the previous section run in time Ω(h),
where h is the height of the tree.
3.2 Consider a BST with n nodes and height n, structured as follows (keys
shown):
36 david liu
..
.
(a) What is an order we could insert the keys so that the resulting tree
has height n? (Note: there’s more than one right answer.)
(b) Assume n = 2h − 1 for some h. Describe an order we could insert
the keys so that the resulting tree has height h.
(c) Given a random permutation of the keys 1 through n, what is the
probability that if the keys are inserted in this order, the resulting tree
has height n?
(d) (Harder) Assume n = 2h − 1 for some h. Given a random permuta-
tion of the keys 1 through n, what is the probability that if the keys
are inserted in this order, the resulting tree has height h?
AVL Trees
Well, of course we can improve on the naïve Search and Delete – other-
wise we wouldn’t talk about binary trees in CS courses nearly as much as
we do. Let’s focus on insertion first for some intuition. The problem with
the insertion algorithm above is it always inserts a new key as a leaf of the
BST, without changing the position of any other nodes. This renders the Note that for inserting a new key, there
structure of the BST completely at the mercy of the order in which items is only one leaf position it could go into
which satisfies the BST property.
are inserted, as you investigated in the previous set of exercises.
Suppose we took the following “just-in-time” approach. After each
insertion, we compute the size and height of the BST. If its height is too
√
large (e.g., > n), then we do a complete restructuring of the tree to
reduce the height to dlog ne. This has the nice property that it enforces
data structures and analysis 37
some maximum limit on the height of the tree, with the downside that
rebalancing an entire tree does not seem so efficient.
You can think of this approach as attempting to maintain an invariant
on the data structure – the BST height is roughly log n – but only enforcing
this invariant when it is extremely violated. Sometimes, this does in fact
lead to efficient data structures, as we’ll study in Chapter 8. However, it
turns out that in this case, being stricter with an invariant – enforcing it
at every operation – leads to a faster implementation, and this is what we
will focus on for this chapter.
More concretely, we will modify the Insert and Delete algorithms
so that they always perform a check for a particular “balanced” invariant.
If this invariant is violated, they perform some minor local restructuring
of the tree to restore the invariant. Our goal is to make both the check
and restructuring as simple as possible, to not increase the asymptotic
worst-case running times of O(h).
The implementation details for such an approach turn solely on the
choice of invariant we want to preserve. This may sound strange: can’t
we just use the invariant “the height of the tree is ≤ dlog ne”? It turns
out that even though this invariant is the optimal in terms of possible
height, it requires too much work to maintain every time we mutate the
tree. Instead, several weaker invariants have been studied and used in the
decades that BSTs have been studied, and corresponding names coined Such data structures include red-black
trees and 2-3-4 trees.
for the different data structures. In this course, we will look at one of the
simpler invariants, used in the data structure known as the AVL tree.
In a full binary tree (2h − 1 nodes stored in a binary tree of height h),
every node has the property that the height of its left subtree is equal to
the height of its right subtree. Even when the binary tree is complete, the
heights of the left and right subtrees of any node differ by at most 1. Our
next definitions describe a slightly looser version of this property. 0
-1 -1 0
Definition 3.2 (AVL invariant, AVL tree). A node satisfies the AVL invari-
ant if its balance factor is between -1 and 1. A binary tree is AVL-balanced
0 0
if all of its nodes satisfy the AVL invariant.
Each node is labelled by its balance
An AVL tree is a binary search tree which is AVL-balanced. factor.
The balance factor of a node lends itself very well to our style of recur-
sive algorithms because it is a local property: it can be checked for a given
38 david liu
node just by looking at the subtree rooted at that node, without knowing
about the rest of the tree. Moreover, if we modify the implementation of
the binary tree node so that each node maintains its height as an attribute,
whether or not a node satisfies the AVL invariant can even be checked in
constant time!
There are two important questions that come out of this invariant:
• How do we preserve this invariant when inserting and deleting nodes? These can be reframed as, “how much
complexity does this invariant add?”
• How does this invariant affect the height of an AVL tree? and “what does this invariant buy us?”
For the second question, the intuition is that if each node’s subtrees
are almost the same height, then the whole tree is pretty close to being
complete, and so should have small height. We’ll make this more precise
a bit later in this chapter. But first, we turn our attention to the more
algorithmic challenge of modifying the naïve BST insertion and deletion
algorithms to preserve the AVL invariant.
Exercise Break!
3.5 Give an algorithm for taking an arbitrary BST of size n and modifying
it so that its height becomes dlog ne, and which runs in time O(n log n).
3.6 Investigate the balance factors for nodes in a complete binary tree. How
many nodes have a balance factor of 0? -1? 1?
Rotations
How do we restore the AVL invariant? Before we get into the nitty-
gritty details, we first make the following global observation: inserting or
deleting a node can only change the balance factors of its ancestors. This is be-
cause inserting/deleting a node can only change the height of the subtrees
which contain this node, and these subtrees are exactly the ones whose
roots are ancestors of the node. For simplicity, we’ll spend the remainder
of this section focused on insertion; AVL deletion can be performed in
almost exactly the same way.
data structures and analysis 39
Even better, the naïve algorithms already traverse exactly the nodes
which are ancestors of the modified node. So it is extremely straight-
forward to check and restore the AVL invariant for these nodes; we can
simply do so after the recursive Insert, Delete, ExtractMax, or Ex-
tractMin call. So we go down the tree to search for the correct spot to
insert the node, and then go back up the tree to restore the AVL invari-
ant. Our code looks like the following (only Insert is shown; Delete is
similar):
Let us spell out what fix_imbalance must do. It gets as input a BST
where the root’s balance factor is less than −1 or greater than 1. However,
because of the recursive calls to Insert, we can assume that all the non-root
nodes in D satisfy the AVL invariant, which is a big help: all we need to do
is fix the root. Reminder here about the power of
recursion: we can assume that the
One other observation is a big help. We can assume that at the begin- recursive Insert calls worked properly
ning of every insertion, the tree is already an AVL tree – that it is balanced. on the subtree of D, and in particular
made sure that the subtree containing
Since inserting a node can cause a subtree’s height to increase by at most the new node is an AVL tree.
1, each node’s balance factor can change by at most 1 as well. Thus if the
root does not satisfy the AVL invariant after an insertion, its balance factor
can only be -2 or 2.
These observations together severely limit the “bad cases” for the root
that we are responsible for fixing. In fact, these restrictions make it quite
straightforward to define a small set of simple, constant-time procedures
to restructure the tree to restore the balance factor in these cases. These
procedures are called rotations.
40 david liu
Right rotation
x C
These rotations are best explained through pictures. In the top diagram,
variables x and y are keys, while the triangles A, B, and C represent arbi- A B
trary subtrees (that could consist of many nodes). We assume that all the
nodes except y satisfy the AVL invariant, and that the balance factor of y
is -2.
This means that its left subtree must have height 2 greater than its right x
subtree, and so A.height = C.height + 1 or B.height = C.height + 1. We
will first consider the case A.height = C.height + 1. A y
In this case, we can perform a right rotation to make x the new root,
moving around the three subtrees and y as in the second diagram to pre- B C
serve the BST property. It is worth checking carefully that this rotation
does indeed restore the invariant.
Lemma 3.1 (Correctness of right rotation). Let x, y, A, B, and C be defined
as in the margin figure. Assume that this tree is a binary search tree, and that x
and every node in A, B, and C satisfy the balance factor invariant. Also assume
that A.height = C.height + 1, and y has a balance factor of -2.
Then applying a right rotation to the tree results in an AVL tree, and in par-
ticular, x and y satisfy the AVL invariant.
Proof. First, observe that whether or not a node satisfies the balance factor
invariant only depends on its descendants. Then since the right rotation
doesn’t change the internal structure of A, B, and C, all the nodes in these
subtrees still satisfy the AVL invariant after the rotation. So we only need This is the power of having a local
to show that both x and y satisfy the invariant. property like the balance factor invari-
ant. Even though A, B, and C move
around, their contents don’t change.
• Node y. The new balance factor of y is C.height − B.height. Since x orig-
inally satisfied the balance factor, we know that B.height ≥ A.height −
1 = C.height. Moreover, since y originally had a balance factor of
−2, we know that B.height ≤ C.height + 1. So B.height = C.height
or B.height = C.height + 1, and the balance factor is either -1 or 0.
• Node x. The new balance factor of x is A.height − (1 + max( B.height, C.height)).
Our assumption tells us that A.height = C.height + 1, and as we just
observed, either B.height = C.height or B.height = C.height + 1, so the
balance factor of x is also -1 or 0.
Left-right rotation
What about the case when B.height = C.height + 1 and A.height = C.height?
Well, before we get ahead of ourselves, let’s think about what would hap-
pen if we just applied the same right rotation as before.
data structures and analysis 41
The diagram looks exactly the same, and in fact the AVL invariant for y
y still holds. The problem is now the relationship between A.height and
B.height: because A.height = C.height = B.height − 1, this rotation would x C
leave x with a balance factor of 2. Not good.
A z
Since in this case B is “too tall,” we will break it down further and
move its subtrees separately. Keep in mind that we’re still assuming the
AVL invariant is satisfied for every node except the root y. D E
Then after performing a left rotation at x, then a right rotation at y, the result-
ing tree is an AVL tree. This combined operation is sometimes called a left-right
double rotation.
We leave it as an exercise to think about the cases where the right sub-
tree’s height is 2 more than the left subtree’s. The arguments are symmet-
ric; the two relevant rotations are “left” and “right-left” rotations.
14
15 def fix_imbalance(D):
16 # Check balance factor and perform rotations
17 if D.balance_factor == -2:
18 if D.left.left.height == D.right.height + 1:
19 right_rotate(D)
20 else: # D.left.right.height == D.right.height + 1
21 left_rotate(D.left)
22 right_rotate(D)
23 elif D.balance_factor == 2:
24 # left as an exercise; symmetric to above case
data structures and analysis 43
25 ...
26
27
28 def right_rotate(D):
29 # Using some temporary variables to match up with the diagram
30 y = D.root
31 x = D.left.root
32 A = D.left.left
33 B = D.left.right
34 C = D.right
35
36 D.root = x
37 D.left = A
38 # Assume access to constructor AVLTree(root, left, right)
39 D.right = AVLTree(y, B, C)
40
41
42 def left_rotate(D):
43 # Left as an exercise
Theorem 3.3 (AVL Tree Correctness). The above AVL Tree insertion algorithm
is correct. That is, it results in a binary tree which satisfies the binary search tree
property and is balanced, with one extra node added, and with the height and
balance_factor attributes set properly.
Proof. Since insertion starts with the naïve algorithm, we know that it
correctly changes the contents of the tree. We need only to check that
fix_imbalance results in a balanced BST.
tree in our running time analysis of the modified Insert and Delete
algorithms.
First, a simple lemma which should look familiar, as it is the basis of
most recursive tree algorithm analyses.
Lemma 3.4. The worst-case running time of AVL tree insertion and deletion is
O(h), the same as for the naïve insertion and deletion algorithms. Don’t forget that we’ve only proved
a O(h) upper bound in these notes –
you’re responsible for proving a lower
Proof. We simply observe that the new implemention consists of the old bound yourself!
one, plus the new updates of height and balaned_factor attributes, and
at most two rotations. This adds a constant-time overhead per recursive
call, and as before, there are O(h) recursive calls made. So the total run-
ning time is still O(h).
So far, the analysis is exactly the same as for naïve BSTs. Here is where
we diverge to success. As we observed at the beginning of this section,
BSTs can be extremely imbalanced, and have height equal to the number
of nodes in the tree. This is obviously not true for AVL trees, which
cannot have all their nodes on (say) the left subtree. But AVL trees are not
perfectly balanced, either, since each node’s two subtree heights can differ
by 1. So the question is, is restricting the balance factor for each node to
the range {−1, 0, 1} good enough to get a strong bound on the height of
the AVL tree? The answer is a satisfying yes.
Lemma 3.5 (AVL Tree Height). An AVL tree with n nodes has height at most
1.44 log n.
0,
h=0
N (h) = 1, h=1
N (h − 1) + N (h − 2) + 1, h≥2
data structures and analysis 45
This looks almost, but not quite, the same as the Fibonacci sequence.
A bit of trial-and-error or more sophisticated techniques for solving re-
currences (beyond the scope of this course) reveal that N (h) = f h+2 − 1,
where f i is the i-th Fibonacci number. Since we do have a closed form
expression
√ for the Fibonacci numbers in terms of the golden ratio ϕ =
1+ 5
, we get the following bound on N (h):
2
N ( h ) = f h +2 − 1
> ( ϕ h +2 − 1 ) − 1
= ϕ h +2 − 2
N (h) ≤ n
h +2
ϕ −2 < n
h < log ϕ (n + 2) − 2
1
h< log(n + 2) − 2
log ϕ
1
The theorem follows from calculating = 1.4404 . . . .
log ϕ
Putting the previous two lemmas together, we can conclude that AVL
insertion, deletion, and search all have logarithmic running time in the
worst-case.
Theorem 3.6 (AVL Tree Runtime). AVL tree insertion, deletion, and search
have worst-case running time Θ(log n), where n is the number of nodes in the
tree.
Exercise Break!
3.7 Show that you can use an AVL tree to implement a priority queue.
What is an upper bound on the worst-case running time of the three
priority queue operations in this case?
4 Dictionaries, Round Two: Hash Tables
Hash functions
Let U represent the set of all possible keys we would like to store. In
general, we do not restrict ourselves to just numeric keys, and so U might
be a set of strings, floating-point numbers, or other custom data types that
we define based on our application. Like direct addressing, we wish to
store these keys in an array, but here we’ll let the size of the array be a
separate variable m. We won’t assume anything about the relationship
between m and |U | right now, though you might already have some ideas
about that.
Definition 4.1 (hash function, hash table). A hash function is a function
h : U → {0, 1, . . . , m − 1}. You should think of this as a function that takes
a key and computes the array slot where the key is to be placed.
A hash table is a data structure containing an array of length m and a
hash function.
48 david liu
Of course, this is not the end of the story. Even though this imple-
mentation is certainly efficient – all three operations run in Θ(1) time – it
actually is not guaranteed to be correct.
Remember that we have not made any assumptions about the relation-
ship between m, the size of the hash table, and |U |, the total possible
number of keys; nor have we said anything about what h actually does.
Our current implementation makes the assumption that there are no col-
lisions in the hash function, i.e., each key gets mapped to a unique array
slot. Consider what happens when you try to use the above algorithm to
insert two key-value pairs whose hashed keys are the same index i: the
first value is stored at index i, but then the second value overwrites it.
Now, if m ≥ |U |, i.e, there are at least as many array slots as possible
keys, then there always exists a hash function h which has no collisions.
Such hash functions are called perfect hash functions, earning the su-
perlative name because they enable the above naive hash table implemen-
tation to be correct. However, this assumption is often not realistic, as we Of course, even when we know that a
typically expect the number of possible keys to be extremely large, and perfect hash function exists for a given
U and m, it is sometimes non-trivial to
the array to not take up too much space. More concretely, if m < |U |, then find and compute it. Consider the case
at least one collision is guaranteed to occur, causing the above implemen- when U is a set of 10 000 names – can
you efficiently implement a perfect hash
tation to fail. The remainder of this chapter will be devoted to discussing function without storing a sorted list of
two strategies for handling collisions. the names?
With closed addressing, each array element no longer stores a value asso- this approach is also called “chaining”
ciated with a particular key, but rather a pointer to a linked list of key-value
pairs. Thus, collisions are sidestepped completely by simply inserting
both the key and value into the linked list at the appropriate array index. Why is it necessary to store both the
Search and deletion are just slightly more complex, requiring a traversal key and the value?
This implementation maintains the invariant that for every key k stored
in the table, it must appear in the linked list at position h(k). From this
observation, we can see that these algorithms are correct. In addition, we
only need to consider the linked list at one particular array index, saving
lots of time.
Or perhaps not. What if all of the keys get mapped to the same index?
In this case, the worst case of search and deletion reduces to the worst-
case search and deletion from a linked list, which is Θ(n), where n is the
number of pairs stored in the hash table. Can we simply pick a function
that guarantees that this extreme case won’t happen? Not if we allow
the set of keys to be much larger than the size of the array: for any hash
function h, there exists an index i such that at least |U |/m keys get mapped
to i by h. This type of counting argument is ex-
tremely common in algorithm analysis.
So while the worst-case running time for Insert is Θ(1), the worst If you have a set of numbers whose
case performance for Search and Insert seems very bad: no better than average is x, then one of the numbers
must have value ≥ x.
simply using a linked list, and much worse than the AVL trees we saw
Insertions happen at the front of the
in the previous section. However, it turns out that if we pick good hash linked list
functions, hash tables with closed addressing are much better than both
of these data structures on average.
n−i
number of keys inserted into h(k i ) after k i is . So then to compute
m
the average running time, we add the cost of computing the hash function
(just 1 step) plus the average of searching for each k i :
n
1
E[ T ] = 1 +
n ∑ expected number of keys visited by search for ki
i =1
n
n−i
1
= 1+
n ∑ 1+
m
i =1
n n
1 n 1 i
= 2+
n ∑m−n∑m
i =1 i =1
n 1 n ( n + 1)
= 2+ − ·
m n 2m
n n+1
= 2+ −
m 2m
n 1
= 2+ −
2m 2m
n
= Θ 1+
m
The average-case running time for Delete is the same, since once a
key has been found in an array, it is only a constant-time step to remove
that node from the linked list.
n
Because the ratio between the number of keys stored and the number
m
of spots in the array comes up frequently in hashing analysis, we give it a
special name.
Definition 4.2 (load factor). The load factor of a hash table is the ratio of
the number of keys stored to the size of the table. We use the notation
n
α = , where n is the number of keys, and m the number of slots.
m
So we can say that the average-case running time of Search and Delete
are Θ(1 + α).
Exercise Break!
4.1 Suppose that we use an AVL tree instead of a linked list to store the
key-value pairs that are hashed to the same index.
(a) What is the worst-case running time of a search when the hash table
has n key-value pairs?
(b) What is the average-case running time of an unsuccessful search when
the hash table has n key-value pairs?
52 david liu
Open addressing
The other strategy used to resolve collisions is to require each array ele-
ment to contain only one key, but to allow keys to be mapped to alternate
indices when their original spot is already occupied. This saves the space
overhead required for storing extra references in the linked lists of closed
addressing, but the cost is that the load factor α must always be less than 1,
i.e., the number of keys stored in the hash table cannot exceed the length
of the array.
In this type of hashing, we have a parameterized hash function h that
takes two arguments, a key and a positive integer. The “first” hash value h : U × N → {0, . . . , m − 1}
for a key k is h(k, 0), and if this spot is occupied, the next index chosen is
h(k, 1), and then h(k, 2), etc.
Searching for an item requires examining not just one spot, but many
spots. Essentially, when searching for a key k, we visit the same sequence
of indices h(k, 0), h(k, 1), etc. until either we find the key, or reach a None
value.
Deletion seems straightforward – simply search for the given key, then
replace the item stored in the array with None, right? Not exactly. Suppose
we insert a key k into the hash table, but the spot h(k, 0) is occupied, and
so this key is stored at index h(k, 1) instead. Then suppose we delete
data structures and analysis 53
the pair that was stored at h(k, 0). Any subsequent search for k will start
at h(k, 0), find it empty, and so return None, not bothering to continue
checking h(k, 1).
So instead, after we delete an item, we replace it with a special value
Deleted, rather than simply None. This way, the Search algorithm will
not halt when it reaches an index that belonged to a deleted key. We leave it as an exercise to modify the
Insert algorithm so that it stores keys
in Deleted positions as well.
Linear probing
The corresponding linear probing sequence for key k is hash(k, 0), hash(k, 1),
hash(k, 2), . . .
When d = 1, the probe sequence is simply hash(k ), hash(k) + 1, . . .
with the hash value 10. These will occupy indices 10, 11, and 12 in the
hash table. But now any new key with hash value 11 will require 2 spots
to be checked and rejected because they are full. The collision-handling
for 10 has now ensured that hash values 11 and 12 will also have colli-
sions, increasing the runtime for any operation which visits these indices.
Moreover, the effect is cumulative: a collision with any index in a cluster
causes an extra element to be added at the end of the cluster, growing it
in size by one. You’ll study the effects of clustering more precisely in an
exercise.
Quadratic probing
The main problem with linear probing is that the hash values in the mid-
dle of a cluster will follow the exact same search pattern as a hash value
at the beginning of the cluster. As such, more and more keys are absorbed
into this long search pattern as clusters grow. We can solve this prob-
lem using quadratic probing, which causes the offset between consecutive
indices in the probe sequence to increase as the probe sequence is visited.
Double hashing
Definition 4.5 (Double hashing). Given two hash functions hash1 , hash2 :
U → {0, . . . , m − 1} and number b ∈ {0, . . . , m − 1}, we can define the
parameterized hash function for double hashing h as follows:
While both linear and quadratic probing have at most m different probe
sequences (one for each distinct starting index), this scheme has at most m2
data structures and analysis 55
different probe sequences, one for each pair of values (hash1 (k ), hash2 (k )).
Under such a scheme, it is far less likely for large clusters to form: not
only would keys have to have the same initial hash value (using hash1 ),
they would also need to have the same offset (using hash2 ).
To round out this chapter, we’ll briefly discuss some of the performance
characteristics of open addressing. First, we observe that there are m!
possible probe sequences (i.e., the order in which hash table indices are
visited). A typical simplifying assumption is that the probability of get-
ting any one of these sequences is 1/m!, which is not true for any of the
probe sequences studied here – even double hashing has only m2 possible
sequences – but can be a useful approximation.
However, under this assumption, the average number of indices searched
1
of all three dictionary operations is at most . So for example, if the
1−α
table is 90% full (so α = 0.9), the average number of indices searched
would be just 10, which isn’t bad at all!
Exercise Break!
4.2 Suppose we use linear probing with d = 1, and our hash table has n
keys stored in the array indices 0 through n − 1.
(a) What is the maximum number of array indices checked when a key
is searched for in this hash table?
(b) What is the probability that this maximum number of array indices
occurs? Use the simple uniform hashing assumption.
(c) What is the average running time of Search for this array?
5 Randomized Algorithms
So far, when we have used probability theory to help analyse our algo-
rithms we have always done so in the context of average-case analysis: de-
fine a probability distribution over the set of possible inputs of a fixed
size, and then calculate the expected running time of the algorithm over
this set of inputs.
This type of analysis is a way of refining our understanding of how
quickly an algorithm runs, beyond simply the worst and best possible in-
puts, we look at the spread (or distribution) of running times across all dif-
ferent inputs. It allows us to say things like “Quicksort may have a worst-
case running time of Θ(n), but on average its performance is Θ(n log n),
which is the same as mergesort.”
However, average-case analysis has one important limitation: the im-
portance of the asymptotic bound is directly correlated with the “plausi-
bility” of the probability distribution used to perform the analysis. We can
say, for example, that the average running time of quicksort is Θ(n log n)
when the input is a random, uniformly-chosen permutation of n items;
but if the input is instead randomly chosen among only the permutations
which have almost all items sorted, the “average” running time might
be Θ(n2 ). We might assume that the keys inserted into a hash table are This assumes naive quicksort which
equally likely to be hashed to any of the m array spots, but if they’re all chooses the first element as the pivot.
hashed to the same spot, searching for keys not already in the hash table
will always take Θ(n) time.
In practice, an algorithm can have an adversarial relationship with the
entity feeding it inputs. Not only is there no guarantee that the inputs will
have a distribution close to the one we used in our analysis; a malicious
user could conceivably hand-pick a series of inputs designed to realize
the worst-case running time. That the algorithm has good average-case
performance means nothing in the face of an onslaught of particularly
bad inputs.
In this chapter, we will discuss one particularly powerful algorithm
design technique which can (sometimes) be used to defend against this
problem. Rather than assume the randomness exists external to the algo-
58 david liu
Randomized quicksort
Let us return to quicksort from our first chapter, which we know has poor
worst-case performance on pre-sorted lists. Because our implementation
always chooses the first list element as the pivot, a pre-sorted list always
results in maximally uneven partition sizes. We were able to “fix” this
problem in our average-case analysis with the insight that in a random
permutation, any of the n input numbers is equally likely to be chosen as
a pivot in any of the recursive calls to quicksort. You’ll recall our argument was a tad
more subtle and relied on indicator
So if we allow our algorithm to make random choices, we can turn variables, but this idea that the pivot
any input into a “random” input simply by preprocessing it, and then was equally likely to be any number
did play a central role.
applying the regular quicksort function:
1 def randomized_quicksort(A):
2 randomly permute A
3 quicksort(A)
Huh, this is so simple it almost feels like cheating. The only difference
is the application of a random permutation to the input. The correctness
of this algorithm is clear: because permuting a list doesn’t change its
contents, the exact same items will be sorted, so the output will be the
same. What is more interesting is talking about the running time of this This isn’t entirely true if the list con-
algorithm. tains duplicates, but there are ways of
fixing this if we need to.
One important point to note is that for randomized algorithms, the be-
haviour of the algorithm depends not only on its input, but also the random
choices it makes. Thus running a randomized algorithm multiple times on
the same input can lead to different behaviours on each run. In the case
of randomized quicksort, we know that the result – A is sorted – is always
the same. What can change is its running time; after all, it is possible to
get as input a “good” input for the original quicksort (sorts in Θ(n log n)
time), but then after applying the random permutation the list becomes
sorted, which is the worst input for quicksort.
In other words, randomly applying a permutation to the input list can
cause the running time to increase, decrease, or stay roughly the same.
data structures and analysis 59
For randomized quicksort, all lists of size n have the same asymptotic
expected running time, and so we conclude that randomized quicksort
has a worst-case expected running time of Θ(n log n).
You might hear the phrase “worst-case expected” and think this is a
bit vague: it’s not the “absolute” maximum running time, or even an
“average” maximum. So what is this really telling us? The right way to
think about worst-case expected running time is from the perspective of
a malicious user who is trying to feed our algorithm bad inputs. With
regular quicksort, it is possible to give the algorithm an input for which
the running time is guaranteed to be Θ(n2 ). However, this is not the case
for randomized quicksort. Here, no matter what input the user picks, if
60 david liu
the algorithm is run many times on this input, the average running time
will be Θ(n log n). Individual runs of randomized quicksort might take
longer, but in the long run our adversary can’t make our algorithm take
more than Θ(n log n) per execution.
Universal Hashing
1
Prh∈H [h(k1 ) = h(k2 )] ≤ .
m
Our algorithm is now quite simple. When a new hash table is created,
it picks a hash function uniformly at random from a universal hash family,
and uses this to do its hashing. This type of hashing is called universal
hashing, and is a randomized implementation of hashing commonly used
in practice.
" #
n n
E ∑ Xi = ∑ E [ Xi ]
i =1 i =1
n
= ∑ Pr[h(ki ) = h(k)]
i =1
n
1
≤ ∑m (universal hash family)
i =1
n
=
m
Note that we have made no assumptions about the keys in the hash
table, nor about k itself: the only property we used was the universal hash
family property, which gave us the upper bound on the probability that
the two hash values h(k i ) and h(k ) were equal. Because this analysis is true
for all possible inputs, we conclude that the worst-case expected running
time of an unsuccessful search with universal hashing is O(1 + α), where Why didn’t we immediately conclude
Θ (1 + α )?
62 david liu
Note that these functions are very easy to implement; the first computa-
tion ak + b mod 2w is simply unsigned integer arithmetic, and the second
operation is a bit shift, taking the M most significant bits.
While we won’t go into the proof here, it turns out that the set of func-
tions { h a,b }, for the defined ranges on a and b, is a universal hash family.
Exercise Break!
5.1 Prove that the number of hash functions in a universal hash family
1
must be ≥ .
m
5.2 Consider the hash function
Find a set of inputs of size 2w− M which all have the same hash value. This question underlines the fact
that an individual hash function in
a universal family still has a lot of
collisions.
6 Graphs
In this chapter, we will study the graph abstract data type, which is one
of the most versatile and fundamental in computer science. You have
already studied trees, which can be used to model hierarchical data like
members of a family or a classification of living things. Graphs are a
generalization of trees that can represent arbitrary (binary) relationships
between various objects. Some common examples of graphs in practice are There are generalizations of graphs that
modelling geographic and spatial relationships; activity between members model relationships between more than
two objects at a time, but we won’t go
of a social network; requirements and dependencies in a complex system into that here.
like the development of a space shuttle. By the end of this chapter, you will
be able to define the graph ADT and compare different implementations
of this data type; and perform and analyse algorithms which traverse
graphs. If you go on to take CSC373, you
will build on this knowledge to write
and analyse more complex graph
algorithms.
Fundamental graph definitions
Before moving on, let’s see some pictures to illustrate some simple
graphs. In each diagram, the vertices are the circles, and the edges are
the lines connecting pairs of circles. Vertices can be labelled or unlabelled,
depending on the graph context.
b
c
1 2 A B
a
3 D C
d
e
Paths are fundamental to graphs because they allow us to take basic re-
lationships (edges) and derive more complex ones. Suppose, for example,
we are representing all the people of the world, with edges representing
“A and B personally know each other.” The famous example from social
networks is the idea of Six Degrees of Separation, which says that on this
graph, any two people have a path of length at most 6 between them.
Given an arbitrary graph, one might wonder what the maximum path
length is between any two vertices – or even whether there is a path at all!
Definition 6.4 (connected). We say that a graph is connected if for every
pair of vertices in the graph, there is a path between them. Otherwise, we
say the graph is unconnected.
data structures and analysis 65
Implementing graphs
Graph ADT
Adjacency lists
10 return False
66 david liu
Adjacency matrices
Definition 6.5 (degree). The degree of a vertex is the number of its neigh-
bours. For a vertex v, we will usually denote its degree by dv .
Lemma 6.1 (handshake lemma). The sum of the degrees of all vertices is equal
to twice the number of edges. Or,
∑ d v = 2| E |.
v ∈V
This lemma gets its name from the
real-life example of meeting people at a
party. Suppose a bunch of people are at
Proof. Each edge e = (u, v) is counted twice in the degree sum: once for
a party always shake hands when they
u, and once for v. meet someone new. After greetings are
done, if you ask each person how many
So this means the total cost of storing the references in our adjacency new people they met and add them up,
the sum is equal to twice the number of
list data structure is Θ(| E|), for a total space cost of Θ(|V | + | E|) (adding handshakes that occurred.
a |V | term for storing the labels). Note that this is the cost for storing all
the vertices, not just a single vertex. Adding up the cost for the graph data
structure and all the vertices gives a total space cost of Θ(|V | + | E|).
At first glance, it may seem that |V |2 and |V | + | E| are incomparable.
However, observe that the maximum number of possible edges is Θ(|V |2 ),
which arises in the case when every vertex is adjacent to every other ver-
tex. So in asymptotic terms, |V | + | E| = O(|V |2 ), but this bound is not
tight: if there are very few (e.g., O(|V |)) edges then adjacency lists will be
much more space efficient than the adjacency matrix.
We are now ready to do our first real algorithm on graphs. Consider the
central problem of exploring a graph by traversing its edges (rather than
relying on direct access of references or array indices). For example, we
might want to take a social network, and print out the names of everyone
who is connected to a particular person. We would like to be able to
start from a given vertex in a graph, and then visit its neighbours, the
neighbours of its neighbours, etc. until we have visited every vertex which
is connected to it. Moreover, we’d like to do so efficiently. If we assume that the graph is con-
nected, then such an exploration visits
For the next two sections, we will study two approaches for perform- every vertex in the graph.
ing such an exploration in a principled way. The first approach is called
breadth-first search (BFS). The basic idea is to take a starting vertex s,
visit its neighbours, then visit the neighbours of its neighbours, then the
neighbours of its neighbours of its neighbours, proceeding until all of the
vertices have been explored. The trick is doing so without repeatedly vis-
iting the same vertex. To formalize this in an algorithm, we need a way to
68 david liu
keep track of which vertices have already been visited, and which vertex
should be visited next.
To do this, we use a queue as an auxiliary container of pending vertices
to be visited. A enqueued attribute is used to keep track of which vertices
have been visited already. This algorithm enqueues the starting vertex,
and then repeated dequeues a vertex and enqueues its neighbours which
haven’t been yet been enqueued.
D
To illustrate this algorithm, suppose we run it on the graph to the right
starting at vertex A. We’ll also assume the neighbours are always accessed
B C
in alphabetical order. We mark in black when the vertices are visited (i.e.,
dequeued and passed to the Visit function).
A E
5. The remaining vertices are removed from the queue in the order C then
E. A E
D
So the order in which these vertices are visited is A, B, D, C, E.
data structures and analysis 69
Correctness of BFS
Now that we have the algorithm in place, we need to do the two standard
things: show that this algorithm is correct, and analyse its running time.
But what does it mean for breadth-first search to be correct? The in-
troductory remarks in this section give some idea, but we need to make
this a little more precise. Remember that we had two goals: to visit every
vertex in the graph, and to visit all the neighbours of the starting vertex,
then all of the neighbours of the neighbours, then all of the neighbours of
the neighbours of the neighbours, etc. We can capture this latter idea by
recalling our definition of distance between vertices. We can then rephrase
the “breadth-first” part of the search to say that we want to visit all ver-
tices at distance 1 from the starting vertex, then the vertices at distance 2,
then distance 3, etc.
Theorem 6.2 (Correctness of BFS). Let v be the starting vertex chosen by the
BFS algorithm. Then for every vertex w which is connected to v, the following
statements hold:
Inductive step. Let k ≥ 0, and assume P(k) holds. That is, assume all
vertices at distance k from v are visited, and visited after all vertices at
distance at most k − 1. If there are no vertices at distance k + 1 from v,
..
then P(k + 1) is vacuously true and we are done. .
Otherwise, let w be a vertex which is at distance k + 1 from v. Then by
the definition of distance, w must have a neighbour w0 which is at distance w0 distance k from v
vertex is visited, all of its neighbours are enqueued into the queue unless
they have already been enqueued. So w is enqueued. Because the loop
does not terminate until the queue is empty, this ensures that w at some
point becomes the current vertex, and hence is visited. This proves that
statement (1) holds for w.
70 david liu
What about (2)? We need to show that w is visited after every vertex at
distance at most k from v. Let u be a vertex which is at distance d from v,
where d ≤ k. Note that d = 0 or d ≥ 1. We’ll only do the d ≥ 1 case here,
and leave the other case as an exercise.
By the induction hypothesis, we know that every vertex at distance < d
is visited before u. In particular, u has a neighbour u0 at distance d − 1
from v which is visited before u. But even more is true: u0 is visited before
w0 , since w0 is at distance k from v, and u0 is at distance d − 1 ≤ k − 1 from
v. Then the neighbours of u0 , which include u, must be enqueued before This claim uses the induction hypothe-
the neighbours of w0 , which include w. So u is added to the queue before sis as well.
w, and hence is dequeued and visited before w. If you read through the proof carefully,
it is only in the last sentence that we
use the fact that we’re storing the
vertices in a queue and not some other
Analysis of BFS
collection data type!
Now let us analyse the running time of this algorithm. The creation of the
new queue takes constant time, but the initialization of all vertices to “not
enqueued” takes Θ(|V |) time.
The analysis of the loops is a little tricky because it is not immediately
obvious how many times each loop runs. Let us start with the outer
loop, which only terminates when queue is empty. We will do this a bit
formally to illustrate the extremely powerful idea that determining the
runtime of an algorithm often involves understanding the subtleties of
what the algorithm really does.
Proof. This follows immediately from the previous propostion, and the
fact that at each iteration, only one item is dequeued.
once per neighbour of the current vertex. Because we know that curr takes
on the value of each vertex at most once, the total number of iterations of
the inner loop across all vertices is bounded above by the total number of
“neighbours” for each vertex. By the Handshake Lemma, this is exactly
2 · | E |.
Putting this together yields an upper bound on the worst-case running
time of O(|V | + | E|). Is this bound tight? You may think it’s “obvious,”
but keep in mind that actually the lower bound doesn’t follow from the
arguments we’ve made in this section. The initialization of the vertex
enqueued attributes certainly takes Ω(|V |) time, but how do we get a bet-
ter lower bound involving | E| using the loops? After all, the first proposi-
tion says that each vertex is added at most once. It is conceivable that no
vertices other than the start are added at all, meaning the loop would only
run just once.
This is where our careful analysis of the correctness of the algorithm
saves us. It tells us that every vertex which is connected to the starting
vertex is added to the queue and visited. For a lower bound on the worst-
case running time, we need a family of inputs (one for each size) which
run in time Ω(|V | + | E|). Note that the size measure here includes both
the number of vertices and the number of edges. So what kind of input
should we pick? Well, consider a graph G in which all edges are part of the
same connected component of the graph, and then doing a breadth-first By “connected component” we mean
search starting at one of the vertices in this component. a group of vertices which are all con-
nected to each other.
Then by Theorem 6.2, we know that all vertices in this component are
visited, and the inner loop runs a total of 2 · | E| times (since all edges are
between vertices in this component). This, combined with the Ω(|V |) time
for initialization, results in an Ω(|V | + | E|) running time for this family,
and hence a lower bound on the worst-case running time of BFS.
Exercise Break!
6.1 Build on our BFS algorithm to visit every vertex in a graph, even if the
graph is not connected. Hint: you’ll need to use a breadth-first search
more than once.
6.2 Our analysis of BFS assumed that Visit runs in constant time. Let’s
try to make this a bit more general: suppose that Visit runs in time
Θ( f (|V |)), where f : N → R+ is some cost function. What is the
running time of BFS in this case?
72 david liu
13 v.finished = True
The other major change that we make is that now we have two at-
tributes, started and finished, tracking the status of each vertex through-
out the search.
Let us trace through this algorithm on the same graph from the pre- B C
vious section, again starting at vertex A. In the diagram, white vertices
have not been started, gray vertices have been started but not finished, A E
and black vertices have been started and finished.
D
1. First, A is visited. It is marked as started, but is not yet finished.
2. Then B is visited (A’s first neighbour). It is marked as started. B C
Theorem 6.5 (Weak correctness of DFS). Let v be the starting vertex given as A E
input the DFS algorithm. Then for every vertex w which is connected to v, the
following statement holds: D
(1) w is visited. B C
1. w is not started.
74 david liu
2. There is a path between v and w which consists of only vertices which have
not been started.
Finally, using a very similar analysis as BFS, we can obtain the following
tight bound on the worst-case running time of DFS.
Theorem 6.7 (DFS runtime). The worst-case running time of depth-first search
is Θ(|V | + | E|).
Exercise Break!
• Each vertex other than the starting vertex stores its parent vertex, which
is the neighbour which caused the vertex to be visited
• The inner loop contains a check for started neighbours that are not the
parent of the current vertex; as we’ll show later, this is a valid check for
a cycle.
The caveat for our (quite simple) implementation is that it only works
when the graph is connected; we leave it as an exercise to decide how to
modify it to always report a cycle, even when the graph is is not connected.
1 def DetectCycle(graph):
2 initialize all vertices in the graph to not started or finished
3 s = pick starting vertex
4 DFS_helper(graph, s)
5
18 v.finished = True
Theorem 6.8 (Cycle Detection with DFS correctness). The modified DFS
algorithm reports that a cycle exists if and only if the input graph has a cycle.
76 david liu
Proof. Let us do the forward direction: assume that the algorithm reports
a cycle. We want to prove that the input graph has a cycle.
Let u and v represent the values of the variables u and v when the cycle
is reported. We will claim that u and v are part of a cycle. Note that u and
v are neighbours, so all we need to do is show that there is a path from u
to v which doesn’t use this edge. We can “close” the path with the edge
to form the cycle.
First, since u is started, it must have been visited before v. Let s be the
starting vertex of the DFS. Because both u and v have been visited, they
are both connected to s.
Then there are two possible cases:
• If u is finished, then the path between u and s taken by the DFS does
not share any vertices in common with the path between v and s other
than s itself. We can simply join those two paths to get a path between
v and u.
• If u is not finished, then it must be on the path between v and s. More-
over, because it is not the parent of v, the segment of this path between
u and v contains at least one other vertex, and does not contain the
edge (u, v).
Since in both cases we can find a path between u and v not using the
edge (u, v), this gives us a cycle.
Now the backwards direction: assume the input graph has a cycle. We
want to show that a cycle is reported by this algorithm. Let u be the first
vertex in the cycle which is visited by the DFS, and let w be the next vertex
in the cycle which is visited.
Though there may be other vertices visited by the search between u and
w, we will ignore those, looking only at the vertices on the cycle. When
w is visited, it and u are the only vertices on the cycle which have been
visited. In particular, this means that u has at least one neighbour in the
cycle which hasn’t been visited. Let v be a neighbour of u which has not This is true even if w and u are neigh-
bours, since u has two neighbours on
yet been visited. Then because there is a path between w and v consisting
the cycle.
of vertices which are not started, v must be visited before the DFS_helper
call ends.
Now consider what happens when DFS_helper is called on v. We know
that u is a neighbour of v, that it is started, and that it is not the parent of
v (since the DFS_helper call on w hasn’t ended yet). Then the conditions
are all met for the algorithm to report the cycle.
Before moving onto our next example, let us take a brief detour into the
world of directed graphs. Recall that a directed graph is one where each
data structures and analysis 77
edge has a direction: the tuple (u, v) represents an edge from u to v, and
is treated as different from the edge (v, u) from v to u.
Cycle detection in the directed graphs has several applications; here is
just one example. Consider a directed graph where each vertex is a course
offered by the University of Toronto, and an edge (u, v) represents the
relationship “u is a prerequisite of v.” Then a cycle of three directed edges
with vertex sequence v1 , v2 , v3 , v1 would represent the relationships “v1 is
a prerequisite of v2 , v2 is a prerequisite of v3 , and v3 is a prerequisite of
v1 .” But this is nonsensical: none of the three courses v1 , v2 , or v3 could
ever be taken. Detecting cycles in a prerequisite graph is a vital step to
ensuring the correctness of this graph.
It turns out that essentially the same algorithm works in both the undi-
rected and directed graph case, with one minor modification in the check
for a cycle. Before we write the algorithm, we note the following changes
in terminology in the directed case:
1 def DetectCycleDirected(graph):
2 initialize all vertices in the graph to not started or finished
3 s = pick starting vertex
4 DFS_helper(graph, s)
5
17 v.finished = True
flexibility of DFS that such minor modifications can have such interesting
consequences.
Shortest Path
from v, because then it wouldn’t be the parent of u. Why not? u must have
a neighbour of distance d from v, and by the BFS correctness theorem, this
neighbour would be visited before u0 if u0 were at a greater distance from
v. But the parent of u is the first of its neighbours to be visited by BFS, so
u0 must be visited before any other of u’s neighbours
So, u0 is at distance d from v. By the induction hypothesis, the sequence
of parents u0 = u0 , u1 , . . . , ud is a path of length d which ends at v. But
then the sequence of parents starting at u of length d + 1 is u, u0 , u1 , . . . , ud ,
which also ends at v.
Weighted graphs
The story of shortest paths does not end here. Often in real-world ap-
plications of graphs, not all edges – relationships – are made equal. For
example, suppose we have a map of cities and roads between them. When
trying to find the shortest path between two cities, we do not care about
the number of roads that must be taken, but rather the total length across
all the roads. But so far, our graphs can only represent the fact that two
cities are connected by a road, and not how long each road is. To solve this
more general problem, we need to augment our representation of graphs
to add such “metadata” to each edge in a graph.
Now that we have introduced weighted graphs, we will study one more
fundamental weighted graph problem in this chapter. Suppose we have
a connected weighted graph, and want to remove some edges from the
graph but keep the graph connected. You can think of this as taking a set
of roads connecting cities, and removing some the roads while still making
sure that it is possible to get from any city to any other city. Essentially,
we want to remove “redundant” edges, the ones whose removal does not
disconnect the graph.
How do we know if such redundant edges exist? It turns out (though
we will not prove it here) that if a connected graph has a cycle, then
removing any edge from that cycle keeps the graph connected. Moreover,
if the graph has no cycles, then it does not have any redundant edges. We won’t prove this here, but please
This property – not having any cycles – is important enough to warrant a take a course on graph theory if you
interested in exploring more properties
definition. like this in greater detail.
Definition 6.8 (tree). A tree is a connected graph which has no cycles. Warning: one difference between this
definition of tree and the trees you are
used to seeing as data structures is that
So you can view our goal as taking a connected graph, and identifying the latter are rooted, meaning we give
a subset of the edges to keep, which results in a tree on the vertices of the one vertex special status as the root of
original graph. the tree. This is not the case for trees in
general.
It turns out that the unweighted case is not very interesting, because
all trees obey a simple relationship between their number of edges and
vertices: every tree with n vertices has exactly n − 1 edges. So finding such
a tree can be done quite easily by repeatedly finding cycles and choosing
to keep all but one of the edges.
However, the weighted case is more challenging, and is the problem of
study in this section.
The first algorithm we will study for solving this problem is known as
Prim’s algorithm. The basic idea of this algorithm is quite straightfor-
ward: pick a starting vertex for the tree, and then repeatedly add new
vertices to the tree, each time selecting an edge with minimum weight
which connects the current tree to a new vertex.
data structures and analysis 81
1 def PrimMST(G):
2 v = pick starting vertex from G.vertices
3 TV = {v} # MST vertices
4 TE = {} # MST edges
5
6 while TV != G.vertices:
7 extensions = { (u, v) in G.edges where exactly one of u, v are in TV }
8 e = edge in extensions having minimum weight
9 TV = TV + endpoints of e
10 TE = TE + {e}
11
12 return TE
This algorithm is quite remarkable for two reasons. The first is that
it works regardless of which starting vertex is chosen; in other words,
there is no such thing as the “best” starting vertex for this algorithm.
The second is that it works without ever backtracking and discarding an
already-selected edge to choose a different one. Let us spend some time This algorithm doesn’t “make mistakes”
studying the algorithm to prove that it is indeed correct. when choosing edges.
(1) The graph (TV, TE) is a tree. Remember: trees are connected and
have no cycles.
(2) The tree (TV, TE) can be extended by only adding some edges and ver-
tices to a minimum spanning tree of G. This is like saying that TE is a subset of
a possible solution to this problem.
Before we prove that these invariants are correct, let’s see why they
imply the correctness of Prim’s algorithm. When the loop terminates, both
of these invariants hold, so the graph (TV, TE) is a tree and can be extended
to an MST for the input graph. However, when the loop terminates, its
condition is false, and so TV contains all the vertices in the graph. This
means that (TV, TE) must be an MST for the input graph, since no more
vertices/edges can be added to extend it into one.
We will omit the proof of the first invariant, which is rather straight-
forward. The second one is significantly more interesting, and harder to High level: every time a vertex is
added, an edge connecting that ver-
tex to the current tree is also added.
Adding this edge doesn’t create a cycle.
82 david liu
reason about. Fix a loop iteration. Let (V1 , E1 ) be the tree at the begin-
ning of the loop body, and (V2 , E2 ) be the tree at the end of the loop body.
We assume that (V1 , E1 ) can be extended to an MST (V, E0 ) for the input,
where E1 ⊆ E0 . Our goal is to show that (V2 , E2 ) can still be extended to This assumption is precisely the loop
an MST for the input (but not necessarily the same one). invariant being true at the beginning of
the loop body.
What the loop body does is add one vertex v to V1 and one edge e =
(v, u, we ) to E1 , where u is already in V1 , and we is the weight of this edge.
So we have V2 = V1 ∪ {v} and E2 = E1 ∪ {e}.
Case 1: e ∈ E0 . In this case, E2 ⊆ E0 , so the tree (V2 , E2 ) can also be
extended to the same MST as (V1 , E1 ), so the statement holds.
Case 2: e ∈ / E0 . This is tougher because this means E2 cannot simply
be extended to E0 ; we need to find a different minimum spanning tree
(V, E00 ) which contains E2 .
Let us study the minimum spanning tree (V, E0 ) more carefully. Con-
sider the partition of vertices V1 , V \V1 , in this MST. Let u and v be the
endpoints of e; then they must be connected in the MST, and so there is a
path between them. Let e0 ∈ E0 be the edge on this path which connects a
vertex from V1 to a vertex from V \V1 . Since e ∈ / E0 , this means that e0 6= e.
Now consider what happens in the loop body. Since e0 connects a vertex
in V1 to a vertex not in V1 , it is put into the set extensions. We also know
that e was selected as the minimum-weight edge from extensions, and so
we0 ≥ we . Note that their weights could be equal.
00 0 0
Now define the edge set E to be the same as E , except with e removed
and e added. The resulting graph (V, E00 ) is still connected, and doesn’t
contain any cycles – it is a tree. Moreover, the total weight of its edges is
less than or equal to that of E0 , since we0 ≥ we . Then the graph (V, E00 ) is In fact, because E0 forms a minimum
also a minimum spanning tree of the input graph, and E ⊆ E00 . spanning tree, the total weight of
E00 can’t be smaller than E0 , so their
weights are equal.
The idea of replacing e0 by e in the initial MST to form a new MST is
a very nice one. Intuitively, we argued that at the loop iteration where
we “should” have chosen e0 to go into the MST, choosing e instead was
still just as good, and also leads to a correct solution. It is precisely this
argument that “every choice leads to a correct solution” that allows Prim’s
algorithm to never backtrack, undoing a choice to make a different one.
This has significant efficiency implications, as we’ll study in the next sec-
tion.
to the set TE, terminating only when TE contains all n vertices. But what
happens inside the loop body?
This is actually where the choice of data structures is important. Note
that we have two sets here, a set of vertices and a set of edges. We’ll first
use an array to represent these sets, so looking up a particular item or
adding a new item takes constant time.
How do we compute extensions, the edges which have one endpoint
in TV and one endpoint not? We can loop through all the edges, each time
checking its endpoints; this takes Θ(m) time. What about computing the Using adjacency lists, this statement
minimum weight edge? This is linear in the size of extensions, and so in is certainly true, but requires a bit of
thought. We leave this as an exercise.
the worst case is Θ(m).
This leads to a total running time of Θ(mn), which is not great (recall
that BFS and DFS both took time which was linear in the two quantities,
Θ(n + m)). The key inefficiency here is that the inner computations of
finding “extension” edges and finding the minimum weight one both do
a lot of repeated work in different iterations of the outer loop. After all,
from one iteration to the next, the extensions only change slightly, since
the set TV only changes by one vertex.
So rather than recompute extensions at every iteration, we maintain a
heap of edges which extend the current tree (TV, TE), where the “priority”
of an edge is simply its weight. This is quite nice because the operation we
want to perform on extensions, find the edge with the minimum weight,
is well-suited to what heaps can give us.
So we can rearrange our loop to check each edge exactly once, provided
we do some pre-sorting on the edges to make sure that when each edge is
checked, it is has the minimum weight of all the remaining edges. Since
we only care about getting the remaining edge with the minimum weight,
this is a perfect time for a heap:
1 def PrimMST(G):
2 v = pick starting vertex from G.vertices
3 TV = {v} # MST vertices
4 TE = {} # MST edges
5 extensions = new heap # treat weights as priority
6 for each edge on v:
7 Insert(extensions, edge)
8
9 while TV != G.vertices:
10 e = ExtractMin(extensions)
11 u = endpoint of e not in TV
12
13 TV = TV + {u}
84 david liu
14 TE = TE + {e}
15 for each edge on u:
16 if other endpoint of edge is not in TV:
17 Insert(extensions, edge)
18
19 return TE
Kruskal’s algorithm
As a bridge between this chapter and our next major topic of discussion,
we will look at another algorithm for solving the minimum spanning tree
problem. You might wonder why we need another algorithm at all: didn’t
we just spend a lot of time developing an algorithm that seems to work
perfectly well? But of course, there are plenty of reasons to study more
than one algorithm for solving a given problem, much as there is for
studying more than one implementation of a given ADT: generation of
new ideas or insights into problem structure; improving running time ef-
ficiency; different possible directions of generalization. In this particular
case, we have a pedagogical reason for introducing this algorithm here as
well, as motivation for the final abstract data type we will study in this
course.
The second algorithm for solving the minimum spanning tree problem
is known as Kruskal’s algorithm. It is quite similar to Prim’s algorithm,
in that it incrementally builds up an MST by selecting “good” edges one
at a time. Rather than build up a single connected component of vertices,
data structures and analysis 85
however, it simply sorts edges by weight, and always picks the smallest-
weight edge it can without creating a cycle with the edges which have
already been selected.
1 def KruskalMST(G):
2 TE = {}
3 sort G.edges in non-decreasing order of weights
4
9 return TE
1 def KruskalMST(G):
2 TE = {}
3 sort G.edges in non-decreasing order of weights
4 set each vertex to be its own connected component
5
As we will see in the following two chapters, the answer is: “we can
support all three of these operations very efficiently, if we consider the
operations in bulk rather than one at a time.” But before we formalize the
abstract data type to support these operations and study some implemen-
tations of this data type, we need to really understand what it means to
“consider the operations in bulk.” This is a type of runtime analysis dis-
tinct from worst-case, average-case, and even worst-case expected running
times, and is the topic of the next chapter.
7 Amortized Analysis
Dynamic arrays
course that memory might be already allocated for a different use, and if This “different use” might even be
it is then the existing array cannot simply be expanded. external to our program altogether!
Instead, we will do the only thing we can, and allocate a separate and
bigger block of memory for the array, copying over all existing elements,
and then inserting the new element. We show below one basic imple-
mentation of this approach, where A is a wrapper around an array which
has been augmented with three attributes: array, a reference to the ac-
tual array, allocated, the total amount of memory allocated for this array,
and size, the number of filled spots in A. For this implementation, we
choose an “expansion factor” of two, meaning that each time the array is
expanded, the memory block allocated doubles in size.
For the purposes of analysing the running time of this data structure, we
will only count the number of array accesses made. This actually allows
us to compute an exact number for the cost, which will be useful in the
following sections. While this does cause us to ignore the
cost of the other operations, all of them
For now, suppose we have a dynamic array which contains n items. are constant time, and so focusing on
We note that the final two steps involve just one array access (inserting the number of array accesses won’t
affect our asymptotic analysis.
the new item x), while the code executed in the if block has exactly 2n
accesses: access each of the n items in the old array and copy the value to
the new array.
So we have a worst-case running time of 2n + 1 = O(n) array accesses.
This upper bound does not seem very impressive: AVL trees were able to
support insertion in worst-case Θ(log n) time, for instance. So why should
we bother with dynamic arrays at all? The intuition is that this linear
worst-case insertion is not a truly representative measure of the efficiency
of this algorithm, since it only occurs when the array is full. Furthermore,
if we assume the array length starts at 1 and doubles each time the array
data structures and analysis 89
expands, the array lengths are always powers of 2. This means that 2n + 1
array accesses can only ever occur when n is a power of 2; the rest of the
time, insertion accesses only a single element! A bit more formally, if T (n)
is the number of array accesses when inserting into a dynamic array with
n items, then
2n + 1, if n is a power of 2
T (n) =
1, otherwise
So while it is true that T (n) = O(n), it is not true that T (n) = Θ(n).
Are average-case or worst-case expected analyses any better here? Well,
remember that these forms of analysis are only relevant if we treat either
the input data as random, or if the algorithm itself has some randomness.
This algorithm certainly doesn’t have any randomness in it, and there isn’t
a realistic input distribution (over a finite set of inputs) we can use here,
as the whole point of this application is to be able to handle insertion into
arrays of any length. So what do we do instead?
Amortized analysis
Aggregate method
The first method we’ll use to perform an amortized analysis is called the
aggregate method, so named because we simply compute the total cost of
a sequence of operations, and then divide by the number of operations to
find the average cost per operation.
Let’s make this a bit more concrete with dynamic arrays. Suppose we
start with an empty dynamic array (size 0, allocated space 1), and then
perform a sequence of m insertion operations on it. What is the total cost
of this sequence of operations? We might naively think of the worst-case We’re still counting only array accesses
expression 2n + 1 for the insertion on an array of size n, but as we dis- here.
cussed earlier, this vastly overestimates the cost for most of the insertion.
Since each insertion increases the size of the array by 1, we have a simple
expression of the total cost:
m −1
∑ T ( k ).
k =0
Recall that T (k) is 1 when k isn’t a power of two, and 2k + 1 when it is.
Define T 0 (k) = T (k) − 1, so that we can write
m −1 m −1 m −1
∑ T (k) = ∑ (1 + T 0 (k)) = m + ∑ T 0 ( k ).
k =0 k =0 k =0
But most of the T 0 (k) terms are 0, and in fact we can simply count
the terms which aren’t by explicitly using the powers of two from 20 to
2blog(m−1)c . So then
data structures and analysis 91
m −1 m −1
∑ T (k) = m + ∑ T 0 (k)
k =0 k =0
blog(m−1)c
= m+ ∑ T 0 ( 2i ) (substituting k = 2i )
i =0
blog(m−1)c
= m+ ∑ 2 · 2i
i =0
d
= m + 2(·2blog(m−1)c+1 − 1) ( ∑ 2i = 2d+1 − 1)
i =0
blog(m−1)c
= m−2+4·2
The last term there looks a little intimidating, but note that the floor
doesn’t change the value of the exponent much, the power and the log
roughly cancel each other out:
So the total cost is between roughly 3m and 5m. This might sound You may find it funny that the key
linear, but remember that we are considering a sequence of operations, variable is the number of operations,
not the size of the data structure like
and so the key point is that the average cost per operation in a sequence usual. In amortized analysis, the size
is actually between 3 and 5, which is constant! So we conclude that the of the data structure is usually tied
implicitly to the number of operations
amortized cost of dynamic array insertion is Θ(1), formalizing our earlier performed.
intuition that the inefficient insertions are balanced out by the much more
frequent fast insertions.
Now let us perform the analysis again through a different technique known
as the accounting method, also known as the banker’s method. As be-
fore, our set up to to consider a sequence of m insertion operations on
our dynamic array. In this approach, we do not try to compute the total
cost of these insertions directly, but instead associate with each operation
a charge, which is simply a positive number, with the property that the total
value of the m charges is greater than or equal to the total cost of the m operations.
92 david liu
You can think of cost vs. charge in terms of a recurring payment system.
The cost is the amount of money you have to pay at a given time, and the
charge is the amount you actually pay. At any point in time the total
amount you’ve paid must be greater than or equal to the total amount
you had to pay, but for an individual payment you can overpay (pay more
than is owed) to reduce the amount you need to pay in the future.
In the simplest case, the charge is chosen to be the same for each op-
eration, say some value c, in which case the total charge is cm. Let T (m)
be the total cost of the m operations in the sequence. If we assume that
T (m) ≤ cm, then T (m)/m ≤ c, and so the average cost per operation is
at most c. In other words, the chosen charge c is an upper bound on the
amortized cost of the operation.
The motivation for this technique comes from the fact that in some
cases, it is harder to compute the total cost of a sequence of m operations,
but more straightforward to compute a charge value and argue that the
charges are enough to cover the total cost of the operations. This is a dif-
ferent style of argument that tends to be more local in nature: the charges
usually are associated with particular elements or components of the data
structure, and cover costs only associated with those elements. Let us Warning: this is just a rule of thumb
make this more conrcete by seeing one charge assignment for dynamic when it comes to assigning charges,
and not every accounting argument
array insertion. follows this style.
Charging scheme for dynamic array insertion. Each insertion gets a
charge of five, subject to the following rules.
• The i-th insertion associates five “credits” with index i. Think of a credit as representing one
array access operation.
• One credit is immediately used to pay for the insertion at position i,
and removed.
• Let 2k be the smallest power of two which is ≥ i. When the 2k -th
insertion occurs, the remaining four credits associated with position i
are removed.
What we want to argue is that the “credits” here represent the differ-
ence between the charge and cost of the insertion sequence. The number Credits are part of the analysis, not the
of credits for an index never drops below zero: the first rule ensures that actual data structure. There isn’t any
like a credits attribute in the program!
for each index 1 ≤ i ≤ m, five credits are added, and the second and third
rules together consume these five credits (note that each is only applied
once per i).
So if we can prove that the total number of credits represents the differ-
ence between the charge and cost of the insertion sequence, then we can
conclude that the total charge is always greater than or equal to the total
cost. Because the charge is 5 per operation, this leads to an amortized cost
of O(1), the same result as we obtained from the aggregate method.
data structures and analysis 93
Proof. There are two cases: when i is a power of two, and when i is not a
power of two.
Case 1: i is not a power of two. Then there’s only one array access, and
this is paid for by removing one credit from index i.
Case 2: i is a power of two. There are 2i + 1 array accesses. There is one
credit removed from index i, leaving 2i accesses to be accounted for.
There are four credits removed for each index j such that i is the small-
est power of two greater than or equal to j. How many such j are there?
Let i = 2k . The next smallest power of two is 2k−1 = i/2. Then the pos-
sible values for j are {i/2 + 1, i/2 + 2, . . . , i }; there are i/2 such choices.
Note that this includes index i itself; so
if i is a power of two, then index i gets
So then there are i/2 · 4 = 2i credits removed, accounting for the re- 5 credits, and then immediately all of
maining array accesses. them are used.
This method often proves more flexible, but also more challenging, than
the aggregate method. We must not only come up with a charging scheme
for the operation, but then analyse this scheme to ensure that the total
charge added is always greater than or equal to the total cost of the opera-
tions. As we saw, this is not the same as arguing that the charge is greater
than the cost for an individual operation; some operations might have a
far higher cost than charge, but have the excess cost be paid for by pre-
vious surplus charges. Creative charging schemes take advantage of this
by keeping charges high enough to cover the cost of any future operation,
while still being low enough to obtain a good bound on the amortized
cost of each operation.
8 Disjoint Sets
• MakeSet(DS, v): Take a single item v and create a new set {v}. The
new item is the representative element of the new set.
• Find(DS, x): Find the unique set that contains x, and return the repre-
sentative element of that set (which may or may not be x).
• Union(DS, x, y): Take two items and merge the sets that contain these
items. The new representative might be one of the two representatives
of the original sets, one of x or y, or something completely different.
The remainder of this chapter will be concerned with how we can effi-
ciently implement this ADT.
Initial attempts
One obvious approach is to store each set as a linked list, and to pick as
the representative element the last element in the linked list. Let us try to
implement the Disjoint Set ADT using linked lists: the data structure it-
self will contain an collection of linked lists, where each individual linked
list represents a set. Each node in the linked list represents an element
in a set. Given any element, it is possible to simply follow the references
from that element to the end of its list. We assume that the inputs to
96 david liu
Find and Union are references to the node of a linked list. This is an
implementation-specific detail that is consistent across all of the imple-
mentations we’ll look at in this chapter, but it is not necessarily intuitive,
so please do keep it in mind.
14
5 # end1 and end2 are the last nodes in their respective linked lists
6 end1.next = end2
But wait, you say – if you do this, the result is no longer a linked list,
since there are two different nodes pointing to end2. And indeed, the
combined structure is no longer a linked list – it’s a tree! Of course, the
data structures and analysis 97
links are the reverse of the tree-based data structures we have seen so far:
instead of a node having references to each of its children, now each node
only has one reference, which is to its parent.
Below is a complete implementation of the Disjoint Set ADT using a
collection of trees. You’ll quickly find that this is basically the implemen-
tation we have already given, except with some renaming to better reflect
the tree-ness of the implementation. Also, it will be helpful to keep in
mind that the representative elements are now the root of each tree.
Runtime analysis
Now, let us briefly analyse the running time of the operations for this
tree-based disjoint set implementation. The MakeSet operation clearly
takes Θ(1) time, while both Find and Union take time proportional to
the distance from the inputs to the root of their respective trees. In the Remember that the disjoint set collec-
worst case, this is Θ(h), where h is the maximum height of any tree in the tion can contain multiple trees, each
with their own height. The running
collection. time of Find and Union depends on
the trees that are involved.
Of course, as with general trees, it is possible for the heights to be pro-
portional to n, the total number of items stored in all of the sets, leading
to a worst-case running time of Θ(n).
98 david liu
Union-by-rank
One idea we borrow from AVL trees is the idea of enforcing some local
property of nodes to try to make the whole tree roughly balanced. While
we do not have to enforce the binary search tree property here, we are
limited in two ways:
Rather than have nodes keep track of their height or balance factor, we
will store an attribute called rank, which has the following definition:
Definition 8.1 (rank (disjoint sets)). The rank of a node in a disjoint set
tree is defined recursively as follows:
can make the smaller tree a subtree of the bigger tree without changing
the bigger tree’s height. We call this modification of the basic Union
algorithm the union-by-rank heuristic.
Lemma 8.1 (Rank for disjoint sets). Let T be a tree generated by a series of
MakeSet and Union operations using the union-by-rank heuristic. Let r be
the rank of the root of T, and n be the number of nodes in T. Then 2r ≤ n.
and r be the rank of the root of the union tree. We want to prove that
2r ≤ n.
Case 1: r1 > r2 . In this case, r = r1 , since the root of T1 is chosen as
the root of the union tree, and its rank doesn’t change. So then 2r = 2r1 ≤
n1 < n, so the property still holds. The same argument works for the case
that r2 < r1 as well, by switching the 1’s and 2’s.
Case 2: r1 = r2 . In this case, the root of T1 is selected to be the new
root, and its rank is increased by one: r = r1 + 1. Then since 2r1 ≤ n1 and
2r2 ≤ n2 , we get 2r1 + 2r2 ≤ n1 + n2 = n. Since r1 = r2 , 2r1 + 2r2 = 2r1 +1 =
2r , and the property holds.
We’ll leave it as an exercise to prove the second lemma, which says that
the rank really is a good approximation of the height of the tree.
Proof. Note that the manipulations involving the new rank attribute in
the implementation are all constant time, and so don’t affect the asymp-
totic running time of these algorithms. So, by our previous argument, the
worst-case running time of these two operations is still Θ(h), where h is
the maximum height of a tree in the disjoint set data structure. Now we
show that h = O(log n), where n is the total number of items stored in the
sets.
Let T be a tree in the data structure with maximum height h, r be the
rank of the root of this tree, and n0 be the size of this tree. By our previous Obviously, n0 ≤ n.
two lemmas, we have the chain of inequalities
h = r + 1 ≤ log n0 + 1 ≤ log n + 1.
Exercise Break!
8.1 Prove that the worst-case running time of a Find or Union operation
when using union-by-rank is Ω(log n), where n is the number of items
stored in the sets.
8.2 Suppose we have a disjoint set collection consisting of n items, with
each item in its own disjoint set.
Then, suppose we perform k Union operations using the union-by-
rank heuristic, where k < n, followed by one Find operation. Give a
tight asymptotic bound on the worst-case running time of this Find
operation, in terms of k (and possibly n).
Path compression
12 return root
Proof. We will use the aggregate method to perform the amortized analy-
sis here, but in a more advanced way: rather than computing the cost of
each Find operation separately and adding them up, we will attempt to
add up all the costs at once, without knowing the value of any individual
cost.
The key insight is that for any individual Find operation, its running
time is proportional to the number of edges traversed from the input node
up to the root. In this proof, we will call the path traversed by a Find
operation a find path. So we define the collection of edges F in the n find
paths, allowing for duplicates in F because the same edge can belong to
multiple find paths. To analyse this, we consider the collection of edges F
that are traversed by any one of the n Find operations. Then the total cost
of the sequence of n Find operations is simply |F |, and the amortized
cost of an individual Find operation is |F |/n.
The second key idea is to partition F based on the differential rank of
the endpoints of each edge. For each 1 ≤ i ≤ log n, define the set
last edge if it is the edge in Fi closest to the set representative of its tree.
Otherwise, we say that the edge e is a middle edge. We let Li and Mi be
the set of last and middle edges from Fi , respectively.
This is a little abstract, so let’s make it more concrete. Suppose we have
a find path that contains three edges e1 , e2 , e3 from Fi (plus possibly many
other edges). Suppose e3 is closest to the root of the tree. Then e3 is a last
edge, and e1 and e2 are middle edges. In general, if multiple edges from
Fi appear on the same find path, exactly one of them will be a last edge,
and all the others will be middle edges.
Since each edge in Fi is either a last or middle edge, |Fi | = |Li | + |Mi |. Don’t worry, we aren’t going to subdi-
Rather than computing the size of Fi , we want to compute the sizes of vide any further! Geez.
• There is another edge e0 ∈ Fi that is closer than e to the root of the tree
(since e is a middle edge).
y
• y is not the set representative of the tree (since it can’t be the root).
e
Let z be the set representative for this tree; so s is distinct from x and y. x
Since the nodes visited on the path from y to z have increasing rank, and e0
separates y and z along this path, we know that rank(z) − rank (y) ≥ 2i−1 .
Checkpoint: this inequality only uses
the fact that the edge e0 ∈ Fi is between
Furthermore, due to path compression, after this Find operation is w and s. We haven’t yet used the fact
complete, z becomes the new parent of x. Here is the third big idea. The that e is also in Fi .
actual node change itself doesn’t matter to our proof; what does matter is
that from the point of view of x, its parent’s rank has increased by at least
2i − 1 . Remember, no node actually changed
its rank. It’s just that the parent of x
Why is this such a big deal? We can use this to prove the following changed to one with a bigger rank.
claim.
Proposition 8.5. Let x be a node. There are at most two edges in Mi that start
at x.
This is finally where we use the fact that these edges are in Mi ! This tells
us that rank(y1 ) ≥ rank ( x ) + 2i−1 , and so we get rank (y3 ) ≥ rank( x ) + 2i +
2i−1 . But the fact that ( x, y3 ) ∈ Mi means that rank(y3 ) ≤ rank( x ) + 2i , a
contradiction.
Since the above claim is true for any node x, this means that each of the
n nodes in the disjoint sets can be the starting point for at most two edges
in Mi . This implies that |Mi | ≤ 2n. It is actually quite remarkable that the
size of each |Mi | doesn’t depend on i
Putting this all together yields a bound on the size of F : itself.
log n
|F | = ∑ |Fi |
i =1
log n
= ∑ |Li | + |Mi |
i =1
log n
≤ ∑ n + 2n
i =1
= 3n log n
In other words, the total number of edges visited by the n Find oper-
ations is O(n log n), and this is the aggregate cost of the n Find opera-
tions.
We have seen that both union-by-rank and path compression can each be
used to improve the efficiency of the Find operation. We have performed
the analysis on them separately, but what happens when we apply both
heuristics at once? This is not merely an academic matter: we have seen
that it is extremely straightforward to implement each one, and because
union-by-rank involves a change to Union and path compression is a
change to Find, it is very easy to combine them in a single implementa-
tion.
The challenge here is performing the running time analysis on the com-
bined heuristics. We can make some simple observations:
In other words, the two analyses that we performed for the two heuris-
tics still apply when we combine them. The only lingering question is the
amortized cost of a Find operation, for which we only proved an upper
bound in the previous section. Given that our analysis for path compres-
sion alone had to accommodate for a node of rank n, while union-by-rank
ensured that a node’s rank is at most log n, there may be hope for a better
amortized cost.
And indeed, there is. What follows is a tour of the major results in
the history of this disjoint set implementation, going from its humble be-
ginnings to the precise analysis we take for granted today. It is truly a
remarkable story of an algorithm whose ease of implementation belied
the full subtlety of its running time analysis.
The first known appearance of this disjoint set implementation, which
combined both union-by-rank and path compression, was in a paper by
Bernard A. Galler and Michael J. Fischer in 1964. However, the authors
of this paper did not perform a precise analysis of their algorithm, and in
fact the question of how efficient this was, really, took almost two decades
to resolve.
In 1972, Fischer proved that the amortized cost of each Find operation
is O(log log n), already a drastic improvement over the O(log n) bound
we proved in the previous section. The next year, John Hopcroft and You should be able to modify our proof
Jeffrey Ullman published a further improvement, showing the amortized to get an amortized cost of O(log log n)
for the rank bound that union-by-rank
cost of Find to be O(log∗ n), where log∗ n is the iterated logarithm function, gives us.
defined recursively as follows:
0, if n ≤ 1
log∗ n = ∗
1 + log (log n), otherwise
at the Hopcroft-Ullman result. In 1975, just two years later, Robert En-
dre Tarjan published his paper “Efficiency of a Good But Not Linear Set
Union Algorithm.” His main result was that the amortized cost of each
Find operation is Θ(α(n)), where α(n) is the inverse Ackermann function,
a function that grows even more slowly than the iterated logarithm. For comparison to the iterated loga-
266536
And even more impressive than getting a better upper bound was that rithm, α(22 ) = 4.
Tarjan was able to show a matching lower bound, i.e., give a sequence of
operations so that the amortized cost of the Find operations when using
both union-by-rank and path compression is Ω(α(n)). This showed that
no one else would be able to come along and give a better upper bound
for the amortized cost of these operations, and the question was settled
– almost. At the end of the paper, Tarjan wrote the following (emphasis
added):
This is probably the first and maybe the only existing example of a simple
algorithm with a very complicated running time. The lower bound given in
Theorem 16 is general enough to apply to many variations of the algorithm,
although it is an open problem whether there is a linear-time algorithm for
the online set union problem. On the basis of Theorem 16, I conjecture that
there is no linear-time method, and that the algorithm considered here is
optimal to within a constant factor.
In other words, having completed his analysis of the disjoint set data
structure using both union-by-rank and path compression, Tarjan pro-
posed that no other heuristics would improve the asymptotic amortized
cost of Find, nor would any other completely different algorithm.
And in 1989 (!), Michael Fredman and Michael Saks proved that this
was true: any implementation of disjoint sets would have an amortized
cost for Find of Ω(α(n)). This is quite a remarkable statement to prove,
as it establishes some kind of universal truth over all possible implemen-
tations of disjoint sets – even ones that haven’t been invented yet! Rather
than analyse a single algorithm or single data structure, they defined a
notion of innate hardness for the disjoint set ADT itself that would con-
strain any possible implementation. This idea, that one can talk about
all possible implementations of an abstract data type, or all possible al-
gorithms solving a problem, is truly one of the most fundamental – and
most challenging – pillars of theoretical computer science.
Exercise Break!
8.3 The idea of using an attribute to store the rank of each node is a power-
ful one, and can be extended to storing all sorts of metadata about each
set.
108 david liu