0% found this document useful (0 votes)
45 views

Graph Tree Notes

The document outlines algorithms for graph problems and path finding. It introduces basic graph concepts like vertices, edges, directed and undirected graphs. It then discusses various graph algorithms like depth-first search, breadth-first search, Dijkstra's algorithm, Bellman-Ford algorithm, dynamic programming, Johnson's algorithm, and algorithms for finding maximum flows, matchings, minimum spanning trees and topological sorting of graphs. The document provides details on graph representations and notations as well as analyzing the time complexity of algorithms.

Uploaded by

abhishek verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Graph Tree Notes

The document outlines algorithms for graph problems and path finding. It introduces basic graph concepts like vertices, edges, directed and undirected graphs. It then discusses various graph algorithms like depth-first search, breadth-first search, Dijkstra's algorithm, Bellman-Ford algorithm, dynamic programming, Johnson's algorithm, and algorithms for finding maximum flows, matchings, minimum spanning trees and topological sorting of graphs. The document provides details on graph representations and notations as well as analyzing the time complexity of algorithms.

Uploaded by

abhishek verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

IA Algorithms

Damon Wischik, Computer Laboratory, Cambridge University. Lent Term 2021

Contents
5 Graphs and path finding 1
5.1 Notation and representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 Depth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 Breadth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.4 Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 Algorithms and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.6 Bellman-Ford . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.7 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.8 Johnson’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Graphs and subgraphs 27


6.1 Flow networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Ford-Fulkerson algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Max-flow min-cut theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.4 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.5 Prim’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Kruskal’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.7 Topological sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Advanced data structures 49


7.1 Aggregate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Amortized costs: introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Amortized costs: definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.4 Potential functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.5 Three priority queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.6 Fibonacci heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.7 Implementing the Fibonacci heap∗ . . . . . . . . . . . . . . . . . . . . . . . . 66
7.8 Analysis of Fibonacci heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.9 Disjoint sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1

5. Graphs and path finding


A great many algorithmic problems are about entities and the connections between them.
Graphs are how we describe them. A graph is a set of vertices (or nodes, or locations) and
edges (or connections, or links) between them.

Example. Leonard Euler, a mathematician from Königsberg, was asked the question “Can I
go for a stroll around the city on a route that crosses each bridge exactly once?” He was
intrigued – “This question is so banal, but seemed to me worthy of attention in that [neither]
geometry, nor algebra, nor even the art of counting was sufficient to solve it.” In 1735 he
proved the answer was ‘No’. His innovation was to turn the question into what we would
now call a discrete maths question about a graph, in this case a simple graph with 4 vertices
and 7 edges (and he also came up with a clever proof).1

Example. Agents can find paths through a game-map using a graph. Beforehand, (1) draw
polygons around the obstacles, (2) subdivide the map into a mesh of convex polygons, (3) de-
fine a graph with one vertex per polygon, and edges between adjacent polygons. Then, we can
find the agent’s shortest path on this graph, using the algorithms we’ll study in this course.
Once we’ve found this optimal sequence of polygons, we can go on to find an aesthetically
pleasing physical path that walks through them.

Example. Facebook’s underlying data structure is a graph. Vertices are used to represent

1 Euler’s proof: Consider any stroll and list the edges it crosses, then count up the number of times each

vertex appears. For example, in the stroll [B ↔ A, A ↔ B, B ↔ D, D ↔ C], A appears twice, B appears three
times, and so on. Clearly every vertex must appear an even number of times except possibly for the start
and end vertices. Now, if there were a stroll that crossed each bridge exactly once, then (by looking at the
graph) there are 3 edges at vertex A so A would have to appear 3 times, B would have to appear 5 times,
and C and D would have to appear 3 times. But we’ve already shown that in any stroll there can be at most
two vertices that appear an odd number of times, hence no such stroll exists.
2

users, locations, comments, check-ins, etc. From a Facebook engineering blog post,2

Example. OpenStreetMap represents its map as XML, with nodes and ways. In some parts
of the city, this data is very fine-grained. The more vertices and edges there are, the more
space it takes to store the data, and the slower the algorithms run. Later in this course we
will discuss geometric algorithms which could be used to simplify the graph while keeping its
basic shape.
<osm version=”0.6” generator=”Overpass API”>
<node id=”687022827” user=”François Guerraz”
lat=”52.2082725” lon=”0.1379459” />
<node id=”687022823” user=”bigalxyz123”
lat=”52.2080972” lon=”0.1377715” />
<node id=”687022775” user=”bigalxyz123”
lat=”52.2080032” lon=”0.1376761” >
<tag k=”direction” v=”clockwise”/>
<tag k=”highway” v=”mini_roundabout”/>
</node>
<way id=”3030266” user=”urViator”>
<nd ref=”687022827”/>
<nd ref=”687022823”/>
<nd ref=”687022775”/>
<tag k=”cycleway” v=”lane”/>
<tag k=”highway” v=”primary”/>
<tag k=”name” v=”East Road”/>
<tag k=”oneway” v=”yes”/>
</way>
...
</osm>

Exercise. Why do you think Facebook chose to make CHECKIN a type of vertex, rather
than an edge from a USER to a LOCATION?

In graphs, edges are only allowed to connect vertices to vertices, they’re not allowed to connect
vertices to other edges. If we want to be able to attach a COMMENT to a CHECKIN, then
CHECKIN has to be a vertex.

2 https://engineering.fb.com/2013/06/25/core-data/tao-the-power-of-the-graph/
3


4 5.1 Notation and representation

5.1. Notation and representation


Some notation for describing graphs. A graph is a collection of vertices, with edges between
them. Write a graph as g = (V, E) where V is the set of vertices and E the set of edges.

• A graph may be directed or undirected;


in a directed graph, an edge from v1 to v2 is written v1 → v2
in an undirected graph, an edge between v1 and v2 is written v1 ↔ v2 .

• The neighbours of a vertex v are the vertices you reach by following an edge from v;

in a directed graph, neighbours(v) = w ∈ V : v → w

in an undirected graph, neighbours(v) = w ∈ V : v ↔ w .

In this course, paths


• A path is a sequence of two or more vertices connected by edges;
are allowed to visit in a directed graph, v1 → v2 → · · · → vk
the same vertex
more than once. in an undirected graph, v1 ↔ v2 ↔ · · · ↔ vk .
Some people define
paths to disallow
repeat visits.
• A cycle is a path from a vertex back to itself, and with no vertices repeated other than
v1 = vk .

It sounds perverse
There are some special types of graph that we’ll look at in more detail later.
to define a tree to
be a type of forest! • A directed acyclic graph or DAG is a directed graph without any cycles. They’re used
But you need to get
used to reasoning
all over computer science. We’ll study some properties of DAGs in Section 6.7.
about algorithms
directly from
• An undirected graph is connected if for every pair of vertices there is a path between
definitions, rather them. A forest is an undirected acyclic graph. A tree is a connected forest. We’ll study
than from your
hunches and
algorithms for finding trees and forests within a graph in Sections 6.5–6.6.
instinct; and a
deliberately
perverse definition
can help remind you
of this.

DAG forest forest & tree

RE P R ESE N TAT I O N
Here are two standard ways to store graphs in computer code: as an array of adjacency lists,
or as an adjacency matrix.

array of
adjacency lists: adjacency matrix:

1 2 3 4 5 1 2 3 4 5
1 0 1 0 0 1
1 2 2 1 2 3 1
2 1 0 1 1 1
5 5 4 2 2
3
3 0 1 0 1 0
4 5 4
5 4
4 0 1 1 0 1
3
5 1 1 0 1 0

The ‘adjacency matrix’ representation is good if we need to be able to quickly look up


for a given pair of vertices whether there is an edge between them. The ‘adjacency list’
representation is good if the graph is sparse, i.e. if there are many vertices and few edges per
vertex.
What exactly do we mean by a ‘good representation’? As usual, we’re interested in
execution time and in storage requirements, and we’ll analyse them using big-O notation so
we don’t get bogged down in details. For a graph g = (V, E), the storage requirements are
5.1 Notation and representation 5

adjacency list representation: O(V + E)


adjacency matrix representation: O(V 2 )

(Note: V and E are sets, so we should really write O(|V | + |E|) etc., but it’s conventional to
drop the | · |.) If the graph is sparse, i.e. if E = o(V 2 ), then the adjacency list doesn’t need
as much storage. As for algorithm execution time, we’ll study that in the following sections.

Exercise.
(a) What is the largest possible number of edges in an undirected graph with V ver-
tices?
(b) and in a directed graph?
(c) What is the smallest possible number of edges in a tree?
(d) Suppose our graph has multiple edges between a pair of nodes, and that we want
to store a label for each edge. If we use an adjacency matrix we’ll therefore need
to store a list of edges in each cell, not just a number. What is the storage
requirement?

(a) V (V − 1)/2
(b) V (V − 1)
(c) V −1
(d) O(V 2 + E)


6 5.2 Depth-first search

5.2. Depth‐first search


A common task is traversing a graph and doing some work at each vertex. For example, a
web crawler for a search engine visits every page it can find, and at each page the work is
to process the text to add to its search index, then extract all the links in order to find new
pages.
A A

B C D B C D

E F G H E F G H

GEN ER A L I D EA : L I K E T R AV E RS I NG A T R E E
It’s easy to explore a tree, using recursion. If we call visit_tree(D, None) on the graph on
We defined tree to
the left, which is a tree, then we’ll see it visit D, H, A, C, B, E, F, G (or perhaps some other
mean ‘undirected order depending on the order of each vertex’s neighbours).
connected acyclic
graph’. The tree is 1 # V i s i t a l l v e r t i c e s i n the subt ree rooted at v
drawn as if to
suggest that A is 2 def visit_tree (v , v_parent ) :
the root, but 3 print (” v i s i t i n g ” , v , ”from” , v_parent)
because the graph is
undirected there is 4 for w in v . neighbours :
actually no 5 # Only v i s i t v ’ s c h i l d r e n , not i t s parent
distinguished vertex,
and we’re entitled 6 i f w != v_parent :
to start the 7 visit_tree (w, v)
traversal at D.

But if the graph has cycles then this algorithm will get stuck in an infinite recursion. If we
run it on the right hand graph above, we might get D, H, C, A, D, H, C, A, D, ... We need
some way to prevent this.

DFS I M P L EM E N TAT I O N 1 : U S I N G R EC URS I O N


Depth-first search uses the same idea as traversing a tree. But we’ll explicitly mark which
vertices we’ve already visited, so that we know not to return to them.
1 # V i s i t a l l v e r t i c e s reachable from s
2 def dfs_recurse (g , s ) :
3 for v in g . vertices :
4 v . v i s i t e d = False
5 visit (s)
6
7 def v i s i t (v ) :
8 v . v i s i t e d = True
9 for w in v . neighbours :
10 i f not w. v i s i t e d :
11 v i s i t (w)

It’s always a good


Here’s the beginning of an execution trace, running on the right hand graph and starting
idea, when you see from D. The vertical lines indicate which function call we’re currently inside.
a new algorithm, to
run through it by dfs_recurse(g, D) :
hand step-by-step, v i s i t (D) :
noting all the
decisions it takes. neighbours = [H,C,A]
This is especially v i s i t (H) :
helpful for neighbours = [D]
understanding
recursive don ’ t v i s i t D
algorithms. return from v i s i t (H)
v i s i t (C) :
neighbours = [D,A]
don ’ t v i s i t D
v i s i t (A) :
...
5.2 Depth-first search 7

DFS IMP LE ME N TAT I O N 2 : W I T H A STAC K


Now a second implementation, in which we’ll keep a record of which vertices are waiting to
be explored, and jump straight to the next one.
A A

B C D B C D

E F G H E F G H
dfs_recurse dfs using a Stack

To see why this is different, consider the execution trace of dfs_recurse starting at B:
v i s i t (B) :
v i s i t (E) from B, return from E to B
v i s i t (F) from B
v i s i t (G) from F, return from G to F,
return from F to B
v i s i t (A) from B

Instead of this chain of returns, could we jump straight from G to A, the next vertex waiting
to be explored? Obviously we’d need to know that A is waiting to be explored, so we’d need
to have noted this down when we first visited B. We can use a Stack, a Last-In-First-Out
data structure, to store the list of vertices waiting to be explored; this way we’ll visit children
before returning to distant cousins. There is a subtle
difference between
1 # V i s i t a l l v e r t i c e s reachable from s dfs and dfs_recurse
2 def dfs (g , s ) : as implemented
here — they don’t
3 for v in g . vertices : actually visit
4 v . seen = False vertices in the same
order as each other.
5 toexplore = Stack ( [ s ] ) # a Stack i n i t i a l l y c o n t a i n i n g a s i n g l e element The example sheet
6 s . seen = True asks you to look
into this, and to
7 modify dfs so that
8 while not toexplore . is_empty ( ) : they do.

9 v = toexplore . popright () # Now v i s i t i n g v e r t e x v


10 for w in v . neighbours :
11 i f not w. seen :
12 toexplore . pushright (w)
13 w. seen = True

B B E B E
E
star�ng Pick A to
A A Pick D to A
at A... explore
explore
C Remove A from C C
toexplore, add its Remove D from
neighbours, and toexplore, add its
mark them as neighbours
D seen
D (excluding A, D
which was already
seen), and mark
them as seen
toexplore = [A] toexplore = [B,D] toexplore = [B,C]
A marked as seen A,B,D marked as seen A,B,C,D marked as seen

A N ALYS IS
In dfs, (a) line 4 is run for every vertex which takes O(V ); (b) lines 8–9 are run at most once
per vertex, since the seen flag ensures that each vertex enters toexplore at most once, so the
running time is O(V ); (c) lines 10 and below are run for every edge out of every vertex that
is visited, which takes O(E). Thus the total running time is O(V + E).
The dfs_recurse algorithm also has running time O(V + E). To see this, (a) line 4 is
run once per vertex, (b) line 8 is run at most once per vertex, since the visited flag ensures
that visit(v) is run at most once per vertex; (c) lines 9 and below are run for every edge out
of every vertex visited.
8 5.2 Depth-first search

∗ ∗ ∗

Pay close attention to the clever trick in analysing the running time. We didn’t try
to build up some complicated recursive formula about the running time of each call to visit,
or to reason about when which part of the graph was put on the stack. Instead we used
mathematical reasoning to bound the total number of times that a vertex could possibly be
aggregate analysis:
processed during the entire execution. This is called aggregate analysis, and we’ll see more
section 7.1 examples later in the course when we look at the design of advanced data structures.

The recursive implementation uses the language’s call stack, rather than our own data
structure. Recursive algorithms are sometimes easier to reason about, and we’ll use the
recursive implementation as part of a proof in Section 6.7.
5.3 Breadth-first search 9

5.3. Breadth‐first search


A common task is finding paths in a graph. With a tiny tweak to the depth-first search
algorithm, we can find shortest paths.

G E N E R AL I DEA

Suppose we want to find the shortest path from A to some other vertex, in a directed graph.
Let’s start by rearranging the graph, to put A at the top, then all the vertices at distance 1
underneath, and all the nodes at distance 2 underneath that, and so on. (By ‘distance d’
we mean ‘can be reached by a path with d edges, and cannot be reached by a path with
< d edges’.) The graph rearranges itself into two parts: there is a tree consisting of all the
vertices, with edges that go down by exactly one ‘level’; and there are extra edges that go
either horizontally or up.

A distance from A = 0
D B
B D distance from A = 1
A
E
C E C distance from A = 2

The idea of breadth first search is to visit A, then all nodes at distance 1 from A, then all
nodes at distance 2, and so on. In other words, we’ll explore the breadth of the tree first,
before going deeper. There’s a very simple way to achieve this. Suppose we’ve already visited
all the vertices at level < d, and we’ve got a list of all the vertices at level d: then we just go
through that list and add all the vertices that are newly reachable; these must be vertices at
level d + 1. Keep going until there’s nothing more to visit.

I MP LE ME N TATI O N

To implement the breadth-first strategy, we don’t even need to manage ‘list of vertices at
distance d’. All we need is a Queue to store all the vertices we’re waiting to explore. Push new
vertices on the right of the queue, pop vertices from the left, and that way we’re guaranteed
to pop all vertices in correct order of distance.
The code turns out to be almost identical to dfs. The only difference is that it uses a
Queue instead of a Stack.
1 # V i s i t a l l the v e r t i c e s i n g reachable from s t a r t v e r t e x s
2 def bfs (g , s ) :
3 for v in g . vertices :
4 v . seen = False
5 toexplore = Queue( [ s ] ) # a Queue i n i t i a l l y c o n t a i n i n g a s i n g l e element
6 s . seen = True
7
8 while not toexplore . is_empty ( ) :
9 v = toexplore . popleft () # Now v i s i t i n g v e r t e x v
10 for w in v . neighbours :
11 i f not w. seen :
12 toexplore . pushright (w)
13 w. seen = True
10 5.3 Breadth-first search

B B E B E
E
star�ng Pick A to
A A Pick B to A
at A... explore
explore
C Remove A from C C
toexplore, add its Remove B from
neighbours, and toexplore, add its
mark them as neighbours
D seen
D (excluding D, D
which was already
seen), and mark
them as seen
toexplore = [A] toexplore = [B,D] toexplore = [D,E,C]
A marked as seen A,B,D marked as seen A,B,C,D,E marked as seen

We can adapt this code to find a path between a pair of nodes: we just need to keep track
of how we discovered each vertex. For every vertex at distance d, we’ll store a come_from
arrow, pointing to a vertex at distance d − 1. Here’s a picture, then the code.

B B E B
E E
Star�ng Pick A to Pick B to
at A... A explore A explore A
C C C

D D D

1 # Find a path from s to t , i f one e x i s t s


2 def bfs_path(g , s , t ) :
3 for v in g . vertices :
4 v . seen = False
5 v .come_from = None
6 s . seen = True
7 toexplore = Queue( [ s ] )
8
9 # Traverse the graph , v i s i t i n g e v e r y t h i n g reachable from s
10 while not toexplore . is_empty ( ) :
11 v = toexplore . popleft ()
12 for w in v . neighbours :
13 i f not w. seen :
14 toexplore . pushright (w)
15 w. seen = True
16 w.come_from = v
17
18 # Reconstruct the f u l l path from s to t , working backwards
19 i f t .come_from i s None:
20 return None # t h e r e i s no path from s to t
21 else :
22 path = [ t ]
23 while path [ 0 ] . come_from != s :
24 path . prepend(path [ 0 ] . come_from)
25 path . prepend( s )
26 return path

AN A LYS I S
The bfs algorithm has running time O(V + E), based on exactly the same analysis as for dfs
in section 5.2.

∗ ∗ ∗
5.3 Breadth-first search 11

Here is another way to think of the bfs algorithm: keep track of the ‘disc’ of vertices
that are distance ≤ d from the start, then grow the disc by adding the ‘frontier’ of vertices
at distance d + 1, and so on. What’s magic is that the bfs algorithm does this implicitly, via
the Queue, without needing an explicit variable to store d.

In this illustration3 , we’re running bfs starting from the blob in the middle. The graph has
one vertex for each light grey grid cell, and edges between adjacent cells, and the black cells in
the left hand panel are impassable. The next three panels show some stages in the expanding
frontier.

3 These pictures are taken from the excellent Red Blob Games blog, http://www.redblobgames.com/

pathfinding/a-star/introduction.html
12 5.4 Dijkstra’s algorithm

5.4. Dijkstra’s algorithm


In many applications it’s natural to use graphs where each edge is labelled with a cost, and
to look for paths with minimum cost. For example, suppose the graph’s edges represent road
segments, and each edge is labelled with its travel time: how do we find the quickest route
between two locations?
Let’s express the problem more precisely. Let cost(u → v) be the cost associated with
edge u → v, let the cost of a path be the sum of its edge costs, and define distance(u to v)
to the the minimum cost of all paths from u to v. If there is no path from u to v, let
distance(u to v) = ∞. Any path which achieves the minimum cost is called a shortest path;
as with breadth-first search, once we can work out distances, it’s easy to find shortest paths.
Here’s an illustration4 . These pictures show two possible paths between the blob and
the cross. The left hand picture shows the number of hops from the blob; the right picture
shows the distance from the blob. Here, the darkest cells can’t be crossed, light cells cost 1
to cross, and darker cells cost 5.

number of hops distance

PRO B L EM STAT E M E N T
Why require costs
≥ 0? Try to work Given a directed graph where each edge is labelled with a cost ≥ 0, and a start vertex s,
out what goes
wrong in the compute the distance from s to every other vertex.
algorithm beglow if
there are negative
costs. You can read
the answer in
answer in
Sections 5.5 GEN ER A L I D EA
and 5.6.
In breadth-first search, we visited vertices in order of how many hops they are from the start
vertex. Now, let’s visit vertices in order of distance from the start vertex. We’ll keep track
of a frontier of vertices that we’re waiting to explore (i.e. the vertices whose neighbours
we haven’t yet examined). We’ll keep the frontier vertices ordered by distance, and at each
iteration we’ll pick the next closest.
We might end up coming across a vertex multiple times, with different costs. If we’ve
never come across it, just add it to the frontier. If we’ve come across it previously and our
new path is shorter than the old path, then update its distance.

4 Pictures taken from the Red Blob Games blog, http://www.redblobgames.com/pathfinding/a-star/

introduction.html
5.4 Dijkstra’s algorithm 13

ver�ces we fron�er the great


know about ver�ces unknown

dis
t1
5 v
Vis
i
s fro t v, th
n�e e n
r ve ear
dist 23 rte est
w x

cost 2 x
v

cost 1
s
w
7
dist 1 x nd
exte
v e&
p dat n�er
U fro
the
dist 16

s
w

I MP LE ME N TATI O N
This algorithm was invented in 1959 and is due to Dijkstra5 (1930–2002), a pioneer of com-
puter science.
Line 5 declares that toexplore is a PriorityQueue in which the key of an item v is See Section 4.8 for a
v.distance. Line 11 iterates through all the vertices w that are neighbours of v, and retrieves definition of
the cost of the edge v → w at the same time. PriorityQueue. It
supports inserting
items, decreasing
1 def d i j k s t r a (g , s ) : the key of an item,
2 for v in g . vertices : and extracting the
item with smallest
3 v . distance = ∞ key.
4 s . distance = 0
5 toexplore = PriorityQueue ( [ s ] , sortkey = lambda v : v . distance )
6
7 while not toexplore . isempty ( ) :
8 v = toexplore .popmin()
9 # Assert : v . d i s t a n c e i s the t r u e s h o r t e s t d i s t a n c e from s to v
10 # Assert : v i s never put back i n t o t o e x p l o r e
11 for (w, edgecost ) in v . neighbours :
12 dist_w = v . distance + edgecost
13 i f dist_w < w. distance :
14 w. distance = dist_w
15 i f w in toexplore :
16 toexplore . decreasekey (w)
17 else :
18 toexplore . push(w)

Although we’ve called the variable v.distance, we really mean “shortest distance from s

5 Dijkstra was famous for his way with words. Some of his sayings: “The question of whether Machines

Can Think [. . . ] is about as relevant as the question of whether Submarines Can Swim.”
14 5.4 Dijkstra’s algorithm

to v that we’ve found so far”. It starts at ∞ and it decreases as we find new and shorter paths
to v. Given the assertion on line 10, we could have coded this algorithm slightly differently:
we could put all nodes into the priority queue in line 5, and delete lines 15, 17, and 18. It
takes some work to prove the assertion...

AN A LYS I S
Running time. Line 8 is run at most once per vertex (by the assertion on line 10), and
lines 12–18 are run at most once per edge. So the total running time is

O(V ) + O(V ) × costpopmin +O(E) × costpush/dec.key

where the individual operation costs depend on how the PriorityQueue is implemented. Later
in the course, we’ll describe an implementation called the Fibonacci heap which for n items
has O(1) running time for both push() and decreasekey() and O(log n) running time for
popmin(). Since the number of items stored in the heap at any time is ≤ V , by the assertion
In this theorem,
on line 10, the total running time is O(E + V log V ).
v.distance is a
variable that is
updated during Theorem (Correctness). The dijkstra algorithm terminates. When it does, for every vertex
program execution,
and distance(s → v) v, the value v.distance it has computed is equal to distance(s to v). Furthermore, the two
is the Platonic assertions never fail.
mathematical
object. Pay close
attention to
whether you’re
dealing with Proof (that Assertion 9 never fails). Suppose this assertion fails at some point in execution.
abstract Let v be the vertex for which it first fails, and let T be the time of this failure. Consider a
mathematical
statements (which shortest path from s to v. (This means the Platonic mathematical object, not a computed
can be stated and variable.) Write this path as
proved even without
an algorithm), or if s = u1 → · · · → uk = v
you’re reasoning
about program There are two cases to consider: CASE1 in which one of these vertices hasn’t yet been popped
execution. from toexplore by time T , and CASE2 in which they have all been popped.
Consider CASE1 first, and let i be the index of the first vertex in the sequence that, at
time T , hasn’t been popped. So the path is

s = u1 → u2 → · · · → ui−1 → ui → · · · → uk = v
| {z } |{z} | {z }
already popped not yet ???
popped

(We’ve just popped v = uk , so we know i < k. The vertices between i and k, if there are any,
might or might not have been popped by time T .) Now, reasoning about the stored .distance
variables as they stand at time T ,

distance(s to v)
< v.distance since the assertion failed at v
≤ ui .distance (*)
≤ ui−1 .distance + cost(ui−1 → ui ) by lines 13–18 when ui−1 was popped
= distance(s to ui−1 ) + cost(ui−1 → ui ) assertion didn’t fail at ui−1
≤ distance(s to v) since s → · · · ui−1 → ui is on a shortest path s to v.

The tricky step is (*). This line is because we just popped v from the PriorityQueue, and we
know ui was in there also because it would have been forced in when we popped ui−1 , and
the PriorityQueue gave us v rather than ui , hence v.distance ≤ ui .distance. Thus, we obtain
a contradiction.
In CASE2, we also obtain a contradiction, and it’s easier to prove:

distance(s to v)
< v.distance since the assertion failed at v
≤ distance(s to uk−1 ) + cost(uk−1 → v) by lines 13–18 when uk−1 was popped
= distance(s to v) since s = u1 → · · · → uk = v is a shortest path.

This again is a contradiction.


5.4 Dijkstra’s algorithm 15

We have proved that, if Assertion 9 fails at some point in execution, there is a contra-
diction. Thus, it can never fail.
Proof that Assertion 10 never fails. Once a vertex v has been popped, Assertion 9 guarantees
that v.distance = distance(s to v). The only way that v could be pushed back into toexplore
is if we found a shorter path to v (on line 13) which is impossible.
Rest of proof. Since vertices can never be re-pushed into toexplore, the algorithm must
terminate. At termination, all the vertices that are reachable from s must have been visited,
and popped, and when they were popped they passed Assertion 9. They can’t have had
v.distance changed subsequently (since it can only ever decrease, and it’s impossible for it to
be less than the true minimum distance, since the algorithm only ever looks at legitimate
paths from s.) □

THE ‘BR EAK PO I N T ’ P RO O F ST R AT EGY


It’s worth stepping back and looking at the proof strategy we used here, since it’s one we’ll
use again and again. I call it the ‘breakpoint’ strategy.
We decided on a property which we want to be true at all points during execution. We
designed this property so that (1) if it’s true when the algorithm terminates, then the output
of the algorithm is correct, and (2) if it’s true up to an arbitrary timepoint T − 1, then it
must be true at time T . The proof of (2) is rather like setting a breakpoint in a debugger
and arguing that, based on the guarantees so far about the algorithm’s data structures, what
happens in the next few lines of code can’t possibly go wrong.
Formally, a breakpoint proof is proof by induction. Any proof by induction requires us
to specify an ordering, and here the ordering is by execution time. The induction hypothesis
is “our property holds at all instants up to T − 1”, and the induction step is to prove that it
must therefore hold at time T .
For some other graph algorithms, it’s helpful use induction but applied to a different
order, e.g. by distance rather than execution time. Whenever you use proof by induction,
make sure you say explicitly what your ordering is.
16 5.5 Algorithms and proofs

5.5. Algorithms and proofs


Here’s what Dijkstra had to say about programming and proofs:

Right from the beginning, and all through the course, we stress that the program-
mer’s task is not just to write down a program, but that his main task is to give
a formal proof that the program he proposes meets the equally formal functional
specification.6

If you want more effective programmers, you will discover that they should not
waste their time debugging, they should not introduce the bugs to start with.7

Programming is one of the most difficult branches of applied mathematics; the


poorer mathematicians had better remain pure mathematicians.8

This course is 50% about the ideas behind the algorithms, 50% about the proofs. Don’t
think of proofs as hoops that a cruel lecturer forces you to jump through! If you can’t figure
out a correct proof that your algorithm works, chances are there’s a bug in your algorithm,
for example a special case that you haven’t coded for. Conversely, every nifty trick that you
invented to make your algorithm work is likely to have a counterpart in a proof of correctness,
otherwise the trick would be superfluous. Algorithm proofs are just a tool for making sure
our code works correctly and for clarifying in our heads the big ideas behind an algorithm.

Here is an exam question about Dijsktra’s algorithm, and a selection of mangled proofs.
The first thing an examiner checks for is whether the proof passes the ‘smell test’. Does this
answer have all the key ingredients from Dijkstra’s proof—the ‘breakpoint’ proof strategy,
the reliance on edge weights being non-negative? If it fails the smell test, the examiner will
look for a faulty inference, in other words construct a counterexample to demonstrate the
fault.

Let dijkstra_path(g, s, t) be an implementation of Dijkstra’s shortest path algorithm that re-


turns the shortest path from vertex s to vertex t in a graph g. Prove that the implementation
can safely terminate when it first encounters vertex t.

Bad Answer 1. At the moment when the vertex t is popped from the priority queue, it has
to be the vertex in the priority queue with the least distance from s. This means that any
other vertex in the priority queue has distance ≥ that for t. Since all edge weights in the
graph are ≥ 0, any path from s to t via anything still in the priority queue will have distance
≥ that of the distance from s to t when t is popped, thus the distance to t is correct when t
is popped.

This fails the smell test in two ways. First, there is no hint of induction—the proof only
discusses what happens when the target vertex t is popped. Second, it doesn’t distinguish
between the two ‘distances’: (a) v.distance, the quantity computed and updated in the course
of the algorithm, and (b) distance(s to v), the true mathematical distance. The point of
Dijkstra’s algorithm is that the former eventually becomes equal to the latter.
Where exactly is the faulty inference? The first sentence is about how priority queues
work, so it must be referring to distances as computed by the algorithm, not to true mathe-
matical distances; but the last sentence seems to be referring to true mathematical distances.
So let’s set up a counterexample where the two are at odds. In this diagram, u and t are in
the priority queue, and the nodes have been labelled by their computed distances at the instant
when t is popped.

6 EWD 1036: On the cruelty of really teaching computing science, https://www.cs.utexas.edu/~EWD/

ewd10xx/EWD1036.PDF
7 EWD 340: The humble programmer, https://www.cs.utexas.edu/users/EWD/ewd03xx/EWD340.PDF.
Quite the opposite of test-driven development! Contrast to Donald Knuth, another pioneer of computer
science, who once wrote “Beware of bugs in the above code; I have only proved it correct, not tried it.”
8 EWD 498: How do we tell truths that might hurt? https://www.cs.utexas.edu/users/EWD/ewd04xx/

EWD498.PDF
5.5 Algorithms and proofs 17

dist=∞
u
1 1
2 3
s v t
dist=0 dist=2 dist=5

What does the Bad Answer say about this scenario? We’re about to pop t, since t.distance <
u.distance. And yet the true mathematical distance of the path s → v → u → t is shorter
than the computed t.distance, so the final sentence of the Bad Answer is incorrect.
One might say “Oh, but the proof in Section 5.4 shows that this can’t happen.” True ...
but an answer to the question should PROVE that this scenario can’t happen (or make explicit
reference to lecture notes!), and because this answer doesn’t it’s a Bad Answer.

Bad Answer 2. Dijkstra’s algorithm performs a breadth-first search on the graph, storing
a frontier of all unexplored vertices that are neighbours to explored vertices.
Each time it chooses a new vertex v to explore, from the frontier of unexplored vertices,
it chooses the one that will have the shortest distance from the start s, based on the edge
weight plus the distance from s of its already explored neighbour.
Given that no other vertex in the frontier is closer to s, and that this new vertex v has
yet to be explored, when v is explored it must have been via the shortest path from s to v.
Hence, when t is first encountered, it must have been found via its shortest path and
the program can safely terminate.

This answer is slightly better than the previous one because it nods in the direction of
induction—it makes an argument about what happens “each time it chooses a new vertex”,
not just about what happens when it chooses t. But the ordering behind the induction isn’t
made clear. And it still fails the smell test because it makes no distinction between ‘distance
as computed’ and ‘true mathematical distance’.

1 4 1
5 1

0 0 0 ∞
2 1 1
2 3

Let’s look for a counterexample. In this graph, each node has been labelled with its
current distance, and each edge with its cost. Node 0 is the start vertex, node ∞ is the
destination vertex, nodes 0, 4, and 2 have already been explored, and nodes 3 and 5 are in
the frontier. The algorithm will pick node 3 to explore, since this has the smallest distance
variable, but this distance is incorrect and the corresponding path (via node 2) is wrong.
The issue is that the algorithm should never get into the state shown here! But the
Bad Answer doesn’t argue that this state is impossible: instead it just assumes that when v
is popped all the already-explored vertices are correct. The proof needs to say ‘by induction
on vertices in the order they are explored’. Only then is it legitimate to assume that all the
already-explored vertices are correct.

Exercise. Bad Answer 2 doesn’t state anywhere that edge weights need to be ≥ 0. Explain
which line of the proof fails when the algorithm is run on this graph:
1
3
-4 3
t
s 2
18 5.6 Bellman-Ford

5.6. Bellman‐Ford
Now for a new wrinkle: graphs whose edges can have both positive and negative costs.
We’ll use the term weight rather than cost of an edge. (The words ‘cost’ and ‘distance’
suggests positive numbers, and so they give us bad intuition for graphs with negative costs.)
The weight of a path is the sum of its edge weights, and our goal is to find minimum weight
paths from some start vertex.

Example (Planning problems). Consider an adventurer who has just entered a room in a dun-
geon, and wants to quickly get to the other exit, but has the option of detouring to pick up
treasure. We could frame this as seeking to minimize
(
T − r if we pick up treasure
W =
T otherwise

where T is the time to reach the exit and r is the value of the treasure. What is the optimal
path, and does it involve picking up the treasure?
We can turn this into a directed graph problem. Let there be a vertex for every state
the game can be in: this comprises both the location of the adventurer, as well as a flag
saying whether the treasure is still available. And let there be edges for all the possible game
moves. Give each edge weight 1, except for the moves which involve picking up the treasure,
which have weight 1 − r. In graph language, we’re told the start vertex, and we have the
choice of two possible destination vertices (exit with treasure, and exit without treasure),
and we want to find the weights of minimum weight paths to those two destinations. ♢

Example (Negative cycles). What’s the minimum weight from a to b in the graph below? By
going around b → c → d → b again and again, the weight of the path goes down and down.
This is referred to as a negative weight cycle, and we’d say that the minimum weight from a
to b is −∞.

2 c a → b: weight 1
1
a → b → c → d → b: weight 0
a b 3 a → b → c → d → b → c → d → b: weight -1

-6 d

Exercise. Run Dijkstra’s algorithm by hand on each of these graphs, starting from the
shaded vertex. The labels indicate edge weights. What happens?

1 1
3 3
-4 3 -4 2
2 2
5.6 Bellman-Ford 19

G E N E R AL I DEA
Dijsktra’s algorithm can fail on graphs with negative edge weights. [Before continuing, do the
exercise!] But the update step at the heart of Dijkstra’s algorithm, lines 13–14 on page 13,
is still sound. Let’s restate it. If we’ve found a path from s to u, call it s ⇝ u, and if there
is an edge u → v, then s ⇝ u → v is a path from s to v. If we store the minimum weight s
path we’ve found so far in the variable minweight, then the obvious update is

if v.minweight > u.minweight + weight(u → v) :


let v.minweight = u.minweight + weight(u → v)
u
This update rule is known as relaxing the edge u → v. v
The idea of the Bellman-Ford algorithm is to just keep relaxing all the edges in the
graph, over and over again, updating the minweight every time we find a better path. The
magic is that we only need to apply it V times.

Dijkstra Bellman-Ford
can get stuck in an ∞ loop if always terminates
some weights < 0
O(E + V log V ) O(V E)
if all weights ≥ 0
visits vertices in a clever order, visits vertices in any order, re-
relaxes each edge once laxes each edge multiple times

P ROBL EM STATE ME N T
Given a directed graph where each edge is labelled with a weight, and a start vertex s, (i) if
the graph contains no negative-weight cycles reachable from s then for every vertex v compute
the minimum weight from s to v; (ii) otherwise report that there is a negative weight cycle
reachable from s.

I MP LE ME N TATI O N
In this code, lines 8 and 12 iterate over all edges in the graph, and c is the weight of the edge
u → v. The assertion in line 10 refers to the true minimum weight among all paths from s to
v, which the algorithm doesn’t know yet; the assertion is just there to help us reason about
how the algorithm works, not something we can actually test during execution.

1 def bf (g , s ) :
2 for v in g . vertices :
3 v . minweight = ∞ # best estimate so f a r of minweight from s to v
4 s . minweight = 0
5
6 repeat len (g . vertices)−1 times :
7 # r e l a x a l l the edges
8 for (u , v , c) in g . edges :
9 v . minweight = min(u . minweight + c , v . minweight)
10 # Assert v . minweight >= t r u e minimum weight from s to v
11
12 for (u , v , c) in g . edges :
13 i f u . minweight + c < v . minweight :
14 throw ”Negative−weight cycle detected”

Lines 12–14 say, in effect, “If the answer we get after V − 1 rounds of relaxation is different
to the answer after V rounds, then there is a negative-weight cycle; and vice versa.”
20 5.6 Bellman-Ford

AN A LYS I S
The algorithm iterates over all the edges, and it repeats this V times, so the overall running
time is O(V E).

Theorem. The algorithm correctly solves the problem statement. In case (i) it terminates
successfully, and in case (ii) it throws an exception in line 14. Furthermore the assertion on
line 10 is true.

Proof (of assertion on line 10). Write w(v) for the true minimum weight among all paths
from s to v, with the convention that w(v) = −∞ if there is a path that includes a negative-
weight cycle. The algorithm only ever updates v.minweight when it has a valid path to v,
therefore the assertion is true.

Proof for case (i). Pick any vertex v, and consider a minimum-weight path from s to v. Let
the path be
s = u0 → u1 → · · · → uk = v.
Consider what happens in successive iterations of the main loop, lines 8–10.
• Initially, u0 .minweight is correct, i.e. equal to w(s) which is 0.
• After one iteration, u1 .minweight is correct. Why? If there were a lower-weight path
to u1 , then the path we’ve got here couldn’t be a minimum-weight path to v.
• After two iterations, u2 .minweight is correct.
• and so on...
We can assume (without loss of generality) that this path has no cycles—if it did, the cycle
would have weight ≥ 0 by assumption, so we could cut it out. So it has at most |V | − 1 edges,
so after |V | − 1 iterations v.minweight is correct.
Thus, by the time we reach line 12, all vertices have the correct minweight, hence the
test on line 13 never goes on to line 14, i.e. the algorithm terminates without an exception.

Proof of (ii). Suppose there is a negative-weight cycle reachable from s,

s → · · · → v0 → v 1 → · · · → vk → v0

where
weight(v0 → v1 ) + · · · + weight(vk → v0 ) < 0.
If the algorithm terminates without throwing an exception, then all these edges pass the test
in line 13, i.e.

v0 .minweight + weight(v0 → v1 ) ≥ v1 .minweight


v1 .minweight + weight(v1 → v2 ) ≥ v2 .minweight
..
.
vk .minweight + weight(vk → v0 ) ≥ v0 .minweight

Putting all these equations together,

v0 .minweight + weight(v0 → v1 ) + · · · + weight(vk → v0 ) ≥ v0 .minweight

hence the cycle has weight ≥ 0. This contradicts the premise—so at least one of the edges
must fail the test in line 13, and so the exception will be thrown. □
5.7 Dynamic programming 21

5.7. Dynamic programming


Dynamic programming means figuring out how to express a problem in terms of easier sub-
problems. Richard Bellman (1920–1984) invented dynamic programming in order to solve
planning problems, which, as we saw in section 5.6, can be thought of as finding minimum
weight paths on a graph. Our graph has a vertex for every state that the system can be
in, and an edge for every transition. Edges are labelled with rewards. (‘Reward’ just means
‘negative cost’. In graph problems it’s common to think in terms of costs and distances, and
in planning problems it’s more common to think in terms of rewards.)
A B C

10.8

value 10.4

10.0

0 10 20 30
frame number

The heart of dynamic programming is a recurrence equation for the value function. The value
function evaluates, for any state v, the expected future reward that can be gained starting
from v. The picture above illustrates 30 frames of gameplay of the Atari game Seaquest, and
the value function that was computed by DeepMind.9 The player is the yellow submarine
on the right. At time A a new enemy appears on the left, and the value function increases
because of the potential reward. At time B the player’s torpedo has nearly reached the enemy,
and the value function is very high anticipating the reward. (The value function shown here
is based on a reward of +1 whenever the player gains points.) At time C the enemy has been
hit, and the reward has been won, so the value function reverts to its baseline.

G E N E R AL I DEA
Define the value function Fdst,t (v) to be the minimum weight for reaching vertex dst within
t timesteps starting from v, assuming that it takes one timestep to follow an edge.

4
c
2
a 1 b 3

-4 d

To illustrate, here’s the value function for reaching state d, for the simple graph shown above:


3 if v = c with the one-hop path c → d
Fd,1 (v) = 0 if v = d since we’re already there


∞ otherwise, since we can’t reach d within 1 timestep


 5 if v = b by following b → c → d

7 if v = a by following a → c → d
Fd,2 (v) =

 3 if v = c with the single-hop path c → d


0 if v = d since we’re already there

and so on.

9 Playing Atari with Deep Reinforcement Learning, Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wier-

stra, and Riedmiller, 2013, https://arxiv.org/pdf/1312.5602v1.pdf. This is the paper that kickstarted
the deep reinforcement learning revolution. See https://blog.evjang.com/2018/08/dijkstras.html for an
excellent blog post by Eric Jang, that discusses the link between reinforcement learning and shortest paths.
22 5.7 Dynamic programming

For a general graph, we can write down a recurrence equation (called the Bellman
the minimum over
equation) for the value function:
an empty set is  n o
taken to be ∞ Fd,t (v) = min Fd,t−1 (v), min weight(v → w) + Fd,t−1 (w) .
w:v→w

In words, “If we have a path v ⇝ d that takes ≤ t − 1 steps, that’s obviously a valid path
of ≤ t steps; alternatively, we could take a first step v → w and then take the optimal path
w ⇝ d in ≤ t − 1 steps.” The terminal condition is
(
0 if v = d
Fd,0 (v) =
∞ otherwise.
We can simply iterate this equation, in the usual dynamic programming way: first write
down Fd,0 (v) for all v, then compute Fd,1 (v) for all v, and so on. The value function itself
tells us the weight of a minimum weight path, and we can recover the path by reading off
which option gives the minimum at each timestep.

PRO B L EM STAT E M E N T
Given a directed graph where each edge is labelled with a weight, and assuming it contains
no negative-weight cycles, then for every pair of vertices compute the weight of the minimum-
weight path between those vertices.

Why the condition about negative-weight cycles, and why is there no mention of time
horizon? The time horizon in the Bellman equation was crucial—it’s what lets us break
the problem down into easier subproblems. There’s a simple reason. If the graph has no
negative-weight cycles, then any minimum weight path must have ≤ V vertices (if it had
more then there must be a cycle, which by assumption has weight ≥ 0, so we might as well
excise it). Thus for any pair of vertices there is a minimum weight path between them with
≤ V − 1 edges. So what we want to compute is Fd,V −1 (v).
The problem statement says for every pair of vertices. There’s a nifty way to compute
the value function using matrices, and all-to-all minimum weights just drop out with no extra
work. This implementation has running time O(V 3 log V ).

MAT R I X I M P LE M E N TAT I O N ( NO N ‐ E XA M I N A B L E )
(t)
Let Mij be the minimum weight of going from i to j in ≤ t steps. The Bellman equation
says  n o
(t) (t−1) (t−1)
Mij = min Mij , min weight(i → k) + Mkj .
k:i→k
Let’s define a matrix W to store the weights,


weight(i → j) if there is an edge i → j
Wij = 0 if i = j


∞ otherwise.
The nifty thing about this matrix is that it lets us simplify the Bellman equation to
n o
(t) (t−1) (1)
Mij = min Wik + Mkj , Mij = Wij .
k

The first clause in the Bellman equation is taken care of because we defined Wii = 0; and
the restriction to {k : i → k} is taken care of because we defined Wij = ∞ if there is no edge.
We could start the iteration at M (0) , but it’s easy to see that a single iteration of Bellman’s
equation gives M (1) = W , and we have W already, so we might as well use it.
The notation x ∧ y
The matrix-Bellman equation can be rewritten as
means min(x, y). (t) (t−1)  (t−1) (t−1) 
Mij = Wi1 + M1j ∧ Wi2 + M2j ) ∧ · · · ∧ Win + Mnj .
This is just like regular matrix multiplication
[A B]ij = Ai1 B1j + Ai2 B2j + · · · + Ain Bnj
except it uses + instead of multiplication and ∧ instead of addition. Let’s write it M (t) =
W ⊗ M (t−1) . This nifty notation lets us write out the complete algorithm very concisely:
5.7 Dynamic programming 23

1 Let M (1) = W
2 Compute M (|V |−1) , using M (t) = W ⊗ M (t−1)
3 Return M (|V |−1)
As noted above, it’s sufficient to compute up to time horizon |V | − 1, since we assumed
the graph has no negative-weight cycles.

Running time. As with regular matrix multiplication, it takes V 3 operations to compute ⊗,


so the total running time is O(V 4 ). There is a cunning trick to reduce the running time.
Let’s illustrate with V = 10. Rather than applying ⊗ 8 times to compute M (9) , we can
repeatedly square:

M (1) = W
M (2) = M (1) ⊗ M (1)
M (4) = M (2) ⊗ M (2)
M (8) = M (4) ⊗ M (4)
M (16) = M (8) ⊗ M (8)
= M (9) as there are no negative-weight cycles.

This trick gives overall running time O(V 3 log V ).

∗ ∗ ∗

For interesting problems like Go, or even the Seaquest game shown at the beginning of this
section, it’s impractical to solve the Bellman equation exactly because the number of states
is combinatorially huge. Instead of solving the value function exactly, DeepMind trains a
neural network to learn an approximation to the value function.
24 5.8 Johnson’s algorithm

5.8. Johnson’s algorithm


What if we want to compute shortest paths between all pairs of vertices?

• The betweenness centrality of an edge is defined to be


the number of shortest paths that use that edge, over
all the shortest paths between all pairs of vertices in a
graph. (If there are n shortest paths between a pair of
vertices, count each of them as contributing 1/n.) It
measures how ‘important’ each edge is, and it’s used
for summarizing the shape of e.g. a social network. To
compute it, we need shortest paths between all pairs of
vertices.

• Each router in the internet has to know, for every packet it might receive, where that
packet should be forwarded to. Routers send messages between themselves using the
Border Gateway Protocol (BGP), advertising which destinations they know about, and
they update their routing tables based on the messages they receive. The entire internet
can be thought of as a distributed algorithm for computing all-to-all paths.

We’ve learnt three algorithms we can use for this purpose: (1) if all edge weights are ≥ 0 we
can run Dijkstra’s algorithm once from each vertex; (2) we can run Bellman-Ford once from
each vertex; (3) we can use dynamic programming with matrices. The running times are

running time E =V −1 E = V (V − 1) E = Θ(V α )


Dijkstra V × O(E + V log V ) O(V 2 log V ) O(V 3 ) O(V 1+α + V 2 log V )
Bellman-Ford V × O(V E) O(V 3 ) O(V 4 ) O(V 2+α )
d.p. O(V 3 log V ) O(V 3 log V ) O(V 3 log V ) O(V 3 log V )
Johnson same as Dijkstra

The table shows the running time as a function of V and E, and it also shows it for two special
cases, E = V − 1 (a tree, the sparsest connected graph on V vertices) and E = V (V − 1)
(a fully connected graph, the densest graph on V vertices), as well as for E = Θ(V α ) for
α ∈ [1, 2], which spans the range from sparse to dense. This last column makes it easier to
see the comparison. Dijsktra is best for any α, and dynamic programming is better than
Bellman-Ford for any α > 1.
The last row is for Johnson’s algorithm, the topic of this section. It is as fast as
Dijkstra’s algorithm, but it also works with positive and negative edge weights. It was
discovered by Donald Johnson in 1977.

GEN ER A L I D EA
Johnson’s idea was that we can construct a suitable helper graph, run Dijkstra once from
each vertex in the helper graph, and then translate the answers back to the original graph.
His method is subtle and clever, but his general strategy is very common, and we’ll see it
again and again. It’s worth highlighting the two parts to his strategy:
TRANSLATION strategy: Translate the problem we want to solve into a different setting,
use a standard algorithm in the different setting, then translate the answer back to the original
setting. In this case, the translated setting is ‘graphs with different edge weights’. Of course
we’ll need to argue why these translated answers solve the original problem. We’ll see more
of the TRANSLATION strategy in Section 6.
AMORTIZATION strategy: It takes work to construct the helper graph, but this work
pays off because we only need to do it once and then we save time on each of the V times that
we can run Dijkstra’s algorithm rather than Bellman-Ford. We’ll see more of the AMORTI-
ZATION strategy in Section 7.
5.8 Johnson’s algorithm 25

P ROBL EM STATE ME N T
Given a directed graph where each edge is labelled with a weight, (i) if the graph contains
no negative-weight cycles then for every pair of vertices compute the weight of the minimal-
weight path between those vertices; (ii) if the graph contains a negative-weight cycle then
detect that this is so.

I MP LE ME N TATI O N A N D A N A LYS I S
1. The augmented graph. First build an augmented graph with an extra vertex s, as shown
below. Run Bellman-Ford on this augmented graph, and let the minimum weight from s
to v be dv . (The direct path s → v has weight 0, so obviously dv ≤ 0. But if there are
negative-weight edges in the graph, some vertices will have dv < 0.) If Bellman-Ford reports
a negative-weight cycle, then stop.

S
0 0
3 0 0 0 3
1 -2 3 0
2 2
-1 4 0 7
-2 0

original graph, with augmented graph, helper graph with


edge weights w(u → v) with an extra vertex s modified edge weights
and zero-weight edges w′ (u → v)
s → v for all vertices v

2. The helper graph. Define a helper graph which is like the original graph, but with different s
edge weights:
w′ (u → v) = du + w(u → v) − dv .
CLAIM: in this helper graph, every edge has w′ (u → v) ≥ 0. PROOF: The relaxation u
equation, applied to the augmented graph, says that dv ≤ du + w(u → v), therefore w′ (u → v
v) ≥ 0.

3. Dijkstra on the helper graph. Run Dijkstra’s algorithm V times on the helper graph, once
from each vertex. (We’ve ensured that the helper graph has edge weights ≥ 0, so Dijkstra
terminates correctly.) CLAIM: Minimum-weight paths in the helper graph are the same as
in the original graph. PROOF: Pick any two vertices p and q, and any path between them

p = v0 → v1 → · · · → vk = q.

What weight does this path have, in the helper graph and in the original graph? This algebraic trick
is called a
weight in helper graph telescoping sum.

= dp + w(v0 → v1 ) − dv1 + dv1 + w(v1 → v2 ) − dv2 + · · ·


= dp + w(v0 → v1 ) + w(v1 → v2 ) + · · · + w(vk−1 → vk ) − dq
= weight in original graph + dp − dq .

Since dp − dq is the same for every path from p to q, the ranking of paths is the same in the
helper graph as in the original graph (though of course the weights are different).

4. Wrap up. We’ve just shown that

min weight min weight


from p to q = from p to q − dp + dq
in original graph in helper graph

which gives us the solution to the problem statement.


27

6. Graphs and subgraphs


In this section we will look at algorithms for finding structures within graphs. Here’s an
example to illustrate.

Example (Minimum spanning trees). Suppose we have to build a power grid to connect 6 cities,
and the costs of running cabling are as shown on the left. We’ve learnt how to find a minimum-
cost path between a pair of nodes. But what is the minimum cost tree that connects all the
nodes? (This is called a minimum spanning tree.)
a a a
2
9
b 6 c b b
c c
5 8 3 4
d e e
d e d
7 f f
1 f

edge costs minimum cost minimum


path from c to d weight tree

But we’re not sudying subgraph algorithms for their own sake. We’re studying them to get
more because they highlight two strategies for algorithm design:
• We’ll see the translation strategy again, in Section 6.4. This is where we translate a The translation
problem into a different setting, solve the translated problem using a standard algo- strategy was used
rithm, and translate the solution back to answer the original problem. by Johnson’s
algorithm,
section 5.8
• Often, the soul of an algorithm can be dressed in different guises. For example, the Prim’s algorithm for
soul of Dijkstra’s algorithm can help us find a minimum spanning tree. The algorithms finding a minimum
in Sections 6.5–6.7 are different guises for algorithms whose souls we’ve seen before. spanning tree,
section 6.5
28 6.1 Flow networks

6.1. Flow networks


For some applications, it’s useful to consider a graph where each edge has a capacity. In
the internet each link has a maximum bandwidth it can carry; in the road network each
road has a maximum amount of traffic it can handle; in an oil pipeline system each pipe
has a capacity. Many interesting problems can be boiled down to the question: what is the
maximum amount of stuff that can be carried between a given pair of vertices?

𝑢
cap
2 . 10
.1

cap
p flo

flow
ca 12 w8

.
w

4
flo

4
cap. 3 cap. 4
𝑠 𝑣 𝑡
1 flow 3
flo
w

w
flo

1
6
ca
p.
p .5
ca

This picture illustrates a flow on a network. There are two distinguished vertices, the source
vertex s where flow originates, and the sink vertex t where flow is consumed. The edges are
directed, and labelled with their capacities. The flow value is the net flow out of the source
vertex, and it’s 12 − 1 = 11 in this picture. This is equal to the net flow into the sink vertex,
of course.
What’s the maximum possible flow value, over all possible flows? For this simple
network, it’s fairly easy to discover a flow of value 14. Furthermore, the total capacity of the
edges going into the sink is 14, so it’s impossible to have a flow of value > 14. Therefore the
maximum possible flow value is 14.

In Sections 6.2 and 6.3 we will see an algorithm for finding a maximum flow, and prove
that it is correct. First, here is a pair of flow problems10 that inspired the algorithm.

TWO T R A N S P O RTAT I O N P RO B L E M S

The Russian applied mathematician A.N. Tolstoĭ was the first to formalize the flow problem.
We’ll only study
He was interested in the problem of shipping cement, salt, etc. over the rail network. Formally,
single-commodity he posed the problem “Given a graph with edge capacities, and a list of source vertices and
flows, i.e. where their supply capacities, and a list of destination vertices and their demands, find a flow that
there is a single
type of ‘stuff’ meets the demands.”
flowing.
Multi-commodity
flow problems are
much much harder. Exercise. In the standard formulation of the flow problem, there is a single source with
unlimited supply capacity, and a single sink. Suppose we have an algorithm that solves
this standard problem. Explain how to use it to solve Tolstoĭ’s problem.

10 For further reading, see On the history of the transportation and maximum flow problems by Alexander

Schrijver, http://homepages.cwi.nl/~lex/files/histtrpclean.pdf; and Flows in railway optimization by


the same author, http://homepages.cwi.nl/~lex/files/flows_in_ro.pdf.
6.1 Flow networks 29

From Methods of find-


ing the minimum total
kilometrage in cargo-
transportation planning in
space, A.N.Tolstoy, 1930.
In this illustration, the cir-
cles mark sources and sinks
for cargo, from Omsk in the
north to Tashkent in the
south.

The US military was also interested in flow networks during the cold war. If the
Soviets were to attempt a land invasion of Western Europe through East Germany (vertex
EG), they’d need to transport fuel to the front line. The diagram shows the links in the
rail network, and the carrying capacity of each link. It also shows the various available
fuel sources, aggregated into a single vertex marked ORIGINS. What is the max flow from
ORIGINS to EG? More importantly, if the US Air Force wants to strike and degrade one of
the links, which link should it target in order to reduce the max flow? It’s no use hitting a
link where the Soviets can just reroute around the damage.

From Fundamentals of a method for evaluat-


ing rail net capacities, T.E. Harris and F.S.
Ross, 1955, a report by the RAND Corpo-
ration for the US Air Force (declassified by
the Pentagon in 1999).
30 6.2 Ford-Fulkerson algorithm

6.2. Ford‐Fulkerson algorithm


PRO B L EM STAT E M E N T
Given a weighted directed graph g with a source s and a sink t, find a flow from s to t with
maximum possible value (also called a maximum flow).

𝑢
cap
. 10
12

cap
p. flo

flow
ca 12 w8

.
w

4
flo

4
cap. 3 cap. 4
𝑠 𝑣 𝑡
flow 3

1
flo

w
w

flo

6
ca 1

p.
p.5

ca
𝑤

But to write a proper problem statement, we need to be more precise than this! Here are
some definitions. Let the weight associated edge u → v be c(u → v), and call this the capacity
of the edge. Assume it is > 0 for every edge in the graph. A flow is a set of edge labels
f (u → v) such that
0 ≤ f (u → v) ≤ c(u → v) on every edge
and X X
f (u → v) = f (v → w) at all vertices v ∈ V \ {s, t}.
u: u→v w: v→w

The second equation is called flow conservation, and it says that as much stuff comes into v
as goes out. Flow conservation doesn’t need to hold at s or t—indeed, the value of a flow is
the net flow out of s,
X X
value(f ) = f (s → u) − f (u → s).
u: s→u u: u→s

In the network pictured above the flow value is 12 − 1 = 11, i.e. the total flow out minus the
total flow in.

GEN ER A L I D EA
The basic idea of the algorithm is “look for vertices to which we could increase flow”.
Imagine that the source and all the other vertices apart from the sink are in bandit
country, and the bandits want to siphon off flow from intermediate vertices. We’ll assume
they can sneak into the vertices to siphon off flow, and to redirect existing flow, and they
can also increase the flow at the source. But they daren’t do anything that would disrupt
the total flow to the sink, because that’d attract the attention of the authorities, who would
come and put an end to their banditry. Here are two types of step that the bandits could
take, starting from the flow pictured at the top of the page.

𝑎 𝑎 𝑎

𝑠 𝑏 𝑡 𝑠 𝑏 𝑡 𝑠 𝑏 𝑡

𝑐 𝑐 𝑐
Turn up the flow Instead of Instead of siphoning off all
at s, and siphon it siphoning off all the excess at b, send some of
off the excess at s, it along b → t and reduce the
increase the flow a → b flow to match, giving
s → b and siphon an excess at a that can be
it off at b siphoned off
6.2 Ford-Fulkerson algorithm 31

Suppose the bandits are overzealous and they discover that they can siphon off some flow at
t. That means that the network operator—which was spying on the bandits all along—has
learned a reconfiguration that delivers extra flow to the sink.
Let’s look more closely at this reconfiguration. In the network fragment below, the
bandits discovered they could siphon off flow at a, thence b, thence c, thence t. How much
could they siphon off at each of these locations?

𝑠
flo

𝑠
• They could siphon off 3 at a by increasing s → a
w

+
2/

1
5

/4
flow2 𝑎 𝑎 • Or siphon off 2 at b by increasing s → a → b
+1
𝑏 (limiting factor: spare capacity on a → b)
𝑏
flo
w -1 • Or siphon off 1 at c by increasing s → a → b
1/
8 𝑐 𝑐 and decreasing c → b, leaving the other outflow
/3

at b undisturbed (limiting factor: existing flow


w1

+1

on c → b)
flo

𝑡 𝑡 1
• Or siphon off 1 at t by increasing s → a → b
and decreasing c → b and increasing c → t, leav-
ing the inflow at c undisturbed (limiting factor:
existing flow on c → b)

The network operator only wants to get flow to t, not to any of the other vertices. So it
chooses a flow adjustment that gets as much as possible to t, with no excess at any of the
other vertices along the path.

The Ford-Fulkerson algorithm starts with an empty flow, then repeatedly uses this
‘bandit search’ to find whether it’s possible to siphon off flow at sink. If it is possible, it
adjusts flow along a suitable sequence of edges and thereby increases the flow value. If it’s
not possible, then the algorithm terminates.

I MP LE ME N TATI O N
There are two pieces that we need to turn into a proper formal algorithm, (1) how exactly
the bandit search works, and (2) how to use the results of the bandit search to adjust the
flow.

The residual graph. To formalize the bandit search, we’ll build what’s called the residual
graph. This has the same vertices as the flow network, and it has either one or two edges for
every edge in the original flow network:

• If f (u → v) < c(u → v) in the flow network, let the residual graph have an edge u → v
with the label “increase flow u → v”.
• If f (u → v) > 0 in the flow network, let the residual graph have an edge v → u (i.e. in
the opposite direction) with the label “decrease flow u → v”.

The two clauses here correspond to the two types of adjustment that the bandits can make.
In this illustration, the ‘increase’ edges are solid lines and the ‘decrease’ edges are dotted.

𝑠 𝑏 𝑡

If the bandits can siphon off flow at some vertex u, and if the residual graph has an edge
u → v, then they can siphon flow off at v. They can certainly siphon off some flow at s.
Thus, if the residual graph has a path from s to t, then they can siphon flow off at t.
32 6.2 Ford-Fulkerson algorithm

Augmenting paths. Suppose the residual graph has a path from s to t. This is called an
augmenting path. We could find it using breadth-first search or depth-first search, or any
other path-finding algorithm we like.

𝑎 𝑎 +𝛿
−𝛿
𝑠 𝑏 𝑡 𝑠 +𝛿 𝑏 𝑡

𝑐 𝑐

Each edge in the augmenting path has a label, saying either ‘increase’ or ‘decrease’. In the
path shown here, the s → b edge is labelled “increase the flow s → b”, the b → a edge is
labelled “decrease the flow a → b”, and the a → t edge is labelled “increase the flow a → t”.
We pick some amount δ > 0 to increase/decrease each edge by, and we update the flow.
The crucial thing about this update is that it leaves us with a valid flow. To verify
this, remember the two defining characteristics of a flow: (1) it must satisfy the capacity
constraints 0 ≤ f (u → v) ≤ c(u → v), and (2) it must satisfy flow conservation. The
rules for constructing the residual graph ensure that the updated flow satisfies the capacity
constraints, as long as δ is sufficiently small; and by thinking carefully about the four different
possibilities for what happens at each vertex along the augmenting path, we see that the total
flow in minus total flow out is unchanged.

𝑎
𝑏
𝑏 𝑐

−𝛿 𝑎 +𝛿
+𝛿 𝑏 −𝛿 −𝛿 −𝛿
+𝛿
+𝛿 𝑏
𝑐

Also, the flow value increases by δ. To see this, consider the two possible labels for the first
edge of the augmenting path. Whether it’s an “increase” edge or a “decrease” edge, either
way the net flow out of s increases.

𝑠 𝑠

𝑠 𝑠
+𝛿
−𝛿

1 def ford_fulkerson (g , s , t ) :
2 # l e t f be a flow , i n i t i a l l y empty
3 for u → v in g . edges :
4 f (u → v) = 0
5
6 # Define a h e l p e r f u n c t i o n f o r f i n d i n g an augmenting path
7 def find_augmenting_path ( ) :
8 # d e f i n e the r e s i d u a l graph h on the same v e r t i c e s as g
9 for each edge u → v in g :
10 i f f (u → v) < c(u → v ) : give h an edge u → v labelled ”inc”
11 i f f (u → v) > 0: give h an edge v → u labelled ”dec”
12 i f h has a path from s to t :
13 return some such path , together with the l ab e ls of i t s edges
14 else :
15 # There i s a s e t of v e r t i c e s that we can reach s t a r t i n g from s ;
16 # c a l l t h i s ” the cut a s s o c i a t e d with flow f ” .
17 # We’ l l use t h i s i n the a n a l y s i s .
18 return None
19
20 # Repeatedly f i n d an augmenting path and add flow to i t
21 while True :
22 p = find_augmenting_path()
23 i f p i s None:
6.2 Ford-Fulkerson algorithm 33

24 break # g i v e up −− can ’ t f i n d an augmenting path


25 else :
26 l e t the vertices of p be s = v0 , v1 , . . . , vk = t
27 δ = ∞ # amount by which we ’ l l augment the flow
28 for each edge vi → vi+1 along p :
29 i f the edge has label ”inc” :
30 δ = min(δ , c(vi → vi+1 ) − f (vi → vi+1 ))
31 else the edge must have label ”dec” :
32 δ = min(δ , f (vi+1 → vi ))
33 # assert : δ > 0
34 for each edge vi → vi+1 along p :
35 i f the edge has label ”inc” :
36 f (vi → vi+1 ) = f (vi → vi+1 ) + δ
37 else the edge must have label ”dec” :
38 f (vi+1 → vi ) = f (vi+1 → vi ) − δ
39 # a s s e r t : f i s s t i l l a v a l i d flow

This pseudocode doesn’t tell us how to choose the path in line 13. One sensible idea is ‘pick
the shortest path’, and this version is called the Edmonds–Karp algorithm; it is a simple
matter of running breadth first search on the residual graph. Another sensible idea is ‘pick
the path that makes δ as large as possible’, also due to Edmonds and Karp.

A N ALYS IS OF R U N N I N G T I M E
Be scared of the while loop in line 21: how can we be sure it will terminate? In fact, there
are simple graphs with irrational capacities where the algorithm does not terminate. On the
other hand,

Lemma. If all capacities are integers then the algorithm terminates, and the resulting flow on
each edge is an integer.

Proof. Initially, the flow on each edge is 0, i.e. integer. At each execution of lines 27–32,
we start with integer capacities and integer flow sizes, so we obtain δ an integer ≥ 0. It’s
not hard to prove the assertion on line 33, i.e. that δ > 0, by thinking about the residual
graph in find_augmenting_path. Therefore the total flow has increased by an integer after
lines 34–38. The value of the flow can never exceed the sum of all capacities, so the algorithm
must terminate. □

Now let’s analyse running time, under the assumption that capacities are integer. We
execute the while loop at most f ∗ times, where f ∗ is the value of maximum flow. We can build
the residual graph and find a path in it using breadth first search, so find_augmenting_path
is O(V + E). Lines 27–38 involve some operations per edge of the augmenting path,  which is
O(V ) since the path is of length ≤ V . Thus the total running time is O (E + V )f ∗ . There’s
no point including the vertices that can’t be reached from s, so we might as well assume
that all vertices can be reached from s, so E ≥ V − 1 and the running time can be written
O(Ef ∗ ).

It is worth noting that the running time we found depends on the values in the input
data (via f ∗ ). This is in contrast to all the algorithms we’ve seen studied so far, like Quicksort
and Depth-first search, in which we found a running time that depends only on the size of
the data. Often in machine learning and optimization, we get answers that depend on the
contents of the data.
On one hand it’s good to get an answer that depends on the values in the input data
rather than just the size, because any analysis that ignores the data contents can’t be very
informative. On the other had it’s bad in problem because we don’t have a useful upper
bound for f ∗ .
34 6.2 Ford-Fulkerson algorithm

The Edmonds–Karp version of the algorithm can be shown to have running time
O(E 2 V ).

CO RR EC T N ESS
There are two parts to proving correctness: (1) does this algorithm produce a flow? and (2)
is the flow it produces a maximum flow?
We’ve already argued why the assertion on line 39 is correct, i.e. why the algorithm
produces a valid flow at every iteration. The proof that it does indeed produce a maximum
flow is left to the next section.

∗ ∗ ∗

In computer science textbooks and on YouTube, there are plenty of explanations of the Ford-
Fulkerson algorithm that start by defining the residual graph, and make no mention of what
these notes have called the ‘bandit search’ problem. If you’re an engineer and all you want
is a recipe to follow, then you don’t need to think about the bandit search at all. But if
you’re a computer scientist or mathematician and you want to understand why the algorithm
works, the bandit search idea is the linchpin, and the residual graph is just an implementation
detail. The big idea in the proof of correctness relates directly to the bandit search problem.
The most elegant algorithms are, in my opinion, those in which each part of the algorithm
corresponds to a line of a proof, and where the proof is as concise as it can be.
6.3 Max-flow min-cut theorem 35

6.3. Max‐flow min‐cut theorem


In this section we’ll prove that the Ford-Fulkerson algorithm finds a maximum flow.

𝑎
cap
2 . 10 𝑠
.1 𝑎

cap
p
ca

.4
cap. 3 cap. 4
𝑠 𝑏 𝑡 𝑏
𝑡
6
ca
p.

p.
ca

5
𝑐 𝑐

The proof is based on the idea of a cut. A cut is a partition of the vertices into two sets,
V = S ∪ S̄, with s ∈ S and t ∈ S̄. The capacity of a cut is
X
capacity(S, S̄) = c(u → v).
u∈S, v∈S̄ :
u→v

The two pictures above shows the same network and cut, with cut capacity 20 (not 24!). The
right hand picture emphasizes that a cut splits the vertices into two groups, one group with
the source, the other with the sink.
It’s obvious that the maximum flow in this network is ≤ 15, since the total capacity
of all edges out of the source is 12 + 3 = 15. Similarly, by considering all the edges into
the sink, the maximum flow must be ≤ 14. The idea of a cut is to generalize this type of
bound. Looking at the right hand picture above, which shows the cut ({s, b, c}, {t, a}), we
see it’s impossible to push more flow from left to right than the total left→right capacity,
which is 12 + 4 = 16. If we had chosen the cut ({s}, {a, b, c, t}) we’d get the ‘source’ bound,
that flow ≤ 15, and if we had chosen the cut ({s, a, b}, {t}) we’d get the ‘sink’ bound, that
flow ≤ 14. Here’s a theorem to formalize this idea.

Theorem (Max‐flow min‐cut theorem). For any flow f and any cut (S, S̄),

value(f ) ≤ capacity(S, S̄).

Proof. To simplify notation in this proof, we’ll extend f and c to all pairs of vertices: if there
is no edge u → v, let f (u → v) = c(u → v) = 0.
X X
value(f ) = f (s → u) − f (u → s) by definition of flow value
u u
X X X 
= f (v → u) − f (u → v) by flow conservation
v∈S u u

(the term in brackets is zero for v ̸= s)


XX XX
= f (v → u) + f (v → u)
v∈S u∈S v∈S u̸∈S
XX XX
− f (u → v) − f (u → v)
v∈S u∈S v∈S u̸∈S

(splitting the sum over u into two sums, u ∈ S and u ̸∈ S )


XX XX
= f (v → u) − f (u → v) by ‘telescoping’ the sum
v∈S u̸∈S v∈S u̸∈S
XX
≤ f (v → u) since f ≥ 0 (1)
v∈S u̸∈S
XX
≤ c(v → u) since f ≤ c (2)
v∈S u̸∈S

= capacity(S, S̄) by definition of cut capacity.

This completes the proof. □


36 6.3 Max-flow min-cut theorem

cut (𝐴, 𝐴)
capacity
cut (𝐵, 𝐵)
of a cut
cut (𝐶, 𝐶)

flow 𝑓 ∗ , cut 𝑆 ∗ , 𝑆 ∗

flow 𝑔
value of flow ℎ
a flow
flow 𝑖

What’s this theorem for? The theorem says that for any cut, every possible flow’s value is
≤ that cut’s capacity. Likewise, for any flow, every possible cut’s capacity is ≥ that flow’s
value.
Hence, if we’re able to find a flow f ∗ and a matching cut (S ∗ , S̄ ∗ ) such that value(f ∗ ) =
capacity(S ∗ , S̄ ∗ ), then every other flow must have value ≤ capacity(S ∗ , S̄ ∗ ), therefore f ∗ is a
maximum flow. In other words, if we can find a matching flow and cut, then the cut acts as a
‘certificate of correctness’. If someone doubts whether our proposed flow truly is a maximum
flow, all we need do is show them our cut, and (assuming the flow value is equal to the cut
capacity!) that proves our flow is correct.11
This is exactly what we need to prove the Ford-Fulkerson algorithm correct. The
algorithm runs the ‘bandit search’ to find all the vertices to which flow could be increased.
The algorithm terminates when it can’t increase flow to the sink—in other words, when the
bandit search produces a cut. And this cut is exactly what we need to certify that the flow
it just found is a maximum flow!

Theorem (Correctness of Ford‐Fulkerson). Suppose the algorithm terminates, and f ∗ is the final
flow it produces. Then f ∗ is a maximum flow.

Proof. Let S ∗ be the set of vertices found in the final call to find_augmenting_path on line 16,
page 32. Since it’s the final call, we know it failed to find a path to the sink, i.e. t ̸∈ S ∗ ,
hence (S, S ∗ ) is a cut.
Now imagine drawing the original flow network with all the S ∗ vertices on the left and
all the S̄ ∗ vertices on the right. Suppose the network has an edge u → v that goes from left
to right, i.e. u ∈ S ∗ and v ̸∈ S ∗ . This means that the residual graph has a path from s to u,
but not a path from s to v; therefore u → v cannot be present in the residual graph. Hence,
by the condition on line 10, f (u → v) = c(u → v).
Next, suppose the flow network has an edge u → v that goes from right to left, i.e.
v ∈ S ∗ and u ̸∈ S ∗ . This means that the residual graph has a path from s to v, but not a
path from s to u, hence v → u is not present in the residual graph. Therefore, by line 11,
f (u → v) = 0.
We have proved that for any edge from S ∗ to S̄ ∗ , the flow on that edge is equal to the
capacity, hence inequality (2) in the proof of the max-flow min-cut theorem is an equality.
And we have also proved that for any edge from S̄ ∗ to S ∗ , the flow on that edge is equal to 0,
hence inequality (1) is an equality also. Therefore the derivation in that proof shows that
value(f ∗ ) is equal to capacity(S ∗ , S̄ ∗ ).
As we have argued, the cut (S ∗ , S̄ ∗ ) thus acts as a ‘certificate’ proving that flow f ∗ is
a maximum flow. □

∗ ∗ ∗

11 In the practical assignment for this part of the course, you’re asked to produce a maximum flow and a

matching cut. The tester doesn’t bother computing its own answer to verify against yours, it simply checks
the certificate you provided.
6.3 Max-flow min-cut theorem 37

A cut corresponding to a maximum flow is called a bottleneck cut. (The maximum flow might
not be unique, and the bottleneck cut might not be unique either. But all maximum flows
have the same flow value, and all bottleneck cuts have the same cut capacity.) The RAND
report shows a bottleneck cut, and suggests it’s the natural target for an air strike.
38 6.4 Matchings

6.4. Matchings
There are several graph problems that don’t on the surface look like flow networks, but
which can be solved by translating them into a well-chosen maximum flow problem. Finding
matchings in bipartite graphs is one example. The example sheet has more.

A bipartite graph is one in which the vertices are split into two sets, and all the edges have
one end in one set and the other end in the other set. For example, kidney transplant donors
and recipients, with edges to indicate compatibility. We’ll assume the graph is undirected.
A matching in a bipartite graph is a selection of some or all of graph’s edges, such that
no vertex is connected to more than one edge in this selection. The size of a matching is the
number of edges it contains. A maximum matching is one with the largest possible size. Our
goal is to find a maximum matching.

IM P LE M EN TAT I O N

Let’s translate the matching problem into a flow problem, as follows.

1. start with a bipartite graph


2. create a helper graph as follows: add a source s with edges to each left-hand vertex;
add a sink with edges from each right-hand vertex; turn the original edges into directed
edges from left to right; give all edges capacity 1
3. run the Ford–Fulkerson algorithm on the helper graph to find a maximum flow from s
to t
4. interpret that flow as a matching in the original bipartite graph

1 1
1
1 1
1
s 1
1
t
1 1 1

1 1

AN A LYS I S

It’s easy to not even notice that there’s something that needs to be proved here. To un-
derstand the issue, it’s helpful to imagine we’re explaining the procedure to someone who
doesn’t know anything at all about Ford–Fulkerson. They’ll ask us two questions:

• How do you know that a maximum flow can be translated into a matching? For example,
what if it returns a flow with a fractional amount on some edge?
• How do you know that your maximum flow actually gives a maximum matching? What
if there’s a larger-size matching out there, which perhaps corresponds to a lower-value
flow, or perhaps doesn’t correspond to a flow at all?
6.4 Matchings 39

matching flow
size value

hypothe�cal
matching with
larger size

matching 𝑚 ∗ max flow 𝑓 ∗

We need to justify the translation in two directions. First we have to justify why the flow f ∗
found in Step 3 can be translated into a matching m∗ . Second we have to justify why any
hypothetical larger-size matching m′ would translate into a higher-value flow f ′ .
Once we’ve justified these translations, it’s easy to argue that m∗ is a maximum-size
matching. Suppose it were not. Then there would be a larger-size matching m′ , with a
corresponding f ′ for which value(f ′ ) > value(f ∗ ). But f ∗ is a maximum flow, therefore no
such f ′ can exist. Hence the premise is false, i.e. m∗ is a maximum-size matching.
This style of algorithm and proof is called the TRANSLATION strategy. Remember,
when you use it, that you have to justify the translation in both directions. Once we’ve
figured out what it is we have to prove, the proof is easy.

Lemma (translation matching ↔ flow).


1. Step 3 of the algorithm described above terminates, and the flow f ∗ that it produces can
be translated into a matching m∗ , with size(m∗ ) = value(f ∗ ).
2. Any matching m translates into a flow f with size(m) = value(f ). In particular, if
there were any matching m′ with size(m′ ) > size(m∗ ) then its corresponding flow f ′
would have value(m′ ) > value(m∗ ).

Proof (of 1). The lemma in Section 6.2 on page 33 tells us that the Ford-Fulkerson algorithm
terminates, since all edge capacities are integer. Write f ∗ for the flow produced by Ford-
Fulkerson. The lemma tells us furthermore that f ∗ is integer on all edges. Since the edge
capacities are all 1, the flow must be 0 or 1 on all edges. Translate f ∗ into a matching m∗ , by
simply selecting all the edges in the original bipartite graph that got f ∗ = 1. The capacity
constraints on edges from s means that each left-hand vertex has either 0 or 1 flow coming
in, so it must have 0 or 1 flow going out, therefore it is connected to at most one edge in m∗ .
Similarly, each right-hand vertex is connected to at most one edge in m∗ . Therefore m∗ is a
matching.
Proof (of 2). Take any matching m and translate it into a flow f in the natural way, i.e.
with a flow of 1 from s to every matched left hand vertex, and similarly for t. It’s easy
to use the definition of ‘matching’ to prove that f is indeed a flow, i.e. that it satisfies
the capacity constraints as well as flow conservation. From this translation it’s clear that
size(m) = value(f ). □
40 6.5 Prim’s algorithm

6.5. Prim’s algorithm

See Section 5.1 for


Given a connected undirected graph g with edge weights, a spanning tree of g is a tree that
the definition of connects all of g’s vertices, using some or all of g’s edges. A minimum spanning tree (MST)
‘connect’ and ‘tree’. is a spanning tree that has minimum weight among all spanning trees. (The weight of a tree
is just the sum of the weights of its edges.)

How can we find an MST? We’ll look at an algorithm due to Jarnik (1930), and inde-
pendently to Prim (1957) and Dijkstra (1959).

3 3 3 3
2 2 2 2

1 1 1 1
2 2 2 2

undirected graph a tree, but not spanning tree of spanning tree of


with edge weights spanning weight 6 weight 6

AP P L I C AT I O N S

• The MST problem was first posed and solved by the Czech mathematician Borůvka
in 1926, motivated by a network planning problem. His friend, an employee of the
West Moravian Powerplants company, put to him the question: if you have to build
an electrical power grid to connect a given set of locations, and you know the costs of
running cabling between locations, what is the cheapest power grid to build?

The ‘edit distance’


• Minimal spanning trees are a use-
between two strings ful tool for exploratory data analy-
is a measure of how sis. In this illustration from bioin-
different they are.
See Section 1.2. formatics12 , each vertex is a geno-
type of Staphylococcus aureus, and
the size shows the prevalance of that
genotype in the study sample. Let
there be edges between all genotypes,
weighted according to edit distance.
The illustration shows the MST, after
some additional high-weight edges
are removed.

GEN ER A L I D EA

We’ll build up the MST greedily. Suppose we’ve already built a tree containing some of the
vertices (start it with just a single vertex, chosen arbitrarily). Look at all the edges between
the tree we’ve built so far and the adjacent vertices that aren’t part of the tree, pick the edge
of lowest weight among these and add it to the tree, then repeat.

This greedy algorithm will certainly give us a spanning tree. To prove that it’s a MST
takes some more thought.

12 From Multiple-Locus Variable Number Tandem Repeat Analysis of Staphylococcus Aureus, Schouls et al.,

PLoS ONE 2009.


6.5 Prim’s algorithm 41

a tree build up three candidate pick the cheapest


with four edges so vertices to add of the four
far next connecting edges
and add it to the
tree

P ROBL EM STATE ME N T
Given a connected undirected graph with edge weights, construct an MST.

I MP LE ME N TATI O N
We don’t need to recompute the nearby vertices every iteration. Instead we can use a
structure very similar to Dijkstra’s algorithm for shortest paths: store a ‘frontier’ of vertices
that are neighbours of the tree, and update it each iteration. For each of the frontier vertices
w, we’ll store the lowest-weight edge connecting it to the tree that we’ve discovered so far
(w.come_from), and the weight of that edge (w.distance). We pick the frontier vertex v with
the smallest v.distance, add it to the tree, and add its neighbours to the frontier if they’re
not already in the tree. When the algorithm terminates, an MST is formed from the edges

v ↔ v.come_from : v ∈ V \ {s} .

1 def prim(g , s ) :
2 for v in g . vertices :
3 v . distance = ∞
4 + v . in_tree = False
5 + s .come_from = None
6 s . distance = 0
7 toexplore = PriorityQueue ( [ s ] , lambda v : v . distance )
8
9 while not toexplore . isempty ( ) :
10 v = toexplore .popmin()
11 + v . in_tree = True
12 # Let t be the graph made of v e r t i c e s with in_tree=True ,
13 # and edges {w−−w. come_from , f o r w i n g . v e r t i c e s e x c l u d i n g s } .
14 # Assert : t i s part of an MST f o r g
15 for (w, edgeweight) in v . neighbours :
16 × i f (not w. in_tree ) and edgeweight < w. distance :
17 × w. distance = edgeweight
18 + w.come_from = v
19 i f w in toexplore :
20 toexplore . decreasekey (w)
21 else :
22 toexplore . push(w)
Compared to Dijkstra’s algorithm, we need some extra lines to keep track of the tree
(lines labelled +), and two modified lines (labelled ×) because here we’re interested in ‘dis-
tance from the tree’ whereas Dijkstra is interested in ‘distance from the start node’. The
start vertex s can be chosen arbitrarily.

A N ALYS IS
Running time. It’s easy to check that Prim’s algorithm terminates. It is nearly identical to
Dijkstra’s algorithm, and exactly the same analysis of running time applies: is O(E+V log V ),
42 6.5 Prim’s algorithm

assuming the priority queue is implemented using a Fibonacci heap.

Correctness. To prove that Prim’s algorithm does indeed find an MST (and for many other
problems to do with constructing networks on top of graphs) it’s helpful to make a definition.
A cut of a graph is an assignment of its vertices into two non-empty sets, and an edge is said
to cross the cut if its two ends are in different sets.

3 b
2 b b
a
c a a
c c
1
2
d d
d
an undirected a cut into {a, b} a cut into {a} and
graph with edge and {c, d}, with {b, c, d}, with one
weights two edges crossing edge crossing the
the cut cut
It can be proved by induction that Prim’s algorithm produces an MST, using the following
The theorem as
theorem. The details of the induction are left as an exercise.
stated here is more
general than is
needed for Prim’s Theorem. Suppose we have a forest F and a cut C such that (i) no edges of F cross C, and
algorithm. That’s
so that we can (ii) there exists some MST that contains F . Then, if we add to F a min-weight edge that
re-use it for crosses C, the result is still part of a MST.
Kruskal’s algorithm
in the next section.

The following proof is not examinable. It is all maths, no algorithm.

Proof. Let F be the forest, and let F̄ be an MST that F is part of (the condition of the
theorem requires that such an F̄ exists). Let e be the a minimum weight edge across the cut.
We want to show that there is an MST that includes F ∪ {e}. If F̄ includes edge e, we are
done.

𝑒
𝑒′

the forest F a MST F̄ a different MST F̂


and the cut C
Suppose then that F̄ doesn’t contain e. Let u and v be the vertices at either end of e, and
consider the path in F̄ between u and v. (There must be such a path, since F̄ is a spanning
x
tree, i.e. it connects all the vertices.) This path must cross the cut (since its ends are on
different sides of the cut). Let e′ be an edge in the path that crosses the cut. Now, let F̂ be
like F̄ but with e added and e′ removed.
It’s easy to see that weight(F̂ ) ≤ weight(F̄ ): e is a min-weight edge in the cut, so
u weight(e) ≤ weight(e′ ). CLAIM: F̂ is connected. If this claim is true, then F̂ must be a tree,
since it has the same number of edges as F̄ namely |V | − 1. Therefore it is a MST including
𝑒 𝑒′ F ∪ {e}, and the theorem is proved.
v PROOF OF CLAIM. Pick any two vertices x and y. Since F̄ is a spanning tree, there
is a path in F̄ from x to y. If this path doesn’t use e′ then it is a path in F̂ , and we are done.
If it does use e′ , consider the paths x − y and u − v in F̄ , both of which use e′ . Either x and
u are on one side of e′ and y and v are on the other (as illustrated), or vice versa. Either
y way, we can build a path x − y using e instead of e′ , by splicing in the relevant part of the
u − v path. This new path we’ve built is in F̂ . Since F̂ contains a path between any two
vertices x and y, it is connected. □
6.6 Kruskal’s algorithm 43

6.6. Kruskal’s algorithm


Another algorithm for finding a minimum spanning tree is due to Kruskal (1956). It makes
the same assumptions as Prim’s algorithm. Its running time is worse. It does however
produce intermediate states that are rather useful in data science.

G E N E R AL I DEA
Kruskal’s algorithm builds up the MST by agglomerating smaller subtrees together. At Kruskal’s algorithm
each stage, we’ve built up some fragments of the MST. The algorithm greedily chooses two maintains a ‘forest’.
fragments to join together, by picking the lowest-weight edge that will join two fragments. Look back at
Section 5.1 for the
definition.

four tree fragments five candidate edges pick the cheapest of


have been found so far, that would join two the five candidate
including two trees fragments edges, and add it,
that each consist of a thereby joining two
single vertex fragments

P ROBL EM STATE ME N T
(Same as for Prim’s algorithm.) Given a connected undirected graph with edge weights,
construct an MST.

I MP LE ME N TATI O N
We could scan through all the edges in the graph at every iteration, looking for the best
edge to add next. Or we could maintain a list of all candidate edges, pre-sorted in order of
increasing weight, and iterate through it; and every time we join two fragments, remove all
the edges from this list that have just become redundant.
Kruskal’s algorithm doesn’t do either of these. It uses a list of edges pre-sorted in order
of increasing weight, but it doesn’t do any housekeeping on the list—it just considers each
edge in turn, asks “does this edge join two fragments?”, and skips over the edges that don’t.
The data structure it uses to perform this test is called a DisjointSet. This keeps track of a
collection of disjoint sets (sets with no common elements), also known as a partition. Here We’ll study the
we’re using it to keep track of which vertices belong to which fragment. DisjointSet data
Initially (lines 4–5) every vertex is in its own fragment. We iterate through all the structure in
Section 7.9
edges of the graph in order of edge weight, lowest edge weight first (lines 6–8), and for each
edge we test whether that edge’s ends are belong to the same set (lines 9–11). If they belong
to different sets, we add that edge to the tree and merge the two sets (lines 12–13).

1 def kruskal (g ) :
2 tree_edges = [ ]
3 partition = DisjointSet ()
4 for v in g . vertices :
5 partition . addsingleton (v)
6 edges = sorted (g . edges , sortkey = lambda u, v , edgeweight : edgeweight)
7
8 for (u , v , edgeweight) in edges :
9 p = partition . getsetwith (u)
10 q = partition . getsetwith (v)
11 i f p != q :
12 tree_edges . append((u, v))
13 partition . merge(p, q)
14 # Let f be the f o r e s t made up of edges i n tree_edges .
15 # Assert : f i s part of an MST
44 6.6 Kruskal’s algorithm

16
17 return tree_edges

AN A LYS I S
Running time. The running time of Kruskal’s algorithm depends on how DisjointSet is im-
plemented. We’ll see in Section 7.4 that all the operations on DisjointSet can be done in
O(1) time13 . The total cost is O(E log E) for the sort on line 6; O(E) for iterating over edges
in lines 8–11; and O(V ) for lines 12–13, since there can be at most V merges. So the total
running time is O(E log E).
The maximum possible number of edges in an undirected graph is V (V − 1)/2, and the
minimum number of edges in a connected graph is V − 1, so log E = Θ(log V ), and so the
running time can be written O(E log V ).

Correctness. To prove that Kruskal’s algorithm finds an MST, we apply the theorem used
for the proof of Prim’s algorithm, as follows. When the algorithm merges fragments p and
q, consider the cut of all vertices into p versus not-p; the algorithm picks a minimum-weight
edge across this cut, and so by the theorem we’ve still got something that’s part of an MST.

AP P L I C AT I O N
If we draw the tree fragments another way, the operation of Kruskal’s algorithm looks like
clustering, and its intermediate stages correspond to a classification tree:
a a
2
9 6
b 6 c b c
5 8 3 5 3
4
d e d 2 c
d e 1
7 f b a
f e f
1
an undirected graph the MST found by draw each fragment as
with edge weights Kruskal’s algorithm a subtree, and draw
arcs when two
fragments are joined
This can be used for image segmentation. Here we’ve started with an image, put vertices on
a hexagonal grid, added edges between adjacent vertices, given low weight to edges where the
vertices have similar colour and brightness, run Kruskal’s algorithm to find an MST, split
the tree into clusters by removing a few of the final edges, and coloured vertices by which
cluster they belong to.

13 This is a white lie. The actual complexity is O(α ) for a DisjointSet with n elements, where α is a
n n
function that grows extraordinarily slowly.
6.7 Topological sort 45

6.7. Topological sort


A directed graph can be used to represent precedence or preference.

The prototypical application, for computer scientists, is representing computational depen-


dencies. In an Excel spreadsheet, for example, imagine a graph with a vertex for each cell
and edges between cells to indicate “cell v is a function of cell w, so w needs to be evaluated
before v” (you can see these edges with the menu option ‘Show Dependencies’). Excel needs
to figure out an order in which to evaluate all the cells, and this order needs to respect the
edge directions. Think of it as putting all the cells in a straight line, such that the edges
of the dependency graph all point in the same direction. Such an ordering is called a total
order or linear order.

b ok
a d

c ok

q not ok
r

For what graphs it possible to put all the vertices into a total order? And, if it is possible,
how can we compute the total order? The picture above shows two simple graphs, and three
attempted total orders. The first two are valid total orders, and the third is not—and a
moment’s thought about the second graph tells us that it’s impossible to find a total order,
because of the cycle.

G E N E R AL I DEA
Recall depth-first search. After reaching a vertex v, it visits all v’s children and other
descendants. We want v to appear earlier in the ordering than all its descendants. So,
can we use depth-first search to find a total ordering?
Here again is the depth-first search algorithm. This is dfs_recurse from Section 5.2, but
modified so that it visits the entire graph (rather than just the part reachable from some
given start vertex).
1 def dfs_recurse_all (g ) :
2 for v in g . vertices :
3 v . v i s i t e d = False
4 for v in g . vertices :
5 i f not v . v i s i t e d :
6 v i s i t (v) # s t a r t d f s from v
7
8 def v i s i t (v ) :
9 v . v i s i t e d = True
10 for w in v . neighbours :
11 i f not w. v i s i t e d :
12 v i s i t (w)
46 6.7 Topological sort

A standard way to visualise program execution is with a flame chart. Time goes on
the horizontal axis, each function call is shown as a rectangle, and if function f calls function
g then g is drawn above f . Here is the flame chart for running dfs_recurse_all on a simple
graph.

b
visit(d) visit(c)
a d visit(a) visit(b)
dfs_recurse_all
c

And here, on the left, is what happens if we order vertices by when we visit them. It turns
out not to be a total order. A better guess is to order vertices by when visit(v) returns.

visit(d) visit(c) visit(d) visit(c)


visit(a) visit(b) visit(a) visit(b)
dfs_recurse_all dfs_recurse_all

a d b c d a c b

PRO B L EM STAT E M E N T
Given a directed acyclic graph (DAG), return a total ordering of all its vertices, such that if
v1 → v2 then v1 appears before v2 in the total order.

Don’t get muddled


Why the restriction to DAGs? Recall the definition. A cycle is a path from a vertex
by the word back to itself, following the edge directions, and a directed graph is called acyclic if it has
‘acyclic’. A DAG no cycles. If a graph is not a DAG then it has a cycle, and as we noted at the beginning
doesn’t have to be a
tree! The graph we of this section that means no total ordering is possible. It turns out that every DAG has a
ran dfs_recurse_all total ordering, as we’ll see from the proof of correctness below.
on is a DAG.

ALG O R I T H M
This algorithm is due to Knuth. It is based on dfs_recurse_all, with some extra lines (labelled
+). These extra lines build up a linked list for the rankings, as the algorithm visits and leaves
each vertex.

1 def toposort (g ) :
2 for v in g . vertices :
3 v . v i s i t e d = False
4 # v . c o l o u r = ’ white ’
5 + totalorder = [ ] # an empty l i s t
6 for v in g . vertices :
7 i f not v . v i s i t e d :
8 v i s i t (v , totalorder )
9 + return totalorder
10
11 def v i s i t (v , totalorder ) :
12 v . v i s i t e d = True
13 # v . c o l o u r = ’ grey ’
14 for w in v . neighbours :
15 i f not w. v i s i t e d :
16 v i s i t (w, totalorder )
17 + totalorder . prepend(v)
18 # v . c o l o u r = ’ black ’

This listing also has some commented lines which aren’t part of the algorithm itself, but which
are helpful for arguing that the algorithm is correct. They’re a bit like assert statements:
they’re there for our understanding of the algorithm, not for its execution.
6.7 Topological sort 47

A N ALYS IS
Running time. We haven’t changed anything substantial from dfs_recurse so the analysis in
Section 5.2 still applies: the algorithm obviously terminates (thanks to the visited flag, which
ensures we never visit a vertex more than once), and its running time is O(V + E).

Theorem (Correctness). The toposort algorithm returns totalorder which solves the problem
statement.

Proof. Pick any edge v1 → v2 . We want to show that v1 appears before v2 in totalorder. It’s
easy to see that every vertex is visited exactly once, and on that visit (1) it’s coloured grey,
(2) some stuff happens, (3) it’s coloured black. Let’s consider the instant when v1 is coloured
grey. At this instant, there are three possibilities for v2 :

• v2 is black. If this is so, then v2 has already been prepended to the list, so v1 will be
prepended after v2 , so v1 appears before v2 .
• v2 is white. If this is so, then v2 hasn’t yet been visited, therefore we’ll call visit(v2 ) at
some point during the execution of lines 14–16 in visit(v1 ). This call to visit(v2 ) must
finish before returning to the execution of visit(v1 ), so v2 gets prepended earlier and v1
gets prepended later, so v1 appears before v2 .
• v2 is grey. If this is so, then there was an earlier call to visit(v2 ) which we’re currently
visit(𝑣1 )
inside. The call stack corresponds to a path in the graph from v2 to v1 . But we’ve
... ...
picked an edge v1 → v2 , so there is a cycle, which is impossible in a DAG. This is a visit(𝑣2 )
contradiction, so it’s impossible that v2 is grey. □

∗ ∗ ∗

The breakpoint proof technique. The proof technique we used here was (1) consider an instant
in time at which the algorithm has just reached a line of code; (2) reason about the current
state of all the variables, and the call stack, using mathematical logic; (3) make an inference
about what the algorithm does next. This is the same structure as the proof of correctness
of Dijkstra’s algorithm.
For this proof technique to work, we may need to store extra information about program
state, so that the mathematical reasoning can use it. In this case, we invented the variable
v.colour, which records a useful fact about what happened in the past. The algorithm doesn’t
need it, but it’s useful for the maths proof.
49

7. Advanced data structures


7.1. Aggregate analysis
When we design an algorithm, it’s good practice to find its worst-case running time, and we
typically express it using big-O notation. But when we design a data structure it’s more
useful to know the worst-case running time for a sequence of operations. Here are some
examples to illustrate why.

Example (Dijkstra’s algorithm / priority queue). Dijkstra’s algorithm uses the Priority Queue data
structure, which supports operations popmin, push, and decreasekey. A single run of Dijkstra’s
algorithm involves running time
O(V ) × popmin + O(E) × push / decreasekey. analysis of
Dijkstra’s algorithm:
section 5.4 page 14
To work out the worst-case cost of a run of Dijkstra’s algorithm, it doesn’t matter what the
worst-case cost is for a single call to popmin or the other operations, what matters is the
worst-case cost for the aggregate of all of the calls. ♢

But can’t we find the worst-case cost for the aggregate of many calls by simply adding
up the worst-case costs for each individual call? The next example shows why not. It
shows that if we simply add up worst-case costs of individual operations, we’ll get an unduly
pessimistic bound for the worst-case cost of the aggregate.

Example (Binomial heap). Suppose we have a binomial heap and we know it never has more
than N elements. The worst-case cost of inserting an element is O(log N ), for the case when binomial heap:
there are O(log N ) trees in the heap. section 4.8.2

*
* *
*

But what about two insertions in a row? We certainly can’t hit this O(log N ) worst case
twice in a row. Indeed, with some careful though about how binomial heaps work, either the similar to heapsort,
first insertion or the second insertion must be O(1). With even more careful thought it can section 2.10, which
be shown that the aggregate cost of N insertions is O(N ). takes O(N log N ) to
insert N items one
Another way to put it: if one operation ends up needing to do a lot of work, then it by one but only
O(N ) to heapify
will at least leave the data structure in a nice clean state so that the following operations are them in a batch
fast. ♢

Example (HashTable). A hash table is typically stored as an array, and we increase the capac- hash table with
ity of the array and reshash all the existing items whenever the occupancy exceeds a certain chaining: see
threshold. Java’s HashMap, for example, rehashes once occupancy reaches 75%. Most inser- section 4.7

tions don’t trigger a rehash so they are very fast, O(1) for a hash table with chaining; but
every so often an insertion will trigger a rehash, which takes O(N ) where N is the number of
items. So the worst case for a single insertion is O(N )—but the worst case for N insertions
grows slower than N 2 . ♢

Advanced data structures are those that have been designed with cunning tricks so that
the work done by one operation can benefit subsequent operations. This is done to reduce
the aggregate cost of sequences of operations.
The study of advanced data structures is centered around analysing the worst-case
aggregate computational complexity of sequences of operations. This is known as aggregate
analysis.
50 7.1 Aggregate analysis

TI G H T B O U N D S A N D WO RST CA S ES
When we analyze computational complexity, it’s good practice to give tight big-O bounds on
the worst-case performance.
It’s technically correct to say that the worst-case cost of inserting N elements into a
N
binomial heap, starting from empty, is O(N log N ). It’s also O(N !), and O(N N ) ... these are
all technically correct, but pretty useless! What’s the worst case that can actually happen?
After looking at some example binomial heaps, we soon find we’re unable to construct a
scenario in which the cost is N log N ; this tells us that either the O(N log N ) bound isn’t
tight, or that we haven’t looked hard enough.
refresher of
Formally, big-O notation provides an upper bound. When we say “the worst-case cost
section 2.3.2 is O(f (N ))” what we really mean is: there exist N0 and κ > 0 such that for all N ≥ N0 ,

worst-case cost for


≤ κf (N ).
problem of size N

To argue that this bound is tight, we should be able to demonstrate scenarios of cost at least
f (N ). Formally14 , we should give a constant κ′ > 0, and a sequence of scenarios i = 1, 2, . . .
of sizes N1 , N2 , . . . with Ni → ∞ such that

cost for
≥ κ′ f (Ni ).
scenario i

Only if both of these bounds hold would we say that our big-O bound is tight.

14 The purists who want to say that the worst-case complexity is Θ(f (N )) would need to give N and κ′ > 0
0
and a scenario for every N ≥ N0 of cost ≥ κ′ f (N ). In the author’s opinion, that’s going beyond the call of
duty.
7.2 Amortized costs: introduction 51

7.2. Amortized costs: introduction


“Amortize a loan” means “pay it off gradually by making regular repayments before the debt
finally becomes due”. In computer science, amortization is a cunning accounting trick for
reasoning about aggregate costs, based on pretending that one operation can “pay off” the
running time of a subsequent operation.

Here’s a simple illustration. Suppose we want to store a list of items, and we have to
support four operations:
class MinList<T>:
append(T v) # add a new item v to the l i s t
flush () # empty the l i s t
foreach ( f ) # do f ( x ) f o r each item i n the l i s t
T min() # get the minimum of a l l items i n the l i s t

The first three operations are straightforward; a simple linked list will do. To implement min,
here are four stages of enlightenment.

Stage 0. Simply iterate through the entire list every time we want to compute min.
This takes time Θ(N ), where N is the number of items in the list.
Stage 1. It’s a waste to redo the work involved in computing min. Instead, we should
remember the result of the last call to min, and keep a pointer to the tail of the list at the
time of that last call. Next time we need min we only need to iterate through the items
added after the last min or flush. It’s faster—but the worst case is still Θ(N ).
Stage 2. We could store the minimum of the entire list, and update it every time a
new item is added. This way, append and min are both O(1), and obviously we can’t do any
better than this! We’d describe this as “amortizing the computation of min”, meaning we
split the work up into small pieces done along the way.
Stage 3. If we count up the total amount of work for the Stage 2 implementation,
it’s no better than that for Stage 1. (It could even be worse, if flush is called before min.)
Morally, it’s unfair to say that Stage 2 has better running-time complexity. Instead, we
should just stick with the Stage 1 implementation and ‘pay off’ the running time of min in
early repayments, by ascribing its cost to the append calls that preceded it, and which caused
min all that work in the first place.

append append append min


capp capp capp c1 + N c 2

capp + c2 capp + c2 capp + c2 c1

We’d say “the amortized cost of append is capp + c2 , and the amortized cost of min is c1 ”.

W HY A MORT IZ E CO STS ?
The point of this accounting trick is to make it easier to get tight bounds on aggregate costs.

Consider the Stage 1 implementation, and suppose we perform a sequence of m1 ×


append interspersed with m2 × min, on an initially-empty list. According to the naive worst-
case analysis,
The cost of each append is capp = O(1). The worst-case cost of min is O(N ) where
N = m1 is the maximum number of items in the list. So the worst-case aggregate
cost is O(m1 + m2 m1 ).
This isn’t a tight bound. If we think more cleverly about aggregate costs, and about how
many times each of the m1 items gets ‘touched’ by min, we deduce that the aggregate cost is
actually O(m1 + m2 ). The point of amortized costs is to let us get to the same answer but
with less cleverness:
52 7.2 Amortized costs: introduction

The amortized cost of each append is capp + c2 = O(1). The amortized cost of
each min is c1 = O(1). So the worst-case aggregate cost is m1 O(1) + m2 O(1) =
O(m1 + m2 ).
It’s a nuisance to always have to reason carefully about aggregate costs of sequences of
operations. Much better if someone (preferably someone else!) thinks hard about the best
way to ascribe costs, and tells us an amortized cost for each operation that the data structure
supports. Then it’s easy for us to work out aggregate costs, by just adding up amortized
costs. Amortization is nothing more than an accounting trick to make it easier to find tight
bounds on aggregate cost.

There’s a subtle caveat. In the analysis we just did, we snuck in the little phrase ‘on
an initially-empty list’. Suppose instead we had started with a list of N items, and we called
min once: then the cost of this single operation is N , and no clever accounting trick can turn
it into O(1).
In practice, this isn’t a problem. For example, when we set out to analyse Dijkstra’s
algorithm, which uses a Priority Queue, we do indeed start out with an initially-empty data
structure. It’s only when we start thinking too hard about what amortized analysis actually
means that things get confusing—and the best way to resolve confusion is with a rigorous
definition. That’s the topic of the next section.
7.3 Amortized costs: definition 53

7.3. Amortized costs: definition


We use amortized costs to reason about aggregate costs of sequences of operations.
We might read for example that a certain data structure “supports push at amortized
cost O(1), and popmin at amortized cost O(log N )”. This is just a way of writing a statement
about aggregate costs: it says that for any sequence comprising m1 push operations and m2
popmin operations, applied to an initially-empty data structure,

worst-case
≤ m1 O(1) + m2 O(log N ) = O(m1 + m2 log N ).
aggregate cost

F UN DA ME N TAL I N EQUA L I T Y O F A M O RT I Z E D AN ALYS I S


Here is the formal definition. Let there be a sequence of m operations, applied to an
initially-empty data structure, whose true costs are c1 , c2 , . . . , cm . Suppose someone invents
c′1 , c′2 , . . . , c′m such that

c1 + · · · + cj ≤ c′1 + · · · + c′j for all j ≤ m ;

then we call these amortized costs. In words,

aggregate true cost of a aggregate amortized cost



sequence of operations of those operations.

This is the fundamental inequality of amortized analysis. If someone else has told us amor-
tized costs, then this inequality tells us what we can do with them. If we are asked to find
amortized costs, this is the inequality we have to ensure is satisfied.
How does one invent amortized costs? Section 7.4 suggests a useful strategy—but for
now we’ll just take them as given.

Asymptotic usage. The fundamental inequality applies to sequences of any length. It is not
an asymptotic (big-O) statement.
Nevertheless, it is in the context of asymptotic analysis that we usually encounter
amortized costs. We might read for example “the amortized cost of push is O(1) and the
amortized cost of popmin is O(log N ), where N is an upper bound on the number of elements
stored”. This tells us that for any sequence of m1 × push and m2 × popmin operations, applied
to an initially empty data structure,

worst-case
≤ m1 O(1) + m2 O(log N ) = O(m1 + m2 log N ).
aggregate cost

More precisely: there exists N0 and κ > 0 such that, for any N ≥ N0 and for any such
sequence of operations on a data structure which always has ≤ N elements,

worst-case
≤ κ(m1 + m2 log N ).
aggregate cost

Typically we’re interested in the case where m1 and m2 grow with N . It might be m1 = N 2
and m2 = 1, or m1 = log N and m2 = log log N , or anything at all—the fundamental
inequality has to hold for all sequences.

EX AM P LE : DY N A MI C A RR AY
Here is a more involved example. Consider a dynamically-sized array with initial capacity 1,
and which doubles its capacity whenever it becomes full. (To be precise: we maintain a
fixed-length array; when it becomes full then we allocate a new fixed-length array of double
the length, copy everything across from the old array, and deallocate the old array. We’ll
only consider appending, not deleting.)

initially empty

append
54 7.3 Amortized costs: definition


append, requires doubling capacity
••
append, requires doubling capacity
•••
append
••••
append, requires doubling capacity
•••••

Suppose that the cost of writing in an element is 1, and the cost of doubling capacity from
m to 2m and copying everything across is κm for some constant κ > 0.

Aggregate analysis. After adding n elements, the cost of the initial writes is n, and the total
Standard formula:
cost from all of the doubling is
1 + r + · · · + r n−1 is
equal to κ 1 + 2 + · · · + 2⌊log2 (n−1)⌋ )
(r n − 1)/(r − 1).

which is ≤ κ(2n − 3). Thus, the total cost of n calls to append is ≤ κ(2n − 3) + n = O(n).

Amortized costs. Let’s ascribe a cost c′ = 2κ + 1 to each append operation. Then, for n calls
to append,

aggregate true cost ≤ n(2κ + 1) − 3κ ,


aggregate amortized cost = n(2κ + 1)

so the fundamental inequality “aggregate true cost ≤ aggregate amortized cost” is satisfied,
i.e. these are valid amortized costs. We’d write this as “the amortized cost of append is
O(1).”

∗ ∗ ∗

Why do we keep on putting in the caveat “applied to an initially empty data structure”? It’s
easier to explain why this is needed, now that we have a rigorous definition and a concrete
example.
Suppose we start with a dynamic array that’s on the brink of needing to be expanded.
In other words, let it have capacity N , and let it hold N items. (Of course this requires that
N be a power of 2.) Now, consider a single append: this single operation has cost κN + 1.
If we wanted our fundamental inequality to hold for all possible sequences of operations and
all possible initial states of the data structure, we’d be forced to say that the amortized cost
of append is O(N )—but then we’d be back in the land of naive per-operation worst case
analysis, and that doesn’t give us tight bounds for other sequences of operations.
This is why formal statements about amortized costs generally have awkward phrasing
such as “consider any sequence of m operations on a data structure, initially empty, whose
size is always ≤ N ”, or alternatively “consider any sequence of m ≥ N operations on a data
structure, initially empty, where N of those operations are insertions”.
7.4 Potential functions 55

7.4. Potential functions


How do we come up with amortized costs? There’s a systematic approach that sometimes
gives useful answers, based on potential functions.
Given a data structure, there are many possible states it might be in. Here, ‘state’
includes absolutely everything: its capacity, its contents, how those contents are arranged,
which bits have pointers to which other bits, and so on. Let Ω be the set of all possible states.
A function Φ : Ω → R is called a potential function if

Φ(S) ≥ 0 for all S ∈ Ω , Φ(empty) = 0.

Here empty refers to the empty intial state. Now, consider an operation op which, when
applied to state Sante , yields state Spost , and which has true cost c. We’ll write this as
c
Sante −→ Spost .

Define the modified cost c′ of op to be

c′ = c + Φ(Spost ) − Φ(Sante )

also written informally as c′ = c + ∆Φ.

Theorem (the ‘potential theorem’). The modified costs defined in this way are valid amortized
costs, i.e. they satisfy the fundamental inequality of amortized analysis.

This begs the question: how do we come up with potential functions? It’s generally
easier to come up with useful potential functions than it is to come up with amortized costs
from scratch. We’ll give some guidance in a moment, after looking at an example.

EX AM P LE AN A LYS I S U S I N G P OT E N T I A L F U N C TI O NS
Consider the dynamic array from page 53, where the cost of writing an element is 1 and the
cost of doubling and copying from capacity m to 2m is κm. Define the potential function
Notation: [x]+
   + means max(0, x).
num. items capacity
Φ=κ 2 −
in array of array

This is clearly a valid potential function since it’s ≥ 0 and = 0 at the initial empty state.
Let’s run through a sequence of append operations, to get a feel for how this potential
function behaves. We’ll annotate each operation with its true cost c and its amortized cost
c + ∆Φ.

initially empty
Φ=0
append: c = 1, c + ∆Φ = 1 + κ
• Φ=κ
append with doubling: c = κ + 1, c + ∆Φ = 2κ + 1
• • Φ = 2κ
append with doubling: c = 2κ + 1, c + ∆Φ = 2κ + 1
••• Φ = 2κ
append: c = 1, c + ∆Φ = 2κ + 1
• • • • Φ = 4κ
append with doubling: c = 4κ + 1, c + ∆Φ = 2κ + 1
••••• Φ = 2κ

In our run-through, for each operation the amortized cost turned out to be c + ∆Φ ≤ 2κ + 1.
Now let’s write out the full argument, to show that this holds in general. There are two ways
that append could play out:
56 7.4 Potential functions

capacity unhanged
ΔΦ = 2𝜅
Φ

number of
appends
capacity=m capacity=2m
#items=m #items=m+1
Φ = 𝜅𝑚 Φ = 2𝜅

• It could be that the capacity needs to double from m to 2m. The true cost of this
type of append is c = κm + 1. The change in potential is ∆Φ = 2κ − κm. Thus the
amortized cost is c + ∆Φ = 2κ + 1.
• Or it could be that we don’t need to double capacity. The true cost of this type of
append is c = 1. The change in potential is ∆Φ = 2κ. Again, the amortized cost is
c + ∆Φ = 2κ + 1.
• There is only one corner case that we need to worry about, the very first append. For
the initial empty state the number of items (zero) is less than half the capacity (one),
and so the [·]+ in the definition of the potential function kicks in. For every other
state, the number of items is ≥ half the capacity, and so the [·]+ doesn’t kick in. We’ve
already calculated the amortized cost of the very first append in the run-through above,
and we found it to be κ + 1.

In all cases the amortized cost of an append is ≤ 2κ + 1 = O(1).

In practice, we’d usually write out the amortized analysis using sloppier notation:

Assume that the cost of append is O(1) if there’s no doubling needed, and O(n) if we need to
double an array with n items. Define the potential function Φ = 2n − c where n is the number
of items in the array and c is the capacity. There are two ways that append could play out:
• If there are n items and the capacity is n, we need to double the capacity. The true cost
is c = O(n) and the change in potential is ∆Φ = [2(n + 1) − 2n] − [2n − n] = 2 − n, so
the amortized cost is O(n) + 2 − n = O(1).
• Otherwise we don’t need to double the capacity. The true cost is c = O(1) and the
change in potential is ∆Φ = 2, so the amortized cost is O(1) − 2 = O(1).
Our Φ function isn’t quite a potential function—we should have Φ ≥ 0 everywhere, and Φ = 0
at the initial empty state, and this isn’t the case. But we can special-case Φ(empty) = 0,
and this is a modification at just a single state so the asymptotic results are still valid. We
conclude that in all cases the amortized cost of append is O(1).

It’s perfectly fine to use this sloppy way of writing, as long as you understand the real
reasoning behind the otherwise preposterous equation “c+∆Φ = O(n)+2−n = O(1)”. What
it really means is “The true cost is ≤ κn, for n sufficiently large, and the change in potential
is ∆Φ = 2 − n. We should really have defined a different potential function, by multiplying
by κ. If we had done that, then the amortized cost would be ≤ κn + 2 − κn = O(1).”
7.4 Potential functions 57

DES I GN IN G P OT E N T I A L F U N C T I O N S
Where do potential functions come from? We can invent whatever potential function we like,
and different choices might lead to different amortized costs of operation. If we are cunning
in our choice of potential function, we’ll end up with informative amortized costs. There is
no universal recipe. Here are some suggestions:

• Sometimes we want to say “this operation may technically have worst case O(n) but
morally speaking it should be O(1)”, as with MinList.min on page 51 and with Dynami-
cArray.append. To get amortized cost O(1), the potential has to build up to n before
the expensive operation, and drop to 0 after—or, to be precise, build up to Ω(n) and
drop to O(1). Think of Φ as storing up credit for the cleanup we’re going to have to
do.
• More generally, think of Φ as measuring the amount of stored-up mess. If an operation
increases mess (∆Φ > 0) then this mess will have to be cleaned up eventually, so we
set the amortized cost to be larger than the true cost, to store up credit. An operation
that does cleanup will decrease mess (∆Φ < 0), which can cancel out the true cost of
the cleanup operation.
• When you invent a potential function, make sure that Φ ≥ 0 and that Φ = 0 for the
initial empty data structure. Otherwise you’ll end up with spurious amortized costs.
This is a common source of errors! (See the example sheet.)
The goal of designing a potential function is to obtain useful amortized costs, and the goal
of amortized analysis is to get tight bounds on the worst-case performance of a sequence of
operations on a data structure. So, to figure out if our potential function is well-designed,
we should see if it gives tight bounds.
For example, suppose we’ve used our potential function to prove that “operation popmin
has amortized cost O(log N ) for a data structure containing ≤ N items”. This proves that
the aggregate cost of m calls to popmin is15 O(m log N ). To check whether this bound is
tight, we sould look for a concrete example of a sequence of m operations that has cost
Ω(m log N ). Typically, m will grow with N . If we can find such a lower bound, we know that
the amortized costs are tight. If on the other hand we had chosen a daft potential function,
we’d end up with a Ω–O gap.

Exercise.
Consider the MinList data structure from page 51, implemented using the Stage 1 method.
By designing a suitable potential function, show that append and min both have amor-
tized cost O(1).

Each append creates ‘mess’ in the form of values that will need to be trawled through by the
next call to min. So we want the potential to increase on each call to append, enough to ‘pay
for’ the work that the next min will do. So let’s define

Φ = num. items that min hasn’t processed.

Then the amortized cost of append is c + ∆Φ = O(1) + 1 = O(1), and the amortized cost of
min is O(L) + (0 − L) = O(1), where L is the number of items that it has to process. Thus
both of these operations have amortized cost O(1).

15 This is an asymptotic statement in N , and m is allowed to depend on N . Remember that the fundamental

inequality of amortized analysis, the inequality on page 53, is required to hold for every sequence of operations.
58 7.4 Potential functions

PRO O F O F TH E ‘ P OT E N T I AL T H EOR E M’
The whole analysis using potential functions rests on the ‘potential theorem’. Let’s restate
it, and give a proof.

Theorem (the ‘potential theorem’). Let Φ be a potential function. For an operation that takes
the state from Sante to Spost , and that has true cost c, define the modified cost to be

c′ = c + Φ(Spost ) − Φ(Sante ).

The modified costs defined in this way are valid amortized costs. In other words they satisfy
the fundamental inequality of amortized analysis: for any sequence of operations

aggregate true cost aggregate amortized cost



of the sequence of that sequence.

Proof. Consider a sequence of operations starting from an initially empty state,


c c c c
S0 −→
1
S1 −→
2
S2 −→
3
· · · −→
k
Sk

This ‘telescoping
where the true costs are c1 , c2 , . . . , ck . Then
sum’ trick also
appeared in the aggregate amortized cost
analysis of n o n o
Johnson’s algorithm,
section 5.8 page 25
= −Φ(S0 ) + c1 + Φ(S1 ) + −Φ(S1 ) + c2 + Φ(S2 )
n o
+ · · · + −Φ(Sk−1 ) + ck + Φ(Sk )
= c1 + · · · + ck − Φ(S0 ) + Φ(Sk )
= aggregate true cost − Φ(S0 ) + Φ(Sk )

hence

aggregate true cost


= aggregate amortized cost + Φ(S0 ) − Φ(Sk )
≤ aggregate amortized cost + Φ(S0 ) (since Φ ≥ 0)
= aggregate amortized cost (since Φ = 0 for the initial empty state S0 ).

Thus the costs c′ that we defined are indeed valid amortized costs. □
7.5 Three priority queues 59

7.5. Three priority queues


All of the work in this section is a build-up to a very advanced implementation of the Priority
Queue, called the Fibonacci Heap.
AbstractDataType PriorityQueue :
# Holds a dynamic c o l l e c t i o n of items .
# Each item has a value /payload v , and a key/ p r i o r i t y k .

# Extract the item with the s m a l l e s t key


Pair<Key, Value> popmin()

# Add v to the queue , and g i v e i t key k


push(Value v , Key k)

# For a value a l r e a d y i n the queue , g i v e i t a new ( lower ) key


decreasekey (Value v , Key newk)

# Sometimes we a l s o i n c l u d e methods f o r :
# merge two p r i o r i t y queues
# d e l e t e a value
# peek at the item with s m a l l e s t key , without removing i t

It’s useful to review the two implementations that we’ve already seen. We’ll also look at a
third very simple implementation, the linked-list implementation, as a thought experiment
to sharpen our thinking and to highlight where there’s room for improvement. This table
summarizes the running for those three implementations, as well as for the Fibonacci Heap.
In this table N is the maximum number of items in the heap.16

popmin push decreasekey


binary heap O(log N ) O(log N ) O(log N )
binomial heap O(log N ) O(1) amortized O(log N )
linked list O(N ) O(1) O(1)
Fibonacci heap O(log N ) amortized O(1) amortized O(1) amortized

B I N ARY HEAP *
*This section of notes is a recap of section 4.8.1. A binary heap is an almost-full
binary tree (i.e. every level except the bottom is full), so its height is ⌊log2 n⌋ where n is the
Notation: ⌊x⌋ is the
number of elements. It satisfies the heap property (each node’s key is ≤ its children), so the floor of x, i.e.
minimum of the entire heap can be found at the root. ⌊x⌋ ≤ x < ⌊x⌋ + 1.

1 5 The heap property:


each node’s key is ≤ those
6 1 6 9 of its children

12 7 7

To implement popmin we extract the root item, replace the root by the end element, and then
bubble it far enough down so as to satisfy the heap property. The number of bubble-down
steps is limited by the height of the tree, so popmin is O(log n).

16 For naive worst-case analysis, N is simply the number of items in the heap when we do the operation.

But amortized analysis applies to sequences of operations, and the number of items will fluctuate over those
operations, and so we have to instead define N as “upper bound on the number of items in the heap over the
sequence of operations in question”.
60 7.5 Three priority queues

replace bubble bubble


root 3 down 1 down 1

1 5 1 5 3 5 1 5

6 1 6 9 6 1 6 9 6 1 6 9 6 3 6 9

12 7 3 12 7 12 7 12 7

To implement push we append the new item to the very end, and then bubble it far enough
up the tree so as to satisfy the heap property. Again, the number of bubble-up steps is
O(log n). And decreasekey is very similar.
1 bubble bubble 1 bubble
up
1 up up
0

1 5 1 5 0 5 1 5

6 3 6 9 6 0 6 9 6 1 6 9 6 1 6 9

12 7 0 itnew 12 7 3 12 7 3 12 7 3
em

BI N O MI A L H EA P *
*This section of notes is a recap of section 4.8.2. A binomial tree of degree 0 is a
single node. A binomial tree of degree k is a tree obtained by combining two binomial trees
of degree k − 1, by appending one of the trees to the root of the other.
degree 0 degree 1 degree 2 degree 3
2 2 2 2

5 6 5 3 6 5

9 3 7 9

12

A binomial heap is a collection of binomial trees, at most one for each tree degree, each
obeying the heap property i.e. each node’s key is ≤ those of its children. Here is a binomial
heap consisting of one binomial tree of degree 0, and one of degree 3. (The dotted parts in
the middle indicate ‘there is no tree of degree 1 or 2’.)

3 1

1 6 5

3 7 9

12

Here are some basic properties of binomial trees and heaps.


1. A binomial tree of degree k has 2k nodes
2. A binomial tree of degree k has height k
3. In a binomial tree of degree k, the root node has k children (which is why we call it
‘degree’)
4. In a binomial tree of degree k, the root node’s k children are binomial trees of all the
degrees k − 1, k − 2, . . . , 0.
5. In a binomial heap with n nodes, the 1s in the binary expansion of the number n
correspond to the degrees of trees contained in the heap. For example, a heap with 9
nodes (binary 1001 = 23 + 20 ) has one tree of degree 3 and one tree of degree 0.
6. If a binomial heap contains n nodes, it contains O(log n) binomial trees, and the largest
of those trees has degree O(log n).
The operations on binomial heaps end up resembling binary arithmetic, thanks to property 5.
7.5 Three priority queues 61

push(v, k) is O(log n):


Treat the new item as a binomial heap with only one node, and merge it as described
below, at cost O(log n), where n is the total number of nodes. It can be shown that
the amortized cost is O(1) — see the example sheet.
decreasekey(v, newk) is O(log n):
Proceed as with a normal binary heap, applied to the tree to which v belongs. The
entire heap has O(n) nodes, so this tree has O(n) nodes and height O(log n), so the
cost of decreasekey is O(log n).
popmin() is O(log n):
First scan the roots of all the trees in the heap, at cost O(log n) since there are that
many trees, to find which root to remove. Cut it out from its tree. Its children form a
binomial heap, by property 4. Merge this heap with what remains of the original one,
as described below, at cost O(log n).
merge(h1, h2) is O(log n):
To merge two binomial heaps, start from degree 0 and go up, as if doing binary addition,
but instead of adding digits in place k we merge binomial trees of degree k, keeping the
tree with smaller root on top. If n is the total number of nodes in both heaps together,
then there are O(log n) trees in each heap, and O(log n) operations in total.

L I N KE D LI ST P R I O RI T Y Q U EU E
Here’s a very simple priority queue. It uses a doubly-linked list to store all the items, and it
also keeps a pointer to the smallest item.
first

3 12 3 7 9 1 6 5 1

minitem

push(v, k) is O(1):
just attach the new item to the front of the list, and if k < minitem.key then update
minitem

decreasekey(v, newk) is also O(1):


update v’s key, and if newk < minitem.key then update minitem
popmin() is O(n):
we can remove minitem in O(1) time, but to find the new minitem we have to traverse
the entire list.
62 7.6 Fibonacci heap

7.6. Fibonacci heap


The Fibonacci heap is a fast priority queue. It was developed by Fredman and Tarjan in
1984, specifically to speed up Dijkstra’s algorithm.

GEN ER A L I D EA
Here is an outline of the thinking that led to the Fibonacci heap.

1. On a graph with V vertices and E edges, Dijkstra’s algorithm might make V calls to
popmin, and E calls to push and/or decreasekey. Since E might be as big as Ω(V 2 ),
push and decreasekey are the common operations, so we want them to be O(1).

2. To make push and decreasekey be O(1), they have to be lazy and they should only
‘touch’ a small part of the data structure. The linked list implementation shows us one
way we can make push be lazy — just dump new nodes into a list, with no further
tidying. Nor should decreasekey do any tidying — if decreasing a key leads to a heap
violation, then the offending node should just be dumped into a list, with no further
tidying.
The binomial heap actually uses a ‘binary counter’ structure so that, most of the time,
push only needs to touch a few small trees, and this gives amortized cost O(1). The
Fibonacci heap uses the same trick, but inside popmin rather than push.

3. In a heap, a call to popmin extracts the root of a tree, and so all of its children need
to be processed. Thus popmin has complexity Ω(d) where d is the number of children,
also called the degree.

4. We don’t want our heap to have any wide shallow trees — they would require popmin
to do a lot of work, just as bad as the simple linked list implementation. We need to
limit the degree of a node in the heap. We need a mechanism to ensure that the trees
are deep and bushy, as they are in a binomial tree.

5. We shall from time to time do expensive housekeeping. If we’re accumulating mess


that will have to be cleaned up anyway, why not just clean up as we go? The heart of
the answer lies in our analysis of heapsort in Section 2.10. We saw that it takes time
O(n log n) to add n items to a binary heap one by one, but only O(n) to heapify them
in a batch.
Doing housekeeping in batches is the big idea behind the Fibonacci heap. The house-
keeping is split between popmin and decreasekey. We will invent a potential function
to help us reason about the amount of housekeeping.

HOW TO P U SH A N D P O PM I N
The Fibonacci heap, like the binomial heap, stores a list of heaps. Unlike the binomial heap,
the trees can have any shape. Like the linked list priority queue, we’ll keep track of minroot,
the smallest element in the data structure, which must of course be the root of one of the
heap.
1 # Maintain a l i s t of heaps ( i . e . s t o r e a p o i n t e r to the root of each heap )
2 roots = [ ]
3
4 # Maintain a p o i n t e r to the s m a l l e s t root
5 minroot = None
6
7 def push(v , k ) :
8 create a new heap h consisting of a single item (v , k)
9 add h to the l i s t of roots
10 update minroot i f k < minroot . key
11
7.6 Fibonacci heap 63

12 def popmin( ) :
13 take note of minroot . value and minroot . key
14 delete the minroot node , and promote i t s children to be roots
15 # cleanup the r o o t s
16 while there are two roots with the same degree :
17 merge those two roots , by making the larger root a child of the smaller
18 update minroot to point to the smallest root
19 return the value and key from l i n e 13

minroot
7 1 5 2

4 3

6
1. popmin extracts the minroot
1

2. promotes its children to root

7 5 2 2 7 4 3 5 2
4 3 6
6

3a
3. merges trees of equal degree,
and updates minroot
4 3 5 2
3b
3 5 2 6 7

4 7

3c 6

minroot
3 2

4 7 5

In this simple version, with only push and popmin, one can show by induction that the
Fibonacci heap consists at all times of a collection of binomial trees, and that after the
cleanup in lines 16–17 it is a binomial heap. (See the example sheet.)

It doesn’t matter how the cleanup is implemented, as long as it is done efficiently. Here is
an example implementation.
20 def cleanup( roots ) :
21 root_array = [None, None, . . . ] # empty a r r a y
22 for each tree t in roots :
23 x=t
24 while root_array [ x . degree ] i s not None:
25 u = root_array [ x . degree ]
26 root_array [ x . degree ] = None
27 x = merge(x , u)
28 root_array [ x . degree ] = x
29 return l i s t of non−None values in root_array
64 7.6 Fibonacci heap

HOW TO D ECR EA S E K E Y

If we can decrease the key of an item in-place (i.e. if its parent is still ≤ the new key), then
that’s all that decreasekey needs to do. If however the node’s new key is smaller than its
parent, we need to do something to maintain the heap. We’ve already discussed why it’s a
reasonable idea to be lazy—to just cut such a node out of its tree and dump it into the root
list, to be cleaned up in the next call to popmin.
There is however one extra twist. If we just cut out nodes and dump them in the root
list, we might end up with trees that are shallow and wide, even as big as Ω(n), where n is
the number of items in the heap. This would make popmin very costly, since it has to iterate
through all minroot’s children.

To make popmin reasonably fast, we need to keep the maximum degree small. The Fibonacci
heap achieves this via two rules:

1. Lose one child, and you get marked as a ‘loser’ node.


2. Lose two children, and you get dumped into the root list (and your mark is removed).

Similarly, a
This ensures that the trees end up with a good number of descendants. Formally, we’ll
binomial tree of show in Section 7.8 that a tree with degree d contains ≥ 1.618d nodes, and hence that the
degree k has 2k maximum degree in a heap of n items is O(log n).
nodes, which
implies a binomial
heap of n items has
maximum degree
O(log n).

1
2 4 3
7 5 6
8
decreasekey from 5 to 3

1
2 4 3
7 d3 6
ecr
8 ea
s ed

decreasekey again to 0— move 0 to maintain the heap


1 1 0
los
er
2 4 3 2 4 3
7 0d 6 7 6
ag ecre
8 ai as 8
n ed

decreasekey from 7 to 1— move 1 to maintain the heap— move the double -loser to root

1 0 1 0 1 2 1 0 1
l
2 4 3 ch ost 2 4 3 8 4 3 8
ild tw
ren o
1 de 6 6 6
cr
ea
se
8 d
7.6 Fibonacci heap 65

30 # Every node w i l l s t o r e a f l a g , p . l o s e r = True / False


31
32 def decreasekey (v , k ′ ) :
33 l e t n be the node where t h is value i s stored
34 n . key = k ′
35 i f n violates the heap condition :
36 repeat :
37 p = n . parent
38 remove n from p . children
39 i n s e r t n into the l i s t of roots , updating minroot i f necessary
40 n . loser = False
41 n=p
42 u n t i l p . loser == False
43 i f p i s not a root :
44 p . loser = True
45
46 def popmin( ) :
47 mark a l l of minroot ’ s children as loser = False
48 then do the same as in the simple version , l i n e s 13–19

Here is another example of the operation of decreasekey, this time highlighting what happens
if there are multiple loser ancestors.

dump the heap dump the dump the


violator into the double-loser double-loser
decreasekey() root list into the root list into the root list
1 1 1 6 1 6 8 1 6 8 5

4 4 4 4 4 lo
ser
5 5 5 5 dou
lo b
se l
d r e
o
8 8 8 lose uble
r

9 6 d
e
he crea
a pv s e
iol d –
at
or
66 7.7 Implementing the Fibonacci heap∗

7.7. Implementing the Fibonacci heap∗


* This section of notes is not examinable.

There’s a question that gets asked every year when students learn about the Fibonacci heap:
“How do we find the node that we want to call decreasekey on? Do we trawl through the
entire heap? Isn’t this O(N )?”
The mental model behind this question is as follows. Suppose we have a Fibonacci
heap in which each node has a pointer to a parent node (except for root nodes which don’t
have parents) as well as to child nodes; and suppose each node also contains a payload, for
example an object representing a graph vertex. If we’re given a graph vertex w and we want
to call decreasekey we’d have to first find the Fibonacci heap node that contains it, and then
perhaps rewire the Fibonacci heap’s parent/child pointers.
def d i j k s t r a (g , s ) :
...
toexplore = PriorityQueue ()
toexplore . push(s , key=0)
parent
while not toexplore . isempty ( ) :
v = toexplore .popmin() w payload
for (w , edgecost ) in v . neighbours :
children
...
# we may do one of these two :
toexplore . push(w)
toexplore . decreasekey (w)
There’s something fishy about this question. There are two competing worldviews here—one
view says that vertices are nodes in a Fibonacci heap, the other that they are objects in a
graph—and why should one worldview take precedence over the other?17

𝑎 𝑏 𝑐
𝑒 7 1
𝑑 𝑎 𝑒
𝑏 4 3
𝑑
𝑐 6
vertices in a graph; the same vertices;
arrows show edges show
graph-neighbours heap-relationships

Here’s a more thoughtful way to formulate the problem: When we use the Fibonacci heap
as the priority queue in Dijkstra’s algorithm, each vertex is doing double duty: it is at once
both a vertex in the graph and also a node in the Fibonacci heap. They are participating in
two data structures simultaneously. How should we implement this?

JAVA N I L E SO LU T I O N
The naive solution is to just throw everything into a single class, as in the code below. Every
VertexNode object does double duty. We don’t need to look up the Fibonacci heap node that
contains a graph vertex w, because the node is the vertex.
class VertexNode :
# used by D i j k s t r a :
List<VertexNode> graph_neighbours
f l o a t distance

17 In the five years I’ve been teaching this material, every year students have asked “how do you find the

node you want to call decreasekey on?”. But not once have they asked the question the other way round—not
once have they started from the mental model “each vertex object contains a payload that is a node object”
and asked the obvious question “When we call popmin, and find the minroot node, how do we find the vertex
object that contains it? Do we trawl through the entire graph? Isn’t this O(V )?”
7.7 Implementing the Fibonacci heap∗ 67

VertexNode come_from

# used by FibHeap :
VertexNode fib_parent
List<VertexNode> fib_children
f l o a t key
bool is_loser

This is an ugly solution because it ties the two data structures together. All the library
code for the Fibonacci heap is tied to this particular class, and so it isn’t reusable by other
programmers who aren’t interested in graphs and just want their own priority queue. Can
we tease the two data structures apart without paying a price?

EX P LOS I ON I N T H E C L A S S FAC TO RY
Here’s an implementation for programmers who haven’t outgrown an infatuation with classes
and templates. We’ll let Vertex represent a vertex in the graph, Node represent a node in
the Fibonacci heap, and we’ll declare that each Vertex uses a Node to store its heap-related
pointers.
𝑏 𝑐
7 1

𝑎 𝑒
4 3

𝑑
6

The heap-related pointers stored by Node have to point to Vertex objects, rather than to Node
objects, since otherwise we’re stuck with the problem “how do I find the Vertex object for a
given Node?” But we want Node to be general purpose, not tied to a graph. The solution is
to use interfaces and templates, to allow the Fibonacci heap routines to ‘see through’ a Vertex
to get to its Node without needing to know anything about how Vertex is implemented.

class FibHeap<T extends FibHeap . Nodeable<T>>:


class Node<T>:
f l o a t key
int degree
T parent
List<T> children
interface Nodeable<T>:
Node<T> get_fib_node()

List<T> roots
def T popmin( ) : . . .
def push(T value , f l o a t key ) : . . .
def decrease_key(T value , f l o a t newkey ) : . . .

class Graph :
class Vertex implements FibHeap . Nodeable<Vertex>:
List<Pair<Vertex , float>> neighbours
f l o a t distance
Vertex? come_from
FibHeap .Node<Vertex> pqn
FibHeap .Node<Vertex> get_fib_node ( ) : return pqn
def compute_shortest_paths_from(Vertex s ) :
...
68 7.7 Implementing the Fibonacci heap∗

DYN A MI C A RC H I T EC T U R E
A cleaner approach is to label each vertex by an id, and to use two hash tables: one to look
up a vertex given its id, and another to look up a heap node given the id. (Often there is a
natural id to use for each vertex, for example, the node id in an OpenStreetMap graph, or
the primary key of the graph as stored in a database.) The interfaces for both the Fibonacci
heap and the graph are defined in terms of ids.

vertices = nodes =
{'a': , 'b': , 'c': , {'a': , 'b': , 'c': ,
'd': , 'e': } 'd': , 'e': }

7 1

4 3

The advantage of this architecture is that it keeps the code for graph traversal entirely
separate from that for the priority queue. This would be especially suitable if we start with
a graph and we don’t know ahead of time which algorithms we’ll need to run on it; in this
scenario it doesn’t make sense to have a Vertex class that embodies a particular storage
requirement.
𝑏
class FibHeap :
HashMap <Id , FibHeapNode> nodes 7
class
𝑎 FibHeapNode:
Id id 𝑒 𝑏 𝑐
f l o a t key 7 1
int
𝑑 degree 𝑎 𝑒
𝑏
FibHeapNode parent 4 3
List<FibHeapNode> children
𝑑
6
𝑐
List<FibHeapNode> roots
def Id popmin( ) : . . .
def push( Id node , f l o a t key ) : . . .
def decrease_key( Id node , f l o a t newkey ) : . . .

class Graph :
HashMap <Id , Vertex> vertices
class Vertex :
Id id
List<Pair<Vertex , float>> neighbours
f l o a t distance
Vertex? come_from
def compute_shortest_paths_from( Id s ) :
...

NI T T Y G R I T T Y
It’s worth thinking in a little more detail how exactly to store the parent and children pointers
of each node in the Fibonacci heap. We have to perform various slicing and rearranging oper-
ations on these pointers—and it would be silly to put in lots of effort designing a very clever
amortized design and then waste it all with an inefficient implementation! The manipulations
we want to perform are:
• slice a node out of a tree in O(1)
• add a node to the root list in O(1)
• merge two trees in O(1)
• iterate through a root’s children in O(num. children)
7.7 Implementing the Fibonacci heap∗ 69

These can all be achieved by keeping enough pointers around. We can use a circular doubly-
linked list for the root list; and the same for each list of siblings; and we’ll let each node point
to its parent; and each parent will point to one of its children. So, for example, to iterate
through a node’s children, we first follow the down-pointer to get to one of the children, then
we follow the sibling points around until they bring us back to the initial child.
parent

degree:
is_loser:
sibling sibling
key:

a child

So this little Fibonacci heap

1 2

4 3

would be represented as

2 0
False False

1 2

0 0
True False

4 3
70 7.8 Analysis of Fibonacci heap

7.8. Analysis of Fibonacci heap


We’ll now compute the amortized costs of the various operations on the Fibonacci heap, using
the potential function

Φ = number of roots + 2 number of loser nodes .
Let n be the number of items in the heap, and let dmax be an upper bound on the degree of
any node in a Fibonacci heap (we’ll see soon that dmax = O(log n) is suitable).
push() : amortized cost O(1)
This just adds a new node to the root list, so the true cost is O(1). The change in
potential is ∆Φ = 1, so the amortized cost is O(1).
popmin() : amortized cost O(dmax )
Let’s split this into three parts. First, cut out minroot and promote its children to
the root list. There are at most dmax children to promote, so the true cost is O(dmax ).
These children get promoted to root, and maybe some of them lose the loser mark, so
∆Φ ≤ dmax . So the amortized cost for this first part is O(dmax ).
The second part is running cleanup. Line 21 initializes an array, in which we will store
a tree of degree d in root_array[d]. We defined dmax to be the largest possible degree
in a Fibonacci heap with n items, so an array of size dmax + 1 is sufficient,18 so it takes
O(dmax ) to initialize the array. Now, suppose that cleanup starts with x trees, it does
M merges, and ends up with y trees. The total true cost of this is
cost = O(x + M + y)
for processing each tree, lines 22–23 & 28: O(x)
for the merges, lines 24–27: O(M )
for the final copy out of root_array, line 29: O(y)
= O(y + 2M + dmax + 1)
every merge decreases the number of trees by 1 so x − M = y
= O(dmax + 1 + 2M + dmax + 1)
we end up with at most one tree per cell of the array so y ≤ dmax + 1
= O(M + dmax ).
And ∆Φ = −M because our M merges decreased the number of roots by M . Thus the
amortized cost is O(M − dmax ) − M = O(dmax ).
The third and final part is fixing up minroot, line 18. This can be done by scanning
through the list of trees, which is O(dmax ).
decreasekey() : amortized cost O(1)
It takes O(1) elementary operations to decrease the key. If the node doesn’t have to
move, then Φ doesn’t change, so amortized cost = true cost = O(1). If the node does
d have to move, the following happens:
c loser • We move the node, call it a, to the root list. The true cost is O(1), and Φ increases
loser
by ≤ 1: it increases by 1 if a wasn’t a loser, and decreases by 1 if it was.
b
• Some of a’s loser ancestors b and c are moved to the root list. Say there are L
a de
cr of these. The true cost of moving is O(L). And ∆Φ = L − 2L: the L is because
ea
se
ke
y
of the new roots, and the −2L is because their loser marks are erased. Thus the
amortized cost of this step is zero, regardless of L.
• One ancestor d might have to be marked as a loser. The true cost is O(1), and Φ
increases by 2, so the amortized cost is O(1).
We can see now why the potential function was chosen just so. The two operations popmin
and decreasekey both include an uncontrolled number of steps, M for popmin and L for
decreasekey. But these steps have already been ‘paid for’, by the operations that increased
Φ, so they contribute zero to the amortized cost.

18 A common question: ‘Doesn’t it need to be size d


max + 2, in case there were two trees of degree dmax
before the cleanup, resulting in a tree of degree dmax + 1 after the cleanup?’ No. We defined dmax to be the
maximum possible degree in the Fibonacci heap, so the maximum degree even after cleanup is by definition
≤ dmax .
7.8 Analysis of Fibonacci heap 71

B OUN DI N G TH E SH A P E O F T H E T R E ES
The amortized cost of popmin is O(dmax ), where dmax is the maximum number of children of
any of the nodes in the heap. The peculiar mechanism of decreasekey was designed to keep
dmax small. How small?

Theorem (Fibonacci shape theorem). If a node in a Fibonacci heap has d children, then the
subtree rooted at that node has ≥ Fd+2 nodes, where F1 , F2 , . . . are the Fibonacci numbers. F1 = 1, F2 = 1,
F3 = 2, F4 = 3,
F5 = 5, . . . The
general formula for
It’s a mathematical
√ fact from linear algebra that Fd+2 ≥ ϕd where ϕ is the golden ratio, Fn is √
ϕ = (1 + 5)/2 ≈ 1.618. It’s a simple exercise (left to the example sheet) to deduce from (ϕn − (−ϕ)−n )/ 5.
this fact and the theorem that dmax = O(log n).

Proof (Fibonacci shape theorem). Consider an arbitrary node x in a Fibonacci heap, at some
point in execution, and suppose it has d children, call them y1 , . . . , yd in the order of when
they last became children of x. (There may be other children that x acquired then lost in
the meantime, but we’re not including those.)

• • • • •

! # ! " # ! " # !

When x acquired y2 , x already had y1 as a child, so y2 must have had ≥ 1 child seeing
as it got merged into x. Similarly, when x acquired y3 , y3 must have had ≥ 2 children, and
so on. After x acquired a child yi , that child might have lost a child, but it can’t have lost
more because of the rules of decreasekey. Thus, at the point of execution at which we’re
inspecting x,

y1 has ≥ 0 children
y2 has ≥ 0 children
y3 has ≥ 1 child, . . .
yd has ≥ d − 2 children.

Now for some pure maths. Consider an arbitrary tree all of whose nodes obey the
grandchild rule “a node with children i = 1, . . . , d has at least i − 2 grandchildren via child
i”. Let Nd be the smallest possible number of nodes in a subtree whose root has d children.
Then
Nd = Nd−2 + Nd−3 + · · · + N0 + N0 + |{z} 1
| {z } | {z } |{z} |{z}
child d child d − 1 child 2 child 1 the root.

Substituting in Nd−1 , we get Nd = Nd−2 + Nd−1 , the defining equation for the Fibonacci
sequence, hence Nd = Fd+2 .
We’ve shown that the nodes in a Fibonacci heap obey the grandchild rule, therefore
the number of nodes in the subtree rooted at x is ≥ Fd+2 where d is the number of children
of x. □
72 7.9 Disjoint sets

7.9. Disjoint sets


The DisjointSet data structure (also known as union-find or merge-find) is used to keep track
of a dynamic collection of items in disjoint sets. We used it in Kruskal’s algorithm for finding
a minimum spanning tree: each vertex of the graph corresponds to an item in the collection,
and we used sets to track which vertices we had joined together into forest fragments.
In this section we’ll first look three ways to implement this data structure. These
implementations are interesting in their own right—but what’s more interesting is to see the
them as embodiments of the same sort of design strategies that led to the Fibonacci heap;
and so we’ll conclude the section by drawing out the analogy.

First, here’s the specification of the DisjointSet abstract data type.

AbstractDataType DisjointSet :
# Holds a dynamic c o l l e c t i o n of d i s j o i n t s e t s

# Return a handle to the s e t c o n t a i n i n g an item .


# The handle must be s t a b l e , as long as the D i s j o i n t S e t i s not modified .
Handle get_set_with(Item x)

# Add a new s e t c o n s i s t i n g of a s i n g l e item ( assuming i t ’ s not been added a l r e a d y )


add_singleton (Item x)

# Merge two s e t s i n t o one


merge(Handle x , Handle y)
(This specification refer to Handles. It doesn’t say what a handle is, only that handles don’t
change unless the DisjointSet is modified (by either add_singleton or merge). In practice, we
might use a representative element from each set as the set’s handle.)

IM P LE M EN TAT I O N 1 : F L AT FO REST

merge

To make get_set_with fast, we could make each item point to its set’s handle.
get_set_with() is just a single lookup.
merge() needs to iterate through each item in one or other set, and update its pointer. This
takes O(n) time, where n is the number of items in the DisjointSet.

To be able to iterate through the items, we could store each set as a linked list:

merge

A smarter way to merge is to keep track of the size of each set, and pick the smaller set to
update. This is called the weighted union heuristic. In the Example Sheet you’ll show that
the aggregate cost of any sequence of m operations on ≤ N elements (i.e. m operations of
which ≤ N are add_singleton) is O(m + N log N ), asymptotic in N .
7.9 Disjoint sets 73

I MP LE ME N TATI O N 2 : D E E P FO R EST

merge

To make merge faster, we could skip all the work of updating the items in a set, and just
build a deeper tree.
merge() attaches one root to the other, which only requires updating a single pointer.
get_set_with() needs to walk up the tree to find the root. This takes O(h) time, where h is
the height of the tree.
To keep h small, we can use the same idea as for the flat forest: keep track of the rank of each
root (i.e. the height of its tree), and always attach the lower-rank root to the higher-rank. If
the two roots had ranks r1 and r2 then the resulting rank is max(r1 , r2 ) if r1 ̸= r2 , and r1 + 1
if r1 = r2 . This is called the union by rank heuristic. It can be shown that the aggregate
cost of any sequence of m operations on ≤ N elements is O(m log N ), asymptotic in N . Aggregate cost: see
CLSR exercises
21.3-3 and 21.4-4

I MP LE ME N TATI O N 3 : L A ZY FO R EST
We’d like the forest to be flat so that get_set_with is fast, but we’d like to let it get deep so
that merge can be fast. Here’s a way to get the best of both worlds, inspired by the Fibonacci
heap—defer cleanup until you actually need the answer.

merge get_set_with(•)

• •

merge() is as for the deep forest.


get_set_with(x) does some cleanup. It walks up the tree once to find the root, and then it
walks up the tree a second time to make x and all the intermediate nodes be direct
children of the root.
This method is called the path compression heuristic. We won’t adjust the stored ranks
during path compression, and so rank won’t be the exact height of the tree, just an upper
bound on the height. (If we wanted to know the actual tree height we’d have to compute
it—and we don’t want the heuristic to take more time on book-keeping than it saves on
actually doing the work!)

It can be shown that with the lazy forest the cost of m operations on ≤ N items is
O(mαN ) where αN is an integer-valued monotonically increasing sequence, related to the
Ackerman function, which grows extremely slowly:
αN = 0 for N = 0, 1, 2
αN = 1 for N = 3
αN = 2 for N = 4, 5, 6, 7
αN = 3 for 8 ≤ N ≤ 2047
αN = 4 for 2048 ≤ N ≤ 1080 , more than there are atoms in the observable universe.
For practical purposes, αN may be ignored in the O notation, and therefore the amortized
cost per operation is O(1).
74 7.9 Disjoint sets

LESS O N S FO R D ES I G N I N G ADVA NC E D DATA ST R U C T U R ES


The three implementations of the DisjointSet can be thought of as embodying three differ-
ent design strategies: anal retentive, adrenaline junkie, and forward planner. Those three
strategies can also be seen in the implementations of the Priority Queue.19

Anal retentive Binary heap Flat forest


Keeps everything pristine, • push: slow O(log N ), • merge: slow, rewrites
all the time. Wastes effort tidies up every time parent points
by doing so. • popmin: fast O(log N ), • get_set_with: fast O(1),
since heap is always tidy just look up parent

Adrenaline junkie Linked list heap Deep forest


Lets work pile up then does • push: fast and lazy O(1) • merge: fast O(1),
it all in a rush, and is too • popmin: slow O(N ), does updates a single pointer
harried to learn from it. its work, but retains • get_set_with: slow, has
none of its findings for to walk up the tree
next time

Forward planner Fibonacci heap Lazy forest


Does work when it’s needed, • push: fast and lazy O(1) • merge: fast and lazy O(1)
and keeps its working so • popmin: fast O(log N ), • get_set_with: leaves
things are easier next time leaves the heap semi-tidy tracks that future
for next time operations can benefit
from

The forward planner strategy means keeping your working, or at least some of it, in the
anticipation of later operations. Here’s another much simpler illustration of the general idea.

Example 7.2. Suppose we have a list of four objects

a b c d
val = 2 val = 5 val = 3 val = 0

and we want to sort them by value, lowest value first. Suppose we run insertion sort:
• First, find the smallest. We need to make some comparisons. We’ll find that a.val <
b.val, then a.val < c.val, then a.val > d.val, and conclude that d is the smallest.
• Next, find the second-smallest. We’ll end up repeating some of the comparisons we’ve
already made: we’ll find a.val < b.val, then a.val < c.val, and conclude that a is the
second smallest.
But this is a waste! We’ve already made those comparisons. We should have found a way
to retain our findings from the first pass, for example by using a heap, so we don’t duplicate
the effort in the second pass. ♢

This is a trivial example. There’s no general recipe for how to implement the forward planner
strategy—it’s an art to work out how much working to keep, and how to keep it without
creating extra bookkeeping work for ourselves.

19 The binomial heap is also a ‘forward planner’ algorithm, it’s just not gone all the way with forward

planning for decreasekey.

You might also like