Graph Tree Notes
Graph Tree Notes
Contents
5 Graphs and path finding 1
5.1 Notation and representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 Depth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 Breadth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.4 Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 Algorithms and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.6 Bellman-Ford . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.7 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.8 Johnson’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Example. Leonard Euler, a mathematician from Königsberg, was asked the question “Can I
go for a stroll around the city on a route that crosses each bridge exactly once?” He was
intrigued – “This question is so banal, but seemed to me worthy of attention in that [neither]
geometry, nor algebra, nor even the art of counting was sufficient to solve it.” In 1735 he
proved the answer was ‘No’. His innovation was to turn the question into what we would
now call a discrete maths question about a graph, in this case a simple graph with 4 vertices
and 7 edges (and he also came up with a clever proof).1
Example. Agents can find paths through a game-map using a graph. Beforehand, (1) draw
polygons around the obstacles, (2) subdivide the map into a mesh of convex polygons, (3) de-
fine a graph with one vertex per polygon, and edges between adjacent polygons. Then, we can
find the agent’s shortest path on this graph, using the algorithms we’ll study in this course.
Once we’ve found this optimal sequence of polygons, we can go on to find an aesthetically
pleasing physical path that walks through them.
Example. Facebook’s underlying data structure is a graph. Vertices are used to represent
1 Euler’s proof: Consider any stroll and list the edges it crosses, then count up the number of times each
vertex appears. For example, in the stroll [B ↔ A, A ↔ B, B ↔ D, D ↔ C], A appears twice, B appears three
times, and so on. Clearly every vertex must appear an even number of times except possibly for the start
and end vertices. Now, if there were a stroll that crossed each bridge exactly once, then (by looking at the
graph) there are 3 edges at vertex A so A would have to appear 3 times, B would have to appear 5 times,
and C and D would have to appear 3 times. But we’ve already shown that in any stroll there can be at most
two vertices that appear an odd number of times, hence no such stroll exists.
2
users, locations, comments, check-ins, etc. From a Facebook engineering blog post,2
Example. OpenStreetMap represents its map as XML, with nodes and ways. In some parts
of the city, this data is very fine-grained. The more vertices and edges there are, the more
space it takes to store the data, and the slower the algorithms run. Later in this course we
will discuss geometric algorithms which could be used to simplify the graph while keeping its
basic shape.
<osm version=”0.6” generator=”Overpass API”>
<node id=”687022827” user=”François Guerraz”
lat=”52.2082725” lon=”0.1379459” />
<node id=”687022823” user=”bigalxyz123”
lat=”52.2080972” lon=”0.1377715” />
<node id=”687022775” user=”bigalxyz123”
lat=”52.2080032” lon=”0.1376761” >
<tag k=”direction” v=”clockwise”/>
<tag k=”highway” v=”mini_roundabout”/>
</node>
<way id=”3030266” user=”urViator”>
<nd ref=”687022827”/>
<nd ref=”687022823”/>
<nd ref=”687022775”/>
<tag k=”cycleway” v=”lane”/>
<tag k=”highway” v=”primary”/>
<tag k=”name” v=”East Road”/>
<tag k=”oneway” v=”yes”/>
</way>
...
</osm>
Exercise. Why do you think Facebook chose to make CHECKIN a type of vertex, rather
than an edge from a USER to a LOCATION?
In graphs, edges are only allowed to connect vertices to vertices, they’re not allowed to connect
vertices to other edges. If we want to be able to attach a COMMENT to a CHECKIN, then
CHECKIN has to be a vertex.
2 https://engineering.fb.com/2013/06/25/core-data/tao-the-power-of-the-graph/
3
□
4 5.1 Notation and representation
• The neighbours of a vertex v are the vertices you reach by following an edge from v;
in a directed graph, neighbours(v) = w ∈ V : v → w
in an undirected graph, neighbours(v) = w ∈ V : v ↔ w .
It sounds perverse
There are some special types of graph that we’ll look at in more detail later.
to define a tree to
be a type of forest! • A directed acyclic graph or DAG is a directed graph without any cycles. They’re used
But you need to get
used to reasoning
all over computer science. We’ll study some properties of DAGs in Section 6.7.
about algorithms
directly from
• An undirected graph is connected if for every pair of vertices there is a path between
definitions, rather them. A forest is an undirected acyclic graph. A tree is a connected forest. We’ll study
than from your
hunches and
algorithms for finding trees and forests within a graph in Sections 6.5–6.6.
instinct; and a
deliberately
perverse definition
can help remind you
of this.
RE P R ESE N TAT I O N
Here are two standard ways to store graphs in computer code: as an array of adjacency lists,
or as an adjacency matrix.
array of
adjacency lists: adjacency matrix:
1 2 3 4 5 1 2 3 4 5
1 0 1 0 0 1
1 2 2 1 2 3 1
2 1 0 1 1 1
5 5 4 2 2
3
3 0 1 0 1 0
4 5 4
5 4
4 0 1 1 0 1
3
5 1 1 0 1 0
(Note: V and E are sets, so we should really write O(|V | + |E|) etc., but it’s conventional to
drop the | · |.) If the graph is sparse, i.e. if E = o(V 2 ), then the adjacency list doesn’t need
as much storage. As for algorithm execution time, we’ll study that in the following sections.
Exercise.
(a) What is the largest possible number of edges in an undirected graph with V ver-
tices?
(b) and in a directed graph?
(c) What is the smallest possible number of edges in a tree?
(d) Suppose our graph has multiple edges between a pair of nodes, and that we want
to store a label for each edge. If we use an adjacency matrix we’ll therefore need
to store a list of edges in each cell, not just a number. What is the storage
requirement?
(a) V (V − 1)/2
(b) V (V − 1)
(c) V −1
(d) O(V 2 + E)
□
6 5.2 Depth-first search
B C D B C D
E F G H E F G H
GEN ER A L I D EA : L I K E T R AV E RS I NG A T R E E
It’s easy to explore a tree, using recursion. If we call visit_tree(D, None) on the graph on
We defined tree to
the left, which is a tree, then we’ll see it visit D, H, A, C, B, E, F, G (or perhaps some other
mean ‘undirected order depending on the order of each vertex’s neighbours).
connected acyclic
graph’. The tree is 1 # V i s i t a l l v e r t i c e s i n the subt ree rooted at v
drawn as if to
suggest that A is 2 def visit_tree (v , v_parent ) :
the root, but 3 print (” v i s i t i n g ” , v , ”from” , v_parent)
because the graph is
undirected there is 4 for w in v . neighbours :
actually no 5 # Only v i s i t v ’ s c h i l d r e n , not i t s parent
distinguished vertex,
and we’re entitled 6 i f w != v_parent :
to start the 7 visit_tree (w, v)
traversal at D.
But if the graph has cycles then this algorithm will get stuck in an infinite recursion. If we
run it on the right hand graph above, we might get D, H, C, A, D, H, C, A, D, ... We need
some way to prevent this.
B C D B C D
E F G H E F G H
dfs_recurse dfs using a Stack
To see why this is different, consider the execution trace of dfs_recurse starting at B:
v i s i t (B) :
v i s i t (E) from B, return from E to B
v i s i t (F) from B
v i s i t (G) from F, return from G to F,
return from F to B
v i s i t (A) from B
Instead of this chain of returns, could we jump straight from G to A, the next vertex waiting
to be explored? Obviously we’d need to know that A is waiting to be explored, so we’d need
to have noted this down when we first visited B. We can use a Stack, a Last-In-First-Out
data structure, to store the list of vertices waiting to be explored; this way we’ll visit children
before returning to distant cousins. There is a subtle
difference between
1 # V i s i t a l l v e r t i c e s reachable from s dfs and dfs_recurse
2 def dfs (g , s ) : as implemented
here — they don’t
3 for v in g . vertices : actually visit
4 v . seen = False vertices in the same
order as each other.
5 toexplore = Stack ( [ s ] ) # a Stack i n i t i a l l y c o n t a i n i n g a s i n g l e element The example sheet
6 s . seen = True asks you to look
into this, and to
7 modify dfs so that
8 while not toexplore . is_empty ( ) : they do.
B B E B E
E
star�ng Pick A to
A A Pick D to A
at A... explore
explore
C Remove A from C C
toexplore, add its Remove D from
neighbours, and toexplore, add its
mark them as neighbours
D seen
D (excluding A, D
which was already
seen), and mark
them as seen
toexplore = [A] toexplore = [B,D] toexplore = [B,C]
A marked as seen A,B,D marked as seen A,B,C,D marked as seen
A N ALYS IS
In dfs, (a) line 4 is run for every vertex which takes O(V ); (b) lines 8–9 are run at most once
per vertex, since the seen flag ensures that each vertex enters toexplore at most once, so the
running time is O(V ); (c) lines 10 and below are run for every edge out of every vertex that
is visited, which takes O(E). Thus the total running time is O(V + E).
The dfs_recurse algorithm also has running time O(V + E). To see this, (a) line 4 is
run once per vertex, (b) line 8 is run at most once per vertex, since the visited flag ensures
that visit(v) is run at most once per vertex; (c) lines 9 and below are run for every edge out
of every vertex visited.
8 5.2 Depth-first search
∗ ∗ ∗
Pay close attention to the clever trick in analysing the running time. We didn’t try
to build up some complicated recursive formula about the running time of each call to visit,
or to reason about when which part of the graph was put on the stack. Instead we used
mathematical reasoning to bound the total number of times that a vertex could possibly be
aggregate analysis:
processed during the entire execution. This is called aggregate analysis, and we’ll see more
section 7.1 examples later in the course when we look at the design of advanced data structures.
The recursive implementation uses the language’s call stack, rather than our own data
structure. Recursive algorithms are sometimes easier to reason about, and we’ll use the
recursive implementation as part of a proof in Section 6.7.
5.3 Breadth-first search 9
G E N E R AL I DEA
Suppose we want to find the shortest path from A to some other vertex, in a directed graph.
Let’s start by rearranging the graph, to put A at the top, then all the vertices at distance 1
underneath, and all the nodes at distance 2 underneath that, and so on. (By ‘distance d’
we mean ‘can be reached by a path with d edges, and cannot be reached by a path with
< d edges’.) The graph rearranges itself into two parts: there is a tree consisting of all the
vertices, with edges that go down by exactly one ‘level’; and there are extra edges that go
either horizontally or up.
A distance from A = 0
D B
B D distance from A = 1
A
E
C E C distance from A = 2
The idea of breadth first search is to visit A, then all nodes at distance 1 from A, then all
nodes at distance 2, and so on. In other words, we’ll explore the breadth of the tree first,
before going deeper. There’s a very simple way to achieve this. Suppose we’ve already visited
all the vertices at level < d, and we’ve got a list of all the vertices at level d: then we just go
through that list and add all the vertices that are newly reachable; these must be vertices at
level d + 1. Keep going until there’s nothing more to visit.
I MP LE ME N TATI O N
To implement the breadth-first strategy, we don’t even need to manage ‘list of vertices at
distance d’. All we need is a Queue to store all the vertices we’re waiting to explore. Push new
vertices on the right of the queue, pop vertices from the left, and that way we’re guaranteed
to pop all vertices in correct order of distance.
The code turns out to be almost identical to dfs. The only difference is that it uses a
Queue instead of a Stack.
1 # V i s i t a l l the v e r t i c e s i n g reachable from s t a r t v e r t e x s
2 def bfs (g , s ) :
3 for v in g . vertices :
4 v . seen = False
5 toexplore = Queue( [ s ] ) # a Queue i n i t i a l l y c o n t a i n i n g a s i n g l e element
6 s . seen = True
7
8 while not toexplore . is_empty ( ) :
9 v = toexplore . popleft () # Now v i s i t i n g v e r t e x v
10 for w in v . neighbours :
11 i f not w. seen :
12 toexplore . pushright (w)
13 w. seen = True
10 5.3 Breadth-first search
B B E B E
E
star�ng Pick A to
A A Pick B to A
at A... explore
explore
C Remove A from C C
toexplore, add its Remove B from
neighbours, and toexplore, add its
mark them as neighbours
D seen
D (excluding D, D
which was already
seen), and mark
them as seen
toexplore = [A] toexplore = [B,D] toexplore = [D,E,C]
A marked as seen A,B,D marked as seen A,B,C,D,E marked as seen
We can adapt this code to find a path between a pair of nodes: we just need to keep track
of how we discovered each vertex. For every vertex at distance d, we’ll store a come_from
arrow, pointing to a vertex at distance d − 1. Here’s a picture, then the code.
B B E B
E E
Star�ng Pick A to Pick B to
at A... A explore A explore A
C C C
D D D
AN A LYS I S
The bfs algorithm has running time O(V + E), based on exactly the same analysis as for dfs
in section 5.2.
∗ ∗ ∗
5.3 Breadth-first search 11
Here is another way to think of the bfs algorithm: keep track of the ‘disc’ of vertices
that are distance ≤ d from the start, then grow the disc by adding the ‘frontier’ of vertices
at distance d + 1, and so on. What’s magic is that the bfs algorithm does this implicitly, via
the Queue, without needing an explicit variable to store d.
In this illustration3 , we’re running bfs starting from the blob in the middle. The graph has
one vertex for each light grey grid cell, and edges between adjacent cells, and the black cells in
the left hand panel are impassable. The next three panels show some stages in the expanding
frontier.
3 These pictures are taken from the excellent Red Blob Games blog, http://www.redblobgames.com/
pathfinding/a-star/introduction.html
12 5.4 Dijkstra’s algorithm
PRO B L EM STAT E M E N T
Why require costs
≥ 0? Try to work Given a directed graph where each edge is labelled with a cost ≥ 0, and a start vertex s,
out what goes
wrong in the compute the distance from s to every other vertex.
algorithm beglow if
there are negative
costs. You can read
the answer in
answer in
Sections 5.5 GEN ER A L I D EA
and 5.6.
In breadth-first search, we visited vertices in order of how many hops they are from the start
vertex. Now, let’s visit vertices in order of distance from the start vertex. We’ll keep track
of a frontier of vertices that we’re waiting to explore (i.e. the vertices whose neighbours
we haven’t yet examined). We’ll keep the frontier vertices ordered by distance, and at each
iteration we’ll pick the next closest.
We might end up coming across a vertex multiple times, with different costs. If we’ve
never come across it, just add it to the frontier. If we’ve come across it previously and our
new path is shorter than the old path, then update its distance.
introduction.html
5.4 Dijkstra’s algorithm 13
dis
t1
5 v
Vis
i
s fro t v, th
n�e e n
r ve ear
dist 23 rte est
w x
cost 2 x
v
cost 1
s
w
7
dist 1 x nd
exte
v e&
p dat n�er
U fro
the
dist 16
s
w
I MP LE ME N TATI O N
This algorithm was invented in 1959 and is due to Dijkstra5 (1930–2002), a pioneer of com-
puter science.
Line 5 declares that toexplore is a PriorityQueue in which the key of an item v is See Section 4.8 for a
v.distance. Line 11 iterates through all the vertices w that are neighbours of v, and retrieves definition of
the cost of the edge v → w at the same time. PriorityQueue. It
supports inserting
items, decreasing
1 def d i j k s t r a (g , s ) : the key of an item,
2 for v in g . vertices : and extracting the
item with smallest
3 v . distance = ∞ key.
4 s . distance = 0
5 toexplore = PriorityQueue ( [ s ] , sortkey = lambda v : v . distance )
6
7 while not toexplore . isempty ( ) :
8 v = toexplore .popmin()
9 # Assert : v . d i s t a n c e i s the t r u e s h o r t e s t d i s t a n c e from s to v
10 # Assert : v i s never put back i n t o t o e x p l o r e
11 for (w, edgecost ) in v . neighbours :
12 dist_w = v . distance + edgecost
13 i f dist_w < w. distance :
14 w. distance = dist_w
15 i f w in toexplore :
16 toexplore . decreasekey (w)
17 else :
18 toexplore . push(w)
Although we’ve called the variable v.distance, we really mean “shortest distance from s
5 Dijkstra was famous for his way with words. Some of his sayings: “The question of whether Machines
Can Think [. . . ] is about as relevant as the question of whether Submarines Can Swim.”
14 5.4 Dijkstra’s algorithm
to v that we’ve found so far”. It starts at ∞ and it decreases as we find new and shorter paths
to v. Given the assertion on line 10, we could have coded this algorithm slightly differently:
we could put all nodes into the priority queue in line 5, and delete lines 15, 17, and 18. It
takes some work to prove the assertion...
AN A LYS I S
Running time. Line 8 is run at most once per vertex (by the assertion on line 10), and
lines 12–18 are run at most once per edge. So the total running time is
where the individual operation costs depend on how the PriorityQueue is implemented. Later
in the course, we’ll describe an implementation called the Fibonacci heap which for n items
has O(1) running time for both push() and decreasekey() and O(log n) running time for
popmin(). Since the number of items stored in the heap at any time is ≤ V , by the assertion
In this theorem,
on line 10, the total running time is O(E + V log V ).
v.distance is a
variable that is
updated during Theorem (Correctness). The dijkstra algorithm terminates. When it does, for every vertex
program execution,
and distance(s → v) v, the value v.distance it has computed is equal to distance(s to v). Furthermore, the two
is the Platonic assertions never fail.
mathematical
object. Pay close
attention to
whether you’re
dealing with Proof (that Assertion 9 never fails). Suppose this assertion fails at some point in execution.
abstract Let v be the vertex for which it first fails, and let T be the time of this failure. Consider a
mathematical
statements (which shortest path from s to v. (This means the Platonic mathematical object, not a computed
can be stated and variable.) Write this path as
proved even without
an algorithm), or if s = u1 → · · · → uk = v
you’re reasoning
about program There are two cases to consider: CASE1 in which one of these vertices hasn’t yet been popped
execution. from toexplore by time T , and CASE2 in which they have all been popped.
Consider CASE1 first, and let i be the index of the first vertex in the sequence that, at
time T , hasn’t been popped. So the path is
s = u1 → u2 → · · · → ui−1 → ui → · · · → uk = v
| {z } |{z} | {z }
already popped not yet ???
popped
(We’ve just popped v = uk , so we know i < k. The vertices between i and k, if there are any,
might or might not have been popped by time T .) Now, reasoning about the stored .distance
variables as they stand at time T ,
distance(s to v)
< v.distance since the assertion failed at v
≤ ui .distance (*)
≤ ui−1 .distance + cost(ui−1 → ui ) by lines 13–18 when ui−1 was popped
= distance(s to ui−1 ) + cost(ui−1 → ui ) assertion didn’t fail at ui−1
≤ distance(s to v) since s → · · · ui−1 → ui is on a shortest path s to v.
The tricky step is (*). This line is because we just popped v from the PriorityQueue, and we
know ui was in there also because it would have been forced in when we popped ui−1 , and
the PriorityQueue gave us v rather than ui , hence v.distance ≤ ui .distance. Thus, we obtain
a contradiction.
In CASE2, we also obtain a contradiction, and it’s easier to prove:
distance(s to v)
< v.distance since the assertion failed at v
≤ distance(s to uk−1 ) + cost(uk−1 → v) by lines 13–18 when uk−1 was popped
= distance(s to v) since s = u1 → · · · → uk = v is a shortest path.
We have proved that, if Assertion 9 fails at some point in execution, there is a contra-
diction. Thus, it can never fail.
Proof that Assertion 10 never fails. Once a vertex v has been popped, Assertion 9 guarantees
that v.distance = distance(s to v). The only way that v could be pushed back into toexplore
is if we found a shorter path to v (on line 13) which is impossible.
Rest of proof. Since vertices can never be re-pushed into toexplore, the algorithm must
terminate. At termination, all the vertices that are reachable from s must have been visited,
and popped, and when they were popped they passed Assertion 9. They can’t have had
v.distance changed subsequently (since it can only ever decrease, and it’s impossible for it to
be less than the true minimum distance, since the algorithm only ever looks at legitimate
paths from s.) □
Right from the beginning, and all through the course, we stress that the program-
mer’s task is not just to write down a program, but that his main task is to give
a formal proof that the program he proposes meets the equally formal functional
specification.6
If you want more effective programmers, you will discover that they should not
waste their time debugging, they should not introduce the bugs to start with.7
This course is 50% about the ideas behind the algorithms, 50% about the proofs. Don’t
think of proofs as hoops that a cruel lecturer forces you to jump through! If you can’t figure
out a correct proof that your algorithm works, chances are there’s a bug in your algorithm,
for example a special case that you haven’t coded for. Conversely, every nifty trick that you
invented to make your algorithm work is likely to have a counterpart in a proof of correctness,
otherwise the trick would be superfluous. Algorithm proofs are just a tool for making sure
our code works correctly and for clarifying in our heads the big ideas behind an algorithm.
Here is an exam question about Dijsktra’s algorithm, and a selection of mangled proofs.
The first thing an examiner checks for is whether the proof passes the ‘smell test’. Does this
answer have all the key ingredients from Dijkstra’s proof—the ‘breakpoint’ proof strategy,
the reliance on edge weights being non-negative? If it fails the smell test, the examiner will
look for a faulty inference, in other words construct a counterexample to demonstrate the
fault.
Bad Answer 1. At the moment when the vertex t is popped from the priority queue, it has
to be the vertex in the priority queue with the least distance from s. This means that any
other vertex in the priority queue has distance ≥ that for t. Since all edge weights in the
graph are ≥ 0, any path from s to t via anything still in the priority queue will have distance
≥ that of the distance from s to t when t is popped, thus the distance to t is correct when t
is popped.
This fails the smell test in two ways. First, there is no hint of induction—the proof only
discusses what happens when the target vertex t is popped. Second, it doesn’t distinguish
between the two ‘distances’: (a) v.distance, the quantity computed and updated in the course
of the algorithm, and (b) distance(s to v), the true mathematical distance. The point of
Dijkstra’s algorithm is that the former eventually becomes equal to the latter.
Where exactly is the faulty inference? The first sentence is about how priority queues
work, so it must be referring to distances as computed by the algorithm, not to true mathe-
matical distances; but the last sentence seems to be referring to true mathematical distances.
So let’s set up a counterexample where the two are at odds. In this diagram, u and t are in
the priority queue, and the nodes have been labelled by their computed distances at the instant
when t is popped.
ewd10xx/EWD1036.PDF
7 EWD 340: The humble programmer, https://www.cs.utexas.edu/users/EWD/ewd03xx/EWD340.PDF.
Quite the opposite of test-driven development! Contrast to Donald Knuth, another pioneer of computer
science, who once wrote “Beware of bugs in the above code; I have only proved it correct, not tried it.”
8 EWD 498: How do we tell truths that might hurt? https://www.cs.utexas.edu/users/EWD/ewd04xx/
EWD498.PDF
5.5 Algorithms and proofs 17
dist=∞
u
1 1
2 3
s v t
dist=0 dist=2 dist=5
What does the Bad Answer say about this scenario? We’re about to pop t, since t.distance <
u.distance. And yet the true mathematical distance of the path s → v → u → t is shorter
than the computed t.distance, so the final sentence of the Bad Answer is incorrect.
One might say “Oh, but the proof in Section 5.4 shows that this can’t happen.” True ...
but an answer to the question should PROVE that this scenario can’t happen (or make explicit
reference to lecture notes!), and because this answer doesn’t it’s a Bad Answer.
□
Bad Answer 2. Dijkstra’s algorithm performs a breadth-first search on the graph, storing
a frontier of all unexplored vertices that are neighbours to explored vertices.
Each time it chooses a new vertex v to explore, from the frontier of unexplored vertices,
it chooses the one that will have the shortest distance from the start s, based on the edge
weight plus the distance from s of its already explored neighbour.
Given that no other vertex in the frontier is closer to s, and that this new vertex v has
yet to be explored, when v is explored it must have been via the shortest path from s to v.
Hence, when t is first encountered, it must have been found via its shortest path and
the program can safely terminate.
This answer is slightly better than the previous one because it nods in the direction of
induction—it makes an argument about what happens “each time it chooses a new vertex”,
not just about what happens when it chooses t. But the ordering behind the induction isn’t
made clear. And it still fails the smell test because it makes no distinction between ‘distance
as computed’ and ‘true mathematical distance’.
1 4 1
5 1
0 0 0 ∞
2 1 1
2 3
Let’s look for a counterexample. In this graph, each node has been labelled with its
current distance, and each edge with its cost. Node 0 is the start vertex, node ∞ is the
destination vertex, nodes 0, 4, and 2 have already been explored, and nodes 3 and 5 are in
the frontier. The algorithm will pick node 3 to explore, since this has the smallest distance
variable, but this distance is incorrect and the corresponding path (via node 2) is wrong.
The issue is that the algorithm should never get into the state shown here! But the
Bad Answer doesn’t argue that this state is impossible: instead it just assumes that when v
is popped all the already-explored vertices are correct. The proof needs to say ‘by induction
on vertices in the order they are explored’. Only then is it legitimate to assume that all the
already-explored vertices are correct.
□
Exercise. Bad Answer 2 doesn’t state anywhere that edge weights need to be ≥ 0. Explain
which line of the proof fails when the algorithm is run on this graph:
1
3
-4 3
t
s 2
18 5.6 Bellman-Ford
5.6. Bellman‐Ford
Now for a new wrinkle: graphs whose edges can have both positive and negative costs.
We’ll use the term weight rather than cost of an edge. (The words ‘cost’ and ‘distance’
suggests positive numbers, and so they give us bad intuition for graphs with negative costs.)
The weight of a path is the sum of its edge weights, and our goal is to find minimum weight
paths from some start vertex.
Example (Planning problems). Consider an adventurer who has just entered a room in a dun-
geon, and wants to quickly get to the other exit, but has the option of detouring to pick up
treasure. We could frame this as seeking to minimize
(
T − r if we pick up treasure
W =
T otherwise
where T is the time to reach the exit and r is the value of the treasure. What is the optimal
path, and does it involve picking up the treasure?
We can turn this into a directed graph problem. Let there be a vertex for every state
the game can be in: this comprises both the location of the adventurer, as well as a flag
saying whether the treasure is still available. And let there be edges for all the possible game
moves. Give each edge weight 1, except for the moves which involve picking up the treasure,
which have weight 1 − r. In graph language, we’re told the start vertex, and we have the
choice of two possible destination vertices (exit with treasure, and exit without treasure),
and we want to find the weights of minimum weight paths to those two destinations. ♢
Example (Negative cycles). What’s the minimum weight from a to b in the graph below? By
going around b → c → d → b again and again, the weight of the path goes down and down.
This is referred to as a negative weight cycle, and we’d say that the minimum weight from a
to b is −∞.
2 c a → b: weight 1
1
a → b → c → d → b: weight 0
a b 3 a → b → c → d → b → c → d → b: weight -1
-6 d
Exercise. Run Dijkstra’s algorithm by hand on each of these graphs, starting from the
shaded vertex. The labels indicate edge weights. What happens?
1 1
3 3
-4 3 -4 2
2 2
5.6 Bellman-Ford 19
G E N E R AL I DEA
Dijsktra’s algorithm can fail on graphs with negative edge weights. [Before continuing, do the
exercise!] But the update step at the heart of Dijkstra’s algorithm, lines 13–14 on page 13,
is still sound. Let’s restate it. If we’ve found a path from s to u, call it s ⇝ u, and if there
is an edge u → v, then s ⇝ u → v is a path from s to v. If we store the minimum weight s
path we’ve found so far in the variable minweight, then the obvious update is
Dijkstra Bellman-Ford
can get stuck in an ∞ loop if always terminates
some weights < 0
O(E + V log V ) O(V E)
if all weights ≥ 0
visits vertices in a clever order, visits vertices in any order, re-
relaxes each edge once laxes each edge multiple times
P ROBL EM STATE ME N T
Given a directed graph where each edge is labelled with a weight, and a start vertex s, (i) if
the graph contains no negative-weight cycles reachable from s then for every vertex v compute
the minimum weight from s to v; (ii) otherwise report that there is a negative weight cycle
reachable from s.
I MP LE ME N TATI O N
In this code, lines 8 and 12 iterate over all edges in the graph, and c is the weight of the edge
u → v. The assertion in line 10 refers to the true minimum weight among all paths from s to
v, which the algorithm doesn’t know yet; the assertion is just there to help us reason about
how the algorithm works, not something we can actually test during execution.
1 def bf (g , s ) :
2 for v in g . vertices :
3 v . minweight = ∞ # best estimate so f a r of minweight from s to v
4 s . minweight = 0
5
6 repeat len (g . vertices)−1 times :
7 # r e l a x a l l the edges
8 for (u , v , c) in g . edges :
9 v . minweight = min(u . minweight + c , v . minweight)
10 # Assert v . minweight >= t r u e minimum weight from s to v
11
12 for (u , v , c) in g . edges :
13 i f u . minweight + c < v . minweight :
14 throw ”Negative−weight cycle detected”
Lines 12–14 say, in effect, “If the answer we get after V − 1 rounds of relaxation is different
to the answer after V rounds, then there is a negative-weight cycle; and vice versa.”
20 5.6 Bellman-Ford
AN A LYS I S
The algorithm iterates over all the edges, and it repeats this V times, so the overall running
time is O(V E).
Theorem. The algorithm correctly solves the problem statement. In case (i) it terminates
successfully, and in case (ii) it throws an exception in line 14. Furthermore the assertion on
line 10 is true.
Proof (of assertion on line 10). Write w(v) for the true minimum weight among all paths
from s to v, with the convention that w(v) = −∞ if there is a path that includes a negative-
weight cycle. The algorithm only ever updates v.minweight when it has a valid path to v,
therefore the assertion is true.
Proof for case (i). Pick any vertex v, and consider a minimum-weight path from s to v. Let
the path be
s = u0 → u1 → · · · → uk = v.
Consider what happens in successive iterations of the main loop, lines 8–10.
• Initially, u0 .minweight is correct, i.e. equal to w(s) which is 0.
• After one iteration, u1 .minweight is correct. Why? If there were a lower-weight path
to u1 , then the path we’ve got here couldn’t be a minimum-weight path to v.
• After two iterations, u2 .minweight is correct.
• and so on...
We can assume (without loss of generality) that this path has no cycles—if it did, the cycle
would have weight ≥ 0 by assumption, so we could cut it out. So it has at most |V | − 1 edges,
so after |V | − 1 iterations v.minweight is correct.
Thus, by the time we reach line 12, all vertices have the correct minweight, hence the
test on line 13 never goes on to line 14, i.e. the algorithm terminates without an exception.
s → · · · → v0 → v 1 → · · · → vk → v0
where
weight(v0 → v1 ) + · · · + weight(vk → v0 ) < 0.
If the algorithm terminates without throwing an exception, then all these edges pass the test
in line 13, i.e.
hence the cycle has weight ≥ 0. This contradicts the premise—so at least one of the edges
must fail the test in line 13, and so the exception will be thrown. □
5.7 Dynamic programming 21
10.8
value 10.4
10.0
0 10 20 30
frame number
The heart of dynamic programming is a recurrence equation for the value function. The value
function evaluates, for any state v, the expected future reward that can be gained starting
from v. The picture above illustrates 30 frames of gameplay of the Atari game Seaquest, and
the value function that was computed by DeepMind.9 The player is the yellow submarine
on the right. At time A a new enemy appears on the left, and the value function increases
because of the potential reward. At time B the player’s torpedo has nearly reached the enemy,
and the value function is very high anticipating the reward. (The value function shown here
is based on a reward of +1 whenever the player gains points.) At time C the enemy has been
hit, and the reward has been won, so the value function reverts to its baseline.
G E N E R AL I DEA
Define the value function Fdst,t (v) to be the minimum weight for reaching vertex dst within
t timesteps starting from v, assuming that it takes one timestep to follow an edge.
4
c
2
a 1 b 3
-4 d
To illustrate, here’s the value function for reaching state d, for the simple graph shown above:
3 if v = c with the one-hop path c → d
Fd,1 (v) = 0 if v = d since we’re already there
∞ otherwise, since we can’t reach d within 1 timestep
5 if v = b by following b → c → d
7 if v = a by following a → c → d
Fd,2 (v) =
3 if v = c with the single-hop path c → d
0 if v = d since we’re already there
and so on.
9 Playing Atari with Deep Reinforcement Learning, Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wier-
stra, and Riedmiller, 2013, https://arxiv.org/pdf/1312.5602v1.pdf. This is the paper that kickstarted
the deep reinforcement learning revolution. See https://blog.evjang.com/2018/08/dijkstras.html for an
excellent blog post by Eric Jang, that discusses the link between reinforcement learning and shortest paths.
22 5.7 Dynamic programming
For a general graph, we can write down a recurrence equation (called the Bellman
the minimum over
equation) for the value function:
an empty set is n o
taken to be ∞ Fd,t (v) = min Fd,t−1 (v), min weight(v → w) + Fd,t−1 (w) .
w:v→w
In words, “If we have a path v ⇝ d that takes ≤ t − 1 steps, that’s obviously a valid path
of ≤ t steps; alternatively, we could take a first step v → w and then take the optimal path
w ⇝ d in ≤ t − 1 steps.” The terminal condition is
(
0 if v = d
Fd,0 (v) =
∞ otherwise.
We can simply iterate this equation, in the usual dynamic programming way: first write
down Fd,0 (v) for all v, then compute Fd,1 (v) for all v, and so on. The value function itself
tells us the weight of a minimum weight path, and we can recover the path by reading off
which option gives the minimum at each timestep.
PRO B L EM STAT E M E N T
Given a directed graph where each edge is labelled with a weight, and assuming it contains
no negative-weight cycles, then for every pair of vertices compute the weight of the minimum-
weight path between those vertices.
Why the condition about negative-weight cycles, and why is there no mention of time
horizon? The time horizon in the Bellman equation was crucial—it’s what lets us break
the problem down into easier subproblems. There’s a simple reason. If the graph has no
negative-weight cycles, then any minimum weight path must have ≤ V vertices (if it had
more then there must be a cycle, which by assumption has weight ≥ 0, so we might as well
excise it). Thus for any pair of vertices there is a minimum weight path between them with
≤ V − 1 edges. So what we want to compute is Fd,V −1 (v).
The problem statement says for every pair of vertices. There’s a nifty way to compute
the value function using matrices, and all-to-all minimum weights just drop out with no extra
work. This implementation has running time O(V 3 log V ).
MAT R I X I M P LE M E N TAT I O N ( NO N ‐ E XA M I N A B L E )
(t)
Let Mij be the minimum weight of going from i to j in ≤ t steps. The Bellman equation
says n o
(t) (t−1) (t−1)
Mij = min Mij , min weight(i → k) + Mkj .
k:i→k
Let’s define a matrix W to store the weights,
weight(i → j) if there is an edge i → j
Wij = 0 if i = j
∞ otherwise.
The nifty thing about this matrix is that it lets us simplify the Bellman equation to
n o
(t) (t−1) (1)
Mij = min Wik + Mkj , Mij = Wij .
k
The first clause in the Bellman equation is taken care of because we defined Wii = 0; and
the restriction to {k : i → k} is taken care of because we defined Wij = ∞ if there is no edge.
We could start the iteration at M (0) , but it’s easy to see that a single iteration of Bellman’s
equation gives M (1) = W , and we have W already, so we might as well use it.
The notation x ∧ y
The matrix-Bellman equation can be rewritten as
means min(x, y). (t) (t−1) (t−1) (t−1)
Mij = Wi1 + M1j ∧ Wi2 + M2j ) ∧ · · · ∧ Win + Mnj .
This is just like regular matrix multiplication
[A B]ij = Ai1 B1j + Ai2 B2j + · · · + Ain Bnj
except it uses + instead of multiplication and ∧ instead of addition. Let’s write it M (t) =
W ⊗ M (t−1) . This nifty notation lets us write out the complete algorithm very concisely:
5.7 Dynamic programming 23
1 Let M (1) = W
2 Compute M (|V |−1) , using M (t) = W ⊗ M (t−1)
3 Return M (|V |−1)
As noted above, it’s sufficient to compute up to time horizon |V | − 1, since we assumed
the graph has no negative-weight cycles.
M (1) = W
M (2) = M (1) ⊗ M (1)
M (4) = M (2) ⊗ M (2)
M (8) = M (4) ⊗ M (4)
M (16) = M (8) ⊗ M (8)
= M (9) as there are no negative-weight cycles.
∗ ∗ ∗
For interesting problems like Go, or even the Seaquest game shown at the beginning of this
section, it’s impractical to solve the Bellman equation exactly because the number of states
is combinatorially huge. Instead of solving the value function exactly, DeepMind trains a
neural network to learn an approximation to the value function.
24 5.8 Johnson’s algorithm
• Each router in the internet has to know, for every packet it might receive, where that
packet should be forwarded to. Routers send messages between themselves using the
Border Gateway Protocol (BGP), advertising which destinations they know about, and
they update their routing tables based on the messages they receive. The entire internet
can be thought of as a distributed algorithm for computing all-to-all paths.
We’ve learnt three algorithms we can use for this purpose: (1) if all edge weights are ≥ 0 we
can run Dijkstra’s algorithm once from each vertex; (2) we can run Bellman-Ford once from
each vertex; (3) we can use dynamic programming with matrices. The running times are
The table shows the running time as a function of V and E, and it also shows it for two special
cases, E = V − 1 (a tree, the sparsest connected graph on V vertices) and E = V (V − 1)
(a fully connected graph, the densest graph on V vertices), as well as for E = Θ(V α ) for
α ∈ [1, 2], which spans the range from sparse to dense. This last column makes it easier to
see the comparison. Dijsktra is best for any α, and dynamic programming is better than
Bellman-Ford for any α > 1.
The last row is for Johnson’s algorithm, the topic of this section. It is as fast as
Dijkstra’s algorithm, but it also works with positive and negative edge weights. It was
discovered by Donald Johnson in 1977.
GEN ER A L I D EA
Johnson’s idea was that we can construct a suitable helper graph, run Dijkstra once from
each vertex in the helper graph, and then translate the answers back to the original graph.
His method is subtle and clever, but his general strategy is very common, and we’ll see it
again and again. It’s worth highlighting the two parts to his strategy:
TRANSLATION strategy: Translate the problem we want to solve into a different setting,
use a standard algorithm in the different setting, then translate the answer back to the original
setting. In this case, the translated setting is ‘graphs with different edge weights’. Of course
we’ll need to argue why these translated answers solve the original problem. We’ll see more
of the TRANSLATION strategy in Section 6.
AMORTIZATION strategy: It takes work to construct the helper graph, but this work
pays off because we only need to do it once and then we save time on each of the V times that
we can run Dijkstra’s algorithm rather than Bellman-Ford. We’ll see more of the AMORTI-
ZATION strategy in Section 7.
5.8 Johnson’s algorithm 25
P ROBL EM STATE ME N T
Given a directed graph where each edge is labelled with a weight, (i) if the graph contains
no negative-weight cycles then for every pair of vertices compute the weight of the minimal-
weight path between those vertices; (ii) if the graph contains a negative-weight cycle then
detect that this is so.
I MP LE ME N TATI O N A N D A N A LYS I S
1. The augmented graph. First build an augmented graph with an extra vertex s, as shown
below. Run Bellman-Ford on this augmented graph, and let the minimum weight from s
to v be dv . (The direct path s → v has weight 0, so obviously dv ≤ 0. But if there are
negative-weight edges in the graph, some vertices will have dv < 0.) If Bellman-Ford reports
a negative-weight cycle, then stop.
S
0 0
3 0 0 0 3
1 -2 3 0
2 2
-1 4 0 7
-2 0
2. The helper graph. Define a helper graph which is like the original graph, but with different s
edge weights:
w′ (u → v) = du + w(u → v) − dv .
CLAIM: in this helper graph, every edge has w′ (u → v) ≥ 0. PROOF: The relaxation u
equation, applied to the augmented graph, says that dv ≤ du + w(u → v), therefore w′ (u → v
v) ≥ 0.
3. Dijkstra on the helper graph. Run Dijkstra’s algorithm V times on the helper graph, once
from each vertex. (We’ve ensured that the helper graph has edge weights ≥ 0, so Dijkstra
terminates correctly.) CLAIM: Minimum-weight paths in the helper graph are the same as
in the original graph. PROOF: Pick any two vertices p and q, and any path between them
p = v0 → v1 → · · · → vk = q.
What weight does this path have, in the helper graph and in the original graph? This algebraic trick
is called a
weight in helper graph telescoping sum.
Since dp − dq is the same for every path from p to q, the ranking of paths is the same in the
helper graph as in the original graph (though of course the weights are different).
Example (Minimum spanning trees). Suppose we have to build a power grid to connect 6 cities,
and the costs of running cabling are as shown on the left. We’ve learnt how to find a minimum-
cost path between a pair of nodes. But what is the minimum cost tree that connects all the
nodes? (This is called a minimum spanning tree.)
a a a
2
9
b 6 c b b
c c
5 8 3 4
d e e
d e d
7 f f
1 f
But we’re not sudying subgraph algorithms for their own sake. We’re studying them to get
more because they highlight two strategies for algorithm design:
• We’ll see the translation strategy again, in Section 6.4. This is where we translate a The translation
problem into a different setting, solve the translated problem using a standard algo- strategy was used
rithm, and translate the solution back to answer the original problem. by Johnson’s
algorithm,
section 5.8
• Often, the soul of an algorithm can be dressed in different guises. For example, the Prim’s algorithm for
soul of Dijkstra’s algorithm can help us find a minimum spanning tree. The algorithms finding a minimum
in Sections 6.5–6.7 are different guises for algorithms whose souls we’ve seen before. spanning tree,
section 6.5
28 6.1 Flow networks
𝑢
cap
2 . 10
.1
cap
p flo
flow
ca 12 w8
.
w
4
flo
4
cap. 3 cap. 4
𝑠 𝑣 𝑡
1 flow 3
flo
w
w
flo
1
6
ca
p.
p .5
ca
This picture illustrates a flow on a network. There are two distinguished vertices, the source
vertex s where flow originates, and the sink vertex t where flow is consumed. The edges are
directed, and labelled with their capacities. The flow value is the net flow out of the source
vertex, and it’s 12 − 1 = 11 in this picture. This is equal to the net flow into the sink vertex,
of course.
What’s the maximum possible flow value, over all possible flows? For this simple
network, it’s fairly easy to discover a flow of value 14. Furthermore, the total capacity of the
edges going into the sink is 14, so it’s impossible to have a flow of value > 14. Therefore the
maximum possible flow value is 14.
In Sections 6.2 and 6.3 we will see an algorithm for finding a maximum flow, and prove
that it is correct. First, here is a pair of flow problems10 that inspired the algorithm.
TWO T R A N S P O RTAT I O N P RO B L E M S
The Russian applied mathematician A.N. Tolstoĭ was the first to formalize the flow problem.
We’ll only study
He was interested in the problem of shipping cement, salt, etc. over the rail network. Formally,
single-commodity he posed the problem “Given a graph with edge capacities, and a list of source vertices and
flows, i.e. where their supply capacities, and a list of destination vertices and their demands, find a flow that
there is a single
type of ‘stuff’ meets the demands.”
flowing.
Multi-commodity
flow problems are
much much harder. Exercise. In the standard formulation of the flow problem, there is a single source with
unlimited supply capacity, and a single sink. Suppose we have an algorithm that solves
this standard problem. Explain how to use it to solve Tolstoĭ’s problem.
10 For further reading, see On the history of the transportation and maximum flow problems by Alexander
The US military was also interested in flow networks during the cold war. If the
Soviets were to attempt a land invasion of Western Europe through East Germany (vertex
EG), they’d need to transport fuel to the front line. The diagram shows the links in the
rail network, and the carrying capacity of each link. It also shows the various available
fuel sources, aggregated into a single vertex marked ORIGINS. What is the max flow from
ORIGINS to EG? More importantly, if the US Air Force wants to strike and degrade one of
the links, which link should it target in order to reduce the max flow? It’s no use hitting a
link where the Soviets can just reroute around the damage.
𝑢
cap
. 10
12
cap
p. flo
flow
ca 12 w8
.
w
4
flo
4
cap. 3 cap. 4
𝑠 𝑣 𝑡
flow 3
1
flo
w
w
flo
6
ca 1
p.
p.5
ca
𝑤
But to write a proper problem statement, we need to be more precise than this! Here are
some definitions. Let the weight associated edge u → v be c(u → v), and call this the capacity
of the edge. Assume it is > 0 for every edge in the graph. A flow is a set of edge labels
f (u → v) such that
0 ≤ f (u → v) ≤ c(u → v) on every edge
and X X
f (u → v) = f (v → w) at all vertices v ∈ V \ {s, t}.
u: u→v w: v→w
The second equation is called flow conservation, and it says that as much stuff comes into v
as goes out. Flow conservation doesn’t need to hold at s or t—indeed, the value of a flow is
the net flow out of s,
X X
value(f ) = f (s → u) − f (u → s).
u: s→u u: u→s
In the network pictured above the flow value is 12 − 1 = 11, i.e. the total flow out minus the
total flow in.
GEN ER A L I D EA
The basic idea of the algorithm is “look for vertices to which we could increase flow”.
Imagine that the source and all the other vertices apart from the sink are in bandit
country, and the bandits want to siphon off flow from intermediate vertices. We’ll assume
they can sneak into the vertices to siphon off flow, and to redirect existing flow, and they
can also increase the flow at the source. But they daren’t do anything that would disrupt
the total flow to the sink, because that’d attract the attention of the authorities, who would
come and put an end to their banditry. Here are two types of step that the bandits could
take, starting from the flow pictured at the top of the page.
𝑎 𝑎 𝑎
𝑠 𝑏 𝑡 𝑠 𝑏 𝑡 𝑠 𝑏 𝑡
𝑐 𝑐 𝑐
Turn up the flow Instead of Instead of siphoning off all
at s, and siphon it siphoning off all the excess at b, send some of
off the excess at s, it along b → t and reduce the
increase the flow a → b flow to match, giving
s → b and siphon an excess at a that can be
it off at b siphoned off
6.2 Ford-Fulkerson algorithm 31
Suppose the bandits are overzealous and they discover that they can siphon off some flow at
t. That means that the network operator—which was spying on the bandits all along—has
learned a reconfiguration that delivers extra flow to the sink.
Let’s look more closely at this reconfiguration. In the network fragment below, the
bandits discovered they could siphon off flow at a, thence b, thence c, thence t. How much
could they siphon off at each of these locations?
𝑠
flo
𝑠
• They could siphon off 3 at a by increasing s → a
w
+
2/
1
5
/4
flow2 𝑎 𝑎 • Or siphon off 2 at b by increasing s → a → b
+1
𝑏 (limiting factor: spare capacity on a → b)
𝑏
flo
w -1 • Or siphon off 1 at c by increasing s → a → b
1/
8 𝑐 𝑐 and decreasing c → b, leaving the other outflow
/3
+1
on c → b)
flo
𝑡 𝑡 1
• Or siphon off 1 at t by increasing s → a → b
and decreasing c → b and increasing c → t, leav-
ing the inflow at c undisturbed (limiting factor:
existing flow on c → b)
The network operator only wants to get flow to t, not to any of the other vertices. So it
chooses a flow adjustment that gets as much as possible to t, with no excess at any of the
other vertices along the path.
The Ford-Fulkerson algorithm starts with an empty flow, then repeatedly uses this
‘bandit search’ to find whether it’s possible to siphon off flow at sink. If it is possible, it
adjusts flow along a suitable sequence of edges and thereby increases the flow value. If it’s
not possible, then the algorithm terminates.
I MP LE ME N TATI O N
There are two pieces that we need to turn into a proper formal algorithm, (1) how exactly
the bandit search works, and (2) how to use the results of the bandit search to adjust the
flow.
The residual graph. To formalize the bandit search, we’ll build what’s called the residual
graph. This has the same vertices as the flow network, and it has either one or two edges for
every edge in the original flow network:
• If f (u → v) < c(u → v) in the flow network, let the residual graph have an edge u → v
with the label “increase flow u → v”.
• If f (u → v) > 0 in the flow network, let the residual graph have an edge v → u (i.e. in
the opposite direction) with the label “decrease flow u → v”.
The two clauses here correspond to the two types of adjustment that the bandits can make.
In this illustration, the ‘increase’ edges are solid lines and the ‘decrease’ edges are dotted.
𝑠 𝑏 𝑡
If the bandits can siphon off flow at some vertex u, and if the residual graph has an edge
u → v, then they can siphon flow off at v. They can certainly siphon off some flow at s.
Thus, if the residual graph has a path from s to t, then they can siphon flow off at t.
32 6.2 Ford-Fulkerson algorithm
Augmenting paths. Suppose the residual graph has a path from s to t. This is called an
augmenting path. We could find it using breadth-first search or depth-first search, or any
other path-finding algorithm we like.
𝑎 𝑎 +𝛿
−𝛿
𝑠 𝑏 𝑡 𝑠 +𝛿 𝑏 𝑡
𝑐 𝑐
Each edge in the augmenting path has a label, saying either ‘increase’ or ‘decrease’. In the
path shown here, the s → b edge is labelled “increase the flow s → b”, the b → a edge is
labelled “decrease the flow a → b”, and the a → t edge is labelled “increase the flow a → t”.
We pick some amount δ > 0 to increase/decrease each edge by, and we update the flow.
The crucial thing about this update is that it leaves us with a valid flow. To verify
this, remember the two defining characteristics of a flow: (1) it must satisfy the capacity
constraints 0 ≤ f (u → v) ≤ c(u → v), and (2) it must satisfy flow conservation. The
rules for constructing the residual graph ensure that the updated flow satisfies the capacity
constraints, as long as δ is sufficiently small; and by thinking carefully about the four different
possibilities for what happens at each vertex along the augmenting path, we see that the total
flow in minus total flow out is unchanged.
𝑎
𝑏
𝑏 𝑐
−𝛿 𝑎 +𝛿
+𝛿 𝑏 −𝛿 −𝛿 −𝛿
+𝛿
+𝛿 𝑏
𝑐
Also, the flow value increases by δ. To see this, consider the two possible labels for the first
edge of the augmenting path. Whether it’s an “increase” edge or a “decrease” edge, either
way the net flow out of s increases.
𝑠 𝑠
𝑠 𝑠
+𝛿
−𝛿
1 def ford_fulkerson (g , s , t ) :
2 # l e t f be a flow , i n i t i a l l y empty
3 for u → v in g . edges :
4 f (u → v) = 0
5
6 # Define a h e l p e r f u n c t i o n f o r f i n d i n g an augmenting path
7 def find_augmenting_path ( ) :
8 # d e f i n e the r e s i d u a l graph h on the same v e r t i c e s as g
9 for each edge u → v in g :
10 i f f (u → v) < c(u → v ) : give h an edge u → v labelled ”inc”
11 i f f (u → v) > 0: give h an edge v → u labelled ”dec”
12 i f h has a path from s to t :
13 return some such path , together with the l ab e ls of i t s edges
14 else :
15 # There i s a s e t of v e r t i c e s that we can reach s t a r t i n g from s ;
16 # c a l l t h i s ” the cut a s s o c i a t e d with flow f ” .
17 # We’ l l use t h i s i n the a n a l y s i s .
18 return None
19
20 # Repeatedly f i n d an augmenting path and add flow to i t
21 while True :
22 p = find_augmenting_path()
23 i f p i s None:
6.2 Ford-Fulkerson algorithm 33
This pseudocode doesn’t tell us how to choose the path in line 13. One sensible idea is ‘pick
the shortest path’, and this version is called the Edmonds–Karp algorithm; it is a simple
matter of running breadth first search on the residual graph. Another sensible idea is ‘pick
the path that makes δ as large as possible’, also due to Edmonds and Karp.
A N ALYS IS OF R U N N I N G T I M E
Be scared of the while loop in line 21: how can we be sure it will terminate? In fact, there
are simple graphs with irrational capacities where the algorithm does not terminate. On the
other hand,
Lemma. If all capacities are integers then the algorithm terminates, and the resulting flow on
each edge is an integer.
Proof. Initially, the flow on each edge is 0, i.e. integer. At each execution of lines 27–32,
we start with integer capacities and integer flow sizes, so we obtain δ an integer ≥ 0. It’s
not hard to prove the assertion on line 33, i.e. that δ > 0, by thinking about the residual
graph in find_augmenting_path. Therefore the total flow has increased by an integer after
lines 34–38. The value of the flow can never exceed the sum of all capacities, so the algorithm
must terminate. □
Now let’s analyse running time, under the assumption that capacities are integer. We
execute the while loop at most f ∗ times, where f ∗ is the value of maximum flow. We can build
the residual graph and find a path in it using breadth first search, so find_augmenting_path
is O(V + E). Lines 27–38 involve some operations per edge of the augmenting path, which is
O(V ) since the path is of length ≤ V . Thus the total running time is O (E + V )f ∗ . There’s
no point including the vertices that can’t be reached from s, so we might as well assume
that all vertices can be reached from s, so E ≥ V − 1 and the running time can be written
O(Ef ∗ ).
It is worth noting that the running time we found depends on the values in the input
data (via f ∗ ). This is in contrast to all the algorithms we’ve seen studied so far, like Quicksort
and Depth-first search, in which we found a running time that depends only on the size of
the data. Often in machine learning and optimization, we get answers that depend on the
contents of the data.
On one hand it’s good to get an answer that depends on the values in the input data
rather than just the size, because any analysis that ignores the data contents can’t be very
informative. On the other had it’s bad in problem because we don’t have a useful upper
bound for f ∗ .
34 6.2 Ford-Fulkerson algorithm
The Edmonds–Karp version of the algorithm can be shown to have running time
O(E 2 V ).
CO RR EC T N ESS
There are two parts to proving correctness: (1) does this algorithm produce a flow? and (2)
is the flow it produces a maximum flow?
We’ve already argued why the assertion on line 39 is correct, i.e. why the algorithm
produces a valid flow at every iteration. The proof that it does indeed produce a maximum
flow is left to the next section.
∗ ∗ ∗
In computer science textbooks and on YouTube, there are plenty of explanations of the Ford-
Fulkerson algorithm that start by defining the residual graph, and make no mention of what
these notes have called the ‘bandit search’ problem. If you’re an engineer and all you want
is a recipe to follow, then you don’t need to think about the bandit search at all. But if
you’re a computer scientist or mathematician and you want to understand why the algorithm
works, the bandit search idea is the linchpin, and the residual graph is just an implementation
detail. The big idea in the proof of correctness relates directly to the bandit search problem.
The most elegant algorithms are, in my opinion, those in which each part of the algorithm
corresponds to a line of a proof, and where the proof is as concise as it can be.
6.3 Max-flow min-cut theorem 35
𝑎
cap
2 . 10 𝑠
.1 𝑎
cap
p
ca
.4
cap. 3 cap. 4
𝑠 𝑏 𝑡 𝑏
𝑡
6
ca
p.
p.
ca
5
𝑐 𝑐
The proof is based on the idea of a cut. A cut is a partition of the vertices into two sets,
V = S ∪ S̄, with s ∈ S and t ∈ S̄. The capacity of a cut is
X
capacity(S, S̄) = c(u → v).
u∈S, v∈S̄ :
u→v
The two pictures above shows the same network and cut, with cut capacity 20 (not 24!). The
right hand picture emphasizes that a cut splits the vertices into two groups, one group with
the source, the other with the sink.
It’s obvious that the maximum flow in this network is ≤ 15, since the total capacity
of all edges out of the source is 12 + 3 = 15. Similarly, by considering all the edges into
the sink, the maximum flow must be ≤ 14. The idea of a cut is to generalize this type of
bound. Looking at the right hand picture above, which shows the cut ({s, b, c}, {t, a}), we
see it’s impossible to push more flow from left to right than the total left→right capacity,
which is 12 + 4 = 16. If we had chosen the cut ({s}, {a, b, c, t}) we’d get the ‘source’ bound,
that flow ≤ 15, and if we had chosen the cut ({s, a, b}, {t}) we’d get the ‘sink’ bound, that
flow ≤ 14. Here’s a theorem to formalize this idea.
Theorem (Max‐flow min‐cut theorem). For any flow f and any cut (S, S̄),
Proof. To simplify notation in this proof, we’ll extend f and c to all pairs of vertices: if there
is no edge u → v, let f (u → v) = c(u → v) = 0.
X X
value(f ) = f (s → u) − f (u → s) by definition of flow value
u u
X X X
= f (v → u) − f (u → v) by flow conservation
v∈S u u
cut (𝐴, 𝐴)
capacity
cut (𝐵, 𝐵)
of a cut
cut (𝐶, 𝐶)
flow 𝑓 ∗ , cut 𝑆 ∗ , 𝑆 ∗
flow 𝑔
value of flow ℎ
a flow
flow 𝑖
What’s this theorem for? The theorem says that for any cut, every possible flow’s value is
≤ that cut’s capacity. Likewise, for any flow, every possible cut’s capacity is ≥ that flow’s
value.
Hence, if we’re able to find a flow f ∗ and a matching cut (S ∗ , S̄ ∗ ) such that value(f ∗ ) =
capacity(S ∗ , S̄ ∗ ), then every other flow must have value ≤ capacity(S ∗ , S̄ ∗ ), therefore f ∗ is a
maximum flow. In other words, if we can find a matching flow and cut, then the cut acts as a
‘certificate of correctness’. If someone doubts whether our proposed flow truly is a maximum
flow, all we need do is show them our cut, and (assuming the flow value is equal to the cut
capacity!) that proves our flow is correct.11
This is exactly what we need to prove the Ford-Fulkerson algorithm correct. The
algorithm runs the ‘bandit search’ to find all the vertices to which flow could be increased.
The algorithm terminates when it can’t increase flow to the sink—in other words, when the
bandit search produces a cut. And this cut is exactly what we need to certify that the flow
it just found is a maximum flow!
Theorem (Correctness of Ford‐Fulkerson). Suppose the algorithm terminates, and f ∗ is the final
flow it produces. Then f ∗ is a maximum flow.
Proof. Let S ∗ be the set of vertices found in the final call to find_augmenting_path on line 16,
page 32. Since it’s the final call, we know it failed to find a path to the sink, i.e. t ̸∈ S ∗ ,
hence (S, S ∗ ) is a cut.
Now imagine drawing the original flow network with all the S ∗ vertices on the left and
all the S̄ ∗ vertices on the right. Suppose the network has an edge u → v that goes from left
to right, i.e. u ∈ S ∗ and v ̸∈ S ∗ . This means that the residual graph has a path from s to u,
but not a path from s to v; therefore u → v cannot be present in the residual graph. Hence,
by the condition on line 10, f (u → v) = c(u → v).
Next, suppose the flow network has an edge u → v that goes from right to left, i.e.
v ∈ S ∗ and u ̸∈ S ∗ . This means that the residual graph has a path from s to v, but not a
path from s to u, hence v → u is not present in the residual graph. Therefore, by line 11,
f (u → v) = 0.
We have proved that for any edge from S ∗ to S̄ ∗ , the flow on that edge is equal to the
capacity, hence inequality (2) in the proof of the max-flow min-cut theorem is an equality.
And we have also proved that for any edge from S̄ ∗ to S ∗ , the flow on that edge is equal to 0,
hence inequality (1) is an equality also. Therefore the derivation in that proof shows that
value(f ∗ ) is equal to capacity(S ∗ , S̄ ∗ ).
As we have argued, the cut (S ∗ , S̄ ∗ ) thus acts as a ‘certificate’ proving that flow f ∗ is
a maximum flow. □
∗ ∗ ∗
11 In the practical assignment for this part of the course, you’re asked to produce a maximum flow and a
matching cut. The tester doesn’t bother computing its own answer to verify against yours, it simply checks
the certificate you provided.
6.3 Max-flow min-cut theorem 37
A cut corresponding to a maximum flow is called a bottleneck cut. (The maximum flow might
not be unique, and the bottleneck cut might not be unique either. But all maximum flows
have the same flow value, and all bottleneck cuts have the same cut capacity.) The RAND
report shows a bottleneck cut, and suggests it’s the natural target for an air strike.
38 6.4 Matchings
6.4. Matchings
There are several graph problems that don’t on the surface look like flow networks, but
which can be solved by translating them into a well-chosen maximum flow problem. Finding
matchings in bipartite graphs is one example. The example sheet has more.
A bipartite graph is one in which the vertices are split into two sets, and all the edges have
one end in one set and the other end in the other set. For example, kidney transplant donors
and recipients, with edges to indicate compatibility. We’ll assume the graph is undirected.
A matching in a bipartite graph is a selection of some or all of graph’s edges, such that
no vertex is connected to more than one edge in this selection. The size of a matching is the
number of edges it contains. A maximum matching is one with the largest possible size. Our
goal is to find a maximum matching.
IM P LE M EN TAT I O N
1 1
1
1 1
1
s 1
1
t
1 1 1
1 1
AN A LYS I S
It’s easy to not even notice that there’s something that needs to be proved here. To un-
derstand the issue, it’s helpful to imagine we’re explaining the procedure to someone who
doesn’t know anything at all about Ford–Fulkerson. They’ll ask us two questions:
• How do you know that a maximum flow can be translated into a matching? For example,
what if it returns a flow with a fractional amount on some edge?
• How do you know that your maximum flow actually gives a maximum matching? What
if there’s a larger-size matching out there, which perhaps corresponds to a lower-value
flow, or perhaps doesn’t correspond to a flow at all?
6.4 Matchings 39
matching flow
size value
hypothe�cal
matching with
larger size
We need to justify the translation in two directions. First we have to justify why the flow f ∗
found in Step 3 can be translated into a matching m∗ . Second we have to justify why any
hypothetical larger-size matching m′ would translate into a higher-value flow f ′ .
Once we’ve justified these translations, it’s easy to argue that m∗ is a maximum-size
matching. Suppose it were not. Then there would be a larger-size matching m′ , with a
corresponding f ′ for which value(f ′ ) > value(f ∗ ). But f ∗ is a maximum flow, therefore no
such f ′ can exist. Hence the premise is false, i.e. m∗ is a maximum-size matching.
This style of algorithm and proof is called the TRANSLATION strategy. Remember,
when you use it, that you have to justify the translation in both directions. Once we’ve
figured out what it is we have to prove, the proof is easy.
Proof (of 1). The lemma in Section 6.2 on page 33 tells us that the Ford-Fulkerson algorithm
terminates, since all edge capacities are integer. Write f ∗ for the flow produced by Ford-
Fulkerson. The lemma tells us furthermore that f ∗ is integer on all edges. Since the edge
capacities are all 1, the flow must be 0 or 1 on all edges. Translate f ∗ into a matching m∗ , by
simply selecting all the edges in the original bipartite graph that got f ∗ = 1. The capacity
constraints on edges from s means that each left-hand vertex has either 0 or 1 flow coming
in, so it must have 0 or 1 flow going out, therefore it is connected to at most one edge in m∗ .
Similarly, each right-hand vertex is connected to at most one edge in m∗ . Therefore m∗ is a
matching.
Proof (of 2). Take any matching m and translate it into a flow f in the natural way, i.e.
with a flow of 1 from s to every matched left hand vertex, and similarly for t. It’s easy
to use the definition of ‘matching’ to prove that f is indeed a flow, i.e. that it satisfies
the capacity constraints as well as flow conservation. From this translation it’s clear that
size(m) = value(f ). □
40 6.5 Prim’s algorithm
How can we find an MST? We’ll look at an algorithm due to Jarnik (1930), and inde-
pendently to Prim (1957) and Dijkstra (1959).
3 3 3 3
2 2 2 2
1 1 1 1
2 2 2 2
AP P L I C AT I O N S
• The MST problem was first posed and solved by the Czech mathematician Borůvka
in 1926, motivated by a network planning problem. His friend, an employee of the
West Moravian Powerplants company, put to him the question: if you have to build
an electrical power grid to connect a given set of locations, and you know the costs of
running cabling between locations, what is the cheapest power grid to build?
GEN ER A L I D EA
We’ll build up the MST greedily. Suppose we’ve already built a tree containing some of the
vertices (start it with just a single vertex, chosen arbitrarily). Look at all the edges between
the tree we’ve built so far and the adjacent vertices that aren’t part of the tree, pick the edge
of lowest weight among these and add it to the tree, then repeat.
This greedy algorithm will certainly give us a spanning tree. To prove that it’s a MST
takes some more thought.
12 From Multiple-Locus Variable Number Tandem Repeat Analysis of Staphylococcus Aureus, Schouls et al.,
P ROBL EM STATE ME N T
Given a connected undirected graph with edge weights, construct an MST.
I MP LE ME N TATI O N
We don’t need to recompute the nearby vertices every iteration. Instead we can use a
structure very similar to Dijkstra’s algorithm for shortest paths: store a ‘frontier’ of vertices
that are neighbours of the tree, and update it each iteration. For each of the frontier vertices
w, we’ll store the lowest-weight edge connecting it to the tree that we’ve discovered so far
(w.come_from), and the weight of that edge (w.distance). We pick the frontier vertex v with
the smallest v.distance, add it to the tree, and add its neighbours to the frontier if they’re
not already in the tree. When the algorithm terminates, an MST is formed from the edges
v ↔ v.come_from : v ∈ V \ {s} .
1 def prim(g , s ) :
2 for v in g . vertices :
3 v . distance = ∞
4 + v . in_tree = False
5 + s .come_from = None
6 s . distance = 0
7 toexplore = PriorityQueue ( [ s ] , lambda v : v . distance )
8
9 while not toexplore . isempty ( ) :
10 v = toexplore .popmin()
11 + v . in_tree = True
12 # Let t be the graph made of v e r t i c e s with in_tree=True ,
13 # and edges {w−−w. come_from , f o r w i n g . v e r t i c e s e x c l u d i n g s } .
14 # Assert : t i s part of an MST f o r g
15 for (w, edgeweight) in v . neighbours :
16 × i f (not w. in_tree ) and edgeweight < w. distance :
17 × w. distance = edgeweight
18 + w.come_from = v
19 i f w in toexplore :
20 toexplore . decreasekey (w)
21 else :
22 toexplore . push(w)
Compared to Dijkstra’s algorithm, we need some extra lines to keep track of the tree
(lines labelled +), and two modified lines (labelled ×) because here we’re interested in ‘dis-
tance from the tree’ whereas Dijkstra is interested in ‘distance from the start node’. The
start vertex s can be chosen arbitrarily.
A N ALYS IS
Running time. It’s easy to check that Prim’s algorithm terminates. It is nearly identical to
Dijkstra’s algorithm, and exactly the same analysis of running time applies: is O(E+V log V ),
42 6.5 Prim’s algorithm
Correctness. To prove that Prim’s algorithm does indeed find an MST (and for many other
problems to do with constructing networks on top of graphs) it’s helpful to make a definition.
A cut of a graph is an assignment of its vertices into two non-empty sets, and an edge is said
to cross the cut if its two ends are in different sets.
3 b
2 b b
a
c a a
c c
1
2
d d
d
an undirected a cut into {a, b} a cut into {a} and
graph with edge and {c, d}, with {b, c, d}, with one
weights two edges crossing edge crossing the
the cut cut
It can be proved by induction that Prim’s algorithm produces an MST, using the following
The theorem as
theorem. The details of the induction are left as an exercise.
stated here is more
general than is
needed for Prim’s Theorem. Suppose we have a forest F and a cut C such that (i) no edges of F cross C, and
algorithm. That’s
so that we can (ii) there exists some MST that contains F . Then, if we add to F a min-weight edge that
re-use it for crosses C, the result is still part of a MST.
Kruskal’s algorithm
in the next section.
Proof. Let F be the forest, and let F̄ be an MST that F is part of (the condition of the
theorem requires that such an F̄ exists). Let e be the a minimum weight edge across the cut.
We want to show that there is an MST that includes F ∪ {e}. If F̄ includes edge e, we are
done.
𝑒
𝑒′
G E N E R AL I DEA
Kruskal’s algorithm builds up the MST by agglomerating smaller subtrees together. At Kruskal’s algorithm
each stage, we’ve built up some fragments of the MST. The algorithm greedily chooses two maintains a ‘forest’.
fragments to join together, by picking the lowest-weight edge that will join two fragments. Look back at
Section 5.1 for the
definition.
P ROBL EM STATE ME N T
(Same as for Prim’s algorithm.) Given a connected undirected graph with edge weights,
construct an MST.
I MP LE ME N TATI O N
We could scan through all the edges in the graph at every iteration, looking for the best
edge to add next. Or we could maintain a list of all candidate edges, pre-sorted in order of
increasing weight, and iterate through it; and every time we join two fragments, remove all
the edges from this list that have just become redundant.
Kruskal’s algorithm doesn’t do either of these. It uses a list of edges pre-sorted in order
of increasing weight, but it doesn’t do any housekeeping on the list—it just considers each
edge in turn, asks “does this edge join two fragments?”, and skips over the edges that don’t.
The data structure it uses to perform this test is called a DisjointSet. This keeps track of a
collection of disjoint sets (sets with no common elements), also known as a partition. Here We’ll study the
we’re using it to keep track of which vertices belong to which fragment. DisjointSet data
Initially (lines 4–5) every vertex is in its own fragment. We iterate through all the structure in
Section 7.9
edges of the graph in order of edge weight, lowest edge weight first (lines 6–8), and for each
edge we test whether that edge’s ends are belong to the same set (lines 9–11). If they belong
to different sets, we add that edge to the tree and merge the two sets (lines 12–13).
1 def kruskal (g ) :
2 tree_edges = [ ]
3 partition = DisjointSet ()
4 for v in g . vertices :
5 partition . addsingleton (v)
6 edges = sorted (g . edges , sortkey = lambda u, v , edgeweight : edgeweight)
7
8 for (u , v , edgeweight) in edges :
9 p = partition . getsetwith (u)
10 q = partition . getsetwith (v)
11 i f p != q :
12 tree_edges . append((u, v))
13 partition . merge(p, q)
14 # Let f be the f o r e s t made up of edges i n tree_edges .
15 # Assert : f i s part of an MST
44 6.6 Kruskal’s algorithm
16
17 return tree_edges
AN A LYS I S
Running time. The running time of Kruskal’s algorithm depends on how DisjointSet is im-
plemented. We’ll see in Section 7.4 that all the operations on DisjointSet can be done in
O(1) time13 . The total cost is O(E log E) for the sort on line 6; O(E) for iterating over edges
in lines 8–11; and O(V ) for lines 12–13, since there can be at most V merges. So the total
running time is O(E log E).
The maximum possible number of edges in an undirected graph is V (V − 1)/2, and the
minimum number of edges in a connected graph is V − 1, so log E = Θ(log V ), and so the
running time can be written O(E log V ).
Correctness. To prove that Kruskal’s algorithm finds an MST, we apply the theorem used
for the proof of Prim’s algorithm, as follows. When the algorithm merges fragments p and
q, consider the cut of all vertices into p versus not-p; the algorithm picks a minimum-weight
edge across this cut, and so by the theorem we’ve still got something that’s part of an MST.
AP P L I C AT I O N
If we draw the tree fragments another way, the operation of Kruskal’s algorithm looks like
clustering, and its intermediate stages correspond to a classification tree:
a a
2
9 6
b 6 c b c
5 8 3 5 3
4
d e d 2 c
d e 1
7 f b a
f e f
1
an undirected graph the MST found by draw each fragment as
with edge weights Kruskal’s algorithm a subtree, and draw
arcs when two
fragments are joined
This can be used for image segmentation. Here we’ve started with an image, put vertices on
a hexagonal grid, added edges between adjacent vertices, given low weight to edges where the
vertices have similar colour and brightness, run Kruskal’s algorithm to find an MST, split
the tree into clusters by removing a few of the final edges, and coloured vertices by which
cluster they belong to.
13 This is a white lie. The actual complexity is O(α ) for a DisjointSet with n elements, where α is a
n n
function that grows extraordinarily slowly.
6.7 Topological sort 45
b ok
a d
c ok
q not ok
r
For what graphs it possible to put all the vertices into a total order? And, if it is possible,
how can we compute the total order? The picture above shows two simple graphs, and three
attempted total orders. The first two are valid total orders, and the third is not—and a
moment’s thought about the second graph tells us that it’s impossible to find a total order,
because of the cycle.
G E N E R AL I DEA
Recall depth-first search. After reaching a vertex v, it visits all v’s children and other
descendants. We want v to appear earlier in the ordering than all its descendants. So,
can we use depth-first search to find a total ordering?
Here again is the depth-first search algorithm. This is dfs_recurse from Section 5.2, but
modified so that it visits the entire graph (rather than just the part reachable from some
given start vertex).
1 def dfs_recurse_all (g ) :
2 for v in g . vertices :
3 v . v i s i t e d = False
4 for v in g . vertices :
5 i f not v . v i s i t e d :
6 v i s i t (v) # s t a r t d f s from v
7
8 def v i s i t (v ) :
9 v . v i s i t e d = True
10 for w in v . neighbours :
11 i f not w. v i s i t e d :
12 v i s i t (w)
46 6.7 Topological sort
A standard way to visualise program execution is with a flame chart. Time goes on
the horizontal axis, each function call is shown as a rectangle, and if function f calls function
g then g is drawn above f . Here is the flame chart for running dfs_recurse_all on a simple
graph.
b
visit(d) visit(c)
a d visit(a) visit(b)
dfs_recurse_all
c
And here, on the left, is what happens if we order vertices by when we visit them. It turns
out not to be a total order. A better guess is to order vertices by when visit(v) returns.
a d b c d a c b
PRO B L EM STAT E M E N T
Given a directed acyclic graph (DAG), return a total ordering of all its vertices, such that if
v1 → v2 then v1 appears before v2 in the total order.
ALG O R I T H M
This algorithm is due to Knuth. It is based on dfs_recurse_all, with some extra lines (labelled
+). These extra lines build up a linked list for the rankings, as the algorithm visits and leaves
each vertex.
1 def toposort (g ) :
2 for v in g . vertices :
3 v . v i s i t e d = False
4 # v . c o l o u r = ’ white ’
5 + totalorder = [ ] # an empty l i s t
6 for v in g . vertices :
7 i f not v . v i s i t e d :
8 v i s i t (v , totalorder )
9 + return totalorder
10
11 def v i s i t (v , totalorder ) :
12 v . v i s i t e d = True
13 # v . c o l o u r = ’ grey ’
14 for w in v . neighbours :
15 i f not w. v i s i t e d :
16 v i s i t (w, totalorder )
17 + totalorder . prepend(v)
18 # v . c o l o u r = ’ black ’
This listing also has some commented lines which aren’t part of the algorithm itself, but which
are helpful for arguing that the algorithm is correct. They’re a bit like assert statements:
they’re there for our understanding of the algorithm, not for its execution.
6.7 Topological sort 47
A N ALYS IS
Running time. We haven’t changed anything substantial from dfs_recurse so the analysis in
Section 5.2 still applies: the algorithm obviously terminates (thanks to the visited flag, which
ensures we never visit a vertex more than once), and its running time is O(V + E).
Theorem (Correctness). The toposort algorithm returns totalorder which solves the problem
statement.
Proof. Pick any edge v1 → v2 . We want to show that v1 appears before v2 in totalorder. It’s
easy to see that every vertex is visited exactly once, and on that visit (1) it’s coloured grey,
(2) some stuff happens, (3) it’s coloured black. Let’s consider the instant when v1 is coloured
grey. At this instant, there are three possibilities for v2 :
• v2 is black. If this is so, then v2 has already been prepended to the list, so v1 will be
prepended after v2 , so v1 appears before v2 .
• v2 is white. If this is so, then v2 hasn’t yet been visited, therefore we’ll call visit(v2 ) at
some point during the execution of lines 14–16 in visit(v1 ). This call to visit(v2 ) must
finish before returning to the execution of visit(v1 ), so v2 gets prepended earlier and v1
gets prepended later, so v1 appears before v2 .
• v2 is grey. If this is so, then there was an earlier call to visit(v2 ) which we’re currently
visit(𝑣1 )
inside. The call stack corresponds to a path in the graph from v2 to v1 . But we’ve
... ...
picked an edge v1 → v2 , so there is a cycle, which is impossible in a DAG. This is a visit(𝑣2 )
contradiction, so it’s impossible that v2 is grey. □
∗ ∗ ∗
The breakpoint proof technique. The proof technique we used here was (1) consider an instant
in time at which the algorithm has just reached a line of code; (2) reason about the current
state of all the variables, and the call stack, using mathematical logic; (3) make an inference
about what the algorithm does next. This is the same structure as the proof of correctness
of Dijkstra’s algorithm.
For this proof technique to work, we may need to store extra information about program
state, so that the mathematical reasoning can use it. In this case, we invented the variable
v.colour, which records a useful fact about what happened in the past. The algorithm doesn’t
need it, but it’s useful for the maths proof.
49
Example (Dijkstra’s algorithm / priority queue). Dijkstra’s algorithm uses the Priority Queue data
structure, which supports operations popmin, push, and decreasekey. A single run of Dijkstra’s
algorithm involves running time
O(V ) × popmin + O(E) × push / decreasekey. analysis of
Dijkstra’s algorithm:
section 5.4 page 14
To work out the worst-case cost of a run of Dijkstra’s algorithm, it doesn’t matter what the
worst-case cost is for a single call to popmin or the other operations, what matters is the
worst-case cost for the aggregate of all of the calls. ♢
But can’t we find the worst-case cost for the aggregate of many calls by simply adding
up the worst-case costs for each individual call? The next example shows why not. It
shows that if we simply add up worst-case costs of individual operations, we’ll get an unduly
pessimistic bound for the worst-case cost of the aggregate.
Example (Binomial heap). Suppose we have a binomial heap and we know it never has more
than N elements. The worst-case cost of inserting an element is O(log N ), for the case when binomial heap:
there are O(log N ) trees in the heap. section 4.8.2
*
* *
*
But what about two insertions in a row? We certainly can’t hit this O(log N ) worst case
twice in a row. Indeed, with some careful though about how binomial heaps work, either the similar to heapsort,
first insertion or the second insertion must be O(1). With even more careful thought it can section 2.10, which
be shown that the aggregate cost of N insertions is O(N ). takes O(N log N ) to
insert N items one
Another way to put it: if one operation ends up needing to do a lot of work, then it by one but only
O(N ) to heapify
will at least leave the data structure in a nice clean state so that the following operations are them in a batch
fast. ♢
Example (HashTable). A hash table is typically stored as an array, and we increase the capac- hash table with
ity of the array and reshash all the existing items whenever the occupancy exceeds a certain chaining: see
threshold. Java’s HashMap, for example, rehashes once occupancy reaches 75%. Most inser- section 4.7
tions don’t trigger a rehash so they are very fast, O(1) for a hash table with chaining; but
every so often an insertion will trigger a rehash, which takes O(N ) where N is the number of
items. So the worst case for a single insertion is O(N )—but the worst case for N insertions
grows slower than N 2 . ♢
Advanced data structures are those that have been designed with cunning tricks so that
the work done by one operation can benefit subsequent operations. This is done to reduce
the aggregate cost of sequences of operations.
The study of advanced data structures is centered around analysing the worst-case
aggregate computational complexity of sequences of operations. This is known as aggregate
analysis.
50 7.1 Aggregate analysis
TI G H T B O U N D S A N D WO RST CA S ES
When we analyze computational complexity, it’s good practice to give tight big-O bounds on
the worst-case performance.
It’s technically correct to say that the worst-case cost of inserting N elements into a
N
binomial heap, starting from empty, is O(N log N ). It’s also O(N !), and O(N N ) ... these are
all technically correct, but pretty useless! What’s the worst case that can actually happen?
After looking at some example binomial heaps, we soon find we’re unable to construct a
scenario in which the cost is N log N ; this tells us that either the O(N log N ) bound isn’t
tight, or that we haven’t looked hard enough.
refresher of
Formally, big-O notation provides an upper bound. When we say “the worst-case cost
section 2.3.2 is O(f (N ))” what we really mean is: there exist N0 and κ > 0 such that for all N ≥ N0 ,
To argue that this bound is tight, we should be able to demonstrate scenarios of cost at least
f (N ). Formally14 , we should give a constant κ′ > 0, and a sequence of scenarios i = 1, 2, . . .
of sizes N1 , N2 , . . . with Ni → ∞ such that
cost for
≥ κ′ f (Ni ).
scenario i
Only if both of these bounds hold would we say that our big-O bound is tight.
14 The purists who want to say that the worst-case complexity is Θ(f (N )) would need to give N and κ′ > 0
0
and a scenario for every N ≥ N0 of cost ≥ κ′ f (N ). In the author’s opinion, that’s going beyond the call of
duty.
7.2 Amortized costs: introduction 51
Here’s a simple illustration. Suppose we want to store a list of items, and we have to
support four operations:
class MinList<T>:
append(T v) # add a new item v to the l i s t
flush () # empty the l i s t
foreach ( f ) # do f ( x ) f o r each item i n the l i s t
T min() # get the minimum of a l l items i n the l i s t
The first three operations are straightforward; a simple linked list will do. To implement min,
here are four stages of enlightenment.
Stage 0. Simply iterate through the entire list every time we want to compute min.
This takes time Θ(N ), where N is the number of items in the list.
Stage 1. It’s a waste to redo the work involved in computing min. Instead, we should
remember the result of the last call to min, and keep a pointer to the tail of the list at the
time of that last call. Next time we need min we only need to iterate through the items
added after the last min or flush. It’s faster—but the worst case is still Θ(N ).
Stage 2. We could store the minimum of the entire list, and update it every time a
new item is added. This way, append and min are both O(1), and obviously we can’t do any
better than this! We’d describe this as “amortizing the computation of min”, meaning we
split the work up into small pieces done along the way.
Stage 3. If we count up the total amount of work for the Stage 2 implementation,
it’s no better than that for Stage 1. (It could even be worse, if flush is called before min.)
Morally, it’s unfair to say that Stage 2 has better running-time complexity. Instead, we
should just stick with the Stage 1 implementation and ‘pay off’ the running time of min in
early repayments, by ascribing its cost to the append calls that preceded it, and which caused
min all that work in the first place.
We’d say “the amortized cost of append is capp + c2 , and the amortized cost of min is c1 ”.
W HY A MORT IZ E CO STS ?
The point of this accounting trick is to make it easier to get tight bounds on aggregate costs.
The amortized cost of each append is capp + c2 = O(1). The amortized cost of
each min is c1 = O(1). So the worst-case aggregate cost is m1 O(1) + m2 O(1) =
O(m1 + m2 ).
It’s a nuisance to always have to reason carefully about aggregate costs of sequences of
operations. Much better if someone (preferably someone else!) thinks hard about the best
way to ascribe costs, and tells us an amortized cost for each operation that the data structure
supports. Then it’s easy for us to work out aggregate costs, by just adding up amortized
costs. Amortization is nothing more than an accounting trick to make it easier to find tight
bounds on aggregate cost.
There’s a subtle caveat. In the analysis we just did, we snuck in the little phrase ‘on
an initially-empty list’. Suppose instead we had started with a list of N items, and we called
min once: then the cost of this single operation is N , and no clever accounting trick can turn
it into O(1).
In practice, this isn’t a problem. For example, when we set out to analyse Dijkstra’s
algorithm, which uses a Priority Queue, we do indeed start out with an initially-empty data
structure. It’s only when we start thinking too hard about what amortized analysis actually
means that things get confusing—and the best way to resolve confusion is with a rigorous
definition. That’s the topic of the next section.
7.3 Amortized costs: definition 53
worst-case
≤ m1 O(1) + m2 O(log N ) = O(m1 + m2 log N ).
aggregate cost
This is the fundamental inequality of amortized analysis. If someone else has told us amor-
tized costs, then this inequality tells us what we can do with them. If we are asked to find
amortized costs, this is the inequality we have to ensure is satisfied.
How does one invent amortized costs? Section 7.4 suggests a useful strategy—but for
now we’ll just take them as given.
Asymptotic usage. The fundamental inequality applies to sequences of any length. It is not
an asymptotic (big-O) statement.
Nevertheless, it is in the context of asymptotic analysis that we usually encounter
amortized costs. We might read for example “the amortized cost of push is O(1) and the
amortized cost of popmin is O(log N ), where N is an upper bound on the number of elements
stored”. This tells us that for any sequence of m1 × push and m2 × popmin operations, applied
to an initially empty data structure,
worst-case
≤ m1 O(1) + m2 O(log N ) = O(m1 + m2 log N ).
aggregate cost
More precisely: there exists N0 and κ > 0 such that, for any N ≥ N0 and for any such
sequence of operations on a data structure which always has ≤ N elements,
worst-case
≤ κ(m1 + m2 log N ).
aggregate cost
Typically we’re interested in the case where m1 and m2 grow with N . It might be m1 = N 2
and m2 = 1, or m1 = log N and m2 = log log N , or anything at all—the fundamental
inequality has to hold for all sequences.
EX AM P LE : DY N A MI C A RR AY
Here is a more involved example. Consider a dynamically-sized array with initial capacity 1,
and which doubles its capacity whenever it becomes full. (To be precise: we maintain a
fixed-length array; when it becomes full then we allocate a new fixed-length array of double
the length, copy everything across from the old array, and deallocate the old array. We’ll
only consider appending, not deleting.)
initially empty
append
54 7.3 Amortized costs: definition
•
append, requires doubling capacity
••
append, requires doubling capacity
•••
append
••••
append, requires doubling capacity
•••••
Suppose that the cost of writing in an element is 1, and the cost of doubling capacity from
m to 2m and copying everything across is κm for some constant κ > 0.
Aggregate analysis. After adding n elements, the cost of the initial writes is n, and the total
Standard formula:
cost from all of the doubling is
1 + r + · · · + r n−1 is
equal to κ 1 + 2 + · · · + 2⌊log2 (n−1)⌋ )
(r n − 1)/(r − 1).
which is ≤ κ(2n − 3). Thus, the total cost of n calls to append is ≤ κ(2n − 3) + n = O(n).
Amortized costs. Let’s ascribe a cost c′ = 2κ + 1 to each append operation. Then, for n calls
to append,
so the fundamental inequality “aggregate true cost ≤ aggregate amortized cost” is satisfied,
i.e. these are valid amortized costs. We’d write this as “the amortized cost of append is
O(1).”
∗ ∗ ∗
Why do we keep on putting in the caveat “applied to an initially empty data structure”? It’s
easier to explain why this is needed, now that we have a rigorous definition and a concrete
example.
Suppose we start with a dynamic array that’s on the brink of needing to be expanded.
In other words, let it have capacity N , and let it hold N items. (Of course this requires that
N be a power of 2.) Now, consider a single append: this single operation has cost κN + 1.
If we wanted our fundamental inequality to hold for all possible sequences of operations and
all possible initial states of the data structure, we’d be forced to say that the amortized cost
of append is O(N )—but then we’d be back in the land of naive per-operation worst case
analysis, and that doesn’t give us tight bounds for other sequences of operations.
This is why formal statements about amortized costs generally have awkward phrasing
such as “consider any sequence of m operations on a data structure, initially empty, whose
size is always ≤ N ”, or alternatively “consider any sequence of m ≥ N operations on a data
structure, initially empty, where N of those operations are insertions”.
7.4 Potential functions 55
Here empty refers to the empty intial state. Now, consider an operation op which, when
applied to state Sante , yields state Spost , and which has true cost c. We’ll write this as
c
Sante −→ Spost .
c′ = c + Φ(Spost ) − Φ(Sante )
Theorem (the ‘potential theorem’). The modified costs defined in this way are valid amortized
costs, i.e. they satisfy the fundamental inequality of amortized analysis.
This begs the question: how do we come up with potential functions? It’s generally
easier to come up with useful potential functions than it is to come up with amortized costs
from scratch. We’ll give some guidance in a moment, after looking at an example.
EX AM P LE AN A LYS I S U S I N G P OT E N T I A L F U N C TI O NS
Consider the dynamic array from page 53, where the cost of writing an element is 1 and the
cost of doubling and copying from capacity m to 2m is κm. Define the potential function
Notation: [x]+
+ means max(0, x).
num. items capacity
Φ=κ 2 −
in array of array
This is clearly a valid potential function since it’s ≥ 0 and = 0 at the initial empty state.
Let’s run through a sequence of append operations, to get a feel for how this potential
function behaves. We’ll annotate each operation with its true cost c and its amortized cost
c + ∆Φ.
initially empty
Φ=0
append: c = 1, c + ∆Φ = 1 + κ
• Φ=κ
append with doubling: c = κ + 1, c + ∆Φ = 2κ + 1
• • Φ = 2κ
append with doubling: c = 2κ + 1, c + ∆Φ = 2κ + 1
••• Φ = 2κ
append: c = 1, c + ∆Φ = 2κ + 1
• • • • Φ = 4κ
append with doubling: c = 4κ + 1, c + ∆Φ = 2κ + 1
••••• Φ = 2κ
In our run-through, for each operation the amortized cost turned out to be c + ∆Φ ≤ 2κ + 1.
Now let’s write out the full argument, to show that this holds in general. There are two ways
that append could play out:
56 7.4 Potential functions
capacity unhanged
ΔΦ = 2𝜅
Φ
number of
appends
capacity=m capacity=2m
#items=m #items=m+1
Φ = 𝜅𝑚 Φ = 2𝜅
• It could be that the capacity needs to double from m to 2m. The true cost of this
type of append is c = κm + 1. The change in potential is ∆Φ = 2κ − κm. Thus the
amortized cost is c + ∆Φ = 2κ + 1.
• Or it could be that we don’t need to double capacity. The true cost of this type of
append is c = 1. The change in potential is ∆Φ = 2κ. Again, the amortized cost is
c + ∆Φ = 2κ + 1.
• There is only one corner case that we need to worry about, the very first append. For
the initial empty state the number of items (zero) is less than half the capacity (one),
and so the [·]+ in the definition of the potential function kicks in. For every other
state, the number of items is ≥ half the capacity, and so the [·]+ doesn’t kick in. We’ve
already calculated the amortized cost of the very first append in the run-through above,
and we found it to be κ + 1.
In practice, we’d usually write out the amortized analysis using sloppier notation:
Assume that the cost of append is O(1) if there’s no doubling needed, and O(n) if we need to
double an array with n items. Define the potential function Φ = 2n − c where n is the number
of items in the array and c is the capacity. There are two ways that append could play out:
• If there are n items and the capacity is n, we need to double the capacity. The true cost
is c = O(n) and the change in potential is ∆Φ = [2(n + 1) − 2n] − [2n − n] = 2 − n, so
the amortized cost is O(n) + 2 − n = O(1).
• Otherwise we don’t need to double the capacity. The true cost is c = O(1) and the
change in potential is ∆Φ = 2, so the amortized cost is O(1) − 2 = O(1).
Our Φ function isn’t quite a potential function—we should have Φ ≥ 0 everywhere, and Φ = 0
at the initial empty state, and this isn’t the case. But we can special-case Φ(empty) = 0,
and this is a modification at just a single state so the asymptotic results are still valid. We
conclude that in all cases the amortized cost of append is O(1).
□
It’s perfectly fine to use this sloppy way of writing, as long as you understand the real
reasoning behind the otherwise preposterous equation “c+∆Φ = O(n)+2−n = O(1)”. What
it really means is “The true cost is ≤ κn, for n sufficiently large, and the change in potential
is ∆Φ = 2 − n. We should really have defined a different potential function, by multiplying
by κ. If we had done that, then the amortized cost would be ≤ κn + 2 − κn = O(1).”
7.4 Potential functions 57
DES I GN IN G P OT E N T I A L F U N C T I O N S
Where do potential functions come from? We can invent whatever potential function we like,
and different choices might lead to different amortized costs of operation. If we are cunning
in our choice of potential function, we’ll end up with informative amortized costs. There is
no universal recipe. Here are some suggestions:
• Sometimes we want to say “this operation may technically have worst case O(n) but
morally speaking it should be O(1)”, as with MinList.min on page 51 and with Dynami-
cArray.append. To get amortized cost O(1), the potential has to build up to n before
the expensive operation, and drop to 0 after—or, to be precise, build up to Ω(n) and
drop to O(1). Think of Φ as storing up credit for the cleanup we’re going to have to
do.
• More generally, think of Φ as measuring the amount of stored-up mess. If an operation
increases mess (∆Φ > 0) then this mess will have to be cleaned up eventually, so we
set the amortized cost to be larger than the true cost, to store up credit. An operation
that does cleanup will decrease mess (∆Φ < 0), which can cancel out the true cost of
the cleanup operation.
• When you invent a potential function, make sure that Φ ≥ 0 and that Φ = 0 for the
initial empty data structure. Otherwise you’ll end up with spurious amortized costs.
This is a common source of errors! (See the example sheet.)
The goal of designing a potential function is to obtain useful amortized costs, and the goal
of amortized analysis is to get tight bounds on the worst-case performance of a sequence of
operations on a data structure. So, to figure out if our potential function is well-designed,
we should see if it gives tight bounds.
For example, suppose we’ve used our potential function to prove that “operation popmin
has amortized cost O(log N ) for a data structure containing ≤ N items”. This proves that
the aggregate cost of m calls to popmin is15 O(m log N ). To check whether this bound is
tight, we sould look for a concrete example of a sequence of m operations that has cost
Ω(m log N ). Typically, m will grow with N . If we can find such a lower bound, we know that
the amortized costs are tight. If on the other hand we had chosen a daft potential function,
we’d end up with a Ω–O gap.
Exercise.
Consider the MinList data structure from page 51, implemented using the Stage 1 method.
By designing a suitable potential function, show that append and min both have amor-
tized cost O(1).
Each append creates ‘mess’ in the form of values that will need to be trawled through by the
next call to min. So we want the potential to increase on each call to append, enough to ‘pay
for’ the work that the next min will do. So let’s define
Then the amortized cost of append is c + ∆Φ = O(1) + 1 = O(1), and the amortized cost of
min is O(L) + (0 − L) = O(1), where L is the number of items that it has to process. Thus
both of these operations have amortized cost O(1).
□
15 This is an asymptotic statement in N , and m is allowed to depend on N . Remember that the fundamental
inequality of amortized analysis, the inequality on page 53, is required to hold for every sequence of operations.
58 7.4 Potential functions
PRO O F O F TH E ‘ P OT E N T I AL T H EOR E M’
The whole analysis using potential functions rests on the ‘potential theorem’. Let’s restate
it, and give a proof.
Theorem (the ‘potential theorem’). Let Φ be a potential function. For an operation that takes
the state from Sante to Spost , and that has true cost c, define the modified cost to be
c′ = c + Φ(Spost ) − Φ(Sante ).
The modified costs defined in this way are valid amortized costs. In other words they satisfy
the fundamental inequality of amortized analysis: for any sequence of operations
This ‘telescoping
where the true costs are c1 , c2 , . . . , ck . Then
sum’ trick also
appeared in the aggregate amortized cost
analysis of n o n o
Johnson’s algorithm,
section 5.8 page 25
= −Φ(S0 ) + c1 + Φ(S1 ) + −Φ(S1 ) + c2 + Φ(S2 )
n o
+ · · · + −Φ(Sk−1 ) + ck + Φ(Sk )
= c1 + · · · + ck − Φ(S0 ) + Φ(Sk )
= aggregate true cost − Φ(S0 ) + Φ(Sk )
hence
Thus the costs c′ that we defined are indeed valid amortized costs. □
7.5 Three priority queues 59
# Sometimes we a l s o i n c l u d e methods f o r :
# merge two p r i o r i t y queues
# d e l e t e a value
# peek at the item with s m a l l e s t key , without removing i t
It’s useful to review the two implementations that we’ve already seen. We’ll also look at a
third very simple implementation, the linked-list implementation, as a thought experiment
to sharpen our thinking and to highlight where there’s room for improvement. This table
summarizes the running for those three implementations, as well as for the Fibonacci Heap.
In this table N is the maximum number of items in the heap.16
B I N ARY HEAP *
*This section of notes is a recap of section 4.8.1. A binary heap is an almost-full
binary tree (i.e. every level except the bottom is full), so its height is ⌊log2 n⌋ where n is the
Notation: ⌊x⌋ is the
number of elements. It satisfies the heap property (each node’s key is ≤ its children), so the floor of x, i.e.
minimum of the entire heap can be found at the root. ⌊x⌋ ≤ x < ⌊x⌋ + 1.
12 7 7
To implement popmin we extract the root item, replace the root by the end element, and then
bubble it far enough down so as to satisfy the heap property. The number of bubble-down
steps is limited by the height of the tree, so popmin is O(log n).
16 For naive worst-case analysis, N is simply the number of items in the heap when we do the operation.
But amortized analysis applies to sequences of operations, and the number of items will fluctuate over those
operations, and so we have to instead define N as “upper bound on the number of items in the heap over the
sequence of operations in question”.
60 7.5 Three priority queues
1 5 1 5 3 5 1 5
6 1 6 9 6 1 6 9 6 1 6 9 6 3 6 9
12 7 3 12 7 12 7 12 7
To implement push we append the new item to the very end, and then bubble it far enough
up the tree so as to satisfy the heap property. Again, the number of bubble-up steps is
O(log n). And decreasekey is very similar.
1 bubble bubble 1 bubble
up
1 up up
0
1 5 1 5 0 5 1 5
6 3 6 9 6 0 6 9 6 1 6 9 6 1 6 9
12 7 0 itnew 12 7 3 12 7 3 12 7 3
em
BI N O MI A L H EA P *
*This section of notes is a recap of section 4.8.2. A binomial tree of degree 0 is a
single node. A binomial tree of degree k is a tree obtained by combining two binomial trees
of degree k − 1, by appending one of the trees to the root of the other.
degree 0 degree 1 degree 2 degree 3
2 2 2 2
5 6 5 3 6 5
9 3 7 9
12
A binomial heap is a collection of binomial trees, at most one for each tree degree, each
obeying the heap property i.e. each node’s key is ≤ those of its children. Here is a binomial
heap consisting of one binomial tree of degree 0, and one of degree 3. (The dotted parts in
the middle indicate ‘there is no tree of degree 1 or 2’.)
3 1
1 6 5
3 7 9
12
L I N KE D LI ST P R I O RI T Y Q U EU E
Here’s a very simple priority queue. It uses a doubly-linked list to store all the items, and it
also keeps a pointer to the smallest item.
first
3 12 3 7 9 1 6 5 1
minitem
push(v, k) is O(1):
just attach the new item to the front of the list, and if k < minitem.key then update
minitem
GEN ER A L I D EA
Here is an outline of the thinking that led to the Fibonacci heap.
1. On a graph with V vertices and E edges, Dijkstra’s algorithm might make V calls to
popmin, and E calls to push and/or decreasekey. Since E might be as big as Ω(V 2 ),
push and decreasekey are the common operations, so we want them to be O(1).
2. To make push and decreasekey be O(1), they have to be lazy and they should only
‘touch’ a small part of the data structure. The linked list implementation shows us one
way we can make push be lazy — just dump new nodes into a list, with no further
tidying. Nor should decreasekey do any tidying — if decreasing a key leads to a heap
violation, then the offending node should just be dumped into a list, with no further
tidying.
The binomial heap actually uses a ‘binary counter’ structure so that, most of the time,
push only needs to touch a few small trees, and this gives amortized cost O(1). The
Fibonacci heap uses the same trick, but inside popmin rather than push.
3. In a heap, a call to popmin extracts the root of a tree, and so all of its children need
to be processed. Thus popmin has complexity Ω(d) where d is the number of children,
also called the degree.
4. We don’t want our heap to have any wide shallow trees — they would require popmin
to do a lot of work, just as bad as the simple linked list implementation. We need to
limit the degree of a node in the heap. We need a mechanism to ensure that the trees
are deep and bushy, as they are in a binomial tree.
HOW TO P U SH A N D P O PM I N
The Fibonacci heap, like the binomial heap, stores a list of heaps. Unlike the binomial heap,
the trees can have any shape. Like the linked list priority queue, we’ll keep track of minroot,
the smallest element in the data structure, which must of course be the root of one of the
heap.
1 # Maintain a l i s t of heaps ( i . e . s t o r e a p o i n t e r to the root of each heap )
2 roots = [ ]
3
4 # Maintain a p o i n t e r to the s m a l l e s t root
5 minroot = None
6
7 def push(v , k ) :
8 create a new heap h consisting of a single item (v , k)
9 add h to the l i s t of roots
10 update minroot i f k < minroot . key
11
7.6 Fibonacci heap 63
12 def popmin( ) :
13 take note of minroot . value and minroot . key
14 delete the minroot node , and promote i t s children to be roots
15 # cleanup the r o o t s
16 while there are two roots with the same degree :
17 merge those two roots , by making the larger root a child of the smaller
18 update minroot to point to the smallest root
19 return the value and key from l i n e 13
minroot
7 1 5 2
4 3
6
1. popmin extracts the minroot
1
7 5 2 2 7 4 3 5 2
4 3 6
6
3a
3. merges trees of equal degree,
and updates minroot
4 3 5 2
3b
3 5 2 6 7
4 7
3c 6
minroot
3 2
4 7 5
In this simple version, with only push and popmin, one can show by induction that the
Fibonacci heap consists at all times of a collection of binomial trees, and that after the
cleanup in lines 16–17 it is a binomial heap. (See the example sheet.)
It doesn’t matter how the cleanup is implemented, as long as it is done efficiently. Here is
an example implementation.
20 def cleanup( roots ) :
21 root_array = [None, None, . . . ] # empty a r r a y
22 for each tree t in roots :
23 x=t
24 while root_array [ x . degree ] i s not None:
25 u = root_array [ x . degree ]
26 root_array [ x . degree ] = None
27 x = merge(x , u)
28 root_array [ x . degree ] = x
29 return l i s t of non−None values in root_array
64 7.6 Fibonacci heap
HOW TO D ECR EA S E K E Y
If we can decrease the key of an item in-place (i.e. if its parent is still ≤ the new key), then
that’s all that decreasekey needs to do. If however the node’s new key is smaller than its
parent, we need to do something to maintain the heap. We’ve already discussed why it’s a
reasonable idea to be lazy—to just cut such a node out of its tree and dump it into the root
list, to be cleaned up in the next call to popmin.
There is however one extra twist. If we just cut out nodes and dump them in the root
list, we might end up with trees that are shallow and wide, even as big as Ω(n), where n is
the number of items in the heap. This would make popmin very costly, since it has to iterate
through all minroot’s children.
To make popmin reasonably fast, we need to keep the maximum degree small. The Fibonacci
heap achieves this via two rules:
Similarly, a
This ensures that the trees end up with a good number of descendants. Formally, we’ll
binomial tree of show in Section 7.8 that a tree with degree d contains ≥ 1.618d nodes, and hence that the
degree k has 2k maximum degree in a heap of n items is O(log n).
nodes, which
implies a binomial
heap of n items has
maximum degree
O(log n).
1
2 4 3
7 5 6
8
decreasekey from 5 to 3
1
2 4 3
7 d3 6
ecr
8 ea
s ed
decreasekey from 7 to 1— move 1 to maintain the heap— move the double -loser to root
1 0 1 0 1 2 1 0 1
l
2 4 3 ch ost 2 4 3 8 4 3 8
ild tw
ren o
1 de 6 6 6
cr
ea
se
8 d
7.6 Fibonacci heap 65
Here is another example of the operation of decreasekey, this time highlighting what happens
if there are multiple loser ancestors.
4 4 4 4 4 lo
ser
5 5 5 5 dou
lo b
se l
d r e
o
8 8 8 lose uble
r
9 6 d
e
he crea
a pv s e
iol d –
at
or
66 7.7 Implementing the Fibonacci heap∗
There’s a question that gets asked every year when students learn about the Fibonacci heap:
“How do we find the node that we want to call decreasekey on? Do we trawl through the
entire heap? Isn’t this O(N )?”
The mental model behind this question is as follows. Suppose we have a Fibonacci
heap in which each node has a pointer to a parent node (except for root nodes which don’t
have parents) as well as to child nodes; and suppose each node also contains a payload, for
example an object representing a graph vertex. If we’re given a graph vertex w and we want
to call decreasekey we’d have to first find the Fibonacci heap node that contains it, and then
perhaps rewire the Fibonacci heap’s parent/child pointers.
def d i j k s t r a (g , s ) :
...
toexplore = PriorityQueue ()
toexplore . push(s , key=0)
parent
while not toexplore . isempty ( ) :
v = toexplore .popmin() w payload
for (w , edgecost ) in v . neighbours :
children
...
# we may do one of these two :
toexplore . push(w)
toexplore . decreasekey (w)
There’s something fishy about this question. There are two competing worldviews here—one
view says that vertices are nodes in a Fibonacci heap, the other that they are objects in a
graph—and why should one worldview take precedence over the other?17
𝑎 𝑏 𝑐
𝑒 7 1
𝑑 𝑎 𝑒
𝑏 4 3
𝑑
𝑐 6
vertices in a graph; the same vertices;
arrows show edges show
graph-neighbours heap-relationships
Here’s a more thoughtful way to formulate the problem: When we use the Fibonacci heap
as the priority queue in Dijkstra’s algorithm, each vertex is doing double duty: it is at once
both a vertex in the graph and also a node in the Fibonacci heap. They are participating in
two data structures simultaneously. How should we implement this?
JAVA N I L E SO LU T I O N
The naive solution is to just throw everything into a single class, as in the code below. Every
VertexNode object does double duty. We don’t need to look up the Fibonacci heap node that
contains a graph vertex w, because the node is the vertex.
class VertexNode :
# used by D i j k s t r a :
List<VertexNode> graph_neighbours
f l o a t distance
17 In the five years I’ve been teaching this material, every year students have asked “how do you find the
node you want to call decreasekey on?”. But not once have they asked the question the other way round—not
once have they started from the mental model “each vertex object contains a payload that is a node object”
and asked the obvious question “When we call popmin, and find the minroot node, how do we find the vertex
object that contains it? Do we trawl through the entire graph? Isn’t this O(V )?”
7.7 Implementing the Fibonacci heap∗ 67
VertexNode come_from
# used by FibHeap :
VertexNode fib_parent
List<VertexNode> fib_children
f l o a t key
bool is_loser
This is an ugly solution because it ties the two data structures together. All the library
code for the Fibonacci heap is tied to this particular class, and so it isn’t reusable by other
programmers who aren’t interested in graphs and just want their own priority queue. Can
we tease the two data structures apart without paying a price?
EX P LOS I ON I N T H E C L A S S FAC TO RY
Here’s an implementation for programmers who haven’t outgrown an infatuation with classes
and templates. We’ll let Vertex represent a vertex in the graph, Node represent a node in
the Fibonacci heap, and we’ll declare that each Vertex uses a Node to store its heap-related
pointers.
𝑏 𝑐
7 1
𝑎 𝑒
4 3
𝑑
6
The heap-related pointers stored by Node have to point to Vertex objects, rather than to Node
objects, since otherwise we’re stuck with the problem “how do I find the Vertex object for a
given Node?” But we want Node to be general purpose, not tied to a graph. The solution is
to use interfaces and templates, to allow the Fibonacci heap routines to ‘see through’ a Vertex
to get to its Node without needing to know anything about how Vertex is implemented.
List<T> roots
def T popmin( ) : . . .
def push(T value , f l o a t key ) : . . .
def decrease_key(T value , f l o a t newkey ) : . . .
class Graph :
class Vertex implements FibHeap . Nodeable<Vertex>:
List<Pair<Vertex , float>> neighbours
f l o a t distance
Vertex? come_from
FibHeap .Node<Vertex> pqn
FibHeap .Node<Vertex> get_fib_node ( ) : return pqn
def compute_shortest_paths_from(Vertex s ) :
...
68 7.7 Implementing the Fibonacci heap∗
DYN A MI C A RC H I T EC T U R E
A cleaner approach is to label each vertex by an id, and to use two hash tables: one to look
up a vertex given its id, and another to look up a heap node given the id. (Often there is a
natural id to use for each vertex, for example, the node id in an OpenStreetMap graph, or
the primary key of the graph as stored in a database.) The interfaces for both the Fibonacci
heap and the graph are defined in terms of ids.
vertices = nodes =
{'a': , 'b': , 'c': , {'a': , 'b': , 'c': ,
'd': , 'e': } 'd': , 'e': }
7 1
4 3
The advantage of this architecture is that it keeps the code for graph traversal entirely
separate from that for the priority queue. This would be especially suitable if we start with
a graph and we don’t know ahead of time which algorithms we’ll need to run on it; in this
scenario it doesn’t make sense to have a Vertex class that embodies a particular storage
requirement.
𝑏
class FibHeap :
HashMap <Id , FibHeapNode> nodes 7
class
𝑎 FibHeapNode:
Id id 𝑒 𝑏 𝑐
f l o a t key 7 1
int
𝑑 degree 𝑎 𝑒
𝑏
FibHeapNode parent 4 3
List<FibHeapNode> children
𝑑
6
𝑐
List<FibHeapNode> roots
def Id popmin( ) : . . .
def push( Id node , f l o a t key ) : . . .
def decrease_key( Id node , f l o a t newkey ) : . . .
class Graph :
HashMap <Id , Vertex> vertices
class Vertex :
Id id
List<Pair<Vertex , float>> neighbours
f l o a t distance
Vertex? come_from
def compute_shortest_paths_from( Id s ) :
...
NI T T Y G R I T T Y
It’s worth thinking in a little more detail how exactly to store the parent and children pointers
of each node in the Fibonacci heap. We have to perform various slicing and rearranging oper-
ations on these pointers—and it would be silly to put in lots of effort designing a very clever
amortized design and then waste it all with an inefficient implementation! The manipulations
we want to perform are:
• slice a node out of a tree in O(1)
• add a node to the root list in O(1)
• merge two trees in O(1)
• iterate through a root’s children in O(num. children)
7.7 Implementing the Fibonacci heap∗ 69
These can all be achieved by keeping enough pointers around. We can use a circular doubly-
linked list for the root list; and the same for each list of siblings; and we’ll let each node point
to its parent; and each parent will point to one of its children. So, for example, to iterate
through a node’s children, we first follow the down-pointer to get to one of the children, then
we follow the sibling points around until they bring us back to the initial child.
parent
degree:
is_loser:
sibling sibling
key:
a child
1 2
4 3
would be represented as
2 0
False False
1 2
0 0
True False
4 3
70 7.8 Analysis of Fibonacci heap
B OUN DI N G TH E SH A P E O F T H E T R E ES
The amortized cost of popmin is O(dmax ), where dmax is the maximum number of children of
any of the nodes in the heap. The peculiar mechanism of decreasekey was designed to keep
dmax small. How small?
Theorem (Fibonacci shape theorem). If a node in a Fibonacci heap has d children, then the
subtree rooted at that node has ≥ Fd+2 nodes, where F1 , F2 , . . . are the Fibonacci numbers. F1 = 1, F2 = 1,
F3 = 2, F4 = 3,
F5 = 5, . . . The
general formula for
It’s a mathematical
√ fact from linear algebra that Fd+2 ≥ ϕd where ϕ is the golden ratio, Fn is √
ϕ = (1 + 5)/2 ≈ 1.618. It’s a simple exercise (left to the example sheet) to deduce from (ϕn − (−ϕ)−n )/ 5.
this fact and the theorem that dmax = O(log n).
Proof (Fibonacci shape theorem). Consider an arbitrary node x in a Fibonacci heap, at some
point in execution, and suppose it has d children, call them y1 , . . . , yd in the order of when
they last became children of x. (There may be other children that x acquired then lost in
the meantime, but we’re not including those.)
• • • • •
! # ! " # ! " # !
When x acquired y2 , x already had y1 as a child, so y2 must have had ≥ 1 child seeing
as it got merged into x. Similarly, when x acquired y3 , y3 must have had ≥ 2 children, and
so on. After x acquired a child yi , that child might have lost a child, but it can’t have lost
more because of the rules of decreasekey. Thus, at the point of execution at which we’re
inspecting x,
y1 has ≥ 0 children
y2 has ≥ 0 children
y3 has ≥ 1 child, . . .
yd has ≥ d − 2 children.
Now for some pure maths. Consider an arbitrary tree all of whose nodes obey the
grandchild rule “a node with children i = 1, . . . , d has at least i − 2 grandchildren via child
i”. Let Nd be the smallest possible number of nodes in a subtree whose root has d children.
Then
Nd = Nd−2 + Nd−3 + · · · + N0 + N0 + |{z} 1
| {z } | {z } |{z} |{z}
child d child d − 1 child 2 child 1 the root.
Substituting in Nd−1 , we get Nd = Nd−2 + Nd−1 , the defining equation for the Fibonacci
sequence, hence Nd = Fd+2 .
We’ve shown that the nodes in a Fibonacci heap obey the grandchild rule, therefore
the number of nodes in the subtree rooted at x is ≥ Fd+2 where d is the number of children
of x. □
72 7.9 Disjoint sets
AbstractDataType DisjointSet :
# Holds a dynamic c o l l e c t i o n of d i s j o i n t s e t s
IM P LE M EN TAT I O N 1 : F L AT FO REST
merge
To make get_set_with fast, we could make each item point to its set’s handle.
get_set_with() is just a single lookup.
merge() needs to iterate through each item in one or other set, and update its pointer. This
takes O(n) time, where n is the number of items in the DisjointSet.
To be able to iterate through the items, we could store each set as a linked list:
merge
A smarter way to merge is to keep track of the size of each set, and pick the smaller set to
update. This is called the weighted union heuristic. In the Example Sheet you’ll show that
the aggregate cost of any sequence of m operations on ≤ N elements (i.e. m operations of
which ≤ N are add_singleton) is O(m + N log N ), asymptotic in N .
7.9 Disjoint sets 73
I MP LE ME N TATI O N 2 : D E E P FO R EST
merge
To make merge faster, we could skip all the work of updating the items in a set, and just
build a deeper tree.
merge() attaches one root to the other, which only requires updating a single pointer.
get_set_with() needs to walk up the tree to find the root. This takes O(h) time, where h is
the height of the tree.
To keep h small, we can use the same idea as for the flat forest: keep track of the rank of each
root (i.e. the height of its tree), and always attach the lower-rank root to the higher-rank. If
the two roots had ranks r1 and r2 then the resulting rank is max(r1 , r2 ) if r1 ̸= r2 , and r1 + 1
if r1 = r2 . This is called the union by rank heuristic. It can be shown that the aggregate
cost of any sequence of m operations on ≤ N elements is O(m log N ), asymptotic in N . Aggregate cost: see
CLSR exercises
21.3-3 and 21.4-4
I MP LE ME N TATI O N 3 : L A ZY FO R EST
We’d like the forest to be flat so that get_set_with is fast, but we’d like to let it get deep so
that merge can be fast. Here’s a way to get the best of both worlds, inspired by the Fibonacci
heap—defer cleanup until you actually need the answer.
merge get_set_with(•)
• •
It can be shown that with the lazy forest the cost of m operations on ≤ N items is
O(mαN ) where αN is an integer-valued monotonically increasing sequence, related to the
Ackerman function, which grows extremely slowly:
αN = 0 for N = 0, 1, 2
αN = 1 for N = 3
αN = 2 for N = 4, 5, 6, 7
αN = 3 for 8 ≤ N ≤ 2047
αN = 4 for 2048 ≤ N ≤ 1080 , more than there are atoms in the observable universe.
For practical purposes, αN may be ignored in the O notation, and therefore the amortized
cost per operation is O(1).
74 7.9 Disjoint sets
The forward planner strategy means keeping your working, or at least some of it, in the
anticipation of later operations. Here’s another much simpler illustration of the general idea.
a b c d
val = 2 val = 5 val = 3 val = 0
and we want to sort them by value, lowest value first. Suppose we run insertion sort:
• First, find the smallest. We need to make some comparisons. We’ll find that a.val <
b.val, then a.val < c.val, then a.val > d.val, and conclude that d is the smallest.
• Next, find the second-smallest. We’ll end up repeating some of the comparisons we’ve
already made: we’ll find a.val < b.val, then a.val < c.val, and conclude that a is the
second smallest.
But this is a waste! We’ve already made those comparisons. We should have found a way
to retain our findings from the first pass, for example by using a heap, so we don’t duplicate
the effort in the second pass. ♢
This is a trivial example. There’s no general recipe for how to implement the forward planner
strategy—it’s an art to work out how much working to keep, and how to keep it without
creating extra bookkeeping work for ourselves.
19 The binomial heap is also a ‘forward planner’ algorithm, it’s just not gone all the way with forward