DisOpt409 Spring2020
DisOpt409 Spring2020
Spring 2020
Thomas Rothvoss
A
C
3 Shortest paths 21
3.1 Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 The Moore-Bellman-Ford Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Detecting negative cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Network flows 31
4.1 The Ford-Fulkerson algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Min Cut problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 The Edmonds-Karp algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Application to bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Kőnig’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.2 Hall’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Linear programming 45
5.1 Separation, Duality and Farkas Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Algorithms for linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 The simplex method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Interior point methods and the Ellipsoid method . . . . . . . . . . . . . . . . 53
5.2.3 Solving linear programs via gradient descent . . . . . . . . . . . . . . . . . . . 54
5.3 Connection to discrete optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Integer programs and integer hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3
6 Total unimodularity 63
6.1 Application to bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Application to flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Application to interval scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 Non-bipartite matching 77
8.1 Augmenting paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Computing augmenting paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.3 Contracting odd cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4
Chapter 1
Roughly speaking, discrete optimization deals with finding the best solution out of finite number
of possibilities in a computationally efficient way. Typically the number of possible solutions is
larger than the number of atoms in the universe, hence instead of mindlessly trying out all of them,
we have to come up with insights into the problem structure in order to succeed. In this class,
we plan to study the following classical and basic problems:
The purpose of this class is to give a proof-based, formal introduction into the theory of discrete
optimization.
Find Duplicate
Input: A list of numbers a1 , . . . , an ∈ Z
Goal: Decide whether some number appears at least twice in the list.
Obviously this is not a very interesting problem, but it will serve us well as introductory example
to bring us all one the same page. A straightforward algorithm to solve the problem is as follows:
(1) FOR i = 1 TO n DO
(2) FOR j = i + 1 TO n DO
(3) If ai = aj then return ”yes”
(4) Return “no”
5
The algorithm is stated in what is called pseudo code, that means it is not actually in one of
the common programming languages like Java, C, C++, Pascal or BASIC. On the other hand, it
takes only small modifications to translate the algorithm in one of those languages. There are no
consistent rules what is allowed in pseudo code and what not; the point of pseudo code is that it
does not need to be machine-readable, but it should be human-readable. A good rule of thumb is
that everything is allowed that also one of the mentioned programming languages can do.
In particular, we allow any of the following operations: addition, subtraction, multiplication,
division, comparisons, etc. Moreover, we allow the algorithm an infinite amount of memory (though
our little algorithm above only needed the two variables i and j).
The next question that we should discuss in the analysis of the algorithm is its running time,
which we define as the number of elementary operations (such as adding, subtracting, compar-
ing, etc) that the algorithm makes. Since the variable i runs from 1 to n and j runs from j = i + 1
to n, step (3) is executed n2 = n(n−1)2 many times. On the other hand, should we count only
step (3) or shall we also count the FOR loops? And in the 2nd FOR loop, shall we only count one
operation for the comparison or shall we also count the addition in i + 1? We see that it might be
very tedious to determine the exact number of operations. On the other hand we probably agree
that the running time is of the form Cn2 where C is some constant that might be, say 3 or 4 or 8
depending on what exactly we count as an elementary operation. Let us agree from now on, that
we only want to determine the running time up to constant factors.
As a side remark, there is a precisely defined formal computational model, which is called a
Turing machine (interestingly, it was defined by Alan M. Turing in 1936 before the first computer
was actually build). In particular, for any algorithm in the Turing machine model one can reduce
the running time by a constant factor while increasing the state space. An implication of this fact
is that running times in the Turing machine model are actually only well defined up to constant
factors. We take this as one more reason to be content with our decision of only determining running
times up to constant factors.
So, the outcome of our runtime analysis for the Find Duplicate algorithm is the following:
There is some constant C > 0 so that the Find Duplicate algorithm (1.1)
2
finishes after at most Cn many operations.
Observe that it was actually possible that the algorithm finishes much faster, namely if it finds a
match in step (3), so we are only interested in an upper bound. Note that the simple algorithm
that we found is not the most efficient one for deciding whether n numbers contain a duplicate. It
is actually possible to answer that question in time C ′ n log(n) using a sorting algorithm. If we want
to compare the running times Cn2 and C ′ n log(n), then we do not know which of the constants C
and C ′ is larger. So for small values of n, we don’t know which algorithm would be faster. But
2
limn→∞ C ′ nCn ′
log(n) = ∞, hence if n is large enough the C n log(n) algorithm would outperform the
Cn2 algorithm. Thus, we do consider the C ′ n log(n) algorithm as the more efficient one1 .
It is standard in computer science and operations research to abbreviate the claim from (1.1)
using the O-notation and replace it by the equivalent statement:
6
The formal definition of the O-notation is a little bit technical:
Definition 1. If f (s) and g(s) are two positive real valued functions on N, the set of non-negative
integers, we say that f (n) = O(g(n)) if there is a constant c > 0 such that f (n) ≤ c · g(n) for all n
greater than some finite n0 .
It might suffice to just note that the statements (1.1) and (1.2) are equivalent and we would use
the latter for the sake of convenience.
Going once more back to the Find Duplicate algorithm, recall that the input are the numbers
a1 , . . . , an . We say that the input length is n, which is the number of numbers in the input. For
the performance of an algorithm, we always compare the running time with respect to the input
length. In particular, the running time of O(n2 ) is bounded by a polynomial in the input length n,
so we say that our algorithm has polynomial running time. Such polynomial time algorithms
are considered efficient from the theoretical perspective2 . Formally, we say that any running
time of the form nC is polynomial, where C > 0 is a constant and n is the length of the input.
For example, below we list a couple of possible running times and sort them according to their
asymptotic behavior: √
100n ≪ n ln(n) ≪ n2 ≪ n10 ≪ 2| n ≪{z 2n ≪ n!}
| {z }
efficient inefficient
To capture problems of this type, one defines a more general class: NP is the class of problems
that admit a non-deterministic polynomial time algorithm. Intuitively, it means that a problem
lies in NP if given a solution one is able to verify in polynomial time that this is indeed a solution.
For example for Partition, if somebody claims to us that for a given input a1 , . . . , an the answer
is “yes”, then (s)he could simply give us the setsPI1 , I2 . We could
P then check that they are indeed a
partition of {1, . . . , n} and compute the sums i∈I1 ai and i∈I2 ai and compare them. In other
words, the partition I1 , I2 is a computational proof that the answer is “yes” and the proof can
be verified in polynomial time. That is exactly what problems in NP defines. Note that trivially,
P ⊆ NP.
We say that a problem P ∈ NP is NP-complete if with a polynomial time algorithm for P , one
could solve any other problem in NP in polynomial time. Intuitively, the NP-complete problems
2
This is true for theoretical considerations. For many practical applications, researchers actually try to come up
with near-linear time algorithms and consider anything of the order n2 as highly impractical.
7
are the hardest problems in NP. One of the 7 Millenium problems (with a $1,000,000 award) is
to prove the conjecture that NP-complete problems do not have polynomial time algorithms (i.e.
NP 6= P). An incomplete overview over the complexity landscape (assuming that indeed NP 6= P)
is as follows:
NP
partition
shortest path
linear programming integer programming
P NP-complete
matching TSP
SAT
max flow max cut
min cut
From time to time we want to make some advanced remarks that actually exceed the scope of
this lecture. Those kind of remarks will be in gray box labeled advanced remark. Those comments
are not relevant for the exam, but they give some background information for the interested student.
Advanced remark:
Now with the notation of P and NP, we want to go back to how we determine the
running time. We used what is called the arithmetic model / RAM model where any
arithmetic operation like addition, multiplication etc, counts only one unit. On the other
hand, if we implement an algorithm using a Turing machine, then we need to encode all
numbers using bits (or with a constant number of symbols, which has the same effect up to
constant factors). If the numbers are large, we might need a lot of bits per number and we
might dramatically underestimate the running time on a Turing machine if we only count
the number of arithmetic operations. To be more concrete, consider the following (useless)
algorithm
(1) Set a := 2
(2) FOR i = 1 TO n DO
(3) Update a := a2 .
The algorithm only performs O(n) arithmetic operations. On the other hand, the variable
n
at the end is a = 22 . In other words, we need 2n bits to represent the result, which leaves
an exponential gap between the number of operations in the arithmetic model and the bit
model where we count each bit operation. It is even worse: One can solve NP-complete
problems using a polynomial number of arithmetic operations by creating numbers with
exponentially many bits and using them to do exponential work.
For a more formal accounting of running time, it is hence necessary to make sure that
the gap between the arithmetic model and the bit model is bounded by a polynomial in
the input length. For example it suffices to argue that all numbers are not more than
single-exponentially large in the the input. All algorithms that we consider in this lecture
notes will have that property, so we will not insist on doing that (sometimes tedious) part
of the analysis.
8
1.2 Basic Graph Theory
Since most of the problems that we will study in this course are related to graphs, we want to
review the basics of graph theory first.
• An undirected graph G = (V, E) is the pair of sets V and E where V is the set of vertices
of G and E the set of (undirected) edges of G. An edge e ∈ E is a set {i, j} where i, j ∈ V .
We write G = (V (G), E(G)) if there are multiple graphs being considered.
1 3
2 4
Graph G = (V, E) with V = {1, 2, 3, 4, 5}
and E = {{1, 2}, {1, 3}, {2, 4}, {3, 4}, {3, 5}}
While it is popular to draw the nodes of graphs as dots and the edges as lines or curves
between those dots, one should keep in mind that the position of the nodes has actually no
meaning.
U
bold edges form the cut δ(U )
• The degree of a node is the number of incident edges. We write deg(v) = |δ(v)|.
• If the edges of G are precisely the |V2 | pairs {i, j} for every pair i, j ∈ V then G is the
complete graph on V . The complete graph on n = |V | vertices is denoted as Kn .
• A subgraph of G = (V (G), E(G)) is a graph H = (V (H), E(H)) where V (H) ⊆ V (G) and
E(H) ⊆ E(G) with the restriction that if {i, j} ∈ E(H) then i, j ∈ V (H).
• If V ′ ⊆ V (G), then the subgraph induced by V ′ is the graph (V ′ , E(V ′ )) where E(V ′ ) is the
set of all edges in G for which both vertices are in V ′ .
spanning subgraph
9
• Given G = (V, E) we can define subgraphs obtained by deletions of vertices or edges.
v0 v1 ... vk
path
v0 e3 v3
e1 v1 = v6 v2 = v5
e4
e2 = e6
v7 e7 e5 v4
a walk of length 7
• A cycle is a graph G = (V, E) where V = {v0 , v1 , . . . , vk−1 } and E = {{v0 , v1 }, {v1 , v2 }, . . . , {vk−1 , v0 }}
where v0 , . . . , vk−1 are distinct and k ≥ 3.
v1 ...
v0
vk−1
cycle with k = 5 vertices and edges
10
V (T )
forest
tree T
spanning tree
Hamiltonian circuit/tour
• A set M ⊆ E of edges with degree exactly 1 for each vertex is called perfect matching.
Convention: We defined paths / trees / spanning trees / cycles / Hamiltonian circuits as graphs,
i.e. as pairs H = (V (H), E(H)). But from the edge set E(H) we can recover the nodes. In
algorithms we will often just talk about the edges of those objects.
Advanced remark:
Looking at the literature one sees that the definition of a “path” and “cycle” is not standard-
ized. Some texts admit revisiting nodes, others do not. We have followed the definitions
of the book “Graph Theory” of Diestel.
11
1.2.2 Directed Graphs
• An directed graph G = (V, E) is the pair of sets V and E where V is the set of vertices
of G and E the set of (directed) edges of G. An edge e ∈ E is a tuple (i, j) where i, j ∈ V .
One can imagine i as the “start node” and j as the “end node” of the edge.
1 3
2 4
Graph G = (V, E) with V = {1, 2, 3, 4, 5}
and E = {(1, 2), (1, 3), (4, 2), (3, 4), (3, 5), (5, 3)}
In the literature, directed graphs are often called digraphs and edges in a directed graph are
often called arcs.
• A directed path is a directed graph P = (V, E) with distinct vertices V = {v0 , . . . , vk } and
edges E = {(v0 , v1 ), (v1 , v2 ), . . . , (vk−1 , vk )}. By a slight abuse of notation we will consider
the edge set E itself as the directed path.
v0 v1 ... vk
directed path
10
100
10 10
1
10
TSP instance with opt. tour in bold
Note that this problem again has an input, which defines the concrete instance (here, the
concrete set of numbers cij ). Then the problem has an implicitly defined set of solutions (here, all
tours in an n-node graph). Usually, the problem also has an objective function that should be
optimized (here, minimize the sum of the edge cost in the tour).
12
The first trivial idea would be to simply enumerate all possible tours in the complete graph;
then for each tour we compute the cost and always remember the cheapest one that we found so far,
which we output at the end. If we ignore the overhead needed to list all tours, certainly the time is
dominated by the number of tours. Unfortunately, the number of tours in an n-node graph is
(n − 1) · (n − 2) · . . . · 2 · 1 = (n − 1)!
√
If we remember Stirlings formula n! ≈ 2πn( ne )n , then n! ≈ 2n ln(n)−n . For example if n = 50,
then (n − 1)! ≈ 1062 and if our computer can enumerate 109 tours per second, it would take ≈ 1043
centuries to enumerate all of them. Obviously, this approach is not very practicable. But in fact,
we can solve the problem in a slightly smarter way (though not that much smarter).
The trick in the Held-Karp algorithm is to first compute the best subtours using dynamic
programming. For each subset S ⊆ V and points i, j ∈ S we want to compute table entries
Held-Karp algorithm
Input: Distances cij ≥ 0 in a complete n-node graph
Output: Cost of an optimum TSP tour
(1) C({i, j}, i, j) := cij for all i 6= j
(2) FOR k = 3 TO n DO
i j
ℓ
S
Let us make the following observation:
Lemma 1. The algorithm computes indeed the cost of the optimum tour.
Proof. Prove by induction on |S| that the quantity C(S, i, j) computed in the algorithm really
satisfies the definition.
13
Now we should discuss how we analyse the running time for the algorithm. For example for
the considered TSP algorithm, we have at most n2 2n many entries of the form C(S, {i, j}); for each
entry, the minℓ {. . .} expression needs to be evaluate for at most n many choices of ℓ and evaluating
each term C(S\{j}, i, ℓ) + cℓ,j takes O(1) many elementary operations. The overall running time
of the algorithm is hence O(n3 2n ). We might wonder, whether one could come up with a much
more efficient algorithm for TSP than the one that we have discussed. But in fact, it is unknown
whether there is any algorithm that solves TSP in time O(1.9999n ). For the input length we count
n
the number of numbers/objects, i.e. in this case 2 ∼ n2 .
Apart from the obvious application to schedule a traveling salesperson/postman tour, the vertices
v1 , . . . , vn could be holes that need to be drilled on a circuit board and cij could be the time taken
by the drill to travel from vi to vj . In this case, we are looking for the fastest way to drill all holes
on the circuit board if the drill traces a tour through the locations that need to be drilled. Yet
another application might be that v1 , . . . , vn are n celestial objects that need to be imaged by a
satellite and cij is the amount of fuel needed to change the position of the satellite from vi to vj .
In this case, the optimal tour allows the satellite to complete its job with the least amount of fuel.
These and many other applications of the TSP as well as the state-of-the-art on this problem can
be found on the web page http://www.tsp.gatech.edu/. In many cases, the distances cij have
special properties like, they define a metric or are Euclidean distance. In those cases one can design
more efficient algorithms.
14
Chapter 2
In this chapter we study the problem of finding a minimum cost spanning tree in a graph.
Then recall that in a graph G = (V, E) we call a subgraph (V ′ , T ) a spanning tree if:
Throughout this section the underlying graph G = (V, E) will be fixed and so by a slight abuse of
notation we call a subset of edges T ⊆ E a spanning tree of G if this is the case for the subgraph
(V, T ). A very useful equivalent definition is:
The edges T ⊆ E form a spanning tree of G if and only if for each pair u, v ∈ V of nodes
there is exactly one path from u to v.
This is clear because if you have more than 2 paths between two nodes u, v, then this implies a
cycle; on the other hand being connected and spanning means you have at least one path between
each pair of vertices. For example, in the graph below, we marked the minimum spanning tree with
bold edges (edges {i, j} are labelled with their cost cij ):
1
3 3
4
4
1 2 1 2
3 1
We start by establishing some basic properties of a spanning tree. Some of those might appear
obvious, but let us be formal here.
a) |E| ≥ n − 1
b) ∃ spanning tree T ⊆ E
15
c) G acyclic ⇔ |E| = n − 1
Proof. Luckily, in all 4 cases we can use almost the same proof. We begin with finding some spanning
tree in G. To do that, we fix an arbitrary node r ∈ V and call it the root. We give each node v a
label d(v) which tells the distance (in terms of number of edges) that this node has from the root.
Since G is connected, there is a path from v to r and d(v) is well-defined. For node v ∈ V \r, we select
an arbitrary neighbor node p(v) that is closer to r, that means {u, p(u)} ∈ E and d(p(u)) = d(v) − 1
(here the p stands for parent node). We now create a set T := {{u, p(u)} | u ∈ V \{r}} of all such
edges. In other words, each node u has exactly one edge {u, p(u)} ∈ T that goes to a node with a
lower d-label.
r
d(v) = 0 : ∈T
d(v) = 1 : p(u)
d(v) = 2 : u
d(v) = 3 : ∈ E\T
i) (V, T ) is connected : This is because from each node u you can find a path to the root by
repeatedly taking the edge to its parent.
ii) |E| ≥ |T | = n − 1: That is because we have a bijective pairing between nodes u ∈ V \{r} and
edges {u, p(u)}, thus |T | = |V \{r}| = n − 1.
iii) (V, T ) is acyclic: If not, we can delete an edge in the cycle and obtain a still connected
spanning subgraph T ′ ⊆ T with only |T ′ | = n − 2 edges; but if we apply ii) we obtain that
each connected spanning graph on n nodes must have ≥ n−1 edges; that gives a contradiction.
v) For any edge {u, v} ∈ E\T , T ∪ {u, v} contains a cycle: This is because there was already a
path from u to v in T . That means if |E| > n − 1, then G is not acyclic, if |E| = n − 1 then
E = T and G is acyclic by iii).
vi) d) follows from c): That is since G is connected and spanning by assumption.
Lemma 3. Let T ⊆ E be a spanning tree in G = (V, E). Take an edge e = {u, v} ∈ E\T and let
P ⊆ T be the edges on the path from u to v. Then for any edge e′ ∈ P , T ′ = (T \{e′ }) ∪ {e} is a
spanning tree.
16
e′ ∈ P
u e v
Proof. Swapping edges in a cycle P ∪ {e} leaves T connected and the number of edges is still n − 1,
hence T ′ is still a spanning tree.
The operation to replace e′ by e is called an edge swap. It leads to a spanning tree of cost
c(T ′ )= c(T ) − c(e′ ) + c(e). If T was already the minimum spanning tree, then we know that
c(e) ≥ c(e′ ) or in other words: if T is optimal, then any edge {u, v} ∈ E\T is at least as expensive
as all the edges on the path P ⊆ T that connects u and v (because otherwise we could swap that
edge with {u, v} and we would obtain a cheaper spanning tree).
Surprisingly it turns out that the other direction is true as well: if no edge swap would decrease
the cost, then T is optimal.
Before we show that, we need one more definition that is very useful in graph theory. For any
graph G = (V, E) we can partition the nodes into V = V1 ∪˙ . . . ∪V ˙ k so that a pair u, v is in the
same set Vi if and only they are connected by a path. Those sets V1 , . . . , Vk are called connected
components. Formally, one could define relation ∼ with u ∼ v if and only if there is a path from
u to v in G. Then one can easily check that ∼ is an equivalence relation and the equivalence classes
define the connected components, see the following example:
V1 V2
V3
connected components
Also recall that for a node set U ⊆ V and δ(U ) = {{u, v} ∈ E | |U ∩ {u, v}| = 1} denotes the edges
that are cut by U .
Theorem 4. Let T be a spanning tree in graph G with edge costs ce for all e ∈ E. Then the
following statements are equivalent:
Proof. It only remains to show (2) ⇒ (1). In other words, we fix a non-optimal spanning tree T ;
then we will find an improving edge swap. Let T ∗ be an optimum minimum spanning tree. If this
choice is not unique, then we choose a T ∗ that maximizes |T ∩ T ∗ |; in other words, we pick on
optimum solution T ∗ that has as many edges as possible in common with T .
Take any edge f = {u, v} ∈ T \T ∗ . The subgraph T \{f } consists of two connected components,
let us call them C ⊆ V and V \C, say with u ∈ C and v ∈ V \C.
17
C C
e∈P
u v u v
f f
tree T tree T ∗
On the other hand, there is a path P ⊆ T ∗ that connects u and v. As this path “starts” in C and
“ends” in V \C, there must be be an edge in the path that lies in the cut δ(C) ∩ T ∗ . We call this
edge e ∈ δ(C) ∩ T ∗ ∩ P . Let us make 2 observations:
• T ′ = (T \{f }) ∪ {e} is a spanning tree! This is because e connects the two connected com-
ponents of T \{f }, hence it is connected and T ′ also has n − 1 edges, thus T ′ is a spanning
tree.
• Case c(f ) < c(e): Then c(T ∗∗ ) < c(T ∗ ) which contradicts the optimality of T ∗ .
• Case c(f ) = c(e): Then T ∗∗ is also optimal. But T ∗∗ has one edge more in common with T
than T ∗ has. This contradicts the choice of T ∗ and the claim follows.
We want to emphasize at this point that spanning trees have a very special structure which
implies that a “local optimal solution” (that we cannot improve by edge swaps) is also a “global
optimal solution”. Most other problems like TSP do not share this property.
Kruskal’s Algorithm
Input: A connected graph G with edge costs ce ∈ R, ∀ e ∈ E(G).
Output: A MST T of G.
(1) Sort the edges such that ce1 ≤ ce2 ≤ . . . ≤ cem .
(2) Set T = ∅
(3) For i from 1 to m do
If T ∪ {ei } is acyclic then update T := T ∪ {ei }.
18
Proof. Let T be the output of Kruskal’s algorithm. We need to first show that T is a spanning tree
of G. Obviously, T will be acyclic at the end. Now suppose for the sake of contradiction that T was
not connected and that C ⊆ V is one connected component with 1 ≤ |C| < n. We assumed that
the graph G itself was connected hence G did contain some edge e = {u, v} ∈ E that was running
between C and V \C. But the algorithm considered that edge and some point and decided not to
take it. That means there was already a path in T between u and v, which gives a contradiction.
We now argue that T cannot be improved by edge swaps, which by Theorem 4 proves that T is
the cheapest spanning tree. Consider any a pair of edges ei ∈ T and ej ∈ E\T so that (T \{ei })∪{ej }
is again a spanning tree.
T
ei
ej
In particular, (T \{ei }) ∪ {ej } is acyclic, but then at the time that ei was added, it would have been
possible to add instead ej without creating a cycle. That did not happen, hence ej was considered
later. This means i < j and c(ei ) ≤ c(ej ), which implies that an edge swap would not have paid
off.
It should be clear that the algorithm has polynomial running time (in the arithmetic and in the
bit model). In fact, it is very efficient:
Theorem 6. Kruskal’s algorithm has a running time of O(m log(n)) (in the arithmetic model)
where n = |V | and m = |E|.
19
20
Chapter 3
Shortest paths
To motivate the next problem that we want to study in detail, imagine that you would like to drive
home from UW and suppose you want to select a route so that you spend as little as possible on
gas. Suppose you already isolated a couple of road segments that you consider of using on your
way home and from your experience, you know how much gas your car needs on each segment. We
draw a graph with a vertex for each location/crossing and an edge for each segment that we label
with the amount of gas that is needed for it.
2
2
2 1 7
5
UW 3 home
4
4 3 1
4
Now we are interested in a path in this graph that goes from UW to your home and that minimizes
the amount of gas. Observe that some roads are one way streets and also the amount of gas could
depend on the direction that you drive, e.g. because you need more gas to drive uphill than downhill
and because the traffic in one direction might be heavier than in the other one (at a given time
of the day). In other words, we allow the edges have directions, but so far we can assume that
ce ≥ 0. We refresh some definitions from chapter 1:
Definition 2. A directed graph (digraph) G = (V, E) consists of a vertex set V and directed
edges e = (i, j) ∈ E (also called arcs) such that i is the tail of e and j is the head of e. A directed
path is of the path (v0 , v1 ), (v1 , v2 ), . . . , (vk−1 , vk ) with (vi , vi+1 ) ∈ E for all i. A path is called an
s-t path if it starts in s and ends in t (s, t ∈ V ).
3
1 2
21
In the following, whenever we have a cost function c : E → R, then “shortest path” means the
one minimizing the cost (not necessarily the one minimizing the number of edges). In the example
above, we depicted the shortest “UW-home path”.
Now let us go back to our introductory example where we want to find the most gas-efficient way
from UW to your home. Now imagine you just bought a hybrid car, which can charge the battery
when breaking. In other words, when you drive downhill, the gas-equivalent that you spend maybe
negative (while before we assumed c(e) ≥ 0). Maybe the road network with gas consumption now
looks as follows (again with the shortest path in bold):
2
2 −1
1 7
5
UW 3 home
−2 4
4 3
4
So, we might need to drop the assumption that edge cost are non-negative (i.e. c(e) ≥ 0). But could
we allow arbitrary (possibly negative) edge cost? Well, consider the following example instance:
−3
s 1
1 t
1 1
What is the length of the shortest s-t path? Observe that the length of the cycle is −1. Hence
if we allow to revisit nodes/edges, then the length of the shortest walk is −∞ and hence not well
defined. If we restrict the problem to paths (= not revisiting nodes), then one can prove that the
problem becomes NP-complete1 , which we want to avoid. It seems the following two assumptions
are meaningful, where the 1st one implies the 2nd one.
• Strong assumption: ce ≥ 0 ∀e ∈ E
In this chapter, we will see several algorithms; some need the stronger assumption (but are more
efficient) and others have higher running time, but work with the weaker assumption.
Lemma 7 (Bellman’s principle). Let G = (V, E) be a directed, weigthed graph with no neg-
ative cost cycle. Let P = {(s, v1 ), (v1 , v2 ), . . . , (vk , t)} be the shortest s-t path. Then Ps,i =
{(s, v1 ), (v1 , v2 ), . . . , (vi−1 , vi )} is a shortest s-vi path.
Proof. If we have a s-vi path Q that is shorter, then replacing the s-vi part in P with Q would give
a shorter s-t path.
1
Given an undirected graph (without costs) and nodes s,t, it is known to be NP-hard to test, whether there is a
s-t path visiting all nodes exactly once. Now put edge cost −1 on all edges. If you can find a simple s-t path of cost
n − 1, then this must be a path that visits all nodes once.
22
Q
P
s v1 ... vi ... vk t
23
R
ℓ(s) = 0 5 ℓ(v2 ) = 3 ℓ(s) = 0 5 ℓ(v2 ) = 3
3 3
2 2 ℓ(v4 ) = 6 2 2 ℓ(v4 ) = 6
1 1
3 3
ℓ(v1 ) = 2 4 ℓ(v3 ) = 5 ℓ(v1 ) = 2 4 ℓ(v3 ) = 5
after it. 3 solution
We can easily modify the algorithm so that it also computes the shortest paths itself. For each node
v ∈ V \{s}, we pick one node p(v) so that d(v) = ℓ(p(u)) + c(u, v). Then from each node v we can
find the shortest path from the root, by repeatedly going back to p(v). Those edges (p(v), v) are
depicted in bold in the last figure. If there is no s-v path, then we will have ℓ(v) = ∞.
Lemma 8. Let d(u, v) be the length of the shortest u-v path in graph G = (V, E). After the
termination of Dijkstra’s algorithm we have ℓ(v) = d(s, v) for all v ∈ V .
Proof. We prove that the following two invariants are satisfied after each iteration of the algorithm:
• Invariant II. For all v ∈ V \R one has: ℓ(v) = length of the shortest path from s to v that
has all nodes in R, except of v itself.
At the end of the algorithm, we have R = V and the claim follows from invariant I. Both invariants
are true before the first iteration when R = {s}. Now consider any iteration of the algorithm;
assume that both invariants were satisfied at the beginning of the iteration and that v is the node
that was selected. Let P be the shortest s-v path that has all nodes except of v itself in R; the cost
is c(P ) = ℓ(v) by invariant II. Now suppose that this is not the shortest s-v path and that there is
a cheaper s-v path Q with c(Q) < c(P ) = ℓ(v). Let w be the first node on the s-v path Q that lies
outside of R (this point must exist since Q starts in R and ends outside; also that point cannot be
v since otherwise by invariant II, we would have c(P ) = c(Q)). Since we have only non-negative
edge costs, ℓ(w) ≤ c(Q) < ℓ(v). But that means the algorithm would have processed w instead of
u. This is a contradiction. It follows that invariant I is maintained.
w
R s
v
P
For invariant II, consider a node w ∈ V \R. There are two possibilities for the shortest s-w path in
R ∪ {v}: either it does not use the newly added node v, then ℓ(w) remains valid. Or Q does use
v, but then it has to use v as last node ℓ(w) = ℓ(u) + c(u, w) is the correct length. The reason is
that for any u ∈ R, by invariant I, there is a shortest s-u path not using v. Hence also invariant II
remains valid.
We can also offer an alternative exposition of the same proof that is slightly more technical:
24
Alternative Exposition. For a set of nodes U ⊆ V and s, t ∈ U , let dU (s, t) be the length of the
shortest s-t path that uses exclusively nodes in U . In particular dV (s, t) is the length of the overall
shortest s-t path in the graph. We prove that the following two invariants are satisfied after each
iteration of the algorithm:
• Invariant I. ℓ(w) = dV (s, w) ∀w ∈ R
• Invariant II. ℓ(w) = dR∪{w} (s, w) ∀w ∈ V \ R.
At the end of the algorithm, we have R = V and the claim follows from invariant I. Now consider any
iteration of the algorithm; assume that both invariants were satisfied at the beginning of the iteration
and that v is the node that was selected, that means in particular the update is R′ := R ∪ {v}.
To see Invariant I, let Qsv be the overall shortest s-v path and suppose for the sake of contra-
diction that c(Qsv ) < dR∪{v} (s, v). Then Qsv must be visiting some node outside of R ∪ {v}; let
w be the first such node. By Qsw we denote the subpath going only from s to w. Note that each
subpath must also be a globally shortest path between the endpoints. Due to non-negative edge
costs
invariant II non-neg. edge cost
ℓ(w) = dR∪{w} (s, w) = c(Qsw ) ≤ c(Qsv ) < ℓ(v),
which is a contradiction to the minimality of the picked label.
For invariant II, consider a node w ∈ V \ R. We will consider the length of the shortest path
from s to w only visiting nodes in R′ ∪ {w}. This path has to visit some node before. Taking the
minimum over all choice we can write
n o
dR′ ∪{w} (s, w) = min min{dR′ (s, u) +ℓ(u, w)}, dR′ (s, v) +c(v, w)} = min{ℓ(w), ℓ(v) + c(v, w)}
u∈R | {z } | {z }
=dR (s,u) =ℓ(v)
| {z }
=dR∪{w} (s,w)=ℓ(w)
Here we use in particular that even the shortest global path from s to some u ∈ R only runs inside
R.
Proof. The algorithm has n iterations and the main work is done for updated the ℓ(w)-values. Note
that one update step ℓ(w) := min{ℓ(w), ℓ(v) + c(u, w)} takes time O(1) and we perform n such
updates per iteration. In total the running time is O(n2 ).
Again with more sophisticated data structures (so called Fibonacci heaps), the running time
can be reduced to O(|E| + |V | log |V |). Recall that Dijkstra’s algorithm computes the lengths of n
shortest paths d(s, v). Interestingly, there is no known algorithm that is faster, even if we are only
interested in a single distance d(s, t).
25
Moore-Bellman-Ford Algorithm
Input: a directed graph G = (V, E); edge cost c(e) ∈ R ∀e ∈ E; a source node s ∈ V .
Assumption: There are no negative cost cycles in G
Output: The length ℓ(v) of the shortest s-v path, for all v ∈ V .
(1) Set ℓ(s) := 0 and ℓ(s) = ∞ for all v ∈ V \{s}
(2) For i = 1, . . . , n − 1 do
(3) For each (v, w) ∈ E do:
If ℓ(w) > ℓ(v) + cvw then set ℓ(w) = ℓ(v) + cvw (“update edge (u, w)”)
An example run of the algorithm can be found below (after iteration 3 we already have all the
correct labels in this case):
ℓ(s) = 0 ℓ(c) = −2
s 5 c s 5 c
−2 −2
−1 2 a −1 2 a ℓ(a) = −4
−1 −1
3 3
d b d b
4 ℓ(d) = −1 4
ℓ(b) = 0
instance labels at end
26
ℓ(s) ℓ(a) ℓ(b) ℓ(c) ℓ(d)
start 0 ∞ ∞ ∞ ∞
Iteration 1 (b, a)
(c, a)
(c, b)
(d, b)
(d, c)
(s, c) 5
(s, d) -1
Iteration 2 (b, a)
(c, a) 3
(c, b) 7
(d, b) 3
(d, c) −2
(s, c)
(s, d)
Iteration 3 (b, a)
(c, a) -4
(c, b) 0
(d, b)
(d, c)
(s, c)
(s, d)
end 0 -4 0 -2 -1
Proof. The outer loop takes O(n) time since there are n − 1 iterations. The inner loop takes
O(m) time as it checks the l-value at the head vertex of every edge in the graph. This gives an
O(nm)-algorithm.
Proof. Let d(u, v) be the actual length of the shortest u-v path. We want to show that indeed after
the termination of the algorithm one has ℓ(u) = d(s, u) for all u ∈ V . First, by induction one can
easily prove that in any iteration one has ℓ(u) ≥ d(s, u). To see this, suppose in some iteration we
update edge (v, w), then ℓ(w) = ℓ(v) + cvw ≥ d(s, v) + c(v, w) ≥ d(s, w) by the triangle inequality.
In the following, we will show that after n−1 iterations of the outer loop of the algorithm, we have
ℓ(w) = d(s, w) for each node. Hence, consider a shortest path s = v0 → v1 → v2 → . . . → vk = w
from s to some node w.
s = v0 v1 ... vi−1 vi . . . w = vk
Suppose after i−1 iterations, we have ℓ(vi−1 ) = d(s, vi−1 ). Then in the ith iteration we will consider
edge (vi−1 , vi ) for an update and set ℓ(vi ) := ℓ(vi−1 ) + c(vi−1 , vi ) = d(s, vi−1 ) + c(vi−1 , vi ) = d(s, vi )
(if that was not already the case before). Note that here we have implicitly used that s → v1 →
27
. . . → vi−1 is also the shortest path from s to vi−1 . Since all shortest paths include at most n − 1
edges, at the end we have computed all distances from s correctly.
Again, one can also extract the shortest paths itself, once we know the labels ℓ(v). Finally, we
should remark that the speed at which the values ℓ(u) propagate does depend on the order of the
edges. Suppose we had sorted the edges as e1 , . . . , em so that d(s, tail(e1 )) ≤ . . . ≤ d(s, tail(em )),
then already after one iteration, all the values ℓ(v) would coincide with the actual distance d(s, v).
2. If the reduced costs of all edges of G are nonnegative, we say that π is a feasible potential
on G.
For example, below we have an example with negative edge cost (but no negative cycle) and a
feasible potential on the right: 1
−1 0
1 −1 π(s) = 2 1 2 −3
s t 0 t
s
3 −2 1 1
Theorem 12. The directed graph G with vector of costs c has a feasible potential if and only if c
is conservative.
Proof. (⇒): Suppose π is a feasible potential on (G, c). Consider a directed cycle with edges C.
Then adding up the cost of the edges gives
X X
c(e) = c(e) + π(tail(e)) − π(head(e)) ≥ 0
e∈C e∈C
| {z }
≥0
using that each node on the cycle appears once as tail(e) and once as head(e).
(⇐): Suppose c is conservative. Augment G to a graph G̃ by adding a new vertex s and adding
edges (s, v) with cost c(s, v) = 0. This does not create negative cost cycles because s has only
outgoing edges.
28
0 G
0 π(i) = d(s, i)
s
cij
0 π(j) = d(s, j)
Now let d(s, i) be the length of the shortest s-i path in G̃. Note that these values might be negative,
but they are well defined since there are no negative cost cycles and from s we can reach every node.
We claim that π(i) := d(s, i) is a feasible potential. Consider any edge (i, j) ∈ E, then its reduced
cost is
c(i, j) + d(s, i) − d(s, j) ≥ 0 ⇔ d(s, j) ≤ d(s, i) + c(i, j)
which is just the triangle inequality.
Lemma 13. Given (G, c), in O(nm) time we can find either a feasible potential on G or a negative
cost cycle in G.
Proof. Let G be the given graph and add again a source node s that is connected to each node
at cost 0. Now run the Moore-Bellman-Ford algorithm to compute labels ℓ(v). If no edge update
is made in the last iteration, then we know that the labels satisfy ℓ(v) ≤ ℓ(u) + cuv for all edges
(u, v) ∈ E. In fact, in the last proof we implicitly argued that such labels define a feasible potential.
Now, let us make a couple of observations concerning the behaviour of the algorithm. Where
do the labels ℓ(v) actually come from? Well, an easy inductive argument shows that at any time in
the algorithm and for any v, there is an s-v path P so that ℓ(v) = c(P ).
Claim: Suppose a label ℓ(v) has been updated in iteration k. Then there is a s-v path with at
least k edges so that ℓ(u) = c(P ).
Proof: If we updated ℓ(v), then we had ℓ(v) = ℓ(u) + c(u, v) at that time with (u, v) ∈ E. The
only reason why ℓ(v) was not updated earlier is that u itself was updated either in iteration k or in
iteration k − 1. The argument follows by induction.
Now suppose that a node u was updated in the nth iteration. Then we can find a path P with
at least n edges so that ℓ(u) = c(P ). Since the path has n edges, there must be a node, say v that
appears twice on this path. Say ℓ′ (v) was the label at the earlier update and ℓ′′ (v) < ℓ′ (v) was the
label at the later update. Let Q be the part of P that goes from s to v and C be the cycle from v
to v. Then ℓ′ (v) = c(Q) white ℓ′′ (v) = c(Q) + c(C) < ℓ′ (v) = c(Q). Hence c(C) < 0.
v Q
u s
Intuitively, the concept of potentials to “correct” the edge cost so that they become non-negative
while not changing the shortest paths itself. This provides the following insight:
29
Corollary 14. Given a directed weighted graph with no negative cost cycles, |V | = n and |E| =
m. Then one can compute the lengths (d(u, v))u,v∈V of all shortest paths simultaneously in time
O(nm + n2 log n).
Proof. Since there are no negative cost cycles, we can compute a feasible potential π in time O(nm).
Suppose that dπ (u, v) gives the length of the shortest path w.r.t. the reduced costs. Observe that
dπ (u, v) = d(u, v) + π(u) − π(v). Hence a path is a shortest path w.r.t. the original costs if and
only if it is a shortest path for the reduced costs. The advantage is that the reduced costs are
non-negative, hence we can apply Dijkstra’s algorithm n times (once with each node as source
node). Each application takes time O(m + n log n). Hence we obtain a total running time of
O(nm) + n · O(n log(n)) which is of the claimed form.
30
Chapter 4
Network flows
In this chapter, we want to discuss network flows and in particular how to compute maximum flows
in directed graphs.
Imagine you work for an oil company that owns an oil field (at node s), a couple of pipelines
and a harbor (at node t). The pipeline network is depicted below. Each pipeline segment e has a
maximum capacity ue (say tons of oil per minute that we write on the edges). For technical reasons,
each pipeline segment has a direction and the oil can only be pumped into that direction. What is
the maximum amount of oil that you can transport from the oil field to the harbor?
a 3
7 1
2
7
oil field s 6 c t harbor
3 4 3
3 d
b 5
Let us make a couple of definitions that will help us to formally state the underlying mathemat-
ical problem.
Definition 4. A network (G, u, s, t) consists of the following data:
• A directed graph G = (V, E),
• edge capacities ue ∈ Z≥0 for all e ∈ E, and
• two specified nodes s (source) and t (sink) in V .
Definition 5. A s-t flow is a function f : E → R≥0 with the following properties:
• The flow respects the capacities, i.e. 0 ≤ f (e) ≤ ue ∀e ∈ E.
• It satisfies flow conservation at each vertex v ∈ V \{s, t}:
X X
f (e) = f (e)
e∈δ − (v) e∈δ + (v)
31
The value of an s-t flow f is
X X
value(f ) := f (e) − f (e)
e∈δ + (s) e∈δ − (s)
which is the total flow leaving the source node s minus the total flow entering s.
In our example application, it turns out that the maximum flow f has a value of value(f ) = 9
and the flow f itself is depicted below label edges e with f (e); observe that indeed we respect the
capacities.
a 3
6 0
2
6
oil field s 1 c t harbor
3 4 0
0 d
b 0
(2) REPEAT
32
(4) FOR e ∈ E set f (e) := f (e) + 1 and u(e) := u(e) − 1 (remove e if u(e) = 0)
Consider the following example instance (again, edges are labelled with capacities u(e)):
a
1 1
s 1 t
1
1
b
Obviously the maximum flow value is 2. Now let us see how our algorithm would perform on the
example. We could decide to select the path P = s → a → b → t and send a flow of 1 along it. But
after removing a capacity of 1 along P , we end up with the following remaining capacities:
a
1
s t
1
b
That means we cannot improve the flow of value 1 anymore. The problem is that we made the
mistake of sending flow along edge (a, b) and we need a way of “correcting” such mistakes later in
the algorithm. The key idea to do that is using residual graphs.
Definition 6. Given a network (G = (V, E), u, s, t) with capacities u and an s-t flow f . The
residual graph Gf = (V, Ef ) has edges
Ef = {(i, j) | (i, j) ∈ E and u(i, j) − f (i, j) > 0} ∪ {(j, i) | (i, j) ∈ E and f (i, j) > 0}
| {z } | {z }
forward edges backward edges
Lemma 15. Let (G, u, s, t) be a network with an s-t flow f . Let P be a f -augmenting path so that
uf (e) ≥ γ ∀e ∈ P . Let us augment f along P by γ and call the result f ′ . Then f ′ is a feasible s-t
flow of value value(f ′ ) = value(f ) + γ.
Proof. By definition, we never violate any edge capacity (and also f ′ (e) ≥ 0). Now we need to argue
that exf ′ (v) = 0 for all v ∈ V \{s, t}. Let us define a flow g that we get by naively adding value to
edges in P
(
f (e) + γ e∈P
g(e) =
f (e)
a
1 1
s 1 1 t
1 1
b
flow g in our example above
For any node v ∈ V \{s, t} on the path P has now γ more flow incoming and γ more flow outgoing.
Hence exg (v) = 0. Moreover, t has flow γ more ingoing than before, hence exg (t) = exf (t) + γ. But
g is not yet f ′ because P might have contained a backward edge (i, j). That means g(i, j) ≥ γ and
g(j, i) ≥ γ. Now we reduce the flow value on both (i, j) and (j, i) by γ and we obtain f ′ . Note that
this “flow cancelling” does not change the excess. Hence also exf ′ (v) = 0 for all v ∈ V \{s, t}.
34
Example: Ford-Fulkerson Algorithm
0/3 3
a a
0/1 4
4/7 1
step 2: 2
0/2 3
3
value(f ) = 4 s 4/6 4/4 c t s 2 4 4 c t
0/3 4/7 3 4
γ=3 0/3 d 3 3 d
0/3
b 0/5 b 5
3/3
a a 4
0/1 7
7/7
step 3:
0/2 2 3
value(f ) = 7 s 4/6 4/4 c t s 2 4 4 c t
0/3 4/7 3 4
γ=2 0/3 d 3 d
0/3 3
b 0/5 b 5
3/3
a a 4
0/1 7
7/7
step 4: 2
2/2 1
value(f ) = 9 s 2/6 4/4 c t s 4 2 4 c t
0/3 6/7 1 3 6
f optimal 0/3 d 3 d
2/3 2
b 0/5 b 5
Edges on left hand side are labelled with f (e)/u(e). Edges on right hand side are labelled with
residual capacities uf (e). The augmenting path P ⊆ Ef that is selected is drawn in Gf in bold. The
value γ = min{uf (e) | e ∈ P } gives the minimum capacity on any edge on P .
The natural question is: does this algorithm actually find the optimum solution? We will find
out that the answer is yes. Before we proceed with proving that, we want to introduce some more
concepts.
35
4.2 Min Cut problem
Next, we want to introduce a problem that will turn out to be the dual to the Max Flow problem.
In our example application, the minimum s-t cut would look as follows (the bold edges are in
the cut δ + (S).
a 3
7 1
2
s c 7 t
6
3 4 3
3 d
S b 5
Lemma 16. Let (G, u, s, t) be a network with an s-t flow f and an s-t cut S ⊆ V . Then value(f ) ≤
u(δ + (S)).
Proof. The claim itself is not surprising: if the flow goes from s to t, then at some point it has to
“leave” S. Mathematically, we argue as follows:
def. of flow
X X X X X
value(f ) = f (e) − f (e) = f (e) − f (e)
e∈δ + (s) e∈δ − (s) v∈S e∈δ + (v) e∈δ − (v)
| {z }
=0 if v6=s
(∗) X X
= f (e) − f (e) ≤ u(δ + (S))
|{z} |{z}
e∈δ + (S) ≤u(e) e∈δ − (S) ≥0
In (∗), we drop edges e that run inside S from the sums. The reason is that they appear twice in
the summation: once with a “+” and once with a “−”, so their contribution is 0 anyway.
Actually, the last proof gives us an insight for free by simply inspecting when the inequality
would be an equality:
Corollary 17. In the last proof we have value(f ) = u(δ + (S)) if and only if both conditions are
satisfied:
36
• All outgoing edges from S are saturated: f (e) = u(e) ∀e ∈ δ + (S)
• All incoming edges into S have 0 flow: f (e) = 0 ∀e ∈ δ − (S).
Now we can show that if the Ford-Fulkerson stops, then it has found an optimum flow.
Theorem 18. An (s, t)-flow f is optimal if and only if there does not exist an f -augmenting path
in Gf .
Proof. (⇒): This is clear, because we already discussed that if there exists an f -augmenting path
in Gf , then f is not maximal (simply because the augmented flow will have a higher value).
(⇐): Now suppose that we have an s-t flow f and there is no augmenting path in Gf anymore. We
need to argue that this flow is optimal. The proof is surprisingly simple: Choose
S := {v ∈ V | ∃s-v path in Gf }
• An incoming edge e = (j, i) ∈ δ − (S) has f (e) = 0. Similarly, if f (j, i) > 0, then the edge (i, j)
exists in the residual network Gf and again j would be reachable via i.
Combining Lemma 16, Cor. 17 and Theorem 18 we conclude the max flow min cut theorem:
Theorem 19 (MaxFlow = MinCut (Ford-Fulkerson 1956)). The max value of an s-t flow in the
network (G, u, s, t) equals the min capacity of a s-t cut in the network.
3/3
a a 4
6/7 0/1 1
2/2 6 2
6/7 1
s 1/6 c t s 5 1 c t
4/4 0/3 3 6
4
3/3 0/3 d 3 3 d
b 0/5 b 5
S
maximum flow; edges labelled with f (e)/u(e) residual graph Gf with cap uf
Remark 1. Note that the Ford-Fulkerson algorithm not only finds a max flow but also the min cut
in the network.
Lemma 20. For the network (G, u, s, t), let U := max{u(e) | e ∈ E} be the maximum capacity.
Then the Ford-Fulkerson algorithm runs in time O(n · m · U ).
37
Proof. Since s has at most n edges of capacity at most U outgoing, any flow has value(f ) ≤ n · U .
In each iteration of the algorithm, the value of the flow increases by at least 1. Hence, the algorithm
needs at most n · U iterations. Each iteration (i.e. constructing Ef and finding an augmenting path)
can be done in time O(m).
Note that the quantity U is exponential in the bit encoding length of the input, hence O(n·m·U )
is not a polynomial running time. Moreover, we cannot improve the analysis by much. Consider
the following example where we always augment along the bold paths P = s → a → b → t or
P = s → b → a → t.
a M −1 a M −1 a M −1
M M M
s s 1 s 1 1
1 t 1 t 1 t
1 1 1
M M M M −1 M −1 M −1
b b b
G with cap u Gf1 with cap uf1 Gf2 with cap uf2
value(f1 ) = 1 value(f2 ) = 2
We see that in each iteration, we indeed gain only one flow unit. Thus it takes 2M iterations to find
the optimum. It seems that the problem was that Ford Fulkerson chooses an arbitrary augmenting
path instead of making a smarter choice. This is what we want to do next.
Theorem 21. The Edmonds-Karp algorithm requires at most m · n iterations and has running time
O(m2 n).
Proof. Each iteration can be done in time O(m) using breadth first search. For bounding the
number of iterations it suffices to show that each edge e appears at most n times as bottleneck edge.
Let f0 , f1 , . . . , fT be the sequence of flows with 0 = value(f0 ) < value(f1 ) < . . . < value(fT ) that
the algorithm maintains. Let di (s, v) be the length of a shortest s-v path in the residual graph Gfi
w.r.t. the number of edges. Note that the flow f changes from iteration to iteration and with it
38
possibly di (s, v). We make the important observation that the distances from s never decrease:
Claim: For all v ∈ V , d0 (s, v) ≤ d1 (s, v) ≤ . . . ≤ dT (s, v).
Proof. Consider a fixed node v ∗ and some iteration i. We want to argue that di (s, v ∗ ) ≤ di+1 (s, v ∗ ).
We partition the nodes into blocks Vj := {w | di (s, w) = j} of nodes that have distance j to the
source. Note that the augmenting path P that is used in iteration i will have exactly one node
in each block Vj and it will only use edges that go from Vj to Vj+1 (since P is the shortest path
possible).
V1 V2 Vj ∗ V1 V2 Vj ∗
P u u
s t s t
w v∗ Q w v∗
graph Gfi with augm. path P graph Gfi+1 with s-v path Q
Suppose that v ∗ ∈ Vj ∗ . The only edges that are in Efi+1 \Efi are edges on P reversed, hence the
only new edges in Efi+1 \Efi run from Vj+1 to Vj for some j. Hence an s-v ∗ path Q in Gfi+1 can
never “skip” one of the blocks Vj , hence di+1 (s, v ∗ ) ≥ j ∗ . That shows the claim.
Claim: Suppose that (u, w) was the bottleneck in two iterations i and i′ > i. Then di′ (s, u) >
di (s, u).
Proof: Consider the situation after iteration i: the edge (u, w) has been the bottleneck and now it
exists only as (w, u) edge in the residual graph. The edge (u, w) can only reappear in the residual
graph if at some point (w, u) lies on the shortest path. In particular, it must be the case that
di′′ (s, w) = di′′ (s, u) − 1 for some iteration i′ < i′′ < i′ . But since di (s, w) = di (s, u) + 1 and
distances never decrease, we must have di′ (s, u) ≥ di′′ (s, u) > di (s, u).
Since di (s, u) ∈ {1, . . . , n − 1, ∞} the last claim implies that each edge can be bottleneck edge
at most n times.
There are many more Max Flow algorithms. The so-called Push-Relabel algorithm needs
running time O(n3 ). Orlin’s algorithm from 2012 solves MaxFlow even in time O(n · m) (where
again n = |V | and m = |E|).
Corollary 22. In a network (G, u, s, t) one can find an optimum s-t MinCut in time O(m2 n).
Proof. Run the Edmonds-Karp algorithm to compute a maximum flow f . As in the MaxCut=MinCut
Theorem, we know that the set S of nodes that are reachable from s in the residual graph Gf will
be a MinCut!
Corollary 23. In a network (G, u, s, t) with integral capacities (i.e. u(e) ∈ Z≥0 ∀e ∈ E) there is
always a maximum flow f that is integral.
Proof. If the capacities u(e) are integral, then the increment γ that is used in either Ford-Fulkerson
or in Edmonds-Karp is integral as well. Then also the resulting flow is integral.
39
4.4 Application to bipartite matching
Recall that matching M ⊆ E in an undirected graph G = (V, E) is a set of edges so that each
node is incident to at most one those edges. A maximum matching is a matching that maximizes
the number |M | of edges. Here we want to show that Max Flow algorithms can also be used to
compute maximum matchings.
Let G = (V1 ∪ V2 , E) be a bipartite graph, like the one below.
V1 V2
Now, make all the edges directed from V1 to V2 and add new nodes s, t. We set capacities u as
(
1 i = s or j = t
u(i, j) =
∞ (i, j) ∈ V1 × V2
V1 V2
∞
1 ∞ 1
s 1 ∞ 1
t
1 1
∞
∞
Lemma 24. Let G = (V1 ∪ V2 , E) be a bipartite graph and (G′ , u, s, t) be the network constructed
as above. Then
Proof. Suppose that M is a matching. Then for any edge (u, v) ∈ M we can put 1 unit of flow
on the path s → u → v → t. Hence there is also a flow of value |M |. Now suppose that the
maximum flow value is k. The Ford-Fulkerson algorithm shows that there is always an integral
flow f achieving this value, i.e. f (e) ∈ Z. In our example, we would have a bijection between the
bold matching and the depicted flow (edges labelled with f (e)/u(e)):
40
V1 V2
0/∞
1/1 1/∞ 0/1
0/∞
1/1 1/1
s t
1/∞
0/1 1/1
0/∞
Then
M = {(u, v) ∈ E | f (u, v) = 1}
is a matching of value k as well.
Lemma 25. Maximum matching in bipartite graphs can be solved in time O(n · m) (|V | = n,
|E| = m).
Proof. Remember that in a network with integral capacities and maximum flow value of k, the
Ford-Fulkerson algorithm has a running time of at most O(m · k). In this case, we have k ≤ n and
the claim follows.
Proof. We will see in the homework, that in every graph (not just in bipartite ones), one has
|M | ≤ |U | for each matching M and each vertex cover U . Now suppose that the size of the
maximum matching is k. We need to argue that there is a vertex cover U with |U | = k nodes. Since
the maximum matching in G has size k, also the maximum flow in the network G′ = (V ∪ {s, t}, E ′ )
as constructed above has value k. By the MaxFlow=MinCut Theorem, we know that this
implies that there is an s-t cut S ⊆ V (i.e. s ∈ S, t ∈
/ S) of cost k. It remains to show:
Claim: U := (V1 \S) ∪ (V2 ∩ S) is a vertex cover for G of size k.
Proof. Observe that the nodes in the vertex cover can be written as
Since exactly k edges of the form (s, u) or (v, t) are cut, we have |U | = k. It remains to show that
U is a vertex cover. Consider any edge (u, v) ∈ E. If (u, v) is not covered by U , then his means
that u ∈ S and v ∈/ S. But then (u, v) ∈ δ + (S) which is impossible because it means that the cut
cuts an edge of infinite capacity.
41
In our little example, we draw a maximum flow (of value 2) together with the mincut S (also of
value 2); the two filled nodes in the vertex cover.
V1 V2
∞
1 ∞ 1
S 1 ∞ 1
s t
1 1
∞
∞
For a general graph G, let M be the largest matching and U be the smallest vertex cover. Then it
is possible that |M | < |U |. On the other hand, one can show that |M | ≤ |U | ≤ 2|M |, that means
both quantities are within a factor of 2 of each other. Interestingly, finding the maximum matching
in a general graph is solvable in polynomial time (with a non-trivial algorithm), while vertex cover
in general graphs is NP-hard and hence there won’t be a polynomial time algorithm.
U N (U )
Theorem 27 (Hall’s Theorem). A bipartite graph G = (V1 ∪ V2 , E) with |V1 | = |V2 | = n has a
perfect matching if and only if |N (U )| ≥ |U | for all U ⊆ V1 .
Proof. If there is a set U ⊆ V1 with |N (U )| < |U |, then obviously there cannot be a perfect matching,
simply because the nodes in U cannot all be matched. For the other direction, assume that the
maximum size matching has size at most n − 1. Then by Kőnig’s Theorem, there is a vertex cover
S ⊆ V1 ∪ V2 of size |S| ≤ n − 1. We make the choice of U := V1 \S.
V1 V2
V1 ∩ S
U
V2 ∩ S ⊇ N (U )
42
In particular, S being a vertex cover means that U is only incident to nodes in V2 ∩ S, thus
N (U ) ⊆ V2 ∩ S. But since |V1 ∩ S| + |U | = n and |V1 ∩ S| + |V2 ∩ S| ≤ n − 1, we must have
|N (U )| ≤ |V2 ∩ S| < |U |.
43
44
Chapter 5
Linear programming
One of the main tools in combinatorial optimization is linear programming. We want to quickly
review the key concepts and results. Since most statements and proofs are known from course 407,
from time to time we will be satisfied with informal proof sketches.
Recall that a vector c ∈ Rn can be used to define a linear function f : Rn → R by setting
f (x) := cT x = c1 x1 + . . . + cn xn . In this text, we always consider a vector c ∈ Rn as a column
vector and so cT is a row vector.
A polyhedron P ⊆ Rn is the set of all points x ∈ Rn that satisfy a finite set of linear inequalities.
Mathematically,
P = {x ∈ Rn : Ax ≤ b}
for some matrix A ∈ Rm×n and a vector b ∈ Rm . A polyhedron can be presented in many different
ways such as P = {x ∈ Rn : Ax = b, x ≥ 0} or P = {x ∈ Rn : Ax ≥ b}. All these formulations
are equivalent. A polyhedron is called a polytope if it is bounded, i.e., can be enclosed in a ball of
finite radius.
(0, 2)
P
2x1 + 3x2 ≤ 6 vertex
x1 ≥ 0
P P
Definition 7. A set Q ⊆ Rn is convex if for any two points x and y in Q, the line segment joining
them is also in Q. Mathematically, for every pair of points x, y ∈ Q, the convex combination
λx + (1 − λ)y ∈ Q for every λ such that 0 ≤ λ ≤ 1.
x y x y
Q Q
convex not convex
45
n
Pt8. A convex combination of a finite set of pointsPvt1 , . . . , vt in R , is any vector of
Definition
the form i=1 λi vi such that 0 ≤ λi ≤ 1 for all i = 1, . . . , t and i=1 λi = 1. The set of all convex
combinations of v1 , . . . , vt is called the convex hull of v1 , . . . , vt . We denote it by
t
nX o
conv{v1 , . . . , vt } = λi vi | λ1 + . . . + λt = 1; λi ≥ 0 ∀i = 1, . . . , t
i=1
v2 conv{v1 , . . . , v5 }
v1
v5
v3
v4
Theorem 28. Every polytope P is the convex hull of a finite number of points (and vice versa).
PnA linear program is the problem of maximizing or minimizing a linear function of the form
i=1 ci xi over all x = (x1 , . . . , xn ) in a polyhedron P . Mathematically, it is the problem
n
nX o
max ci xi | Ax ≤ b
i=1
for some matrix A and vector b (alternatively min instead of max). It turns out that extreme points
are quite useful in linear optimization:
P
I x∗
Proof. Note that the optimum solution might not be unique; the claim is just that there is an
optimum point that is an extreme point where n linearly independent constraints are tight. Let
46
x∗ be an optimum solution that maximizes the number of indices i ∈ [m] with ATi x∗ = bi . Let
J := {i ∈ [m] | ATi x∗ = bi } be the set of those indices. Let AJ be the submatrix of A consisting of
rows indexed by i ∈ J.
Claim. One has rank(AJ ) = n.
Proof of claim. If not, then rank(AJ ) < n and there is a direction y ∈ ker(AJ ) \ {0} (meaning
AJ y = 0). Define
δ + := max{δ | ATi (x∗ + δy) ≤ bi ∀i ∈ [m]} and δ − := min{δ | ATi (x∗ + δy) ≤ bi ∀i ∈ [m]}
x∗ + δ − y
x∗
J
y x∗ + δ + y
P
In fact with slightly more care one can prove this lemma for any (potentially unbounded) poly-
hedron P = {x ∈ Rn | Ax ≤ b} as long as rank(A) = n. Note that while linear programs are
continuous optimization problems, the previous Lemma 29 tells us that optimum solutions are
always taken from a finite set of extreme points.
max{x1 + x2 | x1 + 2x2 ≤ 6, x1 ≤ 2, x1 − x2 ≤ 1}
(which is of the form max{cT x | Ax ≤ b}). First, let us make the following observation: if we add
up non-negative multiples of feasible inequalities, then we obtain again an inequality that is valid
for each solution x of the LP. For example we can add up the inequalities in the following way:
47
13
x1 + x2 ≤ 3
c
2 x1 + 2x2 ≤ 6
3 ·( x1 +2x2 ≤ 6)
0·( x1 ≤ 2) x∗
1 P
3 ·( x1 −x2 ≤ 1)
13
x1 +x2 ≤ 3 ≈ 4.33 x1 − x2 ≤ 1
x1 ≤ 2
Theorem 30 (Weak duality Theorem). One has (P ) ≤ (D). More precisely if we have (x, y) with
Ax ≤ b, y T A = cT and y ≥ 0, then cT x ≤ bT y.
In the remainder of this subsection we will show that always (P ) = (D), that means we can
always combine a feasible inequality that gives a tight upper bound. But first, some tools:
Theorem 31 (Hyperplane separation Theorem). Let P, Q ⊆ Rn be convex, closed and disjoint sets
with at least one of them bounded. Then there exists a hyperplane cT x = β with
cT x < β < cT y ∀x ∈ P ∀y ∈ Q
Proof. Let (x∗ , y ∗ ) ∈ P × Q be the pair minimizing the distance kx∗ − y ∗ k2 (this must exist for
the following reason: suppose that P is bounded; then P is compact. Then the distance function
d(x) := min{ky − xk2 | y ∈ Q} is well-defined and continuous, hence a minimum is attained on P ).
Then the hyperplane through 21 (x∗ + y ∗ ) with normal vector c = y ∗ − x∗ separates P and Q.
y∗ Q
x∗
P
cT x = β
48
To see this, suppose for the sake of contradiction that Q contains a point y with cT y < cT y ∗ .
Consider the interpolated point y(λ) := (1 − λ)y ∗ + λy = x∗ + c + λ(y − y ∗ ). By convexity of Q we
know that y(λ) ∈ Q for all 0 ≤ λ ≤ 1. We claim that for some small enough λ > 0, the point y(λ)
is closer to x∗ than y ∗ which would give a contradiction. And indeed
!
ky(λ) − x∗ k22 = kc + λ(y − y ∗ )k22 = kck22 + 2λ cT (y − y ∗ ) +λ2 ky − y ∗ k22 < kck22 = ky ∗ − x∗ k22
| {z }
<0
if we pick λ > 0 small enough since the coefficient in front of the linear λ term is negative. That is
a contradiction.
y
Q
y(λ)
y∗
P x∗
c
This theorem is the finite dimensional version of the Hahn-Banach separation theorem from
functional analysis.
The separation theorem quickly provides us the very useful Farkas Lemma which is like a
duality theorem without objective function. The lemma tells us that out of two particular linear
systems, precisely one is going to be feasible.
Lemma 32 (Farkas Lemma). One has ∃x ≥ 0 : Ax = b ∨˙ ∃y : y T A ≥ 0 and y T b < 0 .
Proof. First let us check that it is impossible that such x, y exist at the same time since
x = y T |{z}
0 ≤ y T A |{z} b <0
|{z}
≥0 ≥0 =Ax
For the other direction, assume that there is no x ≥ 0 with P Ax = b. We need to show that there
is a proper y. Consider the cone K := {Ax | x ≥ 0} = { ni=1 ai xi | x1 , . . . , xn ≥ 0} of valid right
hand sides, where a1 , . . . , an are the columns of A. By assumption we know that b ∈ / K.
K
a2 a3
b
a1
y
0
49
But K is a closed convex set, hence there is a hyperplane y T c = γ that separates K and {b}, i.e.
∀a ∈ K : y T a > γ > y T b
As 0 ∈ K we must have γ < 0. Moreover all non-negative multiples of columns are in K, that
means ai xi ∈ K for all xi ≥ 0, thus y T (xi )ai > γ for each i ∈ [n], which implies that even y T ai ≥ 0
for each i. This can be be written more compactly as y T A ≥ 0.
max{cT x : Ax ≤ b} = min{bT y : y T A = cT , y ≥ 0}
Proof sketch. Suppose that (P ) is feasible. Let x∗ be an optimum solution1 . Let a1 , . . . , am be rows
of A and let I := {i | aTi x∗ = bi } be the tight inequalities.
aTi1 x ≤ bi1
a i1
b C
a i2
x∗ b
λ c
(P )
aTi2 x ≤ bi2
P
Suppose for the sake of contradiction that c ∈/ { i∈I ai yi | yi ≥ 0 ∀i ∈ I} =: C. Then there is a
direction λ ∈ Rn with cT λ > 0, aTi λ ≤ 0 ∀i ∈ I. Walking in direction λ improves the objective
function while we do not walk into the direction of constraints that are already tight. In other
words, there is a small enough ε > 0 so that x∗ + ελ is feasible for (P ) and has a better objective
function value — but x∗ was optimal. That is a contradiction!
Hence we know that indeed c ∈ C, hence there is a y ≥ 0 with y T A = cT and yi = 0 ∀i ∈ / I
(that means we only use tight inequalities in y).
aTi1 x ≤ bi1
a i1
b
c C
a i2
x∗ b
(P )
aTi2 x ≤ bi2
1
That’s why this is only a proof sketch. For a polytope it is easy to argue that P is compact and hence there
must be an optimum solution. If P is unbounded, but the objective function value is bounded, then one needs more
technical arguments that we skip here.
50
Now we can calculate, that the duality gap is 0:
m
X
y T b − cT x∗ = y T b − y T A x∗ = y T (b − Ax∗ ) = yi · (bi − aTi x∗ ) = 0
|{z} |{z} | {z }
i=1
=cT =0 if i∈I
/ =0 if i∈I
Note that moreover, if (P ) is unbounded, then (D) is infeasible. If (D) is unbounded then (P )
is infeasible. On the other hand, it is possible that (P ) and (D) are both infeasible.
(ATi x∗ = bi ∨ yi∗ = 0) ∀i
Proof. We did already prove this in the last line of the duality theorem!
In this section, we want to briefly discuss the different methods that are known to solve linear
programs.
The oldest and most popular one is the simplex method. Suppose the linear program is of the
form
max{cT x | Ax ≤ b}.
We may assume that the underlying polyhedron P = {x ∈ Rn | Ax ≤ b} has vertices2 . For a set
of row indices I ⊆ [m], let AI be the submatrix of A that contains all the rows with index in I. A
compact way of stating the simplex algorithm is as follows:
2
This is equivalent to saying that the ker(A) = {0}. A simple way to obtain this property is to substitute a
variable xi ∈ R by xi = x+ − + −
i − xi and adding the constraints xi , xi ≥ 0 to the constraint system. Now the kernel of
the constraint matrix is empty.
51
Simplex algorithm
Input: A ∈ Rm×n , b ∈ Rm , c ∈ Rn and a starting
basis I ∈ [m]
n
Output: opt. solution x attaining max{cT x | Ax ≤ b}
(1) x = A−1 j′
I bI
I′ x′ = A−1
I ′ bI
′
(2) IF y := cA−1
I ≥ 0 THEN RETURN x is optimal c
P
(3) select j ∈ I and ∈ j′
/ I so that for := I′ I\{j} ∪ {j ′ } I
the following 3 conditions are satisfied x = A−1
I bI
(i) rank(AI ′ ) = n j
The maintained set I of indices is usually called a basis and the maintained solution x is always a
vertex of P . In other words, the simplex method moves from one vertex to the next one so that the
objective function always improves (or at least does not get worse). If the condition y = cA−1 I ≥0
in step (3) is satisfied, then we have a dual solution y ≥ 0 with c = yAI that uses only constraints
that are tight for x; from our section on duality we know that then x must be an optimal solution.
One might object that the algorithm needs a basis I so that x = A−1 I x is a feasible solution.
But one can also find a feasible solution using a linear program. For example, the linear program
min{λ | Ax ≤ b + λ1} has a feasible solution (0, λ) for λ large enough and an optimum solution will
be feasible for Ax ≤ b (given that there is any feasible solution).
Typically the simplex method is stated in a tableau form from which can easily determine for
which choice of j one can pick a j ′ so that the conditions (i),(ii),(iii) are satisfied. Also, the tableau
form uses that the inverted matrix A−1 2
I ′ can be computed with only O(n ) operations since AI is
−1
Unfortunately the simplex algorithm is still not a polynomial time algorithm because the number
of iterations may be exponentially large for some instances.
52
Advanced remark:
The first known pathological instance for the simplex method was found by Klee and Minty
in 1973 (as a side remark, Victor Klee was on the UW faculty for many years). If one solves
the linear program
n
nX Xi−1 o
max 10n−j xj | 2 10i−j xj + xi ≤ 100i−1 ∀i ∈ [n]; xj ≥ 0 ∀j ∈ [n]
j=1 j=1
with the largest coefficient rule starting at x = (0, . . . , 0), then the simplex will take
2n − 1 iterations. The underlying polytope P of this LP is a skewed hypercube with
2n vertices and the simplex algorithm would go through all of them. We will not give
the proof here, but want to give a visualization for n = 3. In that case, the LP and the
polytope looks as follows:
b
b
(0, 0, 10000) b b
x1 , x2 , x3 ≥ 0 x3
x2 b b
(0, 0, 0) (1, 0, 0)
x1
The bold arrows denote the path that is taken by the simplex algorithm if the largest
coefficient pivoting rule is used. Note the the axes are skewed.
Despite the above result, the simplex method works extremely well in practice. Thus the measure
of complexity we are studying has its limitations and may not give a good feel for how an algorithm
behaves on the average (on most problems). It is a worst case model that computes the complexity
of an algorithm by considering its performance on all instances of a fixed size. A few pathological
examples can skew the running time function considerably. Yet, we still get a pretty good indication
of the efficiency of an algorithm using this idea.
It is a major open question in discrete optimization whether there is some pivot rule for which
the simplex method runs in time polynomial in the input size.
On the other hand, linear programs can be solved in polynomial time using algorithms such as
interior point methods and the ellipsoid method. Both these algorithms were discovered around
1980 and use non-linear mathematics. They are quite complicated to explain compared to the
53
simplex method. We can compare their characterisics as follows3
practical worst case works for
performance running time
Simplex very fast exponential LPs
Interiour Point fast O(n3.5 L) arithmetic operations LPs+SDPs
Ellipsoid method slow O(n6 L) arithmetic operations any convex set
Here L is the number of bits necessary to represent the input. For example we can write the
number 5 as bit sequence 101 (since 1 · 22 + 0 · 21 + 1 · 20 = 6). Note that in general it takes
1 + ⌈log2 (|a|) + 1⌉ ≈ log2 |a| bits to encode an integer number a (we use the extra 1 to encode the
sign of a and ⌈log2 (|a|) + 1⌉ to encode |a|).
Both algorithms, the Interior point method and the Ellipsoid method need more time if the
numbers in the input are larger, which is a contrast to the other algorithms that we encountered
so far. It is a major open problem whether there is an algorithm for solving linear problems, say of
the form max{cT x | Ax ≤ b}, so that that the running time is polynomial in the number of rows
and columns of A.
P
x̄
Hence, we will use a “softer” version of the max-function. Let us rescale the rows of A so that
kAi k2 = 1. Let λ > 0 be a parameter that we will determine later and define the potential function
m
X
Φ(x) := exp(λ · (ATi x − bi ))
| {z }
i=1
=:wi (x)
3
To be precise, the running times are for LPs with O(n) constraints. In particular the running time for the Interior
Point Method listed here is outdated. More recent work has brought the running time down to essentially the time
that it takes to solve a system of linear equations.
54
where wi (x) > 0 will be the weight of the ith constraint. Intuitively, the weight will be growing
exponentially in the violation. Now the function Φ has a smoothly changing derivative everywhere.
We can show that the minimizer of Φ is at least almost feasible for P .
Proof. If x ∈ P , then
m
X
Φ(x) = exp(λ · (ATi x − bi )) ≤ m
| {z }
i=1
≤0
| {z }
≤1
and a) follows. Now suppose that x ∈ / Pε . That means there is an index i∗ with ATi∗ x > bi∗ + ε.
Then Φ(x) ≥ exp(λ · (ATi∗ x − bi∗ )) > exp(λ · ε) ≥ exp(ln(2m)) = 2m.
It remains to show how one can improve the value of Φ(x). By taking the derivative, we see
that the gradient of the function is
Xm m
X
∇Φ(x) = λ Ai · exp(λ · (ATi x − bi )) = λ wi (x) · Ai
i=1 i=1
d
For this sake, recall the one dimensional derivative exp(αx) = α · exp(αx). We will use the
dx
following gradient descent approach, where we determine the step size α > 0 later.
(1) Set x := 0
(2) WHILE x ∈
/ Pε DO
(3) Update x′ := x − α · ∇Φ(x).
First we will show that the potential function can decrease — if the length of the gradient is
sufficiently long:
Lemma 36 (One Gradient Dscent Step). One has Φ(x − α · ∇Φ(x)) ≤ Φ(x) · (1 + λ2 α2 k∇Φ(x)k22 ) −
α · k∇Φ(x)k22 .
55
In (∗) we use the inequality exp(z) ≤ 1 + z + z 2 for all −1 ≤ z ≤ 1. In (∗∗) we use that
| hAi , yi | ≤ kAi k2 · kyk2 ≤ αk∇Φ(x)k2 .
1
Lemma 37. Suppose that x ∈
/ Pε . Then k∇Φ(x)k2 ≥ √
2 n
· Φ(x).
Proof. The function Φ is convex. One property of convex functions is that the function lies above
the linear approximation, meaning that for every x, y ∈ Rn one has Φ(y) ≥ Φ(x) + h∇Φ(x), y − xi.
Recall that Φ(x∗ ) ≤ m for x∗ ∈ P and Φ(x) ≥ 2m for x ∈ / Pε . Then
1 convexity √
Φ(x) ≤ |Φ(x) − Φ(x∗ )| ≤ | h∇Φ(x), x − x∗ i | ≤ kx − x∗ k2 · k∇Φ(x)k2 ≤ n · k∇Φ(x)k2 .
2
Rearranging gives the claim.
1 1
Lemma 38. For α := 2λ2 ·Φ(x)
one has Φ(x′ ) ≤ Φ(x) · (1 − 16nλ2
).
The analysis here is not optimized for running time. First of all, one could replace the system
Ax ≤ b by the n + 1 dimensional system {Ax ≤ x0 b, x0 ≤ 1, x0 ≥ 1}. Then the starting point
x̃ = (x, x0 ) = 0 has a starting potential of only Φ(x) ≤ m + 2. Secondly, the lower bound on
k∇Φ(x)k2 in Lemma 37 is suboptimal and could be improved to Θ(λ) · Φ(x).
Recall that a perfect matching M ⊆ E is a set of edges that has degree exactly 1 for each node in
the graph. For example in the instance below we mark the only perfect matching in the graph:
56
Obviously not all graphs have perfect matchings. We try a bit naively to set up a linear program that
is supposed to solve our matching problem. It appears to be natural be have a variable xe ∈ [0, 1]
that tells us whether or not the edge e should be contained in our matching. A perfect matching is
defined by requiring that the degree of a node is 1.
X
max c e xe
e∈E
X
xe = 1 ∀v ∈ V
e∈δ(v)
xe ≥ 0 ∀e ∈ E
is a feasible LP solution. We want to try out our LP with a small example instance. Already finding
a perfect matching itself is not a trivial problem, hence let us consider the following instance with
ce = 0 for each edge:
57
there is no perfect matching, yet the linear program has the extreme point solution
1/2 1/2
1/2 1/2
1/2 1/2
So, the whole approach does not look very promising for matching. But it turns out that for graphs
G = (V, E) that are bipartite, this approach does work. Note that we call a graph G = (V, E)
˙ 2 so that all edges are running between
bipartite, if the node set V can be partitioned into V = V1 ∪V
V1 and V2 .
V1 V2
bipartite graph
We want to remark that bipartite graphs already cover a large fraction of applications for
matching.
Lemma 40. Let G = (V, E) be a bipartite graph and x be an optimum extreme point solution to
the matching LP. Then x is integral and M = {e ∈ E | xe = 1} is an optimum perfect matching.
Proof. Suppose for the sake of contradiction that x is an extreme point, but it is not completely
integral. We will lead this to a contradiction by finding two other feasible LP solutions y and z with
x = 12 (y + z).
Let Ef = {e ∈ E | 0 < xe < 1} be the fractional edges and let Vf = {v ∈ V | |δ(v) ∩ Ef | > 0} be
the nodes that are incident to fractional edges. Observe that if a node v is incident to one fractional
edge {u, v}, then it must be incident to at least another fractional edge (since 1 − xuv ∈ ]0, 1[).
In other words, in the subgraph (Vf , Ef ), each node has degree at least 2. Each such graph must
contain at least one circuit C ⊆ Ef (simply start an arbitrary walk through Ef until you hit a node
that you have previously visited; this will close a circuit). Since the graph is bipartite, any cycle must
have even length. Denote the edges of the circuit with e1 , . . . , e2k and let ε := min{xe | e ∈ C} > 0.
Now we define y ∈ RE and z ∈ RE by
58
xei + ε e = ei and i odd
xei − ε e = ei and i odd
ze = xei − ε e = ei and i even ye = xei + ε e = ei and i even
xe if e ∈
/C xe if e ∈
/C
x e1 x e1 + ε x e1 − ε
x e2 x e2 − ε x e2 + ε
x e4 x e4 − ε x e4 + ε
x e3 x e3 + ε x e3 − ε
1 1 1
2x1 + 6x2 ≤ 21
2 b b b
6x1 − x2 ≤ 12
1 P
x1 + x2 ≥ 1 b b b
2
x1 , x2 ≥ 0 x1
0 b b
x1 , x2 ∈ Z 0 1 2
59
and the optimum solution is x∗∗ = (1, 3) with objective function value cT x∗∗ = 15. In general,
for any polyhedron P ⊆ Rn , the linear program max{cT x | x ∈ P } is associated with the integer
program max{cT x | x ∈ P ∩ Zn }. The LP is also called an LP relaxation of its integer program.
Obviously, the value of the linear program and the integer program can be very different. It is also
not difficult to construct an example where the linear program is feasible, while the integer program
is infeasible. Note that always
max{cT x | x ∈ P } ≥ max{cT x | x ∈ P ∩ Zn }
(given that both systems have a finite value), simply because the left hand maximum is over a
larger set. Also, we want to remark that the same integer program could have several different LP
relaxations.
PI = conv(P ∩ Zn )
Phrased differently, the integer hull PI is the smallest polyhedron that contains all the integral
points in P . That means PI is the “perfect” relaxation for the integer program. The integer hull is
particularly useful since optimizing any function over it gives the same value as the integer program.
x2
x∗∗ c = (3, 4)
3 b b
P
2 b b b
PPI
1 b b b
x1
0 b b
0 1 2
Lemma 41. Let P ⊆ Rn be a polytope. Then the following conditions are equivalent
(a) P = PI
Proof. For (a) ⇒ (b), just observe that a linear function over conv(P ∩ Zn ) is maximized at one
of the “spanning” points, i.e. for x ∈ P ∩ Zn . For the 2nd direction we show that ¬(a) ⇒ ¬(b).
In words, we assume that P 6= PI . Let x ∈ P \PI . Recall that PI is a convex set, hence there is
a hyperplane separating x from PI hand that hyperplane has cT x > cT y ∀y ∈ PI which gives the
claim.
P
b b
x
PI
b b b c
60
The claim is actually true for unbounded polyhedra.
What we ideally would like to solve are integer programs. The reason is that those are powerful
enough to model every discrete optimization problem that we will encounter in this lecture. For
example, the max weight perfect matching problem can be stated as
nX X o
max c e xe | xe = 1 ∀v ∈ V ; xe ≥ 0 ∀e ∈ E xe ∈ Z ∀e ∈ E
e∈E e∈δ(v)
(in arbitrary graphs). The issue is that integer programming is NP-hard, which means that assum-
ing NP 6= P, there is no polynomial time algorithm for such problems in general. Instead we want
to solve the underlying LP relaxation; for many problems we will be able to argue that P = PI and
hence a computed extreme point is always integral.
61
62
Chapter 6
Total unimodularity
max{cT x | Ax ≤ b; x ∈ Zn }
we can solve in polynomial time simply by computing an extreme point solution to the corresponding
linear relaxation
max{cT x | Ax ≤ b}.
In other words, we wonder when all the extreme points of P = {x ∈ Rn | Ax ≤ b} are integral like
in this figure:
b b b b
b b b b
b b
P b b
b b b b
Remember the following fact that we know from the previous Chapter:
Lemma 42. Let P = {x ∈ Rn | Ax ≤ b} be a polyhedron with A ∈ Rm×n . For each extreme point
x of P , there are row indices I ⊆ {1, . . . , m} so that AI is an n × n matrix of full rank and x is the
unique solution to AI x = bI .
Here AI denotes the rows of A that have their index in I. We wonder: when can we guarantee
that the solution to AI x = bI is integral? Recall the following lemma from linear algebra:
Lemma 43 (Cramer’s rule). Let B = (bij )i,j∈{1,...,n} be an n × n square matrix of full rank. Then
the system Bx = b has exactly one solution x∗ where
det(B i )
x∗i =
det(B)
63
Definition 10. A matrix A is totally unimodular (TU) if every square submatrix of A has
determinant 0,1 or −1.
Note that A itself does not have to be square. We just need all the square submatrices of A to
have 0, ±1 determinant. However, if A is totally unimodular, then every entry of A has to be 0, ±1
since every entry is a 1 × 1 submatrix of A.
Lemma 44. Let P = {x ∈ Rn | Ax ≤ b}. If A is TU and b ∈ Zm , then all extreme points of P are
integral.
Proof. Let x be an extreme point of P and let AI be the submatrix so that x is the unique solution
to AI x = bI . Then by Cramers rule
det(AiI )
xi =
det(AI )
Since A is TU we know that det(AI ) ∈ {±1} (the case det(AI ) = 0 is not possible since AI has full
rank). Since A and b are integral, AiI has only integral entries, and hence det(AiI ) ∈ Z. This shows
that xi ∈ Z.
The TU property is somewhat robust under changes as we will see in the following:
Lemma 45. Let A ∈ {−1, 0, 1}m×n . Then the following operations do not change whether or not
A is TU:
b) Permuting rows/columns
c) Adding/removing a row/column that has at most one ±1 entry and zeros otherwise
d) Duplicating rows
Proof. We can study the effect of those operations on a square submatrix B = (bij )i,j∈{1,...,n} of A.
The claims follow quickly from linear algebra facts such as (I) det(B) = det(B T ); (II) multiplying
a row/column by λ changes the determinant by λ; (III) permuting rows/columns only changes the
sign of det. For (c), we can use the Laplace formula: Let Mij be the (n − 1) × (n − 1) matrix
that we obtain by removing the ith row and jth column from B. Then for all i ∈ {1, . . . , n},
n
X
det(B) = (−1)i+j · bij · det(Mij ).
j=1
We get:
64
b) All extreme points of {x ∈ Rn | Ax = b; x ≥ 0} are integral.
b) All extreme points of {x ∈ Rn | Ax ≤ b; ℓ ≤ x ≤ u} are integral.
Proof. Consider a matrix A which is of the given form and suppose for the sake of contradiction
that A is not TU. Let A′ be a minimal square submatrix so that det(A′ ) ∈ / {−1, 0, 1}. In particular
we may assume that for any proper submatrix A of A one still has det(A′′ ) ∈ {−1, 0, 1}. If A′
′′ ′
has any column that has only 0 entries, then det(A′ ) = 0. If A′ has any column that has exactly
one non-zero entry then det(A′ ) = (±1) · det(A′′ ) where A′′ is a strict submatrix of A′ which by
assumption has det(A′′ ) ∈ {−1, 0, 1}. The only remaining case is that A′ has at′ least 2 non-zero
C1
entries in every every column. Then A′ must indeed be of the form A′ = where C1′ is a
C2′
submatrix of C1 and C2′ is a submatrix of C2 . Moreover A′ must have exactly one +1 and exactly
one −1 per column. But then, adding up all rows of A′ gives the 0-vector, i.e. det(A′ ) = 0.
An example matrix for which this lemma proves total unimodularity is the following:
B C1
0 0 0 0 1 0 0
0 0 0 1 0 1 0
A = 0 −1 1 0 0 0 1
0 0 0 −1 0 0 −1
1 0 0 0 −1 −1 0
C2
65
Before we show some applications, we want to mention one more insight. Remember that going
to the dual LP essentially means we transpose the constraint matrix. Since transposing a matrix
does not destroy the TU property we immediately get:
Lemma 48. Suppose A is TU and both b and c are integral. Then there are optimum integral
solutions x∗ , y ∗ to
with cT x∗ = bT y ∗ .
Theorem 49 (Seymour ’80, Truemper ’82). For a matrix A ∈ {0, ±1}m×n one can test in polyno-
mial time whether A is TU.
We already know from Section 5.3 that for bipartite graphs all extreme points of PM are integral,
while this is not necessarily the case for non-bipartite graphs. Now we want to understand why!
We can now reprove Lemma 40:
Lemma 50. If G is bipartite, then all extreme points x of PM satisfy x ∈ {0, 1}E .
We claim that A is TU. Since G is bipartite, the matrix has rows that belong to nodes in V1 and
rows that belong to V2 . A column belonging to an edge (u, v) has one 1 in the V1 -part and the
other 1 in the V2 -part. Now flip the signs in the V2 -part, then Lemma 47 applies and shows that A
is TU.
66
(a, v)
V1 V2 (a, u) edges
a u a 10 10 10 0 0 0 0 0 0
b 0 0 0 10 01 01 0 0 0 V1
nodes
b v c 0 0 0 0 0 0 10 10 10
u 10 0 0 10 0 0 10 0 0
c w v 0 10 0 0 01 0 0 10 0 V2
w 0 0 10 0 0 01 0 0 10
graph G node edge incidence matrix A
We already showed one direction of the following consequence — we will not show the other
direction.
Lemma 51. The incidence matrix of an undirected graph G is TU if and only if G is bipartite.
with
−k
v=s
b(v) = k v=t
0 otherwise.
We want to have a closer look, how the constraint matrix of this polytope looks like. And in fact,
if we write
o
Pk−flow = {f ∈ RE | Af = b; 0 ≤ f ≤ u
1
e ∈ δ − (v)
Av,e = −1 e ∈ δ + (v)
0 otherwise.
This matrix is also called the node edge incidence matrix of the directed graph G.
67
edges
(s, a)
(a, b)
(s, c)
(s, b)
(a, t)
(b, c)
(c, t)
(b, t)
a
s -1
0 -1
0 -1
0 0 0 0 0 0
nodes
a 10 0 0 -1
0 0 -10 0 0
s b t
b 0 01 0 01 -1
0 0 -1
0 0
c 0 0 01 0 01 0 0 -1
0
c
t 0 0 0 0 0 10 01 01
graph G
node edge incidence matrix A
Lemma 52. The node edge incidence matrix A ∈ {−1, 0, 1}V ×E of a directed graph is TU.
Proof. Again, look at the smallest square submatrix B of A that is a counterexample. Then each
column must contain 2 non-zero entries, otherwise, we could obtain a smaller counterexample. But
then each column must contain exactly one 1 and one −1 entry. Then summing up all rows of B
gives (0, . . . , 0). That means the rows are not linearly independent and hence det(B) = 0.
Corollary 53. For any integer k, the polytope Pk−flow of value k s-t flows has only integral extreme
points.
Let us see how far we can push this method. The following is actually a very general problem.
Note that one can easily use this problem to solve minimum cost matchings in bipartite graphs or
minimum cost max flow problems.
Lemma 54. Minimum Cost Integer Circulation Problem can be solved in polynomial time.
Proof. Simply observe that the relaxation can be written as min{cT f | Af = 0; ℓ ≤ f ≤ u} where
the constraint matrix is the node-edge incidence matrix, which is TU. Then we can solve the LP
with the Interior point method to find an optimum extreme point f ∗ . This solution is going to be
integral since A is TU.
68
Interval scheduling
Input: A list of intervals I1 , . . . , In ⊆ R with profits ci for interval Ii .
Goal: Select a disjoint subset of intervals that maximize the sum of profits.
Note that there are at most 2n many points that are end points of intervals let us denote those
points with t1 < t2 < . . . < tm . Then it suffices to check that we select intervals so that no point in
t1 , . . . , tm lies in more than one interval. Hence we can formulate the interval scheduling problem
as the following integer program:
n X o
max cT x | xi ≤ 1 ∀j ∈ {1, . . . , m}; xi ∈ {0, 1} ∀i ∈ {1, . . . , n}
i:tj ∈Ii
Observe that the matrix A ∈ {0, 1}m×n has the consecutive-ones property, that means the ones
in each column are consecutive.
Lemma 55. A matrix A ∈ {0, 1}m×n with consecutive ones property is TU.
Proof. Again, we consider a minimal counterexample. Since any submatrix of A also has the
consecutive ones property, it suffices to show that det(A) ∈ {−1, 0, 1} assuming that A itself is a
square matrix. Let A1 , . . . , An be the rows of A. For all i, in decreasing order, we subtract row
i − 1 from row i, which does not change the determinant. Formally, we define A′i := Ai − Ai−1 and
A′1 = A1 , then det(A′ ) = det(A) (see the example below). Now each column in A′ contains one 1
and one −1 (or only one 1 in case that the interval contains tn ).
t1 01 0 01 0 0 0 01 0 01 0 0 0
I3
t2 01
0 01 0 0 01
0
0 0 0 0 01
I1
t3 I6 01 0 0 01 01 01 ′ 0 0 -1
0 01 01 0
I4 A= A =
t4 01 0 0 01 01 01 0 0 0 0 0 0
I5
t5 0 01 0 0 01 0 -10 01 0 -1
0 0 -1
0
I2
t6 0 01 0 0 01 0 0 0 0 0 0 0
Theorem 56. The interval scheduling problem can be solved in polynomial time.
69
70
Chapter 7
We already saw many problems for which we can devise an efficient, polynomial time algorithm.
On the other hand, we also discussed that for NP-complete problems like the Travelling Salesmen
problem, it is widely believed that no polynomial time algorithm exists. While from a theoretical
perspective we can be perfectly happy with the non-existence of an algorithm, this is not very
satisfying for practitioners.
Suppose you in fact do have an instance of an NP-complete problem and you really need a
solution! For all the problems that we discuss in this lecture and also the vast majority of problems
that appear in practical applications, it is very straightforward to formulate it as an integer linear
program
max{cT x | Ax ≤ b; x ∈ Zn } (∗)
Suppose for a moment that we have the additional restriction that x ∈ {0, 1}n . Then the naive
approach would just try out all 2n solutions and then pick the best. A quick calculation shows
that even for very moderate instance sizes of, say n = 200, we would obtain an astronomically high
running time.
We wonder: is there an algorithm that solves the IP (*) in many cases much faster than 2n ? The
answer is “yes”! The Branch & Bound algorithm is a much more intelligent search algorithm
that usually avoids searching the whole solution space. See the figure on page 72 for the description
of the algorithm and see the figure on page 73 for an example performance of the algorithm. We
want to emphasize the three points that make the algorithm smarter than trivial enumeration for
solving max{cT x | x ∈ P ; x ∈ Zn } with P = {x | Ax ≤ b}. In the following P ′ ⊆ P denotes a
subproblem.
• Insight 1: Suppose α ∈ / Z and i ∈ {1, . . . , n}. Then all integer points in P are either in
P ∩ {x | xi ≤ ⌊α⌋} or in P ∩ {x | xi ≥ ⌈α⌉}.
P ∩ {x | xi ≤ ⌊α⌋} P ∩ {x | xi ≥ ⌈α⌉}
P
xi
⌊α⌋ ⌈α⌉
71
Branch & Bound algorithm
Input: c ∈ Rn , A ∈ Rm×n , b ∈ Rm
Output: An optimum solution to max{cT x | Ax ≤ b; x ∈ Zn }
(1) Set x∗ := UNDEFINED (best solution found so far)
(2) Put problem P := {x | Ax ≤ b} on the stack
(3) WHILE stack is non-empty DO
(4) Select a polytope P ′ from the stack
(5) Solve max{cT x | x ∈ P ′ } and denote LP solution by x̃
(6) IF P ′ = ∅ THEN goto (3) (“prune by infeasibility”)
(7) IF c T x∗ ≥ cT x̃ THEN goto (3) (“prune by bound”)
(8) IF cT x∗ < cT x̃ and x̃ ∈ Zn THEN (let cT x∗ = −∞ if x∗ = UNDEFINED
update x∗ := x̃ and goto (3) (“prune by optimality”)
/ Zn , cT x∗ < cT x̃) do
(9) Otherwise (i.e. x̃ ∈ (“branch”)
(10) Select coordinate i with x̃i ∈
/ Z.
(11) Add problems P ∩ {x | xi ≤ ⌊x̃i ⌋} and P ′ ∩ {x | xi ≥ ⌈x̃i ⌉} to stack
′
c
b
x̃
b b b
P′ P
b b
x∗
P′
P
Theorem 57. Branch and bound does find the optimum for max{cT x | Ax ≤ b; x ∈ Zn }.
72
Example: Branch & Bound. Instance: max{−x1 + 2x2 | (x1 , x2 ) ∈ P }
with P := {x ∈ R2 | −x1 + 6x2 ≤ 9, x1 + x2 ≤ 4, x2 ≥ 0, −3x1 + 2x2 ≤ 3}
Iteration 1: x2 c
• x∗ = UNDEFINED 3 x̃
• Stack: {P }
• Select: P ′ := P 2
• LP opt: x̃ = (1.5, 2.5) P′
• Case: Branch on i = 2. 1
Add P ′ ∩ {x2 ≤ 2}, P ′ ∩ {x2 ≥ 3} to stack x1
−1 0 1 2 3 4
x2
Iteration 2:
3 c
• x∗ = UNDEFINED
x̃
• Stack: {P ∩ {x2 ≤ 2}, P ∩ {x2 ≥ 3}} 2
• Select: P ′ := P ∩ {x2 ≤ 2}
• LP opt: x̃ = (0.75, 2) 1 P′
• Case: Branch on i = 1. Add P ′ ∩ {x2 ≤ 2, x1 ≥ x1
1}, P ′ ∩ {x2 ≤ 2, x1 ≤ 0} to stack
−1 0 1 2 3 4
x2
Iteration 3:
3 c
• x∗ = UNDEFINED
x̃
• Stack: {P ∩ {x2 ≤ 2, x1 ≥ 1}, P ∩ {x2 ≤ 2, x1 ≤ 2
0}, P ∩ {x2 ≥ 3}}
• Select: P ′ := P ∩ {x2 ≤ 2, x1 ≥ 1} 1 P′
• LP opt: x̃ = (1, 2) x1
• Case: Prune by optimality. Update x∗ := x̃
−1 0 1 2 3 4
x2
Iteration 4: 3
c
• x∗ = (1, 2) 2
• Stack: {P ∩ {x2 ≤ 2, x1 ≤ 0}, P ∩ {x2 ≥ 3}} x̃ x∗
• Select: P ′ := P ∩ {x2 ≤ 2, x1 ≤ 0} 1
• LP opt: x̃ = (0, 1.5) P′ x1
• Case: Prune by bound (cT x̃ ≤ cT x∗ ).
−1 0 1 2 3 4
Iteration 5: 3
• x∗ = (1, 2) P′ = ∅
• Stack: {P ∩ {x2 ≥ 3}} 2
x∗
• Select: P ′ := P ∩ {x2 ≥ 3}
• LP opt: UNDEFINED 1
• Case: Prune by infeasibility (P ′ = ∅)
−1 0 1 2 3 4
Iteration 6: Stack empty. Optimum solution: x∗ = (1, 2)
73
Observe that the branch and bound process implicitly constructs a search tree where a node
corresponds to a subproblem P ′ . For our example on page 73, the Branch & Bound tree looks
as follows:
Iteration 1:
P
Iteration 2: Iteration 5:
P ∩ {x2 ≤ 2} P ∩ {x2 ≥ 3}
prune by LP infeasibility
Iteration 3: Iteration 4:
P ∩ {x2 ≤ 2, x1 ≥ 1} P ∩ {x2 ≤ 2, x1 ≤ 0}
prune by optimality prune by bound
There are two steps in the algorithm that are “generic”. More precisely, we did not specify which
problem should be selected from the stack and which coordinate should be used to branch.
• Which problem P ′ should be selected from the stack? There are two popular strategies:
– Depth First Search: Always selecting the last added problem from the stack corresponds
to a depth first seach in the branch and bound tree. The hope is that this way, one
quickly finds some feasible integral solution that then can be used to prune other parts
of the tree.
– Best bound: Select the problem that has the highest LP maximum. The hope is that
there is also a good integral solution hiding.
Note that we cannot give any theoretical justification for or against one strategy. For any strategy
one could cook up a pathological instance where the strategy misarably fails while another succeeds
quickly.
Lemma 58. Let n ≥ 4 be an even integer. No matter what strategies are used by B & B to solve
the following IP
n n o
1 X n
max x0 | x0 + xi = ; x ∈ {0, 1}n+1
2 2
i=1
74
the search tree has at least 2n/3 leafs.
Proof. Observe that the IP only has integral solutions with x0 = 0. Hence the integral optimum is
0. Since 0 ≤ xi ≤ 1, a branching step consists in fixing a variable as either 0 or 1. Hence, for each
node in the Branch & Bound tree, let I0 ⊆ {0, . . . , n} be indices of variables that are fixed to 0 and
let I1 be the indices of variables that are fixed to 1. The point is that Branch and Bound is not
intelligent enough to realize that a sum of integers will be integral. If only few variables are fixed,
then there is still a feasible fractional solution that is much better than the best integral solution,
hence we would not yet prune the search tree at this point. More formally:
Claim: Consider a node in the search tree with fixed variables (I0 , I1 ) and 0 ∈/ I0 . If |I0 |+|I1 | ≤
n
3 , then there is a feasible LP solution for this node with x̃ 0 = 1 and hence the node could not be
pruned.
Proof. Consider the fractional vector x̃ with
1 i=0
1 i ∈ I
1
x̃i =
0 i ∈ I 0
λ otherwise
We can always choose a λ ∈ [0, 1] to satisfy the equation. Hence there is always a fractional solution
of value x̃0 = 1, hence this node could not be pruned!
The claim shows that all the leafs of the Branch and Bound tree are in depth at least n3 , hence
the tree must have at least 2n/3 many leaves as it is binary1 .
1
Here we assumed that as long as |I0 | + |I1 | ≤ n3 , an optimum LP solution always has x̃0 = 1 and hence we would
not branch on i = 0. One could modify Branch and Bound so that it does branch on i = 0, but still the number of
leaves would be 2Ω(n) .
75
76
Chapter 8
Non-bipartite matching
Apart from NP-complete problems, the only problem that we have mentioned before and for which
have not seen a polynomial time algorithm yet is the matching problem on general, non-bipartite
graphs. We already discussed the problem with a linear programming approach and in fact, the
polynomial time algorithm is highly non-trivial and goes back to seminal work of Edmonds in 1965.
Recall that in an undirected graph G = (V, E), a matching is a subset of disjoint edges M ⊆ E.
We know already a combinatorial algorithm to solve the problem in bipartite graphs: add a
source and a sink and use the Ford-Fulkerson algorithm. In other words, we know that at least for
bipartite graphs, an approach with iteratively augmenting a flow/matching must work. Our idea is
to extend this to non-bipartite matching.
edge in M
M -augm. path
exposed node
matching M with augmenting path matching after augmentation
77
Theorem 59 (Augmenting Path Theorem for Matchings). A matching M in graph G = (V, E) is
maximum if and only if there is no M -augmenting path.
a a′
b b′
c c′
d d′
e f e e′
a b c d f f′
graph G with matching bipartite graph G′
and augm. path with augm. path
Observe that a path in this directed bipartite graph corresponds to an M -alternating walk.
78
Lemma 60. In time O(mn + n2 log n) we can compute the shortest M -alternating walk starting
and ending in an exposed node.
Proof. We simply run Dijkstra’s algorithm in G′ for each node as source to compute shortest path
distances.
It sounds a bit like we are done. But there is a major problem: a shortest walk in the directed
bipartite graph G′ will indeed be a path, but that graph contains node v twice. In that case, for
some v ∈ V we found a subpath B ⊆ P in the directed graph G′ that goes from v to v ′ . In
particular, that means that the length of that path is odd and it contains exactly 21 (|B| − 1) many
M -edges.
Hence, the whole walk P , in the original graph might look as follows
M -blossom
M -flower
Formally we define an M -flower as an M -alternating walk in G = (V, E) that (i) starts at an
M -exposed node u (ii) revisits exactly one node v once which is the other end point of the path and
(iii) the cycle that is closed at v has an odd number of edges. That odd cycle B ⊆ E that contains
exactly |M ∩ B| = 12 (|B| − 1) edges is called an M -blossom. Note that it is possible that u and v
are identical. We can always guarantee the following:
Lemma 61. In time O(nm + n2 log n) we can compute (i) either an M -augmenting path P or (ii)
find an M -flower or (iii) decide that there is no M -augmenting path.
v0 v1 ... vj−1 e1 vj e2
v ℓ e3
vℓ−1
79
It remains to discuss what to do if we find an M -flower.
contraction
vB
Lemma 62. Let M be a matching and B be an odd cycle with |M ∩B| = 21 (|B|−1) many matching
edges. Then M/B is a matching in G × B and any M/B augmenting path in G × B can be extended
to an M -augmenting path in G.
Proof. First, note that vB has degree one in M/B since there is only one edge of M in δ(B). Now,
let P be an M/B augmenting path in G × B and let vB be the supernode that was created by
contraction.
If P does not contain vB , the claim is clear. After expansion the path P will enter B using an
M -edge e1 in a node that we call v1 and it will leave B with a non-M edge e2 in a node v2 . From
v2 we extend P by going either clockwise or counterclockwise so that the first edge that we take
from v2 in B is an M -edge. This way we create an M -augmenting path in G.
P vB expansion
v1 B
e1 e2 e1 v2
e2
80
Proof. Let S be the stem of the M -flower. Note that S is an even-length non-revisiting M -
alternating path. Consider the matching M ′ := M ∆S. Then |M ′ | = |M | and |M ′ /B| = |M/B|.
So it suffices to assume that there is an M ′ -augmenting path P in G and show that there is an
M ′ /B-augmenting path in G × B. If P does not include any node in B, then there is nothing
to show. Otherwise consider the beginning P ′ = (u0 , u1 , . . . , uj ) of that path with u0 ∈
/ B being
M ′ -exposed where uj is the first node in B.
u1 u0
B uj
Theorem 64. The Blossom algorithm finds a maximum cardinality matching in time O(n3 m +
n4 log n).
Proof. We need to find an M -augmenting path at most n times. While we try to find an M -
augmenting path, we might contract at most n times a cycle and before contracting a cycle we need
81
to find a shortest path in the bipartite graph, which each time takes time O(nm + n2 log n). Hence
the total running time that we obtain is O(n3 m + n4 log n).
A more careful implementation (with a more careful analysis) gives a running time of O(n3 ).
82
Chapter 9
In this chapter we want to show more techniques how one can deal with NP-hard optimization
problems.
To give a motivating example, suppose a burglar breaks into a house and discovers a large set
of n precious objects. Unfortunately, all objects together would be to heavy to carry for him, so he
has to make some selection. He estimates that he can carry a weight of at most W and that the
weight of the ith item is wi and that he could receive a price of ci for it.
In a mathematical abstract form we can formulate this as follows:
Knapsack Problem
Input: n items with weights w1 , . . . , wn ≥ 0, profits c1 , . . . , cn ≥ 0 and a bud-
get/capacity W P P
Goal: Find a subset S ⊆ {1, . . . , n} with i∈S wi ≤ W that maximizes the profit i∈S ci .
where xi is a decision variable telling whether you want to select the ith item. In other words,
the Knapsack Integer program is an integer program with just a single constraint (apart from the
0 ≤ xi ≤ 1 constraints). Interestingly, that single constraint makes the problem already NP-hard,
hence there won’t be a polynomial time algorithm to solve it in general. We want to design an
algorithm that is smarter than just trying out all 2n possible combinations.
83
9.1 A dynamic programming algorithm
We want to show that
P the problem is efficiently
P solvable if the involved numbers are nice. We will
abbreviate w(S) = i∈S wi and c(S) = i∈S ci .
The technique that we want to use to obtain this result is dynamic programming that we
saw already at the beginning of the course. The property that makes dynamic programming work
is that the problem can be broken into independent subproblems. We want to use the following
table entries for j ∈ {0, . . . , n} and C ′ ∈ {0, . . . , C}
If there is no set S ⊆ {1, . . . , j} with c(S) = C ′ , then we set T (j, C ′ ) = ∞. For j = 0, it is easy to
see that T (0, 0) = 0 and T (0, C ′ ) = ∞ if C ′ 6= 0. Let us make an observation how the other table
entries can be computed:
Lemma 66. For j ∈ {1, . . . , n} and C ′ ∈ {0, . . . , C} one has T (j, C ′ ) = min{T (j − 1, C ′ − cj ) +
wj , T (j − 1, C ′ )}
Proof. Suppose that S ∗ ⊆ {1, . . . , j} is the set with c(S ∗ ) = C ′ and w(S ∗ ) minimal. Then there are
two case:
/ S ∗ : Then T (j, C ′ ) = T (j − 1, C ′ ).
(a) j ∈
We now write down an algorithm to compute the optimal solution of the knapsack problem.
We can use the dynamic program to solve the example from above; indeed we obtain that the
optimum solution has a value of 4. One can also “backtrack” that the corresponding solution will
be S ∗ = {2, 4} (this can be done by going backwards through the entries and checking whether the
min-expression for determining the table entry was attained for the case j ∈ S or j ∈
/ S).
84
Item 1 Item 2 Item 3 Item 4
w1 = 6 w2 = 7 w3 = 4 w4 = 1
c1 = 2 c2 = 3 c3 = 2 c4 = 1
profit C′ =8 ∞ ∞ ∞ ∞ 18
profit C′ =7 ∞ ∞ ∞ 17 17
profit C′ =6 ∞ ∞ ∞ ∞ 12
highest feasible profit
profit C′ =5 ∞ ∞ 13 11 11
profit C′ =4 ∞ ∞ ∞ 10 8
profit C′ =3 ∞ ∞ 7 7 5
profit C′ =2 ∞ 6 6 4 4 weights ≤ W = 9
profit C′ =1 ∞ ∞ ∞ ∞ 1
profit C′ =0 0 0 0 0 0
C(0, ∗) C(1, ∗) C(2, ∗) C(3, ∗) C(4, ∗)
Note that the size of the input to the knapsack problem has a number of bits that is at most
O(n(log C + log W )), hence the running time of O(nC) is not polynomial in the size of the input.
But the running time satisfies a weaker condition and it is a so-called pseudopolynomial running
time.
In general, suppose P is an optimization problem such that each instance consists of a list of
integers a1 , . . . , an ∈ Z. An algorithm for P is called pseudopolynomial if the running time is
polynomial in n and max{|ai | : i ∈ {1, . . . , n}}.
85
Proof. First of all, we gave the correct weights to the dynamic program, but the “wrong” profits.
But at least the computed solution S will be feasible, i.e. w(S) ≤ W . Next, note that the rounded
instance has only integral profits as input and their sum is C ′ ≤ n · nε . Hence the dynamic program
3
has a running time of O(n · C ′ ) which in our case is O( nε ). It remains to show that S is a good
solution. This will follow from two claims.
Claim 1: c(S) ≥ OP T ′ .
Proof: Since the dynamic program computes the optimum S for the profits c′ , we know that
c′ (S) = OP T ′ . Since the profits c′i are obtained by rounding down ci we know that c(S) ≥ c′ (S),
which gives the claim.
Claim 2: OP T ′ ≥ (1 − ε)OP T .
Proof. Let S ∗ be the solution with c(S ∗ ) = OP T . We claim that c′ (S ∗ ) ≥ (1 − ε)c(S ∗ ). And in
fact, the rounding can cost at most 1 per item hence
In fact, one can improve the running time by a factor of n. We can estimate the value of OP T
up to a factor of 2 and then scale the profits so that OP T ≈ nε . Then the analysis still goes through,
2
but we have C ≤ O( nε ) which improves the running time to O( nε ).
86