0% found this document useful (0 votes)
1 views

Chapter1

These lecture notes, developed for a graduate course at MIT, cover fundamental topics in convex analysis and optimization, including Lagrange multipliers, duality, and nondifferentiable optimization. The notes aim to make convex analysis accessible while maintaining mathematical rigor and include applications to optimization theory. They also present recent research on enhanced Lagrange multiplier theory and computational methods for solving dual problems.

Uploaded by

Hicham Elmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Chapter1

These lecture notes, developed for a graduate course at MIT, cover fundamental topics in convex analysis and optimization, including Lagrange multipliers, duality, and nondifferentiable optimization. The notes aim to make convex analysis accessible while maintaining mathematical rigor and include applications to optimization theory. They also present recent research on enhanced Lagrange multiplier theory and computational methods for solving dual problems.

Uploaded by

Hicham Elmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 147

LECTURE NOTES

Convexity, Duality, and


Lagrange Multipliers

Dimitri P. Bertsekas

with assistance from

Angelia Geary-Nedic and Asuman Koksal

Massachusetts Institute of Technology


Spring 2001

These notes were developed for the needs of the 6.291 class at
M.I.T. (Spring 2001). They are copyright-protected, but they
may be reproduced freely for noncommercial purposes.
Contents

1. Convex Analysis and Optimization


1.1. Linear Algebra and Analysis . . . . . . . . . . . . . . . . .
1.1.1. Vectors and Matrices . . . . . . . . . . . . . . . . . .
1.1.2. Topological Properties . . . . . . . . . . . . . . . . . .
1.1.3. Square Matrices . . . . . . . . . . . . . . . . . . . .
1.1.4. Derivatives . . . . . . . . . . . . . . . . . . . . . . .
1.2. Convex Sets and Functions . . . . . . . . . . . . . . . . . .
1.2.1. Basic Properties . . . . . . . . . . . . . . . . . . . .
1.2.2. Convex and Affine Hulls . . . . . . . . . . . . . . . . .
1.2.3. Closure, Relative Interior, and Continuity . . . . . . . . .
1.2.4. Recession Cones . . . . . . . . . . . . . . . . . . . .
1.3. Convexity and Optimization . . . . . . . . . . . . . . . . .
1.3.1. Local and Global Minima . . . . . . . . . . . . . . . .
1.3.2. The Projection Theorem . . . . . . . . . . . . . . . . .
1.3.3. Directions of Recession
and Existence of Optimal Solutions . . . . . . . . . . . .
1.3.4. Existence of Saddle Points . . . . . . . . . . . . . . . .
1.4. Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . .
1.5. Conical Approximations and Constrained Optimization . . . . .
1.6. Polyhedral Convexity . . . . . . . . . . . . . . . . . . . .
1.6.1. Polyhedral Cones . . . . . . . . . . . . . . . . . . . .
1.6.2. Polyhedral Sets . . . . . . . . . . . . . . . . . . . . .
1.6.3. Extreme Points . . . . . . . . . . . . . . . . . . . . .
1.6.4. Extreme Points and Linear Programming . . . . . . . . .
1.7. Subgradients . . . . . . . . . . . . . . . . . . . . . . . .
1.7.1. Directional Derivatives . . . . . . . . . . . . . . . . . .
1.7.2. Subgradients and Subdifferentials . . . . . . . . . . . . .
1.7.3. ²-Subgradients . . . . . . . . . . . . . . . . . . . . .
1.7.4. Subgradients of Extended Real-Valued Functions . . . . . .
1.7.5. Directional Derivative of the Max Function . . . . . . . . .
1.8. Optimality Conditions . . . . . . . . . . . . . . . . . . . .
1.9. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

iii
iv Contents

2. Lagrange Multipliers
2.1. Introduction to Lagrange Multipliers . . . . . . . . . . . . .
2.2. Enhanced Fritz John Optimality Conditions . . . . . . . . . .
2.3. Informative Lagrange Multipliers . . . . . . . . . . . . . . .
2.4. Pseudonormality and Constraint Qualifications . . . . . . . . .
2.5. Exact Penalty Functions . . . . . . . . . . . . . . . . . . .
2.6. Using the Extended Representation . . . . . . . . . . . . . .
2.7. Extensions to the Nondifferentiable Case . . . . . . . . . . . .
2.8. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

3. Lagrangian Duality
3.1. Geometric Multipliers . . . . . . . . . . . . . . . . . . . .
3.2. Duality Theory . . . . . . . . . . . . . . . . . . . . . . .
3.3. Linear and Quadratic Programming Duality . . . . . . . . . .
3.4. Strong Duality Theorems . . . . . . . . . . . . . . . . . .
3.4.1. Convex Cost – Linear Constraints . . . . . . . . . . . . .
3.4.2. Convex Cost – Convex Constraints . . . . . . . . . . . .
3.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

4. Conjugate Duality and Applications


4.1. Conjugate Functions . . . . . . . . . . . . . . . . . . . .
4.2. The Fenchel Duality Theorem . . . . . . . . . . . . . . . .
4.3. The Primal Function and Sensitivity Analysis . . . . . . . . .
4.4. Exact Penalty Functions . . . . . . . . . . . . . . . . . . .
4.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . .

5. Dual Computational Methods


5.1. Dual Subgradients . . . . . . . . . . . . . . . . . . . . .
5.2. Subgradient Methods . . . . . . . . . . . . . . . . . . . .
5.2.1. Analysis of Subgradient Methods . . . . . . . . . . . . .
5.2.2. Subgradient Methods with Randomization . . . . . . . . .
5.3. Cutting Plane Methods . . . . . . . . . . . . . . . . . . .
5.4. Ascent Methods . . . . . . . . . . . . . . . . . . . . . .
5.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . .
Preface

These lecture notes were developed for the needs of a graduate course at
the Electrical Engineering and Computer Science Department at M.I.T.
They focus selectively on a number of fundamental analytical and com-
putational topics in (deterministic) optimization that span a broad range
from continuous to discrete optimization, but are connected through the
recurring theme of convexity, Lagrange multipliers, and duality. These top-
ics include Lagrange multiplier theory, Lagrangian and conjugate duality,
and nondifferentiable optimization. The notes contain substantial portions
that are adapted from my textbook “Nonlinear Programming: 2nd Edi-
tion,” Athena Scientific, 1999. However, the notes are more advanced,
more mathematical, and more research-oriented.
As part of the course I have also decided to develop in detail those
aspects of the theory of convex sets and functions that are essential for an
in-depth coverage of Lagrange multipliers and duality. I have long thought
that convexity, aside from being an eminently useful subject in engineering
and operations research, is also an excellent vehicle for assimilating some
of the basic concepts of analysis within an intuitive geometrical setting.
Unfortunately, the subject’s coverage in mathematics and engineering cur-
ricula is scant and incidental. I believe that at least part of the reason
is that while there are a number of excellent books on convexity, as well
as a true classic (Rockafellar’s 1970 book), none of them is well suited for
teaching nonmathematicians who form the largest part of the potential
audience.
I have therefore tried in these notes to make convex analysis accessible
by limiting somewhat its scope and by emphasizing its geometrical charac-
ter, while at the same time maintaining mathematical rigor. The coverage
of the theory is significantly extended in the exercises, whose detailed so-
lutions are posted on the internet. I have included as many insightful
illustrations as I could come up with, and I have tried to use geometric
visualization as a principal tool for maintaining the students’ interest in
mathematical proofs. To highlight a contrast in style, Rockafellar’s mar-
velous book contains no figures at all!

v
vi Preface

An important part of my approach has been to maintain a close link


between the theoretical treatment of convexity concepts with their appli-
cation to optimization. For example, in Chapter 1, soon after the devel-
opment of some of the basic facts about convexity, I discuss some of their
applications to optimization and saddle point theory; soon after the dis-
cussion of hyperplanes and cones, I discuss conical approximations and
necessary conditions for optimality; soon after the discussion of polyhedral
convexity, I discuss its application in linear programming; and soon after
the discussion of subgradients, I discuss their use in optimality conditions.
I follow consistently this style in the remaining chapters, although having
developed in Chapter 1 most of the needed convexity theory, the discussion
in the subsequent chapters is more heavily weighted towards optimization.
In addition to their educational purpose, these notes aim to develop
two topics that I have recently researched with two of my students, and to
integrate them into the overall landscape of convexity, duality, and opti-
mization. These topics are:
(a) A new approach to Lagrange multiplier theory, based on a set of en-
hanced Fritz-John conditions and the notion of constraint pseudonor-
mality. This work, joint with my Ph.D. student Asuman Koksal, aims
to generalize, unify, and streamline the theory of constraint qualifica-
tions. It allows for an abstract set constraint (in addition to equalities
and inequalities), it highlights the essential structure of constraints
using the new notion of pseudonormality, and it develops the connec-
tion between Lagrange multipliers and exact penalty functions.
(b) A new approach to the computational solution of (nondifferentiable)
dual problems via incremental subgradient methods. These methods,
developed jointly with my Ph.D. student Angelia Geary-Nedic, in-
clude some interesting randomized variants, which according to both
analysis and experiment, perform substantially better than the stan-
dard subgradient methods for large scale problems that typically arise
in the context of duality.
The lecture notes may be freely reproduced and distributed for non-
commercial purposes. They represent work-in-progress, and your feedback
and suggestions for improvements in content and style will be most wel-
come.

Dimitri P. Bertsekas
[email protected]
Spring 2001
1

Convex Analysis and


Optimization

Date: June 10, 2001

Contents

1.1. Linear Algebra and Analysis . . . . . . . . . . . . . p. 5


1.1.1. Vectors and Matrices . . . . . . . . . . . . . . p. 5
1.1.2. Topological Properties . . . . . . . . . . . . . p. 8
1.1.3. Square Matrices . . . . . . . . . . . . . . . . p. 16
1.1.4. Derivatives . . . . . . . . . . . . . . . . . . . p. 20
1.2. Convex Sets and Functions . . . . . . . . . . . . . . p. 24
1.2.1. Basic Properties . . . . . . . . . . . . . . . . p. 24
1.2.2. Convex and Affine Hulls . . . . . . . . . . . . . p. 35
1.2.3. Closure, Relative Interior, and Continuity . . . . . p. 38
1.2.4. Recession Cones . . . . . . . . . . . . . . . . p. 45
1.3. Convexity and Optimization . . . . . . . . . . . . . p. 58
1.3.1. Local and Global Minima . . . . . . . . . . . . p. 58
1.3.2. The Projection Theorem . . . . . . . . . . . . . p. 61
1.3.3. Directions of Recession
and Existence of Optimal Solutions . . . . . . . . p. 63
1.3.4. Existence of Saddle Points . . . . . . . . . . . . p. 71
1.4. Hyperplanes . . . . . . . . . . . . . . . . . . . . p. 81

1
2 Convex Analysis and Optimization Chap. 1

1.5. Conical Approximations and Constrained Optimization . p. 87


1.6. Polyhedral Convexity . . . . . . . . . . . . . . . . p. 98
1.6.1. Polyhedral Cones . . . . . . . . . . . . . . . . p. 98
1.6.2. Polyhedral Sets . . . . . . . . . . . . . . . . p. 103
1.6.3. Extreme Points . . . . . . . . . . . . . . . . p. 104
1.6.4. Extreme Points and Linear Programming . . . . p. 107
1.7. Subgradients . . . . . . . . . . . . . . . . . . . p. 111
1.7.1. Directional Derivatives . . . . . . . . . . . . . p. 111
1.7.2. Subgradients and Subdifferentials . . . . . . . . p. 115
1.7.3. ²-Subgradients . . . . . . . . . . . . . . . . p. 123
1.7.4. Subgradients of Extended Real-Valued Functions . p. 128
1.7.5. Directional Derivative of the Max Function . . . p. 129
1.8. Optimality Conditions . . . . . . . . . . . . . . . p. 135
1.9. Notes and Sources . . . . . . . . . . . . . . . . p. 137
Sec. 1.0 Preface 3

In this chapter we provide the mathematical background for this book. In


Section 1.1, we list some basic definitions, notational conventions, and re-
sults from linear algebra and analysis. We assume that the reader is familiar
with this material, so no proofs are given. In the remainder of the chapter,
we focus on convex analysis with an emphasis on optimization-related top-
ics. We assume no prior knowledge of the subject, and we provide proofs
and a fairly detailed development.
For related and additional material, we recommend the books by
Hoffman and Kunze [HoK71], Lancaster and Tismenetsky [LaT85], and
Strang [Str76] (linear algebra), the books by Ash [Ash72], Ortega and
Rheinboldt [OrR70], and Rudin [Rud76] (analysis), and the books by Rock-
afellar [Roc70], Ekeland and Temam [EkT76], Rockafellar [Roc84], Hiriart-
Urruty and Lemarechal [HiL93], Rockafellar and Wets [RoW98], Bonnans
and Shapiro [BoS00], and Borwein and Lewis [BoL00] (convex analysis).
The book by Rockafellar [Roc70], widely viewed as the classic con-
vex analysis text, contains a deeper and more extensive development of
convexity that the one given here, although it does not cross over into
nonconvex optimization. The book by Rockafellar and Wets [RoW98] is
a deep and detailed treatment of “variational analysis,” a broad spectrum
of topics that integrate classical analysis, convexity, and optimization of
both convex and nonconvex (possibly nonsmooth) functions. These two
books represent important milestones in the development of optimization
theory, and contain a wealth of material, a good deal of which is original.
However, they are written for the advanced reader, in a style that many
nonmathematicians find challenging.
As we embark on the study of convexity, it is worth listing some of
the properties of convex sets and functions that make them so special in
optimization.
(a) Convex functions have no local minima that are not global . Thus the
difficulties associated with multiple disconnected local minima, whose
global optimality is hard to verify in practice, are avoided.
(b) Convex sets are connected and have feasible directions at any point
(assuming they consist of more than one point). By this we mean that
given any point x in a convex set X, it is possible to move from x
along some directions and stay within X for at least a nontrivial inter-
val. In fact a stronger property holds: given any two distinct points
x and x in X, the direction x − x is a feasible direction at x, and
all feasible directions can be characterized this way. For optimization
purposes, this is important because it allows a calculus-based com-
parison of the cost of x with the cost of its close neighbors, and forms
the basis for some important algorithms. Furthermore, much of the
difficulty commonly associated with discrete constraint sets (arising
for example in combinatorial optimization), is not encountered under
convexity.
4 Convex Analysis and Optimization Chap. 1

(c) Convex sets have a nonempty relative interior. In other words, when
viewed within the smallest affine set containing it, a convex set has a
nonempty interior. Thus convex sets avoid the analytical and compu-
tational optimization difficulties associated with “thin” and “curved”
constraint surfaces.
(d) A nonconvex function can be “convexified” while maintaining the opti-
mality of its global minima, by forming the convex hull of the epigraph
of the function.
(e) The existence of a global minimum of a convex function over a convex
set is conveniently characterized in terms of directions of recession
(see Section 1.3).
(f) Polyhedral convex sets (those specified by linear equality and inequal-
ity constraints) are characterized in terms of a finite set of extreme
points and extreme directions. This is the basis for finitely terminat-
ing methods for linear programming, including the celebrated simplex
method (see Section 1.6).
(g) Convex functions are continuous and have nice differentiability prop-
erties. In particular, a real-valued convex function is directionally
differentiable at any point. Furthermore, while a convex function
need not be differentiable, it possesses subgradients, which are nice
and geometrically intuitive substitutes for a gradient (see Section 1.7).
Just like gradients, subgradients figure prominently in optimality con-
ditions and computational algorithms.
(h) Convex functions are central in duality theory. Indeed, the dual prob-
lem of a given optimization problem (discussed in Chapters 3 and 4)
consists of minimization of a convex function over a convex set, even
if the original problem is not convex.
(i) Closed convex cones are self-dual with respect to orthogonality . In
words, the set of vectors orthogonal to the set C ⊥ (the set of vectors
that form a nonpositive inner product with all vectors in a closed and
convex cone C) is equal to C. This simple and geometrically intuitive
property (discussed in Section 1.5) underlies important aspects of
Lagrange multiplier theory.
(j) Convex, lower semicontinuous functions are self-dual with respect to
conjugacy. It will be seen in Chapter 4 that a certain geometrically
motivated conjugacy operation on a given convex, lower semicontinu-
ous function generates a convex, lower semicontinuous function, and
when applied for a second time regenerates the original function. The
conjugacy operation is central in duality theory, and has a nice inter-
pretation that can be used to visualize and understand some of the
most profound aspects of optimization.
Sec. 1.1 Linear Algebra and Analysis 5

Our approach in this chapter is to maintain a close link between


the theoretical treatment of convexity concepts with their application to
optimization. For example, soon after the development for some of the basic
facts about convexity in Section 1.2, we discuss some of their applications
to optimization in Section 1.3; and soon after the discussion of hyperplanes
and cones in Sections 1.4 and 1.5, we discuss conditions for optimality. We
follow consistently this style in the remaining chapters, although having
developed in Chapter 1 most of the convexity theory that we will need,
the discussion in the subsequent chapters is more heavily weighted towards
optimization.

1.1 LINEAR ALGEBRA AND ANALYSIS

Notation

If X is a set and x is an element of X, we write x ∈ X . A set can be


specified in the form X = {x | x satisfies P }, as the set of all elements
satisfying property P . The union of two sets X1 and X2 is denoted by
X1 ∪ X2 and their intersection by X1 ∩ X2 . The symbols ∃ and ∀ have
the meanings “there exists” and “for all,” respectively. The empty set is
denoted by Ø.
The set of real numbers (also referred to as scalars) is denoted by <.
The set < augmented with +∞ and −∞ is called the set of extended real
numbers. We denote by [a, b] the set of (possibly extended) real numbers x
satisfying a ≤ x ≤ b. A rounded, instead of square, bracket denotes strict
inequality in the definition. Thus (a, b], [a, b), and (a, b) denote the set of
all x satisfying a < x ≤ b, a ≤ x < b, and a < x < b, respectively. When
working with extended real numbers, we use the natural extensions of the
rules of arithmetic: x · 0 = 0 for every extended real number x, x · ∞ = ∞
if x > 0, x · ∞ = −∞ if x < 0, and x + ∞ = ∞ and x − ∞ = −∞ for
every scalar x. The expression ∞ − ∞ is meaningless and is never allowed
to occur.
If f is a function, we use the notation f : X 7→ Y to indicate the fact
that f is defined on a set X (its domain ) and takes values in a set Y (its
range). If f : X 7→ Y©is a function,ªand U and V are subsets of X and Y ,
respectively, the set
© f(x) | x ∈ U is ª called the image or forward image
of U, and the set x ∈ <n | f (x) ∈ V is called the inverse image of V .

1.1.1 Vectors and Matrices

We denote by <n the set of n-dimensional real vectors. For any x ∈ <n ,
we use xi to indicate its ith coordinate, also called its ith component.
6 Convex Analysis and Optimization Chap. 1

Vectors in <n will be viewed as column vectors, unless the contrary


is explicitly stated. For any x ∈ <n , x0 denotes the transpose of x, which is
an n-dimensional P row vector. The inner product of two vectors x, y ∈ <n is
n
defined by x y = i=1 xi yi . Any two vectors x, y ∈ <n satisfying x0 y = 0
0

are called orthogonal .


If x is a vector in <n , the notations x > 0 and x ≥ 0 indicate that all
coordinates of x are positive and nonnegative, respectively. For any two
vectors x and y, the notation x > y means that x − y > 0. The notations
x ≥ y, x < y, etc., are to be interpreted accordingly.
If X is a set and λ is a scalar we denote by λX the set {λx | x ∈ X}.
If X1 and X2 are two subsets of <n , we denote by X1 + X2 the vector sum

{x1 + x2 | x1 ∈ X1 , x2 ∈ X2 }.

We use a similar notation for the sum of any finite number of subsets. In
the case where one of the subsets consists of a single vector x, we simplify
this notation as follows:

x + X = {x + x | x ∈ X}.

Given sets Xi ⊂ <ni , i = 1, . . . , m, the Cartesian product of the Xi ,


denoted by X1 × · · · × Xm , is the subset
© ª
(x1 , . . . , xm ) | xi ∈ Xi , i = 1, . . . , m

of <n1 +···+nm .

Subspaces and Linear Independence

A subset S of <n is called a subspace if ax + by ∈ S for every x, y ∈ X


and every a, b ∈ <. An affine set in <n is a translated subspace, i.e., a set
of the form x + S = {x + x | x ∈ S}, where x is a vector in <n and S is
a subspace of <n . The span of a finite collection {x1 , . . . , xm } of elements
of <n (also called the subspace generated byP the collection) is the subspace
m
consisting of all vectors y of the form y = k=1 ak xk , where each ak is a
scalar.
The vectors x1, . . . , xm ∈ <n are calledPlinearly independent if there
m
exists no set of scalars a1 , . . . , am such that k=1 ak xk = 0, unless ak = 0
for each k. An equivalent definition is that x1 6= 0, and for every k > 1,
the vector xk does not belong to the span of x1, . . . , xk−1 .
If S is a subspace of <n containing at least one nonzero vector, a basis
for S is a collection of vectors that are linearly independent and whose
span is equal to S. Every basis of a given subspace has the same number
of vectors. This number is called the dimension of S. By convention, the
subspace {0} is said to have dimension zero. The dimension of an affine set
Sec. 1.1 Linear Algebra and Analysis 7

x + S is the dimension of the corresponding subspace S. Every subspace


of nonzero dimension has an orthogonal basis, i.e., a basis consisting of
mutually orthogonal vectors.
Given any set X, the set of vectors that are orthogonal to all elements
of X is a subspace denoted by X ⊥ :

X ⊥ = {y | y 0 x = 0, ∀ x ∈ X}.

If S is a subspace, S ⊥ is called the orthogonal complement of S. It can


be shown that (S ⊥ )⊥ = S (see the Polar Cone Theorem in Section 1.5).
Furthermore, any vector x can be uniquely decomposed as the sum of a
vector from S and a vector from S ⊥ (see the Projection Theorem in Section
1.3.2).

Matrices

For any matrix A, we use Aij , [A]ij , or aij to denote its ijth element. The
transpose of A, denoted by A0 , is defined by [A0 ]ij = aji . For any two
matrices A and B of compatible dimensions, the transpose of the product
matrix AB satisfies (AB)0 = B 0 A0 .
If X is a subset of <n and A is an m × n matrix, then the image of
X under A is denoted by AX (or A · X if this enhances notational clarity):

AX = {Ax | x ∈ X}.

If X is subspace, then AX is also a subspace.


Let A be a square matrix. We say that A is symmetric if A0 = A. We
say that A is diagonal if [A]ij = 0 whenever i 6= j. We use I to denote the
identity matrix. The determinant of A is denoted by det(A).
Let A be an m × n matrix. The range space of A, denoted by R(A),
is the set of all vectors y ∈ <m such that y = Ax for some x ∈ <n . The
null space of A, denoted by N (A), is the set of all vectors x ∈ <n such
that Ax = 0. It is seen that the range space and the null space of A are
subspaces. The rank of A is the dimension of the range space of A. The
rank of A is equal to the maximal number of linearly independent columns
of A, and is also equal to the maximal number of linearly independent rows
of A. The matrix A and its transpose A0 have the same rank. We say that
A has full rank , if its rank is equal to min{m, n}. This is true if and only
if either all the rows of A are linearly independent, or all the columns of A
are linearly independent.
The range of an m × n matrix A and the orthogonal complement of
the nullspace of its transpose are equal, i.e.,

R(A) = N (A0 )⊥ .

Another way to state this result is that given vectors a1 , . . . , an ∈ <m (the
columns of A) and a vector x ∈ <m , we have x0 y = 0 for all y such that
8 Convex Analysis and Optimization Chap. 1

a0i y = 0 for all i if and only if x = λ1a1 + · · · + λn an for some scalars


λ1 , . . . , λn . This is a special case of Farkas’ lemma, an important result for
constrained optimization, which will be discussed later in Section 1.6. A
useful application of this result is that if S1 and S2 are two subspaces of
<n , then
S1⊥ + S2⊥ = (S1 ∩ S2)⊥ .

This follows by introducing matrices B1 and B2 such that S1 = {x | B1 x =


0} = N (B1 ) and S2 = {x | B2 x = 0} = N (B2), and writing

µ· ¸¶⊥
¡ ¢ B1 ¡ ¢⊥
S1⊥ +S2⊥ = R [ B10 B20 ] =N = N (B1 )∩N (B2) = (S1 ∩S2 )⊥
B2

A function f : <n 7→ < is said to be affine if it has the form f (x) =


a0 x+ b for some a ∈ <n and b ∈ <. Similarly, a function f : <n 7→ <m is
said to be affine if it has the form f (x) = Ax + b for some m × n matrix
A and some b ∈ <m . If b = 0, f is said to be a linear function or linear
transformation.

1.1.2 Topological Properties

Definition 1.1.1: A norm k · k on <n is a function that assigns a


scalar kxk to every x ∈ <n and that has the following properties:
(a) kxk ≥ 0 for all x ∈ <n .
(b) kαxk = |α| · kxk for every scalar α and every x ∈ <n .
(c) kxk = 0 if and only if x = 0.
(d) kx + yk ≤ kxk + kyk for all x, y ∈ <n (this is referred to as the
triangle inequality ).

The Euclidean norm of a vector x = (x1 , . . . , xn ) is defined by

à n
!1/2
X
kxk = (x0 x)1/2 = |xi |2 .
i=1

The space <n , equipped with this norm, is called a Euclidean space. We
will use the Euclidean norm almost exclusively in this book. In particular,
in the absence of a clear indication to the contrary, k · k will denote the
Euclidean norm. Two important results for the Euclidean norm are:
Sec. 1.1 Linear Algebra and Analysis 9

Proposition 1.1.1: (Pythagorean Theorem) For any two vectors


x and y that are orthogonal, we have

kx + yk2 = kxk2 + kyk2 .

Proposition 1.1.2: (Schwartz inequality) For any two vectors x


and y, we have
|x0 y| ≤ kxk · kyk,
with equality holding if and only if x = αy for some scalar α.

Two other important norms are the maximum norm k·k∞ (also called
sup-norm or `∞ -norm), defined by

kxk∞ = max |xi |,


i

and the `1-norm k · k1 , defined by

n
X
kxk1 = |xi |.
i=1

Sequences

We use both subscripts and superscripts in sequence notation. Generally,


we prefer subscripts, but we use superscripts whenever we need to reserve
the subscript notation for indexing coordinates or components of vectors
and functions. The meaning of the subscripts and superscripts should be
clear from the context in which they are used.
A sequence {xk | k = 1, 2, . . .} (or {xk } for short) of scalars is said
to converge if there exists a scalar x such that for every ² > 0 we have
|xk − x| < ² for every k greater than some integer K (depending on ²). We
call the scalar x the limit of {xk }, and we also say that {xk } converges to
x; symbolically, xk → x or limk→∞ xk = x. If for every scalar b there exists
some K (depending on b) such that xk ≥ b for all k ≥ K, we write xk → ∞
and limk→∞ xk = ∞. Similarly, if for every scalar b there exists some K
such that xk ≤ b for all k ≥ K, we write xk → −∞ and limk→∞ xk = −∞.
10 Convex Analysis and Optimization Chap. 1

A sequence {xk } is called a Cauchy sequence if for every ² > 0, there


exists some K (depending on ²) such that |xk − xm | < ² for all k ≥ K and
m ≥ K.
A sequence {xk } is said to be bounded above (respectively, below) if
there exists some scalar b such that xk ≤ b (respectively, xk ≥ b) for all
k. It is said to be bounded if it is bounded above and bounded below.
The sequence {xk } is said to be monotonically nonincreasing (respectively,
nondecreasing) if xk+1 ≤ xk (respectively, xk+1 ≥ xk ) for all k. If {xk } con-
verges to x and is nonincreasing (nondecreasing), we also use the notation
xk ↓ x (xk ↑ x, respectively).

Proposition 1.1.3: Every bounded and monotonically nonincreasing


or nondecreasing scalar sequence converges.

Note that a monotonically nondecreasing sequence {xk } is either


bounded, in which case it converges to some scalar x by the above propo-
sition, or else it is unbounded, in which case xk → ∞. Similarly, a mono-
tonically nonincreasing sequence {xk } is either bounded and converges, or
it is unbounded, in which case xk → −∞.
The supremum of a nonempty set X of scalars, denoted by sup X, is
defined as the smallest scalar x such that x ≥ y for all y ∈ X. If no such
scalar exists, we say that the supremum of X is ∞. Similarly, the infimum
of X, denoted by inf X, is defined as the largest scalar x such that x ≤ y
for all y ∈ X, and is equal to −∞ if no such scalar exists. For the empty
set, we use the convention

sup(Ø) = −∞, inf(Ø) = ∞.

(This is somewhat paradoxical, since we have that the sup of a set is less
than its inf, but works well for our analysis.) If sup X is equal to a scalar
x that belongs to the set X, we say that x is the maximum point of X and
we often write
x = sup X = max X.
Similarly, if inf X is equal to a scalar x that belongs to the set X, we often
write
x = inf X = min X.
Thus, when we write max X (or min X) in place of sup X (or inf X, re-
spectively) we do so just for emphasis: we indicate that it is either evident,
or it is known through earlier analysis, or it is about to be shown that the
maximum (or minimum, respectively) of the set X is attained at one of its
points.
Given a scalar sequence {xk }, the supremum of the sequence, denoted
by supk xk , is defined as sup{xk | k = 1, 2, . . .}. The infimum of a sequence
Sec. 1.1 Linear Algebra and Analysis 11

is similarly defined. Given a sequence {xk }, let ym = sup{xk | k ≥ m},


zm = inf{xk | k ≥ m}. The sequences {ym } and {zm } are nonincreasing
and nondecreasing, respectively, and therefore have a limit whenever {xk }
is bounded above or is bounded below, respectively (Prop. 1.1.3). The
limit of ym is denoted by lim supk→∞ xk , and is referred to as the limit
superior of {xk }. The limit of zm is denoted by lim inf k→∞ xk , and is
referred to as the limit inferior of {xk }. If {xk } is unbounded above,
we write lim supk→∞ xk = ∞, and if it is unbounded below, we write
lim inf k→∞ xk = −∞.

Proposition 1.1.4: Let {xk } and {yk } be scalar sequences.


(a) There holds

inf xk ≤ lim inf xk ≤ lim sup xk ≤ sup xk .


k k→∞ k→∞ k

(b) {xk } converges if and only if lim inf k→∞ xk = lim supk→∞ xk
and, in that case, both of these quantities are equal to the limit
of xk .
(c) If xk ≤ yk for all k, then

lim inf xk ≤ lim inf yk , lim sup xk ≤ lim sup yk .


k→∞ k→∞ k→∞ k→∞

(d) We have

lim inf xk + lim inf yk ≤ lim inf (xk + yk ),


k→∞ k→∞ k→∞

lim sup xk + lim sup yk ≥ lim sup(xk + yk ).


k→∞ k→∞ k→∞

A sequence {xk } of vectors in <n is said to converge to some x ∈ <n if


the ith coordinate of xk converges to the ith coordinate of x for every i. We
use the notations xk → x and limk→∞ xk = x to indicate convergence for
vector sequences as well. The sequence {xk } is called bounded (respectively,
a Cauchy sequence) if each of its corresponding coordinate sequences is
bounded (respectively, a Cauchy sequence). It can be seen that {xk } is
bounded if and only if there exists a scalar c such that kxk k ≤ c for all k.
12 Convex Analysis and Optimization Chap. 1

Definition 1.1.2: We say that a vector x ∈ <n is a limit point of a se-


quence {xk } in <n if there exists a subsequence of {xk } that converges
to x.

Proposition 1.1.5: Let {xk } be a sequence in <n .


(a) If {xk } is bounded, it has at least one limit point.
(b) {xk } converges if and only if it is bounded and it has a unique
limit point.
(c) {xk } converges if and only if it is a Cauchy sequence.

o(·) Notation

If p is a positive integer and h : <n 7→ <m , then we write

¡ ¢
h(x) = o kxkp

if
h(xk )
lim = 0,
k→∞ kxk kp

for all sequences {xk }, with xk 6= 0 for all k, that converge to 0.

Closed and Open Sets

We say that x is a closure point or limit point of a set X ⊂ <n if there


exists a sequence {xk }, consisting of elements of X, that converges to x.
The closure of X, denoted cl(X ), is the set of all limit points of X .

Definition 1.1.3: A set X ⊂ <n is called closed if it is equal to its


closure. It is called open if its complement (the set {x | x ∈
/ X}) is
closed. It is called bounded if there exists a scalar c such that the
magnitude of any coordinate of any element of X is less than c. It is
called compact if it is closed and bounded.
Sec. 1.1 Linear Algebra and Analysis 13

Definition 1.1.4: A neighborhood of a vector x is an open set con-


taining x. We say that x is an interior point of a set X ⊂ <n if there
exists a neighborhood of x that is contained in X. A vector x ∈ cl(X)
which is not an interior point of X is said to be a boundary point of
X.

Let k · k be a given norm in <n . For any ² > 0 and x∗ ∈ <n , consider
the sets
© ª © ª
x | kx − x∗ k < ² , x | kx − x∗ k ≤ ² .

The first set is open and is called an open sphere centered at x∗ , while the
second set is closed and is called a closed sphere centered at x∗ . Sometimes
the terms open ball and closed ball are used, respectively.

Proposition 1.1.6:
(a) The union of finitely many closed sets is closed.
(b) The intersection of closed sets is closed.
(c) The union of open sets is open.
(d) The intersection of finitely many open sets is open.
(e) A set is open if and only if all of its elements are interior points.
(f) Every subspace of <n is closed.
(g) A set X is compact if and only if every sequence of elements of
X has a subsequence that converges to an element of X.
(h) If {Xk } is a sequence of nonempty and compact sets such that
Xk ⊃ Xk+1 for all k, then the intersection ∩∞
k=0 Xk is nonempty
and compact.

The topological properties of subsets of <n , such as being open,


closed, or compact, do not depend on the norm being used. This is a
consequence of the following proposition, referred to as the norm equiva-
lence property in <n , which shows that if a sequence converges with respect
to one norm, it converges with respect to all other norms.

Proposition 1.1.7: For any two norms k · k and k · k0 on <n , there


exists a scalar c such that kxk ≤ ckxk0 for all x ∈ <n .

Using the preceding proposition, we obtain the following.


14 Convex Analysis and Optimization Chap. 1

Proposition 1.1.8: If a subset of <n is open (respectively, closed,


bounded, or compact) with respect to some norm, it is open (respec-
tively, closed, bounded, or compact) with respect to all other norms.

Sequences of Sets

Let {Xk } be a sequence of nonempty subsets of <n . The outer limit of


{Xk }, denoted lim supk→∞ Xk , is the set of all x ∈ <n such that every
neighborhood of x has a nonempty intersection with infinitely many of the
sets Xk , k = 1, 2, . . .. Equivalently, lim supk→∞ Xk is the set of all limits
of subsequences {xk }K such that xk ∈ Xk , for all k ∈ K.
The inner limit of {Xk }, denoted lim inf k→∞ Xk , is the set of all
x ∈ <n such that every neighborhood of x has a nonempty intersection
with all except finitely many of the sets Xk , k = 1, 2, . . .. Equivalently,
lim inf k→∞ Xk is the set of all limits of sequences {xk } such that xk ∈ Xk ,
for all k = 1, 2, . . ..
The sequence {Xk } is said to converge to a set X if

X = lim inf Xk = lim sup Xk ,


k→∞ k→∞

in which case X is said to be the limit of {Xk }. The inner and outer limits
are closed (possibly empty) sets. If each set Xk consists of a single point
xk , lim supk→∞ Xk is the set of limit points of {xk }, while lim inf k→∞ Xk
is just the limit of {xk } if {xk } converges, and otherwise it is empty.

Continuity

Let X be a subset of <n and let f : X 7→ <m be some function. Let x


be a closure m
© pointª of X. If there exists a vector y ∈ < such that the
sequence f (xk ) converges to y for every sequence {xk } ⊂ X such that
limk→∞ xk = x, we write limz→x f (z) = y.
If X is a subset of < and x is a closure point of X, we denote by
limz↑x f(z) [respectively, limz↓x f (z)] the limit of f (xk ), where {xk } is any
sequence of elements of X converging to x and satisfying xk ≤ x (respec-
tively, xk ≥ x), assuming that at least one such sequence {xk } exists, and
the corresponding limit of f (xk ) exists and is independent of the choice of
{xk }.
Sec. 1.1 Linear Algebra and Analysis 15

Definition 1.1.5: Let X be a subset of <n .


(a) A function f : X 7→ <m is called continuous at a point x ∈ X if
limz→x f(z) = f (x).
(b) A function f : X 7→ <m is called right-continuous (respectively,
left-continuous) at a point x ∈ X if limz↓x f (z) = f (x) [respec-
tively, limz↑x f (z) = f(x)].
(c) A real-valued function f : X 7→ < is called upper semicontinuous
(respectively, lower semicontinuous ) at a vector x ∈ X if f(x) ≥
lim supk→∞ f (xk ) [respectively, f (x) ≤ lim inf k→∞ f (xk )] for ev-
ery sequence {xk } of elements of X converging to x.

If f : X 7→ <m is continuous at every point of a subset of its domain


X, we say that f is continuous over that subset . If f : X 7→ <m is con-
tinuous at every point of its domain X , we say that f is continuous. We
use similar terminology for right-continuous, left-continuous, upper semi-
continuous, and lower semicontinuous functions.
If f : X 7→ <m is continuous at every point of a subset of its domain
X, we say that f is continuous over that subset . If f : X 7→ <m is con-
tinuous at every point of its domain X , we say that f is continuous. We
use similar terminology for right-continuous, left-continuous, upper semi-
continuous, and lower semicontinuous functions.

Proposition 1.1.9:
(a) The composition of two continuous functions is continuous.
(b) Any vector norm on <n is a continuous function.
(c) Let f : <n 7→ <m be continuous, and let Y ⊂ <m© be open
(respectively, closed). Then the inverse image of Y , x ∈ <n |
ª
f (x) ∈ Y , is open (respectively, closed).
(d) Let f : <n 7→ <m be continuous, n
© and let X ª⊂ < be compact.
Then the forward image of X, f(x) | x ∈ X , is compact.

Matrix Norms

A norm k · k on the set of n × n matrices is a real-valued function that has


the same properties as vector norms do when the matrix is viewed as an
2
element of <n . The norm of an n × n matrix A is denoted by kAk.
We are mainly interested in induced norms, which are constructed as
follows. Given any vector norm k · k, the corresponding induced matrix
16 Convex Analysis and Optimization Chap. 1

norm, also denoted by k · k, is defined by

kAk = © sup ª kAxk.


x∈<n |kxk=1

It is easily verified that for any vector norm, the above equation defines a
bona fide matrix norm having all the required properties.
Note that by the Schwartz inequality (Prop. 1.1.2), we have

kAk = sup kAxk = sup |y 0 Ax|.


kxk=1 kyk=kxk=1

By reversing the roles of x and y in the above relation and by using the
equality y 0 Ax = x0 A0 y, it follows that kAk = kA0 k.

1.1.3 Square Matrices

Definition 1.1.6: A square matrix A is called singular if its determi-


nant is zero. Otherwise it is called nonsingular or invertible.

Proposition 1.1.10:
(a) Let A be an n × n matrix. The following are equivalent:
(i) The matrix A is nonsingular.
(ii) The matrix A0 is nonsingular.
(iii) For every nonzero x ∈ <n , we have Ax 6= 0.
(iv) For every y ∈ <n , there is a unique x ∈ <n such that
Ax = y.
(v) There is an n × n matrix B such that AB = I = BA.
(vi) The columns of A are linearly independent.
(vii) The rows of A are linearly independent.
(b) Assuming that A is nonsingular, the matrix B of statement (v)
(called the inverse of A and denoted by A−1 ) is unique.
(c) For any two square invertible matrices A and B of the same
dimensions, we have (AB)−1 = B −1 A−1 .
Sec. 1.1 Linear Algebra and Analysis 17

Definition 1.1.7: The characteristic polynomial φ of an n × n matrix


A is defined by φ(λ) = det(λI − A), where I is the identity matrix
of the same size as A. The n (possibly repeated or complex) roots of
φ are called the eigenvalues of A. A nonzero vector x (with possibly
complex coordinates) such that Ax = λx, where λ is an eigenvalue of
A, is called an eigenvector of A associated with λ.

Proposition 1.1.11: Let A be a square matrix.


(a) A complex number λ is an eigenvalue of A if and only if there
exists a nonzero eigenvector associated with λ.

(b) A is singular if and only if it has an eigenvalue that is equal to


zero.

Note that the only use of complex numbers in this book is in relation
to eigenvalues and eigenvectors. All other matrices or vectors are implicitly
assumed to have real components.

Proposition 1.1.12: Let A be an n × n matrix.


(a) If T is a nonsingular matrix and B = T AT −1 , then the eigenval-
ues of A and B coincide.
(b) For any scalar c, the eigenvalues of cI + A are equal to c +
λ1 , . . . , c + λn , where λ1 , . . . , λn are the eigenvalues of A.
(c) The eigenvalues of Ak are equal to λk1 , . . . , λkn , where λ1 , . . . , λn
are the eigenvalues of A.
(d) If A is nonsingular, then the eigenvalues of A−1 are the recipro-
cals of the eigenvalues of A.
(e) The eigenvalues of A and A0 coincide.

Symmetric and Positive Definite Matrices

Symmetric matrices have several special properties, particularly regarding


their eigenvalues and eigenvectors. In what follows in this section, k · k
18 Convex Analysis and Optimization Chap. 1

denotes the Euclidean norm.

Proposition 1.1.13: Let A be a symmetric n × n matrix. Then:


(a) The eigenvalues of A are real.
(b) The matrix A has a set of n mutually orthogonal, real, and
nonzero eigenvectors x1 , . . . , xn .
(c) Suppose that the eigenvectors in part (b) have been normalized
so that kxi k = 1 for each i. Then
n
X
A= λi xi x0i ,
i=1

where λi is the eigenvalue corresponding to xi .

Proposition 1.1.14: Let A be a symmetric n × n matrix, and let


λ1 ≤ · · · ≤ λn be its (real) eigenvalues. Then:
© ª
(a) kAk = max |λ1 |, |λn | , where k · k is the matrix norm induced
by the Euclidean norm.
(b) λ1 kyk2 ≤ y 0 Ay ≤ λn kyk2 for all y ∈ <n .
(c) If A is nonsingular, then

1
kA−1 k = © ª.
min |λ1 |, |λn |

Proposition 1.1.15: Let A be a square matrix, and let k · k be the


matrix norm induced by the Euclidean norm. Then:
(a) If A is symmetric, then kAk k = kAkk for any positive integer k.
(b) kAk2 = kA0 Ak = kAA0 k.
Sec. 1.1 Linear Algebra and Analysis 19

Definition 1.1.8: A symmetric n × n matrix A is called positive defi-


nite if x0 Ax > 0 for all x ∈ <n , x 6= 0. It is called positive semidefinite
if x0 Ax ≥ 0 for all x ∈ <n .

Throughout this book, the notion of positive definiteness applies ex-


clusively to symmetric matrices. Thus whenever we say that a matrix is
positive (semi)definite, we implicitly assume that the matrix is symmetric.

Proposition 1.1.16:
(a) The sum of two positive semidefinite matrices is positive semidef-
inite. If one of the two matrices is positive definite, the sum is
positive definite.
(b) If A is a positive semidefinite n × n matrix and T is an m ×
n matrix, then the matrix T AT 0 is positive semidefinite. If A
is positive definite and T is invertible, then T AT 0 is positive
definite.

Proposition 1.1.17:
(a) For any m×n matrix A, the matrix A0 A is symmetric and positive
semidefinite. A0 A is positive definite if and only if A has rank n.
In particular, if m = n, A0 A is positive definite if and only if A
is nonsingular.
(b) A square symmetric matrix is positive semidefinite (respectively,
positive definite) if and only if all of its eigenvalues are nonneg-
ative (respectively, positive).
(c) The inverse of a symmetric positive definite matrix is symmetric
and positive definite.
20 Convex Analysis and Optimization Chap. 1

Proposition 1.1.18: Let A be a symmetric positive semidefinite n×n


matrix of rank m. There exists an n × m matrix C of rank m such
that
A = CC 0 .
Furthermore, for any such matrix C:
(a) A and C 0 have the same null space: N (A) = N (C 0 ).
(b) A and C have the same range space: R(A) = R(C).

Proposition 1.1.19: Let A be a square symmetric positive semidef-


inite matrix.
(a) There exists a symmetric matrix Q with the property Q2 = A.
Such a matrix is called a symmetric square root of A and is de-
noted by A1/2.
(b) There is a unique symmetric square root if and only if A is pos-
itive definite.
(c) A symmetric square root A1/2 is invertible if and only if A is
invertible. Its inverse is denoted by A−1/2 .
(d) There holds A−1/2 A−1/2 = A−1.
(e) There holds AA1/2 = A1/2 A.

1.1.4 Derivatives

Let f : <n 7→ < be some function, fix some x ∈ <n , and consider the
expression
f (x + αei ) − f (x)
lim ,
α→0 α
where ei is the ith unit vector (all components are 0 except for the ith
component which is 1). If the above limit exists, it is called the ith partial
derivative of f at the point x and it is denoted by (∂f /∂xi )(x) or ∂f (x)/∂xi
(xi in this section will denote the ith coordinate of the vector x). Assuming
all of these partial derivatives exist, the gradient of f at x is defined as the
column vector  
∂f (x)
∂x1
 .. 
∇f(x) =  .
.
∂f (x)
∂xn
Sec. 1.1 Linear Algebra and Analysis 21

For any y ∈ <n , we define the one-sided directional derivative of f in


the direction y, to be

f(x + αy) − f (x)


f 0 (x; y) = lim ,
α↓0 α

provided that the limit exists. We note from the definitions that

f 0 (x; ei ) = −f 0 (x; −ei ) ⇒ f 0 (x; ei ) = (∂f /∂xi )(x).

If the directional derivative of f at a vector x exists in all directions


y and f 0 (x; y) is a linear function of y, we say that f is differentiable at
x. This type of differentiability is also called Gateaux differentiability . It
is seen that f is differentiable at x if and only if the gradient ∇f (x) exists
and satisfies ∇f (x)0 y = f 0 (x; y) for every y ∈ <n . The function f is called
differentiable over a given subset U of <n if it is differentiable at every
x ∈ U. The function f is called differentiable (without qualification) if it
is differentiable at all x ∈ <n .
If f is differentiable over an open set U and the gradient ∇f (x) is
continuous at all x ∈ U, f is said to be continuously differentiable over U .
Such a function has the property

f (x + y) − f (x) − ∇f (x)0 y
lim = 0, ∀ x ∈ U, (1.1)
y→0 kyk

where k · k is an arbitrary vector norm. If f is continuously differentiable


over <n , then f is also called a smooth function.
The preceding equation can also be used as an alternative definition
of differentiability. In particular, f is called Frechet differentiable at x
if there exists a vector g satisfying Eq. (1.1) with ∇f (x) replaced by g.
If such a vector g exists, it can be seen that all the partial derivatives
(∂f /∂xi )(x) exist and that g = ∇f (x). Frechet differentiability implies
(Gateaux) differentiability but not conversely (see for example [OrR70] for
a detailed discussion). In this book, when dealing with a differentiable
function f , we will always assume that f is continuously differentiable
(smooth) over a given open set [∇f(x) is a continuous function of x over
that set], in which case f is both Gateaux and Frechet differentiable, and
the distinctions made above are of no consequence.
The definitions of differentiability of f at a point x only involve the
values of f in a neighborhood of x. Thus, these definitions can be used
for functions f that are not defined on all of <n , but are defined instead
in a neighborhood of the point at which the derivative is computed. In
particular, for functions f : X 7→ < that are defined over a strict subset
X of <n , we use the above definition of differentiability of f at a vector
x provided x is an interior point of the domain X. Similarly, we use the
above definition of differentiability or continuous differentiability of f over
22 Convex Analysis and Optimization Chap. 1

a subset U, provided U is an open subset of the domain X. Thus any


mention of differentiability of a function f over a subset implicitly assumes
that this subset is open.
If f : <n 7→ <m is a vector-valued function, it is called differentiable
(or smooth) if each component fi of f is differentiable (or smooth, respec-
tively). The gradient matrix of f , denoted ∇f(x), is the n × m matrix
whose ith column is the gradient ∇fi (x) of fi . Thus,
h i
∇f (x) = ∇f1 (x) · · · ∇fm (x) .

The transpose of ∇f is called the Jacobian of f and is a matrix whose ijth


entry is equal to the partial derivative ∂fi /∂xj .
Now suppose that each one of the partial derivatives of a function
f : <n 7→ < is a smooth function of x. We use the notation (∂ 2 f/∂xi ∂xj )(x)
to indicate the ith partial derivative of ∂f /∂xj at a point x ∈ <n . The
Hessian of f is the matrix whose ijth entry is equal to (∂ 2 f/∂xi ∂xj )(x),
and is denoted by ∇2f (x). We have (∂ 2 f /∂xi ∂xj )(x) = (∂ 2 f/∂xj ∂xi )(x)
for every x, which implies that ∇2 f (x) is symmetric.
If f : <m+n 7→ < is a function of (x, y), where x = (x1 , . . . , xm ) ∈ <m
and y = (y1, . . . , yn ) ∈ <n , we write
 ∂f(x,y)   ∂f(x,y) 
∂x1 ∂y1
 ..   .. 
∇x f(x, y) =  , ∇y f (x, y) =  . .
.
∂f(x,y) ∂f(x,y)
∂xm ∂yn

We denote by ∇2xx f (x, y), ∇2xy f(x, y), and ∇2yy f (x, y) the matrices with
components

£ 2 ¤ ∂ 2 f(x, y) £ ¤ ∂ 2 f (x, y)
∇xx f(x, y) ij = , ∇2xy f (x, y) ij = ,
∂xi ∂xj ∂xi ∂yj

£ ¤ ∂ 2 f(x, y)
∇2yy f (x, y) ij
= .
∂yi ∂yj
If f : <m+n 7→ <r , f = (f1 , f2 , . . . , fr ), we write
£ ¤
∇x f (x, y) = ∇x f1 (x, y) · · · ∇x fr (x, y) ,
£ ¤
∇y f (x, y) = ∇y f1 (x, y) · · · ∇y fr (x, y) .
Let f : <k 7→ <m and g : <m 7→ <n be smooth functions, and let h
be their composition, i.e.,
¡ ¢
h(x) = g f (x) .
Sec. 1.1 Linear Algebra and Analysis 23

Then, the chain rule for differentiation states that


¡ ¢
∇h(x) = ∇f (x)∇g f(x) , ∀ x ∈ <k .

Some examples of useful relations that follow from the chain rule are:
¡ ¢ ¡ ¢
∇ f (Ax) = A0 ∇f (Ax), ∇2 f (Ax) = A0 ∇2 f (Ax)A,

where A is a matrix,
³ ¡ ¢´ ¡ ¢
∇x f h(x), y = ∇h(x)∇h f h(x), y ,
³ ¡ ¢´ ¡ ¢ ¡ ¢
∇x f h(x), g(x) = ∇h(x)∇h f h(x), g(x) + ∇g(x)∇g f h(x), g(x) .

We now state some theorems relating to differentiable functions that


will be useful for our purposes.

Proposition 1.1.20: (Mean Value Theorem) Let f : <n 7→ < be


continuously differentiable over an open sphere S, and let x be a vector
in S. Then for all y such that x + y ∈ S, there exists an α ∈ [0, 1] such
that
f (x + y) = f (x) + ∇f (x + αy)0 y.

Proposition 1.1.21: (Second Order Expansions) Let f : <n 7→


< be twice continuously differentiable over an open sphere S, and let
x be a vector in S. Then for all y such that x + y ∈ S:
(a) There holds
³R ³R ´ ´
1 t
f (x + y) = f(x) + y 0 ∇f (x) + 12 y 0 0 0 ∇2 f (x + τ y)dτ dt y.

(b) There exists an α ∈ [0, 1] such that

f (x + y) = f (x) + y 0 ∇f (x) + 12 y 0 ∇2 f (x + αy)y.

(c) There holds


¡ ¢
f (x + y) = f (x) + y 0 ∇f (x) + 12 y 0 ∇2 f (x)y + o kyk2 .
24 Convex Analysis and Optimization Chap. 1

Proposition 1.1.22: (Implicit Function Theorem) Consider a


function f : <n+m 7→ <m of x ∈ <n and y ∈ <m such that:
(1) f (x, y) = 0.
(2) f is continuous, and has a continuous and nonsingular gradient
matrix ∇y f (x, y) in an open set containing (x, y).
Then there exist open sets Sx ⊂ <n and Sy ⊂ <m containing x and y,
respectively,
¡ and
¢ a continuous function φ : Sx 7→ Sy such that y = φ(x)
and f x, φ(x) = 0 for all x ∈ Sx . The function φ is unique in the sense
that if x ∈ Sx , y ∈ Sy , and f (x, y) = 0, then y = φ(x). Furthermore,
if for some integer p > 0, f is p times continuously differentiable the
same is true for φ, and we have

¡ ¢³ ¡ ¢´−1
∇φ(x) = −∇x f x, φ(x) ∇y f x, φ(x) , ∀ x ∈ Sx .

As a final word of caution to the reader, let us mention that one can
easily get confused with gradient notation and its use in various formulas,
such as for example the order of multiplication of various gradients in the
chain rule and the Implicit Function Theorem. Perhaps the safest guideline
to minimize errors is to remember our conventions:
(a) A vector is viewed as a column vector (an n × 1 matrix).
(b) The gradient ∇f of a scalar function f : <n 7→ < is also viewed as a
column vector.
(c) The gradient matrix ∇f of a vector function f : <n 7→ <m with
components f1 , . . . , fm is the n × m matrix whose columns are the
(column) vectors ∇f1 , . . . , ∇fm .
With these rules in mind one can use “dimension matching” as an effective
guide to writing correct formulas quickly.

1.2 CONVEX SETS AND FUNCTIONS

1.2.1 Basic Properties

The notion of a convex set is defined below and is illustrated in Fig. 1.2.1.
Sec. 1.2 Convex Sets and Functions 25

αx + (1 -α)y, 0 <α < 1

y
y
x

x
y
x

Convex Sets Nonconvex Sets

Figure 1.2.1. Illustration of the definition of a convex set. For convexity, linear
interpolation between any two points of the set must yield points that lie within
the set.

Definition 1.2.1: Let C be a subset of <n . We say that C is convex


if
αx + (1 − α)y ∈ C, ∀ x, y ∈ C, ∀ α ∈ [0, 1]. (1.2)

Note that the empty set is by convention considered to be convex.


Generally, when referring to a convex set, it will usually be apparent from
the context whether this set can be empty, but we will often be specific in
order to minimize ambiguities.
The following proposition lists some operations that preserve convex-
ity of a set.
26 Convex Analysis and Optimization Chap. 1

Proposition 1.2.1:
(a) The intersection ∩i∈I Ci of any collection {Ci | i ∈ I} of convex
sets is convex.
(b) The vector sum C1 + C2 of two convex sets C1 and C2 is convex.
(c) The set x + λC is convex for any convex set C, vector x, and
scalar λ. Furthermore, if C is a convex set and λ1, λ2 are positive
scalars, we have

(λ1 + λ2 )C = λ1 C + λ2 C.

(d) The closure and the interior of a convex set are convex.

(e) The image and the inverse image of a convex set under an affine
function are convex.

Proof: The proof is straightforward using the definition of convexity, cf.


Eq. (1.2). For example, to prove part (a), we take two points x and y
from ∩i∈I Ci , and we use the convexity of Ci to argue that the line segment
connecting x and y belongs to all the sets Ci , and hence, to their intersec-
tion. The proofs of parts (b)-(e) are similar and are left as exercises for the
reader. Q.E.D.

A set C is said to be a cone if for all x ∈ C and λ > 0, we have


λx ∈ C. A cone need not be convex and need not contain the origin
(although the origin always lies in its closure). Several of the results of the
above proposition have analogs for cones (see the exercises).

Convex Functions

The notion of a convex function is defined below and is illustrated in Fig.


1.2.2.
Sec. 1.2 Convex Sets and Functions 27

Definition 1.2.2: Let C be a convex subset of <n . A function f :


C 7→ < is called convex if
¡ ¢
f αx + (1 − α)y ≤ αf (x) + (1 − α)f (y), ∀ x, y ∈ C, ∀ α ∈ [0, 1].
(1.3)
The function f is called concave if −f is convex. The function f is
called strictly convex if the above inequality is strict for all x, y ∈ C
with x 6= y, and all α ∈ (0, 1). For a function f : X 7→ <, we also say
that f is convex over the convex set C if the domain X of f contains
C and Eq. (1.3) holds, i.e., when the domain of f is restricted to C, f
becomes convex.

αf(x) + (1 -α)f(y)

f(z)

x y z
C

Figure 1.2.2. Illustration of the definition of a function that is convex over a


convex set C. The linear interpolation αf (x) + (1 − α)f (y) overestimates the
function value f (αx + (1 − α)y) for all α ∈ [0, 1].

If C is a convex set and f : C 7→ < is a convex function, the level


sets {x ∈ C | f(x) ≤ γ} and {x ∈ C | f (x) < γ} are convex for all scalars
γ. To see this, note that if x, y ∈ C are such that f (x) ≤ γ and f (y) ≤ γ,
then for
¡ any α ∈ [0, 1],¢ we have αx + (1 − α)y ∈ C, by the convexity of C,
and f (αx + (1 − α)y ≤ αf (x) + (1 − α)f (y) ≤ γ, by the convexity of f .
p
However, the converse is not true; for example, the function f (x) = |x|
has convex level sets but is not convex.
Unless otherwise indicated, we implicitly assume that a convex func-
tion is real-valued and is defined over the entire Euclidean space (rather
than over just a convex subset). We occasionally deal with extended real-
valued convex functions that can take the value of ∞ or can take the value
28 Convex Analysis and Optimization Chap. 1

−∞ (but never with functions that can take both values −∞ and ∞).
A function f mapping a convex set C ⊂ <n into (−∞, ∞], is also called
convex if the condition

¡ ¢
f αx + (1 − α)y ≤ αf (x) + (1 − α)f (y), ∀ x, y ∈ C, ∀ α ∈ [0, 1]

holds. It can again be seen that if f is convex, the level sets {x ∈ C |


f (x) ≤ γ} and {x ∈ C | f (x) < γ} are convex for all scalars γ. The
effective domain of such a convex function f is the convex set

© ª
dom(f) = x ∈ C | f (x) < ∞ .

By replacing the domain of an extended real-valued convex function with


its effective domain, we can convert it to a real-valued function. In this way,
we can use results stated in terms of real-valued functions, and we can also
avoid calculations with ∞. Thus, the entire subject of convex functions
can be developed without resorting to extended real-valued functions. The
reverse is also true, namely that extended real-valued functions can be
adopted as the norm; for example, the classical treatment of Rockafellar
[Roc70] uses this approach. We will adopt a flexible approach, generally
preferring to avoid extended real-valued functions, unless there are strong
notational or other advantages for doing so.
An extended real-valued function f : X 7→ (−∞, ∞] is called lower
semicontinuous at a vector x ∈ X if f (x) ≤ lim inf k→∞ f(xk ) for every
sequence {xk } converging to x. This definition is consistent with the cor-
responding definition for real-valued functions [cf. Def. 1.1.5(c)]. If f is
lower semicontinuous at every x in a subset U ⊂ X, we say that f is lower
semicontinuous over U .
The epigraph of a function f : X 7→ (−∞, ∞], where X ⊂ <n , is the
subset of <n+1 given by

© ª
epi(f ) = (x, w) | x ∈ X, w ∈ <, f (x) ≤ w ;

©
(see Fig. 1.2.3). ª Note that if we restrict f to its effective domain x ∈
X | f(x) < ∞ , so that it becomes real-valued, the epigraph remains
unaffected. Epigraphs are useful for our purposes because of the follow-
ing proposition, which shows that questions about convexity and lower
semicontinuity of functions can be reduced to corresponding questions of
convexity and closure of their epigraphs.
Sec. 1.2 Convex Sets and Functions 29

f(x) Epigraph f(x) Epigraph

x x
Convex function Nonconvex function

Figure 1.2.3. Illustration of the epigraph of a convex function and a nonconvex


function f : X 7→ (−∞, ∞].

Proposition 1.2.2: Let f : X 7→ (−∞, ∞] be a function. Then:


(a) epi(f ) is convex if and only if the set X is convex and f is convex
over X.
(b) Assuming X = <n , the following are equivalent:
(i) epi(f ) is closed.
(ii) f is lower semicontinuous over <n .
(iii) The level sets {x | f(x) ≤ γ} are closed for all scalars γ.

Proof: (a) Assume that X is convex and f is convex over X. If (x1, w1 )


and (x2 , w2 ) belong to epi(f ) and α ∈ [0, 1], we have

f (x1 ) ≤ w1 , f(x2 ) ≤ w2 ,

and by multiplying these inequalities with α and (1 − α), respectively, by


adding, and by using the convexity of f , we obtain
¡ ¢
f αx1 + (1 − α)x2 ≤ αf (x1 ) + (1 − α)f (x2 ) ≤ αw1 + (1 − α)w2 .
¡ ¢
Hence the vector αx1 + (1 − α)x2 , αw1 + (1 − α)w2 , which is equal to
α(x1 , w1 ) + (1 − α)(x2 , w2 ), belongs to epi(f ), showing the convexity of
epi(f ).
Conversely, assume
¡ that epi(f ¢ ) is¡ convex, ¢and let x1, x2 ∈ X and
α ∈ [0, 1]. The pairs x1 , f (x1 ) and x2, f (x2 ) belong to epi(f), so by
convexity, we have
¡ ¢
αx1 + (1 − α)x2 , αf (x1 ) + (1 − α)f (x2 ) ∈ epi(f ).
30 Convex Analysis and Optimization Chap. 1

Therefore, by the definition of epi(f ), it follows that αx1 + (1 − α)x2 ∈ X,


so X is convex, while
¡ ¢
f αx1 + (1 − α)x2 ≤ αf(x1) + (1 − α)f (x2 ),
so f is convex over X.
(b) We first show that (i) and (ii) are equivalent. Assume that f is
lower semicontinuous over <n , and let (x, w) be the limit of a sequence
{(xk , wk )} ⊂ epi(f ). We have f (xk ) ≤ wk , and by taking limit as k →
∞ and by using the lower semicontinuity of f at x, we obtain f (x) ≤
lim inf k→∞ f (xk ) ≤ w. Hence (x, w) ∈ epi(f ) and epi(f ) is closed.
Conversely, assume that epi(f ) is closed, choose any x ∈ <n , let {xk }
be a sequence converging to x, and let w = lim inf k→∞ f(xk ). We will
show that f (x) ≤ w. Indeed, if w = ∞, we have f (x) ≤ w. If w < ∞, for
each positive integer¡n, let¢wn = w + 1/n, and let©¡ k(n) be an¢ªinteger such
that k(n) ≥ n and f xk(n) ≤ wn . The sequence xk(n) , wn belongs to
epi(f ) and converges to (x, w), so by the closure of epi(f), we must have
f (x) ≤ w. Thus, f is lower semicontinuous at x.
We next show that (i) implies (iii), and that (iii) implies (ii). Assume
that epi(f ) is closed and let {x©k } be a sequence
ª that converges to some
x and belongs to the level set z | f(z) ≤ γ , where γ is a scalar. ¡ Then
¢
(xk , γ) ∈ epi(f) for all k, and by closure © of epi(f ), ªwe have f(x), γ ∈
epi(f ). Hence x belongs to the level set x | f (x) ≤ γ , implying that this
set is closed. Therefore (i) implies (iii). © ª
Finally, assume that the level sets x | f (x) ≤ γ are closed, fix an
x, and let {xk } be a sequence converging to x. If lim inf k→∞ f(xk ) < ∞,
then for each γ with lim inf k→∞ f (xk ) < γ and all sufficiently
© large
ª k,
we have f (xk ) < γ. From the closure of the level sets x | f (x) ≤ γ , it
follows that x belongs to all the levels with lim inf k→∞ f (xk ) < γ, implying
that f (x) ≤ lim inf k→∞ f(xk ), and that f is lower semicontinuous at x.
Therefore, (iii) implies (ii). Q.E.D.

If the epigraph of a function f : X 7→ (−∞, ∞] is a closed set, we say


that f is a closed function. Thus, if we extend the domain of f to rn and
consider the function f˜ given by
½
f (x) if x ∈ X,
f˜(x) =
∞ if x ∈
/ X,
we see that according to the preceding proposition, f is closed if and only
f˜ is lower semicontinuous over <n .
Common examples of convex functions are affine functions and norms;
this is straightforward to verify, using the definition of convexity. For ex-
ample, for any x, y ∈ <n and any α ∈ [0, 1], we have by using the triangle
inequality,
kαx + (1 − α)yk ≤ kαxk + k(1 − α)yk = αkxk + (1 − α)kyk,
Sec. 1.2 Convex Sets and Functions 31

so the norm function k · k is convex. The following proposition provides


some means for recognizing convex functions, and gives some algebraic
operations that preserve convexity of a function.

Proposition 1.2.3:
(a) Let f1 , . . . , fm : <n 7→ (−∞, ∞] be given functions, let λ1 , . . . , λm
be positive scalars, and consider the function g : <n 7→ (−∞, ∞]
given by
g(x) = λ1 f1 (x) + · · · + λm fm (x).
If f1 , . . . , fm are convex, then g is also convex, while if f1 , . . . , fm
are closed, then g is also closed.
(b) Let f : <n 7→ (−∞, ∞] be a given function, let A be an m × n
matrix, and consider the function g : <n 7→ (−∞, ∞] given by

g(x) = f(Ax).

If f is convex, then g is also convex, while if f is closed, then g


is also closed.
(c) Let fi : <n 7→ (−∞, ∞] be given functions for i ∈ I, where I is
an index set, and consider the function g : <n 7→ (−∞, ∞] given
by
g(x) = sup fi (x).
i∈I

If fi , i ∈ I, are convex, then g is also convex, while if fi , i ∈ I,


are closed, then g is also closed.

Proof: (a) Let f1 , . . . , fm be convex. We use the definition of convexity


to write for any x, y ∈ <n and α ∈ [0, 1],
m
¡ ¢ X ¡ ¢
f αx + (1 − α)y = λi fi αx + (1 − α)y
i=1
m
X ¡ ¢
≤ λi αfi (x) + (1 − α)fi (y)
i=1
Xm m
X
=α λi fi (x) + (1 − α) λi fi (y)
i=1 i=1
= αf (x) + (1 − α)f (y).
Hence f is convex.
Let f1 , . . . , fm be closed. Then the fi are lower semicontinuous at
every x ∈ <n [cf. Prop. 1.2.2(b)], so for every sequence {xk } converging to
32 Convex Analysis and Optimization Chap. 1

x, we have fi (x) ≤ lim inf k→∞ fi (xk ) for all i. Hence


m
X m
X
g(x) ≤ λi lim inf fi (xk ) ≤ lim inf λi fi (xk ) = lim inf g(xk ).
k→∞ k→∞ k→∞
i=1 i=1

where we have used Prop. 1.1.4(d) (the sum of the limit inferiors of se-
quences is less or equal to the limit inferior of the sum sequence). There-
fore, g is lower semicontinuous at all x ∈ <n , so by Prop. 1.2.2(b), it is
closed.
(b) This is straightforward, along the lines of the proof of part (a).
(c) A pair (x, w) belongs to the epigraph
© ª
epi(g) = (x, w) | g(x) ≤ w

if and only if fi (x) ≤ w for all i ∈ I, or (x, w) ∈ ∩i∈I epi(fi ). Therefore,

epi(g) = ∩i∈I epi(fi ).

If the fi are convex, the epigraphs epi(fi ) are convex, so epi(g) is convex,
and g is convex. If the fi are closed, then the epigraphs epi(fi ) are closed,
so epi(g) is closed, and g is closed. Q.E.D.

Characterizations of Differentiable Convex Functions

For differentiable functions, there is an alternative characterization of con-


vexity, given in the following proposition and illustrated in Fig. 1.2.4.

Proposition 1.2.4: Let C ⊂ <n be a convex set and let f : <n 7→ <
be differentiable over <n .
(a) f is convex over C if and only if

f (z) ≥ f(x) + (z − x)0 ∇f (x), ∀ x, z ∈ C. (1.4)

(b) f is strictly convex over C if and only if the above inequality is


strict whenever x 6= z.

Proof: We prove (a) and (b) simultaneously. Assume that the inequality
(1.4) holds. Choose any x, y ∈ C and α ∈ [0, 1], and let z = αx + (1 − α)y.
Using the inequality (1.4) twice, we obtain

f(x) ≥ f (z) + (x − z)0 ∇f (z),


Sec. 1.2 Convex Sets and Functions 33

f(z)
f(x) + (z - x)'∇f(x)

x z

Figure 1.2.4. Characterization of convexity in terms of first derivatives. The


condition f(z) ≥ f (x) + (z − x)0 ∇f (x) states that a linear approximation, based
on the first order Taylor series expansion, underestimates a convex function.

f (y) ≥ f (z) + (y − z)0 ∇f (z).


We multiply the first inequality by α, the second by (1 − α), and add them
to obtain
¡ ¢0
αf (x) + (1 − α)f (y) ≥ f (z) + αx + (1 − α)y − z ∇f (z) = f (z),

which proves that f is convex. If the inequality (1.4) is strict as stated in


part (b), then if we take x 6= y and α ∈ (0, 1) above, the three preceding
inequalities become strict, thus showing the strict convexity of f.
Conversely, assume that f is convex, let x and z be any vectors in C
with x 6= z, and for α ∈ (0, 1), consider the function
¡ ¢
f x + α(z − x) − f (x)
g(α) = , α ∈ (0, 1].
α

We will show that g(α) is monotonically decreasing with α, and is strictly


monotonically decreasing if f is strictly convex. This will imply that

(z − x)0 ∇f (x) = lim g(a) ≤ g(1) = f (z) − f (x),


α↓0

with strict inequality if g is strictly monotonically decreasing, thereby show-


ing that the desired inequality (1.4) holds (and holds strictly if f is strictly
convex). Indeed, consider any α1 , α2 , with 0 < α1 < α2 < 1, and let
α1
α= , z = x + α2 (z − x). (1.5)
α2

We have ¡ ¢
f x + α(z − x) ≤ αf (z) + (1 − α)f (x),
34 Convex Analysis and Optimization Chap. 1

or ¡ ¢
f x + α(z − x) − f (x)
≤ f (z) − f(x), (1.6)
α
and the above inequalities are strict if f is strictly convex. Substituting the
definitions (1.5) in Eq. (1.6), we obtain after a straightforward calculation
¡ ¢ ¡ ¢
f x + α1 (z − x) − f (x) f x + α2 (z − x) − f (x)
≤ ,
α1 α2
or
g(α1 ) ≤ g(α2 ),

with strict inequality if f is strictly convex. Hence g is monotonically


decreasing with α, and strictly so if f is strictly convex. Q.E.D.

Note a simple consequence of Prop. 1.2.4(a): if f : <n 7→ < is a convex


function and ∇f(x∗ ) = 0, then x∗ minimizes f over <n . This is a classical
sufficient condition for unconstrained optimality, originally formulated (in
one dimension) by Fermat in 1637.
For twice differentiable convex functions, there is another characteri-
zation of convexity as shown by the following proposition.

Proposition 1.2.5: Let C ⊂ <n be a convex set and let f : <n 7→ <
be twice continuously differentiable over <n .
(a) If ∇2 f (x) is positive semidefinite for all x ∈ C, then f is convex
over C.
(b) If ∇2 f (x) is positive definite for all x ∈ C, then f is strictly
convex over C.
(c) If C = <n and f is convex, then ∇2 f(x) is positive semidefinite
for all x.

Proof: (a) By Prop. 1.1.21(b), for all x, y ∈ C we have


¡ ¢
f (y) = f (x) + (y − x)0 ∇f (x) + 12 (y − x)0 ∇2 f x + α(y − x) (y − x)

for some α ∈ [0, 1]. Therefore, using the positive semidefiniteness of ∇2 f ,


we obtain
f(y) ≥ f (x) + (y − x)0 ∇f(x), ∀ x, y ∈ C.
From Prop. 1.2.4(a), we conclude that f is convex.
(b) Similar to the proof of part (a), we have f(y) > f (x) + (y − x)0 ∇f (x)
for all x, y ∈ C with x 6= y, and the result follows from Prop. 1.2.4(b).
Sec. 1.2 Convex Sets and Functions 35

(c) Suppose that f : <n 7→ < is convex and suppose, to obtain a con-
tradiction, that there exist some x ∈ <n and some z ∈ <n such that
z 0 ∇2f (x)z < 0. Using the continuity of ∇2 f , we see that we can choose the
norm of z to be small enough so that z 0 ∇2 f (x+αz)z < 0 for every α ∈ [0, 1].
Then, using again Prop. 1.1.21(b), we obtain f (x + z) < f (x) + z 0 ∇f (x),
which, in view of Prop. 1.2.4(a), contradicts the convexity of f . Q.E.D.

If f is convex over a strict subset C ⊂ <n , it is not necessarily true


that ∇2 f (x) is positive semidefinite at any point of C [take for example
n = 2, C = {(x1 , 0) | x1 ∈ <}, and f (x) = x21 − x22 ]. A strengthened
version of Prop. 1.2.5 is given in the exercises. It can be shown that the
conclusion of Prop. 1.2.5(c) also holds if C is assumed to have nonempty
interior instead of being equal to <n .
The following proposition considers a strengthened form of strict con-
vexity characterized by the following equation:
¡ ¢0
∇f (x) − ∇f (y) (x − y) ≥ αkx − yk2 , ∀ x, y ∈ <n , (1.7)
where α is some positive number. Convex functions with this property are
called strongly convex with coefficient α.

Proposition 1.2.6: (Strong Convexity) Let f : <n 7→ < be


smooth. If f is strongly convex with coefficient α, then f is strictly
convex. Furthermore, if f is twice continuously differentiable, then
strong convexity of f with coefficient α is equivalent to the positive
semidefiniteness of ∇2 f(x) − αI for every x ∈ <n , where I is the
identity matrix.

Proof: Fix some x, y ¡∈ <n such that ¢ x 6= y, and define the function
h : < 7→ < by h(t) = f x + t(y − x) . Consider some t, t0 ∈ < such that
t < t0 . Using the chain rule and Eq. (1.7), we have
³ dh dh ´ 0
(t0 ) − (t) (t − t)
dt dt
³ ¡ ¢ ¡ ¢´0
= ∇f x + t0 (y − x) − ∇f x + t(y − x) (y − x)(t0 − t)
≥ α(t0 − t)2 kx − yk2 > 0.
Thus, dh/dt is strictly increasing and for any t ∈ (0, 1), we have
Z Z 1
h(t) − h(0) 1 t dh 1 dh h(1) − h(t)
= (τ ) dτ < (τ ) dτ = .
t t 0 dτ 1 − t t dτ 1−t
Equivalently, th(1)
¡ + (1 − t)h(0)¢ > h(t). The definition of h yields tf (y) +
(1 − t)f (x) > f ty + (1 − t)x . Since this inequality has been proved for
arbitrary t ∈ (0, 1) and x 6= y, we conclude that f is strictly convex.
36 Convex Analysis and Optimization Chap. 1

Suppose now that f is twice continuously differentiable and Eq. (1.7)


holds. Let c be a scalar. We use Prop. 1.1.21(b) twice to obtain

c2 0 2
f(x + cy) = f (x) + cy 0 ∇f (x) + y ∇ f (x + tcy)y,
2

and
c2 0 2
f (x) = f (x + cy) − cy 0 ∇f(x + cy) + y ∇ f(x + scy)y,
2
for some t and s belonging to [0, 1]. Adding these two equations and using
Eq. (1.7), we obtain

c2 0 ¡ 2 ¢ ¡ ¢0
y ∇ f (x+scy)+∇2 f (x+tcy) y = ∇f(x+cy)−∇f (x) (cy) ≥ αc2 kyk2 .
2

We divide both sides by c2 and then take the limit as c → 0 to conclude


that y 0 ∇2 f (x)y ≥ αkyk2 . Since this inequality is valid for every y ∈ <n , it
follows that ∇2 f (x) − αI is positive semidefinite.
For the converse, assume that ∇2 f(x) − αI is positive semidefinite
for all x ∈ <n . Consider the function g : < 7→ < defined by
¡ ¢0
g(t) = ∇f tx + (1 − t)y (x − y).

¡ ¢0
Using the Mean Value Theorem (Prop. 1.1.20), we have ∇f(x)−∇f (y) (x−
y) = g(1) − g(0) = (dg/dt)(t) for some t ∈ [0, 1]. The result follows because

dg ¡ ¢
(t) = (x − y)0 ∇2 f tx + (1 − t)y (x − y) ≥ αkx − yk2 ,
dt

where the
¡ last inequality
¢ is a consequence of the positive semidefiniteness
of ∇2 f tx + (1 − t)y − αI. Q.E.D.

As an example, consider the quadratic function

f (x) = x0 Qx,

where Q is a symmetric matrix. By Prop. 1.2.5, the function f is convex


if and only if Q is positive semidefinite. Furthermore, by Prop. 1.2.6, f is
strongly convex with coefficient α if and only if ∇2 f (x) − αI = 2Q − αI is
positive semidefinite for some α > 0. Thus f is strongly convex with some
positive coefficient (as well as strictly convex) if and only if Q is positive
definite.
Sec. 1.2 Convex Sets and Functions 37

1.2.2 Convex and Affine Hulls

Let X be a subset n
Pm of < . A convex combination of elements of X is a vector
of the form i=1 αi xi , where m is a positive integer, x1 , . . . , xm belong to
X, and α1 , . . . , αm are scalars such that
m
X
αi ≥ 0, i = 1, . . . , m, αi = 1.
i=1

Pm
Note that if X is convex, then the convex combination i=1 αi xi belongs
to X (this is easily shown by induction; see the exercises), and for any
function f : <n 7→ < that is convex over X, we have
à m
! m
X X
f α i xi ≤ αi f (xi ). (1.8)
i=1 i=1

This follows by using repeatedly the definition of convexity. The preceding


relation is a special case of Jensen’s inequality and can be used to prove a
number of interesting inequalities in applied mathematics and probability
theory.
The convex hull of a set X , denoted conv(X), is the intersection of
all convex sets containing X, and is a convex set by Prop. 1.2.1(a). It is
straightforward to verify that the set of all convex combinations of elements
of X is convex, and is equal to the convex hull conv(X) (see the exercises).
In particular, if X consists of a finite number of vectors x1 , . . . , xm , its
convex hull is
(m )
¡ ¢ X ¯ m
X
¯
conv {x1, . . . , xm } = αi xi ¯ αi ≥ 0, i = 1, . . . , m, αi = 1 .
i=1 i=1

We recall that an affine set M is a set of the form x + S, where S


is a subspace, called the subspace parallel to M . If X is a subset of <n ,
the affine hull of X, denoted aff(X), is the intersection of all affine sets
containing X. Note that aff(X) is itself an affine set and that it contains
conv(X). It can be seen that the affine hull of X , the affine hull of the
convex hull conv(X), and the affine hull of the closure cl(X) coincide (see
the exercises). For a convex set C, the dimension of C is defined to be the
dimension of aff(C).
Given a subset XP ⊂ <n , a nonnegative combination of elements of X
m
is a vector of the form i=1 αi xi , where m is a positive integer, x1, . . . , xm
belong to X, and α1 , . . . , αm are
Pm nonnegative scalars. If the scalars αi
are all positive, the combination i=1 αi xi is said to be positive. The cone
generated by X , denoted by cone(X), is the set of nonnegative combinations
of elements of X. It is easily seen that cone(X) is a convex cone, although
38 Convex Analysis and Optimization Chap. 1

it need not be closed [cone(X) can be shown to be closed in special cases,


such as when X is a finite set – this is one of the central results of polyhedral
convexity and will be shown in Section 1.6].
The following is a fundamental characterization of convex hulls.

Proposition 1.2.7: (Caratheodory’s Theorem) Let X be a sub-


set of <n .
(a) Every x in conv(X) can be represented as a convex combination
of vectors x1 , . . . , xm ∈ X such that x2 − x1 , . . . , xm − x1 are
linearly independent, where m is a positive integer with m ≤
n + 1.
(b) Every x in cone(X) can be represented as a positive combination
of vectors x1 , . . . , xm ∈ X that are linearly independent, where
m is a positive integer with m ≤ n.

Proof: (a) Let x be a vector in the convexPhull of X, and letPm be the


m m
smallest integer such that x has the form i=1 αi xi , where i=1 αi =
1, αi > 0, and xi ∈ X for all i = 1, . . . , m. The m − 1 vectors x2 −
x1 , . . . , xm − x1 belong to the subspace parallel to aff(X). Assume, to
arrive at a contradiction, that these vectors are linearly dependent. Then,
there must exist scalars λ2 , . . . , λm at least one of which is positive, such
that
Xm
λi (xi − x1) = 0.
i=2
Pm
Letting µi = λi for i = 2, . . . , m and µ1 = − i=2 λi , we see that

m
X m
X
µi xi = 0, µi = 0,
i=1 i=1

while at least one of the scalars µ2 , . . . , µm is positive. Define

αi = αi − γµi , i = 1, . . . , m,

where
Pm γ > 0 is the largest γ such that αi − γµi ≥ 0 for Pmall i. Then, since
i=1 µ i x i = 0, we see that x is also represented
Pm as i=1 αi xi . Further-
more, in view of the choice of γ and the fact i=1 µi = 0, the coefficients
αi are nonnegative, sum to one, and at least one of them is zero. Thus,
x can be represented as a convex combination of fewer than m vectors
of X, contradicting our earlier assumption. It follows that the vectors
x2 − x1 , . . . , xm − x1 must be linearly independent, so that their number
must be at most n. Hence m ≤ n + 1.
Sec. 1.2 Convex Sets and Functions 39

(b) Let x be a nonzero vector in P the cone(X), and let m be the smallest
m
integer such that x has the form i=1 αi xi , where αi > 0 and xi ∈ X for
all i = 1, . . . , m. If the vectorsPmxi were linearly dependent, there would
exist scalars λ1 , . . . , λm , with i=1 λi xi = 0Pand at least one of the λi is
m
positive. Consider the linear combination i=1 (αi − γλi )xi , where γ is
the largest γ such that αi − γλi ≥ 0 for all i. This combination provides a
representation of x as a positive combination of fewer than m vectors of X
– a contradiction. Since any linearly independent set of vectors contains at
most n elements, we must have m ≤ n. Q.E.D.

It is not generally true that the convex hull of a closed set is closed
[take for instance the convex hull of the set consisting of the origin and the
subset {(x1 , x2) | x1 x2 = 1, x1 ≥ 0, x2 ≥ 0} of <2 ]. We have, however, the
following.

Proposition 1.2.8: The convex hull of a compact set is compact.

Proof: Let X be a compact subset of <n . nBy Caratheodory’s Theorem,


Pn+1 k k o
a sequence in conv(X) can be expressed as i=1 α x
i i , where for all k
P n+1
and i, αki ≥ 0, xki ∈ X, and i=1 αik = 1. Since the sequence
© k k
ª
(α1 , . . . , αn+1 , xk1 , . . . , xkn+1 )
© ª
belongs to a compact set, it has a limit point (α1, . . . , αn+1 , x1 , . . . , xn+1 )
Pn+1
such that i=1 αi = 1, and for all i, αi ≥ 0, and xi ∈ X. Thus, the vector
Pn+1
nP i=1 αi xi , which
o belongs to conv(X ), is a limit point of the sequence
n+1 k k
i=1 αi xi , showing that conv(X) is compact. Q.E.D.

1.2.3 Closure, Relative Interior, and Continuity

We now consider some generic topological properties of convex sets and


functions. Let C be a nonempty convex subset of <n . The closure cl(C)
of C is also a nonempty convex set (Prop. 1.2.1). While the interior of C
may be empty, it turns out that convexity implies the existence of interior
points relative to the affine hull of C. This is an important property, which
we now formalize.
We say that x is a relative interior point of C, if x ∈ C and there
exists a neighborhood N of x such that N ∩aff(C) ⊂ C, i.e., x is an interior
point of C relative to aff(C). The relative interior of C, denoted ri(C), is
the set of all relative interior points of C. For example, if C is a line
segment connecting two distinct points in the plane, then ri(C) consists of
all points of C except for the end points.
40 Convex Analysis and Optimization Chap. 1

The following proposition gives some basic facts about relative interior
points.

Proposition 1.2.9: Let C be a nonempty convex set.


(a) (Line Segment Principle) If x ∈ ri(C) and x ∈ cl(C), then all
points on the line segment connecting x and x, except possibly
x, belong to ri(C).
(b) (Nonemptiness of Relative Interior ) ri(C) is a nonempty and con-
vex set, and has the same affine hull as C. In fact, if m is the di-
mension of aff(C) and m > 0, there exist vectors x0 , x1 , . . . , xm ∈
ri(C) such that x1 − x0 , . . . , xm − x0 span the subspace parallel
to aff(C).
(c) x ∈ ri(C) if and only if every line segment in C having x as one
endpoint can be prolonged beyond x without leaving C [i.e., for
every x ∈ C, there exists a γ > 1 such that x+(γ−1)(x−x) ∈ C].

Proof: (a) In the case where x ∈ C, see Fig. 1.2.5. In the case where
x∈/ C, to show that for any α ∈ (0, 1] we have xα = αx + (1 − α)x ∈ ri(C),
consider a sequence {xk } ⊂ C that converges to x, and let xk,α = αx + (1 −
α)xk . Then as in Fig. 1.2.5, we see that {z | kz − xk,α k < α²} ∩ aff(C) ⊂ C
for all k. Since for large enough k, we have

{z | kz − xα k < α²/2} ⊂ {z | kz − xk,α k < α²},

it follows that {z | kz − xα k < α²/2} ∩ aff(C) ⊂ C, which shows that


xα ∈ ri(C).
(b) Convexity of ri(C) follows from the line segment principle of part (a).
By using a translation argument if necessary, we assume without loss of
generality that 0 ∈ C. Then, the affine hull of C is a subspace of dimension
m. If m = 0, then C and aff(C) consist of a single point, which is a unique
relative interior point. If m > 0, we can find m linearly independent vectors
z1 , . . . , zm in C that span aff(C); otherwise there would exist r < m linearly
independent vectors in C whose span contains C, contradicting the fact that
the dimension of aff(C) is m. Thus z1, . . . , zm form a basis for aff(C).
Consider the set
( )
¯ Xm m
X
¯
X = x¯x= αi zi , αi < 1, αi > 0, i = 1, . . . , m
i=1 i=1

(see Fig. 1.2.6). This set is open relative to aff(C); that is, for every x ∈ X,
there exists an open set N such that x ∈ N and N ∩ aff(C) ⊂ X. [To see
Sec. 1.2 Convex Sets and Functions 41

xα = αx + (1 -α)x

x αε

ε

S
C

Figure 1.2.5. Proof of the line segment principle for the case where x ∈ C. Since
x ∈ ri(C), there exists a sphere S = {z | kz − xk < ²} such that S ∩ aff(C) ⊂ C.
For all α ∈ (0, 1], let xα = αx + (1 − α)x and let Sα = {z | kz − xα k < α²}. It
can be seen that each point of Sα ∩ aff(C) is a convex combination of x and some
point of S ∩ aff(C). Therefore, Sα ∩ aff(C) ⊂ C, implying that xα ∈ ri(C).

this, note that X is the inverse image of the open set in <m
( )
¯ X
m
¯
(α1 , . . . , αm ) ¯ αi < 1, αi > 0, i = 1, . . . , m
i=1
Pm
under the linear transformation from aff(C) to <m that maps i=1 αi zi
into (α1 , . . . , αm ); openness of the above set follows by continuity of linear
transformation.] Therefore all points of X are relative interior points of
C, and ri(C) is nonempty. Since by construction, aff(X) = aff(C) and
X ⊂ ri(C), it follows that ri(C) and C have the same affine hull.
To show the last assertion of part (b), consider vectors
m
X
x0 = α zi , xi = x0 + αzi , i = 1, . . . , m,
i=1

where α is a positive scalar such that α(m + 1) < 1. The vectors x0, . . . , xm
are in the set X and in the relative interior of C, since X ⊂ ri(C). Further-
more, because xi − x0 = αzi for all i and vectors z1 , . . . , zm span aff(C),
the vectors x1 − x0 , . . . , xm − x0 also span aff(C).
(c) If x ∈ ri(C) the condition given clearly holds. Conversely, let x satisfy
the given condition. We will show that x ∈ ri(C). By part (b), there exists
a vector x ∈ ri(C). We may assume that x 6= x, since otherwise we are
done. By the given condition, since x is in C, there is a γ > 1 such that
y = x + (γ − 1)(x − x) ∈ C. Then we have x = (1 − α)x + αy, where
α = 1/γ ∈ (0, 1), so by part (a), we obtain x ∈ ri(C). Q.E.D.
42 Convex Analysis and Optimization Chap. 1

X x2

x1
C

Figure 1.2.6. Construction of the relatively open set X in the proof of nonempti-
ness of the relative interior of a convex set C that contains the origin. We choose
m linearly independent vectors z1 , . . . , zm ∈ C, where m is the dimension of
aff(C), and let

( )
m
X ¯ X
m
¯
X= αi zi ¯ αi < 1, αi > 0, i = 1, . . . , m .
i=1 i=1

In view of Prop. 1.2.9(b), C and ri(C) all have the same dimension.
It can also be shown that C and cl(C) have the same dimension (see the
exercises). The next proposition gives several properties of closures and
relative interiors of convex sets.

Proposition 1.2.10: Let C be a nonempty convex set.


¡ ¢
(a) cl(C) = cl ri(C) .
¡ ¢
(b) ri(C) = ri cl(C) .
(c) Let C be another nonempty convex set. Then the following three
conditions are equivalent:
(i) C and C have the same relative interior.
(ii) C and C have the same closure.
(iii) ri(C) ⊂ C ⊂ cl(C).
(d) A · ri(C) = ri(A · C) for all m × n matrices A.
(e) If C is bounded, then A · cl(C) = cl(A · C) for all m × n matrices
A.

¡ ¢
Proof: (a) Since ri(C) ⊂ C, we have cl ri(C) ⊂ cl(C). Conversely, let
¡ ¢
x ∈ cl(C). We will show that x ∈ cl ri(C) . Let x be any point in ri(C)
Sec. 1.2 Convex Sets and Functions 43

[there exists such a point by Prop. 1.2.9(b)], and assume that that x 6= x
(otherwise we are done). By the line segment principle [Prop. 1.2.9(a)], we
have αx +© (1 − α)x ∈ ri(C) for all α ª∈ (0, 1]. Thus, x is the limit¡ of the
¢
sequence (1/k)x + (1 − 1/k)x | k ≥ 1 that lies in ri(C), so x ∈ cl ri(C) .
¡ ¢
(b) Since C ⊂ cl(C), we must have ri(C) ⊂ ri cl(C) . To prove the reverse
¡ ¢
inclusion, let z ∈ ri cl(C) . We will show that z ∈ ri(C). By Prop. 1.2.9(b),
there exists an x ∈ ri(C). We may assume that x 6= z (otherwise we are
done). We choose γ > 1, with γ sufficiently
¡ ¢ close to 1 so that the vector
y = z + (γ − 1)(z − x) belongs to ri cl(C) [cf. Prop. 1.2.9(c)], and hence
also to cl(C). Then we have z = (1 − α)x + αy where α = 1/γ ∈ (0, 1), so
by the line segment principle [Prop. 1.2.9(a)], we obtain z ∈ ri(C).
(c) If ri(C) = ri(C), part (a) implies that cl(C) = cl(C). Similarly, if
cl(C) = cl(C), part (b) implies that ri(C) = ri(C). Furthermore, if these
conditions hold the relation ri(C) ⊂ C ⊂ cl(C) implies condition (iii).
Finally,¡ assume
¢ that condition (iii) holds. Then by taking closures, we
have cl ri(C) ⊂ cl(C) ⊂ cl(C), and by using part (a), we obtain cl(C) ⊂
cl(C) ⊂ cl(C). Hence C and C have the same closure.
(d) For any set X, we have A · cl(X) ⊂ cl(A · X), since if a sequence
{xk } ⊂ X converges to some x ∈ cl(X) then the sequence {Axk } ⊂ A · X
converges to Ax, implying that Ax ∈ cl(A · X). We use this fact and part
(a) to write
¡ ¢ ¡ ¢
A · ri(C) ⊂ A · C ⊂ A · cl(C) = A · cl ri(C) ⊂ cl A · ri(C) .

Thus A·C lies between the set A·ri(C) and the closure of that set, implying
that the relative interiors of the sets A · C and A · ri(C) are equal [part (c)].
Hence ri(A·C) ⊂ A·ri(C). We will show the reverse inclusion by taking any
z ∈ A · ri(C) and showing that z ∈ ri(A · C). Let x be any vector in A · C,
and let z ∈ ri(C) and x ∈ C be such that Az = z and Ax = x. By part
Prop. 1.2.9(c), there exists γ > 1 such that the vector y = z + (γ − 1)(z − x)
belongs to C. Thus we have Ay ∈ A · C and Ay = z + (γ − 1)(z − x), so by
Prop. 1.2.9(c) it follows that z ∈ ri(A · C).
(e) By the argument given in part (d), we have A · cl(C) ⊂ cl(A · C). To
show the converse, choose any x ∈ cl(A · C). Then, there exists a sequence
{xk } ⊂ C such that Axk → x. Since C is bounded, {xk } has a subsequence
that converges to some x ∈ cl(C), and we must have Ax = x. It follows
that x ∈ A · cl(C). Q.E.D.

Note that if C is closed but unbounded, the set A · C need not be


closed [cf.© part (e) of the above proposition]. For
ª example take the closed
set C = (x1 , x2) | x1 x2 ≥ 1, x1 ≥ 0, x2 ≥ 0 and let A have the effect
of projecting the typical vector x on the horizontal
© axis, i.e., A(x1, x2 )ª=
(x1 , 0). Then A · C is the (nonclosed) halfline (x1 , x2 ) | x1 > 0, x2 = 0 .
44 Convex Analysis and Optimization Chap. 1

Proposition 1.2.11: Let C1 and C2 be nonempty convex sets:


(a) Assume that the sets ri(C1 ) and ri(C2 ) have a nonempty inter-
section. Then

cl(C1 ∩ C2 ) = cl(C1 ) ∩ cl(C2 ), ri(C1 ∩ C2 ) = ri(C1 ) ∩ ri(C2 ).

(b) ri(C1 + C2 ) = ri(C1 ) + ri(C2 ).

Proof: (a) Let y ∈ cl(C1 ) ∩ cl(C2 ). If x ∈ ri(C1 ) ∩ ri(C2), by the line


segment principle [Prop. 1.2.9(a)], the vector αx + (1 − α)y belongs to
ri(C1)∩ri(C2 ) for all α ∈ (0, 1]. Hence y is the limit of a sequence
¡ αk x+(1−
¢
αk )y ⊂ ri(C1 ) ∩ ri(C2 ) with αk → 0, implying that y ∈ cl ri(C1 ) ∩ ri(C2 ) .
Hence, we have
¡ ¢
cl(C1 ) ∩ cl(C2 ) ⊂ cl ri(C1 ) ∩ ri(C2 ) ⊂ cl(C1 ∩ C2 ).

Also C1 ∩ C2 is contained in cl(C1) ∩ cl(C2 ), which is a closed set, so we


have
cl(C1 ∩ C2 ) ⊂ cl(C1 ) ∩ cl(C2 ).

Thus, equality holds throughout in the preceding two relations, so that


cl(C1 ∩ C2 ) = cl(C1 ) ∩ cl(C2 ). Furthermore, the sets ri(C1 ) ∩ ri(C2) and
C1 ∩ C2 have the same closure. Therefore, by Prop. 1.2.10(c), they have
the same relative interior, implying that

ri(C1 ∩ C2 ) ⊂ ri(C1 ) ∩ ri(C2 ).

To show the converse, take any x ∈ ri(C1 ) ∩ ri(C2 ) and any y ∈


C1 ∩ C2 . By Prop. 1.2.9(c), the line segment connecting x and y can be
prolonged beyond x by a small amount without leaving C1 ∩ C2 . By the
same proposition, it follows that x ∈ ri(C1 ∩ C2 ).
(b) Consider the linear transformation A : <2n 7→ <n given by A(x1 , x2) =
x1 + x2 for all x1 , x2 ∈ <n . The relative interior of the Cartesian product
C1 ×C2 (viewed as a subset of <2n ) is easily seen to be ri(C1 )×ri(C2 ). Since
A(C1 × C2 ) = C1 + C2 , the result follows from Prop. 1.2.10(d). Q.E.D.

The requirement that ri(C1 ) ∩ ri(C2 ) 6= Ø is essential in part (a) of


the above proposition. As an example, consider the subsets of the real
line C1 = {x | x > 0} and C2 = {x | x < 0}. Then we have cl(C1 ∩
C2 ) = Ø 6= {0} = cl(C1 ) ∩ cl(C2 ). Also, consider C1 = {x | x ≥ 0} and
C2 = {x | x ≤ 0}. Then we have ri(C1 ∩ C2 ) = {0} = 6 Ø = ri(C1 ) ∩ ri(C2 ).
Sec. 1.2 Convex Sets and Functions 45

Continuity of Convex Functions

We close this section with a basic result on the continuity properties of


convex functions.

Proposition 1.2.12: If f : <n 7→ < is convex, then it is continuous.


More generally, if C ⊂ <n is convex and f : C 7→ < is convex, then f
is continuous over the relative interior of C.

Proof: Restricting attention to the affine hull of C and using a transforma-


tion argument if necessary, we assume without loss of generality, that the
origin is an interior point of C and that the unit cube X = {x | kxk∞ ≤ 1}
is contained in C. It will suffice to show that f is continuous at 0, i.e, that
for any sequence {xk } ⊂ <n that converges to 0, we have f (xk ) → f (0).
Let ei , i = 1, . . . , 2n , be the corners of X, i.e., each ei is a vector whose
entries are either 1 or −1. It is not difficult to see that any x ∈ X can
P2n
be expressed in the form x = i=1 αi ei , where each αi is a nonnegative
P2n
scalar and i=1 αi = 1. Let A = maxi f (ei ). From Jensen’s inequality
[Eq. (1.8)], it follows that f (x) ≤ A for every x ∈ X .
For the purpose of proving continuity at zero, we can assume that
xk ∈ X and xk 6= 0 for all k. Consider the sequences {yk } and {zk } given
by
xk xk
yk = , zk = − ;
kxk k∞ kxk k∞
(cf. Fig. 1.2.7). Using the definition of a convex function for the line
segment that connects yk , xk , and 0, we have
¡ ¢
f(xk ) ≤ 1 − kxk k∞ f (0) + kxk k∞f (yk ).
We have kxk k∞ → 0 while f (yk ) ≤ A for all k, so by taking limit as k → ∞,
we obtain
lim sup f (xk ) ≤ f (0).
k→∞
Using the definition of a convex function for the line segment that connects
xk , 0, and zk , we have
kxk k∞ 1
f (0) ≤ f (zk ) + f (xk )
kxk k∞ + 1 kxk k∞ + 1
and letting k → ∞, we obtain
f (0) ≤ lim inf f (xk ).
k→∞
Thus, limk→∞ f (xk ) = f (0) and f is continuous at zero. Q.E.D.

A straightforward consequence of the continuity of a real-valued func-


tion
© f that isªconvex over <n is that its epigraph as well as the level sets
x | f (x) ≤ γ are closed and convex (cf. Prop. 1.2.2).
46 Convex Analysis and Optimization Chap. 1

e3 yk e2

xk

xk+1 Figure 1.2.7. Construction for proving


continuity of a convex function (cf. Prop.
0 1.2.12).

e4 e1
zk

1.2.4 Recession Cones

Some of the preceding results [Props. 1.2.8, 1.2.10(e)] have illustrated how
boundedness affects the topological properties of sets obtained through
various operations on convex sets. In this section we take a closer look at
this issue.
Given a convex set C, we say that a vector y is a direction of recession
of C if x + αy ∈ C for all x ∈ C and α ≥ 0. In words, y is a direction of
recession of C if starting at any x in C and going indefinitely along y, we
never cross the boundary of C to points outside C. The set of all directions
of recession is a cone containing the origin. It is called the recession cone
of C and it is denoted by RC (see Fig. 1.2.8). This definition implies that
the recession cone of the intersection of any collection of sets Ci , i ∈ I, is
equal to the corresponding intersection of the recession cones:

R∩i∈I Ci = ∩i∈I RCi .

The following proposition gives some additional properties of recession


cones.

Proposition 1.2.13: (Recession Cone Theorem) Let C be a


nonempty closed convex set.
(a) The recession cone RC is a closed convex cone.
(b) A vector y belongs to RC if and only if there exists a vector
x ∈ C such that x + αy ∈ C for all α ≥ 0.
(c) RC contains a nonzero direction if and only if C is unbounded.
Sec. 1.2 Convex Sets and Functions 47

x + αy

x
y

Recession Cone R C
0
Convex Set C

Figure 1.2.8. Illustration of the recession cone RC of a convex set C. A direction


of recession y has the property that x + αy ∈ C for all x ∈ C and α ≥ 0.

Proof: (a) If y1 , y2 belong to RC and λ1 , λ2 are positive scalars such that


λ1 + λ2 = 1, we have for any x ∈ C and α ≥ 0
x + α(λ1 y1 + λ2y2 ) = λ1 (x + αy1 ) + λ2 (x + αy2 ) ∈ C,
where the last inclusion holds because x + αy1 and x + αy2 belong to C by
the definition of RC . Hence λ1y1 + λ2y2 ∈ RC , implying that RC is convex.
Let y be in the closure of RC , and let {yk } ⊂ RC be a sequence
converging to y. For any x ∈ C and α ≥ 0 we have x + αyk ∈ C for all k,
and because C is closed, we have x + αy ∈ C. This implies that y ∈ RC
and that RC is closed.
(b) If y ∈ RC , every vector x ∈ C has the required property by the def-
inition of RC . Conversely, let y be such that there exists a vector x ∈ C
with x + αy ∈ C for all α ≥ 0. We fix x ∈ C and α > 0, and we show
that x + αy ∈ C. We may assume that y 6= 0 (otherwise we are done) and
without loss of generality, we may assume that kyk = 1. Let
zk = x + kαy, k = 1, 2, . . . .
If x = zk for some k, then x + αy = x + α(k + 1)y which belongs to C and
we are done. We thus assume that x 6= zk for all k, and we define
zk − x
yk = , k = 1, 2, . . .
kzk − xk
(see the construction of Fig. 1.2.9).
We have
kzk − xk zk − x x−x kzk − xk x−x
yk = · + = ·y + .
kzk − xk kzk − xk kzk − xk kzk − xk kzk − xk
Because zk is an unbounded sequence,
kzk − xk x−x
→ 1, → 0,
kzk − xk kzk − xk
48 Convex Analysis and Optimization Chap. 1

z zk
zk-2 k-1
x + αy
Convex Set C

x x + yk
x + αy
x+y
x

Unit Ball

Figure 1.2.9. Construction for the proof of Prop. 1.2.13(b).

so by combining the preceding relations, we have yk → y. Thus x + αy is


the limit of {x + αyk }. The vector x + αyk lies between x and zk in the
line segment connecting x and zk for all k such that kzk − xk ≥ α, so by
convexity of C, we have x + αyk ∈ C for all sufficiently large k. Using the
closure of C, it follows that x + αy must belong to C.
(c) Assuming that C is unbounded, we will show that RC contains a nonzero
direction (the reverse is clear). Choose any x ∈ C and any unbounded
sequence {zk } ⊂ C. Consider the sequence {yk }, where

zk − x
yk = ,
kzk − xk

and let y be a limit point of {yk } (compare with the construction of Fig.
1.2.9). For any fixed α ≥ 0, the vector x + αyk lies between x and zk in the
line segment connecting x and zk for all k such that kzk − xk ≥ α. Hence
by convexity of C, we have x + αyk ∈ C for all sufficiently large k. Since
x + αy is a limit point of {x + αyk }, and C is closed, we have x + αy ∈ C.
Hence the nonzero vector y is a direction of recession. Q.E.D.

Note that part (c) of the above proposition yields a characterization


of compact and convex sets, namely that a closed convex set is bounded
if and only if RC = {0}. A useful generalization is that for a compact set
W ⊂ <m and an m × n matrix A, the set

V = {x ∈ C | Ax ∈ W }

is compact if and only if RC ∩ N (A) = {0}. To see this, note that the
recession cone of the set

V = {x ∈ <n | Ax ∈ W }
Sec. 1.2 Convex Sets and Functions 49

is N (A) [clearly N (A) ⊂ RV ; if x ∈ / N (A) but x ∈ RV we must have


αAx ∈ W for all α > 0, which contradicts the boundedness of W ]. Hence,
the recession cone of V is RC ∩ N (A), so by Prop. 1.2.13(c), V is compact
if and only if RC ∩ N (A) = {0}.
One possible use of recession cones is to obtain conditions guaran-
teeing the closure of linear transformations and vector sums of convex sets
in the absence of boundedness, as in the following two propositions (some
refinements are given in the exercises).

Proposition 1.2.14: Let C be a nonempty closed convex subset of


<n and let A be an m × n matrix with nullspace denoted by N (A).
Assume that RC ∩ N (A) = {0}. Then AC is closed.

Proof: For any y ∈ cl(AC), the set


© ª
C² = x ∈ C | ky − Axk ≤ ²

is nonempty for all ² > 0. Furthermore, by the discussion following the


proof of Prop. 1.2.13, the assumption RC ∩ N (A) = {0} implies that C² is
compact. It follows that the set ∩²>0 C² is nonempty and any x ∈ ∩²>0 C²
satisfies Ax = y, so y ∈ AC. Q.E.D.

Proposition 1.2.15: Let C1 , . . . , Cm be nonempty closed convex sub-


sets of <n such that the equality y1 + · · · + ym = 0 for some vectors
yi ∈ RCi implies that yi = 0 for all i = 1, . . . , m. Then the vector sum
C1 + · · · + Cm is a closed set.

Proof: Let C be the Cartesian product C1 × · · · × Cm viewed as a sub-


set of <mn and let A be the linear transformation that maps a vector
(x1 , . . . , xm ) ∈ <mn into x1 + · · · + xm . We have

RC = RC1 + · · · + RCm

(see the exercises) and


© ª
N (A) = (y1 , . . . , ym ) | y1 + · · · + ym = 0, yi ∈ <n ,

so under the given condition, we obtain RC ∩ N (A) = {0}. Since AC =


C1 + · · · + Cm , the result follows by applying Prop. 1.2.14. Q.E.D.

When specialized to just two sets C1 and C2 , the above proposition


says that if there is no nonzero direction of recession of C1 that is the
50 Convex Analysis and Optimization Chap. 1

opposite of a direction of recession of C2 , then C1 + C2 is closed. This is


true in particular if RC1 = {0} which is equivalent to C1 being compact
[cf. Prop. 1.2.13(c)]. We thus obtain the following proposition.

Proposition 1.2.16: Let C1 and C2 be closed, convex sets. If C1 is


bounded, then C1 + C2 is a closed and convex set. If both C1 and C2
are bounded, then C1 + C2 is a convex and compact set.

Proof: Closedness of C1 + C2 follows from the preceding discussion. If


both C1 and C2 are bounded, then C1 + C2 is also bounded and hence also
compact. Q.E.D.

Note that if C1 and C2 are both closed and unbounded, the vector
sum C1 + C2 need 2
© not be closed. For example consider ª the closed sets of <
given by C1 = (x1 , x2 ) | x1 x2 ≥ 1, x1 ≥ 0, x2 ≥ 0 and C2 = {(x1 , x2) |
ª
x1 = 0 . Then C1 + C2 is the open halfspace {(x1 , x2 ) | x1 > 0}.

EXE RC ISES

1.2.1

(a) Show that a set is convex if and only if it contains all the convex combina-
tions of its elements.
(b) Show that the convex hull of a set coincides with the set of all the convex
combinations of its elements.

1.2.2

Let C be a nonempty set in <n , and let λ1 and λ2 be positive scalars. Show
by example that the sets (λ1 + λ2 )C and λ1 C + λ2 C may differ when C is not
convex [cf. Prop. 1.2.1].

1.2.3 (Properties of Cones)

(a) For any collection {Ci | i ∈ I} of cones, the intersection ∩i∈I Ci is a cone.
(b) The vector sum C1 + C2 of two cones C1 and C2 is a cone.
Sec. 1.2 Convex Sets and Functions 51

(c) The closure of a cone is a cone.


(d) The image and the inverse image of a cone under a linear transformation
is a cone.
(e) For any collection of vectors {ai | i ∈ I}, the set

C = {x | a0i x ≤ 0, i ∈ I}

is a closed convex cone.

1.2.4 (Convex Cones)

Let C1 and C2 be convex cones containing the origin.


(a) Show that
C1 + C2 = conv(C1 ∪ C2 ).

(b) Consider the set C given by


[ ¡ ¢
C= (1 − α)C1 ∩ αC2 .
α∈[0,1]

Show that
C = C1 ∩ C2 .

1.2.5

Given sets Xi ⊂ <ni , i = 1, . . . , m, let X = X1 × · · · × Xm be their Cartesian


product.
(a) Show that the convex hull (closure, affine hull) of X is equal to the Cartesian
product of the convex hulls (closures, affine hulls, respectively) of the Xi ’s.
(b) Assuming X1 , . . . , Xm are convex, show that the relative interior (reces-
sion cone) of X is equal to the Cartesian product of the relative interiors
(recession cones) of the Xi ’s.

1.2.6

Let {Ci | i ∈ I} be an arbitrary collection of convex sets in <n , and let C be the
convex hull of the union of the collection. Show that
à !
[ X
C= αi Ci ,
i∈I

where the union is taken over all convex combinations such that only finitely
many coefficients αi are nonzero.
52 Convex Analysis and Optimization Chap. 1

1.2.7

Let X be a nonempty set.


(a) Show that X, conv(X), and cl(X) have the same dimension.
¡ ¢
(b) Show that cone(X) = cone conv(X) .
(c) Show that the dimension of conv(X ) is at most as large as the dimension
of cone(X ). Give an example where the dimension of conv(X ) is smaller
than the dimension of cone(X ).
(d) Assuming that the origin belongs to conv(X), show that conv(X) and
cone(X) have the same dimension.

1.2.8

Let g be a convex, monotonically nondecreasing function of a single variable [i.e.,


g(y) ≤ g(y) for y < y], and let f be a convex function defined on a convex set
C ⊂ <n . Show that the function h defined by
¡ ¢
h(x) = g f (x)

is convex over C.

1.2.9 (Convex Functions)

Show that the following functions are convex:


(a) f1 : X → < is given by

1
f1 (x1 , . . . , xn ) = −(x1 x2 · · · xn ) n

where X = {(x1 , . . . , xn ) | x1 ≥, . . . , xn ≥ 0}.


(b) f2 (x) = kxkp with p ≥ 1.
(c) f3 (x) = 1
g(x) with g a concave function over <n such that g(x) > 0 for all
x.
(d) f4 (x) = αf (x) + β with f a convex function over <n , and α and β scalars
such that α ≥ 0.
(e) f5 (x) = max{f (x), 0} with f a convex function over <n .
(f) f6 (x) = ||Ax − b|| with A an m × n matrix and b a vector in <m .
(g) f7 (x) = x0 Ax + b0 x + β with A an n × n positive semidefinite symmetric
matrix, b a vector in <n , and β a scalar.
0
(h) f8 (x) = eβx Ax with A an n ×n positive semidefinite symmetric matrix and
β a positive scalar.
Sec. 1.2 Convex Sets and Functions 53

1.2.10

Use the Line Segment Principle and the method of proof of Prop. 1.2.5(c) to show
that if C is a convex set with nonempty interior, and f : <n 7→ < is convex and
twice continuously differentiable over C, then ∇2 f (x) is positive semidefinite for
all x ∈ C.

1.2.11

Let C ⊂ <n be a convex set and let f : <n 7→ < be twice continuously differen-
tiable over C. Let S be the subspace that is parallel to the affine hull of C. Show
that f is convex over C if and only if y0 ∇2 f (x)y ≥ 0 for all x ∈ C and y ∈ S.

1.2.12

Let f : <n 7→ < be a differentiable function. Show that f is convex over a convex
set C if and only if
¡ ¢0
∇f (x) − ∇f (y) (x − y) ≥ 0, ∀ x, y ∈ C.
Hint : The condition above says that the function f , restricted to the line segment
connecting x and y, has monotonically nondecreasing gradient; see also the proof
of Prop. 1.2.6.

1.2.13 (Ascent/Descent Behavior of a Convex Function)

Let f : < 7→ < be a convex function of a single variable.


(a) (Monotropic Property) Use the definition of convexity to show that f is
“turning upwards” in the sense that if x1 , x2 , x3 are three scalars such that
x1 < x2 < x3 , then
f (x2 ) − f (x1 ) f(x3 ) − f(x2 )
≤ .
x2 − x1 x3 − x2

(b) Use part (a) to show that there are four possibilities as x increases to ∞:
(1) f (x) decreases monotonically to −∞, (2) f (x) decreases monotonically
to a finite value, (3) f (x) reaches and stays at some value, (4) f (x) increases
monotonically to ∞ when x ≥ x for some x ∈ <.

1.2.14 (Arithmetic-Geometric Mean Inequality)


Pn
Show that if α1 , . . . , αn are positive scalars with i=1
αi = 1, then for every set
of positive scalars x1 , . . . , xn , we have
α α
x1 1 x2 2 · · · xα n
n ≤ α1 x1 + α2 x2 + · · · + αn xn ,

with equality if and only if x1 = x2 = · · · = xn . Hint : Show that − ln x is a


strictly convex function on (0, ∞).
54 Convex Analysis and Optimization Chap. 1

1.2.15

Use the result of Exercise 1.2.14 to verify Young’s inequality

xp yq
xy ≤ + ,
p q

where p > 0, q > 0, 1/p + 1/q = 1, x ≥ 0, and y ≥ 0. Then, use Young’s


inequality to verify Holder’s inequality

n
à n
!1/p à n
!1/q
X X X
p q
|xi yi | ≤ |xi | |yi | .
i=1 i=1 i=1

1.2.16

Let f : <n+m 7→ < be a convex function. Consider the function h : <n 7→ <
given by
h(x) = inf f (x, u),
u∈U

where U is nonempty and convex subset of <m . Assuming that h(x) > −∞ for
all x ∈ <n , show that h is ¡convex. Hint : There
¢ cannot exist α ∈ [0, 1], x1 , x2 ,
u1 ∈ U , u2 ∈ U such that h αx1 + (1 − α)x2 > αf (x1 , u1 ) + (1 − α)f (x2 , u2 ).

1.2.17

(a) Let C be a convex set in <n+1 and let

f (x) = inf{w | (x, w) ∈ C}.

Show that f is convex over <n .


(b) Let f1 , . . . , fm be convex functions over <n and let
( )
m
X ¯ X
m
¯
f (x) = inf fi (xi ) ¯ xi = x .
i=1 i=1

Assuming that f (x) > −∞ for all x, show that f is convex over <n .
(c) Let h : <m 7→ < be a convex function and let

f (x) = inf h(y),


By=x

where B is an n × m matrix. Assuming that f(x) > −∞ for all x, show


that f is convex over the range space of B.
(d) In parts (b) and (c), show by example
© that if theªassumption f (x) > −∞
for all x is violated, then the set x | f (x) > −∞ need not be convex.
Sec. 1.2 Convex Sets and Functions 55

1.2.18

Let {fi | i ∈ I} be an arbitrary collection of convex functions on <n . Define


the convex hull of these functions f : <n → < as the pointwise infimum of the
collection, i.e.,
© ¡ ¢ª
f (x) = inf w | (x, w) ∈ conv ∪i∈I epi(fi ) .

Show that f (x) is given by


( )
X ¯ X
¯
f (x) = inf αi fi (xi ) ¯ αi xi = x ,
i∈I i∈I

where the infimum is taken over all representations of x as a convex combination


of elements xi , such that only finitely many coefficients αi are nonzero.

1.2.19 (Convexification of Nonconvex Functions)

Let X be a nonempty subset of <n , let f : X 7→ < be a function that is bounded


below over X . Define the function F : conv(X) 7→ < by
© ¡ ¢ª
F (x) = inf w | (x, w) ∈ conv epi(f ) .

(a) Show that F is convex over conv(X) and it is given by


( )
m
X ¯ X
m m
X
¯
F (x) = inf αi f (xi ) ¯ αi xi = x, xi ∈ X, αi = 1, αi ≥ 0, m ≥ 1
i=1 i=1 i=1

(b) Show that


inf F (x) = inf f(x).
x∈conv(X) x∈X

(c) Show that the set of global minima of F over conv(X) includes all global
minima of f over X.

1.2.20 (Extension of Caratheodory’s Theorem)

Let X1 and X2 be subsets of <n , and let X = conv(X1 ) + cone(X2 ). Show that
every vector x in X can be represented in the form
k m
X X
x= αi xi + α i xi ,
i=1 i=k+1

where m is a positive integer with m ≤ n +1, the vectors x1 , . . . , xk belong to X1 ,


the vectors xk+1 , . . . , xm belong to X2 , and the scalars α1 , . . . , αm are nonnegative
with α1 +· · ·+αk = 1. Furthermore, the vectors x2 −x1 , . . . , xk −x1 , xk+1 , . . . , xm
are linearly independent.
56 Convex Analysis and Optimization Chap. 1

1.2.21

Let x0 , . . . , xm be vectors in <n such that x1 − x0 , . . . , xm − x0 are linearly


independent. The convex hull of x0 , . . . , xm is called an m-dimensional simplex,
and x0 , . . . , xm are called the vertices of the simplex.
(a) Show that the dimension of a convex set C is the maximum of the dimen-
sions of the simplices included in C.
(b) Use part (a) to show that a nonempty convex set has a nonempty relative
interior.

1.2.22

Let X be a bounded subset of <n . Show that


¡ ¢ ¡ ¢
cl conv(X) = conv cl(X ) .

In particular, if X is closed and bounded, then conv(X) is closed and bounded


(cf. Prop. 1.2.7).

1.2.23

Let C1 and C2 be two nonempty convex sets such that C1 ⊂ C2 .


(a) Give an example showing that ri(C1 ) need not be a subset of ri(C2 ).
(b) Assuming that the sets ri(C1 ) and ri(C2 ) have nonempty intersection, show
that ri(C1 ) ⊂ ri(C2 ).
(c) Assuming that the sets C1 and ri(C2 ) have nonempty intersection, show
that the set ri(C1 ) ∩ ri(C2 ) is nonempty.

1.2.24

Let C be a nonempty convex set.


(a) Show the following refinement of the Line Segment Principle [Prop. 1.2.9(c)]:
x ∈ ri(C) if and only if for every x ∈ aff(C), there exists γ > 1 such that
x + (γ − 1)(x − x) ∈ C.
(b) Assuming that the origin lies in ri(C), show that cone(C) coincides with
aff(C).
(c) Show the following extension of part (b) to a nonconvex set: If X is a
nonempty set such that the origin lies in the relative interior of conv(X),
then cone(X) coincides with aff(X ).
Sec. 1.2 Convex Sets and Functions 57

1.2.25

Let C be a compact set.


(a) Assuming that C is convex set not containing the origin on its boundary,
show that cone(C) is closed.
(b) Give examples showing that the assertion of part (a) fails if C is unbounded
or C contains the origin on its boundary.
(c) The convexity assumption in part (a) can be relaxed as follows: assum-
ing that conv(C) does not contain the origin on its boundary, show that
cone(C) is closed. Hint : Use part (a) and Exercise 1.2.7(b).

1.2.26

(a) Let C be a convex cone. Show that ri(C) is also a convex cone.
¡ ¢
(b) Let C = cone {x1 , . . . , xm } . Show that
( )
m
X ¯
¯
ri(C) = αi xi ¯ αi > 0, i = 1, . . . , m .
i=1

1.2.27

Let A be an m × n matrix and let C be a nonempty convex set in <n . Assuming


that the inverse image A−1 · C is nonempty, show that
−1 −1 −1 −1
ri(A · C) = A · ri(C), cl(A · C) = A · cl(C).

[Compare these relations with those of Prop. 1.2.9(d) and (e), respectively.]

1.2.28 (Lipschitz Property of Convex Functions)

Let f : <n → < be a convex function and X be a bounded set in <n . Then f
has the Lipschitz property over X, i.e., there exists a positive scalar c such that

|f (x) − f(y)| ≤ c · ||x − y||, ∀ x, y ∈ X.

1.2.29

Let C be a closed convex set and let M be an affine set such that the intersection
C ∩M is nonempty and bounded. Show that for every affine set M that is parallel
to M, the intersection C ∩ M is bounded when nonempty.
58 Convex Analysis and Optimization Chap. 1

1.2.30 (Recession Cones of Nonclosed Sets)

Let C be a nonempty convex set.


(a) Show by counterexample that part (b) of the Recession Cone Theorem need
not hold when C is not closed.
(b) Show that
RC ⊂ Rcl(C) , cl(RC ) ⊂ Rcl(C) .
Give an example where the inclusion cl(RC ) ⊂ Rcl(C) is strict, and another
example where cl(RC ) = Rcl(C) . Also, give an example showing that Rri(C)
need not be a subset of RC .
(c) Let C be a closed convex set such that C ⊂ C. Show that RC ⊂ RC . Give
an example showing that the inclusion can fail if C is not closed.

1.2.31 (Recession Cones of Relative Interiors)

Let C be a nonempty convex set.


(a) Show that a vector y belongs to Rri(C) if and only if there exists a vector
x ∈ ri(C) such that x + αy ∈ C for every α ≥ 0.
(b) Show that Rri(C) = Rcl(C) .
(c) Let C be a relatively open convex set such that C ⊂ C. Show that RC ⊂
RC . Give an example showing that the inclusion can fail if C is not relatively
open. [Compare with Exercise 1.2.30(b).]
Hint : In part (a), follow the proof of Prop.1.2.12(b). In parts (b) and (c), use
the result of part (a).

1.2.32

Let C be a nonempty convex set in <n and let A be an m × n matrix.


(a) Show the following refinement of Prop. 1.2.13: if Rcl(C) ∩N (A) = {0}, then

cl(A · C) = A · cl(C), A · Rcl(C) = RA·cl(C) .

(b) Give an example showing that A · Rcl(C) and RA·cl(C) can differ when
Rcl(C) ∩ N (A) 6= {0}.

1.2.33 (Lineality Space and Recession Cone)

Let C be a nonempty convex set in <n . Define the lineality space of C, denoted by
L, to be a subspace of vectors y such that simultaneously y ∈ RC and −y ∈ RC .
(a) Show that for every subspace S ⊂ L

C = (C ∩ S ⊥ ) + S.
Sec. 1.3 Convexity and Optimization 59

(b) Show the following refinement of Prop. 1.2.13 and Exercise 1.2.32: if A is
an m × n matrix and Rcl(C) ∩ N (A) is a subspace of L, then

cl(A · C) = A · cl(C), RA·cl(C) = A · Rcl(C) .

1.2.34

This exercise is a refinement of Prop. 1.2.14.


(a) Let C1 , . . . , Cm be nonempty closed convex sets in <n such that the equality
y1 + · · · + ym = 0 with yi ∈ RCi implies that each yi belongs to the lineality
space of Ci . Then the vector sum C1 + · · · + Cm is a closed set and

RC1 +···+Cm = RC1 + · · · + RCm .

(b) Show the following extension of part (a) to nonclosed sets: Let C1 , . . . , Cm
be nonempty convex sets in <n such that the equality y1 +· · ·+ym = 0 with
yi ∈ Rcl(Ci ) implies that each yi belongs to the lineality space of cl(Ci ).
Then we have

cl(C1 + · · · + Cm ) = cl(C1 ) + · · · + cl(Cm ),

Rcl(C1 +···+Cm ) = Rcl(C1 ) + · · · + Rcl(Cm ) .

1.3 CONVEXITY AND OPTIMIZATION

In this section we discuss applications of convexity to some basic optimiza-


tion issues, such as the existence and uniqueness of global minima. Several
other applications, relating to optimality conditions and polyhedral con-
vexity, will be discussed in subsequent sections.

1.3.1 Local and Global Minima

Let X be a nonempty subset of <n and let f : <n 7→ (−∞, ∞] be a


function. We say that a vector x∗ ∈ X is a minimum of f over X if
f (x∗ ) = inf x∈X f (x). We also call x∗ a minimizing point or minimizer or
global minimum over X. Alternatively, we say that f attains a minimum
over X at x∗ , and we indicate this by writing

x∗ ∈ arg min f(x).


x∈X
60 Convex Analysis and Optimization Chap. 1

We use similar terminology for maxima, i.e., a vector x∗ ∈ X such that


f (x∗ ) = supx∈X f (x) is said to be a maximum of f over X, and we indicate
this by writing
x∗ ∈ arg max f (x).
x∈X

If the domain of f is the set X (instead of <n ), we also call x∗ a (global)


minimum or (global) maximum of f (without the qualifier “over X ”).
A basic question in minimization problems is whether an optimal
solution exists. This question can often be resolved with the aid of the
classical theorem of Weierstrass, which states that a continuous function
attains a minimum over a compact set. We will provide a more general
version of this theorem, and to this end, we introduce some terminilogy.
We say that a function f : <n 7→ (−∞, ∞] is coercive if

lim f (xk ) = ∞
k→∞

for every sequence {xk } such that kxk k → ∞ for some © norm k · k. Note
that as a consequence of the definition, the level sets x | f (x) ≤ γ} of a
coercive function f are bounded whenever they are nonempty.

Proposition 1.3.1: (Weierstrass’ Theorem) Let X be a nonempty


subset of <n and let f : <n 7→ (−∞, ∞] be a closed function. Assume
that one of the following three conditions holds:
(1) X is compact.
(2) X is closed and f is coercive.
(3) There exists a scalar γ such that the set
© ª
x ∈ X | f (x) ≤ γ

is nonempty and compact.


Then, f attains a minimum over X.

Proof: If f (x) = ∞ for all x ∈ X, then every x ∈ X attains the minimum


of f over X. Thus, with no loss of generality, we assume that inf x∈X f (x) <
∞. Assume condition (1). Let {xk } ⊂ X be a sequence such that

lim f(xk ) = inf f (x).


k→∞ x∈X

Since X is bounded, this sequence has at least one limit point x∗ [Prop.
1.1.5(a)]. Since f is closed, f is lower semicontinuous at x∗ [cf. Prop.
1.2.2(b)], so that f (x∗ ) ≤ limk→∞ f (xk ) = inf x∈X f(x). Since X is closed,
x∗ belongs to X, so we must have f(x∗ ) = inf x∈X f(x).
Sec. 1.3 Convexity and Optimization 61

Assume condition (2). Consider a sequence {xk } as in the proof under


condition (1). Since f is coercive, {xk } must be bounded and the proof
proceeds similar to the proof under condition (1).
Assume condition (3).© If the given γ is equal
ª to inf x∈X f(x), the set
of minima of f over X is x ∈ X | f(x) ≤ γ , and since by assumption
this set is nonempty, we are done. If inf x∈X f (x) < γ, consider a sequence
{xk } as in the proof under ©condition (1). Then,
ª for all k sufficiently large,
xk must belong to the set x ∈ X | f(x) ≤ γ . Since this set is compact,
{xk } must be bounded and the proof proceeds similar to the proof under
condition (1). Q.E.D.

Note that with appropriate adjustments, the above proposition ap-


plies to the existence of maxima of f over X. In particular, if f is upper
semicontinuous at all points of X and X is compact, then f attains a
maximum over X.
We say that a vector x∗ ∈ X is a local minimum of f over X if
there exists some ² > 0 such that f (x∗ ) ≤ f (x) for every x ∈ X satisfying
kx − x∗ k ≤ ², where k · k is some vector norm. If the domain of f is the set
X (instead of <n ), we also call x a local minimum (or local maximum) of
f (without the qualifier “over X”). Local and global maxima are defined
similarly.
An important implication of convexity of f and X is that all local
minima are also global, as shown in the following proposition and in Fig.
1.3.1.

Proposition 1.3.2: If X ⊂ <n is a convex set and f : X 7→ < is a


convex function, then a local minimum of f is also a global minimum.
If in addition f is strictly convex, then there exists at most one global
minimum of f .

Proof: See Fig. 1.3.1 for a proof that a local minimum of f is also global.
Let f be strictly convex, and to obtain a contradiction, assume that two
distinct global minima x and y exist. Then the average (x + y)/2 must
belong to X, since X is convex. Furthermore, the value of f must be
smaller at the average than at x and y by the strict convexity of f . Since
x and y are global minima, we obtain a contradiction. Q.E.D.

1.3.2 The Projection Theorem

In this section we develop a basic result of analysis and optimization.


62 Convex Analysis and Optimization Chap. 1

f(x)

αf(x*) + (1 - α)f(x)

f (αx* + (1- α)x )

x x* x

Figure 1.3.1. Proof of why local minima of convex functions are also global.
Suppose that f is convex, and assume to arrive at a contradiction, that x∗ is a
local minimum that is not global. Then there must exist an x ∈ X such that
f(x) < f (x∗ ). By convexity, for all α ∈ (0, 1),
¡ ¢
f αx∗ + (1 − α)x ≤ αf(x∗ ) + (1 − α)f(x) < f (x∗ ).

Thus, f has strictly lower value than f(x∗ ) at every point on the line segment
connecting x∗ with x, except x∗ . This contradicts the local minimality of x∗ .
Sec. 1.3 Convexity and Optimization 63

Proposition 1.3.3: (Projection Theorem) Let C be a closed con-


vex set and let k · k be the Euclidean norm.
(a) For every x ∈ <n , there exists a unique vector z ∈ C that mini-
mizes kz − xk over all z ∈ C. This vector is called the projection
of x on C, and is denoted by PC (x), i.e.,

PC (x) = arg min kz − xk.


z∈C

(b) For every x ∈ <n , a vector z ∈ C is equal to PC (x) if and only if

(y − z)0 (x − z) ≤ 0, ∀ y ∈ C.

(c) The function f : <n 7→ C defined by f (x) = PC (x) is continuous


and nonexpansive, i.e.,
° °
°PC (x) − PC (y)° ≤ kx − yk, ∀ x, y ∈ <n .

(d) The distance function

d(x, C) = min kz − xk, x ∈ <n ,


z∈C

is convex.

Proof: (a) Fix x and let w be some element of C. Minimizing kx − zk over


all z ∈ C is equivalent to minimizing the same function over all z ∈ C such
that kx − zk ≤ kx − wk, which is a compact set. Furthermore, the function
g defined by g(z) = kz − xk2 is continuous. Existence of a minimizing
vector follows by Weierstrass’ Theorem (Prop. 1.3.1).
To prove uniqueness, notice that the square of the Euclidean norm
is a strictly convex function of its argument [Prop. 1.2.5(d)]. Therefore, g
is strictly convex and it follows that its minimum is attained at a unique
point (Prop. 1.3.2).
(b) For all y and z in C we have

ky − xk2 = ky −zk2 +kz −xk2 − 2(y −z)0 (x−z) ≥ kz −xk2 − 2(y −z)0 (x−z).

Therefore, if z is such that (y − z)0 (x − z) ≤ 0 for all y ∈ C, we have


ky − xk2 ≥ kz − xk2 for all y ∈ C, implying that z = PC (x).
Conversely, let z = PC (x), consider any y ∈ C, and for α > 0, define
64 Convex Analysis and Optimization Chap. 1

yα = αy + (1 − α)z. We have
kx − yα k2 = k(1 − α)(x − z) + α(x − y)k2
= (1 − α)2kx − zk2 + α2 kx − yk2 + 2(1 − α)α(x − z)0 (x − y).
Viewing kx − yα k2 as a function of α, we have
∂ © ª¯¯
kx − yα k2 ¯ = −2kx − zk2 + 2(x − z)0 (x − y) = −2(y − z)0 (x − z).
∂α α=0

Therefore, if (y − z)0 (x − z) > 0 for some y ∈ C, then


∂ © ª¯¯
kx − yα k2 ¯ <0
∂α α=0

and for positive but small enough α, we obtain kx − yα k < kx − zk. This
contradicts the fact z = PC (x) and shows that (y − z)0 (x − z) ≤ 0 for all
y ∈ C.
¡ ¢0 ¡
(c) Let x and y be elements of <n . From part (b), we have w−PC (x) x−
¢
PC (x) ≤ 0 for all w ∈ C. Since PC (y) ∈ C, we obtain
¡ ¢0 ¡ ¢
PC (y) − PC (x) x − PC (x) ≤ 0.
Similarly, ¡ ¢0 ¡ ¢
PC (x) − PC (y) y − PC (y) ≤ 0.
Adding these two inequalities, we obtain
¡ ¢0 ¡ ¢
PC (y) − PC (x) x − PC (x) − y + PC (y) ≤ 0.
By rearranging and by using the Schwartz inequality, we have
° ° ¡ ¢ ° °
°PC (y) − PC (x)°2 ≤ PC (y) − PC (x) 0 (y − x) ≤ °PC (y) − PC (x)° · ky − xk,

showing that PC (·) is nonexpansive and a fortiori continuous.


(d) Assume, to arrive at a contradiction, that there exist x1 , x2 ∈ <n and
an α ∈ [0, 1] such that
¡ ¢
d αx1 + (1 − α)x2, C > αd(x1, C) + (1 − α)d(x2 , C).
Then there must exist z1 , z2 ∈ C such that
¡ ¢
d αx1 + (1 − α)x2 , C > αkz1 − x1 k + (1 − α)kz2 − x2 k,
which implies that
° °
°αz1 + (1 − α)z2 − αx1 − (1 − α)x2 ° > αkx1 − z1 k + (1 − α)kx2 − z2 k.

This contradicts the triangle inequality in the definition of norm. Q.E.D.

Figure 1.3.2 illustrates the necessary and sufficient condition of part


(b) of the Projection Theorem.
Sec. 1.3 Convexity and Optimization 65

Figure 1.3.2. Illustration of the condition


C satisfied by the projection PC (x). For each
vector y ∈ C, the vectors x − PC (x) and
x P C(x) y − PC (x) form an angle greater than or
equal to π/2 or, equivalently,

y ¡ ¢0 ¡ ¢
y − PC (x) x − PC (x) ≤ 0.

1.3.3 Directions of Recession and Existence of Optimal Solutions

The recession cone, discussed in Section 1.2.4, is also useful for character-
izing directions along which convex functions asymptotically increase or
decrease. A key idea here is that a function that is convex over <n can
be described in terms of its epigraph, which is a closed and convex set.
The recession cone of the epigraph can be used to obtain the directions
along which the function “slopes downward.” This is the idea underlying
the following proposition.

Proposition 1.3.4: Let©f : <n 7→ <ªbe a convex function and con-


sider the level sets La = x | f (x) ≤ a .
(a) All the level sets La that are nonempty have the same recession
cone, given by
© ª
RLa = y | (y, 0) ∈ Repi(f ) ,

where Repi(f ) is the recession cone of the epigraph of f .


(b) If one nonempty level set La is compact, then all nonempty level
sets are compact.

Proof: From the formula for the epigraph


© ª
epi(f ) = (x, w) | f (x) ≤ w ,

it can be seen that for all a for which La is nonempty, we have


© ª © ª
(x, a) | x ∈ La = epi(f ) ∩ (x, a) | x ∈ <n .
©
The recession cone of the set in the left-hand side above is (y, 0) | y ∈
ª
RLa . The recession cone of the set in the right-hand side is equal to
the intersection of the recession cone of epi(f ) and the recession cone of
66 Convex Analysis and Optimization Chap. 1
© ª © ª
(x, a) | x ∈ <n , which is equal to (y, 0) | y ∈ <n , the horizontal
subspace that passes through the origin. Thus we have
© ª © ª
(y, 0) | y ∈ RLa = (y, 0) | (y, 0) ∈ Repi(f ) ,

from which it follows that


© ª
RLa = y | (y, 0) ∈ Repi(f ) .

This proves part (a). Part (b) follows by applying Prop. 1.2.13(c) to the
recession cone of epi(f ). Q.E.D.

For a convex function f : <n 7→ <, the (common) recession cone of


the nonempty level sets of f is referred to as the recession cone of f, and
is denoted by Rf . Thus
© ª
Rf = y | f (x) ≥ f(x + αy), ∀ x ∈ <n , α ≥ 0 .

Each y ∈ Rf is called a direction of recession of f . If we start at any


x ∈ <n and move indefinitely along a direction of recession y, we must stay
within each level set that contains x, or equivalently we must encounter
exclusively points z with f (z) ≤ f (x). In words, a direction of recession of
f is a direction of uninterrupted nonascent for f .
Conversely, if we start at some x ∈ <n and while moving along a
direction y, we encounter a point z with f(z) > f (x), then y cannot be a
direction of recession. It is easily seen via a convexity argument that once
we cross the boundary of a level set of f we never cross it back again, and
with a little thought (see Fig. 1.3.3 and the exercises), it follows that a
direction that is not a direction of recession of f is a direction of eventual
uninterrupted ascent of f . In view of these observations, it is not surprising
that directions of recession play a prominent role in characterizing the
existence of solutions of convex optimization problems, as shown in the
following proposition.

Proposition 1.3.5: Let f : <n 7→ < be a convex function, X be a


nonempty closed convex subset of <n , and X ∗ be the set of minimizing
points of f over X. Then X ∗ is nonempty and compact if and only if
X and f have no common nonzero direction of recession.

Proof: Let f ∗ = inf x∈X f (x), and note that


© ª
X ∗ = X ∩ x | f(x) ≤ f ∗ .

If X ∗ is nonempty and compact, it has no nonzero direction of recession


[Prop. 1.2.13(c)]. Therefore, there is no nonzero vector in the intersection
Sec. 1.3 Convexity and Optimization 67

f(x + αy) f(x + αy)

f(x) f(x)

α α

(a) (b)

f(x + αy) f(x + αy)

f(x) f(x)

α α

(c) (d)

f(x + αy) f(x + αy)

f(x)

f(x)

α α

(e) (f)

Figure 1.3.3. Ascent/descent behavior of a convex function starting at some


x ∈ <n and moving along a direction y. If y is a direction of recession of f , there
are two possibilities: either f decreases monotonically to a finite value or −∞
[figures (a) and (b), respectively], or f reaches a value that is less or equal to f (x)
and stays at that value [figures (c) and (d)]. If y is not a direction of recession
of f, then eventually f increases monotonically to ∞ [figures (e) and (f)]; i.e., for
some α ≥ 0 and all α1 , α2 ≥ α with α1 < α2 we have f (x + α1 y) < f (x + α2 y).

© ª
of the recession cones of X and x | f (x) ≤ f ∗ . This is equivalent to X
and f having no common nonzero direction of recession.
Conversely, let a be a scalar such that the set
© ª
Xa = X ∩ x | f (x) ≤ a
68 Convex Analysis and Optimization Chap. 1

is nonempty and has no © nonzero direction


ª of recession. Then, Xa is closed
[since X is closed and x | f (x) ≤ a is closed by the continuity of f ], and
by Prop. 1.2.13(c), Xa is compact. Since minimization of f over X and
over Xa yields the same set of minima, X ∗ , by Weierstrass’ Theorem (Prop.
1.3.1) X ∗ is nonempty, and since X ∗ ⊂ X ∗
© a , we see thatª X is bounded.
∗ ∗
Since X is closed [since X is closed and x | f (x) ≤ f is closed by the
continuity of f ] it is compact. Q.E.D.

If the closed convex set X and the convex function f of the above
proposition have a common direction of recession, then either X ∗ is empty
[take for example, X = (−∞, 0] and f (x) = ex ] or else X ∗ is nonempty
and unbounded [take for example, X = (−∞, 0] and f (x) = max{0, x}].
Another interesting question is what happens when X and f have a
common direction of recession, call it y, but f is bounded below over X:

f ∗ = inf f (x) > −∞.


x∈X

Then for any x ∈ X, we have x + αy ∈ X (since y is a direction of recession


of X), and f (x + αy) is monotonically nondecreasing to a finite value as
α → ∞ (since y is a direction of recession of f and f ∗ > −∞). Generally,
the minimum of f over X need not be attained. However, it turns out
that the minimum is attained in an important special case: when f is
quadratic and X is polyhedral (is specified by linear equality and inequality
constraints).
To understand the main idea, consider the problem

1
minimize f (x) = c0 x + x0 Qx
2 (1.9)
subject to Ax = 0,

where Q is a positive semidefinite symmetric n × n matrix, c ∈ <n is a


given vector, and A is an m × n matrix. Let N (Q) and N (A) denote the
nullspaces of Q and A, respectively. There are two possibilities:
(a) For some x ∈ N (A) ∩ N (Q), we have c0 x 6= 0. Then, since f(αx) =
αc0 x for all α ∈ <, it follows that f becomes unbounded from below
either along x or along −x.
(b) For all x ∈ N (A) ∩ N (Q), we have c0 x = 0. In this case, we have
f(x) = 0 for all x ∈ N (A)∩N (Q). For x ∈ N (A) such that x ∈/ N (Q),
since N (Q) and R(Q), the range of Q, are orthogonal subspaces, x
can be uniquely decomposed as xR + xN , where xN ∈ N (Q) and
xR ∈ R(Q), and we have f (x) = c0 x + (1/2)x0R QxR , where xR is
the (nonzero) component of x along R(Q). Hence f (αx) = αc0 x +
(1/2)α2 x0R QxR for all α > 0, with x0R QxR > 0. It follows that f is
bounded below along all feasible directions x ∈ N (A).
Sec. 1.3 Convexity and Optimization 69

We thus conclude that for f to be bounded from below along all directions
in N (A) it is necessary and sufficient that c0 x = 0 for all x ∈ N (A) ∩ N (Q).
However, boundedness from below of a convex cost function f along all
directions of recession of a constraint set does not guarantee existence of
an optimal solution, or even boundedness from below over the constraint set
(see the exercises). On the other hand, since the constraint set N (A) is a
subspace, it is possible to use a transformation x = Bz where the columns
of the matrix B are basis vectors for N (A), and view the problem as an
unconstrained minimization over z of the cost function h(z) = f (Bz), which
is positive semidefinite quadratic. We can then argue that boundedness
from below of this function along all directions z is necessary and sufficient
for existence of an optimal solution. This argument indicates that problem
(1.9) has an optimal solution if and only if c0 x = 0 for all x ∈ N (A)∩N (Q).
By using a translation argument, this result can also be extended to the
case where the constraint set is a general affine set of the form {x | Ax = b}
rather than the subspace {x | Ax = 0}.
In part (a) of the following proposition we state the result just de-
scribed (equality constraints only). While we can prove the result by for-
malizing the argument outlined above, we will use instead a more elemen-
tary variant of this argument, whereby the constraints are eliminated via
a penalty function; this will give us the opportunity to introduce a line
of proof that we will frequently employ in other contexts as well. In part
(b) of the proposition, we allow linear inequality constraints, and we show
that a convex quadratic program has an optimal solution if and only if its
optimal value is bounded below. Note that the cost function may be linear,
so the proposition applies to linear programs as well.
70 Convex Analysis and Optimization Chap. 1

Proposition 1.3.6: (Existence of Solutions of Quadratic Pro-


grams) Let f : <n 7→ < be a quadratic function of the form

1
f (x) = c0 x + x0 Qx,
2

where Q is a positive semidefinite symmetric n×n matrix and c ∈ <n is


a given vector. Let also A be an m × n matrix and b ∈ <m be a vector.
Denote by N (A) and N (Q), the nullspaces of A and Q, respectively.
(a) Let X = {x | Ax = b} and assume that X is nonempty. The
following are equivalent:
(i) f attains a minimum over X.
(ii) f ∗ = inf x∈X f (x) > −∞.
(iii) c0 y = 0 for all y ∈ N (A) ∩ N (Q).
(b) Let X = {x | Ax ≤ b} and assume that X is nonempty. The
following are equivalent:
(i) f attains a minimum over X.
(ii) f ∗ = inf x∈X f (x) > −∞.
(iii) c0 y ≥ 0 for all y ∈ N (Q) such that Ay ≤ 0.

Proof: (a) (i) clearly implies (ii).


We next show that (ii) implies (iii). For all x ∈ X, y ∈ N (A) ∩ N (Q),
and α ∈ <, we have x + αy ∈ X and

1
f (x + αy) = c0 (x + αy) + (x + αy)0 Q(x + αy) = f (x) + αc0 y.
2

If c0 y 6= 0, then either limα→∞ f (x + αy) = −∞ or limα→−∞ f (x + αy) =


−∞, and we must have f ∗ = −∞. Hence (ii) implies that c0 y = 0 for all
y ∈ N (A) ∩ N (Q).
We finally show that (iii) implies (i) by first using a translation argu-
ment and then using a penalty function argument. Choose any x ∈ X, so
that X = x+ N (A). Then minimizing f over X is equivalent to minimizing
f (x + y) over y ∈ N (A), or

minimize h(y)
subject to Ay = 0,

where
1
h(y) = f (x + y) = f (x) + ∇f(x)0 y + y 0 Qy.
2
Sec. 1.3 Convexity and Optimization 71

For any integer k > 0, let


k 1 k
hk (y) = h(y) + kAyk2 = f (x) + ∇f (x)0 y + y 0 Qy + kAyk2 . (1.10)
2 2 2
Note that for all k

hk (y) ≤ hk+1 (y), ∀ y ∈ <n ,

and

inf hk (y) ≤ infn hk+1 (y) ≤ inf h(y) ≤ h(0) = f(x). (1.11)
y∈<n y∈< Ay=0

Denote ¡ ¢⊥
S = N (A) ∩ N (Q)
and write any y ∈ <n as y = z + w, where

z ∈ S, w ∈ S ⊥ = N (A) ∩ N (Q).

Then, by using the assumption c0 w = 0 [implying that ∇f (x)0 w = (c +


Qx)0 w = 0], we see from Eq. (1.10) that

hk (y) = hk (z + w) = hk (z), (1.12)

i.e., hk is determined in terms of its restriction to the subspace S. It can


be seen from Eq. (1.10) that the function hk has no nonzero direction of
recession in common with S, so hk (z) attains a minimum over S, call it yk ,
and in view of Eq. (1.12), yk also attains the minimum of hk (y) over <n .
From Eq. (1.11), we have

hk (yk ) ≤ hk+1 (yk+1 ) ≤ inf h(y) ≤ f (x), (1.13)


Ay=0

and we will use this relation to show that {yk } is bounded and each of its
limit points minimizes h(y) subject to Ay = 0. Indeed, from Eq. (1.13), the
sequence {hk (yk )} is bounded, so if {yk } were unbounded, then assuming
without loss of generality that yk 6= 0, we would have hk (yk )/kyk k → 0, or
µ µ ¶¶
1 1 0 k
lim f (x) + ∇f (x)0 ŷk + kyk k ŷk Qŷk + kAŷk k2 = 0,
k→∞ kyk k 2 2

where ŷk = yk /kyk k. For this to be true, all limit points ŷ of the bounded
sequence {ŷk } must be such that ŷ 0 Qŷ = 0 and Aŷ = 0, which is impossible
since kŷk = 1 and ŷ ∈ S. Thus {yk } is bounded and for any one of its limit
points, call it y, we have y ∈ S and
1 k
lim sup hk (yk ) = f (x) + ∇f (x)0 y + y 0 Qy + lim sup kAyk k2 ≤ inf h(y).
k→∞ 2 k→∞ 2 Ay=0
72 Convex Analysis and Optimization Chap. 1

It follows that Ay = 0 and that y minimizes h(y) over Ay = 0. This implies


that the vector y = x + y minimizes f(x) subject to Ax = b.
(b) Clearly (i) implies (ii), and similar to the proof of part (a), (ii) implies
that c0 y ≥ 0 for all y ∈ N (Q) with Ay ≤ 0.
Finally, we show that (iii) implies (i) by using the corresponding re-
sult of part (a). For any x ∈ X , let J (x) denote the index set of active
constraints at x, i.e., J (x) = {j | a0j x = bj }, where the a0j are the rows of
A.
For any sequence {xk } ⊂ X with f (xk ) → f ∗ , we can extract a
subsequence such that J (xk ) is constant and equal to some J . Accordingly,
we select a sequence {xk } ⊂ X such that f(xk ) → f ∗ , and the index set
J (xk ) is equal for all k to a set J that is maximal over all such sequences [for
any other sequence {xk } ⊂ X with f (xk ) → f ∗ and such that J (xk ) = J
for all k, we cannot have J ⊂ J unless J = J ].
Consider the problem

minimize f (x)
(1.14)
subject to a0j x = bj , j ∈ J.

Assume, to come to a contradiction, that this problem does not have a


solution. Then, by part (a), we have c0 y < 0 for some y ∈ N (A) ∩ N (Q),
where A is the matrix having as rows the a0j , j ∈ J . Consider the line
{xk + γy | γ > 0}. Since y ∈ N (Q), we have

f (xk + γy) = f (xk ) + γc0 y,

so that
f(xk + γy) < f (xk ), ∀ γ > 0.
Furthermore, since y ∈ N (A), we have

a0j (xk + γy) = bj , ∀ j ∈ J, γ > 0.

We must also have a0j y > 0 for at least one j ∈ / A [otherwise (iii) would be
violated], so the line {xk + γy | γ > 0} crosses the boundary of X for some
γ k > 0. The sequence {xk }, where xk = xk + γ k y, satisfies {xk } ⊂ X,
f (xk ) → f ∗ [since f (xk ) ≤ f (xk )], and the active index set J (xk ) strictly
contains J for all k. This contradicts the maximality of J , and shows that
problem (1.14) has an optimal solution, call it x.
Since xk is a feasible solution of problem (1.14), we have

f(x) ≤ f (xk ), ∀ k,

so that
f(x) ≤ f ∗ .
Sec. 1.3 Convexity and Optimization 73

We will now show that x minimizes f over X, by showing that x ∈ X,


thereby completing the proof. Assume, to arrive at a contradiction, that
x∈/ X . Let x̂k be a point in the interval connecting xk and x that belongs
to X and is closest to x. We have that J (x̂k ) strictly contains J for all k.
Since f (x) ≤ f (xk ) and f is convex over the interval [xk , x], it follows that
© ª
f (x̂k ) ≤ max f (xk ), f(x) = f(xk ).

Thus f (x̂k ) → f ∗ , which contradicts the maximality of J . Q.E.D.

1.3.4 Existence of Saddle Points

Suppose that we are given a function φ : X × Z 7→ <, where X ⊂ <n ,


Z ⊂ <m , and we wish to either

minimize sup φ(x, z)


z∈Z

subject to x ∈ X
or
maximize inf φ(x, z)
x∈X
subject to z ∈ Z.
These problems are encountered in at least three major optimization con-
texts:
(1) Worst-case design, whereby we view z as a parameter and we wish
to minimize over x a cost function, assuming the worst possible value
of x. A special case of this is the discrete minimax problem, where
we want to minimize over x ∈ X
© ª
max f1 (x), . . . , fm (x) ,

where the fi are some given functions. Here, Z is the finite set
{1, . . . , m}. Within this context, it is important to provide char-
acterizations of the max function

max φ(x, z),


z∈Z

particularly its directional derivative. We do this in Section 1.7, where


we discuss the differentiability properties of convex functions.
(2) Exact penalty functions, which can be used for example to convert
constrained optimization problems of the form

minimize f(x)
(1.15)
subject to x ∈ X, gj (x) ≤ 0, j = 1, . . . , r
74 Convex Analysis and Optimization Chap. 1

to (less constrained) minimax problems of the form


© ª
minimize f (x) + c max 0, g1 (x), . . . , gr (x)
subject to x ∈ X,

where c is a large positive penalty parameter. This conversion is


useful for both analytical and computational purposes, and will be
discussed in Chapters 2 and 4.
(3) Duality theory, where using problem (1.15) as an example, we intro-
duce the, so called, Lagrangian function
r
X
L(x, µ) = f (x) + µj gj (x)
j=1

involving the vector µ = (µ1 , . . . , µr ) ∈ <r , and the dual problem

maximize inf L(x, µ)


x∈X
(1.16)
subject to µ ≥ 0.

The original (primal) problem (1.15) can also be written as

minimize sup L(x, µ)


µ≥0

subject to x ∈ X

[if x violates any of the constraints gj (x) ≤ 0, we have supµ≥0 L(x, µ) =


∞, and if it does not, we have supµ≥0 L(x, µ) = f(x)]. Thus the pri-
mal and the dual problems (1.15) and (1.16) can be viewed in terms
of a minimax problem.
We will now derive conditions guaranteeing that

sup inf φ(x, z) = inf sup φ(x, z), (1.17)


z∈Z x∈X x∈X z∈Z

and that the inf and sup above are attained. This is a major issue in
duality theory because it connects the primal and the dual problems [cf.
Eqs. (1.15) and (1.16)] through their optimal values and optimal solutions.
In particular, when we discuss duality in Chapter 3, we will see that a
major question is whether there is no duality gap, i.e., whether the optimal
primal and dual values are equal. This is so if and only if

sup inf L(x, µ) = inf sup L(x, µ). (1.18)


µ≥0 x∈X x∈X µ≥0

We will prove in this section one major result, the Saddle Point The-
orem, which guarantees the equality (1.17), assuming convexity/concavity
Sec. 1.3 Convexity and Optimization 75

assumptions on φ and (essentially) compactness assumptions on X and


Z.† Unfortunately, the theorem to be shown later in this section is only
partially adequate for the development of duality theory, because compact-
ness of Z and, to some extent, compactness of X turn out to be restrictive
assumptions [for example Z corresponds to the set {µ | µ ≥ 0} in Eq.
(1.18), which is not compact]. We will derive additional theorems of the
minimax type in Chapter 3, when we discuss duality and we make a closer
connection with the theory of Lagrange multipliers.
A first observation regarding the potential validity of the minimax
equality (1.17) is that we always have the inequality

sup inf φ(x, z) ≤ inf sup φ(x, z), (1.19)


z∈Z x∈X x∈X z∈Z

[for every z ∈ Z, write inf x∈X φ(x, z) ≤ inf x∈X supz∈Z φ(x, z) and take the
supremum over z ∈ Z of the left-hand side]. However, special conditions
are required to guarantee equality.
Suppose that x∗ is an optimal solution of the problem

minimize sup φ(x, z)


z∈Z (1.20)
subject to x ∈ X

and z ∗ is an optimal solution of the problem

maximize inf φ(x, z)


x∈X
(1.21)
subject to z ∈ Z.

† The Saddle Point Theorem is also central in game theory, as we now briefly
explain. In the simplest type of zero sum game, there are two players: the first
may choose one out of n moves and the second may choose one out of m moves.
If moves i and j are selected by the first and the second player, respectively, the
first player gives a specified amount aij to the second. The objective of the first
player is to minimize the amount given to the other player, and the objective of
the second player is to maximize this amount. The players use mixed strategies,
whereby the first player selects a probability distribution x = (x1 , . . . , xn ) over
his n possible moves and the second player selects a probability distribution
z = (z1 , . . . , zm ) over his m possible moves. Since the probability of selecting i
and
P j is xi zj , the0 expected amount to be paid by the first player to the second is
a x z or x Az, where A is the n × m matrix with elements aij .
i,j ij i j
If each player adopts a worst case viewpoint, whereby he optimizes his
choice against the worst possible selection by the other player, the first player
must minimize maxz x0 Az and the second player must maximize minx x0 Az. The
main result, a special case of the existence result we will prove shortly, is that
these two optimal values are equal, implying that there is an amount that can
be meaningfully viewed as the value of the game for its participants.
76 Convex Analysis and Optimization Chap. 1

Then we have

sup inf φ(x, z) = inf φ(x, z ∗ ) ≤ φ(x∗ , z ∗ ) ≤ sup φ(x∗ , z) = inf sup φ(x, z).
z∈Z x∈X x∈X z∈Z x∈X z∈Z
(1.22)
If the minimax equality [cf. Eq. (1.17)] holds, then equality holds through-
out above, so that

sup φ(x∗ , z) = φ(x∗ , z ∗ ) = inf φ(x, z ∗ ), (1.23)


z∈Z x∈X

or equivalently

φ(x∗ , z) ≤ φ(x∗ , z ∗ ) ≤ φ(x, z ∗ ), ∀ x ∈ X, z ∈ Z. (1.24)

A pair of vectors x∗ ∈ X and z ∗ ∈ Z satisfying the two above (equivalent)


relations is called a saddle point of φ (cf. Fig. 1.3.4).
The preceding argument showed that if the minimax equality (1.17)
holds, any vectors x∗ and z ∗ that are optimal solutions of problems (1.20)
and (1.21), respectively, form a saddle point. Conversely, if (x∗ , z ∗ ) is a
saddle point, then the definition (1.23)] implies that

inf sup φ(x, z) ≤ sup inf φ(x, z).


x∈X z∈Z z∈Z x∈X

This, together with the minimax inequality (1.19) guarantee that the min-
imax equality (1.17) holds and from Eq. (1.22), x∗ and z ∗ are optimal
solutions of problems (1.20) and (1.21), respectively.
We summarize the above discussion in the following proposition.

Proposition 1.3.7: A pair (x∗ , z ∗ ) is a saddle point of φ if and only if


the minimax equality (1.17) holds, and x∗ and z ∗ are optimal solutions
of problems (1.20) and (1.21), respectively.

Note a simple consequence of the above proposition: the set of saddle


points, when nonempty , is the Cartesian product X ∗ ×Z ∗ , where X ∗ and Z ∗
are the sets of optimal solutions of problems (1.20) and (1.21), respectively.
In other words x∗ and z ∗ can be independently chosen within the sets X ∗
and Z ∗ , respectively, to form a saddle point. Note also that if the minimax
equality (1.17) does not hold, there is no saddle point, even if the sets X ∗
and Z ∗ are nonempty.
One can visualize saddle points in terms of the sets of minimizing
points over X for fixed z ∈ Z and maximizing points over Z for fixed
x ∈ X: © ª
X̂(z) = x̂ | x̂ minimizes φ(x, z) over X ,
Sec. 1.3 Convexity and Optimization 77

φ(x,z)
Curve of maxima
Saddle point
φ(x,z(x)
^ )
(x*,z*)

Curve of minima
x
φ(x(z),z
^ )

Figure 1.3.4. Illustration of a saddle point of a function φ(x, z) over x ∈ X and


z ∈ Z [the function plotted here is φ(x, z) = 12 (x2 + 2xz − z 2 )]. Let

x̂(z) = arg min φ(x, z), ẑ(x) = arg max φ(x, z)


x∈X z∈Z

be the curves¡ of minimizing


¢ ¡and maximizing
¢ points, and consider the correspond-
ing curves φ x̂(z), z and φ x, ẑ(x) [shown in the figure for the case where x̂(z)
and ẑ(x) are unique; otherwise x̂(z) and ẑ(x) should be viewed as set-valued
mappings]. By definition, a pair (x∗ , z ∗ ) is a saddle point if and only if

max φ(x∗ , z) = φ(x∗ , z ∗ ) = min φ(x, z ∗ ),


z∈Z x∈X

or equivalently, if (x∗ , z ∗ ) lies on both curves [x∗ = x̂(z ∗ ) and z ∗ = ẑ(x∗ )]. At
such a pair, we also have
¡ ¢ ¡ ¢
max φ x̂(z), z = max min φ(x, z) = φ(x∗ , z ∗ ) = min max φ(x, z) = min φ x, ẑ(x) ,
z∈Z z∈Z x∈X x∈X z∈Z x∈X

so that ¡ ¢ ¡ ¢
φ x̂(z), z ≤ φ(x∗ , z ∗ ) ≤ φ x, ẑ(x) , ∀ x ∈ X, z ∈ Z
¡ ¢
(see Prop. 1.3.7). Visually, the curve of maxima φ x, ẑ(x) must lie “above” the
¡ ¢
curve of minima φ x̂(z), z (completely, i.e., for all x ∈ X and z ∈ Z).
78 Convex Analysis and Optimization Chap. 1
© ª
Z(x) = ẑ | ẑ maximizes φ(x, z) over Z .
The definition implies that the pair (x∗ , z ∗ ) is a saddle point if and only if
it is a “point of intersection” of X̂(·) and Ẑ(·) in the sense that

x∗ ∈ X̂(z ∗ ), z ∗ ∈ Ẑ(x∗ );

see Fig. 1.3.4.


We now consider conditions that guarantee the existence of a saddle
point. We have the following classical result.

Proposition 1.3.8: (Saddle Point Theorem) Let X and Z be


closed, convex subsets of <n and <m , respectively, and let φ : X ×Z 7→
< be a function. Assume that for each z ∈ Z, the function φ(·, z) :
X 7→ < is convex and lower semicontinuous, and for each x ∈ X, the
function φ(x, ·) : Z 7→ < is concave and upper semicontinuous. Then
there exists a saddle point of φ under any one of the following four
conditions:
(1) X and Z are compact.
(2) Z is compact and there exists a vector z ∈ Z such that φ(·, z) is
coercive.
(3) X is compact and there exists a vector x ∈ X such that −φ(x, ·)
is coercive.
(4) There exist vectors x ∈ X and z ∈ Z such that φ(·, z) and
−φ(x, ·) are coercive.

Proof: We first prove the result under the assumption that X and Z
are compact [condition (1)], and the additional assumption that φ(·, z) is
strictly concave for each x ∈ X.
By Weierstrass’ Theorem (Prop. 1.3.1), the function

f (x) = max φ(x, z), x ∈ X,


z∈Z

is real-valued and the maximum above is attained for each x ∈ X at a


point denoted ẑ(x), which is unique by the strict concavity assumption.
Furthermore, by Prop. 1.2.3(c), f is lower semicontinuous, so again by
Weierstrass’ Theorem, f attains a minimum over x ∈ X at some point x∗ .
Let z ∗ = ẑ(x∗ ), so that

φ(x∗ , z) ≤ φ(x∗ , z ∗ ) = f (x∗ ), ∀ z ∈ Z. (1.25)

We will show that (x∗ , z ∗ ) is a saddle point of φ, and in view of the above
relation, it will suffice to show that φ(x∗ , z ∗ ) ≤ φ(x, z ∗ ) for all x ∈ X.
Sec. 1.3 Convexity and Optimization 79

Choose any x ∈ X and let


µ ¶
1 1
xk = x + 1 − x∗ , zk = ẑ(xk ), k = 1, 2, . . . .
k k

Let z be any limit point of {zk } corresponding to a subsequence K of


positive integers. Using the convexity of φ(·, zk ), we have
µ ¶

1 1
f (x ) ≤ f (xk ) = φ(xk , zk ) ≤ φ(x, zk ) + 1 − φ(x∗ , zk ). (1.26)
k k
Taking the limit as k → ∞ and k ∈ K, and using the upper semicontinuity
of φ(x∗ , ·), we obtain

f(x∗ ) ≤ lim sup φ(x∗ , zk ) ≤ φ(x∗ , z) ≤ max φ(x∗ , z) = f (x∗ ).


k→∞, k∈K z∈Z

Hence equality holds throughout above, and it follows that φ(x∗ , z) ≤


φ(x∗ , z) = f (x∗ ) for all z ∈ Z. Since z ∗ is the unique maximizer of φ(x∗ , ·)
over Z, we see that z = ẑ(x∗ ) = z ∗ , so that {zk } has z ∗ as its unique limit
point, independently of the choice of the vector x within X (this is the fine
point in the argument where the strict concavity assumption is needed).
Equation (1.25) implies that for all k,

φ(x∗ , zk ) ≤ φ(x∗ , z ∗ ) = f (x∗ ),

which when combined with Eq. (1.26), yields


µ ¶
∗ ∗
1 1
φ(x , z ) ≤ φ(x, zk ) + 1 − φ(x∗ , z ∗ ), ∀ x ∈ X,
k k
or
φ(x∗ , z ∗ ) ≤ φ(x, zk ), ∀ x ∈ X.
Taking the limit as zk → z ∗ , and using the upper semicontinuity of φ(x, ·),
we obtain
φ(x∗ , z ∗ ) ≤ φ(x, z ∗ ), ∀ x ∈ X.
Combining this relation with Eq. (1.25), we see that (x∗ , z ∗ ) is a saddle
point of φ.
Next, we remove the assumption of strict concavity of φ(x∗ , ·). We
introduce the functions φk : X × Z 7→ < given by
1
φk (x, z) = φ(x, z) − kzk2 , k = 1, 2, . . .
k
Since φk (x∗ , ·) is strictly concave, by what has already been proved, there
exists a saddle point (x∗k , zk∗ ) of φk , satisfying
1 1 1
φ(x∗k , z)− kzk2 ≤ φ(x∗k , zk∗ )− kzk∗ k2 ≤ φ(x, zk∗ )− kzk∗ k2 , ∀ x ∈ X, z ∈ Z.
k k k
80 Convex Analysis and Optimization Chap. 1

Let (x∗ , z ∗ ) be a limit point of (x∗k , zk∗ ). By taking limit as k → ∞ and by


using the semicontinuity assumptions on φ, it follows that

φ(x∗ , z) ≤ lim inf φ(xk , z) ≤ lim sup φ(x, zk ) ≤ φ(x, z ∗ ), ∀ x ∈ X, z ∈ Z.


k→∞ k→∞
(1.27)
By alternately setting x = x∗ and z = z ∗ in the above relation, we obtain

φ(x∗ , z) ≤ φ(x∗ , z ∗ ) ≤ φ(x, z ∗ ), ∀ x ∈ X, z ∈ Z,

so (x∗ , z ∗ ) is a saddle point of φ.


We now prove the existence of a saddle point under conditions (2),
(3), or (4). Let
 © ª
 max©maxx∈X kxk, kzkª if condition (2) holds,
k ≥ max kxk, maxz∈Z kzk if condition (3) holds,

max{kxk, kzk} if condition (4) holds,

where x and z are the points appearing in conditions (2)-(4). For each
k ≥ k, we introduce the convex and compact sets
© ª © ª
Xk = x ∈ X | kxk ≤ k , Zk = z ∈ Z | kzk ≤ k .

Note that Z = Zk under condition (2) and X = Xk under condition (3).


Furthermore, x ∈ Xk and z ∈ Zk for all k ≥ k. Using the result already
shown under condition (1), for each k, there exists a saddle point over
Xk × Zk , i.e., a pair (xk , zk ) such that

φ(xk , z) ≤ φ(xk , zk ) ≤ φ(x, zk ), ∀ x ∈ Xk , z ∈ Zk . (1.28)

Assume that condition (2) holds. Then, since Z is compact, {zk } is


bounded. If {xk } were unbounded, the coercivity of φ(·, z) would imply
that φ(xk , z) → ∞ and from Eq. (1.28) it would follow that φ(x, zk ) → ∞
for all x ∈ X. This contradicts the boundedness of {zk }. Hence {xk } must
be bounded, and (xk , zk ) must have a limit point (x∗ , z ∗ ). Taking limit as
k → ∞ in Eq. (1.28), and using the semicontinuity assumptions on φ, it
follows that

φ(x∗ , z) ≤ lim inf φ(xk , z) ≤ lim sup φ(x, zk ) ≤ φ(x, z ∗ ), ∀ x ∈ X, z ∈ Z,


k→∞ k→∞

which is identical to Eq. (1.27). Thus, by the argument following that


equation, (x∗ , z ∗ ) is a saddle point of φ.
A symmetric argument with the obvious modifications, shows the
result under condition (3). Finally, under condition (4), note that Eq.
(1.28) yields for all k,

φ(xk , z) ≤ φ(xk , zk ) ≤ φ(x, zk ).


Sec. 1.3 Convexity and Optimization 81

If {xk } were unbounded, the coercivity of φ(·, z) would imply that φ(xk , z) →
∞ and hence φ(x, zk ) → ∞, which violates the coercivity of −φ(x, z).
Hence {xk } must be bounded, and a symmetric argument shows that {zk }
must be bounded. Thus (xk , zk ) must have a limit point (x∗ , z ∗ ). The
result then follows from Eq. (1.27), similar to the case where condition (2)
holds. Q.E.D.

It is easy to construct examples showing that the convexity of X and Z


are essential assumptions for the above proposition (this is also evident from
Fig. 1.3.4). The assumptions of compactness/coercivity and lower/upper
semicontinuity of φ are essential for existence of a saddle point (just as they
are essential in Weierstass’ Theorem). An interesting question is whether
convexity/concavity and lower/upper semicontinuity of φ are sufficient to
guarantee the minimax equality (1.17). Unfortunately this is not so for
reasons that also touch upon some of the deeper aspects of duality theory
(see Chapter 3). Here is an example:

Example 1.3.1

Let
X = {x ∈ <2 | x ≥ 0}, Z = {z ∈ < | z ≥ 0},

and let

φ(x, z) = e− x1 x2
+ zx1 ,

which satisfy the convexity/concavity and lower/upper semicontinuity as-


sumptions of Prop. 1.3.8. For all z ≥ 0, we have
© √ ª
inf e− x1 x2
+ zx1 = 0,
x≥0

since the expression in braces is nonnegative for x ≥ 0 and can approach zero
by taking x1 → 0 and x1 x2 → ∞. Hence

sup inf φ(x, z) = 0.


z≥0 x≥0

We also have for all x ≥ 0

© √ ª n
1 if x1 = 0,
sup e− x1 x2
+ zx1 =
z≥0 ∞ if x1 > 0.

Hence
inf sup φ(x, z) = 1,
x≥0 z≥0

so inf x≥0 supz≥0 φ(x, z) > supz≥0 inf x≥0 φ(x, z). The difficulty here is that
the compactness/coercivity assumptions of Prop. 1.3.8 are violated.
82 Convex Analysis and Optimization Chap. 1

EXE RC ISES

1.3.1

Let f : <n 7→ < be a convex function, let X be a closed convex set, and assume
that f and X have no common direction of recession. Let X ∗ be the optimal
solution set (nonempty and compact by Prop. 1.3.5) and let f ∗ = inf x∈X f (x).
Show that:
(a) For every ² > 0 there exists a δ > 0 such that every vector x ∈ X with
f(x) ≤ f ∗ + δ satisfies minx∗ ∈X ∗ kx − x∗ k ≤ ².
(b) Every sequence {xk } ⊂ X satisfying lim k→∞ f (xk ) → f ∗ is bounded and
all its limit points belong to X ∗ .

1.3.2

Let C be a convex set and S be a subspace. Show that projection on S is a


linear transformation and use this to show that the projection of the set C on S
is a convex set, which is compact if C is compact. Is the projection of C always
closed if C is closed?

1.3.3 (Existence of Solution of Quadratic Programs)

This exercise deals with an extension of Prop. 1.3.6 to the case where the quadratic
cost may not be convex. Consider a problem of the form

0 1 0
minimize c x + x Qx
2
subject to Ax ≤ b,

where Q is a symmetric (not necessarily positive semidefinite) matrix, c and b


are given vectors, and A is a matrix. Show that the following are equivalent:
(a) There exists at least one optimal solution.
(b) The cost function is bounded below over the constraint set.
(c) The problem has at least one feasible solution, and for any feasible x, there
is no y ∈ <n such that Ay ≤ 0 and either y 0 Qy < 0 or y 0 Qy = 0 and
(c + Qx)0 y < 0.
Sec. 1.4 Hyperplanes 83

1.3.4 (Existence of Optimal Solutions)

Let f : <n 7→ < be a convex function, and consider the problem of minimizing
f over a closed and convex set X . Suppose that f attains a minimum along all
half lines of the form {x + αy | a ≥ 0} where x ∈ X and y is in the recession cone
of X. Show that we may have inf x∈X f (x) = −∞.© Hint : Use the case ª n = 2,
X = <2 , f (x) = minz∈C kz − xk2 − x1 , where C = (x1 , x2 ) | x21 ≤ x2 .

1.3.5 (Saddle Points in two Dimensions)

Consider a function φ of two real variables x and z taking values in compact


intervals X and Z, respectively. Assume that for each z ∈ Z, the function φ(·, z)
is minimized over X at a unique point denoted x̂(z). Similarly, assume that for
each x ∈ X, the function φ(x, ·) is maximized over Z at a unique point denoted
ẑ(x). Assume further that the functions x̂(z) and ẑ(x) are continuous over Z
and X, respectively. Show that φ has a unique saddle point (x∗ , z ∗ ). Use this to
investigate the existence of saddle points of φ(x, z) = x2 + z 2 over X = [0, 1] and
Z = [0, 1].

1.4 HYPERPLANES

A hyperplane is a set of the form {x | a0 x = b}, where a ∈ <n , a 6= 0,


and b ∈ <, as illustrated in Fig. 1.4.1. An equivalent definition is that a
hyperplane in <n is an affine set of dimension n − 1. The vector a called
the normal vector of the hyperplane (it is orthogonal to the difference x − y
of any two vectors x and y of the hyperplane). The two sets

{x | a0 x ≥ b}, {x | a0 x ≤ b},

are called the halfspaces associated with the hyperplane (also referred to as
the positive and negative halfspaces, respectively). We have the following
result, which is also illustrated in Fig. 1.4.1. The proof is based on the
Projection Theorem and is illustrated in Fig. 1.4.2.

Proposition 1.4.1: (Supporting Hyperplane Theorem) If C ⊂


<n is a convex set and x is a point that does not belong to the interior
of C, there exists a vector a 6= 0 such that

a0 x ≥ a0 x, ∀ x ∈ C. (1.29)
84 Convex Analysis and Optimization Chap. 1

Positive Halfspace
{x | a'x≥ b}
Negative Halfspace
{x | a'x≤ b}
Hyperplane {x | a'x = b}

(a)

C1
C

x
a C2
a

(b) (c)

Figure 1.4.1. (a) A hyperplane {x | a0 x = b} divides the space in two halfs-


paces as illustrated. (b) Geometric interpretation of the Supporting Hyperplane
Theorem. (c) Geometric interpretation of the Separating Hyperplane Theorem.

C x0
x1
x2
x3
x a0

a1
x3 a2
x0

x2 x1

Figure 1.4.2. Illustration of the proof of the Supporting Hyperplane Theorem


for the case where the vector x belongs to the closure of C. We choose a sequence
{xk } of vectors not belonging to the closure of C which converges to x, and we
project xk on the closure of C. We then consider, for each k, the hyperplane that
is orthogonal to the line segment connecting xk and its projection, and passes
through xk . These hyperplanes “converge” to a hyperplane that supports C at x.

Proof: Consider the closure cl(C) of C, which is a convex set by Prop.


Sec. 1.4 Hyperplanes 85

1.2.1(d). Let {xk } be a sequence of vectors not belonging to cl(C), which


converges to x; such a sequence exists because x does not belong to the
interior of C. If x̂k is the projection of xk on cl(C), we have by part (b) of
the Projection Theorem (Prop. 1.3.3)

(x̂k − xk )0 (x − x̂k ) ≥ 0, ∀ x ∈ cl(C).

Hence we obtain for all k and x ∈ cl(C),

(x̂k −xk )0 x ≥ (x̂k −xk )0 x̂k = (x̂k −xk )0 (x̂k −xk )+(x̂k −xk )0 xk ≥ (x̂k −xk )0 xk .

We can write this inequality as

a0k x ≥ a0k xk , ∀ x ∈ cl(C), k = 0, 1, . . . , (1.30)

where
x̂k − xk
ak = .
kx̂k − xk k

We have kak k = 1 for all k, and hence the sequence {ak } has a subsequence
that converges to a nonzero limit a. By considering Eq. (1.30) for all ak
belonging to this subsequence and by taking the limit as k → ∞, we obtain
Eq. (1.29). Q.E.D.

Proposition 1.4.2: (Separating Hyperplane Theorem) If C1


and C2 are two nonempty and disjoint convex subsets of <n , there
exists a hyperplane that separates them, i.e., a vector a 6= 0 such that

a0 x1 ≤ a0 x2 , ∀ x1 ∈ C1 , x2 ∈ C2 . (1.31)

Proof: Consider the convex set

C = {x | x = x2 − x1 , x1 ∈ C1 , x2 ∈ C2 }.

Since C1 and C2 are disjoint, the origin does not belong to C, so by the
Supporting Hyperplane Theorem there exists a vector a 6= 0 such that

0 ≤ a0 x, ∀ x ∈ C,

which is equivalent to Eq. (1.31). Q.E.D.


86 Convex Analysis and Optimization Chap. 1

Proposition 1.4.3: (Strict Separation Theorem) If C1 and C2


are two nonempty and disjoint convex sets such that C1 is closed and
C2 is compact, there exists a hyperplane that strictly separates them,
i.e., a vector a 6= 0 and a scalar b such that

a0 x1 < b < a0 x2 , ∀ x1 ∈ C1, x2 ∈ C2 . (1.32)

Proof: Consider the problem


minimize kx1 − x2 k
subject to x1 ∈ C1 , x2 ∈ C2.
This problem is equivalent to minimizing the distance d(x2 , C1) over x2 ∈
C2 . Since C2 is compact, and the distance function is convex (cf. Prop.
1.3.3) and hence continuous (cf. Prop. 1.2.12), the problem has at least one
solution (x1 , x2 ). Let
x2 − x1 x1 + x2
a= , x= , b = a0 x.
2 2
Then, a 6= 0, since x1 ∈ C1 , x2 ∈ C2 , and C1 and C2 are disjoint. The
hyperplane
{x | a0 x = b}
contains x, and it can be seen from the preceding discussion that x1 is the
projection of x on C1 , and x2 is the projection of x on C2 (see Fig. 1.4.3).
By Prop. 1.3.3(b), we have
(x − x1 )0 (x1 − x1 ) ≤ 0, ∀ x1 ∈ C1
or equivalently, since x − x1 = a,
a0 x1 ≤ a0 x1 = a0 x + a0 (x1 − x) = b − kak2 < b, ∀ x1 ∈ C 1 .
Thus, the left-hand side of Eq. (1.32) is proved. The right-hand side is
proved similarly. Q.E.D.

The preceding proposition may be used to provide a fundamental


characterization of closed convex sets.

Proposition 1.4.4:
(a) A closed convex C ⊂ <n is the intersection of the halfspaces that
contain C.
(b) The closure of the convex hull of a set C ⊂ <n is the intersection
of the halfspaces that contain C.
Sec. 1.4 Hyperplanes 87

C1 = {(ξ1,ξ2) |ξ1 ≤ 0} C2 = {(ξ1,ξ2) |ξ1 > 0, ξ2 >0, ξ1ξ2 ≥ 1}

C1

x1
x
x2 C = {x1 - x2 | x1 ∈ C1, x2 ∈ C2}
C2
= {(ξ1,ξ2) |ξ1 < 0}

(a) (b)

Figure 1.4.3. (a) Illustration of the construction of a strictly separating hyper-


plane of two disjoint closed convex sets C1 and C2 one of which is also bounded
(cf. Prop. 1.4.3). (b) An example showing that if none of the two sets is compact,
there may not exist a strictly separating hyperplane. This is due to the fact that
the set C = {x1 − x2 | x1 ∈ C1 , x2 ∈ C2 } is equal to {(ξ1 , ξ2 ) | ξ1 < 0} and is
not closed, even though C1 and C2 are closed.

Proof: (a) Clearly C is contained in the intersection of the halfspaces


that contain C, so we focus on proving the reverse inclusion. Let x ∈ / C.
Applying the Strict Separation Theorem (Prop. 1.4.3) to the sets C and
{x}, we see that there exists a halfspace containing C but not containing x.
Hence, if x ∈
/ C, then x cannot belong to the intersection of the halfspaces
containing C, proving the result.
(b) Each halfspace H¡ that contains
¢ C must also contain conv(C) (since H is
convex), and also cl conv(C) (since H is closed). Hence the intersection of
all¡ halfspaces
¢ containing C and the intersection of all halfspaces containing
cl conv(C) coincide. From part (a), the latter intersection is equal to
¡ ¢
cl conv(C) . Q.E.D.

EXE RC ISES

1.4.1 (Strong Separation)

Let C1 and C2 be two nonempty, convex subsets of <n , and let B denote the unit
88 Convex Analysis and Optimization Chap. 1

ball in <n . A hyperplane H is said to separate strongly C1 and C2 if there exists


an ² > 0 such that C1 + ²B is contained in one of the open halfspaces associated
with H and C2 + ²B is contained in the other. Show that:
(a) The following three conditions are equivalent:
(i) There exists a hyperplane strongly separating C1 and C2 .
(ii) There exists a vector b ∈ <n such that inf x∈C1 b0 x > supx∈C2 b0 x.
(iii) inf x1 ∈C1 , x2 ∈C2 kx1 − x2 k > 0.
(b) If C1 and C2 are closed and have no common directions of recession, there
exists a hyperplane strongly separating C1 and C2 .
(c) If the two sets C1 and C2 have disjoint closures, and at least one of the two
is bounded, there exists a hyperplane strongly separating them.

1.4.2 (Proper Separation)

Let C1 and C2 be two nonempty, convex subsets of <n . A hyperplane H is said


to separate properly C1 and C2 if C1 and C2 are not both contained in H . Show
that the following three conditions are equivalent:
(i) There exists a hyperplane properly separating C1 and C2 .
(ii) There exists a vector b ∈ <n such that

inf b0 x ≥ sup b0 x, sup b0 x > inf b0 x,


x∈C1 x∈C2 x∈C1 x∈C2

(iii) The relative interiors ri(C1 ) and ri(C2 ) have no point in common.

1.5 CONICAL APPROXIMATIONS AND CONSTRAINED


OPTIMIZATION

Given a set C, the cone given by

C ⊥ = {y | y 0 x ≤ 0, ∀ x ∈ C},

is called the polar cone of C (see Fig. 1.5.1). Clearly, the polar cone C ⊥ ,
being the intersection of a collection of closed halfspaces, is closed and
convex (regardless of whether C is closed and/or convex). We also have
the following basic result.
Sec. 1.5 Conical Approximations 89

a1
C a1

C
a2
a2

0
0 C⊥
C⊥

(a) (b)

Figure 1.5.1. Illustration of the polar cone C ⊥ of a set C ⊂ <2 . In (a) C


consists of just two points, a1 and a2 , while in (b) C is the convex cone {x | x =
µ1 a1 + µ2 a2 , µ1 ≥ 0, µ2 ≥ 0}. The polar cone C ⊥ is the same in both cases.

Proposition 1.5.1: (Polar Cone Theorem)


(a) For any set C, we have
¡ ¢⊥ ¡ ¢⊥ ¡ ¢⊥
C ⊥ = cl(C) = conv(C) = cone(C) .

(b) For any cone C, we have


¡ ¢
(C ⊥ )⊥ = cl conv(C) .

In particular, if C is closed and convex, (C ⊥ )⊥ = C.

¡ ¢⊥
Proof: (a) Clearly, we have C ⊥ ⊃ cl(C) . Conversely, if y ∈ C ⊥ , then
y 0 xk ≤ 0 for all k and all sequences {xk } ⊂ C, so that y 0 x ≤ 0 for all limits
¡ ¢⊥ ¡ ¢⊥
x of such sequences. Hence y ∈ cl(C) and C ⊥ ⊂ cl(C) .
¡ ¢⊥
Similarly, we have C ⊥ ⊃ conv(C) . Conversely, if y ∈ C ⊥ , then
y 0 x ≤ 0 for all x ∈ C so that y 0 z ≤ 0 for all z that are convex combinations
¡ ¢⊥ ¡ ¢⊥
of vectors x ∈ C. Hence y ∈ conv(C) and C ⊥ ⊂ conv(C) . A nearly
¡ ¢⊥
identical argument also shows that C ⊥ = cone(C) .
(b) Figure 1.5.2 shows that if C is closed and convex, then (C ⊥ )⊥ = C.
From this it follows that
¡¡ ¡ ¢¢⊥ ¢⊥ ¡ ¢
cl conv(C) = cl conv(C) ,
90 Convex Analysis and Optimization Chap. 1

and ⊥ ⊥
¡ by using
¢ part (a) in the left-hand side above, we obtain (C ) =
cl conv(C) . Q.E.D.

x
C

^
z
2z^
0
z - ^z
C⊥ z

Figure 1.5.2. Proof of the Polar Cone Theorem for the case where C is a closed
and convex cone. If x ∈ C, then for all y ∈ C ⊥ , we have x0 y ≤ 0, which implies
that x ∈ (C ⊥ )⊥ . Hence, C ⊂ (C ⊥ )⊥ . To prove the reverse inclusion, take
z ∈ (C ⊥ )⊥ , and let ẑ be the unique projection of z on C, as shown in the figure.
Since C is closed, the projection exists by the Projection Theorem (Prop. 1.3.3),
which also implies that

(z − ẑ)0 (x − ẑ) ≤ 0, ∀ x ∈ C.

By taking in the preceding relation x = 0 and x = 2ẑ (which belong to C since C


is a closed cone), it is seen that

(z − ẑ)0 ẑ = 0.

Combining the last two relations, we obtain (z − ẑ)0 x ≤ 0 for all x ∈ C. Therefore,
(z − ẑ) ∈ C ⊥ , and since z ∈ (C ⊥ )⊥ , we obtain (z − ẑ)0 z ≤ 0, which when added
to (z − ẑ)0 ẑ = 0 yields kz − ẑk2 ≤ 0. Therefore, z = ẑ and z ∈ C. It follows that
(C ⊥ )⊥ ⊂ C.

The analysis of a constrained optimization problem often centers on


how the cost function behaves along directions leading away from a local
minimum to some neighboring feasible points. The sets of the relevant
directions constitute cones that can be viewed as approximations to the
constraint set, locally near a point of interest. Let us introduce two such
cones, which are important in connection with optimality conditions.
Sec. 1.5 Conical Approximations 91

Definition 1.5.1: Given a subset X of <n and a vector x ∈ X, a


feasible direction of X at x is a vector y ∈ <n such that there exists
an α > 0 with x + αy ∈ X for all α ∈ [0, α]. The set of all feasible
directions of X at x is a cone denoted by FX (x).

It can be seen that if X is convex, the feasible directions at x are the


vectors of the form α(x − x) with α > 0 and x ∈ X. However, when X
is nonconvex, the cone of feasible directions need not provide interesting
information about the local structure of the set X near the point x. For
example, often there is no nonzero
© feasible
ª direction at x when X is noncon-
vex [think of the set X = x | h(x) = 0 , where h : <n 7→ < is a nonlinear
function]. The next definition introduces a cone that provides information
on the local structure of X even when there is no nonzero feasible direction.

Definition 1.5.2: Given a subset X of <n and a vector x ∈ X, a


vector y is said to be a tangent of X at x if either y = 0 or there exists
a sequence {xk } ⊂ X such that xk 6= x for all k and

xk − x y
xk → x, → .
kxk − xk kyk

The set of all tangents of X at x is a cone called the tangent cone of


X at x, and is denoted by TX (x).

X Figure 1.5.3. Illustration of a tangent


x k-1 y at a vector x ∈ X. There is a se-
quence {xk } ⊂ X that converges to x
x + y k-1 x and is such that the normalized direction
k sequence (xk − x)/kxk − xk converges to
y/kyk, the normalized direction of y, or
x x + yk equivalently, the sequence
x + y k+1
x+y yk =
kyk(xk − x)
kxk − xk
Ball of
radius ||y|| illustrated in the figure converges to y.

Thus a nonzero vector y is a tangent at x if it is possible to approach x


with a feasible sequence {xk } such that the normalized direction sequence
(xk − x)/kxk − xk converges to y/kyk, the normalized direction of y (see
Fig. 1.5.3). The following proposition provides an equivalent definition of
92 Convex Analysis and Optimization Chap. 1

x2 x2

(1,2)
TX(x) = cl(FX(x)) TX(x)
x = (0,1)
x = (0,1)

x1 x1

(a) (b)

Figure 1.5.4. Examples of the cones FX (x) and TX (x) of a set X at the vector
x = (0, 1). In (a), we have
© ª
X= (x1 , x2 ) | (x1 + 1)2 − x2 ≤ 0, (x1 − 1)2 − x2 ≤ 0 .

Here X is convex and the tangent cone TX (x) is equal to the closure of the cone of
feasible directions FX (x) (which is an open set in this example). Note, however,
that the vectors (1, 2) and (−1, 2) (as well the origin) belong to TX (x) and also
to the closure of FX (x), but are not feasible directions. In (b), we have

© ¡ ¢¡ ¢ ª
X= (x1 , x2 ) | (x1 + 1)2 − x2 (x1 − 1)2 − x2 = 0 .

Here the set X is nonconvex, and TX (x) is closed but not convex. Furthermore,
FX (x) consists of just the zero vector.

a tangent, which is occasionally more convenient.

Proposition 1.5.2: A vector y is a tangent of a set X if and only


if there exists a sequence {xk } ⊂ X with xk → x, and a positive
sequence {αk } such that αk → 0 and (xk − x)/αk → y.

Proof: If {xk } is the sequence in the definition of a tangent, take αk =


kxk − xk/kyk. Conversely, if αk → 0, (xk − x)/αk → y, and y 6= 0, clearly
xk → x and
xk − x (xk − x)/αk y
= → ,
kxk − xk k(xk − x)/αk k kyk
so {xk } satisfies the definition of a tangent. Q.E.D.

Figure 1.5.4 illustrates the cones TX (x) and FX (x) with examples.
The following proposition gives some of the properties of the cones FX (x)
and TX (x).
Sec. 1.5 Conical Approximations 93

Proposition 1.5.3: Let X be a nonempty subset of <n and let x


be a vector in X. The following hold regarding the cone of feasible
directions FX (x) and the tangent cone TX (x).
(a) TX (x) is a closed cone.
¡ ¢
(b) cl FX (x) ⊂ TX (x).
(c) If X is convex, then FX (x) and TX (x) are convex, and we have
¡ ¢
cl FX (x) = TX (x).

Proof: (a) Let {yk } be a sequence in TX (x) that converges to y. We will


show that y ∈ TX (x). If y = 0, then y ∈ TX (x), so assume that y 6= 0. By
the definition of a tangent, for every yk there is a sequence {xik } ⊂ X with
xik 6= x such that

xik − x yk
lim xik = x, lim = .
i→∞ i→∞ ||xi − x|| ||y k ||
k

For k = 1, 2, . . ., choose an index ik such that i1 < i2 < . . . < ik and


° °
° xik − x yk °
ik ° k °
lim ||xk − x|| = 0, lim ° − ° = 0.
k→∞ k→∞ ° ||xik − x|| ||y k || °
k

We also have for all k


° ° ° ° ° °
° xik − x y ° ° i yk °
° k ° ° xkk − x ° ° ° yk y ° °,
° ik − ≤
° ° − ° + −
° ||x − x|| ||y|| ° ° ||xik − x|| ||yk || ° ° ||yk || ||y|| °
k k

so the fact yk → y, and the preceding two relations imply that


° °
° xik − x y °
ik ° k °
lim kxk − xk = 0, lim ° − ° = 0.
k→∞ k→∞ ° ||xik − x|| ||y|| °
k

Hence y ∈ TX (x) and TX (x) is closed.


(b) Clearly every feasible direction is also a tangent, so FX (x) ⊂ TX (x).
Since by part (a), TX (x) is closed, the result follows.
(c) If X is convex, the feasible directions at x are the vectors of the form
α(x − x) with α > 0 and x ∈ X. From this it can be seen that FX (x) is
convex. Convexity
¡ of T¢X (x) will follow from the convexity of FX (x) once
we show that cl FX (x) = TX (x).
94 Convex Analysis and Optimization Chap. 1
¡ ¢
In view of part (b), it will suffice to show that TX (x) ⊂ cl FX (x) .
Let y ∈ TX (x) and, using Prop. 1.5.2, let {xk } be a sequence in X and
{αk } be a positive sequence such that αk → 0 and (xk − x)/αk → y. Since
X is a¡convex¢ set, the direction (xk − x)/αk is ¡feasible¢ at x for all k. Hence
y ∈ cl FX (x) , and it follows that TX (x) ⊂ cl FX (x) . Q.E.D.

The tangent cone finds an important application in the following basic


necessary condition for local optimality:

Proposition 1.5.4: Let f : <n → 7 < be a smooth function, and let


x∗ be a local minimum of f over a set X ⊂ <n . Then

∇f (x∗ )0 y ≥ 0, ∀ y ∈ TX (x∗ ).

If X is convex, this condition can be equivalently written as

∇f(x∗ )0 (x − x∗ ) ≥ 0, ∀ x ∈ X,

and in the case where X = <n , reduces to ∇f (x∗ ) = 0.

Proof: Let y be a nonzero tangent of X at x∗ . Then there exists a sequence


{ξk } and a sequence {xk } ⊂ X such that xk 6= x∗ for all k,

ξk → 0, xk → x∗ ,

and
xk − x∗ y

= + ξk .
kxk − x k kyk
By the Mean Value Theorem, we have for all k

f (xk ) = f (x∗ ) + ∇f (x̃k )0 (xk − x∗ ),

where x̃k is a vector that lies on the line segment joining xk and x∗ . Com-
bining the last two equations, we obtain

kxk − x∗ k
f (xk ) = f(x∗ ) + ∇f (x̃k )0 yk , (1.33)
kyk

where
yk = y + kykξk .
If ∇f (x∗ )0 y < 0, since x̃k → x∗ and yk → y, it follows that for all suffi-
ciently large k, ∇f(x̃k )0 yk < 0 and [from Eq. (1.33)] f (xk ) < f (x∗ ). This
contradicts the local optimality of x∗ .
Sec. 1.5 Conical Approximations 95
¡ ¢
When X is convex, we have cl FX (x) = TX (x) (cf. Prop. 1.5.3).
Thus the condition shown can be written as
¡ ¢
∇f (x∗ )0 y ≥ 0, ∀ y ∈ cl FX (x) ,

which in turn is equivalent to

∇f(x∗ )0 (x − x∗ ) ≥ 0, ∀ x ∈ X.

If X = <n , by setting x = x∗ + ei and x = x∗ − ei , where ei is the ith


unit vector (all components are 0 except for the ith, which is 1), we obtain
∂f (x∗ )/∂xi = 0 for all i = 1, . . . , n, so ∇f (x∗ ) = 0. Q.E.D.

A direction y for which ∇f (x∗ )0 y < 0 may be viewed as a descent


direction of f at x∗ , in the sense that we have (by Taylor’s theorem)

f (x∗ + αy) = f (x∗ ) + α∇f (x∗ )0 y + o(α) < f (x∗ )

for sufficiently small but positive α. Thus Prop. 1.5.4 says that if x∗ is
a local minimum, there is no descent direction within the tangent cone
TX (x∗ ).
Note that the necessary condition of Prop. 1.5.4 can equivalently be
written as
−∇f (x∗ ) ∈ TX (x∗ )⊥
(see Fig. 1.5.5). There is an interesting converse of this result, namely that
given any vector z ∈ TX (x∗ )⊥ , there exists a smooth function f such that
−∇f (x∗ ) = z and x∗ is a local minimum of f over X. We will return to
this result and to the subject of conical approximations when we discuss
Lagrange multipliers in Chapter 2.

The Normal Cone

In addition to the cone of feasible directions and the tangent cone, there is
one more conical approximation that is of special interest for the optimiza-
tion topics covered in this book. This is the normal cone of X at x, denoted
by NX (x), and obtained from the polar cone TX (x)⊥ by means of a clo-
sure operation. In particular, we have z ∈ NX (x) if there exist sequences
{xk } ⊂ X and {zk } such that xk → x, zk → z, and zk ∈ TX (xk )⊥ for all
k. Equivalently, the graph of NX (·), viewed as a point-to-set mapping, is
the closure of the graph of TX (·)⊥ :
© ª ¡© ª¢
(x, z) | x ∈ X, z ∈ NX (x) = cl (x, z) | x ∈ X, z ∈ TX (x)⊥ .

Clearly, NX (x) is a closed cone containing TX (x)⊥ , but it need not


be convex like TX (x)⊥ (see the examples of Fig. 1.5.6). In the case where
96 Convex Analysis and Optimization Chap. 1

x* + TX(x *)

x * − ∇f(x *)

Surfaces of equal cost f(x)

x* Constraint set X

Figure 1.5.5. Illustration of the necessary optimality condition

−∇f (x∗ ) ∈ TX (x∗ )⊥

for x∗ to be a local minimum of f over X.

TX (x)⊥ = NX (x), we say that X is regular at x. An important consequence


of convexity of X is that it implies regularity, as shown in the following
proposition.

Proposition 1.5.5: Let X be a convex set. Then for all x ∈ X, we


have

z ∈ TX (x)⊥ if and only if z 0 (x − x) ≤ 0, ∀ x ∈ X. (1.34)

Furthermore, X is regular at all x ∈ X.

Proof: Since (x − x) ∈ FX (x) ⊂ TX (x) for all x ∈ X, it follows that


if z ∈ TX (x)⊥ , then z 0 (x − x) ≤ 0 for all x ∈ X. Conversely, let x be
such that z 0 (x − x) ≤ 0 for all x ∈ X, and to arrive at a contradiction,
assume that z ∈ / T¡X (x)⊥ . ¢ Then there exists some y ∈ TX (x) such that
0
z y > 0. Since cl FX (x) = TX (x) [cf. Prop. 1.5.3(c)], there exists a
sequence {yk } ⊂ FX (x) such that yk → y, so that yk = αk (xk − x) for
some αk > 0 and some xk ∈ X. Since z 0 y > 0, we have αk z 0 (xk − x) > 0
for large enough k, which is a contradiction.
Sec. 1.5 Conical Approximations 97

a1 NX(x) a2

x=0

X = TX(x)

(a)

TX(x) = Rn
x=0
NX(x)

(b)

Figure 1.5.6. Examples of normal cones. In the case of figure (a), X is the union
of two lines passing through the origin:

X = {x | (a01 x)(a02 x) = 0}.

For x = 0 we have TX (x) = X, TX (x)⊥ = {0}, while NX (x) is the nonconvex set
consisting of the two lines of vectors that are collinear to either a1 or a2 . Thus
X is not regular at x = 0. At all other vectors x ∈ X, we have regularity with
TX (x)⊥ and NX (x) equal to either the line of vectors that are collinear to a1 or
the line of vectors that are collinear to a2 .
In the case of figure (b), X is regular at all points except at x = 0, where
we have TX (x) = <n , TX (x)⊥ = {0}, while NX (x) is equal to the horizontal axis.
98 Convex Analysis and Optimization Chap. 1

If x ∈ X and z ∈ NX (x), there must exist sequences {xk } ⊂ X and


{zk } such that xk → x, zk → z, and zk ∈ TX (xk )⊥ . By Eq. (1.34) (which
critically depends on the convexity of X), we must have zk0 (x − xk ) ≤ 0
for all x ∈ X. Taking the limit as k → ∞, we obtain z 0 (x − x) ≤ 0 for
all x ∈ X, which by Eq. (1.34), implies that z ∈ TX (x)⊥ . Thus, we have
NX (x) ⊂ TX (x)⊥ . Since the reverse inclusion always holds, it follows that
NX (x) = TX (x)⊥ , so that X is regular at x. Q.E.D.

Note that convexity of TX (x) does not imply regularity of X at x,


as the example of Fig. 1.5.6(b) shows. However, it can be shown that the
converse is true, i.e., if X is regular at x, then TX (x) must be convex (see
Rockafellar and Wets [RoW98]). This provides an often useful method for
recognizing nonregularity.

EXE RC ISES

1.5.1 (Fermat’s Principle in Optics)

Let C ⊂ <n be a closed convex set, and let y and z be given vectors in <n such
that the line segment connecting y and z does not intersect with C. Consider
the problem of minimizing the sum of distances ky − xk + kz − xk over x ∈ C.
Derive a necessary and sufficient optimality condition. Does an optimal solution
exist and if so, is it unique? Discuss the case where C is closed but not convex.

1.5.2

Let C1 , C2 , and C3 be three closed subsets of <n . Consider the problem of finding
a triangle with minimum perimeter that has one vertex on each of the three sets,
i.e., the problem of minimizing kx1 − x2 k + kx2 − x3 k + kx3 − x1 k subject to
xi ∈ Ci , i = 1, 2, 3, and the additional condition that x1 , x2 , and x3 do not lie on
the same line. Show that if (x∗1 , x∗2 , x∗3 ) defines an optimal triangle, there exists
a vector z ∗ in the triangle such that

(z∗ − x∗i ) ∈ TCi (x∗i )⊥ , i = 1, 2, 3.

1.5.3 (Cone Decomposition Theorem)

Let C ⊂ <n be a closed convex cone and let x be a given vector in <n . Show
that:
Sec. 1.5 Polyhedral Convexity 99

(a) x̂ is the projection of x on C if and only if

x̂ ∈ C, x − x̂ ∈ C ⊥ , (x − x̂)0 x̂ = 0.

(b) The following two properties are equivalent:


(i) x1 and x2 are the projections of x on C and C ⊥ , respectively.
(ii) x = x1 + x2 with x1 ∈ C, x2 ∈ C ⊥ , and x01 x2 = 0.

1.5.4

Let C ⊂ <n be a closed convex cone and let a be a given vector in <n . Show
that for any positive scalars β and γ, we have

max a0 x ≤ γ if and only if a ∈ C ⊥ + {x | kxk ≤ γ/β}.


kxk≤β, x∈C

1.5.5 (Quasiregularity)

Consider the set © ª


X = x | hi (x) = 0, i = 1, . . . , m ,
where the functions hi : <n 7→ < are smooth. For any x ∈ X consider the
subspace © ª
V (x) = y | ∇hi (x)0 y = 0, i = 1, . . . , m .
Show that:
(a) TX (x) ⊂ V (x).
(b) TX (x) = V (x) if either the gradients ∇hi (x), i = 1, . . . , m, are linearly in-
dependent or the functions hi are linear. Note: The property TX (x) = V (x)
is called quasiregularity, and will be significant in the Lagrange multiplier
theory of Chapter 2.

1.6 POLYHEDRAL CONVEXITY

1.6.1 Polyhedral Cones

We now develop some basic results regarding the geometry of polyhedral


sets. We start with properties of cones that have a polyhedral structure.
We say that a cone C is finitely generated, if it has the form
 
 ¯ Xr 
¯
C= x¯x= µj aj , µj ≥ 0, j = 1, . . . , r ,
 
j=1
100 Convex Analysis and Optimization Chap. 1

where a1 , . . . , ar are some vectors. We say that a cone C is polyhedral, if


it has the form
C = {x | a0j x ≤ 0, j = 1, . . . , r},

where a1 , . . . , ar are some vectors.


It can be seen that polyhedral cones are closed, since they are inter-
sections of closed halfspaces. Finitely generated cones are also closed and
are in fact polyhedral, but this is a fairly deep fact, which is shown in the
following proposition.

Proposition 1.6.1:
(a) Let a1 , . . . , ar be vectors of <n . Then the finitely generated cone
 
 ¯ r
X 
¯
C = x¯ x= µj aj , µj ≥ 0, j = 1, . . . , r (1.35)
 
j=1

is closed and its polar cone is the polyhedral cone given by


© ª
C ⊥ = x | a0j x ≤ 0, j = 1, . . . , r . (1.36)

(b) (Farkas’ Lemma) Let x, e1 , . . . , em , and a1, . . . , ar be vectors of


<n . We have x0 y ≤ 0 for all vectors y ∈ <n such that

y 0 ei = 0, ∀ i = 1, . . . , m, y 0 aj ≤ 0, ∀ j = 1, . . . , r,

if and only if x can be expressed as


m
X r
X
x= λi ei + µj aj ,
i=1 j=1

where λi and µj are some scalars with µj ≥ 0 for all j.


(c) (Minkowski – Weyl Theorem) A cone is polyhedral if and only if
it is finitely generated.

Proof: (a) We first show that the polar cone of C has the desired form
(1.36). If y satisfies y 0 aj ≤ 0 for all j, then y 0 x ≤ 0 for all x ∈ C, so the
set in the right-hand side of Eq. (1.36) is a subset of C ⊥ . Conversely, if
y ∈ C ⊥ , i.e., if y 0 x ≤ 0 for all x ∈ C, then (since aj belong to C) we have
y 0 aj ≤ 0, for all j. Thus, C ⊥ is a subset of the set in the right-hand side
of Eq. (1.36).
Sec. 1.5 Polyhedral Convexity 101

There remains to show that C is closed. We will give two proofs


for this. The first proof is simpler and suffices for the purpose of showing
Farkas’ lemma [part (b)]. The second proof shows a stronger result, namely
that C is polyhedral. This not only shows that C is closed, but also proves
half of the Minkowski-Weyl Theorem [part (c)].
The first proof that C is closed is based on induction on the number
of vectors r. When r = 1, C is either {0} (if a1 = 0) or a halfline, and is
therefore closed. Suppose, for some r ≥ 1, all cones of the form
 
 ¯ Xr 
¯
x¯x= µj aj , µj ≥ 0
 
j=1

are closed. Then, we will show that a cone of the form


 
 ¯ r+1
X 
¯
Cr+1 = x ¯ x = µj aj , µj ≥ 0
 
j=1

is also closed. Without loss of generality, assume that kaj k = 1 for all j.
There are two cases: (i) The vectors −a1 , . . . , −ar+1 belong to Cr+1 , in
which case Cr+1 is the subspace spanned by a1 , . . . , ar+1 and is therefore
closed, and (ii) The negative of one of the vectors, say −ar+1 , does not
belong to Cr+1 . In this case, consider the cone
 
 ¯ Xr 
¯
Cr = x ¯ x = µj aj , µj ≥ 0 ,
 
j=1

which is closed by the induction hypothesis. Let

m= min a0r+1 x.
x∈Cr , kxk=1

Since, the set {x ∈ Cr | kxk = 1} is nonempty and compact, the minimum


above is attained at some x∗ by Weierstrass’ Theorem. We have, using the
Schwartz inequality,

m = a0r+1 x∗ ≥ −kar+1 k · kx∗ k = −1,

with equality if and only if x∗ = −ar+1 . It follows that

m > −1,

since otherwise we would have x∗ = −ar+1 , which violates the hypothesis


(−ar+1 ) ∈ / Cr . Let {xk } be a convergent sequence in Cr+1 . We will prove
that its limit belongs to Cr+1 , thereby showing that Cr+1 is closed. Indeed,
102 Convex Analysis and Optimization Chap. 1

for all k, we have xk = ξk ar+1 + yk , where ξk ≥ 0 and yk ∈ Cr . Using the


fact kar+1 k = 1, we obtain

kxk k2 = ξk2 + kyk k2 + 2ξk a0r+1 yk


≥ ξk2 + kyk k2 + 2mξk kyk k
¡
= ξk − kyk k)2 + 2(1 + m)ξk kyk k.

Since {xk } converges, ξk ≥ 0, and 1 + m > 0, it follows that the sequences


{ξk } and {yk } are bounded and hence, they have limit points denoted by
ξ and y, respectively. The limit of {xk } is

lim (ξk ar+1 + yk ) = ξar+1 + y,


k→∞

which belongs to Cr+1 , since ξ ≥ 0 and y ∈ Cr (by the closure hypothesis


on Cr ). We conclude that Cr+1 is closed, completing the proof.
We now give the alternative proof that C is closed, by showing that it
is polyhedral. The proof is constructive and uses induction on the number
of vectors r. To start the induction, we assume without loss of generality
that a1 = 0. Then, for r = 1, we have C = {0}, which is polyhedral, since
it can be expressed as

{x | u0i x ≤ 0, −u0i x ≤ 0, i = 1, . . . , n},

where ui is the ith unit coordinate vector.


Assume that for some r ≥ 2, the set
 
 ¯ r−1
X 
¯
Cr−1 = x¯x= µj aj , µj ≥ 0
 
j=1

has a polyhedral representation


© ª
Pr−1 = x | b0j x ≤ 0, j = 1, . . . , m .

Let
βj = a0r bj , j = 1, . . . , m,
and define the index sets

J − = {j | βj < 0}, J 0 = {j | βj = 0}, J + = {j | βj > 0}.

Let also
βl
bl,k = bl − bk , ∀ l ∈ J + , k ∈ J −.
βk
Sec. 1.5 Polyhedral Convexity 103

We will show that the set


 
 ¯ r
X 
¯
Cr = x ¯ x = µj aj , µj ≥ 0
 
j=1

has the polyhedral representation


© ª
Pr = x | b0j x ≤ 0, j ∈ J − ∪ J 0 , b0l,k x ≤ 0, l ∈ J + , k ∈ J − ,

thus completing the induction.


We have Cr ⊂ Pr because by construction, all the vectors a1 , . . . , ar
satisfy the inequalities defining Pr . To show the reverse inclusion, we fix a
vector x ∈ Pr and we verify that there exists µr ≥ 0 such that

x − µr ar ∈ Pr−1 ,

which is equivalent to
γ ≤ µr ≤ δ,
where ½ ¾
b0j x b0j x
γ = max 0, max , δ = min .
j∈J + βj j∈J − βj
Since x ∈ Pr , we have

b0k x
0≤ , ∀ k ∈ J −, (1.37)
βk

and also b0l,k x ≤ 0 for all l ∈ J + , k ∈ J − , or equivalently

b0l x b0 x
≤ k , ∀ l ∈ J + , k ∈ J −. (1.38)
βl βk

Equations (1.37) and (1.38) imply that γ ≤ δ, thereby completing the


proof.
(b) Define ar+i = ei and ar+m+i = −ei , i = 1, . . . , m. The result to be
shown translates to

x ∈ P⊥ ⇐⇒ x ∈ C,

where
P = {y | y 0 aj ≤ 0, j = 1, . . . , r + 2m} ,
 
 ¯ r+2m
X 
¯
C= x¯x= µj aj , µj ≥ 0 .
 
j=1
104 Convex Analysis and Optimization Chap. 1

Since by part (a), P = C ⊥ and C is closed and convex, we have by the


¡ ¢⊥
Polar Cone Theorem (Prop. 1.5.1) P ⊥ = C ⊥ = C.
(c) We have already shown in the proof of part (a) that a finitely generated
cone is polyhedral. To show the reverse, we use part (a) and the Polar Cone
Theorem (Prop. 1.5.1) to conclude that the polar of any polyhedral cone
[cf. Eq. (1.36)] is finitely generated [cf. Eq. (1.35)]. The finitely generated
cone (1.35) has already been shown to be polyhedral, so its polar, which is
the “typical” polyhedral cone (1.36), is finitely generated. This completes
the proof. Q.E.D.

1.6.2 Polyhedral Sets

A nonempty subset of <n is said to be a polyhedral set (or polyhedron) if


it is of the form
© ª
P = x | a0j x ≤ bj , j = 1, . . . , r ,

where aj are some vectors and bj are some scalars.


The following is a fundamental result, showing that a polyhedral set
can be represented as the sum of the convex hull of a finite set of points
and a finitely generated cone.

Proposition 1.6.2: A set P is polyhedral if and only if there ex-


ist a nonempty and finite set of vectors {v1 , . . . , vm }, and a finitely
generated cone C such that
 
 ¯ m
X m
X 
¯
P = x¯ x=y+ µj vj , y ∈ C, µj = 1, µj ≥ 0, j = 1, . . . , m .
 
j=1 j=1

Proof: Assume that P is polyhedral. Then, it has the form


© ª
P = x | a0j x ≤ bj , j = 1, . . . , r ,

for some vectors aj and some scalars bj . Consider the polyhedral cone of
<n+1
© ª
P̂ = (x, w) | 0 ≤ w, a0j x ≤ bj w, j = 1, . . . , r

and note that


© ª
P = x | (x, 1) ∈ P̂ .
Sec. 1.5 Polyhedral Convexity 105

By the Minkowski – Weyl Theorem [Prop. 1.6.1(c)], P̂ is finitely generated,


so it has the form
 
 ¯ m
X m
X 
¯
P̂ = (x, w) ¯ x = µj vj , w = µj dj , µj ≥ 0, j = 1, . . . , m ,
 
j=1 j=1

for some vectors vj and scalars dj . Since w ≥ 0 for all vectors (x, w) ∈ P̂ ,
we see that dj ≥ 0 for all j. Let

J + = {j | dj > 0}, J 0 = {j | dj = 0}.

By replacing µj by µj /dj for all j ∈ J + , we obtain the equivalent descrip-


tion
 
 ¯ Xm X 
¯
P̂ = (x, w) ¯ x = µj vj , w = µj , µj ≥ 0, j = 1, . . . , m .
 
j=1 j∈J +

Since P = {x | (x, 1) ∈ P̂ }, we obtain


 
 ¯ X X X 
¯
P = x¯x= µj vj + µj vj , µj = 1, µj ≥ 0, j = 1, . . . , m .
 + 0 +

j∈J j∈J j∈J

Thus, P is the vector sum of the convex hull of the vectors vj , j ∈ J + , plus
the finitely generated cone
 
X 
µj vj | µj ≥ 0, j ∈ J 0 .
 0 
j∈J

To prove that the vector sum of the convex hull of a finite set of
points with a finitely generated cone is a polyhedral set, we use a reverse
argument; we pass to a finitely generated cone description, we use the
Minkowski – Weyl Theorem to assert that this cone is polyhedral, and we
finally construct a polyhedral set description. The details are left as an
exercise for the reader. Q.E.D.

1.6.3 Extreme Points

A vector x is said to be an extreme point of a convex set C if x belongs


to C and there do not exist vectors y ∈ C and z ∈ C, with y 6= x and
z 6= x, and a scalar α ∈ (0, 1) such that x = αy + (1 − α)z. An equivalent
definition is that x cannot be expressed as a convex combination of some
vectors of C, all of which are different from x.
106 Convex Analysis and Optimization Chap. 1

The following proposition provides some intuition into the nature of


extreme points.

Proposition 1.6.3: Let C be a nonempty, closed, convex set in <n .


(a) If H is a hyperplane that passes through a boundary point of C
and contains C in one of its halfspaces, then every extreme point
of T = C ∩ H is also an extreme point of C.
(b) C has at least one extreme point if and only if it does not contain
a line, i.e., a set L of the form L = {x + αd | α ∈ <} with d 6= 0.

Proof: (a) Let x be an element of T which is not an extreme point of C.


Then we have x = αy + (1 − α)z for some α ∈ (0, 1), and some y ∈ C and
z ∈ C, with y 6= x and z 6= x. Since x ∈ H, x is a boundary point of C,
and the halfspace containing C is of the form {x | a0 x ≥ a0 x}, where a 6= 0.
Then a0 y ≥ a0 x and a0 z ≥ a0 x, which in view of x = αy + (1 − α)z, implies
that a0 y = a0 x and a0 z = a0 x. Therefore, y ∈ T and z ∈ T , showing that x
cannot be an extreme point of T .
(b) Assume that C has an extreme point x and contains a line L = {x+αd |
α ∈ <}, where d 6= 0. We will arrive at a contradiction. For each integer
n > 0, the vector
µ ¶
1 1 1
xn = 1 − x + (x + nd) = x + d + (x − x)
n n n

lies in the line segment connecting x and x + nd, so it belongs to C. Since


C is closed, x + d = limn→∞ xn must also belong to C. Similarly, we show
that x − d must belong to C. Thus x − d, x, and x + d all belong to C,
contradicting the hypothesis that x is an extreme point.
Conversely, we use induction on the dimension of the space to show
that if C does not contain a line, it must have an extreme point. This is
true in the real line <1 , so assume it is true in <n−1 . If a nonempty, closed,
convex subset C of <n contains no line, it must have some boundary point
x. Take any hyperplane H passing through x and containing C in one of
its halfspaces. Then, since H is an (n − 1)-dimensional affine set, the set
C ∩ H lies in an (n − 1)-dimensional space and contains no line, so by the
induction hypothesis, it must have an extreme point. By part (a), this
extreme point must also be an extreme point of C. Q.E.D.

An important fact that forms the basis for the simplex method of
linear programming, is that if a linear function f attains a minimum over
a polyhedral set C having at least one extreme point, then f attains a
minimum at some extreme point of C (as well as possibly at some other
Sec. 1.5 Polyhedral Convexity 107

nonextreme points). We will come to this fact after considering the more
general case where f is concave and C is closed and convex.
We say that a set C ⊂ <n is bounded from below if there exists a
vector b ∈ <n such that x ≥ b for all x ∈ C.

Proposition 1.6.4: Let C be a closed convex set which is bounded


from below and let f : C 7→ < be a concave function. Then if f attains
a minimum over C, it attains a minimum at some extreme point of C.

Proof: We first show that f attains a minimum at some boundary point


of C. Let x∗ be a vector where f attains a minimum over C. If x∗ is a
boundary point we are done, so assume that x∗ is an interior point of C.
Let
L = {x | x = x∗ + λd, λ ∈ <}
be a line passing through x∗ , where d is a vector with strictly positive coor-
dinates. Then, using the boundedness from below, convexity, and closure
of C, we see that the set C ∩ L contains a set of the form

{x∗ + λd | λ1 ≤ λ ≤ λ2 }

for some λ2 > 0 and some λ1 < 0 for which the vector

x = x∗ + λ1 d

is a boundary point of C. If f (x) > f (x∗ ), we have by concavity of f ,


µ ¶

λ2 λ2
f (x ) ≥ f (x) + 1 − f(x∗ + λ2 d)
λ2 − λ1 λ2 − λ1
µ ¶
λ2 ∗
λ2
> f (x ) + 1 − f (x∗ + λ2 d).
λ2 − λ1 λ2 − λ1

It follows that f (x∗ ) > f (x∗ + λ2 d). This contradicts the optimality of x∗ ,
proving that f (x) = f (x∗ ).
We have shown that the minimum of f is attained at some boundary
point x of C. If x is an extreme point of C, we are done. If it is not an
extreme point, consider a hyperplane H passing through x and containing
C in one of its halfspaces. The intersection T1 = C ∩ H is closed, convex,
bounded from below, and lies in an affine set M1 of dimension n − 1.
Furthermore, f attains its minimum over T1 at x. Thus, by the preceding
argument, it also attains its minimum at some boundary point x1 of T1 .
If x1 is an extreme point of T1 , then by Prop. 1.6.3, it is also an extreme
point of C and the result follows. If x1 is not an extreme point of T1 , then
we view M1 as a space of dimension n − 1 and we form T2 , the intersection
108 Convex Analysis and Optimization Chap. 1

of T1 with a hyperplane in M1 that passes through x1 and contains T1


in one of its halfspaces. This hyperplane will be of dimension n − 2. We
can continue this process for at most n times, when a set Tn consisting of
a single point is obtained. This point is an extreme point of Tn and, by
repeated application of Prop. 1.6.3, an extreme point of C. Q.E.D.

As a corollary we have the following:

Proposition 1.6.5: Let C be a closed convex set and let f : C 7→ <


be a concave function. Assume that for some invertible n × n matrix
A and some b ∈ <n we have

Ax ≥ b, ∀ x ∈ C.

Then if f attains a minimum over C, it attains a minimum at some


extreme point of C.

Proof: Consider the transformation x = A−1 y and the problem of mini-


mizing ¡ ¢
h(y) = f A−1 y
over Y = {y | A−1 y ∈ C}. The function h is concave over the closed
convex set Y . Furthermore, y ≥ b for all y ∈ Y and hence Y is bounded
from below. By Prop. 1.6.4, h attains a minimum at some extreme point
y ∗ of Y . Then f attains its minimum over C at x∗ = A−1 y ∗ , while x∗ is an
extreme point of C, since it can be verified that invertible transformations
of sets map extreme points to extreme points. Q.E.D.

1.6.4 Extreme Points and Linear Programming

We now consider a polyhedral set P and we characterize the set of its


extreme points (also called vertices). By Prop. 1.6.2, P can be represented
as
P = C + P̂ ,
where C is a finitely generated cone and P̂ is the convex hull of some vectors
v1 , . . . , vm :
 
 ¯ m
X m
X 
¯
P̂ = x ¯ x = µj v j , µj = 1, µj ≥ 0, j = 1, . . . , m .
 
j=1 j=1

We note that an extreme point x of P cannot be of the form x = c + x̂,


where c 6= 0, c ∈ C, and x̂ ∈ P̂ , since in this case x would be the midpoint
Sec. 1.5 Polyhedral Convexity 109

of the line segment connecting the distinct vectors x̂ and 2c + x̂. Therefore,
an extreme point of P must belong to P̂ , and since P̂ ⊂ P , it must also be
an extreme point of P̂ . An extreme point of P̂ must be one of the vectors
v1 , . . . , vm , since otherwise this point would be expressible as a convex
combination of v1 , . . . , vm . Thus the set of extreme points of P is either
empty or finite. Using Prop. 1.6.3(b), it follows that the set of extreme
points of P is nonempty and finite if and only if P contains no line.
If P is bounded, then we must have P = P̂ , and it can be shown
that P is equal to the convex hull of its extreme points (not just the con-
vex hull of the vectors v1 , . . . , vm ), as shown in the following proposition.
The proposition also gives another and more specific characterization of
extreme points of polyhedral sets, which is central in the theory of linear
programming.

Proposition 1.6.6: Let P be a polyhedral set in <n .


(a) If P has the form
© ª
P = x | a0j x ≤ bj , j = 1, . . . , r ,

where aj and bj are given vectors and scalars, respectively, then


a vector v ∈ P is an extreme point of P if and only if the set
© ª
Av = aj | a0j v = bj , j = 1, . . . , r

contains n linearly independent vectors.


(b) If P has the form

P = {x | Ax = b, x ≥ 0},

where A is a given m × n matrix and b is a given vector, then a


vector v ∈ P is an extreme point of P if and only if the columns
of A corresponding to the nonzero coordinates of v are linearly
independent.
(c) If P has the form
 
 ¯ m
X m
X 
¯
P = x¯x= µj vj , µj = 1, µj ≥ 0, j = 1, . . . , m
 
j=1 j=1
(1.39)
where vj are given vectors, then P is the convex hull of its ex-
treme points.
110 Convex Analysis and Optimization Chap. 1

(d) (Fundamental Theorem of Linear Programming) Assume that P


has at least one extreme point. Then if a linear function attains
a minimum over P , it attains a minimum at some extreme point
of P .

Proof: (a) If the set Av contains fewer than n linearly independent vectors,
then the system of equations

a0j w = 0, ∀ a j ∈ Av

has a nonzero solution w. For sufficiently small γ > 0, we have v + γw ∈ P


and v − γw ∈ P , thus showing that v is not an extreme point. Thus, if v
is an extreme point, Av must contain n linearly independent vectors.
Conversely, suppose that Av contains a subset Āv consisting of n
linearly independent vectors. Suppose that for some y ∈ P , z ∈ P , and
α ∈ (0, 1), we have v = αy + (1 − α)z. Then for all aj ∈ Āv , we have

bj = a0j v = αa0j y + (1 − α)a0j z ≤ αbj + (1 − α)bj = bj .

Thus v, y, and z are all solutions of the system of n linearly independent


equations
a0j w = bj , ∀ aj ∈ Āv .

Hence v = y = z, implying that v is an extreme point.


(b) Let k be the number of zero coordinates of v, and consider the matrix
Ā, which is the same as A except that the columns corresponding to the
zero coordinates of v are set to zero. We write P in the form

P = {x | Ax ≤ b, −Ax ≤ −b, −x ≤ 0},

and apply the result of part (a). We obtain that v is an extreme point if and
only if Ā contains n − k linearly independent rows, which is equivalent to
the n − k nonzero columns of Ā (corresponding to the nonzero coordinates
of v) being linearly independent.
(c) We use induction on the dimension of the space. Suppose that all
bounded polyhedra of (n − 1)-dimensional spaces have a representation of
the form (1.39), but there is a bounded polyhedron P ⊂ <n and a vector
x ∈ P , which is not in the convex hull PE of the extreme points of P . Let
x̂ be the projection of x on PE and let x be a solution of the problem

maximize (x − x̂)0 z
subject to z ∈ P.
Sec. 1.5 Polyhedral Convexity 111

The polyhedron
© ª
P̂ = P ∩ z | (x − x̂)0 z = (x − x̂)0 x

is equal to the convex hull of its extreme points by the induction hypothesis.
Show that PE ∩ P̂ = Ø, while, by Prop. 1.6.3(a), each of the extreme points
of P̂ is also an extreme point of P , arriving at a contradiction.
(d) Since P is polyhedral, it has a representation

P = {x | Ax ≥ b},

for some m × n matrix A and some b ∈ <m . If A had rank less than n, then
its nullspace would contain some nonzero vector x, so P would contain
a line parallel to x, contradicting the existence of an extreme point [cf.
Prop. 1.6.3(b)]. Thus A has rank n and hence it must contain n linearly
independent rows that constitute an n × n invertible submatrix Â. If b̂ is
the corresponding subvector of b, we see that every x ∈ P satisfies Âx ≥ b̂.
The result then follows using Prop. 1.6.5. Q.E.D.

EXE RC ISES

1.6.1

Let C1 and C2 be polyhedral sets. Show that C1 + C2 is polyhedral. Show also


that if C1 and C2 are polyhedral cones, then C1 + C2 is a polyhedral cone.

1.6.2

Show that the image and the inverse image of a polyhedral set under a linear
transformation is polyhedral.

1.6.3 (Gordan’s Theorem of the Alternative)

(a) Let A be an m×n matrix. Then exactly one of the following two conditions
holds:
(i) There is an x ∈ <n such that Ax < 0 (all components of Ax are
strictly negative).
(ii) There is a µ ∈ <m such that A0 µ = 0, µ 6= 0, and µ ≥ 0.
112 Convex Analysis and Optimization Chap. 1

(b) Show that an alternative and equivalent statement of part (a) is the follow-
ing: a polyhedral cone has nonempty interior if and only if its polar cone
does not contain a line, i.e., a set of the form {αz | α ∈ <}, where z is a
nonzero vector.

1.6.4

Let P be a polyhedral set represented as


( )
¯ m
X m
X
¯
P = x¯x=y+ µj vj , y ∈ C, µj = 1, µj ≥ 0, j = 1, . . . , m
j=1 j=1

where v1 , . . . , vm are some vectors and C is a finitely generated cone (cf. Prop.
1.6.2). Show that the recession cone of P is equal to C.

1.7 SUBGRADIENTS

1.7.1 Directional Derivatives

Convex functions have interesting differentiability properties, which we dis-


cuss in this section. We first consider convex functions of a single variable.
Let I be an interval of real numbers, and let f : I 7→ < be convex. If
x, y, z ∈ I and x < y < z, then we can show the relation

f (y) − f (x) f (z) − f (x) f(z) − f (y)


≤ ≤ , (1.40)
y−x z−x z−y

which is illustrated in Fig. 1.7.1. For a formal proof, note that, using the
definition of a convex function [cf. Eq. (1.3)], we obtain
µ ¶ µ ¶
y−x z −y
f (y) ≤ f(z) + f (x)
z−x z −x

and either of the desired inequalities follows by appropriately rearranging


terms.
Let a and b be the infimum and the supremum, respectively, of I, also
referred to as the end points of I. For any x ∈ I, x 6= b, and for any α > 0
such that x + α ∈ I, we define

f(x + α) − f (x)
s+ (x, α) = .
α
Sec. 1.5 Subgradients 113

f(z) - f(x)
slope =
z-x

f(y) - f(x) f(z) - f(y)


slope = slope =
y-x z-y

x y z

Figure 1.7.1. Illustration of the inequalities (1.40). The rate of change of the
function f is nondecreasing with its argument.

Let 0 < α ≤ α0 . We use the first inequality in Eq. (1.40) with y = x + α


and z = x + α0 to obtain s+ (x, α) ≤ s+ (x, α0 ). Therefore, s+ (x, α) is a
nondecreasing function of α and, as α decreases to zero, it converges either
to a finite number or to −∞. Let f + (x) be the value of the limit, which
we call the right derivative of f at the point x. Similarly, if x ∈ I, x 6= a,
α > 0, and x − α ∈ I, we define
f (x) − f (x − α)
s− (x, α) = ,
α
which is, by a symmetrical argument, a nonincreasing function of α. Its
limit as α decreases to zero, denoted by f − (x), is called the left derivative
of f at the point x, and is either finite or equal to ∞.
In the case where the end points a and b belong to the domain I of
f , we define for completeness f − (a) = −∞ and f + (b) = ∞.

Proposition 1.7.1: Let I ⊂ < be a convex interval and let f : I 7→ <


be a convex function. Let a and b be the end points of I.
(a) We have f − (y) ≤ f + (y) for every y ∈ I.
(b) If x belongs to the interior of I, then f + (x) and f − (x) are finite.
(c) If x, z ∈ I and x < z, then f + (x) ≤ f − (z).
(d) The functions f − , f + : I 7→ [−∞, +∞] are nondecreasing.
(e) The function f + (respectively, f − ) is right– (respectively, left–)
continuous at every interior point of I. Also, if a ∈ I (respec-
tively, b ∈ I) and f is continuous at a (respectively, b), then f +
(respectively, f − ) is right– (respectively, left–) continuous at a
(respectively, b).
114 Convex Analysis and Optimization Chap. 1

(f) If f is differentiable at a point x belonging to the interior of I,


then f + (x) = f − (x) = (df/dx)(x).
(g) For any x, z ∈ I and any d satisfying f − (x) ≤ d ≤ f + (x), we
have
f (z) ≥ f (x) + d(z − x).

(h) The function f + : I 7→ (−∞, ∞] [respectively, f − : I 7→ [−∞, ∞)]


is upper (respectively, lower) semicontinuous at every x ∈ I.

Proof: (a) If y is an end point of I, the result is trivial because f − (a) =


−∞ and f + (b) = ∞. We assume that y is an interior point, we let α > 0,
and use Eq. (1.40), with x = y − α and z = y + α, to obtain s− (y, α) ≤
s+ (y, α). Taking the limit as α decreases to zero, we obtain f − (y) ≤ f + (y).
(b) Let x belong to the interior of I and let α > 0 be such that x − α ∈ I.
Then f − (x) ≥ s−(x, α) > −∞. For similar reasons, we obtain f + (x) < ∞.
Part (a) then implies that f −(x) < ∞ and f + (x) > −∞.
¡ ¢
(c) We use Eq. (1.40), with y = (z + x)/2, to obtain s+ x, (z − x)/2 ≤
¡ ¢ ¡ ¢
s− z, (z − x)/2 . The result then follows because f + (x) ≤ s+ x, (z − x)/2
¡ ¢
and s− z, (z − x)/2 ≤ f − (z).
(d) This follows by combining parts (a) and (c).
(e) Fix some x ∈ I, x 6= b, and some positive δ and α such that x+δ+α < b.
We allow x to be equal to a, in which case f is assumed to be continuous
at a. We have f + (x + δ) ≤ s+ (x + δ, α). We take the limit, as δ decreases
to zero, to obtain limδ↓0 f + (x + δ) ≤ s+ (x, α). We have used here the fact
that s+ (x, α) is a continuous function of x, which is a consequence of the
continuity of f (Prop. 1.2.12). We now let α decrease to zero to obtain
limδ↓0 f + (x + δ) ≤ f + (x). The reverse inequality is also true because f +
is nondecreasing and this proves the right–continuity of f + . The proof for
f − is similar.
(f) This is immediate from the definition of f + and f − .
(g) Fix some x, z ∈ I. The result is trivially true for x = z. We only
consider the case x < z; the proof for the¡ case x >¢ z is similar. Since
s+ (x, α) is nondecreasing in α, we have f (z) − f (x) /(z − x) ≥ s+ (x, α)
¡
for α belonging to (0, z − x). Letting α decrease to zero, we obtain f (z) −
¢
f (x) /(z − x) ≥ f + (x) ≥ d and the result follows.
(h) This follows from parts (a), (d), (e), and the definition of semicontinuity
(Definition 1.1.5). Q.E.D.

Given a function f : <n 7→ <, a point x ∈ <n , and a vector y ∈ <n ,


Sec. 1.5 Subgradients 115

we say that f is directionally differentiable at x in the direction y if the


limit
f (x + αy) − f (x)
lim ,
α↓0 α
exists, in which case it is denoted by f 0 (x; y). We say that f is directionally
differentiable at a point x if it is directionally differentiable at x in all
directions. As in Section 1.1.4, we say that f is differentiable at x if it is
directionally differentiable at x and f 0 (x; y) is a linear function of y denoted
by
f 0 (x; y) = ∇f (x)0 y
where ∇f (x) is the gradient of f at x.
An interesting property of convex functions is that they are direction-
ally differentiable at all points x. This is a consequence of the directional
differentiability of scalar convex functions, as can be seen from the relation

f(x + αy) − f (x) Fy (α) − Fy (0)


f 0 (x; y) = lim = lim = Fy+ (0), (1.41)
α↓0 α α↓0 α

where Fy+ (0) is the right derivative of the convex scalar function

Fy (α) = f (x + αy)

at α = 0. Note that the above calculation also shows that the left derivative
Fy− (0) of Fy is equal to −f 0 (x; −y) and, by using Prop. 1.7.1(a), we obtain
Fy− (0) ≤ Fy+ (0), or equivalently,

−f 0 (x; −y) ≤ f 0 (x; y), ∀ y ∈ <n . (1.42)


¡
Note also that for a convex function, the difference quotient f (x + αy) −
¢
f (x) /α is a monotonically nondecreasing function of α, so an equivalent
definition of the directional derivative is
f (x + αy) − f (x)
f 0 (x; y) = inf .
α>0 α
The directional derivative can be used to provide a necessary and
sufficient condition for optimality in the problem of minimizing a convex
function f : <n 7→ < over a convex set X ⊂ <n . In particular, x∗ is a
global minimum of f over X if and only if

f 0 (x∗ ; x − x∗ ) ≥ 0, ∀ x ∈ X.

This follows from the definition (1.41) of directional derivative, and from
the fact that the difference quotient
¡ ¢
f x∗ + α(x − x∗ ) − f(x∗ )
α
116 Convex Analysis and Optimization Chap. 1

is a monotonically nondecreasing function of α.


The following proposition generalizes the upper semicontinuity prop-
erty of right derivatives of scalar convex functions [Prop. 1.7.1(h)], and
shows that if f is differentiable, then its gradient is continuous.

Proposition 1.7.2: Let f : <n 7→ < be convex, and let {fk } be a


sequence of convex functions fk : <n 7→ < with the property that
limk→∞ fk (xk ) = f(x) for every x ∈ <n and every sequence {xk } that
converges to x. Then for any x ∈ <n and y ∈ <n , and any sequences
{xk } and {yk } converging to x and y, respectively, we have

lim sup fk0 (xk ; yk ) ≤ f 0 (x; y). (1.43)


k→∞

Furthermore, if f is differentiable over <n , then it is continuously


differentiable over <n .

Proof: For any µ > f 0 (x; y), there exists an α > 0 such that

f (x + αy) − f (x)
< µ, ∀ α ≤ α.
α
Hence, for α ≤ α, we have

fk (xk + αyk ) − fk (xk )



α
for all sufficiently large k, and using Eq. (1.41), we obtain

lim sup fk0 (xk ; yk ) < µ.


k→∞

Since this is true for all µ > f 0 (x; y), inequality (1.43) follows.
If f is differentiable at all x ∈ <n , then using the continuity of f and
the part of the proposition just proved, we have for every sequence {xk }
converging to x and every y ∈ <n ,

lim sup ∇f (xk )0 y = lim sup f 0 (xk ; y) ≤ f 0 (x; y) = ∇f(x)0 y.


k→∞ k→∞

By replacing y by −y in the preceding argument, we obtain


¡ ¢
− lim inf ∇f (xk )0 y = lim sup −∇f (xk )0 y ≤ −∇f (x)0 y.
k→∞ k→∞

Therefore, we have ∇f (xk )0 y → ∇f (x)0 y for every y, which implies that


∇f (xk ) → ∇f (x). Hence, the gradient is continuous. Q.E.D.
Sec. 1.5 Subgradients 117

1.7.2 Subgradients and Subdifferentials

Given a convex function f : <n 7→ <, we say that a vector d ∈ <n is a


subgradient of f at a point x ∈ <n if

f (z) ≥ f (x) + (z − x)0 d, ∀ z ∈ <n . (1.44)

If instead f is a concave function, we say that d is a subgradient of f at


x if −d is a subgradient of the convex function −f at x. The set of all
subgradients of a convex (or concave) function f at x ∈ <n is called the
subdifferential of f at x, and is denoted by ∂f (x). Figure 1.7.2 provides
some examples of subdifferentials.

f(x) = |x| f(x) = max{0, (1/2)(x2 - 1)}

0 x -1 0 1 x

∂f(x) ∂f(x)

0 x -1 0 1 x

-1

Figure 1.7.2. The subdifferential of some scalar convex functions as a function


of the argument x.

We next provide the relationship between the directional derivative


and the subdifferential, and prove some basic properties of subgradients.
118 Convex Analysis and Optimization Chap. 1

Proposition 1.7.3: Let f : <n 7→ < be convex. For every x ∈ <n ,


the following hold:
(a) A vector d is a subgradient of f at x if and only if

f 0 (x; y) ≥ y 0 d, ∀ y ∈ <n .

(b) The subdifferential ∂f (x) is a nonempty, convex, and compact


set, and there holds

f 0 (x; y) = max y 0 d, ∀ y ∈ <n . (1.45)


d∈∂f (x)

In particular, f is differentiable at x with gradient ∇f (x), if and


only if it has ∇f (x) as its unique subgradient at x. Furthermore,
if X is a bounded set, the set ∪x∈X ∂f (x) is bounded.
(c) If a sequence {xk } converges to x and dk ∈ ∂f (xk ) for all k,
the sequence {dk } is bounded and each of its limit points is a
subgradient of f at x.
(d) If f is equal to the sum f1 + · · · + fm of convex functions fj :
<n 7→ <, j = 1, . . . , m, then ∂f(x) is equal to the vector sum
∂f1 (x) + · · · + ∂fm (x).
(e) If f is equal to the composition of a convex function h : <m 7→ <
and an m × n© matrix A [f (x) = ª h(Ax)], then ∂f(x) is equal to
A0 ∂h(Ax) = A0 g | g ∈ ∂h(Ax) .
(f) x∗ minimizes f over a convex set X ⊂ <n if and only if there
exists a subgradient d ∈ ∂f(x∗ ) such that

d0 (x − x∗ ) ≥ 0, ∀ x ∈ X.

Equivalently, x∗ minimizes f over a convex set X ⊂ <n if and


only if
0 ∈ ∂f (x∗ ) + TX (x∗ )⊥ .

Proof: (a) The subgradient inequality (1.44) is equivalent to


f (x + αy) − f(x)
≥ y 0 d, ∀ y ∈ <n , α > 0.
α
Since the quotient on the left above decreases monotonically to f 0 (x; y) as
α ↓ 0 [Eq. (1.40)], we conclude that the subgradient inequality (1.44) is
equivalent to f 0 (x; y) ≥ y 0 d for all y ∈ <n . Therefore we obtain
d ∈ ∂f (x) ⇐⇒ f 0 (x; y) ≥ y 0 d, ∀ y ∈ <n . (1.46)
Sec. 1.5 Subgradients 119

(b) From Eq.


© (1.46), we see that ª ∂f(x) is the intersection of the closed
halfspaces d | y 0 d ≤ f 0 (x; y) , where y ranges over the nonzero vectors
of <n . It follows that ∂f (x) is closed and convex. It is also bounded,
since otherwise, for some y ∈ <n , y 0 d could be made unbounded by proper
choice of d ∈ ∂f (x), contradicting Eq. (1.46). Since ∂f (x) is both closed
and bounded, it is compact.
To show that ∂f (x) is nonempty and that Eq. (1.45) holds, we first
observe that Eq. (1.46) implies that f 0 (x; y) ≥ maxd∈∂f(x) y 0 d [where the
maximum is −∞ if ∂f(x) is empty]. To show the reverse inequality, take
any x and y in <n , and consider the subset of <n+1
© ª
C1 = (µ, z) | µ > f (z) ,

and the half-line


© ª
C2 = (µ, z) | µ = f (x) + αf 0 (x; y), z = x + αy, α ≥ 0 ;

see Fig. 1.7.3. Using the definition of directional derivative and the convex-
ity of f , it follows that these two sets are nonempty, convex, and disjoint.
By applying the Separating Hyperplane Theorem (Prop. 1.4.2), we see that
there exists a nonzero vector (γ, w) ∈ <n+1 such that
¡ ¢
γµ+w 0 z ≥ γ f(x)+αf 0 (x; y) +w 0 (x+αy), ∀ α ≥ 0, z ∈ <n , µ > f (z).
(1.47)
We cannot have γ < 0 since then the left-hand side above could be made
arbitrarily small by choosing µ sufficiently large. Also if γ = 0, then Eq.
(1.47) implies that w = 0, which is a contradiction. Therefore, γ > 0 and
by dividing with γ in Eq. (1.47), we obtain

µ+(z−x)0 (w/γ) ≥ f (x)+αf 0 (x; y)+αy 0 (w/γ), ∀ α ≥ 0, z ∈ <n , µ > f (z).


(1.48)
By setting α = 0 in the above relation and by taking the limit as µ ↓ f (z),
we obtain f (z) ≥ f(x) + (z − x)0 (−w/γ) for all z ∈ <n , implying that
(−w/γ) ∈ ∂f (x). By setting z = x and α = 1 in Eq. (1.48), and by taking
the limit as µ ↓ f (x), we obtain y 0 (−w/γ) ≥ f 0 (x; y), which implies that
maxd∈∂f (x) y 0 d ≥ f 0 (x; y). The proof of Eq. (1.45) is complete.
From the definition of directional derivative, we see that f is differ-
entiable at x with gradient ∇f(x) if and only if the directional derivative
f 0 (x; y) is a linear function of the form f 0 (x; y) = ∇f (x)0 y. Thus, from Eq.
(1.45), f is differentiable at x with gradient ∇f(x), if and only if it has
∇f (x) as its unique subgradient at x.
Finally, let X be a bounded set. To show that ∪x∈X ∂f (x) is bounded,
we assume the contrary, i.e. that there exists a sequence {xk } ⊂ X , and a
sequence {dk } with dk ∈ ∂f (xk ) for all k and kdk k → ∞. Without loss of
generality, we assume that dk 6= 0 for all k, and we denote yk = dk /kdk k.
120 Convex Analysis and Optimization Chap. 1

C2

C1
f(z)

0 y x z

Figure 1.7.3. Illustration of the sets C1 and C2 used in the hyperplane separation
argument of the proof of Prop. 1.7.3(b).

Since both {xk } and {yk } are bounded, they must contain convergent sub-
sequences. We assume without loss of generality that xk converges to some
x and yk converges to some y with kyk = 1. By Eq. (1.45), we have

f 0 (xk ; yk ) ≥ d0k yk = kdk k,

so it follows that f 0 (xk ; yk ) → ∞. This contradicts, however, Eq. (1.43),


which requires that lim supk→∞ f 0 (xk ; yk ) ≤ f 0 (x; y).
(c) By part (b), the sequence {dk } is bounded, and by part (a), we have

y 0 dk ≤ f 0 (xk ; y), ∀ y ∈ <n .

If d is a limit point of {dk }, we have by taking limit in the above relation


and by using Prop. 1.7.2

y 0 d ≤ lim sup f 0 (xk ; y) ≤ f 0 (x; y), ∀ y ∈ <n .


k→∞

Therefore, by part (a), we have d ∈ ∂f (x).


(d) It will suffice to prove the result for the case where f = f1 + f2 . If
d1 ∈ ∂f1 (x) and d2 ∈ ∂f2 (x), then from the subgradient inequality (1.44),
we have
f1 (z) ≥ f1 (x) + (z − x)0 d1 , ∀ z ∈ <n ,
f2 (z) ≥ f2 (x) + (z − x)0 d2 , ∀ z ∈ <n ,
so by adding, we obtain

f (z) ≥ f (x) + (z − x)0 (d1 + d2 ), ∀ z ∈ <n .


Sec. 1.5 Subgradients 121

Hence d1 + d2 ∈ ∂f (x), implying that ∂f1 (x) + ∂f2 (x) ⊂ ∂f (x).


To prove the reverse inclusion, suppose to come to a contradiction,
that there exists a d ∈ ∂f(x) such that d ∈/ ∂f1 (x) + ∂f2(x). Since by part
(b), the sets ∂f1 (x) and ∂f2 (x) are compact, the set ∂f1 (x) + ∂f2(x) is
compact (cf. Prop. 1.2.16), and by Prop. 1.4.3, there exists a hyperplane
strictly separating {d} from ∂f1 (x) + ∂f2 (x), i.e., a vector y and a scalar b
such that

y 0 (d1 + d2) < b < y 0 d, ∀ d1 ∈ ∂f1 (x), d2 ∈ ∂f2 (x).

From this we obtain

max y 0 d1 + max y 0 d2 < y 0 d,


d1 ∈∂f1 (x) d2 ∈∂f2 (x)

or using part (b),


f10 (x; y) + f20 (x; y) < y 0 d.
By using the definition of directional derivative, f10 (x; y)+f20 (x; y) = f 0 (x; y),
so we have
f 0 (x; y) < y 0 d,
which is a contradiction in view of part (a).
(e) It is seen using the definition of directional derivative that

f 0 (x; y) = h0 (Ax; Ay), ∀ y ∈ <n .

Let g ∈ ∂h(Ax) and d = A0 g. Then by part (a), we have

g0 z ≤ h0 (Ax; z) ∀ z ∈ <m ,

and in particular,

g 0 Ay ≤ h0 (Ax; Ay) ∀ y ∈ <n ,

or
(A0 g)0 y ≤ f (x; y), ∀ y ∈ <n .
Hence, by part (a), we have A0 g ∈ ∂f(x), so that A0 ∂h(Ax) ⊂ ∂f (x).
To prove the reverse inclusion, suppose to come to a contradiction,
that there exists a d ∈ ∂f (x) such that d ∈/ A0 ∂h(Ax). Since by part (b),
0
the set ∂h(Ax) is compact, the set A ∂h(Ax) is also compact [cf. Prop.
1.1.9(d)], and by Prop. 1.4.3, there exists a hyperplane strictly separating
{d} from A0 ∂h(Ax), i.e., a vector y and a scalar b such that

y 0 (A0 g) < b < y0 d, ∀ g ∈ ∂h(Ax).


122 Convex Analysis and Optimization Chap. 1

From this we obtain


max (Ay)0 g < y 0 d,
g∈∂h(Ax)

or using part (b),


h0 (Ax; Ay) < y 0 d.
Since h0 (Ax; Ay) = f 0 (x; y), it follows that

f 0 (x; y) < y 0 d,

which is a contradiction in view of part (a).


(f) Suppose that for some d ∈ ∂f (x∗ ) and all x ∈ X, we have d0 (x−x∗ ) ≥ 0.
Then, since from the definition of a subgradient we have f (x) − f (x∗ ) ≥
d0 (x − x∗ ) for all x ∈ X, we obtain f (x) − f(x∗ ) ≥ 0 for all x ∈ X, so
x∗ minimizes f over X. Conversely, suppose that x∗ minimizes f over X.
Consider the set of feasible directions of X at x∗

FX (x∗ ) = {w 6= 0 | x∗ + αw ∈ X for some α > 0},

and the cone


© ª
W = −FX (x∗ )⊥ = d | d0 w ≥ 0, ∀ w ∈ FX (x∗ ) .

If ∂f (x∗ ) and W have a point in common, we are done, so to arrive at a


contradiction, assume the opposite, i.e., ∂f (x∗ ) ∩ W = Ø. Since ∂f (x∗ ) is
compact and W is closed, by Prop. 1.4.3 there exists a hyperplane strictly
separating ∂f (x∗ ) and W , i.e., a vector y and a scalar c such that

g0 y < c < d0 y, ∀ g ∈ ∂f (x∗ ), ∀ d ∈ W.

Using the fact that W is a closed cone, it follows that

c < 0 ≤ d0 y, ∀ d ∈ W, (1.49)

which when combined with the preceding inequality, also yields

max g 0 y < c < 0.


g∈∂f(x∗ )

Thus, using part (b), we have f 0 (x∗ ; y) < 0, while from Eq. (1.49), we see
that y belongs to the polar cone of FX (x∗ )⊥ , which by the Polar Cone
theorem (Prop. 1.5.1), implies that y is in the closure of the set of feasible
directions FX (x∗ ). Hence for a sequence yk of feasible directions converging
to y we have f 0 (x∗ ; yk ) < 0, which contradicts the optimality of x∗ .
The last statement follows from the convexity of X which implies that
TX (x∗ )⊥ is the set of all z such that z 0 (x − x∗ ) ≤ 0 for all x ∈ X (cf. Props.
1.5.3 and 1.5.5). Q.E.D.
Sec. 1.5 Subgradients 123

Note that the last part of the above proposition generalizes the opti-
mality condition of Prop. 1.6.2 for the case where f is convex and smooth:
∇f(x∗ )0 (x − x∗ ) ≥ 0, ∀ x ∈ X.

In the special case where X = <n , we obtain a basic necessary and sufficient
condition for unconstrained optimality of x∗ :

0 ∈ ∂f (x∗ ).
This optimality condition is also evident from the subgradient inequality
(1.44).
We close this section with a version of the chain rule for directional
derivatives and subgradients.

Proposition 1.7.4: (Chain Rule) Let f : <n 7→ < be a convex


function and let g : < 7→ < be a smooth scalar function. Then the
function F defined by ¡ ¢
F (x) = g f (x)
is directionally differentiable at all x, and its directional derivative is
given by
¡ ¢
F 0 (x; y) = ∇g f (x) f 0 (x; y), ∀ x ∈ <n , y ∈ <n . (1.50)

Furthermore, if g is convex and monotonically nondecreasing, then F


is convex and its subdifferential is given by
¡ ¢
∂F (x) = ∇g f (x) ∂f (x), ∀ x ∈ <n . (1.51)

Proof: We have
¡ ¢ ¡ ¢
F (x + αy) − F (x) g f (x + αy) − g f (x)
F 0 (x; y) = lim = lim . (1.52)
α↓0 α α↓0 α
From the convexity of f it follows that there are three possibilities: (1)
For some α > 0, f (x + αy) = f(x) for all α ∈ (0, α], (2) For some α > 0,
f (x + αy) > f (x) for all α ∈ (0, α], (3) For some α > 0, f (x + αy) < f (x)
for all α ∈ (0, α].
In case (1), from Eq. (1.52), we have F 0 (x; y) = f 0 (x; y) = 0 and the
given formula (1.50) holds. In case (2), from Eq. (1.52), we have for all
α ∈ (0, α]
¡ ¢ ¡ ¢
0
f(x + αy) − f (x) g f (x + αy) − g f (x)
F (x; y) = lim · .
α↓0 α f (x + αy) − f(x)
124 Convex Analysis and Optimization Chap. 1

As α ↓ 0, we ¡have ¢f(x + αy) → f (x), so the preceding equation yields


F 0 (x; y) = ∇g f (x) f 0 (x; y). The proof for case (3) is similar.
Assume now that g is convex and monotonically nondecreasing. Then
F is convex since for any x1 , x2 ∈ <n and α ∈ [0, 1], we have
¡ ¢ ¡ ¢
αF (x1 ) + (1 − α)F (x2 ) = αg f (x1 ) + (1 − α)g f (x2 )
¡ ¢
≥ g αf (x1 ) + (1 − α)f (x2 )
¡ ¡ ¢¢
≥ g f αx1 + (1 − α)x2
¡ ¢
= F αx1 + (1 − α)x2 ,

where for the first inequality we use the convexity of g, and for the second
inequality we use the convexity of f and the monotonicity of g.
To obtain the formula for the subdifferential of F , we note that by
Prop. 1.7.3(a), d ∈ ∂F (x) if and only if y 0 d ≤ F 0 (x; y) for all y ∈ <n , or
equivalently (from what has been already shown)
¡ ¢
y 0 d ≤ ∇g f (x) f 0 (x; y), ∀ y ∈ <n .
¡ ¢
If ∇g f(x) = 0, this relation¡yields¢ d = 0, so ∂F (x) = ¡{0} and
¢ the desired
formula (1.51) holds. If ∇g f (x) 6= 0, we have ∇g f (x) > 0 by the
monotonicity of g, so we obtain

d
y0 ¡ ¢ ≤ f 0 (x; y), ∀ y ∈ <n ,
∇g f (x)
¡ ¢
which, by Prop. 1.7.3(a), is equivalent to d/∇g f (x)
¡ ∈¢ ∂f (x). Thus we
have shown that d ∈ ∂F (x) if and only if d/∇g f (x) ∈ ∂f (x), which
proves the desired formula (1.51). Q.E.D.

1.7.3 ²-Subgradients

We now consider a notion of approximate subgradient. Given a convex


function f : <n 7→ < and a scalar ² > 0, we say that a vector d ∈ <n is an
²-subgradient of f at a point x ∈ <n if

f (z) ≥ f (x) + (z − x)0 d − ², ∀ z ∈ <n . (1.53)

If instead f is a concave function, we say that d is an ²-subgradient of f


at x if −d is a subgradient of the convex function −f at x. The set of all
²-subgradients of a convex (or concave) function f at x ∈ <n is called the
²-subdifferential of f at x, and is denoted by ∂f² (x).
For our purposes, ²-subgradients are important because they find sev-
eral applications in nondifferentiable optimization. Some of their important
properties are given in the following proposition.
Sec. 1.5 Subgradients 125

Proposition 1.7.5: Let f : <n 7→ < be convex and ² be a positive


scalar. For every x ∈ <n , the following hold:
(a) The ²-subdifferential ∂f² (x) is a nonempty, convex, and compact
set, and for all y ∈ <n there holds

f (x + αy) − f (x) + ²
inf = max y 0 d.
α>0 α d∈∂² f (x)

(b) We have 0 ∈ ∂² f (x) if and only if

f (x) ≤ infn f (z) + ².


z∈<

(c) If a direction y is such that y 0 d < 0 for all d ∈ ∂² f(x), then

inf f(x + αy) < f(x) − ².


α>0

(d) If 0 ∈
/ ∂² f (x), then the direction y = −d, where

d = arg min kdk,


d∈∂² f (x)

satisfies y 0 d < 0 for all d ∈ ∂² f(x).


(e) If f is equal to the sum f1 + · · · + fm of convex functions fj :
<n 7→ <, j = 1, . . . , m, then

∂² f (x) ⊂ ∂² f1 (x) + · · · + ∂² fm (x) ⊂ ∂m² f (x).

Proof: (a) We have

d ∈ ∂² f(x) ⇐⇒ f (x+αy) ≥ f (x)+αy 0 d−², ∀ α > 0, y ∈ <n . (1.54)

Hence

f(x + αy) − f (x) + ²


d ∈ ∂² f (x) ⇐⇒ inf ≥ y 0 d, ∀ y ∈ <n . (1.55)
α>0 α

It follows that ∂² f(x) is the intersection of the closed subspaces


½ ¯ ¾
¯ f (x + αy) − f (x) + ² 0
d ¯ inf ≥yd
α>0 α
126 Convex Analysis and Optimization Chap. 1

as y ranges over <n . Hence ∂² f (x) is closed and convex. To show that
∂² f (x) is also bounded, suppose to arrive at a contradiction that there is a
sequence {dk } ⊂ ∂² f (x) with ||dk || → ∞. Let yk = ||ddk || . Then, from Eq.
k
(1.54), we have for α = 1
f (x + yk ) ≥ f (x) + ||dk || − ², ∀ k,
so it follows that f(x + yk ) → ∞. This is a contradiction since f is convex
and hence continuous, so it is bounded on any bounded set. Thus ∂² f (x)
is bounded.
To show that ∂² f (x) is nonempty and satisfies
f(x + αy) − f (x) + ²
inf = max y 0 d, ∀ y ∈ <n ,
α>0 α d∈∂² f (x)

we argue similar to the proof of Prop. 1.7.3(b). Consider the subset of


<n+1 © ª
C1 = (µ, z) | µ > f(x) ,
and the half-line
f (x + αy) − f (x) + ²
C2 = {(µ, z) | µ = f(x)−²+β inf , z = x+βy, β ≥ 0}.
α>0 α
These sets are nonempty and convex. They are also disjoint, since we have
for all (µ, z) ∈ C2
f (x + αy) − f (x) + ²
µ = f (x) − ² + β inf
α>0 α
f(x + βy) − f (x) + ²
≤ f (x) − ² + β
β
= f (x + βy)
= f (z).
Hence there exists a hyperplane separating them, i.e., a (γ, w) 6= (0, 0) such
that for all β ≥ 0, z ∈ <n , and µ > f (z),
µ ¶
f (x + αy) − f(x) + ²
γµ + w0 z ≤ γ f(x) − ² + β inf + w 0 (x + βy).
α>0 α
(1.56)
We cannot have γ > 0, since then the left-hand side above could be made
arbitrarily large by choosing µ to be sufficiently large. Also, if γ = 0,
Eq. (1.56) implies that w = 0, contradicting the fact that (γ, w) 6= (0, 0).
Therefore, γ < 0 and after dividing Eq. (1.56) by γ, we obtain for all β ≥ 0,
z ∈ <n , and µ > f (z),
µ ¶0 µ ¶0
w f (x + αy) − f(x) + ² w
µ+ (z − x) ≥ f(x) − ² + β inf +β y.
γ α>0 α γ
(1.57)
Sec. 1.5 Subgradients 127

Taking the limit above as µ ↓ f (z) and setting β = 0, we obtain


µ ¶0
w
f (z) ≥ f(x) − ² + − (z − x), ∀ z ∈ <n .
γ

Hence −w/γ belongs to ∂² f (x). Also by taking z = x in Eq. (1.57), and


by letting µ ↓ f(x) and by dividing with β, we obtain

w0 ² f(x + αy) − f (x) + ²


− y ≤ − + inf ,
γ β α>0 α

Since β can be chosen as large as desired, we see that

w0 f (x + αy) − f (x) + ²
− y ≥ inf .
γ α>0 α

Combining this relation with Eq. (1.55), we obtain

f (x + αy) − f(x) − ²
max d0 y = inf .
d∈∂² f(x) α>0 α

(b) By definition, 0 ∈ ∂² f (x) if and only if f(z) ≥ f (x) − ² for all z ∈ <n ,
which is equivalent to inf z∈<n f(z) ≥ f (x) − ².
(c) Assume that a direction y is such that

max y 0 d < 0, (1.58)


d∈∂² f(x)

while inf α>0 f (x + αy) ≥ f (x) − ². Then f(x + αy) − f (x) ≥ −² for all
α > 0, or equivalently

f(x + αy) − f (x) + ²


≥ 0, ∀ α > 0.
α
Consequently, using part (a), we have

f (x + αy) − f (x) + ²
max y 0 d = inf ≥ 0.
d∈∂² f (x) α>0 α

which contradicts Eq. (1.58).


(d) The vector d is the projection of the zero vector on the convex and
compact set ∂² f (x). If 0 ∈
6 ∂² f (x), we have kdk > 0, while by Prop. 1.3.3,

(d − d)0 d ≥ 0, ∀ d ∈ ∂² f(x).

Hence
d0 d ≥ ||d||2 > 0, ∀ d ∈ ∂² f (x).
128 Convex Analysis and Optimization Chap. 1

(e) It will suffice to prove the result for the case where f = f1 + f2 . If
d1 ∈ ∂² f1 (x) and d2 ∈ ∂² f2 (x), then from Eq. (1.54), we have

f1 (x + αy) ≥ f1 (x) + αy 0 d1 − ², ∀ α > 0, y ∈ <n ,

f2 (x + αy) ≥ f2 (x) + αy 0 d2 − ², ∀ α > 0, y ∈ <n ,


so by adding, we obtain

f(x + αy) ≥ f(x) + αy 0 (d1 + d2 ) − 2², ∀ α > 0, y ∈ <n .

Hence from Eq. (1.54), we have d1 + d2 ∈ ∂2² f (x), implying that ∂² f1 (x) +
∂² f2 (x) ⊂ ∂2² f (x).
To prove that ∂² f(x) ⊂ ∂² f1 (x)+ ∂² f2 (x), we use an argument similar
to the one of the proof of Prop. 1.7.3(d). Suppose to come to a contradic-
tion, that there exists a d ∈ ∂² f (x) such that d ∈
/ ∂² f1 (x)+∂² f2 (x). Since by
part (a), the sets ∂² f1(x) and ∂² f2 (x) are compact, the set ∂² f1 (x)+∂² f2 (x)
is compact (cf. Prop. 1.2.16), and by Prop. 1.4.3, there exists a hyperplane
strictly separating {d} from ∂² f1 (x) + ∂² f2 (x), i.e., a vector y and a scalar
b such that

y 0 (d1 + d2 ) < b < y 0 d, ∀ d1 ∈ ∂² f1 (x), d2 ∈ ∂² f2 (x).

From this we obtain

max y 0 d1 + max y 0 d2 < y 0 d,


d1 ∈∂² f1 (x) d2 ∈∂² f2 (x)

or using part (a),

f1 (x + αy) − f1 (x) + ² f2 (x + αy) − f2 (x) + ²


inf + inf < y 0 d.
α>0 α α>0 α
Let αj , j = 1, 2, be positive scalars such that

2
X fj (x + αj y) − fj (x) + ²
< y 0 d. (1.59)
αj
j=1

Define
1
α = P2 .
j=1 1/αj
¡ ¢
As a consequence of the convexity of fj , the ratio fj (x + αy) − fj (x) /α
is monotonically nondecreasing in α. Thus, since αj ≥ α, we have

fj (x + αj y) − fj (x) fj (x + αy) − fj (x)


≥ , j = 1, 2,
αj α
Sec. 1.5 Subgradients 129

and Eq. (1.59) together with the definition of α yields


2
X fj (x + αj y) − fj (x) + ²
y0 d >
j=1
αj
2
² X fj (x + αy) − fj (x)
≥ +
α j=1 α
f (x + αy) − f (x) + ²
≥ inf .
α>0 α
This contradicts Eq. (1.55), thus implying that ∂² f(x) ⊂ ∂² f1 (x) + ∂² f2 (x).
Q.E.D.

Parts (b)-(d) of Prop. 1.7.5 contain the elements of an important algo-


rithm (called ²-descent algorithm and introduced by Bertsekas and Mitter
[BeM71], [BeM73]) for minimizing convex functions to within a tolerance
of ². At a given point x, we check whether 0 ∈ ∂² f (x). If this is so, then
by Prop. 1.7.5(a), x is an ²-optimal solution. If not, then by going along
the direction opposite to the vector of minimum norm on ∂² f (x), we are
guaranteed a cost improvement of at least ². An implementation of this
algorithm due to Lemarechal [Lem74], [Lem75] is given in the exercises.

1.7.4 Subgradients of Extended Real-Valued Functions

In this work the major emphasis is on real-valued convex functions f :


<n 7→ <, which are defined over the entire space <n and are convex over <n .
There are, however, important cases, prominently arising in the context of
duality, where we must deal with functions g : D 7→ < that are defined over
a convex subset D of <n , and are convex over D. This type of function may
also be specified as the extended real-valued function f : <n 7→ (−∞, ∞]
given by ½
g(x) if x ∈ D,
f (x) =
∞ otherwise,
with D referred to as the effective domain of f.
The notion of a subdifferential and a subgradient of such a function
can be developed along the lines of the present section. In particular, given
a convex function f : <n 7→ (−∞, ∞], a vector d is a subgradient of f at a
vector x such that f (x) < ∞ if the subgradient inequality holds, i.e.,
f (z) ≥ f (x) + (z − x)0 d, ∀ z ∈ <n .
If g : D 7→ < is a concave function (i.e., −g is a convex function over
the convex set D), it can also be represented as the extended real-valued
function f : <n 7→ [−∞, ∞), where
½
g(x) if x ∈ D,
f (x) =
−∞ otherwise.
130 Convex Analysis and Optimization Chap. 1

As earlier, we say that d is a subgradient of f at an x ∈ D if −d is a


subgradient of the convex function −g at x.
The subdifferential ∂f (x) is the set of all subgradients of the convex
(or concave) function f . By convention, ∂f (x) is considered empty for all
x with f (x) = ∞. Note that contrary to the case of real-valued functions,
∂f (x) may be empty, or closed but unbounded. For example, the extended
real-valued convex function given by
½ √
− x if 0 ≤ x ≤ 1,
f (x) =
∞ otherwise,

has the subdifferential


 1
 − 2√ x
if 0 < x < 1,
∂f (x) = [−1/2, ∞) if x = 1,

Ø if x ≤ 0 or 1 < x.

Thus, ∂f (x) can be empty and can be unbounded at points x that belong
to the effective domain of f (as in the cases x = 0 and x = 1, respectively,
of the above example). However, it can be shown that ∂f (x) is nonempty
and compact at points x that are interior points of the effective domain of
f , as also illustrated by the above example.
Similarly, a vector d is an ²-subgradient of f at a vector x such that
f (x) < ∞ if

f (z) ≥ f (x) + (z − x)0 d − ², ∀ z ∈ <n .

The ²-subdifferential ∂² f(x) is the set of all ²-subgradients of f . Figure


1.7.4 illustrates the definition of ∂² f (x) for the case of a one-dimensional
function f . The figure illustrates that even when f is extended real-valued,
the ²-subdifferential ∂² f (x) is nonempty at all points of the effective domain
D. This can be shown for multi-dimensional functions f as well.
One can provide generalized versions of the results of Props. 1.7.3-
1.7.5 within the context of extended real-valued convex functions, but with
appropriate adjustments and additional assumptions to deal with cases
where ∂f (x) may be empty or noncompact. Since we will not need these
generalizations in this book, we will not state and prove them. The reader
will find a detailed account in the book by Rockafellar [Roc70].

1.7.5 Directional Derivative of the Max Function

As mentioned in Section 1.3.4, the max function f (x) = maxz∈Z φ(x, z)


arises in a variety of interesting contexts, including duality. It is therefore
important to characterize this function, and the following proposition gives
the directional derivative and the subdifferential of f for the case where
φ(·, z) is convex for all z ∈ Z.
Sec. 1.5 Subgradients 131

f(x) f(x)

Slope = right endpoint of


Slopes = endpoints of ε-subdifferential at x
ε-subdifferential at x

ε ε

x
0 0 x
D D

Figure 1.7.4. Illustration of the ²-subdifferential ∂² f (x) of a one-dimensional


function n
g(x) if x ∈ D,
f (x) =
∞ otherwise,
where g is convex and the effective domain D is an interval. It corresponds to the
set of slopes indicated in the figure. Note that ∂² f (x) is nonempty at all x ∈ D.
Its left endpoint is
½
f (x+δ)−f (x)+²
f²− (x) = supδ<0, x+δ∈D δ if inf D < x,
−∞ if inf D = x,

and its right endpoint is


½
f (x+δ)−f (x)+²
+
fj,² (x) = inf δ>0, x+δ∈D δ
if x < sup D,
∞ if x = sup D.

Note that these endpoints can be −∞ (as in the figure on the right) or ∞. For
² = 0, the above formulas also give the endpoints of the subdifferential ∂f(x).
Note that while ∂f (x) is nonempty for all x in the interior of D, it may be empty
for x at the boundary of D (as in the figure on the right).

Proposition 1.7.6: (Danskin’s Theorem) Let Z ⊂ <m be a


compact set, and let φ : <n × Z 7→ < be continuous and such that
φ(·, z) : <n 7→ < is convex for each z ∈ Z.
(a) The function f : <n 7→ < given by

f (x) = max φ(x, z) (1.60)


z∈Z

is convex and has directional derivative given by

f 0 (x; y) = max φ0 (x, z; y), (1.61)


z∈Z(x)
132 Convex Analysis and Optimization Chap. 1

where φ0 (x, z; y) is the directional derivative of the function φ(·, z) at


x in the direction y, and Z(x) is the set of maximizing points in Eq.
(1.60) ½ ¯ ¾
¯
Z(x) = z ¯ φ(x, z) = max φ(x, z) .
z∈Z

In particular, if Z(x) consists of a unique point z and φ(·, z) is differ-


entiable at x, then f is differentiable at x, and ∇f(x) = ∇x φ(x, z),
where ∇x φ(x, z) is the vector with coordinates

∂φ(x, z)
, i = 1, . . . , n.
∂xi

(b) If φ(·, z) is differentiable for all z ∈ Z and ∇x φ(x, ·) is continuous


on Z for each x, then
© ª
∂f (x) = conv ∇x φ(x, z) | z ∈ Z(x) , ∀ x ∈ <n .

(c) The conclusion of part (a) also holds if, instead of assuming the
Z is compact, we assume that Z(x) is nonempty for all x ∈ <n ,
and that φ and Z are such that for every convergent sequence
{xk }, there exists a bounded sequence {zk } with zk ∈ Z(xk ) for
all k.

Proof: (a) The convexity of f has been established in Prop. 1.2.3(c). We


note that since φ is continuous and Z is compact, the set Z(x) is nonempty
by Weierstrass’ Theorem (Prop. 1.3.1) and f takes real values. For any
z ∈ Z(x), y ∈ <n , and α > 0, we use the definition of f to obtain
f (x + αy) − f (x) φ(x + αy, z) − φ(x, z)
≥ .
α α
Taking the limit as α decreases to zero, we obtain f 0 (x; y) ≥ φ0 (x, z; y).
Since this is true for every z ∈ Z(x), we conclude that
f 0 (x; y) ≥ sup φ0 (x, z; y), ∀ y ∈ <n . (1.62)
z∈Z(x)

To prove the reverse inequality and that the supremum in the right-
hand side of the above inequality is attained, consider a sequence {αk } with
αk ↓ 0 and let xk = x + αk y. For each k, let zk be a vector in Z(xk ). Since
{zk } belongs to the compact set Z, it has a subsequence converging to some
z ∈ Z. Without loss of generality, we assume that the entire sequence {zk }
converges to z. We have
φ(xk , zk ) ≥ φ(xk , z), ∀ z ∈ Z,
Sec. 1.5 Subgradients 133

so by taking the limit as k → ∞ and by using the continuity of φ, we obtain

φ(x, z) ≥ φ(x, z), ∀ z ∈ Z.

Therefore, z ∈ Z(x). We now have


f(x + αk y) − f(x)
f 0 (x; y) ≤
αk
φ(x + αk y, zk ) − φ(x, z)
=
αk
φ(x + αk y, zk ) − φ(x, zk ) (1.63)

αk
0
≤ −φ (x + αk y, zk ; −y)
≤ φ0 (x + αk y, zk ; y),

where the last inequality follows from inequality (1.42). We apply Prop.
1.4.2 to the functions fk defined by fk (·) = φ(·, zk ), and with xk = x + αk y,
to obtain
lim sup φ0 (x + αk y, zk ; y) ≤ φ0 (x, z; y). (1.64)
k→∞

We take the limit in inequality (1.63) as k → ∞, and we use inequality


(1.64) to conclude that

f 0 (x; y) ≤ φ0 (x, z; y).

This relation together with inequality (1.62) proves Eq. (1.61).


For the last statement of part (a), if Z(x) consists of the unique point
z, Eq. (1.61) and the differentiability assumption on φ yield

f 0 (x; y) = φ0 (x, z; y) = y0 ∇x φ(x, z), ∀ y ∈ <n ,

which implies that ∇f (x) = ∇x φ(x, z).


(b) By part (a), we have

f 0 (x; y) = max ∇x φ(x, z)0 y,


z∈Z(x)

while by Prop. 1.7.3, we have

f 0 (x; y) = max d0 y.
z∈∂f (x)

For all z ∈ Z(x) and y ∈ <n , we have

f (y) = max φ(y, z)


z∈Z

≥ φ(y, z)
≥ φ(x, z) + ∇x φ(x, z)0 (y − x)
= f (x) + ∇x φ(x, z)0 (y − x).
134 Convex Analysis and Optimization Chap. 1

Therefore, ∇x φ(x, z) is a subgradient of f at x, implying that


© ª
conv ∇x φ(x, z) | z ∈ Z(x) ⊂ ∂f (x).

To prove the reverse inclusion, we use a hyperplane separation argu-


ment. By the continuity of φ(x, ·) and the compactness of Z,©we see that
Z(x) is compact, so by the continuity of ∇x φ(x, ·), the set ∇x φ(x, z) |
ª ©
z ∈ Z(x)
ª is compact. By Prop. 1.2.8, it follows©that conv ∇x φ(x, z) ª| z ∈
Z(x) is compact. If d ∈ ∂f(x) while d ∈ / conv ∇x φ(x, z) | z ∈ Z(x) , by
the Strict Separation Theorem (Prop. 1.4.3), there exist y 6= 0 and γ ∈ <,
such that
d0 y > γ > ∇x φ(x, z)0 y, ∀ z ∈ Z(x).
Therefore, we have

d0 y > max ∇x φ(x, z)0 y = f 0 (x; y),


z∈Z(x)

© ª
contradicting Prop. 1.7.3. Thus ∂f (x) ⊂ conv ∇x φ(x, z) | z ∈ Z(x) and
the proof is complete.
(c) The proof of this part is nearly identical to the proof of part (a).
Q.E.D.

A simple but important application of the above proposition is when


Z is a finite set and φ is linear in x for all z. In this case, f(x) can be
represented as

f (x) = max{a01 x + b1 , . . . , a0m x + bm },

where a1 , . . . , am are given vectors in <n and b1 , . . . , bm are given scalars.


Then ∂f (x) is the convex hull of the set of vectors ai such that a0i x + bi =
max{a01 x + b1, . . . , a0m x + bm }.

EXE RC ISES

1.7.1

Consider the problem of minimizing a convex function f :7→ < over the polyhedral
set
X = {x | a0j x ≤ bj , j = 1, . . . , r}.
Sec. 1.5 Subgradients 135

Show that x∗ is an optimal solution if and only if there exist scalars µ∗1 , . . . , µ∗r
such that
(i) µ∗j ≥ 0 for all j, and µ∗j = 0 for all j such that a0j x∗ < bj .
Pr
(ii) 0 ∈ ∂f(x∗ ) + j=1
µ∗j aj .

Hint : Characterize the cone TX (x∗ )⊥ , and use Prop. 1.7.5 and Farkas’ lemma.

1.7.2

Use the boundedness properties of the subdifferential to show that a convex


function f : <n 7→ < is Lipschitz continuous on bounded sets, i.e., given any
bounded set B, there exists a scalar L such that
¯ ¯
¯f (x) − f (y)¯ ≤ Lkx − yk, ∀ x, y ∈ B.

1.7.3 (Steepest Descent Directions of Convex Functions)

Let f : <n 7→ < be a convex function and fix a vector x ∈ <n . Show that a
vector d is the vector of minimum norm in ∂f (x) if and only if either d = 0 or
else d/kdk minimizes f 0 (x; d) over all d with kdk ≤ 1.

1.7.4 (Generating Descent Directions of Convex Functions)

Let f : <n 7→ < be a convex function and fix a vector x ∈ <n . A vector d ∈ <n is
said to be a descent direction of f at x if the corresponding directional derivative
of f satisfies
f 0 (x; d) < 0.

This exercise provides a method for generating a descent direction in circum-


stances where obtaining a single subgradient at x is relatively easy.
Assume that x does not minimize f and let g1 be some subgradient of f at
x. For k = 2, 3, . . ., let wk be the vector of minimum norm in the convex hull of
g1 , . . . , gk−1 ,
wk = arg min kgk.
g∈conv{g1 ,...,gk−1 }

If wk is a descent direction stop; else let gk be an element of ∂f (x) such that

gk 0 wk = min g 0 wk = f 0 (x; wk ).
g∈∂f (x)

Show that this process terminates in a finite number of steps with a descent
direction. Hint : If wk is not a descent direction, then gi 0 wk ≥ kwk k2 ≥ kg ∗ k2 > 0
for all i = 1, . . . , k − 1, where g ∗ is the subgradient of minimum norm, while at
the same time gk 0 wk ≤ 0. Consider a limit point of {(wk , gk )}.
136 Convex Analysis and Optimization Chap. 1

1.7.5 (Implementing the ²-Descent Method [Lem74])

Let f : <n 7→ < be a convex function. This exercise shows how the procedure of
Exercise 1.7.4 can be modified so that it generates an ²-descent direction at a given
vector x. At the typical step of this procedure, we have g1 , . . . , gk−1 ∈ ∂² f (x).
Let wk be the vector of minimum norm in the convex hull of g1 , . . . , gk−1 ,
wk = arg min kgk.
g∈conv{g1 ,...,gk−1 }

If wk = 0, stop; we have 0 ∈ ∂² f (x), so by Prop. 1.7.5(b), f (x) ≤ inf z∈<n f (z) + ²


and x is ²-optimal. Otherwise, by a search along the line {x − αwk | α ≥ 0},
determine whether there exists a scalar α such that f (x − αwk ) < f (x) − ². If
such a α can be found, stop and replace x with x − αwk (the value of f has been
improved by at least ²). Otherwise let gk be an element of ∂² f (x) such that
gk 0 wk = min g 0 wk .
g∈∂² f (x)

Show that this process will terminate in a finite number of steps with either an
improvement of the value of f by at least ², or by confirmation that x is an
²-optimal solution.

1.8 OPTIMALITY CONDITIONS

We finally extend the optimality conditions of Props. 1.6.2 and 1.7.3(f) to


the case where the cost function involves a (possibly nonconvex) smooth
component and a convex (possibly nondifferentiable) component.

Proposition 1.8.1: Let x∗ be a local minimum of a function f :


<n 7→ < over a subset X of <n . Assume that the tangent cone TX (x∗ )
is convex, and that f has the form

f (x) = f1 (x) + f2 (x),

where f1 is convex and f2 is smooth. Then

−∇f2 (x∗ ) ∈ ∂f1 (x∗ ) + TX (x∗ )⊥ .

Proof: The proof is analogous to the one of Prop. 1.6.2. Let y be a


nonzero tangent of X at x∗ . Then there exists a sequence {ξk } and a
sequence {xk } ⊂ X such that xk 6= x∗ for all k,
ξk → 0, xk → x∗ ,
Sec. 1.5 Optimality Conditions 137

and
xk − x∗ y
= + ξk .
kxk − x∗ k kyk
We write this equation as

kxk − x∗ k
xk − x∗ = yk , (1.65)
kyk

where
yk = y + kykξk .
By the convexity of f1 , we have

−f10 (xk ; xk − x∗ ) ≤ f10 (xk ; x∗ − xk )

and
f1 (xk ) + f10 (xk ; x∗ − xk ) ≤ f1 (x∗ ),
and by adding these inequalities, we obtain

f1 (xk ) ≤ f1(x∗ ) + f10 (xk ; xk − x∗ ), ∀ k.

Also, by the Mean Value Theorem, we have

f2 (xk ) = f2 (x∗ ) + ∇f2 (x̃k )0 (xk − x∗ ), ∀ k,

where x̃k is a vector on the line segment joining xk and x∗ . By adding the
last two relations,

f (xk ) ≤ f(x∗ ) + f10 (xk ; xk − x∗ ) + ∇f2 (x̃k )0 (xk − x∗ ), ∀ k,

so using Eq. (1.65), we obtain

kxk − x∗ k ¡ 0 ¢
f (xk ) ≤ f (x∗ ) + f1 (xk ; yk ) + ∇f2(x̃k )0 yk , ∀ k. (1.66)
kyk

We now show by contradiction that f10 (x∗ ; y)+∇f2 (x∗ )0 y ≥ 0. Indeed,


iff10 (x∗ ; y)
+ ∇f2 (x∗ )0 y < 0, then since x̃k → x∗ and yk → y, it follows,
using also Prop. 1.7.2, that for all sufficiently large k,

f10 (xk ; yk ) + ∇f2 (x̃k )0 yk < 0

and [from Eq. (1.66)] f(xk ) < f (x∗ ). This contradicts the local optimality
of x∗ .
We have thus shown that f10 (x∗ ; y)+∇f2 (x∗ )0 y ≥ 0 for all y ∈ TX (x∗ ),
or equivalently, by Prop. 1.7.3(b),

max d0 y + ∇f2 (x∗ )0 y ≥ 0,


d∈∂f1 (x∗ )
138 Convex Analysis and Optimization Chap. 1

or equivalently [since TX (x∗ ) is a cone]


¡ ¢0
min max d + ∇f2 (x∗ ) y = 0.
kyk≤1 d∈∂f1 (x∗ )
y∈TX (x∗ )

We can now apply the Saddle Point Theorem (Prop. 1.3.8 – the convex-
ity/concavity and compactness assumptions of that proposition are satis-
fied) to assert that there exists a d ∈ ∂f1 (x∗ ) such that
¡ ¢0
min d + ∇f2 (x∗ ) y = 0.
kyk≤1
y∈TX (x∗ )

This implies that ¡ ¢


− d + ∇f2 (x∗ ) ∈ TX (x∗ )⊥ ,
which in turn is equivalent to −∇f2 (x∗ ) ∈ ∂f1 (x∗ ) + TX (x∗ )⊥ . Q.E.D.

Note that in the special case where f1 (x) ≡ 0, we obtain Prop. 1.6.2.
The convexity assumption on TX (x∗ ) is unnecessary © in this case, but ªit is
essential in general. [Consider the subset X = (x1 , x2 ) | x1 x2 = 0 of
<2 ; it is easy to construct a convex nondifferentiable function that has a
global minimum at x∗ = 0 without satisfying the necessary condition of
Prop. 1.8.1.]
In the special case where f2 (x) ≡ 0 and X is convex, Prop. 1.8.1 yields
the necessity part of Prop. 1.7.3(e). More generally, when X is convex, an
equivalent statement of Prop. 1.8.1 is that if x∗ is a local minimum of f
over X, there exists a subgradient d ∈ ∂f1 (x∗ ) such that
¡ ¢0
d + ∇f2 (x∗ ) (x − x∗ ) ≥ 0, ∀ x ∈ X.

This is because for a convex set X, we have z ∈ TX (x∗ )⊥ if and only if


z 0 (x − x∗ ) ≤ 0 for all x ∈ X.

1.9 NOTES AND SOURCES

The material in this chapter is classical and is developed in various books.


Most of these books relate to both convex analysis and optimization, but
differ in their scope and emphasis. Rockafellar [Roc70] focuses on convex-
ity and duality in finite dimensions. Ekeland and Temam [EkT76] develop
the subject in infinite dimensional spaces. Hiriart-Urruty and Lemarechal
[HiL93] emphasize algorithms for dual and nondifferentiable optimization.
Rockafellar [Roc84] focuses on convexity and duality in network optimiza-
tion and an important generalization, known as monotropic programming.
Bertsekas [Ber98] also gives a detailed coverage of this material, which owes
Sec. 1.5 Notes and Sources 139

much to the early work of Minty [Min60] on network optimization. Rock-


afellar and Wets [RoW98] extend many of the classical convexity concepts
to a broader analytical context involving nonconvex sets and functions in
finite dimensions, a subject also known as nonsmooth analysis. The nor-
mal cone, introduced by Mordukhovich [Mor76], plays a central role in this
subject. Bonnans and Shapiro [BoS00] emphasize sensitivity analysis and
discuss infinite dimensional problems as well. Borwein and Lewis [BoL00]
develop many of the concepts in Rockafellar and Wets [RoW98], but with
less detail.
Among the early contributors to convexity theory, we note Steinitz
[Ste13], [Ste14], [Ste16], who developed the theory of relative interiors,
recession cones, and polar cones, and Minkowski [Min11], who first in-
vestigated hyperplane separation theorems. The Minkowski-Weyl Theo-
rem given here is due to Weyl [Wey35]. The notion of a tangent and the
tangent cone originated with Bouligand [Bou30], [Bou32]. Subdifferential
theory owes much to the work of Fenchel, who in his 1951 lecture notes
[Fen51] laid the foundations for the subsequent work on convexity and its
connection with optimization.

REFERENCES

[BeM71] Bertsekas, D. P, and Mitter, S. K., 1971. “Steepest Descent for


Optimization Problems with Nondifferentiable Cost Functionals,” Proc.
5th Annual Princeton Confer. Inform. Sci. Systems, Princeton, N. J., pp.
347-351.
[BeM73] Bertsekas, D. P., and Mitter, S. K., 1973. “A Descent Numerical
Method for Optimization Problems with Nondifferentiable Cost Function-
als,” SIAM J. on Control, Vol. 11, pp. 637-652.
[Ber98] Bertsekas, D. P., 1998. Network Optimization: Continuous and
Discrete Models, Athena Scientific, Belmont, MA.
[Ber99] Bertsekas, D. P., 1999. Nonlinear Programming: 2nd Edition, Athena
Scientific, Belmont, MA.
[BoS00] Bonnans, J. F., and Shapiro, A., 2000. Perturbation Analysis of
Optimization Problems, Springer-Verlag, Berlin and N. Y.
[BoL00] Borwein, J. M., and Lewis, A. S., Convex Analysis and Nonlinear
Optimization, Springer-Verlag, N. Y.
[Bou30] Bouligand, G., 1930. “Sur les Surfaces Depourvues de Points Hy-
perlimites,” Annales de la Societe Polonaise de Mathematique, Vol. 9, pp.
32-41.
[Bou32] Bouligand, G., 1932. Introduction a la Geometrie Infinitesimale
Directe, Gauthiers-Villars, Paris.
140 Convex Analysis and Optimization Chap. 1

[EkT76] Ekeland, I., and Temam, R., 1976. Convex Analysis and Varia-
tional Problems, North-Holland Publ., Amsterdam.
[Fen51] Fenchel, W., 1951. “Convex Cones, Sets, and Functions,” Mimeogra-
phed Notes, Princeton Univ.
[HiL93] Hiriart-Urruty, J.-B., and Lemarechal, C., 1993. Convex Analysis
and Minimization Algorithms, Vols. I and II, Springer-Verlag, Berlin and
N. Y.
[HoK71] Hoffman, K., and Kunze, R., 1971. Linear Algebra, Prentice-Hall,
Englewood Cliffs, N. J.
[LaT85] Lancaster, P., and Tismenetsky, M., 1985. The Theory of Matrices,
Academic Press, N. Y.
[Lem74] Lemarechal, C., 1974. “An Algorithm for Minimizing Convex Func-
tions,” in Information Processing ’74, Rosenfeld, J. L., (Ed.), pp. 552-556,
North-Holland, Amsterdam.
[Lem75] Lemarechal, C., 1975. “An Extension of Davidon Methods to Non-
differentiable Problems,” Math. Programming Study 3, Balinski, M., and
Wolfe, P., (Eds.), North-Holland, Amsterdam, pp. 95-109.
[Min11] Minkowski, H., 1911. “Theorie der Konvexen Korper, Insbesondere
Begrundung Ihres Ober Flachenbegriffs,” Gesammelte Abhandlungen, II,
Teubner, Leipsig.
[Min60] Minty, G. J., 1960. “Monotone Networks,” Proc. Roy. Soc. London,
A, Vol. 257, pp. 194-212.
[Mor76] Mordukhovich, B. S., 1976. “Maximum Principle in the Problem
of Time Optimal Response with Nonsmooth Constraints,” J. of Applied
Mathematics and Mechanics, Vol. 40, pp. 960-969.
[OrR70] Ortega, J. M., and Rheinboldt, W. C., 1970. Iterative Solution of
Nonlinear Equations in Several Variables, Academic Press, N. Y.
[RoW98] Rockafellar, R. T., and Wets, R. J.-B., 1998. Variational Analysis,
Springer-Verlag, Berlin.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Univ. Press,
Princeton, N. J
[Roc84] Rockafellar, R. T., 1984. Network Flows and Monotropic Opti-
mization, Wiley, N. Y.; republished by Athena Scientific, Belmont, MA,
1998.
[Rud76] Rudin, W., 1976. Principles of Mathematical Analysis, McGraw-
Hill, N. Y.
[Ste13] Steinitz, H., 1913. “Bedingt Konvergente Reihen und Konvexe Sys-
tem, I, J. of Math., Vol. 143, pp. 128-175.
Sec. 1.5 Notes and Sources 141

[Ste14] Steinitz, H., 1914. “Bedingt Konvergente Reihen und Konvexe Sys-
tem, II, J. of Math., Vol. 144, pp. 1-40.
[Ste16] Steinitz, H., 1916. “Bedingt Konvergente Reihen und Konvexe Sys-
tem, III, J. of Math., Vol. 146, pp. 1-52.
[Str76] Strang, G., 1976. Linear Algebra and Its Applications, Academic
Press, N. Y.
[Wey35] Weyl, H., 1935. “Elementare Theorie der Konvexen Polyeder,”
Commentarii Mathematici Helvetici, Vol. 7, pp. 290-306.

You might also like