0% found this document useful (0 votes)
12 views

Lectures 2023

This document provides an introduction to convex optimization including its history, applications, and goals. It discusses using convex optimization to solve problems in areas like image reconstruction and inpainting. The document is intended as a summary of the book 'Convex Optimization' and contains multiple sections on topics within convex optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lectures 2023

This document provides an introduction to convex optimization including its history, applications, and goals. It discusses using convex optimization to solve problems in areas like image reconstruction and inpainting. The document is intended as a summary of the book 'Convex Optimization' and contains multiple sections on topics within convex optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

1

Convex optimization and applications 2023


Ami Wiesel
2

C ONTENTS

1 Introduction 5

1.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Background and history . . . . . . . . . . . . . . . . . . . . . . 5

1.1.2 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2.2 Positive definiteness . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Convex sets 19

2.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.1 Lines and segments . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.2 Affine sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.3 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.4 Operations that preserve convexity . . . . . . . . . . . . . . . . 22

2.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Convex functions 32

3.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.2 Basic functions . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.3 Operations that preserve convexity . . . . . . . . . . . . . . . . 38

3.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Convex optimization problems 42

4.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3

4.1.1 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.2 Projection onto a convex set . . . . . . . . . . . . . . . . . . . 45

4.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Linear programming . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Linear algebra 52

5.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1.1 Square and symmetric - EVD . . . . . . . . . . . . . . . . . . . 52

5.1.2 Rectangular - SVD . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.3 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.4 Equality constraints (Boyd p141) . . . . . . . . . . . . . . . . . 57

5.1.5 Pseudoinverse- minimum norm solution . . . . . . . . . . . . . 58

5.1.6 Projection onto an affine set . . . . . . . . . . . . . . . . . . . 58

5.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 Robust linear programming . . . . . . . . . . . . . . . . . . . . 59

5.2.2 Semidefinite programming . . . . . . . . . . . . . . . . . . . . 61

6 Gradient Descent and Newton 64

6.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1.2 Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.1.3 Interior point method (barrier) . . . . . . . . . . . . . . . . . . 68

6.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.2.1 Gradient Descent - quadratic analysis . . . . . . . . . . . . . . 69

6.2.2 Gradient Descent - Lipshitz analysis . . . . . . . . . . . . . . . 70

6.2.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4

7 Minimization majorization 74

7.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.1.1 Main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.1.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.1.3 Iterative thresholding . . . . . . . . . . . . . . . . . . . . . . . 75

7.1.4 Iterative Reweighted Least Squares (IRLS) . . . . . . . . . . . 75

7.1.5 Convex - Concave . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.1.6 Exponentiated Gradients . . . . . . . . . . . . . . . . . . . . . 77

8 Duality I 79

8.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.1.1 Definitions and theorems . . . . . . . . . . . . . . . . . . . . . 79

8.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9 Duality II 94

9.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9.1.1 Separating and supporting hyperplanes . . . . . . . . . . . . . . 94

9.1.2 Strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

9.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10 Optimality conditions 97

10.1 Lagrange multiplier theorem . . . . . . . . . . . . . . . . . . . . . . . . . 97

10.2 Convex optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . 98

11 Convex relaxation 103

11.1 Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11.2 Semidefinite relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106


5

1. I NTRODUCTION

A. Lecture

1) Background and history: IMPORTANT: These lecture notes are a rough summary of the
book “Convex Optimization” by Boyd and Vandenberghe. The book is amazing, free and
available online. I highly recommend using the book and not the summary which is only
meant for my personal teaching. All the rights clearly belong to Boyd and Vandenberghe.
The typos and errors are mine.

”. . . the great watershed in optimization isnt between linearity and nonlinearity, but convexity
and nonconvexity.” R. Rockafellar, SIAM Review 1993

minx∈S f (x)

s.t. gi (x) ≤ 0, i = 1, · · · m

Almost all engineering problems involve optimization: a goal and constraints. The art is the
formulate them mathematically and solve them (and these two are related).

Linear-Convex-Conic

• 1947 - Linear programming - Kantrovich, Dantzig.


• 1972 - Convex programming via ellipsoid method - Nemirovski and Yudin.
• 1984 - Interior point method for LP - Karmarker.
• 1998 - SEDUMI by Sturm (LP-SDP). FREE.
• 2004 - YALMIP - easy to use interface for 30 solvers. FREE.
• 2006 - CVX - discipline (uses SDP for nonlinear). FREE.
• 2012 - Core undergraduate course in every leading EE/CS university in the world.

Our goal

• Identify convex/conic problems


• Optimality conditions - structure, check existing...
• Numerical algorithms - Gradient descent and Newton.
• Learn to use existing state-of-the-art free algorithms
• Transform to standard problems
6

• Examples and applications - lots!!!


• Duality
• Relaxation when non-convex

Image reconstruction

Reconstruct an image x from a blurred/sampled H observation y assuming some sparsity (e.g.,


wavelets).

minx ∥y − Hx∥2

s.t. ∥x∥1 ≤ α

Efficient numerical solution, and iterative thresholding interpretation.

Inpainting

Inpainting (also known as image interpolation) refers to the application of sophisticated algo-
rithms to replace lost or corrupted parts of the image data (Facetune - remove small defects).
X xi+1,j − xij
minx
ij xi,j+1 − xij
2
s.t. xij = xknown
ij

Communication and information theory

power/bandwidth allocation to maximize capacity subject to power constraints.


X
minp log (1 + pi γi )
i
s.t. pi ≥ 0
X
pi ≤ P
i

Efficient and insightful waterfilling solution via duality.

Finance -portfolio design

optimize weight allocations to minimize variance subject to target expected return

minw wT Σw

s.t. wT µ = r

wT 1 = 1
7

Closed form solution, extensions, robustness, etc...

Classification (support vector machine)

Try to classify points xi to their labels yi ∈ ±1 using an affine function. For example, digit
recognition in license plates. The goal

sign wT xi + b = yi


and large magnitude. A standard approach is to maximize the margin

minw ∥w∥2

y i w T xi − b ≥ 1

s.t.

Efficient and insightful (support) solution via duality.

Two-way partitioning problem

difficult problem, assign points to two groups while minimizing various costs of separating two
points

minw wT Qw

s.t. wi ∈ {±1}

Brute force solution requires exponential number of evaluations. We will find good bounds via
duality as well as approximations.

Books

• Boyd and Vandenberghe, Convex optimization [online]


• Nemirovski, Lecture notes I-II [online]
• Peressini, Sullivan and Uhl, The mathematics of nonlinear programming
• Bertsekas, Nonlinear programming
• Haardt, EE 227C (Spring 2018)
8

2) Line search: One dimensional optimization problems. Include all the main ingredients. Are
very intuitive. Are part of most high dim algorithms.

Optimality

Consider the following one dimensional optimization problem

min f (x)
x∈S

• f (x) is the objective function (goal, target)


• S is the feasible set (constraints). For now, we will only use closed/open intervals S = R,
S = {a ≤ x ≤ b}, S = {x ≥ a} S = {x ≤ b} with possibly non-strict inequalities. (Later
we will call these convex sets)
• A solution does not necessarily exist! minx x (then we need inf but I will try to avoid it).
• How is a solution defined? Up to a tolerance!
• When can we efficiently solve this problem? Without defining complexity! We can always
draw it? in what interval? in which precision? what is the complexity (num iteration/oracle
calls/etc)?

A solution is a value of x for which f (x) is minimum. Two kinds of minimums (local and
global).

Definition 1. x∗ is a global minimum of f (x) over the interval S if it is no worse than all other
x∈S

f (x∗ ) ≤ f (x), ∀x ∈ S.

Definition 2. x∗ is a local minimum of f (x) over the set S if it is no worse than its feasible
neighbors, i.e., there exists an ϵ > 0 such that

f (x∗ ) ≤ f (x), ∀x ∈ S, |x∗ − x| < ϵ.

Theorem 1 (Necessary optimality conditions - scalar functions). Let x∗ be a local minimum


of a twice continuously differentiable function f over S. Then, either x∗ is an end point, or
f ′ (x∗ ) = 0 and f ′′ (x∗ ) ≥ 0.
9

Proof. Suppose x∗ is a local minimum and is not an end point. We will show that f (x∗ ) = 0.
Recall that
f (x∗ + α) − f (x∗ )
f ′ (x∗ ) = lim
α→0 α
Due to local optimality, f (x∗ + α) − f (x∗ ) ≥ 0 for sufficiently small |α|.
f (x∗ + α) − f (x∗ )
≥ 0, ∀α > 0
α
f (x∗ + α) − f (x∗ )
≤ 0, ∀α < 0
α
These inequalities can only hold together if f ′ (x∗ ) = 0. Using a Taylor’s expansion
α2 ′′
f (x∗ + α) − f (x∗ ) = αf ′ (x∗ ) + f (z)
2
for some z ∈ [x∗ , x∗ + α]. Due to local optimality and f ′ (x∗ ) = 0, we get
f (x∗ + α) − f (x∗ )
0≤2 = f ′′ (z).
α2
Due to continuity of f ′′ (), this holds for all sufficiently small |α| only if f ′′ (x∗ ) ≥ 0.

• Maximization - just replace inequality signs.


• Strict - replace inequalities with strict inequalities.

3) Convexity: The previous result is only local. To get something global we need convexity.

Definition 3. A set S is called convex if for all x, y ∈ S and α ∈ [0, 1] we have λx + (1 − λ) y ∈


S.

Three convex one dimensional sets are the bounded and unbounded intervals:

I = {x ≥ a}

I = {a ≤ x ≤ b}

I = {x ≤ b}

Definition 4. A function f (x) is called convex over an interval I if

f (λx + (1 − λ) y) ≤ λf (x) + (1 − λ) f (y) , ∀x, y ∈ I, λ ∈ [0, 1] .

Note that since I is an interval, we have λx + (1 − λ) y ∈ I and the definition is well defined.

Draw example - line over function.


10

Now, we give two alternative definitions via theorems.

Theorem 2. A continuously differentiable function f (x) is convex over an interval I if and only
if it is above its linear approximation

f (x) ≥ f (y) + (x − y)f ′ (y), ∀x, y ∈ I

Proof. Assume that f is convex, then by definition

f (y + λ (x − y)) ≤ λf (x) + (1 − λ) f (y)

divide by λ and rearrange


f (y + λ (x − y)) − f (y)
f (x) ≥ f (y) + (x − y)
λ (x − y)
Taking the limit λ → 0 yields the required result. On the other hand, assume the condition holds
for all x, y ∈ I. Choose z = λx + (1 − λ) y for any λ ∈ [0, 1], and apply the condition twice

f (x) ≥ f (z) + (x − z)f ′ (z)

f (y) ≥ f (z) + (y − z)f ′ (z)

multiply the inequalities by λ and 1 − λ, and add them up to obtain the required result

λf (x) + (1 − λ)f (y) ≥ f (z).

Theorem 3. A twice continuously differentiable function f (x) is convex over an interval I if


and only if

f ′′ (x) ≥ 0, ∀x ∈ I

Proof. Sufficiency: using a Taylor expansion


′ (y − x)2 ′′
f (y) − f (x) = (y − x)f (x) + f (z)
2
for some z ∈ [x, y]. Due to f ′′ ≥ 0, we get

f (y) − f (x) ≥ (y − x)f ′ (x)

which is sufficient for convexity. On the other hand, assume f is convex and that f ′′ (x) < 0
then we can choose y close enough such that f ′′ (z) < 0 for all z ∈ [x, y], and obtain
(y − x)2 ′′
f (y) − f (x) = (y − x)f ′ (x) + f (z) < (y − x)f ′ (x)
2
11

which is a contradiction to the convexity of f .

A few examples:

• Exponential f (x) = ex is convex on R

f (x) = ex

f ′ (x) = ex

f ′′ (x) = ex ≥ 0

• Powers f (x) = xa is convex on R++ when a ≥ 1, and concave when 0 < a ≤ 1.

f (x) = xa

f ′ (x) = axa−1

f ′′ (x) = a(a − 1)xa−2

• Logarithm f (x) = −logx is convex on R++ .

f (x) = −log(x)
1
f ′ (x) = −
x
1
f ′′ (x) = 2
x
• Negative entropy f (x) = xlogx is convex on R++ .

f (x) = xlogx
1
f ′ (x) = x + logx = 1 + logx
x
1
f ′′ (x) =
x
Using these definitions, we can now find the global minimum:

Theorem 4. If f (x) is a convex function over I, then any local minimum of f over I is also a
global minimum.

Proof. Suppose that x is a local minimum but not global. Then there exists a y ̸= x such that
f (y) < f (x). Using the convexity:

f (λx + (1 − λ) y) ≤ λf (x) + (1 − λ) f (y) < f (x) , ∀λ ∈ [0, 1).


12

where we have used f (y) < f (x) in the second inequality. This contradicts the local optimality
of x.

Question: convexity is a sufficient condition for local=global, is it necessary? No! for example
quasi-convexity. However, it is the easiest, most studied, easiest to generalize to higher dimen-
sions and well understood (no saddle points, convex+convex, etc).
13

B. Tirgul

For a function f (x)

f : Rn → Rm

the Jacobian is an m × n matrix


 
∂f1 (x) ∂f1 (x)
∂x1 ∂xn
..
 
Df (x) = 
 
. 
 
∂fm (x) ∂fm (x)
∂x1 ∂xn

Example

f (x) = Ax

compute the gradient


 
···
 . .. 
 P 
DAx =  .. ∂ kAi,k xk
∂xj
. =A
 
···

The first order approximation of f is

f (z) ≈ f (x) + Df (x) (z − x)

Indeed in the example we have (and this helps remember the transposes sizes)

Az = Ax + A (z − x)

In particular, if m = 1 then the transposed Jacobian is called a gradient vector (always column)
 
∂f (x)
∂x1
..
 
T
∇f (x) = [Df (x)] = 
 
. 
 
∂f (x)
∂xn

so that

f (z) ≈ f (x) + [∇f (x)]T (z − x)


14

In the example

f (x) = aT x

Df (x) = aT

∇f (x) = a

A more interesting example

f (x) = xT Ax

We can find the gradient by definition or by the approximation (for small δ)

(x + δz)T A (x + δz) ≈ xT Ax + δ∇T f (x)z

Ignoring second order terms, we get

z T Ax + xT Az = ∇T f (x)z

Using cT d = dT c and [Ax]T = xT AT we get

xT AT z + xT Az = ∇T f (x)z

so that

∇f (x) = A + AT x


or in the symmetric case

∇f (x) = 2Ax

compare to the scalar case (ax2 )′ = 2ax.

Chain rule

f : Rn → Rm

g : Rm → F p

with the composition (HARKAVA)

h : Rn → Rp

h(x) = g(f (x))


15

then

Dh(x)p×n = Dg(f (x))p×m Df (x)m×n

When m = 1 and p = 1 this is the well known chain rule

∇h(x) = g ′ (f (x))∇f (x)

Example

h(x) = g(Ax + b)

where g : Rm → R1 , and f : Rn → Rm with f (x) = Ax + b with A of size m × n then

Dh(x) = Dg(Ax + b)A

and its transpose is

∇h(x) = AT ∇g(Ax + b)

The Hessian is defined as


 
∂ 2 f (x)
∂x1 ∂x1
···
∂ 2 f (x)  ..

∇2 f (x) = =
 
.
∂x∂xT

 
∂ 2 f (x)
∂xp ∂xp

Note that the Hessian is always symmetric.

The Hessian also has a chain rule but its more cumbersome. In the special case of affine
transformation

g(x) = f (Ax + b)

we have

∇2 g(x) = AT ∇2 f (x)A

Use sizes to remember!

Exercise
p
f (x) = xT Qx − bT x
16

where Q ≻ 0. Compute its gradient and Hessian.


2Qx
∇f (x) = p −b
2 xT Qx
− 1
= Qx xT Qx 2 − b

1 T − 32 − 1
∇2 f (x) = − x Qx 2QxxT Q + xT Qx 2 Q
2
− 3 − 1
= − xT Qx 2 QxxT Q + xT Qx 2 Q
− 3 
= xT Qx 2 xT Qx Q − QxxT Q
 

Is the Hessian positive definite? YES

vT xT Qx Q − QxxT Q v = xT Qx v T Qv − v T QxxT Qv
    

xT U T U x v T U T U v − v T U T U xxT U T U v
 
=

y T y uT u − uT yy T u
 
=
2
y T y uT u − uT y ≥ 0
 
=

where u = U v, y = U x, Q = U T U since Q is positive definite and we have used Cauchy


Schwartz inequaltiy.

1) Eigenvalues:

Av = λv

A is a matrix, λ is its eigenvalue and v is an eigenvector. To find λ we solve

[A − λI] v = 0

so that we need to find λs such that

|A − λI| = 0

We say that A is diagonalizable if it is similar (DOMA) to a diagonal matrix and can be


decomposed as

A = U diag {λi } U −1

Example for non-diagonalizable matrix is [01; 00].


17

Definition 5. Any real symmetric matrix is diagonalizable by an orthogonal matrix with real
eigenvalues and can be decomposed into A = U diag {λi } U T where U is orthogonal U T U =
U U T = I.

2D example:

 
a−λ c
  = (a − λ)(b − λ) − c2 = λ2 + λ(−a − b) + (ab − c2 ) = 0
c b−λ

p
a+b± (a + b)2 − 4(ab − c2 )
λ1,2 =
2

2) Positive definiteness: A concept that generalizes scalar positivity to matrices. A main ingre-
dient in convex optimization. Second order condition for convexity, and allows us to use linear
matrix inequalities.

Not a full order (two matrices can be non ≻ and non ≺. Not the only generalization (posi-
tive matrices element-wise), but this one works with the eigenvalues (which are usually more
interesting than the elements themselves.

We will only work with symmetric matrices!!

Definition 6. A symmetric matrix X is positive definite, denoted by X ≻ 0, if z T Xz > 0 for all


z ̸= 0.

Equivalent statements:

• It is diagonalizable (symmetric matrices always are), and its eigenvalues are positive. Proof:

uTj U DU T uj = dj > 0

On the other hand


X 2 di ≥0
z T Xz = z T U DU T z = di U T z i ≥ 0


In 2D example
p
a+b± (a + b)2 − 4(ab − c2 )
λ1,2 = ≥ 0 ⇔ a + b > 0, ab − c2 > 0
2
18

Proof: require that both the determinant and the trace will be positive. Actually a > 0, ab −
c2 > 0 suffices.
• It has a Cholesky decomposition X = U T U where U is an upper triangular matrix with pos-
itive diagonal elements (the generalization of squared root). This is the standard numerical
method for testing positive definiteness.
   √  √ 
a c a 0 a √c
 = q  q a  a > 0, ab − c2 > 0
2 c2
c b √c b − ca 0 b−
a a

• All the leading principal minors (determinant of upper left subblocks) are positive.
 
a c
  ≻ 0, a > 0, ab − c2 > 0
c b
Later, we will generalize this using Schur’s Lemma.

Definition 7. A symmetric matrix X is positive semidefinite, denoted by X ⪰ 0 if, z T Xz ≥ 0


for all z.

Now we can write linear matrix inequalities: we say that A ≻ B if A − B ≻ 0, just like in
scalars.
19

2. C ONVEX SETS

A. Lecture

We now begin to generalize the previous results to the multidimensional case. The first thing we
need to do is generalize the notion of an interval. We will deal with constrained optimizations
as

min f (x)
x∈S

If S is not convex, we cannot even test the convexity of f (x).

In the scalar case, we showed that the problem is efficiently solvable if f (x) is a convex function
and S is a convex set.

Why do we need constraints?

Theoretically, we dont need this as we can define the function as infinity outside of the domain.
But this is not useful theoretically, numerically and didactively. However, we will see that the
two concepts, convex functions and convex sets, are very similar. In fact, once you master one,
the other is trivial. In practice, this is similar to the duality between constraints and penalties
(which are completely equivalent in convex optimization).

We can transform any non-linear unconstrained optimization to a linear constrained optimization

min f (x)
x

adding a slack variable t (this is our first trick!)

min t s.t. f (x) ≤ t


x,t

1) Lines and segments: Suppose x1 ̸= x2 ∈ Rn are two vectors then the line passing through
them is defined as

y = θx1 + (1 − θ) x2

= x2 + θ (x1 − x2 ) base and direction


20

If θ ∈ R then it is a line. If 0 ≤ θ ≤ 1 then it is a line segment.

2) Affine sets: A set C ∈ Rn is affine if the line through any two distinct points in C lies in C:

x1 , x2 ∈ C, θ ∈ R → θx1 + (1 − θ) x2 ∈ C

Can be generalized to more than two points; an affine set is a subspace plus an offset, and its
dimension is defined as the dimension of the subspace.

Example 1 (Solution of linear equations).

C = {x : Ax = b} is an affine set

x1 , x2 ∈ C → Ax1 = b, Ax2 = b

A (θx1 + (1 − θ) x2 ) = θAx1 + (1 − θ) Ax2 = θb + (1 − θ) b = b

→ θx1 + (1 − θ) x2 ∈ C

There is also a converse - every affine set can be expressed as the solution set of a system of
linear equations.

3) Convex sets: A set C ∈ Rn is convex if the line segment between any two points in C lies
in C:

x1 , x2 ∈ C, 0 ≤ θ ≤ 1 → θx1 + (1 − θ) x2 ∈ C

• Can be generalized to more than two points (even infinite with an integral).
• In the most general form: If C is convex and x is a random vector in C with probability
one, then E [x] ∈ C.

• Every affine set is convex.


21

• What is the difference between affine and convex? both are linear combinations whose
coefficients sum to one but in convex the coefficients are also non-negative (like a weighted
average).

• Hyperplane {x : aT x = b}, a ̸= 0 is affine and convex.

aT x = b
→ aT (θx + (1 − θ)y) = θaT x + (1 − θ)aT y
T
a y=b
= θb + (1 − θ)b

= b

• Halfspace {x : aT x ≤ b}, a ̸= 0 is convex but not affine.

aT x ≤ b
→ aT (θx + (1 − θ)y) = θaT x + (1 − θ)aT y
T
a y≤b
θ∈[0,1]
≤ θb + (1 − θ)b

= b

• Euclidean ball {x : ∥x − xc ∥2 ≤ r}, with radius r > 0 is convex.

∥θx1 + (1 − θ) x2 − xc ∥2 = ∥θ (x1 − xc ) + (1 − θ) (x2 − xc ) ∥2

≤ θ∥x1 − xc ∥2 + (1 − θ) ∥x2 − xc ∥2

≤ r

Can also be expressed as {xc + ru : ∥u∥2 ≤ 1}.


22

• Second order (ice cream, Lorentz, quadratic) cone {(x, t) : ∥x∥ ≤ t} is convex. DRAW.

∥θx1 + (1 − θ) x2 ∥2 − θt1 − (1 − θ) t2 ≤ θ (∥x1 ∥2 − t1 ) + (1 − θ) (∥x2 ∥2 − t2 )

≤ 0

• Positive semidefinite cone is convex.

S+n = {X ∈ S n : X ⪰ 0}

= {X ∈ S n : z T Xz > 0 ∀ z ̸= 0}

Proof:

z T (θX1 + (1 − θ) X2 ) z = θz T X1 z + (1 − θ) z T X2 z ≥ 0

In 2D, we have
 
x y
 ⪰0
y z
is the set

{x ≥ 0, z ≥ 0, xz ≥ y 2 }

4) Operations that preserve convexity: How to check whether a set is convex without going
back to the definitions.
23

• Intersection: if S1 and S2 are convex, then so is S1 ∩ S2 .

x ∈ S1 ∩ S2 → x ∈ S1 and x ∈ S2

y ∈ S1 ∩ S2 → y ∈ S1 and y ∈ S2

then

θx + (1 − θ) y ∈ S1 and θx + (1 − θ) y ∈ S2

→ θx + (1 − θ) y ∈ S1 ∩ S2

This means we can add convex constraints (we look for the intersection of them).

• Infinite intersection: if Sα is convex for all α ∈ A then so is ∩α∈A Sα (A does not need to
be convex). Example: the positive semidefinite cone can be expressed as the intersection of
an infinite number of halfspaces and is therefore convex

{X ∈ S n : X ⪰ 0} = ∩z̸=0 {X ∈ S n : z T Xz ≥ 0}

Which representation is more efficient/compact?


There is also a converse: every closed convex set is a (usually infinite) intersections of
halfspaces, i.e., a solution of an infinite set of linear inequalities.

• Let S1 and S2 be convex sets then their direct product is convex S1 × S2 = {x1 , x2 |x1 ∈
S1 , x2 ∈ S2 } is convex.

• Affine functions: if S ∈ Rn is convex then its image under an affine function f (x) = Ax+b
is convex.

f (S) = {Ax + b : x ∈ Sconvex }


24

Proof:

f1 ∈ f (S) → there exist x1 ∈ S such that f1 = Ax1 + b

f2 ∈ f (S) → there exist x2 ∈ S such that f2 = Ax2 + b

Now we need to show that

f ′ = θf1 + (1 − θ)f2 ∈ f (S)

That is, to show that there exist x′ ∈ S such that f ′ = Ax′ + b but this is trivial since

x′ = θx1 + (1 − θ)x2 ∈ S

and

Ax′ + b = θAx1 + (1 − θ)Ax2 + b

= θ(f1 − b) + (1 − θ)(f2 − b) + b

= f′

as required.

Two simple examples are scaling

αS = {αx | x ∈ S}

and translation

S + a = {x + a | x ∈ S}

which are convex if S is convex. Also rotation.

Example: Let S1 and S2 be convex sets then S1 + S2 = {x1 + x2 |x1 ∈ S1 , x2 ∈ S2 } is


convex. Proof: image of an affine transformation on a direct product.
25

• Affine functions: if S ∈ Rn is convex then its inverse image under an affine function
f (x) = Ax + b is convex.

f −1 (S) = {x : Ax + b ∈ S}

Proof:

x1 ∈ f −1 (S) → Ax1 + b ∈ S

x2 ∈ f −1 (S) → Ax2 + b ∈ S

lets show that

x = θx1 + (1 − θ)x2

satisfies Ax + b ∈ S and therefore f ∈ f −1 (S). But that’s easy


convex
Ax + b = θ(Ax1 + b) + (1 − θ)(Ax2 + b) ∈ S

P
Example: Polyhedron {x : Ax ≤ b, Cx = d} is convex, e.g., simplex {xi ≥ 0, i xi = 1}.

Example: Ellipsoid

P −1 = U T U

P = U −T U −1

Thus

{x : (x − xc )P −1 (x − xc ) ≤ 1} = {x : U x − U xc ∈ {u : ∥u∥ ≤ 1}}

= {U −1 z + xc : z ∈ {u : ∥u∥ ≤ 1}}

Example: Hyperbolic cone is convex since it is the inverse image of the second order cone
under an affine function

P ≻ 0 → P = UT U
2
{x : xT P x ≤ cT x , cT x ≥ 0} = x : U x, cT x ∈ (z, t) : z T z ≤ t2 , t ≥ 0
  
26

Is there something more general than affine transformations?


The perspective function is the function
x1:n
P (x) = xn+1 > 0
xn+1
The image of a convex set under its perspective function is convex

P (S) = {P (x) : x ∈ S} is convex

Proof:

p1 ∈ P (S) → there exists x ∈ S such that p1 = P (x)

p2 ∈ P (S) → there exists y ∈ S such that p2 = P (y)

Now, we need to show that for any µ ∈ [0, 1]

µp1 + (1 − µ)p2 ∈ P (S)

That is, to show that there exists a z ∈ S such that

µp1 + (1 − µ)p2 = P (z)

For this purpose, we choose

z = θx + (1 − θ) y

with
µyn+1
θ=
µyn+1 + (1 − µ)xn+1
where this construction with xn+1 > 0 and yn+1 > 0 guarantees that θ ∈ [0, 1] so that
z ∈ S. Plugging into P yields

P (z) = P (θx + (1 − θ) y)
θx1:n + (1 − θ) y1:n
=
θxn+1 + (1 − θ) yn+1
(∗) µ 1−µ
= x1:n + y1:n
xn+1 yn+1
= µP (x) + (1 − µ)P (y)

= µp1 + (1 − µ)p2
27

since
θxn+1
µ= ∈ [0, 1]
θxn+1 + (1 − θ) yn+1

µ θ
= (∗)
xn+1 θxn+1 + (1 − θ) yn+1
1−µ 1−θ
= (∗)
yn+1 θxn+1 + (1 − θ) yn+1

Similarly, the image of a convex set under the perspective is convex:


 
−1 x1:n
P (S) = x : ∈ S, xn+1 > 0
xn+1
28

B. Tirgul

Exercise: Show that a set is convex if and only if its intersection with any line is convex.

First direction: The intersection of two convex sets is convex. Therefore if S is a convex set,
the intersection of S with a line is convex.

Conversely, suppose the intersection of S with any line is convex. Take any two distinct points
x1 , x2 ∈ S. The intersection of S with the line through x1 and x2 is convex. Therefore, convex
combinations of x1 and x2 belong to the intersection, hence also to S.

Reminder: A positive semidefinite matrix is defined as

A⪰0 ⇔ z T Az ≥ 0 ∀z

Exercise: Prove that the solution set of a quadratic inequality

C = {x | xT Ax + bT x + c ≤ 0}

is convex if A ⪰ 0.

Proof: we will show that its intersection with an arbitrary line

{x + tv | t ∈ R}

is convex. We express

(x + tv)T A(x + tv) + bT (x + tv) + c

as

α2 t + βt + γ

where

α = v T Av

β = bT v + 2xT Av

γ = c + bT x + xT Ax

Thus the intersection is defined as

{x + tv | αt2 + βt + γ ≤ 0}
29

which is convex if α ≥ 0 (a U parabola is smaller than zero over an interval). This will hold
for any v if A ⪰ 0.

Is the converse true? No! for example take A = −1, b = 0 and c = −1, which makes the set
become C = R which is of course convex.

You can already guess here what is really needed for the converse - it should hold for any c.

Now, show that the intersection of C and the hyperplane defined by g T x + h = 0 (where g ̸= 0)
is convex if there exists a λ such that A + λgg T ≻ 0.

What does this mean? psd everywhere except the direction of g. which condition is stronger?

As before, the set is convex if and only if its intersection with an arbitrary line is convex. The
intersection is

{x + tv : αt2 + 2βt + γ ≤ 0, δt + ϵ = 0} (1)

where

δ = gT v (2)

ϵ = gT x + h (3)

If δ ̸= 0 then the set is a singleton which is always convex. Otherwise, δ = 0 only if g T v = 0.


So the set is convex if

gT v = 0 ⇒ v T Av ≥ 0 (4)

Intuitively, we do not need to psd all over but just on v’s that satisfy g T v = 0, i.e., that are
orthogonal to g. Hence the condition. Indeed, if A + λgg T ≻ 0 then

0 ≤ v T A + λgg T v = v T Av + λ0

(5)

for v’s that are orthogonal to g.

Back to gradients:

Log-sum-exponential

f (x) = log (ex1 + · · · + exn ) is convex (approximation of max).


X
g(x) = log exi
i
30

Now the gradient of g(x) is


 
x1
e
 .. 
 
 . 
 
xn
e
∇g(x) = P xi
ie
z
= T
1 z
where zi = exi , and the Hessian is
z1T
 
2 I
∇ g(x) = − diag {z}
1T z (1T z)2
diag {z} zz T
= −
1T z (1T z)2
Hence,
h i
2 diag{z} zz T
∇ f (x) = 1T z
− (1T z)2

zz T
 
Tdiag {z}
u − T 2 u≥0
1T z (1 z)
√ √
Chossing a = diag { z} u and b = z
X X X X
(1T z)(uT diag {z} u) − (uT z)2 = zi u2j zj − ui zi uj zj
i j i j
T T T 2
= (b b)(a a) − (a b) ≥ 0

due to Cauchy Schwartz.

Log-determinant

f (X) = logdet (X) is concave on X ≻ 0. We will find the gradient matrix by completing

f (X + D) = f (X) + Tr {D∇f }

Due to the positive definiteness, we have


1 1
X = X2X2
31

1 1
log |X + D| = log |X| + log I + X − 2 DX − 2
X
= log |X| + log (1 + λi )
i
X
≈ log |X| + λi
i
n o
− 12 − 12
= log |X| + Tr X DX

= log |X| + Tr DX −1 ∇log |X| = X −1




1 1
where λi are the eigenvalues of X − 2 DX − 2 , and we have used log(1 + x) ≈ x for small x.
32

3. C ONVEX FUNCTIONS

A. Lecture

1) Theory:

Definition 8. A function f (x) over a convex set C is called convex if

f (λx + (1 − λ) y) ≤ λf (x) + (1 − λ) f (y) , ∀x, y ∈ C, α ∈ [0, 1] .

Well defined due to set convexity.

This inequality is sometimes called Jensen’s Inequality, and can be generalized to other convex
P
combinations λi ≥ 0 such that i λi = 1, or in general even a continuum

f (E [x]) ≤ E [f (x)]

where λi represent a distribution.

Theorem 5. A function is convex if and only if it is convex when restricted to any line that
intersects its domain.

f (x) is convex in x ⇔ g(t) = f (ty + (1 − t)x) is convex in t for all x, y

Proof: If f (x) is convex then g(t) is a composition with an affine function and is therefore also
convex. Note that

g(t) = f (x + t(y − x))

and we get

g(θt + (1 − θ)t̄) = f (x + (θt + (1 − θ)t̄)(y − x))


divide x between them
= f (θ(x + t(y − x)) + (1 − θ)(x + t̄(y − x)))
convexity
≤ θf (x + t(y − x)) + (1 − θ)f (x + t̄(y − x))

= θg(t) + (1 − θ)g(t̄)

On the other way, assume that g(t) is convex in t for all x and y. Then,

g(θ) ≤ θg(1) + (1 − θ)g(0)


33

where we chose t = 1 and t̄ = 0. In terms of f this is exactly the definition of convexity:

f (θy + (1 − θ)x) ≤ θf (y) + (1 − θ)f (x)

Theorem 6. A continuously differentiable function f (x) is convex in Rn if and only if

f (y) ≥ f (x) + (y − x)T ∇f (x), ∀x, y

Proof. Lets restrict it to a line

g(t) = f (ty + (1 − t)x)

with

g ′ (t) = ∇f (ty + (1 − t)x)T (y − x)

Assume that f (x) is convex in x then g(t) is convex in t and

g(1) ≥ g(0) + (1 − 0)g ′ (0)

which is exactly what we need to prove.

On the other hand, assume that the condition holds then apply it to the points

ty + (1 − t)x

t̄y + (1 − t̄)x

to obtain

f (ty + (1 − t)x) ≥ f (t̄y + (1 − t̄)x) + ∇f (t̄y + (1 − t̄)x)T (ty + (1 − t)x − t̄y − (1 − t̄)x)

f (ty + (1 − t)x) ≥ f (t̄y + (1 − t̄)x) + ∇f (t̄y + (1 − t̄)x)T (y − x)(t − t̄)

which means that

g(t) ≥ g(t̄) + g ′ (t̄)(t − t̄)

and therefore g(t) is convex and so is f (x).

This property is so important that it leads to a definition in the non-differentiable case:

Definition 9. A subgradient of f at x denoted by ∇f (x) is a vector such that

f (x) ≥ f (y) + ∇f (y)T (x − y), ∀x, y


34

The set of all subgradients at x are called the subdifferential at x.

Theorem 7. A twice continuously differentiable function f (x) is convex in Rn if and only if

∇2 f (x) ⪰ 0, ∀x

Proof. Lets restrict it to a line g(t) = f (a + tb) and find a condition so that g(t) will be
convex in t for all a and b. But that’s clearly g ′′ (t) = bT ∇2 f (x)b ≥ 0 which is equivalent to
∇2 f (x) ⪰ 0.

Relation to convex sets:

If f (x) is convex then clearly S = {x : f (x) ≤ 0} is convex.

Proof:

x, y ∈ S ⇒ f (x) ≤ 0, f (y) ≤ 0

and therefore

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) ≤ 0

so that

θx + (1 − θ)y ∈ S.

Of course, we can replace 0 by any constant. This result is the standard way to represent convex
sets in practice.

The converse is wrong. Indeed f (x) = −x2 − 1 is non convex but its set {f (x) ≤ 0} is R and
convex.

To get a converse we need something stronger:

Definition 10. The graph of a function f (x) is the function [x f (x)]. The epigraph is the set
above this function, i.e. {[x t] : f (x) ≤ t}.

Theorem 8. A function f (x) is convex in x iff its epigraph {[x t] : f (x) ≤ t} is convex in [x t].

Draw examples - hyperplane, parabola and a non-convex function.

Proof: One direction is trivial. If f (x) is convex in x then f (x) − t is convex in (x, t) and
therefore f (x) − t ≤ 0 is a convex set.
35

On the other hand, assume epif = {x, t : f (x) ≤ t} is convex.

Consider x1 and x2 and let t1 = f (x1 ) and t2 = f (x2 ). Then,

(x1 , t1 ) ∈ epif, (x2 , t2 ) ∈ epif ⇒ (λx1 + (1 − λ)x2 , λt1 + (1 − λ)t2 ) ∈ epif

and

f (λx1 + (1 − λ)x2 ) ≤ λt1 + (1 − λ)t2

= λf (x1 ) + (1 − λ)f (x2 )

as required.

2) Basic functions: Examples on Rn :

• Affine functions are convex (and concave) f (x) = aT x + b

∇f = a

∇2 = 0

• Quadratic functions f (x) = xT Ax + bT x + c are convex if A ⪰ 0.

∇f = 2Ax + b

∇2 = 2A

• Norms - every norm is convex (proof - triangle inequality).

∥λx + (1 − λ) y∥ ≤ ∥λx∥ + ∥ (1 − λ) y∥ = λ∥x∥ + (1 − λ) ∥y∥

Very important example since minx∈S ∥y − Hx∥ may be more efficient using SOCP than
the natural minx∈S ∥y − Hx∥2 using QP.
• Max functions - f (x) = max{x1 , · · · , xn } is convex [even if max is non-convex!!!! com-
plexity, robust optimization].

max{λxi + (1 − λ) yi } ≤ max{λxi } + max{(1 − λ) yi }


i i i

since the rhs allows more options (different maximum indices with respect to x and y)
This is the analog of intersections in sets which preserve convexity.
x2
• Quadratic-over-linear f (x, y) = y
is convex on {(x, y) : y > 0}.
 
2x
y
∇f =  2

− xy2
36

    T
2  y2 −xy y y
∇2 f = = 2    ⪰0
y 3 −xy x2 y 3 −x −x

Another proof: in sets we need to prove that the epigraph {x, y, t : y > 0, x2 /y < t} is
convex, or using Schur’s lemma that
   
 y x 
x, y, t : y > 0,   ⪰0
 x t 

is a convex set (which we know is true since this is an LMI).


• Log-sum-exponential f (x) = log (ex1 + · · · + exn ) is convex (approximation of max).
X
g(x) = log e xi
i

Now the gradient of g(x) is


 
x1
e
 .. 
 
 . 
 
exn
∇g(x) = P xi
ie
z
= T
1 z
where zi = exi , and the Hessian is
z1T
 
2 I
∇ g(x) = − diag {z}
1T z (1T z)2
diag {z} zz T
= − T 2
1T z (1 z)
Hence,
h i
diag{z} zz T
∇2 f (x) = 1T z
− (1T z)2

zz T
 
diag {z}
T
u − T 2 u≥0
1T z (1 z)
√ √
Chossing a = diag { z} u and b = z
X X X X
(1T z)(uT diag {z} u) − (uT z)2 = zi u2j zj − ui zi uj zj
i j i j

= (bT b)(aT a) − (aT b)2 ≥ 0

due to Cauchy Schwartz.


37

1
xi ) n is concave on Rn++ .
Q
• Geometric mean f (x) = ( i

• Log-determinant f (X) = logdet (X) is concave on X ≻ 0. We will find the gradient matrix
by completing

f (X + D) = f (X) + Tr {D∇f }

Due to the positive definiteness, we have


1 1
X = X2X2

1 1
log |X + D| = log |X| + log I + X − 2 DX − 2
X
= log |X| + log (1 + λi )
i
X
≈ log |X| + λi
i
n o
− 21 − 12
= log |X| + Tr X DX

= log |X| + Tr DX −1 ∇log |X| = X −1




1 1
where λi are the eigenvalues of X − 2 DX − 2 , and we have used log(1 + x) ≈ x for small x.
Convexity proof: either by Hessian (but thats difficult) or - restrict it to a line

X = Z + tV

where Z and V are symmetric but not necessarily positive definite but we only look at ts
for which Z + tV ≻ 0. Without loss of generality, we can assume Z ≻ 0, i.e., t = 0 is in
the domain. Then

g(t) = log |Z + tV |
1
  1
− 12 − 12
= log Z I + tZ V Z
2 Z2
1 1
= log I + tZ − 2 V Z − 2 + log |Z|
X
= log(1 + tλi ) + log |Z|
i
1 1
where λi are the non-negative eigenvalues of Z − 2 V Z − 2 . Now
X λi
g ′ (t) =
i
1 + tλi

X λ2i
g ′′ (t) = − ≤0
i
(1 + tλi )2
38

3) Operations that preserve convexity:


P
• Non-negative weighted sums: f = i wi fi with wi ≥ 0 (also infinite sums).
Most important and highly non trivial. There exist other unimodal functions which do not
satisfy this, e.g., quasi convex.
• Composition with affine mapping: if f (x) is convex then so is g(x) = f (Ax + b).

g(θx + (1 − θ)y) = f (A(θx + (1 − θ)y) + b)


divide b between them
= f (θ(Ax + b) + (1 − θ)(Ay + b))
f is convex
≤ θf (Ax + b) + (1 − θ)f (Ay + b)

= θg(x) + (1 − θ)g(y)

• Pointwise maximum and supremum: if f1 and f2 are convex then so is f (x) = max{f1 (x), f2 (x)}
(also maximum over infinite and non-convex sets).
We already proved this. This is the analog of intersection in sets.
– Distance to farthest point of a set C is convex:

f (x) = max ∥x − y∥
y∈C

– Piecewise linear f (x) = max{aT1 x + b1 , · · · , aTk x + bk }


– Sum of r largest components. For example, sum of two largest numbers of three:

f (x1 , x2 , x3 ) = max{x1 + x2 , x1 + x3 , x2 + x3 }

– Maximal eigenvalue of a symmetric matrix λmax (A) = maxx:∥x∥=1 xT Ax.


– Matrix norm ∥A∥α,β = maxx:∥x∥β =1 ∥Ax∥α .
• Scalar composition: f (x) = h(g(x))

f ′ (x) = h′ (g(x))g ′ (x)

f ′′ (x) = h′′ (g(x))g ′ (x)2 + h′ (g(x))g ′′ (x)

therefore:
– f is convex if h is convex and non-decreasing (h′′ ≥ 0 and h′ ≥ 0), and g is convex.
– f is convex if h is convex and non-increasing (h′′ ≥ 0 and h′ ≤ 0), and g is concave.
– f is concave if h is concave and non-decreasing, and g is concave.
– f is concave if h is concave and non-increasing, and g is convex.
39

For example
– If g is convex then eg(x) is convex.
– if g is concave then log g(x) is concave.
1
– if g is concave and positive then g(x)
is convex.
• Convex partial minimization - if g(x, y) is jointly convex in {x, y} then f (x) = miny g(x, y)
is convex in x (also holds for convex constrained minimization).
For simplicity, we prove the min version and not the inf. Let

f (x) = g(x, y)

f (x′ ) = g(x′ , y ′ )

i.e., the minimizer with respect to x is y and with respect to x′ is y ′ . Then,

f (λx + (1 − λ)x′ ) = min


′′
g (λx + (1 − λ)x′ , y ′′ )
y

≤ g (λx + (1 − λ)x′ , λy + (1 − λ)y ′ ) [y ′′ = λy + (1 − λ)y ′ is suboptimal]

≤ λg (x, y) + (1 − λ)g(x′ , y ′ ) [convexity]

= λf (x) + (1 − λ)f (x′ ) [definitions]

– Schur complement: Assume that


 
A B
 ⪰0
BT C
and that C is invertible (C ≻ 0). Define the convex quadratic form

g(x, y) = xT Ax + 2xT By + y T Cy

The partial minimum with respect to y is

2B T x + 2Cy = 0

y = −C −1 B T x

Plugging it back into the objective yields

f (x) = min g(x, y) = xT A − BC −1 B T x



y

is convex too, and therefore

A − BC −1 B T ⪰ 0.
40

More on Schur’s lemma:  


A B
Lemma 1 (Schur). Let C ≻ 0 then X =   ⪰ 0 if and only if S = A−BC −1 B T ⪰
T
B C
0.
Proof: A matrix X is positive semidefinite if and only if T XT T is positive semidefinite
for an invertible matrix T . If C is invertible then we have the following block Cholesky
decomposition
     T
−1 −1
A B I BC S 0 I BC
 =   
BT C 0 I 0 C 0 I
 
−1
I BC
The matrix   is invertible and a block diagonal matrix is positive semidefinite
0 I
if and only if its blocks are positive semidefinite. Thus, when C ≻ 0, X ⪰ 0 if and only if
S ⪰ 0.
This is a generalization of Cauchy Schwartz since
      T
A B A B U U
 ⪰0 ⇔  =  
BT C BT C V V
which means
−1
UUT − UV T V V T V UT ⪰ 0


which is exactly Cauchy Schwartz if U and V are row vectors.


• Perspective of a function: If f (x) is convex, then so is its perspective function g(x, t) =
tf xt for t > 0.


Lets prove via epigraph: If f is convex then

epif = {(x, z) : f (x) ≤ z}

is convex. Now consider


n x o
epig = (x, t, z) : t > 0, tf ≤z
n  xt z o
= (x, t, z) : t > 0, f ≤
n  x tz  t o
= (x, t, z) : t > 0, , ∈ epif
t t
Thus epig is the inverse image of epif under the perspective function.
41

xT x
Example: The perspective of f (x) = xT x is g(x, t) = t
which is convex in both x and t.

Example: Matrix fractional function: The function

f (x, Y ) = xT Y −1 x

is convex on Rn × S++
n
. Proof via epigraph:

(x, Y, t) : Y ≻ 0, xT Y −1 x ≤ t

epif =
   
 Y x 
= (x, Y, t) : Y ≻ 0,   ⪰0
 xT t 

Example:

A weighted norm with weights A ≻ 0 is defined as ∥x∥A = xT Ax. Show that it is concave in

A ≻ 0. Proof: restrict it to a line A = B + tC then we get xT Bx + txT Cx. The square root is
concave in positive variables and the rest is just an affine transformation. Application is metric
learning:

∥xi − xj ∥2A
P
minA≻0 i,j∈N
P (6)
s.t. / ∥xi − xj ∥A > 1
i,j ∈N

B. Tirgul

Finishing material in the lecture.


42

4. C ONVEX OPTIMIZATION PROBLEMS

A. Lecture

Standard form

minx∈S f0 (x)

s.t. fi (x) ≤ 0

s.t. hi (x) = 0

The problem is convex if S is a convex set, fi are convex functions and hi (x) are affine functions.

• To solve maximizations we minimize −f0 (x).


• Change of variables: when there exists a one-to-one transformation such that

x = ϕ(z) z = ϕ−1 (x)

the problem is equivalent to

minϕ(z)∈S f0 (ϕ(z))

s.t. fi (ϕ(z)) ≤ 0

s.t. hi (ϕ(z)) = 0

Indeed, when we find the optimal z ∗ then the optimal x∗ is x∗ = ϕ(z ∗ ).


Example: Amplitude and phase recovery from a noisy sine function (known frequency f )
X
min |yi − A cos (2πf xi + ϕ) |2
A,ϕ
i
Non-convex least squares!
Using a simple change of variables from polar coordinates to cartesian

a1 = A cos (ϕ)

a2 = A sin (ϕ)

we obtain

A cos (2πf xi + ϕ) = a1 cos (2πf xi ) − a2 sin (2πf xi )

and a convex LS
X
min |yi − a1 cos (2πf xi ) + a2 sin (2πf xi ) |2
a1 ,a2
i
43

to return to polar:
q
A = a21 + a22
a2
ϕ = tan−1
a1
• Monotonic transformation. For example: quadratic minimization vs. norm minimization.

min (y − Hx)T (y − Hx) = min ∥y − Hx∥22 ⇔ min ∥y − Hx∥2


x x x

• Eliminating linear equality constraints.

minx f (x)
s.t. Ax = b
From linear algebra we know that there are three options for Ax = b:
– No solution in which case the problem is infeasible.
– Unique solution in which case it is also the optimal solution.
– Infinite solutions in which case (usually when A is fat)

x = x1 + x2

where

Ax1 = b, x2 = V z, AV = 0

Later, we will say that the columns of V span the null space of A.
Intuitive example

minx∈Rn f (x1 , · · · , xn )
Pn
s.t. i=1 xi = 0

Any x ∈ Rn which satisfies 1T x = 0 can be written as


" n−1
#T
X
x = z1 , z2 , · · · , zn−1 , − zi
i=1

for some z ∈ Rn−1 so that the problem is equivalent to an unconstrained problem of


lower dimension
Pn−1
min
n−1
f (z1 , z2 , · · · , zn−1 , − i=1 zi )
z∈R

Note that the smaller problem remains convex as affine transformations preserve con-
vexity.
44

• Introducing linear equality constraints - sometimes useful for clarity and duality.

minx,z f (z)
min f (Ax + b) ⇔
x
s.t. Ax + b = z
• Implicit/explicit constraints.

minx f (x)
min f (x) ⇔
x≥0
s.t. x≥0

Another way of writing the left problem is minx f˜(x) where



 f (x) x ≥ 0
f˜(x) =
 ∞ x<0
1) Optimality conditions:

Theorem 9.

min f (x)
x∈S

where f (x) is differentiable and convex and S is a convex set.

Then, x is a global solution if and only if x ∈ S and

∇f T (x)(y − x) ≥ 0 ∀ y ∈ S

Proof. Suppose x, y ∈ S and the condition holds. Due to convexity

f (y) ≥ f (x) + ∇f T (x)(y − x) ≥ f (x)

which means x is globally optimal.

Conversely, suppose x is optimal but the condition does not hold, i.e., there exists a y ∈ S such
that

∇f T (x)(y − x) < 0

Consider the point

z(t) = ty + (1 − t)x t ∈ [0, 1]

This point is feasible since x and y are. For small t we will have f (z(t)) < f (x(t)) which is a
contradiction to the optimality of x. To show this we note that
df (z(t))
= ∇f T (x)(y − x) < 0
dt t=0
45

which means that the function decreases close to x.

Two famous special cases:

Unconstrained convex optimization: a necessary and sufficient condition for optimality is

∇f (x) = 0

since the condition must hold for y = x + z and y = x − z for any z. That is

∇f T (x)z ≥ 0 ∀ z

∇f T (x)z ≤ 0 ∀ z

and this only holds if the gradient is identically zero.

Convex scalar minimization on an interval I = {a ≤ x ≤ b}:

A necessary and sufficient condition for optimality is x ∈ I and

f ′ (x)(y − x) ≥ 0 ∀ y ∈ I

Now if x is strictly inside the interval, then as before the condition reduces to

f ′ (x) = 0

If it is on the border, e.g., x = b then for all y ∈ I we have y − x ≤ 0 and the condition reduces
to

f ′ (x) ≤ 0

which means we that the end point is a global minimizer.

2) Projection onto a convex set: Probably the most common optimization over a convex set.

For example, in Projected Gradient Descent. Define the projection of z onto a convex set S

PS (z) = arg min ∥x − z∥


x∈S

Theorem 10. Let S be a non-empty, closed and convex subset of Rn . Then,

1) For every z, there exists a unique projection onto S.


2) x∗ ∈ S is equal to the projection PS (z) if and only if (z − x∗ )T (x − x∗ ) ≤ 0 for all x ∈ S.
3) The projection is continuous and non-expansive, ∥PS (x) − PS (y)∥ ≤ ∥x − y∥ for all x
and y.
46

Non-convex counterexample.

Wide angle 2D convex example.

We will only give a sketch of a proof, skipping existence and uniqueness.

The condition is just the optimality condition from before.

Non-expansive: Consider some x and y. Then PS (x) and PS (y) are in S and the optimality
conditions yield

(PS (y) − PS (x))T (x − PS (x)) ≤ 0

(PS (x) − PS (y))T (y − PS (y)) ≤ 0

Add this up to obtain

(PS (y) − PS (x))T (x − PS (x) − y + PS (y)) ≤ 0

Rearrange and use Cauchy Schwartz

∥PS (y) − PS (x)∥2 ≤ (PS (y) − PS (x))T (y − x) ≤ ∥(PS (y) − PS (x)∥∥y − x∥

Example 2. Projection onto an interval I = {a ≤ x ≤ b} is



 a x≤a



PI (x) = b x≥b


 x a≤x≤b

In particular Px≥0 = [x]+ = max{0, x}.

B. Tirgul

We learned the optimality condition for x∗ :

x∗ ∈ S, ∇f (x∗ )T (x − x∗ ) ≥ 0 ∀ x ∈ S

This requires no regularity conditions but is hard to check and interpret.

Example 3 (non-negative orthant (Boyd p142, Bertsekas p195)).

min f (x)
x≥0
47

Then x∗ ≥ 0 is a local minimum of a convex function f (x) if and only if


X ∂f (x∗ )
(xi − x∗i ) ≥ 0, ∀ xi ≥ 0
i
∂x i

This condition is equivalent to


∂f (x∗ )
≥0
∂xi
∂f (x∗ )
= 0 if x∗i > 0
∂xi
Indeed, choose x = [x∗1 · · · x∗j + 1 · · · x∗p ] to get the first condition, and for some x∗j > 0 choose
x = [x∗1 · · · x∗j /2 · · · x∗p ] to get the second condition. The other direction is trivial. The second
condition can be written as
∂f (x∗ ) ∗
x =0 ∀ i
∂xi i
and is known as complementary slackness (either a constraint is inactive in which case f ′ = 0
as the unconstrained case, or it is active and we are on the boundary which means that the
derivative is not necessarily zero).

Example 4 (Projection on the positives). Projection on the non-negative cone (scalar version)

min(x − y)2
x≥0

The solution is clearly

x∗ = max{y, } = [y]+

Proof by optimality condition:

(y − x∗ )(z − x∗ ) ≤ 0 ∀ z ≥ 0

To show this consider two cases. If y > 0 then (y − x) = 0 and we are done. Otherwise, we
have (y − 0)(z − 0) which is negative · positive = negative.

Example 5 (Projection on span of a vector a). Consider the span of a fixed vector

S = {x : ∃c x = ca} (7)

and lets compute the projection of an arbitrary vector onto it

min ∥x − y∥2
x∈S
48

Rewrite in terms of c

min ∥ca − y∥2


c

and solve
d  2
c ∥a∥2 − 2caT y + ∥y∥2 = 0

dc

aT y
c=
∥a∥2

aT y

x = a
∥a∥2

Note that this a linear operation.

Example 6 (Projection of a matrix onto the set of symmetric matrices (with Frobenius norm)).
Due to symmetry, we have

X = arg min ∥X − Y ∥2f ro


X:X=X T

In vector notations, this is equivalent to

x = arg min ∥x − y∥22


T
mat(x)=mat(x)

where x and y are vector versions of X and Y , respectively, and mat takes an n2 vector and
transforms it into an n × n matrix.

We need to start with a guess and then show that it is optimal by satisfying the optimality
condition
Y +YT
X=
2

The optimality conditions state that x is optimal iff it is feasible, i.e., mat(x) ⪰ 0, and it satisfies

(y − x)T (z − x) ≤ 0 ∀ mat(z) is symm

In matrix notations, this means X is optimal iff X ⪰ 0 and


n o
Tr (Y − X)T (Z − X) ≤ 0 ∀ Z is symm
49

Lets plug in our guess


( T  )
Y +YT Y +YT
Tr Y − Z− ≤ 0 ∀ Z is symm
2 2
( T  )
Y −YT Y +YT
Tr Z− ≤ 0 ∀ Z is symm
2 2
and we get skew-symmetric times symmetric = zero.

Note that this a linear operation.

Example 7 (Projection of a symmetric matrix onto the semidefinite cone (with Frobenius norm)).
Due to symmetry, we have

Y = U DU T

The projection onto the semidefinite cone is defined as

X = arg min ∥X − Y ∥2f ro


X⪰0

The optimality conditions state that X is optimal iff X ⪰ 0 and


n o
Tr (Y − X)T (Z − X) ≤ 0 ∀ Z ⪰ 0

Lets guess a solution in which we just truncate the negative eigenvalues:

X = U [D]+ U T

It is easy to that is satisfies the optimality condition. Clearly, X ⪰ 0 and


n o n o
Tr (Y − X)T (Z − X) = Tr {(Y − X) Z} − Tr (Y − X)T X = negative + 0 ≤ 0

Above we used the following properties:

xT y = Tr X T Y


A ⪰ 0, B ⪰ 0 → Tr {AB} ≥ 0

The problem is the guessing! Next lesson we will do it without the guessing.
50

1) Linear programming: Used to be a full lecture but due to the war we will just briefly touch
it.

Standard form
minx cT x
s.t. Gx ⪯ h
Ax = b

Why work with standard form? Stand on the shoulders of giants, off the shelves software.

Of course, better performance may be obtained via dedicated solutions.

Diet problem

• A healthy diet contains m different nutrients in quantities at least equal to b.


• We can compose such a diet by choosing nonnegative quantities x of different foods.
• One unit quantity of food j contains an amount Aij of nutrient i, and has a cost of cj .
• We want to determine the cheapest diet that satisfies the nutritional requirements.

minx cT x
s.t. Ax ≥ b
x≥0

Power control

Wireless communication system with n transmit and receiving units. The ith transmitter tries to
reach the ith receiver who also measures the interference from the other transmitters and noise.
Given the channels between the units and signal to noise plus interference constraints our goal
is to minimize the total transmitted power.
P
minpi i pi
 
hii pi
s.t. log 1 + P 2 ≥ γi
j̸=i pj hij +σ

pi ≥ 0

P
minpi i pi
hii pi
s.t. P
pj hij +σ 2
≥ eγi − 1
j̸=i

pi ≥ 0
51

P
minpi i pi
pj (eγi − 1) hij + (eγi − 1) σ 2
P
s.t. hii pi ≥ j̸=i

pi ≥ 0

Piecewise linear minimization A reasonable approximation for minimizing any convex function.

min max aTi x + bi


x i

can be reformulated as

min t s.t. max aTi x + bi ≤ t


x,t i

min t s.t. aTi x + bi ≤ ti ∀ i


x,t

min t s.t. Ax + b ⪯ t1
x,t

L1 fitting

min ∥Ax + b∥1


x

minx,t 1T t
s.t. ti = |aTi x + bi | ∀ i
wlog we can relax the equality to an inequality (its always better to choose a lower point)

minx,t 1T t
s.t. ti ≥ |aTi x + bi | ∀ i
Using

{t, x : t ≥ |f (x)|} = {t, x : −t ≤ f (x) ≤ t}

Note that both automatically dictate t ≥ 0.

minx,t 1T t
s.t. −t ⪯ Ax + b ⪯ t
52

5. L INEAR ALGEBRA

A. Lecture

1) Square and symmetric - EVD: Thinking of symmetric square matrices as operators.

A symmetric square n × n matrix partitions Rn into two subspaces:


Rn = C(A) ⊕ N (A)

Any vector x ∈ Rn can be written as

x ∈ Rn

x = x1 + x2

x1 ∈ C(A)

x2 ∈ N (A)

xT1 x2 = 0

Most square matrices can be diagonalized as

A = U DU −1

When they are also symmetric, they can always be diagonalized as

A = U DU T

where U is orthonormal.

It helps in idetifying the subspaces:


 
D1 0
A = [ U1 U2 ]   [ U1 U2 ]T
0 0

and U1 is a basis for C(A), whereas U2 is a basis for N (A).


The column space and row space have equal dimension r D rank
Part 1
The nullspace N .A/ has dimension n ! r; N .AT / has dimension m ! r
That counting of basis vectors is obvious for the row reduced rref.A/. This matrix has r 53
nonzero rows and r pivot columns. The proof of Part 1 is in the reversibility of every
elimination step—to confirm that linear independence and dimension are not changed.

C .AT /
C .A/
Row space Column space
all AT y all Ax
dim r
Perpendicular Perpendicular dim r
Rn x T .AT y/ D 0 y T .Ax/ D 0 R m
dim n ! r
x in y in dim m ! r
Nullspace Nullspace of A T
Ax D 0
AT y D 0
N .A/ N .AT /

Figure 1: Dimensions and orthogonality for any m by n matrix A of rank r.

Part 2 says that the row space and nullspace are orthogonal complements. The orthog-
onality
Fig. comesStrang
1. Gilbert directly from
- The the Fundamental
Four equation Ax Subspaces.
D 0. Each Highly
x in therecommend
nullspace is orthogonal
watching to video on youtube.
Strang’s
each row:
! x is orthogonal to row 1
2 32 3 2 3
.row 1/ 0
Ax D 0 4 """ 5 4 x 5 D 4 "" 5
2) Rectangular -.row
SVD:m/ What happens
0 in !
non-square, non-symmetric
x is orthogonal to row m m × n matrices?
Thefour
The dimensions of C .AT /subspaces
fundamental and N .A/ add
of tolinear
n. Every vector in Rn is accounted for, by
algebra.
separating x into xrow C xnull .
Given Foran
them
90×ı
angle on the right
n matrix A ofside of Figure
rank r, we1, have
changethe
A tofollowing
AT . Every vector b D Ax
subspaces
in the column space is orthogonal to every solution of A y D 0. More briefly, .Ax/T y D
T

x T .AT y/ D 0. m
• Column space C(A) ∈ R - all the combinations of columns of A, dimension r.
C .AT / D N .A/? Orthogonal complements in Rn
Part 2
N .AT / D C .A/? Orthogonal complements in Rm n
C(A) = {Ax : x ∈ R }
2
• Null space N (A) ∈ Rn - all the vectors such that Ax = 0, dimension n − r.

N (A) = {x ∈ Rn : Ax = 0}

• Row space C(AT ) ∈ Rn - all the combinations of rows of A, dimension r.


• Left null space N (AT ) ∈ Rm - all the vectors such that xT A = 0, dimension m − r.

Together:
⊥
Rm : C(A) = N (AT )


Rn : C(AT ) = [N (A)]⊥

The main tool for working with non-square non-symmetric matrices is singular value decompo-
sition (SVD).

(∗) A = U DV T
54

where U and V are square unitary matrices and D = diag {di } is rectangular diagonal matrix
with non-negative singular values.

Four subspaces are involved in possibly different dimensions.


 r×r r×n−r

n×r n×n−r T
   
m×r m×m−r Λ 0
A = U1 U2  m−r×r m−r×n−r
 V1 V2
0 0
Note that this is a very general decomposition which holds for any matrix.

A non-efficient but intuitive derivation is via the EVD (this is not a proof).

For simplicity, lets assume square matrices. Compute two EVDs

AAT = U DV T V DT U T = U diag d2i U T




AT A = V DT U T U DV T = V diag d2i V T


and we have that


A = U DV T
55

3) Least squares: Probably the most studied (convex) optimization problem, Gauss 1794. End-
less applications - curve fitting, statistical estimation, minimum distance, etc...

min ∥Ax − b∥22


x

This is of course an unconstrained convex optimization problem (with and without the square
as it is a norm or positive definite quadratic form).

LS’s optimality conditions are:

∇f (x) = AT (Ax − b) = 0

First, we assume that all the rows/columns are linearly independent.

• Case 1: A is a square invertible matrix. Ax = b has a unique solution which satisfies the
optimality conditions

x = A−1 b

• Case 2: Less equations than unknowns. Ax = b has infinite solutions all of which are
optimal as they satisfy the optimality conditions. The Hessian has zero eigenvalues as AT A
is a large matrix with small rank (rank deficient), and the problem is not strictly convex.

One possible solution: AAT is a small matrix, usually of full rank and invertible1 , and it is
easy to verify that one solution is
−1
x0 = AT AAT b

= A† b

where A† is known as the Moore-Penrose generalized inverse (aka pseudo-inverse) of A.


This is one solution and the rest can be characterized as

x = x0 + z z ∈ N {A}

An alternative approach to the same problem is to use other generalized inverses. Indeed,
the general solution to the problem is

x = B T (AB T )−1 b

1
unless some of the equations are linearly dependent
56

where B is some matrix so that AB T is invertible. What is the relation between B and z?
• Case 3: More equations than unknowns. Ax = b has no solution, but we can still search for
the least squares approximation. We need to solve AT Ax = AT b instead of Ax = b which
has more equations and therefore problematic. When AT A is invertible, and the classical
LS solution is
−1 T
xLS = AT A A b

= A† b

To conclude, in all cases the LS solution is x = A† b.

It coincides with the inverse if A is square and invertible, but otherwise has different meanings.

What happens when the rows/columns are not linearly independent?

The solution is still the A† b and the pseudo-inverse always exists and is unique. To gain intuition,
lets begin with the scalar case. Basically, this means we have a zero and cannot divide by zero.

min |ax − b|2 2a(ax − b) = 0


x

the solution is clearly x = a−1 b, but what if a = 0, well actually then we can choose whatever
we want for x. To be consistent with previous solutions, we define

 1 x ̸= 0
x† = x
 0 x=0

and get the general solution x = a† b.

The generalization to diagonal (not necessarily square) matrices is straight forward.

Then, from diagonal to arbitrary matrices via SVD as follows.

AT Ax = AT b

V DT DV T x = V DT U T b

DT DV T x = DT U T b

DT Dx̃ = DT b̃, diagonalized the system

which is exactly

d2i x̃i = di b̃i


57

and has the solution

x̃i = d†i b̃i

or

V T x = D† U T b

x = V D† U T b = A† b

which yields the practical definition of pseudo-inverse

A† = V D† U T

For completeness, the formal definition of the pseudo-inverse of A is the unique matrix (which
T †
always exists) such that AA† A = A, A† AA† = A† , AA† = AA† and AT A = A† A

4) Equality constraints (Boyd p141):

min f (x)
x:Ax=b

then, the optimality conditions are

∇f (x)T (y − x) ≥ 0 ∀ Ay = b

since x is feasible then y = x + v where v ∈ N (A) and therefore

∇f (x)T v ≥ 0 ∀ v ∈ N (A)

this holds also for −v and therefore

∇f (x)T v = 0 ∀ v ∈ N (A)

which means

∇f (x) ⊥ N (A)

Lets get the same results by changing variables and eliminating the constraints.

min f (x0 + F z)
z

where x0 is some solution and AF = 0 so that F z ∈ N {A}. Using the chain rule, the optimality
condition is

F T ∇f = 0
58

or

∇f T F z = 0 ∀ z

∇f T v = 0 ∀ v = N {A}

5) Pseudoinverse- minimum norm solution:

min ∥x∥2 s.t. Ax = b

We will assume of course that the problem is feasible and there are infinite solutions.

The objective is non-differentiable, but we can take the square due to monotonicity

min xT x s.t. Ax = b

and the optimality condition reads 2x∗ ⊥ N (A).

The feasible set states that x∗ = A† b + v where v ∈ N (A), but the optimality condition holds
iff v = 0.

Therefore x∗ = A† b is the minimum norm solution.

Alternative proof via a change of variables by using:

∥x∥22 = ∥x0 ∥2 + 2xT0 z + ∥z∥2

= ∥x0 ∥2 + ∥z∥2

≥ ∥x0 ∥2

and satisfied by choosing z = 0.

6) Projection onto an affine set:

PS (z) = arg min ∥x − z∥


x∈S

R(A) = {x : x = Ay for some y}

where A is a tall matrix (more rows than columns).

min ∥x − z∥
{x:x=Ay for some y}
59

minx,y ∥x − z∥
s.t x = Ay
We eliminate the linear constraints

min ∥Ay − z∥
y

and the solution is clearly

y ∗ = A† z

and in terms of x

x∗ = Ay ∗ = AA† z

so that

PR(A) (z) = AA† z

The matrix P = AA† is a projection matrix onto the range of A.

Its a matrix has 0/1 eigenvalues and the same eigenvectors as AAT = U DU T .

It is idempotent.

The projection onto a subspace is a linear operation.

Homework: redo it via optimality conditions.

B. Tirgul

First a colab on least squares and minimum norm solution.

1) Robust linear programming: One of the problems with optimization in engineering is that
unlike intuitive heuristic solutions they are usually non-robust. Fortunately, we can sometimes
fix this.

Quoting Nemirovski et al:

A case study reported in [7] shows that the phenomenon we have just described is not an
exception: in 13 of 90 NETLIB Linear Programming problems considered in this study, already
0.01%-perturbations of ugly coefficients result in violations of some constraints, as evaluated at
the nominal optimal solutions by more than 50%. In 6 of these 13 problems the magnitude of
60

constraint violations was over 100%, and in PILOT4 it was as large as 210,000%, that is, 7
orders of magnitude larger than the relative perturbations in the data.

The techniques presented in this book as applied to the NETLIB problems allow one to eliminate
the outlined phenomenon by passing out of the nominal optimal to robust optimal solutions. At
the 0.1%-uncertainty level, the price of this immunization against uncertainty (the increase in
the value of the objective when passing from the nominal to the robust solution), for every one
of the NETLIB problems, is less than 1%.

minx cT x
s.t. aTi x ≤ bi ∀ ai ∈ Si
where

Si = {āi + Pi u : ∥u∥ ≤ 1}

The constraint

aTi x ≤ bi ∀ ai ∈ Si

is equivalent to

max aTi x ≤ bi
ai ∈Si

or
 
āTi x + T
max u Pi x ≤ bi
u:∥u∥≤1

Using Cauchy Schwartz


u=u∗
uT Pi x ≤ ∥u∥∥Pi x∥ = ∥Pi x∥

and is achievable by choosing


Pi x
u=
∥Pi x∥
Thus, we get SOCP

minx cT x
s.t. āTi x + ∥Pi x∥ ≤ bi
61

2) Semidefinite programming: A higher level in the conic programming hierarchy: LP, SOCP
and SDP.

Basically an LP with a different definition of the inequalities.

minx cT x
P
s.t. F0 + i xi F i ⪰ 0
Ax = b

LMI = Linear matrix inequality

Multiple LMIs via a single block diagonal LMI.

LP is special case by choosing scalar F0 and Fi .

SOCP is special case by noting that


  
 tI x 
{x, t : ∥x∥2 ≤ t} = x, t :  ⪰0
 xT t 

Minimization of maximum eigenvalue (symmetric)

X
A(X) = A0 + xi Ai symmetric
i

We want to solve

min λmax (A(x))


x

minx,t t
s.t. λmax (A(x)) ≤ t

minx,t t
s.t. λi (A(x)) ≤ t ∀ i

{λi (A) ≤ t ∀ i} = {A ⪯ tI}


62

minx,t t
s.t. A(x) ⪯ tI

Minimization of spread of eigenvalues (symmetric)

X
A(X) = A0 + xi Ai symmetric
i

We want to solve

min λmax (A(x)) − λmin (A(x))


x

minx,t1 ,t2 t1 − t2
s.t. λmax (A(x)) ≤ t1
λmin (A(x)) ≥ t2

minx,t1 ,t2 t1 − t2
s.t. t2 I ⪯ A(x) ⪯ t1 I

Matrix norm minimization (non-symmetric)

X
A(X) = A0 + xi Ai non symmetric
i

We want to solve

min ∥A(x)∥2
x

where ∥X∥2 is the maximum singular value of X, is the square root of the maximum eigenvalue
of X T X, and also the maximum eigenvalue if X ⪰ 0. This is convex because it is a norm.


minx λmax AT (x)A(x)

minx,t t2

s.t. λmax AT (x)A(x) ≤ t2
63

minx,t t
s.t. AT (x)A(x) ⪯ t2 I

   
 tI A 
{AT A ⪯ t2 I} = A, t :  ⪰0
 AT tI 

minx,t t
 
tI A(x)
s.t.  ⪰0
T
A (x) tI
64

6. G RADIENT D ESCENT AND N EWTON

A. Lecture

1) Gradient Descent: Simplest and most natural method is gradient descent (GD):

xk+1 = xk − tk ∇f (xk ) k = 1, · · · K

where tk are step sizes (either fixed or adaptive).

Theorem 11. For small tk = t GD is a descent method.

Proof. a simple Taylor approximation:

f ((xk − t∇f (xk )) ≈ f (xk ) − t∥∇f (xk ) ∥2

Exact line search:

Basically transforms a high dim problem into a sequence of line searches. The step size tk is
chosen using a line search

min f (xk − tk g)
tk ≥0

ZigZag issue - steepest descent always advances in orthogonal directions (not very useful). Proof:
the optimal step size tmin satisfies:

f (xk − t∇f (xk ))|tmin = 0
∂t
Using the chain rule:

∇f (xk − tmin ∇f (xk ))T (−∇f (xk )) = 0

which means that

∇f (xk+1 )T ∇f (xk ) = 0

Backtracking:

In practice, an inexact line search is usually used, e.g., backtracking.

Basically, begin with α = 1 and multiply it by, say β = .5, until we get descent. Unfortunately,
this does not work (theoretically, in practice its more or less ok).
65

For example, consider [Bertsekas p30]



3(1−x)2



 4
− 2(1 − x) x > 1
3(1+x)2
f (x) = 4
− 2(1 + x) x < −1


 x2 − 1

else

The iteration xk+1 = xk − f ′ (xk ) is a descent method (it chooses always t = 1), but does
not converge if |x0 | > 1 and jumps between ±1 even though the minimum is at x = 0. The
conclusion: descent is not sufficient, we need a bit more. One approach is backtracking:

def backtracking ( f , g , x , a , b )
# a in (0 ,0.5)
# b in (0 ,1)
t =1
w h i l e f ( x− t * g)> f ( x ) − a * t * g ’ * g
t=beta * t

Strongly smooth and strongly convex analysis

Assume that f satisfies:

mI ⪯ ∇2 f (x) ⪯ M I

In standard convexity m = 0 and M = ∞.

Two important bounds:


M
f (y) ≤ f (x) + ∇f (x)T (y − x) + ∥y − x∥2 , ∀ y, x
2
m
f (y) ≥ f (x) + ∇f (x)T (y − x) + ∥y − x∥2 , ∀ y, x
2

Taking the minimum wrt y in both sides of the lower bound yields
m
f (y) ≥ f (x) + ∇f (x)T (y − x) + ∥y − x∥2 , ∀ y, x
2  
miny 1 1
≥ f (x) − ∥∇f (x)∥22 , ∀ y, x y min
= x − ∇f (x)
2m m
which leads to
1
f (xmin ) − f (x) ≥ − ∥∇f (x)∥22
2m

This equation quantifies our distance from optimality using ∥∇f (x)∥ (a stopping criterion).
66

Note that ∇f (x) = 0 yields optimality.

Using the upper bound with y = x − t∇f (x) yields


M t2
f (x − t∇f (x)) ≤ f (x) − t∥∇f (x)∥2 + ∥∇f (x)∥2 , ∀ x [∗ ∗ ∗]
2
and taking the minimum with respect to t in both sides yields
 
min 1 1
f (x − t ∇f (x)) ≤ f (x) − ∥∇f (x)∥22 min
t =
2M M
Subtracting f (xmin )
1
f (x − tmin ∇f (x)) − f (xmin ) ≤ f (x) − f (xmin ) − ∥∇f (x)∥22
2M
On the other hand, the lower bound yields

∥∇f (x)∥22 ≥ 2m f (x) − f (xmin )




Together we get
  m
f (x − tmin ∇f (x)) − f (xmin ) ≤ 1 − f (x) − f (xmin )
 
M

Called linear convergence since the log error vs iterations graph is linear.

Convergence rate depends on the condition number of the Hessian which is directly related to
m
the ratio M
and therefore defines the convergence rate. Later on, this will be the motivation for
Newton.

2) Newton method: The 1D algorithm was written by Isaac Newton in 1669 and published in
1711. Joseph Raphson published a similar description in 1690 (a good friend of Newton’s).
Basically a root finding method applied to f ′ (x∗ ) = 0.

Assume that f is twice continuously differentiable and that we can efficiently compute f , f ′ and
f ′′ . Newton’s method is defined as follows: At the tth iteration, f is approximated as a quadratic
function around xt−1 :
1
f (x) ≈ f (xt−1 ) + (x − xt−1 )f ′ (xt−1 ) + (x − xt−1 )2 f ′′ (xt−1 )
2
The value of xt is the minimizer of this quadratic approximation:
1
xt = arg min f (xt−1 ) + (x − xt−1 )f ′ (xt−1 ) + (x − xt−1 )2 f ′′ (xt−1 )
x 2
f ′ (xt−1 )
= xt−1 − ′′
f (xt−1 )
67

Can also be interpreted as a solution to a nonlinear equation via linearization of f ′ :

f ′ (xt ) = 0

f ′ (xt−1 ) + f ′′ (xt−1 )(x − xt−1 ) = 0

solving for x yields


f ′ (xt−1 )
xt = xt−1 −
f ′′ (xt−1 )

Is it a descent method?
f ′ (xt−1 ) [f ′ (xt−1 )]2
 
f xt − ′′ ≈ f (xk ) − ′′
f (xt−1 ) f (xt−1 )
Yes, if f ′′ (xt−1 ) > 0. Yes, for example if f (x) is strictly convex. But just as well it can be an
ascent method if f (x) is strictly concave.

Quadratic convergence: Let x∗ be local minimizer of f such that f is three times continuously
differentiable in a neighborhood of x∗ with f ′ (x∗ ) = 0, f ′′ (x∗ ) > 0. Then the Newton iterates
converge to x∗ quadratically, provided that the starting point is close enough to x∗ .
f ′ (xt−1 )
xt − x∗ = xt−1 − x∗ −
f ′′ (xt−1 )
−f ′ (xt−1 ) − f ′′ (xt−1 ) (x∗ − xt−1 )
=
f ′′ (xt−1 )
0
f ′ (x∗ ) −f ′ (xt−1 ) − f ′′ (xt−1 ) (x∗ − xt−1 )
= add a zero first term
f ′′ (xt−1 )
1 ′′′
f (x)
= 2
′′
(x∗ − xt−1 )2 2nd order taylor remainder with some x
f (xt−1 )

|xt − x∗ | ≤ K |xt−1 − x∗ |2

Note that we assume that throughout the process f ′′ (xt−1 ) > 0 and that x∗ is not an end point.

The convergence is quadratic: the error is essentially squared (the number of accurate digits
roughly doubles) at each step.

Example 8. Consider the convex function f (x) = ex −10x. The minimun is clearly xopt = log10.
ex −10
The method yields xk+1 = xk − ex
= xk − 1 + 10e−x . Starting with x = xopt + 0.1, method
reaches Matlab’s 15 digits precision in 4 iterations.
68

3
Example 9. Consider the convex function f (x) = 23 |x| 2 . The minimum is clearly xopt = 0, but if
we start Newton’s method at x0 = 1, the algorithm continues as xk = 1 for even k and xk = −1
for odd k, and never reaches the minimum.

Lets more to higher dimensions:

−1
xk+1 = xk − tk ∇2 f (xk )

∇f (xk )

Assumes invertibility of the Hessian.

Derivation - minimization/maximization of a local quadratic approximation


1
xk+1 = arg min f (xk ) + ∇f (xk )T [x − xk ] + [x − xk ]T ∇2 f (xk ) [x − xk ]
x 2
Derivation - solution of linearized optimality conditions

0 = ∇f (xk+1 ) ≈ ∇f (xk ) + ∇2 f (xk ) [xk+1 − xk ]

3) Interior point method (barrier): Idea: starting with a strictly feasible point, improve it inside
the set while penalizing for getting out of the set.

minx f0 (x)
s.t. fi (x) ≤ 0
is equivalent to
X
min f0 (x) + I (fi (x))
x
i

where we define the indicator function for nonpositive scalars as



 0 u≤0
I(u) =
 ∞ u>0

A good approximation for I(u) is the log function


1
It (u) = − log (−u)
t
it is convex, nondecreasing and equals ∞ outside the domain. Known gradient and Hessian.
69

Thus, we propose the following approximation


X
min f0 (x) + It (fi (x))
x
i

which can be solved by Newton.

• The approximation accuracy improves with t.


• Solving the inner problem becomes harder with t.
• The barrier method: start with a bad approximation (small t) and arbitrary starting point,
then use this solution as a starting point to a better approximation with a larger t.

f u n c t i o n x= b a r r i e r ( )
% Input :
% strictly feasible x
% t>0
% µ>1
% ϵ>0
w h i l e ( mt ≤ ϵ )
P
x = arg minx f0 (x) + i It (fi (x)) starting at x ;
t = µt ;
end

Interior can also be interpreted as iteratively solving the nonlinear KKT by linearization around
the easier last point. Easier in terms of relaxing the complementary slackness condition with a
parameter t. This comes out of the math.

B. Tirgul

1) Gradient Descent - quadratic analysis: Let’s dive deeper into this and introduce the ideas
behind acceleration.

Let’s start with minimization of a simple quadratic function which is both strongly convex and
smooth:
1
min xT Ax (8)
x 2
70

with

mI ⪯ A ⪯ M I (9)

The optimal solution is clearly x = 0. Lets analyze how GD approaches it. GD with step size t
yields

xk+1 = xk − tAxk (10)

= (I − tA)xk = · · · (11)

= (I − tA)k+1 x1 (12)

∥xk − 0∥2 = ∥xk ∥2 = ∥(I − tA)k+1 x1 ∥2 ≤ ∥(I − tA)∥k+1


2 ∥x1 ∥2 (13)

Thus, the rate of convergence is controlled by

∥I − tA∥2 (14)

If A = λI then clearly we want t = λ1 . That is: small t for large λ and vice versa.

In the general case, we need to choose one step size that will perform well under all possible
eigenvalues:

min max{|1 − tλi |} (15)


t i

2
The solution is t = m+M
(proof: equality between two eigs of m and M in which each chooses
2m
a different sign in the absolute). This gives a rate of 1 − m+M
.
1 m
Approximately, this yields t = M
with a rate of 1 − M
.

2) Gradient Descent - Lipshitz analysis: Convergence analysis for Lipshitz functions:

Definition 11. A function f is L-Lipshitz on X if

|f (x) − f (y)| ≤ L∥x − y∥ ∀x, y ∈ X

Intuitively, a sort of continuity where the function is limited in how fast it can change.

Examples:

• Every function that is defined on an interval and has bounded first derivative.
• f (x) = |x| even though it is not differentiable in zero.
71

• f (x) = x2 and f (x) = ex are not globally Lipshitz as their slope increases.

Theorem 12. If f is convex and L-Lipshitz then ∥∇f (x)∥ ≤ L.

Proof.

∇f (y)T (x − y) ≤ f (x) − f (y) ≤ L∥x − y∥ (16)

now just choose x = y + ∇f (y).

Theorem 13. Let f be convex and L-Lipshitz. Subgradient descent starting at x1 such that
∥x1 − x∗ ∥ ≤ R with t = R

L K
satsifies
K
!
1 X RL
f xk − f (x∗ ) ≤ √
K k=1 k

Proof.

f (xk ) − f (x∗ ) ≤ ∇f (xk )(xk − x∗ ) convexity


xk − xk+1
= (xk − x∗ )
t
1
= (∥xk − x∗ ∥2 + ∥xk − xk+1 ∥2 − ∥xk+1 − x∗ ∥2 )
2t
2aT b = ∥a∥2 + ∥b∥2 − ∥a − b∥2
1 t
= (∥xk − x∗ ∥2 − ∥xk+1 − x∗ ∥2 ) + ∥∇f (xk )∥2
2t 2
1 tL2
≤ (∥xk − x∗ ∥2 − ∥xk+1 − x∗ ∥2 ) +
2t 2
summing for k = 1, · · · , K and dividing by K yields
K
1 X 1 tL2
f (xk ) − f (x∗ ) ≤ (∥x1 − x∗ ∥2 − ∥xK+1 − x∗ ∥2 ) +
K k=1 2tK 2
1 tL2
≤ ∥x1 − x∗ ∥2 +
2tK 2
RL RL
= √ + √
2 K 2 K
Finally, due to convexity f (mean) ≤ mean(f ).

Some comments: not descent (not diff), use average rather than last point, dimension independent,

very small and pessimistic step sizes (inversely proportional to K).

A main advantage of GD is that the convergence speed is independent of the dimension size.
72

3) Acceleration: Momentum

The problem is clearly that we do not know where x will be, closer to vmin or vmax and need to
prepare for the worst case. The idea behind acceleration is to add a sort of regularization that
ensures robust performance all over. The full analysis is involved so we will stick to a scalar
quadratic with unknown scale and show how momentum performs well for all scales in some
interval.
h
min x2 (17)
x 2

The minimum is clearly x = 0 with a zero value.


1
If we know h then we can choose t = h
and get an optimal rate of zero, meaning that we solve
the problem in one step.

xk+1 = xk − t∇f (xk ) = xk − thxk = 0 (18)


1
If we know that h ∈ [m, M ] then we choosing t = m
will be bad when h = M and vice versa.
2 m+M
The previous solution is to choose the average t = m+M
that will perform well when h = 2

but will deteriorate otherwise.

Instead, momentum suggests to use

xk+1 = xk − t∇f (xk ) + β(xt − xt−1 )

= xk − thxk + β(xt − xt−1 ) (19)

To analyze its performance, we will use an old trick from linear systems
    
x 1 − th + β −β x
 k+1  =   k  (20)
xk 1 0 xk−1
which can be compactly written as

wk+1 = Bwk (21)

or

wk = B k wk (22)

The distance from zero is therefore bounded by ∥B∥k2 and is dictated by the absolute value of
the (possibly complex) eigenvalues of B:

rate ∝ max{|λ1 (B)|, |λ2 (B)|} (23)


73

In particular, choosing
2 + 2β 2
t= β =1− √ (24)
m+M κ+1
yields eigenvalues of equal magnitude
p
|λ1 | = |λ2 | = β ∀h ∈ [m, M ] (25)

Thus, the worst case rate of momentum β ≈ 1 − √1κ is much better than the worst case of GD
which was approximately 1− κ1 . Of course, when h ≈ m+M
2
, GD is much better than momentum,
but momentum is very robust and has a rather constant rate for all h ∈ [m, M ]. Show graph.

In practice, simply choosing a small t with β = 0.9 yield a pair of eigenvalues with equal

absolute values that are roughly β.

Momentum is intuitive and works great for quadratic function. For more general convex function,
it can be extended to Nesterov’s Accelerated Gradient Descent (NAG):

vk+1 = βvk − α∇f (xk − βvk ) (26)

xk+1 = xk + vk+1 (27)

Very similar to momentum but the gradient is computed after applying the momentum. Assuming
T − √Tκ
a strongly smooth and strongly convex function, its rates improve from e− κ in GD to e in
NAG.

References : Cornell cs4784 and Mitliagkas IFT6085


74

7. M INIMIZATION MAJORIZATION

A. Lecture

1) Main idea: A standard approach is optimization is to reduce a difficult problem into a


sequence of simpler problems:

min f (x) (28)


x

with

xk+1 = arg min Q(x; xk ) (29)


θ

where the surrogate function satisfies

f (x) ≤ Q(x; x′ ) ∀ x, x′ (30)

and

f (x) = Q(x; x) ∀ x (31)

Basically, we minimize an upper bound (which is supposedly easier).

Usually, decoupled or linear/quadratic.

Under regularity conditions, this iteration is guaranteed to converge to a stationary point.

f (xk+1 ) ≤ Q(xk+1 ; xk ) ≤ Q(xk ; xk ) = f (xk ) (32)

MM is especially useful when the sub-problems have a simple and elegant solution.

Also, MM methods typically bypass the need for a step size.

2) Gradient Descent: Let’s re-derive gradient descent assuming ∇2 f (x) ⪯ M I:

1
f (x) = f (x′ ) + ∇T f (x′ )(x − x′ ) + (x − x′ )T ∇2 f (x′′ )(x − x′ ) (33)
2

M
f (x) ≤ f (x′ ) + ∇T f (x′ )(x − x′ ) + ∥x − x′ ∥2 = Q(x; x′ ) (34)
2
Therefore, our sub-problems are
M
x ← arg min f (x′ ) + ∇T f (x′ )(x − x′ ) + ∥x − x′ ∥2 (35)
x 2
75

In the unconstrained case, the minimizer has a closed form solution:

∇f (x′ ) + M (x − x′ ) = 0 (36)

1
which yields gradient descent with step size M
:
1
x = x′ − ∇f (x′ ) (37)
M

3) Iterative thresholding: In constrained optimizations, the minimizers can be different

f (x) = g(x) + λ∥x∥1 (38)

with
∇2 g(x) ⪯ M I (39)

Let’s bound the quadratic form as before and leave the L1 regularization as is:
M
Q(x; x′ ) = g(x′ ) + ∇T g(x′ )(x − x′ ) + ∥x − x′ ∥2 + λ∥x∥1 (40)
2
Therefore
M
x ← arg min ∇T g(x′ )x + ∥x − x′ ∥2 + λ∥x∥1 (41)
x 2
What’s the main advantage of these subproblems?

The solution is the famous iterative thresholding algorithm


 
′ 1 ′
x ← T λ x − ∇g(x ) (42)
M M
where

Tα (z) = (|z| − α)+ · sign(z) (43)

4) Iterative Reweighted Least Squares (IRLS): In IRLS we use a quadratic upper bound. For
example, it is useful in logistic and robust regression:

1 a2 1
|a| ≤ ′
+ |a′ | (44)
2 |a | 2

Thus,
1 X (yi − hTi x)2
∥y − Hx∥1 ≤ + const (45)
2 i |yi − hTi x′ |
76

and we get an iterative weighted least squares problem:

x ← arg min(y − Hx)T W ′ (y − Hx) (46)


x

(H T W ′ H)x = H T W ′ y (47)

x ← arg min(y − Hx)T W ′ (y − Hx) = (H T W ′ H)−1 H T W ′ y (48)


x

where  
′ 1
W = diag (49)
|yi − hTi x′ |

5) Convex - Concave: Minimization majorization is also highly useful in non-convex optimiza-


tion. Specifically, many functions can be written as a difference of two convex functions, or a
sum of convex and concave function:

f (x) = f1 (x) − f2 (x) (50)

where both f1 and f2 are convex. The beauty is that −f2 (x) is convcave and can always be
upper bounded by its linear approximation:

f (x) ≤ f1 (x) − [f2 (x′ ) + ∇T f2 (x′ )(x − x′ )] = Q(x; x′ ) (51)

The resulting MM algorithm is then

x ← min f1 (x) − ∇T f2 (x′ )x (52)


x

For example, a classical special case solving non-convex sparse linear systems. Using
a − a′
log(a) ≤ log(a′ ) + a, a′ > 0 (53)
a′
we bound
|x|
log(ϵ + |x|) ≤ + const (54)
ϵ + |x′ |
We solve
X
min log(ϵ + |xi |) (55)
x:Ax=b
i

by iteratively reweighting L1 problems:


X |xi |
min (56)
x:Ax=b
i
ϵ + |x′i |
which is very intuitive.
77

6) Exponentiated Gradients: First, let us define the simplex S


( )
X
S = x : 0 ≤ xi ≤ 1, xi = 1 (57)
i

Suppose we want to solve

min f (x) (58)


x∈S

Gradient descent via majorization will lead to


M
x ← arg min f (x′ ) + ∇T f (x′ )(x − x′ ) + ∥x − x′ ∥2 (59)
x∈S 2
But solving these L2 regularized optimization on the simplex is not trivial (was actually a project
in 2022).

Mirror descent methods replace the L2 norm with a different distance function that is more
suitable for the simplex. For this purpose, we define the KL divergence is:
X xi


d(x; x ) = xi log ′ (60)
i
xi
and define
1
x ← arg min f (x′ ) + ∇T f (x′ )(x − x′ ) + d(x; x′ ) (61)
x∈S α
where we assume α is small enough to ensure majorization (the exact value depends on the
function f ).

Computing the proximal operator on the simplex with respect to KL is simple

min aT x + d(x; y) (62)


x∈S

where we assume yi > 0. Indeed, lets ignore the positivity constraints and use the optimality
condition with a linear constraint xT 1 = 1 (we learned this a few weeks ago):
 
∂ xi
= 1 + log + ai = λ (63)
∂xi yi
so that

xi = yi e−ai eλ−1 (64)

and in order to satisfy the constraint we choose λ as a normalization


yi e−ai
xi = P −ai
(65)
i yi e
78

Plugging this solution into the general optimization with a = ∇f (x′ ) yields the seminal expo-
nentiated gradient iteration

x′ e−α∇i f (x )
xi = P i ′ −α∇i f (x′ ) (66)
i xi e
79

8. D UALITY I

A. Lecture

1) Definitions and theorems:

max g(λ) ≤ gap ≤ min f (x)


λ∈D x∈P

Each λi is associated with one explicit constraint.

Art! Many duals!

Why?

• One problem may be easier than the other, e.g., water filling.
• Physical/practical meaning - support vector machinge, uplink vs downlink duality, maximum
likelihood vs maximum entropy, anomaly detection vs. experiment design.
• Structure, optimality condition, analysis.
• Lower bound, analyze existing solutions.
• Optimality gap receipt, certificates for impossibilities.
• Primal dual algorithms.

Consider the (not necessarily convex) optimization problem



 minx f0 (x)



p∗ = s.t. fi (x) ≤ 0, i = 1, · · · , m


hj (x) = 0, j = 1, · · · , p

Define the Lagrangian


X X
L(x; λ, µ) = f0 (x) + λi fi (x) + µj hj (x)
i j

and the dual function

g(λ, µ) = inf L(x; λ, ν)


x

This may be hard to solve (in closed form or in nonconvex).

The art is to add constraints implicitly or explicitly in order to solve find the dual function.

Theorem 14 (Weak duality). g(λ, µ) ≤ p∗ for any λ ⪰ 0 and µ.


80

Proof:

p∗ = f0 (x∗ )
negative for feasible primal and dual variables
X X

≥ f0 (x ) + λi fi (x∗ ) + µj hj (x∗ )
i j

= L (x ; λ, µ)

≥ inf L(x; λ, µ)
x

= g (λ, µ)

The optimal x is clearly primal feasible.

Note the essence of duality

positive · negative = negative

Notations: we say that λ and µ are dual feasible if λ ⪰ 0 and g(λ, µ) > −∞.

The maximal lower bound

d∗ = max g(λ, µ) s.t. λ⪰0


µ,λ

is called the dual program (and the original is the primal).

Theorem 15 (Weak duality). d∗ ≤ p∗ (also for non finite values ±∞)

Proof: the optimal dual solution is dual feasible.

• g(λ, µ) = −∞ if L is unbounded from below.


• g(λ, µ) is concave in λ and µ (min of affine). Easy to solve!
• No free lunch: the inner minimization can be difficult and the bound can be loose.

Theorem 16 (Strong duality). If the problem is convex and Slater’s condition holds, i.e., then
strong duality holds and d∗ = p∗ .

Proof next week.

Slater’s condition: there exists a strictly feasible point.

Only the non-linear inequalities need to be strictly feasible, and the linear can be just feasible.
81

Saddle point interpretation:

p∗ = min max L(x; λ, µ)


x λ≥0,µ

d = max min L(x; λ, µ)
λ≥0,µ x

The first inner maximization is exactly f0 with the domain defined as the feasible set.

 f (x) x is feasible
0
max L(x; λ, µ) =
λ≥0,µ  ∞ otherwise

The second inner minimization is exactly the definition of the dual function.

In general,

max min L(x; λ) ≤ min max L(x; λ) ∀ L(x; λ)


λ x x λ

and thats the weak duality.

Proof:

min L(x; λ) ≤ L(x; λ) ∀ x, λ


x

max min L(x; λ) ≤ max L(x; λ)


λ x λ

Strong duality is the other way around and is usually wrong.

max min L(x, λ) = min max L(x, λ) strong duality


λ x x λ

Together, we get a saddle point in {x∗ , λ∗ }:

L(x∗ ; λ) ≤ L(x∗ ; λ∗ ) ≤ L(x; λ∗ ) ∀x, λ ≥ 0


82

Proof:

L(x, λ∗ ) ≥ min L(x, λ∗ ) minimum


x

= max min L(x, λ) optimality of λ∗


λ x

= L(x∗ , λ∗ )

= min max L(x, λ) strong duality


x λ

= max L(x∗ , λ) optimality of x∗


λ

≥ L(x∗ , λ) maximum

2) Examples:

Example 10 (Linear programming).

min cT x s.t. x ⪰ 0, Ax = b
x

L(x; λ, µ) = cT x − λT x + µT (Ax − b)

= [AT µ − λ + c]T x − bT µ

A linear function is unbounded from below only when it is identically zero. Therefore

 −bT µ AT µ − λ + c = 0
g(λ, µ) =
 −∞ else

The dual program is therefore

max −bT µ s.t. λ ⪰ 0, AT µ − λ + c = 0


λ,µ

or by eliminating λ

max −bT µ s.t. AT µ + c ⪰ 0


As everything is linear, all we need for strong duality is feasibility.

The dual of LP is LP.

Example 11 (LP with box constraints). Sometimes its better not to put dual variables on all the
constraints (but just enough to make the inner problem solvable):

min cT x s.t. Ax = b, −1 ≤ x ≤ 1
x
83

can also be written as

min cT x s.t. Ax = b, ∥x∥∞ ≤ 1


x

This can be transformed into a standard LP yielding the following UGLY dual

max −bT λ − eT µ − eT ν s.t. c + AT λ + µ − ν = 0


λ,µ≥0,ν≥0

Instead, we can leave the box constraints implicit and put only one dual vector

L = cT x + λT (Ax − b) =

= −bT λ + xT AT λ + c


minimizing over x s.t. −1 ≤ x ≤ 1 yields

−|a| = min ax
−1≤x≤1

so that

max −bT λ − ∥AT λ + c∥1


λ

which is of course much nicer (yet exactly the same).

Note that the dual is concave, and the duality between the L1 and L∞ norms.

Example 12. Farkas lemma

This is a theory of alternatives - the logic behind duality. Basically gives us certificates for
impossibility results. Much stronger than showing possibility which is just by showing an example.
Similarly, to the fact that lower bounds to minimization problems are stronger than upper bounds
(which are any feasible example).

Theorem 17. Consider the two sets of inequalities.

(I) Ax ≤ 0, cT x < 0

(II) AT y + c = 0, y≥0

Then only one of them holds, i.e., each system is feasible if and only if the other is not.

Proof: LP duality with

(P ) min cT x s.t. Ax ≤ 0

(D) max 0 s.t. AT y + c = 0, y≥0


84

(P) is homogeneous, therefore:

p∗ = −∞ (if (I) is feasible) or p∗ = 0 (if it is infeasible).

(D) is also either d∗ = 0 if (II) is feasible or d∗ = −∞ if it is infeasible.

Due to p∗ = d∗ , feasible (I) means infeasible (II) and vice verse.

Example 13 (Minimum norm solution).

min xT x s.t. Ax = b
x

L(x; µ) = xT x + µT (Ax − b)

This is a convex function whose minimum is attained when 2x + AT µ = 0 or x = − 21 AT µ.

Plugging in this solution yields


1
g(µ) = − µT AAT µ − bT µ
4

Note that, as expected, this is a concave function in µ. The dual program is given by
1
max − µT AAT µ − bT µ
µ 4
Example 14 (Minimum volume covering ellipsoid). Think of anomaly detection in machine
learning or experiment design. Also very similar to Gaussian ML in graphical models (primal
is minus ML, dual is max entropy).

minX≻0 log |X −1 |
s.t. aTi Xai ≤ 1, ∀i
Note the implicit PSD domain constraint. Its actually a max determinant.

The Lagrangian is
X X
L(X; λ) = −log |X| + λi aTi Xai − λi
i i
( )
X X
= −log |X| + Tr X λi ai aTi − λi
i i

The minimum with respect to X is when


X
−X −1 + λi ai aTi = 0
i
85

in which case
X X
g(λ) = log λi ai aTi + n − λi
i i

and the dual program is

λi ai aTi + n −
P P
max log i i λi
s.t. λ≥0
The constraints are linear so all we need for strong duality is feasibility which is always holds
in this problem.

Interestingly, this is related to logdet experiment design where we pick the best experiments to
minimize the logdet MSE in regression

λi ai aTi
P
max log i

s.t. λ≥0
P
i λi = 1

B. Tirgul

Example 15 (Min L1 norm).

min ∥Ax − b∥1


x

Has no constraints! Lets add some z = Ax − b.

min ∥z∥1 s.t. z = Ax − b


x,z

Now lets write the Lagrangian

L = ∥z∥1 + µT z − µT Ax + µT b
X
= |zi | + µi zi − µT Ax + µT b
i

Using

 0 if |µ| ≤ 1
min |w| + µw =
w  −∞ else

we get

 µT b if |µ| ≤ 1, AT µ = 0
g(µ) =
 −∞ else
86

which yields the dual program

maxµ µT b
s.t. |µi | ≤ 1 ∀i
AT µ = 0

You can also get a similar (equivalent) dual by formulating the primal problem as

min 1T t s.t. − t ≤ Ax − b ≤ t
x,t

When the original problem is convex and strong duality holds - these are all equivalent.

Example 16 (Rayleigh quotient). On rare occasions strong duality holds for nonconvex problems.
Let A be a symmetric matrix and consider Rayleigh’s quotient
xT Ax
min
x xT x
What is the solution? eigenvalue and eigenvector. Lets do this using duality. This is equivalent
to the following constrained optimization

minx xT Ax
s.t. xT x = 1
which is nonconvex (due to the non pd quadratic objective and equality quad constraint). The
Lagrangian is

L(x; µ) = xT Ax − µxT x + µ

= xT [A − µI] x + µ

A quadratic form is unbounded from below only if it is positive definite

maxµ µ
s.t. A − µI ⪰ 0
This is of course a simple SDP. Now, lets see if it makes sense. This means λi (A) ≥ µ for all i,
or λmin (A) ≥ µ, and therefore the maximal µ is exactly d∗ = λmin which is what we expected
from the beginning. Clearly, this means d∗ = p∗ since we can choose x = umin and obtain
p∗ = λmin .
87

Example 17 (LP relaxation). Suppose ci ≥ 0 and ci ̸= cj . We want to find the K largest elements
(K is an integer), that is solve
P
minx − i ci x i

s.t. xi ∈ {0, 1} (67)


P
i xi = K

we relax it to
P
minx − i ci x i

s.t. 0 ≤ xi ≤ 1 (68)
P
i xi = K

Lets prove the relaxation is tight via KKT. The Lagrangian is


X X X X X
L=− ci x i − v i xi + u i xi − ui + λ xi − λK (69)
i i i i i

The KKT read

0 ≤ xi ≤ 1 (70)
X
xi = K (71)
i
vi , ui ≥ 0 (72)

vi xi = 0, ui (xi − 1) = 0 (73)

−ci − vi + ui + λ = 0 (74)

We claim that at the optimal solution xi ∈ {0, 1}. Assume otherwise that there exist indices such
that 0 < xk < 1 and the rest are integral.
P
One such index is impossible since this would contradict i xi = K as K is an integer.

More than one fractional index is impossible since due to the complementary slackness this
would mean vi = ui = 0 for these indices and then λ = ci for more than one value of ci ̸= cj
which is a contradiction.

Example 18. Equality constrained LS

minx ∥Ax − b∥2


s.t. Gx = h
where A and G are full rank. Solve for x.
88

Solution: one approach is by a change of variables. Alternatively, we can use Lagrange duality

L(x, ν) = ∥Ax − b∥2 + ν T Gx − ν T h

Its minimizer is
1 T −1 T
G ν − 2AT b

x=− A A
2
and the dual function is
1 T T T −1 T
G ν − 2AT b G ν − 2AT b − ν T h

g(ν) = − A A
4
The optimality conditions are

Gx∗ = h

2AT (Ax∗ − b) + GT ν ∗ = 0

Thus
 
∗ T
−1 T 1 T ∗
x = A A A b− G ν
2
From the first optimality condition
−1 1 −1 T ∗
G AT A AT b − G AT A G ν =h
2
Solving for ν ∗ yields
 −1 −1  −1 
∗ T T T T
ν = −2 G A A G h−G A A A b

Example 19. Entropy, piecewise linear and log-sum-exp

First lets prove the well known fact that uniform distribution maximizes the entropy (which is
concave).
X X
max − pi log (pi ) s.t. pi ≥ 0, pi = 1
p
i i

X X
min pi log (pi ) s.t. pi ≥ 0, pi = 1
p
i i

X X X
L= pi log (pi ) − λi pi + µ pi − µ
i i i

∂L
= log (pi ) + 1 − λi + µ = 0
∂pi
89

If pi = 0 then the logarithm is −∞ and the condition cannot be satisfied. All positive pi ’s have
λi = 0 and therefore pi = e−µ−1 which does not depend on i. Due to i pi = 1 we get pi = 1/m
P

and λi = 0 and µ = −1 − log (pi ). Thus,


X
min
P pi log pi = −log m
p1 ≥0, i pi =1
i

Now, consider the minimization of a convex piecewise linear function

pP W L = min max{aTi x + bi }
x i

To define a dual, we need to rewrite it as a constrained problem:

pP W L = min t s.t. t ≥ aTi x + bi ∀ i


t,x

whose dual is
X
λi aTi x + bi − t

L=t+
i

and
X X X
dP W L = max λi bi s.t. λi = 1, λ i ai = 0
λi ≥0
i i i

On the other hand, the non-smooth max function can be approximated using the log-sum-exp
function
T
X
pGP = min log eai x+bi
x
i
or
X
min log e zi s.t. zi = aTi x + bi ∀ i
x,z
i

whose dual is
X X
ezi + λi aTi x + bi − zi

L = log
i i
P
The Lagrangian is unbounded with respect to x unless i λi ai = 0. The optimality condition
with respect to zi yields
ezi
P z − λi = 0
je
j

which means

ezi = cλi
90

P
for any positive constant c and λi > 0, i λi = 1 (otherwise the problem is unbounded).
X1 X X X X
max log λi − λi log λi + λi bi s.t. λi = 1, λi ai = 0, λi ≥ 0
λi
i i i i i

X X X X
dGP = max λi bi − λi log λi s.t. λi = 1, λi ai = 0, λi ≥ 0
λi
i i i i

Show that

0 ≤ dGP − dP W L ≤ log m

Proof of right inequality: Suppose λsub


i are optimal for the GP dual. Then they is also feasible
for the PWL dual with objective
sub-optimality X X
dP W L ≥ bi λsub
i = dGP + λsub sub
i log λi ≥ dGP − log m
i i

since
X
min
P λi log λi = −log m
λi ≥0, i λi
i

The left inequality is due to


T
X
max aTi x + bi ≤ log eai x+bi ∀ x
i
i

In conclusion, the quality of the approximation is

pGP − log m ≤ pP W L ≤ pGP

Example 20. SVD problem In the homework, you needed to prove that

 max ∥Zq∥
Z 2
∥q∥2 =
 s.t. ∥Z∥f ro ≤ 1
There are different ways to solve this. The easiest is probably by LP and SVD. Instead, we will
do it by duality which is a good practice.

First let’s make everything quadratic by taking the squares.



 max ∥Zq∥2
Z
∥q∥22 =
 s.t. ∥Z∥2f ro ≤ 1
91

Note that the problem is non-convex. Yet, we will show that strong duality holds. Indeed, this is
the special case which we already discussed, namely quadratic programming with one quadratic
constraint.

∥Zq∥22 = Tr Zqq T Z T


∥Z∥2f ro = Tr ZZ T


So the problem is bounded from above by (weak duality)

min max Tr Zqq T Z T − λTr ZZ T + λ


 
λ≥0 Z

min max Tr Z qq T − λI Z T + λ
  
λ≥0 Z

X
ziT qq T − λI zi + λ
 
min max
λ≥0 zi
i

The inner minimization is unbounded unless

qq T − λI ⪯ 0

and therefore the dual is



 min
2 λ≥0 λ
∥q∥ =
 s.t. λI ⪰ qq T
whose solution is clearly λ∗ = ∥q∥2 . It remains to show that there exists a feasible solution to
the primal program which attains this bound. This is easy by choosing
vq T
Z=
∥v∥2 ∥q∥2
for any non-zero vector v.

Example 21. Support Vector Machine The goal is to find a separating hyperplane between two
sets of labeled points, {xi , yi } where xi are the data and yi ∈ ±1 are their labels. We want a
simple decision rule of the sort

ŷi = sign xTi w + w0



(75)

which will lead to maximal margin


maxw,wo ,M M

s.t. yi xTi w + w0 ≥ M ∀i (76)
∥w∥ = 1
92

Note that we need ∥w∥ = 1 otherwise we can just scale w, w0 to get infinite margin. This is
clearly a non-convex problem. But we can get rid of ∥w∥ = 1 by requiring
1
yi xTi w + w0 ≥ M

(77)
∥w∥
or equivalently

yi xTi w + w0 ≥ M ∥w∥

(78)

These constraints are homogeneous, so we may as well arbitrarily choose ∥w∥ = 1/M and get
minw,wo ∥w∥2
 (79)
s.t. yi xTi w + w0 ≥ 1 ∀i
This is a convex quadratic minimization subject to linear constraints which can be efficiently
solved.

Let’s try the dual. The Lagrangian is


1 X 
L = ∥w∥2 + αi 1 − yi xTi w + w0

(80)
2 i
The dual function is
1 X 
g (α) = min ∥w∥2 + αi 1 − yi xTi w + w0

(81)
w,w0 2
i

 −1 ∥ P 2 P P
2 i αi yi xi ∥ + i αi i αi yi = 0
= (82)
 −∞ else
Finally, the dual program is
maxα − 12 i,j αi αj yi yj xTi xj + i αi
P P
P
s.t. i αi y i = 0
(83)
αi ≥ 0
Note that the dual depends on xi only through the inner product xTi xj and is therefore suitable
for high dimensions (even infinite kernels).

Let’s do the KKT:

yi xTi w + w0 ≥ 1

(84)
X
αi ≥ 0, αi yi = 0 (85)
i

αi 1 − yi xTi w + w0 = 0
 
(86)
X
w− αi yi xi (87)
i
93


From the complementary slackness, we see that if αi > 0 then yi xTi w + w0 = 1 is on the
boundary. Otherwise, αi = 0. From the last KKT we get a characterization of w using the active
αi
X
w= α i y i xi (88)
i

Using it and the complementary slackness, we can also get a simple characterization of w0 .

The derivation above assumes that a separating hyperplane exists. This is clearly not always
true, and the common SVM allows a few errors:

minw,wo ,ξi ∥w∥2 + C i ξi


P

s.t. yi xTi w + w0 ≥ 1 − ξi ∀i (89)
ξi ≥ 0 ∀i
The Lagrangian is now
1 X X 
L = ∥w∥2 + C αi 1 − ξi − yi xTi w + w0

ξi + (90)
2 i i

and the dual function is



 − 1 ∥P α y x ∥2 + P α P α y = 0, αi ≤ C
2 i i i i i i i i i
g (α) = (91)
 −∞ else
and the dual program is

maxα − 21 i,j αi αj yi yj xTi xj + i αi


P P
P
s.t. i αi y i = 0
(92)
0 ≤ αi ≤ C
which is the standard SVM formulation.
94

9. D UALITY II

A. Lecture

We will begin with the theory behind strong duality and then more examples.

1) Separating and supporting hyperplanes:

Theorem 18. Let C and D be two convex sets that do not intersect. Then there exist a ̸= 0 and
b such that xT a ≤ b for all x ∈ C and xT a ≥ b for all x ∈ D.

We only prove a special case where there exist two points c ∈ C and d ∈ D that are closest to
each other in terms of Euclidean norm, i.e.,

∥c − d∥ = inf ∥x − y∥
x∈C,y∈D

In this case, we define


∥d∥2 − ∥c∥2
a = d − c, b=
2
and show that these define a separating hyperplane.

Specifically, let us show that aT x − b ≥ 0 on D (the other side is identical).

The hyperplane is
∥d∥2 − ∥c∥2
f (x) = (d − c)T x −
2
1
= (d − c)T (x − d) + ∥d − c∥2
2
Suppose f (u) < 0 for some u ∈ D.

Then, clearly (d − c)T (u − d) < 0.

We start at d and make a short step in the direction of (u − d), i.e., take t(u − d).

We remain in D (d and u are in D and it is convex), but decrease the distance which is a
contradiction.

To prove that the distance reduces:


d
∥d + t(u − d) − c∥2 = 2(d − c)T (u − d) < 0
dt t=0
and for small positive t we get

∥d + t(u − d) − c∥2 < ∥d − c∥2


95

2) Strong duality:

Theorem 19 (Strong duality). If the problem is convex and Slater’s condition holds, i.e., then
strong duality holds and d∗ = p∗ .

Proof:

p∗ = min f0 (x) s.t. fi (x) ≤ 0 ∀i


x

Under Slater’s condition that there exists an x̃ such that fi (x̃) < 0.

We want to prove that there exists a λ∗ ≥ 0 such that

g(λ∗ ) ≥ p∗

or
X
min f0 (x) + λ∗i fi (x) ≥ p∗
x
i

which is exactly
X
f0 (x) + λ∗i fi (x) ≥ p∗ ∀ x
i

For this purpose, we define two convex sets (A is convex by definition of convexity)

A = {(u, t) |∃x, fi (x) ≤ ui , f0 (x) ≤ t}

B = {(0, t) |t < p∗ }

We have

p∗ = min t s.t. (0, t) ∈ A

Due to optimality of p∗ , the sets A and B do not intersect.


 
By the separating hyperplane theorem, there exist λ̃, µ ̸= 0 and α such that

(I) : (u, t) ∈ A → λ̃T u + µt ≥ α

(II) : (u, t) ∈ B → λ̃T u + µt ≤ α


96

Due to the definition of A, if (u, t) ∈ A then we can add arbitrary positive numbers to them and
still be in A. Together with (I) this means λ̃ ≥ 0 and µ ≥ 0 (otherwise λT u + µt is unbounded
from below).

From (II), we have µt ≤ α for all t < p∗ . Therefore, µp∗ ≤ α.

Together,
X
µf0 (x) + λ̃i fi (x) ≥ α ≥ µp∗ ∀ x
i

Why ∀x? For any x, define fi (x) = ui and f0 (x) = t so that (u, t) ∈ A and now use (I).

Assume that µ > 0, then we divide by µ and obtain the required result with λ∗ = λ̃/µ.

Now lets show that µ = 0 is impossible. Assume it holds in contradiction, then apply this
inequality to Slater’s point
X
λ̃i fi (x̃) ≥ 0
i

and get λ̃ = 0 since fi (x̃) < 0 which is a contradiction to the existence of a non-zero separating
hyperplane.

B. Tirgul

Continue examples from last week.


97

10. O PTIMALITY CONDITIONS

A. Lagrange multiplier theorem

Not in the course’s material

Theorem 20 (Necessary conditions). Let x∗ be a local minimum of f subject to h(x) = 0, and


assume that the constraint gradients ∇h1 (x∗ ), · · · , ∇hm (x∗ ) are linearly independent. Then,
there exists a unique vector λ∗ such that
X
∇f (x∗ ) + λ∗i ∇hi (x∗ ) = 0
i

That is the gradient needs to be in the span of the ∇hi (x∗ ).

Very easy to interpret: We already know that if f is convex and h are linear Ax = b then

∇f ∈ C AT

(93)

is necessary and sufficient.

Locally, any continuous h can be linearized and then AT is just the Jacobian.

Intuitively, what happens in inequality constraints? We only need Lagrange multipliers for the
active ones and to check feasibility of the non-active ones.
98

B. Convex optimality conditions

∇f (x∗ )T (x − x∗ ) ≥ 0 ∀ x ∈ S

This requires no regularity conditions but is hard to check and interpret.

Example: non-negative orthant (Boyd p142, Bertsekas p195):

min f (x)
x≥0

Then x∗ ≥ 0 is a local minimum of a convex function f (x) if and only if


X ∂f (x∗ )
(xi − x∗i ) ≥ 0, ∀ xi ≥ 0
i
∂xi
This condition is equivalent to
∂f (x∗ )
≥0
∂xi
∂f (x∗ )
= 0 if x∗i > 0
∂xi
Proof:

Choose x = [x∗1 · · · x∗j + 1 · · · x∗p ] to get the first condition.


x∗j
For some x∗j > 0 choose x = [x∗1 · · · 2
· · · x∗p ] to get the second condition.

The other direction is trivial.

The second condition is known as complementary slackness


∂f (x∗ ) ∗
x =0 ∀ i
∂xi i
Eeither a constraint is inactive in which case f ′ = 0 as the unconstrained case.

Or it is active and we are on the boundary which means that the derivative is not necessarily
zero.

KKT just makes this easier...


99

We consider the following problem

min f (x) s.t. gi (x) ≤ 0, hj (x) = 0


x

We define the KKT conditions as the existence of dual variables such that

(KKT ) x∗ ∈ S, Primal feasibility

λ∗ ≥ 0, Dual feasibility

λ∗i gi (x∗ ) = 0, Complementary slackness


X X
∇f (x∗ ) + λ∗i ∇gi (x∗ ) + µ∗j ∇hj = 0.
i j

Complementary slackness:

Either a constraint is active and then has a regular Lagrange multiplier, or we can omit it λi = 0.

Theorem 21 (Necessary). Under technical conditions, the KKT are necessary for optimality of
x∗ .

Conditions can be:

(1) nonconvex problems with linear independent gradients of gi

(2) Strong duality with Slater.

Lets prove (2):

• Strong duality means that there exist x∗ and λ∗ , ν ∗ which are feasible, optimal to (P) and
(D) and have zero duality gap.
• Primal and dual feasibility are the first two KKT conditions.
100

• Zero duality gap is third KKT.


zero gap
f0 (x∗ ) = g(λ∗ ) (94)
X
= min f0 (x) + λ∗i fi (x) (95)
x
i
min X
≤ f0 (x∗ ) + λ∗i fi (x∗ ) (96)
i
fi ≤0,λ∗i ≥0
≤ f0 (x∗ ) (97)

This will only work if the inequalities are tight which means λi fi = 0.
• Minmax optimality of x∗ is fourth KKT.

Theorem 22 (Sufficient). In convex problems (no technical conditions), the KKT are sufficient
for optimality of x∗ .

Proof:

The dual function at λ∗ , ν ∗ is a lower bound.

Due to the 3rd KKT, L(x∗ ; λ∗ , µ∗ ) = f (x∗ ) which means that the lower bound is tight.
101

Another way to show sufficiency is by showing that the KKT are sufficient for the general
necessary and sufficient convex optimality condition:

∇f (x∗ )T (x − x∗ ) ≥ 0 ∀ x ∈ S

Proof (without equality constraints):

Assume x∗ , λ∗ such that

fi (x∗ ) ≤ 0 (98)

λ∗i ≥ 0 (99)

λ∗i fi (x∗i ) = 0 (100)


X
∇f0 (x∗ ) + λ∗i ∇fi (x∗i ) = 0 (101)
i

Let x be some feasible suboptimal point. Then


feasible convexity
0 ≥ fi (x) ≥ fi (x∗ ) + ∇fiT (x∗i )(x − x∗ ) (102)

Using λ∗i ≥ 0 we have


X
λ∗i fi (x∗ ) + ∇fiT (x∗i )(x − x∗ )

0 ≥ (103)
i
T ∗
X 0 X−∇f0 (x )
= λ∗i fi (x∗ ) + λ∗i ∇fiT (x∗i ) (x − x∗ ) (104)
i i

= −∇f0T (x∗ )(x −x ) ∗


(105)
102

Example 22 (Water filling).


X
max log (1 + hi pi ) s.t. p ≥ 0, 1T p = P
p
i

where h > 0.

The Lagrangian is
X X X
L(p; λ, µ) = − log (1 + hi pi ) − λi pi + µ pi − µP
i i i

The KKT are


X
pi ≥ 0, pi = P
i
λi ≥ 0

λi pi = 0
hi
− − λi + µ = 0
1 + hi pi
Using the 4th KKT we get
hi
λi = µ − ≥0
1 + hi pi

If pi = 0 then λi = µ − hi .
hi 1 1
Else if pi > 0 then λi = 0 and µ − 1+hi pi
= 0 so that pi = µ
− hi
.

Altogether, we get
 
1 1
pi (µ) = −
µ hi +
P
and we get to play with µ until i pi (µ) = P (this is basically the dual line search).
103

11. C ONVEX RELAXATION

A. Linear

Finding a point that satisfies many linear constraints

We consider a set of linear inequalities Ax + b ≤ 0 which are infeasible and we want to find an
x which satisfies as many constraints as possible.

Optimally, we would like to solve


X
min I(aTi x + bi )
x
i

where I(z) is the non-convex step function



 1 z≥0
I(z) =
 0 else

which we approximate by the convex function



 z z≥0
max{z, 0} =
 0 else

Thus, we need to solve


X
min max{aTi x + bi , 0}
x
i

X
min ti ti ≥ max{aTi x + bi , 0} ∀ i
x,ti
i

X
min ti ti ≥ aTi x + bi , ti ≥ 0 ∀ i
x,ti
i

Example 23 (Compressed sensing). A different example where we look for an approximation of


a non-convex combinatorial optimization.

The problem is again solving a linear system of equations.

Ax = b
104

but assume we have less equations than unknown, hence infinite solutions. We already learned
the minimum L2 norm solution

x = A† b

In some settings (which are very common recently) we have additional prior knowledge that x
is sparse for parsimony or computationally efficiency. Thus we would like to solve

min ∥x∥0 s.t. Ax = b


x

where

∥x∥0 = number of non zero elements

Thus, we propose to solve

min ∥x∥1 s.t. Ax = b


x

can be reformulated as

min 1T t s.t. Ax = b, −t ⪯ x ⪯ t
x,t

Binary problems

We want to solve
minx cT x
s.t. Gx ⪯ h
Ax = b
xi ∈ {0, 1} i∈I

we relax the condition to

0 ≤ xi ≤ 1 i ∈ I

Summary:
105

• Slack variables.
• Relaxation - underestimate function and then project.

PCA, Rayleigh Quotient

xT Ax
max
x xT x
Clearly, non convex. But wlog we can assume xT x = 1

maxx xT Ax
s.t xT x = 1

This is the classical PCA formulation where we seek for normalized weights x so that xT u
where u ∼ N (0, A) will be of maximal variance.

Lets try SDP relaxation

maxx,X Tr {AX}
s.t Tr {X} = 1
X = xxT

maxx,X Tr {AX}
s.t Tr {X} = 1
X⪰0
rank(X) = 1
and relax the nonconvex constraint
maxX Tr {AX}
s.t Tr {X} = 1
X⪰0

This will clearly yield a lower bound (we enlarge the constraint set).

Homework: prove that the relaxation is tight and the optimal X can be achieved by rank one!
106

B. Semidefinite relaxation

Binary QP

minx xT Ax
s.t xi ∈ {−1, 1} ∀i

minx xT Ax
s.t x2i = 1 ∀ i

Now we can define Q = xxT

minQ Tr {AQ}
s.t Qii = 1 ∀ i
Q⪰0
rank(Q) = 1

minQ Tr {AQ}
s.t Qii = 1 ∀ i
Q⪰0

If the optimal solution is rank one then we are done.

Otherwise, we have a relaxation (lower bound) and can find an upper bound by:

• Sign of last column.


• Sign of principal eigenvector.
• Randomization.
107

Example 24 (Two-way partitioning problem).

minx xT W x
s.t. x2i = 1, ∀i
The constraint basically means {xi ∈ ±1}.

Known as NP hard.

Using Lagrange duality we can bound


X X
L(x; µ) = xT W x + µi x2i − µi
i i
X
= xT [W + diag {µi }] x − µi
i

A quadratic form is bounded from below only if it is positive definite and therefore

 − P µ W + diag {µ } ⪰ 0
i i i
g(µ) = min L(x; µ) =
x  −∞ else
and the dual program is
P
maxµ − i µi
s.t. W + diag {µi } ⪰ 0
This already gives us a (highly non-trivial) lower bound.

Now lets do the bi-dual (the dual of the dual).

The Lagrangian is (its a max so the dual variables are for ≥ inequalities)
( !)
X X
L=− µi + Tr X W + µi ei eTi
i i

with the dual variable X ⪰ 0.

It is unbounded from below unless

Tr Xei eTi

= Xi,i = 1

and we get the bidual

minX Tr {XW }
s.t. Xi,i = 1, ∀i
X⪰0

Always, bidual=dual since the dual is concave.


108

Also, bidual = rank relaxation of the primal


minX Tr {XW }
s.t. Xi,i = 1, ∀i
X⪰0
rank (X) = 1

In communication systems we have a similar maximum likelihood detection problem


minx ∥y − Hx∥2
s.t. x2i = 1, ∀i
The Lagrangian is
X X
L(x; µ) = xT H T Hx − 2y T Hx + y T y + µi x2i − µi
i i
X
= xT H T H + diag {µi } x − 2y T Hx + y T y −
 
µi
i
and the dual program is

max min L(x; µ)


µ x

or

max t s.t. L(x; µ) ≥ t ∀ x


µ

We have an infinite set of constraints. Before, in the homogeneous case, we used

xT Ax ≥ 0 ∀ x ⇔ A⪰0

In the non-homogeneous case we have the following simple result:


 
A b
xT Ax + 2bT x + c ≥ 0 ∀ x ⇔  ⪰0
bT c

Proof: Prove that


 
A b
xT Ax + 2bT x + c ≥ 0 ∀ x ⇔  ⪰0
bT c
 
x
Solution: The ⇐ direction is trivial by choosing   . For the ⇒ direction, we choose x = 1 y
t
1
for some t ̸= 0. Then
1 T 1 T
x Ax + 2b x + c ≥ 0
t2 t
109

Multiply by t2 to get

xT Ax + 2bT xt + ct2 ≥ 0

which means that


  
h i A b x
xT t   ≥0 t ̸= 0
T
b c t

To prove t = 0, use the LHS with x and −x and summing, we get

xT Ax ≥ −c

which can only hold if

xT Ax ≥ 0.

In our problem,

A = H T H + diag {µi } (106)

b = −H T y (107)
X
c = yT y − µi − t (108)
i

Therefore, the dual program can be written as


max t
 
T T
H H + diag {µi } −H y
s.t.  ⪰0
−y T H yT y −
P
i µi − t

and guess what the dual is


  
T T
 Z z H H + diag {µi } −H y 
L = −t − Tr   
 zT ζ −y T H y T y − i µi − t 
P
X
= −t − Tr ZH T H − Tr {Zdiag {µi }} + 2z T H T y − ζy T y + ζ

µi + ζt
i
This Lagrangian is unbounded unless Zi,i = ζ and ζ = 1. Therefore, as expected the bidual is
  
T T
 Z z H H −H y 
minZ Tr    
 zT ζ T
−y H y y T 

s.t. Z = 1, ∀i
i,i 
Z z
 ⪰0
T
z ζ
110

Example 25 (Projection on PSD). The problem is

min ∥X − Y ∥2 (109)
X:X⪰0

where Y is a symmetric matrix.

Hint: try it first with scalars.

L(X, W ) = ∥X − Y ∥2 − Tr {W X} (110)

Take the derivative

2(X − Y ) − W = 0 (111)

1
X∗ = Y + W (112)
2
1 1
L(X ∗ , W ) = ∥W ∥2 − Tr {W Y } − ∥W ∥2
4 2
1
= − ∥W ∥2 − Tr {W Y } (113)
4
The dual program is therefore
1
max − ∥W ∥2 − Tr {W Y } (114)
W :W ⪰0 4

Sanity check: should be concave!

Another sanity check: primal is convex so usually bidual=primal.

1
L2 (W, Z) = − ∥W ∥2 − Tr {W Y } + Tr {ZW }
4
1
= − ∥W ∥2 + Tr {W (Z − Y )} (115)
4

1
− W −Y +Z =0 (116)
2

W ∗ = 2(Z − Y ) (117)
111

L2 (W ∗ , Z) = −∥Z − Y ∥2 + 2∥Z − Y ∥2

= ∥Z − Y ∥2 (118)

Bidual is identical to primal

min ∥Z − Y ∥2 (119)
Z:Z⪰0

Now, lets solve the problems:

• Usually, numerically.
• Sometimes, by a good guess.

Here, we have a good guess:

Y = U diag {yi } U T ⇒ X = U diag {xi } U T (120)

xi = [yi ]+ (121)

Therefore,

W = 2(X − Y ) = U diag {wi } U T (122)



 0 yi ≥ 0
wi = 2([yi ]+ − yi ) = (123)
 −2yi yi < 0

Sanity check: wi are non-negative, and the dual constraints is feasible.

It remains to show that the primal and dual objective coincide:

X X
p∗ = ([yi ]+ − yi )2 = yi2
i i:yi <0
1X X 1 X 2 X X
d∗ = − wi2 − wi yi = − 4yi − −2yi yi = yi2 (124)
4 i i
4 i:y <0 i:y <0 i:y <0
i i i
112

Example 26 (Procrustes problem).

minR ∥AR − B∥2fro


(125)
s.t. RRT = I
It is equivalent to

minR −2Tr B T AR
(126)
s.t. RRT = I
and we relax it to

minR −2Tr B T AR
(127)
s.t. RRT ⪯ I

This is convex since it can be written as an SDP



minR −2Tr B T AR
 
I R (128)
s.t.  ⪰0
RT I

Numerically, this is usually tight. Lets try to prove. The Lagrangian is


  
 T  I R 
L = −2Tr B AR − Tr Z   (129)
T
 R I 
= −2Tr B T A + Z12 R − Tr {Z11 } − Tr {Z22 }
 
(130)

So the dual is
maxZ⪰0 −Tr {Z11 } − Tr {Z22 }
(131)
s.t. Z12 = −B T A

Now, lets guess primal and dual feasible variables that give equal objectives.

Consider the SVD of −B T A

−B T A = U DV T (132)

and let

R∗ = V U T (133)
 
T T
U DU U DV
Z∗ =   (134)
T T T
V D U V DV
which both give an optimal value of −2Tr {D}.
113

Example 27 (ML in Gaussian graphical models).

minX⪰0 Tr {SX} − log |X|


s.t. Xi,j = 0, ∀(i, j) ∈ E
Note the implicit domain constraint. The Lagrangian is
X
L(X; λ) = Tr {SX} − log |X| + µi,j Tr {Ei,j X}
n  X o
= −log |X| + Tr X S − µi,j Ei,j

The minimum with respect to X is when


 X −1
X= S− µi,j Ei,j
P
if S − µi,j Ei,j ≻ 0 and unbounded from below otherwise. In that case, in which case

 log |S − P µ E | + n if S − P µ E ≻ 0
i,j i,j i,j i,j
g(λ) =
 −∞ else
and the dual program is
P
maxµ log |S − µi,j Ei,j | + n
P
s.t. S − µi,j Ei,j ≻ 0
P
Adding a variable Σ = S − µi,j Ei,j we get the classical max entropy formulation

maxΣ≻0 log |Σ| + n


s.t. Σi,j = Si,j (i, j) ∈ E
The constraints are linear so all we need for strong duality is feasibility which is always holds
in this problem.

Example 28 (Projection on simplex).

min ∥y − x∥2 s.t. x ≥ 0, 1T x = 1


x

Now the problematic (non-separable) constraint is 1T x = 1 so we put a Lagrange multiplier on


it

max min ∥y − x∥2 + µ − µ1T x


µ x≥0

which can also be written as


X
max µ + fi (yi ; µ)
µ
i
114

where

fi (yi ; µ) = min |yi − xi |2 − µxi


xi ≥0

has a simple closed form solution:



µ µ
 y +
i 2
if yi + 2
≥0
xi =
 0 else
Thus, we reduced the problem to a line search!

Two other generalization: quadratic inequality subject to linear, and quadratic subject to quadratic
(S lemma).

Similarly, but a bit more confusing is SOCP (this is less known and unfortunate as SOCP is
very strong and much more efficient than SDP)

minx f T x
 
Ai x + bi
s.t.   ⪰K 0
T
ci x + di
which is just a fancy way of writing

minx f T x
s.t. ∥Ai x + bi ∥ ≤ cTi x + di
The Lagrangian is
 T  
X ui Ai x + bi
L = fT x −    
T
i c vi
i x + d i
X X X X
= fT x − uTi Ai x − uTi bi − vi cTi x − v i di
i i i i

with the dual SOCs


 
ui
  ⪰K 0
vi
which is just a fancy way of writing ∥ui ∥ ≤ vi . The Lagrangian is unbounded from below unless
P T
i Ai ui + vi ci = f and the dual is

maxu,v − i uTi bi − i vi di
P P
P T
s.t. i Ai ui + vi ci = f

∥ui ∥ ≤ vi
115

which is again an SOCP.

Example 29 (Group LASSO).


X
min ∥x − g∥2 + λ ∥xGi ∥
i

Note that this is a non-differentiable, non-separable (overlapping groups) optimization.

maxxt xT x − 2xT g + g T g + λ i ti
P

s.t. ti ≥ ∥Ei x∥
X X X
L = xT x − 2xT g + g T g + λ ti − uTi Ei x − vi ti
i i i
1
EiT ui , and the objective is unbounded from
P
The minimum with respect to x is x = g + 2 i

below unless vi = λ. Therefore, the dual is


1
maxui ∥g∥2 − ∥g + EiT ui ∥2
P
2 i

s.t. ∥ui ∥ ≤ λ
Now, this optimization is much easier. The feasible set is separable and we can iteratively solve
for each ui . In fact, each of these iterations has a closed form solution:

maxui −∥g̃i − ui ∥2
s.t. ∥ui ∥ ≤ λ
whose solution is ui = g̃i if it is feasible and ui = λg̃i /∥g̃i ∥ otherwise.

You might also like