Lectures 2023
Lectures 2023
C ONTENTS
1 Introduction 5
1.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Convex sets 19
2.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Convex functions 32
3.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3
4.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Linear algebra 52
5.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4
7 Minimization majorization 74
7.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8 Duality I 79
8.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9 Duality II 94
9.1 Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2 Tirgul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10 Optimality conditions 97
1. I NTRODUCTION
A. Lecture
1) Background and history: IMPORTANT: These lecture notes are a rough summary of the
book “Convex Optimization” by Boyd and Vandenberghe. The book is amazing, free and
available online. I highly recommend using the book and not the summary which is only
meant for my personal teaching. All the rights clearly belong to Boyd and Vandenberghe.
The typos and errors are mine.
”. . . the great watershed in optimization isnt between linearity and nonlinearity, but convexity
and nonconvexity.” R. Rockafellar, SIAM Review 1993
minx∈S f (x)
s.t. gi (x) ≤ 0, i = 1, · · · m
Almost all engineering problems involve optimization: a goal and constraints. The art is the
formulate them mathematically and solve them (and these two are related).
Linear-Convex-Conic
Our goal
Image reconstruction
minx ∥y − Hx∥2
s.t. ∥x∥1 ≤ α
Inpainting
Inpainting (also known as image interpolation) refers to the application of sophisticated algo-
rithms to replace lost or corrupted parts of the image data (Facetune - remove small defects).
X xi+1,j − xij
minx
ij xi,j+1 − xij
2
s.t. xij = xknown
ij
minw wT Σw
s.t. wT µ = r
wT 1 = 1
7
Try to classify points xi to their labels yi ∈ ±1 using an affine function. For example, digit
recognition in license plates. The goal
sign wT xi + b = yi
minw ∥w∥2
y i w T xi − b ≥ 1
s.t.
difficult problem, assign points to two groups while minimizing various costs of separating two
points
minw wT Qw
s.t. wi ∈ {±1}
Brute force solution requires exponential number of evaluations. We will find good bounds via
duality as well as approximations.
Books
2) Line search: One dimensional optimization problems. Include all the main ingredients. Are
very intuitive. Are part of most high dim algorithms.
Optimality
min f (x)
x∈S
A solution is a value of x for which f (x) is minimum. Two kinds of minimums (local and
global).
Definition 1. x∗ is a global minimum of f (x) over the interval S if it is no worse than all other
x∈S
f (x∗ ) ≤ f (x), ∀x ∈ S.
Definition 2. x∗ is a local minimum of f (x) over the set S if it is no worse than its feasible
neighbors, i.e., there exists an ϵ > 0 such that
Proof. Suppose x∗ is a local minimum and is not an end point. We will show that f (x∗ ) = 0.
Recall that
f (x∗ + α) − f (x∗ )
f ′ (x∗ ) = lim
α→0 α
Due to local optimality, f (x∗ + α) − f (x∗ ) ≥ 0 for sufficiently small |α|.
f (x∗ + α) − f (x∗ )
≥ 0, ∀α > 0
α
f (x∗ + α) − f (x∗ )
≤ 0, ∀α < 0
α
These inequalities can only hold together if f ′ (x∗ ) = 0. Using a Taylor’s expansion
α2 ′′
f (x∗ + α) − f (x∗ ) = αf ′ (x∗ ) + f (z)
2
for some z ∈ [x∗ , x∗ + α]. Due to local optimality and f ′ (x∗ ) = 0, we get
f (x∗ + α) − f (x∗ )
0≤2 = f ′′ (z).
α2
Due to continuity of f ′′ (), this holds for all sufficiently small |α| only if f ′′ (x∗ ) ≥ 0.
3) Convexity: The previous result is only local. To get something global we need convexity.
Three convex one dimensional sets are the bounded and unbounded intervals:
I = {x ≥ a}
I = {a ≤ x ≤ b}
I = {x ≤ b}
Note that since I is an interval, we have λx + (1 − λ) y ∈ I and the definition is well defined.
Theorem 2. A continuously differentiable function f (x) is convex over an interval I if and only
if it is above its linear approximation
multiply the inequalities by λ and 1 − λ, and add them up to obtain the required result
f ′′ (x) ≥ 0, ∀x ∈ I
which is sufficient for convexity. On the other hand, assume f is convex and that f ′′ (x) < 0
then we can choose y close enough such that f ′′ (z) < 0 for all z ∈ [x, y], and obtain
(y − x)2 ′′
f (y) − f (x) = (y − x)f ′ (x) + f (z) < (y − x)f ′ (x)
2
11
A few examples:
f (x) = ex
f ′ (x) = ex
f ′′ (x) = ex ≥ 0
f (x) = xa
f ′ (x) = axa−1
f (x) = −log(x)
1
f ′ (x) = −
x
1
f ′′ (x) = 2
x
• Negative entropy f (x) = xlogx is convex on R++ .
f (x) = xlogx
1
f ′ (x) = x + logx = 1 + logx
x
1
f ′′ (x) =
x
Using these definitions, we can now find the global minimum:
Theorem 4. If f (x) is a convex function over I, then any local minimum of f over I is also a
global minimum.
Proof. Suppose that x is a local minimum but not global. Then there exists a y ̸= x such that
f (y) < f (x). Using the convexity:
where we have used f (y) < f (x) in the second inequality. This contradicts the local optimality
of x.
Question: convexity is a sufficient condition for local=global, is it necessary? No! for example
quasi-convexity. However, it is the easiest, most studied, easiest to generalize to higher dimen-
sions and well understood (no saddle points, convex+convex, etc).
13
B. Tirgul
f : Rn → Rm
Example
f (x) = Ax
Indeed in the example we have (and this helps remember the transposes sizes)
Az = Ax + A (z − x)
In particular, if m = 1 then the transposed Jacobian is called a gradient vector (always column)
∂f (x)
∂x1
..
T
∇f (x) = [Df (x)] =
.
∂f (x)
∂xn
so that
In the example
f (x) = aT x
Df (x) = aT
∇f (x) = a
f (x) = xT Ax
z T Ax + xT Az = ∇T f (x)z
xT AT z + xT Az = ∇T f (x)z
so that
∇f (x) = A + AT x
∇f (x) = 2Ax
Chain rule
f : Rn → Rm
g : Rm → F p
h : Rn → Rp
then
Example
h(x) = g(Ax + b)
∇h(x) = AT ∇g(Ax + b)
The Hessian also has a chain rule but its more cumbersome. In the special case of affine
transformation
g(x) = f (Ax + b)
we have
∇2 g(x) = AT ∇2 f (x)A
Exercise
p
f (x) = xT Qx − bT x
16
1 T − 32 − 1
∇2 f (x) = − x Qx 2QxxT Q + xT Qx 2 Q
2
− 3 − 1
= − xT Qx 2 QxxT Q + xT Qx 2 Q
− 3
= xT Qx 2 xT Qx Q − QxxT Q
vT xT Qx Q − QxxT Q v = xT Qx v T Qv − v T QxxT Qv
xT U T U x v T U T U v − v T U T U xxT U T U v
=
y T y uT u − uT yy T u
=
2
y T y uT u − uT y ≥ 0
=
1) Eigenvalues:
Av = λv
[A − λI] v = 0
|A − λI| = 0
A = U diag {λi } U −1
Definition 5. Any real symmetric matrix is diagonalizable by an orthogonal matrix with real
eigenvalues and can be decomposed into A = U diag {λi } U T where U is orthogonal U T U =
U U T = I.
2D example:
a−λ c
= (a − λ)(b − λ) − c2 = λ2 + λ(−a − b) + (ab − c2 ) = 0
c b−λ
p
a+b± (a + b)2 − 4(ab − c2 )
λ1,2 =
2
2) Positive definiteness: A concept that generalizes scalar positivity to matrices. A main ingre-
dient in convex optimization. Second order condition for convexity, and allows us to use linear
matrix inequalities.
Not a full order (two matrices can be non ≻ and non ≺. Not the only generalization (posi-
tive matrices element-wise), but this one works with the eigenvalues (which are usually more
interesting than the elements themselves.
Equivalent statements:
• It is diagonalizable (symmetric matrices always are), and its eigenvalues are positive. Proof:
uTj U DU T uj = dj > 0
In 2D example
p
a+b± (a + b)2 − 4(ab − c2 )
λ1,2 = ≥ 0 ⇔ a + b > 0, ab − c2 > 0
2
18
Proof: require that both the determinant and the trace will be positive. Actually a > 0, ab −
c2 > 0 suffices.
• It has a Cholesky decomposition X = U T U where U is an upper triangular matrix with pos-
itive diagonal elements (the generalization of squared root). This is the standard numerical
method for testing positive definiteness.
√ √
a c a 0 a √c
= q q a a > 0, ab − c2 > 0
2 c2
c b √c b − ca 0 b−
a a
• All the leading principal minors (determinant of upper left subblocks) are positive.
a c
≻ 0, a > 0, ab − c2 > 0
c b
Later, we will generalize this using Schur’s Lemma.
Now we can write linear matrix inequalities: we say that A ≻ B if A − B ≻ 0, just like in
scalars.
19
2. C ONVEX SETS
A. Lecture
We now begin to generalize the previous results to the multidimensional case. The first thing we
need to do is generalize the notion of an interval. We will deal with constrained optimizations
as
min f (x)
x∈S
In the scalar case, we showed that the problem is efficiently solvable if f (x) is a convex function
and S is a convex set.
Theoretically, we dont need this as we can define the function as infinity outside of the domain.
But this is not useful theoretically, numerically and didactively. However, we will see that the
two concepts, convex functions and convex sets, are very similar. In fact, once you master one,
the other is trivial. In practice, this is similar to the duality between constraints and penalties
(which are completely equivalent in convex optimization).
min f (x)
x
1) Lines and segments: Suppose x1 ̸= x2 ∈ Rn are two vectors then the line passing through
them is defined as
y = θx1 + (1 − θ) x2
2) Affine sets: A set C ∈ Rn is affine if the line through any two distinct points in C lies in C:
x1 , x2 ∈ C, θ ∈ R → θx1 + (1 − θ) x2 ∈ C
Can be generalized to more than two points; an affine set is a subspace plus an offset, and its
dimension is defined as the dimension of the subspace.
C = {x : Ax = b} is an affine set
x1 , x2 ∈ C → Ax1 = b, Ax2 = b
→ θx1 + (1 − θ) x2 ∈ C
There is also a converse - every affine set can be expressed as the solution set of a system of
linear equations.
3) Convex sets: A set C ∈ Rn is convex if the line segment between any two points in C lies
in C:
x1 , x2 ∈ C, 0 ≤ θ ≤ 1 → θx1 + (1 − θ) x2 ∈ C
• Can be generalized to more than two points (even infinite with an integral).
• In the most general form: If C is convex and x is a random vector in C with probability
one, then E [x] ∈ C.
• What is the difference between affine and convex? both are linear combinations whose
coefficients sum to one but in convex the coefficients are also non-negative (like a weighted
average).
aT x = b
→ aT (θx + (1 − θ)y) = θaT x + (1 − θ)aT y
T
a y=b
= θb + (1 − θ)b
= b
aT x ≤ b
→ aT (θx + (1 − θ)y) = θaT x + (1 − θ)aT y
T
a y≤b
θ∈[0,1]
≤ θb + (1 − θ)b
= b
≤ θ∥x1 − xc ∥2 + (1 − θ) ∥x2 − xc ∥2
≤ r
• Second order (ice cream, Lorentz, quadratic) cone {(x, t) : ∥x∥ ≤ t} is convex. DRAW.
≤ 0
S+n = {X ∈ S n : X ⪰ 0}
= {X ∈ S n : z T Xz > 0 ∀ z ̸= 0}
Proof:
z T (θX1 + (1 − θ) X2 ) z = θz T X1 z + (1 − θ) z T X2 z ≥ 0
In 2D, we have
x y
⪰0
y z
is the set
{x ≥ 0, z ≥ 0, xz ≥ y 2 }
4) Operations that preserve convexity: How to check whether a set is convex without going
back to the definitions.
23
x ∈ S1 ∩ S2 → x ∈ S1 and x ∈ S2
y ∈ S1 ∩ S2 → y ∈ S1 and y ∈ S2
then
θx + (1 − θ) y ∈ S1 and θx + (1 − θ) y ∈ S2
→ θx + (1 − θ) y ∈ S1 ∩ S2
This means we can add convex constraints (we look for the intersection of them).
• Infinite intersection: if Sα is convex for all α ∈ A then so is ∩α∈A Sα (A does not need to
be convex). Example: the positive semidefinite cone can be expressed as the intersection of
an infinite number of halfspaces and is therefore convex
{X ∈ S n : X ⪰ 0} = ∩z̸=0 {X ∈ S n : z T Xz ≥ 0}
• Let S1 and S2 be convex sets then their direct product is convex S1 × S2 = {x1 , x2 |x1 ∈
S1 , x2 ∈ S2 } is convex.
• Affine functions: if S ∈ Rn is convex then its image under an affine function f (x) = Ax+b
is convex.
Proof:
That is, to show that there exist x′ ∈ S such that f ′ = Ax′ + b but this is trivial since
x′ = θx1 + (1 − θ)x2 ∈ S
and
= θ(f1 − b) + (1 − θ)(f2 − b) + b
= f′
as required.
αS = {αx | x ∈ S}
and translation
S + a = {x + a | x ∈ S}
• Affine functions: if S ∈ Rn is convex then its inverse image under an affine function
f (x) = Ax + b is convex.
f −1 (S) = {x : Ax + b ∈ S}
Proof:
x1 ∈ f −1 (S) → Ax1 + b ∈ S
x2 ∈ f −1 (S) → Ax2 + b ∈ S
x = θx1 + (1 − θ)x2
P
Example: Polyhedron {x : Ax ≤ b, Cx = d} is convex, e.g., simplex {xi ≥ 0, i xi = 1}.
Example: Ellipsoid
P −1 = U T U
P = U −T U −1
Thus
{x : (x − xc )P −1 (x − xc ) ≤ 1} = {x : U x − U xc ∈ {u : ∥u∥ ≤ 1}}
= {U −1 z + xc : z ∈ {u : ∥u∥ ≤ 1}}
Example: Hyperbolic cone is convex since it is the inverse image of the second order cone
under an affine function
P ≻ 0 → P = UT U
2
{x : xT P x ≤ cT x , cT x ≥ 0} = x : U x, cT x ∈ (z, t) : z T z ≤ t2 , t ≥ 0
26
Proof:
z = θx + (1 − θ) y
with
µyn+1
θ=
µyn+1 + (1 − µ)xn+1
where this construction with xn+1 > 0 and yn+1 > 0 guarantees that θ ∈ [0, 1] so that
z ∈ S. Plugging into P yields
P (z) = P (θx + (1 − θ) y)
θx1:n + (1 − θ) y1:n
=
θxn+1 + (1 − θ) yn+1
(∗) µ 1−µ
= x1:n + y1:n
xn+1 yn+1
= µP (x) + (1 − µ)P (y)
= µp1 + (1 − µ)p2
27
since
θxn+1
µ= ∈ [0, 1]
θxn+1 + (1 − θ) yn+1
µ θ
= (∗)
xn+1 θxn+1 + (1 − θ) yn+1
1−µ 1−θ
= (∗)
yn+1 θxn+1 + (1 − θ) yn+1
B. Tirgul
Exercise: Show that a set is convex if and only if its intersection with any line is convex.
First direction: The intersection of two convex sets is convex. Therefore if S is a convex set,
the intersection of S with a line is convex.
Conversely, suppose the intersection of S with any line is convex. Take any two distinct points
x1 , x2 ∈ S. The intersection of S with the line through x1 and x2 is convex. Therefore, convex
combinations of x1 and x2 belong to the intersection, hence also to S.
A⪰0 ⇔ z T Az ≥ 0 ∀z
C = {x | xT Ax + bT x + c ≤ 0}
is convex if A ⪰ 0.
{x + tv | t ∈ R}
is convex. We express
as
α2 t + βt + γ
where
α = v T Av
β = bT v + 2xT Av
γ = c + bT x + xT Ax
{x + tv | αt2 + βt + γ ≤ 0}
29
which is convex if α ≥ 0 (a U parabola is smaller than zero over an interval). This will hold
for any v if A ⪰ 0.
Is the converse true? No! for example take A = −1, b = 0 and c = −1, which makes the set
become C = R which is of course convex.
You can already guess here what is really needed for the converse - it should hold for any c.
Now, show that the intersection of C and the hyperplane defined by g T x + h = 0 (where g ̸= 0)
is convex if there exists a λ such that A + λgg T ≻ 0.
What does this mean? psd everywhere except the direction of g. which condition is stronger?
As before, the set is convex if and only if its intersection with an arbitrary line is convex. The
intersection is
where
δ = gT v (2)
ϵ = gT x + h (3)
gT v = 0 ⇒ v T Av ≥ 0 (4)
Intuitively, we do not need to psd all over but just on v’s that satisfy g T v = 0, i.e., that are
orthogonal to g. Hence the condition. Indeed, if A + λgg T ≻ 0 then
0 ≤ v T A + λgg T v = v T Av + λ0
(5)
Back to gradients:
Log-sum-exponential
zz T
Tdiag {z}
u − T 2 u≥0
1T z (1 z)
√ √
Chossing a = diag { z} u and b = z
X X X X
(1T z)(uT diag {z} u) − (uT z)2 = zi u2j zj − ui zi uj zj
i j i j
T T T 2
= (b b)(a a) − (a b) ≥ 0
Log-determinant
f (X) = logdet (X) is concave on X ≻ 0. We will find the gradient matrix by completing
f (X + D) = f (X) + Tr {D∇f }
1 1
log |X + D| = log |X| + log I + X − 2 DX − 2
X
= log |X| + log (1 + λi )
i
X
≈ log |X| + λi
i
n o
− 12 − 12
= log |X| + Tr X DX
3. C ONVEX FUNCTIONS
A. Lecture
1) Theory:
This inequality is sometimes called Jensen’s Inequality, and can be generalized to other convex
P
combinations λi ≥ 0 such that i λi = 1, or in general even a continuum
f (E [x]) ≤ E [f (x)]
Theorem 5. A function is convex if and only if it is convex when restricted to any line that
intersects its domain.
Proof: If f (x) is convex then g(t) is a composition with an affine function and is therefore also
convex. Note that
and we get
= θg(t) + (1 − θ)g(t̄)
On the other way, assume that g(t) is convex in t for all x and y. Then,
with
On the other hand, assume that the condition holds then apply it to the points
ty + (1 − t)x
t̄y + (1 − t̄)x
to obtain
f (ty + (1 − t)x) ≥ f (t̄y + (1 − t̄)x) + ∇f (t̄y + (1 − t̄)x)T (ty + (1 − t)x − t̄y − (1 − t̄)x)
∇2 f (x) ⪰ 0, ∀x
Proof. Lets restrict it to a line g(t) = f (a + tb) and find a condition so that g(t) will be
convex in t for all a and b. But that’s clearly g ′′ (t) = bT ∇2 f (x)b ≥ 0 which is equivalent to
∇2 f (x) ⪰ 0.
Proof:
x, y ∈ S ⇒ f (x) ≤ 0, f (y) ≤ 0
and therefore
so that
θx + (1 − θ)y ∈ S.
Of course, we can replace 0 by any constant. This result is the standard way to represent convex
sets in practice.
The converse is wrong. Indeed f (x) = −x2 − 1 is non convex but its set {f (x) ≤ 0} is R and
convex.
Definition 10. The graph of a function f (x) is the function [x f (x)]. The epigraph is the set
above this function, i.e. {[x t] : f (x) ≤ t}.
Theorem 8. A function f (x) is convex in x iff its epigraph {[x t] : f (x) ≤ t} is convex in [x t].
Proof: One direction is trivial. If f (x) is convex in x then f (x) − t is convex in (x, t) and
therefore f (x) − t ≤ 0 is a convex set.
35
and
as required.
∇f = a
∇2 = 0
∇f = 2Ax + b
∇2 = 2A
Very important example since minx∈S ∥y − Hx∥ may be more efficient using SOCP than
the natural minx∈S ∥y − Hx∥2 using QP.
• Max functions - f (x) = max{x1 , · · · , xn } is convex [even if max is non-convex!!!! com-
plexity, robust optimization].
since the rhs allows more options (different maximum indices with respect to x and y)
This is the analog of intersections in sets which preserve convexity.
x2
• Quadratic-over-linear f (x, y) = y
is convex on {(x, y) : y > 0}.
2x
y
∇f = 2
− xy2
36
T
2 y2 −xy y y
∇2 f = = 2 ⪰0
y 3 −xy x2 y 3 −x −x
Another proof: in sets we need to prove that the epigraph {x, y, t : y > 0, x2 /y < t} is
convex, or using Schur’s lemma that
y x
x, y, t : y > 0, ⪰0
x t
zz T
diag {z}
T
u − T 2 u≥0
1T z (1 z)
√ √
Chossing a = diag { z} u and b = z
X X X X
(1T z)(uT diag {z} u) − (uT z)2 = zi u2j zj − ui zi uj zj
i j i j
1
xi ) n is concave on Rn++ .
Q
• Geometric mean f (x) = ( i
• Log-determinant f (X) = logdet (X) is concave on X ≻ 0. We will find the gradient matrix
by completing
f (X + D) = f (X) + Tr {D∇f }
1 1
log |X + D| = log |X| + log I + X − 2 DX − 2
X
= log |X| + log (1 + λi )
i
X
≈ log |X| + λi
i
n o
− 21 − 12
= log |X| + Tr X DX
X = Z + tV
where Z and V are symmetric but not necessarily positive definite but we only look at ts
for which Z + tV ≻ 0. Without loss of generality, we can assume Z ≻ 0, i.e., t = 0 is in
the domain. Then
g(t) = log |Z + tV |
1
1
− 12 − 12
= log Z I + tZ V Z
2 Z2
1 1
= log I + tZ − 2 V Z − 2 + log |Z|
X
= log(1 + tλi ) + log |Z|
i
1 1
where λi are the non-negative eigenvalues of Z − 2 V Z − 2 . Now
X λi
g ′ (t) =
i
1 + tλi
X λ2i
g ′′ (t) = − ≤0
i
(1 + tλi )2
38
= θg(x) + (1 − θ)g(y)
• Pointwise maximum and supremum: if f1 and f2 are convex then so is f (x) = max{f1 (x), f2 (x)}
(also maximum over infinite and non-convex sets).
We already proved this. This is the analog of intersection in sets.
– Distance to farthest point of a set C is convex:
f (x) = max ∥x − y∥
y∈C
f (x1 , x2 , x3 ) = max{x1 + x2 , x1 + x3 , x2 + x3 }
therefore:
– f is convex if h is convex and non-decreasing (h′′ ≥ 0 and h′ ≥ 0), and g is convex.
– f is convex if h is convex and non-increasing (h′′ ≥ 0 and h′ ≤ 0), and g is concave.
– f is concave if h is concave and non-decreasing, and g is concave.
– f is concave if h is concave and non-increasing, and g is convex.
39
For example
– If g is convex then eg(x) is convex.
– if g is concave then log g(x) is concave.
1
– if g is concave and positive then g(x)
is convex.
• Convex partial minimization - if g(x, y) is jointly convex in {x, y} then f (x) = miny g(x, y)
is convex in x (also holds for convex constrained minimization).
For simplicity, we prove the min version and not the inf. Let
f (x) = g(x, y)
f (x′ ) = g(x′ , y ′ )
g(x, y) = xT Ax + 2xT By + y T Cy
2B T x + 2Cy = 0
y = −C −1 B T x
A − BC −1 B T ⪰ 0.
40
xT x
Example: The perspective of f (x) = xT x is g(x, t) = t
which is convex in both x and t.
f (x, Y ) = xT Y −1 x
is convex on Rn × S++
n
. Proof via epigraph:
(x, Y, t) : Y ≻ 0, xT Y −1 x ≤ t
epif =
Y x
= (x, Y, t) : Y ≻ 0, ⪰0
xT t
Example:
√
A weighted norm with weights A ≻ 0 is defined as ∥x∥A = xT Ax. Show that it is concave in
√
A ≻ 0. Proof: restrict it to a line A = B + tC then we get xT Bx + txT Cx. The square root is
concave in positive variables and the rest is just an affine transformation. Application is metric
learning:
∥xi − xj ∥2A
P
minA≻0 i,j∈N
P (6)
s.t. / ∥xi − xj ∥A > 1
i,j ∈N
B. Tirgul
A. Lecture
Standard form
minx∈S f0 (x)
s.t. fi (x) ≤ 0
s.t. hi (x) = 0
The problem is convex if S is a convex set, fi are convex functions and hi (x) are affine functions.
minϕ(z)∈S f0 (ϕ(z))
s.t. fi (ϕ(z)) ≤ 0
s.t. hi (ϕ(z)) = 0
a1 = A cos (ϕ)
a2 = A sin (ϕ)
we obtain
and a convex LS
X
min |yi − a1 cos (2πf xi ) + a2 sin (2πf xi ) |2
a1 ,a2
i
43
to return to polar:
q
A = a21 + a22
a2
ϕ = tan−1
a1
• Monotonic transformation. For example: quadratic minimization vs. norm minimization.
minx f (x)
s.t. Ax = b
From linear algebra we know that there are three options for Ax = b:
– No solution in which case the problem is infeasible.
– Unique solution in which case it is also the optimal solution.
– Infinite solutions in which case (usually when A is fat)
x = x1 + x2
where
Ax1 = b, x2 = V z, AV = 0
Later, we will say that the columns of V span the null space of A.
Intuitive example
minx∈Rn f (x1 , · · · , xn )
Pn
s.t. i=1 xi = 0
Note that the smaller problem remains convex as affine transformations preserve con-
vexity.
44
• Introducing linear equality constraints - sometimes useful for clarity and duality.
minx,z f (z)
min f (Ax + b) ⇔
x
s.t. Ax + b = z
• Implicit/explicit constraints.
minx f (x)
min f (x) ⇔
x≥0
s.t. x≥0
Theorem 9.
min f (x)
x∈S
∇f T (x)(y − x) ≥ 0 ∀ y ∈ S
Conversely, suppose x is optimal but the condition does not hold, i.e., there exists a y ∈ S such
that
∇f T (x)(y − x) < 0
This point is feasible since x and y are. For small t we will have f (z(t)) < f (x(t)) which is a
contradiction to the optimality of x. To show this we note that
df (z(t))
= ∇f T (x)(y − x) < 0
dt t=0
45
∇f (x) = 0
since the condition must hold for y = x + z and y = x − z for any z. That is
∇f T (x)z ≥ 0 ∀ z
∇f T (x)z ≤ 0 ∀ z
f ′ (x)(y − x) ≥ 0 ∀ y ∈ I
Now if x is strictly inside the interval, then as before the condition reduces to
f ′ (x) = 0
If it is on the border, e.g., x = b then for all y ∈ I we have y − x ≤ 0 and the condition reduces
to
f ′ (x) ≤ 0
2) Projection onto a convex set: Probably the most common optimization over a convex set.
For example, in Projected Gradient Descent. Define the projection of z onto a convex set S
Non-convex counterexample.
Non-expansive: Consider some x and y. Then PS (x) and PS (y) are in S and the optimality
conditions yield
B. Tirgul
x∗ ∈ S, ∇f (x∗ )T (x − x∗ ) ≥ 0 ∀ x ∈ S
min f (x)
x≥0
47
Example 4 (Projection on the positives). Projection on the non-negative cone (scalar version)
min(x − y)2
x≥0
x∗ = max{y, } = [y]+
(y − x∗ )(z − x∗ ) ≤ 0 ∀ z ≥ 0
To show this consider two cases. If y > 0 then (y − x) = 0 and we are done. Otherwise, we
have (y − 0)(z − 0) which is negative · positive = negative.
Example 5 (Projection on span of a vector a). Consider the span of a fixed vector
S = {x : ∃c x = ca} (7)
min ∥x − y∥2
x∈S
48
Rewrite in terms of c
and solve
d 2
c ∥a∥2 − 2caT y + ∥y∥2 = 0
dc
aT y
c=
∥a∥2
aT y
∗
x = a
∥a∥2
Example 6 (Projection of a matrix onto the set of symmetric matrices (with Frobenius norm)).
Due to symmetry, we have
where x and y are vector versions of X and Y , respectively, and mat takes an n2 vector and
transforms it into an n × n matrix.
We need to start with a guess and then show that it is optimal by satisfying the optimality
condition
Y +YT
X=
2
The optimality conditions state that x is optimal iff it is feasible, i.e., mat(x) ⪰ 0, and it satisfies
Example 7 (Projection of a symmetric matrix onto the semidefinite cone (with Frobenius norm)).
Due to symmetry, we have
Y = U DU T
X = U [D]+ U T
xT y = Tr X T Y
A ⪰ 0, B ⪰ 0 → Tr {AB} ≥ 0
The problem is the guessing! Next lesson we will do it without the guessing.
50
1) Linear programming: Used to be a full lecture but due to the war we will just briefly touch
it.
Standard form
minx cT x
s.t. Gx ⪯ h
Ax = b
Why work with standard form? Stand on the shoulders of giants, off the shelves software.
Diet problem
minx cT x
s.t. Ax ≥ b
x≥0
Power control
Wireless communication system with n transmit and receiving units. The ith transmitter tries to
reach the ith receiver who also measures the interference from the other transmitters and noise.
Given the channels between the units and signal to noise plus interference constraints our goal
is to minimize the total transmitted power.
P
minpi i pi
hii pi
s.t. log 1 + P 2 ≥ γi
j̸=i pj hij +σ
pi ≥ 0
P
minpi i pi
hii pi
s.t. P
pj hij +σ 2
≥ eγi − 1
j̸=i
pi ≥ 0
51
P
minpi i pi
pj (eγi − 1) hij + (eγi − 1) σ 2
P
s.t. hii pi ≥ j̸=i
pi ≥ 0
Piecewise linear minimization A reasonable approximation for minimizing any convex function.
can be reformulated as
min t s.t. Ax + b ⪯ t1
x,t
L1 fitting
minx,t 1T t
s.t. ti = |aTi x + bi | ∀ i
wlog we can relax the equality to an inequality (its always better to choose a lower point)
minx,t 1T t
s.t. ti ≥ |aTi x + bi | ∀ i
Using
minx,t 1T t
s.t. −t ⪯ Ax + b ⪯ t
52
5. L INEAR ALGEBRA
A. Lecture
⊥
Rn = C(A) ⊕ N (A)
x ∈ Rn
x = x1 + x2
x1 ∈ C(A)
x2 ∈ N (A)
xT1 x2 = 0
A = U DU −1
A = U DU T
where U is orthonormal.
C .AT /
C .A/
Row space Column space
all AT y all Ax
dim r
Perpendicular Perpendicular dim r
Rn x T .AT y/ D 0 y T .Ax/ D 0 R m
dim n ! r
x in y in dim m ! r
Nullspace Nullspace of A T
Ax D 0
AT y D 0
N .A/ N .AT /
Part 2 says that the row space and nullspace are orthogonal complements. The orthog-
onality
Fig. comesStrang
1. Gilbert directly from
- The the Fundamental
Four equation Ax Subspaces.
D 0. Each Highly
x in therecommend
nullspace is orthogonal
watching to video on youtube.
Strang’s
each row:
! x is orthogonal to row 1
2 32 3 2 3
.row 1/ 0
Ax D 0 4 """ 5 4 x 5 D 4 "" 5
2) Rectangular -.row
SVD:m/ What happens
0 in !
non-square, non-symmetric
x is orthogonal to row m m × n matrices?
Thefour
The dimensions of C .AT /subspaces
fundamental and N .A/ add
of tolinear
n. Every vector in Rn is accounted for, by
algebra.
separating x into xrow C xnull .
Given Foran
them
90×ı
angle on the right
n matrix A ofside of Figure
rank r, we1, have
changethe
A tofollowing
AT . Every vector b D Ax
subspaces
in the column space is orthogonal to every solution of A y D 0. More briefly, .Ax/T y D
T
x T .AT y/ D 0. m
• Column space C(A) ∈ R - all the combinations of columns of A, dimension r.
C .AT / D N .A/? Orthogonal complements in Rn
Part 2
N .AT / D C .A/? Orthogonal complements in Rm n
C(A) = {Ax : x ∈ R }
2
• Null space N (A) ∈ Rn - all the vectors such that Ax = 0, dimension n − r.
N (A) = {x ∈ Rn : Ax = 0}
Together:
⊥
Rm : C(A) = N (AT )
Rn : C(AT ) = [N (A)]⊥
The main tool for working with non-square non-symmetric matrices is singular value decompo-
sition (SVD).
(∗) A = U DV T
54
where U and V are square unitary matrices and D = diag {di } is rectangular diagonal matrix
with non-negative singular values.
A non-efficient but intuitive derivation is via the EVD (this is not a proof).
AT A = V DT U T U DV T = V diag d2i V T
3) Least squares: Probably the most studied (convex) optimization problem, Gauss 1794. End-
less applications - curve fitting, statistical estimation, minimum distance, etc...
This is of course an unconstrained convex optimization problem (with and without the square
as it is a norm or positive definite quadratic form).
∇f (x) = AT (Ax − b) = 0
• Case 1: A is a square invertible matrix. Ax = b has a unique solution which satisfies the
optimality conditions
x = A−1 b
• Case 2: Less equations than unknowns. Ax = b has infinite solutions all of which are
optimal as they satisfy the optimality conditions. The Hessian has zero eigenvalues as AT A
is a large matrix with small rank (rank deficient), and the problem is not strictly convex.
One possible solution: AAT is a small matrix, usually of full rank and invertible1 , and it is
easy to verify that one solution is
−1
x0 = AT AAT b
= A† b
x = x0 + z z ∈ N {A}
An alternative approach to the same problem is to use other generalized inverses. Indeed,
the general solution to the problem is
x = B T (AB T )−1 b
1
unless some of the equations are linearly dependent
56
where B is some matrix so that AB T is invertible. What is the relation between B and z?
• Case 3: More equations than unknowns. Ax = b has no solution, but we can still search for
the least squares approximation. We need to solve AT Ax = AT b instead of Ax = b which
has more equations and therefore problematic. When AT A is invertible, and the classical
LS solution is
−1 T
xLS = AT A A b
= A† b
It coincides with the inverse if A is square and invertible, but otherwise has different meanings.
The solution is still the A† b and the pseudo-inverse always exists and is unique. To gain intuition,
lets begin with the scalar case. Basically, this means we have a zero and cannot divide by zero.
the solution is clearly x = a−1 b, but what if a = 0, well actually then we can choose whatever
we want for x. To be consistent with previous solutions, we define
1 x ̸= 0
x† = x
0 x=0
AT Ax = AT b
V DT DV T x = V DT U T b
DT DV T x = DT U T b
which is exactly
or
V T x = D† U T b
x = V D† U T b = A† b
A† = V D† U T
For completeness, the formal definition of the pseudo-inverse of A is the unique matrix (which
T †
always exists) such that AA† A = A, A† AA† = A† , AA† = AA† and AT A = A† A
min f (x)
x:Ax=b
∇f (x)T (y − x) ≥ 0 ∀ Ay = b
∇f (x)T v ≥ 0 ∀ v ∈ N (A)
∇f (x)T v = 0 ∀ v ∈ N (A)
which means
∇f (x) ⊥ N (A)
Lets get the same results by changing variables and eliminating the constraints.
min f (x0 + F z)
z
where x0 is some solution and AF = 0 so that F z ∈ N {A}. Using the chain rule, the optimality
condition is
F T ∇f = 0
58
or
∇f T F z = 0 ∀ z
∇f T v = 0 ∀ v = N {A}
We will assume of course that the problem is feasible and there are infinite solutions.
The objective is non-differentiable, but we can take the square due to monotonicity
min xT x s.t. Ax = b
The feasible set states that x∗ = A† b + v where v ∈ N (A), but the optimality condition holds
iff v = 0.
= ∥x0 ∥2 + ∥z∥2
≥ ∥x0 ∥2
min ∥x − z∥
{x:x=Ay for some y}
59
minx,y ∥x − z∥
s.t x = Ay
We eliminate the linear constraints
min ∥Ay − z∥
y
y ∗ = A† z
and in terms of x
x∗ = Ay ∗ = AA† z
so that
Its a matrix has 0/1 eigenvalues and the same eigenvectors as AAT = U DU T .
It is idempotent.
B. Tirgul
1) Robust linear programming: One of the problems with optimization in engineering is that
unlike intuitive heuristic solutions they are usually non-robust. Fortunately, we can sometimes
fix this.
A case study reported in [7] shows that the phenomenon we have just described is not an
exception: in 13 of 90 NETLIB Linear Programming problems considered in this study, already
0.01%-perturbations of ugly coefficients result in violations of some constraints, as evaluated at
the nominal optimal solutions by more than 50%. In 6 of these 13 problems the magnitude of
60
constraint violations was over 100%, and in PILOT4 it was as large as 210,000%, that is, 7
orders of magnitude larger than the relative perturbations in the data.
The techniques presented in this book as applied to the NETLIB problems allow one to eliminate
the outlined phenomenon by passing out of the nominal optimal to robust optimal solutions. At
the 0.1%-uncertainty level, the price of this immunization against uncertainty (the increase in
the value of the objective when passing from the nominal to the robust solution), for every one
of the NETLIB problems, is less than 1%.
minx cT x
s.t. aTi x ≤ bi ∀ ai ∈ Si
where
Si = {āi + Pi u : ∥u∥ ≤ 1}
The constraint
aTi x ≤ bi ∀ ai ∈ Si
is equivalent to
max aTi x ≤ bi
ai ∈Si
or
āTi x + T
max u Pi x ≤ bi
u:∥u∥≤1
minx cT x
s.t. āTi x + ∥Pi x∥ ≤ bi
61
2) Semidefinite programming: A higher level in the conic programming hierarchy: LP, SOCP
and SDP.
minx cT x
P
s.t. F0 + i xi F i ⪰ 0
Ax = b
X
A(X) = A0 + xi Ai symmetric
i
We want to solve
minx,t t
s.t. λmax (A(x)) ≤ t
minx,t t
s.t. λi (A(x)) ≤ t ∀ i
minx,t t
s.t. A(x) ⪯ tI
X
A(X) = A0 + xi Ai symmetric
i
We want to solve
minx,t1 ,t2 t1 − t2
s.t. λmax (A(x)) ≤ t1
λmin (A(x)) ≥ t2
minx,t1 ,t2 t1 − t2
s.t. t2 I ⪯ A(x) ⪯ t1 I
X
A(X) = A0 + xi Ai non symmetric
i
We want to solve
min ∥A(x)∥2
x
where ∥X∥2 is the maximum singular value of X, is the square root of the maximum eigenvalue
of X T X, and also the maximum eigenvalue if X ⪰ 0. This is convex because it is a norm.
minx λmax AT (x)A(x)
minx,t t2
s.t. λmax AT (x)A(x) ≤ t2
63
minx,t t
s.t. AT (x)A(x) ⪯ t2 I
tI A
{AT A ⪯ t2 I} = A, t : ⪰0
AT tI
minx,t t
tI A(x)
s.t. ⪰0
T
A (x) tI
64
A. Lecture
1) Gradient Descent: Simplest and most natural method is gradient descent (GD):
xk+1 = xk − tk ∇f (xk ) k = 1, · · · K
Basically transforms a high dim problem into a sequence of line searches. The step size tk is
chosen using a line search
min f (xk − tk g)
tk ≥0
ZigZag issue - steepest descent always advances in orthogonal directions (not very useful). Proof:
the optimal step size tmin satisfies:
∂
f (xk − t∇f (xk ))|tmin = 0
∂t
Using the chain rule:
∇f (xk+1 )T ∇f (xk ) = 0
Backtracking:
Basically, begin with α = 1 and multiply it by, say β = .5, until we get descent. Unfortunately,
this does not work (theoretically, in practice its more or less ok).
65
The iteration xk+1 = xk − f ′ (xk ) is a descent method (it chooses always t = 1), but does
not converge if |x0 | > 1 and jumps between ±1 even though the minimum is at x = 0. The
conclusion: descent is not sufficient, we need a bit more. One approach is backtracking:
def backtracking ( f , g , x , a , b )
# a in (0 ,0.5)
# b in (0 ,1)
t =1
w h i l e f ( x− t * g)> f ( x ) − a * t * g ’ * g
t=beta * t
mI ⪯ ∇2 f (x) ⪯ M I
Taking the minimum wrt y in both sides of the lower bound yields
m
f (y) ≥ f (x) + ∇f (x)T (y − x) + ∥y − x∥2 , ∀ y, x
2
miny 1 1
≥ f (x) − ∥∇f (x)∥22 , ∀ y, x y min
= x − ∇f (x)
2m m
which leads to
1
f (xmin ) − f (x) ≥ − ∥∇f (x)∥22
2m
This equation quantifies our distance from optimality using ∥∇f (x)∥ (a stopping criterion).
66
Together we get
m
f (x − tmin ∇f (x)) − f (xmin ) ≤ 1 − f (x) − f (xmin )
M
Called linear convergence since the log error vs iterations graph is linear.
Convergence rate depends on the condition number of the Hessian which is directly related to
m
the ratio M
and therefore defines the convergence rate. Later on, this will be the motivation for
Newton.
2) Newton method: The 1D algorithm was written by Isaac Newton in 1669 and published in
1711. Joseph Raphson published a similar description in 1690 (a good friend of Newton’s).
Basically a root finding method applied to f ′ (x∗ ) = 0.
Assume that f is twice continuously differentiable and that we can efficiently compute f , f ′ and
f ′′ . Newton’s method is defined as follows: At the tth iteration, f is approximated as a quadratic
function around xt−1 :
1
f (x) ≈ f (xt−1 ) + (x − xt−1 )f ′ (xt−1 ) + (x − xt−1 )2 f ′′ (xt−1 )
2
The value of xt is the minimizer of this quadratic approximation:
1
xt = arg min f (xt−1 ) + (x − xt−1 )f ′ (xt−1 ) + (x − xt−1 )2 f ′′ (xt−1 )
x 2
f ′ (xt−1 )
= xt−1 − ′′
f (xt−1 )
67
f ′ (xt ) = 0
Is it a descent method?
f ′ (xt−1 ) [f ′ (xt−1 )]2
f xt − ′′ ≈ f (xk ) − ′′
f (xt−1 ) f (xt−1 )
Yes, if f ′′ (xt−1 ) > 0. Yes, for example if f (x) is strictly convex. But just as well it can be an
ascent method if f (x) is strictly concave.
Quadratic convergence: Let x∗ be local minimizer of f such that f is three times continuously
differentiable in a neighborhood of x∗ with f ′ (x∗ ) = 0, f ′′ (x∗ ) > 0. Then the Newton iterates
converge to x∗ quadratically, provided that the starting point is close enough to x∗ .
f ′ (xt−1 )
xt − x∗ = xt−1 − x∗ −
f ′′ (xt−1 )
−f ′ (xt−1 ) − f ′′ (xt−1 ) (x∗ − xt−1 )
=
f ′′ (xt−1 )
0
f ′ (x∗ ) −f ′ (xt−1 ) − f ′′ (xt−1 ) (x∗ − xt−1 )
= add a zero first term
f ′′ (xt−1 )
1 ′′′
f (x)
= 2
′′
(x∗ − xt−1 )2 2nd order taylor remainder with some x
f (xt−1 )
|xt − x∗ | ≤ K |xt−1 − x∗ |2
Note that we assume that throughout the process f ′′ (xt−1 ) > 0 and that x∗ is not an end point.
The convergence is quadratic: the error is essentially squared (the number of accurate digits
roughly doubles) at each step.
Example 8. Consider the convex function f (x) = ex −10x. The minimun is clearly xopt = log10.
ex −10
The method yields xk+1 = xk − ex
= xk − 1 + 10e−x . Starting with x = xopt + 0.1, method
reaches Matlab’s 15 digits precision in 4 iterations.
68
3
Example 9. Consider the convex function f (x) = 23 |x| 2 . The minimum is clearly xopt = 0, but if
we start Newton’s method at x0 = 1, the algorithm continues as xk = 1 for even k and xk = −1
for odd k, and never reaches the minimum.
−1
xk+1 = xk − tk ∇2 f (xk )
∇f (xk )
3) Interior point method (barrier): Idea: starting with a strictly feasible point, improve it inside
the set while penalizing for getting out of the set.
minx f0 (x)
s.t. fi (x) ≤ 0
is equivalent to
X
min f0 (x) + I (fi (x))
x
i
f u n c t i o n x= b a r r i e r ( )
% Input :
% strictly feasible x
% t>0
% µ>1
% ϵ>0
w h i l e ( mt ≤ ϵ )
P
x = arg minx f0 (x) + i It (fi (x)) starting at x ;
t = µt ;
end
Interior can also be interpreted as iteratively solving the nonlinear KKT by linearization around
the easier last point. Easier in terms of relaxing the complementary slackness condition with a
parameter t. This comes out of the math.
B. Tirgul
1) Gradient Descent - quadratic analysis: Let’s dive deeper into this and introduce the ideas
behind acceleration.
Let’s start with minimization of a simple quadratic function which is both strongly convex and
smooth:
1
min xT Ax (8)
x 2
70
with
mI ⪯ A ⪯ M I (9)
The optimal solution is clearly x = 0. Lets analyze how GD approaches it. GD with step size t
yields
= (I − tA)xk = · · · (11)
= (I − tA)k+1 x1 (12)
∥I − tA∥2 (14)
If A = λI then clearly we want t = λ1 . That is: small t for large λ and vice versa.
In the general case, we need to choose one step size that will perform well under all possible
eigenvalues:
2
The solution is t = m+M
(proof: equality between two eigs of m and M in which each chooses
2m
a different sign in the absolute). This gives a rate of 1 − m+M
.
1 m
Approximately, this yields t = M
with a rate of 1 − M
.
Intuitively, a sort of continuity where the function is limited in how fast it can change.
Examples:
• Every function that is defined on an interval and has bounded first derivative.
• f (x) = |x| even though it is not differentiable in zero.
71
• f (x) = x2 and f (x) = ex are not globally Lipshitz as their slope increases.
Proof.
Theorem 13. Let f be convex and L-Lipshitz. Subgradient descent starting at x1 such that
∥x1 − x∗ ∥ ≤ R with t = R
√
L K
satsifies
K
!
1 X RL
f xk − f (x∗ ) ≤ √
K k=1 k
Proof.
Some comments: not descent (not diff), use average rather than last point, dimension independent,
√
very small and pessimistic step sizes (inversely proportional to K).
A main advantage of GD is that the convergence speed is independent of the dimension size.
72
3) Acceleration: Momentum
The problem is clearly that we do not know where x will be, closer to vmin or vmax and need to
prepare for the worst case. The idea behind acceleration is to add a sort of regularization that
ensures robust performance all over. The full analysis is involved so we will stick to a scalar
quadratic with unknown scale and show how momentum performs well for all scales in some
interval.
h
min x2 (17)
x 2
To analyze its performance, we will use an old trick from linear systems
x 1 − th + β −β x
k+1 = k (20)
xk 1 0 xk−1
which can be compactly written as
or
wk = B k wk (22)
The distance from zero is therefore bounded by ∥B∥k2 and is dictated by the absolute value of
the (possibly complex) eigenvalues of B:
In particular, choosing
2 + 2β 2
t= β =1− √ (24)
m+M κ+1
yields eigenvalues of equal magnitude
p
|λ1 | = |λ2 | = β ∀h ∈ [m, M ] (25)
√
Thus, the worst case rate of momentum β ≈ 1 − √1κ is much better than the worst case of GD
which was approximately 1− κ1 . Of course, when h ≈ m+M
2
, GD is much better than momentum,
but momentum is very robust and has a rather constant rate for all h ∈ [m, M ]. Show graph.
In practice, simply choosing a small t with β = 0.9 yield a pair of eigenvalues with equal
√
absolute values that are roughly β.
Momentum is intuitive and works great for quadratic function. For more general convex function,
it can be extended to Nesterov’s Accelerated Gradient Descent (NAG):
Very similar to momentum but the gradient is computed after applying the momentum. Assuming
T − √Tκ
a strongly smooth and strongly convex function, its rates improve from e− κ in GD to e in
NAG.
7. M INIMIZATION MAJORIZATION
A. Lecture
with
and
MM is especially useful when the sub-problems have a simple and elegant solution.
1
f (x) = f (x′ ) + ∇T f (x′ )(x − x′ ) + (x − x′ )T ∇2 f (x′′ )(x − x′ ) (33)
2
M
f (x) ≤ f (x′ ) + ∇T f (x′ )(x − x′ ) + ∥x − x′ ∥2 = Q(x; x′ ) (34)
2
Therefore, our sub-problems are
M
x ← arg min f (x′ ) + ∇T f (x′ )(x − x′ ) + ∥x − x′ ∥2 (35)
x 2
75
∇f (x′ ) + M (x − x′ ) = 0 (36)
1
which yields gradient descent with step size M
:
1
x = x′ − ∇f (x′ ) (37)
M
with
∇2 g(x) ⪯ M I (39)
Let’s bound the quadratic form as before and leave the L1 regularization as is:
M
Q(x; x′ ) = g(x′ ) + ∇T g(x′ )(x − x′ ) + ∥x − x′ ∥2 + λ∥x∥1 (40)
2
Therefore
M
x ← arg min ∇T g(x′ )x + ∥x − x′ ∥2 + λ∥x∥1 (41)
x 2
What’s the main advantage of these subproblems?
4) Iterative Reweighted Least Squares (IRLS): In IRLS we use a quadratic upper bound. For
example, it is useful in logistic and robust regression:
1 a2 1
|a| ≤ ′
+ |a′ | (44)
2 |a | 2
Thus,
1 X (yi − hTi x)2
∥y − Hx∥1 ≤ + const (45)
2 i |yi − hTi x′ |
76
(H T W ′ H)x = H T W ′ y (47)
where
′ 1
W = diag (49)
|yi − hTi x′ |
where both f1 and f2 are convex. The beauty is that −f2 (x) is convcave and can always be
upper bounded by its linear approximation:
For example, a classical special case solving non-convex sparse linear systems. Using
a − a′
log(a) ≤ log(a′ ) + a, a′ > 0 (53)
a′
we bound
|x|
log(ϵ + |x|) ≤ + const (54)
ϵ + |x′ |
We solve
X
min log(ϵ + |xi |) (55)
x:Ax=b
i
Mirror descent methods replace the L2 norm with a different distance function that is more
suitable for the simplex. For this purpose, we define the KL divergence is:
X xi
′
d(x; x ) = xi log ′ (60)
i
xi
and define
1
x ← arg min f (x′ ) + ∇T f (x′ )(x − x′ ) + d(x; x′ ) (61)
x∈S α
where we assume α is small enough to ensure majorization (the exact value depends on the
function f ).
where we assume yi > 0. Indeed, lets ignore the positivity constraints and use the optimality
condition with a linear constraint xT 1 = 1 (we learned this a few weeks ago):
∂ xi
= 1 + log + ai = λ (63)
∂xi yi
so that
Plugging this solution into the general optimization with a = ∇f (x′ ) yields the seminal expo-
nentiated gradient iteration
′
x′ e−α∇i f (x )
xi = P i ′ −α∇i f (x′ ) (66)
i xi e
79
8. D UALITY I
A. Lecture
Why?
• One problem may be easier than the other, e.g., water filling.
• Physical/practical meaning - support vector machinge, uplink vs downlink duality, maximum
likelihood vs maximum entropy, anomaly detection vs. experiment design.
• Structure, optimality condition, analysis.
• Lower bound, analyze existing solutions.
• Optimality gap receipt, certificates for impossibilities.
• Primal dual algorithms.
The art is to add constraints implicitly or explicitly in order to solve find the dual function.
Proof:
p∗ = f0 (x∗ )
negative for feasible primal and dual variables
X X
∗
≥ f0 (x ) + λi fi (x∗ ) + µj hj (x∗ )
i j
∗
= L (x ; λ, µ)
≥ inf L(x; λ, µ)
x
= g (λ, µ)
Notations: we say that λ and µ are dual feasible if λ ⪰ 0 and g(λ, µ) > −∞.
Theorem 16 (Strong duality). If the problem is convex and Slater’s condition holds, i.e., then
strong duality holds and d∗ = p∗ .
Only the non-linear inequalities need to be strictly feasible, and the linear can be just feasible.
81
The first inner maximization is exactly f0 with the domain defined as the feasible set.
f (x) x is feasible
0
max L(x; λ, µ) =
λ≥0,µ ∞ otherwise
The second inner minimization is exactly the definition of the dual function.
In general,
Proof:
Proof:
= L(x∗ , λ∗ )
≥ L(x∗ , λ) maximum
2) Examples:
min cT x s.t. x ⪰ 0, Ax = b
x
L(x; λ, µ) = cT x − λT x + µT (Ax − b)
= [AT µ − λ + c]T x − bT µ
A linear function is unbounded from below only when it is identically zero. Therefore
−bT µ AT µ − λ + c = 0
g(λ, µ) =
−∞ else
or by eliminating λ
Example 11 (LP with box constraints). Sometimes its better not to put dual variables on all the
constraints (but just enough to make the inner problem solvable):
min cT x s.t. Ax = b, −1 ≤ x ≤ 1
x
83
This can be transformed into a standard LP yielding the following UGLY dual
Instead, we can leave the box constraints implicit and put only one dual vector
L = cT x + λT (Ax − b) =
= −bT λ + xT AT λ + c
−|a| = min ax
−1≤x≤1
so that
Note that the dual is concave, and the duality between the L1 and L∞ norms.
This is a theory of alternatives - the logic behind duality. Basically gives us certificates for
impossibility results. Much stronger than showing possibility which is just by showing an example.
Similarly, to the fact that lower bounds to minimization problems are stronger than upper bounds
(which are any feasible example).
(I) Ax ≤ 0, cT x < 0
(II) AT y + c = 0, y≥0
Then only one of them holds, i.e., each system is feasible if and only if the other is not.
(P ) min cT x s.t. Ax ≤ 0
min xT x s.t. Ax = b
x
L(x; µ) = xT x + µT (Ax − b)
Note that, as expected, this is a concave function in µ. The dual program is given by
1
max − µT AAT µ − bT µ
µ 4
Example 14 (Minimum volume covering ellipsoid). Think of anomaly detection in machine
learning or experiment design. Also very similar to Gaussian ML in graphical models (primal
is minus ML, dual is max entropy).
minX≻0 log |X −1 |
s.t. aTi Xai ≤ 1, ∀i
Note the implicit PSD domain constraint. Its actually a max determinant.
The Lagrangian is
X X
L(X; λ) = −log |X| + λi aTi Xai − λi
i i
( )
X X
= −log |X| + Tr X λi ai aTi − λi
i i
in which case
X X
g(λ) = log λi ai aTi + n − λi
i i
λi ai aTi + n −
P P
max log i i λi
s.t. λ≥0
The constraints are linear so all we need for strong duality is feasibility which is always holds
in this problem.
Interestingly, this is related to logdet experiment design where we pick the best experiments to
minimize the logdet MSE in regression
λi ai aTi
P
max log i
s.t. λ≥0
P
i λi = 1
B. Tirgul
L = ∥z∥1 + µT z − µT Ax + µT b
X
= |zi | + µi zi − µT Ax + µT b
i
Using
0 if |µ| ≤ 1
min |w| + µw =
w −∞ else
we get
µT b if |µ| ≤ 1, AT µ = 0
g(µ) =
−∞ else
86
maxµ µT b
s.t. |µi | ≤ 1 ∀i
AT µ = 0
You can also get a similar (equivalent) dual by formulating the primal problem as
min 1T t s.t. − t ≤ Ax − b ≤ t
x,t
When the original problem is convex and strong duality holds - these are all equivalent.
Example 16 (Rayleigh quotient). On rare occasions strong duality holds for nonconvex problems.
Let A be a symmetric matrix and consider Rayleigh’s quotient
xT Ax
min
x xT x
What is the solution? eigenvalue and eigenvector. Lets do this using duality. This is equivalent
to the following constrained optimization
minx xT Ax
s.t. xT x = 1
which is nonconvex (due to the non pd quadratic objective and equality quad constraint). The
Lagrangian is
L(x; µ) = xT Ax − µxT x + µ
= xT [A − µI] x + µ
maxµ µ
s.t. A − µI ⪰ 0
This is of course a simple SDP. Now, lets see if it makes sense. This means λi (A) ≥ µ for all i,
or λmin (A) ≥ µ, and therefore the maximal µ is exactly d∗ = λmin which is what we expected
from the beginning. Clearly, this means d∗ = p∗ since we can choose x = umin and obtain
p∗ = λmin .
87
Example 17 (LP relaxation). Suppose ci ≥ 0 and ci ̸= cj . We want to find the K largest elements
(K is an integer), that is solve
P
minx − i ci x i
we relax it to
P
minx − i ci x i
s.t. 0 ≤ xi ≤ 1 (68)
P
i xi = K
0 ≤ xi ≤ 1 (70)
X
xi = K (71)
i
vi , ui ≥ 0 (72)
vi xi = 0, ui (xi − 1) = 0 (73)
−ci − vi + ui + λ = 0 (74)
We claim that at the optimal solution xi ∈ {0, 1}. Assume otherwise that there exist indices such
that 0 < xk < 1 and the rest are integral.
P
One such index is impossible since this would contradict i xi = K as K is an integer.
More than one fractional index is impossible since due to the complementary slackness this
would mean vi = ui = 0 for these indices and then λ = ci for more than one value of ci ̸= cj
which is a contradiction.
Solution: one approach is by a change of variables. Alternatively, we can use Lagrange duality
Its minimizer is
1 T −1 T
G ν − 2AT b
x=− A A
2
and the dual function is
1 T T T −1 T
G ν − 2AT b G ν − 2AT b − ν T h
g(ν) = − A A
4
The optimality conditions are
Gx∗ = h
2AT (Ax∗ − b) + GT ν ∗ = 0
Thus
∗ T
−1 T 1 T ∗
x = A A A b− G ν
2
From the first optimality condition
−1 1 −1 T ∗
G AT A AT b − G AT A G ν =h
2
Solving for ν ∗ yields
−1 −1 −1
∗ T T T T
ν = −2 G A A G h−G A A A b
First lets prove the well known fact that uniform distribution maximizes the entropy (which is
concave).
X X
max − pi log (pi ) s.t. pi ≥ 0, pi = 1
p
i i
X X
min pi log (pi ) s.t. pi ≥ 0, pi = 1
p
i i
X X X
L= pi log (pi ) − λi pi + µ pi − µ
i i i
∂L
= log (pi ) + 1 − λi + µ = 0
∂pi
89
If pi = 0 then the logarithm is −∞ and the condition cannot be satisfied. All positive pi ’s have
λi = 0 and therefore pi = e−µ−1 which does not depend on i. Due to i pi = 1 we get pi = 1/m
P
pP W L = min max{aTi x + bi }
x i
whose dual is
X
λi aTi x + bi − t
L=t+
i
and
X X X
dP W L = max λi bi s.t. λi = 1, λ i ai = 0
λi ≥0
i i i
On the other hand, the non-smooth max function can be approximated using the log-sum-exp
function
T
X
pGP = min log eai x+bi
x
i
or
X
min log e zi s.t. zi = aTi x + bi ∀ i
x,z
i
whose dual is
X X
ezi + λi aTi x + bi − zi
L = log
i i
P
The Lagrangian is unbounded with respect to x unless i λi ai = 0. The optimality condition
with respect to zi yields
ezi
P z − λi = 0
je
j
which means
ezi = cλi
90
P
for any positive constant c and λi > 0, i λi = 1 (otherwise the problem is unbounded).
X1 X X X X
max log λi − λi log λi + λi bi s.t. λi = 1, λi ai = 0, λi ≥ 0
λi
i i i i i
X X X X
dGP = max λi bi − λi log λi s.t. λi = 1, λi ai = 0, λi ≥ 0
λi
i i i i
Show that
0 ≤ dGP − dP W L ≤ log m
since
X
min
P λi log λi = −log m
λi ≥0, i λi
i
Example 20. SVD problem In the homework, you needed to prove that
max ∥Zq∥
Z 2
∥q∥2 =
s.t. ∥Z∥f ro ≤ 1
There are different ways to solve this. The easiest is probably by LP and SVD. Instead, we will
do it by duality which is a good practice.
Note that the problem is non-convex. Yet, we will show that strong duality holds. Indeed, this is
the special case which we already discussed, namely quadratic programming with one quadratic
constraint.
∥Zq∥22 = Tr Zqq T Z T
∥Z∥2f ro = Tr ZZ T
min max Tr Z qq T − λI Z T + λ
λ≥0 Z
X
ziT qq T − λI zi + λ
min max
λ≥0 zi
i
qq T − λI ⪯ 0
Example 21. Support Vector Machine The goal is to find a separating hyperplane between two
sets of labeled points, {xi , yi } where xi are the data and yi ∈ ±1 are their labels. We want a
simple decision rule of the sort
Note that we need ∥w∥ = 1 otherwise we can just scale w, w0 to get infinite margin. This is
clearly a non-convex problem. But we can get rid of ∥w∥ = 1 by requiring
1
yi xTi w + w0 ≥ M
(77)
∥w∥
or equivalently
yi xTi w + w0 ≥ M ∥w∥
(78)
These constraints are homogeneous, so we may as well arbitrarily choose ∥w∥ = 1/M and get
minw,wo ∥w∥2
(79)
s.t. yi xTi w + w0 ≥ 1 ∀i
This is a convex quadratic minimization subject to linear constraints which can be efficiently
solved.
yi xTi w + w0 ≥ 1
(84)
X
αi ≥ 0, αi yi = 0 (85)
i
αi 1 − yi xTi w + w0 = 0
(86)
X
w− αi yi xi (87)
i
93
From the complementary slackness, we see that if αi > 0 then yi xTi w + w0 = 1 is on the
boundary. Otherwise, αi = 0. From the last KKT we get a characterization of w using the active
αi
X
w= α i y i xi (88)
i
Using it and the complementary slackness, we can also get a simple characterization of w0 .
The derivation above assumes that a separating hyperplane exists. This is clearly not always
true, and the common SVM allows a few errors:
9. D UALITY II
A. Lecture
We will begin with the theory behind strong duality and then more examples.
Theorem 18. Let C and D be two convex sets that do not intersect. Then there exist a ̸= 0 and
b such that xT a ≤ b for all x ∈ C and xT a ≥ b for all x ∈ D.
We only prove a special case where there exist two points c ∈ C and d ∈ D that are closest to
each other in terms of Euclidean norm, i.e.,
∥c − d∥ = inf ∥x − y∥
x∈C,y∈D
The hyperplane is
∥d∥2 − ∥c∥2
f (x) = (d − c)T x −
2
1
= (d − c)T (x − d) + ∥d − c∥2
2
Suppose f (u) < 0 for some u ∈ D.
We start at d and make a short step in the direction of (u − d), i.e., take t(u − d).
We remain in D (d and u are in D and it is convex), but decrease the distance which is a
contradiction.
2) Strong duality:
Theorem 19 (Strong duality). If the problem is convex and Slater’s condition holds, i.e., then
strong duality holds and d∗ = p∗ .
Proof:
Under Slater’s condition that there exists an x̃ such that fi (x̃) < 0.
g(λ∗ ) ≥ p∗
or
X
min f0 (x) + λ∗i fi (x) ≥ p∗
x
i
which is exactly
X
f0 (x) + λ∗i fi (x) ≥ p∗ ∀ x
i
For this purpose, we define two convex sets (A is convex by definition of convexity)
B = {(0, t) |t < p∗ }
We have
Due to the definition of A, if (u, t) ∈ A then we can add arbitrary positive numbers to them and
still be in A. Together with (I) this means λ̃ ≥ 0 and µ ≥ 0 (otherwise λT u + µt is unbounded
from below).
Together,
X
µf0 (x) + λ̃i fi (x) ≥ α ≥ µp∗ ∀ x
i
Why ∀x? For any x, define fi (x) = ui and f0 (x) = t so that (u, t) ∈ A and now use (I).
Assume that µ > 0, then we divide by µ and obtain the required result with λ∗ = λ̃/µ.
Now lets show that µ = 0 is impossible. Assume it holds in contradiction, then apply this
inequality to Slater’s point
X
λ̃i fi (x̃) ≥ 0
i
and get λ̃ = 0 since fi (x̃) < 0 which is a contradiction to the existence of a non-zero separating
hyperplane.
B. Tirgul
Very easy to interpret: We already know that if f is convex and h are linear Ax = b then
∇f ∈ C AT
(93)
Locally, any continuous h can be linearized and then AT is just the Jacobian.
Intuitively, what happens in inequality constraints? We only need Lagrange multipliers for the
active ones and to check feasibility of the non-active ones.
98
∇f (x∗ )T (x − x∗ ) ≥ 0 ∀ x ∈ S
min f (x)
x≥0
Or it is active and we are on the boundary which means that the derivative is not necessarily
zero.
We define the KKT conditions as the existence of dual variables such that
λ∗ ≥ 0, Dual feasibility
Complementary slackness:
Either a constraint is active and then has a regular Lagrange multiplier, or we can omit it λi = 0.
Theorem 21 (Necessary). Under technical conditions, the KKT are necessary for optimality of
x∗ .
• Strong duality means that there exist x∗ and λ∗ , ν ∗ which are feasible, optimal to (P) and
(D) and have zero duality gap.
• Primal and dual feasibility are the first two KKT conditions.
100
This will only work if the inequalities are tight which means λi fi = 0.
• Minmax optimality of x∗ is fourth KKT.
Theorem 22 (Sufficient). In convex problems (no technical conditions), the KKT are sufficient
for optimality of x∗ .
Proof:
Due to the 3rd KKT, L(x∗ ; λ∗ , µ∗ ) = f (x∗ ) which means that the lower bound is tight.
101
Another way to show sufficiency is by showing that the KKT are sufficient for the general
necessary and sufficient convex optimality condition:
∇f (x∗ )T (x − x∗ ) ≥ 0 ∀ x ∈ S
fi (x∗ ) ≤ 0 (98)
λ∗i ≥ 0 (99)
where h > 0.
The Lagrangian is
X X X
L(p; λ, µ) = − log (1 + hi pi ) − λi pi + µ pi − µP
i i i
λi pi = 0
hi
− − λi + µ = 0
1 + hi pi
Using the 4th KKT we get
hi
λi = µ − ≥0
1 + hi pi
If pi = 0 then λi = µ − hi .
hi 1 1
Else if pi > 0 then λi = 0 and µ − 1+hi pi
= 0 so that pi = µ
− hi
.
Altogether, we get
1 1
pi (µ) = −
µ hi +
P
and we get to play with µ until i pi (µ) = P (this is basically the dual line search).
103
A. Linear
We consider a set of linear inequalities Ax + b ≤ 0 which are infeasible and we want to find an
x which satisfies as many constraints as possible.
X
min ti ti ≥ max{aTi x + bi , 0} ∀ i
x,ti
i
X
min ti ti ≥ aTi x + bi , ti ≥ 0 ∀ i
x,ti
i
Ax = b
104
but assume we have less equations than unknown, hence infinite solutions. We already learned
the minimum L2 norm solution
x = A† b
In some settings (which are very common recently) we have additional prior knowledge that x
is sparse for parsimony or computationally efficiency. Thus we would like to solve
where
can be reformulated as
min 1T t s.t. Ax = b, −t ⪯ x ⪯ t
x,t
Binary problems
We want to solve
minx cT x
s.t. Gx ⪯ h
Ax = b
xi ∈ {0, 1} i∈I
0 ≤ xi ≤ 1 i ∈ I
Summary:
105
• Slack variables.
• Relaxation - underestimate function and then project.
xT Ax
max
x xT x
Clearly, non convex. But wlog we can assume xT x = 1
maxx xT Ax
s.t xT x = 1
This is the classical PCA formulation where we seek for normalized weights x so that xT u
where u ∼ N (0, A) will be of maximal variance.
maxx,X Tr {AX}
s.t Tr {X} = 1
X = xxT
maxx,X Tr {AX}
s.t Tr {X} = 1
X⪰0
rank(X) = 1
and relax the nonconvex constraint
maxX Tr {AX}
s.t Tr {X} = 1
X⪰0
This will clearly yield a lower bound (we enlarge the constraint set).
Homework: prove that the relaxation is tight and the optimal X can be achieved by rank one!
106
B. Semidefinite relaxation
Binary QP
minx xT Ax
s.t xi ∈ {−1, 1} ∀i
minx xT Ax
s.t x2i = 1 ∀ i
minQ Tr {AQ}
s.t Qii = 1 ∀ i
Q⪰0
rank(Q) = 1
minQ Tr {AQ}
s.t Qii = 1 ∀ i
Q⪰0
Otherwise, we have a relaxation (lower bound) and can find an upper bound by:
minx xT W x
s.t. x2i = 1, ∀i
The constraint basically means {xi ∈ ±1}.
Known as NP hard.
A quadratic form is bounded from below only if it is positive definite and therefore
− P µ W + diag {µ } ⪰ 0
i i i
g(µ) = min L(x; µ) =
x −∞ else
and the dual program is
P
maxµ − i µi
s.t. W + diag {µi } ⪰ 0
This already gives us a (highly non-trivial) lower bound.
The Lagrangian is (its a max so the dual variables are for ≥ inequalities)
( !)
X X
L=− µi + Tr X W + µi ei eTi
i i
Tr Xei eTi
= Xi,i = 1
minX Tr {XW }
s.t. Xi,i = 1, ∀i
X⪰0
or
xT Ax ≥ 0 ∀ x ⇔ A⪰0
Multiply by t2 to get
xT Ax + 2bT xt + ct2 ≥ 0
xT Ax ≥ −c
xT Ax ≥ 0.
In our problem,
b = −H T y (107)
X
c = yT y − µi − t (108)
i
s.t. Z = 1, ∀i
i,i
Z z
⪰0
T
z ζ
110
min ∥X − Y ∥2 (109)
X:X⪰0
L(X, W ) = ∥X − Y ∥2 − Tr {W X} (110)
2(X − Y ) − W = 0 (111)
1
X∗ = Y + W (112)
2
1 1
L(X ∗ , W ) = ∥W ∥2 − Tr {W Y } − ∥W ∥2
4 2
1
= − ∥W ∥2 − Tr {W Y } (113)
4
The dual program is therefore
1
max − ∥W ∥2 − Tr {W Y } (114)
W :W ⪰0 4
1
L2 (W, Z) = − ∥W ∥2 − Tr {W Y } + Tr {ZW }
4
1
= − ∥W ∥2 + Tr {W (Z − Y )} (115)
4
1
− W −Y +Z =0 (116)
2
W ∗ = 2(Z − Y ) (117)
111
L2 (W ∗ , Z) = −∥Z − Y ∥2 + 2∥Z − Y ∥2
= ∥Z − Y ∥2 (118)
min ∥Z − Y ∥2 (119)
Z:Z⪰0
• Usually, numerically.
• Sometimes, by a good guess.
xi = [yi ]+ (121)
Therefore,
X X
p∗ = ([yi ]+ − yi )2 = yi2
i i:yi <0
1X X 1 X 2 X X
d∗ = − wi2 − wi yi = − 4yi − −2yi yi = yi2 (124)
4 i i
4 i:y <0 i:y <0 i:y <0
i i i
112
So the dual is
maxZ⪰0 −Tr {Z11 } − Tr {Z22 }
(131)
s.t. Z12 = −B T A
Now, lets guess primal and dual feasible variables that give equal objectives.
−B T A = U DV T (132)
and let
R∗ = V U T (133)
T T
U DU U DV
Z∗ = (134)
T T T
V D U V DV
which both give an optimal value of −2Tr {D}.
113
where
Two other generalization: quadratic inequality subject to linear, and quadratic subject to quadratic
(S lemma).
Similarly, but a bit more confusing is SOCP (this is less known and unfortunate as SOCP is
very strong and much more efficient than SDP)
minx f T x
Ai x + bi
s.t. ⪰K 0
T
ci x + di
which is just a fancy way of writing
minx f T x
s.t. ∥Ai x + bi ∥ ≤ cTi x + di
The Lagrangian is
T
X ui Ai x + bi
L = fT x −
T
i c vi
i x + d i
X X X X
= fT x − uTi Ai x − uTi bi − vi cTi x − v i di
i i i i
maxu,v − i uTi bi − i vi di
P P
P T
s.t. i Ai ui + vi ci = f
∥ui ∥ ≤ vi
115
maxxt xT x − 2xT g + g T g + λ i ti
P
s.t. ti ≥ ∥Ei x∥
X X X
L = xT x − 2xT g + g T g + λ ti − uTi Ei x − vi ti
i i i
1
EiT ui , and the objective is unbounded from
P
The minimum with respect to x is x = g + 2 i
s.t. ∥ui ∥ ≤ λ
Now, this optimization is much easier. The feasible set is separable and we can iteratively solve
for each ui . In fact, each of these iterations has a closed form solution:
maxui −∥g̃i − ui ∥2
s.t. ∥ui ∥ ≤ λ
whose solution is ui = g̃i if it is feasible and ui = λg̃i /∥g̃i ∥ otherwise.