ISyE 3013: Optimization for Machine Learning
Lecture #03 2025-09-09
Instructor: Swati Padmanabhan, Teaching Assistant: Zhenyi Zhang
Last class, we studied second-order optimality conditions, both necessary and sufficient, for twice
continuously1 differentiable functions. We saw that relaxing the sufficiency condition to positive
semidefiniteness of the Hessian at all points implies a global minimum. This special class of functions
falls under the umbrella of convexity. We then defined convex sets and used this definition to provide
a “secant inequality”-based definition of convex functions. In this lecture, we continue our study
of this very important class of functions and derive some of its equivalent characterizations.
3.1 Convex Sets, Convex Functions
Definition 3.D1. A set K ⊆ Rd is convex if for every pair of points x, y ∈ K, we have [x, y] ⊆ K.
Example 3.E1. Two important convex sets for us are: (1) halfspaces, and (2) ellipsoids.
y
Proof. (1) Convexity of a halfspace.
A halfspace is defined as a set H := u ∈ Rd : u⊤ a ≤ b .
(1, 1)
Here, a is the normal vector that points away from the (0, 1)
halfspace. In order to prove that this set is convex, we
must prove that given any two points in H, the line segment x + y ≤ 1
joining them lies wholly inside H. Recall, what it means for
“a point to be contained in a set” is that “the point satisfies x
(0, 0) (1, 0)
the equation (or inequality) describing the set”.
Let us put this insight to work here. We start with the following assumption:
x, y ∈ H ⇐⇒ a⊤ x ≤ b, a⊤ y ≤ b. (3.1.1)
Now consider a z ∈ [x, y], i.e., z = λ · x + (1 − λ) · y for some λ ∈ [0, 1]. Then we have,
a⊤ z = a⊤ (λ · x + (1 − λ) · y) = λ · a⊤ x + (1 − λ) · a⊤ y ≤ λ · b + (1 − λ) · b = b,
where the third step uses Equation (3.1.1). Since this holds for all λ ∈ [0, 1], we’ve shown that any
point in the line segment joining x and y lies entirely in H, thereby proving the convexity of H.
(2) Convexity of an ellipsoid.
We now prove convexity of the ellipsoid. The ellipsoid is defined as the set
n o
E := x ∈ Rd : (x − xc )⊤ P(x − xc ) ≤ 1 , P ≻ 0, xc ∈ Rd . (3.1.2)
For our purpose here, the following equivalent definition is useful:
n √ o
E := x ∈ Rd : ∥ P(x − xc )∥2 ≤ 1 , P ≻ 0, xc ∈ Rd . (3.1.3)
These notes have not been subjected to the usual scrutiny reserved for formal peer-reviewed publications. Thank
you for reporting any typos!
1
Please see the lecture note for the precise conditions, we are simplifying some nuance in this summary.
1
The eigenvalues of the positive definite matrix determine the length of each axis of the ellipsoid.
Why is E from Equation (3.1.3) a convex set? Let u, v ∈ E. This is equivalent to:
√ √
∥ P(u − xc )∥2 ≤ 1, ∥ P(v − xc )∥ ≤ 1. (3.1.4)
Now consider a z ∈ [u, v], i.e., z = λ · u + (1 − λ) · v for some λ ∈ [0, 1]. Then we have,
√ √
∥ P(z − xc )∥2 = ∥ P(λ · u + (1 − λ) · v − xc )∥2
√ √
≤ λ · ∥ P(u − xc )∥2 + (1 − λ) · ∥ P(v − xc )∥2
≤ λ · 1 + (1 − λ) · 1 = 1,
where we used triangle inequality in the second step and Equation (3.1.4) in the third.
As of now, we are only proving convexity of these sets. Through the course, we’ll see these sets
play central roles in wide-ranging applications. As an example, the problem class of linear programs,
which we saw in the first lecture, is convex2 partly because its feasible set is the intersection of
halfspaces (convexity is preserved under intersection). Similarly, ellipsoids are enormously useful
in data analysis (e.g., in identifying outliers), statistics (e.g., experiment design), in robotics (e.g.,
collision avoidance), and in approximating convex bodies (e.g., in the ellipsoid method, featured in
the newspaper article below). A wonderful resource on ellipsoids in optimization is [Tod16].
Definition 3.D2. A function f : K → R defined on a convex set K ⊆ Rd is convex (or convex over
K) if for every pair of points x, y ∈ K, we have
f (λ · x + (1 − λ) · y) ≤ λ · f (x) + (1 − λ) · f (y), for all λ ∈ [0, 1].
Example 3.E2. Here are some convex functions.
2 2
f (x, y)
f (x, y)
1 1
−1 1 −1 1
−2−1 1 2 −2−1 1 2 0 0 0 0
x 1 −1 y x 1 −1 y
f (x) = |x| f (x) = ex f (x, y) = x2 + y 2 f (x, y) = |x| + |y|
2
Note, until now, we defined convex sets and convex functions — convex programs will be introduced shortly.
2
We’ll see later, there is a close link between convex functions and their epigraphs.
Problem 3.P1. Prove the convexity of: (1) f (x) = |x|, (2) f (x) = x2 .
With Definition 3.D2 in hand, we are now ready to revisit a big question we posed in the first
lecture: is there a class of functions for which a local minimum is in fact a global minimum? We
answer this question affirmatively in Theorem 3.T1 below.
3.1.1 Local Minima are Global Minima
Theorem 3.T1. When f is convex, any local minimizer x⋆ is a global minimizer of f . If, in
addition, f is differentiable, then any stationary point x⋆ is a global minimizer of f .
Proof. Let x⋆ be a local minimizer of f , and suppose that there exists z ̸= x⋆ such that z is the
global minimizer of f . Consider the line segment joining the two points, and let x ∈ (x⋆ , z). Then
f (x) = f (λz + (1 − λ)x⋆ ) ≤ λf (z) + (1 − λ)f (x⋆ ) = λ(f (z) − f (x⋆ )) + f (x⋆ ) < f (x⋆ ),
where the first step holds for some λ ∈ (0, 1) (since we chose x ∈ (x⋆ , z)), the second step is by
Definition 3.D2, and the final step is because f (z) < f (x⋆ ) and positivity of λ. To see the second
part of the claim, we revisit the definition of convexity, as assumed:
λf (z) + (1 − λ)f (x⋆ ) ≥ f (λz + (1 − λ)x⋆ ),
where λ ∈ (0, 1). Rearranging terms and dividing throughout by λ (valid since λ > 0) yields:
f (λz + (1 − λ)x⋆ ) − f (x⋆ )
f (z) − f (x⋆ ) ≥ ,
λ
The inequality above is preserved if we take λ → 0:
f (λz + (1 − λ)x⋆ ) − f (x⋆ )
f (z) − f (x⋆ ) ≥ lim
λ→0 λ
d
≥ f (x⋆ + λ(z − x⋆ ))|λ=0
dλ
≥ ∇f (x⋆ )⊤ (z − x⋆ ),
where the second inequality is by the definition of derivative of f restricted to a line and the third is
by the connection between directional derivative and gradient. This concludes both the claims.
Problem 3.P2. Is a function that satisfies either of the following properties necessarily convex?
1. “every local minimum is a global minimum”.
2. “every stationary point is a global minimum”.
3.1.2 First-Order Taylor Approximator as Global Underestimator
In general, convex functions aren’t differentiable. But when they are, they have an alternate,
equivalent, first-order characterization, the proof of which follows that of Theorem 3.T1.
3
Theorem 3.T2. Let f : K → R be a continuously differentiable function on convex set K ⊆ Rd .
Then f is convex on K if and only if f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for any x, y ∈ K.
Proof. One direction follows the same idea as Theorem 3.T1. For the other direction, suppose that
we have the stated inequality for all points in K. We apply this to two sets of points, x, y, and
z = λx + (1 − λ)y:
f (x) ≥ f (z) + ∇f (z)⊤ (x − z)
(3.1.5)
f (y) ≥ f (z) + ∇f (z)⊤ (y − z)
Multiplying the first by λ and second by (1 − λ) and adding gives:
λf (x) + (1 − λ)f (y) ≥ f (z) + ∇f (z)⊤ (λx − λz + (1 − λ)y − (1 − λ)z),
which simplifies to the claim.
In plain English, Theorem 3.T2 states the following: at any point, the linear estimator of
a convex function lies completely under the function, everywhere. Thus, the local information
about a convex function (its value and derivative at a point) tell us something about the function
everywhere. This is one of the most important properties of convex functions, since it is what enables
the development of fast (polynomial-time) algorithms for optimizing convex functions: finding a
minimum of an arbitrary function can require exhaustive search; but the above key fact enables a
binary search. We will see this powerful principle in action in all the algorithms in this class.
Example 3.E3.
f (x)
In the adjacent plot (f (x) = .6 · x2 − 1.2 · x + 1.6), (z, f (z))
we show the tangent f (u) = f (x) + f ′ (x) · (u − x) at
three different points, illustrating Theorem 3.T2.
(x, f (x))
(y, f (y))
x
x y z
Problem 3.P3. Prove that a function is convex if and only if its epigraph is.
Remark 3.R1. Theorem 3.T2 provides an alternate proof of the second part of Theorem 3.T1:
since f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for all x, y ∈ dom(f ), if for some x = x⋆ ∈ dom(f ) we have
∇f (x) = 0, then f (y) ≥ f (x⋆ ) for all y ∈ dom(f ).
Problem 3.P4. In the previous lecture, we showed ex ≥ 1 + x using an application of Taylor’s
Theorem. Prove the same inequality using Theorem 3.T2.
Readings
The material in these notes is based on the following excellent sources: [BV04, Chapter 2, Chapter
3] and [NW06, Chapter 2.1].
4
References
[BV04] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004 (cit. on p. 4).
[NW06] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 2006 (cit. on
p. 4).
[Tod16] Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016 (cit. on
p. 2).