0% found this document useful (0 votes)
46 views

03 Convex Functions Notes Cvxopt f22

The document defines convex functions and provides examples. A function f is convex if its domain is convex and f(θx + (1-θ)y) ≤ θf(x) + (1-θ)f(y) for all x, y in the domain and 0 ≤ θ ≤ 1. Common convex functions include quadratic, exponential, and norm functions. A function is concave if its negative is convex. Convexity is preserved under positive weighted sums, composition with affine functions, and taking the maximum of convex functions.

Uploaded by

jchill2018
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

03 Convex Functions Notes Cvxopt f22

The document defines convex functions and provides examples. A function f is convex if its domain is convex and f(θx + (1-θ)y) ≤ θf(x) + (1-θ)f(y) for all x, y in the domain and 0 ≤ θ ≤ 1. Common convex functions include quadratic, exponential, and norm functions. A function is concave if its negative is convex. Convexity is preserved under positive weighted sums, composition with affine functions, and taking the maximum of convex functions.

Uploaded by

jchill2018
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Convex functions

We have seen what it means for a set to be convex. In this set


of notes, we start working towards what it means to be a convex
function.

To define this concept rigorously, we must be specific about the subset


of RN where a function can be applied. Specifically, the domain
dom f of a function f : RN → RM is the subset of RN where f is
well-defined. We then say that a function f is convex if dom f is a
convex set, and
f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)
for all x, y ∈ dom f and 0 ≤ θ ≤ 1.
This inequality is easier to interpret with a picture. The left-hand
side of the inequality above is simply the function f evaluated along
a line segment between x and y. The right-hand side represents a
straight line segment between f (x) and f (y) as we move along this
line segment, which for a convex function must lie above f .

θf (x) + (1 − θ)f (y)


f(y)
f(x)

f (θx + (1 − θ)y)

x y

29
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
We say that f is strictly convex if dom f is convex and
f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y)
for all x 6= y ∈ dom f and 0 < θ < 1.

Note also that we say that a function is f is concave if −f is convex,


and similarly for strictly concave functions. We are mostly interested
in convex functions, but this is only because we are mostly restricting
our attention to minimization problems. We justified this because
any maximization problem can be converted to a minimization one
by multiplying the objective function by −1. Everything that we say
about minimizing convex functions also applies maximizing concave
ones.
We make a special note here that affine functions of the form
f (x) = hx, ai + b,
are both convex and concave (but neither strictly convex nor strictly
concave). This is the only kind of function that has this property.
(Why?)

Note that in the definition above, the domain matters. For example,
f (x) = x3
is convex if dom f = R+ = [0, ∞] but not if dom f = R.

It will also sometimes be useful to consider the extension of f from


dom f to all of RN , defined as
f˜(x) = f (x), x ∈ dom f, f˜(x) = +∞, x 6∈ dom f.
If f is convex on dom f , then its extension is also convex on RN .

30
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
The epigraph
A useful notion that we will encounter later in the course is that of
the epigraph of a function. The epigraph of a function f : RN → R
is the subset of RN +1 created by filling in the space above f :
  
x
epi f = ∈ RN +1 : x ∈ dom f, f (x) ≤ t .
t

epi f
f

It is not hard to show that f is convex if and only if epi f is a convex


set. This connection should help to illustrate how even though the
definitions of a convex set and convex function might initially appear
quite different, they actually follow quite naturally from each other.

Examples of convex functions


Here are some standard examples for functions on R:
• f (x) = x2 is (strictly) convex.
• affine functions f (x) = ax + b are both convex and concave for
a, b ∈ R.
• exponentials f (x) = eax are convex for all a ∈ R.

31
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
• powers xα are:
– convex on R+ for α ≥ 1,
– concave for 0 ≤ α ≤ 1,
– convex for α ≤ 0.
• |x|α is convex on all of R for α ≥ 1.
• logarithms: log x is concave on R++ := {x ∈ R : x > 0}.
• the entropy function −x log x is concave on R++.

Here are some standard examples for functions on RN :


• affine functions f (x) = hx, ai + b are both convex and concave
on all of RN .
• any valid norm f (x) = kxk is convex on all of RN .
• if f1(x) and f2(x) are both convex, then the sum f1(x) + f2(x)
is also convex.

A useful tool for showing that a function f : RN → R is convex is


the fact that f is convex if and only if the function gv : R → R,

gv (t) = f (x + tv), dom g = {t : x + tv ∈ dom f }

is convex for every x ∈ dom f , v ∈ RN .

Example:
N N
Let f (X) = − log det X with dom f = S++ , where S++ denotes the
set of symmetric and (strictly) positive definite matrices. For any
N
X ∈ S++ , we know that

X = U ΛU T,

32
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
for some diagonal, positive Λ, so we can define

X 1/2 = U Λ1/2U T, and X −1/2 = U Λ−1/2U T.


Now consider any symmetric matrix V and t such that X + tV ∈
N
S++ :
gV (t) = − log det(X + tV )
= − log det(X 1/2(I + tX −1/2V X −1/2)X 1/2)
= − log det X − log det(I + tX −1/2V X −1/2)
N
X
= − log det X − log(1 + σit),
n=1

where the σi are the eigenvalues of X −1/2V X −1/2. The function


− log(1 + σit) is convex, so the above is a sum of convex functions,
which is convex.

Operations that preserve convexity


There are a number of useful operations that we can perform on a
convex function while preserving convexity. Some examples include:
• Positive weighted sum: A positive weighted sum of con-
vex functions is also convex, i.e., if f1, . . . , fm are convex and
w1, . . . , wm ≥ 0, then w1f1 + . . . + wmfm is also convex.
• Composition with an affine function: If f : RN → R
is convex, then g : RD → R defined by
g(x) = f (Ax + b),
where A ∈ RN ×D and b ∈ RN , is convex.

33
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
• Composition with scalar functions: Consider the func-
tion f (x) = h(g(x)), where g : RN → R and h : R → R.
– f is convex if g is convex and h is convex and non-decreasing.
Example: eg(x) is convex if g is convex.
– f is convex if g is concave and h is convex and non-
increasing.
1
Example: g(x) is convex if g is concave and positive.
• Max of convex functions: If f1 and f2 are convex, then
f (x) = max (f1(x), f2(x)) is convex.

First-order conditions for convexity


We say that f is differentiable if dom f is an open set (all of RN ,
for example), and the gradient
 
∂f (x)
∂x1
 
 ∂f (x) 
 
 ∂x2 
∇f (x) =  . 
 .. 
 
 
∂f (x)
∂xN

exists for each x ∈ dom f . The gradient of a function is a core


concept in optimization and as such we review a little bit of what it
means at the end of these notes.

The following characterization of convexity is an incredibly useful


fact, and if we never had to worry about functions that were not
differentiable, we might actually just take this as the definition of a
convex function.

34
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
If f is differentiable, then it is convex if and only if

f (y) ≥ f (x) + ∇f (x)T(y − x) (1)

for all x, y ∈ dom f .

f (y)

g(y) = f (x) + rf (x)T (y x)

y
y=x

This means that the linear approximation


g(y) = f (x) + ∇f (x)T(y − x),
is a global underestimator of f (y).
It is easy to show that f convex, differentiable ⇒ (1). Since f is
convex,
f (x + t(y − x)) ≤ (1 − t)f (x) + tf (y), 0 ≤ t ≤ 1,
and so
f (x + t(y − x)) − f (x)
f (y) ≥ f (x) + , ∀0 < t ≤ 1.
t
Taking the limit as t → 0 on the right yields (1).

35
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
It is also true that (1) ⇒ f convex. To see this, choose arbitrary
x, y and set z θ = (1 − θ)x + θy; then (1) tells us

f (w) ≥ f (z θ ) + ∇f (z θ )T(w − z θ ).

Applying this at w = y and multiplying by θ, then applying it at


w = x and multiplying by (1 − θ) yields

θf (y) ≥ θf (z θ ) + θ∇f (z θ )T(y − z θ ),


(1 − θ)f (x) ≥ (1 − θ)f (z θ ) + (1 − θ)∇f (z θ )T(x − z θ ).

Adding these inequalities together establishes the result.

Second-order conditions for convexity


We say that f : RN → R is twice differentiable if dom f is an open
set, and the N × N Hessian matrix
 
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂x21 ∂x1 ∂x2
··· ∂x1 ∂xN 
∇2f (x) = 
 ... ... ... 
 
∂ 2 f (x) ∂ 2 f (x)
∂xN ∂x1
··· ∂x2N

exists for every x ∈ dom f .

If f is twice differentiable, then it is convex if and only if

∇2f (x)  0 (i.e. ∇2f (x) ∈ S+N ).

for all x ∈ dom f .

36
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Note that for a one-dimensional function f : R → R, the above con-
dition just reduces to f 00(x) ≥ 0. You can prove the one-dimensional
version relatively easy (although we will not do so here) using the
first-order characterization of convexity described above and the def-
inition of the second derivative. You can then prove the general case
by considering the function g(t) = f (x + tv). To see how, note that
if f is convex and twice differentiable, then so is g. Using the chain
rule, we have
g 00(t) = v T∇2f (x + tv)v.
Since g is convex, the one-dimensional result above tells us that
g 00(0) ≥ 0, and hence v T∇2f (x)v ≥ 0. Since this has to hold for
any v, this means that ∇2f (x)  0. The proof that ∇2f (x)  0
implies convexity follows a similar strategy.
In addition, if
∇2f (x)  0 (i.e. ∇2f (x) ∈ S++
N
). for all x ∈ dom f,
then f is strictly convex. The converse is not quite true; it is possible
that f is strictly convex even if ∇f (x) has eigenvalues that are zero
at isolated points. For example, f (x) = |x|3 is strictly convex but
f 00(0) = 0.

Standard examples (from [BV04])


Quadratic functionals:
1
f (x) = xTP x + q Tx + r,
2
where P is symmetric, has
∇f (x) = P x + q, ∇2f (x) = P ,
so f (x) is convex iff P  0.

37
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Least-squares:
f (x) = kAx − bk22,
where A is an arbitrary M × N matrix, has

∇f (x) = 2AT(Ax − b), ∇2f (x) = 2ATA,

and is convex for any A.

Quadratic-over-linear:
In R2, if
f (x) = x21/x2,
then
2 x22 −x1x2
   
2x1/x2 2
∇f (x) = , ∇ f (x) = 3
−x21/x22 x2 −x1x 2 x1
2 x2 


= 3 x2 −x1 ,
x2 −x1
and so f is convex on R × [0, ∞] (x1 ∈ R, x2 ≥ 0).

Strong convexity and smoothness


We say that a function f is strongly convex if there is a µ > 0
such that µ
f (x) − kxk22 is convex. (2)
2
We call µ the strong convexity parameter and will sometimes say
that f is µ-strongly convex. In a sense, what we are saying is that
f is so convex that we can subtract off a quadratic function and still
preserve convexity.

38
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
If f is differentiable, there is another interpretation of strong convex-
ity. We have seen that an equivalent definition of regular convexity
is that the linear approximation formed using the gradient at a point
x is a global underestimator of the function — see (1) and the pic-
ture below. If f obeys (2), then we can form a quadratic global
underestimator as
µ
f (y) ≥ f (x) + ∇f (x)T(y − x) + ky − xk22. (3)
2
Here is a picture

f (y)
<latexit sha1_base64="E7tmB/CvDvh6j4TirX5JJXmyt3g=">AAAc1nicrZlbc9vGFceZpJeYvTnpY/SAqcYzcht7RE+a5qHtJJZkSdaNkkhJlqFoFsCShLSLRRYLihSCvHX62q/ST9PpW/tNerA4BMU9dB8y5oxN7PmdveDsOf8FqCAVcWbW1//9wYcf/eSnP/v5x4/av/jlr379m8effHqWqVyHvB8qofRFwDIu4oT3TWwEv0g1ZzIQ/Dy43aj4+ZjrLFZJz0xTfiXZMIkHccgMmK4f7xW+HaTQPCrbw7XpU+8v3mBt8tT7g+cnLBDMtr71e2vTZ7V1oFlY+DIvixel/z1Y/e+vX3z7ol1eP15df75uPx696ODFags/3etPPsv8SIW55IkJBcuyt5311FwVTJs4FLxs+3nGUxbesiF/C5cJkzy7KuyCS+8JWCJvoDT8S4xnrQ97FExm2VQG4CmZGWUuq4zL2NvcDL66KuIkzQ1PwnqiQS48o7wqhF4Uax4aMYULFuoY1uqFIwZRMRDohVnqlbbbT7xBzEUEMOF3oZKSJVHhn5QQRpg9CIqTslxkG3O24bLLObt02eGcHZZ2ZkCp4BMvyWVg16f5wiJ4+bZzVfgq5ZoZpasYF2D0BR8Yv1jt+DoejoxfTbTYc1cu6QnGZT0f9gtVclP3lLdcJ5AYf5S5ryBLqyRG6zNrhTEetMqHHey9pVoFLIhFbKYejO1lhhl3md0lq+zWi1ybrfGps8St0umw5d7EmGnXpzLVEWcizEWeee5SIruUaLWDG5PAehOTeWt3kEWQa16ceFqB7+dePPDugPHo6eK8vCw42YmbsrixQ1YBZNpjYsgDzcDwaAyZChWSKGML3rmJ0NTBCRQkJ5SKErBtcBNPPM+DLNJxyJ2clWby7i6BUOGtV0+52C1oprLJOXD72MkmTp9mroU+j+Ik4bra+igPjbO8P8OmCJYMBQ3SXwFpRBAYA9dZqjL+ubfDtYQqrsKeZjyP1LM4qTSTu0PsYHFpWey4CdGbs57LUhgPcMSGQ66rWxjkScQq1WPCy/IgA71wA32YC+FmmLU5Y5/ALXHXsTY6nqcpS1xHa6tWNLM6q4AohWT02uiMDuG8dR2tzSZmxiHNqwo1KgVNHE4XOwOe7zSUT73Vi6ohVJZrspYQAvIIaqmmi11iqB8dK1KoYHdHD1S1I3rqugaR6wmaLzmM63pW9vpeYXPDqtZgbxn8N81iJ6gRlHhMtgKs7lw8jYeapSPXFeyu62iaqqW+FXCdQSZUSgMOcRxCvmtYOArJbKTqtlRqYhnfo4xscjiwNT+A7kfo9HsYLU4qn2qXDJ+YommX7+zBJos9Zu139mB6KG30rD+0fqia/9edTRbcoQkb1X7y7L196qNIDTyejGOtkqq0s7YNuhlxpbksgIPaxFX0yqL7oLHgJThsUVns268Fgt9l0cMLSx/MV9gVwI0mKk4iMHiQ1gOva60e2HkSpXrgp3BMtf2IDzy0QPIMYiHAfaxzwb0Rr47EL1Pj3cWRGVUXEU/NCB7N4ATmUXYbp+X7DZ97tpb2bCoYOXUDJAEhIZKQkAgJKeUxR0LEbDxAMiBkiGRIyAjJiJAYCana8Q2SG0JukdwSIpCQo2AskRAhGSdIEkIUEiIS4xRJSsh3SL4jRCPRhGRIMkIMEiLK4xxJTsgYyZiQOyR3hEyQTAiZIpkSco/kvjok3SwV6WiWqX7dIOnKTeNhr0k2MVvvtUfdIOnLxXyQukHyOM1ioWZ77M+a5GYerOZ+2Woe8GUYJGjuUDdIpqu5h70mac3StPGoGyS/4e01anywRVJ9lhvViyjJ9gYmFCoZh3oerFmTZP+saH162I71aFY2fnVJcj0ezne1bpBYsmaN1SXJ/MUdzd+xo+lovsoRXWY4x+ESDIM2vTOKleRDNg9T1aBFcM/1LBTrdIRkprCd8secGM65IL8p7QtI8Y07k3yJ5CUhG0jIW7TcRLJJyBYS8tYnXyF5Rcg2km1CdpCQVwb5GslrQvaQ7BGyj2SfkAMkB4QcIjkk5AjJESFdJF1CjpEcE3KChPyaIU+RnBLSQ0JemmQfSZ+QMyRnhJwjOSfkAskFIW+QvCHkEsklTXy5jcpdcX97mXLLTVRu67K5TLllDyXVuvSWSarcn8mh9dlfKoeyG88cuqSQ5SnKkeWny+RI9hu9sU795Xoju6P5PFRQZDebYyoo8ggFxTocLRcUuQuPa/bBdZeW5QQRrZdahiwkMiStDFlmZcjRlMD+kgOvf83vT7XCkORe6nhCHMecOOIzJnkgWeo4oY7TpY7T8seJ6uKroKpE1b4QUlkV6mUDibIKtdFAIq5CbTaQ6KtQWw0kEivUqwYSlRVqu4FEaIXaaSDRWqF2G0gyS6jXDSS5JdReA4kcC7XfQKLIQh00kIiyUIcNJLos1FEDiTQL1W0gUWehjhtIBFqokwYSjRbqtIFEpoXqNZAotVD9BhKxFuqsgUSvhTpvIJFsoS4aSFRbqDcNJMIt1GUDL9/Hk0fGTVUm8EWLBIwvkZEaAeMGMlIiYNxERioEjFvISIGA8RUyUh9g3EZGygOMO8hIdYBxFxkpDjC+RkZqA4x7yEhpgHEfGakMMB4gI4UBxkNkpC7AeISMlAUYu8hIVYDxGBkpCjCeICM1AcZTZKQkwNhDRioCjH1kpCDAeIaM1AMYz5GRcgDjBTJSDWB8g4wUAxgvkdlaaF8/Xu24f5SjF2cvnne+fP7F8RerX3+Ff7D7uPVZ63ettVan9afW162dVrfVb4Wtf7b+1fpP678rFys/rPxt5e+164cfYJ/fthY+K//4H1dJElw=</latexit>

µ
g(y) = f (x) + rf (x)T (y x) + ky xk22
2

y
y=x

We will show that (2) implies (3) in a future homework.

If f is twice differentiable, there is yet another interpretation of strong


convexity. If f obeys (2) then we know that the Hessian of f (x) −
µ
2
kxk22 does not have any negative eigenvalues, i.e.

µ
 
∇2 f (x) − kxk22  0.
2

39
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Thus (since ∇2(kxk22) = 2I),

∇2f (x) − µI  0,

∇2f (x)  µI.

This is just a fancy way of saying that the smallest eigenvalue of the
Hessian ∇2f (x) is uniformly bounded below by µ for all x.

In addition to convexity, there is one more type of structure that we


consider for functions f : RN → R. We say that differentiable f has
a Lipschitz gradient if there is a L such that

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2, for all x, y. (4)

This means that the gradient ∇f (x) does not change radically as
we change x. Functions f that obey (4) are also referred to as L-
smooth. This definition applies whether or not the function f is
convex.
Whether or not f is convex, if it is L-smooth, it there is a natural
quadtratic over estimator. Around any point x, we have the upper
bound
L
f (y) ≤ f (x) + ∇f (x)T(y − x) + ky − xk22. (5)
2
Here is a picture

40
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
L
<latexit sha1_base64="Yi+dSu617yNY3nx7sZwA0YiX9N0=">AAAc1HicrZlbc9vGFceZ9BazN6d9jB4w1XhGbmOP5EnTPLSdxJIsydaF1t0yFM0CWJIr7WKRxYIiBaMvnb72q/TT9KEv7VfpweIQFPfQeciYM7aw/9/ZC3bP+YOQokyK3K6u/uejj3/045/89GefPOj+/Be//NWvH376m9NcFybmJ7GW2pxHLOdSpPzECiv5eWY4U5HkZ9HNes3PRtzkQqfHdpLxS8UGqeiLmFmQrh7ulKEbpDQ8qbqDlcnj4C9Bf2X8OPhDEKYsksy1vg2PVyZPGrVvWFzuVuWzKnwHWvju6tm3z7rV1cPl1aer7hPQizW8WO7gp3f16Wd5mOi4UDy1sWR5/nZtNbOXJTNWxJJX3bDIecbiGzbgb+EyZYrnl6VbbhU8AiUJ+trAv9QGTr3fo2QqzycqgkjF7DD3WS0uYm8L2//qshRpVliexs1E/UIGVgf1BgaJMDy2cgIXLDYC1hrEQwZ7YmGb52ZpVtrtPgr6gssEYMpvY60US5MyPKzKsJ49isrDqppn6zO27rOLGbvw2f6M7VduZkCZ5OMgLVTk1mf43CJ49Xbtsgx1xg2z2tR7XIIYSt63Ybm8FhoxGNqwnmi+545a0BPERT3v94t1et30VDfcpJAYf1RFqCFH6xRG9YlTYYx7rep+B3dvmdERi4QUdhLA2EFumfWX2Vuwyl6zyJXpGh97S9ysvA6b/k2MmPFjaqnZcSbjQhZ54C8lcUtJltfwYFJYb2rzYOUWsghyLRBpYDTEfh6IfnALjCeP5+flVcnJSVxX5bUbst5AZgImBzwyDIQHI8hUqJBUW1fu3k3EttmcSENyQqloCccGN/EoCALIIiNi7uWssuP3d4mkjm+CZsr5blE7lUvOvt/HTTb2+rRzzfV5INKUm/rokyK23vL+DIciWTqQdJP+Csgggo2xcJ1nOuefB9vcKKjietuznBeJfiLS2jG5P8Q2FpdR5bafEMczduyzDMYDnLDBgJv6FvpFmrDa9ZgM8iLKwS/8jd4vpPQzzGne2IdwS9wPbEQv8ihjqR/otHpFU9VbBexSTEZvRG902M4bP9BpLjFzDmleV6jVGXjiYDLfGfDspKF8mqOedw2p88KQtcSwIQ+glho630VA/RihSaGC7o8e6fpEzMQPjRI/EjxfcRjXj6z15l7hcOO61uBsGfw3yYW3qQmUuCBHAao/F8/EwLBs6IeC7ocOJ5leGFsDPxhsQmd0w2EfB5DvBhaORjIdqb4tnVmhxB3ayAaHB7bhe9D9AIN+D6OJtI6pT8nysS3bdvXeHmw832Pafm8PZgbK7Z6Lh9bf6ub3hrPxXDg04aC6j558sE/zKNL9gKcjYXRal3bedZtuh1wbrkrg4Dai3r2q7N1rzEVJDkdUlbvuxxzBn1V5jBeO3puvdCuAG021SBMQAkjrftBzagA6T5PM9MMMHlPdMOH9ABVInr6QEsJHppA8GPL6kfhlZoNbkdhhfZHwzA7hqxk8gXmS34is+rDb5z9bK/dsKhl56kZIIkJiJDEhCRJSyiOOhJjZqI+kT8gAyYCQIZIhIQIJqdrRNZJrQm6Q3BAikZBHwUghIUYySpGkhGgkxCRGGZKMkO+QfEeIQWIIyZHkhFgkxJRHBZKCkBGSESG3SG4JGSMZEzJBMiHkDsld/ZD0s1Rmw2mmhk2DpCu3bYS7JtnEXL03EU2DpC+Xs0GaBsnjLBdST884nDbJzdxbzd2i1dzjizBY0CygaZBM17MId03SmmVZG9E0SH7Du2vSxmCLpPo0N0JFsyNtYUqhViI2s82aNkn2T4s2pA/bkRlOyyasL0mui8HsVJsG2UvWrrG+JJk/f6LFe040G85WOaTLjGc4XoBh0LZ3TrFWfMBm21Q3aBHccTPdilU6Qjp12LXqhzwxvOeC+qZyLyDlN/5M6jmS54SsIyFv0WoDyQYhm0jIW596geQFIVtItgjZRkJeGdRLJC8JeYXkFSG7SHYJ2UOyR8g+kn1CDpAcENJD0iPkNZLXhBwiIb/NUEdIjgg5RkJemtQJkhNCTpGcEnKG5IyQcyTnhLxB8oaQCyQXNPHVFjp3zcOtRc6tNtC5XcjGIudWx2ipLuR4kaWq3akdupjdhXaoemIa0COFrI7Qjhw/WmRH6qT1Gxd0sthvVG84m4caiurlM0wNRR2gobiAg8WGonbg65r74rpDy3KMiNZLY0MOEhtSzoYcczbkeUrkfpMDr3/t758ahyHJvTDwkASOOAnE75jkC8nCwDENnCwMnFQ/zFTnXwV1baruhZDaqtTPW0icVer1FhJzlXqjhcRfpd5sIbFYqV+0kLis1FstJEYr9XYLiddKvdNCkllSv2whyS2pX7WQ2LHUuy0kjiz1XguJKUu930Liy1IftJBYs9S9FhJ3lvp1C4lBS33YQuLRUh+1kNi01MctJE4t9UkLiVlLfdpC4tdSn7WQWLbU5y0kri31mxYS45b6ooUXH+KbR85tXSbwgxYJiM+RkRoBcR0ZKREQN5CRCgFxExkpEBBfICP1AeIWMlIeIG4jI9UB4g4yUhwgvkRGagPEV8hIaYC4i4xUBoh7yEhhgLiPjNQFiAfISFmA2ENGqgLE18hIUYB4iIzUBIhHyEhJgHiMjFQEiCfISEGAeIqM1AOIZ8hIOYB4joxUA4hvkJFiAPECmauF7tXD5TX/j3L04vTZ07Uvn37x+ovlr7/CP9h90vms87vOSmet86fO153tTq9z0ok7/+r8u/Pfzv+WTpfeLf196R9N6McfYZ/fduY+S//8PwUnEVY=</latexit>

g(y) = f (x) + rf (x)T (y x) + ky xk22


2
f (y)

y
y=x

We will show that (4) implies (5) in a future homework.

If f is twice differentiable, then there is another way to interpret


L-smoothness. If f obeys (4), then we have a uniform upper bound
on the largest eigenvalue of the Hessian at every point:

∇2f (x)  LI, for all x. (6)

This makes intuitive sense, as (4) tells us that the first derivative
cannot change too quickly, so there must be some kind of bound on
the second derivative. We will establish that (4) implies (6) (again,
regardless of whether f is convex) in a future homework.

41
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Review: The gradient
First, recall that a function f : R → R is differentiable if its deriva-
tive, defined as
f (x + δ) − f (x)
f 0(x) = lim ,
δ→0 δ
exists for all x ∈ dom f . To extend this notion to functions of
multiple variables, we must first extend our notion of a derivative.
For a function f : RN → R that is defined on N -dimensional vectors,
recall that the partial derivative with respect to xn is
∂f (x) f (x + δen) − f (x)
= lim ,
∂xn δ→0 δ
where en is the nth “standard basis element”, i.e., the vector of all
zeros with a single 1 in the nth entry.

The gradient of a function f : RN → R is the vector of partial


derivatives given by:
 
∂f (x)
 ∂f∂x(x)
1 
 
∇f (x) =  ..   ∂x2  .
 . 
∂f (x)
∂xN

Similar to the scalar case, we say that f is differentiable if the gradient


exists for each x ∈ dom f .

We will use the term gradient in two subtly different ways. Sometimes
we use ∇f (x) to describe a vector-valued function or a vector field,

42
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
i.e., a function that takes an arbitrary x ∈ RN and produces another
vector. When referring to this vector-valued function, we sometimes
use the words gradient map, but sometimes we will overload the
term “gradient”; we will use the notation ∇f (x) to refer to the
vector given by the gradient map evaluated at a particular point
x. So sometimes when we say “gradient” we mean a vector-valued
function, and sometimes we mean a single vector, and in both cases
we use the notation ∇f (x). Which one will usually be obvious by
the context.1

Note that in some cases we will use the notation ∇xf (x) to indicate
that we are taking the gradient with respect to x. This can be helpful
when f is a function of more variables than just x, but most of the
time this is not necessary so we will typically use the simpler ∇f (x).

Here we adopt the convention that the gradient is a column vector.


This is the most common choice and is most convenient in this class,
but some texts will instead treat the gradient as a row vector. The
reason for this is to align with the standard convention for the Ja-
cobian.2 Thus, it is always worth double-checking what notation is
being used when consulting outside resources.

1
This is just like in the scalar case, where the notation f (x) can sometimes
refer to the function f and sometimes the function evaluated at x.
2
The Jacobian of a vector-valued function f : RN → RM is the M × N
matrix of partial derivatives with respect to each dimension in the range.
In this course we will mostly be concerned with functions mapping to a
single dimension, in which case the Jacobian would be the 1 × N matrix
∇T f (x), i.e., the gradient but treated as a row vector. Directly defining
the gradient as a row vector instead of a column vector is thus more
convenient in some contexts.

43
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Interpretation of the gradient

The gradient is one of the most fundamental concepts of this course.


We can interpret the gradient in many ways. One way to think of
the gradient when evaluated at a particular point x is that it defines
a linear mapping from RN to R. Specifically, given a u ∈ RN , we
can use ∇f (x) to define a mapping of u to R by simply taking the
inner product between the two vectors:
hu, ∇f (x)i.
What does this mapping tell us? It computes the directional
derivative of f in the direction of u, i.e.,
f (x + δu) − f (x)
hu, ∇f (x))i = lim . (7)
δ→0 δ
This tells us how fast f is changing at x when we move in the
direction of u.

This fundamental fact is a direct consequence of Taylor’s theorem (see


the Technical Details section below). Specifically, let f : RN → R be
any differentiable function. Then for any u ∈ RN , we can write
f (x + u) = f (x) + hu, ∇f (x)i + h(u)kuk2,
where h(u) : RN → R is some function satisfying h(u) → 0 as
u → 0.

If we substitute δu in place of u above and rearrange, we obtain the


identity
f (x + δu) − f (x) − h(δu)kδuk2
hu, ∇f (x)i =
δ
f (x + δu) − f (x)
= − h(δu)kuk2.
δ

44
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Note that this holds for any δ > 0. Since h(δu) → 0 as δ → 0, we
can arrive at (7) by simply taking the limit as δ → 0.

A related way to think of ∇f (x) is as a vector that is pointing in the


direction of steepest ascent, i.e., the direction in which f increases
the fastest when starting at x. To justify this, note that we just
observed that we can interpret hu, ∇f (x)i as measuring how quickly
f increases when we move in the direction of u. How can we find
the direction u that maximizes this quantity? You may recall that
the Cauchy-Schwarz inequality tells us that

|hu, ∇f (x)i| ≤ k∇f (x)k2kuk2,

and that this holds with equality when u is co-linear with ∇f (x),
i.e., when u points in the same direction as ∇f (x). Specifically, this
implies that ∇f (x) is the direction of steepest ascent, and −∇f (x)
is the direction of steepest descent.

More broadly, this characterizes the entire sets of ascent/descent di-


rections. Suppose that f : RN → R is differentiable at x. If u ∈ RN
is a vector obeying hu, ∇f (x)i < 0, then we say that u is a descent
direction from x, meaning we can find a t > 0 small enough so
that
f (x + tu) < f (x) (8)
Similarly, if hu, ∇f (x)i > 0, then we say that u is an ascent
direction from x, as again for t > 0 small enough,

f (x + tu) > f (x).

It should hopefully not be a huge stretch of the imagination to


see that being able to compute the direction of steepest ascent (or

45
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
steepest descent) will be useful in the context of finding a maxi-
mum/minimum of a function.

To show that hu, ∇f (x)i < 0 implies (8), we again use the Taylor
theorem to get

f (x + tu) = f (x) + t (hu, ∇f (x)i + h(tu)kuk2) ,

where now we have h(tu) → 0 as t → 0. For t > 0 small enough,


we can make |h(tu)| · kuk2 < |hu, ∇f (x)i|, and so the term inside
the parentheses above is negative if hu, ∇f (x)i is negative, and it is
positive if hu, ∇f (x)i is positive.

46
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
Technical Details: Taylor’s Theorem

You might recall the mean-value theorem from your first calculus
class. If f : R → R is a differentiable function on the interval
[a, x], then there is a point inside this interval where the derivative
of f matches the line drawn between f (a) and f (x). More precisely,
there exists a z ∈ [a, x] such that
f (x) − f (a)
f 0(z) = .
x−a
Here is a picture:

f (x)
f (a)

f (x) f (a)
f 0 (z) =
x a

a z x

We can re-arrange the expression above to say that there is some z


between a and x such that
f (x) = f (a) + f 0(z)(x − a).

The mean-value theorem extends to derivatives of higher order; in


this case it is known as Taylor’s theorem. For example, suppose
that f is twice differentiable on [a, x], and that the first derivative f 0
is continuous. Then there exists a z between a and x such that
0 f 00(z)
f (x) = f (a) + f (a)(x − a) + (x − a)2.
2

47
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
In general, if f is k +1 times differentiable, and the first k derivatives
are continuous, then there is a point z between a and x such that
f (k+1)(z)
f (x) = pk,a(x) + (x − a)k+1,
k!
where pk,a(x) polynomial formed from the first k terms of the Taylor
series expansion around a:
0 f 00(a) 2 f (k)(a)
pk,a(x) = f (a) + f (a)(x − a) + (x − a) + · · · + (x − a)k .
2 k!

These results give us a way to quantify the accuracy of the Taylor ap-
proximation around a point. For example, if f is twice differentiable
with f 0 continuous, then
f (x) = f (a) + f 0(a)(x − a) + h1(x)(x − a),
for a function h1(x) goes to zero as x goes to a:
lim h1(x) = 0.
x→a

In fact, you do not even need two derivatives for this to be true. If
f has a single derivative, then we can find such an h1. When f has
two derivatives, then we have an explicit form for h1:
f 00(zx)
h1(x) = (x − a),
2
where zx is the point returned by the (generalization of) the mean
value theorem for a given x.

In general, if f has k derivatives, then there exists an hk (x) with


limx→a hk (x) = 0 such that
f (x) = pk,a(x) + hk (x)(x − a)k .

48
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022
All of the results above extend to functions of multiple variables. For
example, if f (x) : RN → R is differentiable, then around any point
a,
f (x) = f (a) + hx − a, ∇f (a)i + h1(x)kx − ak2,
where h1(x) → 0 as x approaches a from any direction. If f (x) is
twice differentiable and the first derivative is continuous, then there
exists z on the line between a and x such that
1
f (x) = f (a) + hx − a, ∇f (a)i + (x − a)T∇2f (z)(x − a).
2
We will use these two particular multidimensional results in this
course, referring to them generically as “Taylor’s theorem”.

References
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cam-
bridge University Press, 2004.

49
Georgia Tech ECE 6270 Notes by M. Davenport and J. Romberg. Last updated 16:20, October 12, 2022

You might also like