0% found this document useful (0 votes)

22 views9 pages

Mit18 S096iap23 Lec06

The document discusses the computation of derivatives, focusing on Newton's method for nonlinear root-finding and its generalization to multidimensional functions. It also covers nonlinear optimization, emphasizing the use of gradients for minimizing functions in applications like machine learning and engineering. Additionally, it introduces reverse-mode 'adjoint' differentiation as an efficient method for calculating gradients in complex systems.

Uploaded by

ywongyunsum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views9 pages

Mit18 S096iap23 Lec06

Uploaded by

ywongyunsum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

6 Nonlinear Root-Finding, Optimization,

and Adjoint Differentiation

The next part is based on these slides. Today, we want to talk about why we are computing derivatives in the first
place. In particular, we will drill down on this a little bit and then talk about computation of derivatives.

6.1 Newton’s Method

One common application of derivatives is to solve nonlinear equations via linearization.

6.1.1 Scalar Functions

For instance, suppose we have a scalar function f : R → R and we wanted to solve f (x) = 0 for a root x. Of
course, we could solve such an equation explicitly in simple cases, such as when f is linear or quadratic, but if
the function is something more arbitrary like f (x) = x3 − sin(cos x) you might not be able to obtain closed-form
solutions. However, there is a nice way to obtain the solution approximately to any accuracy you want, as long if
you know approximately where the root is. The method we are talking about is known as Newton’s method, which
is really a linear-algebra technique. It takes in the function and a guess for the root, approximates it by a straight
line (whose root is easy to find), which is then an approximate root that we can use as a new guess. In particular,
the method (depicted in Fig. 5) is as follows:
• Linearize f (x) near some x using the approximation

f (x + δx) ≈ f (x) + f ′ (x)δx,

• solve the linear equation f (x) + f ′ (x)δx = 0 =⇒ δx = − ff′(x)

(x) ,

• and then use this to update the value of x we linearized near—i.e., letting the new x be

f (x)
xnew = x − δx = x + .
f ′ (x)

Once you are close to the root, Newton’s method converges amazingly quickly. As discussed below, it asymptotically
doubles the number of correct digits on every step!
One may ask what happens when f ′ (x) is not invertible, for instance here if f ′ (x) = 0. If this happens, then
Newton’s method may break down! See here for examples of when Newton’s method breaks down.

6.1.2 Multidimensional Functions

We can generalize Newton’s method to multidimensional functions! Let f : Rn → Rn be a function which takes in
a vector and spits out a vector of the same size n. We can then apply a Newton approach in higher dimensions:
• Linearize f (x) near some x using the first-derivative approximation

f (x + δx) ≈ f (x) + f ′ (x) δx,

| {z }
Jacobian

• solve the linear equation f (x) + f ′ (x)δx = 0 =⇒ δx = − f ′ (x)−1 f (x),

| {z }
inverse Jacobian

39
one Newton step
4
3
f(x)
2
1 f(xnew)
0 root
xnew
f(x)

1
2
3 initial x
4
5
4 2 0 2 4
x

Figure 5: Single step of the scalar Newton’s method to solve f (x) = 0 for an example nonlinear function f (x) =
2 cos(x) − x + x2 /10. Given a starting guess (x = 2.3 in this example), we use f (x) and f ′ (x) to form a linear
(affine) approximation of f , and then our next step xnew is the root of this approximation. As long as the initial
guess is not too far from the root, Newton’s method converges extremely rapidly to the exact root (black dot).

• and then use this to update the value of x we linearized near—i.e., letting the new x be

xnew = xold − f ′ (x)−1 f (x) .

That’s it! Once we have the Jacobian, we can just solve a linear system on each step. This again converges
amazingly fast, doubling the number of digits of accuracy in each step. (This is known as “quadratic convergence.”)
However, there is a caveat: we need some starting guess for x, and the guess needs to be sufficiently close to the
root for the algorithm to make reliable progress. (If you start with an initial x far from a root, Newton’s method
can fail to converge and/or it can jump around in intricate and surprising ways—google “Newton fractal” for some
fascinating examples.) This is a widely used and very practical application of Jacobians and derivatives!

6.2 Optimization
6.2.1 Nonlinear Optimization

A perhaps even more famous application of large-scale differentiation is to nonlinear optimization. Suppose we
have a scalar-valued function f : Rn → R, and suppose we want to minimize (or maximize) f . For instance, in
machine learning, we could have a big neural network (NN) with a vector x of a million parameters, and one tries to
minimize a “loss” function f that compares the NN output to the desired results on “training” data. The most basic
idea in optimization is to go “downhill” (see diagram) to make f as small as possible. If we can take the gradient
of this function f , to go “downhill” we consider −∇f , the direction of steepest descent, as depicted in Fig. 6.
Then, even if we have a million parameters, we can evolve all of them simultaneously in the downhill direc-
tion. It turns out that calculating all million derivatives costs about the same as evaluating the function at a

40
steepest-descent minimization
10.0
7.5
5.0 f(x) contours
2.5
minimum
0.0
x2

2.5
5.0
7.5 steepest-descent
steps
10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x1
Figure 6: A steepest-descent algorithm minimizes a function f (x) by taking successive “downhill” steps in the
direction −∇f . (In the example shown here, we are minimizing a quadratic function in two dimensions x ∈ R2 ,
performing an exact 1d minimization in the downhill direction for each step.) Steepest-descent algorithms can
sometimes “zig-zag” along narrow valleys, slowing convergence (which can be counteracted in more sophisticated
algorithms by “momentum” terms, second-derivative information, and so on).

point once (using reverse-mode/adjoint/left-to-right/backpropagation methods). Ultimately, this makes large-scale

optimization practical for training neural nets, optimizing shapes of airplane wings, optimizing portfolios, etc.
Of course, there are many practical complications that make nonlinear optimization tricky (far more than can
be covered in a single lecture, or even in a whole course!), but we give some examples here.
• For instance, even though we can compute the “downhill direction”, how far do we need to step in that
direction? (In machine learning, this is sometimes called the “learning rate.”) Often, you want to take “as big
of a step as you can” to speed convergence, but you don’t want the step to be too big because ∇f only tells
you a local approximation of f . There are many different ideas of how to determine this:

– Line search: using a 1D minimization to determine how far to step.

– A “trust region” bounding the step size (where we trust the derivative-based approximation of f ). There
are many techniques to evolve the size of the trust region as optimization progresses.

• We may also need to consider constraints, for instance minimizing f (x) subject to gk (x) ≤ 0 or hk (x) =
0, known as inequality/equality constraints. Points x satisfying the constraints are called “feasible”. One
typically uses a combination of ∇f and ∇gk to approximate (e.g. linearize) the problem and make progress
towards the best feasible point.
• If you just go straight downhill, you might “zig-zag” along narrow valleys, making convergence very slow. There
are a few options to combat this, such as “momentum” terms and conjugate gradients. Even fancier than
these techniques, one might estimate second-derivative “Hessian matrices” from a sequence of ∇f values—a
famous version of this is known as the BFGS algorithm—and use the Hessian to take approximate Newton

41
steps (for the root ∇f = 0). (We’ll return to Hessians in a later lecture.)
• Ultimately, there are a lot of techniques and a zoo of competing algorithms that you might need to experiment
with to find the best approach for a given problem. (There are many books on optimization algorithms, and
even a whole book can only cover a small slice of what is out there!)

Some parting advice: Often the main trick is less about the choice of algorithms than it is about finding the
right mathematical formulation of your problem—e.g. what function, what constraints, and what parameters should
you be considering—to match your problem to a good algorithm. However, if you have many (≫ 10) parameters,
try hard to use an analytical gradient (not finite differences), computed efficiently in reverse mode.

6.2.2 Engineering/Physical Optimization

There are many, many applications of optimization besides machine learning (fitting models to data). It is inter-
esting to also consider engineering/physical optimization. (For instance, suppose you want to make an airplane
wing that is as strong as possible.) The general outline of such problems is typically:
1. You start with some design parameters p, e.g. describing the geometry, materials, forces, or other degrees of
freedom.

2. These p are then used in some physical model(s), such as solid mechanics, chemical reactions, heat transport,
electromagnetism, acoustics, etc. For example, you might have a linear model of the form A(p)x = b(p) for
some matrix A (typically very large and sparse).

3. The solution of the physical model is a solution x(p). For example, this could be the mechanical stresses,
chemical concentrations, temperatures, electromagnetic fields, etc.

4. The physical solution x(p) is the input into some design objective f (x(p)) that you want to improve/optimize.
For instance, strength, speed power, efficiency, etc.

5. To maximize/minimize f (x(p)), one uses the gradient ∇p f , computed using reverse-mode/“adjoint” methods,
to update the parameters p and improve the design.
As a fun example, researchers have even applied “topology optimization” to design a chair, optimizing every voxel
of the design—the parameters p represent the material present (or not) in every voxel, so that the optimization
discovers not just an optimal shape but an optimal topology (how materials are connected in space, how many
holes there are, and so forth)—to support a given weight with minimal material. To see it in action, watch this
chair-optimization video. (People have applied such techniques to much more practical problems as well, from
airplane wings to optical communications.)

6.3 Reverse-mode “Adjoint” Differentiation

But what is adjoint differentiation—the method of differentiating that makes these applications actually feasible to
solve? Ultimately, it is yet another example of left-to-right/reverse-mode differentiation, essentially applying the
chain rule from outputs to inputs. Consider, for example, trying to compute the gradient ∇g of the scalar-valued
function
g(p) = f (A(p)−1 b) .
| {z }
x

where x solves A(p)x = b (e.g. a parameterized physical model as in the previous section) and f (x) is a scalar-valued
function of x (e.g. an optimization objective depending on our physics solution). For example, this could arise in

42
an optimization problem
min f (x)
min g(p) ⇐⇒ p ,
p subject to A(p)x=b

for which the gradient ∇g would be helpful to search for a local minimum. The chain rule for g corresponds to the
following conceptual chain of dependencies:

change dg in g ←− change dx in x = A−1 b

←− change d(A−1 ) in A−1
←− change dA in A(p)
←− change dp in p

which is expressed by the equations:

dg = f ′ (x)[dx] dg ←− dx
= f ′ (x)[d(A−1 )b] dx ←− d(A−1 )
= − f ′ (x)A−1 dA A−1 b dA−1 ←− dA
| {z }
vT

= −v A′ (p)[dp] A−1 b
T
dA ←− dp .
| {z }
dA

Here, we are defining the row vector v T = f ′ (x)A−1 , and we have used the differential of a matrix inverse d(A−1 ) =
−A−1 dA A−1 from Sec. 7.3.
Grouping the terms left-to-right, we first solve the “adjoint” (transposed) equation AT v = f ′ (x)T = ∇x f for v,
and then we obtain dg = −v T dA x. Because the derivative A′ (p) of a matrix with respect to a vector is awkward
to write explicitly, it is convenient to examine this object one parameter at a time. For any given parameter pk ,
∂g/∂pk = −v T (∂A/∂pk )x (and in many applications ∂A/∂pk is very sparse); here, “dividing by” ∂pk works because
this is a scalar factor that commutes with the other linear operations. That is, it takes only two solves to get both
g and ∇g: one for solving Ax = b to find g(p) = f (x), and another with AT for v, after which all of the derivatives
∂g/∂pk are just some cheap dot products.
Note that you should not use right-to-left “forward-mode” derivatives with lots of parameters, because

∂g ∂A
= −f ′ (x) A−1 x
∂pk ∂pk

represents one solve per parameter pk ! As discussed in Sec. 8.4, right-to-left (a.k.a. forward mode) is better
when there is one (or few) input parameters pk and many outputs, while left-to-right “adjoint” differentiation
(a.k.a. reverse mode) is better when there is one (or few) output values and many input parameters. (In Sec. 8.1,
we will discuss using dual numbers for differentiation, and this also corresponds to forward mode.)
Another possibility that might come to mind is to use finite differences (as in Sec. 4), but you should not use
this if you have lots of parameters! Finite differences would involve a calculation of something like
∂g
≈ [g(p + ϵek ) − g(p)]/ϵ,
∂pk
where ek is a unit vector in the k-th direction and ϵ is a small number. This, however, requires one solve for
each parameter pk , just like forward-mode differentiation. (It becomes even more expensive if you use fancier
higher-order finite-difference approximations in order to obtain higher accuracy.)

43
6.3.1 Nonlinear equations

You can also apply adjoint/reverse differentiation to nonlinear equations. For instance, consider the gradient of the
scalar function g(p) = f (x(p)), where x(p) ∈ Rn solves some system of n equations h(p, x) = 0 ∈ Rn . By the chain
rule,
−1
∂h ∂h ∂h ∂h
h(p, x) = 0 =⇒ dp + dx = 0 =⇒ dx = − dp .
∂p ∂x ∂x ∂p
(This is an instance of the Implicit Function Theorem: as long as ∂h
∂x is nonsingular, we can locally define a function
x(p) from an implicit equation h = 0, here by linearization.) Hence,
−1
∂h ∂h
dg = f ′ (x)dx = − f ′ (x) dp .
∂x ∂p
| {z }
vT

Associating left-to-right again leads to a single “adjoint” equation: (∂h/∂x)T v = f ′ (x)T = ∇x f . In other words,
it again only takes two solves to get both g and ∇g—one nonlinear “forward” solve for x and one linear “adjoint”
solve for v! Thereafter, all derivatives ∂g/∂pk are cheap dot products. (Note that the linear “adjoint” solve involves
the transposed Jacobian ∂h/∂x. Except for the transpose, this is very similar to the cost of a single Newton step
to solve h = 0 for x. So the adjoint problem should be cheaper than the forward problem.)

6.3.2 Adjoint methods and AD

If you use automatic differentiation (AD) systems, why do you need to learn this stuff? Doesn’t the AD do
everything for you? In practice, however, it is often helpful to understand adjoint methods even if you use automatic
differentiation. Firstly, it helps you understand when to use forward- vs. reverse-mode automatic differentiation.
Secondly, many physical models call large software packages written over the decades in various languages that
cannot be differentiated automatically by AD. You can typically correct this by just supplying a “vector–Jacobian
product” y T dx for this physics, or even just part of the physics, and then AD will differentiate the rest and apply
the chain rule for you. Lastly, often models involve approximate calculations (e.g. for the iterative solution of
linear or nonlinear equations, numerical integration, and so forth), but AD tools often don’t “know” this and spend
extra effort trying to differentiate the error in your approximation; in such cases, manually written derivative rules
can sometimes be much more efficient. (For example, suppose your model involves solving a nonlinear system
h(x, p) = 0 by an iterative approach like Newton’s method. Naive AD will be very inefficient because it will
attempt to differentiate through all your Newton steps. Assuming that you converge your Newton solver to enough
accuracy that the error is negligible, it is much more efficient to perform differentiation via the implicit-function
theorem as described above, leading to a single linear adjoint solve.)

6.3.3 Adjoint-method example

To finish off this section of the notes, we conclude with an example of how to use this “adjoint method” to compute
a derivative efficiently. Before working through the example, we first state the problem and highly recommend
trying it out before reading the solution.

44
Problem 38
Suppose that A(p) takes a vector p ∈ Rn−1 and returns the n × n tridiagonal real-symmetric matrix
 
a1 p1
 
 p1 a2 p2 
 
 .. .. 
A(p) = 
 p2 . . ,

 .. 

 . an−1 pn−1  
pn−1 an

where a ∈ Rn is some constant vector. Now, define a scalar-valued function f (p) by

2
g(p) = cT A(p)−1 b

for some constant vectors b, c ∈ Rn (assuming we choose p and a so that A is invertible). Note that, in
practice, A(p)−1 b is not computed by explicitly inverting the matrix A—instead, it can be computed in Θ(n)
(i.e., roughly proportional to n) arithmetic operations using Gaussian elimination that takes advantage of the
“sparsity” of A (the pattern of zero entries), a “tridiagonal solve.”
(a) Write down a formula for computing ∂g/∂p1 (in terms of matrix–vector products and matrix inverses).
(Hint: once you know dg in terms of dA, you can get ∂g/∂p1 by “dividing” both sides by ∂p1 , so that dA
becomes ∂A/∂p1 .)

(b) Outline a sequence of steps to compute both g and ∇g (with respect to p) using only two tridiagonal
solves x = A−1 b and an “adjoint” solve v = A−1 (something), plus Θ(n) (i.e., roughly proportional to n)
additional arithmetic operations.

(c) Write a program implementing your ∇g procedure (in Julia, Python, Matlab, or any language you want)
from the previous part. (You don’t need to use a fancy tridiagonal solve if you don’t know how to do this
in your language; you can solve A−1 (vector) inefficiently if needed using your favorite matrix libraries.)
Implement a finite-difference test: Choose a, b, c, p at random, and check that ∇g · δp ≈ g(p + δp) − g(p)
(to a few digits) for a randomly chosen small δp.

Problem 38(a) Solution: From the chain rule and the formula for the differential of a matrix inverse, we
have dg = −2(cT A−1 b)cT A−1 dA A−1 b (noting that cT A−1 b is a scalar so we can commute it as needed). Hence

∂g ∂A −1
= −2(cT A−1 b)cT A−1 A b
∂p1 | {z } ∂p1 | {z }
T x
v
 
0 1
 
 1 0 0 
 
.. ..
= vT 
 
 0 . .  x = v1 x2 + v2 x1 ,


 . .. 0 0 

 
0 0
| {z }
∂A
∂p1

where we have simplified the result in terms of x and v for the next part.
Problem 38(b) Solution: Using the notation from the previous part, exploiting the fact that AT = A, we

45
can choose v = A−1 [−2(cT x)c] , which is a single tridiagonal solve. Given x and v, the results of our two Θ(n)
tridiagonal solves, we can compute each component of the gradient similar to above by ∂g/∂pk = vk xk+1 + vk+1 xk
for k = 1, . . . , n − 1, which costs Θ(1) arithmetic per k and hence Θ(n) arithmetic to obtain all of ∇g.
Problem 38(c) Solution: See the Julia solution notebook (Problem 1) from our IAP 2023 course (which calls
the function f rather than g).

46
MIT OpenCourseWare
https://ocw.mit.edu

18.S096 Matrix Calculus for Machine Learning and Beyond

Independent Activities Period (IAP) 2023��

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

Optimization for Engineers
No ratings yet
Optimization for Engineers
166 pages
Optimization Techniques Lecture
No ratings yet
Optimization Techniques Lecture
37 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Newton-Raphson Optimization: Steve Kroon
No ratings yet
Newton-Raphson Optimization: Steve Kroon
4 pages
Mit18 S096iap23 Lec4
No ratings yet
Mit18 S096iap23 Lec4
14 pages
Newton-Raphson Method: Parallel Numerical Methods in Finance
No ratings yet
Newton-Raphson Method: Parallel Numerical Methods in Finance
15 pages
Non Linear Optmisation - Notes
No ratings yet
Non Linear Optmisation - Notes
24 pages
Chapter 2 - Final
No ratings yet
Chapter 2 - Final
11 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
"Newton's Method and Loops": University of Karbala College of Engineering Petroleum Eng. Dep
No ratings yet
"Newton's Method and Loops": University of Karbala College of Engineering Petroleum Eng. Dep
11 pages
Numerical Optimization Course Notes
No ratings yet
Numerical Optimization Course Notes
96 pages
Machine Problem
No ratings yet
Machine Problem
15 pages
Solving Nonlinear Equations
No ratings yet
Solving Nonlinear Equations
18 pages
Op Tim Ization Tech
No ratings yet
Op Tim Ization Tech
32 pages
NUM701S Lecture Notes Book
No ratings yet
NUM701S Lecture Notes Book
58 pages
Chương 9
No ratings yet
Chương 9
12 pages
ML Notes
No ratings yet
ML Notes
14 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Newton Raphson
No ratings yet
Newton Raphson
16 pages
Intronumericalrecipes v01 Chapter08 Opt
No ratings yet
Intronumericalrecipes v01 Chapter08 Opt
15 pages
Maths for Intelligent Systems Guide
No ratings yet
Maths for Intelligent Systems Guide
76 pages
Bologna 07
No ratings yet
Bologna 07
315 pages
Introduction To Numerical Modeling and Device Modeling Outline
No ratings yet
Introduction To Numerical Modeling and Device Modeling Outline
8 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Numerical Analysis II Overview
No ratings yet
Numerical Analysis II Overview
124 pages
Optimization in Neural Network
No ratings yet
Optimization in Neural Network
22 pages
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
Maths for Intelligent Systems Guide
No ratings yet
Maths for Intelligent Systems Guide
104 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
CH 4
No ratings yet
CH 4
28 pages
Deep Learning Numerical Challenges
No ratings yet
Deep Learning Numerical Challenges
46 pages
Lec 30
No ratings yet
Lec 30
22 pages
Numerical Methods & Statistics Guide
No ratings yet
Numerical Methods & Statistics Guide
58 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
Process Optimization
100% (1)
Process Optimization
70 pages
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
No ratings yet
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
40 pages
Numericalsss
No ratings yet
Numericalsss
9 pages
OptimisationII Notes
100% (2)
OptimisationII Notes
94 pages
Unit 2
No ratings yet
Unit 2
76 pages
Numerical Methods in Economics PDF
100% (3)
Numerical Methods in Economics PDF
349 pages
Numerical Methods in Economics
0% (1)
Numerical Methods in Economics
349 pages
Preguntas Del Examen
No ratings yet
Preguntas Del Examen
8 pages
Wolfram Mathematica Tutorial Collection
No ratings yet
Wolfram Mathematica Tutorial Collection
38 pages
Chapter 2 Optimization
No ratings yet
Chapter 2 Optimization
47 pages
Optimization Algorithms Without Derivatives
No ratings yet
Optimization Algorithms Without Derivatives
12 pages
Lec 13 Newton Raphson Method
100% (2)
Lec 13 Newton Raphson Method
30 pages
19 Newton Method
No ratings yet
19 Newton Method
10 pages
NON LINEAR EQUATIONS Final
No ratings yet
NON LINEAR EQUATIONS Final
5 pages
Back Prop
No ratings yet
Back Prop
8 pages
Introduction to Scientific Computing
No ratings yet
Introduction to Scientific Computing
117 pages
Mam Arifa 502
No ratings yet
Mam Arifa 502
24 pages
4 Zinc Chloride
No ratings yet
4 Zinc Chloride
3 pages
Overview of Numerical Methods
No ratings yet
Overview of Numerical Methods
8 pages
EP Courses
No ratings yet
EP Courses
1 page
River Bank Erosion Control with JGT
No ratings yet
River Bank Erosion Control with JGT
19 pages
Setrelease Ru Premium
No ratings yet
Setrelease Ru Premium
2 pages
Dual Converters: Four Quadrant Operation Explained
No ratings yet
Dual Converters: Four Quadrant Operation Explained
10 pages
Cosmetic Hydroquinone Testing Guide
No ratings yet
Cosmetic Hydroquinone Testing Guide
5 pages
Future of AI: Experts Weigh In
No ratings yet
Future of AI: Experts Weigh In
2 pages
Creative Writing Process Insights
No ratings yet
Creative Writing Process Insights
4 pages
Wireless Network Specs for Engineers
No ratings yet
Wireless Network Specs for Engineers
1 page
Dus Ka Dum Answer Key 05
No ratings yet
Dus Ka Dum Answer Key 05
15 pages
Simple Stresses and Strains
No ratings yet
Simple Stresses and Strains
89 pages
Worksheet Python Modules
No ratings yet
Worksheet Python Modules
5 pages
Common Burnt Clay Building Bricks - Specification: Indian Standard
No ratings yet
Common Burnt Clay Building Bricks - Specification: Indian Standard
10 pages
Antibiotic Basics For Clinicians The ABCs of Choosing The Right Antibacterial Agent 2nd Test Bank Available Instantly
No ratings yet
Antibiotic Basics For Clinicians The ABCs of Choosing The Right Antibacterial Agent 2nd Test Bank Available Instantly
407 pages
Exploring Rajasthan and Nagaland A Contrast of Indian States
No ratings yet
Exploring Rajasthan and Nagaland A Contrast of Indian States
10 pages
Positive Psychology - Wisdom
No ratings yet
Positive Psychology - Wisdom
19 pages
Mathematics Lesson Plan Table Factors and Multiples
No ratings yet
Mathematics Lesson Plan Table Factors and Multiples
2 pages
Practical Research 2 Quarter 2 - Module 20
100% (1)
Practical Research 2 Quarter 2 - Module 20
5 pages
Multi Agent Applications
No ratings yet
Multi Agent Applications
384 pages
1 s2.0 S0308814624010082 Main
No ratings yet
1 s2.0 S0308814624010082 Main
11 pages
Readings in The Behavioral Sciences - Session 2
No ratings yet
Readings in The Behavioral Sciences - Session 2
2 pages
Load Analysis for Structural Design
No ratings yet
Load Analysis for Structural Design
1 page
Asme Sec V A-2 RT PDF
71% (7)
Asme Sec V A-2 RT PDF
44 pages
Interview Rubric
No ratings yet
Interview Rubric
1 page
QBTS Short Report by Pelican Way Research
No ratings yet
QBTS Short Report by Pelican Way Research
28 pages
Thesis Writing on Ignorance Explained
100% (3)
Thesis Writing on Ignorance Explained
8 pages
Science Technology and Society Syllabus
No ratings yet
Science Technology and Society Syllabus
8 pages
Unit Gegerkalong
No ratings yet
Unit Gegerkalong
23 pages

Mit18 S096iap23 Lec06

Uploaded by

Mit18 S096iap23 Lec06

Uploaded by

6 Nonlinear Root-Finding, Optimization,

and Adjoint Differentiation

6.1 Newton’s Method

6.1.1 Scalar Functions

f (x + δx) ≈ f (x) + f ′ (x)δx,

• solve the linear equation f (x) + f ′ (x)δx = 0 =⇒ δx = − ff′(x)

6.1.2 Multidimensional Functions

f (x + δx) ≈ f (x) + f ′ (x) δx,

• solve the linear equation f (x) + f ′ (x)δx = 0 =⇒ δx = − f ′ (x)−1 f (x),

xnew = xold − f ′ (x)−1 f (x) .

point once (using reverse-mode/adjoint/left-to-right/backpropagation methods). Ultimately, this makes large-scale

– Line search: using a 1D minimization to determine how far to step.

6.2.2 Engineering/Physical Optimization

6.3 Reverse-mode “Adjoint” Differentiation

change dg in g ←− change dx in x = A−1 b

which is expressed by the equations:

6.3.2 Adjoint methods and AD

6.3.3 Adjoint-method example

where a ∈ Rn is some constant vector. Now, define a scalar-valued function f (p) by

18.S096 Matrix Calculus for Machine Learning and Beyond

You might also like