0% found this document useful (0 votes)

61 views16 pages

Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent

This document proposes a generic method for approximating various second-order gradient steps like Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient in linear time per iteration. It does this using special curvature matrix-vector products that can be computed in linear time. The document outlines how these curvature matrix-vector products can be used in an iterative approximation to efficiently approach the product of the inverse curvature matrix with a vector. It then discusses how two recent machine learning algorithms called matrix momentum and stochastic meta-descent both implement this approach, providing new insights into how they work.

Uploaded by

Kitanovic Nenad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views16 pages

Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent

Uploaded by

Kitanovic Nenad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

LETTER Communicated by Bara k Pearlmutter

Fast Curvature Matrix-Vector Products for Second-Order

Gradient Descent

Nicol N. Schraudolph
[email protected]
IDSIA, Galleria 2, 6928 Manno, Switzerland, and Institute of Computational Science,
ETH Zentrum, 8092 Zürich, Switzerland

We propose a generic method for iteratively approximating various sec-

ond-order gradient steps—-Newton, Gauss-Newton, Levenberg-Mar-
quardt, and natural gradient—-in linear time per iteration, using spe-
cial curvature matrix-vector products that can be computed in O(n) . Two
recent acceleration techniques for on-line learning, matrix momentum
and stochastic meta-descent (SMD), implement this approach. Since both
were originally derived by very different routes, this offers fresh insight
into their operation, resulting in further improvements to SMD.

1 Introduction

Second-order gradient descent methods typically multiply the local gradi-

ent by the inverse of a matrix CN of local curvature information. Depending on
the specic method used, this n £ n matrix (for a system with n parameters)
may be the Hessian (Newton’s method), an approximation or modication
thereof (e.g., Gauss-Newton, Levenberg-Marquardt), or the Fisher informa-
tion (natural gradient—Amari, 1998). These methods may converge rapidly
but are computationally quite expensive: the time complexity of common
methods to invert CN is O(n3 ), and iterative approximations cost at least O(n2 )
per iteration if they compute CN ¡1 directly, since that is the time required just
to access the n2 elements of this matrix.
Note, however, that second-order gradient methods do not require CN ¡1
explicitly: all they need is its product with the gradient. This is exploited
by Yang and Amari (1998) to compute efciently the natural gradient for
multilayer perceptrons with a single output and one hidden layer: assum-
ing independently and identically distributed (i.i.d.) gaussian input, they
explicitly derive the form of the Fisher information matrix and its inverse
for their system and nd that the latter’s product with the gradient can
be computed in just O(n) steps. However, the resulting algorithm is rather
complicated and does not lend itself to being extended to more complex
adaptive systems (such as multilayer perceptrons with more than one out-
put or hidden layer), curvature matrices other than the Fisher information,
or inputs that are far from i.i.d. gaussian.

Neural Computation 14, 1723–1738 (2002) °

c 2002 Massachusetts Institute of Technology
1724 Nicol N. Schraudolph

In order to set up a general framework that admits such extensions (and

indeed applies to any twice-differentiable adaptive system), we abandon
the notion of calculating the exact second-order gradient step in favor of
an iterative approximation. The following iteration efciently approaches
vE D CN ¡1 uE for an arbitrary vector uE (Press, Teukolsky, Vetterling, & Flannery,
1992, page 57):

vE 0 D 0I (8t ¸ 0) vE t C 1 D vE t C D(uE ¡ CN vE t ), (1.1)

where D is a conditioning matrix chosen close to CN ¡1 if possible. Note that if

we restrict D to be diagonal, all operations in equation 1.1 can be performed
in O(n) time, except (one would suppose) for the matrix-vector product CN vE t .
In fact, there is an O(n) method for calculating the product of an n £ n
matrix with an arbitrary vector—if the matrix happens to be the Hessian of a
system whose gradient can be calculated in O(n), as is the case for most adap-
tive architectures encountered in practice. This fast Hessian-vector product
(Pearlmutter, 1994; Werbos, 1988; Møller, 1993) can be used in conjunction
with equation 1.1 to create an efcient, iterative O(n) implementation of
Newton’s method.
Unfortunately, Newton’s method has severe stability problems when
used in nonlinear systems, stemming from the fact that the Hessian may
be ill-conditioned and does not guarantee positive deniteness. Practical
second-order methods therefore prefer measures of curvature that are better
behaved, such as the outer product (Gauss-Newton) approximation of the
Hessian, a model-trust region modication of the same (Levenberg, 1944;
Marquardt, 1963), or the Fisher information.
Below, we dene these matrices in a maximum likelihood framework for
regression and classication and describe O(n) algorithms for computing
the product of any of them with an arbitrary vector for neural network ar-
chitectures. These curvature matrix-vector products are, in fact, cheaper still
than the fast Hessian-vector product and can be used in conjunction with
equation 1.1 to implement rapid, iterative, optionally stochastic O(n) vari-
ants of second-order gradient descent methods. The resulting algorithms are
very general, practical (i.e., sufciently robust and efcient), far less expen-
sive than the conventional O(n2 ) and O(n3 ) approaches, and—with the aid
of automatic differentiation software tools—comparatively easy to implement
(see section 4).
We then examine two learning algorithms that use this approach: ma-
trix momentum (Orr, 1995; Orr & Leen, 1997) and stochastic meta-descent
(Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000). Since
both methods were derived by entirely different routes, viewing them as
implementations of iteration 1.1 will provide additional insight into their
operation and suggest new ways to improve them.
Curvature Matrix-Vector Products 1725

2 Denitions and Notation

Network. A neural network with m inputs, n weights, and o linear out-

puts is usually regarded as a mapping Rm ! Ro from an input pattern xE
to the corresponding output yE , for a given vector wE of weights. Here we
formalize such a network instead as a mapping N : Rn ! Ro from weights
to outputs (for given inputs), and write yE D N (w).
E To extend this formal-
ism to networks with nonlinear outputs, we dene the output nonlinearity
M: Ro ! Ro and write zE D M (yE ) D M (N (w)). E For networks with linear
outputs, M will be the identity map.
Loss function . We consider neural network learning as the minimization
of a scalar loss function L : Ro ! R dened as the log-likelihood L (zE ) ´
¡ log Pr(Ez ) of the output Ez under a suitable statistical model (Bishop, 1995).
For supervised learning, L may also implicitly depend on given targets
Ez¤ for the outputs. Formally, the loss can now be regarded as a function
L (M (N (w)))
E of the weights, for a given set of inputs and (if supervised)
targets.
Jacobian and gradient. The Jacobian JF of a function F : Rm ! Rn is the
n £ m matrix of partial derivatives of the outputs of F with respect to its
inputs. For a neural network dened as above, the gradient—the vector gE
of derivatives of the loss with respect to the weights—is given by

@
gE ´ L (M (N (w)))
E D JL 0 ±M±N D JN 0 J M 0 JL 0 , (2.1)
@w
E

where ± denotes function composition and 0 the matrix transpose.

Matching loss functions. We say that the loss function L matches the
output nonlinearity M iff JL0 ±M D AEz C b, E for some A and bE not dependent
E 1 The standard loss functions used in neural network regression and
on w.
classication—sum-squared error for linear outputs and cross-entropy error
for softmax or logistic outputs—are all matching loss functions with A D I
(the identity matrix) and bE D ¡Ez¤ , so that JL0 ±M D Ez ¡Ez¤ (Bishop, 1995, chapter
6). This will simplify some of the calculations described in section 4.
Hessian. The instantaneous Hessian HF of a scalar function F : Rn ! R
is the n £ n matrix of second derivatives of F (w)
E with respect to its inputs
w:
E

@ JF @2 F (w)
E
HF ´ , i.e., (HF ) ij D . (2.2)
@w
E0 @wi @wj

1
For supervised learning, a similar if somewhat more restrictive denition of matching
loss functions is given by Helmbold, Kivinen, and Warmuth (1996) and Auer, Herbster,
and Warmuth (1996).
1726 Nicol N. Schraudolph

For a neural network as dened above, we abbreviate H ´ HL ±M±N . The

Hessian proper, which we denote H, N is obtained by taking the expectation
N
of H over inputs: H ´ hHixE . For matching loss functions, HL ±M D AJM D
JM 0 A 0 .

Fisher information . The instantaneous Fisher information matrix FF of

a scalar log-likelihood function F : Rn ! R is the n £ n matrix formed by
the outer product of its rst derivatives:

0
@F (w)
E @F (w)
E
F F ´ JF JF , i.e., (FF ) ij D . (2.3)
@wi @wj

Note that FF always has rank one. Again, we abbreviate F ´ FL ±M±N D gE gE 0 .

The Fisher information matrix proper, FN ´ hFixE , describes the geometric
structure of weight space (Amari, 1985) and is used in the natural gradient
descent approach (Amari, 1998).

3 Extended Gauss-Newton Approximation

Problems with the Hessian. The use of the Hessian in second-order

gradient descent for neural networks is problematic. For nonlinear systems,
HN is not necessarily positive denite, so Newton’s method may diverge or
even take steps in uphill directions. Practical second-order gradient methods
should therefore use approximations or modications of the Hessian that
are known to be reasonably well behaved, with positive semideniteness
as a minimum requirement.

Fisher information. One alternative that has been proposed is the Fisher
information matrix FN (Amari, 1998), which, being a quadratic form, is posi-
tive semidenite by denition. On the other hand, FN ignores all second-order
interactions between system parameters, thus throwing away potentially
useful curvature information. By contrast, we shall derive an approxima-
tion of the Hessian that is provably positive semidenite even though it
does make use of second derivatives to model Hessian curvature better.

Gauss-Newton. An entire class of popular optimization techniques for

nonlinear least-squares problems, as implemented by neural networks with
linear outputs and sum-squared loss function, is based on the well-known
Gauss-Newton (also referred to as linearized, outer product, or squared
Jacobian) approximation of the Hessian. Here we extend the Gauss-Newton
approach to other standard loss functions—in particular, the cross-entropy
loss used in neural network classication—in such a way that even though
Curvature Matrix-Vector Products 1727

some second-order information is retained, positive semideniteness can

still be proved.
Using the product rule, the instantaneous Hessian of our neural network
model can be written as

o
X
@
HD (J
0 L ±M JN ) D JN 0 H L ±M JN C (JL ±M )i HN i , (3.1)
@w
E iD 1

where i ranges over the o outputs of N , with N i denoting the subnetwork

that produces the ith output. Ignoring the second term above, we dene the
extended, instantaneous Gauss-Newton matrix:

G ´ JN 0 H L ±M JN . (3.2)

Note that G has rank · o (the number of outputs) and is positive semidef-
inite, regardless of the choice of architecture for N , provided that HL ±M
is.
G models the second-order interactions among N ’s outputs (via HL ±M )
while ignoring those arising within N itself (HN i ). This constitutes a com-
promise between the Hessian (which models all second-order interactions)
and the Fisher information (which ignores them all). For systems with a
single linear output and sum-squared error, G reduces to F. For multiple
outputs, it provides a richer (rank(G) · o versus rank(F) D 1) model of
Hessian curvature.

Standard Loss Functions. For the standard loss functions used in neural
network regression and classication, G has additional interesting proper-
ties:
First, the residual JL0 ±M D Ez ¡ Ez¤ vanishes at the optimum for realizable
problems, so that the Gauss-Newton approximation, equation 3.2, of the
Hessian, equation 3.1, becomes exact in this case. For unrealizable problems,
the residuals at the optimum have zero mean; this will tend to make the last
term in equation 3.1 vanish in expectation, so that we can still assume GN ¼ HN
near the optimum.
Second, in each case we can show that HL ±M (and hence G, and hence
N is positive semidenite. For linear outputs with sum-squared loss—that
G)
is, conventional Gauss-Newton—HL ±M D JM is just the identity I; for inde-
pendent logistic outputs with cross-entropy loss, it is diag[diag(Ez )(1 ¡ Ez )],
positive semidenite because (8i) 0 < zi < 1. For softmax output with
cross-entropy loss, we have HL ±M D diag(zE ) ¡ EzEz0 , which is also positive
1728 Nicol N. Schraudolph

P
semidenite since (8i) zi > 0 and i zi D 1, and thus

Á !2
X X
o
(8vE 2 R ) 0 0
vE [diag(Ez) ¡ zE Ez ]vE D zi v2i ¡ z i vi
i i
Á !0 1 0 12
X X X X
D zi v2i ¡ 2 zi vi @ zj vj A C @ zj vj A
i i j j
0 12
X X
D zi @vi ¡ zj vj A ¸ 0. (3.3)
i j

Model-Trust Region. As long as G is positive semidenite—as

proved above for standard loss functions—the extended Gauss-Newton
algorithm will not take steps in uphill directions. However, it may still take
very large (even innite) steps. These may take us outside the model-trust
region, the area in which our quadratic model of the error surface is rea-
sonable. Model-trust region methods restrict the gradient step to a suitable
neighborhood around the current point.
One popular way to enforce a model-trust region is the addition of a
small diagonal term to the curvature matrix. Levenberg (1944) suggested
adding lI to the Gauss-Newton matrix G; N Marquardt (1963) elaborated the
N
additive term to ldiag(G). The Levenberg-Marquard t algorithm directly
inverts the resulting curvature matrix; where affordable (i.e., for relatively
small systems), it has become today’s workhorse of nonlinear least-squares
optimization.

4 Fast Curvature Matrix-Vector Products

We now describe algorithms that compute the product of F, G, or H with an

arbitrary n-dimensional vector vE in O(n). They can be used in conjunction
with equation 1.1 to implement rapid and (if so desired) stochastic ver-
sions of various second-order gradient descent methods, including New-
ton’s method, Gauss-Newton, Levenberg-Marquardt, and natural gradient
descent.

4.1 The Passes. The fast matrix-vector products are all constructed from
the same set of passes in which certain quantities are propagated through
all or part of our neural network model (comprising N , M, and L ) in for-
ward or reverse direction. For implementation purposes, it should be noted
that automatic differentiation software tools2 can automatically produce these
passes from a program implementing the basic forward pass f0 .

2 See http://www-unix.mcs.anl.gov/autodiff/.
Curvature Matrix-Vector Products 1729

f 0 . This is the ordinary forward pass of a neural network, evalu-

ating the function F (w)
E it implements by propagating activity (i.e.,
intermediate results) forward through F .
r 1 . The ordinary backward pass of a neural network, calculating
0 u
JF E by propagating the vector u
E backward through F . This pass uses
intermediate results computed in the f0 pass.
f 1 . Following Pearlmutter (1994), we dene the Gateaux derivative

@F (w E)
E C rv
RvE (F (w))
E ´ D JF vE , (4.1)
@r
rD 0

which describes the effect on a function F (w)E of a weight perturbation

in the direction of vE . By pushing RvE , which obeys the usual rules for
differential operators, down into the equations of the forward pass f0,
one obtains an efcient procedure, to calculate JF vE from vE . (See Pearl-
mutter, 1994, for details and examples.) This f1 pass uses intermediate
results from the f0 pass.
r 2 . When the RvE operator is applied to the r1 pass for a scalar func-
tion F , one obtains an efcient procedure for calculating the Hessian-
0 ).
vector product HF vE D RvE (JF (See Pearlmutter, 1994, for details and
examples.) This r2 pass uses intermediate results from the f0 , f1 , and
r1 passes.

4.2 The Algorithms. The rst step in all three matrix-vector products is
computing the gradient gE of our neural network model by standard back-
propagation:

Gradient. gE ´ JL0 ±M±N is computed by an f0 pass through the entire

model (N , M, and L ), followed by an r1 pass propagating uE D 1 back
through the entire model (L , M, then N ). For matching loss functions,
E we can limit the forward pass to N
there is a shortcut: since JL0 ±M D AEz C b,
and M (to compute Ez), then r1 -propagate uE D AzE C bE back through just N .

Fisher Information. To compute FvE D gE gE 0 vE , multiply the gradient gE by

the inner product between gE and vE . If there is no random access to gE or vE —
that is, its elements can be accessed only through passes like the above—the
scalar gE 0 vE can instead be calculated by f1 -propagating vE forward through
the model (N , M, and L ). This step is also necessary for the other two
matrix-vector products.

Hessian. After f1 -propagating vE forward through N , M, and L , r2 -pro-

pagate RvE (1) D 0 back through the entire model (L , M, then N ) to obtain
H vE D RvE (gE ) (Pearlmutter, 1994). For matching loss functions, the shortcut is
1730 Nicol N. Schraudolph

Table 1: Choice of Curvature Matrix C for Various Gradient Descent Methods,

Passes Needed to Compute Gradient Eg and Fast Matrix-Vector Product C N vE , and
Associated Cost (for a Multilayer Perceptron) in Flops per Weight and Pattern.

Pass f0 r1 f1 r2 Cost
0
Method result: F JF uE JF vE HF vE (for gE
CD name cost: 2 3 4 7 and CN vE )
p p
I simple gradient p p p 6
F natural gradient p pp (p ) 10
G Gauss-Newton p p p p 14
H Newton’s method 18

to f1 -propagate vE through just N and M to obtain R vE (Ez), then r2 -propagate

RvE (JL0 ±M ) D ARvE (Ez) back through just N .

Gauss-Newton. Following the f1 pass, r2 -propagate RvE (1) D 0 back

through L and M to obtain RvE (JL0 ±M ) D HL ±M JN vE , then r1 -propagate that
back through N , giving GvE . For matching loss functions, we do not require
an r2 pass. Since

G D JN0 HL ±M J N D JN 0 J M 0 A 0 JN , (4.2)

we can limit the f1 pass to N , multiply the result with A0 , then r1 -propagate
it back through M and N . Alternatively, one may compute the equivalent
GvE D JN0 AJM JN vE by continuing the f1 pass through M, multiplying with
A, then r1 -propagating back through N only.

Batch Average. To calculate the product of a curvature matrix CN ´ hCixE ,

where C is one of F, G, or H, with vector vE , average the instantaneous product
CvE over all input patterns xE (and associated targets Ez¤ , if applicable) while
holding vE constant. For large training sets or nonstationary streams of data,
it is often preferable to estimate CN vE by averaging over “mini-batches” of
(typically) just 5 to 50 patterns.

4.3 Computational Cost. Table 1 summarizes, for a number of gradi-

ent descent methods, their choice of curvature matrix C, the passes needed
(for a matching loss function) to calculate both the gradient gE and the fast
matrix-vector product CN vE , and the associated computational cost in terms
of oating-point operations (ops) per weight and pattern in a multilayer
perceptron. These gures ignore certain optimizations (e.g., not propagat-
ing gradients back to the inputs) and assume that any computation at the
network’s nodes is dwarfed by that required for the weights.
Computing both gradient and curvature matrix-vector product is typi-
cally about two to three times as expensive as calculating the gradient alone.
Curvature Matrix-Vector Products 1731

In combination with iteration 1.1, however, one can use the O(n) matrix-
vector product to implement second-order gradient methods whose rapid
convergence more than compensates for the additional cost. We describe
two such algorithms in the following section.

5 Rapid Second-Order Gradient Descent

We know of two neural network learning algorithms that combine the O(n)
curvature matrix-vector product with iteration 1.1 in some form: matrix
momentum (Orr, 1995; Orr & Leen, 1997) and our own stochastic meta-
descent (Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000).
Since both of these were derived by entirely different routes, we gain fresh
insight into their operation by examining how they implement equation 1.1.

5.1 Stochastic Meta-Descent. Stochastic meta-descent (SMD—Schrau-

dolph, 1999b, 1999c) is a new on-line algorithm for local learning rate adap-
tation. It updates the weights w
E by the simple gradient descent:

w E t ¡ diag(pEt ) gE .
E tC1 D w (5.1)

The vector pE of local learning rates is adapted multiplicatively,

³ ´
1
pEt D diag(pEt¡1 ) max )
, 1 m diag(vE t gE ,
C (5.2)
2

using a scalar meta-learning rate m . Finally, the auxiliary vector vE used in

equation 5.2 is itself updated iteratively via

vE t C 1 D lvE t C diag(pEt ) (gE ¡ lCvE t ), (5.3)

where 0 · l · 1 is a forgetting factor for nonstationary tasks. Although

derived as part of a dual gradient descent procedure (minimizing loss with
respect to both w
E and p),
E equation 5.3 implements an interesting variation of
equation 1.1. SMD thus employs rapid second-order techniques indirectly
to help adapt local learning rates for the gradient descent in weight space.

Linearization. The learning rate update, equation 5.2, minimizes the sys-
tem’s loss with respect to pE by exponentiated gradient descent (Kivinen &
Warmuth, 1995), but has been relinearized in order to avoid the computa-
tionally expensive exponentiation operation (Schraudolph, 1999a). The par-
ticular linearization used, eu ¼ max(%, 1 C u), is based on a rst-order Taylor
expansion about u D 0, bounded below by 0 < % < 1 so as to safeguard
against unreasonably small (and, worse, negative) multipliers for p. E The
value of % determines the maximum permissible learning rate reduction;
we follow many other step-size control methods in setting this to % D 12 , the
1732 Nicol N. Schraudolph

ratio between optimal and maximum stable step size in a symmetric bowl.
Compared to direct exponentiated gradient descent, our linearized version,
equation 5.2, thus dampens radical changes (in both directions) to pE that
may occasionally arise due to the stochastic nature of the data.

Diagonal, Adaptive Conditioner. For l D 1, SMD’s update of vE , equa-

tion 5.3 implements equation 1.1 with the diagonal conditioner D D diag(p). E
Note that the learning rates pE are being adapted so as to make the gradi-
ent step diag(p)
E g
E as effective as possible. A well-adapted pE will typically
make this step similar to the second-order gradient CN ¡1 gE . In this restricted
sense, we can regard diag(p)E as an empirical diagonal approximation of CN ¡1 ,
making it a good choice for the conditioner D in iteration 1.1.

Initial Learning Rates. Although SMD is very effective at adapting local

learning rates to changing requirements, it is nonetheless sensitive to their
initial values. All three of its update rules rely on pE for their conditioning,
so initial values that are very far from optimal are bound to cause prob-
lems: divergence if they are too high, lack of progress if they are too low. A
simple architecture-dependent technique such as tempering (Schraudolph
& Sejnowski, 1996) should usually sufce to initialize pE adequately; the ne
tuning can be left to the SMD algorithm.

Model-Trust Region. For l < 1, the stochastic xpoint of equation 5.3 is

no longer vE ! C ¡1 gE , but rather

E ¡1 ]¡1 gE .
vE ! [lC C (1 ¡ l)diag(p) (5.4)

This clearly implements a model-trust region approach, in that a diagonal

matrix is being added (in small proportion) to C before inverting it. More-
over, the elements along the diagonal are not all identical as in Levenberg’s
(1944) method, but scale individually as suggested by Marquardt (1963). The
scaling factors are determined by 1 / pE rather than diag(C),
N as the Levenberg-
Marquardt method would have it, but these two vectors are related by our
above argument that pE is a diagonal approximation of CN ¡1 . For l < 1, SMD’s
iteration 5.3 can thus be regarded as implementing an efcient stochastic
variant of the Levenberg-Marquardt model-trust region approach.

Benchmark Setup. We illustrate the behavior of SMD with empirical data

obtained on the “four regions” benchmark (Singhal & Wu, 1989): a fully con-
nected feedforward network N with two hidden layers of 10 units each (see
Figure 1, right) is to classify two continuous inputs in the range [¡1, 1] into
four disjoint, nonconvex regions (see Figure 1, left). We use the standard
softmax output nonlinearity M with matching cross-entropy loss L , meta-
learning rate m D 0.05, initial learning rates pE0 D 0.1, and a hyperbolic
Curvature Matrix-Vector Products 1733

Figure 1: The four regions task (left), and the network we trained on it (right).

tangent nonlinearity on the hidden units. For each run, the 184 weights
(including bias weights for all units) are initialized to uniformly random
values in the range [¡0.3, 0.3]. Training patterns are generated on-line by
drawing independent, uniformly random input samples; since each pat-
tern is seen only once, the empirical loss provides an unbiased estimate of
generalization ability. Patterns are presented in mini-batches of 10 each so
as to reduce the computational overhead associated with SMD’s parameter
updates 5.1, 5.2, and 5.3.3

Curvature Matrix. Figure 2 shows loss curves for SMD with l D 1 on

the four regions problem, starting from 25 different random initial states,
using the Hessian, Fisher information, and extended Gauss-Newton matrix,
respectively, for C in equation 5.3. With the Hessian (left), 80% of the runs
diverge—most of them early on, when the risk that H is not positive denite
is greatest. When we guarantee positive semideniteness by switching to the
Fisher information matrix (center), the proportion of diverged runs drops
to 20%; those runs that still diverge do so only relatively late. Finally, for our
extended Gauss-Newton approximation (right), only a single run diverges,
illustrating the benet of retaining certain second-order terms while pre-
serving positive semideniteness. (For comparison, we cannot get matrix
momentum to converge at all on anything as difcult as this benchmark.)

Stability. In contrast to matrix momentum, the high stochasticity of vE

affects the weights in SMD only indirectly, being buffered—and largely

3 In exploratory experiments, comparative results when training fully on-line (i.e.,

pattern by pattern) were noisier but not substantially different.

1734 Nicol N. Schraudolph

Figure 2: Loss curves for 25 runs of SMD with l D 1, when using the Hessian
(left), the Fisher information (center), or the extended Gauss-Newton matrix
(right) for C in equation 5.3. Vertical spikes indicate divergence.

averaged out—by the incremental update 5.2 of learning rates p. E This makes
SMD far more stable, especially when G is used as the curvature matrix. Its
residual tendency to misbehave occasionally can be suppressed further by
slightly lowering l so as to create a model-trust region. By curtailing the
memory of iteration 5.3, however, this approach can compromise the rapid
convergence of SMD. Figure 3 illustrates the resulting stability-performance
trade-off on the four regions benchmark:
When using the extended Gauss-Newton approximation, a small reduc-
tion of l to 0.998 (solid line) is sufcient to prevent divergence, at a moderate
cost in performance relative to l D 1 (dashed, plotted up to the earliest point
of divergence). When the Hessian is used, by contrast, l must be set as low
as 0.95 to maintain stability, and convergence is slowed much further (dash-
dotted). Even so, this is still signicantly faster than the degenerate case of
l D 0 (dotted), which in effect implements IDD (Harmon & Baird, 1996), to
our knowledge the best on-line method for local learning rate adaptation
preceding SMD.
From these experiments, it appears that memory (i.e., l close to 1) is
key to achieving the rapid convergence characteristic of SMD. We are now
investigating more direct ways to keep iteration 5.3 under control, aiming
to ensure the stability of SMD while maintaining its excellent performance
near l D 1.

5.2 Matrix Momentum. The investigation of asymptotically optimal

adaptive momentum for rst-order stochastic gradient descent (Leen & Orr,
1994) led Orr (1995) to propose the following weight update:

w
E t C1 D w
Et C v
E tC 1 , vE t C 1 D vE t ¡ m (%t gE C CvE t ), (5.5)

N largest eigenvalue,
where m is a scalar constant less than the inverse of C’s
and %t a rate parameter that is annealed from one to zero. We recognize
Curvature Matrix-Vector Products 1735

Figure 3: Average loss over 25 runs of SMD for various combinations of curva-
ture matrix C and forgetting factor l. Memory (l ! 1) accelerates convergence
over the conventional memory-less case l D 0 (dotted) but can lead to instability.
With the Hessian H, all 25 runs remain stable up to l D 0.95 (dot-dashed line);
using the extended Gauss-Newton matrix G pushes this limit up to l D 0.998
(solid line). The curve for l D 1 (dashed line) is plotted up to the earliest point
of divergence.

equation 1.1 with scalar conditioner D D m and stochastic xed point vE !

¡%t C¡1 gE ; thus, matrix momentum attempts to approximate partial second-
order gradient steps directly via this fast, stochastic iteration.

Rapid Convergence. Orr (1995) found that in the late, annealing phase of
learning, matrix momentum converges at optimal (second-order) asymp-
totic rates; this has been conrmed by subsequent analysis in a statistical
mechanics framework (Rattray & Saad, 1999; Scarpetta, Rattray, & Saad,
1999). Moreover, compared to SMD’s slow, incremental adaptation of learn-
ing rates, matrix momentum’s direct second-order update of the weights
promises a far shorter initial transient before rapid convergence sets in. Ma-
trix momentum thus looks like the ideal candidate for a fast O(n) stochastic
gradient descent method.

Instability. Unfortunately matrix momentum has a strong tendency to

diverge for nonlinear systems when far from an optimum, as is the case
during the search phase of learning. Current implementations therefore rely
on simple (rst-order) stochastic gradient descent initially, turning on matrix
momentum only once the vicinity of an optimum has been reached (Orr,
1736 Nicol N. Schraudolph

1995; Orr & Leen, 1997). The instability of matrix momentum is not caused
by lack of semideniteness on behalf of the curvature matrix: Orr (1995)
used the Gauss-Newton approximation, and Scarpetta et al. (1999) reached
similar conclusions for the Fisher information matrix. Instead, it is thought
to be a consequence of the noise inherent in the stochastic approximation
of the curvature matrix (Rattray & Saad, 1999; Scarpetta et al., 1999).
Recognizing matrix momentum as implementing the same iteration 1.1
as SMD suggests that its stability might be improved in similar ways—
specically, by incorporating a model-trust region parameter l and an adap-
tive diagonal conditioner. However, whereas in SMD such a conditioner
was trivially available in the vector pE of local learning rates, here it is by
no means easy to construct, given our restriction to O(n) algorithms, which
are affordable for very large systems. We are investigating several routes
toward a stable, adaptively conditioned form of matrix momentum.

6 Summ ary

We have extended the notion of Gauss-Newton approximation of the Hes-

sian from nonlinear least-squares problems to arbitrary loss functions, and
shown that it is positive semidenite for the standard loss functions used
in neural network regression and classication. We have given algorithms
that compute the product of either the Fisher information or our extended
Gauss-Newton matrix with an arbitrary vector in O(n), similar to but even
cheaper than the fast Hessian-vector product described by Pearlmutter
(1994).
We have shown how these fast matrix-vector products may be used to
construct O(n) iterative approximations to a variety of common second-
order gradient algorithms, including the Newton, natural gradient, Gauss-
Newton, and Levenberg-Marquard t steps. Applying these insights to our
recent SMD algorithm (Schraudoph, 1999b)—specically, replacing the Hes-
sian with our extended Gauss-Newton approximation—resulted in im-
proved stability and performance. We are now investigating whether ma-
trix momentum (Orr, 1995) can similarly be stabilized though the incor-
poration of an adaptive diagonal conditioner and a model-trust region
parameter.

Acknowledgments

I thank Jenny Orr, Barak Pearlmutter, and the anonymous reviewers for
their helpful suggestions, and the Swiss National Science Foundation for
the nancial support provided under grant number 2000–052678.97/1.
Curvature Matrix-Vector Products 1737

References

Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer-

Verlag.
Amari, S. (1998). Natural gradient works efciently in learning. Neural Compu-
tation, 10(2), 251–276.
Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local min-
ima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo
(Eds.), Advances in neural information processing systems, 8 (pp. 316–322). Cam-
bridge, MA: MIT Press.
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon
Press.
Harmon, M. E., & Baird III, L. C. (1996). Multi-player residual advantage learn-
ing with general function approximation (Tech. Rep. No. WL-TR-1065). Wright-
Patterson Air Force Base, OH: Wright Laboratory, WL/AACF. Available on-
line: www.leemon.com/papers/sim tech/sim tech.pdf.
Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1996). Worst-case loss bounds
for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.),
Advances in neural information processing systems, 8 (pp. 309–315). Cambridge,
MA: MIT Press.
Kivinen, J., & Warmuth, M. K. (1995). Additive versus exponentiated gradient
updates for linear prediction. In Proc. 27th Annual ACM Symposium on Theory
of Computing (pp. 209–218). New York: Association for Computing Machin-
ery.
Leen, T. K., & Orr, G. B. (1994). Optimal stochastic search and adaptive momen-
tum. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural
information processing systems, 6 (pp. 477–484). San Mateo, CA: Morgan Kauf-
mann.
Levenberg, K. (1944). A method for the solution of certain non-linear problems
in least squares. Quarterly Journal of Applied Mathematics, 2(2), 164–168.
Marquardt, D. (1963). An algorithm for least-squares estimation of non-linear
parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2),
431–441.
Møller, M. F. (1993). Exact calculation of the product of the Hessian matrix of feedfor-
ward network error functions and a vector in O (n) time. (Tech. Rep. No. DAIMI
PB-432). Århus, Denmark: Computer Science Department, Århus University.
Available on-line: www.daimi.au.dk/PB/432/PB432.ps.gz.
Orr, G. B. (1995). Dynamics and algorithms for stochastic learning. Doc-
toral dissertation, Oregon Graduate Institute, Beaverton. Available on-line
ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch1-5.ps.Z,
orrPhDch6-9.ps.Z.
Orr, G. B., & Leen, T. K. (1997). Using curvature information for fast stochastic
search. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural
information processing systems, 9. Cambridge, MA: MIT Press.
Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Com-
putation, 6(1), 147–160.
1738 Nicol N. Schraudolph

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical
recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge
University Press.
Rattray, M., & Saad, D. (1999). Incorporating curvature information into on-line
learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 183–207).
Cambridge: Cambridge University Press.
Scarpetta, S., Rattray, M., & Saad, D. (1999). Matrix momentum for practical
natural gradient learning. Journal of Physics A, 32, 4047–4059.
Schraudolph, N. N. (1999a). A fast, compact approximation of the expo-
nential function. Neural Computation, 11(4), 853–862. Available on-line:
www.inf.ethz.ch/» schraudo/pubs/exp.ps.gz.
Schraudolph, N. N. (1999b). Local gain adaptation in stochastic gradient de-
scent. In Proceedings of the 9th International Conference on Articial Neu-
ral Networks (pp. 569–574). Edinburgh, Scotland: IEE. Available on-line:
www.inf.ethz.ch/» schraudo/pubs/smd.ps.gz.
Schraudolph, N. N. (1999c). Online learning with adaptive local step sizes. In
M. Marinaro & R. Tagliaferri (Eds.), Neural Nets—WIRN Vietri-99:Proceedings
of the 11th Italian Workshop on Neural Nets (pp. 151–156). Berlin: Springer-
Verlag.
Schraudolph, N. N., & Giannakopoulos, X. (2000). Online independent com-
ponent analysis with local learning rate adaptation. In S. A. Solla, T. K.
Leen, & K.-R. Müller (Eds.), Advances in neural information processing sys-
tems, 12 (pp. 789–795). Cambridge, MA: MIT Press. Available on-line:
www.inf.ethz.ch/»schraudo/pubs/smdica.ps.gz.
Schraudolph, N. N., & Sejnowski, T. J. (1996). Tempering backpropagation
networks: Not all weights are created equal. In D. S. Touretzky, M. C.
Mozer, & M. E. Hasselmo (Eds.), Advances in neural information process-
ing systems (pp. 563–569). Cambridge, MA: MIT Press. Available on-line:
www.inf.ethz.ch/»schraudo/pubs/nips95.ps.gz.
Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
Kalman lter. In D. S. Touretzky (Ed.), Advances in neural information processing
systems (pp. 133–140). San Mateo, CA: Morgan Kaufmann.
Werbos, P. J. (1988). Backpropagation: Past and future. In Proceedings of the IEEE
International Conference on Neural Networks, San Diego, 1988 (Vol. I, pp. 343–
353). Long Beach, CA: IEEE Press.
Yang, H. H., & Amari, S. (1998). Complexity issues in natural gradient descent
method for training multilayer perceptrons. Neural Computation, 10(8), 2137–
2157.

Received December 21, 2000; accepted November 12, 2001.

Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
EA-CG - An Approximate Second-Order Method 1802.06502v3
No ratings yet
EA-CG - An Approximate Second-Order Method 1802.06502v3
19 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Taylorapproximations For Descent Gradient
No ratings yet
Taylorapproximations For Descent Gradient
22 pages
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
No ratings yet
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
29 pages
WoodFisher - Efficient Second-Order Approximation For Neural Network Compression
No ratings yet
WoodFisher - Efficient Second-Order Approximation For Neural Network Compression
44 pages
Linear and Nonlinear Programming
No ratings yet
Linear and Nonlinear Programming
7 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
The Backpropagation Algorithm For A Math Student
No ratings yet
The Backpropagation Algorithm For A Math Student
9 pages
Back Prop
No ratings yet
Back Prop
8 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
On Deep Learning For Inverse Problems: Jaweria Amjad Jure Sokoli C Miguel R.D. Rodrigues
No ratings yet
On Deep Learning For Inverse Problems: Jaweria Amjad Jure Sokoli C Miguel R.D. Rodrigues
5 pages
Optimization
No ratings yet
Optimization
21 pages
Deep Learning Via Hessian-Free Optimization: James Martens
No ratings yet
Deep Learning Via Hessian-Free Optimization: James Martens
8 pages
Neural Network MSE Gradients
No ratings yet
Neural Network MSE Gradients
10 pages
4 Adaline - The Adaptive Linear Element: Nnets - L. 4 February 10, 2002
No ratings yet
4 Adaline - The Adaptive Linear Element: Nnets - L. 4 February 10, 2002
34 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Training Feed Forward Networks With The Marquardt Algorithm
No ratings yet
Training Feed Forward Networks With The Marquardt Algorithm
5 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Deep Neural Network (DNN)
100% (1)
Deep Neural Network (DNN)
80 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
No ratings yet
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
62 pages
XOR Problem & Two-Layer Perceptron
No ratings yet
XOR Problem & Two-Layer Perceptron
74 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Computing Gradient Using Backpropagation: ZV0GDF798E
No ratings yet
Computing Gradient Using Backpropagation: ZV0GDF798E
5 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Vectorized Neural Network Gradients
No ratings yet
Vectorized Neural Network Gradients
7 pages
Computing Neural Network Gradients-Merged
No ratings yet
Computing Neural Network Gradients-Merged
67 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Neural Network Gradient Descent
No ratings yet
Neural Network Gradient Descent
23 pages
NC Hessian
No ratings yet
NC Hessian
13 pages
Topic 3.2: Network Training
No ratings yet
Topic 3.2: Network Training
7 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Neural Networks (Basics)
No ratings yet
Neural Networks (Basics)
30 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Supervised Learning: Linear Models
No ratings yet
Supervised Learning: Linear Models
34 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
DLbook
No ratings yet
DLbook
165 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Scaled Conjugate Gradient For Supervised Learning
No ratings yet
Scaled Conjugate Gradient For Supervised Learning
23 pages
Lec 04
No ratings yet
Lec 04
75 pages
CS229
No ratings yet
CS229
216 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Gjto 2018
No ratings yet
Gjto 2018
8 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
NN 1
No ratings yet
NN 1
21 pages
Andrew NG Main - Notes PDF
100% (1)
Andrew NG Main - Notes PDF
226 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
3DDisplay USCICT SIGGRAPH2007
No ratings yet
3DDisplay USCICT SIGGRAPH2007
10 pages
No21 System INVISIO Vertical Facades
No ratings yet
No21 System INVISIO Vertical Facades
24 pages
Smartphone Specs & Comparisons
No ratings yet
Smartphone Specs & Comparisons
1 page
ts700 Version 03.2019 Uk
No ratings yet
ts700 Version 03.2019 Uk
7 pages
Хришћанско Наслеђе Косова и Метохије - Историјско и Духовно Срце Српског Народ
No ratings yet
Хришћанско Наслеђе Косова и Метохије - Историјско и Духовно Срце Српског Народ
1,011 pages
Structural Design Data FTV INVISIO
No ratings yet
Structural Design Data FTV INVISIO
16 pages
Chickenfoot Score
No ratings yet
Chickenfoot Score
1 page
PCIe NVMe SSD Install Boot Guide
No ratings yet
PCIe NVMe SSD Install Boot Guide
15 pages
No35-Inclined Roof System
No ratings yet
No35-Inclined Roof System
24 pages
Façade Panel Load Span Guide
No ratings yet
Façade Panel Load Span Guide
11 pages
Gov Uscourts Casd 660353 4 0
No ratings yet
Gov Uscourts Casd 660353 4 0
3 pages
Abalone Upustvo PDF
No ratings yet
Abalone Upustvo PDF
4 pages
Order to Show Cause in Parsa v. Google
No ratings yet
Order to Show Cause in Parsa v. Google
1 page
Prezentacija Firme STATICUS
No ratings yet
Prezentacija Firme STATICUS
74 pages
Stanovi Sa Karakteristikama Kuca
No ratings yet
Stanovi Sa Karakteristikama Kuca
15 pages
Suites Compare
No ratings yet
Suites Compare
2 pages
Tower 7 Uputstvo SRB
100% (6)
Tower 7 Uputstvo SRB
1,041 pages
The Open Timber Construction System Architectural Design
91% (11)
The Open Timber Construction System Architectural Design
116 pages
Hoyle Bidding System
100% (1)
Hoyle Bidding System
11 pages
Hacking Indian Exam Systems
No ratings yet
Hacking Indian Exam Systems
12 pages
X Project Topics
No ratings yet
X Project Topics
1 page
Algebra Capsule
No ratings yet
Algebra Capsule
12 pages
Kinematics
No ratings yet
Kinematics
3 pages
YB49 Analysis
No ratings yet
YB49 Analysis
283 pages
PSDM 2, The Sequel: Special Section Introduction
No ratings yet
PSDM 2, The Sequel: Special Section Introduction
1 page
Circuit Breaker Interrupting Times
No ratings yet
Circuit Breaker Interrupting Times
2 pages
HW02
100% (1)
HW02
4 pages
Basics of Vibration Isolation
No ratings yet
Basics of Vibration Isolation
8 pages
135 142 8 Johnson June 2022 94
No ratings yet
135 142 8 Johnson June 2022 94
8 pages
Propneu: Fast Pneumatic System Design Tool
No ratings yet
Propneu: Fast Pneumatic System Design Tool
4 pages
JEE Advanced 2022 Test Paper
No ratings yet
JEE Advanced 2022 Test Paper
12 pages
Formulatingand Testing Hypothesis
No ratings yet
Formulatingand Testing Hypothesis
24 pages
Theory of Tensile Test Engineering Essay PDF
No ratings yet
Theory of Tensile Test Engineering Essay PDF
8 pages
Basic Statistics Solutions Guide
No ratings yet
Basic Statistics Solutions Guide
3 pages
Strenght of Materials (ES-64)
100% (14)
Strenght of Materials (ES-64)
353 pages
Revision Guide EoY1 Stage 6
No ratings yet
Revision Guide EoY1 Stage 6
6 pages
Development of The Black Widow Micro Air Vehicle: AIAA-2001-0127
No ratings yet
Development of The Black Widow Micro Air Vehicle: AIAA-2001-0127
9 pages
Question Paper PDF
No ratings yet
Question Paper PDF
2 pages
Cee101 Sim
No ratings yet
Cee101 Sim
142 pages
LoRA+ - Efficient Low Rank Adaptation of Large Models
No ratings yet
LoRA+ - Efficient Low Rank Adaptation of Large Models
24 pages
CE 515 Syllabus 242
No ratings yet
CE 515 Syllabus 242
4 pages
10-Parametric Equations and Polar Coordinates
No ratings yet
10-Parametric Equations and Polar Coordinates
11 pages
Thermodynamics Related To The Civil Engineering: Zeroth Law of Thermodynamics
No ratings yet
Thermodynamics Related To The Civil Engineering: Zeroth Law of Thermodynamics
5 pages
255 SMP Seaa C04L06
No ratings yet
255 SMP Seaa C04L06
7 pages
9792 PHYSICS: MARK SCHEME For The May/June 2010 Question Paper For The Guidance of Teachers
No ratings yet
9792 PHYSICS: MARK SCHEME For The May/June 2010 Question Paper For The Guidance of Teachers
14 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
61 pages
Assignment 2: Use Arrays To Structure The Raw Data and To Perform Data Comparison & Operations
No ratings yet
Assignment 2: Use Arrays To Structure The Raw Data and To Perform Data Comparison & Operations
6 pages
Instant Access To Waves and Particles Two Essays On Fundamental Physics 1st Edition Roger G Newton Ebook Full Chapters
100% (15)
Instant Access To Waves and Particles Two Essays On Fundamental Physics 1st Edition Roger G Newton Ebook Full Chapters
85 pages
Inventory Theory - Group 11 - Technical Paper
No ratings yet
Inventory Theory - Group 11 - Technical Paper
45 pages
Assignment 5 Solutions
No ratings yet
Assignment 5 Solutions
4 pages

Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent

Uploaded by

Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent

Uploaded by

LETTER Communicated by Bara k Pearlmutter

Fast Curvature Matrix-Vector Products for Second-Order

We propose a generic method for iteratively approximating various sec-

Second-order gradient descent methods typically multiply the local gradi-

Neural Computation 14, 1723–1738 (2002) °

In order to set up a general framework that admits such extensions (and

vE 0 D 0I (8t ¸ 0) vE t C 1 D vE t C D(uE ¡ CN vE t ), (1.1)

where D is a conditioning matrix chosen close to CN ¡1 if possible. Note that if

2 Denitions and Notation

Network. A neural network with m inputs, n weights, and o linear out-

where ± denotes function composition and 0 the matrix transpose.

For a neural network as dened above, we abbreviate H ´ HL ±M±N . The

Fisher information . The instantaneous Fisher information matrix FF of

Note that FF always has rank one. Again, we abbreviate F ´ FL ±M±N D gE gE 0 .

3 Extended Gauss-Newton Approximation

Problems with the Hessian. The use of the Hessian in second-order

Gauss-Newton. An entire class of popular optimization techniques for

some second-order information is retained, positive semideniteness can

where i ranges over the o outputs of N , with N i denoting the subnetwork

Model-Trust Region. As long as G is positive semidenite—as

4 Fast Curvature Matrix-Vector Products

We now describe algorithms that compute the product of F, G, or H with an

f 0 . This is the ordinary forward pass of a neural network, evalu-

which describes the effect on a function F (w)E of a weight perturbation

Gradient. gE ´ JL0 ±M±N is computed by an f0 pass through the entire

Fisher Information. To compute FvE D gE gE 0 vE , multiply the gradient gE by

Hessian. After f1 -propagating vE forward through N , M, and L , r2 -pro-

Table 1: Choice of Curvature Matrix C for Various Gradient Descent Methods,

to f1 -propagate vE through just N and M to obtain R vE (Ez), then r2 -propagate

Gauss-Newton. Following the f1 pass, r2 -propagate RvE (1) D 0 back

Batch Average. To calculate the product of a curvature matrix CN ´ hCixE ,

4.3 Computational Cost. Table 1 summarizes, for a number of gradi-

5 Rapid Second-Order Gradient Descent

5.1 Stochastic Meta-Descent. Stochastic meta-descent (SMD—Schrau-

The vector pE of local learning rates is adapted multiplicatively,

using a scalar meta-learning rate m . Finally, the auxiliary vector vE used in

vE t C 1 D lvE t C diag(pEt ) (gE ¡ lCvE t ), (5.3)

where 0 · l · 1 is a forgetting factor for nonstationary tasks. Although

Diagonal, Adaptive Conditioner. For l D 1, SMD’s update of vE , equa-

Initial Learning Rates. Although SMD is very effective at adapting local

Model-Trust Region. For l < 1, the stochastic xpoint of equation 5.3 is

This clearly implements a model-trust region approach, in that a diagonal

Benchmark Setup. We illustrate the behavior of SMD with empirical data

Curvature Matrix. Figure 2 shows loss curves for SMD with l D 1 on

Stability. In contrast to matrix momentum, the high stochasticity of vE

3 In exploratory experiments, comparative results when training fully on-line (i.e.,

pattern by pattern) were noisier but not substantially different.

5.2 Matrix Momentum. The investigation of asymptotically optimal

equation 1.1 with scalar conditioner D D m and stochastic xed point vE !

Instability. Unfortunately matrix momentum has a strong tendency to

We have extended the notion of Gauss-Newton approximation of the Hes-

Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer-

Received December 21, 2000; accepted November 12, 2001.

You might also like

2 Denitions and Notation

For a neural network as dened above, we abbreviate H ´ HL ±M±N . The

some second-order information is retained, positive semideniteness can

Model-Trust Region. As long as G is positive semidenite—as

Model-Trust Region. For l < 1, the stochastic xpoint of equation 5.3 is

equation 1.1 with scalar conditioner D D m and stochastic xed point vE !