LETTER Communicated by Bara k Pearlmutter
Fast Curvature Matrix-Vector Products for Second-Order
Gradient Descent
Nicol N. Schraudolph
[email protected]
IDSIA, Galleria 2, 6928 Manno, Switzerland, and Institute of Computational Science,
ETH Zentrum, 8092 Zürich, Switzerland
We propose a generic method for iteratively approximating various sec-
ond-order gradient steps—-Newton, Gauss-Newton, Levenberg-Mar-
quardt, and natural gradient—-in linear time per iteration, using spe-
cial curvature matrix-vector products that can be computed in O(n) . Two
recent acceleration techniques for on-line learning, matrix momentum
and stochastic meta-descent (SMD), implement this approach. Since both
were originally derived by very different routes, this offers fresh insight
into their operation, resulting in further improvements to SMD.
1 Introduction
Second-order gradient descent methods typically multiply the local gradi-
ent by the inverse of a matrix CN of local curvature information. Depending on
the specic method used, this n £ n matrix (for a system with n parameters)
may be the Hessian (Newton’s method), an approximation or modication
thereof (e.g., Gauss-Newton, Levenberg-Marquardt), or the Fisher informa-
tion (natural gradient—Amari, 1998). These methods may converge rapidly
but are computationally quite expensive: the time complexity of common
methods to invert CN is O(n3 ), and iterative approximations cost at least O(n2 )
per iteration if they compute CN ¡1 directly, since that is the time required just
to access the n2 elements of this matrix.
Note, however, that second-order gradient methods do not require CN ¡1
explicitly: all they need is its product with the gradient. This is exploited
by Yang and Amari (1998) to compute efciently the natural gradient for
multilayer perceptrons with a single output and one hidden layer: assum-
ing independently and identically distributed (i.i.d.) gaussian input, they
explicitly derive the form of the Fisher information matrix and its inverse
for their system and nd that the latter’s product with the gradient can
be computed in just O(n) steps. However, the resulting algorithm is rather
complicated and does not lend itself to being extended to more complex
adaptive systems (such as multilayer perceptrons with more than one out-
put or hidden layer), curvature matrices other than the Fisher information,
or inputs that are far from i.i.d. gaussian.
Neural Computation 14, 1723–1738 (2002) °
c 2002 Massachusetts Institute of Technology
1724 Nicol N. Schraudolph
In order to set up a general framework that admits such extensions (and
indeed applies to any twice-differentiable adaptive system), we abandon
the notion of calculating the exact second-order gradient step in favor of
an iterative approximation. The following iteration efciently approaches
vE D CN ¡1 uE for an arbitrary vector uE (Press, Teukolsky, Vetterling, & Flannery,
1992, page 57):
vE 0 D 0I (8t ¸ 0) vE t C 1 D vE t C D(uE ¡ CN vE t ), (1.1)
where D is a conditioning matrix chosen close to CN ¡1 if possible. Note that if
we restrict D to be diagonal, all operations in equation 1.1 can be performed
in O(n) time, except (one would suppose) for the matrix-vector product CN vE t .
In fact, there is an O(n) method for calculating the product of an n £ n
matrix with an arbitrary vector—if the matrix happens to be the Hessian of a
system whose gradient can be calculated in O(n), as is the case for most adap-
tive architectures encountered in practice. This fast Hessian-vector product
(Pearlmutter, 1994; Werbos, 1988; Møller, 1993) can be used in conjunction
with equation 1.1 to create an efcient, iterative O(n) implementation of
Newton’s method.
Unfortunately, Newton’s method has severe stability problems when
used in nonlinear systems, stemming from the fact that the Hessian may
be ill-conditioned and does not guarantee positive deniteness. Practical
second-order methods therefore prefer measures of curvature that are better
behaved, such as the outer product (Gauss-Newton) approximation of the
Hessian, a model-trust region modication of the same (Levenberg, 1944;
Marquardt, 1963), or the Fisher information.
Below, we dene these matrices in a maximum likelihood framework for
regression and classication and describe O(n) algorithms for computing
the product of any of them with an arbitrary vector for neural network ar-
chitectures. These curvature matrix-vector products are, in fact, cheaper still
than the fast Hessian-vector product and can be used in conjunction with
equation 1.1 to implement rapid, iterative, optionally stochastic O(n) vari-
ants of second-order gradient descent methods. The resulting algorithms are
very general, practical (i.e., sufciently robust and efcient), far less expen-
sive than the conventional O(n2 ) and O(n3 ) approaches, and—with the aid
of automatic differentiation software tools—comparatively easy to implement
(see section 4).
We then examine two learning algorithms that use this approach: ma-
trix momentum (Orr, 1995; Orr & Leen, 1997) and stochastic meta-descent
(Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000). Since
both methods were derived by entirely different routes, viewing them as
implementations of iteration 1.1 will provide additional insight into their
operation and suggest new ways to improve them.
Curvature Matrix-Vector Products 1725
2 Denitions and Notation
Network. A neural network with m inputs, n weights, and o linear out-
puts is usually regarded as a mapping Rm ! Ro from an input pattern xE
to the corresponding output yE , for a given vector wE of weights. Here we
formalize such a network instead as a mapping N : Rn ! Ro from weights
to outputs (for given inputs), and write yE D N (w).
E To extend this formal-
ism to networks with nonlinear outputs, we dene the output nonlinearity
M: Ro ! Ro and write zE D M (yE ) D M (N (w)). E For networks with linear
outputs, M will be the identity map.
Loss function . We consider neural network learning as the minimization
of a scalar loss function L : Ro ! R dened as the log-likelihood L (zE ) ´
¡ log Pr(Ez ) of the output Ez under a suitable statistical model (Bishop, 1995).
For supervised learning, L may also implicitly depend on given targets
Ez¤ for the outputs. Formally, the loss can now be regarded as a function
L (M (N (w)))
E of the weights, for a given set of inputs and (if supervised)
targets.
Jacobian and gradient. The Jacobian JF of a function F : Rm ! Rn is the
n £ m matrix of partial derivatives of the outputs of F with respect to its
inputs. For a neural network dened as above, the gradient—the vector gE
of derivatives of the loss with respect to the weights—is given by
@
gE ´ L (M (N (w)))
E D JL 0 ±M±N D JN 0 J M 0 JL 0 , (2.1)
@w
E
where ± denotes function composition and 0 the matrix transpose.
Matching loss functions. We say that the loss function L matches the
output nonlinearity M iff JL0 ±M D AEz C b, E for some A and bE not dependent
E 1 The standard loss functions used in neural network regression and
on w.
classication—sum-squared error for linear outputs and cross-entropy error
for softmax or logistic outputs—are all matching loss functions with A D I
(the identity matrix) and bE D ¡Ez¤ , so that JL0 ±M D Ez ¡Ez¤ (Bishop, 1995, chapter
6). This will simplify some of the calculations described in section 4.
Hessian. The instantaneous Hessian HF of a scalar function F : Rn ! R
is the n £ n matrix of second derivatives of F (w)
E with respect to its inputs
w:
E
@ JF @2 F (w)
E
HF ´ , i.e., (HF ) ij D . (2.2)
@w
E0 @wi @wj
1
For supervised learning, a similar if somewhat more restrictive denition of matching
loss functions is given by Helmbold, Kivinen, and Warmuth (1996) and Auer, Herbster,
and Warmuth (1996).
1726 Nicol N. Schraudolph
For a neural network as dened above, we abbreviate H ´ HL ±M±N . The
Hessian proper, which we denote H, N is obtained by taking the expectation
N
of H over inputs: H ´ hHixE . For matching loss functions, HL ±M D AJM D
JM 0 A 0 .
Fisher information . The instantaneous Fisher information matrix FF of
a scalar log-likelihood function F : Rn ! R is the n £ n matrix formed by
the outer product of its rst derivatives:
0
@F (w)
E @F (w)
E
F F ´ JF JF , i.e., (FF ) ij D . (2.3)
@wi @wj
Note that FF always has rank one. Again, we abbreviate F ´ FL ±M±N D gE gE 0 .
The Fisher information matrix proper, FN ´ hFixE , describes the geometric
structure of weight space (Amari, 1985) and is used in the natural gradient
descent approach (Amari, 1998).
3 Extended Gauss-Newton Approximation
Problems with the Hessian. The use of the Hessian in second-order
gradient descent for neural networks is problematic. For nonlinear systems,
HN is not necessarily positive denite, so Newton’s method may diverge or
even take steps in uphill directions. Practical second-order gradient methods
should therefore use approximations or modications of the Hessian that
are known to be reasonably well behaved, with positive semideniteness
as a minimum requirement.
Fisher information. One alternative that has been proposed is the Fisher
information matrix FN (Amari, 1998), which, being a quadratic form, is posi-
tive semidenite by denition. On the other hand, FN ignores all second-order
interactions between system parameters, thus throwing away potentially
useful curvature information. By contrast, we shall derive an approxima-
tion of the Hessian that is provably positive semidenite even though it
does make use of second derivatives to model Hessian curvature better.
Gauss-Newton. An entire class of popular optimization techniques for
nonlinear least-squares problems, as implemented by neural networks with
linear outputs and sum-squared loss function, is based on the well-known
Gauss-Newton (also referred to as linearized, outer product, or squared
Jacobian) approximation of the Hessian. Here we extend the Gauss-Newton
approach to other standard loss functions—in particular, the cross-entropy
loss used in neural network classication—in such a way that even though
Curvature Matrix-Vector Products 1727
some second-order information is retained, positive semideniteness can
still be proved.
Using the product rule, the instantaneous Hessian of our neural network
model can be written as
o
X
@
HD (J
0 L ±M JN ) D JN 0 H L ±M JN C (JL ±M )i HN i , (3.1)
@w
E iD 1
where i ranges over the o outputs of N , with N i denoting the subnetwork
that produces the ith output. Ignoring the second term above, we dene the
extended, instantaneous Gauss-Newton matrix:
G ´ JN 0 H L ±M JN . (3.2)
Note that G has rank · o (the number of outputs) and is positive semidef-
inite, regardless of the choice of architecture for N , provided that HL ±M
is.
G models the second-order interactions among N ’s outputs (via HL ±M )
while ignoring those arising within N itself (HN i ). This constitutes a com-
promise between the Hessian (which models all second-order interactions)
and the Fisher information (which ignores them all). For systems with a
single linear output and sum-squared error, G reduces to F. For multiple
outputs, it provides a richer (rank(G) · o versus rank(F) D 1) model of
Hessian curvature.
Standard Loss Functions. For the standard loss functions used in neural
network regression and classication, G has additional interesting proper-
ties:
First, the residual JL0 ±M D Ez ¡ Ez¤ vanishes at the optimum for realizable
problems, so that the Gauss-Newton approximation, equation 3.2, of the
Hessian, equation 3.1, becomes exact in this case. For unrealizable problems,
the residuals at the optimum have zero mean; this will tend to make the last
term in equation 3.1 vanish in expectation, so that we can still assume GN ¼ HN
near the optimum.
Second, in each case we can show that HL ±M (and hence G, and hence
N is positive semidenite. For linear outputs with sum-squared loss—that
G)
is, conventional Gauss-Newton—HL ±M D JM is just the identity I; for inde-
pendent logistic outputs with cross-entropy loss, it is diag[diag(Ez )(1 ¡ Ez )],
positive semidenite because (8i) 0 < zi < 1. For softmax output with
cross-entropy loss, we have HL ±M D diag(zE ) ¡ EzEz0 , which is also positive
1728 Nicol N. Schraudolph
P
semidenite since (8i) zi > 0 and i zi D 1, and thus
Á !2
X X
o
(8vE 2 R ) 0 0
vE [diag(Ez) ¡ zE Ez ]vE D zi v2i ¡ z i vi
i i
Á !0 1 0 12
X X X X
D zi v2i ¡ 2 zi vi @ zj vj A C @ zj vj A
i i j j
0 12
X X
D zi @vi ¡ zj vj A ¸ 0. (3.3)
i j
Model-Trust Region. As long as G is positive semidenite—as
proved above for standard loss functions—the extended Gauss-Newton
algorithm will not take steps in uphill directions. However, it may still take
very large (even innite) steps. These may take us outside the model-trust
region, the area in which our quadratic model of the error surface is rea-
sonable. Model-trust region methods restrict the gradient step to a suitable
neighborhood around the current point.
One popular way to enforce a model-trust region is the addition of a
small diagonal term to the curvature matrix. Levenberg (1944) suggested
adding lI to the Gauss-Newton matrix G; N Marquardt (1963) elaborated the
N
additive term to ldiag(G). The Levenberg-Marquard t algorithm directly
inverts the resulting curvature matrix; where affordable (i.e., for relatively
small systems), it has become today’s workhorse of nonlinear least-squares
optimization.
4 Fast Curvature Matrix-Vector Products
We now describe algorithms that compute the product of F, G, or H with an
arbitrary n-dimensional vector vE in O(n). They can be used in conjunction
with equation 1.1 to implement rapid and (if so desired) stochastic ver-
sions of various second-order gradient descent methods, including New-
ton’s method, Gauss-Newton, Levenberg-Marquardt, and natural gradient
descent.
4.1 The Passes. The fast matrix-vector products are all constructed from
the same set of passes in which certain quantities are propagated through
all or part of our neural network model (comprising N , M, and L ) in for-
ward or reverse direction. For implementation purposes, it should be noted
that automatic differentiation software tools2 can automatically produce these
passes from a program implementing the basic forward pass f0 .
2 See http://www-unix.mcs.anl.gov/autodiff/.
Curvature Matrix-Vector Products 1729
f 0 . This is the ordinary forward pass of a neural network, evalu-
ating the function F (w)
E it implements by propagating activity (i.e.,
intermediate results) forward through F .
r 1 . The ordinary backward pass of a neural network, calculating
0 u
JF E by propagating the vector u
E backward through F . This pass uses
intermediate results computed in the f0 pass.
f 1 . Following Pearlmutter (1994), we dene the Gateaux derivative
@F (w E)
E C rv
RvE (F (w))
E ´ D JF vE , (4.1)
@r
rD 0
which describes the effect on a function F (w)E of a weight perturbation
in the direction of vE . By pushing RvE , which obeys the usual rules for
differential operators, down into the equations of the forward pass f0,
one obtains an efcient procedure, to calculate JF vE from vE . (See Pearl-
mutter, 1994, for details and examples.) This f1 pass uses intermediate
results from the f0 pass.
r 2 . When the RvE operator is applied to the r1 pass for a scalar func-
tion F , one obtains an efcient procedure for calculating the Hessian-
0 ).
vector product HF vE D RvE (JF (See Pearlmutter, 1994, for details and
examples.) This r2 pass uses intermediate results from the f0 , f1 , and
r1 passes.
4.2 The Algorithms. The rst step in all three matrix-vector products is
computing the gradient gE of our neural network model by standard back-
propagation:
Gradient. gE ´ JL0 ±M±N is computed by an f0 pass through the entire
model (N , M, and L ), followed by an r1 pass propagating uE D 1 back
through the entire model (L , M, then N ). For matching loss functions,
E we can limit the forward pass to N
there is a shortcut: since JL0 ±M D AEz C b,
and M (to compute Ez), then r1 -propagate uE D AzE C bE back through just N .
Fisher Information. To compute FvE D gE gE 0 vE , multiply the gradient gE by
the inner product between gE and vE . If there is no random access to gE or vE —
that is, its elements can be accessed only through passes like the above—the
scalar gE 0 vE can instead be calculated by f1 -propagating vE forward through
the model (N , M, and L ). This step is also necessary for the other two
matrix-vector products.
Hessian. After f1 -propagating vE forward through N , M, and L , r2 -pro-
pagate RvE (1) D 0 back through the entire model (L , M, then N ) to obtain
H vE D RvE (gE ) (Pearlmutter, 1994). For matching loss functions, the shortcut is
1730 Nicol N. Schraudolph
Table 1: Choice of Curvature Matrix C for Various Gradient Descent Methods,
Passes Needed to Compute Gradient Eg and Fast Matrix-Vector Product C N vE , and
Associated Cost (for a Multilayer Perceptron) in Flops per Weight and Pattern.
Pass f0 r1 f1 r2 Cost
0
Method result: F JF uE JF vE HF vE (for gE
CD name cost: 2 3 4 7 and CN vE )
p p
I simple gradient p p p 6
F natural gradient p pp (p ) 10
G Gauss-Newton p p p p 14
H Newton’s method 18
to f1 -propagate vE through just N and M to obtain R vE (Ez), then r2 -propagate
RvE (JL0 ±M ) D ARvE (Ez) back through just N .
Gauss-Newton. Following the f1 pass, r2 -propagate RvE (1) D 0 back
through L and M to obtain RvE (JL0 ±M ) D HL ±M JN vE , then r1 -propagate that
back through N , giving GvE . For matching loss functions, we do not require
an r2 pass. Since
G D JN0 HL ±M J N D JN 0 J M 0 A 0 JN , (4.2)
we can limit the f1 pass to N , multiply the result with A0 , then r1 -propagate
it back through M and N . Alternatively, one may compute the equivalent
GvE D JN0 AJM JN vE by continuing the f1 pass through M, multiplying with
A, then r1 -propagating back through N only.
Batch Average. To calculate the product of a curvature matrix CN ´ hCixE ,
where C is one of F, G, or H, with vector vE , average the instantaneous product
CvE over all input patterns xE (and associated targets Ez¤ , if applicable) while
holding vE constant. For large training sets or nonstationary streams of data,
it is often preferable to estimate CN vE by averaging over “mini-batches” of
(typically) just 5 to 50 patterns.
4.3 Computational Cost. Table 1 summarizes, for a number of gradi-
ent descent methods, their choice of curvature matrix C, the passes needed
(for a matching loss function) to calculate both the gradient gE and the fast
matrix-vector product CN vE , and the associated computational cost in terms
of oating-point operations (ops) per weight and pattern in a multilayer
perceptron. These gures ignore certain optimizations (e.g., not propagat-
ing gradients back to the inputs) and assume that any computation at the
network’s nodes is dwarfed by that required for the weights.
Computing both gradient and curvature matrix-vector product is typi-
cally about two to three times as expensive as calculating the gradient alone.
Curvature Matrix-Vector Products 1731
In combination with iteration 1.1, however, one can use the O(n) matrix-
vector product to implement second-order gradient methods whose rapid
convergence more than compensates for the additional cost. We describe
two such algorithms in the following section.
5 Rapid Second-Order Gradient Descent
We know of two neural network learning algorithms that combine the O(n)
curvature matrix-vector product with iteration 1.1 in some form: matrix
momentum (Orr, 1995; Orr & Leen, 1997) and our own stochastic meta-
descent (Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000).
Since both of these were derived by entirely different routes, we gain fresh
insight into their operation by examining how they implement equation 1.1.
5.1 Stochastic Meta-Descent. Stochastic meta-descent (SMD—Schrau-
dolph, 1999b, 1999c) is a new on-line algorithm for local learning rate adap-
tation. It updates the weights w
E by the simple gradient descent:
w E t ¡ diag(pEt ) gE .
E tC1 D w (5.1)
The vector pE of local learning rates is adapted multiplicatively,
³ ´
1
pEt D diag(pEt¡1 ) max )
, 1 m diag(vE t gE ,
C (5.2)
2
using a scalar meta-learning rate m . Finally, the auxiliary vector vE used in
equation 5.2 is itself updated iteratively via
vE t C 1 D lvE t C diag(pEt ) (gE ¡ lCvE t ), (5.3)
where 0 · l · 1 is a forgetting factor for nonstationary tasks. Although
derived as part of a dual gradient descent procedure (minimizing loss with
respect to both w
E and p),
E equation 5.3 implements an interesting variation of
equation 1.1. SMD thus employs rapid second-order techniques indirectly
to help adapt local learning rates for the gradient descent in weight space.
Linearization. The learning rate update, equation 5.2, minimizes the sys-
tem’s loss with respect to pE by exponentiated gradient descent (Kivinen &
Warmuth, 1995), but has been relinearized in order to avoid the computa-
tionally expensive exponentiation operation (Schraudolph, 1999a). The par-
ticular linearization used, eu ¼ max(%, 1 C u), is based on a rst-order Taylor
expansion about u D 0, bounded below by 0 < % < 1 so as to safeguard
against unreasonably small (and, worse, negative) multipliers for p. E The
value of % determines the maximum permissible learning rate reduction;
we follow many other step-size control methods in setting this to % D 12 , the
1732 Nicol N. Schraudolph
ratio between optimal and maximum stable step size in a symmetric bowl.
Compared to direct exponentiated gradient descent, our linearized version,
equation 5.2, thus dampens radical changes (in both directions) to pE that
may occasionally arise due to the stochastic nature of the data.
Diagonal, Adaptive Conditioner. For l D 1, SMD’s update of vE , equa-
tion 5.3 implements equation 1.1 with the diagonal conditioner D D diag(p). E
Note that the learning rates pE are being adapted so as to make the gradi-
ent step diag(p)
E g
E as effective as possible. A well-adapted pE will typically
make this step similar to the second-order gradient CN ¡1 gE . In this restricted
sense, we can regard diag(p)E as an empirical diagonal approximation of CN ¡1 ,
making it a good choice for the conditioner D in iteration 1.1.
Initial Learning Rates. Although SMD is very effective at adapting local
learning rates to changing requirements, it is nonetheless sensitive to their
initial values. All three of its update rules rely on pE for their conditioning,
so initial values that are very far from optimal are bound to cause prob-
lems: divergence if they are too high, lack of progress if they are too low. A
simple architecture-dependent technique such as tempering (Schraudolph
& Sejnowski, 1996) should usually sufce to initialize pE adequately; the ne
tuning can be left to the SMD algorithm.
Model-Trust Region. For l < 1, the stochastic xpoint of equation 5.3 is
no longer vE ! C ¡1 gE , but rather
E ¡1 ]¡1 gE .
vE ! [lC C (1 ¡ l)diag(p) (5.4)
This clearly implements a model-trust region approach, in that a diagonal
matrix is being added (in small proportion) to C before inverting it. More-
over, the elements along the diagonal are not all identical as in Levenberg’s
(1944) method, but scale individually as suggested by Marquardt (1963). The
scaling factors are determined by 1 / pE rather than diag(C),
N as the Levenberg-
Marquardt method would have it, but these two vectors are related by our
above argument that pE is a diagonal approximation of CN ¡1 . For l < 1, SMD’s
iteration 5.3 can thus be regarded as implementing an efcient stochastic
variant of the Levenberg-Marquardt model-trust region approach.
Benchmark Setup. We illustrate the behavior of SMD with empirical data
obtained on the “four regions” benchmark (Singhal & Wu, 1989): a fully con-
nected feedforward network N with two hidden layers of 10 units each (see
Figure 1, right) is to classify two continuous inputs in the range [¡1, 1] into
four disjoint, nonconvex regions (see Figure 1, left). We use the standard
softmax output nonlinearity M with matching cross-entropy loss L , meta-
learning rate m D 0.05, initial learning rates pE0 D 0.1, and a hyperbolic
Curvature Matrix-Vector Products 1733
Figure 1: The four regions task (left), and the network we trained on it (right).
tangent nonlinearity on the hidden units. For each run, the 184 weights
(including bias weights for all units) are initialized to uniformly random
values in the range [¡0.3, 0.3]. Training patterns are generated on-line by
drawing independent, uniformly random input samples; since each pat-
tern is seen only once, the empirical loss provides an unbiased estimate of
generalization ability. Patterns are presented in mini-batches of 10 each so
as to reduce the computational overhead associated with SMD’s parameter
updates 5.1, 5.2, and 5.3.3
Curvature Matrix. Figure 2 shows loss curves for SMD with l D 1 on
the four regions problem, starting from 25 different random initial states,
using the Hessian, Fisher information, and extended Gauss-Newton matrix,
respectively, for C in equation 5.3. With the Hessian (left), 80% of the runs
diverge—most of them early on, when the risk that H is not positive denite
is greatest. When we guarantee positive semideniteness by switching to the
Fisher information matrix (center), the proportion of diverged runs drops
to 20%; those runs that still diverge do so only relatively late. Finally, for our
extended Gauss-Newton approximation (right), only a single run diverges,
illustrating the benet of retaining certain second-order terms while pre-
serving positive semideniteness. (For comparison, we cannot get matrix
momentum to converge at all on anything as difcult as this benchmark.)
Stability. In contrast to matrix momentum, the high stochasticity of vE
affects the weights in SMD only indirectly, being buffered—and largely
3 In exploratory experiments, comparative results when training fully on-line (i.e.,
pattern by pattern) were noisier but not substantially different.
1734 Nicol N. Schraudolph
Figure 2: Loss curves for 25 runs of SMD with l D 1, when using the Hessian
(left), the Fisher information (center), or the extended Gauss-Newton matrix
(right) for C in equation 5.3. Vertical spikes indicate divergence.
averaged out—by the incremental update 5.2 of learning rates p. E This makes
SMD far more stable, especially when G is used as the curvature matrix. Its
residual tendency to misbehave occasionally can be suppressed further by
slightly lowering l so as to create a model-trust region. By curtailing the
memory of iteration 5.3, however, this approach can compromise the rapid
convergence of SMD. Figure 3 illustrates the resulting stability-performance
trade-off on the four regions benchmark:
When using the extended Gauss-Newton approximation, a small reduc-
tion of l to 0.998 (solid line) is sufcient to prevent divergence, at a moderate
cost in performance relative to l D 1 (dashed, plotted up to the earliest point
of divergence). When the Hessian is used, by contrast, l must be set as low
as 0.95 to maintain stability, and convergence is slowed much further (dash-
dotted). Even so, this is still signicantly faster than the degenerate case of
l D 0 (dotted), which in effect implements IDD (Harmon & Baird, 1996), to
our knowledge the best on-line method for local learning rate adaptation
preceding SMD.
From these experiments, it appears that memory (i.e., l close to 1) is
key to achieving the rapid convergence characteristic of SMD. We are now
investigating more direct ways to keep iteration 5.3 under control, aiming
to ensure the stability of SMD while maintaining its excellent performance
near l D 1.
5.2 Matrix Momentum. The investigation of asymptotically optimal
adaptive momentum for rst-order stochastic gradient descent (Leen & Orr,
1994) led Orr (1995) to propose the following weight update:
w
E t C1 D w
Et C v
E tC 1 , vE t C 1 D vE t ¡ m (%t gE C CvE t ), (5.5)
N largest eigenvalue,
where m is a scalar constant less than the inverse of C’s
and %t a rate parameter that is annealed from one to zero. We recognize
Curvature Matrix-Vector Products 1735
Figure 3: Average loss over 25 runs of SMD for various combinations of curva-
ture matrix C and forgetting factor l. Memory (l ! 1) accelerates convergence
over the conventional memory-less case l D 0 (dotted) but can lead to instability.
With the Hessian H, all 25 runs remain stable up to l D 0.95 (dot-dashed line);
using the extended Gauss-Newton matrix G pushes this limit up to l D 0.998
(solid line). The curve for l D 1 (dashed line) is plotted up to the earliest point
of divergence.
equation 1.1 with scalar conditioner D D m and stochastic xed point vE !
¡%t C¡1 gE ; thus, matrix momentum attempts to approximate partial second-
order gradient steps directly via this fast, stochastic iteration.
Rapid Convergence. Orr (1995) found that in the late, annealing phase of
learning, matrix momentum converges at optimal (second-order) asymp-
totic rates; this has been conrmed by subsequent analysis in a statistical
mechanics framework (Rattray & Saad, 1999; Scarpetta, Rattray, & Saad,
1999). Moreover, compared to SMD’s slow, incremental adaptation of learn-
ing rates, matrix momentum’s direct second-order update of the weights
promises a far shorter initial transient before rapid convergence sets in. Ma-
trix momentum thus looks like the ideal candidate for a fast O(n) stochastic
gradient descent method.
Instability. Unfortunately matrix momentum has a strong tendency to
diverge for nonlinear systems when far from an optimum, as is the case
during the search phase of learning. Current implementations therefore rely
on simple (rst-order) stochastic gradient descent initially, turning on matrix
momentum only once the vicinity of an optimum has been reached (Orr,
1736 Nicol N. Schraudolph
1995; Orr & Leen, 1997). The instability of matrix momentum is not caused
by lack of semideniteness on behalf of the curvature matrix: Orr (1995)
used the Gauss-Newton approximation, and Scarpetta et al. (1999) reached
similar conclusions for the Fisher information matrix. Instead, it is thought
to be a consequence of the noise inherent in the stochastic approximation
of the curvature matrix (Rattray & Saad, 1999; Scarpetta et al., 1999).
Recognizing matrix momentum as implementing the same iteration 1.1
as SMD suggests that its stability might be improved in similar ways—
specically, by incorporating a model-trust region parameter l and an adap-
tive diagonal conditioner. However, whereas in SMD such a conditioner
was trivially available in the vector pE of local learning rates, here it is by
no means easy to construct, given our restriction to O(n) algorithms, which
are affordable for very large systems. We are investigating several routes
toward a stable, adaptively conditioned form of matrix momentum.
6 Summ ary
We have extended the notion of Gauss-Newton approximation of the Hes-
sian from nonlinear least-squares problems to arbitrary loss functions, and
shown that it is positive semidenite for the standard loss functions used
in neural network regression and classication. We have given algorithms
that compute the product of either the Fisher information or our extended
Gauss-Newton matrix with an arbitrary vector in O(n), similar to but even
cheaper than the fast Hessian-vector product described by Pearlmutter
(1994).
We have shown how these fast matrix-vector products may be used to
construct O(n) iterative approximations to a variety of common second-
order gradient algorithms, including the Newton, natural gradient, Gauss-
Newton, and Levenberg-Marquard t steps. Applying these insights to our
recent SMD algorithm (Schraudoph, 1999b)—specically, replacing the Hes-
sian with our extended Gauss-Newton approximation—resulted in im-
proved stability and performance. We are now investigating whether ma-
trix momentum (Orr, 1995) can similarly be stabilized though the incor-
poration of an adaptive diagonal conditioner and a model-trust region
parameter.
Acknowledgments
I thank Jenny Orr, Barak Pearlmutter, and the anonymous reviewers for
their helpful suggestions, and the Swiss National Science Foundation for
the nancial support provided under grant number 2000–052678.97/1.
Curvature Matrix-Vector Products 1737
References
Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer-
Verlag.
Amari, S. (1998). Natural gradient works efciently in learning. Neural Compu-
tation, 10(2), 251–276.
Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local min-
ima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo
(Eds.), Advances in neural information processing systems, 8 (pp. 316–322). Cam-
bridge, MA: MIT Press.
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon
Press.
Harmon, M. E., & Baird III, L. C. (1996). Multi-player residual advantage learn-
ing with general function approximation (Tech. Rep. No. WL-TR-1065). Wright-
Patterson Air Force Base, OH: Wright Laboratory, WL/AACF. Available on-
line: www.leemon.com/papers/sim tech/sim tech.pdf.
Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1996). Worst-case loss bounds
for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.),
Advances in neural information processing systems, 8 (pp. 309–315). Cambridge,
MA: MIT Press.
Kivinen, J., & Warmuth, M. K. (1995). Additive versus exponentiated gradient
updates for linear prediction. In Proc. 27th Annual ACM Symposium on Theory
of Computing (pp. 209–218). New York: Association for Computing Machin-
ery.
Leen, T. K., & Orr, G. B. (1994). Optimal stochastic search and adaptive momen-
tum. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural
information processing systems, 6 (pp. 477–484). San Mateo, CA: Morgan Kauf-
mann.
Levenberg, K. (1944). A method for the solution of certain non-linear problems
in least squares. Quarterly Journal of Applied Mathematics, 2(2), 164–168.
Marquardt, D. (1963). An algorithm for least-squares estimation of non-linear
parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2),
431–441.
Møller, M. F. (1993). Exact calculation of the product of the Hessian matrix of feedfor-
ward network error functions and a vector in O (n) time. (Tech. Rep. No. DAIMI
PB-432). Århus, Denmark: Computer Science Department, Århus University.
Available on-line: www.daimi.au.dk/PB/432/PB432.ps.gz.
Orr, G. B. (1995). Dynamics and algorithms for stochastic learning. Doc-
toral dissertation, Oregon Graduate Institute, Beaverton. Available on-line
ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch1-5.ps.Z,
orrPhDch6-9.ps.Z.
Orr, G. B., & Leen, T. K. (1997). Using curvature information for fast stochastic
search. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural
information processing systems, 9. Cambridge, MA: MIT Press.
Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Com-
putation, 6(1), 147–160.
1738 Nicol N. Schraudolph
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical
recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge
University Press.
Rattray, M., & Saad, D. (1999). Incorporating curvature information into on-line
learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 183–207).
Cambridge: Cambridge University Press.
Scarpetta, S., Rattray, M., & Saad, D. (1999). Matrix momentum for practical
natural gradient learning. Journal of Physics A, 32, 4047–4059.
Schraudolph, N. N. (1999a). A fast, compact approximation of the expo-
nential function. Neural Computation, 11(4), 853–862. Available on-line:
www.inf.ethz.ch/» schraudo/pubs/exp.ps.gz.
Schraudolph, N. N. (1999b). Local gain adaptation in stochastic gradient de-
scent. In Proceedings of the 9th International Conference on Articial Neu-
ral Networks (pp. 569–574). Edinburgh, Scotland: IEE. Available on-line:
www.inf.ethz.ch/» schraudo/pubs/smd.ps.gz.
Schraudolph, N. N. (1999c). Online learning with adaptive local step sizes. In
M. Marinaro & R. Tagliaferri (Eds.), Neural Nets—WIRN Vietri-99:Proceedings
of the 11th Italian Workshop on Neural Nets (pp. 151–156). Berlin: Springer-
Verlag.
Schraudolph, N. N., & Giannakopoulos, X. (2000). Online independent com-
ponent analysis with local learning rate adaptation. In S. A. Solla, T. K.
Leen, & K.-R. Müller (Eds.), Advances in neural information processing sys-
tems, 12 (pp. 789–795). Cambridge, MA: MIT Press. Available on-line:
www.inf.ethz.ch/»schraudo/pubs/smdica.ps.gz.
Schraudolph, N. N., & Sejnowski, T. J. (1996). Tempering backpropagation
networks: Not all weights are created equal. In D. S. Touretzky, M. C.
Mozer, & M. E. Hasselmo (Eds.), Advances in neural information process-
ing systems (pp. 563–569). Cambridge, MA: MIT Press. Available on-line:
www.inf.ethz.ch/»schraudo/pubs/nips95.ps.gz.
Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
Kalman lter. In D. S. Touretzky (Ed.), Advances in neural information processing
systems (pp. 133–140). San Mateo, CA: Morgan Kaufmann.
Werbos, P. J. (1988). Backpropagation: Past and future. In Proceedings of the IEEE
International Conference on Neural Networks, San Diego, 1988 (Vol. I, pp. 343–
353). Long Beach, CA: IEEE Press.
Yang, H. H., & Amari, S. (1998). Complexity issues in natural gradient descent
method for training multilayer perceptrons. Neural Computation, 10(8), 2137–
2157.
Received December 21, 2000; accepted November 12, 2001.