0% found this document useful (0 votes)
85 views7 pages

Automatic Early Stopping Using Cross Validation: Quantifying The Criteria

This document discusses early stopping in neural networks to avoid overfitting. It introduces the concept of using cross-validation to detect when overfitting starts during training and stop before convergence. However, most researchers choose early stopping criteria in an ad-hoc way. This study evaluates 14 different automatic early stopping criteria across 12 classification and regression tasks using multi-layer perceptrons. The experiments show that slower criteria allow small improvements in generalization performance but take significantly longer to train, around 4 times as long on average.

Uploaded by

Amine Khalfallah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views7 pages

Automatic Early Stopping Using Cross Validation: Quantifying The Criteria

This document discusses early stopping in neural networks to avoid overfitting. It introduces the concept of using cross-validation to detect when overfitting starts during training and stop before convergence. However, most researchers choose early stopping criteria in an ad-hoc way. This study evaluates 14 different automatic early stopping criteria across 12 classification and regression tasks using multi-layer perceptrons. The experiments show that slower criteria allow small improvements in generalization performance but take significantly longer to train, around 4 times as long on average.

Uploaded by

Amine Khalfallah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Neural

Networks
PERGAMON Neural Networks 11 (1998) 761–767

Contributed article

Automatic early stopping using cross validation: quantifying the criteria


Lutz Prechelt*
Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany

Received 28 May 1995; accepted 3 November 1997

Abstract

Cross validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before
convergence to avoid the overfitting (‘early stopping’). The exact criterion used for cross validation based early stopping, however, is chosen
in an ad-hoc fashion by most researchers or training is stopped interactively. To aid a more well-founded selection of the stopping criterion,
14 different automatic stopping criteria from three classes were evaluated empirically for their efficiency and effectiveness in 12 different
classification and approximation tasks using multi-layer perceptrons with RPROP training. The experiments show that, on average, slower
stopping criteria allow for small improvements in generalization (in the order of 4%), but cost about a factor of 4 longer in training time.
q 1998 Elsevier Science Ltd. All rights reserved.

Keywords: Early stopping; Overfitting; Cross validation; Generalization; Empirical study; Supervised learning

1. Training for generalization meter dimension are regularization such as weight decay (e.g.
Krogh and Hertz, 1992; Moody et al., 1992) and others (e.g.
When training a neural network, one is usually interested Weigend et al., 1991; Lippmann et al., 1991) or early stopping
in obtaining a network with optimal generalization perfor- (Morgan and Bourlard, 1990; Touretzky, 1990). See also
mance. Generalization performance means small error on Reed (1993) and Fiesler (1994) for an overview and Finnoff
examples not seen during training. et al. (1993) for an experimental comparison.
Because standard neural network architectures such as the Early stopping is widely used because it is simple to
fully connected multi-layer perceptron almost always have understand and implement and has been reported to be
too large a parameter space, such architectures are prone to superior to regularization methods in many cases, e.g. in
overfitting (Geman et al., 1992). While the network seems to Finnoff et al. (1993). The method can be used either inter-
get better and better (the error on the training set decreases), at actively, i.e. based on human judgement, or automatically,
some point during training it actually begins to get worse i.e. based on some formal stopping criterion. However, such
again, (the error on unseen examples increases). automatic stopping criteria are usually chosen in an ad-hoc
There are basically two ways to fight overfitting: reducing fashion today. The present paper aims at providing some
the number of dimensions of the parameter space or reducing quantitative data to guide the selection among automatic
the effective size of each dimension. The parameters are stopping criteria. The means to achieve this goal is an
usually the connection weights in the network. The corre- empirical investigation of the behavior of 14 different cri-
sponding techniques used in neural network training to reduce teria on 12 different learning problems.
the number of parameters, i.e. the number of dimensions of The following sections discuss the problem of early stop-
the parameter space, are greedy constructive learning (e.g. ping in general, formally introduce three classes of stopping
Fahlman and Lebiere, 1990; Touretzky, 1990), pruning (e.g. criteria, and then describe the idea, set-up and results of the
Cun et al., 1990; Touretzky, 1990; Hassibi and Stork, 1993; experimental study that measured the efficiency and the
Hanson et al., 1993; Levin et al., 1994; Cowan et al., 1994), or effectiveness of the criteria.
weight sharing (e.g. Nowlan and Hinton, 1992). The corre-
sponding NN techniques for reducing the size of each para-
2. Ideal and real generalization curves
* Requests for reprints should be sent to Dr L. Prechelt. Tel: 0049 721 608
4068; Fax: 0049 721 608 7343; E-mail: [email protected]. In most introductory papers on supervised neural network
0893-6080/98/$19.00 q 1998 Elsevier Science Ltd. All rights reserved.
PII S 08 93 - 60 8 0( 9 8) 0 00 1 0- 0
762 L. Prechelt / Neural Networks 11 (1998) 761–767

Fig. 1. Idealized training and generalization error curves. Vertical: errors; horizontal: time.

training one can find a diagram similar to the one shown in minimum at epoch 205). If representative training data is
Fig. 1. It is claimed to show the evolution over time of the used, the validation error is an optimal estimation of the
per-example error on the training set and on a test set not actual network performance; so we expect a 1.1% decrease
used for training (the training curve and the generalization of the generalization error in this case. Nevertheless, over-
curve). Given this behavior, it is clear how to do early fitting might sometimes go undetected because the valida-
stopping using cross validation: (1) split the training data tion set is not perfectly representative of the problem.
into a training set and a cross validation set, e.g. in a 2:1 Unfortunately, this or any other generalization curve is
proportion; (2) train only on the training set and evaluate the not typical in the sense that all curves share the same qua-
per-example error on the validation set once in a while, e.g. litative behavior. Other curves might never reach a better
after every fifth epoch; (3) stop training as soon as the error minimum than the first, or than, say, the third; the mountains
on the cross validation set is higher than it was the last time and valleys in the curve can be of very different width,
it was checked; (4) use the weights the network had in that height and shape. The only thing all curves seem to have
previous step as the result of the training run. This approach in common is that the differences between the first and the
uses the cross validation set to anticipate the behavior on the following local minima, if any, are not huge. Theoretical
test set (or in real use), assuming that the error on both will analyses of the error curves cannot yet be carried out for the
be similar. interesting cases, e.g. multi-layer perceptrons with sigmoid
However, the real situation is a lot more complex. Real functions; today they are possible for simpler cases only,
generalization curves almost always have more than one namely for linear networks (Baldi and Chauvin, 1991;
local minimum. Baldi and Chauvin (1991) showed for linear Wang et al., 1994; Cowan et al., 1994).
networks with n inputs and n outputs that up to n such local As we see, choosing a stopping criterion predominantly
minima are possible; for multi-layer networks, the situation
is even worse. Thus, it is impossible in general to tell from
the beginning of the curve whether the global minimum has
already been seen or not, i.e. whether an increase in the
generalization error indicates real overfitting or is just inter-
mittent. Such a situation is shown in Fig. 2. This real gen-
eralization curve was measured during training of a two
hidden layer network on the glass1 problem (see below).
The curve exhibits as many as 16 local minima in the
validation set error before severe overfitting begins at
about epoch 400; of these local minima, four are the global
minimum up to where they occur. The optimal stopping
point in this example would be epoch 205. Note that stop-
ping in epoch 400 compared with stopping shortly after the
first ‘deep’ local minimum at epoch 45 trades about a seven-
fold increase of learning time for an improvement of Fig. 2. A real generalization error curve. Vertical: validation set error;
validation set performance by 1.1% (by finding the horizontal: time (in training epochs).
L. Prechelt / Neural Networks 11 (1998) 761–767 763

involves a tradeoff between training time and generalization strip?’’ Note that this progress measure is high for unstable
error. However, some stopping criteria may typically find phases of training, where the training set error goes up
better tradeoffs than others. This leads to the question which instead of down. This is intended, because many training
criterion to use with cross validation to decide when to stop algorithms sometimes produce such ‘jitter’ by taking inap-
training. The present work provides empirical data in order propriately large steps in weight space. The progress mea-
to give an answer. sure is, however, guaranteed to approach zero in the long
run unless the training is globally unstable (e.g. oscillating).
Now we can define the second class of stopping criteria
3. Actual stopping criteria using the quotient of generalization loss and progress:
PQa : stop after first end ¹ of ¹ strip epoch t with
There are a number of plausible stopping criteria. This
work evaluates three classes of them. GL(t)
To formally describe the criteria, we first need some .a
Pk (t)
definitions. Let E be the objective function (error function)
of the training algorithm, for example the squared error. In the following we will always assume strips of length 5
Then E tr(t) is the average error per example over the training and measure the cross validation error only at the end of
set, measured after epoch t. E va(t) is the corresponding error each strip.
on the validation set and is used by the stopping criterion. A third class of stopping criteria relies only on the sign of
E te(t) is the corresponding error on the test set; it is not the changes in the generalization error. These criteria say
known to the training algorithm, but characterizes the ‘‘stop when the generalization error increased in s succes-
quality of the network resulting from training. sive strips’’:
The value E opt(t) is defined to be the lowest validation set UPs : stop after epoch t if UPs ¹ 1 stops after epoch t ¹ k
error obtained in epochs up to t:
Eopt (t) ¼ min Eva (t9) and Eva (t) . Eva (t ¹ k)
t9#t

Now we define the generalization loss at epoch t to be the UP1 : stop after first end ¹ of ¹ strip epoch t with
relative increase of the validation error over the minimum-
so-far (in percent): Eva (t) . Eva (t ¹ k)
 
Eva (t) The idea behind this definition is that when the validation
GL(t) ¼ 100· ¹1 error has increased not only once, but during s consecutive
Eopt (t)
strips (!), we assume that such increases indicate the begin-
A high generalization loss is one obvious candidate reason ning of final overfitting, independent of how large the
to stop training, because it directly indicates overfitting. increases actually are. The UP criteria have the advantage
This leads us to the first class of stopping criteria: stop as that they measure change locally so that they can directly be
soon as the generalization loss exceeds a certain threshold. used in the context of pruning algorithms, where errors must
We define the class GL a as be allowed to remain much higher than previous minima
GLa : stop after first epoch t with GL(t) . a over long training periods.
Note that none of these criteria can guarantee termination.
However, we might want to suppress stopping if the training We thus complement them by the rule that training is
is still progressing very rapidly. The reasoning behind this stopped when the progress drops below 0.1 and also after
approach is that when the training error still decreases at most 3000 epochs.
quickly, generalization losses have higher chance to be All stopping criteria have in common the way they are
‘repaired’; we assume that overfitting does not begin until used: they decide to stop at some time t during training and
the error decreases only slowly. To formalize this notion we the result of the training is then the set of weights that
define a training strip of length k to be a sequence of k exhibited the lowest validation error E opt(t). Note that in
epochs numbered n þ 1…n þ k where n is divisible by k. order to implement this scheme, only one duplicate weight
The training progress (in per thousand) measured after such set is needed.
a training strip is then
0 1
X t
B Etr (t9) C
B t9 ¼ t ¹ k þ 1 C 4. Design of the study
Pk (t) ¼ 1000·BB ¹ 1C
t C
@ k· min Etr (t9) A
t9 ¼ t ¹ k þ 1 For most efficient use of training time we would be inter-
ested in knowing which of these criteria will achieve how
that is, ‘‘how much was the average training error during the much generalization using how much training time on
strip larger than the minimum training error during the which kinds of problems. However, as said before, no direct
764 L. Prechelt / Neural Networks 11 (1998) 761–767

mathematical analysis of criteria with respect to these 0.05…0.2 randomly per weight, D max ¼ 50, D min ¼ 0, initial
factors is possible today. Therefore, we resort to studying weights ¹0.5…0.5 randomly. RPROP is a fast backpropa-
the criteria empirically. gation variant similar in spirit to quickprop (Fahlman,
To achieve a broad coverage, we use multiple different 1988). It is about as fast as quickprop, but more stable with-
network topologies, multiple different learning tasks, and out adjustment of the parameters. RPROP requires epoch
multiple different exemplars from each stopping criteria learning, i.e. the weights are updated only once per epoch.
class. To keep the experiment feasible, only one training Therefore, the algorithm is fast without parameter tuning for
algorithm is used. small training sets, but not recommendable for large training
We are interested in answering the following questions: sets; that no parameter tuning is necessary for RPROP also
helps to avoid the common methodological error of tuning
1. Training time. How long will training take with each
parameters using the performance on the test sets.
criterion, i.e. how fast or slow are they?
The twelve problems have between 8 and 120 inputs,
2. Efficiency. How much of this training time will be redun-
between 1 and 19 outputs, and between 214 and 7200
dant, i.e. will occur after the finally found validation
examples. All inputs and outputs are normalized to range
error minimum has been seen?
0…1. Eight of the problems are classification tasks using
3. Effectiveness. How good will the resulting network per-
1-of-n output encoding (cancer, card, diabetes, gene, glass,
formance be?
heart, horse, soybean, and thyroid), three are approximation
4. Tradeoffs. Which criteria provide the best time–
tasks (building, flare and hearta); all problems are real data-
performance tradeoff?
sets from realistic application domains.
5. Quantification. How can the tradeoff be quantified?
The examples of each problem were partitioned into
To find the answers, we record for a large number of runs training (50%), validation (25%) and test set (25% of
when each criterion would stop and what the associated examples) in three different random ways, resulting in 36
network performance would be. datasets. Each of these datasets was trained with 12 different
To measure network performance, we partition each data- feedforward network topologies: one hidden layer networks
set into two disjoint parts: training data; and test data. The with 2, 4, 8, 16, 24, or 32 hidden nodes and two hidden layer
training data (and only that) is used for training the network; networks with 2 þ 2, 4 þ 2, 4 þ 4, 8 þ 4, 8 þ 8, or 16 þ 8
the test data is used to estimate the network performance hidden nodes in the first þ second hidden layer, respec-
after training has finished. The training data is further sub- tively; all these networks were fully connected including
divided into a training set of examples used to adjust the all possible shortcut connections. For each of the network
network weights and a validation set of examples used to topologies and each dataset, two runs were made with linear
estimate network performance during training as required output units and one with sigmoidal output units using the
by the stopping criteria. In the set-up described below, the activation function f(x) ¼ x/(1 þ lxl). A popular rule of
validation set was never used for weight adjustment. This thumb recommends to always use sigmoidal output units
decision was made in order to obtain pure stopping criteria for classification tasks and linear output units for regression
results. In a real application this would be a waste of training (approximation) tasks. This rule was not applied since it is
data and should be changed. too far from always being good; see (Prechelt, 1994).
Twelve different problems were used, all from the Pro- Altogether, 1296 training runs were made for the compar-
ben1 NN benchmark set (Prechelt, 1994). These problems ison, giving 18 144 stopping criteria performance records
form a sample of a quite broad class of domains, but are of for the 14 criteria. 270 of these records (or 1.5%) from
course not universally representative of learning; see Pre- 125 different runs reached the 3000 epoch limit instead of
chelt (1994) for a discussion of how to characterize the using the stopping criterion itself.
Proben1 domains.

6. Results and discussion


5. Experimental set-up
The results for each stopping criterion averaged over all
The stopping criteria examined were GL 1, GL 2, GL 3,
1296 runs are shown in Table 1. An explanation and inter-
GL 5, PQ 0.5, PQ 0.75, PQ 1, PQ 2, PQ 3, UP 2, UP 3, UP 4, UP 6,
pretation of the entries in the table will now be given. Please
and UP 8. A series of simulations using all of the above
note that much of the discussion is biased by the particular
criteria was run, in which all criteria where evaluated simul-
collection of criteria chosen for the study.
taneously, i.e. each single training run returned one result
for each of the criteria. This approach reduces the variance
of the estimation. 6.1. Basic definitions
All runs were carried out using the RPROP training
algorithm (Riedmiller and Braun, 1993) using the squared For each run, we define E v(C) as the minimum validation
error function and the parameters h þ ¼ 1.1, h ¹ ¼ 0.5, D [ set error found until criterion C indicates to stop; it is the
L. Prechelt / Neural Networks 11 (1998) 761–767 765

Table 1
Behavior of stopping criteria. SGL2 (C) is normalized training time, BGL2 (C) is normalized test error (both relative to GL 2). For further description refer to the
text

C Training time Efficiency and effectiveness

SĈ (C) SGL2 (C) r(C) BĈ (C) BGL2 (C) P g(C)

UP 2 0.792 0.766 0.277 1.055 1.024 0.587


GL 1 0.956 0.823 0.308 1.044 1.010 0.680 a
UP 3 1.010 1.264 0.419 1.026 a 1.003 0.631
GL 2 1.237 1.000 0.514 1.034 1.000 0.723 a
UP 4 1.243 1.566 0.599 1.020 a 0.997 0.666
PQ 0.5 1.253 1.334 0.663 1.027 1.002 0.658
PQ 0.75 1.466 1.614 0.863 1.021 0.998 0.682
GL 3 1.550 1.450 0.712 1.025 0.994 0.748 a
PQ 1 1.635 1.796 1.038 1.018 0.994 0.704
UP 6 1.786 2.381 1.125 1.012 a 0.990 0.737
GL 5 2.014 2.013 1.162 1.021 0.991 0.772 a
PQ 2 2.184 2.510 1.636 1.012 0.990 0.768
UP 8 2.485 3.259 1.823 1.010 0.988 0.759
PQ 3 2.614 3.095 2.140 1.009 0.988 0.800
a
Particularly good speed/performance tradeoffs.

error after epoch number t m(C) (read: ‘time of minimum’). time. The slower criteria train more than twice as long as
E t(C) is the corresponding test set error and characterizes would be necessary for finding the same solution.
network performance. Stopping occurs after epoch t s(C) 3. Effectiveness. We define the ‘badness’ of a criterion C in
(read: ‘time of stop’). A best criterion Ĉ of a particular a run relative to another criterion x as B x(C): ¼ E t(C)/
run is one with minimum t s of all those (among the exam- E t(x), i.e. its relative error on the test set. P g(C) is the
ined) with minimum E v, i.e. a criterion that found the best fraction of the 1296 runs in which C was a good criterion.
validation set error fastest. There may be several best, This is an estimate for the probability that C is good in a
because multiple criteria may stop at the same epoch. run. As we see from the P g column, even the fastest
Note that the criterion Ĉ does not really exist as such in criteria are fairly effective. They obtain a result as
general because it changes from run to run. C is called good as the best-of-that-run criteria in about 60% of
‘good’ in a particular run if E v(C) ¼ E v(Ĉ), i.e. if it is the cases. On the other hand, even the slowest criteria
among those that found the lowest validation set error, no are not at all infallible; they achieve about 80%. So, to
matter how fast or slow. We now discuss the five questions obtain the best possible results, a conjunction of all three
raised earlier. criteria classes has to be used. However, P g says nothing
about how far from the optimum the remaining runs are.
1. Training time. The slowness of a criterion C in a run, Columns BĈ (C) and BGL2 (C) indicate that these differ-
relative to another criterion x is Sx (C): ¼ t s(C)/t s(x), i.e. ences are usually rather small: column BGL2 (C) shows
the relative total training time. As we see, the times that even the criteria with the lowest error achieve only
relative to a fixed criterion as shown in column SGL2 (C) about 1% lower error on the average than the relatively
vary by more than a factor of 4. Therefore, the decision fast criterion GL 2. In column BĈ (C) we see that even
for a particular stopping criterion influences training several only modestly slow criteria have just about 2%
times dramatically, even if one considers only the higher error on the average than the best criteria of the
range of criteria used here. In contrast, even the slowest same run.
criteria train only about 2.5 times as long as the fastest 4. Best tradeoffs. Despite the common overall trend, some
criterion of each run that finds the same result, as indi- criteria may be more cost-effective than others, i.e. pro-
cated in column SĈ (C). This shows that the training times vide better tradeoffs between training time and resulting
are not completely unreasonable even for the slower cri- network performance. Column BĈ of the table suggests
teria, but do indeed pay off to some degree. that the best tradeoffs between test set performance and
2. Efficiency. The redundancy of a criterion can be defined training time are (in order of increasing willingness to
as r(C): ¼ (t s(C)/t m(C)) ¹ 1. It characterizes how long spend lots of training time) UP 3, UP 4, and UP 6, if one
the training continues after the final solution has been wants to minimize the expected network performance
found. r(C) ¼ 0 would be perfect, r(C) ¼ 1 means that from a single run. If, on the other hand, one wants to
the criterion trains twice as long as necessary. Low make several runs and pick the network that seems to be
values indicate efficient criteria. As we see, the slower best (based on its validation set error), P g is the relevant
a criterion is, the less efficient it tends to get. Even the metric and the GL criteria are preferable. The criteria
fastest criteria ‘waste’ about one-fifth of overall training with best tradeoffs are marked in the table. Fig. 3
766 L. Prechelt / Neural Networks 11 (1998) 761–767

Fig. 3. Badness BĈ (C) and P g against slowness SĈ (C) of criteria.

illustrates these results. The upper curve corresponds to generalization compared to ‘faster’ ones. However, the
column BĈ of the table (plotted against column SĈ ); local training time that has to be expended for such improvements
minima indicate criteria with the best tradeoffs. The is rather long.
lower curve corresponds to column P g; local maxima It remains an open question whether and how the above
indicate the criteria with the best tradeoffs. results apply to other training algorithms, other error func-
5. Quantification. From columns SGL2 (C) and BGL2 (C) we tions, and in particular other problem domains. Future work
can quantify the tradeoff involved in the selection of a should address these issues in order to provide clear quanti-
stopping criterion as follows: in the range of criteria tative engineering rules for network construction using early
examined we can roughly trade a 4% decrease in test stopping. In particular, a theory should be built that quanti-
set error (from 1.024 to 0.988) for about a four-fold tatively explains the empirical data. Such a theory would
increase in training time (from 0.766 to 3.095). Within then have to be validated by further empirical studies. Only
this range, some criteria are somewhat better than others, such a theory can overcome the inherent limitation of
but no panacea exists among the criteria examined in this empirical work: the difficulty in generalizing the results to
study. other situations.
For training set-ups similar to the one used in this work,
Attempts were also made to find out whether similar
the following rules can be used to select a stopping criterion:
results hold for more specialized circumstances such as
only large or only small networks, only large or only 1. use fast stopping criteria unless small improvements of
small data sets or only particular learning problems. To do network performance (e.g. 4%) are worth large increases
this, a factor analysis was performed by reviewing appro- of training time (e.g. factor 4);
priate subsets of the data separately. The results indicate 2. to maximize the probability of finding a good solution (as
that, generally, the same trends hold for specialized circum- opposed to maximizing the average quality of solutions),
stances within the limits of the study. One notable exception use a GL criterion;
was the fact that for very small networks, the PQ criteria are 3. to maximize the average quality of solutions, use a PQ
more cost-effective than both the GL and the UP criteria for criterion if the network overfits only very little or an UP
minimizing BĈ (C). An explanation of this lies in the fact criterion otherwise.
that such small networks do not overfit severely; in this case
it is advantageous to take training progress into account as
an additional factor to determine when to stop training.
References

7. Conclusion and further work Baldi P., & Chauvin Y. (1991). Temporal evolution of generalization dur-
ing learning in linear networks. Neural Computation, 3, 589–603.
Cowan, J.D., Tesauro, G. & Alspector, J. (Eds.), 1994. Advances in Neural
This work studied three classes of stopping criteria, Information Processing Systems 6, Morgan Kaufman, San Mateo, CA.
namely GL, UP, and PQ on a variety of learning problems. Cun, Y.L., Denker, J.S. & Solla, S.A., 1990. Optimal brain damage. In:
The results indicate that ‘slower’ criteria, which stop later Touretzky, D.S. (Ed.), Advances in Neural Information Processing Sys-
than others, on average indeed lead to improved tems 2, Morgan Kaufman, San Mateo, CA, pp. 598–605.
L. Prechelt / Neural Networks 11 (1998) 761–767 767

Fiesler, E., 1994. Comparative bibliography of ontogenic neural networks. Neural Information Processing Systems 4, Morgan Kaufman, San
In: Intl. Conf. on Artificial Neural Networks, Springer, London, UK. Mateo, CA.
Fahlman, S.E., 1988. An empirical study of learning speed in back-propa- Morgan, N. & Bourlard, H., 1990. Generalization and parameter estimation
gation networks. Technical Report CMU-CS-88-162, School of Com- in feedforward nets: some experiments. In: Touretzky, D.S. (Ed.),
puter Science. Carnegie Mellon University Pittsburgh, PA. Advances in Neural Information Processing Systems 2, San Mateo,
Fahlman, S.E. & Lebiere, C., 1990. The cascade-correlation learning archi- CA, pp. 630–637.
tecture. In: Touretzky, D.S. (Ed.), Advances in Neural Information Nowlan S.J., & Hinton G.E. (1992). Simplifying neural networks by soft
Processing Systems 2, Morgan Kaufman, San Mateo, CA, pp. 524–532. weight-sharing. Neural Computation, 4 (4), 473–493.
Finnoff W., Hergert F., & Zimmermann H.G. (1993). Improving model Prechelt, L., 1994. PROBEN1—a set of benchmarks and benchmarking
selection by nonconvergent methods. Neural Networks, 6, 771–783. rules for neural network training algorithms. Technical Report 21/94,
Geman S., Bienenstock E., & Doursat R. (1992). Neural networks and the Fakultät für Informatik, Universität Karlsruhe, Germany. Anonymous
bias/variance dilemma. Neural Computation, 4, 1–58. FTP: /pub/papers/techreports/1994/1994-21.ps.gz on ftp.ira.uka.de.
Hanson, S.J., Gowan, J.D. & Giles, C.L. (Eds.), 1993. Advances in Neural Reed R. (1993). Pruning algorithms—a survey. IEEE Transactions on
Information Processing Systems 5, Morgan Kaufman, San Mateo, CA. Neural Networks, 4 (5), 740–746.
Hassibi, B. & Stork, D.G., 1993. Second order derivatives for network Riedmiller, M. & Braun, H., 1993. A direct adaptive method for faster
pruning: optimal brain surgeon. In: Advances in Neural Information backpropagation learning: the RPROP algorithm. In: Proceedings of
Processing Systems 5, Morgan Kaufman, San Mateo, CA, pp. 164–171. the IEEE International Conference on Neural Networks, San Francisco,
Krogh, A. & Hertz, J.A., 1992. A simple weight decay can improve general- CA, pp. 586–591.
ization. In: Advances in Neural Information Processing Systems 4, Touretzky, D.S. (Ed.), 1990. Advances in Neural Information Processing
Morgan Kaufman, San Mateo, CA, pp. 950–957. Systems 2, Morgan Kaufman, San Mateo, CA.
Levin, A.U., Leen, T.K. & Moody, J.E., 1994. Fast pruning using principal Wang, C., Venkatesh, S.S. & Judd, J.S., 1994. Optimal stopping and effec-
components. In: Advances in Neural Information Processing Systems 6, tive machine complexity in learning. In: Advances in Neural Informa-
Morgan Kaufman, San Mateo, CA. tion Processing Systems 6, Morgan Kaufman, San Mateo, CA.
Lippmann, R.P. & Moody, J.E., Touretzky, D.S. (Eds.), 1991. Advances in Weigend, A.S., Rumelhart, D.E. & Huberman, B.A., 1991. Generalization
Neural Information Processing Systems 3, Morgan Kaufman, San by weight-elimination with application to forecasting. In: Advances in
Mateo, CA. Neural Information Processing Systems 3, Morgan Kaufman, San
Moody, J.E., Hanson, S.J. & Lippmann, R.P. (Eds.), 1992. Advances in Mateo, CA, pp. 875–882.

You might also like