K-折交叉验证(K-fold
cross-validation)是指将样本集分为k份,其中k-1份作为训练数据集,而另外的1份作为验证数据集。用验证集来验证所得分类器或者回归的错误码率。一般需要循环k次,直到所有k份数据全部被选择一遍为止。
交叉检验的方法是
Cross Validation
Cross validation is a model evaluation method that is better than
residuals. The problem with residual evaluations is that they do
not give an indication of how well the learner will do when it is
asked to make new predictions for data it has not already seen. One
way to overcome this problem is to not use the entire data set when
training a learner. Some of the data is removed before training
begins. Then when training is done, the data that was removed can
be used to test the performance of the learned model on ``new''
data. This is the basic idea for a whole class of model evaluation
methods called cross validation.
The holdout method is the simplest kind of cross validation. The
data set is separated into two sets, called the training set and
the testing set. The function approximator fits a function using
the training set only. Then the function approximator is asked to
predict the output values for the data in the testing set (it has
never seen these output values before). The errors it makes are
accumulated as before to give the mean absolute test set error,
which is used to evaluate the model. The advantage of this method
is that it is usually preferable to the residual method and takes
no longer to compute. However, its evaluation can have a high
variance. The evaluation may depend heavily on which data points
end up in the training set and which end up in the test set, and
thus the evaluation may be significantly different depending on how
the division is made.
K-fold cross validation is one way to improve over the holdout
method. The data set is divided into k subsets, and the holdout
method is repeated k times. Each time, one of the k subsets is used
as the test set and the other k-1 subsets are put together to form
a training set. Then the average error across all k trials is
computed. The advantage of this method is that it matters less how
the data gets divided. Every data point gets to be in a test set
exactly once, and gets to be in a training set k-1 times. The
variance of the resulting estimate is reduced as k is increased.
The disadvantage of this method is that the training algorithm has
to be rerun from scratch k times, which means it takes k times as
much computation to make an evaluation. A variant of this method is
to randomly divide the data into a test and training set k
different times. The advantage of doing this is that you can
independently choose how large each test set is and how many trials
you average over.
Leave-one-out cross validation is K-fold cross validation taken to
its logical extreme, with K equal to N, the number of data points
in the set. That means that N separate times, the function
approximator is trained on all the data except for one point and a
prediction is made for that point. As before the average error is
computed and used to evaluate the model. The evaluation given by
leave-one-out cross validation error (LOO-XVE) is good, but at
first pass it seems very expensive to compute. Fortunately, locally
weighted learners can make LOO predictions just as easily as they
make regular predictions. That means computing the LOO-XVE takes no
more time than computing the residual error and it is a much better
way to evaluate models. We will see shortly that Vizier relies
heavily on LOO-XVE to choose its metacodes.
Figure 26: Cross validation checks how well a model generalizes to
new data
Fig. 26 shows an example of cross validation performing better than
residual error. The data set in the top two graphs is a simple
underlying function with significant noise. Cross validation tells
us that broad smoothing is best. The data set in the bottom two
graphs is a complex underlying function with no noise. Cross
validation tells us that very little smoothing is best for this
data set.
Now we return to the question of choosing a good metacode for data
set a1.mbl:
File -> Open -> a1.mbl
Edit -> Metacode -> A90:9
Model -> LOOPredict
Edit -> Metacode -> L90:9
Model -> LOOPredict
Edit -> Metacode -> L10:9
Model -> LOOPredict
LOOPredict goes through the entire data set and makes LOO
predictions for each point. At the bottom of the page it shows the
summary statistics including Mean LOO error, RMS LOO error, and
information about the data point with the largest error. The mean
absolute LOO-XVEs for the three metacodes given above (the same
three used to generate the graphs in fig. 25), are 2.98, 1.23, and
1.80. Those values show that global linear regression is the best
metacode of those three, which agrees with our intuitive feeling
from looking at the plots in fig. 25. If you repeat the above
operation on data set b1.mbl you'll get the values 4.83, 4.45, and
0.39, which also agrees with our observations.
Cross-validation and bootstrapping are both methods for
estimating
generalization error based on "resampling" (Weiss and Kulikowski
1991; Efron
and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White
1994; Shao
and Tu 1995). The resulting estimates of generalization error are
often used
for choosing among various models, such as different network
architectures.
Cross-validation
++++++++++++++++
In k-fold cross-validation, you divide the data into k subsets
of
(approximately) equal size. You train the net k times, each time
leaving
out one of the subsets from training, but using only the omitted
subset to
compute whatever error criterion interests you. If k equals the
sample
size, this is called "leave-one-out" cross-validation.
"Leave-v-out" is a
more elaborate and expensive version of cross-validation that
involves
leaving out all possible subsets of v cases.
Note that cross-validation is quite different from the
"split-sample" or
"hold-out" method that is commonly used for early stopping in NNs.
In the
split-sample method, only a single subset (the validation set) is
used to
estimate the generalization error, instead of k different subsets;
i.e.,
there is no "crossing". While various people have suggested
that
cross-validation be applied to early stopping, the proper way of
doing so is
not obvious.
The distinction between cross-validation and split-sample
validation is
extremely important because cross-validation is markedly superior
for small
data sets; this fact is demonstrated dramatically by Goutte (1997)
in a
reply to Zhu and Rohwer (1996). For an insightful discussion of
the
limitations of cross-validatory choice among several learning
methods, see
Stone (1977).
Jackknifing
+++++++++++
Leave-one-out cross-validation is also easily confused with
jackknifing.
Both involve omitting each training case in turn and retraining the
network
on the remaining subset. But cross-validation is used to
estimate
generalization error, while the jackknife is used to estimate the
bias of a
statistic. In the jackknife, you compute some statistic of interest
in each
subset of the data. The average of these subset statistics is
compared with
the corresponding statistic computed from the entire sample in
order to
estimate the bias of the latter. You can also get a jackknife
estimate of
the standard error of a statistic. Jackknifing can be used to
estimate the
bias of the training error and hence to estimate the generalization
error,
but this process is more complicated than leave-one-out
cross-validation
(Efron, 1982; Ripley, 1996, p. 73).
Choice of cross-validation method
+++++++++++++++++++++++++++++++++
Cross-validation can be used simply to estimate the generalization
error of
a given model, or it can be used for model selection by choosing
one of
several models that has the smallest estimated generalization
error. For
example, you might use cross-validation to choose the number of
hidden
units, or you could use cross-validation to choose a subset of the
inputs
(subset selection). A subset that contains all relevant inputs will
be
called a "good" subsets, while the subset that contains all
relevant inputs
but no others will be called the "best" subset. Note that subsets
are "good"
and "best" in an asymptotic sense (as the number of training cases
goes to
infinity). With a small training set, it is possible that a subset
that is
smaller than the "best" subset may provide better generalization
error.
Leave-one-out cross-validation often works well for
estimating
generalization error for continuous error functions such as the
mean squared
error, but it may perform poorly for discontinuous error functions
such as
the number of misclassified cases. In the latter case, k-fold
cross-validation is preferred. But if k gets too small, the error
estimate
is pessimistically biased because of the difference in training-set
size
between the full-sample analysis and the cross-validation analyses.
(For
model-selection purposes, this bias can actually help; see the
discussion
below of Shao, 1993.) A value of 10 for k is popular for
estimating
generalization error.