6.
ESTIMATORS, BIAS AND VARIANCE
Foundational concepts such as parameter estimation, bias and variance are useful
to formally characterize notions of generalization, underfitting and overfitting.
Point Estimation
Point estimation is the attempt to provide the single “best” prediction of
some quantity of interest.
In general the quantity of interest can be a single parameter or a vector
of parameters in some parametric model, such as the weights in our
linear regression.
In order to distinguish estimates of parameters from their true value, our
convention will be to denote a point estimate of a parameter θ by θˆ.
Let {x(1) , . . . , x(m)} be a set of m independent and identically distributed
The definition does not require that g return a value that is close to
the true θ or even that the range of g is the same as the set of allowable
values of θ.
Function Estimation
Here we are trying to predict a variable y given an input vector x.
We assume that there is a function f (x) that describes the approximate
relationship between y and x. For example, we may assume that y = f(x) + ∈,
where ∈ stands for the part of y that is not predictable from x.
In function estimation we are interested in approximating f
with a model or estimate fˆ
Function estimation is really just the same as estimating a parameter θ; the
function estimator fˆis simply a point estimator in function space.
Bias
The bias of an estimator is defined as:
bias(θˆm ) = E(θˆm ) − θ
Since bias(ˆθ) = 0, we say that our estimator ˆθ is unbiased.
Variance and Standard Error
The variance of an estimator is simply the variance
Var(θˆ) (5.45)
Where the random variable is the training set. Alternately, the square root
of the variance is called the standard error, denoted SE( ˆθ).
The variance or the standard error of an estimator provides a measure of how
we would expect the estimate we compute from data to vary as we
independently resample the dataset from the underlying data generating
process.
When we compute any statistic using a finite number of samples, our estimate
of the true underlying parameter is uncertain, in the sense that we could have
obtained other samples from the same distribution and their statistics would
have been different.
The expected degree of variation in any estimator is a source of error that we
want to quantify.
Trading off Bias and Variance to Minimize Mean Squared Error
Bias and variance measure two different sources of error in an estimator.
Bias measures the expected deviation from the true value of the function or
parameter.
Variance on the other hand, provides a measure of the deviation from the
expected estimator value that any particular sampling of the data is likely
to cause.
The most common way to negotiate this trade-off is to use cross-validation.
Empirically, cross-validation is highly successful on many real-world tasks.
Alternatively, we can also compare the mean squared error (MSE) of the estimates:
Consistency
So far we have discussed the properties of various estimators for a training
set of fixed size.
Concerned with the behavior of an estimator as the amount of training data
grows.
The number of data points m in our dataset increases, our point estimates
converge to the true value of the corresponding parameters.
we would like that