INADMISSIBILITY OF THE USUAL ESTI-
MATOR FOR THE MEAN OF A MULTI-
VARIATE NORMAL DISTRIBUTION
CHARLES STEIN
STANFORD UNIVERSITY
1. Introduction
If one observes the real random variables Xi, X,, independently normally dis-
tributed with unknown means ti, *, {n and variance 1, it is customary to estimate (i
by Xi. If the loss is the sum of squares of the errors, this estimator is admissible for
n < 2, but inadmissible for n _ 3. Since the usual estimator is best among those which
transform correctly under translation, any admissible estimator for n _ 3 involves an
arbitrary choice. While the results of this paper are not in a form suitable for immediate
practical application, the possible improvement over the usual estimator seems to be
large enough to be of practical importance if n is large.
Let X be a random n-vector whose expected value is the completely unknown vec-
tor t and whose components are independently normally distributed with variance 1.
We consider the problem of estimating t with the loss function L given by
(1) L(t, d) = ( -d)I = 2(ti-dj2
where d is the vector of estimates. In section 2 we give a short proof of the inadmissi-
bility of the usual estimator
(2) d =t(X) = X,
for n 2 3. For n = 2, the admissibility of 4, is proved in section 4. For n = 1 the ad-
missibility of t, is well known (see, for example, [1], [2], [3]) and also follows from the
result for n = 2. Of course, all of the results concerning this problem apply with obvious
modifications if the assumption that the components of X are independently distributed
with variance 1 is replaced by the condition that the covariance matrix 2 of X is known
and nonsingular and the loss function (1) is replaced by
(3) L (, d) = ( -d)'2-' ( -d).
We shall give immediately below a heuristic argument indicating that the usual esti-
mator t, may be poor if n is large. With some additional precision, this could be made to
yield a discussion of the infinite dimensional case or a proof that for sufficiently large n
the usual estimator is inadmissible. We choose an arbitrary point in the sample space
independent of the outcome of the experiment and call it the origin. Of course, in the
way we have expressed the problem this choice has already been made, but in a correct
coordinate-free presentation, it would appear as an arbitrary choice of one point in an
affine space. Now
(4) X2 = (X-t)2+ t2+ 2 \Z
I97
I98 THOIRD BERKELEY SYMPOSIUM: STEIN
where
(5) z_ _t_(X_-
has a univariate normal distribution with mean 0 and variance 1, and for large n, we have
(X - 2 = n + Op,(un), so that
(6) X2 = n+ (t +)
2+O
uniformly in 0. (For the stochastic order notation op, 0, see [41.) Consequently-when
we observe X2 we know that t2 is nearly x2 - n. The usual estimator 4. would have us
estimate t to lie outside of the convex set { t; t .5 X2 - cn} (with c slightly less than 1) al-
though we are practically sure that t lies in that set. It certainly seems more reasonable
to cut X down at least by a factor of [(X2 - n)/X2]1/2 to bring the estimate within that
sphere. Actually, because of the curvature of the sphere combined with the uncertainty
of our knowledge of t, the best factor, to within the approximation considered here,
turns out to be (X2 - n)/X2. For, consider the class of estimators
(7) t(X)= I ( )]x
where h is a continuous real-valued function with lim IIh()I < -. We have (with
p2 = t2/n)
(8) [(X) t2= [- (Xn)] X- t
[1-h ( n )]2(X-t) 2 + t2h2 (-)-2 2
vr 2h ( )[[-h
=nl[ - 2k(1+ p2) + (I +p2) h2(1 + p2)] +0,( vn).
This (without the remainder) attains its minimum of np2/(1 + p2) for h(l + p2) =
1/(1 + p2). In these calculations we have not used the normality.
In section 3 we consider some of the properties of spherically symmetric estimators,
that is, estimators of the form (7), for finite n. We show that a spherically symmetric
estimator is admissible provided it is admissible as compared with other spherically
symmetric estimators. This is essentialiy a special case of a result given by Karlin [11]
and Kud6 [12].
In section 4 we use the information inequality in the manner of [1] and [2] in order to
obtain lower bounds to the mean squared error of a spherical estimator of the mean. In
particular, for n = 2 this proves the admissibility of the usual estimator. For n 2 3 we
obtain the bound (n - 2)2/cO for the asymptotic value of the possible improvement as
-- D, which is proved to be attainable in section 2.
In accordance with the results of section 3, a good spherically symmetric estimator is
admissible for any n. However, roughly speaking as n - X it becomes less and less ad-
missible, as in Robbins [7]. A simple way to obtain an estimator which is better for most
practical purposes is to represent the parameter space (which is also essentially the sample
space) as an orthogonal direct sum of two or more subspaces, also of large dimension and
apply spherically symmetric estimators separately in each. If the p2's (squared length
of the population mean divided by the dimension) are appreciably different for the se-
lected subspaces, this estimator will be better than the spherically symmetric one. It is
INADISSIBIMT OF ESTIMATOR I99
unlikely that this estimator is admissible unless Bayes solutions (in the strict sense) are
used in the component subspaces, but it is also unlikely that its departure from admis-
sibility is important in practice.
In section 5 we consider very briefly a number of problems for which answers are
needed before the methods of this paper can be applied with confidence.
2. Inadmissibility of the usual estimator
For n 2 3, let X be a normally distributed random n-vector with unknown mean t
and covariance matrix I, the identity matrix. In addition to the usual estimator 4,
given by
(9) (X)
we shall consider the estimator (i given by
(10) t1(X) =(l _- bX)X
with a, b > 0. We shall show that for sufficiently small b and large a, ti is strictly better
than &, in fact,
(I11) EjW1 X)-t 2 < n=El [t(X)-t 2
for all t. To prove (11) let X = Y + t so that Y is normally distributed with mean 0
and covariance matrix I, the identity. Then
(12) Et(1- +X2) X- E[Y-a+( Y+ t)2(Y+ t)
=n-2bE Y(Y+t) +b2 E (y+02
a+(+ t2 [a+ (Y+ t)211
<n-2bE Y( Y++ ) -/!2
From the identity
x
(13) +
we find that
2tY
(14) 11
a+(Y+T)- a+Y2+t21¾ a+Y2+ t2
~4 (tY) 21
+
+ (a + Y2+ t2) [a + (Y+ t) 2]
Since the conditional mean of Y given y2 is 0, we find from (14) that
(15) E Y(Y+ ) -b/2 _E a+ Y2-b/2 _ 2E E [(t Y)21 y2]
a+ (y+ t)2 Y2+ t2 (a + Y2 + t2) 2
+ 4E (QY)2 [Y(Y+t) -b/2]
(a+ Y2+ t2)2 [a+ (Y+ t)21
But
(16) E[QY)2j Y21 = tn Y2<a+ y2+
n
t2 y2
200 THIRD BERKELEY SYMPOSIUM: STEIN
so that
(17) E Y2 _b/2 _ 2E E [(t Y)21 Y2] >E (1-2/n) Y2-b/2
a+ y2+ t2 (a+Y2+ t2)2 a+ Y2+ Z2
> n-2 E Y2 1Y2 __ b
n-2-b/2 (n-2) (n+2)
a +t2 (a+.t2)2
It is intuitively clear that the last term on the right-hand side of (15) is o[1/(a + 0)]
uniformly in t as a - . To give a detailed proof we observe that for a . a
(QY)3
(18) E E(a +y2 + 2)2 _
[a)+( y +t)2] -E (a +Y2 + 2 yl
t2) [a+(Y + t) 2]
V
~-E (a
I__ l__>
+e2) 2a =-(a +t2)a-
c
For > a,
(Y-> -EI (a
(19) E (a+y2+ t2(2 [a+( Y+ t)2]= +Y2+ t2)2 [a+( Y+ t)2I
> -E I ty I
(a +Y2+ t2)2 [a+"t21
-P I
t y<--t2)E[ +t ~~It{y I < -Lt2]
> C/ _ Cif
= (a+,t2)3/2 (a+t2)a
Combining (18) and (19) we have
(20)
(2 0) E ((a+
a + y22+ Y)[+ I
y + t ) 2] °ta~ 2
+ t2) 2 [a +((Yo2
uniformly in t as a - . Also
b Y
(21) (a+ + t2)2 - 2 (a+[a (Ya (a+ 2)
uniformly in t as a -+ .
Thus from (12), (15), (17), (20), and (21) we find that
(22) E[(1b-
a+X2 X-t] < n-2b a+nt2b/+ °(a T1)
uniformly in t as a -+ -. Consequently if we take 0 < b < 2(n - 2), and a sufficiently
large this will be less than n for all t. If we take b = n - 2, then as t2 c, the im-
provement over the risk of the usual estimator is asymptotic to (n - 2)2/t2. In section 4
we shall see that this is asymptotically the best possible improvement over the usual
estimator in the neighborhood of t2 = -.
INADMISSIBILITY OF ESTIMATOR 20I
3. Spherically symmetrical estimators
We shall say that an estimator t is spherically symmetrical (about the origin) if it is
of the form
(23) (x) = [1-h (x2) ] x
where h is a real-valued function. This is equivalent to requiring that for every orthog-
onal transformation g, gotog-I = 4, that is, for all x
(24) g [(g-1x) (x).
First, if 4 is of the form (23), then
(25) gI (g-x) g{[l-h(x2)] g-1x} (x).
Suppose conversely that t satisfies (24) for all orthogonal g. In particular, for those g
which are reflections in a subspace containing x, g[k(x)] = 4(x). Consequently 4(x) lies
along x, that is,
(26) (x) = [1-h' (x) ] x
for some real-valued function h'. Since a vector x can be taken into any other vector
having the same squared length x2 by an orthogonal transformation, this yields (23).
We shall show that if a spherically symmetric estimator &2 is admissible as compared
with all other spherically symmetric estimators, then it is admissible (in the class of all
estimators). The proof is based on the compactness of the orthogonal group q, and the
continuity of, the problem. It is similar to a proof for finite groups (see p. 228, [5], and p.
198, [6]). We shall only sketch the proof since a general result for compact groups will
appear elsewhere. Because of the convexity of the loss function (1) in the estimate d we
can confine our attention to nonrandomized procedures (see p. 186, [8]).
Suppose the estimator t is strictly better than the spherically symmetric estimator
42, that is,
(2 7) R. ( ) =Et [ 4 (X)-t I <Et [2(X)-t 2
for all t with strict inequality for some t. Because of the continuity of Ri and R&2, strict
inequality will hold for t in some nonempty open set S. Since &2 is spherically symmetric,
(27) will remain true if t is replaced by gotog-1 with g orthogonal; in fact,
(28) Rgol0o(Q) =EJ{g[ (g1X)] -t 2
=Et [ 4(g-IX) -g-I fl 2 =R (g-it)
Thus, for fixed t E S, the set of g for which Rgo,og-(Q) < R&,() will be a nonempty
open set. Let jA be the invariant probability measure on q which assigns strictly posi-
tive measure to any nonempty open set (for the existence of such a measure see chap-
ter 2, [101). Then
(29) 4'= fgotog-ldA (g)
is spherically symmetric, and because of the convexity of the loss function (1) in d
( 30) R, ( f)_R,,.j._ ( t) dA ( g) <R j2 (t
with strict inequality for t E S. This shows that &2 is not admissible in the class of all
spherically symmetric estimators and completes the proof.
202 THIRD BERKELEY SYMPOSIUM: STEIN
4. Application of the information inequality
In this section we apply the information inequality, as in [1] and [2], to obtain an
upper bound for the possible improvement of a spherically symmetric estimator over the
usual one. In particular, with the aid of the result of section 3, we show that for n - 2
the usual estimator is admissible.
Let t be any estimator of t with everywhere finite risk R and let b be the bias of (,
that is,
(31) b(s) =Et(X)-E
Then by the information inequality
(32)(32)~~~~~~~~~~~~~~~~~~~~~~~~~~
R(Q) 2 b2QA) + 7 [bij + bij(S)
frij
for any v with 7 si = 1 for all i, where &ij = 1 if i = j, 0 otherwise and
(33) bij (bi
aj
with bi(,) the ith coordinate of b(t). Choosing
(34) Inij + bij (C)
tz [6jj+bjjQt)] 2
so as to maximize the right-hand side of (32), we find
(35) R(C) 2b2(C)+ [ij+ bi(C)1 2
i, j
- b2(e) +n+2 , bii(C) + MS).
In the spherically symmetrical case where ( has the form (23), b has the form
(36) b(t) =-(t2) t
where (p is a differentiable real-valued function. In this case, dropping the last term,
(35) becomes
(37) R(t) 2 n+ t2(2(t2) -2np (t2) -4t2'(t2)
We first use (37) to prove that for n = 2 the usual estimator given by (2) is admissible.
By the results of section 3, if &. is not admissible there exists a spherically symmetric
estimator t which is strictly better, and therefore there exists a function so not vanishing
identically such that
(38) 2 2R(t) 2 2+ t2p2(t2) -4,(t2) -4t2'(2)
for all 2> 0. Letting t = 2 and 4,(t) = tp(t) we find
(39) o> 4,2 (t) - 4j,'(t)
INADMISSIBILITY OF ESTIMATOR 203
for I > 0. This shows that 4' is a nondecreasing function. We shall show that (39) im-
plies that P is identically 0. Suppose first that 4'(to) < 0 for some to > 0. Then integrat-
ing the inequality
(40) '2 (t) 4t
from I < to to 1. we obtain
(41) 1 1 > g
The left-hand side is bounded as t 0 whereas the right-hand side approaches + X so
that this is a contradiction. If on the other hand 4'(Q) > 0 for some to > 0, then
(42) *
4(t0) - 4') 1(2 4 logtTo
for all t > 4. As I - , the left-hand side is bounded and the right-hand side approaches
+ - so that we again have a contradiction.
Next we shall apply (35) to show that for n 2 3 there cannot exist c > (n - 2)2 and
. such that for all 2 _ t,
(43) R() _n c
[We have seen in section 2 that there is an estimator which yields an improvement over
the usual estimator asymptotic to (n - 2)2/Ec as t2- .1 It will suffice to show that
the differential inequality
(44) n- C 2 n + t2,2 (2)-2no( t2) -
4t2f, ( t2),
obtained by combining (37) and (43) has no solution valid for all t2 _ . To see that
(44) has no solution, let
2
(45) sp(,2) + f (t2)
Then (44) becomes
(46) c- (n-2) 2 2 t2f2( t.2) -
4f(t2) - 4t2f'( 2)
Let t = 0,4'(t) = if(t). Then
c -(n -2)2 >
(47)~ ~ ~ 1 *,2(t) -44,'(t)
(48) 4" (t) 2 1
a2 +,P2(t) =41'
where a2 = c - (n - 2)2. From the inequality (39) (for all t 2 4o) which is weaker than
(47) we concluded that 4(t) S 0. Consequently for to < t
(49) tan-' ' () -tan-1 4' (to) _ '
log t
The left-hand side is bounded (since 4 does not change sign) and the right-hand side ap.
proaches + - as I -+ -, which is a contradiction.
204 THIRD BERKELEY SYMPOSIUM: STEIN
5. Miscellaneous remarks
In this section I shall indicate a few of the many problems which must be solved be-
fore the methods suggested in this paper can be applied with confidence in all situations
where they seem appropriate.
(i) It seems that similar improvements must be possible if the variance is unknown,
but there is available a reasonable number of degrees of freedom for estimating the vari-
ance. Presumably the correction to the sample mean will be smaller with a given esti-
mated variance than if that value were known to be the variance. If there are no addi-
tional degrees of freedom for estimating the variance it is clear that the usual estimator is
admissible. For, if there is a better estimator t, we can (because of the convexity of the
loss function) construct a continuous estimator t' which is also better than 4. by taking,
for example,
(50) {'( xy)= 1 e-2/2dy .
Then there is an e > 0 and a disc S of radius at least e such that
(51) Il ' (x) - ]2> 2
for all x E S. If the variance of each component is much less than e2/n then the mean
squared error of t, will be small compared with e2 whereas that of t' will not.
(ii) The (positive definite) matrix of the quadratic loss function may be different from
the inverse of the covariance matrix. It is intuitively clear that the usual estimator must
be inadmissible provided there are at least three characteristic roots which do not differ
excessively. However, because of the lack of spherical symmetry it seems difficult to
select a good estimator.
(iii) The covariance matrix may be wholly or partially unknown. Suppose for example
that the covariance matrix is completely unknown but there are enough degrees of free-
dom to estimate it. For simplicity suppose the matrix of the quadratic loss function is the
inverse of the covariance matrix. Again it seems likely that the usual estimator is in-
admissible. The problem of finding an admissible estimator better than the usual one is
complicated by the fact that there is at present no reason to believe that the usual esti-
mator of the covariance matrix is a good one.
(iv) At least two essentially different sequential problems suggest themselves. First
we may consider observing, one at a time, random n-vectors X(M), X(2), * * * independently
normally distributed with common unknown mean vector t and the identity matrix as
covariance. If we want to attain a certain upper bound to the mean squared error with
as small an expected number of observations as possible, it seems likely that we must
resort to a sequential scheme.
Also, consider the situation where we observe, one at a time, real random variables
Xi,, * *, X. (with n fixed but large) independently normally distributed with unknown
means 1,* * *, (,, and variance 1. Suppose we want to estimate the {i with the sum of
squared errors as loss, but we are forced to estimate (i immediately after observing Xi. It
is not clear whether it is admissible to estimate {i to be Xi. However, if it is admissible,
it can only be because of the usual unrealistic assumption of a malevolent nature. In any
case it should be possible to devise a scheme which will improve the estimate consider-
ably if the means are small without doing much harm if they are large.
(v) It is not clear whether, in testing problems, the usual test may be inadmissible for.
INADMISSIBILITY OF ESTIMATOR 205
the reasons given in this paper. This uncertainty cannot arise in the problem of testing
for the variance of a normal distribution with unknown means as nuisance parameters
(see p. 503, [9]). However, in the case of distinguishing between two possible values of
the ratio of a mean to the standard deviation, with unknown means as nuisance parame-
ters, the situation is unclear. Also the inadvisability (but not inadmissibility) of using
spherical symmetry in a space of extremely high dimension is clear, at least if there is
any natural way of breaking up the space.
(vi) It is of some interest to examine the situation in which the result of a previous
experiment of the same type is taken as the origin. In this case (assuming the usual
method, not that of this paper, has been applied to the previous experiment) E2 is dis-
tributed as a2X2, where o2 is the variance of each component in the first experiment so
that for large n, t2 - a2n. Consequently the expected loss for the final estimate is nearly
(52) 2n na2
t2 + n ',,2 + 1
which is the loss that would be attained if the two experiments were combined with
weights inversely proportional to the variances. This method can be applied even if a is
unknown.
(vii) Of course if we are interested in estimating only ti the presence of other un-
known means 6,2 * * , , cannot make our task any easier. Thus our gain in over-all mean
squared error must be accompanied by a deterioration in the mean squared error for cer-
tain components. Let us investigate this situation to the crudest approximation. We sup-
pose without essential loss of generality that ti = V2, t2 =... = t.n = 0. Also we sup-
pose n large and put p2 = t2/n. Then the best spherically symmetric estimator gives
nearly the same result as [1 - 1/(1 + p2)]X. Of course XI = vK2 + Y1 = pNi + Y1
where Y1 is normally distributed with mean 0 and variance 1. The bias introduced in
the estimate of the first coordinate is thus approximately pnl/(l + p2) which makes
a contribution of np2/(l + p2)2 to the mean squared error of the estimate of this com-
ponent. This attains its maximum of n/4 for p = 1. We notice that at this value of p, the
squared errors of all other components combined add up to approximately the same
amount n/4. For certain purposes this extreme concentration of the error in one com-
ponent may be intolerable. This is one more reason why a space of extremely large di-
mension should be broken up before the methods of this paper are applied.
(viii) Better approximations than we have given here will be needed before this meth-
od can be applied to obtain simultaneous confidence sets for the means. Nevertheless it
seems clear that we shall obtain confidence sets which are appreciably smaller geometri-
cally than the usual discs centered at the sample mean vector.
(ix) For certain loss functions, for example
(53) L ( t, d ) = sup I {i - dil
little or no improvement over the usual estimator may be possible.
6. Acknowledgments
About two years ago I worked with L. J. Savage and E. Seiden on the problem of ad-
missible strategies for the independent combination of two or more games. While no
definite results were obtained, the discussion led to some clarification of the problem. I
206 THIRD BERKELEY SYMPOSIUM: STEIN
have also discussed the subject with J. Tukey and H. Robbins, who helped break down
my conviction that the usual procedure must be admissible. Some remarks made by T.
Harris and L. Moses when the paper was presented were also useful.
REFERENCES
[1] J. L. HODGES, JR., and E. L. LEHMANN, "Some applications of the Cramer-Rao inequality," Proceed-
ings of the Second Berkdey Symposium on Mathematical Statistics and Probability, Berkeley and Los
Angeles, University of California Press, 1951, pp. 13-22.
[2] M. A. GiRSHICK and L. J. SAVAGE, "Bayes and minimax estimates for quadratic loss functions,"
Proceedings of the Second Berkdey Symposium on Mathematical Statistics and Probability, Berkeley
and Los Angeles, University of California Press, 1951, pp. 53-73.
[3] C. R. BLYTH, "On minimax statistical decision procedures and their admissibility," Annals of Math.
Stat., Vol. 22 (1951), pp. 22-42.
[4] H. B. MANN and A. WALD, "On stochastic limit and order relationships," Annals of Math. Stat.,
Vol. 14 (1943), pp. 217-226.
[5] D. BLACxwELL and M. A. GISHICK, Theory of Games and Statistical Decisions, New York, John
Wiley and Sons, 1954, pp. 226-228.
[6] L. J. SAVAGE, The Foundations of Statistics, New York, John Wiley and Sons, 1954.
[7] H. ROBBINS, "Asymptotically subminimax solutions of compound statistical decision problems,"
Proceedings of the Second Berkdey Symposium on Mathematical Statistics and Probability, Berkeley
and Los Angeles, University of California Press, 1951, pp. 131-148.
[8] J. L. HODGES, JR., and E. L. LEHmANN, "Some problems in minimax point estimation," Annals of
Math. Stat., Vol. 21 (1950), pp. 182-197.
[9] E. L. LET ANN and C. STEIN, "Most powerful tests of composite hypotheses. I. Normal distribu-
tions," Annals of Math. Stat., Vol. 19 (1948), pp. 495-516.
[101 A. WED, L'IntMgration dans 16s Groupes Topologiques et ses Applications, Paris, Hermann, 1938.
[11] S. KARLIN, "The theory of infinite games," Annals of Math., Vol. 58 (1953), pp. 371-401.
[121 H. KunD, "On minimax invariant estimates of the translation parameter," Natural Science Report
of the Ochanomisu University, Vol. 6 (1955), pp. 31-73.