A General Definition of Residuals
A General Definition of Residuals
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Royal Statistical Society and Wiley are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series B (Methodological)
SUMMARY
Residuals are usually defined in connection with linear models. Here a more
general definition is given and some asymptotic properties found. Some
illustrative examples are discussed, including a regression problem
involving exponentially distributed errors and some problems concerning
Poisson and binomially distributed observations.
1. INTRODUCTION
RESIDUALS are now widely used to assess the adequacy of linear models; see Anscombe
(1961) for a systematic discussion of significance tests based on residuals, and for
references to earlier work. A second and closely related application of residuals is
in time-series analysis, for example in examining the fit of an autoregressive model.
In the context of normal-theory linear models, the n x 1 vector of random variables
Y is assumed to have the form
Y = xp + , (1)
Y = X,+R*. (2)
Provided that the number o
properties of R* are nearly those of e, i.e. R* should have approximately the
properties of a random sample from a normal distribution. In fact, R* being line
in Y, the random variable R* has, under (1), a singular normal distribution and
hence the properties of significance tests can be studied in some detail (Anscombe
1961).
The main types of departure from the model (1) likely to be of importance are:
(i) the presence of outliers;
(ii) the relevance of a further factor, omitted from (1), detected by plotting the
residuals against the levels of that factor;
(iii) non-linear regression on a factor already included in (1), detected by plotting the
residuals against the levels of that factor and obtaining a curved relationship;
(iv) correlation between different ei's, for example between ei's adjacent in time,
detected from scatter diagrams of suitable pairs of J?*'s, or possibly from a periodo-
gram analysis of residuals;
(v) non-constancy of variance, detected by plotting residuals or squared residuals
against factors thought to affect the variance, or against fitted values;
either (a) a more general model containing one or more additional parameters and
reducing to (1) for particular values of the new parameters;
or (b) a different family of models, adequacy of fit being assessed, say, by the
maximum log likelihood achieved.
One example of (a) is the family of models considered by Box and Cox (1964),
in which model (1) is considered as applying to an unknown power of the original
observations. The advantages of the more formal techniques are that they have
sensitive significance tests associated with them and that they are directly constructive
in the sense that, if the initial model does not fit, a specific better-fitting model is
obtained immediately from the analysis. On the other hand, analysis of residuals,
especially by graphical techniques, does not require committal in advance to a
particular family of alternative models. It will indicate the nature of a departure
from the initial model, but not explicitly how to extend or replace the model. With
very extensive data, significance testing is relatively unimportant and the types of
departure that can be detected are more numerous than can be captured in advance
in a few simple parametric models. It is in such applications that the analysis of
residuals is likely to be most fruitful.
,z = i{ *2 Inl-i,
L() = logpj(Yj, ),
where pj(Yj, /) is the p.d.f. of Yj. For a regular problem the maximum li
equation L'(P) = 0 is to first order
L'( +( - L"(P = 0. (10)
Write
I = E E(-V(j))
where I is the total information in the sample.
We thus have the standard first-order expressions
where U() = U(j), the dot indicating a sum over the sample.
To obtain a more refined answer, we replace (10) by the second-order equation
where
J = z E(U(i) V(D)).
Also if
03 logpjQ(y/3
W(j) = a ,3 l, K= E=(W?
then
EfL .. ()}= K.
Note that I, J, K refer to a total over the sample and are of order n.
Finally, a calculation similar to (15) shows that the final term in
whence (14) gives (Bartlett, 1952), for the terms of order 1
-IE(P-P)+7+KI = O,
i.e.
St gr ags a t
[KL/arIp-= = 0
replaces the first-order equation (12) by
i3L A a2L _A 3L 0
gr +(O s- P )p +p 12 (9t-t) ( u-fu)ggSr J=
H(i) =ahi(Yi, r
p) g
_ 3ih)(Yi, p)
H(i) ~~~~~~~~~~(22)
r gr a s Ir ags
Thus
on the right-hand side the summation convention does not apply to the superscript i.
Thus, to order n-1,
E(Ri Rj)
E(i^= E(,E,)}2 r E(,E)
+(a, +aj) s ++Irs E1
E(Ei H ~
Q) UU r/
(i)+ jH(i) UU)s
+ H (i) H ID)
(27)
We can summarize (25), (26), and (27) as follows:
Secondly, for particular types of test statistic, the results can be use
mate to its distribution. Thus if we consider
T 3 Rizz,
E(T') = E cii(zi- 2) +
In principle, it is possible t
and hence to reach an approximation to var(T').
Thirdly, we may use (28) to define a modified residual RJ? having more nearly the
properties of Ei. How best to do this depends somewhat on the particular case, but
one fairly general procedure is to write
Ri = (1 +ki) R+ l, (30)
where ki, 4 are small constants. If we require
E(R11)) = E(E(j));
of course, it is easy to formulate even more ambitious aims. It is plausible, but not
certain, that the modification (30), designed to produce residuals with approximately
the same marginal mean and variance, is an advantage also from the point of view
of plots to examine distributional form.
6. AN EXAMPLE
We consider further the data of Feigl and Zelen (1965). The model (7) is
We first find the bias in Al, 12. Since ei is exponentially distributed with unit mean,
we have, writing (xj - x) = dj,
(n/Pl2 r=s= 1,
Irs d ! r =s= 2,
IO, r# s.
Equations (20) therefore give
Ka= E(W( )
(4n/p 3 t = 1,
( d//ll, t = 2.
Thus
b, =- Iln.
A similar calculation gives
TABLE 1
Leukemia data (Feigl and Zelen, 1965). Log white blood cell count, xi.
Survival time, weeks, Yi. Crude and modified residuals, R,? and R,
Xi Yi ai Ci Ra RIi
If we equate these expressions to (33) and (34), take logarithms and expand, ignoring
high-order terms of ki and li, we find
One such test, although not normally the best, is based in effect on c
variance with the square of the mean. Now
T*=-RI- = 31-07
#4
40o xoF~~~~~~~~~~~~~~~~~
396
32
268 -
V 24
2-0 -
1-2 -
0-8
0*4
FIG. 1. Leukemia data. Modified residual, Rk, versus expected exponential order
statistics. Straight line corresponds to unit exponential distribution.
Now if the residuals were obtained directly from a random sample of size n, i.e.
Ri= Yil , and the test statistic is
T* = / F)2
then it is easy to show that, to the order considered,
E(T*) 2(n-1).
7. POISSON DATA
In order to apply the methods of the preceding sections to Poisson-distributed
observations, we must first consider how to define residuals so as to obtain nearly
identically distributed variables. The difference from the earlier discussion is that,
there, the model was defined directly in terms of independently distributed random
variables. Here we proceed indirectly, defining Ri as
(a) (Y - )l/i4K,
(b) 2(V/Yi - Vk),
or
and hence we can apply the results of Section 4 to obtain E(R) and E(R!). To do so,
however, involves certain approximations and it is necessary to distinguish between
terms that become small when n, the number of distinct Poisson observations, is
large and those that become small when It, the expectation of a typical observation,
is large. The general results of Section 5 refer to large n.
The biases are given by
bs = -IjrsjtuZ1 -3 (39)
with
Irs - @ 1 a _p
8. BINOMIAL DATA
Blom (1954) suggests equation (42) as a normalizing transformation but does not
apply it. In order to simplify its application we have computed Table 2. This gives
values of i(u)/+(l), i.e. the incomplete beta function IuQ-, 23), which is symmetrical
about u = 0-5; multiplication by B(-, 23) = 2X0533 gives the value of (42). For example,
(0-2) = 2*0533 x 0-257 = 0'528, 4(0*8) = 2-0533 (1 -0 257) = 1526.
Introducing the mean and variance of the transformed binomial variate, we define
TABLE 2
u 0.000 0.001 0.002 0.003 0-004 0-005 0-006 0-007 0.008 0-009
0.00 0 0-007 0-012 0-015 0-018 0-021 0-024 0.027 0-029 0-032
0.01 0-034 0.036 0-038 0.040 0.043 0.045 0-046 0.048 0.050 0-052
0-02 0-054 0-056 0.058 0-059 0.061 0.063 0-064 0.066 0-068 0-069
0-03 0-071 0-072 0-074 0-075 0-077 0-079 0-080 0-082 0-083 0-084
0-04 0-086 0-087 0-089 0.090 0-092 0-093 0-094 0-096 0-097 0-098
0.05 0.100 0.101 0.102 0.104 0.105 0-106 0-108 0-109 0-110 0-112
0.06 0.113 0.114 0-115 0-117 0-118 0-119 0-120 0.122 0.123 0-124
0-07 0.125 0.126 0.128 0.129 0-130 0.131 0-132 0-134 0.135 0-136
0.08 0.137 0.138 0.139 0-141 0.142 0.143 0.144 0.145 0-146 0-147
0.09 0-149 0.150 0.151 0.152 0.153 0-154 0-155 0-156 0-157 0-158
0.10 0.160 0.161 0.162 0.163 0-164 0-165 0-166 0-167 0-168 0-169
0-11 0-170 0-171 0-172 0-173 0-174 0-176 0-177 0-178 0-179 0-180
0.12 0-181 0.182 0.183 0-184 0-185 0-186 0-187 0-188 0-189 0-190
0.13 0.191 0.192 0.193 0.194 0-195 0-196 0-197 0-198 0-199 0-200
0.14 0-201 0-202 0.203 0-204 0-205 0-206 0-207 0-208 0-209 0-210
0.15 0-211 0-212 0.213 0-214 0-214 0-215 0-216 0-217 0-218 0-219
0.16 0-220 0-221 0.222 0-223 0-224 0-225 0-226 0-227 0-228 0-229
0-17 0-230 0-231 0.232 0.232 0-233 0-234 0-235 0-236 0.237 0-238
0.18 0-239 0-240 0-241 0.242 0-243 0-244 0-244 0-245 0.246 0-247
0-19 0-248 0-249 0-250 0.251 0-252 0.253 0-254 0-254 0.255 0-256
0-20 0-257 0-258 0-259 0-260 0.261 0.262 0-262 0-263 0-264 0 265
0-21 0-266 0-267 0.268 0-269 0-270 0.270 0-271 0-272 0-273 0-274
0-22 0-275 0-276 0.277 0.277 0-278 0.279 0-280 0-281 0.282 0-283
0-23 0-284 0-284 0-285 0-286 0-287 0-288 0-289 0-290 0.290 0-291
0-24 0-292 0-293 0.294 0-295 0-296 0.296 0-297 0-298 0-299 0-300
0-25 0-301 0-302 0.302 0.303 0-304 0-305 0-306 0-307 0.308 0-308
0-26 0-309 0-310 0.311 0-312 0-313 0-313 0-314 0-315 0.316 0-317
0-27 0-318 0-318 0-319 0.320 0-321 0.322 0-323 0-323 0.324 0-325
0-28 0-326 0-327 0-328 0.328 0.329 0.330 0-331 0-332 0-333 0-333
0-29 0-334 0-335 0-336 0-337 0-338 0.338 0-339 0-340 0.341 0-342
0-30 0-342 0-343 0.344 0.345 0-346 0.347 0-347 0-348 0-349 0-350
0-31 0-351 0-351 0.352 0.353 0-354 0.355 0-355 0-356 0-357 0-358
0-32 0-359 0-360 0.360 0.361 0-362 0.363 0-364 0-364 0-365 0-366
0-33 0-367 0-368 0.368 0.369 0-370 0.371 0-372 0-372 0.373 0-374
0-34 0-375 0-376 0.376 0-377 0-378 0.379 0-380 0-380 0-381 0-382
0-35 0-383 0-384 0-384 0-385 0-386 0-387 0-388 0-388 0.389 0-390
0-36 0-391 0-392 0-392 0-393 0-394 0-395 0-396 0-396 0.397 0-398
0-37 0-399 0-400 0-400 0-401 0-402 0-403 0-403 0.404 0-405 0-406
0-38 0-407 0-407 0.408 0-409 0-410 0-411 0-411 0.412 0.413 0-414
0-39 0-414 0-415 0.416 0-417 0-418 0-418 0-419 0-420 0-421 0-422
0-40 0-422 0-423 0-424 0.425 0-425 0.426 0-427 0-428 0.429 0-429
0-41 0-430 0-431 0.432 0.433 0.433 0-434 0.435 0-436 0-436 0.437
0-42 0.438 0-439 0-440 0-440 0-441 0-442 0-443 0-443 0-444 0-445
0-43 0-446 0-447 0-447 0-448 0.449 0.450 0-450 0-451 0.452 0.453
0-44 0-454 0-454 0.455 0.456 0.457 0.457 0.458 0-459 0-460 0-461
0-45 0-461 0-462 0-463 0.464 0.464 0.465 0.466 0-467 0-468 0-468
0-46 0-469 0-470 0.471 0.471 0-472 0-473 0-474 0-474 0.475 0.476
0-47 0-477 0-478 0.478 0.479 0-480 0.481 0-481 0.482 0-483 0.484
0-48 0-485 0-485 0-486 0.487 0-488 0.488 0-489 0-490 0.491 0.491
0-49 0-492 0-493 0-494 0.495 0.495 0.496 0-497 0-498 0.498 0-499
The p.d.f. of Yj is
p1(y, p) = O Y (1 - oj)"-Yj,
9. A FURTHER EXAMPLE
Dyke and Patterson (1952) present the analysis for a 24 factorial design of the
proportions of respondents who achieve good scores on cancer knowledge; some
details of the data are given in columns (1)-(3) of Table 3. They assume a logit
transformation of the proportions, the expected value of the transformed variat
being a linear function of parameters representing main effects and interactions
Values of the parameters are estimated by maximum likelihood. We consider thei
solution and apply the methods of Section 8 to examine residuals from the fitted
model.
Following Dyke and Patterson (with slight changes in notation) we write the model
as
Oj(P) = {1 +exp (-2z1)}-',
where zj = br/Pr, summed over the parameters; Ijr = ? 1. Then
= 1+ctii.
Dyke and Patterson fit a model with five parameters, representing the overall
mean and the four main effects. They quote the values of Irs obtained in the course
of their solution and we use these values to calculate b8 and ct.
Values of cit, Ri and R' are given in Table 3. The biases b5 were all extremely small;
none exceeded 21 per cent of the standard error of the estimate of the parameter.
2-0 -
1-6
12
0.8 .
0-4-
010
-ox /~~~~~~~~~
-0-4-
-08 ,X x
-1*2-
-16 ' /
FIG. 2. 24 factorial design. Modified residual, R', versus expected normal order
statistics. Straight line corresponds to unit normal distribution.
The modified residuals R? are plot,ted against the expected normal order statistics
in Fig. 2 and show agreement with the assumption of a standard normal distribution.
Dyke and Patterson go on to fit a model with extra parameters to represent the
interactions AD, BD and CD. Before proceeding to extend the model, we examine
the residuals R, from the simple main effects model. If we regard the values of R,
as observations in a 24 design, we can analyse them in the usual way and obtain the
sums of squares given in Table 4. This is an unweighted analysis and, as such, can
be used only for guidance; the existence of any apparent effect can be established
only by fitting a model containing the appropriate parameters. Also, for the same
reason, the sums of squares due to main effects in Table 4 are not zero. Nevertheless,
the magnitude of the AC effect suggests it to be worth including in the model along
with AD, BD and CD; the three-factor interaction ACD also is large but has not been
included in the further analysis. We therefore fitted a model with nine parameters;
TABLE 3
Treatment mi ri -4c, Rj R
TABLE 4
Sums of squares
Effect
R, Rj
A 0.78 0.80
B 0-26 0-20
C 109 1P14
D 001 000
AB 0 04 0 04
AC 2-53 0.98
AD 1P85 1P57
BC 0.32 0-20
BD 2-06 1P69
CD 4-88 3-89
ABC 2-51 1P26
ABD 007 010
ACD 3-77 1P97
BCD 003 005
ABCD 003 003
the estimated values and standard errors are given in Table 5. Comparison of the
estimates with their standard errors confirms that the AC interaction is at least as
significant as the AD and BD interactions.
TABLE 5
For comparison, the corresponding analysis on the crude residuals, Ri, is also given
in Table 4; it is interesting that the AC interaction does not stand out in this case.
Note, however, that 9 parameters are being fitted to 16 observations, so that the
applicability of the asymptotic formulae is in doubt.
ACKNOWLEDGEMENT
We are grateful to Mrs E. A. Chambers and Mr B. G. F. Springer for programming
the calculations. Their work was supported by the Science Research Council.
REFERENCES
ANSCOMBE, F. J. (1953). Contribution to the discussion of H. Hotelling's paper. J. R. Statist.
Soc. B, 15, 229-230.
(1961). Examination of residuals. Proc. 4th Berkeley Symp., 1, 1-36.
BARTLETT, M. S. (1952). Approximate confidence intervals. II. Biometrika, 40, 306-317.
BLOM, G. (1954). Transformations of the binomial, negative binomial, Poisson and x2 distri-
butions. Biometrika, 41, 302-316.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. J. R. Statist. Soc. B,
26, 211-252.
Cox, D. R. and LEWIS, P. A. W. (1966). The Statistical Analysis of Series of Events. London:
Methuen.
DURBIN, J. and WATSON, G. S. (1950). Testing for serial correlation in least squares regression.
I. Biometrika, 37, 409-428.
(1951). Testing for serial correlation in least squares regression. II. Biometrika, 38, 159-178.
DYKE, G. V. and PATTERSON, H. D. (1952). Analysis of factorial arrangements when the data are
proportions. Biometrics, 8, 1-12.
FEIGL, P. and ZELEN, M. (1965). Estimation of exponential survival probabilities with concomitant
observation. Biometrics, 21, 826-838.
HALDANE, J. B. S. (1953). The estimation of two parameters from a sample. Sankhyd, 12, 313-320.
HALDANE, J. B. S. and SMITH, S. M. (1956). The sampling distribution of a maximum likelihood
estimate. Biometrika, 43, 96-103.
MOORE, P. G. (1957). Transformations to normality using fractional powers of the variate.
J. Am. Statist. Ass., 52, 237-246.
PEARSON, E. S. and HARTLEY, H. 0. (1966). Biometrika Tables for Statisticians, 3rd ed. Cambridge
University Press.
SHENTON, L. R. and BOWMAN, K. (1963). Higher moments of a maximum-likelihood estimate.
J. R. Statist. Soc. B, 25, 305-317.
SHENTON, L. R. and WALLINGTON, P. A. (1962). The bias of moment estimators with an appli-
cation to the negative binomial distribution. Biometrika, 49, 193-204.
will be quite inadequate. A model which takes the underlying physical situation seriously
enough to influence the expectations might also allow it to affect the error structure, and
doing so is likely to lead to just the sort of problems dealt with here. The solutions to
such problems may well be more important than they would be in the linear case. If a
linear model is admittedly an over-simplification, we can tolerate even some systematic
departures from it, weighing these against the advantages of simplicity and ease of calcu-
lation; whereas the extra cost of the non-linear model may not be recouped by better
insight into the true situation if discrepancies between model and observations are allowed
to remain uninvestigated.
I would like to make a few specific comments on the paper, in particular on the second
example, and to ask the authors whether they have compared their incomplete beta-
function transform with any alternatives. The maximum-likelihood calculations can often
be viewed as weighted least squares following a transformation, and thus can readily be
persuaded to produce a set of quasi-residuals as a by-product. Even simpler, perhaps,
is a procedure I have myself found useful for informal investigation; it consists of tabulating
or plotting (observed - expected)/(expected)i the sum of squares of these quantities being
simply the goodness-of-fit x2. The surprising thing about the transformation tabulated in
the paper is perhaps its near linearity between 10-90 per cent.
I confess that I share the authors' reservations over treating 16-9 as a large number.
I would also like to ask whether they know any way of taking advantage of the fact that
their "ordered plots" should in the null situation not only give straight lines but straight
lines with known position and slope. Perhaps one might examine the residuals.
In summary, my reaction to this paper has been that which is so often stimulated by
papers in which Professor Cox has had a hand-a mixture of somewhat envious admiration
with an impatience to go away and apply the suggested methods to data a satisfactory
treatment of which has so far eluded me. I would like to propose a very hearty vote of
thanks to both authors.
Before statisticians had ready access to computers they had a reasonable degree of
control over residuals. This was primarily due to the fact that observations largely
resulted from statistically designed experiments and there was often prior information
about the nature of the experimental error. When this was not the case then computing
limitations usually restricted investigations to models with few parameters.
However, immediately computers became available for statistical analyses, multiple
linear regression and non-linear estimation programs were written which were capable
of analysing models involving a large number of parameters.
Initially most of these programs were applied somewhat indiscriminately, and generally
to data which had not been statistically planned. Residuals were listed but their analysis
was left to the statistician who probably contented himself with a quick scan for outliers
and distribution form. More recently these programs have attempted to automate the
examination of model adequacy.
In one I.C.I. program there is an option for considering the family of mod
by Box and Cox (1964). The paper defines this as a more formal approach. Procedures
are also available for dealing with outliers, for testing distributional form and for plotting
residuals with selected variables. It is also possible to test the hypothesis that the residuals
are distributed in the multivariable space according to a given law. In this case interest
might lie in detecting and defining regions of the space in which the residuals are signifi-
cantly correlated. Clump or Cluster Analysis is a possibility here.
In general the residuals which are analysed are those the paper refers to as Crude
Residuals and Professor Cox and Mrs Snell draw our attention to the fact that perhaps we
should work in terms of Modified Residuals. However, the paper comes to no definite
recommendation that we should use the Modified Residual except perhaps when the num-
ber of observations is small compared with the number of parameters to be estimated,
and I would agree with the authors that this is encouraging.
The paper suggests that further work might be undertaken in defining residuals related
to time series and component of variance problems. Certainly in such situations the
problems are more difficult. Box and Jenkins discuss the identification of non-stationary
time series models using combinations of differencing, regression, overparameterization
and simulation techniques. In industry the most frequently encountered non-stationary
model can be represented in terms of a first-order random walk generating process:
dt = mt +,Et, (1)
mt = mt-i+Yt,
where dt is the observation at time t and e and y are each a set of identically independently
distributed random variables with zero mean.
This model is a particular case of the polynomial random walk in which each derivative
is subjected to random impulses. The linear least-squares predictor for (1) is
dt+1 = dt + Aet,
The vote of thanks was put to the meeting and carried unanimously.
expression of algorithms, using a notation and way of thinking closely resembling estab-
lished mathematics. The present implementation of a modified version of the language
as a computer coding language is known as APL. (The modest first letter has been
retained, though some of us think that TPL would be more fitting. The language should
not be confused with PL/1, developed for IBM's 360 series of computers.) APL has been
implemented experimentally at IBM's Thomas J. Watson Research Center, Yorktown
Heights, N.Y., for computation in conversational mode through typewriter terminals.
At present it has not been generally released.
Requirements for statistical computing are many and various, because the persons
who have occasion to do such computing have very diverse degrees of interest in statistics
and in computing. No one system or method can be satisfactory for all. A professional
statistician (like me) needs to be able to experiment freely in computing, and ought if
possible to be able to do so without a vast effort in mastering a computer language and
with little time spent in coding or otherwise specifying what he wants done. It seems to
me that, unlike previous general-purpose languages such as FORTRAN or ALGOL,
APL is sufficiently powerful and sufficiently easy to learn to meet this need. I have been
preparing a paper to show how, the language can be used for typical statistical calculations,
to show enough of its character that a reader could have some basis for judging whether
to take an active interest. Far less than that can be said now.
Statistical computing, like other computing, requires negotiation of arrays. Various
languages and systems have been proposed permitting matrix operations to be called for
easily. APL also is designed to handle arrays. The unique feature of APL is the generality
of its definitions, leading to a high degree of consistency that not only makes the language
easy to remember but also gives it a peculiar dignity and reasonableness. One example
must suffice.
Any language or system designed to handle matrices must obviously encompass
matrix addition. It must permit a command like:
ADD A B,
where A and B are matrices of the same size. In APL matrix addition is denoted by
A+B.
What is peculiar is that this notation refers not to a special operation, matrix addition,
but to a general method of combining two arrays that are not necessarily matrices by a
function that is not necessarily addition. In fact A and B can be any two arrays of the same
size-vectors, or matrices, or rectangular arrays of any number of dimensions. The
function " + " can be replaced by any other standard function f having the same syntax,
that is by a symbol f such that for any scalar arguments A and B the combination A fB
is scalar. Thus if A and B are two 17-dimensional arrays of the same size, A x B means
the array of the same size formed by multiplying each element of A by the corresponding
element of B. Similarly for A * B (where * means exponentiation) and A I B (where I means
the residue function) and A = B (where = is the logical function taking the value 1 when
the arguments are equal and 0 otherwise), and so on for a dozen more functions. We
might say that for the cost of a capability in matrix addition we have obtained free many
other capabilities, many of which turn out to be just as useful. The whole of APL has
been constructed with this kind of generality.
Whatever may be the fate of this particular implementation of APL, something like
it must surely eventually command widespread approval.
The relevance of computing to the paper by Professor Cox and Mrs Snell is that until
a computer has been adequately tamed, residuals have only theoretical interest. Such a
study was not done many years ago in the Fisherian era because computers were not
available then. I am eager to try out these methods, which promise to be a valuable
weapon in the continual struggle to fit theory to facts.
a2 log L a2 log L
pAni 82log
C' B C1 1 AaVa' VAaDC3'
L 82log L.
a =
F-1(1) or F-(12)
(if we ignore inter-residual correlations); but we may well sacrifice a lot of information
on u in relation to some other choice of plotting positions. This is well illustrated, for
example, by the case of uniformly distributed residuals with half range or.
The choice of plotting positions relates also to the question of model validation. We
are invited to consider the plotted ordered residuals in Figs. 1 and 2 as evidence of the
adequacy of the respective models for these examples. It is largely a subjective matter
how convincing we find the evidence in these two cases-personally Fig. 2 worries me a
little! More objective criteria of the adequacy of the model can of course be constructed
and applied to these plots, as the authors have mentioned. But again the properties, and
relative convenience, of relevant test statistics for this purpose are going to depend
(perhaps quite strongly) on the choice of plotting positions, xi.
My second and concluding point is something of a personal "hobby horse" and I will
be brief. In the Introduction the authors comment on the use of computers in studying
residuals. There is a growing tendency, in some circles, to regard the computer as a
substitute for common sense, thought and observation. I certainly would not presume to
criticize the authors on these grounds; the whole spirit of their paper denies this attitude;
nor would I dispute the essential value of the computer in large-scale studies of residuals.
But it would be unfortunate if the lack of a "suitable computer with graphical output
device" was allowed to discourage, or excuse, the study of residuals for model validation.
A pencil and piece of graph paper may still work wonders in this respect!
Dr S. C. PEARCE (East Malling Research Station): I studied the text of this paper with
the greatest interest and found it both stimulating and provocative. After a time I came to
suspect that the authors know rather more about the quantities, R&, than they say, so
perhaps I may provoke them in turn, hoping they will add to their remarks and so make an
excellent paper even better.
These quantities are derived from the crude residuals, Ri, by a transformation that is
intended to give them the same mean and variance as the random variables, 4i. Accordingly
they go only part of the way towards the desired quantities, R'), each of which shall be an
estimate of one of the 4i's. Where they first appear at equation (30), there are some cautionary
words about particular cases and fairly general procedures, but when they are actually used
at equation (35) the word is Hence; I suggest that it should be Let. The transformation
proposed is completely reasonable, but it is not unique for the purpose intended.
In fact the transformation in this instance proves to be rather too powerful, as is shown
by Fig. 1. In general, it will make small Ri into even smaller R' whereas it will increase
large Ri. In Fig. 1 we see that all the smallest residuals lie below the line while the largest
is awkwardly above it, and we should actually have done better not to have transformed
at all. Turning to Fig. 2, the transformation has again been too powerful; with one
exception all the negative residuals are too small and all the positive ones too large. Here
let me agree that these two Figures provide most impressive support for the essential
rightness of the authors' approach; my point is merely that there are in fact systematic
deviations, which may be due to the arbitrary element in the transformation.
However, there are other explanations possible. Perhaps the systematic deviations
result from a quantity having been badly estimated on account of some quirk in the data.
In that case the fact of the transformation having been too powerful in two instances may
be of no more importance than a coin having been tossed twice and come down heads on
both occasions. Alternatively everything may be the result of using Ri instead of R') or Ei
Admittedly the method of deriving the R' is reasonable, because each Ei must be a functio
of the values Ri and must depend chiefly upon the Ri to which it corresponds, but a trans-
formation such as the one used can hardly be exact. There is another point, how certain
can we be a priori that the random residuals, Ei, are in fact distributed exponentially
(Fig. 1) or normally (Fig. 2)? (In the latter case, as a matter of fact, it is scarcely con-
ceivable that they should be.) For my own part I can see no way of judging where the
systematic deviations come from, but perhaps the authors can help.
I have rather trailed my coat because I would like to hear the extended comments of
the authors on this point. I would, however, advance a suggestion. Scandalous as the
suggestion may seem at a meeting of the Royal Statistical Society, there are occasions when
real data are a nuisance and fudged-up figures are better. This is perhaps an occasion for
simulation. If we knew for certain what the parameters and random residuals were, we
could apply the methods of this valuable paper and observe how the values of Ri actually
behave. It could be that we do not need to seek much further.
Suppose now that one observes Xt and Wt over some interval of time and wishes to estimate
the transfer function A. As an initial model one might suppose that et was a white noise
process uncorrelated with Xt, and estimate A by the usual technique of cross-spectral
analysis. However, if these assumptions were incorrect then the estimate of A could be
seriously in error, and any subsequent model fitted to the residuals may be invalid. In
such situations it would be nice if one could appeal to some type of "orthogonality"
property-but this is hardly likely to be applicable in many situations.
Mrs SNELL replied briefly at the meeting and the authors subsequently replied more
fully in writing as follows:
We are grateful for the extremely constructive and helpful comments that have been
made and for the pinpointing of problems for further study. As Mr Healy has pointed out,
there are a number of commonly used transformations of the binomial distribution which
are for many purposes effectively linear functions of one another. The incomplete beta
transformation studied in the paper does, however, give an appreciably more linear plot
on probability paper than, for example, the inverse sine transformation, especially in the
range p = 0-1 - 0 3 and with small n, for example n 10; in looking for systematic depar-
tures this could be an advantage. Mr Harrison has raised in particular the question of the
behaviour of cusum charts plotted from residuals, and we agree that this deserves further
study, especially of the effect on the plot of correlation between different residuals.
We have deliberately in the paper put the main emphasis on the plotting of residuals
rather than on formal tests of significance. Professor Durbin and also Dr Priestley have
stressed the need for caution in applying significance tests to residuals. That this is a very
important point is clear from a consideration of the simple problem of examining in linear
regression the possible importance of an omitted regressor variable, say z. It is entirely
legitimate to make a graphical analysis by plotting residuals from the initial regression
relation directly against z. If, however, the significance of the regression on z is to be
tested by the usual formula the residuals of z must be used, as is in effect done in analysis
of covariance. In the more general situations contemplated in our paper the expected
value of linear or nearly linear test statistics calculated from residuals will be close to that
calculated from the Eq's, but the contribution of the covariance terms (28) to the variance
of such test statistics will in general be non-negligible, because the number of covariance
terms will be of order n2. A special case is Professor Durbin's example for the sum of
residuals, where the correct variance he gives follows directly from (28). That is, in such
applications it is essential to take int9 account the correction terms in (28) and with more
complex test statistics to develop appropriate extensions of them.
Mr Walker has pointed out that the distribution of the Ei's could by transformation
be taken in any form, and he and Professor Tukey have enquired about the use of a log
transformation in the exponential regression model (7), converting it into a model with
additive error. It must be stressed that computationally there is no particular advantage
in such a transformation, unless a log normal distribution of error is assumed. Our reason
for keeping to the original scale was partly the simplicity of properties of the exponential
distribution and more importantly that we felt that in this application the precise form of
the distribution for small values is not important and that test statistics and plots that are
sensitive to the very small values are not good. We think that similar arguments of
simplicity and meaningfulness will often be applicable; see also the remarks at the begin-
ning of Section 7 of the paper.
Dr Loynes and Dr Mallows have sketched theoretical arguments that should be an
improvement on those that we have used and we look forward to seeing a detailed account
of their work. In particular they should throw light on some of the points raised by
Dr Pearce, with whom we agree that there is substantial scope for simulation in studying
these problems.
We agree with Dr Barnett that it is not always necessary to take the distribution of the
Ei's in standardized form, nor always to include its parameters in fitting the model. Often
however, as in our two examples, it will be useful to take the Ei's as having a completel
known distribution.
We accept the general points made by Dr Priestley; they make explicit some of the
reservations about residuals discussed in the introduction of our paper.
Professor Tukey has made a considerable number of cogent points. We agree that the
real usefulness of minor adjustments to the residuals is not established although, as
Professor Durbin's contribution makes clear, second-order properties are important when
it comes to tests. In the statement about Poisson variates that he queries, we had in mind
that no continuous relation is available. If the Re's are to be plotted against external
variables we think that standardizing them to have equal means and variances is very
reasonable to avoid spurious regularities in the plots but in examining distributional form
the neglect of covariances is less easily defended, as indeed we indicated in the paper.
Finally we mention some relevant references that have come to our attention since we
wrote the paper. Dr Mallows has pointed out that the work of David and Johnson (1948)
has some bearing on the problem briefly mentioned at the end of Section 10. Also Theil
(1965, 1968) has suggested that for normal theory linear model problems adjusted residuals
with exactly the distribution of the true errors can be produced by attempting to find
residuals only at a suitably limited set of observational points.