0% found this document useful (0 votes)
275 views40 pages

Parameter Orthogonality and Approximate Conditional Inference

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content.

Uploaded by

Jaina Smith
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
275 views40 pages

Parameter Orthogonality and Approximate Conditional Inference

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content.

Uploaded by

Jaina Smith
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Parameter Orthogonality and Approximate Conditional Inference Author(s): D. R. Cox and N.

Reid Reviewed work(s): Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 49, No. 1 (1987), pp. 1-39 Published by: Blackwell Publishing for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2345476 . Accessed: 14/11/2011 00:15
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series B (Methodological).

http://www.jstor.org

Soc. B (1987) J. R. Statist. 49, No. 1,pp. 1-39

Inference Conditional and ParameterOrthogonality Approximate


D. R. COXt
ImperialCollege,London

and

N. REID
Vancouver Columbia, of University British

by organized the ResearchSection on Societyat a meeting the [Read before Royal Statistical A. 1986,Professor F. M. Smithin theChair] 8th Wednesday, October,

of for We consider inference a scalarparameter in thepresence one or more q/ to to are parameters required be orthogonal the The parameters. nuisance nuisance of and of and parameter interest, theconstruction interpretationorthogonalized we of in is parameters discussed somedetail.For purposes inference proposea of distributiontheobservafrom conditional constructed the ratio likelihood statistic We parameters. consider for likelihood estimates thenuisance maximum given tions, the in likelihood ratiostatistic which to extent is preferable theprofile this to what Thereare close parameters. overthe nuisance is function maximized likelihood (1983). of likelihood Barndorff-Nielsen The profile to connections the modified as model normal transformation ofBox andCox (1964)is discussed an illustration.
Keywords:ASYMPTOTIC THEORY; CONDITIONAL INFERENCE; LIKELIHOOD RATIO TEST; NORMAL TRANSFORMATION MODEL; NUISANCE PARAMETERS; ORTHOGONAL PARAMETERS

SUMMARY

1. INTRODUCTION of betweenorthogonality objectiveof thispaper is to exploretheconnection The primary with is inference. Orthogonality defined of theory conditional and parameters theasymptotic it in as matrix described Section2. In general is not Fisherinformation to respect theexpected values but it is possible to at orthogonality all parameter possibleto have total parameter The t of of obtainorthogonality a scalar parameter interest to a set of nuisanceparameters. and broad implications is discussedin seemsto have fairly parameters conceptoforthogonal withseveralexamplesin Section3. some detailin Section2 and illustrated about a parameterin the presenceof nuisance A widelyused procedurefor inference maximum function their in by is parameters to replacethenuisanceparameters thelikelihood of as likelihood a function theparameter profile and estimates examinetheresulting likelihood for estimates problems or is This procedure knownto giveinconsistent inefficient ofinterest. thatitmaynotbe close to optimal whichsuggests of withlargenumbers nuisanceparameters, withno eventhoughthelikelihoodratiostatistic of fora smallnumber nuisanceparameters, based on an We is nuisanceparameters in some senseoptimal. consider approachto inference of paralikelihoodestimates the orthogonalized the conditionallikelihoodgivenmaximum are of likelihoodestimates thenuisanceparameters thatthemaximum To meters. theextent likelihood this procedure for statistics thenuisanceparameters, conditional sufficient complete for similar described examplein Cox and for tests, the generalizes usual procedure obtaining Hinkley(1974, p. 134). There are close connectionsto the modifiedprofilelikelihoodof (1983, 1985b). Barndorff-Nielsen in is likelihoodfunction discussedand illustrated Section4, and a The conditionalprofile in is likelihoodfunction presented it forpreferring to the usual profile possiblejustification in modelis discussed separately Section5. for Section4.3. Inference thenormaltransformation In Section6 some further pointsand open questionsare discussed.

of ImperialCollege,London SW7 2BZ, UK. t Address correspondence:Department Mathematics, for

? 1987 Royal StatisticalSociety

0035-9246/87/49001

COX AND REID 2. ORTHOGONAL PARAMETERS

[No. 1,

2.1. Introduction is problemsfor which the vectorof observations with parametric throughout We deal on by represented an n x 1 vectorY ofrandomvariableshavingdensity fy(y; 0) depending a on depending the 1 x p vector0 ofunknown We 1(0) parameters. write forthelog-likelihood; y, contextthiswill be eitherlog fy(y; 0) forgivenobservations or the randomvariablelog from the is fy(Y; 0). Occasionallywe write ly(O)to emphasizethatthelog-likelihood derived densityof Y. Our argumentswill be informalwithoutexplicit attentionto regularity those requiredforthe expansionsneeded formaximum these being essentially conditions, problems. estimation in likelihoodtheory regular Pt Pt If0 is partitioned twovectors and 02 oflength and P2 respectively, + P2 = p,we into 01 satisfy matrix to of define to be orthogonal 02 iftheelements theinformation 01
= io"Ot n E(j n A; ?) = E(-0

0)

(1)

space,and is fors = 1, ** , P1, t = Pi + 1, ***, PI + P2; thisis to hold forall 0 in theparameter per to Note that i refers information observation, called global orthogonality. sometimes value 00, then whichwillbe assumedto be O(1) as n -+ oo. If (1) holds at onlyone parameter at the vectors01 and 02 are said to be locally orthogonal, 00. The most directstatistical are components the score statistic uncorrelated. of of interpretation (1) is thattherelevant and to The definition orthogonality be extended morethantwo setsofparameters, in of can is if matrix diagonal. While orthogonality 0 particular is totallyorthogonal the information is can alwaysbe achievedlocally,global orthogonality possibleonlyin specialcases (Jeffreys, 1962; Amari,1985). 1961,p. 208; Huzurbazar,1950; Mitchell, 2.2 Consequences Orthogonality of For of whichwe now outline. Thereare a number statistical of consequences orthogonality of and Aimplies Then orthogonality q/ A) suppose0 = (f5, has just twocomponents. simplicity, that i7i independent; likelihoodestimates and 2 are asymptotically (i) themaximum / A as for error estimating is thesamewhether is treated known standard (ii) theasymptotic or unknown; of(i, 2);see Ross (1970) in determination in maybe simplifications thenumerical (iii) there thecontext nonlinear of regression. for relevance thepresent paper is A further property relatedto (iii) and ofparticular (iv) A withA.
)
=

;(i),

of likelihoodestimate the maximum

/whenA is given,variesonlyslowly
2) as
-

near themaximum function (;, the To study(iv), we write log-likelihood


1(,

1{

n,*

(f

- _)2

2n.4,(/-

2)-

A(n

1)2}

0 + OP( 11

3),

(2)

(2) (2) suppressed.We rewrite in termsof i and Z, differentiate withrespectto A satisfies


ni,(;A-

where Z,... . Write.4,, = i, + Z,,/ji/n, etc., for where, example, .',; = [-02 l(o, ))/D12]o nJ are random variablesof zero mean and Op(1) as n -+ oo. The dependenceof i, Z on 0 is
q/ so

that

/)

+ VnZ,,(;A

-i)

)21n

n fr) * + ...

0, (3)

are where derivatives evaluatedat (f, A).Providedthatrandomvariablessuchas AZ,,/@,are = Op(1/1n), thenifand that A are O(1), and noting suchas i Op(1) and quantities

1987]

and ParameterOrthogonality Inference

whereas remaining the terms Op(l) as Avaries are term (3) is Op(.Jn), of onlyifi,,A = 0, thefirst that termis in factOp(1), requiring by an amountthat is 0(1/Vn). It followsthat the first is are A proofholdsiftheparameters not scalars.The argument of is Op(1/n). similar AAin coursesymmetric (', A), and we willuse also the result *, - I = Op(1/n) in latersections. parameters. Examplesof It is easy to see thatif = Vforall A,then). and Vare orthogonal (1978), an important class for families whichthisholds are discussedin Barndorff-Nielsen and A as the models withV as part of the canonical parameter being regularexponential see parameter; Example 3.2. It would,of course,be part complementary of the expectation the of functionally independent Aand at thesame timefor distribution and, possibleto have Vi, on of to dependstrongly A. the in particular standarderror, of models.A (1984b) in thecontext location-scale also bySweeting (iv) Property is discussed is illustration providedin Section3.5. numerical otherpairs could be obtainedby A) Note thatfroma pair (Vi, of orthogonalparameters parameter in However, thispaper we shall regardVas a preassigned suitabletransformation. of particular relevance.
QA A

Parameters 2.3. Construction Orthogonal of We a orthogonal parametrization. possibleto find totally As notedabove,itis notin general V now discuss the special case in whicha scalar parameter is orthogonalto the remaining is and one or all althoughit is possiblethatVi the nuisanceparameter nuisanceparameters, see of of components ) are the parameters interest; Example 3.5 below. In the notationof 02 equation(1), 01 = Vi, = (A1 ...--q) (1961,p. 208) and Huzurbazar(1950); see also Jeffreys generalizes The following argument of ,. in the Amari(85, p. 254). Suppose thatinitially likelihoodis specified terms (Vi41, . ., Oq) A(V whereA = (Al, ..., Aq), and We thenwrite01 = (Vi, 02 = k2(V,A), ..., O.= )), A),
parameters A ..., 1,
)q.

Typically

Vwill be the parameter of interest and

A1, ...,

Aq will be

A) A), W(V, = l*{Vi P(V), .,q(*, ..., 'q). Then l* of 0, regarding as a function (Vi

i)},

0f a0 =) r)r
__21
=

01

al*+

a 01

021*

4003i

a,/3() 'at

a'P Z +

a21* 1
34)rO3s

/ 0s
3it 3*

1 +

1* f3

02Or

34)r 3D/3t

so in vanishes, thattheorthogonathe On taking expectations last term thesecondderivative lityequationsare


EO-OS (i,{*,, + E
id*>+ 0r=

O~

...

q,

We measures calculatedin the(V,?) parametrization. require the where i* are theinformation from ' A) thatthetransformation (Vi to (Vi, have nonzeroJacobian;hence
E i<,,t^=-i*4,

s=

,..,q.

(4)

the dependence of 'P on V, but there is These partial differential equations determine considerablearbitrariness the dependence of ' on A; see the examples. It is often in to convenient take '1 to dependonlyon (Vi, 0P2 to dependon (V,Al,A2), etc.,and to aim to Al), problem. of in interpretation the context the particular givethe i meaningful whenVis not a From (4) it is clear whyin generalwe cannotobtainglobal orthogonality and aP1/0V2, and to scalar. If V = (V11 Vi2),we can use (4) independently calculate 'sP/alV

4 is satisfied.

COX AND REID


=

[No. 1, 02O/aO20f1

the is that condition there noguarantee ingeneral compatability 0'O'/4142


3. EXAMPLES

Distribution 3.1. Exponential the Let Y1 and Y2 be exponentialrandom variableswithmeans 4 and fr respectively; to equationcorresponding (4) is is of parameter interest theratioofthemeans.The differential 1 00, of choice is function A. A convenient withsolution4f112 = a(A), wherea(A) is an arbitrary 1/2, respectively. Y1 a(A) = A; in the new parametrization and Y2 have means AO - 1/2and AO of replications (Y1, Y2), Note thatforn independent
= 2 = (YlY2)112 onlyon A. depending and 20 has a distribution random exponential has Y1,..., Ynindependent regression to The extension exponential variableswith E Yi=A exp(- /zi),wherezi are given constants.RequiringXzi = 0 ensures variable to give E Yi= that A and / are orthogonal.If we add on anotherexplanatory as exp(- tzi - fxi) and also requireExi = 0, thenA and / are stillorthogonal, are A and f,. of expression thenuisance we of Assuming/ is stilltheparameter interest, needtheorthogonal on fromzi its regression xi ,B parameter withrespectto /. This is obtainedby subtracting in giving, thenew parametrization,

2 04 02 '00

2=

1I2(*Yj

+ Y2),

EYi = A exp[-fr{zi - xi(Sx/Sxx)} -xi], whereSx = Xxiziand Sx = Xxv. about the difference A different versionof the two-sampleproblemconcernsinference randomvariableswith exponential between twoexponential means.Let Y1,Y2be independent means 4 and (4 + /) respectively. differential The equation (4) gives
r

+ a-(o +0)2 ~p 00

1)

(4 + t/O2'

this can be solved by separationof variables,leading to a() = 4(f + 0)/(/ + 20), where of function A.In mostof our exampleswe choose a(i) = A; in this again a(A) is an arbitrary be examplea(A) = el might moresuitable. Families 3.2. RegularExponential Write f(y; 0) = exp{01tl + 02t2- c(0) - d(y)}, where (01, 02) are componentsof the componentsof the sufficient canonical parameterand {t1(y),t2(y)} are the corresponding It directly, parameter. is easyto verify statistic. I1 = (11012) = (Et,, Et2) be theexpectation Let to in (1983),that01 is orthogonal 12and 02 and is implicit Amari(1982) and Barndorff-Nielsen to is orthogonal ill. withmean p and varianceThas canonical As a simpleexamplethe normaldistribution parameter(Y/T, -1(2T)) and expectationparameter(y, p2 + T).Thus p is orthogonalto will to and p/T iS orthogonal p2 + T. The normaldistribution be studied -1/(2T), henceto T, as separately Example 3.3. / 4 withshape parameter and scale parameter Another exampleis thegammadistribution
f(y;
/,

4) = 4-yl-1

exp( - y/o)/F(O).

to is The canonical parameter (- 1/4, f) corresponding (y, log y) and we have immediately

1987]

ParameterOrthogonality Inference and = (AO-F -y-' ) exp{ -fy/2)}/F/) does. y does not dependon /, althoughits distribution f(y;
/, 2)

EY is that = '/4 is orthogonal il. The newparametrization to


In thisexample1,
=

the also where shows that he in These results discussed Barndorff-Nielsen p. 184), are (1978, of linear model orthogonal the is to expectation dispersion parameter a generalized parameter. is Note distributionorthogonal. of As noted above, (4u, parametrizationthenormal the T) thatA,= V does not dependon the nuisance parameter whereas, = n- l(yi_ z, y)2 + the to from T setting, variance iS orthogonal n(y- u)2} differs T byOp(n-'). In theregression to the of theregression coefficientsifthecomponents ,Bareto be orthogonal eachother f,; matrix must orthogonalized,in theexponential be as design regression example. X/ with normal distribution meanvector and whenY has a multivariate Moregenerally, so and / are orthogonal, long as theyare functionally covariance matrix V(*), thenft of models in unrelated. generalisation This includes, particular, components variance (PattersonandThompson, 1971). the take T and = ( - a)/rT12, latter As an exampleof nonorthogonal parameters of belowthefixed tolerance a. Then level the determining probability an observation falling to The that from = t by Op(1/Vn). parameter is orthogonal 4 is an Z tr = (y - a)/T'/2 differs
function (42 + of arbitrary
2)T.

3.3. NormalDistribution

1, of as We taketheindex theWeibull distribution theparameter writing


= f(y; i/i,4))

Distribution 3.4. Weibull

exp{-

Theni,
/).

The survivor function the new parametrization in is

(f/))2, i,

A nuisance and parameter= 4 exp(J7'(2)/ F'(2)/O, theorthogonal 1 - F(y) = exp{ -(y/2Y)exp(F'(2))}.

i/ In practice may ofmore it be to the interest estimate rate parameter, treatingas a nuisance is A of parameter. statistical interpretation the above parametrization that maximum likelihood of of on estimation the80thpercentile thedistribution little i; in depends very 2 the particularwillbe nearly samewhether assume exponential we an distribution = 1), (/ orestimate parameters maximum both that valueis notvery by likelihood, provided thetrue from of is different 1.Thusthemaximum likelihood estimate this percentileina rather special robust. interpretation This is discussed more in in of sense oforthogonality detail thecontext thenormal transformation model.

so The value of F'(2) is 1 - y,wherey = .577215... is Euler's constant, 1 - F(A) - 0.22.

Weassume for that somenon-zero Y hasa normal r. distribution mean variance with p, q/, = to to (Thecase q/f 0 willbe taken correspond log Yg.) usualformulationthis The of model same. in will on Although practice interest usually focus themean, possibly variance, and the the for present lookfora reparametrization and T to makethem we ofp orthogonal the to transformation 1. it parameter In this model is necessary Y be non-negative; couldbe that this achieved truncation wewill but by assume thevariance sufficiently relative the that is small to meanthat nonpositive observations negligible have probability. To extend argument easily theregression the more to we the setting change notation p for andT to 4) and00, respectively. 4 part theinformation The of matrix, is orthogonal, in i,,<, as
involves
-

3.5. NormalTransformation Model

(Y-1)

(Box and Cox, 1964) but theargument thatfamily essentially for is the

COX AND REID

[No. 1,

Example 3.4,but iol, and i0o,o, onlybe evaluatedapproximately, can using E(Yk log Y) = 01 log 01// + 0(Q0), E{Y' log Y(Y - 41)} = r/(1 + log 41)// + O(42). The pair of differential equationsto be solvedis, approximately, 1 00, ,1 logo, + log &1)

1 00 _0(1

From thefirst equation r/ = exp{a(A1,)0)f}, and from secondequation /012 = f*rb(1, the wherea and b are arbitrary functions (Al, AO). of We choose a(A1,AO)= log A, and b(A1, AO), in so themodel is represented theform AO) = A112l yp - N(A*,A21*-2*2?). the normal Note that if Yi has mean A1 and variance AO,then Y' has approximately for choice ofa, b above. distribution given;thiswas themotivation theparticular just of We can use this fora simple numericalillustration property (iv) of Section 2.2, the as varies.We likelihood estimates one parameter anotherparameter of of stability maximum blood pressures recorded Cox and Snell(1981,Table E.1, have takentheset of 15 systolic by is deviation 20.56mmHg. As / variesfrom col. 1). The meanis 176.9mmHg and thestandard meansand variances; factthemeanschange in is 2 to -2 there a largechangein theestimated from178.0to of by a factor 109.On theotherhand theestimated and AO varyrespectively A, of of the 424 173.6and from to 411, illustrating considerable stability theestimates A, and AO to withrespect changesin /. The extensionof this model to the regressionsettingproceeds as follows. Assume Yp - N(Exi.4y k0); then

l(4, i/i; Y) =

-2

log 4o- ,

Z(y

-Z

xir4r)2

+ n log ( + (/ -1) E logyi,

from of from Jacobian thetransformation y~ to yi.The the thelasttwoterms derived being of variables been has if thatthematrix explanatory are computations simplifiedwe assume standardized; theni, = diag(1/40, l/4 , 0/Po) ..., where variance the component is i0o,o we last.Approximating log Y) as before, have E(Y" ito04 -Ei {log(E xi.,4) + 1}/(0ko), i01-* Ei ( xi.,4)log(E xi.,)xir000). mean. A further is to simplification takexi1= 1,1,xi,= 0, so that&,is an overall Assuming we other effects be relatively to small, have
log(E xisOs)= log P1+ E xis5/& + OQfQ2),
i=2
q

giving approximately = i,o,J,(1 + log P1)/('"0) = &1log 41/(*00) = 4r(1+ log 41)/(104), r = 2,... , q.

1987]

Parameter Orthogonality Inference and

One solutionof theset of equations(4) gives


I = A,
40o = 21- 2*2Ao,

The orthogonal expression themodel is thus of


V, N AO'+ A*i E xisAs, &2f-r212Ao) (5)

Note thatA, and 41/2have thedimensions Y and A2, .A. , .q are dimensionless. of Analysis of thismodel willbe discussedin Section5. To discussthe statistical interpretation theseresults take a slightly of we broadersetting. 4 Supposewe have a modelf(y; 4) involving unknown an parameter ofinterest themodel and is enrichedby a nuisance parameter/ in order to produce a more realisticmodel. One is is possibility thatthere a secondmodelg(y; 4) and that / indexestheexponential mixtures withdensity to proportional {f(Y; )}1*{g(y;
)}1-P

f We concentrate estimating treating as essentially on 4, totally unknown. thisproblem For to have a clear meaning,0 should be definedso as to have an interpretation some sense in independent i. of In some problemsthe componentsof 0 may have a descriptive interpretation that is unaffected thevalue of /; two examplesare thecomponents themean responsevector by of and regression coefficients some fixed on scale.Then direct comparison estimates 0 from of of different analyses is possible,even if different values of / are used. In general such an interpretationnot available,and thena basis forcomparison be provided expressing is can by 4)= q6(/, A),choosingA to be orthogonal /. Estimates / fordifferent to of values of / can be comparedvia conversion thecorresponding to estimate A.In particular, might of we consider (i) the overallmaximum likelihoodestimates h); ( (ii) themaximum likelihoodestimate 4)at / = i?, say $0; of (iii) themaximum likelihoodestimate 4)at some other, of possiblydata dependent, value ;, say $. in Bytheparameter orthogonality, havethat~, 0, and + are approximately we equivalent the sensethatif 0=4(q

O= ow(q, 20), 4 =4)01, A)

thenA,2Aand 2 are exactlyor nearlythe same. Whateverthe choice of / we would have on reachednearly sameinference the the about 0, after re-expressing twoestimates thesame / scale. of are For thenormaltransformation modeltheorthogonal parameters thecomponents the mean vectorand the varianceforthe untransformed observations: above argument the says / on to that inference two different scales should be compared via transformation these from slightly a the parameters. Hinkleyand Runger(1984) make essentially same argument in different likelihood point of view: theyrescalethe observations orderthatthe maximum estimatesof the regression on coefficients do not depend strongly the transformation ,B parameterq. By property (iv) of Section 2.2 this impliesthat ,Band / are approximately orthogonal.

COX AND REID

[No. 1,

4. APPLICATION TO CONDITIONAL INFERENCE

Conditioning at leasttworoles thesampling plays in theory statistical of inference; to one induce relevance theprobability of calculations theparticular under to data analysis, the and other eliminate reduce effect nuisance to or the of parameters. concentrate on the We here latter. / We deal onlywithproblems which parameter ofinterest a scalar.Thisis a in the is nontrivial restriction, although maybe argued at eachstage interpretation it that of attention can often profitably focussed a single be on parameter describing aspectofthesystem one under study. in 2.3. intervals 0, theusualobjective, approached for are 0, as described Section Confidence q via considerationtests the of of nullhypothesis= *O, where is a fixed arbitrary but value 00 of'. Animportant general procedure testing = fris basedonthegeneralized for likelihood 0 ratiostatistic
w(o?) = 2{1(, 1) - 1(fo,40,) (6) Suppose thenthatthenuisanceparameters have beendefined be orthogonal to A1,..., Aq to

4.1. Introduction

treated having asymptotic as an distribution one degree freedom, with of when chi-squared to can by by 0 = 00. Theapproximationthenulldistribution be improved dividing a suitable the constant, Bartlett and adjustment, (Barndorff-NielsenCox,1984)or,if tests equi-tailed are for desired, an adjustment skewness by (McCullagh, 1984; Barndorff-Nielsen, To obtain 1986). confidence it intervals,is useful consider as a function V'0; term to of the in (6) l(/o,A*O) (6) is thelog-profile likelihood function. In simple casestheproblem be reduced onewithout can to If each nuisance parameters.for t/0 fixed there a complete is sufficient statistic A,thelikelihood for ratiostatistic can be (6) constructed theconditional from of distributiontheobservations this given statistic (Bartlett, 1937;Cox and Hinkley, 1974, 134).Iftheconditional p. is of distributionfree A,evenwhen f #*?,then problem beenreduced a one-parameter the has to and optimality of problem, the now (6) for suchproblems holdsamong not asymptotically equivalent procedures depending on A. Unfortunately, approach this but typically onlyworksin important rather special in problems regular exponential with families, f a component thecanonical of parameter. We nowexplore extension theconditional the of to approach more We general problems. of likelihood estimate A given willcondition theobserved on valueof 4,, themaximum The resulting likelihood closely is related Barndorff-Nielsen's to modified likelihood profile his (Barndorff-Nielsen, p. 351), 1983, when approximationthedistributionthe especially to of maximum likelihood estimator used.There also connections a longchainofwork is are with on conditional marginal and inference (Bartlett, 1936,1937;Kalbfleisch Sprott, and 1970; Patterson Thompson, and 1971;Godambe and Thompson, 1974;Godambe, 1976;Lindsay, 1982).Note thatforthosenormal theory problems which conditioning in the statistics are conditional marginal and linear, inference equivalent. full are In families usual the exponential is on approach to condition thecomponents thesufficient of statistic correspond the that to nuisance These of courseare just the maximum parameters. likelihood of estimates the expectation are parameters, which orthogonal thecanonical to parameters. We wishto derive conditional a likelihood / using1,O as theconditioning for profile statistic. write when possibility confusion We no of exists. Transform (10,h),where is h yto 20 function the observations, writeJ(Q0)forthe Jacobianof the of any convenient and transformation. conditional The density given is then 10
fY(Y; ds o) f TOTiA t;s ci) a

to of on qf= t0. Because Ais requiredto be orthogonal 4/,thedependence A*O

/?

is reduced.

where denominator themarginal the is of version the of density 10. This leads to a conditional

1987]

and Orthogonality Inference Parameter

of (6), likelihoodratiostatistic stilla function A,in theform 2[suP


{lyfr A) - lAo(* A)}
-

{y(Y

, A) -

A0(W/ A))

and Note thattheJacobianJ(20) no longerappears,the precisechoice of h is irrelevant, the of transformations A.FinallyreplaceA by 20 to getthe underone-to-one answeris invariant likelihood conditional profile
w

2[sup VY2= P {l0(/

t0) -

lAO(f 10)} -

{Y(I

O) -

AO(1

O)}J(

of distribution 20 for it to To calculatethisexpression is necessary computethe marginal An involves a noncentraldistribution. from /0 and this typically values of / different than term 2,J, on rather in from byconditioning thefirst (7) can alternative statistic be derived 20, leadingto 2 [sup {lY(*,
4A)4-1lA(f,,4)+ log det J(Q,I,) log detJ(Q0)}

{lY(W0 jO) -

lA0(W ,

is The or mucheasierto calculateexactly approximately. Jacobianterm log which frequently is shall therefore We ignorethis and det(d2q,/d20) (because / and A are orthogonal)is Op(1/n). in defining term whatfollows, = WJ(fO) 2 [SUP {lY(iY,24) -

X)} 1AV(lfr,

{lY(4if0 X0)- 1Ao(W1f

'Z)}1

(8)

half of the formuladoes not depend on /O. A A further advantage of (8) is that the first of under transformation A, although this nondisadvantageof (8) is its non-invariance It invariance has been reducedby usingthe orthogonalparametrization. is perhapsbest to is A curiousfeature that A in regard as defined somereferenceparametrization. conceptually wc has parametrization again theorthogonal are events used,although conditioning twodifferent of arisesin thediscussion locallymost The same feature them. between reducedthedifference tests;see Cox and Hinkley(1974, p. 146). similar powerful of distribution 2,,under (1983) forthemarginal in the Applying formula Barndorff-Nielsen / and of AO approximation underif, we have thefurther
WC(rO)= 2( sup [I@I', 2,0,) 2 log det{njh,(i, 2J)}]

-l[y(O, )) -

4logdet{njh(ifr0, 0)}]).

(9)

for matrix theAcomponents. Equation observedinformation In (9) ji, is theper observation for function the as of the thatwe can regard effect conditioning modifying objective (9) implies the likelihoodfrom computing profile ly(/, ,.) to

(10) 21,)}. ly(/, 14) 2logdet{njh,1, about A is The effect the second termis to penalizevalues of i forwhichtheinformation of of large. It can be shown that the value c at whichthe supremum (9) or (8) is relatively = Op(1/n), that,forsome purposes, write insteadof (9), we so achievedsatisfies {c (11) , 20)}. 2{1y(, 2) - ly(l/ X0)}- log det{nj.,(i, Z)} + log det{nj,A(Jf,

10

COX AND REID

[No. 1,

as can of information determinant in Notethat term the product the det(nj].) becomputed the and of the(4, 4) parametrization thesquareofthedeterminantthetransformation matrix from 0) to (A,f). (0, There a complication thederivation (9) to (11) inthat is in of Barndorff-Nielsen's formula of in for distributionthemaximum the likelihood estimator on requires general conditioning In case when ancillary needed fixed the no is for appropriate fr ancillary statistics. thespecial if The true theancillary does aboveargument applies directly. sameholds statistic notdepend on i. Thesetwopossibilities many common in cover cases,and all theexamples this paper. in in and there an additional is term theapproximate Otherwise density hence (9) to (11) likelihood statistics be approximated maximum can estimators possible these that ancillary by of constructed the orthogonal parameters, possibly embedding model in a suitable by This wouldimply We exponential family. thatthe omitted terms Op(1/n). have not, are however, explored in detail. this The difference (9) from likelihood theuse of is of Barndorff-Nielsen's modified profile orthogonal whichallows us to ignorethetermI&,/01o l Parameter parameters orthogonais in of 4.3. lity also essential theasymptotic the expansion fv,in Section Although factor bedifficultcompute, inclusion to its ensures themodified that likelihood profile IOA*/02I Imay In is parametrization invariant. thespecial case (for full where example exponential families) thedouble and of saddlepoint approximationBarndorff-NielsenCox (1984)canbe applied to the likelihood modified the and approximate conditional density, conditional profile profile likelihood b-oth are see equal to thisapproximation; Barndorff-Nielsen p. 353) and (1983, of Jorgensen Pedersen and likelihood (1979,p. 309). For discussion themodified profile derived from marginal conditional a or point view, Barndorff-Nielsen of see (1985b). The expressions (7)-(11) are in decreasing order preference an intuitive from of pointof that abovedoesnotdepend /, so thatBarndorffon implies theancillary statistic discussed Nielsen's formula givean approximation theappropriate does to conditional density. 4.2. Examples to of the ofthe a We nowdiscuss number examples, illustrate implementation conditional 4.1. in discussed Section likelihoods 4.2.1.NormalDistribution = and r. Wefirst consider parameter interest bethevariance, In this casew* wc, the the of to is of conditional likelihood simply to profile proportional the X2-1density S/I, where = S - (y1iconditional likelihood the and _)2, and TC= S/(n- 1). Both the approximate modified are as profile likelihood also proportional theX2_ density, theapproximation to is formula exact(Barndorff-Nielsen, Example3.1). No new considerations in arise 1983, replacing mean the with linear a regression; wc, w*, and vc allproportional the ofthe are to log now S is the residual sum of squaresafter on of regression q Xn-q density S/I, where variables. explanatory Ofmore some of 4.1 interest illustrating ofthegeneral for points Section is thecase where the mean istheparameter interest. of of is Wereduce Computation wc fairly straightforward. p tothe bysufficiency joint density (y,S),andtransform jointdistributionthat (y,?,,), of the to of with Jacobian Themarginal 1/n. density T is proportional a X2density, therequired of to and conditional is density proportional ju ((n/2)-1). Thisgives to wc(ji) = (n-2) log{1+ n(y pl)2/ function the of usualt-statistic. that profile Note the likelihood this for problem SI, a monotone = is w(g10) n log{1+ n(y-_u0)2/S}. Againwcand fvc identical. are the Analysis using conditional distribution To leadsto thesameresult, wj*(,) is a given i.e. is monotone function theusualt-statistic, thederivation somewhat of but more The difficult.
in mosteasilyimplemented. w, = WC If view,although manyapplications7, is theversion this ratio of the distribution the ancillaryat of arisingfromthe log-likelihood
i/

and

4,0.

It is

1987]
n(p-

Parameter and Orthogonality Inference

11

marginal density To is noncentral withn degrees freedom noncentrality of of and X2 parameter is of 0)2/T. The requiredconditionaldensity a function j and T, althoughit can be shownto lead to a similar ofthenullvalue ,u againstvaluesp > ,u? all positiveT (Cox test for and Hinkley,1974 p. 143). This approach can be extendedto normal theoryregression, althoughthedetailsare somewhat morecomplicated. A normaltheory problem where profile the likelihood failsis theproblem weighted of means (Neymanand Scott,1948). Assumeyj,j = 1, ..., q, are independently distributed normally withmean pu and varianceT ./nj. The conditionaldensity of lo,...q, , giveni,p, ..., -yq iS _ -, wherenjTj = Sj + nj(iSij u)2, and Si is the residualsum of proportional iH('2"nj to n squaresfrom jth sample.This gives the
Wc(pO)= Ej (nj - 2) log[{Sj + nj(9j where jC satisfies L(n - 2)ni(Y - PC) =O. EjSj + nj(yj - pc)2=
u0)2}/{Sj + nj(yGj2}],

thisis the estimatederivedby Bartlett (1936). This solutioncan again be obtained via the modified profile likelihoodby ignoring termI IT (Barndorff-Nielsen, Example the 1983, the 3.6). Since wc= w7c, approximate ancillary appearingin Barndorff-Nielsen's discussionof thisexampledoes not dependon p. The above estimating in equationis also derived Cox and Hinkley a (1974,p. 147) from slightly different pointofview;see also Lindsay(1982). Note that WC leads directly the" correct" to answer, whereas expression w*involves product the for the of x2 q noncentral densities and is quite complicated. 4.2.2. Exponential Regression We considerhere the regression model with one covariate; E Yi=)A exp(- q/zi), where
Izi = 0. Then l(ir, A) =-n

log A - A

yi exp(ifzi)

from whichA. = n- 1Yyi exp(/zi) and ; satisfies log-likelihood lyizi exp(;zi) = 0. The profile ratioevaluatedat i0 = 0 is w(O)= 2{ - n log(E (yi/n) exp(bzi))+ n log y} =-2n (logA-log 20), whereA = 2Aand )0 = Apo. Both expressions and (9) fortheconditional (8) likelihood profile have a one degreeof freedom but of adjustment lead to thesame estimate /:
wv-O)= wc(O) = -2(n
iA , gives -

1) log(44/1)

(12)

The modified profile likelihood, including termId2/d2l*, thiscase proportional by the to 1, in


-2(n
-

2) log(A/2O).

(13)

To computew* we need themarginal density y = 10 foran arbitrary of value of /; thenthe conditionaldensity needs to be maximizedover f. Since the marginaldensity can only be evaluatedapproximately, is quitecumbersome comparetheresulting it to expression wc* for to w(O),wc(O)and (13). A simpler approach is to approximate 2[{ly(f, Z0)lAo(f,

Z0)} - {YY(f0 X4) -

lAO(00r ,0)}]

(14)

whichcorresponds thedefinition w* in (7), but i0 is regarded fixed 0 and (14) is a at as to of

12

COX AND REID


-

[No. 1,
=

of function t. The approximationto (14) is, lettingi/

n 'IZkyi,

mk 6/I/nand writing for


4_2_

IVn 6(ml)

__

62

m2 m2

6~3 M3_

64 M4 62 02z 4 m4 + 2 Cz2 n

(15)

of analogues w,w,and w-are actually for expansions theunmaximized The corresponding , the above expressions, substituting forA and to quite straightforwardobtainfrom
expandingin termsof 0

terms. and in Op(1)term, differ theOp(1/n)

agree with(15) in the leading, = 6//In. threeexpressions All


znl

sampletotaland y.. forthecombined Writing forthefirst is exactsolution available. Yl. totalgives sample
f(Y. Il; 0, yJ/n; A) =
n
--1(.

letting = ... = version, In the two-sample z,

= -n2 and

znl+l

n=

n2 =

n1,an

B'(yn-y,,n2

e y,)nl2-lef- (l _2IA}y1.
n

'c(y.

0,

A)

Y.

has For constant. A > 0 thisdistribution monotone where = e-0 and c is a normalizing 0 0 test a similar ofthenullhypothesis= 1 against ratioand gives mostpowerful likelihood by of 0 alternatives> 1, forall A> 0. The two sampleversion (14) can also be obtained test similar most the Curiously, sameuniformly powerful directly. this approximating density the the and on under alternative 20 under null;i.e. by by can be obtained conditioning A,, of that of the computing exactversion wcrather theexactversion wc*.
4.2.3. GammaDistribution

of f; is to is of Theparameter interest taken be theshapeparameter as 2A, independent / of and vc differs these from in approximation only the between andwc, wc* is there nodifference for equations t. via compared theestimating are The thenormalizing constant. methods best estimate: likelihood for equation themaximum the gives following Theprofile likelihood
=
log(y./n) n' 1

log -F'(;)/F(i) gives profile likelihood Theconditional


`(n c) ]F(ni7c)
fr(;c) fC)Z I-(

log yi.

log y -n n-

log yi;
-

the comparisonof the two is clarified writing by 1`(nic)/F(nlc) - log(nc) gives


log

which 1/(2n;c),

~c _

_= log(y /n) - n'

E log yi.

In likelihood. profile the from modified that equation is obtained Thisis thesameestimating variance has of bothcases one "degree freedom" beenlost,in analogywiththenormal in and is point This from example. adjustmentmotivated a different ofview McCullagh Nelder
(1981). (1983,p. 157); see also Sweeting Functions 4.3. Comparison Conditional and Unconditional Likelihood Profile of

is statistic likelihood profile the We now consider how to assesswhether conditional are no to form. basesfor comparison, oneofwhich preferabletheunconditional There several is wholly in convincing itself. on family conditioning 20 of problems theexponential As noted Section in special in 4.1, to We this most similar tests. can expect optimality be nearly uniformly powerful generates family. for close retained distributions to theexponential distributhe are in we Two possibilities shallnotconsider detail to compare approximate alternative. and an one-sided tions w and w*under nullhypothesis under appropriate of the

1987]

and Orthogonality Inference Parameter

13

than rather of as it Withregardto thefirst, would be of some interest a matter convenience of principleto examine whetheror not the distribution w* is more nearly fundamental of to approximation thedistribution by thanthatofw.It is knownthattheX% approximated X% such but adjustment, we have not investigated by w can be improved applicationofa Bartlett for adjustments wcor wc*. of linearmodelwiththevarianceas theparameter thatin thenormaltheory Note,however, due to allows forthe loss of degreesof freedom essentially adjustment the interest Bartlett made by theconditionis this parameters; adjustment automatically the estimating regression i.e. factor, n-1 correction, In of al construction wc*. generalthe need fora large adjustment expansionsuspect. of would make the use of one or two terms the asymptotic to orderapproximations power.The higher to Withregard thesecond,we havenotexplored answer. to involvedare complexand unlikely lead to a definitive calculations and to is ComparisonwithBayesiancalculations likely be helpful, in thisregardtheresults posterior approximate Sweeting's relevant. of Sweeting (1981, 1984a, 1984b) are particularly verysimilarto thosehere, lead to inferences for distributions location and scale parameters is the although basis of the argument quite different. In the developmentbelow we examine directlythe firsttwo termsof the stochastic expansionof w and wc. our original statistic although on concentrate theconditional We willin our discussion wc, but thatwc-w* = Op(l/n), we have not proved of It motivation in terms wc*. seemslikely was this. the in the results mentioned Section4.1justify use if We assumethat, Ais known, optimality whichwe denoteby likelihoodratiostatistic of theordinary
Wk(W/) =

2{l(i.,

A) - 1(ir, A)}.

version in likelihoodw defined (6) and to theapproximate We shallcompareWk to theprofile in likelihood, defined (10). of theconditional profile wc, The underthe null hypothesis. a have asymptotically X% distribution All threestatistics not knowingA. We the differences WkW) - w(fO) and Wk(W/) - wc(v ) represent loss from small. A major advantage of this approach is that the want this loss to be stochastically at measureof the loss is unnecessary, least forthe analysishere. adoption of a veryspecific A nuisanceparameter rather of Note thatwe have defined Wk(W/)in terms theorthogonalized in nuisanceparameter, This seemscompelling, of however, thatto in terms an arbitrary than 0. about q, whereas to regard0 as knownwould in generaladd appreciably the information about /. of aspectsofinference onlysecond-order specification A affects calculating We beginbycomparing w(fr) and Wk(W/) via suitableTaylorseriesexpansions, thetermthatis Op(1/1n).On expansionabout (i, 2) and (fr,20), we have
= Wk-W 2[{w

(
_
-

1(-l,

I)}

{1(/, A) O
20)}(A )T

(
-)

o)}

=-n(A

_jA(

+ 2n(2

Jo)1(/O, 10)(A -

+ Op(l/n).

(16)

The termsretainedin (16) are Op(1/1n). In derivingthis we have used orthogonality and in the expansionof and (;A - /) = Op(1/n), to repeatedly give both (2, - )) = Op(1/n) = A - 2 + I - )0. It follows fromexpansion of 2,0, i20 1(*?, A) - 1(i,, AO)we have written).as a function / that of

2- 2

-r)Z

(f

))i71(f0,A)/nn

-i;

-_

) (if(, ).)/ + OP(n

After vector. is and of where is a randomvector order1 in probability ai0(fr, A)/DA a fixed Z4A

14

COX AND REID

[No. 1,

some further expansionwe have _ i Wk- W 0)(A _,AA(O -n(;


+ n{2(;
1/)(Z-A/n)i-( + Op(1/n).

1A)/al](A-)
-

)T aiJ(,

))/J}iii,

SA

(17) V

To examinethestructure (17) we write of

't -0

V_ii -2/ -

n,

at where = i 42(i 42)T, all theis areevaluated ( A?, and(V,, VA)is a 1 x (q + 1) vector of A) i, random variables. normal Then standard asymptotically independent
Wk-W

v i-1/2 * * [VAi1 I2{ais(fO, i** {ai*(/0,

A)//0}(i-1/'2)TVT]

4)/4}iAA(ih

1"2)TVT

.n
-

2VVi 142

ZVd

(i7 1/2)TVT

Op(l/n).

(18)

Thecorresponding in theexpansion Wkterm of


= =
-=-

from has oneextra term, arising

logdet jAA(fO, jA(fi,2) - logdet 20)

(;, - *){O logdetiAA(i, A)l0}1qqo + Op(1/n)


(; - *?O) trace{i -(ai/Aqi)0,=4,o}+ Op(1/n)
-vi*-i/2 * *

in Thetrace this in form VAin (18). is valueofthequadratic expression theexpected A is probably the when we Theinterpretation most seenfrom case where is a scalar, easily can write
Wk Wkvc W=

,n

trace {i-

1(ai&A/aI)}

Op(1/n).

(19)

(aV.VA +
-

+ bV VA+ cV VV)/1n

Op(1/n)

(20)
(21)

={aV (VA

1) + bV VI + cV,V,}/2n + Op(1/n).

Notethat first to order ofthew statistics equal to V2. is any We somedata and calculated or other w and fvc. one of Supposethatwe havecollected becauseVA is totally wouldliketo havecalculated but thisis not possible, Wk essentially ofWk w these are unknown. therefore We consider conditional the representation given or wc; of respectivelytheform and
w + (a'V2 + b'VA)/1n+

Op(l/n),

WC {a'(V2+ 1) + b'Va}/ln + Ov(1/n). there no uniform is domination. to On average, is closer wkbecauseEV2 = 1, although fPc The Op(1/1n) termsare a kind of bias, and the mean squares of these termsin (22) are of and respectively (3a'2 + b'2)/n (2a'2 + b'2)/n. Further, amongall linearcombinations w and theminimum mean square is (2a'2 + b'2)/n and in general, unlessIa <I Ib 1,the possible wk, fromthe 'optimal' Wk is that the unknownVAcontributes large discrepancy a probability versionfvc. for versionw thanfortheconditional greater theunconditional

1987]

Parameter Orthogonality Inference and

15

An incidental comment thattheadditionofan O(1/1n) term say,Wk, onlyaffects is to, its distribution order1/n, to provided thattheadditionalterms haveconditional meanzero,given Wk, and mildregularity conditions satisfied are (Cox and Reid, 1987). WhenAis a vector morecomplicated the formulae and (19) hold.In thespecialcase that (18) thecomponents ) are mutually of orthogonal and all components have thesame value ofa log thenexpressions corresponding (22) are to iArAr/80I w + {a (VAX * + VA9) b' E csVA }/Vn + Op(l/n), + + + + + WC {a'(VAX * + VA2 q) + b' CSVA5}/1n Op(l/n) and theamountof 'bias' removedby fv, proportional q. is to In generala simplecharacterization theamountof 'bias' does not seempossible.Note, of however, that if iAA does not depend on /, then w and fv,have the same expansionsto Op(l/1/n). In work unpublishedat the time of writing, A. Aitkinand J. Hinde have proposed M. another method deriving likelihood for a in function thepresence nuisanceparameters a of via notion of canonical likelihood.It would be of interest compare theirmethodwiththe to present ones via an expansionof theform (22).

5. TRANSFORMATIONSIN NORMAL THEORY REGRESSION


5.1. Introduction model introduced in Example 3.5. For some unknown /, the random variables (YO, ..., YO)

We now discuss in more detail some aspects of inference the normal transformation in

are assumedto be independent normally and distributed withmean 01 and variance00 in the one sample case, and mean Zxiso, in the regression model. An approximately orthogonal of parametrization the model is given in equation (5) of Section 3.5, and a possible of interpretation thestatistical of is It implications thisparametrization outlined. is essentially the same parametrization a route. developedby Hinkleyand Runger(1984) from different 5.2. Bayesianand Conditional Likelihood Analysis to proportional the nthroot of the Jacobianof the transformation fromy to y/.This was because therelative necessary sizesoftheregression coefficients variancedependstrongly and / on thevalue of /, so thatin theabsenceofanyassumptions regarding itdoes notmakesense to assign uniform improper priorsforthem.The logical statusof data-dependent priorsis Nelder'scontribution thediscussion Box and Cox (1964). One to unclear;see,forexample, of methodofavoidingthemwas suggested Pericchi by (1981) and modified Sweeting by (1984a) similarto thatbelow,althoughexpresseddifferently; also Hinkleyand by an argument see Runger(1984). Since theapproximately are on orthogonal parameters by construction weaklydependent /, itseemsreasonableto assignuniform for to improper priors them. SinceAlis constrained be it the positive, is giventhepriord1/.Al; variancecomponent similarly priorfortheorthogonal is components ..., IAq) are assignedthejoint priorHd)s. For the (A2, AO dI0/)O.The remaining to one-sample problemthelikelihoodis proportional exp[-{n(Ay f(y; 'I', Al, )OC A-n( -l)),-nl/2ny~4)2 + S A}/(2k2 202)],
-

The Bayesian analysis of Box and Cox (1964) used a data dependent prior for (/, A),

where and S, are themeanand residualsumofsquarescalculatedfrom Integration over y4. y, thepriordI0/)O gives
f(y; /,Al) oc Hy1

l I

n M{n(y,-A ).)2

+ S}

-n/2

(23)

16

COX AND REID

[No. 1,

density i we integrate of to of and to obtainthecontribution theobservations theposterior (23) overdIA,:

oc fty;)X
f(y; i) OCnyt1

q' I

A))2 Aj{n(j-, _

+ S,n'2/

thisreducesto After some computation


-S + 1 nS (n 1)I2,(yq)- 1[l + {S,/(ny2)}nn-1 0(n - 2)]

of is Note thatSk/(ny2) thesquared coefficient variationof theyk. The computation thelinearregression for model proceedssimilarly, giving
f(y7fr)ocnHqSI(n-q)1

IYPIq

2) ( y2+0n ~(2(n - q - 2)

q(q +1)

+0(n)
4j3

(24)

using expression (1984a,eq. (6)). In thecorresponding theleadingterm withSweeting agreeing in prior(Box and Cox, 1964,equation22), theterm bracesin (24) is equal thedata dependent to 1, and theterm Hy-1 I'' In-ql( q = 'I n-q(y~l/ )y -4 is replacedby n-1)(n-q)ln = J(@/; y)(n-q)/n qny, 1 likelihoodis to use profile of directrouteto thecomputation theconditional The simplest to to (equation (9)); i.e. the expression be comparedto (2) is the versioncorresponding fv, matrix from /) to (A,0) is diagonalwith exp{l(i/, - log det(nj,j)}. The transformation (0, 2A,) ii f2r2-2'). The resulting expressionfor exp{l(/, 2.,) - 2 log entries( Oi/ 1, /i on terms not depending /, is, det(njhj)}, ignoring
.

ny -1

In-q-2

S-(n-q-2)121

I yV j {(q +2)V,-3}/*

(25)

The conditional by likelihooddefined w, in equation (8) givesthe same expression. profile design of (24) and (25) wereevaluatedas functions fforthe3 x 4 x 4 factorial Expressions density (24) has itsmode at discussedbyBox and Cox (1964,Table 1). The Bayesianposterior = -0.71 and an equitailed is profile 0.95 posterior interval (-1.14, -0.27). The conditional the obtainedfrom X1 interval at likelihood (25) is maximized ) =-0.68 and a 0.95 confidence approximationis (-1.09, -0.26). Box and Cox obtained -0.75 for the Bayesian and intervals 1.18, -0.32) and - 1.13, -0.37). (of likelihoodestimates , and corresponding is result orthogonalization theapproximate One advantageof theparameter
{var(;)} -1 = E( 02l/8,2),

(26)

The so theinversion thefullinformation matrix unnecessary. value of(26) can be used to is of to potentialof a set of data (Box and Cox, 1982),i.e. theextent measurethe transformation from but a whichit is feasibleto determine suitabletransformation the data. A complicated calculationgivesthat(26) is equal to elementary
{4CVe + 2 CVA+ 4CV2 (1 + CA)}

Here CV
=

n-1 E

= n

4i

of component,cA is definedby is the squared coefficient variationfromthe regression of of n-1E Ai = CV4(1 + CA),and CVe = 1i var Yi/(EYi)2is thecoefficient variation theerror problem at component, / = 1; CV2/CV2 is a kindofsignalto noise ratio.In theone-sample
CVA = O.

1987]

and ParameterOrthogonality Inference


6. DISCUSSION

17

of we of leavesopen a number issuessomeofwhich raisein theform The above development questions. thanon thenull Wk, between w,and wcrather on (i) In Section4 we concentrated therelation to factor have a adjustment by can two The distribution. first statistics be modified a Bartlett distribution 0(1/n312)(Barndorff-Nielsen Cox, 1984).Is thesame trueofwcand can to and X2 of for be factor calculatedvia thatof w or Wk? Is an adjustment skewness wc theadjustment for i (McCullagh, 1984; Barndorfflimits available,to producenearlyequitailedconfidence Nielsen,1985a)? be for justification the use of wcor w*,or some otherstatistic, produced, (ii) Can stronger order? calculationsto higher perhapsasymptotic including been achieved by the on (iii) Has conditioning exact or approximateancillarystatistics ? proposedprocedure in formulation curved possiblysimpler, in (iv) Do theresults Sections2 and 4 have a useful, families? exponential data? for (v) Are therespecial problems discrete with for to problems, examplethoseconnected be (vi) Can thediscussion extended nonregular and of the terminal a distribution, to generalproblemswith a large numberof nuisance ? parameters rather of is observations, whentheobjective theprediction future implications (vii) Arethere thanestimation? where in be (viii) Can the discussionusefully extendedto vectorparametersof interest, is generalonlylocal orthogonality possible? A equations determining be handled when simpleexplicit (ix) How should the differential be conditionscan usefully imposed on A in general? solutionis not feasible?What further and robustness can be definition for (x) What generalimplications model and parameter orthogonality? the drawnfrom notionof parameter of the whilebothauthorswerevisiting Department completed This paperwas substantially warmlyfor its of Statisticsat the University Waterloo, and we thank the department Lynda Hohner.We are grateful secretary thankthe departmental We hospitality. especially comments. forconstructive also to referees ResearchCouncil and the Natural Sciencesand Supportof the Science and Engineering acknowledged. ResearchCouncil of Canada is gratefully Engineering
REFERENCES loss. Ann. geometry curved exponentialfamilies-curvaturesand information of Amari,S.-I. (1982) Differential Statist., 357-85. 10, New York: Springer Verlag. geometry statistics. in (1985) Differential theory. New York: Wiley. in and Barndorff-Nielsen, E. (1978) Information exponentialfamilies statistical 0. likelihoodestimator. of for Biometrika, 343-65. 70, (1983) On a formula the distribution a maximum limits from I1J c '/27, in the single-parameter Scand. J. Statist., 83-87. case. 12, (1985a) Confidence In to and in likelihood. Contributions Probability Statistics Honourof of profile (1985b) Properties modified Blom(J.Lanke and G. Lindgren, eds), pp. 25-38. Lund. Gunnar ratioBiometrika, based on thestandardized on or signedlog likelihood (1986) Inference full partialparameters, 73, 307-22. Barndorff-Nielsen, E. and Cox, D. R. (1984) Bartlettadjustmentsto the likelihood ratio statisticand the 0. of likelihoodestimator. R. Statist. J. Soc. B, 46, 483-95. distribution themaximum available in smallsamples.Proc. Camb.Phil. Soc., 34, 33-40. M. Bartlett, S. (1936) The information of tests.Proc. R. Soc. A, 160, 268-82. and (1937) Properties sufficiency statistical J. Soc. B, 26, 211-52. (with Box, G. E. P. and Cox, D. R. (1964) An analysisoftransformations discussion). R. Statist. rebutted. Amer. (1982) An analysisof transformations revisited, J. Statist. Assoc.,77, 209-10. 67, Biometrika, 273-8. Cox, D. R. (1980) Local ancillarity. Statistics. London: Chapman and Hall. D. Cox, D. R. and Hinkley, V. (1974) Theoretical Can. J. Statist., appear. to to distributions. Cox, D. R. and Reid, N. (1987) Approximations noncentral London: Chapman and Hall. Cox, D. R. and Snell,E. J.(1981) AppliedStatistics.

18

COX AND REID

[No. 1,

B. D. Efron, and Hinkley, V. (1978) Assessing accuracyof themaximum the likelihoodestimator: Observedversus expectedFisherinformation. Biometrika, 457-87. 65, Godambe, V. P. (1976) Conditionallikelihoodand unconditional optimumestimating equations. Biometrika, 63, 277-84. Godambe, V. P. and Thompson,M. E. (1974) Estimating equations in the presenceof a nuisanceparameter. Ann. Statist., 568-71. 2, D. Hinkley, V. and Runger, (1984) The analysisof transformed G. data. J. Amer. Statist. Assoc.,79, 302-20. Huzurbazar,V. S. (1950). Probability distributions orthogonal and Proc. Camb.Phil. Soc., 46, 281-4. parameters. H. Jeffreys, (1961) Theory Probability, ed. Oxford:ClarendonPress. of 3rd Jorgensen, and Pedersen, V. (1979) Contribution discussionof paper by 0. E. Barndorff-Nielsen D. R. B. and B. to Cox. J. R. Statist. Soc. B, 41, 305. of Kalbfleisch, D. and Sprott, A. (1970) Applicationof likelihoodmethodsto modelsinvolving J. D. largenumbers parameters (withdiscussion). R. Statist. J. Soc. B, 32, 175-208. Lindsay,B. (1982) Conditionalscorefunctions: some optimality results. Biometrika, 503-12. 69, McCullagh,P. (1984) Local sufficiency. Biometrika, 233-44. 71, McCullagh,P. and Nelder,J.A. (1983) Generalized linearmodels. London: Chapman and Hall. Mitchell, F. S. (1962) Sufficient A. and orthogonal statistics parameters. Proc. Camb.Phil. Soc., 58, 326-37. Neyman, and Scott,E. L. (1948) Consistent J. estimates based on partially consistent observations. Econometrica, 16, 1-32. information when block sizes are unequal. Patterson,H. D. and Thompson, R. (1971) Recoveryof interblock Biometrika, 545-54. 58, Pericchi, R. (1981) A Bayesianapproach to transformations normality. L. to Biometrika, 35-43. 68, Ross, G. J. S. (1970) The efficient of function use minimization nonlinear in maximum likelihoodestimation. Appl. Statis.,19, 205-21. T. a J. Soc. B, 43, 333-338. Sweeting, J.(1981) Scale parameters: Bayesiantreatment. R. Statist. (1984a) On thechoiceofprior distribution theBox-Cox transformed for linearmodel.Biometrika, 127-134. 71, in (1984b) Approximate inference location-scale regression models.J. Amer. Statist.Assoc.,79, 847-852.

DISCUSSION OF THE PAPER BY PROFESSORS COX AND REID on parameters The (AarhusUniversity): subjectofinference interest 0. Professor E. Barndorff-Nielsen us and parameters at thecoreofstatistics, thepaperbefore adds substantially is of in thepresence nuisance for of to our understanding and methodology thissubject. and orthogonality its relevancefor The main points of the paper are the discussionof parameter of and investigation a new concept of conditionallikelihood.Below I and inference, the definition on comment thesein turn. As Let t, of dimensionr, denote the parameterof interest. the authors demonstrate-and use parameter = (Al a A possibleto find complementary it extensively-if/ is one-dimensional is generally i metric on the parametric to relative expectedinformation ...q A,) such that / and A are orthogonal a vantagepoint. to model X. It is illuminating viewthisresultfrom generalgeometric metric manifold dimension + 1 and with of q differentiable For thispurpose, suppose X is an arbitrary submanifolds on t/ be tensor and let ,/l a realvaluedfunction X, thelevelsets ,;,of beingq-dimensional y, contains which line ofS. Ateach pointp ofeach submanifold&, we mayplace an infinitesimal segment and on accountofFrobenius'theorem generally to plausible, orthogonal ,1,.It is intuitively p and is y-is differentiable connectup to form bundleofone-dimensional a thattheseinfinitesimal segments line true, the through submanifolds&, I Now, let A= (A1,. . , Aq) be a orthogonally each curvecutting curves, to and whichis complementary y-orthogonal ir, i.e. (A,4/) parametrizesX# and when y is parameter . are in its expressed the(A,0) coordinates mixedtypeelements 0, i.e. yA,sq,(A, = 0 fors = 1, . ., q. Any 0) of but on a system a fixed, arbitrary, the A such parameter maybe conceivedas determining coordinate the pointp in.kbeing obtainedby finding of submanifolds, say,the(A,0) coordinates an arbitrary #,Vo i such thatp belongsto curveswhose and the A such thatp lies on thatof the above-mentioned &,*, A in point with Mkohas coordinateA. Thus the freedom choice of orthogonalparameter intersection on system M, . If(4,0 = (4yI a with consists solelyin thearbitrariness whichone maydefine coordinate parameter can be found A complementary of ..., /q,0) is any parametrization X thenan orthogonal of by solvingthe system equations Iab fe(ro
isu)n
/

ii), = -y,t,h,(M,

t= 1 .. . q.

(1)

SureshMoolgavkar. setting withProfessor thisgeometrical I have benefitted from discussing

1987]

Cox and Reid Discussionof thePaper by Professors

19

in statistical model follows, particular, when/ it that Returning thecaseof X being parametric to a is parameterwhich orthogonal / relative to is one-dimensionalcanequally a complementary we find A in (1986a,b). to theobserved information metric as defined Barndorff-Nielsen , from location-scale the model - -f((x For example, suppose ..., x,,is a sample x1, -)/oU), let a of Then, .. and ju be theconfiguration - A)/&, . -,(x ft-)/&) consider as theparameter interest. letting ((x1 g(x)= -log f(x) and solving with (1) y=j and t12[g"(aj) 2

Xa9g"(aj) 1 Lxavg "(aj) n+ a2g (aj

is onefinds A= a + u,u <-orthogonal ,u, that to where


u = {Xavg"(aj)}/{n+ Ea2g"(aj)}.

as A general is whether considers one expected observed or information,a pointof remark that, on of parameter A. principle inference / ought to depend thechoice orthogonal the on not discussion model In their M. d ofthe Letw,ofdimension= q + r,be anyparametrization statistical ofconditional Professors and Reidrefer theformula o Ia) p*(65;o Ia) where Cox to p(ci; likelihood an for distribution the of 1exp(l -1), which provides expression the conditional p*(6; (oIa) = cv/ of ancillary andto therelated a, concept the maximum likelihood estimator given complementary Co a modified profile likelihood /, i.e. for

LO
-

I? I 2LQ(f) 1-

(2)

for information /, and likelihood /,j] is observed where = (), ), LQ)= L(jO, 0) is theprofile o given definitionconditional of on commenting theCox-Reid f is considered a function (4),,i/,a). Before as of I remarks modified on likelihood parameter and orthogonality. profile likelihoodwish makea few to the If 4 and / are orthogonal 2.2, ignore factorO I/O?, then, (iv) of Section we mayoften by I 1, invariance (2) is lostby of the which However, parametrization maybe a considerable simplification. this approximation. in by by (1983, Theexpression wasderived, (2) using above, tworoutes Barndorff-Nielsen 1985b): p* as on that 4, reasoning marginal of inference theassumptton p(4, i; 4, ' la) factorizes p(&; q la)p(4); as on that l of inference theassumption p(4, i; 4, ' la) factorizes Ii, a) andbyreasoning conditional if routes nowapparent. are of First, papertwoother p(&; 4) ' la)p(i/; 1, a). In thelight tonight's q fixed thestatistic a), as iI; every q (,),, p(o,&, 4, /la) factorizes p(4),; 4, /la)p(i; /l,),, a), i.e.iffor and to by p*, rather 4,is sufficient 4, then than for again(2) emerges applying separately thenumerator this thedenominator p(,4,,0; 4, ' Ia)/p(,4,; ' la). (We illustrate further of 4, below, an example by This is to to similar that basedon theF-distribution.) argumentrather leading formula ofthepaper. (9) for likelihood / is for function 4. Thentheposterior density Second, suppose is a prior probability 7r()
L(0=

) e'( ?(0)d0.

and obtain to In widegenerality mayapply method this integral wethus we Laplace'sapproximation

4 this of close (and ignoring Provided and i are orthogonal is generally to L0((0)on account 2.2(iv) constant factors). the Theideaofconditional likelihood defined (7) or (8) is certainly as although by appealing, profile of invariance (8) is somewhat lackofparametrization disconcerting. in over it to an extent or(8) offer advantage practice themodified (7) However, is notclear metowhat of An the distribution A, lies likelihood. inherent profile difficulty in theneedto calculate marginal or One when either order conditionally exactly totheappropriate ofapproximation. possibility, working an for of on theancillary is toderive approximate expression thedistributionA,byintegrating a, p*(A,,, to feasible sufficient to Ak;VIa) = p*(2, '; A,VIa) Il /A,, with i, respect i, thisbeing approximation I in for 3 of (1986a). bymeans an asymptotic expansion p*,given section ofBarndorff-Nielsen As another with likelihood maynotethat(7) and (8) are tiedto the one pointofcomparison modified profile as the above, exactly sameexpression for is arrived (2) L? conditionality viewpoint whereas, mentioned

7r(4) L(0) = (27r)2 i)-jI

20

Cox and Reid Discussionof thePaper by Professors

[No. 1

and of when inference, thatis relevant, by an argument conditional of at by an argument marginal Poisson variates xl this, To is when inference, that relevant. illustrate suppose andx2 areindependent + and with means 1 andY2, respectively, let' = pl + 12 andA= pl/(pl i2). Then/ andAareorthogonal p that from for profile likelihood / is equal to thelikelihood modelfor/, and themodified (Poisson) complicated, is of2, via distributon which quite proceed the would of However, calculation w*orwc model. basedon i/. likelihood and theresult wouldnotbe equalto themarginal the for function / alonewouldbe to consider a way Quitea different of defining likelihood-like ratio signed likelihood log standardized density r*isthe and standard normal function where isthe 0(r*) 0 list). in (1986-see mainpaper'sreference for/ as defined Barndorff-Nielsen mean whose by is Aninstructive example provided theF-distribution {(//)`/F(A)}x`I exp{- (A/l)x}, size and sample n> 1)wehavethat butnot2,issufficient value /. Here/ andAareorthogonal (for is 1,,, of two of in for given/. Thus, this A case,onlythesecond theabove-mentioned derivations modified of One finds from likelihood a viewpoint conditional inference applies. profile 1?(0) =-n
whereC(() =
02 1/2 (A)- Ic(2 )"I214f)

and

= xI + x2, i;= =2= xl/ (xl + x2). Inference on

in should clearlybe performed the marginal

example in tractable thepresent of this. not andonewould discard Calculation wc* or ofwc is notvery compare =7(0) + I log on the of (although nulldistributionA,doesnotdepend O), butonemight 70(O) = (1986) thatr,*, r,,- (log :j/r,, where 4 = 0( to log ,(r,,). It was shownin Barndorff-Nielsen the No of basicaspects thepaper. doubt paper uponthemost only here It hasbeenpossible to touch to and Itis and discussion investigations. a pleasure a privilege propose further considerable will generate a and paper. stimulating interesting to Cox a strong ofthanks Professors and Reidfor very vote the I pleased havebeenaskedto second voteof to Dr T. J.Sweeting of (UniversitySurrey):am very In I provoking. mydiscussionshould thought for which mehas beenvery for thanks tonight's paper, in and inference discusssome parameters Bayesian on like to concentrate the use of orthogonal work. with relationships thepresent I can four parametrization. identify to an reasons wishing consider orthogonal for There several are are Cox are all by mainreasons, ofwhich mentioned Professors and Reid;they usedas an aid to (i) of and (iii) (ii) computation approximation interpretation (iv) elimination nuisanceparameters. for The part in inference thesamereasons. first ofmy are parameters also useful Bayesian Orthogonal likelihood the (CPL)andanapproximate concerns relationship the between conditional profile discussion Barndorff-Nielson madeby Professor already and remarks likelihood, amplifies integrated Bayesian of of parameters. In the independence orthogonal partI consider question prior tonight. thesecond Let of of parameters. L(f, O) be the and Let / be a scalarparameter interest 4 a vector nuisance likelihood of and density 4 given/. The integrated prior likelihood function 7r(o0) theconditional I L/) of/ is L
= LX

_ case the factorI 0A/al* in (2) is equal to log F(A)/OA2 i- 1. In the present I

I 4) 7r(oi)dqo

suitable of (under andbytaking expansions logLQf, andlog7r( I0) about$,,oneobtains appropriate X) conditions) regularity + Op(n1)) (1) with and to a Thisis essentiallyLaplaceapproximationtheaboveintegral, maybe compared formula L In and density. thatpaperhowever, is posterior (4.1) in Tierney Kadane (1986)fora marginal than~,,. When4 and / are orthogonal, modeof 4, rather abouttheconditional posterior expanded we the term. CPL(10)isjustformula without first Butthen can replace in 7r( ,Itfr) theCox-Reid (1) 4,, It by f to thesame orderof aproximation. followsthatto Op(n` ), the CPL is equal to the integrated
- 1/2

/ IfO)L(Q, k) ljI4(P, k&) L(Q) = 7r(@,

(1

An the orthogonalparameters/ and 4 are taken to be a prioriindependent. likelihoodwhenever the of distribution / is freefrom prior in feature thiscase is that,to Op(n '), the posterior interesting 4. adopted forthe nuisanceparameter the between CPL in Cox and Reid and the approximate The above analysisexplainsthe agreement marginalposteriordensityfor the Gamma index in Sweeting(1981). Formula (1) also explains the The likelihoodand CPL respectively. betweenformulae (24) and (25) forthe integrated discrepancy

1987]

Discussionof thePaper by Professors Cox and Reid

21

4 leading term (24)is precisely buttheCPL cannot thesameto Op(n- here in (1), be since = (AO, A) ) is notexactly to orthogonal f andi(o) is notconstant. is readily It checked that here
and on multiplying bythis factor recover (25) we (24). to Returning thequestion whetherpriori of a independencesensible orthogonal is for we parameters, haveseenthat when is the this casethings work nicely Op(n1);theCPLagrees out to with Bayesian the likelihood every for smooth prior 4. Although cannot for one argue that orthogonal parameters should be a in always taken priori it independent,certain natural takethem to problems does seemvery at leastapproximately independent. Consider transformation As thetransformation / varies, can againthenormal model. index one in identify directions(/, ))spacealong which is in model. there very localchange the little Reparametrize so that If these directions to correspond A= constant. ourprior aboutAgivenf is formed opinion by the considering type data we would of expect see,then can argue to our we that beliefs aboutAgiven i = i0 should be when/ moves a neighbouring / = ,1. No compensation to value in hardly affected Ais required this in for small change / to preserve model. the Suchan argument madein Sweeting is and it turns thattheresulting (1984a), out parametrization with agrees Cox and Reid'sapproximate This orthogonal when the of parametrization. is notso surprising oneviews process orthogonalizing to / as one offinding of directions leastmodel under information the metric. change Omitting details, localdistance model in spaceis minimized eachpoint the in at of parameter bymoving a direction space satisfying orthogonality the equation In model (4). this space, amounts moving to from Q(0) M(fO,00) in a direction orthogonal thespaceM(/0+ d/o,4). to A direct "compensation" argument to transformations error and appliesquitegenerally arbitrary distributions and abovetheresulting should (Sweeting, 1985), forthereasons given parametrization the be It which approximate orthogonal parametrization, willnormally complex. wouldbe interesting tofind other models which simple for a can an compensation argument be madewhen exact orthogonal I am is difficult. surethere be many will other avenues research of parametrization interesting arising from tonight's paper, it gives very me to the and great pleasure second voteofthanks. The voteofthanks passedbyacclamation. was R. The Professor L. Smith (University Surrey): three-parameter of Weibull distribution
= F(x; 0, 4,oa) 1 - exp{ -((x
2 ) OC 2S1 ly(2,-

3)/*

Q)/0))}

(x > 0),

0 is in within with theparameter interestan inference of harder those thepaper, than problem although thedomain thetheory, problem the for of being regular a > 2. J.Naylor I, ina paper yet as unpublished, compared have and and sampling-theoryBayesian analyses, with to a difficulty theformer that likelihood 0 endsto be very To try improve for flat. being theprofile on profile one likelihood a inference, maysolvetheorthogonalization equations = a(O,A,P),4 = (0,A, p) suchthat 84 where
f ()=2
r(--{

= 8(X), = f2()

(1)

6}
a{(-6 + (1-y)T 1-a)}
t

6(X) 12 r2

and F function. Hereyis Euler's constant, thegamma function, P thedigamma of The solution (1) is ofcourse quitestraightforward. Defining
f3(CX) =

f(2x) -

f1 (cx)

= g(ac)

exp

{-

f3(u)du}dv

22

Discussionof thePaper by Professors Cox and Reid

[No. 1

where > 1 is arbitrary, haveone solution (1) in theform b we of


9(a)

= AO III0 +

t2( A ( ),(2) ),t

and another, the putting contants integration a different in theform of in place,


g(aX)=--

= Yf2(709*42

(3)

Thisdefines orthogonal two A third parametrizations. suggested analogy with generalized the by extreme valueparametrization and (Prescott Walden, 1980, 1983)is
F(x; 0,

i)

= eI[-{e (xpx)}] 1

>.

(4)

Theresulting forms log profile five of likelihood, namely theoriginal, themodified profile (a) (b) log i.e. likelihood, equation ofCox and Reid, without reparametrization,(cHe) themodified (10) and any profile likelihoods defined respect thethree with to newparametrizations on (2H4), havebeentried some dataon strengthsglassfibre of and analyzed Naylor me.In Fig.DI, theunmodified profile by log likelihood is very but(b) and(c) areevenworse, flat (a) no within range the of having localmaximum values calculated. contrast, and(e) appear do better discriminating thevarious In to in values (d) among of0. Preliminary results from simulation a confirm picture the in study suggested Fig.DI, i.e.that, by terms sampling of than properties, and(e) arebestwith and(c) worse (d) (b) (a).
-20 -

-25

- 300 no -35 -j400 \5

0 -_55 _-

-60

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0.5

Fig.DI.

Profile likelihoods Weibull for distribution (basedon 63 fibre strengths).

and Thisexample be very behaved allows but somegeneral may specialised badly observations. Cox in valuable work drawing andReidhave attention the to importanceorthogonality, of performed thereby definitionmodified of Barndorff-Nielsen's likelihood. extending the profile However, example, specially of(c) with thebad performance contrasted (d),shows orthogonality no means whole that is by the story. In particular, different two different orthogonal parametrizations havevery may properties. Mr D. Firth concern important the (Imperial College, London): remarks My property ofSection (iv) to havesomerelevance question ofSection 6. 2.2,and perhaps (x)

1987]

Discussionof thePaper by Professors Cox and Reid

23

Observefirst slightly a different, apparently moredirect routeto property based on the (iv), approximation
0,A - )

au

=*|

(2

i)

au

@2 ;

P 10 0p |?0|

derived expandng score by the function A)= 01(o,A)/10rather u(f, than1(i, A)itself. Orthogonality 2- = Op(1/Vn) implies and like following apply: (3) provided (u/alA) A)= Op(lVn), arguments those I(, ij allterms Op(1), are hence - = Op(1/n). that result 'local'inthat is required be within Note is the A to 0(1/In) ofthetrue value;in particular,Ais fixed must thetrue if it be value. The result maybe extended twostages. in First maybe 'delocalized' restricting it by attention to likelihoods satisfy that
= A)/1frA}0 forall E0,;{a02l(0,

f, A 4,

Thisis a stronger condition orthogonality implies, particular, thescore than and in that equation u(0, estimating equation / at every Nowconsider for A. arbitraryno longer A, required A)= 0 is an unbiased to be nearthetrue value;and suppose that2,rather thanbeing maximum the likelihood estimate, is i/ suchthatA- = Op(1/Vn). Then, with = i/i;, thebehaviour all quantities theaboveexpansion of in is as before, in particular - = Op(1/n). and A Animmediate further extension to thesituation is where {u(f,A):AE R} is a more general family of is estimating functions q, notnecessarily for likelihood-based functions; required score condition still the A)} E.{u(of, = 0 forall /, A,
= Op(1/n) i.e.u(i, A)= 0 is an unbiased estimating equation / at every for valueofA.TheresultAimplies particular theasymptotic in that (normal) distributiona solution of basedon anyfixed valueA isthesameas that a solution of basedon a data-dependent 2,provided A- = Op(1/Vn). value Consider twoexamples: = . (i) Y1, . ., Y independent, = qxi, var(Yi) {E(Yj)}j0and u(0, A)= X(Yi- 0xj)/(0xj)'. Within this E(Yi) the if is class, AO) asymptotic is achieved AO replaced efficiency; samefirst-order efficiency u(0, maximizes estimate. bya In-consistent = = - 1}]. Provided (ii) Y1 A) ....,Y i.i.d., E(Yi) i/,var(Yi) 1 and u(tf, = X[)(yi-i + (1-)){(Yi)2 here third moments and fourth exist, asymptotic efficiency is maximized thechoiceA= (2+ K4)/ by of to estimates K3 and K4 allowthesamefirst-order (2 + K4 - K3); againIn-consistent efficiencybe unbiased is in that estimating the achieved. example non-robust thesense This is equation notgenerally under failure thevariance of assumption.

Ms S. E. Hills(Nottingham I the University):wouldliketo makea practical pointconcerning construction oforthogonal The have the that solution parameters. authors noted problem simple explicit ofthedifferential for but is equations theorthogonal parameters notbefeasible, there alsothecase may is in when explicit an solution possible theoriginal but nuisance can parameters notbe expressed terms An oftheorthogonal is This model parameters. example theMichaelis-Menton innonlinear regression. model usually is as specified /3xj+ Yi-+Xix, Yi +i (i=1,. .,n)

where N(0,a2) (assume known). a ei(3, If,B theparameter interest, a transformation form A)(p, A)is required that is so of then ofthe 3 and Aareorthogonal. differential The to will equation be solved be
n
i1

xi

Ooa
0#

n
i= 1t

xi
+

+ Xi)2

Ep
x2 + xi)2-

with solution
n a =a 1 (

24

Discussionof thePaper by Professors Cox and Reid

[No. 1

It is trivial write (andhence likelihood to a the in of function) terms / andA. Ifa is theparameter interest, a transformation form ) (a, A)is required that of then ofthe (a, so a and Aareorthogonal. differential The to equation be solved be will
)0 Xi #

Xi

o=i

(+Xi4

=1(3+j

with solution
b(A) =
aC3

X) j=1 (fl+Xi)3'
Lt

X2

Theinverse this of transformation explicit therefore notpossible write likelihood is not and itis to the in theform A). l(ac, Canthe authors anyguidance towhen type situation occur howtoovercome give as this of will and it? Dr C. J.Skinner of I should to comment theroleoftheconcept (University Southampton): like on ofparameter in with orthogonalitymodel to reference theregression robustness, particular example. Let MO = {f(y; 4, f0); 0 E (D} denote specified a model. Thenit appears be ofsomeinterest to to study of 'orthogonal perturbations' M? within broadermodelsoftheform = {f(y; 4,O); 4 E D,t E T} M where E T, ,/, orthogonal 4 locally MOand 4 retains interpretation free / (c.f. is to on an in M q0 of 3.5).For if fT, indexing true the model assumed lieinM, is within 1/2) of f0 to as 0(nthen, in2.2,40),0, allwithin MLE of within is,inthis MO robust localperturbations to Op(n 1) andthe 4 sense, 4OT, and f are For example, MR be theclassofregression let Y' models - N(4)1 I x,o, 40)in Section andlet + 3.5 = MO refer a specific to choice'i, . Then+* = (43/02, ..*, 4)/02) has an interpretation of * free (increasing by r,/02 thesameeffect Y as increasing byoneunit, x2 has on whatever valueof ) the x, a of and,being function 2.., A, (5),is approximately in orthogonal f. Hence, thesense to in above, MLE of4Vin MOis robust localperturbationsM? in MR. to of Thisproperty becompared a seemingly may with stronger for result global perturbatons within ofM' thewider classGRofgeneralised linear Y in models, which depends x = (x2,..., x,)only a linear on via = combinationI + z x,+4, 4)1 x+, andfor wider + the classofpoint estimators (4)1, )) which of solve O a maximisation problem max(a, Y /(Yi,a + xib),where function .) is essentially b) the f(., arbitrary. to Subject a suitable oflarge law numbers estimators such converge ($, i), solution max(a, to the of b) an estimating equation: EO(Y,a + xb)with implied
COV[i2(Y $1 + X), X] =0 (1)

of M? in M.

under consistently misspecification in GR. Solomon ofMO (1984) gives special a caseofthis result. The small11 11 may 4) condition be replaced a condition elliptical of by on symmetrythemarginal distribution ofx (c.f. Brillinger, Ruud,1986). 1982; Mr N. G. Poison (Nottingham University): Tonight haveheardhowto makeinferences we about theparameter interest, in thepresence a nuisance of 0, of parameter, The authors A. propose use a to conditional likelihood. havealso heard We profile from Professor Barndorff-Nielsenadvocates who the modified profile likelihood. When model the possesses group a the structure, latter be represented a marginal can as likelihood

wheref2(u,v)= u/(u, UnderGR, (1) reduces an equationofform v)/av. to cov[h4)1+ x+, a + xl), x] = 0 which, assumingd 11 1I 1/@i, small in 3.5,gives first-order as a approximation + 11 /041,4+ cov[h1x+ h2x4, i x] = 0 so that oc+ ifvar(x) 0 and+* = +*. Hencetheglobal > robustness property +* is estimated that

Thiscan be usedto unify someof thecomments abouttheBayesian approach already madeby Professor Cox, Professor Barndorff-Nielsen Dr Sweeting. Bayesian and The is methodology totally out beliefs .0. general-integrate prior about i An appealing choiceofprior when there nuisance are is parametersa reference as defined Bernardo prior, by It (1979). is usedas a reference for point other also inferences, as an approximation weaka priori to information aboutAI0. Applying Bernardo's criterion, orthogonality the simplifies asymptotic posterior 2 0, yielding result theabove for . the that is measure precisely reference for2.10.We therefore theimportant the prior have theorem the that modified profile likelihood precisely Bayesian is the marginal likelihood a reference for I0. with prior

on (as mentioned p. 10),the measurefor i.0 beingthe inducedright invariant Haar measure.

1987]

Discussionof thePaper by Professors Cox and Reid

25

The methodsof thispaper are therefore essentially Bayesian.If the authorswereto be consistent and also use a reference priorfor0, theywould find completenumerical agreement withreference Bayesian solutions. This has severalimportant implications. First,we note that such priorsavoid the marginalisation paradoxes of Dawid et al. (1973). Secondly, of tonight's all examplesand previousones givenin the literature permitanalyticcomputations reference for priors,extremely usefulforBayesians.Thirdly, questionsraisedin Bernardo'spaper are also applicablehere. Two exampleswherethegroupstructure not present thehyberboloid is are modeloforder3 and the inverse Gaussian model(Barndorff-Nielsen 1983). I would like to ask the authorshow theirmethods apply hereand ifthereare corresponding linkswitha Bayesiananswer? Finally,one of the mostimportant applicationsof nuisanceparameters to model elaboration(for is example, Box-Cox transformation the model).The Bayesianframework allowsus to viewsuchquestions in a unified manner.Do the authorsthinktheirmethodscan be applied in as unified manner, a for examplewithA discrete continuous, the Bayesianmethodology? or as Dr FrankCritchley (University Warwick):My reactionon readingthispaper was one of awe and of wonder. "Or" because theauthorsproposew*or w,or wi'v wonderbecause I foundmyself and genuinely wondering: "What does it all add up to?" In particular: (i) Choice: How are we to choose among the variousmeasuresproposed? Are theyall the same to in differences their leadingcoefficients beingthe basis put (this Op(n` )? If so, may therebe important forward preferring to w)? If so, when? for wc (ii) Practice: What are the relativeand, indeed,absolute values of the measuresin practice?By the parameter orthogonality, asymptotic conditional of distribution al giventheobserved02 is just the distribution nasymptotic marginal case,theauthors appearto NP,(01, 1i,). In goingbeyondthissimple be considering In sub-asymptotic situations. any event, thisis thecommonpracticalsituation. The key questionhereis: Whichvaluesofn are sufficiently to sub-asymptotic makethemoreelaborateprocedures worthwhile yetsufficiently to retain and large in enoughaccuracy thecrucialapproximation on which (2) reststhe keyadvantage(iv) of parameter orthogonality? (iii) Operation: How operationalis it all ? What about vectorparameters interest? of How often are the differential equations (4) soluble analytically? When mustthe invariant be abandoned forthe wc* morepragmatic or fvc? Professor Smith'scontribution containsa graphic illustration thepotential of wC losses associatedwithusingthesealternative measures. It would be churlish not also offer to some neutralor positiveremarks: (i) The choice among the measuresis indeed a multivariate one. No one measuredominateson all There are conflicts criteria. both betweenand within of matters principle and matters practice. of Withinthislatterset,we note thecriterion communicability theclient.Not all of the entries of to in the criteria measuresarrayare known(how close is wc* wk?, ...). Further by to workwould be valuable here. (ii) Notingthelocalnessoftheapproximation might be worth it (2), exploring multi-parameter extensions based on approximate in global orthogonality which (average)size ofi,O02is minimized the oversome of neighbourhood 0 = 0? in (iii) Can the freedom choosing an orthogonalparameterisation turnedto good effect be (e.g. by the optimising accuracyof (2) or the robustness some sense of the overallprocedure)? in (iv) In recently submitted papers,Critchley, Ford and co-workers have shownhow strong Lagrangian bothilluminates theoretical the theory of properties wand givessubstantial in benefits computational interval estimates based on it.It willbe ofgreatinterest see how thistheory calculating to appliesto to tonight's paper and, in particular, (10). (v) There are close linksbetweentonight's workof Cook (1986). paper and thelocal influence to Answers any oftheabove questionswould be ofvalue.Without doubt,manyoftheseanswerswill as dependon thecontext, withtheprobableadvantageofi7vc overw whichdependsupon both Ia I b I and i ," beingmathematically of independent 4. In sum,I foundtonight's paper a valuable and thought-provoking to contribution what is one of our subject'smajor problems. is, therefore, surprising It not thatmuchworkremainsto be done.

26

Discussionof thePaper by Professors Cox and Reid

[No. 1

Dr Ann F. S. Mitchell(Imperial College, London): Amari (1982, 1985) produces the orthogonal parameters Section3.2. forregular of exponential families a different by approach to thatofthispaper. of For parameter the space 9, thefamily densities {p(y; 0), 0 E 9}, satisfying usual regularity conditions, 0= is considered a manifold whichtheparameters (0102, .. . 0X) play therole ofco-ordinates, as in the information matrixentries{gi,(O)} formthe metrictensor and the connectionsare the familyof of for a-connections Amari (1982, 1985). If the manifoldis ? ox0-flat some real cox, thereexist dual co-ordinate systems i) suchthatOiand il,are orthogonal i : j; i,j = 1,2,..., r.The dual co-ordinates (0, for are relatedby Legendretransformations
01= O() i@ (?),

wherethepotentialfunctions f(0) and +(Q) are such that


gi(O)

a2

C O/(0), giJ(il)=

___

a2

and

+ O(?) 001)

Oi = ?, ?li

of of in matrix terms the parameters of {gij(q)} beingthe entries the inverse the information il. Since regularexponential in the families are + 1 flat, results Section3.2 forthe case r = 2 followat once in generaland fortheparticular case of the normaldistribution. The normaldistribution also be regardedas belongingto an alternative can class of distributions, namelytheclass of elliptic distributions withdensities theform of
p(y; C)=1 a) h h(Y

forsome function and location and scale parameters, and a (a > 0), respectively. class also h The pu includes,forexample,the Cauchy, Student'st on k d.f.(k > 1) and the logisticdistributions. the In multivariate contextit has receivedmuch attention studiesof robustness standardmultivariate in of normalprocedures. The Cauchy distribution constant has negative curvature all a-valuesand recent for numerical work by Kyriakidis indicatesthatthe logisticis not flatforany value of a. However,whenflatness can be the demonstrated, dual co-ordinate are systems
?

2-

2)

and q = (ahu, ahyu + bh 2),

whereah and bhare constants on depending thefamily underconsideration. particular, Student's In the k + ) flatwith t family k d.f.is + on
ah=(k+ 1)/(k+3), bh=k/(k+3).

Full detailsofthedifferential geometry of properties theclass ofelliptic distributions givenby Mitchell is (1986). The following contributions werereceived writing, in after meeting. the ProfessorShun-ichi Amari (University Tokyo): A statistical of model M = {fy(y;V, O)} formsa manifold witha coordinatesystem 4) to specify point(a distribution) M. When one in geometric a (t, has interests forms a onlyin ' but not in 4, the set of distributions I S(/0) = {fy 0 = /0; 4: arbitrary} submanifold embeddedin M. In such a case withnuisanceparameters, is geometry more explicitin statistical the becausetheshape(moreprecisely m-and e-curvatures) S(/0) playsan important of inference, role.The present issue relating theconditionality to paper raisesan interesting and geometry. principle The authorsproposenewteststatistics and itssimplified version It is interesting important and wc* w,.

1987]

Discussionof thePaper by Professors Cox and Reid

27

to study their characteristics. aresubject an asymptotic They to chi-squared distribution, thetests and basedon them first-order are and are efficient, hence automatically second-order efficient. problem The is to know deficiency the curve
PT(t)= n-oo

limn[P*(t) -PT(t)],

of such a test7Twhere P*(t) is the envelope powerfunction PT(t) is the powerfunction and of test7Tat V= V'0 t/Vn. + Let us consider problem a curvedexponential the in for family simplicity's sake. Then,the of critical basedon w*(ofwc) bounded a hypersurface region thetest is determined from by w*= const. (wc const.) = in theenveloping manifold which identified thesamplespace,where constant to be is with the is from determined thelevel In condition. thecaseofa two-sided itis bounded twosubmanifolds, test, by where use thesigned we ofthe rootofw*.The characteristics test on features depends thegeometric and of of (angle curvatures) a family submanifolds = const. w* and (Kumon Amari, 1983;Amari 1985). Here, should we distinguish problems. is howto choosetheconstant. two One Sincew*(or wc) not is to subject an exactchi-squared we distribution, needto adjustw*(wc) equivalently constant), the (or suchthatthelevelcondition bias condition a two-sided in (and of case) are satisfied to theterm up order n-', as we do in theBartlett The is with adjustment. other problem concerned thedeficiency of curve a test after adjustment beendone. the has Thedeficiency when do notknow true we the curve, valueof4), include additional two one to of curvature terms; being proportional thesquare themixture ofS(ir0) theother and to of of curvature M itself. we proportional thesquare theexponential Although I do not yetknowthecharacteristictheproposed of thatthedifferential tests, believe geometrical in methods Kumonand Amari developed Amari Kumon and (1982), (1983)and Amari (1985), provide us with sufficient of means analysing these problems. A finalcomment that, is evenwhenthere does not exista globalorthogonal parametrization, a locallyorthogonal for at parmeter beingorthogonal only V= V0, may be sufficient the A, Suchoneis easily present asymptotic derived purpose. as
)t 4,0)(k O,= + E (M*fi(
k,r
-

VOJk)-

Wecanadd quadratic (V terms V)2suchthat only cross not the terms theFisher of information but with also itsderivatives respect Vvanish V0. to at A. Professor C. Atkinson (Imperial College, London: major ofmy A interest the in work part described in tonight's on papercentres the normal transformation which model Box and Cox (1964)write

parameter. 1. Thenumerical in example Section demonstrates advantage theorthogonal 3.5 the of parameterizationcompared withtheform investigated Bickeland Doksum(1981).However, by Box and Cox , introduced normalized the transformation- yA)/,A wherey is the geometric z(A mean of the An observations. appealing of is that of property thistransformation thedimensions z(-) is thatofy. is approximately The resulting Are parametrization other orthogonal. there where examples physical lead arguments to a nearorthogonal parametrization? 2. The profile for loglikelihood the factorial of experiment Section5.2 is pleasingly parabolic, as arethose several for other examples plotted Cook and Weisberg by Section Asymptotic (1982, 2.4). can be to procedures then expected behave well. The other example given Box and Cox,a factorial by on failure worsted however of experiment the yarn, yields profile a loglikelihood isconcave which around A= 0 butconvex nearA= + 1 (Atkinson, Fig.6.2).Whatresults available theconcavity are on 1985, ofprofile Do of 4.3 loglikelihoods? thecorrections Section improve curve? this 3. There a striking is similarity between andthe (26) extremely result Patefield useful of (1977), obtained without use ofparameter the orthogonality. Lawrance result obtain score to (1987)usesPatefield's a statistic transformations hasgooddistributional for which in of properties theneighbourhoodthenull Since hypothesis. this result observed uses information, thantheexpected rather information(26),it of a when = -1 in theworsted gives negative variance data. AO

it y(-1= (yA - 1)/i. Withthisbackground is a nuisancethatCox and Reid choose A to be the nuisance

28

Discussionof thePaper by Professors Cox and Reid

[No. 1

Mr B. J. R. Bailey (Universityof Southampton): I should like to complimentthe authors on their generous provision of examples and here add yet another,but one based on discrete variates. Suppose X and Y have independent binomialdistributions thatX - b(m, and Y - b(n, such 0,) of is of 02). Iftheparameter interest theodds ratiot = 01(1- 02)(1 - 01)02 then theapplication equation the (4) leads to the orthogonal parameter = mO1+ nO2. Setting = A,and maximizing likelihood a(A) a(A) overA, fixed/, yields conditioning for for the statistic = x + y,theusual ancillary statistic thisproblem. I,, of On the otherhand,in epidemiology, parameter particular the is interest the riskratio i = 01/02. and , can be Orthogonalto thisis A = (1 - 01)m(1- 02)', or any function A such as thelogarithm, of severe found explicitly a slightly as cumbersome function /. However, of conditioning A0is extremely on in thatthe possiblevalues of thepair (x, y) generally lead to different values of 2,,,forfixed f.This is, ofcourse,a problem likely arisein manydiscrete to cases. Groupingvaluesof2,,,and thenconditioning on theparticular does not seemto be practicalifthishas to be done forseveralvalues groupobserved, in of / and forvalues of m and n typically the range 100-200.Is therean easier alternative? of Professor A. Barnard(Retired):There can be no generalmethodfor"elimination" "nuisance G. The verytermharksback to theidea thatstatistical inference involvesan act ofwill.In a parameters". whichmay be of decisionproblemtheform answermay be governed our wishes.But theinferences by a mustbe dictatedby the data available,along withthe logical drawnfrom data-modelcombination We about p without reference A,butthe to of features thiscombination. maywellwishto infer something this.And ifthe data do not permit we owe it to our clientsto say so. data may not permit it, in For examplewe maybe interested thecorrelation between a Logical features maysmplify problem. takento be bivariate test. scoresin a verbaltestand in a mathematical Such scoresare often reasonably so of affected irrelevant factors, thatanalysis normal, themeansand variances each scoreareclearly but by underlocationand scale changes.Thus invariance be considerations ofthedata mustlogically invariant as withits marginal to coefficient the only quantityof interest, lead directly the sample correlation likelihoodfunction. Similarinvariance features relevant the are to the distribution providing relevant to and to othercases. Neyman-Scott problemreferred by the authors, In location-scaleproblems, on allows us to reducethe data to two conditioning the configuration of known means and with z) pivotals, = (x - p)/sx z = sx/a, t jointdensity z). The failure O(t, to factorize 4(t, of often about p usingthe marginaldistribution t is subjectto the implicit thatinference assumption, thc about a. By suitablechoiceofp (noting thatthedata providetheonlyusableinformation overlooked, so brief on p. 339 ofFisher's1922Phil.Trans. hint paper), and a can be madeorthogonal, thatpossession p affect inferences aboutp. But to ofa little information concerning inaddition thedata,does notseriously a, Here thevarianceratio withtheBehrens-Fisher weighted and meanproblems. thecase is quiteotherwise sinceinferences about the nuisanceparameter", parameter should perhapsbe called a "confounded p of from statements about p. Insistence rigour on cannotbe separated requires making parameter interest be to for on a inferences conditional p; butitwilloften justifiable introduce rangeofpriors p. Thenso long the and so long as theproblems have enoughofa as we makeclearto our clients assumptions involved, in routine character allow some checkon thepriorsused,inferences whichp does not occurexplicitly to will be permissible. the useful in The "top down" approachoftheauthors, downfrom asymptotic working case,willbe very of But neededto makeinferences theform the cases as indicating kindsofassumption required. an complex of and Viveros to extension several to parameters the"bottomup" approachofSprott (1984),whoattempt of would also seemworth term itsTaylorexpansion, function to thefourth matchthelog-likelihood up exploring. Dr A. C. Davison (ImperialCollege, London): Professors in Cox and Reid refer the finalsectionof I theirthought-provoking of observations. would like to point out a paper to the prediction future curious similarity betweenthe modifiedprofilelog likelihood(10) and recentwork on predictive likelihood. Suppose thattherandomvariableY = y,withprobability density function I 0) has been observed, f(y and thatthe unobserved The randomvariableZ withconditionaldensity Iy, 0) is to be predicted. f(z 0 In likelihood as parameter is unknown. Davison (1986,equation6) 1 suggest an approximate predictive forthe predictand exponentof the

(*) logf(z,yI ,)- 4 log detj,o(6_), of regardedas a function z. Here tz and jo are the maximumlikelihoodestimateof 0 and observed

1987]

Discussionof thePaper by Professors Cox and Reid

29

information on bothy and z. Expression looksvery themodified based (*) like profile likelihood log (10) with and 0 replacing i/ and AofCox and Reid-thoughofcourse logical z the the status the of unknown Z valuez oftherandom from of variable differs that theunknown fixed but /. parameter In replying the discussion his paper,Butler to of (1986a)showshow to make(*) invariant to of0 reparametrization byadding it to logdetj](O) - I logdetKK', (t) K where = d2log f(z,yI0)/d0d(y, at z),evaluated 0 = Oz. Thefirst of(*)is theprofile term predictive likelihood log suggested Mathiasen by (1979) Lejeune and andFaulkenbery In (1982). many the situations relative contributions predictive likelihood extra tothe log from second the term (*) and from aresmall. of (t) sucha case,comparing 3 terms Fig.D2 shows the for prediction themaximum m= 10annual the of of maximum river daily flows, basedon a sample of 35 suchflows theRiver of Nidd at Hunsingore Weir. The modelis thattheannualmaxima a are of sequence independent observations a common with generalized extreme-value distribution. The modificationstheprofile to likelihood predictive from and thesecondterm (*) are in thiscase (t) of negligible, though mincreases doestheeffect (t). as so of As far I know itnotyet as it clearhowingeneral basepredictive to confidence regions Z on (*), for in though someprogress this direction recently madebyButler has been soona first(1986b). Perhaps and second-order for asymptotic theory prediction, wellas estimation, be available the as will for likelihood function.

o_j

5-

-1

200
Z

400
(M3/S)

600

800

1000

Nidddata,m= 10.Shown profile for are Fig.D2 Information comparison River predictive likelihood (solid log information matrix and contribution logdet contribution det -2 line), dashes), Jacobian log joo#(Oz)I j00(0j(small logdetKK' (longer dashes).

D. Professor A. S. Fraser(YorkUniversity; Universities Toronto of and Waterloo): Conditional introduced Fisher, but connections to inference, by generally neglected, nurtured tenuously through and is the it and fiducial, ancillarity, structural,nowreceiving attention has seemingly deserved long thepresent and examination aspects thetopic. of of paperis a welcome thorough Thefirst sections three the ofnuisance to propose information orthogonalization parametersa primary realparameter order obtain in to for m.l.e. this asymptotic independence thecorresponding estimates; leadsto theanalysis theprimary of conditional estimates thenuisance on of parameter parameters. the do suchconditional which a involves real Unfortunately, authors not directly inference, pursue variable and real parameter withsome minimum effect from nuisance Such inference parameters.

30

Cox and Reid Discussionof thePaper by Professors

[No. 1

regions, and a likelihood leadingto testsand confidence forward line on-the-real is directand straight is criterion not needed. analysis workon conical tests(Massam and Fraser,1985; Skovgaard,1986)and on fibre Some recent (Fraser, 1986) lead (in joint work with a Toronto colleague) to a sample space developmentof conditional tests; these seem to show agreementwith the orthogonal-parameter one-dimensional of approach whenit is available. The majority the examplesin Section3 are location/transformation the reducing effect shows promiseforfurther of models; some compounding conditionaldistributions of nuisanceparameters. of to likelihoodto obtain a likelihoodassessment Sections4 and 5 develop modifications profile regionsin any directsense: the values,but do not provideconditionaltestsor confidence parameter be reasonably changedto 'conditionallikelihood'. in 'conditionalinference' thetitleofthepaper might to distributions address use an represent insightful ofconditional likelihood to The modifications profile likelihooditself. foundwithprofile the difficulties a n indicates modification in modelas given Section3.5areoflength which The vectors theregression for of a is ratio statistic essentially negative log likelihood;thus in The log-likelihood to some formulas. and not added to be correct likelihood'needsto have 'ratio statistic' severalplaces 'conditional(profile) misleading. I Dr P. Harris(LiverpoolPolytechnic): have enjoyedreadingthispaper,and would liketo make two The first concernsdiscussionpoint(i) of Section6 of thepaper,namelythepossibility brief comments. introduced Section4. In particular in existfortheteststatistics might factor adjustment thata Bartlett considerthe teststatistic, fv,(O')say,givenat (11)
fy(O0)

= w(fr) + S,

wherew(qlo)is givenat (6) and S = log det{hj(i/i, 20)} - log det{uzz(/,2)}. A) value(qlo, to give parameter and let For convenience Abe a scalarparameter, expandS about thetrue S = (j/AJ -y u + (1AAA) -YAAU+ Y + Op(n3/2)

= all and = yAAA /), uA Vn(2 - 2o), yqAAn- 31203l/(aOfa2), = n- 312031/at3 Y denotes of arisingin theexpansion. the Op(n-1) terms appearsto introduce which v The term = (jA)- 1yAu,,U,,u, remains Op(n 1/2) whenil and Aare orthogonal, the whichnot onlyprevents calculation a of intotheO(n- 1) partofthenulldistribution iJ(qIf), quantity of in the beingexpressed but ofa Bartlett factor, also prevents O(n- 1) terms thenulldistribution f1v,(qo) arisesbecause the Op(n-1/2) part of theexpansionof S variables.The difficulty as sums of chi-squared than themore usual UiUjUk (i,j, k = i/ or A). containsa termlinearin ui(i= i or A),rather has the form of function OIJOO) of Assumingthe orthogonality / and A, the momentgenerating M(t) = (1 - 2t) 1/2{1 + (24n)-(4P + tOQ)} whereQ = 6(iZ,A) 2iA i**, iAA, = VnE(yz,), 4 = 2t(1- 2t) The of of of function thecumulants thederivatives thelog likelihoodfunction. and P is a complicated p; factor, ifQ = 0 thenp = 1 + (12n)-'P. For adjustment the termQ prevents calculationofa Bartlett of =0, so thatthe arrayof expectedvalues of the thirdderivatives the log Q to be zero we need i =, to factor be available. conditionfora Bartlett an needs to satisfy orthogonality likelihoodfunction to is My second comment thatif il is orthogonal A,thenin testing :/ = *? Ho: where = /n(i u,, (1) 2{1(&,2)- l(il0,2)} is when the null hypothesis withone degreeof freedom has the asymptotic chi-squareddistribution upon the use of (1), or a conditionalversionbased upon the true.Have the authorsany comments il of distributions y given1, in testing = *l ? conditional theymay be maximumlikelihoodestimator is not used in eitherteststatistic As the restricted 20 in where 0 is awkwardto estimate. convenient situations
WA(qo) =

likelihood profile (1983) modified of Dr C. J. Lloyd(University Melbourne):Barndorff-Nielsen's

L?O

ILwl/) IJAAZ.

or a to that has theniceproperty itcan be seenas an approximation either suitableconditional marginal likelihoodis conditional In if likelihood function. particular, we conditionon 2,,thenthe approximate correspondsto choosing A so that LP(l as discussed in 4.1. The orthogonal parametrisation = for either fixedi/ or ignoringf. This alone would seem 02/a02 1 + O(n-1) andjzz is theAinformation

1987]

Discussionof thePaper by Professors Cox and Reid

31

to justify idea of orthogonality givingthe profile the for likelihooda standardform. would be nice It to look at theconditional from dual/marginal profile likelihood the pointofview.The mostlikely course is to let Z(&, 24,/, A)= F( l24;/i, whereF is a conditionaldistribution ,A) function and substitute 2, forA and considerthemarginal likelihoodof Z. The distribution ; given24 of does not generally lead to an unbiasedscore function pointedout as by Lindsay(1983).The bias term in factlog(02/0A*) thatbias is reducedwheni/, Aare orthogonal. is so in (This term omitted wc*). is Thereis also a problem ambiguousinference choosingthefree in of statistic to be conditioned For exampleif(X, Y) are independent on. normalwithmeans(Acos i/, A sin 4/)and = variances1 then )(4l) = XC cos /,l Y sin il and Z(O/) XC + sin l/- Y cos #A withN(O,l/n) distribution. Now A and q/are orthogonal and it turnsout
WC(* nZ(+)2

= so theestimating equation is Z(Vf) 0 whichis unbiased.The density XT of given21(q)is N(2(tfl) V, cos

n-1 sin2/l) giving conditional the loglikelihood


so the estimating equation is

-2 log Isin lI -

nZ(ql)2

=0 20 logIsin / I/8i/+ nZ(iP)2(tf)


whichis not unbiased.Also, if we use the density Y given2(q/)a different of likelihoodresults. The distribution ; given2(f)leads to a conditional of likelihood which differs from bytheterm cos(fr -47) wc whichis the O(n- termlog(&2(qf)/). Finally,an unbiasedequation can be obtainedby substituting 1) for2(tf)after differentiationtheconditional of likelihoodand thisgives 1] - nZ(/)12(qf) in thepresent example.This procedure alwaysgivesan unbiasedequation underregularity, however it is not clear whether solutioncorresponds maximising sensibleobjectivefunction. its to any
_

20 log Isin e l [Z2(qf)

T. Professor A. Louis (Harvard School of Public Health): Whoeverinventedthe label "nuisance on was thatare notofdirect whatever parameter" right themark.Parameters interest plaguetheanalyst, approach is taken.The frequentist needs to condition, profile, ignore;the Bayesianrequirespriors or thatmaybe difficult pindown.Cox and Reidhavehelpedexposethese to difficulties thesimplification and providedby parameter For orthogonality. example, theiranalysisofthe normaltransformation model helpsclarify controversy the the to in concerning decision incorporate ignore uncertainty estimating or the il (thetransformation I that parameter). read their messageas suggesting any reasonableapproach will workjust fineif the parameters interest orthogonalto il. Individualsmay have philosophical of are will arguments, theirinferences be similar. but Divorced fromcomputationalissues,these resuilts implythat puttingweak (ignorance)priorson nuisance parameters,and marginalizing should produce reasonable frequentist when inferences, holds.Whenit does not hold,uncertainty to thenuisanceparameters (approximate) orthogonality due but gets incorporated results the depend to a greater degreeon the prior.Generally, applied context dictates of parameters interest, us though(approximate) orthogonality have a rolein teaching how may to think about the application, muchas do naturalparameters exponential in This viewpoint families. makesthe vectorparameter case similarto thescalar case. Eitherapproximate orthogonality holds or inferences moredifficult. are The paper succeedsat identifying important an research the agenda. Section4.1 presents technical a of thatshould generate research development underlying radical form double conditioning intensity similar to that followingCox's 1972 approach to survival analysis. The relationamong profile, conditional, marginal, partial,and canonical likelihoodsbegs further as study, does the successof the new approach in incorporating effects vectornuisance parameters, the of and analysingparametric is empirical to Bayes models.Thoughthisreport unlikely have immediate impacton statistical practice, it adds to our understanding the issues and approaches forinference the presenceof nuisance of in Generatedresearch should have a largeimpact, and theauthorsare to be congratulated. parameters. Dr J. N. S. Matthews of an (University Oxford):I would like to raise a couple of issues from area wherethe methodsof thispaper may findapplication. In theanalysisofcontinuous data from crossover trials subjects, periodswithp typically between (n p 3 and about 10),a model thatis frequently used is:

32

Cox and Reid Discussionof thePaper by Professors

[No. 1

for a effects and, crucially, parameter each subject.Moreover Here Xf includesperiodand treatment felt it is usuallyassumedthat? - N(O, 2Inp) However,it is often thatthiscould lead to an inefficient making in dependence theerrorterm, analysisas thereis likelyto be within-subject
var(?) = a2j" @ V(p

.Pk)

morerealistic. to This givessome comfort Pk) are orthogonal. As was pointedout in the paper,,Band (a2, Pi, matrixwill of model,as it means thatchangesin the specification the dispersion usersof the simpler of Howeverwe need estimates thestandarderrors contrasts. of on effect estimates treatment have little on theycan offer thispoint? how muchcomfort of thesecontrasts; can theauthorsclarify in matrix, particular in of The second issue concernsthe estimation the parameters the dispersion process. autoregressive first-order to whenk = 1 and V = V(p) corresponds a stationary leads profilelikelihoodmethodsastray,givingbadly biassed The need to estimatesubject effects of problemin theestimation of and Thompson(1971, 1974) overcamea similar estimates p. Patterson thistypeof likelihood.Applying maximum by variancecomponents usingtheirmethodof restricted is but results, there roomforconsiderable of p approachto theproblem estimating givessomeinteresting and coefficient smaller especiallyforlargerpositivevalues of the trueautocorrelation improvement, values of p. it As the root of the problemis the presenceof so manynuisanceparameters, seems possible that to maybe able to contribute thisproblem;I would be interested likelihood conditional profile methods in the authors'views. and of Dr P. McCullagh(University Chicago): This is a comprehensive detailedpaper thatdeserves I to careful study. have onlytwo comments make at thisstage,both verybrief. in parameter expectation and thecomplementary of First, orthogonality thecanonicalparameter the families. by families was demonstrated Huzurbazar(1956),at least fortwo-parameter exponential of if wouldelaborateon thenon-invariance (8) undertransformation Second,I'd be grateful theauthors does permit? of A.How muchleewayfortransformation orthogonality There is one generalpoint,which though Donald A. Pierce (Oregon State University): Professor extensions on and bearing theissueshere.This has to do with mayhave somesignificant practical simple, For effects. il and includescomparative of wheretheparameter interest is ofmorethanone dimension several Weibull to of consider Weibullsetting Section3.4,extended thecase ofsamplesfrom the example, of consisting It populationswiththesame shape parameter. is easilyseen thatthevectorofparameters This is an instanceof a is ratiosof the various scale parameters orthogonalto the shape parameter. more generalresulton location-scaleparametermodels,and those such as the Weibull whichmay to transformed such. By the method of Section 2.3 a conventionallocation parameteri/ of any to say quantile, il + ka, whichis orthogonal thescale location-scale can family be replacedby a certain of differences these a. problemwithcommonscale parameter, parameter But thenfora multi-sample of new quantilesare the same as differences the originalones. the defining "shape" of the This remainstrue,I believe,ifthereare additionalnuisanceparameters to is of That is, thevectorofdifferences locationparameters orthogonal both the location-scale family. models in that if the The resultalso extendsto more generalregression scale and shape parameters. to whereZ zi=0, thenthevector,Bis orthogonal the locationparameters modelledas i = ,u+ zf'fl, are scale and shape parameters. if whichhave been made in specialcases before, not moregenerally, I suspectthattheseobservations, for of It in are of practicalimportance the generalconsideration orthogonality. would be helpful the in theoretical interest the relationto be authorsto comment whether on theymight of any particular inference. theirinteresting results conditional on I transformations Mr G. J. S. Ross (RothamstedExperimental Station): In advocatingparameter The shape of the log likelihood in as have regardedorthogonality itself being of minorimportance. on fromthe quadratic approximation which many optimisation function may deviate dramatically For own local orthogonalisations. a Normal theseprocedures whereas computetheir depend, procedures on a are contours roundedtriangles: cube roottransformation a2 samplein thespace of(u, a2) likelihood is if createssymmetry , = x but not elsewhere;a bettertransformation based on the 8th and 92nd ,u whichare also orthogonal.For Negative Binomial samples the parameters and K are percentiles of transformation skew withrespectto the MLE: a reciprocal orthogonalbut contoursare extremely K is a greatimprovement (Ross and Preece,1985).

1987]

Discussionof thePaper by Professors Cox and Reid

33

a numerical estimabilityconceptof'unrelatedness' to suitabletransformationsimprove In anticipating may depend on both the model and the is rather than of orthogonality invoked.True orthogonality Unrelatedness the it data,and cannotbe achieveduntil is too lateto be ofuse,after MLE has beenfound. of (iv) is like property of (2.2): a qualitativedescription the expectedshape of the likelihoodfunction of affected theestimate A.With by of because thereis no reasonwhytheestimate il shouldbe seriously it threeor moreparameters is essentialto thinkin thisway because graphicalaids are of littleuse. In the spaced pointson the curverepresenting local curvesuse can be made of widelyfitting non-linear on orthogonalbut theyare a greatimprovement the positionof the curve:theymay not be perfectly In but with a of parameters. fitting mixture two Normaldistributions equal variances algebraicdefining would take we unknownmeans and proportions can anticipatethat a set of unrelatedparameters relative a to account of (i) the generallocation of the data, (ii) the overallspread,(iii) the asymmetry the single Normal, and (iv) the separationof the modes. Withinthe qualitativeframework actual procedures. (Ross 1970,1975). algebraic algorithmic or chosenare thosethatlead to tractable parameters whichare matrix of (ii) The property of(2.2) is relatedto measuresof ill-conditioning thedispersion thanan eigenvector of coefficients lesscomplicated but morehelpful thana simpleinspection correlation matrix givesan absolute The productof i? withthecorresponding element thedispersion of analysis. It parameters. is the and infinity totallydependent for parameters whichis 1 fororthogonal quantity wereknown ratioby whichthe varianceof 4, would be reducedifthe values of the otherparameters I Provisionally call thisa 'variancemultiplier'. diagnostic quantity. and is thusa veryimportant the mentioning Fisher(1922) defined centre that Dr D. A. Sprott of (University Waterloo):It is worth such that(1) is satisfied (4, a). by to (0, oflocationof a location-scale a) family be O = 0 + kcr an means problemforq = 2 Fisher(1961a, 1961b) also presented "exact" solutionto the weighted for the nj samples.In thissolution, factor - 2 in theexpression wj(y?)is replacedby nj. The likelihood and Sprott(1970) has nj - 1. Thus both of thesesolutionsavoid the logical producedby Kalbfleisch to of of difficulty beingunable to cope withsamplesof size nj = 2. It would be therefore some interest of properties thesethreesolutions. comparethefrequency nj The solutioninvolving -2 seemsto have gained supportbecause of Neymanand Scott's (1948) of of that demonstration it is more"efficient". However,theirdefinition the efficiency an estimate is Ai N(0, 1) of onlyforasymptotic in terms itsasymptotic variance U2 , a function the rj's.This is relevant of of for pivotals(/i-y)/a-, which,beingfunctions the Ts, are appropriate estimating only whenthe p ,i for would seemirrelevant estimating whentheTj'S and their efficiencies, Such pivotals, sj'sare known. of of the meansproblem. Thus theproblem assessing behaviour various are unknown, in theweighted as solutionsto thisproblemstillseemsopen. approach based on Finally,Viveros (1985) and Viverosand Sprott(1986) have used a different of Taylorexpansion, the up approximating observedfamily log likelihoods, to thequartictermoftheir in like log F 0)IoI2 havingapproximate by simplefunctions t or log F. This results pivotalslike ( - O in This has producedveryaccurateresults whereIo is the observedFisherinformation. distributions, in model similar thatof Section4.2.2,theapproximating to log small samples.In fact, a location-scale is from distribution. F distribution one ofthepivotals graphically of indistinguishable itsexactconditional the Mr Jonathan I of Tawn (University Surrey): wouldliketo congratulate authorson thisinteresting has me interested is theuse oforthogonal and motivating paper.The aspectofthepaperwhich particularly parameters. of by to the The authorssuggestthatwe orthogonalise nuisanceparameters the parameter interest The key property this of with respectto expected Fisher informaton. using global orthogonality for inference being orthogonality conditional
A.~-

= OP 2?p=o( if

impossible of beingequations(4) are often The obviouspractical withthisform orthogonality problem Thissuggests another thatmaybewe shouldlook for to solveinclosedform. conceptofreparametrization whichhas property (*). arisesin thenon-regular estimation An interesting the examplewhichmotivates restofmydiscussion is of Suppose thatthe probability problemwhenthe parameter interest theendpointof a distribution. A) density f(x; i/l, is the form
ac(x_x-i)
asx],if

0if

x< /i.

1<x<2

34

Discussionof thePaper by Professors Cox and Reid


A.

[No. 1

Here the endpointof the distribution exhibits 'orthogonality' properties, namely

and 'i and 2 are asymptotically independent (Smith, 1985). This situationis not unique, similar orthogonality obtainedin bivariate is extreme value theory involving parameter the boundaryof a on the parameter space. In each case no conceptof orthogonality theexpectedinformation possible in is as thisis infinite. What theexamplessuggest thatwe workwithorthogonality theobservedinformation, is of leading to data dependent parametrization. we had global orthogonality the observedinformation If of then unfortunately a situation rarely achieved.One particular such can be case in whichthiscan be achieved is example3.1 in the paper.

10) Op(n-

Here

O=2.

I+

,n=

2,

q=

'Y

As iln does not tend to the authorsparametrization leads us to questionthe optimality their this of parametrization. I suggest use oflocal orthogonality theobserved the of information (i, 2).Undercertain at conditions property stillholds,hencewe obtain an equivalentof(4) (*)

Z]
t <1'| Hence,ifax#1, ,#: 1.

( Q~)X IQ()

s= 1

q.

We illustrate flexibility the solutionis thetwo parameter the of case = -(t

,=s(Y)=

S*(Y) say.

(/)1-GC=

lfl-P S*(Y) +n ).

say.

wherea and ,B can be chosento maketheparametrization validone,and to satisfy conditions A. a the on Withsuchflexibility thesolutioncan otherconditions imposedon theparametrization make in be to it moreoptimal? Drs S. H. Moolgavkarand R. L. Prentice (Fred Hutchinson Cancer ResearchCentre,Washington State, of USA): The problem parameter orthogonalization a general has solution provided thetheorem by of Frobeniusin differential geometry (Boothby,1975,page 159). Let ('I, O) be a parametrization with il the k-dimensional of parameter interest, a (n - k)-dimensional nuisanceparameter. 0 Then,itfollows from Frobenius'theorem thata necessary sufficient and condition theexistence for of a parametrization A)(withil orthogonal A is thatthe(vector) (i/, to space of all vectorfields orthogonal to 0 (withrespectto the Fisherinformation be metric) a Lie algebra.When il is one-dimensional this condition trivially is satisfied. an example,consideran exponential As withdensity family f(y, 0) = g(y) exp{y*0- K(0)}, and suppose i/ = (O1, 02 . . . 00tk), = (Ok+ 1 . . . I on). Then,forany(coordinate) tangent vector8/0Oiand any arbitrary vectorfieldX, the innerproductwithrespect the Fisherinformation to metric, 8/8Oi>,is givenby <X, 8/80i>= <8/8O0,X> = X 8/8O0(K(0)). <X, Now considervectorfields X1, X2 such that <X1, 8/8Oi>= <X2, 8/8Oi> = 0 fori = k + 1, ..., n. Then <[X1, X2], 8/8Oi> = <(X1X2 X2X1), 8/8Oi> = X1(X28/8OjK) - X2(X18/8OjK) = 0. Thus thevectorfields to a orthogonal + form Lie algebraand by Frobenius'theorem thereexistsa parametrization A) withil orthogonal A,which (i/, to is a well knownresult. In general, course,in an orthogonal of parametrization A), Fisherinformation il willdepend (i/, the for upon A. The theorem de Rham (Kobayashi and Nomizu, 1963,page 187) providesnecessary of and sufficient conditionsforthe Fisher information each parameterto be independent the other for of parameter. The asymptotic distribution for theory ordinary likelihoodprocedures profile can be thought as of from asymptotic the deriving distribution ik. marginal of Inference procedures insteadfrom the deriving asymptotic of distribution ; given 2 may providemore accurate approximations moderate-sized in

1987]

Cox and Reid Discussionof thePaper by Professors

35

between estimated the and makesan accommodation thedifference for samplessincesuchconditioning is in in any thetrueA-values. Orthogonality a naturalrequrement thissetting orderto minimize loss of whichdiffers fromCox and on information il. Conditioning 2 leads to a likelihoodratio statistic on Reid's expression onlyin thefinalterm;thatis, (8) under transformationsA. of Cox and Reid'sexamples, is also invariant but Thisstatistic behaveslike(8) for of to Anotherapproach would be to improvethe approximation the marginaldistribution i. For = = has mean Eo = E(tfo)and varianceVO V(tpO) example,suppose thatthe score S(tpO) 8ly(Qfo, 20)/Otpo for obtain an asymptotic distribution x% at A= l2. One can thenreadily

= 2) wo(q10) 2[1y(;,2) - l 2(i, - {1y(qf0, - I(qfo 2 )}]. 20)

2) 2{1y(&, - ly(q?,2o)} + Eo VO'Eo. for profile likelihood to, leads to another possibility adjusting Calculationof,or approximation Eo and VO ratiotests.
(oM(q1)
=

in as and subsequently morefully writing follows: The authors repliedbriefly the meeting at for and We are very to thoughtful wide-ranging grateful all thosewho took partin thediscussion their comments. containsmanyvaluable is of of Orthogonality parameters a majortheme thepaperand thediscussion in of if ideas are involved. comments this.As we stressed Section2.2 a number distinct, interrelated, on for the is for the Global orthogonality helpful interpreting model,and in particular studying important the Pierce).For example, orthogonalSkinner, Atkinson, Barnard, topicofmodelrobustness (Sweeting, model clearlyindicatesthatthe varianceratiois what Professor ized expression the Behrens-Fisher of Mr We callsa confounded nuisance parameter. agreewith Ross's valuablepointsand particularly Barnard solutions. on thatorthogonality its own may not providedefinitive is and Atkinsonthat ultimately We agree withProfessor physicalinterpretabilitymore important thatfromthispoint of view approximate orthogonality may oftenbe enough.But fromthe point of are and viewofinterpreting data dependent local definitions less thanideal,whichis one reason models, scale our of forpreferring formulation the transformation problemto thatbased on a data dependent Mr observedor expected It ofmeasurement. is clearfrom Tawn's discussionof Example3.1 thateither is thanproperty (iv). information orthogonality moreimportant "improved" Amari,forthe more technicaldetails involvedin deriving As suggestedby Professor is inference il it is likelythat local orthogonality enough,and thismay well be the basis of any for in numerical methodof implementing procedures generality. the based on local expansionscan be used to studyMiss Hills's A versionof approximate orthogonality model.We write = x + di, c2 =n-1 of questionabout the second formulation the Michaelis-Menten xi X d?/.2 and assume in the expansionsthat c2 is small. One findswiththe particular choice in Miss = Hills's notationof a3b(A) nx2) thatwe may take
_ =(X/A)J{ -+ C2(1 A+62)1.

for expression ,Bis not needed.From Note thatforsome purposes,such as computing an explicit wv, on the the point of view of model interpretation dependenceof the parameterization the designis a misfortune. via An important that we did not discuss is its interpretation estimating aspect of orthogonality of equations and the comments Mr Firthand Dr Lloyd on thistopic are most welcome.In a recent I = op(l) and that the score paper Liang (1986) discussesthis in some detail; he shows thatIdll function based on the conditionalprofile likelihoodis, at least in special cases, more nearlyunbiased likelihood. from profile the thanthe score function and as Smithand McCullagh)emphasize, As a number contributors of (Barndorff-Nielsen, Critchley, in we mentioned the paper, our discussionis not exactlyinvariantunder nonlinearchanges in the in it to nuisanceparameter although is invariant theorderconsidered theasymptotic A, orthogonalized about il, presumably inference Professor Smith'sexampleis one wherethechoicedoes affect expansions. is Nearness of in is becausethere so muchuncertainty thedata thattheform theOp(n- term important. 1) It oftheproblemto nonregularity also be relevant. is naturalto aim to resolvethisnonuniqueness may none we have found of orderexpansions. Whilewe have exploreda number suchpossibilities by higher From one pointof viewthe inclusionof the term and so faris totallysatisfactory easilyimplemented. modified thusleadingto Barndorff-Nielsen's profile invariance, 1802,/81 is the naturalway to restore As as objective. likelihood, although, we discussbelow,it is not clear thatthisis themostappropriate of we a difficult calculate, have investigated transformation Ato make theterm to very 1 I2,/01is often

36

Discussionof thePaper by Professors Cox and Reid

[No. 1

varyas slowlyas possible withil. This leads forone-dimensional to a differential equation forthe A preferred parameterization withsolution
A*= |{i(AA,
X)/0411A

(P, X)}dX,

= whereip,,A n- 'E(83l/81f2OA).For multidimensional thereresultsa system q partial differential A of equationsto be solved by the methodof characteristics. We do not know the answerto Professor Barndorff-Nielsen's or questionwhether when modified is profile likelihood preferable one oftheconditional to versions proposed.It is possibleto compute we our (8) or (9) from data without the thatone ofthefoursufficiency reductions explicitly assuming holds, and if(8) is used the resulting likelihoodis constructed from exact conditional an density. However,it maybe important thegeneraltheory be rather for to about thetransformation theminimal from explicit in sufficient statistic themaximum to likelihood furthermore error usingtheapproximation the estimate; to the conditionallikelihoodin deriving modified likelihoodmay be negligible all practical for profile likelihood thanmodified purposes.In at leasttwo examplestheconditional seemsto givebetter results likelihood. The first the weighted is meansexampleof Section4.2.1,beggingProfessor profile Sprott's pertinent query about the "correct"solutionforthis,and the second is the veryinteresting example suggested Dr Lloyd, a versionof Fieller'sproblemconcerning ratio of normalmeans.In this the by examplethe exact versionsof w, wc,wcand wc* give identicallikelihoodsforil, whereasthe modified and profilelikelihood, calculated fromp(Qi ,), gives a different apparently inferior version;see Dr J mustbe calculatedin theAparameterization whichtheproblem in Lloyd's remarks. However, and ivc wc is formulated. in Withthecurrent interest thedifferential it strong geometric aspectsofstatistics is pleasing, although of not surprising, there that werea number comments geometry on (Barndorff-Nielsen, Mitchell, Amari, It thatsuch considerations be qualitatively will overthe Moolgavkarand Prentice). seemslikely helpful choiceofteststatistics, hopefully theissue discussedbelow concerning and on generalsecond-order the nuisanceparameter most appropriate. Some aspects of this are particularversionof orthogonalized has describedby Amari(1985, Ch. 8). Our discussion, influenced the desire however, been strongly by to handle particular is examples,and here the role of differential geometry less clear.The complexity servesto emphasizethedifficulty handling of ofthecalculations behindDr Mitchell's results interesting statistically simplesituations. Thus in discussing multidimensional parameters interest, is probably of it easierto checkthecompatability equationsdirectly investigate possibleexistence orthogonalized to the of nuisance parameters than to use the geometric considerations clearlysummarized Professors so by Moolgavkarand Prentice. Professor Amarialso raisesthepossibility studying characteristics theproposedteststatistics of the of via differential geometric techniques. findthisa rather We dauntingtask but look forward further to resultsfromhim and his associates.One key queston is whether can be adjusted by a important wc Bartlett factor satisfy to Professor Amari'slevelcondition. Harrishas summarized verydetailed Dr his calculationson thismatter and it seems thatthe formof the Op(n- termin (21) means that wcin 1/2) generalcannotbe simply adjustedto improvethex2 approximation the required in way.In factit can be shownthattheadditionof an Op(n- termto a chi-squared 1/2) randomvariablecan be corrected by a Bartlett factor onlyundera veryspecial conditionon theconditionalvarianceof the added term. Severalcontributions, especially thoseofDr Sweeting, Polson and Professor Mr Louis, deal withthe relationbetweenour resultsand a Bayesian approach. Such parallelsare valuable and are a natural extension the work of Welch and Peers (1963). It mustalways be of interest examineproblems of to from different viewpoints, ifthenotionofa priordistribution takenseriously a wayofinjecting but is as further intoa discussion, treatment orthogonal knowledge the of parameters independent be far as will from inevitable. wouldtakeus too farafield treat Polson's final It to Mr questionas other thanrhetorical. is Clearlythe Bayesian formalismveryappealing. Dr Baileygivesan interesting discrete example:see also Cox (1984),where difference probabilities, the of also ofepidemiological is discussed.A slightly interest, briefly simpler version concerns comparison the oftwoPoissonvariables, where in interest thedifference themeansdemands approximate of an discussion, the orthogonal parameter beingthe ratio of the means.This would be appropriate the first if Poisson variablerepresented emissionalone and thesecond sourceplus background. background The question raisedby Dr Baileyconcerning degreeofconditioning the appropriate discrete in problems important is and puzzling. a moregeneral In the setting amountofconditioning in involved determining probability the of some specified eventis settled a balance between selectivity by the achievedin conditioning the and "noise" introduced overconditioning, itis hardto makethatnotionprecise thepresent by but in context. We agree withProfessor Barnardthatcertainquestionsmay have to be regarded unanswerable, as

1987]

Cox and Reid Discussionof thePaper by Professors

37

Smith'sflatlikelihood in information the data; Professor no or that theremay be virtually relevant to we may be an expressionof this.Nevertheless would be veryreluctant draw a strongdistinction and those that formalism betweenquestionsthat have an "exact" answerwithinsome mathematical the answer.Indeed one goal ofour paper is to extendsomewhat availability have onlyan approximate The these and exactcases ina waythatis unified. examplementioned and ofgood approximations to treat of the above concerning difference Poisson means is a case in point.Recentworkshows thatthisis a and likelihood formostpurposesnegligible that is profile neededto ordinary case wheretheadjustment likelihoodare satisfactory. the obtainedfrom profile intervals theconfidence likelihood use one direct ofprofile is application an important where hand,Dr Matthews's On theother the problemof estimating the Ms misleading. Marie Cruddas has investigated simpler may be grossly processon the basis of m small samples,the autoregressive p autoregressive parameter in a first-order fromthe modified intervals means but commonp and variance.Confidence sampleshavingdifferent based even form as small as 10, whereasintervals well in simulations perform likelihoodprocedures biased. likelihoodare strongly negatively profile on unmodified comments alternative related on but viewpoints. his Fraserfor sympathetic to We aregrateful Professor statisticcan be found whose distribution, We agree that if a reasonablysimple one-dimensional of thenthatprovidesan attractively dependsonlyon the parameter interest conditionalor otherwise, intervals of One of the features the use of confidence routeto inference. directand readilyinterpreted modelis thatthesearchforsuchstatistics in as via likelihood suchproblems thenormaltransformation is by-passed. Barnard,Pierceand Sprott)pointedout the interesting Mitchell, Severalpeople (Barndorff-Nielsen, model. As k is a fairly complicated of orthogonality p + ka and a forsuitablek in the location-scale of of function the distribution the ancillarystatistic(or of the ancillarystatisticitselfif observed in thatcontrasts the means are also orthogonalto a is is information used) Dr Pierce's observation because In relevant. thecase ofjust two samplesthisis a specialcase ofDr Mitchell'sresult, particularly We in distribution. are not clear how thisapplies inducessymmetry the underlying takingdifferences to theproblemof morethan two samples. modelsand it for One motivation our workwas to extendmethodsthatworkwell forexponential models.Relatedto thisare Mitchell's successfortransformation to would be interesting examinetheir resultthatBarndorff-Nielsen's of modelsand theintriguing (1986) discussionof thegeometry elliptical modelsand is likelihood estimate exactin transformation of for formula thedistribution themaximum to This relates Mr Polson's questionon thehyperboloid families. exponential accurateto O(n- 3/2) in full and is profile and inverseGaussian models: the group structure not needed to derivethe modified likelihoodfrom saddlepointapproximation. the conditional profile of likelihoodapproach.Formally to We are grateful Dr Davison fora clear summary the predictive to and withthe parameter be estimated the at least one can identify randomvariableto be predicted and or between variousapproximate likelihoods modified conditional predictive obtaina correspondence and aspectsof both inference may likelihoods. This identification prove valuable forclarifying profile betweenthem. and prediction, the relationship to anxieties, set out the broad partlyto assuage Dr Critchley's Finally it may help,in particular of qualitativeobjectives our paper. There are arguments. can (i) Many problemsof formalinference be tackledonly via approximate Oftentheseprocedures orderof asymptotic theory. equivalentto the first procedures manydifferent the but thisis not alwaysso and thereis thusa need fora willin practicegive virtually same answer, the second-order approach to clarify choice. in effective exponential thatis often family procedure (ii) We are guidedin partby theconditioning whenthereare likelihoodforits defects profile and in part by the need to adjust ordinary problems, of In profile appreciablenumbersof nuisanceparameters. some examplesthe modification ordinary inference leads to improvedlikelihood-based but likelihoodis negligible, in otherssuch modification procedures. A of to lead first a formulation themodelin whichnuisanceparameters are (iii) These considerations Major and then to a varietyof conditionaltest statistics. orthogonalto the parameterof interest, define can in a outstanding questionsincludewhether requirement additionto orthogonality sensibly can be shownto have good and underwhat conditions A uniquely, any of the conditionalprocedures to on We statistical properties. have concentrated version(11), wv,because it is oftenstraightforward likelihood with thatoftheprofile can compared and expansion be directly calculate, becauseitsstochastic order(11) is in a certain to is equivalent thefirst ratio.The justification forth thatamongprocedures put for senseas close as possibleto the likelihoodprocedure a knownvalue of the nuisanceparameters. metin a simple, the easily workis certainly needed before centralobjectivesare fully Whilefurther

38

Discussionof thePaper by Professors Cox and Reid

[No. 1

implemented conceptually and compelling we havebeenmuch way, encouraged thebreadth by and depth thecomments ourpaper. of on One ofus (D.R.C.)is grateful thehospitality Department Statistics, for of of of University Toronto, during someofthework thereply. on
REFERENCES IN THE DISCUSSION LectureNotes Verlag.(Springer New York: Springer Methodsin Statistics. Geometrical Amari, (1985) Differential S. 28.) in Statistics, Ann. family. in expansions curvedexponential of geometry Edgeworth S. Amari, and Kumon,M. (1983) Differential Inst. Statist.Maths,34A, 1-24. Press. Oxford:University and A. Atkinson, C. (1985) Plots, Transformations Regression. Ann.Statist.,14, 856-873. 0. Barndorff-Nielsen, E. (1986a) Likelihoodand observedgeometries. in Geometry Statistical In in and integralgeometry statisticalinference. Differential (1986b) Differential (IMS Monograph,to appear). Inference Soc. (withDiscussion).J. R. Statist. for distributions Bayesianinference. posterior J. Bernardo, M. (1979) Reference B, 41, 113-147. J. revisited. Amer.Statist.Ass., 76, 296-311. Bickel,P. and Doksum,J.(1981) An analysisof transformations New York: Academic Geometry. and Manifolds Riemannian to W. Boothby, M. (1975) An Introduction lifferentiable Press. for variables.In A Festschrift Erich linearmodel with"Gaussian" regressor D. Brillinger, R. (1982) A generalised (P. Lehmann J. Bickel,K. A. Doksum and J. L. Hodges,eds). San Francisco:Wadsworth. Soc. B, 48, 1-38. Discussion).J.R. Statist. (with with inference applications likelihood R. Butler, W. (1986a) Predictive Unpublished. pivotsforprediction. (1986b) Approximate Soc. B, 48, 133-169. (withDiscussion).J. R. Statist. of Cook, R. D. (1986) Assessment local influence London: Chapman and Hall. in S. Cook, R. D. and Weisberg, (1982) Residualsand Influence Regression. Soc. B, 34, 197-220. modelsand lifetables(withDiscussion).J. R. Statist. Cox, D. R. (1972) Regression Soc. A, 147,451. (1984) In discussionof paper by Yates, F. J. R. Statist. 73, Biometrika, 323-332. likelihood. predictive Davison, A. C. (1986) Approximate (with inference paradoxesin Bayesianand structural Dawid, A. P., Stone,M. and Zidek,J.(1973) Marginalization Soc. B, 35, 189-233. Discussion).J. R. Statist. A, Phil. Trans.Roy. Soc. London, 222, statistics. of foundations theoretical R. Fisher, A. (1922) On themathematical 309-368. 23, set. (1961a) Samplingthe reference Sankhya, 3-8. 23, varianceratio.Sankhya, 103-114. mean of two normalsampleswithunknown (1961b) The weighted Papers. to analysis.Submitted Statistical Fraser,D. A. S. (1986) Fibre analysisand tangent 17, Sankhya, 217-220. parameters. and orthogonal statistics Huzurbazar,V. S. (1956) Sufficient Vol. Geometry. 1. New York: Interscience. of Kobayashi,S. and Nomizu,K. (1963) Foundations Differential and estimators of theoryof higherorderasymptotics test,interval Kumon, M. and Amari,S. (1983) Geometrical Proc. Roy. Soc. London,A, 397, 429-458. inference. conditional ImperialCollege,London. E. Kyriakidis, (1986) M. Sc. Report, 74, Biometrika, in the press. transformation. for Lawrance,A. J.(1987) The score statistic regression Ass.,77, 654-657. Statist. J. function. Amer. density G. Lejeune,M. and Faulkenberry, D. (1982) A simplepredictive conditionallikelihood.Technical Report610, Dept of and approximate functions Liang, K. Y. (1986) Estimating JohnHopkins University. Biostatistics, Statistical regions. and levelsofsignificance confidence D. Massam,H. and Fraser, A. S. (1985)Conical tests:observed Papers,26, 1-17. Scand. J. Statist., 1-21. functions. 6, Mathiasen,P. E. (1977) Prediction Int. distributions. Statist.Rev.,to appear. of manifolds univariate elliptic A. Mitchell, F. S. (1986) Statistical SankhyaB, 39, 92-96. likelihoodfunction. W. Patefield, M. (1977) On themaximised of of H. Patterson, D. and Thompson,R. (1974) Maximumlikelihoodestimation components variance.Proc. 8th Conf.,197-207. Int. Biometrics Prescott,P. and Walden, A. T. (1980) Maximum likelihood estimationof the parametersof the generalized distribution. 67, Biometrika, 723-724. extreme-value extreme-value of of generalised (1983) Maximumlikelihoodestimation the parameters the three-parameter 16, censoredsamples.J. Statist.Comput. Simul., 241-250. from distribution 34, The Ross, G. J.S. and Preece,D. A. (1985) The negativebinomialdistribution. Statistician, 323-336. for modelling thegeneraluser.Proc. 40thSessionInt. Statist.Inst.,2, 585-593. Ross,G. J.S. (1975) Simplenon-linear of variablemodelsdespite of misspecification distribution. estimation limited dependent Ruud,P. A. (1986)Consistent 32, J. Econometrics, 157-187. test Skovgaard,H. M. (1986) Saddlepointexpansionsfordirectional probabilities. 72, cases. Biometrika, 67-90. in R. Smith, L. (1985) Maximumlikelihoodestimation a class of nonregular 71, data. Biometrika, modelsin theanalysisof survival of of Solomon,P. J.(1984) Effect misspecification regression 291-298.

1987]

Discussionof thePaper by Professors Cox and Reid

39

of Sprott, A. and Viveros, (1984) The interpretation maximum D. R. likelihood estimation. Can. J. Statist., 27-38. 12, T. Sweeting, J.(1985) Consistent for models.BayesianStatistics 755-762. priordistributions transformed 2, L. Tierney, and Kadane, J.B. (1986) Accurate approximations posterior for moments marginal and densities. Amer. J. Statist. Ass.,81, 82-86. Viveros,R. (1985) Estimation small samples.Ph.D. Thesis,Dept of Statist.and ActuarialScience,University in of Waterloo. Viveros, and Sprott, A. (1986)Allowancefor R. D. in skewness maximum likelihood with estimation application the to location-scale model.Submitted publication. for Welch,B. L. and Peers,H. W. (1963) On formulae confidence for pointsbased on integrals weighted of likelihoods. J. R. Statist. Soc. B, 25, 318-329.

You might also like