0% found this document useful (0 votes)
374 views

Statistical Methods in Data Analysis - W. J. Metzger

These notes were initially prepared for a third-year course for physics and chemistry students. Statistics is a tool useful in the design, analysis and interpretation of experiments. Like any other tool, the more you understand how it works the better you can use it. The fundamental laws of classical physics do not deal with statistics, nor with probability. In “statistical” physics you have such a complicated situation that you treat it in a “statistical” manner. when you do physics you need to know what measurements really mean. For that you need statistics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
374 views

Statistical Methods in Data Analysis - W. J. Metzger

These notes were initially prepared for a third-year course for physics and chemistry students. Statistics is a tool useful in the design, analysis and interpretation of experiments. Like any other tool, the more you understand how it works the better you can use it. The fundamental laws of classical physics do not deal with statistics, nor with probability. In “statistical” physics you have such a complicated situation that you treat it in a “statistical” manner. when you do physics you need to know what measurements really mean. For that you need statistics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 278

Experimental High Energy Physics Group

HEN-343
March 13, 2018
NIJMEGEN

Statistical Methods in Data Analysis


W. J. Metzger

Faculteit der Natuurwetenschappen,


Wiskunde en Informatica
Radboud Universiteit Nijmegen
Nijmegen, The Netherlands
These notes were initially prepared for a third-year
course for physics and chemistry students at the
Katholieke Universiteit Nijmegen (since September,
2004, known as Radboud Universiteit Nijmegen,
officially translated to ungrammatical English as
Radboud University Nijmegen).
The course was given for the first time in the Spring
semester of 1991. The author would like to thank
the students who followed the course then for their
cheerful forbearance under the sometimes chaotic
appearance of successive installments and multiple
versions of these notes. Since then these notes have
undergone numerous revisions and expansions to the
joy or dismay of later students.
The author would appreciate any comments which
could result in an improvement of the course. In
particular, he would like to know of any errors in the
equations or text.
Contents

1 Introduction 1
1.1 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Computer usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Some advice to the student . . . . . . . . . . . . . . . . . . . . . . . 4

I Probability 7
2 Probability 9
2.1 First principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Probability—What is it? . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Probability density function (p.d.f.) . . . . . . . . . . . . . . 11
2.1.4 Cumulative distribution function (c.d.f.) . . . . . . . . . . . 12
2.1.5 Expectation values . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 More on Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . 15
2.2.2 More than one r.v. . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Dependence and Independence . . . . . . . . . . . . . . . . . 20
2.2.5 Characteristic Function . . . . . . . . . . . . . . . . . . . . . 22
2.2.6 Transformation of variables . . . . . . . . . . . . . . . . . . 23
2.2.7 Multidimensional p.d.f. – matrix notation . . . . . . . . . . 25
2.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Probability—What is it?, revisited . . . . . . . . . . . . . . . . . . 27
2.4.1 Mathematical probability (Kolmogorov) . . . . . . . . . . . 27
2.4.2 Empirical or Frequency interpretation (von Mises) . . . . . . 27
2.4.3 Subjective (Bayesian) probability . . . . . . . . . . . . . . . 28
2.4.4 Are we frequentists or Bayesians? . . . . . . . . . . . . . . . 29

3 Some special distributions 31


3.1 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

i
3.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Exponential and Gamma distributions . . . . . . . . . . . . . . . . 39
3.6 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Gaussian or Normal distribution . . . . . . . . . . . . . . . . . . . . 41
3.8 Log-Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9 Multivariate Gaussian or Normal distribution . . . . . . . . . . . . 44
3.10 Binormal or Bivariate Normal p.d.f. . . . . . . . . . . . . . . . . . . 46
3.11 Cauchy (Breit-Wigner or Lorentzian) p.d.f. . . . . . . . . . . . . . . 49
3.12 The χ2 p.d.f. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.13 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.14 The F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.15 Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.16 Double exponential (Laplace) distribution . . . . . . . . . . . . . . 56
3.17 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Real p.d.f.’s 59
4.1 Complications in real life . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Central Limit Theorem 63


5.1 The Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Implications for measurements . . . . . . . . . . . . . . . . . . . . . 64

II Monte Carlo 67
6 Monte Carlo 69
6.1 Random number generators . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 True random number generators . . . . . . . . . . . . . . . . 70
6.1.2 Pseudo-random number generators . . . . . . . . . . . . . . 70
6.2 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.1 Crude Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.2 Hit or Miss Monte Carlo . . . . . . . . . . . . . . . . . . . . 73
6.2.3 Buffon’s needle, a hit or miss example . . . . . . . . . . . . 74
6.2.4 Accuracy of Monte Carlo integration . . . . . . . . . . . . . 74
6.2.5 A crude example in 2 dimensions . . . . . . . . . . . . . . . 76
6.2.6 Variance reducing techniques . . . . . . . . . . . . . . . . . 78
6.3 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.1 Weighted events . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.2 Rejection method . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.3 Inverse transformation method . . . . . . . . . . . . . . . . 83
6.3.4 Composite method . . . . . . . . . . . . . . . . . . . . . . . 85

ii
6.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.6 Gaussian generator . . . . . . . . . . . . . . . . . . . . . . . 87

III Statistics 91
7 Statistics—What is it/are they? 93

8 Parameter estimation 95
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.3 Variance, efficiency . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.4 Interpretation of the Variance . . . . . . . . . . . . . . . . . 103
8.2.5 Information and Likelihood . . . . . . . . . . . . . . . . . . 104
8.2.6 Minimum Variance Bound . . . . . . . . . . . . . . . . . . . 107
8.2.7 Efficient estimators—the Exponential family . . . . . . . . . 110
8.2.8 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . 114
8.3 Substitution methods . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3.1 Frequency substitution . . . . . . . . . . . . . . . . . . . . . 115
8.3.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . 117
8.3.3 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . 118
8.3.4 Generalized method of moments . . . . . . . . . . . . . . . . 118
8.3.5 Variance of moments . . . . . . . . . . . . . . . . . . . . . . 119
8.3.6 Transformation of the covariance matrix under a change of
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4 Maximum Likelihood method . . . . . . . . . . . . . . . . . . . . . 122
8.4.1 Principle of Maximum Likelihood . . . . . . . . . . . . . . . 122
8.4.2 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . 128
8.4.3 Change of parameters . . . . . . . . . . . . . . . . . . . . . 130
8.4.4 Maximum Likelihood vs. Bayesian inference . . . . . . . . . 131
8.4.5 Variance of maximum likelihood estimators . . . . . . . . . . 132
8.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.7 Extended Maximum Likelihood . . . . . . . . . . . . . . . . 139
8.4.8 Constrained parameters . . . . . . . . . . . . . . . . . . . . 142
8.5 Least Squares method . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.5.2 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . 147
8.5.3 Derivative formulation . . . . . . . . . . . . . . . . . . . . . 153
8.5.4 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . 154
8.5.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5.6 Constraints in the linear model . . . . . . . . . . . . . . . . 158

iii
8.5.7 Improved measurements through constraints . . . . . . . . . 160
8.5.8 Linear Model with errors in both x and y . . . . . . . . . . . 161
8.5.9 Non-linear Models . . . . . . . . . . . . . . . . . . . . . . . 163
8.5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 Estimators for binned data . . . . . . . . . . . . . . . . . . . . . . . 164
8.6.1 Minimum Chi-Square . . . . . . . . . . . . . . . . . . . . . . 164
8.6.2 Binned maximum likelihood . . . . . . . . . . . . . . . . . . 166
8.6.3 Comparison of the methods . . . . . . . . . . . . . . . . . . 167
8.7 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . 168
8.7.1 Choice of estimator . . . . . . . . . . . . . . . . . . . . . . . 168
8.7.2 Bias reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.7.3 Variance of estimators—Jackknife and Bootstrap . . . . . . 172
8.7.4 Robust estimation . . . . . . . . . . . . . . . . . . . . . . . 174
8.7.5 Detection efficiency and Weights . . . . . . . . . . . . . . . . 176
8.7.6 Systematic errors . . . . . . . . . . . . . . . . . . . . . . . . 180

9 Confidence intervals 185


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2 Confidence belts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.3 Confidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.4 Normal confidence intervals . . . . . . . . . . . . . . . . . . . . . . 190
9.4.1 σ known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.4.2 σ unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.5 Binomial confidence intervals . . . . . . . . . . . . . . . . . . . . . 191
9.6 Poisson confidence intervals . . . . . . . . . . . . . . . . . . . . . . 192
9.6.1 Large N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.6.2 Small N — Confidence bounds . . . . . . . . . . . . . . . . 193
9.6.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.7 Use of the likelihood function or χ2 . . . . . . . . . . . . . . . . . . 195
9.8 Fiducial intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.9 Credible (Bayesian) intervals . . . . . . . . . . . . . . . . . . . . . . 196
9.10 Discussion of intervals . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.11 Measurement of a bounded quantity . . . . . . . . . . . . . . . . . . 197
9.12 Upper limit on the mean of a Poisson with background . . . . . . . 199

10 Hypothesis testing 201


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.2 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.3 Properties of tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.3.1 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.3.2 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.3.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.3.4 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

iv
10.3.5 Distribution-free tests . . . . . . . . . . . . . . . . . . . . . 207
10.3.6 Choice of a test . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.4 Parametric tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.4.1 Simple Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 209
10.4.2 Simple H0 and composite H1 . . . . . . . . . . . . . . . . . 211
10.4.3 Composite hypotheses—same parametric family . . . . . . . 213
10.4.4 Composite hypotheses—different parametric families . . . . 219
10.5 And if we are Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . 220
10.6 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.6.1 Confidence level or p-value . . . . . . . . . . . . . . . . . . . 222
10.6.2 Relation between Confidence level and Confidence Intervals . 223
10.6.3 The χ2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.6.4 Use of the likelihood function . . . . . . . . . . . . . . . . . 224
10.6.5 Binned data . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.6.6 Run test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.6.7 Tests free of binning . . . . . . . . . . . . . . . . . . . . . . 231
10.6.8 But use your eyes! . . . . . . . . . . . . . . . . . . . . . . . 233
10.7 Non-parametric tests . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.7.1 Tests of independence . . . . . . . . . . . . . . . . . . . . . 235
10.7.2 Tests of randomness . . . . . . . . . . . . . . . . . . . . . . 237
10.7.3 Two-sample tests . . . . . . . . . . . . . . . . . . . . . . . . 237
10.7.4 Two-Gaussian-sample tests . . . . . . . . . . . . . . . . . . . 239
10.7.5 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 242

IV Bibliography 249

V Exercises 255

v
They say that Understanding ought to work by the rules of right
reason. These rules are, or ought to be, contained in Logic; but
the actual science of logic is conversant at present only with things
either certain, impossible, or entirely doubtful, none of which (for-
tunately) we have to reason on. Therefore the true logic for this
world is the calculus of Probabilities, which takes account of the
magnitude of the probability which is, or ought to be, in a reason-
able man’s mind.
—J. Clerk Maxwell

Chapter 1

Introduction

Statistics is a tool useful in the design, analysis and interpretation of experi-


ments. Like any other tool, the more you understand how it works the better you
can use it.
The fundamental laws of classical physics do not deal with statistics, nor with
probability. Newton’s law of gravitation F = G Mr2m contains an exponent 2 in
the denominator—exactly 2, not 2.000 ± 0.001. But where did the 2 come from?
It came from analysis of many detailed and accurate astronomical observations of
Tycho Brahe and others.
In “statistical” physics you have such a complicated situation that you treat it
in a “statistical” manner, although I would prefer to make a distinction between
statistics and probability and call it a probabilistic manner. In quantum mechanics
the probability is intrinsic to the theory rather than a mere convenience to get
around complexity.
Thus in studying physics you have no need of statistics, although in some sub-
jects you do need probability. But when you do physics you need to know what
measurements really mean. For that you need statistics.
Using probability we can start with a well defined problem and calculate the
chance of all possible outcomes of an experiment. With probability we can thus go
from theory to the data.
In statistics we are concerned with the inverse problem. From data we want to
infer something about physical laws. Statistics is sometimes called an art rather

1
2 CHAPTER 1. INTRODUCTION

than a science, and there is a grain of truth in it. Although there are standard
approaches, most of the time there is no “best” solution to a given problem. Our
most common tasks for statistics fall into two categories: parameter estimation and
hypothesis testing.
In parameter estimation we want to determine the value of some parameter in a
model or theory. For example, we observe that the force between two charges varies
with the distance r between them. We make a theory that F ∼ r −α and want to
determine the value of α from experiment.
In hypothesis testing we have an hypothesis and we want to test whether that
hypothesis is true or not. An example is the Fermi theory of β-decay which predicts
the form of the electron’s energy spectrum. We want to know whether that is
correct. Of course we will not be able to give an absolute yes or no answer. We
will only be able to say how confident we are, e.g., 95%, that the theory is correct,
or rather that the theory predicts the correct shape of the energy spectrum. Here
the meaning of the 95% confidence is that if the theory is correct, and if we were
to perform the experiment many times, 95% of the experiments would appear to
agree with the theory and 5% would not.
Parameter estimation and hypothesis testing are not completely separate topics.
It is obviously nonsense to estimate a parameter if the theory containing the pa-
rameter does not agree with the data. Also the theory we want to test may contain
parameters; the test then is whether values for the parameters exist which allow
the theory to agree with the data.
Although the main subject of this course is statistics, it should be clear that
an understanding of statistics requires understanding probability. We will begin
therefore with probability. Having had probability, it seems only natural to also
treat, though perhaps briefly, Monte Carlo methods, particularly as they are often
useful not only in the design and understanding of an experiment but also can be
used to develop and test our understanding of probability and statistics.
There are a great many books on statistics. They vary greatly in content and
intended audience. Notation is by no means standard. In preparing these lectures I
have relied heavily on the following sources (sometimes to the extent of essentially
copying large sections):
• R. J. Barlow,1 a recent text book in the Manchester series. Most of what
you need to know is in this book, although the level is perhaps a bit low.
Nevertheless (or perhaps therefore), it is a pleasure to read.

• Siegmund Brandt,2 a good basic book at a somewhat higher level. Unfor-


tunately, the FORTRAN sample programs it contains are rather old-fashioned.
There is an emphasis on matrix notation. There is also a German edition.

• A. G. Frodesen, O. Skjeggestad, and H. Tøfte,3 also a good basic text (despite


the words “particle physics” in the title) at a higher level. Unfortunately, it
is out of print.
1.1. LANGUAGE 3

• W. T. Eadie et al.,4 or the second edition of this book by F. James5 , a book


at an advanced level. It is difficult to use if you are not already fairly familiar
with the subject.

• G. P. Yost,6 the lecture notes for a course at Imperial College, London. They
are somewhat short on explanation.

• Glen Cowan,7 a recent book at a level similar to these lectures. In fact, had
this book been available I probably would have used it rather than writing
these notes.

Other books of general interest are those of Lyons,8 Meyer,9 and Bevington.10
A comprehensive reference for almost all of probability and statistics is the three-
volume work by Kendall and Stuart11 . Since the death of Kendall, volumes 1 and 2
(now called 2a) are being kept up to date by others,12, 13 and a volume (2b) on
Bayesian statistics has been added.14 Volume 3 has been split into several small
books, “Kendall’s Library of Statistics”, covering many specialized topics. Another
classic of less encyclopedic scope is the one-volume book by Cramér15 .

1.1 Language

Statistics, like physics, has it own specialized terminology with words whose mean-
ing differs from the meaning in everyday use or the meaning in physics. An example
is the word estimate. In statistics “estimate” is used where the physicist would say
“determine” or “measure”, as in parameter estimation. The physicist or indeed
ordinary people tend to use “estimate” to mean little more than “guess” as in “I
would estimate that this room is about 8 meters wide.” We will generally use the
statisticians’ word.
Much of statistics has been developed in connection with population studies
(sociology, medicine, agriculture, etc.) and industrial quality control. One cannot
study the entire population; so one “draws a sample”. But the population exists.
In experimental physics the set of all measurements (or observations) forms the
“sample”. If we make more measurements we increase the size of the sample, but
we can never attain the “population”. The population does not really exist but is
an underlying abstraction. For us some terminology of the statisticians is therefore
rather inappropriate. We therefore sometimes make substitutions like the following:
4 CHAPTER 1. INTRODUCTION

“demographic” terminology “physics” terminology


sample data (set)
draw a sample observe, measure
sample of size N N observations, N measurements
population observable space
population mean parent mean
= mean of the underlying distribution
population variance, etc. parent variance, etc.
sample mean sample mean = mean of the data
= experimental mean or average

We will just say “mean” when it is clear from the context whether we are referring
to the parent or the sample mean.

1.2 Computer usage


In this day and age, data analysis without a computer is inconceivable. While there
exist (a great many) programs to perform statistical analyses of data, they tend to
be difficult to learn and/or limited in what they can do. Their use also tends to
induce a cook-book mentality. Consequently, we shall not use them, but will write
our own programs (in FORTRAN or C). In this way we will at least know what we are
doing. Subroutines will be provided for a few conceptually simple, but technically
complicated, tasks.
Data is often presented in a histogram (1 or 2 dimensional). Computer packages
to do this will be available.
As an aid to understanding it is often useful to use random numbers, i.e., perform
simple Monte Carlo (cf. Part II). On a computer there is generally a routine which
returns a “pseudo-random” number. What that actually is will be treated in section
6.1.2. An example of such use is to generate random numbers according to a given
distribution, e.g., uniformly between 0 and 1, and then to histogram some function
of these numbers.
Parameter estimation (chapter 8) is often most conveniently done by numerically
finding the maximum (or minimum) of some function. Computer programs to do
this will also be available.

1.3 Some advice to the student


The goal of this course is not to provide a cook book of statistical data analysis.
Instead, we aim for some understanding of statistical techniques, of which there
are many. Lack of time will preclude rigorous proof (or sometimes any proof) of
results. Moreover, we will introduce some theoretical concepts, which will not seem
1.3. SOME ADVICE TO THE STUDENT 5

immediately useful, but which should put the student in a better position to go
beyond what is included in this course, as will almost certainly be necessary at
some time in his career. Further, we will point out the assumptions underlying,
and the limitations of, various techniques.
A major difficulty for the student is the diversity of the questions statistical tech-
niques are supposed to answer, which results in a plethora of methods. Moreover,
there is seldom a single “correct” method, and deciding which method is “best” is
not always straightforward, even after you have decided what you mean by “best”.
A further complication arises from what we mean by “probability”. There are
two major interpretations, “frequentist” (or “classical”) and “Bayesian” (or “sub-
jective”), which leads to two different ways to do statistics. While the emphasis
will be on the classical approach, some effort will go into the Bayesian approach as
well.
While there are many questions and many techniques, they are related. In order
to see the relationships, the student is strongly advised not to fall behind.
Finally, some advice to astronomers which is equally valid for physicists:

Whatever your choice of area, make the choice to live your professional
life at a high level of statistical sophistication, and not at the level—
basically freshman lab. level—that is the unfortunate common currency
of most astronomers. Thereby we will all move forward together.
—William H. Press16
Part I

Probability

7
“La théorie des probabilités n’est que
le bon sens reduit au calcul.”
—P.-S. de Laplace, “Mécanique Céleste”

Chapter 2

Probability

2.1 First principles

2.1.1 Probability—What is it?


We begin by taking the “frequentist” approach. A given experiment is assumed
to have a certain number of possible outcomes or events E. Suppose we repeat
the identical experiment N times and find outcome Ei Ni times. We define the
probability of outcome Ei to be
Ni
P (Ei ) = lim (2.1)
N →∞ N

We can also be more abstract. Intuitively, probability must have the following
properties. Let Ω be the set of all possible outcomes.
Axioms:
1. P (Ω) = 1 The experiment must have an outcome.

2. 0 ≤ P (E), E ∈ Ω
P
3. P (∪Ei ) = P (Ei ), for any set of disjoint Ei , Ei ∈ Ω
(Axiom of Countable Additivity)
It is straightforward to derive the following theorems:
1. P (E) = 1 − P (E ∗ ), where Ω = E ∪ E ∗ , E and E ∗ disjoint.

2. P (E) ≤ 1

9
10 CHAPTER 2. PROBABILITY

3. P (∅) = 0, where ∅ is the null set.

4. If E1 ,E2 ∈ Ω and not necessarily disjoint, then


P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )

A philosopher once said, “It is necessary


for the very existence of science that
the same conditions always produce the same results.”
Well, they do not. —Richard P. Feynman

2.1.2 Sampling

We restrict ourselves to experiments where the outcome is one or more real numbers,
Xi . Repetition of the experiment will not always yield the same outcome. This
could be due to an inability to reproduce exactly the initial conditions and/or to
a probabilistic nature of the process under study, e.g., radioactive decay. The Xi
are therefore called random variables (r.v.), i.e., variables whose values cannot
be predicted exactly. Note that the word ‘random’ in the term ‘random variable’
does not mean that the allowed values of Xi are equiprobable, contrary to its use
in everyday speech. The set of possible values of Xi , which we have denoted Ω, is
called the sample space. A r.v. can be

• discrete: The sample space Ω is a set of discrete points. Examples are the
result of a throw of a die, the sex of a child (F=1, M=2), the age (in years)
of students studying statistics, names of people (Marieke=507, Piet=846).

• continuous: Ω is an interval or set of intervals. Examples are the frequency of


radiation from a black body, the angle at which an electron is emitted from
an atom in β-decay, the height of students studying statistics.

• a combination of discrete and continuous.

An experiment results thus in an outcome which is a set of real numbers Xi


which are random variables. They form a sampling of a parent ‘population’. Note
the difference between the sample, the sample space and the population:

• The population is a list of all members of the population. Some members of


the population may be identical.

• The sample space is the set of all possible results of the experiment (the
sampling). Identical results are represented by only one member of the set.
2.1. FIRST PRINCIPLES 11

• The sample is a list of the results of a particular experiment. Some of the


results may be identical. How often a particular result, i.e., a particular
member of the sample space, occurs in the sample should be approximately
proportional to how often that result occurs in the population.

The members of the population are equiprobable while the members of the sample
space are not necessarily equiprobable. The sample reflects the population which is
derived from the sample space according to some probability distribution, usually
called the parent (or underlying) probability distribution.

2.1.3 Probability density function (p.d.f.)


Conventionally, one uses a capital letter for the experimental result, i.e., the sam-
pling of a r.v. and the corresponding lower case letter for other values of the r.v.
Here are some examples of probability distributions:

• the throw of a die. The sample space is Ω = {1, 2, 3, 4, 5, 6}. The probability
distribution is P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 16 , which gives
a parent population of {1, 2, 3, 4, 5, 6}. An example of an experimental result
is X = 3.

• the throw of a die having sides marked with one 1, two 2’s, and three 3’s.
The sample space is Ω = {1, 2, 3}. The probability distribution is P (1) = 16 ,
P (2) = 13 , P (3) = 21 . The parent population is {1, 2, 2, 3, 3, 3}. An experi-
mental result is X = 3 (maybe).
In the discrete case we have a probability function, f (x), which is greater than zero
for each value of x in Ω. From the axioms of probability,
X
f (x) = 1

X
P (A) ≡ P (X ∈ A) = f (x) , A ⊂ Ω
A

For a continuous r.v., the probability of any exact value is zero since there are
an infinite number of possible values. Therefore it is only meaningful to talk of the
probability that the outcome of the experiment, X, will be in a certain interval.
f (x) is then a probability density function (p.d.f.) such that
Z
P (x ≤ X ≤ x + dx) = f (x) dx , f (x) dx = 1 (2.2)

Since most quantities of interest to us are continuous we will usually only treat
the continuous case unless the corresponding treatment of the discrete case is not
obvious. Usually going from the continuous to the discrete case is simply the re-
placement of integrals by sums. We will also use the term p.d.f. for f (x) although
12 CHAPTER 2. PROBABILITY

in the discrete case it is really a probability rather than a probability density. Some
authors use the term ‘probability law’ instead of p.d.f., thus avoiding the mislead-
ing (actually wrong) use of the word ‘density’ in the discrete case. However, such
use of the word ‘law’ is misleading to a physicist, cf. Newton’s second law, law of
conservation of energy, etc.

2.1.4 Cumulative distribution function (c.d.f.)


The cumulative distribution function (c.d.f.) is the probability that the value of a
r.v. will be ≤ a specific value. The c.d.f. is denoted by the capital letter correspond-
ing to the small letter signifying the p.d.f. The c.d.f. is thus given by
Z x
F (x) = f (x′ ) dx′ = P (X ≤ x) (2.3)
−∞

Clearly, F (−∞) = 0 and F (+∞) = 1.


Properties of the c.d.f.:
• 0 ≤ F (x) ≤ 1
• F (x) is monotone and not decreasing.
• P (a ≤ X ≤ b) = F (b) − F (a)
• F (x) discontinuous at x implies
P (X = x) = lim [F (x + δx) − F (x − δx)] , i.e., the size of the jump.
δx→0

• F (x) continuous at x implies P (X = x) = 0.


The c.d.f. can be considered to be more fundamental than the p.d.f. since the
c.d.f. is an actual probability rather than a probability density. However, in appli-
cations we usually need the p.d.f. Sometimes it is easier to derive first the c.d.f.
from which you get the p.d.f. by
∂F (x)
f (x) = (2.4)
∂x

2.1.5 Expectation values


Consider some single-valued function, u(x) of the random variable x for which f (x)
is the p.d.f. Then the expectation value of u(x) is defined:
Z +∞
E [u(x)] = u(x) f (x) dx (2.5)
−∞
Z +∞
= u(x) dF (x) , f (x) continuous (2.6)
−∞

Properties of the expectation value:


2.1. FIRST PRINCIPLES 13

• If k is a constant, then E [k] = k


• If k is a constant and u a function of x, then E [ku] = kE [u]
• If k1 and k2 are constants and u1 and u2 are functions of x, then
E [k1 u1 + k2 u2 ] = k1 E [u1] + k2 E [u2 ], i.e., E is a linear operator.
Note that some books, e.g., Barlow1 , use the notation hu(x)i instead of E [u(x)].

2.1.6 Moments
Moments are certain special expectation values. The mth moment is defined (think
of the moment of inertia) as
Z +∞
m
E [x ] = xm f (x) dx (2.7)
−∞

The moment is said to exist if it is finite. The most commonly used moment is the
(population or parent) mean,
Z +∞
µ ≡ E [x] = xf (x) dx (2.8)
−∞

The mean is often a good measure of location, i.e., it frequently tells roughly where
the most probable region is, but not always.
f (x) ✻ f (x) ✻

✲ ✲
µ x µ x

In statistics we will see that the sample mean, x, the average of the result of a
number of experiments, can be used to estimate the parent mean, µ, the mean of
the underlying p.d.f.
Central moments are moments about the mean. The mth central moment is
defined as Z +∞
m
E [(x − µ) ] = (x − µ)m f (x) dx (2.9)
−∞

If µ is finite, the first central moment is clearly zero. If f (x) is symmetric about its
mean, all odd central moments are zero.
The second central moment is called the variance. It is denoted by V [x], σx2 ,
or just σ 2 .
h i
σx2 ≡ V [x] ≡ E (x − µ)2 (2.10)
h i
= E x2 − µ2 (2.11)
14 CHAPTER 2. PROBABILITY

The square root of the variance, σ, is called the standard deviation. It is a measure
of the spread of the p.d.f. about its mean.
Since all symmetrical distributions have all odd central moments zero, the odd
central moments provide a measure of the asymmetry. The first central moment is
zero. The third central moment is thus the lowest order odd central moment. One
makes it dimensionless by dividing by σ 3 and defining the skewness as
h i
E (x − µ)3
γ1 ≡ (2.12)
σ3
This is the definition of Fisher, which is the most common. However, be aware that
other definitions exist, e.g., the Pearson skewness,
 h i 2
E (x − µ)3
β1 ≡   = γ12 (2.13)
σ3

The sharpness of the peaking of the p.d.f. is measured by the kurtosis (also
spelled curtosis). There are two common definitions, the Pearson kurtosis,
h i
E (x − µ)4
β2 ≡ (2.14)
σ4
and the Fisher kurtosis,
h i
E (x − µ)4
γ2 ≡ − 3 = β2 − 3 (2.15)
σ4
The −3 makes γ2 = 0 for a Gaussian. For this reason, it is somewhat more con-
venient, and is the definition we shall use. A p.d.f. with γ2 > 0 (< 0) is called
leptokurtic (platykurtic) and is less (more) peaked than a Gaussian, i.e., having
higher (lower) tails.
Moments are often normalized in some other way than we have done with γ1
and γ2 , e.g., with the corresponding power of µ:
h i h i
E xk E (x − µ)k
ck ≡ ; rk ≡ (2.16)
µk µk
It can be shown that if all central moments exist, the distribution is completely
characterized by them. In statistics we can estimate each parent moment by its
sample moment (cf. section 8.3.2) and so, in principle, reconstruct the p.d.f.
Other attributes of a p.d.f.:
• mode: The location of a maximum of f (x). A p.d.f. can be multimodal.

• median: That value of x for which the c.d.f. F (x) = 12 . The median is not
always well defined, since there can be more than one such value of x.
2.2. MORE ON PROBABILITY 15

F (x) ✻ F (x) ✻

1 1

1 1
c.d.f. 2 2

✲ ✲
median x medians x

f (x) ✻ f (x) ✻

p.d.f.

✲ ✲
x x

“If any one imagines that he knows something,


he does not yet know as he ought to know.”
—1 Corinthians 8:2

2.2 More on Probability

2.2.1 Conditional Probability


Suppose we restrict the set of results of our experiment (observations or events) to
a subset A ⊂ Ω. We denote the probability of an event E given this restriction by
P (E | A); we speak of “the probability of E given A.” Clearly this ‘conditional’
probability is greater than the probability without the restriction, P (E) (unless of
course A∗ , the complement of A, is empty). The probability must be renormalized
such that the probability that the condition is fulfilled is unity. The conditional
probability should have the following properties:
16 CHAPTER 2. PROBABILITY

P (A | A) = 1 renormalization
P (A2 | A1 ) = P (A1 ∩ A2 | A1 ) Ω
A1
A2
While the probability changes with the restriction,
ratios of probabilities must not:

P (A1 ∩ A2 | A1 ) P (A1 ∩ A2 )
=
P (A1 | A1 ) P (A1 )
These requirements are met by the definition, assuming P (A1 ) > 0,
P (A1 ∩ A2 )
P (A2 | A1 ) ≡ (2.17)
P (A1 )
If P (A1 ) = 0, P (A2 | A1 ) makes no sense. Nevertheless, for completeness we define
P (A2 | A1 ) = 0 if P (A1 ) = 0.
It can be shown that the conditonal probability satisfies the axioms of proba-
bility.
It follows from the definition that
P (A1 ∩ A2 ) = P (A2 | A1 ) P (A1)
If P (A2 | A1 ) is the same for all A1 , i.e., A1 and A2 are independent, then
P (A2 | A1 ) = P (A2 )
and P (A1 ∩ A2 ) = P (A1 ) P (A2)

2.2.2 More than one r.v.


Joint p.d.f.
If the outcome is more than one r.v., say X1 and X2 , then the experiment is a
sampling of a joint p.d.f., f (x1 , x2 ), such that
P (x1 < X1 < x1 + dx1 , x2 < X2 < x2 + dx2 ) = f (x1 , x2 ) dx1 dx2 (2.18)
Z b Z d
P (a < X1 < b , c < X2 < d) = dx1 dx2 f (x1 , x2 ) (2.19)
a c

Marginal p.d.f.
The marginal p.d.f. is the p.d.f. of just one of the r.v.’s; all dependence on the other
r.v.’s of the joint p.d.f. is integrated out:
Z +∞
f1 (x1 ) = f (x1 , x2 ) dx2 (2.20)
−∞
Z +∞
f2 (x2 ) = f (x1 , x2 ) dx1 (2.21)
−∞

Conditional p.d.f.
2.2. MORE ON PROBABILITY 17

X2 ✻ Ω
Suppose that there are two r.v.’s, X1 and X2 , and
a space of events Ω.
Choosing a value x1 of X1 restricts the possible
values of X2 . Assuming f1 (x1 ) > 0, then f (x2 |

x1 ) is a p.d.f. of X2 given X1 = x1 . X1
In the discrete case, from the definition of con- x1
ditional probability (eq. 2.17), we have

P (X2 = x2 ∩ X1 = x1 )
f (x2 | x1 ) ≡ P (X2 = x2 | X1 = x1 ) =
P (X1 = x1 )
P (X2 = x2 , X1 = x1 ) f (x1 , x2 )
= =
P (X1 = x1 ) f1 (x1 )

The continuous case is, analogously,

f (x1 , x2 )
f (x2 | x1 ) = (2.22)
f1 (x1 )

Note that this conditional p.d.f. is a function of only one r.v., x2 , since x1 is fixed.
Of course, a different choice of x1 would give a different function. A conditional
probability is then obviously calculated
Z b
P (a < X2 < b | X1 = x1 ) = f (x2 | x1 ) dx2 (2.23)
a

This may also be written P (a < X2 < b | x1 ).


We can also compute conditional expectations:
Z +∞
E [u (x2 ) | x1 ] = u(x2 )f (x2 | x1 ) dx2 (2.24)
−∞

For example, the conditional mean, E [x2 | x1 ],


or the conditional variance, E [(x2 − E [x2 | x1 ])2 | x1 ].
The generalization to more than two variables is straightforward, e.g.,

f (x1 , x2 , x3 , x4 )
f (x2 , x4 | x1 , x3 ) =
f13 (x1 , x3 )
Z Z
where f13 (x1 , x3 ) = f (x1 , x2 , x3 , x4 ) dx2 dx4

2.2.3 Correlation
When an experiment results in more than one real number, i.e., when we are con-
cerned with more than one r.v. and hence the p.d.f. is of more than one dimension,
the r.v.’s may not be independent. Here are some examples:
18 CHAPTER 2. PROBABILITY

• Let A =‘It is Sunday’, B =‘It is raining’. The probability of rain on Sunday is


the same as the probability of rain on any other day. A and B are independent.
But if A =‘It is December’, the situation is different. The probability of rain
in December is not the same as the probability of rain in all other months. A
and B are correlated.
• If you spend 42 hours each week at the university, the probability that at a
randomly chosen moment your head is at the university is 1/4. Similarly, the
probability that your feet are at the university is 1/4. The probability that
both your head and your feet are at the university is also 1/4 and not 1/16; the
locations of your head and your feet are highly correlated.
• Abram and Lot were standing at a road junction. The probability that Lot
would take the left-hand road was 1/2. The probability that Abram would take
the left-hand road was also 1/2. But the probability that they both would take
the left-hand road was zero.17
• The Fermi theory allows us to calculate the energy spectrum of the particles
produced in β-decay, e.g., n → pe− ν e , from which we can calculate the prob-
ability that the proton will have more than, say 3/4, of the available energy.
We can also calculate the probability that the electron will have more than
3/4 of the available energy. But the probability that both the electron and the

proton will have more than 3/4 of the available energy is zero. The energies
of the electron and the proton are not independent. They are constrained by
the law of energy conservation.
Given a two-dimensional p.d.f. (the generalization to more dimensions is straight-
2
forward), f (x, y), the mean and variance of X, µX and σX are given by
Z +∞ Z +∞
µX = E [X] = xf (x, y) dx dy
−∞ −∞
h i
2
σX = E (X − µX )2
A measure of the dependence of X on Y is given by the covariance defined as
cov(X, Y ) ≡ E [(X − µX )(Y − µY )] (2.25)
= E [XY ] − µY E [X] − µX E [Y ] + µX µY
= E [XY ] − µX µY (2.26)
From the covariance we define a dimensionless quantity, the correlation coef-
ficient
cov(X, Y )
ρXY ≡ (2.27)
σX σY
If σX = 0, then X ≡ µX and consequently E [XY ] = µX E [Y ] = µX µY , which means
that cov(X, Y ) = 0. In this case the above definition would give ρ indeterminate,
and we define ρXY = 0.
2.2. MORE ON PROBABILITY 19

It can be shown that ρ2 ≤ 1, the equality holding if and only if X and Y are
linearly related. The proof is left to the reader (exercise 7).
Note that while the mean and the standard deviation scale, the correlation
coefficient is scale invariant, e.g.,

µ2X = 2µX and σ2X = 2σX

cov(2X, Y ) 2 cov(X, Y )
ρ2X,Y = =
σ2X σY 2σX σY

The correlation coefficient ρXY is a measure of how much the variables X and Y
depend on each other. It is most useful when the contours of constant probability
density, f (x, y) = k, are roughly elliptical, but not so useful when these contours
have strange shapes:

Y ✻ Y ✻ Y ✻

✲ ✲ ✲
X X X
ρ>0 ρ<0 ρ≈0

In the last case, even though X and Y are clearly related, ρ ≈ 0. This can be seen
as follows:
Z
E [(X − µX ) | y] = (x − µX )f (x | y) dx
Z
f (x, y)
= (x − µX ) dx
fY (y)
= 0 for all y

Thus, the mean value of X is independent of y. Then,

cov(X, Y ) = E [(X − µX ) (Y − µY )]
Z Z
= (y − µY ) (x − µX )f (x, y) dx dy
| {z }
=0
=0

Consequently, ρXY = 0.
20 CHAPTER 2. PROBABILITY

Y′ ✻

However, if we change variables, e.g.,


by rotating, ρ, i.e., ρX ′ Y ′ , will no longer
be 0.

X′
ρ>0
Also in the elliptical case, such a change in variables can make ρ = 0.

Y ✻ Y′ ✻

✲ ✲
X X′
ρ>0 ρ=0
In fact, it is always possible (also in n dimensions) to remove the correlation by a
change of variables (cf. section 2.2.7).
The correlation coefficient, ρ, measures the average linear change in the marginal
p.d.f. of one variable for a specified change in the other variable. This can be 0 even
when the variables clearly depend on each other. This occurs when a change in one
variable produces a change in the marginal p.d.f. of the other variable but no change
in its average, only in its shape. Thus zero correlation does not imply independence.

2.2.4 Dependence and Independence


We know from the definitions of conditional and marginal p.d.f.’s that
f (x1 , x2 ) = f (x2 | x1 )f1 (x1 ) (2.28)
Z
and f2 (x2 ) = f (x1 , x2 ) dx1
Z
Hence f2 (x2 ) = f (x2 | x1 )f1 (x1 ) dx1

Now suppose that f (x2 | x1 ) does not depend on x1 , i.e., is the same for all x1 .
Then
Z
f2 (x2 ) = f (x2 | x1 ) f1 (x1 ) dx1
| {z }
=1, normalization
= f (x2 | x1 )
Substituting this in (2.28) gives
f (x1 , x2 ) = f1 (x1 )f2 (x2 )
2.2. MORE ON PROBABILITY 21

The joint p.d.f. is then just the product of the marginal p.d.f.’s. We take this as
the definition of independence:
r.v.’s X1 and X2 are independent ≡ f (x1 , x2 ) = f1 (x1 )f2 (x2 )
r.v.’s X1 and X2 are dependent ≡ f (x1 , x2 ) 6= f1 (x1 )f2 (x2 )

We can easily derive two theorems:


Theorem: X1 and X2 are independent r.v.’s with joint p.d.f. f (x1 , x2 ) if and only
if f (x1 , x2 ) = g(x1 )h(x2 ) with g(x1 ) ≥ 0 and h(x2 ) ≥ 0 for all x1 , x2 ∈ Ω.

=⇒ From the definition of independence, f can be written as the product of


the marginal p.d.f.’s, which fulfill the requirement of being positive for
all x1 , x2 ∈ Ω.
⇐= Assume f (x1 , x2 ) = g(x1 )h(x2 ) with g and h positive. Then the marginal
distributions are
Z Z
f1 (x1 ) = g(x1 ) h(x2 ) dx2 = g(x1 ) h(x2 ) dx2 = c g(x1 )
Z Z
and f2 (x2 ) = g(x1 ) h(x2 ) dx1 = h(x2 ) g(x1 ) dx1 = d h(x2 )
1
Hence, f (x1 , x2 ) = g h = f1 (x1 )f2 (x2 )
cd
And, since f1 and f2 are normalized to 1, cd = 1. Q.E.D.

Note that g and h do not have to be the marginal p.d.f.’s; the only requirement
is that their product equal the product of the marginals.
Theorem: If X1 and X2 are independent r.v.’s with marginal p.d.f.’s f1 (x1 ) and
f2 (x2 ), then for functions u(x1 ) and v(x2 ), assuming all E’s exist,
E [u (x1 ) v (x2 )] = E [u (x1 )] E [v (x2 )]
=⇒ From the definition of expectation, and since X1 and X2 are independent,
Z Z
E [u (x1 ) v (x2 )] = u(x1 ) v(x2 ) f (x1 , x2 ) dx1 dx2
Z Z
= u(x1 ) f1 (x1 ) dx1 v(x2 ) f2 (x2 ) dx2
Z Z Z Z
= u(x1 ) f (x1 )f (x2 ) dx1 dx2 v(x2 ) f (x2 )f (x1 ) dx2 dx1
| {z } | {z }
=f (x1 ,x2 ) =f (x1 ,x2 )

= E [u(x1 )] E [v(x2 )]

A consequence of this last theorem is that X1 , X2 independent implies


cov(x1 , x2 ) ≡ E [(x1 − µ1 ) (x2 − µ2 )] = E [x1 − µ1 ] E [x2 − µ2 ] = 0
But the converse is not true.
22 CHAPTER 2. PROBABILITY

2.2.5 Characteristic Function


So far we have only considered real r.v.’s. But from two real r.v.’s we can construct
a complex r.v., Z = X + ıY with expectation E [Z] = E [X] + ıE [Y ]
The characteristic function of the p.d.f. f (x) is defined as the expectation of the
complex quantity eıtx , t real:
(R
h i +∞ ıtx
e f (x) dx (X continuous)
ıtx −∞
φ(t) = E e = P ıtxk (2.29)
ke f (xk ) (X discrete)
For X continuous, φ(t) is the Fourier integral of f (x).
The characteristic function completely determines the p.d.f., since by inverting
the Fourier transformation we regain f (x):
Z
1 +∞
f (x) = φ(t)e−ıxt dt (2.30)
2π −∞
From the definition, it is clear that φ(0) = 1 and |φ(t)| ≤ 1.
The cumulative distribution function, or indeed the probability for any interval
[xmin , x], can also be found from φ(t):
Z Z Z
x x 1 +∞
F (x) = f (x) dx = φ(t)e−ıxt dt dx
xmin xmin 2π −∞
Z Z
1 +∞ x
= φ(t) e−ıxt dx dt
2π −∞ xmin
Z  
1 +∞ 1  −ıxt 
= φ(t) e − e−ıxmin t dt
2π −∞ −ıt
Z
ı +∞ e−ıxt − e−ıxmin t
= φ(t) dt
2π −∞ t
In the discrete case, f (xk ) is given by the difference in the probability of adjacent
values of x,
f (xk ) = F (xk ) − F (xk−1 )
Z
ı +∞ e−ıtxk − e−ıtxk−1
= φ(t) dt
2π −∞ t
The characteristic function is particularly useful in calculating moments. Dif-
ferentiating φ(t) with respect to t and evaluating the result at t = 0 gives
Z
dq φ(t) +∞
= (ıx)q e0 f (x) dx = ıq E [xq ]
dtq t=0 −∞

The characteristic function can also be written in terms of the moments by


means of a Taylor expansion.
" #
h i ∞
(ıtx)r
X
ıtx
φ(t) = E e =E
r=0 r!

X (ıt) r
= E [xr ] (2.31)
r=0 r!
2.2. MORE ON PROBABILITY 23

Some authors prefer, especially for discrete r.v.’s, to use the probability gener-
ating function instead of the characteristic function. It is in fact the same thing,
just replacing eıt by z:
( R
+∞
z x f (x) dx
G(z) = E [z x ] = −∞
P xk
k z f (xk )

The moments are then found by differentiating with respect to z and evaluating at
z = 1,
R +∞ x−1
dG(z)
G′ (1) = dz
= −∞ xz f (x) dx = E [x]
z=1
R +∞ z=1
d2 G(z)
G′′ (1) =
dz 2 z=1
x−2
= −∞ x(x − 1)z f (x) dx = E [x(x − 1)] = E [x2 ] − E [x]
z=1

Thus the variance is given by


h i
2
V [x] = E x2 − (E [x])2 = G′′ (1) + G′ (1) − [G′ (1)]
Another application of the characteristic function is to find the p.d.f. of sums
of independent r.v.’s. Let x and y be r.v.’s. Then w = x + y is also an r.v. The
characteristic function of w is
h i h i h i
φw (t) = E eıtw = E eıt(x+y) = E eıtx eıty
If x and y are independent, this becomes
h i h i
φw (t) = E eıtx E eıty = φx (t) φy (t) (2.32)
Thus the characteristic function of the sum of independent r.v.’s is just the product
of the individual characteristic functions.

2.2.6 Transformation of variables


We will treat the two-dimensional case. You can easily generalize to N dimensions.

Continuous p.d.f.
Given r.v.’s X1 , X2 from a p.d.f. f (x1 , x2 ) defined on a set A, we transform (X1 , X2 )
to (Y1 , Y2 ). Under this transformation the set A maps onto the set B.

X2 ✻ A Y2 ✻
b
a ✲ B

✲ ✲
X1 Y1
24 CHAPTER 2. PROBABILITY

Let a ⊂ A be a small subset which the transformation maps onto b ⊂ B, i.e.,

(X1 , X2 ) ∈ a → (Y1 , Y2) ∈ b such that P (a) = P (b)


Z Z
Then P [(Y1 , Y2) ∈ b] = P [(X1 , X2 ) ∈ a] = f (x1 , x2 ) dx1 dx2
a
The transformation is given by

y1 = u1 (x1 , x2 )
y2 = u2 (x1 , x2 )

The transformation must be one-to-one. Then a unique inverse transformation


exists:

x1 = w1 (y1 , y2 )
x2 = w2 (y1 , y2 )

(Actually the condition of one–to–one can be relaxed in some cases.) Assume also
that all first derivatives of w1 and w2 exist. Then

P (a) = P (b)
Z Z Z Z
f (x1 , x2 ) dx1 dx2 = f (w1 (y1 , y2 ), w2 (y1 , y2 )) |J| dy1 dy 2
a b
where J is the Jacobian determinant (assumed known from calculus) and the abso-
lute value is taken to ensure that the probability is positive,
!
∂w1 ∂w2
w1 , w2
∂y1 ∂y1

J =J = ∂w1 ∂w2 (2.33)
y1 , y2 ∂y2 ∂y2

Hence the p.d.f. in (Y1 , Y2) is the p.d.f. in (X1 , X2 ) times the Jacobian:

g(y1, y2 ) = f (w1 (y1 , y2 ) , w2 (y1 , y2 )) |J| (2.34)

Discrete p.d.f.
This is actually easier, since we can take the subsets a and b to contain just one
point. Then

P (b) = P (Y1 = y1 , Y2 = y2 ) = P (a) = P (X1 = x1 = w1 (y1 , y2 ), X2 = w2 (y1 , y2 ))


g(y1, y2 ) = f (w1 (y1 , y2) , w2 (y1 , y2 ))

Note that there is no Jacobian in the discrete case.


2.2. MORE ON PROBABILITY 25

2.2.7 Multidimensional p.d.f. – matrix notation


In this section we present the vector notation we will use for multidimensional
p.d.f.’s. An n-dimensional random variable, i.e., the collection of the n r.v.’s
x1 , x2 , . . . , xn is denoted by an n-dimensional column vector and its transpose by a
row vector:  
x1
 x2 
 
x =  ..  xT = ( x1 x2 . . . xn ) (2.35)
 . 
xn
If the r.v. x is distributed according to the p.d.f. f (x), the c.d.f. is
Z x1 Z xn
F (x) = ... f (x) dx , dx = dx1 dx2 . . . dxn
−∞ −∞

The p.d.f. and the c.d.f. are related by


∂n
f (x) = F (x)
∂x1 ∂x2 . . . ∂xn
The moments about the origin of order l1 , l2 , . . . , ln are
h i Z ∞ Z ∞
µl1 ,l2 ,...,ln = E xl11 , xl22 , . . . , xlnn = ... xl11 xl22 · · · xlnn f (x) dx
−∞ −∞

The mean of a particular r.v., e.g., x2 , is

µ2 = µ010...0

These means can be written as a vector, the mean of x:


 
µ1
 
 µ2 
µ =  .. 


 . 
µn
Moments about the mean are
h i
λl1 ,l2 ,...,ln = E (x1 − µ1 )l1 (x2 − µ2 )l2 . . . (xn − µn )ln

The variances are, e.g.,


h i
σ12 = σ 2 (x1 ) = λ200...00 = E (x1 − µ1 )2

and the covariances

σij = cov(xi , xj ) = E [(xi − µi )(xj − µj )] , i 6= j


e.g., cov(x1 , x2 ) = λ1100...00
26 CHAPTER 2. PROBABILITY

The variances and covariances may be written as a matrix, called the covariance
(or variance) matrix:
 
σ11 σ12 . . . σ1n
h i  σ21 σ22 . . . σ2n 
 
V = E (x − µ)(x − µ)T =  .. .. .. ..  (2.36)
 . . . . 
σn1 σn2 . . . σnn
 
σ12 ρ12 σ1 σ2 . . .

 ρ12 σ1 σ2 σ22 ...

= .. .. ..  (2.37)
 . . .
ρ1n σ1 σn ρ2n σ2 σn . . .
where ρij is the correlation coefficient for r.v.’s xi and xj :
σij cov(xi , xj )
ρij ≡ = q (2.38)
σi σj σi2 σj2

The covariance matrix is clearly symmetric (σji = σij ). As is well known in


linear algebra, it is always possible to find a unitary transformation,
h i U, of the r.v.
x to the r.v. y = U x such that the covariance matrix of y, V y = U V [x] U T , is
diagonal, which means that the yi are uncorrelated.

2.3 Bayes’ theorem


A ∩ B = B ∩ A. Hence, P (A ∩ B) = P (B ∩ A). From the definition of conditional
probability, eq. (2.17), P (A | B) ≡ P (A ∩ B)/P (B), it then follows that
P (A | B)P (B) = P (B | A)P (A) (2.39)
This simple theorem∗ of Rev. Thomas Bayes18 is quite innocuous. However it
has far-reaching consequences in one interpretation of probability, as we shall see
in the next section.

“When I use a word,” Humpty Dumpty said in a


rather scornful tone, “it means just what I
choose it to mean—neither more nor less.”
—Lewis Carroll, “Through the Looking Glass”

Sometimes called the chain rule of probability, this theorem was first formulated by Rev. Bayes

around 1759. The exact date is not known; the paper was published posthumously by his good
friend Richard Price in 1763. Bayes’ formulation was only for P (A) uniform. The theorem was
formulated in its present form by Laplace,19 who was apparently unaware of Bayes’ work. Laplace
went on to apply20 it to problems in celestial mechanics, medical statistics and even, according to
some accounts, to jurisprudence.
2.4. PROBABILITY—WHAT IS IT?, REVISITED 27

2.4 Probability—What is it?, revisited

We have used mathematical probability, which is largely due to Kolmogorov, to


derive various properties of probability. In our minds we have so far an idea of what
probability means, which we refer to as the frequency approach. In this section we
shall first review these two topics and then discuss another interpretation of the
meaning of probability, which we shall call subjective probability.

2.4.1 Mathematical probability (Kolmogorov)


In this approach21 we began with three axioms, from which we can derive everything.
We can calculate the probability of any complicated event for which we know the
a priori probabilities of its components. But this is simply mathematics. What
probability really means requires a connection to the real world. As Bayes wrote,22
It is not the business of the Mathematician to dispute whether quantities
do in fact ever vary in the manner that is supposed, but only whether
the notion of their doing so be intelligible; which being allowed, he has
the right to take it for granted, and then to see what deductions he
can make from that supposition... He is not inquiring how things are in
matter of fact, but supposing things to be in a certain way, what are the
consequences to be deduced from them; and all that is to be demanded
of him is, that his suppositions be intelligible, and his inferences just
from the suppositions he makes.
Bertrand Russel put it somewhat more succinctly:
Mathematics is the only science where one never knows what one is
talking about nor whether what is said is true.

2.4.2 Empirical or Frequency interpretation (von Mises)


In this approach, largely due to von Mises,23 probability is viewed as the limit of the
frequency of a result of an experiment or observation when the number of identical
experiments is very large, i.e.,
Ni
P (xi ) = lim (2.40)
N →∞ N

There are two shortcomings to this approach:

• P (xi ) is not just a property of the experiment. It also depends on the “collec-
tive” or “ensemble”, i.e., on the N repetitions of the experiment. For example,
if I take a resistor out of a box of resistors, the probability that I measure the
resistance of the resistor as 1 ohm depends not only on how the resistor was
made, but also on how all the other resistors in the box were made.
28 CHAPTER 2. PROBABILITY

• The experiment must be repeatable, under identical conditions, but with dif-
ferent outcomes possible. This is a great restriction on the number of situa-
tions in which we can use the concept of probability. For example, what is
the probability that it will rain tomorrow? Such a question is meaningless for
the frequentists, since the experiment cannot be repeated!

2.4.3 Subjective (Bayesian) probability


This approach attempts to extend the notion of probability to the areas where the
experiment of the frequentists cannot be repeated. Probability here is a subjective
“degree of belief” which can be modified by observations. This was, in fact, the
interpretation of such pioneers in probability as Bayes and Laplace.
This approach takes Bayes’ theorem (2.39), which we repeat here,

P (A | B) P (B) = P (B | A) P (A)

and interprets A as a theory or hypothesis and B as a result or observation. P (A)


is then the probability that A is true, or, in other words, our “belief” in the theory.
Then Bayes’ theorem becomes

P (theory | result) P (result) = P (result | theory) P (theory)

Then
P (result | theory)
P (theory | result) = P (theory)
P (result)
Here, P (theory) is our “belief” in the theory before doing the experiment, P (result |
theory) is the probability of getting the result if the theory is true, P (result) is the
probability of getting the result irrespective of whether the theory is true or not,
and P (theory | result) is our belief in the theory after having obtained the result.
This seems to make sense. We see that if the theory predicts the result with
high probability, i.e., P (result | theory) big, then P (theory | result), i.e., your belief
in the theory after the result, will be higher than it was before, P (theory), and vice
versa. However, if the result is likely even if the theory is not true, then your belief
in the theory will not increase by very much since then P (result|theory)
P (result)
is not much
greater than 1.
Suppose we want to determine some parameter of nature, λ, by doing an ex-
periment which has outcome Z. Further, suppose we know the conditional p.d.f. to
get Z given λ: f (z | λ). Our prior, i.e., before we do the experiment, belief about
λ is given by Pprior (λ). Now the probability of z, P (z), is just the marginal p.d.f.:
P
f1 (z) = λ′ f (z | λ′ )Pprior (λ′ ). Then by Bayes’ theorem,

f (z | λ)
Pposterior (λ | z) = Pprior (λ) (2.41)
f1 (z)
2.4. PROBABILITY—WHAT IS IT?, REVISITED 29

Or, if λ is a continuous variable, which in physics is most often the case,


f (z | λ)
fposterior (λ | z) = fprior (λ) (2.42)
f1 (z)
R
where f1 (z) = f (z | λ′ ) fprior(λ′ ) dλ′.
Given Pprior (λ) this is all OK. The problem here is: What is Pprior (λ)? By its
nature this is not known. Guessing the prior probability is clearly subjective and
unscientific. The usual prescription is
Bayes’ Postulate: If completely ignorant about Pprior (λ), take all values of λ as
equiprobable.
There are objections to this postulate:
• If we are completely ignorant about P (λ), how do we know Pprior (λ) is a
constant?
• A different choice of Pprior (λ) would give a different Pposterior .

• If we are ignorant about P (λ), we are also ignorant about P (λ2 ) or P ( λ)
or P (1/λ). Taking any of these as constant would imply a different Pprior (λ),
giving a different posterior probability.
These objections are usually answered by the assertion (supported by experience)
that Pposterior usually converges to about the same value after several experiments
irrespective of the initial choice of Pprior .

2.4.4 Are we frequentists or Bayesians?


First we note that it is in the sense of frequencies that the word ‘probability’ is used
in quantum mechanics and statistical physics. Turning to experimental results,
in the physical sciences, most experiments are, in principle, repeatable and the
problem can be stated to specify the “collective”. So the frequentist interpretation
is usually OK for us. Given the objections we have seen in the Bayesian approach,
particularly that of subjectivity, most physicists today, like mathematicians starting
in the mid-nineteenth century, would claim to be frequentists.
However in interpreting experimental results we often sound like Bayesians. For
example, you measure the mass of the electron to be 520 ± 10 keV/c2 , i.e., you
measured 520 keV/c2 with an apparatus with a resolution of 10 keV/c2 . You might
then say “The mass of the electron is probably close to 520 keV/c2 .” Or you might
say “The mass of the electron is between 510 and 530 keV/c2 with 68% probability.
But this is not the frequentist’s probability—the experiment has not been repeated
an infinite or even a large number of times. It sounds much more like a Bayesian
probability: With a resolution, or ‘error’, of σ = 10 keV/c2 , the probability that we
will measure a mass m when the true value is me is
2 /2σ 2
P (m | me ) ∝ e−(m−me )
30 CHAPTER 2. PROBABILITY

Then by Bayes’ theorem, the probability that the true mass has the value me after
we have measured a value m is
P (m | me )
P (me | m) = Pprior (me )
P (m)
∝ P (m | me ) assuming Pprior (me ) = const.
2 /2σ 2
∝ e−(m−me )

In a frequentist interpretation of probability, the statement that the electron


has a certain mass with a certain probability is utter nonsense. The electron has
a definite mass: The probability that it has that mass is 1; the probability that it
has some other value is 0. Our only problem is that we do not know what the value
is. We can, nevertheless, make the statement “The mass of the electron is between
510 and 530 keV/c2 with 68% confidence.” Note that this differs from the Bayesian
statement above by just one word. This will be discussed further in the sections
on maximum likelihood (sect. 8.2.4) and confidence intervals (sect. 9), where what
exactly we mean by the word confidence will be explained.∗

“That’s a great deal to make one word mean,”


Alice said in a thoughtful tone.
“When I make a word do a lot of work like that,”
said Humpty Dumpty, “I always pay it extra.”
—Lewis Carroll, “Through the Looking Glass”

Fisher24 , introducing his prescription for confidence intervals, had this scathing comment on

Bayesian probability (referred to as inverse probability):


I know only one case in mathematics of a doctrine which has been accepted and
developed by the most eminent men of their time, and is now perhaps accepted by
men now living, which at the same time has appeared to a succession of sound writers
to be fundamentally false and devoid of foundation. Yet that is quite exactly the
position in respect of inverse probability. Bayes, who seems to have first attempted
to apply the notion of probability, not only to effects in relation to their causes but
also to causes in relation to their effects, invented a theory, and evidently doubted its
soundness, for he did not publish it during his life. It was posthumously published by
Price, who seems to have felt no doubt of its soundness. It and its applications must
have made great headway during the next 20 years, for Laplace takes for granted
in a highly generalised form what Bayes tentatively wished to postulate in a special
case.
Chapter 3

Some special distributions

We now examine some distributions which are frequently encountered in physics


and/or statistics. We begin with discrete distributions.

3.1 Bernoulli trials


A Bernoulli trial is an experiment with two possible outcomes, e.g., the toss of a
coin. The random variable is the outcome of the experiment, k:

outcome probability
‘success’, k = 1 p
‘failure’, k = 0 q = 1−p

The p.d.f. is
f (k; p) = pk q 1−k (3.1)
Note that we use a semicolon to separate the r.v. k from the parameter of the
distribution, p. This p.d.f. results in the moments and central moments:

E [k m ] = 1 · p + 0 · (1 − p) = p
E [(k − µ)m ] = (1 − p)m p + (0 − p)m (1 − p)
| {z } | {z }
k=1 k=0

In particular,

h i
µ=p
V [k] = E k 2
− (E [k])2 = p − p2 = p(1 − p)

31
32 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

3.2 Binomial distribution


The binomial distribution gives the probability of k successes (ones) in n inde-
pendent Bernoulli trials each having a probability p of success. We denote this
distribution by B(k; n, p). The probability of k successes followed by n − k failures
k n−k
is
 p q . But the order of the successes and failures is unimportant. There are
n n!
k
= k!(n−k)!
different permutations. Therefore the p.d.f. is given by
!
n k
B(k; n, p) = p (1 − p)n−k (3.2)
k
It has the following properties:
µ = E [k] = np (mean)
σ 2 = V [k] = np(1 − p) (variance)
γ1 = √ 1−2p (skewness)
np(1−p)
1−6p(1−p)
γ2 = np(1−p)
(kurtosis)
nıt
φ(t) = [pe + (1 − p)] (characteristic function)
We will derive the first of these properties and leave the rest as exercises.
n n
!
X n k X
µ = E [k] = kB(k; n, p) = k p (1 − p)n−k
k=0 k=0
k
n
X n!
= k pk (1 − p)n−k
k=0 k!(n − k)!
n
X (n − 1)!
= np k pk−1 (1 − p)n−k k = 0 term is 0
k=1 k(k − 1)!(n − k)!
n ′
X n′ ! ′ ′ ′
= np ′ ′ ′
pk (1 − p)n −k with n′ = n − 1, k ′ = k − 1
k ′ =0
k !(n − k )!
| {z }
=[p+(1−p)]n′ =1
= np
Many distributions have a reproductive property, i.e., the p.d.f. of the sum of
two or more independent r.v.’s, each distributed according to the same p.d.f., is the
same p.d.f. as for the individual r.v.’s although (usually) with different parameters.
Let X, Y be independent r.v.’s both distributed according to a binomial p.d.f.
with parameter p. Thus
! !
nx x ny y
f (x, y) = B(x; nx , p)B(y; ny , p) = p (1 − p)nx −x p (1 − p)ny −y
x y
What is then the p.d.f. of the r.v. X +Y ? We change variables and, for convenience,
introduce new parameters:
new variables Z1 = X + Y Z2 = Y
inverse transformation X = Z1 − Z2 Y = Z2
new parameters nz1 = nx + ny nz2 = ny
3.3. MULTINOMIAL DISTRIBUTION 33

The p.d.f. for the new variables is then

g(z1 , z2 ) = f (z1 − z2 , z2 )
! !
nz1 − nz2 nz2 z1
= p (1 − p)nz1 −z1
z1 − z2 z2

The p.d.f. for Z1 = X + Y is the marginal of this. Hence we must sum over z2 .
! !
X X nz1 − nz2 nz2
z1 nz1 −z1
g1 (z1 ) = g(z1 , z2 ) = p (1 − p)
z2 z2 z1 − z2 z2
 
nz 1
For normalization the sum must be just z1
. Thus g1 is also a binomial p.d.f.:

g1 (x + y) = B(z1 ; nz1 , p) = B(x + y; nx + ny , p)

3.3 Multinomial distribution


This is the generalization of the binomial distribution to more than two outcomes.
Let there be m different outcomes, with probabilities pi . Consider n experiments
and let ki denote the number of experiments having outcome i. The p.d.f. is then
n!
M(k1 , k2, . . . , km ; p1 , p2 , . . . , pm , n) = pk1 pk2 . . . pkmm (3.3)
k1 !k2 ! . . . km ! 1 2
subject to the conditions
m
X m
X
pi = 1 and ki = n
i=1 i=1

We can write the multinomial p.d.f. in a more condensed form:


m
Y pki i
M(k; p, n) = n! (3.4)
i=1 ki !

An example of application of this p.d.f. is a histogram of m bins with a prob-


ability of pi that the outcome of an experiment will be in the ith bin. Then for n
experiments, the probability that the numbers of entries in the bins will be given
by the ki is given by the multinomial p.d.f.
To calculate expectation values we make use of the binomial p.d.f.: For a given
P
bin, either an outcome is in it (probability pi ) or not (probability 1 − pi = j6=i pj ).
This is just the case of the binomial p.d.f. In other words, the marginal p.d.f. of
the multinomial is the binomial. Hence,

µi = E [ki ] = npi
σi2 = V [ki ] = npi (1 − pi )
34 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

Further, cov(ki , kj ) = −npi pj i 6= j


The correlation coefficient is then
s
cov(ki , kj ) pi pj
ρij = =−
σi σj 1 − pi 1 − pj
P
The correlation comes about because n is fixed: ki = n. The ki are thus not
independent. If n is not fixed, i.e., n is a r.v., the bin contents are not correlated.
But then we do not have the multinomial p.d.f. but the Poisson p.d.f. for each bin.
The characteristic function of the multinomial p.d.f. is
 n
φ(t2 , t3 , . . . , tm ) = p1 + p2 eıt2 + p3 eıt3 + . . . + pm eıtm

3.4 Poisson distribution


This p.d.f. applies to the situation where we detect events but do not know the
number of trials. An example is a radioactive source where we detect the decays
but do not detect the non-decays. The events are counted as a function of some
parameter x, e.g., the time of a decay. The probability of an event in an interval
∆x is assumed proportional to ∆x.
Now make ∆x so small that the probability of more than one event in the interval
∆x is negligible. Consider n such intervals. Let λ be the probability of an event
in the total interval n∆x. Assume λ 6= λ(x). Then the probability of an event in
∆x is p = λ/n. The probability of r events in the total interval, i.e., r of the n
subintervals contain one event, is given by the binomial p.d.f.
! !r !n−r
λ n! λ λ
P (r; λ) = B r; n, = 1−
n r!(n − r)! n n

n!
Now (n−r)!
= n(n − 1)(n − 2) . . . (n − r + 1) r terms
≈ nr since n >> r
 n−r  n
λ λ
and 1− n
≈ 1− n
−→ e−λ
n→∞

Hence, we arrive at the expression for the Poisson p.d.f.:

e−λ λr
P (r; λ) =
r!

We can check that P (r; λ) is properly normalized:


∞ ∞
X
−λ
X λr
P (r; λ) = e = e−λ eλ = 1
r=0 r=0 r!
3.4. POISSON DISTRIBUTION 35

The mean is
∞ ∞
X λr X λr−1
µ = E [r] = re−λ = λe−λ
r=0 r! r=1 (r − 1)!
∞ ′
−λ
X λr
= λe r′ = r − 1
r ′ =0
r′!

X
=λ P (r ′ ; λ)
r ′ =0

Hence the Poisson p.d.f. is usually written

e−µ µr
P (r; µ) = (3.5)
r!
It gives the probability of getting r events if the expected number (mean) is µ.
Further, you can easily show that the variance is equal to the mean:

σr2 = V [r] = µ (3.6)

Other properties:
E[(r−µ)3 ] µ
γ1 = σ3
= µ3/2 = √1µ (skewness)
E[(r−µ)4 ] 2
γ2 = σ4
= 3µµ2+µ − 3 = µ1 (kurtosis)
P∞ ıtr P∞ ıtr µr −µ
φ(t) = r=0 e P (r; µ) = r=0 e r!
e
−µ P∞ (µe )
ıt r
−µ ıt
= e r=0 r!
= e exp (µe )
ıt
φ(t) = exp [µ (e − 1)] (characteristic function)

From the skewness we see that the p.d.f. becomes more symmetric as µ increases.
When calculating a series of Poisson probabilities, one can make use of the
µ
recurrence formula P (r + 1) = r+1 P (r).

Reproductive property
The Poisson p.d.f. has a reproductive property: For independent r.v.’s X and Y ,
both Poisson distributed, the joint p.d.f. is

µxx µyy e−µx e−µy


f (x, y) = x, y = 0, 1, 2, 3, . . .
x!y!
To find the p.d.f. of X + Y we change variables

new variables Z1 = X + Y Z2 = Y
inverse transformation X = Z1 − Z2 Y = Z2
36 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

The joint p.d.f. of the new variables is then

µzx1 −z2 µzy2 e−µx e−µy


g(z1 , z2 ) =
(z1 − z2 )!z2 !

The marginal p.d.f. for z1 is (using the fact that 0 ≤ z2 ≤ z1 )


z1 z1
X e−µx −µy X z1 !
g1 (z1 ) = g(z1 , z2 ) = µzx1 −z2 µzy2
z2 =0 z1 ! z2 =0 (z1 − z 2 )!z2 !
| {z }
=(µx +µy )z1 (binomial theorem)
z1 −(µx +µy )
(µx + µy ) e
=
z1 !
which has the form of a Poisson p.d.f. Q.E.D. We rewrite it

(µx + µy )x+y e−(µx +µy )


g(x + y) =
(x + y)!

The p.d.f. of the sum of two Poisson distributed random variables is also Poisson
with µ equal to the sum of the µ’s of the individual Poissons. This can also be
easily shown using the characteristic function (exercise 12).

Examples
The Poisson p.d.f. is applicable when

• the events are independent, and

• the event rate is constant (= µ).

We give a number of examples:

• Thus the number of raisins per unit volume in raisin bread should be Poisson
distributed. The baker has mixed the dough thoroughly so that the raisins do
not stick together (independent) and are evenly distributed (constant event
rate).

• However, the number of zebras per unit area is not Poisson distributed (even
in those parts of the world where there are wild zebras), since zebras live in
herds and are thus not independently distributed.

• A classic example of Poisson statistics is the distribution of the number of


Prussian cavalry soldiers kicked to death by horses.25 In 10 different cavalry
corps over 20 years there were 122 soldiers kicked to death by horses. The
average is thus k = 122/200 = 0.610 deaths/corps/year.
3.4. POISSON DISTRIBUTION 37

Assuming that the death rate is constant over the 20 year period and inde-
pendent of corps and that the deaths are independent (not all caused by one
killer horse) then the deaths should be Poisson distributed: the probability of
k deaths in one particular corps in one year is P (k; µ). Since the mean of P
is µ, we take the experimental average as an ‘estimate’ of µ. The distribution
should then be P (k; 0.61) and we should expect 200 × P (k; 0.61) occurrences
of k deaths in one year in one corps. The data:

number of deaths in actual number of times Poisson


1 corps in 1 year 1 corps had k deaths prediction
k in 1 year 200 × P (k; 0.610)
0 109 108.67
1 65 66.29
2 22 20.22
3 3 4.11
4 1 0.63
5 0 0.08
200 200.00

The ‘experimental’ distribution agrees very well with the Poisson p.d.f. The
P
reader can verify that the experimental variance, estimated by N1 (ki − k)2 ,
is 0.608, very close to the mean (0.610) as expected for a Poisson distribution.

• The number of entries in a given bin of a histogram when the (independent)


data are collected over a fixed time interval, i.e., when the total number of
entries in the histogram is not fixed.

However, if the rate of the basic process is not constant, the distribution may not
be Poisson, e.g.,

• The radioactive decay over a period of time significant compared with the
lifetime of the source.

• The radioactive decay of a small amount of material.

• The number of interactions produced by a beam consisting of a small number


of particles incident on a thick target.

In the first two examples the event rate decreases with time, in the third with
position. In the last two there is the further restriction that the number of events is
significantly restricted, as it can not exceed the number of atoms or beam particles,
while for the Poisson distribution the number extends to infinity.
38 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

• The number of people who die each year while operating a computer is also
not Poisson distributed. Although the probability of dying while operating
a computer may be constant, the number of people operating computers in-
creases each year. The event rate is thus not constant.
The Poisson p.d.f. requires that the events be independent. Consider the case
of a counter with a dead time of 1 µsec. This means that if a second particle
passes through the counter within 1 µsec after one which was recorded, the counter
is incapable of recording the second particle. Thus the detection of a particle is
not independent of the detection of other particles. If the particle flux is low, the
chance of a second particle within the dead time is so small that it can be neglected.
However, if the flux is high it cannot be. No matter how high the flux, the counter
cannot count more than 106 particles per second. In high fluxes, the number of
particles detected in some time interval will not be Poisson distributed.

Radioactive decays – Poisson approximation of a Binomial


Let us examine the case of radioactive decays more closely. Consider a sample of
n radioactive atoms. In a time interval T some will decay, others will not. There
are thus two possibilities between which the n atoms are divided. The appropriate
p.d.f. is therefore the binomial. The probability that r atoms decay in time T is
thus
n!
f (r) = B(r; n, p) = pr (1 − p)n−r (3.7)
r!(n − r)!
where p is the probability for one atom to decay in time T . Of course, p depends
on the length of the time interval. In the following time interval n will be less but
the value of p will remain the same. But if n is large and p small, then n >> r and
the change in n can be neglected. Then
n!
= n(n − 1)(n − 2) · · · (n − r + 1) r terms
(n − r)!
≈ nr
Also,
p2
(1 − p)n−r = 1 − p(n − r) + (n − r)(n − r − 1) + . . .
2!
p2
≈ 1 − p(n − r) + (n − r)2 + . . .
2!
−p(n−r) −pn
=e ≈e
Substituting these approximations in (3.7) yields
(np)r −np
f (r) = B(r; n, p) ≈ e = P (r; np)
r!
which is a Poisson p.d.f. with µ = np. This derivation is in fact only slightly different
from our previous one; the approximations involved here are the same.
3.5. EXPONENTIAL AND GAMMA DISTRIBUTIONS 39

3.5 Exponential and Gamma distributions


Radioactive decays (again): As discussed in the previous section, the probability
of r decays in time dt is given by the binomial p.d.f.:
n!
P (r) = pr (1 − p)n−r
r!(n − r)!
where n is the number of undecayed atoms at the start of the interval. The proba-
bility that one atom decays is p, which of course depends on the length of the time
interval, dt. Now r is just the current value of − dndt
, i.e., the number of atoms
which decay in dt equals the change in the number of undecayed atoms. Therefore,
" #
dn
E = −E [r] = −np (3.8)
dt

Interchanging the order of the differentiation and the integration of the expectation
operator yields
dE [n]
= −np
dt
Identifying the actual number with its expectation,
dn
= −np
dt
n = n0 e−pt (3.9)

Thus the number of undecayed atoms falls exponentially. From this it follows that
the p.d.f. for the distribution of individual decay times (lifetimes) is exponential:
Exponential p.d.f.: Let f (t) be the p.d.f. for an individual atom Rt
to decay at
time t. The probability that it decays before time t is then F (t) = 0 f (t) dt. The
expected number of decays in time t is
Z t
E [n0 − n] = n0 F (t) = n0 f (t) dt
0

Substituting for E [n] from equation 3.9 and differentiating results in the exponential
p.d.f.:
1
f (t; t0 ) = e−t/t0 (3.10)
t0
which gives, e.g., the probability that an individual atom will decay in time t. Note
that this is a continuous p.d.f.
Properties:

µ = E [t] = t0 γ1 = 2
σ 2 = V [t] = t20 γ2 = 6
φ(x) = [1 − ıxt0 ]−1
40 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

Since we could start timing at any point, in particular at the time of the first
event, f (t) is the p.d.f. for the time of the second event. Thus the p.d.f. of the time
interval between decays is also exponential. This is the special case of k = 1 of the
following situation:
P
Let us find the distribution of the time t for k atoms to decay. The r.v. T = k1 ti
is the sum of the time intervals between k successive atoms. The ti are independent.
The c.d.f. for t is then just the probability that more than k atoms decay in time t:
F (t) = P (T ≤ t) = 1 − P (T > t)
Since the decays are Poisson distributed, the probability of m decays in the interval
t is
(λt)m e−λt
P (m) =
m!
where λ = 1/t0 , and t0 is the mean lifetime of an atom. The probability of < k
decays is then
k−1
X (λt)m e−λt Z ∞ k−1 −z
z e
P (T > t) = = dz
m=0 m! λt (k − 1)!

(The replacement of the sum by the integral can be found in any good book of
integrals.) Substituting the gamma function, Γ(k) = (k − 1)!, the c.d.f. becomes
Z λt z k−1 e−z
F (t) = 1 − dz
0 Γ(k)
Changing variables, y = z/λ,
Z t λk y k−1e−λy
F (t) = dy
0 Γ(k)
The p.d.f. is then
dF λk tk−1 e−λt
f (t; k, λ) = = , t > 0, (3.11)
dt Γ(k)
which is called the gamma distribution. Some properties of this p.d.f. are

µ = E [t] = k/λ γ1 = 2/ k
σ 2 = V [t] = k/λ2 γ2 = 6/k
h i−k
ıx
φ(x) = 1− λ

Note that the exponential distribution, f (t; 1, λ) = λe−λt , is the special case of the
gamma distribution for k = 1. The exponential distribution is also a special case
of the Weibull distribution (section 3.17).
Although in the above derivation k is an integer, the gamma distribution is, in
fact, more general: k does not have to be an integer. For λ = 21 and k = n2 , the
gamma distribution reduces to the χ2 (n) distribution (section 3.12).
3.6. UNIFORM DISTRIBUTION 41

3.6 Uniform distribution


The uniform distribution (also known as the rectangular distribution),
1
f (x; a, b) = , a≤x≤b and f (x) = 0 , elsewhere (3.12)
b−a
is the p.d.f. of a r.v. distributed uniformly between a and b.
Properties:
Rb x b+a
µ = E [t] = a b−a dx = 2 mean
2
R b x2 2 (b−a)2
σ = V [x] = a b−a dx − µ = 12
variance
E[(x−µ)3 ]
γ1 = σ3
=0 skewness
E[(x−µ) ] 4
γ2 = σ4
− 3 = −1.2 kurtosis
sinh[ 12 ıt(b−a)]
φ(t) = ıt(b−a)
+ 12 ıt(b + a) characteristic function
Round-off errors in arithmetic calculations are uniformly distributed.

3.7 Gaussian or Normal distribution


This is probably the best known and most used p.d.f.
1 2 2
N(x; µ, σ 2 ) = √ e−(x−µ) /2σ (3.13)
2πσ 2

Some books use the notation N(x; µ, σ). The Gaussian distribution is symmetric
about µ, and σ is a measure of its width.
We name this distribution after Gauß, but in fact many people discovered it
and investigated its properties independently. The French name it after Laplace,
who had noted26 its main properties when Gauß was only six years old. The first
known reference to it, before Laplace was born, is by the Englishman A. de Moivre
in 1733,27 who, however, did not realize its importance and made no use of it. Its
importance in probability and statistics (cf. section 8.5) awaited Gauß28 (1809).
The origin of the name ‘normal’ is unclear. It certainly does not mean that
other distributions are abnormal.
Properties: (The first two justify the notation used for the two parameters of the
Gaussian.)
µ = E [x] = µ mean
σ 2 = V [x] = σ 2 variance
γ1 = γ2 = (0 skewness and kurtosis
0, n odd
E [(x − µ)n ] = (n − 1)!!σ n = n!σn
, n even central moments
2n/2 ( n
2
)!
where
h
a!! ≡ 1 · 3i· 5 · · · a
φ(t) = exp ıtµ − 12 t2 σ 2 characteristic function
42 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

When using the Gaussian, it is usually convenient to shift the origin, x → x′ =


x − µ to obtain
1 ′2 2
N(x′ ; 0, σ 2 ) = √ e−x /2σ (3.14)
2πσ 2

We can also change the scale, x → z = (x − µ)/σ, defining a ‘standard’ variable,


i.e., a variable with µ = 0 and σ = 1. Then we obtain the unit Gaussian (also
called the unit Normal or standard Normal) p.d.f.:

1 2
N(z; 0, 1) = √ e−z /2 (3.15)

which has the cumulative distribution (c.d.f.)

Z
1 z 2 /2
erf(z) ≡ √ e−x dx (3.16)
2π −∞

which is called the error


 function
 or normal probability integral. The c.d.f. of
2 x−µ
N(x; µ, σ ) is then erf σ .
Some authors use the following definition of the error function instead of equa-
tion 3.16:
Z
2 z 2
φ(z) ≡ √ e−t dt (3.17)
π 0

It is this definition which is used by the FORTRAN library function ERF(Z). Our
definition (3.16) is related to this definition by

!
1 1 z
erf(z) = + φ √ (3.18)
2 2 2

The Gaussian as limiting case

The Gaussian distribution is so important because it is a limiting case of nearly all


commonly used p.d.f.’s. This is a consequence of the Central Limit Theorem, which
we will discuss shortly (cf. chapter 5). This relationship is shown for a number of
distributions in the following figure:
3.8. LOG-NORMAL DISTRIBUTION 43

Binomial p✲→ 0 Poisson


B(k; N, p) Np = µ = const. P (k; µ)

✟✟ ❍❍ ✟✟
✟✟
✟ ✟
❍❍ ✟✟✟
✟✟ ❍❍N → ∞
❥ ✙

✟✟

✯✟ ✟ µ→∞
✟m = 2
✟✟



✟ ❍ ✟
✟ ❍❍
✟✟
✟ ✟ ❍ ✟✟
Multinomial
M(k; p, N) ✲ Normal ✛ Student’s t
k, p of N →∞ N(x; µ, σ 2 ) N →∞ f (t; N)
dimension m
✟ ❍❍
✟✟
✟✟
❍ ν1 → ∞ ✟✟✟✟
✟ ❍ ✟ ν1 =

✟ 1
✯✟
✟ ❍❨ 2 → ∞
❍ν❍
❍ ✯✟
✟✟ N → ∞



✟ ✟
✟✟ N = ν2
✟✟ ❍❍ ✟ ✟
✟✟

N= ✛ν1 F -distribution exact


χ2 (N) ν2 → ∞ f (F ; ν1 , ν2 )
limit

Reproductive property
Since the Poisson p.d.f. has the reproductive property and since the Gaussian p.d.f.
is a limit of the Poisson, it should not surprise us that the Gaussian is also re-
productive: If X and Y are two independent r.v.’s distributed as N(x; µx , σx2 ) and
N(y; µy , σy2 ) then Z = X + Y is distributed as N(z; µz , σz2 ) with µz = µx + µy and
σz2 = σx2 + σy2 . The proof is left as an exercise (exercise 19).

3.8 Log-Normal distribution


If an r.v., y, is normally distributed with mean µ and variance σ 2 , then the r.v.,
x = ey , is distributed as
!
2 1 1 1 (log x − µ)2
f (x; µ, σ ) = √ exp − (3.19)
2πσ 2 x 2 σ2
As with the normal p.d.f., some authors consider σ, rather than σ 2 , as the parameter
of the p.d.f.
Properties:  
E [x] = exp µ + 12 σ 2 mean
2 2
V [x] = exp (2µ + σ ) [exp (σ ) − 1] variance
Note that the parameters µ and σ 2 are not the mean and variance of the p.d.f.
of x, but rather the parameters of the corresponding normal p.d.f. of y = log x,
N(y; µ, σ 2).
44 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

3.9 Multivariate Gaussian or Normal distribution


Consider n random variables xi with expectations (means) µi , which we write as
vectors:    
x1 µ1
 x2   
   µ2 
x =  ..  µ =  .. 


 .   . 
xn µn
The Gaussian is an exponential of a quadratic form in (x − µ). In generalizing
the Gaussian to more than one dimension, we replace (x − µ) by the most general
n-dimensional quadratic form which is symmetric about the point µ,
1
− (x − µ)T A (x − µ)
2
We have written the − 21 explicitly in order that A = σ12 in the one-dimensional
case. Since we have hconstructed
i
this to be symmetric about µ, we must have that
E [x] = µ. Hence, E x − µ = 0, and
Z  
+∞ 1
(x − µ) exp − (x − µ)T A (x − µ) dx = 0
−∞ 2
By differentiating this with respect to µ we get (1 is the unit matrix)
Z h i  
+∞ 1
1 − (x − µ)(x − µ) A exp − (x − µ)T A (x − µ) dx = 0
T
−∞ 2
Therefore, h i
E 1 − (x − µ)(x − µ)T A = 0
h i
E (x − µ)(x − µ)T A = 1
This expectation is just the definition of the covariance matrix, V (equation 2.36).
Hence V A = 1 or
A = V −1
If the correlations between all the xi , are zero, i.e., if all ρij , i 6= j, are zero, then
V is diagonal with Vii = σi2 . Then A is also diagonal with Aii = σ12 and
i

  " !#
1 1 (x1 − µ1 )2 (x2 − µ2 )2
exp − (x − µ)T A(x − µ) = exp − + + ...
2 2 σ12 σ22
" # " #
(x1 − µ1 )2 (x2 − µ2 )2
= exp − exp − ···
2σ12 2σ22
The p.d.f. is thus just the product of n 1-dimensional Gaussians. Thus all ρij = 0
implies that xi and xj are independent. As we have seen (sect. 2.2.4), this is not
true of all p.d.f.’s.
3.9. MULTIVARIATE GAUSSIAN OR NORMAL DISTRIBUTION 45

It remains to determine the normalization. The result is


   
1 1 T −1
N x; µ, V = exp − (x − µ) V (x − µ) (3.20)
(2π)n/2 |V |1/2 2
where |V | is the determinant of V . This assumes that V is non-singular, i.e.,
|V | =
6 0. If V is singular, that means that two of the xi are completely correlated,
i.e., |ρij | = 1. In that case we can replace xj by a function of xi thus reducing the
dimension by one.
Comparison of equations 3.13 and 3.20 shows that an n-dimensional Gaussian
may be obtained from a 1-dimensional Gaussian by the following substitutions:
x → x µ → µ
σ2 → V σ−2
→ V −1
σ → |V |1/2 √1

→ (2π)1n/2

These same substitutions are applicable for many (not all) cases of generalization
from 1 to n dimensions, as we might expect since the Gaussian p.d.f. is so often a
limiting case.

Multivariate Normal - summary:

  h i
p.d.f. N x; µ, V = 1
1/2 exp − 21 (x − µ)T V −1 (x − µ) (3.20)
(2π)n/2 |V |
mean E [x] = µ
covariance cov(x) = V
V [xi ] = Vii
cov(xi , xj ) = Vij
characteristic h i
function φ(t) = exp ıtµ − 21 tT V t
Other interesting properties:
• Contours of constant probability density are given by
(x − µ)T V −1 (x − µ) = C , a constant

• Any section through the distribution, e.g., at xi = const., gives again a mul-
tivariate normal p.d.f. It has dimension n − 1. For the case xi = const., the
covariance matrix Vn−1 is obtained by removing the ith row and column from
V −1 and inverting the resulting submatrix.
• Any projection onto a lower space gives a marginal p.d.f. which is a multi-
variate normal p.d.f. with covariance matrix obtained by deleting appropriate
rows and columns of V . In particular, the marginal distribution of xi is
fi (xi ) = N(xi ; µi , σi2 )
46 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

• A set of variables, each of which is a linear function of a set of normally


distributed variables, has itself a multivariate normal p.d.f.

We will now examine a special case of the multivariate normal p.d.f., that for two
dimensions.

3.10 Binormal or Bivariate Normal p.d.f.


This is the multivariate normal p.d.f. for 2 dimensions. Using (x, y) instead of
(x1 , x2 ), we have
 
σx2 ρσx σy
V =
ρσx σy σy2
 
−1 1 σy2 −ρσx σy
V = 2 2
σx σy (1 − ρ2 ) −ρσx σy σx2
1 − 21 G
f (x, y) = √ e
2πσx σy 1 − ρ2
 ! !2 
 2  
1 
x − µx x − µx y − µy y − µy 
where G = − 2ρ +
(1 − ρ2 ) σx σx σy σy

Contours of constant probability density are given by setting the exponent equal to
a constant. These are ellipses, called covariance ellipses.
For G = 1, the extreme values of the
ellipse are at µx ± σx and µy ± σy . The y ✁
larger the correlation, the thinner is the µy + σy ✻ ✁
ellipse, approaching 0 as |ρ| → 1. (Of ✁

course in the limit of ρ = ±1, G is infinite ✁
❍❍ ✁
and we really have just 1 dimension.) ❍❍✁
The orientation of the major axis of µy ❍
✁ ❍❍
✁ ❍
the ellipse is given by


2ρσx σy µy − σy ✁
tan 2θ = ✁
σx2 − σy2
✁✁ θ


Note that θ = ±45 only if =σx2 σy2 µx − σx µx µx + σx x
θ=0 if ρ = 0
In calculating θ by taking the arctangent of the above equation, one must be
careful of quadrants. If the arctangent function is defined to lie between − π2 and
π
2
, then θ is the angle of the major axis if σx > σy ; otherwise it is the angle
of the minor axis. It is therefore more convenient to use an arctangent function
defined between −π and π such as the ATAN2(y,x) of some languages: 2θ =
ATAN2(2ρσx σy , σx2 − σy2 ).
3.10. BINORMAL OR BIVARIATE NORMAL P.D.F. 47

In the one-dimensional Gaussian the probability that x is within k standard


deviations of µ is given by
Z µ+kσ
P (µ − kσ ≤ x ≤ µ + kσ) = N(x; µ, σ 2 ) dx (3.21)
µ−kσ

which is an integral over the interval of x where G ≤ k. In two dimensions this


generalizes to the probability that (x, y) is within the ellipse corresponding to k
standard deviations, which is given∗ by
Z Z
1 1
P (G ≤ k) = √ e− 2 G dx dy (3.22)
2πσx σy 1 − ρ2
G≤k

Some values:
2-dimensional 1-dimensional 2× 1-dimensional
P (G ≤ k) k P (G ≤ k) = P (µx − kσ ≤ x ≤ µx + kσ)
P (µ − kσ ≤ x ≤ µ + kσ) P (µy − kσ ≤ y ≤ µy + kσ)
0.3934693 1 0.6826895 0.466065
0.6321206 2 0.9544997 0.911070
0.7768698 3 0.9973002 0.994608
0.8646647 4 0.9999367 0.999873
0.9179150 5 0.9999994 0.999999
0.9502129 6

Note that the 2-dimensional probability for a given k is much less than the cor-
responding 1-dimensional probability. This is easily understood: the product of
the two 1-dimensional probabilities is the probability that (x, y) is in the rectangle
defined by µx − kσx ≤ x ≤ µx + kσx and µy − kσy ≤ y ≤ µy + kσy . The ellipse is
entirely within this rectangle and hence the probability of being within the ellipse
is less than the probability of being within the rectangle.
Since the covariance matrix is sym- y
✻ σu ✟

metric, there exists a unitary transforma- v ✟
❆❑ ✟✟
tion which diagonalizes it. In two dimen- ❆ ✟

sions this is the rotation matrix U, ❆
✟ ✟ ❆ ✟ ✯u

❆ ❆✟
✟❆✟ ✟❆
✟ ✟✟
 
cos θ − sin θ ✟ ❆
U= ✟
✟ ✟❆✟ ✟✟

sin θ cos θ ❆ ✟ ✟ ❆ ✟ σy
❆✟ ❆✟
✟✟❆ ✟✟❆
✟✟
This matrix transforms (x, y) to (u, v): ❆
    σv ❆ ❆✟
u x σx
=U
v y


We will see in sect. 3.12 that G is a so-called χ2 r.v. P (G ≤ k) can therefore also be x
found
from the c.d.f. of the χ2 distribution, tables of which, as well as computer routines, are readily
available.
48 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

The new covariance matrix is U V U T .


Since the transformation is unitary, areas are preserved (Jacobian |J| = 1). Hence,

P [(x, y) inside ellipse] = P [(u, v) inside ellipse]

The standard deviations of u, v are then found from the transformed covariance
matrix. After some algebra we find

σx2 cos2 θ − σy2 sin2 θ


σu2 = (3.23a)
cos2 θ − sin2 θ
σy cos2 θ − σx2 sin2 θ
2
σv2 = (3.23b)
cos2 θ − sin2 θ
or

σx2 σy2 (1 − ρ2 )
σu2 = (3.24a)
σy2 cos2 θ − ρσx σy sin 2θ + σx2 sin2 θ
σx2 σy2 (1 − ρ2 )
σv2 = (3.24b)
σx2 cos2 θ + ρσx σy sin 2θ + σy2 sin2 θ

Or starting from the uncorrelated (diagonalized) variables (u,v), a rotation by θ to


the new variables x, y will give

σx2 = σu2 cos2 θ + σv2 sin2 θ (3.25a)


σy2 = σv2 cos2 θ + σu2 sin2 θ (3.25b)
σu2 − σv2
ρ = sin θ cos θ (3.25c)
σx σy

Note that if ρ is fairly large, i.e., the ellipse is thin, just knowing σx and σy
would give a very wrong impression of how close a point (x, y) is to (µx , µy ).
The properties stated at the end of the previous section, regarding the condi-
tional and marginal distributions of the multivariate normal p.d.f. can be easily
verified for the bivariate normal. In particular, the marginal p.d.f. is

fx (x) = N(x; µx , σx2 ) (3.26)

and the conditional p.d.f. is


(    )
f (y, x) 1 1 σy 2
f (y | x) = =q √ exp − 2 y − µ y + ρ (x − µ x )
fx (x) 2πσy2 1 − ρ2 2σy (1 − ρ2 ) σx
 h i
σy
= N y; µy + ρ (x − µx ) , σy2 1 − ρ2 (3.27)
σx
3.11. CAUCHY (BREIT-WIGNER OR LORENTZIAN) P.D.F. 49

3.11 Cauchy (Breit-Wigner or Lorentzian) p.d.f.


The Cauchy p.d.f. is
1 1
C(x; µ, α) = (3.28)
πα 1 + (x − µ)2 /α2
or in its ‘standard’ form with µ = 0 and α = 1,
1 1
C(x; 0, 1) = (3.29)
π 1 + x2
It looks something like a Gaussian, but with bigger tails.
It is usually encountered in physics in a
slightly different form as the Breit-Wigner f (x) ✻ ✘ N
(or Lorentz) function which gives the dis- ✾✘

tribution of particles of mass m due to a
resonance of mass M and width Γ:
1 Γ
f (m; M, Γ) = C
2π (m − M)2 + ( Γ2 )2 ✚




M is the mode and Γ the full width at half x
maximum (FWHM) of the distribution.
The Cauchy p.d.f. is a pathological distribution. Let us try to calculate the
mean:
1 Z +∞ x 1 Z +∞ (x − µ) + µ
E [x] = 2 dx = 2 dx
πα −∞ 1 + (x−µ) πα −∞ 1 + (x−µ)
α2 α2
Z Z
α +∞ z µ +∞ 1
= dz + dz , z = x−µ
α
π −∞ 1 + z 2 π −∞ 1 + z 2
+∞
α µ
= ln(1 + z 2 ) + π = +∞ − ∞ + µ
2π −∞ π
which is indeterminate. The mean does not exist! However, noting that the p.d.f.
is symmetric about µ, we can define the mean as
Z µ+L
lim xC(x; µ, α) dx = µ
L→∞ µ−L

All higher moments are also divergent, and no such trick will allow us to define
them. In actual physical problems the distribution is truncated, e.g., by energy
conservation, and the resulting distribution is well-behaved.
The characteristic function of the Cauchy p.d.f. is
φ(t) = e−α|t|+ıµt
P
The reproductive property of the Cauchy p.d.f. is rather unusual: x = n1 xi is
distributed according to the identical Cauchy p.d.f. as are the xi . (The proof is left
as an exercise.)
50 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

3.12 The χ2 p.d.f.


Let x be a vector of n independent r.v.’s, xi , each distributed normally with mean
µi and variance σi2 . Then the joint p.d.f. is
n
"   #
Y 1 1 xi − µi 2
f (x; µ, σ) = √ exp −
i=1 2πσi 2 σi
" n   # n
1X xi − µi 2 Y 1
= exp − √
2 i=1 σi i=1 2πσi

The variable χ2 is defined:


n 
X 2
xi − µi
χ2 (n) = (3.30)
i=1 σi

Being a function of r.v.’s, χ2 is itself a r.v. The χ2 has a parameter n, which is


called the number of degrees of freedom (d.o.f.), since each of the r.v.’s, xi , is free
to vary independently of the others. Note that χ2 is regarded
√ 2 as the variable, not
the square of a variable; one does not usually refer to χ = χ .

χ2 with 1 d.o.f.
For example, for n = 1, letting z = (x − µ)/σ, the p.d.f. for z is N(z; 0, 1) and the
probability that z ≤ Z ≤ z + dz is

1 1 2
f (z) dz = √ e− 2 z dz

Let Q = Z 2 . (We use Q here instead of χ2 to emphasize that this is the variable.)
This is not a one-to-one transformation; both +Z and −Z go into +Q.

f (z) ✻ f (q) ✻

✲ ✲
−|Z| 0 +|Z| z Q q

The probability that Q is between q and q + dq is the sum of the probability that

Z is between z and z + dz around z = q and the probability that Z is between z

and z − dz around z = − q. Therefore, we must add the p.d.f. obtained from the
3.12. THE χ2 P.D.F. 51

+Z → q transformation to that obtained from the −Z → q transformation. The


Jacobians for these two transformations are (cf. section 2.2.6)

d(±z) 1
J± = =± √
dq 2 q

!
1 1 1 1 dq dq 1 1
f (q) dq = √ e− 2 q (|J+ | + |J− |) dq = √ e− 2 q √ + √ =√ e− 2 q dq
2π 2π 2 q 2 q 2πq

Now Q was just χ2 . Hence the p.d.f. for χ2 with 1 d.o.f. is

1 1 2
χ2 (1) = f (χ2 ; 1) = √ 2
e− 2 χ (3.31)
2πχ

It may be confusing to use the same symbol, χ2 , for both the r.v. and its p.d.f., but
that’s life!

χ2 with 3 degrees of freedom


 
xi −µi
For n = 3, using standardized normal variables zi = σi
, let

R2 = χ2 = z12 + z22 + z32

The joint probability is then

1 2
g(z1 , z2 , z3 ) dz 1 dz 2 dz 3 = 3/2
e−R /2 dz 1 dz 2 dz 3
(2π)

Think of R as the radius of a sphere in 3-dimensional space. Then, clearly, dz 1 dz 2 dz 3 =


R2 dR d cos θ dφ. To get the marginal p.d.f. for R, we integrate over cos θ and φ,
which gives a factor 4π. Hence, the probability that R is between R and R + dR is

2 2
f (R) dR = √ R2 e−R /2 dR


Now χ2 = R2 . Hence, dχ2 = 2R dR and dR = dχ2 /2 χ2 . Hence,

2 2 dχ2
f (χ2 ; 3) dχ2 = √ χ2 e−χ /2 √ 2
2π 2 χ
2 1/2
(χ ) 2
χ2 (3) = f (χ2 ; 3) = √ e−χ /2 (3.32)

52 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

χ2 with n degrees of freedom

For n degrees of freedom, the p.d.f. of χ2 is


n 2 /2
2 (χ2 ) 2 −1 e−χ
2
χ (n) = f (χ ; n) = (3.33)
Γ( n2 ) 2n/2

Properties:

mean µ = E [χ2 (n)] = n


variance V [χ (n)] = σχ2 2 (n) = 2n
2

2 n−2 n≥ 2
mode (max.) at χ (n) =
q0 n≤2
2
skewness γ1 = 2 n
kurtosis γ2 = 12/n
characteristic function φ(t) = (1 − 2ıt)−n/2

Reproductive property: Let χ2i be a set of variables which are distributed as χ2 (ni ).
P P
Then χ2i is distributed as χ2 ( ni ). This is obvious from the definition of χ2 :
The variables χ21 and χ22 are, by definition,

n1
X n1X
+n2
χ21 (n1 ) = zi2 and χ22 (n2 ) = zi2
i=1 i=n1 +1

Hence, their sum is


n1X
+n2
χ2n1 +n2 = χ2n1 + χ2n2 = zi2
i=1

which from the definition is a χ2 of (n1 + n2 ) degrees of freedom.


Since the expectation of a χ2 (n) is n, the expectation of χ2 (n)/n is 1. The
quantity χ2 (n)/n is called a “reduced χ2 ”.
Asymptotically (for large n), the χ2 p.d.f. approaches the normal distribution
with mean n and variance 2n:

f (χ2 ; n) = χ2 (n) −→ N(χ2 ; n, 2n) (3.34)



A faster convergence occurs for the variable 2χ2 :
q q q √
2 2
f( 2χ2 ; n) = χ (χ ; n) 2χ2 −→ N( 2χ2 ; 2n − 1, 1) (3.35)

This approximation is good for n greater than about 30.


3.13. STUDENT’S T DISTRIBUTION 53

General definition of χ2
If the n Gaussian variables are not independent, we can change variables such that
the covariance matrix is diagonalized. Since this is a unitary transformation, it
does not change the covariance ellipse G = k. In the diagonal case G ≡ χ2 . Hence,
χ2 = G also in the correlated case. Thus we can take

χ2 = (x − µ)T V −1 (x − µ) (3.36)

as the general definition of the random variable χ2 .

3.13 Student’s t distribution


Consider an r.v., x, normally distributed with mean µ and standard deviation σ.
Then z = x−µ σ
is normally distributed with mean 0 and standard deviation 1. In
the normal p.d.f., the mean determines the origin and the standard deviation the
scale. By transforming to the standard variable z, both dependences are removed.
In analyzing data we may not know the σ of the p.d.f. We may then remove the
scale dependence by using the sample standard deviation, σ̂, instead of the parent
standard deviation. We may also not know the parent mean and will use the sample
mean, x̄, instead. For N independent xi (cf. equations 8.3, 8.7),

N
2 1 X
σ̂ = (xi − µ)2 , using µ (3.37a)
N i=1
N
2 1 X P
σ̂ = (xi − x̄)2 , using x̄ = 1
N
xi (3.37b)
N − 1 i=1

In either case, nσ̂ 2 /σ 2 is a χ2 (n), i.e., is distributed according to the χ2 distribution


for n = N − k degrees of freedom, where k is 0 if µ is used and is 1 if x̄ is used,
since in the latter case only N − 1 of the terms in the sum are independent. This
is discussed in more detail in section 8.2.1.
We now seek the p.d.f. for the r.v.

x−µ (x − µ)/σ z
t= =q =q (3.38)
σ̂ (nσ̂ 2 /σ 2 )/n χ2 /n

Now z is a standard normal r.v. and χ2 is a χ2 (n). A Student’s t r.v. is thus the
ratio of a standard normal r.v. to the square root of a reduced χ2 r.v. The joint
p.d.f. for z and χ2 is then (equation 3.33)
2 n 2 /2
2 2 2 2 e−z /2 (χ2 ) 2 −1 e−χ
2
f (z, χ ; n) dz dχ = N(z; 0, 1) χ (χ ; n) dz dχ = √ dz dχ2
2π Γ( n2 ) 2n/2
54 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

where we have assumed that z and χ2 are independent. This is certainly so if the
x has not been used in determining σ̂, and asymptotically so if n is large. Making
a change of variable, we transform this distribution to one for t and χ2 :
1 2 n−1
2 2
− χ2 (1+ tn )
f (t, χ2 ; n) dt dχ2 = √ n n/2
(χ ) 2 e dt dχ2
2πn Γ( 2 ) 2

Integrating this over all χ2 , we arrive finally at the p.d.f. for t, called Student’s
t distribution,
1 Γ( n+1 2
) 1
t(n) = f (t; n) = √ n 2 (3.39)
πn Γ( 2 ) (1 + tn )(n+1)/2
Properties: mean µ = E [t] = 0 , n>1
n
variance V [t] = σt2 = n−2 , n>2
skewness γ1 = 0
6
kurtosis γ2 = n−4 , n>4

 n2r Γ( r+1
2 ) ( 2 )
Γ n−r

 , r even and r < n
Γ( 12 )Γ( n )
moments µr = 0 ,
2
r odd and r ≤ n



does not exist , otherwise.
Student’s t distribution is thus
the p.d.f. of a r.v., t, which is the t(t; n)
ratio of a standard normal vari- 0.4 ✲ ✛
n=∞ ✲ n = 10
able and the square root q of a nor- n = 5 ✛

n=3
malized χ2 r.v., i.e., χ2 (n)/n, 0.35 n=2
of n degrees of freedom. It was ✛ n=1
discovered29 by W. S. Gossett, a 0.3

chemist working for the Guinness


0.25
brewery in Dublin, who in his
spare time wrote articles on statis-
0.2
tics under the pseudonym∗ “Stu-
dent”. The number of degrees
0.15
of freedom, n, is not required to
be an integer. The t-distribution
0.1
with non-integral n > 0 is useful
in certain applications, which is, 0.05
however, beyond the scope of this
course. 0
-4 -3 -2 -1 0 1 2 3 4
Student’s t distribution is t
symmetric about t = 0. It approaches the standard normal distribution as the
number of degrees of freedom, n, approaches infinity. For n = 1 it is identical to

In order to prevent competitors from learning about procedures at Guinness, it was the policy

of Guinness that articles by its employees be published under a pseudonym.


3.14. THE F -DISTRIBUTION 55

the standard Cauchy p.d.f. As n → ∞, it approaches the standard normal distri-


bution. It thus has larger tails and a larger variance than the Gaussian, but not so
large as the Cauchy distribution.
We have constructed t from a single observation, x. In a similar way, a r.v. t can
be constructed for the mean of a number of r.v.’s each distributed normally with
mean µ and standard deviation σ. We know from the reproductive property of the
normal p.d.f. that
√ x̄ is also normally √ distributed with mean µ but with a standard
x̄−µ
deviation of σ/ N . Thus z = σ N is a standard normal r.v. and hence
x̄ − µ √
t= N (3.40)
σ̂
is distributed as Student’s t with n degrees of freedom. It can be shown3 that x̄
and σ̂ 2 are independent.

3.14 The F -distribution


Consider two random variables, χ21 and χ22 , distributed as χ2 with ν1 and ν2 degrees
of freedom, respectively. We define a new r.v., F , as the ratio of the two reduced
χ2 :
χ21 /ν1
F = 2 (3.41)
χ2 /ν2
The p.d.f. of F may be derived by a method similar to that used for Student’s t
distribution: Start with the joint p.d.f. of the independent variables χ21 , χ22 ; make
a change of variables to F , v = χ22 ; and integrate out the v dependence. The result
is2 ν1
q ν1 +ν2
F 2 −1
ν1 ν2 Γ( 2 )
f (F ; ν1, ν2 ) = ν1 ν2 (3.42)
Γ( ν21 )Γ( ν22 ) (ν2 + ν1 F ) ν1 +ν
2
2

This distribution is known by many names: Fisher-Snedecor distribution, Fisher


distribution, Snedecor distribution, variance ratio distribution, and F -distribution.
We could, of course, have written equation 3.41 with the ratio the other way
around. By convention, one usually puts the larger value on top so that F ≥ 1.
Properties: mean µ = E [F ] = ν2ν−ν2
1
, ν2 > 2
2ν22 (ν1 +ν2 −2)
variance V [F ] = ν1 (ν2 −2)2 (ν2 −4)
, ν2 > 4
The distribution is positively skew and tends to mormality as ν1 , ν2 −→ ∞, but
only slowly (ν1 , ν2 > 50).
The p.d.f. for Z = 12 ln F has a much faster approach to a Gaussian with a mean
1 1
of 2 ( ν2 − ν11 ) and variance 12 ( ν12 + ν11 ).
The F -distribution is useful in various hypothesis tests (cf. sections 10.4.3 and
10.7.4). However, for the tests it may be more convenient to use
ν1 F
U= (3.43)
ν2 + ν1 F
56 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS

which is a monotonic function of F and has a beta distribution (cf. section 3.15).

3.15 Beta distribution


This is a basic distribution for random variables bounded on both sides. Without
loss of generality the bounds are here taken as 0 ≤ x ≤ 1. It has two parameters
(not necessarily integers): n, m > 0. The p.d.f. is

Γ(n + m) m−1
f (x; n, m) = x (1 − x)n−1 , 0≤x≤1 (3.44)
Γ(n)Γ(m)

=0 , otherwise

m
Properties: mean µ = E [x] = m+n
mn
variance V [x] = (m+n)2 (m+n+1)

For m = n = 1 this becomes the uniform p.d.f.


Do not confuse the beta distribution with the beta function,

Γ(y)Γ(z) Z 1 y−1
β(y, z) = = x (1 − x)z−1 dx , real y, z > 0
Γ(y + z) 0

to which it is related, and from which the normalization of the p.d.f. is easily derived.

3.16 Double exponential (Laplace) distribution


This distribution is symmetric about the mean. Its tails fall off less sharply than
the Gaussian, but faster than the Cauchy distribution. Note that its first derivative
is discontinuous at x = µ.

λ
f (x; µ, λ) = exp (−λ|x − µ|) (3.45)
2
Properties: mean µ = E [x] = µ
variance V [x] = 2/λ2
skewness γ1 = 0
kurtosis γ2 = 3
λ2
characteristic function φ(t) = ıtµ + λ2 +t2

It can also be written


!
2 1 √ |x − µ|
f (x; µ, σ ) = √ exp − 2 (3.46)
2σ 2 σ
3.17. WEIBULL DISTRIBUTION 57

3.17 Weibull distribution


Originally invented to describe failure rates in ageing lightbulbs, it describes a wide
variety of complex phenomena.
α
f (t; α, λ) = αλ(λt)α−1 e−(λt) real t ≥ 0 and α, λ > 0 (3.47)
 
Properties: mean µ = E [x] = 1
Γ 1
+1
λ   α
 h  i2 
1 2 1
variance V [x] = λ2
Γ α
+1 − Γ α
+1

The exponential distribution (equation 3.10) is a special case (α = 1), when the
probability of failure at time t is independent of t.
Chapter 4

Real p.d.f.’s

There are, of course, many other distributions which we have not discussed in the
previous section. We may introduce a few more later when needed. Now let us turn
to some complicatons which we will encounter in trying to use these distributions.

4.1 Complications in real life


So far we have treated probability and handled some ideal p.d.f.’s. Given the
p.d.f. for the physical process we want to study, we can, in principle, calculate the
probability of a given experimental result. There are, however, some complications:
• In real life the p.d.f. is quite likely not one of the ideal distributions we have
studied. It may be difficult to calculate. Or it may not even be known.
• The range of variables is never the −∞ to +∞ we have so blithely assumed.
Either it is limited by physics, e.g., conservation of energy, or by our appara-
tus, e.g., a given radio telescope only works in a certain range of frequencies,
in which case we must use the conditional p.d.f., f (x|xmin ≤ x ≤ xmax ).
While truncation is usually a complication, making the p.d.f. more difficult
to calculate (e.g., we must renormalize, which frequently can only be done
by numerical integration), occasionally it is welcome, e.g., the Cauchy p.d.f.
becomes well-behaved if truncated at µ ± a:
1 1
C(x; µ = 0, α = 1) =
π 1 + x2
C(x; 0, 1) 1 1
−→ R +a = ·
−a C(x; 0, 1) dx 2 arctan a 1 + x2

which has a finite variance (recall that the Cauchy p.d.f. did not):
Z
1 +a x2 a
V [x] = 2
dx = −1
arctan a −a 1+x arctan a

59
60 CHAPTER 4. REAL P.D.F.’S

• The physical p.d.f. may be modified by the response of the detector. This
response must then be convoluted with the physical p.d.f. to obtain the p.d.f.
which is actually sampled.

“Now we see in a mirror dimly ...


Now I know in part ...”
—1 Corinthians 13:12

4.2 Convolution
Experimentally we often measure the sum of two (or more) r.v.’s. For example,
in the decay n → pe− ν e we want to measure the energy of the electron, which is
distributed according to a p.d.f. given by the theory of weak interactions, f1 (E1 ).
But we measure this energy with some apparatus, which has a certain resolution.
Thus we do not record the actual energy E1 of the electron but E1 + δ, where δ
is distributed according to the resolution function (p.d.f.) of the apparatus, f2 (δ).
What is then the p.d.f., f (E), of the quantity we record, i.e., E = E1 + δ? This
f (E) is called the (Fourier) convolution of f1 and f2 .
Assume E1 and δ to be independent. This may seem reasonable since E1 is from
the physical process (n decay) and δ is something extra added by the apparatus,
which has nothing at all to do with the decay itself. Then the joint p.d.f. is

f12 (E1 , δ) = f1 (E1 ) f2 (δ)

The c.d.f. of E = E1 + δ is then


ZZ
F (E) = f1 (E1 )f2 (δ) dE 1 dδ
E1 + δ ≤ E
Z +∞ Z E−E1
= dE1 f1 (E1 ) dδ f2 (δ)
−∞ −∞
Z +∞
= dE1 f1 (E1 ) F2 (E − E1 )
−∞
Z +∞
or = dδ f2 (δ) F1 (E − δ)
−∞

The p.d.f. can then be calculated from the c.d.f.:


Z
dF (E) +∞
f (E) = = f1 (E1 )f2 (E − E1 ) dE 1
dE −∞
Z +∞
or = f2 (δ)f1 (E − δ) dδ
−∞
4.2. CONVOLUTION 61

The characteristic function is particularly useful in evaluating convolutions:


Z
φf (t) = eıtE f (E) dE
Z Z
ıtE
= e f1 (E1 )f2 (E − E1 ) dE 1 dE
Z Z
= eıtE1 f1 (E1 )eıt(E−E1 ) f2 (E − E1 ) dE 1 dE
since E = E1 + (E − E1 )
= φf1 (t) φf2 (t) (4.1)

Thus, assuming that the r.v.’s are independent, the characteristic function of a
convolution is just the product of the individual characteristic functions. (This
probably looks rather familiar. We have already seen it in connection with the
reproductive property of distributions; in that case f1 and f2 were the same p.d.f.)
Recall that the characteristic function is a Fourier transform. Hence, a convolution,
E = E1 + δ, where δ is independent of E, is known as a Fourier convolution.
Another type of convolution, called the Mellin convolution, involves the product
of two random variables, e.g., E = E1 R1 . As we shall see, the Fourier convolution
is easily evaluated using the characteristic function, which is essentially a Fourier
transform of the p.d.f. Similarly, the Mellin convolution can be solved using the
Mellin transformation, but we shall not cover that here.
In the above example we have assumed a detector response independent of what
is being measured. In practice, the distortion of the input signal usually depends
on the signal itself. This can occur in two ways:
1. Detection efficiency. The chance of detecting an event with our apparatus
may depend on the properties of the event itself. For example, we want to
measure the frequency distribution of electromagnetic radiation incident on
the earth. But some of this radiation is absorbed by the atmosphere. Let
f (x) be the p.d.f. for the frequency, x, of incident radiation and let e(x) be
the probability that we will detect a photon of frequency x incident on the
earth. Both f and e may depend on other parameters, y, e.g., the direction
in space in which we look. The p.d.f. of the frequency of the photons which
we detect is R
f (x, y)e(x, y) dy
g(x) = R R
f (x, y)e(x, y) dx dy
2. Resolution. To continue with the above example, suppose the detector records
frequency x′ when a photon of frequency x is incident. Let r(x′ , x) be the p.d.f.
that this will occur. Then
Z

g(x ) = r(x′ , x)f (x) dx

In the case that r is just a function of x − x′ we get the simple convolution


handled above. Note that resolution effects can lead to values of x′ which lie
62 CHAPTER 4. REAL P.D.F.’S

outside the physical range of x, e.g., an energy of a particle which is larger


than the maximum energy allowed by energy conservation. The Central Limit
Theorem (chapter 5) will tell us that the detector response, or resolution
function, is usually normally distributed for a given input to the detector:
" #
′ 1 1 (x′ − x)2
r(x , x) = √ exp −
2πσ 2 σ2
= N(x′ ; x, σ 2 ) if σ is constant

However in practice σ often depends on x, in which case r(x′ , x) may still have
the above form, but is not really a Gaussian.
If the resolution function is Gaussian and if the physical p.d.f., f (x), is also
Gaussian, f (x) = N(x; µ, τ 2 ), then you can show, by using the reproductive
property of the Gaussian (exercise 19) or by evaluating the convolution using
the characteristic function (equation 4.1), that the p.d.f. for x′ is also normal:
Z +∞  

g(x ) = f (x) r(x′ , x) dx = N x′ ; µ, σ 2 + τ 2
−∞
Chapter 5

Central Limit Theorem

5.1 The Theorem


This is a very important theorem; you could call it the ‘central’ theorem of statistics.
It states:
Given n independent variables, xi , distributed according to p.d.f.’s, fi , having
P
mean µi and variance Vi = σi2 , then the p.d.f. for the sum of the xi , S ≡ xi , has
P P P 2
expectation (mean) E [S] = µi and variance V [S] = Vi = σi and approaches
P P
the normal p.d.f. N (S; µi , σi2 ) as n → ∞:
n n
! n
X X X
lim f (S) → N S;
n→∞
µi , σi2 , S= xi (5.1)
i=1 i=1 i=1

It must be emphasized that the mean and variance must exist.


It is left as an exercise to show that
X
µS = µi (5.2)
X X
and σS2 = V [S] = Vi = σi2 (5.3)
Proving the C.L.T. in the general case is a bit too difficult for us. We will only
demonstrate it for the restricted case where all the p.d.f’s are the same, fi = f .
Without loss of generality we can let µ = 0. Then σ 2 = E [x2 ]. We also assume not
only that the mean and variance of f are finite, but also that the expectations of
higher powers of x are finite such that we can expand the characteristic function
of f (equation 2.31):
h i (ıt)2 2 (ıt)3 h 3 i
φx (t) = E eıtx = 1 + σ + E x + ...
2 3!
σ 2 t2
=1− + ...
2
Let u = σ√x n . The p.d.f. for u has variance 1/n. Then
h i t2
φu (t) = E eıtu = 1 − + ...
2n

63
64 CHAPTER 5. CENTRAL LIMIT THEOREM

Now recall that the characteristic function of a sum of independent r.v.’s is the
product of the individual characteristic functions. Therefore, the characteristic
P
function of Su = ui is
" #n
n t2
φSu (t) = [φu (t)] = 1 − + ...
2n

which in the limit n → ∞ is just an exponential:


!
t2
φSu (t) = exp −
2

But this is just the characteristic function of the standard normal N(Su ; 0, 1). Since
P P
Su = ui = σ√1 n S, the p.d.f. for xi is the normal p.d.f. N(S; nµ, nσ 2 ).
A corallary of the C.L.T.: Under the conditions of the C.L.T., the p.d.f. of S/n
approaches the normal p.d.f. as n → ∞:
  P P ! n
S S µi σ2 X
lim f =N ; , 2i , S= xi (5.4)
n→∞ n n n n i=1

or in the case that all the fi are the same:


  ! n
S S σ2 X
lim f =N ; µ, , S= xi (5.5)
n→∞ n n n i=1

5.2 Implications for measurements


The C.L.T. shows why the Gaussian p.d.f. is so important. Most of what we measure
is in fact the sum of many r.v.’s. For example, you measure the length of a table with
a ruler. The length you measure depends on a lot of small effects: optical parallax,
calibration of the ruler, temperature, your shaking hand, etc. A digital meter has
electronic noise at various places in its circuitry. Thus, what you measure is not
only what you want to measure, but added to it a large number of (hopefully) small
contributions. If this number of small contributions is large the C.L.T. tells us that
their total sum is Gaussian distributed. This is often the case and is the reason
resolution functions are usually Gaussian. But if there are only a few contributions,
or if a few of the contributions are much larger than the rest, the C.L.T. is not
applicable, and the sum is not necessarily Gaussian.
Consider the passage of particles, e.g., an α particle, through matter. Usually the
α undergoes a large number of small-angle scatters producing a small net deflection.
This net deflection is Gaussian distributed since it results from a large number of
individual scatters. However occasionally there is a large-angle scattering; usually
not, but sometimes 1 and very rarely 2. The distribution of the scattering angle θ
when there has been one or more large-angle scatters will not be Gaussian, since
5.2. IMPLICATIONS FOR MEASUREMENTS 65

1 or 2 is not a large number. Instead, the p.d.f. for θ will be the convolution of
the Gaussian for the net deflection from many small-angle scatters with the actual
p.d.f. for the large-angle scatters. It will look something like:



θ

Adding this to the Gaussian p.d.f. for the much more likely case of no large-angle
scatters will give a p.d.f. which looks almost like a Gaussian, but with larger tails:

Nearly Gaussian. ✻
Many small-angle, Long tails. Many small-angle
no large-angle scatters. scatterings giving Gaussian tails.
❩ Plus some large-angle
❩ scatterings giving a


non-Gaussian p.d.f.





✄✎

θ

This illustrates that the further you go from the mean, the worse the Gaussian
approximation is likely to be.
The C.L.T. also shows the effect of repeated measurements of a quantity. For
example, we measure the length of a table with a ruler. The variance of the p.d.f.
for 1 measurement is σ 2 ; the √
variance of the p.d.f. for an average of n measurements
σ2
is n . Thus σ is reduced by n.
If a r.v. is the product of many factors, then its logarithm is a sum of equally
many terms. Assuming that the CLT holds for these terms, then the r.v. is asymp-
totically distributed as the log-normal distribution.
“You can . . . never foretell what any one man
will do, but you can say with precision what an
average number will be up to. Individuals vary,
but percentages remain constant.”
—Arthur Conan Doyle: Sherlock Holmes in
“The Sign of Four”
Part II

Monte Carlo

67
Chapter 6

Monte Carlo

The term Monte Carlo is used for calculational techniques which make use of random
numbers. These techniques represent the solution of a problem as a parameter of
a hypothetical population, and use a random sequence of numbers to construct a
sample of the population, from which statistical estimates of the parameter are
obtained.
The Monte Carlo solution of a problem thus consists of three parts:
1. choice of the p.d.f. which describes the hypothetical population;
2. generation of a random sample of the hypothetical population using a random
sequence of numbers; and
3. statistical estimation of the parameter in question from the random sample.
It is no accident that these three steps correspond to the three parts of these lectures.
P.d.f.’s have been covered in part I; this part will cover the generation of a Monte
Carlo sample according to a given p.d.f.; and part III will treat statistical estimation,
which is done in the same way for Monte Carlo as for real samples.
If the solution of a problem is the number F , the Monte Carlo estimate of F
will depend on, among other things, the random numbers used in the calculation.
The introduction of randomness into an otherwise well-defined problem may seem
rather strange, but we shall see that the results can be very good.
After a short treatment of random numbers (section 6.1) we will treat a common
application of the Monte Carlo method, namely integration (section 6.2) for which
the statistical estimation is particularly simple. Then, in section 6.3 we will treat
methods to generate a Monte Carlo sample which can then be used with any of the
statistical methods of part III.

6.1 Random number generators


Random number generators may be classified as true random number generators or
as pseudo-random number generators.

69
70 CHAPTER 6. MONTE CARLO

6.1.1 True random number generators


True random number generators must be based on random physical processes, e.g.,

• the potential across a resistor, which arises from thermal noise.

• the time between the arrival of two cosmic rays.

• the number of radioactive decays in a fixed time interval.

An example of how we could use this last possibility is to turn on a counter for
a fixed time interval, long enough that the average number of decays is large. If
the number of detected decays is odd, we record a 1; if it is even, we record a 0.
We repeat this the number of times necessary to make up the fraction part of our
computer’s word (assuming a binary computer). We then have a random number
between 0 and 1.
Unfortunately, this procedure does not produce a uniform distribution if the
probability of an odd number of decays is not equal to that of an even number. To
remove this bias we could take pairs of bits: If both bits are the same, we reject
both bits; if they are different, we accept the second one. The probability that we
end up with a 1 is then the probability that we first get a zero and second a one;
the probability that we end up with a zero is the probability that we first get a
one and second a zero. Assuming no correlation between successive trials, these
probabilities are equal and we have achieved our goal.
The main problem with such generators is that they are very slow. Not wanting
to have too dangerous a source, i.e., not too much more than the natural background
(cosmic rays are about 200 m−2 s−1 ), nor too large a detector, it is clear that we will
have counting times of the order of milliseconds. For a 24-bit fraction, that means
24 counting intervals per real random number, or 96 intervals if we remove the bias.
Thus we can easily spend of the order of 1 second to generate 1 random number!
They are also not, by their very nature, reproducible, which can be a problem
when debugging a program.

6.1.2 Pseudo-random number generators


A pseudo-random number generator produces a sequence of numbers calculated
by a strictly mathematical procedure, which nonetheless appears random by some
statistical tests. Since the sequence is not really random, there will certainly exist
some other statistical test for which it will fail to appear random.
Several algorithms have been used to produce pseudo-random generators,30 de-
scriptions of which are beyond the scope of this course. In FORTRAN77, generators
have usually been introduced as functions with names such as RAN. The statement X
= RAN(0) assigns the next number in the random number sequence to the variable
X. The argument of the function is a dummy argument which is not used. The
generation proceeds from a ‘seed’, each number in the sequence acting as the seed
6.2. MONTE CARLO INTEGRATION 71

for the next. Usually there is a provision allowing the user to set the seed at the
start of his program and to find out what the seed is at the end. This feature allows
a new run to be made starting where the previous run left off. In FORTRAN90 this
is standardized by providing an intrinsic subroutine, random number(h), which fills
the real variable (or array) h with pseudo-random numbers in the interval [0, 1).
A subroutine random seed is also provided to input a seed or to inquire what the
seed is at any point in the calculation. However, no requirements are made on the
quality of the generated sequence, which will therefore depend on the compiler used.
In critical applications one may therefore prefer to use some other generator.
Recently, new methods have been developed resulting in pseudo-random number
generators far better than the old ones.31 In particular the short periods, i.e., that
the sequence repeats itself, of the old generators has been greatly lengthened. For
example the generator RANMAR has a period of the order of 1043 . The new generators
are generally written as subroutines returning an array of random numbers rather
than as a function, since the time to call a subroutine or invoke a function is of the
same order as the time to generate one number, e.g., CALL RANMAR (RVEC,90) to
obtain the next 90 numbers in the sequence in the array RVEC, which of course must
have a dimension of at least 90.
Some pseudo-random number generators generate numbers in the closed interval
[0, 1] rather than the open interval. Although it occurs very infrequently (once in
224 on a 32-bit computer), the occurence of an exact 0 can be particularly annoying
if you happen to divide by it. The open interval is therefore recommended.
Any one who considers arithmetical methods of producing
random digits is, of course, in a state of sin.
— John von Neumann

6.2 Monte Carlo integration


Much of this section has been taken from James30 and Lyons8 .
We want to evaluate the integral
Z b
I= y(x) dx (6.1)
a

We will discuss several Monte Carlo methods to do so.

6.2.1 Crude Monte Carlo


A trivial (certainly not the best) numerical method is to divide the interval (a, b)
into n subintervals and add up the areas of each subinterval using the value of y at
the middle of the interval:
n  
b−aX 1 b−a
I= y(xi ) , xi = a + i −
n i=1 2 n
72 CHAPTER 6. MONTE CARLO

An obvious Monte Carlo method, called crude Monte Carlo, is to do the same
sum, but with
xi = a + ri (b − a)
where the ri are random numbers uniformly distributed on the interval (0, 1).
More formally, the expectation of the function y(x) given a p.d.f. f (x) which is
non-zero in (a, b) is given by
Z b
µy = E [y] = y(x)f (x) dx
a

Since the available pseudorandom number generators sample a uniform distribution,


we take f (x) to be the uniform p.d.f. f (x) = 1/(b − a), a ≤ x ≤ b. Then
Z
1 b I
µy = E [y] = y(x) dx =
b−a a b−a
Z b Z b
2 1 2 1
σy = V [y] = (y − µy ) dx = y 2 dx − µ2y
b−a a b−a a
Let us emphasize that µy and σy2 are the expectation and variance of the function
y(x) for a uniform p.d.f. Do not confuse them with the mean and variance of a
p.d.f.—y(x) is not a p.d.f.
Let yi = y(xi ) where the xi are distributed according to f (x), i.e., uniformly.
Then, by the C.L.T., the average of the n values yi approaches the normal distri-
bution for large n:
P ! P !
yi σy2 yi I σy2
N ; µy , =N ; ,
n n n b−a n

We shall see in statistics (sect. 8.3) that an expectation value, e.g., E [y], can be
P
estimated by the sample mean of the quantity, ȳ = yi /n.
Thus by generating n values xi distributed uniformly in (a, b) and calculating

the sample mean, we determine the value of I/(b − a) to an uncertainty σy / n:
n
b−aX σy
I= y(xi ) ± (b − a) √ (6.2)
n i=1 n
R R
In practice, if we do not know y dx, it is unlikely that we know y 2 dx, which
is necessary to calculate σy . However, we shall see that this too can be estimated
from the Monte Carlo points (eq. 8.7):
n
1 X
σc2 = (yi − ȳ)2
n − 1 i=1

Since n is large, we can replace n − 1 by n. Multiplying out the sum we then get

σc2 = (y 2 − ȳ 2 )
6.2. MONTE CARLO INTEGRATION 73

Hence the integral is estimated by


Z !
b 1 q 2
I= y(x) dx = (b − a) ȳ ± √ y − ȳ 2 (6.3)
a n
Generalizing to more than one dimension is straightforward: Points are gen-
erated uniformly in the region of integration. The Monte Carlo estimate of the
integral is still given by equation 6.3 if the length of the interval, (b − a), is replaced
by the volume of the region of integration.

6.2.2 Hit or Miss Monte Carlo


Another method to evaluate the integral y(x) ✻
(6.1) is by hit or miss Monte Carlo. In this
method two random numbers are required ymax
per evaluation of y(x). Let R[x, y] be a
random number uniformly distributed on
(x, y). Then generate a point in the rectan-
gle defined by the minimum and maximum
values of y and the limits of integration, a
and b: ymin

xi = R[a, b] a b x
yi = R[ymin, ymax ]
If you do not know ymin and ymax , you must guess ‘safe’ values. The generated point
is called a
‘hit’ if yi < y(xi )
‘miss’ if yi > y(xi )
Then an estimate of I is given by the fraction of points which are hits:
nhits
I= (b − a)(ymax − ymin) + ymin(b − a)
n
Since hit or miss is a binomial situation, the number of hits follows the binomial
p.d.f. with expectation E [nhits ] = np and variance V [nhits ] = np(1 − p), where p is
the probability of a hit. V [I] is trivially related to V [nhits ]:
1
V [I] = V (nhits )(b − a)2 (ymax − ymin )2
n2
1
= p(1 − p)(b − a)2 (ymax − ymin)2
n
The probability p, of a hit can be estimated from the result: p̂ = nhits /n. Thus
nhits
I= (b − a)(ymax − ymin) + ymin (b − a)
n s
√ 
nhits nhits
± 1− (b − a) (ymax − ymin ) (6.4)
n n
74 CHAPTER 6. MONTE CARLO

Here too, the generalization to more than one dimension is straightforward:


Points are generated uniformly in the region of integration and the function value
is tested for a hit. The integral is then given by equation 6.4 with (b − a) replaced
by the volume of the region in which points were generated.

6.2.3 Buffon’s needle, a hit or miss example


An early (1777) application of the Monte Carlo technique was to estimate the value
of π. This calculation, known as Buffon’s needle,32 proceeds as follows: Parallel lines
separated by distance d are drawn on the floor. A needle of length d is dropped on
the floor such that its position (distance of the center of the needle to the nearest
line) and its orientation (angle, θ, between the needle and a perpendicular to the
lines) are both distributed uniformly. If the needle lies across a line we have a hit,
otherwise a miss.
For a given θ, the chance of a hit is given by the conditional p.d.f.
projected length of needle on a perpendicular d cos θ
f (hit|θ) = = = cos θ
distance between lines d
The chance of a hit irrespective of θ is then
Z Z
π/2 π/2 1 2
p= f (hit|θ)f (θ) dθ = cos θ π dθ =
0 0 2
−0 π
| {z }
f (θ)

Thus an estimate of 2/π is given by the estimator of p, namely p̂ = nhits /n and an


estimate of π by 2n/nhits .

6.2.4 Accuracy of Monte Carlo integration


The uncertainty of the Monte Carlo integration decreases, for both crude and hit
or miss Monte Carlo, with the number of points, n, as n−1/2 . However crude Monte
Carlo is usually more accurate than the hit or miss method. For example, take the
integral involved in Buffon’s needle. In crude Monte Carlo,
I 2 Z π/2 2
µy = E [y] = = cos θ dθ =
b−a π 0 π
Z π/2
1
V [y] = cos2 θ dθ − µ2y
b−a 0
 2
1 2
= − = 0.0947
2 π
√ √ √
The uncertainty of the estimation of I is then 0.0947/ n = 0.308/ n.
On the other hand, hit or miss yields, using p = 2/π:
 2
1 π 0.571
V [I] = p(1 − p) =
n 2 n
6.2. MONTE CARLO INTEGRATION 75

√ √ √
The uncertainty of the estimation of I is then 0.571/ n = 0.756/ n, which is
considerably larger (more than a factor 2) than for crude Monte Carlo.
The uncertainty of Monte Carlo integration is compared with that of numerical
methods in the following table:

uncertainty in I calculated from n points


method 1 dimension d dimensions
Monte Carlo n−1/2 n−1/2
trapezoidal rule n−2 n−2/d
Simpson’s rule n−4 n−4/d
m-point Gauss rule n−(2m−1) n−(2m−1)/d

Thus we see that Monte Carlo integration converges much more slowly than
other methods, particularly for low numbers of dimensions. Only for greater than
8 dimensions does Monte Carlo converge faster than Simpson’s rule, and there is
always a Gauss rule which converges faster than Monte Carlo.
However, there are other considerations besides rate of convergence: The first is
the question of feasibility. For example, in 38 dimensions a 10-point Gauss method
converges at the same rate as Monte Carlo. However, in the Gauss method, the
number of points is fixed, n = md , which in our example is 1038 . The evaluation
of even a very simple function requires on the order of microseconds on a fast
computer. So 1038 is clearly not feasible. (1032 sec. ≈ π × 1024 years, while the age
of the universe is only of order 12 Gyr.)
Another problem with numerical methods is the boundary of integration. If the
boundary is complicated, numerical methods become very difficult. This is, how-
ever, easily handled in Monte Carlo. One simply takes the smallest hyperrectangle
that will surround the region of integration and integrates over the hyperrectangle,
throwing away the points that fall outside the region of integration. This leads to
some inefficiency, but is straightforward. This is one of the chief advantages of the
Monte Carlo technique. An example is given by phase space integrals in particle
physics. Consider the decay n → pe− ν e , the neutron at rest. Calculations for this
decay involve 9 variables, px , py , pz for each of the 3 final-state particles. However
these variables are not independent, being constrained by energy and momentum
P P P P
conservation, px = pq y = pz = 0, and E = mn c2 , where the energy of a
particle is given by, E = m2 c4 + p2x c2 + p2y c2 + p2z c2 . This complicated boundary
makes an integration by numerical methods difficult; it becomes practically im-
possible for more than a few particles. However Monte Carlo integration is quite
simple: one generates points uniformly in the integration variables, calculates the
energy and momentum components for each particle and tests whether momentum
and energy are conserved. If not, the point is simply rejected.
Another practical issue might be termed the growth rate. Suppose you have
performed an integration and then decide that it is not accurate enough. With
76 CHAPTER 6. MONTE CARLO

Monte Carlo you just have to generate some more points (starting your random
number generator where you left off the previous time). However, with the Gauss
rule, you have to go to a higher order m. All the previous points are then useless
and you have to start over.

6.2.5 A crude example in 2 dimensions


y(x) ✻
One of the advantages of Monte Carlo is the ease b
with which irregular integration regions can be
handled. Consider a two-dimensional integral
over a triangular region:
Z b Z x
I= dx dy g(x, y)
a a

We give five ways of estimating this integral us-


a ✲
ing crude Monte Carlo:
a b x

1. The obvious way:


(a) Choose xi = R[a, b].
(b) Choose yi = R[a, xi ].
(b−a)2 Pn
(c) Sum the g(xi , yi ): I = 2n i=1 g(xi , yi )
This method, although obvious, is incorrect. This is because the points
(xi , yi ) are not uniformly distributed over the region of integration. There
are (approximately) the same number of points for a < x < (a + b)/2 as for
(a + b)/2 < x < b, while the areas differ by a factor 3.
2. Rejection method:
(a) Choose xi = R[a, b] and yi = R[a, b].
(b) Define a new function z(x, y) which is defined on the entire region for
which points are generated, but which has the same integral as g:

0, if yi > xi ,
zi =
g(xi , yi ), if yi < xi .
Or, equivalently, reject the point if it does not lie in the region of inte-
gration, i.e., if yi > xi .
(c) Then sum the zi :
n
(b − a)2 X
I= zi
n i=1
6.2. MONTE CARLO INTEGRATION 77

3. Rejection method (area of region of integration known): The above rejection


method results in a perhaps needlessly large error since we are using Monte
Carlo to estimate the integral of z, even where we know that z = 0. Another
way of looking at it is that we are using Monte Carlo to estimate what fraction,
fa , of the area of point generation is taken up by the area of integration.
Hence, if we know this fraction we can remove this contribution to the error
by simply rejecting the points not in the area of integration. We proceed as
follows:

(a) Choose xi = R[a, b] and yi = R[a, b].


(b) Reject the point if it does not lie in the region of integration, i.e., if
yi > xi .
(c) Then sum the g(xi , yi ) replacing the area of point generation by the area
of the region of integration, fa (b − a)2 . In this example we know that
fa = 12 . The result is then

n′
1 (b − a)2 X
I= g(xi , yi )
2 n′ i=1

where n′ is the number of generated points lying in the region of inte-


gration.

Both rejection methods are correct, but inefficient—both use only half of the
points. Sometimes this inefficiency can be overcome by a trick:

4. Folding method (a trick):

(a) Choose ui = R[a, b] and vi = R[a, b].


(b) Let xi = max(ui , vi ) and yi = min(ui , vi ).
(c) Then sum the gi :
n
(b − a)2 X
I= g(xi , yi )
2n i=1

This is equivalent to generating points uniformly over the whole square and
then folding the square about the diagonal so that all the points fall in the
triangular region of integration. The density of points remains uniform.

5. Weighting method. We generate points as in the “obvious”, but wrong,


method:

(a) Choose xi = R[a, b].


(b) Choose yi = R[a, xi ].
78 CHAPTER 6. MONTE CARLO

(c) But we make a weighted sum, the weight correcting for the unequal
1
density of points (density ∼ (x−a) ):
n
b−a X
I= (xi − a) g(xi , yi ) (6.5)
n i=1

The derivation of this formula is left as an exercise (27).


This method is, in fact, an application of the technique of importance sampling
(cf. section 6.2.6) It may, or may not, be more efficient than folding, depending
on the function g. In particular, it will be more efficient when the variance of
(x − a)g is smaller than that of g.

6.2.6 Variance reducing techniques


As we have seen, Monte Carlo integration converges rather slowly with n compared
to the better numerical techniques. There are, however, several methods of reducing
the variance of the Monte Carlo estimation:

Stratification
In this approach we split the region of integration into two or more subregions.
Then the integral is just the sum of the integrals over the subregions, e.g., for two
subregions, Z Z Z
b c b
I= y(x) dx = y(x) dx + y(x) dx
a a c
The variance of I is just the sum of the variances of the subregions. A good choice
of subregions and number of points in each region can result in a dramatic decrease
in V [I]. However, to make a good choice requires knowledge of the function. A
poor choice can increase the variance.
Some improvement can always be achieved by simply splitting the region into
subregions of equal size and generating the same number of points for each subre-
gion. We illustrate this, using crude Monte Carlo, for the case of two subregions:
For the entire region the variance is (from equation 6.2)
 !2 
2 2 Z Z
(b − a) 2 (b − a)  1 b 1 b
V1 (I) = σy = y 2 dx − y dx 
n n b−a a b−a a

For two equal regions, the variance is the sum of the variances of the two regions:
(" Z  Z 2 #
[(b − a)/2]2 2 c 2 c
V2 (I) = y 2 dx − y dx
n/2 b−a a b−a a

Z Z !2 
2 b 2 b 
+ y 2 dx − y dx 
b−a c b−a c
6.2. MONTE CARLO INTEGRATION 79

  !2  
2  Z Z 2 Z 
(b − a) 2 b 4 c b
= y 2 dx −  y dx + y dx 
2n  b−a a (b − a)2 a c 

The improvement in variance is given by


!2  !2 
 2
1 Zb 2 Zc Z b
V1 (I) − V2 (I) = − y dx + y dx + y dx 
n a n a c
Z c Z b
Substituting A= y dx and B = y dx
a c
1 h  i
V1 (I) − V2 (I) = − (A + B)2 + 2 A2 + B 2
n
1
= (A − B)2
n
≥0

Thus some improvement in the variance is attained, although it may be arbitrarily


small. This improvement can be qualitatively understood as due to an increased
uniformity of the distribution of points.

Importance Sampling
We have seen that (in crude Monte Carlo) the variance of the estimate of the integral
is proportional to the variance of the function being integrated (eq. 6.2). Thus the
less variation in y, i.e., the more constant y(x) is, the more accurate the integral.
We can effectively achieve this by generating more points in regions of large y and
compensating for the higher density of points by reducing the value of y (i.e., giving
a smaller weight) accordingly. This was also the motivation for stratification.
In importance sampling we change variable in order to have an integral with
smaller variance:
Z Z Z G(b)
b b y(x) y(x)
I= y(x) dx = g(x) dx = dG(x)
a a g(x) G(a) g(x)
Z x
where G(x) = g(x) dx
a

Thus we must find a function g(x) such that


• g(x) is a p.d.f., i.e., everywhere positive and normalized such that G(b) = 1.

• G(x) is known analytically.

• Either G(x) can be inverted (solved for x) or a random number generator is


available which generates points (x) according to g(x).

• The ratio y(x)/g(x) is as nearly constant as possible and in any case more
constant than y(x), i.e., σy/g < σy .
80 CHAPTER 6. MONTE CARLO

We then choose values of G randomly between 0 and 1; for each G solve for x; and
sum y(x)/g(x). The weighting method of section 6.2.5 was really an application of
importance sampling.
Although importance sampling is a useful technique, it suffers in practice from
a number of drawbacks:

• The class of functions g which are integrable and for which the integral can
be inverted analytically is small—essentially only the trigonometric functions,
exponentials, and polynomials. The inversion could in principle be done nu-
merically, but this introduces inaccuracies which may be larger that the gain
made in reducing the variance.

• It is very difficult in more than one dimension. In practice one usually uses a
g which is a product of one-dimensional functions.

• It can be unstable. If g becomes small in a region, y/g becomes very big and
hence the variance also. It is therefore dangerous to use a function g which is
0 in some region or which approaches 0 rapidly.

• Clearly y(x) must be rather well known in order to choose a good function g.

On the other hand, an advantage of this method is that singularities in y(x) can be
removed by choosing a g(x) having the same singularities.

Control Variates
This is similar to importance sampling except that instead of dividing by g(x), we
subtract it: Z Z Z
I = y(x) dx = [y(x) − g(x)] dx + g(x) dx
R
Here, g(x) dx must be known, and g is chosen such that y − g has a smaller
variance than y. This method does not risk the instability of importance sampling.
Nor is it necessary to invert the integral of g(x).

Antithetic Variates
So far, we have always used Monte Carlo points which are independent. Here we
deliberately introduce a correlation. Recall that the variance of the sum of two
functions is

V [y1 (x) + y2 (x)] = V [y1 (x)] + V [y2 (x)] + 2 cov[y1 (x), y2 (x)]

Thus, if we can write


Z b Z b
I= y dx = (y1 + y2 ) dx
a a
6.3. MONTE CARLO SIMULATION 81

such that y1 and y2 have a large negative correlation, we can reduce the variance of
I. Clearly, we must understand the function y(x) in order to do this. It is difficult
to give general methods, but we will illustrate it with an example:
Suppose that we know that y(x) is a monotonically increasing function of x.
Then let y1 = 12 y(x) and y2 = 21 y (b − (x − a)). Clearly the integral of (y1 + y2 ) is
just the integral of y. However, since y is monotonically increasing, y1 and y2 are
negatively correlated; when y1 is small, y2 is large and vice versa. If this negative
correlation is large enough, V [y1 + y2 ] < V [y].

6.3 Monte Carlo simulation


References for this section are James30 and Lyons.8 For further details and additional
topics consult Rubinstein.33
Monte Carlo problems are usually classified as either integration or simulation.
We shall be concerned with simulating experiments in physics. This begins with
a theory or hypothesis about the physical process, i.e., with the assumption of
an underlying p.d.f., g(x′ ), which may then be modified by the response function,
r(x, x′ ), of the experimental apparatus. The expected p.d.f. of the observations is
then given by Z
f (x) = g(x′ ) r(x, x′ ) dx′
The purpose of the simulation is to produce a set of n simulated or ‘fake’ data
points distributed according to f (x). These can be compared with the real data to
test the hypothesis. They can also be used in the planning stage of the experiment
to help in its design, e.g., to compare the use of different apparatus, and to test
software to be used in the analysis of the experiment.
Since these fake points are distributed according to f (x), they are in fact just
R
the points generated for the Monte Carlo integration of f (x) dx. Simulation is
thus, formally at least, equivalent to integration. The purpose is, however, usually
different. This means that often a different Monte Carlo method will be preferred
for simulation than for integration.
Although we will continue to use the term p.d.f. for f (x), for the purposes
of simulation the normalization is unimportant (at least if we are careful). It is,
however, essential that the function not be negative.
The p.d.f. that we wish to simulate, f (x), can be extremely complicated. The
underlying physical p.d.f., g(x′ ), may itself involve integrals which will be evaluated
by Monte Carlo in the course of the simulation, and the detector description may
consist of various stages, each depending on the previous one.
Monte Carlo simulation of such complex processes breaks them down into a
series of steps. At each step a particular outcome is chosen from a set of possi-
ble outcomes according to a given p.d.f., f (x). In other words, the outcome of
the step is a (pseudo-)random number generated according to f (x). But random
number generators generally produce uniformly distributed numbers. We therefore
82 CHAPTER 6. MONTE CARLO

must transform the uniformly distributed random numbers into random numbers
distributed according to the desired p.d.f. There are three basic methods to do this:

6.3.1 Weighted events


This method is analogous to that of crude Monte Carlo for integration. For a p.d.f.,
f (x), defined on the interval (a, b), points are generated uniformly in x and given
a weight, w. An event then consists of the values xi and wi = f (xi )(b − a). The
integral of f (x) over any subinterval of (a, b), e.g., (c, d) with a ≤ c < d ≤ b, is then
given by the sum of the weights of the events in that interval:
Z d 1 X
f (x) dx = wi
c n c<x<d
In particular, a weighted histogram of the xi (c and d are then the various bin
limits), represents the p.d.f. and can be directly compared with the data.
We have seen that integration by crude Monte Carlo gives a smaller variance
than the hit-or-miss method, and is therefore generally preferable. However in sim-
ulation it is usually deemed preferable not to have weighted events. One prefers to
have the Monte Carlo events as much as possible like the real events. In particular,
it is usually desirable that the Monte Carlo sample behave statistically like the real
event sample, e.g., the variance of the average of n Monte Carlo points should result
in the same variance as that of the average of n real points. This is not the case
with weighted events. The density of Monte Carlo points where f (x) is small is
the same as where f (x) is large, whereas in the real data the density of points is
proportional to f (x).

6.3.2 Rejection method


This method is analogous to the hit-or-miss
f (x) ✻
method of Monte Carlo integration. As in
hit-or-miss Monte Carlo, we generate points
fmax
uniformly in x and in f (x)

xi = R[a, b]
ri = R[0, fmax ]

where fmax is the maximum value of f (x) in


(a, b). Points for which f (xi ) < ri are then 0 ✲
rejected. a cd b x
The integral over a subinterval (c, d) is then
Z d nc<x<d
f (x) dx = (b − a)fmax
c n
6.3. MONTE CARLO SIMULATION 83

In hit-or-miss
Rb
Monte Carlo we also introduced an fmin . Since we knew the
integral a fmin dx, it was not necessary to evaluate it by Monte Carlo. It was
R
therefore better (more efficient) to use all the Monte Carlo points to evaluate ab (f −
fmin ) dx. But here we want to generate all the events for f , not just for (f − fmin ).
The difficulty with this method lies in knowing fmax . If we do not know it, then
we must guess a ‘safe’ value, i.e., a value which we are sure is larger than fmax . If
we choose fmax too safe, the method becomes inefficient. This method can be made
more efficient by choosing different values of fmax in different regions.
This method is the easiest method to use for complicated functions in many
dimensions.

6.3.3 Inverse transformation method


Continuous p.d.f.
This is like importance sampling with g(x) = f (x). The resulting integrand is just
the uniform distribution. We transform from f (x) to F :

f (x) dx = dF

where F (x) is just the c.d.f. of f (x),


Z x
F (x) = f (x) dx
a

Instead of generating points uniform in x, we generate points uniformly distributed


in F between F (a) and F (b), which are 0 and 1, respectively, if f (x) is a p.d.f.
normalized on (a, b):
F (x) ✻
1

ui = R [F (a), F (b)]
u
and calculate the corresponding value of x,

xi = F −1 (ui )
0a ✲
−1
x
x=F (u)

The xi are then distributed as f (x). To see this, recall the results on changing vari-
ables (sect. 2.2.6): For a transformation u → x = v(u) with inverse transformation
u = w(x), the p.d.f. for x is given by the p.d.f. for u, g(u), times the Jacobian, i.e.,

∂u ∂w(x)

p.d.f. for x = g(u) = g (w(x))
∂x ∂x
84 CHAPTER 6. MONTE CARLO

Here, u = F (x), x = F −1 (u) and u is distributed uniformly, i.e., g(u) = 1. The


p.d.f. for x is then
∂u ∂F (x)

= = f (x)
∂x ∂x

Hence, if g(u) is a uniform distribution, the p.d.f. for x is f (x), as desired.


The difficulties with this method are integrating f (x) to obtain F (x) and in-
verting F (x) to obtain F −1 (u). But if this can be done, this is usually the best
method.

F (x) ✻

If F is not one-to-one, we define


−1
x = Fmin (u) = min(x for which F (x) ≥ u)
u

0a ✲
−1
x = Fmin (u) x
Discrete p.d.f.
For a discrete p.d.f., we can always use this method, since the c.d.f. is always easily
calculated. The probability of X = xk is P (X = xk ) = f (xk ). Then the c.d.f. is
k
X
Fk = P (X ≤ xk ) = f (xk )
i=1

Taking u uniformly distributed between 0 and 1,


Z Fk
P (Fk−1 < u ≤ Fk ) = du F (x)

Fk−1 1
= Fk − Fk−1 = f (xk )

Thus, to generate a point, we

1. generate ui = R[0, 1] Fk

2. find the value of k such that Fk−1

Fk−1 < ui ≤ Fk
0a ✲
xk−1 xk
x
Then xk is the desired value of x.
Step 2 of this procedure can involve a lot of steps. You can usually save computer
time by starting the comparison somewhere in the middle of the x-range, say at the
mean or mode, and then working up or down in x depending on u and Fk .
6.3. MONTE CARLO SIMULATION 85

This is of interest not only for situations with a discrete p.d.f., but also for
cases where the p.d.f. is continuous, but not known analytically. The resolution
function of an apparatus is often determined experimentally and the resulting p.d.f.
expressed as a histogram.

6.3.4 Composite method


It may be advantageous to decompose the desired p.d.f. into a sum of p.d.f.’s which
are easier to generate: X
f (x) = fk (x) (6.6)
Let Rb
fk (x) dx
αk = P a R b (6.7)
j a fj (x) dx
P
Then αi = 1, and αk is the fraction of the points to be generated according to
fk .
In generating the points, we regard the index k as a discrete r.v. with probability
αk . We first generate u = R[0, 1] and use it to select k. Then we generate a value
xi according to fk (x) using one of the previous methods.
You might ask why not skip the first step and just generate exactly αk n points
according to fk for each k, where n is the total number of points. This was a method
(stratification) to improve the variance in Monte Carlo integration. The answer is
that the variance of the Monte Carlo sample would then be different from that of
a sample of n real events, while the purpose of simulation is usually to obtain a
Monte Carlo sample having the same statistical properties as real events.

6.3.5 Example
As an example of the above methods, we take the p.d.f.

f (x) = 1 + x2

in the region (−1, 1). This could be an angular distribution with x = cos θ. We
note that f (x) is not normalized. We could, of course, normalize it, but choose not
to do so. For as we shall see, for the purpose of generating events the normalization
is not necessary.

Weighted events
This is completely trivial. We generate xi = R[−1, 1] and assign weight wi = 1 + x2i .

Rejection method
86 CHAPTER 6. MONTE CARLO

2
We have fmax = 2, a = −1, b = +1. There-
fore, we generate
1
xi = R[−1, +1] = 2R[0, 1] − 1
ri = R[0, 2] = 2R[0, 1]
and reject the point if ri > 1 + x2i . 0
−1 0 +1
NoteR that the efficiency of the genera-
1
(1+x2 ) dx
tion is −1
(b−a)fmax
= 23 , i.e., 1/3 of the points are rejected.

Inverse transformation method


We have Z #x
x
2 x3 x3 4
F (x) = (1 + x ) dx = x + = x+ +
−1 3 −1
3 3
Hence F (−1) = 0 and F (1) = 8/3. Therefore generate u = R[0, 1]. Then 38 u
is uniformly distributed on [F (−1), F (+1)]. The corresponding value of x is the
solution of
8 x3 4
u = F (x) = x + +
3 3 3
The solution is
xi = A + B, where A = (4u − 2 + s)1/3
q
B = (4u − 2 − s)1/3 , s= 1 + 4(1 − 2u)2
Note that this requires calculating one square root and two cube roots per point.

Composite method
We write f (x) as the sum of simpler functions. In this case an obvious choice is
f (x) = fa (x) + fb (x) with fa (x) = 1 and fb (x) = x2
The integrals of these functions are
Z Z #+1
+1 +1 x3 2
Aa = fa (x) dx = 2 and Ab = fb (x) dx = =
−1 −1 3 −1
3
2 3
Hence we want to generate from fa with probability 2+ 23
= 4
and from fb with
1
probability 4
.
The first step is therefore to generate v = R[0, 1]
3
• If v ≤ 4
we generate from fa :
ui = R[0, 1]
xi = 2ui − 1
6.3. MONTE CARLO SIMULATION 87

3
• If v > 4
we generate from fb :
1. either by the rejection method:
xi = 2R[0, 1] − 1
ri = R[0, fb max ] = R[0, 1]
repeating until we find a point for which ri ≤ x2i .
R1
x2 dx
Note that the efficiency is (b−a)f
−1
b max
= 13 for the points generated here
( 1/4 of the points). But it was 1 for the points distributed according to
fa . The net efficiency is thus 31 · 14 + 1 · 43 = 56 , a small improvement over
the 2/3 of the simple rejection method.
2. or by the inverse transformation method:
Z #x
x
2x3 x3 1 2
Fb (x) = x dx = = + Fb (−1) = 0 Fb (1) =
−1 3 −1
3 3 3
We generate ui = R[0, 1]. Then xi is the solution of
2 x3 1
ui = i +
3 3 3
Hence, xi = (2ui − 1)1/3
Note that we only have to calculate one cube root; and that only for 1/4
of the events. This is ∼ 12 times faster that the simple inverse trans-
formation method (assuming that square and cube roots take about the
same time).
In this example, the composite rejection method turned out to be the fastest
with the simple rejection method only slightly slower. The composite inverse trans-
formation method was much faster than the simple inverse transformation method,
but still much slower than the rejection method. These results should not be re-
garded as general. Which method is faster depends on the function f .

6.3.6 Gaussian generator


The Gaussian distribution is one of the most important in physics and statistics.
Many methods have been proposed to generate normally distributed points.

Using the Central Limit Theorem


By the C.L.T., the average of a large number of r.v.’s distributed according to almost
any p.d.f. will be normally distributed. In particular, for n r.v.’s, ui , distributed
uniformly between 0 and 1, the quantity, g,
Pn n
i=1 ui − 2
g= q
n
12
88 CHAPTER 6. MONTE CARLO

is approximately distributed as N(g; 0, 1) for large n. Proof of this is left to the


reader (exercise 26).
While simple to program, this generator is not particularly fast and has the
feature that the tails are truncated at ±nσ, which is usually undesirable. If the
absence of long tails is tolerable, this method is usually satisfactory for as few as
n = 12, where g reduces to
12
X
g+ = ui − 6
i−1

Another disadvantage of this method is that it puts severe requirements on the cor-
relations between successive points of the random number generator, in particular
on correlations within groups of n successive values of ui .
A word of caution is perhaps appropriate for clever students who have undoubt-
edly noticed that instead of summing 12 ui and subtracting 6, we could have used

6
X 12
X
g− = ui − ui
i=1 i=7

So far, so good. But if you try to save computer time by generating both g+ and
g− with the same 12 values of ui , you are in trouble: g+ and g− are then highly
correlated.

A transformation method

Since the Gaussian p.d.f. cannot be integrated in terms of the usually available func-
tions, it is not straightforward to find a transformation from uniformly to Gaussian
distributed variables. There is, however, a clever method, which we give without
proof, to transform two independent variables, u1 and u2 , uniformly distributed on
(0,1) to two independent variables, g1 and g2 , which are normally distributed with
µ = 0 and σ 2 = 1:
q
g1 = cos(2πu2 ) −2 ln u1
q
g2 = sin(2πu2 ) −2 ln u1

This method is exact, but its speed can be improved upon by effectively gener-
ating the sine and cosine by a rejection method:

1. Generate uniform random numbers u1 and u2 on (0,1)

2. Calculate r 2 = (2u1 − 1)2 + (2u2 − 1)2 .

3. If r 2 > 1, then reject u1 and u2 and go back to step 1.


6.3. MONTE CARLO SIMULATION 89

4. Otherwise,
s
−2 ln r 2
g1 = (2u1 − 1)
r2
s
−2 ln r 2
g2 = (2u2 − 1)
r2

This saves the time of evaluating a sine and a cosine at the slight expense of rejecting
about 21% of the uniformly generated points.
Part III

Statistics

91
Chapter 7

Statistics—What is it/are they?

So far, we have considered probability theory. Once we have decided which p.d.f. is
appropriate to the problem, we can make direct calculations of the probability of any
set of outcomes. Apart from possible uncertainty about which p.d.f. is appropriate,
this is a straight-forward and mathematically well defined procedure.
The problem we now address is the inverse of this. We have a set of data
which have been sampled from some parent p.d.f. We wish to infer from the data
something about the parent p.d.f. Note that here we are assuming that the data
are independent, i.e., that the value of a particular datum does not depend on
the values of other data, and that all of the data sample the same p.d.f. The
statistician speaks of a sample of independent identically distributed iid random
variables. Usually this will be the case, and some of our methods will depend on
this.
The study of calculations using probability is sometimes called direct probability.
Statistical inference is sometimes called inverse probability, particularly in the case
of Bayesian methods.
We may think we know what the p.d.f. is apart from one or more parameters,
e.g., we think it is a Gaussian but want to determine its mean and standard devia-
tion. This is called parameter estimation. It is also called fitting since we want to
determine the value of the parameter such that the p.d.f. best ‘fits’ the data.
On the other hand, we may think we know the p.d.f. and want to know whether
we are right. This is called hypothesis testing. Usually both parameter estimation
and hypothesis testing are involved, since it makes little sense to try to determine the
parameters of an incorrect p.d.f. And frequently an hypothesis to be tested involves
some unknown parameter. Nevertheless, we will first treat these as separate topics.
A third topic is decision theory or classification.
For all of these topics we shall use statistical methods (or “statistics”), so-called
because they, statistical methods, make (it, statistics, makes) use of one or more
statistics.∗ A definition: A statistic is any function of the observations in a sample,


It is perhaps interesting to note that the stat in statistics is the same as in state. Statists

93
94 CHAPTER 7. STATISTICS—WHAT IS IT/ARE THEY?

which does not depend on any of the unknown characteristics of the population
P
(parent p.d.f.). An example of a statistic is the sample mean, x̄ = xi /n. Each
observation, xi , is, in fact, itself a statistic. In other words, if you can calculate it
from the data plus known quantities, it is a statistic. “Statistics” is the branch of
applied mathematics which deals with statistics as just defined. Whether the word
statistics is singular or plural, thus depends on which meaning you intend.
We have seen in section 2.4 that there are two common interpretations of prob-
ability, which we have called frequentist and Bayesian. They give rise to two ap-
proaches to statistical inference, usually called classical or frequentist statistics (or
inference) and Bayesian inference. The word classical is something of a misnomer,
since the Bayesian interpretation is older (Bayes, Laplace). However, in the second
half of the 19th century science became more quantitative and objective, even in
such fields as biology (Darwin, evolution, heredity, Galton). This gave rise to the
frequentist interpretation and the development of frequentist statistics. By about
1935 frequentist statistics, which came to be known as classical statistics, had com-
pletely replaced Bayesian thinking. Since around 1960, however, Bayesian inference
has been making a comeback.
Probably most physicists would profess to being frequentists, and reflecting this,
as well as my own personal bias, the emphasis in the rest of this course will be on
classical statistics. However, there are situations where classical statistics is very
difficult, or even impossible, to use and where Bayesian statistics is comparatively
simple to apply. So, intermixed with classical statistics you will find some Bayesian
methods. This is rather unconventional; most books are firmly in one of the two
camps, and discussions between frequentists and Bayesians often take on aspects of
holy war. It also runs the risk of confusing the student—it is important to know
which you are doing.

To understand God’s thoughts we must study statistics,


for these are the measure of His purpose.
—Florence Nightingale

(advocates of statism, economic control and planning by a highly centralized state), collected data
to better enable the state to run the economy. Such data, and quantities calculated from them,
came to be called statistics.
Chapter 8

Parameter estimation

8.1 Introduction
In everyday speech, “estimation” means a rough and imprecise procedure leading
to a rough and imprecise result. You estimate when you cannot measure exactly.
This last sentence is also true in statistics, but only because you can never measure
anything exactly; there is always some uncertainty. In statistics, estimation is a
precise procedure leading to a result which may be imprecise, but the extent of the
precision is, in principle, known. Estimation in statistics has nothing to do with
approximation.
The goal of parameter estimation is then to make some sort of statement like
θ = a ± b where a is, on the basis of the data, the ‘best’ (in some sense) value
of the parameter θ and where it is ‘highly probable’ that the true value of θ lies
somewhere between a − b and a + b. We often call b the estimated error on a. If
we make a plot, this is represented by a point at θ = a with a bar running through
it from a − b to a + b, the ‘error bar’. It is usually assumed that the estimate of
θ is normally distributed, i.e., that the values of a obtained from many identical
experiments would form a normal distribution centered about the true value of θ
with standard deviation equal to b. The meaning of θ = a ± b is then that a is,
in some sense (to be discussed more fully later), Rthe most likely value of θ and
a+b
that in any case there is, again in some sense, a a−b N(x; a, b2 ) = 68.3% chance
that the true value of θ lies in the interval (a − b, a + b).∗ This is a special case of
a 68.3% ‘confidence interval’ (cf. chapter 9), i.e., an interval within which we are
68.3% confident that the true value lies. We shall see that error bars, or confidence
intervals may be difficult to estimate. Just as our estimate of θ has an ‘error’, so
too does our estimate of this ‘error’.
Suppose now that we have a set of numbers xi which are the result of our
experiment. This could, e.g., be n measurements of some quantity. Let θ be the


Note that this is different from what an engineer usually means by a ± b, namely that b is the
tolerance on a, i.e., that the true value is guaranteed to be within (a − b, a + b).

95
96 CHAPTER 8. PARAMETER ESTIMATION

true value of that quantity. The xi are clustered about θ in some way that depends
on the measuring process. We often assume that they are distributed normally
about the true value with a width given by the accuracy of the measurement.
It is worth noting the distinction many authors, e.g., Bevington10 , make be-
tween the words accuracy and precision, which in normal usage are synonymous.
Accuracy refers to how close a result is to the true value, whereas precision refers
to how reproducible the measurements are. Thus, a poorly calibrated apparatus
may result in measurements of high precision but poor accuracy. Other authors,
e.g., Eadie et al.,4 and James5 prefer to avoid these terms altogether since neither
term is well defined, and to speak only of the variance of the estimates.
Similarly, a distinction is sometimes∗ made between error, the difference be-
tween the estimate and the true value, and the uncertainty, the square root of the
variance of the estimate. Thus accurate means small error and precise means small
uncertainty. However, the use of the word ‘error’ to mean uncertainty is deeply
ingrained, and we (like most books) will not make the distinction. Note that with
the above distinction, the accuracy and the error are usually unknown, since the
true value is usually unknown.
So, we wish to estimate θ. To do this we need an estimator which is a function
of the measurements.
As stated in chapter 7, a statistic is, by definition, any function of the obser-
vations in a sample, φ(xi ), which does not depend on any of the unknown charac-
teristics of the population (parent p.d.f.). An example of a statistic is the sample
P
mean, x̄ = xi /n. In other words, if you can calculate it from the data plus known
quantities, it is a statistic.
Since a statistic is calculated from random variables, it is itself a r.v., but a
r.v. whose value depends on the particular sample, or set of data. Like all random
variables, it is distributed according to some p.d.f. Since the value of the statistic
depends on the sample, its p.d.f. is sometimes referred to as the sampling distri-
bution or sampling p.d.f. in order to distinguish it from the population or parent
p.d.f.
An estimator is (definition) a statistic, the value of which we will give as our
determination of some constant, θ, which is a property of the parent population
or parent p.d.f. We will generally denote an estimator of a variable by adding a
circumflex (ˆ) to the symbol of the variable. Thus θ̂ is an estimator of θ.
There are in general numerous estimators that one can construct for any θ. Here
are several estimators of the mean, µ, of the parent p.d.f., assuming n measurements,
xi :

1 Pn
1. µ̂ = x̄ = n i=1 xi The sample mean. This is probably the most
often used estimator of the mean, but it can be
sensitive to mismeasured data.


This is recommended by the International Standards Organization34.
8.1. INTRODUCTION 97

1 P10
2. µ̂ = 10 i=1 xi The sample mean of the first 10 points, ignoring
the rest.
1 Pn
3. µ̂ = n−1 i=1 xi n/(n − 1) times the sample mean.

4. µ̂ = 5 Throw away all the data and give the estimate


as 5.
qQ
n
5. µ̂ = n
i=1 xi

6. Make a histogram of the xi and take µ̂ as the


midpoint of the bin containing the most events,
i.e., a sort of sample mode. Note that the value
will depend on the bin size.

7. µ̂ = [min(xi ) + max(xi )] /2 The midrange, i.e., the average of the smallest


and the largest xi . This is very sensitive to the
tails of the distribution but may be the best es-
timator if the p.d.f. is nearly uniform.

2 Pn/2
8. µ̂ = n i=1 x2i The sample mean of the even numbered points,
ignoring the odd numbered points.

9. µ̂ = µ̄trimmed Discard the smallest and largest y% (e.g., 10%)


of the data and then average. This is relatively
insensitive to the tails of the distribution, but
has a larger variance than the sample mean if
there are no problems in the tails.

10. µ̂ = sample median This is less sensitive to statistical fluctuations in


the tails, but it has a larger variance than the
sample mean if the p.d.f. is a Gaussian.

Each of these is, by our definition, an estimator. Yet some are certainly better
than others. However, which is ‘best’ depends on the p.d.f. Which is ‘best’ may
also depend on the use we want to make of it. How do we choose which estimator
to use? In general we shall prefer an estimator which is ‘unbiased’, ‘consistent’, and
‘efficient’. We will discuss these and other properties of estimators in the following
section. In succeeding sections we will treat three general methods of constructing,
or choosing, estimators.

Nothing is easier than to invent


methods of estimation.
—R. A. Fisher
98 CHAPTER 8. PARAMETER ESTIMATION

8.2 Properties of estimators

8.2.1 Bias
Since a statistic is a function of r.v.’s, it is itself a r.v. Therefore, it is hdistributed
i
according to some p.d.f., and we can speak of its expectation value, E θ̂ . For an
estimator, making use of n observations, the bias bn is defined as the difference
between the expectation of the estimator and the true value of the parameter:
h i h i
bn (θ̂) = E θ̂ − θ = E θ̂ − θ (8.1)
h i
An estimator is unbiased if, for all n and θ, bn (θ̂) = 0, i.e., if E θ̂ = θ. We
include n in this definition since we shall see that some estimators are unbiased
only asymptotically, i.e., only for n → ∞.

Mean In general, the sample mean, no. 1 in our list above, is an unbiased esti-
mator of the parent (true) mean:
 
1X 1X 1
E [µ̂] = E [x̄] = E xi = E [xi ] = nE [x] = E [x] = µ (8.2)
n n n
On the other hand, the third estimator in our list is biased:
 
1 X n
E [µ̂] = E xi = µ
n−1 n−1
although the bias,
n µ
bn (µ̂) = µ−µ= → 0 , for large n.
n−1 n−1
This estimator is thus asymptotically unbiased.
If we know the bias, we can construct a new estimator by correcting the old
one for its bias. For example, from no. 3 and its bias we construct no. 1 simply by
multiplying no. 3 by (n − 1)/n.
Lack of bias is a reason to prefer no. 1 to no. 3. However, nos. 2 and 8 are also
unbiased. The trimmed mean (no. 9) is unbiased if the parent p.d.f. is symmetric
about its mean. The sample median (no. 10) is also unbiased if the parent median
equals the parent mean. Similarly, nos. 6 and 7 will be unbiased for certain p.d.f.’s.

Variance Now suppose we want to estimate the variance of the parent p.d.f.
Assume that we know the true mean, µ. Usually this is not the case, but could be,
e.g., if we know that the p.d.f. is symmetric about some value. Then following our
above experience with the sample mean, we might expect the sample variance,
n
1 X
s21 = (xi − µ)2 (8.3)
n i=1
8.2. PROPERTIES OF ESTIMATORS 99

to be a good estimator of the parent variance, σ 2 . (N.b., do not confuse the standard
deviation, σ, of the parent p.d.f. with the ‘error’ on µ̂.) Assume that the parent
variance, σ 2 , is finite (exists). Then
h i 1 hX i 1 hX  2 i
E s21 = E (xi − µ)2 = E xi − 2xi µ + µ2
n n
1 hX 2 X X i 1 h hX 2 i hX i i
= E xi − 2µ xi + µ2 = E xi − 2µE xi + nµ2
n n
1 h h 2i i h i
= nE x − 2nµE [x] + nµ = E x2 − 2µ2 + µ2
2
n
= σ 2 + µ2 − 2µ2 + µ2 , since σ 2 = E [x2 ] − µ2
= σ2

Thus σc2 = s21 is an unbiased estimator of the variance of the parent p.d.f., σ 2 , if µ
is known.
But usually µ is not known. We therefore try using our estimate of µ, µ̂ = x̄,
instead of µ:
1X 1X 2
s2x = (xi − x̄)2 = xi − x̄2 = x2 − x̄2 (8.4)
n n
This has the expectation,
"P P  #
h i xi 2
x2i
E s2x =E −
nn
 hX i  
1 2 1 X 2
= E xi − E xi (8.5)
n n
P
The xi are independent. Hence E [ x2i ] = nE [x2 ]. Also,
h i
σ 2 = E x2 − µ2
hX i h X i  hX i2
and V xi = E ( xi )2 − E xi

Substituting in (8.5), gives


h  i  hX i  hX i2  
1 1
E = n σ 2 + µ2 −
s2x V xi + E xi
hX i
n n
X
Using V xi = V [xi ] = nV [x] = nσ 2
hX i
and E xi = nE [x] = nµ

we find
h i  
1 1 2 
E s2x = nσ 2 + nµ2 − nσ + (nµ)2
n n
1 2
= (n − 1) σ (8.6)
n
100 CHAPTER 8. PARAMETER ESTIMATION

Thus s2x is a biased estimator of σ 2 . The reason


is that, not knowing µ, we used our estimate of
the mean, µ̂ = x̄, the sample mean. The spread
of the data about the sample mean is clearly
less than its spread about the true mean. Since
the variance is the spread about the true mean, ✲
x̄ µ x
s2x underestimates the true variance.
This bias is easily removed. An unbiased estimator for the parent variance when
the parent mean is unknown is

n n  2  1 X
s2 = s2x = x − x̄2 = (xi − x̄)2 (8.7)
n−1 n−1 n−1

Note that the above calculations did not depend at all on what the parent p.d.f.
was, not even on the C.L.T.
If the p.d.f. is Gaussian or if n is large enough that the C.L.T. applies, let

xi − x̄
zi =
σ

Then
X 1 X
zi2 = (xi − x̄)2
σ2
is distributed as χ2 (section 3.12). There is one relationship among the zi ’s:

X 1X 1 X 
zi = (xi − x̄) = xi − nx̄ = 0
σ σ
P
which follows from the definition of x̄. Hence, the p.d.f. for zi2 is a χ2 of n − 1
degrees of freedom. Recall that E [χ2 (n − 1)] = n − 1. This is another way of seeing
that " # " #
h i σ2 X 2 σ2 2 1 h i
2
E s =E zi = E χ = σ2 E χ2 = σ 2
n−1 n−1 n−1

i.e., that σc2 = s2 is an unbiased estimator of σ 2 when µ is unknown.


This use of χ2 is of more than passing interest: In general, if we have n mea-
surements, xi , of a quantity, with k ≤ n relationships (constraints) among them,
P
then the χ2 constructed from the x2i will have n − k degrees of freedom.
The (n − 1) instead of n in s2 also makes sense in the limit n = 1. With only one
measurement of x, you have an estimate µ̂ = x of µ, but no estimate of the width
1
of the distribution. This is consistent with s2 = 1−1 (x − µ̂)2 = 00 = indeterminate.
However, if µ is known you do not have to use the measurement to estimate µ; you
can use it instead to estimate σ 2 . Hence s1 contains n instead of (n − 1).
8.2. PROPERTIES OF ESTIMATORS 101

8.2.2 Consistency
If we take more data, we should expect a better (more accurate) estimate of the
parameters. An estimator which converges to the true value with increasing n is
termed consistent.
Definition: An estimator, θ̂, of θ is consistent if for any ǫ > 0 (no matter how small),

lim P (|θ̂ − θ| ≥ ǫ) = 0 (8.8)


n→∞

This is rather analogous to the definition of convergence of a series except that


here it is the probability of the deviation from the true value which approaches 0
rather than the deviation itself. This is therefore sometimes called convergence in
probability.
If θ̂ is an average of data which are distributed according to a p.d.f. for which
the C.L.T. applies, then θ̂ is a consistent estimator, since the width of the p.d.f.,
2
N(x̄; µ, σn ) approaches 0 for n → ∞.
In our list of estimators of the mean no. 2 is clearly inconsistent. Nos. 1, 3,
and 8 are obviously consistent if the C.L.T. applies. No. 10 is consistent only if the
mean and median of the parent p.d.f. are equal. Likewise, the consistency of nos.
6, 7 and 9 depends on the p.d.f.
The usual example of an inconsistent estimator is the sample mean for the
Cauchy p.d.f., which, as we have seen, does not have a finite variance. The C.L.T.
does not then apply, and in fact x̄ is distributed just like x. Thus, x̄ does not
converge to anything! This illustrates the fact that an unbiased estimator is not
necessarily consistent.

8.2.3 Variance of an estimator, efficiency


An estimator is called efficient if it has a small variance, in particular if it has the
smallest possible variance (see the following section).
Repetition of an experiment generally results in a different value of our (consis-
tent) estimator. If the variance of the sampling p.d.f. of the estimator, which, for
convenience, we will call the variance of the estimator, is small, these values will
cluster closely about the true value, or, if the estimator is biased, about the biased
(i.e., wrong) value. We will see that in general the variance of an estimator depends
on the parent p.d.f., in particular, on the variance (σ 2 ) of the parent p.d.f.
For example, consider the variance of the sample mean. As we have seen (chap-
ter 5 and exercise 23),
 
1X 1 X 1 σ2
V x̄ = xi = 2 V [xi ] = 2 nV [x] = (8.9)
n n n n
Now consider the sample variance, which was defined in equation 8.7. Assuming
that the xi follow a normal p.d.f. (or that n is large and the C.L.T. applies), the
102 CHAPTER 8. PARAMETER ESTIMATION

sample variance has variance


" # " #2
h i 1 X (xi − x̄)2 σ2 hX i
2
V s =V σ2 = V zi2
n−1 σ2 n−1
xi −x̄ Pn 2
where zi = σ
. As we have seen (section 8.2.1), i=1 zi is distributed as χ2 (n−1).
Thus, hX i h i
V zi2 = V χ2 (n − 1) = 2(n − 1)
Hence,
h i 2(σ 2 )2
V s2 = (8.10)
n−1
We see that the expressions for the variance of x̄ and s2 both contain σ 2 , the
variance of the parent p.d.f., which we may not know. (If we do know it we certainly
will not be interested in estimating it.) The usual procedure is to use instead our
estimate of σ 2 , s2 . Then the estimated variances of our estimates are

s2 h i 2(s2 )2
Vb [x̄] = , Vb s2 = (8.11)
n n−1

Sometimes you do know σ 2 . We give two examples: (1) You average many mea-
surements of a quantity, e.g., the length of a table. The p.d.f. is then a convolution
of a δ-function about the true length with a resolution function for the measuring
apparatus, which is just a Gaussian centered about the true length with σ equal to
the resolution. But you have calibrated the measuring apparatus by measuring a
standard length a great many times. From this calibration you know σ 2 . So you
only need to estimate µ. (2) You are designing an experiment and you want to know
how many measurements you need to make in order to attain a given accuracy. You
then make reasonable assumptions about the p.d.f. and calculate what V will be
for the different assumptions about µ, σ 2 , and n.
To summarize, assuming that we do not know µ or σ 2 , they are estimated by
q q
µ̂ = x̄ ± V [x̄] and σc2 = s2 ± V [s2 ] (8.12a)
s s
s2 2 2
= x̄ ± = s2 ± s (8.12b)
n n−1

Note that the ‘error’ on µ̂ has itself an error. By ‘error propagation’, which will
be covered in section 8.3.6,
 2
ds2
V [s2 ] = ds
V [s] = (2s)2 V [s]
2 )2
s2
Hence, V [s] = 1
4s2√
V [s2 ] = 4s12 2(s
n−1
= 2(n−1)
q
and V [s] = √ s2
2(n−1)
8.2. PROPERTIES OF ESTIMATORS 103

The error on the error on µ̂ is then (with δ indicating ‘error’)


v s 
q
u
u
s √
u  s2  1 h√ 2 i s2 δ µ̂
δ(δ µ̂) = V [δ µ̂] = tV = V s =q =q
n n 2n(n − 1) 2(n − 1)

Thus for n not too small, the error on the error on µ̂ is negligible.

8.2.4 Interpretation of the Variance


We usually interpret V [q̂] = σ 2 as the “square of the expected error” of q̂ and we
write q = q̂ ± δq where δq = σ. If the p.d.f. of q̂ is a Gaussian with variance σ 2 , then
the chance, in some sense, that the true value of q, qt , is within q̂ − σ ≤ qt ≤ q̂ + σ
is Z q̂+σ
P (q̂ − σ ≤ qt ≤ q̂ + σ) = N(q; q̂, σ 2 ) dq ≈ 0.68
q̂−σ

In exactly what sense this is so will be discussed in section 9.


We could have used some other quantityqto indicate the ‘error’, e.g., the average
of the absolute deviation |q̂ − q|, instead of (q̂ − q)2 . The variance is conventional
for a number of reasons:

• It is low order and hence easy to calculate.

• It is sufficient in the case of a Gaussian, being one of the two parameters of


the Gaussian, and the Gaussian is, by the C.L.T., often the asymptotic limit
of the p.d.f.

• It is easily converted to a confidence interval in the Gaussian limit (cf. chap-


ter 9).

When the p.d.f. of q̂ is non-Gaussian one must be careful. If the p.d.f. is skewed,
this can be indicated by stating asymmetric errors. But that is not foreseen in the
propagation of errors. Also, for a non-Gaussian P (q̂ − σ ≤ qt ≤ q̂ + σ) is usually
not 68%. Nor is the probability of being within, e.g., 2σ the same in the non-
Gaussian case as in the Gaussian case. Nor do the errors even have to be symmetric.
The propagation of errors (cf. section 8.3.6) is usually the least trustworthy
when there is a dependence on 1/q. Going to higher orders in the expansion does
not necessarily help because the resulting error, though perhaps more accurate, still
has the same problems resulting from skewness and the probability content of ±2σ.
These questions are often conveniently investigated by Monte Carlo methods. As
previously stated, the best cure for these problems is to rewrite the p.d.f. in terms
of the parameters you want to estimate.
We shall return to these questions when discussing confidence intervals (chapter
9) and hypothesis testing (chapter 10).
104 CHAPTER 8. PARAMETER ESTIMATION

8.2.5 Information and Likelihood


The concepts ‘information’ and ‘likelihood’ will be useful in discussing the variance
of estimators. We introduce them now:
There are several different definitions of information. They are named after
the person who introduced them. We will use that of R. A. Fisher, which is then
referred to as the information of R. A. Fisher. However, since we will only treat this
one definition of information, we will simply refer to it as information. But bear in
mind that the word can have other definitions. We will see that Fisher’s definition
meets the following requirements, which we find necessary for what we would like
the word ‘information’ to mean:
1. The information should increase if we make more observations.
2. Data, which are irrelevant to the estimation of the parameters we wish to
estimate or to the hypothesis we wish to test, should contain no information.
Of course the same data may contain information for other parameters or
other tests.
3. The precision of the estimation or test should be greater if we have more
information.
Present-day, large-scale experiments usually produce a great amount of data of
which only a small part is useful for a given measurement or test. The information
contained in a datum can be used to decide whether to reject it in order to reduce
the amount of data to a manageable size. (It is difficult to work with data on 100
magnetic tapes; working with just one tape, or a small disk file is much easier.)
A good criterion for data reduction is to reject the maximum of data with the
minimum loss of information. This is usually a compromise, although the rejection
of some data may actually result in no loss of information.

Likelihood function: We observe a real random variable, X, sampled from a


p.d.f., f (x; θ), where θ is a parameter. The set of allowed values of X is denoted by
Ωθ , the subscript emphasizing the possible dependence on the parameter. Both X
and θ could be sets of values X and θ, not necessarily of the same dimension.
Consider a set of n independent observations of X, xi . The joint p.d.f. of the xi
is, since they are independent,
n
Y
L(x; θ) = L(x1 , x2 , . . . , xn ; θ) = f (xi ; θ) (8.13)
i=1
The function L depends on both the measurements xi and on the parameters θ.
However, after having done the experiment, the xi are fixed. Then L can be regarded
as a function of θ only. L is called the likelihood function. We also define its
logarithm,
n
X
ℓ ≡ ln L(x1 , . . . , xn ; θ) = ln f (xi ; θ) (8.14)
i=1
8.2. PROPERTIES OF ESTIMATORS 105

Information: The information (of R. A. Fisher) given about a parameter θ by


an observation of the r.v. x is defined as the expectation
 !2   !2 
∂ ln L(x; θ) ∂ℓ
Ix (θ) = E   = E  (8.15)
∂θ ∂θ
Z !2
∂ ln L(x; θ)
= L(x; θ) dx
Ωθ ∂θ
In the case where there are k parameters, the information is a k × k matrix:
" #
h i ∂ ln L(x; θ) ∂ ln L(x; θ)
I x (θ) =E
ij ∂θi ∂θj
Z
∂ ln L(x; θ) ∂ ln L(x; θ)
= L(x; θ) dx
Ωθ ∂θi ∂θj
This definition of information may seem rather arbitrary, but we shall see that
it satisfies the three requirements stated above.

Score: Notation becomes more compact by introducing the score. We define the
score of one measurement as

S1 ≡ ln f (x; θ) (8.16)
∂θ
Note that the score, being a function of r.v.’s, is itself a r.v. The score of the entire
sample is then defined to be the sum of the scores of each observation:
n
X
S(x; θ) ≡ S1 (xi ; θ) (8.17)
i=1

Then
n
X ∂
S(x; θ) = ln f (xi ; θ)
i=1 ∂θ
n
∂ X
= ln f (xi ; θ)
∂θ i=1
∂ ln L(x; θ)
=
∂θ
Summarizing,
n n
∂ ln L(x; θ) X ∂ X
S(x; θ) = = S1 (xi ; θ) = ln f (xi ; θ) (8.18)
∂θ i=1 ∂θ i=1
This result combined with equation 8.15 shows that we can write the information
of the sample x on the parameter θ as the expectation of the square of the score:
h i
Ix (θ) = E (S(x; θ))2 (8.19)
106 CHAPTER 8. PARAMETER ESTIMATION

If Ωθ is independent of θ, we can show that the expectation of the score is zero


and we can derive another relation between the information and the score. Let us
assume that

1. Ωθ is independent of θ, and

∂2 R
2. L(x; θ) is regular enough that we can interchange the order of ∂θi ∂θj
and dx.

If condition (1) holds, condition (2) will also generally hold for distributions en-
countered in physics. Now,
" # Z " #
∂ ∂
E [S1 (x; θ)] = E ln f (x; θ) = ln f (x; θ) f (x; θ) dx
∂θ ∂θ
Z " #
1 ∂
= f (x; θ) f (x; θ) dx
f (x; θ) ∂θ
Z

= f (x; θ) dx
∂θ

Interchanging the order of integration and differentiation (assumption 2),


Z
∂ ∂
E [S1 (x; θ)] = f (x; θ) dx = 1=0 (8.20)
∂θ ∂θ

since f (x; θ) is normalized for all values of θ. Hence,


X
E [S(x; θ)] = E [S1 (xi ; θ)] = 0 (8.21)

Using the fact that the variance of a quantity is given by V [a] = E [a2 ]−(E [a])2 ,
we see from equations 8.19 and 8.21 that

Ix (θ) = V [S(x; θ)] (8.22)

We have shown above (equation 8.19) that in general the information on θ is


equal to the expectation of the square of the score. Under the above two assumptions
you can show (exercise 31) that the information is also given by
" #
∂S(x; θ)
Ix (θ) = −E (8.23)
∂θ

These results (equations 8.21 and 8.23) are very useful, but do not forget the as-
sumptions on which they depend.
8.2. PROPERTIES OF ESTIMATORS 107

Does I satisfy the requirements? We can now show that the information
increases with the number of independent observations. For n observations,
 !2 
n
X
I(θ) = E  S1 (xi ; θ) 
i=1
" n # ( " n #)2
X X
=V S1 (xi ; θ) + E S1 (xi ; θ)
i i

where we have used the fact that V [a] = E [a2 ] − (E [a])2 . The second term is
zero under the assumptions that Ωθ is independent of θ and that the order of
differentiation and integration can be interchanged as in the previous paragraph
(eq. 8.21). However, let us now relax these assumptions.
Since the xi are independent, the variance of the sum is just the sum of the
variances. And since all the xi are sampled from the same p.d.f., the variance is the
same for all i. A similar argument applies to the second term. Hence,

I(θ) = n V [S1 (x; θ)] + n2 {E [S1 (x; θ]}2 (8.24)

Following the same steps for n = 1 gives the same expression with n = 1. Hence,
the information increases with the number of observations, our first requirement for
information.
If the assumptions of the previous paragraph apply, the second term in the above
equation is zero by equation 8.20. Then,

I(θ) = n I1 (θ) (8.25)

and the information of n independent observations is just n times the information


of one observation. If the assumptions are not true, the second term may not be
zero but will still be positive; hence I will still increase with n.
For data which are irrelevant for the estimation of θ, the p.d.f. will not depend
on θ and the score will, from its definition (equations 8.16 and 8.17), be zero. This
implies that the information will also be zero, which was our second requirement
for information.
We now turn to the third requirement, the connection between the precision of
an estimator and the information.

8.2.6 Minimum Variance Bound


It turns out that there is a lower limit to the variance of an estimator under certain
general conditions.

Rao-Cramérh i inequality: Suppose thath i we have an estimator θ̂ of θ with bias


bn (θ̂) = E θ̂ − θ, that the variance V θ̂ is finite, and that the range of X does
108 CHAPTER 8. PARAMETER ESTIMATION

not depend on θ. Then


Z Z " #
h i ∂
E θ̂ S(x; θ) = . . . θ̂ ln L(x; θ) L(x; θ) dx1 . . . dxn
∂θ
Z Z " #
1 ∂
= . . . θ̂ L(x; θ) L(x; θ) dx1 . . . dxn
L(x; θ) ∂θ
Z Z " #

= . . . θ̂ L(x; θ) dx1 . . . dxn
∂θ
Z Z " n #
∂ Y
= . . . θ̂ f (xi ; θ) dxi
∂θ i=1
Z Z "n
#
∂ Y
= ... θ̂ f (xi ; θ) dxi
∂θ i=1

The last step follows because θ̂ is a statistic and therefore does not depend on θ.
Assuming that we can interchange the order of differentiation and integration, we
find
h i Yn Z Z

E θ̂ S(x; θ) = . . . θ̂ [f (xi ; θ) dxi ]
∂θ i=1
∂ h i ∂ h i
= E θ̂ = θ + bn (θ̂))
∂θ ∂θ

= 1 + bn (θ̂)
∂θ
Both θ̂ and S(x; θ) are r.v.’s. Their covariance is
h i h i h i
cov S(x; θ), θ̂(x) = E S(x; θ) θ̂(x) − E [S(x; θ)] E θ̂(x)
| {z }
=0, eq. 8.21

=1+ bn (θ̂)
∂θ
Therefore, their correlation coefficient is
n h io2 h i2

cov S, θ̂ 1+ b (θ̂)
∂θ n
ρ2 = h i = h i
V [S] V θ̂ I(θ) V θ̂

Since ρ2 ≤ 1, we have
h i2

h i 1+ b (θ̂)
∂θ n
σ 2 (θ̂) = V θ̂ ≥ (8.26)
I(θ)
Thus, there is a lower bound on the variance of the estimator. For a given set
of data and hence a given amount of information, I(θ), on θ, we can never find an
estimator with a lower variance.
8.2. PROPERTIES OF ESTIMATORS 109

The more information we have, the lower this bound is, in accordance with our
third requirement for information.
If the estimator is a constant, θ̂ = c, then the bias is b = c − θ and the minimum
variance is 0, which is not a very interesting bound since the variance of a constant
is always 0.
The inequality (8.26) is usually known as the Rao-Cramér inequality or the
Frechet inequality. It was discovered independently by a number of people including
Rao,35 Cramér,15 and Frechet. The first were Aitken and Silverstone.36 Although we
have assumed that the range of X is independent of θ and that we could interchange
the order of differentiation and integration, the result (8.26) can be obtained with
somewhat more general assumptions.11, 13
In general, we prefer unbiased estimators. In that case the inequality reduces to
2
σ (θ̂) ≥ 1/I(θ). This is also the case if the bias of the estimator does not depend
on the true value of θ. For more than one parameter this result generalizes to
h i
σ 2 (θ̂i ) ≥ I −1 (θ) (8.27)
ii

the diagonal element of the inverse of the information matrix.


We define the efficiency of the estimator as
2
σmin (θ̂)
ǫ(θ̂) = ≤1 (8.28)
σ 2 (θ̂)
which, for unbiased estimators, is just
1
ǫ(θ̂) = ≤1 (8.29)
σ 2 (θ̂) I(θ)
An estimator whose variance is equal to the minimum variance given by equa-
tion 8.26, i.e., has ǫ(θ̂) = 1, is termed efficient. It is not always possible to
construct an efficient estimator.

Examples:
P
Gaussian with known mean. We have seen (section 8.2.1) that σc2 = (xi −
µ)2 /n is an unbiased estimator of the variance of a Gaussian of known mean. It is
easy to show (exercise 32) that it is also an efficient estimator.

Exponential. Consider n independent observations from an exponential p.d.f.,


1 −x/µ
f (x; µ) = e , µ>0
µ
We wish to estimate µ. We note that
x
ln f (x; µ) = − ln µ −
µ
110 CHAPTER 8. PARAMETER ESTIMATION

The score of one observation is then


!
∂ x 1 x
S1 (x; µ) = − ln µ − =− + 2
∂µ µ µ µ
The information of one observation is then, using equation 8.19 or 8.23, the latter
being applicable since the range of X is independent of µ,
" #
h i
∂S1 (x; µ)
2
I1 (µ) = E (S1 (x; µ)) = −E
∂µ
" #
1 2x 1 2 1
= −E 2 − 3 = − 2 + 2 = 2
µ µ µ µ µ
And the total information of the sample is
n
I(µ) = nI1 (µ) =
µ2
If µ̂ is unbiased, its minimum variance is then 1/I = µ2 /n. We try the sample mean
as an estimator: µ̂ = x̄. We know (equation 8.2) that the sample mean is always
an unbiased estimator of the mean. The variance of the sample mean is
1 1  h 2i 
V [x̄] = V [x] = E x − µ2
n n
Z ∞
1 1 µ2 µ2
= x2 e−x/µ dx − =
n 0 µ n n
| {z }
=2µ2

which is just the minimum variance found above. Thus the sample mean is an
efficient estimator of the mean of an exponential p.d.f.
Note that the score is
n P
X n xi
S(x; µ) = S1 (xi ; µ) = − + 2 = −I(µ) (µ − µ̂)
i=1 µ µ
Thus the score is a linear function of the estimator. This is not a coincidence, but
a general feature of unbiased efficient estimators, as we show in the next section.

8.2.7 Efficient estimators—the Exponential family


In this section we shall show that an efficient estimator can be found if and only if
the p.d.f. is a member of a quite general class of functions known as the exponential
family.
The minimum variance bound was found using
n h io2
cov S, θ̂
ρ2 = h i h i ≤1
V S V θ̂
8.2. PROPERTIES OF ESTIMATORS 111

The equality ρ = ±1 corresponds to a linear relationship between the variables


(exercise 7), i.e., a straight line on a graph of S vs. θ̂. Thus, assuming that the
conditions of the minimum variance bound hold, an estimator θ̂ can be efficient if
and only if it is a linear function of S, with the possible exception of regions where
the probability is zero.
Let A(θ) and B(θ) be functions of θ, but not of x, and A′ , B ′ be their derivatives
with respect to θ. Then we can write the linear relationship as


ln f (x; θ) ≡ S = A′ (θ)θ̂(x) + B ′ (θ) (8.30)
∂θ

Since θ̂ is a statistic and hence depends only on x, integration over θ gives

ln f (x; θ) = A(θ) θ̂(x) + B(θ) + K(x) (8.31)

where the integration constant K may depend on x but not on θ. Then, where the
required normalization is included in B and/or K,
h i
f (x; θ) = exp A(θ) θ̂(x) + B(θ) + K(x) (8.32)

Any p.d.f. of the above form is said to belong to the exponential family. What
we have shown is that an efficient estimator can be found if and only if the p.d.f. is
of the exponential family where the estimator enters the exponent in the way shown
in equation 8.32.
Note that the efficient estimator is not necessarily unique since the product A · θ̂
can often be factored in more than one way. The estimator θ̂ will be an unbiased
estimator for some quantity, although not necessarily for the quantity we want to
estimate. It may also not be an estimator which we will be able to use. Let us now
calculate the expectation of θ̂ and see for what quantity it is an unbiased estimator:
From equation 8.30,
S(x; θ) B ′ (θ)
θ̂ = ′ − ′
A (θ) A (θ)
Since A′ and B ′ do not depend on x, the expectation is then
h i 1 B ′ (θ)
E θ̂ = E [S(x; θ)] −
A′ (θ) A′ (θ)

Since E [S(x; θ)] = 0, we have

h i ∂B(θ)
∂θ
E θ̂ = − ∂A(θ)
(8.33)
∂θ

This is the quantity for which the θ̂ in equation 8.32 is an unbiased, efficient esti-
mator.
112 CHAPTER 8. PARAMETER ESTIMATION

If there are k parameters, θ, equation 8.32 generalizes to


h i
f (x; θ) = exp A(θ) · θ̂(x) + B(θ) + K(x) (8.34)
The score for the ith parameter is then
∂ X ∂Aj (θ) ∂B(θ)
S(x; θi ) = ln f (x; θ) = θ̂j (x) +
∂θi j ∂θi ∂θi
Taking the expectation, we arrive at the generalization of equation 8.33, which is a
set of k equations: h i
∂B(θ) P ∂A (θ)
h i
∂θi
+ j6=i E θ̂j ∂θj i
E θ̂i = − ∂Ai (θ)
(8.35)
∂θi

Examples:

Gaussian. As an

example

we take the normal p.d.f., N(x; µ, σ 2 ), which has
µ
two parameters θ = . We write N(x; µ, σ 2 ) in an exponential form:
σ2
" #
2 1 1 (x − µ)2
N(x; µ, σ ) = √ √ exp −
2π σ 2 2 σ2
" !#
µ 1 2 1 µ2
= exp 2 x − 2 x − + ln(2πσ 2 )
σ 2σ 2 σ2
For n independent observations the p.d.f. becomes
n
" !#
Y
2 nµ n n µ2
N(xi ; µ, σ ) = exp 2 x̄ − 2 x2 − + ln(2πσ 2 )
i=1 σ 2σ 2 σ2
from which we see that we can choose (in equation 8.34)
A1 (θ) = nµ
σ2
θ̂1 (x) = x̄
n
A2 (θ) = − 2σ2 θ̂2 (x) = x2
2
B(θ) = − n2 µσ2 + ln(2πσ 2 )
K(x) = 0
Then (from equation 8.35)
∂A1
∂µ
= σn2 Thus θ̂1 = x̄ is an efficient and unbiased estimator of
∂A2
=0
∂µ −nµ/σ 2
∂B
∂µ
= −n σµ2 − =µ
n/σ 2

∂A1
∂σ2
= nµ
σ4
Thus θ̂2 = x2 is an efficient and unbiased estimator of
∂A2
∂σ2
= 2σn4 µ2 + σ 2 . Hence, x2 − µ2 = (x − µ)2 is an efficient and
∂B
= nµ
2
− n unbiased estimator of σ 2 . However, this is of use to us
∂σ2 2σ4 2σ2
only if we know µ.
8.2. PROPERTIES OF ESTIMATORS 113

Note the role of the number of observations n. The likelihood function, L, is


just the p.d.f. with each term in the exponent replaced by a sum of n terms. Thus
L can be obtained from f by the replacements: x → x̄, x2 → x2 , etc. and A → nA,
B → nB, and K → nK. But − ∂B/∂θ ∂A/∂θ
is unchanged by these substitutions. Thus
we can work with f instead of L, just replacing any function of x by its average in
the expression for θ̂.

Binomial. Discrete p.d.f.’s can also belong to the exponential family. As an


example we take the binomial p.d.f.,
!
n k
f (k; n, θ) = θ (1 − θ)n−k
k
which can be written
" ! !#
θ n
f (k; n, θ) = exp k ln + n ln(1 − θ) + ln
1−θ k
With n fixed, there is just one parameter to estimate, θ.
 
θ
A(θ) = ln 1−θ
θ̂(k) = k
 
n
B(θ) = n ln(1 − θ) K(k) = ln k

The expectation of the estimator is


h i ∂B/∂θ
E θ̂ = − = nθ
∂A/∂θ
Thus k is an efficient, unbiased estimator of nθ, or k/n is an efficient, unbiased
estimator of θ.

Which estimator is the best? Returning to the list of 10 estimators for the
mean at the start of the section, we can ask which of the 10 is the best. Unfor-
tunately, there is no unique answer. In general we prefer unbiased, consistent and
efficient estimators. We can clearly reject nos. 2, 3, 4, 5 and 8. Nor is no. 6, the
sample mode, a good choice, even when the parent mode equals the parent mean,
since it uses so little of the information. However, which of the others is ‘best’
depends on the parent p.d.f.
The sample mean is efficient for a normal p.d.f. However, for a uniform p.d.f.
1
(f (x; a, b) = b−a ) where the limits (a, b) are unknown, estimator no. 7, 12 xmin +
1
x , has a smaller variance that x̄.
2 max
No. 10, the sample median, has a larger variance that the sample mean for
a Gaussian p.d.f., but for a ‘large-tailed Gaussian’ it can be smaller. No. 9, the
trimmed sample mean, throws away information but may still be best, in particular
if we think that points in the tails are largely due to mismeasurement.
114 CHAPTER 8. PARAMETER ESTIMATION

8.2.8 Sufficient statistics


A statistic T (x) is said to be sufficient for the parameter θ if the conditional p.d.f. of
x, given T , f (x|T ), is independent of θ. (T and θ may of course be multidimensional
and of different dimensions.) In other words, T is sufficient if T contains all the
information on θ.
Clearly, T = x is a sufficient statistic since that is all the information we have—
on θ or on anything else. But this doesn’t help us very much. The importance
of sufficiency is in data reduction. If we have a sufficient statistic, T , of a smaller
dimension than the data, x, we can reduce the amount of data. This can be of
enormous practical advantage.
From n independent observations xi , one can construct m ≤ n independent
statistics t, t1 , t2 , . . . , tm−1 (in an infinite number of ways). From the definition of
marginal and conditional p.d.f.’s we can write the p.d.f. of these statistics as (cf.
equation 2.28)

f (t, t1 , t2 , . . . , tm−1 ; θ) = g(t; θ) h(t1, t2 , . . . , tm−1 ; θ|t) (8.36)

where g(t; θ) is the marginal p.d.f. of t and h is the conditional p.d.f. Now if
h is independent of θ, then clearly the t1 , t2 , . . . , tm−1 contribute nothing to our
knowledge of θ. If this is true for any set of ti and any m < n then t clearly
contains all the information on θ. We therefore define a sufficient statistic t as: t is
a sufficient statistic for θ if for any choice of t1 , t2 , . . . , tm−1 (which are independent
of t),
f (t, t1 , t2 , . . . , tm−1 ; θ) = g(t; θ) h(t1, t2 , . . . , tm−1 |t) (8.37)
Now, what does this mean in terms of the likelihood? The likelihood function
is the p.d.f. for x and is thus related to the f of equation 8.37 by a coordinate
transformation. Starting from equation 8.37, let ti = xi for i = 1, 2, . . . , n − 1.
Then
f (t, x1 , x2 , . . . , xn−1 ; θ) = g(t; θ) h(x1, x2 , . . . , xn−1 |t)
The p.d.f. in terms of x is then
!
x1 , . . . , xn

L(x; θ) = g(t; θ) h(x1 , x2 , . . . , xn−1 |t) J



x1 , . . . , xn−1 , t

which is, since the Jacobian does not involve θ, of the form

L(x; θ) = g(t; θ) k(x) (8.38)

Conversely, starting from equation 8.38, we make the transformation

t = t(x1 , . . . , xn )
ti = ti (x1 , . . . , xn ) , i < m
ti = xi , i = m, . . . , n − 1
8.3. SUBSTITUTION METHODS 115

L(x; θ) dx then transforms to


!
t, t1 , . . . , tn−1 n
Y

g(t; θ) k(x) J



dt dti
x1 , . . . , xn i=1

which we integrate over dtm . . . dtn−1 to obtain the p.d.f. f (t, t1 , . . . , tm−1 ). Neither
k nor J depend on θ. However, the integration limits for tm , . . . , tn−1 (xm , . . . , xn−1 )
could depend on θ. If not, it is clear that we obtain the form of equation 8.37. It
turns out11, 13 that this is also true even when the integration limits do depend on
θ.
Thus equations 8.37 and 8.38 are equivalent. If we can find a statistic t such that
the likelihood function can be written in the form of equation 8.38, t is a sufficient
statistic for θ.
The sufficient statistics for θ having the smallest dimension are called minimal
sufficient statistics for θ. One usually prefers a minimal sufficient statistic since
that gives the greatest data reduction.
We have seen that if we can write the p.d.f. in the exponential form of equa-
tion 8.34, h i
f (x; θ) = exp A(θ) · θ̂(x) + B(θ) + K(x)

then θ̂ is an efficient estimator. Such a p.d.f. clearly factorizes like equation 8.38
with
h i
g(θ̂; θ) = exp A(θ) · θ̂(x) + B(θ)
k(x) = exp [K(x)]

Thus, if the range of x does not depend on θ, θ̂(x) is not only an efficient estimator
of θ, but also a sufficient statistic for θ. If the range of x depends on θ, the
situation is more complicated. The reader is referred to Kendall and Stuart11, 13 for
the conditions of sufficiency.

8.3 Substitution methods


Now that we know something about the properties of estimators, let us turn to
the problem of constructing, or choosing, an estimator. There are three general
methods of estimation, which we will examine in turn. We begin with substitution
methods.

8.3.1 Frequency substitution


This is the simplest method. It is useful when the parameter to be estimated is
a frequency or the function of a frequency. It consists of simply estimating the
population (parent) frequency by the experimentally observed (sample) frequency.
116 CHAPTER 8. PARAMETER ESTIMATION

Such estimators are also known as plug-in estimators, since the data are simply
“plugged into” the parameter definition.  
For example, if the underlying p.d.f. is a binomial, B(x; n, p) = nx px (1 − p)n−x ,
we would estimate p by p̂ = x/n. This is unbiased since E [x] = np. It is also
efficient since B is a member of the exponential family of p.d.f.’s, as we saw in
section 8.2.7. And we would estimate a function of p, g(p), by g(p̂) = g(x/n). This
method works well for large samples where the C.L.T. assures us that the difference
between E [x] and np is a small fraction of np.
Advantages of this method are simplicity and the fact that the estimator is
usually consistent. Disadvantages are that the estimator may be biased and that it
may not have minimum variance. However, if it is biased, we may be able to reduce
the bias, or at least estimate its size by a series expansion:
Suppose that θ̂ is an unbiased estimator of θ. We wish to estimate some function
of θ, g(θ). Following the above prescription, we use ĝ = g(θ̂). Then, expanding ĝ
about the true value of θ, θt , assuming that the necessary derivatives exist,

∂g(θ) 1 ∂ 2 g(θ)
ĝ = g(θ̂) = g(θt ) + (θ̂ − θt ) + (θ̂ − θt )2 + ...
∂θ θ=θt 2 ∂θ2 θ=θt
Now we take the expectation. Since θ̂ is assumed unbiased, this gives simply,

1 h i ∂ 2 g(θ)

E [ĝ] = g(θt ) + E (θ̂ − θt )2 + ...
2 ∂θ θ=θt
2

h i
Not knowing the true value θ, we can not calculate E (θ̂ − θt )2 . But we can
h i
estimate it by V θ̂ . In the same spirit, we evaluate the derivative at θ = θ̂ instead
h i
∂ 2 g(θ)
of at θ = θt . Thus, to lowest order, there is a bias of approximately 21 V θ̂
∂θ 2 θ=θ̂
.

In the case of more than one parameter, θ, this becomes



X ∂g 1 XX ∂ 2 g
ĝ = g(θ̂) = g(θt ) + (θ̂i − θti ) + (θ̂i − θti )(θ̂j − θtj ) + ...
i ∂θi θ=θ 2 i j ∂θi ∂θj θ=θ
t t

1 XX ∂ 2 g
E [ĝ] = g(θt ) + Vij (θ̂) + ...
2 i j ∂θi ∂θj θ=θ̂
from which we deduce that

1X ∂ 2 g
ĝ1 = ĝ − Vij (8.39)
2 i,j ∂θi ∂θj θ=θ̂
has reduced bias, provided that the correction term is not large or rapidly varying.
If that is not true, it is not obvious that going to higher order terms in the expansion
would help, since the problem may come from using θ̂ instead of the true value in
the expansion. In that case more detailed investigation is needed, perhaps employ-
ing Monte Carlo techniques to test the behavior of the estimators under different
assumptions for θ.
8.3. SUBSTITUTION METHODS 117

8.3.2 Method of Moments


The method
This is another substitution method. To estimate a function q of the parameter θ,
we write q(θ) as a function of the moments of the p.d.f.:
q(θ) = g(m1 , m2 , . . .)

where mj = E [xj ]. This can, of course, only be done if all the necessary moments
exist. We then estimate q(θ) by replacing all the parent (population) moments, mj ,
in g by the corresponding sample (experimental) moments. Thus,
1X j
q̂ = g(m
c1 , m
c2 , . . .) , cj = xj =
m x (8.40)
n i i
In this notation m1 = µ, the parent mean, and mc1 = x̄, the sample mean.
For example, to estimate the parent variance, V [x], we write the variance in
terms of the moments: V [x] = σ 2 = m2 − m21 . We then estimate the moments by
the corresponding sample moments:
1X 2 1X
σc2 = m c21 =
c2 − m xi − x̄2 = (xi − x̄)2
n n
As we have previously seen (equation 8.6), this estimator, which we have called s2x
(equation 8.4), is biased. Thus the method of moments does not necessarily give
unbiased estimators.
As a second example, take the Poisson p.d.f. For this p.d.f., the population mean
and the population variance are equal, µ = V [x]. Therefore, we could estimate the
mean and the variance either
by θ̂ = m
c1 = x̄
P
or by c1 = n1 (xi − x̄)2
c2 − m
θ̂ = m 2

Thus the method of moments does not necessarily provide a unique estimator.

Variance of sample moments


Of course, a moment estimator, like any estimator, is rather useless unless we
also estimate its uncertainty. It can be easily shown (exercise 35) that in general,
assuming that the moments exist,
 
1X k 1 
V [m
ck ] =V xi = m2k − m2k (8.41)
n n
1
cov [m ck ] = (mj+k − mj mk )
cj , m (8.42)
n
We can estimate these variances and covariances by replacing the moments by their
estimators and 1/n by 1/(n − 1) to remove the bias.
118 CHAPTER 8. PARAMETER ESTIMATION

By the C.L.T. the average tends to its expectation under the assumption that
the variance is finite. Moments estimators, being averages, are therefore consistent.
A word of caution is in order: If it is necessary to use higher order moments, you
should be cautious. They are very sensitive to the tails of the distribution, which
is the part of the distribution which is usually the most affected by experimental
difficulties.

8.3.3 Descriptive statistics


Moments provide a simple way to describe the data without making any assumption
about the parent p.d.f. Since the amount of data in present-day experiments is
usually far too large to publish, it is necessary to reduce it to a reasonable volume,
but in such a way that it remains useful.
In some cases we have a theory which is in agreement with the data and it is
enough that the experimental data agree with the expectation. In other cases we
have no theory and the purpose of the experiment is to provide data which can point
the way to a theory. The experimental moments of a distribution up to a certain
(not too high) order provide a set of numbers with which some future theory can
easily be compared.

8.3.4 Generalized method of moments


Instead of the moments mi = E [xi ], which are moments of the functions xi , we
can use moments of some other set of functions, uj (x). These moments, E [uj ], are
given by Z
E [uj ] = uj (x)f (x; θ) dx

Thus we have a number of equations for E [uj ] in terms of θ. We solve them for the
θ in terms of the E [uj ] and substitute the sample moments, ūj , for the expectations
to obtain our estimate of θ. We will always need at least as many equations, and
hence at least as many functions uj , as there are parameters to be estimated.
We take as an example the angular distribution of the decay of a vector meson
into two pseudo-scalar mesons. The angles θ and φ of the decay products in the
rest system of the vector meson are distributed as

3 1 1
f (cos θ, φ) = (1 − ρ00 ) + (3ρ00 − 1) cos2 θ − ρ1,−1 sin2 θ cos 2φ
4π 2 2 

− 2Reρ10 sin 2θ cos φ

where the ρ’s are parameters to be estimated. The data consist of measurements
of the angles, θi and φi , for n decays. From inspection of the above expression for
f , we choose three functions to estimate the three parameters. The choice is not
8.3. SUBSTITUTION METHODS 119

unique, but an obvious choice is as follows. We then compute the expectation of


each of the functions:

function expectation
u1 = cos2 θ E [u1 ] = 51 (1 + 2ρ00 )
u2 = sin2 θ cos 2φ E [u2 ] = − 45 ρ1,−1

u3 = sin 2θ cos φ E [u3 ] = − 45 2Reρ10

1 P
Replacing E [uj ] by the sample mean ūj = n
uj (cos θi , φi ) gives, e.g.,

4√ 1X n
− 2Reρ̂10 = ū3 = sin 2θi cos φi
5 n i=1

which we solve for Reρ̂10 .


This method is most elegant when the functions uj form an orthonormal set.
Then ∞ Z
X
f (x) = ai ui (x) and u∗j (x)uk (x) dx = δjk
i=0

The expectations are then


Z
E [u∗k (x)] = u∗k (x)f (x) dx = ak

Thus the estimate of the coefficient of the k th term is just the sample mean of the
(complex conjugate of the) k th function,

âk = u∗k

This estimator is unbiased and, by the C.L.T., asymptotically normally distributed


about ak .

8.3.5 Variance of moments


The variance of the k th sample moment, generalized or not, is
n
" #
1 X 1
Vkk ≡ V [ūk ] = 2 V uk (xi ) = V [uk (x)]
n i=1 n
 2 
1
= E uk (x) − E [uk (x)] (8.43)
n
which reduces to equation 8.41 for ordinary moments, uk (x) = mk = xk . This is
estimated by replacing the expectations by the sample means to give
n  2
1 1 X 1  2 
Vbkk = uk (xi ) − ūk (x) = uk − ū2k (8.44)
n n − 1 i=1 n−1
120 CHAPTER 8. PARAMETER ESTIMATION

1
where we have used n−1 instead of n1 in order to have an unbiased estimate. The
general element of the covariance matrix is estimated by
n   
1 1 X
Vbjk [ū] = uj (xi ) − ūj (x) uk (xi ) − ūk (x) (8.45)
n n − 1 i=1
1
= (uj uk − ūj ūk ) (8.46)
n−1

8.3.6 Transformation of the covariance matrix under a change


of parameters
Frequently it is not one of the moments that we want to estimate, but rather
some function of the moments, e.g., ρ̂00 = (5ū1 − 1)/2. We now examine how the
covariance matrix for the ūk transforms under such a change of parameter. This
topic is usually known as propagation of errors. This is, of course, applicable to
functions of any estimator, not just to moments.
We want to estimate θ which we write as a function of q, θ(q). We first find
an estimate of q, q̂, and an estimate of its variance, Vb [q̂]. To avoid possible mis-
understanding, we denote the true (unknown) value of q by qt . The true value of
θ is then θ(qt ). Our estimate of q, q̂, being a r.v., is of course distributed about
qt according to some p.d.f. We wish to (approximately) evaluate the variance of θ̂
from the variance of q̂. We assume that q̂ is an unbiased estimator of q, which is
true, at least asymptotically (C.L.T.), if q̂ is a moment.
We expand θ̂ about the true value of q. Then

∂θ
θ̂ = θ(q̂) = θ(qt ) + (q̂ − qt ) + . . .
∂q q=qt

h i ∂θ
and E θ̂ = θ(qt ) + E [(q̂ − qt )] + . . .
∂q q=qt
h i
Since q̂ is unbiased, E [(q̂ − qt )] = 0. Thus, to first order, E θ̂ = θ(qt ). Subtracting
the second equation from the first gives, to first order,

h i ∂θ
θ̂ − E θ̂ = (q̂ − qt )
∂q q=qt

Hence,
 !2
h i h i2  ∂θ h i
V θ̂ ≡ E θ̂ − E θ̂ = E (q̂ − qt )2
∂q q=qt
!2
∂θ
= V [q̂] (8.47)
∂q q=qt
8.3. SUBSTITUTION METHODS 121

This can be estimated by substituting q̂ for qt and our estimate Vb [q̂] for V [q̂]:
!2
h i ∂θ
Vb θ̂ = Vb [q̂] (8.48)
∂q q=q̂

This technique works well only when second and higher order terms are small and
when q̂ is unbiased.
We give a simple example, a func- θ(q) = A + Bq
tion linear in q. The result is ∂θ
then, in fact, exact since the sec- =B
∂q
ond and higher order derivatives h i
V θ̂ = B 2 V [q̂] (8.49)
are zero.
The general case is similar to our treatment of change of variables (section
2.2.6). Indeed, it is in principle better to transform the p.d.f. to a new p.d.f. in
terms of the parameter we want to estimate, e.g., f (x; q) → g(x; θ). In particular
it is nice if we can transform to a p.d.f. having θ as its mean (or other low order
moment), since sample moments are unbiased estimators. However, in practice
such a transformation may be difficult and it may be easier to estimate q than to
estimate θ directly.
Consider now the p.d.f.’s for the estima- θ ✻
tors q̂ and θ̂. If the transformation θ = θ(q) g(θ̂) θ = θ(q)
is non-linear, the shape of the p.d.f. g(θ̂) is
changed from that of f (q̂) by the Jacobian
(|∂q/∂θ| in one dimension), as illustrated
θ1
in the figure. In regions where dθ < dq,
the probability piles up faster for θ than dθ
for q. Thus in the example the peak in
g(θ̂) occurs below θ1 = g (E [q̂]). f (q̂)
In particular, if f (q̂) is normal, g(θ̂) is

not normal, except for a linear transforma- dq E [q̂] q
tion. This is a source of bias, which in the h i
figure manifests itself as a long tail for g(θ̂) resulting in E θ̂ > θ1 .
Now let us treat the multidimensional case, where q is of dimension n and θ is
of dimension m. Note that m ≤ n; otherwise not all θi will be independent and
there will be no unique solution. An example would be a p.d.f. for (x, y) for which
we want only to estimate some parameter of the (marginal) distribution for r. In
this case, n = 2 and m = 1.
We can then expand each θ̂i about its true value in the same manner as for the
one-dimensional case, except that we now must introduce a sum over all parameters:

n
X ∂θi
θ̂i ≡ θi (q̂) = θi (q t ) + (q̂k − qt k ) + . . .
k=1 ∂qk q =q
t
122 CHAPTER 8. PARAMETER ESTIMATION

Assuming that q̂i is unbiased, its expectation is equal to the true value so that to
first order,

 h i  h i n X
X n
∂θi ∂θj
θˆi − E θ̂i θˆj − E θ̂j = (q̂k − qt k ) (q̂l − qt l )
k=1 l=1 ∂qk q =q ∂ql q =q
t t

Taking expectations, and writing in matrix notation, we arrive at the generalization


of equation 8.47: h i h i
V θ̂ = DT (θ) V q̂ D(θ) (8.50)
where,  
∂θ1 ∂θ2 ∂θm
∂q1 ∂q1
... ∂q1
 ∂θ1 ∂θ2 ∂θm 
 ... 
 ∂q2 ∂q2 ∂q2 
D(θ) =  .. .. .. ..  (8.51)
 . 
 . . . 
∂θ1 ∂θ2 ∂θm
∂qn ∂qn
... ∂qn q =q t
As in the one-dimensional case we estimate this variance by replacing true values
by their estimates to arrive at the generalization of equation 8.48:
h i h i
cT (θ) Vb q̂ D(θ)
Vb θ̂ = D c (8.52)

where,  
∂θ1 ∂θ2 ∂θm
∂q1 ∂q1
... ∂q1
 ∂θ1 ∂θ2 ∂θm 
 ... 
c  ∂q2 ∂q2 ∂q2 
D(θ) = 
 .. .. .. .. 

(8.53)
 . . . . 
∂θ1 ∂θ2 ∂θm
∂qn ∂qn
... ∂qn q =q̂
Warning: D is not symmetric.

8.4 Maximum Likelihood method


This method of parameter estimation is very general. It is often the simplest method
to use, particularly in complex cases, and maximum likelihood estimators have
certain desirable properties.

8.4.1 Principle of Maximum Likelihood


We have already met the likelihood function in section 8.2.5. We repeat its defi-
nition here: The likelihood function is the joint p.d.f. for n measurements x given
parameters θ:
L(x; θ) = L(x1 , x2 , . . . , xn ; θ) (8.54)
8.4. MAXIMUM LIKELIHOOD METHOD 123

If the xi are independent, this is just the product of the p.d.f.’s for the individual
xi :
n
Y
L(x; θ) = fi (xi ; θ) (8.55)
i=1

where we have included a subscript i on f since it is not necessary that all the xi
have the same p.d.f.
In probability theory this p.d.f. expresses the probability that an experiment
identical to ours would result in the n observations x which we observed. In prob-
ability theory we know θ and the functions fi , and we calculate the probability of
certain results. In statistics this is turned around. We have done the experiment;
so we know a set of results, x. We (think we) know the p.d.f.’s, fi (x, θ). We want
to estimate θ.
We emphasize that L is not a p.d.f. for θ; if it were we would use the expectation
value of θ for θ̂. Instead we take eq. 8.54, replace θ by θ̂ and solve for θ̂ under the
condition that L is a maximum. In other words, our estimate, θ̂, of θ is that value of
θ which would make our experimental results the most likely of all possible results.
This is the Principle of Maximum Likelihood: The best estimate of a pa-
rameter θ is that value which maximizes the likelihood function. This can not be
proved without defining ‘best’. It can be shown that maximum likelihood (ml)
estimators have desirable properties. However, they are often biased. Whether the
ml estimator really is the ‘best’ estimator depends on the situation.
It is usually more convenient to work with

ℓ = ln L (8.56)

since the product in eq. 8.55 becomes a sum in eq. 8.56. For independent xi this is
n
X
ℓ= ℓi , where ℓi = ln fi (xi ; θ) (8.57)
i=1

Since L > 0, both L and ℓ have the same extrema, which are found from

∂ℓ 1 ∂L
Si ≡ = =0 (8.58)
∂θi L ∂θi

where Si is the score function (section 8.2.5)


The maximum likelihood condi-
tion (8.58) finds an extremum which ℓ ✻ local largest
may be a minimum; so it is important max. max.
❅ ❍❍
to check. There may also be more ❥


than one maximum, in which case ❄ ❅
one usually takes the highest max- ❅


imum. The maximum may also be
at a physical boundary, in which case


← physical range of θ → θ
124 CHAPTER 8. PARAMETER ESTIMATION

eq. (8.58) may not find it. Usually


such problems do not occur for suffi-
ciently large samples. However, this
is not always the case.
Note that for the purpose of finding the maximum of L, it is not necessary
that L be normalized. Any factors not depending on θ can be thrown away. This
includes factors which depend on x but not on θ.

Example: n independent xi , each distributed normally.


n
"   #
Y 1 1 xi − µi 2
L= √ exp −
i=1 2πσi 2 σi
n
" #
X 1 (xi − µi )2
ℓ= − ln(2π) − ln σi −
i=1 2 2σi2

Suppose that all the µi are the same, µi = µ, but that the σi are different, but
known. This is the case if we make n measurements of the same quantity, each
with a different precision, e.g., using different apparatus. The maximum likelihood
condition (8.58) is then

∂ℓ X xi − µ X xi X µ
= = − =0
∂µ σi2 σi2 σi2

The solution of this equation is the ml estimate of µ:


P
(xi /σi2 )
µ̂ = P (8.59)
(1/σi2 )

which is a weighted average, each xi weighted by 1/σi2 .


The expectation of µ̂ is
"P # P P P
(xi /σi2 ) (E [xi ] /σi2 ) (µ/σi2 ) µ (1/σi2 )
E [µ̂] = E P = P = P = P =µ
(1/σi2 ) (1/σi2 ) (1/σi2 ) (1/σi2 )

from which we conclude that this estimate is unbiased. The variance of µ̂ is


" 2 #
P  
xi P P xi xj
h i  2 h i E σi2 E i j σ2 σ2
V [µ̂] = E µ̂2 − E [µ̂] = E µ̂2 − µ2 =  − µ2 = − µ2
i j
 2   2
P 1 P 1
σi2 σi2

Since the xi are independent,


(
E [xi ] E [xj ] = µi µj = µ2 if i 6= j
E [xi xj ] =
E [x2i ] = σi2 + µ2 if i = j
8.4. MAXIMUM LIKELIHOOD METHOD 125

Therefore, having written the expectation of sums as the sum of expectations and
having split the double sum into two parts,
!2  
1 X σi2 + µ2 X X µ2 
V [µ̂] = P  + − µ2
(1/σi2 ) i σi4 σ 2 2
σ
i j6=i i j
!2  
1 X 1 2
X 1 2
XX 1
= P  + µ + µ  − µ2
(1/σi2 ) i σi2 i σi
4
i j6=i σi
2 2
σj
P 1 P P 1 
1 i σ4 + i j6=i σ2 σ2
=P + µ2  i
P 2
i j  −µ2
(1/σi2 ) ( (1/σi2))
| {z }
=1
1
=P (8.60)
(1/σi2 )

It is curious that in this example V [µ̂] does not depend on the xi , but only on the
σi . This is not true in general.
We have seen (section 8.2.6) that the Rao-Cramér inequality sets a lower limit
on the variance of an estimator. For an unbiased estimator the bound is 1/I, where
I is the information. For µ,
" # " #
∂S(µ) ∂2ℓ
I(µ) = −E = −E
∂µ ∂µ2
" !#
∂ X xi X µ
= −E −
∂µ σi2 σi2
" #
X 1 X 1
= −E − 2
=
σi σi2

Thus V [µ̂] = I −1 (µ); the variance of µ̂ is the smallest possible. The ml estimator
is efficient. This is in fact a general property of ml estimators: The ml estimator
is efficient if an efficient estimator exists. We will now demonstrate this.

Properties of maximum likelihood estimators


We have seen in section 8.2.7 that an efficient, unbiased estimator is linearly related
to the score function. Assume that such an estimator of θ exists; call it T (x). Then

S(x, θ) = C(θ)T (x) + D(θ) (8.61)

From the maximum likelihood condition, S(x, θ̂) = 0, where θ̂ is the ml estimator
of θ. Hence the unbiased, efficient estimator T (x) is related to the ml estimator θ̂
by
D(θ̂)
T (x) = − (8.62)
C(θ̂)
126 CHAPTER 8. PARAMETER ESTIMATION

We have also seen in section 8.2.6, equation 8.21, that E [S(x, θ)] = 0 under
quite general conditions on f . Therefore, taking the expectation of equation 8.61,

E [S(x, θ] = C(θ)E [T (x)] + D(θ) = 0

Hence,
D(θ)
E [T (x)] = − (8.63)
C(θ)
This is true for any value of θ; in particular it is true for θ = θ̂, i.e., if the true value
of θ is equal to the ml estimate of θ:
h i D(θ̂)
E T (x|θ̂) = − = T (x) (8.64)
C(θ̂)
h i
It may seem strange to write E T (x|θ̂) since T (x) does not depend on the value of
θ. However, the expectation operator does depend on the value of θ. In fact, since
T (x) is an unbiased estimator of θ,
Z
E [T (x)] = T (x) f (x, θ) dx = θ (8.65)

Hence, h i
E T (x|θ̂) = θ̂
Combining this with equation 8.64 gives

T (x) = θ̂ (8.66)

Thus we have demonstrated that the ml estimator is efficient and unbiased if an


efficient, unbiased estimator exists.
If an unbiased, efficient estimator exists, we can derive the following properties:

1. From equations 8.63 and 8.65,

D(θ) = −θ C(θ)

Substituting this and equation 8.66 in equation 8.61 yields


h i
S(x, θ) = C(θ) θ̂ − θ (8.67)

2. Assuming that the estimator is efficient means that the Rao-Cramér inequal-
ity, equation 8.26, becomes an equality. Collecting equations 8.19, 8.23, and
8.26, results in the variance of an unbiased, efficient estimator θ̂ given by
h i 1 1 1 1
V θ̂ = = 2
= − h ∂S i = − h ∂ 2 ℓ i
I(θ) E [S ] E E 2∂θ ∂θ
8.4. MAXIMUM LIKELIHOOD METHOD 127

From (8.67),
∂S h i
= C ′ (θ) θ̂ − θ − C(θ) (8.68)
h i
∂θ
Since θ̂ is unbiased, E θ̂ = θt , the true value of the parameter. Hence,
" #
∂S
E = −C(θt )
∂θ
h i 1
and V θ̂ = (8.69)
C(θt )
Hence, C(θt ) > 0.
3. From equation 8.68, we also see that

∂ 2 ℓ ∂S
= = −C(θ̂)
∂θ2 θ=θ̂ ∂θ θ=θ̂
Since C(θ) > 0 in the region of the true value, this confirms that the extremum
of ℓ, which we have used to determine θ̂, is in fact a maximum.
4. From equation 8.67 and the maximum likelihood condition (equation 8.58),
we see that the ml estimator is the solution of
 
0 = S(x, θ) = C(θ) θ̂ − θ
Since C(θ) > 0 in the region of the true value, this equation can have only one
solution, namely θ̂. Hence, the maximum likelihood estimator θ̂ is unique.
Let us return to the Gaussian example. But now assume not only that all µi = µ
but also all σi = σ. Unlike the previous example, we now assume that σ is unknown.
The likelihood condition gives
! !
∂ℓ X xi − µ̂
= =0
∂µ µ̂,σ̂
σ̂
! !
∂ℓ X 1 (xi − µ̂)2
= − + =0
∂σ µ̂,σ̂
σ̂ σ̂ 3
The first equation gives
1X
µ̂ =
xi = x̄
n
Using this in the second equation gives
1X
σ̂ 2 = (xi − x̄)2
n
which, as we have previously seen (eq. 8.6), is a biased estimator of σ 2 . This
illustrates an important, though often forgotten, feature of ml estimators: They
are often biased.
To summarize this section: The ml estimator is efficient and unbiased if such
an estimator exists. Unfortunately, that is not always the case.
128 CHAPTER 8. PARAMETER ESTIMATION

8.4.2 Asymptotic properties


Although, as we have seen in the previous section, the maximum likelihood esti-
mator is efficient and unbiased if an efficient, unbiased estimator exists, in general
the ml estimator is neither unbiased nor efficient. However, asymptotically, i.e.,
for a large number of independent measurements, it (usually) is both unbiased and
efficient. To see this we expand the score about θ̂:

∂ X ∂S  
S(x, θ) = ln f (xi , θ) ≈ S(x, θ̂) + θ − θ̂ + . . .
∂θ ∂θ θ̂

By the maximum likelihood principle, S(x, θ̂) = 0. We assume that as n → ∞


higher order terms can be neglected. We are then left with

∂S   ∂ X 


S(x, θ) ≈ θ − θ̂ = S1 (xi , θ) θ − θ̂
∂θ θ̂ ∂θ
θ̂

X ∂S1 (xi , θ)  
= θ − θ̂
∂θ
θ̂

Replacing the sum by n times the sample mean,



∂S1   ∂2 


S(x, θ) ≈ n θ − θ̂ = n ln f (xi , θ) θ − θ̂
∂θ θ̂ ∂θ2
θ̂

Since the sample mean approaches the expectation as n → ∞ provided only that
the variance is finite (C.L.T.), asymptotically
" # " #
∂S1   ∂2 


S(x, θ) ≈ nE θ − θ̂ = nE ln f (xi , θ) θ − θ̂
∂θ θ̂ ∂θ2
θ̂
" # " #
∂ X   ∂2 X 


=E S1 θ − θ̂ = E ln f (xi , θ)

θ − θ̂
∂θ θ̂
∂θ2 θ̂
" # " #
∂S   ∂ 2 ℓ  
=E θ − θ̂ = E θ − θ̂
∂θ θ̂ ∂θ2 θ̂
 
= −I(θ̂) θ − θ̂ (8.70)
the last step following from equation 8.23.
There are several consequences of equation 8.70:
• First we note that asymptotically, I(θ) = I(θ̂):
" #
∂S h i
I(θ) = −E = E I(θ̂) = I(θ̂)
∂θ
where the second step follows from equation 8.70 and the last step follows
since I(θ̂) is itself an expectation and the expectation of an expectation is
just the expectation itself.
8.4. MAXIMUM LIKELIHOOD METHOD 129

• The result, equation 8.70, that θ̂ is linearly related to the score function,
implies (section 8.2.7) that θ̂ is unbiased and efficient. This is an important
asymptotic property of ml estimators.

• Further, we can integrate equation 8.70,


ln L = S(x, θ) ≈ −I(θ̂)(θ − θ̂)
∂θ
over θ to find
I(θ̂)  2
ℓ = ln L ≈ − θ̂ − θ + ln k (8.71)
2
where the integration constant, k, is just k = L(θ̂) = Lmax . Exponentiating,
 
 
1
L(θ) ≈ Lmax exp − I(θ̂)(θ̂ − θ)2 ∝ N θ; θ̂, I −1 (θ̂) (8.72)
2

Thus, asymptotically, L is proportional to a Gaussian function of θ with mean


θ̂ and variance 1/I(θ̂).

Instead of starting with equation 8.70, we could use equation 8.67, which ex-
presses the linear dependence of S on θ̂ for any efficient, unbiased estimator. Inte-
grating equation 8.67 leads to
 
1
L(θ) = Lmax exp − C(θ)(θ̂ − θ)2
2

which looks formally similar to equation 8.72 but is not, in fact, a Gaussian function
since C depends on θ. Only asymptotically must C(θ) approach a constant, C(θ) →
I(θ̂). Nevertheless, C(θ) may be constant for finite n, as we have seen in the example
of using x̄ to estimate µ of a Gaussian (cf. section 8.2.7).
We emphasize again that, despite the form of equation 8.72, L is not a p.d.f.
for θ. It is an experimentally observed function. Nevertheless, the principle of
maximum likelihood tells us to take the maximum of L to determine θ̂, i.e., to
take θ̂ equal to the mode of L. In this approximation the mode of L is equal to
the mean, which is just θ̂. In other words the ml estimate is the same as what we
would find if we were to regard L as a p.d.f. for θ and use the expectation (mean)
of L to estimate θ.
Since asymptotically hthe
i ml estimator is unbiased and efficient, the Rao-Cramér
bound is attained and V θ̂ = I −1 (θ). Thus the variance is also that which we would
have found treating L as a p.d.f. for θ.
We have shown that the ml estimator is, under suitable conditions, asymptot-
ically efficient and unbiased. Let us now specify these conditions (without proof)
more precisely:
130 CHAPTER 8. PARAMETER ESTIMATION

1. The true value of θ must not be at the L ✻ ∂L


6= 0 ❳


boundary of its allowed interval such that ∂θ

the maximum likelihood condition would


not be satisfied, i.e., ∂L
∂θ
must be zero at the
maximum. ✲
θmax θ

2. The p.d.f.’s defined by different values of θ must be distinct, i.e., two values
of θ must not give p.d.f.’s whose ratio is not a function of θ. Otherwise there
would be no way to decide between them.

3. The first three derivatives of ℓ = ln L must exist in the neighborhood of θ̂.

4. The information, I(θ) must be finite and positive definite.

8.4.3 Change of parameters


It is important to understand the difference between a change of parameters and a
change of variable. L(x; θ) is a p.d.f. for the random variable x. Under a change
of variable, x −→ y(x) and L(x; θ) −→ L′ (y; θ), the probability must be conserved.
Hence, L(x; θ) dx = L′ (y; θ) dy. This requirement results (cf. section 2.2.6) in

L′ (y; θ) = L(w(y); θ) |J|

where w is the inverse of the transformation x −→ y and J is the Jacobian of the


transformation.
However, for a change of parameters, θ −→ g(θ), the requirement that prob-
ability be conserved means that L(x; θ) dx = L′ (x; g) dx and consequently that
L(x; θ) = L′ (x; g). Thus the value of L is unchanged by the transformation from
θ to g(θ) and L′ is obtained from L simply be replacing θ by h(g) where h is the
inverse of the transformation θ → g. There is no Jacobian involved.
As in frequency substitution, the ml estimator of a function, g, of the parameter
θ is just that function for the ml estimator, i.e.,

ĝ(θ) = g(θ̂)
∂θ
This occurs because, assuming ∂g
exists,

∂L ∂L ∂θ
=
∂g ∂θ ∂g

Then the maximum likelihood condition for θ, ∂L ∂θ


= 0, implies that ∂L
∂g
= 0, which
is just the maximum likelihood condition for g.
8.4. MAXIMUM LIKELIHOOD METHOD 131

∂θ
If ∂g is zero at some value of θ, this can intro- g ✻
duce additional solutions to the likelihood condi-
tion for g. This will not usually happen if g is a
single-valued function of θ unless there are points
of inflection.

Note that θ̂ unbiased does not imply that θ
∂θ
ĝ = g(θ̂) is unbiased and vice versa. Asymptoti- ∂g
=0
cally, both θ̂ and ĝ become unbiased and efficient
(previous section), but they usually approach this at different rates.
In the case of more than one parameter, g(θ), the above generalizes to
!T !
∂L X ∂L ∂θi ∂L ∂θ
= = (8.73)
∂gk i ∂θi ∂gk ∂θ ∂gk

and the information matrix transforms as


!T !
∂gj ∂gk
Ijk (g) = I(θ) (8.74)
∂θ ∂θ

It is not necessary that θ and g have the same dimensions.

8.4.4 Maximum Likelihood vs. Bayesian inference


Recall Bayes’ theorem (section 2.3). Assume that the parameter θ, which we wish
to estimate, can have only discrete values, θ1 , θ2 , . . . , θk . Applied to the estimation
of θ, Bayes’ theorem can be stated (cf. section 2.4.3)

P (x | θi )
Pposterior (θi | x) = Pprior (θi ) (8.75)
P (x)

and it would seem reasonable to choose as our estimate of θ̂ that value θi having
the largest Pposterior , i.e., the mode of the posterior probability.∗ Since Pposterior is
P P
normalized, i.e., i Pposterior (θi |x) = 1, we see that P (x) = i P (x|θi )Pprior (θi ) is
the constant which serves to normalize Pposterior . We also see that P (x|θi ) is just
the likelihood, L(x; θi ), apart from normalization.
In the absence of prior knowledge (belief) of θ, Bayes’ postulate tells us to assume
all values equally likely, i.e., Pprior (θi ) = k1 . Then the right-hand side of equation 8.75
is exactly L(x; θi ) (apart from normalization) and maximizing Pposterior is the same
as maximizing L. Thus, Bayesian statistics leads to the same estimator as maximum
likelihood.

The mode is not the only choice. A Bayesian could also choose the mean or the median, or
some other property of the posterior probability distribution. Asymptotically, of course, Pposterior
will be Gaussian, in which case the mode, mean, and median are the same.
132 CHAPTER 8. PARAMETER ESTIMATION

In the more usual case of a continuous parameter, equation 8.75 must be rewrit-
ten in terms of probability densities:

f (x | θ)
fposterior (θ | x) = R fprior (θ) (8.76)
f (x | θ) fprior (θ) dθ

Assuming Bayes’ postulate, fprior = constant, and again Bayesian statistics is equiv-
alent to maximum likelihood.
But now what happens if we want to estimate the parameter g = g(θ) rather
than θ? Assume that the transformation g(θ) is one-to-one. Then in the discrete
case we just replace θi by gi = g(θi ) in equation 8.75. Bayes’ postulate again tells
us that Pprior = k1 and the same maximum is found resulting in ĝ = g(θ̂). However
in the continuous case, the change of parameter (cf. sections 2.2.6, 8.4.3) involves
a Jacobian, since in Bayesian statistics f is a p.d.f. for θ, or in other words, the ml
parameter is regarded as the variable of the p.d.f. Hence,

fposterior (g | x) = fposterior (θ | x) |J|

where J is the Jacobian of the transformation θ → g. But since the likelihood


function is a p.d.f. for x, not for θ, there is no Jacobian involved in rewriting L
using g instead of θ, i.e., L(x; g(θ)) = L(x; θ). Thus, assuming Bayes’ postulate
for g, fprior (g) = constant, the value of g which maximizes fposterior (g | x) is that
which maximizes L(x; θ)|J| rather than L(x; θ). Bayesian statistics and maximum
likelihood thus give different estimates of g. To obtain the same result in ml, the
Bayesian would have to use fprior (g) = fprior (θ)|J| rather than the uniform fprior (g)
suggested by Bayes’ postulate. In other words, Bayes’ postulate can only be applied
to θ or g, not simultaneously to both (except when θ and g are linearly related).
But how does one choose which?∗ This is one of the grounds which would cause
most physicists to prefer maximum likelihood to Bayesian parameter estimation.

8.4.5 Variance of maximum likelihood estimators


We have seen that the variance of an efficient estimator is given by the Rao-Cramér
bound (equation 8.26). Assuming h i that θ̂ is efficient, substituting θ̂ for θ in this
equation gives an estimate of V θ̂ . If, in addition, θ̂ is unbiased (or at least that
h i
the bias does not depend on θ), this just becomes V θ̂ = 1/I(θ̂). We recall that the
ml estimator is efficient if an efficient estimator exists, but that this is not always
the case. Nor is the ml estimator always unbiased.


There are arguments for the choice of non-uniform priors (see, e.g., Jeffreys37 ) in certain
circumstances. However, they are not completely convincing and remain controversial.
8.4. MAXIMUM LIKELIHOOD METHOD 133

If the estimator is unbiased and efficient


However, asymptotically the ml estimator is both unbiased and efficient. Assuming
this to be the case, and also assuming that the range of x does not depend on θ,
we can estimate the variance as follows:
1. Using equation 8.19, I(θ) = E [S 2 ],
 !2 
h i h i ∂ℓ
V −1 θ̂ = I(θ) = E S 2 = E 
∂θ

which, for more than one parameter generalizes to


" #
h i ∂ℓ ∂ℓ
Vjk−1 θ̂ = E (8.77)
∂θj ∂θk
If the sample consists of n independent events distributed according to the
p.d.f.’s fi (xi ; θ), the score is just the sum of the scores for the individual events
and  !
h i n 2
X
−1
V θ̂ = E S1 (xi ; θ) 
i=1

Performing the square and using the fact that the expectation of a sum is the
sum of the expectations, we get
h i n
X h i n X
X n
V −1 θ̂ = E S12 (xi ; θ) + E [S1 (xi ; θ)S1 (xj ; θ)]
i=1 i=1 j=1
i6=j

However, the cross terms are zero, which follows from the fact that for indepen-
dent xi the expectation of the product equals the product of the expectations
and from E [S1 (x; θ)] = 0 (equation 8.20). Therefore, generalizing to more
than one parameter,
" #
h i n
X
∂ ln fi (xi ; θ) ∂ ln fi (xi ; θ)
Vjk−1 θ̂ = E (8.78)
i=1 ∂θj ∂θk

Not knowing the true value of θ, we estimate this by evaluating it at θ = θ̂.


If all the fi are the same, equation 8.78 reduces to
" #
h i ∂ ln f (x; θ) ∂ ln f (x; θ)
Vjk−1 θ̂ = nE (8.79)
∂θj ∂θk

Rather than calculating the expectation and evaluating it at θ̂, we can estimate
the expectation value by the sample mean evaluated at θ̂:

d
−1
h i n
X ∂ ln f (xi ; θ) ∂ ln f (xi ; θ)
Vjk θ̂ = (8.80)
∂θj ∂θk
i=1 θ̂ θ̂
134 CHAPTER 8. PARAMETER ESTIMATION

2. I is also given by equation 8.23:


" #
∂S ∂S ∂ 2 ℓ
I(θ) = −E =− =− (8.81)
∂θ ∂ θ̂ θ̂=θ ∂ θ̂2 θ̂=θ

where the second step follows from the linear dependence of S on θ̂ (equa-
tion 8.30) for an unbiased, efficient estimator. The variance is then estimated
by evaluating the derivative at θ = θ̂:

h i ∂ 2 ℓ
Vd
−1 θ̂ = − 2 (8.82)
∂θ θ̂

In the case of more than one parameter, this becomes


" #
h i ∂2ℓ
Vjk−1 θ̂ = Ijk (θ) = −E (8.83)
∂θj ∂θk
which is estimated by

d
−1
h i ∂ 2 ℓ
Vjk θ̂ = Ijk (θ̂) = − (8.84)
∂θj ∂θk θ̂

which is the Hessian matrix∗ of −ℓ. For n independent events, all distributed
as f (x; θ), the expectations in equations 8.81 and 8.83 can be estimated by a
sample mean evaluated at θ̂. Thus

d
−1
h i n
X ∂ 2 ln f (xi ; θ) ∂ 2 ln f (x; θ)
Vjk θ̂ = − = −n (8.85)
i=1 ∂θj ∂θk θ̂ ∂θj ∂θk θ̂

The expectation forms (8.77, 8.78, 8.79 and 8.84) are useful for estimating the error
we expect before doing the experiment, e.g., to decide how many events we need to
have in order to achieve a certain precision under various assumptions for θ. Both
the expectation and the sample mean forms (8.80 and 8.85) may be used after the
experiment has been done. It is difficult to give general guidelines on which method
is most reliable.

Example: Let us apply the two methods to the example of n independent xi


distributed normally with the same µ but different σi . Assume that the σi are
known. Recall that in this case
 !2 
n
X 1
− ln(2π) − ln σi −
1 (xi − µ) 
ℓ=
i=1 2 2 σi


Mathematically it is conditions on the first derivative vector, ∂ℓ/∂ θ̂, and on the Hessian matrix
that define the maximum of ℓ or the minimum of −ℓ. The Hessian matrix is positive (negative)
definite at a minimum (maximum) of the function and indefinite at a saddle point.
8.4. MAXIMUM LIKELIHOOD METHOD 135

1. From equation 8.78,


 !2   
n
X n
X  2 !2
∂ ln fi (xi ; µ, σi ) 1∂ xi − µ
V −1 [µ̂] = E  = E − 
i=1 ∂µ i=1 2 ∂µ σi
n
"  # n
" #
X 1 xi − µ 2 X 1 xi − µ 2
= E 2 = 2
E
i=1 σi σi i=1 σi σi
since σi is just a parameter of fi , hence a constant
n
X 1
= 2
since this expectation is 1 for the normal p.d.f.
i=1 σi

∂ℓ Pn xi −µ
2. Since ∂µ
= i=1 σi2
, equation 8.84 yields
n
−1 ∂2ℓ X 1
V [µ̂] = − 2 = 2
∂µ i=1 σi

 
P 1
Thus both methods give V [µ̂] = 1/ σi2
. This is the same result we found in
section 8.4.1, equation 8.60, where we calculated the variance explicitly from the
definition. This was, of course, to be expected since in this example µ̂ is unbiased
and efficient and the range of x is independent of µ.

Variance using Bayesian inference


We have emphasized that L is the p.d.f. for x given θ and not the p.d.f. for θ given
x. However, using the Bayesian interpretation of probability (sections 2.4.3 and
8.4.4), these two conditional p.d.f.’s are related: By Bayes’ theorem,

fposterior (θ|x) ∝ f (x|θ) fprior (θ)

and f (x|θ) is just the likelihood function L(x; θ). If we are willing to accept Bayes’
postulate (for which there is no mathematical justification) and take the prior p.d.f.
for θ, fprior (θ), as uniform in θ (within possible physical limits), we have
L(x; θ)
fposterior (θ|x) = R (8.86)
L(x; θ) dθ
where the explicit normalization
R
in the denominator is needed to normalize fposterior ,
since L is normalized by L dx = 1. Since, in Bayesian inference L is regarded as
a p.d.f. for θ, the covariance matrix of θ̂,
h i h  i
Vjk θ̂ = E θ̂j − θj θ̂k − θk (8.87)

is given by
R  
h i θ̂j − θj θ̂k − θk L dθ
Vjk θ̂ = R (8.88)
L dθ
136 CHAPTER 8. PARAMETER ESTIMATION

If the integrals in equation 8.88 can not be easily performed analytically, we could
use Monte Carlo integration. Alternatively, we can estimate the expectation (8.87)
from the data. This is similar to Monte Carlo integration, but instead of Monte
Carlo points θ we use the data themselves. Assuming n independent observations
xi , we estimate each parameter for each observation separately, keeping all other
parameters fixed at θ̂. Thus, θ̂j(i) is the value of θ̂j that would be obtained from
using only the ith event. In other words, θ̂j(i) is the solution of

∂fi (xi ; θ)
=0
∂θj θk =θ̂k ,k6=j

With L regarded as a p.d.f. for θ, the θ̂j(i) are r.v.’s distributed according to L.
Their variance about θ thus estimates the variance of θ̂. However, not knowing θ
we must use our estimate of it. This leads to the following estimate of the covariance,
where in equation 8.87 the expectation has been replaced by an average over the
observations, θ̂ by the estimate from one observation θ̂j(i) , and θj by our estimate
θ̂j :
h i Xn   
Vd θ̂ = 1 θ̂ − θ̂ θ̂ − θ̂ (8.89)
jk j(i) j k(i) k
n i=1
Equation 8.88 is particularly easy to evaluate when L is a Gaussian. We have
seen that asymptotically L is a Gaussian function of θ (equation 8.72) and hence
that ℓ is parabolic (equation 8.71):
1 2 (θ̂ − θ)2 1
L = Lmax e− 2 Q , Q2 = , ℓ = ln L = ln Lmax − Q2 (8.90)
σ2 2
h i
Then, using the Bayesian interpretation, it follows from equation 8.88 that V θ̂ =
σ 2 = I −1 (θ̂).
However, in the asymptotic limit it is not necessary to invoke the Bayesian
interpretation to obtain this result, hsince
i we already know from the asymptotic
efficiency of the ml estimator that V θ̂ = I −1 (θ) = I −1 (θ̂).

A graphical method
In any case, if L is Gaussian, the values of
2
θ for which Q2 = (θ̂−θ)σ2
= 1, i.e., the values ℓ(θ) ✻
of θ corresponding to 1 standard deviation
“errors”, θ̂ − θ = ±σ, are just those values, ℓmax
θ1 , for which ℓ differs from ℓmax by 1/2. This ℓ1
provides another
r way to estimate the uncer-
h i
tainty, δ θ̂ = V θ̂ , on θ̂: Find the value of
θ, θ1 , for which ℓ2
1 ✲
ℓ1 = ℓ(θ1 ) = ℓmax −
2 θ̂ θ1 θ2 θ
8.4. MAXIMUM LIKELIHOOD METHOD 137

The error is then δ θ̂ = |θ̂ − θ1 | This could be done graphically from a plot of ℓ vs. θ.
Similarly, two-standard deviation errors (Q2 = 4) could be found using ℓ2 = ℓmax −2,
etc. (The change in ℓ corresponding to Q standard deviations is Q2 /2.)
But, what do we do if L is not Gaussian? We can be Bayesian and use equa-
tion 8.87 or 8.88. Not wanting to be Bayesian, we can use the following approach.
The two approaches will in general give different estimates of the variance, the
difference being smallest when L is nearly of a Gaussian form.
Recall that for efficient, unbiased estimators L can be Gaussian even for finite n.
Imagine a one-to-one transformation g(θ) from the parameter θ to a new parameter
g and suppose that ĝ is efficient and unbiased and hence that L(g) is normal. Such
a g may not exist, but for now we assume that it does. We have seen that ĝ = g(θ̂).
Let h be the inverse transformation, i.e., θ = h[g(θ)]. Since, by assumption, L(g)
is Gaussian, δg is given by a change of 1/2 in ℓ(g).
But, as we have seen in section 8.4.3, L(θ|x) = L(g(θ)|x) for all θ; there is no
Jacobian involved in going from L(θ) to L(g). This means that since we can find
δg from a change of 1/2 in ℓ(g), δθ will be given by the same change.

ℓ(g) ✻ ℓ(θ) ✻
ℓmax
ℓ1

ℓ2
✲ ✲
ĝ g1 g2 g θ̂ θ1 θ2 θ

L(θ) need not be a symmetric function of θ, in which case the errors on θ̂ are
asymmetric.
Note that we do not actually need to use the parameter g. We can find δθ
directly.
A problem is that such a g may not exist. Asymptotically both L(g) and L(θ) are
Gaussian. However, in general, L(g) and L(θ) will approach normality at different
rates. It is therefore plausible that there exists some g which is at least nearly
normally distributed for finite n. Since we never actually have to use g, we can only
adopt it as an assumption, realizing that the further away L for the ‘best’ g is from
normality, the less accurate will be our estimation of δθ.
This method of error estimation is easily extended to the case of more than
one parameter. If all estimators are efficient, L will be a multivariate normal. We
show the example of two parameters, θ1 and θ2 . The condition of a change of 1/2
in ℓ, i.e., Q2 = 1, gives an ellipse of constant L in θ2 vs. θ1 . A distinction must be
138 CHAPTER 8. PARAMETER ESTIMATION

made, however, between the ‘error’ and the ‘reduced’ or ‘conditional error’, which
is the error if the values of the other parameters are all assumed to be equal to their
estimated values.
If, for example, θ2 is held fixed at θ̂2
θ
and ℓ varied by 1/2, the conditional er- 2 ✻ ←σ1c→← σ1 →
c
ror, σ1 is found rather than the error σ1 ,
which is the error that enters the multi-
variate normal distribution. In practice, ↑
σ2c
the maximum of ℓ, as well as the vari- θ̂ ↓
ation of ℓ by /2, are usually found on a
1
2 ↑
computer using a search technique. How- σ2
ever, since it is easier (faster), the pro- ↓
c
gram may compute σ rather than σ. If
the parameters are uncorrelated, σ c = σ. ✲
If parameters are correlated, the correla- θ̂1 θ1
tion should be stated along with the errors, or in other words, the complete covari-
ance matrix should be stated, e.g., as σ1 , σ2 , and ρ, the correlation coefficient.

8.4.6 Summary
• If the sample is large, maximum likelihood gives a unique, unbiased, minimum
variance estimate under certain general conditions. However ‘large’ is not well
defined. For finite samples the ml estimate may not be unique, unbiased, or
of minimum variance. In this case other estimators may be preferable.

• Maximum likelihood estimators are often the easiest to compute, especially


for complex problems. In many practical cases maximum likelihood is the
only tractable approach.

• Maximum likelihood estimators are sufficient, i.e., they use all the information
about the parameter that is contained in the data. In particular, for small
samples ml estimators can be much superior to methods which rely on binned
data.

• Maximum likelihood estimators are not necessarily robust. If you use the
wrong p.d.f., the ml estimate may be worse than that from some other
method.

• The maximum likelihood method gives no way of testing the validity of the
underlying theory, i.e., whether or not the assumed p.d.f. is the correct one.
In practice this is not so bad: You can always follow the maximum likelihood
estimation by a goodness-of-fit test. Such tests will be discussed in section
10.6.
8.4. MAXIMUM LIKELIHOOD METHOD 139

And finally, a practical point: In complex situations, the likelihood condition


∂ℓ
∂θi
= 0 can not be solved analytically. You then must code the likelihood function
and use computer routines to find its maximum. Very clever programs exist as
pre-packaged routines for finding the minimum or maximum of a function. Do not
be tempted to write your own; take one from a good software library, e.g., that of
the Numerical Algorithms Group (NAG) or the MINUIT38 program from CERN.
Note that such programs usually search for a minimum instead of a maximum, so
put a minus sign before your ℓ. One usually writes a subroutine which calculates
the function for values of θ chosen by the program. The program needs a starting
value for θ. It evaluates the function at numerous points in θ space, determines
the most likely direction in this space to find the minimum (or maximum), and
proceeds to search until the minimum is found. The search can usually be speeded
up by also supplying a subroutine to calculate the derivatives of ℓ with respect to
the θi ; otherwise the program must do this numerically.

8.4.7 Extended Maximum Likelihood


Applied to n independent events from the same p.d.f., the likelihood method, as
discussed so far, is a method to determine parameters governing the shape of the
p.d.f. The number of events in the sample is not regarded as a variable.
Fermi proposed to extend the maximum likelihood method by including the
number of events as a parameter to be estimated. His motivation was the grand
canonical ensemble of statistical mechanics. In the canonical ensemble the number
of atoms or molecules is regarded as fixed while in the grand canonical ensemble
the number is free to vary.
To incorporate a variable number of events, the ordinary likelihood function is
multiplied by the Poisson p.d.f. expressing the probability of obtaining n events
when the expected number of events is ν. This expected number of events is then
another parameter to be estimated from the data. The likelihood becomes
n n
Y e−ν ν n Y
L(x; θ) = f (xi ; θ) −→ LE (x; θ, ν) = f (xi ; θ) (8.91)
i=1 n! i=1
n
X n
X
ℓ(x; θ) = ln f (xi ; θ) −→ ℓE (x; θ, ν) = ln f (xi ; θ) − ν + n ln ν − ln n!
i=1 i=1
Xn
ℓE (x; θ, ν) = ln νf (xi ; θ) − ν − ln n!
i=1
Xn
Or, ℓE (x; θ, ν) = ln g(xi ; θ) − ν (8.92)
i=1

where g = νf is the p.d.f. normalized to ν rather than to 1 and where we have


dropped the constant term ln n! since constant terms are irrelevant in finding the
maximum and the variance of estimators.
140 CHAPTER 8. PARAMETER ESTIMATION

Just as the grand canonical ensemble can be used even for situations where the
number of molecules is in fact constant (non-permeable walls), so also the extended
maximum likelihood method. In particular, if there is no functional relationship
between ν and θ, the likelihood condition ∂ℓE /∂ν = 0 will lead to ν̂ = n. Also,
∂ℓE /∂θj = ∂ℓ/∂θj , which leads to identical estimators θ̂j as in the ordinary max-
imum likelihood method. Nevertheless, we may still prefer to use the extended
maximum likelihood method. It can happen that the p.d.f., f , is very difficult to
normalize, e.g., involving a lengthy numerical integration. Then, even though the
number of events is fixed, we can use the extended maximum likelihood method,
allowing the maximum likelihood principle to find the normalization. In this case,
the resulting estimate of ν should turn out to be the actual number of events n times
the normalization of f and the estimate of the other parameters to be the same as
would have been found using the ordinary maximum likelihood method. However,
the errors on the parameters will be overestimated since the method assumes that
ν can have fluctuations. This overestimation can be removed (cf. section 3.9) by
1. inverting the covariance matrix,
2. removing the row and column corresponding to ν,
3. inverting the resulting matrix.
This corresponds to fixing ν at the best value, ν̂. Thus we could also fix ν = ν̂ and
find the errors on θ̂ by the usual ml procedure.

An example: Suppose we have an angular distribution containing N events, F


in the forward hemisphere and B = N − F in the backward hemisphere. In the
ordinary maximum likelihood method N is regarded as fixed. The p.d.f. for the
division of N events into F forward and B backward is the binomial p.d.f.:
N! F
L(F ; p) = B(F ; p, N) = p (1 − p)B
F !B!
ℓ(F ; p) = F ln p + B ln(1 − p) + ln N! − ln F !B!
The likelihood condition is then
∂ℓ F B F F
= − =0 −→ p̂ = =
∂p p 1−p F +B N
Its variance is given by
" #−1 " #−1
∂2ℓ F B
V [p̂] = − =− 2 +
∂p2 p (1 − p)2
which we estimate by replacing p by p̂:
" #−1 " #−1 " #−1
F B N N N p̂(1 − p̂)
Vb [p̂] = − 2 + =− + =− =
p̂ (1 − p̂)2 p̂ (1 − p̂) p̂(1 − p̂) N
8.4. MAXIMUM LIKELIHOOD METHOD 141

The estimated numbers of forward and backward events, i.e., the estimate of the
expectation of the numbers of forward and backward events if the experiment were
repeated, are then

Fb = N p̂ = F and Bb = N(1 − p̂) = B

with variance h i
V F̂ = V [N p̂] = N 2 V [p̂]

which is estimated by replacing V by Vb :


h i F B FB
Vb F̂ = N 2 Vb [p̂] = N p̂(1 − p̂) = N =
NN N
Similarly,
h i
V B̂ = V [N(1 − p̂)] = V [N p̂]
h i FB
Vb B̂ =
N

Further, F̂ , B̂ are completely anticorrelated.


In extended maximum likelihood N is not constant, but Poisson distributed.
Hence,

e−ν ν N e−ν ν N N! F
LE = L= p (1 − p)B
N! N! F !B!
ℓE = −ν + N ln ν − ln N! + F ln p + B ln(1 − p) + ln N! − ln F !B!
∂ℓE N
=1+ =0 −→ ν̂ = N
∂ν ν

The likelihood condition for p, ∂ℓ E


∂p
= 0 gives p̂ = N F
, the same as in ordinary
likelihood. The variance of p̂ is also the same. For ν̂, the variance is found as
follows:
∂ 2 ℓE N ν2
= − −→ V [ν̂] =
∂ν 2 ν2 N
Estimating the variance by replacing ν with ν̂ gives Vb (ν̂) = N. Further,

∂ 2 ℓE
=0 −→ p̂ and ν̂ are uncorrelated.
∂ν∂p

The estimate of the number of forward events is F̂ = p̂ν̂ = F , with the variance
found by error propagation:
h i F2 F2
2 p̂(1 − p̂) F B
V F̂ = p̂2 V [ν̂] + ν̂ 2 V [p̂] = 2
N + N = +N =F
N N N N N
142 CHAPTER 8. PARAMETER ESTIMATION

The result for B̂ is similar. Thus,


√ √
F̂ = F ± F and B̂ = B ± B

Alternatively, we can write the p.d.f. as a product of Poisson p.d.f.’s, one for
forward events and one for backward events (see exercise 13). Again, N is not
fixed. The parameters are now the expected numbers of forward, φ, and backward,
β, events. Then
e−φ φF e−β β B
LE =
F! B!
which leads to the same result:
√ √
F̂ = φ̂ = F ± F and B̂ = β̂ = B ± B

again with uncorrelated errors.


The constraint of fixed N leads to smaller, but correlated, errors in the ordi-
nary maximum likelihood method. The estimates of the numbers of forward and
backward events are, however, the same. Which method is correct depends on the
question we are asking. To find the fraction of backward events we should use ordi-
nary maximum likelihood. To find the number of backward events that we should
expect in repetitions of the experiment where the number of events can vary, we
should use extended maximum likelihood.

8.4.8 Constrained parameters


It often happens that the parameters to be estimated are constrained, for instance
by a physical law. The imposition of constraints always implies adding some in-
formation, and therefore the errors of the parameters are in general reduced. One
should therefore be careful not to add incorrect information. One should always test
that the data are indeed compatible with the constraints. For example, before fixing
a parameter at its theoretical value one should perform the fit with the parameter
free and check that the resulting estimate is compatible with the theoretical value.
Even if the theory is true, the data may turn out to give an incompatible value
because of some experimental bias. Testing the compatibility is usually a good way
to discover such experimental problems. How to do this will be discussed in sections
10.4 and 10.6.
The constraints may take the form of a set of equations

g(θ̂) = 0 (8.93)

The most efficient method to deal with such constraints is to change parameters
such that these equations become trivial. For example, if the constraint is

g(θ) = θ1 + θ2 − 1 = 0
8.4. MAXIMUM LIKELIHOOD METHOD 143

we simply replace θ2 by 1 − θ1 in the likelihood function and maximize with respect


to θ1 .
Similarly, boundaries on a parameter, e.g., θl < θ < θh , can be imposed by the
transformation
1
θ = θl + (sin ψ + 1)(θh − θl )
2
and maximizing L with respect to ψ.
When the θi are fractional contributions, subject to the constraints
k
X
0 ≤ θi ≤ 1 ; θi = 1
i=1

one can use the following transformation:

θ1 = ξ1
θ2 = (1 − ξ1 )ξ2
θ3 = (1 − ξ1 )(1 − ξ2 )ξ3
.. .
.= ..
θk−1 = (1 − ξ1 )(1 − ξ2 )(1 − ξ3 ) · · · (1 − ξk−2 )ξk−1
θk = (1 − ξ1 )(1 − ξ2 )(1 − ξ3 ) · · · (1 − ξk−2 )(1 − ξk−1)

where the ξi are bounded by 0 and 1 using the method given above:
1
ξi = (sin ψi + 1)
2
L is then maximized with respect to the k − 1 parameters ψi . A drawback of this
method is that the symmetry of the problem with respect to the parameters is lost.
In general, the above simple methods may be difficult to apply. One then turns
to the method of Lagrangian multipliers. Given the likelihood function L(x; θ) and
the constraints g(θ) = 0, one finds the extremum of

F (x; θ, α) = ln L(x; θ) + αT g(θ) (8.94)

with respect to θ and α. The likelihood condition (equation 8.58) becomes



∂F ∂ℓ
T ∂g(θ)
= + α̂ =0 (8.95)
∂θi θ=θ̂ ∂θi θ=θ̂ ∂θi θ =θ̂
α=α̂

∂F
= g(θ̂) = 0 (8.96)
∂αj θ=θ̂
α=α̂
The estimators of θ found in this way clearly satisfy the constraints (equation 8.93).
They also have all the usual properties of maximum likelihood estimators.
144 CHAPTER 8. PARAMETER ESTIMATION

To find the variances, we construct the matrix of the negative of the second
derivatives:

∂2F ∂2F  
∂2ℓ ∂g 
  
∂θ∂θ ′! ∂θ∂α   ∂θ∂θ!′ ∂θ  A B
I≡ −E 
 T
 
 = −E  T

 ≡ (8.97)
 ∂2F 2
∂ F   ∂g  BT 0
0
∂θ∂α ∂α2 ∂θ
It can be shown4, 5 that the covariance matrix of the estimators is then given by
h i
V θ̂ = A−1 − A−1 B V [α̂] B T A−1 (8.98)
 −1
V [α̂] = B T A−1 B (8.99)
h i
The first term of V θ̂ is the ordinary unconstrained covariance matrix; the second
term is the reduction in variance due to the additional information provided by the
constraints. We have implicitly assumed that I is not singular. This may not be the
case, e.g., when the constraint is necessary to define the parameters unambiguously.
One then adds another term to F ,

F ′ = F − g 2 (θ)

and proceeds as above. The resulting inverse covariance matrix is usually non-
singular.4, 5
Computer programs which search for a maximum will generally perform better
if the constraints are handled correctly, rather than by some trick such as setting the
likelihood very small when the constraint is not satisfied, since this will adversely
affect the program’s estimation of derivatives. Also, use of Lagrangian multipliers
may not work with some programs, since the extremum can be a saddle point rather
than a maximum: a maximum with respect to θ, but a minimum with respect to α.
In such a case, “hill-climbing” methods will not be capable of finding the extremum.

8.5 Least Squares method

8.5.1 Introduction
We begin this subject by starting from maximum likelihood and treating the exam-
ple of n independent xi , each distributed normally with the same mean but different
σi . To estimate µ when all the σi are known we have seen that the likelihood func-
tion is
n
"   #
Y 1 1 xi − µ 2
L= √ exp −
i=1 2πσi 2 σi
n
" #
n X (xi − µ)2
ℓ = − ln(2π) + − ln σi −
2 i=1 2σi2
8.5. LEAST SQUARES METHOD 145

P 2
To maximize L, or ℓ, is equivalent to minimizing ni=1 (xiσ−µ)
2 . If µ were known, this
i
quantity would be, assuming each point independent, a χ2 (n). Since µ is unknown
we replace it by an estimate of µ, µ̂. There is then one relationship between the
terms of the χ2 and therefore
n
X (xi − µ̂)2
χ2 = (8.100)
i=1 σi2
is a χ2 not of n, but of n − 1 degrees of freedom.
The method of least squares takes as the estimator of a parameter that value
which minimizes χ2 . The least squares estimator is thus given by

∂χ2 Xn
xi − µ̂
= −2 =0
∂µ µ=µ̂ i=1 σi2

which gives the same estimator as did maximum likelihood (equation 8.59):
P xi
σi2
µ̂ = P 1 (8.101)
σi2

Although in this example the least squares and maximum likelihood methods
result in the same estimator, this is not true in general, in particular if the p.d.f.
is not normal. We will see that although we arrived at the least squares method
starting from maximum likelihood, least squares is much more solidly based than
maximum likelihood. It is, perhaps as a consequence, also less widely applicable.
The method of least squares is a special case of a more general class of methods
whereby one uses some measure of distance, di (xi , θ), of a data point from its
expected value and minimizes the sum of the distances to obtain the estimate of θ.
Examples of d, in the context of our example, are
1. di (xi , θ) = |xi − µ̂|α

|xi − µ̂|
2. di (xi , θ) =
σi
The difference between these two is that in the second case the distance is scaled by
the square root of the expected variance of the distance. If all these variances, σi2 ,
are the same, the two definitions are equivalent. It can be shown11, 13 that the first
distance measure with α = 1 leads to µ̂ given by the sample median. The second
distance measure with α = 2 is just χ2 .
The first publication in which least squares was used is by Legendre. In an
1805 paper entitled “Nouvelles méthodes pour la determination des orbites des
comètes” he writes:
Il faut ensuite, lorsque toutes les conditions du problême sont exprimées
convenablement, determiner les coëfficiens de manière à rendre les er-
reurs les plus petites qu’il est possible. Pour cet effet, la méthode qui
146 CHAPTER 8. PARAMETER ESTIMATION

me paraı̂t la plus simple et la plus générale, consiste à rendre minimum


la somme des quarrés des erreurs.
Least squares was not the only method in use in those days (or now). In 1792
Laplace minimized the sum of absolute errors, although he later switched to least
squares. Bessel and Encke also used least squares. In 1831, Cauchy suggested,
“que la plus grande de toutes les erreurs, abstraction faite du signe, devienne un
minimum”, i.e., to minimize the maximum of the absolute values of the deviations,
max |xi − µ̂|. This ‘minimax’ principle gives a very robust estimation but is not
very efficient.4, 5
We have noted that the χ2 of equation 8.100 is a χ2 of n − 1 degrees of freedom.
Thus, if we were to repeat the identical experiment many times, the values of χ2
obtained would be distributed as χ2 (n − 1), provided that the assumed p.d.f. of
the xi is correct. We would not expect then to get a value of χ2 which would be
very improbable if the assumed p.d.f. were correct. This could provide a reason for
deciding that the assumed p.d.f. is incorrect. This built-in test of the validity of the
assumed p.d.f. is a feature which was missing in the maximum likelihood method.
We will return to this and other hypothesis tests in sections 10.4 and 10.6.
In the example we assumed that the xi were normally distributed about µ with
standard deviation σi . If this were not the case, the distribution of the quantity χ2
for repetitions of the experiment would not follow the expected χ2 (n − 1) distribu-
tion. Consequently, the chance of getting a particular value of χ2 would not be that
given by the χ2 distribution. In other words, the quantity that we have called χ2 is
a χ2 r.v. only if our assumption that the xi are distributed normally is correct.
Assuming that we have not rejected the p.d.f., we need to estimate the variance
of µ̂. First we can use error propagation (section 8.3.6) to calculate the variance of
µ̂, given by equation 8.101, from the variances of the xi . In our example µ̂ is linear
in the xi ; hence the method is exact (equation 8.49):
!2 !2
X ∂ µ̂ 1 X V [xi ] 1
V [µ̂] = V [xi ] = P 4
=P 1
∂xi (1/σi )2 σi σ2
i

which agrees with the variance found in the maximum likelihood method (equa-
tion 8.60).
We see that in this example (although not in general true) the variance of the
estimator does not depend on the value of χ2 . However, it does depend on the
shape of χ2 (µ):
X  xi
−µ 2

χ2 (µ) =

σi
∂χ2 X 2(xi − µ̂)
=− =0
∂µ µ̂ σi2

∂ 2 χ2 X 1 2
= 2 2
=
∂µ2 µ̂ σi V [µ̂]
8.5. LEAST SQUARES METHOD 147

All higher order derivatives are zero, a con- χ2 (µ) ✻


sequence of the efficiency of the estimator and
the linear relationship between χ2 and ℓ. Thus
the χ2 is a parabola:
χ2min
2
(µ̂ − µ)
χ2 (µ) = χ2 (µ̂) +
V [µ̂]

µ̂µ
Corresponding to what we did in the maxi-
mum likelihood method, we construct the er-
ror on µ̂ by finding that value of µ for which χ2 (µ) − χ2 (µ̂) has a particular
value. From the above equation we see that a χ2 -difference of 1 occurs when
(µ̂ − µ)2 = V [µ̂], i.e., for those values of µ which are one standard deviation from
µ̂, or more generally a value of µ for which χ2 (µ) = χ2 (µ̂) = n2 corresponds to an
n standard deviation difference from µ̂.

8.5.2 The Linear Model


In the preceding example we had a number of measurements of a fixed quantity.
Now let us suppose that we have a number of measurements yi of a quantity y which
depends on some other quantity x. Assume, for now, that the values xi are known
exactly, i.e., without error. For each xi , y is measured to be yi with expected error
σi . We assume that σi does not depend on yi .
One of the reasons for doing a fit to a curve is to enable us to predict the
most likely value of future measurements at a specified x. For example, we wish to
calibrate an instrument. Then the predictor variable x would be the value that the
instrument reads. The response variable y would be the true value. A fit averages
out the fluctuations in the individual readings as much as possible. This only works,
of course, if the form used for the curve in the fit is at least approximately correct.
Although we will use a one-dimensional predictor variable x, the generalization to
more dimensions is straightforward: x → x.
Assume now that we have a model for y vs. x in terms of certain parameters θ
which are coefficients of known functions of x:

y(x) = θ1 h1 (x) + θ2 h2 (x) + θ3 h3 (x) + . . . + θj hj (x) (8.102)

This is the curve which we fit to the data. There are k parameters, θj , to be
estimated. The important features of this model are that the hj are known, distin-
guishable, functions of x, single-valued over the range of x, and that y is linear in
the θj . The word ‘linear’ in the term ‘linear model’ thus refers to the parameters θj
and not to the variable x. In some cases the linear model is just an approximation
arrived at by retaining only the first few terms of a Taylor series. The functions hj
must be distinguishable, i.e., no hj may be expressible as a linear combination of
the other hj ; otherwise the corresponding θj will be indeterminate.
148 CHAPTER 8. PARAMETER ESTIMATION

We want to determine the values of the θj for which the model (eq. 8.102) best
fits the measurements. We assume that any deviation of a point yi from this curve
is due to measurement error or some other unbiased effects beyond our control, but
whose distribution is known from previous study of the measuring process to have
variance σi2 . It need not be a Gaussian. We take as our measure of the distance of
the point yi from the hypothesized curve the squared distance in units of σi , as in
1

our example above.


The general term for this fitting procedure is ‘Regression Analysis’. This term
is of historical origin and like many such terms it is not particularly appropriate;
nothing regresses. The term is not much used in physics, where we prefer to speak of
least squares fits, but is still in common use in the social sciences and in statistics
books. Some authors make a distinction between regression analysis and least
squares, reserving the term regression for the case where the yi (and perhaps the
1

xi ) are means (or other descriptive statistics) of some random variable, e.g., y the
average height and x the average weight of Dutch male university students. The
mathematics is, however, the same.

y ✻
✛true curve
y(xi )
✛ p.d.f. of yi at xi
yi


xi x

We assume that the actual measurements are described by

k
X
yi = y(xi ) + ǫi = θj hj (xi ) + ǫi (8.103)
j=1

where the unknown error on yi has the properties: E [ǫi ] = 0, V [ǫi ] = σi2 , and σi2 is
known. The ǫi do not have to be normally distributed for most of what we shall do;
where a Gaussian assumption is needed, we will say so. Note that if at each xi the
yi does not have a normal p.d.f., we may be able to transform to a set of variables
which does.
Further, we assume for simplicity that each yi is an independent measurement,
although correlations can easily be taken into account by making the error matrix
non-diagonal, as will be discussed. The xi may be chosen any way we wish, including
several xi which are equal. However, we shall see that we need at least k distinct
values of x to determine k parameters θj .
8.5. LEAST SQUARES METHOD 149

Estimator
The problem is now to determine the ‘best’ values of k parameters, θj , from n
measurements, (xi , yi ). The deviations from the true curve are ǫi . Therefore the
“χ2 ” is
n
X ǫ2i
Q2 = 2
(8.104)
i=1 σi
n
!2
X yi − y(xi )
=
i=1 σi
 2
n
X Xk
1 
= 2
y i − θj hj (xi ) (8.105)
i=1 σi j=1

This is a true χ2 , i.e., distributed as a χ2 p.d.f., only if the ǫi are normally dis-
tributed. To emphasize this we use the symbol Q2 instead of χ2 .
We do not know the actual value of Q2 , since we do not know the true values
of the parameters θj . The least squares method estimates θ by that value θ̂ which
minimizes Q2 . This is found from the k equations (l = 1, . . . , k)
 
n k
∂Q2 X 1  X
=2 2
y i − θj hj (xi ) (−hl (xi )) = 0
∂θl σ
i=1 i j=1

which we rewrite as
n
X k n
hl (xi ) X X yi
2
θ̂j hj (xi ) = h (x )
2 l i
(8.106)
i=1 σi j=1 i=1 σi

This is a set of k linear equations in k unknowns. They are called the normal
equations. It is easier to work in matrix notation. We write
     
y1 θ1 ǫ1
 ..   ..   .. 
y= .  ; θ= .  ; ǫ= . 
yn θk ǫn
 
h1 (x1 ) h2 (x1 ) ... hk (x1 )

 h1 (x2 ) h2 (x2 ) ... hk (x2 ) 

H =
 .. .. .. ..  
 . . . . 
h1 (xn ) h2 (xn ) . . . hk (xn )
Then  Pk 
j=1 θj hj (x1 )
 Pk 
 j=1 θj hj (x2 ) 
Hθ=
 .. 

 . 
Pk
j=1 θj hj (x2 )
150 CHAPTER 8. PARAMETER ESTIMATION

and the model (eq. 8.103) can be rewritten

y =Hθ+ǫ (8.107)
h i
Since E [ǫ] = 0, we obtain E y = H θ. In other words, the expectation value of
each measurement is exactly the value given by the model.
The errors σi2 can also be incorporated in a matrix, which is diagonal given our
assumption of independent measurements,
 
σ12 ... 0
V [y] =  ... . 
 ..
. .. 
0 . . . σn2

If the measurements are not independent, we incorporate that by setting the off-
diagonal elements to the covariances of the measurements. In this matrix notation,
the equations for Q2 (equations 8.104 and 8.105) become

Q2 = ǫT V −1 ǫ (8.108)
 T  
= y−Hθ V −1 y − H θ (8.109)

To find the estimates of θ we solve

∂Q2  
= −2 H T V −1 y − H θ = 0 (8.110)
∂θ
which gives the normal equations corresponding to equations 8.106, but now in
matrix form:
H T V −1 H θ̂ = H T V −1 y
(8.111)
(k×n) (n×n) (n×k) (k×1) (k×n) (n×n) (n×1)

where we have indicated the dimension of the matrices. The normal equations are
solved by inverting the square matrix H T V −1 H, which is a symmetric matrix since
V is symmetric. The solution is then
 −1
θ̂ = H T V −1 H H T V −1 y (8.112)

It is useful to note that the actual sizes of the errors σi2 do not have to be known
to find θ̂; only their relative sizes. To see this, write V = σ 2 W , where σ 2 is an
arbitrary scale factor and insert this in equation 8.112. The factors σ 2 cancel; thus
σ 2 need not be known in order to determine θ̂.
Now let us evaluate the expectation of θ̂:
h i  −1   −1 h i
E θ̂ = E H T V −1 H H T V −1 y = H T V −1 H H T V −1 E y
 −1  
= H T V −1 H H T V −1 H θ = θ
8.5. LEAST SQUARES METHOD 151

Thus θ̂ is unbiased, assuming that the model is correct. This is true even for small
n. (Recall that maximum likelihood estimators are often biased for finite n.)
Procedures exist for solving the normal equations without the intermediate step
of matrix inversion. Such methods are usually preferable in that they usually suffer
less from round-off problems.
In some cases, it is more convenient to solve these equations by numerical ap-
proximation methods. As discussed at the end of section 8.4.6, programs exist to
find the minimum of a function. For simple cases like the linear problem we have
considered, use of such programs is not very wasteful of computer time, and its
simplicity decreases the probability of an experimenter’s error and probably saves
his time as well. If the problem is not linear, a case which we shall shortly discuss,
such an approach is usually best.
We have stated that there must be no linear relationship between the hj . If
there is, then the columns of H are not all independent, and since V is symmetric,
H T V −1 H will be singular. The best approach is then to eliminate some of the h’s
until the linear relationships no longer exist. Also, there must be at least k distinct
xi ; otherwise the same matrix will be singular.
Note that if the number of parameters k is equal to the number of distinct values
of x, i.e., n = k assuming all xi are distinct, then
 −1  −1
H T V −1 H = H −1 V H T

Substituting in equation 8.112 yields θ̂ = H −1 y, assuming that H T V −1 H is not


singular. Thus θ̂ is independent of the errors. The curve will pass through all the
points, if that is possible. It may not be possible; the assumed model may not be
correct.

Variance
The covariance matrix of the estimators is given by
h i  −1  h i  −1 T
T −1 T −1 T −1 T −1
V θ̂ = H V H H V V y H V H H V
| {z } | {z }
(8.113)
(k×k) (k×n) (n×n) (n×k)

This can be demonstrated by working out a simple example. Alternatively, it follows


from propagation of errors (section 8.3.6): Since we are converting from errors on
y to errors on θ̂, the matrix D (equation 8.53) is
 
∂ θ̂1 ∂ θ̂2 ∂ θ̂k
∂y1 ∂y1
... ∂y1
 
 ∂ θ̂1 ∂ θ̂2 ∂ θ̂k 

 ∂y2 ∂y2
... ∂y2


D(θ̂) =  .. .. .. 
 .. 
 . . . . 
∂ θ̂1 ∂ θ̂2 ∂ θ̂k
∂yn ∂yn
... ∂yn
152 CHAPTER 8. PARAMETER ESTIMATION

The elements of D are found by differentiating equation 8.112, which gives


 −1 
∂ θ̂i
DijT = Dji = = H T V −1 H H T V −1 (8.114)
∂yj ij

or  T
−1
D= H T V −1 H H T V −1 (8.115)
h i h i
The covariance (equation 8.113) then follows from equation 8.52, V θ̂ = D T V y D.
h i
What we here call V y is what we previously just called V . It is a square,
symmetric matrix. Hence V −1 is also square and symmetric and therefore (V −1 )T =
 −1 T  −1
−1 T −1
V . For the same reason H V H = H T V −1 H . Therefore, equation
8.113 can be rewritten:
h i  −1  −1
V θ̂ = H T V −1 H H T V −1 V V −1 H H T V −1 H
 −1  −1
= H T V −1 H H T V −1 H H T V −1 H
h i  −1
V θ̂ = H T V −1 H (8.116)

Equation 8.112 for the estimator θ̂ and equation 8.116 for its variance constitute
the complete method of linear least squares.

σ 2 unknown
If V (y) is only known up to an overall constant, i.e., V = σ 2 W with σ 2 unknown,
it can be estimated from the minimum value of Q2 : Defining Q2 in terms of W , its
minimum value is given by equation 8.108 with θ = θ̂:
 T  
Q2min = y − H θ̂ W −1 y − H θ̂ (8.117)

If the ǫi are normally distributed, Q2 = σ 2 χ2 where the χ2 has n − k degrees of


freedom. The expectation of Q2 is then
h i h i
E Q2 = E σ 2 χ2 = σ 2 (n − k)

Therefore,
Q2min
σc2 = (8.118)
n−k
is an unbiased estimate of σ 2 . It can be shown∗ that this result is true even when
the ǫi are not normally distributed.


See Kendall & Stuart11 , vol. II, section 19.9 and exercise 19.5.
8.5. LEAST SQUARES METHOD 153

Interpolation
Having found θ̂, we may wish to calculate the value of y for some particular value
of x. In fact, the reason for doing the fit is often to be able to interpolate or
extrapolate the data points to other values of x. This is done by substituting the
estimators in the model. The variance is found by error propagation, reversing the
procedure used above to find the variance of θ̂. The estimate ŷ0 of y at x = x0 and
its variance are therefore given by
ŷ0 = H 0 θ̂ (8.119)
 h i −1
V [ŷ0 ] = H 0 V (θ̂) H T T −1
0 = H0 H V y H HT
0 (8.120)
where H 0 = ( h1 (x0 ) h2 (x0 ) . . . hk (x0 ) ), i.e., the H-matrix for the single point
x0 .

8.5.3 Derivative formulation


We can derive the above results in another way. The covariance matrix can be
found from the derivatives of Q2 : Starting from equation 8.109,

∂Q2  
= −2 H T V −1 y − H θ̂ (8.121)
∂θ θ=θ̂

∂ 2 Q2 T −1 −1
h i
= +2 H V H = 2 V θ̂ (8.122)
∂θ 2 θ=θ̂
This is a very useful way to calculate the covariance, which we have already seen in
our simple example of repeated measurements of a fixed quantity in the introduction
(section 8.5.1).
In fact, the solution θ̂ can be written in terms of the derivatives of Q2 making
it unnecessary to construct H, V , and the associated matrix products. To see this
we substitute the second derivative, equation 8.122, in equation 8.112. Since we are
trying to find θ̂, we do not yet know it, and we can not evaluate the derivative at
θ = θ̂. We therefore evaluate it at some guessed value, θ0 . Thus,
 −1
∂ 2 Q2
θ̂ = 2   H T V −1 y
∂θ 2 θ=θ
0
 −1  
2
∂ Q 2
∂ 2 Q2 ∂Q2
=   · θ 0 − 
∂θ 2 θ=θ ∂θ 2 θ=θ ∂θ θ =θ
0 0 0
 −1
∂ 2 Q2 ∂Q2
= θ0 −   · (8.123)
∂θ 2 θ=θ ∂θ θ=θ
0 0

2
This is the Newton-Raphson method of solving the equations ∂Q ∂θ
= 0. It is exact,
i.e., independent of the choice of θ0 for the linear model where the form of Q2 is a
154 CHAPTER 8. PARAMETER ESTIMATION

parabola. In the non-linear case, the method can still be used, but iteratively; its
success will depend on how close θ 0 is to θ̂ and on how non-linear the problem is.
The derivative formulation for the least squares solution is frequently the most
convenient technique in practical problems. The derivatives we need are
∂Q2 ∂ X ǫ2m X ǫm ∂ǫm
= 2
=2 2
∂θi ∂θi m σm m σm ∂θi
∂ 2 Q2 X 1 ∂ǫm ∂ǫm X ǫm ∂ 2 ǫm
and =2 2
+ 2 2
∂θi ∂θj m σm ∂θi ∂θj m σm ∂θi ∂θj
2
∂ ǫm
In the linear case, ∂θ i ∂θj
= 0, and ∂ǫ
∂θi
m
= −hi (xm ). Thus, the necessary derivatives
are easy to compute.
Finally, we note that the minimum value of Q2 is given by

2 2 ∂Q2 1 ∂ 2 Q2
Q (θ̂) = Q (θ0 ) + · (θ̂ − θ0 ) + (θ̂ − θ0 )T (θ̂ − θ0 ) (8.124)
∂θ θ=θ 2 ∂θ 2 θ=θ
0 0

where we have expanded Q2 (θ̂) about θ0 . Third and higher order terms are zero for
the linear model.
Just as in the example in the introduction to least squares, we can show, by
expanding Q2 about θ̂ that the set of values of θ given by Q2 (θ) = Q2min + 1 define
the one standard deviation errors on θ̂. This is the same as the geometrical method
to find the errors in maximum likelihood analysis (section 8.4.5), except that here
the difference in Q2 is 1 whereas the difference in ℓ was 1/2. This is because the
covariance matrix here is given by twice the inverse of the second derivative matrix,
whereas it was equal to the inverse of the second derivative matrix in the maximum
likelihood case.
So far we have made no use of the assumption that the ǫi are Gaussian dis-
tributed. We have only used the conditions E(ǫi ) = 0 and V [ǫi ] = σi2 known and
the linearity of the model.

8.5.4 Gauss-Markov Theorem


This is the theorem which provides the method of least squares with its firm mathe-
matical foundation. In 1812 Laplace showed that the method of least squares gives
unbiased estimates, irrespective of the parent distribution. Nine years later Gauss
proved that among the class of estimators which are both linear combinations of
the data and unbiased estimators of the parameters, the method of least squares
gives estimates having the least possible variance. This was treated more gener-
ally by Markov in 1912. It was extended in 1934 by Aitken to the case where the
observations are correlated and have different variances.
We will simply state the theorem without proof:∗ If E [ǫi ] = 0 and the covariance

For a proof, see for example, Kendall & Stuart11 , chapter 19 (Stuart et al.13 , chapter 29), or

Eadie et al.4 (or James5 ).


8.5. LEAST SQUARES METHOD 155

matrix of the ǫi , V [ǫ] is finite and fixed, i.e., independent of θ and y, (it does not
have to be diagonal), then the least squares estimate, θ̂ is unbiased and has the
smallest variance of all linear (in y), unbiased estimates, regardless of the p.d.f. for
the ǫi .
Note that
• This theorem concerns only linear unbiased estimators. It may be possible,
particularly if ǫ is not normally distributed, to find a non-linear unbiased
estimator with a smaller variance. Biased estimators with a smaller variance
may also exist.
• Least squares does not in general give the same result as maximum likelihood
(unless the ǫi are Gaussian) even for linear models. In this case, linear least
squares is often to be preferred to linear maximum likelihood where appli-
cable and convenient, since linear least squares is unbiased and has smallest
variance. An exception may occur in small samples where the data must be
binned in order to do a least squares analysis, causing a loss of information.
• The assumptions are important: The measurement errors must have zero
mean and they must be homoscedastic (the technical name for constant vari-
ance). Non-zero means or heteroscedastic variances may reveal themselves in
the residuals, yi − f (xi ), cf. section 10.6.8.

8.5.5 Examples
A Straight-Line Fit
As an example of linear least squares we do a least squares fit of independent
measurements yi at points xi assuming the model y = a + bx. Thus,
 
1 x1
     1 x2 
a 1  
θ= ; h= ; H =  .. .. 
b x . . 
1 xn
and y =Hθ+ǫ
Since the measurements are independent, the covariance matrix is diagonal with
n n
!2
−1 ǫ2i
X X yi − a − bxi
Vii (y) = Vii (ǫ) = σi2 and 2
Q =ǫ V T
ǫ= 2
=
i=1 σi i=1 σi
Hence, using the derivative method,
n
∂Q2 1 X yi − b̂xi
=0 −→ â = Pn 1
∂a i=1 σ2 i=1 σi2
i
n
∂Q2 1 X xi yi − âxi
=0 −→ b̂ = P x2i
∂b n σi2
i=1 σ2 i=1
i
156 CHAPTER 8. PARAMETER ESTIMATION

Solving, we find      
P x i yi P 1 P yi P xi
σi2 σi2
− σi2 σi2
b̂ =     2
P x2i P 1 P xi
σi2 σi2
− σi2

which can in turn be substituted in the expression for â.


Alternatively, we can solve the matrix equation,
 −1
θ̂ = H T V −1 H H T V −1 y

which, of course, gives the same result.


Note that if all σi are the same, σi = σ, then

xy − x̄ȳ
â = ȳ − b̂x̄ and b̂ = (8.125)
x2 − x̄2

These are the formulae which are programmed into many pocket calculators. As
such, they should only be used when the σi are all the same. These formulae are,
however, also applicable to the case where not all σi are the same if the sample
average indicated by the bar is interpreted P as 2meaning a weighted sample average
2 yi /σi
with weights given by 1/σi , e.g., ȳ = P 1/σ 2 . The proof is left as an exercise
i
(ex. 40).
Note that at least two of the xi must be different. Otherwise, the denominator
in the expression for b̂ is zero. This illustrates the general requirement that there
must be at least as many distinct values of xi as there are parameters in the model;
otherwise the matrix H T V −1 H will be singular.
The errors on the least squares estimates of the parameters are given by equation
8.122 or 8.116. With all σi the same, equation 8.116 gives
h i  −1  −1
V θ̂ = H T V −1 H = H TH σ2
  −1 
 1 x1
  P −1
1 . . . 1  .. ..  n x
= σ2  . .  = σ
2
P P 2i
x1 . . . xn xi xi
1 xn
 P 2 P   P 2 P 
σ2 xi − xi σ2 x − xi
= P 2 P 2
P = P 2
Pi
n xi − ( xi ) − x i n n (xi − x̄) − xi n

Thus, !  
V [â] cov(â,
h ib̂) σ2 x2 −x̄
=   (8.126)
cov(â, b̂) V b̂ n x2 − x̄2 −x̄ 1

Note that by translating the x-axis such that x̄ becomes zero, the estimates of the
parameters become uncorrelated.
8.5. LEAST SQUARES METHOD 157

Here too, it is possible to use this formula for the case where not all σi are the
same. Besides taking the bar as a weighted average, one must also replace σ 2 by
its weighted average, P 2 2
σ /σ n
σ = P i 2i = P
2 (8.127)
1/σi 1/σi2
Note that the errors are smallest for the largest spread in the xi . Thus we will
attain the best estimates of the parameters by making measurements only at the
extreme values of x. This procedure is, however, seldom advisable since it makes it
impossible to test the validity of the model, as we shall see.
Having found â and b̂, we can calculate the value of y for any value of x by
simply substituting the estimators in the model. The estimate ŷ0 of y at x = x0 is
therefore given by
ŷ0 = â + b̂x0 (8.128)
We note in passing that this gives ŷ0 = ȳ for x0 = x̄. The variance of ŷ0 is found
by error propagation:
h i
V [ŷ0 ] = V [â] + x20 V b̂ + 2x0 cov(â, b̂)

Substituting from equation 8.126 gives


 
σ2  (x0 − x̄)2 
V [ŷ0 ] = 1+   (8.129)
n x2 − x̄2

Thus, the closer x0 is to x̄, the smaller the error in ŷ0 .

A Polynomial Fit
To fit a parabola
y = a0 + a1 x + a2 x2
the matrix H is  
1 x1 x21

1 x2 x22 

H=
 .. .. .. 

. . . 
1 xn x2n
Assuming that all the σi are equal, equation 8.112 becomes
   P P P 2 −1  P 
â0 i1 x i xi i yi
  P P i 2i P 3 P 
θ̂ =  â1  =  i xi x x   i xi yi 
P 2 Pi i3 Pi i4 P 2
â2 i xi i xi i xi i xi yi

The extension to higher order polynomials is obvious. Unfortunately, there is no


simple method to invert such matrices, even though the form of the matrix appears
158 CHAPTER 8. PARAMETER ESTIMATION

very regular and symmetric. Numerical inversion suffers from rounding errors when
the order of the polynomial is greater than six or seven.
One can hope to mitigate these problems by choosing a set of orthogonal poly-
nomials, e.g., Legendre or Tchebycheff (Chebyshev) polynomials, instead of powers
of x. The off-diagonal terms then involve products of orthogonal functions summed
over the events. The expectation of such products is zero, and hence the sum of
their products over a large number of events should be nearly zero. The matrix is
then nearly diagonal and less prone to numerical problems.
Even better is to find functions which are exactly orthogonal over the measured
data points, i.e., functions, ξ, for which
n
X
ξj (xi )ξk (xi )(V −1 )jk = δjk
i=1

The matrix which has to be inverted, H T V −1 H, is then simply the unit matrix. An
additional feature of such a parametrization is that the estimates of the parameters
are independent; the covariance matrix for the parameters is diagonal. Such a set of
functions can always be found, e.g., using Schmidt’s orthogonalization method∗ or,
more simply, using Forsythe’s method.40 Its usefulness is limited to cases where we
are merely seeking a parametrization of the data (for the purpose of interpolation
or extrapolation) rather than seeking to estimate the parameters of a theoretical
model.

8.5.6 Constraints in the linear model


If the parameters to be estimated are constrained, we can, as in the maximum
likelihood case (section 8.4.8), try to write the model in terms of new parameters
which are unconstrained. Alternatively, we can use the more general method of
Lagrangian multipliers, which we will now discuss for least squares fits.
Suppose the model is y = H θ + ǫ for n observations yi and k parameters θj . The
Hij may take any form, e.g., Hij = xij−1 for a polynomial fit to the observations yi
taken at points xi as in the previous section.
Suppose that the deviations ǫi have covariance matrix V and that the parameters
θ are subject to m linear constraints,

k
X
ℓij θj = Ri , i = 1, . . . , m (8.130)
j=1

or, in matrix notation,


Lθ = R (8.131)


See, e.g., Margenau & Murphy39 .
8.5. LEAST SQUARES METHOD 159

The least squares estimate of θ is then found using a k-component vector of La-
grangian multipliers, 2λ, by finding the extremum of
 T  
Q2 = y − H θ V −1 y − H θ + 2λT (L θ − R) (8.132)

where the first term is the usual Q2 and the second term represents the constraints.
Differentiating with respect to θ and with respect to λ, respectively, yields the
normal equations

H T V −1 H θ̂ + LT λ̂ = H T V −1 y (8.133a)
L θ̂ = R (8.133b)

which can be combined to give


    
C LT θ̂ S
= (8.134)
L 0 λ̂ R

where

C = H T V −1 H (8.135)
S = H T V −1 y (8.136)

Assuming that both C and L C −1 LT can be inverted, the normal equations can be
solved for θ̂ and λ̂ giving3–5
    
θ̂ F GT S
= (8.137)
λ̂ G E R

where∗
 −1
W = L C −1 LT (8.138)
F = C −1 − C −1 LT W L C −1
= 1 − C −1 LT W L C −1 (8.139)
G = W L C −1 (8.140)
E = −W (8.141)

The solutions can then be written

θ̂ = F S + GT R = F H T V −1 y + GT R (8.142)
T −1
λ̂ = G S + E R = G H V ǫ (8.143)


Note that Eadie et al.4 contains a misprint in these equations.
160 CHAPTER 8. PARAMETER ESTIMATION

The covariance matrix can be shown3–5 to be given by


h i
V θ̂ = F (8.144)
h i
V λ̂ = W (8.145)
 
cov θ̂, λ̂ = 0 (8.146)

h In
i the unconstrained case the solution was θ̂ = C −1 S with covariance matrix
V θ̂ = C −1 . These results are recovered from the above equations by setting terms
involving L or R to zero. From equations 8.139 and 8.144 we see that the constraints
reduce the variance of the estimators, as should be expected since introducing con-
straints adds information. We also see that the constraints introduce (additional)
correlations between the θ̂i . h i
It can be shown4, 5 that the θ̂ are unbiased, and that E λ̂ = 0 as expected.

8.5.7 Improved measurements through constraints


An important use of constraints in the linear model is to improve measurements.
As an example, suppose that one measures the three angles of a triangle. We know,
of course, that the sum of the three angles must be 180◦ . However, because of the
resolution of the measuring apparatus, it probably will not be. In particle physics
one often applies the constraints of energy and momentum conservation to the
measurements of the energies and momenta of particles produced in an interaction.∗
By using this knowledge we can obtain improved values of the measurements.
To do this, we make use of the linear model with constraints as developed in the
previous section. We assume that there is just one measurement of each quantity.
If there is more than one, they can be averaged and the average used in the fit. The
model is here the simplest imaginable, y = θ, i.e., what we want to estimate is the
response variable itself. The measurements are then described by (equation 8.103)
n
X
yi = θi δij + ǫi = θi + ǫi
j=1

Thus the matrix H is just the unit matrix, and, in the absence of constraints, the
normal equations have the trivial (and obvious) solution (equation 8.112)
 −1
θ̂ = H T V −1 H H T V −1 y = y

The best value of a measurement (θ̂i ) is just the measurement itself (yi ).
With m linear constraints (equation 8.130 or 8.131) the solution follows imme-
diately from the previous section by setting H = 1. The improved values of the
measurements are then the θ̂i . Note that the constraints introduce a correlation
between the measurements.

In particle physics this procedure is known as kinematical fitting since the constraints usually
express the kinematics of energy and momentum conservation.
8.5. LEAST SQUARES METHOD 161

8.5.8 Linear Model with errors in both x and y


So far we have considered the xi to be known exactly. Now let us drop this restriction
and allow the xi as well as the yi to have errors: σx i and σy i , respectively.

Straight-line fit
We begin by treating the case of a y ✻ P4 r✑✑
straight-line fit, y = a + bx, from section ✑
✑✡
8.5.5. ✑ ✡
✑ ✡
As before, we take Q2 as the sum of P3 r✑✑ ✡
the squares of the distances between the ✑ ✡

P2 r✑ ✡
fit line and the measured point scaled by ✑ ✡
✑ ❅ R3 r r R
the error on this distance. However, this P1r ✑ ✑ R2❅r ✡ 4
distance is not unique. This is illustrated yi ✑ r ❅r✡
✑ R1 D
in the figure where the ellipse indicates ✑

the errors on xi and yi . For a point on ✲
the line, Pj , the distance to D is Pj D and xi x
the error is the distance along this line from the point D to the error ellipse, Rj D:

Pj D
dj =
Rj D

Since we want the minimum of Q2 , we also want to take the minimum of the dj ,
i.e., the minimum of
(x − xi )2 (y − yi )2
d2i = 2
+ 2
(8.147)
σxi σyi
where we have assumed that the errors on xi and yi are uncorrelated. Substituting
y = a + bx and setting dd
dx
i
= 0 results in the minimum distance being given by

(yi − a − bxi )2
d2i min =
σy2 i + b2 σx2 i

This same result can be found by taking the usual definition of the distance,
!2 !2
yi − y(xi ) yi − a − bxi
d2i = =
σi σi

where σi is no longer just the error on yi , σy i , but is now the error on yi − a − bxi
and is found by error propagation to be

σi2 = σy2 i + b2 σx2 i

Here the error propagation is exact since yi − a − bxi is linear in xi .


162 CHAPTER 8. PARAMETER ESTIMATION

We must now find the minimum of


n
X (yi − a − bxi )2
Q2 = (8.148)
i=1 σy2 i + b2 σx2 i
The easiest method is to program it and use a minimization program. However,
lets see how far we can get analytically.
Differentiating with respect to a gives
n
∂Q2 X yi − a − bxi
= −2 2 2 2
∂a i=1 σy i + b σx i

Setting this to zero and solving for a results in


Pn yi −b̂xi
i=1 σ2 +b̂2 σ2
yi xi
â = Pn 1
i=1 σ2 +b̂2 σ2
yi xi

We note that if all σx i = 0 this reduces to the expression found in section 8.5.5.
Unfortunately, the differentiation with respect to b is more complicated. In practice
it is most easily done numerically by choosing a series of values for b̂, calculating â
from the above formula and using these values of â and b̂ to calculate Q2 , repeating
the process until the minimum Q2 is found.
The errors on â and b̂ are most easily found from the condition that Q2 −Q2min = 1
corresponds to one standard deviation errors.
If all σx i are the same and also all σy i are the same, the situation simplifies
considerably. The above expression for â becomes
â = ȳ − b̂x̄ (8.149)
and differentiation with respect to b leads to
n Pn
∂Q2 X yi − â − b̂xi b̂σx2 i=1 (yi − â − b̂xi )2
= −2 2 2 2
+ =0
∂b i=1 σy + b σx σy2 + b2 σx2
Substituting the expression for â into this equation then yields
b̂2 σx2 ∆xy − b̂ (σx2 ∆y 2 − σy2 ∆x2 ) − σy2 ∆xy = 0 (8.150)
where
∆x2 = x2 − x̄2
∆y 2 = y 2 − ȳ 2
∆xy = xy − x̄ȳ
This is a quadratic equation for b̂. Of the two solutions it turns out that the one
with a negative sign before the square root gives the minimum Q2 ; the one with
the plus sign gives the maximum Q2 of all straight lines passing through the point
(x̄, ȳ). We note that these solutions for â and b̂ reduce to those found in section
8.5.5 when there is no uncertainty on x (σx = 0).
8.5. LEAST SQUARES METHOD 163

In general
Now let us consider
  a more complicated case. Let us represent a data point by the
xi
vector z i = . If the model is a more complicated function than a straight
yi
line, or if there is a non-zero correlation between xi and yi , the distance measure di
defined in equation 8.147 becomes

d2i = (z ci − z i )T Vi −1 (z ci − z i )
 
σx2 i cov(xi , yi )
where Vi is the covariance matrix for data point i, Vi =
cov(xi ,yi )  σy2 i
xci
and the point on the curve closest to z i is represented by z ci = . The com-
yic
ponents of z ci are related by the model: yic = H T (xci ) θ, which can be regarded as
constraints for the minimization of Q2 . We then use Lagrangian multipliers and
minimize
Xh
n  i
Q2 = (z ci − z i )T Vi (z ci − z i ) + λi yic − H T (xci )θ (8.151)
i=1

with respect to the unknowns:

k parameters θ
n unknowns xci
n unknowns yic
n unknowns λi

by setting the derivatives of Q2 with respect to each of these unknowns equal to


zero. The solution of these 3n + k equations is usually quite messy and a numerical
search for the minimum Q2 is more practical.

8.5.9 Non-linear Models


For simplicity we again assume that the xi are exactly known.
If the deviations of the measurements yi from the true value y(xi ) are normally
distributed, the likelihood function is
 !2 
n
Y 1 1 yi − y(xi ; θ) 
L(y; θ) = √ exp −
i=1 2πσi 2 σi
 !2 
n
X
n − ln σi −
1 yi − y(xi ; θ) 
ℓ = ln L = − ln(2π) +
2 i=1 2 σi

and L is maximal when !2


n
X
2 yi − y(xi ; θ)
Q =
i=1 σi
164 CHAPTER 8. PARAMETER ESTIMATION

is minimal. Thus the least squares method yields the same estimates as the maxi-
mum likelihood method, and accordingly has the same desirable properties.
When the deviations are not normally distributed, the least squares method may
still be used, but it does not have such general optimal properties as to be useful for
small n. Even asymptotically, the estimators need not be of minimum variance.4, 5
In practice, the minimum of Q2 is usually most easily found numerically using
a search program such as MINUIT. However, an iterative solution3, 6 of the normal
equations (subject to constraints) may yield considerable savings in computer time.

8.5.10 Summary
The most important properties of the least squares method are

• In the linear model, it follows from the Gauss-Markov theorem that least
squares estimators have optimal properties: If the measurement errors have
zero expectation and finite, fixed variance, then the least squares estimators
are unbiased and have the smallest variance of all linear, unbiased estimators.

• If the errors are Gaussian, least squares estimators are the same as maximum
likelihood estimators.

• If the errors are Gaussian, the minimum value of Q2 provides a test of the
validity of the model, at least in the linear model (cf. sections 10.4.3 and
10.6.3).

• If the model is non-linear in the parameters and the errors are not Gaussian,
the least squares estimators usually do not have any optimal properties.

The least squares method discussed so far does not apply to histograms or other
binned data. Fitting to binned data is treated in section 8.6.

8.6 Estimators for binned data


The methods of parameter estimation treated so far were developed and applied
either to points (events) sampled from some p.d.f. (moments and maximum likeli-
hood) or to measurements, i.e., the results of some previous analysis (least squares
and maximum likelihood). Here we want to apply a least squares method to a
sample of events in order to estimate parameters of the underlying p.d.f., much as
we did with maximum likelihood.

8.6.1 Minimum Chi-Square


The astute reader will have noticed that the least squares method requires measure-
ments yi with variance V [yi ] for values xi of the predictor variable. What do we
8.6. ESTIMATORS FOR BINNED DATA 165

do when the data are simply observations of the values of x for a sample of events?
This was easily treated in the maximum likelihood method. For a least squares
type of estimator we must transform this set of observations into estimates of y at
various values of x.
To do this we collect the observations into mutually exclusive and exhaustive
classes defined with respect to the variable x. (The extension to more than one
variable is straightforward.) An example of such a classification is a histogram and
we shall sometimes refer to the classes as bins, but the concept is more general than
a histogram. Assume that we have k classes and let πi be the probability, calculated
from the assumed p.d.f., that an observation falls in the ith class. Then
k
X
πi = 1
i=1

and the distribution of observations among the classes is a multinomial p.d.f. Let
n be the total number of observations and ni the number of observations in the ith
class. Then pi = ni /n is the fraction of observations in the ith class.
The minimum chi-square method consists of minimizing Pearson’s53 χ2 , which
we refer to here as Q21 ,
k k
X (pi − πi )2 X (ni − nπi )2
Q21 = n = (8.152)
i=1 πi i=1 nπi
k
!
p2i
X
=n −1
i=1 πi

The estimators θ̂j are then the solutions of


k k  2
∂Q21 X ∂Q21 ∂πi X pi ∂πi
=n = −n =0 (8.153)
∂θj i=1 ∂πi ∂θj i=1 πi ∂θj

This appears rather similar to the usual least squares method. The ‘measure-
ment’ is now the observed number of events in a bin, and the model is that there
should be nπi events in the bin. Recall (section 3.3) that the multinomial p.d.f.
has for the ith bin the expectation µi = nπi and variance σi2 = nπi (1 − πi ). For a
large number of bins, each with small probability πi , the variance is approximately
σi2 = nπi and the covariances, cov(ni , nj ) = −nπi πj , i 6= j, are approximately zero.
The ‘error’ used in equation 8.152 is thus that expected from the model and is
therefore a function of the parameters. In the least squares method we assumed,
as a condition of the Gauss-Markov theorem, that σi2 was fixed. Since that is here
not the case, the Gauss-Markov theorem does not apply to minimum χ2 .
This use of the error expected from the model may seem rather surprising, but
nevertheless this is the definition of Q21 . We note that in least squares the error
was actually also an expected error, namely the error expected from the measuring
apparatus, not the error estimated from the measurement itself.
166 CHAPTER 8. PARAMETER ESTIMATION

In practice, Q21 may be difficult to minimize owing to the dependence of the


denominator on the parameters. This consideration led to the modified minimum
chi-square method where one minimizes Q22 (Neyman’s χ2 ), which is defined using
an approximation of the observed, i.e., estimated, error, σi2 ≈ ni , which is valid for
large ni :
k k
X (pi − πi )2 X (ni − nπi )2
Q22 = n = (8.154)
i=1 pi i=1 ni
k
!
πi2
X
=n −1
i=1 pi

The estimators θ̂j are then the solutions of


k k
!
∂Q22 X ∂Q22 ∂πi X πi ∂πi
=n = 2n =0 (8.155)
∂θj i=1 ∂πi ∂θj i=1 pi ∂θj

From the approximations involved, it is clear that neither Q2 is a true χ2 for


finite n. However, both become a χ2 (k − s) asymptotically, where s is the number
of parameters which are estimated. Also, it can be shown11 that the estimators
found by both methods are ‘best asymptotically normal’ (BAN) estimators, i.e.,
that the estimators are consistent, asymptotically normally distributed, efficient (of
∂ θ̂
minimum variance), and that ∂p i
exists and is continuous for all i. Both Q21 and Q22
thus lead asymptotically to estimators with optimal properties.

8.6.2 Binned maximum likelihood


Alternatively, one can use the maximum likelihood method on the binned data.
The multinomial p.d.f. (eq. 3.3) in our present notation is

n! Yk
πini
f= π1n1 π2n2 . . . πknk = n!
n1 !n2 ! . . . nk ! i=1 ni !

Dropping factors which are independent of the parameters, the log-likelihood which
is to be maximized is given by
k
X
ℓ = ln L = ni ln πi (8.156)
i=1

Note that in the limit of zero bin width this is identical to the usual log-likelihood
of equation 8.57. The estimators θ̂j are the values of θ for which ℓ is maximum and
are given by
Xk Xk  
∂ℓ ∂ ln πi ∂πi pi ∂πi
= ni =n =0 (8.157)
∂θj i=1 ∂πi ∂θj i=1 πi ∂θj
8.6. ESTIMATORS FOR BINNED DATA 167

These maximum likelihood estimators are also BAN.


P
This formulation assumes that the total number of observations, n = ni , is
fixed, as did the minimum chi-square methods of the previous section. If this is
not the case, the binned maximum likelihood method is easily extended. As in
section 8.4.7, the joint p.d.f. is multiplied by a Poisson p.d.f. for the total number
of observations. Equivalently (cf. exercise 13), we can write the joint p.d.f. as a
product of k Poisson p.d.f.’s:
Yk
νini e−νi
f=
i=1 ni !
where νi is the expected number of observations in bin i. This leads to
k
X k
X
ℓE = ni ln νi − νi (8.158)
i=1 i=1
P
In terms of the present notation, νi = νtot πi . But now νtot = νi is not necessarily
equal to n.

8.6.3 Comparison of the methods


Asymptotically, all three of these methods are equivalent. How do we decide which
one to use? In a particular problem, one method could be easier to compute.
However, given the computer power most physicists have available, this is seldom
a problem. The question is then which method has the best behavior for finite n.

• Q21 requires a large number of bins with small πi for each bin in order to
neglect the correlations and to approximate the variance by nπi . Assuming
that the model is correct, this will mean that all ni must be small.

• In addition, Q22 requires all ni to be large in order that ni be a good estimate
of the variance. Thus the ni must be neither too large nor too small. In
particular, an ni = 0 causes Q22 to blow up.

• The binned maximum likelihood method does not suffer from such problems.

In view of the above, it is perhaps not surprising that the maximum likelihood
method usually converges faster to efficiency. In this respect the modified minimum
chi-square (Q22 ) is usually the worst of the three methods.11
One may still choose to minimize Q21 or Q22 , perhaps because the problem is
2
linear so that the equations ∂Q ∂θj
= 0 can be solved simply by a matrix inversion
instead of a numerical minimization. One must then ensure that there are no small
ni , which in practice is usually taken to mean that all ni must be greater than 5 or
10. Usually one attains this by combining adjacent bins. However, one can just as
well combine non-adjacent ones. Nor is there any requirement that all bin widths
be equal. One must simply calculate the πi properly, i.e., as the integral of the
168 CHAPTER 8. PARAMETER ESTIMATION

p.d.f. over the bin, which is not always adequately approximated by the bin width
times the value of the p.d.f. at the center of the bin.
Since the maximum likelihood method is usually preferred, we can ask why we
bin the data at all. Although binning is required in order to use a minimum chi-
square method, we can perfectly well do a maximum likelihood fit without binning.
Although binning loses information, it may still be desirable in the maximum like-
lihood method in order to save computing time when the data sample is very large.
In choosing the bin sizes one should pay particular attention to the amount of in-
formation that is lost. Large bins lose little information in regions where the p.d.f.
is nearly constant. Nor is much information lost if the bin size is small compared to
the experimental resolution in the measurement of x. It would seem best to try to
have the information content of the bins approximately equal. However, even with
this criterion the choice of binning is not unique. It is then wise to check that the
results do not depend significantly on the binning.

“There are nine and sixty ways of constructing tribal lays.


And – every – single – one – of – them – is – right!”
—Rudyard Kipling

8.7 Practical considerations

In this section we try to give some guidance on which method to use and to treat
some complications that arise in real life.

8.7.1 Choice of estimator


Criteria
Faced with different methods which lead to different estimators we must decide
which estimator to use. Eadie et al.4 and James5 give the following order of impor-
tance of various criteria for the estimators:

1. Consistency. The estimator should converge to the true value with increasing
numbers of observations. If this is not the case, a procedure to remove the
bias should be applied.

2. Minimum loss of information. When an estimator summarizes the results of


an experiment in a single number, it is of vital interest to subsequent users of
the estimate that no other number could contain more information about the
parameter of interest.
8.7. PRACTICAL CONSIDERATIONS 169

3. Minimum variance (efficiency). The smaller the variance of the estimator, the
more certain we are that it is near the true value of the parameter (assuming
it is unbiased).

4. Robustness. If the p.d.f. is not well known, or founded on unsafe assumptions,


it is desirable that the estimate be independent of, or insensitive to, departures
from the assumed p.d.f. In general, the information content of such estimates
is less since one chooses to ignore the information contained in the form of
the p.d.f.

5. Simplicity. When a physicist reads the published value of some parameter,


he usually presumes that the estimate of the parameter is unbiased, normally
distributed, and uncorrelated with other estimates. It is therefore desirable
that estimators have these simple properties. If the estimate is not simple,
it should be stated how it deviates from simplicity and not given as just a
number ± an error.

6. Minimum computer time. Although not fundamental, this may be of practical


concern.

7. Minimum loss of physicist’s time. This is also not fundamental; its importance
is frequently grossly overestimated.

Compromising between these criteria


The order of the desirable properties above reflects a general order of importance.
However, in some situations a somewhat different order would be better. For ex-
ample, the above list places more importance on minimum loss of information than
on minimum variance. These two criteria are related. The minimum variance is
bounded by the inverse of the information. However this limit is not always attain-
able. In such cases it is possible that two estimates t1 and t2 of θ are such that
I2 (θ) > I1 (θ) but V [t1 ] < V [t2 ]. The recommendation here is to choose t2 , the esti-
mate with the greater information. The reason is that, having more information, it
will be more useful later when the result of this experiment is combined with results
of other experiments. On the other hand, if decisions must be made, or conclusions
drawn, on the basis of just this one experiment, then it would be better to choose
t1 , the estimate with the smaller variance.

Obtaining simplicity
It may be worth sacrificing some information to obtain simplicity.
Estimates of several parameters can be made uncorrelated by diagonalizing the
covariance matrix and finding the corresponding linear combinations of the param-
eters. But the new parameters may lack physical meaning.
Techniques for bias removal will be discussed below (section 8.7.2).
170 CHAPTER 8. PARAMETER ESTIMATION

When sufficient statistics exist, they should be used, since they can be estimated
optimally (cf. section 8.2.8).
Asymptotically, most usual estimators are unbiased and normally distributed.
The question arises how good the asymptotic approximation is in any specific case.
The following checks may be helpful:
• Check that the log-likelihood function or χ2 is a parabolic function of the
parameters.
• If one has two asymptotically efficient estimators, check that they give con-
sistent results. An example is the minimum chi-square estimate from two
different binnings of the data.
• Study the behavior of the estimator by Monte Carlo techniques, i.e., make
a large number of simulations of the experiment and apply the estimator to
each Monte Carlo simulation in order to answer questions such as whether the
estimate is normally distributed. However, this can be expensive in computer
time.
A change of parameters can sometimes make an estimator simpler. For instance
the estimate of θ2 = g(θ1 ) may be simpler than the estimate of θ1 . However, it is in
general impossible to remove both the bias and the non-normality of an estimator
in this way4, 5 .

Economic considerations
Economy usually implies fast computing. Optimal estimation is frequently iterative,
requiring much computer time. The following three approaches seek a compromise
between efficiency (minimum variance) and economic cost.
• Linear methods. The fastest computing is offered by linear methods, since
they do not require iteration. These methods can be used when the expected
values of the observations are linear functions of the parameters. Among linear
unbiased estimators, the least squares method is the best, which follows from
the Gauss-Markov theorem (section 8.5.4).
When doing empirical fits, rather than fits to a known (or hypothesized) p.d.f.,
choose a p.d.f. from the exponential family (section 8.2.7) if possible. This
leads to easy computing and has optimal properties.
• Two-step methods. Some computer time can be saved by breaking the prob-
lem into two steps:
1. Estimate the parameters by a simple, fast, inefficient method, e.g., the
moments method.
2. Use these estimates as starting values for an optimal estimation, e.g.,
maximum likelihood.
8.7. PRACTICAL CONSIDERATIONS 171

Although more physicist’s time may be spent in evaluating the results of the
first step, this might also lead to a better understanding of the problem.

• Three-step method.

1. Extract from the data a certain number of statistics which summarize


the observations compactly, and if possible in a way which increases in-
sight into the problem. For example, one can make a histogram, which
reduces the number of observations to the number of bins in the his-
togram. Another example is the summary of an angular distribution by
the coefficients of the expansion of the distribution in spherical harmon-
ics. These coefficients are rapidly estimated by the moments method
(section 8.3.2) and their physical meaning is clear.
2. Estimate the parameters of interest using this summary data. If the
summary data have an intuitive physical meaning this estimation may
be greatly simplified.
3. Use the preliminary estimates from the second step as starting values for
an optimal estimation directly from the original data.

The third step should not be forgotten. It is particularly important when the
information in the data is small (‘small statistics’). Because of the third step,
the second step does not have to be exact, but only approximate.

8.7.2 Bias reduction


We have already given a procedure for bias reduction in section 8.3.1 for the case
of an estimator ĝ which is calculated from an unbiased estimator θ̂ by a change of
variable ĝ = g(θ̂). Now let us consider two general methods.

Exact distribution of the estimate known


h i
If the p.d.f. of the estimator is exactly known, the bias b = E θ̂ − θ can be calcu-
lated. If b does not depend on the parameters, we can use the unbiased estimator
′ ′
θ̂ = θ̂ − b instead of the biased estimator θ̂. The variances of θ̂ and θ̂, are the same
since b is exactly known.
However, b is usually not exactly known since it usually depends on some of
the parameters of the p.d.f. It must therefore be estimated. Assuming that we can
make an unbiased estimate of the bias, b̂, the unbiased estimator of the parameter
′ ′
is θ̂ = θ̂ − b̂, which results in a larger variance for θ̂ than for θ̂.

Exact distribution of the estimate unknown


There is a straightforward method4, 5 to use in the case that the p.d.f. is not well
known or no unbiased estimate of b is possible. Suppose that θ̂ is a biased estimator
172 CHAPTER 8. PARAMETER ESTIMATION

which is asymptotically unbiased (as maximum likelihood estimators frequently


are). Express θ̂ as a power series in N1 , where N is the number of events. The
leading term is then θ, independent of N. The N −1 term is the leading bias term.
Now split the data into two samples, each of N2 events. Let the estimate from the
two N2 samples be θ̂1 and θ̂2 . The expectation of the above expansion will be
h i  
1 1
E θ̂ = θ + β + O
N N2
h i h i  
2 1
E θ̂1 = E θ̂2 = θ + β + O
N N2

Thus,    
1 1
E 2θ̂ − (θ̂1 + θ̂2 ) = θ + O
2 N2
   
and we see that we have a method to reduce the bias from O N1 to O N12 . The
variance is, however, in general increased by a term of order N1 .
A generalization of this method,11, 13 known as the jackknife,∗ estimates θ by

θ̂ = N θ̂N − (N − 1)θ̂N −1 (8.159)

where θ̂N is the estimator using all N events and θ̂N −1 is the average of the N
estimates possible using N − 1 events:
N
X
θ̂N −1 = θ̂i /N (8.160)
i=1

where θ̂i is the estimate obtained using all events except event i.
A more general method, of which the jackknife is an approximation, is the
bootstrap method introduced by Efron.41–43 Instead of using each subset of N − 1
observations, it uses samples of size M ≤ N randomly drawn, with replacement,
from the data sample itself. For details, see, e.g., Reference 43.

8.7.3 Variance of estimators—Jackknife and Bootstrap


The jackknife and the bootstrap of the previous section provide methods, albeit
computer intensive, to evaluate the variance of estimators, which may be used in
situations where the usual methods are unreliable, e.g., small statistics where the
asymptotic properties of ml estimators are questionable, non-Gaussian errors in
least-squares fits,, or non-linear transformations of parameters (cf. section 8.3.6).


Named after a large folding pocket knife, this procedure, like its namesake, serves as a handy
tool in a variety of situations where specialized techniques may not be available.
8.7. PRACTICAL CONSIDERATIONS 173

The jackknife estimation of the variance of an estimator θ̂ is given by43, 44

h i N  2
N −1X
V̂ θ̂ = θ̂i − θ̂ N −1 (8.161)
J N i=1

where the notation is the same as in the previous section.


While the jackknife is often a good method to estimate the variance of an es-
timator, it can fail miserably when the value of the estimator does not behave
smoothly to small changes in the data. An example of such an estimator is the
median. Suppose the data consist of 13 points: 5 values smaller than 9; the values
9, 11, and 13; and 5 values larger than 13. The sample median is 11. Removing any
one of the smallest 6 values results in a median of 12 (midway between 11 and 13),
while removing any one of the largest 6 values results in a median of 10. Removal
of the middle value results in 11. There are only 3 different jackknife values; the
estimate does not change smoothly with changes in the data, but only in large steps.
This failure of the jackknife can be overcome by removing more points to make the
jackknife samples. This is known as the delete-d jackknife; the interested reader
is referred, e.g., to Reference 43.
For bootstrap sample size, M, equal to the data sample size, N, there are
N
N distinct samples possible, which is very large even for moderate N. Then the
bootstrap sampling distribution for an estimator is a good approximation of the true
sampling distribution, converging to it as N → ∞ under fairly general conditions.
This method is something like Monte Carlo, but uses the data themselves instead
of a known (or hypothesized) distribution. The variance of θ̂ is then estimated by
the following procedure:

1. Select B independent bootstrap samples, each consisting of N data, drawn


with replacement from the real data sample. Usually B in the range 25–200
will suffice,43 but this can be checked by repeating for large values of B until
the improvement is negligible.

2. Evaluate θ̂ for each bootstrap sample, giving B values, θ̂b .

3. The bootstrap estimate of the variance of θ̂ is then

h i B  2
1 X
V̂ θ̂ = θ̂b − θ̂b (8.162)
B B − 1 b=1
B
X
where θ̂b = θ̂b /B
b=1

Note that these two methods are applicable to non-parametric estimators as well
as parametric. If the estimators are the result of a parametric fit, e.g., ml, the B
bootstrap samples can be generated from the fitted distribution function, i.e., the
174 CHAPTER 8. PARAMETER ESTIMATION

parametric estimate of the population, rather than from the data. The estimation
of the variance is again given by equation 8.162.
Limitation: It should be clear that the non-parametric bootstrap will not be
reliable when the estimator depends strongly on the tail of the distribution, as is
the case, e.g., with high-order moments. A bootstrap sample can never contain
points larger than the largest point in the data.

8.7.4 Robust estimation


When the form of the p.d.f. is not exactly known, the following questions arise:

1. What kind of parameters can be estimated without any assumption about the
form of the p.d.f.? Such estimators are usually called ‘distribution-free’. This
term may be misleading, for although the estimate itself does not depend on
the assumption of a p.d.f., its properties, e.g., the variance, do depend on the
actual (unknown) p.d.f.

2. How reliable are the estimates if the assumed form of the p.d.f. is not quite
correct?

Center of a symmetric distribution

There is relatively little known about robust estimation. The only case treated ex-
tensively in the literature is the estimation of the center of an unknown, symmetric
distribution. The center of a distribution may be defined by a ‘location parameter’
such as the mean, the median, the mode, the midrange, etc. Several of these esti-
mators were mentioned in section 8.1. The sample mean is the most obvious and
most often used estimator of location because

• By the central limit theorem it is consistent whenever the variance of the p.d.f.
is finite.

• It is optimal (unbiased and minimum variance) when the p.d.f. is a Gaussian.

However, if the distribution is not normal, the sample mean may not be the best
estimator. For symmetric distributions of finite range, e.g., the uniform p.d.f. or a
triangular p.d.f., the location is determined by specifying the end points of the dis-
tribution. The midrange is then an excellent estimator. However, for distributions
of infinite range, the midrange is a poor estimator.
The following table4, 5 shows asymptotic efficiencies, i.e., the ratio of the min-
imum variance bound to the variance of the estimator, of location estimators for
various p.d.f.’s.
8.7. PRACTICAL CONSIDERATIONS 175

Distribution Sample Sample Sample


median mean midrange
Normal 0.64 1.00 0.00
Cauchy 0.82 0.00 0.00
Double exponential 1.00 0.50 0.00
Uniform 0.00 0.00 1.00

None of these three estimators is asymptotically efficient for all four distribu-
tions. Nor has any of these estimators a non-zero asymptotic efficiency for all four
distributions. As an example take a distribution which is the sum of a normal
distribution and a Cauchy distribution having the same mean:

f (x) = β N(x; µ, σ 2 ) + (1 − β) C(x; µ, α) , 0≤β≤1

Because of the Cauchy admixture, the sample mean has infinite variance, as we
see in the table, while the sample median has at worst (β = 1) a variance of
1/0.64 = 1.56 times the minimum variance bound. This illustrates that the median
is generally more robust than the mean.
Other methods to improve robustness involve ‘trimming’, i.e., throwing away
the highest and lowest points before using one of the above estimators. This is
particularly useful when there are large tails which come mostly from experimental
problems. Such methods are further discussed by Eadie et al.4, 5

Center of an asymmetric distribution


Consider the estimation of the center of a narrow ‘signal’ distribution superim-
posed on an unknown but wider ‘background’ distribution. The asymmetry of the
background makes it difficult to use any of the above-mentioned estimators.
A common technique is to parametrize the signal and background in some arbi-
trary way and to do a maximum likelihood or least squares fit to obtain optimum
values of the parameters, including the location parameter of interest. This is
a non-robust method because the location estimate depends on the background
parametrization and on correlations with other parameters.
A robust technique for this problem is to estimate the mode of the observed
distribution. The mode is nearly invariant under variations of a smooth background.
An obvious way to estimate the mode is to histogram the data and take the center
of the most populated bin. Such a method depends on the binning used. A better
method is given by the following procedure: Find the two observations which are
separated by the smallest distance, and choose the one which has the closer next
nearest neighbor. The estimate of the mode is then taken as the position of this
observation. A generalization of this method is that of k nearest neighbors, where
the density of observations at a given point is estimated by the reciprocal of the
distance between the smallest and largest of the k observations closest to the point.
176 CHAPTER 8. PARAMETER ESTIMATION

8.7.5 Detection efficiency and Weights


We are often not able to observe directly the phenomenon we wish to study. The
apparatus generally introduces some distortion or bias, the effect of which must be
taken into account. Such distortion may take the form of a detection efficiency,
i.e., the apparatus may not detect all events and the efficiency of detection may
depend on the values of the variables being measured. This problem has already
been mentioned in section 4.2.
The method used to account for this distortion depends on the severity of the
problem. If the detection efficiency varies greatly over the range of the variables, it
will be necessary to treat the problem exactly in order to avoid losing a great deal
of information. On the other hand, if the detection efficiency is nearly uniform (say
to within 20%), an approximate method will suffice.

Maximum likelihood—ideal method


As already mentioned in section 4.2, the p.d.f. of the observations is the product of
the underlying physical p.d.f. and the efficiency function. It often happens that the
physical p.d.f. can be written as the product of two p.d.f.’s where the parameters
we want to estimate occur in only one of the two. For example, consider the
production of particles in an interaction. The energies of the produced particles
will not depend on where the interaction took place. The p.d.f. is then a product of
a p.d.f. for the place where the interaction takes place and a p.d.f. for the interaction
itself. Accordingly we write the physical p.d.f. as

f (x, y; θ, ψ) = p(x; θ) q(y; ψ)


R R
where the p.d.f.’s p and q are, as usual, normalized: p dx = q dy = 1. Let e(x, y)
be the detector efficiency, i.e., the p.d.f. describing the probability that an event is
observed. Then the p.d.f. of the actual observations is
p(x; θ) q(y; ψ) e(x, y)
g(x, y; θ, ψ) = R
p(x; θ) q(y; ψ) e(x, y) dx dy

Note that the efficiency may depend on both x and y. The likelihood of a given set
of observations is then
N
Y N
Y
L(x1 , . . . xN , y 1 . . . y N ; θ, ψ) = g(xi , yi ; θ, ψ) = gi
i=1 i=1
N
X
Hence, ℓ = ln L = W + ln(ei qi ) (8.163)
i=1
N
X Z
where W = ln pi − N ln pqe dx dy (8.164)
i=1
and where pi = p(xi ; θ)
8.7. PRACTICAL CONSIDERATIONS 177

Suppose now that we are not interested in estimating ψ, but only θ. Then the second
term of equation 8.163 does not depend on the parameters and may be ignored. The
estimates θ̂ and their variances are then found in the usual way treating W as the
log-likelihood.
In practice, difficulties arise when pqe is not analytically normalized, but must
be normalized numerically by time-consuming Monte Carlo. Moreover, the results
depend on the form of q, which may be poorly known and of little physical interest.
For these reasons one prefers to find a way of eliminating q from the expressions.
Since this will exclude information, it will increase the variances, but at the same
time make the estimates more robust.

Troll, to thyself be true—enough.


—Ibsen, “Peer Gynt”

Maximum likelihood—approximate method


We replace W in equation 8.164 by
N 
X 
1
W′ = ln pi (8.165)
i=1 ei

Intuitively, the observation of an event with efficiency ei corresponds, in some


sense, to wi = 1/ei events having actually occurred. Then the likelihood for all of
the events, i.e., the one which is observed and the ones which are not, is pw
i , which
i


results in W .

Whatever the validity of this argument, it turns out4, 5 that the estimate θ̂
obtained by maximizing W ′ is, like the usual maximum likelihood estimate, asymp-
totically normally distributed about the true value. However, care must be taken
in evaluating the variance. Using the second derivative matrix of W ′ is wrong since
it assumes that
N
X
wi = N
i=1

events have been observed. One approach to curing this problem is to renormalize
′ P
the weights by using wi = Nwi / wi instead of wi . However, this is only satisfac-
tory if the weights are all nearly equal.
The correct procedure, which we will not derive, results in4, 5
 ′
V θ̂ = H −1 H ′ H −1 (8.166)

where the matrices H and H ′ are given by


" ! !#
1 ∂ ln p ∂ ln p
Hjk =E
e ∂θj ∂θk
178 CHAPTER 8. PARAMETER ESTIMATION

" ! !#
′ 1 ∂ ln p ∂ ln p
Hjk =E 2
e ∂θj ∂θk

which may be estimated by the sample mean:


N
! !
1 X 1 ∂pi ∂pi
Ĥjk = (8.167a)
N i=1 ei p2i ∂θj ∂θk
N
! !
′ 1 X 1 ∂pi ∂pi
Ĥjk = 2 2
(8.167b)
N i=1 ei pi ∂θj ∂θk


evaluated at θ = θ̂ . If e is constant, this reduces to the usual estimator of the
covariance matrix given in equations 8.78 and 8.80.
Alternatively, one can estimate the matrix elements from the second derivatives:
" #
∂ 2 W ′ 1 XN
1 ∂ 2 ln pi
Hjk = − ; Ĥjk = − (8.168a)
∂θj ∂θk θ=θ̂ N i=1 ei ∂θj ∂θk θ=θ̂
" #
′ 1 ∂ 2 W ′ 1 XN
1 ∂ 2 ln pi
Hjk =− ; Ĥjk = − (8.168b)
e ∂θj ∂θk θ =θ̂ N i=1 e2i ∂θj ∂θk θ=θ̂

If e is constant, this reduces to the usual estimator of the covariance matrix given
in equations 8.84 and 8.85.

To summarize: Find the estimates θ̂ by maximizing W ′ (eq. 8.165). If possible
compute H and H ′ by equation 8.167 or 8.168; if the derivatives are not known an-
∂2W ′
alytically, use equation 8.168, evaluating ∂θ j ∂θk
numerically. The covariance matrix
is then given by equation 8.166.
It is clear from the above formulae that the appearance of one event with a
very large weight will ruin the method, since it will cause W ′ (equation 8.165) to
be dominated by one term and will make the variance very large. Accordingly, a
better estimate may be obtained by rejecting events with very large weights.

Minimum chi-square—approximate method


Consider a histogram with k bins containing ni events in the ith bin. Suppose that a
P
model predicts the normalization n = ni as well as the shape of the distribution.
Denote the expected number of events in the ith bin by
R
pqe dx
ai (θ) = A(θ) Ri (8.169)
pqe dx
P R
where A(θ) = ai is the predicted total number of events and i indicates an
integral over bin i.
8.7. PRACTICAL CONSIDERATIONS 179

The minimum chi-square and modified minimum chi-square formulae (section


8.6.1) become
k k
" 2 #
X (ni − ai )2 ∂Q21 X ni ∂ai
Q21 = , =− −1
i=1 ai ∂θ i=1 ai ∂θ
k k  
X (ni − ai )2 ∂Q22 X ai ∂ai
Q22 = , = −2 1−
i=1 ni ∂θ i=1 ni ∂θ
So far, this is exact. Now let us introduce the approximate method by removing
the dependence on q from equation 8.169. Let bi be the predicted number of events in
bin i when e = 1. We want to correct the numbers bi using the known experimental
efficiency to obtain numbers ci such that

E [ci ] = ai (8.170)

From its definition, bi is given by


R R
pq dx i pq dx
bi = B(θ) R = A(θ) R i
pq dx pqe dx
where B(θ) is the total number of events predicted when e = 1. Combining this
equation with equation 8.169, we find
R
pqe dx
ai = bi Ri
i pq dx

The inverse of this ratio of integrals can be rewritten as


R
pqew dx
Ri = Ei [w]
i pqe dx

where w = 1e is the weight. This expectation can be estimated by the sample mean
of the weights of the events in the bin:
P ni
j=1 wij
Ed
[w] i =
ni
where wij is the weight (1/ei ) of the j th event in the ith bin.
We now define
bi ni
ci = Pni
j=1 wij

From the preceding equations it is clear that this ci satisfies equation 8.170.
The expressions for Q2 then use ci instead of ai . Writing σi2 for ai in the case of
Q21 and for ni in the case of Q22 , both may be written as
!2  2
k
X k
X
1 ni 1 X
Q2 = 2
ni − bi P = ′2
wij − bi 
i=1 σi j wij σ
i=1 i j
180 CHAPTER 8. PARAMETER ESTIMATION

!2
1 1 ni
where ′2
= 2 P
σi σi j wij

The ‘error’, σi′ , is then given by


 2   2 
ni n
 X  X
i
 
σi′2 = E  wij − bi   = E  wij   − b2i
j=1 j=1

hP i
ni
since E j=1 wij = bi . Further, one can show that
 2     
n
X i ni
X ni X
X ni h i
E
 wij   = E wij2  + E 
 wij wik  2 2
 ≈ E [ni ] E wi + bi
j=1 j=1 j=1 k=1
k6=j

E [wi2 ] can be estimated by the sample mean


ni
1 X
Ed
[w 2 ]i = w2
ni j=1 ij

E [ni ] can be estimated in two ways: from the model, which gives the minimum
chi-square method; or from the data, which gives the modified minimum chi-square
method. The resulting expressions for Q2 are
 2 
P ni
j=1 wij − bi
Xk
 
Ed
[ni ] = ci ; Q′2
1 =

 Pni 2
w

 (8.171)
ij
i=1 bi Pni
j=1
wij
j=1
 2 
P ni
k
X  j=1 wij − bi 
Ed
[ni ] = ni ; Q′2
2 =  Pni 2  (8.172)
i=1 j=1 wij

Clearly both Q′ approach the corresponding Q as the weights all approach 1. As in


the unweighted case, the minimum chi-square method (Q1 ) is better justified than
the modified minimum chi-square method (Q2 ). However, if bi is a linear function
of the parameters, the solution of the modified method can in principle be written
explicitly, which is much faster than a numerical minimization.

But who can discern his errors?


Clear thou me from hidden faults.
—Psalm 19.12

8.7.6 Systematic errors


If a meter has a random error, then its readings are distributed in some way about
the true value. If the error distribution is not specified further, you expect it to
8.7. PRACTICAL CONSIDERATIONS 181

be Gaussian. Thus if it is simply stated that the error is 1%, you expect that this
distribution will be a Gaussian distribution with a standard deviation of 1% of the
true value. The standard deviation of a single reading will be 1% of that reading.
But by making many (N) readings and averaging them, you obtain an estimate of
the true value which has a much smaller variance. Usually, the variance is reduced
by a factor 1/N, which follows from the central limit theorem.
If the meter has a systematic error such that it consistently reads 1% too high,
the situation is different. The readings are thus correlated. Averaging a large
number of readings will not decrease this sort of error, since it affects all the readings
in the same way. With more readings, the average will not converge to the true
value but to a value 1% higher. It is as though we had a biased estimator.
Systematic errors can be very difficult to detect. For example, we might measure
the voltage across a resistor for different values of current. If the systematic error
was 1 Volt, all the results would be shifted by 1 Volt in the same direction. If we
plotted the voltages against the currents, we would find a straight line, as expected.
However, the line would not pass through the origin. Thus, we could in principle
discover the systematic effect. On the other hand, with a systematic error of 1% on
the voltage, all points would be shifted by 1% in the same direction. The voltages
plotted against the currents would lie on a straight line and the line would pass
through the origin. The voltages would thus appear to be correctly measured.
But the slope of the line would be incorrect. This is the worst kind of systematic
error—one which cannot be detected statistically. It is truly a ‘hidden fault’.
The size of a systematic error may be known. For example, consider temperature
measurements using a thermocouple. You calibrate the thermocouple by measuring
its output voltages V1 and V2 for two known temperatures, T1 and T2 , using a volt-
meter of known resolution. You then determine some temperatures T by measuring
voltages V and using the proportionality of V to T to calculate T :
T2 − T1
T = (V − V1 ) + T1
V2 − V1
The error on T will include a systematic contribution from the errors on V1 and V2
as well as a random error on V . In this example the systematic error is known.
In other cases the size of the systematic error is little more than a guess. Suppose
you are studying gases at various pressures and you measure the pressure using a
mercury manometer. Actually it only measures the difference in pressure between
atmospheric pressure and that in your vessel. For the value of the atmospheric
pressure you rely on that given by the nearest meteorological station. But how big
is the difference in the atmospheric pressure between the station at the time the
atmospheric pressure was measured and your laboratory at the time you did the
experiment?
Or, suppose you are measuring a (Gaussian) signal on top of a background. The
estimate of the signal (position, width, strength) may depend on the functional
form chosen for the background. If you do not know what this form is, you should
182 CHAPTER 8. PARAMETER ESTIMATION

try various forms and assign systematic errors based on the resulting variations in
the estimates.

Experimental tips
To clear your experiment of ‘hidden faults’ you should begin in the design of the
experiment. Estimate what the systematic errors will be, and, if they are too large,
design a better experiment.
Build consistency checks into the experiment, e.g., check the calibration of an
instrument at various times during the course of the experiment.
Try to convert a systematic error into a random error. Many systematic effects
are a function of time. Examples are electronics drifts, temperature drifts, even
psychological changes in the experimenter. If you take data in an orderly sequence,
e.g., measuring values of y as a function of x in the order of increasing x, such drifts
are systematic. So mix up the order. By making the measurements in a random
order, these errors become random.
The correct procedure depends on what you are trying to measure. If there are
hysteresis effects in the apparatus, measuring or setting the value of a quantity,
e.g., a magnetic field strength, from above generally gives a different result than
setting it from below. Thus, if the absolute values are important such adjustments
should be done alternatively from above and from below. On the other hand, if
only the differences are important, e.g., you are only interested in a slope, then all
adjustments should be made from the same side, as the systematic effect will then
cancel.

Error propagation with systematic errors


Having eliminated what systematic effects you can, you must evaluate the rest.
Different independent systematic errors are, since independent, added in quadra-
ture.∗ Since random and systematic errors are independent, they too can be added
in quadrature to give the total error. Nevertheless, the two types of error are often
quoted separately, e.g.,
R = −1.9 ± 0.1 ± 0.4
where (conventionally) the first error is statistical and the second systematic. Such a
statement is more useful to others, particularly if they want to combine your result
with other results which may have the same systematic errors. For this reason,
the various contributions to the systematic errors should also be given separately,
particularly those which could be common to other experiments. One also sees in
this example that more data would not help since the systematic error is much
larger than the statistical error.


This assumes that the errors are normally distributed. If you know this not to be the case,
you should try to combine the errors using the correct p.d.f.’s.
8.7. PRACTICAL CONSIDERATIONS 183

Error propagation is done using the covariance matrix in the usual way except
that we keep track of the statistical and systematic contributions to the error.
Suppose that we have two ‘independent’ measurements x1 and x2 with statistical
errors σ1 and σ2 and with a common systematic error s. For pedagogical purposes
we can think of the xi as being composed of two parts, xi = xR S R
i + xi , where xi has
only a random statistical error, σi , and xi has only a systematic error, s. Then xR
S
1
and xR2 are completely independent and x S
1 and xS
2 are completely correlated. The
variance of xi is then
h i
V [xi ] = E x2i − (E [xi ])2
 2   h i2
=E xR
i + xSi − E xR S
i + xi

= σi2 + s2
The covariance is
cov(x1 , x2 ) = E [x1 x2 ] − E [x1 ] E [x2 ]
h  i h i h i
=E xR S
1 + x1 xR S
2 + x2 − E xR S R S
1 + x1 E x2 + x2

Each term involves four products. Those involving an xR


i cancel leaving

cov(x1 , x2 ) = cov(xS1 , xS2 ) = s2


Thus the covariance matrix is
 
σ12 + s2 s2
V =
s2 σ22 + s2
So far we have considered systematic errors which are constants. They also
occur as fractions or percentages. The systematic error s is then not a constant but
proportional to the measurement (actually to the true value, but for small errors
the difference is by definition negligible): s = ǫx with, e.g., ǫ = 0.01 for a 1% error.
The above analysis is still valid: xS1 and xS2 are still completely correlated. The
resulting covariance matrix is
 
σ12 + ǫ2 x21 ǫ2 x1 x2
V =
ǫ2 x1 x2 σ22 + ǫ2 x22
Generalization is rather obvious. If there are several independent sources of
systematic error then they are added in quadrature. If there are more variables the
matrix is larger. For example, consider three variables with independent statistical
errors, a common systematic error s and in addition an independent systematic
error t which is shared by x1 and x2 but not x3 . The covariance matrix is then
 
σ12 + s2 + t2 s2 + t2 s2
 
V =  s2 + t2 σ22 + s2 + t2 s2 
s2 s2 σ32 + s2
184 CHAPTER 8. PARAMETER ESTIMATION

Least squares fit with systematic errors


Consider a least squares fit where the y-values have not only a statistical error σ,
but also a common systematic error s. The covariance matrix for y is then
h i
Vij y = δij σ 2 + s2

This is just the covariance matrix previously considered in section 8.5.5 with the
addition of s2 to every element. As an example, consider a fit to a straight line,
2
y = a + bx. Using this V and ǫ = y − a − bx, in Q2 = ǫT V ǫ and solving ∂Q ∂a
=0
∂Q2
and ∂b = 0 leads to the same expressions for the estimators as before (equation
8.125). A common systematic shift of all points up or down clearly has no effect
on the slope, and therefore we expect the same variance for b̂ as before. However,
a systematic shift in y will affect the intercept; consequently, we expect a larger
variance for â.
Chapter 9

Confidence intervals

In the previous chapter we have discussed methods to estimate the values of un-
known parameters. As the uncertainty, or “error”, δ θ̂, on the estimate, θ̂, we have
been content to state the standard deviations and correlation coefficients of the
estimate as found from the covariance matrix or the estimated covariance matrix.
This is inadequate in certain cases, particularly when the sampling p.d.f., i.e., the
p.d.f. of the estimator is non-Gaussian. In this chapter our interest is to find the
range
θa ≤ θ ≤ θb
which contains the true value θt of θ with “probability” β. We shall see that when
the sampling p.d.f. is Gaussian, the interval [θa , θb ] for β = 68.3% is the same as
the interval of ±1 standard deviation about the estimated value.

9.1 Introduction
In parameter
h i estimation we found an estimator for a parameter θ̂ and its variance
σθ̂2 = V θ̂ and we wrote the result as θ = θ̂ ± σθ̂ . Assuming a normal distribution
for θ̂, one is then tempted to say, as we did in section 8.2.4, that the probability is
68.3% that
θ̂ − σθ̂ ≤ θt ≤ θ̂ + σθ̂ (9.1)
Now, what does this statement mean? If we interpret it as 68.3% probability that
the value of θt is within the stated range, we are using Bayesian probability (cf.
section 2.4.4) with the assumption of uniform prior probability. This assumption is
not always justifiable and often is wrong, as is illustrated in the following example:
An empty dish is weighed on a balance. The result is 25.31 ± 0.14 g. A sample
of powder is placed on the dish, and the weight is again determined. The result is
25.51 ± 0.14 g. By subtraction and combination of errors, the weight of the powder
is found to be 0.20 ± 0.20 g. Our first conclusion is that the scientist should have
used a better balance. Next we try to determine some probabilities. From the

185
186 CHAPTER 9. CONFIDENCE INTERVALS

normal distribution, there is a probability of about 16% that a value lies lower than
µ − σ. In this example that means that there is a chance of about 16% that the
powder has negative weight (an anti-gravity powder!). The problem here is Bayes’
postulate of uniform prior probability. We should have incorporated in the prior
knowledge the fact that the weight must be positive, but we didn’t.
Let us avoid the problems of Bayesian prior probability and stick to the fre-
quentist interpretation. This will lead us to the concept of confidence intervals,
developed largely by Neyman,45 which give a purely frequentist interpretation to
equation 9.1. We shall return to the Bayesian interpretation in section 9.9.
Suppose we have a p.d.f. f (x; θ) which depends on one parameter θ. The prob-
ability content β of the interval [a, b] in X-space is
Z b
β = P (a ≤ X ≤ b) = f (x; θ) dx (9.2)
a

Common choices for β are 68.3% (1σ), 95.4% (2σ), 99.7% (3σ), 90% (1.64σ), 95%
(1.96σ), and 99% (2.58σ), where the correspondence between percent and a number
of standard deviations (σ) assumes that f is a Gaussian p.d.f.
If the function f and the parameter θ are known we can calculate β for any a
and b. If θ is unknown we try to find another variable z = z(x, θ) such that its
p.d.f., g(z), is independent of θ. If such a z can be found, we can construct an
interval [za , zb ], where zx = z(x, θ), such that
Z zb
β = P (za ≤ Z ≤ zb ) = g(z) dz (9.3)
za

It may then be possible to use this equation together with equation 9.2 to find an
interval [θ− , θ+ ] such that
P (θ− ≤ θt ≤ θ+ ) = β (9.4)
The meaning of this last equation must be made clear. Contrast the following
two quite similar statements:
1. The probability that θt is in the interval [θ− , θ+ ] is β.

2. The probability that the interval [θ− , θ+ ] contains θt is β.


The first sounds like θt is the r.v. and that the interval is fixed. This is incorrect—
we are frequentists here, and so θt is not a r.v. The second statement sounds like a
statement about θ− and θ+ , which is the correct meaning of equation 9.4. θ− and
θ+ are the results of the experiment, and hence r.v.’s. To put it slightly differently:
Performing the experiment as we have done, we have the probability, β, of finding
an interval, [θ− , θ+ ], which contains the (unknown) true value of θ, θt . If we were to
repeat the experiment many times, a fraction β of the experiments would yield an
interval containing the true value, i.e., an interval which “covers” the true value.
Turned around, this means that if we assert on the basis of our experiment that
the true value of θ lies in the interval [θ− , θ+ ], we will be right in a fraction β of
9.1. INTRODUCTION 187

the cases. Thus, β expresses the degree of confidence (or belief) in our assertion;
hence the name confidence interval. The quantity β is known by various names:
confidence coefficient, coverage probability, confidence level. However, the
last term, “confidence level”, is inadvisable, since it is also used for a different
concept, which we will encounter in goodness-of-fit tests (cf. section 10.6).
The interval [θ− , θ+ ] corresponding to a confidence coefficient β is in general not
unique; many different intervals exist with the same probability content.
We can, of course, choose to state any one of these intervals. Commonly used
criteria to remove this arbitrariness are

1. Symmetric interval: θ̂ − θ− = θ+ − θ̂.

2. Shortest interval: θ+ − θ− is the smallest possible, given β.

3. Central interval: the probability content below and above the interval are
equal, i.e., P (θ < θ− ) = P (θ > θ+ ) = (1 − β)/2.

For a symmetric distribution having a single maximum these criteria are equivalent.
We usually prefer intervals satisfying one (or more) of these criteria. However, non-
central intervals will be preferred when there is some reason to be more careful on
one side than on the other, e.g., the amount of tritium emitted from a nuclear power
station.

Normally distributed estimators. To illustrate the above procedure: Let t(x)


be an estimator of a parameter having true value θ. As we have seen in the previous
chapter, many estimators are (at least asymptotically) normally distributed about
the true value. Then t is a r.v. distributed as N(t; θ, σ 2 ). Equation 9.2 is then
Z ! !
b
2 b−θ a−θ
β = P (a ≤ T ≤ b) = N(t; θ, σ ) dt = erf − erf (9.5)
a σ σ

since the c.d.f. of the normal p.d.f. is the error function (cf. section 3.7).
If θ is not known, we can not evaluate the integral. Instead, assuming that σ is
known, we transform to the r.v. z = t − θ. The interval [c, d] for z corresponds to
the interval [θ + c, θ + d] for t. Hence, equation 9.3 becomes
Z ! !
θ+d
2 d c
β = P (θ + c ≤ T ≤ θ + d) = N(t; θ, σ ) dt = erf − erf (9.6)
θ+c σ σ

We can, for a given β, now choose an interval [θ + c, θ + d] satisfying this equation.


Now t ≤ θ + d implies that θ ≥ t − d and t ≥ θ + c implies that t − c ≥ θ. Hence,
the above interval in t-space corresponds to the interval [t − d, t − c] in θ-space, and
we have the desired confidence interval for θ:

β = P (t − d ≤ θ ≤ t − c) (9.7)
188 CHAPTER 9. CONFIDENCE INTERVALS

Again, we emphasize that although this looks like a statement concerning the proba-
bility that θ is in this interval, it is not, but instead means that we have a probability
β of being right when we assert that θ is in this interval.
If neither θ or σ is known, one chooses the standardized variable z = t−θ σ
. The
probability statement about Z is
Z d
β = P (c ≤ Z ≤ d) = N(z; 0, 1) dz = erf(d) − erf(c) (9.8)
c

which can be converted into a probability statement for θ:


β = P (t − dσ ≤ θ ≤ t + cσ) (9.9)
For the normal distribution this conversion is easy, due to the symmetry of the
distribution between Z and θ. Note however that equation 9.9 does not help us
very much since we do not know σ. We will discuss this further in section 9.4.2

9.2 Confidence belts


Now let us see how we construct confidence intervals for an arbitrary p.d.f.45 Suppose
that t(x) is an estimator of the parameter θ with p.d.f. f (t|θ). For a given value of
θ, there will be values of t, t− (θ) and t+ (θ) such that
Z t+
β = P (t− ≤ T ≤ t+ ) = f (t|θ) dt (9.10)
t−

These values of t then define an interval in t-space, [t− , t+ ], with probability content
β. Usually the choice of t− and t+ is not unique, but may be fixed by an additional
criterion, e.g., by requiring a central interval:
Z Z
t− 1−β +∞
f (t|θ) dt = = f (t|θ) dt (9.11)
−∞ 2 t+

We do not, of course, know the θ ✻ t− (θ)


true value of θ, and hence we are or θ+ (t)
unable to solve this equation for t− θ+ (t̂)
and t+ . Nevertheless, we can make θ t+ (θ)
t or θ− (t)
a plot of t− (θ) and t+ (θ) vs. θ, θ (t̂)

which can also be viewed as a plot
of, respectively, θ+ (t) and θ− (t) vs.
t. The region between the t− and
t+ curves is known as a confidence
belt.
For an unbiased, normally dis- ✲
tributed estimator, as in the previ- t (θ
− t ) t̂ t (θ
+ t ) t
ous section, f (t|θ) = N(t; θ, σ 2 ) and the lines for β = 0.683 would be, from equation
9.11, t− (θ) = θ − σ and t+ (θ) = θ + σ.
9.3. CONFIDENCE BOUNDS 189

For any value of θ, the chance of finding a value of t in the interval [t− (θ), t+ (θ)]
is β, by construction. Conversely, having done an experiment giving a value t = t̂,
the values of θ− and θ+ corresponding to t+ = t̂ and t− = t̂ can be read off of the
plot as indicated. The interval [θ− , θ+ ] is then a confidence interval of probability
content β for θ. This can be seen as follows:
Suppose that θt is the true value of θ. A fraction β of experiments will then
result in a value of t in the interval [t− (θt ), t+ (θt )]. Any such value of t would yield,
by the above-indicated method, an interval [θ− , θ+ ] which would include θt . On the
other hand, the fraction 1 − β of experiments which result in a value of t not in the
interval [t− (θt ), t+ (θt )] would yield an interval [θ− , θ+ ] which would not include θt .
Thus the probability content of the interval [θ− , θ+ ] is also β.
To summarize, given a measurement t̂, the central β confidence interval (θ− ≤
θ ≤ θ+ ) is the solution of
Z Z
t̂ 1−β +∞
f (t|θ+ ) dt = = f (t|θ− ) dt (9.12)
−∞ 2 t̂

If f (t) is a normal p.d.f., which is often (at least asymptotically, as we have seen in
chapter 8) the case, this interval is identical for β = 68.3% to [θ̂−σθ̂ < θ < θ̂+σθ̂ ]. If
f (t) is not Gaussian, the interval of ±1σ (σ 2 the variance of θ̂) does not necessarily
correspond to β = 68.3%. In this case the uncertainty should be given which does
correspond to β = 68.3%. Such an interval is not necessarily symmetric about θ̂.
In ‘pathological’ cases, the confidence belt may wiggle in such a way that the
resulting confidence interval consists of several disconnected pieces. While mathe-
matically correct, the use of such disconnected intervals may not be very meaningful.

9.3 Confidence bounds


As mentioned above, the choice of confidence interval is usually not unique. In
many cases we prefer a central interval. But sometimes an extremely non-central
interval is preferable from a physical standpoint. In particular, confidence bounds,
i.e., upper or lower limits, are useful when the ‘best’ value of a parameter is found
to be close (or perhaps beyond) a physical boundary.
For an upper limit, t+ (θ) is chosen infinite (or equal to the maximum allowed
value of t). Then, the function t− (θ) is defined (equation 9.10) by
Z +∞
β = P (T > t− ) = f (t|θ) dt
t−

For a measurement t̂, θ+ is read from this t− (θ) curve as in the previous section. In
other words, the upper limit, θ+ is the solution of
Z +∞
β = P (θ < θ+ ) = f (t|θ+ ) dt (9.13)

190 CHAPTER 9. CONFIDENCE INTERVALS

The statement is then that θ < θ+ with confidence β, and such an assertion will be
correct in a fraction β of the cases.
Lower limits are defined analogously: The lower limit θ− , for which θ > θ− with
confidence β, is found from
Z t̂
β = P (θ > θ− ) = f (t|θ− ) dt (9.14)
−∞

Note that we have defined these limits as > and <, whereas we used ≥ and ≤
for confidence intervals. Some authors also use ≥ and ≤ for confidence bounds. For
continuous estimators, this makes no difference. However, for discrete estimators,
e.g., a number of events, the integral over the p.d.f. of the estimator is replaced by
a sum, and then this difference is important. This will be discussed further for the
Poisson p.d.f. (section 9.6).

9.4 Normal confidence intervals


The example of a normally distributed estimator has already been discussed in the
introduction (section 9.1). There we saw that the situation is different depending
on whether σ is or is not known.

9.4.1 σ known
If the variance, σ 2 , of the estimator is known, the confidence interval is easily calcu-
lated, as shown in the introduction. Suppose we have n measurements of an exact
quantity, µ, like the mass of a ball, using an apparatus of known resolution, σa . The
estimate, µ̂ = x̄, of the quantity is then normally distributed as N(µ̂; µ, σ 2 = σa2 /n),
and confidence intervals (equation 9.7) are computed using σ and the error function
(equation 9.6). The central confidence belt is defined by straight lines correspond-
ing to t± = µ ± bσ, where b is the number of standard deviations corresponding to
probability β.

9.4.2 σ unknown
But suppose that we do not know the resolution of the apparatus. As shown in the
introduction, it is still possible to give a confidence interval, but only in terms of σ
(equations 9.8 and 9.9). Since σ is not known, this is not particularly useful.
Rather, the approach is to estimate σ from the data. In the simple example of a
set of n measurements of the same quantity, x, with an apparatus of constant, but
unknown resolution, σ, the mean is estimated by µ̂ = x̄. As we have seen (equation
8.7), the resolution is then estimated by
s
n
σ̂ = s = (x − x̄)2
n−1
9.5. BINOMIAL CONFIDENCE INTERVALS 191

and the variance of the estimator is estimated by

s2
V [µ̂] =
n

Although z = x−µ σ
is distributed as a standard normal p.d.f., i.e., z 2 is distributed
as χ2 , the corresponding variable for the case of unknown σ,

x−µ (x − µ)/σ z
t= = =
σ̂ σ̂/σ σ̂/σ

is not. Instead, it follows Student’s t distribution (section 3.13). It is therefore not


correct to determine a confidence interval for µ from the normal p.d.f.
Qualitatively we can understand that the confidence region will be somewhat
larger with σ unknown than with σ known, since the region must also take into
account fluctuations of s from the true value of σ. It can be shown6, 11, 13 that the
central β-confidence interval is given by
q
µ± = µ̂ ± T ( 21 (1 + β); n − 1) V [µ̂] (9.15)

The factor T is derived from the c.d.f. of Student’s t distribution. It is the value of
t for which the c.d.f. is equal to 12 (1 + β):

Z T 1
t(x; n − 1) dx = (1 + β) (9.16)
−∞ 2

In the case of a least squares fit to measurements yi , all having the same (un-
known) Gaussian error σ, this generalizes to
r h i
θi ± = θ̂i ± T ( (1 + β); n − k)
1
2
V θ̂i (9.17)

where n is the number of points and k the number of parameters in the model.

9.5 Binomial confidence intervals


For a binomial p.d.f., B(n; N, p), for which we want to estimate the parameter p, the
experimental observation is the number of successes, n, in N trials. The estimator
of p is then n/N.
192 CHAPTER 9. CONFIDENCE INTERVALS

For a given number of trials and various p ✻


values of p, the confidence-belt diagram
can be constructed as before using sums t− (p)
instead of integrals. Since the estimator of
p, t = n/N can take on only discrete val-
ues, the t− (p) and t+ (p) curves will have a
staircase-like form. Also, it will not usu-
ally be possible to find an interval for β
exactly equal to say 95%. One normally
then takes the next higher possible value, t+ (p)
i.e., one takes an interval with probability ✲
content slightly larger that 95%. t = n/N

For example, to find the 95% central confidence interval for p, given that we
observe n successes in N trials, we first find the regions p < p+ and p > p− using
the discrete analogues of equations 9.13 and 9.14 to find 97.5% upper and lower
limits
N
X
P (p < p+ ) = B(k; N, p+ ) ≥ 0.975 (9.18a)
k=n+1
n−1
X
P (p > p− ) = B(k; N, p− ) ≥ 0.975 (9.18b)
k=0

The smallest value of p+ and the largest value of p− satisfying these equations give
the central 95% confidence interval [p− , p+ ]. In other words, we find the upper and
lower limits for 1 − 1−β
2
and then exclude these regions.
Using the ≥ in these equations rather than taking the values of p for which the
equality is most nearly satisfied means that if no value gives an equality, we take
the next larger value for p+ and the next smaller value for p− . This is known as
being conservative. It implies that for some values of p we have overcoverage,
which means that for some values of p the coverage probability is actually greater
than the 95% that we claim, i.e., that P (p− < p < p+ ) > 0.95 instead of = 0.95.
This is not desirable, but the alternative would be to have undercoverage for other
values of p. Since we do not know what the true value of p is—if we did know, we
would not be doing the experiment—the lesser of two evils is to accept overcoverage
in order to rule undercoverage completely out.

9.6 Poisson confidence intervals


9.6.1 Large N
If the number of observed events is large, the Poisson p.d.f. is well approximated
by a Gaussian, and the Gaussian p.d.f. may be used to determine the confidence
9.6. POISSON CONFIDENCE INTERVALS 193

interval.

9.6.2 Small N — Confidence bounds


If the number of events is smaller a confidence interval may be determined in the
same way as for the binomial p.d.f.
However, for very small numbers of events one frequently prefers to state upper
or lower limits. The Poisson p.d.f. is a particularly important case for such limits,
since many random processes follow the Poisson p.d.f. (section 3.4).
Some experiments search for rare or ‘forbidden’ processes and conclude by stat-
ing upper limits for their occurrence. For example, we may search for the decay
µ → eγ, which is forbidden in the standard theory of weak interactions, but which
would be allowed in various proposed generalizations of this theory. Detection of
such a decay would show that the standard theory was only an approximate theory,
and the rate, i.e., the fraction of µ’s which decay through this mode, would help to
choose among the various alternative theories. Usually such experiments find a few
events which are consistent with the searched-for process, but which are not neces-
sarily evidence for it because of possible background processes. The experimental
result is then stated as an upper limit for the process.
On the other hand, a theory may predict that some process must not be zero.
Then an experiment will seek to give a lower limit.
When n events have been observed, the β upper limit µ+ for the parameter µ
of the Poisson p.d.f. is, from equation 9.13, the solution of
∞ ∞
X X µk+
β = P (µ < µ+ ) = P (k; µ+ ) = e−µ+
k=n+1 k=n+1 k!
n n
X X µk+
=1− P (k; µ+ ) = 1 − e−µ+ (9.19)
k=0 k=0 k!

The solution is easily found using the fact that the sum in the right-hand side of
equation 9.19 is related to the c.d.f. of the χ2 -distribution for 2(n + 1) degrees of
freedom.4, 5, 46 Thus,
n
X h i Z ∞
1−β = P (k; µ+ ) = P χ2 (2n + 2) > 2µ+ ) = χ2 (2n + 2) dχ2 (9.20)
k=0 2µ+

The upper limit µ+ can thus be found from a table of the c.d.f. of χ2 (2n + 2).
Lacking a table, equation 9.19 can be solved by iteration.
Let us emphasize, perhaps unnecessarily, exactly what the upper limit means: If
the true value of µ is really µ+ , the probability that a repetition of the experiment
will find a number of events which is as small or smaller than n is 1 − β; for a true
value of µ larger than µ+ , the chance is even smaller. Thus we say that we are
‘β confident’ that µ is less than µ+ . In making such statements, we will be right in
a fraction β of the cases.
194 CHAPTER 9. CONFIDENCE INTERVALS

Similarly, the β lower limit, µ− , is the solution of

n−1 n−1
X X µk−
β= P (k; µ− ) = e−µ− (9.21)
k=0 k=0
k!

which can be found from the c.d.f. of the χ2 -distribution for 2n degrees of freedom.
Thus,
n−1
X h i Z ∞
β= P (k; µ− ) = P χ2 (2n) > 2µ− ) = χ2 (2n) dχ2 (9.22)
k=0 2µ−

The fact that it is here 2n degrees of freedom instead of 2(n + 1) as for the upper
limit is because there are only n terms in the sum of equation 9.22 whereas there
were n + 1 terms in the upper limit case, equation 9.20.

9.6.3 Background
As mentioned above, there is usually background to the signal. The background
is also Poisson distributed. The sum of the two Poisson-distributed quantities is
also Poisson distributed (section 3.7), with mean equal to the sum of the means of
the signal and background, µ = µs + µb . Assume that µb is known with negligible
error. However, we do not know the actual number of background events, nb , in our
experiment. We only know that nb ≤ n. If µb + µs is large we may approximate the
Poisson p.d.f. by a Gaussian and take the number of background events as n̂b ≈ µb .
Then µ̂s = n − n̂b = n − µb , with variance V [µ̂s ] = V [n] + V [n̂b ] = n + µb .
An upper limit may be found by replacing µ+ in equation 9.19 by (µ+ + µb ). A
lower limit may be found from equation 9.21 by a similar substitution. The results
are

µ+ = µ+ (nobackground) − µb (9.23)
µ− = µ− (nobackground) − µb (9.24)

A difficulty arises when the number of observed events is not large compared
to the expected number of background events. The situation is even worse when
the expected number of background events is greater than the number of events
observed. For small enough n and large enough µb , equation 9.23 will lead to a
negative upper limit. So, if you follow this procedure, you may end up saying
something like “the number of whatever-I-am-trying-to-find is less than −1 with
95% confidence.” To anyone not well versed in statistics this sounds like nonsense,
and you probably would not want to make such a silly sounding statement. Of
course, 95% confidence means that 5% of the time the statement is false. This is
simply one of those times, but still it sounds silly. We will return to this point in
section 9.12.
9.7. USE OF THE LIKELIHOOD FUNCTION OR χ2 195

9.7 Use of the likelihood function or χ2


We have seen in section 8.4.5 how to estimate the variance of a maximum likelihood
estimator. Using the asymptotic normality of maximum likelihood estimators, we
can find confidence intervals as for any normally distributed quantity with known
variance (equations 9.6 and 9.7):
! !
d c
θ̂ − d ≤ θ ≤ θ̂ + c with confidence β = erf − erf
σθ̂ σθ̂

With smaller samples it is usually most convenient to use the likelihood ratio
(difference in log likelihood) to estimate the confidence interval. Then, relying on
the assumption that a change of parameters would lead to a Gaussian likelihood
function (cf. section 8.4.5), the region for which ℓ > ℓmax − a2 /2, or equivalently
(cf. section 8.5.1) χ2 < χ2 min + a2 , corresponds to a probability content
Z +a
β= N(z; 0, 1) dz = erf(a) − erf(−a)
−a

In ‘pathological’ cases, i.e., cases where there is more than one maximum, as
pictured here, the situation is less clear. Applying the above procedure would lead
to disconnected intervals, whereas the interval for the transformed parameter would
give a single interval. It is sometimes said that it is nevertheless correct to state a
β confidence interval as
ℓ ✻
θ1 ≤ θ ≤ θ2 or θ3 ≤ θ ≤ θ4
θ1 θ2 θ3 θ4✲
However, this statement seems to be the
result of confusing confidence intervals
with fiducial intervals (section 9.8). Be
that as it may, the usefulness of such in-
tervals is rather dubious, and in any case
gives an incomplete picture of the situation. One should certainly give more details
than just stating these intervals.
The application of other methods of estimating the variance of θ̂ to finding
confidence intervals for finite samples is discussed in some detail by Eadie et al.4
and James5 .

9.8 Fiducial intervals


Confidence intervals, as developed by Neyman and discussed in the previous sec-
tions, use a fully frequentist approach to probability. R. A. Fisher, a few years
earlier, had followed a somewhat different, also frequentist, approach to interval
estimation24 . His intervals are called fiducial intervals. A third approach is the
much older Bayesian one, which will be presented in the next section.
196 CHAPTER 9. CONFIDENCE INTERVALS

Fisher’s concept of information (section 8.2.5) is intimately related to the like-


lihood function. So too is his fiducial interval.
In section 8.4.2 we saw that asymptotically the likelihood function L(x; θ) be-
comes (under rather general assumptions) a Gaussian function of the parameters
θ. This does not mean (as we have repeatedly emphasized) that L is a p.d.f. for
θ. That only happens in a Bayesian interpretation, which we are not making here.
Recall that the principle of maximum likelihood, i.e., that the best estimate of θ
is that value of θ for which the likelihood function is a maximum, was not derived,
but assumed on intuitive grounds. In the same way we go now a step further and
assume, again intuitively, that L represents our level of credence in a value of θ. A
fiducial interval for a degree of credence β is defined as an interval [θ1 , θ2 ] such that
R θ2
θ1 L dθ
β = R +∞ (9.25)
−∞ L dθ

This procedure is supported by the connection we have seen (section 8.4.2)


between the asymptotic Gaussian shape of L and the variance of the maximum
likelihood estimator. And just as with the maximum likelihood method, the attrac-
tiveness of fiducial intervals is based on asymptotic properties.
As with confidence intervals, a supplementary criterium, such as a central inter-
val, is needed in addition to equation 9.25 to uniquely define a fiducial interval.
Often the confidence interval and fiducial interval approaches lead to the same
interval. However, the approach, and hence the meaning, is different. The confi-
dence approach says that if we assert that the true value is in a 95% interval we
will be right 95% of the time. However, in the fiducial approach the same assertion
means that we are 95% sure that we are right this time. This shift in emphasis is
the same as in the meaning of the likelihood function itself: We can regard L(x; θ)
as an elementary probability in which θ is fixed and x varies, i.e., as the p.d.f. for
the r.v. X. On the other hand, we can regard it as a likelihood in which x is fixed
and θ varies, as is done in the maximum likelihood method. Similarly, in interval
estimation, we can regard θ as a constant and set up containing intervals which are
random variables (the confidence interval approach); or we can regard the observa-
tions as fixed and set up intervals based on some undefined intensity of belief in the
values of the parameter generating the observations (the fiducial interval approach).
Today, fiducial intervals are seldom used, since they lack a firm mathematical
basis. If one is a frequentist, one generally prefers confidence intervals.

9.9 Credible (Bayesian) intervals


Confidence intervals are based on the frequentist interpretation of probability and
are statements about the probability of experimental results. Fiducial intervals are
also based on the frequentist interpretation of probability (the parameters θ have
fixed true values) but represent our credence (or belief) about the values of the
9.10. DISCUSSION OF INTERVALS 197

parameters. However, we may prefer to use Bayesian probability. In this case we can
construct intervals, [a, b], called credible intervals, Bayesian confidence intervals,
or simply Bayesian intervals, such that β is the probability that parameter θ is
in the interval: Z b
β = P (a ≤ θ ≤ b) = f (θ|x) dθ (9.26)
a

where f (θ|x) is the Bayesian posterior p.d.f. As with confidence and fiducial inter-
vals, supplementary conditions, such as centrality, are needed to uniquely specify
the interval. We have seen in section 8.4.5 that, assuming Bayes’ postulate, f (θ|x)
is just the likelihood function L(x; θ), apart, perhaps, from normalization.

9.10 Discussion of intervals


We have presented three approaches to interval estimation: confidence intervals,
fiducial intervals, and credible (or Bayesian) intervals. In cases where the likelihood
function is a Gaussian function of the parameters, as is usually true asymptotically,
these approaches (with a suitable choice of prior in the Bayesian case) all lead to the
same interval. Though this is comforting, we must realize that in less ideal circum-
stances the intervals given by the different approaches may be different. This does
not mean that any of the approaches is wrong, but rather that they are answering
different questions or making different assumptions.
The virtue of the confidence interval approach is its firm grounding in frequentist
probability. The Bayesian approach is also firmly grounded, but loses something
in objectivity by its subjective Bayesian interpretation of probability as a degree of
belief. Further, it suffers from its need for an arbitrary choice of prior probability
(Bayes’ postulate). The fiducial approach is well-grounded only where its results
are identical to the other approaches. Extension to other cases is more a question
of intuition.
We thus are inclined to prefer the confidence interval approach even though it
is a very complicated procedure compared to the other approaches. However, the
confidence interval approach is unable to incorporate prior information, as we will
see in the next section.

9.11 Measurement of a bounded quantity


Let us return to the example in the introduction (section 9.1). A dish is weighed, a
sample is placed on the dish and the combination is weighed, and then the mass of
the sample is estimated by subtracting the mass of the dish from the mass of the
dish plus sample. If the mass of the sample is smaller than or comparable to the
resolution of the balance, the confidence interval [−∞, 0] will have a non-negligible
probability content. This is clearly ridiculous and comes about because we have
not made use of our knowledge that the mass must be positive. Such a situation
198 CHAPTER 9. CONFIDENCE INTERVALS

can also occur when we must subtract a number of background events from the
observed number of events to find the number of events in the signal; a number of
events also can not be negative.
The problem is how to incorporate this constraint (or prior knowledge) into
the confidence interval. In the confidence interval approach there is no way to
do this. The best we can do is to choose an interval which does not contain the
forbidden region (< 0 in our example). Consider the figure showing confidence
belts in section 9.2. Suppose that we know that θt > θmin . We can think of several
alternatives to the interval [θ− , θ+ ] when θ− < θ+ :

1. [θmin , θ+ ]. But this is the same interval we would have found using a confidence
belt with t+ shifted upwards such that the t+ curve passes through the point
(θmin , t̂ ). This confidence belt clearly has a smaller β. This places us in the
position of stating the same confidence for two intervals, the one completely
contained in, and smaller than, the other.
′′ ′′
2. [θmin , θ+ ], where θ+ is the solution of tmin (θ) = tmin , with tmin = t+ (θmin ). This
is the interval we would have stated had we found t̂ = t+ (θmin ). So, apparently
the fact that we found a lower value of t̂ does not mean anything—any value of
t̂ smaller than t+ (θmin ) leads to the same confidence interval! This procedure
is clearly unsatisfactory.
′ ′
3. [θmin , θ+ ], where θ+ is determined from a new confidence belt constructed
such that the t+ curve passes through the point (θmin , t̂ ). The t− curve is
taken as that curve which together with this new t+ curve gives the required
β. This approach seems better than the previous two. However, it is still
unsatisfactory since it relies on the measurement to define the confidence
belt.

The situation is even worse if not only θ− (t̂) < θmin but also θ+ (t̂) > θmax . Then
we find ourselves in the absurd situation of, e.g., stating the conclusion of our
experiment as −0.2 < θ < 1.2 with 95% confidence when we know that 0 < θ < 1—
we are only 95% confident that θ is within its physical limits! The best procedure
to follow has been the subject of much interest lately among high energy physicists,
particularly those trying to measure the mass of the neutrino and those searching
for hypothetical new particles. The most reasonable procedure seems to be47 that of
Feldman and Cousins,48 who rediscovered a prescription previously given by Kendall
and Stuart.11
On the other hand, in the fiducial approach physical boundaries are easily in-
corporated. The likelihood function is simply set to zero for unphysical values of
the parameters and renormalized. Equation 9.25 is thus replaced by
R θ2
L dθ
β = R θθmax
1
(9.27)
θmin L dθ
9.12. UPPER LIMIT ON THE MEAN OF A POISSON WITH BACKGROUND199

Also the Bayesian approach has no difficulty in incorporating the physical limits.
They are naturally imposed on the prior probability. If the prior probability is
uniform within the physical limits, the result is the same interval as in the fiducial
approach (equation 9.27).
Note, however, that in order to combine with the results of other experiments,
the (nonphysical) estimate and its variance should be stated, as well as the con-
fidence interval. This, in fact, should also be done for quantities which are not
bounded.

9.12 Upper limit on the mean of a Poisson p.d.f.


with background
In section 9.6.3 we introduced the problem of measuring an upper limit on the
number of (Poisson distributed) events for a particular process in the presence of
background. This is related to the problems of the previous section. The number of
events can not be negative; it is a bounded quantity. Within the classical confidence
limit approach, the most reasonable procedure here too is that of Feldman and
Cousins48 .
Another approach is to determine an upper limit by an extension of the argument
of section 9.6.46, 49 As in that section, let n be the number of events observed, nb the
expected number of background events, and µ+ the upper limit on µs . Then µ+ is
that value of µs such that any random repetition of the current experiment would,
if µs actually equals µ+ , result in more than n events and would also have nb ≤ n,
all with probability β. Thus, in equation 9.19 the sum, which is the probability of
≤ n events given µ = µ+ , is replaced by the same probability given µ = µb + µ+
normalized to the probability that nb ≤ n.
P (≤ n events)
β =1−
P (≤ n background events)
Pn
(µ+ +µb )k
e−(µµ+ +b ) k=0 k!
=1− P µk (9.28)
n
e−µb k=0 k!b

This equation must be solved for µ+ . In practice this is best done numerically,
adjusting µ+ until the desired β is obtained. However, to incorporate the probability
that nb ≤ n, we have been Bayesian. The result is thus a credible upper limit rather
than a classical upper limit.
When µb is not known to a negligible error, the same approach can be used.
However, we must integrate over the p.d.f. for nb . It is most convenient to use a
Monte Carlo technique. We generate a sample of Monte Carlo experiments taking µb
randomly distributed according to our knowledge of µb (usually normally) and with
a fixed µs . Experiments with nb > n are rejected. The sum in equation 9.19 or 9.21
is then estimated by the fraction of remaining Monte Carlo experiments satisfying
200 CHAPTER 9. CONFIDENCE INTERVALS

the corresponding probability. The process is repeated for different values of µs


until the desired value of β is found.50
“Which way ought I to go to get from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where—” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
—Lewis Carroll, “Alice in Wonderland”

Chapter 10

Hypothesis testing

10.1 Introduction
In chapter 8 we were concerned with estimating parameters of a p.d.f. using a statis-
tic calculated from observations assumed to be distributed according to that p.d.f.
In chapter 9 we sought an interval which we were confident (to some specified de-
gree) contained the true value of the parameter. In this chapter we will be concerned
with whether some previously designated value of the parameter is compatible with
the observation, or even whether the assumed p.d.f. is compatible. In a sense, this
latter question logically precedes the estimation of the value of a parameter, since
if the p.d.f. is incompatible with the data there is little sense in trying to estimate
its parameters.
When the hypothesis under test concerns the value of a parameter, the problems
of hypothesis testing and parameter testing are related and techniques of parameter
estimation will lead to analogous testing procedures. If little is known about the
value of a parameter, you will want to estimate it. However, if a theory predicts
it to have a certain value, you may prefer to test whether the data are compatible
with the predicted value. In either case you should be clear which you are doing.
That others are often confused about this is no excuse.

10.2 Basic concepts


The question here is thus one of hypothesis testing. We make some hypothesis
and want to use experimental observations to test whether it is correct. Not all
scientific hypotheses can be tested statistically. For instance, the hypothesis that

201
202 CHAPTER 10. HYPOTHESIS TESTING

every particle in the universe attracts every other particle can not be tested sta-
tistically. Statistical hypotheses concern the distributions of observable random
variables. Suppose we have N such observations. We denote them by a vector x
in an N-dimensional space, Ω, called the sample space (section 2.1.2), which is the
space of all possible values of x, i.e., the space of all possible results of an experi-
ment. A statistically testable hypothesis is one which concerns the probability of a
particular observation X, P (X ∈ Ω).
Suppose that x consists of a number of independent measurements of a r.v., xi .
Let us give four examples of statistical hypotheses concerning x:

1. The xi are distributed normally with particular values of µ and σ.

2. The xi are distributed normally with a particular value of µ.

3. The xi are distributed normally.

4. The results of two experiments, x1i and x2i are distributed identically.

Each of these hypotheses says something about the distribution of probability over
the sample space and is hence statistically testable by comparison with observations.
Examples 1 and 2 specify a p.d.f. and certain values for one or both of its pa-
rameters. Such hypotheses are called parametric hypotheses. Example 3 specifies
the form of the p.d.f., but none of its parameters, and example 4 does not even
specify the form of the p.d.f. These are examples of non-parametric hypothe-
ses, i.e., no parameter is specified in the hypothesis. We shall mainly concentrate
on parametric hypotheses, leaving non-parametric hypotheses to section 10.7.
Examples 1 and 2 differ in that 1 specifies all of the parameters of the p.d.f.,
whereas 2 specifies only a subset of the parameters. When all of the parameters are
specified the hypothesis is termed simple; otherwise composite. If the p.d.f. has n
parameters, we can define an n-dimensional parameter space. A simple hypothesis
selects a unique point in this space. A composite hypothesis selects a subspace
containing more than one point. The number of parameters specified exactly by
the hypothesis is called the number of constraints. The number of unspecified
parameters is called the number of degrees of freedom of the hypothesis. Note the
similarity of terminology with that used in parameter estimation:

Parameter Estimation Hypothesis Testing


n = number of observations parameters
k = number of parameters parameters specified
to be by the hypothesis
estimated (constraints)
n − k = number of degrees of freedom
10.2. BASIC CONCEPTS 203

To test an hypothesis on the basis of a random sample of observations, we must


divide the sample space Ω into two subspaces. If the observation x lies in one of these
subspaces, call it ω, we shall reject the hypothesis; if x lies in the complementary
region, ω ∗ = Ω − ω, we shall accept the hypothesis. The subspace ω is called the
critical region of the test, and ω ∗ is called the acceptance region.
A few words are in order regarding this terminology. In science we can never
completely reject or accept an hypothesis. Nevertheless, the words “reject” and
“accept” are in common usage. They should be understood as meaning “the ob-
servations are unfavorable” or “favorable” to the hypothesis. Since acceptance or
rejection is never certain, it is clear that we also need to be able to state our de-
gree of confidence in acceptance or rejection, just as when constructing confidence
intervals we did so with a specified confidence.
The hypothesis being tested is generally designated H0 and is called the null
hypothesis. For the time being, we will assume that H0 is a simple hypothesis,
i.e., it specifies the p.d.f. completely. We can then calculate the probability that
a random observation will fall in the critical region, and we can choose this region
such that this probability is equal to some pre-chosen value, α,

P (x ∈ ω|H0) = α (10.1)

This value α is thus the probability of rejecting H0 if H0 is true. It is called the


size of the test or the level of significance, although this latter term can be
misleading. For a discrete p.d.f. the possible values of α will also be discrete, while
for a continuous p.d.f. any value of α is possible.
In general, there will be many, often an infinity, of subspaces ω of the same size
α. Which of them should we use? In other words, which of all possible observations
should we regard as favoring and which as disfavoring H0 ?
To decide which subspace to take as ω, we need to know what the alternatives
are. It is perfectly possible that an observation is unlikely under H0 but even more
unlikely under an alternative hypothesis. Forced to choose between the two we
would not want to reject H0 . Thus whether we accept or reject H0 depends on
what the alternative hypothesis, usually designated H1 , is.
It should now be clear that a critical region (or, synonymously, a test) must be
judged by its properties both when H0 is true and when H0 is false. We want to
accept H0 if it is true and reject it if it is false. Our decision, i.e., acceptance or
rejection, can be wrong in two ways:
1. Error of the first kind, or loss, or false negative: H0 is true, but we
reject it.

2. Error of the second kind, or contamination, or false positive: H0 is


false, but we accept it.
The probability of making an error of the first kind is equal to the size of the
critical region, α. The probability of making an error of the second kind depends
204 CHAPTER 10. HYPOTHESIS TESTING

on the alternative hypothesis and is denoted∗ by β:

P (x ∈ ω ∗|H1 ) = β (10.2)

The complementary probability,

P (x ∈ ω|H1) = 1 − β (10.3)

is called the power of the test of H0 against H1 . The specification of H1 when


giving the power is clearly essential since β depends on H1 .
Clearly, we would like a test to have small values of both α and β. However,
it is usually a trade-off: decreasing α frequently increases β and vice versa. Let us
consider two examples where H0 and H1 are both simple hypotheses.
Example 1. Consider H0 and H1 both of which hypothesize that the r.v. X
is normally distributed with standard deviation σ. The difference between the
hypotheses lies in the value of µ. For H0 it is µ0 and for H1 it is µ1 . We make two
independent observations x1 and x2 to test H0 against H1 .
The two observations can be represented by a point in Ω, which is a plane having
x1 and x2 as axes. The joint p.d.f. under H0 is a bivariate normal distribution
centered at the point A, i.e., at x1 = x2 = µ0 . The density of points about A in
the figure is meant to represent this p.d.f. Under H1 the p.d.f. is the same except
that it is centered at the point B, x1 = x2 = µ1 .
A test of H0 could be made by defining ω by the line P Q with H0
to be rejected if the point representing ❵
the observations lies above the line P Q. x2 ✻ ❵
Another possible critical region is that C
❵❵ ❵❵ ❵ ❵ ❵ ❵❵
between the lines CA and AD. Both of ❵ ❵ ❵ ❵
❵ ❵ ❵ ❵
❵❵ ❵❵ ❵ ❵❵ ❵ ❵ ❵ ❵ ❵
these regions have the same probability
❵❵
❵ ❵ ❵❵ ❵❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵
❵ ❵❵ ❵❵ ❵ ❵❵ ❵ ❵❵ ❵❵ ❵ ❵❵ ❵ ❵❵ ❵ ❵ ❵❵❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵
under H0 and hence the same size, α. P ❵ ❵ ❵ ❵ ❵❵❵❵ ❵ ❵ ❵ ❵B ❵ ❵ ❵❵ ❵ ❵ ❵ ❵ ✧ ✧
❵❵ ❵ ✧ D
However, the values of β are much ❅❅
❵ ❵ ❵ ❵
❵❵ ✧
❵✧❵ ✧❵
❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵



different. The first test will almost al- ❵ ❵ ✧❵ ❵

❵ ❅❵ ✧
❵ ❵
ways reject H0 when H1 is true, while ❵ ❵ ❅❵ ❵ ❵
❵ ❵ ❵ ✧ ✧ ❵ ❵ ❵
❵❵ ❵ ❵ ❵ ❵❵ ❅ ✧❵

the second test will often wrongly ac- ❵ ✧
❵ ❵ ❵
❵ ❵ ❵ ❵ ❵ ❵ ❵❵ ❵ ❵❅

❵ ❵ ❵ ❵ ❵ ❵ ❵
❵ ❵ ❵❵ ❵ ❵ ❵ ❵❵ ❵ ❵ ✧❵ ✧
❵ ❵ ❵ ❵❵ ❵❵
cept H0 . Thus β is much larger for the ❵❵ ❵ ❅
❵ ❵ ❵ ❵
❵ ❵
❵ ❵ ❵ ❵ ❵ ❵❵ ❵ ❵ A❵ ❵ ❵ ❵ ❵❵ ❵ ❵ ❅
second test, and hence the power of the ❵
❵❵
❵❵ ❵ ❵❵❵❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵❵ ❵ ❵ ❵❵ ❵
❵❅

❵ ❵ ❵❵ ❵ ❵ ❵ ❵ ❵ ❵ ❵
first test is larger. It should be obvious ❵ ❵❵ ❵ ❵ ❵
❵ ❵ ❅ ✲

❵ ❵
that the more powerful test is prefer- x
❵ ❵❵
❵ ❵ ❵ ❵❵ ❵ ❵ 1
Q
able. ❵

Example 2. In the previous example the sample space was only two dimensions.
When the dimensionality is larger, it is inconvenient to formulate the test in terms of
the complete sample space. Rather, a small number

The symbols α and β are used by most authors for the probabilities of errors of the first and

second kind. However, some authors use 1 − β where we use β.


10.3. PROPERTIES OF TESTS 205

(frequently one) of test statistics is f (M)


defined and the test is formulated in
terms of them. In fact, as we shall later
H0
see, in some cases a single test statis-
tic provides the best test. Recall that
a statistic is a function only of the ob-
servations and does not depend on any
assumptions about the p.d.f.
Suppose that we want to distin- ✟✟α
guish K− p elastic scattering events
from inelastic scattering events where
a π 0 is produced. The hypotheses are H1
then
H0 : K− p→K− p
H1 : K− p→K− pπ 0
If the experiment measures the mo-
menta and energies of charged particles β❍
but does not detect neutral particles, a ❍
convenient test statistic is the missing 0 Mc Mπ0 M
mass, the mass of the neutral system
in the final state. This is easily calculated from the energies and momenta of the
initial- and final-state charged particles. The true value of the missing mass is
M = 0 under H0 , and M = 135 MeV/c2 under H1 . We can choose a critical region
M > Mc . The corresponding loss and contamination are shown in the figure. The
choice of Mc will be governed by balancing our interest in both small loss and small
contamination.
Note that the actual contamination in our sample of elastic events depends on
the a priori abundance of inelastic events produced. If this is small compared to
that of elastic events, we can tolerate a large value of β.

10.3 Properties of tests


In parameter estimation we were faced with the problem of choosing the best esti-
mator. Here a similar situation arises: we seek the best test. To aid us, we examine
some properties of tests.

10.3.1 Size
In the previous section we defined (equation 10.1) the size, α, of a test as the
probability that the test would reject the null hypothesis when it is true. If H0 is
a simple hypothesis, the size of a test can be calculated. In other cases it is not
always possible. Clearly a test of unknown size is worthless.
206 CHAPTER 10. HYPOTHESIS TESTING

10.3.2 Power
We have defined (equation 10.3) the power, 1 − β, of a test of one hypothesis H0
against another hypothesis H1 as the probability that the test would reject H0 when
H1 is true. If H1 is a simple hypothesis, the power of a test can be calculated. If
H1 is composite, the power can still be calculated, but is in general no longer a
constant but a function of the parameters.
Suppose that H0 and H1 specify the same p.d.f., the difference being the value
of the parameter θ:

H0 : θ = θ0
H1 : θ 6= θ0

The contamination, β, is then a function of θ, as is the power:

p(θ) = 1 − β(θ) (10.4)

Note that by definition, p(θ0 ) = 1 − β(θ0 ) = α.


Tests may then be compared on the basis of their power function. If H0 and
H1 are both simple, the best test of size (at significance level) α is the test with
maximum power at θ = θ1 , the value specified by H1 . In the figure, test B has the
largest power for θ > θ′ and in particular at θ = θ1 , whereas test C is more powerful
for θ0 < θ < θ′ .
If for a given value of θ a test is
at least as powerful as any other 1
B
possible test of the same size, it p(θ) A
is called a most powerful (MP)
test at that value of θ, and its crit- C
ical region is called a best criti-
cal region (BCR). A test which
is most powerful for all regions of θ
α
under consideration is called a uni- ✲
formly most powerful (UMP) 0θ θ ′
θ θ
0 1
test. Clearly, if a test is MP at θ1
and the test is independent of θ1 , then it is UMP. It is frequently not possible to
find an UMP test, although we will see in section 10.4.1 that if H0 and H1 are both
simple hypotheses, then an UMP test always exists. Unfortunately, in real life an
UMP test does not usually exist. An UMP test which is also unbiased (section
10.3.4) is called UMPU.

10.3.3 Consistency
A highly desirable property of a test is that, as the number of observations in-
creases, it should distinguish better between the hypotheses. A test is termed
10.3. PROPERTIES OF TESTS 207

consistent if the power tends to 1


unity as the number of observations p(θ) N


increases: ❅


lim P (x ∈ ω|H1) = 1 ❅
N →∞

where x is the set of N observations
and ω is the critical region under α

H0 . The power function thus tends 0θ θ
0
to a step function as N → ∞.

10.3.4 Bias
A test is biased if the power func-
tion is smaller at a value of θ corresponding to H1 than at the value, θ0 , specified
by H0 , i.e., when there exists a value θ for which

p(θ) = 1 − β(θ) < α , θ 6= θ0

An example is test B at θ = θ1 in the figure. In such a case the chance of accepting


H0 is greater when θ = θ1 than when θ = θ0 , which means we are more likely to
accept H0 when it is false than when it is true. Such a test is clearly undesirable in
general.
In some situations it may be 1
preferable to use a biased test. For p(θ)
example, test B may be chosen
rather than test A if it is particu-
larly important to be able to dis- A B
criminate against θ = θ2 , where test
B is more powerful than A. How-
α
ever, in so doing all discrimination ✲
between H0 and H1 in the region of 0θ θ1 θ2 θ
0
θ1 is lost.
The definition of a biased test can be formulated in a way which is also applicable
for composite hypotheses. Let H0 specify that θ is in some interval θ0 . Then a test
is unbiased if 
≤ α, for all θ ∈ θ0
P (x ∈ ω|θ)
≥ α, for all θ ∈
/ θ0
In real life it is usually possible to find an unbiased test.

10.3.5 Distribution-free tests


Most of the time we do not invent our own tests, but instead use some standard
test. To be ‘standard’, the distribution of the test statistic, and hence the size of
208 CHAPTER 10. HYPOTHESIS TESTING

the critical region, must be independent of the p.d.f. specified by H0 . It can only
depend on whether H0 is true. Such a test is called distribution-free. An example
is the well-known Pearson’s χ2 test, which we shall meet shortly.
It should be emphasized that it is only the size or level of significance of the
test which does not depend on the distributions specified in the hypotheses. Other
properties of the test do depend on the p.d.f.’s. In particular, the power will depend
on the p.d.f. specified in H1 .

10.3.6 Choice of a test


Traditionally, the choice of a test is done by first specifying the loss α and then
choosing the test on the basis of the power. This procedure assumes that the risk
of an error of the first kind (loss) is a given constant, and that one only has to
minimize the risk of an error of the second kind (contamination).
However, this is frequently not the
case. We want to have both kinds of 1❅❅
errors as small as possible. It is then
advantageous to take both α and β as β ❅

variables in comparing the tests. As- ❅

sume, for simplicity, that both H0 and
H1 are simple, specifying θ0 and θ1 , re- B ❅

spectively. Then, for a given test and C
A ❅❅
a given value of α = p(θ0 ), one can de-
termine β = 1 − p(θ1 ). Repeating for ❅

different values of α, a curve giving β
N-P A ❅❅
as a function of α can be constructed,
B
as shown in the figure. ❅❅
The dashed line in the figure corre- 0 α1 α 1
sponds to 1 − β = α, so that all unbi-
ased test curves will lie entirely below this line, passing through the points (1,0)
and (0,1). Since we desire to have both α and β small, test C in the figure is clearly
inferior to the others for all values of α and β. If both H0 and H1 are simple, there
always exists a test (the Neyman-Pearson test, cf. section 10.4.1) which is at least
as good as any other test for all α and β. If this test is too complicated, or in the
case of composite hypotheses, one could be in the position of choosing, for example,
between tests A and B. Clearly, test A should be chosen for α < α1 and test B for
α > α1 .
If the hypotheses are composite, the figure can become a multidimensional dia-
gram with new axes corresponding to θ or to other unspecified parameters. Or each
test can be represented by a family of curves in the α-β plane.
A minor difficulty arises when discrete distributions are involved, since only a
discrete set of α’s are then available, and the α-β curves are discontinuous.
The above techniques allow one to choose the best test. Whether it is good
10.4. PARAMETRIC TESTS 209

enough depends on the cost (in terms of such things as time and money) of making
an error, i.e., a wrong decision.

10.4 Parametric tests


10.4.1 Simple Hypotheses
The Neyman-Pearson test
When both H0 and H1 are simple hypotheses, the problem of finding the best critical
region (BCR), or most powerful (MP) test, of size α is particularly straightforward,
as was shown by Neyman and Pearson.51
We suppose that the r.v. x is distributed under H0 as f (x; θ0 ) and under H1 as
g(x; θ1 ). Then equations 10.1 and 10.3 can be written
Z
P (x ∈ ωα | H0 ) = f (x; θ0 ) dx = α (10.5)
Zωα
P (x ∈ ωα | H1 ) = g(x; θ1 ) dx = 1 − β (10.6)
ωα

We want to find the critical region ωα which, for a given value of α, maximizes
1 − β. Rewriting equation 10.6, we have
Z
g(x; θ1 )
1−β = f (x; θ0 ) dx
ωα f (x; θ0 )
" #
g(x; θ1 )
= Eωα H0
f (x; θ0 )
which is the expectation of g(x; θ1 )/f (x; θ0 ) in the region ωα assuming that H0 is
true. This will be maximal if we choose the region ωα as that region containing the
points x for which this ratio is the largest. In other words, we order the points x
according to this ratio and add these points to ω until ω has reached the size α.
The BCR thus consists of the points x satisfying
f (x; θ0 )
≤ cα
g(x; θ1 )
where cα is chosen such that ωα is of size α (equation 10.5).
This ratio is, for a given set of data, just the ratio of the likelihood functions,
which is known as the likelihood ratio. We therefore use the test statistic
L(x|H0 )
λ= (10.7)
L(x|H1 )
and
reject H0 if λ ≤ cα
accept H0 if λ > cα
This is known as the Neyman-Pearson test.
210 CHAPTER 10. HYPOTHESIS TESTING

An Example
As an example, consider the normal distribution treated in example 1 of section
10.2. Both H0 and H1 hypothesize a normal p.d.f. of the same variance, but different
means, µ0 under H0 and µ1 under H1 . The variance is, for both hypotheses, specified
as σ 2 . The case where the variance is not specified is treated in section 10.4.3. The
likelihood function under Hi for n observations is then
 
n
1X (xj − µi )2 
L(x|Hi ) = (2π)−n/2 exp −
2 j=1 σ2
 o
n n
= (2π)−n/2 exp − 2 s2 + (x̄ − µi )2

where x̄ and s2 are the sample mean and sample variance, respectively. Hence, our
test statistic is (equation 10.7)
 n o
L(x|H0 ) n
λ= = exp (x̄ − µ1 )2 − (x̄ − µ0 )2
L(x|H1 ) 2σ 2
 n o
n
= exp 2x̄(µ0 − µ1 ) + (µ21 − µ20 )
2σ 2
and the BCR is defined by λ ≤ cα or
1 σ2
x̄(µ0 − µ1 ) + (µ21 − µ20 ) ≤ ln cα
2 n
which becomes
1 σ 2 ln cα
x̄ ≥ (µ1 + µ0 ) − if µ1 > µ0 (10.8)
2 n µ1 − µ0
1 σ 2 ln cα
x̄ ≤ (µ1 + µ0 ) + if µ1 < µ0 (10.9)
2 n µ0 − µ1
Thus we see that the BCR is determined by the value of the sample mean. This
should not surprise us if we recall that x̄ was an efficient estimator of µ (section
8.2.7).
In applying the test, we reject H0 if µ1 > µ0 and x̄ is above a certain critical
value (equation 10.8), or if µ1 < µ0 and x̄ is below a certain critical value (equation
10.9).
To find this critical value, we recall that x̄ itself is a normally distributed r.v.
with mean µ and variance σ 2 /n. (This is the result of the central limit theorem,
but when the p.d.f. for x is normal, it is an exact result for all n.) We will treat the
case of µ1 > µ0 and leave the other case as an exercise for the reader.
For µ1 > µ0 , the right-hand side of equation 10.8 is just x̄α given by
r Z  
n ∞ n
exp − (x̄ − µ0 )2 dx̄ = α
2πσ 2 x̄α 2σ 2
10.4. PARAMETRIC TESTS 211

Transforming to a standard normal variable,


x̄ − µ0
z= √ (10.10)
σ/ n

we can rewrite this in terms of the standard normal integral, which is given by the
error function (section 3.7):
Z Z
1 ∞
−z 2 /2 1 −zα 2 /2
α=√ e dz = √ e−z dz = erf(−zα ) (10.11)
2π zα 2π −∞

For example, for α = 0.05 we find in a table that zα = 1.645. For µ0 = 2, σ = 1,


and n = 25, this value of zα inserted in equation 10.10 yields x̄α = 2.33. Then if
x̄ > 2.33, we reject H0 with a level of significance of 5%.
The power of the test can also be easily computed in this example. It is
r Z  
n ∞ n
exp − (x̄ − µ1 )2 dx̄ = 1 − β
2πσ 2 x̄α 2σ 2
which, in terms of the error function and the zα defined above, can be written
√ ! √ !
n n
1 − β = 1 − erf (µ0 − µ1 ) + zα = erf (µ1 − µ0 ) − zα (10.12)
σ σ

We see that the power increases monotonically with both n and µ1 − µ0 .

10.4.2 Simple H0 and composite H1


In the previous section we have seen how to construct the best test between two
simple hypotheses. Unfortunately, no such generally optimal method exists when
H0 and/or H1 is not simple.
Suppose that we want to test a simple H0 against a composite H1 . Let us first
treat an H1 which is just a collection of simple hypotheses, e.g., under H0 θ = θ0 ,
and under H1 θ = θ1 or θ2 or θ3 or . . . θn . We could imagine testing H0 against each
of these alternatives separately using a MP test as found in section 10.4.1. However,
this would lead in general to a different critical region in each case and most likely
to acceptance of H0 in some cases and rejection in others. We are therefore led to
inquire whether there exists one BCR for all the alternative values. A test using
such a BCR would be UMP.

UMP test for the exponential family


Unfortunately, an UMP test does not generally exist. One important case where
an UMP test does exist is when the p.d.f. of H0 and H1 is of the exponential
family (section 8.2.7), but then only for ‘one-sided’ tests.4, 5 We illustrate this for
the Gaussian p.d.f. from our results of section 10.4.1.
212 CHAPTER 10. HYPOTHESIS TESTING

In that example we saw that for µ1 > µ0 a BCR was given by x̄ ≥ bα and for
µ1 < µ0 by x̄ ≤ aα . Thus if H1 contains only values greater than, or only values
less than µ0 , we have a (one-sided) UMP test, but not if H1 allows values of µ
on both sides of µ0 . In such cases we would intuitively expect that a compromise
critical region defined by x̄ ≤ aα/2 or x̄ ≥ bα/2 would give a satisfactory ‘two-
sided’ test, and this is what is usually used. It is, of course, less powerful than
the one-sided tests in their regions of applicability as is illustrated in the figure.

1
p


0 µ0 µ
Critical region in both tails equally.
Critical region in lower tail.
Critical region in upper tail.

Maximizing local power


If no UMP test exists, it can be a good idea to look for a test which is most
powerful in the neighborhood of the null hypothesis. This is the place where a test
will usually be least powerful. Consider the two simple hypotheses, both specifying
the same p.d.f.,
H0 : θ = θ0 H1 : θ = θ1 = θ0 + ∆
where ∆ is small.
The log-likelihood can be expanded about θ0 ,

∂ ln L
ln L(x; θ1 ) = ln L(x; θ0 ) + ∆ + ...
∂θ θ=θ0

Since we are treating two simple tests, we can use the Neyman-Pearson test (equa-
tion 10.7) to reject H0 if the likelihood ratio is smaller than some critical value:
L(x|H0 )
λ= ≤ cα
L(x|H1 )
This is equivalent to
ln L(x; θ0 ) − ln L(x; θ1 ) ≤ ln cα
10.4. PARAMETRIC TESTS 213

or (assuming ∆ > 0)

∂ ln L ln cα
≥ kα , kα = −
∂θ θ=θ0 ∆

Now, if the observations are independent and identically distributed, we know from
section 8.2.5 that under H0 the expectation of L is a maximum and
" #
∂ ln L
E =0
∂θ θ=θ0
 !2 
∂ ln L
E  = nI
∂θ

for n independent observations, where I is the information on θ for 1 observation.


Under suitable conditions ∂ ∂θ
ln L
is approximately normally distributed with mean 0
and variance nI. The value of kα corresponding to a particular choice of size α can
then be found as in section 10.4.1 (equation 10.11):

α = erf(−zα ) , where zα = √
nI
In this way, a locally most powerful test is approximately given by rejecting H0 if

∂ ln L √
≥ zα nI , α = erf(−zα ) (10.13)
∂θ θ=θ0

10.4.3 Composite hypotheses—same parametric family


We now turn to the more general case where both H0 and H1 are composite hy-
potheses. We make a distinction between the case where the p.d.f.’s specified in
the hypotheses belong to one continuous family from the case where they belong to
distinct families. In the first case the only difference between the hypotheses is the
specification of the parameters, e.g.,

H0 : f (x; θ) , with θ < θ0


H1 : f (x; θ) , with θ > θ0

However, in the second case the p.d.f.’s are different and may even involve different
numbers of parameters. In this section we will treat the first case.

Likelihood ratio test


We have seen (section 8.4) that the maximum likelihood method gave estimators
which, under certain conditions, had desirable properties. A method of test con-
struction closely related to it is the likelihood ratio method proposed by Neyman
214 CHAPTER 10. HYPOTHESIS TESTING

and Pearson52 in 1928. It has played a similar role in the theory of tests to that
of the maximum likelihood method in the theory of estimation. As we have seen
(sect. 10.4.1), this led to a MP test for simple hypotheses.
Assume that the N observations, x, are independent and that both hypotheses
specify the p.d.f. f (x; θ). Then the likelihood function is
N
Y
L(x; θ) = f (xi ; θ)
i=1

We denote the total parameter space by Θ and a subspace of it by ν. Then the


hypotheses can be specified by
H0 : θ ∈ ν
H1 : θ ∈ Θ − ν
Examples, where for simplicity we assume that there are only two parameters θ =
(θ1 , θ2 ), are

Example 1 2 3
H0 θ1 = a and θ2 = b θ1 = c , θ2 unspecified θ1 + θ2 = d
H1 θ1 6= a 6 b θ1 =
and/or θ2 = 6 c , θ2 unspecified θ1 + θ2 6= d

In the first example H0 is in fact a simple hypothesis.


We use the term conditional maximum likelihood for the maximum of the like-
lihood function for θ in the region specified by H0 . Similarily, the unconditional
maximum likelihood is the maximum of the likelihood in the entire parameter space.
We define as the test statistic the maximum likelihood ratio, λ, as the ratio of
the conditional maximum likelihood to the unconditional maximum likelihood:
Lν max (x; θ)
λ= (10.14)
LΘ max (x; θ)
Clearly, 0 ≤ λ ≤ 1. Given what we know about the maximum likelihood method for
parameter estimation, it certainly seems reasonable that this statistic would provide
a reasonable test. In the limit of H0 and H1 both being simple, it is equivalent to the
Neyman-Pearson test (equation 10.7, section 10.4.1). The success of the maximum
likelihood ratio as a test statistic is due to the fact that it is always a function of
a sufficient statistic for the problem. Its main justification is its past success. It
has been found very frequently to result in a workable test with good properties, at
least for large sets of observations.
The hypotheses to be tested can usually be written in the form
H0 : θi = θi0 for i = 1, 2, . . . , r (denote this by θr = θr0 )
θj unspecified for j = 1, 2, . . . , s (denote this by θs )
H1 : θi 6= θi0 for i = 1, 2, . . . , r (denote this by θr 6= θr0 )
θj unspecified for j = 1, 2, . . . , s
10.4. PARAMETRIC TESTS 215

Hypotheses which do not specify exact values for parameters, but rather relation-
ships between parameters, e.g., θ1 = θ2 , can usually be reformulated in terms of
other parameters, e.g., θ1′ = θ1 − θ2 = 0 and θ2′ = θ1 + θ2 unspecified. We can
introduce the more compact notation of L(x; θr , θs ), i.e., we write two vectors of
parameters, first those which are specified under H0 and second those which are not.
The unspecified parameters θs are sometimes referred to as ‘nuisance’ parameters.
In this compact notation, the test statistic can be rewritten as
 
ˆ
L x; θr0 , θ̂s
λ=   (10.15)
L x; θ̂r , θ̂s

ˆ
where θ̂s is the value of θs at the maximum of L in the restricted region ν and θ̂r
and θ̂ s are the values of θr and θ s at the maximum of L in the full region Θ.
If H0 is true, we expect λ to be near to 1. The critical region will therefore be

λ ≤ cα (10.16)

where cα must be determined from the p.d.f. of λ, g(λ), under H0 . Thus, for a test
of size α, cα is found from Z cα
α= g(λ) dλ (10.17)
0

It is thus necessary to know how λ is distributed. Furthermore, to perform this


integration, g(λ) must not depend on any of the unspecified (nuisance) parameters.
Luckily, this is so for most statistical problems.

Example: As an example, let us again take a normal p.d.f. with H0 specifying


the mean as µ = µ0 and H1 specifying µ 6= µ0 . Both hypotheses leave σ unspecified;
thus σ is a nuisance parameter. Then
N
"  2 #
Y 1 xi − µ
2 −N/2
L(x; µ, σ) = (2πσ ) exp −
i=1 2 σ

We have seen (section 8.4.1) that the unconditional maximum likelihood estimators
are

µ̂ = x̄
N
1 X
σ̂ 2 = s2 = (xi − x̄)2
N i=1

Thus, the unconditional likelihood is


 
1
L(x; µ̂, σ̂) = (2πs2 )−N/2 exp − N
2
216 CHAPTER 10. HYPOTHESIS TESTING

Under H0 , the maximum likelihood estimator is


XN
ˆ2 = 1
σ̂ (xi − µ0 )2 = s2 + (x̄ − µ0 )2
N i=1
Therefore the conditional maximum likelihood is given by
n h io−N/2  
ˆ = 2π s2 + (x̄ − µ0 )2 1
L(x; µ0 , σ̂) exp − N
2
The likelihood ratio is then
( )1N
s2 2
λ= (10.18)
s2 + (x̄ − µ0 )2
Consequently,
1 N(x̄ − µ0 )2
λ2/N = t2
, t2 = 1 PN 2
(10.19)
1+ N −1 N −1 i=1 (xi − x̄)

This t is a Student’s t-statistic with N − 1 degrees of freedom (equation 3.40). We


see that λ is a monotonically decreasing function of t2 . Recall that the t-distribution
is symmetric about zero. The critical region, λ < λα , therefore corresponds to the
two regions t < t−α/2 and t > tα/2 . The values of t±α/2 corresponding to a particular
test size α can be found from the Student’s t-distribution, and from that value the
corresponding value of λα follows using the above equation. It can be shown that
this test is UMPU.11, 13

Asymptotic distribution of the likelihood ratio


In order to determine the critical region of the likelihood ratio, λ, it is necessary to
know how it is distributed under H0 . Sometimes we can find this distribution quite
easily, as in the example of the previous section. But often it is difficult, since the
distribution is unknown or since it is awkward to handle. One can sometimes use
Monte Carlo, but this is not always satisfactory. The usual procedure is to consider
the asymptotic distribution of the likelihood ratio, and use it as an approximation
to the true distribution.
We know that asymptotically the maximum likelihood estimator θ̂ attains the
minimum variance bound and that θ̂ becomes normally distributed according to the
likelihood function. Suppressing the normalization factor, the likelihood function is
of the form  
1
L(x; θ) = L(x; θr , θs ) ∝ exp − (θ̂ − θ)T I(θ̂ − θ) (10.20)
2
where I is the information matrix for θ,
 . 
I r .. I rs
 . 
I = 
 · · · .. · · · 
.
I T .. I
rs s
10.4. PARAMETRIC TESTS 217

Thus, equation 10.20 can be written


n h
L(x; θr , θs ) ∝ exp − 12 (θ̂r − θr )T I r (θ̂r − θr )
+ 2 (θ̂r − θr )T I rs (θ̂s − θ s ) io
(10.21)
+ (θ̂ s − θs )T I s (θ̂s − θ s )

At the maximum of L under H1 , θ̂r = θr and θ̂s = θs . Thus the exponent of


equation 10.21 is zero and equation 10.21 becomes L ∝ 1. Under H0 , we must
ˆ ˆ
replace θ̂s in equation 10.21 by θ̂s At the maximum of L, we have θ̂s = θs . Thus,
under H0 equation 10.21 becomes
 
1
L ∝ exp − (θ̂r − θ0r )T I r (θ̂r − θ0r )
2
Taking the ratio, we find
 
1
λ = exp − (θ̂r − θ0r )T I r (θ̂r − θ0r )
2
or
−2 ln λ = (θ̂r − θ0r )T I r (θ̂r − θ0r )
From the property that L is normally distributed, it follows that −2 ln λ is a dis-
tributed as χ2 with r degrees of freedom under H0 , where r is the number of
parameters specified under H0 . For a test of size α, we therefore reject H0 if
Z ∞
−2 ln λ > χ2α where χ2 (r) dχ2 = α
χ2α

Under H1 , it turns out that −2 ln λ is distributed as a non-central χ2 with r


degrees of freedom and non-centrality parameter

K = (θ̂r − θ0r )T I r (θ̂r − θ0r )

The non-central χ2 distribution, χ′2 (r, K), is the distribution of a sum of variables
distributed normally with a non-zero mean and unit variance. It can be used to
calculate the power of the test:4, 5, 11, 13
Z ∞
p=1−β = dF1
χ2α

where F1 is the c.d.f. of χ′2 .


The asymptotic properties of the likelihood ratio test which have been found in
this section depend on the asymptotic properties of the likelihood function, which
in turn rest on regularity assumptions about the likelihood function. In particular,
we have assumed that the range of the p.d.f. does not depend on the value of a
parameter. Nevertheless, it turns out that under certain conditions −2 ln λ is even
then distributed as χ2 , but with 2r instead of r degrees of freedom.11, 13
218 CHAPTER 10. HYPOTHESIS TESTING

Small sample behavior of the likelihood ratio


Although the asymptotic properties of the likelihood ratio for hypotheses having
p.d.f.’s of the same family are quite simple, the small sample behavior is not so
easy. The usual approach is to find a correction factor, f , such that −(2 ln λ)/f is
distributed as χ2 (r) even for small N.4, 5, 11, 13 Only the case of the linear model will
be treated here.

Linear model: A particular case is the linear model (section 8.5.2) in which the
N observations yi are assumed to be related to other observations xi , within random
errors ǫi , by a function linear in the k parameters θj ,
k
X
yi = y(xi ) + ǫi = θj hj (xi ) + ǫi
j=1

We assume that the ǫi are normally distributed with mean 0 and variance σ 2 . We
wish to test whether the θj have the specified values θ0j , or more generally, whether
they satisfy some set of r linear constraints,

Aθ = b (10.22)

where A and b are specified under H0 . Under H1 , the θ may take on any set of
values not satisfying the constraints of equation 10.22.
The likelihood for both H0 and H1 is given by
  2 
N k
 1 X X 
L(x; θ) = (2πσ 2 )−N/2 exp − 2 yi − θj hj (xi ) 
2σ i=1 j=1
 
2 −N/2 1
= (2πσ ) exp − Q2
2
We now distinguish two cases:

Variance known. We first treat the case of known variance σ 2 . The esti-
mates of the parameters are given by the least squares solutions (section 8.5), with
constraints for H0 yielding θ̂0j and without constraints for H1 yielding θ̂1j . The
likelihood ratio, λ, is then given by
 2  2
N
X k
X N
X k
X
1 1
−2 ln λ = yi − θ̂0j hj (xi ) −  yi − θ̂1j hj (xi ) = Q20 − Q21
σ2 i=1 j=1 σ2 i=1 j=1
(10.23)
It has been shown4, 5 that the second term can be expressed as the first term plus a
quadratic form in the ǫi , and hence that −2 ln λ is distributed as a χ2 of r degrees
of freedom. This result is true exactly for all N, not just asymptotically. It also
holds if the errors are not independent but have a known covariance matrix.
10.4. PARAMETRIC TESTS 219

The test thus consists of performing two least squares fits, one with and one
without the constraints of H0 . Each fit results in a value of Q2 , the difference of
R
which, Q20 −Q21 , is a χ2 (r). H0 is then rejected if Q20 −Q21 > χ2α where χ∞2α χ2 (r) dχ2 =
α.
We can qualitatively understand this result in the following way: Asymptoti-
cally, Q20 is a χ2 (N − k + r) and Q21 is a χ2 (N − k). From the reproductive property
of the χ2 distribution (section 3.12), the difference of these χ2 is also a χ2 with a
number of degrees of freedom equal to the difference of degrees of freedom of Q20
and Q21 , namely r. Thus the above result follows.

Variance unknown. If the variance σ 2 is unknown, it must be estimated from


the data. Under H0 the estimate of σ 2 is
 2
N k
1 X X
s20 = yi − θ̂0j h(xi )
N i=1 j=1

and the maximum likelihood becomes


 
1 N
L(x; H0 ) = 2
exp −
(2π)N/2 (s0 )N/2 2

The expressions for H1 are similar. The likelihood ratio is then


!N/2
s21
λ= (10.24)
s20
s20 − s21
or λ−2/N = 1 + (10.25)
s21

It can be shown that (s20 − s21 )/σ 2 and s21 /σ 2 are independently distributed as
χ2 with r and N − k degrees of freedom, respectively. The ratio,

N − k s20 − s21
F = (10.26)
r s21

is therefore distributed
R∞
as the F -distribution (section 3.14). H0 is then rejected if
F > Fα , where Fα F (r, N − k) dF = α.
However, under H1 , (s20 − s21 )/σ 2 is distributed as a non-central χ2 . This leads to
a non-central F -distribution from which the power of the test can be calculated.11, 13

10.4.4 Composite hypotheses


—different parametric families
When the p.d.f. specified by H1 can not be attained by varying the parameters
of the p.d.f. of H0 , we speak of different parametric families of functions. The
220 CHAPTER 10. HYPOTHESIS TESTING

distribution of the likelihood ratio then usually turns out to depend on N as well
as on which hypothesis is true. The likelihood ratio can still be used as a test, but
these dependences must be properly taken into account.4, 5 The tests are therefore
more complicated.
The easiest method to treat this situation is to construct a comprehensive family
of functions
h(x; θ, φ, ψ) = (1 − θ)f (x; φ) + θg(x; ψ)
by introducing an additional parameter θ.
What we really want to test is H0 against H1 ,
H0 : f (x; φ) , φ unspecified
H1 : g(x; ψ) , ψ unspecified
Instead, we can use the composite function to test H0 against H1′ :
H0 : h(x; θ, φ, ψ) , θ = 0, φ, ψ unspecified
H1′ : h(x; θ, φ, ψ) , θ 6= 0, φ, ψ unspecified
using the maximum likelihood ratio as in the previous section:
 N
ˆ ˆ
L(x; θ = 0, φ̂, ψ)  f (x; φ̂) 
λ= =  (10.27)
L(x; θ̂, φ̂, ψ̂) (1 − θ̂)f (x; φ̂) + θ̂g(x; ψ̂)

Then under H0 , −2 ln λ is distributed asymptotically as χ2 (1) since one constraint


(θ = 0) has been imposed on the parameter space.
The power of the test can be found using the fact that, under H1 , −2 ln λ
is distributed as a non-central χ2 , χ′2 (1, K) with 1 degree of freedom and non-
centrality parameter K = θ2 /S where
 h i2 
 f (x; φ) − g(x; ψ) 
S = E h i2  (10.28)
(1 − θ)f (x; φ) + θg(x; ψ)
Since this test compares f (x; φ) with a mixture of f and g, it is not expected to be
very powerful.
In practice, one would also make a test of H1 against the mixture, i.e., define a
new H0′ corresponding to θ = 1, and test this against the mixture H1′ in the same
manner as above, hoping that H0 or H0′ , but not both, would be rejected.

10.5 And if we are Bayesian?


If we are Bayesian, our belief in (the probability of) H0 or H1 is simply given by
Bayes’ theorem. After an experiment giving result x, the probability of Hi (i = 0, 1)
is
P (x|Hi )
P (Hi |x) = Pp (Hi ) (10.29)
P (x|H0 ) + P (x|H1 )
10.5. AND IF WE ARE BAYESIAN? 221

where Pp (Hi ) is the probability of Hi before (prior to) doing the experiment and
P (x|Hi ) is the probability of obtaining the result x if Hi is true, which is identical
to L(x|Hi ). We can compare P (H0 |x) and P (H1|x), e.g., by their ratio. If both H0
and H1 are simple hypotheses,

P (H0 |x) P (x|H0 ) Pp (H0 )


= (10.30)
P (H1 |x) P (x|H1 ) Pp (H1 )
Pp (H0 )
=λ (10.31)
Pp (H1 )

where λ is just the likelihood ratio (eq. 10.7). This leads to statements such as
“the probability of H0 is, e.g., 20 times that of H1 ”. Note, however, that here, as
always with Bayesian statistics, it is necessary to assign prior probabilities. In the
absence of any prior knowledge, Pp (H0 ) = Pp (H1 ). The test statistic is then λ, just
as in the Neyman-Pearson test (section 10.4.1). However now the interpretation is
a probability rather than a level of significance.
Suppose that H1 is a composite hypothesis where a parameter θ is unspecified.
Equation 10.30 remains valid, but with
Z
P (x|H1 ) = f (x, θ|H1) dθ (10.32)
Z
= P (x|θ, H1 ) f (θ|H1) dθ (10.33)

Now, P (x|θ, H1 ) is identical to L(x; θ) under H1 and f (θ|H1 ) is just the prior p.d.f.
of θ under H1 . In practice, this may not be so easy to evaluate. Let us therefore
make some simplifying assumptions for the purpose of illustration. We know that
asymptotically L(x; θ) is proportional to a Gaussian function of θ (eq. 8.72). Let us
take a prior probability uniform between θmin and θmax and zero otherwise. Then,
with σθ̂2 the variance of the estimate, θ̂, of θ, equation 10.33 becomes

Z !
(θ − θ̂)2 1
P (x|H1 ) = Lmax (x; θ) exp − dθ (10.34)
2σθ̂2 θmax − θmin
Z !
Lmax (x; θ) θmax (θ − θ̂)2
= exp − dθ (10.35)
θmax − θmin θmin 2σθ̂2

σθ̂ 2π
= Lmax (x; θ) (10.36)
θmax − θmin

where we have assumed that the tails of the Gaussian cut off by the integration
limits θmin , θmax are negligible. Thus equation 10.30 becomes

P (H0|x) Pp (H0 ) θmax − θmin


=λ √ (10.37)
P (H1|x) Pp (H1 ) σθ̂ 2π
222 CHAPTER 10. HYPOTHESIS TESTING

where λ is now the maximum likelihood ratio λ = L(x|H0 )/Lmax (x|H1 ). Note that
there is a dependence not only on the prior probabilities of H0 and H1 , but also on
the prior probability of the parameter θ.

Someone remarked to me once:


“Physicians shouldn’t say, ‘I have cured this man’,
but, ‘this man didn’t die under my care’.”
In physics too, instead of saying,
“I have explained such and such phenomenon”,
one might say, “I have determined causes for it
the absurdity of which cannot be conclusively proved.”
—Georg Christoph Lichtenberg

10.6 Goodness-of-fit tests


10.6.1 Confidence level or p-value
As in the previous section, we are concerned with testing an hypothesis H0 at
some significance level α. Again, H0 will be rejected if a test statistic has a value
which lies in the critical region ω. The difference with the previous section lies
in the alternative hypothesis H1 . Now H1 is simply not H0 , i.e., H1 is the set of
all possible alternatives to H0 . Thus H1 can not be formulated and consequently,
the chance of an error of the second kind can not be known. Nor can most of the
tests of the previous section (including the use of Bayesian probability) be applied,
involving as they do the likelihood ratio, for if we do not specify H1 , we can not
calculate the likelihood under H1 .
Goodness-of-fit tests compare the experimental data with the p.d.f. specified
under H0 and lead to the statement that the data are consistent or inconsistent
with H0 . Usually one states a confidence level, e.g., “The data are consistent with
H0 at a confidence level of 80%.” The confidence level∗ (cl), also known as
p-value,† is the size that the test would have if the critical region were such that
the test statistic were at the boundary between rejection and acceptance of H0 .
In other words, it is the probability, assuming H0 is true, of obtaining a value of
the test statistic as “bad” as or “worse” than that actually obtained. Thus, a high
confidence level means that if H0 is true there is a large chance of obtaining data
‘similar’ to ours. On the contrary, if cl is small there is a small chance, and H0


Many authors use 1 − cl where we use cl.

The preferable term is p-value, since it eliminates confusion with the confidence level of
confidence intervals (chapter 9), which, although related, is different. Nevertheless, the term
confidence level is widely used, especially by physicists.
10.6. GOODNESS-OF-FIT TESTS 223

can be rejected. Despite the suggestive “p”, the p-value is not a probability; it is a
random variable.
We shall only consider distribution-free tests, for the practical reason that they
are widely applicable. To apply a test, one needs to know the p.d.f. of the test
statistic in order to calculate the confidence level. For the well-known tests tables
and/or computer routines are widely available. For a specific problem it may be
possible to construct a better test, but it may not be so much better that it is worth
the effort.

10.6.2 Relation between Confidence level and Confidence


Intervals
The same integrals are involved in confidence intervals and goodness-of-fit tests. To
illustrate this, consider a r.v., x, which is distributed normally, f (x) = N (x; µ, σ 2 ).
For n points, assuming σ 2 known, the estimator of the mean, t = x̄, is also nor-
mally distributed:
f (t) = N (t; µ, σ 2 /n)
The coverage probability (or confidence coefficient or confidence level) of the confi-
dence interval [µ− , µ+ ], e.g., for a central confidence interval from equation 9.12,
is given by equation 9.10,
Z t+ (µ)
β= N (t; µ, σ 2 /n) dt (10.38)
t− (µ)

which holds for any value of µ.


If H0 states that x is distributed normally with mean µ = 0,

H0 : f (x) = N (x; 0, σ 2 ) or f (t) = N (t; 0, σ 2 /n)

and if the data give t = x̄, the confidence level or p-value (for a symmetric two-sided
test) is
Z −|x̄| Z +∞
2
cl = N (t; 0, σ /n) dt + N (t; 0, σ 2 /n) dt
−∞ +|x̄|
Z +|x̄|
=1− N (t; 0, σ 2 /n) dt (10.39)
−|x̄|

Note the similarity of the integrals in equations 10.38 and 10.39. We see that
the coverage probability of the interval [−|x̄|, +|x̄|], β, is related to the p-value
by cl = 1 − β. However, for the confidence interval, the coverage probability
is specified first and the interval, [µ− , µ+ ], is the random variable, while for the
goodness-of-fit test the hypothesis is specified (µ = µ0 ) and the p-value is the r.v.
Referring to the confidence belt figure of section 9.2, and supposing that θt is
the hypothesized value of the parameter µ0 , t− (µ0 ) and t+ (µ0 ) are the values of
224 CHAPTER 10. HYPOTHESIS TESTING

t̂ which would give cl = 1 − β. Put another way, if we decide to reject H0 if


cl < α, then the regions outside the confidence belt for β = 1 − α is the rejection
region. Thus the confidence belt defines the acceptance region of the corresponding
goodness-of-fit test.

10.6.3 The χ2 test


Probably the best known and most used goodness-of-fit test is the χ2 test. We
have already frequently alluded to it. We know (section 3.12) that the sum of N
normally distributed r.v.’s is itself a r.v. which is distributed as χ2 (N ). Hence,
assuming that our measurements, yi , have a normally distributed error, σi , the
sum
XN (y − f )2
2 i i
X = 2
(10.40)
i=1 σi
where fi is the value that yi is predicted to have under H0 , will be distributed as
χ2 (N ). The cl is easily calculable from the χ2 distribution:
Z ∞
cl = χ2 (z; N ) dz (10.41)
X2

This X 2 is just the quantity that is minimized in a least squares fit (where we
denoted it by Q2 ). In the linear model, assuming Gaussian errors, X 2 = Q2min is
still distributed as χ2 even though parameters have been estimated by the method.
However the number of degrees of freedom is reduced to N − k, where k is the
number of parameters estimated by the fit. If constraints have been used in the fit
(cf. section 8.5.6), the number of degrees of freedom is increased by the number of
constraints, since each constraint among the parameters reduces by one the number
of free parameters estimated by the fit. If the model is non-linear, X 2 = Q2min is
only asymptotically distributed as χ2 (N − k).
It is sometimes argued that the χ2 test should be two-tailed rather than one-
tailed, i.e., that H0 should be rejected for unlikely small values of X 2 as well as
for unlikely large values. Arguments given for this practice are that such small
values are likely to have resulted from computational errors, overestimation of the
measurement errors σi , or biases (unintentional or not) in the data which have not
been accounted for in making the prediction. However, while an improbably small
value of X 2 might well make one suspicious that one or more of these considerations
had occurred (and indeed several instances of scientific fraud have been discovered
this way), such a low X 2 can not be regarded as a reason for rejecting H0 .

10.6.4 Use of the likelihood function


It is often felt that since the likelihood function is so useful in parameter estimation
and in the formulation of tests of hypotheses, it should also be useful as a goodness-
of-fit test. Frequently the statement is made that it can not be used for this purpose.
In fact, it can be used, but it is usually difficult to do so.
10.6. GOODNESS-OF-FIT TESTS 225

The problem is that in order to use the value of L as a test, we must know
how L is distributed in order to be able to calculate the confidence level. Suppose
that we have N independent observations, xi , each distributed as f (x). The log
likelihood is then just
N
X
ℓ= ln f (xi)
i=1

If no parameter is estimated from the data, the mean of ℓ is just


Z X
N Z
E [ℓ] = ln f (xi ) L dx1 dx2 ... dxN = N ln f (x) f (x) dx
i=1

Similarly higher moments could be calculated, and from these moments (just the
first two if N is large and the central limit theorem is applicable) the distribution
of ℓ, g(ℓ), could be reconstructed. The confidence level would then be given by
Z ℓ
cl = g(ℓ) dℓ (10.42)
−∞

If parameters are estimated by maximum likelihood, the calculations become


much more complicated. A simple, but expensive, solution is to generate Monte
Carlo experiments. From each Monte Carlo experiment one calculates ℓ and thus
obtains an approximate distribution for ℓ from which the cl can be determined.

10.6.5 Binned data


We now consider tests of binned data.∗ Since binning data loses information, we
should expect such tests to be inferior to tests on individual data. Further, we must
be sure to have a sufficient number of events in each bin, since most of the desirable
properties of the tests are only true asymptotically.
However, binning the data removes the difficulty that H1 is completely unspec-
ified, since the number of events in a bin must be distributed multinomially. Thus
both H0 and H1 specify the multinomial p.d.f. Some or all of the parameters are
specified under H0 ; none of them are specified under H1 further than that they
are different from those specified under H0 .

Likelihood ratio test


P
Suppose that we have k bins with ni events in bin i and ki=1 ni = N . Let H0
be a simple hypothesis, i.e., all parameters are specified. Let pi be the probability
content of bin i under H0 and qi the probability content under the true p.d.f.,


Although we use the term ‘binned’, which suggests a histogram, any classification of the
observations may be used. See also section 8.6.1.
226 CHAPTER 10. HYPOTHESIS TESTING

which we of course do not know. The likelihood under H0 and under the true p.d.f.
are then, from the multinomial p.d.f., given by

pn
i
k
Y
i

L0 (n|p) = N !
i=1 ni !
k
Y qini
L(n|q) = N !
i=1 ni !

An estimate q̂i of the true probability content can be found by maximizing L(n|q)
P
subject to the constraint ki=1 qi = 1. The result∗ is
ni
q̂i =
N
The test statistic is then the likelihood ratio (cf. section 10.4.3)
k
! ni
L0 (n|p) N
Y pi
λ= =N (10.43)
L(n|q̂) i=1 ni

The exact distribution of λ is not known. However, we have seen in section 10.4.3
that −2 ln λ is asymptotically distributed as χ2 (k − 1) under H0 , where the num-
ber of degrees of freedom, k − 1, is the number of parameters specified. The multi-
P
nomial p.d.f. has only k −1 parameters (pi ) because of the restriction ki=1 pi = 1.
If H0 is not simple, i.e., not all pi are specified, the test can still be used but the
number of degrees of freedom must be decreased accordingly.

Pearson’s χ2 test
The classic test for binned data is the χ2 test proposed by Karl Pearson53 in 1900.
It makes use of the asymptotic normality of a multinomial p.d.f. to find that under
H0 the statistic
Xk (n − N π )2
i i
X2 = (10.44)
i=1 N πi
is distributed asymptotically as χ2 (k − 1).
If H0 is not simple, its free parameters can be estimated, (section 8.6.1) by
the minimum chi-square method. In that method, the quantity which is minimized
with respect to the parameters (equation 8.152) is just Pearson’s X 2 . The mini-
mum value thus found therefore serves to test the hypothesis. It can be shown that
in this case X 2 is asymptotically distributed as χ2 (k − s − 1) where s is the num-
ber of parameters which are estimated. This is also true if the binned maximum

This was derived for the binomial p.d.f. in section 8.4.7. It may be trivially extended to the
multinomial case by treating each bin separately as binomially distributed between that bin and
all the rest.
10.6. GOODNESS-OF-FIT TESTS 227

likelihood method (section 8.6.2) is used to estimate the parameters.11, 13 Simi-


larly, the quantity which is minimized in the modified minimum chi-square method
(equation 8.154) is also asymptotically distributed as χ2 (k − s − 1).
But what if we estimate the parameters by a different method? In particular, as
is frequently the case, what if we estimate the parameters by maximum likelihood
using the individual data rather than the binned data? It then turns out11, 13 that
X 2 is still distributed as χ2 , but with a number of degrees of freedom, d, between
that of the binned fit and the fully specified case, i.e., k − s − 1 ≤ d ≤ k − 1.
The exact number of degrees of freedom depends on the p.d.f. The test is then no
longer distribution free, although for large k and small s it is nearly so.
Equation 10.44 assumes that H0 only predicts the shape of the distribution, i.e.,
P
the probability, πi , that an event will be in bin i, with πi = 1. If also the total
number of events is predicted by H0 , the distribution is no longer multinomial,
but rather a multinomial times a Poisson or, equivalently, the product of k Poisson
distributions. The test statistic is then
k
X (ni − νi )2
X2 = (10.45)
i=1 νi

where, under H0 , νi is the mean (and variance) of the Poisson distribution for
bin i. Since each bin is independent, there are now k degrees of freedom, and X 2
is distributed asymptotically as χ2 (k − s).
Pearson’s χ2 test makes use of the squares of the deviations of the data from
that expected under H0 . Tests can be devised which use some other measure of
deviation, replacing the square of the absolute value of the deviation by some other
power and scaling the deviation or not by the expected variance. Such tests are,
however, beyond the scope of this course.

Choosing optimal bin size


If one is going to bin his data, he must define the bins. If the number of bins is
small, too much information may be lost. But a large number of bins may mean that
there are too few events per bin. Most of the results for binned data are only true
asymptotically, e.g., the normal limit of the multinomial p.d.f. or the distribution
of −2 ln λ or X 2 as χ2 .
There are, in fact, two questions which play a role here. The first is whether the
binning may be decided on the basis of the data; the second concerns the minimum
number of events per bin. At first glance it would seem that the bin boundaries
should not depend on the observations themselves, i.e., that we should decide on
the binning before looking at the data. If the bin boundaries depend on the data,
then the bin boundaries are random variables, and no provision has been made in
our formalism for fluctuations in the position of these boundaries. On the other
hand, the asymptotic formalism holds for any set of fixed bins, and so we might
228 CHAPTER 10. HYPOTHESIS TESTING

expect that it does not matter which of these sets we happen to choose, and this
has indeed been shown to be so.11, 13
Intuitively, we could expect that we should choose bins which are equiprobable
under H0 . Pearson’s χ2 test is consistent (asymptotically unbiased) whatever the
binning, but for finite N it is not, in general, unbiased. It can be shown4, 5, 11, 13
that for equiprobable bins it is locally unbiased, i.e., unbiased against alternatives
which are very close to H0 , which is certainly a desirable property.
Having decided on equiprobable bins, the next question is how many bins.
Clearly, we must not make the number of bins k too large, since the multinor-
mal approximation to the multinomial p.d.f. will no longer be valid. A rough rule
which is commonly used is that no expected frequency, N pi , should be smaller
than ∼ 5. However, according to Kendall and Stuart,11 there seems to be no gen-
eral theoretical basis for this rule. Cochran goes even further and claims4 that the
asymptotic approximation remains good so long as not more than 20% of the bins
have an expected number of events between 1 and ∼ 5.
This does not necessarily mean that it is best to take k = N/5 bins. By
maximizing local power, one can try to arrive at an optimal number of bins. The
result4, 5 is √ 2/5
2(N − 1)
k = b  (10.46)
λα + λ1−p0
R λα
where α = 1 − −λ α
N (x; 0, 1) dx is the
size of the test for a standard normal distri-
bution and p0 is the local power. In general, p0
for a simple hypothesis a value for b between N α 0.5 0.8
2 and 4 is good, the best value depending
200 0.01 27 (7.4) 24 (8.3)
on the p.d.f. under H0 . Typical values for k
0.05 31 (6.5) 27 (7.4)
(N/k) using b = 2 are given in the following
table. We see from the table that there is only 500 0.01 39 (13) 35 (14)
a mild sensitivity of the number of bins to α 0.05 45 (11) 39 (13)
and p0 . For N = 200, 25–30 bins would be
reasonable.
Thus we are led to the following recommendations for binning:

1. Determine the number of bins, k, using equation 10.46 with b ∼ 2 to 4.

2. If N/k turns out to be too small, decrease k to make N/k ≥ 5.

3. Define the bins to have equal probability content, either from the p.d.f. spec-
ified by H0 or from the data.

4. If parameters have to be estimated (H0 does not specify all parameters), use
maximum likelihood on the individual observations, but remember that the
test statistic is then only approximately distribution-free.
10.6. GOODNESS-OF-FIT TESTS 229

Note, however, that, regardless of the above prescription, if the p.d.f. under H0
does not include resolution effects, one should not choose bins much smaller than
the resolution.
Even with the above prescription, the specification of the bins is still not unique.
The usual method in one dimension would be to define a bin as an interval in the
variable, bini = (xi , xi + δi ). However, there is nothing in the above prescription
to forbid defining a single bin as consisting of more than one (nonadjacent) interval.
This might even be desirable from the point of view H0 . For example, H0 might
specify a p.d.f. that is symmetric about 0, and we might only be interested in testing
this hypothesis against alternatives which are also symmetric about 0. Then it
would be appropriate to define bins as intervals in |x| rather than in x.
In more than one dimension the situation is more ambiguous. For example,
to construct equiprobable bins in two dimensions, the easiest way is to first find
equiprobable bins in x and then for each bin in x to find equiprobable bins in
y. This is easily generalized to more dimensions. However, one could equally well
choose first to construct bins and y and then in x, which in general would yield
different bins. One could also choose different numbers of bins in x than in y. The
choice depends on the individual situation. One should prefer smaller bins in the
variable for which H0 is most sensitive.
There is, obviously, one taboo: You must not try several different choices of
binning and choose the one which gives the best (or worst) confidence level.

10.6.6 Run test


χ2 tests make use of the squares of the deviations of the data from that expected
under H0 . Thus they only use the size of the deviations and ignore their signs.
However, the signs of the deviations are also important, and systematic deviations
of the same sign indicate that the hypothesis is unlikely, as is illustrated in the figure.

A test which uses only the sign of the devi-


ations is the run test. A run is defined as a
set of adjacent points all having the same sign
of deviation. The data and curve in the figure
have deviations AAABBBBBBAAA, where
A represents a positive and B a negative devi-
ation. There are thus three runs, which seems
rather small. We would expect the chance of an
A to equal that of a B and to show no correlation
between points if the hypothesis were true. This
implies that we should expect runs to be short;
a long run of 6 points as in the figure should be
unlikely. In fact, this expectation is strictly true only if H0 is a simple hypothesis.
To be more quantitative, let kA be the number of positive deviations and kB
230 CHAPTER 10. HYPOTHESIS TESTING

the number of negative deviations. Let k = kA + kB . Given kA and kB , we can


calculate the probability that there will be r runs. If either kA or kB is zero, there
is, necessarily only one run, and P (r = 1) = 1.
Given kA and kB , the number of different ways to arrange them is
!
k k!
=
kA kA !kB !

Suppose that there are r runs. First, suppose that r is even and that the sequence
begins with an A. Then there are kA A-points and r/2 − 1 divisions between
them. For the example of the figure this is AAA|AAA. With kA A’s there are
kA − 1 places to put the first dividing line, since it can not go at the ends. Then
there are kA − 2 places to put the second dividing line, since
 it can not go at the
kA −1
ends or next to the first dividing line. In total there are r/2−1 ways to arrange
the dividing lines among the A’s. There is a similar factor for arrangement of the
B’s and a factor 2 because we assumed we started with an A and it could just have
well been a B. Thus the probability of r runs, for r even, is
  
kA −1 kB −1
2 r/2−1 r/2−1
P (r) =  
k
(10.47)
kA

Similarly, one finds for r odd


     
kA −1 kB −1 kA −1 kB −1
(r−3)/2 (r−1)/2
+ (r−1)/2 (r−3)/2
P (r) = 
k
 (10.48)
kA

From these it can be shown that the expectation and variance of r are

2kA kB
E [r] = 1 + (10.49)
k
2kA kB (2kA kB − k)
V [r] = (10.50)
k2 (k − 1)

The critical region of the test is defined as improbably low values of r, r < rα .
For kA and kB greater than about 10 or 15, one can use the Gaussian approx-
imation for r. For smaller numbers one can compute the probabilities directly
using equations 10.47 and 10.48. In our example, kA = kB = 6. From equa-
tions 10.49 and 10.50 we expect r = 7 with variance 2.73, or σ = 1.65. We
observe 3 runs, which differs from the expected number by 4/1.65 = 2.4 standard
deviations. Using the Gaussian approximation, this corresponds to a (one-tailed)
confidence level of 0.8%. Exact calculation using equations 10.47 and 10.48 yields
P (1) + P (2) + P (3) = 1.5%. Whereas the χ2 is acceptable (χ2 = 12 for 12
points), the run test suggests that the curve does not fit the data.
10.6. GOODNESS-OF-FIT TESTS 231

The run test is much less powerful than a χ2 test, using as it does much less
information. But the two tests are completely independent and hence they can
be combined. An hypothesis may have an acceptable χ2 , but still be wrong and
rejectable by the run test. Unfortunately, the run test is applicable only when H0
is simple. If parameters have been estimated from the data, the distribution of the
number of runs is not known and the test can not be applied.

10.6.7 Tests free of binning


Since binning loses information, we should expect tests which do not require binning
to be in principle better than tests which do.
The successful bin-free tests are based on the c.d.f., F (x), under H0 and consist
of in some way comparing this c.d.f. with the data. To do so involves the concept
of order statistics, which are just the observations, xi , ordered in some way, i.e.,
renumbered as x(j) . In one dimension this is trivial. For n observations, the order
statistics obey
x(1) ≤ x(2) ≤ . . . ≤ x(n)

In more than one dimension it is rather arbitrary, implying as it were a reduction of


the number of dimensions to one. Even in one dimension the ordering is not free of
ambiguity since we could equally well have ordered in descending order. We could
also make a change of variable which changes the order of the data.
We define the sample c.d.f. for n observations as

0
, x < x(1)

r
Sn (x) = , x(r) ≤ x < x(r+1)
n (10.51)

1 , x
(n) ≤ x

which is simply the fraction of the observations not exceeding x. Clearly, under
H0 , Sn (x) → F (x) as n → ∞. The tests consist of comparing Sn (x) with
F (x). We shall discuss two such tests, the Smirnov-Cramér-von Mises test and the
Kolmogorov test. Unfortunately, both are only applicable to simple hypotheses,
since the distribution of the test statistic is not distribution-free when parameters
have been estimated from the data.

Smirnov-Cramér-von Mises test

As a measure of the difference between Sn (x) and F (x) this test uses the statistic
Z 1
W2 = [Sn (x) − F (x)]2 ψ(x) dF
0
Z +∞
= [Sn (x) − F (x)]2 ψ(x)f (x) dx
−∞
232 CHAPTER 10. HYPOTHESIS TESTING

with ψ(x) = 1. We see that W 2 is the expectation of [Sn (x) − F (x)]2 under
H0 . Inserting Sn (equation 10.51) and performing the integral results in
n
" #2
2
1 X 2i − 1
nW = + F (x(i)) − (10.52)
12n i=1 2n

The asymptotic distribution of nW 2 has been Test size Rejection region


found, and from it critical regions have been α nW 2 >
computed. Those corresponding to frequently
0.10 0.347
used test sizes are given in the following table.
0.05 0.461
The asymptotic distribution is reached remark-
0.01 0.743
ably rapidly. To the accuracy of this table,∗ the
0.001 1.168
asymptotic limit is reached4, 5, 11 for n ≥ 3.

Kolmogorov test
This test also compares Sn and F (x), but only uses the maximum difference: The
Kolmogorov (or Smirnov, or Kolmogorov-Smirnov) test statistic is the maximum
deviation of the observed distribution Sn (x) from the c.d.f. F (x) under H0 :
Dn = max {|Sn (x) − F (x)|} for all x (10.53)

Test size Rejection region



The asymptotic distribution of Dn α nDn >
yields the critical regions shown in the
table. This approximation is consid- 0.01 1.63
ered satisfactory for more than about 0.05 1.36
80 observations.4, 5, 11 Computer rou- 0.10 1.22
tines also exist.† 0.20 1.07

Alternatively, one can take the maximum positive deviation,


Dn+ = max {+ [Sn (x) − F (x)]} for all x (10.54)
It can be shown that 4n(Dn+ )2 is distributed asymptotically as a χ2 of 2 degrees
of freedom. The same holds for Dn− ,
Dn− = max {− [Sn (x) − F (x)]} for all x (10.55)
Or, as proposed by Kuiper,56 one can use
V = Dn+ + Dn− (10.56)

A more complete and more accurate table is given by Anderson and Darling,54 who also
−1
consider the test statistic with ψ(x) = {F (x) [1 − F (x)]} .
† 55
See, e.g., Numerical Recipes.
10.6. GOODNESS-OF-FIT TESTS 233

Asymptotic critical regions of V can be calculated.55, 57

The sensitivity of the Kolmogorov test to deviations from the c.d.f. is not in-
dependent of x. It is more sensitive around the median value and less sensitive
in the tails. This occurs because the difference |Sn (x) − F (x)| does not, under
H0 have a probability distribution that is independent of x. Rather, its variance
is proportional to F (x) [1 − F (x)], which is largest at F = 0.5. Consequently,
the significance of a large deviation in a tail is underweighted in the test. The Kol-
mogorov test therefore turns out to be more sensitive to departures of the data from
the median of H0 than to departures from the width. Various modifications of the
Kolmogorov test statistic have been proposed54, 58, 59 to ameliorate this problem.

Although the distribution of the test statistic, Dn , is generally unknown if pa-


rameters have been estimated from the data, there are cases where the distribution
has been calculated, e.g., when H0 specifies an exponential distribution whose mean
is estimated from the data.60 It also may be possible to determine the distribution
of the test statistic yourself, e.g., using Monte Carlo techniques.

10.6.8 But use your eyes!

A few words of caution are appropriate at this point. As illustrated by the figure
at the start of the section on the run test (section 10.6.6), one test may give an
acceptable value while another does not. Indeed, it is in the nature of statistics
that this must sometimes occur.

Also, a fit may be quite good over part of the range of the variable and quite bad
over another part. The resulting test value will be some sort of average goodness,
which can still have an acceptable value. And so: Do not rely blindly on a test. Use
your eyes. Make a plot and examine it.

There are several useful plots you can make. One is, as was done to illustrate
the run test, simply a plot of the data with the fit distribution superimposed. Of
course, the error bars should be indicated. It is then readily apparent if the fit
is bad only in some particular region, and frequently you get an idea of how to
improve the hypothesis. This is illustrated in the figure where the fit (dashed line)
in (a) is perfect, while in (b) higher order terms are clearly needed and in (c) either
higher orders or a discontinuity are required.
234 CHAPTER 10. HYPOTHESIS TESTING

y✻

(a) (b) (c) ✲


x
Since it is easier to see departures from a horizontal straight line, you could
instead plot the residuals, yi − f (xi ), or even better, the residuals divided by their
error, (yi − f (xi ))/δ, where δ can be either the error on the data, or the expected
error from a fit.
It may happen that there is only one or just a few data points which account for
almost all the deviation from the fit. These are known as outliers. One is tempted
to throw such points away on the assumption that they are due to some catastrophic
error in the data taking, e.g., writing down 92 instead of 29. However, one must be
careful. Statistics can not really help here. You have to decide on the basis of what
you know about your apparatus. Automatic outlier rejection should be avoided. It
is said∗ that the discovery of the hole in the ozone layer above the south pole was
delayed several years because computer programs automatically rejected the data
which indicated its presence.

It’s not right to pick only what you like,


but to take all of the evidence.
—Richard P. Feynman

I don’t see the logic of rejecting data


just because they seem incredible.
—Sir Fred Hoyle


Cited by Barlow1 from New Scientist, 31 March 1988.
10.7. NON-PARAMETRIC TESTS 235

10.7 Non-parametric tests


The main classes of non-parametric problems which can be solved by distribution-
free methods are

1. The two-sample problem. We wish to test whether two (or more generally k)
samples are distributed according to the same p.d.f.

2. Randomness. A series of n observations of a single variable is ordered in some


way, e.g., in the time at which the observation was made. We wish to test
that all of the observations are distributed according to the same p.d.f., i.e.,
that there has been no change in the p.d.f. as a function of, e.g., time.

3. Independence of variables. We wish to test that a bivariate (or multivariate)


distribution factorizes into two independent marginal distributions, i.e., that
the variables are independent (cf. section 2.2.4).

These are all hypothesis-testing problems, which are similar to the goodness-of-
fit problem in that the alternative hypothesis is simply not H0 .
The first two of the above problems are really equivalent to the third, even
though the first two involve observations of just one quantity. For problem 1, we
(1) (2)
can combine the two samples xi and xi into one sample by defining a second
variable yi = 1 or 2 depending on whether xi is from the first or the second sample.
Independence of x and y is then equivalent to independence of the two samples. For
problem 2, suppose that the xi of problem 3 are just the observations of problem 2
and that the yi are the order of the observations. Then independence of xi and yi
is equivalent to no order dependence of the observations of problem 1. Let us begin
then with problem 3.

10.7.1 Tests of independence


We have a sample of observations consisting of pairs of real numbers, (x, y) dis-
tributed according to some p.d.f., f (x, y), with marginal p.d.f.’s, g(x) and h(y).
We wish to test
H0 : f (x, y) = g(x) h(y)

Sample correlation coefficient


An obvious test statistic is the sample correlation coefficient (cf. equation 2.27).
1 Pn
n i=1 xi yi − x̄ȳ xy − x̄ȳ
r= = (10.57)
sx sy sx sy

where x̄ and ȳ are the sample means and sx and sy are the sample variances of x
and y, respectively, and xy is the sample mean of the product xy. Under H0 , x
236 CHAPTER 10. HYPOTHESIS TESTING

and y are independent, which leads to the following expectations:


hX i X X
E x i yi = E [xi yi ] = E [xi ] E [yi ] = nE [x] E [y]

Since E [x̄ȳ] = E [x] E [y], it follows that

E [r] = 0

Higher moments of r can also be easily calculated. It turns out that the variance
1
is V [r] = n−1 . Thus, the first two moments are exactly equal to the moments
of the bivariate normal distribution with zero correlation. Further, the third and
fourth moments are asymptotically approximately equal to those of the normal
distribution. From this it follows11 that
s
n−2
t=r (10.58)
1 − r2
is distributed approximately as Student’s t-distribution with (n − 2) degrees of
freedom, the approximation being very accurate even for small n. The confidence
level can therefore be calculated from the t-distribution. H0 is then rejected for
large values of |t|.

Rank tests
The rank of an observation xi is simply its position, j, among the order statistics
(cf. section 10.6.7), i.e., the position of xi when all the observations are ordered.
In other words,
rank(xi ) = j if x(j) = xi (10.59)
The relationship between statistics, order statistics and rank is illustrated in the
following table:

i 1 2 3 4 5 6
statistic (measurement) xi 7.1 3.4 8.9 1.1 2.0 5.5
order statistic x(i) 1.1 2.0 3.4 5.5 7.1 8.9
rank rank(xi ) 5 3 6 1 2 4

For each pair of observations (xi , yi ), the difference in rank

Di = rank(xi ) − rank(yi ) (10.60)

is calculated. Spearman’s rank correlation coefficient is then defined as


n
6 X
ρ=1− Di2 (10.61)
n3 − n i=1
10.7. NON-PARAMETRIC TESTS 237

which can take on values between −1 and 1. If x and y are completely correlated,
xi and yi will have the same rank and Di will be zero, leading to ρ = 1. It can be
shown1, 11 that for large n (≥ 10) ρ has the same distribution as r in the previous
section, and Student’s t-distribution can be used, substituting ρ for r in equation
10.58.

10.7.2 Tests of randomness


Given n observations, xi , ordered according to some other variable, e.g., time, called
the trend variable, we wish to test whether the xi are random in, i.e., independent
of, the trend variable, t. H0 is then that all the xi are distributed according to the
same p.d.f.
As already remarked, we can test for randomness in the same way as for inde-
pendence by making a y-variable equal to the trend variable, yi = ti .
If the trend is assumed to be monotonic, additional tests are possible. The
reader is referred to Kendall and Stuart.11, 13

10.7.3 Two-sample tests


Given independent samples of n1 and n2 observations, we wish to test whether
they come from the same p.d.f. The hypothesis to be tested is thus

H0 : f1 (x) = f2 (x)

If both samples contain the same number of observations (n1 = n2 ), we can group
the two samples into one sample of pairs of observations and apply one of the tests
for independence. However, we can also adapt (without the restriction n1 = n2 )
any of the goodness-of-fit tests (section 10.6) to this problem.

Kolmogorov test
The Kolmogorov test (cf. section 10.6.7) adapted to the two-sample problem com-
pares the sample c.d.f.’s of the two samples. Equations 10.53-10.55 become

Dn1 n2 = max {|Sn1 (x) − Sn2 (x)|} for all x (10.62)


Dn±1 n2 = max {± [Sn1 (x) − Sn2 (x)]} for all x (10.63)
q
However, now the critical values given in section 10.6.7 are in terms of nn11+n
n2
Dn1 n2
√ n1 n2 ± 2 ±
2

rather than nDn and 4 n1 +n2 (Dn1 n2 ) rather than 4nDn , respectively.

Run test
The two samples are combined keeping track of the sample from which each obser-
vation comes. Runs in the sample number, rather than in the sign of the deviation,
238 CHAPTER 10. HYPOTHESIS TESTING

are then found. In the notation of section 10.6.6, A and B correspond to an obser-
vation coming from sample 1 and sample 2, respectively. The test then follows as
in section 10.6.6.

χ2 test
Consider two histograms with identical binning. Let nji be the number of entries
in bin i of histogram j. Each histogram has k bins and a total of Nj entries.
The Pearson χ2 statistic (equation 10.44) becomes a sum over all bins of both
histograms,
X2 X k (n − N p )2
2 ji j i
X = (10.64)
j=1 i=1 N j pi
Under H0 the probability content pi of bin i is the same for both histograms and
it is estimated from the combined histogram:
n1i + n2i
p̂i =
N1 + N2
Substituting this for pi in equation 10.64 results, after some work, in
" #
2
k
1 X n21i k
1 X n22i
X = (N1 + N2 ) + −1 (10.65)
N1 i=1 n1i + n2i N2 i=1 n1i + n2i

In the usual limit of a large number of events in each bin, X 2 is distributed as a


χ2 (k − 1). The number of degrees of freedom is k − 1, since that is the number
of parameters specified by H0 . In other words, there are 2(k − 1) free bins, and
(k −1) parameters are estimated from the data, leaving (k −1) degrees of freedom.
This is directly generalizable to more than two histograms. For r histograms,
     
2
r
X 1 X kr
X n2ji
X =  
Nr ·   Pr  − 1 (10.66)
j=1 j=1 N r i=1 j=1 n ji

which, for all nji large, behaves as χ2 with (r − 1)(k − 1) degrees of freedom.

Mann-Whitney test
As previously mentioned, the two-sample problem can be viewed as a test of inde-
pendence for which, as we have seen, rank tests can be used. A rank test appropriate
for this problem is the Mann-Whitney test, which is also known as the Wilcoxon∗
test, the rank sum test, or simply the U -test. Let the observations of the first

Wilcoxon proposed the test before Mann and Whitney, but his name is also used for another
test, the Wilcoxon matched pairs test, which is different. The use of Mann-Whitney here eliminates
possible confusion.
10.7. NON-PARAMETRIC TESTS 239

sample be denoted xi and those of the second sample yi . Rank them together.
This results in a series like xyyxxyx. For each x value, count the number of y
values that follow it and add up these numbers. In the above example, there are
3 y values after the first x, 1 after the second, 1 after the third, and 0 after the
fourth. Their sum, which we call Ux is 5. Similarly, Uy = 3 + 3 + 1 = 7. In fact,
you only have to count for one of the variables, since
Ux + Uy = Nx Ny
Ux can be computed in another way, which may be more convenient, by finding the
total rank, Rx , of the x’s, which is the sum of the ranks of the xi . In the example
this is Rx = 1 + 4 + 5 + 7 = 17. Then Ux is given by
Nx (Nx + 1)
Ux = Nx Ny + − Rx (10.67)
2
Under H0 , one expects Ux = Uy = 21 Nx Ny . Asymptotically, Ux is distributed
normally1, 11 with mean 21 Nx Ny and variance 121
Nx Ny (Nx +Ny +1), from which
(two-tailed) critical values may be computed. For small samples, one must resort
to tables.
This test can be easily extended11 to r samples: For each of the 12 r(r − 1) pairs
of samples, Ux is calculated (call it Upq for the samples p and q) and summed
r
X r
X
U = Upq (10.68)
p=1 q=p+1

Asymptotically U is distributed normally under H0 with mean and variance:


 
1 r
X
E [U ] = N 2 − Np2  (10.69)
4 p=1
 
r
1 X
V [U ] = N 2 (2N + 3) − Np2 (2Np + 3) (10.70)
72 p=1
Pr
where N = p=1 Np .

10.7.4 Two-Gaussian-sample tests


The previous two-sample tests make no assumptions about the distribution of the
samples and are completely general. If we know something about the distribution
we can make more powerful tests. Often, thanks to the central limit theorem, the
distribution is (at least to a good approximation) Gaussian. If this is not the case,
a simple transformation such as x → ln x, x → x2 , or x → 1/x may result
in a distribution which is nearly Gaussian. If we are testing whether two samples
have the same distribution, testing the transformed distribution is equivalent to
testing the original distribution. We now consider tests for two samples under the
assumption that both are normally distributed.
240 CHAPTER 10. HYPOTHESIS TESTING

Test of equal mean


As we have already done several times when dealing with normal distributions, we
distinguish between cases where the variance of the distributions is or is not known.

Known σ: Suppose we have two samples, xi and yi , both known to have a


Gaussian p.d.f. with variance σx2 and σy2 , respectively. If σx2 = σy2 , the hypothesis
that the two Gaussians are the same is equivalent to the hypothesis that their
means are the same, or that the difference in their means, θ = µx − µy , is zero.
An obvious test that the means are equal, also valid when σx2 6= σy2 is given by an
estimate of this difference, θ̂ = µ̂x − µ̂y , which has variance
h i σx2 σy2
V θ̂ = V [µ̂x ] + V [µ̂y ] = +
Nx Ny
We know that the difference of two normally distributed random variables is also
normally
h i distributed. Therefore, θ̂ will be distributed as a Gaussian with variance
V θ̂ and mean 0 or non-0 under H0 and H1 , respectively. H0 is then rejected for
large |θ̂| and the size of the test follows from the integral of the Gaussian over the
critical region as in sections 10.4.1 and 10.4.2. This is, of course, just a question of
how many standard deviations θ̂ is from zero, and rejection of H0 if θ̂ is found to
be too many σ from zero.

Unknown σ: If the parent p.d.f. of each sample is known to be normal, but its
variance is unknown, we can estimate the variance for each sample:
PNx PNy
i=1 (xi − x̄)2 i=1 (yi − ȳ)2
σ̂x2 = ; σ̂y2 = (10.71)
Nx − 1 Ny − 1
A Student’s-t variable can then be constructed. Recall that such a r.v. is the ratio
2
of a standard Gaussian r.v. to
r the square root of a reduced χ r.v. Under H0 ,
σx 2 σ2
µx = µy and θ̂ = (x̄ − ȳ)/ N x
+ Nyy is normally distributed with mean 0 and
variance 1. From equation 10.71 we see that

2
(Nx − 1)σ̂x2 (Ny − 1)σ̂y2
χ = + (10.72)
σx2 σy2

is distributed as χ2 with Nx +Ny −2 degrees of freedom, the loss


q
of 2 degrees of free-
dom coming from the determination of x̄ and ȳ. The ratio, θ̂/ χ2 /(Nx + Ny − 2),
is then distributed as Student’s t. However, we can calculate this only if σx and
σy can be eliminated from the expression. This occurs if σx = σy , resulting in
x̄ − ȳ
t= q
1 1
(10.73)
S Nx
+ Ny
10.7. NON-PARAMETRIC TESTS 241

(Nx − 1)σ̂x2 + (Ny − 1)σ̂y2


where S2 = (10.74)
Nx + Ny − 2

Note that S 2 is in fact just the estimate of the variance obtained by combining
both samples.
We emphasize that this test rests on two assumptions: (1) that the p.d.f. of
both samples is Gaussian and (2) that both Gaussians have the same variance. The
latter can be tested (cf. section 10.7.4). As regards the former, it turns out that
this test is remarkably robust. Even if the parent p.d.f. is not Gaussian, this test is
a good approximation.11 This was also the case for the sample correlation (section
10.7.1).

Correlated samples: In the above we have assumed that the two samples are
uncorrelated. A common case where samples are correlated is in testing the effect of
some treatment. For example, the light transmission of a set of crystals is measured.
The crystals are then treated in some way and the light transmission is measured
again. One could compare the means of the sample before and after treatment.
However, we can introduce a correlation by using the simple mathematical relation
P P P
xi − yi = (xi − yi ). A crystal whose light transmission was lower than
the average before the treatment is likely also to be below the average after the
treatment, i.e., there is a positive correlation between the transmission before and
after. This reduces the variance of the before-after difference, θ: σθ2 = σx2 + σy2 −
2ρσx σy . We do not have to know the correlation, or indeed σx or σy , but can
estimate the variance of θ = x − y directly from the data:
N  
1 X
σ̂θ2 = θi2 − θ̄ 2 (10.75)
N − 1 i=1

Again we find √
a Student’s-t variable: θ̂ = θ̄ is normally distributed with variance
2
σθ /N . Thus, N θ̄/σθ is a standard normal r.v. Further, (N − 1)σ̂θ2 /σθ2 is a χ2
r.v. of N − 1 degrees of freedom. Hence, the ratio

θ̄ N
t= (10.76)
σ̂θ
is a Student’s-t variable of N − 1 degrees of freedom, one degree of freedom being
lost by the determination of θ̄, a result already known from equation 3.40.

Test of equal variance


One could approach this problem as above for the means, i.e., estimate the variance
of each sample and compare their difference with zero. However, this requires
knowing the means or, if unknown, estimating them. Further, we must know how
this difference is distributed.
242 CHAPTER 10. HYPOTHESIS TESTING

A more straightforward approach makes use of the F -distribution (cf. section


3.14), which is the p.d.f. for the ratio of two reduced χ2 variables. For each sample,
the estimate of the variance (equation 8.3 or 8.7 depending on whether the mean is
known) divided by the true variance is related to a χ2 (cf. equation 10.72). Thus

χ2x /(Nx − 1) σ̂x2 /σ 2


F = = (10.77)
χ2y /(Ny − 1) σ̂y2 /σ 2

is distributed as the F -distribution. The σ 2 cancels in this expression, and con-


sequently F can be calculated directly from the data. We could just as well have
used 1/F instead of F ; both have the same p.d.f. By convention F is taken > 1.
The parameters of the F -distribution are ν1 = Nx − 1, ν2 = Ny − 1 if σ̂x2 is in
the numerator of equation 10.77.

“Never trust to general impressions, my boy,


but concentrate yourself upon details.”
—Arther Conan Doyle: Sherlock Holmes in
“A Case of Identity”

10.7.5 Analysis of Variance


Analysis of Variance (AV or ANOVA), originally developed by R. A. Fisher in the
1920’s, is widely used in the social sciences, and there is much literature—entire
books—about it. In the physical sciences it is much less frequently used and so will
be only briefly treated here in the context of testing whether the means of k normal
samples are equal. The method is much more general. In particular, it can be used
for parameters in the linear model. As usual, Kendall and Stuart11, 13 provide a
wealth of information.

The basic method: One-way classification


Given k samples, each normally distributed with the same unknown variance, σ 2 ,
we want to test whether the means of all samples are the same. Suppose that sample
i contains Ni measurements and has a sample moment ȳi , which estimates its true
P
mean µi . Using all N = ki=1 Ni measurements we can calculate the overall
sample mean ȳ in order to estimate the overall true mean µ. The null hypothesis
is that µ = µi for all i.
If the µi differ we can expect the ȳi to differ more from ȳ than would be expected
from the variance of the parent Gaussian alone. Unfortunately, we do not know σ,
10.7. NON-PARAMETRIC TESTS 243

which would enable us to calculate this expectation. We can, however, estimate


σ from the data. We can do this in two ways: from the variation of y within
the samples and from the variation of ȳ between samples. The results of these
two determinations can be compared and tested for equality. To do this we will
construct an F variable (section 3.14). Recall that F is the ratio of two reduced
χ2 variables. √
The expected error on the estimated mean is σ/ N . Therefore, under H0

k
X (ȳi − µ)2
χ2 (k) =
i=1 σ 2 /Ni

is distributed as χ2 (k). Since µ is unknown, we replace it by its estimate (obtained


from the entire sample) to obtain a χ2 of k − 1 degrees of freedom:

k
X Ni (ȳi − ȳ)2
χ2 (k − 1) = (10.78)
i=1 σ2

A second χ2 variable is obtained from the estimate of σ for each sample


Ni  2
1 X (i)
σ̂i2 = yj − ȳi (10.79)
Ni − 1 j=1

(i)
(where yj is element j of sample i) by a weighted average:

k
1 X
σ̂ 2 = (Ni − 1) σ̂i2 (10.80)
N −k i=1

which is a generalization of equation 10.74. Then (N − k)σ̂ 2 /σ 2 is a χ2 r.v. with


N − k degrees of freedom, since k sample means, ȳi , have also been determined.
The ratio of these two χ2 variables, normalized by dividing by their respective
numbers of degree of freedom, is an r.v. distributed as F (k − 1, N − k):
1 Pk
k−1 i=1 Ni (ȳi − ȳ)2
F = 1 Pk PNi (i)
(10.81)
N −k i=1 j=1 (yj − ȳi )2

If the hypothesis of equal means is false, the ȳi will be different and the numerator of
equation 10.81 will be larger than expected under H0 while the denominator, being
an average of the sample variance within samples, will be unaffected (remember that
the true variance of all samples is known to be the same). Hence large values of F
are used to reject H0 with a confidence level determined from the one-tailed critical
values of the F distribution. If there are only two samples, this analysis is equivalent
to the previously described two-sample test using Student’s t distribution.
244 CHAPTER 10. HYPOTHESIS TESTING

Multiway analysis of variance


Let us examine the situation of the previous section in a slightly different way. An
estimate of the variance of the (Gaussian) p.d.f. is given by σ̂ 2 = Q/(N −1) where
the “sum of squares” (SS), denoted here by Q (in contrast to previous sections where
Q2 was used), is given (cf. equation 8.118) by
N
X
Q = (N − 1) σ̂ 2 = (yi − ȳ)2 (10.82)
i=1

Under H0 , Q/σ 2 is a χ2 of N − 1 degrees of freedom. Equation 10.82 can be


rewritten
Ni
k X
X (i)
Q= (yj − ȳ)2 (10.83)
i=1 j=1
Ni 
k X
X 2
(i)
= yj − ȳi + ȳi − ȳ
i=1 j=1
 
k X
X Ni  2  Ni 
X 
(i) (i)
= 
yj − ȳi + (ȳi − ȳ)2 + 2 (ȳi − ȳ) yj − ȳi 
i=1 j=1 j=1

The second term is zero since both its sums are equal:
Ni
X Ni
X
(i)
yj = ȳi = Ni ȳi
j=1 j=1

Hence,
k
X k
X
Q = (N − 1) σ̂ 2 = (Ni − 1) σ̂i2 + Ni (ȳi − ȳ)2 (10.84)
i=1 i=1

There are thus two contributions to our estimate of the variance of the p.d.f.: The
first term is the contribution of the variance of the measurements within the samples;
the second is that of the variance between the samples. Also the number of degrees
of freedom are partitioned. As we have seen in the previous section, the first and
second terms are related to χ2 variables of N − k and k − 1 degrees of freedom,
respectively, and their sum, N − 1, is the number of degrees of freedom of the χ2
variable associated with σ̂ 2 .
Now suppose the samples are classified in some way such that each sample has
two indices, e.g., the date of measurement and the person performing the measure-
ment. We would like to partition the overall variance between the various sources:
the variance due to each factor (the date and the person) and the innate residual
variation. In other words, we seek the analog of equation 10.84 with three terms.
We then want to test whether the mean of the samples is independent of each factor
separately.
10.7. NON-PARAMETRIC TESTS 245

Of course, the situation can be more complicated. There can be more than two
factors. The classification is called “crossed” if there is a sample for all combinations
of factors. More complicated is the case of “nested” classification where this is not
the case. Further, the number of observations in each sample can be different. We
will only treat the simplest case, namely two-way crossed classification.
We begin with just one observation per sample. As an example, suppose that
there are a number of technicians who have among their tasks the weighing of
samples. As a check of the procedure, a reference sample is weighed once each day
by each technician. One wants to test (a) whether the balance is stable in time,
i.e., gives the same weight each day, and (b) that the weight found does not depend
on which technician performs the measurement.
In such a case the measurements can be placed in a table with each row cor-
responding to a different value of the first factor (the date) and each column to a
value of the second factor (the technician). Suppose that there are R rows and C
columns. The total number of measurements is then N = RC. We use subscripts
r and c to indicate the row and column, respectively. The sample means of row r
and column c are given, respectively, by
C R
1 X 1 X
ȳr. = yrc ; ȳ.c = yrc (10.85)
C c=1 R r=1

In this notation a dot replaces indices which are averaged over, except that the
dots are suppressed if all indices are averaged over (ȳ ≡ ȳ.. ). We now proceed as
in equations 10.82-10.84 to separate the variance (or more accurately, the sum of
squares, SS) between rows from the rest:
XX
Q= (yrc − ȳ)2 (10.86)
r c
XX X
= (yrc − ȳr )2 + C (ȳr. − ȳ)2 (10.87)
r c r

where C is, of course, the same for all rows and hence can be taken out of the sum
over r. The second term, to be denoted QR , is the contribution to the SS due to
variation between rows while the first term contains both the inter-column and the
innate, or residual, contributions.
We can, in the same way, separate the SS between rows from the rest. The result
can be immediately written down by exchanging columns and rows in equation
10.87: XX X
Q= (yrc − ȳ.c )2 + R (ȳ.c − ȳ)2 (10.88)
c r c
The residual contribution, QW , to the SS can be obtained by subtracting the inter-
row and inter-column contributions from the total:

QW = Q − QR − QC
XX X X
= (yrc − ȳ)2 − C (ȳr. − ȳ)2 − R (ȳ.c − ȳ)2
r c r c
246 CHAPTER 10. HYPOTHESIS TESTING

which, using the fact that


XX X X
yrc = C ȳr. = R ȳ.c = CR ȳ (10.89)
r c r c

can be shown to be equal to


XX
QW = (yrc − ȳr. − ȳ.c + ȳ)2
r c

We have thus split the variance into three parts. The number of degrees of freedom
also partitions:

Two-way Crossed Classification – Single Measurements


Factor SS d.o.f.
P
Row QR = C r (ȳr. − ȳ)2 R−1
P
Column QC = R c (ȳ.c − ȳ)2 C−1
P P
Residual QW = r c (yrc − ȳr. − ȳ.c + ȳ)2 RC − R − C + 1
P P
Total Q = r c (yrc − ȳ)2 RC − 1

Divided by their respective numbers of degrees of freedom, the SS are, under


H0 , all estimators of σ 2 . The hypotheses H0R , that the means of all rows are equal,
and H0C , similarly defined for columns, can be separately tested by the one-tailed
F -test using, respectively,
1 1
R
Q
R−1 R C
Q
C−1 C
F = 1 , F = 1 (10.90)
Q
(R−1)(C−1) W
Q
(R−1)(C−1) W

Let us now look at this procedure somewhat more formally. What we, in fact,
have done is used the following model for our measurements:
X X
yrc = µ + θr + ωc , θr = ωc = 0 (10.91)
r c

which is a linear model with R + C + 1 parameters subject to R + C constraints.


The measurements are then equal to µ + θr + ωc + ǫrc where the measurement
errors, ǫrc , are assumed to be normally distributed with the same variance. The
hypothesis to be tested is that all the θr and ωc are 0. The θr and the ωc can be
tested separately. The least squares estimator for θr is
1 X
θ̂r = yrc − µ̂ = ȳr. − ȳ (10.92)
C c
If all θr are zero, which is the case under H0 , then
P
2
X θ̂r2 C r (ȳr. − ȳ)2 QR
χ = = = (10.93)
r σ 2 /C σ2 σ2
10.7. NON-PARAMETRIC TESTS 247

is a χ2 of R − 1 degrees of freedom. However, since σ 2 is unknown, we can not


use this χ2 directly.
As shown above, a second, independent χ2 can be found, namely QW /σ 2 ,
which is from that part of the sum of squares not due to inter-row or inter-column
variation. This χ2 is then combined with that of equation 10.93 to make an F -test.
for the hypothesis that all θr are zero. Similarly, an F -test can be derived for
the hypothesis that all ωc are zero. The method can be extended to much more
complicated linear models.
However, we will go just one step further: two-way crossed classification with
several, K, observations per class. We limit ourselves to the same number, K,
for all classes. It is now possible to generalize the model by allowing “interaction”
between the factors. The model is
X X XX
yrck = µ + θr + ωc + υrc , θr = ωc = υrc = 0 (10.94)
r c r c

where k is the index specifying the observation within class rc.


In our example of different technicians and different dates, the variance among
technicians can now depend on the date. (On a day a technician does not feel well
the measurements might show more variation.)
The null hypothesis that all θr , ωc , and υrc are zero is equivalent to three
hypotheses all being true, namely H0R that all θr are zero, a similar H0C for columns,
and H0I that all υrc are zero. These three hypotheses can all be tested separately.
Here too, the procedure of equations 10.82-10.84 can be followed with the ad-
dition of a sum over k. The result is the partition of the sum of squares over four
terms:

Two-way Crossed Classification


Factor SS d.o.f.
P
Row QR = CK r (ȳr.. − ȳ)2 R−1
P
Column QC = RK c (ȳ.c. − ȳ)2 C−1
P P
Interaction QI = K r c (ȳrc. − ȳr.. − ȳ.c. + ȳ)2 RC − R − C + 1
P P P 2
Residual QW = r c k (yrck − ȳrc. ) RC(K − 1)
P P P
Total Q = r c k (yrck − ȳ)2 RCK − 1

where the averages are, e.g.,


1 XX 1 X
ȳr.. = yrck , ȳrc. = yrck
CK c k K k

F -tests can be constructed using QR , QC , and QI together with QW .


Part IV

Bibliography

249
REFERENCES 251

1. R. J. Barlow, Statistics: A Guide to the Use of Statistical Methods in the Physical


Sciences (Wiley, 1989)

2. Siegmund Brandt, Data Analysis: Statistical and Computational Methods for Scien-
tists and Engineers, Third edition (Springer 1999)

3. A. G. Frodesen, O. Skjeggestad, and H. Tøfte, Probability and Statistics in Particle


Physics (Universitetsforlaget, Bergen, 1979)

4. W. T. Eadie et al., Statistical Methods in Experimental Physics (North-Holland,


1971)

5. Frederick James, Statistical Methods in Experimental Physics, 2nd edition (World


Scientific, 2006)

6. G. P. Yost, Lectures on Probability and Statistics (Lawrence Berkeley Laboratory


report LBL-16993, 1984)

7. Glen Cowan, Statistical Data Analysis (Oxford University Press, 1998)

8. Louis Lyons, Statistics for Nuclear and Particle Physicists (Cambridge University
Press, 1986)

9. Stuart L. Meyer, Data Analysis for Scientists and Engineers (Wiley, 1975)

10. Philip R. Bevington, Data Reduction and Error Analysis for the Physical Sciences
(McGraw-Hill, 1969)

11. M. G. Kendall and A. Stuart, The Advanced Theory of Statistics (Griffin, vol. I, 4th
ed., 1977; vol. II, 4th ed., 1979; vol. III, 3rd ed., 1976)

12. Alan Stuart and Keith Ord, Kendall’s Advanced Theory of Statistics, vol. 1, Distri-
bution Theory (Arnold, 1994).

13. Alan Stuart, Keith Ord, and Steven Arnold, Kendall’s Advanced Theory of Statistics,
vol. 2A, Classical Inference and the Linear Model (Arnold, 1999).

14. Anthony O’Hagan, Kendall’s Advanced Theory of Statistics, vol. 2B, Bayesian Infer-
ence (Arnold, 1999).

15. Harald Cramér, Mathematical Methods of Statistics (Princeton Univ. Press, 1946)

16. William H. Press, “Understanding Data Better with Bayesian and Global Statistical
Methods”, preprint astro-ph/9604126 (1996).

17. anon., Genesis 13.9

18. T. Bayes, “An essay towards solving a problem in the doctrine of chances”, Phil.
Trans. Roy. Soc., liii (1763) 370; reprinted in Biometrika 45 (1958) 293.
252 REFERENCES

19. P. S. de Laplace, “Mémoire sur la probabilité des causes par les évenements”, Mem.
Acad. Sci. Paris 6 (1774) 621; Eng. transl.: “Memoir on the probability of the causes
of events” with an introduction by S. M. Stigler, Statist. Sci. 1 (1986) 359.

20. P. S. de Laplace, Théorie analytique des probabilités (Courcier Imprimeur, Paris,


1812); 3rd edition with supplements (1820).

21. A. N. Kolmogorov, Foundations of the Theory of Probability (Chelsea, 1950)

22. T. Bayes, An introduction to the Doctrine of Fluxions, and a Defence of the Mathe-
maticians Against the Objections of the Author of the Analyst (1736)

23. Richard von Mises, Wahrscheinlichkeit, Statistik und Wahrheid (1928); reprinted as
Probability, Statistics, and Truth (Dover, 1957)

24. R. A. Fisher, Proc. Cambridge Phil. Soc. 26 (1930) 528.

25. L. von Bortkiewicz, Das Gesatz der kleinen Zahlen (Teubner, Leipzig, 1898)

26. Pierre Simon de Laplace, Histoire de l’Acadamie (1783) 423.

27. A. de Moivre, Approximatio ad summam terminorum binomii (a + b)n in seriem


expansi (1733). An English translation is included in The Doctrine of Chances (second
edition, 1738, and third edition, 1756); the third edition has been reprinted (Chelsea
Publ. Co., New York, 1967)

28. K. F. Gauß, Theoria motus corporum celestium (Perthes, Hamburg, 1809); Eng.
transl., Theory of the Motion of the Heavenly Bodies Moving About the Sun in
Conic Sections (Dover, New York, 1963)

29. “Student”, Biometrika 6 (1908) 1.

30. F. James, Rep. on Progress in Physics 43 (1980) 1145.

31. F. James, Computer Phys. Comm. 60 (1990) 329.

32. G. Buffon, “Essai d’arithmétique morale,” Histoire naturelle, générale et particulière,


Supplément 4 (1777) 46.

33. R. Y. Rubinstein, Simulation and the Monte Carlo Method (Wiley, 1981)

34. International Organization for Standardization (ISO), Guide to the expression of


uncertainty in measurement (Geneva, 1993)

35. C. R. Rao, Bull. Calcutta Math. Soc. 37 (1945) 81.

36. A. C. Aitken and H. Silverstone, Proc. Roy. Soc. Edin. A 61 (1942) 186.

37. H. Jeffreys, Theory of Probability (Oxford Univ. Press, 1961)

38. CERN program library, entry D506;


F. James and M. Roos, Computer Phys. Comm. 10 (1975) 343.
REFERENCES 253

39. Henry Margenau and George Mosely Murphy, The Mathematics of Physics and Chem-
istry (vol. I, Van Nostrand, 1956)

40. G. Forsythe, J. Soc. Indust. Appl. Math. 5 (1957) 74.

41. B. Efron, Ann. Statist. 7 (1979) 1.

42. B. Efron, The Jackknife, the Bootstrap, and Other Resampling Plans (S.I.A.M., 1982)

43. B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall
1993)

44. J.W. Tukey, Ann. Math. Stat. 29 (1958) 614.

45. J. Neyman, Phil. Trans. Roy. Soc. London, series A, 236 (1937) 333.

46. Particle Data Group: ‘Review of Particle Physics’, Phys. Rev. D 54 (1996) 1.

47. Particle Data Group: ‘Review of Particle Physics’, Eur. Phys. J. C 15 (2000) 1.

48. G. J. Feldman and R. D. Cousins, Phys. Rev. D 57 (1998) 3873.

49. O. Helene, Nucl. Instr. Methods 212 (1983) 319.

50. W. J. Metzger, ‘Upper limits’, Internal report University of Nijmegen HEN-297 (1988)

51. J. Neyman and E. S. Pearson, Phil. Trans. A 231 (1933) 289.

52. J. Neyman and E. S. Pearson, Biometrika A 20 (1928) 175 and 263.

53. K. Pearson, Phil. Mag. 5 (1900) 157.

54. T. W. Anderson and D. A. Darling, Ann. Math. Stat. 23 (1952) 193.

55. William H. Press et al., Numerical Recipes in FORTRAN90: The Art of Scientific
Computing, Numerical Recipes in FORTRAN77: The Art of Scientific Computing,
Numerical Recipes in C++: The Art of Scientific Computing, Numerical Recipes in
C: The Art of Scientific Computing, (Cambridge Univ. Press).

56. N. H. Kuiper, Proc. Koninklijke Nederlandse Akademie van Wetenschappen, ser. A


63 (1962) 38.

57. M. A. Stephens, J. Roy. Stat. Soc., ser. B 32 (1970) 115.

58. D. A. Darling, Ann. Math. Stat. 28 (1957) 823.

59. J. R. Michael, Biometrika 70,1 (1983) 11.

60. J. Durbin, Biometrika 62 (1975) 5.


Part V

Exercises

255
257

1. In statistics we will see that the moments of the parent distribution can be
‘estimated’, or ‘measured’, by calculating the correspondingqmoment of the
P P
data, e.g., x = n1 xi gives an estimate of the mean µ and n1 (xi − x)2
estimates σ, etc.

(a) Histogram the following data using a suitable bin size.


90 90 79 84 78 91 88 90 85 80
88 75 73 79 78 79 67 83 68 60
73 79 69 74 76 68 72 72 75 60
61 66 66 54 71 67 75 49 51 57
62 64 68 58 56 79 63 68 64 51
58 53 65 57 59 65 48 54 55 40
49 42 36 46 40 37 53 48 44 43
35 39 30 41 41 22 28 36 39 51
These data will be available in a file, which can be read, e.g., in FORTRAN
by
READ(11,’(10F4.0)’) X
where X is an array defined by REAL X(80).
(b) Estimate the mean, standard deviation, skewness, mode, median and
FWHM (full width at half maximum) using the data and using the his-
togram bin contents and the central values of the bins.

You may find the FORTRAN subroutine FLPSOR useful: CALL FLPSOR(X,N),
where N is the dimension, e.g., 80, of the array X. After calling this routine,
the order of the elements of X will be in ascending order.

2. Verify by making a histogram of 1000 random numbers that your random


number generator indeed gives an approximately uniform distribution in the
interval 0 to 1.
Make a two-dimensional histogram using successive pairs of random numbers
for the x and y coordinates. Does this two-dimensional distribution also
appear uniform? Calculate the correlation coefficient between x and y.

3. Let Xi , i = 1, 2, ..., n, be n independent r.v.’s uniformly distributed between


0 and 1, i.e., the p.d.f. is f (x) = 1 for 0 ≤ x ≤ 1 and f (x) = 0 otherwise.
Let Y be the maximum of the n Xi : Y = max(X1 , X2 , ..., Xn ). Derive
the p.d.f. for Y , g(y). Hint: What is the c.d.f. for Y ?

4. For two r.v.’s, x and y, show that

V [x + y] = V [x] + V [y] + 2 cov(x, y)


258

5. Show that the skewness can be written


1  h 3i h i
2 3

γ1 = E x − 3E [x] E x + 2E [x]
σ3

6. The Chebychev Inequality. Assume that the p.d.f. for the r.v. X has mean µ
and variance σ 2 . Show that for any positive number k, the probability that
x will differ from µ by more than k standard deviations is less than or equal
to 1/k2 , i.e., that
1
P (|x − µ| ≥ kσ) ≤ 2
k
7. Show that | cov(x, y)| ≤ σx σy , i.e., that the correlation coefficient, ρx,y =
cov(x, y)/σx σy , is in the range −1 ≤ ρ ≤ 1 and that ρ = ±1 if and only
if x and y are linearly related.

8. A beam of mesons, composed of 90% pions and 10% kaons, hits a Čerenkov
counter. In principle the counter gives a signal for pions but not for kaons,
thereby identifying any particular meson. In practice it is 95% efficient at
giving a signal for pions, and also has a 6% probability of giving an accidental
signal for a kaon. If a meson gives a signal, what is the probability that the
particle was a pion? If there is no signal, what is the probability that it was
a kaon?

9. Mongolian swamp fever (MSF) is such a rare disease that a doctor only expects
to meet it once in 10000 patients. It always produces spots and acute lethargy
in a patient; usually (60% of cases) they suffer from a raging thirst, and
occasionally (20% of cases) from violent sneezes. These symptoms can arise
from other causes: specifically, of patients who do not have MSF, 3% have
spots, 10% are lethargic, 2% thirsty, and 5% complain of sneezing. These four
probabilities are independent.
Show that if you go to the doctor with all these symptoms, the probability
of your having MSF is 80%. What is the probability if you have all these
symptoms except sneezing?

10. Suppose that an antimissile system is 99.5% efficient in intercepting incoming


ballistic missiles. What is the probability that it will intercept all of 100
missiles launched against it? How many missiles must an aggressor launch to
have a better than even chance of one or more penetrating the defenses? How
many missiles would be needed to ensure a better than even chance of more
than two missiles evading the defenses?

11. A student is trying to hitch a lift. Cars pass at random intervals, at an average
rate of 1 per minute. The probability of a car giving a student a lift is 1%.
What is the probability that the student will still be waiting:
259

(a) after 60 cars have passed?


(b) after 1 hour?

12. Show that the characteristic function of the Poisson p.d.f.,

µr e−µ
P (r; µ) =
r!
is h  i
φ(t) = exp µ eıt − 1
Use the characteristic function to prove the reproductive property of the Pois-
son p.d.f.

13. A single number often used to characterize an angular distribution is the


F
forward-backward ratio, F /B, or the forward-backward asymmetry, N ,where
F is the number of events with cos θ > 0, B is the number of events with
cos θ < 0, and N = F + B is the total number of events. Assume that the
events are independent and that the event rate is constant, for both forward
and backward events.
Clearly, only two of the three variables, F , B, N , are independent. We can
regard this situation in two ways:

(a) The number of events N is Poisson distributed with mean µ and they are
split into F and B = N − F following a binomial p.d.f., B(F ; N, pF ),
i.e., the independent variables are N and F .
(b) The F events and B events are both Poisson distributed (with param-
eters µF and µB ), and the total is just their sum, i.e., the independent
variables are F and B.

Show that both ways lead to the same p.d.f.

14. Show that the Poisson p.d.f. tends to a Gaussian with mean µ and variance
σ 2 = µ for large µ, i.e.,

P (r; µ) −→ N (r; µ, µ)

For µ = 5.3, what is the probability of 2 or less events? Approximating


the discrete Poisson by the continuous Gaussian p.d.f., ≤ 2 should be re-
garded as < 2.5, half way between 2 and 3. What is the probability in this
approximation?

15. For a Gaussian p.d.f.:

(a) What is the probability of a value lying more than 1.23σ from the mean?
260

(b) What is the probability of a value lying more than 2.43σ above the
mean?
(c) What is the probability of a value lying less than 1.09σ below the mean?
(d) What is the probability of a value lying above a point 0.45σ below the
mean?
(e) What is the probability that a value lies more than 0.5σ but less than
1.5σ from the mean?
(f) What is the probability that a value lies above 1.2σ on the low side of
the mean, and below 2.1σ on the high side?
(g) Within how many standard deviations does the probability of a value
occurring equal 50%?
(h) How many standard deviations correspond to a one-tailed probability of
99%?

16. During a meteor shower, meteors fall at the rate of 15.7 per hour. What is the
probability of observing less than 5 in a given period of 30 minutes? What
value do you find if you approximate the Poisson p.d.f. by a Gaussian p.d.f.?

17. Four values (3.9, 4.5, 5.5, 6.1) are drawn from a normal p.d.f. whose mean is
known to be 4.9. The variance of the p.d.f. is unknown.

(a) What is the probability that the next value drawn from the p.d.f. will
have a value greater than 7.3?
(b) What is the probability that the mean of three new values will be between
3.8 and 6.0?

18. Let x and y be two independent r.v.’s, each distributed uniformly between 0
and 1. Define z± = x ± y.

(a) How are z+ and z− distributed?


(b) What is the correlation between z+ and z− ; between z+ and y?

It will probably help your understanding of this situation to use Monte Carlo
to generate points uniform in x and y and to make a two-dimensional his-
togram of z+ vs. z− .

19. Derive the reproductive property of the Gaussian p.d.f., i.e., show that if
x and y are independent r.v.’s distributed normally as N (x; µx , σx2 ) and
N (y; µy , σy2 ), respectively, then z = x + y is also normally distributed as
N (z; µz , σz2 ). Show that µz = µx + µy and σz2 = σx2 + σy2 . Derive also
P
the p.d.f. for z = x − y, for z = (x + y)/2, and for z = x̄ = n i=1 xi /n
when all the xi are normally distributed with the same mean and variance.
261

20. For the bivariate normal p.d.f. for x, y with correlation coefficient ρ, trans-
form to variables u, v such that the covariance matrix is diagonal and show
that
σx2 cos2 θ − σy2 sin2 θ
σu2 =
cos2 θ − sin2 θ
σy2 cos2 θ − σx2 sin2 θ
σv2 =
cos2 θ − sin2 θ
2ρσx σy
where tan 2θ =
σx2 − σy2

21. Show that for the bivariate normal p.d.f., the conditional p.d.f., f (y|x), is a
normal p.d.f. with mean and variance,
σy
E [y|x] = µy + ρ (x − µx ) and V [y|x] = σy2 (1 − ρ2 )
σx

22. For a three-dimensional Gaussian p.d.f. the contours of constant probability


are ellipsoids defined by constant

G = (x − µ)T V −1 (x − µ)

Find the probability that a point is within the ellipsoid defined by G = 1.


23. Given n independent variables, xi , distributed according to fi having mean,
P
µi , and variance, Vi = σi2 , show that S = xi has mean µS = E [S] =
P P P
µi and variance V [S] = Vi = σi2 . What are the expected value and
P
variance of the average of the xi , x̄ = n1 xi ?
24. Derive the reproductive property of the Cauchy p.d.f. Does the p.d.f. of the
sum of n independent, Cauchy-distributed r.v.’s, approach the normal p.d.f.
in the limit n → ∞?
25. Let x and y be independent r.v.’s, each distributed normally with mean 0
and variances σx2 and σy2 , respectively.
(a) Derive the p.d.f. of the r.v. z = x/y.
(b) Describe a method to generate random numbers distributed as a standard
Cauchy p.d.f. Try it.
26. (a) Show that for n independent r.v.’s, xi , uniformly distributed between 0
and 1, the p.d.f. for Pn
xi − n2
g = i=1q n
12

approaches N (g; 0, 1) for n → ∞.


262

(b) Demonstrate the result (a) by generating by Monte Carlo the distribution
of g for n = 1, 2, 3, 5, 10, 50 and comparing it to N (g; 0, 1).
(c) If the xi are uniformly distributed in the intervals [0.0, 0.2] and [0.8, 1.0].
i.e.,
1
f (x) = 0.4 , 0.0 ≤ x ≤ 0.2 or 0.8 ≤ x ≤ 1.0
=0, otherwise,

what distribution will g approach? Demonstrate this by Monte Carlo as


in (b).

27. Show that the weighting method used in the two-dimensional example of crude
Monte Carlo integration (sect. 6.2.5, eq. 6.5) is in fact an application of the
technique of importance sampling.
R
28. Perform the integral I = 01 x3 dx by crude Monte Carlo using 100, 200,
400, and 800 points. Estimate not only I, but also the error on I. Does the
error decrease as expected with the number of points used?
Repeat the determination of I 50 times using 200 (different) points each time
and histogram the resulting values of I. Does the histogram have the shape
that you expect? Also evaluate the integral by the following methods and
compare the error on I with that obtained by crude Monte Carlo:

(a) using hit or miss Monte Carlo and 200 points.


(b) using crude Monte Carlo and stratification, dividing the integration re-
gion in two, (0,0.5) and (0.5,1), and using f · 200 points in (0,0.5) and
(1 − f ) · 200 points in (0.5,1), where f = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
0.8, and 0.9. Plot the error on I vs. f .
(c) as (b) but for f =0.5 with various intervals, (0,c) and (c,1), for c = 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. Make a plot of the estimated
error on I vs. c.
(d) using crude Monte Carlo and antithetic variables x and (1 − x) and 200
points.
(e) as (d) but with only 100 points.
(f) using importance sampling with the function g(x) = x2 and 200 points.

29. Generate 20000 Monte Carlo points with x > 0 distributed according to the
distribution !
1 1 −x/τ 1 −x/λ
f (x) = e + e
2 τ λ
for τ = 3 and λ = 10. Do this for (a) the weighting, (b) the rejection,
and (c) the composite methods using inverse transformations. Which method
263

is easiest to program? Which is fastest? Make histograms of the resulting


distribution in each case and verify that the distribution is correct.
If you can only detect events with 1 < x < 10, what fraction of the events
will you detect? Suppose in addition, that your detector has a detection
efficiency given by

0, if x < 1 or x > 10
e=
(x − 1)/9, if 1 < x < 10

How can you arrive at a histogram for the x-distribution of the events you
detect? There are various methods. Which should be the best?

30. Generate 1000 points, xi , from the Gaussian p.d.f. N (x; 10, 52 ). Use each
of the following estimators to estimate the mean of X: sample mean, sample
median, and trimmed sample mean (10%).
Repeat assuming we only measure values of X in the interval (5,25), i.e. if
an xi is outside this range, throw it away and generate a new value.
Repeat this all 25 times, histogramming each estimation of the mean. From
these histograms determine the variance of each of the six estimators.

31. Under the assumptions that the range of the r.v. X is independent of the
parameter θ and that the likelihood, L(x; θ), is regular enough to allow
∂2 R
interchanging ∂θ 2 and dx, derive equation 8.23,
" #

Ix (θ) = −E S(x; θ)
∂θ

P
c2 = (x − µ)2 /n is an efficient estimator of the
32. Show that the estimator σ i
variance of a Gaussian p.d.f. of known mean by showing that its variance is
equal to I −1 .

33. Using the method of section 8.2.7, find an efficient and unbiased estimator for
σ 2 of a normal p.d.f. when µ is known and there is thus only one parameter
for the distribution.

34. We count the number of decays in a fixed time interval, T . We do this N


times yielding the results ni , i = 1, ..., N . The source is assumed to consist
of a large number of atoms having a long half-life. The data, ni , are therefore
assumed to be distributed according to a Poisson p.d.f., the parameter of
which can be estimated by µ̂ = n̄ (section 8.3.2). Suppose, however, that we
want instead to estimate the probability of observing no decays in the time
interval, T .

(a) What is the estimator in the frequency method of estimation?


264

(b) Derive a less biased estimator.


(c) Derive the variances of both the estimator and the less biased estimator.

35. (a) Derive equations 8.41 and 8.42, i.e., show that the variance of the r th
sample moment is given by
1 h h 2r i i
V [ xr ] = E x − (E [xr ])2
n
and that
1 h h r+q i i
cov [ xr , xq ] = E x − E [xr ] E [xq ]
n

(b) Derive an
h expression
i in terms of sample moments to estimate the vari-
c2
ance, V σ , of the moments estimator of the parent variance, σ c2 =

m
c2 − m c 21

36. We estimate the values of x and y by their sample means, x̄ and ȳ, which
have variances σx2 and σy2 . The covariance is zero. We want to estimate the
values of r and θ which are related to x and y by
y
r 2 = x2 + y 2 and tan θ =
x

Following the substitution method, what are r̂ and θ̂? Find the variances and
covariance of r̂ and θ̂.

37. We measure x = 10.0 ± 0.5 and y = 2.0 ± 0.5. What is then our estimate
of x/y? Use Monte Carlo to investigate the validity of the error propagation.

38. We measure cos θ and sin θ, both with standard deviation σ. What is the
ml estimator for θ? Compare with the results of exercise 36.

39. Decay times of radioactive atoms are described by an exponential p.d.f. (equa-
tion 3.10):
1
f (t; τ ) = e−t/τ
τ
(a) Having measured the times ti of n decays, how would you estimate τ
and the variance V [τ̂ ] (1) using the moments method and (2) using the
maximum likelihood method? Which method do you prefer? Why?
(b) Generate 100 Monte Carlo events according to this p.d.f. with τ = 10,
(cf. exercise 29) and calculate τ̂ and V [τ̂ ] using both the moments
and the maximum likelihood methods. Are the results consistent with
τ = 10? Which method do you prefer? Why?
265

(c) Use a minimization program, e.g., MINUIT, to find the maximum of the
likelihood function for the Monte Carlo events of (39b). Evaluate V [τ̂ ]
using both the second-derivative matrix and the variation of l by 1/2.
Compare the results for τ̂ and V [τ̂ ] with those of (39b).
(d) Repeat (39b) 1000 times making histograms of the value of τ̂ and of the
estimate of the error on τ̂ for each method. Do you prefer the moments
or the maximum likelihood expression for V [τ̂ ]? Why?
(e) Suppose that we can only detect times t < 10. What is then the
likelihood function? Use a minimization program to find the maximum
of the likelihood function and thus τ̂ and its variance. Does this value
agree with τ = 10?
(f) Repeat (39b) and (39e) with 10000 Monte Carlo events.
40. Verify that a least squares fit of independent measurements to the model
y = a + bx results in estimates for a and b given by
xy − x̄ȳ
â = ȳ − b̂x̄ and b̂ =
x2 − x̄2
where the bar indicates a weighted sample average with weights given by
1/σi2 , as stated in section 8.5.5.
41. Use the method of least squares to derive formulae to estimate the value (and
its error), y ± δy, from a set of n measurements, yi ± δyi . Assume that
the yi are uncorrelated. Comment on the relationship between these formulae
and those derived from ml (equations 8.59 and 8.60).
42. Perform a least squares fit of a parabola
y(x) = θ1 + θ2 x + θ3 x2
for the four independent measurements: 5 ± 2, 3 ± 1, 5 ± 1, 8 ± 2 measured
at the points x = −0.6, −0.2, 0.2, 0.6, respectively. Determine not only the
θ̂i and their covariances, but also calculate the value of y and its uncertainty
at x = 1.
To invert a matrix you can use the routine RSINV:
CALL RSINV (N,A,N,IFAIL)
where A is a symmetric, positive matrix of dimension (N,N). If the matrix
inversion is successful, IFAIL is returned as 0.
43. The three angles of a triangle are independently measured to be 63◦ , 34◦ ,
and 85◦ , all with a resolution of 1◦ .
(a) Calculate the least squares estimate of the angles subject to the require-
ment that their sum be 180◦ .
266

(b) Calculate the covariance matrix of the estimators.

44. Generate events as in exercise 39b. Histogram the times ti and use the two
minimum chi-square methods and the binned maximum likelihood method to
estimate the lifetime τ . Use a minimization program, e.g., MINUIT, to find
the minima and maximum. Compare the results of these three methods and
those of exercise 39b.

45. In section 8.7.4 is a table comparing the efficiencies of various location es-
timators for various distributions. Generate 10000 random numbers from a
standard normal distribution and estimate the mean using each of the esti-
mators in the table. Repeat this 1000 times making histograms of the values
of each estimator. The standard deviation of these histograms is an estimate
of the standard deviation of the estimator. Are these in the ratio expected
from the table?

46. Consider a long-lived radioactive source.

(a) In our detector it produces 389 counts in the first minute and 423 counts
in the second minute. Assuming a 100% efficient detector, what is the
best estimation of the activity of the source?
(b) What can you say about the best value and uncertainty for the activity
of the source from the following set of independent measurements?

1.08 ± 0.13 , 1.04 ± 0.07 , 1.13 ± 0.10 Bq.

47. A current is determined by measuring the voltage V across a standard re-


sistor. The voltmeter has a resolution σV and a systematic error sV . We
measure two currents using the same resistor and voltmeter. Since the resis-
tance is unchanged between the measurements, we regard its uncertainty as
entirely systematic. Find the covariance matrix for the two currents, which
are calculated using Ohm’s law, Ii = Vi /R.

48. We measure a quantity X 25 times using an apparatus of unknown but con-


stant resolution. The average value of the measurements is x̄ = 128. The
1 P
estimate of the variance is s2 = 24 (x − x̄)2 = 225. What is the 95%
confidence interval on the true value, µ, of the quantity X?

49. You want to determine the probability, p, that a student passes the statis-
tics exam. Since there are only two possible outcomes, pass and fail, the
appropriate p.d.f. is binomial, B(k; N, p).

(a) Construct the confidence belt for a 95% central confidence interval for p
for the case that 10 students take the exam and k pass, i.e., draw k+ (p)
and k− (p) curves on a p vs. k plot.
267

(b) Assume that 8 of the 10 pass. Find the 95% central confidence interval
from the plot constructed in (a) and by solving equation 9.18.

50. An experiment studying the decay of the proton (an extremely rare process, if
it occurs at all) observes 7 events in 1 year for a sample of 106 kg of hydrogen.

(a) Assume that there is no background. Give a 90% central confidence


interval and a 90% upper limit for the expected number of proton decays
and from these calculate the corresponding interval and limit for the
mean lifetime of the proton.
(b) Repeat (a) assuming that background processes are expected to con-
tribute an average of 3 events per year.
(c) Repeat (a) assuming 8 expected background events per year.

51. Construct a most powerful (MP) test for one observation, x, for the hypothesis
that X is distributed as a Cauchy distribution,
1
f (x) =
π [1 + (x − θ)2 ]
with θ = 0 under H0 and θ = 1 under H1 . What is the size of the test if
you decide to reject H0 when x > 0.5?

52. Ten students each measure the mass of a sample, each with an error of 0.2 g:

10.2 10.4 9.8 10.5 9.9 9.8 10.3 10.1 10.3 9.9 g

(a) Test the hypothesis that they are all measurements of a sample whose
true mass is 10.1 g.
(b) Test the hypothesis that they are all measurements of the same sample.

53. On Feb. 23, 1987, the Irvine-Michigan-Brookhaven experiment was counting


neutrino interactions in their detector. The time that the detector was on
was split into ten-second intervals, and the number of neutrino interactions
in each interval was recorded. The number of intervals containing i events is
shown in the following table. There were no intervals containing more than 9
events.

Number of events 0 1 2 3 4 5 6 7 8 9
Number of intervals 1042 860 307 78 15 3 0 0 0 1

This date was also the date that astronomers first saw the supernova S1987a.

(a) Test the hypothesis that the data are described by a Poisson distribution.
268

(b) Test the hypothesis that the data are described by the sum of two Poisson
distributions, one for a signal of 9 events within one ten-second interval,
and another for the background of ordinary cosmic neutrinos.

54. Marks on an exam are distributed over male and female students as follows
(it is left to your own bias to decide which group is male):

Group 1 39 18 3 22 24 29 22 22 27 28 23 48
Group 2 42 23 36 35 38 42 33

Assume that test scores are normally distributed within each group.

(a) Assume that the variance of the scores of both groups is the same, and
test the hypothesis that the mean is also the same for both groups.
(b) Test the assumption that the variance of the scores of both groups is the
same.

55. The light transmission of crystals is degraded by ionizing radiation. Folklore,


and some qualitative physics arguments, suggest that it can be (partially)
restored by annealing. To test this the light transmission of 7 crystals, which
have been exposed to radiation, is measured. The crystals are then annealed,
and their light transmission again measured. The results:

Crystal 1 2 3 4 5 6 7
Before 29 30 42 34 37 45 32
After 36 26 46 36 40 51 33
difference 7 −4 4 2 3 6 1

Assume that the uncertainty in the measurement of the transmission is nor-


mally distributed.

(a) Test whether the light transmission has improved using only the mean
of the before and after measurements.
(b) Test whether the light transmission has improved making use of the
measurements per crystal, i.e., using the differences in transmission.

For the following exercises you will be assigned a file containing the data to be
used. It will consist of 3 numbers per event, which may be read, e.g., in FORTRAN
by
269

READ(11,’(I5)’) NEVENTS
READ(11,’(3F10.7)’) ((E(I,IEV),I=1,3),IEV=1,NEVENTS)

The data may be thought of as being the measurement of the radioactive decay
of a neutral particle at rest into two new particles, one positive and one negative,
with

E(1,IEV) = x, the mass of the decaying particle as determined from the en-
ergies of the decay products. The mass values have a certain
spread due to the resolution of our apparatus and/or the Heisen-
berg uncertainty principle (for a very short-lived particle).
E(2,IEV) = cos θ, the cosine of the polar angle of the positive decay particle’s
direction.
E(3,IEV) = φ/π, the azimuthal angle, divided by π, of the positive decay par-
ticle’s direction. Division by π results in a more convenient
quantity to histogram.
Assume that the decay is of a vector meson to two pseudo-scalar mesons. The decay
angular distribution is then given by
3 1 1
f (cos θ, φ) = (1 − ρ00 ) + (3ρ00 − 1) cos2 θ − ρ1,−1 sin2 θ cos 2φ
4π 2 2 

− 2Reρ10 sin 2θ cos φ

A1. Use the moments method to estimate the mass of the particle and the decay
parameters ρ00 , ρ1,−1 , and Reρ10 . Also estimate the variance and standard
deviation of the p.d.f. for x. Estimate also the errors of all of the estimates.

A2. Use the maximum likelihood method to estimate the decay parameters ρ00 ,
ρ1,−1 , and Reρ10 using a program such as MINUIT to find the maximum of the
likelihood function. Determine the errors on the estimates using the variation
of the likelihood.

A3. Assume that x is distributed normally. Determine µ and σ using maximum


likelihood. Also determine the covariance matrix of the estimates.

A4. Assume that x is distributed normally. Determine µ and σ using both the
minimum χ2 and the binned maximum likelihood methods. Do this twice,
once with narrow and once with wide bins. Compare the estimates and their
covariance matrix obtained with these two methods with each other and with
that of the previous exercise.

A5. Test the assumption of vector meson decay against the hypothesis of decay of
a scalar meson, in which case the angular distribution must be isotropic.
270

For the following exercises you will be assigned a file containing the data to be
used. It is the same situation as in the previous exercises except that it is somewhat
more realistic, having some background to the signal.

B1. From an examination of histograms of the data, make some reasonable hypothe-
ses as to the nature of the background, i.e., propose some functional form for
the background, fb (x) and fb (cos θ, φ).

B2. Modify your likelihood function to include your hypothesis for the background,
and use the maximum likelihood method to estimate the decay parameters
ρ00 , ρ1,−1 , and Reρ10 as well as the fraction of signal events. Also determine
the position of the signal, µ, and its width, σ, under the assumption that the
signal x is normally distributed. Determine the errors on the estimates using
the variation of the likelihood.

B3. Develop a way to use the moments method to estimate, taking into account
the background, the decay parameters ρ00 , ρ1,−1 , and Reρ10 . Estimate also
the errors of the estimates.

B4. Determine the goodness-of-fit of the fits in the previous two exercises. There
are several goodness-of-fit tests which could be applied. Why did you choose
the one you did?

You might also like