Statistical Methods in Data Analysis - W. J. Metzger
Statistical Methods in Data Analysis - W. J. Metzger
HEN-343
March 13, 2018
NIJMEGEN
1 Introduction 1
1.1 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Computer usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Some advice to the student . . . . . . . . . . . . . . . . . . . . . . . 4
I Probability 7
2 Probability 9
2.1 First principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Probability—What is it? . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Probability density function (p.d.f.) . . . . . . . . . . . . . . 11
2.1.4 Cumulative distribution function (c.d.f.) . . . . . . . . . . . 12
2.1.5 Expectation values . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 More on Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . 15
2.2.2 More than one r.v. . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Dependence and Independence . . . . . . . . . . . . . . . . . 20
2.2.5 Characteristic Function . . . . . . . . . . . . . . . . . . . . . 22
2.2.6 Transformation of variables . . . . . . . . . . . . . . . . . . 23
2.2.7 Multidimensional p.d.f. – matrix notation . . . . . . . . . . 25
2.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Probability—What is it?, revisited . . . . . . . . . . . . . . . . . . 27
2.4.1 Mathematical probability (Kolmogorov) . . . . . . . . . . . 27
2.4.2 Empirical or Frequency interpretation (von Mises) . . . . . . 27
2.4.3 Subjective (Bayesian) probability . . . . . . . . . . . . . . . 28
2.4.4 Are we frequentists or Bayesians? . . . . . . . . . . . . . . . 29
i
3.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Exponential and Gamma distributions . . . . . . . . . . . . . . . . 39
3.6 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Gaussian or Normal distribution . . . . . . . . . . . . . . . . . . . . 41
3.8 Log-Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9 Multivariate Gaussian or Normal distribution . . . . . . . . . . . . 44
3.10 Binormal or Bivariate Normal p.d.f. . . . . . . . . . . . . . . . . . . 46
3.11 Cauchy (Breit-Wigner or Lorentzian) p.d.f. . . . . . . . . . . . . . . 49
3.12 The χ2 p.d.f. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.13 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.14 The F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.15 Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.16 Double exponential (Laplace) distribution . . . . . . . . . . . . . . 56
3.17 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Real p.d.f.’s 59
4.1 Complications in real life . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
II Monte Carlo 67
6 Monte Carlo 69
6.1 Random number generators . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 True random number generators . . . . . . . . . . . . . . . . 70
6.1.2 Pseudo-random number generators . . . . . . . . . . . . . . 70
6.2 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.1 Crude Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.2 Hit or Miss Monte Carlo . . . . . . . . . . . . . . . . . . . . 73
6.2.3 Buffon’s needle, a hit or miss example . . . . . . . . . . . . 74
6.2.4 Accuracy of Monte Carlo integration . . . . . . . . . . . . . 74
6.2.5 A crude example in 2 dimensions . . . . . . . . . . . . . . . 76
6.2.6 Variance reducing techniques . . . . . . . . . . . . . . . . . 78
6.3 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.1 Weighted events . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.2 Rejection method . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.3 Inverse transformation method . . . . . . . . . . . . . . . . 83
6.3.4 Composite method . . . . . . . . . . . . . . . . . . . . . . . 85
ii
6.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.6 Gaussian generator . . . . . . . . . . . . . . . . . . . . . . . 87
III Statistics 91
7 Statistics—What is it/are they? 93
8 Parameter estimation 95
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.3 Variance, efficiency . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.4 Interpretation of the Variance . . . . . . . . . . . . . . . . . 103
8.2.5 Information and Likelihood . . . . . . . . . . . . . . . . . . 104
8.2.6 Minimum Variance Bound . . . . . . . . . . . . . . . . . . . 107
8.2.7 Efficient estimators—the Exponential family . . . . . . . . . 110
8.2.8 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . 114
8.3 Substitution methods . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3.1 Frequency substitution . . . . . . . . . . . . . . . . . . . . . 115
8.3.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . 117
8.3.3 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . 118
8.3.4 Generalized method of moments . . . . . . . . . . . . . . . . 118
8.3.5 Variance of moments . . . . . . . . . . . . . . . . . . . . . . 119
8.3.6 Transformation of the covariance matrix under a change of
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4 Maximum Likelihood method . . . . . . . . . . . . . . . . . . . . . 122
8.4.1 Principle of Maximum Likelihood . . . . . . . . . . . . . . . 122
8.4.2 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . 128
8.4.3 Change of parameters . . . . . . . . . . . . . . . . . . . . . 130
8.4.4 Maximum Likelihood vs. Bayesian inference . . . . . . . . . 131
8.4.5 Variance of maximum likelihood estimators . . . . . . . . . . 132
8.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.7 Extended Maximum Likelihood . . . . . . . . . . . . . . . . 139
8.4.8 Constrained parameters . . . . . . . . . . . . . . . . . . . . 142
8.5 Least Squares method . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.5.2 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . 147
8.5.3 Derivative formulation . . . . . . . . . . . . . . . . . . . . . 153
8.5.4 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . 154
8.5.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5.6 Constraints in the linear model . . . . . . . . . . . . . . . . 158
iii
8.5.7 Improved measurements through constraints . . . . . . . . . 160
8.5.8 Linear Model with errors in both x and y . . . . . . . . . . . 161
8.5.9 Non-linear Models . . . . . . . . . . . . . . . . . . . . . . . 163
8.5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 Estimators for binned data . . . . . . . . . . . . . . . . . . . . . . . 164
8.6.1 Minimum Chi-Square . . . . . . . . . . . . . . . . . . . . . . 164
8.6.2 Binned maximum likelihood . . . . . . . . . . . . . . . . . . 166
8.6.3 Comparison of the methods . . . . . . . . . . . . . . . . . . 167
8.7 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . 168
8.7.1 Choice of estimator . . . . . . . . . . . . . . . . . . . . . . . 168
8.7.2 Bias reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.7.3 Variance of estimators—Jackknife and Bootstrap . . . . . . 172
8.7.4 Robust estimation . . . . . . . . . . . . . . . . . . . . . . . 174
8.7.5 Detection efficiency and Weights . . . . . . . . . . . . . . . . 176
8.7.6 Systematic errors . . . . . . . . . . . . . . . . . . . . . . . . 180
iv
10.3.5 Distribution-free tests . . . . . . . . . . . . . . . . . . . . . 207
10.3.6 Choice of a test . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.4 Parametric tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.4.1 Simple Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 209
10.4.2 Simple H0 and composite H1 . . . . . . . . . . . . . . . . . 211
10.4.3 Composite hypotheses—same parametric family . . . . . . . 213
10.4.4 Composite hypotheses—different parametric families . . . . 219
10.5 And if we are Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . 220
10.6 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.6.1 Confidence level or p-value . . . . . . . . . . . . . . . . . . . 222
10.6.2 Relation between Confidence level and Confidence Intervals . 223
10.6.3 The χ2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.6.4 Use of the likelihood function . . . . . . . . . . . . . . . . . 224
10.6.5 Binned data . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.6.6 Run test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.6.7 Tests free of binning . . . . . . . . . . . . . . . . . . . . . . 231
10.6.8 But use your eyes! . . . . . . . . . . . . . . . . . . . . . . . 233
10.7 Non-parametric tests . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.7.1 Tests of independence . . . . . . . . . . . . . . . . . . . . . 235
10.7.2 Tests of randomness . . . . . . . . . . . . . . . . . . . . . . 237
10.7.3 Two-sample tests . . . . . . . . . . . . . . . . . . . . . . . . 237
10.7.4 Two-Gaussian-sample tests . . . . . . . . . . . . . . . . . . . 239
10.7.5 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 242
IV Bibliography 249
V Exercises 255
v
They say that Understanding ought to work by the rules of right
reason. These rules are, or ought to be, contained in Logic; but
the actual science of logic is conversant at present only with things
either certain, impossible, or entirely doubtful, none of which (for-
tunately) we have to reason on. Therefore the true logic for this
world is the calculus of Probabilities, which takes account of the
magnitude of the probability which is, or ought to be, in a reason-
able man’s mind.
—J. Clerk Maxwell
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
than a science, and there is a grain of truth in it. Although there are standard
approaches, most of the time there is no “best” solution to a given problem. Our
most common tasks for statistics fall into two categories: parameter estimation and
hypothesis testing.
In parameter estimation we want to determine the value of some parameter in a
model or theory. For example, we observe that the force between two charges varies
with the distance r between them. We make a theory that F ∼ r −α and want to
determine the value of α from experiment.
In hypothesis testing we have an hypothesis and we want to test whether that
hypothesis is true or not. An example is the Fermi theory of β-decay which predicts
the form of the electron’s energy spectrum. We want to know whether that is
correct. Of course we will not be able to give an absolute yes or no answer. We
will only be able to say how confident we are, e.g., 95%, that the theory is correct,
or rather that the theory predicts the correct shape of the energy spectrum. Here
the meaning of the 95% confidence is that if the theory is correct, and if we were
to perform the experiment many times, 95% of the experiments would appear to
agree with the theory and 5% would not.
Parameter estimation and hypothesis testing are not completely separate topics.
It is obviously nonsense to estimate a parameter if the theory containing the pa-
rameter does not agree with the data. Also the theory we want to test may contain
parameters; the test then is whether values for the parameters exist which allow
the theory to agree with the data.
Although the main subject of this course is statistics, it should be clear that
an understanding of statistics requires understanding probability. We will begin
therefore with probability. Having had probability, it seems only natural to also
treat, though perhaps briefly, Monte Carlo methods, particularly as they are often
useful not only in the design and understanding of an experiment but also can be
used to develop and test our understanding of probability and statistics.
There are a great many books on statistics. They vary greatly in content and
intended audience. Notation is by no means standard. In preparing these lectures I
have relied heavily on the following sources (sometimes to the extent of essentially
copying large sections):
• R. J. Barlow,1 a recent text book in the Manchester series. Most of what
you need to know is in this book, although the level is perhaps a bit low.
Nevertheless (or perhaps therefore), it is a pleasure to read.
• G. P. Yost,6 the lecture notes for a course at Imperial College, London. They
are somewhat short on explanation.
• Glen Cowan,7 a recent book at a level similar to these lectures. In fact, had
this book been available I probably would have used it rather than writing
these notes.
Other books of general interest are those of Lyons,8 Meyer,9 and Bevington.10
A comprehensive reference for almost all of probability and statistics is the three-
volume work by Kendall and Stuart11 . Since the death of Kendall, volumes 1 and 2
(now called 2a) are being kept up to date by others,12, 13 and a volume (2b) on
Bayesian statistics has been added.14 Volume 3 has been split into several small
books, “Kendall’s Library of Statistics”, covering many specialized topics. Another
classic of less encyclopedic scope is the one-volume book by Cramér15 .
1.1 Language
Statistics, like physics, has it own specialized terminology with words whose mean-
ing differs from the meaning in everyday use or the meaning in physics. An example
is the word estimate. In statistics “estimate” is used where the physicist would say
“determine” or “measure”, as in parameter estimation. The physicist or indeed
ordinary people tend to use “estimate” to mean little more than “guess” as in “I
would estimate that this room is about 8 meters wide.” We will generally use the
statisticians’ word.
Much of statistics has been developed in connection with population studies
(sociology, medicine, agriculture, etc.) and industrial quality control. One cannot
study the entire population; so one “draws a sample”. But the population exists.
In experimental physics the set of all measurements (or observations) forms the
“sample”. If we make more measurements we increase the size of the sample, but
we can never attain the “population”. The population does not really exist but is
an underlying abstraction. For us some terminology of the statisticians is therefore
rather inappropriate. We therefore sometimes make substitutions like the following:
4 CHAPTER 1. INTRODUCTION
We will just say “mean” when it is clear from the context whether we are referring
to the parent or the sample mean.
immediately useful, but which should put the student in a better position to go
beyond what is included in this course, as will almost certainly be necessary at
some time in his career. Further, we will point out the assumptions underlying,
and the limitations of, various techniques.
A major difficulty for the student is the diversity of the questions statistical tech-
niques are supposed to answer, which results in a plethora of methods. Moreover,
there is seldom a single “correct” method, and deciding which method is “best” is
not always straightforward, even after you have decided what you mean by “best”.
A further complication arises from what we mean by “probability”. There are
two major interpretations, “frequentist” (or “classical”) and “Bayesian” (or “sub-
jective”), which leads to two different ways to do statistics. While the emphasis
will be on the classical approach, some effort will go into the Bayesian approach as
well.
While there are many questions and many techniques, they are related. In order
to see the relationships, the student is strongly advised not to fall behind.
Finally, some advice to astronomers which is equally valid for physicists:
Whatever your choice of area, make the choice to live your professional
life at a high level of statistical sophistication, and not at the level—
basically freshman lab. level—that is the unfortunate common currency
of most astronomers. Thereby we will all move forward together.
—William H. Press16
Part I
Probability
7
“La théorie des probabilités n’est que
le bon sens reduit au calcul.”
—P.-S. de Laplace, “Mécanique Céleste”
Chapter 2
Probability
We can also be more abstract. Intuitively, probability must have the following
properties. Let Ω be the set of all possible outcomes.
Axioms:
1. P (Ω) = 1 The experiment must have an outcome.
2. 0 ≤ P (E), E ∈ Ω
P
3. P (∪Ei ) = P (Ei ), for any set of disjoint Ei , Ei ∈ Ω
(Axiom of Countable Additivity)
It is straightforward to derive the following theorems:
1. P (E) = 1 − P (E ∗ ), where Ω = E ∪ E ∗ , E and E ∗ disjoint.
2. P (E) ≤ 1
9
10 CHAPTER 2. PROBABILITY
2.1.2 Sampling
We restrict ourselves to experiments where the outcome is one or more real numbers,
Xi . Repetition of the experiment will not always yield the same outcome. This
could be due to an inability to reproduce exactly the initial conditions and/or to
a probabilistic nature of the process under study, e.g., radioactive decay. The Xi
are therefore called random variables (r.v.), i.e., variables whose values cannot
be predicted exactly. Note that the word ‘random’ in the term ‘random variable’
does not mean that the allowed values of Xi are equiprobable, contrary to its use
in everyday speech. The set of possible values of Xi , which we have denoted Ω, is
called the sample space. A r.v. can be
• discrete: The sample space Ω is a set of discrete points. Examples are the
result of a throw of a die, the sex of a child (F=1, M=2), the age (in years)
of students studying statistics, names of people (Marieke=507, Piet=846).
• The sample space is the set of all possible results of the experiment (the
sampling). Identical results are represented by only one member of the set.
2.1. FIRST PRINCIPLES 11
The members of the population are equiprobable while the members of the sample
space are not necessarily equiprobable. The sample reflects the population which is
derived from the sample space according to some probability distribution, usually
called the parent (or underlying) probability distribution.
• the throw of a die. The sample space is Ω = {1, 2, 3, 4, 5, 6}. The probability
distribution is P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 16 , which gives
a parent population of {1, 2, 3, 4, 5, 6}. An example of an experimental result
is X = 3.
• the throw of a die having sides marked with one 1, two 2’s, and three 3’s.
The sample space is Ω = {1, 2, 3}. The probability distribution is P (1) = 16 ,
P (2) = 13 , P (3) = 21 . The parent population is {1, 2, 2, 3, 3, 3}. An experi-
mental result is X = 3 (maybe).
In the discrete case we have a probability function, f (x), which is greater than zero
for each value of x in Ω. From the axioms of probability,
X
f (x) = 1
Ω
X
P (A) ≡ P (X ∈ A) = f (x) , A ⊂ Ω
A
For a continuous r.v., the probability of any exact value is zero since there are
an infinite number of possible values. Therefore it is only meaningful to talk of the
probability that the outcome of the experiment, X, will be in a certain interval.
f (x) is then a probability density function (p.d.f.) such that
Z
P (x ≤ X ≤ x + dx) = f (x) dx , f (x) dx = 1 (2.2)
Ω
Since most quantities of interest to us are continuous we will usually only treat
the continuous case unless the corresponding treatment of the discrete case is not
obvious. Usually going from the continuous to the discrete case is simply the re-
placement of integrals by sums. We will also use the term p.d.f. for f (x) although
12 CHAPTER 2. PROBABILITY
in the discrete case it is really a probability rather than a probability density. Some
authors use the term ‘probability law’ instead of p.d.f., thus avoiding the mislead-
ing (actually wrong) use of the word ‘density’ in the discrete case. However, such
use of the word ‘law’ is misleading to a physicist, cf. Newton’s second law, law of
conservation of energy, etc.
2.1.6 Moments
Moments are certain special expectation values. The mth moment is defined (think
of the moment of inertia) as
Z +∞
m
E [x ] = xm f (x) dx (2.7)
−∞
The moment is said to exist if it is finite. The most commonly used moment is the
(population or parent) mean,
Z +∞
µ ≡ E [x] = xf (x) dx (2.8)
−∞
The mean is often a good measure of location, i.e., it frequently tells roughly where
the most probable region is, but not always.
f (x) ✻ f (x) ✻
✲ ✲
µ x µ x
In statistics we will see that the sample mean, x, the average of the result of a
number of experiments, can be used to estimate the parent mean, µ, the mean of
the underlying p.d.f.
Central moments are moments about the mean. The mth central moment is
defined as Z +∞
m
E [(x − µ) ] = (x − µ)m f (x) dx (2.9)
−∞
If µ is finite, the first central moment is clearly zero. If f (x) is symmetric about its
mean, all odd central moments are zero.
The second central moment is called the variance. It is denoted by V [x], σx2 ,
or just σ 2 .
h i
σx2 ≡ V [x] ≡ E (x − µ)2 (2.10)
h i
= E x2 − µ2 (2.11)
14 CHAPTER 2. PROBABILITY
The square root of the variance, σ, is called the standard deviation. It is a measure
of the spread of the p.d.f. about its mean.
Since all symmetrical distributions have all odd central moments zero, the odd
central moments provide a measure of the asymmetry. The first central moment is
zero. The third central moment is thus the lowest order odd central moment. One
makes it dimensionless by dividing by σ 3 and defining the skewness as
h i
E (x − µ)3
γ1 ≡ (2.12)
σ3
This is the definition of Fisher, which is the most common. However, be aware that
other definitions exist, e.g., the Pearson skewness,
h i 2
E (x − µ)3
β1 ≡ = γ12 (2.13)
σ3
The sharpness of the peaking of the p.d.f. is measured by the kurtosis (also
spelled curtosis). There are two common definitions, the Pearson kurtosis,
h i
E (x − µ)4
β2 ≡ (2.14)
σ4
and the Fisher kurtosis,
h i
E (x − µ)4
γ2 ≡ − 3 = β2 − 3 (2.15)
σ4
The −3 makes γ2 = 0 for a Gaussian. For this reason, it is somewhat more con-
venient, and is the definition we shall use. A p.d.f. with γ2 > 0 (< 0) is called
leptokurtic (platykurtic) and is less (more) peaked than a Gaussian, i.e., having
higher (lower) tails.
Moments are often normalized in some other way than we have done with γ1
and γ2 , e.g., with the corresponding power of µ:
h i h i
E xk E (x − µ)k
ck ≡ ; rk ≡ (2.16)
µk µk
It can be shown that if all central moments exist, the distribution is completely
characterized by them. In statistics we can estimate each parent moment by its
sample moment (cf. section 8.3.2) and so, in principle, reconstruct the p.d.f.
Other attributes of a p.d.f.:
• mode: The location of a maximum of f (x). A p.d.f. can be multimodal.
• median: That value of x for which the c.d.f. F (x) = 12 . The median is not
always well defined, since there can be more than one such value of x.
2.2. MORE ON PROBABILITY 15
F (x) ✻ F (x) ✻
1 1
1 1
c.d.f. 2 2
✲ ✲
median x medians x
f (x) ✻ f (x) ✻
p.d.f.
✲ ✲
x x
P (A | A) = 1 renormalization
P (A2 | A1 ) = P (A1 ∩ A2 | A1 ) Ω
A1
A2
While the probability changes with the restriction,
ratios of probabilities must not:
P (A1 ∩ A2 | A1 ) P (A1 ∩ A2 )
=
P (A1 | A1 ) P (A1 )
These requirements are met by the definition, assuming P (A1 ) > 0,
P (A1 ∩ A2 )
P (A2 | A1 ) ≡ (2.17)
P (A1 )
If P (A1 ) = 0, P (A2 | A1 ) makes no sense. Nevertheless, for completeness we define
P (A2 | A1 ) = 0 if P (A1 ) = 0.
It can be shown that the conditonal probability satisfies the axioms of proba-
bility.
It follows from the definition that
P (A1 ∩ A2 ) = P (A2 | A1 ) P (A1)
If P (A2 | A1 ) is the same for all A1 , i.e., A1 and A2 are independent, then
P (A2 | A1 ) = P (A2 )
and P (A1 ∩ A2 ) = P (A1 ) P (A2)
Marginal p.d.f.
The marginal p.d.f. is the p.d.f. of just one of the r.v.’s; all dependence on the other
r.v.’s of the joint p.d.f. is integrated out:
Z +∞
f1 (x1 ) = f (x1 , x2 ) dx2 (2.20)
−∞
Z +∞
f2 (x2 ) = f (x1 , x2 ) dx1 (2.21)
−∞
Conditional p.d.f.
2.2. MORE ON PROBABILITY 17
X2 ✻ Ω
Suppose that there are two r.v.’s, X1 and X2 , and
a space of events Ω.
Choosing a value x1 of X1 restricts the possible
values of X2 . Assuming f1 (x1 ) > 0, then f (x2 |
✲
x1 ) is a p.d.f. of X2 given X1 = x1 . X1
In the discrete case, from the definition of con- x1
ditional probability (eq. 2.17), we have
P (X2 = x2 ∩ X1 = x1 )
f (x2 | x1 ) ≡ P (X2 = x2 | X1 = x1 ) =
P (X1 = x1 )
P (X2 = x2 , X1 = x1 ) f (x1 , x2 )
= =
P (X1 = x1 ) f1 (x1 )
f (x1 , x2 )
f (x2 | x1 ) = (2.22)
f1 (x1 )
Note that this conditional p.d.f. is a function of only one r.v., x2 , since x1 is fixed.
Of course, a different choice of x1 would give a different function. A conditional
probability is then obviously calculated
Z b
P (a < X2 < b | X1 = x1 ) = f (x2 | x1 ) dx2 (2.23)
a
f (x1 , x2 , x3 , x4 )
f (x2 , x4 | x1 , x3 ) =
f13 (x1 , x3 )
Z Z
where f13 (x1 , x3 ) = f (x1 , x2 , x3 , x4 ) dx2 dx4
2.2.3 Correlation
When an experiment results in more than one real number, i.e., when we are con-
cerned with more than one r.v. and hence the p.d.f. is of more than one dimension,
the r.v.’s may not be independent. Here are some examples:
18 CHAPTER 2. PROBABILITY
proton will have more than 3/4 of the available energy is zero. The energies
of the electron and the proton are not independent. They are constrained by
the law of energy conservation.
Given a two-dimensional p.d.f. (the generalization to more dimensions is straight-
2
forward), f (x, y), the mean and variance of X, µX and σX are given by
Z +∞ Z +∞
µX = E [X] = xf (x, y) dx dy
−∞ −∞
h i
2
σX = E (X − µX )2
A measure of the dependence of X on Y is given by the covariance defined as
cov(X, Y ) ≡ E [(X − µX )(Y − µY )] (2.25)
= E [XY ] − µY E [X] − µX E [Y ] + µX µY
= E [XY ] − µX µY (2.26)
From the covariance we define a dimensionless quantity, the correlation coef-
ficient
cov(X, Y )
ρXY ≡ (2.27)
σX σY
If σX = 0, then X ≡ µX and consequently E [XY ] = µX E [Y ] = µX µY , which means
that cov(X, Y ) = 0. In this case the above definition would give ρ indeterminate,
and we define ρXY = 0.
2.2. MORE ON PROBABILITY 19
It can be shown that ρ2 ≤ 1, the equality holding if and only if X and Y are
linearly related. The proof is left to the reader (exercise 7).
Note that while the mean and the standard deviation scale, the correlation
coefficient is scale invariant, e.g.,
cov(2X, Y ) 2 cov(X, Y )
ρ2X,Y = =
σ2X σY 2σX σY
The correlation coefficient ρXY is a measure of how much the variables X and Y
depend on each other. It is most useful when the contours of constant probability
density, f (x, y) = k, are roughly elliptical, but not so useful when these contours
have strange shapes:
Y ✻ Y ✻ Y ✻
✲ ✲ ✲
X X X
ρ>0 ρ<0 ρ≈0
In the last case, even though X and Y are clearly related, ρ ≈ 0. This can be seen
as follows:
Z
E [(X − µX ) | y] = (x − µX )f (x | y) dx
Z
f (x, y)
= (x − µX ) dx
fY (y)
= 0 for all y
cov(X, Y ) = E [(X − µX ) (Y − µY )]
Z Z
= (y − µY ) (x − µX )f (x, y) dx dy
| {z }
=0
=0
Consequently, ρXY = 0.
20 CHAPTER 2. PROBABILITY
Y′ ✻
Y ✻ Y′ ✻
✲ ✲
X X′
ρ>0 ρ=0
In fact, it is always possible (also in n dimensions) to remove the correlation by a
change of variables (cf. section 2.2.7).
The correlation coefficient, ρ, measures the average linear change in the marginal
p.d.f. of one variable for a specified change in the other variable. This can be 0 even
when the variables clearly depend on each other. This occurs when a change in one
variable produces a change in the marginal p.d.f. of the other variable but no change
in its average, only in its shape. Thus zero correlation does not imply independence.
Now suppose that f (x2 | x1 ) does not depend on x1 , i.e., is the same for all x1 .
Then
Z
f2 (x2 ) = f (x2 | x1 ) f1 (x1 ) dx1
| {z }
=1, normalization
= f (x2 | x1 )
Substituting this in (2.28) gives
f (x1 , x2 ) = f1 (x1 )f2 (x2 )
2.2. MORE ON PROBABILITY 21
The joint p.d.f. is then just the product of the marginal p.d.f.’s. We take this as
the definition of independence:
r.v.’s X1 and X2 are independent ≡ f (x1 , x2 ) = f1 (x1 )f2 (x2 )
r.v.’s X1 and X2 are dependent ≡ f (x1 , x2 ) 6= f1 (x1 )f2 (x2 )
Note that g and h do not have to be the marginal p.d.f.’s; the only requirement
is that their product equal the product of the marginals.
Theorem: If X1 and X2 are independent r.v.’s with marginal p.d.f.’s f1 (x1 ) and
f2 (x2 ), then for functions u(x1 ) and v(x2 ), assuming all E’s exist,
E [u (x1 ) v (x2 )] = E [u (x1 )] E [v (x2 )]
=⇒ From the definition of expectation, and since X1 and X2 are independent,
Z Z
E [u (x1 ) v (x2 )] = u(x1 ) v(x2 ) f (x1 , x2 ) dx1 dx2
Z Z
= u(x1 ) f1 (x1 ) dx1 v(x2 ) f2 (x2 ) dx2
Z Z Z Z
= u(x1 ) f (x1 )f (x2 ) dx1 dx2 v(x2 ) f (x2 )f (x1 ) dx2 dx1
| {z } | {z }
=f (x1 ,x2 ) =f (x1 ,x2 )
= E [u(x1 )] E [v(x2 )]
Some authors prefer, especially for discrete r.v.’s, to use the probability gener-
ating function instead of the characteristic function. It is in fact the same thing,
just replacing eıt by z:
( R
+∞
z x f (x) dx
G(z) = E [z x ] = −∞
P xk
k z f (xk )
The moments are then found by differentiating with respect to z and evaluating at
z = 1,
R +∞ x−1
dG(z)
G′ (1) = dz
= −∞ xz f (x) dx = E [x]
z=1
R +∞ z=1
d2 G(z)
G′′ (1) =
dz 2 z=1
x−2
= −∞ x(x − 1)z f (x) dx = E [x(x − 1)] = E [x2 ] − E [x]
z=1
Continuous p.d.f.
Given r.v.’s X1 , X2 from a p.d.f. f (x1 , x2 ) defined on a set A, we transform (X1 , X2 )
to (Y1 , Y2 ). Under this transformation the set A maps onto the set B.
X2 ✻ A Y2 ✻
b
a ✲ B
✲ ✲
X1 Y1
24 CHAPTER 2. PROBABILITY
y1 = u1 (x1 , x2 )
y2 = u2 (x1 , x2 )
x1 = w1 (y1 , y2 )
x2 = w2 (y1 , y2 )
(Actually the condition of one–to–one can be relaxed in some cases.) Assume also
that all first derivatives of w1 and w2 exist. Then
P (a) = P (b)
Z Z Z Z
f (x1 , x2 ) dx1 dx2 = f (w1 (y1 , y2 ), w2 (y1 , y2 )) |J| dy1 dy 2
a b
where J is the Jacobian determinant (assumed known from calculus) and the abso-
lute value is taken to ensure that the probability is positive,
!
∂w1 ∂w2
w1 , w2
∂y1 ∂y1
J =J = ∂w1 ∂w2 (2.33)
y1 , y2 ∂y2 ∂y2
Hence the p.d.f. in (Y1 , Y2) is the p.d.f. in (X1 , X2 ) times the Jacobian:
Discrete p.d.f.
This is actually easier, since we can take the subsets a and b to contain just one
point. Then
µ2 = µ010...0
The variances and covariances may be written as a matrix, called the covariance
(or variance) matrix:
σ11 σ12 . . . σ1n
h i σ21 σ22 . . . σ2n
V = E (x − µ)(x − µ)T = .. .. .. .. (2.36)
. . . .
σn1 σn2 . . . σnn
σ12 ρ12 σ1 σ2 . . .
ρ12 σ1 σ2 σ22 ...
= .. .. .. (2.37)
. . .
ρ1n σ1 σn ρ2n σ2 σn . . .
where ρij is the correlation coefficient for r.v.’s xi and xj :
σij cov(xi , xj )
ρij ≡ = q (2.38)
σi σj σi2 σj2
Sometimes called the chain rule of probability, this theorem was first formulated by Rev. Bayes
∗
around 1759. The exact date is not known; the paper was published posthumously by his good
friend Richard Price in 1763. Bayes’ formulation was only for P (A) uniform. The theorem was
formulated in its present form by Laplace,19 who was apparently unaware of Bayes’ work. Laplace
went on to apply20 it to problems in celestial mechanics, medical statistics and even, according to
some accounts, to jurisprudence.
2.4. PROBABILITY—WHAT IS IT?, REVISITED 27
• P (xi ) is not just a property of the experiment. It also depends on the “collec-
tive” or “ensemble”, i.e., on the N repetitions of the experiment. For example,
if I take a resistor out of a box of resistors, the probability that I measure the
resistance of the resistor as 1 ohm depends not only on how the resistor was
made, but also on how all the other resistors in the box were made.
28 CHAPTER 2. PROBABILITY
• The experiment must be repeatable, under identical conditions, but with dif-
ferent outcomes possible. This is a great restriction on the number of situa-
tions in which we can use the concept of probability. For example, what is
the probability that it will rain tomorrow? Such a question is meaningless for
the frequentists, since the experiment cannot be repeated!
P (A | B) P (B) = P (B | A) P (A)
Then
P (result | theory)
P (theory | result) = P (theory)
P (result)
Here, P (theory) is our “belief” in the theory before doing the experiment, P (result |
theory) is the probability of getting the result if the theory is true, P (result) is the
probability of getting the result irrespective of whether the theory is true or not,
and P (theory | result) is our belief in the theory after having obtained the result.
This seems to make sense. We see that if the theory predicts the result with
high probability, i.e., P (result | theory) big, then P (theory | result), i.e., your belief
in the theory after the result, will be higher than it was before, P (theory), and vice
versa. However, if the result is likely even if the theory is not true, then your belief
in the theory will not increase by very much since then P (result|theory)
P (result)
is not much
greater than 1.
Suppose we want to determine some parameter of nature, λ, by doing an ex-
periment which has outcome Z. Further, suppose we know the conditional p.d.f. to
get Z given λ: f (z | λ). Our prior, i.e., before we do the experiment, belief about
λ is given by Pprior (λ). Now the probability of z, P (z), is just the marginal p.d.f.:
P
f1 (z) = λ′ f (z | λ′ )Pprior (λ′ ). Then by Bayes’ theorem,
f (z | λ)
Pposterior (λ | z) = Pprior (λ) (2.41)
f1 (z)
2.4. PROBABILITY—WHAT IS IT?, REVISITED 29
Then by Bayes’ theorem, the probability that the true mass has the value me after
we have measured a value m is
P (m | me )
P (me | m) = Pprior (me )
P (m)
∝ P (m | me ) assuming Pprior (me ) = const.
2 /2σ 2
∝ e−(m−me )
Fisher24 , introducing his prescription for confidence intervals, had this scathing comment on
∗
outcome probability
‘success’, k = 1 p
‘failure’, k = 0 q = 1−p
The p.d.f. is
f (k; p) = pk q 1−k (3.1)
Note that we use a semicolon to separate the r.v. k from the parameter of the
distribution, p. This p.d.f. results in the moments and central moments:
E [k m ] = 1 · p + 0 · (1 − p) = p
E [(k − µ)m ] = (1 − p)m p + (0 − p)m (1 − p)
| {z } | {z }
k=1 k=0
In particular,
h i
µ=p
V [k] = E k 2
− (E [k])2 = p − p2 = p(1 − p)
31
32 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
g(z1 , z2 ) = f (z1 − z2 , z2 )
! !
nz1 − nz2 nz2 z1
= p (1 − p)nz1 −z1
z1 − z2 z2
The p.d.f. for Z1 = X + Y is the marginal of this. Hence we must sum over z2 .
! !
X X nz1 − nz2 nz2
z1 nz1 −z1
g1 (z1 ) = g(z1 , z2 ) = p (1 − p)
z2 z2 z1 − z2 z2
nz 1
For normalization the sum must be just z1
. Thus g1 is also a binomial p.d.f.:
µi = E [ki ] = npi
σi2 = V [ki ] = npi (1 − pi )
34 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
n!
Now (n−r)!
= n(n − 1)(n − 2) . . . (n − r + 1) r terms
≈ nr since n >> r
n−r n
λ λ
and 1− n
≈ 1− n
−→ e−λ
n→∞
e−λ λr
P (r; λ) =
r!
The mean is
∞ ∞
X λr X λr−1
µ = E [r] = re−λ = λe−λ
r=0 r! r=1 (r − 1)!
∞ ′
−λ
X λr
= λe r′ = r − 1
r ′ =0
r′!
∞
X
=λ P (r ′ ; λ)
r ′ =0
=λ
e−µ µr
P (r; µ) = (3.5)
r!
It gives the probability of getting r events if the expected number (mean) is µ.
Further, you can easily show that the variance is equal to the mean:
Other properties:
E[(r−µ)3 ] µ
γ1 = σ3
= µ3/2 = √1µ (skewness)
E[(r−µ)4 ] 2
γ2 = σ4
= 3µµ2+µ − 3 = µ1 (kurtosis)
P∞ ıtr P∞ ıtr µr −µ
φ(t) = r=0 e P (r; µ) = r=0 e r!
e
−µ P∞ (µe )
ıt r
−µ ıt
= e r=0 r!
= e exp (µe )
ıt
φ(t) = exp [µ (e − 1)] (characteristic function)
From the skewness we see that the p.d.f. becomes more symmetric as µ increases.
When calculating a series of Poisson probabilities, one can make use of the
µ
recurrence formula P (r + 1) = r+1 P (r).
Reproductive property
The Poisson p.d.f. has a reproductive property: For independent r.v.’s X and Y ,
both Poisson distributed, the joint p.d.f. is
new variables Z1 = X + Y Z2 = Y
inverse transformation X = Z1 − Z2 Y = Z2
36 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
The p.d.f. of the sum of two Poisson distributed random variables is also Poisson
with µ equal to the sum of the µ’s of the individual Poissons. This can also be
easily shown using the characteristic function (exercise 12).
Examples
The Poisson p.d.f. is applicable when
• Thus the number of raisins per unit volume in raisin bread should be Poisson
distributed. The baker has mixed the dough thoroughly so that the raisins do
not stick together (independent) and are evenly distributed (constant event
rate).
• However, the number of zebras per unit area is not Poisson distributed (even
in those parts of the world where there are wild zebras), since zebras live in
herds and are thus not independently distributed.
Assuming that the death rate is constant over the 20 year period and inde-
pendent of corps and that the deaths are independent (not all caused by one
killer horse) then the deaths should be Poisson distributed: the probability of
k deaths in one particular corps in one year is P (k; µ). Since the mean of P
is µ, we take the experimental average as an ‘estimate’ of µ. The distribution
should then be P (k; 0.61) and we should expect 200 × P (k; 0.61) occurrences
of k deaths in one year in one corps. The data:
The ‘experimental’ distribution agrees very well with the Poisson p.d.f. The
P
reader can verify that the experimental variance, estimated by N1 (ki − k)2 ,
is 0.608, very close to the mean (0.610) as expected for a Poisson distribution.
However, if the rate of the basic process is not constant, the distribution may not
be Poisson, e.g.,
• The radioactive decay over a period of time significant compared with the
lifetime of the source.
In the first two examples the event rate decreases with time, in the third with
position. In the last two there is the further restriction that the number of events is
significantly restricted, as it can not exceed the number of atoms or beam particles,
while for the Poisson distribution the number extends to infinity.
38 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
• The number of people who die each year while operating a computer is also
not Poisson distributed. Although the probability of dying while operating
a computer may be constant, the number of people operating computers in-
creases each year. The event rate is thus not constant.
The Poisson p.d.f. requires that the events be independent. Consider the case
of a counter with a dead time of 1 µsec. This means that if a second particle
passes through the counter within 1 µsec after one which was recorded, the counter
is incapable of recording the second particle. Thus the detection of a particle is
not independent of the detection of other particles. If the particle flux is low, the
chance of a second particle within the dead time is so small that it can be neglected.
However, if the flux is high it cannot be. No matter how high the flux, the counter
cannot count more than 106 particles per second. In high fluxes, the number of
particles detected in some time interval will not be Poisson distributed.
Interchanging the order of the differentiation and the integration of the expectation
operator yields
dE [n]
= −np
dt
Identifying the actual number with its expectation,
dn
= −np
dt
n = n0 e−pt (3.9)
Thus the number of undecayed atoms falls exponentially. From this it follows that
the p.d.f. for the distribution of individual decay times (lifetimes) is exponential:
Exponential p.d.f.: Let f (t) be the p.d.f. for an individual atom Rt
to decay at
time t. The probability that it decays before time t is then F (t) = 0 f (t) dt. The
expected number of decays in time t is
Z t
E [n0 − n] = n0 F (t) = n0 f (t) dt
0
Substituting for E [n] from equation 3.9 and differentiating results in the exponential
p.d.f.:
1
f (t; t0 ) = e−t/t0 (3.10)
t0
which gives, e.g., the probability that an individual atom will decay in time t. Note
that this is a continuous p.d.f.
Properties:
µ = E [t] = t0 γ1 = 2
σ 2 = V [t] = t20 γ2 = 6
φ(x) = [1 − ıxt0 ]−1
40 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
Since we could start timing at any point, in particular at the time of the first
event, f (t) is the p.d.f. for the time of the second event. Thus the p.d.f. of the time
interval between decays is also exponential. This is the special case of k = 1 of the
following situation:
P
Let us find the distribution of the time t for k atoms to decay. The r.v. T = k1 ti
is the sum of the time intervals between k successive atoms. The ti are independent.
The c.d.f. for t is then just the probability that more than k atoms decay in time t:
F (t) = P (T ≤ t) = 1 − P (T > t)
Since the decays are Poisson distributed, the probability of m decays in the interval
t is
(λt)m e−λt
P (m) =
m!
where λ = 1/t0 , and t0 is the mean lifetime of an atom. The probability of < k
decays is then
k−1
X (λt)m e−λt Z ∞ k−1 −z
z e
P (T > t) = = dz
m=0 m! λt (k − 1)!
(The replacement of the sum by the integral can be found in any good book of
integrals.) Substituting the gamma function, Γ(k) = (k − 1)!, the c.d.f. becomes
Z λt z k−1 e−z
F (t) = 1 − dz
0 Γ(k)
Changing variables, y = z/λ,
Z t λk y k−1e−λy
F (t) = dy
0 Γ(k)
The p.d.f. is then
dF λk tk−1 e−λt
f (t; k, λ) = = , t > 0, (3.11)
dt Γ(k)
which is called the gamma distribution. Some properties of this p.d.f. are
√
µ = E [t] = k/λ γ1 = 2/ k
σ 2 = V [t] = k/λ2 γ2 = 6/k
h i−k
ıx
φ(x) = 1− λ
Note that the exponential distribution, f (t; 1, λ) = λe−λt , is the special case of the
gamma distribution for k = 1. The exponential distribution is also a special case
of the Weibull distribution (section 3.17).
Although in the above derivation k is an integer, the gamma distribution is, in
fact, more general: k does not have to be an integer. For λ = 21 and k = n2 , the
gamma distribution reduces to the χ2 (n) distribution (section 3.12).
3.6. UNIFORM DISTRIBUTION 41
Some books use the notation N(x; µ, σ). The Gaussian distribution is symmetric
about µ, and σ is a measure of its width.
We name this distribution after Gauß, but in fact many people discovered it
and investigated its properties independently. The French name it after Laplace,
who had noted26 its main properties when Gauß was only six years old. The first
known reference to it, before Laplace was born, is by the Englishman A. de Moivre
in 1733,27 who, however, did not realize its importance and made no use of it. Its
importance in probability and statistics (cf. section 8.5) awaited Gauß28 (1809).
The origin of the name ‘normal’ is unclear. It certainly does not mean that
other distributions are abnormal.
Properties: (The first two justify the notation used for the two parameters of the
Gaussian.)
µ = E [x] = µ mean
σ 2 = V [x] = σ 2 variance
γ1 = γ2 = (0 skewness and kurtosis
0, n odd
E [(x − µ)n ] = (n − 1)!!σ n = n!σn
, n even central moments
2n/2 ( n
2
)!
where
h
a!! ≡ 1 · 3i· 5 · · · a
φ(t) = exp ıtµ − 12 t2 σ 2 characteristic function
42 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
1 2
N(z; 0, 1) = √ e−z /2 (3.15)
2π
Z
1 z 2 /2
erf(z) ≡ √ e−x dx (3.16)
2π −∞
It is this definition which is used by the FORTRAN library function ERF(Z). Our
definition (3.16) is related to this definition by
!
1 1 z
erf(z) = + φ √ (3.18)
2 2 2
✟✟ ❍❍ ✟✟
✟✟
✟ ✟
❍❍ ✟✟✟
✟✟ ❍❍N → ∞
❥ ✙
✟
✟✟
✟
✯✟ ✟ µ→∞
✟m = 2
✟✟
✯
✟
✟
✟ ❍ ✟
✟ ❍❍
✟✟
✟ ✟ ❍ ✟✟
Multinomial
M(k; p, N) ✲ Normal ✛ Student’s t
k, p of N →∞ N(x; µ, σ 2 ) N →∞ f (t; N)
dimension m
✟ ❍❍
✟✟
✟✟
❍ ν1 → ∞ ✟✟✟✟
✟ ❍ ✟ ν1 =
✟
✟ 1
✯✟
✟ ❍❨ 2 → ∞
❍ν❍
❍ ✯✟
✟✟ N → ∞
✟
✯
✟
✟ ✟
✟✟ N = ν2
✟✟ ❍❍ ✟ ✟
✟✟
Reproductive property
Since the Poisson p.d.f. has the reproductive property and since the Gaussian p.d.f.
is a limit of the Poisson, it should not surprise us that the Gaussian is also re-
productive: If X and Y are two independent r.v.’s distributed as N(x; µx , σx2 ) and
N(y; µy , σy2 ) then Z = X + Y is distributed as N(z; µz , σz2 ) with µz = µx + µy and
σz2 = σx2 + σy2 . The proof is left as an exercise (exercise 19).
" !#
1 1 (x1 − µ1 )2 (x2 − µ2 )2
exp − (x − µ)T A(x − µ) = exp − + + ...
2 2 σ12 σ22
" # " #
(x1 − µ1 )2 (x2 − µ2 )2
= exp − exp − ···
2σ12 2σ22
The p.d.f. is thus just the product of n 1-dimensional Gaussians. Thus all ρij = 0
implies that xi and xj are independent. As we have seen (sect. 2.2.4), this is not
true of all p.d.f.’s.
3.9. MULTIVARIATE GAUSSIAN OR NORMAL DISTRIBUTION 45
These same substitutions are applicable for many (not all) cases of generalization
from 1 to n dimensions, as we might expect since the Gaussian p.d.f. is so often a
limiting case.
h i
p.d.f. N x; µ, V = 1
1/2 exp − 21 (x − µ)T V −1 (x − µ) (3.20)
(2π)n/2 |V |
mean E [x] = µ
covariance cov(x) = V
V [xi ] = Vii
cov(xi , xj ) = Vij
characteristic h i
function φ(t) = exp ıtµ − 21 tT V t
Other interesting properties:
• Contours of constant probability density are given by
(x − µ)T V −1 (x − µ) = C , a constant
• Any section through the distribution, e.g., at xi = const., gives again a mul-
tivariate normal p.d.f. It has dimension n − 1. For the case xi = const., the
covariance matrix Vn−1 is obtained by removing the ith row and column from
V −1 and inverting the resulting submatrix.
• Any projection onto a lower space gives a marginal p.d.f. which is a multi-
variate normal p.d.f. with covariance matrix obtained by deleting appropriate
rows and columns of V . In particular, the marginal distribution of xi is
fi (xi ) = N(xi ; µi , σi2 )
46 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
We will now examine a special case of the multivariate normal p.d.f., that for two
dimensions.
Contours of constant probability density are given by setting the exponent equal to
a constant. These are ellipses, called covariance ellipses.
For G = 1, the extreme values of the
ellipse are at µx ± σx and µy ± σy . The y ✁
larger the correlation, the thinner is the µy + σy ✻ ✁
ellipse, approaching 0 as |ρ| → 1. (Of ✁
✁
course in the limit of ρ = ±1, G is infinite ✁
❍❍ ✁
and we really have just 1 dimension.) ❍❍✁
The orientation of the major axis of µy ❍
✁ ❍❍
✁ ❍
the ellipse is given by
✁
✁
2ρσx σy µy − σy ✁
tan 2θ = ✁
σx2 − σy2
✁✁ θ
✲
◦
Note that θ = ±45 only if =σx2 σy2 µx − σx µx µx + σx x
θ=0 if ρ = 0
In calculating θ by taking the arctangent of the above equation, one must be
careful of quadrants. If the arctangent function is defined to lie between − π2 and
π
2
, then θ is the angle of the major axis if σx > σy ; otherwise it is the angle
of the minor axis. It is therefore more convenient to use an arctangent function
defined between −π and π such as the ATAN2(y,x) of some languages: 2θ =
ATAN2(2ρσx σy , σx2 − σy2 ).
3.10. BINORMAL OR BIVARIATE NORMAL P.D.F. 47
Some values:
2-dimensional 1-dimensional 2× 1-dimensional
P (G ≤ k) k P (G ≤ k) = P (µx − kσ ≤ x ≤ µx + kσ)
P (µ − kσ ≤ x ≤ µ + kσ) P (µy − kσ ≤ y ≤ µy + kσ)
0.3934693 1 0.6826895 0.466065
0.6321206 2 0.9544997 0.911070
0.7768698 3 0.9973002 0.994608
0.8646647 4 0.9999367 0.999873
0.9179150 5 0.9999994 0.999999
0.9502129 6
Note that the 2-dimensional probability for a given k is much less than the cor-
responding 1-dimensional probability. This is easily understood: the product of
the two 1-dimensional probabilities is the probability that (x, y) is in the rectangle
defined by µx − kσx ≤ x ≤ µx + kσx and µy − kσy ≤ y ≤ µy + kσy . The ellipse is
entirely within this rectangle and hence the probability of being within the ellipse
is less than the probability of being within the rectangle.
Since the covariance matrix is sym- y
✻ σu ✟
✟
metric, there exists a unitary transforma- v ✟
❆❑ ✟✟
tion which diagonalizes it. In two dimen- ❆ ✟
✟
sions this is the rotation matrix U, ❆
✟ ✟ ❆ ✟ ✯u
✟
❆ ❆✟
✟❆✟ ✟❆
✟ ✟✟
cos θ − sin θ ✟ ❆
U= ✟
✟ ✟❆✟ ✟✟
✟
sin θ cos θ ❆ ✟ ✟ ❆ ✟ σy
❆✟ ❆✟
✟✟❆ ✟✟❆
✟✟
This matrix transforms (x, y) to (u, v): ❆
σv ❆ ❆✟
u x σx
=U
v y
✲
∗
We will see in sect. 3.12 that G is a so-called χ2 r.v. P (G ≤ k) can therefore also be x
found
from the c.d.f. of the χ2 distribution, tables of which, as well as computer routines, are readily
available.
48 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
The standard deviations of u, v are then found from the transformed covariance
matrix. After some algebra we find
σx2 σy2 (1 − ρ2 )
σu2 = (3.24a)
σy2 cos2 θ − ρσx σy sin 2θ + σx2 sin2 θ
σx2 σy2 (1 − ρ2 )
σv2 = (3.24b)
σx2 cos2 θ + ρσx σy sin 2θ + σy2 sin2 θ
Note that if ρ is fairly large, i.e., the ellipse is thin, just knowing σx and σy
would give a very wrong impression of how close a point (x, y) is to (µx , µy ).
The properties stated at the end of the previous section, regarding the condi-
tional and marginal distributions of the multivariate normal p.d.f. can be easily
verified for the bivariate normal. In particular, the marginal p.d.f. is
All higher moments are also divergent, and no such trick will allow us to define
them. In actual physical problems the distribution is truncated, e.g., by energy
conservation, and the resulting distribution is well-behaved.
The characteristic function of the Cauchy p.d.f. is
φ(t) = e−α|t|+ıµt
P
The reproductive property of the Cauchy p.d.f. is rather unusual: x = n1 xi is
distributed according to the identical Cauchy p.d.f. as are the xi . (The proof is left
as an exercise.)
50 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
χ2 with 1 d.o.f.
For example, for n = 1, letting z = (x − µ)/σ, the p.d.f. for z is N(z; 0, 1) and the
probability that z ≤ Z ≤ z + dz is
1 1 2
f (z) dz = √ e− 2 z dz
2π
Let Q = Z 2 . (We use Q here instead of χ2 to emphasize that this is the variable.)
This is not a one-to-one transformation; both +Z and −Z go into +Q.
f (z) ✻ f (q) ✻
✲ ✲
−|Z| 0 +|Z| z Q q
The probability that Q is between q and q + dq is the sum of the probability that
√
Z is between z and z + dz around z = q and the probability that Z is between z
√
and z − dz around z = − q. Therefore, we must add the p.d.f. obtained from the
3.12. THE χ2 P.D.F. 51
d(±z) 1
J± = =± √
dq 2 q
!
1 1 1 1 dq dq 1 1
f (q) dq = √ e− 2 q (|J+ | + |J− |) dq = √ e− 2 q √ + √ =√ e− 2 q dq
2π 2π 2 q 2 q 2πq
1 1 2
χ2 (1) = f (χ2 ; 1) = √ 2
e− 2 χ (3.31)
2πχ
It may be confusing to use the same symbol, χ2 , for both the r.v. and its p.d.f., but
that’s life!
1 2
g(z1 , z2 , z3 ) dz 1 dz 2 dz 3 = 3/2
e−R /2 dz 1 dz 2 dz 3
(2π)
2 2
f (R) dR = √ R2 e−R /2 dR
2π
√
Now χ2 = R2 . Hence, dχ2 = 2R dR and dR = dχ2 /2 χ2 . Hence,
2 2 dχ2
f (χ2 ; 3) dχ2 = √ χ2 e−χ /2 √ 2
2π 2 χ
2 1/2
(χ ) 2
χ2 (3) = f (χ2 ; 3) = √ e−χ /2 (3.32)
2π
52 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
Properties:
Reproductive property: Let χ2i be a set of variables which are distributed as χ2 (ni ).
P P
Then χ2i is distributed as χ2 ( ni ). This is obvious from the definition of χ2 :
The variables χ21 and χ22 are, by definition,
n1
X n1X
+n2
χ21 (n1 ) = zi2 and χ22 (n2 ) = zi2
i=1 i=n1 +1
General definition of χ2
If the n Gaussian variables are not independent, we can change variables such that
the covariance matrix is diagonalized. Since this is a unitary transformation, it
does not change the covariance ellipse G = k. In the diagonal case G ≡ χ2 . Hence,
χ2 = G also in the correlated case. Thus we can take
χ2 = (x − µ)T V −1 (x − µ) (3.36)
N
2 1 X
σ̂ = (xi − µ)2 , using µ (3.37a)
N i=1
N
2 1 X P
σ̂ = (xi − x̄)2 , using x̄ = 1
N
xi (3.37b)
N − 1 i=1
x−µ (x − µ)/σ z
t= =q =q (3.38)
σ̂ (nσ̂ 2 /σ 2 )/n χ2 /n
Now z is a standard normal r.v. and χ2 is a χ2 (n). A Student’s t r.v. is thus the
ratio of a standard normal r.v. to the square root of a reduced χ2 r.v. The joint
p.d.f. for z and χ2 is then (equation 3.33)
2 n 2 /2
2 2 2 2 e−z /2 (χ2 ) 2 −1 e−χ
2
f (z, χ ; n) dz dχ = N(z; 0, 1) χ (χ ; n) dz dχ = √ dz dχ2
2π Γ( n2 ) 2n/2
54 CHAPTER 3. SOME SPECIAL DISTRIBUTIONS
where we have assumed that z and χ2 are independent. This is certainly so if the
x has not been used in determining σ̂, and asymptotically so if n is large. Making
a change of variable, we transform this distribution to one for t and χ2 :
1 2 n−1
2 2
− χ2 (1+ tn )
f (t, χ2 ; n) dt dχ2 = √ n n/2
(χ ) 2 e dt dχ2
2πn Γ( 2 ) 2
Integrating this over all χ2 , we arrive finally at the p.d.f. for t, called Student’s
t distribution,
1 Γ( n+1 2
) 1
t(n) = f (t; n) = √ n 2 (3.39)
πn Γ( 2 ) (1 + tn )(n+1)/2
Properties: mean µ = E [t] = 0 , n>1
n
variance V [t] = σt2 = n−2 , n>2
skewness γ1 = 0
6
kurtosis γ2 = n−4 , n>4
n2r Γ( r+1
2 ) ( 2 )
Γ n−r
, r even and r < n
Γ( 12 )Γ( n )
moments µr = 0 ,
2
r odd and r ≤ n
does not exist , otherwise.
Student’s t distribution is thus
the p.d.f. of a r.v., t, which is the t(t; n)
ratio of a standard normal vari- 0.4 ✲ ✛
n=∞ ✲ n = 10
able and the square root q of a nor- n = 5 ✛
✲
n=3
malized χ2 r.v., i.e., χ2 (n)/n, 0.35 n=2
of n degrees of freedom. It was ✛ n=1
discovered29 by W. S. Gossett, a 0.3
In order to prevent competitors from learning about procedures at Guinness, it was the policy
∗
which is a monotonic function of F and has a beta distribution (cf. section 3.15).
Γ(n + m) m−1
f (x; n, m) = x (1 − x)n−1 , 0≤x≤1 (3.44)
Γ(n)Γ(m)
=0 , otherwise
m
Properties: mean µ = E [x] = m+n
mn
variance V [x] = (m+n)2 (m+n+1)
Γ(y)Γ(z) Z 1 y−1
β(y, z) = = x (1 − x)z−1 dx , real y, z > 0
Γ(y + z) 0
to which it is related, and from which the normalization of the p.d.f. is easily derived.
λ
f (x; µ, λ) = exp (−λ|x − µ|) (3.45)
2
Properties: mean µ = E [x] = µ
variance V [x] = 2/λ2
skewness γ1 = 0
kurtosis γ2 = 3
λ2
characteristic function φ(t) = ıtµ + λ2 +t2
The exponential distribution (equation 3.10) is a special case (α = 1), when the
probability of failure at time t is independent of t.
Chapter 4
Real p.d.f.’s
There are, of course, many other distributions which we have not discussed in the
previous section. We may introduce a few more later when needed. Now let us turn
to some complicatons which we will encounter in trying to use these distributions.
which has a finite variance (recall that the Cauchy p.d.f. did not):
Z
1 +a x2 a
V [x] = 2
dx = −1
arctan a −a 1+x arctan a
59
60 CHAPTER 4. REAL P.D.F.’S
• The physical p.d.f. may be modified by the response of the detector. This
response must then be convoluted with the physical p.d.f. to obtain the p.d.f.
which is actually sampled.
4.2 Convolution
Experimentally we often measure the sum of two (or more) r.v.’s. For example,
in the decay n → pe− ν e we want to measure the energy of the electron, which is
distributed according to a p.d.f. given by the theory of weak interactions, f1 (E1 ).
But we measure this energy with some apparatus, which has a certain resolution.
Thus we do not record the actual energy E1 of the electron but E1 + δ, where δ
is distributed according to the resolution function (p.d.f.) of the apparatus, f2 (δ).
What is then the p.d.f., f (E), of the quantity we record, i.e., E = E1 + δ? This
f (E) is called the (Fourier) convolution of f1 and f2 .
Assume E1 and δ to be independent. This may seem reasonable since E1 is from
the physical process (n decay) and δ is something extra added by the apparatus,
which has nothing at all to do with the decay itself. Then the joint p.d.f. is
Thus, assuming that the r.v.’s are independent, the characteristic function of a
convolution is just the product of the individual characteristic functions. (This
probably looks rather familiar. We have already seen it in connection with the
reproductive property of distributions; in that case f1 and f2 were the same p.d.f.)
Recall that the characteristic function is a Fourier transform. Hence, a convolution,
E = E1 + δ, where δ is independent of E, is known as a Fourier convolution.
Another type of convolution, called the Mellin convolution, involves the product
of two random variables, e.g., E = E1 R1 . As we shall see, the Fourier convolution
is easily evaluated using the characteristic function, which is essentially a Fourier
transform of the p.d.f. Similarly, the Mellin convolution can be solved using the
Mellin transformation, but we shall not cover that here.
In the above example we have assumed a detector response independent of what
is being measured. In practice, the distortion of the input signal usually depends
on the signal itself. This can occur in two ways:
1. Detection efficiency. The chance of detecting an event with our apparatus
may depend on the properties of the event itself. For example, we want to
measure the frequency distribution of electromagnetic radiation incident on
the earth. But some of this radiation is absorbed by the atmosphere. Let
f (x) be the p.d.f. for the frequency, x, of incident radiation and let e(x) be
the probability that we will detect a photon of frequency x incident on the
earth. Both f and e may depend on other parameters, y, e.g., the direction
in space in which we look. The p.d.f. of the frequency of the photons which
we detect is R
f (x, y)e(x, y) dy
g(x) = R R
f (x, y)e(x, y) dx dy
2. Resolution. To continue with the above example, suppose the detector records
frequency x′ when a photon of frequency x is incident. Let r(x′ , x) be the p.d.f.
that this will occur. Then
Z
′
g(x ) = r(x′ , x)f (x) dx
However in practice σ often depends on x, in which case r(x′ , x) may still have
the above form, but is not really a Gaussian.
If the resolution function is Gaussian and if the physical p.d.f., f (x), is also
Gaussian, f (x) = N(x; µ, τ 2 ), then you can show, by using the reproductive
property of the Gaussian (exercise 19) or by evaluating the convolution using
the characteristic function (equation 4.1), that the p.d.f. for x′ is also normal:
Z +∞
′
g(x ) = f (x) r(x′ , x) dx = N x′ ; µ, σ 2 + τ 2
−∞
Chapter 5
63
64 CHAPTER 5. CENTRAL LIMIT THEOREM
Now recall that the characteristic function of a sum of independent r.v.’s is the
product of the individual characteristic functions. Therefore, the characteristic
P
function of Su = ui is
" #n
n t2
φSu (t) = [φu (t)] = 1 − + ...
2n
But this is just the characteristic function of the standard normal N(Su ; 0, 1). Since
P P
Su = ui = σ√1 n S, the p.d.f. for xi is the normal p.d.f. N(S; nµ, nσ 2 ).
A corallary of the C.L.T.: Under the conditions of the C.L.T., the p.d.f. of S/n
approaches the normal p.d.f. as n → ∞:
P P ! n
S S µi σ2 X
lim f =N ; , 2i , S= xi (5.4)
n→∞ n n n n i=1
1 or 2 is not a large number. Instead, the p.d.f. for θ will be the convolution of
the Gaussian for the net deflection from many small-angle scatters with the actual
p.d.f. for the large-angle scatters. It will look something like:
✻
✲
θ
Adding this to the Gaussian p.d.f. for the much more likely case of no large-angle
scatters will give a p.d.f. which looks almost like a Gaussian, but with larger tails:
Nearly Gaussian. ✻
Many small-angle, Long tails. Many small-angle
no large-angle scatters. scatterings giving Gaussian tails.
❩ Plus some large-angle
❩ scatterings giving a
⑦
❩
non-Gaussian p.d.f.
✄
✄
✄
✄
✄
✄✎
✲
θ
This illustrates that the further you go from the mean, the worse the Gaussian
approximation is likely to be.
The C.L.T. also shows the effect of repeated measurements of a quantity. For
example, we measure the length of a table with a ruler. The variance of the p.d.f.
for 1 measurement is σ 2 ; the √
variance of the p.d.f. for an average of n measurements
σ2
is n . Thus σ is reduced by n.
If a r.v. is the product of many factors, then its logarithm is a sum of equally
many terms. Assuming that the CLT holds for these terms, then the r.v. is asymp-
totically distributed as the log-normal distribution.
“You can . . . never foretell what any one man
will do, but you can say with precision what an
average number will be up to. Individuals vary,
but percentages remain constant.”
—Arthur Conan Doyle: Sherlock Holmes in
“The Sign of Four”
Part II
Monte Carlo
67
Chapter 6
Monte Carlo
The term Monte Carlo is used for calculational techniques which make use of random
numbers. These techniques represent the solution of a problem as a parameter of
a hypothetical population, and use a random sequence of numbers to construct a
sample of the population, from which statistical estimates of the parameter are
obtained.
The Monte Carlo solution of a problem thus consists of three parts:
1. choice of the p.d.f. which describes the hypothetical population;
2. generation of a random sample of the hypothetical population using a random
sequence of numbers; and
3. statistical estimation of the parameter in question from the random sample.
It is no accident that these three steps correspond to the three parts of these lectures.
P.d.f.’s have been covered in part I; this part will cover the generation of a Monte
Carlo sample according to a given p.d.f.; and part III will treat statistical estimation,
which is done in the same way for Monte Carlo as for real samples.
If the solution of a problem is the number F , the Monte Carlo estimate of F
will depend on, among other things, the random numbers used in the calculation.
The introduction of randomness into an otherwise well-defined problem may seem
rather strange, but we shall see that the results can be very good.
After a short treatment of random numbers (section 6.1) we will treat a common
application of the Monte Carlo method, namely integration (section 6.2) for which
the statistical estimation is particularly simple. Then, in section 6.3 we will treat
methods to generate a Monte Carlo sample which can then be used with any of the
statistical methods of part III.
69
70 CHAPTER 6. MONTE CARLO
An example of how we could use this last possibility is to turn on a counter for
a fixed time interval, long enough that the average number of decays is large. If
the number of detected decays is odd, we record a 1; if it is even, we record a 0.
We repeat this the number of times necessary to make up the fraction part of our
computer’s word (assuming a binary computer). We then have a random number
between 0 and 1.
Unfortunately, this procedure does not produce a uniform distribution if the
probability of an odd number of decays is not equal to that of an even number. To
remove this bias we could take pairs of bits: If both bits are the same, we reject
both bits; if they are different, we accept the second one. The probability that we
end up with a 1 is then the probability that we first get a zero and second a one;
the probability that we end up with a zero is the probability that we first get a
one and second a zero. Assuming no correlation between successive trials, these
probabilities are equal and we have achieved our goal.
The main problem with such generators is that they are very slow. Not wanting
to have too dangerous a source, i.e., not too much more than the natural background
(cosmic rays are about 200 m−2 s−1 ), nor too large a detector, it is clear that we will
have counting times of the order of milliseconds. For a 24-bit fraction, that means
24 counting intervals per real random number, or 96 intervals if we remove the bias.
Thus we can easily spend of the order of 1 second to generate 1 random number!
They are also not, by their very nature, reproducible, which can be a problem
when debugging a program.
for the next. Usually there is a provision allowing the user to set the seed at the
start of his program and to find out what the seed is at the end. This feature allows
a new run to be made starting where the previous run left off. In FORTRAN90 this
is standardized by providing an intrinsic subroutine, random number(h), which fills
the real variable (or array) h with pseudo-random numbers in the interval [0, 1).
A subroutine random seed is also provided to input a seed or to inquire what the
seed is at any point in the calculation. However, no requirements are made on the
quality of the generated sequence, which will therefore depend on the compiler used.
In critical applications one may therefore prefer to use some other generator.
Recently, new methods have been developed resulting in pseudo-random number
generators far better than the old ones.31 In particular the short periods, i.e., that
the sequence repeats itself, of the old generators has been greatly lengthened. For
example the generator RANMAR has a period of the order of 1043 . The new generators
are generally written as subroutines returning an array of random numbers rather
than as a function, since the time to call a subroutine or invoke a function is of the
same order as the time to generate one number, e.g., CALL RANMAR (RVEC,90) to
obtain the next 90 numbers in the sequence in the array RVEC, which of course must
have a dimension of at least 90.
Some pseudo-random number generators generate numbers in the closed interval
[0, 1] rather than the open interval. Although it occurs very infrequently (once in
224 on a 32-bit computer), the occurence of an exact 0 can be particularly annoying
if you happen to divide by it. The open interval is therefore recommended.
Any one who considers arithmetical methods of producing
random digits is, of course, in a state of sin.
— John von Neumann
An obvious Monte Carlo method, called crude Monte Carlo, is to do the same
sum, but with
xi = a + ri (b − a)
where the ri are random numbers uniformly distributed on the interval (0, 1).
More formally, the expectation of the function y(x) given a p.d.f. f (x) which is
non-zero in (a, b) is given by
Z b
µy = E [y] = y(x)f (x) dx
a
We shall see in statistics (sect. 8.3) that an expectation value, e.g., E [y], can be
P
estimated by the sample mean of the quantity, ȳ = yi /n.
Thus by generating n values xi distributed uniformly in (a, b) and calculating
√
the sample mean, we determine the value of I/(b − a) to an uncertainty σy / n:
n
b−aX σy
I= y(xi ) ± (b − a) √ (6.2)
n i=1 n
R R
In practice, if we do not know y dx, it is unlikely that we know y 2 dx, which
is necessary to calculate σy . However, we shall see that this too can be estimated
from the Monte Carlo points (eq. 8.7):
n
1 X
σc2 = (yi − ȳ)2
n − 1 i=1
Since n is large, we can replace n − 1 by n. Multiplying out the sum we then get
σc2 = (y 2 − ȳ 2 )
6.2. MONTE CARLO INTEGRATION 73
√ √ √
The uncertainty of the estimation of I is then 0.571/ n = 0.756/ n, which is
considerably larger (more than a factor 2) than for crude Monte Carlo.
The uncertainty of Monte Carlo integration is compared with that of numerical
methods in the following table:
Thus we see that Monte Carlo integration converges much more slowly than
other methods, particularly for low numbers of dimensions. Only for greater than
8 dimensions does Monte Carlo converge faster than Simpson’s rule, and there is
always a Gauss rule which converges faster than Monte Carlo.
However, there are other considerations besides rate of convergence: The first is
the question of feasibility. For example, in 38 dimensions a 10-point Gauss method
converges at the same rate as Monte Carlo. However, in the Gauss method, the
number of points is fixed, n = md , which in our example is 1038 . The evaluation
of even a very simple function requires on the order of microseconds on a fast
computer. So 1038 is clearly not feasible. (1032 sec. ≈ π × 1024 years, while the age
of the universe is only of order 12 Gyr.)
Another problem with numerical methods is the boundary of integration. If the
boundary is complicated, numerical methods become very difficult. This is, how-
ever, easily handled in Monte Carlo. One simply takes the smallest hyperrectangle
that will surround the region of integration and integrates over the hyperrectangle,
throwing away the points that fall outside the region of integration. This leads to
some inefficiency, but is straightforward. This is one of the chief advantages of the
Monte Carlo technique. An example is given by phase space integrals in particle
physics. Consider the decay n → pe− ν e , the neutron at rest. Calculations for this
decay involve 9 variables, px , py , pz for each of the 3 final-state particles. However
these variables are not independent, being constrained by energy and momentum
P P P P
conservation, px = pq y = pz = 0, and E = mn c2 , where the energy of a
particle is given by, E = m2 c4 + p2x c2 + p2y c2 + p2z c2 . This complicated boundary
makes an integration by numerical methods difficult; it becomes practically im-
possible for more than a few particles. However Monte Carlo integration is quite
simple: one generates points uniformly in the integration variables, calculates the
energy and momentum components for each particle and tests whether momentum
and energy are conserved. If not, the point is simply rejected.
Another practical issue might be termed the growth rate. Suppose you have
performed an integration and then decide that it is not accurate enough. With
76 CHAPTER 6. MONTE CARLO
Monte Carlo you just have to generate some more points (starting your random
number generator where you left off the previous time). However, with the Gauss
rule, you have to go to a higher order m. All the previous points are then useless
and you have to start over.
n′
1 (b − a)2 X
I= g(xi , yi )
2 n′ i=1
Both rejection methods are correct, but inefficient—both use only half of the
points. Sometimes this inefficiency can be overcome by a trick:
This is equivalent to generating points uniformly over the whole square and
then folding the square about the diagonal so that all the points fall in the
triangular region of integration. The density of points remains uniform.
(c) But we make a weighted sum, the weight correcting for the unequal
1
density of points (density ∼ (x−a) ):
n
b−a X
I= (xi − a) g(xi , yi ) (6.5)
n i=1
Stratification
In this approach we split the region of integration into two or more subregions.
Then the integral is just the sum of the integrals over the subregions, e.g., for two
subregions, Z Z Z
b c b
I= y(x) dx = y(x) dx + y(x) dx
a a c
The variance of I is just the sum of the variances of the subregions. A good choice
of subregions and number of points in each region can result in a dramatic decrease
in V [I]. However, to make a good choice requires knowledge of the function. A
poor choice can increase the variance.
Some improvement can always be achieved by simply splitting the region into
subregions of equal size and generating the same number of points for each subre-
gion. We illustrate this, using crude Monte Carlo, for the case of two subregions:
For the entire region the variance is (from equation 6.2)
!2
2 2 Z Z
(b − a) 2 (b − a) 1 b 1 b
V1 (I) = σy = y 2 dx − y dx
n n b−a a b−a a
For two equal regions, the variance is the sum of the variances of the two regions:
(" Z Z 2 #
[(b − a)/2]2 2 c 2 c
V2 (I) = y 2 dx − y dx
n/2 b−a a b−a a
Z Z !2
2 b 2 b
+ y 2 dx − y dx
b−a c b−a c
6.2. MONTE CARLO INTEGRATION 79
!2
2 Z Z 2 Z
(b − a) 2 b 4 c b
= y 2 dx − y dx + y dx
2n b−a a (b − a)2 a c
Importance Sampling
We have seen that (in crude Monte Carlo) the variance of the estimate of the integral
is proportional to the variance of the function being integrated (eq. 6.2). Thus the
less variation in y, i.e., the more constant y(x) is, the more accurate the integral.
We can effectively achieve this by generating more points in regions of large y and
compensating for the higher density of points by reducing the value of y (i.e., giving
a smaller weight) accordingly. This was also the motivation for stratification.
In importance sampling we change variable in order to have an integral with
smaller variance:
Z Z Z G(b)
b b y(x) y(x)
I= y(x) dx = g(x) dx = dG(x)
a a g(x) G(a) g(x)
Z x
where G(x) = g(x) dx
a
• The ratio y(x)/g(x) is as nearly constant as possible and in any case more
constant than y(x), i.e., σy/g < σy .
80 CHAPTER 6. MONTE CARLO
We then choose values of G randomly between 0 and 1; for each G solve for x; and
sum y(x)/g(x). The weighting method of section 6.2.5 was really an application of
importance sampling.
Although importance sampling is a useful technique, it suffers in practice from
a number of drawbacks:
• The class of functions g which are integrable and for which the integral can
be inverted analytically is small—essentially only the trigonometric functions,
exponentials, and polynomials. The inversion could in principle be done nu-
merically, but this introduces inaccuracies which may be larger that the gain
made in reducing the variance.
• It is very difficult in more than one dimension. In practice one usually uses a
g which is a product of one-dimensional functions.
• It can be unstable. If g becomes small in a region, y/g becomes very big and
hence the variance also. It is therefore dangerous to use a function g which is
0 in some region or which approaches 0 rapidly.
• Clearly y(x) must be rather well known in order to choose a good function g.
On the other hand, an advantage of this method is that singularities in y(x) can be
removed by choosing a g(x) having the same singularities.
Control Variates
This is similar to importance sampling except that instead of dividing by g(x), we
subtract it: Z Z Z
I = y(x) dx = [y(x) − g(x)] dx + g(x) dx
R
Here, g(x) dx must be known, and g is chosen such that y − g has a smaller
variance than y. This method does not risk the instability of importance sampling.
Nor is it necessary to invert the integral of g(x).
Antithetic Variates
So far, we have always used Monte Carlo points which are independent. Here we
deliberately introduce a correlation. Recall that the variance of the sum of two
functions is
V [y1 (x) + y2 (x)] = V [y1 (x)] + V [y2 (x)] + 2 cov[y1 (x), y2 (x)]
such that y1 and y2 have a large negative correlation, we can reduce the variance of
I. Clearly, we must understand the function y(x) in order to do this. It is difficult
to give general methods, but we will illustrate it with an example:
Suppose that we know that y(x) is a monotonically increasing function of x.
Then let y1 = 12 y(x) and y2 = 21 y (b − (x − a)). Clearly the integral of (y1 + y2 ) is
just the integral of y. However, since y is monotonically increasing, y1 and y2 are
negatively correlated; when y1 is small, y2 is large and vice versa. If this negative
correlation is large enough, V [y1 + y2 ] < V [y].
must transform the uniformly distributed random numbers into random numbers
distributed according to the desired p.d.f. There are three basic methods to do this:
xi = R[a, b]
ri = R[0, fmax ]
In hit-or-miss
Rb
Monte Carlo we also introduced an fmin . Since we knew the
integral a fmin dx, it was not necessary to evaluate it by Monte Carlo. It was
R
therefore better (more efficient) to use all the Monte Carlo points to evaluate ab (f −
fmin ) dx. But here we want to generate all the events for f , not just for (f − fmin ).
The difficulty with this method lies in knowing fmax . If we do not know it, then
we must guess a ‘safe’ value, i.e., a value which we are sure is larger than fmax . If
we choose fmax too safe, the method becomes inefficient. This method can be made
more efficient by choosing different values of fmax in different regions.
This method is the easiest method to use for complicated functions in many
dimensions.
f (x) dx = dF
ui = R [F (a), F (b)]
u
and calculate the corresponding value of x,
xi = F −1 (ui )
0a ✲
−1
x
x=F (u)
The xi are then distributed as f (x). To see this, recall the results on changing vari-
ables (sect. 2.2.6): For a transformation u → x = v(u) with inverse transformation
u = w(x), the p.d.f. for x is given by the p.d.f. for u, g(u), times the Jacobian, i.e.,
∂u ∂w(x)
p.d.f. for x = g(u) = g (w(x))
∂x ∂x
84 CHAPTER 6. MONTE CARLO
F (x) ✻
0a ✲
−1
x = Fmin (u) x
Discrete p.d.f.
For a discrete p.d.f., we can always use this method, since the c.d.f. is always easily
calculated. The probability of X = xk is P (X = xk ) = f (xk ). Then the c.d.f. is
k
X
Fk = P (X ≤ xk ) = f (xk )
i=1
1. generate ui = R[0, 1] Fk
Fk−1 < ui ≤ Fk
0a ✲
xk−1 xk
x
Then xk is the desired value of x.
Step 2 of this procedure can involve a lot of steps. You can usually save computer
time by starting the comparison somewhere in the middle of the x-range, say at the
mean or mode, and then working up or down in x depending on u and Fk .
6.3. MONTE CARLO SIMULATION 85
This is of interest not only for situations with a discrete p.d.f., but also for
cases where the p.d.f. is continuous, but not known analytically. The resolution
function of an apparatus is often determined experimentally and the resulting p.d.f.
expressed as a histogram.
6.3.5 Example
As an example of the above methods, we take the p.d.f.
f (x) = 1 + x2
in the region (−1, 1). This could be an angular distribution with x = cos θ. We
note that f (x) is not normalized. We could, of course, normalize it, but choose not
to do so. For as we shall see, for the purpose of generating events the normalization
is not necessary.
Weighted events
This is completely trivial. We generate xi = R[−1, 1] and assign weight wi = 1 + x2i .
Rejection method
86 CHAPTER 6. MONTE CARLO
2
We have fmax = 2, a = −1, b = +1. There-
fore, we generate
1
xi = R[−1, +1] = 2R[0, 1] − 1
ri = R[0, 2] = 2R[0, 1]
and reject the point if ri > 1 + x2i . 0
−1 0 +1
NoteR that the efficiency of the genera-
1
(1+x2 ) dx
tion is −1
(b−a)fmax
= 23 , i.e., 1/3 of the points are rejected.
Composite method
We write f (x) as the sum of simpler functions. In this case an obvious choice is
f (x) = fa (x) + fb (x) with fa (x) = 1 and fb (x) = x2
The integrals of these functions are
Z Z #+1
+1 +1 x3 2
Aa = fa (x) dx = 2 and Ab = fb (x) dx = =
−1 −1 3 −1
3
2 3
Hence we want to generate from fa with probability 2+ 23
= 4
and from fb with
1
probability 4
.
The first step is therefore to generate v = R[0, 1]
3
• If v ≤ 4
we generate from fa :
ui = R[0, 1]
xi = 2ui − 1
6.3. MONTE CARLO SIMULATION 87
3
• If v > 4
we generate from fb :
1. either by the rejection method:
xi = 2R[0, 1] − 1
ri = R[0, fb max ] = R[0, 1]
repeating until we find a point for which ri ≤ x2i .
R1
x2 dx
Note that the efficiency is (b−a)f
−1
b max
= 13 for the points generated here
( 1/4 of the points). But it was 1 for the points distributed according to
fa . The net efficiency is thus 31 · 14 + 1 · 43 = 56 , a small improvement over
the 2/3 of the simple rejection method.
2. or by the inverse transformation method:
Z #x
x
2x3 x3 1 2
Fb (x) = x dx = = + Fb (−1) = 0 Fb (1) =
−1 3 −1
3 3 3
We generate ui = R[0, 1]. Then xi is the solution of
2 x3 1
ui = i +
3 3 3
Hence, xi = (2ui − 1)1/3
Note that we only have to calculate one cube root; and that only for 1/4
of the events. This is ∼ 12 times faster that the simple inverse trans-
formation method (assuming that square and cube roots take about the
same time).
In this example, the composite rejection method turned out to be the fastest
with the simple rejection method only slightly slower. The composite inverse trans-
formation method was much faster than the simple inverse transformation method,
but still much slower than the rejection method. These results should not be re-
garded as general. Which method is faster depends on the function f .
Another disadvantage of this method is that it puts severe requirements on the cor-
relations between successive points of the random number generator, in particular
on correlations within groups of n successive values of ui .
A word of caution is perhaps appropriate for clever students who have undoubt-
edly noticed that instead of summing 12 ui and subtracting 6, we could have used
6
X 12
X
g− = ui − ui
i=1 i=7
So far, so good. But if you try to save computer time by generating both g+ and
g− with the same 12 values of ui , you are in trouble: g+ and g− are then highly
correlated.
A transformation method
Since the Gaussian p.d.f. cannot be integrated in terms of the usually available func-
tions, it is not straightforward to find a transformation from uniformly to Gaussian
distributed variables. There is, however, a clever method, which we give without
proof, to transform two independent variables, u1 and u2 , uniformly distributed on
(0,1) to two independent variables, g1 and g2 , which are normally distributed with
µ = 0 and σ 2 = 1:
q
g1 = cos(2πu2 ) −2 ln u1
q
g2 = sin(2πu2 ) −2 ln u1
This method is exact, but its speed can be improved upon by effectively gener-
ating the sine and cosine by a rejection method:
4. Otherwise,
s
−2 ln r 2
g1 = (2u1 − 1)
r2
s
−2 ln r 2
g2 = (2u2 − 1)
r2
This saves the time of evaluating a sine and a cosine at the slight expense of rejecting
about 21% of the uniformly generated points.
Part III
Statistics
91
Chapter 7
So far, we have considered probability theory. Once we have decided which p.d.f. is
appropriate to the problem, we can make direct calculations of the probability of any
set of outcomes. Apart from possible uncertainty about which p.d.f. is appropriate,
this is a straight-forward and mathematically well defined procedure.
The problem we now address is the inverse of this. We have a set of data
which have been sampled from some parent p.d.f. We wish to infer from the data
something about the parent p.d.f. Note that here we are assuming that the data
are independent, i.e., that the value of a particular datum does not depend on
the values of other data, and that all of the data sample the same p.d.f. The
statistician speaks of a sample of independent identically distributed iid random
variables. Usually this will be the case, and some of our methods will depend on
this.
The study of calculations using probability is sometimes called direct probability.
Statistical inference is sometimes called inverse probability, particularly in the case
of Bayesian methods.
We may think we know what the p.d.f. is apart from one or more parameters,
e.g., we think it is a Gaussian but want to determine its mean and standard devia-
tion. This is called parameter estimation. It is also called fitting since we want to
determine the value of the parameter such that the p.d.f. best ‘fits’ the data.
On the other hand, we may think we know the p.d.f. and want to know whether
we are right. This is called hypothesis testing. Usually both parameter estimation
and hypothesis testing are involved, since it makes little sense to try to determine the
parameters of an incorrect p.d.f. And frequently an hypothesis to be tested involves
some unknown parameter. Nevertheless, we will first treat these as separate topics.
A third topic is decision theory or classification.
For all of these topics we shall use statistical methods (or “statistics”), so-called
because they, statistical methods, make (it, statistics, makes) use of one or more
statistics.∗ A definition: A statistic is any function of the observations in a sample,
∗
It is perhaps interesting to note that the stat in statistics is the same as in state. Statists
93
94 CHAPTER 7. STATISTICS—WHAT IS IT/ARE THEY?
which does not depend on any of the unknown characteristics of the population
P
(parent p.d.f.). An example of a statistic is the sample mean, x̄ = xi /n. Each
observation, xi , is, in fact, itself a statistic. In other words, if you can calculate it
from the data plus known quantities, it is a statistic. “Statistics” is the branch of
applied mathematics which deals with statistics as just defined. Whether the word
statistics is singular or plural, thus depends on which meaning you intend.
We have seen in section 2.4 that there are two common interpretations of prob-
ability, which we have called frequentist and Bayesian. They give rise to two ap-
proaches to statistical inference, usually called classical or frequentist statistics (or
inference) and Bayesian inference. The word classical is something of a misnomer,
since the Bayesian interpretation is older (Bayes, Laplace). However, in the second
half of the 19th century science became more quantitative and objective, even in
such fields as biology (Darwin, evolution, heredity, Galton). This gave rise to the
frequentist interpretation and the development of frequentist statistics. By about
1935 frequentist statistics, which came to be known as classical statistics, had com-
pletely replaced Bayesian thinking. Since around 1960, however, Bayesian inference
has been making a comeback.
Probably most physicists would profess to being frequentists, and reflecting this,
as well as my own personal bias, the emphasis in the rest of this course will be on
classical statistics. However, there are situations where classical statistics is very
difficult, or even impossible, to use and where Bayesian statistics is comparatively
simple to apply. So, intermixed with classical statistics you will find some Bayesian
methods. This is rather unconventional; most books are firmly in one of the two
camps, and discussions between frequentists and Bayesians often take on aspects of
holy war. It also runs the risk of confusing the student—it is important to know
which you are doing.
(advocates of statism, economic control and planning by a highly centralized state), collected data
to better enable the state to run the economy. Such data, and quantities calculated from them,
came to be called statistics.
Chapter 8
Parameter estimation
8.1 Introduction
In everyday speech, “estimation” means a rough and imprecise procedure leading
to a rough and imprecise result. You estimate when you cannot measure exactly.
This last sentence is also true in statistics, but only because you can never measure
anything exactly; there is always some uncertainty. In statistics, estimation is a
precise procedure leading to a result which may be imprecise, but the extent of the
precision is, in principle, known. Estimation in statistics has nothing to do with
approximation.
The goal of parameter estimation is then to make some sort of statement like
θ = a ± b where a is, on the basis of the data, the ‘best’ (in some sense) value
of the parameter θ and where it is ‘highly probable’ that the true value of θ lies
somewhere between a − b and a + b. We often call b the estimated error on a. If
we make a plot, this is represented by a point at θ = a with a bar running through
it from a − b to a + b, the ‘error bar’. It is usually assumed that the estimate of
θ is normally distributed, i.e., that the values of a obtained from many identical
experiments would form a normal distribution centered about the true value of θ
with standard deviation equal to b. The meaning of θ = a ± b is then that a is,
in some sense (to be discussed more fully later), Rthe most likely value of θ and
a+b
that in any case there is, again in some sense, a a−b N(x; a, b2 ) = 68.3% chance
that the true value of θ lies in the interval (a − b, a + b).∗ This is a special case of
a 68.3% ‘confidence interval’ (cf. chapter 9), i.e., an interval within which we are
68.3% confident that the true value lies. We shall see that error bars, or confidence
intervals may be difficult to estimate. Just as our estimate of θ has an ‘error’, so
too does our estimate of this ‘error’.
Suppose now that we have a set of numbers xi which are the result of our
experiment. This could, e.g., be n measurements of some quantity. Let θ be the
∗
Note that this is different from what an engineer usually means by a ± b, namely that b is the
tolerance on a, i.e., that the true value is guaranteed to be within (a − b, a + b).
95
96 CHAPTER 8. PARAMETER ESTIMATION
true value of that quantity. The xi are clustered about θ in some way that depends
on the measuring process. We often assume that they are distributed normally
about the true value with a width given by the accuracy of the measurement.
It is worth noting the distinction many authors, e.g., Bevington10 , make be-
tween the words accuracy and precision, which in normal usage are synonymous.
Accuracy refers to how close a result is to the true value, whereas precision refers
to how reproducible the measurements are. Thus, a poorly calibrated apparatus
may result in measurements of high precision but poor accuracy. Other authors,
e.g., Eadie et al.,4 and James5 prefer to avoid these terms altogether since neither
term is well defined, and to speak only of the variance of the estimates.
Similarly, a distinction is sometimes∗ made between error, the difference be-
tween the estimate and the true value, and the uncertainty, the square root of the
variance of the estimate. Thus accurate means small error and precise means small
uncertainty. However, the use of the word ‘error’ to mean uncertainty is deeply
ingrained, and we (like most books) will not make the distinction. Note that with
the above distinction, the accuracy and the error are usually unknown, since the
true value is usually unknown.
So, we wish to estimate θ. To do this we need an estimator which is a function
of the measurements.
As stated in chapter 7, a statistic is, by definition, any function of the obser-
vations in a sample, φ(xi ), which does not depend on any of the unknown charac-
teristics of the population (parent p.d.f.). An example of a statistic is the sample
P
mean, x̄ = xi /n. In other words, if you can calculate it from the data plus known
quantities, it is a statistic.
Since a statistic is calculated from random variables, it is itself a r.v., but a
r.v. whose value depends on the particular sample, or set of data. Like all random
variables, it is distributed according to some p.d.f. Since the value of the statistic
depends on the sample, its p.d.f. is sometimes referred to as the sampling distri-
bution or sampling p.d.f. in order to distinguish it from the population or parent
p.d.f.
An estimator is (definition) a statistic, the value of which we will give as our
determination of some constant, θ, which is a property of the parent population
or parent p.d.f. We will generally denote an estimator of a variable by adding a
circumflex (ˆ) to the symbol of the variable. Thus θ̂ is an estimator of θ.
There are in general numerous estimators that one can construct for any θ. Here
are several estimators of the mean, µ, of the parent p.d.f., assuming n measurements,
xi :
1 Pn
1. µ̂ = x̄ = n i=1 xi The sample mean. This is probably the most
often used estimator of the mean, but it can be
sensitive to mismeasured data.
∗
This is recommended by the International Standards Organization34.
8.1. INTRODUCTION 97
1 P10
2. µ̂ = 10 i=1 xi The sample mean of the first 10 points, ignoring
the rest.
1 Pn
3. µ̂ = n−1 i=1 xi n/(n − 1) times the sample mean.
2 Pn/2
8. µ̂ = n i=1 x2i The sample mean of the even numbered points,
ignoring the odd numbered points.
Each of these is, by our definition, an estimator. Yet some are certainly better
than others. However, which is ‘best’ depends on the p.d.f. Which is ‘best’ may
also depend on the use we want to make of it. How do we choose which estimator
to use? In general we shall prefer an estimator which is ‘unbiased’, ‘consistent’, and
‘efficient’. We will discuss these and other properties of estimators in the following
section. In succeeding sections we will treat three general methods of constructing,
or choosing, estimators.
8.2.1 Bias
Since a statistic is a function of r.v.’s, it is itself a r.v. Therefore, it is hdistributed
i
according to some p.d.f., and we can speak of its expectation value, E θ̂ . For an
estimator, making use of n observations, the bias bn is defined as the difference
between the expectation of the estimator and the true value of the parameter:
h i h i
bn (θ̂) = E θ̂ − θ = E θ̂ − θ (8.1)
h i
An estimator is unbiased if, for all n and θ, bn (θ̂) = 0, i.e., if E θ̂ = θ. We
include n in this definition since we shall see that some estimators are unbiased
only asymptotically, i.e., only for n → ∞.
Mean In general, the sample mean, no. 1 in our list above, is an unbiased esti-
mator of the parent (true) mean:
1X 1X 1
E [µ̂] = E [x̄] = E xi = E [xi ] = nE [x] = E [x] = µ (8.2)
n n n
On the other hand, the third estimator in our list is biased:
1 X n
E [µ̂] = E xi = µ
n−1 n−1
although the bias,
n µ
bn (µ̂) = µ−µ= → 0 , for large n.
n−1 n−1
This estimator is thus asymptotically unbiased.
If we know the bias, we can construct a new estimator by correcting the old
one for its bias. For example, from no. 3 and its bias we construct no. 1 simply by
multiplying no. 3 by (n − 1)/n.
Lack of bias is a reason to prefer no. 1 to no. 3. However, nos. 2 and 8 are also
unbiased. The trimmed mean (no. 9) is unbiased if the parent p.d.f. is symmetric
about its mean. The sample median (no. 10) is also unbiased if the parent median
equals the parent mean. Similarly, nos. 6 and 7 will be unbiased for certain p.d.f.’s.
Variance Now suppose we want to estimate the variance of the parent p.d.f.
Assume that we know the true mean, µ. Usually this is not the case, but could be,
e.g., if we know that the p.d.f. is symmetric about some value. Then following our
above experience with the sample mean, we might expect the sample variance,
n
1 X
s21 = (xi − µ)2 (8.3)
n i=1
8.2. PROPERTIES OF ESTIMATORS 99
to be a good estimator of the parent variance, σ 2 . (N.b., do not confuse the standard
deviation, σ, of the parent p.d.f. with the ‘error’ on µ̂.) Assume that the parent
variance, σ 2 , is finite (exists). Then
h i 1 hX i 1 hX 2 i
E s21 = E (xi − µ)2 = E xi − 2xi µ + µ2
n n
1 hX 2 X X i 1 h hX 2 i hX i i
= E xi − 2µ xi + µ2 = E xi − 2µE xi + nµ2
n n
1 h h 2i i h i
= nE x − 2nµE [x] + nµ = E x2 − 2µ2 + µ2
2
n
= σ 2 + µ2 − 2µ2 + µ2 , since σ 2 = E [x2 ] − µ2
= σ2
Thus σc2 = s21 is an unbiased estimator of the variance of the parent p.d.f., σ 2 , if µ
is known.
But usually µ is not known. We therefore try using our estimate of µ, µ̂ = x̄,
instead of µ:
1X 1X 2
s2x = (xi − x̄)2 = xi − x̄2 = x2 − x̄2 (8.4)
n n
This has the expectation,
"P P #
h i xi 2
x2i
E s2x =E −
nn
hX i
1 2 1 X 2
= E xi − E xi (8.5)
n n
P
The xi are independent. Hence E [ x2i ] = nE [x2 ]. Also,
h i
σ 2 = E x2 − µ2
hX i h X i hX i2
and V xi = E ( xi )2 − E xi
we find
h i
1 1 2
E s2x = nσ 2 + nµ2 − nσ + (nµ)2
n n
1 2
= (n − 1) σ (8.6)
n
100 CHAPTER 8. PARAMETER ESTIMATION
n n 2 1 X
s2 = s2x = x − x̄2 = (xi − x̄)2 (8.7)
n−1 n−1 n−1
Note that the above calculations did not depend at all on what the parent p.d.f.
was, not even on the C.L.T.
If the p.d.f. is Gaussian or if n is large enough that the C.L.T. applies, let
xi − x̄
zi =
σ
Then
X 1 X
zi2 = (xi − x̄)2
σ2
is distributed as χ2 (section 3.12). There is one relationship among the zi ’s:
X 1X 1 X
zi = (xi − x̄) = xi − nx̄ = 0
σ σ
P
which follows from the definition of x̄. Hence, the p.d.f. for zi2 is a χ2 of n − 1
degrees of freedom. Recall that E [χ2 (n − 1)] = n − 1. This is another way of seeing
that " # " #
h i σ2 X 2 σ2 2 1 h i
2
E s =E zi = E χ = σ2 E χ2 = σ 2
n−1 n−1 n−1
8.2.2 Consistency
If we take more data, we should expect a better (more accurate) estimate of the
parameters. An estimator which converges to the true value with increasing n is
termed consistent.
Definition: An estimator, θ̂, of θ is consistent if for any ǫ > 0 (no matter how small),
s2 h i 2(s2 )2
Vb [x̄] = , Vb s2 = (8.11)
n n−1
Sometimes you do know σ 2 . We give two examples: (1) You average many mea-
surements of a quantity, e.g., the length of a table. The p.d.f. is then a convolution
of a δ-function about the true length with a resolution function for the measuring
apparatus, which is just a Gaussian centered about the true length with σ equal to
the resolution. But you have calibrated the measuring apparatus by measuring a
standard length a great many times. From this calibration you know σ 2 . So you
only need to estimate µ. (2) You are designing an experiment and you want to know
how many measurements you need to make in order to attain a given accuracy. You
then make reasonable assumptions about the p.d.f. and calculate what V will be
for the different assumptions about µ, σ 2 , and n.
To summarize, assuming that we do not know µ or σ 2 , they are estimated by
q q
µ̂ = x̄ ± V [x̄] and σc2 = s2 ± V [s2 ] (8.12a)
s s
s2 2 2
= x̄ ± = s2 ± s (8.12b)
n n−1
Note that the ‘error’ on µ̂ has itself an error. By ‘error propagation’, which will
be covered in section 8.3.6,
2
ds2
V [s2 ] = ds
V [s] = (2s)2 V [s]
2 )2
s2
Hence, V [s] = 1
4s2√
V [s2 ] = 4s12 2(s
n−1
= 2(n−1)
q
and V [s] = √ s2
2(n−1)
8.2. PROPERTIES OF ESTIMATORS 103
Thus for n not too small, the error on the error on µ̂ is negligible.
When the p.d.f. of q̂ is non-Gaussian one must be careful. If the p.d.f. is skewed,
this can be indicated by stating asymmetric errors. But that is not foreseen in the
propagation of errors. Also, for a non-Gaussian P (q̂ − σ ≤ qt ≤ q̂ + σ) is usually
not 68%. Nor is the probability of being within, e.g., 2σ the same in the non-
Gaussian case as in the Gaussian case. Nor do the errors even have to be symmetric.
The propagation of errors (cf. section 8.3.6) is usually the least trustworthy
when there is a dependence on 1/q. Going to higher orders in the expansion does
not necessarily help because the resulting error, though perhaps more accurate, still
has the same problems resulting from skewness and the probability content of ±2σ.
These questions are often conveniently investigated by Monte Carlo methods. As
previously stated, the best cure for these problems is to rewrite the p.d.f. in terms
of the parameters you want to estimate.
We shall return to these questions when discussing confidence intervals (chapter
9) and hypothesis testing (chapter 10).
104 CHAPTER 8. PARAMETER ESTIMATION
Score: Notation becomes more compact by introducing the score. We define the
score of one measurement as
∂
S1 ≡ ln f (x; θ) (8.16)
∂θ
Note that the score, being a function of r.v.’s, is itself a r.v. The score of the entire
sample is then defined to be the sum of the scores of each observation:
n
X
S(x; θ) ≡ S1 (xi ; θ) (8.17)
i=1
Then
n
X ∂
S(x; θ) = ln f (xi ; θ)
i=1 ∂θ
n
∂ X
= ln f (xi ; θ)
∂θ i=1
∂ ln L(x; θ)
=
∂θ
Summarizing,
n n
∂ ln L(x; θ) X ∂ X
S(x; θ) = = S1 (xi ; θ) = ln f (xi ; θ) (8.18)
∂θ i=1 ∂θ i=1
This result combined with equation 8.15 shows that we can write the information
of the sample x on the parameter θ as the expectation of the square of the score:
h i
Ix (θ) = E (S(x; θ))2 (8.19)
106 CHAPTER 8. PARAMETER ESTIMATION
1. Ωθ is independent of θ, and
∂2 R
2. L(x; θ) is regular enough that we can interchange the order of ∂θi ∂θj
and dx.
If condition (1) holds, condition (2) will also generally hold for distributions en-
countered in physics. Now,
" # Z " #
∂ ∂
E [S1 (x; θ)] = E ln f (x; θ) = ln f (x; θ) f (x; θ) dx
∂θ ∂θ
Z " #
1 ∂
= f (x; θ) f (x; θ) dx
f (x; θ) ∂θ
Z
∂
= f (x; θ) dx
∂θ
Using the fact that the variance of a quantity is given by V [a] = E [a2 ]−(E [a])2 ,
we see from equations 8.19 and 8.21 that
These results (equations 8.21 and 8.23) are very useful, but do not forget the as-
sumptions on which they depend.
8.2. PROPERTIES OF ESTIMATORS 107
Does I satisfy the requirements? We can now show that the information
increases with the number of independent observations. For n observations,
!2
n
X
I(θ) = E S1 (xi ; θ)
i=1
" n # ( " n #)2
X X
=V S1 (xi ; θ) + E S1 (xi ; θ)
i i
where we have used the fact that V [a] = E [a2 ] − (E [a])2 . The second term is
zero under the assumptions that Ωθ is independent of θ and that the order of
differentiation and integration can be interchanged as in the previous paragraph
(eq. 8.21). However, let us now relax these assumptions.
Since the xi are independent, the variance of the sum is just the sum of the
variances. And since all the xi are sampled from the same p.d.f., the variance is the
same for all i. A similar argument applies to the second term. Hence,
Following the same steps for n = 1 gives the same expression with n = 1. Hence,
the information increases with the number of observations, our first requirement for
information.
If the assumptions of the previous paragraph apply, the second term in the above
equation is zero by equation 8.20. Then,
The last step follows because θ̂ is a statistic and therefore does not depend on θ.
Assuming that we can interchange the order of differentiation and integration, we
find
h i Yn Z Z
∂
E θ̂ S(x; θ) = . . . θ̂ [f (xi ; θ) dxi ]
∂θ i=1
∂ h i ∂ h i
= E θ̂ = θ + bn (θ̂))
∂θ ∂θ
∂
= 1 + bn (θ̂)
∂θ
Both θ̂ and S(x; θ) are r.v.’s. Their covariance is
h i h i h i
cov S(x; θ), θ̂(x) = E S(x; θ) θ̂(x) − E [S(x; θ)] E θ̂(x)
| {z }
=0, eq. 8.21
∂
=1+ bn (θ̂)
∂θ
Therefore, their correlation coefficient is
n h io2 h i2
∂
cov S, θ̂ 1+ b (θ̂)
∂θ n
ρ2 = h i = h i
V [S] V θ̂ I(θ) V θ̂
Since ρ2 ≤ 1, we have
h i2
∂
h i 1+ b (θ̂)
∂θ n
σ 2 (θ̂) = V θ̂ ≥ (8.26)
I(θ)
Thus, there is a lower bound on the variance of the estimator. For a given set
of data and hence a given amount of information, I(θ), on θ, we can never find an
estimator with a lower variance.
8.2. PROPERTIES OF ESTIMATORS 109
The more information we have, the lower this bound is, in accordance with our
third requirement for information.
If the estimator is a constant, θ̂ = c, then the bias is b = c − θ and the minimum
variance is 0, which is not a very interesting bound since the variance of a constant
is always 0.
The inequality (8.26) is usually known as the Rao-Cramér inequality or the
Frechet inequality. It was discovered independently by a number of people including
Rao,35 Cramér,15 and Frechet. The first were Aitken and Silverstone.36 Although we
have assumed that the range of X is independent of θ and that we could interchange
the order of differentiation and integration, the result (8.26) can be obtained with
somewhat more general assumptions.11, 13
In general, we prefer unbiased estimators. In that case the inequality reduces to
2
σ (θ̂) ≥ 1/I(θ). This is also the case if the bias of the estimator does not depend
on the true value of θ. For more than one parameter this result generalizes to
h i
σ 2 (θ̂i ) ≥ I −1 (θ) (8.27)
ii
Examples:
P
Gaussian with known mean. We have seen (section 8.2.1) that σc2 = (xi −
µ)2 /n is an unbiased estimator of the variance of a Gaussian of known mean. It is
easy to show (exercise 32) that it is also an efficient estimator.
which is just the minimum variance found above. Thus the sample mean is an
efficient estimator of the mean of an exponential p.d.f.
Note that the score is
n P
X n xi
S(x; µ) = S1 (xi ; µ) = − + 2 = −I(µ) (µ − µ̂)
i=1 µ µ
Thus the score is a linear function of the estimator. This is not a coincidence, but
a general feature of unbiased efficient estimators, as we show in the next section.
∂
ln f (x; θ) ≡ S = A′ (θ)θ̂(x) + B ′ (θ) (8.30)
∂θ
where the integration constant K may depend on x but not on θ. Then, where the
required normalization is included in B and/or K,
h i
f (x; θ) = exp A(θ) θ̂(x) + B(θ) + K(x) (8.32)
Any p.d.f. of the above form is said to belong to the exponential family. What
we have shown is that an efficient estimator can be found if and only if the p.d.f. is
of the exponential family where the estimator enters the exponent in the way shown
in equation 8.32.
Note that the efficient estimator is not necessarily unique since the product A · θ̂
can often be factored in more than one way. The estimator θ̂ will be an unbiased
estimator for some quantity, although not necessarily for the quantity we want to
estimate. It may also not be an estimator which we will be able to use. Let us now
calculate the expectation of θ̂ and see for what quantity it is an unbiased estimator:
From equation 8.30,
S(x; θ) B ′ (θ)
θ̂ = ′ − ′
A (θ) A (θ)
Since A′ and B ′ do not depend on x, the expectation is then
h i 1 B ′ (θ)
E θ̂ = E [S(x; θ)] −
A′ (θ) A′ (θ)
h i ∂B(θ)
∂θ
E θ̂ = − ∂A(θ)
(8.33)
∂θ
This is the quantity for which the θ̂ in equation 8.32 is an unbiased, efficient esti-
mator.
112 CHAPTER 8. PARAMETER ESTIMATION
Examples:
Gaussian. As an
example
we take the normal p.d.f., N(x; µ, σ 2 ), which has
µ
two parameters θ = . We write N(x; µ, σ 2 ) in an exponential form:
σ2
" #
2 1 1 (x − µ)2
N(x; µ, σ ) = √ √ exp −
2π σ 2 2 σ2
" !#
µ 1 2 1 µ2
= exp 2 x − 2 x − + ln(2πσ 2 )
σ 2σ 2 σ2
For n independent observations the p.d.f. becomes
n
" !#
Y
2 nµ n n µ2
N(xi ; µ, σ ) = exp 2 x̄ − 2 x2 − + ln(2πσ 2 )
i=1 σ 2σ 2 σ2
from which we see that we can choose (in equation 8.34)
A1 (θ) = nµ
σ2
θ̂1 (x) = x̄
n
A2 (θ) = − 2σ2 θ̂2 (x) = x2
2
B(θ) = − n2 µσ2 + ln(2πσ 2 )
K(x) = 0
Then (from equation 8.35)
∂A1
∂µ
= σn2 Thus θ̂1 = x̄ is an efficient and unbiased estimator of
∂A2
=0
∂µ −nµ/σ 2
∂B
∂µ
= −n σµ2 − =µ
n/σ 2
∂A1
∂σ2
= nµ
σ4
Thus θ̂2 = x2 is an efficient and unbiased estimator of
∂A2
∂σ2
= 2σn4 µ2 + σ 2 . Hence, x2 − µ2 = (x − µ)2 is an efficient and
∂B
= nµ
2
− n unbiased estimator of σ 2 . However, this is of use to us
∂σ2 2σ4 2σ2
only if we know µ.
8.2. PROPERTIES OF ESTIMATORS 113
Which estimator is the best? Returning to the list of 10 estimators for the
mean at the start of the section, we can ask which of the 10 is the best. Unfor-
tunately, there is no unique answer. In general we prefer unbiased, consistent and
efficient estimators. We can clearly reject nos. 2, 3, 4, 5 and 8. Nor is no. 6, the
sample mode, a good choice, even when the parent mode equals the parent mean,
since it uses so little of the information. However, which of the others is ‘best’
depends on the parent p.d.f.
The sample mean is efficient for a normal p.d.f. However, for a uniform p.d.f.
1
(f (x; a, b) = b−a ) where the limits (a, b) are unknown, estimator no. 7, 12 xmin +
1
x , has a smaller variance that x̄.
2 max
No. 10, the sample median, has a larger variance that the sample mean for
a Gaussian p.d.f., but for a ‘large-tailed Gaussian’ it can be smaller. No. 9, the
trimmed sample mean, throws away information but may still be best, in particular
if we think that points in the tails are largely due to mismeasurement.
114 CHAPTER 8. PARAMETER ESTIMATION
where g(t; θ) is the marginal p.d.f. of t and h is the conditional p.d.f. Now if
h is independent of θ, then clearly the t1 , t2 , . . . , tm−1 contribute nothing to our
knowledge of θ. If this is true for any set of ti and any m < n then t clearly
contains all the information on θ. We therefore define a sufficient statistic t as: t is
a sufficient statistic for θ if for any choice of t1 , t2 , . . . , tm−1 (which are independent
of t),
f (t, t1 , t2 , . . . , tm−1 ; θ) = g(t; θ) h(t1, t2 , . . . , tm−1 |t) (8.37)
Now, what does this mean in terms of the likelihood? The likelihood function
is the p.d.f. for x and is thus related to the f of equation 8.37 by a coordinate
transformation. Starting from equation 8.37, let ti = xi for i = 1, 2, . . . , n − 1.
Then
f (t, x1 , x2 , . . . , xn−1 ; θ) = g(t; θ) h(x1, x2 , . . . , xn−1 |t)
The p.d.f. in terms of x is then
!
x1 , . . . , xn
L(x; θ) = g(t; θ) h(x1 , x2 , . . . , xn−1 |t) J
x1 , . . . , xn−1 , t
which is, since the Jacobian does not involve θ, of the form
t = t(x1 , . . . , xn )
ti = ti (x1 , . . . , xn ) , i < m
ti = xi , i = m, . . . , n − 1
8.3. SUBSTITUTION METHODS 115
which we integrate over dtm . . . dtn−1 to obtain the p.d.f. f (t, t1 , . . . , tm−1 ). Neither
k nor J depend on θ. However, the integration limits for tm , . . . , tn−1 (xm , . . . , xn−1 )
could depend on θ. If not, it is clear that we obtain the form of equation 8.37. It
turns out11, 13 that this is also true even when the integration limits do depend on
θ.
Thus equations 8.37 and 8.38 are equivalent. If we can find a statistic t such that
the likelihood function can be written in the form of equation 8.38, t is a sufficient
statistic for θ.
The sufficient statistics for θ having the smallest dimension are called minimal
sufficient statistics for θ. One usually prefers a minimal sufficient statistic since
that gives the greatest data reduction.
We have seen that if we can write the p.d.f. in the exponential form of equa-
tion 8.34, h i
f (x; θ) = exp A(θ) · θ̂(x) + B(θ) + K(x)
then θ̂ is an efficient estimator. Such a p.d.f. clearly factorizes like equation 8.38
with
h i
g(θ̂; θ) = exp A(θ) · θ̂(x) + B(θ)
k(x) = exp [K(x)]
Thus, if the range of x does not depend on θ, θ̂(x) is not only an efficient estimator
of θ, but also a sufficient statistic for θ. If the range of x depends on θ, the
situation is more complicated. The reader is referred to Kendall and Stuart11, 13 for
the conditions of sufficiency.
Such estimators are also known as plug-in estimators, since the data are simply
“plugged into” the parameter definition.
For example, if the underlying p.d.f. is a binomial, B(x; n, p) = nx px (1 − p)n−x ,
we would estimate p by p̂ = x/n. This is unbiased since E [x] = np. It is also
efficient since B is a member of the exponential family of p.d.f.’s, as we saw in
section 8.2.7. And we would estimate a function of p, g(p), by g(p̂) = g(x/n). This
method works well for large samples where the C.L.T. assures us that the difference
between E [x] and np is a small fraction of np.
Advantages of this method are simplicity and the fact that the estimator is
usually consistent. Disadvantages are that the estimator may be biased and that it
may not have minimum variance. However, if it is biased, we may be able to reduce
the bias, or at least estimate its size by a series expansion:
Suppose that θ̂ is an unbiased estimator of θ. We wish to estimate some function
of θ, g(θ). Following the above prescription, we use ĝ = g(θ̂). Then, expanding ĝ
about the true value of θ, θt , assuming that the necessary derivatives exist,
∂g(θ) 1 ∂ 2 g(θ)
ĝ = g(θ̂) = g(θt ) + (θ̂ − θt ) + (θ̂ − θt )2 + ...
∂θ θ=θt 2 ∂θ2 θ=θt
Now we take the expectation. Since θ̂ is assumed unbiased, this gives simply,
1 h i ∂ 2 g(θ)
E [ĝ] = g(θt ) + E (θ̂ − θt )2 + ...
2 ∂θ θ=θt
2
h i
Not knowing the true value θ, we can not calculate E (θ̂ − θt )2 . But we can
h i
estimate it by V θ̂ . In the same spirit, we evaluate the derivative at θ = θ̂ instead
h i
∂ 2 g(θ)
of at θ = θt . Thus, to lowest order, there is a bias of approximately 21 V θ̂
∂θ 2 θ=θ̂
.
where mj = E [xj ]. This can, of course, only be done if all the necessary moments
exist. We then estimate q(θ) by replacing all the parent (population) moments, mj ,
in g by the corresponding sample (experimental) moments. Thus,
1X j
q̂ = g(m
c1 , m
c2 , . . .) , cj = xj =
m x (8.40)
n i i
In this notation m1 = µ, the parent mean, and mc1 = x̄, the sample mean.
For example, to estimate the parent variance, V [x], we write the variance in
terms of the moments: V [x] = σ 2 = m2 − m21 . We then estimate the moments by
the corresponding sample moments:
1X 2 1X
σc2 = m c21 =
c2 − m xi − x̄2 = (xi − x̄)2
n n
As we have previously seen (equation 8.6), this estimator, which we have called s2x
(equation 8.4), is biased. Thus the method of moments does not necessarily give
unbiased estimators.
As a second example, take the Poisson p.d.f. For this p.d.f., the population mean
and the population variance are equal, µ = V [x]. Therefore, we could estimate the
mean and the variance either
by θ̂ = m
c1 = x̄
P
or by c1 = n1 (xi − x̄)2
c2 − m
θ̂ = m 2
Thus the method of moments does not necessarily provide a unique estimator.
By the C.L.T. the average tends to its expectation under the assumption that
the variance is finite. Moments estimators, being averages, are therefore consistent.
A word of caution is in order: If it is necessary to use higher order moments, you
should be cautious. They are very sensitive to the tails of the distribution, which
is the part of the distribution which is usually the most affected by experimental
difficulties.
Thus we have a number of equations for E [uj ] in terms of θ. We solve them for the
θ in terms of the E [uj ] and substitute the sample moments, ūj , for the expectations
to obtain our estimate of θ. We will always need at least as many equations, and
hence at least as many functions uj , as there are parameters to be estimated.
We take as an example the angular distribution of the decay of a vector meson
into two pseudo-scalar mesons. The angles θ and φ of the decay products in the
rest system of the vector meson are distributed as
3 1 1
f (cos θ, φ) = (1 − ρ00 ) + (3ρ00 − 1) cos2 θ − ρ1,−1 sin2 θ cos 2φ
4π 2 2
√
− 2Reρ10 sin 2θ cos φ
where the ρ’s are parameters to be estimated. The data consist of measurements
of the angles, θi and φi , for n decays. From inspection of the above expression for
f , we choose three functions to estimate the three parameters. The choice is not
8.3. SUBSTITUTION METHODS 119
function expectation
u1 = cos2 θ E [u1 ] = 51 (1 + 2ρ00 )
u2 = sin2 θ cos 2φ E [u2 ] = − 45 ρ1,−1
√
u3 = sin 2θ cos φ E [u3 ] = − 45 2Reρ10
1 P
Replacing E [uj ] by the sample mean ūj = n
uj (cos θi , φi ) gives, e.g.,
4√ 1X n
− 2Reρ̂10 = ū3 = sin 2θi cos φi
5 n i=1
Thus the estimate of the coefficient of the k th term is just the sample mean of the
(complex conjugate of the) k th function,
âk = u∗k
1
where we have used n−1 instead of n1 in order to have an unbiased estimate. The
general element of the covariance matrix is estimated by
n
1 1 X
Vbjk [ū] = uj (xi ) − ūj (x) uk (xi ) − ūk (x) (8.45)
n n − 1 i=1
1
= (uj uk − ūj ūk ) (8.46)
n−1
Hence,
!2
h i h i2 ∂θ h i
V θ̂ ≡ E θ̂ − E θ̂ = E (q̂ − qt )2
∂q q=qt
!2
∂θ
= V [q̂] (8.47)
∂q q=qt
8.3. SUBSTITUTION METHODS 121
This can be estimated by substituting q̂ for qt and our estimate Vb [q̂] for V [q̂]:
!2
h i ∂θ
Vb θ̂ = Vb [q̂] (8.48)
∂q q=q̂
This technique works well only when second and higher order terms are small and
when q̂ is unbiased.
We give a simple example, a func- θ(q) = A + Bq
tion linear in q. The result is ∂θ
then, in fact, exact since the sec- =B
∂q
ond and higher order derivatives h i
V θ̂ = B 2 V [q̂] (8.49)
are zero.
The general case is similar to our treatment of change of variables (section
2.2.6). Indeed, it is in principle better to transform the p.d.f. to a new p.d.f. in
terms of the parameter we want to estimate, e.g., f (x; q) → g(x; θ). In particular
it is nice if we can transform to a p.d.f. having θ as its mean (or other low order
moment), since sample moments are unbiased estimators. However, in practice
such a transformation may be difficult and it may be easier to estimate q than to
estimate θ directly.
Consider now the p.d.f.’s for the estima- θ ✻
tors q̂ and θ̂. If the transformation θ = θ(q) g(θ̂) θ = θ(q)
is non-linear, the shape of the p.d.f. g(θ̂) is
changed from that of f (q̂) by the Jacobian
(|∂q/∂θ| in one dimension), as illustrated
θ1
in the figure. In regions where dθ < dq,
the probability piles up faster for θ than dθ
for q. Thus in the example the peak in
g(θ̂) occurs below θ1 = g (E [q̂]). f (q̂)
In particular, if f (q̂) is normal, g(θ̂) is
✲
not normal, except for a linear transforma- dq E [q̂] q
tion. This is a source of bias, which in the h i
figure manifests itself as a long tail for g(θ̂) resulting in E θ̂ > θ1 .
Now let us treat the multidimensional case, where q is of dimension n and θ is
of dimension m. Note that m ≤ n; otherwise not all θi will be independent and
there will be no unique solution. An example would be a p.d.f. for (x, y) for which
we want only to estimate some parameter of the (marginal) distribution for r. In
this case, n = 2 and m = 1.
We can then expand each θ̂i about its true value in the same manner as for the
one-dimensional case, except that we now must introduce a sum over all parameters:
n
X ∂θi
θ̂i ≡ θi (q̂) = θi (q t ) + (q̂k − qt k ) + . . .
k=1 ∂qk q =q
t
122 CHAPTER 8. PARAMETER ESTIMATION
Assuming that q̂i is unbiased, its expectation is equal to the true value so that to
first order,
h i h i n X
X n
∂θi ∂θj
θˆi − E θ̂i θˆj − E θ̂j = (q̂k − qt k ) (q̂l − qt l )
k=1 l=1 ∂qk q =q ∂ql q =q
t t
where,
∂θ1 ∂θ2 ∂θm
∂q1 ∂q1
... ∂q1
∂θ1 ∂θ2 ∂θm
...
c ∂q2 ∂q2 ∂q2
D(θ) =
.. .. .. ..
(8.53)
. . . .
∂θ1 ∂θ2 ∂θm
∂qn ∂qn
... ∂qn q =q̂
Warning: D is not symmetric.
If the xi are independent, this is just the product of the p.d.f.’s for the individual
xi :
n
Y
L(x; θ) = fi (xi ; θ) (8.55)
i=1
where we have included a subscript i on f since it is not necessary that all the xi
have the same p.d.f.
In probability theory this p.d.f. expresses the probability that an experiment
identical to ours would result in the n observations x which we observed. In prob-
ability theory we know θ and the functions fi , and we calculate the probability of
certain results. In statistics this is turned around. We have done the experiment;
so we know a set of results, x. We (think we) know the p.d.f.’s, fi (x, θ). We want
to estimate θ.
We emphasize that L is not a p.d.f. for θ; if it were we would use the expectation
value of θ for θ̂. Instead we take eq. 8.54, replace θ by θ̂ and solve for θ̂ under the
condition that L is a maximum. In other words, our estimate, θ̂, of θ is that value of
θ which would make our experimental results the most likely of all possible results.
This is the Principle of Maximum Likelihood: The best estimate of a pa-
rameter θ is that value which maximizes the likelihood function. This can not be
proved without defining ‘best’. It can be shown that maximum likelihood (ml)
estimators have desirable properties. However, they are often biased. Whether the
ml estimator really is the ‘best’ estimator depends on the situation.
It is usually more convenient to work with
ℓ = ln L (8.56)
since the product in eq. 8.55 becomes a sum in eq. 8.56. For independent xi this is
n
X
ℓ= ℓi , where ℓi = ln fi (xi ; θ) (8.57)
i=1
Since L > 0, both L and ℓ have the same extrema, which are found from
∂ℓ 1 ∂L
Si ≡ = =0 (8.58)
∂θi L ∂θi
✲
← physical range of θ → θ
124 CHAPTER 8. PARAMETER ESTIMATION
Suppose that all the µi are the same, µi = µ, but that the σi are different, but
known. This is the case if we make n measurements of the same quantity, each
with a different precision, e.g., using different apparatus. The maximum likelihood
condition (8.58) is then
∂ℓ X xi − µ X xi X µ
= = − =0
∂µ σi2 σi2 σi2
Therefore, having written the expectation of sums as the sum of expectations and
having split the double sum into two parts,
!2
1 X σi2 + µ2 X X µ2
V [µ̂] = P + − µ2
(1/σi2 ) i σi4 σ 2 2
σ
i j6=i i j
!2
1 X 1 2
X 1 2
XX 1
= P + µ + µ − µ2
(1/σi2 ) i σi2 i σi
4
i j6=i σi
2 2
σj
P 1 P P 1
1 i σ4 + i j6=i σ2 σ2
=P + µ2 i
P 2
i j −µ2
(1/σi2 ) ( (1/σi2))
| {z }
=1
1
=P (8.60)
(1/σi2 )
It is curious that in this example V [µ̂] does not depend on the xi , but only on the
σi . This is not true in general.
We have seen (section 8.2.6) that the Rao-Cramér inequality sets a lower limit
on the variance of an estimator. For an unbiased estimator the bound is 1/I, where
I is the information. For µ,
" # " #
∂S(µ) ∂2ℓ
I(µ) = −E = −E
∂µ ∂µ2
" !#
∂ X xi X µ
= −E −
∂µ σi2 σi2
" #
X 1 X 1
= −E − 2
=
σi σi2
Thus V [µ̂] = I −1 (µ); the variance of µ̂ is the smallest possible. The ml estimator
is efficient. This is in fact a general property of ml estimators: The ml estimator
is efficient if an efficient estimator exists. We will now demonstrate this.
From the maximum likelihood condition, S(x, θ̂) = 0, where θ̂ is the ml estimator
of θ. Hence the unbiased, efficient estimator T (x) is related to the ml estimator θ̂
by
D(θ̂)
T (x) = − (8.62)
C(θ̂)
126 CHAPTER 8. PARAMETER ESTIMATION
We have also seen in section 8.2.6, equation 8.21, that E [S(x, θ)] = 0 under
quite general conditions on f . Therefore, taking the expectation of equation 8.61,
Hence,
D(θ)
E [T (x)] = − (8.63)
C(θ)
This is true for any value of θ; in particular it is true for θ = θ̂, i.e., if the true value
of θ is equal to the ml estimate of θ:
h i D(θ̂)
E T (x|θ̂) = − = T (x) (8.64)
C(θ̂)
h i
It may seem strange to write E T (x|θ̂) since T (x) does not depend on the value of
θ. However, the expectation operator does depend on the value of θ. In fact, since
T (x) is an unbiased estimator of θ,
Z
E [T (x)] = T (x) f (x, θ) dx = θ (8.65)
Hence, h i
E T (x|θ̂) = θ̂
Combining this with equation 8.64 gives
T (x) = θ̂ (8.66)
D(θ) = −θ C(θ)
2. Assuming that the estimator is efficient means that the Rao-Cramér inequal-
ity, equation 8.26, becomes an equality. Collecting equations 8.19, 8.23, and
8.26, results in the variance of an unbiased, efficient estimator θ̂ given by
h i 1 1 1 1
V θ̂ = = 2
= − h ∂S i = − h ∂ 2 ℓ i
I(θ) E [S ] E E 2∂θ ∂θ
8.4. MAXIMUM LIKELIHOOD METHOD 127
From (8.67),
∂S h i
= C ′ (θ) θ̂ − θ − C(θ) (8.68)
h i
∂θ
Since θ̂ is unbiased, E θ̂ = θt , the true value of the parameter. Hence,
" #
∂S
E = −C(θt )
∂θ
h i 1
and V θ̂ = (8.69)
C(θt )
Hence, C(θt ) > 0.
3. From equation 8.68, we also see that
∂ 2 ℓ ∂S
= = −C(θ̂)
∂θ2 θ=θ̂ ∂θ θ=θ̂
Since C(θ) > 0 in the region of the true value, this confirms that the extremum
of ℓ, which we have used to determine θ̂, is in fact a maximum.
4. From equation 8.67 and the maximum likelihood condition (equation 8.58),
we see that the ml estimator is the solution of
0 = S(x, θ) = C(θ) θ̂ − θ
Since C(θ) > 0 in the region of the true value, this equation can have only one
solution, namely θ̂. Hence, the maximum likelihood estimator θ̂ is unique.
Let us return to the Gaussian example. But now assume not only that all µi = µ
but also all σi = σ. Unlike the previous example, we now assume that σ is unknown.
The likelihood condition gives
! !
∂ℓ X xi − µ̂
= =0
∂µ µ̂,σ̂
σ̂
! !
∂ℓ X 1 (xi − µ̂)2
= − + =0
∂σ µ̂,σ̂
σ̂ σ̂ 3
The first equation gives
1X
µ̂ =
xi = x̄
n
Using this in the second equation gives
1X
σ̂ 2 = (xi − x̄)2
n
which, as we have previously seen (eq. 8.6), is a biased estimator of σ 2 . This
illustrates an important, though often forgotten, feature of ml estimators: They
are often biased.
To summarize this section: The ml estimator is efficient and unbiased if such
an estimator exists. Unfortunately, that is not always the case.
128 CHAPTER 8. PARAMETER ESTIMATION
Since the sample mean approaches the expectation as n → ∞ provided only that
the variance is finite (C.L.T.), asymptotically
" # " #
∂S1 ∂2
S(x, θ) ≈ nE θ − θ̂ = nE ln f (xi , θ) θ − θ̂
∂θ θ̂ ∂θ2
θ̂
" # " #
∂ X ∂2 X
=E S1 θ − θ̂ = E ln f (xi , θ)
θ − θ̂
∂θ θ̂
∂θ2 θ̂
" # " #
∂S ∂ 2 ℓ
=E θ − θ̂ = E θ − θ̂
∂θ θ̂ ∂θ2 θ̂
= −I(θ̂) θ − θ̂ (8.70)
the last step following from equation 8.23.
There are several consequences of equation 8.70:
• First we note that asymptotically, I(θ) = I(θ̂):
" #
∂S h i
I(θ) = −E = E I(θ̂) = I(θ̂)
∂θ
where the second step follows from equation 8.70 and the last step follows
since I(θ̂) is itself an expectation and the expectation of an expectation is
just the expectation itself.
8.4. MAXIMUM LIKELIHOOD METHOD 129
• The result, equation 8.70, that θ̂ is linearly related to the score function,
implies (section 8.2.7) that θ̂ is unbiased and efficient. This is an important
asymptotic property of ml estimators.
∂
ln L = S(x, θ) ≈ −I(θ̂)(θ − θ̂)
∂θ
over θ to find
I(θ̂) 2
ℓ = ln L ≈ − θ̂ − θ + ln k (8.71)
2
where the integration constant, k, is just k = L(θ̂) = Lmax . Exponentiating,
1
L(θ) ≈ Lmax exp − I(θ̂)(θ̂ − θ)2 ∝ N θ; θ̂, I −1 (θ̂) (8.72)
2
Instead of starting with equation 8.70, we could use equation 8.67, which ex-
presses the linear dependence of S on θ̂ for any efficient, unbiased estimator. Inte-
grating equation 8.67 leads to
1
L(θ) = Lmax exp − C(θ)(θ̂ − θ)2
2
which looks formally similar to equation 8.72 but is not, in fact, a Gaussian function
since C depends on θ. Only asymptotically must C(θ) approach a constant, C(θ) →
I(θ̂). Nevertheless, C(θ) may be constant for finite n, as we have seen in the example
of using x̄ to estimate µ of a Gaussian (cf. section 8.2.7).
We emphasize again that, despite the form of equation 8.72, L is not a p.d.f.
for θ. It is an experimentally observed function. Nevertheless, the principle of
maximum likelihood tells us to take the maximum of L to determine θ̂, i.e., to
take θ̂ equal to the mode of L. In this approximation the mode of L is equal to
the mean, which is just θ̂. In other words the ml estimate is the same as what we
would find if we were to regard L as a p.d.f. for θ and use the expectation (mean)
of L to estimate θ.
Since asymptotically hthe
i ml estimator is unbiased and efficient, the Rao-Cramér
bound is attained and V θ̂ = I −1 (θ). Thus the variance is also that which we would
have found treating L as a p.d.f. for θ.
We have shown that the ml estimator is, under suitable conditions, asymptot-
ically efficient and unbiased. Let us now specify these conditions (without proof)
more precisely:
130 CHAPTER 8. PARAMETER ESTIMATION
2. The p.d.f.’s defined by different values of θ must be distinct, i.e., two values
of θ must not give p.d.f.’s whose ratio is not a function of θ. Otherwise there
would be no way to decide between them.
ĝ(θ) = g(θ̂)
∂θ
This occurs because, assuming ∂g
exists,
∂L ∂L ∂θ
=
∂g ∂θ ∂g
∂θ
If ∂g is zero at some value of θ, this can intro- g ✻
duce additional solutions to the likelihood condi-
tion for g. This will not usually happen if g is a
single-valued function of θ unless there are points
of inflection.
✲
Note that θ̂ unbiased does not imply that θ
∂θ
ĝ = g(θ̂) is unbiased and vice versa. Asymptoti- ∂g
=0
cally, both θ̂ and ĝ become unbiased and efficient
(previous section), but they usually approach this at different rates.
In the case of more than one parameter, g(θ), the above generalizes to
!T !
∂L X ∂L ∂θi ∂L ∂θ
= = (8.73)
∂gk i ∂θi ∂gk ∂θ ∂gk
P (x | θi )
Pposterior (θi | x) = Pprior (θi ) (8.75)
P (x)
and it would seem reasonable to choose as our estimate of θ̂ that value θi having
the largest Pposterior , i.e., the mode of the posterior probability.∗ Since Pposterior is
P P
normalized, i.e., i Pposterior (θi |x) = 1, we see that P (x) = i P (x|θi )Pprior (θi ) is
the constant which serves to normalize Pposterior . We also see that P (x|θi ) is just
the likelihood, L(x; θi ), apart from normalization.
In the absence of prior knowledge (belief) of θ, Bayes’ postulate tells us to assume
all values equally likely, i.e., Pprior (θi ) = k1 . Then the right-hand side of equation 8.75
is exactly L(x; θi ) (apart from normalization) and maximizing Pposterior is the same
as maximizing L. Thus, Bayesian statistics leads to the same estimator as maximum
likelihood.
∗
The mode is not the only choice. A Bayesian could also choose the mean or the median, or
some other property of the posterior probability distribution. Asymptotically, of course, Pposterior
will be Gaussian, in which case the mode, mean, and median are the same.
132 CHAPTER 8. PARAMETER ESTIMATION
In the more usual case of a continuous parameter, equation 8.75 must be rewrit-
ten in terms of probability densities:
f (x | θ)
fposterior (θ | x) = R fprior (θ) (8.76)
f (x | θ) fprior (θ) dθ
Assuming Bayes’ postulate, fprior = constant, and again Bayesian statistics is equiv-
alent to maximum likelihood.
But now what happens if we want to estimate the parameter g = g(θ) rather
than θ? Assume that the transformation g(θ) is one-to-one. Then in the discrete
case we just replace θi by gi = g(θi ) in equation 8.75. Bayes’ postulate again tells
us that Pprior = k1 and the same maximum is found resulting in ĝ = g(θ̂). However
in the continuous case, the change of parameter (cf. sections 2.2.6, 8.4.3) involves
a Jacobian, since in Bayesian statistics f is a p.d.f. for θ, or in other words, the ml
parameter is regarded as the variable of the p.d.f. Hence,
∗
There are arguments for the choice of non-uniform priors (see, e.g., Jeffreys37 ) in certain
circumstances. However, they are not completely convincing and remain controversial.
8.4. MAXIMUM LIKELIHOOD METHOD 133
Performing the square and using the fact that the expectation of a sum is the
sum of the expectations, we get
h i n
X h i n X
X n
V −1 θ̂ = E S12 (xi ; θ) + E [S1 (xi ; θ)S1 (xj ; θ)]
i=1 i=1 j=1
i6=j
However, the cross terms are zero, which follows from the fact that for indepen-
dent xi the expectation of the product equals the product of the expectations
and from E [S1 (x; θ)] = 0 (equation 8.20). Therefore, generalizing to more
than one parameter,
" #
h i n
X
∂ ln fi (xi ; θ) ∂ ln fi (xi ; θ)
Vjk−1 θ̂ = E (8.78)
i=1 ∂θj ∂θk
Rather than calculating the expectation and evaluating it at θ̂, we can estimate
the expectation value by the sample mean evaluated at θ̂:
d
−1
h i n
X ∂ ln f (xi ; θ) ∂ ln f (xi ; θ)
Vjk θ̂ = (8.80)
∂θj ∂θk
i=1 θ̂ θ̂
134 CHAPTER 8. PARAMETER ESTIMATION
where the second step follows from the linear dependence of S on θ̂ (equa-
tion 8.30) for an unbiased, efficient estimator. The variance is then estimated
by evaluating the derivative at θ = θ̂:
h i ∂ 2 ℓ
Vd
−1 θ̂ = − 2 (8.82)
∂θ θ̂
which is the Hessian matrix∗ of −ℓ. For n independent events, all distributed
as f (x; θ), the expectations in equations 8.81 and 8.83 can be estimated by a
sample mean evaluated at θ̂. Thus
d
−1
h i n
X ∂ 2 ln f (xi ; θ) ∂ 2 ln f (x; θ)
Vjk θ̂ = − = −n (8.85)
i=1 ∂θj ∂θk θ̂ ∂θj ∂θk θ̂
The expectation forms (8.77, 8.78, 8.79 and 8.84) are useful for estimating the error
we expect before doing the experiment, e.g., to decide how many events we need to
have in order to achieve a certain precision under various assumptions for θ. Both
the expectation and the sample mean forms (8.80 and 8.85) may be used after the
experiment has been done. It is difficult to give general guidelines on which method
is most reliable.
∗
Mathematically it is conditions on the first derivative vector, ∂ℓ/∂ θ̂, and on the Hessian matrix
that define the maximum of ℓ or the minimum of −ℓ. The Hessian matrix is positive (negative)
definite at a minimum (maximum) of the function and indefinite at a saddle point.
8.4. MAXIMUM LIKELIHOOD METHOD 135
∂ℓ Pn xi −µ
2. Since ∂µ
= i=1 σi2
, equation 8.84 yields
n
−1 ∂2ℓ X 1
V [µ̂] = − 2 = 2
∂µ i=1 σi
P 1
Thus both methods give V [µ̂] = 1/ σi2
. This is the same result we found in
section 8.4.1, equation 8.60, where we calculated the variance explicitly from the
definition. This was, of course, to be expected since in this example µ̂ is unbiased
and efficient and the range of x is independent of µ.
and f (x|θ) is just the likelihood function L(x; θ). If we are willing to accept Bayes’
postulate (for which there is no mathematical justification) and take the prior p.d.f.
for θ, fprior (θ), as uniform in θ (within possible physical limits), we have
L(x; θ)
fposterior (θ|x) = R (8.86)
L(x; θ) dθ
where the explicit normalization
R
in the denominator is needed to normalize fposterior ,
since L is normalized by L dx = 1. Since, in Bayesian inference L is regarded as
a p.d.f. for θ, the covariance matrix of θ̂,
h i h i
Vjk θ̂ = E θ̂j − θj θ̂k − θk (8.87)
is given by
R
h i θ̂j − θj θ̂k − θk L dθ
Vjk θ̂ = R (8.88)
L dθ
136 CHAPTER 8. PARAMETER ESTIMATION
If the integrals in equation 8.88 can not be easily performed analytically, we could
use Monte Carlo integration. Alternatively, we can estimate the expectation (8.87)
from the data. This is similar to Monte Carlo integration, but instead of Monte
Carlo points θ we use the data themselves. Assuming n independent observations
xi , we estimate each parameter for each observation separately, keeping all other
parameters fixed at θ̂. Thus, θ̂j(i) is the value of θ̂j that would be obtained from
using only the ith event. In other words, θ̂j(i) is the solution of
∂fi (xi ; θ)
=0
∂θj θk =θ̂k ,k6=j
With L regarded as a p.d.f. for θ, the θ̂j(i) are r.v.’s distributed according to L.
Their variance about θ thus estimates the variance of θ̂. However, not knowing θ
we must use our estimate of it. This leads to the following estimate of the covariance,
where in equation 8.87 the expectation has been replaced by an average over the
observations, θ̂ by the estimate from one observation θ̂j(i) , and θj by our estimate
θ̂j :
h i Xn
Vd θ̂ = 1 θ̂ − θ̂ θ̂ − θ̂ (8.89)
jk j(i) j k(i) k
n i=1
Equation 8.88 is particularly easy to evaluate when L is a Gaussian. We have
seen that asymptotically L is a Gaussian function of θ (equation 8.72) and hence
that ℓ is parabolic (equation 8.71):
1 2 (θ̂ − θ)2 1
L = Lmax e− 2 Q , Q2 = , ℓ = ln L = ln Lmax − Q2 (8.90)
σ2 2
h i
Then, using the Bayesian interpretation, it follows from equation 8.88 that V θ̂ =
σ 2 = I −1 (θ̂).
However, in the asymptotic limit it is not necessary to invoke the Bayesian
interpretation to obtain this result, hsince
i we already know from the asymptotic
efficiency of the ml estimator that V θ̂ = I −1 (θ) = I −1 (θ̂).
A graphical method
In any case, if L is Gaussian, the values of
2
θ for which Q2 = (θ̂−θ)σ2
= 1, i.e., the values ℓ(θ) ✻
of θ corresponding to 1 standard deviation
“errors”, θ̂ − θ = ±σ, are just those values, ℓmax
θ1 , for which ℓ differs from ℓmax by 1/2. This ℓ1
provides another
r way to estimate the uncer-
h i
tainty, δ θ̂ = V θ̂ , on θ̂: Find the value of
θ, θ1 , for which ℓ2
1 ✲
ℓ1 = ℓ(θ1 ) = ℓmax −
2 θ̂ θ1 θ2 θ
8.4. MAXIMUM LIKELIHOOD METHOD 137
The error is then δ θ̂ = |θ̂ − θ1 | This could be done graphically from a plot of ℓ vs. θ.
Similarly, two-standard deviation errors (Q2 = 4) could be found using ℓ2 = ℓmax −2,
etc. (The change in ℓ corresponding to Q standard deviations is Q2 /2.)
But, what do we do if L is not Gaussian? We can be Bayesian and use equa-
tion 8.87 or 8.88. Not wanting to be Bayesian, we can use the following approach.
The two approaches will in general give different estimates of the variance, the
difference being smallest when L is nearly of a Gaussian form.
Recall that for efficient, unbiased estimators L can be Gaussian even for finite n.
Imagine a one-to-one transformation g(θ) from the parameter θ to a new parameter
g and suppose that ĝ is efficient and unbiased and hence that L(g) is normal. Such
a g may not exist, but for now we assume that it does. We have seen that ĝ = g(θ̂).
Let h be the inverse transformation, i.e., θ = h[g(θ)]. Since, by assumption, L(g)
is Gaussian, δg is given by a change of 1/2 in ℓ(g).
But, as we have seen in section 8.4.3, L(θ|x) = L(g(θ)|x) for all θ; there is no
Jacobian involved in going from L(θ) to L(g). This means that since we can find
δg from a change of 1/2 in ℓ(g), δθ will be given by the same change.
ℓ(g) ✻ ℓ(θ) ✻
ℓmax
ℓ1
ℓ2
✲ ✲
ĝ g1 g2 g θ̂ θ1 θ2 θ
L(θ) need not be a symmetric function of θ, in which case the errors on θ̂ are
asymmetric.
Note that we do not actually need to use the parameter g. We can find δθ
directly.
A problem is that such a g may not exist. Asymptotically both L(g) and L(θ) are
Gaussian. However, in general, L(g) and L(θ) will approach normality at different
rates. It is therefore plausible that there exists some g which is at least nearly
normally distributed for finite n. Since we never actually have to use g, we can only
adopt it as an assumption, realizing that the further away L for the ‘best’ g is from
normality, the less accurate will be our estimation of δθ.
This method of error estimation is easily extended to the case of more than
one parameter. If all estimators are efficient, L will be a multivariate normal. We
show the example of two parameters, θ1 and θ2 . The condition of a change of 1/2
in ℓ, i.e., Q2 = 1, gives an ellipse of constant L in θ2 vs. θ1 . A distinction must be
138 CHAPTER 8. PARAMETER ESTIMATION
made, however, between the ‘error’ and the ‘reduced’ or ‘conditional error’, which
is the error if the values of the other parameters are all assumed to be equal to their
estimated values.
If, for example, θ2 is held fixed at θ̂2
θ
and ℓ varied by 1/2, the conditional er- 2 ✻ ←σ1c→← σ1 →
c
ror, σ1 is found rather than the error σ1 ,
which is the error that enters the multi-
variate normal distribution. In practice, ↑
σ2c
the maximum of ℓ, as well as the vari- θ̂ ↓
ation of ℓ by /2, are usually found on a
1
2 ↑
computer using a search technique. How- σ2
ever, since it is easier (faster), the pro- ↓
c
gram may compute σ rather than σ. If
the parameters are uncorrelated, σ c = σ. ✲
If parameters are correlated, the correla- θ̂1 θ1
tion should be stated along with the errors, or in other words, the complete covari-
ance matrix should be stated, e.g., as σ1 , σ2 , and ρ, the correlation coefficient.
8.4.6 Summary
• If the sample is large, maximum likelihood gives a unique, unbiased, minimum
variance estimate under certain general conditions. However ‘large’ is not well
defined. For finite samples the ml estimate may not be unique, unbiased, or
of minimum variance. In this case other estimators may be preferable.
• Maximum likelihood estimators are sufficient, i.e., they use all the information
about the parameter that is contained in the data. In particular, for small
samples ml estimators can be much superior to methods which rely on binned
data.
• Maximum likelihood estimators are not necessarily robust. If you use the
wrong p.d.f., the ml estimate may be worse than that from some other
method.
• The maximum likelihood method gives no way of testing the validity of the
underlying theory, i.e., whether or not the assumed p.d.f. is the correct one.
In practice this is not so bad: You can always follow the maximum likelihood
estimation by a goodness-of-fit test. Such tests will be discussed in section
10.6.
8.4. MAXIMUM LIKELIHOOD METHOD 139
Just as the grand canonical ensemble can be used even for situations where the
number of molecules is in fact constant (non-permeable walls), so also the extended
maximum likelihood method. In particular, if there is no functional relationship
between ν and θ, the likelihood condition ∂ℓE /∂ν = 0 will lead to ν̂ = n. Also,
∂ℓE /∂θj = ∂ℓ/∂θj , which leads to identical estimators θ̂j as in the ordinary max-
imum likelihood method. Nevertheless, we may still prefer to use the extended
maximum likelihood method. It can happen that the p.d.f., f , is very difficult to
normalize, e.g., involving a lengthy numerical integration. Then, even though the
number of events is fixed, we can use the extended maximum likelihood method,
allowing the maximum likelihood principle to find the normalization. In this case,
the resulting estimate of ν should turn out to be the actual number of events n times
the normalization of f and the estimate of the other parameters to be the same as
would have been found using the ordinary maximum likelihood method. However,
the errors on the parameters will be overestimated since the method assumes that
ν can have fluctuations. This overestimation can be removed (cf. section 3.9) by
1. inverting the covariance matrix,
2. removing the row and column corresponding to ν,
3. inverting the resulting matrix.
This corresponds to fixing ν at the best value, ν̂. Thus we could also fix ν = ν̂ and
find the errors on θ̂ by the usual ml procedure.
The estimated numbers of forward and backward events, i.e., the estimate of the
expectation of the numbers of forward and backward events if the experiment were
repeated, are then
with variance h i
V F̂ = V [N p̂] = N 2 V [p̂]
e−ν ν N e−ν ν N N! F
LE = L= p (1 − p)B
N! N! F !B!
ℓE = −ν + N ln ν − ln N! + F ln p + B ln(1 − p) + ln N! − ln F !B!
∂ℓE N
=1+ =0 −→ ν̂ = N
∂ν ν
∂ 2 ℓE
=0 −→ p̂ and ν̂ are uncorrelated.
∂ν∂p
The estimate of the number of forward events is F̂ = p̂ν̂ = F , with the variance
found by error propagation:
h i F2 F2
2 p̂(1 − p̂) F B
V F̂ = p̂2 V [ν̂] + ν̂ 2 V [p̂] = 2
N + N = +N =F
N N N N N
142 CHAPTER 8. PARAMETER ESTIMATION
Alternatively, we can write the p.d.f. as a product of Poisson p.d.f.’s, one for
forward events and one for backward events (see exercise 13). Again, N is not
fixed. The parameters are now the expected numbers of forward, φ, and backward,
β, events. Then
e−φ φF e−β β B
LE =
F! B!
which leads to the same result:
√ √
F̂ = φ̂ = F ± F and B̂ = β̂ = B ± B
g(θ̂) = 0 (8.93)
The most efficient method to deal with such constraints is to change parameters
such that these equations become trivial. For example, if the constraint is
g(θ) = θ1 + θ2 − 1 = 0
8.4. MAXIMUM LIKELIHOOD METHOD 143
θ1 = ξ1
θ2 = (1 − ξ1 )ξ2
θ3 = (1 − ξ1 )(1 − ξ2 )ξ3
.. .
.= ..
θk−1 = (1 − ξ1 )(1 − ξ2 )(1 − ξ3 ) · · · (1 − ξk−2 )ξk−1
θk = (1 − ξ1 )(1 − ξ2 )(1 − ξ3 ) · · · (1 − ξk−2 )(1 − ξk−1)
where the ξi are bounded by 0 and 1 using the method given above:
1
ξi = (sin ψi + 1)
2
L is then maximized with respect to the k − 1 parameters ψi . A drawback of this
method is that the symmetry of the problem with respect to the parameters is lost.
In general, the above simple methods may be difficult to apply. One then turns
to the method of Lagrangian multipliers. Given the likelihood function L(x; θ) and
the constraints g(θ) = 0, one finds the extremum of
To find the variances, we construct the matrix of the negative of the second
derivatives:
∂2F ∂2F
∂2ℓ ∂g
∂θ∂θ ′! ∂θ∂α ∂θ∂θ!′ ∂θ A B
I≡ −E
T
= −E T
≡ (8.97)
∂2F 2
∂ F ∂g BT 0
0
∂θ∂α ∂α2 ∂θ
It can be shown4, 5 that the covariance matrix of the estimators is then given by
h i
V θ̂ = A−1 − A−1 B V [α̂] B T A−1 (8.98)
−1
V [α̂] = B T A−1 B (8.99)
h i
The first term of V θ̂ is the ordinary unconstrained covariance matrix; the second
term is the reduction in variance due to the additional information provided by the
constraints. We have implicitly assumed that I is not singular. This may not be the
case, e.g., when the constraint is necessary to define the parameters unambiguously.
One then adds another term to F ,
F ′ = F − g 2 (θ)
and proceeds as above. The resulting inverse covariance matrix is usually non-
singular.4, 5
Computer programs which search for a maximum will generally perform better
if the constraints are handled correctly, rather than by some trick such as setting the
likelihood very small when the constraint is not satisfied, since this will adversely
affect the program’s estimation of derivatives. Also, use of Lagrangian multipliers
may not work with some programs, since the extremum can be a saddle point rather
than a maximum: a maximum with respect to θ, but a minimum with respect to α.
In such a case, “hill-climbing” methods will not be capable of finding the extremum.
8.5.1 Introduction
We begin this subject by starting from maximum likelihood and treating the exam-
ple of n independent xi , each distributed normally with the same mean but different
σi . To estimate µ when all the σi are known we have seen that the likelihood func-
tion is
n
" #
Y 1 1 xi − µ 2
L= √ exp −
i=1 2πσi 2 σi
n
" #
n X (xi − µ)2
ℓ = − ln(2π) + − ln σi −
2 i=1 2σi2
8.5. LEAST SQUARES METHOD 145
P 2
To maximize L, or ℓ, is equivalent to minimizing ni=1 (xiσ−µ)
2 . If µ were known, this
i
quantity would be, assuming each point independent, a χ2 (n). Since µ is unknown
we replace it by an estimate of µ, µ̂. There is then one relationship between the
terms of the χ2 and therefore
n
X (xi − µ̂)2
χ2 = (8.100)
i=1 σi2
is a χ2 not of n, but of n − 1 degrees of freedom.
The method of least squares takes as the estimator of a parameter that value
which minimizes χ2 . The least squares estimator is thus given by
∂χ2 Xn
xi − µ̂
= −2 =0
∂µ µ=µ̂ i=1 σi2
which gives the same estimator as did maximum likelihood (equation 8.59):
P xi
σi2
µ̂ = P 1 (8.101)
σi2
Although in this example the least squares and maximum likelihood methods
result in the same estimator, this is not true in general, in particular if the p.d.f.
is not normal. We will see that although we arrived at the least squares method
starting from maximum likelihood, least squares is much more solidly based than
maximum likelihood. It is, perhaps as a consequence, also less widely applicable.
The method of least squares is a special case of a more general class of methods
whereby one uses some measure of distance, di (xi , θ), of a data point from its
expected value and minimizes the sum of the distances to obtain the estimate of θ.
Examples of d, in the context of our example, are
1. di (xi , θ) = |xi − µ̂|α
!α
|xi − µ̂|
2. di (xi , θ) =
σi
The difference between these two is that in the second case the distance is scaled by
the square root of the expected variance of the distance. If all these variances, σi2 ,
are the same, the two definitions are equivalent. It can be shown11, 13 that the first
distance measure with α = 1 leads to µ̂ given by the sample median. The second
distance measure with α = 2 is just χ2 .
The first publication in which least squares was used is by Legendre. In an
1805 paper entitled “Nouvelles méthodes pour la determination des orbites des
comètes” he writes:
Il faut ensuite, lorsque toutes les conditions du problême sont exprimées
convenablement, determiner les coëfficiens de manière à rendre les er-
reurs les plus petites qu’il est possible. Pour cet effet, la méthode qui
146 CHAPTER 8. PARAMETER ESTIMATION
which agrees with the variance found in the maximum likelihood method (equa-
tion 8.60).
We see that in this example (although not in general true) the variance of the
estimator does not depend on the value of χ2 . However, it does depend on the
shape of χ2 (µ):
X xi
−µ 2
χ2 (µ) =
σi
∂χ2 X 2(xi − µ̂)
=− =0
∂µ µ̂ σi2
∂ 2 χ2 X 1 2
= 2 2
=
∂µ2 µ̂ σi V [µ̂]
8.5. LEAST SQUARES METHOD 147
This is the curve which we fit to the data. There are k parameters, θj , to be
estimated. The important features of this model are that the hj are known, distin-
guishable, functions of x, single-valued over the range of x, and that y is linear in
the θj . The word ‘linear’ in the term ‘linear model’ thus refers to the parameters θj
and not to the variable x. In some cases the linear model is just an approximation
arrived at by retaining only the first few terms of a Taylor series. The functions hj
must be distinguishable, i.e., no hj may be expressible as a linear combination of
the other hj ; otherwise the corresponding θj will be indeterminate.
148 CHAPTER 8. PARAMETER ESTIMATION
We want to determine the values of the θj for which the model (eq. 8.102) best
fits the measurements. We assume that any deviation of a point yi from this curve
is due to measurement error or some other unbiased effects beyond our control, but
whose distribution is known from previous study of the measuring process to have
variance σi2 . It need not be a Gaussian. We take as our measure of the distance of
the point yi from the hypothesized curve the squared distance in units of σi , as in
1
xi ) are means (or other descriptive statistics) of some random variable, e.g., y the
average height and x the average weight of Dutch male university students. The
mathematics is, however, the same.
y ✻
✛true curve
y(xi )
✛ p.d.f. of yi at xi
yi
✲
xi x
k
X
yi = y(xi ) + ǫi = θj hj (xi ) + ǫi (8.103)
j=1
where the unknown error on yi has the properties: E [ǫi ] = 0, V [ǫi ] = σi2 , and σi2 is
known. The ǫi do not have to be normally distributed for most of what we shall do;
where a Gaussian assumption is needed, we will say so. Note that if at each xi the
yi does not have a normal p.d.f., we may be able to transform to a set of variables
which does.
Further, we assume for simplicity that each yi is an independent measurement,
although correlations can easily be taken into account by making the error matrix
non-diagonal, as will be discussed. The xi may be chosen any way we wish, including
several xi which are equal. However, we shall see that we need at least k distinct
values of x to determine k parameters θj .
8.5. LEAST SQUARES METHOD 149
Estimator
The problem is now to determine the ‘best’ values of k parameters, θj , from n
measurements, (xi , yi ). The deviations from the true curve are ǫi . Therefore the
“χ2 ” is
n
X ǫ2i
Q2 = 2
(8.104)
i=1 σi
n
!2
X yi − y(xi )
=
i=1 σi
2
n
X Xk
1
= 2
y i − θj hj (xi ) (8.105)
i=1 σi j=1
This is a true χ2 , i.e., distributed as a χ2 p.d.f., only if the ǫi are normally dis-
tributed. To emphasize this we use the symbol Q2 instead of χ2 .
We do not know the actual value of Q2 , since we do not know the true values
of the parameters θj . The least squares method estimates θ by that value θ̂ which
minimizes Q2 . This is found from the k equations (l = 1, . . . , k)
n k
∂Q2 X 1 X
=2 2
y i − θj hj (xi ) (−hl (xi )) = 0
∂θl σ
i=1 i j=1
which we rewrite as
n
X k n
hl (xi ) X X yi
2
θ̂j hj (xi ) = h (x )
2 l i
(8.106)
i=1 σi j=1 i=1 σi
This is a set of k linear equations in k unknowns. They are called the normal
equations. It is easier to work in matrix notation. We write
y1 θ1 ǫ1
.. .. ..
y= . ; θ= . ; ǫ= .
yn θk ǫn
h1 (x1 ) h2 (x1 ) ... hk (x1 )
h1 (x2 ) h2 (x2 ) ... hk (x2 )
H =
.. .. .. ..
. . . .
h1 (xn ) h2 (xn ) . . . hk (xn )
Then Pk
j=1 θj hj (x1 )
Pk
j=1 θj hj (x2 )
Hθ=
..
.
Pk
j=1 θj hj (x2 )
150 CHAPTER 8. PARAMETER ESTIMATION
y =Hθ+ǫ (8.107)
h i
Since E [ǫ] = 0, we obtain E y = H θ. In other words, the expectation value of
each measurement is exactly the value given by the model.
The errors σi2 can also be incorporated in a matrix, which is diagonal given our
assumption of independent measurements,
σ12 ... 0
V [y] = ... .
..
. ..
0 . . . σn2
If the measurements are not independent, we incorporate that by setting the off-
diagonal elements to the covariances of the measurements. In this matrix notation,
the equations for Q2 (equations 8.104 and 8.105) become
Q2 = ǫT V −1 ǫ (8.108)
T
= y−Hθ V −1 y − H θ (8.109)
∂Q2
= −2 H T V −1 y − H θ = 0 (8.110)
∂θ
which gives the normal equations corresponding to equations 8.106, but now in
matrix form:
H T V −1 H θ̂ = H T V −1 y
(8.111)
(k×n) (n×n) (n×k) (k×1) (k×n) (n×n) (n×1)
where we have indicated the dimension of the matrices. The normal equations are
solved by inverting the square matrix H T V −1 H, which is a symmetric matrix since
V is symmetric. The solution is then
−1
θ̂ = H T V −1 H H T V −1 y (8.112)
It is useful to note that the actual sizes of the errors σi2 do not have to be known
to find θ̂; only their relative sizes. To see this, write V = σ 2 W , where σ 2 is an
arbitrary scale factor and insert this in equation 8.112. The factors σ 2 cancel; thus
σ 2 need not be known in order to determine θ̂.
Now let us evaluate the expectation of θ̂:
h i −1 −1 h i
E θ̂ = E H T V −1 H H T V −1 y = H T V −1 H H T V −1 E y
−1
= H T V −1 H H T V −1 H θ = θ
8.5. LEAST SQUARES METHOD 151
Thus θ̂ is unbiased, assuming that the model is correct. This is true even for small
n. (Recall that maximum likelihood estimators are often biased for finite n.)
Procedures exist for solving the normal equations without the intermediate step
of matrix inversion. Such methods are usually preferable in that they usually suffer
less from round-off problems.
In some cases, it is more convenient to solve these equations by numerical ap-
proximation methods. As discussed at the end of section 8.4.6, programs exist to
find the minimum of a function. For simple cases like the linear problem we have
considered, use of such programs is not very wasteful of computer time, and its
simplicity decreases the probability of an experimenter’s error and probably saves
his time as well. If the problem is not linear, a case which we shall shortly discuss,
such an approach is usually best.
We have stated that there must be no linear relationship between the hj . If
there is, then the columns of H are not all independent, and since V is symmetric,
H T V −1 H will be singular. The best approach is then to eliminate some of the h’s
until the linear relationships no longer exist. Also, there must be at least k distinct
xi ; otherwise the same matrix will be singular.
Note that if the number of parameters k is equal to the number of distinct values
of x, i.e., n = k assuming all xi are distinct, then
−1 −1
H T V −1 H = H −1 V H T
Variance
The covariance matrix of the estimators is given by
h i −1 h i −1 T
T −1 T −1 T −1 T −1
V θ̂ = H V H H V V y H V H H V
| {z } | {z }
(8.113)
(k×k) (k×n) (n×n) (n×k)
or T
−1
D= H T V −1 H H T V −1 (8.115)
h i h i
The covariance (equation 8.113) then follows from equation 8.52, V θ̂ = D T V y D.
h i
What we here call V y is what we previously just called V . It is a square,
symmetric matrix. Hence V −1 is also square and symmetric and therefore (V −1 )T =
−1 T −1
−1 T −1
V . For the same reason H V H = H T V −1 H . Therefore, equation
8.113 can be rewritten:
h i −1 −1
V θ̂ = H T V −1 H H T V −1 V V −1 H H T V −1 H
−1 −1
= H T V −1 H H T V −1 H H T V −1 H
h i −1
V θ̂ = H T V −1 H (8.116)
Equation 8.112 for the estimator θ̂ and equation 8.116 for its variance constitute
the complete method of linear least squares.
σ 2 unknown
If V (y) is only known up to an overall constant, i.e., V = σ 2 W with σ 2 unknown,
it can be estimated from the minimum value of Q2 : Defining Q2 in terms of W , its
minimum value is given by equation 8.108 with θ = θ̂:
T
Q2min = y − H θ̂ W −1 y − H θ̂ (8.117)
Therefore,
Q2min
σc2 = (8.118)
n−k
is an unbiased estimate of σ 2 . It can be shown∗ that this result is true even when
the ǫi are not normally distributed.
∗
See Kendall & Stuart11 , vol. II, section 19.9 and exercise 19.5.
8.5. LEAST SQUARES METHOD 153
Interpolation
Having found θ̂, we may wish to calculate the value of y for some particular value
of x. In fact, the reason for doing the fit is often to be able to interpolate or
extrapolate the data points to other values of x. This is done by substituting the
estimators in the model. The variance is found by error propagation, reversing the
procedure used above to find the variance of θ̂. The estimate ŷ0 of y at x = x0 and
its variance are therefore given by
ŷ0 = H 0 θ̂ (8.119)
h i −1
V [ŷ0 ] = H 0 V (θ̂) H T T −1
0 = H0 H V y H HT
0 (8.120)
where H 0 = ( h1 (x0 ) h2 (x0 ) . . . hk (x0 ) ), i.e., the H-matrix for the single point
x0 .
2
This is the Newton-Raphson method of solving the equations ∂Q ∂θ
= 0. It is exact,
i.e., independent of the choice of θ0 for the linear model where the form of Q2 is a
154 CHAPTER 8. PARAMETER ESTIMATION
parabola. In the non-linear case, the method can still be used, but iteratively; its
success will depend on how close θ 0 is to θ̂ and on how non-linear the problem is.
The derivative formulation for the least squares solution is frequently the most
convenient technique in practical problems. The derivatives we need are
∂Q2 ∂ X ǫ2m X ǫm ∂ǫm
= 2
=2 2
∂θi ∂θi m σm m σm ∂θi
∂ 2 Q2 X 1 ∂ǫm ∂ǫm X ǫm ∂ 2 ǫm
and =2 2
+ 2 2
∂θi ∂θj m σm ∂θi ∂θj m σm ∂θi ∂θj
2
∂ ǫm
In the linear case, ∂θ i ∂θj
= 0, and ∂ǫ
∂θi
m
= −hi (xm ). Thus, the necessary derivatives
are easy to compute.
Finally, we note that the minimum value of Q2 is given by
2 2 ∂Q2 1 ∂ 2 Q2
Q (θ̂) = Q (θ0 ) + · (θ̂ − θ0 ) + (θ̂ − θ0 )T (θ̂ − θ0 ) (8.124)
∂θ θ=θ 2 ∂θ 2 θ=θ
0 0
where we have expanded Q2 (θ̂) about θ0 . Third and higher order terms are zero for
the linear model.
Just as in the example in the introduction to least squares, we can show, by
expanding Q2 about θ̂ that the set of values of θ given by Q2 (θ) = Q2min + 1 define
the one standard deviation errors on θ̂. This is the same as the geometrical method
to find the errors in maximum likelihood analysis (section 8.4.5), except that here
the difference in Q2 is 1 whereas the difference in ℓ was 1/2. This is because the
covariance matrix here is given by twice the inverse of the second derivative matrix,
whereas it was equal to the inverse of the second derivative matrix in the maximum
likelihood case.
So far we have made no use of the assumption that the ǫi are Gaussian dis-
tributed. We have only used the conditions E(ǫi ) = 0 and V [ǫi ] = σi2 known and
the linearity of the model.
For a proof, see for example, Kendall & Stuart11 , chapter 19 (Stuart et al.13 , chapter 29), or
∗
matrix of the ǫi , V [ǫ] is finite and fixed, i.e., independent of θ and y, (it does not
have to be diagonal), then the least squares estimate, θ̂ is unbiased and has the
smallest variance of all linear (in y), unbiased estimates, regardless of the p.d.f. for
the ǫi .
Note that
• This theorem concerns only linear unbiased estimators. It may be possible,
particularly if ǫ is not normally distributed, to find a non-linear unbiased
estimator with a smaller variance. Biased estimators with a smaller variance
may also exist.
• Least squares does not in general give the same result as maximum likelihood
(unless the ǫi are Gaussian) even for linear models. In this case, linear least
squares is often to be preferred to linear maximum likelihood where appli-
cable and convenient, since linear least squares is unbiased and has smallest
variance. An exception may occur in small samples where the data must be
binned in order to do a least squares analysis, causing a loss of information.
• The assumptions are important: The measurement errors must have zero
mean and they must be homoscedastic (the technical name for constant vari-
ance). Non-zero means or heteroscedastic variances may reveal themselves in
the residuals, yi − f (xi ), cf. section 10.6.8.
8.5.5 Examples
A Straight-Line Fit
As an example of linear least squares we do a least squares fit of independent
measurements yi at points xi assuming the model y = a + bx. Thus,
1 x1
1 x2
a 1
θ= ; h= ; H = .. ..
b x . .
1 xn
and y =Hθ+ǫ
Since the measurements are independent, the covariance matrix is diagonal with
n n
!2
−1 ǫ2i
X X yi − a − bxi
Vii (y) = Vii (ǫ) = σi2 and 2
Q =ǫ V T
ǫ= 2
=
i=1 σi i=1 σi
Hence, using the derivative method,
n
∂Q2 1 X yi − b̂xi
=0 −→ â = Pn 1
∂a i=1 σ2 i=1 σi2
i
n
∂Q2 1 X xi yi − âxi
=0 −→ b̂ = P x2i
∂b n σi2
i=1 σ2 i=1
i
156 CHAPTER 8. PARAMETER ESTIMATION
Solving, we find
P x i yi P 1 P yi P xi
σi2 σi2
− σi2 σi2
b̂ = 2
P x2i P 1 P xi
σi2 σi2
− σi2
xy − x̄ȳ
â = ȳ − b̂x̄ and b̂ = (8.125)
x2 − x̄2
These are the formulae which are programmed into many pocket calculators. As
such, they should only be used when the σi are all the same. These formulae are,
however, also applicable to the case where not all σi are the same if the sample
average indicated by the bar is interpreted P as 2meaning a weighted sample average
2 yi /σi
with weights given by 1/σi , e.g., ȳ = P 1/σ 2 . The proof is left as an exercise
i
(ex. 40).
Note that at least two of the xi must be different. Otherwise, the denominator
in the expression for b̂ is zero. This illustrates the general requirement that there
must be at least as many distinct values of xi as there are parameters in the model;
otherwise the matrix H T V −1 H will be singular.
The errors on the least squares estimates of the parameters are given by equation
8.122 or 8.116. With all σi the same, equation 8.116 gives
h i −1 −1
V θ̂ = H T V −1 H = H TH σ2
−1
1 x1
P −1
1 . . . 1 .. .. n x
= σ2 . . = σ
2
P P 2i
x1 . . . xn xi xi
1 xn
P 2 P P 2 P
σ2 xi − xi σ2 x − xi
= P 2 P 2
P = P 2
Pi
n xi − ( xi ) − x i n n (xi − x̄) − xi n
Thus, !
V [â] cov(â,
h ib̂) σ2 x2 −x̄
= (8.126)
cov(â, b̂) V b̂ n x2 − x̄2 −x̄ 1
Note that by translating the x-axis such that x̄ becomes zero, the estimates of the
parameters become uncorrelated.
8.5. LEAST SQUARES METHOD 157
Here too, it is possible to use this formula for the case where not all σi are the
same. Besides taking the bar as a weighted average, one must also replace σ 2 by
its weighted average, P 2 2
σ /σ n
σ = P i 2i = P
2 (8.127)
1/σi 1/σi2
Note that the errors are smallest for the largest spread in the xi . Thus we will
attain the best estimates of the parameters by making measurements only at the
extreme values of x. This procedure is, however, seldom advisable since it makes it
impossible to test the validity of the model, as we shall see.
Having found â and b̂, we can calculate the value of y for any value of x by
simply substituting the estimators in the model. The estimate ŷ0 of y at x = x0 is
therefore given by
ŷ0 = â + b̂x0 (8.128)
We note in passing that this gives ŷ0 = ȳ for x0 = x̄. The variance of ŷ0 is found
by error propagation:
h i
V [ŷ0 ] = V [â] + x20 V b̂ + 2x0 cov(â, b̂)
A Polynomial Fit
To fit a parabola
y = a0 + a1 x + a2 x2
the matrix H is
1 x1 x21
1 x2 x22
H=
.. .. ..
. . .
1 xn x2n
Assuming that all the σi are equal, equation 8.112 becomes
P P P 2 −1 P
â0 i1 x i xi i yi
P P i 2i P 3 P
θ̂ = â1 = i xi x x i xi yi
P 2 Pi i3 Pi i4 P 2
â2 i xi i xi i xi i xi yi
very regular and symmetric. Numerical inversion suffers from rounding errors when
the order of the polynomial is greater than six or seven.
One can hope to mitigate these problems by choosing a set of orthogonal poly-
nomials, e.g., Legendre or Tchebycheff (Chebyshev) polynomials, instead of powers
of x. The off-diagonal terms then involve products of orthogonal functions summed
over the events. The expectation of such products is zero, and hence the sum of
their products over a large number of events should be nearly zero. The matrix is
then nearly diagonal and less prone to numerical problems.
Even better is to find functions which are exactly orthogonal over the measured
data points, i.e., functions, ξ, for which
n
X
ξj (xi )ξk (xi )(V −1 )jk = δjk
i=1
The matrix which has to be inverted, H T V −1 H, is then simply the unit matrix. An
additional feature of such a parametrization is that the estimates of the parameters
are independent; the covariance matrix for the parameters is diagonal. Such a set of
functions can always be found, e.g., using Schmidt’s orthogonalization method∗ or,
more simply, using Forsythe’s method.40 Its usefulness is limited to cases where we
are merely seeking a parametrization of the data (for the purpose of interpolation
or extrapolation) rather than seeking to estimate the parameters of a theoretical
model.
k
X
ℓij θj = Ri , i = 1, . . . , m (8.130)
j=1
∗
See, e.g., Margenau & Murphy39 .
8.5. LEAST SQUARES METHOD 159
The least squares estimate of θ is then found using a k-component vector of La-
grangian multipliers, 2λ, by finding the extremum of
T
Q2 = y − H θ V −1 y − H θ + 2λT (L θ − R) (8.132)
where the first term is the usual Q2 and the second term represents the constraints.
Differentiating with respect to θ and with respect to λ, respectively, yields the
normal equations
H T V −1 H θ̂ + LT λ̂ = H T V −1 y (8.133a)
L θ̂ = R (8.133b)
where
C = H T V −1 H (8.135)
S = H T V −1 y (8.136)
Assuming that both C and L C −1 LT can be inverted, the normal equations can be
solved for θ̂ and λ̂ giving3–5
θ̂ F GT S
= (8.137)
λ̂ G E R
where∗
−1
W = L C −1 LT (8.138)
F = C −1 − C −1 LT W L C −1
= 1 − C −1 LT W L C −1 (8.139)
G = W L C −1 (8.140)
E = −W (8.141)
θ̂ = F S + GT R = F H T V −1 y + GT R (8.142)
T −1
λ̂ = G S + E R = G H V ǫ (8.143)
∗
Note that Eadie et al.4 contains a misprint in these equations.
160 CHAPTER 8. PARAMETER ESTIMATION
h In
i the unconstrained case the solution was θ̂ = C −1 S with covariance matrix
V θ̂ = C −1 . These results are recovered from the above equations by setting terms
involving L or R to zero. From equations 8.139 and 8.144 we see that the constraints
reduce the variance of the estimators, as should be expected since introducing con-
straints adds information. We also see that the constraints introduce (additional)
correlations between the θ̂i . h i
It can be shown4, 5 that the θ̂ are unbiased, and that E λ̂ = 0 as expected.
Thus the matrix H is just the unit matrix, and, in the absence of constraints, the
normal equations have the trivial (and obvious) solution (equation 8.112)
−1
θ̂ = H T V −1 H H T V −1 y = y
The best value of a measurement (θ̂i ) is just the measurement itself (yi ).
With m linear constraints (equation 8.130 or 8.131) the solution follows imme-
diately from the previous section by setting H = 1. The improved values of the
measurements are then the θ̂i . Note that the constraints introduce a correlation
between the measurements.
∗
In particle physics this procedure is known as kinematical fitting since the constraints usually
express the kinematics of energy and momentum conservation.
8.5. LEAST SQUARES METHOD 161
Straight-line fit
We begin by treating the case of a y ✻ P4 r✑✑
straight-line fit, y = a + bx, from section ✑
✑✡
8.5.5. ✑ ✡
✑ ✡
As before, we take Q2 as the sum of P3 r✑✑ ✡
the squares of the distances between the ✑ ✡
✑
P2 r✑ ✡
fit line and the measured point scaled by ✑ ✡
✑ ❅ R3 r r R
the error on this distance. However, this P1r ✑ ✑ R2❅r ✡ 4
distance is not unique. This is illustrated yi ✑ r ❅r✡
✑ R1 D
in the figure where the ellipse indicates ✑
✑
the errors on xi and yi . For a point on ✲
the line, Pj , the distance to D is Pj D and xi x
the error is the distance along this line from the point D to the error ellipse, Rj D:
Pj D
dj =
Rj D
Since we want the minimum of Q2 , we also want to take the minimum of the dj ,
i.e., the minimum of
(x − xi )2 (y − yi )2
d2i = 2
+ 2
(8.147)
σxi σyi
where we have assumed that the errors on xi and yi are uncorrelated. Substituting
y = a + bx and setting dd
dx
i
= 0 results in the minimum distance being given by
(yi − a − bxi )2
d2i min =
σy2 i + b2 σx2 i
This same result can be found by taking the usual definition of the distance,
!2 !2
yi − y(xi ) yi − a − bxi
d2i = =
σi σi
where σi is no longer just the error on yi , σy i , but is now the error on yi − a − bxi
and is found by error propagation to be
We note that if all σx i = 0 this reduces to the expression found in section 8.5.5.
Unfortunately, the differentiation with respect to b is more complicated. In practice
it is most easily done numerically by choosing a series of values for b̂, calculating â
from the above formula and using these values of â and b̂ to calculate Q2 , repeating
the process until the minimum Q2 is found.
The errors on â and b̂ are most easily found from the condition that Q2 −Q2min = 1
corresponds to one standard deviation errors.
If all σx i are the same and also all σy i are the same, the situation simplifies
considerably. The above expression for â becomes
â = ȳ − b̂x̄ (8.149)
and differentiation with respect to b leads to
n Pn
∂Q2 X yi − â − b̂xi b̂σx2 i=1 (yi − â − b̂xi )2
= −2 2 2 2
+ =0
∂b i=1 σy + b σx σy2 + b2 σx2
Substituting the expression for â into this equation then yields
b̂2 σx2 ∆xy − b̂ (σx2 ∆y 2 − σy2 ∆x2 ) − σy2 ∆xy = 0 (8.150)
where
∆x2 = x2 − x̄2
∆y 2 = y 2 − ȳ 2
∆xy = xy − x̄ȳ
This is a quadratic equation for b̂. Of the two solutions it turns out that the one
with a negative sign before the square root gives the minimum Q2 ; the one with
the plus sign gives the maximum Q2 of all straight lines passing through the point
(x̄, ȳ). We note that these solutions for â and b̂ reduce to those found in section
8.5.5 when there is no uncertainty on x (σx = 0).
8.5. LEAST SQUARES METHOD 163
In general
Now let us consider
a more complicated case. Let us represent a data point by the
xi
vector z i = . If the model is a more complicated function than a straight
yi
line, or if there is a non-zero correlation between xi and yi , the distance measure di
defined in equation 8.147 becomes
d2i = (z ci − z i )T Vi −1 (z ci − z i )
σx2 i cov(xi , yi )
where Vi is the covariance matrix for data point i, Vi =
cov(xi ,yi ) σy2 i
xci
and the point on the curve closest to z i is represented by z ci = . The com-
yic
ponents of z ci are related by the model: yic = H T (xci ) θ, which can be regarded as
constraints for the minimization of Q2 . We then use Lagrangian multipliers and
minimize
Xh
n i
Q2 = (z ci − z i )T Vi (z ci − z i ) + λi yic − H T (xci )θ (8.151)
i=1
k parameters θ
n unknowns xci
n unknowns yic
n unknowns λi
is minimal. Thus the least squares method yields the same estimates as the maxi-
mum likelihood method, and accordingly has the same desirable properties.
When the deviations are not normally distributed, the least squares method may
still be used, but it does not have such general optimal properties as to be useful for
small n. Even asymptotically, the estimators need not be of minimum variance.4, 5
In practice, the minimum of Q2 is usually most easily found numerically using
a search program such as MINUIT. However, an iterative solution3, 6 of the normal
equations (subject to constraints) may yield considerable savings in computer time.
8.5.10 Summary
The most important properties of the least squares method are
• In the linear model, it follows from the Gauss-Markov theorem that least
squares estimators have optimal properties: If the measurement errors have
zero expectation and finite, fixed variance, then the least squares estimators
are unbiased and have the smallest variance of all linear, unbiased estimators.
• If the errors are Gaussian, least squares estimators are the same as maximum
likelihood estimators.
• If the errors are Gaussian, the minimum value of Q2 provides a test of the
validity of the model, at least in the linear model (cf. sections 10.4.3 and
10.6.3).
• If the model is non-linear in the parameters and the errors are not Gaussian,
the least squares estimators usually do not have any optimal properties.
The least squares method discussed so far does not apply to histograms or other
binned data. Fitting to binned data is treated in section 8.6.
do when the data are simply observations of the values of x for a sample of events?
This was easily treated in the maximum likelihood method. For a least squares
type of estimator we must transform this set of observations into estimates of y at
various values of x.
To do this we collect the observations into mutually exclusive and exhaustive
classes defined with respect to the variable x. (The extension to more than one
variable is straightforward.) An example of such a classification is a histogram and
we shall sometimes refer to the classes as bins, but the concept is more general than
a histogram. Assume that we have k classes and let πi be the probability, calculated
from the assumed p.d.f., that an observation falls in the ith class. Then
k
X
πi = 1
i=1
and the distribution of observations among the classes is a multinomial p.d.f. Let
n be the total number of observations and ni the number of observations in the ith
class. Then pi = ni /n is the fraction of observations in the ith class.
The minimum chi-square method consists of minimizing Pearson’s53 χ2 , which
we refer to here as Q21 ,
k k
X (pi − πi )2 X (ni − nπi )2
Q21 = n = (8.152)
i=1 πi i=1 nπi
k
!
p2i
X
=n −1
i=1 πi
This appears rather similar to the usual least squares method. The ‘measure-
ment’ is now the observed number of events in a bin, and the model is that there
should be nπi events in the bin. Recall (section 3.3) that the multinomial p.d.f.
has for the ith bin the expectation µi = nπi and variance σi2 = nπi (1 − πi ). For a
large number of bins, each with small probability πi , the variance is approximately
σi2 = nπi and the covariances, cov(ni , nj ) = −nπi πj , i 6= j, are approximately zero.
The ‘error’ used in equation 8.152 is thus that expected from the model and is
therefore a function of the parameters. In the least squares method we assumed,
as a condition of the Gauss-Markov theorem, that σi2 was fixed. Since that is here
not the case, the Gauss-Markov theorem does not apply to minimum χ2 .
This use of the error expected from the model may seem rather surprising, but
nevertheless this is the definition of Q21 . We note that in least squares the error
was actually also an expected error, namely the error expected from the measuring
apparatus, not the error estimated from the measurement itself.
166 CHAPTER 8. PARAMETER ESTIMATION
n! Yk
πini
f= π1n1 π2n2 . . . πknk = n!
n1 !n2 ! . . . nk ! i=1 ni !
Dropping factors which are independent of the parameters, the log-likelihood which
is to be maximized is given by
k
X
ℓ = ln L = ni ln πi (8.156)
i=1
Note that in the limit of zero bin width this is identical to the usual log-likelihood
of equation 8.57. The estimators θ̂j are the values of θ for which ℓ is maximum and
are given by
Xk Xk
∂ℓ ∂ ln πi ∂πi pi ∂πi
= ni =n =0 (8.157)
∂θj i=1 ∂πi ∂θj i=1 πi ∂θj
8.6. ESTIMATORS FOR BINNED DATA 167
• Q21 requires a large number of bins with small πi for each bin in order to
neglect the correlations and to approximate the variance by nπi . Assuming
that the model is correct, this will mean that all ni must be small.
√
• In addition, Q22 requires all ni to be large in order that ni be a good estimate
of the variance. Thus the ni must be neither too large nor too small. In
particular, an ni = 0 causes Q22 to blow up.
• The binned maximum likelihood method does not suffer from such problems.
In view of the above, it is perhaps not surprising that the maximum likelihood
method usually converges faster to efficiency. In this respect the modified minimum
chi-square (Q22 ) is usually the worst of the three methods.11
One may still choose to minimize Q21 or Q22 , perhaps because the problem is
2
linear so that the equations ∂Q ∂θj
= 0 can be solved simply by a matrix inversion
instead of a numerical minimization. One must then ensure that there are no small
ni , which in practice is usually taken to mean that all ni must be greater than 5 or
10. Usually one attains this by combining adjacent bins. However, one can just as
well combine non-adjacent ones. Nor is there any requirement that all bin widths
be equal. One must simply calculate the πi properly, i.e., as the integral of the
168 CHAPTER 8. PARAMETER ESTIMATION
p.d.f. over the bin, which is not always adequately approximated by the bin width
times the value of the p.d.f. at the center of the bin.
Since the maximum likelihood method is usually preferred, we can ask why we
bin the data at all. Although binning is required in order to use a minimum chi-
square method, we can perfectly well do a maximum likelihood fit without binning.
Although binning loses information, it may still be desirable in the maximum like-
lihood method in order to save computing time when the data sample is very large.
In choosing the bin sizes one should pay particular attention to the amount of in-
formation that is lost. Large bins lose little information in regions where the p.d.f.
is nearly constant. Nor is much information lost if the bin size is small compared to
the experimental resolution in the measurement of x. It would seem best to try to
have the information content of the bins approximately equal. However, even with
this criterion the choice of binning is not unique. It is then wise to check that the
results do not depend significantly on the binning.
In this section we try to give some guidance on which method to use and to treat
some complications that arise in real life.
1. Consistency. The estimator should converge to the true value with increasing
numbers of observations. If this is not the case, a procedure to remove the
bias should be applied.
3. Minimum variance (efficiency). The smaller the variance of the estimator, the
more certain we are that it is near the true value of the parameter (assuming
it is unbiased).
7. Minimum loss of physicist’s time. This is also not fundamental; its importance
is frequently grossly overestimated.
Obtaining simplicity
It may be worth sacrificing some information to obtain simplicity.
Estimates of several parameters can be made uncorrelated by diagonalizing the
covariance matrix and finding the corresponding linear combinations of the param-
eters. But the new parameters may lack physical meaning.
Techniques for bias removal will be discussed below (section 8.7.2).
170 CHAPTER 8. PARAMETER ESTIMATION
When sufficient statistics exist, they should be used, since they can be estimated
optimally (cf. section 8.2.8).
Asymptotically, most usual estimators are unbiased and normally distributed.
The question arises how good the asymptotic approximation is in any specific case.
The following checks may be helpful:
• Check that the log-likelihood function or χ2 is a parabolic function of the
parameters.
• If one has two asymptotically efficient estimators, check that they give con-
sistent results. An example is the minimum chi-square estimate from two
different binnings of the data.
• Study the behavior of the estimator by Monte Carlo techniques, i.e., make
a large number of simulations of the experiment and apply the estimator to
each Monte Carlo simulation in order to answer questions such as whether the
estimate is normally distributed. However, this can be expensive in computer
time.
A change of parameters can sometimes make an estimator simpler. For instance
the estimate of θ2 = g(θ1 ) may be simpler than the estimate of θ1 . However, it is in
general impossible to remove both the bias and the non-normality of an estimator
in this way4, 5 .
Economic considerations
Economy usually implies fast computing. Optimal estimation is frequently iterative,
requiring much computer time. The following three approaches seek a compromise
between efficiency (minimum variance) and economic cost.
• Linear methods. The fastest computing is offered by linear methods, since
they do not require iteration. These methods can be used when the expected
values of the observations are linear functions of the parameters. Among linear
unbiased estimators, the least squares method is the best, which follows from
the Gauss-Markov theorem (section 8.5.4).
When doing empirical fits, rather than fits to a known (or hypothesized) p.d.f.,
choose a p.d.f. from the exponential family (section 8.2.7) if possible. This
leads to easy computing and has optimal properties.
• Two-step methods. Some computer time can be saved by breaking the prob-
lem into two steps:
1. Estimate the parameters by a simple, fast, inefficient method, e.g., the
moments method.
2. Use these estimates as starting values for an optimal estimation, e.g.,
maximum likelihood.
8.7. PRACTICAL CONSIDERATIONS 171
Although more physicist’s time may be spent in evaluating the results of the
first step, this might also lead to a better understanding of the problem.
• Three-step method.
The third step should not be forgotten. It is particularly important when the
information in the data is small (‘small statistics’). Because of the third step,
the second step does not have to be exact, but only approximate.
Thus,
1 1
E 2θ̂ − (θ̂1 + θ̂2 ) = θ + O
2 N2
and we see that we have a method to reduce the bias from O N1 to O N12 . The
variance is, however, in general increased by a term of order N1 .
A generalization of this method,11, 13 known as the jackknife,∗ estimates θ by
where θ̂N is the estimator using all N events and θ̂N −1 is the average of the N
estimates possible using N − 1 events:
N
X
θ̂N −1 = θ̂i /N (8.160)
i=1
where θ̂i is the estimate obtained using all events except event i.
A more general method, of which the jackknife is an approximation, is the
bootstrap method introduced by Efron.41–43 Instead of using each subset of N − 1
observations, it uses samples of size M ≤ N randomly drawn, with replacement,
from the data sample itself. For details, see, e.g., Reference 43.
∗
Named after a large folding pocket knife, this procedure, like its namesake, serves as a handy
tool in a variety of situations where specialized techniques may not be available.
8.7. PRACTICAL CONSIDERATIONS 173
h i N 2
N −1X
V̂ θ̂ = θ̂i − θ̂ N −1 (8.161)
J N i=1
h i B 2
1 X
V̂ θ̂ = θ̂b − θ̂b (8.162)
B B − 1 b=1
B
X
where θ̂b = θ̂b /B
b=1
Note that these two methods are applicable to non-parametric estimators as well
as parametric. If the estimators are the result of a parametric fit, e.g., ml, the B
bootstrap samples can be generated from the fitted distribution function, i.e., the
174 CHAPTER 8. PARAMETER ESTIMATION
parametric estimate of the population, rather than from the data. The estimation
of the variance is again given by equation 8.162.
Limitation: It should be clear that the non-parametric bootstrap will not be
reliable when the estimator depends strongly on the tail of the distribution, as is
the case, e.g., with high-order moments. A bootstrap sample can never contain
points larger than the largest point in the data.
1. What kind of parameters can be estimated without any assumption about the
form of the p.d.f.? Such estimators are usually called ‘distribution-free’. This
term may be misleading, for although the estimate itself does not depend on
the assumption of a p.d.f., its properties, e.g., the variance, do depend on the
actual (unknown) p.d.f.
2. How reliable are the estimates if the assumed form of the p.d.f. is not quite
correct?
There is relatively little known about robust estimation. The only case treated ex-
tensively in the literature is the estimation of the center of an unknown, symmetric
distribution. The center of a distribution may be defined by a ‘location parameter’
such as the mean, the median, the mode, the midrange, etc. Several of these esti-
mators were mentioned in section 8.1. The sample mean is the most obvious and
most often used estimator of location because
• By the central limit theorem it is consistent whenever the variance of the p.d.f.
is finite.
However, if the distribution is not normal, the sample mean may not be the best
estimator. For symmetric distributions of finite range, e.g., the uniform p.d.f. or a
triangular p.d.f., the location is determined by specifying the end points of the dis-
tribution. The midrange is then an excellent estimator. However, for distributions
of infinite range, the midrange is a poor estimator.
The following table4, 5 shows asymptotic efficiencies, i.e., the ratio of the min-
imum variance bound to the variance of the estimator, of location estimators for
various p.d.f.’s.
8.7. PRACTICAL CONSIDERATIONS 175
None of these three estimators is asymptotically efficient for all four distribu-
tions. Nor has any of these estimators a non-zero asymptotic efficiency for all four
distributions. As an example take a distribution which is the sum of a normal
distribution and a Cauchy distribution having the same mean:
Because of the Cauchy admixture, the sample mean has infinite variance, as we
see in the table, while the sample median has at worst (β = 1) a variance of
1/0.64 = 1.56 times the minimum variance bound. This illustrates that the median
is generally more robust than the mean.
Other methods to improve robustness involve ‘trimming’, i.e., throwing away
the highest and lowest points before using one of the above estimators. This is
particularly useful when there are large tails which come mostly from experimental
problems. Such methods are further discussed by Eadie et al.4, 5
Note that the efficiency may depend on both x and y. The likelihood of a given set
of observations is then
N
Y N
Y
L(x1 , . . . xN , y 1 . . . y N ; θ, ψ) = g(xi , yi ; θ, ψ) = gi
i=1 i=1
N
X
Hence, ℓ = ln L = W + ln(ei qi ) (8.163)
i=1
N
X Z
where W = ln pi − N ln pqe dx dy (8.164)
i=1
and where pi = p(xi ; θ)
8.7. PRACTICAL CONSIDERATIONS 177
Suppose now that we are not interested in estimating ψ, but only θ. Then the second
term of equation 8.163 does not depend on the parameters and may be ignored. The
estimates θ̂ and their variances are then found in the usual way treating W as the
log-likelihood.
In practice, difficulties arise when pqe is not analytically normalized, but must
be normalized numerically by time-consuming Monte Carlo. Moreover, the results
depend on the form of q, which may be poorly known and of little physical interest.
For these reasons one prefers to find a way of eliminating q from the expressions.
Since this will exclude information, it will increase the variances, but at the same
time make the estimates more robust.
′
results in W .
′
Whatever the validity of this argument, it turns out4, 5 that the estimate θ̂
obtained by maximizing W ′ is, like the usual maximum likelihood estimate, asymp-
totically normally distributed about the true value. However, care must be taken
in evaluating the variance. Using the second derivative matrix of W ′ is wrong since
it assumes that
N
X
wi = N
i=1
events have been observed. One approach to curing this problem is to renormalize
′ P
the weights by using wi = Nwi / wi instead of wi . However, this is only satisfac-
tory if the weights are all nearly equal.
The correct procedure, which we will not derive, results in4, 5
′
V θ̂ = H −1 H ′ H −1 (8.166)
" ! !#
′ 1 ∂ ln p ∂ ln p
Hjk =E 2
e ∂θj ∂θk
′
evaluated at θ = θ̂ . If e is constant, this reduces to the usual estimator of the
covariance matrix given in equations 8.78 and 8.80.
Alternatively, one can estimate the matrix elements from the second derivatives:
" #
∂ 2 W ′ 1 XN
1 ∂ 2 ln pi
Hjk = − ; Ĥjk = − (8.168a)
∂θj ∂θk θ=θ̂ N i=1 ei ∂θj ∂θk θ=θ̂
" #
′ 1 ∂ 2 W ′ 1 XN
1 ∂ 2 ln pi
Hjk =− ; Ĥjk = − (8.168b)
e ∂θj ∂θk θ =θ̂ N i=1 e2i ∂θj ∂θk θ=θ̂
If e is constant, this reduces to the usual estimator of the covariance matrix given
in equations 8.84 and 8.85.
′
To summarize: Find the estimates θ̂ by maximizing W ′ (eq. 8.165). If possible
compute H and H ′ by equation 8.167 or 8.168; if the derivatives are not known an-
∂2W ′
alytically, use equation 8.168, evaluating ∂θ j ∂θk
numerically. The covariance matrix
is then given by equation 8.166.
It is clear from the above formulae that the appearance of one event with a
very large weight will ruin the method, since it will cause W ′ (equation 8.165) to
be dominated by one term and will make the variance very large. Accordingly, a
better estimate may be obtained by rejecting events with very large weights.
E [ci ] = ai (8.170)
where w = 1e is the weight. This expectation can be estimated by the sample mean
of the weights of the events in the bin:
P ni
j=1 wij
Ed
[w] i =
ni
where wij is the weight (1/ei ) of the j th event in the ith bin.
We now define
bi ni
ci = Pni
j=1 wij
From the preceding equations it is clear that this ci satisfies equation 8.170.
The expressions for Q2 then use ci instead of ai . Writing σi2 for ai in the case of
Q21 and for ni in the case of Q22 , both may be written as
!2 2
k
X k
X
1 ni 1 X
Q2 = 2
ni − bi P = ′2
wij − bi
i=1 σi j wij σ
i=1 i j
180 CHAPTER 8. PARAMETER ESTIMATION
!2
1 1 ni
where ′2
= 2 P
σi σi j wij
hP i
ni
since E j=1 wij = bi . Further, one can show that
2
n
X i ni
X ni X
X ni h i
E
wij = E wij2 + E
wij wik 2 2
≈ E [ni ] E wi + bi
j=1 j=1 j=1 k=1
k6=j
E [ni ] can be estimated in two ways: from the model, which gives the minimum
chi-square method; or from the data, which gives the modified minimum chi-square
method. The resulting expressions for Q2 are
2
P ni
j=1 wij − bi
Xk
Ed
[ni ] = ci ; Q′2
1 =
Pni 2
w
(8.171)
ij
i=1 bi Pni
j=1
wij
j=1
2
P ni
k
X j=1 wij − bi
Ed
[ni ] = ni ; Q′2
2 = Pni 2 (8.172)
i=1 j=1 wij
be Gaussian. Thus if it is simply stated that the error is 1%, you expect that this
distribution will be a Gaussian distribution with a standard deviation of 1% of the
true value. The standard deviation of a single reading will be 1% of that reading.
But by making many (N) readings and averaging them, you obtain an estimate of
the true value which has a much smaller variance. Usually, the variance is reduced
by a factor 1/N, which follows from the central limit theorem.
If the meter has a systematic error such that it consistently reads 1% too high,
the situation is different. The readings are thus correlated. Averaging a large
number of readings will not decrease this sort of error, since it affects all the readings
in the same way. With more readings, the average will not converge to the true
value but to a value 1% higher. It is as though we had a biased estimator.
Systematic errors can be very difficult to detect. For example, we might measure
the voltage across a resistor for different values of current. If the systematic error
was 1 Volt, all the results would be shifted by 1 Volt in the same direction. If we
plotted the voltages against the currents, we would find a straight line, as expected.
However, the line would not pass through the origin. Thus, we could in principle
discover the systematic effect. On the other hand, with a systematic error of 1% on
the voltage, all points would be shifted by 1% in the same direction. The voltages
plotted against the currents would lie on a straight line and the line would pass
through the origin. The voltages would thus appear to be correctly measured.
But the slope of the line would be incorrect. This is the worst kind of systematic
error—one which cannot be detected statistically. It is truly a ‘hidden fault’.
The size of a systematic error may be known. For example, consider temperature
measurements using a thermocouple. You calibrate the thermocouple by measuring
its output voltages V1 and V2 for two known temperatures, T1 and T2 , using a volt-
meter of known resolution. You then determine some temperatures T by measuring
voltages V and using the proportionality of V to T to calculate T :
T2 − T1
T = (V − V1 ) + T1
V2 − V1
The error on T will include a systematic contribution from the errors on V1 and V2
as well as a random error on V . In this example the systematic error is known.
In other cases the size of the systematic error is little more than a guess. Suppose
you are studying gases at various pressures and you measure the pressure using a
mercury manometer. Actually it only measures the difference in pressure between
atmospheric pressure and that in your vessel. For the value of the atmospheric
pressure you rely on that given by the nearest meteorological station. But how big
is the difference in the atmospheric pressure between the station at the time the
atmospheric pressure was measured and your laboratory at the time you did the
experiment?
Or, suppose you are measuring a (Gaussian) signal on top of a background. The
estimate of the signal (position, width, strength) may depend on the functional
form chosen for the background. If you do not know what this form is, you should
182 CHAPTER 8. PARAMETER ESTIMATION
try various forms and assign systematic errors based on the resulting variations in
the estimates.
Experimental tips
To clear your experiment of ‘hidden faults’ you should begin in the design of the
experiment. Estimate what the systematic errors will be, and, if they are too large,
design a better experiment.
Build consistency checks into the experiment, e.g., check the calibration of an
instrument at various times during the course of the experiment.
Try to convert a systematic error into a random error. Many systematic effects
are a function of time. Examples are electronics drifts, temperature drifts, even
psychological changes in the experimenter. If you take data in an orderly sequence,
e.g., measuring values of y as a function of x in the order of increasing x, such drifts
are systematic. So mix up the order. By making the measurements in a random
order, these errors become random.
The correct procedure depends on what you are trying to measure. If there are
hysteresis effects in the apparatus, measuring or setting the value of a quantity,
e.g., a magnetic field strength, from above generally gives a different result than
setting it from below. Thus, if the absolute values are important such adjustments
should be done alternatively from above and from below. On the other hand, if
only the differences are important, e.g., you are only interested in a slope, then all
adjustments should be made from the same side, as the systematic effect will then
cancel.
∗
This assumes that the errors are normally distributed. If you know this not to be the case,
you should try to combine the errors using the correct p.d.f.’s.
8.7. PRACTICAL CONSIDERATIONS 183
Error propagation is done using the covariance matrix in the usual way except
that we keep track of the statistical and systematic contributions to the error.
Suppose that we have two ‘independent’ measurements x1 and x2 with statistical
errors σ1 and σ2 and with a common systematic error s. For pedagogical purposes
we can think of the xi as being composed of two parts, xi = xR S R
i + xi , where xi has
only a random statistical error, σi , and xi has only a systematic error, s. Then xR
S
1
and xR2 are completely independent and x S
1 and xS
2 are completely correlated. The
variance of xi is then
h i
V [xi ] = E x2i − (E [xi ])2
2 h i2
=E xR
i + xSi − E xR S
i + xi
= σi2 + s2
The covariance is
cov(x1 , x2 ) = E [x1 x2 ] − E [x1 ] E [x2 ]
h i h i h i
=E xR S
1 + x1 xR S
2 + x2 − E xR S R S
1 + x1 E x2 + x2
This is just the covariance matrix previously considered in section 8.5.5 with the
addition of s2 to every element. As an example, consider a fit to a straight line,
2
y = a + bx. Using this V and ǫ = y − a − bx, in Q2 = ǫT V ǫ and solving ∂Q ∂a
=0
∂Q2
and ∂b = 0 leads to the same expressions for the estimators as before (equation
8.125). A common systematic shift of all points up or down clearly has no effect
on the slope, and therefore we expect the same variance for b̂ as before. However,
a systematic shift in y will affect the intercept; consequently, we expect a larger
variance for â.
Chapter 9
Confidence intervals
In the previous chapter we have discussed methods to estimate the values of un-
known parameters. As the uncertainty, or “error”, δ θ̂, on the estimate, θ̂, we have
been content to state the standard deviations and correlation coefficients of the
estimate as found from the covariance matrix or the estimated covariance matrix.
This is inadequate in certain cases, particularly when the sampling p.d.f., i.e., the
p.d.f. of the estimator is non-Gaussian. In this chapter our interest is to find the
range
θa ≤ θ ≤ θb
which contains the true value θt of θ with “probability” β. We shall see that when
the sampling p.d.f. is Gaussian, the interval [θa , θb ] for β = 68.3% is the same as
the interval of ±1 standard deviation about the estimated value.
9.1 Introduction
In parameter
h i estimation we found an estimator for a parameter θ̂ and its variance
σθ̂2 = V θ̂ and we wrote the result as θ = θ̂ ± σθ̂ . Assuming a normal distribution
for θ̂, one is then tempted to say, as we did in section 8.2.4, that the probability is
68.3% that
θ̂ − σθ̂ ≤ θt ≤ θ̂ + σθ̂ (9.1)
Now, what does this statement mean? If we interpret it as 68.3% probability that
the value of θt is within the stated range, we are using Bayesian probability (cf.
section 2.4.4) with the assumption of uniform prior probability. This assumption is
not always justifiable and often is wrong, as is illustrated in the following example:
An empty dish is weighed on a balance. The result is 25.31 ± 0.14 g. A sample
of powder is placed on the dish, and the weight is again determined. The result is
25.51 ± 0.14 g. By subtraction and combination of errors, the weight of the powder
is found to be 0.20 ± 0.20 g. Our first conclusion is that the scientist should have
used a better balance. Next we try to determine some probabilities. From the
185
186 CHAPTER 9. CONFIDENCE INTERVALS
normal distribution, there is a probability of about 16% that a value lies lower than
µ − σ. In this example that means that there is a chance of about 16% that the
powder has negative weight (an anti-gravity powder!). The problem here is Bayes’
postulate of uniform prior probability. We should have incorporated in the prior
knowledge the fact that the weight must be positive, but we didn’t.
Let us avoid the problems of Bayesian prior probability and stick to the fre-
quentist interpretation. This will lead us to the concept of confidence intervals,
developed largely by Neyman,45 which give a purely frequentist interpretation to
equation 9.1. We shall return to the Bayesian interpretation in section 9.9.
Suppose we have a p.d.f. f (x; θ) which depends on one parameter θ. The prob-
ability content β of the interval [a, b] in X-space is
Z b
β = P (a ≤ X ≤ b) = f (x; θ) dx (9.2)
a
Common choices for β are 68.3% (1σ), 95.4% (2σ), 99.7% (3σ), 90% (1.64σ), 95%
(1.96σ), and 99% (2.58σ), where the correspondence between percent and a number
of standard deviations (σ) assumes that f is a Gaussian p.d.f.
If the function f and the parameter θ are known we can calculate β for any a
and b. If θ is unknown we try to find another variable z = z(x, θ) such that its
p.d.f., g(z), is independent of θ. If such a z can be found, we can construct an
interval [za , zb ], where zx = z(x, θ), such that
Z zb
β = P (za ≤ Z ≤ zb ) = g(z) dz (9.3)
za
It may then be possible to use this equation together with equation 9.2 to find an
interval [θ− , θ+ ] such that
P (θ− ≤ θt ≤ θ+ ) = β (9.4)
The meaning of this last equation must be made clear. Contrast the following
two quite similar statements:
1. The probability that θt is in the interval [θ− , θ+ ] is β.
the cases. Thus, β expresses the degree of confidence (or belief) in our assertion;
hence the name confidence interval. The quantity β is known by various names:
confidence coefficient, coverage probability, confidence level. However, the
last term, “confidence level”, is inadvisable, since it is also used for a different
concept, which we will encounter in goodness-of-fit tests (cf. section 10.6).
The interval [θ− , θ+ ] corresponding to a confidence coefficient β is in general not
unique; many different intervals exist with the same probability content.
We can, of course, choose to state any one of these intervals. Commonly used
criteria to remove this arbitrariness are
3. Central interval: the probability content below and above the interval are
equal, i.e., P (θ < θ− ) = P (θ > θ+ ) = (1 − β)/2.
For a symmetric distribution having a single maximum these criteria are equivalent.
We usually prefer intervals satisfying one (or more) of these criteria. However, non-
central intervals will be preferred when there is some reason to be more careful on
one side than on the other, e.g., the amount of tritium emitted from a nuclear power
station.
since the c.d.f. of the normal p.d.f. is the error function (cf. section 3.7).
If θ is not known, we can not evaluate the integral. Instead, assuming that σ is
known, we transform to the r.v. z = t − θ. The interval [c, d] for z corresponds to
the interval [θ + c, θ + d] for t. Hence, equation 9.3 becomes
Z ! !
θ+d
2 d c
β = P (θ + c ≤ T ≤ θ + d) = N(t; θ, σ ) dt = erf − erf (9.6)
θ+c σ σ
β = P (t − d ≤ θ ≤ t − c) (9.7)
188 CHAPTER 9. CONFIDENCE INTERVALS
Again, we emphasize that although this looks like a statement concerning the proba-
bility that θ is in this interval, it is not, but instead means that we have a probability
β of being right when we assert that θ is in this interval.
If neither θ or σ is known, one chooses the standardized variable z = t−θ σ
. The
probability statement about Z is
Z d
β = P (c ≤ Z ≤ d) = N(z; 0, 1) dz = erf(d) − erf(c) (9.8)
c
These values of t then define an interval in t-space, [t− , t+ ], with probability content
β. Usually the choice of t− and t+ is not unique, but may be fixed by an additional
criterion, e.g., by requiring a central interval:
Z Z
t− 1−β +∞
f (t|θ) dt = = f (t|θ) dt (9.11)
−∞ 2 t+
For any value of θ, the chance of finding a value of t in the interval [t− (θ), t+ (θ)]
is β, by construction. Conversely, having done an experiment giving a value t = t̂,
the values of θ− and θ+ corresponding to t+ = t̂ and t− = t̂ can be read off of the
plot as indicated. The interval [θ− , θ+ ] is then a confidence interval of probability
content β for θ. This can be seen as follows:
Suppose that θt is the true value of θ. A fraction β of experiments will then
result in a value of t in the interval [t− (θt ), t+ (θt )]. Any such value of t would yield,
by the above-indicated method, an interval [θ− , θ+ ] which would include θt . On the
other hand, the fraction 1 − β of experiments which result in a value of t not in the
interval [t− (θt ), t+ (θt )] would yield an interval [θ− , θ+ ] which would not include θt .
Thus the probability content of the interval [θ− , θ+ ] is also β.
To summarize, given a measurement t̂, the central β confidence interval (θ− ≤
θ ≤ θ+ ) is the solution of
Z Z
t̂ 1−β +∞
f (t|θ+ ) dt = = f (t|θ− ) dt (9.12)
−∞ 2 t̂
If f (t) is a normal p.d.f., which is often (at least asymptotically, as we have seen in
chapter 8) the case, this interval is identical for β = 68.3% to [θ̂−σθ̂ < θ < θ̂+σθ̂ ]. If
f (t) is not Gaussian, the interval of ±1σ (σ 2 the variance of θ̂) does not necessarily
correspond to β = 68.3%. In this case the uncertainty should be given which does
correspond to β = 68.3%. Such an interval is not necessarily symmetric about θ̂.
In ‘pathological’ cases, the confidence belt may wiggle in such a way that the
resulting confidence interval consists of several disconnected pieces. While mathe-
matically correct, the use of such disconnected intervals may not be very meaningful.
For a measurement t̂, θ+ is read from this t− (θ) curve as in the previous section. In
other words, the upper limit, θ+ is the solution of
Z +∞
β = P (θ < θ+ ) = f (t|θ+ ) dt (9.13)
t̂
190 CHAPTER 9. CONFIDENCE INTERVALS
The statement is then that θ < θ+ with confidence β, and such an assertion will be
correct in a fraction β of the cases.
Lower limits are defined analogously: The lower limit θ− , for which θ > θ− with
confidence β, is found from
Z t̂
β = P (θ > θ− ) = f (t|θ− ) dt (9.14)
−∞
Note that we have defined these limits as > and <, whereas we used ≥ and ≤
for confidence intervals. Some authors also use ≥ and ≤ for confidence bounds. For
continuous estimators, this makes no difference. However, for discrete estimators,
e.g., a number of events, the integral over the p.d.f. of the estimator is replaced by
a sum, and then this difference is important. This will be discussed further for the
Poisson p.d.f. (section 9.6).
9.4.1 σ known
If the variance, σ 2 , of the estimator is known, the confidence interval is easily calcu-
lated, as shown in the introduction. Suppose we have n measurements of an exact
quantity, µ, like the mass of a ball, using an apparatus of known resolution, σa . The
estimate, µ̂ = x̄, of the quantity is then normally distributed as N(µ̂; µ, σ 2 = σa2 /n),
and confidence intervals (equation 9.7) are computed using σ and the error function
(equation 9.6). The central confidence belt is defined by straight lines correspond-
ing to t± = µ ± bσ, where b is the number of standard deviations corresponding to
probability β.
9.4.2 σ unknown
But suppose that we do not know the resolution of the apparatus. As shown in the
introduction, it is still possible to give a confidence interval, but only in terms of σ
(equations 9.8 and 9.9). Since σ is not known, this is not particularly useful.
Rather, the approach is to estimate σ from the data. In the simple example of a
set of n measurements of the same quantity, x, with an apparatus of constant, but
unknown resolution, σ, the mean is estimated by µ̂ = x̄. As we have seen (equation
8.7), the resolution is then estimated by
s
n
σ̂ = s = (x − x̄)2
n−1
9.5. BINOMIAL CONFIDENCE INTERVALS 191
s2
V [µ̂] =
n
Although z = x−µ σ
is distributed as a standard normal p.d.f., i.e., z 2 is distributed
as χ2 , the corresponding variable for the case of unknown σ,
x−µ (x − µ)/σ z
t= = =
σ̂ σ̂/σ σ̂/σ
The factor T is derived from the c.d.f. of Student’s t distribution. It is the value of
t for which the c.d.f. is equal to 12 (1 + β):
Z T 1
t(x; n − 1) dx = (1 + β) (9.16)
−∞ 2
In the case of a least squares fit to measurements yi , all having the same (un-
known) Gaussian error σ, this generalizes to
r h i
θi ± = θ̂i ± T ( (1 + β); n − k)
1
2
V θ̂i (9.17)
where n is the number of points and k the number of parameters in the model.
For example, to find the 95% central confidence interval for p, given that we
observe n successes in N trials, we first find the regions p < p+ and p > p− using
the discrete analogues of equations 9.13 and 9.14 to find 97.5% upper and lower
limits
N
X
P (p < p+ ) = B(k; N, p+ ) ≥ 0.975 (9.18a)
k=n+1
n−1
X
P (p > p− ) = B(k; N, p− ) ≥ 0.975 (9.18b)
k=0
The smallest value of p+ and the largest value of p− satisfying these equations give
the central 95% confidence interval [p− , p+ ]. In other words, we find the upper and
lower limits for 1 − 1−β
2
and then exclude these regions.
Using the ≥ in these equations rather than taking the values of p for which the
equality is most nearly satisfied means that if no value gives an equality, we take
the next larger value for p+ and the next smaller value for p− . This is known as
being conservative. It implies that for some values of p we have overcoverage,
which means that for some values of p the coverage probability is actually greater
than the 95% that we claim, i.e., that P (p− < p < p+ ) > 0.95 instead of = 0.95.
This is not desirable, but the alternative would be to have undercoverage for other
values of p. Since we do not know what the true value of p is—if we did know, we
would not be doing the experiment—the lesser of two evils is to accept overcoverage
in order to rule undercoverage completely out.
interval.
The solution is easily found using the fact that the sum in the right-hand side of
equation 9.19 is related to the c.d.f. of the χ2 -distribution for 2(n + 1) degrees of
freedom.4, 5, 46 Thus,
n
X h i Z ∞
1−β = P (k; µ+ ) = P χ2 (2n + 2) > 2µ+ ) = χ2 (2n + 2) dχ2 (9.20)
k=0 2µ+
The upper limit µ+ can thus be found from a table of the c.d.f. of χ2 (2n + 2).
Lacking a table, equation 9.19 can be solved by iteration.
Let us emphasize, perhaps unnecessarily, exactly what the upper limit means: If
the true value of µ is really µ+ , the probability that a repetition of the experiment
will find a number of events which is as small or smaller than n is 1 − β; for a true
value of µ larger than µ+ , the chance is even smaller. Thus we say that we are
‘β confident’ that µ is less than µ+ . In making such statements, we will be right in
a fraction β of the cases.
194 CHAPTER 9. CONFIDENCE INTERVALS
n−1 n−1
X X µk−
β= P (k; µ− ) = e−µ− (9.21)
k=0 k=0
k!
which can be found from the c.d.f. of the χ2 -distribution for 2n degrees of freedom.
Thus,
n−1
X h i Z ∞
β= P (k; µ− ) = P χ2 (2n) > 2µ− ) = χ2 (2n) dχ2 (9.22)
k=0 2µ−
The fact that it is here 2n degrees of freedom instead of 2(n + 1) as for the upper
limit is because there are only n terms in the sum of equation 9.22 whereas there
were n + 1 terms in the upper limit case, equation 9.20.
9.6.3 Background
As mentioned above, there is usually background to the signal. The background
is also Poisson distributed. The sum of the two Poisson-distributed quantities is
also Poisson distributed (section 3.7), with mean equal to the sum of the means of
the signal and background, µ = µs + µb . Assume that µb is known with negligible
error. However, we do not know the actual number of background events, nb , in our
experiment. We only know that nb ≤ n. If µb + µs is large we may approximate the
Poisson p.d.f. by a Gaussian and take the number of background events as n̂b ≈ µb .
Then µ̂s = n − n̂b = n − µb , with variance V [µ̂s ] = V [n] + V [n̂b ] = n + µb .
An upper limit may be found by replacing µ+ in equation 9.19 by (µ+ + µb ). A
lower limit may be found from equation 9.21 by a similar substitution. The results
are
µ+ = µ+ (nobackground) − µb (9.23)
µ− = µ− (nobackground) − µb (9.24)
A difficulty arises when the number of observed events is not large compared
to the expected number of background events. The situation is even worse when
the expected number of background events is greater than the number of events
observed. For small enough n and large enough µb , equation 9.23 will lead to a
negative upper limit. So, if you follow this procedure, you may end up saying
something like “the number of whatever-I-am-trying-to-find is less than −1 with
95% confidence.” To anyone not well versed in statistics this sounds like nonsense,
and you probably would not want to make such a silly sounding statement. Of
course, 95% confidence means that 5% of the time the statement is false. This is
simply one of those times, but still it sounds silly. We will return to this point in
section 9.12.
9.7. USE OF THE LIKELIHOOD FUNCTION OR χ2 195
With smaller samples it is usually most convenient to use the likelihood ratio
(difference in log likelihood) to estimate the confidence interval. Then, relying on
the assumption that a change of parameters would lead to a Gaussian likelihood
function (cf. section 8.4.5), the region for which ℓ > ℓmax − a2 /2, or equivalently
(cf. section 8.5.1) χ2 < χ2 min + a2 , corresponds to a probability content
Z +a
β= N(z; 0, 1) dz = erf(a) − erf(−a)
−a
In ‘pathological’ cases, i.e., cases where there is more than one maximum, as
pictured here, the situation is less clear. Applying the above procedure would lead
to disconnected intervals, whereas the interval for the transformed parameter would
give a single interval. It is sometimes said that it is nevertheless correct to state a
β confidence interval as
ℓ ✻
θ1 ≤ θ ≤ θ2 or θ3 ≤ θ ≤ θ4
θ1 θ2 θ3 θ4✲
However, this statement seems to be the
result of confusing confidence intervals
with fiducial intervals (section 9.8). Be
that as it may, the usefulness of such in-
tervals is rather dubious, and in any case
gives an incomplete picture of the situation. One should certainly give more details
than just stating these intervals.
The application of other methods of estimating the variance of θ̂ to finding
confidence intervals for finite samples is discussed in some detail by Eadie et al.4
and James5 .
parameters. However, we may prefer to use Bayesian probability. In this case we can
construct intervals, [a, b], called credible intervals, Bayesian confidence intervals,
or simply Bayesian intervals, such that β is the probability that parameter θ is
in the interval: Z b
β = P (a ≤ θ ≤ b) = f (θ|x) dθ (9.26)
a
where f (θ|x) is the Bayesian posterior p.d.f. As with confidence and fiducial inter-
vals, supplementary conditions, such as centrality, are needed to uniquely specify
the interval. We have seen in section 8.4.5 that, assuming Bayes’ postulate, f (θ|x)
is just the likelihood function L(x; θ), apart, perhaps, from normalization.
can also occur when we must subtract a number of background events from the
observed number of events to find the number of events in the signal; a number of
events also can not be negative.
The problem is how to incorporate this constraint (or prior knowledge) into
the confidence interval. In the confidence interval approach there is no way to
do this. The best we can do is to choose an interval which does not contain the
forbidden region (< 0 in our example). Consider the figure showing confidence
belts in section 9.2. Suppose that we know that θt > θmin . We can think of several
alternatives to the interval [θ− , θ+ ] when θ− < θ+ :
1. [θmin , θ+ ]. But this is the same interval we would have found using a confidence
belt with t+ shifted upwards such that the t+ curve passes through the point
(θmin , t̂ ). This confidence belt clearly has a smaller β. This places us in the
position of stating the same confidence for two intervals, the one completely
contained in, and smaller than, the other.
′′ ′′
2. [θmin , θ+ ], where θ+ is the solution of tmin (θ) = tmin , with tmin = t+ (θmin ). This
is the interval we would have stated had we found t̂ = t+ (θmin ). So, apparently
the fact that we found a lower value of t̂ does not mean anything—any value of
t̂ smaller than t+ (θmin ) leads to the same confidence interval! This procedure
is clearly unsatisfactory.
′ ′
3. [θmin , θ+ ], where θ+ is determined from a new confidence belt constructed
such that the t+ curve passes through the point (θmin , t̂ ). The t− curve is
taken as that curve which together with this new t+ curve gives the required
β. This approach seems better than the previous two. However, it is still
unsatisfactory since it relies on the measurement to define the confidence
belt.
The situation is even worse if not only θ− (t̂) < θmin but also θ+ (t̂) > θmax . Then
we find ourselves in the absurd situation of, e.g., stating the conclusion of our
experiment as −0.2 < θ < 1.2 with 95% confidence when we know that 0 < θ < 1—
we are only 95% confident that θ is within its physical limits! The best procedure
to follow has been the subject of much interest lately among high energy physicists,
particularly those trying to measure the mass of the neutrino and those searching
for hypothetical new particles. The most reasonable procedure seems to be47 that of
Feldman and Cousins,48 who rediscovered a prescription previously given by Kendall
and Stuart.11
On the other hand, in the fiducial approach physical boundaries are easily in-
corporated. The likelihood function is simply set to zero for unphysical values of
the parameters and renormalized. Equation 9.25 is thus replaced by
R θ2
L dθ
β = R θθmax
1
(9.27)
θmin L dθ
9.12. UPPER LIMIT ON THE MEAN OF A POISSON WITH BACKGROUND199
Also the Bayesian approach has no difficulty in incorporating the physical limits.
They are naturally imposed on the prior probability. If the prior probability is
uniform within the physical limits, the result is the same interval as in the fiducial
approach (equation 9.27).
Note, however, that in order to combine with the results of other experiments,
the (nonphysical) estimate and its variance should be stated, as well as the con-
fidence interval. This, in fact, should also be done for quantities which are not
bounded.
This equation must be solved for µ+ . In practice this is best done numerically,
adjusting µ+ until the desired β is obtained. However, to incorporate the probability
that nb ≤ n, we have been Bayesian. The result is thus a credible upper limit rather
than a classical upper limit.
When µb is not known to a negligible error, the same approach can be used.
However, we must integrate over the p.d.f. for nb . It is most convenient to use a
Monte Carlo technique. We generate a sample of Monte Carlo experiments taking µb
randomly distributed according to our knowledge of µb (usually normally) and with
a fixed µs . Experiments with nb > n are rejected. The sum in equation 9.19 or 9.21
is then estimated by the fraction of remaining Monte Carlo experiments satisfying
200 CHAPTER 9. CONFIDENCE INTERVALS
Chapter 10
Hypothesis testing
10.1 Introduction
In chapter 8 we were concerned with estimating parameters of a p.d.f. using a statis-
tic calculated from observations assumed to be distributed according to that p.d.f.
In chapter 9 we sought an interval which we were confident (to some specified de-
gree) contained the true value of the parameter. In this chapter we will be concerned
with whether some previously designated value of the parameter is compatible with
the observation, or even whether the assumed p.d.f. is compatible. In a sense, this
latter question logically precedes the estimation of the value of a parameter, since
if the p.d.f. is incompatible with the data there is little sense in trying to estimate
its parameters.
When the hypothesis under test concerns the value of a parameter, the problems
of hypothesis testing and parameter testing are related and techniques of parameter
estimation will lead to analogous testing procedures. If little is known about the
value of a parameter, you will want to estimate it. However, if a theory predicts
it to have a certain value, you may prefer to test whether the data are compatible
with the predicted value. In either case you should be clear which you are doing.
That others are often confused about this is no excuse.
201
202 CHAPTER 10. HYPOTHESIS TESTING
every particle in the universe attracts every other particle can not be tested sta-
tistically. Statistical hypotheses concern the distributions of observable random
variables. Suppose we have N such observations. We denote them by a vector x
in an N-dimensional space, Ω, called the sample space (section 2.1.2), which is the
space of all possible values of x, i.e., the space of all possible results of an experi-
ment. A statistically testable hypothesis is one which concerns the probability of a
particular observation X, P (X ∈ Ω).
Suppose that x consists of a number of independent measurements of a r.v., xi .
Let us give four examples of statistical hypotheses concerning x:
4. The results of two experiments, x1i and x2i are distributed identically.
Each of these hypotheses says something about the distribution of probability over
the sample space and is hence statistically testable by comparison with observations.
Examples 1 and 2 specify a p.d.f. and certain values for one or both of its pa-
rameters. Such hypotheses are called parametric hypotheses. Example 3 specifies
the form of the p.d.f., but none of its parameters, and example 4 does not even
specify the form of the p.d.f. These are examples of non-parametric hypothe-
ses, i.e., no parameter is specified in the hypothesis. We shall mainly concentrate
on parametric hypotheses, leaving non-parametric hypotheses to section 10.7.
Examples 1 and 2 differ in that 1 specifies all of the parameters of the p.d.f.,
whereas 2 specifies only a subset of the parameters. When all of the parameters are
specified the hypothesis is termed simple; otherwise composite. If the p.d.f. has n
parameters, we can define an n-dimensional parameter space. A simple hypothesis
selects a unique point in this space. A composite hypothesis selects a subspace
containing more than one point. The number of parameters specified exactly by
the hypothesis is called the number of constraints. The number of unspecified
parameters is called the number of degrees of freedom of the hypothesis. Note the
similarity of terminology with that used in parameter estimation:
P (x ∈ ω|H0) = α (10.1)
P (x ∈ ω ∗|H1 ) = β (10.2)
P (x ∈ ω|H1) = 1 − β (10.3)
Example 2. In the previous example the sample space was only two dimensions.
When the dimensionality is larger, it is inconvenient to formulate the test in terms of
the complete sample space. Rather, a small number
The symbols α and β are used by most authors for the probabilities of errors of the first and
∗
10.3.1 Size
In the previous section we defined (equation 10.1) the size, α, of a test as the
probability that the test would reject the null hypothesis when it is true. If H0 is
a simple hypothesis, the size of a test can be calculated. In other cases it is not
always possible. Clearly a test of unknown size is worthless.
206 CHAPTER 10. HYPOTHESIS TESTING
10.3.2 Power
We have defined (equation 10.3) the power, 1 − β, of a test of one hypothesis H0
against another hypothesis H1 as the probability that the test would reject H0 when
H1 is true. If H1 is a simple hypothesis, the power of a test can be calculated. If
H1 is composite, the power can still be calculated, but is in general no longer a
constant but a function of the parameters.
Suppose that H0 and H1 specify the same p.d.f., the difference being the value
of the parameter θ:
H0 : θ = θ0
H1 : θ 6= θ0
10.3.3 Consistency
A highly desirable property of a test is that, as the number of observations in-
creases, it should distinguish better between the hypotheses. A test is termed
10.3. PROPERTIES OF TESTS 207
10.3.4 Bias
A test is biased if the power func-
tion is smaller at a value of θ corresponding to H1 than at the value, θ0 , specified
by H0 , i.e., when there exists a value θ for which
the critical region, must be independent of the p.d.f. specified by H0 . It can only
depend on whether H0 is true. Such a test is called distribution-free. An example
is the well-known Pearson’s χ2 test, which we shall meet shortly.
It should be emphasized that it is only the size or level of significance of the
test which does not depend on the distributions specified in the hypotheses. Other
properties of the test do depend on the p.d.f.’s. In particular, the power will depend
on the p.d.f. specified in H1 .
enough depends on the cost (in terms of such things as time and money) of making
an error, i.e., a wrong decision.
We want to find the critical region ωα which, for a given value of α, maximizes
1 − β. Rewriting equation 10.6, we have
Z
g(x; θ1 )
1−β = f (x; θ0 ) dx
ωα f (x; θ0 )
" #
g(x; θ1 )
= Eωα H0
f (x; θ0 )
which is the expectation of g(x; θ1 )/f (x; θ0 ) in the region ωα assuming that H0 is
true. This will be maximal if we choose the region ωα as that region containing the
points x for which this ratio is the largest. In other words, we order the points x
according to this ratio and add these points to ω until ω has reached the size α.
The BCR thus consists of the points x satisfying
f (x; θ0 )
≤ cα
g(x; θ1 )
where cα is chosen such that ωα is of size α (equation 10.5).
This ratio is, for a given set of data, just the ratio of the likelihood functions,
which is known as the likelihood ratio. We therefore use the test statistic
L(x|H0 )
λ= (10.7)
L(x|H1 )
and
reject H0 if λ ≤ cα
accept H0 if λ > cα
This is known as the Neyman-Pearson test.
210 CHAPTER 10. HYPOTHESIS TESTING
An Example
As an example, consider the normal distribution treated in example 1 of section
10.2. Both H0 and H1 hypothesize a normal p.d.f. of the same variance, but different
means, µ0 under H0 and µ1 under H1 . The variance is, for both hypotheses, specified
as σ 2 . The case where the variance is not specified is treated in section 10.4.3. The
likelihood function under Hi for n observations is then
n
1X (xj − µi )2
L(x|Hi ) = (2π)−n/2 exp −
2 j=1 σ2
o
n n
= (2π)−n/2 exp − 2 s2 + (x̄ − µi )2
2σ
where x̄ and s2 are the sample mean and sample variance, respectively. Hence, our
test statistic is (equation 10.7)
n o
L(x|H0 ) n
λ= = exp (x̄ − µ1 )2 − (x̄ − µ0 )2
L(x|H1 ) 2σ 2
n o
n
= exp 2x̄(µ0 − µ1 ) + (µ21 − µ20 )
2σ 2
and the BCR is defined by λ ≤ cα or
1 σ2
x̄(µ0 − µ1 ) + (µ21 − µ20 ) ≤ ln cα
2 n
which becomes
1 σ 2 ln cα
x̄ ≥ (µ1 + µ0 ) − if µ1 > µ0 (10.8)
2 n µ1 − µ0
1 σ 2 ln cα
x̄ ≤ (µ1 + µ0 ) + if µ1 < µ0 (10.9)
2 n µ0 − µ1
Thus we see that the BCR is determined by the value of the sample mean. This
should not surprise us if we recall that x̄ was an efficient estimator of µ (section
8.2.7).
In applying the test, we reject H0 if µ1 > µ0 and x̄ is above a certain critical
value (equation 10.8), or if µ1 < µ0 and x̄ is below a certain critical value (equation
10.9).
To find this critical value, we recall that x̄ itself is a normally distributed r.v.
with mean µ and variance σ 2 /n. (This is the result of the central limit theorem,
but when the p.d.f. for x is normal, it is an exact result for all n.) We will treat the
case of µ1 > µ0 and leave the other case as an exercise for the reader.
For µ1 > µ0 , the right-hand side of equation 10.8 is just x̄α given by
r Z
n ∞ n
exp − (x̄ − µ0 )2 dx̄ = α
2πσ 2 x̄α 2σ 2
10.4. PARAMETRIC TESTS 211
we can rewrite this in terms of the standard normal integral, which is given by the
error function (section 3.7):
Z Z
1 ∞
−z 2 /2 1 −zα 2 /2
α=√ e dz = √ e−z dz = erf(−zα ) (10.11)
2π zα 2π −∞
In that example we saw that for µ1 > µ0 a BCR was given by x̄ ≥ bα and for
µ1 < µ0 by x̄ ≤ aα . Thus if H1 contains only values greater than, or only values
less than µ0 , we have a (one-sided) UMP test, but not if H1 allows values of µ
on both sides of µ0 . In such cases we would intuitively expect that a compromise
critical region defined by x̄ ≤ aα/2 or x̄ ≥ bα/2 would give a satisfactory ‘two-
sided’ test, and this is what is usually used. It is, of course, less powerful than
the one-sided tests in their regions of applicability as is illustrated in the figure.
1
p
✲
0 µ0 µ
Critical region in both tails equally.
Critical region in lower tail.
Critical region in upper tail.
Since we are treating two simple tests, we can use the Neyman-Pearson test (equa-
tion 10.7) to reject H0 if the likelihood ratio is smaller than some critical value:
L(x|H0 )
λ= ≤ cα
L(x|H1 )
This is equivalent to
ln L(x; θ0 ) − ln L(x; θ1 ) ≤ ln cα
10.4. PARAMETRIC TESTS 213
or (assuming ∆ > 0)
∂ ln L ln cα
≥ kα , kα = −
∂θ θ=θ0 ∆
Now, if the observations are independent and identically distributed, we know from
section 8.2.5 that under H0 the expectation of L is a maximum and
" #
∂ ln L
E =0
∂θ θ=θ0
!2
∂ ln L
E = nI
∂θ
However, in the second case the p.d.f.’s are different and may even involve different
numbers of parameters. In this section we will treat the first case.
and Pearson52 in 1928. It has played a similar role in the theory of tests to that
of the maximum likelihood method in the theory of estimation. As we have seen
(sect. 10.4.1), this led to a MP test for simple hypotheses.
Assume that the N observations, x, are independent and that both hypotheses
specify the p.d.f. f (x; θ). Then the likelihood function is
N
Y
L(x; θ) = f (xi ; θ)
i=1
Example 1 2 3
H0 θ1 = a and θ2 = b θ1 = c , θ2 unspecified θ1 + θ2 = d
H1 θ1 6= a 6 b θ1 =
and/or θ2 = 6 c , θ2 unspecified θ1 + θ2 6= d
Hypotheses which do not specify exact values for parameters, but rather relation-
ships between parameters, e.g., θ1 = θ2 , can usually be reformulated in terms of
other parameters, e.g., θ1′ = θ1 − θ2 = 0 and θ2′ = θ1 + θ2 unspecified. We can
introduce the more compact notation of L(x; θr , θs ), i.e., we write two vectors of
parameters, first those which are specified under H0 and second those which are not.
The unspecified parameters θs are sometimes referred to as ‘nuisance’ parameters.
In this compact notation, the test statistic can be rewritten as
ˆ
L x; θr0 , θ̂s
λ= (10.15)
L x; θ̂r , θ̂s
ˆ
where θ̂s is the value of θs at the maximum of L in the restricted region ν and θ̂r
and θ̂ s are the values of θr and θ s at the maximum of L in the full region Θ.
If H0 is true, we expect λ to be near to 1. The critical region will therefore be
λ ≤ cα (10.16)
where cα must be determined from the p.d.f. of λ, g(λ), under H0 . Thus, for a test
of size α, cα is found from Z cα
α= g(λ) dλ (10.17)
0
We have seen (section 8.4.1) that the unconditional maximum likelihood estimators
are
µ̂ = x̄
N
1 X
σ̂ 2 = s2 = (xi − x̄)2
N i=1
The non-central χ2 distribution, χ′2 (r, K), is the distribution of a sum of variables
distributed normally with a non-zero mean and unit variance. It can be used to
calculate the power of the test:4, 5, 11, 13
Z ∞
p=1−β = dF1
χ2α
Linear model: A particular case is the linear model (section 8.5.2) in which the
N observations yi are assumed to be related to other observations xi , within random
errors ǫi , by a function linear in the k parameters θj ,
k
X
yi = y(xi ) + ǫi = θj hj (xi ) + ǫi
j=1
We assume that the ǫi are normally distributed with mean 0 and variance σ 2 . We
wish to test whether the θj have the specified values θ0j , or more generally, whether
they satisfy some set of r linear constraints,
Aθ = b (10.22)
where A and b are specified under H0 . Under H1 , the θ may take on any set of
values not satisfying the constraints of equation 10.22.
The likelihood for both H0 and H1 is given by
2
N k
1 X X
L(x; θ) = (2πσ 2 )−N/2 exp − 2 yi − θj hj (xi )
2σ i=1 j=1
2 −N/2 1
= (2πσ ) exp − Q2
2
We now distinguish two cases:
Variance known. We first treat the case of known variance σ 2 . The esti-
mates of the parameters are given by the least squares solutions (section 8.5), with
constraints for H0 yielding θ̂0j and without constraints for H1 yielding θ̂1j . The
likelihood ratio, λ, is then given by
2 2
N
X k
X N
X k
X
1 1
−2 ln λ = yi − θ̂0j hj (xi ) − yi − θ̂1j hj (xi ) = Q20 − Q21
σ2 i=1 j=1 σ2 i=1 j=1
(10.23)
It has been shown4, 5 that the second term can be expressed as the first term plus a
quadratic form in the ǫi , and hence that −2 ln λ is distributed as a χ2 of r degrees
of freedom. This result is true exactly for all N, not just asymptotically. It also
holds if the errors are not independent but have a known covariance matrix.
10.4. PARAMETRIC TESTS 219
The test thus consists of performing two least squares fits, one with and one
without the constraints of H0 . Each fit results in a value of Q2 , the difference of
R
which, Q20 −Q21 , is a χ2 (r). H0 is then rejected if Q20 −Q21 > χ2α where χ∞2α χ2 (r) dχ2 =
α.
We can qualitatively understand this result in the following way: Asymptoti-
cally, Q20 is a χ2 (N − k + r) and Q21 is a χ2 (N − k). From the reproductive property
of the χ2 distribution (section 3.12), the difference of these χ2 is also a χ2 with a
number of degrees of freedom equal to the difference of degrees of freedom of Q20
and Q21 , namely r. Thus the above result follows.
It can be shown that (s20 − s21 )/σ 2 and s21 /σ 2 are independently distributed as
χ2 with r and N − k degrees of freedom, respectively. The ratio,
N − k s20 − s21
F = (10.26)
r s21
is therefore distributed
R∞
as the F -distribution (section 3.14). H0 is then rejected if
F > Fα , where Fα F (r, N − k) dF = α.
However, under H1 , (s20 − s21 )/σ 2 is distributed as a non-central χ2 . This leads to
a non-central F -distribution from which the power of the test can be calculated.11, 13
distribution of the likelihood ratio then usually turns out to depend on N as well
as on which hypothesis is true. The likelihood ratio can still be used as a test, but
these dependences must be properly taken into account.4, 5 The tests are therefore
more complicated.
The easiest method to treat this situation is to construct a comprehensive family
of functions
h(x; θ, φ, ψ) = (1 − θ)f (x; φ) + θg(x; ψ)
by introducing an additional parameter θ.
What we really want to test is H0 against H1 ,
H0 : f (x; φ) , φ unspecified
H1 : g(x; ψ) , ψ unspecified
Instead, we can use the composite function to test H0 against H1′ :
H0 : h(x; θ, φ, ψ) , θ = 0, φ, ψ unspecified
H1′ : h(x; θ, φ, ψ) , θ 6= 0, φ, ψ unspecified
using the maximum likelihood ratio as in the previous section:
N
ˆ ˆ
L(x; θ = 0, φ̂, ψ) f (x; φ̂)
λ= = (10.27)
L(x; θ̂, φ̂, ψ̂) (1 − θ̂)f (x; φ̂) + θ̂g(x; ψ̂)
where Pp (Hi ) is the probability of Hi before (prior to) doing the experiment and
P (x|Hi ) is the probability of obtaining the result x if Hi is true, which is identical
to L(x|Hi ). We can compare P (H0 |x) and P (H1|x), e.g., by their ratio. If both H0
and H1 are simple hypotheses,
where λ is just the likelihood ratio (eq. 10.7). This leads to statements such as
“the probability of H0 is, e.g., 20 times that of H1 ”. Note, however, that here, as
always with Bayesian statistics, it is necessary to assign prior probabilities. In the
absence of any prior knowledge, Pp (H0 ) = Pp (H1 ). The test statistic is then λ, just
as in the Neyman-Pearson test (section 10.4.1). However now the interpretation is
a probability rather than a level of significance.
Suppose that H1 is a composite hypothesis where a parameter θ is unspecified.
Equation 10.30 remains valid, but with
Z
P (x|H1 ) = f (x, θ|H1) dθ (10.32)
Z
= P (x|θ, H1 ) f (θ|H1) dθ (10.33)
Now, P (x|θ, H1 ) is identical to L(x; θ) under H1 and f (θ|H1 ) is just the prior p.d.f.
of θ under H1 . In practice, this may not be so easy to evaluate. Let us therefore
make some simplifying assumptions for the purpose of illustration. We know that
asymptotically L(x; θ) is proportional to a Gaussian function of θ (eq. 8.72). Let us
take a prior probability uniform between θmin and θmax and zero otherwise. Then,
with σθ̂2 the variance of the estimate, θ̂, of θ, equation 10.33 becomes
Z !
(θ − θ̂)2 1
P (x|H1 ) = Lmax (x; θ) exp − dθ (10.34)
2σθ̂2 θmax − θmin
Z !
Lmax (x; θ) θmax (θ − θ̂)2
= exp − dθ (10.35)
θmax − θmin θmin 2σθ̂2
√
σθ̂ 2π
= Lmax (x; θ) (10.36)
θmax − θmin
where we have assumed that the tails of the Gaussian cut off by the integration
limits θmin , θmax are negligible. Thus equation 10.30 becomes
where λ is now the maximum likelihood ratio λ = L(x|H0 )/Lmax (x|H1 ). Note that
there is a dependence not only on the prior probabilities of H0 and H1 , but also on
the prior probability of the parameter θ.
∗
Many authors use 1 − cl where we use cl.
†
The preferable term is p-value, since it eliminates confusion with the confidence level of
confidence intervals (chapter 9), which, although related, is different. Nevertheless, the term
confidence level is widely used, especially by physicists.
10.6. GOODNESS-OF-FIT TESTS 223
can be rejected. Despite the suggestive “p”, the p-value is not a probability; it is a
random variable.
We shall only consider distribution-free tests, for the practical reason that they
are widely applicable. To apply a test, one needs to know the p.d.f. of the test
statistic in order to calculate the confidence level. For the well-known tests tables
and/or computer routines are widely available. For a specific problem it may be
possible to construct a better test, but it may not be so much better that it is worth
the effort.
and if the data give t = x̄, the confidence level or p-value (for a symmetric two-sided
test) is
Z −|x̄| Z +∞
2
cl = N (t; 0, σ /n) dt + N (t; 0, σ 2 /n) dt
−∞ +|x̄|
Z +|x̄|
=1− N (t; 0, σ 2 /n) dt (10.39)
−|x̄|
Note the similarity of the integrals in equations 10.38 and 10.39. We see that
the coverage probability of the interval [−|x̄|, +|x̄|], β, is related to the p-value
by cl = 1 − β. However, for the confidence interval, the coverage probability
is specified first and the interval, [µ− , µ+ ], is the random variable, while for the
goodness-of-fit test the hypothesis is specified (µ = µ0 ) and the p-value is the r.v.
Referring to the confidence belt figure of section 9.2, and supposing that θt is
the hypothesized value of the parameter µ0 , t− (µ0 ) and t+ (µ0 ) are the values of
224 CHAPTER 10. HYPOTHESIS TESTING
This X 2 is just the quantity that is minimized in a least squares fit (where we
denoted it by Q2 ). In the linear model, assuming Gaussian errors, X 2 = Q2min is
still distributed as χ2 even though parameters have been estimated by the method.
However the number of degrees of freedom is reduced to N − k, where k is the
number of parameters estimated by the fit. If constraints have been used in the fit
(cf. section 8.5.6), the number of degrees of freedom is increased by the number of
constraints, since each constraint among the parameters reduces by one the number
of free parameters estimated by the fit. If the model is non-linear, X 2 = Q2min is
only asymptotically distributed as χ2 (N − k).
It is sometimes argued that the χ2 test should be two-tailed rather than one-
tailed, i.e., that H0 should be rejected for unlikely small values of X 2 as well as
for unlikely large values. Arguments given for this practice are that such small
values are likely to have resulted from computational errors, overestimation of the
measurement errors σi , or biases (unintentional or not) in the data which have not
been accounted for in making the prediction. However, while an improbably small
value of X 2 might well make one suspicious that one or more of these considerations
had occurred (and indeed several instances of scientific fraud have been discovered
this way), such a low X 2 can not be regarded as a reason for rejecting H0 .
The problem is that in order to use the value of L as a test, we must know
how L is distributed in order to be able to calculate the confidence level. Suppose
that we have N independent observations, xi , each distributed as f (x). The log
likelihood is then just
N
X
ℓ= ln f (xi)
i=1
Similarly higher moments could be calculated, and from these moments (just the
first two if N is large and the central limit theorem is applicable) the distribution
of ℓ, g(ℓ), could be reconstructed. The confidence level would then be given by
Z ℓ
cl = g(ℓ) dℓ (10.42)
−∞
∗
Although we use the term ‘binned’, which suggests a histogram, any classification of the
observations may be used. See also section 8.6.1.
226 CHAPTER 10. HYPOTHESIS TESTING
which we of course do not know. The likelihood under H0 and under the true p.d.f.
are then, from the multinomial p.d.f., given by
pn
i
k
Y
i
L0 (n|p) = N !
i=1 ni !
k
Y qini
L(n|q) = N !
i=1 ni !
An estimate q̂i of the true probability content can be found by maximizing L(n|q)
P
subject to the constraint ki=1 qi = 1. The result∗ is
ni
q̂i =
N
The test statistic is then the likelihood ratio (cf. section 10.4.3)
k
! ni
L0 (n|p) N
Y pi
λ= =N (10.43)
L(n|q̂) i=1 ni
The exact distribution of λ is not known. However, we have seen in section 10.4.3
that −2 ln λ is asymptotically distributed as χ2 (k − 1) under H0 , where the num-
ber of degrees of freedom, k − 1, is the number of parameters specified. The multi-
P
nomial p.d.f. has only k −1 parameters (pi ) because of the restriction ki=1 pi = 1.
If H0 is not simple, i.e., not all pi are specified, the test can still be used but the
number of degrees of freedom must be decreased accordingly.
Pearson’s χ2 test
The classic test for binned data is the χ2 test proposed by Karl Pearson53 in 1900.
It makes use of the asymptotic normality of a multinomial p.d.f. to find that under
H0 the statistic
Xk (n − N π )2
i i
X2 = (10.44)
i=1 N πi
is distributed asymptotically as χ2 (k − 1).
If H0 is not simple, its free parameters can be estimated, (section 8.6.1) by
the minimum chi-square method. In that method, the quantity which is minimized
with respect to the parameters (equation 8.152) is just Pearson’s X 2 . The mini-
mum value thus found therefore serves to test the hypothesis. It can be shown that
in this case X 2 is asymptotically distributed as χ2 (k − s − 1) where s is the num-
ber of parameters which are estimated. This is also true if the binned maximum
∗
This was derived for the binomial p.d.f. in section 8.4.7. It may be trivially extended to the
multinomial case by treating each bin separately as binomially distributed between that bin and
all the rest.
10.6. GOODNESS-OF-FIT TESTS 227
where, under H0 , νi is the mean (and variance) of the Poisson distribution for
bin i. Since each bin is independent, there are now k degrees of freedom, and X 2
is distributed asymptotically as χ2 (k − s).
Pearson’s χ2 test makes use of the squares of the deviations of the data from
that expected under H0 . Tests can be devised which use some other measure of
deviation, replacing the square of the absolute value of the deviation by some other
power and scaling the deviation or not by the expected variance. Such tests are,
however, beyond the scope of this course.
expect that it does not matter which of these sets we happen to choose, and this
has indeed been shown to be so.11, 13
Intuitively, we could expect that we should choose bins which are equiprobable
under H0 . Pearson’s χ2 test is consistent (asymptotically unbiased) whatever the
binning, but for finite N it is not, in general, unbiased. It can be shown4, 5, 11, 13
that for equiprobable bins it is locally unbiased, i.e., unbiased against alternatives
which are very close to H0 , which is certainly a desirable property.
Having decided on equiprobable bins, the next question is how many bins.
Clearly, we must not make the number of bins k too large, since the multinor-
mal approximation to the multinomial p.d.f. will no longer be valid. A rough rule
which is commonly used is that no expected frequency, N pi , should be smaller
than ∼ 5. However, according to Kendall and Stuart,11 there seems to be no gen-
eral theoretical basis for this rule. Cochran goes even further and claims4 that the
asymptotic approximation remains good so long as not more than 20% of the bins
have an expected number of events between 1 and ∼ 5.
This does not necessarily mean that it is best to take k = N/5 bins. By
maximizing local power, one can try to arrive at an optimal number of bins. The
result4, 5 is √ 2/5
2(N − 1)
k = b (10.46)
λα + λ1−p0
R λα
where α = 1 − −λ α
N (x; 0, 1) dx is the
size of the test for a standard normal distri-
bution and p0 is the local power. In general, p0
for a simple hypothesis a value for b between N α 0.5 0.8
2 and 4 is good, the best value depending
200 0.01 27 (7.4) 24 (8.3)
on the p.d.f. under H0 . Typical values for k
0.05 31 (6.5) 27 (7.4)
(N/k) using b = 2 are given in the following
table. We see from the table that there is only 500 0.01 39 (13) 35 (14)
a mild sensitivity of the number of bins to α 0.05 45 (11) 39 (13)
and p0 . For N = 200, 25–30 bins would be
reasonable.
Thus we are led to the following recommendations for binning:
3. Define the bins to have equal probability content, either from the p.d.f. spec-
ified by H0 or from the data.
4. If parameters have to be estimated (H0 does not specify all parameters), use
maximum likelihood on the individual observations, but remember that the
test statistic is then only approximately distribution-free.
10.6. GOODNESS-OF-FIT TESTS 229
Note, however, that, regardless of the above prescription, if the p.d.f. under H0
does not include resolution effects, one should not choose bins much smaller than
the resolution.
Even with the above prescription, the specification of the bins is still not unique.
The usual method in one dimension would be to define a bin as an interval in the
variable, bini = (xi , xi + δi ). However, there is nothing in the above prescription
to forbid defining a single bin as consisting of more than one (nonadjacent) interval.
This might even be desirable from the point of view H0 . For example, H0 might
specify a p.d.f. that is symmetric about 0, and we might only be interested in testing
this hypothesis against alternatives which are also symmetric about 0. Then it
would be appropriate to define bins as intervals in |x| rather than in x.
In more than one dimension the situation is more ambiguous. For example,
to construct equiprobable bins in two dimensions, the easiest way is to first find
equiprobable bins in x and then for each bin in x to find equiprobable bins in
y. This is easily generalized to more dimensions. However, one could equally well
choose first to construct bins and y and then in x, which in general would yield
different bins. One could also choose different numbers of bins in x than in y. The
choice depends on the individual situation. One should prefer smaller bins in the
variable for which H0 is most sensitive.
There is, obviously, one taboo: You must not try several different choices of
binning and choose the one which gives the best (or worst) confidence level.
Suppose that there are r runs. First, suppose that r is even and that the sequence
begins with an A. Then there are kA A-points and r/2 − 1 divisions between
them. For the example of the figure this is AAA|AAA. With kA A’s there are
kA − 1 places to put the first dividing line, since it can not go at the ends. Then
there are kA − 2 places to put the second dividing line, since
it can not go at the
kA −1
ends or next to the first dividing line. In total there are r/2−1 ways to arrange
the dividing lines among the A’s. There is a similar factor for arrangement of the
B’s and a factor 2 because we assumed we started with an A and it could just have
well been a B. Thus the probability of r runs, for r even, is
kA −1 kB −1
2 r/2−1 r/2−1
P (r) =
k
(10.47)
kA
From these it can be shown that the expectation and variance of r are
2kA kB
E [r] = 1 + (10.49)
k
2kA kB (2kA kB − k)
V [r] = (10.50)
k2 (k − 1)
The critical region of the test is defined as improbably low values of r, r < rα .
For kA and kB greater than about 10 or 15, one can use the Gaussian approx-
imation for r. For smaller numbers one can compute the probabilities directly
using equations 10.47 and 10.48. In our example, kA = kB = 6. From equa-
tions 10.49 and 10.50 we expect r = 7 with variance 2.73, or σ = 1.65. We
observe 3 runs, which differs from the expected number by 4/1.65 = 2.4 standard
deviations. Using the Gaussian approximation, this corresponds to a (one-tailed)
confidence level of 0.8%. Exact calculation using equations 10.47 and 10.48 yields
P (1) + P (2) + P (3) = 1.5%. Whereas the χ2 is acceptable (χ2 = 12 for 12
points), the run test suggests that the curve does not fit the data.
10.6. GOODNESS-OF-FIT TESTS 231
The run test is much less powerful than a χ2 test, using as it does much less
information. But the two tests are completely independent and hence they can
be combined. An hypothesis may have an acceptable χ2 , but still be wrong and
rejectable by the run test. Unfortunately, the run test is applicable only when H0
is simple. If parameters have been estimated from the data, the distribution of the
number of runs is not known and the test can not be applied.
which is simply the fraction of the observations not exceeding x. Clearly, under
H0 , Sn (x) → F (x) as n → ∞. The tests consist of comparing Sn (x) with
F (x). We shall discuss two such tests, the Smirnov-Cramér-von Mises test and the
Kolmogorov test. Unfortunately, both are only applicable to simple hypotheses,
since the distribution of the test statistic is not distribution-free when parameters
have been estimated from the data.
As a measure of the difference between Sn (x) and F (x) this test uses the statistic
Z 1
W2 = [Sn (x) − F (x)]2 ψ(x) dF
0
Z +∞
= [Sn (x) − F (x)]2 ψ(x)f (x) dx
−∞
232 CHAPTER 10. HYPOTHESIS TESTING
with ψ(x) = 1. We see that W 2 is the expectation of [Sn (x) − F (x)]2 under
H0 . Inserting Sn (equation 10.51) and performing the integral results in
n
" #2
2
1 X 2i − 1
nW = + F (x(i)) − (10.52)
12n i=1 2n
Kolmogorov test
This test also compares Sn and F (x), but only uses the maximum difference: The
Kolmogorov (or Smirnov, or Kolmogorov-Smirnov) test statistic is the maximum
deviation of the observed distribution Sn (x) from the c.d.f. F (x) under H0 :
Dn = max {|Sn (x) − F (x)|} for all x (10.53)
The sensitivity of the Kolmogorov test to deviations from the c.d.f. is not in-
dependent of x. It is more sensitive around the median value and less sensitive
in the tails. This occurs because the difference |Sn (x) − F (x)| does not, under
H0 have a probability distribution that is independent of x. Rather, its variance
is proportional to F (x) [1 − F (x)], which is largest at F = 0.5. Consequently,
the significance of a large deviation in a tail is underweighted in the test. The Kol-
mogorov test therefore turns out to be more sensitive to departures of the data from
the median of H0 than to departures from the width. Various modifications of the
Kolmogorov test statistic have been proposed54, 58, 59 to ameliorate this problem.
A few words of caution are appropriate at this point. As illustrated by the figure
at the start of the section on the run test (section 10.6.6), one test may give an
acceptable value while another does not. Indeed, it is in the nature of statistics
that this must sometimes occur.
Also, a fit may be quite good over part of the range of the variable and quite bad
over another part. The resulting test value will be some sort of average goodness,
which can still have an acceptable value. And so: Do not rely blindly on a test. Use
your eyes. Make a plot and examine it.
There are several useful plots you can make. One is, as was done to illustrate
the run test, simply a plot of the data with the fit distribution superimposed. Of
course, the error bars should be indicated. It is then readily apparent if the fit
is bad only in some particular region, and frequently you get an idea of how to
improve the hypothesis. This is illustrated in the figure where the fit (dashed line)
in (a) is perfect, while in (b) higher order terms are clearly needed and in (c) either
higher orders or a discontinuity are required.
234 CHAPTER 10. HYPOTHESIS TESTING
y✻
∗
Cited by Barlow1 from New Scientist, 31 March 1988.
10.7. NON-PARAMETRIC TESTS 235
1. The two-sample problem. We wish to test whether two (or more generally k)
samples are distributed according to the same p.d.f.
These are all hypothesis-testing problems, which are similar to the goodness-of-
fit problem in that the alternative hypothesis is simply not H0 .
The first two of the above problems are really equivalent to the third, even
though the first two involve observations of just one quantity. For problem 1, we
(1) (2)
can combine the two samples xi and xi into one sample by defining a second
variable yi = 1 or 2 depending on whether xi is from the first or the second sample.
Independence of x and y is then equivalent to independence of the two samples. For
problem 2, suppose that the xi of problem 3 are just the observations of problem 2
and that the yi are the order of the observations. Then independence of xi and yi
is equivalent to no order dependence of the observations of problem 1. Let us begin
then with problem 3.
where x̄ and ȳ are the sample means and sx and sy are the sample variances of x
and y, respectively, and xy is the sample mean of the product xy. Under H0 , x
236 CHAPTER 10. HYPOTHESIS TESTING
E [r] = 0
Higher moments of r can also be easily calculated. It turns out that the variance
1
is V [r] = n−1 . Thus, the first two moments are exactly equal to the moments
of the bivariate normal distribution with zero correlation. Further, the third and
fourth moments are asymptotically approximately equal to those of the normal
distribution. From this it follows11 that
s
n−2
t=r (10.58)
1 − r2
is distributed approximately as Student’s t-distribution with (n − 2) degrees of
freedom, the approximation being very accurate even for small n. The confidence
level can therefore be calculated from the t-distribution. H0 is then rejected for
large values of |t|.
Rank tests
The rank of an observation xi is simply its position, j, among the order statistics
(cf. section 10.6.7), i.e., the position of xi when all the observations are ordered.
In other words,
rank(xi ) = j if x(j) = xi (10.59)
The relationship between statistics, order statistics and rank is illustrated in the
following table:
i 1 2 3 4 5 6
statistic (measurement) xi 7.1 3.4 8.9 1.1 2.0 5.5
order statistic x(i) 1.1 2.0 3.4 5.5 7.1 8.9
rank rank(xi ) 5 3 6 1 2 4
which can take on values between −1 and 1. If x and y are completely correlated,
xi and yi will have the same rank and Di will be zero, leading to ρ = 1. It can be
shown1, 11 that for large n (≥ 10) ρ has the same distribution as r in the previous
section, and Student’s t-distribution can be used, substituting ρ for r in equation
10.58.
H0 : f1 (x) = f2 (x)
If both samples contain the same number of observations (n1 = n2 ), we can group
the two samples into one sample of pairs of observations and apply one of the tests
for independence. However, we can also adapt (without the restriction n1 = n2 )
any of the goodness-of-fit tests (section 10.6) to this problem.
Kolmogorov test
The Kolmogorov test (cf. section 10.6.7) adapted to the two-sample problem com-
pares the sample c.d.f.’s of the two samples. Equations 10.53-10.55 become
rather than nDn and 4 n1 +n2 (Dn1 n2 ) rather than 4nDn , respectively.
Run test
The two samples are combined keeping track of the sample from which each obser-
vation comes. Runs in the sample number, rather than in the sign of the deviation,
238 CHAPTER 10. HYPOTHESIS TESTING
are then found. In the notation of section 10.6.6, A and B correspond to an obser-
vation coming from sample 1 and sample 2, respectively. The test then follows as
in section 10.6.6.
χ2 test
Consider two histograms with identical binning. Let nji be the number of entries
in bin i of histogram j. Each histogram has k bins and a total of Nj entries.
The Pearson χ2 statistic (equation 10.44) becomes a sum over all bins of both
histograms,
X2 X k (n − N p )2
2 ji j i
X = (10.64)
j=1 i=1 N j pi
Under H0 the probability content pi of bin i is the same for both histograms and
it is estimated from the combined histogram:
n1i + n2i
p̂i =
N1 + N2
Substituting this for pi in equation 10.64 results, after some work, in
" #
2
k
1 X n21i k
1 X n22i
X = (N1 + N2 ) + −1 (10.65)
N1 i=1 n1i + n2i N2 i=1 n1i + n2i
which, for all nji large, behaves as χ2 with (r − 1)(k − 1) degrees of freedom.
Mann-Whitney test
As previously mentioned, the two-sample problem can be viewed as a test of inde-
pendence for which, as we have seen, rank tests can be used. A rank test appropriate
for this problem is the Mann-Whitney test, which is also known as the Wilcoxon∗
test, the rank sum test, or simply the U -test. Let the observations of the first
∗
Wilcoxon proposed the test before Mann and Whitney, but his name is also used for another
test, the Wilcoxon matched pairs test, which is different. The use of Mann-Whitney here eliminates
possible confusion.
10.7. NON-PARAMETRIC TESTS 239
sample be denoted xi and those of the second sample yi . Rank them together.
This results in a series like xyyxxyx. For each x value, count the number of y
values that follow it and add up these numbers. In the above example, there are
3 y values after the first x, 1 after the second, 1 after the third, and 0 after the
fourth. Their sum, which we call Ux is 5. Similarly, Uy = 3 + 3 + 1 = 7. In fact,
you only have to count for one of the variables, since
Ux + Uy = Nx Ny
Ux can be computed in another way, which may be more convenient, by finding the
total rank, Rx , of the x’s, which is the sum of the ranks of the xi . In the example
this is Rx = 1 + 4 + 5 + 7 = 17. Then Ux is given by
Nx (Nx + 1)
Ux = Nx Ny + − Rx (10.67)
2
Under H0 , one expects Ux = Uy = 21 Nx Ny . Asymptotically, Ux is distributed
normally1, 11 with mean 21 Nx Ny and variance 121
Nx Ny (Nx +Ny +1), from which
(two-tailed) critical values may be computed. For small samples, one must resort
to tables.
This test can be easily extended11 to r samples: For each of the 12 r(r − 1) pairs
of samples, Ux is calculated (call it Upq for the samples p and q) and summed
r
X r
X
U = Upq (10.68)
p=1 q=p+1
Unknown σ: If the parent p.d.f. of each sample is known to be normal, but its
variance is unknown, we can estimate the variance for each sample:
PNx PNy
i=1 (xi − x̄)2 i=1 (yi − ȳ)2
σ̂x2 = ; σ̂y2 = (10.71)
Nx − 1 Ny − 1
A Student’s-t variable can then be constructed. Recall that such a r.v. is the ratio
2
of a standard Gaussian r.v. to
r the square root of a reduced χ r.v. Under H0 ,
σx 2 σ2
µx = µy and θ̂ = (x̄ − ȳ)/ N x
+ Nyy is normally distributed with mean 0 and
variance 1. From equation 10.71 we see that
2
(Nx − 1)σ̂x2 (Ny − 1)σ̂y2
χ = + (10.72)
σx2 σy2
Note that S 2 is in fact just the estimate of the variance obtained by combining
both samples.
We emphasize that this test rests on two assumptions: (1) that the p.d.f. of
both samples is Gaussian and (2) that both Gaussians have the same variance. The
latter can be tested (cf. section 10.7.4). As regards the former, it turns out that
this test is remarkably robust. Even if the parent p.d.f. is not Gaussian, this test is
a good approximation.11 This was also the case for the sample correlation (section
10.7.1).
Correlated samples: In the above we have assumed that the two samples are
uncorrelated. A common case where samples are correlated is in testing the effect of
some treatment. For example, the light transmission of a set of crystals is measured.
The crystals are then treated in some way and the light transmission is measured
again. One could compare the means of the sample before and after treatment.
However, we can introduce a correlation by using the simple mathematical relation
P P P
xi − yi = (xi − yi ). A crystal whose light transmission was lower than
the average before the treatment is likely also to be below the average after the
treatment, i.e., there is a positive correlation between the transmission before and
after. This reduces the variance of the before-after difference, θ: σθ2 = σx2 + σy2 −
2ρσx σy . We do not have to know the correlation, or indeed σx or σy , but can
estimate the variance of θ = x − y directly from the data:
N
1 X
σ̂θ2 = θi2 − θ̄ 2 (10.75)
N − 1 i=1
Again we find √
a Student’s-t variable: θ̂ = θ̄ is normally distributed with variance
2
σθ /N . Thus, N θ̄/σθ is a standard normal r.v. Further, (N − 1)σ̂θ2 /σθ2 is a χ2
r.v. of N − 1 degrees of freedom. Hence, the ratio
√
θ̄ N
t= (10.76)
σ̂θ
is a Student’s-t variable of N − 1 degrees of freedom, one degree of freedom being
lost by the determination of θ̄, a result already known from equation 3.40.
k
X (ȳi − µ)2
χ2 (k) =
i=1 σ 2 /Ni
k
X Ni (ȳi − ȳ)2
χ2 (k − 1) = (10.78)
i=1 σ2
(i)
(where yj is element j of sample i) by a weighted average:
k
1 X
σ̂ 2 = (Ni − 1) σ̂i2 (10.80)
N −k i=1
If the hypothesis of equal means is false, the ȳi will be different and the numerator of
equation 10.81 will be larger than expected under H0 while the denominator, being
an average of the sample variance within samples, will be unaffected (remember that
the true variance of all samples is known to be the same). Hence large values of F
are used to reject H0 with a confidence level determined from the one-tailed critical
values of the F distribution. If there are only two samples, this analysis is equivalent
to the previously described two-sample test using Student’s t distribution.
244 CHAPTER 10. HYPOTHESIS TESTING
The second term is zero since both its sums are equal:
Ni
X Ni
X
(i)
yj = ȳi = Ni ȳi
j=1 j=1
Hence,
k
X k
X
Q = (N − 1) σ̂ 2 = (Ni − 1) σ̂i2 + Ni (ȳi − ȳ)2 (10.84)
i=1 i=1
There are thus two contributions to our estimate of the variance of the p.d.f.: The
first term is the contribution of the variance of the measurements within the samples;
the second is that of the variance between the samples. Also the number of degrees
of freedom are partitioned. As we have seen in the previous section, the first and
second terms are related to χ2 variables of N − k and k − 1 degrees of freedom,
respectively, and their sum, N − 1, is the number of degrees of freedom of the χ2
variable associated with σ̂ 2 .
Now suppose the samples are classified in some way such that each sample has
two indices, e.g., the date of measurement and the person performing the measure-
ment. We would like to partition the overall variance between the various sources:
the variance due to each factor (the date and the person) and the innate residual
variation. In other words, we seek the analog of equation 10.84 with three terms.
We then want to test whether the mean of the samples is independent of each factor
separately.
10.7. NON-PARAMETRIC TESTS 245
Of course, the situation can be more complicated. There can be more than two
factors. The classification is called “crossed” if there is a sample for all combinations
of factors. More complicated is the case of “nested” classification where this is not
the case. Further, the number of observations in each sample can be different. We
will only treat the simplest case, namely two-way crossed classification.
We begin with just one observation per sample. As an example, suppose that
there are a number of technicians who have among their tasks the weighing of
samples. As a check of the procedure, a reference sample is weighed once each day
by each technician. One wants to test (a) whether the balance is stable in time,
i.e., gives the same weight each day, and (b) that the weight found does not depend
on which technician performs the measurement.
In such a case the measurements can be placed in a table with each row cor-
responding to a different value of the first factor (the date) and each column to a
value of the second factor (the technician). Suppose that there are R rows and C
columns. The total number of measurements is then N = RC. We use subscripts
r and c to indicate the row and column, respectively. The sample means of row r
and column c are given, respectively, by
C R
1 X 1 X
ȳr. = yrc ; ȳ.c = yrc (10.85)
C c=1 R r=1
In this notation a dot replaces indices which are averaged over, except that the
dots are suppressed if all indices are averaged over (ȳ ≡ ȳ.. ). We now proceed as
in equations 10.82-10.84 to separate the variance (or more accurately, the sum of
squares, SS) between rows from the rest:
XX
Q= (yrc − ȳ)2 (10.86)
r c
XX X
= (yrc − ȳr )2 + C (ȳr. − ȳ)2 (10.87)
r c r
where C is, of course, the same for all rows and hence can be taken out of the sum
over r. The second term, to be denoted QR , is the contribution to the SS due to
variation between rows while the first term contains both the inter-column and the
innate, or residual, contributions.
We can, in the same way, separate the SS between rows from the rest. The result
can be immediately written down by exchanging columns and rows in equation
10.87: XX X
Q= (yrc − ȳ.c )2 + R (ȳ.c − ȳ)2 (10.88)
c r c
The residual contribution, QW , to the SS can be obtained by subtracting the inter-
row and inter-column contributions from the total:
QW = Q − QR − QC
XX X X
= (yrc − ȳ)2 − C (ȳr. − ȳ)2 − R (ȳ.c − ȳ)2
r c r c
246 CHAPTER 10. HYPOTHESIS TESTING
We have thus split the variance into three parts. The number of degrees of freedom
also partitions:
Let us now look at this procedure somewhat more formally. What we, in fact,
have done is used the following model for our measurements:
X X
yrc = µ + θr + ωc , θr = ωc = 0 (10.91)
r c
Bibliography
249
REFERENCES 251
2. Siegmund Brandt, Data Analysis: Statistical and Computational Methods for Scien-
tists and Engineers, Third edition (Springer 1999)
8. Louis Lyons, Statistics for Nuclear and Particle Physicists (Cambridge University
Press, 1986)
9. Stuart L. Meyer, Data Analysis for Scientists and Engineers (Wiley, 1975)
10. Philip R. Bevington, Data Reduction and Error Analysis for the Physical Sciences
(McGraw-Hill, 1969)
11. M. G. Kendall and A. Stuart, The Advanced Theory of Statistics (Griffin, vol. I, 4th
ed., 1977; vol. II, 4th ed., 1979; vol. III, 3rd ed., 1976)
12. Alan Stuart and Keith Ord, Kendall’s Advanced Theory of Statistics, vol. 1, Distri-
bution Theory (Arnold, 1994).
13. Alan Stuart, Keith Ord, and Steven Arnold, Kendall’s Advanced Theory of Statistics,
vol. 2A, Classical Inference and the Linear Model (Arnold, 1999).
14. Anthony O’Hagan, Kendall’s Advanced Theory of Statistics, vol. 2B, Bayesian Infer-
ence (Arnold, 1999).
15. Harald Cramér, Mathematical Methods of Statistics (Princeton Univ. Press, 1946)
16. William H. Press, “Understanding Data Better with Bayesian and Global Statistical
Methods”, preprint astro-ph/9604126 (1996).
18. T. Bayes, “An essay towards solving a problem in the doctrine of chances”, Phil.
Trans. Roy. Soc., liii (1763) 370; reprinted in Biometrika 45 (1958) 293.
252 REFERENCES
19. P. S. de Laplace, “Mémoire sur la probabilité des causes par les évenements”, Mem.
Acad. Sci. Paris 6 (1774) 621; Eng. transl.: “Memoir on the probability of the causes
of events” with an introduction by S. M. Stigler, Statist. Sci. 1 (1986) 359.
22. T. Bayes, An introduction to the Doctrine of Fluxions, and a Defence of the Mathe-
maticians Against the Objections of the Author of the Analyst (1736)
23. Richard von Mises, Wahrscheinlichkeit, Statistik und Wahrheid (1928); reprinted as
Probability, Statistics, and Truth (Dover, 1957)
25. L. von Bortkiewicz, Das Gesatz der kleinen Zahlen (Teubner, Leipzig, 1898)
28. K. F. Gauß, Theoria motus corporum celestium (Perthes, Hamburg, 1809); Eng.
transl., Theory of the Motion of the Heavenly Bodies Moving About the Sun in
Conic Sections (Dover, New York, 1963)
33. R. Y. Rubinstein, Simulation and the Monte Carlo Method (Wiley, 1981)
36. A. C. Aitken and H. Silverstone, Proc. Roy. Soc. Edin. A 61 (1942) 186.
39. Henry Margenau and George Mosely Murphy, The Mathematics of Physics and Chem-
istry (vol. I, Van Nostrand, 1956)
42. B. Efron, The Jackknife, the Bootstrap, and Other Resampling Plans (S.I.A.M., 1982)
43. B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall
1993)
45. J. Neyman, Phil. Trans. Roy. Soc. London, series A, 236 (1937) 333.
46. Particle Data Group: ‘Review of Particle Physics’, Phys. Rev. D 54 (1996) 1.
47. Particle Data Group: ‘Review of Particle Physics’, Eur. Phys. J. C 15 (2000) 1.
50. W. J. Metzger, ‘Upper limits’, Internal report University of Nijmegen HEN-297 (1988)
55. William H. Press et al., Numerical Recipes in FORTRAN90: The Art of Scientific
Computing, Numerical Recipes in FORTRAN77: The Art of Scientific Computing,
Numerical Recipes in C++: The Art of Scientific Computing, Numerical Recipes in
C: The Art of Scientific Computing, (Cambridge Univ. Press).
Exercises
255
257
1. In statistics we will see that the moments of the parent distribution can be
‘estimated’, or ‘measured’, by calculating the correspondingqmoment of the
P P
data, e.g., x = n1 xi gives an estimate of the mean µ and n1 (xi − x)2
estimates σ, etc.
You may find the FORTRAN subroutine FLPSOR useful: CALL FLPSOR(X,N),
where N is the dimension, e.g., 80, of the array X. After calling this routine,
the order of the elements of X will be in ascending order.
6. The Chebychev Inequality. Assume that the p.d.f. for the r.v. X has mean µ
and variance σ 2 . Show that for any positive number k, the probability that
x will differ from µ by more than k standard deviations is less than or equal
to 1/k2 , i.e., that
1
P (|x − µ| ≥ kσ) ≤ 2
k
7. Show that | cov(x, y)| ≤ σx σy , i.e., that the correlation coefficient, ρx,y =
cov(x, y)/σx σy , is in the range −1 ≤ ρ ≤ 1 and that ρ = ±1 if and only
if x and y are linearly related.
8. A beam of mesons, composed of 90% pions and 10% kaons, hits a Čerenkov
counter. In principle the counter gives a signal for pions but not for kaons,
thereby identifying any particular meson. In practice it is 95% efficient at
giving a signal for pions, and also has a 6% probability of giving an accidental
signal for a kaon. If a meson gives a signal, what is the probability that the
particle was a pion? If there is no signal, what is the probability that it was
a kaon?
9. Mongolian swamp fever (MSF) is such a rare disease that a doctor only expects
to meet it once in 10000 patients. It always produces spots and acute lethargy
in a patient; usually (60% of cases) they suffer from a raging thirst, and
occasionally (20% of cases) from violent sneezes. These symptoms can arise
from other causes: specifically, of patients who do not have MSF, 3% have
spots, 10% are lethargic, 2% thirsty, and 5% complain of sneezing. These four
probabilities are independent.
Show that if you go to the doctor with all these symptoms, the probability
of your having MSF is 80%. What is the probability if you have all these
symptoms except sneezing?
11. A student is trying to hitch a lift. Cars pass at random intervals, at an average
rate of 1 per minute. The probability of a car giving a student a lift is 1%.
What is the probability that the student will still be waiting:
259
µr e−µ
P (r; µ) =
r!
is h i
φ(t) = exp µ eıt − 1
Use the characteristic function to prove the reproductive property of the Pois-
son p.d.f.
(a) The number of events N is Poisson distributed with mean µ and they are
split into F and B = N − F following a binomial p.d.f., B(F ; N, pF ),
i.e., the independent variables are N and F .
(b) The F events and B events are both Poisson distributed (with param-
eters µF and µB ), and the total is just their sum, i.e., the independent
variables are F and B.
14. Show that the Poisson p.d.f. tends to a Gaussian with mean µ and variance
σ 2 = µ for large µ, i.e.,
P (r; µ) −→ N (r; µ, µ)
(a) What is the probability of a value lying more than 1.23σ from the mean?
260
(b) What is the probability of a value lying more than 2.43σ above the
mean?
(c) What is the probability of a value lying less than 1.09σ below the mean?
(d) What is the probability of a value lying above a point 0.45σ below the
mean?
(e) What is the probability that a value lies more than 0.5σ but less than
1.5σ from the mean?
(f) What is the probability that a value lies above 1.2σ on the low side of
the mean, and below 2.1σ on the high side?
(g) Within how many standard deviations does the probability of a value
occurring equal 50%?
(h) How many standard deviations correspond to a one-tailed probability of
99%?
16. During a meteor shower, meteors fall at the rate of 15.7 per hour. What is the
probability of observing less than 5 in a given period of 30 minutes? What
value do you find if you approximate the Poisson p.d.f. by a Gaussian p.d.f.?
17. Four values (3.9, 4.5, 5.5, 6.1) are drawn from a normal p.d.f. whose mean is
known to be 4.9. The variance of the p.d.f. is unknown.
(a) What is the probability that the next value drawn from the p.d.f. will
have a value greater than 7.3?
(b) What is the probability that the mean of three new values will be between
3.8 and 6.0?
18. Let x and y be two independent r.v.’s, each distributed uniformly between 0
and 1. Define z± = x ± y.
It will probably help your understanding of this situation to use Monte Carlo
to generate points uniform in x and y and to make a two-dimensional his-
togram of z+ vs. z− .
19. Derive the reproductive property of the Gaussian p.d.f., i.e., show that if
x and y are independent r.v.’s distributed normally as N (x; µx , σx2 ) and
N (y; µy , σy2 ), respectively, then z = x + y is also normally distributed as
N (z; µz , σz2 ). Show that µz = µx + µy and σz2 = σx2 + σy2 . Derive also
P
the p.d.f. for z = x − y, for z = (x + y)/2, and for z = x̄ = n i=1 xi /n
when all the xi are normally distributed with the same mean and variance.
261
20. For the bivariate normal p.d.f. for x, y with correlation coefficient ρ, trans-
form to variables u, v such that the covariance matrix is diagonal and show
that
σx2 cos2 θ − σy2 sin2 θ
σu2 =
cos2 θ − sin2 θ
σy2 cos2 θ − σx2 sin2 θ
σv2 =
cos2 θ − sin2 θ
2ρσx σy
where tan 2θ =
σx2 − σy2
21. Show that for the bivariate normal p.d.f., the conditional p.d.f., f (y|x), is a
normal p.d.f. with mean and variance,
σy
E [y|x] = µy + ρ (x − µx ) and V [y|x] = σy2 (1 − ρ2 )
σx
G = (x − µ)T V −1 (x − µ)
(b) Demonstrate the result (a) by generating by Monte Carlo the distribution
of g for n = 1, 2, 3, 5, 10, 50 and comparing it to N (g; 0, 1).
(c) If the xi are uniformly distributed in the intervals [0.0, 0.2] and [0.8, 1.0].
i.e.,
1
f (x) = 0.4 , 0.0 ≤ x ≤ 0.2 or 0.8 ≤ x ≤ 1.0
=0, otherwise,
27. Show that the weighting method used in the two-dimensional example of crude
Monte Carlo integration (sect. 6.2.5, eq. 6.5) is in fact an application of the
technique of importance sampling.
R
28. Perform the integral I = 01 x3 dx by crude Monte Carlo using 100, 200,
400, and 800 points. Estimate not only I, but also the error on I. Does the
error decrease as expected with the number of points used?
Repeat the determination of I 50 times using 200 (different) points each time
and histogram the resulting values of I. Does the histogram have the shape
that you expect? Also evaluate the integral by the following methods and
compare the error on I with that obtained by crude Monte Carlo:
29. Generate 20000 Monte Carlo points with x > 0 distributed according to the
distribution !
1 1 −x/τ 1 −x/λ
f (x) = e + e
2 τ λ
for τ = 3 and λ = 10. Do this for (a) the weighting, (b) the rejection,
and (c) the composite methods using inverse transformations. Which method
263
How can you arrive at a histogram for the x-distribution of the events you
detect? There are various methods. Which should be the best?
30. Generate 1000 points, xi , from the Gaussian p.d.f. N (x; 10, 52 ). Use each
of the following estimators to estimate the mean of X: sample mean, sample
median, and trimmed sample mean (10%).
Repeat assuming we only measure values of X in the interval (5,25), i.e. if
an xi is outside this range, throw it away and generate a new value.
Repeat this all 25 times, histogramming each estimation of the mean. From
these histograms determine the variance of each of the six estimators.
31. Under the assumptions that the range of the r.v. X is independent of the
parameter θ and that the likelihood, L(x; θ), is regular enough to allow
∂2 R
interchanging ∂θ 2 and dx, derive equation 8.23,
" #
∂
Ix (θ) = −E S(x; θ)
∂θ
P
c2 = (x − µ)2 /n is an efficient estimator of the
32. Show that the estimator σ i
variance of a Gaussian p.d.f. of known mean by showing that its variance is
equal to I −1 .
33. Using the method of section 8.2.7, find an efficient and unbiased estimator for
σ 2 of a normal p.d.f. when µ is known and there is thus only one parameter
for the distribution.
35. (a) Derive equations 8.41 and 8.42, i.e., show that the variance of the r th
sample moment is given by
1 h h 2r i i
V [ xr ] = E x − (E [xr ])2
n
and that
1 h h r+q i i
cov [ xr , xq ] = E x − E [xr ] E [xq ]
n
(b) Derive an
h expression
i in terms of sample moments to estimate the vari-
c2
ance, V σ , of the moments estimator of the parent variance, σ c2 =
m
c2 − m c 21
36. We estimate the values of x and y by their sample means, x̄ and ȳ, which
have variances σx2 and σy2 . The covariance is zero. We want to estimate the
values of r and θ which are related to x and y by
y
r 2 = x2 + y 2 and tan θ =
x
Following the substitution method, what are r̂ and θ̂? Find the variances and
covariance of r̂ and θ̂.
37. We measure x = 10.0 ± 0.5 and y = 2.0 ± 0.5. What is then our estimate
of x/y? Use Monte Carlo to investigate the validity of the error propagation.
38. We measure cos θ and sin θ, both with standard deviation σ. What is the
ml estimator for θ? Compare with the results of exercise 36.
39. Decay times of radioactive atoms are described by an exponential p.d.f. (equa-
tion 3.10):
1
f (t; τ ) = e−t/τ
τ
(a) Having measured the times ti of n decays, how would you estimate τ
and the variance V [τ̂ ] (1) using the moments method and (2) using the
maximum likelihood method? Which method do you prefer? Why?
(b) Generate 100 Monte Carlo events according to this p.d.f. with τ = 10,
(cf. exercise 29) and calculate τ̂ and V [τ̂ ] using both the moments
and the maximum likelihood methods. Are the results consistent with
τ = 10? Which method do you prefer? Why?
265
(c) Use a minimization program, e.g., MINUIT, to find the maximum of the
likelihood function for the Monte Carlo events of (39b). Evaluate V [τ̂ ]
using both the second-derivative matrix and the variation of l by 1/2.
Compare the results for τ̂ and V [τ̂ ] with those of (39b).
(d) Repeat (39b) 1000 times making histograms of the value of τ̂ and of the
estimate of the error on τ̂ for each method. Do you prefer the moments
or the maximum likelihood expression for V [τ̂ ]? Why?
(e) Suppose that we can only detect times t < 10. What is then the
likelihood function? Use a minimization program to find the maximum
of the likelihood function and thus τ̂ and its variance. Does this value
agree with τ = 10?
(f) Repeat (39b) and (39e) with 10000 Monte Carlo events.
40. Verify that a least squares fit of independent measurements to the model
y = a + bx results in estimates for a and b given by
xy − x̄ȳ
â = ȳ − b̂x̄ and b̂ =
x2 − x̄2
where the bar indicates a weighted sample average with weights given by
1/σi2 , as stated in section 8.5.5.
41. Use the method of least squares to derive formulae to estimate the value (and
its error), y ± δy, from a set of n measurements, yi ± δyi . Assume that
the yi are uncorrelated. Comment on the relationship between these formulae
and those derived from ml (equations 8.59 and 8.60).
42. Perform a least squares fit of a parabola
y(x) = θ1 + θ2 x + θ3 x2
for the four independent measurements: 5 ± 2, 3 ± 1, 5 ± 1, 8 ± 2 measured
at the points x = −0.6, −0.2, 0.2, 0.6, respectively. Determine not only the
θ̂i and their covariances, but also calculate the value of y and its uncertainty
at x = 1.
To invert a matrix you can use the routine RSINV:
CALL RSINV (N,A,N,IFAIL)
where A is a symmetric, positive matrix of dimension (N,N). If the matrix
inversion is successful, IFAIL is returned as 0.
43. The three angles of a triangle are independently measured to be 63◦ , 34◦ ,
and 85◦ , all with a resolution of 1◦ .
(a) Calculate the least squares estimate of the angles subject to the require-
ment that their sum be 180◦ .
266
44. Generate events as in exercise 39b. Histogram the times ti and use the two
minimum chi-square methods and the binned maximum likelihood method to
estimate the lifetime τ . Use a minimization program, e.g., MINUIT, to find
the minima and maximum. Compare the results of these three methods and
those of exercise 39b.
45. In section 8.7.4 is a table comparing the efficiencies of various location es-
timators for various distributions. Generate 10000 random numbers from a
standard normal distribution and estimate the mean using each of the esti-
mators in the table. Repeat this 1000 times making histograms of the values
of each estimator. The standard deviation of these histograms is an estimate
of the standard deviation of the estimator. Are these in the ratio expected
from the table?
(a) In our detector it produces 389 counts in the first minute and 423 counts
in the second minute. Assuming a 100% efficient detector, what is the
best estimation of the activity of the source?
(b) What can you say about the best value and uncertainty for the activity
of the source from the following set of independent measurements?
49. You want to determine the probability, p, that a student passes the statis-
tics exam. Since there are only two possible outcomes, pass and fail, the
appropriate p.d.f. is binomial, B(k; N, p).
(a) Construct the confidence belt for a 95% central confidence interval for p
for the case that 10 students take the exam and k pass, i.e., draw k+ (p)
and k− (p) curves on a p vs. k plot.
267
(b) Assume that 8 of the 10 pass. Find the 95% central confidence interval
from the plot constructed in (a) and by solving equation 9.18.
50. An experiment studying the decay of the proton (an extremely rare process, if
it occurs at all) observes 7 events in 1 year for a sample of 106 kg of hydrogen.
51. Construct a most powerful (MP) test for one observation, x, for the hypothesis
that X is distributed as a Cauchy distribution,
1
f (x) =
π [1 + (x − θ)2 ]
with θ = 0 under H0 and θ = 1 under H1 . What is the size of the test if
you decide to reject H0 when x > 0.5?
52. Ten students each measure the mass of a sample, each with an error of 0.2 g:
10.2 10.4 9.8 10.5 9.9 9.8 10.3 10.1 10.3 9.9 g
(a) Test the hypothesis that they are all measurements of a sample whose
true mass is 10.1 g.
(b) Test the hypothesis that they are all measurements of the same sample.
Number of events 0 1 2 3 4 5 6 7 8 9
Number of intervals 1042 860 307 78 15 3 0 0 0 1
This date was also the date that astronomers first saw the supernova S1987a.
(a) Test the hypothesis that the data are described by a Poisson distribution.
268
(b) Test the hypothesis that the data are described by the sum of two Poisson
distributions, one for a signal of 9 events within one ten-second interval,
and another for the background of ordinary cosmic neutrinos.
54. Marks on an exam are distributed over male and female students as follows
(it is left to your own bias to decide which group is male):
Group 1 39 18 3 22 24 29 22 22 27 28 23 48
Group 2 42 23 36 35 38 42 33
Assume that test scores are normally distributed within each group.
(a) Assume that the variance of the scores of both groups is the same, and
test the hypothesis that the mean is also the same for both groups.
(b) Test the assumption that the variance of the scores of both groups is the
same.
Crystal 1 2 3 4 5 6 7
Before 29 30 42 34 37 45 32
After 36 26 46 36 40 51 33
difference 7 −4 4 2 3 6 1
(a) Test whether the light transmission has improved using only the mean
of the before and after measurements.
(b) Test whether the light transmission has improved making use of the
measurements per crystal, i.e., using the differences in transmission.
For the following exercises you will be assigned a file containing the data to be
used. It will consist of 3 numbers per event, which may be read, e.g., in FORTRAN
by
269
READ(11,’(I5)’) NEVENTS
READ(11,’(3F10.7)’) ((E(I,IEV),I=1,3),IEV=1,NEVENTS)
The data may be thought of as being the measurement of the radioactive decay
of a neutral particle at rest into two new particles, one positive and one negative,
with
E(1,IEV) = x, the mass of the decaying particle as determined from the en-
ergies of the decay products. The mass values have a certain
spread due to the resolution of our apparatus and/or the Heisen-
berg uncertainty principle (for a very short-lived particle).
E(2,IEV) = cos θ, the cosine of the polar angle of the positive decay particle’s
direction.
E(3,IEV) = φ/π, the azimuthal angle, divided by π, of the positive decay par-
ticle’s direction. Division by π results in a more convenient
quantity to histogram.
Assume that the decay is of a vector meson to two pseudo-scalar mesons. The decay
angular distribution is then given by
3 1 1
f (cos θ, φ) = (1 − ρ00 ) + (3ρ00 − 1) cos2 θ − ρ1,−1 sin2 θ cos 2φ
4π 2 2
√
− 2Reρ10 sin 2θ cos φ
A1. Use the moments method to estimate the mass of the particle and the decay
parameters ρ00 , ρ1,−1 , and Reρ10 . Also estimate the variance and standard
deviation of the p.d.f. for x. Estimate also the errors of all of the estimates.
A2. Use the maximum likelihood method to estimate the decay parameters ρ00 ,
ρ1,−1 , and Reρ10 using a program such as MINUIT to find the maximum of the
likelihood function. Determine the errors on the estimates using the variation
of the likelihood.
A4. Assume that x is distributed normally. Determine µ and σ using both the
minimum χ2 and the binned maximum likelihood methods. Do this twice,
once with narrow and once with wide bins. Compare the estimates and their
covariance matrix obtained with these two methods with each other and with
that of the previous exercise.
A5. Test the assumption of vector meson decay against the hypothesis of decay of
a scalar meson, in which case the angular distribution must be isotropic.
270
For the following exercises you will be assigned a file containing the data to be
used. It is the same situation as in the previous exercises except that it is somewhat
more realistic, having some background to the signal.
B1. From an examination of histograms of the data, make some reasonable hypothe-
ses as to the nature of the background, i.e., propose some functional form for
the background, fb (x) and fb (cos θ, φ).
B2. Modify your likelihood function to include your hypothesis for the background,
and use the maximum likelihood method to estimate the decay parameters
ρ00 , ρ1,−1 , and Reρ10 as well as the fraction of signal events. Also determine
the position of the signal, µ, and its width, σ, under the assumption that the
signal x is normally distributed. Determine the errors on the estimates using
the variation of the likelihood.
B3. Develop a way to use the moments method to estimate, taking into account
the background, the decay parameters ρ00 , ρ1,−1 , and Reρ10 . Estimate also
the errors of the estimates.
B4. Determine the goodness-of-fit of the fits in the previous two exercises. There
are several goodness-of-fit tests which could be applied. Why did you choose
the one you did?