100% found this document useful (1 vote)
890 views103 pages

Statistical Methods in Experimental Chemistry

Uploaded by

Sameh Ahmed
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
890 views103 pages

Statistical Methods in Experimental Chemistry

Uploaded by

Sameh Ahmed
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistical Methods in Experimental Chemistry

Philip J. Grandinetti

January 20, 2000


2

P. J. Grandinetti, January 20, 2000


Contents

1 Statistical Description of Experimental Data 7


1.1 Random Error Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
The Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
The Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2 Bivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
The Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
[Link] Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
[Link] Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
[Link] Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . 17
Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Bi-dimensional Gaussian Distribution . . . . . . . . . . . . . . . . . . 24
Multi-dimensional Gaussian Distribution . . . . . . . . . . . . . . . . . 25
[Link] Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2 The χ2 test of a distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3 Systematic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Instrument Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Personal Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Method Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.4 Gross Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5 Propagation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3
4 CONTENTS

1.6 Confidence Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


1.6.1 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.7 The Two Sample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.7.1 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
[Link] Comparing a measured mean to a true value . . . . . . . . . . . . . 40
[Link] Comparing two measured means . . . . . . . . . . . . . . . . . . . . 41
1.7.2 Comparing Variances - The F -test . . . . . . . . . . . . . . . . . . . . . . . . 42
1.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2 Modeling of Data 53
2.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.2 A Simple Example of Linear Least-Squares . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.1 Finding Best-Fit Model Parameters . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.2 Finding Uncertainty in Model Parameters . . . . . . . . . . . . . . . . . . . . 58
2.2.3 Finding Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2.4 What to do if you don’t know σyi ? . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3 General Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.4 General Non-Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4.1 Minimizing χ2 without Derivative Information . . . . . . . . . . . . . . . . . 67
[Link] Downhill Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.2 Minimizing χ2 with Derivative Information . . . . . . . . . . . . . . . . . . . 69
[Link] Directional Derivatives and Gradients . . . . . . . . . . . . . . . . . 72
[Link] Steepest Descent Method . . . . . . . . . . . . . . . . . . . . . . . . 74
[Link] Newton Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . 75
[Link] Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.5.1 Constant Chi-Squared Boundaries as Confidence Limits . . . . . . . . . . . . 79
2.5.2 Getting Parameter Errors from χ2 with Non-Gaussian Errors . . . . . . . . . 81
[Link] Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . 81
[Link] Bootstrap Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A Computers 89
The Bit: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Integers: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Signed Integer Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 91

P. J. Grandinetti, January 20, 2000


CONTENTS 5

Alphanumeric Characters: . . . . . . . . . . . . . . . . . . . . . . . . . 92
Floating point numbers: . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Roundoff Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

B Programming in C - A quick tutorial 95


Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
#include . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
main() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Another Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
To program or not to program . . . . . . . . . . . . . . . . . . . . . . 100
C.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

P. J. Grandinetti, January 20, 2000


6 CONTENTS

P. J. Grandinetti, January 20, 2000


Chapter 1

Statistical Description of
Experimental Data

A fundamental problem encountered by every scientist who performs a measurement is whether


her measured quantity is correct. Put another way, how can she know the difference between her
measured value and the true value?

Error in Measurement = Measured Value − True Value

The basic difficulty is that to answer this very important question she needs to know the “true
value”. Of course, if she knew the true value she wouldn’t be making the measurement in the first
place! So, how can she resolve this “catch-22” situation?
The solution is for her to go back into the lab, perform the measurement many times on known
quantities, and study the errors. While she can never know the true value of her unknown quantity,
she can use her error measurements and the mathematics of probability theory to tell her the
probability that the true value for her unknown quantity lies within a given range.
Errors can be classified into two classes, random and systematic. Random errors are irrepro-
ducible, and usually are caused by a large number of uncorrelated sources. In contrast, systematic
errors are reproducible, and usually can be attributed to a single source. Random errors could also
be caused by a single source, particularly when the experimentalist cannot identify the source of her
errors. If she could identify the source she might be able to make the error behave systematically.
Often systematic errors can be eliminated by simply recalibrating the measurement device. You
may hear of a third class of errors called gross errors. These are described as errors that occur
occasionally and are large. We will discuss this particular type of error later on, however, these
errors technically do not define a new class of errors. All errors, regardless of the class, can be
studied and used to establish limits between which the true value lies.

7
8 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

(A) 31.4 (B) 8


31.3 N = 128
Temperature (°C)

31.2 6

Probability
Density
31.1 4
31.0
2
30.9
30.8 0
30.8 30.9 31.0 31.1 31.2 31.3 31.4
Measurement Number
Temperature (°C)

Figure 1.1: (A) A set of 128 temperature measurements of a solution in a constant temperature
bath. (B) A normalized histogram of the measured values.

1.1 Random Error Distributions

1.1.1 Univariate Data

In this section we will assume that when we perform a measurement there are no systematic errors.
We are all familiar with random errors. For example, suppose you were measuring the temperature
of a solution in a constant temperature bath. You know that the temperature should be constant,
but on close inspection you find that is it not constant and appears to be randomly fluctuating
around the expected constant value (see Fig 1.1a). What do you report for the temperature of the
solution? On one extreme we could report only the average value, and on the other extreme we
could report all the values measured in the form of a histogram of measured values (see Fig 1.1b).
To use the mathematics of probability theory in helping us understand errors we make the
assumption that our histogram of measured values is governed by an underlying probability distri-
bution called the parent distribution. In the limit of an infinite number of measurements our
histogram or sample distribution becomes the parent distribution. This point is illustrated in
Fig. 1.2. Notice how the histogram of measured values more closely follows the values predicted
by the parent distribution as the number of measurements used to construct the histogram in-
creases. Thus, our first goal in understanding the errors in our measurements is to learn what is
the underlying parent distribution1 , p(x), that predicts the spread in our measured values. Once
we know that distribution we can calculated the probability that the measured value lies within a
1
We define our parent distribution so that it is normalized. That is, the area under the distribution is unity,
Z ∞
p(x)dx = 1. (1.1)
−∞

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 9

(a) (b) (c)


8 8 8
N = 128 N = 128 N = 128
6 6 6
Probability
Density

4 4 4

2 2 2

0 0 0
30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4

8 8 8
N = 1024 N = 1024 N = 1024
6 6 6
Probability
Density

4 4 4

2 2 2

0 0 0
30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4

8 8 8
N = 4096 N = 4096 N = 4096
6 6 6
Probability
Density

4 4 4

2 2 2

0 0 0
30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4
Measured Value Measured Value Measured Value

Figure 1.2: In (a), (b), and (c) are histograms constructed from three different sets of 4096 measure-
ments, all drawn from the same parent distribution. Shown from top to bottom are the histograms
constructed using only the first 128, the first 1024, and finally all 4096 measurements, respectively.
Notice how the histogram bin heights vary at low N and fall closer to the bin heights predicted by
the parent distribution for high N .

given range. That is Z x+


P (x− , x+ ) = p(x) dx, (1.2)
x−

where p(x) is our parent distribution for the measured value x, and P (x− , x+ ) is the probability
that the measured value lies between x− and x+ . In general, x− and x+ are called the confidence
limits associated with a given probability P (x− , x+ ). We will have further discussions on confidence
limits later in the text.
Note that when you report confidence limits you are not reporting any information concerning
the shape of the parent distribution. It is often useful to have such information. In the interests
of not having to plot a histogram for every measurement we report we ask the question: Can we
mathematically describe the parent distribution for our measured values by a finite (and perhaps

P. J. Grandinetti, January 20, 2000


10 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

even small) number of parameters? While the answer to this question depends on the particular
parent distribution, there are a few parameters that by convention are often used to describe (in
part, or sometimes completely) the parent distribution. These are the mean, variance, skewness,
and kurtosis.

The Mean describes the average value of the distribution. It is also called the first moment
about the origin. It is defined as:
1 X
µ = lim xi , (1.3)
N →∞ N
i

where N corresponds to the number of measurements xi and µ is the mean of the parent distribution.
Experimentally we cannot make an infinite number of measurements so we define the experimental
mean according to:
1 X
x= xi , (1.4)
N i
where x is the mean of the experimental distribution. The mean is usually approximated to be the
“true value”, particularly when the distribution is symmetric and assuming no systematic errors.
If the distribution is not symmetric the mean is often supplemented with two other parameters,
the median and the mode. The median cuts the area of the parent distribution in half. That is,
Z xmedian
p(x)dx = 1/2. (1.5)
−∞

When the distribution is symmetric then the mean and median are the same. The mode is the
most probable value of the parent distribution. That is,

dp(xmode ) d 2 p(xmode )
= 0, and < 0. (1.6)
dx dx2

The Variance characterizes the width of the distribution. It is also called the second moment
about the mean. It is defined as:

1 X
N
σ 2 = lim (xi − µ)2 , (1.7)
N →∞ N
i

where σ 2 is the variance of the parent distribution. The experimental variance is defined as:

1 XN
s2 = (xi − x)2 , (1.8)
N −1 i

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 11

median
mode mean
probability density

Measured Value

Figure 1.3: Position of mean, median, and median for an asymmetric distribution. Note that in a
symmetric distribution the mean and median are the same.

where s2 is the variance of the experimental parent distribution. σ and s are defined as the
standard deviation of the parent distribution and experimental parent distribution, respectively.
The variance is approximated as the “error dispersion” of the set of measurements, and the standard
deviation is approximated as the “uncertainty” in the “true value”. Notice that in the expression
for s2 we use N − 1 in the denominator in contrast to N for σ 2 . Without going into detail, the
difference is because s2 is calculated using x instead of µ. In practice, if the difference between N
and N − 1 in the denominator affects the conclusions you make about your data, then you probably
need to collect more data (i.e. increase N ) and reanalyze your data.

The Skewness characterizes the degree of asymmetry of a distribution around its mean. It is
also called the third moment about the mean. It is defined as:
N µ ¶
1 X xi − µ 3
skewness = . (1.9)
N i=1 σ

Skewness is defined to have no dimensions. Positive skewness signifies a distribution with an asym-
metric tail extending out towards more positive x, while negative skewness signifies a distribution
whose tail extends out toward more negative x. A symmetric distribution should have zero skew-
ness (e.g. Gaussian). It is good practice to believe you have a statistically significant non-zero
p
skewness if it is larger than 15/N .

P. J. Grandinetti, January 20, 2000


12 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

positive negative
skewness skewness
Probability Density

Measured Value

Figure 1.4: Positive skewness signifies a distribution with an asymmetric tail extending out towards
more positive x, while negative skewness signifies a distribution whose tail extends out toward more
negative x.

The Kurtosis measures the relative peakedness or flatness of a distribution relative to a normal
(i.e., Gaussian) distribution. It is also called the fourth moment about the mean. It is defined as:
" N µ ¶ #
1 X xi − µ 4
kurtosis = − 3. (1.10)
N i=1 σ

Subtracting 3 makes the kurtosis zero for a normal distribution. A positive kurtosis is called
leptokurtic, a negative kurtosis is called platykurtic, and in between is called mesokurtic.

1.1.2 Bivariate Data


Let’s consider a slightly different measurement situation. In this situation when we make a mea-
surement we obtain not a single number, but a pair of numbers. That is, each measurement yields
an (x, y) pair. Just as in the one parameter case, there are random errors that will result in a
distribution of (x, y) pairs around the true (x, y) pair value. Also, like the one parameter case we
can construct a histogram of the N measured (x, y) pairs, except in this case our histogram will be
a three dimensional plot of occurance versus x versus y.
As before, in the limit of an infinite number of (x, y) pair measurements we would have the two
parameter parent distribution2 p(x, y). The mean (µx , µy ) pair are calculated
1 X 1 X
µx = lim xi , and µy = lim yi . (1.12)
N →∞ N N →∞ N
i i
2
We define our two parameter parent distribution so that it is normalized. That is, the area under the distribution

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 13

positive
Probability Density

kurtosis
(leptokurtic)

negative
kurtosis
(platykurtic)

Measured Value

Figure 1.5: The Kurtosis measures the relative peakedness or flatness of a distribution relative to
a normal (i.e., Gaussian) distribution. A positive kurtosis is called leptokurtic, a negative kurtosis
is called platykurtic, and in between is called mesokurtic.

Similarly the variance is calculated


1 X 1 X
σx2 = lim (xi − µx )2 , and σy2 = lim (yi − µy )2 . (1.13)
N →∞ N N →∞ N
i i

The Covariance In the two parameter case, however, we need to consider another type of
variance called the covariance. It is defined as
1 X
2
σxy = lim (xi − µx )(yi − µy ) (1.14)
N →∞ N
i

If the errors in x and y are uncorrelated this term will be zero. What distinguishes these two
situations? Usually, a non-zero covariance implies that the two measured parameters are dependent
on each other, that is
y = F(x), or x = F(y),

or both x and y are dependent on a common third parameter, that is

x = F(z), or y = F(z).

If x is the temperature of a solution in a water bath in New York City, and y is the temperature
of a solution in a water bath in San Francisco, then it’s unlikely that the errors in x and y will be
is unity, Z ∞ Z ∞
p(x, y)dx dy = 1. (1.11)
−∞ −∞

P. J. Grandinetti, January 20, 2000


14 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

(a) y (b) y

σ2xy = 0 σ2xy ≠ 0

x x

Figure 1.6: (a) Contour plot of a two parameter parent distribution having no correlation between
the two parameter errors. (b) Contour plot of a two parameter parent distribution having a strong
correlation between parameter errors.

correlated. This situation would have a distribution similar to the one in Fig. 1.6a. If, however,
x and y were the temperature of two different solutions in the same water bath, then it would be
reasonable if their errors were correlated as shown for the distribution in Fig. 1.6b.
Another parameter often used that is related to the covariance is the linear correlation coefficient
r. It is defined as
2
σxy
rxy = . (1.15)
σx σy
2 value of 1 implies a complete linear correlation between the measured x and y parameters,
An rxy
2 value of zero implies no correlation whatsoever. It should be pointed out that even if
while an rxy
y is dependent on x in a non-linear fashion, the non-linear function may be fairly linear over the
small range of x and y associated with the random errors in their measurement.
A high covariance between two parameters means that any increase (or decrease) in the uncer-
tainty of one parameter will lead to a corresponding increase (or decrease) in the uncertainty of
the other parameter. This is illustrated in Fig. 1.7.
Finding the probability that a measured (x, y) pair will lie within a given range is a more
complicated problem than in one-dimension. One could define a square region delimited by x− and
x+ on the x-axis and y− and y+ on the y-axis, such as
Z x+ Z y+
P (x− , x+ , y− , y+ ) = p(x, y)dx dy, (1.16)
x− y−

as shown in Fig. 1.8. In case of high correlation (i.e. r2 close to 1), this approach will includes

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 15

Distribution in x after Distribution in x after


summing over all possible summing over all possible
values of y. values of y.
(a) y
(b) y
σ2xy = 0 σ2xy ≠ 0

y2 y2
y1
} y1
}
Distribution in x after Distribution in x after
summing over values of y summing over values of y
between y1 and y2 between y1 and y2

x x

Figure 1.7: When the distribution in two parameters are strongly correlated the width of the dis-
tribution in one of the parameters will be greatly reduced if the values of the other parameters are
restricted to a smaller set of possible values.

regions of low probability, and thus doesn’t give an accurate representation of where the data is
likely to occur. Finding the probability that a measured (x, y) pair will lie within a given ellipse is
a better approach, which we will discuss later in the course.

1.1.3 Probability

Before we can go deeper with this statistical picture of random experimental errors, we need to
review the mathematics of probability theory. This will help us better understand the shapes of
parent distributions and their application to the study of random errors.
As you might have guessed, the mathematical theory of probability arose when a gambler
(Chevalier de Mere) wanted to adjust the odds so he could be certain of winning. Thus he enlisted
the help of the famous French mathematicians Blaise PASCAL and Pierre DE FERMAT.

y+

y-

x
x- x+

Figure 1.8: Cartesian confidence limits of two-dimensional probablity distribution.

P. J. Grandinetti, January 20, 2000


16 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

In principle, calculating the probability of an event is straightforward:


Number of outcomes that are successful (winning)
Probability = (1.17)
Total number of outcomes (winning and losing)
The difficulty lies in counting. In order to count the number of outcomes we appeal to combinatorics.

[Link] Permutations

A Permutation is an arrangement of outcomes in which the order is important. A common example


is the “Election” problem. Consider a club with 5 members, Joe, Cathy, Sally, Bob, and Pat. In
how many ways can we elect a president and a secretary? One solution is to make a tree, such as
the one below:

Joe, Cathy Sally, Joe


Joe, Sally Sally, Cathy
Joe Joe, Bob Sally Sally, Bob
Joe, Pat Sally, Pat

Cathy, Joe Bob, Joe


Cathy, Sally Bob, Cathy
Cathy Cathy, Bob Bob Bob, Sally
Cathy, Pat Bob, Pat

Pat, Joe
Pat, Cathy
Pat Pat, Sally
Pat, Bob

Using such a tree diagram we can count that there are a total of 20 possible ways to elect a
president and a secretary in a club with 5 members. Another approach is to think of two boxes,
one for president, and one for secretary. If you pick the president first and secretary second then
you’ll have five choices for president 5 president , and four choices for secretary 4 secretary . The
total number of ways is the product of the two numbers

5 president · 4 secretary = 20.

What if we wanted to elect a president, secretary, and treasurer? In this case a tree would be
alot of work. Using the boxes approach we would have 5 · 4 · 3 = 60 possibilities. In general, the
number of ways r objects can be selected from n objects is

n Pr = n · (n − 1) · (n − 2) · · · (n − r + 1), (1.18)

or more generally written as


n!
n Pr = . (1.19)
(n − r)!

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 17

For example, how many ways can a baseball coach arrange the batting order with 9 players?
9!
Answer: 9 P9 = = 9! = 362, 880 ways.
(9 − 9)!

[Link] Combinations

A Combination is an arrangement of outcomes in which the order is not important. A common


example is the “Committee” problem. Consider again our club with 5 members. In how many
ways can we form a three member committee? In this case the order is not important. That is,
{Joe, Cathy, Sally} = {Cathy, Joe, Sally} = {Cathy, Sally, Joe}. All arrangements of Cathy, Joe,
and Sally are equivalent.
The total number of combinations of n objects taken r at a time is

n Pr n!
n Cr = , or . (1.20)
r! r!(n − r)!

5!
For example, there are 5 C3 = = 10 possible combinations for a 3 member committee starting
3!2!
with 5 members.
Here’s another example: What is the number of ways a 5 card hand can be drawn from a 52
card deck?
52!
52 C5 = = 2, 598, 960 ways.
5!47!
à !
n
n Cr is called the binomial coefficient, and is also often written as . Both notations will be
r
used in this text.

Permutations and Combinations cannot be used to solve every counting problem, however, they
can be extremely helpful for certain types of counting problems.

[Link] Probability Distributions

Now that we know how to count, let’s go back to the original problem of calculating the probability
of an event. Our starting point is the equation
Number of successful outcomes
Probability = . (1.21)
Total number of outcomes
What if the names of everyone in our 5 member club are thrown in a hat, and two names are drawn,
with the first becoming president and the second becoming secretary. With this approach, what is
the probability that Pat is chosen as president and Cathy as secretary?

P. J. Grandinetti, January 20, 2000


18 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

To calculate this probability we first have to count the outcomes. In this case there is only one
successful outcome, namely Pat as president and Cathy as secretary. The total number of outcomes
can be calculated as the total number of permutations of drawing two names out of a group of five,
or
total number of outcomes = 5 P2 = 20

Therefore, the probability is


1
Probability = = 0.05.
20
That is, there’s a 5% chance that Pat is chosen as president and Cathy as secretary.
Let’s look at another example. What is the probability of being dealt a particular five card
hand from a 52 card deck?

number of successful outcomes = 1,


total number of outcomes = 52 C5 = 2, 598, 960,
1
Probability = = 3.847692 × 10−7 .
2, 598, 960

Yet, another example. What is the probability of being dealt five cards that are all spades from
a 52 card deck?
13!
number of successful outcomes = 13 C5 = = 1, 287,
5!8!
total number of outcomes = 52 C5 = 2, 598, 960,
1, 287
Probability = = 4.951980 × 10−4 .
2, 598, 960

Binomial Distribution

Now let’s look at the probabilities associated with independent events with the same probabil-
ities. For example, if I roll a die ten times, what is the probability that only 3 will come up sixes?
The first question we have to ask is what is the probability of rolling a particular sequence with
only 3 sixes. For example, one possibility is

X, X, 6, X, X, X, 6, 6, X, X

where X is a roll that was not 6. The probability of this particular sequence of independent events
is µ ¶7 µ ¶3
5 5 1 5 5 5 1 1 5 5 5 1
· · · · · · · · · = = 1.292044 × 10−3
6 6 6 6 6 6 6 6 6 6 6 6

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 19

The second question we ask is how many different ways canÃwe !roll only 3 sixes. Clearly, the
10
total number of possibilities (i.e., combinations) is 10 C3 or . Assuming that all possible
3
combinations are equally probable, then to obtain the overall probability that I will roll only 3
sixes we simply multiply our calculated probability above by the number of combinations that give
only 3 sixes. That is,
µ ¶7 µ ¶3
5 1
P(3 sixes out of 10 rolls) = 10 C3 = 120 · 1.292044 × 10−3 = 0.15504536,
6 6
or roughly a 1 in 6.5 chance. Note that the solution to this problem would have been no different
if I had rolled ten different dice all at once and asked for the probability that only 3 came up sixes.
We can generalize this reasoning to the case where the probability of success is p (instead of
1/6), the probability of failure is (1 − p) (instead of 5/6), the number of trials is n (instead of 10),
and the number of successes is r (instead of 3). That is,
à !
n
P (r, n, p) = pr (1 − p)n−r (1.22)
r
This distribution of probabilities, for r = 0, 1, 2, . . . , n, is called the binomial distribution.
In general, the mean and variance of a discrete distribution3 is given by
rX
max rX
max
µr = rP (r), and σr2 = (r − µr )2 P (r). (1.24)
r=0 r=0

Thus, we find the mean of the binomial distribution to be


à !
X
n
n
µ= r pr (1 − p)n−r = np, (1.25)
r=0 r
and the variance of the binomial distribution to be
" Ã ! #
X
n
n
2
σ = (r − µ) 2
p (1 − p)
r n−r
= np(1 − p). (1.26)
r=0 r
For example, a coin is flipped 10 times. What is the probability of getting exactly r heads?
Since p = 1/2 and (1 − p) = 1/2 then we have
à ! à !
10 10−r 10
P (r, 10, 1/2) = r
(1/2) (1/2) = (1/2)10
r r
3
For a continuous distribution we have
Z ∞ Z ∞
µx = x p(x)dx, and σx2 = (x − µx )2 p(x)dx. (1.23)
−∞ −∞

where x is a continuous variable.

P. J. Grandinetti, January 20, 2000


20 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

µ=5 µ = 1.67 r Prob.


r Prob.
0 9.77 X 10- 4 0 0.162
(a) 0.25 1 9.77 X 10- 3 (b) 0.35 1
2
0.323
0.291
2 0.044
3 0.117 3 0.155
0.30
4 0.205 4 0.054
0.20 5 0.013
5 0.246
0.25 6 0.002
6
σ = 1.58

Probability
0.205
7 0.117 7 0.00025
0.15 σ = 1.18
Probability

8 0.044 0.20 8 0.00002


9 9.77 X 10- 3 9 0.0000008
10 9.77 X 10- 4 0.15 10 1.65 X 10-8
0.10
0.10
0.05
0.05

0.00 0.00
0 2 4 6 8 10 0 2 4 6 8 10

r r

Figure 1.9: The binomial distribution. (a) A symmetric case where p = 1/2 and n = 10. (b) A
asymmetric case where p = 1/6 and n = 10.

In this case we get a symmetric distribution (see Fig. 1.9a) with a mean of µ = 5 (i.e., 5 out of 10

times we will get heads) and a variance of σ 2 = µ/2 = 2.5, and σ = 2.5.
Let’s look at another example. A die is rolled 10 times. What is the probability of getting a
six r times? Since p = 1/6 and (1 − p) = 5/6 then we have

10!
P (r, 10, 1/6) = (1/6)r (5/6)10−r .
r!(10 − r)!

In this case we get an asymmetric distribution (see Fig. 1.9b) with a mean of µ = 10/6, σ 2 = 50/36,
p
and σ = 50/36.

Poisson Distribution

Let’s look at a problem that more oriented towards physics and chemistry. If the probability
that an isolated excited state atom will emit a single photon in one second is 0.00050, what is the
probability that two photons would be emitted in one second from five identical non-interacting
excited state atoms?
5!
Answer: P (2, 5, 0.0005) = (0.00050)2 (0.99950)3 = 2.5 × 10−6 .
2!3!
While this is a straightforward calculation, we run into difficulties in this type of calculation when
working with a larger number of atoms (i.e. higher n). For example, how would you calculate the

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 21

0.40
λ=1
0.35

0.30
λ=2
0.25
P(r,λ) 0.20 λ=4

0.15 λ=8
0.10 λ=16

0.05
0.00
0 5 10 15 20 25 30
r

Figure 1.10: The Poisson distribution.

probability that two photons would be emitted from 10,000 identical non-interacting excited state
atoms? My calculator cannot do factorial calculations with numbers greater than 69.
In this situation we can make an approximation for the binomial distribution in the limit that
n → ∞ and p → 0 such that np → a finite number. This approximation is called the Poisson
distribution and is given by
(np)r −np
P Poisson (r, n, p) = e . (1.27)
r!
This distribution most often describes the parent distribution when you are observing independent
random events that are occurring at a constant rate, such as photon counting experiments. The
mean of the Poisson distribution is
∞ ·
X ¸
(np)r −np
µ= r e = np. (1.28)
r=0
r!

and the variance of the Poisson distribution is


∞ ·
X ¸
(np)r −np
2
σ = (r − np) 2
e = np. (1.29)
r=0
r!

P. J. Grandinetti, January 20, 2000


22 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

These two results are as expected for the binomial distribution in the limit that p → 0. We define
λ = np and rewrite the Poisson distribution as

λr −λ
P Poisson (r, λ) = e . (1.30)
r!
The Poisson distribution gives the probability for obtaining r events in a given time interval with
the expected number of events being λ.
Let’s use the Poisson distribution to solve our problem above with 10,000 atoms. In this problem
n = 10, 000 and p = 0.00050 so λ = np = 5.0. Thus,

(5.0)2 −5.0
P (2, 5.0) = e = 0.084.
2!
There’s an 8.4% chance that two photons would be emitted from 10,000 identical non-interacting
excited state atoms.

Gaussian Distribution In the limit of large n and p ≈ 0 the Poisson distribution serves as a
good approximation for the binomial distribution. Is there an approximation that holds for the
limit of large n when p is not close to zero? Yes, under these conditions we can use the Gaussian
distribution as an approximation for the binomial. That is,
( )
1 1 (r − np)2
P Gaussian (r, n, p) = p exp − . (1.31)
2πnp(1 − p) 2 np(1 − p)

The mean of the Gaussian distribution is



" ( )#
X r 1 (r − np)2
µ= p exp − = np. (1.32)
r=0 2πnp(1 − p) 2 np(1 − p)

and the variance of the Gaussian distribution is



" ( )#
X (r − np)2 1 (r − np)2
2
σ = p exp − = np(1 − p). (1.33)
r=0 2πnp(1 − p) 2 np(1 − p)

Making the substitutions for µ = np and σ 2 = np(1 − p) we can rewrite the Gaussian distribution
in the form ( µ ¶ )
1 1 r−µ 2
P Gaussian (r, µ; σ) = √ exp − . (1.34)
σ 2π 2 σ
It turns out that the Gaussian distribution describes the distribution of random errors in many
kinds of measurements.

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 23

In the same way that we make the transition from quantized to continuous observables in going
from quantum to classical mechanics we replace the integer r with a continuous parameter x and
get ( µ ¶ )
1 1 x−µ 2
p Gaussian (x, µ; σ) = √ exp − . (1.35)
σ 2π 2 σ
An important distinction in this transition is that p Gaussian now represents probability density
not probability. Recall from our earlier discussion that to get the probability that a continuous
measured quantity lies within a given range we need to integrate the probability density between
the limits that define the range. If your parent distribution is known to be Gaussian, and if you
choose your confidence limits to be equidistant from the mean, then the confidence limits for a
given probability can be written in the simple form

x± = µ ± zσ, (1.36)

where z is a factor that is proportional to the percent confidence desired and is given in table 1.1.
For example, z = 0.67 for 50% confidence; that is, 50% of the total area under the Gaussian curve
lies between ±0.67σ from the mean µ. Conversely, we can say that for a given single measurement,
x, the true mean µ will lie within the interval

µ = x ± zσ, (1.37)

0.40/σ
0.35/σ
probability density

0.30/σ
0.25/σ σ
0.20/σ
0.15/σ
0.10/σ
0.05/σ
0.00/σ
-5σ -4σ -3σ -2σ -σ 0 σ 2σ 3σ 4σ 5σ
x−µ

Figure 1.11: Gaussian Distribution.

P. J. Grandinetti, January 20, 2000


24 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

with a percent confidence determined by z.

Bi-dimensional Gaussian Distribution So far we’ve only discussed univariate data. Let’s
look at the case of a bivariate Gaussian distribution. Remember we have to take into account
the covariance (or linear correlation rxy ) between the two variables. The functional form of this
distribution is
1
p Gaussian (x, y) = q ×
2πσx σy 1 − rxy
2
  !
µ  ¶2 Ã !2 µ ¶Ã
1 
x − µx y − µy x − µx y − µy 
exp − + − 2r xy . (1.38)
 2(1 − rxy
2 ) σx σy σx σy 

If you take a cross-section through the distribution to obtain, for example, the distribution in
x for a fixed value of y, then the mean of this cross-section distribution will be

µx (y) = µx + rxy (σx /σy )(y − µy ), (1.39)

and the standard deviation will be


q
σx (y) = σx 1 − rxy
2 . (1.40)

-0.67 +0.67

0.40/σ
0.35/σ
Area between z = ±0.67 is
probability density

0.30/σ 50% of total area

0.25/σ
0.20/σ
0.15/σ
0.10/σ
0.05/σ
0.00/σ
-5 -4 -3 -2 -1 0 1 2 3 4 5
z = (x − µ) / σ

Figure 1.12: 50% of the area under a Gaussian distribution lies between the limits ±0.67 from the
mean. These limits define the 50% confidence limits for the true value.

P. J. Grandinetti, January 20, 2000


1.1. RANDOM ERROR DISTRIBUTIONS 25

% confidence z
50 0.67
60 0.84
75 1.15
90 1.64
95 1.96
97.5 2.24
99.0 2.58
99.5 2.81
99.95 3.50

Table 1.1: Percent Confidence (i.e., Area under Gaussian distribution) as a function of z values.

Multi-dimensional Gaussian Distribution In the more general case of multi-variate data we


have ½ ¾
1 1 −1
p Gaussian (x) = p exp − (x − µ) · V · (x − µ) . (1.41)
(2π)n/2 |V| 2

Here we use the symbol x to represent not one variable but m variables. For example, instead of the
variables (x, y, z) we use a 3 dimensional vector x or (x1 , x2 , x3 ). In general x is a m-dimensional
vector whose elements (x1 , x2 , x3 , . . . , xm ) represent the m variables in our multi-variate data.
Likewise µ is a m-dimensional vector whose elements (µx1 , µx2 , µx3 , . . . , µxm ) represent the means
of the m variables in our multi-variate data.
To represent the variance, however, we use a m × m dimension symmetric (i.e., Vi,j = Vj,i )
matrix so that we can include all the covariances between variables. For example, in the bi-variate
case we have  
σx2 2
σxy
 
V= 
.
 (1.42)
2
σxy σy2

Or in the three variable case we have


 
σ12 2
σ12 2
σ13
 
 
 
 2 
V =  σ12 σ22 2
σ23 . (1.43)
 
 
 
2
σ13 2
σ23 σ32

P. J. Grandinetti, January 20, 2000


26 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

Mean µx (y) = µx + r(σx /σy )(y − µy ),


(b) y
σ2xy ≠ 0

y2
Stand. p
Dev. σx (y) = σx / 1 − r2

Figure 1.13: Bivariate Gaussian Distribution.

V−1 , the inverse of the covariance matrix, is called the curvature matrix, α,

V−1 = α. (1.44)

We will learn more about the curvature matrix later when we discuss modeling of data.

[Link] Central Limit Theorem

Our earlier assumption that the random errors in our measurements are governed by a parent dis-
tribution seems to be reasonable in terms of the binomial theorem and the probabilities associated
with quantum measurements. Wouldn’t it be great if our measurement uncertainties were domi-
nated by just the uncertainty principle of quantum mechanics! But alas, we also have the workman
outside our building running a jackhammer that sends random vibrations into our instrument,
and/or the carelessness of Homer Simpson at the local power plant sending random voltage fluctu-
ations into our instrument, and/or a number of other seemingly random perturbations leading to
random errors in our measurements. Not being able to eliminate them we are still faced with the
task of determining the parent distribution governing our errors.
One helpful simplification is the Central Limit Theorem. This theorem says that if you
have different random error sources, each with their own parent distribution, then in the limit
that you have a infinite number of random error sources the final parent distribution for your
random errors will be Gaussian, even if none of the individual random error sources have Gaussian

P. J. Grandinetti, January 20, 2000


1.2. THE χ2 TEST OF A DISTRIBUTION 27

parent distributions. An important condition is that the component errors have the same order
of magnitude and that no single source of error dominates all the others. Since you really can’t
know if you have a large number of random error sources you should be cautious about assuming
Gaussian errors. Nonetheless, the Gaussian distribution does seem to describe the errors in most
experiments, other than situation where the Poisson distribution should apply.

1.2 The χ2 test of a distribution

If we suspect that a given set of observations comes from a particular parent distribution, how can
we test them for agreement with this distribution? For example, if I gave you two dice and ask you
to tell me if the dice are loaded4 what would you do?
You could roll the dice a large number of times and make a histogram of the results. You don’t
expect that you will roll 12 (i.e., both dice come up as 6) exactly 1/36 times the number of rolls,
but it should be close. In contrast, if the die came up 12 in over half the number of total rolls then
you might suspect that the dice are not obeying the statistics of the binomial distribution. How
much disagreement between the parent (binomial) distribution and our sample distribution can we
reasonably expect? To answer this question, we need a quantitative index of the difference between
the two distributions and a way to interpret this index.
Let’s consider a variable x that we’ve measured N times. We can construct a histogram of
measured x values using a bin width of dx. If the errors in x are governed by a parent distribution
p(x) then the expected number of x observations in the range dx is given by

h(xi ) = N p(xi )dx.

How do we find the uncertainty in the bin heights? Pay close attention, this part is tricky.
For each bin height h(xi ) there’s a probability distribution governing the distribution of measured
heights. If we construct many histograms from different groups of measurements, then we would
expect some distribution of bin heights around the expected bin heights. Recall that counting
events are typically governed by a Poisson distribution. So while the spread in x can be governed
by a particular parent distribution, the spread in the frequency of occurance of a particular x value
is governed by the Poisson distribution. Recall that for a Poisson distribution the variance is equal
to the mean of the distribution. Thus we estimate the standard deviation of the bin heights as
q
σi (h(xi )) = h(xi ). (1.45)

4
One side of the die is weighted so that the probabilities for any given side coming up are not equal.

P. J. Grandinetti, January 20, 2000


28 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

Now back to our original problem. The experimental histogram can be tested against a parent
distribution using the χ2 test:

X
N bin
[h(xi ) − N p(xi )dx]2
χ2 = (1.46)
i
h(xi )

The χ2 test characterizes the dispersion of the observed frequencies from the expected frequencies.
For good agreement between two distributions you might expect that each bin height would differ
from the expected heights by one standard deviation. So we might expect χ2 to be equal to Nbin
for good agreement. More specifically, the expectation value for χ2 is

hχ2 i = ν = Nbin − nc , (1.47)

where ν is the degrees of freedom, Nbin is the number of sample frequencies, and nc is the number
of constraints or parameters calculated from the data to describe the potential parent distribution
p(x). Simply normalizing the parent distribution to the total number of events N is one constraint.

0.35

0.30

0.25
Probability

0.20

0.15

0.10
8
0.05

0.00
0 2 4 6 8 10
6 relative occurance
occurance
relative

0
30.8 30.9 31.0 31.1 31.2 31.3 31.4
Measured Value

Figure 1.14: The uncertainty in the number of occurances within a given bin is governed by the
Poisson distribution.

P. J. Grandinetti, January 20, 2000


1.2. THE χ2 TEST OF A DISTRIBUTION 29

If you obtained the mean and variance from your data and used them to describe your parent
distribution then you have used another two constraints.
Often you will see the reduced χ2 or

χ2ν = χ2 /ν, (1.48)

which has an expectation value of hχ2ν i = 1. Large values for χ2ν (i.e. À 1) indicate poor agreement.
Very small (i.e. ¿ 1) are also unreasonable, and imply something wrong somewhere.
We can also use the χ2 test to compare two experimental distributions to decide whether or not
they were drawn from the same parent distribution. For example, to compare h(xi ) and g(xi ) you
would calculate
NXbin
[g(xi ) − h(xi )]2
2
χ = . (1.49)
i
g(xi ) + h(xi )
If a given bin contains no occurances for both histograms, then the sum over that bin is skipped
and the number of degrees of freedom (i.e., Nbin ) is reduced by one. If the two experimental
distributions are constructed from different number of measurements then you need to scale the
individual distributions. In this case you would calculate
p p
X
N bin
[ H/G g(xi ) − G/H h(xi )]2
2
χ = (1.50)
i
g(xi ) + h(xi )
P P
where H = i h(xi ) and G = i g(xi ). Having to scale the histograms for different number of
measurements is another constraint that will reduce the degrees of freedom by one.
Interpreting the χ2 test result can be simpler if the χ2 probability function is used
1
(χ2 )(ν−2)/2 e−χ
2 /2
p(χ2 , ν) = , (1.51)
2ν/2 Γ(ν/2)
where Γ(x) is the Gamma function and is given by
Z ∞
Γ(x) = eu ux−1 du with 0 ≤ x ≤ ∞. (1.52)
0

p(χ2 , ν) is the distribution of χ2 values as a function of the number of degrees of freedom, ν, and
has a mean of ν and a variance of 2ν. Q(χ2 |ν) is the probability that another distribution would
give a higher (i.e., worse) χ2 than your value χ20 . It is calculated as follows:
Z ∞
1
(χ2 )(ν−2)/2 e−χ
2 /2
Q(χ20 |ν) = ν/2 d(χ2 ). (1.53)
2 Γ(ν/2) χ20

See table C.4 in Bevington for this function tabulated. If Q(χ2 |ν) is reasonably close to 1, then the
assumed distribution describes the spread of the data points well. If Q(χ2 |ν) is small the assumed

P. J. Grandinetti, January 20, 2000


30 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

ν=1

0.4

0.3
p(χ2 ,ν)

ν=2
0.2
ν=4
ν=6
ν=8
0.1 ν = 10

0.0
0 5 10 15 20 25 30 35
χ2

Figure 1.15: The χ2 distribution for ν = 1,2,4,6,8, and 10.

parent distribution is not a good estimate of the parent distribution. For example, for a sample
with 10 degrees of freedom (ν = 10) and χ2 = 2.56 the probability is 99% that another distribution
would give a higher (i.e., worse) χ2 (i.e., the distributions agree well). Another example, if (ν = 10)
and χ2 = 29.59 the probability is 0.1% that another distribution would give a higher (i.e., worse)
χ2 (i.e., the distributions don’t agree well).

1.3 Systematic Errors


Systematic errors lead to bias in a measurement technique. Bias is the difference between your
measured mean and the true value
bias = µB − xt (1.54)

Systematic errors are reproducible, and usually can be attributed to a single source. The sources
of error can be divided into three subclasses:

Instrument Errors Arise from imperfections in the measuring device. For example, uncalibrated
glassware, bad power supply, faulty Pentium chip, ... If these errors are found they can be corrected
by calibration. Periodic calibration of equipment is always desirable because the response of most

P. J. Grandinetti, January 20, 2000


1.4. GROSS ERRORS 31

µA µB

bias
8
A B

6
Occurance
Relative

4 True
Value
2

0
30.8 30.9 31.0 31.1 31.2 31.3 31.4 31.5
Measured Value

Figure 1.16: Method A has no bias so µA = xtrue .

instruments changes with time as a result of wear, corrosion, or mistreatment.

Personal Errors Arise from carelessness, inattention, other limitations. For example, missed
end points, poor record keeping, ... These errors can be minimized by care and self-discipline. It is
always a good habit to often check instrument readings, notebook entries, and calculations.

Method Errors Arise from non-ideal chemical or physical behavior of system under study. For
example, incomplete reactions, unknown decomposition, system nonlinearities, ... These errors are
the most difficult to detect. Some steps to take are analyses of standard samples, independent
analyses (i.e., another technique that confirms your result), blank determinations which can reveal
contaminant in the instrument.

1.4 Gross Errors

As we mentioned earlier these errors are not a third class of errors. They are either systematic or
random errors. It can be dangerous to reject gross errors (or outliers). There is no universal rule.
A common approach is the Q test. For more details see Rorabacher, Anal. Chem. 63, 139(1991).

P. J. Grandinetti, January 20, 2000


32 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

X1 Xn Xq

d
w

Figure 1.17: Questionable measurements can be rejected if Qexp = d/w is greater than the value of
Qcrit given in table 1.2

If Qexp is greater than Qcrit than xq can be rejected.

d |xq − xn |
Qexp = = (1.55)
w |x1 − xq |

where d is the difference between the questionable result and the nearest neighbor and w is the
spread in the entire data set. Or xq is the questionable result, xn is the nearest neighbor, and x1
is the furthest result from xq . Ideally, the only valid reason for rejecting data from a small set of
data is knowledge that a mistake was made in the measurement.

1.5 Propagation of Errors


If we calculate a result using experimentally measured variables how do we relate the uncertainty
in our experimental variable to the uncertainty in our calculated variable? Recall that the area
under the error distribution curve between the confidence limits is proportional to how confident
we are (i.e., the uncertainty) in our measurements. In general, propagating errors requires us to

Number of Qcrit
observations 90% confidence 95% confidence 99% confidence
3 0.941 0.970 0.994
4 0.765 0.829 0.926
5 0.642 0.710 0.821
6 0.560 0.625 0.740
7 0.507 0.568 0.680
8 0.468 0.526 0.634
9 0.437 0.493 0.598
10 0.412 0.466 0.568

Table 1.2: Qcrit values for rejecting data from a small set of data.

P. J. Grandinetti, January 20, 2000


1.5. PROPAGATION OF ERRORS 33

probability density

y
y(x) = a + b x

Calculated Value (y)

probability density

Measured Value (x)

Figure 1.18: Confidence interval in calculated quantity is proportional to the same area under the
distribution.
probability density

y
Calculated Value (y)

y(x) = (x-µ)2 + b

x
probability density

Measured Value (x)

Figure 1.19: Confidence interval in calculated quantity is proportional to the same area under the
distribution.

P. J. Grandinetti, January 20, 2000


34 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

map our experimental error distribution curve through our calculation into an error distribution
curve for our calculated variable, and then find the interval in our calculated variable that gives us
the area that corresponds to our desired confidence. In Fig. 1.18 is an example where a Gaussian
experimental error distribution curve is mapped through a linear function to obtain a calculated
error distribution. In this example, the shape of the calculated error distribution is also Gaussian.
In Fig. 1.19 is another example where a Gaussian experimental error distribution curve is
mapped through a nonlinear function to obtain a calculated error distribution that is highly asym-
metric, and cannot be characterized by just a mean and variance. Clearly, it is important to realize
that the shape of the error distribution curve for the calculated variable depends on the functional
relationship between the experimentally measured variables and the calculated variables. This last
example would be one of the worst cases for error propagation. With this caveat, let’s look at error
propagation and assume that the error distributions are Gaussian and map through the calculations
without too significant distortions in shape.
Previously we defined the uncertainty in a measured value in terms of its standard deviation.
We can do the same here,
" #
1 X
σy2 (u, v, . . .) = lim {y(ui , vi , . . .) − y(u, v, . . .).}2 (1.56)
N →∞ N i

Here ui , vi , . . . are the experimental measurements. For example, we measured ui = 2.25, 2.05, 1.85, 1.79,
and 2.12, which have a mean of u = 2.01 and a variance of s2u = 0.03632. What is s2y (u) if
y(u) = 4u + 3? If we wanted to use Eq. 1.56 directly we would first calculate y(ui ).

ui yi
2.25 12.00
2.05 11.20
1.85 10.40
1.79 10.16
2.12 11.48

Then we calculate the average y = 11.048 and a standard deviation s2y = 0.581.
Alternatively we could do a Taylor Series expansion of y(u, v, . . .) around y(u, v, . . .) and use it
in our expression for σy2 .

dy(u, v, . . .) dy(u, v, . . .)
y(u, v, . . .) = y(u, v, . . .) + (ui − u) + (vi − v) + · · · + higher-order terms
du dv
dy(u, v, . . .) dy(u, v, . . .)
Assuming and are not near zero, as they were in our parabola example
du dv
earlier, then we can neglect higher-order terms, and write

P. J. Grandinetti, January 20, 2000


1.5. PROPAGATION OF ERRORS 35

" ½ ¾2 #
1 X dy(u, v, . . .) dy(u, v, . . .)
σy2 (u, v, . . .) = lim (ui − u) + (vi − v) + ··· .
N →∞ N i du dv

Expanding the squared term in brackets we obtain


" ½ ¾#
³ ´2 ³ ´2 ³ ´³ ´
1 X dy dy dy dy
σy2 (u, v, . . .) = lim (ui − u) 2
+ (vi − v) 2
+ · · · + 2(ui − u)(vi − v) + ···. .
N →∞ N du dv du dv
i

Breaking this into individual sums we have


" µ ¶2 µ ¶2 # " #
1 X dy 1 X dy
σy2 (u, v, . . .) = lim (ui − u)2 + lim (vi − v)2 + ···
N →∞ N i du N →∞ N dv
i
" µ ¶µ ¶ #
1 X dy dy
+ lim 2(ui − u)(vi − v) + ···.
N →∞ N i du dv

Finally we can write


µ ¶2 µ ¶2 µ ¶µ ¶
dy dy dy dy
σy2 (u, v, . . .) = σu2 + σv2 + ··· + 2
2σuv + ··· . (1.57)
du dv du dv

2 = 0, and the
This is the error propagation equation. If ui , vi , . . . are all uncorrelated then σuv
error propagation equation simplifies to
µ ¶2 µ ¶2
dy dy
σy2 (u, v, . . .) = σu2 + σv2 + ···. (1.58)
du dv
Thus, in our earlier example where y = 4u + 3, we can use the error propagation equation to
relate the variance in u to the variance in y. That is,

s2y = 16 · s2u = 0.581.

Now let’s look at some specific functions y(u, v, . . .) and see how the error propagation equation
is applied.

Simple Sums and Differences:

y =u±a here a = constant.

dy
=1
du
σy2 = σu2

P. J. Grandinetti, January 20, 2000


36 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

Weighted Sums and Differences:

y = au ± bv here a, b = constants.

dy dy
=a and = ±b
du dv
σy2 = a2 σu2 + b2 σv2 ± 2abσuv
2

Multiplication and Division:

y = ±auw

dy dy
= ±aw and = ±au
du dw
σy2 = a2 w2 σu2 + a2 u2 σw
2
+ 2a2 uwσuw
2

If we divide both sides by y 2 we get something that might be more familiar, that is
µ ¶2 µ ¶2 µ ¶2 µ ¶2
σy σu σw σuw
= + +2
y u w uw

σy /y is called the relative error in y, while σy is the absolute error in y.

Powers:

y = au±b

dy by
= ±abu±(b−1) = ±
du u
µ ¶2
σ2 σy σu2
σy2 = u2 · b2 y 2 or = b2
u y u2

Exponentials:

y = ae±bu

dy
= ±abe±bu = ±by
du
µ ¶2
σy
σy2 = σu2 ·b y
2 2
or = b2 σu2
y

P. J. Grandinetti, January 20, 2000


1.6. CONFIDENCE LIMITS 37

Logarithms:

y = a ln(±bu)
dy a a
= ± (±b) =
du bu u
a2 σ2
σy2 = σu2 · 2 = a2 u2
u u

1.6 Confidence Limits


As we learned earlier, once you know the parent distribution governing the errors in your measure-
ments you can calculate the probability that a given measurement will lie within a certain interval
of possible values. That is, Z x+
P (x− , x+ ) = p(x) dx, (1.59)
x−

where P (x− , x+ ) is the probability that the measured value x lies between x− and x+ . The interval
between x− and x+ is called the confidence interval and x− and x+ are called confidence
limits.
Based on the Central Limit theorem we can often assume that our parent distribution is Gaus-
sian. In certain situations you may also know the variance of this Gaussian distribution but not the
mean. For example, most analytical balance manufacturers will print on the front of the balance
the standard deviation (σ) for any measurement made with the balance. Assuming the distribution
of errors are Gaussian, then given a single measurement, x, we can estimate that the true mean µ
to lie within the interval
µ = x ± zσ, (1.60)

with a percent confidence determined by z, as given in Table 1.1.


If we performed the measurement twice we can use the average of the two measurements as a
better estimate for µ, but how do we determine the confidence limits in this case? We use the error
propagation equation. Given
x1 + x2
x= , (1.61)
2
we can calculate the variance in x given the variances in x1 and x2 ,
µ ¶2 µ ¶2
1 1
σx2 = σ12 + σ22 . (1.62)
2 2
Assuming σ12 = σ22 = σ 2 we obtain

σx2 = σ 2 /2 or σx = σ/ 2. (1.63)

P. J. Grandinetti, January 20, 2000


38 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

0.45
0.40 Gaussian
ν = 10
0.35 ν=6
0.30 ν=4
ν=3
0.25 ν=2
p(t ,ν) ν=1
0.20
0.15
0.10
0.05
0.00
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
t

Figure 1.20: Student’s t distribution for ν = 1,2,3,4,6, and 10. The unit Gaussian is also shown.

In general, after N measurements the standard deviation of the mean is given by



σx = σ/ N . (1.64)

Therefore the confidence limits of the mean after N measurements is given by



µ = x ± zσ/ N . (1.65)

Note that this applies only if there are no systematic errors (or bias).

1.6.1 Student’s t-distribution


A problem with these equations is that they assume we already know σ, which requires an infinite
number of measurements. For a finite number of measurements we only have the experimental
variance s2 . In the case where our distribution was Gaussian with variance σ 2 , then we knew that
the distribution in z = (µ − x)/σ had a zero mean and unit variance. If we know our distribution
is Gaussian but only know s2 , then we need to know the distribution in t = (µ − x)/s. This
distribution was derived by William Gossett and is called Student’s t distribution. It is
Γ((ν + 1)/2) 1
p(t, ν) = √ · (1.66)
νπ · Γ(ν/2) 1 + (t /ν))(ν+1)/2
2

P. J. Grandinetti, January 20, 2000


1.6. CONFIDENCE LIMITS 39

Degrees of Freedom
ν =N −1 t(95%) t(99%)
1 12.7 63.7
2 4.30 9.92
3 3.18 5.84
4 2.78 4.60
5 2.57 4.03
6 2.45 3.71
7 2.36 3.50
10 2.23 3.17
30 2.04 2.75
60 2.00 2.66
∞ 1.96 2.58

Table 1.3: Student’s t values as a function of ν for 95 and 99 Percent Confidence (i.e., 95 and 99
Percent of the Area under Student’s t distribution).

In contrast to the z-distribution this expression depends on t and ν, the number of degrees of
freedom. As expected, in the limit that N → ∞ then t → z. In table 1.3 are the values for t as a
function of the degrees of freedom (i.e., ν = N − 1) for the 95% and 99% confidence limits. Using
Student’s t distribution we can use the average of N measurements and estimate the true mean µ
to lie within the interval

µ = x ± ts/ N , (1.67)
with a percent confidence determined by t and ν, as given in Table 1.3. From this table you will
notice that for a given confidence limit the t inflates the confidence limits to take into account a
finite number of measurements.
Let’s look at an example. A chemist obtained the following data for the alcohol content of a
sample of blood: 0.084%, 0.089%, and 0.079%. How would he calculate the 95% confidence limits
for the mean assuming there is no additional knowledge about the precision of the method? First
he would calculate the mean

x = (0.084% + 0.089% + 0.079%)/3 = 0.084%,

and then the standard deviation


s
(0.084% − 0.084%)2 + (0.089% − 0.084%)2 + (0.079% − 0.084%)2
s= = 0.0050%.
2

P. J. Grandinetti, January 20, 2000


40 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

Then for 2 degrees of freedom he uses t = 4.30 to calculate the 95% confidence limits for the mean
µ as
µ = 0.084% ± 0.012%.

How would the calculation change if on the basis of previous experience he knew that s → σ =
0.005% (i.e., he know the parent distribution variance)? In this case he would use z = 1.96 instead
of t = 4.30, and would calculate
µ = 0.084% ± 0.006%

for the 95% confidence limits.

1.7 The Two Sample Problem


1.7.1 The t-test
[Link] Comparing a measured mean to a true value

The best way to detect a systematic error in an experimental method of analysis is to analyze
a standard reference material and compare the results with the known true value. If there is a
difference between your measured mean and the true value how would you know if it’s due to
random error or to some reproducible systematic error? Using the expression

µ = x ± ts/ N (1.68)

we can rewrite it as

µ−x= ±ts/ N . (1.69)
| {z }
Critical Value

If |µ − x| is less than or equal to ts/ N then the difference is statistically insignificant at the
percent confidence determined by t. Otherwise the difference will be statistically significant.

A procedure for determination of sulfur in kerosenes was tested on a sample known from
preparation to contain 0.123% S. The results were %S= 0.112, 0.118, 0.115, and 0.119.
Is there a systematic error (or bias) in the method?
The first step is the calculate the mean.
0.112 + 0.118 + 0.115 + 0.119
x= = 0.116%S
4
Next we calculate the difference between our mean and the true value

x − µ = 0.116 − 0.123 = −0.007%S.

P. J. Grandinetti, January 20, 2000


1.7. THE TWO SAMPLE PROBLEM 41

The experimental standard deviation is


s
(0.112 − 0.116)2 + (0.118 − 0.116)2 + (0.115 − 0.116)2 + (0.119 − 0.116)2
s= = 0.0032.
3
At 95% confidence t has a value of 3.18 for N = 4. The critical value is
√ √
ts/ N = (3.18)(0.0032%)/ 4 = 0.0051%S.

Since the difference between our mean and the true value (0.007% S) is greater than
the critical value (0.0051% S) we conclude that a systematic error is present with 95%
confidence.

[Link] Comparing two measured means

Often chemical analyses are used to determine whether two materials are identical. Then the
question arises . . . How can you know if the difference in the mean of two sets of (allegedy identical)
analyses is either real and constitutes evidence that the samples are different or simply due to
random errors in the two data sets.
To solve this problem we use the same approach as in the last section. The difference between
the two experimental means can be compared to a critical value to determine if the difference is
statistically significant. s
N1 + N2
x1 − x2 = ±tspooled , (1.70)
N1 N2
| {z }
Critical Value
where s
s21 (n1 − 1) + s22 (n2 − 1)
spooled = . (1.71)
n 1 + n2 − 2
As before if |x1 − x2 | is less than or equal to the critical value then the difference is probably
due to random error, i.e., the sample are probably the same. If, however, |x1 − x2 | is greater than
the critical value then the difference is probably real, i.e., the samples are not identical.

Two wine bottles were analyzed for alcohol content to determine whether they were
from difference sources. Six analyses of the first bottle gave a mean alcohol content
of 12.61% and s = 0.0721%. Four analyses of the second bottle gave a mean alcohol
content of 12.53% and s = 0.0782%.
First we calculate spooled .
s
5 · (0.0721)2 + 3 · (0.0782)2
spooled = = 0.0744%.
8

P. J. Grandinetti, January 20, 2000


42 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

At 95% confidence t = 2.31 (for 8 degrees of freedom, i.e., (6 − 1) + (4 − 1) = 8). Then


s r
N1 + N2 6+4
tspooled = (2.31)(0.0744%) = 0.111%.
N1 N2 6·4

Since x1 − x2 = 12.61 − 12.53 = 0.08% we conclude that at the 95% confidence level
there is no difference in the alcohol content of the two wine bottles.

1.7.2 Comparing Variances - The F -test


Sometimes you need to compare two measured variances. For example, is one method of analysis
is more precise than another? Or, are the differences between the variances for the two methods of
analysis statistically significant?
To answer these types of questions we use the F -test. F is the ratio of the variance of sampling
A with νA = NA − 1 degrees of freedom and sampling B with νB = NB − 1 degrees of freedom,
that is
s2
Fexp = 2A . (1.72)
sB
In the limit of νA → ∞ and νB → ∞ then
2
σA
F = 2 . (1.73)
σB

0.8
(νΑ = 10,νΒ = 12)
0.7
0.6
(νΑ = 3,νΒ = 5)
0.5
0.4
p ( F)
0.3

0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
F

Figure 1.21: F distribution for νA = 3 and νB = 5.

P. J. Grandinetti, January 20, 2000


1.7. THE TWO SAMPLE PROBLEM 43

An F value of 1 indicates that the two variances are the same, whereas F values differing from one
indicates that the two variances are different. When νA and νB are finite we will, of course, have
random errors such that even if the variances are identical we would often expect to obtain Fexp
values different from 1. How likely is it that Fexp differs from one when σA = σB ? That depends
on the shape of the distribution of F values. The parent distribution for F values is given by
Γ((νA + νB )/2) q νA νB F (νA /2)−1
p(F, νA , νB ) = νA · νB . (1.74)
Γ(νA /2)Γ(νB /2) (νB + νA F )(νA +νB )/2)
A plot of this distribution is given in Fig. 1.21. As you can see, for νA = 10 and νB = 12 an F value
of 2 is not that unlikely. The basic approach, then, is to integrate the F distribution from infinity
down to some critical F value to define the range of F values that cover, say the top 5% worst
F -values. It is common to use tables that will give you the critical F value (limit) that corresponds
to a given probability. In in Tables 1.5 and 1.4 are tables of critical F values that correspond to
2.5% and 5% significance.
Sometimes you will see two types of F tables called the one- and two-tailed F distribution
tables. If you are asking if the variance in A is greater than the variance in B? or if the variance
in B is greater than the variance in A? then you will use the one-tailed F distribution table. On
the other hand, if you are asking if the variance in A is not equal to the variance in B, then you
would use the two-tailed distribution table. In most reference texts, however, you will only find
the one-tailed F distribution table because the values for the two-tailed F distribution table can be
obtained from the one-tailed table. That is,

Pone-tailed = Ptwo-tailed /2. (1.75)

That is, the one tailed F table for 5% probability is the same as the two tailed F table for 10%
probability.
Let’s look at an example where we ask if the variance in A is not equal to the variance in B.

Two chemists determined the purity of a crude sample of ethyl acetate. Chemist A made
6 measurements with a variance of s2A = 0.00428 and Chemist B made 4 measurements
with a variance of s2B = 0.00983. Can we be 95% confident that these two chemists
differ in their measurement precision?
In this case we are not asking if A is better than B, but only if A and B are different.
Therefore we use the two-tailed F distribution table. When calculating the experimental
ratio Fexp , the convention is to define s2A to be the large of the two variances s2A and
s2B . With this definition Fexp will always be greater than 1. So we have

Fexp = 0.00983/0.00428 = 2.30.

P. J. Grandinetti, January 20, 2000


44 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

From the two-tailed table with νA = 3, νB = 5, and P = 0.05 we find Fcrit = 7.764.
Since 2.30 < 7.764 we conclude that the F test shows no difference at the 95% confidence
level.

Let’s look at another example.

A water sample is analyzed for silicon content by two methods, one is supposed to
have improved precision. The original method gives xA = 143.6 and s2A = 66.8 after 5
measurments. The newer method gives xB = 149.0 and s2B = 8.50 after 5 measurments.
Is the new method an improvement over the original at 95% confidence (i.e., less than
5% chance that they’re the same.).
First we calculate the F value from the measurements.

Fexp = 66.8/8.5 = 7.86.

Then we use the one-tailed table and find Fcrit = 6.388 for νA = 5 − 1 = 4 and
νB = 5 − 1 = 4. Since 7.86 > 6.388 we conclude that we are 95% confident that the
new method is better than the original.

ν1 =1 2 3 4 5 6 7 8 9 10 12 15 20 ∞
ν2
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 243.9 245.9 248.0 254.3
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.50
3 10.13 9.552 9.277 9.117 9.013 8.941 8.887 8.845 8.812 8.786 8.745 8.703 8.660 8.53
4 7.709 6.944 6.591 6.388 6.256 6.163 6.094 6.041 5.999 5.964 5.912 5.858 5.803 5.63
5 6.608 5.786 5.409 5.192 5.050 4.950 4.876 4.818 4.772 4.735 4.678 4.619 4.558 4.36
6 5.987 5.143 4.757 4.535 4.387 4.284 4.207 4.147 4.099 4.060 4.000 3.938 3.874 3.67
7 5.591 4.737 4.347 4.120 3.972 3.866 3.787 3.726 3.677 3.637 3.575 3.511 3.445 3.23
8 5.318 4.459 4.066 3.838 3.687 3.581 3.500 3.438 3.388 3.347 3.284 3.218 3.150 2.93
9 5.117 4.256 3.863 3.633 3.482 3.374 3.293 3.230 3.179 3.137 3.073 3.006 2.936 2.71
10 4.965 4.103 3.708 3.478 3.326 3.217 3.135 3.072 3.020 2.978 2.913 2.845 2.774 2.54
12 4.747 3.885 3.490 3.259 3.106 2.996 2.913 2.849 2.796 2.753 2.687 2.617 2.544 2.30
15 4.543 3.682 3.287 3.056 2.901 2.790 2.707 2.641 2.588 2.544 2.475 2.403 2.328 2.07
20 4.351 3.493 3.098 2.866 2.711 2.599 2.514 2.447 2.393 2.348 2.278 2.203 2.124 1.84

Table 1.4: Critical values of F for a one-tailed test P = 0.05 (or two-tailed test with P = 0.10). ν1
is the number of degrees of freedom of the numerator and ν2 is the number of degrees of freedom of
the denominator.

P. J. Grandinetti, January 20, 2000


1.7. THE TWO SAMPLE PROBLEM 45

ν1 =1 2 3 4 5 6 7 8 9 10 12 15 20 ∞
ν2
1 647.8 799.5 864.2 899.6 921.8 937.1 948.2 956.7 963.3 968.6 976.7 984.9 993.1 1018
2 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39 39.40 39.41 39.43 39.45 39.50
3 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 14.42 14.34 14.25 14.17 13.90
4 12.22 10.65 9.979 9.605 9.364 9.197 9.074 8.980 8.905 8.844 8.751 8.657 8.560 8.26
5 10.01 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.681 6.619 6.525 6.428 6.329 6.02
6 8.813 7.260 6.599 6.227 5.988 5.820 5.695 5.600 5.523 5.461 5.366 5.269 5.168 4.85
7 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.823 4.761 4.666 4.568 4.467 4.14
8 7.571 6.059 5.416 5.053 4.817 4.652 4.529 4.433 4.357 4.295 4.200 4.101 3.999 3.67
9 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 4.026 3.964 3.868 3.769 3.667 3.33
10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.779 3.717 3.621 3.522 3.419 3.08
12 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.436 3.374 3.277 3.177 3.073 2.72
15 6.200 4.765 4.153 3.804 3.576 3.415 3.293 3.199 3.123 3.060 2.963 2.862 2.756 2.40
20 5.871 4.461 3.859 3.515 3.289 3.128 3.007 2.913 2.837 2.774 2.676 2.573 2.464 2.09

Table 1.5: Critical values of F for a one-tailed test P = 0.025 (or two-tailed test with P = 0.05).
ν1 is the number of degrees of freedom of the numerator and ν2 is the number of degrees of freedom
of the denominator.

P. J. Grandinetti, January 20, 2000


46 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

1.8 Problems
1. Why is it often assumed that experimental random errors follow a Gaussian parent distribu-
tion?

2. In what type of experiments would you expect random errors to follow a Poisson parent
distribution?

3. Indicate whether the Skewness and Kurtosis are positive, negative, or zero for the following
distribution.
Relative Frequency

Measured Value

4. Describe the three classifications of systematic errors and what can be done to minimize each
one.

5. Describe the Q-test. When is it reasonable to reject data from a small set of data?

6. (a) Give a general explanation of confidence limits. (b) How would you find the 99% confidence
limits for a measured parameter that had the distribution in question 3 above?

7. The error propagation equation is


µ ¶2 µ ¶2 µ ¶µ ¶
∂y ∂y ∂y ∂y
σy2 (u, v, . . .) = σu2 + σv2 + · · · σuv
2
+ ···
∂u ∂v ∂u ∂v

Explain the assumptions involved in using this equation to relate the variances in u, v, . . . to
the variance in y.

8. The mean is defined as the first moment about the origin. Explain why the variance is defined
as the second moment about the mean and not the second moment about the origin.

P. J. Grandinetti, January 20, 2000


1.8. PROBLEMS 47

9. An negatively charged oxygen is coordinated by 4 potassium cations. 50% of the potassium


cations are ion exchanged out and replaced by sodium cations. Assuming that the sodi-
ums simply replace potassium cations randomly, what is the probability that the oxygen is
coordinated by four potassiums in the ion exchanged material?

10. We discussed two equations for obtaining the confidence limits for the mean after N mea-
surements of a variable. They were

µ=x± √ ,
N
and
ts
µ=x± √ .
N
If you had measured a concentration in triplicate, which equation would you use and why?
Describe a situation where you could correctly use the other equation for your triplicate
measurements.

11. In the χ2 test of a distribution explain what value is used for the variance of the bin heights
and why.

12. In your own words describe the Central Limit theorem and one experimental situation where
the experimental distribution of noise is not described by this theorem.

13. Given the mean and standard deviation from the parent distribution in question 3, would
it be correct if you calculated the 50% confidence limits for a single measurement using the
equation µ = x ± .67σ? Explain why or why not.

14. Draw a distribution that has a positive skewness.

15. Draw a distribution that is leptokurtic.

16. If the probability of getting an A in this course is 38%, what is the probability that 15 people
will receive an A if the class has 30 students enrolled?

17. (a) Explain the conditions under which the Poisson distribution is a good approximation for
the binomial distribution. (b) Explain the conditions under which the Gaussian distribution
is a good approximation for the binomial distribution.

18. If you had to guess which parent distribution governs the errors in your experiment before
you made a single measurement, which distribution would you guess and why?

P. J. Grandinetti, January 20, 2000


48 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

19. When using the chi-squared test to compare distributions A and B below, how many degrees
of freedom would you use to calculate χ2ν and why?

(A) 8
(B) 8

6 6
Probability

Probability
Density

Density
4 4

2 2

0 0
30.8 30.9 31.0 31.1 31.2 31.3 31.4 30.8 30.9 31.0 31.1 31.2 31.3 31.4

20. Given the following measurements 2.12, 2.20, 2.12, 2.39, and 2.25, use the Q-test and deter-
mine which, if any, of these numbers can be rejected at the 90% confidence level.

21. Seven measurements of the pH of a buffer solution gave the following results:

5.12, 5.20, 5.15, 5.17, 5.16, 5.19, 5.15

Calculate (i) the 95% and (ii) 99% confidence limits for the true pH. (Assume that there are
no systematic errors.)

22. The solubility product of barium sulfate is 1.3×10−10 , with a standard deviation of 0.1×10−10 .
Calculate the standard deviation of the calculated solubility of barium sulfate in water.

23. A 0.1 M solution of acid was used to titrate 10 mL of 0.1 M solution of alkali and the following
volumes of acid were recorded:

9.88, 10.18, 10.23, 10.39, 10.25 mL

Calculate the 95% confidence limits of the mean and use them to decide if there is any evidence
of systematic error.

24. Seven measurements of the concentration of thiol in a blood lysate gave:

1.84, 1.92, 1.94, 1.92, 1.85, 1.91, 2.07

Verify that 2.07 is not an outlier.

P. J. Grandinetti, January 20, 2000


1.8. PROBLEMS 49

25. The following data give the recovery of bromide from spiked samples of vegetable matter,
measured by using a gas-liquid chromatographic method. The same amount of bromide was
added to each specimen.
Tomato: 777 790 759 790 770 758 764 µg/g
Cucumber: 782 773 778 765 789 797 782 µg/g
(a) Test whether the recoveries from the two vegetables have variances which differ signifi-
cantly.
(b) Test whether the mean recovery rates differ significantly.

26. Calculate the probability of a rolling a one exactly 120 times in 720 tosses of a die.

27. Explain the circumstances under which the Student’s t distribution is used, and it’s relation-
ship to the Gaussian distribution.

28. (a) Give a general explanation of confidence limits. (b) How would you find the 99% confidence
limits for a measured parameter that had the distribution in question 3?

29. Give a real life example of a bivariate data set expected to have a linear correlation coefficient
2 σ σ near zero and an example of a bivariate data set expected to have r
rxy = σxy x y xy near
one. In both cases explain your reason for expecting the given rxy value.

30. A certain quantity was measured N times, and the mean and its standard deviation were
computed. If it is desired to increase the precision of the result (decrease σ) by a factor of
10, how many additional measurments should be made?

31. We have a crystal which has two types of impurities, one of which will absorb a particular
frequency photon without freeing an electron, and the other of which will free an electron
when absorbing the photon. There are the same number of each kind of impurity, but the
cross section for absorption by the impurity which doesn’t free an electron is 99 times bigger
than the cross section for the absorption by the the impurity which frees an electron. The
crystal is thick enough that all the photons are absorbed. Suppose 200 photons are incident.
What is the probability of at least three electrons being liberated.

32. Suppose we make two independent measurements of the same quantity but they have different
variances.

x1 (x0 , σ1 )
x2 (x0 , σ2 )

P. J. Grandinetti, January 20, 2000


50 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

How should we combine them linearly to get a result with the smallest possible variance?
That is, given
x = cx1 + (1 − c)x2 ,

find the c which minimizes σx2 .

33. The following values were obtained for the atomic weight of Cd.

112.23, 112.34, 112.30, 112.22, 112.32, 112.34

(a) Calculate the variance and standard deviation of the data. (b) Calculate the 90% and
95% confidence limits for the atomic weight. (c) Does the experimental average agree with
the accepted value of 112.41?

34. A change in the indicator used in a method was expected to improve its precision – a known
standard deviation of σ = 0.3317. Based on 20 determinations, the modified method had
an estimated standard deviation of s = 0.2490. Does this lower value represent a significant
improvement in precision at α = 0.05?

35. Four gas chromatographic assays on a sample of pentachlorophenol performed under the same
conditions are 88.2%, 86.7%, 87.9%, and 87.7%. Can we reject any of these assays based on
the Q-test?

36. A collection of diatomic molecules is excited with laser beam pulses. The molecules can ionize
giving up an electron or dissociate, giving up an ion. At each pulse, we measure the number
of ions (x) and the number of electrons (y) in appropriate detectors for each. Below are the
detector counts for each pulse:

Pulse # x y
1 45 68
2 54 90
3 36 50
4 85 145
5 55 88
6 29 37
7 41 63
8 75 133
9 63 95
10 42 67

P. J. Grandinetti, January 20, 2000


1.8. PROBLEMS 51

(a) What are the standard deviations for x and y?


(b) What is the linear correlation coefficient, rxy ? Is there a correlation between number of
ions (x) and the number of electrons (y)?
(c) With some fine tuning of the laser the standard deviation in the number of ions detected
can be reduced to virtually zero. To what value will the standard deviation in the
number of electrons detected be reduced?

P. J. Grandinetti, January 20, 2000


52 CHAPTER 1. STATISTICAL DESCRIPTION OF EXPERIMENTAL DATA

P. J. Grandinetti, January 20, 2000


Chapter 2

Modeling of Data

When a measured parameter varies systematically as another experimental parameter is varied,


then a functional relationship between the two parameters is implied. Often we have a mechanistic
picture describing this relationship. For example, in chemical kinetics we may know that the
reactant concentrations for a particular reaction
k
A −→ B

will vary systematically according to the relationship

[A]t = [A]0 e−kt ,

[A]0

e-kt
[A]t

Figure 2.1: By measuring the concentration [A]t as a function of time we can extract the rate
constant k and the initial concentration [A]0 .

53
54 CHAPTER 2. MODELING OF DATA

where [A]0 is the concentration at time t = 0, [A]t is the concentration at a later time t, and k is
the first-order rate constant. Knowing [A]0 and k, and assuming they are time independent, we can
then predict the value for [A]t at any arbitrary positive time. Conversely, from a measured series
of ([A]t , t) values we could extract [A]0 and k. In this example [A]t = [A]0 e−kt is our model that
describes an (x, y) data set, i.e., ([A]t , t), with the model parameters k and [A]0 . In this chapter we
focus on the general problem of determining the best possible model parameters for a given model
describing the relationship between experimentally measured parameters.
The basic approach in all cases is the same: you choose or design a “figure of merit” function
or merit function that measures the agreement between the data and the model with a particular
choice of parameters. The merit function is conventionally arranged so that small values represent
close agreement. The parameters of the model are then adjusted to achieve a minimum in the
merit function yielding the “best-fit” parameters. The adjustment process is thus a problem in
minimization in many dimensions.
Modeling of data, however, goes beyond just finding the “best-fit” parameters. As we have
discussed earlier our data are subject to measurement error. Our data will never exactly fit the
model even if the model is correct. Therefore, to decide if a particular model is appropriate for
your data you need to quantify the “goodness of fit”. Even if the model is correct, we need to know
the errors associated (i.e., confidence limits) with the “best-fit” parameters.
In summary, a useful fitting program should provide
1. Parameters.

2. Error estimates for the parameters.

3. A statistical measure of the goodness of fit.


If (3) suggests that the model is wrong, then obviously items (1) and (2) are probably worthless.
No self-respecting scientist does steps (1) without steps (2) and (3).

2.1 Maximum Likelihood


How do we decide on the merit function? We use the approach of Maximum likelihood. Suppose we
are fitting N data points (xi , yi ) where i = 1, . . . , N , to a model that has M adjustable parameters
aj , where j = 1, . . . M . The model predicts a functional relationship between the independent and
dependent variables, that is
y(x) = y(x, a1 , a2 , . . . , aM ).
What do we want to minimize to get the “best-fit” values for aj ? Actually, the starting point to
answer this question is to first ask the question: “What particular set of parameters (i.e., aj ) will

P. J. Grandinetti, January 20, 2000


2.1. MAXIMUM LIKELIHOOD 55

maximize the probability that the data set could have occurred?” That is, if the model is correct,
then the probability that any one particular data point is measured in an interval ∆y is

Pi (a1 , a2 , . . . , aM ) = ρ[yi − y(xi , a1 , a2 , . . . , aM )]∆y. (2.1)

For N data points the probability is

Y
N Y
N
Ptotal (a1 , a2 , . . . , aM ) = Pi (a1 , a2 , . . . , aM ) = ρ[yi − y(xi , a1 , a2 , . . . , aM )]∆y. (2.2)
i i

There’s a probability associated with the deviation from the model for every point in the data set.
We want to find the model parameters that maximize that probability. With the assumption of a
Gaussian error distribution we get
( " µ ¶2 # )
Y
N
1 1 yi − y(xi , a1 , a2 , . . . , aM )
P (a1 , a2 , . . . , aM ) = √ exp − ∆y . (2.3)
i σi 2π 2 σi
or
(N )( " µ ¶2 #)
Y ∆y 1XN
yi − y(xi , a1 , a2 , . . . , aM )
P (a1 , a2 , . . . , aM ) = √ exp − . (2.4)
i σi 2π 2 i σi

x
Figure 2.2: There’s a probability associated with the deviation from the model for every point in the
data set. We want to find the model parameters that maximize that probability.

P. J. Grandinetti, January 20, 2000


56 CHAPTER 2. MODELING OF DATA

Thus maximizing the probability P (a1 , a2 , . . . , aM ) is equivalent to minimizing the sum in the
exponential. We define this exponential as our merit function
N µ
X ¶
yi − y(xi , a1 , a2 , . . . , aM ) 2
χ2 = (2.5)
i
σi

This is the familiar least-squares fit. Remember the Central Limit Theorem is the main justification
for a Gaussian distribution and thus the least-squares fit. You should always check to make sure
that your noise has a Gaussian distribution before using least-squares as your merit function. From
a practical point of view, however, even in cases with non-Gaussian noise the χ2 merit function can
serve as a useful merit function.
We will first look at situations where we can analytically find the minimum in χ2 , and then
consider cases where the minimum in χ2 can only be found numerically. A starting point for
analytically finding the minimum of χ2 as a function of the parameters a1 , a2 , . . . , aM is to set its
derivative with respect to the parameters aj to zero. That is,
N µ ¶ Ã !
X yi − y(xi , a1 , a2 , . . . , aM ) dy(xi , a1 , a2 , . . . , aM )
= 0, (2.6)
i
σi daj

where j = 1, 2, . . . , M . This gives a set of M equations with the M unknowns aj .


Let’s start with one of the simplest χ2 minimization problems, that is, fitting data to a straight
line.

2.2 A Simple Example of Linear Least-Squares


Let’s say we obtain a set of N measurements yi , where i = 1, . . . , N , as another parameter xi is
varied. We also know that there should be a linear relationship between these two parameters, i.e.,

y(x, m, b) = mx + b.

Our goal is to determine the model parameters m and b that best describe the experimental data,
the uncertainties in m and b, and some measure of the goodness of fit or appropriateness of our
model function.

2.2.1 Finding Best-Fit Model Parameters


To extract the parameters m and b we construct our merit function
N µ
X ¶
yi − mxi − b 2
χ2 (m, b) =
i=1
σi

P. J. Grandinetti, January 20, 2000


2.2. A SIMPLE EXAMPLE OF LINEAR LEAST-SQUARES 57

Taking the derivative of χ2 with respect to m and b we get two equations


à !
∂χ2 (m, b) XN
yi − mxi − b
= −2 2 xi = 0
∂m i=1
σ i

à !
∂χ2 (m, b) XN
yi − mxi − b
= −2 = 0.
∂b i=1
σi2

Expanding the summations we obtain

∂χ2 (m, b) X N
xi yi XN
x2i XN
xi
= 2 − m 2 − b = 0.
∂m i=1
σi σ
i=1 i
σ2
i=1 i

∂χ2 (m, b) X N
yi XN
xi XN
1
= 2 − m 2 − b = 0,
∂b σ
i=1 i
σ
i=1 i
σ2
i=1 i

If we make the following definitions

X
N
1 X
N
xi X
N
x2i X
N
yi X
N
xi yi
S= , Sx = , Sxx = , Sy = , Sxy = .
i=1
σi2 i=1
σi2 i=1
σi2 i=1
σi2 i=1
σi2

then we get a pair of simultaneous equations1 .


)
mSxx + bSx = Sxy two equations,
mSx + bS = Sy two unknowns.

n linear equations with n unknowns can be solved using Cramer’s Rule. The solutions are
¯ ¯ ¯ ¯ ¯ ¯
1 ¯¯ S Sy ¯
¯ 1
¯ S
¯ y Sx ¯
¯
¯ S
¯ Sx ¯
¯
m= ¯ ¯; b = ¯ ¯; ∆ = ¯ ¯.
∆ ¯ Sx Sxy ¯ ∆ ¯ Sxy Sxx ¯ ¯ Sx Sxx ¯

Thus we get

SSxy − Sx Sy Sxx Sy − Sx Sxy


m= ; b= ; ∆ = S · Sxx − (Sx )2 . (2.7)
∆ ∆

1
Think of a pair of simultaneous equations as representing two straight lines. The solution is then the point of
intersection of the lines. This geometric picture helps to understand why sometimes there is no solution (i.e. parallel
lines) or sometimes there is an infinite set of solutions (i.e., both equations represent the same line).

P. J. Grandinetti, January 20, 2000


58 CHAPTER 2. MODELING OF DATA

2.2.2 Finding Uncertainty in Model Parameters


We still need error estimates on m and b and a measure of the goodness of fit. To get error estimates
for m and b we go back to the error propagation equation
µ ¶2 µ ¶2 µ ¶µ ¶
df df df df
σf2 (u, v, . . .) = σu2 + σv2 + · · · + 2σuv
2
+ ···. (2.8)
du dv du dv

Errors in m and b are produced by errors in yi , which we assume are taken all from the same parent
distribution with variance σy2i . We can also assume that σy2i ,yj = 0. Thus for σm
2 we can write

X
N µ ¶2
2 ∂m
σm = σy2i .
i
∂yi

∂m
To complete this derivation, we need to calculate . From Eq. (2.7) we can calculate
∂yi
µ ¶ µ ¶
∂m S ∂Sxy Sx ∂Sy
= − ,
∂yi ∆ ∂yi ∆ ∂yi

and since
∂Sxy xi ∂Sy 1
= 2, and = 2,
∂yi σi ∂yi σi

we obtain à ! à !
∂m S xi Sx 1
= − .
∂yi ∆ σi2 ∆ σi2
Now we can calculate à !2
X
N
S · xi − Sx
2
σm = σi2 .
i
σi2 ∆
With some additional algebra this expression can be simplified to

2 = S/∆.
σm (2.9)

We take the same approach for σb2 , starting with

X
N µ ¶2
∂b
σb2 = σy2i .
i
∂yi

We first evaluate µ ¶ µ ¶
∂b Sxx ∂Sy Sx ∂Sxy
= − ,
∂yi ∆ ∂yi ∆ ∂yi

P. J. Grandinetti, January 20, 2000


2.2. A SIMPLE EXAMPLE OF LINEAR LEAST-SQUARES 59

and since
∂Sy 1 ∂Sxy xi
= 2, and = 2,
∂yi σi ∂yi σi

we obtain
∂b Sxx − Sx · xi
= .
∂yi σi2 ∆
Now we can calculate à !2
X
N
Sxx − Sx · xi
σb2 = σi2 .
i
σi2 ∆
With some additional algebra this expression can be simplified to

σb2 = Sxx /∆. (2.10)

2
Next, we calculate the covariance σmb

X
N µ ¶µ ¶
2 ∂m ∂b
σmb = σy2i .
i
∂yi ∂yi

Details of the derivation are similar to before, and we obtain

2 = −S /∆.
σmb (2.11)
x

The covariance can also be rewritten in terms of the coefficient of the correlation between the
uncertainty in m and the uncertainty in b as
2
σmb p
rmb = = −Sx / Sxx · S. (2.12)
σm σb
2 values very close to one (> 0.99) indicate an exceptionally difficult χ2 minimiza-
In general, rmb
tion problem, and one which has been badly parameterized so that individual errors are not very
meaningful because they are so highly correlated.

2.2.3 Finding Goodness of Fit


Finally, we need to calculate some measure of the goodness of fit. One approach is to first determine
the number of degrees of freedom, i.e., ν = N − 2, where 2 takes into account the two parameters
m and b, and then calculate chi-squared reduced
µ ¶2
1 X N
yi − mxi − b
χ2ν (m, b) = .
N − 2 i=1 σi

P. J. Grandinetti, January 20, 2000


60 CHAPTER 2. MODELING OF DATA

Recall that for a good fit the expectation value of χ2ν is one. Of course there will be an expected
distribution in χ2 values with a width that depends on the number of degrees of freedom. So we
use our earlier expression for the χ2 parent distribution to calculate
Z ∞
1
(χ2 )(ν−2)/2 e−χ
2 /2
Q(χ20 |ν) = ν/2
d(χ2 ), (2.13)
2 Γ(ν/2) χ20

that is, the probability that a worse χ2 could be obtained.


A more common approach in the case of the linear least squares fit is to report the linear
correlation coefficient rxy as a measure of the goodness of fit. This parameter rxy is not to be
confused with rmb which is the linear correlation coefficient between m and b.
Using the expression
2
σxy
2
rxy = , (2.14)
σx σy
with our earlier definitions
1 X 1 X
σx2 = lim (xi − µx )2 , and σy2 = lim (yi − µy )2 ,
N →∞ N N →∞ N
i i

1 X
2
σxy = lim (xi − µx )(yi − µy ),
N →∞ N
i

we would expect to obtain a value of rxy2 close to one for a good fit, and a r 2 close to zero for a
xy
poor fit. Because of random fluctuations you will not, in general, get rxy = 0, even if there is no
correlation. As expected rxy is related to χ2 . The relationship is

X
N
χ2 = (1 − rxy
2
) (yi − y)2 . (2.15)
i

2.2.4 What to do if you don’t know σyi ?


How can we use our merit function
N µ
X ¶
2 yi − mxi − b 2
χ (m, b) =
i=1
σi

if we don’t know the value of σyi (i.e., σi )? This is a problem. The only solution is to assume that
the σyi values for each (xi , yi ) pair are identical and have the value of σy . Of course, we still don’t
know the value of σy , but now we can rewrite the expression above as

X
N
χ2 (m, b) · σy2 = (yi − mxi − b)2 .
i=1

P. J. Grandinetti, January 20, 2000


2.3. GENERAL LINEAR LEAST SQUARES 61

Instead of minimizing χ2 (m, b) we minimize {χ2 (m, b)·σy }. The only problem is that since we don’t
know the value for the average variance we can’t use {χ2 (m, b) · σy }min as an indicator of whether
our data is behaving properly or if a linear function is the appropriate function to describe your
data. In other words, we lose our measure of the goodness of fit. At this point you have to make a
leap of faith and assume that the data is behaving properly and that the linear function is correct.
© ª
Then we proceed as before taking the derivative of χ2 (m, b) · σy to get identical expressions for
m and b except we use the definitions

X
N X
N X
N X
N
S = N, Sx = xi , Sxx = x2i , Sy = yi , Sxy = xi yi ,
i=1 i=1 i=1 i=1

instead. To get the variance in m and b we need to know the variance in our measurements. The
best we can do in this case is assume that χ2min (m, b) = ν = N − 2 and estimate the variance from
our best fit as
1 X N
σy2 = (yi − mxi − b)2 .
N − 2 i=1
Using this estimate for the variance in y we can then calculate the variance and covariance in m
and b as
2 = σ 2 · (N/∆),
σm σb2 = σy2 · (Sxx /∆), and 2 = −S /∆.
σmb (2.16)
y x

2.3 General Linear Least Squares


We can generalize our approach of fitting a set of data points (xi , yi ) to a straight line to that of
fitting a set of data points to any function that is a linear combination of any M specified functions
of x. For example,
y(x) = a1 + a2 x + a3 x2 + · · · + aM xM −1 . (2.17)

This function is not linear in x but is linear in the parameters a1 , . . . , aM . In general, a function of
the form
X
M
y(x) = ak Xk (x), (2.18)
k=1

where X1 (x), . . . , XM (x) are arbitrary fixed functions of x, called the basis functions. Again, Xk (x)
can be a wildly non-linear function of x, and as long as the Xk (x) do not involve the parameters
ak , then the general linear least-squares approach can be used as long as the function is linear in
all the ak . Our general merit function is
" P #2
X
N
yi − M
2 k ak Xk (xi )
χ = . (2.19)
i=1
σi

P. J. Grandinetti, January 20, 2000


62 CHAPTER 2. MODELING OF DATA

Taking the derivative of χ2 with respect to each of the ak parameters we get


 " #2  " #
PM P
∂χ2 ∂ XN
yi − k ak Xk (xi )  = −2
X
N
yi − M
k ak Xk (xi )
= Xl (xi ) = 0
∂al ∂al i=1 σi i=1
σi2

We eliminate the −2, and expand this into two terms


(P )
X
N
yi Xl (xi ) X
N M
ak Xk (xi )Xl (xi )
− k
=0
i
σi2 i
σi2

which become
(N )
X
N
yi Xl (xi ) X
M X Xk (xi )Xl (xi )
= ak . (2.20)
i
σi2 k i
σi2

This is a set of M coupled linear equations, each indexed by the variable l = 1, . . . , M . For example,
for M = 2 we have
(N ) (N )
X
N
yi X1 (xi ) X X1 (xi )X1 (xi ) X X2 (xi )X1 (xi )
= a1 + a2 ,
i
σi2 i
σi2 i
σi2
(N ) (N )
X
N
yi X2 (xi ) X X1 (xi )X2 (xi ) X X2 (xi )X2 (xi )
= a1 + a2 .
i
σi2 i
σi2 i
σi2
This can be summarized in matrix form for the arbitrary M as
   
X
N
yi X1 (xi ) X
N
X1 (xi )X1 (xi ) X
N
XM (xi )X1 (xi )


 
 
···  


 i σi2   i σi2 i  σi2 a1
 ..   .. .   
 = ..
. ..  ·  .. , (2.21)
 .   .   . 
 N   N 
 X y X (x )   X X (x )X (x ) X XM (xi )XM (xi ) 
N
aM
 i M i   1 i M i
··· 
i
σi2 i
σi2 i
2 σi

or more compactly written as


~ = α · ~a.
β (2.22)
~ and α, what is ~a? The answer is simply given
From Eq. 2.22 we see our problem as simply given β
as
~a = α−1 · β.
~ (2.23)
Thus our job becomes, Given α what is α−1 ?
α is called the curvature matrix because of its relationship to the curvature of the χ2 function
in parameter space. That is, if we take the second derivative of χ2 we get

∂ 2 χ2 X
N
Xl (xi )Xk (xi )
=2 2 = 2α (2.24)
∂al ∂ak l
σ i

P. J. Grandinetti, January 20, 2000


2.3. GENERAL LINEAR LEAST SQUARES 63

From a computational point of view it helpful to note that the matrix α can be constructed
using the expression
α = AT · A, (2.25)

where A is a N × M matrix that is given by


 
X1 (x1 ) X2 (x1 ) XM (x1 )
 σ1 σ1 ··· σ1 
 
 
 
 X1 (x2 ) X2 (x2 ) XM (x2 ) 
 ··· 
 σ2 σ2 σ2 
 
A= . (2.26)
 
 .. .. .. .. 
 . . . . 
 
 
 
 
X1 (xN ) X2 (xN ) XM (xN )
σN σN ··· σN
~ can also be constructed using A according to
β

~ = AT · ~b,
β (2.27)

where  
y1 /σ1
 
 y1 /σ1 
~b = 
 ..

. (2.28)
 . 
 
yN /σN
With our definition of A we can rewrite our simultaneous linear equations as

(AT · A) · ~a = AT · ~b. (2.29)

Back to our original problem, once we know α−1 we can calculate ~a in a simple matrix multi-
plication. Now to get the uncertainty in our best-fit parameters we start with our expression for a
given best-fit parameter
X
M
−1
aj = αj,k βk . (2.30)
k=1
Using the equation above with the error propagation equation, which in this case is written

X
N µ ¶2
2 ∂aj
σ (aj ) = σy2i , (2.31)
i=1
∂yi

remembering that the only experimental uncertainties are in our measured y values, and we have
assumed that our x are known with complete certainty. Thus, we have to evaluate the partial

P. J. Grandinetti, January 20, 2000


64 CHAPTER 2. MODELING OF DATA

derivative in this expression. Since α−1 is independent of our measured y values, we only need to
~ with respect to y, that is,
take the derivative of β
µ ¶ X
M µ ¶
∂aj −1 ∂βk
= αj,k .
∂yi k=1
∂yi

~k we obtain
Recalling our expression for β
µ ¶
∂βk Xk (xi )
= .
∂yi σi2

Putting this all together we get


" #
X
M X
M X
N
Xk (xi )Xl (xi )
2 −1 −1
σ (aj ) = αj,k αl,j .
k=1 l=1 i=1
σi2

We recognize the term in brackets above as simply αk,l and can rewrite this as

X
M X
M
−1 −1
σ 2 (aj ) = αj,k αl,j αk,l .
k=1 l=1

This sum can be simplified to


−1
σ 2 (aj ) = αj,j . (2.32)

This is a beautifully simple result! The diagonal elements of the α−1 matrix are the variances of
−1
the fitted parameters. It also shouldn’t surprise you to learn that the off-diagonal elements αj,k are
the covariances between the parameters aj and ak . That is, one can similarly prove that

X
N µ ¶µ ¶
∂aj ∂ak
σ 2 (aj , ak ) = σi2
i=1
∂yi ∂yi

evaluates to
−1
σ 2 (aj , ak ) = αj,k . (2.33)

For this reason, the matrix α−1 is called the error matrix or covariance matrix.
As I mentioned earlier solving a set of simultaneous equations is geometrically equivalent to
finding the point where all lines associated with each equation intersect. If two or more lines are
parallel or colinear then the matrix is said to be singular. One of the most robust algorithms for
inverting a matrix like α is called singular value decomposition and is described in detail in the
book Numerical Recipes. In Numerical Recipes you can find canned subroutines for χ2 minimization
using singular value decomposition.

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 65

2.4 General Non-Linear Least Squares


Although our discussion is currently on general linear least squares there are situations where
parameters that appear non-linearly in a model function can be “linearized”. For example, the
function
y(x) = a · e−bx ,

contains a parameter b that is related to y in a non-linear fashion. However, this expression can be
rewritten as
log[y(x)] = log[a] − bx.

In this modified function, log[y] depends linearly on the parameters b and log[a]. Be aware, however,
that your Gaussian error distribution in y may not map through the equations as Gaussian error
distributions in model parameters. Also, it is important to watch out for non-parameters. For
example, the function
y(x) = a · exp(−bx + d),

as written appears to have three parameters a, b, and d. However, with a little rearranging we
obtain
y(x) = aed · exp(−bx).

The term aed amounts to a single pre-exponential parameter. If data was fit to this model function,
2 would be unity. If you linearized this
the correlation coefficient between errors in a and d, i.e., rad
function and tried to solve for a, b, and c your program would run into problems because this
example will have singular equations. A good singular value decomposition routine would alert you
to this problem.
If your model function doesn’t vary linearly with the model parameters, and you can’t linearize
the problem what do you do? Your χ2 merit function can still be used, however, you cannot, in
general, find an analytical solution for the parameters at the χ2 minimum. For example, if we take
the derivative of our χ2 merit function with respect to each model parameter,
à !µ ¶
∂χ2 X N
yi − y(xi , a1 , . . . , aM ) ∂y(xi , a1 , . . . , aM )
= = 0, (2.34)
∂ak i=1
σi2 ∂ak

where k = 1, . . . , M . We have M simultaneous equations, but not linear in parameters. Now what?
Well, the only solution is to roll up your sleeves and hunt for the minimum in χ2 numerically.
The first problem we run into is that there may be more than one minimum. This was never a
problem when the model function depended linearly on the parameters because then χ2 could only
depend on the square of the parameters at most, thus as a function of any one parameter χ2 always

P. J. Grandinetti, January 20, 2000


66 CHAPTER 2. MODELING OF DATA

looked like a parabola. Now, if the model function depends non-linearly on the parameters then χ2
can depend on the parameters in a non-quadratic manner. This means it may be possible for χ2 to
have multiple minima as a function of a parameter. So we have two problems: (1) How do we find
a minimum in χ2 ? and (2) How do we know that the minimum we find is the global minimum? If
we could simply see the χ2 surface then it would be easy, but that is computationally expensive.
The problem is like being in the dark with only an LED altimeter, and then trying to find the
quickest way to the valley floor starting somewhere on the mountainside. Actually, the problem
is a bit easier than that, although we can’t see, we can jump to any point in space and measure
the altitude. There is no sure fire way to know you’ve found the global minimum in χ2 . However,
simply plotting the raw data and the best-fit curve together is one of the best ways of knowing that
the minimum in χ2 found is reasonable. In fact, iteratively plotting the raw data and the model
function with different parameter values is a good way of finding starting parameter values near
the χ2 minimum before optimizing the fit for the absolute χ2 minimum and best-fit parameters.

There are two general categories for minimization of functions: (1) Those that only need to
evaluate the function to minimize the function, and (2) those that need to evaluate the function and
its derivatives to minimize the function. We will look at the application of both these approaches
to minimizing χ2 .

(a) (b)
χ2 (a) = a2 +a +

χ2(a)
χ2(a)
local
mimimum

global
mimimum

a a

Figure 2.3: (a) When the model function depended linearly on the parameters then χ2 could only
depend on the square of the parameters at most, thus as a function of any one parameter χ2 always
looked like a parabola. (b) If the model function depends non-linearly on the parameters then χ2
can depend on the parameters in a non-quadratic manner.

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 67

b χ2(a1,b1)

χ2(a2,b2)

χ2(a3,b3)

Figure 2.4: Three points (a1 , b1 ), (a2 , b2 ), and (a3 , b3 ) define a simplex in two dimensions. In
general, for minimization in M dimensions a simplex is composed of M + 1 points.

2.4.1 Minimizing χ2 without Derivative Information


[Link] Downhill Simplex Method

A simplex is a geometrical figure consisting, in M independent parameter dimensions, of M + 1


points (or vertices) and all their interconnecting line segments, polygonal faces, etc,... In a two
parameter minimization problem a simple is a triangle. In a three parameter problem it is a
(not-necessarily regular) tetrahedron.
How do we use a simplex to find a minimum in χ2 as a function of a and b? We start by
evaluating χ2 at each of the three vertices of the simplex.
χ2(a1,b1)

χ2(a2,b2)

χ2(a3,b3)

Then we project the vertex with the highest χ2 value through the midpoint on the opposite face,
keeping the area (or volume in higher dimensions) constant.
χ2highest

χ2new

If the new vertex is no longer the highest χ2 value then our simplex is now closer to the minimum.
If the new vertex becomes the lowest χ2 value then go back and reproject the old vertex twice as
far in the same direction.

P. J. Grandinetti, January 20, 2000


68 CHAPTER 2. MODELING OF DATA

χ2highest

χ2new

If the new vertex is still the lowest χ2 then redo it again (four times as far), and again (eight times
as far), as long as it keeps getting better. Once it gets worse, then go back to the best (lowest χ2 )
vertex. Then you have your new simplex that is closer to the minimum.
What if after trying first move the new vertex is still the highest χ2 vertex? Then reproject
again towards the midpoint of the opposite face, but only go half the distance toward the face.

χ2highest

χ2new

If this new vertex is now no longer the highest χ2 value you have a new simplex that’s closer to the
minimum. This new simplex will have a smaller area than the original.
If the half projection doesn’t work, then the last possibility is to just contract the two highest
vertices towards the lowest vertex.
χ2high

χ2new
χ2high
χ2new

χ2lowest

This new simplex will have a smaller area than the original.
With these rules you can slither down a χ2 surface to the minimum. Once all the vertex changes
are less than a small finite tolerance value, the search is done.
The Simplex method is an example of a minimization algorithm that doesn’t require derivatives.
It often isn’t the most efficient or quickest algorithm for finding a minimum, but it has the advantage
that it is very simple. If your calculate of χ2 takes little computer time relative to the simplex
algorithm, then the simplex method is not a bad approach to take. If you χ2 calculation takes a
long time compared to the minimization algorithm, however, then you should consider looking for
a more efficient method.

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 69

b s

3 1

Figure 2.5: Start from the highest vertex labeled s the next three new vertices are shown as the
simplex moves towards the minimum.

2.4.2 Minimizing χ2 with Derivative Information

Before we discuss methods that use derivatives, let’s review some basic calculus. Note that at
the maximum or minimum the first derivative is always zero. For minima the second derivative is
positive and for maxima the second derivative is negative.
How do we use this derivative information to find minima or maxima?

∂ χ (a0)
2

∂a

χ2(a)

a
0
a

2
If we were standing at point (χ2 (ao ), a0 ) and could obtain the derivative ∂χ∂a(a0 ) then from the sign
of the derivative we would know which way to go towards the minimum. If the derivative is positive
we go left, if it is negative we go right.
In this particular example we have a positive derivative so we go left and keep measuring the
derivative until it goes negative. Then we know we’ve got a minimum trapped between two “a”
values where the derivative is positive and negative. in the parlance of Numerical Recipes we have
“bracketed” the minimum between A and C.

P. J. Grandinetti, January 20, 2000


70 CHAPTER 2. MODELING OF DATA

χ2(a) χ2(a)

a a

∂ χ (a) ∂ χ (a)
2 2

∂a a ∂a a

∂ χ (a) ∂ χ (a)
2 2 2 2

∂a ∂a
2 a 2 a

Figure 2.6: The zero of the first-derivative can be used to find the extrema in χ2 , and the second-
derivative can be used to determine if the extrema are maxima or minima.

χ2(a) A
B

amin a
We can now evaluate χ2 (a) at a point between A and C, say B, and use these three points and the
assumption that χ2 (a) is parabolic near the minimum to calculate where the minimum should be

1 (B − A)2 [χ2 (B) − χ2 (C)] − (B − C)2 [χ2 (B) − χ2 (A)]


amin = B − · (2.35)
2 (B − A)[χ2 (B) − χ2 (C)] − (B − C)[χ2 (B) − χ2 (A)]

After this calculation we can calculate the derivative of χ2 (a) at amin to see if our assumption that
χ2 (a) is parabolic in this region is true, that is, the derivative should be zero at the minimum. If

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 71

not, we have at least narrowed in the bracket region around the minimum and we can iterate the
whole process until we converge on the minimum.
Another approach is to use the Newton-Raphson method, which assumes that we’re near
the minimum. Recall that at the minimum ∂χ2 (amin )/∂a = 0. So we can do a Taylor expansion
of this derivative near the minimum:
∂χ2 (a) ∂χ2 (a0 ) ∂ 2 χ2 (a0 )
= + (a − a0 ) ·
∂a ∂a ∂a2
If we set this expression to zero then we can find out how far we need to jump to get to the
minimum. That is,
1 ∂χ2 (a0 )
amin = a0 − 2 2 ·
∂ χ (a0 )/∂a2 ∂a
Remember that the sign of ∂χ2 (a0 )/∂a tells us in which direction is the minimum. The second
derivative of χ2 should be a positive quantity near the minimum, and the product of the inverse of
the second derivative with the first derivative of χ2 gives a number in the units of the parameter
a that is the step from a0 to amin . If this approximation (near minimum) is slightly off, then just
iterate again from the new point until the approach converges on the true minimum. Far from the
minimum, this approach can often blow up, and can move you further from the minimum. This is
because near the minimum, we expect the sign of the second derivative to be positive. If we are
far from the minimum, then this second derivative may be negative, and thus we get an increment
in a that takes us in the wrong direction. Therefore, when you’re far from the minimum, you can
still use the first derivative information, but not with the second derivative.
∂χ2 (a0 )
anew = a0 − positive constant × . (2.36)
∂a
This works well, you just need a good guess for the constant.
For one-dimensional minimization you are often better off using the “bracketed minimum”
described by Eq. 2.35 above. In a multidimensional minimization problem, however, this approach
is not optimal. In one-dimension all of this is pretty straightforward, however, our main concern is
minimization in M dimensions.
b ?
χ2(a,b)

P. J. Grandinetti, January 20, 2000


72 CHAPTER 2. MODELING OF DATA

If we are standing on this χ2 contour above, how do we use derivative information to tell us which
way to go. That it, it’s no longer a go left or go right problem anymore. We need to calculate the
derivative in all possible directions.
At this point we need to review the mathematics of directional derivatives and gradients.

[Link] Directional Derivatives and Gradients

Let’s consider the general problem: You’re standing at some point on the side of a hill (not the
top), and ask the question, “In which direction does the hill slope downward most steeply from
this point?” That is, in which direction would you slide if the surface of the hill was smooth ice?
Before we can answer this question, let’s look at how you would calculate the derivative along any
particular direction.

b χ2(a,b) (a1,b1)

(a0,b0)

If you’re standing at a particular point (a0 , b0 ), then the vector ~s associated with a particular
direction can be calculated using a second point (a1 , b1 ),

~s = (a1 , b1 ) − (a0 , b0 ).

The vector ~s can be written as the product of a unit vector ŵ times the scalar length s of the vector
~s,
~s = s × ŵ,
where q
s= (a1 − a0 )2 + (b1 − b0 )2 ,
and ŵ can be written in terms of the unit vectors in the direction of a and b coordinate axes

ŵ = u â + v b̂,

where u and v are scalar quantities normalized so that u2 + v 2 = 1. With these definitions we
can write
a1 = a0 + us,
b1 = b0 + vs.

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 73

So the derivative along the direction of the vector ~s is

∂χ2 (a, b) ∂χ2 ∂a ∂χ2 ∂b ∂χ2 ∂χ2


= · + · = ·u + · v.
∂s ∂a ∂s ∂b ∂s ∂a ∂b

~ 2 , defined
This result is simply the vector dot product between the vector ŵ and the gradient ∆χ
as
2 2
~ 2 = ∂χ (a, b) â + ∂χ (a, b) b̂.
∇χ
∂a ∂b
Thus " #
∂χ2 (a, b) ∂χ2 (a, b) ∂χ2 (a, b) h i
= â + ~ 2 · ŵ.
b̂ · uâ + v b̂ = ∆χ
∂s ∂a ∂b

~ 2 (a1 , . . . , aM ) is written
In M dimensions ∇χ

X
M
∂χ2 (a1 , . . . , aM )
~ 2 (a1 , . . . , aM ) =
∇χ âj ,
j=1
∂aj

where âj is a unit vector in direction of aj coordinate axis. So the directional derivative tells us
the derivative in any direction from point (a0 , b0 )

∂χ2 (a, b) ~ 2 · ŵ.


= ∇χ
∂s

Now recall that our original question was which direction does χ2 slope downwards most steeply?
To answer this we recall that the dot product between two vectors can be written

∂χ2 (a, b) ~ 2 ||ŵ| cos θ,


~ 2 · ŵ = |∆χ
= ∇χ
∂s

~ 2 . Since |ŵ| = 1 then


where θ is the angle between ŵ and ∆χ

∂χ2 (a, b) ~ 2 | cos θ.


= |∇χ
∂s

Now if we want the direction with the steepest (largest derivative) in χ2 , we look for the direction,
2
or angle θ that gives the maximum possible negative value for ∂χ ∂s . Clearly this is simply where
θ = 180◦ , and we get the result that just knowing the gradient of χ2 alone, i.e., ∇χ
~ 2 , gives us the
direction of steepest descent.
Another important property of the gradient is that it is always perpendicular to the surface or
lines of constant χ2 .

P. J. Grandinetti, January 20, 2000


74 CHAPTER 2. MODELING OF DATA

b χ2(a,b)

[Link] Steepest Descent Method

Given some initial parameters (a0 , b0 ) and the gradient of χ2 at that point, we can do just what
we did in the one-dimensional minimization problem, i.e., find points along the gradient direction
that bracket the minimum, then do a parabolic interpolation to the minimum along that direction.

~ 2 (~a)
~anext = ~a0 − positive constant × ∇χ

b χ2(a,b)
(a0,b0)

This is the ideal case of how we use the gradient information to increase the efficiency of our search
for the minimum. In practice, initial guesses are not so well placed.

b χ2(a,b)

(a0,b0)

In this example, the starting point and gradient doesn’t take us to the minimum directly. Here we
end up with a series of minimizations that end at the global minimum.

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 75

b χ2(a,b)

(a0,b0)

a
Of course, χ2 can also contain twisty valleys. The start point above requires a few iterations before
it gets into the region of the global minimum.
Steepest Descent works well far from the minimum, but as we get near the minimum, ∇χ2 gets
near zero. We then start taking smaller and smaller steps, and then this method becomes less
efficient.

However, remember, that if we know we’re near the minimum, and we can approximate χ2 as
parabolic, then we can use the Newton-Raphson method in M -dimensions to jump straight to the
minimum.

[Link] Newton Raphson Method

First recall a Taylor expansion in many dimensions. In two dimensions, expanded about a point
(a◦ , b◦ ), we have
∂χ2 (a◦ , b◦ ) ∂χ2 (a◦ , b◦ )
χ2 (a, b) = χ2 (a◦ , b◦ ) + (a − a◦ ) + (b − b◦ )
" ∂a ∂b #
1 ∂ 2 χ2 (a◦ , b◦ ) ∂ 2 χ2 (a◦ , b◦ ) 2 2 ◦ ◦
◦ 2 ◦ ◦ ◦ 2 ∂ χ (a , b )
+ (a − a ) + 2(a − a )(b − b ) + (b − b ) + ···
(2.37)
2 ∂a2 ∂a∂b ∂b2

In M dimensions for χ2 (a1 , . . . , aM ), expanded about (a◦1 , . . . , a◦M ), we have


X ∂χ2 (a◦1 , . . . , a◦M )
χ2 (a1 , . . . , aM ) = χ2 (a◦1 , . . . , a◦M ) + (ak − a◦k )
k
∂ak
1 XX ∂ 2 χ2 (a◦1 , . . . , a◦M )
+ (ak − a◦k )(al − a◦l ) + ··· (2.38)
2 k l ∂ak ∂al

∂χ2 (a1 , . . . , aM )
We can write the expression for ∂aj as

∂χ2 (a1 , . . . , aM ) ∂χ2 (a◦1 , . . . , a◦M ) X ∂ 2 χ2 (a◦1 , . . . , a◦M )


= + (ak − a◦k ) . (2.39)
∂aj ∂aj k
∂a k ∂a j

P. J. Grandinetti, January 20, 2000


76 CHAPTER 2. MODELING OF DATA

Recall that the second derivative of χ2 is our old friend the curvature matrix [α]. In matrix form
this expression is written:
∇χ2 (a) = ∇χ2 (a◦ ) + [α] · (a − a◦ ) (2.40)

So if our current position is a◦ then we can find out how far we need to jump to get to the minimum
amin by setting the gradient to zero

∇χ2 (a) = ∇χ2 (a◦ ) + [α] · (amin − a◦ ) = 0. (2.41)

Solving for amin we get


amin = a◦ − [α]−1 · ∇χ2 (a◦ ) (2.42)

[Link] Marquardt Method

So we have two approaches, one for when we’re far away from the minimum, and one for when
we’re close to the minimum.
(1) Far from minimum - use Steepest Descent

(2) Close to minimum - use Newton Raphson

Noting the mathematical similarities between these two approaches


(1) anew = a◦ − positive constant · ∇χ2 (a◦ ),

(2) amin = a◦ − [α]−1 · ∇χ2 (a◦ ),

Marquardt suggested a way to integrate these two approaches into one approach.
Remember, in the one-dimensional case the Newton-Raphson method fails if we are far from
the minimum because the second derivative of χ2 could be negative, and then the method could
actually move away from the minimum. That is why the steepest descent approach uses a positive
constant instead of the second derivative. Once we’re near the minimum and we’re sure that the
second derivative is positive then we can use the second-derivative instead of the positive constant.
In the multi-dimensional case, we can’t use the matrix α if we’re far from the minimum. What
Marquardt realized is that we can still use the α matrix when we are far from the minimum as long
as we make the off-diagonal elements negligible compared to the diagonal elements. Recall that

1 ∂ 2 χ2
αk,l =
2 ∂ak ∂al

P. J. Grandinetti, January 20, 2000


2.4. GENERAL NON-LINEAR LEAST SQUARES 77

and since
N ·
X ¸
yi − y(xi , ~a) 2
χ2 =
i
σi
we can calculate
" #
∂ 2 χ2 X
N
1 ∂y(xi , ~a) ∂y(xi , ~a) ∂ 2 y(xi , ~a)
=2 · − [y i − y(xi , ~
a )]
∂ak ∂al i
σi2 ∂ak ∂al ∂ak ∂al

The last term in this expression is generally ignored because the factor [yi − y(xi , ~a)] will vary
randomly and tend to cancel out in the sum over i. Therefore we have

X
N · ¸
∂ 2 χ2 1 ∂y(xi , ~a) ∂y(xi , ~a)
=2 ·
∂ak ∂al i
σi2 ∂ak ∂al

Thus the diagonal elements of α are


" #2
X
N
1 ∂y(xi , ~a)
αj,j = .
i
σi2 ∂aj

Clearly the diagonal elements of [α] must always be positive since every term is squared. In contrast,
the off-diagonal elements of α are

X
N · ¸
1 ∂y(xi , ~a) ∂y(xi , ~a)
αj,k = · .
i
σi2 ∂ak ∂al

This term can be either positive or negative. So it is the off-diagonal terms in the [α] matrix that
cause the Newton-Raphson method to fail when far from the minimum. Marquardt’s idea was to
redefine the [α] matrix as
0 = α (1 + λ)
αj,j j,j
0 =α
(2.43)
αj,k j,k

This way when we’re far from the minimum we set λ to some large number, so when we invert α0
to get [α0 ]−1 we can use it to calculate the next set of a closer to the minimum.

anext = a◦ − [α0 ]−1 · ∇χ2 (a◦ )

Once we’re near the minimum we set λ nearer to zero and calculate

amin = a◦ − [α0 ]−1 · ∇χ2 (a◦ )

The Marquardt recipe is

P. J. Grandinetti, January 20, 2000


78 CHAPTER 2. MODELING OF DATA

(1) Compute χ2 (a◦ )

(2) Pick a modest value for λ, say λ = 0.001

(3) Solve for [α0 ]−1 , calculate anext , and χ2 (anext ).

(4) If χ2 (anext ) ≥ χ2 (a◦ ) then increase λ by a factor of 10 and go back to step (3).

(5) If χ2 (anext ) < χ2 (a◦ ) then decrease λ by a factor of 10, set a◦ = anext and go back to step
(3).

Once the acceptable minimum has been found, set λ = 0 and compute [α0 ]−1 at the minimum
to get the variance and covariance matrix.

2.5 Confidence Intervals

Once we have the variances of our M model parameters taken from the diagonal elements of the
covariance matrix it is a simple matter to calculate the standard deviations of each parameter and
report them as the 68.3% confidence limits. Or using the z-values in table 1.1 we can report the
confidence limits at any desired confidence level. The assumption implicit in all this is that there
is a linear relationship between the measured parameter (i.e. y) and each of the model parameters
(i.e. aj ). Even if y is a non-linear function of the model parameters this approach can still be valid
as long as the distribution in each of the model parameters is over a range where y is approximately
linearly related to each of the parameters.
In general, however, it may be necessary to construct the distribution for each of the model
parameters to confirm that each is indeed Gaussian. How is that done? Remember that maximum
likelihood was about finding the most probable set of parameters consistent with our model.

Y
N
ρ(a1 , a2 , . . . , aM ) = ρ(yi − y(xi , a1 , a2 , . . . , aM )).
i

This probability density describes the probability distribution for our model parameters. With our
assumption of Gaussian distributed errors in y it becomes
(N ) · ¸
Y 1 1
ρ(a1 , a2 , . . . , aM ) = √ · exp − χ2 (a1 , a2 , . . . , aM ) ,
i σi 2π 2

P. J. Grandinetti, January 20, 2000


2.5. CONFIDENCE INTERVALS 79

normalized so that the volume under the distribution is unity. For example, in a two parameter
problem we have (N )
Y · ¸
1 1 2
ρ(a, b) = √ · exp − χ (a, b) ,
i σi 2π
2
To get the probability distribution for a we only need to integrate over all values of b, that is,
(N ) Z · ¸
Y 1 ∞ 1
ρ(a) = √ · exp − χ2 (a, b) db (2.44)
i σi 2π −∞ 2

Likewise, to get the probability distribution for b we only need to integrate over all values of a,
(N ) Z · ¸
Y 1 ∞ 1
ρ(b) = √ · exp − χ2 (a, b) da (2.45)
i σi 2π −∞ 2

Once you have the individual probability distribution functions for a and b you can obtain the
confidence limits just as we did earlier.

2.5.1 Constant Chi-Squared Boundaries as Confidence Limits


When you’re dealing with more than one model parameter (which is often the case), it is convenient
to use confidence ellipses (or hyper-ellipses). A natural choice for such ellipses are the χ2 contour
boundaries. To determine the relationship between χ2 contour boundaries and percent confidence
it is useful to first consider the case of a single parameter problem, i.e., y = aX(x). In this case
the distribution in a is given by
(N ) · ¸
Y 1 1
p(a) = √ · exp − χ2 (a) . (2.46)
i σi 2π 2

We could also write this as


½ ¾ " µ ¶2 #
1 1 a−a
p(a) = √ · exp − . (2.47)
σa 2π 2 σa

I can make these two expressions more alike if I rewrote the first one as
( · ¸ ) · ¸
1 2 Y
N
1 1
p(a) = exp − χmin (a) · √ · exp − ∆χ2 (a) , (2.48)
2 i σi 2π
2

where ∆χ2 (a) = χ2 (a) − χ2min (a). Written this way I can equate the terms in and out of the
brackets to get
· ¸ YN
1 1 1
√ = exp − χ2min (a) · √ (2.49)
σa 2π 2 i σ i 2π

P. J. Grandinetti, January 20, 2000


80 CHAPTER 2. MODELING OF DATA

and " µ ¶2 # · ¸
1 a−a 1
exp − = exp − ∆χ2 (a) . (2.50)
2 σa 2
Let focus on what this last equality tells us. As expected when a = a then ∆χ2 (a = a) = 0. When
we move plus or minus one standard deviation σa away from a we expect ∆χ2 (a = a ± σa ) = 1.
Or reversing this result, we can say that the values of a away from a that produce a change in
χ2 by one are the 68% confidence limits for a. Likewise, two standard deviations 2σa away gives
∆χ2 (a = a + 2σa ) = 4, or stated in reverse - values of a away from a that produce a change in χ2
by four are the ≈ 95% confidence limits for a.
One needs to be careful when trying to quickly generalize this result. For example, when one
similarly equates the bivariate Gaussian distribution to the maximum likelihood distribution for
the two parameter problem you obtain
  Ã !2 !
 µ ¶2 µ ¶Ã · ¸
1 
a−a b−b a−a b − b  1
exp − + − 2rab = exp − ∆χ2 (a, b) .
 2(1 − rab )
2 σa σb σa σb  2
(2.51)
Although it’s more involved, one can show that values of a and b away from a and b that produce a
change in χ2 by six mark the ≈ 95% confidence limits for a and b. A general summary of how ∆χ2
should change to include a desired probability for a given number of model parameters is given in
table 2.1.

Number of Confidence Level


Parameters 50% 70% 90% 95% 99%
1 0.46 1.07 2.70 3.84 6.63
2 1.39 2.41 4.61 5.99 9.21
3 2.37 3.67 6.25 7.82 11.36
4 3.36 4.88 7.78 9.49 13.28
5 4.35 6.06 9.24 11.07 15.09
6 5.35 7.23 10.65 12.59 16.81
7 6.35 8.38 12.2 14.07 18.49
8 7.34 9.52 13.36 15.51 20.09
9 8.34 10.66 14.68 16.92 21.67
10 9.34 11.78 15.99 18.31 23.21
11 10.34 12.88 17.29 19.68 24.71

Table 2.1: ∆χ2 as a function of confidence Level and number of parameters.

P. J. Grandinetti, January 20, 2000


2.5. CONFIDENCE INTERVALS 81

2.5.2 Getting Parameter Errors from χ2 with Non-Gaussian Errors

There are situations, despite the central limit theorem, where the errors in your data are not
governed by a Gaussian parent distribution. In this case the χ2 function is still considered to be
a reasonable merit function for finding the best-fit model parameters. For determining the errors
in our model parameters, however, none of the previously discussed approaches are going to be
valid. In this situation we can use our χ2 minimization alogorithms and one of the two approaches
described below to determine the error in the model parameters.
If experiments were cheap one could simply measure the entire (xi , yi ) data set again and again
and fit the each new (xi , yi ) data set for the model parameters.

χ2 fit
Experimental Data Set 0 −−−−−−→ fitted parameters ~a(0) (2.52)
χ2 fit
Experimental Data Set 1 −−−−−−→ fitted parameters ~a(1)
χ2 fit
Experimental Data Set 2 −−−−−−→ fitted parameters ~a(2)
..
.
χ2 fit
Experimental Data Set n −−−−−−→ fitted parameters ~a(n)

At the end of this process the n vectors of best fit parameters can be used to construct a multi-
dimensional histogram of occurance versus each of the M model parameters in the vector of best
fit parameters ~a. Just as with the chi-squared probability we can project this multi-dimensional
histogram onto a particular parameter axis to get the probability histogram for that parameter.
If experiments are not cheap (which is usually the case) then one of the two approaches described
below can be used to obtain the model parameter errors.

[Link] Monte Carlo Simulations

One approach for extracting the uncertainty in best fit parameters is to do a Monte Carlo sim-
ulation of synthetic data sets. In this approach you start with the raw data and perform a χ2 fit
using a given model to get a vector of best fit parameters ~a(0) , where the vector has M elements
corresponding to the M different parameters in the model used to describe the data. You then use
these best fit parameters and the model to simulate a synthetic data set. To this data set you then
add noise based on whatever parent distribution that is appropriate to obtain a noisy synthetic
data set. You then fit this synthetic data set to obtain a vector of best fit parameters ~a(1) .

P. J. Grandinetti, January 20, 2000


82 CHAPTER 2. MODELING OF DATA

Experimental χ2 fitted Simulate Synthetic Add noise w/desired Synthetic χ2 Monte Carlo
Data −→ parameters −−−−−−−−−→ Data −−−−−−−−−−−−→ Noisy −→ Parameters
Set fit ~a(0) a(0)
data w/ ~ Set distribution Data Set fit ~a(1)

If you repeat the procedure n times, then you will have n vectors of best fit parameters obtained
from n synthetic noisy data set, all of which were simulated using the vector of best fit parameters
obtained from the original raw data. These n vectors of best fit parameters can be used to construct
a multi-dimensional histogram of occurance versus each of the M model parameters in the vector of
best fit parameters ~a. Just as with the chi-squared probability we can project this multi-dimensional
histogram onto a particular parameter axis to get the probability histogram for that parameter.

[Link] Bootstrap Method

If you don’t know that your errors are distributed according to a Gaussian, Lorentzian, Poisson, . . .,
then the Monte Carlo Simulation might give you an incorrect distribution for model parameters.
The Bootstrap Approach is similar to the Monte Carlo simulation approach but it doesn’t assume
any underlying distribution. The basic idea is that information concerning the parent distribution is
already present in your raw data, and can be used to extract the distribution in model parameters.
The procedure is to create new data sets from the original by randomly drawing N data points
with replacement from the raw data set. Because of replacement you do not get back to the
same data set. Of course, this approach will sometimes (≈ 37%) give you data sets with identical
(duplicate) points, but it doesn’t matter. You simply fit all of these data sets and get a set of model
parameters which are then histogrammed into a multidimensional plot. As before the projection
onto a particular axis gives you the probability distribution for that parameter.

2.6 Problems
1. What three things should a useful fitting procedure provide?

2. Choose and describe one approach to extract confidence limits from a least squares fit that
uses a model function that is non-linear in parameters. What assumptions would you be
making in your chosen approach?

3. What’s the main advantage(s) of a downhill simplex method over the Marquardt method for
minimizing χ2 ? When would you choose the simplex method over the Marquardt method,
and the Marquardt method over the simplex method?

P. J. Grandinetti, January 20, 2000


2.6. PROBLEMS 83

4. Explain how Marquardt combined the steepest descent method with the Newton-Raphson
method for minimizing χ2 .

5. What is the definition of chi-squared reduced and what values of chi-squared reduced are
indicative of a reasonable fit?

6. Below is a simplex is given with three vertices labeled a, b, and c. Make a copy of this figure
and draw the position of the next simplex vertex and label it d. Using this new simplex draw
the next simplex vertex and label it e. Continue drawing the next two simplexes labeling
the new vertices f and g. Stop at vertex g. Do not continue to the minimum!

c
a
1
2
3
4
5
6

P. J. Grandinetti, January 20, 2000


84 CHAPTER 2. MODELING OF DATA

7. In a general linear least squares fit taking the derivative of χ2 (a) with respect to all the
parameters, a, and setting these derivatives equal to zero yields a set of simultaneous linear
equations which can be written in matrix form as:

β = [α] · a,

where
X
N
βj = yi Xj (xi )/σi2 ,
i
a is the vector of best-fit parameters, and

1 ∂χ2 (a) X
N
αj,k = = Xj (xi )Xk (xi )/σi2 .
2 ∂aj ∂ak i

The matrix [α] is called the curvature matrix. Explain, in general, how this set of simultaneous
equations is solved for the parameters, a, and their variances and covariances.

8. You have an data set of photon counts versus time in a fluorescence decay experiment. You
want to fit to this data to a non-linear model and extract a set of parameters. You also
know that the uncertainty (i.e., the noise) in your photon counts comes from a Poisson noise
distribution. Explain how you would extract the confidence limits for the non-linear model
parameters.

9. Draw examples of linear experimental data where the uncertainties in slope and intercept will
have (a) a high correlation coefficient, and (b) a low correlation coefficient.

10. Explain the justification for the χ2 function as the merit function when modeling data?
Explain when it might not be justified as a merit function?

11. When an (x, y) data set is fit to a straight line i.e., y = mx + b, two linear correlation
2
coefficients are usually calculated, i.e., rm,b 2 . Give a general explanation of the linear
and rx,y
2
correlation coefficient, and then explain the difference between rm,b 2 .
and rx,y

12. When performing non-linear least squares analyses why is there a possibility of more than one
χ2 minimum, whereas in linear least squares analyses there can only be one χ2 minimum?

13. Explain how the Steepest Descent and Newton-Raphson Methods work, and the situations
in which each is the optimal approach.

14. What is the parameter λ in the Marquardt approach to χ2 minimization? How would you
vary this parameter when performing a χ2 minimization?

P. J. Grandinetti, January 20, 2000


2.6. PROBLEMS 85

15. A useful fitting program should provide (1) the best fit parameters, (2) the errors in the best
fit parameters, and (3) a measure of the goodness of fit. If you measure an (xi , yi ) data set
but do not know the error in yi (i.e., σyi ) what assumptions could you make so that your
fitting program is still useful? What would be the limitations, if any, of your fitting program
in this circumstance.

16. Below is an output from a Marquardt fitting routine of three Gaussian lineshapes.
Number of Sites: 3
Area1: 0.0222417
Position1: -30003.1
Linewidth1: 2479.92
Area2: 0.022251
Position2: 20000.5
Linewidth2: 2477.39
Area3: 0.0183341
Position3: 30000.5
Linewidth3: 2998.16

COVARIANCE Area1: Position1: Linewidth1: Area2: Position2: Linewidth2: Area3: Position3: Linewidth3:
Area1: 1.27778e-09 -1.33668e-11 -9.49809e-05 0 0 0 0 0 0
Position1: -1.33668e-11 10.5903 2.17454e-06 0 0 0 0 0 0
Linewidth1: -9.49809e-05 2.17454e-06 21.1806 0 0 0 0 0 0
Area2: 0 0 0 1.28097e-09 -1.96165e-07 -9.5499e-05 -1.86807e-11 -2.46691e-06 7.63039e-06
Position2: 0 0 0 -1.96165e-07 10.5911 0.0583074 2.12156e-06 0.228281 -0.818555
Linewidth2: 0 0 0 -9.5499e-05 0.0583074 21.3068 5.73072e-06 0.700233 -2.28917
Area3: 0 0 0 -1.86807e-11 2.12156e-06 5.73072e-06 1.05893e-09 2.38823e-07 -0.000116023
Position3: 0 0 0 -2.46691e-06 0.228281 0.700233 2.38823e-07 18.8708 -0.0946504
Linewidth3: 0 0 0 7.63039e-06 -0.818555 -2.28917 -0.000116023 -0.0946504 38.0012

(a) Report the three lineshape areas with 95% cofidence limits.
(b) Explain why some off-diagonal elements are zero while others are not. Feel free to draw
examples of the lineshapes to support your arguments.
(c) Calculate the linear correlation coefficients between all three best fit lineshape areas.

17. Given an experimental (x, y) data set that is known to follow the model function

y(x) = A · exp(−b · x) (2.53)

draw examples of experimental (x, y) data sets where the uncertainties in A and b would be
expected to have (a) a high linear correlation coefficient and (b) a low linear correlation
coefficient.

18. Describe the approach you would use to extract confidence limits for the parameters obtained
in a least squares fit if you knew the error in your measurements did not arise from a Gaussian
parent distribution.

P. J. Grandinetti, January 20, 2000


86 CHAPTER 2. MODELING OF DATA

19. On the contour surface below are three vertices labeled a, b, and c. Draw the position of the
next simple vertex and label it d. Using this new simple draw the next simplex vertex and
label it e. Continue drawing the next two simplexes labeling the new vertices f and g. Stop
at vertex g. Do not continue to the minimum!

b 70

30 40

5 10
20

30

c
80 a 40
70 50
60

P. J. Grandinetti, January 20, 2000


2.6. PROBLEMS 87

20. On the contour surface below is a single point labeled (a0 , b0 ). From this point draw the path
taken by the steepest descent algorithm to the minimum.

70

30 40

5 10
20

30

80
40
(a0,b0) 70 50
60

P. J. Grandinetti, January 20, 2000


88 CHAPTER 2. MODELING OF DATA

P. J. Grandinetti, January 20, 2000


Appendix A

Computers

Computers have become very sophisticated devices over the last couple decades. You certainly
don’t need to know all the details of how computers work to use them, but you should be familiar
with certain concepts, especially when you’re doing scientific calculations on a computer, or working
with a computer interfaced to scientific equipment in the laboratory.

Simplified Computer Schematic


In Figure A is a block schematic of a basic computer. Central to this computer design is the bus.
The computer bus is analogous to a highway along which all the information being input, processed,
and output travels.

Bus

Read A/D
Central Random Display Printer,
Hard Disk and Digital
Processing Access Only other output
Interface
CD/ROM Keyboard devices
D/A
Unit Memory Memory Converters

Where data are Where data are Permanent storage Least expensive, Interface to User
added, subtracted, stored before and of data. Usually data storage
multiplied, divided, after calculations. contains minimum device. Takes a few
etc... Takes only a few data needed for milliseconds to Experiment
nanoseconds to computer to transfer data to and
transfer data function. from CPU over the
between RAM and Bus.
CPU over the Bus.
Usually lost when
power is turned off.

Figure A.1: Block diagram of a computer.

89
90 APPENDIX A. COMPUTERS

A major breakthrough in computer design is attributed to Von Neumann, who suggested that
the numbers stored in memory be considered as both data and instructions for the CPU. Each
number stored in memory is given an address (i.e., another number that labels the location of a
number in memory). For example,

address contents
0001 0327
0002 0529
0003 1643
0004 0528
0005 0213
.. ..
. .
0105 8721

The contents contains both data and instructions. A CPU usually executes instructions stored
in sequential addresses until it reaches an instruction that makes it jump to a different location
(address) in memory and starts executing the instruction stored there.

How the CPU instructions are coded as numbers in memory depends entirely on what kind of
CPU it is, i.e., Pentium, PowerPC, R8000, etc... How the CPU data are coded as numbers and
characters in memory depends somewhat on the type of CPU, but there are some generally agreed
upon standards. Let’s first look at how binary numbers work, then we’ll look at how real world
data is usually represented in computers.

The Bit: The fundamental unit of memory in a computer is the binary digit or bit. It has two
possible values 0 or 1. A single bit can be used to represent a variable that has only two states,
like TRUE and FALSE, or ON and OFF, or YES and NO, etc...

Integers: To represent integer numbers other than 1 or 0 you need to use more bits. For example,

P. J. Grandinetti, January 20, 2000


91

binary decimal
0 0
1 1
10 2
11 3
100 4
101 5
110 6
111 7
1000 8
1001 9
1010 10
.. ..
. .

This binary representation of positive integer decimals is an agreed upon standard. Obviously, to
represent bigger numbers you will need more bits. To convert a binary number to a decimal number
you associate each bit with a power of 2. For example, each bit in the binary number 101100112 is
associated with a power of 2 as follows:

1 0 1 1 0 0 1 1
7 6 5 4 3 2 1 0
2 2 2 2 2 2 2 2
To convert 101100112 into a decimal number you multiply each bit by its associated power of 2.
That is,

101100112 = 1 × 27 + 0 × 26 + 1 × 25 + 1 × 24 + 0 × 23 + 0 × 22 + 0 × 21 + 1 × 20
= 128 + 0 + 32 + 16 + +0 + 0 + 2 + 1 = 17910

A useful concept is the byte. A byte is a group of 8 bits. With 1 byte you can represent a
positive integer between 0 and 28 − 1 or between 0 and 255.

Signed Integer Numbers The approach shown so far works fine for positive integers. To
negate an integer you take the two’s complement of the integer. Before we learn about the two’s
complement, let’s learn about the one’s complement. To make a one’s complement of a number
you simply change all the ones to zeroes, and all the zeroes to ones. For example,
1’s cmpl.
01011011 −→ 10100100

P. J. Grandinetti, January 20, 2000


92 APPENDIX A. COMPUTERS

To take the two’s complement of a number simply add one to the one’s complement of a number.
That is,

1’s cmpl. +1
01011011 −→ 10100100 −→ 10100101

Now to represent, for example, −9 you start with 9 or 000010012 and takes the two’s complement
to get −9 or 111101112 . In general, when using this approach the left most bit is always 1 for
negative numbers and is always 0 for positive numbers.
As you would expect for a negative number, if you keep adding one to a negative number it
will become positive. For example, -1 is represented in binary as 111111112 . If you add one to this
binary number you get

1 1 1 1 1 1 1 1
+ 1
(1) 0 0 0 0 0 0 0 0

This bit
is lost

Since a Byte only has 8 bits the last bit gets lost, and the number is zero, as expected for −1+1 = 0
This raises an important point. For an 8 bit or 1 byte signed integer the maximum and minimum
integers you can represent are −127 (10000012 ) and +127 (011111112 ) respectively. Therefore, when
you write a program and declare a variable as signed integer that is 1 byte long, you have to be
careful you don’t go outside the limits. Most programming languages won’t tell you if you do.
If you need to work with bigger integers, the solution, of course, is to use more bytes for your
variables.

Alphanumeric Characters: Let’s look at how alpha numberical characters are represented.
The agreed upon system is the ASCII representation. In this system alphanumerical characters are
represented using 1 byte. For example,

P. J. Grandinetti, January 20, 2000


A.1. PROBLEMS 93

character ASCII code


space 3210
A 6510
B 6610
C 6710
D 6810
.. ..
. .
a 9710
b 9810
c 9910
.. ..
. .

Floating point numbers: Floating point number representations vary somewhat from CPU to
CPU. The general scheme goes something like this:

1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1
}
}
sign bit Exponent bits mantissa bits

The critical issue is to make sure your program represents your numbers with more than enough
significant figures. Always check your programming language for the precision associated with all
types of floating point numbers.

Roundoff Errors When dealing with floating point calculations there is always error associated
with roundoff. For example, if a computer only stored real numbers to two significant figures,
then the product of 2.1 × 3.2 would be stored as 6.7 instead of 6.72. These errors accumulate as
the number of mathematical operations increases. One useful hint to avoid significant roundoff
errors is to avoid subtracting numbers that are nearly the same in magnitude. For example,
2.9876 − 2.9871 = 0.0005. The last digit could get rounded off and the answer might be zero in the
computer.

A.1 Problems
1. Express the unsigned binary (i.e. base 2) number 010110112 as a decimal (i.e. base 10)
number.

2. How would most computers represent −110 as 1 Byte binary (i.e. base 2) number?

P. J. Grandinetti, January 20, 2000


94 APPENDIX A. COMPUTERS

3. What is the maximum integer that can be represented with 4 bytes?

4. What is the maximum signed integer that can be represented with 8 bytes? What is the
minimum?

P. J. Grandinetti, January 20, 2000


Appendix B

Programming in C - A quick tutorial

First and foremost I recommend that you have a copy of “The C Programming Language” by
Kernighan and Ritchie”. You may have others, but this book is essential.
C is a very easy language to learn to write. Unfortunately, it can be a very difficult language
to read depending on who’s writing it. Since I strongly advocate writing programs that are easy
to read, and first and foremost I suggest that you use lots of comments!

Comments

Comments in C start with the characters /* and end with the characters */. For example,

/* This line is a comment */

#include

In practically every C program that does input or output you will find the line

#include "stdio.h"

at the beginning the program. In general, an #include "filename" statement means to include
all the information that’s in the file “filename” into your source code at this point. In this example,
“stdio.h” is a file that tells your program about all the possible input/output subroutines that it
can call. Without this line your program would not know how to print, or open and close a file.
Sometimes you will see

#include <stdio.h>

95
96 APPENDIX B. PROGRAMMING IN C - A QUICK TUTORIAL

where the filename is delimited by < and > symbols instead of quotes. This tells the C compiler
to look in the directory of standard C libraries for the file. If you use quotes the C compiler will
first look in the directory where your source code resides before going to the C compiler directory.

main()
Practicallly every C program is called main() and is written

main() {

/* Source code goes between the curly braces */

For example, the classic program Hello, World is written

#include <stdio.h>
main() {

printf("Hello, World\n");

printf is a standard C library output function that prints the characters between the quotes.
The “\n” represents the newline character, and is printed as a newline not as the characters “\n”.
In C each statement is terminated by a semicolon, “;”.

Another Example

Now let’s look at a C program that calculates a χ2 minimization best fit of (xi , yi ) data to a
straight line y = a + bx based on the equations:

Sxx Sy − Sx Sxy SSxy − Sx Sy


a= , b= , ∆ = SSxx −Sx2 , σa2 = Sxx /∆, σb2 = S/∆, σab
2
= −Sx /∆,
∆ ∆
where
X
N
1 X
N
xi X
N
x2i X
N
yi X
N
xi yi
S= 2, Sx = 2, Sxx = 2, Sy = 2, Sxy = .
i=1
σi i=1
σi i=1
σi i=1
σi i=1
σi2

P. J. Grandinetti, January 20, 2000


97

#include <stdio.h>/* You need to include this file to do input/output */


#include <math.h>/* You need to include this file to call math functions, like cos() or sqrt() */

main() {

/* First you need to declare all the variables and their types that you will use in your C program */

/* This line defines an array of type char that contains 32 characters (i.e. 32 Bytes) */
/* In this case the variable filename will hold the name of the file containing our data sets */
char filename[32];

/* This line defines a file variable. */


/* In C you need a file variable for every file you open to identify it */
FILE *fp;

/* This line defines two integer variables npts and i */


int npts, i;

/* This line defines two floating point variable arrays of double precision */
/* In this case the arrays contain 200 elements each */
double x[200], y[200];

/* These lines define other real variables used in the program */

double sigma_y, variance_y;


double S, Sx, Sy, Sxx, Sxy, Delta;
double a, b, sigma_a, sigma_b;
double variance_a, variance_b, variance_ab, chi, r2;
double aa, bb, probability;

/* Variable declaration is Done, now we get into the program */


/*************************************************************/

/* Inform user what this program does. */


printf("Least Squares fit of (x,y) data to a line y(x) = a + b x.\n\n");

/* Get filename containing (x,y) data set from user. */


/* Scan in a string, put first character in filename[0], and continue from there */
printf("What is the name of the file containing the (x,y) data?");
scanf("%s",&filename[0]);

/* Open filename containing (x,y) data set. We use the variable fp to identify our opened file. */
fp = fopen(filename,"r");
if(fp == 0) { /* If fopen returns zero, then file open failed. */
printf("Cannot open file %s\n",filename); /* If unopenable, tell user, then quit */
return(0);
}

P. J. Grandinetti, January 20, 2000


98 APPENDIX B. PROGRAMMING IN C - A QUICK TUTORIAL

/* Get total number of (x,y) points from user. */


printf("How many (x,y) points are contained in the data? ");
scanf("%d",&npts);

/* Get standard deviation in y from user. */


printf("What is the standard deviation in y? ");
scanf("%lf",&sigma_y);

/* Calculate Variance from Standard Deviation */


variance_y = sigma_y*sigma_y;

/* Check if this number is too low or too high. */


if((npts<3)||(npts>200)) {
printf("Number of points must be between 3 and 200\n\n");
return(0);
}

/* Read in the (x,y) data set into your arrays x[i] and y[i]. */
/* Print to verify you are reading data correctly */
printf("\nHere is the (x,y) data set\n");
for(i=0;i<npts;i=i+1) {
fscanf(fp,"\t%lf\t%lf",&x[i],&y[i]);
printf("x = %lf, \t y = %lf \n",x[i],y[i]);
}
/* Be polite - Always close the file after you’re done */
fclose(fp);

/* All the input data is ready, now we start the calculation */


/*************************************************************/

/* Calculate all the defined variables S, Sx, Sy, Sxx, and Sxy */
/* First zero all variables before loop */
S = Sx = Sy = Sxx = Sxy = 0.0;
for(i=0;i<npts;i=i+1) {
S = S + 1/variance_y;
Sx = Sx + x[i]/variance_y;
Sy = Sy + y[i]/variance_y;
Sxx = Sxx + x[i]*x[i]/variance_y;
Sxy = Sxy + x[i]*y[i]/variance_y;
}

printf("\n\nHere are the sums\n");


printf(" S = %lf\n", S);
printf(" Sx = %lf\n", Sx);
printf(" Sy = %lf\n", Sy);
printf(" Sxx = %lf\n",Sxx);
printf(" Sxy = %lf\n",Sxy);

/* Calculate Delta, intercept a, and slope b */


Delta = S * Sxx - Sx * Sx;

P. J. Grandinetti, January 20, 2000


99

a = (Sxx * Sy - Sx * Sxy) / Delta;


b = (S * Sxy - Sx * Sy) / Delta;

/* Print out best fit parameters a and b */


printf("\n\nBest Fit parameters are ...\n\n");
printf("Intercept a = %lf\n",a);
printf("Slope b = %lf\n",b);

/* Calculate the Variance and Standard Deviation in a */


variance_a = Sxx/Delta;
sigma_a = sqrt(variance_a);

/* Calculate the Variance and Standard Deviation in b */


variance_b = S/Delta;
sigma_b = sqrt(variance_b);

/* Calculate the Covariance in a and b */


variance_ab = -Sx/Delta;

/* Print out Variance and Standard Deviation in a and b */


printf("\nVariance in a and b are ...\n\n");
printf("variance in a = %lf\n",variance_a);
printf("variance in b = %lf\n",variance_b);
printf("cvariance between a and b = %lf\n",variance_ab);
printf("\nStandard Deviation in a and b are ...\n\n");
printf("std. dev. in a = %lf\n",sigma_a);
printf("std. dev. in b = %lf\n",sigma_b);

/* Calculate chi squared reduced for this fit. */


chi = 0.0;
for(i=0;i<npts;i=i+1)
chi = chi + (y[i] - a - b* x[i])*(y[i] - a - b* x[i])/variance_y;

/* Still need to divide by the number of degrees of freedom, i.e. npts-2 */


chi = chi/(npts-2);

/* Print out Chi Squared Reduced for this fit. */


printf("\nChi Squared Reduced is %\lf\n", chi);

/* Calculate and Print out the correlation coefficient r2 for this fit. */
r2 = Sx*Sx/(S*Sxx);
printf("\nThe correlation coefficient is %\lf\n", r2);

return(1);
}

P. J. Grandinetti, January 20, 2000


100 APPENDIX B. PROGRAMMING IN C - A QUICK TUTORIAL

To program or not to program

Program when you can’t do it any other way. That is, no commerical program will do the job.

Program when you need speed. That is, when commericial programs exist that will do the job,
but they are not optimized for your specific task.

Program when you want to understand computers better. In addition, fewer things help you
understand theory better than writing a program to simulate a theoretical result.

Don’t Program if a program already exists that does the job → Use it!. Don’t reinvent the
wheel!

P. J. Grandinetti, January 20, 2000


Mathematical Methods

Lorentzian Distribution
Another distribution of scientific importance but not related to the binomial distribution is the
Lorenztian distribution.
1 Γ/2
p Lorentzian (x, µ; Γ) = · . (C.1)
π (x − µ)2 + (Γ/2)2

An important feature of the Lorentzian distribution is that it doesn’t go to zero as fast as the
Gaussian. Later when we discuss maximum likelihood estimators we will see how this behavior can
be useful.
The mean µ of the Lorentzian is already parameterized in the function. The variance, however,
is not parameterized in the function because it is not defined for the Lorentzian. That is,
Z ∞ Z ∞
1 Γ/2dx 1 Γ2 z2
2
σ = (x − µ) · ·
2
= dz (C.2)
0 π (x − µ)2 + (Γ/2)2 π 4 −∞ z2 + 1
x−µ
The last integral, where z = is an unbounded integral. This is why Γ, the full width at half
Γ/2
maximum (F.W.H.M.), is used to characterize the dispersion in the Lorentzian distribution.

Working with Summations


X
5
n = 1 + 2 + 3 + 4 + 5 = 15. (C.3)
n=1

X
5
n2 = 1 + 4 + 9 + 16 + 25 = 55. (C.4)
n=1

X
n
xi = x1 + x2 + x3 + · · · + xn . (C.5)
i=1

101
102 APPENDIX B. PROGRAMMING IN C - A QUICK TUTORIAL

0.30

0.25
probability density

0.20

0.15 Γ
0.10

0.05

0.00
-5Γ -4Γ -3Γ -2Γ -Γ 0 Γ 2Γ 3Γ 4Γ 5Γ
x−µ

Figure C.1: Lorentzian Distribution.

à n ! n 
X X X
n X
n
xi  
yj = xi yj (C.6)
i=1 j=1 i=1 j=1

à n ! n 
X X
xi  yj  = (x1 + x2 + x3 + · · · + xn )(y1 + y2 + y3 + · · · + yn )
i=1 j=1
= x1 y1 + x1 y2 + x1 y3 + · · · + x1 yn
+ x2 y1 + x2 y2 + x2 y3 + · · · + x2 yn
+ x3 y1 + x3 y2 + x3 y3 + · · · + x3 yn
.. ..
. .
+ xn y1 + xn y2 + xn y3 + · · · + xn yn
X
n X
n
= xi yj
i=1 j=1
(C.7)

Taylor Series Expansion


The Taylor series expansion for a function f (x) about a point x is

df (a) (x − a)2 d2 f (a) (x − a)n dn f (a)


f (x) = f (a) + (x − a) + + · · · + + ···. (C.8)
dx 2! dx2 n! dxn

P. J. Grandinetti, January 20, 2000


C.1. PROBLEMS 103

When the expansion is done about the origin, i.e., x = 0 then the expansion is called the Maclaurin
series,
df (0) x2 d2 f (0) xn dn f (0)
f (x) = f (0) + x + + · · · + + ···. (C.9)
dx 2! dx2 n! dxn
Some common examples of Maclaurin series expansions are:

x3 x5 x7
sin x = x − + − + ···, (C.10)
3! 5! 7!
x2 x4 x6
cos x = 1 − + − + ···, (C.11)
2! 4! 6!
x2 x3 x4
ex = 1 + x + + + + ···, (C.12)
2! 3! 4!
p(p − 1) 2 p(p − 1)(p − 2) 3
(1 + x)p = 1 + px + x + x + ···, (C.13)
2! 3!
The beauty of the Taylor series expansions is that when the difference (x − a) is small (more
specifically, |x − a| < 1), then to a good approximation you can neglect many of the higher-order
terms in the expansion to get a good approximation to your function f (x) in the vicinity of a. For
example, in the Maclaurin series expansion for sin x we can approximate sin x ≈ x in the limit
that x is close to zero. Clearly, this approximation is valid since all the higher-order terms contain
powers of x, which will contain insignificant contributions to the series sum in the limit that x is
close to zero.

C.1 Problems
1. For a Lorentzian distribution, why is the full height at half maximum used to characterize
the dispersion in the distribution instead of the variance?

P. J. Grandinetti, January 20, 2000

You might also like