Lecture Notes On Probability, Statistics & Linear Algebra
Lecture Notes On Probability, Statistics & Linear Algebra
C. H. Taubes
Department of Mathematics
Harvard University
Cambridge, MA 02138
Spring, 2010
CONTENTS
1 Data Exploration 2
1.1 Snowfall data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Conditional probability 16
3.1 The definition of conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Decomposing a subset to compute probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 More linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 An iterated form of Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Linear transformations 25
4.1 Protein molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Random variables 31
6.1 The definition of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Probability for a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 A probability function on the possible values of f . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 Mean and standard distribution for a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.5 Random variables as proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.6 A biology example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
i
6.7 Independent random variables and correlation matrices . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.8 Correlations and proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
12 P-values 72
12.1 Point statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.2 P -value and bad choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.3 A binomial example using DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
12.4 An example using the Poisson function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.5 Another Poisson example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
12.6 A silly example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
ii
13.3 The mean and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.4 The Chebychev theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.5 Examples of probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.6 The Central Limit Theorem: Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
13.7 The Central Limit Theorem: Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.8 The three most important things to remember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.9 A digression with some comments on Equation (13.1) . . . . . . . . . . . . . . . . . . . . . . . . . 85
13.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
14 Hypothesis testing 88
14.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
14.2 Testing the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
14.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
14.4 The Chebychev and Central Limit Theorems for random variables . . . . . . . . . . . . . . . . . . . 90
14.5 Testing the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
14.6 Did Gregor Mendel massage his data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
14.7 Boston weather 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
14.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15 Determinants 97
16 Eigenvalues in biology 99
16.1 An example from genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
16.2 Transition/Markov matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
16.3 Another protein folding example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
iii
Preface
This is a very slight revision of the notes used for Math 19b in the Spring 2009 semester. These are written by Cliff
Taubes (who developed the course), but re-formatted and slightly revised for Spring 2010. Any errors you might find
were almost certainly introduced by these revisions and thus are not the fault of the original author.
I would be interested in hearing of any errors you do find, as well as suggestions for improvement of either the text or
the presentation.
Peter M. Garfield
[email protected]
1
CHAPTER
ONE
Data Exploration
The subjects of Statistics and Probability concern the mathematical tools that are designed to deal with uncertainty. To
be more precise, these subjects are used in the following contexts:
What follows are some examples of scientific questions where the preceding issues are central and so statistics and
probability play a starring role.
• An extremely large meteor crashed into the earth at the time of the disappearance of the dinosaurs. The most
popular theory posits that the dinosaurs were killed by the ensuing environmental catastrophe. Does the fossil
record confirm that the disappearance of the dinosaurs was suitably instantaneous?
• We read in the papers that fat in the diet is “bad” for you. Do dietary studies of large populations support this
assertion?
• Do studies of gene frequencies support the assertion that all extent people are 100% African descent?
• The human genome project claims to have determined the DNA sequences along the human chromosomes. How
accurate are the published sequences? How much variation should be expected between any two individuals?
Statistics and probability also play explicit roles in our understanding and modelling of diverse processes in the life
sciences. These are typically processes where the outcome is influenced by many factors, each with small effect, but
with significant total impact. Here are some examples:
Examples from Chemistry: What is thermal equilibrium? Does it mean stasis?
Why are chemical reaction rates influenced by temperature? How do proteins fold correctly? How stable are the folded
configurations?
Examples from medicine: How many cases of flu should the health service expect to see this winter? How to determine
cancer probabilities? Is hormone replacement therapy safe? Are anti-depressants safe?
An example from genomics: How are genes found in long stretches of DNA? How much DNA is dispensable?
An example from developmental biology: How does programmed cell death work; what cells die and what live?
Examples from genetics: What are the fundemental inheritance rules? How can genetics determine ancestral relation-
ships?
2
An examples from ecology: How are species abundance estimates determined from small samples?
To summarize: There are at least two uses for statistics and probability in the life sciences. One is to tease information
from noisy data, and the other is to develop predictive models in situations where chance plays a pivotal role. Note that
these two uses of statistics are not unrelated since a theoretical understanding of the causes for the noise can facilitate
its removal.
The rest of this first chapter focuses on the first of these two uses of statistics.
1890 42.6 1910 40.6 1930 40.8 1950 29.7 1970 57.3 1990 19.1
1891 46.8 1911 31.6 1931 24.2 1951 39.6 1971 47.5 1991 22.0
1892 66.0 1912 19.4 1932 40.6 1952 29.8 1972 10.3 1992 83.9
1893 64.0 1913 39.4 1933 62.7 1953 23.6 1973 36.9 1993 96.3
1894 46.9 1914 22.3 1934 45.4 1954 25.1 1974 27.6 1994 14.9
1895 38.7 1915 79.2 1935 30.0 1955 60.9 1975 46.6 1995 107.6
1896 43.2 1916 54.2 1936 9.0 1956 52.0 1976 58.5 1996 51.9
1897 51.9 1917 45.7 1937 50.6 1957 44.7 1977 85.1 1997 25.6
1898 70.9 1918 21.1 1938 40.3 1958 34.1 1978 27.5 1998 36.4
1899 25.0 1919 73.4 1939 37.7 1959 40.9 1979 12.7 1999 24.9
1900 17.5 1920 34.1 1940 47.8 1960 61.5 1980 22.3 2000 45.9
1901 44.1 1921 37.6 1941 24.0 1961 44.7 1981 61.8 2001 15.1
1902 42.0 1922 68.5 1942 45.7 1962 30.9 1982 32.7
1903 72.9 1923 32.3 1943 27.7 1963 63.0 1983 43.0
1904 44.9 1924 21.4 1944 59.2 1964 50.4 1984 26.6
1905 37.6 1925 38.3 1945 50.8 1965 44.1 1985 18.1
1906 67.9 1926 60.3 1946 19.4 1966 60.1 1986 42.5
1907 26.2 1927 20.8 1947 89.2 1967 44.8 1987 52.6
1908 20.1 1928 45.5 1948 37.1 1968 53.8 1988 15.5
1909 37.0 1929 31.4 1949 32.0 1969 48.8 1989 39.2
in 1992, 1993 and 1995, I ask whether they indicate that winters in the more recent years are snowier than those in
the first half of the record. Thus, I want to compare the snow falls in the years 1890-1945 with those in the years
1946–2001.
•
100 ................................................................................................................................................................................................................................
100 ....................................................................................................................................................................................................................................
•
•
• •
Snowfall (inches)
Snowfall (inches)
80 •
................................................................................................................................................................................................................................
80 ....................................................................................................................................................................................................................................
• • • •
•• •
60 .................................................................................................................................................• ............................•
•
...................................................
60 • • • • •
....................................................................................................................................................................................................................................
• • •
• • • • • • •• • • •
• • • • • • • •
• • • • • • • •
40 •...................•............................•.............•................•....•............•.................................•................•....................•........•.........................•....•.......................... 40 ...........•............•.................................•........................................................•.....................................................•............•.....................................•..............
• • • • •• • • • •
• • • • •
• • • • • • ••
........................•
20 ........................................................................•
• • .........................•
................• ............•...........................................................................
20 ...•..........................................................................................................................................•....................•....................•.....•..........................................
• • • •
• •
0 0
1895 1905 1915 1925 1935 1945 1950 1960 1970 1980 1990 2000
Year Year
Figure 1.1: Snowfall Data for Years 1890–1945 (left) and 1946–2001 (right)
Well, these differ by roughly 5 inches, but is this difference large enough to be significant? How much difference
should I tolerate so as to maintain that the snowfall amounts are “statistically” identical? How much difference in
standard deviations signals a significant difference in yearly snowfall?
I can also “bin” the data. For example, I can count how many years have total snow fall less than 10 inches, then how
many 10–20 inches, how many 20-30 inches, etc. I can do this with the two halves of the data set and then compare
bin heights. Here is the result:
1890–1945: 1 2 11 11 16 5 6 4 0 0 0
(1.5)
1946–2001: 0 8 11 9 11 7 5 0 3 1 1
Having binned the data, I am yet at a loss to decide if the difference in bin heights really signifies a distinct difference
in snow fall between the two halves of the data set.
Thus, the two rank-sums differ by 16. But, I am again faced with the following question: Is this difference significant?
How big must the difference be to conclude that the first half of the 20’th century had, inspite of 1995, more snow on
average, than the second half?
To elaborate now on this last question, consider that there is a hypothesis on the table:
The rank-sums for the two halves of the data set indicate that there is a significant difference between the
snowfall totals from the first half of the data set as compared with those from the second.
To use the numbers in (1.6) to analyze the validity of this hypothesis, I need an alternate hypothesis for comparison.
The comparison hypothesis plays the role here of the control group in an experiment. This “control” is called the null
hypothesis in statistics. In this case, the null-hypothesis asserts that the rankings are random. Thus, the null-hypothesis
is:
The 112 ranks are distributed amongst the years as if they were handed out by a blindfolded monkey
choosing numbers from a mixed bin.
Said differently, the null-hypothesis asserts that the rank-sums in (1.6) are statistically indistinguishable from those
that I would obtain I were to randomly select 56 numbers from the set {1, 2, 3, . . . , 112} to use for the rankings of the
years in the first half of the data set, while using the remaining numbers for the second half.
An awful lot is hidden here in the phrase statistically indistinguishable. Here is what this phrase means in the case
at hand: I should compute the probability that the sum of 56 randomly selected numbers from the set {1, 2, . . . , 112}
differs from the sum of the 56 numbers that are left by at least 16. If this probability is very small, then I have some
indication that the snow fall totals for the years in the two halves of the data set differ in a significant way. If the
probability is high that the rank-sums for the randomly selected rankings differ by 16 or more, then the difference
indicated in (1.6) should not be viewed as indicative of some statistical difference between the snowfall totals for the
years in the two halves of the data set.
Thus, the basic questions are:
• What is the probability in the case of the null-hypothesis that I should get a difference that is bigger than the
one that I actually got?
• What probability should be considered “significant”?
Of course, I can ask these same two questions for the bin data in (1.5). I can also ask analogs of these question for the
two means in (1.2) and for the two standard deviations in (1.4). However, because the bin data, as well as the means
and standard deviations deal with the snowfall amounts rather than with integer rankings, I would need a different sort
of definition to use for the null-hypothesis in the latter cases.
In any event, take note that the first question is a mathematical one and the second is more of a value choice. The first
question leads us to study the theory of probability which is the topic in the next chapter. As for the second question, I
1
can tell you that it is the custom these days to take 20 = 0.05 as the cut-off between what is significant and what isn’t.
Thus,
1.3 Exercises:
1. This exercise requires ten minutes of your time on two successive mornings. It also requires a clock that tells
time to the nearest second.
(a) On the first morning, before eating or drinking, record the following data: Try to estimate the passage of
precisely 60 seconds of time with your eyes closed. Thus, obtain the time from the clock, immediately
close your eyes and when you feel that 1 minute has expired, open them and immediately read the amount
of time that has passed on the clock. Record this as your first estimate for 1 minute of time. Repeat this
procedure ten time to obtain ten successive estimates for 1 minute.
(b) On the second morning, repeat this part (a), but first drink a caffeinated beverage such as coffee, tea, or a
cola drink.
(c) With parts (a) and (b) completed, you have two lists of ten numbers. Compute the means and standard
deviations for each of these data sets. Then, combine the data sets as two halves of a single list of 20
numbers and compute the rank-sums for the two lists. Thus, your rankings will run from 1 to 20. In the
event of a tie between two estimates, give both the same ranking and don’t use the subsequent ranking.
For example, if there is a tie for fifth, use 5 for both but give the next highest estimate 7 instead of 6.
2. Flip a coin 200 times. Use n1 to denote the number of heads that appeared in flips 1-10, use n2 to denote
the number that appeared in flips 11-20, and so on. In this way, you generate twenty numbers, {n1 , . . . n20 }.
Compute the mean and standard deviation for the sets {n1 , . . . n10 }, {n11 , . . . n20 }, and {n1 , . . . n20 }.
3. The table that follows gives the results of US congressional elections during the 6th year of a President’s term in
office. (Note: he had to be reelected.) A negative number means that the President’s party lost seats. Note that
there aren’t any positive numbers. Compute the mean and standard deviation for both the Senate and House of
Representatives. Compare these numbers with the line for the 2006 election.
Table 1.2: Number of seats gained by the president’s party in the election during his sixth year in office
TWO
Probability theory is the mathematics of chance and luck. To elaborate, its goal is to make sense of the following
question:
What is the probability of a given outcome from some set of possible outcomes?
For example, in the snow fall analysis of the previous chapter, I computed the rank-sums for the two halves of the
data set and found that they differed by 16. I then wondered what the probability was for such rank sums to differ by
more than 16 if the rankings were randomly selected instead of given by the data. We shall eventually learn what it
means to be “randomly selected” and how to compute such probabilities. However, this comes somewhat farther into
the course.
Sample space: A sample space is the set of all possible outcomes of the particular “experiment” of interest. For
example, in the rank-sum analysis of the snowfall data from the previous chapter, I should consider the sample space
to be the set of all collections of 56 distinct integers from the collection {1, . . . , 112}.
For a second example, imagine flipping a coin three times and recording the possible outcomes of the three flips. In
this case, the sample space is
Here is a third example: If you are considering the possible birthdates of a person drawn at random, the sample space
consists of the days of the year, thus the integers from 1 to 366. If you are considering the possible birthdates of two
people selected at random, the sample space consists of all pairs of the form (j, k) where j and k are integers from 1
to 366. If you are considering the possible birthdates of three people selected at random, the sample space consists of
all triples of the form (j, k, m) where j, k and m are integers from 1 to 366.
My fourth example comes from medicine: Suppose that you are a pediatrician and you take the pulse rate of a 1-year
old child? What is the sample space? I imagine that the number of beats per minute can be any number between 0 and
and some maximum, say 300.
To reiterate: The sample space is no more nor less than the collection of all possible outcomes for your experiment.
7
Events: An event is a subset of the sample space, thus a subset of possible outcomes for your experiment. In the
rank-sum example, where the sample space is the set of all collections of 56 distinct integers from 1 through 112,
here is one event: The subset of collections of 56 integers whose sum is 16 or more greater than the sum of those that
remain. Here is another event: The subset that consists of the 56 consecutive integers that start at 1. Notice that the
first event contains lots of collections of 56 integers, but the second event contains just {1, 2, . . . , 56}. So, the first
event has more elements than the second.
Consider the case where the sample space is the set of outcomes of two flips of a coin, thus S = {HH, HT, T H, T T }.
For a small sample space such as this, one can readily list all of the possible events. In this case, there are 16 possible
events. Here is the list: First comes the no element set, this denoted by tradition as ∅. Then comes the 4 sets with
just one element, these consist of {HH}, {HT }, {T H}, {T T }. Next come the 6 two element sets, {HH, HT },
{HH, T H}, {HH, T T }, {HT, T H}, {HT, T T }, {T H, T T }. Note that the order of the elements is of no conse-
quence; the set {HH, HT } is the same as the set {HT, HH}. The point here is that we only care about the elements,
not how they are listed. To continue, there are 4 distinct sets with three elements, {HH, HT, T H}, {HH, HT, T T },
{HH, T H, T T } and {HT, T H, T T }. Finally, there is the set with all of the elements, {HH, HT, T H, T T }.
Note that a subset of the sample space can have no elements, or one element, or two, . . . , up to and including all of the
elements in the sample space. For example, if the sample space is that given in (1.1) for flipping a coin three times,
then HTH is an event. Meanwhile, the event that a head appears on the first flip is {HT T, HHT, HT H, HHH}, a
set with four elements. The event that four heads appears has zero elements, and the set where there are less than four
heads is the whole sample space. No matter what the original sample space, the event with no elements is called the
empty set, and is denoted by ∅.
In the case where the sample space consists of the possible pulse rate measurements of a 1-year old, some events are:
The event that the pulse rate is greater than 100. The event that the pulse rate is between 80 and 85. The event that the
pulse rate is either between 100 and 110 or between 115 and 120. The event that the pulse rate is either 85 or 95 or
105. The event that the pulse rate is divisible by 3. And so on.
By the way, this last example illustrates the fact that there are many ways to specify the elements in the same event.
Consider, for example, the event that the pulse rate is divisible by 3. Let’s call this event E. Another way to E is to
provide a list of all of its elements, thus E = {0, 3, 6, . . . , 300}. Or, I can use a more algebraic tone: E is the set of
integers x such that 0 ≤ x ≤ 300 and x/3 ∈ {0, 1, 2, . . . , 100}. (See below for the definition of the symbol “∈”.) For
that matter, I can describe E accurately using French, Japanese, Urdu, or most other languages.
To repeat: Any given subset of a given sample space is called an event.
Set Notation: Having introduced the notion of a subset of some set of outcomes, you need to become familiar with
some standard notation that is used in the literature when discussing subsets of sets.
(d) If no elements are shared by A and B, then these two sets are said to be disjoint. Thus, A and B are disjoint if
and only if A ∩ B = ∅.
(e) If A is given as a subset of a set S, then Ac denotes the subset of S whose elements are not in A. Thus, Ac and
A are necessarily disjoint and Ac ∪ A = S. The set Ac is called the complement of A.
(f) If a subset A is entirely contained in another subset, B, one writes A ⊂ B. For example, if A is an event in a
sample space S, then A ⊂ S.
(g) If an element, e, is contained in a set A, one writes e ∈ A. If e is not in A, one writes e 6∈ A.
• P(S) = 1.
(2.2)
• P(A ∪ B) = P(A) + P(B) when A ∩ B = ∅.
Note that condition P(S) = 1 says that there is probability 1 of at least something happening. Meanwhile, the
condition P(A ∪ B) = P(A) + P(B) when A and B have no points in common asserts the following: The probability
of something happening that is in either A or B is the sum of the probabilities of something happening from A or
something happening from B.
To give an example, consider rolling a standard, six-sided die. If the die is rolled once, the sample space consists of
the numbers {1, 2, 3, 4, 5, 6}. If the die is fair, then I would want to use the probability function that assigns the value
1
6 to each element. But, what if the die is not fair? What if it favors some numbers over others? Consider, for example,
a probability function with P({1}) = 0, P({2}) = 13 , P({3}) = 12 , P({4}) = 61 , P({5}) = 0 and P({6}) = 0. If this
probability function is correct for my die, what is the most probable number to appear with one roll? Should I expect
to see the number 5 show up at all? What follows is a more drastic example: Consider the probability function where
P({1}) = 1 and P({2}) = P({3}) = P({4}) = P({5}) = P({6}) = 0. If this probability function is correct, I should
expect only the number 1 to show up.
Let us explore a bit the reasoning behind the conditions for P that appear in equation (2.2). To start, you should
understand why the probability of an event is not allowed to be negative, nor is it allowed to be greater than 1. This is
to conform with our intuitive notion of what probability means. To elaborate, think of the sample space as the suite of
possible outcomes of an experiment. This can be any experiment, for example flipping a coin three times, or rolling a
die once, or measuring the pulse rate of a 1-year old child. An event is a subset of possible outcomes. Let us suppose
If the experiment is carried out a large number of times, with the conditions and set up the same each
time, then P(A) is a prediction for the fraction of those experiments where the outcome is in the set A.
As this fraction can be at worse 0 (no outcomes in the set A), or at best 1 (all outcomes in the set A), so P(A) should
be a number that is not less than 0 nor more than 1.
Why should P(S) be equal to 1 in all cases? Well, by virtue of its very definition, the set S is supposed to be the set of
all possible outcomes of the experiment. The requirement for P(S) to equal 1 makes the probability function predict
that each outcome must come from our list of all possible outcomes.
The second condition that appears in equation (2.2) is less of a tautology. It is meant to model a certain intuition that
we all have about probabilities. Here is the intuition: The probability of a given event is the sum of the probabilities
of its constituent elements. For example, consider the case where the sample set is the set of possible outcomes when
I roll a fair die. Thus, the probability is 16 for any given integer from {1, 2, 3, 4, 5, 6} appearing. Let A denote the
probability of {1} appearing and B the probability of {2} appearing. I expect that the probability of either 1 or 2
appearing, thus A ∪ B = {1, 2}, is 61 + 16 = 13 . I would want my probability function to reflect this addititivity. The
second condition in equation (2.2) asserts no more nor less than this requirement.
By the way, the condition for A∩B = ∅ is meant to prevent over-counting. For an extreme example, suppose A = {1}
and B is also {1}. Thus both have probability 16 . Meanwhile, A ∪ B = {1} also, so P(A ∪ B) should be 61 , not 61 + 16 .
Here is a somewhat less extreme example: Suppose that A = {1, 2} and B = {2, 3}. Both of these sets should have
probability . Their union is {1, 2, 3}. I expect that this set has probability 21 , not 13 + 13 = 32 . The reason I shouldn’t
use the formula P(A ∪ B) = P(A) + P(B) for the case where A = {1, 2} and B = {2, 3} is because the latter
formula counts twice the probability of the shared integer 2; it counts it once from its appearance in A and again from
its appearance in B.
For a second illustration, consider the case where S is the set of possible pulse rates for a 1-year old child. Take A
to be the event {100, . . . , 109} and B to be the event {120, . . . , 129}. Suppose that many years of pediatric medicine
have given us a probability function, P, for this set. Suppose, in addition that P(A) = 41 and P(B) = 16 1
. What should
we expect for the probability that a measured pulse rate is either in A or in B? That is, what is the probability that the
pulse rate is in A ∪ B? Since A and B do not share elements (A ∩ B = ∅), you might expect that the probability of
5
being in either set is the sum of the probability of being in A with that of being in B, thus 16 .
Keeping this last example in mind, consider the set {109, 110, 111}. I’ll call this set C. Suppose that our probability
1
function says that P(C) = 64 . I would not predict that P(A ∪ C) = P(A) + P(C) since A and C both contain the
element 109. Thus, I can imagine that P(A)+P(C) over-counts the probability for A∪C since it counts the probability
of 109 two times, once from its membership in A and again from its membership in C.
I started the discussion prior to equation (2.2) by asking that you imagine a particular sample space and then said that
a probability function on this space is a rule that assigns to each event a number no less than zero and no greater than
1 to each subspace (event) of the sample space, but subject to the rules that are depicted in (2.2). I expect that many of
you are silently asking the following question:
To make this less abstract, consider again the case of rolling a six-sided die. The corresponding sample space is
S = {1, 2, 3, 4, 5, 6}. I noted above three different probability functions for S. The first assigned equal probability
to each element. The second and third assigned different probabilities to different elements. The fact is that there are
infinitely many probability functions to choose from. Which should be used?
To put the matter in even starker terms, consider the case where the sample space consists of the possible outcomes
of a single coin flip. Thus, S = {H, T }. A probability function on S is no more nor less than an assignment of one
number, P(H), that is not less than 0 nor greater than 1. Only one number is needed because the first line of (2.2)
makes P assign 1 to S, and the second line of (2.2) makes P assign 1 − P(H) to T . Thus, P(T ) = 1 − P(H). If you
understand this last point, then it follows that there are as many probability functions for the set S = {H, T } as there
If you know what P assigns to each element in S, then you know P on every subset: Just add up the
probabilities that are assigned to its elements.
We’ll talk about the story when S isn’t finite later. Anyway, the preceding illustrates the more intuitive notion of
probability that we all have: It says simply that if you know the probability of every outcome, then you can compute
the probability of any subset of outcomes by summing up the probabilities of the outcomes that are in the given subset.
1
For example, in the case where my sample space S = {1, . . . , 112} and each integer in S has probability 112 , then I
1 1 1
can compute that the probability of a blindfolded monkey picking either 1 or 2 is 112 + 112 = 112 . Here I invoke the
second of the rules in (2.2) where A is the event that 1 is chosen and B is the event that 2 is chosen. A sequential use
10
of this same line of reasoning finds that the probability of picking an integer that is less than or equal to 10 is 112 .
Here is a second example: Take S to be the set of outcomes for flipping a fair coin three times (as depicted in (2.1)). If
the coin is fair and if the three flips are each fair, then it seems reasonable to me that the situation is modeled using the
probability function, P, that assigns to each element in the set S. If we take this version of P, then we can use the rule
in (2.2) to assign probabilities 81 to any given subset of S. For example, the subset given by {HHT, HT H, T HH}
has probability 38 since
and so
P({HHT, HT H, T HH}) = P(HHT ) + P(HT H) + P(T HH) = 83 .
To summarize: If the sample space is a set with finite elements, or is a discrete set (such as the positive integers), then
you can find the probability of any subset of the sample space if you know the probability for each element.
(a) P(∅) = 0.
(b) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
In the preceding, Ac is the set of elements that are not in A. The set Ac is called the complement of A.
I want to stress that all of these conditions are simply translations into symbols of intuition that we all have about
probabilities. What follows are the respective English versions of (2.3).
Equation (2.3a):
This is to say that if S is, as required, the list of all possible outcomes, then at least one outcome must occur.
Equation (2.3b):
The probability that an outcome is in either A or B is the probability that it is in A plus the probability
that it is in B minus the probability that it is in both.
The point here is that if A and B have elements in common, then one is overcounting to obtain P(A ∪ B) by just
summing the two probabilities. The sum of P(A) and P(B) counts twice the elements that are both in A and in B
count twice. To see how this works, consider the rolling a standard, six-sided die where the probabilities of any given
side appearing are all the same, thus . Now consider the case where A is the event that either 1 or 2 appears, while B
is the event that either 2 or 3 appears. The probability assigned to A is 13 , that assigned to B is also 13 . Meanwhile,
A ∪ B = {1, 2, 3} has probability 12 and A ∩ B = {2} has probability 61 . Since 12 = 31 + 13 − 61 , the claim in (2.3b)
holds in this case. You might also consider (2.3b) in a case where A = B.
Equation (2.3c):
The probability of an outcome from Ais no greater than that of an outcome from B in the case that all
outcomes from A are contained in the set B.
The point of (2.3c) is simply that if every outcome from A appears in the set B, then the probability that B occurs can
not be less than that of A. Consider for example the case of rolling one die that was just considered. Take A again to
be {1, 2}, but now take B to be the set {1, 2, 3}. Then P(A) is less than P(B) because B contains all of A’s elements
plus another. Thus, the probability of B occurring is the sum of the probability of A occurring and the probability of
the extra element occurring.
The probability of an outcome from the set B is the sum of the probability that the outcome is in the
portion of B that is contained in A and the probability that the outcome is in the portion of B that is not
contained in A.
This translation of (2.3d) says that if I break B into two parts, the part that is contained in A and the part that isn’t,
then the probability that some element from B appears is obtained by adding, first the probability that an element that
is both in A and B appears, and then the probability that an element appears that is in B but not in A. Here is an
example from rolling one die: Take A = {1, 2, 4, 5} and B = {1, 2, 3, 6}. Since B has four elements and each has
probability 61 , so B has probability 23 . Now, the elements that are both in B and in A comprise the set {1, 2}, and
this set has probability 13 . Meanwhile, the elements in B that are not in A comprise the set {3, 6}. This set also has
probability 31 . Thus (2.3d) holds in this case because 13 + 31 = 23 .
Equation (2.3e):
The probability of an outcome that is not in A is equal to 1 minus the probability that an outcome is in A.
To see why this is true, break the sample space up into two parts, the elements in A and the elements that are not in A.
The sum of the corresponding two probabilities must equal 1 since any given element is either in A or not. Consider
our die example where A = {1, 2}. Then Ac = {3, 4, 5, 6} and their probabilities do indeed add up to 1.
2.6 Exercises:
1. Suppose an experiment has three possible outcomes, labeled 1, 2, and 3. Suppose in addition, that you do the
experiment three successive times.
(a) Give the sample space for the possible outcomes of the three experiments.
(a) Before doing any calculations, do you think Table 2.1 shows any evidence of an effect of a vegetarian diet
on cholesterol levels? Why or why not?
The sign test is a simple test of whether or not there is a real difference between two sets of numbers. In
this case, the first set consists of the 24 pre-diet measurements, and the second set consists of the 24 after diet
measurements. Here is how this test works in the case at hand: Associate + to a given measurement if the
cholesterol level increased, and associate − if the cholesterol decreases. The result is a set of 24 symbols, each
either + or −. For example, in this case, there are the number of + is 3 and the number of − is 21. One then
ask whether such an outcome is likely given that the diet has no effect. If the outcome is unlikely, then there is
reason to suspect that the diet makes a difference. Of course, this sort of thinking is predicated on our agreeing
on the meaning of the term “likely”, and on our belief that there are no as yet unknown reasons why the outcome
appeared as it did. To elaborate on the second point, one can imagine that the cholesterol change is due not so
much to the vegetarian nature of the diet, but to some factor in the diet that changed simultaneously with the
change to a vegetarian diet. Indeed, vegetarian diets can be quite bland, and so it may be the case that people use
more salt or pepper when eating vegetarian food. Could the cause be due to the change in condiment level? Or
perhaps people are hungrier sooner after such a diet, so they treat themselves to an ice cream cone a few hours
after dinner. Perhaps the change in cholesterol is due not to the diet, but to the daily ice cream intake.
1 Rosner, Bernard. Fundamentals of Biostatistics. 4th Ed. Duxbury Press, 1995.
Table 2.1: Cholesterol levels before and three months after starting a vegetarian diet
(b) To make some sense of the notion of “likely”, we need to consider a probability function on the set of
possible lists where each list has 24 symbols with each symbol either + or −. What is the sample space
for this set?
(c) Assuming that each subject had a 0.50 probability of an increase in cholesterol, what probability does the
resulting probability function assign to any given element in your sample space?
(d) Given the probability function you found in part (c), what is the probability of having no + appear in the
24?
(e) With this same probability function, what is the probability of only one + appear?
An upcoming chapter explains how to compute the probability of any number of + appearing. Another chapter
introduces a commonly agreed upon definition for “likely”.
2.6. Exercises 15
CHAPTER
THREE
Conditional probability
The notion of conditional probability provides a very practical tool for computing probabilities of events. Here is
context where this notion first appears: You have a sample space, S, with a probability function, P. Suppose that
A and B are subsets of S and that you have knowledge that the event represented by B has already occurred. Your
interest is in the probability of the event A given this knowledge about the event B. This conditional probability is
denoted by P (A | B); and it is often different from P(A).
Here is an example: Write down + if you measure your pulse rate to be greater than 70 beats per minute; but write
down − if you measure it to be less than or equal to 70 beats per minute. Make three measurements of your pulse
rate and so write down three symbols. The set of possible outcomes for the three measurements consists of the eight
element set
S = {+ + +, + + −, + − +, + − −, − + +, − + −, − − +, − − −}. (3.1)
Let A denote the event that all three symbols are +, and let B denote the event that the first symbol is +. Then
P (A | B) is the probability that all symbols are + given that the first one is also +. If each of the eight elements has
the same probability, 81 , then it should be the case that P (A | B) = 41 since there are four elements in B but only one
of these, (+ + +), is also in A. This is, in fact, the case given the formal definition that follows. Note that in this
example, P (A | B) 6= P(A) since P(A) = 18 .
Here is another hypothetical example: Suppose that you are a pediatrician and you get a phone call from a distraught
parent about a child that is having trouble breathing. One question that you ask yourself is: What is the probability
that the child is having an allergic reaction? Let’s denote by A the event that this is, indeed, the correct diagnosis. Of
course, it may be that the child has the flu, or a cold, or any number of diseases that make breathing difficult. Anyway,
in the course of the conversation, the parent remarks that the child has also developed a rash on its torso. Let us use B
to denote the probability that the child has a rash. I expect that the probability the child is suffering from an allergic
reaction is much greater given that there is a rash. This is to say that P (A | B) > P(A) in this case. Or, consider an
alternative scenario, one where the parent does not remark on a rash, but remarks on a fever instead. In this case, I
would expect that the probability of the child suffering an allergic reaction is rather small since the symptoms point
more towards a cold or flu. This is to say that I now expect P (A | B) to be less than P(A).
You can check that this obeys all of the rules for being a probability. In English, this says:
The probability of an outcome occurring from A given that the outcome is known to be in B is the
probability of the outcome being in both A and B divided by the probability of the outcome being in B in
the first place.
16
Another way to view this notion is as follows: Since we are told that B has happened, one might expect that the
probability that A occurs is the fraction of B’s probability that is accounted for by the elements that are in both A and
B. This is just what (3.2) asserts. Indeed, P(A ∩ B) is the probability of the occurrence of an element that is in both
A and B, so the ratio P(A ∩ B)/P(B) is the fraction of B’s probability that comes from the elements that are both in
A and B.
For a simple example, consider the case where we roll a die with each face having the same probability of appearing.
Take B to be the event that an even number appears. Thus, B = {2, 4, 6}. I now ask: What is the probability that 2
appears given that an even number has appeared? Without the extra information, the probability that 2 appears is 16 . If
I am told in advance that an even number has appeared, then I would say that the probability that 2 appears is 13 . Note
that 13 = 16 / 12 ; and this is just what is said in (3.2) in the case that A = {2} and B = {2, 4, 6}.
To continue with this example, I can also ask for the probability that 1 or 3 appears given that an even number has
appeared. Set A = {1, 3} in this case. Without the extra information, we have P(A) = 31 . However, as neither 1 nor 3
is an even number, A ∩ B = ∅. This is to say that A and B do not share elements. Granted this obvious fact, I would
say that P (A | B) = 0. This result is consistent with (3.2) because the numerator that appears on the right hand side
of (3.2) is zero in this case.
I might also consider the case where A = {1, 2, 4}. Here I have P(A) = 12 . What should P (A | B) be? Well, A has
two elements from B, and since B has three elements, each element in B has an equal probability of appearing, I
would expect P (A | B) = 23 . To see what (3.2) predicts, note that A ∩ B = {2, 4} and this has probability 31 . Thus,
(3.2)’s prediction for P (A | B) is 13 / 12 = 23 also.
What follows is another example with one die, but this die is rather pathological. In particular, imagine a six-sided
1 2
die, so the sample space is again the set {1, 2, 3, 4, 5, 6}. Now consider the case where P(1) = 21 , P(2) = 21 ,
3 n
P(3) = 21 , etc. In short, P(n) = 21 when n ∈ {1, 2, 3, 4, 5, 6}. Let B again denote the set {2, 4, 6} and suppose
that A = {1, 2, 4}. What is P (A | B) in this case? Well, A has two of the elements in B. Now B’s probability is
2 4 6 12 6
21 + 21 + 21 = 21 and the elements from A account for 21 , so I would expect that the probability of A given B is the
6 12
fraction of B’s probability that is accounted for by the elements of A, thus 21 / 21 = 21 . This is just what is asserted
by (3.2).
What follows describe various common applications of conditional probabilities.
to write
P (A | B) P(B)
P (B | A) = . (3.4)
P(A)
This is the simplest form of ‘Bayes theorem’. It tells us the probability of cause B given that outcome A has been
observed. What follows is a sample application.
(the probability of heads given that the fair coin) × (the probability that the coin is fair)
plus
(the probability of heads given the unfair coin) × (the probability that the coin is unfair) .
Thus, I would say that the probability of heads in this case is ( 12 × 12 ) + ( 14 × 12 ) = 38 . I trust that you notice here the
appearance of conditional probabilities.
In general, the use of conditional probabilities to compute unconditional probabilities arises in the following situa-
tion: Suppose that a sample space, S, is given. In the coin example above, I took S to be set with four elements
{(F, H), (F, T ), (U, H), (U, T )}, where the symbols have the following meaning: First, (F, H) denotes the case
where the coin is fair and heads appears and (F, T ) that where the coin is fair and tails appears. Meanwhile, (U, H)
The probability of an outcome from A is the probability that an outcome from A occurs given that B1
occurs times the probability of B1 , plus the probability that an outcome from A occurs given that B2
occurs times the probability of B2 , plus . . . etc.
The formula in (3.5) is useful only to the extent that the conditional probabilities P (A | B1 ), P (A | B2 ), . . . , P (A | BN )
and the probabilities of each Bk are easy to compute. This is what I mean by the use of the descriptive ‘convenient’
when I say that one should look for a ‘convenient’ decomposition of S as B1 ∪ B2 ∪ · · · ∪ BN .
By the way, do you recognize (3.5) as a linear equation? You might if you denote P(A) by y, each P(Bj ) by xj and
P (A | Bj ) by aj so that this reads
y = a1 x1 + a2 x2 + · · · + aN xN .
Thus, linear systems arise!
Here is why (3.5) holds: Remember that P (A | B) = P(A ∩ B)/P(B). Thus, P (A | B) · P(B) = P(A ∩ B). Therefore,
the right hand side of (3.5) states that
This now says that the probability of A is obtained by summing the probabilities of the parts of A that appear in each
Bn . That such is the case follows when the Bn ’s don’t share elements but account for all of the elements of S. Indeed,
if, say B1 shared an element with B2 , then that element would be overcounted on the right-hand side of the preceding
equation. On the other hand, if the Bn ’s don’t account for all elements in S, then there might be some element in A
that is not accounted for by the sum on the right-hand side.
What follows is a sample application of (3.5): Suppose that I have six coins where the probability of heads on the first
is 21 , that on the second is 14 , that on the third is 18 , that on the fourth is 16
1 1
, that on the fifth is 32 , and that on the last
is 64 . Suppose that I label these coins by 1, 2, . . . , 6 so that the probability of heads on the m’th coin is 2−m . Now I
1
n
also have my pathological die, the one where the probability of the face with number n ∈ {1, 2, . . . , 6} is 21 . I roll
my pathological die and the number that shows face up tells me what coin to flip. All of this understood, what is the
probability of the flipped coin showing heads?
To answer this question, I first note that the relevant sample space has 12 elements, these of the form (n, H) or (n, T ),
where n can be 1, 2, 3, 4, 5, or 6. This is to say that (n, H) is the event that the n’th coin is chosen and heads appears,
while (n, T ) is the event that the n’th coin is chosen and tails appears. The set A in this case is the event that H
appears. To use (3.5), I first decompose my sample space into 6 subsets, {Bn }1≤n≤6 , where Bn is the event that the
n
n’th coin is chosen. This is a ‘convenient’ decomposition because I know P(Bn ), this 21 . I also know P (A | Bn ), this
−n
2 . Granted this, then (3.5) finds that the probability of heads is equal to
1 1 1 2 1 3 1 4 1 5 1 6 5
· + · + · + · + · + · = .
2 21 4 21 8 21 16 21 32 21 64 21 56
Thus, it is enough for me to know the probabilities for the possible genotypes of the parents.
The point here is that the conditional probabilities {P (A | Bn )}1≤n≤5 are easy to compute. Note also that their
computation is based on theoretical considerations. This is to say that we made the hypothesis that ‘a parent gives
one of its two genes to the offspring, and that there is equal probability of giving one or the other.’ The formula given
above for P(wrinkled offspring) should be viewed as a prediction to be confirmed or not by experiments.
What follows is another example from biology. Suppose that I am concerned with a stretch of DNA of length N ,
and want to know what the probability of not seeing the base guanine in this stretch. Let A denote the event of that
there is no guanine in a particular length N stretch of DNA. Let B1 denote the event that there is no guanine in the
stretch of length N − 1, and let B2 denote the event that guanine does appear in this length N − 1 stretch. In this
case, P (A | B2 ) = 0 since A and B2 are disjoint. What about P (A | B1 )? This is the probability that guanine is not
in the N th site if none has appeared in the previous N − 1 sites. Of course, the probability of guanine appearing in
the N th site may or may not be affected by what is happening in the other sites. Let us make the hypothesis that each
of the four bases has equal probability to appear in any given site. Under this hypothesis, the probability of seeing no
guanine in the N th site is 43 since there are four bases in all and only one of them, guanine, is excluded. Thus, under
our hypothesis that each base has equal probability of appearing in any given site, we find that P (A | B1 ) = 43 . This
understood, it then follows from (3.5) that P(A) = 34 P(B1 ).
Now we can compute P(B1 ) in an analogous fashion by considering the relative probability of no guanine in the
(N − 1)st site given that none appears in the previous N − 2 sites. Under our hypothesis of equal probabilities for the
bases, this gives P(B1 ) = 34 times the probability of no guanine in the first N − 2 sites. We can use this trick again to
compute the latter probability; it equals 34 times the probability of no guanine in the first N − 3 sites. Continuing in
this vein finds P(A) to be equal to the product of N copies of 34 , thus ( 34 )N .
This computation of P(A) = ( 34 )N is now a theoretical prediction based on the hypothesis that the occurrence of any
given base in any given DNA site has the same probability as that of any other base. You are challenged to think of an
experiment that will test this prediction.
So, we have a 4 × 4 matrix M whose entry in row i and column j is P (Ai | Bj ). Now write each P(Ai ) as yi and each
P(Bj ) as xj , and these last four equations can be summarized by the assertion that
X
yi = Mij xj for each i = 1, 2, 3, and 4.
1≤j≤4
using the original form of Bayes’ theorem. To compute P(A), I use (3.5). Together, these two equalities imply the
desired one:
P (A | B3 ) P(B3 )
P (B3 | A) = .
P (A | B1 ) · P(B1 ) + · · · + P (A | BN ) · P(BN )
Of course, I make such an equation for any given P (Bk | A), and this gives the most general form of Bayes theorem:
P (A | Bk ) P(Bk )
P (Bk | A) = . (3.6)
P (A | B1 ) · P(B1 ) + · · · + P (A | BN ) · P(BN )
This equation provides the conditional probability of Bk given that A occurs from the probabilities of the various Bk
and the conditional probabilities that A occurs given that any one of these Bk occur.
To see how this works in practice, return to the example I gave above where I have six coins, these labeled
{1, 2, 3, 4, 5, 6}; and where the mth coin has probability 2−m of landing with the head up when flipped. I also
n
have my pathological six-sided die, where the probability of the nth face appearing when rolled is 21 . As before, I
first roll the die and if the nth face appears, I flip the coin with the label n. I don’t tell you which coin was flipped, but
I do tell you that heads appeared. What is the probability that the coin #3 was flipped?
3.7 Exercises:
1. This exercise concerns the sample space S that is depicted in (3.1). If S represents the outcomes for three pulse
rate measurements of a given individual, it is perhaps more realistic to take the following probability function:
The function P assigns probability 13 to + + + and to − − − while assigning 18 1
to each of the remaining
elements.
(a) Is the event that + appears first independent of the event that + appears last?
(b) Is the event that + appears second independent for the event that + appears last?
(c) What is the conditional probability that + appears first given that − appears second?
2. Suppose that that the probability of a student lying in the infirmary is 1%, and that the probability that a student
has an exam on any given day is 5%. Suppose as well that 6% of students with exams go to the infirmary. What
is the probability that a student in the infirmary has an exam on a given day?
3. Label the four bases that are used by DNA as {1, 2, 3, 4}.
(a) Granted this labeling, write down the sample space for the possible bases at two given sites on the molecule.
(b) Invent a probability function for this sample space.
(c) Let Aj for j = 1, 2, 3, 4 denote the event in this two-site sample space that the first site has the base j,
and let Bj for j = 1, . . . , 4 denote the analogous event for the second site. Use the definition of condi-
tional probability to explain why, for any probability function and for any k, P (A1 | Bk ) + P (A2 | Bk ) +
P (A3 | Bk ) + P (A4 | Bk ) must equal 1.
(d) Is there a choice for a probability function on the sample space that makes P (A1 | B1 ) = P (B1 | A1 ) in
the case that A1 and B1 are not independent? If so, give an example. If not, explain why.
4. This problem refers to the scenario that I described above where I have six coins, these labeled {1, 2, 3, 4, 5, 6};
and where the mth coin has probability 2−m of landing with the head up when flipped. I also have my patho-
n
logical six-sided die, where the probability of the nth face appearing when rolled is 21 . As before, I first roll the
die and if the nth face appears, I flip the coin with the label n. I don’t tell you which coin was flipped, but I do
tell you that heads appeared.
(a) For n = 1, 2, 4, 5, and 6, give the probability that the coin with the label n was flipped.
(b) For what n, if any, is the event that the nth face appears independent from the event that heads appears.
5. Suppose that A and B are subsets of a sample space with a probability function, P. Suppose in addition that
P(A) = 54 and P(B) = 35 . Explain why P (B | A) is at least 12 .
6. For many types of cancer, early detection is the key to successful treatment. Prostate cancer is one of these. For
early detection, the National Cancer Institute suggests screening of patients using the Serum Prostate-Specific
Antigen (PSA) Test. There is controversy due to the lack of evidence showing that early detection of prostate
cancer and aggressive treatment of early cancers actually reduces mortality. Also, this treatment can often lead
to complications of impotence and incontinence.
Here is some terminology that you will meet if you go on to a career in medicine: The sensitivity of a test is the
probability of a positive test when the patient has the disease, and the specificity of a test is the probability of a
negative test when the patient does not have the disease. In the language of conditional probability,
3.7. Exercises 23
• Sensitivity is the conditional probability of a positive test given that the disease is present.
• Specificity is the conditional probability of a negative test given that the disease is not present.
The standard PSA test to detect early stage prostate cancer has Sensitivity = 0.71 and Specificity = 0.91. Thus,
0.71 is the conditional probability of a positive test given that a person does have prostate cancer. And, 0.91 is
the conditional probability of a negative test given that a person does not have prostate cancer. Note for what
follows that roughly 0.7% of the male population is diagnosed with prostate cancer each year.
Granted this data, here is the question that this problem will answer:
If a patient receives a positive test for prostate cancer, what is the probability he truly has cancer?
To answer this question, let A denote the event that a person has a positive test, and let B denote the event that a
person has prostate cancer. This question is asking for the conditional probability of B given A; thus P (B | A).
The data above gives P (A | B) = 0.71 and P(B) = 0.007 and also P (Ac | B c ) = 0.91 where Ac is the event
that a person has a negative test, and B c is the event that a person does not have cancer. As the set of questions
that follow demonstrate, the information given is sufficient to answer the question posed above.
(a) Why is P (B | A) = (0.71) × (0.007)/P(A) ≈ 0.005/P(A)?
If you answered this, then the task is to find P(A).
(a) What is the sample space for the possible pairs of alleles that Erin could have inherited from her parents?
Now assume each allele is equally likely to be passed from Erin’s parents to Erin.
(b) Explain how this information gives a probability function on the sample space that you found in (a).
(c) Use the probability function from (b) to give the probability that Erin is albino.
(d) Given that Erin is not albino, what is the probability that she has the albino allele?
Mendel’s paper: http://www.mendelweb.org/Mendel.html
FOUR
Linear transformations
My purpose here is to give some examples of linear transformations that arise when thinking about probability as
applied to problems in biology. As you should recall, a linear transformation on Rn can be viewed as the effect of
multiplying vectors by a given matrix. If A is the matrix and ~v is any given vector, the transformation has the form
~v → A~v , where A~v is the vector with components
X
(A~v )j = Ajk vk . (4.1)
k
Thus,
(A~v )1 = A11 v1 + A12 v2 + · · · + A1n vn
(A~v )2 = A21 v1 + A22 v2 + · · · + A2n vn
..
. (4.2)
(A~v )n = An1 v1 + An2 v2 + · · · + Ann vn
This equation says that the probability of seeing amino acid i in the 127th position is obtained by using the formula
in (3.5). In words:
25
The probability of seeing amino acid i in generation t + 1 is the conditional probability that the amino
acid at position 127 is i in the offspring given that it is 1 in the parent times the probability that amino
acid 1 appears in generation t, plus the conditional probability that the amino acid at position 127 is i in
the offspring given that it is 2 in the parent times the probability that amino acid 2 appears in generation
t, plus . . . etc.
The point here is that the vector p~(t) with entries pi (t) is changed via p~(t) → p~(t + 1) = A~
p(t) after each generation.
Here, A is the matrix whose ij entry is Aij .
FIVE
What follows are some areas in biology and statistics where matrix products appear.
5.1 Genomics
Suppose that a given stretch of DNA coding for cellular product is very mutable, so that there are some number, N ,
of possible sequences that can appear in any given individual (this is called ‘polymorphism’) in the population. To
elaborate, a strand of DNA is a molecule that appears as a string of small, standard molecules that are bound end to
end. Each of these standard building blocks can be one of four, labeled C, G, A and T. The order in which they appear
along the strand determines any resulting cellular product that the given part of the DNA molecule might produce. For
example, AAGCTA may code for a different product than GCTTAA.
As it turns out, there are stretches of DNA where the code can be changed without damage to the individual. What
with inheriting genes from both parents, random mutations over the generations can then result in a population where
the codes on the given stretch of DNA vary from individual to individual. In this situation, the gene is called ‘poly-
morphic’. Suppose that there are N different possible codes for a given stretch. For example, if one is looking at just
one particular site along a particular DNA strand, there could be at most N = 4 possibilities at that site, namely C, G,
A or T. Looking at two sites gives N = 4 × 4 = 16 possibilities.
I am going to assume in what follows that sites of interest along the DNA is inherited from parent to child only from
the mother or only from the father. Alternately, I will assume that I am dealing with a creature such as a bacteria that
reproduces asexually. This assumption simplifies the story that follows.
Let us now label the possible sequences for the DNA site under discussion by integers starting from 1 and going to N.
At any given generation t, let pj (t) denote the frequency of the appearance of the jth sequence in the population at
generation t. These frequencies then change from one generation to the next in the following manner: The probability
of any given sequence, say i, appearing in generation t + 1 can be written as a sum:
pi (t + 1) = P (i | 1) p1 (t) + P (i | 2) p2 (t) + · · · + P (i | N ) pN (t), (5.1)
where each P (i | j) can be viewed as the conditional probability that a parent with sequence j produces an offspring
with sequence i. This is to say that the probability of sequence i appearing in an individual in generation t + 1 is equal
to the probability that sequence 1 appears in the parent times the probability of a mutation that changes sequence 1 to
sequence i, plus the probability that sequence 2 appears in the parent times the probability of a mutation that changes
sequence 2 to sequence i, and so on.
We can write the suite of N versions of (5.1) using our matrix notation by thinking of the numbers {pj (t)}1≤j≤N
as defining a column vector, p~(t), in RN , and likewise the numbers {pi (t + 1)}1≤i≤N as defining a second column
vector, p~(t + 1) in RN . If I introduce the N × N matrix A whose entry in the jth column and ith row is P (i | j),
then (5.1) says in very cryptic shorthand:
p~(t + 1) = A~
p(t). (5.2)
I can sample the population at time t = T = now, and thus determine p~(T ), or at least the proxy that takes pi (T ) to
27
be the percent of people in the population today that have sequence i. One very interesting question is to determine
p~(T 0 ) at some point far in the past, thus T 0 T . For example, if we find p~(T 0 ) such that all pi (T 0 ) are zero but a very
few, this then indicates that the population at time T 0 was extremely homogeneous, and thus presumably very small.
To determine p~(T 0 ), we use (5.2) in an iterated form:
p(T − 1) = AA~
p~(T ) = A~ p(T − 2)) = AAA~ p(T 0 ),
p(T − 3) = · · · = A · · · A~ (5.3)
where the final term has T − T 0 copies of A multiplying one after the other.
On a similar vein, we can use (5.2) to predict the distribution of the sequences in the population at any time T 0 > T
by iterating it to read
p~(T 0 ) = A · · · A~
p(T ). (5.4)
Here, the multiplication is by T 0 − T successive copies of the matrix A.
By the way, here is a bit of notation: This sort of sequence p~{(t)}t=0,1,... of vectors of probabilities is an example of
a Markov chain. In general, a Markov chain is a sequence of probabilities, {P(0), P(1), P(2), . . . , } where the N th
probability P(N ) depends only on the probabilities with numbers that are less than N .
p1 (t + 1) = (1 − q)p1 (t) + (1 − q)p2 (t) and pN (t + 1) = qpN −1 (t) + qpN (t). (5.6)
Let us introduce the vector p~(t) in RN whose jth component is pj (t). Then (5.5) and (5.6) assert that p~(t + 1) is
p(t), where A is the N × N matrix whose
obtained from p~(t) by the action of a linear transformation: p~(t + 1) = A~
only non-zero entries are:
p~(t) = A · · · A~
p(0) (5.9)
where A · · · A signifies t copies of A multiplied one after the other. By the way, a common shorthand for some n
copies of any given matrix, A, successively multiplying one after the other is An .
pj (t + 1) = qpj−1 (t) + (1 − q)pj+1 (t) when j ≥ 1 and p0 (t + 1) = (1 − q)(p1 (t) + p0 (t)) (5.10)
as long as α and β are well supplied. This last equation is another version of the one that is depicted in (5.2).
5.5 Exercises:
Exercises 1–3 concern the example above where the position of the bacteria at any given time t is labeled by an integer,
x(t), in the set {1, . . . , N }. Don’t assume that we know where the bacteria is at t = 0.
4. Write down a version of (5.10) that would hold in the case that the enzyme when folded correctly produces
some L = 1, or L > 2 units of γ per unit time. To be more explicit, assume first that when the enzyme is
folded correctly, it only makes 1 unit of γ per unit time to see how (5.10) will change. Then see how (5.10)
must change if the enzyme makes 3 units per unit time. Finally, consider the case where it makes L units per
unit time and write (5.10) in terms of this number L.
SIX
Random variables
In favorable circumstances, the different outcomes of any given experiment have measurable properties that distinguish
them. Of course, if a given outcome has a certain probability, then this is also the case for any associated measurement.
The notion of a ‘random variable’ provides a mathematical framework for studying these induced probabilities on the
measurements.
Here is a simple example: Suppose that I have some large number, say 100, coins, all identical and all with probability
1
2 of landing heads when flipped. I am going to flip them all, and put a dollar in your bank account for each head that
appears. You can’t see me flipping the coin, but you can go to your bank tomorrow and measure the size of your bank
account. You might be interested in knowing the probability of your account increasing by any given amount. The
amount in your account is a random variable.
To explore this example a bit, note that the configuration space of possible outcomes from flipping 100 coins consists
of sequences that are 100 letters long, each letter being either H or T . There are 2100 ≈ 1.2 × 1030 elements in
the configuration space! If s is such a 100 letter sequence, let f (s) denote the number of heads that appear in the
sequence s. Thus, f (s) can be any integer from 0 through 100. The assignment s → f (s) is an example of a random
variable. It is a function of sorts on the configuration space. Of interest to you are the probabilities for the appearances
of the various possible values of this function f . This is to say that your concern is the probability function on the 101
element set {0, . . . , 100} that gives the probability of your bank account increasing by any given amount.
Here is another example that is slightly less contrived: You are in charge of stocking a lake with trout. You put some
large number, say N , of trout in a lake. Due to predation, the odds are 50–50 that any given trout will be eaten after
one year. Meanwhile, the trout do not breed for their first year, so you are interested in the number of trout that survive
to the second year. This number can be anywhere from 0 to N . What is the probability of finding a given number in
this range? Note that in the case that N = 100, this question is essentially identical to the question just posed about
your bank account.
The model for this is as follows: There is a sample space, S, whose elements consist of sequences of N letters, where
each letter is either a D (for dead) or L (for live). Thus, S has 2N elements. I assign to each element in S a number,
this the number of L’s in the sequence. This assignment of a number to each element in S is a function on S, and of
interest to me are the probabilities for the possible values of this function. Note that these probabilities are not for the
elements of S; rather they are for the elements in a different sample space, the set of integers from 0 through N .
31
is the ‘stop’ signal. This is the genetic code. The code is thus a function from a set with 64 elements to one with 21
elements
Most often, random variables take real number values. For example, let S denote the 20 possible amino acids that can
occupy the 127th position from the end of a certain enzyme (a type of protein molecule) that helps the cell metabolize
the sugar glucose. Now, let f denote the function on S that measures the rate of glucose metabolism in growing
bacteria with the given enzyme at the given site. In this case, f associates to each element in a 20 element set a real
number.
In words:
The probability that f = r is the sum of the probabilities of those elements in S where f is equal to r.
For an example of what happens in (6.1), consider the situation that I described at the outset where I flip 100 coins
and pay you one dollar for each head that appears. As noted, the sample space S is the 2100 element set whose typical
element is a sequence, s, of 100 letters, each either H or T . The random variable, f , assigns to any given s ∈ S the
number of heads that appear in the sequence s.
To explore (6.1) in this case, let us agree that the probability of any given element in s is 2−100 . This is based on my
telling you that each of the 100 coins is fair. It also assumes that the appearance of H or T on any one coin has no
bearing on whether H or T appear on any other. (I can say this formally as follows: The event that H appears on any
given coin is independent from the event that H appears on any other coin.) Thus, no matter what s is, the value P(s)
that appears in (6.1) is equal to 2−100 . This understood, (6.1) asserts that the probability that f is equal to any given
integer r ∈ {0, . . . , 100} is obtained by multiplying 2−100 times the number of elements in s that are sequences with
precisely r heads.
For example, P(f = 0) = 2−100 because there is just one element in s with no heads at all, this the element T T · · · T .
Thus, it is a good bet that you will get at least one dollar. On the other hand, P(f = 100) is also 2−100 since only
HH · · · H has 100 heads. So, it is a good bet that I will lose less than 100 dollars. Consider next the probability for
f to equal 1. There are 100 sequences from S with 1 head. These being HT T · · · T , T HT · · · T , . . . , T · · · T HT ,
T · · · T T H. Thus, P(f = 1) is 1002−100 . This is still pretty small, on the order of 10−28 . How about the probability
for 2 dollars? In this case, there are 21 100 · 99 elements in S with two heads. If you buy this count, then P(f = 2) is
50 · 99 · 2−100 . We shall learn in a subsequent chapter that P(f = r) is (100 × 99 × · · · × (100 − r))/(1 × 2 × · · · ×
r) × 2−100 .
For a second example, take S to be the set of 20 possible amino acids at the 127th position from the end of the glucose
metabolizing enzyme. Let f now denote the function from S to the 10 element set that is obtained by measuring to the
nearest 10% the fraction of glucose used in one hour by the growing bacteria. Number the elements of S from 1 to 20,
1 1
and suppose that P assigns the kth amino acid probability 10 if k ≤ 5, probability 20 if 6 ≤ k ≤ 10 and probability
1 k
40 if k > 10. Meanwhile, suppose that f (k) = 1 − 10 if k ≤ 10 and f (k) = 0 if k ≥ 10. This understood, it then
n
follows using (6.1) that P(f = 10 ) is equal to
1 1 3
0 for n = 10, 10 for 5 ≤ n ≤ 9, 20 for 1 ≤ n ≤ 4, and 10 for n = 0. (6.2)
By the way, equation (6.1) can be viewed (at least in a formal sense) as a matrix equation in P
the following way:
Introduce a matrix by writing Ars = 1 if f (s) = r and Ars = 0 otherwise. Then, P(f = r) = s∈S Ars P(s) is a
These values define the probability function, Pf , on the set Sf = {2, . . . , 12}.
For a third example, consider again the case that is relevant to (6.2). The sum for the mean in this case is
9 1 8 1 7 1 6 1 5 1 4 1 3 1 2 1 1 1 3
1·0+ · + · + · + · + · + · + · + · + · +0·
10 10 10 10 10 10 10 10 10 10 10 20 10 20 10 20 10 20 10
which equals 25 . Thus, µ = 52 . The standard deviation in this example is the number whose square is the sum
9 25 1 16 1 9 1 4 1 1 1 1 1 1 4 1 9 1 4 3
·0+ · + · + · + · + · +0· + · + · + · + ·
25 100 10 100 10 100 10 100 10 100 10 20 100 20 100 20 100 20 25 10
√
11 11
which equals 100 . Thus, σ = 10 ≈ 0.33.
Y2 = X(1, 1)
Y3 = X(1, 2) + X(2, 1)
Y4 = X(1, 3) + X(2, 2) + X(3, 1)
Y5 = X(1, 4) + X(2, 3) + X(3, 2) + X(4, 1)
Y6 = X(1, 5) + X(2, 4) + X(3, 3) + X(4, 2) + X(5, 1)
Y7 = X(1, 6) + X(2, 5) + X(3, 4) + X(4, 3) + X(5, 2) + X(6, 1)
Y8 = X(2, 6) + X(3, 5) + X(4, 4) + X(5, 3) + X(6, 2)
Y9 = X(3, 6) + X(4, 5) + X(5, 4) + X(6, 3)
Y10 = X(4, 6) + X(5, 5) + X(6, 4)
Y11 = X(5, 6) + X(6, 5)
Y12 = X(6, 6)
Here X(a, b) is our unknown proxy for the probability function, P, on the sample space of pairs of the form (a, b)
where a and b are integers from 1 through 6.
To see how this works in the general case, suppose that S is a sample space and f a random variable on S. Suppose
that there is some finite set of possible values for f , these labeled as {r1 , . . . , rN }. When k ∈ {1, . . . , N }, let yk
denote the frequency that rk appears as the value for f in our experiments. Label the elements in S as {s1 , . . . , sn }.
y1 = a11 x1 + · · · + a1n xn
..
. (6.6)
yN = aN 1 x1 + · · · + aN n xn ,
where akj = 1 if f (sj ) = rk and akj = 0 otherwise. Note that this whole strategy is predicated on two things: First,
that the sample space is known. Second, that there is enough of a theoretical understanding to predict apriori the values
for the measurement f on each element in S.
To see something of this in action, consider first the example from the game of craps. In this case there are 11
equations for 36 unknowns, so there are infinitely many possible choices for the collection {X(a, b)} for any given
set {Y2 , . . . , Y12 }. Even so, the equations determine X(1, 1) and X(6, 6). If we expect that X(a, b) = X(b, a), then
there are 21 unknowns and the equations now determine X(1, 2) and X(5, 6) also.
n
Consider next the example from (6.2). For the sake of argument, suppose that the measured frequency of P(f = 10 )
1
are exactly those given in (6.2). Label the possible values of f using r1 = 0, r2 = 10 , · · · , r11 = 1. This done, the
relevant version of (6.6)) is the following linear equation:
3
10 = x10 + · · · + x20
1
20 = x9
1
20 = x8
1
20 = x7
1
20 = x6
1
10 = x5
1
10 = x4
1
10 = x3
1
10 = x2
1
10 = x1
0=0
As you can see, this determines xj = P(sj ) for j ≤ 9, but there are infinitely many ways to assign the remaining
probabilities.
We shall see in subsequent lessons how to choose a ‘best possible’ solution of a linear equation that has more unknowns
than knowns.
where ars is obtained from the theoretical model. Note that our task then is to solve for the collection {P(s)},
effectively solving a version of the linear equation in (6.7).
If this is zero for all values of r and ρ, then f and g are independent. Consider first the case where r = 2 and ρ = 2.
1
Then the set where f = 2 and g = 2 consists of just the triple (1, 1, 1) and so it has probability 27 . Meanwhile, the
3
set where f = 2 consists of 3 triples, (1, 1, 1), (1, 1, 2) and (1, 1, 3). Thus, it has probability 27 . Likewise, the set
3
where g = 2 has probability 27 since it consists of the three elements (1, 1, 1), (2, 1, 1) and (3, 1, 1). The quantity
2
P(f = 2 and g = 2) − P(f = 2)P(g = 2) is equal to 81 . Since this is not zero, the random variables f and g are not
independent.
For a second example, take S, P and f as just described, but now consider the case where g is the random variable that
assigns c − b to the triple (a, b, c). In this case, g can take any integer in the range from −2 to 2. Consider f = 2 and
1
g = 0. In this case, the set where both f = 2 and g = 0 consists of (1, 1, 1) and so has probability 27 . The set where
g = 0 consists of 9 elements, these of the form (a, b, b) where a and b can be any integers from 1 through 3. Thus,
9
P(g = 0) = 27 = 13 . Since 271
= 91 · 13 , the event that a + b = 2 is independent from the event that c − b = 0. Of
course, to see if f and g are independent random variables, we need to consider other values for f and for g.
To proceed with this task, consider the case where f = 2 and g = −2. The event that g = −2 consists of three
elements, these of the form (a, 3, 1) where a can be any integer from 1 to 3. As a consequence, the event that g = −2
1 9
Label the values of f so that r1 = 0, r2 = 10 , . . ., r10 = 10 , r10 = 1. Meanwhile, label those of g as in the order they
appear above, ρ1 = 0, ρ2 = 1 and ρ3 = 2. The correlation matrix in this case is an 11 × 3 matrix. For example, here
are the coefficients in the first row:
11 3 3
C11 = 50 , C12 = − 25 , C13 = − 25 .
To explain, note that the event that f = 0 consists of the subset {10, . . . , 20} in the set of integers from 1 to 20. This set
3
is a subset of the event that g is zero since the latter set is {7, . . . , 20}. Thus, P(f = 0 and g = 0) = P(f = 0) = 10 ,
while there are no events where f is 0 and g is either 1 or 2.
By the way, this example illustrates something of the contents of the correlation matrix: If Ckj > 0, then the outcome
f = rk is relatively likely to occur when g = ρj . On the other hand, if Ckj < 0, then the outcome f = rk is unlikely
to occur when g = ρj . Indeed, in the most extreme case, the function f is never rk when g is ρj and so
As an addendum to this discussion about correlation matrices, I say again that statisticians are want to use a single
number to summarize behavior. In the case of correlations, they favor what is known as the correlation coefficient.
The latter, c(f, g), is obtained from the correlation matrix and is defined as follows:
1 X
c(f, g) = (rk − µ(f ))(ρj − µ(g))Ckj . (6.10)
σ(f )σ(g)
k,j
Here, µ(f ) and σ(f ) are the respective mean and standard deviation of f , while µ(g) and σ(g) are their counterparts
for g.
6.9 Exercises:
1. A number from the three element set {−1, 0, 1} is selected at random; thus each of −1, 0 or 1 has probability 13
of appearing. This operation is repeated twice and so generates an ordered set (i1 , i2 ) where i1 can be any one
of −1, 0 or 1, and likewise i2 . Assume that these two selections are done independently so that the event that i2
has any given value is independent from the value of i1 .
(a) Write down the sample space that corresponds to the possible pairs {i1 , i2 }.
(b) Let f denote the random variable that assigns i1 + i2 to any given (i1 , i2 ) in the sample space. Write down
the probabilities P(f = r) for the various possible values of r.
(c) Compute the mean and standard deviation of f .
(d) Let g denote the random variable that assigns |i1 | + |i2 | to any given (i1 , i2 ). Write down the probabilities
P(g = ρ) for the various possible values of ρ.
(e) Compute the mean and standard deviation of g.
(f) Compute the correlation matrix for the pair (f, g).
(g) Which pairs of (r, ρ) with r a possible value for f and ρ one for g are such that the event f = r is
independent from the event g = ρ?
2. Let S denote the same sample space that you used in Problem 1, and let P denote some hypothetical probability
function on S. Label the elements in S by consecutive integers starting from 1, and also label the possible values
for f by consecutive integers starting from 1. Let xj denote P(sj ) where sj is the jth element of the sample
space. Meanwhile, let yk denote the P(f = rk ) where rk is the kth possible value for f . Write down the linear
equation that relates {yk } to {xj }.
3. Repeat Problem 1b through 1e in the case that the probability of selecting either −1 or 1 in any given selection
is 41 and that of selecting 0 is 12 .
4. Suppose that N is a positive integer, and N selections are made from the set {−1, 0, 1}. Assume that these
are done independently so that the probability of any one number arising on the kth selection is independent of
any given number arising on any other selection. Suppose, in addition, that the probability of any given number
arising on any given selection is 31 .
Wins 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 Count Wins
0 1 1 0
1 1 1 1 1 4 1
2 1 2 1 1 1 1 1 8 2
3 1 1 1 1 1 1 2 8 3
4 2 2 4 2 5 2 4 2 2 25 4
5 3 3 2 2 3 4 6 2 3 3 31 5
6 1 1 3 4 5 3 2 4 2 25 6
7 3 2 7 3 3 3 5 6 3 35 7
8 5 5 4 8 1 4 2 3 2 2 36 8
9 5 5 2 4 4 4 1 6 2 4 37 9
10 3 2 5 3 3 3 6 4 3 4 36 10
11 3 4 2 6 1 1 3 3 4 27 11
12 1 4 2 1 2 4 3 2 3 22 12
13 2 1 3 2 2 1 1 2 1 15 13
14 1 1 1 1 1 1 6 14
15 1 1 15
16 1 1 16
(a) Should you expect f15 and f16 to be independent random variables? Explain your answer.
(b) Should you expect f8 and any fk to be independent random variables? Explain your answer.
(c) From the table above, what are the mean and standard deviation of the data representing f8 for the last ten
years? See equations (1.1) and (1.3).
1 Note: The Houston Texans did not start play until the 2002 season, so for the first two seasons in our table there are only 31 teams. Also, there
were two tie games in this timespan: Eagles-Bengals in 2008 and Falcons-Steelers in 2002.
SEVEN
Here is a basic issue in statistic: Experiments are often done to distinguish various hypothesis about the workings of a
given system. This is to say that you have various models that are meant to predict the behavior of the system, and you
do experiments to see which model best predicts the experimental outcomes. The ultimate goal is to use the observed
data to find the correct model.
A more realistic approach is to use the observed data to generate a probability function on the set of models that
are under consideration. The assigned probability should give the odds for the correctness of a given model. There
are many ways to do this. Usually, any two of the resulting probability functions assign different probabilities to the
models. For this reason, if for no other, some serious thought must be given as to which (if any) to use. In any event, I
describe some of these methods in what follows.
To indicate the flavor of what is to follow, consider first an example. Here is some background: Suppose that a given
stretch of DNA on some chromosome has N sites. If I look at this stretch in different people, I will, in general,
not get the same sequence of DNA. There are typically some number of sites where the bases differ. Granted this
basic biology, I then sample this site in a large number of people. If I find that one particular DNA sequence occurs
more often then any other, I deem it to be the consensus sequence. Suppose for simplicity that there is, in fact, such a
consensus sequence. My data gives me more than just a consensus sequence; I also have numbers {p0 , p1 , p2 , . . . , pN }
where any given pk is the fraction of people whose DNA differs at k sites from the consensus sequence. For example,
p0 is the fraction of people whose DNA for this N site stretch is identical to the consensus sequence. For a second
example, p52 is the fraction of people whose DNA for this stretch differs at 52 sites from the consensus sequence.
I wish to understand how these number {p0 , p1 , . . . , pN } arise. If I believe that this particular stretch of DNA is ‘junk
DNA’, and so has no evolutionary function, I might propose the following: There is some probability for a substituted
DNA base to appear at any given site of this N site long stretch of DNA. The simplest postulate I can make along
these lines is that the probability of a substitution is independent of the particular site. I am interested in finding out
m
something about this probability of a substitution. To simplify matters, I look for probabilities of the form 100 where
m ∈ {0, 1, 2, . . . , 100}. So, my basic question is this:
m
Given the data {p0 , . . . , pN }, what is the probability that a given θ = 100 is the true probability for a
single site substitution in the given stretch of DNA?
1 2 99
The interesting thing here is that if I know this probability, thus the number θ from the set {0, 100 , 100 , . . . , 100 , 1},
then I can predict the sequence {p0 , p1 , . . . , pN }. We will give a general formula for this in a later chapter. However,
for small values of N , you needn’t know the general formula as we can work things out directly.
Consider first N = 1. Then there is just p0 and p1 . The probability, given θ for one site change is θ (by definition),
thus the probability for no site changes is 1 − θ. Hence, given θ, I would predict
I can then compare these numbers with what I observed, p0 and p1 . Doing so, I see that there is an obvious choice for
θ, that with p0 = 1 − θ and p1 = θ. Of course, if I restrict θ to fractions of 100, then the I should choose the closest
41
such fraction to p1 . Note, by the way, that p0 + p1 = 1 by virtue of their definition as fractions, so the two conditions
on θ amount to one and the same thing.
Consider next the case N = 2. Given θ, I would then predict
To see why this is, think of flipping two coins, with heads = site change and tails = no site change; and with probability
equal to θ for any given coin to have heads. The probability of the first coin/site having heads is θ and of it having tails
is (1 − θ). Likewise for the second coin/site. So, they both end up tails with probability (1 − θ)2 . Likewise, they both
end up heads with probability θ2 . Meanwhile, there are two ways for one head to appear, either on the first coin/site,
or on the second. Each such event has probability θ(1 − θ), so the probability of one coin/site to have heads is twice
this.
This same sort of comparison with coin flipping leads to the following for the N = 3 case:
However, note that in the case N = 2 the data {p0 , p1 , p2 } need not lead to any obvious candidate for θ. This is
because the equations
need not have a solution. Even using the fact that p0 + p1 + p2 = 1, there are still two conditions for the one unknown,
9 3
θ. Consider, for example, p0 = 16 , p1 = 16 , p2 = 14 . The equations p2 = θ2 and p0 = (1 − θ)2 lead to very different
choices for θ. Indeed, the equation p2 = θ2 gives θ = 12 and the equation p0 = (1 − θ)2 gives θ = 14 ! This same
problem gets worse as N gets ever larger.
What’s to be done? Here is one approach: Use the data {p0 , . . . , pN } to find a probability function on the possible
1 2 99
values of θ. This would be a probability function on the sample space Θ = {0, 100 , 100 , . . . , 100 , 1} that would give
the probability for a given to be the true probability of a site substitution given the observed data {p0 , . . . , pN } and
given that my model of equal chance of a site substitution at each of the N sites along the DNA strand is correct.
m
I will use P to denote a probability function on Θ. This is to say that P( 100 ) is the probability that P assigns to any
3 3
given θ. For example, P( 100 ) is the probability that 100 is the true probability of a mutation occurring in any given
49 49
cell. Likewise, P( 100 ) is the probability that 100 is the true probability of a mutation occurring in any given cell.
Imagine for the moment that I have a probability function, P, on the sample space Θ. I can then use P to make a
theoretical prediction of what the observed data should be. This uses the notion of conditional probability. Here is
how it works in the case N = 2:
1 1
Prob(k site changes) = P (k | 0) P(0) + P k | 100 P( 100 ) + · · · + P (k | 1) P(1), (7.3)
where the notation uses P (k | θ) to denote the probability of there being k site changes given that θ is the correct
probability. Note that I use the conditional probability notation from Chapter 3, although the context here is slightly
different. Anyway, grant me this small abuse of my notation and agree to view P (k | θ) as an honest conditional
probability. Note in this regard that it obeys all of the required rules: Each P (k | θ) is non-negative, and the sum
P (0 | θ) + P (1 | θ) + · · · + P (N | θ) = 1 for each θ.
The key point here is that I know what these conditional probabilities P (k | θ) are. For example, they in the case
N = 2, they are given in (7.1) and in the case N = 3 they are given in (7.3). Thus, the various versions of (7.3) for
Of course, I don’t yet have a probability function P. Indeed, I am looking for one; and I hope to use the experimental
data {p0 , . . . , pN } to find it. This understood, it makes some sense to consider only those versions of P that give the
correct experimental results. This is to say that I consider as reasonable only those probability functions on the sample
space that give Prob(k site changes) = pk for all choices of k ∈ {0, . . . , N }. For example, in the case N = 2, these
give the conditions
X
m 2 m
p0 = (1 − 100 ) P( 100 )
m=0,1,...100
X
m m m
p1 = 2 100 (1 − 100 )P( 100 ) (7.5)
m=0,1,...100
X
m 2 m
p2 = ( 100 ) P( 100 ).
m=0,1,...100
3 3 1
To make it even more explicit, suppose again that I measured p0 = 4, p1 = 16 and p2 = 16 . Then what is written
in (7.5) would read:
3 99 2 1 1 2 99
4 = P(0) + ( 100 ) P( 100 ) + · · · + ( 100 ) P( 100 )
3 1 99 1 2 98 2 99 1 1
16 = 2( 100 ) 100 )P( 100 ) + 2( 100 )( 100 )P( 100 ) · · · + 2( 100 )( 100 )P( 100 ) (7.6)
1 1 2 1 99 2 99
16 = ( 100 ) P( 100 ) + ··· + ( 100 ) P( 100 ) + P(1).
Notice, by the way, that what is written here constitutes a system of 3 equations for the 101 unknown numbers,
1
{P(0), P( 100 ), . . . , P(1)}. As we know from our linear algebra, there will be infinitely many solutions. In the general
case of a site with length N , the analog of (7.5) has N + 1 equations, with these reading
X
m m
p0 = P 0 | 100 P( 100 )
m=0,1,...100
X
m m
p1 = P 1 | 100 P( 100 )
m=0,1,...100
..
. (7.7)
X
m m
pN = P N | 100 P( 100 )
m=0,1,...100
Given that I know what P (k | θ) is for any given k, this constitutes a system of N + 1 equations for the 101 unknowns
1
{P(0), P( 100 ), . . . , P(1)}. In particular, if N > 100, then there will be more equations than unknowns and so there
may be no solutions! To make this look exactly like m a system of linear equation, change the notation so as to denote
m m
the conditional probability P 0 | 100 as a0m , P 1 | 100 as a01m , etc. Use xm for P( 100 ). And, use y0 for p0 , y1 for
p1 , etc. This done, then (7.7) reads
y0 = a00 x0 + a01 x1 + · · ·
y1 = a10 x1 + a11 x1 + · · · (7.8)
..
.
In any event, here is a lesson from this:
43
In general, the equations in (7.5) for the case N = 2, or their analogs in (7.7) for N > 2 do not determine
a God-given probability function on the sample space.
The task of finding a probability function on the sample space from the experimental data, one that comes reasonably
close to solving (7.7) and makes good biological sense, is an example of what I call the statistical inverse problem.
Of course, this is wishful thinking without some scheme to obtain the various conditional probabilities P (θ | k).
The Bayesian makes the following rather ad-hoc proposal: Let’s use for P the formula
P(k | θ)
PBayes (θ | k) = Z(k) (7.11)
Note that the presence of Z guarantees that the probabilities defined by the left-hand side of (7.12) sum to 1.
In effect, the Bayesian probability function on Θ is obtained by approximating the unknown conditional probability
1
for θ given outcome k by the known conditional probability of outcome k given θ (times the factor Z(k) ). This may
or may not be a good approximation. In certain circumstances it is, and in others, it isn’t.
7.3 An example
To see how this works in an example, consider the N = 2 case of the introduction. In this case, θ is one of the fractions
m
from the set Θ = { 100 }m=0,1,...100 and (7.1) finds that P (0 | θ) = (1 − θ)2 , P (1 | θ) = 2θ(1 − θ) and P (2 | θ) = θ2 .
This understood, then
99 2 98 2 1 2
Z(0) = 1 + ( 100 ) + ( 100 ) + · · · + ( 100 ) ≈ 33.835
1 99 98 2 99 1
Z(1) = 2( 100 )( 100 ) + 2( 100 )( 100 ) + · · · + 2( 100 )( 100 ) ≈ 33.33 (7.13)
1 2 2 2
Z(2) = ( 100 ) + ( 100 ) + · · · + 1 ≈ 33.835.
m
Using this in (7.12) gives the Bayesian probability function whose value on 100 is
m m 2 1 m m 1 m 2 1
PBayes ( 100 ) = (1 − 100 ) 33.835 p0 + 2( 100 )(1 − 100 ) 33.33 p1 + ( 100 ) 33.835 p2 . (7.14)
Assume that you calculate PBayes in a given situation. You must then ask: What in the world does it tell me about my
set of possible models? It gives a probability function on this set, but so what. There are lots of probability functions
on any given set. The fact is that before you use PBayes (or any of other version of P), you need to think long and hard
about whether it is really telling you anything useful.
By the way, the formula in (7.14) for PBayes is another disguised version of what we are doing in linear algebra (as is
the formula in (7.12)). To unmask this underlying linear algebra, agree to change to the linear algebra book’s notation
m
and so use x0 to denote p0 , use x1 to denote p1 and x2 to denote p2 . Then use ym to denote P( 100 ) and use am1 to
m 2 m m m 2
denote the number (1 − 100 ) , use am2 to denote 2( 100 )(1 − 100 ) and am3 to denote ( 100 ) . This done, then the 101
versions of (7.14) read
7.3. An example 45
Gregor Johann Mendel (July 20[1], 1822 – January 6, 1884) was an Augustinian abbot who is often called
the “father of modern genetics” for his study of the inheritance of traits in pea plants. Mendel showed
that the inheritance of traits follows particular laws, which were later named after him. The significance
of Mendel’s work was not recognized until the turn of the 20th century. Its rediscovery prompted the
foundation of genetics.
What follows describes the experiment: A given pea plant, when self-pollinated, can have either round or angular
seeds. It can also have either yellow or green seeds. Mendel took 10 plants, where each plant can have either round or
angular seeds, and either yellow or green seeds. Each plant was self-pollinated, and Mendel kept track of the number
of seeds that were round and the number that were angular. He also kept track of the number that were yellow and the
number that were green. Here are the published results (from http://www.mendelweb.org/Mendel.html):
Experiment 1 Experiment 2
(Shape) (Color)
Plant # Round Angular Yellow Green
1 45 12 25 11
2 27 8 32 7
3 24 7 14 5
4 19 10 70 27
5 32 11 24 13
6 26 6 20 6
7 88 24 32 13
8 22 10 44 9
9 28 6 50 14
10 25 7 44 18
Totals 336 101 355 123
Using the totals at the bottom finds that round seeds appear 77% of the time. Meanwhile, yellow seeds appear 74%
of the time. Let us agree to use pr = 0.77 for the experimentally determined probability for a seed to be round, and
py = 0.74 for the experimentally determined probability for a seed to be yellow.
The conventional wisdom explains the numbers of round and angular seeds by postulating that each plant has two
different genes that convey instructions for seed shape. These are denoted here by ‘R’ (for round) and ‘a’ (for angular).
The parent in each case is hypothesized to be type Ra. The offspring inherits either type RR, Ra or aa, where one of
the genes, either R or a, comes from the pollen and the other from the ovule. The hypothesis is that R is dominant so
types RR and Ra come from round seeds. Meanwhile, offspring of type aa come from angular seeds.
There is a similar story for the color of the seed. Each plant also has two genes to control seed color. These are denoted
here by ‘Y’ and ‘g’. The parent is type Yg, and the offspring can be of type YY, Yg or gg. The Y is postulated to be
the dominant of the two genes, so offspring of type YY or Yg have yellow seeds, while those of type gg have round
seeds.
The conventional wisdom assigns probability 21 for any given pollen grain or ovule to receive the dominant gene for
shape, and likewise probability 12 to receive the dominant gene for color. One should make it a habit to question
conventional wisdom! In particular, suppose that we wish to compare the conventional probability model with another
model, this giving some probability, θ, for any given pollen grain or ovule to have the dominant allele. To keep things
m
simple, I will again assume that θ can be any number { 100 }m∈{0,1,...,100} . In this case, our set of models is again the
m
101 element set Θ = { 100 }m∈{0,1,...100} .
1
Note that the factor of Z = Pf (691 | 0) + Pf 691 | 100 + · · · + Pf (691 | 1) is necessary so as to guarantee that
the sum of the values of PML over all 101 elements in Θ is equal to 1. What is written in (7.15) is called by some a
maximum likelihood probability function on the set Θ.
Of course, this is all pretty abstract if Pf (691 | θ) can’t be computed. As we shall see in an upcoming chapter, it is, in
fact, computable:
915 × 914 × · · · × 1
Pf (691 | θ) = (2θ − θ2 )691 ((1 − θ)2 )224 . (7.16)
(224 × 223 × · · · × 1) (691 × 690 × · · · × 1)
In any event, this probability function assigns the greatest probability to the model whose version of Pf (691 | θ) is
largest amongst all models under consideration. This is to say that the model that PML makes most probable is the one
that gives the largest probability to the number 691.
To see this approach in a simpler case, suppose that instead of 915 seedlings, I only had 4, and suppose that three of
the four exhibited the dominant trait, and one of the four exhibited the recessive trait.
My sample space S for this simple example consists of 24 = 16 elements, these
• (0, 0, 0, 0)
• (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)
• (1, 1, 0, 0), (1, 0, 1, 0), (1, 0, 0, 1), (0, 1, 1, 0), (0, 1, 0, 1), (0, 0, 1, 1) (7.17)
• (1, 1, 1, 0), (1, 1, 0, 1), (1, 0, 1, 1), (0, 1, 1, 1)
• (1, 1, 1, 1).
P = α4 to (0, 0, 0, 0)
P = α3 (1 − α) to everything in the second row of (7.17)
P = α2 (1 − α)2 to everything in the third row of (7.17) (7.18)
3
P = α(1 − α) to everything in the fourth row of (7.17)
P = (1 − α)4 to (1, 1, 1, 1).
In particular, Pf (3 | θ) = 4α(1 − α)3 . In terms of θ, this says that Pf (3 | θ) = 4(1 − θ)2 (2θ − θ2 )3 .
The analog of (7.10) for this example sets
PML (θ) = 1
20.3175 4(1 − θ)2 (2θ − θ2 )3 . (7.19)
In this case Z = 20.3175. I’ll leave it as an exercise for those who remember their one variable calculus to verify that
the function θ → P(θ) has its maximum at θ = 21 .
As with the Bayesian probability function, you must still ask yourself whether the assigned probabilities to the models
are useful or not. For example, the probability function in (7.19) finds PML ( 12 ) ≈ 0.02. Meanwhile, PML ( 32 ) ≈ 0.015.
Is it really the case that the probabilities for θ = 12 and θ = 32 should be so nearly equal? The fact is that these
probabilities are close is due to the fact that the data from 4 seedlings is not nearly enough to distinguish these two
models. This is what I meant when I said at the outset that you must think hard about whether any probability function
on your set of possible models is worth looking at. The fact that these two probabilities are so close really says nothing
about peas and genetics, and everything about the fact that you don’t have enough data to discriminate.
This gets to the heart of the matter with regards to using statistics: A good deal of common sense must be used to
interpret the mathematics.
The point is that the your data may not be sufficiently powerful to distinguish various models; and if this is the case,
then no amount of fancy mathematics or statistics is going to provide anything useful.
7.7 Exercises:
1. Suppose that the probability that any given pollen grain or ovule has the recessive allele is some number α ∈
[0, 1], and so the probability that either has the dominant gene is (1 − α).
(a) Consider an experiment where 4 seedlings are examined for the dominant or recessive trait. Thus, the
outcome of such an experiment is a vector in R4 whose kth entry is 0 if the kth seedling has the recessive
trait and it is 1 if the kth seedling has the dominant trait. Let S denote the 24 = 16 element set of possible
experimental outcomes. Let s ∈ S denote an element with some m ∈ {0, . . . , 4} entries equal to 1 and the
remaining entries equal to zero. Explain why s has probability (1 − α)m α4−m .
(b) Consider now the analogous experiment where 915 seedlings are examined for the dominant or recessive
trait. Thus, the outcome of such an experiment is a vector in R915 whose kth entry is 0 if the kth seedling
has the recessive trait and it is 1 if the kth seedling has the dominant trait. Let S denote the 2915 element
set of possible experimental outcomes. Let s ∈ S denote an element with some m ∈ {0, . . . , 915} entries
equal to 1 and the remaining entries equal to zero. Explain why the m has probability (1 − α)m α915−m .
(c) In the case of part (a) above, define the random variable f : S → {0, 1, 2, 3, 4} where f (s) is the sum of
the entries of s. Compute the induced probability function Pf on the set {0, 1, 2, 3, 4}.
7.7. Exercises 49
CHAPTER
EIGHT
Imagine a complicated cellular process that involves some n genes that are presumably ‘turned on’ by some N other
genes that act at an earlier time. We want to test whether the effect of the early genes on the late genes involves a
complicated synergy, or whether their affects simply add. To do this, I could try to derive the consequences of one
or the other of these possibilities and then devise an experiment to see if the predicted consequences arise. Such an
experiment could vary the expression level of the early genes from their normal levels (either + or −) and see how the
variation in the level of expression of the late genes from their normal levels changes accordingly.
To elaborate on this strategy, note that when a gene is expressed, its genetic code (a stretch of the DNA molecule) is
used to code for a molecule much like DNA called mRNA. Here, the ‘m’ stands for ‘messenger’ and the RNA part is
a sequence of small molecules strung end to end. Any of these can be one of four and the resulting sequence along the
RNA string is determined by the original sequence on the coding stretch of DNA. This messenger RNA subsequently
attaches to the protein making part of a cell (a ‘ribosome’) where its sequence is used to construct a particular protein
molecule. In any event, the level of any given mRNA can be measured at any given time, and this level serves as a
proxy for the level of expression of the gene that coded it in the first place.
One more thing to note: The level of expression of a gene can often be varied with some accuracy in an experiment by
inserting into the cell nucleus certain tailored molecules to either promote or repress the gene expression. Such is the
magic of modern biotechnology.
To make a prediction that is testable by measuring early and late gene expression, let us suppose that the affects of
the early genes are simply additive and see where this assumption leads. For this purpose, label these early genes
by integers from 1 to N , and let uk to denote the deviation, either positive or negative, of the level of expression of
the kth early gene from its normal level. For example, we can take uk to denote the deviation from normal of the
concentration of the mRNA that comes from the kth early gene.
Meanwhile, label the late genes by integers from 1 to n, and use pj to denote the deviation of the latter’s mRNA from
its normal level. If the affects of the early genes on the late genes are simply additive, we might expect that any given
pj has the form
pj = Aj1 u1 + Aj2 u2 + · · · + AjN uN , (8.1)
where each Ajk is a constant. This is to say that the level pj is a sum of factors, the first proportional to the amount
of the first early gene, the second proportional to the amount of the second early gene, and so on. Note that when
Ajk is positive, then the kth early gene tends to promote the expression of the jth late gene. Conversely, when Ajk is
negative, the kth early gene acts to repress the expression of the jth late gene.
If we use ~u to denote the N -component column vector whose kth entry is uk , and if we use p~ to denote the n-
component column vector whose jth component is pj , then the equation in (8.1) is the matrix equation p~ = A~u. Thus,
we see a linear transformation from an N -dimensional space to an n-dimensional one.
By the way, note that the relation predicted by (8.1) can, in principle, be tested by experiments that vary the levels of
the early genes and see if the levels of the late genes change in a manner that is consistent with (8.1). Such experiments
will also determine the values for the matrix entries {Ajk }. For example, to find A11 , vary u1 while keeping all k > 1
versions of uk equal to zero. Measure p1 as these variations are made and see if the ratio p1 /u1 is constant as u1
changes with each k > 1 version of uk equal zero. If so, the constant is the value to take for A11 . If this ratio is not
50
constant as these variations are made, then the linear model is wrong. One can do similar things with the other pj and
uk to determine all Ajk . One can then see about changing more than one uk from zero and see if the result conforms
to (8.1).
The question now arises as to the meaning of the kernel and the image of the linear transformation A from RN to Rn .
To make things explicit here, suppose that n and N are both equal to 3 and that A is the matrix
1 1 2
A = 2 1 1 (8.2)
1 0 −1
As you can check, this matrix has kernel equal to the scalar multiples of the vector
1
−3 (8.3)
1
Thus, it consists of all vectors in R3 that can be written as a constant times the first vector in (8.4) plus another constant
times the second.
Here is the meaning of the kernel: Vary the early genes 1, 2 and 3 from their normal levels in the ratio u1 /u3 = 1
and u2 /u3 = −3 and there is no affect on the late genes. This is to say that if the expression levels of early genes 1
and 3 are increased by any given amount r while that of early gene 2 is decreased by 3r, then there is no change to
the levels of expression of the three late genes. In a sense, the decrease in the level by a factor of 3 of the second early
gene exactly offsets the affect of increasing the equal increases in the levels of the first and third early genes.
As to the meaning of the image, what we find is that only certain deviations of the levels of expression of the three late
genes from their background values can be obtained by modifying the expression levels of the three early genes. For
example, both vectors in (8.4) are orthogonal to the vector
1
−1 . (8.5)
1
Thus, values of p1 , p2 and p3 with the property that p1 +p3 6= p2 can not be obtained by any variation in the expression
levels of the three early genes. Indeed, the dot product of the vector p~ with the vector in (8.5) is p1 − p2 + p3 and this
must be zero in the case that is a linear combination of the vectors in (8.4).
Granted that the matrix A is that in (8.2), then the preceding observation has the following consequence for the
biologist: If values of p1 , p2 and p3 are observed in a cell with p1 + p3 6= p2 , then the three early genes can not be the
sole causative agent for the expression levels of the three late genes.
51
CHAPTER
NINE
My purpose here is to give some toy examples where the notion of dimension and coordinates appear in a biological
context.
9.1 Coordinates
Here is a hypothetical situation: Suppose that a cell has genes labeled {1, 2, 3}. The level of the corresponding product
then defines vector in R3 where the obvious coordinates, x1 , x2 and x3 , measure the respective levels of the products
of gene 1, gene 2 and gene 3. However, this might not be the most useful coordinate system. In particular, if some
subsets of genes are often turned on at the same time and in the same amounts, it might be better to change to a basis
where that subset gives one of the basis vectors. Suppose for the sake of argument, that it is usually the case that the
level of the product from gene 2 is three times that of gene 1, while the level of the product of gene 3 is half that of
gene 1. This is to say that one usually finds x2 = 3x1 and x3 = x1 . Then it might make sense to switch from the
standard coordinate bases,
1 0 0
~e1 = 0 , ~e2 = 1 , ~e3 = 0 , (9.1)
0 0 1
to the coordinate system that uses a basis ~v1 , ~v2 and ~v3 where
1
~v1 = 3 , ~v2 = ~e2 , ~v3 = ~e3 . (9.2)
1
2
To explain, suppose I measure some values for x1 , x2 and x3 . This then gives a vector,
x1
~x = x2 = x1~e1 + x2~e2 + x3~e3 . (9.3)
x3
Now, I can also write this vector in terms of the basis in (9.2) as
With ~v1 , ~v2 and ~v3 as in (9.2), the coordinates c1 , c2 and c3 that appear in (9.5) are
As a consequence, the coordinates c2 describes the deviation of x2 from its usual value of 3x1 . Meanwhile, the
coordinate c3 describes the deviation of x3 from its usual value of x1 .
52
Here is another example: Suppose now that there are again three genes with the levels of their corresponding products
denoted as x1 , x2 , and x3 . Now suppose that it is usually the case that these levels are correlated in that x3 is generally
very close to 2x2 + x1 . Any given set of measured values for these products determines now a column vector as
in (9.3). A useful basis in this case would by one where the coordinates c1 , c2 and c3 has
Thus, c3 again measures the deviation from the expected values. The basis with this property is that where
1 0 0
~v1 = 0 , ~v2 = 1 , and ~v3 = 0 . (9.7)
1 2 1
This is to say that if ~v1 , ~v2 , and ~v3 are as depicted in (9.7), and if ~x is then expanded in this basis as c1~v1 + c2~v2 + c3~v3 ,
then c1 , c2 and c3 are given by (9.6).
To explain why (9.9) holds, take the equation ~x = c1~v1 +c2~v2 +c3~v3 and act on both sides by the linear transformation
A. According to (9.8), the left-hand side, A~x, is the vector ~c whose top component is c1 , middle component is c2 and
bottom component is c3 . This is to say that A~x = c1~e1 + c2~e2 + c3~e3 . Meanwhile, the left-hand side of the resulting
equation is c1 A~v1 + c2 A~v2 + c3 A~v3 . Thus,
Now, the two sides of (9.10) are supposed to be equal for all possible values of c1 , c2 and c3 . In particular, they are
equal when c1 = 1 and c2 = c3 = 0. For these choices, the equality in (9.10) asserts that ~e1 = A~v1 ; this the left-most
equality in (9.9). Likewise, setting c1 = c3 = 0 and c2 = 1 in (9.10) gives the equivalent of the middle equality
in (9.9); and setting c1 = c2 = 0 and c3 = 1 in (9.10) gives the equivalent of the right-most equality in (9.9).
9.3 Dimensions
What follows is an example of how the notion of dimension arises in a scientific context. Consider the situation in
Section 9.1, above, where the system is such that the levels of x2 and x3 are very nearly x2 ≈ 3x1 and x3 ≈ x1 . This
is to say that when we use the coordinate c1 , c2 and c3 in (9.5), then |c2 | and |c3 | are typically very small. In this case,
a reasonably accurate model for the behavior of the three gene system can be had by simply assuming that c2 and
c3 are always measured to be identically zero. As such, the value of the coordinate c1 describes the system to great
accuracy. Since only one coordinate is needed to describe the system, it is said to be ‘1-dimensional’.
A second example is the system that is described by c1 , c2 and c3 as depicted in (9.6). If it is always the case that x3
is very close to 2x2 + x1 , then the system can be described with good accuracy with c3 set equal to zero. This done,
9.4 Exercises:
1. Suppose that four genes have corresponding products with levels x1 , x2 , x3 and x4 where x4 is always very
close to x1 + 4x2 while x3 is always very close to 2x1 + x2 . Find a new set of basis vectors for R4 and
corresponding coordinates c1 , c2 , c3 and c4 with the following property: The values of x1 , x2 , x3 and x4 for
this four gene system are the points in the (c1 , c2 , c3 , c4 ) coordinate system where c3 and c4 are nearly zero.
2. Suppose that two genes are either ‘on’ of ‘off’, so that there are affectively, just four states for the two gene
system, {++, +−, −+, −−}, where ++ means that both genes are on; +− means that the first is on and the
second is off; etc. Assume that these four states have respective probabilities 21 , 61 , 16 , 61 .
(a) Is the event that the first gene is on independent from the event that the second gene is on?
Now suppose that these two genes jointly influence the levels of two different products. The levels of the first
product are given by {3, 2, 1, 0} in the respective states ++, +−, −+, −−. The levels of the second are
{4, 2, 3, 1} in these same states.
(b) View the levels of the two products as random variables on the sample space S that consists of
{++, +−, −+, −−} with the probabilities as stated. Write down the mean and standard deviations for
these two random variables.
(c) Compute the correlation matrix in equation (6.8) of Chapter 6 for these two random variables to prove that
they are not independent.
TEN
The term ‘Bayesian statistics’ has different meanings for different people. Roughly, Bayesian statistics reverses
‘causes’ and ‘effects’ so as to make an educated guess about the causes given the known effects. The goal is de-
duce a probability function on the set of possible causes granted that we have the probabilities of the various effects.
Take note that I have underlined the words ‘educated guess’. There are situations when the Bayesian strategy seems
reasonable, and others where it doesn’t.
55
This last equation can be written in terms of conditional probabilities as follows:
X
Pf (r) = P (r | s) P(s) (10.2)
s∈S
where P (r | s) is the conditional probability that f = r given that you are at the point s ∈ S. Of course, this just says
that P (r | s) is 1 if f (s) = r and 0 otherwise.
The problem faced by statisticians is to deduce P, or a reasonable approximation, given only knowledge of some
previously determined probability function, PW , on the set W . In effect, we want to find a probability function P on
S whose corresponding Pf is the known function PW .
Your typical Bayesian will derive a guess for P using the following strategy:
Step 1: Assume that there is some conditional probability, K(s; r), that gives the probability of obtaining any given
s from S granted that the value of f is r. If such a suite of conditional probabilities were available, then one could
take X
Pguess (s) = K(s; r)PW (r). (10.3)
r∈W
The problem is that the points in W are the values of a function of the points in S, not vice-versa. Thus, there is often
no readily available K(s; r).
Step 2: A Bayesian is not deterred by this state of affairs. Rather, the Bayesian plows ahead by using what we have,
which is P (r | s). We know its values in all cases; it is 1 when f (s) = r and zero otherwise. Why not, asks the
Bayesian, take
1
K(s; r) = P (r | s) , (10.4)
Z(r)
where Z(r) is the number of points in S on which f has value r. This, is to say that
X
Z(r) = P (r | s) . (10.5)
s∈S
To explain the appearance of Z(r), remember that a conditional probability of the form P (A | B) is a probability
function in its own right on the sample space S. Thus, P (S | B) must be 1 if S is the whole sample space. This need
not be the case for K(S; r) were the factor of 1/Z(r) absent.
Step 3: To summarize: Our typical Bayesian takes the following as a good guess for the probability function on S:
X 1
PBayes (s) = P (r | s) PW (r). (10.6)
Z(r)
r∈W
Note that disentangling the definitions, there is really no summation involved in (10.6) because there is just one value
of r that makes P (r | s) non-zero for any given s, this the value r = f (s). Thus, (10.6) is a very roundabout way of
saying that
1
PBayes (s) = PW (f (s)). (10.7)
Z(f (s))
This is our Bayesian’s guess for the probability function on S.
• P(−2, T T ) = 1
• P(0, HT ) = P(0, T H) = 1 (10.8)
• P(2, HH) = 1
PBayes (HH) = q 2 , PBayes (T T ) = (1 − q)2 and PBayes (HT ) = PBayes (T H) = q(1 − q). (10.9)
Ptrue (HH) = 12 q, Ptrue (HT ) = 21 (1 − q), Ptrue (T H) = 12 q, and Ptrue (T T ) = 12 (1 − q). (10.10)
The frequencies of appearance of the three positions in W are now 21 (1 − q), 12 , 12 q. I use these three numbers for the
probabilities given by PW . As the conditional probabilities in (10.8) do not change, we can employ then in (10.6) to
find the Bayesian guess:
This is also the probability PBayes (coin #2 = H) since PBayes (HT ) = PBayes (T H). Now, note that
unless q = 21 since the left-hand side is 12 q and the right is ( 21 q + 14 )2 . Thus, the Bayesian finds that the event of
coin #1 = H is not independent of the event that coin #2 = H!! (Remember that events A and B are deemed
independent when P(A ∩ B) = P(A)P(B).)
a b ab
Ptrue (a, b) = 21 · 21 = 441 . (10.14)
If the die has these probabilities, then the probabilities that result for the outcomes are
1 4 10 20
PW (2) = 441 , PW (3) = 441 , PW (4) = 441 , PW (5) = 441 ,
35 56 70 76
PW (6) = 441 , PW (7) = 441 , PW (8) = 441 , PW (9) = 441 , (10.15)
73 60 36
PW (10) = 441 , PW (11) = 441 , PW (12) = 441 .
and so on.
10.8 Exercises:
1. (a) Complete the table in (10.16) by computing the values of PBayes on the remaining pairs in S.
(b) According to PBayes , is the event of the first roll comes up 1 independent from the event that second roll
comes up 6? Justify your answer.
2. Compute the mean and standard deviation for the random variable a + b first using Ptrue from (10.15) and then
using PBayes .
3. Consider now the same sample space for rolling a die twice, but now suppose that the die is fair, and so each
number has probability of turning up on any given roll.
(a) Compute the mean and standard deviation of the random variable a + b.
(b) Compute the mean and standard deviation for the random variable ab.
(c) Are the random variables a + b and ab independent? In this regard, remember that two random variables,
f and g, are said to be independent when P(f = r and g = s) = P(f = r)P(g = s) for all pairs (r, s)
where r is a possible value of f and s is a possible value of g. Justify your answer.
ELEVEN
There are certain probability functions that serve as models that are commonly used when trying to decide if a given
phenomena is ‘unexpected’ or not. This chapter describes those that arise most often.
• How do we translate our intuitive notion of the English term ‘random’ into a predic-
tion for pt (x)?
• Granted we have a prediction, for each t and x, then how far must pt (x) be from (11.1)
its predicted value before we must accept the fact that the bacteria is not moving
according to our preconceived notion of random?
These questions go straight to the heart of what is called the ‘scientific method’. We made a hypothesis: The bacterium
moves left or right at random’. We want to first generate some testable predictions of the hypothesis (the first point
in (11.1)), and then compare these predictions with experiment. The second point in (11.1) asks for criterion to use to
evaluate whether the experiment confirms or rejects our hypothesis.
The first question in (11.1) is the provenance of ‘probability theory’ and the second the provenance of ‘statistics’. This
chapter addresses the first question in (11.1), while aspects of the second are addressed in some of the subsequent
chapters.
Probabilities are defined using the probability function that assigns all elements the same
probability. If the set has L elements, then the probability of any given element appearing (11.2)
is L1 .
The probability function that assigns this constant value to all elements is called the uniform probability function.
59
Here is an archetypal example: A coin is flipped N times. Our sample space is the set S that consists of the 2N possible
sequences (±1, ±1, . . . , ±1), where +1 is in the kth spot when the kth flip landed heads up, while −1 sits in this slot
when the kth flip landed tails up. If I assume that the appearance of heads on any flip has probability , and that the
appearance of heads on any subset of flips has no bearing on what happens in the other flips, then I would predict
that the frequencies of appearance of any given sequence in S is 2−N . This is to say that I would use the uniform
probability function on S to predict the frequency of appearance of any subset of its elements.
Note that after setting N = t, this same sample space describes all of the possibilities for the moves of our bacterium
from Section 11.1, above.
Here is another example: You go to a casino to watch people playing the game of ‘craps’. Remember that this game is
played by rolling two six-sided dice, and looking at the numbers that show on the top faces when the dice stop rolling.
The sample space for one play of the game is the set of 36 elements where each is of the form (a, b) for a, b integers
from 1 through 6. If I believe that the dice are ‘fair’ and that the appearance of any given number on one die has no
bearing on what appears on the other, then I would use the uniform probability function on S to predict the frequency
of appearance of any given outcome.
I might watch N games of craps played. In this case, the sample space is the set of 36N possible sequence of the form
((a1 , b1 ), . . . , (aN , bN )) where each ak and each bk is a number from 1 through 6. If I believe that the appearance of
any face on the dice is as likely as any other, and if I believe that the appearance of any sequence in any given subset of
the N games has no affect on what happens in the other games, then I would make my predictions for the frequencies
using the uniform probability function on this 36N element sample space.
What follows is yet one more example: Look at a single stranded DNA molecule that is composed of N bases strung
end to end. As each cite on the DNA molecule can be occupied by one of 4 bases, the sample space that describes the
various possibilities for this DNA molecule is the 4N element set whose typical member is a sequence (θ1 , . . . , θN )
where each θk is either A, C, G or T. If I believe that each such letter is equally likely to appear, and that the appearance
of a given letter in any one slot has no bearing on what happens in the other slots, then I would use the uniform
probability function to predict the frequencies of occurrences of the various letters in length N strand of DNA.
You might think that the uniform probability distribution is frightfully dull – after all, how much can you say about a
constant?
N!
The set Kn has members. (11.3)
n!(N − n)!
In this regard, remember that k! is defined for any positive integer k as k(k − 1)(k − 2) · · · 1. Also, 0! is defined to by
equal to 1. For those who don’t like to take facts without proof, I explain in the last section below how to derive (11.3)
and also the formulae that follow.
By the way, arises often enough in counting problems to warrant its own symbol, this
N
. (11.4)
n
Here is another standard counting formula: Let b ≥ 1 be given, and let S denote the set of bN elements of the form
(β1 , . . . , βN ), where each βk is now {1, . . . , b}. For example, if b = 6, then S is the sample space for the list of faces
that appear when a six-sided die is rolled N times. If b = 4, then S is the sample space for the possible single strands
of DNA with N bases.
If N > b, then each element in S has at least one βk that is the same as another. If N ≤ b, then there can be elements
in S where no two βk are the same. Fix b and let Eb denote the subset of those N -tuples (β1 , . . . , βN ) where no two
βk are identical.
b!
The set Eb has members. (11.5)
(b − N )!
The case n = N in (11.5) provides the following:
Here, a set of elements is ‘ordered’ simply by listing them one after the other. For example, the set that consists
of 1 apple and 1 orange has two orderings, (apple, orange) and (orange, apple). The set that consist of the three
elements {apple, orange, grape} has six orderings, (apple, orange, grape), (apple, grape, orange), (orange, grape,
apple), (orange, apple, grape), (grape, apple, orange), (grape, orange, apple).
Here is a direct argument for (11.6): To count the number of orderings, note that there are N possible choices for the
first in line. With the first in line chosen, then N − 1 elements can be second in line. With the first two in line chosen,
there are N − 2 left that can be third in line. Continuing in this reasoning leads to (11.6).
The Equal Probability Binomial: Let S denote the sample space with the 2N elements of the form (α1 , . . . , αN )
where each αk can be 1 or −1. For any given integer n ∈ {0, 1, . . . , n}, let Kn denote the event that there are
precisely n occurrences of +1 in the N -tuple (α1 , . . . , αN ). Then the uniform probability function on S assigns Kn
the probability
N!
P(n) = 2−N (11.7)
n! (N − n)!
If we believe that the bacteria chooses left and right with equal probability, and that moves made in any subset of
the N steps have no bearing on those made in the remaining steps, then we should be comparing our experimentally
determined pt (x) with the N = t version of (11.8).
The Binomial Probability Function: The probability function P in (11.7) is an example of what is called the
binomial probability function on the set {0, . . . , N }. The ‘generic’ version of the binomial probability distribution
requires the choice of a number, q ∈ [0, 1]. With q chosen, the probability q-version assigns the following probability
to an integer n:
Pq (n) = q n (1 − q)N −n . (11.9)
The probability function in (11.9) arises from the sample space S whose elements are the N -tuples with elements
(±1, . . . , ±1). In this regard, (11.9) arises in the case that the probability of seeing +1 in any given entry is q and
that of −1 in a given entry is 1 − q. Here, this probability function assumes that the occurrences of ±1 in any subset
of entries must have no bearing on the appearances of ±1 in the remaining entries. The probability function in (11.9)
describes the probability in this case for the set Kn ⊂ S of elements with n occurrences of +1 and N − n occurrences
of −1.
One can also view the probability function in (11.9) as follows: Define a random variable, g, on S, that assigns to any
given element the number of appearances of +1. Give S the probability function just described. Then (11.9) gives the
probability that g = n. To see why this probability is (11.9), note that the event that g = n is just our set Kn . With
this new probability, each element in Kn has probability q n (1 − q)N −n . As (11.3) gives the number of elements in
Kn , its probability is therefore given by (11.9).
The probability function in (11.9) is relevant to our bacterial walking scenario when we make the hypothesis that the
bacteria moves to the right at any given step with probability q, thus to the left with probability 1 − q. I’ll elaborate on
this in a subsequent chapter.
Here is another example: Suppose you role a six sided die N times. I now ask fix an integer n from the set
{0, 1, . . . , N } and ask for the probability that precisely n ≤ N of the rolls are such that the number 6 appears. If
I assume that the die is fair, and that the numbers that appear on any subset of rolls have no bearing on the numbers
that appear in the remaining rolls, then the probability of n occurrences of 6 in N rolls of the die is given by the q = 61
version of (11.9). Here is why: I make an N element sequence (α1 , . . . , αN ) with each αk = 1 or −1 by setting
αk = +1 when 6 appears on the kth roll of the die, and setting αk = −1 when 1, 2, 3, 4, or 5 appears on the kth roll
of the die. The possible sequence of this form make a 2N element set that I call S. As the probability is 16 for any
given αk to equal to +1, so a given sequence from S with precisely n occurrences of +1 has probability ( 16 )n ( 56 )N −n .
Meanwhile, the number of elements in S with n occurrences of +1 is given by (11.3). This understood, the probability
of an element in S having n occurrences of +1 is equal to the product of (11.3) with ( 61 )n ( 56 )N −n , and this is just the
q = 16 version of (11.9).
To give an example, consider rolling a standard, six-sided die. If the die is rolled once, the sample space consists of the
numbers {1, 2, 3, 4, 5, 6}. If the die is ‘fair’, then I would want to use the probability function that assigns the version
of (11.9).
Here is a final example: Consider a single stranded DNA molecule made of N bases. Suppose that the probability
that the base C appears in any given position is given by some q ∈ [0, 1]. If each base is equally likely, then q = 14 to
The Poisson Probability Function: This is a probability function on the sample space, N = {0, 1, . . .}, the non-
negative integers. As you can see, this sample space has an infinite number of elements. Even so, I define a probability
function on N to be a function, P, with P(n) ∈ (0, 1) for each n, and such that
X
P(n) = P(0) + P(1) + P(2) + · · · (11.10)
n=0,1,...
The Poisson probability enters when trying to decide if an observed cluster of occurrences of a particular phenomenon
is or is not due to chance. For example, suppose that on average, some number, ∆, of newborns in the United States
carry a certain birth defect. Suppose that some number, n, of such births are observed in 2006. Does this constitute an
unexpected clustering that should be investigated? If the defects are unrelated and if the causative agent is similar in
all cases over the years, then the probability of n occurrences in a given year should be very close to the value of the
τ = ∆ version of the Poisson function Pτ (n).
Here is another example from epidemiology: Suppose that the university health service sees an average of 10 cases
of pneumonia each winter. Granted that the conditions that prevail this coming winter are no different than those
in the past, and granted that cases of pneumonia are unrelated, what is the probability of there being 20 cases of
pneumonia this winter? If these assumptions are valid, then the τ = 10 Poisson function should be used to compute
this probability. In particular, P10 (20) ≈ 0.0058.
What follows is a final example, this related to discerning whether patterns are ‘random’ or not. Consider, for example,
the sightings of ‘sea monsters’. In particular, you want to know if more sea monster sightings have occurred in the
Bermuda Triangle then can be accounted for by chance. One way to proceed is to catalogue all such sightings to
compute the average number of sightings per day per ship. Let us denote this average by δ. Now, estimate the number
of ‘ship days’ that are accounted for by ships while in the Bermuda Triangle. Let N denote this last number. If
sightings of sea monsters are unrelated and if any two ships on any two days in any two parts of the ocean are equally
likely to sight a sea monster, then the probability of n sightings in the Bermuda triangle is given by the τ = N δ version
of (11.11).
I give some further examples of how the Poisson function is used in a separate chapter.
where
√ the term designated as ‘error’ limits to zero as n → ∞. For example, the ratio with n! the numerator and
2πne−n nn as the denominator is as follows for various choices of n:
n!
n √
2πne−n nn
2 7.7
5 1.016
10 1.008
20 1.004
100 1.0008
Here, one should keep in mind that when S is an infinite number of elements, then µ and σ are defined only when the
corresponding sums on the right sides of (11.15) and (11.16) are those of convergent series. The mean and standard
deviation characterize any given probability function to some extent. More to the point, both the mean and standard
deviation are often used in applications of probability and statistics.
The mean and standard deviation for the binomial probability on {0, 1, . . . , N } are
The Chebychev Theorem. Suppose that P is a given probability function on a subset of the integers,
{. . . , −1, 0, 1, . . .}, one with a well defined mean µ and standard deviation σ. For any R ≥ 1, the probability as-
signed to the set where |n − µ| > Rσ is less than R−2 .
For example, this says that the probability of being 2σ away from the mean is less than 41 , and the probability of being
3σ away is less than 19 .
This theorem justifies the focus in the literature on the mean and standard deviation, since knowing these two numbers
gives you rigorous bounds for probabilities without knowing anything else about the probability function!!
If you remember only one thing from these lecture notes, remember the Chebychev Theorem.
Here is the proof of the Chebychev theorem: Let S denote thePsample space under consideration, and let E ⊂ S denote
the set where |n − µ| > Rσ. The probability of E is then P(n)). However, since |n − µ| > Rσ for n ∈ E, one
n∈E
has
|n − µ|2
1≤ (11.21)
R2 σ 2
on E. Thus,
X X |n − µ|2
P(n) ≤ P(n). (11.22)
R2 σ 2
n∈E n∈E
To finish the story, note that the right side of (11.22) is even larger when we allow the sum to include all points in S
instead of restricting only to points in E. Thus, we learn that
X X |n − µ|2
P(n) ≤ P(n). (11.23)
n
R2 σ 2
n∈E
The definition of σ 2 from (11.16) can now be invoked to identify the sum on the right hand side of (11.23) with R−2 .
Here is an example of how one might apply the Chebychev Theorem: I watch a six-sided die rolled 100 times and see
the number 6 appear thirty times. I wonder how likely it is to see thirty or more appearances of the number 6 given
that the die is fair. Under the assumption that the die is fair and that the faces that appear in any given subset of the
100 rolls have no bearing on those that appear in the remaining rolls, I can compute this probability using the q = 16
version of the N = 100 binomial probability function by computing the sum P1/6 (30) + P1/6 (31) + · · · + P1/6 (100).
Alternately, I can get an upper bound for this probability by using the Chebychev Theorem. In this regard, I note that
the mean of the q = 61 and N = 100 binomial probability function is 100 50 2
6 = 3 = 16 3 and the standard deviation
√
is 53 5 ≈ 3.73. Now, 30 is 13 13 from the mean, and this is ≈ 3.58 standard deviations. Thus, the probability of
seeing thirty or more appearances of the number 6 in 100 rolls is no greater than (3.58)−2 ≈ 0.08. As it turns out, the
probability of seeing 6 appear thirty or more times is a good deal smaller than this.
Thus, the coefficient of xn of this polynomial representation of P gives us the probabilities that Pq assigns to the
integer n.
To see why (11.25) is the same as (11.24), consider multiplying out an N -fold product of the form:
A given term in the resulting sum can be labeled as (α1 , . . . , αN ) where αk = 1 if the kth factor in (11.26) contributed
ak x, while αk = −1 if the kth factor contributes bk . The power of x for such a term is equal to the number of αk
N! N!
that are +1. This is n! (N −n)! , the number of elements in the set Kn that appears in (11.3). Thus, n! (N −n)! terms
n
contribute to the coefficient of x in (11.26). In the case of (11.25), all versions of ak are equal to q, and all versions
of bk are equal to (1 − q), so each term that contributes to xn is q n (1 − q)N −n xn . As there are n! (NN!
−n)! of them, the
coefficient of xn in (11.24) is Pq (n) as (11.24) claims.
In general the characteristic function for a probability function, P, on a subset of {0, 1, . . .} is the polynomial or infinite
power series in x for which the coefficient of xn is P(n). This is to say that
X
P(x) = P(0) + P(1)x + P(2)x2 + · · · = P(n)xn . (11.27)
n=0,1,2,...
This way of coding the probability function P is practical only to the extent that the series in (11.27) sums to a relatively
simple function. I gave the q-binomial example above. In the case of the uniform probability function on {0, . . . , N },
N +1
the sum in (11.27) is the function N1+1 1−x1−x . I argue momentarily that the τ version of the Poisson probability
function on {0, 1, 2, . . . } gives
P(x) = e(x−1)τ . (11.28)
In general, there are two salient features of a characteristic polynomial: First, the values of P, its derivative, and its
• P(1) = 1
d
• P =µ
dx x=1 (11.29)
2
d
• P = σ 2 + µ(µ − 1)
dx2 x=1
Second, the values of the value of P, its derivative and its higher order derivatives at x = 0 determine P since
n
1 d
P = P(n). (11.30)
n! dxn x=0
This is how the function x → P(x) encodes all of the information that can be obtained from the given probability
function P.
To explain how (11.29) follows from the definition in (11.27), note first that P(1) = P(0)+P(1)+· · · , and this is equal
to 1 since the sum of the probabilities must be 1. Meanwhile, the derivative of P at x = 1 is 1 · P(1) + 2 · P(2) + · · · ,
and this is sum is the definition of the mean µ. With the help of (11.16), a very similar argument establishes the third
point in (11.29).
I can use (11.29) to get my slick calculations of the mean and standard deviations for the binomial probability function.
In this case, P is given in (11.24) and so
d N −1
P = N q (qx + (1 − q)) . (11.31)
dx
Set x = 1 here to find the mean equal to N q as claimed. Meanwhile
d2 N −1
P = N (N − 1)q 2 (qx + (1 − q)) , (11.32)
dx2
Set x = 1 here finds the right-hand side of (11.32) equal to N (N − 1)q 2 . Granted this and the fact that µ = qN , then
the third point in (11.29) find σ 2 equal to
N (N − 1)q 2 − N 2 q 2 + N q = N q(1 − q), (11.33)
which is the asserted value.
For the Poisson probability function, the characteristic polynomial is the infinite power series
1 1 1
Pτ (0) + xPτ (1) + x2 Pτ (2) + · · · = 1 + xτ + x2 τ 2 + x3 τ 3 + · · · + xn τ n + · · · e−τ . (11.34)
2 6 n!
As can be seen by replacing τ in (11.12) with xτ , the sum on the right here is exτ . Thus, P(x) = e(x−1)τ as claimed
by (11.28). Note that the first and second derivatives of this function at x = 1 are both equal to τ . With (11.29), this
last fact serves to justify the claim that the both the mean and standard deviation for the Poisson probability are equal
to τ .
The characteristic polynomial for a probability function is used to simplify seemingly hard computations in the manner
just illustrated in the cases where the defining sum in (11.27) can be identified with the power series expansion of a
well known function. The characteristic function has no practical use when the power series expansion is not that of a
simple function.
I will now make this last formula look like a matrix equation. For this purpose, fix some integer T ≥ 1 and make a
T -component vector, m(N~ ), whose coefficients are the values of mn for the cases that 1 ≤ n ≤ T . This equation
asserts that m(N
~ ) = Am(N~ − 1) where A is the matrix with Ak,k and Ak,k−1 both equal to 1 and all other entries
equal to zero. Iterating the equation m(N
~ ) = Am(N
~ − 1) finds
m(N
~ ) = AN −1 m(1),
~ (11.36)
where m(1)
~ is the vector with top component 1 and all others equal to zero.
Now, we don’t have the machinery to realistically compute AN −1 , so instead, lets just verify that the expression
in (11.3) gives the solution to (11.35). In this regard, note that m(N ~ ) is completely determined from m(1)
~ us-
ing (11.36), and so if we believe that we have a set {m(1),
~ m(2),
~ . . . } of solutions, then we need only plug in our
candidate and see if (11.36) holds. This is to say that in order to verify that (11.3) is the correct, one need only check
that the formula in (11.36) holds. This amounts to verifying that
N! (N − 1)! (N − 1)!
= + . (11.37)
n! (N − n)! (n − 1)! (N − n)! n! (N − n − 1)!
I leave this task to you.
Max Delbruck and Salvador Luria shared the 1969 Nobel Prize in Physiology and Medicine for an experiment, done
in 1943, that distinguished between these two proposals as explanations for the evolution of immunity in bacterium.
The results conclusively favored Darwin over Lamarck.
1 Luria, SE, Delbruck, M., “Mutations of Bacteria from Virus Sensitivity to Virus Resistance” Genetics 28 (1943) 491-511.
2
Luria and Delbruck realized that µexp and σexp can distinguish between the Darwin and Lamarck proposals. To explain
their thinking, consider first what one would expect were the Lamarck proposal true. Let p denote the probability
that exposure to the virus causes a mutation in any given bacterium that allows the bacterium to survive the infection.
The probability of n surviving bacteria in a colony with N members should be given by the binomial probability
N! n N −n
function from (11.9) as defined using p and N in lieu of q and N . Thus, this probability is n! (N −n)! p (1 − p) . In
particular, the mean number of survivors should be pN and the square of the standard deviation should be p(1 − p)N .
This is to say that the Lamarck proposal predicts that the experimental data
2
µexp ≈ pN and σexp ≈ p(1 − p)N. (11.39)
Not knowing the value for p, one can none the less say that the Lamarck proposal predicts the following:
2
σexp
≈ 1 − p. (11.40)
µexp
Note in particular that this number is independent of N and K; and in any event it is very close to 1 when p is small.
Small p is expected.
Now consider the Darwinian proposal: In this case, there is some small, but non-zero probability, I’ll call it p, for
the founding bacterium of any given colony to have a mutation that renders it immune to the virus. If the founder
of a colony has this mutation, then it is unaffected by the virus as are all of its descendents. So, a colony with an
immune founder should have population 2T = N after viral attack. If a colony is founded by a bacterium without this
mutation, then its population after the viral attack should be very small. Thus, in this ideal situation, the Darwinian
proposal predicts that each zj should be either nearly zero or nearly N .
If K colonies are started, then the probability that n of them are founded by an immune bacterium is given by the
binomial probability function from (11.9) as defined now using p and K in lieu of q and N . Thus, this probability is
K n K−n
n! (K−n)! p (1 − p) . Note that the mean of this is pK and the square of the standard deviation is p(1 − p)K. As a
2
consequence, the Darwin proposal predicts that µexp and σexp should be roughly
X
• µexp ≈ (Probability of n immune founders)(K−1 nN )
0≤n≤K
X (11.41)
2
• σexp ≈ (Probability of n immune founders)(K−1 nN 2 ) − µ2exp .
0≤n≤K
2
These sums give the predictions µexp ≈ pN and σexp ≈ p(1 − p)N 2 . As a consequence, the Darwin proposal predicts
the ratio
2
σexp
≈ (1 − p)N . (11.42)
µexp
Note in particular that if p is small, then this number is roughly N = 2T while that in (11.40) is roughly 1.
Delbruck and Luria saw statistics that conclusively favored the Darwin proposal over the Lamarck one.
0 0 1 1
(a) Present the steps of the reduced row echelon reduction of A to verify that it is invertible.
(b) Find A−1 using Fact 2.3.5 of Bretscher’s book Linear Algebra with Applications.
2. Let τ denote a fixed number in (0, 1). Now define a probability function, P, on the set {0, 1, 2, . . .} by setting
P(n) = (1 − τ )τ n .
(a) Verify that P(0) + P(1) + · · · = 1, and thus verify that P is a probability function.
(If you haven’t seen how to do this, set V (n) = 1+τ +τ 2 +· · ·+τ n . Prove that τ V (n) = V (n)−1+τ n+1 .
Then rearrange things to write V (n)(1−τ ) = (1−τ n+1 ). Now solve for V (n) and consider what happens
when n → ∞.)
(b) Sum the series P(0) + xP(1) + x2 P(2) + · · · to verify that the characteristic function is the P(x) = 1−τ
1−xτ .
(c) Use the formula in (11.28) to compute the mean and standard deviation of P.
√
(d) In the case τ = 12 , the mean is 1 and the standard deviation is 2. As 6 ≥ µ+3σ, the Chebychev Theorem
asserts that the probability for the set {6, 7, . . .} should be less than 19 . Verify this prediction by summing
P(6) + P(7) + · · · .
√
(e) In the case τ = 23 , the mean is 2 and σ = 6. Verify the prediction of the Chebychev Theorem that
{7, 8, . . .} has probability less than 41 by computing the sum P(7) + P(8)) + · · · .
3. This exercise fills in some of the details in the verification of (11.3).
(a) Multiply both sides of (11.37) by (n − 1)!(N − n − 1)! and divide both sides of the result by (N − 1)!
Give the resulting equation.
(b) Use this last result to verify (11.37).
4. Suppose that I have a bag with 6 red grapes and 5 green ones. I reach in with my eyes closed and pick a grape
at random. After looking at its color, I replace the grape, shake up the bag and redo this experiment, 10 times
in all. Let n be an integer between 0 and 10. Assume that any given grape is chosen at random each time, and
that my choices in any given subset of the experiments have no bearing on those of the remaining experiments.
What is the probability of choosing exactly n green grapes in the 10 experiments?
1
5. The purpose of this exercise is to explore the q = 2 version of the binomial probability function, thus the
function P(n) that appears in (11.7).
(a) As explained in the text, P(n) is largest when n = N2 in the case that N is even. Use Stirling’s formula to
q
justify the claim that P( N2 ) ≈ πN 2
when N is even and very large. Compare this last number with the
true value given by (11.7) for N = 10 and N = 100.
(b) Suppose that N is a positive, even integer. The mean for the q = 21 binomial probability function for the
√
set {0, 1, . . . , N } is µ = 12 N and the standard deviation is σ = 12 N . Use the Chebychev theorem in
the cases N = 36 and 64 to find an upper bound for the probability of the set of n ≤ 10. Then, use a
calculator to compute this probability using the formula in (11.7).
6. The purpose of this problem is to explore the Poisson probability function in the context of discerning patterns.
Suppose that I am a police captain in Boston in charge of a 10 block by 10 block neighborhood. Thus a
neighborhood of 100 squares, each 1 block on a side. I note that in over the years, my force was asked to
respond to an average of 500 calls from the neighborhood, thus an average of 5 per square block per year. The
Here is a question: What is the expected frequency of 4, 5, 6 and 7 game series if we assume that the two teams
have a 50-50 chance of winning any given game, and if we assume that the event of winning any one game is
independent of the event of winning any other. To answer this question, let A and N denote the two teams.
(a) What is the probability for A to win the first 4 games? Likewise, what is the probability for N?
1
(b) Explain why the probability for the World series to last 4 games is 8 = 0.125.
(c) Explain why the probability for the World Series to last 5 games is equal to
4
4 1 1 1
2 = = 0.25.
3 2 2 4
(d) Explain why the probability for the World Series to last 6 games is
5
5 1 1 5
2 = = 0.3125.
3 2 2 16
(e) Explain why the probability for the World Series to last 7 games is
6
6 1 1 5
2 = = 0.3125.
3 2 2 16
11.10. Exercises 71
CHAPTER
TWELVE
P-values
My purpose in this chapter is to describe a criterion for deciding when data is “surprising”. For example, suppose that
you flip a coin 100 times and see 60 heads. Should you be surprised? Should you question your belief that the coin is
fair? More generally, how many heads should appear before you question the coin’s fairness? These are the sorts of
questions that I will address below.
N!
Pq (n) = q n (1 − q)N −n . (12.1)
n!(N − n)!
• Since we actually found m events in N trials, take the value of q that gives m for the mean of the probability
function in (12.1). With reference to (11.17) in the previous chapter, this choice for q is m
N.
• Take q to so that n = m is the integer with the maximal probability. If you recall (11.19) from the previous
chapter, this entails taking q so that both
Pq (m) Pq (m − 1)
>1 and < 1. (12.2)
Pq (m + 1) Pq (m)
m m+1 m
This then implies that N +1 <q< N +1 . Note that q = N satisfies these conditions.
With regards to my example of flipping a coin 100 times and seeing 60 heads appear, these arguments would lead me
to postulate that the probability of seeing some number, n, of heads, is given by the q = 0.6 version of (12.1).
72
12.2 P -value and bad choices
A different approach asks for the bad choices of q rather than the “best” choice. The business of ruling out various
choices for q is more in the spirit of the scientific method. Moreover, giving the unlikely choices for q is usually
much more useful to others than simply giving your favorite candidate. What follows explains how statisticians often
determine the likelihood that a given choice for q is realistic.
For this purpose, suppose that we have some value for q in mind. There is some general agreement that q is not a
reasonable choice when the following occurs:
There is small probability as computed by the q-version of (12.1) of there being the observed m occur-
rences of the event of interest.
To make the notion of small probability precise, statisticians have introduced the notion of the “P -value” of a mea-
surement. This is defined with respect to some hypothetical probability function, such as our q-version of (12.1). In
our case, the P -value of m is the probability for the subset of numbers n ∈ {0, 1, . . . , N } that are at least as far from
the mean, qN , as is m. For example, m has P -value 12 in the case that (12.1) assigns probability 21 to the set of integers
n that obey |n − N q| ≥ |m − N q|. A P -value that is less than 0.05 is deemed significant by statisticians. This is to
say that if m has such a P -value, then q is deemed unlikely to be incorrect.
In general, the definition of the P-value for any measurement is along the same lines:
Definition: Suppose that a probability function on the set of possible measurements for some experiment is proposed.
The P -value of any given measurement is the probability for the subset of values that lie as far or farther from the
mean of this probability function than the given measurement. The P -value is deemed significant if it is smaller than
0.05.
Note here that this definition requires computing the probability of being both greater than the mean and less than then
the mean. There is also the notion of a one-sided P -value. This computes the probability of being on the same side of
the mean as the observed value, and at least as far from the mean as the observed value. Thus, if the observed value is
greater than the mean, the 1-sided P -value computes the probability that a measurement will be as large or larger than
the observed value. If the observed value is less than the mean, then the 1-sided P -value computes the probability of
being as small or smaller than the observed value. There are other versions of P -value used besides that defined above
and the 1-sided P -values. However, in these lecture notes, P -value means what is written in the preceding definition.
But, keep in mind when reading the literature or other texts that there are alternate definitions.
Consider now the example where I flip a coin 100 times and find 60 heads. If I want to throw doubt on the hypothesis
that the coin is fair, I should compute the P -value of 60 using the q = 12 version of the binomial probability function
in (12.1). This means computing P1/2 (60) + P1/2 (61) + · · · + P1/2 (100) + P1/2 (40) + P1/2 (39) + · · · + P1/2 (1).
The latter sum gives the probability of seeing a number of heads appear that is at least as far from the mean, 50, as is
60. My calculator finds that this sum is 0.057. Thus, the P -value of 60 is greater than 0.05 and so I hesitate to reject
the hypothesis that the coin is fair.
An upper bound for the P -value that uses only the mean and standard deviation of a probability function can be had
using the Chebychev theorem in Chapter 11. As you should recall, this theorem asserts that the probability of finding
a measurement with distance Rσ from the mean is less than R−2 . Here, σ denotes the standard deviation of the given
probability function. Granted this, a measurement that differs from the mean by 5σ or more has probability less than
0.04 and so has a significant P -value. Such being the case, the 5σ bound is often used in lieu of the 0.05 bound.
To return to our binomial case, to say that m differs from the mean, N q, by at least 5σ, is to say that
The converse of this last statement is not true. An event can have P -value that is smaller than 0.05 yet differ from the
mean by less than 5σ. Take for example the case where I flip a coin 100 times and now see 61 heads instead of 60.
Under the assumption that the coin is fair, the P -value of 61 heads is 0.0452. This is less than 0.05. However, 61 is
2.2σ from the mean.
N N
where the sum is over all integers k from the set, B, of integers in {0, . . . , N } that obey |b − 4| ≥ |n − 4 |.
Suppose, for instance that we have a strand of N = 20 bases and see 10 appearances of G. Does this suggest that our
hypothesis is incorrect about G’s appearance being random in the sense just defined? To answer this question, I would
compute the P -value of 10. This means computing the sum in (12.4) with N = 20 and B the subset in {0, . . . , 20} of
integers b with either b = 0 or b ≥ 10. As it turns out, this sum is close to 0.017. The P -value is less than 0.05 and so
we are told to doubt the hypothesis that the G appears at random.
As the sum in (12.4) can be difficult to compute in any given case, one can often make due with the upper bound for the
P -value that is obtained from the Chebychev theorem. In this regard, you should remember the Chebychev theorem’s
assertion that the probability of being R standard deviations from the mean is less than R−2 . Thus, Chebychev says
that a measurement throws significant doubt on a hypothesis when it is at 5 or more standard deviations from the mean.
√
3N
In the case under consideration in (12.4), the standard deviation, σ, is 4 . In this case, the Chebychev theorem says
√
that the set of integers b that obey |b − N
4| > R has probability less than R−2 . Taking R = 5, we see that a
3N
4 √
value for n has P -value less that 0.05 if it obeys |n − N4 | ≥ 45 3N . This result in the DNA example can be framed
n
as follows: The measured fraction, N , of occurrences of G has significant P -value in our q = 14 random model if
r
n 1 5 3
− > . (12.5)
N 4 4 N
You should note here that as N gets bigger, the right-hand side of this last inequality gets smaller. Thus, as N gets
n
bigger, the experiment must find the ratio N ever closer to 41 so as to forestall the death of our hypothesis about the
random occurrences of the constituent molecules on the DNA strand.
Pτ (n) gives the probability of seeing n occurrences of a particular event in any given unit
(12.7)
time interval when the occurrences are unrelated and they average τ per unit time.
What follows is an example that doesn’t come from biology but is none-the-less dear to my heart. I like to go star
gazing, and over the years, I have noted an average of 1 meteor per night. Tonight I go out and see 5 meteors. Is
this unexpected given the hypothesis that the appearances of any two meteors are unrelated? To test this hypothesis,
I should compute the P -value of n = 5 using the τ = 1 version of (12.6). Since the mean of Pτ is τ , this involves
computing
X 1
e−1 = 1 − 1 + 1 + 1 + 1 + 1 e−1 (12.8)
m! 2 6 24
m≥5
My trusty computer can compute this, and I find that P (5) ≤ 0.004. Thus, my hypothesis of the unrelated and random
occurrence of meteors is unlikely to be true.
What follows is an example from biology, this very relevant to the theory behind the “genetic clocks” that predict
the divergence of modern humans from an African ancestor some 100, 000 years ago. I start the story with a brief
summary of the notion of a point mutation of a DNA molecule. This occurs as the molecule is copied for reproduction
when a cell divides; it involves the change of one letter in one place on the DNA string. Such changes, cellular
typographical errors, occur with very low frequency under non-stressful conditions. Environmental stresses tend to
increase the frequency of such mutations. In any event, under normal circumstances, the average point mutation rate
per site on a DNA strand, per generation has been determined via experiments. Let µ denote the latter. The average
number of point mutations per generation on a segment of DNA with N sites on it is thus µN . In T ≥ 1 generations,
the average number of mutations in this N -site strand is thus µN T .
Now, make the following assumptions:
• The occurrence of any one mutation on the given N -site strand has no bearing on the
occurrence of another.
• Environmental stresses are no different now than in the past, (12.9)
• The strand in question can be mutated at will with no effect on the organism’s repro-
ductive success.
Granted the latter, the probability of seeing n mutations in T generations on this N -site strand of DNA is given by the
τ = µN T version of the Poisson probability:
1
(µN T )n eµN T . (12.10)
n!
In the early 1980s, a leukemia cluster was identified in the Massachusetts town of Woburn. Three
companies, including W. R. Grace & Co., were accused of contaminating drinking water and causing
illnesses. There is no question that this tragedy had a profound impact on everyone it touched, particularly
the families of Woburn.
John Travolta and Robert Duvall play the roles of the lawyers for the folks in Woburn who brought forth the civil suit
against the companies.
Here are the numbers involved: Over twenty years, there were 20 cases of childhood leukemia in Woburn. On
average, there are 3, 500 children of the relevant ages in Woburn each year, so we are talking about 20 cases per
3, 500 × 20 = 70, 000 person-years. Given these numbers, there are two questions to ask:
(a) Is the number of cases, 20, so high as to render it very likely that the cases are not random occurrences, but do
to some underlying cause?
(b) If the answer to (a) is yes, then are the leukemias caused by pollution from the companies that are named in the
suit?
We can’t say much about question (b), but statistics can help us with question (a). This is done by testing the sig-
nificance of the hypothesis that this large number of cases is a random event. For this purpose, note that the sort of
leukemia that is involved here occurs with an expected count of 13 per 100, 000 person-years which is 9.1 per 70, 000
years. Thus, we need to ask whether 20 per 70, 000 person-years is significant. If the probabilities here are well
modeled by a Poisson probability then the probability of seeing n ∈ {0, 1, . . .} cases per 100, 000 person-years is
1
(9.1)n e−9.1 . (12.12)
n!
In particular, of interest here is the P -value of 20. This is the probability
√ of being 10.9 or more from the mean. The
mean of this Poisson function is 9.1 and the standard deviation is 9.1 ≈ 3. Thus, 20 is roughly 4 standard deviations,
which isn’t enough to use the Chebychev theorem. This being the case, I can still try to compute the P -value directly
using its definition as the probability of getting at least as far from the mean as is 20. This is to say that
X 1
P -value(20) = (9.1)n e−9.1 ≈ 0.0012. (12.13)
n!
n≥20
Note that the mean increases when τ increases and it decreases when τ decreases.
To see what sort of value for τ gives a small P -value for 7730, I start by considering value of τ where the mean, µ, is
less than half of 7730. Note that I can solve (12.15) for τ in terms of µ to see that this means looking for τ between 0
3865
and 3866 . There are two reasons why I look at these values first:
• If I find τ in this set, I needn’t look further since (12.15) shows that the mean, µ, when viewed as a function of
τ , increases with increasing τ .
3865
This is to say that the P -value of 7730 as defined using any given τ < 3866 is
X X
(1 − τ )τ n = τ 7730 (1 − τ )τ n = τ 7730 . (12.16)
n≥7730 n≥0
Since I want to find the smallest τ where the P -value of 7730 is 0.05, I solve
12.7 Exercises:
1 9 n
1. Define a probability function, P, on {0, 1, 2, . . . } by setting P(n) = 10 ( 10 ) . What is the P -value of 5 using
this probability function?
2. Suppose we lock a monkey in a room with a word processor, come back some hours later and see that the monkey
has typed N lower case characters. Suppose this string of N characters contains precisely 10 occurrences of the
letter “e”. The monkey’s word processor key board allows 48 lower case characters including the space bar.
(a) Assume that the monkey is typing at random, and give a formula for the probability that 10 occurrences of
the letter “e” appear in the N characters.
(b) Use the Chebychev theorem to estimate how big to take N so that 10 appearances of the letter “e” has
significant P -value.
Here is a rather more difficult thought problem along the same line: Suppose that you pick up a copy of Tolstoy’s
novel War and Peace. The paperback version translated by Ann Dunnigan has roughly 1,450 pages. Suppose
that you find as you read that if you take away the spacing between letters, there is an occurrence of the string of
letters “professortaubesisajerk”. Note that spaces can be added here to make this a bona fide English sentence.
You might ask whether such a string is likely to occur at random in a book of 1,450 pages, or whether Tolstoy
discovered something a century or so ago that you are only now coming to realize is true. (Probabilistic analysis
of the sort introduced in this problem debunks those who claim to find secret, coded prophesies in the Bible and
other ancient texts.)
3. The Poisson probability function is often used to distinguish “random” from less than random patterns. This
problem concerns an example that involves spatial patterns. The issue concerns the distribution of appearances
of the letter “e” in a piece of writing. To this end, obtain two full page length columns of text from a Boston
Globe newspaper.
(a) Draw a histogram on a sheet of graph paper whose bins are labeled by the nonnegative integers, n =
0, 1, . . . . Make the height of the bin numbered by any given integer n equal to the number of lines in your
two newspaper columns that have n appearances of the letter “e”.
(b) Compute the total number appearances of the letter “e”, and divide the latter number by the total number of
lines in your two newspaper columns. Call this number τ . Plot on your graph paper (in a different color),
a second histogram where the height of the nth bin is the function Pτ (n) as depicted in (12.6).
(c) Compute the standard deviation of your observed data as defined
√ in (1.3) of Chapter 1 and give the ratio
of this observed standard deviation to the standard deviation, τ , for Pτ (n).
(a) Use a calculator to compute the actual P -value of 15 under the assumption that the probability of seeing
n cases in one year in England is determined by the τ = 10 version of the Poisson probability function.
In this regard, it is almost surely easier to first compute the probability of {6, 7, . . . , 14} and then subtract
the latter number from 1. However, if you do the computation in this way, you have to explain why the
resulting number is the P -value of 15.
(b) Compute the Chebychev theorem’s upper bound for the P -value of 15.
5. Suppose that cells from the skin are grown in low light conditions, and it is found that under these circum-
stances, there is some probability, p, for any given cell to exhibit some chromosome abnormality. Granted this,
suppose that 100 skin cells are taken at random from an individual and 2 are found to have some chromosome
abnormality. How small must p be to deem this number significant?
6. This problem returns to the World Series data that is given in Problem 8 of Chapter 11. The question here is
this: Does the data make it unlikely that, on average, each team in the World Series has a 50% probability of
winning any given game. To answer this question,
(a) Use the actual data to prove that the average number of games played in the World Series is 5.8.
(b) Use the probabilities given in Problem 8(b)–(e) of Chapter 11 to justify the following:
If each team has a 50-50 chance of winning any given game, then the mean of the number of
games played is 5.8125.
Thus, the average numbers of games are almost identical.
5
(c) Here is another approach: If the probability of a 7-game series is 16 , one can ask for the P -value of the
fact that there were 33 of them in 83 years. Explain why this question can be answered by studying the
5
binomial probability function on the sample space {0, . . . , 83} with q = 16 . In particular, explain why the
P -value of 33 is
n 83−n n 83−n
X 83 5 11 X 83 5 11
+ .
n=33,34,...,83
n 16 16 n=0,1,...,19
n 16 16
12.7. Exercises 79
CHAPTER
THIRTEEN
So far, we have only considered probability functions on finite sets and on the set of non-negative integers. The task
for in this chapter is to introduce probability functions on the whole real line, R, or on a subinterval in R such as the
interval where 0 ≤ x ≤ 1. Let me start with an example to motivate why such a definition is needed.
13.1 An example
Suppose that we have a parameter that we can vary in an experiment, say the concentration of sugar in an airtight,
enclosed Petri dish with photosynthesizing bacteria. Varying the initial sugar concentration, we measure the amount
of oxygen produced after one day. Let x denote the sugar concentration and y the amount of oxygen produced. We run
some large number, N , of versions of this experiment with respective sugar concentrations x1 , . . . , xN and measure
corresponding oxygen concentrations, y1 , . . . , yN .
Suppose that we expect, on theoretical grounds, a relation of the form y = cx + d to hold. In order to determine the
constants c and d, we find the least squares fit to the data {(xj , yj )}1≤j≤N .
Now, the differences,
∆1 = y1 − cx1 − d, ∆2 = y2 − cx2 − d, etc (13.1)
between the actual measurements and the least squares measurements should not be ignored. Indeed, these differences
might well carry information. Of course, you might expect them to be spread “randomly” on either side of 0, but then
what does it mean for a suite of real numbers to be random? More generally, how can we decide if their distribution
on the real line carries information?
80
I’ll say more about these points momentarily. Here is how you are supposed to use p(x) to find probabilities: If
U ⊂ [a, b] is any given subset, then Z
p(x) dx (13.3)
x∈U
is meant to give the probability of finding the point x in the subset U . Granted this use of p(x), then the first constraint
in (13.2) forbids negative probabilities; meanwhile, the second guarantees that there is probability 1 of finding x
somewhere in the given interval [a, b].
A continuous probability function is often called a “probability distribution” since it signifies how probabilities are
distributed over the relevant portion of the real line. Note in this regard, that people often refer to the “cumulative
distribution function”. This function is the anti-derivative of p(x). It is often denoted as P (x) and is defined by
Z x
P (x) = p(s) ds. (13.4)
a
0
Thus, P (a) is zero, P (b) is one, and P (x) = p(x). In this regard, P (x) is the probability that p assigns to the interval
[a, x]. It is the probability of finding a point that is less than the given point x.
The functions p(x) = 1, p(x) = 2x, p(x) = 2(x − x2 ) and p(x) = sin(πx) are all probability functions for the
interval [0, 1].
This is the “average” value of x where p(x) determines the meaning of average. The standard deviation, σ, has its
square given by
Z b
σ2 = (x − µ)2 p(x) dx. (13.6)
a
Note that in the case that |a| or b is infinite, one must worry a bit about whether the integrals actually converge. We
won’t be studying examples in this course where this is an issue.
By the way, novices in probability theory often forget to put the factor of p(x) into the integrands when they compute
the mean or standard deviation. Many of you will make this mistake at some point.
Note that this theorem holds for any p(x) as long as both µ and σ are defined. Thus, the two numbers µ and σ give
you enough information to obtain upper bounds for probabilities without knowing anything more about p(x).
The Chebychev theorem justifies the ubiquitous focus on means and standard deviations.
The uniform probabilities: The simplest of the three is the uniform probability function on some finite interval.
Thus, a and b must be finite. In this case,
1
p(x) = . (13.7)
b−a
This probability function asserts that the probability of finding x in an interval of length L < b − a inside the interval
L
[a, b] is equal to b−a ; thus it is proportional to L.
Here is an example where this case can arise: Suppose we postulate that bacteria in a petri dish can not sense the
direction of the source of a particular substance. We might then imagine that the orientation of the axis of the bacteria
with respect to the xy-coordinate system in the plane of the petri dish should be “random”. This is to say that the head
end of a bacteria is pointed at some angle, θ ∈ [0, 2π], and we expect that the particular angle for any given bacteria is
“random”. Should we have a lot of bacteria in our dish, this hypothesis implies that we must find that the percent of
them with head pointed between angles 0 ≤ α < β ≤ 2π is equal to β−α 2π .
The mean and standard deviation for the uniform probability function are
1 b−a
µ= (b + a) and σ= √ . (13.8)
2 12
In this regard, note that the mean is the midpoint of the interval [a, b] (are you surprised?). For example, the uniform
probability distribution on the interval [0, 1] has mean 21 and standard deviation √112 .
The Gaussian probabilities: These are probability functions on the whole of R. Any particular version is deter-
mined with the specification of two parameters, µ and σ. Here, µ can be any real number, but σ must be a positive real
number. The (µ, σ) version the Gaussian probability function is
1 2 2
p(x) = √ e−|x−µ| /(2σ ) . (13.9)
2π σ
If you have a graphing calculator and graph this function for some numerical choices of µ and σ, you will see that
the graph is the famous “bell-shaped” curve, but centered at the point µ and with the width of the bell given by σ. In
fact, µ is the mean of this probability function and σ is its standard deviation. Thus, small σ signifies that most of the
probability is concentrated at points very close to µ. Large σ signifies that the probability is spread out.
There is a theorem called the “Central Limit Theorem” that explains why the Gaussian probability function appears as
often as it does. This is a fantastically important theorem that is discussed momentarily.
The exponential probabilities: These are defined on the half line [0, ∞). There are various versions and the
specification of any one version is determined by the choice of a positive real number, µ. With µ chosen,
1 −t/µ
p(t) = e . (13.10)
µ
This one arises in the following context: Suppose that you are waiting for some particular “thing” to happen and you
know the following:
where σN = √1N σ. This is to say that for very large N the probability function for the possible values of x is very
close to the Gaussian probability function with mean µ and with standard deviation σN = √1N σ.
Here is an example: Suppose a coin is flipped some N times. For any given k ∈ {1, 2, . . . , N }, let xk = 1 if the kth
flip is heads, and let xk = 0 if it is tails. Let x denote the average of these values, this as defined via (13.12). Thus,
x can take any value in the set {0, N1 , N2 , · · · , 1}. According to the Central Limit Theorem, the probabilities for the
values of x are, for very large N , essentially determined by the Gaussian probability function
1 √ 1 2
pN (x) = √ 2 N e−2N |x− 2 | . (13.14)
2π
By the way, here is something to keep in mind about the Central Limit Theorem: As N gets larger, the mean for the
Gaussian in (13.13) is unchanged, it is the same as that for the original probability function p that gives the probabilities
for the possible values of any given measurement. However, the standard deviation shrinks to zero in the limit that
N → ∞ since it is obtained from the standard deviation, σ, of p as √1N σ. Thus, the odds of finding the average,
x, some fixed distance from the mean µ decreases to zero in the limit that N → ∞. The Chebychev inequality also
predicts this. Indeed, the Chebychev inequality in this context asserts the following:
Fix a real number, r, and let pN (r) denote the probability that the average, x, of N mea-
2 (13.16)
surements obeys |x − µ| > r. Then pN (r) ≤ N1 σr when N is very large.
The use of (13.13) to approximate pN (r) when N is large suggests that pN (r) is much smaller than the Chebychev
upper bound from (13.16). Indeed, the sum of the integrals that appears in (13.13) in the case a = µ + r, b = ∞ and
in the case a = −∞ and b = µ − r can be proved no greater than
r
2 1 σ −N r2 /(2σ2 )
√ e . (13.17)
π N r
To derive (13.17), I am using the following fact: Suppose that both κ and r are positive real numbers. Then
Z ∞
1 2 2 1 κ −r2 /(2κ2 )
• √ e−(x−µ) /(2κ ) dx ≤ √ e .
µ+r 2π κ 2π r
(13.18)
Z µ−r
1 2 2 1 κ −r2 /(2κ2 )
• √ e−(x−µ) /(2κ ) dx ≤ √ e .
−∞ 2π κ 2π r
Here, I use κ as a generic stand-in for√“σ”, since we will be applying this formula in instances where κ = σ (such as
in (13.17)), and others where κ = σ/ N (such as when using the Central Limit Theorem). You are walked through
the proof of the top inequality of (13.18) in one of the exercises at the end of this chapter. The bottom inequality is
obtained from the top by changing variables of integration from x to 2µ − x.
The possible values of fN (x) are {0, N1 , N2 , . . . , 1}. We are interested in the large N version of this function. What
almost always happens is that as N gets large, the graph of fN (x) gets closer and closer to the graph of the cumulative
Note that this version does not assume that the probability function for any given source of perturbation is the same as
that of any of the others. The term “not unreasonable” means that the sum of the squares of the standard deviations for
the probability functions of the perturbations is finite, as is the sum of the average values of x4 for these probability
functions.
The top two points tell you that Gaussian probability functions arise in most contexts, and the bottom point tells you
how to find upper bounds for probabilities as determined by a Gaussian probability function.
~ = ~y − A AT A −1 AT ~y .
∆ (13.23)
13.10 Exercises:
1. (a) Sketch a graph of the Gaussian probability function for the values µ = 0 and σ = 1, 2, 4, and 8.
(b) Prove the inequality that is depicted in the top line of (13.18) by the following sequence of arguments:
(i) Change variables of integration using the substitution y = x − µ so write the integral as
Z ∞
1 2 2
√ e−y /(2κ ) dy
r 2π κ
(ii) Note that the integration range has y ≥ r, so the integral is no greater than
Z ∞
1 y −y2 /(2κ2 )
√ e dy
r 2π κr
2
y y
(iii) Change variables in this integral by writing u = 2κ 2 . Thus, du = κ2 dy and so as to write this last
integral as Z ∞
κ
√ e−u du.
2π r r2 /(2κ2 )
(iv) Complete the job by evaluating the integral in (iii).
2. This exercise works with the exponential probability function in (13.10).
(a) Verify by integration that the mean and standard deviations for p(x) in (13.10) are both µ.
(b) Use (13.10) to compute the probability that the desired event does not occur prior to time t0 > 0.
(c) Suppose t0 , a and b are given with 0 < t0 < a < b. Prove that the conditional probability that the
time of the desired event is in the interval [a, b] given that the event time is not before t0 is given by
0 0
e−(a−t )/µ − e−(b−t )/µ .
3. Consider the uniform probability function on the interval [0, π]. Compute the mean and standard deviation for
the random variable x → sin(x). In this regard, remember that when x → f (x) is a random variable, then its
Rb Rb
mean is µf ≡ a f (x)p(x) dx, and the square of its standard deviation is σ 2 ≡ a (f (x) − µf )2 p(x) dx.
4. This problem concerns the example in Section 13.1 above where we run N versions of an experiment with sugar
concentration xk in the kth experiment and measure the corresponding oxygen concentration yk . We then do a
least squares fit of the data to a line of the form y = ax + b. Then ∆j = yj − axj − b tells us how far yj is
fromP the value predicted by the equation y = ax + b. We saw that the average of these ∆j ’s is zero. That is, the
sum j ∆j is 0. If we assume that the probability for any ∆j landing in any given interval U ⊂ (−∞, ∞) is
determined by a Gaussian probability function, thus equal to
Z
1 2 2
√ e−|x−µ| /(2σ ) dx.
x∈U 2π σ
Find formulae for µ and σ in terms of the quantities {∆j }1≤j≤N . Hint: See Chapter 1.
7. Throw away the 3 to the left of the decimal place in the decimal expansion of π given in the previous problem.
Group the 100 remaining digits into 10 groups of 10, by moving from left to right. Thus, the first group is
{1415926535} and the tenth group is {3421170679}. Let µk for k ∈ {1, . . . , 10} denote the mean of the kth
group.
(a) Compute each µk .
1
(b) Let S denote the set {0, . . . , 9} and let p denote the probability function on S that assigns 10 to each
element. Compute the mean and standard deviation of the numbers in S as determined by p.
(c) Compute the mean and standard deviation of the set {µ1 , . . . , µ10 }
(d) What does the Central Limit Theorem predict for the mean and the standard deviation of the set
1
{µ1 , . . . , µ10 } if it is assumed that the probability of a digit appearing in any given decimal place is 10 and
that the digits that appear in any subset of the 100 decimal places have no bearing on those that appear in
the remaining places?
8. Let p(x) denote the version of the Gaussian probability function in (13.9) with µ = 0 and σ = 1.
(a) Use the relevant version of (13.18) to obtain an upper bound for the probability that p assigns to the set of
points on the line where |x| ≥ 5. These are the points that differ from the mean by 5 standard deviations.
(b) Use the Chebychev Theorem to obtain an upper bound for the probability that p assigns to the set of points
where |x| ≥ 5.
(c) Use a calculator to compute the ratio α
β where α is your answer to 8(a) and β is your answer to 8(b). (You
will see that the Chebychev upper bound for the Gaussian probability function is much greater than the
bound obtained using (13.18).)
13.10. Exercises 87
CHAPTER
FOURTEEN
Hypothesis testing
My purpose in this chapter is to say something about the following situation: You repeat some experiment a large num-
ber, N , times and each time you record the value of a certain key measurement. Label these values as {z1 , . . . , zN }.
A good theoretical understanding of both the experimental protocol and the biology should provide you with a hypo-
thetical probability function, x → p(x), that gives the probability that a measurement has value in any given interval
[a, b] ⊂ (−∞, ∞). Here, a < b and a = −∞ and b = ∞ are allowed. For example, if you think that the variations in
the values of zj are due to various small, unrelated, random factors, then you might propose that p(x) is a Gaussian,
thus a function that has the form
1 2 2
p(x) = √ e−(x−µ) /(2σ ) . (14.1)
2π σ
for some suitable choice of µ and σ.
In any event, let’s suppose that you have some reason to believe that a particular p(x) should determine the probabilities
for the value of any given measurement. Here is the issue on the table:
Is it likely or not that N experiments will obtain a sequence {z1 , . . . , zN } if the probability
(14.2)
of any one measurement is really determined by p(x)?
If the experimental sequence, {z1 , . . . , zN } is “unlikely” for your chosen version of p(x), this suggests that your
understanding of the experiment is less than adequate.
I describe momentarily two ways to answer the question that is posed in (14.2).
14.1 An example
By way of an example, I ask you to consider the values of sin(x) for x an integer. As the function x → sin(x) has its
values between −1 and 1, it is tempting to make the following hypothesis:
The values of sin(x) on the set of integers are distributed in a random fashion in the interval
[−1, 1].
As a good scientist, I now proceed to test this hypothesis. To this end, I do 100 experiments, where the kth experiment
amounts to computing sin(k). Thus, the outcome of the kth experiment is zk = sin(k). For example, here are
z1 , . . . , z10 to three significant figures:
{0.841, 0.909, 0.141, −0.757, −0.959, −0.279, 0.657, 0.989, 0.412, −0.544} .
The question posed in (14.2) here asks the following: How likely is it that the 100 numbers {zk = sin(k)}1≤k≤100
are distributed at random in the interval [−1, 1] with “random” defined using the uniform probability function that has
p(x) = 21 for all x?
88
14.2 Testing the mean
To investigate the question that is posed in (14.2), let µ denote the mean for p(x). I remind you that
Z b
µ= xp(x) dx, (14.3)
a
where −∞ ≤ a < b ≤ ∞ delineate the portion of R where p(x) is defined. If the probabilities for the measurements
(z1 , . . . , zN ) are determined by our chosen p(x), then you might expect their average,
1
z= (z1 + z2 + · · · + zN ) (14.4)
N
to be reasonably close to µ. The size of the difference between z and µ shed light on the question posed in (14.2).
To elaborate, I need to introduce the standard deviation, σ, for the probability function p(x). I remind you that
Z b
2 2
σ = (x − µ) p(x) dx. (14.5)
a
I will momentarily invoke the Central Limit Theorem. To this end, suppose that N is a positive integer. Let
{x1 , . . . , xN } denote the result of N measurements where no one measurement is influenced by any other, and where
the probability of each is actually determined by the proposed function p(x). Let
1 X
x≡ xj (14.6)
N
1≤j≤N
denote the average of the N measurements. The Central Limit Theorem asserts the following: If R ≥ 0, then the
probability that x obeys |x − µ| ≥ R √1N σ is approximately
Z ∞
1 2
2√ e−s /2
ds (14.7)
2π R
when N is very large. Note that this last expression is less than
1 1 −R2 /2
2√ e , (14.8)
2π R
as can be seen by applying the version of (13.18) with µ = 0, κ = 1 and r = R.
With the proceeding understood, I will use the Central Limit Theorem to estimate the probability that x is further from
the mean of p(x) than the number z that is depicted in (14.4). This probability is the P -value of z under the hypothesis
that p(x) describes the variations in the measurements
√ of the N experiments. The Central Limit Theorem’s estimate
for this probability is given by the R = σ1 N |µ − z| version of (14.7).
√
If the number obtained by the R = σ1 N |µ − z| version of (14.7) is less than 20 1
, and N is very large, then the
P -value of our experimental mean z is probably “significant” and suggests that our understanding of our experiment
is inadequate.
To see how this works in an example, consider the values of sin(x) for x an integer. The N = 100 version of z for the
case where zk = sin(k) is −0.0013. Since I propose to use the uniform probability function to determine probabilities
on [−1, 1], I should take µ = 0 and σ = √212 = √13 for determining R for use in (14.7). Granted that N = 100, I
√ √
find that R = σ1 N |µ − z| ≈ 0.0225. As this is very much smaller than 2, I can’t use (14.8) as an upper bound for
the integral in (14.7). However, I have other tricks for estimating (14.7) and find it very close to 0.982. This is a good
approximation to the probability that the average of 100 numbers drawn at random from [−1, 1] is further from 0 than
0.0013. In particular, this result doesn’t besmirch my hypothesis about the random nature of the values of sin(x) on
the integers.
This is the “average” of f as weighted by the probability p(x). The standard deviation of f is denoted by σf , it is
non-negative and defined by setting
Z b
2
σf2 = (f (x) − µf ) p(x) dx. (14.10)
a
The standard deviation measures the variation of f from its mean, µf . I assume in what follows that the integrals
in (14.9) and (14.10) are finite in the case that a = −∞ or b = ∞. In general, this need not be the case.
By way of example, when f (x) = x, then µf is the same as the mean, µ, of p that is defined in (14.3). Meanwhile, σf
in the case that f (x) = x is the same as the standard deviation that is defined in (14.5).
Here is another example: Take p(x) to be the Gaussian probability function for the whole real line. Suppose that k is
a positive integer. Then x → xk is a random variable whose mean is zero when k is odd and equal to k !2k!k/2 when k
2
(2k)!
is even.
Meanwhile, the square of the standard deviation for this random variable is k!2k
when k is odd and equal to
2
1 (2k)! k!
2k k! − k
!
2
when k is even.
There is no need for you to remember the numbers from these examples, but it is important that you understand how
the mean and standard deviation for a random variable are defined.
By the way, here is some jargon that you might run into: Suppose that p(x) is a probability function on a part of the
Rb
real line, and suppose that µ is its mean. Then the average, a (x − µ)k p(x) dx, for positive integer k is called the kth
order moment of p(x).
14.4 The Chebychev and Central Limit Theorems for random variables
The mean and standard deviation for any given random variable are important for their use in the Chebychev Theorem
and the Central Limit Theorem. For example, the Chebychev Theorem in the context of a random variable says the
following:
Chebychev Theorem for Random Variables. Let x → p(x) denote a probability function on a part of the real
line, and let x → f (x) denote a random variable on the domain for p with mean µf and standard deviation σf as
determined by p. Let R be a positive number. Then, R12 is greater than the probability that p assigns to the set of
points where |f (x) − µf | ≥ Rσf .
The proof of this theorem is identical, but for a change of notation, to the proof given previously for the original
version of the Chebychev Theorem.
Here is the context for the random variable version of the Central Limit Theorem: You have a probability function,
x → p(x), that is defined on some or all of the real line. You also have a function, x → f (x), that is defined where
p is. You now interpret p and f in the following way: The possible “states” of a system that you are interested in are
labeled by the points in the domain of p, and the function p characterizes the probability for the system to be in the
state labeled by x. Meanwhile, f (x) is the value of a particular measurement when the system is in the state labeled
by x.
if −∞ ≤ s < r ≤ ∞ and the subset in question is the interval where s < x < r, then the probability that s < f < r
is very close when N is large to Z r
1
√ √ e−(x−µf )/(2N σf ) dx.
s 2π (σf / N )
fσ
For example, if R ≥ 0, then the probability that x obeys |x − µf | ≥ R √N is approximately
Z ∞
1 2
2· √ e−s /2 ds (14.11)
2π R
when N is very large.
Here, I write the interval where p is defined as [a, b] with −∞ ≤ a < b ≤ ∞. You might recognize µf as the square
of the standard deviation of p(x). I am calling it by a different name because I am thinking of it as the mean of the
random variable f (x) = (x − µ)2 .
The random variable f (x) = (x − µ)2 can have a standard deviation as well as a mean. According to our definition
in (14.10), this standard deviation has square
Z b Z b
2
σf2 = (x − µ)2 − µf p(x) dx = (x − µ)4 p(x) dx − µ2f . (14.13)
a a
Put µf and σf away for a moment to bring in our experimental data. Our data gives us the N numbers {(z1 −
µ)2 , (z2 − µ)2 , . . . , (zN − µ)2 }. The plan is to use the random variable version of the Central Limit Theorem to
estimate a P -value. This is the P -value for the average of these N numbers,
1 X
Var ≡ (zj − µ)2 , (14.14)
N
1≤j≤N
under the assumption that p(x) determines the spread in the numbers {zj }1≤j≤N . I call Var the “experimentally
determined variance”.
According to the Central Limit Theorem, if the spread in the numbers {zj }1≤j≤N are really determined by p(x), then
the P -value of our experimentally determined variance is well approximated by the integral in (14.11) for the case that
√
N
R= |Var −µf | . (14.15)
σf
Soc. 26: 1–32; reprinted in Experiments in Plant Hybridization, Harvard University Press, Cambridge, MA, 1967).
2 Fisher, R. A., 1936, Has Mendel’s work been rediscovered? Ann. Sci. 1: 115–137.
N µ σ h |h − µ|/σ
round/wrinkled seed 565 188 11 193 0.5
green/yellow seed 519 173 11 166 0.6
colored/white flower 100 33 5 36 0.6
tall/dwarf plant 100 33 5 28 1
constricted pod or not 100 33 5 29 0.8
green/yellow pod #1 100 33 5 40 1.4
terminal flower or not 100 33 5 33 0
yellow/green pod #2 100 33 5 35 0.4
This has the following consequence: The probability of getting n ∈ {0, 1, . . . , 8} out of 8 measurements to lie
within 1 standard deviation is given by the N = 8 binomial probability function using q = 0.68. This is p(n) =
8! n 8−n
n! (8−n)! (0.68) (0.32) . Note that the mean this binomial function is 8(0.68) = 5.44. Thus, the P -value for
getting 7 of 8 measurements within 1 standard deviation of the mean is
p(0) + p(1) + p(2) + p(3) + p(7) + p(8).
As it turns out this ≈ 0.58, thus greater than 0.05.
Looking back on this discussion, note that I invoked binomial probability functions in two different places. The first
was to derive the right most column in the table. This used a q = 13 binomial with values of N given by the left-most
column of numbers in the table. The value of came from the hypothesis about the manner in which genes are passed
from parent plant to seedling. The second application of the binomial probability function used the q = 0.68 version
with N = 8. I used this version to compute a P -value for having 7 of 8 measurements land within 1 standard deviation
of the mean. The value 0.68 for q was justified by an appeal to the Central Limit theorem.
Figure 14.1: Temperature Extremes in Boston (NY Times, Sunday, January 4, 2009)
swath in the background is meant to indicate the “normal range”. (Nothing is said about what this means.)
We hear all the time about “global warming”. Does this chart give evidence for warming in Boston during the year
2008? What follows describe what may be an overly simplistic way to tease some answer to this question from the
chart. Let us assume that the center of the grey swath on each day represents the mean temperature on the given
day as computed by averaging over some large number of years. This gives us an ordered set of 366 numbers,
{T1 , . . . , T366 }. Meanwhile, the chart gives us, for each day in 2008, the actual mean temperature on that day, this
half of the sum of the indicated high and low temperature on that day. This supplies us with a second sequence of 366
numbers, {t1 , . . . , t366 }. Subtracting the first sequence from the second gives us numbers (x1 = t1 − T1 , . . . , x366 =
t366 − T366 ). Take this set of 366 numbers to be the experimentally measured data.
1
The chart tells us that the average, µreal = 366 (x1 + · · · + x366 ), of the experimental data is equal to 0.5. We can
then ask if this average is consistent with the hypothesis that the sequence (x1 , . . . , x366 ) is distributed according to a
Gaussian distribution with mean zero and some as yet unknown standard deviation. Unfortunately, the chart doesn’t
say anything about the standard deviation. This understood, let us proceed with the assumption that the standard
deviation is half the vertical height of the grey swath, which is to say roughly 6◦ degrees Fahrenheit. If we make this
assumption, then we are comparing the number 0.5 with what would be expected using the Central Limit theorem with
input a Gaussian with mean zero and standard deviation 6◦ . The Central Limit theorem asserts that the P -value of the
measured mean 0.5 as computed using N = 366 and this hypothetical Gaussian is ≈ 0.14, which is not significant.
We can also ask whether the variance of the data set (x1 , . . . , x366 ) is consistent with the assumption that these
numbers are distributed according a Gaussian with mean µ = 0 and standard deviation σ = 6◦ . This is to say that we
2 1
are comparing the experimentally variance σreal = 366 (x21 + · · · + x2366 ) with the average of 366 distances from the
mean as computed using the Central Limit theorem for the Gaussian with mean µ = 0 and σ = 6◦ .
14.8 Exercises:
1. Suppose that we expect that the x-coordinate of bacteria in our rectangular petri dish should be any value
between −1 and 1 with equal probability in spite of our having coated the x = 1 wall of the dish with a specific
chemical. We observe the positions of 900 bacteria in our dish and so obtain 900 values, {z1 , . . . , z900 }, for the
x-coordinates.
1
P
(a) Suppose the average, z = 900 1≤k≤900 zk , is 0.1. Use the Central Limit Theorem to obtain a theoretical
upper bound based on our model of a uniform probability function for the probability that an average of
900 x-coordinates differs from 0 by more than 0.1.
(b) Suppose that the average of the squares, Var = 1≤k≤900 zk2 , equals 0.36. Use the Central Limit Theorem
P
and (13.21) to obtain a theoretical upper bound based on our model of a uniform probability function for
the probability that an average of the squares of 900 x-coordinates is greater than or equal to 0.36. (Note
that I am not asking that it differ by a certain amount from the square of the standard deviation for the
uniform probability function. If you compute the latter, you will be wrong by a factor of 2.)
2
(2k)! 1 (2k)! k!
2. Use Stirling’s formula in Equation (11.14) to give approximate formulae for both k!2k and 2k k! − k
2 !
when k is large.
3. R. A. Fisher (see also reference2 above) discussed a second criticism of Mendel’s experimental data. This
involved the manner in which a given dominant phenotype plant was classified as being “homozygous dominant”
or “heterozygous dominant”. According to Fisher, Mendel used the following method on any given plant: He
germinated 10 seeds from the plant via self-pollination, and if all 10 of the resulting seedlings had the dominant
phenotype, he then labeled the plant as “homozygous dominant”. If one or more of the 10 seedlings had the
recessive phenotype, he labeled the plant as “heterozygous dominant”. Fisher pointed out that if Mendel really
did the classification in this manner, then he should have mislabeled some heterozygous dominant plants as
homozygous dominant. The following questions walk you through some of Fisher’s arguments.
(a) What is the probability for a heterozygous dominant plant to produce a seedling with the dominant pheno-
type?
(b) What binomial probability function should you use to compute the probability that a heterozygous plant
produces 10 consecutive dominant seedlings.
(c) Use the binomial probability function for (b) to compute the probability of any given heterozygous domi-
nant plant to produce 10 consecutive dominant seedlings.
(d) If any given plant has probability to be homozygous dominant and thus to be heterozygous dominant, what
is the probability that Mendel would label any given plant as “homozygous dominant”? (To answer this,
you can use conditional probabilities. To this end, suppose that you have a sample space of N plants.
Use A to denote the subset of plants that are homozygous dominant, B to denote the subset that are
heterozygous dominant, and C to denote the subset that Mendel designates are homozygous dominant.)
(e) Redo the second table in this chapter based on your answer to (c) of this chapter.
4. The 2006 election for the United States Senator in Virginia had the following outcome (according to cnn.com):
14.8. Exercises 95
The total votes cast for the two candidates was 2, 238, 111, and the difference in the vote totals was 7, 231. This
was considered by the press to be an extremely tight election. Was it unreasonably close? Suppose that an
election to choose one of two candidates is held with 2, 238, 111 voters. Suppose, in addition, that each voter
casts his or her vote at random. If you answer correctly the questions (b) and (c) below, you will obtain an
estimate for the probability as determined by this “vote at random” model that the difference between the two
candidates is less than or equal to 7, 231.
(a) Set N = 2, 238, 111. Let S denote the sample space whose elements are sequences of the form
{z1 , . . . , zN } where each zk is either 1 or −1. Use f to denote the random variable on S that is given
by the formula f (z1 , . . . , zN ) = z1 + z2 + · · · + zN . What is the mean and what is the standard deviation
of f ?
(b) Use the Central Limit Theorem to find a Gaussian probability function that can be used to estimate the
probability that N1 f has value in any given interval on the real line.
(c) Use the Gaussian probability from (b) to compute the probability that | N1 f | ≤ 7,231
N . Note that this is
the probability (as computed by this Gaussian) for |f | to be less than or equal to 7, 231. (You can use a
calculator if you like to compute the relevant integral.)
FIFTEEN
Determinants
This chapter is meant to provide some cultural background to the story told in the linear algebra text book about
determinants. As the text explains, the determinant of a square matrix is non-zero if and only if the matrix is invertible.
The fact that the determinant signals invertibility is one of its principle uses. The other is the geometric fact observed
by the text that the absolute value of the determinant of the matrix is the factor by which the linear transformation
expands or contracts n-dimensional volumes. This chapter considers an invertibility question of a biological sort, an
application to a protein folding problem.
Recall that a protein molecule is a long chain of smaller molecules that are tied end to end. Each of these smaller
molecules can be any of twenty so called amino acids. This long chain appears in a cell folded up on itself in a
complicated fashion. In particular, its interactions with the other molecules in the cell are determined very much by
the particular pattern of folding because any given fold will hide some amino acids on its inside while exhibiting others
on the outside. This said, one would like to be able to predict the fold pattern from knowledge of the amino acid that
occupies each site along the chain.
In all of this, keep in mind that the protein is constructed in the cell by a component known as a “ribosome”, and
this construction puts the chain together starting from one end by sequentially adding amino acids. As the chain is
constructed, most of the growing chain sticks out of the ribosome. If not stabilized by interactions with surrounding
molecules in the cell, a given link in the chain will bend this way or that as soon as it exits the ribosome, and so the
protein would curl and fold even as it is constructed.
To get some feeling for what is involved in predicting the behavior here, make the grossly simplifying assumption
that each of the amino acids in a protein molecule of length N can bend in one of 2 ways, but that the probability of
bending, say in the + direction for the nth amino acid is influenced by the direction of bend of its nearest neighbors,
the amino acids in sites n − 1 and n + 1. One might expect something like this for short protein molecules in as much
as the amino acids have electrical charges on them and so feel an electric force from their neighbors. As this force is
grows weaker with distance, their nearest neighbors will affect them the most. Of course, once the chain folds back
on itself, a given amino acid might find itself very close to another that is actually some distance away as measured by
walking along the chain.
In any event, let us keep things very simple and suppose that the probability, pn (t), at time t of the nth amino acid
being in the + fold position evolves as
Here, an is some fixed number between 0 and 1, and the numbers {an , An,n , An,n±1 } are constrained so that
for any choice between 0 and 1 of values for x, x− and x+ with x + x− + x+ ≤ 1. This constraint is necessary to
guarantee that pn (t + 1) is between 0 and 1 if each of pn (t), pn+1 (t) and pn−1 (t) is between 0 and 1. In this regard,
let me remind you that pn (t + 1) must be between 0 and 1 if it is the probability of something. My convention here
takes both A1,0 and AN,N +1 to be zero.
As for the values of the other coefficients, I will suppose that knowledge of the amino acid type that occupies site n
97
is enough to determine an and Ann , and that knowledge of the respective types that occupy sites n − 1 and n + 1 is
enough to determine An,n−1 and An,n+1 . In this regard, I assume access to a talented biochemist.
Granted these formulæ, the N -component vector p~(t) whose nth component is pn (t) evolves according to the rule:
p~(t + 1) = ~a + A~
p(t) (15.3)
where ~a is that N -component vector whose nth entry is an , and where A is that N × N matrix whose only non-zero
entries are {An,n , An,n±1 }1≤n≤N .
We might now ask if there exists an equilibrium probability distribution, an N -component vector, p~, with non-negative
entries that obeys
p~ = ~a + A~ p (15.4)
If there is such a vector, then we might expect its entries to give us the probabilities for the bending directions of the
various links in the chain for the protein. From this, one might hope to compute the most likely fold pattern for the
protein.
To analyze (15.4), let us rewrite it as the equation
(I − A)~
p = ~a, (15.5)
where I here denotes the identity matrix; this the matrix with its only entries on the diagonal and with all of the latter
equal to 1. We know now that there is some solution, p~, when det(I − A) 6= 0. It is also unique in this case. Indeed,
were there two solutions, p~ and p~0 , then
(I − A)(~p − p~0 ) = 0. (15.6)
This implies (I − A) has a kernel, which is forbidden when I − A is invertible. Thus, to understand this version of the
protein folding problem, we need to consider whether the matrix I − A is invertible. As remarked at the outset, this is
the case if and only if it has non-zero determinant.
By the way, we must also confirm that the solution, p~, to (15.5) has its entries between 0 and 1 so as to use the entries
as probabilities.
In any event, to give an explicit example, consider the 3 × 3 case. In this case, the matrix I − A is
1 − A11 −A12 0
−A21 1 − A12 −A23 . (15.7)
−A31 −A32 1 − A12
Using the formulae from Chapter 6 of the linear algebra text book, its determinant is found to be
det(I − A) = (1 − A11 )(1 − A22 )(1 − A33 ) − (1 − A11 )A23 A32 − (1 − A33 )A12 A21 . (15.8)
SIXTEEN
Eigenvalues in biology
My goal in this chapter is to illustrate how eigenvalues can appear in problems from biology.
where P (n red survivors | m red parents) is the conditional probability that there are n red cells at the end of a cycle
given that there were m such cells the end of the previous cycle. If the ambient environmental conditions don’t change,
one would expect that these conditional probabilities are independent of time. As I explain below, they are, in fact,
computable from what we are given about this problem. In any event, let me use the shorthand A to denote the square
matrix of size N + 1 whose entry in row n and column m is P (n red survivors | m red parents). In this regard, note
that n and m run from 0 to N , not the from 1 to N + 1. Let p~(t) denote the vector in RN +1 whose nth entry is pn (t).
Then (16.1) reads
p(t − 1).
p~(t) = A~ (16.2)
This last equation would be easy to solve if we knew that p~(0) was an eigenvector of the matrix A. That is, if it were
the case that
A~p(0) = λ~ p(0) with λ some real number. (16.3)
Indeed, were this the case, then the t = 1 version of (16.2) would read p~(1) = λ~p(0). We could then use the t = 2
p(0) = λ2 p~(0). Continuing in the vein, we would
version of (16.2) to compute p~(2) and we would find that p~(2) = λA~
find that p~(t) = λt p~(0) and our problem would be solved.
99
Now, it is unlikely that p~(0) is going to be an eigenvector. However, even if p~(0) is a linear combination of eigenvectors,
We could then plug this into the t = 2 version of (16.2) to find that
The last line above asserts that the probability is 1 of there being some number, either 0 or 1 or 2 or . . . or N of red
cells in the subsequent generation given m red cells in the initial generation.
A square matrix with this property is called a transition matrix, or sometimes a Markov matrix. When A is such a
matrix, the equation in (16.2) is called a Markov process.
Although we are interested in the eigenvalues of A, it is amusing to note that the transpose matrix, AT , has an
eigenvalue equal to 1 with the corresponding eigenvector being proportional to the vector, ~a, with each entry equal to
1. Indeed, the entry in the mth row and nth column of AT is Anm , this the entry of A in the nth row and mth column.
Thus, the mth entry of AT ~a is
If the lower line in (16.10) is used and if each ak is 1, then each entry of AT ~a is also 1. Thus, AT ~a = ~a.
As a “cultural” aside, what follows is the story on Anm in the example from Section 16.1. First, Anm = 0 if n is
larger than 2m since m parent cells can spawn at most 2m survivors. For n ≤ m, consider that you have 2N cells of
which 2m are red and you ask for the probability that a choice of N from the 2N cells results in n red ones. This is a
counting problem that is much like those discussed in Section 11.3 although more complicated. The answer here is:
I can now introduce the vector p~(z) in Rn whose kth component is pk (z), and also the square matrix A with the
components Ak,j . Then (16.11) is the equation
p(z − 1).
p~(z) = A~ (16.12)
Note that as in the previous case, the matrix A is a Markov matrix. This is to say that each Akj is non-negative because
they are conditional probabilities; and
with corresponding eigenvalues λ1 = 1, λ2 = 14 and λ3 = 41 . Note that the vectors in (16.15) are mutually orthogonal
and have norm 1, so they define an orthonormal basis for R3 . Thus, any given z version of p~(z) can be written as a
linear combination of the vectors in (16.15). Doing so, we write
where each ck (z) is some real number. In this regard, there is one point to make straight away, which is that c1 (z)
must equal √13 when the entries of p~(z) represent probabilities that sum to 1. To explain, keep in mind that the basis
in (16.15) is an orthonormal basis, and this implies that ~e1 · p~ = c1 (z). However, since each entry of ~e1 is equal to
√1 , this dot product is √1 (p1 (z) + p2 (z) + p3 (z)) and so equals √1 when p1 + p2 + p3 = 1 as is the case when these
3 3 3
coefficients are probabilities.
In any event, if you plug the expression in (16.16) into the left side of (16.12) and use the analogous z − 1 version on
the right side, you will find that the resulting equation holds if and only if the coefficients obey
1 1
c1 (z) = c1 (z − 1), c2 (z) = c2 (z − 1) and c3 (z) = c3 (z − 1). (16.17)
4 4
Note that the equality here between c1 (z) and c1 (z − 1) is heartening in as much as both of them are supposed to equal
√1 . Anyway, continue by iterating (16.17) by writing the z − 1 versions of ck in terms of the z − 2 versions, then the
3
latter in terms of the z − 3 versions, and so on until you obtain
z z
1 1
c1 (z) = c1 (0), c2 (z) = c2 (0) and c3 (z) = c3 (0). (16.18)
4 4
By the way, take note of how the probabilities for the three possible fold directions come closer and closer to being
equal as z increases even if the initial z = 0 probabilities were drastically skewed to favor one or the other of the three
directions. For example, suppose that
1
p~(0) = 0 (16.20)
0
As a result, c1 (0) = √1 , c2 (0) = − √12 and c3 (0) = √1 . Thus, we have
3 6
2
1 z
1 1 31
p~(z) = 1 + − 3 . (16.21)
3 4
1 − 13
16.4 Exercises:
1. Multiply the matrix A in (16.14) against the vector p~(z) in (16.19) and verify that the result is equal to p~(z + 1)
as defined by replacing z by z + 1 in (16.19).
1 2
−1
3 3 1 1 1
2. Let A denote the 2 × 2 matrix 2 1 . The vectors ~e1 = 2 √ and ~e2 = 2√ are eigenvectors of A.
3 3 1 1
(a) Compute the eigenvalues of ~e1 and ~e2 .
1
4
(b) Suppose z ∈ {0, 1, . . . } and p~(z + 1) = A~
p(z) for all z. Find p~(z) if p~(0) = 3 .
4
0
(c) Find p~(z) if p~(0) = .
1
1 0
(d) Find p~(z) in the case that p~(z + 1) = + A~
p(z) for all z and p~(0) = .
−1 1
3. Suppose that you have a model to explain your data that predicts the probability of a certain measurement having
any prescribed value. Suppose that this probability function has mean 1 and standard deviation 2.
(a) Give an upper bound for the probability of a measurement being greater than 5.
(b) Suppose that you average some very large number, N , of measurements that are taken in unrelated, but
identical versions of the same experimental set up. Write down a Gaussian probability function that you
can use to estimate the probability that the value of this average is greater than 5. In particular, give a
numerical estimate using this Gaussian function for N = 100.
(c) Let us agree that you will throw out your proposed model if it predicts that the probability for finding an
1
average value that is greater than your measured average for 100 measurements is less than 20 . If your
measured average is 1.6 for 100 experiments, should you junk your model?
1
(c) Let k ≥ 5 and let z → fk (z) = z k (1 − z)100−k . Show that f is an increasing function of z when z < 20 .
(Hint: Take its derivative with respect to z.)
1
(d) Since p ≤ 0.0015 ≤ 0.05 = 20 , use the result from part (c) to conclude that the probability of finding 5 or
more of the xk ’s out of 100 with xk > 3 has probability less than
100
X
(0.0015)k (0.9985)100−k .
k=5
(e) We saw in Equation (11.19) of Chapter 11 that the terms in the preceding sum get ever smaller as k
increases. Use a calculator to show that the k = 5 term and thus all higher k terms are smaller than
5 × 10−7 . Since there are less than 95 ≤ 100 of these terms prove that the sum of these terms is no greater
than 0.00005.
SEVENTEEN
The notion of a Markov matrix was introduced in Chapter 15. By way of a reminder, this is an N × N matrix A that
obeys the following conditions:
These two conditions are sufficient (and also necessary) for the interpretation of the components of A as conditional
probabilities. To this end, imagine a system with N possible states, labeled by the integers starting at 1. Then Ajk can
represent the conditional probability that the system is in state j at time t if it is in state k at time t − 1. In particular,
one sees Markov matrices arising in a dynamical system where the probabilities for the various states of the system at
any given time are represented by an N -component vector, p~(t), that evolves in time according to the formula
p(t − 1).
p~(t) = A~ (17.2)
All entries are non-negative, and the entries in each column sum to 1. Note that there is no constraint for the sum of
the entries in any given row.
A question that is often raised in the general context of (17.1) and (17.2) is whether the system has an equilibrium
probability function, thus some non-zero vector p~∗ , with non-negative entries that sum to 1, and that obeys A~
p∗ = p~∗ .
If so, there is the associated question of whether the t → ∞ limit of p~(t) must necessarily converge to the equilibrium
p~∗ .
104
Here are some facts that allow us to answer these questions.
I elaborate on these points in Section 17.2, below. Accept them for the time being.
The point is that the coefficient in front of ~e1 is necessarily equal to 1. The coefficient, ck , in front of any k ≥ 2
version of ~ek is not so constrained.
Pn
Here is why (17.5) must hold: In general, I can write p~(0) = c1~e1 + k=2 ck~ek where c1 , c2 , . . . , cn are real numbers
because I am assuming that A is diagonalizable, and the eigenvectors of any diagonalizable matrix comprise a basis. It
follows from this representation of p~(0) that the sum of its entries is obtained by adding the following numbers: First,
c1 times the sum of the entries of ~e1 , then c2 times the sum of the entries ~e2 , then c3 times the sum of the entries of
~e3 , and so on. However, the last point of (17.4) asserts that the sum of the entries of each k ≥ 2 version of ~ek is zero.
This means that the sum of the entries of p~(0) is c1 times the sum of the entries of ~e1 . Since the sum of the entries of
~e1 is 1 and since this is also the sum of the entries of p~(0), so c1 must equal 1. This is what is asserted by (17.5).
By way of an example, consider the 2 × 2 case where
1 1
4 2
A= 3 1 . (17.6)
4 2
The eigenvalues in this case are 1 and − 14 and the associated eigenvectors ~e1 and ~e2 in this case can be taken to be
2
1
~e1 = 53 and ~e2 = . (17.7)
5 −1
Thus, as t → ∞, we see that in the case where each entry of A is positive, the limit is
Note in particular that p~(t) at large t is very nearly ~e1 . Thus, if you are interested only in the large t behavior of p~(t),
you need only find one eigenvector!
In our 2 × 2 example, "2 t #
5 + − 41 c2
p~(t) = t . (17.11)
3
5 − − 41 c2
In the example provided by (17.3), the matrix has eigenvalues 1, 0 and − 14 . If I use ~e2 for the eigenvector with
eigenvalue 0 and ~e3 for the eigenvector with eigenvalue − 41 , then the solution with p~(0) = ~e1 + c2~e2 + c3~e3 is
t
p~(t) = ~e1 + c3 − 14 ~e3 for t > 0. (17.12)
As I remarked above, I need only find ~e1 to discern the large t behavior of p~(t); and in the example using the matrix
in (17.3), 4
15
7
~e1 = 15 . (17.13)
4
15
Point 1: As noted, an equilibrium vector for A is a vector, p~, that obeys A~ p = p~. Thus, it is an eigenvector of A
with eigenvalue 1. Of course, we have also imposed other conditions, such as its entries must be non-negative and
they should sum to 1. Even so, the first item to note is that A does indeed have an eigenvector with eigenvalue 1. To
see why, observe that the vector w
~ with entries all equal to 1 obeys
AT w
~ =w
~ (17.14)
by virtue of the second condition in (17.1). Indeed, if k is any given integer, then the kth entry of AT is A1k + A2k +
· · · + AN k and this sum is assumed to be equal to 1, which is the kth entry of w.
~
This point about AT is relevant since det(AT − λI) = det(A − λI) for any real number λ. Because AT − I is not
invertible, det(AT − I) is zero. Thus, det(A − I) is also zero and so A − I is not invertible. Thus it has a positive
dimensional kernel. Any non-zero vector in this kernel is an eigenvector for A with eigenvalue 1.
Point 3: I know from Point 1 that there is at least one non-zero eigenvector of A with eigenvalue 1. I also know,
this from Point 2, that either all of its entries are negative or else all are positive. If all are negative, I can multiply it
by −1 so as to obtain a new eigenvector of A with eigenvalue 1 that has all positive entries. Let r denote the sum of
the entries of the latter vector. If I now multiply this vector by 1r , I get an eigenvector whose entries are all positive
and sum to 1.
Point 4: To prove this point, let me assume, contrary to the assertion, that there are two non-zero vectors in the
kernel of A − I and one is not a multiple of the other. Let me call them ~v and ~u. As just explained, I can arrange that
both have only positive entries and that their entries sum to 1, this by multiplying each by an appropriate real number.
Now, if ~v is not equal to ~u, then some entry of one must differ from some entry of the other. For the sake of argument,
suppose that v1 < u1 . Since the entries sum to 1, this then means that some other entry of ~v must be greater than the
corresponding entry of ~u. For the sake of argument, suppose that v2 > u2 . As a consequence the vector ~v − ~u has
negative first entry and positive second entry. It is also in the kernel of A − I. But, these conclusions are untenable
since I already know that every vector in the kernel of A − I has either all positive entries or all negative ones. The
only escape from this logical nightmare is to conclude that ~v and ~u are equal.
This then demonstrates two things: First, there is a unique vector in the kernel of A − I whose entries are positive and
sum to 1. Second, any one vector in kernel A − I is a scalar multiple of any other and so this kernel has dimension 1.
λ (v1 + · · · + vn ) = (A11 + · · · + An1 )v1 + (A12 + · · · + An2 )v2 + · · · + (A1n + · · · + Ann )vn . (17.20)
λ(v1 + · · · + vn ) = v1 + · · · + vn . (17.21)
Thus, either λ = 1 or else the entries of ~v sum to zero. This is what is asserted by the second sentence of the final
point in (17.4).
Now suppose that λ > 1. In this case, the argument that I used in the discussion above for Point 2 can be reapplied
with only minor modifications to produce the ridiculous conclusion that something negative is equal to something
positive. To see how this works, remark that the conclusion that ~v ’s entries sum to zero implies that has at least one
negative entry and at least one positive one. For example, suppose that the first entry of ~v is negative and the rest are
either zero or positive with at least one positive. Since ~v is an eigenvector with eigenvalue λ, we have
and thus
(λ − A11 )v1 = A12 v2 + · · · + A1n vn . (17.23)
Note that this last equation is the analog in the λ > 1 case of (17.15). Well since λ > 1 and A11 < 1, the left-hand
side of (17.23) is negative. Meanwhile, the right-hand side is positive since each A1k that appears here is positive and
since at least one k ≥ 2 version of vk is positive.
Equation (17.19) has its λ > 1 analog too, this where the (1 − A11 − A21 ) is replaced by (λ − A11 − A21 ) and where
(1 − A12 − A22 ) is replaced by (λ − A12 − A22 ). The general case where has some m < n negative entries and the
rest zero or positive is ruled out by these same sorts of arguments.
Consider now the case where λ ≤ −1. I can rule this out using the trick of introducing the matrix A2 = A · A. This is
done in three steps.
Step 1: If A obeys (17.1) then so does A2 . If all entries of A are positive, then this is also the case for A2 . To see
that all of this is true, note that the first point in (17.1) holds since each entry of A2 is a sum of products of the entries
of A and each of the latter is positive. As for the second point in (17.1), note that
X n X n X
(A2 )mk = Amj Ajk . (17.24)
m=1 m=1 1≤j≤n
Now switch the orders of summing so as to make the right-hand side read
n n n
!
X X X
(A2 )mk = Amj Ajk . (17.25)
m=1 j=1 m=1
The sum inside the parentheses is 1 for each j because A obeys the second point in (17.1). Thus, the right-hand side
of (17.24) is equal to
X n
Ajk , (17.26)
j=1
and such a sum is equal to 1, again due to the second point in (17.1).
Step 3: To see that −1 is not an eigenvalue for A, remember that if ~v were an eigenvector with this eigenvalue,
then its entries would sum to zero. But ~v would also be an eigenvector of A2 with eigenvalue 1 and we know that
the entries of any such eigenvector must either be all positive or all negative. Thus, A can’t have −1 as an eigenvalue
either.
17.3 Exercises:
1. The purpose of this exercise is to walk you through the argument for the second point in (17.4). To start, assume
that obeys A~v = ~v and that the first k < n entries of are are negative or zero and the rest either zero or positive
with at least one positive.
(a) Add the first k entries of the vector A~v and write the resulting equation asserting that the latter sum is
equal to that of the first k entries of ~v . In the case k = 2, this is (17.18).
(b) Rewrite the equation that you got from (a) so that all terms that involve v1 , v2 , . . ., and vk are on the
left-hand side and all terms that involve vk+1 , . . ., vn are on the right-hand side. In the case k = 2, this
is (17.19).
(c) Explain why the left-hand side of the equation that you get in (b) is negative or zero while the right-hand
side is positive.
(d) Explain why the results from (c) forces you to conclude that every eigenvector of A with eigenvalue 1 has
entries that are either all positive or all negative.
2
3 a b
2. (a) Consider the matrix A = a 23 c . Find all possible values for a, b and c that make this a Markov
b c 23
matrix.
(b) Find the eigenvector for A with eigenvalue 1 with positive entries that sum to 1.
(c) As a check on you work in (a), prove that your values of a, b, c are such that A also has eigenvalue 12 . Find
two linearly independent eigenvectors for this eigenvalue.
3. This problem plays around with some of our probability functions.
(a) The exponential probability function is defined on the half line [0, ∞). The version with mean µ is the
function x → p(x) = e−x/µ . The standard deviation is also µ. If R > 0, what is the probability that
x ≥ (R + 1)µ?
(b) Let Q(R) denote the function of R you just derived in (a). We know a priori that Q(R) is no greater than
1 2 2
R2 and so R Q(R) ≤ 1. What value of R maximizes the function R → R Q(R) and give the value of
2
R Q(R) to two decimal places at this maximum. You can use a calculator for this last part.
(c) Let p(x) denote the Gaussian function with mean zero and standard deviation σ. Thus, p(x) =
2 2
√1
2π σ
e−x /(2σ ) . We saw in (13.21) that the probability, P(R), that x is greater than Rσ is less than
2
√ 1
2π R
e−R /2 . We also know from the Chebychev Theorem that know that P(R) ≤ R12 . The ratio of these
two upper bounds is R. What value of R is this ratio at its largest value? Use a calculator to write this
largest value.
(d) Let L > 0 and let x → p(x) denote the uniform probability function on the interval where −L ≤ x ≤ L.
This probability has mean 0 and standard deviation L. Suppose that R is larger than 1 but smaller than
√
3. What is the probability that x has distance 2RL
√ or more from the origin?
3
EIGHTEEN
The previous chapter analyzed Markov matrices in some detail, but left open the question as to whether such a matrix
can have complex eigenvalues. My purpose here is to explain that such can be the case. I will then describe some of
their properties.
• A matrix with real entries has an even number of distinct, complex eigenvalues since any given complex eigen-
value must be accompanied by its complex conjugate.
• There are at most 2 eigenvalues for a 2 × 2 matrix: Either it has two real eigenvalues or one real eigenvalue with
algebraic multiplicity 2, or two complex eigenvalues, one the complex conjugate of the other.
• The number 1 is always an eigenvalue of a Markov matrix.
If you compute the characteristic polynomial, P(λ) = det(A − λI), you will find that it is equal to
P(λ) = − λ3 − 23 λ2 + 34 λ + 14 .
(18.2)
Ordinarily, I would be at a loss to factor a generic cubic polynomial, but in this case, I know that 1 is a root, so I know
that λ − 1 divides P(λ) to give a quadratic polynomial. I can do this division and I find that
P(λ) = − (λ − 1) λ2 − 21 λ + 14 .
(18.3)
The roots of the quadratic polynomial λ → λ2 − 21 λ + 14 are roots of P. The roots of the quadratic polynomial can
be found (using the usual formula) to be
√
q
1 1
2 ± 4 −1 1 3
= ± i. (18.4)
2 4 4
111
You might complain that the matrix A here has some entries equal zero, and it would be more impressive to see an
example where all entries of A are positive. If this is your attitude, then consider the Markov matrix
1 1 7
2 16 16
7 1 1
A = 16 2 16 (18.5)
1 7 1
16 16 2
√
3 3
whose characteristic polynomial is − λ3 − 23 λ2 + 171 43 1
256 λ − 256 . The roots of the latter are 1 and 4 ± 16 i.
To see how this works in the general case, let’s again use A to denote our Markov matrix with all Ajk > 0. If λ is
a complex eigenvalue for A, then it must be a complex eigenvalue for AT . Let ~v denote a corresponding complex
eigenvector; thus AT ~v = λ~v . In terms of components, this says that
for any k ∈ {1, 2, . . . , n}. Taking absolute values of both sides in (18.6) finds the inequality
Here, I have used two facts about absolute values: First, the absolute value of λvk is the product of |λ| and |vk |. Indeed,
if a and b are any two complex numbers, then |ab| = |a| |b| which you can see by writing both a and b in polar form.
Thus, write a = reiθ and b = seiϕ with s and r non-negative. Then ab = rsei(θ+ϕ) and so the absolute value of ab is
rs which is also |a| |b|. Meanwhile, I used the fact that |a + b| ≤ |a| + |b| in an iterated fashion to obtain
to deduce that the expression on the left side of (18.7) is no less than that on the right side. By the way, the fact that
|a + b| ≤ |a| + |b| holds for complex numbers is another way to say that the sum of the lengths of any two sides to a
triangle is no less than the length of the third side.
Consider the inequality depicted in (18.7) in the case that k is chosen so that
Thus, vk has the largest absolute value of any entry of ~v . In this case, each |vj | that appears on the left side of (18.7) is
no larger than |vk |, so the left-hand side is even larger if each |vj | is replaced by |vk |. This done, then (18.7) finds that
Since A1k + A2k + · · · + Ank = 1, this last expression finds that |vk | ≥ |λ| |vk | and so 1 ≥ |λ|.
Now, to see that |λ| is actually less than 1, let us see what is required if every one of the inequalities that were used to
go from (18.6) to (18.7) and from (18.7) to (18.10) are equalities. Indeed, if any one them is a strict inequality, then
1 > |λ| is the result. Let’s work this task backwards: To go from (18.7) to (18.10) with equality requires that each
p(t − 1)
p~(t) = A~ and so p~(t) = At p~(0). (18.11)
I gave an example from genetics of such a Markov chain in Chapter 15. What follows is a hypothetical example from
biochemistry.
There is a molecule that is much like DNA that plays a fundamental role in cell biology, this denoted by RNA.
Whereas DNA is composed of two strands intertwined as a double helix, a typical RNA molecule has just one long
strand, usually folded in a complicated fashion, that is composed of standard segments linked end to end. As with
DNA, each segment can be one of four kinds, the ones that occur in RNA are denoted as G, C, A and U. There are
myriad cellular roles for RNA and the study of these is arguably one of the hottest items these days in cell biology.
In any event, imagine that as you are analyzing the constituent molecules in a cell, you come across a long strand of
RNA and wonder if the sequence of segments, say AGACUA· · · , is “random” or not.
To study this question, you should know that a typical RNA strand is constructed by sequentially adding segments
from one end. Your talented biochemist friend has done some experiments and determined that in a test tube (in vitro,
as they say), the probability of using one of A, G, C, or U for the tth segment depends on which of A, C, G or U has
been used for the (t−1)st segment. This is to say that if we label A as 1, G as 2, C as 3 and U as 4, then the probability,
pj (t) of seeing the segment of the kind labeled j ∈ {1, 2, 3, 4} in the tth segment is given by
where Ajk denotes the conditional probability of a given segment being of the kind labeled by j if the previous segment
is of the kind labeled by k. For example, if your biochemist friend finds no bias toward one or the other base, then one
would expect that each Ajk has value 14 . In any event, A is a Markov matrix, and if we introduce p~(t) ∈ R4 to denote
the vector whose kth entry is pk (t), then the equation in (18.12) has the form of (18.11).
Now, those of you with some biochemistry experience might argue that to analyze the molecules that comprise a cell,
it is rather difficult to extract them without breakage. Thus, if you find a strand of RNA, you may not be seeing the
whole strand from start to finish and so the segment that you are labeling as t = 0 may not have been the starting
segment when the strand was made in the cell. Having said this, you would then question the utility of the ‘solution’,
p~(t) = At p~(0) since there is no way to know p~(0) if the strand has been broken. Moreover, there is no way to see if
the strand was broken.
As it turns out, this objection is a red herring of sorts because one of the virtues of a Markov chain is that the form
of p~(t) is determined solely by p~(t − 1). This has the following pleasant consequence: Whether our starting segment
is the original t = 0 segment, or some t = N > 0 segment makes no difference if we are looking at the subsequent
segments. To see why, let us suppose that the strand was broken at segment N and that what we are calling strand t
was originally strand t + N . Not knowing the strand was broken, our equation reads p~(t) = At p~(0). Knowing the
strand was broken, we must relabel and equate our original p~(t) with the vector p~(t + N ) that is obtained from the
starting vector, p~(0), of the unbroken strand by the equation p~(t + N ) = At+N p~(0).
where ck is real if ~ek has a real eigenvalue, but complex when ~ek has a complex eigenvalue. With regards to the latter
case, since our vector p~(0) is real, the coefficients ck and ck0 must be complex conjugates of each other when the
corresponding ~ek and ~ek0 are complex conjugates also.
I need to explain why ~e1 has the factor 1 in front. This requires a bit of a digression: As you may recall from the
previous chapter, the vector ~e1 can be assumed to have purely positive entries that sum to 1. I am assuming that such
is the case. I also argued that the entries of any eigenvector with real eigenvalue less than 1 must sum to zero. This
must also be the case for any eigenvector with complex eigenvalue. Indeed, to see why, suppose that ~ek has eigenvalue
λ 6= 1, either real or complex. Let ~v denote the vector whose entries all equal 1. Thus, ~v is the eigenvector of AT
with eigenvalue 1. Note that the dot product of ~v with any other vector is the sum of the other vector’s entries. Keep
this last point in mind. Now, consider that the dot product of ~v with A~ek is, on the one hand, λ~v · ~ek , and on the other
(AT ~v ) · ~ek . As AT ~v = ~v , we see that λ~v · ~ek = ~v · ~ek and so if λ 6= 1, then ~v · ~ek = 0 and so the sum of the entries of
is zero.
Now, to return to the factor of 1 in front of ~e1 , remember that p~(0) is a vector whose components are probabilities, and
so they must sum to 1. Since the components of the vectors ~e2 , . . . , ~en sum to zero, this constraint on the sum requires
the factor 1 in front of ~e1 in (18.13).
With (18.13) in hand, it then follows that any given t > 0 version of p~(t) is given by
n
X
p~(t) = ~e1 + ck λtk~ek . (18.14)
k=2
Since |λt | = |λ|t and each λ that appears in (18.14) has absolute value less than 1, we see that the large t versions of
p~(t) are very close to ~e1 . This is to say that
lim p~(t) = ~e1 . (18.15)
t→∞
This last fact demonstrates that as t increases along a Markov chain, there is less and less memory of the starting vector
p~(0). It is sort of like the aging process in humans: As t → ∞, a Markov chain approaches a state of complete senility,
a state with no memory of the past.
18.5 Exercises:
a 1−b
1. Any 2 × 2 Markov matrix has the generic form , where a, b ∈ [0, 1]. Compute the characteristic
1−a b
polynomial of such a matrix and find expressions for its roots in terms of a and b. In doing so, you will verify
that it has only real roots.
2. Let A denote the matrix in (18.1). Find the eigenvectors for A and compute p~(1), p~(2) and lim p~(t) in the case
t→∞
1
p(t − 1) and p~(0) = 0. Finally, write p~(0) as a linear combination of the eigenvectors for A.
that p~(t) = A~
0
NINETEEN
Suppose that you take some large number of measurements of various facets of a system that you are studying. Lets say
that there are n facets and you take N n measurements under different conditions; thus, you generate a collection
of N vectors in Rn which you can label as {~x1 , ~x2 , . . . , ~xN }.
An issue now is whether some of the n facets that you measure are dependent on the rest. For example, suppose
that one facet is the temperature of the sample, and if all of the other facets of the system are completely determined
by the temperature, then all of these N vectors will lie on very near some curve in the n-dimensional space, a curve
parameterized as T → ~x(T ) by the temperature. On the other hand, if no facets are determined by any collection of
the others, then the N vectors could be spread in a more or less random fashion through a region of Rn . The point
here is that if you are interested in discovering relations between the various facets, then you would like to know if the
distribution of the N vectors is spread out or concentrated near some lower dimensional object – a curve or a surface or
some such. Any such concentration towards something less spread out indicates relations between the various facets.
For example, suppose n = 2. If you plot the endpoints of the vectors {~xk } in R2 and see the result as very much
lying near a particular curve (not necessarily a line), this says that the two facets are not able to vary in an independent
fashion. Indeed, if y = f (x) is the equation for the curve, it says that the variation in x determines the variation in y.
Now, for this n = 2 case, you can go and plot your N vectors and just look at the picture. However, for n > 2, this is
going to be hard to do. How then can you discern relationships when n > 2?
116
Now, consider the curve in this disk given in the parametric form by
t → (x = ct cos(t), y = ct sin(t)). (19.1)
Here c > 0 is a constant and 0 ≤ t ≤ 1c is the parameter for the curve. For any given c, this curve is a spiral. The
radius, r(t), is equal to ct and the angle is t as measured from the positive x-axis in the anti-clockwise direction.
As c gets smaller, the spiral gets tighter and tighter; there are more and more turns before the curve hits the boundary
1
of the disk. Indeed, when c is very large, the spiral hardly turns and stays very close to the x-axis. When c = 2π , the
1
spiral makes one complete turn before exiting the disk. When c = 2πm and m is an integer, the spiral makes m turns
before it exits.
Now, here is the problem: Suppose that our vectors are near some very small c version of (19.1). Since this spiral
makes a lot of turns in the disk, all points in the disk are pretty close to the spiral and so even a random collection of
points in the disk will find each point close to the spiral. In particular, our collection of vectors {~xk } will be close to
all small c versions of the spiral no matter what!!
Here is another example: Suppose that the points {~xk } are distributed randomly in a thin strip of width r 1 along
the diagonal y = x. Thus, each ~xk has coordinates x and y that obey |x − y| ≤ r. So, inside this strip, the points
are spread at random. Outside the strip, there are no points at all. If your experimental error is on the order of r,
then I would be happy to conclude that the points lie on the line y = x and thus the experiment indicates that the
y-measurement is completely determined by the x-measurement. On the other hand, if the experimental error is much
less than r, then I would not agree that the concentration near the x − y line signifies that y is determined by x. Maybe
some part of y, but there is some part left over that is independent.
19.3 A method
Suppose we have data {~xk }1≤k≤N , each a vector in Rn . Here is a method to analyze whether the data near any given
~xj is concentrated near some lower dimensional subspace of Rn .
Step 1: You must choose a number, r, that is a reasonable amount greater than your experimental error. In
this regard, r should also be significantly less than the maximum distance between any two points in {~xk }. Thus,
r maxj,k |~xj − ~xk |. This number r determines the scale on which you will be looking for the clustering of the
vectors.
Step 2: Let ~xj ∈ {~xk }. To see if the data is clustering around a lower dimensional subspace near ~xj , take all points
in {~xk } that have distance r or less from ~xj . Let m denote the number of such points. Allow me to relabel these points
as {~y1 , . . . , ~ym }. These are the only points from the collection {~xk } that will concern us while we look for clustering
around the given point ~xj at the scale determined by r.
1
Step 3: Let ~a denote the vector m (~y1 + · · · + ~ym ). This vector should be viewed as the center of the collection
{~y1 , . . . , ~ym }. For each index i = 1, . . . , m, set ~zi = ~yi − ~a. This just shifts the origin in Rn .
Step 4: View each i ∈ {1, . . . , m} version of ~zi as a matrix with 1 column and then introduce the transpose, ~ziT ,
which is a matrix with 1 row. Note that if ever is a vector in Rn viewed as a 1-column matrix, then ~z can multiply ~zT
to give the square matrix ~z~zT . For example, if ~z has top entry 1 and all others 0, then ~z~zT has top left entry 1 and all
others zero. In general, the entry (~z~zT )ik is the product of the ith and kth entries of ~z.
Granted the preceding, introduce the matrix
1
~z1 ~z1T + · · · + ~zm ~zm
T
A= . (19.2)
m
This is a symmetric, n × n matrix, so it has n real eigenvalues which I will henceforth denote by {λ1 , . . . , λn }.
1
|~z1 |2 + · · · + |~zm |2 |~v |2
~v · (A~v ) ≤ (19.7)
m
since each ~zk has norm less than r, the right-hand side of (19.7) is no larger than r2 |~v |2 . Thus,
Example 2: Suppose that m is large and that the vectors {~xk } all sit on a single line, and that they are evenly
distributed along the line. In particular, let ~z denote the unit tangent vector to the line, and suppose that ~zk = (−1 +
2k
m+1 )r~ z . Thus, ~z1 = − m−1
m+1 r~z and ~zm = m−1
m+1 r~ z . This being the case
m
r2 X 2
A= −1 + 2k
m+1 ~z~zT . (19.10)
m
k=1
As it turns out, this sum can be computed in closed form and the result is
r2 m − 1 T
A= ~z~z . (19.11)
3 m+1
The matrix A has the form c~z~zT where c is a constant. Now any matrix  = c~z~zT has the following property: If ~v is
any vector in Rn , then
Â~v = c~z (~z · ~v ) . (19.12)
Thus, Â~v = 0 if ~v is orthogonal to ~z, and Â~z = c~z. Hence  has 0 and c as its eigenvalues, where 0 has multiplicity
n − 1 and c has multiplicity 1.
2
In our case, this means that there is one eigenvalue that is on the order of r3 and the others are all zero. Thus, we
would say that the clustering here is towards a subspace of dimension 1, and this is precisely the case.
Example 3: To generalize the preceding example, suppose that d ≥ 1, that V ⊂ Rn is a d-dimensional subspace,
and that the vectors {~zk }1≤k≤m all lie in V .
In this case, one can immediately deduce that A will have n − d orthonormal eigenvectors that have zero as eigenvalue.
Indeed, any vector in the orthogonal complement to V is in the kernel of A and so has zero eigenvalue. As this
orthogonal subspace has dimension n − d, so A has n − d linearly independent eigenvalues with eigenvalue 0.
Thus, we see predict here that the clustering is towards a subspace whose dimension is no greater than d.
When m is very large, the Central Limit Theorem tells me that the matrix A in (19.2) is very close to the
r2
matrix d+2 P , where P here denotes the orthogonal projection on to the subspace in question.
2
r
Thus, it should have d eigenvalues that are very close to d+2 and n − d eigenvalues that are much smaller than this
number. In particular, when m is large, then under the hypothesis that the vectors {~zk }1≤k≤m are sprinkled at random
in a d-dimensional subspace, the matrix A in (19.2) will be very likely to have d eigenvalues that are on the order of
r2
d+2 and n − d eigenvalues that are zero.
1
This application of the Central Limit Theorem explains my preference for the size distinction 2(n+2) r2 between small
eigenvalues of A and eigenvalues that are of reasonable size. I think that this cut-off is rather conservative and one can
take a somewhat larger one.
19.7 Exercises:
1. The purpose of this exercise is to compute the average of the matrix in (19.13) over the disk of radius r in the
xy-plane. This average is the matrix, U , whose entries are
ZZ ZZ ZZ
1 1 1
U11 = 2 x2 dx dy, U22 = 2 y 2 dx dy, and U12 = U21 = 2 xy dx dy.
πr πr πr
To compute U , change to polar coordinates (ρ, θ) where ρ ≥ 0 and θ ∈ [0, 2π] using the formula x = ρ cos(θ)
and y = ρ sin(θ). Show that
R 2π R r
(a) U11 = πr1 2 0 0 ρ2 cos2 (θ)ρ dρ dθ,
R 2π R r
(b) U22 = πr1 2 0 0 ρ2 sin2 (θ)ρ dρ dθ, and
R 2π R r
(c) U12 = U21 = πr1 2 0 0 ρ2 sin(θ) cos(θ)ρ dρ dθ.
Next, use the formula cos(2θ) = 2 cos2 (θ) − 1 = 1 − 2 sin2 (θ) and sin(2θ) = 2 sin(θ) cos(θ) to do the angle
integrals first and so find that
Rr
(d) U11 = U22 = r12 0 ρ3 dρ, and
(e) U12 = U21 = 0.
Finally, do the ρ-integral to find that U11 = U22 = 14 r2 and U12 = U21 = 0. Note that this means that U is the
r2
d = 2 version of d+2 Id .