Textbook
Textbook
Data Scientists
1st Edition
Probability for
Data Scientists
1st Edition
Juana Sánchez
University of California, Los Angeles
SAN DIEGO
Bassim Hamadeh, CEO and Publisher
Mieka Portier, Acquisitions Editor
Tony Paese, Project Editor
Sean Adams, Production Editor
Jess Estrella, Senior Graphic Designer
Alexa Lucido, Licensing Associate
Susana Christie, Developmental Editor
Natalie Piccotti, Senior Marketing Manager
Kassie Graves, Vice President of Editorial
Jamie Giganti, Director of Academic Publishing
Copyright © 2020 by Cognella, Inc. All rights reserved. No part of this publication may be reprinted, reproduced,
transmitted, or utilized in any form or by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information retrieval system without the
written permission of Cognella, Inc. For inquiries regarding permissions, translations, foreign rights, audio rights,
and any other forms of reproduction, please contact the Cognella Licensing Department at [email protected].
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Cover image and interior image copyright © 2018 Depositphotos/SergeyNivens; © 2017 Depositphotos/rfphoto;
© 2015 Depositphotos/creisinger; © 2014 Depositphotos/Neode; © 2013 Depositphotos/branex; © 2013 Deposit-
photos/vitstudio; © 2012 Depositphotos/oconner; © 2012 Depositphotos/scanrail; © 2016 Depositphotos/lamnee;
© 2012 Depositphotos/shirophoto.
PREFACE X V II
v
Detailed Contents
PREFACE X V II
vii
2 Building Blocks of Modern Probability Modeling 29
2.1 Learning the vocabulary of probability: experiments, sample spaces,
and events. 30
2.1.1 Exercises 32
2.2 Sets 33
2.2.1 Exercises 34
2.3 The sample space 35
2.3.1 A note of caution 37
2.3.2 Exercises 38
2.4 Events 39
2.5 Event operations 41
2.6 Algebra of events 46
2.6.1 Exercises 46
2.7 Probability of events 49
2.8 Mini quiz 49
2.9 R code 51
2.10 Chapter Exercises 52
2.11 Chapter References 55
Detailed Contents ix
5 Probability Models for a Single Discrete Random Variable 139
5.1 New representation of a familiar problem 139
5.2 Random variables 142
5.2.1 The probability mass function of a discrete random variable 142
5.2.2 The cumulative distribution function of a discrete random variable 146
5.2.3 Functions of a discrete random variable 147
5.2.4 Exercises 147
5.3 Expected value, variance, standard deviation and median of a discrete
random variable 148
5.3.1 The expected value of a discrete random variable 148
5.3.2 The expected value of a function of a discrete random variable 149
5.3.3 The variance and standard deviation of a discrete random variable 149
5.3.4 The moment generating function of a discrete random variable 150
5.3.5 The median of a discrete random variable 151
5.3.6 Variance of a function of a discrete random variable 151
5.3.7 Exercises 151
5.4 Properties of the expected value and variance of a linear function
of a discrete random variable 153
5.4.1 Short-cut formula for the variance of a random variable 154
5.4.2 Exercises 155
5.5 Expectation and variance of sums of independent random variables 156
5.5.1 Exercises 159
5.6 Named discrete random variables, their expectations, variances and moment
generating functions 159
5.7 Discrete uniform random variable 160
5.8 Bernoulli random variable 160
5.8.1 Exercises 161
5.9 Binomial random variable 161
5.9.1 Applicability of the Binomial probability mass function in Statistics 164
5.9.2 Exercises 164
5.10 The geometric random variable 166
5.10.1 Exercises 168
5.11 Negative Binomial random variable 169
5.11.1 Exercises 171
5.12 The hypergeometric distribution 171
5.12.1 Exercises 172
5.13 When to use binomial, when to use hypergeometric?
When to assume independence in sampling? 173
5.13.1 Implications for data science 174
5.14 The Poisson random variable 174
5.14.1 Exercises 178
Detailed Contents xi
Part 2. Probability in Continuous Sample Spaces 221
In investigating the position in space of certain objects, “What is the probability that
the object is in a given region?” is a more appropriate question than “Is the object in
the given region?”
Parzen, 1960
Preface
P robability is the mathematical term for chance. Much of statistics, data science and
machine learning theory and practice rests on the concept of probability. The reason is
that any conclusion concerning a population based on a random sample from that popula-
tion is subject to uncertainty due to variability. It is probability theory what enables one to
proceed from mere description of data to inferences about populations. The conclusion of a
statistical data analysis is often stated in terms of probability. Understanding probability is
thus necessary to succeed as a statistician and data scientist in artificial intelligence, machine
learning or similar endeavors.
This book contains a mathematically sound but elementary introduction to the theory and
applications of probability. The book has been divided in two parts. Part I contains the basic
definitions, theorems, and methods in the context of discrete sample spaces, which makes
it accessible to readers with a good background in high school algebra and a little ability in
the reading and manipulation of mathematical symbols. Part II contains the corresponding
ideas in the continuous case, and is accessible to readers with a working knowledge of the
univariate and multivariate differential and integral calculus, and mastery of Part I. The book
is designed as a textbook for a one-quarter or one-semester introductory course, and can be
adapted to the needs of undergraduate students with diverse interests and backgrounds, but
it is detailed enough to be used as a self-learning tool by physics and life scientists, engineers,
mathematicians, statisticians, data scientists and others that have the necessary prepara-
tion. The text aims at helping the reader become fluent in formulating probability problems
mathematically so that they can be attacked by routine methods, in whatever applied field
the reader resides. In many of these fields of application, books on chance quickly jump to
the most advanced probability methods used in research without the proper apprenticeship
period. Probability is not to be learned as a cookbook, because then the reader will have
no idea how to start when encountering an unfamiliar problem in their field of application.
Numerous examples throughout the text show the reader how apparently very different
problems in remotely related contexts can be approached with the same methodology, and
how probability studies mathematical models of random physical, chemical, social and bio-
logical phenomena that are contextually unrelated but use the same probability methods.
For example, the law of large Numbers is the foundation of social media, fire, earthquake
and automobile insurance, and gambling, to name a few.
xvii
Having those who have to deal with data, data science or statistics in mind, the main
goal of this book is to convey the importance of knowing about the many (the probability
distribution for random behavior) in order to predict individual behavior. The second learning
goal is to appreciate the principle of substitution, which allows the manipulation of basic
probabilities about the many to obtain more complex and powerful predictions. Lastly, the
book intends to make the reader aware of the fact that probability is a fundamental concept
in Statistics and Data Science, where statistical tests of hypothesis and predictions involve
the calculation of probabilities.
In part I, Chapters 1 to 6 review the origin of the mathematical study of probability, the
main concepts in modern probability theory, univariate and bivariate discrete probability
models and the multinomial distribution. Chapters 7–10 make up Part II. Sections that are
too specialized and more advanced are indicated and the author recommends passing them
without loss of continuity, or refers the reader to other sections of the book where they will be
explained in detail. To enhance the teaching and self-learning value of the book, all chapters
and many sections within chapters start with a challenging question to encourage readers
to assess their prior conceptions of chance problems. The reader should try to answer that
question and discuss it with peers. At the end of each chapter, the reader should go back to
that question and compare initial thoughts with thoughts after studying the chapter. Exercises
at the end of most sections of the book and at the end of each chapter give the reader an
opportunity to apply the methods and reasoning process that constitutes probability topic by
topic. Some of them invite research and broader considerations. Because random numbers are
used in many ways associated with computers nowadays, including the adaptive algorithms
used by social media to modify behavior, computer games, generation of synthetic data for
testing theories, and decision making in many fields, every chapter contains guided exercises
with the software R that involve random numbers.
Relevant references for further analysis found throughout the book will allow the reader
to continue training in the more advanced way of approaching probability after they finish
this book. There are so many fields of engineering and the physical, natural, and social sci-
ences to which probability theory has been applied that it is not possible to cite all of them.
Probability is also at the heart of modern financial and actuarial mathematics, thus exercises
in health care and insurance are also included.
The book is intended as a tribute to all those who have made an effort to make probabil-
ity theory accessible to a wide audience and those that are more specialized. Consequently,
the reader will find many examples and exercises from a wide array of sources. I am deeply
indebted to them. By bringing many of these authors to the reader’s attention I wish to direct
enquiries to sources with correct information and give students a sense of the depth and
breadth of thinking probabilistically and of how they can move to more difficult aspects of
the theory. If I have missed acknowledging or have misquoted some author, I hope the author
will bring this to my attention, and I apologize in advance.
In studying this book, the reader must make an effort to talk about what is or is not under-
stood with peers. Sharing results of experiments, chatting with colleagues about recent
discoveries, learning a new technique from friends are common experiences for working
Juana Sánchez
University of California, Los Angeles
June 2019
Preface xix
Part I
Probability in
Discrete Sample Spaces
1
an expectation for some finance random variable will read to the reader as different from
a problem that asks to compute an expectation for a biology variable. However, both the
biology and the finance problem will use the same method to compute the expectation.
Table 1.1
What do you think this table represents? What could it be used for? What kind of
things can you predict with it? Ask someone else the same questions and compare
your thoughts. Are you uncertain about your guess?
3
1.1 Measuring uncertainty
How often do you think about uncertainty? Have you ever tried to measure your uncertainty
about the outcome of some action you are planning to take in some way? For example,
when you were debating whether a prescribed medicine for a cold would lead to recovery?
Neglecting all possible influence of diet, stress, and financial problems, perhaps you found
online information claiming that 80% of all of those taking this medicine in the past year got
cured, and then you adopted this 80% as the measure of your uncertainty about the outcome
that would ensue if you take the medicine for your cold. Certainly, some individuals that took
the medicine recovered, and some did not, and you have no idea whether you will be among
the former; taking the medicine does not always lead to the same outcome. Taking a medicine
for a cold is a random or chance experiment. If the
A probability is a number that gives a precise information online had said that 80% that took this
estimate of how certain we are about something. medicine in the past year died, certainly the decision
(Everitt 1999) you made would perhaps have been different.
However, we do not know that the die is physically fair, or that this model holds. A way
to find out is with data, the other approach to calculating probabilities. To obtain data, you
should complete the experiment proposed in Table 1.3, using what you think is a fair six sided
die. Roll first 10 times and stop. Compute the number of 6’s you would be expecting to get,
based on the model, with 10 rolls and look at how many you really got. Then roll 40 more
and stop. Now you will have accumulated 50 rolls. Count how many of those 50 rolls are 6
(include the ones in the first 10 and the ones in the last 40 rolls). Continue calculating how
Table 1.3 Data obtained by an experiment that consists of rolling a real six-sided die
to observe what the proportion of sixes converges to. We do not know if the die is fair
or not.
(2)
(1) Expected (6)
Roll up to number (4) Observed Observed
this of sixes (3) Observed # (5) proportion proportion –
number (based on Observed minus Expected of expected
of rolls model) # of sixes expected # proportion 6’s = (3)/(1) proportion
10 (1/6)10 1/6
50 (1/6)50 1/6
100 (1/6)100 1/6
200 (1/6)(200) 1/6
300 (1/6)(300) 1/6
400 (1/6)(400) 1/6
500 (1/6)500 1/6
600 (1/6)600 1/6
700 (1/6)700 1/6
800 (1/6)800 1/6
900 (1/6)900 1/6
1000 (1/6)1000 1/6
If your experiment is successful, the proportion of sixes that you get in column (6) will be
closer and closer to 1/6 (in column 5) as the number of tosses increases if the model given
in Table 1.2 is indeed a good model for the physical die you are using. But if you had tossed
a loaded die, the results you put in Table 1.3 will contradict the model in Table 1.2. Thus,
although we are not able to predict whether a single roll will give us the number 6 or not,
we are able to predict that a large number of rolls will give a 6 with a very stable proportion
of 1/6 if the die is fair, or other proportion if not.
It is common in data science to compare a probability model with data collected randomly.
If the model is correct, a large amount of collected data (by experimentation, like you will do to
complete Table 1.3) will support the model. If the model is incorrect, the data will not support it.
To do the comparison of models to reality, statisticians and data scientists collect a lot of data
when they can.
Returning to the medicine example at the beginning of this chapter, and by analogy with the
die experiment, not much can be said by anyone about a particular individual in a large popu-
lation that took the medicine we were talking about but, thanks to probability theory, there is
Let me assume that I am told that some cows ruminate; I cannot infer logically from this
that any particular cow does so, though I should feel some way removed from absolute
disbelief, or even indifference to assent, upon the subject; but if I saw a herd of cows I
should feel more sure that some of them were ruminant than I did of the single cow, and
my assurance would increase with the numbers of the herd about which I had to form
an opinion. Here then we have a class of things as to the individuals of which we feel
quite in uncertainty, whilst as we embrace larger numbers in our assertions we attach
greater weight to our inferences. It is with such classes of things and such inferences
that the science of Probability is concerned. (Venn 1888)
The calculus of probability makes possible statistics and gives statistics a foundation. Data
scientists and statisticians think of probability models as models representing the population’s
random behavior. They constantly search in samples of data for what those probability models
are. Because their data may not be the whole population, they may even use probability further
to attach some error to their estimates. Probability is at the core of the search engines such
as Google or Yahoo that we use every day to gather information. The goal of social media
is to treat you like the average in the population, presenting to you what the summary of
the combined behavior of many is, using the past behavior of other users. The way they try
to predict your behavior as an individual is by them knowing about everybody prior to you
approaching social media. Your behavior in turn leads them to update their algorithms about
everyone. Probability theory also guides population genetics and genetic testing, medical
diagnoses, language processing, surveillance, quality control, climate change research, social
networks, psychology of people, and behavior of agents in video games, to name a few areas.
Probability theory is the background behind all scientific and social endeavors.
Students must obtain some knowledge of probability and must be able to tie this
concept to real scientific investigations if they are to understand science and the world
around them.
(Scheaffer 1995)
1.1.2 Exercises
Exercise 1. You are given a new twelve-sided die by the host of a party you are attending.
You are told that this die will be used to play a game after dinner in which you will lose $100
Exercise 2. You are uncertain about the outcome of taking your significant other to a new
restaurant to celebrate your birthday. Your significant other has never been to this restaurant
and the invitation has to be a complete surprise (but not a complete failure). How do you
decrease your uncertainty about the restaurant’s quality?
Exercise 3. Suppose you are an economist who has been teaching in an economics department
for quite some time. Someone asks you to choose between the following two things and earn
$1,000 if you get it right: (a) Predict whether a new hire, Shakir, in the reception office of an
economics department at a university will leave the job after a year (if you predict yes, and
the person leaves, you get the $1,000); (b) Predict whether there will be some (not needing to
give names) new hires among the 100 new hires in the reception offices of many economics
departments across the US who will leave the job after a year. Do you choose (a) or (b)? Why?
Exercise 4. An individual 45 years old chooses to live in a neighborhood that has cheap
housing but not a good safety and hygienic record. The individual is perfectly healthy, works
hard, has a new car, has a very clean house, and has never been harmed or inconvenienced
by anybody in the neighborhood. This individual is pretty much a mirror image of another
individual of the same age who lives in a very fancy gated neighborhood with lots of secu-
rity surveillance, who has the same health, the same car, the same job, and the same safety
record. An insurance company offers a life insurance to both. But the premium of the first
individual is much higher than that of the second individual. What explains that? Try to tie
your response to what we have discussed in this Section 1.1.
Exercise 5. Brian Tarran (2015) interviewed Dan Bouk, a historian who wrote a book about how
people see themselves as a statistical individual—one that that is understood and interpreted
as the statistical whole, meaning as the average of everybody else (for example, a middle
age individuals thinks there is 40% chance of death by heart attack, 20% chance of being hit
by a car, etc.). Think about the things you think about yourself, and think hard about where
those thoughts come from. How much is it based on data that you have seen on people your
age? List three or four things that you believe about yourself based on something you have
read about people your age (for example: risks, health items).
Exercise 6. Comment on what Jaron Lanier (2018) says in his recently published book:
Behavior modification, especially the modern kind implemented with gadgets like
smartphones, is a statistical effect, meaning it’s real but not comprehensively reliable;
over a population, the effect is more or less predictable, but for each individual it’s
impossible to say. (Lanier 2018)
Although probability theory today has about as much to do with games of chance as
geometry has to do with land surveying, the first paradoxes arose from popular games
of chance.
(Szekely 1986)
1.2.1 It all started with long (repeated) observations (experiments) that did
not conform with our intuition
When it comes to relative frequencies at which events occur, our intuition (you may call it
our a priori “model”) often does not conform to repeated observation. It is with this clash
that mathematical probability started (a clash would occur, for example, if Table 1.2 in this
chapter was contradicted by the relative frequency results that you will get in the last row
of column 6 of Table 1.3). These clashes still happen now (Stigler 2015). The reader is encour-
aged to look at Side Box 1.1 for a definition of relative frequency.
Box 1.1
Assuming your performance does not change between this quarter and the next, it can
be estimated that the probability that any of your quizzes will be 4 in the future is 3/8 or
37.5% or 0.375. Probability can be expressed in various forms: as fractions, percentages or
decimal fractions.
Box 1.2
The history of probability is plagued since its beginning with examples where empirical
facts did not present relative frequencies that were expected based on intuition (an a priori
model). In fact, the modern probability theory that you are going to study in this book is the
result of efforts by gamblers, mathematicians, social scientists, engineers and other scientists
to create a framework for thinking about the frequency of empirical facts so that we do not rely
solely on intuition or a priori models. When using a mathematical probability approach to think
about reality, we are bound to make less mistakes in our predictions.
Making decisions based on long observations (when we can) or based on models supported
by long observations, pays in data science, public policy, and our daily lives. Nowadays,
the terms “evidence-based decision making” are very popular in many circles. For example,
knowing the usual frequency of SIDS (Sudden Infant Death Syndrome) deaths in each county
in a given state (possibly measured as deaths per hundred thousand) may help raise a flag
in an anomalous year that has an unusually large frequency.
Exercise 2. Test your intuition by thinking about this problem: If you roll a die three times,
what is the probability of getting at least one six? Again, do not look anywhere for an answer.
This question is just for you to assess your intuition or a-priori model.
Exercise 3. A student of probability was asked to record the first digit of every number
encountered throughout a week. If the student bought a coffee for $3.45 the student would
record 3; if the student arrived to class at 10:05, the student would record 1, and so on. Phone
numbers, zip codes and student id numbers were not allowed. Then the student was asked to
write a table with the relative frequency of each first digit recorded. This student produced a
perfectly uniform table, which said that each number was equally likely to happen: relative
frequency of 1 was 1/9, relative frequency of 2, 1/9, and so on. Do you think this student
used observed data to do this homework?
The observation of many games like those dice games just mentioned made dice
players in the sixteenth and seventeenth century consider that there was a difference
between the relative frequencies, whether practically significant or not, and ask for
an explanation. If, playing with three dice, 9 and 10 points may each be obtained
in 6 different ways, they thought, why was there a difference between the relative
frequencies observed? Similarly, if playing with two dice, 7 and 8 each may be obtained
in 3 different ways, why was there a difference in the relative frequencies observed?
(Apostol 1969)
We could replicate the experience of the dice players playing the games of Section 1.2.3
by conducting an experiment with fair dice bought in some store. Equally likely numbers in
a single six-sided die, for example, is a reasonable model assumption if the information we
possess about the die is that it is symmetric or fair, and we do not possess any other infor-
mation. The observations and concerns of gamblers were based on that assumption. If the
dice used were fair, why were the frequencies observed in those games different from what they
expected based on their model?
In the case of the game consisting of rolling two dice, a repetition of the experiment would
consist of rolling two dice and recording the two numbers as a pair, for example (3,2), and
then, separately, the sum of the pair, respectively 5. Repetition of trials, say m times, and
recording how many trials gave a sum of 8 and how many of 7 out of the m trials would give
an approximation to the frequencies of 8 and 7. The number of repetitions, m, would have
to be large. Exercise 1 in Section 1.8 invites you to do that.
A trial of the experiment that would help us estimate the frequency of 9 and 10 for the
sum of the points in the roll of three dice would consist of rolling three dice and recording
the sum. Repetition of trials m times and recording the proportion of the m trials giving 9 or
10 would give us the approximation sought.
a. Determine the probability model to use, for example, a fair die (numbers 1 to 6 each
with the same probability of 1/6)
b. Define what a trial consists of, for example, roll a die twice
c. Determine what to record at each trial, for example, we will record the sum of the
numbers
d. Repeat a), b), c) many times, say 10000
e. Calculate what you are looking for, for example, what proportion of the 10000 trials
gave us a sum equal to 7.
Repeating a trial many times requires patience, and lots of time, but it is worth doing. To
achieve an accurate approximation requires many trials. For that reason, software is often
used to conduct many trials of a simulation. Section 1.7 introduces the free software R and
gives R code to conduct the simulation in Chapter Exercise 1.
Example 1.2.1
These days, applets created for the purpose of simulating, under known assumptions, can
be found on many web sites. For example, a dice tossing applet that you can find at http://
www.randomservices.org/random/apps/DiceExperiment.html allows you to do the simula-
tions needed to determine how to answer the questions posed by gamblers that occupy our
attention in this section 1.2. For example, by setting n = 2 (number of dice), options “fair”
and Y = sum, and stop = 100 (number of trials), you will see the computer tossing two dice
and showing to you what numbers come up, and you will see their sum. You will see that a
sum equal to 7 appears more often than a sum of 8, even though the differences between
the relative frequencies are small. You can then do the analysis with n = 3 to see what you
discover about the question posed at the beginning of Section 1.2.3. If you are curious, you
can explore further to see if the conclusions are different when the die is not fair.
1.2.5 Exercises
Exercise 1. We mentioned at the beginning of this chapter that the probability of an outcome
could be found by observing many times the experimental outcome and counting how many
of the many times observed the outcome occurred. But we also said we could just subjectively
make up the probability. Still, we could have a mental model of the probability not based on
observation but some other knowledge. In which of these three categories would you place
Exercise 2. “Forensics sports analytics” uses probability reasoning to help identify and eliminate
corruption within the sports sector (Paulden 2016). Chris Gray (2015), a tennis follower, wrote
an article where he presented a version of the widely used (in tennis) IID probability model
for a player, player A, winning a tennis game. He gave the following model which depends
on the probability of player A winning a point on serve (denoted by p, and assumed constant)
Paulden (2016) talks about an alternative version of this model, the O’Malley tennis for-
mulae. Gray’s and O’Malley’s models are based on assumptions about the game, but they
are also filled with probabilities that were obtained from past data on many players. How do
you think you could validate either of the models mentioned by these authors? Use concepts
seen in Sections 1.1 and 1.2 of this chapter to answer.
Exercise 3. Think of a situation where you had a very clear model of how often something
that interests you would happen and your model clashed with the evidence you obtained
from repeated observations.
Defining probability of an event E as the long-run frequency of the event in a large number
of trials, m, is known as the frequentist definition of probability of an event.
Example 1.2.2
I rolled a die 1,000,000 times and found that I got 400,000 times the number 6. According
to the frequentist definition of probability, this means that we estimate the probability of a 6
to be 0.4. Because we simulated 1,000,000 rolls, we are almost convinced we are very close
to the true probability and can conclude that the die is not fair. By the law of large Numbers,
we give high probability to the fact that
400000
- P (6).
1000000
is 0. P(6) means the true probability of 6, which based on our experimentation is very close
to 0.4.
Statisticians, data scientists, insurance companies, and managers of social media make
wise use of the law of large Numbers in designing their methods to analyze data and their
policies and resources. The relative frequency with which something happens to a large
number of subjects, is a good approximation to the true probability that this something
happens to an individual.
1.2.7 Exercises
Exercise 1. Comment on the following statement: “I cannot predict one fair coin toss, but I
can predict quite accurately that the proportion of heads in 1,000 tosses of a fair coin will be
close to the theoretical probability of 1 / 2 assumed by the equally likely outcomes model.”
Back in the seventeenth century, it was clear by repeated experimentation (gambling) that
there was a difference in frequencies that did not conform to intuition. The law of large
Numbers then made it clear that the relative frequencies obtained in repeated experimen-
tation should be trusted. How to reconcile observation with the model gamblers believed
in? How to translate that discrepancy into mathematics? What was wrong with the gamblers’
model? Between 1613 and 1623 Galileo Galilei gave an explanation in Sopra le Scoperte dei
Dadi (On a discovery concerning dice).
This implicitly assumes independence of the rolls, that all 216 possible outcomes are
equally probable.
Although limited to this special case, Cardano and Galileo provided a theoretical counter-
part to the observed phenomena by modeling the situation.
In spite of the simplicity of the dice problem, several great mathematicians failed to solve
it because they forgot about the order of the cast. (This mistake is made quite frequently,
even today.) (Szekely 1986)
Chapters 2 and 3 of this book further discuss the role that the independence assumption
makes in the calculation of probabilities.
We have seen that gamblers observed a difference between relative frequencies, whether
significant or not, asked for an explanation and got an explanation from mathematicians.
The explanation just described is a precursor of the concepts of sample space, events and
random variables, three fundamental concepts of modern probability theory introduced in
Chapter 2.
Galileo’s solution for the dice problem implicitly used what we call now the classical defi-
nition of probability of an event E, namely if E is an event,
Finding the probability entailed knowing all the logically possible cases and being able to
count the ones that were favorable. Implicitly, this assumed that all outcomes were equally
likely and implicitly assumes independence. The mistake of the gamblers was that they were
not counting all the logically possible cases.
Example 1.3.1
In the case of the two dice, let’s go back to Table 1.1 to see that there are 36 logically possible
outcomes that we enumerated there.
If we call the case of a 7 “favorable,” the number of favorable outcomes where the sum
is 7 is 6 out of 36, so the classical probability is 6/36 whereas the number of favorable out-
comes where the sum is 8 is 5, making the classical probability 5/36. A not very significant
difference, yet a difference that helps explain the gamblers’ observed difference. Denoting
probability by P,
6 5
P (" sum of 2 dice is 7") = , P (" sum of two dice is 8") =
36 36
Example 1.3.2
In the case of the three dice, let’s go back to our earlier discussion to see that there are 216
logically possible outcomes that we enumerated there.
If we call the case of a 9 “favorable,” the number of favorable outcomes where the sum is
9 is 25 out of 216, making the probability of 9 to be 25/216 whereas the number of favorable
outcomes where the sum is 10 is 27, making the probability of 10 to be 27/36. A not very
significant difference, yet a difference that helps explain the gamblers’ observed difference.
25 27
P (" sum of 3 dice is 9") = , P (" sum of 3 dice is 10") =
216 216
Using rules that we learn in Chapter 3, we would support de Méré’s calculation as follows:
P (at least one (6,6) in 24 throws ) = 1 − P (no (6,6) in 24 throws ) = 1 − (35 / 36)24 = 0.4914639.
Alternatively, you could get the same answer by looking at Table 1.1 to find the proba-
bility of (6,6) and then using the complement rule and product rule for independent events
presented in Chapter 3 of this book.
This result indicates that the probability of getting at least one (6,6) is less that 0.5; it is
more favorable that there will be no (6,6) pairs in 24 throws.
• Axiom 1. The probability of the biggest set, the sample space S, containing all possible
outcomes of an experiment, is 1.
• Axiom 2. The probability of an event is a number between 0 and 1.
• Axiom 3. If there are events that cannot happen simultaneously (are mutually exclusive),
the probability that at least one of them happens is the sum of their probabilities.
Measure theory is a theory of sets. Probability is a measure defined on sets. What is remark-
able is that the frequentist, the classical, and the subjective definitions of probability satisfy
the axioms. The assumption of the existence of a set function P, defined on the events of a
sample space S, and satisfying Axioms 1,2,3, constitutes the modern mathematical approach
to probability theory. Any function P satisfying the axioms is a probability function. With
those axioms, it is straightforward to prove the most important properties of probability,
which we do in Chapter 3.
Because P is a function defined on events, and events are, mathematically speaking, sets, it
is necessary to use the algebra of sets when studying probability. Chapter 2 guides your review
of the algebra of sets. The axiomatic approach allows us to talk about probability defined in
continuous sample spaces, and probability models defined on continuous random variables,
which we do in Chapters 7 and 8. But discrete sample spaces and discrete random variables
equally fall under the umbrella of the axiomatic approach. We study those in Chapters 2 to 6.
By probability modeling in data science we mean the act of using probability theory to model
what we are interested in measuring. The conclusions that we reach will be as valid as the
model is. Laplace (1749–1827) used to say that the most important questions of life are
indeed for the most part only problems of probability. In most of these problems, we build
models to describe conditions of uncertainty and provide tools to make decisions or draw
conclusions on the basis of such models.
Not only are probabilistic methods needed to deal with noisy measurements,
but many of the underlying phenomena, including the dynamic evolution of the
internet and the Web, are themselves probabilistic in nature. As in the systems
studied in statistical mechanics, regularities may emerge from the more or less
random interactions of myriad of small factors. Aggregation can only be captured
probabilistically.
(Baldi et al. 2003)
Example 1.4.1
Suppose there are two classes of email, good email and spam email. We let the random
variable Y = 1 if the email is good, and Y = 2 if the email is spam. Let W represent a new
email message. Our decision is to classify a new email message W which contains the word
“urgent” into class 1, good email, if
Otherwise, the email W is classified as spam email and rejected by the server. Why we use
this decision rule given will become very clear to you after you study chapter 3. The condi-
tional probabilities of P(W | Y = 1) and P(W | Y = 2) and the prior probabilities P(Y = 1) and
P(Y = 2) are known and are based on past observations of the frequency of good and spam
messages and the contents of good messages and spam messages.
Another area of machine learning where probability plays a very important role is text
processing. Indexing, scoring and categorization of text documents is required by search
engines such as Google http://www.stat.ucla.edu/~jsanchez/oid03/csstats/cs-stats.html.
The areas of application of probability mentioned should give you an idea of possible career
paths that can be pursued with sound skills in probability reasoning like those you will acquire
by studying this book. There are many other career paths that will become transparent as you
study the book. Actuarial science, the science of insurance, for example, can not be pursued
without first passing the first exam, for which this book prepares you well. At http://q38101.
questionwritertracker.com/EQERFHHR/ry.com you will find sample exams.
The dice problem has some links with 19th and 20th century microphysics. Suppose
that we play with particles instead of dice. Each face of the die represents a phase
cell on which the particles appear randomly and which characterizes the state of
the particles. Here dice is equivalent to the Maxwell-Boltzmann model of particles.
In this model (used mostly for gas molecules) every particle has the same chance of
reaching any cell, so in a list of equally probable events, the order must be taken into
account, just as in the dice problem. There is another model in which the particles
are indistinguishable, and for this reason the order must be left out of consideration
when counting the equally possible outcomes. This model is named after Bose and
Einstein. Using this terminology the point of the (dice paradox studied in this chapter),
is that dice are not of the Bose-Einstein but of Maxwell-Boltzmann type. It is worth
mentioning that none of these models are correct for bound electrons because in this
case, only one particle may occupy any cell. In dice-language it means that after having
thrown a 6 with one of the dice, we can not get another 6 on the other dice. This is the
Fermi-Dirac model. Now the question is which model is correct in a certain situation.
(Beside these three models, there are many others not mentioned here.) Generally we
can not choose any of the models only on the basis of pure logic. In most cases it is
experience or observation that settles the question. But in the case of dice, it is obvious
that the Maxwell-Boltzmann model is the correct one and at this moment that is all
we need.
(Szekely 1986, 3–4)
1.5 Probability is not just about games of chance and balls in urns
We have talked a lot about dice in this chapter. That is because the mathematical theory
of probability had its origin in questions that grew out of games of chance. The reader
will find more dice and even balls and urns in this book and in almost every probability
theory book that comes to the reader’s attention, but not because probability theory is
about them.
2 3 4 5 6 7 8 9 10 11 12
Total number of microstates: 36 Total number of macrostates: 11
Figure 1.3 A simple six-sided die model helps clarify a rather complicated physics
concept.
Source: http://hyperphysics.phy-astr.gsu.edu/hbase/Therm/entrop2.html.
The early experts in probability theory were forever talking about drawing colored
balls out of “urns.” This was not because people are really interested in jars or boxes
full of a mixed-up lot of colored balls, but because those urns full of balls could
often be designed so that they served as useful and illuminating models of important
real situations. In fact, the urns and balls are not themselves supposed real. They are
fictitious and idealized urns and balls, so that the probability of drawing out any one
ball is just the same as for any other.
(Weaver 1963, 73)
Example 1.5.1
In India in 2012, the probability of dying before age 15 was 22%. The parents of 5 children are
worried that dying before age 15 could happen to their children. One can think of a box with
100 balls, 22 of which are red and 78 of which are green. What is the probability of drawing,
in succession, 5 red balls with replacement? Would this box model simulate well the real
situation of dying before age 15, even though it is a box with balls? Friedman, Pisani, and
Purves, authors of an introductory statistics book, introduced probability using box models
like this (Friedman, Pisani, and Purves 1998).
The reader should be warned that science books use different names for the same con-
cepts that we talk about in this book. A book in physics, another in psychology, another in
linguistics, for example, may be using the same “rolling two dice” experiment model that
you saw in this chapter yet each of them uses different names for the total number of out-
comes, for the number of sets, for the sum and such concepts that are very standard in the
probability theory books. Physics, Probability and Linguistics require the background that
you are going to learn in this book to solve their seemingly unrelated problems. The fact is
not that probability theory consists of a bag of an endless number of tricks to solve problems
Question 1. You are playing with three fair six sided dice. You are interested in the sum of
the points. Which is more favorable: 9 or 10? That is, if you had to bet on 9 or 10, which one
would you choose?
a. 9
b. 10
c. either one
Question 2. You are playing with two fair six sided dice. You are interested in the sum of
the points. Which is more favorable? 7 or 8? That is, if you had to bet on 7 or 8, which one
would you choose? 7 or 8?
a. 7
b. 8
c. either one
a. models
b. data
c. subjective opinion
d. all of the above
e. none of the above
Question 5. The classical definition of probability has some limitations. Which of the following
are some limitations?
a. It cannot be used when the outcomes are not equally likely.
b. It can only be used when there are finite or infinite countable outcomes.
c. It does not satisfy Kolmogorov’s axioms.
d. We could not double-check it with long observations.
a. counting not only the favorable partitions but also the number of permutations of
each partition.
b. using the law of large numbers
c. use your subjective opinion
d. Taking into account that the number of possible outcomes is: any of the numbers
from 3 to 18, that is, there are 16 outcomes. One of those outcomes is favorable, 14.
So the probability 1/16 will be the correct probability.
Question 7. The dice model that reconciled observations with the intuition of seventeenth-cen-
tury gamblers is similar to what model for particles in physics?
a. Fermi-Dirac’s
b. Bose-Einstein’s
c. Maxwell-Boltzmann’s
d. Jaynes’
Question 8. Use the classical definition of probability to find the probability that in two rolls
of a four-sided die the sum is 5.
a. 1/5
b. 1/4
c. 1/3
d. 1/8
Question 9. The law of large Numbers (LLN) added only what to the belief that more obser-
vations obviously give more accurate estimates of the chances?
a. The LLN showed that the probability that the estimate is close to the truth increases
with the number of trials.
b. The LLN tells us that we can be more certain that long observations give us accurate
estimates the more the observations made.
c. The LLN legitimizes the frequentist definition of probability.
d. All of the above.
a. calculate probabilities of outcomes that can take any value in an interval of the real line
b. use the same rules of probability that are consistent with axioms in both the discrete
and continuous outcomes scenario
c. none of the above
d. (a and b)
R and Rstudio
R code is code that is understood by the software R. It is widely used by data scientists in
their day to day data analysis routines. It is also used to generate random numbers that
allow us to simulate many random phenomena.
We can simulate many rolls of three dice and compute the probability of the event of
interest in seconds using R.
R is a free open source software that can be downloaded into any computer. Rstudio is
an interface that makes working with R much easier. To use it, R must be installed. R can be
downloaded from
https://cran.r-project.org/
https://www.rstudio.com/
This gives R the order to sample 5 number from 1 to 6, where each number has probability
1/6, and that is true for each number (guaranteed by typing replace = T).
1.7 R code
Exercise 1. You will do a simulation in this problem. A trial of this simulation consists of roll-
ing two fair six-sided dice of different color. The number in both is recorded as a pair (a,b),
where a is the first roll and b is the second. For example, you could obtain (3,2), where 3 is
the number on the first die, and 2 is the number on the second die.
You will do 125 trials, by hand or using software. If you use the R code given in section 1.7
you could do many more trials. Alternatively, you may use the applet introduced in
Example 1.2.1.
a. At each trial, record the sum of the two numbers. For example, if the outcome is (3,2)
the sum is 5. The sum is called a random variable, because until you actually know
its value the value is not known, it is determined by chance. We will call this random
variable Y.
Table 1.4 below illustrates the process. For someone to double check your numbers
they need to see what they are. So a table of some of the trials is always recommended.
Record on Table 1.4 some of your trials.
1
2
3
4
…..
…..
125
Total number of trials: With Y = 7 :
With Y = 8:
b. Based on the results recorded on Table 1.4, what proportion of the trials gave you a sum
equal to 7 and what proportion gave you a sum equal to 8? Compare with the result you
would get using the applet introduced in Example 1.2.1, run 10000 times. Explain the
difference using the frequentist definition of probability introduced in Section 1.2.6.
c. If you used the classical definition of probability introduced in Section 1.3, what would be
the probability that the sum of the two dice is 7? What assumption would you have to make?
Exercise 2. As we have seen in this chapter, Galileo, and Cardano before him, suggested that
in order to educate our intuition about dice games of their time discussed in this chapter,
we should start by considering all the possible outcomes of the games. For example, the game
with the two dice has the outcomes and the corresponding values of the random variable
representing the sum indicated in Table 1.1.
If we assume that all numbers of one dice are equally likely to appear (the dice is assumed
to be fair), then the solution for how frequently each outcome appears is given by counting
the number of times it appears (the number of favorable cases). The classical probability
would be that number divided by 36.
a. Write a table indicating in one column the value of the sum of the faces of two dice
and in the second column the number of times the sum appears divided by 36. Is 7
or 8 more frequent?
b. Is there a mathematical formula that would model the value of the sum of two dice?
Why did you write the formula you wrote? Talk to friends about it.
c. Create a table with all the possible outcomes of the roll of three dice and the value
of the sum associated with each outcome. In that table, you write (a,b,c), where a =
the number in the first roll, b = the number in the second roll and c = the number
in the third roll. The sum = a + b + c. Then write separately another table that has
in the first column, the value of the sum, and in the other, the relative frequency of
the sum. Is 9 or 10 more frequent?
Exercise 4. Suppose two players, A and B, toss a fair coin in turn. The winner is the first
player to throw a head. Do both players have an equal chance of winning the game? You may
investigate this question doing a simulation.
The probability model is a fair coin. A trial of the simulation consists of a game. For exam-
ple, A starts and gets a head in the first toss. Another example, A starts and gets a tail in the
first toss, B gets a tail in the second toss, and A gets a head on the third toss.
Repeat the trials 100 times recording whether A or B wins. At the end, compute the
relative frequency of A winning and the relative frequency of B winning. Then answer the
question asked.
Exercise 5. Suppose you are playing a game that involves flipping two balanced coins simul-
taneously. To win the game you must obtain “heads” on both coins. What is your classical
probability of winning the game? Explain.
Exercise 6. Esha and Sarah decide to play a dice rolling game. They take turns rolling two fair
dice and calculating the difference (larger number minus the smaller number) of the numbers
rolled. If the difference is 0, 1, or 2, Esha wins, and if the difference is 3, 4 or 5, Sarah wins.
Is this game fair? Explain your thinking.
Exercise 7. What is the proportion of three-letter words used in sports reporting? Write down
a thoughtful guess. Then design an experiment to find out.
Exercise 8. The molecule DNA determines the structure not only of cells, but of entire organ-
ism as well. Every species is different due to the differences in DNA. Even though DNA has
the same structure for every living thing, the major differences arise from the sequence of
compounds in the DNA molecule. The four base molecules that form the structure of DNA
are adenine, guanine, cytosine, and thymine, often referred as A, G, C, and T for short. The
entire DNA sequence is formed of millions of such base molecules, so there is a lot of different
combinations, and hence, lots of different species of organisms.
Research what a palindrome is and come up with a strategy to conclude whether palin-
dromes are randomly placed in DNA or not.
Exercise 9. What does the forecast “60% chance of rain today” mean? Do you think the fore-
caster has erred if there is no rain today?
Apostol, Tom M. 1969. Calculus, Volume II (2nd edition). John Wiley Sons.
Baldi, Pierre, Paolo Frasconi, and Padhraic Smyth. 2003. Modeling the Internet and the Web.
Probabilistic Methods and Algorithms. Wiley.
Browne, Malcolm W. 1998. “Following Benford’s Law, or Looking out for No. 1.” New York
Times, Aug. 4, 1998. http://www.nytimes.com/1998/08/04/science/following-benford-s-
law-or-looking-out-for-no-1.html
Carlton, Mathew A., and Jay L. Devore, 2017. Probability with Applications in Engineering,
Science and Technology, Second Edition. Springer Verlag.
Diaconis, Persi, and Brian Skyrms. 2018. Great Ideas about Chance. Princeton University Press.
Erickson, Kathy. 2006. NUMB3RS Activity: We’re Number 1! Texas Instruments Incorporated,
2006. https://education.ti.com/~/media/D5C7B917672241EEBD40601EE2165014
Everitt, Brian S. 1999. Chance Rules. New York: Springer Verlag.
Freedman, David, Robert Pisani, and Roger Purves. 1998. Statistics. Third Edition. W.W. Norton
and Company.
Goodman, Joshua, and David Heckerman. 2004. “Fighting Spam with Statistics.” Significance 1,
no. 2 (June): 69–72. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2004.021.x
Goodman, William. 2016. “The promises and pitfalls of Benford’s law.” Significance 13, no. 3
(June): 38–41.
Gray, Chris. 2015. “Game, set and starts.” Significance. (February): 28–31.
Hald, Anders. 1990. A History of Probability and Statistics and Their Applications before 1750.
John Wiley & Sons.
Hill, Theodore P. 1999. “The Difficulty of Faking Data.” Chance 12, no. 3: 27–31.
Lanier, Jaron. 2018. Ten Arguments For Deleting your Social Media Accounts Right Now. New York:
Henry Holt and Company.
Paulden, Tim. 2016. “Smashing the Racket.” Significance 13, no. 3 (June): 16–21.
Scheaffer, Richard L. 1995. Introduction to Probability and Its Applications, Second Edition.
Duxbury Press.
Stigler, Stephen M. 2015. “Is probability easier now than in 1560?” Significance 12, no. 6
(December): 42–43.
Szekely, Gabor J. 1986. Paradoxes in Probability Theory and Mathematical Statistics. D. Reidel
Publishing Company.
Tarran, Brian. 2015. “The idea of using statistics to think about individuals is quite strange.”
Significance 12, no. 6 (December): 16–19.
Venn, John. 1888. The Logic of Chance. London, Macmillan and Co.
Weaver, Warren. 1963. Lady Luck: The Theory of Probability. Dover Publications, Inc. N.Y.
aa aa Aa Aa
bb bB Bb BB
cc cc Cc Cc
dD dd DD Dd
ee ee Ee Ee
Figure 2.1
29
2.1 Learning the vocabulary of probability: experiments, sample
spaces, and events.
Probability theory assigns technical definitions to words we commonly use to mean other
things in everyday life.
In this chapter, we introduce the most important definitions relevant to probability modeling
of a random experiment. A probability model requires an experiment that defines a sample space,
S, and a collection of events which are subsets of S, to which probabilities can be assigned. We
talk about the sample space and events and their representation in this chapter, and introduce
probability in Chapter 3.
A most basic definition is that of a random experiment, that is an experiment whose out-
come is uncertain. The term experiment is used in a wider sense than the usual notion of a
controlled laboratory experiment to find the cure of a disease, for example, or the tossing
of coins or rolls of dice. It can mean either a naturally occurring phenomenon (e.g., daily
measurement of river discharge, or counting hourly the number of visitors to a particular
web site), a scientific experiment (e.g., measuring blood pressure of patients), or a sampling
experiment (e.g., drawing a random sample of students from a large university and recording
their GPAs). Throughout this book, the reader will encounter numerous experiments. Once
an experiment is well defined, we proceed to enumerate all its logically possible outcomes
in the most informative way, and define events that are logically possible. Only when this
is done, we can talk about probability. This section serves as a preliminary introduction
of the main concepts. Later sections in this chapter talk in more detail about each of the
concepts.
Denny and Gaines, in their book Chance in Biology, introduce fringeheads—fish that live in
the rocky substratum of the ocean. The authors describe how when an intruder fringehead
approaches the living shelter of another fringehead, the two individuals enter into a ritual of
mouth wrestling with their sharp teeth interlocked. This is a mechanism to establish domi-
nance, the authors add, and the larger of the two individuals wins the battle and takes over
the shelter, leaving the other homeless. Fringeheads are poor judges of size, thus they are
incapable of accurately evaluating the size of another individual until they begin to wrestle.
When they enter into this ritual they do not know what their luck will be. As the authors claim,
since they cannot predict the result of the wrestling experiments with complete certainty
before the fringehead leaves the shelter to defend it, but have some notion of how frequently
they have succeeded in the past, these wrestling matches are random experiments. Every
time a fringehead repeats the experiment of defending its home, there is one of two possible
outcomes (an outcome is a specific output of the experiment). The set of all possible logical
outcomes of an experiment is called a sample space for that experiment. In the case of the
fringehead wrestling, there are only two possible elementary logical outcomes, success (s)
or failure ( f ), and these together form the sample space. We say that S = {s, f} This S is a
discrete finite sample space, to put it more technically. (Denny and Gaines 2000, 14)
Individual outcomes of an experiment are elementary events. For example, in the fringehead
example, s is an elementary event, and f is another elementary event. Elementary events can
An experiment, the logical outcomes of the experiment, and hence the sample space vary
depending on the problem being studied.
Example 2.1.1
For physicists, for example, a random experiment may consist of observing the number of
photons measured by a detector (Cı̆rca 2016), with sample space set S = {0,1,2, ... .}, the set of
positive integers. An elementary event in the photon experiment is a single integer number.
A compound event is, for example, the event A that the detector sees more than 10 photons,
with A = {11, 12, .....,}.
Example 2.1.2
In genetics, consider another example explained by the Random Project (Siegrist 1997).
Thus the sample space for the mating of gy, gy genotypes is:
S = { gg , gy , yg , yy }.
The genotypes gg and yy are called homozygous because the two alleles are the
same, while the genotype gy and yg is called heterozygous because the two alleles
Example 2.1.3
Selecting a sample of three persons from a group of six people to form a committee of three
people is an experiment that may result in the choice of any one of the ( 63 ) or 20 committees,
while selecting a treasurer, a captain and a typist out of the group of six is an experiment
that results in 6 ´ 5 ´ 4 = 120 outcomes.
Example 2.1.4
Playing a lottery where you must select five numbers from 49 is an experiment that has
( 49
5
) = 1,906,884 possible outcomes.
Box 2.1
Math Tidbit
6 6! 6 ×5 × 4 ×3 × 2× 1
.
3 = 3!3! = 3×2×1×3×2×1
2.1.1 Exercises
Exercise 1. Consider the experiment of monitoring the credit card activity of an individual to
detect whether fraud is committed. Write the sample space of this experiment.
• Sampling three students at random to determine whether they have bought Football
Season tickets.
• Planting three tomato seeds and checking whether they germinate or not.
• Tossing a coin three times and checking whether head or tail appears.
Exercise 3. Consider the following experiment and list the outcomes of the sample space:
observing wolves in the wilderness until the first wounded wolf appears.
Exercise 4. Consider the following experiment and list the outcomes of its sample space:
screening people for malaria until the first 3 persons with malaria are found.
Exercise 5. This problem is inspired by Mosteller et al. (1961, 5). Suppose parents are classified
on the basis of one pair of genes, and that d represents a dominant gene and r represents
a recessive gene. Then a parent with genes dd is pure dominant, dr is hybrid, and rr is pure
recessive. The pure dominant and the hybrid are alike in appearance. Offspring receive one
gene from each parent, and are classified the same way. Write the sample space for the
mating of dr with rr.
Exercise 6. A Chess club is debating which two people in the club should be supported to
attend the world championship. They decide to select the two people at random by placing
their name in a box and drawing the two names without replacement. The people in this
club are:
2.2 Sets
In talking more formally from now on about the sample space S and events, we will need the
concept of set. The mathematical theory of probability is now most effectively formulated
by using the terminology and notation of sets. Events are sets. The foundation of probability
in set theory was laid in 1933 by the Russian probabilist A. Kolmogorov.
Example 2.2.1
Consider the set H of numbers that may result from the toss of a 6-sided die. We may list it as:
H = {1,2,3,4,5,6}.
Example 2.2.2
Let A = {"an odd number less than 7"}.
Definition 2.2.3 A is a subset of H, in Example 2.2.1.
Two sets A and B are equal (A = B), if and
only if they have exactly the same ele- Example 2.2.3
ments. If one of the sets has an element The sets A = {2,4,6}, B = {4,6,2} are equal but the following sets:
not in the other, they are unequal and we W = {5,1,2} and T = {3,1,4} are not equal.
write A ¹ B. The rest of this chapter makes extensive use of sets and their
properties. Readers who need a refresher in the theory of sets
and Venn diagrams may benefit from studying first the lesson on
sets found at http://www.randomservices.org/random/foundations/Sets.html, a resource of
the Random Project (Siegrist 1997), and then coming back to this chapter, where we use the
theory under the assumption that the reader understands the concept of a set. The rest of the
chapter relies on definitions given in that resource and is devoted to naming and illustrating
sets as used in Probability Theory.
2.2.1 Exercises
Exercise 1. Do all the computational exercises at the end of the following lesson on sets that
you should review before you go on studying this chapter:
http://www.randomservices.org/random/foundations/Sets.html
Exercise 2. Consider the set of movies showing in the movie theaters of the town where you
live. List their names. Then create a subset containing the drama movies. If your town has
no movie theaters, look at the nearest town.
Exercise 3. An academic department has 13 members. Their names are Abbott, Cicirelli,
Cuellar, Liu, Pham, Mason, Danielian, Abe, Martinez, Mojica, Naseri, Engle, and Zaplan. From
this set, list the subset containing the names that start with M.
Exercise 4. One of the many tasks in machine learning is to classify objects or individuals
into subsets with common characteristics. What subsets of the periodic table could we make?
Exercise 5. (This exercise is based on an activity by Weber, Even, and Weaver (2015). Two
people are playing a game of “Odd or Even” with two six-sided fair die. A trial of the game
Suppose you play a game consisting of many trials. Regarding the points earned at the
end of the game, what is the set of possible outcomes? List the outcomes.
Box 2.2
Sample Spaces
A sample space is finite if an integer can be assigned to each possible element. The out-
comes of a single wrestling match of the fringehead is a finite discrete sample space.
A sample space is countably infinite if the elements can be counted, i.e., can be put in
one-to-one correspondence with the positive integers. The number of wrestling matches
a fringehead will enter to get its first loss is a discrete sample space which has an infinite
number of logically possible outcomes, assuming the fringehead is immortal.
A sample space is uncountably infinite if it cannot be put in one-to-one correspondence
with the positive integers. The time it takes to complete a standard homework is an un-
countably infinite sample space.
Finite and countable infinite sample spaces are also called discrete sample spaces. Un-
countably infinite sample spaces are also called continuous sample spaces. Part I of this
book is about the former, Part II about the latter.
S = {the squirrel ate the grapes , the squirrel did not eat the grapes }
does not seem to be a good description of the possible outcomes of the random phenomenon
that interests us. Instead,
Example 2.3.1
“Receiving a letter grade after completing your intro probability class at a public American
University” like that where the author works is a 7-outcome experiment. The sample space
of the logically possible outcomes of the experiment is
S = {A, B, C, D, F, P, NP}.
Example 2.3.2
Tossing three coins—a dime, a quarter, and a penny—in a row and keeping track of the sequence
of heads and tails that results is an 8-outcome experiment. Thus the sample space is the set
which has eight elements and provides a list that represents the logically possible outcomes
of one toss, if we understand that the first letter in a pair designates the outcome for the
dime, the second letter that for the quarter, and the third letter that for the penny. Thus HTH
means that the dime fell heads, the quarter fell tails and the penny fell heads. Every logically
possible outcome of the experiment corresponds to exactly one element of the set S.
Sometimes, when the sample space is not too big, we could use trees as a tool to find out
what to put in the list of elements of S. For example, in the case of the three coins experi-
ment, Figure 2.2 shows that the way to construct the sample space for an experiment like
this is to first consider the possible outcomes for the first coin, then the ones for the second
and then the ones for the third.
where, for example, ssf denotes an outcome where the individuals from neighborhoods a
and b practice capoeira, and the one from neighborhood c does not.
Example 2.3.4
Consider the experiment of observing SAT scores for a student randomly chosen among those
that have taken the SAT. Note: SAT is a standardized test for college admissions. Scores are
multiples of 10, and therefore discrete numbers. There are three sections: reading, math,
and writing, each section with positive scores between 200 and 800. So the total possible
score is between 600 and 2400. We may describe the sample space, instead of listing all its
elements, since it is a very large set.
2.3.2 Exercises
Exercise 1. Five students, numbered 1,2,3,4,5 to keep their identity secret, are competing
for the “best data scientist of the year” and the “best mathematical statistician of the year”
awards, offered by the undergraduate student association in their school. All five students
have applied for the two awards. A student can get at most one award. What are logically
possible outcomes of this experiment? Give the most informative listing of the sample space.
Explain the notation you use.
Exercise 3. As an international student coming to the United States, there are three different
student visas that a foreign student could be issued: F1 Visa, J1 Visa or M1 Visa. Descriptions
of these visas can be seen at https://www.internationalstudent.com/study_usa/preparation/
student-visa/. If we select three international students at random from the population of
international students at a particular point in time to observe their visa status, what would
be the sample space? List it. To simplify the notation, we use the letters F, J, and M, respec-
tively, for the types of visas.
Exercise 4. Sometimes, companies downsize by laying off the older workers. Consider an
experiment that consists of keeping track of layoffs in a major company that is under the radar
of the Equal Employment Opportunity Commission until three employees older than 40 are
laid off. List some outcomes of the sample space of this experiment, at least six members of S.
Exercise 5. (This exercise is inspired by a related problem on pages 34–35 of Khilyuk, Chilingar,
and Rieke (2005), but uses the more recent EPA standards.) The Environmental Protection
Agency (EPA) in the United States evaluates air quality on the basis of the Air Quality Index
(AQI), which classifies air quality into five major categories: Good (AQI 0–50), Moderate
(AQI 51–100), Unhealthy for Sensitive Groups (AQI 101–150), Unhealthy (AQI 151–200), Very
Unhealthy (AQI 201–300), Hazardous (PSI 301–500). The following document,
https://www3.epa.gov/airnow/aqi-technical-assistance-document-sept2018.pdf
2.4 Events
Example 2.4.1
Think of the experiment in Example 2.3.2.
We recognize A as a subset of the sample space S. The subset A is the mathematical coun-
terpart of the event “two heads.”
We could define many other events in S, for example:
Each event described above is given precise mathematical meaning by the corresponding set.
Example 2.4.2
Box 2.3 A clinical study can screen patients until it finds
one with a disease, but budget allows only at
Events and sets most four patients screened. We will list the
Sets are the main building blocks of probability theory. sample space S, the event E that “a patient with
The sample space S is a set, each of the outcomes in S the disease is found” and the event B that “the
is a simple set, a set with one outcome or elementary patient with the disease is found in at most
events, and a bigger event is a set of outcomes of S. If the
3 attempts.”
outcome of an experiment is contained in the collection
of outcomes in the event, then we say that the event has Let s denote that the disease is present and
occurred. Thus an event can occur in several ways. Event A let f denote that no disease is present.
occurs if either of the outcomes in it happens.
S = { s , fs , ffs , fffs , ffff },
where ffs means that three individuals are screened, the first two without the disease and
the third with the disease.
Example 2.4.3
An experiment consists of searching the internet for an email address of the history teacher
you had in high school 25 years ago. The sample space of this experiment is:
Let the event A represent the seating arrangements with the Japanese-speaking students
not sitting together.
A = {(1001), (1010),(0101)}.
Set operations allow us to obtain new sets from subsets of the sample space. We consider in
this section some of the most important set operations as applied to events.
Consider two events A and B.
The union of events A, B, is the event C consisting of outcomes that are in at least one of
the events:
Ac = { si ∈ S : si not in A}. A Ç B.
Example 2.5.1
Consider the sample space depicted in Figure 2.3 consisting of all the pairs of numbers that
you can get when you roll a red die and a white die (you do not know ahead of time whether
the dice are fair or not). We will associate with each outcome the sum of the points (as we
did in Table 1.1 of Chapter 1).
Let A be the event that the sum is smaller than or equal to 5. Let B the event that the sum
is larger than 3 but smaller than 7. Then,
A = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (4, 1)},
B = {(1, 3), (1, 4), (1, 5), (2, 2), (2, 3), (2, 4), (3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (5, 1)} ,
C=A∪B
= {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (4, 1), (1, 5),
(2, 4), (3, 3), (4, 2), (5, 1)},
E = A ∩ B = {(1, 3), (1, 4), (2, 2), (2, 3), (3, 1), (3, 2), (4, 1)}, and
Ac = {(a, b ) in S : (a + b ) > 5}.
Example 2.5.2
Suppose that key elections for the position of president of a country are held. There are three
voting districts, I, II, III. The winner of the election has to have won the majority of votes
in two of the three voting districts. Assume that there are two candidates, A and B, for the
position. The sample space for the outcome of the election can be seen in Figure 2.4.
where, for example, ABB means that candidate A won in district I, and B won in districts II
and III.
Let W be the event that A wins the election,
The event
(ABB)
C = W ∩ T = { AAA, AAB}.
Figure 2.4 Venn diagrams for election
example 2.5.2.
The reader may want to mark these events in Figure 2.4
using Venn diagrams.
E2
E1
E4
....
E3 En−1
En
E5
Partitions are very useful, allowing us to divide the sample space into small, non-overlap-
ping pieces, as in a puzzle. Visualizing partitions often helps doing the proofs of the main
theorems of probability in Chapter 3.
Example 2.5.3
The possible partitions of the sample space S = {1, 2, 3, 4} are:
(i ) [{1}, {2,3,4}]; (ii ) [{2}, {1,3,4}]; (iii ) [{3}, {1,2,4}]; (iv ) [{4}, {1,2,3}]; (v ) [{1,2}, {3,4}];
(vi ) [{1,3}, {2,4}]; (vii ) [{1,4}, {2,3}]; (viii ) [{2,4}, {1}, {3}]; (ix ) [{3,4}, {1}, {2}]; ( x ) [{1,2,3,4}];
( xi ) [{1}, {2}, {3}, {4}]; (xii ) [{1,2}, {3}, {4}]; ( xiii ) [{1,3}, {2}, {4}];
( xiv ) [{1,4}, {2}, {3}]; ( xv ) [{2,3}, {1}, {4}].
Example 2.5.4
Let us consider the logical possibilities for the next three games in which England
plays Russia in a FIFA World Cup. We can list the possibilities in terms of the winner of
each game:
where E denotes England and U denotes Russia. The outcome or simple event EUU means
that England wins the first game and Russia wins the next two games.
A partition of the sample space is made by the sets A = {“England wins two games”}, B =
{“Russia wins two games”}, C = {“England wins three games”}, and D = {“Russia wins 3 games”}.
Can you list the outcomes contained in each of these events?
Example 2.5.5
Visit the traffic light applet to visualize this problem. It can be run if your browser has Java
and you will find it at http://statweb.stanford.edu/~susan/surprise/Car.html.
Set the applet to run at low speed and the number of lights to three. Then think about
the following problem:
Driving a customer, a taxi driver passes through a sequence of three intersections with
traffic lights. At each light, she either stops, s, or continues, c. The sample space is
D = {csc , ccc }.
Box 2.5
The algebra of events is the algebra of sets. This algebra tells us the relations among sets
that are obtained by set operations. The following are important relations (of sets in general)
that help simplify calculations. For any three events, A, B, and C, defined on a sample space S,
Commutativity:
A ∪ B = B ∪ A and A ∩ B = B ∩ A .
Associativity:
A ∪ (B ∪ C ) = ( A ∪ B ) ∪ C and A ∩ (B ∩ C ) = ( A ∩ B ) ∩ C .
Distributive Laws:
A ∩ (B ∪ C ) = ( A ∩ B ) ∪ ( A ∩ C ) and A ∪ (B ∩ C ) = ( A ∪ B ) ∩ ( A ∪ C ).
De Morgan’s Laws is a law about the relation between the three basic operations.
( A ∪ B )c = AC ∩ B c and ( A ∩ B )c = AC ∪ B c .
The proof of De Morgan’s laws can be found in several sources. See Ross (2010) or Siegrist
(1997).
All of the properties mentioned can be generalized to a number n > 2 of events. Let
E1 , E2 ,¼. . , En be events defined in the sample space S. Then,
In prose form: the complement of the intersection of n events is the union of the comple-
ments of these n events. And the complement of the union of n events equals the intersection
of the complements of these events.
2.6.1 Exercises
Exercise 1. If we pull a card from a deck like that in Figure 2.6 and consider the events A =
spade, B = heart, are these events mutually exclusive? Why?
If we pull a card from a deck and consider the events A = spade and B = ace, are these
events mutually exclusive? Why?
Exercise 2. Consider the sample space of Figure 2.7 with elements representing the ages at
which one can apply for an annual free ticket to an attraction park, i.e.,
S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}.
Clubs
Diamonds
Hearts
Spades
The event A contains all ages that are multiples of three, while event
16 14 13
B contains all ages that are multiples of five. (i) Identify in the Venn
17 11
diagrams the events A, B. (ii) List the elements in the events 12 5
9
c c c
A ∩ B, A ∩ B , A ∩ B and ( A ∪ B ) , and determine whether these events 19 3 15 20
form a partition of S. 6
18 10
1 8
Exercise 3. Four components are connected to form a system as shown 2 4 7
in Figure 2.8. The subsystem 1–2 will function if both of the individual
components function. The subsystem 3–4 functions if both of the indi- Figure 2.7 Ages of free-ticket-
eligible individuals.
vidual 3–4 components function. For the entire system to function, at
least one of the two subsystems must function. (i) List the outcomes in
the sample space. (ii) Let A be the event that the system does not work, 1 2
and list the elements of A. (iii) Let B be the event that the system works,
and list its elements. What is the relation between events A and B?
3 4
Exercise 5. If the outcome of an experiment is the order of finish in a car race among
4 cars having post positions 1,2,3,4, then how many outcomes are there in the sample space
and how many outcomes are there in the event E consisting of outcomes in which car 3 wins
the race?
Exercise 6. A series of 3 jobs arrive at a computing center with 3 processors and could end
up in any of the processors. List the members of the sample space and then list the members
of the event that all processors are occupied.
Box 2.6
The third building block of probability theory is a probability function defined on the sample
space, mapping to the real numbers. All together, a sample space, the events defined in a
sample space and a probability function form a probability space. With a probability space
well defined, we can approach probability problems of any complexity.
Chapter 3 is dedicated to this third building block. To understand the material in Chapter 3,
it is important to first feel proficient in all the material seen in Chapter 2.
Question 1. The building blocks of probability theory are (select all that apply):
Question 2. A partition of the sample space must satisfy which of the following?
Question 5. Consider two events A and B in a sample space S. The event A Ç B is not empty.
The event
( A ∩ B c ) ∪ ( A ∩ B ) ∪ ( B ∩ Ac )
a. ( A ∩ Bc ) ∪ ( A ∩ B)
b. ( A È B )c
c. B
d. Ac
e. AÈB
Question 6. Two six-sided dice are rolled. Let A be the event that the sum is less than nine, and
let B be the event that the first number rolled is five. Events A and B are (select all that applies)
Question 9. (This problem is inspired by Pfeiffer (1965, 25).) A certain type of rocket is known
to fail for one or two reasons: (1) failure of the rocket engine because the fuel does not burn
evenly or (2) failure of the guidance system. Let the experiment consist of the firing of a
rocket of this type. We let A be the event that the rocket fails because of engine malfunction
and B the event the rocket fails because of guidance failure. The event F of a failure of the
rocket is thus given by: F = A ∪ B. Consider the following three events:
a. equal to F
c
b. equal to A Ç B
c. a partition of F
d. a partition of A È B
2.9 R code
Being aware of the many possible outcomes that can arise in an experiment can be illustrated
with the matching problem.
An online dating service research team has matched individuals A1, A2, A3, and A4
with individuals B1, B2, B3, and B4, respectively. If they call for a date, the number after
the letter determines who they are matched with. That is, the research team thinks that
A1 matches well with B1, etc. The assistant who responds to date requests does not use
the information provided by the research team. Instead, the assistant assigns a date to
the A individuals at random. That is, for example, A1 is given a randomly chosen person
from the B pool of candidates. This is like putting four numbers in an urn, and drawing
one number at random, without replacement. In R, the activity done for the four in the
A pool is:
If we had gotten 3,1,4, and 2 as a result, this would mean that A1 gets B3, A2 gets B1,
A3 gets B4, and A4 gets B2.
Listing the possible outcomes of this experiment could take a lot of space if done by hand.
You could do many trials at once by using a for loop in R. For example, suppose you want
three trials. Then you would use the following code:
trials=matrix(rep(0,12),ncol=4)
for(i in 1:3){ # put each of the three trials in a row
trials[i,]=sample(1:4, 4, replace=F)
}
trials
A1 A2 A3 A4
trials=matrix(rep(0,16),ncol=4)
for(i in 1:100){ # This is doing 100 trials
trials[i,]=sample(1:4, 4, replace=F)
}
trials
• Question 3. How many elements are there in the event B = {“A1 and A1 get the
right date”}. How many times in your 100 trials did this event happen? To find
the latter, type in R
A1A2match=matrix(trials[trials[,2]==2&
trials[,1]==1],ncol=4)
nrow(A1A2match)
Exercise 1. A company is allowed to interview candidates until two qualified candidates are
found. But budget constraints dictate that no more than 10 candidates can be interviewed.
List the outcomes in sample space.
a. the set of all students that have are double majors in EE and statistics
b. the set of all students who are in only one of those two majors
c. the set of all students that are not in any of those majors
Exercise 3. Consider the Venn diagrams and events A, B, associated with Figure 2.7. List the
elements in the following events:
a. Ac Ç B c
b. (A Ç Bc) È (B Ç Ac)
c. Ac È B c
d. B È Ac
Exercise 4. Sketch the region corresponding to the following event (A È B)c Ç C in Figure 2.10
A B
Figure 2.10
Exercise 6. (This exercise is an adaptation of a problem by Ross 2010, page 108, problem 3.69.)
A certain organism possesses a pair of each of 5 different genes (which we will designate by
the first 5 letters of the English alphabet). Each gene appears in 2 forms (which we designate
by lowercase and capital letters). The capital letter will be assumed to be the dominant gene
in the sense that if an organism possesses the gene pair xX, then it will outwardly have the
appearance of the X gene. For instance, if X stands for brown eyes and x for blue eyes, then
an individual having either gene pair XX or Xx will have brown eyes, whereas one having
gene pair xx will have blue eyes. The characteristic appearance of an organism is called its
phenotype, whereas its genetic constitution is called its genotype. (Thus 2 organisms with
Exercise 7. The image in Exercise 4, Figure 2.10, shows a Venn diagram of three events
(A, B, and C), in a sample space. Each of the cells delimited by the solid curves represents
an event. All the cells shown comprise a partition of the sample space. Use the notation and
operations on sets learned in this chapter to list all the sets in the partition.
Exercise 8. It is possible to derive formulas for the number of elements in a set which is the
union of more than two sets, but usually it is easier to work with Venn diagrams. For example,
suppose that the data science club reports the following information about 30 of its members:
19 work part time, 17 take stats, 11 volunteer on Volunteer day, 12 work part time and take
stats, 7 volunteer and work part time, 5 take stats and volunteer and 2 volunteer, take stats,
and work part time. Using Figure 2.10, fill in the number of elements in each subset working
from the bottom of the list given in this problem to the top.
Exercise 9. (Based on Khilyuk, Chilingar, and Rieke 2005, page 37) A protect-the-bay program
is trying to prevent eutrophication (excessive nutrient enrichment that produces an increas-
ing biomass of phytoplankton and causes significant impact on water quality and marine
life). To measure biologic water quality the protect-the-bay program uses mean chlorophyll
concentration on the surface, mean chlorophyll concentration on the photic layer, and mean
chlorophyll concentration of the water column. If each of these are ranked as high or normal,
what are the possible outcomes in the sample space of biological water quality?
Exercise 10. A psychologist has some mice that come from lab A and some that come from
lab B. The psychologist ran 50 mice through a maze experiment and reported the follow-
ing: 25 mice were from lab A, 25 were previously trained, 20 turned left (at the first choice
point), 10 were previously trained lab A mice, 4 lab A mice turned left, 15 previously trained
mice turned left, and 3 previously trained lab A mice turned left. Draw an appropriate Venn
diagram and determine the number of lab B mice who were not previously trained and who
did not turn left. Put how many mice are in each piece of your Venn diagram and label your
events clearly. Make your plot very large, so we can clearly see the numbers that you write.
(Goldberg 1960, 25, problem 3.8)
Exercise 11. Persons are classified according to blood type and Rh quality by testing a blood
sample for the presence of three antigens: A, B, and Rh. Blood is of type AB if it contains
both antigens A and B, of type A if it contains A but not B, of type B if it contains B but not
A, and of type O if it contains neither A nor B. In addition, blood is classified as Rh+ if the Rh
Find: (i) the proportion of type A- persons; (ii) the proportion of O- persons; (iii) the pro-
portion of B+ persons. (Based on Goldbert 1960, 22–23)
Exercise 12. A tract of land in the Alabama Piedmont contains a number of dead shortleaf
pine trees, some of which had been killed by the littleleaf disease, some by the southern
pine beetle, and some by fire. Suppose that out of 500 trees,
What proportion of trees were killed by littleleaf disease? (Johnson 2000, chapter 7)
XXThe taxicab problem was made famous by Tversky and Kahneman (1982).
These two psychologists studied the judgment of probability in people.
Before you research these two scholars and their taxicab problem, think
about it yourself and propose a solution plan. Revisit it again after you have
studied the chapter. Then you may research other interesting puzzles posed
by these authors. Here is the taxicab problem.
A cab was involved in a hit and run accident at night. Two cab companies, the
Green and the Blue, operate the city. You are given the following information:
85% of the cabs in the city are Green and 15% are Blue
A witness identified the cab as blue. The court tested the reliability of
the witness under the same circumstances that existed on the night of the
accident and concluded that the witness correctly identified each one of the
two colors 80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accident was Blue
rather than Green? How would you answer this question?
57
3.1 Modern mathematical approach to probability theory
Chapter 2 described the mathematical notions with which we may state the postulates of
a mathematical model of a random phenomenon or experiment, namely an experiment, a
sample space, and a collection of events. To complete a probability model we need a proba-
bility measure that we will denote by P. This probability measure must assign, to each event
A in S, regardless of whether it is elementary or complex, a probability P(A). In this chapter,
we talk about this probability measure and the properties that it must have to allow us to
compute the probability of complex events from elementary events. The results found in
this chapter are powerful and big aids in decision-making under uncertainty.
As indicated in Chapters 1 and 2, Kolmogorov gave probability an axiomatic foundation,
thus making it mathematical and general enough to handle almost any problem that involves
uncertainty. The axioms found in Definition 3.1.1 are attributed to him.
Axiom 3 is saying that the probability of the union of mutually
Definition 3.1.1 exclusive events is the sum of their probabilities.
This axiomatic approach made it easier for people from
Probability is a function P defined on many different backgrounds and levels of training to talk about
sets of the larger set containing all
probability. Regardless of how they obtained their probabilities
logically possible outcomes, S, such
(experimentation, subjective, model-based), and how they defined
that this function satisfies Kolmogor-
probability (classical definition, frequentist definition, subjective
ov’s axioms, which are:
definition) the probability is a function defined on the events in
1. If A is an event in the sample the sample space that must satisfy the axioms.
space S, P(A) ≥ 0 for all events A The assumption of the existence of a set function P, defined
2. P (S)=1 for the certain event S on the events of a sample space S, and satisfying Axioms 1, 2,
3. Axiom of countable additivity:
and 3, constitutes the modern mathematical approach to proba-
If A1,A 2, ... .. is a collection of
bility theory. For any sample space S, many different probability
pairwise disjoint or mutually
functions can be defined that satisfy the axioms.
exclusive events, all defined
on a sample space S, then
Example 3.1.1 (Ross 2010)
∞
Consider the experiment that consists of tossing a coin.
P (∪∞ A )=
i =1 i ∑P( A ).
i =1
i Let’s find a couple of probability functions and see how
we do it. Let S = {H, T} . A reasonable probability function is
P ({H}) = P ({T}). What would P ({H}), P ({T}) have to be for P to
be a probability function? Since S = {H È T}, and {H} and {T} are disjoint, we have by axiom
3 and then by axiom 2, P {H È T} = P (H) + P (T ) = 1, and therefore P ({H}) = P ({T}) = 1/2 is a
probability function that satisfied the Axioms. Another good function is
1
P ({H }) = ; P ({T }) = 2 / 3.
3
Because there can be many probability functions defined on a sample space, the task
of the data scientist is to determine from incomplete data what is the probability function
behind a particular experiment, i.e., generating the observed data. That is the task of the
mathematical statistician and applied probability modelers. The probability function is
Example 3.1.2
Consider a game of dart played by throwing a dart at a board and receiving a score corre-
sponding to the number assigned to the region in which the dart lands. The probability of
the dart hitting a particular region is proportional to the area of the region. Thus, a bigger
region has a higher probability of being hit.
If we make the assumption that the board is always hit, then we have
area of region i
P ( scoring i points ) = .
area of dart board
The sum of the areas of the disjoint regions equals the area of the dart board. Thus the prob-
abilities assigned to the 5 outcomes sum up to 1 and satisfies the axioms of probability. Dart
games exercises appear in numerous probability textbooks, including Ross(2010). The idea
of finding theoretical probabilities by calculating how much of an area is covered by random
throwing of a dart is behind modern computational methods based on Markov Chain Mon-
tecarlo Simulations. The Metropolis-Hastings algorithm to compute posterior distributions
in a Bayesian context is an example.
The Metropolis-Hastings algorithm and other MCMC methods allow estimation of
probabilities when the probability models are very complex and can not be handled by
mathematics alone.
(1) P(f ) = 0
(
P ∪ni =1 Ei =) ∑P(E ). i
i =1
This follows from Axiom 3 by defining Ei to be the null event for all values of i greater
than n. Axiom 3 is equivalent to the equation above when the sample space is finite. However, the
added generality of Axiom 3 is necessary when the sample space consists of an infinite number
of points. Although not the subject of the first part of the book, this result will be meaningful in
the second part.
To prove that
P(AC) = 1 - P(A).
Realize that since A and AC are disjoint, then, by Axiom 3,
P(A È AC ) = P(A) + P(AC ) but P(A) + P(AC ) = P(S ) = 1 by Axiom 2, from where the statement
given in the theorem follows.
Theorem 3.1.2
If P is a probability function and A and B are any sets, then
1. P(B Ç AC) = P(B) - P(A Ç B). The probability that “only B” happens.
2. P(A È B) = P(A) + P(B) - P(A Ç B). The probability that A or B happens.
3. If A is contained in B, then P(A) £ P(B).
Proof
1. Make B the union of two mutually exclusive events: B = (A Ç B) È (B Ç AC ). Then by
Axiom 3,
P(B) = P(AÇB) + P(BÇAC) . It follows then that P(B Ç AC ) = P(B) - P(A Ç B).
Corollary 1
If A and B are disjoint, P(AÈB) = P(A) + P(B).
If E1, E2 … … Em are mutually exclusive, the probability of the event consisting of
their union is the sum of their probabilities.
Theorem 3.1.3
The probability of the union of n events equals the sum of the probabilities of these
events taken one at a time, minus the sum of the probabilities of these events taken
two at a time, plus the sum of the probabilities of these events taken three at a
time, and so on.
P (E1 ∪ E2 ∪……. ∪ En ) = P (E1 ) + P (E2 ) +…. . + P (En ) − ∑P(E ∩ E )
i< j
i j
+(−1)r +1 ∑
i1 <i2 < .…<ir
P (E1 ∩ E2 ∩…. . ∩ Er ) +……+ (−1)n+1 P (E1 ∩ E2 ∩…. . ∩ En ).
The summation ∑ P (E1 ∩ E2 ∩…. . ∩ Er ) is taken over all of the n possible
r
i <i <….<i
1 2 r
Example 3.1.3
A data festival has received three fancy markers from one of the sponsors. The fancy mark-
ers will be given to three students chosen at random from the five winners of the Festival:
Ana (a), Betsy (b), Charles (c), Dai (d), and Ezra (e). What is the probability that both Ana and
Betsy are chosen, or both Charles and Ezra are chosen, or Betsy, Charles, and Dai are chosen?
The sample space consists of 60 possible selections of the 3 winners, but each of the
10 possible sets of three individuals listed below appears 6 times in different orders, so we
can think of the outcomes possible as
{abc , abd , abe , acd , ace, ade , bcd , bce, bde , cde },
Example 3.1.4
Suppose that on a random week day, after dinner, Rabindranath watches CNN 2/3 of the time,
watches BBC 1/2 of the time, and watches both CNN and BBC 1/3 of the time. On a randomly
selected weekday evening (i) What is the probability that Rabindranath watches only CNN?
(ii) What is the probability that Rabindranath watches neither station?
P(A Ç B) = 1/3.
(i ) P ( A ∩ B c ) = P ( A) − P ( A ∩ B ) = 1 / 3.
To answer question (ii) we make use of Theorem 3.1.1 and 3.1.2. First we compute
2 1 1 5
P ( A ∪ B ) = P ( A) + P (B ) − P ( A ∩ B ) = + − = by Theorem 3.1.2.
3 2 3 6
Then, using Theorem 3.1.1, we compute
1
(ii ) P ( A ∪ B )c = 1 − P ( A ∪ B ) = .
6
Theorem 3.1.4
Let E1, E2, ..., En , be events that form a partition of the sample space S as indicated
in Figure 3.1. Let B be any event, also shown in Figure 3.1. Then
Exercise 5. In a remote organization, 28.1% among the total adult population are current
smokers. Moreover, 37.7% of the adult population works in a place indoors where there is
no rule against smoking at work. The proportion of all adults that are current smokers and
work in an indoor place with no rule against smoking at work is 5%. What is the probability
that a randomly selected adult in this remote organization is a current smoker or works in a
place with no ban on smoking at work?
Exercise 6. (This exercise is based on a similar problem from Ross (2010).) The Talented
Mr. Ripley is a mystery novel by Patricia Highsmith (1955). Mr. Ripley is a very lucky killer,
who gets away with two murders but sometimes gets anxious at the thought of being
discovered. There is no probability mentioned in the book, but it is lurking in the thought
process. Suppose Mr. Ripley thinks of these possibilities: A (the father will call me today), B
(the police are on the way to arrest me), AÇB (the father calls and the police are on the way
to arrest me), AÈB (at least one of A and B will happen). This person assesses: P(A) = 0.30,
P(B) = 0.40, P(AÇB) = 0.20, and P(AÈB) = 0.60 as answers. Are Mr. Ripley’s imaginary answers
consistent with the axioms of probability? Why or why not? If not, which axiom is violated?
Exercise 7. (This problem is from Khilyuk, Chillingar, and Rieke (2005, page 58).) Consider
an urban water-supply system. It can fail because of either the lack of water or damage of
supplying pipes. On any given day, the supply system can be in one of the following two
states: proper functioning (event A) or failure (event B). Reliability of the system can be
defined as the probability of proper functioning on any given day P(A). Based on the infor-
mation presented below, one needs to evaluate the reliability of the water supply system,
Let W = event lack of water. P(W) = 0.014. Let D be the event line damage P(D) = 0.030.
IT is known that P(D Ç W ) = 0.011. Calculate P(A).
Exercise 8. A big department store offers the clients two types of payment style options.
One option, A, is paying in cash, and 40% of its customers choose that; the other option, B,
is paying with a credit card, which 60% of customers choose. It is possible to pay both with
credit card and cash, an option employed by 10% of customers. There are many other options
such as gift cards, debit cards, checks, and EBT cards. (i) What percentage of this department
store’s customers will use only one of the A, B options? (ii) What percentage of its customers
do not use either A or B?
Suppose we know, by some luck, the probabilities of each of the logically possible ele-
mentary outcomes of the experiment given in a discrete sample space, or we assume what
they are as, for example, when we assume that outcomes are equally likely. If si is a simple
outcome in the sample space S, and A is an event defined in S, then if any outcome in A
occurs, the event A occurs. As a consequence, the probability of A can be found by adding
the probabilities of all the simple outcomes of S that are in A, i.e.,
Box 3.1
Example 3.2.1
After shipping five computers to users, a com-
Probability of an event
puter manufacturer realized that two out of
The probability of any event (no matter how complicated
the five computers were not configured prop-
it may be) can be computed if:
erly (i.e., were defective), without knowing
a. you can identify the individual outcomes of the specifically which ones. The manufacturer allo-
sample space that are in the event.
cated resources to recall only two randomly
b. you know the probability of each of the outcomes
chosen computers in succession out of the five
in the sample space.
for examination.
If those two conditions are satisfied, then the proba-
bility of the event is the sum of the probabilities of the
outcomes of S that are in the event. Sometimes we know
the probabilities of the outcomes in S, sometimes we
have to compute them ourselves. A book on the theory of
probability like this helps you do the latter.
where ss reflects that the first computer was defective and the second computer recalled
was also defective.
The event A that the second recalled computer is a defective computer contains the
following outcomes:
A = {ss, fs}.
We were told by someone that went through the trouble of finding them out that the
probabilities of each outcome in the sample space are:
P ( A) = 1 / 10 + 3 / 10 = 4 / 10 .
Example 3.2.2
A job requires that prospective employees go through a security clearance check to make sure
that they qualify for a job where they must sign a nondisclosure agreement. Let s represent
“passing the security clearance check” and f denote “not passing the test.” In a company that
narrows down the pool of qualified applicants to the three most qualified ones, the possible
outcomes of the security check are
8 4 2 1 .
P ({ sss }) = ; P ( ssf ) = P ( sfs ) = P ( fss ) = ; P ( ffs ) = P ( fsf ) = Psff ) = ; P ( fff ) =
27 27 27 27
Let D be the event that two of the qualified candidates pass the security clearance check.
Then
4 4 4 12 .
P (D ) = + + =
27 27 27 27
Example 3.2.3
A person can be born in any of the 12 months of the year. Consider four persons. Each of
them can be born in any of the 12 months. In total, there are 124 4-tuples, each indicating
the birth months of four people. If we assume each of those is equally likely to happen, the
in different months?
We first figure out how many elements are in the event. For the first person, there could
be 12 months, for the second the remaining 11, for the third the remaining 10, and for the
fourth the remaining 10. So in total there are (12) (11) (10) (9) possible outcomes. If we mul-
tiply 11,880 times 121 , we get 0.5729.
4
Box 3.2
1
is 49
, that is, the chance is 1 in 13,983,816 and the jackpot odds are 13,983,816 to
6
1.” (Henze and Riedwyl 1998, 14–16) One could argue that it is easier for people to
understand the dimensionality of their luck using this informal approach to describe
odds, but the reader should be aware that such definitions are not the probabilistic
definition of odds.
3.2.1 Exercises
Exercise 1. When each outcome in the sample space is equally likely to happen we may use the
classical definition of probability to calculate the probability of events. Consider an individual
who seeks advice regarding one of two possible courses of action from three consultants,
who are each equally likely to be wrong or right. This individual follows the recommendation
of the majority. What is the probability that this individual makes the wrong decision?
(i) What is the probability that an individual measures calories? (ii) What is the probability
that an individual Posts in personal web site?
Exercise 3. In a population of 485 smokers, 238 of the smokers divorced and 247 of the non-
smokers did not divorce. Calculate the odds of a smoker being divorced to not being divorced.
Example 3.3.2
(This problem is from C ιrca (2016, 15).) The spin in a quantum system can have two projec-
tions: +1/2 (spin “up”) or -1/2 (spin “down”). The orientation of the spin is measured twice in
a row. We make the following event assignments:
If each of these pairs is equally likely, i.e., each has probability 1/4,
2 1
P ( A ) = P ( B ) = P (C ) = =
4 2
The hard disk and the memory stick are independent units, hence
P (E ) = P (D ∩ M ) = P (D )P (M ) = (0.1)(0.5) = 0.05
There is a 5% chance that the student`s work will be lost. A better backup system should
be considered.
Example 3.3.4
Reliability is the probability that a system will work, given probabilities about the compo-
nents of this system. It applies to industry, or to standardized exams with more than one
part, and many other things. There are various types of configurations of the components
in a system. In a series system all the components are in series and they all have to work
for the system to work. If one component fails, the system fails. A robotic tool is given a
one-year guarantee after purchase. The tool contains 4 components in series, each of which
has a probability 0.01 of failing during the warranty period. The tool fails if any one of the
components fails. Assuming that the components fail independently, what is the probability
that the system works. That is, what is the reliability of this system? What is the probability
that the system fails?
Let Ei, i = 1,…,4 be the event that component i works. Then P (Ei ) = 0.99, i = 1,…,4.
Let E be the event that the system works. Then P (E ) = P (E1 ∩ E2 ∩ E3 ∩ E4 ) = P (Ei )4 = 0.960596.
The reliability of the system is then 0.960596.
The complement event (EC) that the system fails has then probability 0.039404.
Example 3.3.5
There are many cards problems in numerous probability books, because there are many games
that use cards, and it is interesting to use probability to predict possible winnings and losses
or perhaps just the outcomes. If the reader is not familiar with cards, the following image may
help see what the contents of a deck of cards is (See Figure 3.2). Assuming that the deck has
not been tampered with, each card has equal probability of being selected. One could define
many events around this deck. For example, let A be the event “selecting a Queen” and B be
the event “selecting a heart.” In the context of this section, we are interested in whether A
and B are independent. We use the definition.
1 4 13 1
P( A ∩ B) = ; P ( A)P (B ) = = .
52 52 52 52
Because P(E Ç F) = P(A)P(B) we can conclude that the events A and B are independent.
Clubs
Diamonds
Hearts
Spades
3.3.1 Exercises
Exercise 1. (This is from Parzen (1960, 90).) Consider an automobile accident on a city street
in which car I stops suddenly and is hit from behind by car II. Suppose the three persons,
whom we call A’,B’, and C’, witness the accident. Suppose the probability that each witness
has correctly observed that car I stopped suddenly is estimated by having the witnesses
observe a number of contrived incidents about which each is then questioned. Assume that
it is found that A’ has probability 0.9 of stating that car I stopped suddenly, B’ has probabil-
ity 0.8 of stating that car I stopped suddenly, and C’ has probability 0.7 of stating that car
I stopped suddenly. Let A, B, and C denote, respectively, the events that persons A’,B’, and
C’ will state that car I stopped suddenly. Assuming that A, B, and C are independent events,
what is the probability that (i) A’, B’, and C’ will state that car I stopped suddenly, (ii) exactly
two of them will state that car I stopped suddenly?
Exercise 2. Actuarial science is the science that estimates risks of insurance and other finan-
cial endeavors. Their first preliminary probability exam is close to the level of this book. You
may consider trying an online exam when you are done studying the book. These current
sample exams are at https://www.soa.org/Education/Exam-Req/Syllabus-Study-Materials/
edu-exam-p-online-sample.aspx One of the actuarial applications of probability is to mortality.
The probability of an individual’s death is important in the actuarial fields of life insurance
and pensions. Life insurance pays out when a death occurs and a pension pays until death
occurs. Therefore, calculating the expected present value of the liability to the insurer or
pension provider requires understanding of the probabilities of death. For more information,
see chapter 9 of Introduction to Actuarial and Financial Methods (Garret 2015).
To keep this problem simple, suppose an individual is alive at age 35. What is the proba-
bility that the person will die between the ages of 40 and 45? Assume that the probability
of dying in a given year is 0.1 and the probability of being alive is 0.9. (Usually, in actuarial
science, the probabilities vary each year or there are probabilities calculated from empirical
Exercise 3. (This is Example 3.1.3 (applying Axiom 3) from Keeler and Steinhorst (2001).) Given
recent flooding between Town A and Town B, the local telephone company is assessing the
value of an independent trunk line between the two towns. The second line will fail inde-
pendently of the first because it will depend on different equipment and routing (assume a
regional disaster is unlikely). Under current conditions, the present line works 98 out of 100
times someone wishes to make a call. If the second line performs as well, what is the chance
that a caller will be able to get through?
Exercise 4. Consider again the deck of cards of Figure 3.2. Assume it is well shuffled. A card
will be dealt off the top of the deck. Let event A be “the card is the number 7” and let B be
the event “ the card is a club.” Are these two events independent?
Exercise 5. A six-sided fair die is rolled once. Hence the probability that the number 6 turns
up on the dice is p = 1/6. Is the probability of seeing a 6 equal to 2/6 if the die is rolled twice?
Exercise 6. A flight system consists of five radio communication devices in series. The system
will work if the five components work. Let Ri represent the event that component i works, i =
1,…,5. The P(Ri) = 0.87 for all i. What is the probability that the system will work (also known
as the reliability of the system?).
Type O B A AB
Proportion 0.3712 0.3226 0.2288 0.0774
After defining what mutually exclusive blood types means for a randomly chosen person
from India, find the probability that two randomly chosen individuals from India share the
same blood type.
An event G could have P(G), but then after observing that event B occurs that probability
could change. We say that we update the probability of G after information on event B reaches
us. The former P(G) is the prior probability, also called total probability of G. To mark the
P (G ∩ B ) = P (G | B )P (B )
and
P (G ∩ B ) = P (B | G )P (G ).
Example 3.4.1
In 2016–2017, approximately 32% of enrolled undergraduate students were recipients of
Pell Grants (https://trends.collegeboard.org/student-aid/figures-tables/undergraduate-
enrollment-and-percentage-receiving-pell-grants-over-time#Key%20Points). According to
Facts and Figures at the author’s institution, 34% of undergraduates at UCLA receive Pell
Grants. The probability that a randomly chosen student from the US receives a Pell Grant can
be approximated by 32%. But if we learn that the student is from UCLA, we should update
that probability to 34%.
Let A be the event that a randomly chosen undergraduate in the US receives a Pell Grant.
Let B be the event that an undergraduate student is at UCLA. Then
P(A) = 0.32,
P(A|B) = 0.34.
Example 3.4.2
Rossman and Short (1995) used the following example of Table 3.1 that classifies members
of the 1994 US Senate according to their political party and sex. The following table sum-
marizes their findings.
Table 3.1
Men Women Row Total
Republicans 42 2 44
Democrats 51 5 56
Column Total 93 7 100
Rossman and Short asked: Is it legitimate to say that “most Democratic senators are women”
and “most women senators are Democrats”? What conditional probability statements are these?
Example 3.4.3
(This example is from Keeler and Steinhorst (2001).) Many couples take advantage of ultra-
sound exams to determine the sex of their baby before it is born. Some couples prefer not
to know beforehand. In any case, ultrasound examination is not always accurate. About 1
in 5 predictions are wrong. In one medical group, the proportion of girls correctly identi-
fied is 9 out of 10 and the number of boys correctly identified is 3 out of 4. The proportion
of girls born is 48 out of 100. What is the probability that a baby predicted to be a girl is
actually a girl?
Table 3.2
Girls Boys Row Total
Ultrasound says girl 432 130 562
Ultrasound says boy 48 390 438
Column Total 480 520 1000
From the numbers in the table you can compute the probability requested:
432
P ( girl|ultrasound says girl ) =
562
3.4.2 An aid: Tree diagrams to visualize a sequence of events
Tree diagrams are another tool for computing conditional and total probabilities. They are
often useful in clarifying our thinking about sequences of events.
We will quote the detailed information about a probability tree found in Wild and Seber
(2000). Look at Figure 3.3 at the same time.
The probability written beside each line segment in the tree is the probability that the
right-hand event on the line segment occurs given the occurrence of all events that
have appeared along the path so far (reading from left to right). Each time a branching
occurs in the tree, we want to cover all eventualities, so the probabilities besides any
“fan” of line segments should add to unity.
Because the probability information on a line segment is conditional on what has
gone before, the order in which the tree branches should reflect the type of informa-
tion that is available. Unconditional probability information should go in the first set of
branches. The readily available probability information in the second branch depends
(i.e., is conditional on) what happened in the first branch and so on….
Rules for use are: (a) Multiply along a path to obtain the joint probability that all the
events in that path occur (using the multiplication rule for conditional probabilities).
Add the probabilities of all whole paths in which an event occurs to obtain the prob-
ability of that event occurring (using the addition rule for mutually exclusive events).
This gives the total probability of the event of interest.
Wild and Seber (2000) use the following example to illustrate the use of trees.
Example 3.4.4
In 1992, 14% of the population of Israel was Arabic, and of those, 52% were described as living
below the poverty line. On the other hand, 86% of the population of Israel was Jewish that
year, and of those 11% were described as living below the poverty level. Let B represent the
Arabic (A)
P(A) = 0.14
P(not poor and Arabic) =
P(not poor | A)=1–0.52 Not poor
(0.48)(0.14)
event “poor”, i.e., living below the poverty line, A denote the event “Arabic” and J denote the
event “Jewish.” A tree representation first splits the population of Israel first on ethnicity,
because we have unconditional (or total) probabilities for this, and then on poverty, because
the information on this event is conditional on ethnic group.
In the tree, the total probability of poor is obtained by adding the joint probabilities of all
the branches in which the event “poor” appears. For example,
P ( poor ) = P ( poor and arabic ) + P ( poor and Jewish ) = (0.14)(0.52) + (0.86)(0.11) = 0.1674 .
Table 3.3
Arabic (A) Jewish (J) Total Probability
Poor (B) P(A Ç B) = (0.14) (0.52) P(J Ç B) = (0.86) (0.11) P(B) = P(A Ç B) + P(J Ç B)
Not poor B C P(A Ç B ) = (0.14) (0.48)
C
P(J Ç B ) = (0.86) (0.89)
C
P(BC) = P(A Ç BC) + P(J Ç BC)
Total probability P(A) = P(A Ç B) + P(A Ç BC) P(J) = P(J Ç B) + P(J Ç BC) 1
We can see that we can calculate the same probabilities as with the tree. One thing should
be clear. In a two-by-two table, the conditional probabilities do not appear directly.
But
P(poor) P(Arabic) = (0.14 * 0.52 + 0.86 * 0.11)(0.14 * 0.52 + 0.14 * 0.48) = 0.023436.
Thus, because
P( A ∩ B)
0 ≤ P ( A|B ) = ≤1
P (B )
Second,
P ( S ∩ B ) P (B )
P ( S |B ) = = =1
P (B ) P (B )
Third, if A1, A2, .... ., are mutually exclusive events, then so are A1|B, A2|B, ... . ., and
∞ ∞
P ((∪∞ A ) ∩ B) P ( Ai ∩ B )
P (∪∞ A |B ) =
i =1 i
i =1 i
P (B )
= ∑
i =1
P (B )
= ∑P( A | B)
i =1
i
The above axioms imply that all the properties of probability apply to conditional
probabilities.
P (C ∩ A ∩ B ) P ( A ∩ B|C )P (C )
P (C | A ∩ B ) = =
P( A ∩ B) P ( A ∩ B|C )P (C ) + P ( A ∩ B|C c )P (C c )
if P(A Ç B) > 0.
Example 3.4.2
Cells carry the 23 pairs of chromosomes present in the human body. Each chromosome has
a simple DNA molecule in it. Each DNA molecule carries several hundreds, or thousands, of
genes. Chromosomes go in pairs in human beings. A pair of genes (one in each chromosome)
is necessary for a trait. This makes humans “diploid.” For example, the trait for hair color or
breast cancer status is determined by two genes, one in each pair of chromosomes.
There are many types of the same gene. Each type is called “allele.” Two alleles are needed
for the trait. All the possible pairs of alleles for a trait constitute the set of genotypes for
the trait. The observable characteristics of the trait that result from the genotype is the set
of phenotypes.
In the 90s, when the study of genetic counseling for breast and ovarian cancer acquired
importance, it was believed that the trait “breast-ovarian cancer susceptibility” is determined
by a gene with two alleles in chromosome 13 or a gene with two alleles in chromosome 17.
At each of the loci where these genes reside we can assume that the Mendelian model of
segregation operates. And we can assume that the two genes are independent. The Mendelian
model that applies at each locus is illustrated in Table 3.4.
Table 3.4
Genotype i (a1a1) a1a2 a2a2
Population probability p 2
2p(1 - p) (1 - p)2
Penetrance:P(affected|i) 1 1 0
P(normal|i) 0 0 1
P(offspring=1|parents=
(a1a1, a1a1) 1 0 0
(a1a1, a1a2) 1/2 1/2 0
(a1a2, a1a2) 1/4 1/2 1/4
(a1a1, a2a2) 0 1 0
(a1a2, a2a2) 0 1/2 1/2
(a2a2,a2a2) 0 0 1
3.4.6 Exercises
Exercise 1. At the University of California Los Angeles in 2017 there were 102,242 applicants
to the freshman class. Of these, 16,456 were admitted. Of these, 6,037 enrolled. http://www.
admission.ucla.edu/Prospect/Adm_fr/Frosh_Prof17.htm
If the pattern repeats for 2020, what is the probability that in 2020 an admitted
student enrolls?
Exercise 2. Of the graduate students at a university, 70% are engineering students and 30%
are students of other sciences. Suppose that 20% and 25% of the engineering and the “other”
population, respectively, smoke cigarettes. What is the probability that a randomly selected
graduate student is
Exercise 3. (This problem is from the Society of Actuaries (2007).) A public health researcher
examines the medical records of a group of 937 men who died in 1999 and discovers that
210 of the men died from causes related to heart disease. Moreover, 312 of the 937 men had
at least one parent who suffered from heart disease, and, of these 312 men, 102 died from
causes related to heart disease. Determine the probability that a man randomly selected from
this group died of causes related to heart disease, given that neither of his parents suffered
from heart disease.
Exercise 4. (This problem is inspired by Redelmeier and Yarnell (2013).) Do car crashes
increase in days close to Tax Day, which is April 15th in the United States? Consider 30 mil-
lion and 300 days in which the time of the year and the number of crashes were observed.
In 200 of those days, there were crashes and it was close to tax day; in 10 million of those
days there were no crashes and it was around tax day. In 100 of those days there were
crashes and the days were not around tax day, but were days of similar characteristics to
tax day otherwise. And finally, in 20 million of those days there were no crashes and no
tax days. What would be the estimate of the condition probability of a crash given it is a
day close to tax day?
Exercise 5. (From Khilyuk, Chilingar, and Rieke (2005, 66).) Oysters are grown on three marine
farms for the purpose of pearl production. The first farm yields 20% of total production of
oysters, the second yields 30%, and the third yields 50%. The share of the oyster shells
containing pearls in the first farm is 5%, the second farm is 2%, and the third farm is 1%.
(i) What is the probability of event A that a randomly chosen shell contains a pearl? (ii) Under
Exercise 6. The approximately 100 million adult Americans (age 25 and over in 1985) were
roughly classified by education and age as follows (Statistical Abstract of the United States
1987, 122) (The numbers in the middle are proportions of adult Americans).
Age
25–35 years old 35–55 years old 55–100 years old
None 0.01 0.02 0.05
EDUCATION Primary 0.03 0.06 0.1
Secondary 0.18 0.21 0.15
College 0.07 0.08 0.04
(i) If an adult American is chosen at random, what is the probability of getting a 25–35-year-
old college graduate? (ii) What is the probability that a 35–55-year-old has completed
secondary education?
Exercise 8. About 52% of the population of China lived in urban areas in 2012. In 2012, the
upper-middle class accounted for just 14% of urban households, while the middle-middle class
accounted for almost 50%. About 56% of the urban upper-middle class bought electronics
and household appliances, as compared to 36% of the middle-middle class. If this continued
like this in the near future, what would be the probability that a randomly chosen household
in China is an upper-middle class urban person that purchases appliances and electronics?
This information was obtained from https://www.mckinsey.com/industries/retail/our-insights/
mapping-chinas-middle-class.
There are many circumstances in which you would like to know the probability of an event,
but you cannot calculate it directly. You may be able to find it if you know its probability
under some conditions. The desired probability is a weighted average of the various condi-
tional probabilities. To see how we can achieve this, consider two events B and G defined in
the sample space.
B = (B Ç G) È (B Ç GC).
By results in section 3.4, we can express the joint probabilities P(B Ç G), P(B Ç GC) in terms
of the conditional probabilities:
More generally, if we have a partition of the sample space into n events Gi, i = 1, 2, ..., n then
n
P (B ) = ∑P(B | G )P(G ) .
i =1
i i
Example 3.5.1
(Horgan (2009, Example 6.5).) Enquiries to an online computer system arrive on five commu-
nication lines. The percentage of messages received through each line are:
Line 1 2 3 4 5
% received 20 30 10 15 25
From past experience, it is known that the percentage of messages exceeding 100 char-
acters on the different lines are:
Line 1 2 3 4 5
% exceeding 40 60 20 80 90
100 characters
P ( A) = P ( A | L1)P (L1) + P ( A | L2)P (L2) +P ( A | L3)P (L3) + P ( A | L4)P (L4) + P ( A | L5)P (L5) = 0.625.
3.5.1 Exercises
Exercise 1. Automobile recalls by car manufacturers are associated mostly with three defects:
engine (E), brakes (B) and seats (Seats). A database of all recalls indicates that the probability
of each of these defects is:
Let R be the event that a car is recalled. It is known that P(R|E) = 0.9, P(R|B) = 0.8, P(R | Seats)
= 0.4 and P(R|other defect) = 0.4. What is the probability that a randomly chosen automobile
is recalled?
Exercise 3. (Chowdhury, Flentje, and Bhattacharya 2010, page 132). An earth dam may fail
due to one of three causes, namely: (a) overtopping; (b) slope failure; (c) piping and subsurface
erosion. The probabilities of failure due to these causes are respectively 0.7, 0.1, and 0.2. The
probability that overtopping will occur within the life of the dam is 10 −5. The probability that
slope failure will take place is 10 −4 and the probability that piping and subsurface erosion
will occur is 10 −3. What is the probability of failure of the dam, assuming that there are no
other phenomena which can cause failure?
XXYou are asked to identify the source of a defective part. You know that the part came
from one of three factories. Factory A produces 60% of the parts, factory B 30% and
factory C 10% of the parts. It is known that 10% of the parts produced by factory A
are defective, 30% of the parts produced by factory B are defective, and 40% of the
parts produced by machine C are defective. Where did the defective part come from?
What do you think? Write your conclusion down on a piece of paper.
To evaluate your conclusion, do the following exercise: imagine that there are 100 parts.
Consider Table 3.5.
To guide your work completing the table, answer the following questions: (i) Of every 100
parts produced, how many were made by Factory A, B and C? Fill these in as the row totals
of the table. (ii) Of those parts produced by factory A, how many would you expect to be
defective? Repeat for Factories B and C, recording your results in the “Defective” column.
(iii) How many of the total of 100 parts in your table are defective? Enter the result as the
where P(G|B) is called the posterior probability of G, P(G) is the prior probability of G when
written in a Bayes rule formula, P(B|G) is the known probability of B after seeing G. P(B) is
the total probability of B, calculated as indicated in Section 3.5.
It also follows that
P (G |B )P (B )
P (B|G ) = ,
P (G )
where now P(B|G) is the posterior probability of B given G, and P(B) is the prior
probability.
These last two results are called Bayes Theorem and reflect the relation between con-
ditional probabilities. If we know P(B|G) we can obtain the probability of P(G|B) with Bayes
theorem. and vice versa. Bayes rule indicates how probabilities change in light of evidence.
The author of Bayes Theorem was Thomas Bayes (1702–1761), a mathematician and minister
who published little in mathematics, but what he wrote has been very significant in numer-
ous decision making problems. One area where conditional probability and Bayes Theorem
play a very important role is criminology. Bayesian filtering, or recalculating the probability
of something given new information, plays a very important role in DNA processing and
solving crimes.
The conditional probability applet found at http://www.randomservices.org/random/
apps/ConditionalProbabilityExperiment.html illustrates Bayes theorem using Venn diagrams.
The purple area divided by the area of the event marked in color grey is the conditional
probability.
Example 3.6.1
As an example of application of Bayes theorem and conditional probability in Criminology,
the reader is encouraged to complete the activity at
https://education.ti.com/en/activity/detail?id=510617F0CEE24132860AC2E01779C503
Example 3.6.2
Bayes theorem is the basis for many filtering programs, most notably those that filter spam
from our e-mail inboxes. Spam is unsolicited commercial e-mail. Significance, a statistical mag-
azine published at the time by the Royal Statistical Society of the United Kingdom, reported
in an article written by Joshua Goodman and David Heckerman (Goodman and Heckerman,
2004) that about 50% of the mail on the internet at the time was spam. Given that it costs
very little to send spam, even a tiny response rate makes spam economically viable. It pays
to familiarize yourself with the problems and solutions associated with spam mail, because
you may end up paying a high cost. One of the most popular techniques for stopping spam
is Bayesian spam filtering. Many mail clients incorporate a Bayesian spam filter today. The
mentioned article explains how it works, at a basic level.
We present in this example a simplified version of a Bayesian spam filter.
A department in a major company keeps all emails received by employees. The first month
that the company did this, there were 10000 emails. The IT person concluded that during
that month:
• 90% of the emails that are spam contained the word sex in the subject field
• 7% of the emails that were not spam contained the word sex in the subject field
• 20% of the emails received were spam
This information is called in machine learning a “training set.” It can be used to create a
filter that would automatically decide which emails should not be allowed to enter the server
the next month. The filter would operate according to the following rule: if the probability
that an email that contains the word sex in the subject field is spam is larger than the prob-
ability that this email is not spam, reject the email. That seems like a good rule. However, we
do not know those probabilities. They must be calculated somehow. Bayesian spam filtering
offers a methodology for that.
P (word sex |spam message )P ( spam)
P ( spam|word sex ) = = 0.7627,
P (word sex )
P (word sex ) = P (word sex |spam)P ( spam) + P (word sex |no spam)P (no spam).
Most messages contain many words. Actual spam filters, compute many conditional
probabilities:
P ( sex ∩ click ∩…∩ other words|spam message )
= P (word sex |spam)P (word click |spam)…..P (other word | spam) .
Example 3.6.3
Pittman (1993) starts discussion of conditional probability with the following example:
If you bet that 2 or more heads will appear in 3 tosses of a fair coin, you are more likely to
win the bet given the first toss lands heads than given the first toss lands tails. Why?
If there are at least two heads, then the following event A happens:
P(A|W) = 3/4.
because W = {HHH, HHT, HTH, HTT} can occur in 4 way, and just 3 of these outcomes make
A occur. These three outcomes define the event {HHH, HHT, HTH} which is the intersection
of A and W, denoted A and W, A Ç W or simply AW Similarly, if the event W C = “First toss lands
tails” occurs, event A happens only if the next two tosses land heads, with probability 1/4. So
P(A|W C) = 1/4.
Conditional probabilities can be defined as follows in any setting with equally likely outcomes:
Example 3.6.4
Bayes theorem is widely used by statisticians in forensic science to identify drug traces,
fiber matching from clothes and guilt of suspects in light of evidence. In statistics, the given
By Bayes Theorem,
P (C |I )P (I ) (0.1)(0.5)
P (I|C ) = = = 0.4 .
P (C ) 0.125
The probability that a convicted suspect was wearing Nikeia shoes is 0.4.
Example 3.6.5
(This example is from Berry 1996, 150).) Legal cases of disputed paternity in many countries
are resolved using blood tests. Laboratories make genetic determinations concerning the
mother, father, and alleged child. Most labs apply Bayes’ rule in communicating the testing
results. They calculate the probability that the alleged father is in fact the child’s father
given the genetic evidence.
Suppose you are on a jury considering a paternity suit brought by Suzy Smith’s mother against
Al Edged. The following is part of the background information. Suzy’s mother has blood type O
and Al Edged is type AB. All probability calculations are done conditional on this information.
You have other information as well. You hear testimony concerning whether Al Edged and
Suzy’s mother had sexual intercourse during the time that conception could have occurred,
about the timing and frequency of such intercourse, about Al Edged’s fertility, about the
possibility that someone else is the father, and so on. You put all this information together
in assessing the probability that Al is Suzy’s father.
The evidence of interest is Suzy’s blood type. If it is O, then Al Edged is excluded from
paternity-he is not the father, unless there has been a gene mutation or a laboratory error.
Suzy’s blood type turns out to be B; call this event B. According to Bayes’ rule, if F is the event
that Al Edged is the father,
P(B|F)P(F)
P(F|B) =
P(B|F)P(F) + P(B|Fc )(1 − P(F))
The P(F) comes from all the other, nonblood, evidence. Here are the possible values of the
posterior probability P(F|B) under different assumptions for P(F).
The reason such a large increase is possible is that Suzy’s paternal gene B is
relatively rare.
Blood banks and other laboratories that analyze genetic factors in paternity cases have a
name for the Bayes factor in favor of F: Paternity Index.
Where P(X) is the probability that the Earth presents a viable collision cross-section to a
NEO, and P(I) the probability that the orbits of the Earth and a given NEO intersect. Since
I and X are independent, we multiply their probabilities to get P(H). X and I are conditions
needed to have hazard.
3.6.2 Exercises
Exercise 1. (This exercise is from Bennet (1998, 3).) If a test to detect a disease whose prev-
alence is one in a thousand has a false positive rate of 5 percent, what is the chance that
a person found to have a positive test result actually has the disease, assuming you know
nothing about the person’s symptoms or signs? (Bennet 1998)
Exercise 2. In Example 3.5.1, compute the probability that a message that has more than
100 characters came from line 2.
Exercise 3. A bin contains 25 light bulbs. Let G be the set of light bulbs that are in good
condition and will function for at least 30 days. Let T be the set of light bulbs that are totally
defective and will not light up. And let D be the set of light bulbs that are partially defective
and will fail in the second day of use. If a randomly chosen bulb initially lights, what is the
probability that it will be working still after a week?
Exercise 4. Two methods, A and B, are available for teaching a certain industrial skill. The
failure rate is 30% for method A, and 10% for method B. Method B is more expensive, how-
ever, and hence is used only 20% of the time. Method A is used the other 80% of the time. A
worker is taught the skill by one of the two methods, but he fails to learn it correctly. What
is the probability that he was taught by using method A?
Exercise 5. In a US campus, 20% of data analysts use R, 30% use SPSS and 50% use SAS. 20%
of the R programs run successfully as soon as they are typed, 70% of the SPSS programs run
successfully as soon as they are typed and 80% of the SAS programs run successfully as soon
Question 1. We are given that P (A)= 0.3, P (B)= 0.7 and P(AÇB) = 0.1. Thus
Question 2. Which of the following sequences is most likely to result from flipping a fair
coin five times?
a. HHHTT
b. THHTH
c. THTTT
d. HTHTH
e. All four sequences are equally likely
Question 3. Which of the following sequences is least likely to result from flipping a fair coin
five times?
a. HHHTT
b. THHTH
c. THTTT
d. HTHTH
e. All four sequences are equally likely
Question 4. A high end neighborhood has two types of residents. It is known that 30% of
residents are only architects, 50% are only runners, and 10% are both architects and runners.
Let A denote the event that a resident, randomly chosen, is an architect, and let R denote
the event that the resident is a runner. The P(AC Ç BC) is
a. 0.6
b. 0.5
c. 0.2
d. 0.1
a. 0.8
b. 0.3
c. 0.6
d. 0.5
e. 0.4
Question 6. The price of the stock of a very large company on each day goes up with prob-
ability p or down with probability (1 - p). The changes on different days are assumed to be
independent. Consider an experiment where we observe the price of the stock for three
days. And consider event A that the stock price goes up the first day. What is the probability
of event A?
a. p3
b. 3(1 - p)2 + (1 - p)3
c. p3 + 2p(1 - p)2 + (1 - p)p2
d. p3 + 2p2(1 - p)2 + p(1 - p)2
Question 7. In the Campos de Dalias in Almeria, Spain, 59.7% of the hectares in the agricultural
region is covered by invernaderos (greenhouses under which most of agricultural production
takes place under a very controlled and highly technological method) and the rest is based
on traditional agricultural methods. 30% of the hectares in the invernadero area are dedi-
cated to producing tomatoes. What is the probability that a randomly chosen hectare in the
agriculture region of Campos de Dalias is of the invernadero type and produces tomatoes?
(Junta de Andalucia 2016)
a. 0.1791
b. 0.08
c. 0.2
d. 0.12
a. 0.5
b. 0.1011
c. 0.6316
d. 0.3684
a. 0.62
b. 0.38
c. 0.84211
d. 0.1118
a. 0.03
b. 0.023
c. 0.3
d. 0.091
3.8 R code
Some theoretical probabilities are hard to get analytically. But doing a simulation may pro-
vide some insight and very accurate answers. In this chapter’s R session you will be doing a
simulation to calculate probabilities of matching. In this section, we will do the probability
version of the R activity in Chapter 2, but with a different application.
• Probability model: insert four numbers 1 to 4 in a box. The numbers will be drawn
without replacement.
• Trial: Select four numbers without replacement.
• What to record: whether all the numbers match the students’ computers (1 = yes,
0 = no).
• What to compute: number of yes/total number of trials.
trials=matrix(rep(0,400000),ncol=4,nrow=100000)
matching=c(rep(0,100000))
for(i in 1:100000){
trials[i,]=sample(1:4,4, replace=F)
if(sum( trials[i,]==sort(trials[i,]) ) ==4 ) { matching[i] =1}
else {matching[i]=0}
}
table(matching) # see how many full matches
head(trials) # double check your first numbers
head(matching) # double check your first numbers
1/24 # theoretical solution
sum(matching)/100000 # simulation answer
3.8.2 Exercises
Exercise 1. Using code used in Lesson 2, find the probability that student 1 gets the right
computer. Compare that with the theoretical probability.
Exercise 2. Modify the code given in section 3.8.1 to find the empirical probability that
there are no matches. What probability does your simulation produce? Compare it with the
theoretical probability.
Exercise 3. Modify the code given in section 3.8.1 to find the empirical probability that student
1 or 3 get their computer. Compare with the theoretical probability.
Exercise 1. Prove the following theorem: If E and F are independent, then so are the following
pairs of events: (a) E and Fc ; (b) Ec and F; (c) Ec and Fc.
Exercise 4. Application to 2008 Elections. The following are results from Election Day.
[Note: We are assuming that the groups are mutually exclusive here and in what follows.]
From national results, we know that 52% of the votes went to Obama, 46% to McCain, and 2%
Exercise 5. (This problem is based on Society of Actuaries (2007, Problem 2).) The probability
that a visit to the dentist ends up in neither X-rays nor tooth pulling out is 35%. Typically, 30%
of visits with this prognosis result in the tooth being pulled out and 40% in just X-rays. Deter-
mine the probability that a visit to a dentist clinic results in both a tooth pulled out and X rays.
Exercise 6. (Society of Actuaries (2007).) You are given P (A È B) = 0.7 and P (A È BC) = 0.9
Determine P(A).
Exercise 8. (This problem is based on Winter and Carlton (2000, page 60).) A lie detector test
is accurate 70% of the time, meaning that for 30% of the times a suspect is telling the truth,
the test will conclude that the suspect is lying, and for 30% of the times a suspect is lying,
the test will conclude that he or she is telling the truth. A police detective arrested a suspect
who is one of only four possible perpetrators of a jewel theft. If the test result is positive,
what is the probability that the arrested suspect is actually guilty? Show work.
Exercise 10. An actuary studying the insurance preferences of automobile owners makes
the following conclusions: (i) an automobile owner is twice as likely to purchase collision
coverage as disability coverage, (ii) the event that an automobile owner purchases collision
coverage is independent of the event that he or she purchases disability coverage, and (iii) the
probability that an automobile owner purchases both collision and disability coverage is 0.15.
So the actuary asks: What is the probability that an automobile owner purchases neither
collision nor disability coverage?
Exercise 11. A collection of 100 computer programs was examined for various types of errors
(bugs). It was found that 20 of them had syntax errors, 10 had input/output (I/O) errors that
were not syntactical, five had other types of errors, six programs had both syntax errors
and I/O errors, three had both syntax errors and other errors, two had both I/O and other
errors, while one had all three types of errors. A program is selected at random from the
collection—that is, selected in such a way that each program was equally likely to be chosen.
Let Y be the event that the selected program has errors in syntax, I be the event that it has
I/O errors, and O the event that it has other errors. What is the probability that the randomly
chosen program has some type of error?
Yes No
Center of the city 0.150 0.250
Suburbs 0.250 0.150
Rural areas 0.050 0.150
The table reflects the opinion of adults eligible to vote and is saying, for example, that 15%
of the town adults eligible to vote live in the center of the city and are in favor of the car
pool lane.
With this information, answer the following questions: (i) What is the probability that a
randomly chosen eligible voter disapproves of the car pool lane? (ii) What is the probability
that a randomly chosen eligible voter does not live in the center of the city and disapproves
of the car pool lane? (iii) What is the probability that a voter from the suburbs disapproves
of the car pool lane?
Exercise 14. (This problem is from Rossman and Short (1995).) Consider the case of Joseph
Jamieson, who was tried in a 1987 criminal trial in Pittsburgh’s Common Pleas Court on charges
of raping seven women in the Shadyside district of the city over a period from April 18, 1985,
to January 30, 1986. Fienberg (1990) reports that by analyzing body secretion evidence taken
from the scenes of the crimes, a forensic expert concluded that the assailant had the blood
characteristics and genetic markers of type B, secretor, PGM 2 + 1-. She further testified that
only .32% of the male population of Allegheny County had these blood characteristics and
that Jamieson himself was a type B, secretor, PGM 2 + 1-. The natural question to ask is how
a juror should update the probability of Jamieson’s guilt in light of this quantitative forensic
evidence (event E). Note that, in this case, Pr(E|G) = 1 and Pr(E|not G) = .0032, since if Jamieson
did not commit the crimes, then some other male in Allegheny County presumably did. Plug-
ging these into Bayes’ Theorem as presented above and simplifying leads to the expression
P (E | G )P (G )
P (G |E ) =
P (E|G )P (G ) + P (E|G c )P (G c )
where Pr(G) represents the juror’s subjective assessment of Jamieson’s guilt prior to hearing
the forensic evidence. Calculate the posterior or updated probability of guilt for the following
values of the prior probability. The first value is given.
Note: an employed and unemployed person are by definition in the labor force. See the US
Bureau of Labor Statistics glossary. It is important to know the glossary to answer this problem.
Suppose that an arbitrarily selected person from the civilian noninstitutionalized popu-
lation was asked, in 1989, to fill out a questionnaire on employment. Find the probability of
the following events. (i) The person was in the labor force. (ii) The person was employed.
(iii) The person was employed and in the labor force. (iv) The person was not in the labor
for or unemployed.
Separately, answer the following question: what is the probability that a person in the
labor force was employed?
Exercise 16. Here is some information about the first-year class at University of Washington
in a given year of the past. (a) 50% are mostly masculine; (b) 25% have a car; (c) 60 of those
with a car drive to school; (d) 40% are blonde; (e) 80% are from the state of Washington; (f)
10% are from Oregon; and (g) 5% are from California. If I pick a student at random, what is
the chance that this student is from outside the state of Washington?
Exercise 17. (This problem is based on Berry and Chastain (2004).) Testosterone (T) is the
naturally occurring male hormone produced primarily in the testes. Epitestosterone (E) is
an inactive form of testosterone that may serve as a storage substance or precursor that gets
converted to active T. The normal urinary range for the T/E ratio for any person has been set
by scientists to be 6:1 (meaning that 99% of normal men will have that or lower). (i) If 1% of
nonusers of testosterone as a doping agent have a urinary T/E ratio above the established
normal range, what would be the probability that the test for testosterone doping is a false
positive? How many of the 90,000 athletes tested annually would be accused of testosterone
doping even though they did not dope?
Anti-doping screening is done to detect whether an athlete has used testosterone as a
doping agent. In the context of a disease like, for example, AIDS, and a test to screen for
AIDS, we define sensitivity of a test as the value of the following probability: P(+ | D), i.e.,
the probability of a true positive result in the test for someone that has AIDS. We define
the specificity of the test as the P (- |no D), i.e., the probability that the person without the
disease gets a negative test result. (ii) Define sensitivity and specificity in the context of the
anti-doping screening.
Exercise 18. In 2002, a group of medical researchers reported that on average, 30 out of
every 10,000 people have colorectal cancer. Of these 30 people with colorectal cancer,
18 will have a positive hemoccult test. Of the remaining 9,970 people without colorectal
cancer, 400 will have a positive test. (i) If a randomly chosen person has a negative test
result, what is the probability that the person is free of colorectal cancer? (ii) If it is learned
that there were 2,400 patients tested, about how many should we expect to be free of
colorectal cancer?
Exercise 19. Of the new homes on the market in a neighborhood of California, 21% have
pools, 64% have garages, and 17% have both. (a) If you pull up a house with a garage, what
is the probability that it has a pool? (ii) Are having a garage and a pool disjoint events?
(iii) Are having a garage and a pool independent events?
Exercise 20. A box contains three black tickets numbered 1, 2, 3, and three white tickets
numbered 1,2,3. One ticket will be drawn at random. You have to guess the number on the
ticket. You catch a glimpse of the ticket as tit is drawn out of the box. You cannot make out
the number but see that the ticket is black. (i) What is the chance that the number on it will
be 2? (ii) The same but the ticket is white. (iii) Are color and number independent?
Exercise 21. Someone is going to toss a coin twice. If the coin lands heads on the second
toss, you win a dollar. (i) If the first toss is heads, what is your chance of winning the
dollar? (ii) If the first toss is tails, what is your chance of winning the dollar? (iii) Are the
tosses independent?
Exercise 22. (This exercise is from Ross (2010).) A certain organism possesses a pair of each
of 5 different genes (which we will denote by the first five letters of the English alphabet).
Each gene appears in 2 forms (which we designate by lowercase and capital letters). The
capital letter will be assumed to be the dominant gene in the sense that if an organism
possesses the gene pair xX, then it will outwardly have the appearance of the X gene. For
instance, if X stands for brown eyes and x for blue eyes, then an individual having either gene
pair XX or Xx will have brown eyes, whereas one having gene pair xx will have blue eyes.
The characteristic appearance of an organism is called its phenotype, whereas its genetic
constitution is called its genotype. (Thus 2 organisms with respective genotypes aA, bB, cc,
dD, ee and AA, BB, cc, DD, ee would have different genotypes but the same phenotype.) In
a mating between two organisms, each one contributes, at random, one of its gene pairs of
each type. The 5 contributions of an organism (one of each of the 5 types) are assumed to be
independent and are also independent of the contributions of its mate. In a mating between
Exercise 23.
A study of the relationship between smoking and lung cancer found that 238 individuals
smoked and had lung cancer, 247 individuals smoked and had no lung cancer, 374 individuals
did not smoke and had lung cancer, and 810 individuals did not smoke and did not have lung
cancer. There were a total of 1,669 people randomly chosen to participate in the study. Since
smoking is a risk factor for lung cancer, the epidemiology literature refers to probability as
risk when associated to a risk factor. For example, the risk of lung cancer among smokers is
the probability that a smoker has lung cancer. The risk of lung cancer among nonsmokers is
the probability that a nonsmoker has lung cancer. The relative risk is the ratio of those prob-
abilities. Are those (i) conditional, (ii) total, (iii) joint probabilities? Select one and calculate
the risks mentioned using the information given.
Exercise 24.
A blood test for hepatitis is 90% accurate. If a patient has hepatitis, the probability that the
test will be positive is 0.9 and if the patient does not have hepatitis the probability that the
test is negative is 90%. The rate of hepatitis in the general population is 1 in 10,000. Jaundice
is a a medical condition with yellowing of the skin or whites of the eyes, arising from excess
of the pigment bilirubin and typically caused by obstruction of the bile duct, by liver disease,
or by excessive breakdown of red blood cells. The physician knows that this type of patient
has a probability of ½ of having hepatitis.
(i) What is the probability that a person who receives a positive blood test result actually
has hepatitis? (ii) A patient is sent for a blood test because he has lost his appetite and has
jaundice. If this person receives a positive test result, what is the probability that the patient
has hepatitis?
Exercise 25. A concert in Vienna is paid for by Visa (25% of customers), Mastercard (10% of
customers), American Express (15% of customers), Apple Pay (35%) and PayPal (15%). If we
choose two persons that will attend the concert and already bought tickets, what is the
probability that the two persons will have paid by PayPal?
Exercise 26. At the time of writing this book, the Brexit deal in the United Kingdom was
being debated. It turns out that the United Kingdom has tried before to leave the European
Union. The British National Referendum of 1975 asked whether the United Kingdom should
Exercise 27. The prosecutor’s fallacy is P(A|B) = P(B|A). Under what conditions would that
equality be true?
Exercise 28. (This exercise is based on Skorupski and Wainer (2015).) In 1992, the population
of women in the United States was approximately 125 million. That year, 4,936 women were
murdered. Approximately 3.5 million women are battered every year. In 1992, 1,432 women
were murdered by their previous batterers. Let B be the event “woman battered by her
husband, boyfriend or lover,” M the event “woman murdered.” What is the probability that a
murdered woman was murdered by her batterer?
Exercise 29. (This problem is based on Schneiter (2012).) The following website contains
an activity for K–12 students to illustrate Buffon’s needle problem, as an example of
geometric probabilities.
Geometric probabilities is a field in which probabilities are concerned with proportions of
areas (lengths or volumes) of geometric objects under specified conditions.
Go to http://www.amstat.org/education/stew and find the activity “Exploring Geometric
Probabilities with Buffon’s Coin Problem,” by Schneiter. Complete the student’s activity pages.
In addition to that, discuss the definition the author gives of “theoretical probability.” Is that
the only definition of probability?
Exercise 30. To do blind grading, a professor asks students to write a code on the front page
of their exam and the second page. The first page will be torn off. The code must have five
characters, each being one of the 26 letters of the alphabet (a–z) or any of the ten integers
(0–9). The code must start with a letter. If we select a student’s code at random, what is the
probability that the code starts with a vowel or ends with an odd number?
4.1 Sampling
101
via official statistics produced by governments or polls and surveys produced by private
organizations are based on probabilistic sampling. It is not coincidentally then that a basic
random phenomenon with whose analysis we are concerned in probability theory is that of
finite sampling.
In the physical and life sciences and engineering it is repeated experimentation what
helps trust conclusions from experiments. It is not coincidentally either that another
basic phenomenon with whose analysis we are concerned in probability theory is that of
repeated experimentation.
Chapter 4 is intended to give you a methodology to approach applied probability prob-
lems concerned with random samples and with repeated experimentation in a formal
way. This chapter presents a few models that are often used to solve a very wide array of
applied probability problems in the context of sampling from populations and repeated
sampling. The methods learned in this chapter may be applied to other probability problems
as well.
The chapter uses some notations from combinatorial analysis as the mathematical coun-
terpart of counting samples and repeated experimentation.
4.1.1 n-tuples
A basic tool for the construction of sample spaces in the context of sampling is the notion
of an n-tuple.
Population
Definition 4.1.1
An n-tuple (o1, o2, …. , on) is an array of n
symbols, o1, o2, …. , on which are called, re-
spectively, the first component, the second
component, and so on, up to the nth com-
ponent, of the n-tuple. The order in which
the components of an n-tuple are written
is of importance (and consequently one
sometimes speaks of ordered n-tuples).
Two n-tuples are identical if and only
if they consist of the same components
written in the same order. The usefulness
of n-tuples derives from the fact that Sample
they are convenient devices for report-
ing the results of drawing of a sample of
size n. Figure 4.2 Sampling from populations
results in an n-tuple. Depending on the
sample drawn, the n-tuple will be different.
Statisticians do surveys and they sample from large populations in order to learn about the
population from the sample. Surveys use sampling without replacement. A simple random sample
is a sample drawn without replacement. However, if we make the assumption that the population
is infinite, then this process is equivalent to drawing with replacement. Statisticians have come up
with “finite population corrections” when this assumption is not valid, however. There is a whole
area called Sampling Theory that deals with these issues.
Box 4.2
Math tidbit
If M = 7 and n = 3,
M! = 7 ×6 ×5× 4 ×3×2×1
Think of this as the number of ways in which a race with seven horses could end.
M! 7 × 6 ×5 × 4 ×3 × 2 × 1
= = M (M − 1)….(M − n + 1) .
(M − n )! 4 ×3× 2× 1
Think of this as the number of ways in which we could select a first-prize winner
($100,000), second-prize winner ($50,000), and third-prize winner ($25,000) from seven
contestants. A convenient notation is (Mn) but there could be other notations for the same
thing.
Finally,
M! 7 × 6 ×5 × 4 ×3 × 2 × 1 M (M − 1)….(M − n + 1)
= =
(M − n )! n ! (4 ×3×2×1) ×(3×2×1) n!
M
A convenient notation for that is
n .
is the number of ways in which we could select three people at random to get a free
movie pass to the latest summer blockbuster. The three are getting the same prize. (Albert,
Aakifah, Sidharta) is the same as (Aakifah, Albert, Sidharta) and (Sidharta, Albert, Aakifah) are
three of the 6 ways we can order these three names.
(3,4,1),(3,4,2),(3,1,2),(3,1,4),(3,2,1),(3,2,4),(4,1,2),(4,1,3),(4,2,3),(4,2,1),(4,3,1),(4,3,2)}
And we can see that the number of ordered 3-tuples (the number of random samples) in
this sample space can be calculated, using the various notations introduced earlier,
4!
(4)3 = 4 ×3×2 = = 24.
1!
If we assume that each of the balls is equally likely to be chosen, which would be a reason-
able assumption if the balls are well mixed and the drawing is done at random, the probability
of obtaining the first number is 1/4. This leaves three equally likely to be drawn balls, given
as probability for the second ball of 1/3, and by the same token, the third ball has probability
1/2. The probability of each of the 24 3-tuples is (1 / 4)(1 / 3)(1 / 2) = 1 / 24 = 0.04166667, and
we can see that each of the samples in the sample space would have the same probability.
When we multiply 24 by (1/24) we get 1, as we should. Recall that P(S) = 1, by axiom.
With this information, we can put the concepts learned so far at work to find probabilities
of events.
We said in Chapter 3, Section 3.2, that the probability of an event is the sum of the prob-
abilities of the outcomes in the event. Let A be the event that the numbers 1,2,3 are in the
sample. What is the probability of A? We observe that there are six samples with the number
(1,2,3), each sample with probability 1/24. So the probability is 6(1/24) = 1/4 = 0.25.
We also said in Chapter 1, Section 1.3, that when the outcomes are equally likely, we can
just calculate the probability alternatively by counting the number of favorable outcomes and
dividing by the total number of outcomes in the sample space. This result can also be seen
as the number of ways the numbers 1,2,3 can be ordered (3!) divided by the total number of
samples in the sample space, or
3!
P ( A) = = 6(1 / 24) = 1 / 4 = 0.25.
24
Example 4.1.4
If the sampling is done with replacement, then the 64 samples of size 3 from an urn containing
4 balls, numbered 1 to 4, can also be found.
(1,4,1), (4,1,1), (2,2,1) (1,2,2), (2,1,2), (2,2,3), (2,3,2),(3,2,2), (2,2,4), (2,4,2), (4,2,2),
(3,3,1), (3,1,3), (1,3,3), (3,3,2), (3,2,3),(2,3,3) ,(3,3,4), (3,4,3), (4,3,3), (4,4,1), (4,1,4), (1,4,4),
(4,4,2), (4,2,4),(2,4,4), (4,4,3), (4,3,4), (3,4,4),
(2,3,4),(2,4,3), (3,2,4),(3,4,2),(4,2,3),(4,3,2) }.
We can see that the number of ordered n-tuples in this sample space can be calculated,
using the notations seen earlier, as 4 × 4 × 4 = 43 = 64 n-tuples. If we assume that each of the
balls is equally likely to be chosen, which would be a reasonable assumption if the balls are
well mixed and the drawing is done at random, the probability of obtaining the first number
is 1/4. Because the ball is put back in the urn, the probability of the second number is 1/4,
and, by the same token, the number in the fourth ball has probability 1/4. The probability of
each of the 64 3-tuples is (1 / 4)(1 / 4)(1 / 4) = 1 / 64 = 0.015625, and we can see that each
of the samples in the sample space would have the same probability. Again, we check that
64 × 0.015625 = 1, because P(S) must be 1, by axiom.
With this information, we can put the concepts learned so far to work to find probabilities
of events, like we did in Example 4.1.3.
We said in Chapter 3 that the probability of an event is the sum of the probabilities of the
outcomes in the event.
Let A be the event that the numbers 1,2,3 are in the sample. What is the probability of A
now that we are sampling with replacement? We observe that there are six samples with the
numbers (1,2,3), each sample with probability 1/64. So the probability is 6(1/64) = 0.09375.
3!
P ( A) = = 6(1 / 64) = 0.09375 .
64
Table 4.1
Sets Samples
{(1,2,3)} (1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)
{(1.2.4)} (1,2,4),(1,4,2),(2,1,4),(2,4,1),(4,1,2),(4,2,1)
{(2,3,4)} (2,3,4),(2,4,3),(3,2,4),(3,4,2),(4,3,2),(4,2,3)
{(1,3,4)} (1,3,4),(1,4,3),(3,1,4),(3,4,1),(4,1,3),(4,3,1)
Which contains more information? The set notation or the listing of the corresponding
samples? To compute probabilities, the sample listing is the most informative. In practice,
without concern about the probability, it depends what the sampling is done for. If the numbers
in the balls have been assigned to individuals (for example, Jean Claude got number 1, Ching
Ti got number 2, Francisca got number 3 and Rakiyah got number 4), and the drawing is done
to decide who will be the first, second, and third to be called when needed for combat, then
the samples are each representing distinct things. (1,2,3) means Jean Claude will go first if
there is need for someone in combat. Next time there is need, Ching Ti will go, and so on.
On the other hand, if the sample is (3,2,1), things look different for Francisca now because
she will go first. In other words, the information in sample (1,2,3) is not the same as that in
(3,2,1). If we just used the set notation we would be losing a lot of information.
If the drawing is done to select three people for a committee representing the school,
with no particular title for any of the members in the sample, then they might as well be
represented by the set, without loss of information about the content.
Returning to the combat situation. When
sampling with replacement, it is possible that
the same person is called to combat repeatedly.
For example, (1,1,1) means that Jean Claude
Box 4.3
gets called to combat first, then the next time
Vietnam War draft
someone is needed he could be called again,
and the next time he would be the one going as The Vietnam War draft lottery https://www.usatoday.com/
vietnam-war/draft-picker) was sharply criticized by statis-
well. Sample (4,1,4) means that the first time it
ticians for not using a true probability sampling method.
is Rakiyah who goes to combat, the second time The birth month and day were placed in the bowl in such
is Jean Claude, and the third time is Rakiyah, a way that 18-year-old men born in the last months of
again. Thus, what model to use, with or without the year had lower draft numbers, and therefore a greater
replacement, depends on what the context is chance of being drafted than those born earlier in the
for your problem. year. Most of the drafted soldiers would end up fighting
in the jungles of Vietnam. Starr Norton (Norton (2017))
Regardless of the specification used, when
gives a survey of some statistical analyses done of the
computing probability, one must keep in mind resulting samples.
the larger, sample specifications to compute the https://www.tandfonline.com/doi/full/10.1080/1069
probabilities correctly. 1898.1997.11910534
The lottery has been mentioned in numerous text-
Example 4.1.5 books and articles on some aspect of probability. For ex-
ample, Wild and Seber (2000, 145).
Given the collection of 4 × 3 × 2 possible sam-
Knowing probability theory well helps understand
ples in Example 4.1.3, the number of sets can the shortcomings that may transpire in some surveys and
be found by dividing 4 × 3 × 2 by 3 × 2 × 1 or polls conducted nowadays.
3!, where 3! represents the number of ways in
which the numbers 1,2,3 can be ordered. The notation that is usually adopted to represent
this operation is the binomial coefficient
4 ×3×2 (4)3 4! 4
= = = = 4 .
3! 3! 1!3! 3
where 4 is read “4 choose 3.” The probability of the event A can be written in terms of
3
this notation. That is, the probability of obtaining the set {(1,2, 3)} can be calculated as
1 1
P ( A) = 6 = = 1 / 4.
24 4
3
The number of sets of S of size k, multiplied by the number of samples of size k that can be
drawn without replacement from a subset of size k, is equal to the number of samples of size
k that can be drawn without replacement from an urn containing balls numbered 1, 2, 3, 4.
There are 4 = 4 subsets of size 3 that can be formed, namely, {1,2,3},{1,2,4}, {1,3,4},{2,3,4}.
3
From each of these subsets one may draw, without replacement, 6 samples so that there
are twenty four possible samples of size 3 to be drawn without replacement from an urn
containing 4 balls.
(5)3 = 5 × 4 × 3 = 60 samples.
We may assume that each of these samples is equally likely to occur. On the other hand,
there are
5
= 10 sets of treatments .
3
Notice that 10 × 6 = 60.
(i) Let A be the event “treatment a” is chosen. Treatment a appears in 4 = 6 sets.
2
You may want to check that by either writing the whole collection of possible samples,
or by realizing that forcing treatment a to be in the sample, the other two treatments
can only be formed in 4 × 3 = 12 ways, resulting in 12/2 = 6 different sets. So the
probability of selecting treatment a is given by the number
4
2 6
P ( A) = = .
5 10
3
(ii) Let B be the event “treatments a and b are chosen.” The probability that treatments
a and b are chosen, using similar reasoning as in (i), is
3
1 3 .
P (B ) = =
5 10
3
(iii) Let C be the event “b is chosen.” The probability that at least one of a and b will
be chosen has as a complement that none of the two are chosen. If a and b are not
chosen, there is only one set with the other three. The probability is
3
3 1 9 .
P( A ∪ C ) = 1 − = 1− =
5 10 10
3
30!
(30)3 = 30 ×29×28 = = 24360 samples .
27!
There are 3 × 2 × 1 = 3! = 6 samples with the same numbers in them. Thus there are
30 30! 24360
3 = 27!3! = 3! = 4060 sets of 3 numbers.
Let A denote the event “individuals 1, 2, 3 are in the sample,” The probability of obtaining
individuals 1, 2, 3 is
6 1 1
P ( A) = = = .
24360 30 4060
3
Box 4.4
Sampling in statistics
Statisticians use sampling for different purposes. When sampling to conduct a survey and
gather data to learn about a population, the sampling is, in fact, done without replacement.
However, a common assumption is that the population is so large, that extracting a ran-
dom sample from a very large population is equivalent to drawing with replacement. The
methods of Mathematical Statistics define a simple random sample as one drawn without
replacement.
On the other hand, statisticians use a sampling method to determine the properties of
the methods used by mathematical statistics. This sampling method, the bootstrap, relies on
samples actually conducted with replacement, samples taken from the sample.
Nowadays, with the large computing power in our hands, sampling is also the source of
modern machine learning methods such as Bayesian estimation with Markov chain Monte
Carlo methods.
Statistics and tools of data science are now used in almost any type of enquiry regarding
data. Ask yourself: what is the role of sampling in my area of interest? and do some research
to find out. For example, in Section 4.1.5 we illustrate how the prototype urn model of
sampling has led to competing theories about the equilibrium state of a physical system in
physics. Similar competing models can be found, for example, in linguistics.
4
2 6
P (1 poll needed ) = = .
10 10
3
2 3
P (2 polls needed ) = = .
10 10
2
2 1
P (3 polls needed ) = = .
10 10
4.1.5 Exercises
1. An urn contains 20 balls, numbered 1 to 20. Suppose we draw a ball from the urn one at
a time, until 6 balls have been drawn. How many samples (or n-tuples), can we draw and
what is the probability of a single sample when the drawing is done: (i) with replacement,
(ii) without replacement? Explain your answer.
2. An urn contains 20 balls, numbered 1 to 20. Suppose we draw a ball from the urn one at
a time, until 6 balls have been drawn. How many sets of 6 balls are there and what is the
probability of a set when the drawing is done: (i) with replacement, (ii) without replacement?
Explain your answer.
3. (Inspired by Roxy Peck (2008, page 285).) The instructor of a probability class which has
40 students enrolled comes to class daily with a little box containing balls numbered 1 to
40. The instructor also brings a roster that has the students’ names sorted by last name. The
first student in the list is number 1, the last one number 40. For example,
1 Ayala, Maria
2 Coelho, Brenda
3 Chen, Cynthia
……..
…….
Let’s see how the model is used in this context. There are many particles, M. Think of the num-
bered balls in the urn now as indicating the state j that a particle will occupy, j = 1,2,3,…,M.
Drawing a random sample of size n with replacement from the urn, for example, if n = 20
and M = 10 ,the sample could be (3,1,4,8,1,1,10,9,9,1,4,5,7,1,3,6,8,10,10,9),
is indicating that particle 1 is going to state 3, particle 2 is going to state 1, particle
3 goes to state 4, particle 4 goes to state 8, particle 5 goes to state 1 and so on. It is
also indicating a macroscopic state, i.e., the number of particles that go into each state
(5 particles are in microscopic state 1, 0 are in microscopic state 2, and 2 particles are in
microscopic state 3, and so on). We can rewrite the macrostate for the given example as
(n1 = 5, n2 = 0,n3 = 2,n4 = 2,n5 = 1,n6 = 1,n7 = 1,n8 = 2,n9 = 3,n10 = 3). In total, there
are 1020 allocations of 20 particles to 10 states. In Physics jargon, there are 1020 macrostates.
Since we are drawing with replacement, for the first particle we have M states, for the second
one we have M states, ..etc. Then obviously, there could be more than one particle in a given
state. If we consider the particles distinguishable, i.e., as if they were arriving in order and the
order of arrival matters, then this result is known as Maxwell-Boltzmann’s model. If nobody
is keeping track of the order of arrival (imagine they all arrive at once), the particles are not
distinguishable, and then we have Bose-Einstein model. If the sampling had been done with
replacement, and indistinguishable balls, we would have had the Fermi-Dirac model. In
Physics, it is considered that Maxwell-Boltzmann’s model is a good approximation to reality.
(Parzen 1960)
Parzen uses the physics example just described to illustrate the occupancy problem, and
contains more details about it.
We may use the n-tuple sampling approach to solve probability problems where we inquire
about the diversity of the elements in the population.
Consider the following problem. Two balls are drawn without replacement from an urn
containing six balls, of which four are white and two are red. Find the probability that (i)
both balls will be white, (ii) both balls will be the red, (iii) at least one of the balls will
be white.
To set up a mathematical model for the experiment described using what we have already
learned in section 4.1 of this chapter, assume that the balls in the urn are indistinguishable;
in particular, assume that they are numbered 1 to 6. Let the white balls bear numbers 1 to
4, and let the red balls be numbered 5 and 6.
The sample space of the experiment, S, is the set of 6 ×5 = 30 2-tuples (o1 , o2 ) whose
components are any numbers, 1 to 6,subject to the restriction that no two components of
a 2-tuple are equal.
S = {(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,3),(2,4),(2,5),
(2,6),(3,1),(3,2),(3,4),(3,5),(3,6),(4,1),(4,2),(4,3),(4,5),
(4,6),(5,1),(5,2),(5,3),(5,4),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5)} .
We may assume that all the samples are equally likely, i.e., each of them has probability 1/30.
Now let A be the event that both balls drawn are white, let B be the event that both balls
drawn are red, and let C be the event that at least one of the balls drawn is white. The prob-
lem at hand can then be stated as one of finding (i) P ( A), (ii) P ( A È B ), and (iii) P (C ). It should
be noted that C = B c so that P (C ) = 1 − P (B ). Further, A and B are mutually exclusive, so that
P ( A ∪ B ) = P ( A) + P (B ).
Now, because the white balls bear numbers 1 to 4, the event A is
A = {(1,2), (1,3), (1,4), (2,1), (2,3), (2,4), (3,1), (3,2), (3,4), (4,1), (4,2), (4,3)},
B = {(5,6), (6,5)}
Now event B.
2× 1 2
P (B ) = = ,
6 ×5 30
2
2 1
P (B ) = = = 0.0666667.
6 15
2
So
and
P (C ) = 1 − P (B ) = 1 − 0.0666667 = 0.933333.
We could solve the problem using the product rule, for example,
P ( A) = (4 / 6)(3 / 5) = 2 / 5 = 0.4 ,
P (B ) = (2 / 6)(1 / 5) = 1 / 15.
the proportion of white balls in the urn. The formula for P ( Ak ) can then be compactly written,
in the case of sampling with replacement, as
P ( Ak ) = n ( p)k (1 − p)n−k .
k
This formula is only approximately correct in the case of sampling without replacement,
but the approximation gets better as M increases.
Example 4.2.1
If we were to draw three cards from a box containing 52 cards, 26 of which are black, and we
draw without replacement, what is the probability of obtaining three black cards?
The urn size is M = 52 cards. There are Mb = 26 black cards and M − Mb = 26 other cards.
Let A3 be the event that consists of obtaining 3 black bards.
26 26
3 26(25)(24) 3 0
P ( A3 ) = = = 0.1176471 .
3 52(52 − 1)(52 − 3) 52
3
Example 4.2.2
Acceptance sampling of a manufactured product. Suppose we are to inspect a lot of size M
of manufactured articles of some kind, such as light bulbs, screws, resistors, or anything else
that is manufactured to meet certain standards. An article that is below standards is said
to be defective. Let a sample of size n be drawn without replacement from the lot. A basic
Example 4.2.3
A box of chocolates is to be inspected for defects by a chocolatier in a chocolate factory.
Suppose that, in a box containing twenty chocolates, four are defective and sixteen are not
defective (note: a defective chocolate will be one that has some scratch or discoloration due
to mixing of chocolate types, etc.). A sample of two chocolates is to be selected randomly for
inspection without replacement. We will compute the probability that: (i) neither is defective,
(ii) at least one is defective, (iii) neither is defective given that at least one is nondefective.
The urn size is M = 20 chocolates. There are Md = 4 defective and M − Md = 16 nondefective.
Let A2 be the event that consists of obtaining 2 nondefective chocolates. The sample size is n = 2.
16 4
2 16(15) 2 0
(i ) P ( A2 ) = = = 0.6315789 .
2 20(19) 20
2
Let B be the event “at least one of the two is defective,” which is the same event as “one
or two of them are defective.”
16 4 16 4
0 2
(ii ) P (B ) = 2 16(4) + 2 4(3) = 1 1 + = 0.36842.
1 20(19) 2 20(19) 20
20
2 2
Notice that B is the complement of A2.
Consider the event C to be “at least one is nondefective” The probability of the event
“neither is defective given that at least one is nondefective” is
P ( A2 ∩ C ) P ( A2 ) 0.6315789
P ( A2 |C ) = = = = 0.6521738 ,
P (C ) P (C ) 0.9684211
since
P (C ) = (16 / 20)(15 / 19) + 2(16 / 20)(4 / 19) = 0.9684211 .
Example 4.2.3
Simple-minded game warder. Consider a fisherman who has caught 10 fish, 2 of which were
smaller than the law permits to be caught. A game warden inspects the catch by examining
two that he selects randomly from among the fish. What is the probability that he will not
select either of the undersized fish? This problem is an example of those previously stated,
involving sampling without replacement, with undersized fish playing the role of white balls,
Example 4.2.4
A simple-minded die. Another problem, which may be viewed in the same context but which
involves sampling with replacement, is the following. Let a fair six-sided die be tossed four
times. What is the probability that one will obtain the number 3 exactly twice in the four tosses?
This problem can be stated as one involving the drawing (with replacement) of balls from an
urn containing balls numbered 1 to 6, among which ball number 3 is white and the other balls
red (or, more strictly, nonwhite). In the notation of the problem introduced at the beginning
of the section this problem corresponds to the case M = 6, Mw = 1, n = 4, k = 2. Sampling with
replacement, there are 64 = 1296 samples. The number of sets of 4 with two whites is 4 × 3/2 = 6.
Example 4.2.5
Five employees of a firm are ranked from 1 to 5 in their abilities to program a computer. Three
of these employees are selected to fill equivalent programming jobs. If all possible choices
of three (out of the five) are equally likely, find the probabilities of the following events: (i)
A = the employee ranked number 1 is selected, (ii) B = employees ranked 4 and 5 are selected
(i) Let an urn contain numbered balls, one for each of the ranked employees. The ques-
tion can be rephrased as “what is the probability of obtaining k = 1 special employee
if we sample n = 3 employees without replacement?
1 4
1 2
P ( A) = = 0.6.
5
3
(ii) Let the urn now contain M = 5 employees, with Md = 2 ( special employees 4,5),
M − Md = 3(employees 1,2,3). The question can be rephrased as “what is the probability of
obtaining k = 2 special employees if we sample n = 3 employees without replacement?
2 3
2 1
P ( A) = = 0.3.
5
3
Scientists conduct experiments repeatedly under identical conditions to learn about the effects
of medicines, the resistance to stress of materials, the way the brain responds to stimuli, to
name a few examples. The following formulation is one of the most clear formulations found
in the probability theory literature for beginners, and as such, we present it here verbatim.
We assume that an acceptable assignment of probabilities has been made to the simple
events of S; i.e., to each {oj } there is assigned a nonnegative number P({oj }) in such a way that
∑P(o ) = 1.
j =1
j
The outcomes do not need to be equally likely. Now let us think of performing this exper-
iment and then performing it again. The succession of two experiments is a new experiment
that we want to describe mathematically. In order to avoid confusing reference to original
experiments and this new experiment, it is convenient to refer to the original experiments
as trials and to describe the new experiment as made up of two trials, each represented by
(or corresponding to) the sample space S. This new experiment is mathematically defined, as
are all experiments, by a sample space. The elements (outcomes) of this new sample space
are all the ordered pairs (oj ,ok ) denoting the occurrence of outcome oj at the first trial and
ok at the second trial. Thus the sample space for the experiment is the Cartesian product of
S ´ S. Since the sample space S for each of the two trials making up the experiment has N
elements, there are N2 ordered pairs in S ´ S.
Before probability questions can be answered for the experiment, we must make some
acceptable assignment of probabilities to the N2 simple events of S ´ S; i.e., we must assign
a nonnegative number to {(oj ,ok )} for each j and k in such a way that the sum of all N2
P ({(o j , ok )}) = P ({(o j )})P ({(ok )}) for j = 1, 2,.., N and k = 1, 2,…., N
This formula expresses the probability of the simple event {(oj ,ok )} of S ´ S as the product
of the probabilities of the simple events {oj } and {ok } of S.
The result we obtained for two trials can be generalized to any number of trials. That
is, suppose n is a positive integer and let Sj (for j = 1, 2,…,n) be a sample space with
outcomes o(1j) ,o(2j) ,¼,oN(j) . By the experiment consisting of the succession of n trials, the first
corresponding to S1 , the second to S2, etc. we mean the sample space S1 × S2 ×….. × Sn
whose elements are all the N1N2 ¼.Nn ordered n-tuples (see Definition, 4.1.1).
(Parzen 1960)
We will use Parzen’s formulation in the following examples of this section, with the under-
standing that in the rest of the book, after this section, the sample space made of several
trials will also be denoted by a simple S.
Example 4.3.1
The toss of a coin twice or the toss of two coins is an example of independent trials. Each
toss is a trial represented by the sample space S = {H ,T }. Suppose the two simple events
have been assigned the probabilities P({H}) = 2/3 and P({T}) = 1/3; i.e., the coin is not fair.
The outcomes of the two trials are given by S × S = {HH , HT ,TH ,TT }. If the two tosses are
independent, then the probability of each outcome is
P ({(HH )}) = (2 / 3)2 ; P ({(TT )}) = (1 / 3)2 ; P ({HT )}) = P ({(TH )}) = (1 / 3)(2 / 3) .
Tossing two coins is equivalent to drawing a sample of two numbers with replacement from
an urn containing three numbers 1 to 3, where 1 = H, 2 = H and 3 = T.
Example 4.3.2
If we expand Example 4.3.1 and instead of 2 we do three trials,
where
P ({(HHH )}) = (2 / 3)3; P ({(TTT )}) = (1 / 3)3; P ({HHT )}) = P ({(THH )}) = P ({(HTH )})
= (1 / 3)(2 / 3)2; P ({THT )}) = P ({(TTH )}) = P ({(HTT )}) = (2 / 3)(1 / 3)2 .
Notice that three trials of this experiment “tossing the coin” is equivalent to drawing a
sample of three numbers with replacement from an urn containing three numbers 1 to 3,
where 1 = H, 2 = H and 3 = T.
1 2
P ({R}) = , P ({W }) =
3 3
RRRR, RRRW , RRWR, RRWW , RWRR, RWRW , RWWR, RWWW , WRRR, WRRW ,
S × S × S × S =
.
WRWR , WRWW , WWRR , WWRW , WWWR , WWWW
Example 4.3.4
From a population of n people, one person is selected at random. Another person is then
selected at random from the full group; i.e., we allow the same person to be selected at both
trials. Each selection (trial) is defined by the sample space S = {1,2,…. . , n}, where each person
is identified by a positive integer. Each of the n simple events of S is assigned probability
1 / n.; i.e., P ({ j }) = 1 / n for j = 1, 2,…. , n. The experiment made up of these two trials is called
selecting a sample of two with replacement from the population and is represented by the
Cartesian product set
S × S = {( j , k ) | j ∈ S , k ∈ S }.
To say that the two trials (i.e., the selection of the first person and the selection of the
second person) are independent is to require that assignment of probabilities to the simple
events of S ´ S as follows:
1 1 1
P ({ j , k }) = P ({ j })P ({k }) = = 2 .
n n n
The independence of the trials means that each simple event of S ´ S is assigned the
same probability 1 n2 . Thus we have the formal mathematical counterpart of our intuitive
feeling that selecting a random sample of size two with replacement can be considered as
a succession of two independent trials.
S = {D ,W }.
Let A be the event “at most 2 (2 or less) defectives”, which is the subset
p ≥ 0, q ≥ 0, p + q = 1.
Consider now n independent repeated Bernoulli trials, in which the word “repeated” is
meant to indicate that the probabilities of success and failure remain the same throughout
the trials. The sample space S × S × … × S contains 2n n-tuples (o1, o2, ….., on), in which each
oi is either an s or an f. The sample space is finite. The probability of every single n-tuple is
failures. Each such description has probability pk q n-k . Thus the probability that n indepen-
dent Bernoulli trials, with probability p for success, and q = 1 − p for failure, will result in k
successes and n − k failures (in which k = 0, 1, 2,….,n) is given by
P (" k successes | n, p) = n pk q n−k .
k
Example 4.3.6
A system has five identical independent components connected in series. A component works
with probability p = 0.9. The system works if all the components work. What is the probability
that the system works?
Observing each component is a Bernoulli trial, where the probability of success is 0.9 if
the component works, and the probability of failure is 0.1, if the component does not work.
Observing the five components is a repeated experiment consisting of five independent
Bernoulli trials.
There are 25 = 32 logically possible configurations of the system, each of them a sequence
of five Bernoulli trials.
P ("5 working " | n, p) = 5 0.950.10 = 0.59049 .
5
Example 4.3.7
Davy Crocket was a member of the US Congress during the 1830s. There are three books
supposedly written by Crocket: A Narrative of the Life of David Crockett, An Account of Col.
Crockett’s Tour to the North and Down East, and Col. Crockett’s Exploits and Adventures in Texas,
the latter published after his death. There is serious doubt whether Crockett wrote any of
these books. Beyond these books, the only examples of Crockett’s supposed writings consist
We would conclude that there is a very small probability of observing the word “also” five
times in a random sample of 30 words from the Texas book. Salsburg and Salsburg do statis-
tical analysis using the binomial model and compare the probability that k is 5 in the Texas
book under the assumption p = 0.00051 and then under the assumption that p is some other
value suggested by the word data of the Texas book. The details of their statistical analysis
are beyond this book, and concern statistical inference, but let this example illustrate that
the models of probability theory play a very important role in disputed authorship.
4.3.2 Exercises
Exercise 1. Demonstrate that the approach used in Example 4.3.4 provides an acceptable
assignment of probabilities to the simple events of S × S.
Exercise 2. A pizza restaurant has five ovens. At least four ovens must be working in order to
meet customer demand on a given day. The probability of a particular oven working is 0.9.
We want to find out the probability of meeting customer demand.
Exercise 3. Factories produce millions of items. Thus, when we sample them to observe
quality, we can pretend we are sampling with replacement, although in reality we are sam-
pling at random (without replacement) That is the assumption in industrial reliability and
other contexts where populations are very large. An incoming lot of silicon wafers is to be
inspected for defective by an engineer in a microchip manufacturing plant. Suppose that, in
a tray containing twenty wafers, four are defective and sixteen are working properly. Two
wafers are to be selected for inspection with replacement. After listing the sample space, find
the probabilities of the following events: (i) neither is defective, (ii) at least one is defective.
a. 1440
b. 36
c. 120
d. 10
Question 2. If we roll a die four times, how many samples are there?
a. 360
b. 760
c. 1296
d. 54
Question 3. An audition office has told candidates to prepare excerpts from 10 scripts tell-
ing them that the day of the audition they will be asked a random selection of 5 of them. A
hopeful actor has practiced only 7 of the scripts. To compute the probability that, the day of
the audition, the hopeful actor will be able to do 4 of the scripts using the notation of the
model seen in this chapter, what would be the value of M?
a. 5
b. 10
c. 7
d. 3
Question 4. A group of four iPhone fans wait in line for the next version of the iPhone to
be released. They slept outside the store because they were told that three of the first four
would be chosen randomly to obtain a free iPhone. The same person could not be chosen
twice, of course. These fans are numbered. The first one in line is number 1, the second is
number 2, the third is number 3, and the fourth is number 4. In how many ways could the
three be chosen?
a. 6
b. 4
c. 30
d. 3
a. 10,000
b. 100,000
c. 1,000
d. 100
Question 6. There are two types of workers in an office: 10 administrative assistants and 5
fund managers. Two workers will be chosen randomly to represent the office on the board
of directors and at the town’s city hall, one worker to each place. Let A be the event that two
fund managers are chosen. What is the probability of A?
a. 0.3156
b. 0.095238
c. 0.6954
d. 0.0910
Question 7. A gardener has been growing tomatoes. This individual takes a box containing 20
beautiful tomatoes to the local market with the hope of selling them. There are 8 tomatoes
in the box that look good, but they are from a neighbor’s garden. The person buying the
tomatoes will not be able to tell just by looking. A customer bought three randomly chosen
tomatoes from the box. What is the probability that two of the tomatoes were from the
neighbor’s garden?
a. 0.2947
b. 0.09824
c. 0.6197
d. 0.1131
Question 8. People arriving to a foreign country must usually pass Customs control. During
some periods, the authorities require that a certain number of people be chosen at random
and sent to the Customs office for further inspection. They then sit in some room waiting
for the authorities to check their records further. Supposedly the authorities are looking
for drug dealers that may have stolen someone else’s identity. Suppose that a randomly
chosen visitor to a major country with millions of visitors has probability 0.005 of being a
drug dealer that stole someone’s identity. Suppose that three people are randomly chosen.
What is the probability that two of the people are found to be drug dealers that have stolen
someone’s identity?
a. 0.1265
b. 0.0319
c. 0.0051
d. 0.00007462
a. 0.8145
b. 0.2181
c. 0.6891
d. 0.0434
Question 10. When going about their business of drawing random samples from populations
in order to learn about the population, statisticians prefer to use probability sampling. The
equivalent of sampling in practice is, in probability theory, an urn model. How many simple
random samples (samples without replacement) of 50 people can be chosen from a population
of 1,000,000 to estimate the average number of apps installed in smart phones per person?
1,000,000
a.
50
1,000,000!
b.
50!
c. 1,298,000,000
5,000,000
d.
50
4.5 R corner
R exercise Birthdays.
How would we calculate theoretically (i.e., not by simulation with computers but by the
exact mathematical solution) the probability that, in a room of five people, none of them
shares the same birthday, assuming 365 days/year? Then how would we find the probability
of the complement “at least two people share a birthday”? Use the urn model appropriate
for this situation.
Read the article attached at the end of this chapter to find out.
Here, we will use RStudio to do a simulation with a 12-sided fair die to estimate the prob-
ability that, in a room of five people, nobody shares the same birth month. Use the following
simulation in R. The probability model is a 12-sided fair die. A trial consists of rolling the die five
times. We record “0” if some numbers are repeated and “1” if all numbers are different. At the
end, we calculate the proportion of trials in which we got TRUE (nobody shared a birth month).
Exercise 1. A first prize of $1000, a second prize of $500, and a third prize of $100 are offered
to the best three data mining projects presented at a major data mining competition. There
are 10 contestants. How many different outcomes are possible if (i) a contestant can receive
any number of awards and (ii) each contestant can receive at most one award?
Exercise 2. You roll two loaded six-sided dice. A single die has probability 0.3 of being a
1, probability 0.3 of being a 5, and the other numbers have probability 0.1. What is more
advantageous: betting on a sum of 7 or of 8?
Exercise 3. A bank opens at 9 a.m. on a regular working day with five bank tellers available
for assisting entering customers. If three customers are waiting at the door, and each cus-
tomer chooses a different teller, in how many ways can the tellers be chosen to assist the
three customers?
Exercise 5. If we were to draw three cards from a box containing 52 cards, 26 of which are
black, and we draw with replacement, then what is the probability of three black cards? What
is the probability if we draw without replacement?
Exercise 7. Suppose there are 40 alumni signed up to travel to Egypt with the university
alumni association during the summer. In this group of alumni, 25 received a bachelor of
science (BS) degree from the university and 15 received a master of science (MS) degree. We
must select a random sample of 7 people. (i) What is the probability that the sample contains
4 BS recipients and 3 MS recipients? (ii) What is the probability that there is at least one BS
in the sample?
Exercise 8. If there are 12 strangers in a room, what is the probability that no two of them
celebrate their birthdays in the same month?
Exercise 9. A Hollywood producer holds an audition for a movie. The executive office of the
studio sends interested parties a set of 15 excerpts from the script of the movie for them
to memorize with the information that the audition will consist of a random selection of 5
excerpts. If a candidate has memorized 10 of the excerpts, what is the probability that this
candidate will recall (i) all 5 excerpts or (ii) at least 4 of the excerpts?
We will first write the command to roll a red and a green fair 6-sided dice. This will be a
2-tuple containing as first number the roll of the red die and as second number the roll of
the green die. The sample () function in R does the job of extracting n numbered balls from
an urn containing M numbers.
The 6 = M stands for 1 to 6, size = n = 2 is the number of times we draw the ball, prob
gives the probabilities assigned to each side of the die; notice that we are saying: repeat 1/6
six times, which means that the number 1 has probability 1/6, the number 2 has probability 1/6
and so on. The argument replace = T is necessary because when rolling a die the model to use
is the same as when we were drawing from an urn with six numbered balls with replacement.
It is not very interesting to see just one 2-tuple. As we saw in section 4.1, there are
M n = 62 = 36 logically possible samples of size 2 that can be drawn with replacement from
an urn containing balls numbered 1 to 6. To see more 2-tuples, we can complicate the prob-
lem a little bit.
First we create a matrix where we will put the 2-tuples. Each row of the matrix will contain
a 2-tuple. We will generate 10 2-tuples so we will create storage for 10 rows, and 2 columns.
for(i in 1:10){
dice[i, ] = sample(6, size=2, prob=c(rep(1/6, 6) ), replace=T )
}
The for loop goes row by row, i = 1 to 10, and puts a new 2-tuple in the row. What we want
to do in each row is written between { }.The result is the 10 2-tuples (or 10 rolls of the two
dice). To view the samples you drew, type
“dice”.
Exercise 12. Read the article in Appendix 4.1. Summarize in your own words, and then relate
the calculations they do to the formulas seen in this chapter.
Exercise 13. During one year (365 days) 23 earthquakes occurred over an area of interest.
What is the probability that two or more earthquakes occurred on the same day of the year?
(Note: This is the same problem as the birthday problem.)
• The people in the room are randomly chosen. Clearly, the answer would be very different
if the people were attending a convention of twins or of people born in December.
• Birthdays are uniformly distributed throughout the year. For some species of animals,
birthdays are mainly in the spring. But, for now at least, it seems reasonable to assume
that humans are about as likely to be born on one day of the year as on another.
• Ignore leap years and pretend there are only 365 possible birthdays. If someone was born
in a leap year on February 29, we simply pretend he or she doesn’t exist. Admittedly,
this is not very fair to those who were “leap year babies,” but we hope it is not likely
to change the answer to our problem by much.
Bruce Trumbo, Eric Seuss, and Clayton Schupp, “Computing Probabilities of Matching Birthdays,” STAT, the Magazine
for Students of Statistics, pp. 3-7. Copyright © 2005 by American Statistical Association. Reprinted with permission.
is the number of permutations of 365 objects taken 25 at a time, where repeated objects are
not permitted. Therefore,
William Feller, who first published this birthday matching problem in the days when this
kind of computation was not easy, shows a way to get an approximate result using tables of
logarithms. Today, statistical software can do the complex calculations easily, and even some
statistical calculators can do the numerical computation accurately and with little difficulty.
Figure 4.6 Two ways to calculate the probability of no matching birthdays among
25 people selected at random.
p <- numeric(50)
for (n in 1:50) {
q <- 1—(0:(n—1))/365
p[n] <- 1—prod(q) }
plot(p)
Figure 4.7 R code to calculate the probability of matching birthdays when the
number of people in the room ranges from 1 to 50.
In Figure 4.6, we show two ways to use the statistical software R to calculate the proba-
bility of no matches.
Of course, different values of n would give different probabilities of a match. With a
computer package like R that has built-in procedures for doing probability computations
and making graphs, it is easy to loop through various values of n and graph the relationship
between n and P(At Least 1 Match). Figure 2 shows the small amount of R code required, and
Figure 4.8 shows the resulting plot. (The labels and the reference lines were added later.)
By looking at the plot, we see the probability of at least one match increases from zero to
near one as the number of people in the room increases from 1 to 50. We can see that n =
23 is the smallest value of n for which P(At Least 1 Match) exceeds 1/2. The computations
show the probability for n = 23 to be 0.5073. A room with n = 50 randomly chosen people is
very likely to have at least one match. Indeed, for n = 50, the probability is 0.9704.
First, we must specify the population from which to sample. For us, this is the 365 days of the
year. In R, the notation 1:365 can be used to represent the list of these population elements.
Second, we have to specify how many elements of the population are to be drawn at
random. Here, we want 25.
Third, we have to say whether sampling is to be done with or without replacement. Because
we want to allow for the possibility of matching birthdays, our answer is “with replace-
ment.” In R, this is denoted as repl=T. We put the 25 sampled birthdays into an ordered
list called b. Altogether, the R code is
Each time R performs this instruction, we will get a different random list b. Below is the
result of one run. For easy reference, the numbers in brackets give the position along the list
of the first birthday in each line of output. For example, the 22nd person in this simulated
room was born on the 20th day of the year, January 20.
[1] 352 364 246 190 143 272 149
[8] 206 154 272 61 199 357 141
[15] 264 157 42 340 287 166 335
[22] 20 123 214 149
x <- 25 – length(unique(b))
For our run above, the list “unique (b)” is the same as b, but with the 10th and 25th
birthdays removed. It is a list of the 23 unique birthdays since its “length” is 23. So the
value of the random variable X for this simulated room is X = 25 − 23 = 2.
0.4
0.3
Density
0.2
0.1
0.0
0 1 2 3 4 5 6
Number of Birthday Matches (X)
Figure 4.9 The simulated distribution of the number of birthday matches (X) in a
room of 25 randomly chosen people.
105
100
95
0 10 20 30 40
Month
Figure 4.10 Cyclical pattern of birth frequencies in the United States for
36 consecutive months.
References
Peter Dalgaard. Introductory Statistics with R, Springer, 2002.
William Feller. An Introduction to Probability Theory and Its Applications, Vol. 1, 1950 (3rd ed.),
Wiley, 1968.
Thomas S. Nunnikhoven. “A birthday problem solution for nonuniform frequencies.” The
American Statistician, 46, 270–274, 1992.
Jim Pitman and Michael Camarri. “Limit distributions and random trees derived from the
birthday problem with unequal probabilities.” Electronic Journal of Probability, 5, Paper 2,
1–18, 2000.
Bruce E. Trumbo, Eric A. Suess, and Clayton W. Schupp. “Using R to Compute Probabilities of
Matching Birthdays.” 2004 Proceedings of the Joint Statistical Meetings [CD-ROM], Alexandria,
Virginia: American Statistical Association.
A florist stocks a perishable flower which costs 50 cents and which is sold at
a price of $1.50 on the first day it is in the shop. Any flowers not sold that first
day are worthless and are thrown away. Let X denote the number of flowers
that customers order in a randomly selected day. Records of numerous other
days in the past reveal that 0 flowers were sold 10% of the days, 1 flower 40%
of the days, 2 flowers 30% of the days and 3 flowers 20% of the days.
How many flowers should the florist stock in order to maximize the
expected value of the florist’s net profit?
(Goldberg 1960)
Are you happy with the number and the sex of siblings in your family? Would you
have liked to have more or less siblings or siblings of different sex?
We saw in Chapter 4 that if we were to draw at random a family of three siblings,
the sample space of this experiment is:
where, for example, bgg is a family where the oldest is a boy, the second child is a
girl and the third is also a girl. Notice that in Chapter 4 we used S1 ´ S2 ´ S3 where
S i = {b, g }, to tag that sample space. But once that is learned, we make our life easier
from now on by denoting the sample space of the sequence as S, implicitly knowing
that it is Cartesian product of the sample space of each of the trials.
139
The human sex ratio is a very active area of research in biology (Orzack 2016). Science
1
tends to support the notion that P (b ) = P ( g ) = , although it all depends on which proba-
2
bility model you assume. Some authors question that assumption, based on observation of
families and government data (Carlton and Stansfield 2005). Under the assumption of even
chance for each sex, we saw in Chapter 4 that the probabilities of each of those outcomes
in S are, respectively:
3 3 3 3 3 3 3 3
1 1 1 1 1 1 1 1
, , , , , , ,
2 2 2 2 2 2 2 2
Table 5.1 The first two columns are the probability mass function of Y.
1 1
3 {bgg,gbg,ggb} 3
1 1 1
3 3
3 + +
2 2 2
2
2 1
3 {bbg,bgb,gbb} 3
1 1 1
3 3
3 + +
2 2 2
2
3 1
3 {bbb} 1
3
2 2
We could also represent the probability of each value of Y by using the following formula:
1 3
P Y = y |n = 3, p = = y
(1 / 2) (1 − 1 / 2)
(3− y )
, y = 0, 1, 2, 3.
2 y
This formula would provide the part of Table 5.1 containing y and P (Y = y ). That formula
is the formula for the probability mass function of a binomial discrete random variable.
The first two columns of Table 5.1 represent in a table a probability mass function for Y. A
probability mass function gives all the unique possible values of the random variable and the prob-
ability of the events consisting of all outcomes to which we map that value of the random variable.
M y (t ) = ( pe t + (1 − p))n.
Knowing this, and the formula or the table, we can conclude that for the family of 3, the expected
number of boys is 3/2, the variance is 3/4, and the standard deviation is 0.866. We could also
compute the probability of less than 2 boys by extracting the information from the table:
3 3 3 3
1 1 1 1
P (Y < 2) = P (Y = 0) + P (Y = 1) = + + + .
2 2 2 2
We can not deny how convenient it would be to have formulas and tables like these for a
host of random phenomena that might step in our way. That is new jargon, but its usefulness
will be seen shortly. An advantage of a probability model formula is its usefulness to model our
uncertainty about many phenomena of similar nature and help us make predictions and better
decisions when faced with uncertainty. The above binomial model could be used to calculate
any question that asks: what is the number of successes in n Bernoulli trials? For example,
∑( x − µ) P( X = x ) is the variance of X ,
2 2
σ = Var ( X ) =
x
MX (t ) = E (e tx ) = ∑e P ( X = x ),
tx
x
and can be used, by taking derivative of the function with respect to t and evaluating
it at t = 0, to obtain the expected value. If we take the second derivative and evaluate it
at t = 0, we get the E ( X 2 ).
We just saw in the previous section that all we want to know about the number of successes
in n Bernoulli trials can be conveniently summarized in a probability mass mathematical
function, its expectation, variance, standard deviation and moment generating function.
The reader may wonder whether we can do that for the other
problems we have studied, such as number of successes in trials
Definition 5.2.1
that are not Bernoulli, for example, or other questions we may
A function whose domain is a sample have about Bernoulli trials. Indeed, we can do that and more.
space and whose range is the set of The concept of random variable and probability mass function
real numbers is called a random vari-
are at the core of such possibility.
able. If the random variable is denot-
Random variables can be discrete or continuous. A discrete
ed by X and has as domain the sample
space S = {o1 , o2 , ……, on } , then we write random variable can have a finite or countably infinite number
X ({ok , ¼. , o j }) for the value of X that is of values. A continuous random variable has an uncountably
shared by each of the elements in the infinite number of values.
event {ok , ¼. , o j }. It is conventional to represent a random variable by an upper
case letter like X, Y, W, etc. and a particular value of the random
variable by the corresponding lower case letter, x, y, w etc.
Assigning probabilities to the possible values of the random variable requires that we
calculate the probability of all the outcomes in the sample space that result in that value of
X and add them up.
P( X = x ) = ∑ P (o k
∈ S : X (ok ) = x ).
Box 5.2
Once all the work is done to figure out the probability for each unique value of X, we present
the information in a discrete probability mass function table like Table 5.2:
Table 5.2
X 1 2 3 4 5 6
P(X = x) 1/36 3/36 5/36 7/36 9/36 11/36
Probability mass function of Max(a,b) This is called the probability mass function (pmf) of X
0.35 and is usually given in the form of a table, or a mathe-
matical formula. The sum of the probabilities must be
one and each probability must be equal to or greater
0.30
than 0, in order to satisfy the Axioms introduced in
Chapter 3.
0.25 This information can also be presented graphically, as
in Figure 5.1, where the probability table of the random
0.20 variable X is drawn. In the graph, values of X are on
Probability
Summation operators
The rules of summation are widely used when talking about the probability of a single dis-
crete random variable. We review here:
If x takes n values x 1 , x 2 , ¼. . , x n then their sum is
n
∑x = x
i =1
i 1
+ x2 +…. . , + x n
If a is a constant then
n n
∑
i =1
ax i = a ∑x
i =1
i
n n
∑ ( x i − a )2 = ∑( x
2
i
+ a 2 − 2ax i )
i =1 i =1
n n n n n
∑( x ) + ∑(a ) − ∑(2ax ) = ∑( x ) + na ∑( x ).
2 2 2 2
= i i i
− 2a i
i =1 i =1 i =1 i =1 i =1
If X and Y are two variables, then the following two equalities hold:
n n n
∑
i =1
( xi + y i ) = ∑ i =1
xi + ∑y ,
i =1
i
n n n
∑ i =1
(ax i + by i ) = a ∑ i =1
xi + b ∑y .
i =1
i
The following function will be very important when we study the central limit theorem.
n
x=
∑ i =1
xi
=
x1 + x2 +…. . + x n
,
n n
and
n
∑( x − x ) = 0.
i =1
i
åx i
i
or
åx x
Definition 5.2.3
Let X be a random variable. The cumulative distribution function F (cdf) of X is the function
F : R − > R defined by
F ( x ) = P( X ≤ x ) = ∑ P( x ).
x
That is, it is the sum of the probabilities of all values of X smaller than or equal to specific
value x.
If X is a discrete random variable with probability mass function P ( X ), then F is a step
function.
In either case, F is monotonic increasing, i.e, F (a ) £ F (b ) whenever a £ b and the limit of F to
the left is 0 and to the right is 1. The cumulative distribution function may also be summarized
in a table like Table 5.3 granted that the number of possible values of X is finite and not too
many to list.
Example 5.2.2
For the experiment of Example 5.2.1, the cumulative distribution function is given in Table
5.3, where the computations done to arrive to the value of the F(x) are given inside the table
for convenience. Usually, those computations would be done outside the table.
Table 5.3 Cumulative probability of the max(a,b) in the roll of two six-sided fair dice
x 1 2 3 4 5 6
F(x) 1/36 3/36 + 1/36 5/36 + 3/36 7/36 + 5/36 9/36 + 7/36 11/36 + 25/36
= 4/36 + 1/36 = 9/36 + 3/36 + 1/36 + 5/36 + 3/36 = 36/36
= 16/36 + 1/36 = 25/36
F (3) = P ( X ≤ 3) = 9 / 36.
Table 5.4 Probability mass function of number of amber alerts per year
x 0 1 2
P(X=x) 0.85 0.1 0.05
and if each Amber alert costs the city a value Y, where Y is a function of X , g( X ), as follows,
Y = 1000 + 200 X ,
then the probability mass function of Y is directly related to that of X, because Y takes, for
example, value 1000 when X = 0 and that happens with probability 0.85, which makes the
probability that Y = 1000 equal to 0.85. We can illustrate that with Table 5.5.
x 0 1 2
y 1000 + 200(0) = 1000 1000 + 200(1) = 1200 1000 + 200(2) = 1400
P(X = x) 0.85 0.1 0.05
Thus,
P (Y > 1000) = P (Y = 1200) + P (Y = 1400) = P ( X = 1) + P ( X = 2) = 0.15.
5.2.4 Exercises
Exercise 1. Let Y be a random variable denoting the sum of the roll of two fair six-sided dice.
Give a table with the probability mass function of Y in the first two columns, the cumulative
probability mass function of Y in the third column and the corresponding sample space
members in the last column. Use Tables 5.1 to 5.2 in Section 5.2 as examples.
Exercise 3. Daily tooth brushing by residents in a remote country was found to follow the
probability mass function in Table 5.6, where the random variable Y represents the number
of times brushing teeth per day
Table 5.6 Probability mass function of frequency of teeth brushing per day
y 0 1 2 3
P(Y = y) 0.325 0.474 0.15 0.051
The value 0.474 means that 47.4% of the residents brush their teeth once per day. What is the
probability that a randomly chosen resident of this country brushes teeth at least twice a day?
Example 5.3.1
For the random variable X = max (a, b ), the maximum value of the rolls of two fair six-sided
dice seen in Examples 5.2.1 and 5.2.2, the expected value of X is, according to Definition 5.3.1,
Let g( X ) = (( X − µ))2 represent the square distance of a random variable from its expected
value. This is a very special function of X. Some values of X will be very far and others very
close. Small variance will mean that most of the values are very close to the expected value,
and few far.
We denote the variance with the Greek symbol s . The smallest value that s 2 can take is 0,
2
when all the probability is concentrated at a single point (that is, when X takes on a constant value
with probability 1). The variance becomes larger as the points with positive probability spread more.
• Large standard deviation means that many values of the random variable are very far
from the expected value, and few are close.
• Small standard deviation means that most of the values of the random variable are
concentrated around the expected value with very few far.
• The smallest value that the standard deviation can take is 0, when all the probability
is concentrated at a single point.
Table 5.8.
x 1 2 5
P(X = x) 1/8 1/4 5/8
1 t 1 5
M x (t ) = e + e 2t + e5t .
8 4 8
A nice feature of the moment generating function is that if you take the first derivative
with respect to t and evaluate it at t = 0, you get the expected value of the random variable.
If you take the second derivative and evaluate it at t = 0, you get E ( X 2 ) and so on.
When the random variable is discrete, the median, like the expected value, may take values
that are not in the domain. For example, for the random variable X = max (a, b ), the maximum
value of the rolls of two fair six-sided dice seen in Examples 5.2.1 and 5.2.2,
However,
16 25
F (4) =
= 0.44444, F (5) = = 0.6944444,
36 36
which means that the median is between X = 4 and X = 5.
5.3.7 Exercises
Exercise 1. A resident of Boston spends the Summers in the Grand Tetons, Wyoming. Every
day there is expectation that a moose may pass in front of the house. Moose are wild ani-
mals that live around that area. The daily sighting (number of times seen) of moose has the
probability mass function in Table 5.9, where X is the daily number of sightings:
x 0 1 2
P(X = x) 0.1 0.5 0.4
In a randomly chosen day, what is the expected number of sightings and the standard devia-
tion of sightings? (ii) Find the moment generating function of X. Then take the first derivative
with respect to t, evaluate the first derivative at t = 0 and compare the result to the expected
value of X found with the definition of expected value.
when a coin is tossed three times. Find (i) the probability mass function of X and compute
the (ii) expected value of X and the (iii) standard deviation of X. Find also the (iv) moment
generating function of X. A few elements of the pmf can be seen in Table 5.10.
x P(X = x)
-3 1/8 = 0.125
-1 3(1/8) = 0.375
Exercise 3. You have $1,000 and a certain commodity presently sells for $2 per ounce. Sup-
pose that after one week the commodity will sell for either $1 or $4 an ounce, with these
two possibilities being equally likely. If your objective is to maximize the expected amount
of money that you possess at the end of the week, what strategy should you employ?
Exercise 4. (This exercise is based on Christensen (2015).) You are the forecaster responsi-
ble for hurricane warnings on the southeast coast of the United States in September. The
cost of issuing a warning like a hurricane involves people taking shelter, business stopping,
the area’s economy paused—a moderate cost of C dollars. You also know the preventable
loss should a hurricane come and the area be unprepared: property damaged, lives lost—an
extremely high loss of L dollars. Your weather forecast indicates hurricane with a probabil-
ity of p. (i) Should you issue a warning? Use expectations to create a decision rule. (ii) For
what kinds of extreme weather warnings is the decision more likely to be “do not warn”
than “warn”? Explain.
Quite often, we are interested not in a random variable per se, but in a function of a random
variable. For example, a business selling a very specialized tool to automobile producers may
be more interested in the profits made each week. Of course, the profits depend on how many
tools are sold during the week, which is random. But the ultimate goal is to compute the
expected profit per week and the give or take, the standard deviation of the profit. The contents
of this section will be helpful in computing any linear function of a discrete random variable.
Theorem
Let X be a discrete random variable and let g(X ) be a function of X. Then
i. If g( X ) = k , where k is a constant,
E (k ) = k and Var (k ) = 0.
ii. If g( X ) = kX , then
We can prove properties (i) and (iii) using the definitions of expected value and variance
given in Sections 5.3.1 and 5.3.3. The reader is asked to prove (ii) in the exercises. We always
use the definitions to prove results.
Proof
i. If g( X ) = k , where k is a constant,
E ( X ) = k and Var ( X ) = 0 .
Using the definition in Section 5.3.1, for g( X ) = k , and Math Tibdit box 5.3,
because the sum of the probabilities all the value of X must equal 1.
The variance is
x x x
The variance is
sg2( X ) = Var ( g( x ))
= ∑ x
(b( X − E ( X ))2 P ( x ) = ∑ x
b2 ( X − E ( X ))2 P ( x )
∑( X − E( X )) P( x ) = b Var ( X ).
2 2 2
=b
x
σ 2 = Var ( X ) = E ( X 2 ) − µ2.
Proof
σx2 = ∑( x − µ) P( x ) = ∑( x
2 2
+ µ2 − 2µ x )P ( x )
x x
Example 5.4.1
Daily sales records for a car dealership is a random variable with probability mass function
given in Table 5.11. (i) Consider three independent days. What is the probability that in less
than 2 of those days the number of sales will be at least 2? (ii) The price of a car is $3000
and total daily revenue is given by T = 3000X2 , where X is the number of cars sold. What is
the expected total daily revenue?
x 0 1 2 3
P(X = x) 0.5 0.3 0.15 0.05
5.4.2 Exercises
Exercise 1. Prove that If g( X ) = kX , then
Exercise 2. The manager of a cosmetic products stand in a department store knows that
the daily demand for the most expensive item in the stand, the “dramatically beautifying
moisturizing lotion” has the probability mass function in Table 5.12:
Table 5.12 Probability mass function of daily demand for expensive cosmetic item
Quantity demanded 0 1 2
Probability 0.1 0.5 0.4
Suppose that the bonus is $10 each time an item is bought. (i) What is the expected daily
bonus of selling the expensive item? How much is this bonus expected to vary from one day
to another? (ii) What is the E ( X 3 ) ?
Exercise 3. Weekly downtime of internet services from an internet service provider (in hours)
has expected value 0.5 and variance 0.25. Based on past experience, the data scientist of a
retailer store has calculated the loss function to the store from the downtime as
C = 30 X + 2X 2
where X is the amount of weekly downtime and C is cost. Find the expected cost.
Exercise 4. The number of calories burnt by a biker on a biking day depends on the number of
hours biking plus the fixed amount burnt by the regular functioning of the body to stay alive.
Based on past experience it is known that the calories burnt by a biker follows this function
Calories = 1000 + 200 X
where X is the number of hours biking. If X is a random variable with expected value 5 and vari-
ance 5, what is the expected number of calories burnt and the standard deviation of the calories?
A fair six-sided die is rolled 100 times. How can we calculate the expected sum and the
variance of the sum of the numbers obtained?
If we assume that the dice are of similar quality, then we can calculate the expected sum
of the 100 rolls and the variance of the sum of the 100 rolls as follows:
100 100 7
E ( S100 ) = E X i =
∑ ∑ E ( X i ) = µ+ µ+…….. + µ= n µ= 100 = 350
2
i =1 i =1
100 100
Var ( S100 ) = Var X i =
∑ ∑ Var ( X i ) = s 2 + s 2 +…….. + s 2 = 100(2.916667) = 291.6667
i =1 i =1
Dice are not the only context in the applications of probability where we are interested
in sums of random variables. For example,
• The study of the number of cyberattacks to a computer network might require knowl-
edge of the sum of the random number of cyberattacks each hour of the day. The
number of cyberattacks of each hour is a random variable and the sum of the indepen-
dent numbers of cyberattacks of each of the 24 hours of the day is a sum of random
variables giving us the total number of cyberattacks per day.
• The random total cost of a building project can be studied as the sum of the random
costs for the major independent components of the project.
• The random size of an animal population can be modeled as the sum of the random
sizes of the independent colonies within the population.
• At the end of the summer the total weight of seeds accumulated by a nest of
seed-gathering ants will vary from nest to nest. We may be interested in the sum of
the total weights of seeds of all nests.
• The total weight of people riding an elevator is important to know to prevent over-
loading the particular elevator.
• An insurance company may want to know the total yearly claim by all the automobile
policy holders.
Let’s define the sum of n random independent and identically distributed (same mean,
same variance) random variables as follows:
n
S n = X1 + X2 +….. + Xn = ∑X ,
i=1
i
where S is for sum and n for how many random variables we are adding. Then, without
proving it, we claim that:
n n
E ( S n ) = E X i =
∑ ∑ E ( X i ) = µ+ µ+…….. + µ= n µ,
i =1 i =1
n n
∑
Var ( S n ) = Var X i =
∑
i =1 i =1
Var ( X i ) = s 2 + s 2 +…….. + s 2 = ns 2.
Example 5.5.1
A stockbroker recommends three stocks (ORCL, AAPL and Microsoft and GOOGL) to his firm’s
clients. His bonus will be $50,000, $30,000, $10,000, or nothing, depending on whether the
prices of 3, 2, 1, or none of them go up next year. Suppose that the probability that each
stock goes up next year is 1/2, and each stock’s price behavior is independent of the price
behavior of the other stocks. (i) Construct a table with the probability mass function of the
bonus that the stockbroker will receive next year. (ii) If the stock broker recommends the
same stocks each year, and all other factors are constant, what is the expected total bonus
received after five years of predictions; what is the standard deviation?
We will write Table 5.13 with the probability mass function of the stockbroker’s bonus.
5 5
Var ( S5 ) = Var X i =
∑ ∑ Var ( X i ) = 5(235937500) = 1179687500,
i =1 i =1
SD( S5 ) = $34346.58.
Example 5.5.2
A biologist has determined that the expected number of genetically modified crops to be
found in a randomly chosen farm of a remote region is 4 and the standard deviation is 1. If
three farms are independent and randomly chosen, what is the expected value and variance
of the total amount of genetically modified crops produced by the three farms together,
respectively?
3 3
Var ( S3 ) = Var X i =
∑ ∑ Var ( X i ) = s 2 + s 2 + s 2 = 3(16) = 48.
i =1 i =1
Example 5.5.3
We said earlier that the proof for the results regarding sums of independent random vari-
ables would be in Chapters 6 and 8. However, we can informally prove the results with an
example.
Consider random variable X which denotes the number of candy that a typical student at
College Bliss eats in a typical day. Table 5.14 describes the probability mass function of X:
Table 5.14 Probability mass function of amount of candy eaten per day
x 0 1 2
P(X = x) 1/4 1/2 1/4
µX = 1, σ 2 = 0.5.
Consider two randomly chosen unrelated students from this college. Let X 1 denote the
amount of candy consumed by student I and X 2 the amount consumed by student II.
We now list all possible values of the sum X 1 + X 2, the total amount of candy eaten by the
two next Monday, where the two random variables have the probability mass function given
above. The table below shows the possible values of the consumption of the two students
and the sum and the probability of that sum, which is obtained using the product rule for
1 1 1
independent events. For example, P ( X 1 = 0, X 2 = 1) = P ( X 1 = 0)P ( X 2 = 1) = = .
4 2 8
X1, X2 (0,0) (0,1) (0,2) (1,0) (1,1) (1,2) (2,0) (2,1) (2,2)
S 2 = X1 + X2 0 1 2 1 2 3 2 3 4
We can regroup the results to obtain Table 5.15 of the probability mass function for the
unique values of the sum. We will make use of the addition rule for mutually exclusive
events.
S2 0 1 2 3 4
P ( S2 ) 1/16 4/16 6/16 4/16 1/16
E ( S2 ) = ∑S P(S ) = 2 ,
S2 = 0
2 2
Var ( S2 ) = ∑(S
S2 = 0
2
− 2)2 P ( S2 ) = 1,
E ( S n ) = nµ,
Var ( S n ) = ns 2 ,
5.5.1 Exercises
Exercise 1. A painter purchases two types of paints, paint A and paint B. The amount of paint
A purchased per week, X 1 , has E ( X 1 ) = 40 gallons and Var ( X 1 ) = 4 . The amount of paint B
purchased, X 2 , has E ( X 2 ) = 65 gallons and Var ( X 2 ) = 8. Paint A costs $3 per gallon, whereas
paint B costs $5 per gallon. How much should the firm expect to spend next week in these
two types of paint? What is the standard deviation of the amount spent? Assume that X 1
and X 2 are independent.
Exercise 2. A group of 50,000 loan requests made through the web site of a bank report an
average income of µ= $37, 000 with a standard deviation of s = $20,000. Furthermore, 20%
of the requests report a gross income over $50,000. A group of 900 requests is chosen at
random to check their accuracy. Find the expected value and standard deviation of the total
income of the 900 loan requests.
Probability models play a very important role as mathematical abstractions of the uncertainty
we have about different measurements. Although we can reach an answer to our probability
questions with the methods we studied in Chapters 2 and 3, some experiments are so similar
in nature, and happen so often, that, for the sake of economy of time, it is worth having all
the information we need about a given experiment in an equation and methodology like that
presented in this chapter to extract from that equation everything we need. Many years of
research by probabilists and applied researchers went into arriving at those equations. We call
A random variable X has a discrete uniform distribution with N points, where N is a positive
integer, and possible distinct values x i , i = 1,2, …., N, if its probability mass function is given by:
1
P( xi ) = , i = 1,2, ….N
N
E ( X ) = 0(1 − p) + 1( p) = p
Var ( X ) = (0 − p)2 (1 − p) + (1 − p )2 ( p ) = p(1 − p)
Example 5.8.1
A genetic counselor is conducting genetic testing for a genetic trait. Previous research sug-
gests that this trait is found in 1 out of every 8 people. He examines one randomly chosen
person and observes whether it has the genetic trait or not. The pmf is:
1
1 − , x = 0
P( X = x ) = 8
1
, x =1
8
1 1 7 7
E ( X ) = , Var ( X ) = = .
8 8 8 64
5.8.1 Exercises
Exercise 1. Find the moment generating function of a Bernoulli random variable using the
definition of moment generating function given in Section 5.3.4.
Exercise 2. In 2008, the city of Quebec, a Canadian city, reported that 10% of Quebec students
are victims of acts of bullying at least once a week. We choose a student at random. (i) Write
the Bernoulli model formula appropriate for this context. What is the expected value of this
Bernoulli random variable? (ii) Define what the random variable represents. (iii) Give the
expected value and variance of this random variable. (iv) Do you think that the model you
wrote applies equally to young and old students? Write your thoughts about this, and if you
think the models might be different, write your proposed models. (v) What variables may be
most important in discriminating groups of bullied students?
A Park Ranger working in a large National Park is in charge of overseeing groups of ten visitors
to describe to them the regulations of the park. There is a probability of 1/3 that a visitor
will ask the park ranger a question. The next group of 10 has arrived. What is the probability
that at most two people ask a question?
Var ( X ) = ∑ ( x − µX )2 n p x q n−x = npq .
x
x
E (a + bX ) = a + b(np),
Var (a + bX ) = b2np(1 − p ),
There is a linear function of the binomial which is of particular interest in Statistics. This is:
X
Y= .
n
By the linearity of expectations,
E( X )
E (Y ) = = p.
n
Var ( X ) p(1 − p)
Var (Y ) = = .
n2 n
Why these results? How did we use the properties of expectations to obtain them? The
reader will answer that in one of the exercises.
The parameters of the Binomial distribution, p and n, determine the shape of the binomial
distribution. As n stays fixed and p changes the pmf goes from being right to left skewed. As
p stays fixed and n increases, the distribution becomes more bell shaped.
Example 5.9.2
Suppose that a large lot of fuses contains 10 percent defectives. If four fuses are randomly sam-
pled from the lot, find the probability that at least one fuse in the sample of four is defective.
p = probability of defective = 0.1,
n = number of fuses sampled = 4 .
Let X = “number of defective fuses in n = 4.”
P ( X ≥ 1) = 1 − P ( X = 0) = 1 − 4 0.10 (0.9)3 = 0.271.
0
Example 5.9.3
A sequence of 4 bits is transmitted over a channel with a bit error rate of 0.2. What is the
probability that the number of erroneous bits is 2?
p = probability that a bit is erroneous,
n = number of bits transmitted,
X = number of erroneous bits in n = 4,
P ( X = 2) = 4 (0.2)2 (0.8)2 = 0.1536.
2
Example 5.9.4
Suppose that a large manufacturer of earbuds produces 10% defectives. Suppose that four
earbuds sampled from the lot were shipped to a customer, before being tested, on a guar-
antee basis. Assume that the cost of making the shipment good is given by C = 3X 2, where
X denotes the number of defectives in the shipment of four. Find the expected repair cost.
5.9.2 Exercises
Exercise 1. Consider the following scenarios:
SCENARIO 1: A system with two independent components is such that the system fails
if at least one of the individual components fails. The probability that a component
fails is 0.04.
SCENARIO 2: A system with three independent components is such that the system
fails if at least one of the individual components fails. The probability that a compo-
nent fails is 0.04.
SCENARIO 3: A system with four independent components is such that the system fails
if at least one of the individual components fails. The probability that a component
fails is 0.04.
(i) For each scenario, provide the probability mass function in the form of a table indi-
cating the values of the random variable X = number of failing components in one
table, and the probability of that X in the other column. Add a third column in each
table, indicating which events are used to compute the probabilities. In each of the
scenarios double check whether the formula:
P ( X = x ) = n p x q n−x , x = 0,1,2, …, n ,
x
(ii) Add a column to your table indicating the computation done using the binomial formula.
Check that, in each of the scenarios,
∑( x − E( X )) P( X = x ) = Var ( X ) = npq.
2
Exercise 2. According to the Center for Disease Control and Prevention, 1 of every 4 deaths
are caused by heart disease. If we choose 10 deaths from the registry at random, what is the
probability that the sample would contain 3 heart disease deaths?
Exercise 3. The Physella Zionis, also known as wet-rock physa is a snail found in Zion Canyon, Utah,
and Orderville Canyon, along the North Fork of the Virgin River in Utah. This species is believed
to have mutated to adapt to the environment in the Canyons. The National Park has a small
information post at the entrance of the area where the snail lives indicating the history. What is
the probability that in a random sample of 10 visitors, at least one stopped to read the post about
the Physella Zionis, if the probability that one will stop is 0.2? This information may help the park
rangers decide whether perhaps the post should be moved to a more visible part of the trail.
Exercise 4. In some parts of Africa the prevalence of albinism is as high as 1 in 1,000. How
large a sample of individuals from these parts would we have to draw in order to sight at
least one person with albinism?
Exercise 5. Outbreaks of cholera, a highly contagious disease, are possible after a natural
disaster. For example, after the 2010 earthquake in Haiti, there was a cholera outbreak. If a
natural disaster such as that in Haiti were to occur in the future, and you were in charge of
sampling the population to screen for cholera, would you use the binomial model?
Exercise 7. The probability that the life length of a certain type of battery exceeds four hours
is 0.135. If three such batteries are in use in independent operating systems, what is the
probability that only one of the batteries will last more than four hours?
Exercise 8. Suppose that X is a binomial random variable with parameters p and n. Define
X 2
Y = . (i) Find the E(Y ) and the Standard deviation of Y. (ii) What is the E (Y ) equal to?
n
Exercise 9. In the old times, when there were no computers and calculators were not so good,
students had access to tables for the distributions. For example, for the Binomial distribution,
for any given n, the table would give the probability of X = 0, X = 1, …, X = n for values of
p of 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95. Suppose n is 3, for which value of
those p’s is P(X = 1) highest?
XXWhat experiment would the sample space given in the box below represent? Describe
the experiment (note that the dots (…..) mean that the sequences continue ad infinitum
forever following similar pattern).
H TTTTTH
TH TTH
TTTH TTTTH
TTTTTTTH
TTTTTTTTH …………
Suppose we have a situation, like the coin toss, where there is only one of two possible out-
comes. Let X = 1 if the event of interest happens and X = 0 otherwise. We keep trying until
the event appears. Suppose that the trials are independent and of constant probability of the
event. So we have just described a sequence of Bernoulli trials. The trial in the sequence at
which the first occurs is often of interest. For example, we are going to conduct interviews
to hire a candidate proficient in Zulu to work for the government of South Africa. How many
interviews does it take to find the first qualified Zulu speaker? Let that interview (could be
the first, the second, the 20th, or other) be denoted by a random variable Y. We say that Y is
a geometric random variable, with probability mass function
∑ (1 − p) ∑(1 − p)
k −1 k −1
p= p =1
k =1 k =1
By computing the probabilities this way, we have counted all the probability of all possible
outcomes in the sample space. The sample space has a probability of 1. See Math tidbit 5.5
to review the geometric progression formula needed to reach this conclusion.
Using similar mathematics, we can prove that:
1
E (Y ) = .
p
Similarly, we can show that:
1− p
Var (Y ) = .
p2
pe t
M X (t ) = .
1 − (1 − p)e t Box 5.5
1
P (Nn = c ) = (1 − 1 / n )c −1 .
n
(ii) If the password is not replaced, then each trial is not independent of the other,
and the probability of hitting the right password changes as we eliminate passwords.
Therefore,
1 1 1 1
P (Nn = c ) = 1 − 1 − …. . 1 −
1 − .
n n − 1 n − c − 1 n − c
5.10.1 Exercises
Exercise 1. The National Health and Nutrition Examination Survey (NHANES) examines about
5000 persons per year https://www.cdc.gov/nchs/nhanes/index.htm and constantly screens
diabetic people to detect people with type I diabetes. Let X be a random variable measuring
the number of diabetic people it takes to find the first Type I diabetic person. Assume that
the probability of a diabetic person having type I diabetes is 0.05. (i) List five or six outcomes
of the sample space. Use as notation: 1 for type I and 0 for not type I diabetic. Also provide
the probability of each of these outcomes and write a small table containing the first values
of X and P(X). (ii) Granted that populations are finite. But would a finite number of outcomes
in this sample space suffice to guarantee that all the axioms of probability hold? Explain
why or why not.
Exercise 2. A Statistics department finds that 20% of applicants for the Ph.D. have had prior
statistics courses. Applicants are reviewed at random from the pool. (i) Find the probability
that the first applicant having had statistics prior to applying is the sixth applicant. (ii) Sup-
pose that the first applicant who has had statistics courses prior to applying is accepted in
the program, and the applicant visits the department. Suppose each interview costs $100
and there are $350 in travel expenses. Find the expected value and standard deviation of the
cost of reviewing applicants until the first qualified applicant is found and visits.
Exercise 3. The probability of being able to access a disaster zone area by air after a hurricane
is 0.7. A helicopter keeps trying until access is achieved. (i) What is the probability that access
Exercise 4. The probability that a watch produced by a factory is defective is 0.01. In a recent
audit of the factory, products are inspected until the first defective is found. The first 10
trials have been found to be free of defectives (that is, it takes more than 10 trials to find a
defective). Calculate the probability that the first defective will occur in the 15th trial given
that it took more than 10 trials.
What experiment would the sample space given in the box below represent? Describe the
experiment (Note that (….) means that the sequences continue ad infinitum forever following
similar pattern.)
HH TTTHTTH
THH THTH
HTTTH TTTHTH
TTHTTTTTH
TTTTTTTHTTTTH
…………
Let Y denote the number of the trial on which the rth success occurs in a sequence of
independent Bernoulli trials where the probability of success is p. For example, when do we
reach the first two heads?
P (Y = 2) = p( p ) = P (HH ),
P (Y = 3) = 2(1 − p )p( p) = P ({THH , HTH }),
P (Y = 4) = 3(1 − p)2 p( p) = P ({HTTH , THTH , TTHH }) ,
¼.
y − 1
P (Y = y ) = (1 − p )2 p( p ) = P ({TTT ….TTHH , HTT ….TTTH } ,
1
P (Y = y ) = P[{ first ( y − 1) trials contain (r − 1) successes and the last trial is a success}]
= P[ first ( y − 1) trials contain r − 1 successes ] P [ yth trial is a success].
The first probability statement is identical to the one that results in a binomial model for
the probability of obtaining r - 1 successes in y - 1 trials. The last statement is just p. When
combining the two, we get the final formula for the negative binomial.
We have Y = 6, r = 4, and p = 0.4. The random variable Y is the number of games it takes
B to win 4 games.
P (“B wins series in 6”) = 5 (0.4)4 (0.6)2 = 0.09216.
3
Exercise 2. On a hunting trip, the probability that a lion will find suitable prey is 0.8. What is
the probability that it takes 10 trips to find three suitable prey?
The following quote illustrates the use of a probability model to assess danger.
XXSeveral cases of anthrax including the deadly inhalation form of the disease were asso-
ciated with an envelope containing anthrax spores addressed to Senate Majority Leader
Thomas Daschle of South Dakota and found in a mailroom serviced by the U.S. Postal
Service distribution in Brentwood, MD, in the Autumn of 2001. In addition to the mailroom
serving Senator Daschle’s office, the Brentwood facility services approximately 3200 other
mailrooms in the district of Columbia and surrounding area. Once the letter was discov-
ered and cases of the disease emerged, a logical step for consideration in control of the
disease is to inspect a sample of the approximately 3200 other mailrooms served by the
Brentwood facility for the presence of anthrax spores.
Under simple random sampling without replacement, the conditional probability that
k of the n sampled mailrooms are contaminated with anthrax spores is given by the
hypergeometric distribution.
(Levy, Hsia, Jovanovic, and Passaro 2002, 19)
Statisticians use all kinds of probability models like those seen in this book to set up their
statistical hypothesis testing. The statistics calculation is complicated, and requires that the
reader learn inferential statistics to do it. However, after studying this section, the reader is
challenged to think: what in the hypergeometric model is the unknown that the statistician
is after in this anthrax case?
When trials are not independent Bernoulli trials, how can we calculate the probability
of k successes in n trials? We learned to do that in Chapter 4, Section 4.2.1. In the spirit of
Chapter 5, where we try to find a formula that summarizes all possible values and their
probabilities, a name was given to the formula we used in Chapter 4 to solve this problem.
The name given is the hypergeometric probability mass function.
The hypergeometric probability mass function is applicable to any situation in which a
population of N units can be considered to contain two mutually exclusive and exhaus-
tive subpopulations: subpopulation 1 having K units and subpopulation 2 containing N-K
units. Let Y be the number of units from subpopulation 1. The probability that a sample of
K N − K
k n − k
P (Y = k |N , n, K ) = , k = 0,1, …. , K .
N
n
The reader will recognize the above formula as the one arrived at when trying to figure
the probability of obtaining n draws from an urn in such a way that k elements are from
population 1 and n - k from population 2. In the language of trials used in Chapter 4, Y rep-
resents the number of successes in n trials that are not independent and are obtained from
a box containing M items, Ms of which are successes and M - Ms are failures. The reader will
realize that the formula we got in Chapter 4, namely,
M M − M
s
s
k n − k
P (Y = k ) = , k = 0, 1, …. , Ms
M
n
Example 5.12.1
A department store has ten discounted espresso coffee machines, four of which are defective.
A customer randomly selects five of the machines for purchase, to give to relatives for the
holidays. What is the probability that all five of the machines are non-defective?
Then
6 4
5 0 6
P (Y = 5) = = = 0.02380952, k = 0,1, …., Ms
10 252
5
5.12.1 Exercises
Exercise 1. A supermarket sells boxes containing five chirimoyas, two of which are spoiled
inside. If we were to randomly pick two chirimoyas from a box, what is the probability that
no more than one chirimoya is spoiled inside?
Exercise 2. Proposition 134 is on the ballot for the next election. In a small town of 50 people,
30 favor the proposition and 20 do not. A committee of 4 people is selected from this town.
Answer the following questions: (i) What is the probability that there will be no one in favor
Exercise 3. A museum curator collects paintings from local artists in sets of size 10 sent by
an intermediary. It is the curator’s policy to inspect 3 paintings randomly from a set and to
accept the set if all 3 satisfy the standards of the museum. If 30 percent of the sets have 4
unacceptable paintings and 70 percent have only 1, what proportion of the sets does the
curator reject?
Exercise 4. There are 20 data scientists and 23 statisticians attending the Annual Data Science
conference. Three of these 43 people are randomly chosen to take part in a panel discussion.
What is the probability that at least one statistician is chosen?
This is a question that always arises when talking about the hypergeometric probability
mass function. Binomial is for independent Bernoulli trials and for draws with replacement.
Hypergeometric is for draws without replacement. Where is the gray area? Consider the
following example.
Suppose we have a box that contains 1000000 red and 1000000 yellow balls. We
are going to draw two balls at random without replacement. What is the probability of
getting two red balls? Let R1 be the first ball drawn that is red, and let R2 be the second
red ball. Then:
If we had drawn the two balls with replacement, the probability would have been
2
1000000 1000000 1
P(R1R2 ) = P(R1 )P(R2 |R1 ) = = = 0.25.
2000000 2000000 2
With replacement,
2
10 10
P(R1R2 ) = P(R1 )P(R2 |R1 ) = = 1 = 0.25.
20 20 2
Now there is an appreciable difference. Thus, if the box (urn, population) is small, drawing
with and without replacement makes a difference in the results. The Binomial approximation
is not appropriate in this case. We cannot assume independence if we are not drawing with
replacement. But we can assume independence if the box we are drawing from is so large
that the composition of the box does not change noticeably.
If (a) the probability that a sparrow lands at a given point on the landscape is the same every-
where, and (b) every point in the landscape is equally likely to be a landing point, then the
number of sparrows landing per square mile, as in Figure 5.2, is a Poisson random variable.
Those assumptions are equivalent to assuming that there is a large number of independent
sources affecting the landing or not landing. Alternatively, they are equivalent to assuming
that the landing is random.
The Poisson random variable is used to model the number of occurrences of a certain
event, in a unit of space or time, when the occurrences happen independently of each other
and occur with equal probability at every point in time and space. Examples are:
For the biologist, the Poisson distribution is just a model for how successes may be distrib-
uted in time and space in nature. Life gets interesting when the model doesn’t fit the data,
because then we learn that one or more of the main assumptions is false, hinting at the
existence of interesting biological processes. (For example, some individuals may be more
prone to mutations than others, some fishers may be better catchers than others, or some
plants may produce better quality seeds. (Whitlock and Schluter 2009)
The alternative to a Poisson model is counts that are clumped together, meaning that one
occurrence (e.g., a sparrow landing) increases the probability of another occurrence nearby
(another sparrow landing nearby), as when there are contagious diseases. The distribution
of counts is more clustered than what is expected by chance. Another alternative is when
the occurrences are more dispersed than expected, as when for example, seeing a territorial
animal decreases the probability of seeing another animal nearby. There is more dispersion
than would be expected by chance.
The Poisson is a discrete random variable. The parameter l is larger than 0. Parameters
are assumed constant. Different values of the parameter characterize different members of
the Poisson family.
Example 5.14.1
As part of a National Parks survey, an aerial photograph of a region where cars are not allowed
could be taken by the National Parks Service. A grid of four hundred 100-meter squares could
be superimposed over the photograph to enable a study to be made of the distribution of
cars. The number of 100 meter squares containing 0,1,2,3, up to 6 cars could be recorded.
Suppose the following information was found:
Number of cars 0 1 2 3 4 5 6
Number of squares 278 92 25 4 0 0 1
Example 5.14.2
The reader is encouraged to read further into uses of the Poisson distribution, in particular
to model rare phenomena, such as for example “horse-kick” deaths or death by surgery in
the wrong location. The following web site https://onlinelibrary.wiley.com/doi/full/10.1111/
anae.13261 dwells on this application.
Example 5.14.3
Modern statistical software packages give us the power to perform many computations with
a few clever keystrokes and mouse clicks. However, this computing power does not come
without drawbacks. Specifically it has become all too easy to overlook data-entry errors
and/or nonsense codes meant to represent missing data, with many researchers including
these values in their final computations. Codes such as 9999 or -1 for missing data were
often used to indicate that a value was missing like systolic blood pressure (which should
Example 5.14.4
The number of emergency calls to 911 for health reasons per day in a certain road has a
Poisson distribution with expected value 4. There are only three locally owned ambulances
which are dispatched to the first three emergency calls locations, after that ambulances from
the neighboring town are called. On any given day, what is the probability of having to call
ambulances from the nearest town?
Let X be the number of fatal accidents per day.
P ( X > 3) = 1 − P ( X ≤ 3) = 1 − [P ( X = 0) + P ( X = 1) + P ( X = 2) + P ( X = 3)]
16e−4 64 e−4
= 1 − e−4 + 4 e−4 + + = 0.56653.
2 6
So the probability of more than one case is 0.0047. That is, we would consider seeing two
or more mump cases in a month a very unlikely event if mump cases are random and inde-
pendent. More than two cases per month would be almost impossible to observe.
Mumps cases, like all cases of rare diseases, must be reported to CDC and are published in
the Morbidity and Mortality Weekly Report. In January 2006, Iowa reported 4 cases of mumps.
What is the probability of getting 4 or more cases of mumps under the model assumed? We
find, again using the Poisson model that
P ( X ≥ 4) = 0.000004.
Example 5.14.6
(This problem is from Ross (2010).) The number of times that a person contracts a cold in a
given year is a Poisson random variable with parameter l = 6. Suppose that a new wonder
drug (based on large quantities of vitamin C) has just been marketed that reduces the Poisson
parameter to l = 3 for 75% of the population. For the other 25 percent of the population the
drug has no appreciable effect on colds. If an individual tries the drug for a year and has 2
colds in that time, how likely is it that the drug is beneficial for him or her?
Let B denote beneficial and X = the number of colds
The probability sought is
P ( X = 2 | B )P (B )
P (B | X = 2) =
P ( X = 2 | B )P (B ) + P ( X = 2 | B c )P (B c )
32 e−3 3
2! 4
= 2 −3 = 0.9377.
3 e 3 62 e−6 1
+
2! 4 2! 4
5.14.1 Exercises
Exercise 1. Almost every year, there is some incidence of volcanic activity on the island of
Japan. In 2005 there were 5 volcanic episodes, defined as either eruptions or sizable seismic
activity. Suppose the mean number of episodes is 2.4 per year. Let X be the number of episodes
in the next two years. (i) What model might you use to model X? (ii) What is the expected
number of episodes in the next two years period according to your model? (iii) What is the
probability that there will be no episodes in the next two years? (iv) What is the probability
that there are more than three episodes in this period?
Exercise 2. In a town with two public libraries, people borrow books from public library A
at a rate of one for every two minutes and, independently, from library B at a rate of two
for every two minutes. People tend to prefer borrowing from library A, with 60% of library
patrons preferring A, and 40% preferring B. Nobody is known to borrow books from both
libraries. The two public libraries are open all day. (i) What is the probability that no one
Exercise 4. Prove that the expected value of a Poisson random variable with parameter l is l.
∞
kn
The following result from Mathematics may pop up during the computations:
n !
= ek . ∑
Use it as needed. k =0
E (20Y + 10Y 2 − 3Y 3 + e tY ).
Exercise 6. Some online sellers allow free examination of products for a month. The customer
can return the product within a month and get a full refund. In the past, an average of 2 of
every 10 products sold by a seller are returned for a refund. Using the Poisson probability
distribution formula, find the probability that exactly 6 of the 40 products sold by this com-
pany on a given day will be returned for a refund.
Exercise 7. Suppose that X and Y are independent Poisson random variables with parameters
l1 and l2 , respectively. (i) What is the expected value of the sum of these two random vari-
ables? (ii) What is the variance of the sum of these two random variables?
5.15 The choice of
probability models in
data science
Genomics
DNA is a long, coded message made from a four letter alphabet: A, C, G, and T. It is believed
that some patterns may flag important sites on the DNA, such as the area on a virus’ DNA
that contains instructions for its reproduction. A particular type of pattern is a comple-
mentary palindrome, a sequence that reads in reverse as the complement of the forward
sequence. Palindromes were found in the area of replication of several viruses of the Her-
pes family. Nolan and Speed (2000) studied a DNA sequence the human cytomegalovirus
member of the herpes virus family. Places with clusters of palindromes were found along
the DNA sequence and they were believed to be suspect of being the origin of replication.
Those clusters were located. The locations are the numbers assigned to the pairs in the DNA
sequence, which has 229354 pairs of letters.
A question of interest posed by Nolan and Speed (2000) is whether the Poisson model
fits these data well. The Poisson model can help determine whether there are clusters or
not. Would you like to think how?
Solution: Divide the 229,354 locations into equal intervals. Count the number of palin-
dromes per interval. Create a table that shows X (the number of palindromes) and P(X) the
number of intervals containing that number of palindromes divided by the total number
of intervals. Calculate the average number of palindromes per interval and calculate the
Poisson probabilities for X, using a Poisson with the data average. If the numbers are sys-
tematically off, we can say that the Poisson does not fit.
However, the exponential decay at the tail, this type of behavior, has been found not to be
very appropriate for random variables describing the internet. Even before the internet, it was
found that they are not very useful to model several hydrolic variables such as river runoff.
5.15.1 Zipf laws and the Internet. Scalability. Heavy tails distributions.
Consider the ranked web pages of a particular institution. Lower number in the rank means
more frequently visited. Assign rank = 1 to the page with the highest frequency. The proba-
bility of requesting the rth ranked page is a power law called Zipf’s law.
The assumption made about the distribution of the popular web pages has a lot of impli-
cations for web cache replacement algorithms.
A discrete power law distribution with coefficient g > 1 is a distribution of the form
P ( X = k ) = Ck −g k = 1, 2, …
This equation describes the behavior of the random variable X for sufficiently large values
of X. The distribution for small values of X may deviate from the expression above.
Power law distributions decay polynomially for large values of the random variable. That
is, the distribution decays polynomially as k -g for g > 1. This means that in a power law dis-
tribution, rare events are not so rare.
To detect a power law in a log-log plot (log of probability vs log of X) we should see a line
with a slope determined by the coefficient g .
Question 1. (This problem is from Moore (1996, 352).) Joe reads that one out of four eggs
contains salmonella bacteria, so he never uses more than three eggs in cooking. If eggs do
or do not contain salmonella independently of each other, the number of contaminated eggs
when Joe uses three chosen at random has the distribution
Question 2. A student of Probability proposed a discrete random variable Y which can take
the possible values 1, 2, and 3. The student claims that the following function is a probability
mass function for Y:
q2y
P (Y = y ) = , y = 1,2,3; q ≥ 0
q2 + q 4 + q6
a. Bernoulli
b. Binomial
c. Poisson
d. Negative Binomial
Question 7. Suppose that 65% of the American public approves of the way the President is
handling the economy. A random sample of 8 adults is taken and Y is made to represent the
number who approve in that sample, a Binomial random variable with n = 8 and p = 0.65. A
student of survey sampling theory proposed looking instead at X = 8 − Y , another random
variable. The probability model for X is
a. Poisson(l = 8)
b. Geometric(p = 0.65)
c. Binomial(n = 8, p = 0.35)
d. Negative Binomial
Question 8. A biologist is examining frogs for a genetic trait. Previous research suggests that
this trait is found in 1 out of every 8 frogs. She collects and examines 150 frogs chosen at
random. How many frogs with the trait should he expect to find?
Question 9. An oil exploration firm is to drill ten wells, each of which has probability 0.1
of successfully striking recoverable oil. It costs $10000 to drill each well so there is a total
fixed cost of $100000. A successful well will bring oil worth $500000. The expected value
and standard deviation of the firm’s gains are, respectively
Question 10. 20% of the applicants for a certain sales position are fluent in both Chinese and
Spanish. Suppose that four jobs requiring fluency in Chinese and Spanish are open. Find the
probability that two unqualified applicants are interviewed before finding the fourth qualified
applicant, if the applicants are interviewed sequentially and at random.
a. 0.87
b. 0.2
c. 0.0124
d. 0.541
5.17 R code
Example: To calculate P(X <= 10), given X follows a binomial distribution with n = 30,
p = 0.4, you can use the command:
pbinom(10, 30, 0.4)
#will give you
[1] 0.2914719
Example: To calculate P(10 < X < 13), given X follows a binomial distribution with n = 30,
p = 0.4, you can use the command, since P(10 < X < 13)= P(X <= 12) - P(X <= 10),
Poisson distribution:
ppois(x, lambda)
This computes the probability that given the average rate of hits per unit of time (lambda)
there are x hits or less during a given unit of time.
Example: calculate P(X <= 2) given that the average rate per unit of time is 3 (lambda).
x
ppois(2,3)
# will give you
[1] 0.4231901
Example: Likewise, if you were interested in calculating the exact probability that there were
4 hits in one unit of time if the average is 3 hits per unit of time, you could use:
ppois(4,3)-ppois(3,3)
# will give you
[1] 0.1680314
Example: What happens if we want to calculate the probability that we get more than four
hits if the average amount of successes is three per unit of time? Remember that if you sum
up all of the probabilities for a distribution they equal 1. Thus if you calculate the probability
of having four or less hits and then take one minus that probability you get the probability
of having more than four hits. In R, it looks like:
1-ppois(4,3)
#will give you
[1] 0.1847368
1. A neat way to plot the probability mass function of a discrete random variables is
a line plot. Write by hand the probability mass function and cumulative probability
given below and then plot it with R with the following commands. Plot also the
cumulative probability.
stacksize=0:7
probability=c(0.05,0.10,0.25,0.2,0.2,0.10,0.05,0.05)
plot(stacksize,probability,xlab=”Stack
size”,ylab=”probability”,type=”h”)
cumprob=c(0.05,0.15,0.4,0.6,0.8,0.9,0.95,1)
plot(stacksize,cu mprob,xla b =”stack size”,yla b =”Cu mulative
probability”,type=”S”)
expectation=sum(stacksize*probability)
variance=sum((stacksize-expectation)^2 * probability)
standarddeviation=sqrt(variance)
Exercise 1. In a digital commutation system, bits are transmitted over a channel in which
the bit error rate is assumed to be 0.0001. The transmitter sends each bit five times, and a
decoder takes a majority vote of the received bits to determine what the transmitted bit was.
Determine the probability that the receiver will make an incorrect decision.
Exercise 2. Prove that the moment generating function of the Binomial Distribution is
Mx (t ) = ( pe t + 1 − p )n
Exercise 3. A large corporation is interested in reconsidering its retirement policy and pro-
ceeds to conduct a survey of a random sample of its employees. The survey asks each of the
surveyed employees whether they favor or not the current retirement policy. In a random
sample of 10 employees, what is the probability that all of them favor the current retirement
policy if overall 65% of all employees favor it?
Exercise 4. Jury deliberation in the jury room sometimes takes a long time. Jurors are randomly
selected members of the population. In a population where 90% of the population thinks that
someone that robs a convenience store at gun point is guilty, what is the probability that 5
out of 12 jurors will find the defendant of such a robbery case guilty?
Exercise 5. Show that the cumulative distribution function of a geometric random variable
equals
P (Y ≤ y ) = 1 − (1 − p) y
Exercise 6. Coltron has an unfair coin (it’s unfair because it only comes up heads 40% of the
time). He bets his older brother Garrett that if he flips the coin 10 times he can’t get at least
6 heads. Garrett, unaware of the nature of the coin agrees. What is the chance that Garrett
defeats Coltron and flips at least 6 heads with the unfair coin. Do this problem theoretically
and also with R. Compare the answers.
Exercise 7. Zandree Rose is a car salesperson. On average she sells two brand new BMWs
every week. She gets a bonus if she sells at least three BMWs in any given week. Given a
seven-day span, what is the probability that she gets the bonus? What is the probability she
sells exactly three? Do this problem theoretically and also with R. Compare the answers.
Exercise 8. Let X be a random variable representing the roll of a fair six-sided die. Write
the probability mass function that corresponds to the roll of a fair die, find the expected
value, the standard deviation, the cumulative distribution function and the moment
generating function.
Exercise 10. A coin weighted so that P(H) = 2/3 and P(T) = 1/3 is tossed three times.
a. Write down the outcomes in the sample space S and the probability of each of the
individual outcomes.
b. Let X be the random variable which assigns to each point in S the largest number of
successive heads which occurs. Write the values of X under each of the outcomes in
the sample space S. For example, in a different context, if we observe, say, 10 boxes
of cereal in a supermarket in sequence to see if they contain a coupon inside, we could
observe cccnnncncc. In this case, the largest number of successive coupons is 3. We
observe successive coupons cc and c and ccc. The largest sequence is ccc.
c. Write the probability distribution of X in tabular form.
Exercise 11. (This problem is from Petrucelli, Nandram and Chen (1999, 211).) Ming’s Seafood
Shop stocks live lobsters. Ming pays $6.00 for each lobster and sells each one for $12.00.
The demand X for these lobsters in a given day has the following probability mass function.
x 0 1 2 3 4 5
P(X = x) 0.05 0.15 0.30 0.20 0.20 0.10
x 0 1 2 3
P(X = x) 0.1 0.2 0.3 0.4
Exercise 13. You have $1,000 and a certain commodity presently sells for $2 per ounce.
Suppose that after one week the commodity will sell for either $1 or $4 an ounce, with
these two possibilities being equally likely. If your objective is to maximize the expected
amount of the commodity that you possess at the end of the week, what strategy should
you employ?
Exercise 14. A typical slot machine has 3 dials, each with 20 symbols (cherries, lemons, plums,
oranges, bells, and bars). A typical set of dials is shown below. According to this table, of the
20 slots on dial 1, 7 are cherries, 3 are oranges, and so on. A typical payoff on a 1-unit bet
is as shown in below. Compute the player’s expected winnings on a single play of the slot
machine. Assume that each dial acts independently.
Exercise 15. Prove that E(X) = np and Var(X) = np(1 - p) if X is a binomial random variable.
Exercise 16. The number alpha particles emitted by a radioactive substance has expected
value of 12 per square centimeter. If two 1-square centimeters are independently selected,
find the probability that the two received 4 alpha particles. How many 1-square-centimeter
samples should be selected to establish a probability of approximately 0.95 that at least one
will contain one or more alpha particles?
Exercise 17. A department store classifies its charge customers as either high-volume or
low-volume purchasers. Ten percent are high-volume purchasers. If a sample of four custom-
ers is randomly selected, what is the chance that none of them is a high-volume purchaser?
Exercise 18. Assume that 13% of people are left-handed. If we select five people at random,
find the probability of each outcome below:
Exercise 19. Suppose we randomly select five cards without replacement from an ordinary
deck of playing cards. What is the probability of getting exactly two red cards?
Exercise 20. Let X be the number of bacterial colonies per cubic centimeter, a Poisson random
variable with expected value 3. (i) What is the probability that there is at least one bacterial
colony in a randomly chosen cubic centimeter? (ii). What is the probability that in five randomly
chosen cubic centimeters there is at least one cubic centimeter where the event of part (i)
Exercise 21. Show that if X is a random variable having a discrete uniform distribution with
N points, then
N N N
1 1 1
∑ ∑ ∑e
txi
E(X) = xi , E(Xr ) = xri , Mx (t) =
N i=1
N i=1
N i=1
Exercise 22. A student member collecting money for the Probability Club thinks that the
probability of collecting money from a person is 0.01 if nothing is offered for the money in
return, but it is 0.4 if doughnuts from the Sinking Donuts store is given to the person donating.
If this week Sinking Donuts has offered to give a doughnut to everyone that donates money,
how many people will the student have to address to collect money?
Exercise 23. (This problem is based on Marchette and Wegman (2004, 13).) Cybersecurity is
of major concern these days. It is rare the large institution that has not received a virus or is
attacked on the internet. A network attack may occur by compromising multiple systems which
are then used as platforms to mount a distributed attack. The victims send their responses
to the spoofed addresses of the compromised system, not to the attacker. The compromised
systems are chosen at random by the attacker. Assume that an attacker is selecting spoofed
IP addresses from N total addresses. Further, assume that a network sensor monitors n IP
addresses. The attacker sends k packets to the victim. A packet is the basic data part of the
internet. (i) What is the probability of detecting an attack? (ii) What is the probability of
seeing j packets?
Exercise 24. In a town, the probability that the air quality is good is p. If we choose n days
at random, what is the probability of finding 4 with good air quality?
Exercise 25. A certain water purification system contains five filters. Each one of the five filters
of the water-purification system functions independently with probability of 0.95. For good
water quality, at least 4 of the filters should be functioning. What is the probability that the
quality of the water is good?
Exercise 26. Consider the following random variables and identify each type.
Exercise 27. A homework assignment has 12 problems. A quiz is designed containing a random
selection of 4 of these problems. If a student has figured out how to do 6 of the problems
in the homework, what is the probability that the student will answer correctly more than
2 questions on the quiz?
Exercise 29. Do extinctions occur randomly through the long fossil’s record of Earth’s his-
tory?, or are there periods in which extinction rates are unusually high (“mass extinctions”)
compared with background rates? Whitlock and Schluter (2009) give data on the number of
extinctions of marine invertebrate families in 76 blocks of time of similar duration.
Compute the expected number of blocks for the number of extinctions given above and
compare with the observed. Is there much difference between the two?
Find P ( X = 4).
Bain, Lee J., and Max Engelhardt. 1987. Introduction to Probability and Mathematical Statistics.
Duxbury Press.
Carlton, Matthew A., and William D. Stansfield. 2005. “Making Babies by the Flip of a Coin.”
The American Statistician 59, no. 2 (May): 180–182.
Christensen, Hannah. 2015. “Outlook for the weekend? Chilly, with a chance of confusion.”
Significance (October): 4–5.
Goldberg, Samuel. 1960. Probability. An Introduction. New York. Dover Publications, Inc.
Levy, Paul S., Jason Hsia, Borko Jovanovic, and Douglas Passaro. 2002. “Sampling Mailroom
for Presence of Anthrax Spores: A Curious Property of the Hypergeometric Distribution
under an Unusual Hypothesis Testing Scenario.” Chance 15, no. 2: 19–21.
Mansfield, Edwin. 1994. Statistics for Business and Economics: Methods and Applications. Fifth
Edition. W.W. Norton & Company.
Do you think people can tell if their significant other (SO) were cheating on
them? How would people feel if they accused their SO of cheating and found
out they were wrong? Do people themselves get away with cheating? How
would you conduct an investigation to answer these questions?
(Jane M. Watson 2011)
We are often interested in more than one characteristic of the outcome of an exper-
iment. If a conservation measure is implemented to determine its environmental
effect on plants and wild animals in the Santa Monica mountains, California, this
experiment could produce a wide array of outcomes, each outcome involving a
metric on plant abundance and animal abundance. Most of the time we would be
interested in how those are related. Moreover, before we implement the measure,
we would like to know the probability of the logically possible outcomes, to prevent
the implementation of a measure with devastating effects. Probability theory for
the relation between events, which we saw in Chapter 3 when we studied condi-
tional probability and independence, can be used to this end. Chapter 6 contains
the alternative random variable representation of metrics related to outcomes in
the sample space. We focus in two metrics in this chapter, vector random variables
such that the vector has two elements.
The reader is encouraged to review Sections 3.3 and 3.4 of this textbook before
embarking on the study of Chapter 6.
193
Definition 6.1.1 Example 6.1.1
It is customary at American universities to employ students to
Let a sample space S = {o1 , o2 , o3 …, oN } do many of the services. A campus ticket office employs three
be given together with an acceptable as-
work-study students on Mondays. The students are unrelated,
signment of probabilities to its simple
and come from different parts of campus. The first student works
events or outcomes of the experiment,
o1 , o2 , o3 , , oN . Let X and Y be two 10 a.m.—12 p.m., the second 12–2 p.m. and the third 2–4 p.m.
random variables defined on S. Then the Sometimes, a student gets sick, or does not show up. Suppose
function P whose value at the point each student is, independently, equally likely to show up or not.
(X = x, Y = y) is given by: The manager of the office must have a backup plan in the event
P( x , y ) = P( X = x ,Y = y ) that some of the students do not show up to work that day. It
= ∑ P({o } ∈ S | X (o )
i i
would be convenient to have a model that gives the probability
of all possible events next Monday.
= x and Y ({oi }) = y ),
Let W indicate whether a work-study student shows up to work
is called the joint probability mass func-
on a given Monday, and A(for absent) if the student does not show
tion of the random variable X and Y. (The
up. As usual, we refer to the sample space of the experiment of
domain of the function is the set of all
ordered pairs of real numbers, although observing the attendance of the three students on a randomly
P has nonzero values for only a finite chosen Monday.
number of such pairs.) If the number of
values of X and Y is not large, we may
S = {WWW , WWA, WAW , WAA, AWW , AWA, AAW , AAA} .
represent the P(x, y) in a two-way table,
The simple event WWW, for example, denotes the outcome
as in examples 6.1.1 and 6.1.2.
where the first student shows up to work, the second student
The joint probability mass function
must satisfy the axioms, namely the shows up to work and the third shows up to work. The probability
joint probabilities must add to 1, joint
1 1 1 1
probabilities are between 0 and 1 and of this simple event is = , if we are willing to assume
the probability of the union of mutually 8 2 2 2
exclusive events is the sum of the
that all outcomes are equally likely. Similarly, we can compute
probabilities.
the probability of the other simple events, which is 1/8 for each
of them.
Define the following random variables:
Of interest in this example is whether the fact that the student in the first shift is absent
has some effect on how many students show to work that day.
We want to determine not only the possible pairs of values of X and Y, but also the prob-
ability with which each such pair occurs. To say, for example, that the event consisting of
X taking value 0 and Y the value 1 occurs is to say that the event { AWA, AAW } occurs. The
probability of this event is therefore 2 / 8 or 1 /4. We write
1 1
P ( X = 0, Y = 1) = P ({ AWA, AAW }) = + = 1 / 4,
8 8
etc. In this way, we obtain the probabilities of all possible pairs of values of X and Y. These
probabilities are conveniently arranged in Table 6.1, in order to see the association between
the outcomes of the sample space and the values of the random variables.
Note that when we define random variables in this way, the equality sign is used as a
shorthand for “is the random variable whose value for any outcome (element of S) is”. The
distinction between the random variable and the value of the random variable should be
kept clearly in mind when, as here, the customary notation is somewhat misleading.
We list in Table 6.1 the values of these two random variables for each element of the
sample space S.
Table 6.1 Two random variables defined on the same sample space.
From the information in Table 6.1, and Definition 6.1.1, we obtain then the joint probability
mass function table of X and Y.
The resulting joint or bivariate distribution of X and Y , containing all values of P ( X = x , Y = y ),
can be seen in Table 6.2.
Table 6.2 Joint probability mass function of the random variables X and Y of Table 6.1.
The numbers in the cells are the joint probabilities of values of X and Y.
x \y 0 1 2 3
0 P(X = 0,Y = 0) = 1/8 P(X = 0,Y = 1) = 2/8 P(X = 0,Y = 2) = 1/8 P(X = 0,Y = 3) = 0
1 P(X = 1,Y = 0) = 0 P(X = 1,Y = 1) = 1/8 P(X = 1,Y = 2) = 2/8 P(X = 1,Y = 3) = 1/8
We can also represent these results graphically as in Figure 6.1, where we draw a three
dimentional chart in which P(X = x, Y = y) is the height of a vertical line drawn above the
point (x, y) in the horizontal x, y plane, the height being the value of the joint probability at
that point.
Once the joint probability mass table is constructed, we may answer questions relevant
to the problem at hand. For example, what is the probability that the student in the first
Probability Models for More Than One Discrete Random Variable 195
P(x,y)
shift misses work and less than two students show
up to work?
2/8 1 2 3
P ( X = 0, Y < 2) = + = .
8 8 8
x
1/8 Example 6.1.2
1 In trying to determine whether the employee in charge
of checkout register 1 at a large department store is
y much more desirable to customers than the employee
0 1 2 3
in charge of checkout register 2, a big department
Figure 6.1 Graphical display of the joint proba- store conducts a study of the 6 PM rush hour over
bility mass function in Table 6.2. many days that these two employees work. Let X
be the number of customers in register 1 and Y the
number in register 2. The joint probability mass func-
tion is given below.
x \y 0 1 2 3
0 0.08 0.07 0.03 0.3
1 0.02 0.01 0.02 0.2
2 0.01 0.02 0.03 0.2
3 0 0 0.01 0.0
What is the probability that the employee in checkout register 1 has more customers than
the employee in register 2? We add the probabilities of the mutually exclusive events where
this happens.
P ( X > Y ) = P ( X = 1, Y = 0) + P ( X = 2, Y = 0) + P ( X = 2, Y = 1) + P ( X = 3, Y = 0)
+ P ( X = 3, Y = 1) + P ( X = 3, Y = 2)
= 0.02 + 0.01 + 0.02 + 0 + 0 + 0.01
= 0.06.
6.1.1 Exercises
Exercise 1. Two aeronautics companies (I, II,) bid for contracts for space in a satellite navigation
system. A company that bids for a contract gets funded for their contract by the European
Union. Past information shows that firm I and firm II get each one contract with probability
1/9, firm I and firm II can each get two contracts with probability 1/9 and firm I and firm II
can each get 3 contracts with probability 1/9. But any other distribution of the contracts
between the two companies have also 1/9 probability of happening. There are then a total
of 9 outcomes. The Sample space would be represented by:
Exercise 2. Two species, A and B, affected by the same environmental factors, are being
studied to see if there is association between them. The species live in fruits. The random
variable X measures the number of species A per fruit, and the random variable Y measures
the number of species B per fruit. The joint probability mass function P(X, Y ) is given by the
following table.
x \y 0 1 2
0 0.40 0.1 0.1
1 0.1 0.1 0.02
2 0.1 0.02 0.03
3 0.01 0.01 0.01
What is the probability that the number of species B is larger than the number of species B?
Suppose that in example 6.1.1 we are interested only in Y yet have to work with the joint
distribution of X and Y. Specifically, suppose we are interested in the event Y = 2, which is a
vertical slice of Table 6.2. Of course, as we learned in Chapter 3, by law of total probability
(Section 3.5),
1 2 3. Definition 6.2.1
P (Y = 2) = P (Y = 2, X = 0) + P (Y = 2, X = 1) = + =
8 8 8
Knowing the joint probability mass func-
tion of two random variables X and Y, we
That is, to obtain the total probability of the event Y = 2, we sum
can construct the total probability mass
all the joint probabilities events in which Y takes the value 2. function of X, P(X ), and the total proba-
bility mass function of Y, P(Y ) as follows:
Example 6.2.1. Example 6.1.1 (continued)
Consider Table 6.3. It is a copy of Table 6.2, but we have added P( x ) = P( X = x ) = ∑ P({o _ i} ∈ S | X (o _ i ) = x )
a column and a row.
P ( y ) = P (Y = y ) = ∑ P({o _ i} ∈ S | Y (o _ i ) = y )
Probability Models for More Than One Discrete Random Variable 197
Table 6.3 The marginal probabilities of X and Y obtained by summing rows and columns
of the joint distribution.
x \y 0 1 2 3 P(X = x)
0 1/8 2/8 1/8 0 1/2
1 0 1/8 2/8 1/8 1/2
P(Y = y) 1/8 3/8 3/8 1/8 1
In Table 6.3, this probability is obtained as the sum of the entries in the rows headed
Y = 0. By adding the entries in the other columns, we similarly find:
3
P (Y = 1) = ∑P( x , 1) = 8 ,
x
P (Y = 2) = ∑P( x , 2) = 3 / 8,
x
P (Y = 3) = ∑P( x , 3) = 1 / 8.
x
In this way, we obtain the total or marginal probability function of the random variables Y
from the joint probability table of X and Y. Since values of this probability function are written
in the lower margin of the joint table, the function is commonly called the marginal prob-
ability of Y, in spite of the fact that the adjective “marginal” is redundant. By adding across
the columns in the joint table, one similarly obtains the (marginal) probability mass function
of X. Marginal probabilities are total probabilities.
It may be more familiar to the reader to have the marginal probability mass functions in
the usual table format that we used in Chapter 5. Table 6.4 contains them. This is the way
they should be presented. Table 6.3 is just to illustrate where they
Table 6.4 Marginal pmfs of X and Y come from.
give us total probabilities of X and Y. Once the marginal probabilities have been found, it is not hard to
go back to Chapter 5 and review how to compute the expected value,
x P(X = x)
variance, moment generating function and other functions of the
0 1/2 single random variable. For example,
1 1/2
1 12 1 24
µX = E ( X ) = , µY = E (Y ) = , E ( X 2 ) = , E (Y 2 ) = ,
y P(Y = y) 2 8 2 8
1
P (Y = y ) = (2y 2 + 7), y = 0, 1.
16
Exercise 2. In exercise 1, section 6.1.1, calculate the expected number of contracts going to
company I and the standard deviation of contracts going to company I.
Probability Models for More Than One Discrete Random Variable 199
Notice that all we need to show is that the condition is not satisfied in one cell of the
table. If it had been satisfied we would have had to continue trying until we were convinced
that the condition is not true in all the cells.
6.3.1 Exercises
Exercise 1. Suppose that X and Y have the following joint probability mass function P(X, Y ).
x\y 0 2 4 6
10 0.04 0.08 0.08 0.05
15 0.12 0.24 0.24 0.15
Exercise 2. (This exercise is from Wonnacott and Wonnacot (1990).) A salesman has an 80%
chance of making a sale on each call. If three calls are to be made next month, let
Where the profit Y is calculated as follows: Any sales on the first two calls yields a profit
of $100 each. By the time the third call is made, the original product has been replaced
by a new product whose sale yields a profit of $200. Thus, for example, the sequence (sale,
no sale, sale) would give Y = $300. (i) List the sample space. (ii) Tabulate and graph the
bivariate probability mass function of X and Y. (iii) Calculate the marginal distribution of
X and Y. (iv) What is the expected value of X and the expected value of Y? (v) Are X and Y
independent?
Exercise 3. Consider the planning of a 3 day vacation. The cost of the vacation R depends
upon both the number of days that the person spends the vacation outside the home (Y) and
whether the last day of the three was spent outside the home (X)
P(X, Y ) is given by the following table
x \y 0 1 2 3
0 0.04 0.08 0.08 0.05
1 0.12 0.24 0.24 0.15
1. What is the expected cost? (ii) Compute the marginal pmf of X and Y. (iii) Are X and
Y independent?
XXConsider again the probability mass function of Table 6.2. If we are interested only
in the probability of Y when X = 1, how would we go about finding the distribution
of Y for that case when all we have to work with is the joint distribution of X and Y?
Do you think that the probability mass function of Y will be different when X = 2?
We add one final observation about the random variables X and Y in Table 6.2. It is clear from
the meaning of X and Y that knowing the value of X changes the probability that a given value
of Y occurs. For example, P (Y = 2) = 3 / 8. But if we are told that the value of X is 1, then the
conditional probability of the event Y = 2 becomes 1 / 2. For, by the definition of conditional
probability of two events, the conditional probability of the event Y = 2 given that X = 1 is
P ( X = 1, Y = 2) 1 / 4 1
P (Y = 2| X = 1) = = = .
P ( X = 1) 1/2 2
We can also think of this as follows: knowing that X = 1 reduces the sample space to the
four outcomes where X = 1 ({WWW, WWA, WAW, WAA}). In this reduced sample space, only
two outcomes have Y = 2, namely {WWW, WAW}, and the probability of this last event is
(1 / 8) + (1 / 8) = 1 / 2 by axiom 3, since these two outcomes are mutually exclusive.
As we expect, the events X = 1 and Y = 2 are not independent: knowing that the first student
in the shift works on a Monday increases the probability of 2 students working on a Monday.
We introduce now a new type of univariate probability mass function: the conditional dis-
tributions. There are four conditional distributions of X, one for each value of Y. And there are
two conditional distributions of Y, one for each value of X. We will illustrate the extraction of
conditional probability mass functions from a joint probability mass function using the latter.
Table 6.5 Conditional pmfs of Y given a value of X. Obtained from Table 6.3.
y P(Y = y | X = 0)
P ( X = 0, Y = 0) 1 / 8
0 P (Y = 0| X = 0) = = =1 / 4
P ( X = 0) 1/2
P ( X = 0, Y = 1) 2 / 8
1 P (Y = 1| X = 0) = = =2/ 4
P ( X = 0) 1/2
2 P ( X = 0, Y = 2) 1 / 8
P (Y = 2| X = 0) = = =1 / 4
P ( X = 0) 1/2
3 P ( X = 0, Y = 3) 0
P (Y = 3| X = 0) = = =0
P ( X = 0) 1/2
Probability Models for More Than One Discrete Random Variable 201
y P(Y = y | X = 1)
0 P ( X = 1, Y = 0) 0
P (Y = 0| X = 1) = = =0
P ( X = 1) 1/2
1 P ( X = 1, Y = 1) 1 / 8
P (Y = 1| X = 1) = = =1 / 4
P ( X = 1) 1/2
2 P ( X = 1, Y = 2) 2 / 8
P (Y = 2| X = 1) = = =2/ 4
P ( X = 1) 1/2
3 P ( X = 1, Y = 3) 1 / 8
P (Y = 3| X = 1) = = =1 / 4
P ( X = 1) 1/2
These pmfs are univariate probability mass functions. Thus, their expectation and vari-
ance must be computed using the same methodology used for the marginal probability
mass functions and for a univariate mass function as seen in Chapter 5. However, the nota-
tion must change to allow for the fact that each expectation depends on the value of the
other variable and the pmf is the conditional pmf. We will illustrate this idea by computing
the conditional expectation and variance of Y when X is 1. We obtain what we need from
Table 6.5. We will use the table for P (Y | X = 1).
9
sY2|X =1 = E (Y 2 |X = 1) − (E (Y |X = 1))2 = − 4 = 1 / 2.
2
With this particular example understood, we can now proceed to discuss the general case
of any two random variables defined on the same sample space.
6.4.1 Exercises
Exercise 1. Find the following conditional probability mass functions obtained from
Table 6.2 and then compute the conditional expectations and conditional variances in
each case.
XXConsider the planning of undergraduate education for all three kids that a young couple
plans to have. The costs depend on the number of kids that go to college (X) and the
number of years each takes to complete the degree (Y). We denote the cost by C. Thus
C = g( X , Y ).
What will be the expected cost, µC ? By how much could the actual cost deviate from the
expected cost?
We could tabulate the distribution of C and calculate:
E (C ) = ∑ CP(C ).
Or we could use the joint distribution of X and Y, P ( X , Y ), directly:
E[ g( X , Y )] = ∑∑g( x , y )p( x , y ).
x y
Example 6.5.1
Suppose the joint probability mass function of X and Y for the planning couple is as follows
x \y 3 4 5
1 0.05 0.05 0
2 0.07 0.3 0.03
3 0 0.4 0.1
C = g( X , Y ) = 10 XY .
We can calculate the mean of C directly from P ( X , Y ). We will first use the joint probability
mass table as a tool to do the computations directly in each cell. We calculate in each cell
10 xyP ( X = x , Y = y ).
x \y 3 4 5
1 (10xy)0.05 = 1.5 (10xy)0.05 = 2 0
2 (10xy) 0.07 = 4.2 (10xy)0.3 = 24 (10xy) 0.03 = 3
3 0 (10xy)0.4 = 48 (10xy) 0.1 = 15
Probability Models for More Than One Discrete Random Variable 203
We can also find the same result by first finding the probability mass function of C, also
obtained from the table, as follows, and then compute the expected value of this univariate
discrete random variable.
c P(C = c)
10(1)(3) = 30 0.05
10(1)(4) = 40 0.05
60 0.07
80 0.3
100 0.03
120 0.4
150 0.1
E (C ) = ∑ CP(C ) = 97.7.
We shall see in this section that if two random variables X and Y are defined on a sample
space S, then there are automatically many other random variables also defined on S. In partic-
ular, the sum X + Y and the product XY turn out to be especially important random variables.
g( X , Y ) = X + Y
g( X , Y ) = ( X − µx )(Y − µy )
Example 6.5.2
Consider the random variables X and Y of Table 6.2. The possible values of X and Y, together
with their joint probabilities, are given in that table. Let g( X , Y ) = X + Y .
From the joint probability Table 6.2, we can determine the possible values of random
variable U = X + Y as well as the probabilities with which each value occurs. For example,
P (U = 2) = P ( X = 0, Y = 2) +P ( X = 1, Y = 1) = 1 / 8 +1 / 8 = 1 / 4 .
In this way, we obtain the entries in the following probability table for the random variable
U = X + Y.
u 0 1 2 3 4
P(U = u) 1/8 1/4 1/4 1/4 1/8
1 1 1
E ( X ) = 0 + 1 = ; E (Y ) = 0(1 / 8) + 1(3 / *) +2(3 / 8) + 3(1 / 8) = 3 / 2.
2 2 2
Observe that E ( X + Y ) = E ( X ) + E (Y ), a result that we will soon establish for all random
variables X and Y.
Example 6.5.3
If we define g( X , Y ) as the product rather than the sum of X and Y, then V = XY is a random
variable whose probability table is similarly found from Table 6.2:
v 0 1 2 3
P(V = v) 1/2 1/8 1/4 1/8
Observe that E ( XY ) ¹ E ( X )E (Y ) in general. The equality will only be true when the two
random variables are independent, which is not the case in this exercise.
You should note that what we do to determine the probability function of g( X , Y ) is col-
lect all possible pairs of X and Y values that lead to the same value of g( X , Y ) and add their
probabilities. But to compute the mean of g( X , Y ) , we could just take a short-cut and use the
joint probability table as a computational tool.
Theorem 6.1
Let X and Y be random variables with joint probability function P and let g be a function of
X and Y. Then
E[ g( X , Y )] = ∑∑g( x , y )P( x , y )
x y
In words, we find E [ g( X , Y )] by moving from cell to cell in the joint probability table
of X and Y, multiplying the value of g(X, Y) corresponding to each cell by the probability
appearing in that cell, and then adding these products for all cells.
Example 6.5.4
Consider Example 6.1.1 and let us illustrate the use of the last formula by recalculating the
mean of X + Y and XY. We find directly from Table 6.2, moving across the first row and then
the second,
Probability Models for More Than One Discrete Random Variable 205
1 1 1 1 1 1
E ( X + Y ) = 0 + 1 + 2 + 3(0) + 1(0) + 2 + 3 + 4 = 2,
8 4 8 8 4 8
as before.
There is of course no need to write down terms that have zero factors. Indeed, any cell in
the joint probability table for which g( x j , y k ) = 0 can be skipped in computing E[ g( X , Y )]. Hence
we skip these and find
1 1 1
E ( XY ) = 1 + 2 + 3 = 1 .
8 4 8
Theorem 6.1 enables us to prove the following extremely important and often-used results.
E ( X + Y ) = E ( X ) + E (Y )
In words, the expected sum of two random variables is equal to the sum of their means.
E (aX + bY ) = aE ( X ) + bE (Y )
Theorem 6.2
Let n be any positive integers. If X 1 , X 2 , X 3 , ¼., X n are any random variables defined on a
sample space S, and if a1 , a2 , a3 , ¼., an are any constants, then
Proof:
The result is true for n = 1 and n = 2 by formula earlier. The theorem is proved by mathe-
matical induction) as soon as we show that if the theorem is true for any positive integer,
say n = k, then it is also true for the next integer, n = k + 1. Let us therefore assume that the
last statement is true for n = k. That is, letting Y = a1 X 1 + a2 X 2 +, …., + ak X k , we are
assuming:
Theorem 6.3
Let X and Y be independent random variables defined on a sample space S. Then
E ( XY ) = E ( X )E (Y ).
In words, the mean of the product of two independent random variables is equal to the
product of their means.
Proof
The proof needs you to recall that if two variables are independent, P ( X , Y ) = P ( X )P (Y )
It is very important to note that the converse of Theorem 6.3 is false. As the following
example shows, it is possible to find this last result, (E(XY) = E(X)E(Y)), to be true for random
variables that are dependent.
Example 6.5.5
Suppose X has probability table
x −1 0 1
P(X = x) 1/4 1/2 1/4
Let Y = X 2 . Then X and Y are surely dependent, since the value of X determines the value
of Y. This dependence is obvious from the joint probability of X and Y, as given below
x
−1 0 1
y 0 0 1/2 0
1 1/4 0 1/4
Probability Models for More Than One Discrete Random Variable 207
E ( XY ) = 0; E ( X ) = 0; E (Y ) = 1 / 2. So E ( XY ) = E ( X )E (Y ) but the two random variables are
not independent. We know that because, for example,
P ( X = 0, Y = 0) = 1 / 2;
P ( X = 0) = 1 / 2; P (Y = 0) = 1 / 2 .
So P ( X = 0)P (Y = 0) = 1 .
4
Obviously, P ( X = 0, Y = 0) ≠ P ( X = 0)P (Y = 0), so the two random variables are not inde-
pendent, and yet
E ( XY ) = E ( X )E (Y ).
6.5.1 Exercises
Exercise 1. In the example of the planning family, suppose that
C = X 2 +Y 2
Find the following expected values: (a) E[( X - 2)(Y - 2)]; (b) E[( X - 2) ]; (c) E[(4 X + 2Y )]
2
The covariance is used to measure how two variables X and Y vary together, using the familiar
concept of E[( g( X , Y )], where g( X , Y ) = ( X − µX )(Y − µY ). To measure how two variables vary
together, we start with the deviations, we multiply them and then take the expectation.
When two variables are independent, the covariance is 0. But the reverse is not true in general.
σX ,Y = Cov ( X , Y ) = E ( XY ) − µx µy
When large values of X and Y occur together, the deviations are both positive, and so their
product is positive. Similarly, when small values of X and Y occur together, both deviations are
negative, so their product is positive. If most of the joint probabilities were in this situation,
then the covariance would be positive and would summarize the positive relation between
the two variables.
When one deviation is positive and the other is negative, the calculated product is nega-
tive. If there is high joint probabilities in those cases, the covariance would be negative.
The correlation coefficient, a number between –1 and 1 takes care of eliminating the units,
thus giving us a unitless measure of the direction and strength of a linear relation between
the two random variables. See Figure 6.2.
σX ,Y
ρ=
σX σY
This expression neutralizes any change in the scale of X and Y. Correlation is independent
of the scale in which either X or Y is measured. Correlation is also always bounded.
−1 ≤ r ≤ +1
–1 0 +1
NO LINEAR RELATION
Linear negative relation Linear positive relation
4
4
3
Y Y Y
3 3
2
2 2
1
1 1
0 1 2 3 4 5 0 1 2 3 4 5 1 2 3 4
X X X
Probability Models for More Than One Discrete Random Variable 209
Example 6.6.1
x \y 0 1 2 3
0 (0 − 1/2)(0 − 3/2)1/8 (0 − 1/2)(1 − 3/2)2/8 (0 − 1/2)(2 − 3/2)1/8 0
1 0 (1 − 1/2)(1 − 3/2)1/8 (1 − 1/2)(2 − 3/2)2/8 (1 − 1/2)(3 − 3/2)1/8
Let X and Y be random variables with joint probability as given in Table 6.2. In
1 3
Example 6.2.1, we found that µx = , µy = . In Example 6.5.3, we saw that E ( XY ) = 1.
2 2
Notice that we could obtain the same number for the value of the covariance if we add
the values of the cells given in the table at the top of this example. That is,
1
sX ,Y = Cov ( X , Y ) = ∑∑ ( X − E( X ))(Y − E(Y )) P( X = x ,Y = y ) = 4 .
x y
The correlation coefficient of 0.57773503 means that there is a positive linear association
between random variables X and Y, but it is not very strong. It is not very weak either.
6.6.3 Exercises
Exercise 1. For the joint distribution in Example 6.5.1, calculate the covariance and correlation
using the definition and the short-cut formula. Is there a strong positive relation between
the number of kids that go to college and how long it takes to complete the degree?
Exercise 2. Let X be a random variable with the following probability mass function:
x –2 –1 1 2
P(X = x) 1/4 1/4 1/4 1/4
(i) Find the joint probability mass function of X and Y. (ii) Determine E ( XY ) and the value
of the correlation. (iii) Are the variables independent?
Now that we know the concept of covariance, we can study a new important result concern-
ing the variance of the sum of two random variables. Consider the following function of two
random variables
g( X , Y ) = X + Y .
∑∑(( X + Y ) − E( X + Y )) P( X = x ,Y = y )
2
=
x y
∑∑(( X − µ ) + (Y − µ )) P( X = x ,Y = y )
2
= x y
x y
∑∑(( X − µ )
2
= x
+ (Y − µy )2 + 2( X − µx )(Y − µy ))P ( X = x , Y = y ).
x y
By bringing the summation operators into each term, we obtain the desired result:
Example 6.7.1
Let’s revisit the sum of the rolls of two dice. What would be the expected sum and the variance?
E ( X + Y ) = E ( X ) + E (Y ) = 3.5 + 3.5 = 7,
And we can see that the formulas are satisfied, knowing that the correlation between the
roll of two dice is 0.
Probability Models for More Than One Discrete Random Variable 211
6.7.1 Exercises
Exercise 1. Complete the steps needed to arrive from
Var ( X + Y ) = E (( X + Y ) − E ( X + Y )2 )
∑∑(( X + Y ) − E( X + Y )) P( X = x ,Y = y )
2
=
x y
∑∑(( X − µ ) + (Y − µ )) P( X = x ,Y = y )
2
= x y
x y
∑∑( X − µ ) + (Y − µ )
2 2
= x y
+ 2( X − µx )(Y − µy )(P ( X = x , Y = y )
x y
to
E (aX + bY ) = aE ( X ) + bE (Y ), and
x y
C = 0.04 ( X + Y ),
where C is cost, X is the weight of the machine from plant A and B is the weight of the machine
from plant B. Find the expected cost and the variance of the cost.
Cov (a + X , Y ) = Cov ( X , Y ),
Australia’s Highway 1 is considered to be the longest national highway in the world at over
14,500 km or 9,000 miles and runs almost the entire way around the continent. There are
16 major intersections of this highway and therefore there are 16 major exit ramps. Suppose
the average number of cars per minute in each of these major ramps is 10. What is the joint
probability mass function of the number of cars for all ramps?
We denote by X i , i = 1, 2, …. , 16, the number of cars exiting each of the 16 ramps. Assum-
ing that P ( X i ) is Poisson with parameter l = 10, we can compute the joint probability mass
function as follows: 16
10∑
xi
i =1
e−10
P ( X 1 , X 2 , ……. , X 16 ) = P ( X 1 )P ( X 2 )……P ( X 16 ) = 16
∏ i =1
xi !
In Probability, we could show, although it is too advanced for this book, where we only
prove it for two random variables, that
P ( X 1 < 2, X 2 < 2, ……., X 16 < 2) = P ( X 1 < 2)P ( X 2 < 2)……P ( X 16 < 2)
16
100 e−10 101 e−10
= + = (e−10 (1 + 10))16
0! 1!
The job of statisticians is very different from that of the probabilist. In the problem discussed
above, a typical statistical problem would start not knowing the value of the parameter l.
Then the above joint distribution would be looked at not as a function of the random variable,
but rather as a function of the parameter lambda. The function will then be called “likelihood
function.” To estimate l the statistician will maximize the likelihood with respect to l. The
result will be called the “maximum likelihood estimator.” The statistician will then proceed
to use expectations and variances rules learned in probability to determine the properties of
the estimator and the distribution of the estimator. Decisions about whether the estimator
is an accurate value, close to the true l, will be made based on those properties, but, given
uncertainty, probability theory plays a very important role in that decision.
Another typical problem encountered by statisticians is, for example, to estimate the average
number of customers appearing in a store the first hour of the day. Why? Say a store owner
is struggling and wants to make better personnel management. Perhaps it is not necessary
to have so many clerks during the first hour. It is very common to contract a statistician to
Probability Models for More Than One Discrete Random Variable 213
help make that decision. A statistician will ask the store owner to bring the number of people
entering in a randomly selected number of days, say 250 days.
The statistician then will assume a model for
Y = the number of customers entering the store in the first hour of a random day.
The statistician thinks that there is the same distribution for Y in every one of the
n = 250 days. But there is independence, so the joint distribution of all the observed number
of customers, Y1 , Y2 , ¼., Yn , is:
n yi
l e−l
P ( y 1 , y 2 , …. , y n ) = ∏
i =1
yi !
.
The statistician then will design methods to estimate the l, which is the only unknown in
that equation. Probability has contributed the model assumption, the procedure to find the
joint distribution, since independence implies you may multiply the marginal. But that is it.
The statistician will provide the store owner with the best estimate of the average number of
customers, but also the standard error of the estimate, so that the uncertainty can be taken
into account in decision making.
In general, if there are n objects that fall in one and only one of k categories, and we denote
by X i , i = 1,…,k, the number of objects in category i, and by pi , i = 1, …. , k , the probabilities of
the object being in category i, then
n
( p ) x …..( p ) x
P ( X 1 = x1 , ………., X k = x k ) = 1
1 k
x1 , …., x k
k
In a random sample of 20 people, what is the probability that 2 are 65 years or older, 5 are
55–64 years old, 6 are 25–54 years old, 4 are 15–24 years old and 3 are 0–14 years old?
https://www.indexmundi.com/ecuador/demographics_profile.htm
Exercise 2. With the recent emphasis on solar energy, solar radiation has been carefully
monitored at various sites in Florida. Among typical July days in Tampa, 30% have total radi-
ation of at most 5 calories, 60 % have total radiation of at most 6 calories, and 100% have
total radiation of at most 8 calories. A solar collector for a hot water system is to be run for
6 days. Find the probability that 3 days will produce no more than 5 calories each, 1 day will
produce between 5 and 6 calories, and 2 days will produce between 6 and 8 calories. What
assumptions must be true for your answer to be correct?
Question 1. A diagnostic test for the presence of a disease has two possible outcomes: 1 for
disease present and 0 for disease not present. Let X denote the disease state of a patient,
and let Y denote the outcome of the diagnostic test. The joint probability mass function
function of X and Y is given by
P ( X = 1, Y = 1) = 0.125.
Calculate the variance of the outcome of the diagnostic test for those with the disease.
a. 0.13
b. 0.15
c. 0.2
d. 0.51
e. 0.71
Probability Models for More Than One Discrete Random Variable 215
Question 2. The joint probability mass function of random variables X and Y is given as follows:
X
1 2
Y 1 1/8 2/8
2 2/8 1/8
3 1/8 0
4 0 1/8
The Var (Y | X = 2) is:
a. 2/9
b. 2/3
c. 2
d. 1.5
e. 1/3
Question 3. For the joint probability mass function of question 2, calculate Cov (2 + 3X , 4 − 2Y ).
a. 3
b. 6
c. 3/2
d. 0
Question 4. The joint probability mass function of two random variables X and Y is as follows:
The probability
P (1 £ X £ 2, Y £ 2)
is:
a. 0.6
b. 0.65
c. 0.56
d. 1
Question 5. Consider the planning of undergraduate education for all three kids that a young
couple plans to have. The costs depend on the number of kids that go to college (X) and
the number of years each takes to complete the degree (Y). We denote the cost by C. Thus
C = g( X , Y ).
x\y 3 4 5
1 0.05 0.05 0
2 0.07 0.3 0.03
3 0 0.4 0.1
C = g( X , Y ) = 10 X + 20Y .
a. 21
b. 1506
c. 104.2
d. 193
Question 6. Consider the planning of undergraduate education for all three kids that a young
couple plans to have. Use the information we used in Question 5. What is the variance of
the cost C?
a. 419.78
b. −31.004
c. −661.64
d. 143.96
Question 7. In the joint probability mass function of Question 5, calculate the probability that
X is larger than or equal to 3 and Y is larger than or equal to 4.
a. 1/2
b. 1/4
c. 3/7
d. 1/7
Question 8. Suppose that 25% of the people attending a popular gym live within 5 miles of
the gym, 55% live between 5 and 10 miles from the gym and 20% live more than 10 miles
from the gym. Suppose that 30 people are selected at random from the members of the gym.
What is the probability that 10 live within 5 miles, 10 live between 5 and 10 miles, and the
other 10 live more than 10 miles from the gym?
a. 0.001373087
b. 0.33145
c. 0.7814
d. 0.999
Probability Models for More Than One Discrete Random Variable 217
Question 9. Consider the following joint probability mass function of X and Y.
x\y 0 1 2 3
0 1/8 2/8 1/8 0
1 0 1/8 2/8 1/8
The E ( XY ) is
a. 0.5
b. 0.007
c. 1
d. 0.11
Question 10. In the same joint probability mass function of Question 9, what is the Cov ( X , Y )?
a. 0.25
b. 0.577
c. 0.75
d. 0.214
Exercise 2. Suppose that 15% of the families in a certain community have no car, 20% have
1 car, 35% have 2, and 30% have 3. Suppose, further, that in each family, each car is equally
likely (independently) to be a foreign or a domestic car. Let F be the number of foreign cars
and D the number of domestic cars. (i) Find the joint probability mass function of F and D,
showing your work. (ii) Write the marginal distribution for the number of foreign cars and
find expected number of foreign cars per family and the standard deviation. (iii) Write the
marginal distribution for the number of domestic cars and find expected number of domestic
cars per family and the standard deviation.
x 8 10 11
P(X = x) 1/4 1/4 1/2
y 4 6
P(Y = y) 1/2 1/2
Exercise 5. Daily sales records for a car dealership show that it will sell 0, 1, 2, or 3 cars, with
probabilities as listed
X = Number of sales 0 1 2 3
P(X = x) 0.5 0.3 0.15 0.05
( i) Find the probability distribution for the total number of sales in a 2-day period assuming
that the sales are independent from day to day. (ii) Find the probability that two or more
sales are made in the next two days.
Exercise 6. If individuals always married significant others that are 2 years younger than
themselves, what would be the correlation between the age of the two individuals in the
married couple?
Exercise 7. (Ibe 2014, Example 5.2) The joint probability mass function of two random vari-
ables X and Y is given by
P ( X = x , Y = y ) = k (2x + y ), x = 1, 2; y = 1, 2, 3
where k is a constant. (i) What is the value of k? (ii) Find the marginal probability mass
functions of X and Y. (iii) Are X and Y independent?
Exercise 8. (Bain 2014, Example 5.7) For two random variables, X and Y, we know the
following:
1 5 1
P ( X ≤ 1, Y ≤ 1) = ; P ( X ≤ 1, Y ≤ 2) = ; P ( X ≤ 2, Y ≤ 1) = ; P ( X ≤ 2, Y ≤ 2) = 1
8 8 4
Determine the joint probability mass function of X and Y. (ii) Determine the marginal
probability mass function of X. (iii) Determine the marginal probability mass function of Y.
Probability Models for More Than One Discrete Random Variable 219
Exercise 9. (Based on Page (1989, 130).) One 4-ohm resistor and two 8-ohm resistors are in a
box. A resistor is randomly drawn from the box and inspected; then it is replaced in the box
and a second is drawn. We will denote by X the resistance of the first one drawn from the
box and Y the resistance of the second. Since this is sampling with replacement, the result
of one test has no influence on the result of the other test. (i) Construct the joint probability
mass function of X and Y. (ii) What is the probability that X = Y ?
Bain, Lee J., and Max Engelhardt. 1987. Introduction to Probability and Mathematical Statistics.
Duxbury Press.
Goldberg, Samuel. 1960. Probability: An Introduction. New York: Dover Publications, Inc.
Ibe, Oliver C. 2014. Fundamentals of Applied Probability and Random Processes. 2nd edition.
Elsevier.
Page, Lavon B. 1989. Probability for Engineering with Applications to Reliability. Computer
Science Press.
Watson, Jane M. 2011. “Cheating partners, conditional probability and independence.”
Teaching Statistics. An International Journal for Teachers 33, no. 3 (Autumn): 66–70.
Wonnacott,Thomas H., and Ronald J. Wonnacott. 1990. Introductory statistics for Business and
Economics, fourth edition. Wiley and Sons.
Probability in Continuous
Sample Spaces
221
integration of functions of one variable. This makes this part of the book accessible to readers
with a good background in differential and integral calculus of one and several variables.
Supplementary sidebars with review of some of the mathematics, and references to resources
to refresh your calculus are provided. References will be made to the sections of Part I where
the concepts were first introduced. Numerous references to authors who have written about
probability theory at the accessible level of this part can be found throughout the chapters.
As in the discrete case, the reader should be aware that notation varies by authors, vocabu-
lary for the same thing is different across the disciplines, but the probability theory method
may be exactly the same in all of them. Probability theory is not a bag of different tricks to
solve problems but a very condensed set of a few tools to solve a bag of very different and
contextually unrelated problems.
XXPick up a ruler similar to the one drawn below, or larger, and ask 10 differ-
ent persons to measure the length of a string like that in Figure 7.2. Make
them write the number on paper, keeping their record confidential. They
all should use the same system of measurement: decimal system or United
States customary units.
Figure 7.1 Ruler.
Copyright © 2011 Depositphotos/karenr.
Figure 7.2 String.
Copyright © 2013 Depositphotos/Irina1977.
When all measurements are done, unveil the measurements made. Enter them in
this table, one measurement obtained by the 1st person, another by the 2nd person,
and so on.
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
What do you expect to happen? Is that what happened? What is the exact length
with this measurement instrument, or would you require an instrument with greater
precision? Can you measure the exact length of the line?
223
7.1 Coping with the dilemmas of continuous sample spaces
Up to this point, we have restricted attention to discrete sample spaces and discrete random
variables. But life is not always discrete. There are many problems in which a random variable
can have any value in an interval of the Real Line, meaning that there are an infinite number
of possible outcomes. For example, concentration of pollutant in some media is a continuous
random variable of interest in environmental monitoring. The amount by which a radioactive
particle decays is also a continuous random variable (Harris, 2014); the decay can take place
at any time and is caused by a nuclear reaction whose behavior is only predictable statisti-
cally. As Harris explains, it makes no sense to associate a finite nonzero probability with each
instant in time (there are an infinite number of such instants). It is more useful to recognize
that for a short time interval Δt the decay probability will have some value proportional to
Δt, and that the relevant quantity is the decay probability per unit time (computed in the
limit of small Δt). Since this probability of decay per unit time may differ at different times,
a general description for the decay within a time interval dt at time t must be of the form
f(t) dt, where f can be called a density function. Then the overall probability of decay during
a time interval from times t1 and t2 will be given by
t2
In other words, to deal with continuous random variables, we have to shift our thinking
from specific outcomes to intervals of outcomes in the real line. Intervals break up a con-
tinuous scale into a set of discrete chunks, and we use these discrete chunks the way we
previously used discrete outcomes. Thus, instead of thinking of events such as {X = 2}, we
focus on events such as {1 ≤ X ≤ 2}, and ask: what is the probability that a random variable
will lie within this range of values instead of at some particular point? (Denny and Gaines
(2000, 69))
Harris presents an interesting example that illustrates this point.
Example
(This example is from Harris (2014, 684).) “A particle of unit mass moves (assuming classical
x2
mechanics) subject to the potential V = , where x is the particle position, with a total
2
1
energy E = (dimensionless units). These conditions correspond to motion with kinetic
2
1− x2 v2 1− x2
energy T = , so that when the particle is at x its velocity can be found from =
2 2 2
leading to v ( x ) = ± 1 − x 2 . We see from this form for v(x) that the particle will move back and
forth between turning points at x = ±1, and that it will move fastest at x = 0 and momen-
tarily become stationary at x = ±1. The probability density for the particle’s position will be
1
f (x) = , −1 < x < 1
p 1− x2
The reader may be tempted to deduce from the discussion we are leading so far that
knowing physics and every single discipline is needed to understand this second part of the
book. But that is not the case. This example was presented here to illustrate how these new
entities that we are about to learn about in this chapter had their genesis in some application.
Each area of science has its own problems and its corresponding probabilistic solutions to
those problems. There are many different density functions, but the reader does not need to
know them all. In this chapter, we hope to convey the methodology needed to handle all of
them once they have been given to us, and we study the ones that are most widely used in
a wide array of fields, i.e., the ones that have wide applicability as models for a wide range
of random phenomena. We will explore what we can do with them, and how we can use the
concepts learned in Part I.
For the purposes of this chapter, the reader needs only have very present that the concept
of a continuously distributed random variable is an idealization which allows for calculus to
be used as a technical tool. This gives models for chance phenomena involving continuous
random variables (Pitman (2005, 259).)
In fact, once we accept the convention used to measure probability in sample spaces for
continuous random variables, the methodology will be the same as in the discrete case,
but, instead of using summation operators in the calculations, we will use integrals. And
regardless of which area of science we move in, we will be able to use the methodology,
knowing that only the complexity of the context, and the mathematical complexity of the
probability model can stop us.
∪∞ A = {s ∈ S |s ∈ Ai for some i }.
i =1 i
The intersection consists of all simple outcomes s j of S that are in all of the events.
∩∞ A = {s ∈ S |s ∈ Ai for all i }.
i =1 i
∪∞ A = ∪∞ i ,1 = (0,1] = S ,
i =0 i i =0
i + 1
i
∩∞ A = ∩∞ ,1 = {1}.
i =0
i =0 i
i + 1
Example 7.1.2
Consider the collection of events Ai = [i , i + 1), i = 0,1,2,……. . Do these events make a partition
of S = [0, ∞)? (See Definition 2.5.2 in Chapter 2 for definition of partition).
Because the collection of events satisfies the two conditions, we say that this collection
of events is a partition of the sample space S defined above.
Some physical phenomena such as the exact time that a train arrives at a specified stop, the
lifetime of an atom, the stress of a beam, weight, height, the distance to the moon, and many
other measurements are values in the real line. The random variable representing them is
a continuous random variable. Probabilities for those random variables are areas under a
function that we call the density function. We then define this new entity.
Definition 7.2.1
Let X be a random variable. Let a the smallest possible value of X (which could be −∞ ) and
b the largest (which could be +∞ ). Any other real number in between is allowed. If there is a
function f(x) such that
• f ( x ) ≥ 0, a ≤ x ≤ b
b
•
∫ f ( x )dx = 1
a
P (B ) = ∫ f ( x )dx,
k
then the function f is called the probability density function (pdf ) of the random variable X,
a £ X £ b , and we say that X is a continuous random variable. P(a < X ≤ b ) = P(a ≤ X < b ) =
P(a < X < b).
Definition 7.2.2
For the continuous random variable X in [a, b] and density function f(x),
b
µx = E ( X ) = ∫ xf ( x )dx .
a
b
E( X k ) = ∫x
k
f ( x )dx ,
a
and in general,
b
E ( g( X )) = ∫ g( x ) f ( x )dx .
a
σ 2 = E[( X − µX )2 ] = ∫ ( X − µ)
2
f ( x )dx = E ( X )2 − µ2 .
a
F ( x ) = P( X ≤ x ) = ∫ f (t )dt ,
a
where F is the notation we use to denote the cumulative distribution function (cdf ).
We may use the cumulative distribution function to obtain the density function of X.
dF ( x )
f (x) =
dx
The cumulative distribution function can also be used to compute probabilities:
P (a ≤ X ≤ b ) = F (b ) − F (a )
∫e
tX tX
M X (t ) = E ( e ) = f ( x )dx .
a
The reader does not need to delve deeper into how we made the transition from Σ to ò
. But if there is interest, sources mentioned in section 3.1 of this book discuss in detail how
we ended up here.
In this section, the reader will see a full example without any context to practice comput-
ing the usual things for a continuous random variable. We will use a Question and Answer
approach.
Example 7.2.1
Let X be a continuous random variable which has density function
f ( x ) = 3x 2 , 0 ≤ X ≤ 1
This formula means that X has range in the real line interval [0,1]. The pdf of X is posi-
tive only on the interval [0,1]. Outside the domain (0,1) the f(x) is 0. The f(x) is the function
given.
We will illustrate next how we would compute all those relevant quantities in Definition
7.2.2 with this example. At the end of the computations you will be able to see a graphical
display of some of the results.
Q: Is f(x) positive for all x and is the area under f(x) equal to 1? Are axioms of
probability satisfied?
We can prove that f(x) is a density function for the random variable X because we can
show that the area under the curve in the range of X is 1.
1
3x 3 1 =1
∫ 3x
2
dx =
3 0
0
Notice that we do not need to take the integral from −∞ to ∞, because outside the
interval (0,1), the f(x) is 0, there is no area to account for.
Certainly, f(x) is positive for all values of X in the range of X.
Thus, f(x) is a legitimate density function for continuous random variable X. Axioms
of probability are satisfied.
We can easily check that the derivative of F(x) with respect to X is indeed the density
function of X.
F '( x ) = 3x 2 = f ( x )
Box 7.1
F (k ) = P ( X ≤ k ) = ∫ f ( x )dx
a
This function has the following properties:
• dF ( x )
= f (x)
dx
If we know the cumulative distribution function, then
P ( m < X < n ) = F ( m) − F (n )
Box 7.2
F (c ) = P ( X ≤ c ) = ∫ f ( x )dx = q.
a
F (c ) = P ( X ≤ c ) = ∫ f ( x )dx = 0.9.
a
When a distribution is skewed, the median is a better measure of average than the ex-
pected value, and the interquartile range gives a better idea of the spread of the distribution.
This is so because the expected value is too influenced by the tail values of the distribution,
and therefore, the variance, which depends also on the expected value, is influenced as well.
We can compute the probability of a continuous random variable X in any interval con-
tained in the range of X. For example,
0.8
3x 3 0.8
P (0.5 < X < 0.8) = ∫
0.5
3x 2 dx =
3 0.5
= 0.387
To compute percentiles you set the cumulative distribution function to the probability
desired. For example, to find the 70th percentile, make
F ( x ) = 0.7.
x 3 = 0.7,
5 4
0 (left as an exercise).
f(x)
1.0 1.0
Whole area = 1
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
f(x)
0.4 1.0
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 7.3 Figures correspond to Example 7.2.1. Probabilities are areas under the
density function. A percentile is a value of X corresponding to some cumulative
probability and cumulative probabilities viewed in a cdf plots are values on the ver-
tical line of the plot.
7.2.1 Exercises
Exercise 1. (Based on Keeler and Steinhorst (2001).) An appliance repair firm has recorded the
time it took to complete service calls during the past year. They found that time (Y ) varies
Density
0 4
y = hours
Figure 7.4
(i) Find the height of the triangle. That is, find the height where the triangle crosses
the vertical axis.
(ii) If we consider a randomly selected call,
(a) Find the probability that the service call will take at most an hour.
(b) Find P( Y > 2 hours).
(iii) Is µ less than 2 or greater than 2? Explain.
Exercise 2. Let X be the time that it takes to drive between point A and point B during the
afternoon rush hour period on highway 4005. The density function of X is
1
f (x) = x , 0 ≤ x ≤ 2.
2
(i) Calculate the value of the 70th percentile. (ii) Calculate the interquartile range. (iii) Cal-
culate P (0.5 £ X £ 1.5). (iv) Find the median. (v) Find the moment-generating function of X.
Exercise 3. The time it takes a seasoned runner to complete the Mountain High half-marathon
is a random variable X with cumulative distribution function
1
F( x ) = (2x 2 + x − 10), 2 ≤ x ≤ 4.
26
f ( x ) = 1.5(1 − x 2 ), 0 ≤ x ≤ 1.
(i) What is the expected landing point? (ii) What is the probability that the landing point
is before 0.4? (iii) Find the cumulative distribution function of X. (iv) What is the standard
deviation of X?
f ( x ) = a + bx 2 .
Exercise 7. The distribution of the amount of gravel (in tons) sold by a particular construction
supply company in a given week is a continuous random variable X with density function
f ( x ) = 1.5 (1 − x 2 ), 0 ≤ x ≤ 1
(i)How often does this company sell less than 0.4 tons per week? (ii) What is the variability
(in tons per week) of the amount sold each week? (iii) The company sells the gravel at a price
of $10,000 per ton. But keeping the gravel in storage during the week costs $2000 per ton.
On average, how much profit does the company make per week? (iv) The company keeps the
gravel at a storage place each week. Gravel not sold that week is thrown away. Since the
amount sold is a random variable, it doesn’t pay for the company to keep one ton stored all
the time. The company just keeps enough gravel to make sure that customers will have to
leave without gravel only 20% of the time. How much does the company store each week?
(v) Write down the cumulative distribution function of the random variable X and use it to
compute P (0.2 £ x £ 0.6). (vi) What is the probability that 2 out of 10 construction companies
will sell more than 0.4 tons per week?
Exercise 8. (This problem is based on Scheaffer (1995, 254, problem 5.29).) Daily total solar
radiation for a certain location in Florida during the month of October has the following
density function:
3
f (x) = ( x − 2)(6 − x ), 2 ≤ x ≤ 6,
32
Exercise 10. (This exercise is based on Rice (2007).) Let X be the cosine of the angle at which
electrons are emitted in muon decay. X is a random variable with the following density function:
1 + ax
f (x) = , − 1 ≤ x ≤ 1,
2
where α is a constant that can take values −1 ≤ a ≤ 1. (i) Find the expected cosine of the
angle at which electrons are emitted as a function of α . (ii) Find the variance.
We said in Section 7.2, and when we studied discrete random variables in Chapter 5, that,
in addition to the functions g(X) considered so far, we can find the expectation of any other
function of X; for example, if f(x) is as in Example 7.2.1,
f ( x ) = 3x 2 , 0 ≤ X ≤ 1
And let Y = a + bX , where a and b are constants. What is the expected value and variance
of Y? We use the definition of expectation and variance of a function of a random variable.
1 1 1
∫ (a + bx )3x ∫ 3x ∫ ( x )3x
2 2 2
E (Y ) = dx = a dx + b dx = a + bE ( X ).
0 0 0
∫ (a + bx − (a + bµx ))2 3x 2 dx = b2 ∫ ( x − µ ) 3x
2 2
V (Y ) = x
dx = b2Var ( X )
0 0
How did we know earlier that we could compute the variance of X with the formula
∫ ( x − µ) ∫ ( x − µ ) 3x
2 2 2 2
σ = V (X ) =
X
f ( x ) dx = dx
0 0
1
∫ (x
2
= + µ2 − 2µ x )3x 2 dx
0
1 1 1
∫ ( x )3x ∫ 3x ∫ x 3x
2 2 2 2 2
= dx + µ dx − 2µ dx = E ( X )2 + µ2 (1) − 2µ2
0 0 0
= E ( X )2 − µ2
7.3.1 Exercises
Exercise 1. Snowpack measurements on April 1st in a mountainous region have expected
value 17.7 inches and standard deviation 3 inches, based on measurements made in gauging
stations. In hydrology, there are lots of studies trying to measure the relation between spring
river discharge and snowpack. A study found that this relation is
Y = 1.04 + 1.03 X,
where X is snowpack depth on April 1st and Y is spring river discharge. Calculate the expected
river discharge and the standard deviation of the discharge.
Exercise 2. The proportion of time X during a 40-hour work week that a health worker spends
transporting blood samples from the clinic to the examination lab is a random variable with
density function
f ( x ) = 2x , 0 ≤ x ≤ 1.
The cost of transportation depends on the proportion of time used in transportation according
to the following function:
Cost = 10–2X .
According to the Bureau of Labor Statistics of the United States, the median annual wage
for statisticians was $84,060 in May 2017. Overall employment of mathematicians and stat-
isticians is projected to grow 33 percent from 2016 to 2026, much faster than the average
for all occupations. Businesses will need these workers to analyze the increasing volume of
digital and electronic data.
https://www.bls.gov/ooh/math/mathematicians-and-statisticians.htm
Many statisticians choose to become consultants. When billing customers, they take into
account the time it takes to write the computer program they need to analyze data, the time
it takes to interpret the results, the time it takes to write the report for the customer, and the
time spent with the customer. That is a total of 4 different independent random variables,
say X1 , X2 , X3 , X 4 . Let’s assume that these random variables are all equally distributed. What
is the expected amount of time spent in a project by a typical statistical consultant?
We can calculate the expected sum of the four times and the variance of the four times
as follows:
4 4
E ( S 4 ) = E X i =
∑ ∑E( X ) = µ+ µ+ µ+ µ
= 4µ
i =1
i
i =1
4 4
Var ( S 4 ) = Var X i =
∑ ∑Var ( X ) = s
2
+ s2 + s2 + s2 = 4s2
i =1
i
i =1
As we said in Section 5.5, sums of independent random variables are very important.
• Many credit cards offer rewards that consist of earning some amount of dollars for
every $100 dollars spent. Credit card companies know the distribution of the usual
expenses people make and can predict with a degree of certainty the expected total
amount of money back that a customer’s credit card will get.
• Research and Development (R&D) by corporations is an expensive multi-step process.
When companies engage in the process of inventing new processes and products they
must predict the cost at each step. The expected total cost will help the company
decide whether the R&D project is worth pursuing.
Let’s define the sum of n random independent continuous random variables as follows:
n
S n = X1 + X2 +….. + Xn = ∑X ,
i=1
i
where S is for sum and n for how many random variables we are adding. Then, without
proving it, we claim that
n n
Var ( S n ) = Var X i =
∑ ∑Var ( X ) = s
2
+ s 2 +……. . + s 2 = ns 2
i =1
i
i =1
Sums of independent and identically distributed random variables are of central importance
in both Probability and Statistics.
The proof of this result was done for two discrete random variables in Chapter 6.
7.4.1 Exercises
Exercise 1. Consider the density function of exercise 10 in Section 7.2.1. Consider the cosine
of the angle of 5 electrons. This is represented by 5 random variables, X1 , X2 , X3 , X 4 , X5 . Of
interest is comparing the expected value and variance (as functions of α), of the following
functions of these random variables:
5
Y =3
∑ i =1
Xi
5
and
5
W=
∑ i =1
Xi
5
Scientists that deal with data had to come up with models that allow them to make prob-
abilistic predictions of random phenomena. Many phenomena have the same nature, and
hence can be modeled the same way, albeit with different model parameters. The models
described in this section have been found to be very useful in a wide range of applications.
It is impossible to go over all the probability models for all continuous random variables. We
emphasize in this section those that are widely adopted in many different areas of application.
We can go ahead and repeat the computations in 7.2 for each of them, or we could just use
what others have already proved and focus instead in the applications. Many web sites are
dedicated to describe these distributions’ properties. However, a word of caution: they may
be using different parameterizations and notation, and that can be confusing for the begin-
ner. For this reason, we illustrate in this chapter how to obtain all those quantities obtained
There exist applets online that allow you to calculate probabilities with these
important distributions.
XXYou call a friend to ask for a ride. Your friend declares that it will take between 10 and
30 minutes to arrive to pick you up. What is your uncertainty in this scenario? What
uncertainty do we have regarding when the next earthquake will occur? How would
you model the uncertainty in these two scenarios? What kind of density function
would you use?
A continuous uniform random variable is used when the range of the random variable is known
but we have no other knowledge about possible values that the random variable can have
other than that the event may happen in any interval within the range.This random variable
is often used to model the random phase of a sinusoidal signal, in that a uniform distribution
of phase between 0 and 2π is frequently assumed. Also, in analog to digital conversion, this
random variable is used to describe the errors due to rounding off in the conversion process.
Definition 7.6.1
A uniform random variable X on an interval (a,b) is a uniform random variable if its density f(x)
is constant on (a,b) and 0 elsewhere.
1
f (x) = , a≤ x ≤b
b−a
From this, we can derive all the relevant things that we have been focusing on:
c −a
F (c ) = P ( X ≤ c ) = ,
b−a
a£ x £b
c −a
F (c ) = =q
b−a
and solving for c, for example, or the 90th percentile
c −a
F (c ) = = 0.9.
b−a
a+b
E( X ) =
2
(b − a )2
Var ( X ) =
12
e tb − e ta
M (t ) =
t (b − a )
The uniform distribution is also known as the “no knowledge” distribution.
Example 7.6.1
A friend is to appear in front of your door to pick you up sometime within the next 10 to
30 minutes. You will be ready if this friend shows up within 5 minutes of the 20 minutes since
you called. Find the probability that you will be ready, assuming that your friend drops at a
random time within that interval. Let X be the time your friend shows up.
Assume that
1
f (x) = , 10 < x < 30
30 − 10
The b = 30, a = 10.
25 − 15
P (15 < X < 25) = = 0.5
30 − 10
The uniform distribution is important in the study of Poisson processes. Given a Poisson rate
of appearance of an event in a unit interval, the way in which the event is scattered in the
interval is uniformly distributed. What is the same, the location of the event is equally likely
to be in any point in the interval.
The uniform distribution is also very useful in Bayesian statistics modeling. It is often used
as an uninformative prior, when the researcher uses it to express the idea that there is no
prior knowledge about an event.
Example 7.6.2
A teacher teaches a one-hour class that starts at 10 a.m. and usually asks a clicker question
during the hour to assess whether students are following and are actively engaged in the lec-
ture discussion. The time at which the teacher asks the question varies and is never the same.
It depends on the lecture, the topic that the teacher thinks needs assessment throughout the
1
f (x) = , 0 ≤ x ≤ 60.
60 − 0
40 − 20 1
P (20 < x < 40) = =
60 3
Example 7.6.3
A commuter arrives at a train stop at 10:00 a.m., knowing that the train will arrive at some time
uniformly distributed between 10:00 a.m. and 10:30 a.m. If at 10:15 the train has not yet arrived,
what is the probability that the commuter will have to wait at least an additional 10 minutes?
This exercise is an illustration of the fact that the rules of probability learned in earlier
chapters will continue to be used with random variables, since random variables are repre-
senting just events in the sample space.
Let X be the random variable representing the waiting time.
1
f (x) = , 0 ≤ x ≤ 30
30 − 0
The probability that the commuter will have to wait at least 10 additional minutes if the
train arrives at 10:25 or later is
5
P (( X > 25) ∩ ( X > 15)) P ( X > 25) 30 1
P ( X > 25 | X > 15) = = = =
P ( X > 15) P ( X > 15) 15 3
30
7.6.1 Exercises
Exercise 1. Food trucks along a stretch of a mountain pass in Italy are prohibited. The views
from that pass are beautiful and lots of people stop there. Consequently, food trucks try to
stop at random locations not to be expected by police. They stop randomly along a two-
kilometer part of the road that stretches across the best viewpoints and has as its midpoint
the highest point of the mountain. What is the probability that a tourist will find the straw-
berry stand of a food truck within three meters of the top of the mountain?
Exercise 2. The failure of a circuit board interrupts work by a computer system until a new
board is delivered. Delivery time, X, is uniformly distributed over the interval of from one to
five days. The cost C of this failure and interruption consists of a fixed cost co for the new
part and a cost that increases proportionally to X 2, so that
C = c o + c1 X 2
Exercise 3. You arrive at a bus stop at 10 o’clock, knowing that the bus will arrive at some
time uniformly distributed between 10:00 and 10:30. What is the probability that you will
have to wait longer than 10 minutes?
XXPeople wait in line to enter the Catalina Express ferry to Catalina Island. People enter-
ing the United States when returning or coming from a foreign country must wait in
the passport checkpoint. Airplanes must wait in line to take off. Capacity building
to service people in those lines requires that the capacity to service the people is
in proper relation to the rate at which they arrive. If this service is too small, the
queue will be disproportionately large and people and planes will wait an inordinate
amount of time for service. On the other hand, if the capacity is too large, much of
the service capacity will be underutilized or perhaps not utilized at all. How would
you approach this problem?
Suppose an event happens at a constant rate λ. The random variable X measuring the time
until the first event and the time between events is a random variable with range [0, ∞) and
density function
f ( x ) = le−lx , x ≥ 0
F ( x ) = 1 − e−lx
1
E( X ) =
l
1
Var ( X ) =
l2
l
M x (t ) =
l −t
Figure 7.5 shows an exponential density function. The key property of the exponential
random variable is that it is memoryless. For s > 0 and t > 0 the memoryless or Markov prop-
erty states that
0.6
Example 7.7.1
Traffic to an email server arrives in a random pattern (i.e.,
0.4 exponential arrival time) at a rate of 240 emails per minute.
The server has a transmission rate of 800 characters per
0.2 second. The message length distribution (including control
characters) is approximately exponential with an average
0.0 length of 176 characters. Assume a M/M/1 queueing system
0 1 2 3 4 5 (i.e., exponential arrival times, exponential service time,
X and one server). What is the probability that 10 or more
messages are waiting to be transmitted? See section 7.16.
Figure 7.5 The density of an exponential
with lambda = 1, the blue line marks the
value of the expected value on the X axis. Example 7.7.2
The lifetime of a light bulb, the length of a phone call,
and radioactive decay are other examples of exponential
random variables.
Example 7.7.3
Suppose that the amount of time it takes the bookstore to process the book purchase of a
student at the beginning of the school year, in minutes, is an exponential random variable
with parameter l = 1 / 10.
Let X denote the length of processing the order in minutes. Then
1 − 101 x
f (x) = e , x ≥ 0.
10
• If someone arrives immediately ahead of you at the bookstore, then the probability
that you will have to wait more than 10 minutes is
∞ ∞
1 − 101 x 1
∫ ∫
− x
∞
f ( x )dx = e dx = −e 10 10
= 0.368.
10
10 10
P (10 < X < 20) = P ( X < 20) − P ( X < 10) = F (20) − F (10) = e−2 −e−1 = 0.233
1
SD( X ) = = 10.
l2
• Each half hour that you waste in the bookstore line, you lose about $10 from your
work-study job at the statistics department. Compute the expected cost and the
standard deviation of that cost.
1
C = X.
3
Using the rules of expectations learned and used earlier,
1 10
E (C ) = E ( X ) = = $3.333 …
3 3
2
1 1
SD( X ) = Var ( X ) = SD( X ) = $3.3333
3 3
7.7.1 Exercises
Exercise 1. Suppose that an experimenter studies Box 7.3
the lifespan of members of a colony of bacte-
ria. Let T be the lifespan of a randomly chosen Integration by parts
member of the colony. Suppose the lifespan To prove that the values of the expected values regarding
of the bacteria has an exponential distribution an exponential random variable are what they are, you
need to remember integration by parts.
with expected value 100 minutes. What is the
You will need that in the exercises.
probability that a randomly chosen member of
the colony lives more than 87 minutes?
∫ udv = uv | −∫ vdu
Exercise 2. The time in hours required to repair a
machine is an exponentially distributed random
variable with expected value of two hours. (i) What is the probability that a repair time exceeds
two hours? (ii) A company has four machines identical to those in (i) that need repairs. What
is the probability that two of them require a repair time that exceeds two hours?
Exercise 3. The service times at a teller window in a bank were found to follow an exponen-
tial distribution, with a mean of five minutes. A customer arrives at a window at 2:00 p.m.
(i) Find the probability that they will still be there at 2:06 p.m. Show work. (ii) Find the prob-
ability that the customer will still be there at 2:10 p.m., given that the customer was there at
2:06 p.m.
Exercise 4. Prove the Markov property of the exponential distribution using the definition of
conditional probability of an event.
Exercise 7. The time (in hours) required to repair a machine is exponentially distributed with
parameter l = 1 / 2. Calculate (i) the probability that a repair time exceeds 2 hours. (ii) The
conditional probability that a repair takes at least 10 hours given that its duration exceeds
9 hours. (iii) the probability that the total repair time of 100 machines is greater than 180 hours.
l
M x (t ) =
l −t
f ( x ) = e− x , x ≥ 0.
The gamma is a very important random variable. It applies to metrics that are nonnegative and
have skewed distributions, but with less extreme exponential decay than the exponential. In
fact, the exponential and the chi-square distributions are special cases of a gamma density.
You will research this random variable in Siegrist (1997). Go to http://www.randomservices.
org/random/special/Gamma.html and visit the applets and simulators provided by the author.
Try to do some of the exercises.
Example 7.8.1
The monthly salary of women that are in the labor force in a large town, Y, follows a gamma
distribution with a = 2000 and l = 4. The expected salary of the women in this town is
α 2000
E (Y ) = = = 500.
λ 4
f ( y ) = e− y y 3 , y ≥ 0.
Carl Friedrich Gauss (1777–1855), whom many mathematical historians consider to have been
the greatest mathematician of all time, was working as the royal surveyor for the King of
Prussia. Surveyors measure distances. For instance, a survey crew may measure a distance to
be 135.674m. To tell if that is the correct distance, they would check their work by measur-
ing it again. The second time, they might get an answer of 135.677 m. So is it 135.674 m or
135.677 m.? They would have to measure it again. The next time, they might get an answer
of 135.675 m. Which one is it? Each time they measured they got a different answer. Gauss
would have them measure it about 15 times, and they would get, for example
135.674; 135.677; 135.675; 135.675; 135.676; 135.672; 135.675; 135.674; 135.676; 135.675;
135.676; 135.674; 135.675; 135.676; 135.675
1
1 − ( x −µ )2
f (x) = e 2
, −∞ < x < ∞
2πσ 2
Errors in measurement arise for a myriad of reasons (the environment, the instrument, facts
about the person making the measurement, the material being measured, to name a few).
An awareness of measurement error, and having a distribution that allows the measurement
of the probability of making certain error and has an expected value has proved invaluable
in the history of science and in the maintenance of the quality of products and services. The
awareness has made it possible to set government regulations and more reliable production
processes and measurements, for example.
Example 7.9.1
Kinney (2002, 66), talks of a machine that fills soft drink cans advertised to contain 12 oz of
soft drink but actually dispenses a volume (V) per can that is normally distributed with mean
11.8 oz and standard deviation 0.2 oz. Government regulations require that at least 99% of
the cans advertised as containing 12 oz of soft drink actually contain 11.6 oz or more. Quality
control statisticians are hired to use probability to design quality control experiments that
constantly monitor the filling process to guarantee that this regulation is respected. The
departure from 12 oz is due to chance. The quality control statisticians are not there to make
it 12 oz, but to make sure that there are no systematic departures from it, such as always
overfilling, that would indicate that the machine is broken.
Learn more about what statistical quality control statisticians do at:
https://www.britannica.com/topic/statistical-quality-control
7.9.1 Which things other than measurement errors have a normal density?
The normal or Gaussian distribution was first observed by DeMoivre, an eighteenth-century
mathematician. Later on, Laplace discovered that the normal or Gaussian model is also
followed by sums of random variables (discovering the famous Central Limit Theorem that
we study in chapter 9). The model was later used for measurements of people by Quetelet.
When used for measurements other than error, the normal density is just demonstrating the
variability of many measurements due to many factors affecting the value observed. This
variability arises for different reasons, for example, in the case of the height of people, many
factors (genetic, environmental) determine height. Height varies because people vary, not
because there is any error in measurement (well, perhaps the instrument is not accurate and
then there is both, but you cannot say that one person measuring 150 cm and another mea-
suring 200 cm is because there is an error; we look at the two people and they are certainly
one taller than the other).
The reader should be aware of the fact that sometimes people carry over ancient interpre-
tations of things. Names such as standard deviation, for example, should be called standard
distance. The fact that a person is above average in height means that this person is at some
distance from the expected value for the population the person is coming from.
Similarly, the normal model arises naturally in nature when there are many unrelated fac-
tors affecting the metric we are interested in. The following website gives a very interesting
example in physics: dust particles moving in water. Why is the distance traveled by dust
particles in water normally distributed? The author of this applet simplifies the explanation
of Brownian Motion for you.
http://webphysics.davidson.edu/Applets/Galton/BallDrop.html
Example 7.9.4
Box 7.6
(This example is from Samuels (2016).) When
red blood cells are counted using a certain
Maxwell’s law of velocities
electronic counter, the standard deviation of
The velocity of a molecule (with mass M) in a gas at ab- repeated counts of the same blood specimen
solute temperature T, according to Maxwell’s law of ve-
is about 0.8% of the true value, and the dis-
locities, obeys a normal probability law with parameters
M tribution of repeated counts is approximately
µ= 0 and s 2 = , where k is the physical constant
kT normal. For example, this means that if the true
called Boltzmann’s constant. (Parzen (1960), 237.) value is 5,000,000 cells/mm3, then the standard
deviation is 40,000.
(i) If the true value of the red blood count for
a certain specimen is 5,000,000 cells/mm3, what
is the probability that the counter would give a reading between 4,900,000, and 5,100,000?
(ii) A hospital lab performs counts of many specimens every day. For what percentage of
these specimens does the reported blood count differ from the correct value by 2% or more?
To calculate with an applet, see, for example, the Allan Rossman and Beth Chance applet.
http://www.rossmanchance.com/applets/NormCalc.html. For example, if X is a normal
random variable with mean 50 and standard deviation 3, you may enter 50 for mean and 3
for standard deviation and variable X. Do not check mark the third line. Then click on “Scale
to Fit” and you will see the normal curve drawn. Below check mark the first line and enter
the X, the Z, or the probability, and hit return to obtain the others. Do not check mark the
second line. You can change the > or < as well.
Example 7.9.5
A retailer is contemplating opening a bike shop in Normal density for age
a residential area. This retailer does not plan to 0.04
P(45<age<55) = red area
sell over the internet. To know what kind of bikes
f(age)
0.02
to stock, the retailer needs to know the age of the
population where the store will be. If there are many
0.00
kids, the retailer would order bikes for kids. If there 20 40 60 80
are many elderly people, the retailer would like to Age
Example 7.9.7
Scores in an exam, which we will denote by X, are normally distributed with expected value
µX = 70 and standard deviation s = 6. What is the probability that a randomly chosen student
X
will score higher than 76 on the exam?
76 − 70
P ( X > 76) = P Z > = P ( Z > 1) = 1 − P ( Z < 1) = 1 − 0.84.
6
Example 7.9.8
The time it takes a driver to react to the brake light on a decelerated vehicle is critical in
avoiding rear-end collisions. Someone suggests that reaction time for an in-traffic response
to a brake signal from standard brake lights can be modeled with a normal distribution with
expected value 1.25 seconds and standard deviation 0.46 seconds. What is the probability
that a driver’s reaction time will be between 1 and 1.75 seconds?
To find percentiles for the normal random variable we may also use the standard normal
random variable or use an applet or normal curve calculator. The steps are as follows:
You may compute percentiles with the Rossman/Chance applet by typing the probability
and leaving the < sign.
Another applet that is widely used is that of David Lane, which can be found at http://
onlinestatbook.com/2/calculators/normal_dist.html
Example 7.9.9
What is the 67th percentile in the standard normal random variable Z?
We first find the 0.67 area within the body of the table. We then identify that with a
Z value of 0.4399132. In other words,
P ( z ≤ 0.4399132) ≈ 0.67
c − 64
2.58 =
0.78
Example 7.9.11
If X is N (μ = 2, σ2 = 0.2), then Y = 4X −2 is N(μ = 6, σ2 = 3.2). We use the linear property of the
expectation operator and the rules of expectations we know to prove this.
Box 7.7
7.9.4 Exercises
Exercise 1. Wires manufactured for a certain computer system are specified to have a resistance
of between 0.10 and 0.17 ohms. The actual measured resistances of the wires produced by
company A have a normal probability density distribution, with expected value 0.13 ohms
and standard deviation 0.005 ohms. If three independent such wires are used in a single
system and all are selected from company A, what is the probability that they all will meet
the specifications?
Exercise 2. The temperatures in June in Los Angeles are distributed normally with mean
77° Fahrenheit and standard deviation 5° Fahrenheit. (i) What is the probability that the
Exercise 3. Among first-year students at a certain university, scores on the math SAT followed
the normal curve, with an average of 500 and a standard deviation of 100. At what percentile
was a student who scored 350?
X − µX
Exercise 4. Prove that if X has normal density with mean µX and variance sX2 then Z =
σX
has expected value 0 and standard deviation 1.
P (| X − 3 | > 6).
Exercise 6. Statisticians use the theory we learn in probability about the normal density
function to determine whether data that they observe might follow the normal model. This
is how they operate. They use data tools to calculate the percentiles of the data, the mean
of the data, and the standard deviation of the data. Then they assume a theoretical normal
model with the same mean and the same standard deviation. The idea then is to compare
the percentiles of the model with the percentiles of the data set.
You are going to apply this technique to SAT scores. Research and find the average verbal
SAT score in 2017 in the United States. Find also the standard deviation and a few percentiles.
Compute the same percentiles for the normal model. Are they the same? Would you conclude
that the normal model is a good model for these data?
Exercise 7. How many standard deviations above and below the expected value do the quar-
tiles of any normal distribution lie?
Exercise 8. Federal Services advertises that its average delivery time is 30 hours. Federal
Services’ standard deviation of its delivery time is 5 hours. What is the probability that a
document arrives in less than 36 hours?
Exercise 9. Family branding occurs when a firm applies one brand name to its entire product
line, such as Levi’s. Individual branding occurs when a firm uses individual brand names for
its products, for example, Procter & Gamble’s Pringles, Crisco, and Tide. GSP Inc. is trying
family branding for a new toothpaste in 20 test cities. The mean and standard deviation of
units sold per week are 2,250 and 250 respectively. GSP is also test marketing the toothpaste
Exercise 10. The number of chocolate chips in an 18-ounce bag of Nabisco’s Chips Ahoy!
chocolate chip cookies was found to be normally distributed with mean 1,261 and standard
deviation 117.6, based on analysis of many bags way back in the late 1990s. This was found
in response to Nabisco’s “Chips Ahoy! 1000 Chips Challenge (Warner and Rutledge 1999).
Nabisco asked for confirmation that there are at least 1,000 chips in every 18-ounce bag and
some content participants used the normal model to find the answer. Other approaches to
answer Nabisco’s question were tried. Use the normal density assumption to find the proba-
bility that there are at least 1000 chips in a randomly chosen 18-ounce bag.
Exercise 11. The weight of anodized reciprocating pistons produced by a company follows
a Gaussian distribution with µ= 10 lb and standard deviation 0.2 lb. A sampling inspection
scheme designed by the quality control engineers calls for rejecting the heaviest 2.5% of the
pistons. What weight, in pounds, determines the overweight classification?
Example 7.9.1
It is known that 45% of home improvement loan applications are approved. If 500 applications
are chosen at random, what is the probability that less than 200 are approved?
X = # of drivers out of 1204 that run stop sign or red light. X ~ Bin(n = 1204, p = 0.17)
E ( X ) = np = 1204 (0.17) = 204.68
Var ( X ) = np (1 − p) = 1204((0.17) (0.83) = 169.8844
It is unlikely that there would be so many drivers running a stop sign or red light in a
sample of 1204 if 17% of drivers in the population do.
7.9.6 Exercises
Exercise 1. According to the Census Bureau, 15.3% of Californians live below poverty level.
A random sample of 1,000 Californians is taken. What is the probability that less than 450
people in the sample live below the poverty level?
Exercise 2. According to government data, 30% of married Americans marry after age 30. A
study of married people chooses a random sample of 400 married Americans and asks each
person in the sample their age at marriage. What is the probability that more than 200 people
in the sample married after age 30?
Exercise 3. Approximately 1.07% of the population in the United States has chronic hepatitis
C, according to the Center for Disease Control. If you randomly select individuals to test, how
many should be selected for the normal approximation to hold?
Exercise 4. In 1,000 flips of a fair coin, heads came up 560 times and tails 440 times. Are
these results consistent with a fair coin?
Exercise 5. Inspired by Mansfield (1994). Department stores in the United States offer store
credit cards that come with some perks. Customers with a store credit card can get a 20%
greater discount than customers without a store credit card. Department store Mysees knows
from past experience that on days when there is a substantial storewide sale and Mysees
offers an additional 20% off if the customer opens an account, 25% of the customers with-
out an account will open one. If 1,000 customers without the credit card visit Mysees on a
particular day, what is the probability that more than 300 open a credit card account?
Exercise 7. This exercise is adapted from Mansfield (1994, 206). Telephone companies use
probability to solve many kinds of engineering problems. This problem is about landlines,
the static phones servicing many homes. But it is applicable to cell phones.
A telephone exchange at A was to serve 2,000 telephones in a nearby exchange at B. It
is too expensive to install 2,000 trunk lines from A to B. Instead, the telephone company
decided to install trunk lines so that only 1 out of 100 calls would fail to find an unutilized
trunk line immediate at its disposal. Under typical conditions, the probability that one of
the 2,000 telephone subscribers will require a trunk line to B is 1/30 (this probability will be
different if there are natural disasters, or other hazardous events). The telephone company
wants to determine how many trunk lines it should install so that when 1 out of the 2,000
subscribers puts through a call requiring a trunk line to B during the busiest hour of the day,
he or she would find an unutilized trunk line to B immediately at the subscriber’s disposal
in 99 out of 100 cases.
(See also the version of this problem in Parzen (1960, 246).)
Waloddi Weibull (1887–1979) was a Swedish physicist. He was an inventor, engineer, and
professor. Weibull was interested in the strength properties of brittle material and conducted
experiments to measure such strength. He was well aware that no physical measurement is
exact. Thus he conducted experiments that consisted on measuring the strength of a mate-
rial under a given stress repeatedly. His measurements lead him to discover that material
strength under the same experimental conditions varied, as he expected, very much like Gauss
discovered that distances were different depending who measured them. But the normal
or Gaussian distribution was not a good mathematical model for Weibull’s measurements.
To have a reference that engineers could use, he came up with a mathematical probability
model that came to be known as the Weibull distribution (Weibull (1951)). With this model,
engineers could determine the proportion of times that measured strength would be found
to be less than a given value.
It turned out that Weibull’s model gained wide applicability in engineering as a model for
many types of strength of different materials. But the health sciences also found the model
useful to measure the strength of people after a medical surgery or some traumatic event.
The field survival analysis relies to a large extent on the Weibull model to measure survival,
or time until death after a surgery.
Whereas the Gaussian distribution is pretty symmetric, the Weibull distribution tends to
be skewed right.
The formula that Weibull found for his distribution was
α−1 x α
α x −
f ( x ) =
β
e , x ≥ 0,
β β
where α , β are positive shape and scale parameters, respectively. Changing β stretches or
compresses the x scale without changing the shape of the distribution. As in any density
function, the area under the curve represents probability.
Example 7.10.1
An engineer found that the Weibull distribution model that fits fracture strength, X, of silicon
nitride braze points had parameters a = 5 and b = 125. Find the quartiles of this distribution
and the value of the interquartile range.
The model is
5−1 x 5
5 x −
125
f (x) = e , x ≥ 0.
125 125
To have a reference for future problems that you may do with this distribution, it pays
to do the computations first with generic parameters and then plug in the values of the
parameters given.
∫
β
e dx = 0.25
β β
0
c
x α
−
β
−e = 0.25
0
which reduces to
c α
−
β
1−e = 0.25
Simplifying further,
c α
−
β
e = 0.75
α
c
= −log (0.75)
β
so
αlog( β ) + log(−log(0.75))
log(c ) =
α
which gives
αlog ( β )+log ( −log (0.75))
c=e α
Substituting the values of the parameters given in the problem, we find that c = 97.42997.
We leave it as an exercise for the reader to find the 75th percentile and the interquartile range.
7.11.1 Exercises
Exercise 1. Consider the problem solved in Example 7.10.1. Complete the exercise by finding
the 75th percentile and then computing the interquartile range.
The beta distribution models random variables that take values between 0 and 1. A propor-
tion is that type of variable because a proportion is a number between 0 and 1. Thus, this
distribution is widely used in Bayesian modeling as a prior distribution of the parameter p of
a binomial distribution. But the beta has many other uses.
The beta is another density function that you will research in the random project of Siegrist
(1997), at this site http://www.randomservices.org/random/special/Beta.html. This author
has applets for you to see the shapes of the density. Study that section of the random project
for this section on the beta random variable. Do some of the exercises.
Have you heard the saying that 80% of wealth is in the hands of 20% of the population? This
has been known in many circles as the Pareto principle, after Vilfredo Pareto (1848–1923).
Twenty percent of the causes is responsible for 80% of the outcomes. The Pareto density
is widely used to measure income distributions, and income inequality indexes have been
built around it.
In this section, you will research the Pareto density function with the applets and simula-
tors found in Siegrist (1997). Visit http://www.randomservices.org/random/special/Pareto.
html to acquaint yourself with this density. Do some of the exercises.
When doing proofs and discovering new formulas, it helps to remember the formulas for
probability density functions and the property that the area under continuous density func-
tions is 1.
Example 7.15.1
Consider the following expression:
∞ ( z −t )2 t 2
1
∫e
− +
( t σ )2 /2 t µ
e e 2 2
dz .
2π −∞
Can it be simplified? It can, if you notice that there is this term embedded in the formula:
∞ ( z −t )2
1
∫
−
e 2
dz ,
−∞
2p
that is the area under a normal curve with mean t and variance 1, and therefore this integral is 1.
t2 ∞ ( z −t )2 (1+σ2 )t 2
1
∫
− +t µ
( t σ )2 /2 t µ
e e e 2
e 2
dz = e 2
−∞
2π
The moment generating function of the normal random variable is of the form:
(Variance )t 2
+t ( Mean )
e 2
In the placeholder for the variance we have (1 + s 2 ). So the expression that we started this
example with is the moment generating function of a normal random variable with mean µ
and variance (1 + s 2 ).
Question 1. In a large lecture course, the scores on the final examination followed the normal
curve closely. The average score was 60 points and three-fourths of the class got between
50 and 70 points. The SD of the scores was
Question 2. What is the constant k that makes the following function a valid density?
f ( x ) = kx 9 (1 − x )2 , 0 ≤ x ≤ 1.
a. 0.00151
b. 2
c. 660
d. 210
Question 3. Let X be the time that it takes to drive between point A and point B during the
afternoon rush hour period in highway 4005. The density function of X is
1
f (x) = x , 0 < x < 2.
2
a. 1.414
b. 1.0
c. 5/16
d. 0.034
x −2
F( x ) = , 2 ≤ x ≤7
5
Question 6. Suppose X is a normal random variable with mean µ and standard deviation s .
Under what circumstances is Xs the standard normal random variable?
a. µX is 0
b. sx = 1
c. µX is 1
d. sx = 0
Question 7. Let X be the change (in dollars per share) next year in the stock price of Apple.
A financial analyst found that X is normally distributed with µ= 2 and standard deviation
s = 3. The probability of a price change larger than 2.5 would be?
a. 0.1131
b. 0.4338
c. 0.0001
d. 0.1364
Question 9. The distribution of X, the time it takes women between 50 and 55 to run a 10k
race, is such that the event A (X between 40 and 60 minutes) has P(40 < X < 60) = 0.8; the
event B, taking between 50 and 90 minutes has P(50 < X < 90) = 0.2; and the event C, taking
between 50 and 60 minutes, has P(50 < X < 60) = 0.1. For a runner woman chosen at random
from the population in this age group, compute the probability that she is in at least one of
A or B.
a. 0.3
b. 0.9
c. 0.8
d. 0.2
Question 10. Consider a set of 20 independent and identically distributed uniform random
variables in the interval (0,1). What is the expected value and variance of the sum S of these
random variables?
7.16 R code
This simulation is about queues and waiting and service times. Think of yourself waiting in the
ATM line. You are the customer, the ATM is the service center, which takes more or less time
with each customer. Your total time in the system is the service time for your transaction plus
the time you had to wait in the queue, if any. Then there is a queue of other customers. We
will figure out the outcome of interactions in a complex system like this. It is very important
that you read line by line and do line by line, except where indicated, to avoid errors. Also
keep notes of what is going on. Answer the questions asked.
Number of Number of
customers in queue customers in server
Arrival rate
Customers Waiting time
Interarrival time Waiting time in queue in server
We want to see some of the numbers you got for inter arrival times. So in R, after the
above commands, type
head(it)
s= c(rep(0,1000))
s=runif(1000,5,15)
head(s) # View first numbers in s
You may see the distribution of service times generated by doing a histogram:
Now extract what you need from the random numbers generated.
Using the it numbers, now create a new variable “arrival times” (at), that contains the cumu-
lated interarrival times of all customers. This will give the time of arrival of each customer.
That is, if, for example, the first four values of it are 15, 16, 17, and 8, the first four values of
the variable at should be: 15, 31, 56, and 63. That is, customer 1 in this example arrived at
minute 15, customer 2 arrived at minute 31, and so on.
Compare the first six numbers you obtained for it earlier in this handout with the numbers
just obtained for at. Is the computer doing the right job for your numbers? I.e., is it giving you
the right arrival times for your customers? Show by showing the computation the computer
is doing.
The information obtained so far about arrival times, at, and service time, s, will help us
simulate the operation of the queueing system as follows:
To start with, create a variable for the queueing time of each customer, qt, and another
for the exit time of each customer, exit. These are the time the customer waits in line after
arriving to the system, and the time at which the customer is done and leaves the system.
Now determine the queueing time and the exit time for each customer. The following
code does the following: if the next customer i + 1 arrives before the arrival time at plus the
service time s plus the queuing time qt of last customer i is over, then the new customer
i + 1 has to wait in the queue (qt > 0). Otherwise, it does not have to queue (qt = 0). The
program calculates the queuing time. You run this bunch of code all as one: copy and paste
the whole thing into R after you have typed it in your script file.
for(i in 1:999) {
if(at[i] +s[i] + qt[i] <= at[i+1]){
qt[i+1]= 0
exit[i]= at[i]+s[i]+qt[i]
We know the service time, the queueing time and the exit time for each customer. So
we can now figure out the total time spent by each customer in the system. Let’s call it tts.
tts = c(rep(0,1000))
tts = exit - at # total time= exit time-arrival time.
We can do a histogram to see the distribution of the total time in the system as follows
Then we can find the sample mean, standard deviation and other summaries of the
generated data:
mean(tts)
sd(tts)
summary(tts)
Enter here the mean, median, first quartile, third quartile, minimum, maximum, and
standard deviation of the total time spent in the system by a customer.
Describe the shape of the histogram of total time in the system.
We would also like to know the average number of customers in the system. For that, we
first need to put the operation of the system in real time. The following program puts the
arrival times and exit times one after the other for each customer. Then in a separate column,
if arrival, a number 1 is entered(customer in), if exit, then a -1 is entered. After that, the times
of arrival and exit are sorted from lowest to highest. The objective is to be able to count
how many number 1’s we have in an interval of time. Let us follow the code more in detail.
Odd row numbers of this last matrix give us the arrival time, and even numbers give us
the exit time. So we can add one column to the matrix, which tells us that with every arrival
there is a new customer, and with every departure one customer less.
The following two commands order the times of arrivals and exits and then helps us see
at each point in time whether there was a customer.
oo = order(realtime[,1])
trace = realtime[oo,]
We can help the computer see this with the command head(trace):
head(trace)
The first time in column 1 (row 1) is the entry time of customer 1, which was the first line
of realtime. The second time in column 1 is the entry time of customer 3, which was the third
line of realtime. The third time in column 1 is the exit time of customer 1, which was line 2
of realtime. The fourth time in column 1 is the entry time of customer 3, which was line 5
in realtime. The fifth time in column 1 is the departure time of customer 2, which was line
4 in realtime. Finally, the last time in column 1 is the departure time of customer 3, which
was line 6 in realtime.
Now we would like to focus on the number of customers. So we create a variable that tells
us how many customers are in the system at each real time of entry or exit.
tracecum= cumsum(trace[,2])
We can see the result of cumsum looking at the first 6 we looked at earlier:
head(tracecum)
We can find the average number and other summaries of the number of customers at each
of the minutes in which action is taking place by typing in R:
mean(tracecum)
sd(tracecum)
summary(tracecum)
Of interest is whether the system converges. That is, are the service times and arrival times
such that the number of customers in the system explodes? That would happen maybe if
the service time is too slow relative to the arrival rate. So playing with the parameter values
of the service time and the inter arrival times, you can generate all sorts of behavior. Let us
see which behavior we generated.
plot( tracecum,type="l")
To see whether the behavior of the random variables analyzed change when we change
the assumption about service time, repeat the analysis done in this activity, but this
time, in the probability model, the service time is exponential with parameter d = 151
instead of uniform (5,15). The rest of the code should not change. Then answer all
the questions again. In particular, explain well what happens at the end by running
all the code several times and comparing the output you get. Is this new system more
likely to converge? Is the final picture different from the uniform service time case?
Experiment with the randomness of the situation by doing the simulation several
times. That way, you will appreciate how each time you run it the data is different
(since it is random).
s=rexp(1000, lambda)
The rest of the code should be as for the uniform service time.
Note: This simulation has also benefitted from Goodman(1988).
Exercise 2. After completing a study, a company in Kansas City concluded the time its employ-
ees spend commuting to work each day is normally distributed with a mean equal to 15
minutes and a standard deviation equal to 3.5 minutes. One employee has indicated that she
commutes 22 minutes or more per day. (i) What is the probability that a randomly chosen
employee commutes that much? (ii) An employee commuting exactly 22 minutes per day is at
what percentile? (iii) The company will pay each employee the cost of commuting by giving
them 10 cents per minute. How much should the company budget per employee? What is
the variability of the cost of commuting per employee?
Exercise 3. Four fuses were shipped to a customer, before being tested, on a guarantee basis.
That is, if the customer finds some of the fuses defective, s/he may return the shipment for
repair. The number of defectives in a shipment of four fuses is a random variable Y with
E(Y ) = 0.4 and Var(Y ) = 0.36. The cost of repairing the defective fuses is given by
C = 3Y 2.
Exercise 5. Suppose that one is told that the time one has to wait for a bus on a certain bus
stop is a continuous random variable with a probability density function given by
f ( x ) = 4 x − 2x 2 − 1, 0 ≤ x ≤ 2
Exercise 6. (This exercise is based on William F. Stout (1999).) A certain insect species has
a mean length of 1.2 centimeters and a standard deviation of 0.1 centimeters. If there are
estimated to be 10,000 of these insects in a terrarium, how many of them would be expected
to be less than 0.8 cm in length. Assume that length is normally distributed.
Exercise 7. Prove that if X is a random variable with mean µ = 3 and standard deviation
σ = 0.5 then
X -µ
has mean 0 and standard deviation 1.
s
Under what conditions will this density be equal to the exponential density?
Exercise 9. Prove that the variance of a uniform random variable with range (a,b) is
(b − a )2
Var ( X ) =
12
Exercise 10. If X is a gamma random variable, what is the following equal to?
x
−
∫x
x α−1e β
dx
Exercise 11. A six-sided die is tossed 200 times. The number 1 or 2 or 3 came up
125 times. Is this die fair?
Exercise 12. The length of human pregnancies from conception to birth varies according to a
normal distribution, with a mean µ= 266 days and standard deviation s = 16 days. Calculate
the median length of pregnancies.
Exercise 14. Consider the cumulative distribution of the proportion of income in a country.
That is a distribution that has proportion of income on the horizontal axis and proportion of
persons that have that proportion of income or less on the vertical axis. With these criteria,
draw the cumulative distribution of income when there is complete equality, i.e., everybody
gets the same amount of income. In the same graph, draw the cumulative distribution of
income when there is inequality of income distribution. Explain why you drew your graph
that way.
Exercise 16. The coffee chain Starbucks created an app that supports mobile ordering at
7,400 of its stores in the United States, giving users the opportunity to order and pay for
their drinks before they even arrive at their local Starbucks. Starbucks estimates the typical
wait time given in the app will average around 3–5 minutes at most stores, with variability
depending on the number of mobile orders in the queue and what a call order entails. After
the transaction is completed in the app, users can skip the line and instead head to the pick-up
counter where they can ask a barista for their order. Suppose that at one of the stores the
waiting time in seconds has moment generating function given by
MX (t ) = (1 − 200t )−1
If you enter your order immediately after another customer, what is the probability that your
order will be ready in 300 seconds? (ii) If 300 seconds have passed and you arrive at the
counter and your coffee is not ready, what is the probability that you will have to wait an
additional 50 seconds?
Exercise 17. The length of life, X, of a fuse has the following probability density function:
1 −x
f ( x ) = e q , x ≥ 0, q > 0.
q
Three such fuses operate independently. Find the joint density of their lengths of life.
Which probability density function model would be appropriate to consider for each of these
random variables?
Exercise 19. Suppose the distribution of math scores on the SAT test has a roughly unimodal
and symmetric distribution with mean equal to 500 and standard deviation equal to 100.
You happened to earn a 600 on the math part of the SAT. Where do you stand among all
students who took this math portion of the SAT? (i) Draw the distribution of scores and label
the points that are one SD, two SDs, and three SDs away from the mean. (ii) What percent-
age of students scored below 400? (iii) What percentage of students scored above 700? (iv)
What percentage of students scored between 500 and 600? (v) What percentage of students
scored above 620? (vi) What score did you have to receive to be above the 90th percentile?
What about the 10th percentile?
Exercise 20. In the United States, many universities require applicants to submit scores on
standardized tests, such as the SAT tests. The college your friend wants to apply to says that
while there is no minimum score required, the middle 50% of their students have SAT scores
between 1020 and 1220. You would feel confident if you knew her score was in the top 25%,
but unfortunately she took the ACT test, an alternative standardized test. How high must her
score be on the ACT to be comparable to the top quarter of equivalent SAT scores? (Note:
The mean SAT score for all college-bound seniors is about 1000 and the standard deviation
is about 200 points. For the same group, the ACT average is 20.8 with a standard deviation
of 4.8. You can assume both score distributions are nearly normal.)
Exercise 21. (This exercise is from Allen(1990, 113, 162).) The interactive computer system
at Gnu Glue has 20 communication lines to the central computer system. The lines operate
independently and the probability that any particular line is in use is 0.6. (i) What is the
probability that 10 or more lines are in use? Compare the exact binomial solution and the
normal approximation solution to this problem. (ii) Do the assumptions needed for the normal
approximation make sense?
Exercise 22. (This exercise is based on Degroot (1974, 247).) Suppose that F is a continuous
cumulative distribution function on the real line, and let a and b be numbers such that
F(a) = 0.3 and F(b) = 0.8. If 25 observations are selected at random from the distribution for
which the distribution function is F, what is the probability that six of the observed values will
be less than a, ten of the observed values will be between a and b, and nine of the observed
values will be greater than b?
Allen, Arnold O. 1990. Probability, Statistics and Queueing Theory with Computer Science
Applications. Academic Press, Inc.
Denny, Mark and Steven Gaines. 2000. Chance in Biology. Princeton University Press.
Goodman, R. 1988. Introduction to Stochastic Models. Benjamin/Cummings Publishing Co., Inc.
Hamming, Richard W. 1991. The Art of Probability. Addison Wesley-Publishing Company, Inc.
Harris, Frank E. 2014. Mathematics for Physical Sciences and Engineering. Elsevier.
Keeler, Carolyn and Steinhorst. 2001. “A New Approach to Learning Probability in the First Sta-
tistics Course.” Journal of Statistics Education 9, no. 3.
Kinney, John J. 2002. Statistics for Science and Engineering. Addison-Wesley.
Mansfield, Edwin. 1994. Statistics for Business and Economics: Methods and Applications.
Fifth Edition. W.W. Norton & Company.
Ott, Wayne R. 1995. Environmental Statistics and Data Analysis. Lewis Publishers.
Parzen, Emanuel. 1960. Modern Probability Theory and Its Applications. New York. John Wiley
and Sons, Inc.
Pitman, Jim. 1993. Probability. Springer Verlag.
Rice, John A. 2007. Mathematical Statistics and Data Analysis. Thomson Brooks/Cole
Ross, Sheldon. 2010. A First Course in Probability. 8th Edition. Pearson Prentice Hall.
Samuels, Myra L., Jeffrey A. Witmer, and Andrew A. Schaffner. 2016. Statistics for the Life
Sciences. Fifth Edition. Pearson.
Scheaffer, Richard L. 1995. Introduction to Probability and Its Applications, Second edition.
Duxbury Press.
Siegrist, Kyle. 1997. The Random Project. http://www.randomservices.org/random/
Stout, William F. 1999. Statistics: Making Sense of Data. 3rd Edition. Mobius Communications Ltd.
Warner, Brad, and Jim Rutledge. 1999. “Checking the Chips Ahoy! Guarantee.” Chance, 12,
no. 1: 10–14.
Weibull, Waloddi. 1951. A Statistical Distribution Function of Wide Applicability. Journal of
Applied Mechanics, 18:293–7.
XXA particular fast-food outlet is interested in the joint behavior of the random
variables X, defined as the total time between a customer’s arrival at the
store and the customer’s leaving the service window, and Y, the time that
the customer waits in line before reaching the service window. Because X
includes the time a customer waits in line, we must have X > Y. How do
you propose to find the probability that the time at the service window is
larger than 1 minute?
Suppose we are interested in the joint behavior of two continuous random variables,
X and Y, where X and Y might represent, for example,
• the mass of a body and the time of descent of this body from a given height
to the earth’s surface (keeping other things such as height, air density and
initial velocity of the body constant)
• methane concentration in a sample of the earth’s atmosphere and the
sample’s carbon dioxide concentration
• the diameter and length of logs on the deck of a sawmill
The fact that each of the pairs of random variables are obtained from the same
object is the motivation for wanting to study them together. The probability of
occurrence of the pair of random variables is controlled by the rules governing the
probabilities of multiple events.
The notion of a pdf of a continuous random variable X can be extended to the
notion of the pdf of two or more random variables, a surface in three or more
dimensions. We will focus on a pair of random variable, X and Y.
273
Definition 8.1.1
The joint density function of a pair of continuous random variables, f(X, Y), is a function that
assigns to a pair of real numbers to outcomes in the sample space:
P ( X ∈ A, Y ∈ B ) = P (a ≤ x ≤ b, c ≤ y ≤ d ) = ∫ ∫ f ( x , y )dydx .
a c
Thus, the probability of an event involving the two random variables is a volume under the joint
bivariate density function.
c. The volume under the density function throughout the whole domain of the function
is 1. More formally, we say that
∞ ∞
∫ ∫ f ( x , y )dxdy = 1,
−∞ −∞
where it is understood that the integration will be done throughout the range in which f(x, y)
is not 0.
Example 8.1.1.
Let
f ( x , y ) = 6 x 2 y , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1.
3 1
and let the event A = {( x , y ) : 0 < x < , < y < 1}.
4 3
Then
Example 8.1.2
The two indices BOD (biochemical oxygen demand) and DO (dissolved oxygen) are among
the parameters that determine the quality of water in a river. BOD is a relative measure of
8.1.1 Exercises
Exercise 1. An institution that prepares applicants for the sequence of actuarial exams yields
applicants that have two main characteristics. Let X denote the proportion of applicants who
feel very confident about their passing the first exam, and let Y denote the proportion of
applicants who feel confident about passing all of the exams. The joint pdf of X and Y can
be modeled by
f ( x , y ) = 2(1 − x ), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1.
When we talked about the marginal or total probability of a particular value of a discrete
random variable in Chapter 6 we added the joint probabilities that involved that value of
the random variable in question. The equivalent operation in the continuous case is to add
the infinite number of joint probabilities that involve that value of the random variable. The
mathematical operation to do that in the continuous case is integration.
Consider two random variables X and Y. Then to compute probabilities for X we will add
over Y, and viceversa if we want to compute probabilities for Y
∞
f (x) = ∫ f ( x , y )dy
−∞
f (y ) = ∫ f ( x , y )dx ,
−∞
where again, the integration will actually be done over the range of X and Y for which f(x, y) is
not 0.
With marginal density functions, we can compute total or marginal probabilities, mar-
ginal expectations and variances, cumulative distribution functions for X and for Y, marginal
E( x ) = ∫ xf ( x )dx ,
−∞
∞
E( y ) = ∫ yf ( y )dy .
−∞
Notice that we may not, in general, obtain the joint pdf from the marginal pdfs. See
Section 8.3 for an exception to this statement.
Example 8.2.1
2
Let f ( x , y ) = 6 x y , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. Then
∫ 6x ydy = 3x ,
2 2
f (x) = 0≤ x ≤1
0
∫ 6x ydx = 2y ,
2
f (y ) = 0≤ y ≤1
0
1 1
3
∫ ∫ x3x dx = 4
2
E( X ) = xf ( x )dx =
0 0
1 1
2
E (Y ) = ∫ 0
yf (Y )dy = ∫ y 2ydx = 3
0
1 1
3
∫x ∫ x 3x dx = 5
2 2 2 2
E( X ) = f ( x )dx =
0 0
1 1
1
E (Y 2 ) = ∫ y 2 f ( y )dy = ∫ y 2ydx = 2
2
0 0
2
3 3 3
σx2 = E ( X 2 ) − µ2 = − =
5 4 80
2
1 2 1
σ = E (Y ) − µ = − =
2 2 2
y
2 3 18
We can use the marginal pdfs to compute probabilities as well. For example,
1
∫ 3x
2
P ( X > 0.3) = dx = 0.973.
0.3
f ( x , y ) = 3x , 0 ≤ y ≤ x ≤ 1.
The probability that more than 50% of the tank is sold in a week is
1
3 5
P (Y > 0.5) = ∫
0.5
2
(1 − y 2 )dy =
16
On the other hand,
x
∫ 3x dy = 3x ,
2
f (x) = 0 ≤ x ≤ 1.
0
Example 8.2.3
Let
f ( x , y ) = 15e−3 x −5 y , x ≥ 0, y ≥ 0
f ( x ) = 3e−3 x , x ≥ 0
f ( y ) = 5e−5 y , y ≥ 0
8.2.1 Exercises
Exercise 1. Suppose X is the number of hours it takes for a new pair of blue jeans mail ordered
from a store to arrive to customers after ordering, and Y is the time it takes between order-
ing and the customer trying the new pair of jeans. The joint density of these two random
variables is
1
f (x, y ) = , 0 ≤ x ≤ y ≤ 500.
125000
What is the probability that the delivery takes longer than 250 hours?
f ( x , y ) = 25e−5 y , 0 ≤ x ≤ 0.2; y ≥ 0
Exercise 3. Consider the joint density in Section 8.1.1, Exercise 1. Find (i) the marginal density
functions of X and Y; (ii) the expected value and variance of X, and the expected value and
variance of Y; and (iii) the cumulative distribution function of X and the cumulative distri-
bution function of Y.
Exercise 4. Consider Example 8.2.3. Find the cumulative distribution function of X and of Y.
Exercise 5. Consider the joint density function in Example 8.2.2. (i) Calculate
8.3 Independence
XXDo you remember what independence implied about two events A and B? About
two discrete random variables X, Y? Do you remember how independence was
defined in those contexts? List all the things you remember about these. Then check
Chapters 3 and Chapter 6 to see how much you remembered correctly.
Example 8.3.1
If, as in Example 8.2.1,
f ( x , y ) = 6x 2 y ,
then
f ( x ) f ( y ) = 3x 2 (2y ) = 6 x 2 y ,
3
f (y ) = (1 − y 2 ), 0 ≤ y ≤ 1, f ( x ) = 3x 2 , 0 ≤ x ≤ 1.
2
3
We can see that f ( x ) f ( y ) = (1 − y 2 ) 3x 2 is not equal to f ( x , y ) = 3x . Therefore, X and Y
2
are not independent random variables.
8.3.1 Exercises
Exercise 1. Consider the random variables in Exercise 1, Section 8.1.1. Are X and Y indepen-
dent random variables?
Exercise 2. Consider the random variables in Example 8.2.3. Are X and Y independent
random variables?
Exercise 3. Prove that if two random variables X and Y are independent, then
If two variables are independent, then, by definition, it is not hard to see that f(x | y) = f(x)
and f(y | x) = f(y).
Conditional density functions allow us to make more interesting statements than we can
make with just the marginal densities. That is, we can now specialize the probabilities to
specific groups instead of offering generalized conclusions.
Example 8.4.1
f ( x , y ) = 2, 0 ≤ x ≤ y ≤ 1.
1
f (y ) = ∫ 2dx = 2y ,
0
0 ≤ y ≤ 1.
We can see that f ( x , y ) ¹ f ( x ) f ( y ) and therefore the two random variables are not inde-
pendent. The conditional density functions are:
f (x, y ) 2 1
f ( y |x ) = = = , x ≤ y ≤ 1,
f (x) 2(1 − x ) (1 − x )
f (x, y ) 2 1
f ( x |y ) = = = , 0 ≤ x ≤ y.
f (y ) 2y y
There are an infinite number of continuous conditional densities f(x | y), one for each value
of Y in the real line. Similarly, there are an infinite amount of continuous conditional densities
f(y | x), one for each value of X in the real line.
Once we specialize a conditional density to the value of the other random variable, we can
compute conditional expectations, conditional variances, conditional probabilities, conditional
percentiles and so on.
A conditional density function is a univariate distribution.
Example 8.4.2
Let’s see what we can do with the conditional densities of Example 8.4.1.
First, we specialize to a value of X. Suppose we are interested in the situation when X = 0.5.
1
f ( y |X = 0.5) = = 2, 0.5 ≤ y ≤ 1.
(1 − 0.5)
E (Y |X = 0.5) = ∫ y 2dy = 3 / 4,
0.5
1
0.5
We may also compute conditional probabilities. For example, the conditional probability
that Y is larger than 0.6 given that X is 0.5 is
1
f (x, y ) f (x ) f (y )
f ( x |y ) = = = f ( x ),
f (y ) f (y )
f (x, y ) f (x) f (y )
f ( y |x ) = = = f ( y ).
f (x) f (x)
Example 8.4.3
If two random variables X and Y have joint density
f ( x , y ) = 15e−3 x −5 y , x ≥ 0, y ≥ 0
then
f ( x |y ) = f ( x ) = 3e−3 x , x ≥ 0
f ( y |x ) = f ( y ) = 5e−5 y , y ≥ 0
Thus
P ( X < 2|Y = 4) = P ( X < 2) = Fx (2) = 1 − e−6
8.4.2 Exercises
Exercise 1. Teachers get courses assigned to teach each semester. For each instructor, there
are the courses that the instructor can teach based on the skill set of the instructor, and there
are courses that the teacher would rather teach all the time, closer to their specialization.
(i) Given that 10% of the teachers teach the whole spectrum of courses, what is the
probability that fewer than 5% teach their favorite courses? (ii) What is the expected per-
centage of teachers teaching their favorite courses when the proportion teaching the whole
spectrum is 0.7?
Exercise 2. In Section 8.4, we constructed the joint density of two random variables using the
conditional and marginal density. For that problem, (i) what is the probability distribution of
the time at which the call is placed? (ii) When is the expected time for the call to be placed?
Most of the time, as it happened with univariate continuous random variables, we are not
interested in the random variables per se, but on functions of them. For example, suppose we
take two strength measurements on the same section of a cable and we are interested in the
difference between the measurements. This may allow us to compare the two instruments used.
Definition 8.5.1
Let g(X, Y ) be a function of two continuous random variables. We define the expectation of this
function as follows:
∞ ∞
E ( g( X ,Y )) = ∫ ∫ g( x , y ) f ( x , y )dxdy ,
−∞ −∞
where, as usual, the domain of integration will be just where the density is not 0.
Similarly, the variance is defined as follows:
∞ ∞
∫ ∫ ( g( x , y ) − E( g( x , y )))
2
Var ( g( X ,Y )) = f ( x , y )dxdy = E ( g( x , y ))2 − E ( g( x , y ))2
−∞ −∞
Example 8.5.1
Let’s consider again the case of the gasoline of Example 8.2.2. A quantity of interest to a gas
station is the difference between the amount stocked and the amount sold, because this
allows the gas station to predict shortages or excess inventory. Let D = X − Y. Then,
1 x 1 1 1 1
3 3
= ∫ x ∫ 3xdydx − ∫ y ∫ 3xdxdy = ∫ xf ( x )dx − ∫ yf ( y )dy = E( X ) − E(Y ) = 4 − 8 = 3 / 8.
0 0 0 y 0 0
∫ ∫ ( x − y − (µ − µ )) 3xdydx
2
Var (D ) = x y
0 0
1 x
∫ ∫ [( x − µ ) + ( y − µ ) − 2( x − µ )( y − µ )]3xdydx
2 2
= x y x y
0 0
Example 8.5.2
A very special function of two random variables that is very helpful in computing the cor-
relation later on in the chapter is g(X, Y ) = XY (the product of the two random variables). In
the case of example 8.2.2,
1 x
3
E ( XY ) = ∫ ∫ xy 3xdydx = 10
0 0
Example 8.6.1
Continuing with Examples 8.2.2, 8.5.1 and 8.5.2, we know that E ( X ) = 3 / 4, E(Y ) = 3/8, E(XY ) =
3/10, Var(X) = 0.0375, Var(Y ) = 0.0594. Thus
3 3 3
−
Cov ( X ,Y ) E ( XY ) − E ( X )E (Y ) 10 4 8
ρ= = = = 0.3972
σx σY σx σY ( 0.0375)( 0.0594 )
W = a X + bY , T = cM + d N .
By definition of covariance,
All we have to do is first of all substitute, expand, simplify, put terms that make sense to
put together inside the brackets.
Step 1.
Cov (W ,T ) = E[(aX + bY − E (aX + bY ))(cM + dN − E (cM + dN ))]
Step 2.
Cov (W ,T ) = E[(a( X − E ( X ))c (M − E (M ) + (a( X − E ( X )d(N − E (N ))
+ (b(Y − E (Y ))c (M − E (M ) + (b(Y − E (Y )d(N − E (N ))].
And now we bring the expectation operator inside the brackets, term by term, and simplify
a little bit further.
Step 3.
Cov (W ,T ) = acE[( X − E ( X )(M − E (M )] + adE[( X − E ( X ))(N − E (N ))]
+ bcE[(Y − E ( X ))(M − E (M ))] + bdE[(Y − E (Y ))(N − E (N ))].
Step 4.
Now we recognize definitions.
f ( x , y ) = kxy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1.
(i) Determine the value of k. (ii) Find the correlation between X and Y.
Exercise 2. What would be the covariance of the following two functions of the random
variables X and Y?
W = a X + bY ; T = cX + d Y .
XXIf X is the time your friend arrives to a party and Y is the time you arrive, what would
be the expected gap of time between your arrivals to a future party?
g( X ,Y ) = aX + bY .
The expected value of this sum is
E ( g( X ,Y ) = aE (X ) + bE (Y ).
Example 8.7.1
Let’s revisit the sum of the rolls of two dice. What would be the expected sum and the vari-
ance of the sum obtained when rolling two fair six-sided dice?
E ( X + Y ) = E ( X ) + E (Y ) = 3.5 + 3.5 = 7
E( X + Y ) = ∫ ∫ ( x + y ) f ( x ) f ( y )dxdy
y x
= ∫ ∫ ( x ) f ( x ) f ( y )dxdy + ∫ ∫ ( y ) f ( x ) f ( y )dxdy
y x y x
= ∫ f ( y )∫ ( x ) f ( x )dxdy + ∫ f ( x )∫ ( y ) f ( y )dydx
y x x y
= ∫ f ( y )µ dy + ∫ f ( x ) µdx
y
x
x
y
= µx ∫ f ( y )dy + µ ∫ f ( x )dx
y
y
x
= µx + µy
Similar calculations allow us to show that the variance of the sum is the sum of the variances.
Var ( X + Y ) = ∫ ∫ (x + y −( µ
y x
x+
µy ))2 f ( x ) f ( y )dxdy
∫ ∫ (x − µ ) ∫ ∫ (y − µ )
2 2
= x
f ( x ) f ( y )dxdy + y
f ( x ) f ( y )dxdy + 0
y x y x
∫ f ( y )∫ ( x − µ ) ∫ f ( x )∫ ( y − µ )
2 2
= x
f ( x )dxdy + y
f ( y )dydx
y x x y
∫ f ( y )s dy + ∫ f ( x )s dx
2 2
= x y
y x
2 2
=s +sx y
8.7.3 Exercises
Exercise 1. What is ∫ ∫ ( x − µ )( y − µ ) f ( x ) f ( y )dxdy equal to?
y x
x y
f ( x , y ) = 3x , 0 ≤ y ≤ x ≤ 1
The random variable D = X − Y represents the amount left over at the end of the week. Find
the mean and variance of D.
Exercise 3. Prove that when X and Y are two random variables that are not independent,
If X and Y are two independent continuous random variables, we have seen in Definition 8.3.1
that f ( x , y ) = f ( x ) f ( y ). This condition generalizes to more than two random variables. If we
have n independent random variables, X 1 , X 2 ,¼¼, X n , each of them with density f(xi ), then
the joint density of all these random variables is
f ( x1 , x2 ,……, x n ) = f ( x1 ) f ( x2 )…… f ( x3 )
Example 8.8.1
The lifetime of a system component is a random variable X with density function
1 −1 x
f ( x ) = e 3 , x ≥ 0.
3
A system contains 6 of these components in series. The density function of the joint life-
time of all components is
6
1
1 − 3 ∑x
6
3
Continuing with the example in the context of probability, i.e., the joint density and known
parameter, we could compute the probability that the system works longer than 10 years.
That will happen if all components last less than 10 years.
6
1
1 − 3 ∑x
∞ ∞ ∞ 6
e
∫ ∫ …….∫
i
−q
∑x i
6
qˆ = 6
∑x 1
i
Using the data values for the X’s the statistician gets an estimate. On the other hand,
the mathematical statistician uses properties of expectations to determine whether the
expected value of that estimator in general, for any n, for any random sample, has good
properties, i.e., gives good estimates of q.
Because of independence, all those integrals factor out, and the result is just
6
∞ 6 1
1 − x
P ( X 1 > 10,……, X n > 10) = e 3
∫
10 3
8.8.1 Exercises
Exercise 1. The length of life of a fuse, X , has density
x
−
e q
f (x) = , x > 0, q > 0.
q
Three such fuses operate independently. Find the joint density of their lengths of life,
simplifying it as much as possible.
The bivariate normal density function, and in general the multivariate normal, are perhaps
the most widely used distributions in statistics. In this section, we are going to describe the
form of the joint, marginal and conditional distributions. Operating with these distributions
is not different than operating with the generic ones we have been talking about in this
chapter. But because conditionals and marginal are normal densities, you would have to use
the same procedures we used with the normal density introduced in chapter 7.
The joint density of two bivariate normal random variables X and Y is
2
x − µ y − µy
x − µx y − µy
2
1 1 x
f (x, y ) = exp − + −
2ρ
2πσx σy 1 − ρ 2 2(1 − ρ 2 ) σx σy σ x σ y
Notice that this density function has five parameters that we have been concerned about
throughout our discussion in this chapter, namely: mean of X, mean of Y, standard deviation of X,
standard deviation of Y, and correlation between X and Y. The difference between the normal and
the other densities seen so far in this chapter is that all the parameters of the bivariate normal
appear in the joint density formula, but that was not the case in the other examples we have seen.
The marginal densities of X and Y are both normally distributed
f ( x ) ~ N (µx , σx ), f ( y ) ~ N (µy , σy ).
The conditional densities of X and Y are a little bit more complicated. They are normal
density functions as well, but their expectations and variances consist of formulas that
depend on the parameters:
σ
f ( x | y ) ~ N µx |y = µx + ρ x ( y − µy ), σx |y = (1 − ρ 2 )σx2
σy
σy
f ( y | x ) ~ N µy |x = µy + ρ ( x − µx ), σy |x = (1 − ρ 2 )σy2
σx
Example 8.9.1
At a certain university, the joint probability density function of X and Y, the grade point aver-
ages of students in the first and last year at school, respectively, is bivariate normal. From
the grades of past years, it is known that
Thus
3.2 − 2.66
P (Y ≥ 3.2 |X = 3.5) = P Z > = P ( Z > 1.4727) = 0.072104.
0.36666
8.9.1 Exercises
Exercise 1. A portfolio has two assets. The return in the two assets is bivariate normal. The
return RA on asset A has Expected value $20 and standard deviation $3. The return on asset B,
RB has expected value of $15 and standard deviation of $2. The correlation between the return
of the two assets is 0.7. Let a = 0.3 denote the share of wealth invested in asset A and b =
0.7 the share of wealth invested in asset B. Then the portfolio return is T = aRA + bRB. (i) Find
the expected T; (ii) Find the portfolio risk (standard deviation). (iii) Find the expected return
for asset A, when asset B has return equal to 14; (iv) Find the probability that the return for
asset A is larger than 16 when asset B has return equal to 14.
Exercise 2. Consider students at a university. Let X be their math SAT scores and Y their verbal
SAT scores. Suppose a study reveals that
µx = 600, σx2 = 64, ρ = 0.6
µy = 500, σy2 = 60.
What is the probability that a student with a verbal SAT score of 450 has a math SAT score
larger than 650?
Exercise 3. Scientists studied the relationship between the length of the body of a bullfrog
and how far it can jump. Mean body length is 149.64 mm and the standard deviation is 14.47
mm. The mean maximum jump is 103.99 cm and the standard deviation is 17.94 cm. The cor-
relation between body length and maximum jump is 0.28. The two random variables follow
a bivariate normal distribution. (i) What jump size should be expected in a bullfrog that is
140 mm long? (ii) What is the probability that jump size is larger than 100 in a randomly
chosen bullfrog?
Exercise 4. Students in a school were asked to participate in a study of the effects of a new
teaching method on reading skills of 10th graders. To determine the effectiveness of the new
method, a reading test was given to each student before applying the new method (pre-test).
An improvement ³ 300 is considered good. It is known that the average score in the pre-test
is 40, average score in the post-test is 40, standard deviation in the post-test is 6, standard
deviation in the pre-test is 5, and correlation between the scores in the two tests is 0.5. (i) What
is the probability that a student with a pretest score of 38 has good improvement? (ii) What is
the standard deviation of the improvement?
Exercise 5. A certain type of cable car has a maximum weight capacity X, with mean and standard
deviation of 5000 and 300 pounds, respectively. In a touristic site at high elevation, the cable
car loading Y has mean and standard deviation 4000 and 400 pounds, respectively. For any
given time that the cable car is in use, find the probability that it will be overloaded, assuming
that X and Y are independent and normally distributed. How does this problem differ from the
other problems done in this Chapter? What is common to other problems done in this chapter?
Exercise 6. We have said in Chapter 6 and in this Chapter that if two random variables are
independent, their covariance and therefore their correlation is 0. We have also said that the
converse is not generally true. An exception is the multivariate normal density function. If two
random variables are bivariate normal and their covariance is 0, then the random variables
are independent. Prove this result. Hint: you may want to start with the formula we gave for
the joint density function in Section 8.9.
Exercise 7. Let X denote the carapace length of common shrimp and let Y denote the postpinuou
position of the carapace. It is known that µX = 249, µY = 177.53, rX = 6.7, r Y = 5.18, r = 0.83.
(i) Compute the probability that carapace length is larger than 260 for shrimp with post-
pinuou position equal to 170. (ii) What is the probability that a randomly chosen shrimp has
postpinuou position smaller than 180? (iii) What is the expected carapace length for shrimp
with postpinuou position equal to 170?
Question 1. When we talk about the joint density function of two random variables X, Y, ( f(x, y)),
for constants a and b,
P ( X < a, Y > b )
is
a. An area
b. A volume
c. Always 1
d. 0
Question 3. The future lifetimes (in months) of two components of a machine have the fol-
lowing joint density function:
6
f (x, y ) = (50 − x − y ), 0 < x < 50 − y < 50.
125000
What is the probability that both components are still functioning 20 months from now?
20 20
6
a. ∫
0
125000 ∫ (50 − x − y ) dy dx
0
30 50− x − y
6
b. ∫
20
125000
20
∫ (50 − x − y ) dy dx
30 50− x
6
c.
∫ 125000 ∫ (50 − x − y ) dy dx
20 20
50 50− x − y
6
d. ∫20
125000 ∫
20
(50 − x − y ) dy dx
Question 5. If g( X ,Y ) = ( X − µX )(Y − µY ) and X and Y have joint density f(x, y), what will the
following calculation give us?
∫ ∫ ( x − µ )( y − µ ) f ( x , y )dxdy
X Y
a. Cov ( g( X ,Y ))
b. Var ( XY )
c. Cov ( X ,Y )
d. Var ( X )Var (Y )
f ( x , y ) = 2, 0 ≤ y ≤ x ≤ 1.
Which of the following is the marginal probability density function of X?
f ( x , y ) = 6 x 2 y , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
E ( XY ) equals
a. 3/12
b. 4/9
c. 1/2
d. 7/8
f ( x , y ) = 6 x 2 y , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
a. 1
b. 0.7
c. 0.6
d. 0
Question 9. Consider the following conditional density function extracted from a joint bivariate
normal density function. A student thinks that there is a mistake, that the two variables were
uncorrelated. If that is the case, and everything else is correct, what is the new µy |x =3.5 and sy |x ?
0.4
f ( y | x ) ~ N µy |x =3.5 = 2.5 + 0.4 (3.5 − 3) = 2.66, σy |x = (1 − 0.4 2 )0.4 2 = 0.36666
0.5
6
f (x, y ) = ( x + y 2 ), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1.
5
The probability that neither facility is busy more than one-quarter of the time is
a. 0.67
b. 0.0109
c. 0.0004
d. 0.11
8.11 R code
Gibbs sampling is a form of Markov Chain Monte Carlo simulation that consists of drawing
random numbers from conditional distributions repeatedly in order to obtain the marginal
distributions. It is usually used in Bayesian statistics to do inference about unknown param-
eters, but it can be used with any random variable.
Below you will find the code to generate random draws from two conditional density
functions. At the end you will see the marginal empirical distribution of each of the variables.
Exercise 1. Stores A and B, which belong to the same owner, are located in two different
towns. If the probability density function of the weekly profit of each store, in thousands of
dollars, is given by
x
f (x) = , 1 ≤ x ≤ 3,
4
y
f (y ) = , 1 ≤ y ≤ 3,
4
and the profit of one store is independent of the other, what is the probability that next week
one store makes at least $500 more than the other store?
Exercise 2. Mary and Antonio plan an eight-hour hike in the mountains. Let X and Y denote the
time it takes, respectively, for Mary and Antonio to arrive at the top of the mountain. Assume
that Mary always arrives first, and that X and Y are uniform on the region 0 £ X £ Y £ 8. The
joint density is:
1
f (x, y ) = , 0 ≤ x ≤ y ≤ 8.
32
What is the expected value of Y − X, which is the time interval between Mary and Antonio
arriving to the top?
f ( x , y ) = l2e−ly , 0 ≤ x ≤ y ≤ ∞.
(i) Find the marginal density function of X and determine what family it is from, if one
of the important densities. (ii) Find the conditional density of Y when X = 5.
Exercise 5. Prove, using integration, that if X and Y are two random variables that are
independent,
E ( XY ) = E ( X )E (Y ).
Exercise 6. Prove, using integration, that if X and Y are any two random variables for which
the expectation exists,
E (aX + bY ) = aE ( X ) + bE (Y ).
Exercise 7. The length (X) of the leaf of a special kind of flower and the width (Y ) of the leaf
are bivariate normally distributed with
(i) What is the E (Y | X = 1 / 2) equal to? (ii) Complete what is missing in the parenthesis
containing the letters A, B, and C in the formula for the conditional density of width
given length = 1/2
1 [ y − (B )]2
f (y |x = 0.5) = exp −
2p ( A) 2( C )
Exercise 8. Suppose X1, X2, and X3 are three independent and identically distributed normal random
variables. Write the joint density of these random variables, simplifying it as much as possible.
Exercise 9. (This problem is from Degroot (1975, 178).) Suppose that X and Y are random
variables such that Var ( X ) = 9 , Var (Y ) = 4 , and correlation between X and Y is -1 / 6 .
(i) Determine Var ( X + Y ); (ii) determine Var ( X − 3Y + 4).
Exercise 10. Suppose X and Y are jointly uniform distributed on the square delimited by
0 £ x £ 1, 0 £ y £ 1. What is the probability of the event X > Y ?
Exercise 11. (This exercise is from Page (1989, 134).) In a certain device, X represents the voltage
drop across a component and Y represents the current in another part of the device. The two
are related probabilistically, however. Let’s suppose that X and Y have a joint density function
f (x, y ) = c
(i) Find the value c must have to make this a joint density
(ii) Find the marginal densities of X and Y.
(iii) What is the probability that 3X is greater than Y?
Exercise 12. (This exercise is from Pitman (1993, 351).) Suppose X and Y are jointly uniformly
distributed in the region 0 < x < y < 1. (i) Find the joint density of X and Y. (ii) Find the mar-
ginal densities of X and Y. (iii) Are X and Y independent?
Exercise 13. (This exercise is from Grami (2016, chapter 4).) Suppose that two random variables
X and Y are independent and we have g(X, Y ) = f(X)h(Y ). Show that E(g(X, Y )) = E( f(x))E(h(y)).
Until now, probabilities have been arrived at by assuming that the random variables
followed a particular distribution, or else the probabilities were given to us. But
how could we find probabilities if none of these conditions are satisfied? There
are theorems that allow us to find approximations to the probabilities for a single
random variable. We study these theorems here not only because they allow us
to approximate probabilities but also because they are very valuable when doing
proofs involving several random variables.
Markov’s inequality shows how to obtain probability bounds when the only thing
we know about the random variable is its expected value, and nothing else.
Example 9.1.1
The average number of fallen trees in a week at the Green Tree National Forest is
two. What is the probability that three or more trees will fall next week?
299
Theorem 9.1.1 Markov’s inequality
Let X be a nonnegative random variable with expected value µ= E(X) and let a be a con-
stant. Then
E( X )
P( X ≥ a) ≤
a
gives the tail bounds of a nonnegative random variable when all we know is its expectation.
The larger the variance, the more accurate this bound is. For the complement event, the
theorem implies that
E( X )
P( X < a) > 1 −
a
By its very nature, Markov’s inequality is conservative. A way to check this claim is to com-
pare what you obtain with the bound and what you would obtain if you knew the true
distribution of the random variable. Of course, you would not want to use Markov’s bound if
you know the probability distribution of the random variable; we are using this suggestion
to convince ourselves that Markov’s inequality works.
As the reader can see, we know only that the average is two. Let X be the number of trees
falling per week. According to Markov’s theorem,
2
P ( X ≥ 3) ≤ = 0.66666
3
Thus the probability that more than three trees will fall per week is at most 0.66666.
Suppose now that the actual distribution of X is Poisson with mean 2. Then, according to
the Poisson,
20 e−2 21 e−2 3e−2
P ( X ≥ 3) = 1 − P ( X < 3) = 1 − + + = 0.323 < 0.66666
0! 1! 3!
We can see that the probability bound proposed by Markov’s theorem is twice as large as
that obtained exactly using the Poisson, but Markov is not wrong: 0.323 is indeed smaller
than 2/3 (at most 0.66666 means less than 0.66666). Markov’s does not tell us exactly how
much. It just gives us a point of reference.
9.1.1 Exercises
Exercise 1. The time it takes a pedestrian to react to the change of a traffic light from red to
green is expected to be 10 seconds. What is the probability that the next pedestrian waiting
for the green light will take more than 12 seconds? Calculate your answer first by assuming
that you only know the expected value. Then calculate what the probability would be if it is
known that the reaction time is exponentially distributed with the same expected value of
10 seconds. Compare your answers.
Exercise 2. The average income per capita in Madagascar is about $400 per year. Find an
upper bound for the percentage of families with incomes over $1,000.
Chebyshev’s inequality shows how to obtain probability bounds when the only things we know
about the random variable are its expected value and the standard deviation, and nothing
else. The interval produced by Chebyshev’s inequality is a generalization of Markov’s. Like
Markov’s it is a conservative bound.
Theorem 9.2.1
Let X be any random variable with expected value µ= E(X ) and let k be a constant. Then
1
P { | X − µ |< k σ} = P (µ − k σ < X < µ + k σ ) ≥ 1 −
k2
Which implies, using the complement rule, that
1
P {| X − µ |≥ k σ} = P ( X < µ − k σ or X > µ + k σ ) ≤
k2
Another version of the theorem is
σ2
P {| X − µ |< k } = P (µ − k < X < µ + k ) ≥ 1 −
k2
Which implies, using the complement rule, that
σ2
P {| X − µ | ≥ k } = P ( X < µ − k or X > µ + k ) ≤
k2
Proof
The proof of the theorem uses Markov’s inequality, by making a in Theorem 9.1.1 equal to k
and our random variable is now ( X - µ)2 .
E ( X − µ)2
P {( X − µ)2 ≥ k 2 } ≤
k2
But ( X − µ)2 ≥ k 2 if | X − µ| ≥ k . So
σ2
P {| X − µ | ≥ k } ≤
k2
1
P {| X − µ | ≥ σk } ≤
k2
And
1
P {| X − µ | < σk } ≤ 1 −
k2
Example 9.2.1
The daily consumption of carbohydrates by a healthy individual eating a healthy diet in a
given community averages 225 grams, with a standard deviation of 10 grams. (i) What can
be said about the fraction of individuals for which the carbohydrate intake falls between
205 and 245 grams?. (ii) Find the shortest interval about the mean certain to contain at least
90% of the individual daily carbohydrate intakes.
(i) The interval from 205 to 245 represents µ - 2σ to µ + 2σ with µ= 225 and s = 10 .
Thus k = 2 and 1 − k1 = 1 − 41 = 3 / 4. We conclude that at least 75% of all individuals have a
2
carbohydrate intake between 205 and 245 grams. So at most, 25% fall outside that interval.
(ii) To find k, we must set 1 − k1 = 0.9. Then k1 = 0.1; k 2 = 10. Which implies that
2 2
k = 10 = 3.16
The interval is µ - 3.16σ to µ + 3.16σ or 225 - 3.16(10) to 225 + 3.16(10) or (193.4, 256.6).
We say that at least 90% of individuals in this community eat between 193.4 and 256.4 grams
of carbs. Which implies that at most 10% eat beyond that bound.
9.2.1 Exercises
Exercise 1. The average income per capita in Madagascar is about $400 and the standard
deviation of incomes is $400. Find the shortest interval about the mean certain to contain
at least 90% of the individual incomes.
Exercise 2. The number of machines used in a gym by the gym members during the peak hour
is closely monitored by the gym management, since it is critical to the efficient operation
of the gym. The number of machines in use averages 20 during peak hour, with a standard
deviation or 2. (i) Find an interval that includes at least 90% of the peak-hour figures for the
number of machines in use. (ii) In the advertisement of the gym, the management promises
that there will always be at least two machines available for a member in any peak hour. Is
the director safe in making this claim?
• When Gallup Polls (https://news.gallup.com) tell us that 40% of the population approve
of a particular candidate for the presidency of the United States, how accurate is that?
After all, they ask only a small group of people, usually no more than 5,000 people.
Let’s go back to Venn’s quote on cows in chapter 1, section 1. What does that quote have to
do with insurance?
When a person buys car insurance, insurance companies do not know how risky the person
is. There is no way for insurance companies to know whether people are accident prone or
not. But they have past data on drivers with the age of the insured, where they live, doing
similar things they do and with similar automobile. They know that for that large group the
probability of accident is, say, p. According to that, they determine the insurance premium
for that person. However, insuring a single person or a small number of individuals like that
person only would not guarantee anything to the insurance company. In order to predict how
much the company will lose or gain, the insurance company needs to have a lot of policy
holders. The rate p will be true for many, but not for a few only. If that p is stable, the company
can predict its expected losses or gains. With a few policy holders the proportion that have
an accident is hard to predict, as if fluctuates depending on the number of policy holders.
The insurance company is using the law of large numbers twice: (i) to set p and decide
what the individual policyholder’s premium will be (using information on many individuals in
the population) and (ii) to reduce risk exposure for the company (by issuing many automobile
policies). However, the law of large numbers is rendered less effective when risk-bearing
policyholders are dependent of one another. This is most easily seen in the health and fire
insurance industries, because diseases and fire can spread from one policy holder to another
if not properly contained. This problem is known as contagion. To read more about this issue,
go to Behind the Law of Large Numbers in the Insurance Industry | Investopedia https://
www.investopedia.com/articles/personal-finance/081616/behind-law-large-numbers-
insurance-industry.asp#ixzz5N275G4zj
As in insurance, the laws of probability are basic to the understanding of software engi-
neering and management.
The problem of insurance companies is no different than wanting to find out whether a die
is fair or not. If it is fair, repeated rolling of the die a zillion times should give us 1/6 of the
tosses or close with the number 5 in them. We call that an empirical probability, because it
varies depending on the number of tosses we make. But as the number of tosses increases,
the empirical probability gets more stable. It is that number towards which the relative
frequency tends to go that we consider to be the true probability of getting a 5 in the roll
Box 9.1
The fundamental empirical fact upon which are based all applications of the theory of
probability. (Parzen 1960)
It is a striking fact that we can start with a random experiment about which little can
be predicted and, by taking averages, obtain an experiment in which the outcome can be
predicted with high degree of certainty. (Grinstead and Snell 1997)
Theorem 9.3.1
Let n be the number of trials of an experiment. Let p be the true probability of an event in
each trial and let pцn be the empirical frequency of the event up to trial n. Then the law says
that for any e > 0
P (| pˆn − p | > e ) → 0
n →∞
Or,
P (| pˆn − p |< e ) → 1
n →∞
Theorem 9.3.2
Let X 1 , X 2 ,¼. . , X nbe a sequence of independent random variables with finite expected
value µ= E ( X j ) and finite variance s 2 = V ( X j ). Let S n = X 1 + X 2 +…. . + X n . Then for
any e > 0 ,
S
P n − µ > ε → 0
n n→∞
Put another way,
S
P n − µ < ε → 1
n n→∞
And, again, Chebyshev’s theorem can be used to prove this version of the theorem. We will
leave the proof as an exercise.
It is the version of the law of large numbers in Theorem 9.3.2 that makes people sometimes
call the law of large numbers the “Law of Averages.”
Example 9.3.1
Sn
Let’s toss a fair coin n times and let S n be the number of heads in the n tosses. Then pц = n
represents the fraction of times heads appear in the n tosses. The law of large numbers
predicts that the outcome for pц will be near 1/2 for large n.
Example 9.3.2
Let’s now consider n rolls of a fair six-sided die and let X j denote the outcome of the jth roll.
The E ( X j ) = 7 / 2. Let S n = X 1 + X 2 +…. . + X n be the sum of the first n rolls. Then for any e > 0,
S 7
P n − ≥ e → 0
n 2 n→∞
Put another way,
S 7
P n − < e → 1
n 2 n→∞
I( f ) = ∫ g( x )dx
0
• Draw n random numbers from a uniform distribution defined on the interval [0,1].
Denote these numbers by x1 , x2 ,¼. . , x n
1
• Compute I ∑
n
(f)= g( x i ) ≈ E[ g( X )] by law of large numbers when n is large
n i =1
1 1 1
• Realize that E[ g( X )] = ∫ 0
g( x ) f ( x ) dx = ∫ 0
g( x ) 1 dx = ∫ 0
g( x ) 1 dx
Example 9.3.3
Evaluate the following integral 1
ò cos(2p x )dx .
2
Using Monte Carlo integration. Compare with the exact answer. We used R to do the inte-
gration. See the R section to see how as we increase n, the solution approaches 0.244127.
You can check yourself by going to Wolfram Alpha (www.wolframalpha.com) and typing:
integrate cos(2pix^2), from 0 to 1
9.3.2 Exercises
Exercise 1. Prove the following result using Chebyshev’s theorem?
P (| X − µ | > ε ) → 0
n →∞
Exercise 2. Jaron Lanier, author of the book “Ten Arguments for Deleting your Social Media
Accounts Right Now,” writes the following:
Behavior modification, especially the modern kind implemented with gadgets like smartphones,
is a statistical effect, meaning it’s real but not comprehensively reliable; over a population, the
effect is more or less predictable, but for each individual it’s impossible to say. To a degree,
you are an animal in a behaviorist’s experimental cage. But the fact that something is fuzzy
or approximate does not make it unreal. (Lanier 2018, p.11)
What does the author mean by statistical effect? What does this comment have in common
with the observations we made about insurance companies at the beginning of Section 9.3?
Exercise 3. As pointed out in Grinstead and Snell (1997, 308), although Chebyshev’s inequality
proves the law of large numbers, it is a crude inequality for the probabilities involved. This
problem is based on a problem in their book. Let X 1 , X 2 ,¼. . , X n be a set of independent and
identically distributed uniform random variables defined in the interval [0,1]. Assume e = 0.1.
How large must n be for the P (| X − µ | > ε ) to approach 0?
Exercise 4. The law of large numbers is a probability statement, it tells us that we can be more
and more certain of being close to the value of E ( X ) if we compute the average or proportion
Exercise 5. There are two options: (a) You roll a die 100 times, if the number is less than four,
you win one dollar, if the number is larger than three, you lose one dollar.
(b) You draw 100 times at random with replacement from a box containing a ticket worth
$1 and another ticket worth -$1.
Exercise 6. (This example is from Freedman, Pisani and Purves (1998).) Basketball players
who make several baskets in succession are described as having a “hot had.” Fans and players
have long believed in the hot hand phenomenon, which refutes the assumption that each
shot is independent of the next. What other explanation is possible for the hot hand?
As we have mentioned several times throughout the book, applications of probability often
call for the use of a random variable that is itself the sum or a linear combination of other
random variables. For example,
• The study of downtimes of computer systems might require knowledge of the sum of
the random downtimes each hour of the day. The downtime of each hour is a random
variable and the sum of the independent downtimes of each of the 24 hours of the
day is a sum of random variables giving us the total downtime per day.
• The random total cost of a building project can be studied as the sum of the random
costs for the major independent components of the project.
• The random size of an animal population can be modeled as the sum of the random
sizes of the independent colonies within the population.
• At the end of the summer the total weight of seeds accumulated by a nest of seed-gath-
ering ants will vary from nest to nest. We may be interested in the sum of the total
weights of seeds of all nests.
• The total weight of people riding an elevator is important to know to prevent over-
loading the particular elevator.
• An insurance company may want to know the total yearly claim by all the automobile
policy holders.
We have talked about sums of two discrete random variables in Chapter 6, and sums of
two continuous random variables in Chapter 8. In those chapters, we proved several results
regarding the expected value and the variance of the sum of two random variables. We
showed in those chapters that
E ( X + Y ) = E ( X ) + E (Y ).
E ( X + Y ) = E ( X ) + E (Y ),
Var ( X + Y ) = Var ( X ) + Var (Y ).
We also mentioned in Chapters 6 and 8 that those results extend to the sum of more than
two random variables although we did not prove it. The proof is beyond the scope of this
book, as it requires using joint distributions of more than two variables. Let’s denote the sum
of n random independent random variables, i.e.,
n
S n = X1 + X2 +…. . + Xn = ∑X ,
i=1
i
where S is for sum and n for how many random variables we are adding. Then, without
proving it, we claim that
n n
E ( S n ) = E X i =
∑ ∑E( X ) = µ+ µ+……. . + µ= n µ
i =1
i
i =1
n n
Var ( S n ) = Var X i =
∑ ∑Var ( X ) = s
2
+ s 2 +……. . + s 2 = ns 2
i =1
i
i =1
Example 9.4.1
A fair six-sided die is rolled 100 times. Calculate the expected sum of the 100 rolls and the
variance of the sum of the 100 rolls.
100 100
7
E ( S100 ) = E X i =
∑
i =1
∑E( X ) = µ+ µ+……. . + µ= n µ= 100 2 = 350,
i =1
i
100 100
Var ( S100 ) = Var X i =
∑ ∑Var ( X ) = s
2
+ s 2 +……. . + s 2 = 100(2.916667) = 291.6667.
i =1
i
i =1
Example 9.4.2
Three flour mills receive raw corn in bulk. The amount of corn that one mill can process in
one day can be modeled as having an exponential distribution with a mean of 4 tons for
3 3
Var ( S3 ) = Var X i =
∑ ∑Var ( X ) = s
2
+ s 2 + s 2 = 3(16) = 48
i =1
i
i =1
Example 9.4.3
Random variable X denotes the number of classes that a typical student at College Bliss has
in a given Monday. The probability mass function of X is given below:
x 0 1 2
P(X = x) ¼ ½ ¼
List all possible values of the sum X 1 + X 2 , where the two random variables have the
probability mass function given above.
9.4.1 Exercises
Exercise 1. A machine has n identical components each of which has an exponentially dis-
tributed lifetime T with expected value 10. What is the Expected value and variance of the
total lifetime of all n machines?
Exercise 2. The number of patients entering a randomly chosen hospital in a city is a Poisson
random variable with expected value 10 per day. If there are 25 similar hospitals in town,
what is the expected value and variance of the total number of patients entering hospitals
in a given day (assuming all hospitals are independent)?
This section goes beyond the results seen in section 9.4 and introduces the Central Limit
Theorem (CLT) which concerns the density function of the sum of several independent
random variables.
The CLT is at the core of most inference methods that statisticians apply to their data to
learn about populations, and is another indication of why the Gaussian distribution plays
such a prominent role in Statistics and data science.
Theorem 9.5.1
New in this section is the result that if a random variable Sn is itself the sum of a large num-
ber of independent and identically distributed random variables which may individually
have distributions of finite variance that are quite different from the normal distribution, i.e.
∑
n
Sn = X1 + X2 +…. . + Xn = Xi , then for each fixed value of z, as n tends to infinity,
i=1
S − nµ
P n > z
σ n
approaches the probability that the standard normal random variable Z exceeds z. This
result is known as Central Limit Theorem (CLT). In practical terms, this means that if n is
large we may use the standard Gaussian density to approximate the answer to probability
questions about sums of independent and identically distributed random variables even if
we do not know what the distribution of X is.
Example 9.5.1
Suppose that the number of automobiles per single family housing unit, X, can be modeled
by a Poisson probability mass function with expected value E ( X ) = l = 3. If there are 100
independent single family housing units in a town, the number of automobiles in each of
these units is a set of identically distributed Poisson random variables, X 1 , X 2 , ¼¼¼X n with
expected value E ( X i ) = l = 3, i = 1,…. , n . If we were asked what is the probability that the
total number of automobiles in town is larger than 400, we could use the normal curve (by
a consequence of the central limit theorem) to find that probability.
We first find the expected value and the variance of the sum.
n n
E ( S n ) = E X i =
∑ ∑E( X ) = µ+ µ+……. . + µ= n µ= 100(3) = 300,
i =1
i
i =1
By the CLT,
n
400 − 300
P ( S n > 400) = P X i > 400 = P Z >
∑ = P ( Z > 5.773) ≈ 0
i =1 300
Using moment generating functions, we could prove that the exact distribution of the sum
of the 100 Poisson random variables in this problem is Poisson with expected value 300 and
400 S100
300 e−300
P ( S100 > 400) = 1 − ∑
S100 =0
S100 !
≈ 0.
Example 9.5.2
An insurance company has 10,000 automobile policyholders. The expected yearly claim per
policy holder is $240 with a standard deviation of $800. What is the probability that the total
yearly claims exceed $2.7 million?
10000 10000
∑
E ( S n ) = E X i =
∑ E ( X i ) = µ+ µ+……. . + µ= 10000 µ= 10000(240) = $2400000
i =1 i =1
10000 10000
∑
Var ( S n ) = Var X i =∑Var ( X i ) = s 2 + s 2 +……. . + s 2 = 10000s 2 = 10000(8002 )
i =1 i =1
By the CLT,
n
2700000 − 2400000 30
P ( S n > 400) = P X i > 400 = P Z > = P Z > = P ( Z > 3.75) ≈ 0
∑
i =1 10000(8002 )
8
Example 9.5.3
Resistors of a certain type have resistances that are exponentially distributed, with param-
eter l = 0.04 Ohms. 50 such independent resistors are connected in series, which causes
total resistance in the circuit to be the sum of individual resistances. We will calculate the
probability that the total resistance is larger than 1245. First,
50 50
1
E ( S n ) = E X i =
∑
i =1
∑E( X ) = µ+ µ+……. . + µ= 50 µ= 50 0.04 = 1250,
i =1
i
50 50
1
Var ( S n ) = Var X i =
∑ ∑ Var ( X i ) = s 2 + s 2 +……. . + s 2 = 50s 2 = 50 = 625.
i =1 0.04 2
i =1
50
1245 − 1250
P ( S n > 1245) = P X i > 1245 = P Z >
∑ = P ( Z > −0.2) = 0.5112823
625
i =1
Then E ( X ) = 0.
−12 02 12 2
Var(X ) = E (X 2 ) − 02 = + + =
3 3 3 3
and
2
SD(X) = = 0.8165
3
The problem is to find
E (S10000 ) = 10000E(X ) = 0
S − 0 100 − 0
P (S10000 > 100) = P 10000 > = 1 − P ( Z < 1.22474) ≈ 0.11
81.65 81.65
Sn
Xn =
n'
X1 X2 Xn
and expressed alternatively as X n = n + n +…. . + n , is itself a sum of random variables, and
therefore follows theorem 9.5.1. We just need to put the right expected value and variance
in the formula. That is,
X − µ S − nµ
P n
> z = P n > z
σ σ n
n
approaches the probability that the standard normal random variable Z exceeds z. This is so
because the factor of n in the X n does not affect the standardized variables.
Example 9.5.5
Let n
Xn =
Sn
=
∑ i =1
Xi
n n
where
X i , i = 1,…, n
is a Poisson random variable with expected value l. The expected value and variance
of X n are:
E( X n ) = l
l
Var ( X n ) =
n
Example 9.5.6
The average hospital inpatient length of stay (the number of days that on average a person
stayed in the hospital) in the United States was 4.5 days in 2017 with a standard deviation
of approximately 7. What is the probability that a random sample of 30 patients will have
an average stay longer than 6 days next year if the information about 2017 still holds for
next year?
30
P
∑i =1
Xi
> 6 = P Z >
6 − 4.5
= P ( Z > 1.173691) = 1 − P ( Z < 1.173691) = 0.121.
30 7
30
150
P
∑ Xi
i =1
> 12 = P Z >
12 − 12
= P ( Z > 0) = 0.5.
150 2.309401
150
Example 9.5.8
Approximately 16.2 percent of Americans purchase private individual health plans in the
United States. If we take a random sample of 200 Americans what is the probability that
there will be more than 14 percent with private individual health plan?
0.14 − 0.162
P ( pˆ > 0.14) = P Z > = P ( Z > −0.844193) = 1 − P ( Z < −0.844193) = 0.8.
0.162(1 − 0.162)
200
Table 9.1
The average of the two rolls is 7 and the standard deviation is 2.415229.
µ − 3σ = 7 − 3 * 2.415229 = −0.2456,
µ + 3σ = 7 + 3 * 2.415229 = 14.24569.
We can see that, by going three standard deviations to the left of the expected value, we
are putting ourselves in negative values, which are impossible, and by going three standard
deviations to the right of the expected value we are putting ourselves at higher values of
the random variable than are possible. These two results are an indication that the normal
approximation is not very good yet. We need to add many more dice for the normal curve to
be a good approximation to the distribution of the sum.
Similarly, if the problem involves the sample average X we can check whether going
beyond three standard deviations from the expected value puts us in forbidden values of X .
standard deviation 730 = 1.278. Three standard deviations to the left of 4.5 is 0.666. So it is
reasonable to assume that the normal is reasonable approximation despite the skewness of
the distribution of X. However, had the standard deviation been higher (which is not unheard
Xi
∑
30
of in this type of random variables), for example, 12, then the standard deviation of
i =1 30
Xi
∑
30
that is negative. But length of stay cannot be negative. This result would be an
i =1 30
indication that the normal approximation has not been reached under the given conditions.
Similar result could have been obtained if the n is smaller than 30. For example, less than
15. Do you see why?
Example 9.5.11 How many Poisson random variables does it take for the CLT to
hold?
Figure 9.1 shows how the distribution of the sum of independent and identically distributed
X1 ,X2 ,¼. . , Xn Poisson random variables with parameter l = 1 approaches the Gaussian
density.
As a general rule, the more symmetric the distribution, and the thinner the tails, the faster
the approach to normality as n increases (Dinov, Christou and Sanchez 2008).
For n = 2 For n = 5
0.6
0.4
Density
Density
0.10
0.2
0.0 0.00
0 1 2 3 4 5 6 7 0 2 4 6 8 10 12
Sum Sum
For n = 20 For n = 50
0.08
Density
Density
0.04 0.03
0.00 0.00
5 10 15 20 25 30 35 30 40 50 60 70 80
Sum Sum
Density
0.02 0.015
0.00 0.000
70 80 90 100 110 120 130 140 160 180 200 220 240
Sum Sum
Figure 9.1 The Figure shows that as we increase n, the distribution of the sum of n
Poisson random variables with expected value 1 approaches the Gausisian distribu-
tion. The Gaussian is the continuous curve in blue.
9.5.4 Combining the central limit theorem with other results seen earlier
Sometimes, we may be interested in comparing two sets of iid random variables. In that case,
it is helpful to remember what we reviewed at the beginning of Section 9.5.1 about Gaussian
random variables. The following example illustrates what we mean.
Example 9.5.12
Iron deficiency among infants is a major problem. The table below contains the average blood
hemoglobin levels at 12 months of infants following different feeding regimens: breastfed
infants, and baby formula without any iron supplements. Here are summary results. (Note:
none of the babies take both feeding regimens).
Group µ s
Breast-fed 13.3 1.7
Formula 12.4 1.8
Let X1 , X2 ,¼. . , X100 represent the blood hemoglobin of 100 unrelated breast-fed
babies and let Y1 , Y2 ,¼. . , Y100 the blood hemoglobin of 100 unrelated formula babies.
What is the probability that the difference in average hemoglobin levels at 12 months is
bigger than 2?
We first identify the random variable needed. We can see that in this problem, we are being
asked to compare two random variables: the X n - Yn .
We can use what we learned in this chapter to show that
1.72 1.82
sX = Var ( X n − Yn ) = + = 0.24758
n −Yn
100 100
2 − 0.9
P ( X n − Yn > 2) = P Z > = P ( Z > 4.443) ≈ 0.
0.24758
sample values is called a summary statistic. The derivation of distributions for summary
Example 9.5.13
A physician studies a randomly selected group of 25 patients and gives them a drug that could
cause vasoconstriction. The physician conducting the study is trying to determine whether
there are adverse effects on systolic blood pressure due to taking the drug. The physician finds
that after taking the drug, the average blood pressure of the 25 patients is 124 mm Hg. The
physician then asks: What is the probability that this or a higher blood pressure would happen
if these patients had not taken the drug—that is, if the patients had remained like all patients
of their kind, who have a mean systolic blood pressure of 120 and standard deviation 10?
That probability is 0.023. Why?
The statistician then interprets the result as
follows: 2.3% of random samples of 25 patients
Box 9.3 not taking the drug would have had an average
systolic blood pressure of 124 or more without
Misconceptions taking the drug, just by chance. That is a very
A common misconception is that as n goes to infinity the small number of samples, hence it is rare to
distribution of X follows the Gaussian distribution. The
find an average of 124 in a random sample of
Central Limit theorem says nothing about the distribution
25. But the physician found it in the sample
of one random variable. It is only a statement about the
S
distribution of the sum Sn and the sample n . The CLT does
n at hand. Therefore, it seems that the drug has
not assume either that the population is large. some effect on systolic blood pressure.
The physician used the property of the dis-
tribution of X 25 , assumed to be approximately
Gaussian with expected value 120 and standard
deviation 1025 ,by the CLT. Statisticians call the distribution of X the sampling distribution of
X . The standard deviation of X is called the standard error.
Example 9.5.14
A claim was made that 60% of the adult population thinks that there is too much violence
on television and a random sample of size n = 100 was drawn from this population to check
whether that is indeed true. A Poll was conducted. The people in the sample were asked: do
0.56 − 0.6
P ( pˆ < 0.56) = P z < = P ( z < −0.8179) = 1 − P ( z < −0.844193) = 0.2067.
0.0489
So approximately 20% random samples of 100 adults taken from a population where 60%
think there is too much violence would have given 56% or less by chance. That is a large
percentage of samples. Therefore seeing a 56% is no statistical evidence against the claim
that 60% think there is too much violence. The 56% is a result of chance variation.
Example 9.5.15
Go to the Reese’s pieces samples applet
http://statweb.calpoly.edu/chance/applets/Reeses/ReesesPieces.html
To run it, you need to click on the handle on the drawn candy dispenser, as if you were
buying the candy from the machine. Get the distribution of pц for sample size of n = 50 and
p = 0.03. Draw 1000 samples to get a better approximation. Is the normal a good approxi-
mation to the distribution of pц in this case?
You will see that the normal curve does not fit the distribution of pц that is centered around
0.03 well. The sample size is not large enough. That can be seen by the fact that the normal
curve centered at 0.03 gives negative proportions (there is blue in the negative range), but
we never get in the simulation a negative proportion (all the black dots are in the positive
range), as they should be. Proportions are between 0 and 1. The parent distribution is too
skewed (a Bernoulli with p = 0.03 is very skewed).
The standard deviation of pц is 0.0241. This means that if we go two standard deviations
to the left of 0.03 we hit negative numbers.
Exercise 2. A lot acceptance sampling plan for large lots calls for sampling 50 items and
accepting the lot if the number of nonconformances is no more than 5. Find the approximate
probability of acceptance if the true proportion of noncomformances in the lot is 10%.
Exercise 3. (This exercise is from Mosteller, Rourke and Thomas (1967, 333).) In crossing
two pink flowers of a certain variety the resulting flowers are white, red, or pink, and the
probabilities that attach to these various outcomes are 41 , 41 , 12 , respectively. If 300 flowers
are obtained by crossing pink flowers of this variety, what is the probability that 90 or more
of these flowers are white?
Exercise 4. Government statistics in the United Kingdom suggest that 20% of individuals live
below the poverty level (Purdam, Royston and Whitham (2017)). What percentage of individuals
in a random sample of 100 individuals are within two standard deviations of that proportion?
Exercise 5. According to Plewis (2014), Punjab produces 11.84% of India’s cotton (2012 fig-
ures). Twenty-six percent of farmers in Punjab produce cotton (Bt cotton). Thirteen percent of
100,000 farmers (15+) commit suicide. Fifty-six percent of farmers producing cotton produce
genetically modified cotton.
Maharashtra produces 20.42% of cotton in Maharashtra. Twenty percent of farmers pro-
duced cotton. There is a forty-six percent suicide rate per 100,000. Fifty-six percent of farmers
producing cotton produce genetically modified cotton.
What is the probability that, in a random sample of 100 Punjab farmers, less than 40
produce genetically modified cotton?
Exercise 6. According to the Department of Motor Vehicles (DMV), the entity in charge of pro-
viding driving licenses in the United States, it is illegal to drive with a blood alcohol content
of 0.08% or more if you are 21 or older. In the DMV’s guidelines to determine when a person
is driving under the influence, which can be found at https://www.dmv.ca.gov/portal/dmv/
detail/pubs/hdbk/actions_drink, it is indicated that fewer than five percent of the popula-
tion weighing 100 pounds will exceed the 0.33 alcohol level. Assume this is accurate. If the
Highway Patrol stops in a random day 200 unrelated cars where the individual weighs one
hundred pounds, on a Friday night, what is the probability that six percent or more individuals
stopped exceeds the 0.33 alcohol level?
Exercise 7 (This exercise is from Kinney (2002, 75).) A bridge crossing a river can support
at most 85,000 lbs. Suppose that the weights of automobiles using the bridge have mean
Exercise 8. Resistors of a certain type have resistances that are exponentially distributed with
parameter l = 0.04. An operator connects 50 independent resistors in series, which causes
total resistance in the circuit to be the sum of individual resistances. Find the probability
that the total resistance is less than 1245.
We have been saying throughout the book that the parameters of the probability mass
functions and density functions studied are constant, and in all the chapters and exercises
done so far they have been constant. However, there are situations where that is not the
case. Bayesian Statistics for example assumes that the parameters are themselves random
variables. But without getting into Bayesian statistics, there are some theorems regarding
conditional expectations that help us compute expected values and variance of random
variables, when solving the problems otherwise would be very difficult.
These theorems are
E ( X ) = E[E ( X |Y )]
Var ( X ) = E[Var ( X |Y )] + Var [E ( X |Y )]
Example 9.6.1
A quality control plan for an assembly line involves sampling n = 10 finished items per day
and counting Y, the number of defective items. If p denotes the probability of observing a
defective item, then Y has a binomial distribution, when the number of items produced by the
line is large. However, p varies from day to day and is assumed to have a uniform distribution
on the interval from 0 to ¼. What is the expected value of Y for any given day.
We have seen the moment generating function in earlier chapters. There is also the proba-
bility generating function, G(t). This is defined as
G X (t ) = E (t X ).
G X +Y (t ) = E (t X +Y ) = E (t X )E (t Y ) = G X (t )GY (t ).
Example 9.7.1.
(This example is from Newman (1998).) Let’s recall the discrete pmf for the roll of a six sided
die in Table 1.2 in Chapter 1.
X 1 2 3 4 5 6
P(X = x) 1/6 1/6 1/6 1/6 1/6 1/6
Then
1 1 1 1 1 1
G X (t ) = E (t X ) = t + t 2 + t 3 + t 4 + t 5 + t 6
6 6 6 6 6 6
As we can see, this looks like a polynomial, whose coefficients are the probabilities and
the powers are the values of the random variable.
Suppose we roll the die twice and we are interested in the sum of the two rolls. Then
2
1 1 1 1 1 1
G X +Y (t ) = E (t X +Y
) = t + t 2 + t 3 + t 4 + t 5 + t 6
6 6 6 6 6 6
1 2 2 3 4 5 6 5 4 3 2 1
= t + t 3 + t 4 + t 5 + t 6 + t 7 + t 8 + t 9 + t 10 + t 11 + t 12
36 36 36 36 36 36 36 36 36 36 36
As we can see, the coefficients of this new polynomial contain the probabilities of the
values of the sums given in the exponents of the polynomial.
Question 1. Which of the following statements is NOT true according to the Central Limit
Theorem? Select all that apply.
a. The mean of a distribution of sample means is equal to the population mean divided
by the square root of the sample size.
b. The larger the sample size, the more the distribution of the sample means resembles
the shape of the original distribution of one random variable.
c. The mean of the distribution of sample means for samples of size n = 15 will be the
same as the mean of the distribution for samples of size n = 100.
d. The larger the sample size, the more the distribution of sample means will resemble
a normal distribution.
e. An increase in n will produce a distribution of sample means with a smaller
standard deviation.
a. The distribution of the sample obtained in one trial should be close to normal
b. If we plotted the distribution of the sample obtained in each of the 1000 trials, we
would have 1000 distributions that look like the normal.
c. The mean of the 500 random variables in each sample is close to 2000.
d. The distribution of the 1000 averages is exactly gamma.
e. The distribution of the 1000 averages is close to normal.
Question 3. Blood pressure in a population of very at risk people has expected value of 195
and a standard deviation of 20. Suppose you take a random sample of 100 of these people.
There would be a 68% chance that the average blood pressure would be between
Question 4. An airline knows that over the long run, 90% of passengers who reserve seats
show up for their flight. On a particular flight with 300 seats, the airline accepts 324 reser-
vations. Assuming that passengers show up independently of each other, what is the chance
that the flight will be overbooked?
a. 0.91
b. 0.455
c. 0.05297
d. 0.1
Question 5. The service times for customers coming through a checkout counter in a retail
store are independent random variables, with a mean of 1.5 minutes and a variance of 1.0
minute. Approximate the probability that 100 customers can be serviced in less than 2 hours
of total service time.
a. 0.4987
b. 0.5
c. 0.0013
d. 0.23
Question 7. The median age of residents of the United States is 31 years. If a survey of 100
randomly selected United States residents is taken, find the approximate probability that at
least 60 of them will be under 31 years of age.
a. 0.02
b. 0.5
c. 0.471
d. 5
a. exact probabilities that a random variable is in an interval in the real line
b. a bound for the probability that a random variable is in an interval in the real line
c. the expected value of a random variable
d. the variance of a random variable
f ( x , y ) = 6y , 0 ≤ y ≤ x ≤ 1
For this example, it is true that (circle all that applies)
Question 10. An online computer system is proposed. The manufacturer gives the information
that the mean response time is 10 seconds. Estimate the probability that the response time
will be more than 20 seconds.
Income
11.652
23.015
5.604
6.710
7.293
8.918
14.176
11.363
…..
…..
with each line representing the Income of a particular woman in the population.
N=
m=
s=
Distribution is:
2. Start simulation.
One trial: Draw a random sample of women from this population and analyze their income.
We do now 4 trials. Follow steps indicated below.
2.1. Trial 1. Draw a random sample of 300 women from the population and look at the his-
togram of their incomes and the sample mean. We will use density histograms, because we
will be comparing histograms. Density histograms, like probability models for quantitative
variables, have proportion represented by the area under the curve. The area under the curve
represents the proportion of women.
sample2Income=sample(Income,300)
xbar2= mean(sample2Income)
hist(sample2Income,xlim=c(0,221),prob=T)
boxplot(sample2Income,ylim=c(0,221))
2.3. Trial 3. Now draw another random sample of 300 women from the same population and
repeat the analysis you did above.
sample3Income=sample(Income,300)
xbar3= mean(sample3Income)
hist(sample3Income,xlim=c(0,221),prob=T )
boxplot(sample3Income,ylim=c(0,221))
2.4. Trial 4. One more time: draw another random sample of 300 women from the same
population and repeat the analysis you did above.
sample4Income=sample(Income,300)
xbar4=mean(sample4Income)
hist(sample4Income,xlim=c(0,221),prob=T)
boxplot(sample4Income,ylim=c(0,221))
xbar1; xbar2; xbar3;xbar4
dev.off() #to close the graph window we had. Don’t type until you
have copy pasted your graph
Question 2. Does each of the four histograms resemble the population distribution? Sum-
marize their shape.
Question 3. Look at all your four boxplots. Are there outliers in any of them? What can you
conclude about where the majority of the salaries are in each sample if you excluded the
outliers (if you got any)?
Question 4. Let’s see now what are the sample means ( X 1 , X 2 , X 3 , X 4 ) that we obtained and
compare them. Write them down and describe whether they are very different or not. Can
you explain what you found?
Comment on the shape of the distribution that would be expected if the Central Limit
Theorem holds.
Exercise 2. Let X be a continuous random variable with the following density function:
1
f (x) = x , 0 ≤ x ≤ 2.
2
What is the probability that in 200 tosses of such a die we would get more than 120 odd
numbers? Show work.
Exercise 4. Consider Exercise 10 in Section 7.2.1. Assume that for this particular matter, a = 0.5.
Calculate the probability that the average angle of 100 emitted electrons is larger than 0.5.
Exercise 5. The monthly salary of women that are in the labor force in a large town, Y, fol-
lows a gamma distribution with expected value 500 and variance 125. The city is planning
to obtain a random sample of 100 women to obtain some information. The city is plan-
ning to ask the 100 women about their salary. (i) What is the probability that the average
salary of the 100 women in the sample is larger than 1,500? (ii) If, in this town, 20% of the
women have a monthly salary larger than 3,000, what is the probability that in the random
sample of 100 women more than 20% of them make a salary larger than 3,000? Show work.
(iii) Would W = 100
4
Σ y y i be an unbiased estimator of the average salary of the women in the
large town? (Note: “unbiased” means that the expected value of W equals the mean of the
population of women.) (iv)What is the joint distribution of the 100 random variables with
the same density function as in this problem?
Exercise 6. In 1,000 flips of a supposedly fair coin, heads came up 560 times and tails 440 times.
What is the probability that a number of heads that large or larger occurs if the coin is fair?
Exercise 7. Demonstration of the law of large numbers with simulation with R. Run the fol-
lowing code, and report in the answer to this problem only what is being asked.
1. Give the 20 rolls of a fair six-sided die, the approximate (empirical) discrete distribution
from those rolls and a plot of that distribution. Does it look like the probability mass
function for the roll of a fair six-sided die? How far are you from it?
plus20=sample(1:6,20,rep=T)
first40=append(first20,plus20)
first40 #See your numbers
table(first40)/40
X=1:6
plot(table(first40)/40,xlab=”X”,ylab=”empirical probability (40
rolls)”, ylim=c(0,1),type=”h”)
3. Keep rolling: roll another 60 times, and append these new numbers to the ones you
already got to have 100 numbers
more60=sample(1:6,60,rep=T)
first100=append(first40,more60)
first100 # see your numbers.
table(first100)/100
plot(table(first100)/100,xlab=”X”,ylab=”empirical probability
(100 rolls)”,
ylim=c(0,1),type=”h”)
4. Now roll an additional 1000 times, append these 1000 new numbers to the 100 you
already have obtained and find the distribution and plot of the 1100 numbers.
5. Explain what is happening in a, b, c, d. Do you think the law of large numbers is at
work? Why? How is this behavior different from what you would observe if you were
illustrating the Central Limit Theorem with a simulation. Explain the difference. (no
need to simulate).
What if the die had not been fair? What would be the final distribution you would observe
after rolling 20, 100 and 1000 times? Rewrite the code. Most of the code will be the same
except the first command For example,
sample(1:6,prob= c(0.5,0.05,0.2,0.05,0.1,0.1)
Chatterjee, Samprit, Mark S. Handcock, and Jeffrey S. Simonoff. 1995. A Casebook for a First
Course in Statistics and Data Analysis. John Wiley & Sons, Inc.
Dinov, Ivo, Nicolas Christou and Juana Sanchez. (2008). “Central Limit Theorem: New SOCR
Applet and Demonstration Activity.” Journal of Statistics Education Volume 16, Number 2
(2008) http://www.amstat.org/publications/jse/v16n2/dinov.html
Freedman, David, Robert Pisani, and Roger Purves. 1998. Statistics. Third Edition. W.W. Norton
& Company.
Grinstead, Charles M., and J. Laurie Snell. 1997. Introduction to Probability. Second revised
edition. American Mathematical Society.
Kinney, John J. 2002. Statistics for Science and Engineering. Addison-Wesley.
Lanier, Jaron. 2018. Ten Arguments for Deleting Your Social Media Accounts Right Now. New
York: Henry Holt and Company.
Mosteller, Frederic, Robert E. K. Rourke, and George B. Thomas. 1967. Probability and Statistics.
Addison-Wesley.
Newman, Donald J. 1998. Analytic Number Theory. Springer Verlag,
Parzen, Emanuel. 1960. Modern Probability Theory and Its Applications. New York. John Wiley
and Sons, Inc.
Pittman, Jim. 1993. Probability. Springer Texts in Statistics.
Plewis, Ian. 2014. “Indian farmer suicides. Is GM cotton to blame?” Significance 11, no. 1
(February): 14–18.
Purdam, Kingsley, Sam Royston, and Graham Whitham. Measuring the “Poverty Penalty” in
the UK. Significance, August 2017, p 34 to 37.
Rice, John A. 2007. Mathematical Statistics and Data Analysis. Thomson Brooks/Cole
• Eight people who suffer from migraine headaches volunteer to take part in
a medical study of the effect of a new drug on migraine headaches. The
names of the volunteers are:
Your job is to allocate half of these people to the experimental group taking
the drug and the other half to the control group that will not take the drug. How
would you do it?
This is done on a daily basis by all clinical trials out there to determine the effects
of drugs on people. Clinical trials and intervention studies must allocate subjects
at random to treatment and control groups. The simplest way of allocation is by
selecting at random six people to be in the treatment group. The remaining ones
will be in the control group. The point for us is that random numbers are used for
the allocation, namely the following command in R would do the job.
sample(1:12, 6, replace=F)
333
10.2 What model fits your data?
XXThe following list of 24 test scores has an average of approximately 50 and a standard
deviation of approximately 10.
29, 36, 37, 39, 41, 44, 47, 48, 49, 50, 50, 52, 52, 53, 54, 56, 58, 59, 62, 64, 65.
How many scores are within one standard deviation of the mean? Is that the number of
scores that you would have gotten using the normal model with the same mean and the
same standard deviation?
The number of scores between 40 and 60, within one standard deviation in the data,
is 16. The normal model predicts 0.68*24 = 16.32, so approximately correct.
Model fitting to data is one of the day-to-day activities of statisticians and data scientists.
One possible method goes as follows:
a. Data are available. For example, consider the baby boom data set presented in
Table 10.1 below. This data set can be downloaded from http://ww2.amstat.org/
publications/jse/datasets/babyboom.dat.txt
b. The Poisson distribution could be fit to the number of births per hour and the
empirical proportion of births found in the data each hour could be compared to
the theoretical number of births per hour predicted by a Poisson model, using
the average number of births per hour of the data as the proxy for the param-
eter of the theoretical Poisson. Dunn (1999) did this. The results are found in
Table 10.2.
c. The statisticians are not happy with just a table like that. They need to determine
some criteria to accept that table as indication that the Poisson model is a good fit.
To this end, statisticians design test statistics. These test statistics are summaries
that themselves are random variables and have sampling distributions. One test
statistic is the chi-Square goodness of fit statistic, which follows a chi-square
density function.
Table 10.2 does not give the chi-square statistic. The reader should try to compute it after
studying Section 10.5. But bear in mind that the criterion for statistically determining whether
the Poisson fits a data set is probability based.
10.3 Communications
Example 10.3.1
(This exercise is based on Grami (2016, chapter 4).) In a binary symmetric communication
(BSC) channel, the input bits transmitted over the channel are either 0 or 1 with probabilities
p and 1 − p, respectively. Due to channel noise, errors are made.
If a channel is assumed to be symmetric, the probability of receiving 1 when 0 is transmitted
is the same as the probability of receiving 0 when 1 is transmitted.
The conditional probabilities of error are assumed to be each e. Determine the probability
of error, also known as the bit error rate, as well as the a posteriori probabilities.
Priori probability of transmitting bits are:
= ep + e(1 − p ) = e
For e = 0, i.e., when the channel is ideal, both a posteriori probabilities are one.
For e = 12 , the a posteriori probabilities are the same as the a priori probabilities
For e = 11 , when the channel is most destructive, both a posteriori probabilities are zero.
It is very insightful to note than in the absence of a channel, the optimum receiver, which
minimizes the average probability of error, P(error), would always decide in favor of the bit
whose a priori probability was the greatest. Moreover, if P (error ) > 1 / 2, that is, more often
than not an error is made, an inverter can then be employed to reduce the bit error rate to
1 − P (error ) > 1 / 2, simply by turning a 1 into 0 and a 0 into 1.
10.3.1 Exercises
Exercise 1. In a digital commutation system, bits are transmitted over a channel in which
the bit error rate is assumed to be 0.0001. The transmitter sends each bit five times, and a
decoder takes a majority vote of the received bits to determine what the transmitted bit was.
Determine the probability that the receiver will make an incorrect decision.
Exercise 2. The normal distribution represents the distribution of thermal noise in signal
transmission. For example, in a communication system, the noise level is modeled as a
Gaussian random variable with mean 0, and σ2 = 0.0001. What is the probability that the
noise level is larger than 0.01?
Before quantum mechanics, it was thought that the accuracy of any measurement was lim-
ited only by the accuracy of the instruments used. However, Karl Heisenberg showed that
no matter how accurate the instruments used, quantum mechanics limits the accuracy when
two properties are measured at the same time. For the moving electron, the properties or
variables are considered in pairs: momentum and position are one pair, and energy and time
are another. The conclusion that came from Heisenberg’s theory was that the more accurate
the measurement of the position of an electron or any particle for that matter, the more
inaccurate the measurement of momentum, and vice versa. In the most extreme case, 100%
accuracy of one variable would result in 0% accuracy in the other.
https://history.aip.org/history/exhibits/heisenberg/p08a.htm
http://www.umich.edu/~chem461/QMChap7.pdf
We simulate where the electron is located by using R’s random number generator.
Have the program generate random integers between 1 and 100. Numbers between
0 and 32 will be within 1 AU, between 32 and 74 will be within 2 AU, etc. You would
have to generate two numbers at once, each number representing the electron position of
one atom.
Step 2. One trial will consist of repeatedly generating the 2 numbers until one number is
between 0 and 32 and the other is between 94 and 100 (must occur at the same time)-the
correct configuration for the atoms to exchange electrons and bond.
Step 3. The quantity to keep track of is how many sets of numbers the random number
generator had to generate until one number was between 0 and 32 and the other between
94 and 100.
Step 4. Repeat steps 2 and 3 many times. The simulation ends when ten trial successes or
bonds are made, which will take a sufficient number of number generations in order to do.
This will make the calculated probability of the two atoms bonding a more accurate figure
due to the high number of trials.
Step 5. Student says: the proportion of successful trials can be used to estimate the prob-
ability of two atoms encountering each other and having the correct electron configuration
in order to bond, based on the number of trials performed.
Without doing the simulation (which would take a long time), one can calculate the prob-
ability: (32/100)(6/100)(2) = 0.0192(2) = 0.0384.
You multiply, not add, the two probabilities because they must occur at the same
time. Then you multiply by 2 because the position of the 3–4 AU electron and within
1 AU electron can be on either atom for them to bond. Therefore you would expect
the atoms to bond about every 4 out of 100 times they’re put near each other, or 4% of
the time.
A weather model predicts the following for a certain month (a month = 31 days) in your area:
The rainfall during the month is looked back on after the month is over, and the actual
results are as follows:
Of interest is knowing how good is this model. In other words, did the model predict well
what the actual rainfall would be?
Statisticians also use the chi-square test to answer this question. The test uses the prob-
ability distribution for a chi-square random variable. Here is what they do.
First, they calculate a sample summary statistic, which we do in Table 10.3.
Table 10.3
(O - E)2
Outcome Observed (O) Expected (E) (O - E)2
E
No rain 13 15 4 0.266
Rain but < 1 inch 11 10 1 0.1
>1 inch rain 7 6 1 0.166
When we add the numbers in the last column, the total is 0.532. This is the value of the
chi-square statistic with 2 degrees of freedom (3 categories − 1 = 2).
c22 = 0.532.
The next thing is to look at
And we find from the tables that this probability is larger than 0.25. This means it is
not unlikely that by chance we would see the difference from the expected value that we
observed. Thus, the statistician would conclude that the forecasting model is not systemat-
ically deviating from the actual weather, the discrepancies observed are something that one
would expect just by chance variation.
Geographers know the coordinates of the points on earth that they are interested in studying.
For example, medical geographers are interested in the spatial patterns of mortality from
various diseases. If certain areas contain more than expected mortality, they claim to have
found a pattern, nonrandomness, or something that must have an identifiable cause. Then
they seek the cause.
To conduct this type of analysis, geographers must design metrics and find the probabil-
ity distributions of these metrics. The probability distributions that they use must convey
distance and location somehow, and they must convey the correlation of the observations.
The task of geographical data analysis is not too different from that of time series data
analysis. Time series data is data that has been collected over time. Both in geographical
statistics and in time series analysis, the relevant models are multivariate, like those we
studied in Chapters 6 and 8 in this book. The correlations among the different observations
must be modeled, and that requires multivariate modeling.
Dunn, Peter K. (1999). A Simple Dataset for Demonstrating Common Distributions. Journal of
Statistics Education, Vol 7 Number 3. American Statistics Education.
Grami, Ali. 2016. Introduction to Digital Communications. Cambridge: Academic Press.
Park, Dalnam. 2004. Final project for Stats 19. Reproduced with permission of the author.