MGMT 2262 Applied Business Statistics 2.1
MGMT 2262 Applied Business Statistics 2.1
Collection Editor:
Brad Quiring
MGMT 2262: Applied Business Statistics
Collection Editor:
Brad Quiring
Authors:
OpenStax
Lyryx Learning
Collette Lemieux
Brad Quiring
Online:
< http://cnx.org/content/col23820/1.2/ >
OpenStax-CNX
This selection and arrangement of content as a collection is copyrighted by Brad Quiring. It is licensed under the
Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
Collection structure revised: December 19, 2017
PDF generated: December 19, 2017
For copyright and attribution information for the modules contained in this collection, see p. 325.
Table of Contents
1 Business Statistics - Module 1 - Data collection and descriptive statistics
1.1 Chapter 1: Introduction to Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Chapter 2: Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2 Business Statistics - Module 2 - Probability
2.1 Chapter 3: Probability Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.2 Chapter 4: Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 140
2.3 Chapter 5: Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.4 Chapter 6: Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3 Business Statistics - Module 3 - Condence Intervals and Hypothesis Tests
3.1 Chapter 7: Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
3.2 Chapter 8: Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4 Business Statistics - Module 4 - Linear Regression and Correlation
4.1 Introduction to Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 257
4.2 The Correlation Coecient r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.3 Testing the Signicance of the Correlation Coecient . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 263
4.4 Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5 The Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
4.6 Interpretation of Regression Coecients: Elasticity and Logarithmic Transforma-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
4.7 Predicting with a Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 293
4.8 How to Use Microsoft Excel® for Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .325
iv
Figure 1.1: We encounter statistics in our daily lives more often than we probably realize and from
many dierent sources, like the news. (credit: David Sim)
1
CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
2
AND DESCRIPTIVE STATISTICS
You are probably asking yourself the question, "When and where will I use statistics?" If you read any
newspaper, watch television, or use the Internet, you will see statistical information. There are statistics
about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or
watch a television news program, you are given sample information. With this information, you may make
a decision about the correctness of a statement, claim, or "fact." Statistical methods can help you make the
"best educated guess."
Since you will undoubtedly be given statistical information at some point in your life, you need to know
some techniques for analyzing the information thoughtfully. Think about buying a house or managing a
budget. Think about your chosen profession. The elds of economics, business, psychology, education,
biology, law, computer science, police science, and early childhood development require at least one course
in statistics.
Included in this chapter are the basic ideas and words of probability and statistics. You will soon
understand that statistics and probability work together. You will also learn how data are gathered and
what "good" data can be distinguished from "bad."
For example, we may wonder if there is a gap between how much men and women are paid for doing the
same job. This would be the problem we want to investigate. Before we do the investigation, we would
want to spend some time dening the problem. This could include dening terms (e.g. what do we mean by
paid? what constitutes the same job?). Then we would want to state a research question. A research
question is the overarching question that the study aims to address. In this example, our research question
might be: Does the gender wage gap exist?.
Once we have the problem clearly dened, we need to gure out how we are going to study the problem.
This would include determining how we are going to collect the data for the study. Since it is unlikely we
are going to nd out the salary and position of every employee in the world (i.e. the population), we need to
instead collect data from a subset of the whole (i.e. a sample). The process of how we will collect the data
is called the sampling technique. The overall plan of how the study is designed is called the sampling
design or methodology.
Once we have the methodology, we want to implement it and collect the actual data.
When we have the data, we will learn how to organize and summarize data. Organizing and summarizing
data is called descriptive statistics. Two ways to summarize data are by visually summarizing the data
(for example, a histogram) and by numerically summarizing the data (for example, the average). After we
have summarized the data, we will use formal methods for drawing conclusions from "good" data. The
formal methods are called inferential statistics. Statistical inference uses probability to determine how
condent we can be that our conclusions are correct.
Once we have summarized and analyzed the data, we want to see what kind of conclusions we can
draw. This would include attempting to answer the research question and recognizing the limitations of the
conclusions.
In this course, most of our time will be spent in the last two steps of the statistical analysis process
(i.e. organizing, summarizing and analyzing data). To understand the process of making inferences from the
2 This content is available online at <http://cnx.org/content/m64275/1.1/>.
data, we must also learn about probability. This will help us understand the likelihood of random events
occurring.
In statistics, we generally want to study a population. You can think of a population as a collection of
persons, things, or objects under study. You can think of a population as a collection of persons, things, or
objects under study. The person, thing or object under study (i.e. the object of study) is called the obser-
vational unit. What we are measuring or observing about the observational unit is called the variable.
We often use the letters X or Y to represent a variable. A specic instance of a variable is called data.
Example 1.1
Suppose our research question is Do current NHL forwards who make over $3 million a year score,
on average, more than 20 points a season?
The population would be all of the NHL forwards who make over $3 million a year and who are
currently playing in the NHL. The observational unit would be any forward that meets the criteria.
The variable is the number points a forward in the population gets in a season. A data value would
be the actual number of points.
In the above example, it would be reasonable to look at the population when doing the statistical analysis.
But this is not always the case. For example, suppose you want to study the average prots of oil and gas
companies in the world. This might be very hard to get a list of all of the oil and gas companies in the world
and get access to their nancial reports. When the population is not easily accessible, we instead look at a
sample. The idea of sampling (the process of collecting the sample) is to select a portion (or subset) of
the larger population and study that portion (the sample) to gain information about the population. Data
are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical
technique. If you wished to compute the overall grade point average at your school, it would make sense
to select a sample of students who attend the school. The data collected from the sample would be the
students' grade point averages. In federal elections, opinion poll samples of 1,0002,000 people are taken.
The opinion poll is supposed to represent the views of the people in the entire country. Manufacturers of
canned carbonated drinks take samples to determine if a 16 ounce can contains 16 ounces of carbonated
drink.
It is important to note that though we might not know the population, when we decide to sample from
it, it is fairly static. Going back to the example of the NHL forwards, if we were to gather the data for the
population right now that would be our xed population. But if you took a sample from that population
and your friend took a sample from that population, it is not surprising that you and your friend would get
a dierent sample. That is, there is one population, but there are many, many dierent samples that can be
drawn from the sample. How the samples vary from each other is called sampling variability. The idea
of sampling variability is a key concept in statistics and we will come back to it over and over again.
As mentioned above, a variable, or random variable, notated by capital letters such as X and Y, is a
characteristic of interest for each person or thing in a population. Data are the actual values of the variable.
Data and variables fall into two general types: either they are measuring something and they are not
measuring. When a variable is measuring or counting something, it is called a quantitative variable and
the data is called quantitative. When a variable is not measuring or counting something, it is called a
categorical variable and the data is called categorical data. For a variable to be considered quantitative,
the distance between each number has to be xed. In general, quantitative variables measure something and
take on values with equal units such as weight in pounds or number of people in a line. Categorical variables
place the person or thing into a category such as colour of car or opinion on topic.
Example 1.2
• In the NHL forwards example, the variable is quantitative as we investigating the number of
points a player has.
• In the gender gap example, there were three variables: the salary, gender, and the position.
The salary is a quantitative variable as we are investigating the amount people make. Gender
is a categorical variable as we are categorizing someone's gender. Position is also categorical
as we are categorizing their type of employment.
• Sometimes though determining the type of a variable (i.e. quantitative or categorical) is not
always cut and dry. In particular, Likert scales or rating scales are tricky to place. A Likert
scale is any scale where you are asked to state your opinion on a scale. For example, you
may be asked whether you strongly agree, agree, neutral, disagree or strongly disagree with a
statement. Sometimes there is a number associated with the rating. For example, write 5 if
you strongly agree and 1 if you strongly disagree. Technically, a Likert scale is a categorical
data as we are categorizing people's opinions and the number is just a short form for the
category.
tip: When you are asked to categorize the data or variable, rst determine what the observation
unit is. Then determine the variable being studied. Then think about what the data will look like.
If the data is a number, then it is usually quantitative data (be wary of Likert scales). If the data
is word or category, then it is categorical data.
Two words that come up often in statistics are mean and proportion. These are two example of numerical
descriptive statistics. If you were to take three exams in your math classes and obtain scores of 86, 75, and
92, you would calculate your mean score by adding the three exam scores and dividing by three (your mean
score would be 84.3 to one decimal place). If, in your math class, there are 40 students and 22 are men then
the proportion of men in the course is 55% and the proportion of women is 45%.
From the sample data, we can calculate a statistic. A statistic is a number that represents a property
of the sample. For example, if we consider one math class to be a sample of the population of all math
classes, then the mean number of points earned by students in that one math class at the end of the term is
an example of a statistic. The statistic is an estimate of a population parameter, in this case the mean. A
parameter is a number that is a property of the population. Since we considered all math classes to be the
population, then the mean number of points earned per student over all the math classes is an example of
a parameter (i.e. the population mean). If we took a sample of students from the math class and found the
mean points earned per student in the sample, then we would have found a statistic (i.e. the sample mean).
One of the main concerns in the eld of statistics is how accurately a statistic estimates a parameter.
The accuracy really depends on how well the sample represents the population. The sample must contain
the characteristics of the population in order to be a representative sample. We are interested in both the
sample statistic and the population parameter in inferential statistics. In a later chapter, we will use the
sample statistic to test the validity of the established population parameter.
Example 1.3
Determine what the key terms refer to in the following study. We want to know the average
(mean) amount of money rst year college students spend at ABC College on school supplies that
do not include books. We randomly survey 100 rst year students at the college. Three of those
students spent $150, $200, and $225, respectively.
Solution
The population is all rst year students attending ABC College this term.
The sample could be all students enrolled in one section of a beginning statistics course at
ABC College (although this sample may not represent the entire population).
The parameter is the average (mean) amount of money spent (excluding books) by rst year
college students at ABC College this term.
The statistic is the average (mean) amount of money spent (excluding books) by rst year
college students in the sample.
The variable could be the amount of money spent (excluding books) by one rst year student.
Let X = the amount of money spent (excluding books) by one rst year student attending ABC
College.
The data are the dollar amounts spent by the rst year students. Examples of the data are
$150, $200, and $225.
Example 1.4
Determine what the key terms refer to in the following study.
A study was conducted at a local college to analyze the average cumulative GPA's of students
who graduated last year. Fill in the letter of the phrase that best describes each of the items below.
1._____ Population 2._____ Statistic 3._____ Parameter 4._____ Sample 5._____
Variable 6._____ Data
Solution
1. f; 2. g; 3. e; 4. d; 5. b; 6. c
Example 1.5
Determine what the key terms refer to in the following study.
As part of a study designed to test the safety of automobiles, the National Transportation Safety
Board collected and reviewed data about the eects of an automobile crash on test dummies. Here
is the criterion they used:
Table 1.1
Cars with dummies in the front seats were crashed into a wall at a speed of 35 miles per hour.
We want to know the proportion of dummies in the driver's seat that would have had head injuries,
if they had been actual drivers. We start with a simple random sample of 75 cars.
Solution
The population is all cars containing dummies in the front seat.
The sample is the 75 cars, selected by a simple random sample.
The parameter is the proportion of driver dummies (if they had been real people) who would
have suered head injuries in the population.
The statistic is proportion of driver dummies (if they had been real people) who would have
suered head injuries in the sample.
The variable X = the number of driver dummies (if they had been real people) who would
have suered head injuries.
The data are either: yes, had head injury, or no, did not.
Example 1.6
Determine what the key terms refer to in the following study.
An insurance company would like to determine the proportion of all medical doctors who have
been involved in one or more malpractice lawsuits. The company selects 500 doctors at random
from a professional directory and determines the number in the sample who have been involved in
a malpractice lawsuit.
Solution
The population is all medical doctors listed in the professional directory.
The parameter is the proportion of medical doctors who have been involved in one or more
malpractice suits in the population.
The sample is the 500 doctors selected at random from the professional directory.
The statistic is the proportion of medical doctors who have been involved in one or more
malpractice suits in the sample.
The variable X = the number of medical doctors who have been involved in one or more
malpractice suits.
The data are either: yes, was involved in one or more malpractice lawsuits, or no, was not.
1.1.2.2 References
The mathematical theory of statistics is easier to learn when you know the language. This module presents
important terms that will be used throughout the text.
1.1.2.4 HOMEWORK
For each of the following eight exercises, identify: a. the population, b. the sample, c. the parameter, d. the
statistic, e. the variable, and f. the data. Give examples where appropriate.
Exercise 1.1.2.3
A tness center is interested in the mean amount of time a client exercises in the center each week.
Exercise 1.1.2.4 (Solution on p. 77.)
Ski resorts are interested in the mean age that children take their rst ski and snowboard lessons.
They need this information to plan their ski classes optimally.
Exercise 1.1.2.5
A cardiologist is interested in the mean recovery period of her patients who have had heart attacks.
Exercise 1.1.2.6 (Solution on p. 77.)
Insurance companies are interested in the mean health costs each year of their clients, so that they
can determine the costs of health insurance.
Exercise 1.1.2.7
A politician is interested in the proportion of voters in his district who think he is doing a good
job.
Exercise 1.1.2.8 (Solution on p. 77.)
A marriage counselor is interested in the proportion of clients she counsels who stay married.
Exercise 1.1.2.9
Political pollsters may be interested in the proportion of people who will vote for a particular
cause.
Exercise 1.1.2.10 (Solution on p. 77.)
A marketing company is interested in the proportion of people who will buy a particular product.
Use the following information to answer the next three exercises: A Lake Tahoe Community College instruc-
tor is interested in the mean number of days Lake Tahoe Community College math students are absent from
class during a quarter.
Exercise 1.1.2.11
What is the population she is interested in?
a. variable.
b. population.
c. statistic.
d. data.
Exercise 1.1.2.13
The instructor's sample produces a mean number of days absent of 3.5 days. This value is an
example of a:
a. parameter.
b. data.
c. statistic.
d. variable.
Example 1.9
You go to the supermarket and purchase three cans of soup (19 ounces) tomato bisque, 14.1 ounces
lentil, and 19 ounces Italian wedding), two packages of nuts (walnuts and peanuts), four dierent
kinds of vegetable (broccoli, cauliower, spinach, and carrots), and two desserts (16 ounces Cherry
Garcia ice cream and two pounds (32 ounces chocolate chip cookies).
Problem
Name data sets that are quantitative discrete, quantitative continuous, and qualitative.
Solution
One Possible Solution:
• The three cans of soup, two packages of nuts, four kinds of vegetables and two desserts are
quantitative discrete data because you count them.
• The weights of the soups (19 ounces, 14.1 ounces, 19 ounces) are quantitative continuous data
because you measure weights as precisely as possible.
• Types of soups, nuts, vegetables and desserts are qualitative data because they are categorical.
: You may collect data as numbers and report it categorically. For example, the quiz scores for
each student are recorded throughout the term. At the end of the term, the quiz scores are reported
as A, B, C, D, or F.
Example 1.11
A statistics professor collects information about the classication of her students as freshmen,
sophomores, juniors, or seniors. The data she collects are summarized in the pie chart Figure 1.2.
What type of data does this graph show?
Figure 1.2
Solution
This pie chart shows the students in each year, which is categorical data.
Figure 1.3
1.1.3.1 Sampling
Gathering information about an entire population often costs too much or is virtually impossible. Instead,
we use a sample of the population. To collect the sample, a sampling technique is used. Not all sampling
techniques are created equal, though. A good sampling technique meets the following criteria:
note: Humans in general are not very random. Therefore, the randomness added to the sampling
technique cannot be someone randomly choosing something. The randomness has to come from
a random event (like rolling dice, ipping a coin, using a random number generator).
A sample is representative if it shares similar characteristics to the population. For example, suppose that
the students at a university are distributed as follows by faculty:
• Business: 20%
• Arts: 25%
• Science and Engineering: 30%
• Nursing: 15%
• Education: 10%
Then a sample would be representative of this population if the distribution of the students' faculty in the
sample was similar to the population. It doesn't have to be exactly the same, but it should be close. A
random sample will generate a fairly representative sample, but it doesn't guarantee it.
note: What makes a sample representative depends on what is being studied. For example, if
we are looking at the average age of students at a university, making sure we get students from
each faculty would be important, but making sure we get students from various political aliations
might not be.
Determining if a sample is large enough is a bit arbitrary and depends on the situation. In general, the
larger the sample size the better, but issues such as time and money need to be taken into account. You
don't want to interview 5000 people, when 50 people would do. In Chapter 7, we will look at a formula that
determines how many members of a population need to be in a sample depending on the level of error we
are comfortable with. Until then, as a general rule, if the data is quantitative, a sample of at least 30 is
usually good enough. While if the data is categorical, a sample of at least 100 is usually good enough.
In general, even if a sample is collected extremely well, it will not be perfectly representative of the pop-
ulation. The discrepancy between the sample and the population is called chance error due to sampling.
When dealing with samples, there will always be error. Statistics helps us to understand and even measure
this error. As a rule, the larger the random sample, in general the smaller the sampling error.
Areas of concern for sampling bias
When people publish their research, they include a description of their sampling technique. This is called
the methodology. When evaluating a sampling technique, check to see if the sample was collected randomly,
if it is representative of the population, and if the sample is large enough. Here are some examples of areas
of concern when looking at methodologies:
1. Undercoverage occurs when some members of the population are excluded from the process of selecting
the sample. For example, if no one from the faculty of nursing is included in the sample, then we would
say that the faculty of nursing is undercovered. This has been a specic concern in scientic research
over the years. For example, women have been traditionally excluded from drug studies because of
their menstrual cycles, but this results in the research only indicating how well the drug works for men.
2. Nonresponse bias occurs when a member of the population that is selected as part of the sample cannot
be contacted or refuses to participate. Have you ever refused to be part of a telephone study? If so,
you are contributing to nonresponse bias.
• Similar to nonresponse bias is voluntary response bias. Here a large segment of the population
is contacted and people choose to participate or not. Examples of this are mail-out surveys or
online polls. In these situations, usually the person is very invested in the issue so that is why
they take the time to answer. This results in non-representative samples.
• Response rate is a measure of how many people responded out of the total contacted. If the
response rate is low, then this suggests a very narrow segment of the population answered. This
would raise concerns about representativeness.
3. Asking potentially awkward questions might result in untruthful responses. This is called response
bias. For example, if you are asked if you have ever had a sexually transmitted infection, you may not
want to divulge that. One way to minimize response bias is to allow participants in a study to answer
the questions anonymously.
4. Improper wording of questions being asked might result in skewed answers. Here is an example of a
question that skews the results:
• Do you think it should be easier for seniors to make ends meet?
· Yes they've worked hard and helped build our country
· No seniors don't need any help or recognition
A famous example of a survey that had a very poor methodology was the incorrect prediction by the Literary
Digest that Dewey would beat Truman in the 1936 US election. Check out the following website for more
information: https://www.math.upenn.edu/∼deturck/m170/wk4/lecture/case2.html
Most statisticians and researchers use various methods of random sampling in an attempt to achieve a
good sample. This section will describe a few of the most common techniques: simple random sampling,
(proportional) stratied random sampling, cluster sampling, systematic random sampling, and convenience
sampling.
Simple random sampling
The easiest method to describe is called a simple random sample. In this technique, a random sample
is taken from the members of the population. This can be done by putting the names (or identier) of all
members of the population into a hat and pulling out those names (or identiers) to choose the sample.
Or the population can be numbered and a random number generator can choose the sample. Here, each
member of the population has an equal chance of being chosen. If the goal of the technique is to get a very
random sample, this is the best method to use. But it requires having a list of the whole population, which
is not always realistic.
Stratied sampling and proportionate stratied sampling
If there are concerns that a random sample might not fully represent a population (e.g. one portion of
the population is small compared to another), the best sampling technique to use is stratied random
sampling. In this case, divide the population into groups called strata and then take a random sample from
each stratum. Each stratum needs to be mutually exclusive from any other strata. That means that each
member of the population can only belong to one stratum. For example, you could stratify (group) your
university population by faculty and then choose a simple random sample from each stratum (each faculty)
to get a stratied random sample. Using the students per faculty example above, if the sample size is 100,
to get a stratied sample, you would randomly select 20 students from each faculty (as there are 5 faculties
and 100 students, choose an equal number from each faculty).
If the size of the sample is proportionate to the size of the strata, this is called proportionate stratied
random sampling. If you wanted a proportionate stratied random sample for students by faculty, you
would randomly select 20 students from business, 25 students from arts, 30 from science and engineering,
15 from nursing, and 10 from education (i.e. proportional to the number of students in each faculty). This
technique is best used when there are large dierences in the proportion of each group. For example, if the
faculty of business had 50% of the students and the faculty of nursing only had 1% of the students, it would
not be good to have an equal number of students from each faculty.
note: To randomly choose students from each faculty, a random sampling technique needs to be
used. This could be simple random sampling or using another technique listed below.
Cluster sampling
Cluster sampling and stratied sampling are often confused. In each case, the population is divided into
groups. But, in stratied sampling, a few people from all groups (strata) are chosen. While in cluster
sampling, all of the people from a few groups (clusters) are chosen.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of
the clusters. All the members from these clusters are in the cluster sample. For example, if you were to divide
your university into departments (sub-sets of the faculty), then each department is a cluster. You can then
randomly select a few of those departments (clusters). For example, let's say you clustered the population
into ten clusters. You might then randomly select four of those clusters to sample. All of the members of
those four departments are your sample. Again, to randomly select the four departments, you have to use a
random sampling technique. Here, you could number all of the departments and then use a random number
generator to choose four of them. Cluster sampling can be very convenient as the members of the sample are
in one location (in the above example, the sample are in the locations of the four departments). This can
save time and money. But it does present a real chance of undercoverage. If the four departments chosen are
only in the faculties of business and arts, then the other faculties are not included. This means that cluster
sampling can result in non-representative samples. This is only a good technique to use if the clusters are
very similar to each other and each cluster would be representative of the population.
Systematic random sampling
To choose a systematic random sample, randomly select a starting point and take every kth piece of data
from a list of the population. For example, suppose you have to do a phone survey and you must choose 400
names for the sample. Your phone book contains 20,000 residence listings. To perform systematic random
sampling, number the population from 1 to 20,000 and then use a random number generator to pick a number
that represents the rst name in the sample. k is found by taking the population size (20,000) and dividing
by the size of the population (400). In this case, this results in 50. Thus, from your random starting point,
choose every ftieth name thereafter until you have a total of 400 names. If you reach the end of the list
before completing your sample you simply go back to the beginning of your phone book and keep going until
the sample is complete.
Be careful: k needs to be large enough to ensure that you cycle through all the names. Otherwise the
sample is not random. If k had been 10, then once the random starting point was chosen only 4,000 names
had a chance of being chosen which means that not everyone has an equal chance of being chosen. In this
case, any k larger than 50 would be appropriate. Systematic sampling is frequently chosen because it is a
simple method.
A variation of systematic random sampling is very useful when a list of the population does not exist.
For example, suppose you are doing a survey about people's satisfaction with a certain mall's hours. You
won't have a list of all of the people who go to the mall. Instead, you may stand at an entrance to the mall
and ask every fth person who enters the mall to complete your survey. To ensure the sampling technique is
representative, you'll want to do the survey multiple times at multiple locations. To ensure that the sampling
technique is random, you'll want to randomly choose your starting times and locations.
Convenience sampling
A type of sampling that is non-random is convenience sampling. Convenience sampling involves using
results that are readily available. For example, a computer software store conducts a marketing study by
interviewing potential customers who happen to be in the store browsing through the available software. The
results of convenience sampling may be very good in some cases and highly biased (favour certain outcomes)
in others. This is not a valid sampling technique when it comes to statistical inference. That is, if the data
is collected using a convenience sample, then no conclusions can be made about the population from the
sample.
With replacement or without replacement
True random sampling is done with replacement. That is, once a member is picked, that member goes
back into the population and thus may be chosen more than once. However, for practical reasons, in most
populations, simple random sampling is done without replacement. Surveys are typically done without
replacement. That is, a member of the population may be chosen only once. Most samples are taken from
large populations and the sample tends to be small in comparison to the population. Since this is the case,
sampling without replacement is approximately the same as sampling with replacement because the chance
of picking the same individual more than once with replacement is very low.
Too illustrate how small of chance it is, consider a university with a population of 10,000 people. Suppose
you want to pick a sample of 1,000 randomly for a survey. For any particular sample of 1,000, if you
are sampling with replacement,
• the chance of picking the rst person is 1,000 out of 10,000 (0.1000);
• the chance of picking a dierent second person for this sample is 999 out of 10,000 (0.0999);
• the chance of picking the same person again is 1 out of 10,000 (very low).
• the chance of picking the rst person for any particular sample is 1000 out of 10,000 (0.1000);
• the chance of picking a dierent second person is 999 out of 9,999 (0.0999);
• you do not replace the rst person before picking the next person.
Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the decimal answers to four decimal
places. To four decimal places, these numbers are equivalent (0.0999).
Sampling without replacement instead of sampling with replacement becomes a mathematical issue only
when the population is small. For example, if the population is 25 people, the sample is ten, and you are
sampling with replacement for any particular sample, then the chance of picking the rst person is
ten out of 25, and the chance of picking a dierent second person is nine out of 25 (you replace the rst
person).
If you sample without replacement, then the chance of picking the rst person is ten out of 25, and
then the chance of picking the second person (who is dierent) is nine out of 24 (you do not replace the rst
person).
Compare the fractions 9/25 and 9/24. To four decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. To four
decimal places, these numbers are not equivalent.
Example 1.12
A study is done to determine the average tuition that San Jose State undergraduate students pay
per semester. Each student in the following samples is asked how much tuition he or she paid for
the Fall semester. What is the type of sampling in each case?
a. A sample of 100 undergraduate San Jose State students is taken by organizing the students'
names by classication (freshman, sophomore, junior, or senior), and then selecting 25 stu-
dents from each.
b. A random number generator is used to select a student from the alphabetical listing of all
undergraduate students in the Fall semester. Starting with that student, every 50th student
is chosen until 75 students are included in the sample.
c. A completely random method is used to select 75 students. Each undergraduate student
in the fall semester has the same probability of being chosen at any stage of the sampling
process.
d. The freshman, sophomore, junior, and senior years are numbered one, two, three, and four,
respectively. A random number generator is used to pick two of those years. All students in
those two years are in the sample.
e. An administrative assistant is asked to stand in front of the library one Wednesday and to
ask the rst 100 undergraduate students he encounters what they paid for tuition the Fall
semester. Those 100 students are the sample.
Solution
a. stratied; b. systematic; c. simple random; d. cluster; e. convenience
Example 1.13
Determine the type of sampling used (simple random, stratied, systematic, cluster, or conve-
nience).
a. A soccer coach selects six players from a group of boys aged eight to ten, seven players from
a group of boys aged 11 to 12, and three players from a group of boys aged 13 to 14 to form
a recreational soccer team.
b. A pollster interviews all human resource personnel in ve dierent high tech companies.
c. A high school educational researcher interviews 50 high school female teachers and 50 high
school male teachers.
d. A medical researcher interviews every third cancer patient from a list of cancer patients at a
local hospital.
e. A high school counselor uses a computer to generate 50 random numbers and then picks
students whose names correspond to the numbers.
f. A student interviews classmates in his algebra class to determine how many pairs of jeans a
student owns, on the average.
Solution
a. stratied; b. cluster; c. stratied; d. systematic; e. simple random; f.convenience
If we were to examine two samples representing the same population, even if we used random sampling
methods for the samples, they would not be exactly the same. Just as there is variation in data, there is
variation in samples. As you become accustomed to sampling, the variability will begin to seem natural.
Example 1.14
Suppose ABC College has 10,000 part-time students (the population). We are interested in the
average amount of money a part-time student spends on books in the fall term. Asking all 10,000
students is an almost impossible task.
Suppose we take two dierent samples.
First, we use convenience sampling and survey ten students from a rst term organic chemistry
class. Many of these students are taking rst term calculus in addition to the organic chemistry
class. The amount of money they spend on books is as follows:
$128; $87; $173; $116; $130; $204; $147; $189; $93; $153
The second sample is taken using a list of senior citizens who take P.E. classes and taking every
fth senior citizen on the list, for a total of ten senior citizens. They spend:
$50; $40; $36; $15; $50; $100; $40; $53; $22; $22
It is unlikely that any student is in both samples.
Problem 1
a. Do you think that either of these samples is representative of (or is characteristic of) the entire
10,000 part-time student population?
Solution
a. No. The rst sample probably consists of science-oriented students. Besides the chemistry
course, some of them are also taking rst-term calculus. Books for these classes tend to be expensive.
Most of these students are, more than likely, paying more than the average part-time student for
their books. The second sample is a group of senior citizens who are, more than likely, taking
courses for health and interest. The amount of money they spend on books is probably much less
than the average parttime student. Both samples are biased. Also, in both cases, not all students
have a chance to be in either sample.
Problem 2
b. Since these samples are not representative of the entire population, is it wise to use the results
to describe the entire population?
Solution
b. No. For these samples, each member of the population did not have an equally likely chance of
being chosen.
Now, suppose we take a third sample. We choose ten dierent part-time students from the disciplines
of chemistry, math, English, psychology, sociology, history, nursing, physical education, art, and
early childhood development. (We assume that these are the only disciplines in which part-time
students at ABC College are enrolled and that an equal number of part-time students are enrolled in
each of the disciplines.) Each student is chosen using simple random sampling. Using a calculator,
random numbers are generated and a student from a particular discipline is selected if he or she
has a corresponding number. The students spend the following amounts:
$180; $50; $150; $85; $260; $75; $180; $200; $200; $150
Problem 3
c. Is the sample biased?
Solution
c. The sample is unbiased, but a larger sample would be recommended to increase the likelihood
that the sample will be close to representative of the population. However, for a biased sampling
technique, even a large sample runs the risk of not being representative of the population.
Students often ask if it is "good enough" to take a sample, instead of surveying the entire population.
If the survey is done well, the answer is yes.
Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less
than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following
amount (in ounces) of beverage:
15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5
Measurements of the amount of beverage in a 16-ounce can may vary because dierent people make the
measurements or because the exact amount, 16 ounces of liquid, was not put into the cans. Manufacturers
regularly run tests to determine if the amount of beverage in a 16-ounce can falls within the desired range.
Be aware that as you take data, your data may vary somewhat from the data someone else is taking for
the same purpose. This is completely natural. However, if two or more of you are taking the same data and
get very dierent results, it is time for you and the others to reevaluate your data-taking methods and your
accuracy.
It was mentioned previously that two or more samples from the same population, taken randomly, and
having close to the same characteristics of the population will likely be dierent from each other. Suppose
Doreen and Jung both decide to study the average amount of time students at their college sleep each night.
Doreen and Jung each take samples of 500 students. Doreen uses systematic sampling and Jung uses cluster
sampling. Doreen's sample will be dierent from Jung's sample. Even if Doreen and Jung used the same
sampling method, in all likelihood their samples would be dierent. Neither would be wrong, however.
Think about what contributes to making Doreen's and Jung's samples dierent.
If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample results
(the average amount of time a student sleeps) might be closer to the actual population average. But still,
their samples would be, in all likelihood, dierent from each other. This variability in samples cannot be
stressed enough.
The size of a sample (often called the number of observations) is important. The examples you have seen in
this book so far have been small. Samples of only a few hundred observations, or even smaller, are sucient
for many purposes. In polling, samples that are from 1,200 to 1,500 observations are considered large enough
and good enough if the survey is random and is well done. You will learn why when you study condence
intervals.
Be aware that many large samples are biased. For example, call-in surveys are invariably biased, because
people choose to respond or not.
We need to evaluate the statistical studies we read about critically and analyze them before accepting the
results of the studies. We listed common problems with sampling techniques above. We re-iterate them here
and add a few additional ones.
• Problems with samples: A sample must be representative of the population. A sample that is not
representative of the population is biased. Biased samples that are not representative of the population
give results that are inaccurate and not valid.
• Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are
often unreliable.
• Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible.
In some situations, having small samples is unavoidable and can still be used to draw conclusions.
Examples: crash testing cars or medical testing for rare conditions
• Undue inuence: collecting data or asking questions in a way that inuences the response
• Non-response or refusal of subject to participate: The collected responses may no longer be represen-
tative of the population. Often, people with strong positive or negative opinions may answer surveys,
which can aect the results.
• Causality: A relationship between two variables does not mean that one causes the other to occur.
They may be related (correlated) because of their relationship through a dierent variable.
• Self-funded or self-interest studies: A study performed by a person or organization in order to support
their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not automati-
cally assume that the study is good, but do not automatically assume the study is bad either. Evaluate
it on its merits and the work done.
• Misleading use of data: improperly displayed graphs, incomplete data, or lack of context
• Confounding: When the eects of multiple factors on a response cannot be separated. Confounding
makes it dicult or impossible to draw valid conclusions about the eect of each factor.
1.1.3.5 References
Data are individual items of information that come from a population or sample. Data may be classied as
qualitative, quantitative continuous, or quantitative discrete.
Because it is not practical to measure the entire population in a study, researchers use samples to represent
the population. A random sample is a representative group from the population chosen by using a method
that gives each individual in the population an equal chance of being included in the sample. Random
sampling methods include simple random sampling, stratied sampling, cluster sampling, and systematic
sampling. Convenience sampling is a nonrandom method of choosing a sample that often produces biased
data.
Samples that contain dierent individuals result in dierent data. This is true even when the samples
are well-chosen and representative of the population. When properly selected, larger samples model the
population more closely than smaller samples. There are many dierent potential problems that can aect
the reliability of a sample. Statistical data needs to be critically analyzed, not simply accepted.
1.1.3.7 HOMEWORK
For the following exercises, identify the type of data that would be used to describe a response (quantitative
discrete, quantitative continuous, or categorical), and give an example of the data.
Exercise 1.1.3.8 (Solution on p. 78.)
number of tickets sold to a concert
Exercise 1.1.3.9 (Solution on p. 78.)
percent of body fat
a. Using complete sentences, list three things wrong with the way the survey was conducted.
b. Using complete sentences, list three ways that you would improve the survey if it were to be
repeated.
a. A woman in the airport is handing out questionnaires to travelers asking them to evaluate
the airport's service. She does not ask travelers who are hurrying through the airport with
their hands full of luggage, but instead asks all travelers who are sitting near gates and not
taking naps while they wait.
b. A teacher wants to know if her students are doing homework, so she randomly selects rows
two and ve and then calls on all students in row two and all students in row ve to present
the solutions to homework problems to the class.
c. The marketing manager for an electronics chain store wants information about the ages of its
customers. Over the next two weeks, at each store location, 100 randomly selected customers
are given questionnaires to ll out asking for information about age, as well as about other
variables of interest.
d. The librarian at a public library wants to determine what proportion of the library users are
children. The librarian has a tally sheet on which she marks whether books are checked out
by an adult or a child. She records this data for every fourth patron who checks out books.
e. A political party wants to know the reaction of voters to a debate between the candidates. The
day after the debate, the party's polling sta calls 1,200 randomly selected phone numbers.
If a registered voter answers the phone or is available to come to the phone, that registered
voter is asked whom he or she intends to vote for and whether the debate changed his or her
opinion of the candidates.
a. Think about the state of the United States in 1936. Explain why a sample chosen from
magazine subscription lists, automobile registration lists, phone books, and club membership
lists was not representative of the population of the United States at that time.
b. What eect does the low response rate have on the reliability of the sample?
c. Are these problems examples of sampling error or nonsampling error?
d. During the same year, George Gallup conducted his own poll of 30,000 prospective voters.
His researchers used a method they called "quota sampling" to obtain survey answers from
specic subsets of the population. Quota sampling is an example of which sampling method
described in this module?
4 lastbaldeagle. 2013. On Tax Day, House to Call for Firing Federal Workers Who Owe Back Taxes. Opinion poll posted
online at: http://www.youpolls.com/details.aspx?id=12328 (accessed May 1, 2013).
with unscented masks. Participants were assigned at random to wear the oral mask during the
rst three trials or during the last three trials. For each trial, researchers recorded the time it
took to complete the maze and the subject's impression of the mask's scent: positive, negative, or
neutral.
a. Describe the explanatory and response variables in this study.
b. What are the treatments?
c. Identify any lurking variables that could interfere with this study.
d. Is it possible to use blinding in this study?
Solution
a. The explanatory variable is scent, and the response variable is the time it takes to complete
the maze.
b. There are two treatments: a oral-scented mask and an unscented mask.
c. All subjects experienced both treatments. The order of treatments was randomly assigned so
there were no dierences between the treatment groups. Random assignment eliminates the
problem of lurking variables.
d. Subjects will clearly know whether they can smell owers or not, so subjects cannot be blinded
in this study. Researchers timing the mazes can be blinded, though. The researcher who is
observing a subject will not know which mask is being worn.
1.1.4.1 References
A poorly designed study will not produce reliable data. There are certain key components that must be
included in every experiment. To eliminate lurking variables, subjects must be assigned randomly to dierent
treatment groups. One of the groups must act as a control group, demonstrating what happens when the
active treatment is not applied. Participants in the control group receive a placebo treatment that looks
exactly like the active treatments but cannot inuence the response variable. To preserve the integrity of
the placebo, both researchers and subjects may be blinded. When a study is designed properly, the only
dierence between treatment groups is the one imposed by the researcher. Therefore, when groups respond
dierently to dierent treatments, the dierence must be due to the inuence of the explanatory variable.
An ethics problem arises when you are considering an action that benets you or some cause you support,
hurts or reduces benets to others, and violates some rule. 7 Ethical violations in statistics are not always
easy to spot. Professional associations and federal agencies post guidelines for proper conduct. It is important
that you learn basic statistical procedures so that you can recognize proper data analysis.
7 Andrew Gelman, Open Data and Open Methods, Ethics and Statistics, http://www.stat.columbia.edu/∼gelman/research/published/ChanceE
(accessed May 1, 2013).
Figure 1.4: When you have large amounts of data, you will need to organize it in a way that makes
sense. These ballots from an election are rolled together with similar ballots to keep them organized.
(credit: William Greeson)
• Display data graphically and interpret graphs: pie charts, bar graphs, histograms and box
plots.
• Recognize, describe, calculate, and interpret measures of location: quartiles and percentiles.
• Recognize, describe, calculate, and interpret measures of centre: mean, median and mode.
• Recognize, describe, calculate, and interpret measures of variation: variance, standard devia-
tion, range, interquartile range and coecient of variation.
Once you have collected data, what will you do with it? Data can be described and presented in many
dierent formats. For example, suppose you are interested in buying a house in a particular area. You may
have no clue about the house prices, so you might ask your real estate agent to give you a sample data set of
prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look at the
8 This content is available online at <http://cnx.org/content/m64282/1.1/>.
median price and the variation of prices. The median and variation are just two ways that you will learn to
describe data. Your agent might also provide you with a graph of the data.
In this chapter, you will study visual and numerical ways to describe and display your data. This area
of statistics is called Descriptive Statistics. If you have collected 200 data values, just looking at them
won't tell anyone much about the data. Instead, you want to summarize the raw data in a way that you can
better understand what's going on.
Categorical data is summarized usually using a visual representation like a pie chart or a bar graph. The
numerical summary for categorical data would be a percentage, fraction or decimal.
For quantitative data, it is a bit more involved. In general, there are three components to a good summary
of quantitative data: a visual representation, a measure of centre, and a measure of variation.
The visual representation can give you a sense of the centre and variation in the data, but is very useful
for determining the shape of the data. Is the data all clustered together? Are there a bunch of data on one
side, but a few on the other? Do all of the data values occur with the same frequency? The shape describes
this. Histograms and box plots are both visual representations of quantitative data.
Measures of centre, also known as averages or measures of central tendency, provide a value(s)
that gives us a sense of a typical value in the data set. This doesn't tell us about a specic member of the
population, but instead lets us know what the average one is like. Measures of centre we will learn about
include the mean, median, and mode.
Though a measure of centre tells us about a typical value in a data set, measures of variation tell
us how much the data values vary from each other. Are they all clumped together? Are they all spread
out? Measures of variation can tell us how consistent or how volatile the data is. If we are analyzing stock
prices, the more variation there is then the more volatile and risky the investment is. But the rewards
may be greater! Measures of variation that we will learn about include range, variance, standard deviation,
interquartile range, and the coecient of variation.
When we describe the shape, centre, and variation of the data, we are describing the distribution of the
data. If we only focus on one aspect of the distribution (say the centre), then we miss out on some important
information, which is why we always want to consider all three aspects when summarizing quantitative data.
For example, suppose two stock prices have the same average price. If we only look at the average, we might
think they are equivalent. But if one of them has greater variation, then that means that one is more volatile
and riskier than the other one.
Box plots (or box and whisker diagrams) are a special type of visual representation that includes both
visual and numerical elements. A box plot divides the data into quarters (or quartiles). Thus, a box plot
contains a measure of centre (the second quartile is the halfway point, called the median) and a measure of
variation (the distance between the rst quartile and the third quartile is called the interquartile range). The
box plot can also give a sense of the data's shape. The box plot then is the only representation that we will
see that gives us a sense of the distribution all in one representation (i.e. gives a sense of centre, variation,
and distribution). It also has an additional benet of identifying outliers. Outliers are data values that are
abnormal. That is, they dier signicantly from the other data values. A box plot shows if there are any
outliers.
This chapter will go over descriptive statistics by focusing on visual and numerical representations of
data. Though categorical data is discussed, the main focus will be on determining the distribution and
outliers for quantitative data.
The vast majority of the time when conducting statistical studies, we will only have access to sample
data. In this situation, we will want to analyze the sample data to see if we can come to any conclusions
about the population data. Once we make the leap from simply describing a sample to using that sample to
draw conclusions about the population, we are doing inferential statistics. These concepts and techniques
are covered in chapter seven and eight.
important: The distribution of sample data ideally mimics the distribution of the population.
But the smaller the sample size the greater the potential for there to be dierences between the
two distributions. This means that, for a large enough sample size, the distribution of the sample
generally gives a good idea of distribution of the population. This is an example of the law of large
numbers. In other words, if the sample size is large enough and the data is collected properly, then
the sample mean will most likely be a good estimate of the population mean, the sample standard
deviation will most likely be a good estimate of the population standard deviation, and the shape
of the sample data will most likely be a good estimate of the shape of the population.
Measures of centre or average give us a sense of what a typical value in a data set is. For example, the
average number of children in a family in Canada is 1.9. This means that a typical family will have about
1.9 children. Obviously, no family has exactly 1.9 children, but this gives a sense of how many children
families have on average. Further, some families may have 8 children. Others may have no children. The
measure of centre gives a sense of what is going on in the middle of the data set.
note: Even though you may wish to round an average to a whole number (especially when it is
about the number of people), this is not necessary nor is it appropriate as it is giving a sense of the
centre of the data, which is not necessarily an actual data value.
The "center" of a data set is a way of describing a typical value in a data set. The three most widely used
measures of the "center" of the data are the mean, median and mode.
To explain these three measures of centre, let's look at an example. Suppose we want to nd the average
weight of 50 people. To calculate the mean weight of the 50 people, we would add the 50 weights together
and divide by 50. To nd the median weight of the 50 people, order the data from least heavy to most heavy,
and nd the weight that splits the data into two equal parts. The mode is the most commonly occurring
value. To nd the mode, nd the weight that occurs the most frequently.
This section provides more details on how to nd the measures of centre, the notation for the measures,
and when it is best to used which measure.
: Though the words mean and average are sometimes used interchangeably, they do not
necessarily mean the same thing. In general, average is any measure of centre and mean is
a specic type of centre. Many people use average and mean as the same, but not always. For
example, when people talk about average housing price, they are usually referring to the median
house price.
1.2.2.1.1 Mean
The mean of a data set can be thought of as a balancing point (or fulcrum). If you think of numbers as
weighted, then the mean is the number that will balance the data values evenly. Suppose your data values
are 1, 2, 3, 4, 5. Then the number that balances the data is 3. To go a little deeper, the balance point is
three because the distance between 3 and the data values less than it is equal to the distance between 3 and
the data values greater than it as shown in Figure 1.5.
9 This content is available online at <http://cnx.org/content/m64905/1.3/>.
Figure 1.5: To nd the mean of this data, we need to nd the number that balances the data equally
on both sides.
Let's try a harder example. Suppose our data values are 0, 1, 1, 2, 3, 3, 4, 6. The mean will be the
number such that the total distance to the data values below it and the total distance to the data values
above it are the same. Let's see 3 is the mean again. Then the distance between our suggested "mean" and
0 is 3; the distance between our "mean" and 1 is 2 (but there are two of them); and the distance between
our "mean" and 2 is 1. That is, the distance between our "mean" and all of the data values below it are
3+2+2+1 = 8. If 3 is actually our mean, then the total distance between 3 and the data values above it will
also be 8. Let's check. The distance between our "mean" and 4 is 1; the distance between our "mean" and
6 is 3. The total distance above 3 is only 4. Therefore, 3 cannot be our mean as it doesn't balance our data.
note: The two data values of 3 were ignored as their distance from the suggested mean is 0.
Therefore, they would not change the answer if included.
From our calculations above, the choice of 3 was too big as the lower was too heavy. Let's try 2.5 as our
mean. If the mean is 2.5, then the distance between our "mean" and 0 is 2.5; the distance between our
"mean" and 1 is 1.5 (but there are two of them); the distance between our "mean" and 2 is 0.5. Thus the
total distance between our mean of 2.5 and the data values below is is 2.5 + 1.5 + 1.5 + 0.5 = 6. If 2.5 is
our mean, then the total distance above 2.5 should also be 6. The distance between our "mean" and 3 is
0.5 (but there are two of them); the distance between our "mean" and 4 is 1.5; the distance between our
"mean" and 6 is 3.5. Thus the total distance between the data values and our suggested mean of 2.5 is 0.5
+ 0.5 + 1.5 + 3.5 = 6! Therefore, 2.5 is the mean for this data.
Figure 1.6: To nd the mean of this data, we need to nd the number that balances the data equally
on both sides. Notice that the mean here is not a data value.
Thankfully we don't have to do these in-depth calculations and guesses each time. Instead the formula
is pretty straight-forward.
The Greek letter µ (pronounced "mew") represents the population mean. That is, it is the mean for
the population data.
Formula for Population Mean
N
1 X
µ= xi (1.6)
N i=1
The letter used to represent the sample mean is an x with a bar over it (pronounced x bar): x. It is the
mean of a sample of data from the population.
The sample mean is an estimate of the population mean. One of the requirements for the sample mean
to be a good estimate of the population mean is for the sample taken to be truly random.
Formula for Sample Mean
n
−− 1X
x= xi (1.6)
n i=1
To see how the formula words, consider the sample:
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4
−−1+1+1+2+2+3+4+4+4+4+4
x= = 2.7 (1.6)
11
Note: Since it is sample data, we use the symbol x.
: If the size of a random sample is increased, then the sample mean will more likely be a better
estimate of the population mean.
Note: Just because the sample size increases does not mean that the sample mean for the larger
sample must be a better estimate. It is only that it is more likely to be a better estimate.
1.2.2.1.2 Median
On a road, the median is in the middle of the road. In statistics, the median is the middle data value (when
the data is in order).
You can quickly nd the location or position of the median by using the expression n+1 2 .
The letter n is the total number of data values in the sample. If n is an odd number, the median is
the middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is
equal to the two middle values added together and divided by two after the data has been ordered. For
example, if the total number of data values is 97, then n+1
2 =
97+1
2 = 49. The median is the 49th value in
the ordered data. If the total number of data values is 100, then 2 = 100+1
n+1
2 = 50.5. The median occurs
midway between the 50th and 51st values. The location of the median and the value of the median are not
the same. The upper case letter M is often used to represent the median. The next example illustrates the
location of the median and the value of the median.
1.2.2.1.3 Mode
Another measure of the center is the mode. The mode is the data value that occurs most frequently and at
least twice.
A data set can have either
• no mode.
• one mode (unimodal)
• two modes (bimodal)
• or many modes (multimodal).
: The mode can be calculated for qualitative data as well as for quantitative data. For example,
if the data set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red.
Example 1.16
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody
drug are as follows (smallest to largest):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29;
31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
Calculate the mean, median and mode.
Solution
The calculation for the mean is:
−− ...+35+37+40+(44)(2)+47]
x = [3+4+(8)(2)+10+11+12+13+14+(15)(2)+(16)(2)+
40 = 23.6
To nd the median, M, rst use the formula for the location. The location is:
n+1 40+1
2 = 2 = 20.5
Starting at the smallest value, the median is located between the 20th and 21st values (the two 24s):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29;
31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
M = 24+242 = 24 To nd the mode, we rst have to determine if any data values repeat. If no
data values repeat, there is no mode. Since 8 repeats, we know there is a mode. 8 repeats twice.
We need to check if any data value repeats more than twice. If a data value repeats more than
twice, then it is the mode. Since no data value repeats more than twice, any data value that repeats
twice is the mode.
Therefore, the modes are 8, 15, 16, 17, 22, 24, 26, 27, 29, 34, 44. This data set is multi-modal.
Example 1.17
Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other
49 each earn $30,000. Which is the better measure of the "center": the mean, the median or the
mode?
Solution
−−
x = 5,000,000+49(30,000)
50 = 129, 400
M = 30,000
(There are 49 people who earn $30,000 and one person who earns $5,000,000.)
The mode is 30,000 as this data value occurs 49 times.
Since the median and mode are equal, lets focus on the median. The median is a better
measure of the "center" than the mean because 49 of the values are 30,000 and one is 5,000,000.
The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.
• Outliers: We have dened outliers as data values that are signicantly dierent from other
data values, but we have not provided a way of nding them. This will be discussed in the
next section. Regardless, we can see that 5 million is signicantly dierent than 30 thousand
in the above example.
• Skew: When a data set has outliers, the outliers have the potential to skew the mean. In the
above example, the centre of the data is 30,000, but the mean is 129,400. Thus the outlier of
5 million is pulling the mean up. That is, it is skewing the centre value by pulling it to the
right on the number line.
Above we have described how to nd each of the measures of centre. But how do you choose which measure
of centre to use in which situation? One option is to provide all three measures of centre, but sometimes
this can be overwhelming to the audience. Instead you want to pick the best one that best describes that
data. The following are some general guidelines for choosing the best measure of centre.
The mean is often the best measure of centre to use because it is the most well-known and familiar of
the measures of centre. It is also the only measure of centre that is computed using all of the sample values.
But the mean is susceptible to outliers. As was seen in Example 1.17, if there is an outlier, the mean can be
pulled in one direction away from the centre.
Outliers are any data value that are signicantly dierent from the other data values. In Example 1.17,
the outlier is 5 million as it is signicantly higher than the other data values. We will discuss how to nd
outliers in the section 2.3 (Boxplots).
If there is an outlier in the data set that is skewing the mean, the best measure of centre to use is the
median as it is not susceptible to outliers.
But be careful: The presence of outliers does not necessarily mean that the median is the best measure
of centre. Here are a couple of examples where this is the case:
1. Suppose there are 200 data values in a sample and one data value is an outlier, then the mean will
most likely not be aected by the outlier.
2. Suppose there is a data set that has outliers, but one is a high outlier and one is a low outlier. Then
the outliers may balance out and not aect the mean.
The mode is best used for categorical data, but can sometimes be used for quantitative data. For example,
in Example 1.17, the mode would be a good measure of centre because the majority of data values are the
same.
In Example 1.16, since there are no outliers, the mean is the best measure of centre to use. In Exam-
ple 1.17, since there is an outlier (5 million) and the mean and median are quite dierent, the median is the
best measure of centre to use.
The following tables compare the measures of centre.
Table 1.2
Mean yes
Median no
Mode no
Table 1.3
Mean yes
Median no
Mode no
Table 1.4
In this example, there are many dierent possible scenarios that could explain the discrepancy. But no
matter what the scenario is, the neighbour is picking his statistics to t his situation.
One scenario: The neighbour may be picking and choosing which measure of centre to use. Suppose that
most people in the neighbourhood make around $20,000 a year, but there are a few people who live on the
street with the super nice view who make $300,000 a year. Then in the rst case, when he says the average
income is $60,000, he has used the mean which has been pulled higher by the outliers of $300,000. He chose
to use the mean to make the neighbourhood look more auent than it really is.
But when he wanted to make the argument that the neighbourhood wasn't as auent and should be in
a lower tax bracket, he changed which measure of centre to use. Instead he may have the used the median
or mode because they aren't inuenced by the outliers.
Another scenario: The neighbour may be choosing how he denes income to help make his point. In the
rst case, he may have only used those who are employed to come up with the average salary. While in the
second case, he may have used all adults in the neighbourhood including students living with their parents,
stay-at-home parents, retired people or people out of work. Their incomes may be very low or non-existent
which would skew the average to being lower. In this scenario, he may be using the same measure of centre,
but is picking what he means by income to get the results he wants.
There are other possible scenarios. Can you think of any?
1.2.2.1.6 Skew
As has been noted above, if there are outliers in a data set, this can cause the mean to be pulled up or down
(i.e. be either higher than expected or lower than expected) by these outliers. Outliers don't have to be
present for this to happen. Essentially, any time that there are data values that cause the mean and median
to be signicantly dierent, then we say the data is skewed.
• If the mean is signicantly larger than the median and the histogram has a long tail on the right, then
the data is right skewed or positively skewed.
• If the mean is signicantly smaller than the median and the histogram has a long tail on the left, then
the data is left skewed or negatively skewed.
• If the mean and the median are approximately the same and the histograms has balanced tails, then
the data is symmetric.
Figure 1.7: These are "perfect" examples of skewness and symmetry. In reality, there may be multiple
modes or the mean and median will be similar but not equal. These are provided to give an example.
An important characteristic of any set of data is the variation in the data. In some data sets, the data values
are concentrated closely near the mean; in other data sets, the data values are more widely spread out from
the mean. There are ve measures of variation: range, standard deviation, variance, interquartile range and
coecient of variation.
The range is the easiest to calculate. It is found by subtracting the maximum value in the data set
from the minimum value in the data set. Though the range is easy to calculate, it is very much aected by
outliers.
The interquartile range will be discussed in the section on box plots (section 2.3).
The most common measure of variation, or spread, is the standard deviation. The standard deviation
measures how far data values are from their mean, on average.
important: When talking about variable or variability in statistics, there are two dierent kinds:
variation within a sample and variation between samples.
When we discuss nding the standard deviation, range or any measure of variation of a sample, we
are discussing variation within a sample. In this case, we are looking at how the data values vary
from each other. Most of the time, when we talk about variation this is what we are talking about.
We can also talk about how much dierent samples vary from each other. For example, we could
take multiple samples and nd the sample mean of each sample. If we talk about how much the
means vary from each other, we are discussing variation between samples. We will discuss this
specic type of variation in Chapter 6.
The law of large numbers saws that, for random samples, as the sample size increases, then the
sample will more closely resemble the population. For example, as the sample size increases, the
sample standard deviation will approach the population standard deviation. Thus, the variation
within the sample will more closely mimic the variation within the population as the sample size
increases. But as the sample size increases, the sample means will approach the population mean.
Thus, there will be less variation between the sample means. This means that the variation between
samples decreases, as the sample size increases. When we discuss sampling variability, we are
discussing variation between samples.
For this chapter, we are focusing on variation within a sample.
• provides a numerical measure of the overall amount of variation in a data set, and
• can be used to determine whether a particular data value is close to or far from the mean.
1.2.2.1.7.1.1 The standard deviation provides a measure of the overall variation in a data set
The standard deviation is small when the data are all concentrated close to the mean, exhibiting little
variation or spread. The standard deviation is larger when the data values are more spread out from the
mean, exhibiting more variation.
Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket
A and supermarket B. It is known that the average wait time at both supermarkets is about ve minutes.
At supermarket A, though, the standard deviation for the wait time is two minutes; at supermarket B the
standard deviation for the wait time is four minutes.
Because supermarket B has a higher standard deviation, we know that there is more variation in the
wait times at supermarket B. Overall, wait times at supermarket B are more spread out from the average;
wait times at supermarket A are more concentrated near the average. This means that at supermarket B,
you have a greater chance of having a short wait time, but also a greater chance of having a long wait time,
compared to supermarket A. That means the wait times are more volatile at supermarket B. On the other
hand, you will be waiting about the same amount of time at supermarket A. That means there are more
consistent waits times at supermarket A.
One way, we could summarize the supermarket situation is as follows:
• A typical wait time at supermarket A is 5 minutes give or take 2 minutes. This means that someone
typically has to wait 3 to 7 minutes in the checkout line.
• A typical wait time at supermarket B is 5 minutes give or take 4 minutes. This means that someone
typically has to wait 1 to 9 minutes in the checkout line.
Here the term typical means common, normal. So normally people will wait between 3 to 7 minutes at
supermarket A, but there will be some people who only wait 2 minutes and some who wait 10 minutes at
the checkout. That is, the typical range only provides a sense of what is going on in the middle of the data,
but there are values occurring outside of that range.
note: For the typical value, you can use any measure of centre. But for the give or take value,
you have to use standard deviation. No other measure of variation works.
note: The following explains how to calculate the standard deviation by hand. We will be using
computer software to do this. Thus it is not important to know this section in detail, but it is
helpful to know the basics of how the standard deviation is calculated to help understand what the
standard deviation is.
If x is a number, then the dierence "x mean" is called its deviation. In a data set, there are as many
deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If
the numbers belong to a population, in symbols a deviation is x µ. For sample data, in symbols a deviation
−−
is x x .
The procedure to calculate the standard deviation depends on whether the numbers are the entire pop-
ulation or are data from a sample. The calculations are similar, but not identical. Therefore the symbol
used to represent the standard deviation depends on whether it is calculated from a population or a sample.
The lower case letter s represents the sample standard deviation and the Greek letter σ (sigma, lower case)
represents the population standard deviation. If the sample has the same characteristics as the population,
then s should be a good estimate of σ .
To calculate the standard deviation, we need to calculate the variance rst. The variance is the average
−−
of the squares of the deviations (the x x values for a sample, or the x µ values for a population).
The symbol σ 2 represents the population variance; the population standard deviation σ is the square root
of the population variance. The symbol s 2 represents the sample variance; the sample standard deviation s
is the square root of the sample variance. You can think of the standard deviation as a special average of
the deviations.
If the numbers come from a census of the entire population and not a sample, when we calculate
the average of the squared deviations to nd the variance, we divide by N, the number of items in the
population. If the data are from a sample rather than a population, when we calculate the average of the
n
squared deviations, we divide by 1, one less than the number of items in the sample.
Since the standard deviation is found by square rooting something, the standard deviation is always positive
or zero.
Since the variance is the square of the standard deviation, it is not helpful as a descriptive statistic. For
example, if you are looking at the weights of basketballs in kg, then the standard deviation will be in kg,
while the variance will be in kg^2. Thus the variance is meaningless when trying to interpret the variation
in data. It is helpful later on in statistics, but at this point it is not.
Example 1.18
In a fth grade class, the teacher was interested in the average age and the sample standard deviation
of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fth grade
students. The ages are rounded to the nearest half year:
9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5;
Table 1.5
The sample variance, s 2 , is equal to the sum of the last column (9.7375) divided by the total
number of data values minus one (20 1):
s2 = 9.7375
20−1 = 0.5125
The √
sample standard deviation s is equal to the square root of the sample variance:
s = 0.5125 = 0.715891, which is rounded to two decimal places, s = 0.72.
1.2.2.1.7.2
The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the
mean than is the data value 11 which is indicated by the deviations 0.97 and 0.47. A positive deviation
occurs when the data value is greater than the mean, whereas a negative deviation occurs when the data
value is less than the mean. The deviation is 1.525 for the data value nine. If you add the deviations,
the sum is always zero. (For Example 1.18, there are n = 20 deviations.) So you cannot simply add the
deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and
the sum will also be positive. The variance, then, is the average squared deviation.
The variance is a squared measure and does not have the same units as the data. Taking the square root
solves the problem. The standard deviation measures the spread in the same units as the data.
Notice that instead of dividing by n = 20, the calculation divided by n 1 = 20 1 = 19 because the
data is a sample. For the sample variance, we divide by the sample size minus one (n 1). Why not divide
by n? The answer has to do with the population variance. The sample variance is an estimate of the
population variance. Based on the theoretical mathematics that lies behind these calculations, dividing
by (n 1) gives a better estimate of the population variance.
The standard deviation, s or σ , is either zero or larger than zero. When the standard deviation is zero,
there is no spread; that is, the all the data values are equal to each other. The standard deviation is small
when the data are all concentrated close to the mean, and is larger when the data values show more variation
from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out
about the mean; outliers can make s or σ very large.
The standard deviation is a very good measure of variation, but when comparing two data sets it is not
always the best. In particular, if the means of the two data sets are dierent. Suppose you are comparing
the yearly salaries (excluding bonuses) of junior employees versus CEOs at oil and gas companies around
Alberta. The yearly salaries for the junior employees will be signicantly smaller than the CEOs. Let's
say the average salary for junior employees is $45,000 while for CEOs is $500,000. Now suppose that the
standard deviation for both groups is $50,000. If we only looked at the standard deviation, we might say
that the variation in both groups is the same. But really variation of $50,000 when the average salary is
$45,000 is quite a bit more than for a salary of $500,000. That is, there is more relative variation in the junior
employees' salary. The standard deviation doesn't capture this dierence. But the coecient of variation
does and is a measure of relative variation. That is, it takes into account that bigger data values might
have a larger standard deviation, but that doesn't mean it has larger variation.
The coecient of variation is found by expressing the standard deviation as a percentage of the mean:
s
Coecient of Variation = −− (100%) (1.7)
x
In the above example, the coecient of variation would be:
50, 000
CofV for Junior employees = (100%) = 111.1% (1.7)
45, 000
50, 000
CofV for CEOs = (100%) = 1% (1.7)
5, 000, 000
The larger the coecient of variation, the larger the relative variation. Thus, as a measure of relative
variation, the junior employees have signicantly more relative variation (111.11%) compared to the CEOs
(1%).
Here are some points about the coecient of variation:
• If the standard deviation is larger than the mean, the coecient of variation is bigger than 100%.
Table 1.6
Example 1.19
Suppose you are looking at two companies and each company has 24 employees. At one company,
everybody except the CEO makes $30,000. The CEO makes $490,000. Thus, the data values would
be
$30,000; $30,000; $30,000; $30,000; $30,000; . . . ;$490,000
The second company has an interesting policy. Everybody who starts at the company makes
$30,000 a year, but as soon as someone else gets hired, they get paid $20,000 more. They only hire
one person at a time. So, the rst person who was hired started at $30,000, then when a second
person got hired, the rst person's salary was raised to $50,000. When a third person got hired, the
rst person's salary was raised to $70,000 while the salary of the second person hired was raised to
$50,000. This has been done 23 times. Therefore, their data values (i.e. salaries) would look like
this:
$30,000 $50,000; $70,000; $90,000; $110,000; . . . ;$490,000
Without doing any calculations, we can see that company one has fairly consistent salaries
except for the CEO. While company two has salaries that are more spread out.
The following table provides the count (i.e. sample size), mean, and the measures of variation
for the two companies.
Table 1.7
In the table above, notice that the range is the same for the two data sets. If we only looked
at the range, this would give a false sense that the amount of variation in the two data sets is the
same, but we know it isn't.
The standard deviation is measuring how much, on average, the data values vary from the mean.
For company one, 23 of the 24 data values deviate the same amount from the mean ($49,166.67
$30,000 = $19,166.67) with only the $490,000 deviating a large amount from the mean.
For company two, two data values only deviate by only $10,000 ($250,000 and $270,000) while
two data values deviate by a whopping $230,000 ($30,000 and $490,000).
In company one, 23 out of 24 data values deviate by less than $20,000. But for company two,
only 2 out of 24 deviate by less than $20,000. This suggests that company one will have a smaller
standard deviation than company two because there is less average deviation. This is supported
by MegaStat, which shows that the population standard deviation for company one is $91,920.10
versus company two, which has a population standard deviation of $138,443.73.
Notice that even though company one has an outlier (the CEO's salary), the standard deviation
is less than company two. That is, the average variation from the mean is less for company one.
Thus, the presence of an outlier does not necessarily result in a larger standard deviation.
The story is dierent when we look at the coecient of variation. For company one, it is
190.98%. While for company two, it is 54.39%. This means that company one has larger relative
variation than company two. This is because company two has a higher mean than company one
and thus the variation, relative to the mean, isn't as large as it is in company one.
In this situation, the best measure of variation to use would be the coecient of variation as
we are comparing two data sets with two dierent means. Based on this, company one has larger
relative variation than company two.
Notice that variance is not discussed here. As stated above, the variance is the square of the
standard deviation. Therefore, the units for variance in this example would be $^2, which makes
no sense. Again, variance is not a useful descriptive statistic.
warning: Variation and variance might seem like the same word but they aren't. Variation is a
general term used to discuss how much the data values vary from each other, how much spread there
is in the data, how consistent the data is, how volatile or risky the data is, and how much deviation
there is in the data values. It is an umbrella term. Variance is a specic type of variation. It
specically refers to the square of the standard deviation. Therefore, it is incorrect to say, There
is a lot of variance in the data or The best measure of variance is . . ..
The standard deviation is useful when comparing data values that come from dierent data sets. If the
data sets have dierent means and standard deviations, then comparing the data values directly can be
misleading.
• For each data value, calculate how many standard deviations away from its mean the value is.
• Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs.
• value mean
#of ST DEV s = standard deviation
• Compare the results of this calculation.
#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become:
−−
Sample x = x + zs z= x− x
s
x−µ
Population x = µ + zσ z= σ
Table 1.8
Example 1.20
Two students, John and Ali, from dierent high schools, wanted to nd out who had the highest
GPA when compared to his school. Which student had the highest GPA when compared to his
school?
Table 1.9
Solution
For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from
the average, for his school. Pay careful attention to signs when comparing and interpreting the
answer.
value −−mean = x−µ
z = # of ST DEV s = standard deviation σ
For John, z = #of ST DEV s = 2.85−−3.0
0.7 = − − 0.21
For Ali, z = #of ST DEV s = 77−8010 = −0.3
John has the better GPA when compared to his school because his GPA is 0.21 standard
deviations below his school's mean while Ali's GPA is 0.3 standard deviations below his school's
mean.
John's z-score of 0.21 is higher than Ali's z-score of 0.3. For GPA, higher values are better,
so we conclude that John has the better GPA when compared to his school.
Table 1.10
1.2.2.2 Distributions
Now that we have learned about determining shape (histogram), centre (mean, median or mode), and
variation (standard deviation, coecient of variation and range), we can now describe the distribution of a
data set.
In Example 1.19, we examined the salaries for two dierent companies.
Though we have not done the histogram for either of these data sets, we can imagine what they will look
like to determine the shape. Company A will have one peak at $30,000 with an outlier at $490,000. This
will make it skewed to the right. For Company B each data value has the same frequency, which makes the
data uniform.
For company A, we would describe the distribution of salaries to be skewed to the right(shape), centred
at $49,166.67 (mean) and have variation of $91,820.10 (standard deviation).
For company B, we would describe the distribution of salaries to be uniform(shape), centred at $260,000
(mean) and have variation of $138,443.73 (standard deviation).
1.2.2.3 References
Data from The World Bank, available online at http://www.worldbank.org (accessed April 3, 2013).
Demographics: Obesity adult prevalence rate. Indexmundi. Available online at
http://www.indexmundi.com/g/r.aspx?t=50&v=2228&l=en (accessed April 3, 2013).
The mean and the median can be calculated to help you nd the "center" of a data set. The mean is the
best estimate for the actual data set, but the median is the best measurement when a data set contains
several outliers or extreme values. The mode will tell you the most frequently occuring datum (or data) in
your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, but
if your data set consists of ranges which lack specic values, the mean may seem impossible to calculate.
However, the mean can be approximated if you add the lower boundary with the upper boundary and divide
by two to nd the midpoint of each interval. Multiply each midpoint by the number of values found in the
corresponding range. Divide the sum of these values by the total number of data values in the set.
The standard deviation can help you calculate the spread of data. There are dierent equations to use if
are calculating the standard deviation of a sample or of a population.
• The Standard Deviation allows us to compare individual data or classes to the data set mean numeri-
cally.s s
2 2
P −− P −−
x− x f x− x
• s = n−1 or s = n−1 is the formula for calculating the standard deviation of a
sample. To calculate q
the standard deviation
q of a population, we would use the population mean, µ,
(x−µ)2 f (x−µ)2
P P
and the formula σ = N or σ = N .
1.2.2.5
Use the following information to answer the next three exercises: The following data show the lengths of
boats moored in a marina. The data are ordered from smallest to largest: 16; 17; 19; 20; 20; 21; 23; 24; 25;
25; 25; 26; 26; 27; 27; 27; 28; 29; 30; 32; 33; 33; 34; 35; 37; 39; 40
Exercise 1.2.2.2 (Solution on p. 79.)
Calculate the mean.
Exercise 1.2.2.3 (Solution on p. 79.)
Identify the median.
Exercise 1.2.2.4 (Solution on p. 79.)
Identify the mode.
Use the following information to answer the next three exercises: Sixty-ve randomly selected car salespersons
were asked the number of cars they generally sell in one week. Fourteen people answered that they generally
sell three cars; nineteen generally sell four cars; twelve generally sell ve cars; nine generally sell six cars;
eleven generally sell seven cars. Calculate the following:
Exercise 1.2.2.5 (Solution on p. 79.)
sample mean = x = _______
Exercise 1.2.2.6 (Solution on p. 79.)
median = _______
Exercise 1.2.2.7 (Solution on p. 79.)
mode = _______
Exercise 1.2.2.8 (Solution on p. 79.)
The following data are the distances between 20 retail stores and a large distribution center. The
distances are in miles.
29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 145; 150
Use a computer to nd the standard deviation and round to the nearest tenth.
Javier Ercilia
x 6.0 km 6.0 km
s 4.0 km 7.0 km
Table 1.11
c. If the two histograms depict the distribution of values for each supervisor, which one depicts
Ercilia's sample? How do you know?
Figure 1.8
Use the following information to answer the next three exercises: We are interested in the number of years
students in a particular elementary statistics class have lived in California. The information in the following
table is from the entire section.
7 1 22 1
14 3 23 1
15 1 26 1
18 1 40 2
19 4 42 2
20 3
Total = 20
Table 1.12
a. 19
b. 19.5
c. 14 and 20
d. 22.65
a. sample
b. entire population
c. neither
a. Organize the data into a chart with ve intervals of equal width. Label the two columns
"Enrollment" and "Frequency."
b. Construct a histogram of the data.
c. What is the shape of the data? What does the shape tell you about the enrollment at these
community colleges?
d. What is the best measure of centre for this data and why? State the measure.
e. What is the best measure of variation for this data and why? State the measure.
f. If you were to build a new community college, what is the typical range for the enrollment?
Why would this information be helpful? What caveats would you want to think about when
you look at this typical range?
1 1 10 6
2 4 8 7
3 2 9 7
4 6 3 8
5 1 8 6
6 1 7 7
7 1 3 7
8 4 9 8
9 1 10 9
10 7 4 6
11 4 7 6
12 5 6 7
13 6 9 8
14 4 4 6
15 6 8 7
Table 1.13
Which label would you recommend as the new label for the Asian market? Support your decision
using the data.
Exercise 1.2.2.14 (Solution on p. 81.)
Three publicly traded telecommunications companies reported their monthly prot for the last
year. The results are presented below.
Table 1.14
1. Donna is close to retirement and wants to invest in one of the three companies. She doesn't
want to see her investment drop signicantly as she doesn't want to see her retirement savings
dwindle. Which company would you recommend she invest in and why?
2. What information is missing from the list that you might want to have to help you answer
the above question?
3. What information below is not necessary for making this decision?
Below are tables comparing the number of part-time and full-time students at De Anza College and Foothill
College enrolled for the spring 2010 quarter. The tables display counts (frequencies) and percentages or
proportions (relative frequencies). The percent columns make comparing the same categories in the colleges
easier. Displaying percentages along with the numbers is often helpful, but it is particularly important when
comparing sets of data that do not have the same totals, such as the total enrollments for both colleges in
this example. Notice how much larger the percentage for part-time students at Foothill College is compared
to De Anza College.
Table 1.15
Tables are a good way of organizing and displaying data. But graphs can be even more helpful in
understanding the data. There are no strict rules concerning which graphs to use. Two graphs that are used
to display categorical data are pie charts and bar graphs.
In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to
the percent of individuals in each category.
In a bar graph, the length of the bar for each category is proportional to the number or percent of
individuals in each category. Bars may be vertical or horizontal.
Look at Figure 1.9 and Figure 1.10 and determine which graph (pie or bar) you think displays the
comparisons better.
It is a good idea to look at a variety of graphs to see which is the most helpful in displaying the data. We
might make dierent choices of what we think is the best graph depending on the data and the context.
Our choice also depends on what we are using the data for.
(a) (b)
Figure 1.9
Figure 1.10
Bar graphs can also be used to summarize discrete quantitative data and categorical data. Bar graphs
consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular
boxes (used in three-dimensional plots), and they can be vertical or horizontal. The bar graph shown in
x
has age groups represented on the -axis and proportions on the -axis. y
Exercise 1.2.3.1 (Solution on p. 81.)
By the end of 2011, Facebook had over 146 million users in the United States. Table 1.16 shows
three age groups, the number of users in each age group, and the proportion (%) of users in each
age group. Construct a bar graph using this data.
Table 1.16
1 15.5% 19.4%
2 12.2% 15.6%
3 9.8% 9.0%
4 17.4% 18.5%
5 22.8% 20.7%
6 22.3% 16.8%
Table 1.17
Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows:
5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3.
Table 1.18: Frequency Table of Student Work Hours lists the dierent data values in ascending order and
their frequencies.
2 3
3 5
4 3
5 6
6 2
7 1
Table 1.18
A frequency is the number of times a value of the data occurs. According to Table 1.18: Frequency
Table of Student Work Hours, there are three students who work two hours, ve students who work three
hours, and so on. The sum of the values in the frequency column, 20, represents the total number of students
included in the sample.
A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data
occurs in the set of all outcomes to the total number of outcomes. To nd the relative frequencies, divide
each frequency by the total number of students in the samplein this case, 20. Relative frequencies can be
written as fractions, percents, or decimals.
Frequency Table of Student Work Hours with Relative Frequencies
2 3 3
20 or 0.15
3 5 5
20 or 0.25
4 3 3
20 or 0.15
5 6 6
20 or 0.30
6 2 2
20 or 0.10
7 1 1
20 or 0.05
Table 1.19
The sum of the values in the relative frequency column of Table 1.19: Frequency Table of Student Work
Hours with Relative Frequencies is 20
20
, or 1.
Cumulative relative frequency is the accumulation of the previous relative frequencies. To nd the
cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the
current row, as shown in Table 1.20: Frequency Table of Student Work Hours with Relative and Cumulative
Relative Frequencies.
Frequency Table of Student Work Hours with Relative and Cumulative Relative Frequencies
2 3 3
20 or 0.15 0.15
3 5 5
20 or 0.25 0.15 + 0.25 = 0.40
4 3 3
20 or 0.15 0.40 + 0.15 = 0.55
5 6 6
20 or 0.30 0.55 + 0.30 = 0.85
6 2 2
20 or 0.10 0.85 + 0.10 = 0.95
7 1 1
20 or 0.05 0.95 + 0.05 = 1.00
Table 1.20
The last entry of the cumulative relative frequency column is one, indicating that one hundred percent
of the data has been accumulated.
: Because of rounding, the relative frequency column may not always sum to one, and the last
entry in the cumulative relative frequency column may not be one. However, they each should be
close to one.
Table 1.21: Frequency Table of Soccer Player Height represents the heights, in inches, of a sample of 100
male semiprofessional soccer players.
Frequency Table of Soccer Player Height
6061.99 5 5
100 = 0.05 0.05
6263.99 3 3
100 = 0.03 0.05 + 0.03 = 0.08
6465.99 15 15
100 = 0.15 0.08 + 0.15 = 0.23
66-67.99 40 40
100 = 0.40 0.23 + 0.40 = 0.63
6869.99 17 17
100 = 0.17 0.63 + 0.17 = 0.80
7071.99 12 12
100 = 0.12 0.80 + 0.12 = 0.92
7273.99 7 7
100 = 0.07 0.92 + 0.07 = 0.99
7475.99 1 1
100 = 0.01 0.99 + 0.01 = 1.00
Total = 100 Total = 1.00
Table 1.21
The data in this table have been grouped into the following intervals:
• 60 to 61.99 inches
• 62 to 63.99 inches
• 64 to 65.99 inches
• 66 to 67.99 inches
• 68 to 69.99 inches
• 70 to 71.99 inches
• 72 to 73.99 inches
• 74 to 75.99 inches
In this sample, there are ve players whose heights fall within the interval 59.9561.95 inches, three players
whose heights fall within the interval 61.9563.95 inches, 15 players whose heights fall within the interval
63.9565.95 inches, 40 players whose heights fall within the interval 65.9567.95 inches, 17 players whose
heights fall within the interval 67.9569.95 inches, 12 players whose heights fall within the interval 69.95
71.95, seven players whose heights fall within the interval 71.9573.95, and one player whose heights fall
within the interval 73.9575.95. All heights fall between the endpoints of an interval and not at the endpoints.
Example 1.21
From Table 1.21: Frequency Table of Soccer Player Height, nd the percentage of heights that
are less than 65.95 inches.
Solution
If you look at the rst, second, and third rows, the heights are all less than 65.95 inches. There are
5 + 3 + 15 = 23 players whose heights are less than 65.95 inches. The percentage of heights less
than 65.95 inches is then 100
23
or 23%. This percentage is the cumulative relative frequency entry in
the third row.
34.99 6 6
50 = 0.12 0.12
56.99 7 7
50 = 0.14 0.12 + 0.14 = 0.26
79.99 15 15
50 = 0.30 0.26 + 0.30 = 0.56
1011.99 8 8
50 = 0.16 0.56 + 0.16 = 0.72
1212.99 9 9
50 = 0.18 0.72 + 0.18 = 0.90
1314.99 5 5
50 = 0.10 0.90 + 0.10 = 1.00
Total = 50 Total = 1.00
Table 1.22
From Table 1.22, nd the percentage of rainfall that is less than 9.01 inches.
Example 1.22
From Table 1.21: Frequency Table of Soccer Player Height, nd the percentage of heights that
fall between 61.95 and 65.95 inches.
Solution
Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%.
Example 1.23
Use the heights of the 100 male semiprofessional soccer players in Table 1.21: Frequency Table of
Soccer Player Height. Fill in the blanks and check your answers.
a. The percentage of heights that are from 67.95 to 71.95 inches is: ____.
b. The percentage of heights that are from 67.95 to 73.95 inches is: ____.
c. The percentage of heights that are more than 65.95 inches is: ____.
d. The number of players in the sample who are between 61.95 and 71.95 inches tall is: ____.
e. What kind of data are the heights?
f. Describe how you could gather this data (the heights) so that the data are characteristic of
all male semiprofessional soccer players.
Remember, you count frequencies. To nd the relative frequency, divide the frequency by the
total number of data values. To nd the cumulative relative frequency, add all of the previous
relative frequencies to the relative frequency for the current row.
Solution
a. 29%
b. 36%
c. 77%
d. 87
e. quantitative continuous
f. get rosters from each team and choose a simple random sample from each
Example 1.24
Nineteen people were asked how many miles, to the nearest mile, they commute to work each day.
The data are as follows: 2; 5; 7; 3; 2; 10; 18; 15; 20; 7; 10; 18; 5; 12; 13; 12; 4; 5; 10. Table 1.23:
Frequency of Commuting Distances was produced:
3 3 3
19 0.1579
4 1 1
19 0.2105
5 3 3
19 0.1579
7 2 2
19 0.2632
10 3 4
19 0.4737
12 2 2
19 0.7895
13 1 1
19 0.8421
15 1 1
19 0.8948
18 1 1
19 0.9474
20 1 1
19 1.0000
Table 1.23
Problem
Solution
a. No. The frequency column sums to 18, not 19. Not all cumulative relative frequencies are
correct.
b. False. The frequency for three miles should be one; for two miles (left out), two. The
cumulative relative frequency column should read: 0.1052, 0.1579, 0.2105, 0.3684, 0.4737,
0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1.0000.
c. 19
5
d. 19 , 12
7
19 , 19
7
1.2.3.2.3 Histograms
In the introduction, the idea of distribution was introduced. The distribution refers to the shape, centre and
variation of quantitative data. To determine the shape of the data, we need to look at a visual representation
of the data. The best visual representation to look at is the histogram.
note: Bar graphs and histograms look very similar. They both have bars whose heights represent
the frequency of the data. But bar graphs are used for categorical data and discrete quantitative
data (i.e. whole number data). Histograms are used for continuous quantitative data (i.e. numbers
with decimals) and sometimes discrete quantitative data as well. Since there is a gap between
categories and whole numbers, the bars in bar graphs do not touch. But for continuous data, there
is no gap between the numbers, so the bars for histograms do touch.
For most of the work you do in this book, you will use a histogram to display the data. One advantage of a
histogram is that it can readily display large data sets. The following explains how to make a histogram by
hand, but you can use statistical software to do this quite quickly.
A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical
axis. The horizontal axis is labeled with what the data represents (for instance, distance from your home
to school). The vertical axis is labeled either frequency or relative frequency (or percent frequency or
probability). The graph will have the same shape with either label. The histogram (like the stemplot) can
give you the shape of the data, the center, and the spread of the data.
The relative frequency is equal to the frequency for an observed value of the data divided by the total
number of data values in the sample.(Remember, frequency is dened as the number of times an answer
occurs.) If:
• f = frequency
• n = total number of data values (or the sum of the individual frequencies), and
• RF = relative frequency,
then:
f
RF = (1.10)
n
For example, if three students in Mr. Ahab's English class of 40 students received from 90% to 100%,
then, f = 3, n = 40, and RF = nf = 40 3
= 0.075. 7.5% of the students received 90100%. 90100% are
quantitative measures.
To construct a histogram, rst decide how many bars or intervals, also called classes, represent the
data. Many histograms consist of ve to 15 bars or classes for clarity. The number of bars needs to be chosen.
Choose a starting point for the rst interval to be less than the smallest data value. A convenient starting
point is a lower value carried out to one more decimal place than the value with the most decimal places.
For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient
starting point is 6.05 (6.1 0.05 = 6.05). We say that 6.05 has more precision. If the value with the most
decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 0.005 = 1.495).
If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point
is 0.9995 (1.0 0.0005 = 0.9995). If all the data happen to be integers and the smallest value is two, then
a convenient starting point is 1.5 (2 0.5 = 1.5). Also, when the starting point and other boundaries are
carried to one additional decimal place, no data value will fall on a boundary. The next two examples go
into detail about how to construct a histogram using continuous data and how to create a histogram using
discrete data.
Example 1.25
The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional
soccer players. The heights are continuous data, since height is measured.
60; 60.5; 61; 61; 61.5
63.5; 63.5; 63.5
64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5
66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5;
67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5
68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5
70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71
72; 72; 72; 72.5; 72.5; 73; 73.5
74
The smallest data value is 60. Since the data with the most decimal places has one decimal
(for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5,
0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for
the convenient starting point.
60 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point
is, then, 59.95.
The largest value is 74, so 74 + 0.05 = 74.05 is the ending value.
Next, calculate the width of each bar or class interval. To calculate this width, subtract the
starting point from the ending value and divide by the number of bars (you must choose the number
of bars you desire). Suppose you choose eight bars.
74.05 − 59.95
= 1.76 (1.10)
8
: We will round up to two and make each bar or class interval two units wide. Rounding up to
two is one way to prevent a value from falling on a boundary. Rounding to the next number is
often necessary even if it goes against the standard rules of rounding. For this example, using 1.76
as the width would also work. A guideline that is followed by some for the width of a bar or class
interval is to take the square root of the number of data values and then round to the nearest whole
number, if necessary. For example, if there are 150 values of data, take the square root of 150 and
round to 12 bars or intervals.
• 59.95
• 59.95 + 2 = 61.95
• 61.95 + 2 = 63.95
• 63.95 + 2 = 65.95
• 65.95 + 2 = 67.95
• 67.95 + 2 = 69.95
• 69.95 + 2 = 71.95
• 71.95 + 2 = 73.95
• 73.95 + 2 = 75.95
The heights 60 through 61.5 inches are in the interval 59.9561.95. The heights that are 63.5 are
in the interval 61.9563.95. The heights that are 64 through 64.5 are in the interval 63.9565.95.
The heights 66 through 67.5 are in the interval 65.9567.95. The heights 68 through 69.5 are in
the interval 67.9569.95. The heights 70 through 71 are in the interval 69.9571.95. The heights
72 through 73.5 are in the interval 71.9573.95. The height 74 is in the interval 73.9575.95.
The following histogram displays the heights on the x-axis and relative frequency on the y-axis.
Figure 1.11
note: Visual representations should be numbered. As they are images, they would be numbered
as gures. For example, a histogram would be numbered Figure 3. This means it is the third
image in the document. This makes it easier to refer back to: In Figure 3, we can see that . . .
The title of the visual representation includes the name of the visual representation and the context:
Histogram of . . ..
The label that goes along the axis includes the variable and the unit: Variable (unit).
These three aspects combined will make it easy to refer to the image and let the reader of the image
know what the image is about.
A frequency table would be similarly titled and labelled, but since it is a table and not an image,
it would be referred to as Table 4 (meaning the fourth table in the document).
As you look through this textbook, notice how all of the images and tables are numbered as
described above.
1.2.3.2.3.1 Shape
The shape of the data helps us understand what kind of pattern the data has. For example, if all of the
data values have the same frequency, then the shape will be distinct (it is called uniform). If the data has a
skew in it, then that helps us understand the measure of centre better (to be discussed in the next section).
Overall, the shape helps us see how the data is behaving. Data that has similar shapes will behave in similar
ways.
The shape of the data set is determined by looking at a visual representation of the data and usually
the histogram. Common ways of describing the shape include whether it is symmetrical or not, how many
distinct peaks it has (unimodal, bimodal, multimodal), and whether the data has a tail only on one side
(skew).
Figure 1.12: Here are some examples of possible shapes that data can take
The above is provided to give you some ideas on how to describe the shape of data. But not all data sets
have a nice shape that ts into one of the above. Sometimes the data can only be described as non-symmetric.
It is important to remember that the very reason we develop a variety of methods to present data is to
develop insights into the subject of what the observations represent. We want to get a "sense" of the data.
Are the observations all very much alike or are they spread across a wide range of values, are they bunched
at one end of the spectrum or are they distributed evenly and so on. We are trying to get a visual picture
of the numerical data. Shortly we will develop formal mathematical measures of the data, but our visual
graphical presentation can say much. It can, unfortunately, also say much that is distracting, confusing and
simply wrong in terms of the impression the visual leaves. Many years ago Darrell Hu wrote the book How
to Lie with Statistics. It has been through 25 plus printings and sold more than one and one-half million
copies. His perspective was a harsh one and used many actual examples that were designed to mislead. He
wanted to make people aware of such deception, but perhaps more importantly to educate so that others do
not make the same errors inadvertently.
Again, the goal is to enlighten with visuals that tell the story of the data. Pie charts have a number of
common problems when used to convey the message of the data. Too many pieces of the pie overwhelm the
reader. More than perhaps ve or six categories ought to give an idea of the relative importance of each
piece. This is after all the goal of a pie chart, what subset matters most relative to the others. If there
are more components than this then perhaps an alternative approach would be better or perhaps some can
be consolidated into an "other" category. Pie charts cannot show changes over time, although we see this
attempted all too often. In federal, state, and city nance documents pie charts are often presented to show
the components of revenue available to the governing body for appropriation: income tax, sales tax motor
vehicle taxes and so on. In and of itself this is interesting information and can be nicely done with a pie
chart. The error occurs when two years are set side-by-side. Because the total revenues change year to year,
but the size of the pie is xed, no real information is provided and the relative size of each piece of the pie
cannot be meaningfully compared.
Histograms can be very helpful in understanding the data. Properly presented, they can be a quick visual
way to present probabilities of dierent categories by the simple visual of comparing relative areas in each
category. Here the error, purposeful or not, is to vary the width of the categories. This of course makes
comparison to the other categories impossible. It does embellish the importance of the category with the
expanded width because it has a greater area, inappropriately, and thus visually "says" that that category
has a higher probability of occurrence.
Changing the units of measurement of the axis can smooth out a drop or accentuate one. If you want to
show large changes, then measure the variable in small units, penny rather than thousands of dollars. And
of course to continue the fraud, be sure that the axis does not begin at zero, zero. If it begins at zero, zero,
then it becomes apparent that the axis has been manipulated.
Again, the goal of descriptive statistics is to convey meaningful visuals that tell the story of the data.
Purposeful manipulation is fraud and unethical at the worst, but even at its best, making these type of errors
will lead to confusion on the part of the analysis.
1.2.3.4 References
Burbary, Ken. Facebook Demographics Revisited 2001 Statistics, 2011. Available online at
http://www.kenburbary.com/2011/03/facebook-demographics-revisited-2011-statistics-2/ (accessed August
21, 2013).
9th Annual AP Report to the Nation. CollegeBoard, 2013. Available online at
http://apreport.collegeboard.org/goals-and-ndings/promoting-equity (accessed September 13, 2013).
Overweight and Obesity: Adult Obesity Facts. Centers for Disease Control and Prevention. Available
online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13, 2013).
Data on annual homicides in Detroit, 196173, from Gunst & Mason's book `Regression Analysis and its
Application', Marcel Dekker
Timeline: Guide to the U.S. Presidents: Information on every president's birth-
place, political party, term of oce, and more. Scholastic, 2013. Available online at
http://www.scholastic.com/teachers/article/timeline-guide-us-presidents (accessed April 3, 2013).
Presidents. Fact Monster. Pearson Education, 2007. Available online at
http://www.factmonster.com/ipka/A0194030.html (accessed April 3, 2013).
Food Security Statistics. Food and Agriculture Organization of the United Nations. Available online
at http://www.fao.org/economic/ess/ess-fs/en/ (accessed April 3, 2013).
Consumer Price Index. United States Department of Labor: Bureau of Labor Statistics. Available
online at http://data.bls.gov/pdq/SurveyOutputServlet (accessed April 3, 2013).
CO2 emissions (kt). The World Bank, 2013. Available online at
http://databank.worldbank.org/data/home.aspx (accessed April 3, 2013).
Births Time Series Data. General Register Oce For Scotland, 2013. Available online
at http://www.gro-scotland.gov.uk/statistics/theme/vital-events/births/time-series.html (accessed April 3,
2013).
Demographics: Children under the age of 5 years underweight. Indexmundi. Available online at
http://www.indexmundi.com/g/r.aspx?t=50&v=2224&aml=en (accessed April 3, 2013).
Gunst, Richard, Robert Mason. Regression Analysis and Its Application: A Data-Oriented Approach.
CRC Press: 1980.
Overweight and Obesity: Adult Obesity Facts. Centers for Disease Control and Prevention. Available
online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13, 2013).
A bar graph is a chart that uses either horizontal or vertical bars to show comparisons among categories.
One axis of the chart shows the specic categories being compared, and the other axis represents a discrete
value. Some bar graphs present bars clustered in groups of more than one (grouped bar graphs), and others
show the bars divided into subparts to show cumulative eect (stacked bar graphs). Bar graphs are especially
useful when categorical data is being used, but they can also be used for quantitative discrete data.
A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width
drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the
vertical scale represents frequencies. The heights of the bars correspond to frequency values. Histograms are
typically used for large, continuous, quantitative data sets.
1.2.3.6
Spring 8 24%
Summer 9 26%
Autumn 11 32%
Winter 6 18%
Table 1.24
Table 1.25
Exercise 1.2.3.9
Construct a histogram for the following:
a.
Pulse Rates for Women Frequency
6069 12
7079 14
8089 11
9099 1
100109 1
110119 0
120129 1
Table 1.26
b.
Actual Speed in a 30 MPH Zone Frequency
4245 25
4649 14
5053 7
5457 3
5861 1
Table 1.27
c.
Tar (mg) in Nonltered Cigarettes Frequency
1013 1
1417 0
1821 15
2225 7
2629 2
Table 1.28
1.2.3.7 Homework
Use the following information to answer the next two exercises: Suppose one hundred eleven people who
shopped in a special t-shirt store were asked the number of t-shirts they own costing more than $19 each.
a. 21
b. 59
c. 41
d. Cannot be determined
a. cluster
b. simple random
c. stratied
d. convenience
Measures of location help us to understand where data values are located relative to other data values. We've
already seen a measure of location - the median. It tells us what data value is in the middle of the data set.
The most common measure of position is a percentile . Percentiles divide ordered data into hundredths.
To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It
means that 90% of test scores are the same or less than your score and 10% of the test scores are the same
or greater than your test score. The median is the 50th percentile
A special type of percentile are called quartiles. Quartiles divide ordered data into quarters. The rst
quartile, Q 1 , is the same as the 25th percentile, and the third quartile, Q 3 , is the same as the 75th percentile.
The median, M, is called both the second quartile and the 50th percentile.
A visual representation of measures of location is called a box plot.
In this section, we will learn how to nd quartiles and use those quartiles to nd the interquartile range
and outliers. Then we will visually represent this information on a box plot. Unlike histograms and bar
graphs, box plots require the use of numerical summaries. Thus the box plot is a representation that combines
both visual and numerical summaries of the data.
As described in the introduction, a common measure of location are percentiles. Percentiles are useful for
comparing values. For this reason, universities and colleges use percentiles extensively. One instance in
which colleges and universities use percentiles is when SAT results are used to determine a minimum testing
score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above
the 75th percentile. That translates into a score of at least 1220.
Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the
test scores are less (and not the same or less) than your score, it would be acceptable because removing one
particular data value is not signicant.
The median is a number that measures the "center" of the data. You can think of the median as the
"middle value," but it does not actually have to be one of the observed values. It is a number that separates
ordered data into halves. Half the values are the same number or smaller than the median, and half the
values are the same number or larger. For example, consider the following data.
1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1
Ordered from smallest to largest:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
11 This content is available online at <http://cnx.org/content/m64907/1.1/>.
Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2.
To nd the median, add the two values together and divide by two.
6.8 + 7.2
=7 (1.12)
2
The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the
data. To nd the quartiles, rst nd the median or second quartile. The rst quartile, Q 1 , is the middle
value of the lower half of the data, and the third quartile, Q 3 , is the middle value, or median, of the upper
half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the
data. To nd the quartiles, rst nd the median or second quartile. The rst quartile, Q 1 , is the middle
value of the lower half of the data, and the third quartile, Q 3 , is the middle value, or median, of the upper
half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle
value of the lower half is two.
1; 1; 2; 2; 4; 6; 6.8
The number two, which is part of the data, is the rst quartile. One-fourth of the entire sets of values
are the same as or less than two and three-fourths of the values are more than two.
The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.
The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-
fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this
example.
Possible Quartile Positions
Figure 1.13
As mentioned in the previous section, the interquartile range is a measure of variation. It is a number
that indicates the spread of the middle half or the middle 50% of the data. It is the dierence between the
third quartile (Q 3 ) and the rst quartile (Q 1 ).
IQR = Q 3 Q 1
The IQR can help to determine potential outliers. A value is suspected to be a potential outlier
if it is less than (1.5)( IQR) below the rst quartile or more than (1.5)(IQR) above the third
quartile. Potential outliers always require further investigation.
: A potential outlier is a data point that is signicantly dierent from the other data points.
These special data points may be errors or some kind of abnormality or they may be a key to
understanding the data.
Example 1.26
For the following 13 real estate prices, calculate the IQR and determine if any prices are potential
outliers. Prices are in dollars.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000;
488,800; 1,095,000
Solution
Order the data from smallest to largest.
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000;
1,095,000; 5,500,000
M = 488,800
Q 1 = 230,500 +
2
387,000
= 308,750
Q 3 = 639,000 +
2
659,000
= 649,000
IQR = 649,000 308,750 = 340,250
(1.5)(IQR) = (1.5)(340,250) = 510,375
1.5(IQR) less than the rst quartile: Q 1 (1.5)(IQR) = 308,750 510,375 = 201,625
1.5(IQR) more than the rst quartile:Q 3 + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375
No house price is less than 201,625. However, 5,500,000 is more than 1,159,375. Therefore,
5,500,000 is a potential outlier.
Example 1.27
For the two data sets in the test scores example (Example 1.33), nd the following:
Solution
The ve number summary for the day and night classes is
Table 1.29
a. The IQR for the day group is Q 3 Q 1 = 82.5 56 = 26.5 The IQR for the night group is Q 3
Q 1 = 89 78 = 11
The interquartile range (the spread or variability) for the day class is larger than the night
class IQR. This suggests more variation will be found in the day class's class test scores.
b. Day class outliers are found using the IQR times 1.5 rule. So,
A percentile indicates the relative standing of a data value when data are sorted into numerical order from
smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example,
15% of data values are less than or equal to the 15th percentile.
A percentile may or may not correspond to a value judgment about whether it is "good" or "bad." The
interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to
which the data applies. In some situations, a low percentile would be considered "good;" in other contexts
a high percentile might be considered "good". In many situations, there is no value judgment that applies.
Understanding how to interpret percentiles properly is important not only when describing data, but also
when calculating probabilities in later chapters of this text.
: When writing the interpretation of a percentile in the context of the given data, the sentence
should contain the following information.
Example 1.28
On a timed math test, the rst quartile for time it took to nish the exam was 35 minutes.
Interpret the rst quartile in the context of this situation.
Solution
Example 1.29
On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret
the 70th percentile in the context of this situation.
Solution
Example 1.30
At a community college, it was found that the 30th percentile of credit units that students are
enrolled for is seven units. Interpret the 30th percentile in the context of this situation.
Solution
1.2.4.2.2 Outliers
Above the idea of potential outliers were discussed. This section will look more in depth at how to nd
outliers and how to categorize them.
Quartiles can also be used to determine if there are any outliers in a data set. To determine if there are
outliers, we need to rst calculate the inner and outer fences. The fences dene the boundary between a
normal data value and an abnormal data value (or outlier). Any data values that fall between the inner
fences are normal data values. Any data values that fall outside the inner fences are considered
outliers.
The fences are calculated as follows:
The inner fences are Q 1 - IQR(1.5) and Q 3 + IQR(1.5).
The outer fences are Q 1 - IQR(3) and Q 3 + IQR(3).
A mild outlier is any data value between the inner and outer fences.
An extreme outlier is any data value to the extreme of the outer fence.
Example 1.31: Finding outliers
Sharpe Middle School is applying for a grant that will be used to add tness equipment to the
gym. The principal surveyed 15 anonymous students to determine how many minutes a day the
students spend exercising. The results from the 15 anonymous students are shown.
Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the
concentration of the data. They also show how far the extreme values are from most of the data.
To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest
and largest data values label the endpoints of the axis. The rst quartile marks one end of the box and the
third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall
inside the box. The "whiskers" extend from the ends of the box to the smallest and largest data values.
The median or second quartile can be between the rst and third quartiles, or it can be one, or the other,
or both. The box plot gives a good, quick picture of the data.
A box plot is constructed from the ve-number summary (the minimum value, the rst quartile, the
median, the third quartile, and the maximum value) and, if there are outliers, the fences. We use these
values to compare how close other data values are to them.
Figure 1.14: This is an example of a box plot. The box is in the middle and represents 50% of the data.
The circles on the right represent outliers and the dashed lines the fences. The outliers at approximately
22000 and 27000 are mild outliers, while the outlier at approximately 28500 is an extreme outlier.
To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and
largest data values label the endpoints of the axis. The rst quartile marks one end of the box and the third
quartile marks the other end of the box. The median is represented by a line inside the box. The middle 50
percent of the data fall inside the box and the length of the box is the interquartile range.
The "whiskers" extend from the ends of the box to the rst data values inside the fences. If there are no
outliers, this would be minimum and maximum values. The outliers are represented by asterisks or dots and
fall either between the inner and outer fences (mild outlier) or outside the outer fences (extreme outlier).
Consider, again, this dataset.
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
From the work done above, we know the ve number summary is 1, 2, 7, 9, 11.5. The IQR is 9-2 = 7.
IQR(1.5) is 7*1.5 = 10.5. The lower inner fence is Q1-IQR(1.5) = 2-10.5=-8.5 and the upper inner fence is
Q3+IQR(1.5)=9+10.5 = 19.5. Since no data values are smaller than -8.5 or larger than 19.5, there are no
outliers in the data set.
Figure 1.15
The two whiskers extend from the rst quartile to the smallest value and from the third quartile to the
largest value. The median is shown with a dashed line.
: It is important to start a box plot with a scaled number line. Otherwise the box plot may
not be useful.
Example 1.32
The following data are the heights of 40 students (in inches) in a statistics class.
59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69;
70; 70; 70; 70; 70; 71; 71; 72; 72; 73; 74; 74; 75; 77
Take this data and input it into Excel. Use the "Text to Columns" function in the "Data" menu
to separate the data into separate columns. Then copy the data, but when you paste it, use paste
special to "Transpose" the data so it is all in one column.
Now use whatever software you are using to nd the ve-number summary.
• Minimum value = 59
• Q1: First quartile = 64.75
• Q2: Second quartile or median= 66
• Q3: Third quartile = 70
• Maximum value = 77
Figure 1.16
note: The titles and labels for a box plot follow the same rules as they do for a histogram or a
bar graph.
• Interquartile Range: IQR = third quartile - rst quartile = 70 64.75 = 5.25, which means
that the middle 50% (middle half) of the data has a range of 5.25 inches. This also means
the length of the box is 5.25.
For some sets of data, some of the largest value, smallest value, rst quartile, median, and third quartile
may be the same. For instance, you might have a data set in which the median and the third quartile are
the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The
right side of the box would display both the third quartile and the median. For example, if the smallest
value and the rst quartile were both one, the median and the third quartile were both ve, and the largest
value was seven, the box plot would look like:
Figure 1.17
In this case, at least 25% of the values are equal to one. Twenty-ve percent of the values are between
one and ve, inclusive. At least 25% of the values are equal to ve. The top 25% of the values fall between
ve and seven, inclusive.
Example 1.33
Test scores for a college statistics class held during the day are:
99; 56; 78; 55.5; 32; 90; 80; 81; 56; 59; 45; 77; 84.5; 84; 70; 72; 68; 32; 79; 90
Test scores for a college statistics class held during the evening are:
98; 78; 68; 83; 81; 89; 88; 76; 65; 45; 98; 90; 80; 84.5; 85; 79; 78; 98; 90; 79; 81; 25.5
Problem (Solution on p. 84.)
a. Find the smallest and largest values, the median, and the rst and third quartile for the day
class.
b. Find the smallest and largest values, the median, and the rst and third quartile for the night
class.
c. For each data set, what percentage of the data is between the smallest value and the rst
quartile? the rst quartile and the median? the median and the third quartile? the third
quartile and the largest value? What percentage of the data is between the rst quartile and
the largest value?
d. Create a box plot for each set of data. Use one number line for both box plots.
e. Which box plot has the widest spread for the middle 50% of the data (the data between the
rst and third quartiles)? What does this mean for that set of data in comparison to the
other set of data?
1.2.4.4 References
The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are
used to compare and interpret data. For example, an observation at the 50th percentile would be greater
than 50 percent of the other obeservations in the set. Quartiles divide data into quarters. The rst quartile
(Q 1 ) is the 25th percentile,the second quartile (Q 2 or median) is 50th percentile, and the third quartile (Q 3 )
is the the 75th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data
values. The IQR is found by subtracting Q 1 from Q 3 , and can help determine outliers by using the following
two expressions.
• Q 3 + IQR(1.5)
• Q 1 IQR(1.5)
Box plots are a type of graph that can help visually organize data. To graph a box plot the following data
points must be calculated: the minimum value, the rst quartile, the median, the third quartile, and the
maximum value. Once the box plot is graphed, you can display and compare distributions of data.
1.2.4.6
Figure 1.18
a. In complete sentences, describe what the shape of each box plot implies about the distribution
of the data collected.
b. Have more Americans or more Germans surveyed been to over eight foreign countries?
c. Compare the three box plots. What do they imply about the foreign travel of 20-year-old
residents of the three countries when compared to each other?
Figure 1.19
a. Think of an example (in words) where the data might t into the above box plot. In 25
sentences, write down the example.
b. What does it mean to have the rst and second quartiles so close together, while the second
to third quartiles are far apart?
Figure 1.20
a. In complete sentences, describe what the shape of each box plot implies about the distribution
of the data collected for that car series.
b. Which group is most likely to have an outlier? Explain how you determined that.
c. Compare the three box plots. What do they imply about the age of purchasing a BMW from
the series when compared to each other?
d. Look at the BMW 5 series. Which quarter has the smallest spread of data? What is the
spread?
e. Look at the BMW 5 series. Which quarter has the largest spread of data? What is the
spread?
f. Look at the BMW 5 series. Estimate the interquartile range (IQR).
g. Look at the BMW 5 series. Are there more data in the interval 31 to 38 or in the interval 45
to 55? How do you know this?
h. Look at the BMW 5 series. Which interval has the fewest data in it? How do you know this?
i. 3135
ii. 3841
iii. 4164
017 18.9
1824 8.0
2534 22.8
3544 15.0
4554 13.1
5564 11.9
65+ 10.3
Table 1.30
a. Construct a histogram of the Japanese-American community in Santa Clara County, CA. The
bars will not be the same width for this example. Why not? What impact does this have on
the reliability of the graph?
b. What percentage of the community is under age 35?
c. Which box plot most resembles the information above?
Figure 1.21
a. all people (maybe in a certain geographic area, such as the United States)
b. a group of the people
c. the proportion of all people who will buy the product
d. the proportion of the sample who will buy the product
e. X = the number of people who will buy it
a. The country was in the middle of the Great Depression and many people could not aord these luxury
items and therefore not able to be included in the survey.
b. Samples that are too small can lead to sampling bias.
c. sampling error
d. stratied
Solution to Exercise 1.1.3.23 (p. 21)
Self-Selected Samples: Only people who are interested in the topic are choosing to respond. Sample Size
Issues: A sample with only 11 participants will not accurately represent the opinions of a nation.
Undue Inuence: The question is wording in a specic way to generate a specic response. Self-Funded
or Self-Interest Studies: This question was generated to support one person's claim and it was designed to
get the answer that the person desires.
Solution to Exercise 1.2.2.1 (p. 41)
For Angie: z = 26.20.8
27.2 = 1.25
For Beth: z = 27.3 −−30.1
= 2
1.4
Solution to Exercise 1.2.2.2 (p. 43)
Mean: 16 + 17 + 19 + 20 + 20 + 21 + 23 + 24 + 25 + 25 + 25 + 26 + 26 + 27 + 27 + 27 + 28 + 29 +
30 + 32 + 33 + 33 + 34 + 35 + 37 + 39 + 40 = 738;
27 = 27.33
738
a. It is dicult to determine which survey is correct. Both surveys include the same number of shoppers
and the shoppers were randomly selected. We could look at how the random selection was done to see
if one of the sampling techniques would result in a more representative sample. But if they used the
same sampling technique, there is no way to know which sample is right. The only way would be to
take another, larger sample and see which of the two supervisor's samples most closely matches that
sample. But really we expect there to be sampling variability so it is not really an appropriate question
to ask which is "correct".
b. Since the mean is the same for both samples, this suggests that it is fair to say that on average shoppers
travel 6.0 km to the mall. But the standard deviations are dierent. This suggests that it is not yet
clear how much variation there is from the 6.0km.
c. Ercilia's data has a larger standard deviation. Therefore, on average, the data needs to be more spread
out from the mean than Javier's. This suggests (b) is the answer.
a.
Enrollment Frequency
0-4999 10
5000-9999 16
10000-14999 3
15000-19999 3
20000-24999 1
25000-29999 2
Table 1.31
Figure 1.22
c. The shape is skewed to the right which means that there a few community colleges that have greater
enrollment compared to most of the other colleges in the sample.
d. Since the mean (8628.74) is being skewed (as it is larger than the median of 6,414), the median is the
best measure of centre.
e. Since we are only looking at one data set, the standard deviation is a good measure of variation. It is
6,943.88.
f. The typical range is 6,414 +/- 6,943.88 = -529.88 to 13,357.88. As there can't be negative students
enrolled, the typical range is 0 students to 13,357.88. Though there could be multiple caveats, one
concern is the rather large variation in the data. This means that community colleges have very dierent
enrollment rates. Perhaps looking at community colleges that are similar to the one I would like to
open would be more benecial as that population would be more representative of my community
college.
1. Any answer requires that you examine the amount of variation in the data set. The coecient of
variation is the best measure to use to compare the variation as the means are dierent.
Table 1.32
2. The information provided is only for one year. It would be helpful to know about their changes over
more than one year. Quartiles aren't provided. They could help examine the variation as well.
3. The median and the mode are not relevant as this is a question about variation. The mean is only
required as it is needed to nd the coecient of variation.
Figure 1.23
Figure 1.24
Figure 1.25
Figure 1.26
Figure 1.27
IQR = 158
Solution to Example 1.33, Problem (p. 71)
a. Min = 32
Q 1 = 56
M = 74.5
Q 3 = 82.5
Max = 99
b. Min = 25.5
Q 1 = 78
M = 81
Q 3 = 89
Max = 98
c. Day class: There are six data values ranging from 32 to 56: 30%. There are six data values ranging
from 56 to 74.5: 30%. There are ve data values ranging from 74.5 to 82.5: 25%. There are ve data
values ranging from 82.5 to 99: 25%. There are 16 data values between the rst quartile, 56, and the
largest value, 99: 75%. Night class:
d.
Figure 1.28
e. The rst data set has the wider spread for the middle 50% of the data. The IQR for the rst data set
is greater than the IQR for the second set. This means that there is more variability in the middle
50% of the rst data set.
to Exercise 1.2.4.3 (p. 72)
Figure 1.29
time corresponding to a lower percentile. 85% of people at the DMV waited 32 minutes or less. 15% of
people at the DMV waited 32 minutes or longer.
Solution to Exercise 1.2.4.6 (p. 73)
The manufacturer and the consumer would be upset. This is a large repair cost for the damages, compared
to the other cars in the sample. INTERPRETATION: 90% of the crash tested cars had damage repair costs
of $1700 or less; only 10% had damage repair costs of $1700 or more.
Solution to Exercise 1.2.4.7 (p. 73)
You can aord 34% of houses. 66% of the houses are too expensive for your budget. INTERPRETATION:
34% of houses cost $240,000 or less. 66% of houses cost $240,000 or more.
Solution to Exercise 1.2.4.8 (p. 73)
Figure 1.30
a. The shape of China suggests that either every person they surveyed except one either visited 0 foreign
countries or 5 foreign countries. For example, if 30 people were interviewed in China, 29 of them have
visited no foreign country and one of them has visited 5 foreign countries OR 29 of them have visited
5 foreign countries and one of them has visited no foreign countries. It is unclear which way it is going
in the box plot. In Germany, 50% of those surveyed have visited 8 or less countries. Based on the
position of the median, this suggests that there are many people in the survey who have visited eight
countries. This suggests the distribution will have a peak at 8 and will be non-symmetric. In the USA,
50% of those surveyed have visited 2 or less countries. As there are no whiskers, this suggests that
25% of the Americans surveyed have visited no foreign countries which suggest a skew to the right for
the distribution.
b. 25% of Germans surveyed have been to more than 8 foreign countries. It is unclear what the percentage
is for Americans but it is less than 25%. Therefore, Germany.
c. Germans in the survey have visited far more countries that Americans and the Chinese in the survey.
China has the least foreign travel.
a. Answers will vary. Possible answer: State University conducted a survey to see how involved its
students are in community service. The box plot shows the number of community service hours logged
by participants over the past year.
b. Because the rst and second quartiles are close, the data in this quarter is very similar. There is not
much variation in the values. The data in the third quarter is much more variable, or spread out. This
is clear because the second quartile is so far away from the third quartile.
Solution to Exercise 1.2.4.12 (p. 74)
a. Each box plot is spread out more in the greater values. Each plot is skewed to the right, so the ages
of the top 50% of buyers are more variable than the ages of the lower 50%.
b. The BMW 3 series is most likely to have an outlier. It has the longest whisker.
c. Comparing the median ages, younger people tend to buy the BMW 3 series, while older people tend
to buy the BMW 7 series. However, this is not a rule, because there is so much variability in each data
set.
d. The second quarter has the smallest spread. There seems to be only a three-year dierence between
the rst quartile and the median.
e. The third quarter has the largest spread. There seems to be approximately a 14-year dierence between
the median and the third quartile.
f. IQR ∼ 17 years
g. There is not enough information to tell. Each interval lies within a quarter, so we cannot tell exactly
where the data in that quarter is concentrated.
h. The interval from 31 to 35 years has the fewest data values. Twenty-ve percent of the values fall in
the interval 38 to 41, and 25% fall between 41 and 64. Since 25% of values fall between 31 and 38, we
know that fewer than 25% fall between 31 and 35.
Solution to Exercise 1.2.4.13 (p. 75)
a.
Figure 1.31
f. a) The ve-number summary is: Minimum = 8; First quartile = 33; Median = 44; Third quartile =
49.25; Maximum = 65.
a.
Figure 1.32: This is technically not a histogram as the bars aren't touching, but without the original
data this is the best that I could come up with unless I drew it by hand!
Figure 2.1: Meteor showers are rare, but the probability of them occurring can be calculated. (credit:
Navicore/ickr)
89
90 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY
It is often necessary to "guess" about the outcome of an event in order to make a decision. Politicians study
polls to guess their likelihood of winning an election. Teachers choose a particular course of study based on
what they think students can comprehend. Doctors choose the treatments needed for various diseases based
on their assessment of likely results. You may have visited a casino where people play games chosen because
of the belief that the likelihood of winning is good. You may have chosen your course of study based on the
probable availability of jobs.
You have, more than likely, used probability. In fact, you probably have an intuitive sense of probability.
Probability deals with the chance of an event occurring. Whenever you weigh the odds of whether or not to
do your homework or to study for an exam, you are using probability. In this chapter, you will learn how to
solve probability problems using a systematic approach.
the experiment tends to become closer and closer to the theoretical probability. Even though the outcomes
do not happen according to any set pattern or order, overall, the long-term observed relative frequency will
approach the theoretical probability. (The word empirical is often used instead of the word observed.)
It is important to realize that in many situations, the outcomes are not equally likely. A coin or die may
be unfair, or biased. Two math professors in Europe had their statistics students test the Belgian one
Euro coin and discovered that in 250 trials, a head was obtained 56% of the time and a tail was obtained
44% of the time. The data seem to show that the coin is not a fair coin; more repetitions would be helpful
to draw a more accurate conclusion about such bias. Some dice may be biased. Look at the dice in a game
you have at home; the spots on each face are usually small holes carved out and then painted to make the
spots visible. Your dice may or may not be biased; it is possible that the outcomes may be aected by the
slight weight dierences due to the dierent numbers of holes in the faces. Gambling casinos make a lot
of money depending on outcomes from rolling dice, so casino dice are made dierently to eliminate bias.
Casino dice have at faces; the holes are completely lled with paint having the same density as the material
that the dice are made out of so that each face is equally likely to occur. Later we will learn techniques to
use to work with probabilities for events that are not equally likely.
A key concept in probability is whether an event is likely or unlikely. A likely event is an event that
has a good chance of happening, while an unlikely event is rare. For example, it is likely to snow in Calgary
in the winter, but it is unlikely to snow in Calgary in the summer (it can happen, but it would be a rare
or strange event). In general, in statistics, unlikely events usually have a probability of less than 1% of
happening. Likely events usually have a probability of greater than 10% of happening. If the probability
of the event is between 1% and 10%, it is up to the statistician or researcher to make a call to determine
whether it is likely or unlikely.
"OR" Event:
An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B. For example,
let A = {1, 2, 3, 4, 5} and B = {4, 5, 6, 7, 8}. A OR B = {1, 2, 3, 4, 5, 6, 7, 8}. Notice that 4 and 5 are
NOT listed twice.
"AND" Event:
An outcome is in the event A AND B if the outcome is in both A and B at the same time. For example, let
A and B be {1, 2, 3, 4, 5} and {4, 5, 6, 7, 8}, respectively. Then A AND B = {4, 5}.
The complement of event A is denoted A0 (read "A prime"). A0 consists of all outcomes that are NOT
in A. Notice that P(A) + P(A0 ) = 1. For example, let S = {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then,
A0 = {5, 6}. P(A) = 64 , P(A0 ) = 26 , and P(A) + P(A0 ) = 64 + 26 = 1
The conditional probability of A given B is written P(A|B). P(A|B) is the probability that event A
will occur given that the event B has already occurred. A conditional reduces the sample space. We
calculate the probability of A from the reduced sample space B. The formula to calculate P(A|B) is P(A|B)
= P (APAND
(B)
B)
where P(B) is greater than zero.
For example, suppose we toss one fair, six-sided die. The sample space S = {1, 2, 3, 4, 5, 6}. Let A =
face is 2 or 3 and B = face is even (2, 4, 6). To calculate P(A|B), we count the number of outcomes 2 or 3
in the sample space B = {2, 4, 6}. Then we divide that by the number of outcomes B (rather than S).
We get the same result by using the formula. Remember that S has six outcomes.
(the number of outcomes that are 2 or 3 and even in S)
P (A AND B) 1
P(A|B) = P (B) = (the
6
number of outcomes that are even in S) = 6
3 = 1
3
6 6
Odds
The odds of an event presents the probability as a ratio of success to failure. This is common in various
gambling formats. Mathematically, the odds of an event can be dened as:
P (A)
(2.1)
1 − P (A)
where P(A) is the probability of success and of course 1 − P(A) is the probability of failure. Odds are
always quoted as "numerator to denominator," e.g. 2 to 1. Here the probability of winning is twice that
of losing; thus, the probability of winning is 0.66. A probability of winning of 0.60 would generate odds in
favor of winning of 3 to 2. While the calculation of odds can be useful in gambling venues in determining
payo amounts, it is not helpful for understanding probability or statistical theory.
Understanding Terminology and Symbols
It is important to read each problem carefully to think about and understand what the events are. Under-
standing the wording is the rst very important step in solving probability problems. Reread the problem
several times if necessary. Clearly identify the event of interest. Determine whether there is a condition
stated in the wording that would indicate that the probability is conditional; carefully identify the condition,
if any.
If the sample space is
Solution
a. S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
b. A = {2, 4, 6, 8, 10, 12, 14, 16, 18}, B = {14, 15, 16, 17, 18, 19}
c. P(A) = 199
, P(B) = 19 6
d. A AND B = {14,16,18}, A OR B = {2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 19}
e. P(A AND B) = 19 3
, P(A OR B) = 12 19
f. A0 = 1, 3, 5, 7, 9, 11, 13, 15, 17, 19; P(A0 ) = 19
10
g. P(A) + P(A0 ) = 1 ( 19 + 19 = 1)
9 10
h. P(A|B) = P (APAND(B)
B)
= 36 , P(B|A) = P (APAND
(A)
B)
= 39 , No
a.S = _____________________________
Let event A = the sum is even and event B = the rst number is prime.
b.A = _____________________, B =
_____________________
c.P(A) = _____________, P(B) = ________________
d.A AND B = ____________________, A OR B =
________________
e.P(A AND B) = _________, P(A OR B) = _____________
f.B0 = _____________, P(B0 ) = _____________
g.P(A) + P(A0 ) = ____________
h.P(A|B) = ___________, P(B|A) = _____________; are the probabil-
ities equal?
Example 2.2
A fair, six-sided die is rolled. Describe the sample space S, identify each of the following events
with a subset of S and compute its probability (an outcome is the number of dots that show up).
Solution
a. T = {2}, P(T) = 61
b. A = {2, 4, 6}, P(A) = 12
c. B = {1, 2, 3}, P(B) = 12
d. A0 = {1, 3, 5}, P(A0 ) = 12
e. A|B = {2}, P(A|B) = 13
f. B|A = {2}, P(B|A) = 13
g. A AND B = {2}, P(A AND B) = 16
h. A OR B = {1, 2, 3, 4, 6}, P(A OR B) = 56
i. A OR B0 = {2, 4, 5, 6}, P(A OR B0 ) = 32
j. N = {2, 3, 5}, P(N ) = 12
k. A six-sided die does not have seven dots. P(7) = 0.
Example 2.3
Table 2.1 describes the distribution of a random sample S of 100 individuals, organized by gender
and whether they are right- or left-handed.
Right-handed Left-handed
Males 43 9
Females 44 4
Table 2.1
Problem
Let's denote the events M = the subject is male, F = the subject is female, R = the subject is
right-handed, L = the subject is left-handed. Compute the following probabilities:
a. P(M )
b. P(F)
c. P(R)
d. P(L)
e. P(M AND R)
f. P(F AND L)
g. P(M OR F)
h. P(M OR R)
i. P(F OR L)
j. P(M')
k. P(R|M )
l. P(F|L)
m. P(L|F)
Solution
a. P(M ) = 0.52
b. P(F) = 0.48
c. P(R) = 0.87
d. P(L) = 0.13
e. P(M AND R) = 0.43
f. P(F AND L) = 0.04
g. P(M OR F) = 1
h. P(M OR R) = 0.96
i. P(F OR L) = 0.57
j. P(M') = 0.48
k. P(R|M ) = 0.8269 (rounded to four decimal places)
l. P(F|L) = 0.3077 (rounded to four decimal places)
m. P(L|F) = 0.0833
2.1.2.1 References
In this module we learned the basic terminology of probability. The set of all possible outcomes of an
experiment is called the sample space. Events are subsets of the sample space, and they are assigned a
probability that is a number between zero and one, inclusive.
2.1.2.4
Use the following information to answer the next four exercises. A box is lled with several party favors. It
contains 12 hats, 15 noisemakers, ten nger traps, and ve bags of confetti.
Let H = the event of getting a hat.
Let N = the event of getting a noisemaker.
Let F = the event of getting a nger trap.
Let C = the event of getting a bag of confetti.
Exercise 2.1.2.3
Find P(H ).
Exercise 2.1.2.4 (Solution on p. 186.)
Find P(N ).
Exercise 2.1.2.5
Find P(F).
Exercise 2.1.2.6 (Solution on p. 186.)
Find P(C).
Use the following information to answer the next six exercises. A jar of 150 jelly beans contains 22 red jelly
beans, 38 yellow, 20 green, 28 purple, 26 blue, and the rest are orange.
Let B = the event of getting a blue jelly bean
Let G = the event of getting a green jelly bean.
Let O = the event of getting an orange jelly bean.
Let P = the event of getting a purple jelly bean.
Let R = the event of getting a red jelly bean.
Let Y = the event of getting a yellow jelly bean.
Exercise 2.1.2.7
Find P(B).
Use the following information to answer the next six exercises. There are 23 countries in North America, 12
countries in South America, 47 countries in Europe, 44 countries in Asia, 54 countries in Africa, and 14 in
Oceania (Pacic Ocean region).
Let A = the event that a country is in Asia.
Let E = the event that a country is in Europe.
Let F = the event that a country is in Africa.
Let N = the event that a country is in North America.
Let O = the event that a country is in Oceania.
Let S = the event that a country is in South America.
Exercise 2.1.2.13
Find P(A).
Exercise 2.1.2.14 (Solution on p. 186.)
Find P(E).
Exercise 2.1.2.15
Find P(F).
Exercise 2.1.2.16 (Solution on p. 186.)
Find P(N ).
Exercise 2.1.2.17
Find P(O).
Exercise 2.1.2.18 (Solution on p. 186.)
Find P(S).
Exercise 2.1.2.19
What is the probability of drawing a red card in a standard deck of 52 cards?
Exercise 2.1.2.20 (Solution on p. 186.)
What is the probability of drawing a club in a standard deck of 52 cards?
Exercise 2.1.2.21
What is the probability of rolling an even number of dots with a fair, six-sided die numbered one
through six?
Exercise 2.1.2.22 (Solution on p. 186.)
What is the probability of rolling a prime number of dots with a fair, six-sided die numbered one
through six?
Use the following information to answer the next two exercises. You see a game at a local fair. You have to
throw a dart at a color wheel. Each section on the color wheel is equal in area.
Figure 2.2
Use the following information to answer the next ten exercises. On a baseball team, there are inelders and
outelders. Some players are great hitters, and some players are not great hitters.
Let I = the event that a player in an inelder.
Let O = the event that a player is an outelder.
Let H = the event that a player is a great hitter.
Let N = the event that a player is not a great hitter.
Exercise 2.1.2.25
Write the symbols for the probability that a player is not an outelder.
Exercise 2.1.2.26 (Solution on p. 186.)
Write the symbols for the probability that a player is an outelder or is a great hitter.
Exercise 2.1.2.27
Write the symbols for the probability that a player is an inelder and is not a great hitter.
Use the following information to answer the next two exercises. You are rolling a fair, six-sided number cube.
Let E = the event that it lands on an even number. Let M = the event that it lands on a multiple of three.
Exercise 2.1.2.39
What does P(E|M ) mean in words?
Exercise 2.1.2.40 (Solution on p. 187.)
What does P(E OR M ) mean in words?
2.1.2.5 Homework
Exercise 2.1.2.41
Figure 2.3
The graph in Figure 2.3 displays the sample sizes and percentages of people in dierent age and
gender groups who were polled concerning their approval of Mayor Ford's actions in oce. The
total number in the sample of all the age groups is 1,045.
a. If there is a 60% chance of rain on Saturday and a 70% chance of rain on Sunday, then there
is a 130% chance of rain over the weekend.
b. The probability that a baseball player hits a home run is greater than the probability that he
gets a successful hit.
• P(A|B) = P(A)
• P(B|A) = P(B)
• P(A AND B) = P(A)P(B)
Two events A and B are independent if the knowledge that one occurred does not aect the chance the
other occurs. For example, the outcomes of two roles of a fair die are independent events. The outcome of
the rst roll does not change the probability for the outcome of the second roll. To show two events are
independent, you must show only one of the above conditions. If two events are NOT independent, then
we say that they are dependent.
Sampling may be done with replacement or without replacement.
• With replacement: If each member of a population is replaced after it is picked, then that member
has the possibility of being chosen more than once. When sampling is done with replacement, then
events are considered to be independent, meaning the result of the rst pick will not change the
probabilities for the second pick.
• Without replacement: When sampling is done without replacement, each member of a population
may be chosen only once. In this case, the probabilities for the second pick are aected by the result
of the rst pick. The events are considered to be dependent or not independent.
If it is not known whether A and B are independent or dependent, assume they are dependent until
you can show otherwise.
Example 2.4
You have a fair, well-shued deck of 52 cards. It consists of four suits. The suits are clubs,
diamonds, hearts and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, J (jack), Q (queen), K (king) of that suit.
a. Sampling with replacement:
Suppose you pick three cards with replacement. The rst card you pick out of the 52 cards is the
Q of spades. You put this card back, reshue the cards and pick a second card from the 52-card
deck. It is the ten of clubs. You put this card back, reshue the cards and pick a third card from
the 52-card deck. This time, the card is the Q of spades again. Your picks are {Q of spades, ten of
clubs, Q of spades}. You have picked the Q of spades twice. You pick each card from the 52-card
deck.
b. Sampling without replacement:
Suppose you pick three cards without replacement. The rst card you pick out of the 52 cards is
the K of hearts. You put this card aside and pick the second card from the 51 cards remaining in
the deck. It is the three of diamonds. You put this card aside and pick the third card from the
remaining 50 cards in the deck. The third card is the J of spades. Your picks are {K of hearts, three
of diamonds, J of spades}. Because you have picked the cards without replacement, you cannot
pick the same card twice. The probability of picking the three of diamonds is called a conditional
probability because it is conditioned on what was picked rst. This is true also of the probability
of picking the J of spades. The probability of picking the J of spades is actually conditioned on
both the previous picks.
4, 5, 6, 7, 8, 9, 10, J (jack), Q (queen), K (king) of that suit. Three cards are picked at
random.
a.Suppose you know that the picked cards are Q of spades, K of hearts and Q of spades.
Can you decide if the sampling was with or without replacement?
b.Suppose you know that the picked cards are Q of spades, K of hearts, and J of
spades. Can you decide if the sampling was with or without replacement?
Example 2.5
You have a fair, well-shued deck of 52 cards. It consists of four suits. The suits are clubs,
diamonds, hearts, and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, J (jack), Q (queen), and K (king) of that suit. S = spades, H = Hearts, D = Diamonds, C =
Clubs.
a. Suppose you pick four cards, but do not put any cards back into the deck. Your cards are
QS, 1D, 1C, QD.
b. Suppose you pick four cards and put each card back before you pick the next card. Your
cards are KH, 7D, 6D, KH.
Which of a. or b. did you sample with replacement and which did you sample without replacement?
A and B are mutually exclusive events if they cannot occur at the same time. This means that A and B
do not share any outcomes and P(A AND B) = 0.
For example, suppose the sample space S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Let A = {1, 2, 3, 4, 5}, B =
{4, 5, 6, 7, 8}, and C = {7, 9}. A AND B = {4, 5}. P(A AND B) = 10 2
and is not equal to zero. Therefore,
A and B are not mutually exclusive. A and C do not have any numbers in common so P(A AND C) = 0.
Therefore, A and C are mutually exclusive.
If it is not known whether A and B are mutually exclusive, assume they are not until you can show
otherwise. The following examples illustrate these denitions and terms.
Example 2.6
Flip two fair coins. Find the probabilities of the events.
a. Let F = the event of getting at most one tail (zero or one tail).
b. Let G = the event of getting two faces that are the same.
c. Let H = the event of getting a head on the rst ip followed by a head or tail on the second
ip.
d. Are F and G mutually exclusive?
e. Let J = the event of getting all tails. Are J and H mutually exclusive?
Solution
Look at the sample space in .
a. Zero (0) or one (1) tails occur when the outcomes HH, TH, HT show up. P(F) = 43
b. Two faces are the same if HH or TT show up. P(G) = 24
c. A head on the rst ip followed by a head or tail on the second ip occurs when HH or HT
show up. P(H ) = 24
d. F and G share HH so P(F AND G) is not equal to zero (0). F and G are not mutually
exclusive.
e. Getting all tails occurs when tails shows up on both coins (TT). H 's outcomes are HH and
HT.
J and H have nothing in common so P(J AND H ) = 0. J and H are mutually exclusive.
Example 2.7
Roll one fair, six-sided die. The sample space is {1, 2, 3, 4, 5, 6}. Let event A = a face is odd.
Then A = {1, 3, 5}. Let event B = a face is even. Then B = {2, 4, 6}.
Problem
Are C and E mutually exclusive events? (Answer yes or no.) Why or why not?
No. C = {3, 5} and E = {1, 2, 3, 4}. P(C AND E) = 16 . To be mutually exclusive, P(C AND E)
must be zero.
• Find P(C|A). This is a conditional probability. Recall that the event C is {3, 5} and event
A is {1, 3, 5}. To nd P(C|A), nd the probability of C using the sample space A. You have
reduced the sample space from the original sample space {1, 2, 3, 4, 5, 6} to {1, 3, 5}. So,
P(C|A) = 32 .
• P(A|B) = P(A)
• P(B|A)
• P(A AND B) = P(A)P(B)
Example 2.8
Let event G = taking a math class. Let event H = taking a science class. Then, G AND H =
taking a math class and a science class. Suppose P(G) = 0.6, P(H ) = 0.5, and P(G AND H ) =
0.3. Are G and H independent?
If G and H are independent, then you must show ONE of the following:
• P(G|H ) = P(G)
• P(H |G) = P(H )
• P(G AND H ) = P(G)P(H )
: The choice you make depends on the information you have. You could choose any of
the methods here because you have the necessary information.
Problem 1
a. Show that P(G|H ) = P(G).
Solution
P (G AND H ) 0.3
P(G|H ) = P (H ) = 0.5 = 0.6 = P(G)
Problem 2
b. Show P(G AND H ) = P(G)P(H ).
Solution
P(G)P(H ) = (0.6)(0.5) = 0.3 = P(G AND H )
Since G and H are independent, knowing that a person is taking a science class does not change the
chance that he or she is taking a math class. If the two events had not been independent (that is,
they are dependent) then knowing that a person is taking a science class would change the chance
he or she is taking math. For practice, show that P(H |G) = P(H ) to show that G and H are
independent events.
• R = a red marble
• G = a green marble
• O = an odd-numbered marble
• The sample space is S = {R1, R2, R3, R4, R5, R6, G1, G2, G3, G4}.
Example 2.9
Let event C = taking an English class. Let event D = taking a speech class.
Suppose P(C) = 0.75, P(D) = 0.3, P(C|D) = 0.75 and P(C AND D) = 0.225.
Justify your answers to the following questions numerically.
a.Find P(B|D).
b.Find P(D|B).
c.Are B and D independent?
d.Are B and D mutually exclusive?
Example 2.10
In a box there are three red cards and ve blue cards. The red cards are marked with the numbers
1, 2, and 3, and the blue cards are marked with the numbers 1, 2, 3, 4, and 5. The cards are
well-shued. You reach into the box (you cannot see into it) and draw one card.
Let R = red card is drawn, B = blue card is drawn, E = even-numbered card is drawn.
The sample space S = R1, R2, R3, B1, B2, B3, B4, B5. S has eight outcomes.
• P(R) = 83 . P(B) = 58 . P(R AND B) = 0. (You cannot draw one card that is both red and
blue.)
• P(E) = 83 . (There are three even-numbered cards, R2, B2, and B4.)
• P(E|B) = 52 . (There are ve blue cards: B1, B2, B3, B4, and B5. Out of the blue cards,
there are two even cards; B2 and B4.)
• P(B|E) = 32 . (There are three even-numbered cards: R2, B2, and B4. Out of the even-
numbered cards, to are blue; B2 and B4.)
• The events R and B are mutually exclusive because P(R AND B) = 0.
• Let G = card with a number greater than 3. G = {B4, B5}. P(G) = 82 . Let H = blue card
numbered between one and four, inclusive. H = {B1, B2, B3, B4}. P(G|H ) = 14 . (The only
card in H that has a number greater than three is B4.) Since 28 = 14 , P(G) = P(G|H ), which
means that G and H are independent.
Let A be the event that a fan is rooting for the away team.
Let B be the event that a fan is wearing blue.
Are the events of rooting for the away team and wearing blue independent? Are they
mutually exclusive?
Example 2.11
In a particular college class, 60% of the students are female. Fifty percent of all students in the
class have long hair. Forty-ve percent of the students are female and have long hair. Of the female
students, 75% have long hair. Let F be the event that a student is female. Let L be the event
that a student has long hair. One student is picked randomly. Are the events of being female and
having long hair independent?
: The choice you make depends on the information you have. You could use the rst
or last condition on the list for this example. You do not know P(F|L) yet, so you cannot use the
second condition.
Solution 1
Check whether P(F AND L) = P(F)P(L). We are given that P(F AND L) = 0.45, but P(F)P(L) =
(0.60)(0.50) = 0.30. The events of being female and having long hair are not independent because
P(F AND L) does not equal P(F)P(L).
Solution 2
Check whether P(L|F) equals P(L). We are given that P(L|F) = 0.75, but P(L) = 0.50; they are
not equal. The events of being female and having long hair are not independent.
Interpretation of Results
The events of being female and having long hair are not independent; knowing that a student is
female changes the probability that a student has long hair.
Example 2.12
a. Toss one fair coin (the coin has two sides, H and T). The outcomes are ________. Count
the outcomes. There are ____ outcomes.
b. Toss one fair, six-sided die (the die has 1, 2, 3, 4, 5 or 6 dots on a side). The outcomes are
________________. Count the outcomes. There are ___ outcomes.
c. Multiply the two numbers of outcomes. The answer is _______.
d. If you ip one fair coin and follow it with the toss of one fair, six-sided die, the answer in
three is the number of outcomes (size of the sample space). What are the outcomes? (Hint:
Two of the outcomes are H 1 and T6.)
e. Event A = heads (H ) on the coin followed by an even number (2, 4, 6) on the die.
A = {_________________}. Find P(A).
f. Event B = heads on the coin followed by a three on the die. B = {________}. Find
P(B).
g. Are A and B mutually exclusive? (Hint: What is P(A AND B)? If P(A AND B) = 0, then
A and B are mutually exclusive.)
h. Are A and B independent? (Hint: Is P(A AND B) = P(A)P(B)? If P(A AND B) =
P(A)P(B), then A and B are independent. If not, then they are dependent).
a. H and T; 2
b. 1, 2, 3, 4, 5, 6; 6
c. 2(6) = 12
d. T1, T2, T3, T4, T5, T6, H 1, H 2, H 3, H 4, H 5, H 6
e. 3
A = {H 2, H 4, H 6}; P(A) = 12
f. B = {H 3}; P(B) = 12 1
g. Yes, because P(A AND B) = 0
3
1
h. P(A AND B) = 0.P(A)P(B) = 12 12 . P(A AND B) does not equal P(A)P(B), so A and
B are dependent.
a.Compute P(T).
b.Compute P(T|F).
c.Are T and F independent?.
d.Are F and S mutually exclusive?
e.Are F and S independent?
2.1.3.3 References
Lopez, Shane, Preety Sidhu. U.S. Teachers Love Their Lives, but Struggle in the Workplace. Gallup
Wellbeing, 2013. http://www.gallup.com/poll/161516/teachers-love-lives-struggle-workplace.aspx (accessed
May 2, 2013).
Data from Gallup. Available online at www.gallup.com/ (accessed May 2, 2013).
Two events A and B are independent if the knowledge that one occurred does not aect the chance the other
occurs. If two events are not independent, then we say that they are dependent.
In sampling with replacement, each member of a population is replaced after it is picked, so that member
has the possibility of being chosen more than once, and the events are considered to be independent. In
sampling without replacement, each member of a population may be chosen only once, and the events are
considered not to be independent. When events do not share outcomes, they are mutually exclusive of each
other.
If A and B are independent, P(A ∩ B) = P(A)P(B), P(A|B) = P(A) and P(B|A) = P(B).
If A and B are mutually exclusive, P(A ∪ B) = P(A) + P(B) and P(A AND B) = 0.
2.1.3.6
Exercise 2.1.3.10
E and F are mutually exclusive events. P(E) = 0.4; P(F) = 0.5. Find P(E |F).
Exercise 2.1.3.11 (Solution on p. 188.)
J and K are independent events. P(J|K) = 0.3. Find P(J).
Exercise 2.1.3.12
U and V are mutually exclusive events. P(U ) = 0.26; P(V ) = 0.37. Find:
a. P(U AND V ) =
b. P(U |V ) =
c. P(U OR V ) =
2.1.3.7 Homework
Use the following information to answer the next 12 exercises. The graph shown is based on more than
170,000 interviews done by Gallup that took place from January through December 2012. The sample
consists of employed Americans 18 years of age or older. The Emotional Health Index Scores are the sample
space. We randomly sample one Emotional Health Index Score.
Figure 2.4
Exercise 2.1.3.14
Find the probability that an Emotional Health Index Score is 82.7.
Exercise 2.1.3.15 (Solution on p. 188.)
Find the probability that an Emotional Health Index Score is 81.0.
Exercise 2.1.3.16
Find the probability that an Emotional Health Index Score is more than 81?
Exercise 2.1.3.17 (Solution on p. 188.)
Find the probability that an Emotional Health Index Score is between 80.5 and 82?
Exercise 2.1.3.18
If we know an Emotional Health Index Score is 81.5 or more, what is the probability that it is
82.7?
Exercise 2.1.3.19 (Solution on p. 188.)
What is the probability that an Emotional Health Index Score is 80.7 or 82.7?
Exercise 2.1.3.20
What is the probability that an Emotional Health Index Score is less than 80.2 given that it is
already less than 81.
Exercise 2.1.3.21 (Solution on p. 188.)
What occupation has the highest emotional index score?
Exercise 2.1.3.22
What occupation has the lowest emotional index score?
Exercise 2.1.3.23 (Solution on p. 188.)
What is the range of the data?
Exercise 2.1.3.24
Compute the average EHIS.
Exercise 2.1.3.25 (Solution on p. 188.)
If all occupations are equally likely for a certain individual, what is the probability that he or she
will have an occupation with lower than average EHIS?
Exercise 2.1.3.26
A previous year, the weights of the members of the San Francisco 49ers and the Dallas Cow-
boys were published in the San Jose Mercury News. The factual data are compiled into Table
2.2.
133 21 5 0 0
3466 6 18 7 4
6699 6 12 22 5
Table 2.2
For the following, suppose that you randomly select one player from the 49ers or Cowboys.
If having a shirt number from one to 33 and weighing at most 210 pounds were independent
events, then what should be true about P(Shirt# 133|≤ 210 pounds)?
Exercise 2.1.3.27 (Solution on p. 188.)
The probability that a male develops some form of cancer in his lifetime is 0.4567. The probability
that a male has at least one false positive test result (meaning the test comes back for cancer when
the man does not have it) is 0.51. Some of the following questions do not have enough information
for you to answer them. Write not enough information for those answers. Let C = a man develops
cancer in his lifetime and P = man has at least one false positive.
a. P(C) = ______
b. P(P|C) = ______
c. P(P|C') = ______
d. If a test comes up positive, based upon numerical values, can you assume that man has cancer?
Justify numerically and explain why or why not.
Exercise 2.1.3.28
Given events G and H : P(G) = 0.43; P(H ) = 0.26; P(H AND G) = 0.14
If A and B are two events dened on a sample space, then: P (A ∩ B) = P (B) P (A|B). We can think of
the intersection symbol as substituting for the word "and".
This rule may also be written as: P (A|B) = P P(A∩B)
(B)
This equation is read as the probability of A given B equals the probability of A and B divided by the
probability of B.
If A and B are independent, then P (A|B) = P (A). Then P (A ∩ B) = P (A|B) P (B) becomes
P (A ∩ B) = P (A) (B) because the P (A|B) = P (A) if A and B are independent.
One easy way to remember the multiplication rule is that the word "and" means that the event has to
satisfy two conditions. For example the name drawn from the class roster is to be both a female and a
sophomore. It is harder to satisfy two conditions than only one and of course when we multiply fractions
the result is always smaller. This reects the increasing diculty of satisfying two conditions.
If A and B are dened on a sample space, then: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). We can think of
the union symbol substituting for the word "or". The reason we subtract the intersection of A and B is to
keep from double counting elements that are in both A and B.
If A and B are mutually exclusive, then P (A ∩ B) = 0. Then P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
becomes P (A ∪ B) = P (A) + P (B).
Example 2.13
Klaus is trying to choose where to go on vacation. His two choices are: A = New Zealand and B
= Alaska
• Klaus can only aord one vacation. The probability that he chooses A is P(A) = 0.6 and the
probability that he chooses B is P(B) = 0.35.
• P (A ∩ B) = 0 because Klaus can only aord to take one vacation
• Therefore, the probability that he chooses either New Zealand or Alaska is P (A ∪ B) =
P (A) + P (B) = 0.6 + 0.35 = 0.95. Note that the probability that he does not choose to go
anywhere on vacation must be 0.05.
Example 2.14
Carlos plays college soccer. He makes a goal 65% of the time he shoots. Carlos is going to attempt
two goals in a row in the next game. A = the event Carlos is successful on his rst attempt. P(A)
= 0.65. B = the event Carlos is successful on his second attempt. P(B) = 0.65. Carlos tends to
shoot in streaks. The probability that he makes the second goal | that he made the rst goal is 0.90.
Problem 1
a. What is the probability that he makes both goals?
Solution
a. The problem is asking you to nd P (A ∩ B) = P (B ∩ A). Since P(B |A) = 0.90: P(B ∩ A) =
P(B |A) P(A) = (0.90)(0.65) = 0.585
Carlos makes the rst and second goals with probability 0.585.
Problem 2
b. What is the probability that Carlos makes either the rst goal or the second goal?
Solution
b. The problem is asking you to nd P(A ∪ B).
P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = 0.65 + 0.65 - 0.585 = 0.715
Carlos makes either the rst goal or the second goal with probability 0.715.
Problem 3
c. Are A and B independent?
Solution
c. No, they are not, because P(B ∩ A) = 0.585.
P(B)P(A) = (0.65)(0.65) = 0.423
0.423 6= 0.585 = P(B ∩ A)
So, P(B ∩ A) is not equal to P(B)P(A).
Problem 4
d. Are A and B mutually exclusive?
Solution
d. No, they are not because P(A ∩ B) = 0.585.
To be mutually exclusive, P(A ∩ B) must equal zero.
Example 2.15
A community swim team has 150 members. Seventy-ve of the members are advanced swimmers.
Forty-seven of the members are intermediate swimmers. The remainder are novice swimmers.
Forty of the advanced swimmers practice four times a week. Thirty of the intermediate swimmers
practice four times a week. Ten of the novice swimmers practice four times a week. Suppose one
member of the swim team is chosen randomly.
Problem 1
a. What is the probability that the member is a novice swimmer?
Solution
a. 28
150
Problem 2
b. What is the probability that the member practices four times a week?
Solution
b. 80
150
Problem 3
c. What is the probability that the member is an advanced swimmer and practices four times a
week?
Solution
c. 40
150
Problem 4
d. What is the probability that a member is an advanced swimmer and an intermediate swimmer?
Are being an advanced swimmer and an intermediate swimmer mutually exclusive? Why or why
not?
Solution
d. P(advanced ∩ intermediate) = 0, so these are mutually exclusive events. A swimmer cannot
be an advanced swimmer and an intermediate swimmer at the same time.
Problem 5
e. Are being a novice swimmer and practicing four times a week independent events? Why or why
not?
Solution
e. No, these are not independent events.
P(novice ∩ practices four times per week) = 0.0667
P(novice)P(practices four times per week) = 0.0996
0.0667 6= 0.0996
Example 2.16
Felicity attends Modesto JC in Modesto, CA. The probability that Felicity enrolls in a math class
is 0.2 and the probability that she enrolls in a speech class is 0.65. The probability that she enrolls
in a math class | that she enrolls in speech class is 0.25.
Let: M = math class, S = speech class, M |S = math given speech
Problem
Solution
a. 0.1625, b. 0.6875, c. No, d. No
Example 2.17
Studies show that about one woman in seven (approximately 14.3%) who live to be 90 will develop
breast cancer. Suppose that of those women who develop breast cancer, a test is negative 2%
of the time. Also suppose that in the general population of women, the test for breast cancer
is negative about 85% of the time. Let B = woman develops breast cancer and let N = tests
negative. Suppose one woman is selected at random.
Problem 1
a. What is the probability that the woman develops breast cancer? What is the probability that
woman tests negative?
Solution
a. P(B) = 0.143; P(N ) = 0.85
Problem 2
b. Given that the woman has breast cancer, what is the probability that she tests negative?
Solution
b. P(N |B) = 0.02
Problem 3
c. What is the probability that the woman has breast cancer AND tests negative?
Solution
c. P(B ∩ N ) = P(B)P(N |B) = (0.143)(0.02) = 0.0029
Problem 4
d. What is the probability that the woman has breast cancer or tests negative?
Solution
d. P(B ∪ N ) = P(B) + P(N ) - P(B ∩ N ) = 0.143 + 0.85 - 0.0029 = 0.9901
Problem 5
e. Are having breast cancer and testing negative independent events?
Solution
e. No. P(N ) = 0.85; P(N |B) = 0.02. So, P(N |B) does not equal P(N ).
Problem 6
f. Are having breast cancer and testing negative mutually exclusive?
Solution
f. No. P(B ∩ N ) = 0.0029. For B and N to be mutually exclusive, P(B ∩ N ) must be zero.
Example 2.18
Refer to the information in Example 2.17. P = tests positive.
a. Given that a woman develops breast cancer, what is the probability that she tests positive.
Find P(P |B) = 1 - P(N |B).
b. What is the probability that a woman develops breast cancer and tests positive. Find P(B ∩
P) = P(P |B)P(B).
c. What is the probability that a woman does not develop breast cancer. Find P(B0 ) = 1 -
P(B).
d. What is the probability that a woman tests positive for breast cancer. Find P(P) = 1 - P(N ).
Solution
a. 0.98; b. 0.1401; c. 0.857; d. 0.15
a.Find P(B0 ).
b.Find P(D ∩ B).
c.Find P(B |D).
d.Find P(D ∩ B0 ).
e.Find P(D |B0 ).
2.1.4.3 References
DiCamillo, Mark, Mervin Field. The File Poll. Field Research Corporation. Available online at
http://www.eld.com/eldpollonline/subscribers/Rls2443.pdf (accessed May 2, 2013).
Rider, David, Ford support plummeting, poll suggests, The Star, September 14, 2011. Available on-
line at http://www.thestar.com/news/gta/2011/09/14/ford_support_plummeting_poll_suggests.html (ac-
cessed May 2, 2013).
Mayor's Approval Down. News Release by Forum Research Inc. Available online
at http://www.forumresearch.com/forms/News Archives/News Releases/74209_TO_Issues_-
_Mayoral_Approval_%28Forum_Research%29%2820130320%29.pdf (accessed May 2, 2013).
Roulette. Wikipedia. Available online at http://en.wikipedia.org/wiki/Roulette (accessed May 2,
2013).
Shin, Hyon B., Robert A. Kominski. Language Use in the United States: 2007. United States Cen-
sus Bureau. Available online at http://www.census.gov/hhes/socdemo/language/data/acs/ACS-12.pdf (ac-
cessed May 2, 2013).
Data from the Baseball-Almanac, 2013. Available online at www.baseball-almanac.com (accessed May 2,
2013).
Data from U.S. Census Bureau.
Data from the Wall Street Journal.
Data from The Roper Center: Public Opinion Archives at the University of Connecticut. Available online
at http://www.ropercenter.uconn.edu/ (accessed May 2, 2013).
Data from Field Research Corporation. Available online at www.eld.com/eldpollonline (accessed May
2,2 013).
The multiplication rule and the addition rule are used for computing the probability of A and B, as well
as the probability of A or B for two given events A, B dened on the sample space. In sampling with
replacement each member of a population is replaced after it is picked, so that member has the possibility
of being chosen more than once, and the events are considered to be independent. In sampling without
replacement, each member of a population may be chosen only once, and the events are considered to be
not independent. The events A and B are mutually exclusive events when they do not have any outcomes
in common.
2.1.4.6
Use the following information to answer the next ten exercises. Forty-eight percent of all Californians
registered voters prefer life in prison without parole over the death penalty for a person convicted of rst
degree murder. Among Latino California registered voters, 55% prefer life in prison without parole over the
death penalty for a person convicted of rst degree murder. 37.6% of all Californians are Latino.
In this problem, let:
• C = Californians (registered voters) preferring life in prison without parole over the death penalty for
a person convicted of rst degree murder.
• L = Latino Californians
Suppose that one Californian is randomly selected.
Exercise 2.1.4.6
Find P(C).
Exercise 2.1.4.7 (Solution on p. 189.)
Find P(L).
Exercise 2.1.4.8
Find P(C |L).
Exercise 2.1.4.9 (Solution on p. 189.)
In words, what is C |L?
Exercise 2.1.4.10
Find P(L ∩ C).
Exercise 2.1.4.11 (Solution on p. 189.)
In words, what is L ∩ C?
Exercise 2.1.4.12
Are L and C independent events? Show why or why not.
Exercise 2.1.4.13 (Solution on p. 189.)
Find P(L ∪ C).
Exercise 2.1.4.14
In words, what is L ∪ C?
Exercise 2.1.4.15 (Solution on p. 189.)
Are L and C mutually exclusive events? Show why or why not.
2.1.4.7 Homework
Exercise 2.1.4.16
On February 28, 2013, a Field Poll Survey reported that 61% of California registered voters approved
of allowing two people of the same gender to marry and have regular marriage laws apply to
them. Among 18 to 39 year olds (California registered voters), the approval rating was 78%.
Six in ten California registered voters said that the upcoming Supreme Court's ruling about the
constitutionality of California's Proposition 8 was either very or somewhat important to them. Out
of those CA registered voters who support same-sex marriage, 75% say the ruling is important to
them.
In this problem, let:
a. Find P(C).
b. Find P(B).
c. Find P(C |A).
d. Find P(B|C).
e. In words, what is C |A?
f. In words, what is B |C?
g. Find P(C ∩ B).
h. In words, what is C ∩ B?
i. Find P(C ∪ B).
j. Are C and B mutually exclusive events? Show why or why not.
• In early 2011, 60 percent of the population approved of Mayor Ford's actions in oce.
• In mid-2011, 57 percent of the population approved of his actions.
• In late 2011, the percentage of popular approval was measured at 42 percent.
Use the following information to answer the next three exercises. The casino game, roulette, allows the
gambler to bet on the probability of a ball, which spins in the roulette wheel, landing on a particular color,
number, or range of numbers. The table used to place bets contains of 38 numbers, and each number is
assigned to a color and a range.
Exercise 2.1.4.18
a. Betting on two lines that touch each other on the table as in 1-2-3-4-5-6
b. Betting on three numbers in a line, as in 1-2-3
c. Betting on one number
d. Betting on four numbers that touch each other to form a square, as in 10-11-13-14
e. Betting on two numbers that touch each other on the table, as in 10-11 or 10-13
f. Betting on 0-00-1-2-3
g. Betting on 0-1-2; or 0-00-2; or 00-2-3
Exercise 2.1.4.20
Compute the probability of winning the following types of bets:
a. Betting on a color
Exercise 2.1.4.22
Roll two fair dice separately. Each die has six faces.
Exercise 2.1.4.24
An experiment consists of rst rolling a die and then tossing a coin.
b. Let A be the event that either a three or a four is rolled rst, followed by landing a head on
the coin toss. Find P(A).
c. Let B be the event that the rst and second tosses land on heads. Are the events A and
B mutually exclusive? Explain your answer in one to three complete sentences, including
numerical justication.
Exercise 2.1.4.26
Consider the following scenario:
Let P(C) = 0.4.
Let P(D) = 0.5.
Let P(C |D) = 0.6.
a. Rewrite the basic Addition Rule P(Y ∪ Z) = P(Y ) + P(Z) - P(Y ∩ Z) using the information
that Y and Z are independent events.
b. Use the rewritten rule to nd P(Z) if P(Y ∪ Z) = 0.71 and P(Y ) = 0.42.
Exercise 2.1.4.28
G and H are mutually exclusive events. P(G) = 0.5 P(H ) = 0.3
a. Explain why the following statement MUST be false: P(H |G) = 0.4.
b. Find P(H ∪ G).
c. Are G and H independent or dependent events? Explain in a complete sentence.
a. P(E ) =
0
i. 0.8043
b. P(E) = ii. 0.623
c. P(S ∩ E0 ) = iii. 0.1957
d. P(S |E ) =
0
iv. 0.1219
Table 2.3
Exercise 2.1.4.30
1994, the U.S. government held a lottery to issue 55,000 Green Cards (permits for non-citizens to
work legally in the U.S.). Renate Deutsch, from Germany, was one of approximately 6.5 million
people who entered this lottery. Let G = won green card.
a. What was Renate's chance of winning a Green Card? Write your answer as a probability
statement.
b. In the summer of 1994, Renate received a letter stating she was one of 110,000 nalists chosen.
Once the nalists were chosen, assuming that each nalist had an equal chance to win, what
was Renate's chance of winning a Green Card? Write your answer as a conditional probability
statement. Let F = was a nalist.
c. Are G and F independent or dependent events? Justify your answer numerically and also
explain why.
d. Are G and F mutually exclusive events? Justify your answer numerically and explain why.
Exercise 2.1.4.32
The following table of data obtained from www.baseball-almanac.com shows hit information for
four players. Suppose that one hit from the table is randomly selected.
Table 2.4
Are "the hit being made by Hank Aaron" and "the hit being a double" independent events?
a. Find the probability that a person has both type O blood and the Rh- factor.
b. Find the probability that a person does NOT have both type O blood and the Rh- factor.
Exercise 2.1.4.34
At a college, 72% of courses have nal exams and 46% of courses require research papers. Suppose
that 32% of courses have a research paper and a nal exam. Let F be the event that a course has
a nal exam. Let R be the event that a course requires a research paper.
a. Find the probability that a course has a nal exam or a research project.
b. Find the probability that a course has NEITHER of these two requirements.
a. Find the probability that a cookie contains chocolate or nuts (he can't eat it).
b. Find the probability that a cookie does not contain chocolate or nuts (he can eat it).
Exercise 2.1.4.36
A college nds that 10% of students have taken a distance learning class and that 40% of students
are part time students. Of the part time students, 20% have taken a distance learning class. Let D
= event that a student takes a distance learning class and E = event that a student is a part time
student
A contingency table provides a way of portraying data that can facilitate calculating probabilities. The
table helps in determining conditional probabilities quite easily. The table displays sample values in relation
to two dierent variables that may be dependent or contingent on one another. Later on, we will use
contingency tables again, but in another manner.
Example 2.19
Suppose a study of speeding violations and drivers who use cell phones produced the following
ctional data:
Table 2.5
The total number of people in the sample is 755. The row totals are 305 and 450. The column
totals are 70 and 685. Notice that 305 + 450 = 755 and 70 + 685 = 755.
Calculate the following probabilities using the table.
Problem 1
a. Find P(Person is a car phone user).
Solution
a. number of car phone users
total number in study = 305
755
Problem 2
b. Find P(person had no violation in the last year).
Solution
b. number that had no violation
total number in study = 685
755
Problem 3
c. Find P(Person had no violation in the last year AND was a car phone user).
Solution
c. 280
755
Problem 4
d. Find P(Person is a car phone user OR person had no violation in the last year).
Solution
d. 305 685 280 710
755 + 755 − 755 = 755
Problem 5
e. Find P(Person is a car phone user GIVEN person had a violation in the last year).
Solution
e. 25
70 (The sample space is reduced to the number of persons who had a violation.)
Problem 6
f. Find P(Person had no violation last year GIVEN person was not a car phone user)
Solution
f. 405
450 (The sample space is reduced to the number of persons who were not car phone users.)
Table 2.6
Example 2.20
Table 2.7: Hiking Area Preference shows a random sample of 100 hikers and the areas of hiking
they prefer.
Sex The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 ___ 45
Male ___ ___ 14 55
Total ___ 41 ___ ___
Table 2.7
Problem 1
a. Complete the table.
Solution
a.
Hiking Area Preference
Sex The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 11 45
Male 16 25 14 55
Total 34 41 25 100
Table 2.8
Are these two numbers the same? If they are, then F and C are independent. If they are not, then
F and C are not independent.
Problem 3 (Solution on p. 191.)
c. Find the probability that a person is male given that the person prefers hiking near lakes and
streams. Let M = being male, and let L = prefers hiking near lakes and streams.
1. Find P(F).
2. Find P(P).
3. Find P(F AND P).
Female 45 38 27 110
Male 26 52 12 90
Total 71 90 39 200
Table 2.9
a.Out of the males, what is the probability that the cyclist prefers a hilly path?
b.Are the events being male and preferring the hilly path independent events?
Example 2.21
Muddy Mouse lives in a cage with three doors. If Muddy goes out the rst door, the probability
that he gets caught by Alissa the cat is 15 and the probability he is not caught is 45 . If he goes
out the second door, the probability he gets caught by Alissa is 14 and the probability he is not
caught is 34 . The probability that Alissa catches Muddy coming out of the third door is 21 and the
probability she does not catch Muddy is 12 . It is equally likely that Muddy will choose any of the
three doors so the probability of choosing each door is 13 .
Door Choice
Caught 1
15
1
12
1
6 ____
Not Caught 4
15
3
12
1
6 ____
Total ____ ____ ____ 1
Table 2.10
• The entry 15
4
= 54 13 is P(Door One AND Not Caught)
Problem 2
b. What is the probability that Alissa does not catch Muddy?
Solution
b. 41
60
Problem 3
c. What is the probability that Muddy chooses Door One OR Door Two given that Muddy is
caught by Alissa?
Solution
c. 9
19
Example 2.22
Table 2.11: United States Crime Index Rates Per 100,000 Inhabitants 20082011 contains the
number of crimes per 100,000 inhabitants from 2008 to 2011 in the U.S.
Table 2.11
Problem
TOTAL each column and each row. Total data = 4,520.7
Solution
a. 0.0294, b. 0.1551, c. 0.7165, d. 0.2365, e. 0.2575
Obese 18 28 14
Normal 20 51 28
Underweight 12 25 9
Totals
Table 2.12
Sometimes, when the probability problems are complex, it can be helpful to graph the situation. Tree
diagrams can be used to visualize and solve conditional probabilities.
A tree diagram is a special type of graph used to determine the outcomes of an experiment. It consists of
"branches" that are labeled with either frequencies or probabilities. Tree diagrams can make some probability
problems easier to visualize and solve. The following example illustrates how to use a tree diagram.
Example 2.23
In an urn, there are 11 balls. Three balls are red (R) and eight balls are blue (B). Draw two balls,
one at a time, with replacement. "With replacement" means that you put the rst ball back
in the urn before you select the second ball. The tree diagram using frequencies that show all the
possible outcomes follows.
The rst set of branches represents the rst draw. The second set of branches represents the
second draw. Each of the outcomes is distinct. In fact, we can list each red ball as R1, R2, and R3
and each blue ball as B1, B2, B3, B4, B5, B6, B7, and B8. Then the nine RR outcomes can be
written as:
R1R1; R1R2; R1R3; R2R1; R2R2; R2R3; R3R1; R3R2; R3R3
The other outcomes are similar.
There are a total of 11 balls in the urn. Draw two balls, one at a time, with replacement. There
are 11(11) = 121 outcomes, the size of the sample space.
Problem 3
c. Using the tree diagram, calculate P(RB OR BR).
Solution
c. P(RB OR BR) = 3 8
+ 8 3
= 48
11 11 11 11 121
Problem 4
d. Using the tree diagram, calculate P(R on 1st draw AND B on 2nd draw).
Solution
d. P(R on 1st draw AND B on 2nd draw) = P(RB) = 3 8
= 24
11 11 121
Problem 5
e. Using the tree diagram, calculate P(R on 2nd draw GIVEN B on 1st draw).
Solution
e. P(R on 2nd draw GIVEN B on 1st draw) = P(R on 2nd|B on 1st) = 88 24
= 11
3
This problem is a conditional one. The sample space has been reduced to those outcomes that
already have a blue on the rst draw. There are 24 + 64 = 88 possible outcomes (24 BR and 64
BB). Twenty-four of the 88 possible outcomes are BR. 88
24
= 11
3
.
Problem 6
f. Using the tree diagram, calculate P(BB).
Solution
f. P(BB) = 64
121
Figure 2.7
Example 2.24
An urn has three red marbles and eight blue marbles in it. Draw two marbles, one at a time,
this time without replacement, from the urn. "Without replacement" means that you do not
put the rst ball back before you select the second marble. Following is a tree diagram for this
situation. The branches are labeled with probabilities instead of frequencies. The numbers at the
ends of the branches
2 are
calculated by multiplying the numbers on the two corresponding branches,
for example, 113
10 = 6
110 .
: If you draw a red on the rst draw from the three red possibilities, there are two red marbles
left to draw on the second draw. You do not put back or replace the rst marble after you have
drawn it. You draw without replacement, so that on the second draw there are ten marbles left
in the urn.
Calculate the following probabilities using the tree diagram.
Problem 1
a. P(RR) = ________
Solution
a. P(RR) = 3 2 6
11 10 = 110
P(RB OR BR) = 3 8
+ (___)(___) = 48
11 10 110
Problem 5
e. Find P(BB).
Solution
e. P(BB) = 8 7
11 10
Problem 6
f. Find P(B on 2nd|R on 1st).
Solution
f. Using the tree diagram, P(B on 2nd|R on 1st) = P(R|B) = 10 .
8
If we are using probabilities, we can label the tree in the following general way.
Figure 2.9
Example 2.25
A litter of kittens available for adoption at the Humane Society has four tabby kittens and ve black
kittens. A family comes in and randomly selects two kittens (without replacement) for adoption.
Problem
a. 12 12 b. 94 49 c. 94 38 d. 49 59
a. 49 59 b. 94 58 c. 94 59 + 59 49 d. 49 58 + 59 48
c. What is the probability that a tabby is chosen as the second kitten when a black kitten was
chosen as the rst?
d. What is the probability of choosing two kittens of the same color?
Solution
a. c, b. d, c. 8,
4
d. 32
72
2.1.5.3 References
There are several tools you can use to help organize and sort data when calculating probabilities. Contin-
gency tables help display data and are particularly useful when calculating probabilites that have multiple
dependent variables.
A tree diagram use branches to show the dierent outcomes of experiments and makes complex probability
questions easy to visualize.
Suppose you ip a coin ten times and each time it comes up heads. This might make you start to wonder if
there is something wrong with the coin. Perhaps it is a trick coin and is heads on both sides? Perhaps it is
imbalanced and it is more likely to come up heads over tails. You may also wonder what is the probability
of getting 10 heads in a row, if the coin was fair.
Coin ipping is interesting because it is a random event. We cannot predict whether the next ip will
be heads or tails (assuming it isn't a trick coin). That means the outcome would be a random variable. A
random variable is any variable where the outcome is determined by a random event. The outcome is
6 This content is available online at <http://cnx.org/content/m64919/1.1/>.
also discrete because we count it. Above you ipped a coin ten times and counted the number of heads. A
discrete random variable is a variable whose outcome is determined by a random event and where we
count the outcomes. Other examples of discrete random variables include how many times you roll an even
number with a die out of ten rolls; how many customers enter a store during a ve-minute interval; how
many times you draw a high card out of a deck of cards out of eight draws (without replacement).
In each of these situations (coin toss, rolling die, number of customers, drawing cards), you could look at
each situation and, each time, come up with a new formula to nd the probability of these events happening.
But this would take a lot of work and be inecient. Instead, you would want to see if the situation can be
modelled by a distribution. A probability distribution provides the theoretical probabilities of all of the
possible events in a situation. For example, the following is a probability model of how many heads you can
get when you ip a fair coin three times:
# of heads Probability
0 0.125
1 0.375
2 0.375
3 0.125
Table 2.13
Notice that the probabilities range from 0 to 1 and that the sum of the probabilities is 1.
The above table could be determined by working out all of the possible outcomes (TTT, TTH, THT,
HTT, etc.), then counting how many heads were in each outcome. But again, that is time consuming.
Instead, you want to see if there is a probability distribution that models your situation that you can use.
For example, coin tossing can be modelled by the T distribution.
The binomial distribution is an example of a model for discrete random variables. There are many other
models for discrete random variables including Poisson, geometric, hypergeometric, and discrete uniform to
name a few. Each distribution comes with a set of criteria and if a situation ts that criteria, then the
distribution can model it. That is, the distribution can produce theoretical probabilities for that situation.
In this chapter, we are going to learn about the binomial distribution, which is a model for discrete random
variables. In the next chapter, we will learn about the normal distribution, which is a model for continuous
random variables.
In particular, we want to use the binomial distribution to evaluate evidence. For example, going back to
the example ipping a coin ten times and getting ten heads, we want to use the evidence (getting ten heads)
to determine whether we think there is something wrong with the coin.
Flipping a coin a certain number of times, let's say ten times, is a classic example of a binomial distribution.
What are the characteristics of ipping a coin that makes it binomial?
7 This content is available online at <http://cnx.org/content/m64918/1.4/>.
Before we answer that question, let's get a bit terminology out of the way. In probability theory, an
experiment is the actual process that you investigating. In the ipping coin example, the experiment is
ipping a coin ten times. A trial is a specic instance of an experiment. Flipping a coin only once is
considered a trial.
Going back to the coin ipping example, let's assume that we are dealing with a fair coin (i.e. probability
of getting a head or a tail is 50%). We've already discussed that when we count the number of heads that
this is an example of a discrete random variable. Notice that there are only two possible outcomes (heads or
tails). This is a key criterion for a binomial distribution (binomial derives from Latin for two terms). Also,
notice that the events are independent of each other. That is, if you get a head on one ip, that has no
impact on the probability of getting a head on the next ip. This also means that the probability of getting
a head remains constant. This is another key criterion for a binomial distribution. The other thing to notice
is that the number of times we ip the coin is xed. We don't ip it until we get bored or run out of time.
Instead we ip it ten times. This means that the number of trials is xed. This is the last key criterion of a
binomial distribution.
There are ve characteristics of a binomial experiment.
1. The variable being studied is random.
2. The outcomes of the variable are being counted.
3. There are a xed number of trials. The letter n denotes the number of trials.
4. There are only two possible outcomes, called "success" and "failure," for each trialπ denotes the
probability of a success on one trial, and 1-π denotes the probability of a failure on one trial.
5. The n trials are independent and are repeated using identical conditions. Because the n trials are
independent, the outcome of one trial does not help in predicting the outcome of another trial. Another
way of saying this is that for each individual trial, the probability, π , of a success and probability, 1-π ,
of a failure remain constant.
Other examples of binomial distributions:
• Counting the number of 2's that are rolled when you roll a die six times (trial = rolling a dice, success
= rolling a 2, n = 6, π = 1/6 = 0.1667)
• Counting the number of times a jack is pulled out of deck of cards (with replacement) when you pull
a card fteen times (trial = pulling a card, success = pulling a jack, n = 15, π = 4/52 = 0.0769)
• Counting the number of times that you win a prize in Tim Hortons Roll up the Rim to Win contest
out of four cups (trial = checking cup for win, success = winning a prize, n = 4, π = 1/6 = 0.1667
assuming no special rules (e.g. anniversary rules that changed the odds of winning)
Examples of situations that are not binomial include:
• Counting the number of times a jack is pulled out of deck of cards (without replacement) when you
pull a card fteen times. The fth criterion is not met because the events are now dependent.
• Counting the number times each number (1 to 6) is rolled when you roll a die fty times. The fourth
criterion is not met because there are six possible outcomes instead of two.
• Counting the number of times that you win a prize in Tim Horton's Roll up the Rim to Win contest
out of how many cups you buy during the contest. Unless you know exactly how many you'll buy
during the contest, this would not meet the third criterion of having a xed number of trials.
note: The Roll up the Rim example might not be binomial as it may fail the fth criterion. At the
beginning of the contest, the odds of winning are determined by counting how many prizes there
are out of the total number of cups printed. As the contest goes on, the probability of winning may
change depending on how many people have already won. At the beginning of the contest, this
is also true but there are so many cups that it doesn't really matter (think back to the sampling
with replacement vs. without replacement in Chapter 1). Thus, this contest is only binomial at
the beginning of the contest.
2.2.2.2 Notation
Suppose we are working on a probability question and there are multiple probabilities that need to be found.
Then it gets time consuming to write out, for example, the probability that three rolls of a die will result
in at least one 2 or some variation over and over again. Instead we will use notation to reduce the work.
We can write the previous statement more quickly as P(X ≥ 1). The P( ) means the probability of. X is
the random variable being studied (in this case the number of times 2 has been rolled out of 3 rolls). X ≥
1 means we are looking at the number of times a 2 is rolled at least once.
It is important to dene X. Otherwise, P(2<X ≤ 5) could refer to any random variable and the person
reading the notation won't know what it means.
Just like a set of data, a binomial distribution has a mean and a standard deviation. For the binomial
distribution, these are given by the formulas:
µ = nπ
p
σ = nπ (1 − π)
Goingpback to the Tim Hortons example, we had n=4 and π = 0.1667. Thus µ = 4 × 0.1667= 0.6667
and σ = 4 × 0.1667 × (1 − 0.1667) = 0.745. This means that if we buy four random cups of Tim Hortons
coee during the Roll Up the Rim content, we will typically win 0.67 times, give or take 0.75. Thus, when
buying four cups of coee, we will typically win between -0.08 and 1.42 times. Since we can't win negative
times, we will round the lower bound to 0. Therefore, when buying four cups of coee, we will typically win
between 0 and 1.42 times.
Exercise 2.2.2.1 (Solution on p. 192.)
A market research study shows that 30% of all passengers on Canadian Airlines are business
travelers. A random sample of 20 passengers is taken.
1. Explain why the above situation satises the criteria of a binomial distribution. If there are
any issues with why this situation may not meet all of the criteria, discuss them. Dene n, X
and π .
2. For the random sample, determine the probability that:
A company looked at its hiring practices. In particular, they found that their hiring practices appears to
favour men over women. Based on past data, they have found that regardless of the number of applications
by women, seventy-ve percent of hires are men. Due to this issue, they decide to implement program.
In this program, the name and any identifying features that may indicate the gender of an applicant are
removed. For example, if the application says, She executed a marketing campaign that increased revenue
by 30%, this would be changed to They executed a marketing campaign that increased revenue by 30%.
The names on the applications were changed an alpha-numeric identication (like AB-101). The company
claims that the program has worked, but they want to check the claim.
How will the company determine if the program has worked? One way to do this would be using statistics.
Now suppose that after a recent round of hiring, the proportion of men hired was 70%. Would this be
enough evidence that the program is working? 70% is denitely lower than 75%, but we know that there is
variability in sampling. This means that, prior to the program being implemented, around 75% of hires are
men, there may be some rounds of hiring where 70% of hires were men and some that were 80%. It won't
be 75% each time. Instead we expect it to be close to 75%. Therefore, if the program has caused the hiring
practices to change, would a recent round of hiring that results in the proportion of men hires being 70%
be enough evidence of change? What about 60%? What is the line between normal variability from 75%
and abnormal variability? Statistics helps us gure that out and that is how we evaluate evidence using
statistics.
Let's say in a recent round of hires there were 30 new hires and 20 of these hires were men.
2.2.2.3.1 Skepticism
Any time we are trying to evaluate evidence, we always start from a position of skepticism. That is, we don't
want to assume what we are trying to show (i.e the claim). If we do that, we may bias the investigation. To
illustrate, if you assume that your signicant other is cheating on you, then this will colour all of the evidence
you nd (why did they show up ve minutes late from work? They must be cheating!). A well-known real-
world example of this position is the assumption in court that a defendant is innocent until proven guilty.
That is, criminal court cases start with the assumption of innocence.
In general, the position of skepticism is that nothing has changed, the program didn't work, the experi-
ment didn't work, the eect being studied isn't happening, etc.
In our example, we will assume that the program that the company implemented did not work. That is,
we are assuming that the proportion of hires that are men is still 75%. Another way of writing this is π =
75% (i.e. the population proportion).
2.2.2.3.2 Evidence
In a court case, evidence would be witness testimony, forensics evidence, expert testimony, etc.
In statistics, evidence is sample data. The evidence has been collected to evaluate the claim. In this case,
the evidence has been evaluated to determine if the program is working.
In our example, the sample data is the 20 men hires out of 30. This gives us a sample proportion of
^
20/30 = 66.67%. The symbol for sample proportion is p (said p hat - the symbol above the p is supposed
to be "^", but the online textbook program does not properly show it).
To evaluate the evidence, we want to determine the probability of observing the evidence (or even better
evidence against the assumption) assuming the assumption is true. Once we determine this probability, we
need to determine if the event is unlikely or not unlikely. That is, we want to determine if it unlikely we could
have observed the evidence, if the assumption is true. Or is it not unlikely that we observed the evidence, if
the assumption is true. If it is unlikely to have observed the evidence, then most likely there is something
wrong with the assumption and the claim is likely true. If it is not unlikely to have observed the evidence,
then we can't actually conclude that there is something wrong with the assumption and we cannot conclude
that the claim is true.
To go back to the court case example, if you are a juror, you have to evaluate how unlikely or not unlikely
it is that the defendant would have had a heated argument with the victim, and was found covered in blood
and holding the murder weapon at the scene, if the defendant was innocent. If you think that it is unlikely
that all of the pieces of evidence could have happened if the defendant is innocent, then you would nd the
defendant guilty. That is, the evidence calls into question the assumption. If you think that this it is not
unlikely that all of these pieces of evidence could have happened if the defendant is innocent, then you would
nd the defendant not guilty. Notice that we don't conclude that the defendant is innocent. That is, we
can't say that they are innocent; we can only say that they are not guilty.
note: When evaluating evidence, we are trying to evaluate the claim (i.e. not the position of
skepticism). Therefore, the evidence has been collected about claim. No evidence has been col-
lected about the assumption. Therefore, our conclusion can only be about the claim and not the
assumption.
Therefore, if the probability is small and therefore unlikely, we can say that there is enough evidence to
suggest that the assumption is likely false (i.e. guilty).
If the probability is not small and therefore not unlikely, we can say that there is not enough evidence to
suggest that the assumption is false (i.e. not guilty).
note: In statistics, if the probability of an event happening is less than 1%, we say that the event
is unlikely to happen. If the probability is greater than 10%, we say the event is not unlikely to
happen. If the probability is between 1% and 10%, then it is up to the researcher to determine
whether they believe that the event is unlikely or not unlikely. Usually, the researcher decides on
the threshold between unlikely and not unlikely before performing the experiment or study.
In our example, to evaluate the evidence, we want to work out what is the probability this company would
have hired 20 men out 30 (or even better evidence against the assumption) if the proportion of men hires is
still 75%. That is, we want to nd P (X ≤ 20) , given π = 75%). Notice that this is a conditional probability
and the condition is the assumption.
What does or even better evidence against the assumption mean? It means that we don't just nd the
probability of exactly 20 out of 30 men hires. We nd the probability of at most 20 out of 30 men hires
because if the company hired 19, 18, 17, . . . men then that would be even better evidence that 75% is no
longer correct (as the sample proportion is getting more and more dierent from the assumed population
proportion).
Why do we look at or even better evidence against the assumption? Often the probability of exactly
one event happening is quite small. For example, the probability of getting exactly 10 heads out of 20 coin
tosses is 17.62%, even though that is the most likely event to occur. Therefore, if we only looked at the
probability of exactly one event happening (i.e. P(X=20)) rather than P (X ≤ 20) , we may come to the
false impression that an event is unlikely, when it could actually be explained by normal sampling variability.
To nd the probability, we need to nd an appropriate distribution that models the situation. In later
chapters, we will look at other models. Right now, the model we are going to use is the binomial distribution.
For us to use this model, we have to ensure that the situation is meeting all of the conditions of the binomial
distribution.
1. The variable being studied is random: This is not necessarily the case here as the applicants are not
random and the hiring process is not random. If we randomly selected 30 hires from a greater number
of hires, then it would be.
2. The outcomes of the variable are being counted: We are counting the number of men hired.
3. There are a xed number of trials: We are looking at 30 hires (n = 30)
4. There are only two possible outcomes: Either the hire is a man or the hire is not a man.
5. The n trials are independent and the probability of success and probability of failure remain constant:
This is true because we are assuming that the probability of hiring a man remains constant at 75%.
Though the rst condition is not met, we can still use the binomial distribution to model the situation. That
the model is not perfectly met would be a limitation of the study. That means that we would want to put
a caveat at the end of our conclusion to state that this might reduce the accuracy of our results.
warning: If the conditions of randomness and independence may not be fully met, then we can
still utilize the binomial distribution. But we do have to be wary of the results. The other three
conditions do need to be met to use the binomial distribution.
Now that we have the model, we can nd the probability. In A computer program, we will use the binomial
distribution with n = 30 and π (or probability of occurrence) = 0.75. Then we will nd P (X ≤ 20) .
From the computer program, we get P (X ≤ 20) = 0.19659 = 19.659. Again, this probability is found
under the assumption that the program has not worked (i.e. π =75%)
The probability that we would have observed at most 20 hires that were men out of 30, under the assumption
that the program did not work, is 19.659%. Therefore, it is not unlikely that we could have observed this
evidence as the probability is greater than 10%. This means that having 20 out of 30 hires being men falls
within the normal sampling variability for this data.
Based on the evidence collected, there is not sucient evidence to suggest that the program worked.
Notice we don't conclude that the program is not working.
note: In statistics, we never use the words prove or true when making a conclusion. All of
our conclusions are based o of sample data that we are using to make a conclusion about the
population. Therefore, there is always the chance of error.
2.2.2.3.6 Example
Olivier has spent ve years honing his archery skills in various seedy locals around the world. Now he has
returned to his city of birth to use these skills to take out criminals. One night while drinking vodka with
his friends, he boasts that he can shoot an arrow into the bullseye, blindfolded at a distance of 50m 90% of
the time.
I don't believe you! Jack, Olivier's best friend, slurred.
I swear! I've really honed my skills. Olivier countered.
But remember last week when we were in that darkened factory, you missed two of your shots! Thelma,
Olivier's sister, countered.
No. I meant to miss them.
Jack thought for a moment. I think you are exaggerating and I'm going to test you.
You're on! Olivier sneered arrogantly.
To test that Olivier was exaggerating about his marksmanship, Jack set up a bunch of targets and,
randomly had Olivier attempt the shot. Olivier hit the bullseye (blindfolded at a distance of 50m) 39 out of
50 times.
a. If Olivier's is not exaggerating, how many times out of 50 do we typically expect him to hit the bullseye?
Write your answer as a range that takes into account variation.
Answer: We would expect Olivier to hit the bullseye 45 times give or take 2.121 times. This means a typical
range is 42.88 to 47.12 bullseyes out of 50.
Answer: Since 39 is outside of the range, it would be deemed atypical, but that does not necessarily mean
that it is abnormal.
c. What assumption do we need to make before determining whether the 39 out of 50 provides evidence
for or against Olivier exaggerating?
Answer: Since Jack wants to show that Olivier is exaggerating, we want to assume that Olivier is not
exaggerating. This means we want to assume π = 90%, where π is the proportion of bullseyes that Olivier
hits.
d. What model (i.e. distribution) will you use to test the evidence against the assumption? Explain why
it is the best model to use. Note: This situation might not completely t the model, but explain why
it is still a reasonable model to use.
• The variable being studied is random: Since Jack is randomly having Olivier take the shot, we can say
this is a random event.
• The outcomes of the variable are being counted: We are counting the number of bullseyes.
• There are a xed number of trials: We are looking at 50 shots with the bow and arrow.
• There are only two possible outcomes: Either the shot is a bullseye or it is not.
• The n trials are independent and the probability of success and probability of failure remain constant:
This is true because we are assuming that the probability of hitting the bullseye remains constant at
90%.
e. What probability do you need to nd to evaluate the evidence against the assumption?
Answer: We need to nd the probability that Olivier hits at most 39 out of 50 bullseyes, assuming his
accuracy is 90%. NOTE: We look at at most 39 because having less bullseyes is even better evidence that
Olivier is exaggerating (i.e. better evidence against the assumption).
Answer: P (X ≤ 39 given π =90%) =0.00935=0.94%, (from computer program with n = 50, π (or probability
of occurrence) = 90%).
Answer: The probability that Olivier hit at most 39 out of 50 bullseyes, under the assumption that he wasn't
exaggerating about his accuracy is 0.94%.
h. Does the probability provide evidence to support whether Olivier is exaggerating or not? Explain.
Answer: Since the probability that we observed our sample data is less than 1%, then it is unlikely that that
Olivier is not exaggerating (i.e. that his accuracy is 90%). Therefore, it is likely that Olivier is exaggerating
and cannot hit the bullseye 90% of the time blindfolded from 50m.
Exercise 2.2.2.2 (Solution on p. 198.)
As stated in a previous question, the chance of an CRA audit for a tax return with over $25,000
in income is about 2% per year. An employee at I&S Square, a company that helps individuals
do their yearly tax returns and helps if there is an audit, has noticed that people in Seba Beach,
Alberta appear to have a greater chance of an audit than the rest of Canadians. Out of a random
sample of 45 residents, four of them have been audited.
a. If the residents of Seba Beach are being audited fairly, how many residents out of 45 do we
typically expect to get audited in a year? Write your answer as a range that takes into account
variation.
b. Based on your answer in a), is 4 out of 45 audits potentially abnormal? Explain.
c. What assumption do we need to make before determining whether the 4 out of 45 audits is
unfair?
d. What model (i.e. distribution) will you use to test the assumption? Explain why it is the
best model to use. Note: This situation might not completely t the model, but explain why
it is still a reasonable model to use.
e. What probability do you need to nd to evaluate the assumption?
1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5
Table 2.14
Does the recent sample provide sucient evidence to suggest that the proportion of customers
who are happy with their overall service when they call Bull has increased from 60%? Explain your
answer in detail.
note: What we can conclude when the probability is not unlikely: If the probability is greater
than 10%, then it means that it is not unlikely that we observed this evidence under the assumption.
We can NOT conclude that the assumption is likely true as the evidence was collected to evaluation
the claim (not the assumption). Instead, we can only conclude that there is not enough evidence
to say that the claim is true. When the probability is not unlikely, we have really learned very
little about the claim.
A statistical experiment can be classied as a binomial experiment if the following conditions are met:
3. There are a xed number of trials. The letter n denotes the number of trials.
4. There are only two possible outcomes, called "success" and "failure," for each trialπ denotes the
probability of a success on one trial, and 1-π denotes the probability of a failure on one trial.
5. The n trials are independent and are repeated using identical conditions. Because the n trials are
independent, the outcome of one trial does not help in predicting the outcome of another trial. Another
way of saying this is that for each individual trial, the probability, π , of a success and probability, 1-π ,
of a failure remain constant.
The outcomes of a binomial experiment t a binomial probability distribution. The random variable X =
the number of successes obtained in the n independent trials. The meanp of X can be calculated using the
formula µ = nπ , and the standard deviation is given by the formula σ = nπ (1 − π).
To evaluate evidence, we must rst begin from a position of skepticism (i.e. assume the opposite of what
we want to show). Then we must nd a probability which is the distance from the actual evidence to perfect
evidence against the assumption. We can then evaluate the probability by determining whether it is less
than 1% (which means it is unlikely the evidence occurred under the assumption) or if it is greater than 10%
(which means it is not unlikely the evidence occurred under the assumption). If the probability is deemed
unlikely, then we reject the assumption, which means there is enough evidence to support what we originally
wanted to show (the claim). If the probability is deemed not unlikely, then we do not reject the assumption,
which means there is not enough evidence to support what we originally wanted to show (the claim). In the
latter situation, we cannot make any conclusions about the assumption as the evidence was collected only
for the claim.
2.2.2.5 Practice
The rst few exercises provided are from the textbook Business Statistics BSTA 200 Hum-
ber College Version 2016RevA DRAFT 2016-04-04 by Alexander Holmes, Lyryx Learning:
http://cnx.org/contents/[email protected]
Use the following information to answer the next seven exercises: The Higher Education Research Insti-
tute at UCLA collected data from 203,967 incoming rst-time, full-time freshmen from 270 four-year colleges
and universities in the U.S. 71.3% of those students replied that, yes, they believe that same-sex couples
should have the right to legal marital status. Suppose that you randomly pick eight rst-time, full-time
freshmen from the survey. You are interested in the number that believes that same sex-couples should have
the right to legal marital status.
Exercise 2.2.2.4 (Solution on p. 198.)
In words, dene the random variable X.
Exercise 2.2.2.5 (Solution on p. 198.)
What values does the random variable X take on?
Exercise 2.2.2.6 (Solution on p. 198.)
Construct the probability distribution function (PDF). That is, ll in the table below. In the left
column put in the possible values for X. In the right column, put in the probability for exactly X,
i.e. P (X = x)
x P (X = x)
Table 2.15
Use the following information to answer the next three multiple choice questions: The probability that the
Calgary Flames will win any given game is 0.4617 based on a 45-year win history of 1,616 wins out of 3,500
games played (as of Sept. 2017). An upcoming monthly schedule contains 12 games.
Exercise 2.2.2.12 (Solution on p. 199.)
The expected number of wins for that upcoming month is:
a. 1.67
b. 12
c. 3500
1616
d. 5.54
a. 0.2178
b. 0.4167
c. 0.7664
d. 0.7116
a. 0.2176
b. 0.2762
c. 0.7238
d. 0.5062
Jenna nally broke the silence. I don't know if I would even pick up this package. It just looks
so depressing. But what do we do?
Leticia wants to put the product in this new packaging in ve stores that carry our products.
Based o of previous sales numbers, we know that the stores sell 68% of the product we give them
in a two-week period.
How does that help us? Do we just watch our sales plummet? Jenna was sounding exasperated.
I'm getting to that. Megan soothed. Leticia is convinced this packaging will increase sales.
But what if we can show her that it doesn't? Let's put this packaging into the ve stores and then
see how many kits were actually sold. I bet that we can show her that the sales went down.
I don't see how that is useful. We still have to pay her stupid fee.
You should read her contract more closely. She only gets paid if she can show that sales
increased. If they don't, then not only does she not get paid but she also has to pay for any
contractors (i.e. the design team).
Jenna perked up visibly at this.
Over the next two weeks, ve stores carried the new packaging. Megan and Jenna provided
each store with 100 kits. At the end of the two weeks, 306 of the kits were sold.
1. What is the observation unit? What is the variable? Categorize it.
2. What do Jenna and Megan want to show?
3. What assumption do Jenna and Megan need to make in order to investigate your answer in
question 2? Write your answer both as a sentence and as a probability.
4. What is the evidence that Jenna and Megan have found?
5. Describe the process that Jenna and Megan will go through to evaluate this evidence. Your
description should include (but is not limited to) what probability they will nd and what
they will do with that probability once they've found it. Don't actually do the process (that
comes later). Just describe what they will do.
6. Jenna and Megan believe that the binomial distribution will be the best model to nd the
required probability. Does this situation meet the criteria for a binomial? Examine each
criterion and comment on whether it is satised here or not.
7. Regardless of your answer above, use a binomial distribution to model this situation. Find
the appropriate probability to evaluate the evidence using MegaStat.
8. In sentence form, explain what the probability you have found means in the context of the
question. Do not make a conclusion yet. Instead explain what it is a probability of.
9. Now make a conclusion. In particular, answer this question: Is their enough evidence to
suggest that Leticia's new packaging has reduced sales? Justify your answer.
2. Can this situation be modelled by the binomial distribution? Support your answer by showing
why or why not this situation satises each of the ve criteria of the binomial distribution.
3. After previous issues with horrible new logo launches, Baravalle only wants to go forward
if there is clear evidence that Logo 2 is preferred. Based on this, what level of signicance
should they use? Explain your reasoning.
4. Regardless of your answer in b, assume that this situation satises the binomial distribution for
the remainder of the question. Use a computer program to nd that appropriate probability
that will allow you to evaluate the evidence.
5. In sentence form, explain what the probability you have found means in the context of the
question. Do not make a conclusion yet. Instead explain what it is a probability of.
6. Based on the probability, determine whether Logo 2 is preferred signicantly more than Logo
1. Explain your reasoning.
Figure 2.10: If you ask enough people about their shoe size, you will nd that your graphed data is
shaped like a bell curve and can be described as normally distributed. (credit: Ömer Ünl)
The normal probability density function, a continuous distribution, is the most important of all the distri-
butions. It is widely used and even more widely abused. Its graph is bell-shaped. You see the bell curve
in almost all disciplines. Some of these include psychology, business, economics, the sciences, nursing, and,
of course, mathematics. Some of your instructors may use the normal distribution to help determine your
grade. Most IQ scores are normally distributed. Often real-estate prices t a normal distribution.
The normal distribution is extremely important, but it cannot be applied to everything in the real world.
Remember here that we are still talking about the distribution of population data. This is a discussion of
probability and thus it is the population data that may be normally distributed, and if it is, then this is
8 This content is available online at <http://cnx.org/content/m62345/1.1/>.
how we can nd probabilities of specic events just as we did for population data that may be binomially
distributed or Poisson distributed. This caution is here because in the next chapter we will see that the
normal distribution describes something very dierent from raw data and forms the foundation of inferential
statistics.
In this chapter, you will study the normal distribution, the standard normal distribution, and applications
associated with them.
The normal distribution has two parameters (two numerical descriptive measures), the mean (µ) and the
standard deviation (σ ). If X is a quantity to be measured that has a normal distribution with mean (µ) and
standard deviation (σ ), we designate this by writing the following formula of the normal probability density
function:
Figure 2.11
The curve is symmetrical about a vertical line drawn through the mean, µ. The mean is the same as the
median, which is the same as the mode, because the graph is symmetric about µ. As the notation indicates,
the normal distribution depends only on the mean and the standard deviation. Note that this is unlike
several probability density functions we have already studied, such as the Poisson, where the mean is equal
to µ and the standard deviation simply the square root of the mean, or the binomial, where p is used to
determine both the mean and standard deviation. Since the area under the curve must equal one, a change
in the standard deviation, σ , causes a change in the shape of the curve; the curve becomes fatter and wider
or skinnier and taller depending on σ . A change in µ causes the graph to shift to the left or right. This
means there are an innite number of normal probability distributions. One of special interest is called the
standard normal distribution.
X ∼ N (µ, σ )
µ = the mean; σ = the standard deviation
2.3.2.1 Z -Scores
If X is a normally distributed random variable and X ∼ N(µ, σ ), then the z-score is:
x −− µ
z= (2.11)
σ
z
The -score tells you how many standard deviations the value x
is above (to the right of ) or
below (to the left of ) the mean, µ. Values of x that are larger than the mean have positive z-scores,
and values of x that are smaller than the mean have negative z-scores. If x equals the mean, then x has a
z-score of zero.
Example 2.26
Suppose X ∼ N(5, 6). This says that x is a normally distributed random variable with mean µ =
5 and standard deviation σ = 6. Suppose x = 17. Then:
x − −µ 17 − −5
z= = =2 (2.11)
σ 6
This means that x = 17 is two standard deviations (2σ ) above or to the right of the mean µ =
5. The standard deviation is σ = 6.
Now suppose x = 1. Then: z = x−−µ = 1−−5 = 0.67 (rounded to two decimal places)
This means that x σ 6
= 1 is 0.67 standard deviations (0.67σ ) below or to the left of
the mean µ = 5.
Example 2.27
Some doctors believe that a person can lose ve pounds, on average, in a month by reducing his or
her fat intake and by exercising consistently. Suppose weight loss has a normal distribution. Let X
= the amount of weight lost(in pounds) by a person in a month. Use a standard deviation of two
pounds. X ∼ N (5, 2).
Problem
Suppose a person gained three pounds (a negative weight loss). Then z = __________. This
z-score tells you that x = 3 is ________ standard deviations to the __________ (right
or left) of the mean.
Solution
x−µ −3 − 5
Z= = = −4 (2.11)
σ 2
9 This content is available online at <http://cnx.org/content/m64939/1.1/>.
z = 4. This z-score tells you that x = 3 is four standard deviations to the left of the mean.
Suppose the random variables X and Y have the following normal distributions: X ∼ N (5, 6) and Y ∼ N (2,
1). If x = 17, then z = 2. (This was previously shown.) If y = 4, what is z?
z = y−µ
σ = 1 = 2 where µ = 2 and σ = 1.
4−2
The z-score for y = 4 is z = 2. This means that four is z = 2 standard deviations to the right of the
mean. Therefore, x = 17 and y = 4 are both two (of their own) standard deviations to the right of their
respective means.
z
The -score allows us to compare data that are scaled dierently. To understand the concept,
suppose X ∼ N (5, 6) represents weight gains for one group of people who are trying to gain weight in a
six week period and Y ∼ N (2, 1) measures the same weight gain for a second group of people. A negative
weight gain would be a weight loss. Since x = 17 and y = 4 are each two standard deviations to the right
of their means, they represent the same, standardized weight gain relative to their means.
• About 68.26% of the x values lie between 1σ and +1σ of the mean µ (within one standard deviation
of the mean).
• About 95.44% of the x values lie between 2σ and +2σ of the mean µ (within two standard deviations
of the mean).
• About 99.73% of the x values lie between 3σ and +3σ of the mean µ (within three standard deviations
of the mean). Notice that almost all the x values lie within three standard deviations of the mean.
• The z-scores for +1σ and 1σ are +1 and 1, respectively.
• The z-scores for +2σ and 2σ are +2 and 2, respectively.
• The z-scores for +3σ and 3σ are +3 and 3 respectively.
Figure 2.12
Example 2.28
The mean height of 15 to 18-year-old males from Chile from 2009 to 2010 was 170 cm with a
standard deviation of 6.28 cm. Male heights are known to follow a normal distribution. Let X =
the height of a 15 to 18-year-old male from Chile in 2009 to 2010. Then X ∼ N (170, 6.28).
Problem 1
a. Suppose a 15 to 18-year-old male from Chile was 168 cm tall from 2009 to 2010. The z-score
when x = 168 cm is z = _______. This z-score tells you that x = 168 is ________ standard
deviations to the ________ (right or left) of the mean _____ (What is the mean?).
Solution
Problem 2
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010 has a z-score
of z = 1.27. What is the male's height? The z-score (z = 1.27) tells you that the male's height is
________ standard deviations to the __________ (right or left) of the mean.
Solution
x−µ x − 170
Z= = = 1.27 → 1.27 ∗ 6.28 + 170 = 177.98 (2.12)
σ 6.28
b. 177.98, 1.27, right
Example 2.29
Suppose x has a normal distribution with mean 50 and standard deviation 6.
• About 68% of the x values lie between 1σ = (1)(6) = 6 and 1σ = (1)(6) = 6 of the mean
50. The values 50 6 = 44 and 50 + 6 = 56 are within one standard deviation of the mean
50. The z-scores are 1 and +1 for 44 and 56, respectively.
• About 95% of the x values lie between 2σ = (2)(6) = 12 and 2σ = (2)(6) = 12. The
values 50 12 = 38 and 50 + 12 = 62 are within two standard deviations of the mean 50.
The z-scores are 2 and +2 for 38 and 62, respectively.
• About 99.7% of the x values lie between 3σ = (3)(6) = 18 and 3σ = (3)(6) = 18 of the
mean 50. The values 50 18 = 32 and 50 + 18 = 68 are within three standard deviations of
the mean 50. The z-scores are 3 and +3 for 32 and 68, respectively.
a.About 68% of the y values lie between what two values? These values are
________________. The z-scores are ________________, respec-
tively.
b.About 95% of the y values lie between what two values? These values are
________________. The z-scores are ________________, respec-
tively.
c.About 99.7% of the y values lie between what two values? These values are
________________. The z-scores are ________________, respec-
tively.
2.3.2.2 References
A z-score is a standardized value. Its distribution is the standard normal, Z ∼ N (0, 1). The mean of the
z-scores is zero and the standard deviation is one. If z is the z-score for a value x from the normal distribution
N (µ, σ ) then z tells you how many standard deviations x is above (greater than) or below (less than) µ.
a. 2.7
b. 5.3
c. 7.4
d. 2.1
Figure 2.13
The area to the right is then P(X > x) = 1 P(X < x). Remember, P(X < x) = Area to the left of
the vertical line through x. P(X < x) = 1 P(X < x) = Area to the right of the vertical line through x.
P(X < x) is the same as P(X ≤ x) and P(X > x) is the same as P(X ≥ x) for continuous distributions.
To nd the probability for probability curves with a continuous random variable we need to calculate the
area under the curve across the values of X we are interested in. For the normal distribution this seems a
dicult task given the complexity of the formula. There is, however, a simply way to get what we want.
We start knowing that the area under a probability curve is the probability.
Figure 2.14
This shows that the area between X1 and X2 is the probability as stated in the formula: P (X1 ≤ x ≤
X2 )
The mathematical tool needed to nd the area under a curve is integral calculus. The integral of the
normal probability density function between the two points x1 and x2 is the area under the curve between
these two points and is the probability between these two points.
Doing these integrals is no fun and can be very time consuming. But now, remembering that there are
an innite number of normal distributions out there, we can consider the one with a mean of zero and a
standard deviation of 1. This particular normal distribution is given the name Standard Normal Distribution.
Putting these values into the formula it reduces to a very simple equation. We can now quite easily calculate
all probabilities for any value of x, for this particular normal distribution, that has a mean of zero and a
standard deviation of 1. These have been produced and are available here in the text or everywhere on the
web. They are presented in various ways. The table in this text is the most common presentation and is set
up with probabilities for one-half the distribution beginning with zero, the mean, and moving outward. The
shaded area in the graph at the top of the table represents the probability from zero to the specic Z value
noted on the horizontal axis, Z.
The only problem is that even with this table, it would be a ridiculous coincidence that our data had
a mean of zero and a standard deviation of one. The solution is to convert the distribution we have with
its mean and standard deviation to this new Standard Normal Distribution. The Standard Normal has a
random variable called Z.
Using the standard normal table, typically called the normal table, to nd the probability of one standard
deviation, go to the Z column, reading down to 1.0 and then read at column 0. That number, 0.3413 is the
probability from zero to 1 standard deviation. At the top of the table is the shaded area in the distribution
which is the probability for one standard deviation. The table has solved our integral calculus problem. But
only if our data has a mean of zero and a standard deviation of 1.
However, the essential point here is, the probability for one standard deviation on one normal distribution
is the same on every normal distribution. If the population data set has a mean of 10 and a standard deviation
of 5 then the probability from 10 to 15, one standard deviation, is the same as from zero to 1, one standard
deviation on the standard normal distribution. To compute probabilities, areas, for any normal distribution,
we need only to convert the particular normal distribution to the standard normal distribution and look up
the answer in the tables. As review, here again is the standardizing formula:
x−µ
Z= (2.14)
σ
where Z is the value on the standard normal distribution, X is the value from a normal distribution one
wishes to convert to the standard normal, µ and σ are, respectively, the mean and standard deviation of that
population. Note that the equation uses µ and σ which denotes population parameters. This is still dealing
with probability so we always are dealing with the population, with known parameter values and a known
distribution. It is also important to note that because the normal distribution is symmetrical it does not
matter if the z-score is positive or negative when calculating a probability. One standard deviation to the
left (negative Z-score) covers the same area as one standard deviation to the right (positive Z-score). This
fact is why the Standard Normal tables do not provide areas for the left side of the distribution. Because of
this symmetry, the Z-score formula is sometimes written as:
|x − µ|
Z= (2.14)
σ
Where the vertical lines in the equation means the absolute value of the number.
What the standardizing formula is really doing is computing the number of standard deviations X is from
the mean of its own distribution. The standardizing formula and the concept of counting standard deviations
from the mean is the secret of all that we will do in this statistics class. The reason this is true is that all
of statistics boils down to variation, and the counting of standard deviations is a measure of variation.
This formula, in many disguises, will reappear over and over throughout this course.
Example 2.30
The nal exam scores in a statistics class were normally distributed with a mean of 63 and a
standard deviation of ve.
Problem 1
a. Find the probability that a randomly selected student scored more than 65 on the exam.
b. Find the probability that a randomly selected student scored less than 85.
Solution
a. Let X = a score on the nal exam. X ∼ N (63, 5), where µ = 63 and σ = 5
Draw a graph.
Figure 2.15
x1 − µ 65 − 63
Z1 = = = 0.4 (2.15)
σ 5
P(x ≥ x 1 ) = P(Z ≥ Z 1 ) = 0.3446
The probability that any student selected at random scores more than 65 is 0.3446. Here is how
we found this answer.
The normal table provides probabilities from zero to the value Z1 . For this problem the question
can be written as: P(X ≥ 65) = P(Z ≥ Z1 ), which is the area in the tail. To nd this area the
formula would be 0.5 P(X ≤ 65). One half of the probability is above the mean value because
this is a symmetrical distribution. The graph shows how to nd the area in the tail by subtracting
that portion from the mean, zero, to the Z1 value. The nal answer is: P(X ≥ 63) = P(Z ≥ 0.4) =
0.3446
z = 65 5 63 = 0.4
Area to the left of Z1 to the mean of zero is 0.1554
P(x > 65) = P(z > 0.4) = 0.5 0.1554 = 0.3446
Problem 2
Solution
b.
Z = x−µ
σ =
85−63
5 = 4.4 which is larger than the maximum value on the Standard Normal Table.
Therefore, the probability that one student scores less than 85 is approximately one or 100%.
A score of 85 is 4.4 standard deviations from the mean of 63 which is beyond the range of
the standard normal table. Therefore, the probability that one student scores less than 85 is
approximately one (or 100%).
Example 2.31
A personal computer is used for oce work at home, research, communication, personal nances,
education, entertainment, social networking, and a myriad of other things. Suppose that the
average number of hours a household personal computer is used for entertainment is two hours per
day. Assume the times for entertainment are normally distributed and the standard deviation for
the times is half an hour.
Problem 1
a. Find the probability that a household personal computer is used for entertainment between 1.8
and 2.75 hours per day.
Solution
a. Let X = the amount of time (in hours) a household personal computer is used for entertainment.
X ∼ N (2, 0.5) where µ = 2 and σ = 0.5.
Find P(1.8 < x < 2.75).
The probability for which you are looking is the area between x = 1.8 and x = 2.75. P(1.8 <
x < 2.75) = 0.5886
Figure 2.16
Problem 2
b. Find the maximum number of hours per day that the bottom quartile of households uses a
Figure 2.17
f (Z) = 0.5 − 0.25 = 0.25, therefore Z ≈ −0.675(or just 0.67 using the table)Z = x−µ x−2
σ = 0.5 =
−0.675, therefore x = −0.675 ∗ 0.5 + 2 = 1.66 hours.
The maximum number of hours per day that the bottom quartile of households uses a personal
computer for entertainment is 1.66 hours.
Example 2.32
There are approximately one billion smartphone users in the world today. In the United States the
ages 13 to 55+ of smartphone users approximately follow a normal distribution with approximate
mean and standard deviation of 36.9 years and 13.9 years, respectively.
Problem 1
a. Determine the probability that a random smartphone user in the age range 13 to 55+ is between
23 and 64.7 years old.
Solution
a. 0.8186
Problem 2
b. Determine the probability that a randomly selected smartphone user in the age range 13 to
55+ is at most 50.8 years old.
Solution
b. 0.8413
Example 2.33
A citrus farmer who grows mandarin oranges nds that the diameters of mandarin oranges
harvested on his farm follow a normal distribution with a mean diameter of 5.85 cm and a standard
deviation of 0.24 cm.
Problem 1
a. Find the probability that a randomly selected mandarin orange from this farm has a diameter
larger than 6.0 cm. Sketch the graph.
Solution
Figure 2.18
6 − 5.85
Z1 = = .625 (2.18)
.24
P(x ≥ 6) = P(z ≥ 0.625) = 0.2670
b. The middle 20% of mandarin oranges from this farm have diameters between ______ and
______.
Problem 2
Solution
2 = 0.10 , therefore Z ≈ ±0.25
f (Z) = 0.20
x−µ
Z = σ = x−5.85
0.24 = ±0.25 → ±0.25 ∗ 0.24 + 5.85 = (5.79, 5.91)
2.3.3.2 References
a. What percentage of their customers has daily balances less than $290?
b. What percentage of their customers has daily balances between $250 and $275?
c. What percentage of their customers has daily balances over $260?
d. The Bank is planning a special promotion where it is rewarding its customers whose balances
are in the top 15% with a free toaster. What account balance must a customer achieve in
order to qualify for a free toaster?
e. 68.26% of balances will be between what amount?
f. What is the interquartile range for the account balances?
Sample size N n
−
Mean µx x
Standard deviation σx sx
>
Proportion π p
The population mean, population standard deviation, and sample standard deviation have a subscript
of x to demonstrate that they are the measure for the variable X. Though this is mostly notational, it does
become important later in this chapter.
Suppose we take many dierent random samples of 100 university students from a university that has an
equal number of men and women.
The number of women will vary amongst the samples. For example, one sample could have 45 women,
another sample could have 48 women, another sample could have 52 women, etc.
11 This content is available online at <http://cnx.org/content/m64933/1.1/>.
12 This content is available online at <http://cnx.org/content/m64932/1.2/>.
Though it could be possible that we get a random sample that only has 2 women in it, it would be pretty
unlikely. Instead, we would expect that most of the samples would have around 50 women in it with some
variation around that value.
Figure 1 is the result of a simulation that took 10,000 samples of size 100 from a population that had an
equal about of women and men. The horizontal axis is the number of women in each sample. The height of
each bar is the number of samples that had that many women.
Notice how the most common number of women is around 50 (i.e. the average), but there is variation
from that 50. Most samples have between 40 and 60 women.
The variability among random samples of size n from the same population is called sampling variability.
A probability distribution that characterizes some aspect of sampling variability is termed a sampling
distribution. A sampling distribution is constructed by taking all possible samples of a size n from a
population. Then for each sample, a statistic is calculated (e.g. sample mean, sample proportion, sample
standard deviation). The sampling distribution is then created by making a graph of all of these samples.
Actually constructing a sampling distribution is often very dicult. A medium sized university in Canada
might have 12,000 students. All possible samples of size 100 from that population would result in 5.87×10249
unique samples! Think about that. One billion is 109 . Google is named after a googol ( 10100 ) because they
wanted Google to be associated with an immense amount of data. Yet a googol is smaller than all possible
samples at 100 from the medium sized university. If we got a computer to nd all possible samples, it would
13
take it over a billion years to nd them ! Therefore, actually constructing a true sampling distribution in
most situations is incredibly hard, incredibly time consuming, and not really worth it. Thus when we talk
about sampling distributions, we talk about a theoretical sampling distribution. That is, we theorize
what this sampling distribution would look like if it was possible to examine all possible samples.
Due to these limitations, we often look at an empirical sampling distribution, instead of a theoretical
sampling distribution. An empirical sampling distribution is created by taking many samples from a
population and nding a statistic for each sample, but not doing this for all possible samples. The plot
shown in is an example of an empirical sampling distribution as it only contains 10,000 samples and not all
possible samples. The statistic in is the number of women, but we could have also looked at the proportion
of women.
In summary, a sampling distribution is a distribution of a statistic. This diers from other distributions,
like the population distribution, which are distributions for individual data values.
Suppose we take a random sample of 100 students from a medium sized university and we nd that 75 of
them are women. Does this call into question the assumption that 50% of the students are women? This is
hard to gure out unless we know how likely it is that we could have found this random sample, assuming
that there are an equal number of men and women.
The sampling distribution helps us nd this probability. From the empirical sampling distribution in
Figure 1 we can nd the probability of getting a random sample of 75 women, assuming that there are an
equal number of men and women is 0.0000%. That is, it is really unlikely to get a random sample of 75
women out of 100 if there are an equal number of men and women in the population. Based on this, we can
be fairly condent that this university probably doesn't have an equal number of men and women. Instead,
it is more likely that there are women than men at this university.
The process described above is called inferential statistics. Inferential statistics is used to make a
conclusion about the population (all students at the university) from a sample (100 students). In general, to
do any form of inferential statistics, we need to use a sampling distribution to either determine how likely or
unlikely a statistic is (in hypothesis testing) or to estimate a parameter from a statistic (condence intervals).
Thus sampling distributions are the backbone of inferential statistics.
Note: What was described above about the proportion of women at a university should sound familiar.
In Chapter 4, we used the binomial distribution to determine how not unlikely or unlikely events were. The
binomial distribution was helping us understand the sampling distribution of proportions.
If we have access to the population, we can construct an empirical distribution from it. This can be done by
using computer software to pull random samples from a population. An example of one such tool is from the
Rossman Chance website, which has an applet that allows you to create an empirical sampling distribution
from a nite population: http://www.rossmanchance.com/applets/OneSample53.html15
When constructing an empirical sampling distribution, it is important to keep the law of large numbers
in mind. That is, the more samples you take, the closer the empirical sampling distribution will be to the
theoretical sampling distribution. In general, empirical sampling distributions should be constructed from
at least 10,000 samples.
To get an idea of how an empirical sampling distribution is constructed, go to
http://onlinestatbook.com/stat_sim/sampling_dist/index.html16
Example 2.34
The images/gures in this example were generated from David Lane's sampling distribution applet
that is part of the OnlineStatBook project 17 .
Figure 1 (Figure 2.20) shows the histogram of the population we are going to generate an
empirical sampling distribution from. We call this population the parent population as it is the
population we are creating the sampling distribution from. Notice that the parent population is
skewed left.
14 This content is available online at <http://cnx.org/content/m64934/1.2/>.
15 http://www.rossmanchance.com/applets/OneSample53.html
16 http://onlinestatbook.com/stat_sim/sampling_dist/index.html
17 Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane,
Rice University.
We are going to take multiple samples of size 10 from the parent population and look at the
statistic of the sample mean for each sample.
Here is the rst sample:
Now one sample mean is not enough to tell us what the sampling distribution looks like. So
let's take a few more samples. Let's take 5 more samples of size 10 and plot their sample means:
note: There are two sample sizes here. One is the size of the sample we are taking from the parent
population (10). The other is the number of samples we've taken (6). The rst is the sample size
for the sample. The second is the sample size for the empirical sampling distribution.
Now let's take 10,000 samples of size 10 from the population and plot each of their sample means.
This is what we get:
Finally, let take 100,000 samples of size 10 from the population and plot each of their sample
means. This is what we get:
Notice how there is no real dierence between the distributions (shape, centre and variation)
in Figure 5 and Figure 6. This means that are empirical distribution is now giving us a good
sense of what the theoretical sampling distribution would look like. When this happens, this is
called convergence. That is, the empirical sampling distribution is converging on the theoretical
sampling distribution. As the sample size of the empirical sampling distribution increases this is
expected to happen due to law of large numbers.
2.4.3.1.1 Bootstrapping
Suppose we don't have access to the population. This can happen if the population is innite (e.g. in
a manufacturing process) or where the population is large (e.g. population of Canada) or where most
researchers wouldn't have access to the population (e.g. list of students at a university). Can we still
construct an empirical sampling distribution?
The answer is yes! To do this, we use a process called bootstrapping. Essentially bootstrapping follows
the same procedure as outlined in Example 1 (Example 2.34), but instead of using a parent population, we
use a parent sample. That is, we take a good sample from the population and use that to construct the
sampling distribution.
Again the law of large numbers applies. If the random sample from the population is large enough, then
the sample will most likely be a good estimate of the population. Then the empirical sampling distribution
generated by the sample will most likely be a good estimate of the theoretical sampling distribution of the
population.
note: Bootstrapping only works if the sample being used has been collected properly and that
the sampling technique ensures that the sample is random, the sample is representative of the
population, and the sample size is large enough. There are no set rules on how big the sample
needs to be, but for bootstrapping the bigger the better.
2.4.4.1 The central limit theorem for the sampling distribution for sample means
The sampling distribution for the sample means comes from a parent population that is comprised of quan-
titative data. Random samples of size n are taken from the parent population and the sample mean is
calculated for each sample. What will the distribution of the sample means look like? That is, what is the
shape of the distribution of sample means, where are the sample means centred, and what is the sampling
variability?
note: The following refers to the theoretical sampling distribution for the sample means. Further,
when sample size is mentioned, it is referring to the size of the sample taken from the population.
That is, it is not referring to how many dierent random samples have been taken.
As the sample means are estimating the population mean, it makes sense that the sample means are centred
around the population mean.
In the previous section, we saw the right skewed parent population in Figure 1 (Figure 2.20). The
population mean of that parent population is 8.08. Notice that the empirical sampling distributions shown
in Figures 5 (Figure 2.24) and 6 (Figure 2.25) are both centred around 8.08.
In general, the mean of the theoretical sampling distribution for the sample means equals the population
mean.
µx = µx (2.25)
note: The variable for the sample means is x. That is why the subscript for the mean of the
sample means (µx ) has changed.
2.4.4.1.2 What is the sampling variability? (or what is the variation in the sampling distribu-
tion)
Based on the law of large numbers, the sampling variability of the sample means will decrease as the sample
size increases. As the sample size increases, the sample means will become better and better estimates of
the population mean and, therefore, there will be less variability between them. That is, there will be more
variability between the sample means for samples of size 2, then there will be for samples of size 30.
Just like we can measure variability for individual data values, we can also measure variability for sample
means. We will use the standard deviation to measure the sampling variability. The standard deviation
of the sampling distribution for sample means is called the standard error of the sample means. It is
found with the following formula
r
σ N −n
σx = √ (2.25)
n N −1
As the population size (N ) increases, N −n
N −1 approaches 1 and no longer aects the standard error.
:
σ
σx = √ (2.25)
n
19 The images/gures that follow were generated from David Lane's sampling distribution applet that is part of the OnlineS-
tatBook project
Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane,
Rice University.
What will the sampling distribution for sample means look like?
Here's the answer:
• If the parent population is normal, then the sampling distribution for sample means will be normal.
Always.
• As the sample size of the samples being taken from the parent population increases, the more normal
the sampling distribution for sample means will become.
Since the population in Figure 1 (Figure 2.26) is not normally distributed, then we would expect the sampling
distribution will not be normal for smaller sample sizes, but will be normal for larger sample size.
Figure 2.27: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 2
aside: For each of these empirical sampling distributions, 100,000 samples were taken of size n.
Therefore, we can be very condent that the empirical sampling distributions are good representa-
tions of the theoretical sampling distributions.
Figure 2.28: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 5
Figure 2.29: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 10
Figure 2.30: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 16
Figure 2.31: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 20
Figure 2.32: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 25
Figure 1 (Figure 2.26) (the parent population) is not even close to being normal, but notice that as the
sample size increases, the sampling distribution for sample means gets closer and closer to being normally
distributed!
In general, the closer the population is to being normally distributed, the faster the sampling distribution
gets closer to normal. Here faster means for a smaller sample size.
important: The central limit theorem states that regardless of the shape of the population, if the
sample size is greater than 30, the sampling distribution will be approximately normal.
Mean µx x µx = µx
Standard deviation σx sx σx = √σ
n
(standard error)
2.4.4.2 The central limit theorem for the sampling distribution for sample proportions
The sampling distribution for the sample proportions comes from a parent population that satises the
criteria of the binomial distribution. Random samples of size n are taken from the parent population and
the sample proportion is calculated for each sample. What will the distribution of the sample means look
like? That is, what is the shape of the distribution of sample proportions, where are the sample proportions
centred, and what is the sampling variability?
The sampling distribution for sample proportions has similar characteristics as the sampling distribution
for the sample means.
The shape of sampling distributions of the sample proportions also becomes normal. Unlike for sample
means though, the normality is not based on sample size, but is based on the number of successes (nπ ) and
failures (n (1 − π)).
To illustrate, here are the empirical sampling distributions for proportions for various population pro-
portions. The sample size is 100 in each case and the number of samples taken is 10,000.
In Figure 8 (Figure 2.33) a, n =100 and π = 0.01. Therefore, the number of successes is 1 and the number
of failures is 99. The sampling distribution is skewed to the right.
In Figure 8 (Figure 2.33) b, n =100 and π = 0.20. Therefore, the number of successes is 20 and the
number of failures is 80. The sampling distribution is approximately normal.
In Figure 8 (Figure 2.33) c, n =100 and π = 0.60. Therefore, the number of successes is 60 and the
number of failures is 40. The sampling distribution is approximately normal.
In Figure 8 (Figure 2.33) d, n =100 and π = 0.96. Therefore, the number of successes is 96 and the
number of failures is 4. The sampling distribution is skewed to the left.
important: In general, the shape of the sampling distribution for sample proportions is approxi-
mately normal if the number of successes and the number failures are both at least 5.
If the sampling distribution for sample proportions is normal, we can nd probabilities for the distribution
using two methods. The rst method is using the binomial distribution. The second method is the normal
distribution. This might seem a bit strange as the binomial distribution is for discrete random variables
and the normal distribution is for continuous random variables. In reality, we use the normal distribu-
tion to approximate probabilities for the sampling distribution for sample proportions. This is called the
normal approximation to the binomial distribution. To get the exact probability, one would need
to use the binomial distribution. But this can be cumbersome when the sample sizes are very large (e.g.
1000). Therefore, using the normal distribution can be benecial, especially because it gives very accurate
approximations. In example 6.4 below we will investigate this further.
Further when we begin to do inferential statistics, we won't know the population proportion (other-
wise inferential statistics wouldn't be necessary). Since we won't know π it will hard to use the binomial
distribution. Therefore, we use the normal approximation to the binomial distribution instead.
If we use a normal approximation to the binomial distribution, we need to know the mean and standard
deviation of the sampling distribution.
The mean of the sampling distribution for sample proportions is the population proportion.
µ^ = π (2.33)
p
The standard deviation of the sampling distribution for sample proportions (or the standard error of
sample proportions) is found using the following formula:
r
π (1 − π)
σ^ = (2.33)
p n
X − µx X − µx
Z= = σx (2.33)
σx √
n
The z-score for the sampling distribution for sample proportions would be:
^
p −µ
^ ^
p p −π
Z= =q (2.33)
σ^ π(1−π)
p n
a. Given this information, he rst wants to know what the individual weight allowance is (i.e.
the per person average) that the gondola can withstand.
b. He also wants to know how likely is it that the individual weight of any randomly selected
male will exceed the individual weight allowance calculated above.
c. Finally, he wants to know how likely it would be that the average weight of a sample of 12
adult males would exceed the average individual weight allowance.
d. Based on your answers, do you think the manager should renovate the gondola? Is there any
further information that the manager would need?
a. Assuming that the proportion of parents that are well informed about video game ratings is
30%, what is the probability that you would observe a sample proportion of less than 27%.
Use the normal approximation of the binomial distribution to nd your answer.
b. Based on your results, do you believe that this is enough evidence to suggest that less than
30% of parents are well informed about video game ratings? Explain your answer.
The following practice questions are from Lyryx Learning, Business Statistics I MGMT 2262 Mt Royal
University Version 2016 Revision A. OpenStax CNX. Sep 8, 2016 http://cnx.org/contents/f3aefa9e-58d2-
[email protected]
Use the following information to answer the next ten exercises: A manufacturer produces 25-pound lifting
weights. The lowest actual weight is 24 pounds, and the highest is 26 pounds. Each weight is equally likely
so the distribution of weights is uniform. A sample of 100 weights is taken. The standard deviation is 0.58
pounds.
Exercise 2.4.5.5 (Solution on p. 203.)
a. What is the distribution for the weights of one 25-pound lifting weight? What is the mean
and standard deivation?
b. What is the distribution for the mean weight of 100 25-pound lifting weights?
c. Find the probability that the mean actual weight for the 100 weights is less than 24.9.
a. What is the probability that the 49 balls traveled an average of less than 240 feet?
b. Find the 80th percentile of the distribution of the average of 49 y balls.
a. Would you be surprised if the 36 taxpayers nished their Form 1040s in an average of more
than 12 hours? Explain why or why not in complete sentences.
b. Would you be surprised if one taxpayer nished his or her Form 1040 in more than 12 hours?
In a complete sentence, explain why.
a. Find the probability that the runner will average between 142 and 146 minutes in these 49
marathons.
b. Find the 80th percentile for the average of these 49 marathons.
c. Find the median of the average running times.
−−
a. When the sample size is large, the mean of X is approximately equal to the mean of X.
−−
b. When the sample size is large, X is approximately normally distributed.
−−
c. When the sample size is large, the standard deviation of X is approximately the same as the
standard deviation of X.
a. S = {(1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4)}
b. A = {(1,1), (1,3), (2,2), (2,4), (3,1), (3,3)}
A OR B = {(1,1), (1,3), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4)}
e. P(A AND B) = 13 , P(A OR B) = 56
f. B0 = {(1,1), (1,2), (1,3), (1,4)}, P(B0 ) = 31
g. P(B) + P(B0 ) = 1
h. P(A|B) = P (APAND
(B )
B)
= 12 , P(B|A) = P (APAND
(B )
B)
= 23 , No.
a. P(L0 ) = P(S)
b. P(M OR S)
c. P(F AND L)
d. P(M |L)
e. P(L|M )
f. P(S|F)
g. P(F|L)
h. P(F OR L)
i. P(M AND S)
j. P(F)
a. You can't calculate the joint probability knowing the probability of both events occurring, which is not
in the information given; the probabilities should be multiplied, not added; and probability is never
greater than 100%
b. A home run by denition is a successful hit, so he has to have at least as many successful hits as home
runs.
a. With replacement
b. No
a. P(F) = 1
4
b. P(G) = 1
2
c. P(H ) = 1
2
d. Yes
e. No
a. P(B|D) = 0.6667
b. P(D|B) = 0.5
c. No
d. No
a. P(T) = 14
b. P(T|F) = 1
2
c. No
d. No
e. Yes
to Exercise 2.1.3.11 (p. 110)
P(J) = 0.3
to Exercise 2.1.3.13 (p. 110)
P(Q AND R) = P(Q)P(R)
0.1 = (0.4)P(R)
P(R) = 0.25
Solution to Exercise 2.1.3.15 (p. 111)
0
Solution to Exercise 2.1.3.17 (p. 111)
0.3571
to Exercise 2.1.3.19 (p. 111)
0.2142
to Exercise 2.1.3.21 (p. 111)
Physician (83.7)
to Exercise 2.1.3.23 (p. 112)
83.7 − 79.6 = 4.1
to Exercise 2.1.3.25 (p. 112)
P(Occupation < 81.3) = 0.5
to Exercise 2.1.3.27 (p. 112)
a. P(C) = 0.4567
b. not enough information
c. not enough information
d. No, because over half (0.51) of men have at least one false positive text
to Exercise 2.1.3.29 (p. 112)
a. P(J OR K) = P(J) + P(K) − P(J AND K); 0.45 = 0.18 + 0.37 - P(J AND K); solve to nd P(J
AND K) = 0.10
b. P(NOT (J AND K)) = 1 - P(J AND K) = 1 - 0.10 = 0.90
c. P(NOT (J OR K)) = 1 - P(J OR K) = 1 - 0.45 = 0.55
to Exercise 2.1.4.1 (p. 114)
P(D |C) = 0.85
P(C ∩ D) = P(D ∩ C)
P(D ∩ C) = P(D|C)P(C) = (0.85)(0.75) = 0.6375
Helen makes the rst and second free throws with probability 0.6375.
a. P(B0 ) = 0.60
b. P(D ∩ B) = P(D |B)P(B) = 0.20
c. P(B|D) = P P(B∩D ) (0.20)
(D) = (0.30) = 0.66
d. P(D ∩ B0 ) = P(D) - P(D ∩ B) = 0.30 - 0.20 = 0.10
e. P(D |B0 ) = P(D ∩ B0 )P(B0 ) = (P(D) - P(D ∩ B))(0.60) = (0.10)(0.60) = 0.06
Solution to Exercise 2.1.4.7 (p. 120)
0.376
Solution to Exercise 2.1.4.9 (p. 120)
C |L means, given the person chosen is a Latino Californian, the person is a registered voter who prefers life
in prison without parole for a person convicted of rst degree murder.
Solution to Exercise 2.1.4.11 (p. 120)
L ∩ C is the event that the person chosen is a Latino California registered voter who prefers life without
parole over the death penalty for a person convicted of rst degree murder.
Solution to Exercise 2.1.4.13 (p. 120)
0.6492
Solution to Exercise 2.1.4.15 (p. 120)
No, because P(L ∩ C) does not equal 0.
Solution to Exercise 2.1.4.17 (p. 121)
b. 85
c. 23
d. 28
e. 68
f. No, because P(G ∩ E) does not equal 0.
Solution to Exercise 2.1.4.23 (p. 123)
c. Yes, A and B are mutually exclusive because they cannot happen at the same time; you cannot pick
a card that is both blue and also (red or green). P(A ∩ B) = 0
d. No, A and C are not mutually exclusive because they can occur at the same time. In fact, C includes
all of the outcomes of A; if the card chosen is blue it is also (red or blue). P(A ∩ C) = P(A) = 20
3
a. If Y and Z are independent, then P(Y ∩ Z) = P(Y )P(Z), so P(Y ∪ Z) = P(Y ) + P(Z) - P(Y )P(Z).
b. 0.5
Solution to Exercise 2.1.4.29 (p. 124)
a. iii; a. i; a. iv; a. ii
Solution to Exercise 2.1.4.31 (p. 125)
a. P(R) = 0.44
b. P(R|E) = 0.56
c. P(R|O) = 0.31
d. No, whether the money is returned is not independent of which class the money was placed in. There
are several ways to justify this mathematically, but one is that the money placed in economics classes
is not returned at the same overall rate; P(R|E) 6= P(R).
e. No, this study denitely does not support that notion; in fact, it suggests the opposite. The money
placed in the economics classrooms was returned at a higher rate than the money place in all classes
collectively; P(R|E) > P(R).
Solution to Exercise 2.1.4.33 (p. 126)
a. Let C = be the event that the cookie contains chocolate. Let N = the event that the cookie contains
nuts.
b. P(C ∪ N ) = P(C) + P(N ) - P(C ∩ N ) = 0.36 + 0.12 - 0.08 = 0.40
=
1. P(F AND C) = 100
18
0.18
2. P(F)P(C) = 100 100
45 34
= (0.45)(0.34) = 0.153
1. P(F) = 100
45
2. P(P) = 100
25
4. P(F OR P) = 100 +
45 25
100 - 11
100 = 59
100
a. P(H |M ) = 52
90 = 0.5778
b. For M and H to be independent, show P(H |M ) = P(H )
Door Choice
Caught 1
15
1
12
1
6
19
60
Not Caught 4
15
3
12
1
6
41
60
Total 5
15
4
12
2
6 1
Table 2.18
Obese 18 28 14 60
Normal 20 51 28 99
Underweight 12 25 9 46
Totals 50 104 51 205
Table 2.19
a. Row Totals: 60, 99, 46. Column totals: 50, 104, 51.
b. P(Tall) = 205
50
= 0.244
c. P(Obese AND Tall) = 205 18
= 0.088
d. P(Tall|Obese) = 60 = 0.3
18
e. P(Obese|Tall) = 18
50 = 0.36
f. P(Tall AND Underweight = 205 12
= 0.0585
g. No. P(Tall) does not equal P(Tall|Obese).
There are 9 + 24 outcomes that have R on the rst draw (9 RR and 24 RB). The sample space is then
9 + 24 = 33. 24 of the 33 outcomes have B on the second draw. The probability is then 33
24
.
to Exercise 2.1.5.4 (p. 134)
Total number of outcomes is 144 + 480 + 480 + 1600 = 2,704.
P(FF) = 144 + 480 +
144 144 9
480 + 1,600 = 2,704 = 169
Solution to Example 2.24, Problem
2 (p. 136)
b. P(RB OR BR) = 3 8
+ 8 3
= 48
11 10 11 10 110
P(X)
P (X ≤ x)
0 0.00080 0.00080
1 0.00684 0.00764
2 0.02785 0.03548
3 0.07160 0.10709
4 0.13042 0.23751
5 0.17886 0.41637
6 0.19164 0.60801
7 0.16426 0.77227
8 0.11440 0.88667
9 0.06537 0.95204
10 0.03082 0.98286
11 0.01201 0.99486
12 0.00386 0.99872
13 0.00102 0.99974
14 0.00022 0.99996
15 0.00004 0.99999
16 0.00001 1.00000
17 0.00000 1.00000
18 0.00000 1.00000
19 0.00000 1.00000
20 0.00000 1.00000
P(X)
P (X ≤ x)
0 0.00080 0.00080
1 0.00684 0.00764
2 0.02785 0.03548
3 0.07160 0.10709
4 0.13042 0.23751
5 0.17886 0.41637
6 0.19164 0.60801
7 0.16426 0.77227
8 0.11440 0.88667
9 0.06537 0.95204
10 0.03082 0.98286
11 0.01201 0.99486
12 0.00386 0.99872
13 0.00102 0.99974
14 0.00022 0.99996
15 0.00004 0.99999
16 0.00001 1.00000
17 0.00000 1.00000
18 0.00000 1.00000
19 0.00000 1.00000
20 0.00000 1.00000
P(X)
P (X ≤ x)
0 0.00080 0.00080
1 0.00684 0.00764
2 0.02785 0.03548
3 0.07160 0.10709
4 0.13042 0.23751
5 0.17886 0.41637
6 0.19164 0.60801
7 0.16426 0.77227
8 0.11440 0.88667
9 0.06537 0.95204
10 0.03082 0.98286
11 0.01201 0.99486
12 0.00386 0.99872
13 0.00102 0.99974
14 0.00022 0.99996
15 0.00004 0.99999
16 0.00001 1.00000
17 0.00000 1.00000
18 0.00000 1.00000
19 0.00000 1.00000
20 0.00000 1.00000
X P(X) P (X ≤ x)
0 0.00080 0.00080
1 0.00684 0.00764
2 0.02785 0.03548
3 0.07160 0.10709
4 0.13042 0.23751
5 0.17886 0.41637
6 0.19164 0.60801
7 0.16426 0.77227
8 0.11440 0.88667
9 0.06537 0.95204
10 0.03082 0.98286
11 0.01201 0.99486
12 0.00386 0.99872
13 0.00102 0.99974
14 0.00022 0.99996
15 0.00004 0.99999
16 0.00001 1.00000
17 0.00000 1.00000
18 0.00000 1.00000
19 0.00000 1.00000
20 0.00000 1.00000
Table 2.20
a. P(X=7) = 0.16426
b. P(10 ≤ X ≤ 14) = 0.04792 (highlight the values in the column P(X) for X from 10 to 14, then look
at the Sum in the lower right)
c. P(X ≥ 11) = 0.01714 (highlight the values in the column P(X ) for X from 11 and higher, then look
at the Sum in the lower right)
d. This changes π to 0.7, then re-run the computer program. Look at when X is 5. P(X =5) = 0.00004
3. The mean is the same as the expected value, which is 6.0 and the standard deviation is 2.049. This
gives us a typical range of 3.951 and 8.049 for the typical number of business passengers in a random
sample of 20 passengers.
a. We would expect 0.9 give or take 0.939 to be audited. This means a typical range is 0 to 1.8 residents
to be audited out of 45.
b. Since 4 is outside of the range, it would be deemed atypical, but that does not necessarily mean that
it is abnormal.
c. Since the employee at I&S Square wants to show that something strange is happening in Seba Beach,
they would want to assume that nothing strange is happening. That is, the rate of audits is the same in
Seba Beach as anywhere in Canada. This means we want to assume π = 2%, where π is the proportion
of people who are audited.
d. The binomial distribution:
e. We need to nd the probability that at least 4 out of 45 residents are audited, assuming an audit rate
of 2%.
f. P (X ≥ 4 given π = 2%) = 0.01242 = 1.24% (from computer program with n = 45, π (or probability
of occurrence) = 2%).
g. The probability that at least 4 out of 45 Seba Beach residents are audited, under the assumption that
the audit rate is 2%, is 1.24%.
h. Since the probability that we observed our sample data is between 1% and 10%, then we have to
determine if the probability is unlikely or not unlikely. Since it is closer to 1% than 10%, we can
say that the sample data is unlikely to have occurred under the assumption. Therefore, the evidence
suggests that there is something wrong with the assumption. That is, there is evidence that the
residents of Seba Beach are being audited at a higher rate than the rest of Canada.
x P (X = x)
0 0.00005
1 0.0009
2 0.0080
3 0.0395
4 0.1227
5 0.2439
6 0.3030
7 0.2151
8 0.0668
Table 2.21
1. Observational unit: Five-minute make-up kit; Variable: Did it sell or not; Categorize: Categorical
2. They want to show that the new packaging will decrease sales.
3. They need to assume the opposite of what they want to show. Therefore, they need to assume that the
new packaging does not decrease sales. Therefore, the proportion of kits sold stays the same at 68%.
4. They have found that out of 500 kits supplied, 306 of them have been sold.
5. They rst need to start with an assumption ( = 68%). Then they need to come up with a model based
on this assumption. Once they have the model, they will use it to nd the probability that the stores
sold at most 306 out of 500 kits, assuming that the new packaging has not decreased sales (i.e. stayed
at 68%). Once they have the probability, they need to determine whether the event is likely or unlikely.
An event is unlikely if the probability is less than 1%. An event is likely if the probability is more
than 10%. If the event is unlikely, then it means that it is unlikely we observed the evidence under the
assumption. Since we know the evidence actually happened, that makes us question the assumption.
Thus, it is unlikely the assumption is true based o of the evidence. If the event is likely to happen,
then the assumption is likely to be true based o of the evidence.
6. • Is the data randomly collected? Most likely not. The 500 kits that we are looking at were not
randomly selected.
• Is the data discrete (countable)? As we are counting the number of kits that are sold the data is
discrete.
• Are the events independent? This may be a fair assumption for this study. Most likely the sale
of one kit is not dependent on whether another kit is sold. Though if two friends buy the kits
together or someone buys a bunch as presents, this is not the case, but in general it is more likely
independent than dependent.
• Are there a xed number of trials? In this case, the number of trials would be the 500 kits with
the new packaging.
• Are there two possible outcomes? Either a kit is sold or it is not.
7. P (X ≤ 36) = 0.00077 = 0.077%
8. The probability that we observed at most 306 out of 500 kits sold (1), assuming the rate of sales is
68%, is 0.077%
9. Since the probability is less than 1%, then it is very unlikely that we would have observed this evidence
under the assumption. Since we actually observed the evidence but assumed that the rate was 68%,
what we have assumed is called into question. Therefore, it is unlikely that the assumption is true .
Therefore, it is likely that the new packaging has resulted in a decrease in sales .
Solution to Exercise 2.2.2.18 (p. 152)
Observational unit: People who are familiar with Striking Donkey Coee; Variable: Whether they prefer
Logo 2; Type of variable: Categorical.
1. They need to assume the opposite of what they want to show. This means they need to assume that
Logo 2 is NOT preferred signicantly more than Logo 1. This would mean they are preferred equally.
Therefore, there is a 50% chance that someone will choose Logo 2.
2. • The data is collected randomly: Yes. It is a random sample of participants.
• The outcomes are counted: Yes. They count how many people like Logo 2.
• There are two possible outcomes: Yes. Either they prefer Logo 2 or they did not.
• There are a xed number of trials: Yes. They asked 40 people.
• The trials are independent of each other. Yes. It is fair to assume that no participant's preference
is based on another participants preference.
3. The more unlikely it is that we observed our evidence, the smaller the probability will be. This means,
the smaller the probability, the more unlikely it is that the assumption (i.e. that there is no preference
between the logos) is true. Since the marketers want clear evidence that there is a preference, they
want a smaller probability, which would show it is unlikely that there is a preference between the logos.
The level of signicance is the threshold between likely and unlikely. Thus, if they want clear evidence,
they want to set their threshold high, meaning they want to make it a small number. Since the level
of signicance is between 1% and 10%, the lowest level of signicance (meaning the highest threshold
of evidence) is at 1%.
4. P (X ≥ 26) = 0.04035 = 4.04%
5. The probability that we observed at least 26 out of 40 people who preferred Logo 2, assuming that
there was no preference between the logos, is 4.04%.
6. Since the probability is greater than 1% (it is 4.04%), it is not unlikely that we observed at least
26 out of 40 people who preferred Logo 2, assuming that there was no preference between the logos.
Therefore, we do not reject that there was no preference between the logos. This suggests that Logo 2
is NOT preferred signicantly more than Logo 1.
a. About 68% of the values lie between the values 41 and 63. The z-scores are 1 and 1, respectively.
b. About 95% of the values lie between the values 30 and 74. The z-scores are 2 and 2, respectively.
c. About 99.7% of the values lie between the values 19 and 85. The z-scores are 3 and 3, respectively.
a. 0.6915
b. 0.3345
c. 0.8413
d. $300.70
e. $260 to $300
a. i. 0.0228
ii. 0.0082
iii. 0.7865
iv. 0.0669
b. 56407.8 km
c. 48073.4 km
Solution to Exercise 2.4.5.1 (p. 182)
a. Yes. As the sample size is greater than 30 (it is 100), we can assume that the sampling distribution of
the sample mean lifespan of the tires is normally distributed regardless of the shape of the sampling
distribution due to the central limit theorem.
b. x = mean lifespan of 100 Old Baldy tires, µx = µx = 50,000, σx = √σn = 10,000/10 = 1000. Since we
know that the data
is normally distributed, we can use a computer program to calculate the probability
P X < 49, 000 . From the computer program, we get P X < 49, 000 = 15.87%
c. No. The probability that a random sample of 100 Old Baldy tires has a mean lifespan of 49,000km is
15.87% (assuming Old Baldy's claim). This means that this event is likely to occur (as it is greater than
10%), under the assumption that the tires last on average 50,000 km, and does not provide sucient
evidence against Old Baldy's claim.
Solution to Exercise 2.4.5.2 (p. 182)
a. Since we are using the binomial distribution, we are being asked to nd the probability that at least
660 of the 1000 people in the poll will want to focus on connecting the system. The 660 comes from
66% of 1000. In other words, we are asked to nd the P (X ≥ 660) , with n = 1000 and π = 60%. Using
a computer program, this yields a probability of 0.48%. This is found, by highlighting all of the values
above 660 and including 660.
b. Since we are using the sampling distribution for sample proportions, we are asked to nd the probability
^
that the sample proportion will be at least 66%. In other words, we are asked to nd the P( p ≥ 0.66).
We can assume the sampling distribution for sample proportions is normal as the number of successes
(nπ =1000 × 0.62=620) and the number of failures (n (1 − π) = 1000 × 0.38=380) are both at least
ve. Therefore, we will use the normal distribution to nd the probability with µ ^ = π = 0.62 and
p
q
π(1−π)
q
0.62(1−0.62) ^
σ^ = n = 1000 = 0.01475. Therefore, using a computer program we nd P( p ≥ 0.66)
p
= 0.33%
c. The two probabilities are quite close. They are only 0.15% apart. Therefore, the two methods give us
similar results.
d. It is unlikely that if the proportion of residents that want to focus on connecting the bike system is
62% that a poll of 1000 people would result in a sample proportion of 66%. Therefore, it is unlikely
that the city of Montreal will chose to focus on connecting the system.
Solution to Exercise 2.4.5.4 (p. 183)
a. Since we are using the sampling distribution for sample proportions, we are asked to nd the probability
^
that the sample proportion will be at most 27%. In other words, we are asked to nd the P( p ≤
0.27). We can assume the sampling distribution for sample proportions is normal as the number of
successes ( and the number of failures ((nπ =1000 × 0.30=300) and the number of failures (n (1 − π)
= 1000 × 0.7=700) are both at least ve. Therefore,
q we will
q use the normal distribution to nd the
π(1−π) 0.3(1−0.3)
probability with µ ^ = π = 0.30 and σ ^ = n = 1000 = 0.01145. Therefore, using a
p p
^
computer program we nd P( p ≤ 0.27) = 0.44%.
b. Since the probability that we would observe a sample proportion of 27% (assuming a population
proportion of 30%) is 0.44%, it is very unlikely we would have observed this evidence if the assumption
is true. Therefore, it is more likely that the population proportion is less than 30%. Thus there is
enough evidence to suggest that less than 30% of parents are well informed about video game ratings.
Solution to Exercise 2.4.5.5 (p. 184)
a. Uniform with a mean of 25 and a standard deviation of 0.58 pounds. Remember when a distribution
is uniform all of the values are equally likely. Therefore the mean will be halfway between the lowest
value (24) and the highest value (26).
b. Normal with a mean of 25 and a standard deviation of 0.0577
c. 0.0416
Solution to Exercise 2.4.5.6 (p. 184)
0.0003
Solution to Exercise 2.4.5.7 (p. 184)
25.07
Solution to Exercise 2.4.5.8 (p. 184)
a. 0.0808
b. 256.01 feet
Solution to Exercise 2.4.5.9 (p. 184)
a. 0.6247
b. 146.68
c. 145 minutes
Solution to Exercise 2.4.5.11 (p. 184)
a. True. The mean of a sampling distribution of the means is approximately the mean of the data
distribution.
b. True. According to the Central Limit Theorem, the larger the sample, the closer the sampling distri-
bution of the means becomes normal.
c. The standard deviation of the sampling distribution of the means will decrease making it approximately
the same as the standard deviation of X as the sample size increases.
• Find and interpret condence intervals that estimate the population mean and the population propor-
tion.
• Understand the properties of the Student-t distribution.
• For condence intervals for the population mean, can determine whether to use the Student-t distri-
bution or the standard normal distribution as a model.
• Find the minimum sample size needed to estimate a parameter given a margin of error.
205
CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
206
INTERVALS AND HYPOTHESIS TESTS
by the number of shots you attempted. In this case, you would have obtained a point estimate for the true
proportion.
A point estimate is a single value used to estimate a population parameter. For example, the sample
mean is a point estimate of the population mean. But point estimates do not give a sense of how much
error there is in an estimate. Thus, we instead want to provide an interval estimate for the population
parameter takes into account error. The type of interval estimate we will learn about in this chapter is called
a condence interval.
From our work on sampling distributions, we know that the sample mean probably won't be exactly the
population mean. Instead we expect it to be slightly larger or smaller than the population mean. But by
how much? The margin of error, denoted E , measures how much we expect the statistic to vary from the
parameter. The margin of error is computed by looking at how much variation is in the sampling distribution
and the level of condence (discussed below).
To calculate a condence interval, you take the statistic and you add and subtract the margin of error
from it. For example, if you are trying to estimate the population mean, you would take the sample mean
and add and subtract the margin of error from it: x − E, x + E . This gives an interval of values that you
expect the population mean to fall between.
Example 3.1
A recent opinion poll asked Canadians their opinion of the work of the current Prime Minister of
Canada. 53% of Canadians approved of his work with a margin of error of 2.6%. The statistic is
a sample proportion of 53% and we are trying to estimate the true proportion of Canadians who
approved of the Prime Minister's work. We know that there will be error in that estimate and it has
been measured to be 2.6%. Therefore, we are estimating that the true proportion of all Canadians
who approve of the Prime Minister's work is between 53% ± 2.6% or between 50.4% and 55.6%.
note: Though condence intervals change depending on the sample, but the parameter being
estimated is xed. For example, on a specic day, the population mean rent of a two-bedroom
apartment in your town is a specic value. You are trying to estimate it, but it is xed. The
condence interval, on the other hand, changes depending on the sample you take. Suppose instead
of looking at the classied section of a newspaper, you looked at a rental website. Then the
sample might be dierent, which will result in a dierent condence interval. Or suppose you stood
outside a mall entrance and asked every fth person what they paid in rent for their two-bedroom
apartment, then your sample would be dierent, which will result in a dierent condence interval.
These three dierent condence intervals are all estimating the same thing, the population mean
rent of a two-bedroom apartment in your town, but since each of the samples are dierent, the
sample means will be dierent which will result in dierent estimates. In short, the parameter
being estimated is not a random variable. But the condence interval being used to estimate the
parameter varies depending on the random sample taken.
In the following sections, we will learn how to calculate the margin of error for the mean and proportion.
For each situation, we will use a dierent model to nd the margin of error. It should be noted that all of
the models are based on the assumption that a random sample has been calculated. Therefore, nding a
condence interval based on the convenience sample of the rent in today's classied ads is not appropriate.
This is important to remember when you are critically assessing a condence interval provided to you. No
matter how prettily the condence interval is presented, if it was constructed from a non-random sample, it
is useless. It is like baking an apple pie from rotten apples. It might look good, but it is still rotten.
If you are trying to estimate how much it will cost to go on a trip to Montreal for ve days, you can work
out with strong condence the cost of the ight and hotels, but then you have to start making estimates
about how much food and entertainment will cost while you're there. You can get a pretty good estimate of
what it will cost, but your friend who you are trying to convince to come with you might want to know how
condent you are in that estimate. Are you the kind of person who just guesses at the cost of meals or did
you look at restaurantsÕ menus to come up with a sense of what meals cost in Montreal? Did you take into
account snacks? The cost of renting a car or taking the bus? Did you assume you were going to do an equal
number of free and paid admission activities? All of this aects the condence you have in your estimate.
For a condence interval, it is much easier to determine how much condence we have in our estimate
because condence intervals come with a level of condence (or condence level).
To understand the condence level, let's go back to the two-bedroom apartment situation. Let's now
suppose that 100 people on the same day were very curious about determining the mean rent for two-bedroom
apartments in your town. Each of these 100 people went out and found their own random sample of fty
people who rent two-bedroom apartments in your town. From these 100 samples, 100 condence intervals
were calculated. Based o of our work on sampling distributions, we know that the 100 sample means will
be close to the population mean (some might even be the same as the population mean), but some will
be closer and some will be farther. Thus some of the condence intervals will be 'good' estimates of the
population mean rent for two-bedroom apartments (that is, the population mean will actually be included
in the condence interval) and some will be 'bad' estimates (that is, the population mean won't actually
be included in the condence interval). Since the population mean is unknown none of the 100 people who
made these condence intervals knows if their estimate is good or bad. Instead, they can only state how
condent they are in their estimate. That is, they can only state their level of condence.
Suppose that all 100 people made 95% condence intervals. What does that mean? Well suppose a local
real estate company has actually worked out the population mean rent for two-bedroom apartments in your
town by nding out the rent for all two-bedroom apartments. Since they know the population mean, they
don't have to estimate it. They have found it to be $1200.
Figure 3.1 shows the 100 condence intervals created by the 100 random samples and compares them to
the population mean. If the interval is yellow then that means it is a good estimate. If it is red, then that
means it is a bad estimate. The yellow part in the middle represent the 95% condence interval. The yellow
and the blue combined represent the 99% condence interval.
Figure 3.1: 100 condence intervals generated from 100 random sample of the rent of two-bedroom
apartments in your town
The above image was created using an applet from David Lane's onlinestatbook.com3
Notice that out of the 100 condence intervals calculated, 93 of them are good estimates (contain $1200)
and seven of them are bad estimates (do not contain $1200). This is what the condence level refers to.
That is, if you take many, many random samples of the same size and construct a condence interval for
each of the samples, then the percentage of condence intervals that contain the population mean is 95%
and the percentage that do not contain the population mean is 5%. Thus, the condence level refers to the
probability that the process of creating a condence interval results in the population parameter being in
the condence interval. It is NOT the probability that the population mean falls in a specic condence
interval. Remember that the population mean is xed. Therefore, either the population mean does fall in
the condence interval or it doesn't. Since there is no randomness to whether it does fall or not, there is
no probability associated with that event. Instead the level of condence refers to the percent of condence
intervals that contain the parameter being estimated if the study/experiment is repeated many, many times.
What has been described above is not an easy idea. Many people who have studied statistics are under
3 Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane,
Rice University.
the false impression that the condence level refers to the probability that the parameter is in the condence
interval. Don't fret if this doesn't make entire sense to you right away. Give yourself some time to think
about it and process it.
As a note, the example provided in Figure 3.1 is a bit surprising. If you ip a fair coin 100 times, you
would expect that around 50 heads and 50 tails, but due to sampling variability it would also be fair to
get 49 heads and 51 tails. It is the same thing with condence intervals, we expect that for 100 condence
intervals that around 95 of them contain the population mean and 5 of them don't, but it would be fair to
get 94 good estimates and 6 bad ones. Once again, the law of large numbers tells us that as the sample size
increases the closer we will get to the 95%. That is, if we take 1000 random samples instead of 100, the more
likely it is that 95% will be good estimates and 5% will be bad.
The most common choices for condence levels are 90%, 95%, and 99%, but you can choose the level of
condence to be any percentage between 0.00001% and 99.99999%. The can't choose 100%, because that
would mean you for sure know that the population parameter falls within the condence interval. You also
can't choose 0%, because that would mean you for sure know that the population parameter does not fall
within the condence interval. If you knew for sure the parameter falls (or does not fall) in the condence
interval, you wouldn't be bothering to do a condence interval, because you already know that parameter.
90%, 95%, and 99% are common levels of condence because they oer a high degree of condence.
How does the condence level change the condence interval? Think about the following two condence
intervals for the mean age of students at your university:
Which condence interval are you more condent actually contains the population mean? Well it is
pretty likely that the population mean age of students at your university is somewhere between 4 years old
and 85 years old, because the range is so wide that it most likely `catches' the population mean.
In general, the larger the condence level, the wider the condence interval. That is, to increase the
condence in the estimate, we make the condence interval wider so that it is more likely to catch what we
are estimating. Think about the condence interval like a net. The smaller the net, the less likely it is you'll
catch the sh. But the wider the net, the more likely it is that you will. Thus for the same sample, the 90%
condence interval is narrower than the 99% condence interval.
Thus, a 99% condence interval is very reliable, but it gains reliability at the price of precision. That is,
its wideness might come at the sake of usefulness. Going back to the condence interval for the mean age
of students at your university, we can be very condent that the population mean age is between 4 and 85
years old, but that doesn't actually help understand what the population mean age is. We are less condent
in the estimate of 20 to 21 years old, but it is providing us more useful information.
To summarize, higher degrees of condence mean that we are more sure that the parameter fall in the
interval (i.e. more reliable). Lower degrees of condence mean that the interval is smaller and thus gives us
a better idea of where the parameter in question is (i.e. more precise). See Figure 3.2
Figure 3.2: Comparing dierent levels of condence for the same random sample
The choice of a 95% level of condence is most common because it provides a good balance between
precision and reliability.
The width of the condence interval is determined by the margin of error, E . In general, the condence
interval is calculated as follows:
The size of the margin of error determines the width of the condence interval. That is, the bigger the
margin of error is, the wider the condence interval.
Factors that eect the size of the condence interval include the size of the sample, the amount of
variability in the data, and the condence level.
As per the law of large numbers, the larger the sample size, the closer the statistic (or point estimate) is
to the parameter. Therefore, the larger the sample size, the less error there is between the statistic and the
parameter. This means that the margin of error is smaller for larger sample sizes taken from the
same population.
The greater the variability in the population, the greater the variability in the statistics. We saw this in
Chapter 6 when we determined that the standard deviation of the sampling distribution was related both to
the standard deviation of the population and the sample size. That is, the variation between the statistics
relied both on the variation in the population and the sample size. Thus, the margin of error is larger
in situations where there is more variability in the population.
As stated above, the larger the condence level, the wider the condence interval. Therefore, the margin
of error is larger for larger levels of condence.
1. The condence interval contains 95% of the data values. A condence interval is an estimate
for a parameter (like the population mean or population proportion). Though the data values are used
to construct the condence interval, the condence interval does not tell us anything about the range
of the data values.
2. We are 95% condent that the sample mean is contained in the condence interval. If
the condence interval is for the population mean, then the sample mean has to be in the condence
interval. In fact, it is right in the middle. Remember that the condence interval for the population
mean is calculated as follows: x − E, x + E . All condence intervals contain the point estimate being
used to construct the condence interval.
3. Increasing the sample size increases the width of the condence interval. In fact, the
opposite happens. From the law of large numbers, we know that a larger sample size means that the
point estimate will likely be closer to the parameter being estimated. Therefore, as the sample size
increases, the margin of error decreases and the width of the condence interval decreases.
4. A 90% condence interval is wider than a 95% for the same data. Again, it is the opposite
that happens. To become more condent in our estimate (i.e. increasing the level of condence), we
widen the condence interval. A wider condence interval is a larger net which makes it more likely
that we catch the parameter we are estimating.
The margin of error is found by taking into account the condence level and the standard
error.
The next section examines how the margin of error is constructed for condence intervals for the mean.
• The sampling distribution for sample means of the population we are investigating is approximately
normally distributed.
4 This content is available online at <http://cnx.org/content/m65026/1.1/>.
5 This content is available online at <http://cnx.org/content/m65027/1.1/>.
· If the sample size is greater than 30, then the central limit theorem tells us that we can assume
that the sampling distribution is approximately normal regardless of the population distribution.
Thus, if the sample size is greater than 30, we can use this model.
· If the sample size is less than 30, the central limit theorem does not guarantee that the sampling
distribution of the means will be normal. Therefore, to use this model the population distri-
bution needs to be approximately normal so that we know that the sampling distribution
for sample means is normal.
• The sample we are using to construct the condence interval is a random sample.
To construct a condence interval for the mean, collect a random sample from the population whose mean
is being estimated. Then calculate the sample mean.
The next step is to calculate the margin of error. To do this, we begin by nding out how much sampling
variability there is in the sampling distribution. That is, we determine how much variation we expect between
the sample means. This is found by calculating the standard error of the sampling distribution for sample
means:
σ
σX = √X (3.2)
n
Now we want to take into account the level of condence. To do this, we construct a normal distribution
σ
that is centred at the sample mean, x, whose standard deviation is the standard error of the mean, √Xn .
The data values for this distribution are sample means. Therefore this is a sampling distribution for sample
means. This sampling distribution is an estimate of what the sampling distribution of the population will
look like:
Figure 3.3: Blue curve: True sampling distribution for sample means centred at µx and with a
σ
standard deviation of √Xn . Red curve: Estimate of the true sampling distribution for sample means
σ
based on the mean of the random sample. It is centred at x and has a standard deviation of √Xn .
In Figure 3.3, the blue sampling distribution is the theoretical sampling distribution of the population,
which is unknown. The red sampling distribution is an estimate of the blue curve based on the sample mean
found from the random sample. We will use the red sampling distribution to estimate the population mean.
Using the red sampling distribution, we want to determine the interval of sample means that fall within
a specic percentage from the mean. The specic percentage is the condence level.
Suppose that the condence level is 95.44%. From the empirical rule, we know that 95.44% of data
values fall within 2 standard deviations of the mean for normally distributed data. Therefore, if we wanted
to construct a 95.44% condence interval, we would take the sample mean and add and subtract two standard
deviations from it. Since we are dealing with a sampling distribution, the standard deviation we are referring
to is the standard error of the mean. Therefore, a 95.44% condence interval is found by calculating
X ± 2 · σX = X ± 2 · √ σX
n
. Thus for a 95.44% condence interval, the margin of error is E = 2 · √
σX
n
.
If we wanted to nd a 95% condence interval, we would use the same process, but we would want a
slightly narrower interval. Therefore, instead of multiplying the standard error by 2, we would multiply it
by a slightly smaller number. To determine by what number, we would need to nd out how many standard
deviations away from the mean results in an area of 95%. In other words, we would need to nd the z -score
that gives an area of 95%.
Figure 3.5: Standard normal curve with the area of the tails being 5%.
If the area in the middle of the curve is 95%, then the area of one tail is 2.5%. Using a computer program,
we can nd this value to be ±1.96.
To do this, go to your computer program and go to the menu option that lets you nd probabilities for
normal distributions. Then make the mean 0 and the standard deviation 1. Then switch from calculating
probabilities to nding z-values (like you are going to nd a percentile). In the appropriate box, put 0.0025
in for the area in the upper tail. When you hit enter, the program will give you 1.96 as the z-value for this
area.
In general, the value that you multiply the standard error by is called the critical value and is denoted
by zα/2 , where α is the total area of the tails. (1 − α) × 100% is the level of condence.
The margin of error is E = zα/2 × √ σX
n
The condence interval is x ± E . As it is an interval, always write it with the smaller number rst (x − E )
followed by the larger number (x + E ).
Exercise 3.1.4.1 (Solution on p. 250.)
Suppose that a random sample of 175 students from a university is taken and their average age is
21.34 years old and the population standard deviation is known to be 5.12 years.
1. Find the 95% condence interval for the population mean age of all university students.
2. Interpret the condence interval in the context of the question.
3. Explain what the level of condence means in the context of the problem.
4. If we decreased the sample size to 100, what would you expect to happen to the condence
interval? Explain your answer.
5. Suppose that an administrator at the university claims that this university caters to older
students and that the mean age is 23. Does the condence interval support the claim?
• All of the means in the interval are equally likely. That is, each of the estimates of the population
mean in the interval have an equal chance of being correct. For example, 20.58 years old and 21.25
years old are both equally likely estimates of the population mean age.
• The sample mean of 21.34 is right in the middle of the interval.
• The margin of error is 0.759 and is found using the formula
s 5.12
zα/2 × √ = 1.96 × √ (3.5)
n 175
• It is possible that the population mean is not captured by this condence interval, but we wouldn't
know whether it does or not without knowing the population mean.
3.1.4.1 Wait a second! If we don't the population mean (µx ), how do we know the population
standard deviation (σx ) in the standard error formula???
1. In some long running process (e.g. manufacturing), the standard deviation may be very static. There-
fore, the population standard deviation could be known even if the population mean isn't.
2. We don't know the population standard deviation, so instead we estimate it with the sample standard
deviation.
It is fairly unlikely that in most situations, the population standard deviation will be known. Thus, we will
focus on situations where the population standard deviation is unknown. In that case, we will use the sample
standard deviation s to estimate the population standard deviation σx .
To use this model to construct a condence interval, we need to again assume that the sampling distribution
is normal and that the sample was collected randomly. Just as we saw above, there are two general situations
that need to occur to ensure the sampling distribution is normal:
• If the sample size is greater than 30, then the central limit theorem tells us that we can assume that
the sampling distribution is approximately normal regardless of the population distribution. Thus, if
the sample size is greater than 30, we can use this model.
• If the sample size is less than 30, the central limit theorem does not guarantee that the sampling
distribution of the means will be normal. Therefore, to use this model the population distribution
needs to be approximately normal so that we know that the sampling distribution for sample
means is normal.
Since we don't know the population standard deviation, we will be using the sample standard deviation
to estimate σx . That means we are estimating the population mean using the sample mean and sample
standard deviation. This suggests that there may be more error in our estimate. To account for the greater
error, we want the condence interval to be slightly wider. To do this the margin of error needs to slightly
bigger. The margin of error is the critical value × the standard error. The standard error is inherent to the
population and can't be changed, but the critical value can be. So instead of using the standard normal
distribution to nd the critical value, we use the Student-t distribution 6
Here is some information about the Student-t distribution.
• The Student-t distribution is a normal distribution with µ = 0 and σ > 1. The standard deviation of
the Student t distribution is dierent for dierent sample size. Remember that the standard normal
distribution is a normal distribution with µ = 0 and σ = 1. Therefore, the Student-t distribution is
centred at the same place as the standard normal distribution, but has greater variation so it is slightly
wider and shorter. See Figure 3.6.
• The smaller the sample size, the greater the variability is in the sampling distribution. When the
sample size is larger, there is less variability in the sampling distribution. These aspects are reected
in shape of the Student-t distribution.
• As the sample size n gets larger, the Student-t distribution gets closer to the standard normal distri-
bution.
The standard deviation of the Student-t distribution is based on the degrees of freedom, which in turn
are based on the sample size. The number of degrees of freedom for a sample corresponds to the number of
data values that can vary after certain restrictions have been imposed on all data values. Another way of
saying it, is the degrees of freedom are the number of components that need to be known before a statistic
is entirely determined. Depending on the model used, the degrees of freedom have a dierent formula. For
this model (i.e. condence interval for one population mean), the degrees of freedom are the sample size
minus 1, i.e. n − 1.
As stated above, we want the width of the condence interval to be wider to take into account the larger
variation due to the estimate of the standard deviation. As you can see from the gure above, the Student-t
distribution is wider than the standard normal distribution. Which means that the critical value for a 95%
condence level will be greater than that for the standard normal. See the image below.
6 The Student-t distribution was created by William Gosset, an English statistician who worked for Guinness breweries.
While working for Guinness, Gosset developed the Student-t distribution, but was prohibited from publishing his work by his
employers who worried about trade secrets getting out. Thus he published his work under the pseudonym `Student' in 1907.
The distribution, then, should really be called the Gosset-t distribution.
Notice the critical value is happening about halfway between ±2 and ±3. But the critical value for the
standard normal distribution is ±1.96.
The margin of error for this model is:
s
E = tα/2 × √ (3.7)
n
The condence interval is constructed in the same way: x ± E .
Exercise 3.1.4.2 (Solution on p. 250.)
A manufacturer of AAA batteries wants to estimate the mean life expectancy of the batteries. It
is known that the life expectancy of such batteries is typically normally distributed.
A random sample of 25 batteries has a mean of 44.25 hours and a standard deviation of 2.25
hours. Assume the population is normal.
1. Construct a 95% condence interval for the mean life expectancy of all the AAA batteries
made by this manufacturer.
2. Interpret the 95% condence interval.
3. If the condence level is decreased to 90%, how does the condence interval change?
important: As the degrees of freedom of the Student-t distribution increase (i.e. as the sample
size increases), the standard deviation of the Student-t distribution decreases and gets closer and
closer to 1. That is, as the sample size increases, the Student-t distribution gets closer and closer
to the standard normal distribution. Statisticians and researchers generally agree that for n ≥ 40,
the dierence between the critical values of the Student-t distribution and the standard normal
distribution are negligible7 . Thus when n > 40, the standard normal distribution can be
used to construct the condence interval.
Figure 3.8 is a ow chart that indicates how to make a choice of which model to use to construct a condence
interval (CI) for the mean.
7 But some researchers switch to the Student-t distribution when n > 30. And other researchers only use the Student-t
distribution regardless of the sample size. This is a matter of preference, but as most researchers agree with n ≥ 40 rule, we
will stick with it here.
Figure 3.8: Flow chart for determining which model to use when constructing condence interval for
the mean
Determining an appropriate sample size is very important. Too small of a sample may lead to poor results.
Too large of a sample needlessly wastes time and money.
Prior to this section, we would have determined if a sample size was large enough simply by guessing.
Here we will learn a formula for nding the appropriate sample size based on the amount of error we will
accept in our results. This can be done by determining the minimum sample size needed to have a certain
margin of error. To do this, we solve for the sample size n in the margin of error formula.
E = z α2 · √s
n
√ zα/2 ·s
n = E
(3.8)
zα/2 ·s 2
n = E
As we would always rather than have one more object of study rather than one less, we will always round
up the result of this calculation. That is, if the result of the formula is 50.2, then we will round up to 51.
A couple of notes about the formula:
8.2; 9.1; 7.7; 8.6; 6.9; 11.2; 10.1; 9.9; 8.9; 9.2; 7.5; 10.5 (3.8)
How many participants should be in your study?
• The variable being studied satises the conditions of the binomial distribution.
• The sampling distribution for sample proportions is approximately normal. This occurs if the number
of successes (n × π ) is at least 5 and the number of failures (n × (1 − π)) is at least 5. As π is unknown
this can be checked by determining if the number of successes and failures in the sample are both at
least 5.
The margin of error is found in a similar way to margin of error for the mean. That is, it is the critical value
× the standard error. As we are assuming that the sampling distribution is approximately normal, we will
use the standard normal distribution to nd the critical value. Since the variable being studied satises the
conditions of the
q binomial distribution, we know from Chapter 6 that the standard error of the sampling
π(1−π)
distribution is .
As we don't know π as that is what we are trying to estimate, we will estimate
n
^
π in the formula with the sample proportion p . This results in the estimate of the standard error to be
^ ^
s !
p p
1−
n
If these conditions are met, then the formula for the margin of error is:
v
u^ ^
u
u p 1− p
(3.8)
t
E = zα/2 ×
n
Example: Cell phones
Suppose that a market research rm is hired to estimate the percent of adults living in a Vancouver who
have cell phones. Five hundred randomly selected adult residents in Vancouver are surveyed to determine
whether they have cell phones. Of the 500 people sampled, 421 responded yes - they own cell phones.
1. Using a 92% condence level, compute a condence interval estimate for the true proportion of adult
residents of this city who have cell phones.
2. Would it be appropriate to say that 85% of residents have a cell phone in Vancouver?
3. What does the condence level tell us in the context of the question?
Solutions:
1. We can use the standard normal model for proportions to construct our condence interval as the
variable (cell phone ownership) follows a binomial distribution (1: The variable is random (random
sample); 2: The outcomes are being counted (number of people who have cell phones); 3: There is a
xed number of trials (500); 4: There are two possible outcomes (have cell phone or don't have cell
phone); 5: Though π is unknown it is fair to assume that the proportion of people who have a cell
phone on a given day in Vancouver is very stable) and the sampling distribution for proportions is
normal as the number of successes is 421 and the number of failures is 79 (i.e. they are both greater
than 5). Use a computer program to construct the condence interval. Input x as 421 (this may be in
8 This content is available online at <http://cnx.org/content/m65028/1.1/>.
the same place as the sample proportion, but when you input the whole number it will switch to x),
the sample size as 500, and the condence level as 92%. Notice that you don't have to state whether it
is z or t as there is only one model for this situation. This gives the following output:
From this, we can see that the condence interval for the mean is 0.813 to 0.871.
2. To interpret the condence interval, we would say that we are 92% condent that proportion of residents
of Vancouver that own a cell phone is somewhere between 81.3% and 87.1%.
3. Since 85% is contained in the condence interval, it is appropriate to say that the proportion of residents
in Vancouver who have a cell phone is 85%.
4. The condence level means that if we took many random samples of Vancouver residents of size 500 and
constructed many condence intervals for each of these random samples, then 92% of these condence
intervals will contain the population proportion of cell phone users, while 8% will not.
A couple of notes about the condence interval:
• The margin of error is 0.029 or 2.9%. The margin of error for a condence interval for proportions has
to be less 1 (or 100%). If the sample size is large enough, the margin of error should be quite small
(less than 10%).
• Since proportions can only range from 0 to 1 or 0% to 100%, the condence interval can never exceed
these values. For example, if the sample proportion is 92% and the margin of error is 10%, then the
condence interval would be 82% to 102%, but since the upper bound is impossible, we would round
the answer to 82% to 100%.
Just like with the mean, we want to determine an appropriate sample size to achieve a maximum amount of
error in our estimate for the population proportion.
To nd the formula for n, we again solve for n in the formula for the margin of error, this results in the
following formula:
^ ^
2 p 1− p
zα/2
n= (3.8)
E2
To use this formula we need to know the margin of error, the condence level and the sample proportion.
^
Note: If no estimate for π exists, then use p = 0.5.
Exercise 3.1.5.1 (Solution on p. 251.)
The Western Canada Communications Company is considering a bid to provide long-distance
phone service. You are asked to conduct a poll to estimate the percentage of consumers who are
satised with their current long-distance phone service. You want to be 90% condent that your
sample percentage is within 2.5 percentage points of the true population value, and a Roper poll
suggests that this percentage should be about 85%. How large must your sample be?
In chapter 7, you learned how to construct an estimate of a population parameter, such as a mean or
proportion, from a sample statistic. In this chapter we examine a related concept: investigating whether the
value of a population parameter has changed from what has previously been claimed or believed. Again, we
use the sample data for this investigation.
For example, it is commonly stated that adults should get 8 hours of sleep per night. Many of us may
suspect that the real average is lower. In conducting an investigation, since we don't yet have evidence to
the contrary, we will treat the mean of 8 as the prevailing claim. In other words, we must assume the true
population mean is 8 unless we can prove otherwise. In our attempt to nd proof against the prevailing
claim, we would need to gather sample evidence.
Let's say that after gathering a large random sample (say, n = 50), you discover that the sample mean
number of hours slept per night is only 7.5. So is a sample mean of 7.5 hours proof that the true population
mean is not 8 hours, as claimed, but actually less? On the surface, it would appear so. However, recall from
chapter 7 that every sample mean will be dierent from the true population mean. Some sample means will
be a little dierent and others will be very dierent.
Also, recall that all possible sample means taken from a population, plotted on a distribution, is called a
sampling distribution of sample means. The mean or middle of this distribution will be the true population
mean, which at present we are assuming to be 8. And if 8 really is the true population mean, then most sample
means would be expected to be very close to 8, but somethose means near the tails of the distribution
could be much lower or much higher than 8. The gure below shows a normal curve with a mean of 8 and a
standard error of 0.20. As the curve expands towards the tails, the number of observations we would expect
to see gets smaller and smaller. In other words, sample means that come from far out in the tails of the
distribution are considered rare or unlikely occurrences. So for this example, the question is whether 7.5 is
so far out into one of the tails that it would be considered an unlikely observation under the assumption
that the middle of this curve is actually 8.
To measure how far into the tail our sample mean of 7.5 is, we must use a familiar measuring tool called
a Z score (or a T score for smaller samples). Since we are assuming the mean or middle of our sampling
distribution is 8 (remember that 8 is our prevailing claim), we need to measure the number of Z scores
our sample mean of 7.5 is from 8. Recall from chapter 6 that a variable's Z score is simply the number of
standard deviations the variable lies from the middle of the normal curve. Also recall that over 95% of a
normally shaped distribution will fall within two Z scores (two standard deviations) of the middle and over
99% will fall within three Z scores.
In hypothesis testing, any value falling more than two standard deviations from the middle would be
considered unlikely (less than 5% of all possible sample means will fall more than two standard deviations
from the middle). Any value falling more than three standard deviations from the middle would be considered
very unlikely (less than 0.5% of all possible sample means will fall more than three standard deviations from
the middle). If the standard error for our example is 0.2, then our sample has a Z score of -2.5 (7.5 8/0.2).
That is, our sample mean of 7.5 lies 2.5 standard deviations to the left of our hypothesized population mean,
well out into the left tail of the curve. So, it does appear that our sample mean can be considered an unlikely
occurrence. The conclusion then must be that if the true population mean is actually 8 it would unlikely
for us to obtain a sample mean as small as 7.5. But since we did obtain such a mean, we must therefore
conclude the true population mean is less than 8.
9 This content is available online at <http://cnx.org/content/m65311/1.2/>.
Figure 3.9
Without getting into further technicalities at this point, we have shown that a hypothesis test seeks to
measure whether the sample evidence can be considered unlikely under the assumption that the prevailing
claim is true. If our answer is `yes', then we have good reason to reject the prevailing claim. If our answer
is no, then we must let the prevailing claim stand, at least until stronger evidence against it is found.
In the next section we will break down the various steps in a hypothesis test.
When you perform a hypothesis test of a single population mean µ using a normal distribution (often
called a z-test), you take a large random sample from the population. When working with large samples,
you should recall from chapter 6 that Central Limit Theorem says that the sampling distribution of means
will be approximately normal even if the population from whence the sample came is not. For this reason
we can perform hypothesis tests using large samples and the normal distribution regardless of the shape of
the parent population.
Many statisticians prefer to use a t-distribution if the population standard deviation is unknown, even if
the samples are large. The reasoning behind this is that using the sample standard deviation in place of the
unknown population standard deviation adds an extra degree of potential error that can only be accounted
for by using a t- distribution. However, as noted in the previous chapter, it is common practice to use the
normal (Z-based) sampling distribution when working with large samples. Specically, when n>40, we will
use the standard normal(z-based)distribution to conduct a hypothesis test..
When working with small samples, we will perform a hypothesis test of a single population mean µ
t
using a Student's -distribution (often called a t-test). There are fundamental assumptions that need to
10 This content is available online at <http://cnx.org/content/m65305/1.3/>.
be met in order for the test to be considered valid. Most importantly, since Central Limit Theorem does not
apply to small samples, we have no guarantee the the sampling distribution will be normally shaped. For
this reason, we can only perform means tests with small samples when we know the population is normally
distributed.
note: Please see Figure 6 in the previous chapter for further insight into how to determine which
sampling distribution is appropriate when conducting a hypotheses test of a population mean.
p
When you perform a hypothesis test of a single population proportion , you take a random sample
from the population. You must meet the conditions for a binomial distribution which are: there are a
certain number n of independent trials, the outcomes of any trial are success or failure, and each trial has
the same probability of a success p. The Central Limit Theorem says the shape of the binomial distribution
will approximate the shape of the normal distribution if the sample is suciently large. To ensure this, the
quantities np and nq must both be greater than ve (np > 5 and nq > 5). Then the binomial distribution p of
a sample (estimated) proportion can be approximated by the normal distribution with µ = p and σ = pq n.
Remember that q = 1 p.
Going back to the standardizing formula we can derive the test statistic for testing hypotheses concerning
means. We have already worked with the formula below when introduced to sampling distributions in
Chapter 6. You should, however, notice one small dierence. When we perform hypothesis tests, we don't
know the population mean; we simply have a claim or belief about the mean, which may or may not be true.
Because the mean is hypothesized rather than known, we use a slightly dierent symbol in the equation, µ0 ,
as seen below.
−−
x −µ0
Zc = √ (3.9)
σ/ n
This calculated Z is nothing more than the number of standard deviations that the sample mean is from
the hypothesized population mean. If the sample mean falls "too many" standard deviations from the
hypothesized mean we conclude that the sample mean is unlikely to have come from a distribution centred
around the hypothesized mean.
So how do we know if a sample mean can be considered to have fallen "too many" standard deviations
away from a hypothesized mean? Obviously, we can't simply make this decision arbitrarily. Thankfully, we
have already been introduced this concept when we examined condence intervals in the previous chapter.
Just as we predetermine our level of condence before we compute an estimate of a population parameter, so
too must we predetermine how strong we need our sample evidence to be (i.e. how many standard deviations
away from the hypothesized population parameter it must lie) before we would be condent in rejecting the
null hypothesis. This predetermined level in hypothesis testing is called the level of signicance, and it is
simply 1- the level of condence. The level of signicance is denoted as alpha (α).
This level of signicance delineates a set number of standard deviations between evidence that would be
considered unlikely and evidence that would be considered not unlikely under the assumption that the null
hypothesis is true. By way of example, say we set our level of signicance at 5%. The corresponding Z score
for a 5% level of signicance is 1.645. This means that if our sample mean falls more than 1.645 standard
deviations away from the hypothesized middle of the distribution (i.e. the null hypothesis), we can conclude
the sample evidence is strong enough to be considered an unlikely event and we can therefore reject the null
hypothesis.
Before proceeding further, it's worth reviewing this notion of a signicance level from another perspective.
The signicance level can be thought of as the allowable amount of error in our test. Just as a 95% condence
level will produce an incorrect estimate 5% of the time, so will our hypothesis test with a level of signicance
set at 5%, produce an incorrect conclusion 5% of the time, at least theoretically. When we set the signicance
level at, say 5%, we are essentially saying that on our sampling distribution any sample mean that falls into
the top (or bottom) 5% of the tail would be considered strong evidence against the null hypothesis. This
does not mean the evidence is perfect, however. There is certainly the possibility that a sample mean that
falls into the top (or bottom ) 5% of the tail could have come from a population in which the null hypothesis
is true. Indeed that possibility is actually 5%. But 5% is a pretty small number, which is why we would say
the observance of such a sample mean must be considered an unlikelybut not impossible event.
Because the samples are small and we don't know the population standard deviation, we must use a Student
t-distribution rather than a Z distribution to perform our tests. The new standardizing formula below will
be used to compute how many standard deviations our sample mean falls from the hypothesized middle of
the t-distribution.
−−
X −µ0
tc = √ (3.9)
s/ n
When conducting a hypothesis test on a proportion, we can use a Z-based test so long as the sample is
^ ^
suciently large. A sample is considered large if n p and n(1- p ) both exceed 5. Even though we will perform
a Z-based test, because we are working with proportions, the standardizing formula is quite dierent. In
the numerator, the hypothesized proportion is subtracted from the observed sample proportion. In the
denominator, the standard error is calculated by rst multiplying the hypothesized proportion by 1 - the
hypothesized proportion; then by dividing the result; and nally taking the square root of that result.
^ q
Z* = p - πo/ πo(1−πo)
n .
In order for a hypothesis test's results to be generalized to a population, certain requirements must be
satised.
When testing for a single population mean:
1. A Student's t-test should be used if the data come from a small, random sample and the population is
approximately normally distributed.
2. The normal z-test can be used if the data come from a large, random sample. The population does
not need to be normally distributed.
When testing a single population proportion use a normal test for a single population proportion if the data
comes from a random sample, t the requirements for a binomial distribution, and the mean number of
success and the mean number of failures satisfy the conditions: np > 5 and nq > n where n is the sample
size, p is the probability of a success, and q is the probability of a failure.
3.2.2.6
3.2.2.7 Homework
Ha : The alternative hypothesis: This is the claim about the population that the researcher is trying
to show and it is contradictory to H0 . It is what we conclude to be likely to be true if our sample evidence
suggests that H0 is no longer valid. The alternative hypothesis says that something is dierent, that things
have changed. It must be supported by signicant evidence to overthrow the assumption.
Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you
have enough evidence to reject the null hypothesis or not. Since we rarely have access to population data,
we must take our evidence from sample data.
Later we will discuss in more detail how to determine if the sample evidence can be considered strong
enough to support the alternative hypothesis. Once you have examined the sample evidence, you can
determine if it supports the alternative hypothesis or not and make your nal decision. There are two
options for this decision. They are "reject H0 " if the sample information favours the alternative hypothesis
or "fail to reject H0 " or "decline to reject H0 " if the sample information is insucient to reject the null
hypothesis. These conclusions are all based upon a level of signicance that is set by the analyst.
Table 9.1 presents the various hypotheses in the relevant pairs. For example, if the null hypothesis is
equal to some value, the alternative has to be not equal to that value.
H0 Ha
equal (=) not equal (6=)
greater than or equal to (≥) less than (<)
less than or equal to (≤) more than (>)
Table 3.2
: As a mathematical convention H0 always has a symbol with an equal sign in it. Ha never has a
symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test.
Example 3.2
H0 : The average amount of sleep adult Canadians get per night is greater than or equal to 8 hours.
Ha : The average amount of sleep adult Canadians get per night is less than 8 hours.
H0 : µ≥8
Ha : µ<8
Example 3.3
We want to test whether the mean GPA of students in Canadian universities is dierent from 2.0
(out of 4.0). The null and alternative hypotheses are:
H0 : µ = 2.0
Ha : µ 6= 2.0
Example 3.4
We want to test if university students take more than four years to graduate from university, on
the average. The null and alternative hypotheses are:
H0 : µ ≤ 4
Ha : µ > 4
Example 3.5
We want to test if the proportion of Liberal supporters has dropped since the election.
Ho : The proportion of Liberal supporters is greater than or equal to 0.40
Ha : The proportion of Liberal supporters is less than 0.40.
H0 : π ≥ 0.40
Ha : π < 0.40
In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim
about a population parameter, such as the mean or proportion. If the sample provides strong evidence to
the contrary of the original claim, then the claim can be rejected in favour of the new claim. In a hypothesis
test, we:
1. Evaluate the null hypothesis, typically denoted with H0 . The null is not rejected unless the hypothesis
test shows otherwise. The null statement must always contain some form of equality (=, ≤ or ≥)
2. Always write the alternative hypothesis, typically denoted with Ha or H1 , using not equal, less than
or greater than symbols, i.e., (6=, <, or > ).
3. If we reject the null hypothesis, then we can assume there is enough evidence to support the alternative
hypothesis.
4. Never state that a claim under the null hypothesis is proven true or false. Keep in mind the underlying
fact that hypothesis testing is based on probability laws; therefore, we can talk only in terms of non-
absolute certainties.
3.2.3.2
3.2.3.3 Homework
a. π = 0.20
b. π > 0.20
c. π < 0.20
d. π ≤ 0.20
3.2.3.4 References
that the doctor diagnosed your ulcer by taking in only a few pieces of evidence: talking to you and pressing
on your stomach. In other words, she was willing to reject the null hypothesis on relatively weak evidence.
Why? Because she knew that the prescription might help, and even if it didn't it would do you little harm.
And since it didn't help you after a month, she can now rule out an ulcer and focus on other, possibly less
benign, conditions. Keep in mind that if she had set alpha low, she likely would not have misdiagnosed you,
but she would also have sought much stronger evidencepossibly even invasive exploratory surgerybefore
being willing to reject the null hypothesis. Obviously, in this case it made much more sense to risk a Type
II error and treat you for a condition that you don't actually have.
Summary
When you perform a hypothesis test, there are actually four possible outcomes depending on the actual truth
(or falseness) of the null hypothesis H0 and the decision to reject or not. The outcomes are summarized in
the following table:
Table 3.3
: Determine both Type I and Type II errors for the following scenario:
Assume a null hypothesis, H0 , that states the percentage of adults with jobs is at least 88%.
Exercise 3.2.4.4 (Solution on p. 253.)
Identify the Type I and Type II errors from these four statements.
a.Not to reject the null hypothesis that the percentage of adults who have jobs is at
least 88% when that percentage is actually less than 88%
b.Not to reject the null hypothesis that the percentage of adults who have jobs is at
least 88% when the percentage is actually at least 88%.
c.Reject the null hypothesis that the percentage of adults who have jobs is at least 88%
when the percentage is actually at least 88%.
d.Reject the null hypothesis that the percentage of adults who have jobs is at least 88%
when that percentage is actually less than 88%.
In every hypothesis test, the outcomes are dependent on a correct interpretation of the data. Incorrect
calculations or misunderstood summary statistics can yield errors that aect the results. A Type I error
occurs when a true null hypothesis is rejected. A Type II error occurs when a false null hypothesis is not
rejected.
The probabilities of these errors are denoted by the Greek letters α and β , for a Type I and a Type II
error respectively.
3.2.4.2
3.2.4.3 Homework
Exercise 3.2.4.9
For statements a-j in Exercise 9.109 (Exercise 3.2.4.8), answer the following in complete sentences.
Once you have set out your null and alternative hypothesis, you need to determine how strong your sample
evidence must be before you would be willing to reject the null hypothesis in favour of the alternative
hypothesis. The required strength of evidence is dened by the level of signicance (α).
Once your level of signicance has been set, you can then examine your sample evidence to determine its
strength, as measured by its p-value This process will be discussed below.
Once you have set out your null and alternative hypothesis, you need to determine how strong your sample
evidence must be before you would be condent in rejecting the null hypothesis in favour of the alternative
hypothesis. The required strength of evidence is dened by the level of signicance (α).
Typically values for alpha range from 1% to 10% and will vary depending on a number of factors, including
conventions set by a particular industry or discipline and the relative risks of a Type I versus a Type II error,
as discussed in the previous section. In many cases, the choice of alpha may be left up to the analyst.
Unfortunately, without a peer review process, some analysts may be tempted to set alpha in a way that will
support his or her desired conclusion.
For example, if a pharmaceutical company stands to make millions of dollars on a new drug, it obviously
has a vested interest in oering proof that the drug is eective. The null hypothesis is that the drug is
not eective; and the aternative is that it is. But what if the proof, as discovered by several rounds of
double-blind tests, turns out to be rather weak? This would normally lead the researcher to decide not to
reject the null hypothesis and conclude that the sample evidence is insuciently strong for the drug to be
considered a success. If this were the conclusion, the drug should not be approved as an eective treatment.
But a company with millions already invested in the drug may be strongly determined to see it to market,
in spite of the test results. An unethical approach might be to simply move the goal posts to make it easier
to reject the null hypothesis (i.e. to make the proof look stronger than it is).
These goal posts, of course, are dened by the level of signicance. In much scientic testing, the level
of signicance is typically set at 1%, which means the sample evidence must be very strong before a null
hypothesis can be rejected. In this case, moving the goal posts could mean setting the level of signicance
as high as 10%. This higher level of signicance, as we shall see below, allows for weaker evidence to be used
in support of an alternative hypothesis.
In Figure 1 below, alpha has been set at 1%. As you can see, the sample evidence fails to cross over the
goal posts set by alpha and we would thus reach a fail to reject of the null hypothesis. The sample evidence
is not strong enough.
13 This content is available online at <http://cnx.org/content/m65319/1.5/>.
Figure 3.10
In the following gure we have moved the goal posts by setting alpha at 10%, making it easier to reject
the null hypothesis. As you can see, the sample evidence now is strong enough to lead us to reject the null
hypothesis. Of course, in truth the evidence has not changed, but in the rst instance we fail to reject the
null and in the second we do reject the null.
Figure 3.11
Thankfully, at least when it comes to pharmaceutical testing, there are objective, government regulated
standards that cannot be easily manipulated by vested interests. However, there are instances where the
researcher is in control of choosing the level of signicance. When this is the case, the choice should be made
ethically and with an honest consideration of the implications of Type I and Type II errors.
As a nal note, the level of signicance should never be chosen after the sample evidence has been
measured. This would be akin to allowing the home team to determine where the goal posts are after the
game has already begun.
Once the level of signicance has been set, you can look more closely at the sample evidence to determine how
strong it is. As discussed earlier, this evidence is rst measured by determining how far away your sample
mean or proportion is from the hypothesized mean or proportion. The measuring stick we use is called a
Z-score or a t-score, which is simply the number of standard deviations our sample mean or proportion lies
from the hypothesized middle of the sampling distribution.
Recall from earlier in this chapter the example we looked at regarding sleep habits. We hypothesized
that the mean number of hours adults sleep per night is 8. We then gathered sample evidence, where the
sample mean was 7.5 and the standard deviation was 1.4 hours. The sampling distribution for this scenario
would then have a hypothesized middle of 8 and a standard error of 0.20 (i.e. 1.4/sqrt50)
Does a sample mean of 7.5 provide sucient proof that the true population mean is less than 8? To
investigate, we must rst determine our level of signicance. For now, we will use the default of 5%. This
means that if our sample mean falls into the lower 5% of the tail, it will be considered strong evidence against
the null hypothesis. We can now measure how many standard deviations (Z-scores, since we are working
with a large sample) 7.5 is from 8. This measurement is often called the test statistic. You may see it written
as Z* or t*. √
Using our standardizing formula, we get Z* = 7.5 8.0 / (1.4/ 50). The resulting Z-score is -2.5 (rounded
to one decimal). Based on the empirical rule we know that any value with a Z-score of 2.5 (as an absolute
value) would fall well out into the lower or upper 5% of the tail and would thus be considered an unlikely
observation. That is, very few sample means taken from a population with a mean of 8 would have such a
high Z-score.
Our decision, in this case would be to reject the null hypothesis (that the mean number of hours adults
sleep is 8) in favour of the alternative hypothesis (that the mean number of hours adults sleep is less than
8). Keep in mind we have not proven they only sleep 7.5; this is never what we sought to prove. We only
sought to prove that they sleep less than 8 hours. Our sample mean of 7.5 is our evidence against the null
hypothesis. As it turned out, the empirical rule helped us conclude that a sample mean of 7.5 would be a very
unlikely nding if the true population mean were actually 8, which is why we rejected the null hypothesis.
While using Z-scores and t-scores can lead us to a correct decision, a more common and precise measuring
tool is preferred, called a p-value. To nd the p-value of a sample mean or proportion we simply need to
convert the test statistic into a probability. Specically, the p-value seen below is the probability of getting
sample mean of 7.5 or less from a population whose true mean is 8. As you can see, the resulting p-value
is extremely small, meaning that such an outcome would be extremely unlikely (well under a probability of
5%) to occur if the true mean is 8.
Figure 3.12
Be careful! The p-value is not the probability that the null hypothesis is true. It is the probability that
our sample mean could have come from a population in which the null hypothesis is true. And since this
probability is so small, we must conclude the null hypothesis in not true. In other words, our sample mean
is what is considered an unlikely event.
As a nal example, suppose Didi and Ali are at a birthday party of a very wealthy friend. They hurry to be
rst in line to grab a prize from a tall basket that they cannot see inside because they will be blindfolded.
There are 200 plastic bubbles in the basket and Didi and Ali have been told that there is only one with a
$100 bill. Didi is the rst person to reach into the basket and pull out a bubble. Her bubble contains a $100
bill. The probability of this happening is 1/200 = 0.005.
In statistical language, 0.005 is akin to a p-value. Because this occurrence was unlikely to have happened
if there truly is only one $100 bill in the basket, Ali can conclude that what the two of them were told was
wrong and there are actually more $100 bills in the basket. A "rare event" has occurred (Didi getting the
$100 bill), so Ali doubts the assumption about only one $100 bill being in the basket.
Once you have determined the p-value associated with a sample mean or proportion, the next step is to
compare that p-value to the original level of signcnce..
When you make a decision to reject or not reject H 0, do as follows:
If p-value < α, reject H 0. The evidence provided by the sample data is signicant. There is sucient
evidence to conclude that H 0 is an incorrect belief and that the alternative hypothesis, H a, may be
correct.
If p-value α ≥ , do not reject H 0. The evidence provided by the sample data is not signicant.There is
not sucient evidence to conclude that the alternative hypothesis,H a, may be correct.
When you "do not reject H 0", it does not mean that you have proven that H 0 is true. It simply means
that the sample data have failed to provide sucient evidence to cast serious doubt about the truthfulness
of H o.
The gure below illustrates a P-value of 0.006 and a chosen level of signicance of 0.05. As you can see,
the p-value is much smaller than alpha (further out into the tail), which indicates strong evidence against
the null hypothesis.
Figure 3.13
Conclusion: After you make your decision, write a thoughtful conclusion about the hypotheses in
terms of the given problem, making specic reference to the context. The example below should serve as a
summary and a guide for conducting a full eight-step hypothesis test on a population mean or proportion.
Example 1
Suppose Irene, who owns a top bakery in the city, claims that she has the best bread in the city by any
measure. Not only is her bread the tastiest, it is also the uest and the tallest, averaging 15 cm in height.
Another baker, Jose, wishes to challenge Irene's claim that her bread is the tallest. As evidence he will
provide a sample of 40 randomly selected loaves of bread and have their heights measured in his attempt to
prove that his bread heights actually exceed 15 cm, on average. In doing so, he obtains a sample mean bread
height of 15.5 cm. He also knows from baking thousands of loaves that his varation is very low: specically
the standard deviation is 0.9 cm.
Step One
The null and alternative hypotheses are as follows:
Ho: µ = 15
Ha: µ > 15
Step Two
The sample evidence is as follows: sample mean = 15.5; population standard deviation = 0.9; sample
size = 40.
Step Three
The test considerations are as follows: We are using a large sample (n>30) to conduct a means test. This
will require a sampling distribution of the mean, which central limit theorem says will be approximately
normally shaped since our sample size exceeds 30. We will therefore do a Z-based test.
Step Four
The required strength of evidence can now be determined by considering the implications of a Type I
vs. a Type II error. In this context, Jose will make a Type I error if he concludes that his bread heights
average more than 15 cm when in fact they do not. He will make a Type II error if he concludes that his
bread heights do not average more than 15 cm when in fact they do. Which error is worse will depend on
where you are standing. Jose would consider a Type II error worse, whilst Irene would consider a Type I
error worse. To be fair, we will choose a level of signicance of 5%, which is generally considered a good
balance between the two types of errors.
Reject Ho if p-value < 0.05
Step Five
Compute the test statistics
√ as follows:
Z* = 15.5-15/(0.9/ 40) = 3.51
p-value = 0.0002
NOTE: The Excel function for computing a p-value is as follows:
=1-NORM.S.DIST(3.51,1)
Step Six
Interpret the p-value in the context of the problem. Under the assumption that Jose's bread is no taller
than Irene's (this his bread averages only 15 cm), the probability of obtaining a sample of 40 with a mean
of 15.5 cm (or more) is only 0.005 or 0.5%, which makes it a very unlikely event.
Step Seven
Make a decision by comparing your p-value to your level of signicance.
Since the p-value (0.0002) < 0.05, we can reject the null hypothesis.
Step Eight
Oer a nal conclusion in sentence form: Therefore we can conclude that Jose's bread averages more
than 15 cm and is indeed taller than Irene's.
Exercise 3.2.5.1: Practice Question One (Solution on p. 254.)
An auditing rm is looking at the travel expense claims for a large book retailer. The retailer's
books suggest that their average (µ) travel expenses was $1200 per person per year. A sample
of 64 random expense claims revealed average of $1300. The population σ , based on an earlier
comprehensive audit, is $400. The sample suggests the books have under-exaggerated the expense
claims. Identify the Null and Alternative Hypotheses.
These questions were derived from Lyryx Learning, Business Statistics I MGMT 2262 Mt Royal University
Version 2016 Revision A. OpenStax CNX. Sep 8, 2016 http://cnx.org/contents/f3aefa9e-58d2-41ea-969f-
[email protected].
important: If a question has a set of data, please see the course site for the Excel le.
1. Question 1: The Specic Absorption Rate (SAR) for a cell phone measures the amount of radio
frequency (RF) energy absorbed by the user's body when using the handset. Every cell phone emits
RF energy. Dierent phone models have dierent SAR measures. To receive certication from the
Federal Communications Commission (FCC) for sale in the United States, the SAR level for a cell
phone must be no more than 1.6 watts per kilogram. Table 7.1 shows the highest SAR level for a
random selection of cell phone models as measured by the FCC. A recent study has shown that if a
cell phone's SAR level exceeds 0.9 watts per kilogram, there is an increased chance of brain tumours
for those that use this phone15 An advocacy group wants to use this new study to petition the FCC
to change their regulations around the current allowable SAR levels.
Table 3.4
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Is it appropriate to assume that the sampling distribution is normal? Explain your reasoning
and provide evidence for your choice. Regardless of your answer in b), assume that the sampling
distribution is normal for the remaining questions.
c. The advocacy group will go forward with their petition if they can show that, on average, cell
phones have SAR rates that exceed 0.9 watts per kg. This advocacy group is run by an admin-
istrator who is very risk averse (meaning they will only go forward with the petition if there is a
lot of evidence). Determine whether the advocacy group should go forward with their petition by
performing an appropriate eight-step hypothesis test.
14 This content is available online at <http://cnx.org/content/m65288/1.3/>.
15 This is completely made-up.
d. Find a condence interval for the true (population) mean of the Specic Absorption Rates (SARs)
for cell phones. Choose a condence level that complements the level of signicance you have
chosen above.
e. Interpret the condence interval in the context of the question.
f. Does the condence interval suggest that the mean SAR exceeds 0.9? Compare your answer with
what you got for the hypothesis test. Do the condence interval and hypothesis test support each
other? Explain your answer.
2. Question 2: A hospital is trying to cut down on emergency room wait times. In the past, they have
found that the average wait time is 1.4 hours for patients to be called back to be examined. They have
implemented a new triage protocol and are interested in seeing if it has changed the amount of time
patients must wait before being called back to be examined. An investigation committee randomly
surveyed 70 patients. The sample mean wait time was 1.5 hours with a sample standard deviation of
0.5 hours.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Use an appropriate eight-step hypothesis to determine if the average wait time for patients to be
called back to be examined has changed from 1.4 hours. Use a level of signicance of 10%.
c. Is there a level of signicance that causes you to change your decision?
d. Suppose the true population mean wait time is 1.4 hours, have you made an error in b)? If so,
what type?
e. Construct a 90% condence interval for the population mean emergency room wait times.
f. Interpret the condence interval in the context of the question .
g. If the investigation committee wants to increase its level of condence and keep the margin of
error the same by taking another survey, what changes should it make?
h. If the investigation committee did another survey, kept the margin of error the same, and surveyed
200 people instead of 70, how would the level of condence have to change? Why?
i. Suppose investigation committee wanted their estimate of the population mean emergency room
wait times to be within 0.05 hours of the true mean. How many patients would they need to
interview?
3. Question 3: Twenty-ve Americans were surveyed to determine the number of hours they spend watch-
ing television each month. The results were as follows:
Table 3.5
Assume that the underlying population distribution is normal and the population standard deviation
is known to be 32 hours.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. The U.S. government has recently released a recommendation that Americans watch less than 150
hours of television per month. Based on this sample, is there enough evidence to suggest that,
on average, Americans are meeting this recommendation? Base your answer on an appropriate
eight-step hypothesis test. Use α = 5%.
c. Construct a 99% condence interval for the population mean hours spent watching television per
month.
d. Interpret the condence interval in the context of the question.
e. Explain what the condence level means in the context of the question.
4. Question 4: The standard deviation of the weights of newborn elephants is known to be approximately
15 pounds. We wish to construct a 95% condence interval for the mean weight of newborn elephant
calves. Fifty newborn elephants are weighed. The sample mean is 244 pounds. The sample standard
deviation is 11 pounds.
a. What model will you use to construct a condence interval for the population mean? Explain
your reasoning by referring to the criteria for that model.
b. Construct a 95% condence interval for the population mean weight of newborn elephants.
c. What will happen to the condence interval obtained, if 500 newborn elephants are weighed
instead of 50? Why?
d. Based on the condence interval, is it fair to say that the average weight of a newborn elephants
exceeds 235 pounds? Explain your answer.
e. Does an appropriate hypothesis test support your decision in d)? Explain your answer by doing
the eight-step hypothesis test.
5. Question 5: A news magazine is investigating the changing dynamics in marriages. Historically, men
made many of the nancial decisions including the decision on whether to make major household
purchases (such as buying a new vehicle or doing a renovation), while women were left out of them.
To investigate whether this has changed, the magazine is considering doing a study to nd out the
percentage of couples who are equally involved in making decision about household purchases.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. When designing a study to determine this population proportion, what is the minimum number
you would need to survey to be 90% condent that the population proportion is estimated to
within 0.05?
c. If it were later determined that it was important to be more than 90% condent, how would it
aect the minimum number you need to survey? Why? Do not do any calculations. Suppose the
marketing company did do the survey. They randomly surveyed 200 households and found that
in 114 of them, the couple makes major household purchasing decisions together. A similar study
from the 1980s found that 46.5% of couple made major household purchasing decisions together
d. Conduct an eight-step hypothesis test to determine whether there has been a signicant increase
in the number of couples who make major household purchasing decisions together since the 1980s.
The editor of the magazine will only publish the article if there is ample evidence to support the
claim.
e. Construct a 95% condence interval for the population proportion of couples who make major
household purchasing decisions together.
f. Interpret the condence interval in the context of the question.
g. If the rate has increased, use the condence interval to determine by how much the rate has
increased since the 1980s.
h. List two diculties the company might have in obtaining random results, if this survey were done
by email.
6. Question 6: Suppose that an accounting rm has developed a new software to help their clients do
their taxes more quickly. Based o of a national survey, most people spend 24.4 hours completing
their personal income taxes a year. The accounting rm has a random sample of 100 of their clients
complete their 2016 income tax return using the new software. The sample mean time to complete the
tax returns is 23.6 hours with a standard deviation of 7.0 hours. The rm doesn't want to release the
software unless they are sure it will reduce the time it takes clients to do their taxes. The population
distribution is assumed to be normal.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Conduct an appropriate eight-step hypothesis test to determine if, on average, the software has
reduced the time it takes clients to do their taxes.
c. Suppose the truth is that the software does help clients do their taxes faster. Has an error been
committed? If so, what type of error is it? Explain your answers.
d. Construct a 90% condence interval for the population mean time to complete the tax forms.
e. Interpret the condence interval in the context of the question.
f. Does the condence interval support the results of the hypothesis test? Explain your answer.
g. If the rm wished to increase its level of condence and keep the margin of error the same by
taking another survey, what changes should it make? Why?
h. If the rm did another survey, kept the margin of error the same, and only surveyed 49 people,
how would the level of condence have to change? Why?
i. Suppose that the rm decided that it needed to be at least 96% condent of the population mean
length of time to within one hour. How would the number of people the rm surveys change?
Why?
7. Question 7: In 2013, it was determined that 21% of North Americans download music illegally. Public
Policy Polling is wondering whether that number has changed. They asked a random sample of adults
across North America about their downloading habits. When asked, 512 of the 2247 participants
admitted that they have illegally downloaded music.
a. Has the proportion of North Americans who illegally download music increased since 2013? Con-
duct an appropriate eight-step hypothesis test to support your answer.
b. Create and interpret a 99% condence interval for the true proportion of North American adults
who have illegally downloaded music.
c. This survey was conducted through automated telephone interviews on May 6 and 7 of this year.
The margin of error of the survey compensates for sampling error, or natural variability among
samples. List some factors that could aect the surveyÕs outcome that are not covered by the
margin of error.
d. Without performing any calculations, describe how the condence interval would change if the
condence level changed from 99% to 90%.
e. Suppose Public Policy Polling want to conduct the study again now. They want to keep the
same level of condence as their last survey, but they want their results to within 2% of the
true proportion of Canadian adults who have illegally downloaded music. What is the minimum
sample size they need to obtain this?
8. Question 8: A survey of the mean number of cents o that coupons give was conducted by randomly
surveying one coupon per page from the coupon sections of a recent San Jose Mercury News. The
following data were collected (in cents): 20; 75 ; 50 ; 65 ; 30 ; 55 ; 40 ; 40 ; 30 ; 55 ; 150; 40 ; 65 ; 40 .
Assume the underlying distribution is approximately normal.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Conduct an appropriate eight-step hypothesis test to determine if the mean number of cents o
a coupon is dierent from 50 . Use a level of signicance of 3%.
c. What is the probability of committing a type I error in the above hypothesis test?
d. Construct a 97% condence interval for the population mean worth of coupons.
e. Interpret the condence interval in the context of the question.
f. If many random samples were taken of size 14, what percent of the condence intervals constructed
should contain the population mean worth of coupons? Explain why.
1. a. The variable is the specic absorption rate. It is quantitative continuous data. The best descriptive
statistic for this type of data is the mean.
b. Since the sample size is less than 30, we can only assume the sampling distribution is normal if
the population distribution is close to being normal. Based on the normal curve plot and the
empirical rule, it appears that the sample is not normally distributed. The normal curve plot is
not a straight line and only 55.6% of the data fall within the rst standard deviation of this. This
conclusion is supported by a bimodal histogram. This suggests that the population distribution is
not normal, which means we cannot be certain the sampling distribution is normal. Regardless of
your answer in b), assume that the sampling distribution is normal for the remaining questions.
Step
c. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
cell phones have SAR rates that are 0.9 watts per kg, µ = 0.9; HA : on average, cell phones
have SAR rates that exceed 0.9 watts per kg, µ > 0.9
Step 2. State the evidence. n = 27, X = 0.989, s = 0.410
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes, as stated in the question.
• Population standard deviation is known? No
• Sample size is greater than 40? No
Therefore, since we need to estimate the population standard deviation using the sample
standard deviation and the sample size is small, we will use the t-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. Since the
administrator is risk averse, they want to ensure that they have rejected H0 with a lot of
evidence. Therefore, the level of signicance that requires the most evidence to reject H0 is
1%. If p < 1%, reject H0 . If p ≥ 1%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.1357
Step 6. State what the p-value means in the context of the question. The probability that a sample
mean SAR of at least 0.989 is observed, under the assumption that the SAR rate is 0.9, is
13.57%.v
Step 7. Make a decision. Since p (13.57%) is greater than α (1%), we do not reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is not sucient evidence
to suggest that, on average, cell phones have SAR rates that exceed 0.9 watts per kg, which
means the advocacy group should not go forward with their petition.
d. Since α = 1%, I will use a condence level of 98% (for a one-tailed HT, use 1-2*alpha to determine
complementary CL): 0.793 to 1.18
e. We are 98% condent that the true population mean for SARs is somewhere between 0.793
watts/kg and 1.18 watts/kg.
f. Though there are possible values for the population mean that do exceed 0.9 watts/kg in the
CI, there are also values that do not exceed 0.9 watt/kg. Therefore, the CI would lead to an
inconclusive result, meaning it is not clear from the CI whether the pop. mean exceeds 0.9 or
not. This aligns with our hypothesis test that there is not enough evidence to suggest that the
population mean exceed 0.9 watts/kg.
2. a. The variable is the emergency room wait times. It is quantitative continuous data. The best
descriptive statistic for this type of data is the mean.
Step
b. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the average
wait time for patients to be called back to be examined is 1.4 hours, µ = 1.4; HA : the average
wait time for patients to be called back to be examined has changed from 1.4 hours, µ 6= 1.4
Step 2. State the evidence. n = 70, X = 1.5, s = 0.5
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the sample size (70) is greater
than 30, the central limit theorem applies and the sampling distribution of sample means
is normally distributed.
to suggest that, on average, Americans are meeting the recommendation of watching less than
150 hours of television per month.
c. 133.2 to 166.1
d. We are 99% condent that the population mean time that Americans spend watching TV is
somewhere between 133.2 hours and 166.1 hours.
e. The condence level means that if we took many random samples of size 25 from the population of
Americans and constructed many condence intervals for each of these random samples, then 99%
of these condence intervals will contain the population mean time Americans spend watching
TV, while 1% will not.
4. a. We know the sampling distribution for sample means is normal because the sample size is greater
than 30 as stated in the Central Limit Theorem. Therefore, we use either the Student-t or the
standard normal distributions. As the population standard deviation is known, we can use the
standard normal distribution (i.e z-based normal distribution).
b. 239.84 to 248.16
c. The condence interval will get narrower because the margin of error will be smaller. The margin
of error is smaller because the amount of error between the sample means and the population
mean is smaller as stated in the law of large numbers.
d. Yes, the estimated population mean weight of newborn elephants is 239.84 pounds to 248.16
pounds. Based on this, it is fair to say that the average weight exceeds 235 pounds, as both
bounds are larger than 235.
Step
e. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
newborn elephants weigh 235 pounds, µ = 235; HA : on average, newborn elephants weigh
exceeds 235 pounds, µ > 235
Step 2. State the evidence. n = 50, X = 244, σ = 15
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the sample size (50) is greater
than 30, the central limit theorem applies and the sampling distribution of sample means
is normally distributed.
• Population standard deviation is known? Yes
As the population standard deviation is known, we will use the z-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As the
condence level in the previous question was 95% and we are attempting to verify the CI
with a HT, we should use an α of 2.5% (solve for alpha in 0.95 = 1-2*alpha, for a one-tailed
HT). If p < 2.5%, reject H0 . If p ≥ 2.5%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 1.10E − 5 =
1.10 × 10−5 = 0.000011
Step 6. State what the p-value means in the context of the question. The probability that a sample
mean weight of newborn elephants is at least 244 pounds is observed, under the assumption
that the mean weight of newborn elephants is 235, is 0.0011%.
Step 7. Make a decision. Since p (0.0011%) is less than α (5%), we reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is sucient evidence
to suggest that, on average, newborn elephants weigh exceeds 235 pounds.
5. a. The variable is what whether a couple makes major household purchasing decisions together or
not. It is categorical nominal data. The best descriptive statistic for this type of data is a
proportion.
b. They would need to interview a minimum of 271 households (Note: As no estimate of the popu-
lation proportion is provided, use 50%)
c. If it were later determined that it was important to be more than 90% condent and a new survey
were commissioned, how would it aect the minimum number you need to survey? Why?
Step
d. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the pro-
portion of couples who make major household purchasing decisions together is unchanged
at 46.5%, π = 0.465; HA : the proportion of couples who make major household purchasing
decisions together is greater than 46.5%, π > 0.465
Step 2. State the evidence. n = 200, X = 114
Step 3. State the model. Explain why you have chosen it.
• Binomial distribution? Yes, because ...
· Data is being counted: Yes. Counting the number of couples.
· Data is collected randomly: Says so in question
· There are only two outcomes: Either couple makes household decisions together or they
don't.
· There are a xed number of trials: 200
· The trials are independent: As the sample is random, it is safe to say this is the case
as how one couple makes decisions should not eect how another randomly selected
couple makes decisions.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As the
editor needs strong evidence, need to choose α to be small, i.e. 1%. If p < 1%, reject H0 . If
p ≥ 1%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.00185
Step 6. State what the p-value means in the context of the question. The probability that at least
114 out of 200 couples make major purchasing together, assuming the rate has not changed
since the 1980s, is 0.19%.
Step 7. Make a decision. Since p (0.19%) is less than α (1%), we reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is sucient evidence
to suggest that the proportion of couples who make major household purchasing decisions
together is greater than 46.5%.
e. 0.5014 to 0.6386
f. We are 95% condent that the true proportion of couples who make major household purchasing
decisions together is somewhere between 50.14% and 63.86%.
g. Based o of the CI, the rate has increased by at least 3.6% and by at most 17.4%.
h. One issue is how will the marketing company develop the list of email addresses. Most likely they
will not have a complete list of all emails for all households. Second of all, the email will be sent
to a member of the household and not to the household as a whole. Thus one household may get
multiple surveys. Further, not everyone uses email so the sample will miss those households.
6. a. The variable is the amount of time people take completing their tax forms. It is quantitative
continuous data. The best descriptive statistic for this type of data is the mean.
b. Conduct an appropriate eight-step hypothesis test to determine if, on average, the software has
reduced the time it takes clients to do their taxes.
Step 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
the software has not reduced the time it takes clients to do their taxes, µ = 24.4; HA : on
average, the software has reduced the time it takes clients to do their taxes, µ < 24.4
Step 2. State the evidence. n = 100, X = 23.6, s = 7.0
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the population distribution is
assumed to be normal, we know the sampling distribution of sample means is also normal.
should choose a small level of signicance (i.e. 1%). If p < 1%, reject H0 . If p ≥ 1%, do not
reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.1265
Step 6. State what the p-value means in the context of the question. The probability that a sample
mean time to complete tax returns of at most 23.6 hours is observed, under the assumption
that the mean time is 24.4, is 12.65%.
Step 7. Make a decision. Since p (12.65%) is greater than α (1%), we do not reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is not sucient evidence
to suggest that, on average, the software has reduced the time it takes clients to do their taxes.
c. Since we have stated that it is that there is not enough evidence that the software has reduced
the time it takes clients to do their taxes, when in fact it has, we have committed a type II error.
d. 22.45 to 24.75
e. We are 90% condent that the true average time it takes for people to complete their tax forms
with this new software is somewhere between 22.45 hours and 24.75 hours.
f. The HT has led us to state that there is evidence that the average time has not been reduced
from 24.4. The CI supports this as it contains the population mean of 24.4 hours.
g. If the level of condence is increased then the critical value in the margin of error would increase.
To keep the margin of error the same, either the standard deviation would need to decrease, or
the sample size would need to increase. As the standard deviation is inherent to the data, the
sample size needs to increase.
h. If the sample size decreases, then the margin of error increases. This means that to keep the
margin of error constant, the level of condence would need to decrease. This would cause the
critical value to be smaller which would compensate for the smaller sample size.
i. It would not change the number of people needed to be interviewed. The level of condence and
the sample size are independent of each other.
7. Step
a. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the propor-
tion of North Americans who illegally download music not increased since 2013, π = 0.21HA :
the proportion of North Americans who illegally download music increased since 2013,
π > 0.21
Step 2. State the evidence. n = 2247, X = 512
Step 3. State the model. Explain why you have chosen it. Normal distribution approximation of the
binomial distribution:
• Binomial distribution? Yes, because ...
· Data is being counted: Yes. Counting the number of North Americans who illegally
download music.
· Data is collected randomly: Says so in question
· There are only two outcomes: Either person illegally downloads music or they don't.
· There are a xed number of trials: 2247
· The trials are independent: As the sample is random, it is safe to say this is the case as
whether one person downloads music illegally or not should not eect whether another
randomly selected person does.
Therefore, we will use the normal distribution approximation of the binomial distribution.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As there
is no motivation stated in the study, I will choose a level of signicance that is a balance
between rejecting and not rejecting H0, i.e. 5%. If p < 5%, reject H0 . If p ≥ 5%, do not
reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.0208
Step 6. State what the p-value means in the context of the question. The probability that at least
512 out of 2247 North Americans admit that they have downloaded music illegally, assuming
the rate has not changed since 2013, is 2.08%.
1. We can use the standard normal model to nd the condence interval, because the sample was collected
randomly and, since the sample size is greater than 30 (it is 175), we can be very condent that the
sampling distribution for the sample means is normal due to the central limit theorem. To nd the
condence interval, use a computer program. Make sure to choose the z-model (instead of the t-model).
Input the sample size as 175, the sample mean as 21.34 and the standard deviation as 5.12. Choose
the level of condence to be 95%. This gives the following output:
Table 3.6
From this, we can see that the condence interval for the mean is 20.58 to 22.10.
2. To interpret the condence interval, we would say that we are 95% condent that the population mean
age of students from this university is somewhere between 20.58 years old and 22.10 years old. That
is, we are estimating that the population mean age is somewhere between 20.58 years old and 22.10
years old.
3. The condence level means that if we took many random samples of size 175 from the student body
of this university and constructed many condence intervals for each of these random samples, then
95% of these condence intervals will contain the population mean age for this university, while 5%
will not.
4. If the sample size is decreased to 100, we would expect that the condence interval would get wider.
From the law of large numbers, we know there is more sampling variability in smaller samples. Thus
there is more potential for error between the sample mean and the population mean when the sample
size is smaller. The margin of error then is bigger to take this into account. This
√ is supported by the
formula for the margin of error (zα/2 × √σn ). Since we are dividing by the n, the margin of error
would be smaller for larger n and bigger for smaller n.
5. We have estimated that the population mean age is between 20.58 years old and 22.10 years old.
Therefore, based on our estimate, it is unlikely that the mean age of this university is 23 years old as
23 does not fall within our estimate. The administrator's claim is most likely incorrect.
1. We can use the Student-t distribution model to construct the condence interval, because the popu-
lation standard deviation is unknown (so we don't use the standard normal distribution), the sample
is collected randomly, and the sampling distribution of the sample means is normal because the popu-
lation distribution is normal. To nd the condence interval, use a computer program. Make sure to
choose the t-model (instead of the z -model). Input the sample size as 25, the sample mean as 44.25
and the standard deviation as 2.25. Choose the level of condence to be 95%. This gives the following
output:
Table 3.7
From this, we can see that the condence interval for the mean is 43.321 to 45.179.
2. To interpret the condence interval, we would say that we are 95% condent that the true mean battery
life of brand of AAA batteries is somewhere between 43.32 hours and 45.18 hours.
3. If the condence level is decreased to 90%, we would expect that the condence interval would get
narrower. A higher level of condence is obtained by making the condence interval wider. Therefore,
if the condence level is decreased, then the condence interval would get narrower.
Notice from the computer output, that the critical value is 2.064 with 24 degrees of freedom (i.e one less
than the sample size). If the population standard deviation was known, the critical value would be 1.96. To
re-iterate, since we are estimating the population standard deviation with the sample standard deviation,
we know there is more room for error in the estimate. Therefore, we want the estimate (i.e. condence
interval) to be slightly wider, thus the margin of error needs to be slightly bigger. This is done by using the
Student-t distribution, which results in bigger critical values for the same condence level as would occur for
the standard normal distribution. In this case, 2.064.
Solution to Exercise 3.1.4.3 (p. 217)
We know the condence level (95%). The margin of error is stated by saying that we want the estimate
of the true mean to be within 0.2 hours. Thus the 0.2 hours is telling us how much error we want in the
estimate (i.e. E = 0.2). We do need to have a sense of the standard deviation, which we get from the
preliminary study. Using the 12 participants, we get a sample standard deviation of 1.29.
We can now use a computer program to do the calculation. From the question, we know the margin of
error (E ) is 0.2, the standard deviation is 1.29, and the condence level is 95%. When we input this into
the computer program, we get output similar to this.
Table 3.8
From this, we can see that to get our sample size within 0.2 hours of the true mean we would need a
sample size of at least 160 participants.
Solution to Exercise 3.1.5.1 (p. 219)
The condence level is 90%, the sample proportion is 85%, and the amount of error we want in our estimate
(i.e. the margin of error) is 2.5%.
We can now use a computer program to do the calculation. From the question, we know the margin of
error (E ) is 0.025 (remember to write it as a decimal), the sample proportion is 0.85, and the condence
level is 90%. When we input this into the computer program, we get output similar to this.
Table 3.9
From this, we can see that we need to have at least 552 consumers in our sample.
Solution to Exercise 3.2.2.1 (p. 224)
A normal distribution or a Student's t-distribution
Solution to Exercise 3.2.2.2 (p. 224)
Use a Student's t-distribution
Solution to Exercise 3.2.2.3 (p. 224)
a normal distribution for a single population mean
Solution to Exercise 3.2.2.4 (p. 224)
It must be approximately normally distributed.
Solution to Exercise 3.2.2.5 (p. 224)
They must both be greater than ve.
Solution to Exercise 3.2.2.6 (p. 224)
binomial distribution
a. H0 : π = 0.42
b. Ha : π < 0.42
Solution to Exercise 3.2.3.5 (p. 227)
a. H0 : µ = 34; Ha : µ 6= 34
b. H0 : π ≤ 0.60; Ha : π > 0.60
c. H0 : µ ≥ 100,000; Ha : µ < 100,000
d. H0 : π = 0.29; Ha : π 6= 0.29
e. H0 : π = 0.05; Ha : π < 0.05
f. H0 : µ ≤ 10; Ha : µ > 10
g. H0 : π = 0.50; Ha : π 6= 0.50
h. H0 : µ = 6; Ha : µ 6= 6
i. H0 : π ≥ 0.11; Ha : π < 0.11
j. H0 : µ ≤ 20,000; Ha : µ > 20,000
Solution to Exercise 3.2.3.6 (p. 227)
c
Solution to Exercise 3.2.4.1 (p. 230)
Type I error: The researcher thinks the blood cultures do contain traces of pathogen X, when in fact, they
do not.
Type II error: The researcher thinks the blood cultures do not contain traces of pathogen X, when in
fact, they do.
Solution to Exercise 3.2.4.2 (p. 230)
The error with the greater consequence is the Type II error: the patient will be thought well when, in fact,
he is sick, so he will not get treatment.
Solution to Exercise 3.2.4.3 (p. 230)
In this scenario, an appropriate null hypothesis would beH0 : the mean level of toxins is at most 800 µg, H0
: µ0 ≤ 800 µg.
Type I error: The DMF believes that toxin levels are still too high when, in fact, toxin levels are at
most 800 µg. The DMF continues the harvesting ban.
Type II error: The DMF believes that toxin levels are within acceptable levels (are at least 800 µg)
when, in fact, toxin levels are still too high (more than 800 µg). The DMF lifts the harvesting ban. This
error could be the most serious. If the ban is lifted and clams are still toxic, consumers could possibly eat
tainted food.
In summary, the more dangerous error would be to commit a Type II error, because this error involves
the availability of tainted clams for consumption.
Solution to Exercise 3.2.4.4 (p. 230)
Type I error: c
Type I error: b
Solution to Exercise 3.2.4.5 (p. 231)
Type I: The mean price of mid-sized cars is $32,000, but we conclude that it is not $32,000.
Type II: The mean price of mid-sized cars is not $32,000, but we conclude that it is $32,000.
Solution to Exercise 3.2.4.6 (p. 231)
α = the probability that you think the bag cannot withstand -15 degrees F, when in fact it can
β = the probability that you think the bag can withstand -15 degrees F, when in fact it cannot
Solution to Exercise 3.2.4.7 (p. 231)
Type I: The procedure will go well, but the doctors think it will not.
Type II: The procedure will not go well, but the doctors think it will.
Solution to Exercise 3.2.4.8 (p. 231)
a. Type I error: We conclude that the mean is not 34 years, when it really is 34 years. Type II error: We
conclude that the mean is 34 years, when in fact it really is not 34 years.
b. Type I error: We conclude that more than 60% of Americans vote in presidential elections, when the
actual percentage is at most 60%.Type II error: We conclude that at most 60% of Americans vote in
presidential elections when, in fact, more than 60% do.
c. Type I error: We conclude that the mean starting salary is less than $100,000, when it really is at least
$100,000. Type II error: We conclude that the mean starting salary is at least $100,000 when, in fact,
it is less than $100,000.
d. Type I error: We conclude that the proportion of high school seniors who get drunk each month is not
29%, when it really is 29%. Type II error: We conclude that the proportion of high school seniors who
get drunk each month is 29% when, in fact, it is not 29%.
e. Type I error: We conclude that fewer than 5% of adults ride the bus to work in Los Angeles, when the
percentage that do is really 5% or more. Type II error: We conclude that 5% or more adults ride the
bus to work in Los Angeles when, in fact, fewer that 5% do.
f. Type I error: We conclude that the mean number of cars a person owns in his or her lifetime is more
than 10, when in reality it is not more than 10. Type II error: We conclude that the mean number of
cars a person owns in his or her lifetime is not more than 10 when, in fact, it is more than 10.
g. Type I error: We conclude that the proportion of Americans who prefer to live away from cities is not
about half, though the actual proportion is about half. Type II error: We conclude that the proportion
of Americans who prefer to live away from cities is half when, in fact, it is not half.
h. Type I error: We conclude that the duration of paid vacations each year for Europeans is not six weeks,
when in fact it is six weeks. Type II error: We conclude that the duration of paid vacations each year
for Europeans is six weeks when, in fact, it is not.
i. Type I error: We conclude that the proportion is less than 11%, when it is really at least 11%. Type
II error: We conclude that the proportion of women who develop breast cancer is at least 11%, when
in fact it is less than 11%.
j. Type I error: We conclude that the average tuition cost at private universities is more than $20,000,
though in reality it is at most $20,000. Type II error: We conclude that the average tuition cost at
private universities is at most $20,000 when, in fact, it is more than $20,000.
costly, it would be better to err on the side of making a Type II error over a Type I error. Therefore we will
set alpha at 10%. Reject Ho if the p-value < 0.10.
Solution to Exercise 3.2.5.13 (p. 238)
Z* = (0.01 0.0057)/sqrt((0.0057*0.9943)/2000) = 2.54; P-value = 0.011
Solution to Exercise 3.2.5.14 (p. 238)
Our P-value is 0.011. This means that the probability of getting a sample proportion of 0.01 (or more)
from a population with a proportion of 0.0057 is only 1.1%. Given our level of signicance, this would be
considered an unlikely event
Solution to Exercise 3.2.5.15 (p. 238)
Since the p-value of 0.011 is less than alpha of 0.10, we will reject the null hypothesis.
Solution to Exercise 3.2.5.16 (p. 238)
There is sucient evidence to indicate that the proportion of responses diers as a result of the marketing
campaign.
Solution to Exercise 3.2.5.17 (p. 238)
Ho: µ = 12; Ha: µ < 12
Solution to Exercise 3.2.5.18 (p. 238)
A sample of 9 randomly chosen customers' boarding times reveals: n = 9; mean = 9.3; sample standard
deviation = 3.32.
Solution to Exercise 3.2.5.19 (p. 238)
We are investigating a hypothesis about a population mean using a small sample. Central Limit Theorem
does not apply to small samples, but we can expect the sampling distribution to be normally shaped if the
population is also normal. This has been conrmed through previous studies. Thus we will conduct a t-based
test.
Solution to Exercise 3.2.5.20 (p. 239)
A Type I error in this case would be for Charter Air to claim their boarding time is less than 12 minutes,
when in fact it is not. A Type II error in this case would be for Charter Air not to claim their boarding time
is less than 12 minutes, when in fact it is. A Type I error could lead to false advertising, which has both
ethical and legal implications, so it would be best to minimize the likelihood of making this type of error.
Alpha should be set at 1% (or at most 5%). Reject Ho if P-value < 0.01.
Solution to Exercise 3.2.5.21 (p. 239)
√
t* = 9.3-12/(3.32/ 9) = -2.43; P-value = 0.0203
Solution to Exercise 3.2.5.22 (p. 239)
The P-value > 0.01, so we will fail to reject the null hypothesis.
Solution to Exercise 3.2.5.23 (p. 239)
Our P-value is 0.0203. This means that the probability of getting a sample mean of 9.3 minutes (or less) from
a population with a mean of 12 minutes is 2.03%. Given our level of signicance, this would be considered
a not unlikely event.
Solution to Exercise 3.2.5.24 (p. 239)
The P-value of 0.0203 > 0.01, so we will fail to reject the null hypothesis.
Solution to Exercise 3.2.5.25 (p. 239)
There is insucient evidence to indicate that the mean boarding time is less than 12 minutes.
Figure 4.1: We encounter statistics in our daily lives more often than we probably realize and from
many dierent sources, like the news. (credit: David Sim)
You are probably asking yourself the question, "When and where will I use statistics?" If you read any
newspaper, watch television, or use the Internet, you will see statistical information. There are statistics
about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or
watch a television news program, you are given sample information. With this information, you may make
a decision about the correctness of a statement, claim, or "fact." Statistical methods can help you make the
"best educated guess."
1 This content is available online at <http://cnx.org/content/m65507/1.1/>.
257
CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
258
AND CORRELATION
Since you will undoubtedly be given statistical information at some point in your life, you need to know
some techniques for analyzing the information thoughtfully. Think about buying a house or managing a
budget. Think about your chosen profession. The elds of economics, business, psychology, education,
biology, law, computer science, police science, and early childhood development require at least one course
in statistics.
Included in this chapter are the basic ideas and words of probability and statistics. You will soon
understand that statistics and probability work together. You will also learn how data are gathered and
what "good" data can be distinguished from "bad."
The correlation coecient, ρ (pronounced rho), is the mathematical statistic for a population that pro-
vides us with a measurement of the strength of a linear relationship between the two variables. For a sample
of data, the statistic, r, developed by Karl Pearson in the early 1900s, is an estimate of the population
correlation and is dened mathematically as:
−− −−
1
n−1 Σ X1i − X1 X 2i − X2
r= (4.1)
sx1 sx2
OR (4.1)
−− −−
ΣX1i X2i − n X 1 − X 2
r = s (4.1)
−− 2 −− 2
ΣX12 i − n X 1 ΣX22 i − n X 2
−− −−
where sx1 and sx2 are the standard deviations of the two independent variables X1 and X2 , X 1 and X 2 are
the sample means of the two variables, and X1i and X2i are the individual observations of X1 and X2 . The
correlation coecient r ranges in value from -1 to 1. The second equivalent formula is often used because it
may be computationally easier. As scary as these formulas look they are really just the ratio of the covariance
between the two variables and the product of their two standard deviations. That is to say, it is a measure
of relative variances.
In practice all correlation and regression analysis will be provided through computer software designed for
these purposes. Anything more than perhaps one-half a dozen observations creates immense computational
problems. It was because of this fact that correlation, and even more so, regression, were not widely used
research tools until after the advent of computing machines. Now the computing power required to analyze
data using regression packages is deemed almost trivial by comparison to just a decade ago.
To visualize any linear relationship that may exist review the plot of a scatter diagrams of the stan-
dardized data. Figure 4.2 presents several scatter diagrams and the calculated value of r. In panels (a) and
(b) notice that the data generally trend together, (a) upward and (b) downward. Panel (a) is an example
of a positive correlation and panel (b) is an example of a negative correlation, or relationship. The sign of
the correlation coecient tells us if the relationship is a positive or negative (inverse) one. If all the values
of X1 and X2 are on a straight line the correlation coecient will be either 1 or -1 depending on whether
the line has a positive or negative slope and the closer to one or negative one the stronger the relationship
between the two variables. BUT ALWAYS REMEMBER THAT THE CORRELATION COEFFICIENT
DOES NOT TELL US THE SLOPE.
Figure 4.2
Remember, all the correlation coecient tells us is whether or not the data are linearly related. In panel
(d) the variables obviously have some type of very specic relationship to each other, but the correlation
coecient is zero, indicating no linear relationship exists.
If you suspect a linear relationship between X1 and X2 then r can measure how strong the linear rela-
tionship is.
What the VALUE of r tells us:
• The value of r is always between 1 and +1: 1 ≤ r ≤ 1.
• The size of the correlation r indicates the strength of the linear relationship between X1 and X2 .
Values of r close to 1 or to +1 indicate a stronger linear relationship between X1 and X2 .
• If r = 0 there is absolutely no linear relationship between X1 and X2 (no linear correlation).
• If r = 1, there is perfect positive correlation. If r = 1, there is perfect negative correlation. In both
these cases, all of the original data points lie on a straight line: ANY straight line no matter what the
slope. Of course, in the real world, this will not generally happen.
note: Strong correlation does not suggest that X1 causes X2 or X2 causes X1 . We say "correlation
does not imply causation."
4.2.1
Exercise 4.2.1 (Solution on p. 310.)
In order to have a correlation coecient between traits A and B, it is necessary to have:
a. one group of subjects, some of whom possess characteristics of trait A, the remainder pos-
sessing those of trait B
b. measures of trait A on one group of subjects and of trait B on another group
c. two groups of subjects, one which could be classied as A or not A, the other as B or not B
d. two groups of subjects, one which could be classied as A or not A, the other as B or not B
a. 81% of the variation in the money spent for repairs is explained by the age of the auto
b. 81% of money spent for repairs is unexplained by the age of the auto
c. 90% of the money spent for repairs is explained by the age of the auto
d. none of the above
a. 20
b. 16
c. 40
d. 80
a. plus and minus 10% from the means includes about 68% of the cases
b. one-tenth of the variance of one variable is shared with the other variable
c. one-tenth of one variable is caused by the other variable
d. on a scale from -1 to +1, the degree of linear relationship between the two variables is +.10
a. Approximately 0.9
b. Approximately 0.4
c. Approximately 0.0
d. Approximately -0.4
e. Approximately -0.9
The hypothesis test lets us decide whether the value of the population correlation coecient ρ is "close to
zero" or "signicantly dierent from zero". We decide this based on the sample correlation coecient r and
the sample size n.
If the test concludes that the correlation coecient is signicantly dierent from zero, we
say that the correlation coecient is "signicant."
• Conclusion: There is sucient evidence to conclude that there is a signicant linear relationship
between X1 and X2 because the correlation coecient is signicantly dierent from zero.
• What the conclusion means: There is a signicant linear relationship X1 and X2 . If the test concludes
that the correlation coecient is not signicantly dierent from zero (it is close to zero), we say that
correlation coecient is "not signicant".
H
• Null Hypothesis 0 : The population correlation coecient IS NOT signicantly dierent from zero.
There IS NOT a signicant linear relationship (correlation) between X1 and X2 in the population.
H
• Alternate Hypothesis a : The population correlation coecient is signicantly dierent from zero.
There is a signicant linear relationship (correlation) between X1 and X2 in the population.
Drawing a Conclusion
There are two methods of making the decision concerning the hypothesis. The test statistic to test this
hypothesis is:
(4.2)
OR (4.2)
√
r n−2
tc = √ (4.2)
1 − r2
Where the second formula is an equivalent form of the test statistic, n is the sample size and the degrees of
freedom are n-2. This is a t-statistic and operates in the same way as other t tests. Calculate the t-value and
compare that with the critical value from the t-table at the appropriate degrees of freedom and the level of
condence you wish to maintain. If the calculated value is in the tail then cannot accept the null hypothesis
that there is no linear relationship between these two independent random variables. If the calculated t-value
is NOT in the tailed then cannot reject the null hypothesis that there is no linear relationship between the
two variables.
A quick shorthand way to test correlations is the relationship between the sample size and the correlation.
If:
2
|r| ≥ √ (4.2)
n
then this implies that the correlation between the two variables demonstrates that a linear relationship exists
and is statistically signicant at approximately the 0.05 level of signicance. As the formula indicates, there
is an inverse relationship between the sample size and the required correlation for signicance of a linear
relationship. With only 10 observations, the required correlation for signicance is 0.6325, for 30 observations
the required correlation for signicance decreases to 0.3651 and at 100 observations the required level is only
0.2000.
Correlations may be helpful in visualizing the data, but are not appropriately used to "explain" a rela-
tionship between two variables. Perhaps no single statistic is more misused than the correlation coecient.
Citing correlations between health conditions and everything from place of residence to eye color have the
eect of implying a cause and eect relationship. This simply cannot be accomplished with a correlation
coecient. The correlation coecient is, of course, innocent of this misinterpretation. It is the duty of the
analyst to use a statistic that is designed to test for cause and eect relationships and report only those
results if they are intending to make such a claim. The problem is that passing this more rigorous test is
dicult so lazy and/or unscrupulous "researchers" fall back on correlations when they cannot make their
case legitimately.
4.3.1.1
y = a + bx (4.2)
y = 3 + 2x (4.2)
The graph of a linear equation of the form y = a + bx is a straight line. Any line that is not vertical can
be described by this equation.
Example 4.2
Graph the equation y = 1 + 2x.
4 This content is available online at <http://cnx.org/content/m55718/1.12/>.
Figure 4.3
Figure 4.4
Example 4.3
Aaron's Word Processing Service (AWPS) does word processing. The rate for services is $32 per
hour plus a $31.50 one-time charge. The total cost to a customer depends on the number of hours
it takes to complete the job.
Problem
Find the equation that expresses the total cost in terms of the number of hours required to
complete the job.
Solution
Let x = the number of hours it takes to get the job done.
Let y = the total cost to the customer.
The $31.50 is a xed cost. If it takes x hours to complete the job, then (32)(x) is the cost of
the word processing only. The total cost is: y = 31.50 + 32x
Figure 4.5: Three possible graphs of y = a + bx. (a) If b > 0, the line slopes upward to the right. (b)
If b = 0, the line is horizontal. (c) If b < 0, the line slopes downward to the right.
Example 4.4
Svetlana tutors to make extra money for college. For each tutoring session, she charges a one-time
fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total amount of money
Svetlana earns for each session she tutors is y = 25 + 15x.
Problem
What are the independent and dependent variables? What is the y-intercept and what is the
slope? Interpret them using complete sentences.
Solution
The independent variable (x) is the number of hours Svetlana tutors each session. The dependent
variable (y) is the amount, in dollars, Svetlana earns for each session.
The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-time
fee of $25 (this is when x = 0). The slope is 15 (b = 15). For each session, Svetlana earns $15 for
each hour she tutors.
4.4.1.1
1. The independent variables, xi , are all measured without error, and are xed numbers that are inde-
pendent of the error term. This assumption is saying in eect that Y is deterministic, the result of a
xed component X and a random error component .
2. The error term is a random variable with a mean of zero and a constant variance. The meaning of
this is that the variances of the independent variables are independent of the value of the variable.
Consider the relationship between personal income and the quantity of a good purchased as an example
of a case where the variance is dependent upon the value of the independent variable, income. It is
plausible that as income increases the variation around the amount purchased will also increase simply
because of the exibility provided with higher levels of income. The assumption is for constant variance
5 This content is available online at <http://cnx.org/content/m55838/1.30/>.
with respect to the magnitude of the independent variable called homoscedasticity. If the assumption
fails, then it is called heteroscedasticity. Figure 4.6 shows the case of homoscedasticity where all three
distributions have the same variance around the predicted value of Y regardless of the magnitude of
X.
3. While the independent variables are all xed values they are from a probability distribution that is
normally distributed. This can be seen in Figure 4.6 by the shape of the distributions placed on the
predicted line at the expected value of the relevant value of Y.
4. The independent variables are independent of Y, but are also assumed to be independent of the other
X variables. The model is designed to estimate the eects of independent variables on some dependent
variable in accordance with a proposed theory. The case where some or more of the independent
variables are correlated is not unusual. There may be no cause and eect relationship among the
independent variables, but nevertheless they move together. Take the case of a simple supply curve
where quantity supplied is theoretically related to the price of the product and the prices of inputs.
There may be multiple inputs that may over time move together from general inationary pressure.
The input prices will therefore violate this assumption of regression analysis. This condition is called
multicollinearity, which will be taken up in detail later.
5. The error terms are uncorrelated with each other. This situation arises from an eect on one error
term from another error term. While not exclusively a time series problem, it is here that we most
often see this case. An X variable in time period one has an eect on the Y variable, but this eect
then has an eect in the next time period. This eect gives rise to a relationship among the error
terms. This case is called autocorrelation, self-correlated. The error terms are now not independent
of each other, but rather have their own eect on subsequent error terms.
Figure 4.6 shows the case where the assumptions of the regression model are being satised. The estimated
^
line is y = a + bx. Three values of X are shown. A normal distribution is placed at each point where X equals
the estimated line and the associated error at each value of Y. Notice that the three distributions are normally
distributed around the point on the line, and further, the variation, variance, around the predicted value is
constant indicating homoscedasticity from assumption 2. Figure 4.6 does not show all the assumptions of
the regression model, but it helps visualize these important ones.
Figure 4.6
Figure 4.7
This is the general form that is most often called the multiple regression model. So-called "simple"
regression analysis has only one independent (right-hand) variable rather than many independent variables.
Simple regression is just a special case of multiple regression. There is some value in beginning with simple
regression: it is easy to graph in two dimensions, dicult to graph in three dimensions, and impossible
to graph in more than three dimensions. Consequently, our graphs will be for the simple regression case.
Figure 4.7 presents the regression problem in the form of a scatter plot graph of the data set where it is
hypothesized that Y is dependent upon the single independent variable X.
A basic relationship from Macroeconomic Principles is the consumption function. This theoretical re-
lationship states that as a person's income rises, their consumption rises, but by a smaller amount than
the rise in income. If Y is consumption and X is income in the equation below Figure 4.7, the regression
problem is, rst, to establish that this relationship exists, and second, to determine the impact of a change
in income on a person's consumption. The parameter β 1 was called the Marginal Propensity to Consume in
Macroeconomics Principles.
Each "dot" in Figure 4.7 represents the consumption and income of dierent individuals at some point in
time. This was called cross-section data earlier; observations on variables at one point in time across dierent
people or other units of measurement. This analysis is often done with time series data, which would be
the consumption and income of one individual or country at dierent points in time. For macroeconomic
problems it is common to use times series aggregated data for a whole country. For this particular theoretical
concept these data are readily available in the annual report of the President's Council of Economic Advisors.
The regression problem comes down to determining which straight line would best represent the data in
Figure 4.8. Regression analysis is sometimes called "least squares" analysis because the method of deter-
mining which line best "ts" the data is to minimize the sum of the squared residuals of a line put through
the data.
Figure 4.8:
Population Equation: C = β 0 + β 1 Income + ε
Estimated Equation: C = b0 + b1 Income + e
This gure shows the assumed relationship between consumption and income from macroeconomic theory.
Here the data are plotted as a scatter plot and an estimated straight line has been drawn. From this graph
we can see an error term, e1 . Each data point also has an error term. Again, the error term is put into the
equation to capture eects on consumption that are not caused by income changes. Such other eects might
be a person's savings or wealth, or periods of unemployment. We will see how by minimizing the sum of
these errors we can get an estimate for the slope and intercept of this line.
Consider the graph below. The notation has returned to that for the more general model rather than
the specic case of the Macroeconomic consumption function in our example.
Figure 4.9
^
The y y y
is read " hat" and is the estimated value of . (In Figure 4.8 C represents the estimated
value of consumption because it is on the estimated line.) It is the value of y obtained using the regression
line. y
is not generally equal to y from the data.
The term y0 −Θ y 0 = e0 is called the "error" or residual. It is not an error in the sense of a mistake. The
error term was put into the estimating equation to capture missing variables and errors in measurement that
may have occurred in the dependent variables. The absolute value of a residual measures the vertical
distance between the actual value of y and the estimated value of y. In other words, it measures the vertical
distance between the actual data point and the predicted point on the line as can be seen on the graph at
point X0 .
If the observed data point lies above the line, the residual is positive, and the line underestimates the
actual data value for y.
If the observed data point lies below the line, the residual is negative, and the line overestimates that
actual data value for y.
In the graph, y0 − Θy 0 = e0 is the residual for the point shown. Here the point lies above the line and the
residual is positive. For each data point the residuals, or errors, are calculated yi y
i = ei for i = 1, 2, 3, ...,
n where n is the sample size. Each |e| is a vertical distance.
The sum of the errors squared is the term obviously called Sum of Squared Errors (SSE).
Using calculus, you can determine the straight line that has the parameter values of b0 and b1 that
minimizes the SSE. When you make the SSE a minimum, you have determined the points that are on the
line of best t. It turns out that the line of best t has the equation:
Θ
y = b0 + b1 x (4.9)
−− −−
−− Σ x− x y− y
−− cov(x,y)
where b0 = y −b1 x and b1 =
−−
2 = sx 2
Σ x− x
−− −−
The sample means of the x values and the y values are x and y , respectively. The best t line always
−− −−
passes through the point ( x , y ) called the points of means.
The slope b can also be written as:
sy
b1 = ry,x (4.9)
sx
where sy = the standard deviation of the y values and sx = the standard deviation of the x values and r is
the correlation coecient between x and y.
These equations are called the Normal Equations and come from another very important mathematical
nding called the Gauss-Markov Theorem without which we could not do regression analysis. The Gauss-
Markov Theorem tells us that the estimates we get from using the ordinary least squares (OLS) regression
method will result in estimates that have some very important properties. In the Gauss-Markov Theorem
it was proved that a least squares line is BLUE, which is, Best, Linear, Unbiased, Estimator. Best is the
statistical property that an estimator is the one with the minimum variance. Linear refers to the property
of the type of line being estimated. An unbiased estimator is one whose estimating function has an expected
mean equal to the mean of the population. (You will remember that the expected value of µ−− was equal
x
to the population mean µ in accordance with the Central Limit Theorem. This is exactly the same concept
here).
Both Gauss and Markov were giants in the eld of mathematics, and Gauss in physics too, in the 18th
century and early 19th century. They barely overlapped chronologically and never in geography, but Markov's
work on this theorem was based extensively on the earlier work of Carl Gauss. The extensive applied value
of this theorem had to wait until the middle of this last century.
Using the OLS method we can now nd the estimate of the error variance which is the variance of
the squared errors, e2 . This is sometimes called the standard error of the estimate. (Grammatically
this is probably best said as the estimate of the error's variance) The formula for the estimate of the error
variance is:
2
Σ(yi − Θ
yi ) Σei 2
s2e = = (4.9)
n−k n−k
2
where y is the predicted value of y and y is the observed value, and thus the term (yi − Θ
y i ) is the squared
errors that are to be minimized to nd the estimates of the regression line parameters. This is really just
the variance of the error terms and follows our regular variance formula. One important note is that here
we are dividing by (n − k), which is the degrees of freedom. The degrees of freedom of a regression equation
will be the number of observations, n, reduced by the number of estimated parameters, which includes the
intercept as a parameter.
The variance of the errors is fundamental in testing hypotheses for a regression. It tells us just how
tight the dispersion is about the line. As we will see shortly, the greater the dispersion about the line,
meaning the larger the variance of the errors, the less probable that the hypothesized independent variable
will be found to have a signicant eect on the dependent variable. In short, the theory being tested will
more likely fail if the variance of the error term is high. Upon reection this should not be a surprise. As we
tested hypotheses about a mean we observed that large variances reduced the calculated test statistic and
thus it failed to reach the tail of the distribution. In those cases, the null hypotheses could not be rejected.
If we cannot reject the null hypothesis in a regression problem, we must conclude that the hypothesized
independent variable has no eect on the dependent variable.
A way to visualize this concept is to draw two scatter plots of x and y data along a predetermined line.
The rst will have little variance of the errors, meaning that all the data points will move close to the line.
Now do the same except the data points will have a large estimate of the error variance, meaning that the
data points are scattered widely along the line. Clearly the condence about a relationship between x and
y is eected by this dierence between the estimate of the error variance.
H0 : β1 = 0 (4.9)
Ha : β1 6= 0 (4.9)
If we cannot reject the null hypothesis, we must conclude that our theory has no validity. If we cannot
reject the null hypothesis that β 1 = 0 then b1 , the coecient of Income, is zero and zero times anything is
zero. Therefore the eect of Income on Consumption is zero. There is no relationship as our theory had
suggested.
Notice that we have set up the presumption, the null hypothesis, as "no relationship". This puts the
burden of proof on the alternative hypothesis. In other words, if we are to validate our claim of nding a
relationship, we must do so with a level of signicance greater than 90, 95, or 99 percent. The status quo is
ignorance, no relationship exists, and to be able to make the claim that we have actually added to our body
of knowledge we must do so with signicant probability of being correct. John Maynard Keynes got it right
and thus was born Keynesian economics starting with this basic concept in 1936.
The test statistic for this test comes directly from our old friend the standardizing formula:
b1 − β1
tc = (4.9)
Sb1
where b1 is the estimated value of the slope of the regression line, β 1 is the hypothesized value of beta,
in this case zero, and Sb1 is the standard deviation of the estimate of b1 . In this case we are asking how
many standard deviations is the estimated slope away from the hypothesized slope. This is exactly the same
question we asked before with respect to a hypothesis about a mean: how many standard deviations is the
estimated mean, the sample mean, from the hypothesized mean?
The test statistic is written as a student's t distribution, but if the sample size is larger enough so that
the degrees of freedom are greater than 30 we may again use the normal distribution. To see why we can
use the student's t or normal distribution we have only to look at Sb1 ,the formula for the standard deviation
of the estimate of b1 :
Se2
Sb1 = r (4.9)
−− 2
xi − x
or (4.9)
Se2
Sb1 = (4.9)
(n − 1) Sx2
Where Se is the estimate of the error variance and S2 x is the variance of x values of the coecient of the
independent variable being tested.
We see that Se , the estimate of the error variance, is part of the computation. Because the estimate
of the error variance is based on the assumption of normality of the error terms, we can conclude that
the sampling distribution of the b's, the coecients of our hypothesized regression line, are also normally
distributed.
One last note concerns the degrees of freedom of the test statistic, ν =n-k. Previously we subtracted 1
from the sample size to determine the degrees of freedom in a student's t problem. Here we must subtract
one degree of freedom for each parameter estimated in the equation. For the example of the consumption
function we lose 2 degrees of freedom, one for b0 , the intercept, and one for b1 , the slope of the consumption
function. The degrees of freedom would be n - k - 1, where k is the number of independent variables and
the extra one is lost because of the intercept. If we were estimating an equation with three independent
variables, we would lose 4 degrees of freedom: three for the independent variables, k, and one more for the
intercept.
The decision rule for acceptance or rejection of the null hypothesis follows exactly the same form as in
all our previous test of hypothesis. Namely, if the calculated value of t (or Z) falls into the tails of the
distribution, where the tails are dened by α ,the required signicance level in the test, we cannot accept the
null hypothesis. If on the other hand, the calculated value of the test statistic is within the critical region,
we cannot reject the null hypothesis.
If we conclude that we cannot accept the null hypothesis, we are able to state with (1 − α) level of
condence that the slope of the line is given by b1 . This is an extremely important conclusion. Regression
analysis not only allows us to test if a cause and eect relationship exists, we can also determine the magnitude
of that relationship, if one is found to exist. It is this feature of regression analysis that makes it so valuable.
If models can be developed that have statistical validity, we are then able to simulate the eects of changes
in variables that may be under our control with some degree of probability , of course. For example, if
advertising is demonstrated to eect sales, we can determine the eects of changing the advertising budget
and decide if the increased sales are worth the added expense.
4.5.3 Multicollinearity
Our discussion earlier indicated that like all statistical models, the OLS regression model has important
assumptions attached. Each assumption, if violated, has an eect on the ability of the model to provide
useful and meaningful estimates. The Gauss-Markov Theorem has assured us that the OLS estimates are
unbiased and minimum variance, but this is true only under the assumptions of the model. Here we will look
at the eects on OLS estimates if the independent variables are correlated. The other assumptions and the
methods to mitigate the diculties they pose if they are found to be violated are examined in Econometrics
courses. We take up multicollinearity because it is so often prevalent in Economic models and it often leads
to frustrating results.
The OLS model assumes that all the independent variables are independent of each other. This assump-
tion is easy to test for a particular sample of data with simple correlation coecients. Correlation, like much
in statistics, is a matter of degree: a little is not good, and a lot is terrible.
The goal of the regression technique is to tease out the independent impacts of each of a set of independent
variables on some hypothesized dependent variable. If two 2 independent variables are interrelated, that is,
correlated, then we cannot isolate the eects on Y of one from the other. In an extreme case where x1 is a
linear combination of x2 , correlation equal to one, both variables move in identical ways with Y. In this case
it is impossible to determine the variable that is the true cause of the eect on Y. (If the two variables were
actually perfectly correlated, then mathematically no regression results could actually be calculated.)
The normal equations for the coecients show the eects of multicollinearity on the coecients.
The correlation between x1 and x2 , rx21 x2 , appears in the denominator of both the estimating formula
for b1 and b2 . If the assumption of independence holds, then this term is zero. This indicates that there
is no eect of the correlation on the coecient. On the other hand, as the correlation between the two
independent variables increases the denominator decreases, and thus the estimate of the coecient increases.
The correlation has the same eect on both of the coecients of these two variables. In essence, each variable
is taking part of the eect on Y that should be attributed to the collinear variable. This results in biased
estimates.
Multicollinearity has a further deleterious impact on the OLS estimates. The correlation between the
two independent variables also shows up in the formulas for the estimate of the variance for the coecients.
s2e
s2b1 = (4.9)
(n − 1) s2x1 1 − rx21 x2
s2e
s2b2 = (4.9)
(n − 1) s2x2 1 − rx21 x2
Here again we see the correlation between x1 and x2 in the denominator of the estimates of the variance
for the coecients for both variables. If the correlation is zero as assumed in the regression model, then the
formula collapses to the familiar ratio of the variance of the errors to the variance of the relevant independent
variable. If however the two independent variables are correlated, then the variance of the estimate of the
coecient increases. This results in a smaller t-value for the test of hypothesis of the coecient. In short,
multicollinearity results in failing to reject the null hypothesis that the X variable has no impact on Y when
in fact X does have a statistically signicant impact on Y. Said another way, the large standard errors of the
estimated coecient created by multicollinearity suggest statistical insignicance even when the hypothesized
relationship is strong.
SSR
R2 = (4.9)
SST
where SSR
is theregression sum of squares, the squared deviation of the predicted value of y from the mean
−−
value of y Θ
y− y , and SST is the total sum of squares which is the total squared deviation of the dependent
variable, y, from its mean value, including the error term, SSE, the sum of squared errors. Figure 4.10 shows
how the total deviation of the dependent variable, y, is partitioned into these two pieces.
Figure 4.10
Figure 4.10 shows the estimated regression line and a single observation, x1 . Regression analysis tries to
explain the variation of the data about the mean value of the dependent variable, y. The question is, why do
the observations of y vary from the average level of y? The value of y at observation x1 varies from the mean
−−
of y by the dierence (yi − y ). The sum of these dierences squared is SST, the sum of squares total. The
actual value of y at x1 deviates from the estimated value, y , by the dierence between the estimated value
and the actual value, (yi − Θy ). We recall that this is the error term, e, and the sum of these errors is SSE,
−−
sum of squared errors. The deviation of the predicted value of y, y
, from the mean value of y is (Θ
y − y ) and
is the SSR, sum of squares regression. It is called regression because it is the deviation explained by the
regression. (Sometimes the SSR is called SSM for sum of squares mean because it measures the deviation
from the mean value of the dependent variable, y, as shown on the graph.).
Because the SST = SSR + SSE we see that the multiple correlation coecient is the percent of the
variance, or deviation in y from its mean value, that is explained by the equation when taken as a whole.
R2 will vary between zero and 1, with zero indicating that none of the variation in y was explained by the
equation and a value of 1 indicating that 100% of the variation in y was explained by the equation. For time
series studies expect a high R2 and for cross-section data expect low R2 .
While a high R2 is desirable, remember that it is the tests of the hypothesis concerning the existence
of a relationship between a set of independent variables and a particular dependent variable that was the
motivating factor in using the regression model. It is validating a cause and eect relationship developed
by some theory that is the true reason that we chose the regression analysis. Increasing the number of
independent variables will have the eect of increasing R2 . To account for this eect the proper measure of
−−2
the coecient of determination is the R , adjusted for degrees of freedom, to keep down mindless addition
of independent variables.
There is no statistical test for the R2 and thus little can be said about the model using R2 with our
characteristic condence level. Two models that have the same size of SSE, that is sum of squared errors,
may have very dierent R2 if the competing models have dierent SST, total sum of squared deviations.
The goodness of t of the two models is the same; they both have the same sum of squares unexplained,
errors squared, but because of the larger total sum of squares on one of the models the R2 diers. Again,
the real value of regression as a tool is to examine hypotheses developed from a model that predicts certain
relationships among the variables. These are tests of hypotheses on the coecients of the model and not a
game of maximizing R2 .
Another way to test the general quality of the overall model is to test the coecients as a group rather
than independently. Because this is multiple regression (more than one X), we use the F-test to determine
if our coecients collectively aect Y. The hypothesis is:
Ho : β1 = β2 = · · · = βi = 0
Ha : "at least one of the β i is not equal to 0"
If the null hypothesis cannot be rejected, then we conclude that none of the independent variables
contribute to explaining the variation in Y. Reviewing Figure 4.10 we see that SSR, the explained sum of
squares, is a measure of just how much of the variation in Y is explained by all the variables in the model.
SSE, the sum of the errors squared, measures just how much is unexplained. It follows that the ratio of these
two can provide us with a statistical test of the model as a whole. Remembering that the F distribution is a
ratio of Chi squared distributions and that variances are distributed according to Chi Squared, and the sum
of squared errors and the sum of squares are both variances, we have the test statistic for this hypothesis as:
SSR
Fc = k (4.10)
SSE
n−k−1
where n is the number of observations and k is the number of independent variables. It can be shown that
this is equivalent to:
n−k−1 R2
Fc = ∗ (4.10)
k 1 − R2
building from Figure 4.10 where R2 is the coecient of determination which is also a measure of the
goodness of the model.
As with all our tests of hypothesis, we reach a conclusion by comparing the calculated F statistic with
the critical value given our desired level of condence. If the calculated test statistic, an F statistic in this
case, is in the tail of the distribution, then we cannot accept the null hypothesis. By not being able to accept
the null hypotheses we conclude that this specication of this model has validity, because at least one of the
estimated coecients is signicantly dierent from zero.
An alternative way to reach this conclusion is to use the p-value comparison rule. The p-value is the
area in the tail, given the calculated F statistic. In essence, the computer is nding the F value in the table
for us. The computer regression output for the calculated F statistic is typically found in the ANOVA table
section labeled signicance F". How to read the output of an Excel regression is presented below. This is
the probability of NOT accepting a false null hypothesis. If this probability is less than our pre-determined
alpha error, then the conclusion is that we cannot accept the null hypothesis.
Θ
y = b0 + b2 x2 + b1 x1 (4.10)
Figure 4.11
where x2 = 0, 1. X2 is the dummy variable and X1 is some continuous random variable. The constant,
b0 , is the y-intercept, the value where the line crosses the y-axis. When the value of X2 = 0, the estimated
line crosses at b0 . When the value of X2 = 1 then the estimated line crosses at b0 + b2 . In eect the dummy
variable causes the estimated line to shift either up or down by the size of the eect of the characteristic
captured by the dummy variable. Note that this is a simple parallel shift and does not aect the impact
of the other independent variable; X1 .This variable is a continuous random variable and predicts dierent
values of y at dierent values of X1 holding constant the condition of the dummy variable.
An example of the use of a dummy variable is the work estimating the impact of gender on salaries.
There is a full body of literature on this topic and dummy variables are used extensively. For this example
the salaries of elementary and secondary school teachers for a particular state is examined. Using a homoge-
neous job category, school teachers, and for a single state reduces many of the variations that naturally eect
salaries such as dierential physical risk, cost of living in a particular state, and other working conditions.
The estimating equation in its simplest form species salary as a function of various teacher characteristic
that economic theory would suggest could aect salary. These would include education level as a measure of
potential productivity, age and/or experience to capture on-the-job training, again as a measure of produc-
tivity. Because the data are for school teachers employed in a public school districts rather than workers in
a for-prot company, the school district's average revenue per average daily student attendance is included
as a measure of ability to pay. The results of the regression analysis using data on 24,916 school teachers
are presented below.
Intercept 4269.9
Gender (male = 1) 632.38 13.39
Total Years of Experience 52.32 1.10
Years of Experience in Current 29.97 1.52
District
Education 629.33 13.16
Total Revenue per ADA 90.24 3.76
−−2
R .725
n 24,916
Table 4.1
The coecients for all the independent variables are signicantly dierent from zero as indicated by the
standard errors. Dividing the standard errors of each coecient results in a t-value greater than 1.96 which is
the required level for 95% signicance. The binary variable, our dummy variable of interest in this analysis, is
gender where male is given a value of 1 and female given a value of 0. The coecient is signicantly dierent
from zero with a dramatic t-statistic of 47 standard deviations. We thus cannot accept the null hypothesis
that the coecient is equal to zero. Therefore we conclude that there is a premium paid male teachers of
$632 after holding constant experience, education and the wealth of the school district in which the teacher
is employed. It is important to note that these data are from some time ago and the $632 represents a six
percent salary premium at that time. A graph of this example of dummy variables is presented below.
Figure 4.12
In two dimensions, salary is the dependent variable on the vertical axis and total years of experience was
chosen for the continuous independent variable on horizontal axis. Any of the other independent variables
could have been chosen to illustrate the eect of the dummy variable. The relationship between total years
of experience has a slope of $52.32 per year of experience and the estimated line has an intercept of $4,269 if
the gender variable is equal to zero, for female. If the gender variable is equal to 1, for male, the coecient
for the gender variable is added to the intercept and thus the relationship between total years of experience
and salary is shifted upward parallel as indicated on the graph. Also marked on the graph are various points
for reference. A female school teacher with 10 years of experience receives a salary of $4,792 on the basis of
her experience only, but this is still $109 less than a male teacher with zero years of experience.
A more complex interaction between a dummy variable and the dependent variable can also be estimated.
It may be that the dummy variable has more than a simple shift eect on the dependent variable, but also
interacts with one or more of the other continuous independent variables. While not tested in the example
above, it could be hypothesized that the impact of gender on salary was not a one-time shift, but impacted
the value of additional years of experience on salary also. That is, female school teacher's salaries were
discounted at the start, and further did not grow at the same rate from the eect of experience as for male
school teachers. This would show up as a dierent slope for the relationship between total years of experience
for males than for females. If this is so then females school teachers would not just start behind their male
colleagues (as measured by the shift in the estimated regression line), but would fall further and further
behind as time and experienced increased.
The graph below shows how this hypothesis can be tested with the use of dummy variables and an
interaction variable.
Figure 4.13
The estimating equation shows how the slope of X1 , the continuous random variable experience, contains
two parts, b1 and b3 . This occurs because of the new variable X2 X1 , called the interaction variable, was
created to allow for an eect on the slope of X1 from changes in X2 , the binary dummy variable. Note that
when the dummy variable, X2 = 0 the interaction variable has a value of 0, but when X2 = 1 the interaction
variable has a value of X1 . The coecient b3 is an estimate of the dierence in the coecient of X1 when
X2 = 1 compared to when X2 = 0. In the example of teacher's salaries, if there is a premium paid to male
teachers that aects the rate of increase in salaries from experience, then the rate at which male teachers'
salaries rises would be b1 + b3 and the rate at which female teachers' salaries rise would be simply b1 . This
hypothesis can be tested with the hypothesis:
H0 : β3 = 0|β1 = 0, β2 = 0 (4.13)
Ha : β3 6= 0|β1 6= 0, β2 6= 0 (4.13)
This is a t-test using the test statistic for the parameter β 3 . If we cannot accept the null hypothesis that
β 3 =0 we conclude there is a dierence between the rate of increase for the group for whom the value of the
binary variable is set to 1, males in this example. This estimating equation can be combined with our earlier
one that tested only a parallel shift in the estimated line. The earnings/experience functions in Figure 4.13
are drawn for this case with a shift in the earnings function and a dierence in the slope of the function with
respect to total years of experience.
Example 4.5
A random sample of 11 statistics students produced the following data, where x is the third exam
score out of 80, and y is the nal exam score out of 200. Can you predict the nal exam score of a
randomly selected student if you know the third exam score?
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
69 159
(a)
(b)
Figure 4.14: (a) Table showing the scores on the nal exam based on scores from the third exam. (b)
Scatter plot showing the scores on the nal exam based on scores from the third exam.
4.5.6
Exercise 4.5.1 (Solution on p. 311.)
Suppose that you have at your disposal the information below for each of 30 drivers. Propose
a model (including a very brief indication of symbols used to represent independent variables) to
explain how miles per gallon vary from driver to driver on the basis of the factors measured.
Information:
a. negative.
b. low.
c. heterogeneous.
d. between two measures that are unreliable.
Y = a + b1 X + b2 X 2 (4.14)
where the values of X are simply squared and put into the equation as a separate variable.
There is much more in the way of econometric "tricks" that can bypass some of the more troublesome
assumptions of the general regression model. This statistical technique is so valuable that further study
would provide any student signicant, statistically signicant, dividends.
(%∆Q)
Price elasticity: η p = (4.14)
(%∆P)
Where η is the Greek small case letter eta used to designate elasticity. ∆ is read as change.
(%∆Q)
Income elasticity: η Y = (4.14)
(%∆Y)
6 This content is available online at <http://cnx.org/content/m64846/1.7/>.
(%∆Q1 )
Cross-Price elasticity: η p1 = (4.14)
(%∆P2 )
Where P2 is the price of the substitute good.
Examining closer the price elasticity we can write the formula as:
(%∆Q) dQ P P
ηp = = =b (4.14)
(%∆P) dP Q Q
Where b is the estimated coecient for price in the OLS regression.
The rst form of the equation demonstrates the principle that elasticities are measured in percentage
terms. Of course, the ordinary least squares coecients provide an estimate of the impact of a unit change
in the independent variable, X, on the dependent variable measured in units of Y. These coecients are not
elasticities, however, and are shown in the second way of writing the formula for elasticity as ddQ P , the
derivative of the estimated demand function which is simply the slope of the regression line. Multiplying the
slope times Q P
provides an elasticity measured in percentage terms.
Along a straight-line demand curve the percentage change, thus elasticity, changes continuously as the
scale changes, while the slope, the estimated regression coecient, remains constant. Going back to the
demand for gasoline. A change in price from $3.00 to $3.50 was a 16 percent increase in price. If the
beginning price were $5.00 then the same 50¢ increase would be only a 10 percent increase generating a
dierent elasticity. Every straight-line demand curve has a range of elasticities starting at the top left, high
prices, with large elasticity numbers, elastic demand, and decreasing as one goes down the demand curve,
inelastic demand.
In order to provide a meaningful estimate of the elasticity of demand the convention is to estimate the
elasticity at the point of means. Remember that all OLS regression lines will go through the point of means.
At this point is the greatest weight of the data used to estimate the coecient. The formula to estimate an
elasticity when an OLS demand curve has been estimated becomes:
−
P
ηp = b − (4.14)
Q
− −
Where P and Q are the mean values of these data used to estimate b, the price coecient.
The same method can be used to estimate the other elasticities for the demand function by using the
appropriate mean values of the other variables; income and price of substitute goods for example.
good. All three of these cases can be estimated by transforming the data to logarithms before running the
regression. The resulting coecients will then provide a percentage change measurement of the relevant
variable.
To summarize, there are four cases:
Case 1: The ordinary least squares case begins with the linear model developed above:
Y = a + bX (4.14)
where the coecient of the independent variable b = ddX
Y
is the slope of a straight line and thus measures
the impact of a unit change in X on Y measured in units of Y.
Case 2: The underlying estimated equation is:
dY
= bdX (4.14)
Y
Multiply by 100 to covert to percentages and rearranging terms gives:
%∆Y
100b = (4.14)
Unit ∆X
100b is thus the percentage change in Y resulting from a unit change in X.
Case 3: In this case the question is what is the unit change in Y resulting from a percentage change in
X? What is the dollar loss in revenues of a ve percent increase in price or what is the total dollar cost
impact of a ve percent increase in labor costs? The estimated equation for this case would be:
dY = bd (logX) (4.14)
dX
dY = b (4.14)
X
Divide by 100 to get percentage and rearranging terms gives:
b dY Unit ∆Y
= d = (4.14)
100 100 X X %∆X
Therefore, b
100 is the increase in Y measured in units from a one percent increase in X.
Case 4: This is the elasticity case where both the dependent and independent variables are converted to
logs before the OLS estimation. This is known as the log-log case or double log case, and provides us with
direct estimates of the elasticities of the independent variables. The estimated equation is:
1
d (logX) = b dX (4.14)
X
thus:
dY dX dY
1 1 X
dY = b dX OR =b OR b = (4.14)
Y X Y X dX Y
and b = % %∆Y our denition of elasticity. We conclude that we can directly estimate the elasticity of a
∆X
variable through double log transformation of the data. The estimated coecient is the elasticity. It is
common to use double log transformation of all variables in the estimation of demand functions to get
estimates of all the various elasticities of the demand curve.
4.6.2
Exercise 4.6.1 (Solution on p. 311.)
In a linear regression, why do we need to be concerned with the range of the independent (X)
variable?
Exercise 4.6.2 (Solution on p. 311.)
Suppose one collected the following information where X is diameter of tree trunk and Y is tree
height.
X Y
4 8
2 4
8 18
6 22
10 30
6 8
Table 4.2
^
Regression equation: y i = −3.6 + 3.1 · Xi
What is your estimate of the average height of all trees having a trunk diameter of 7 inches?
Exercise 4.6.3 (Solution on p. 311.)
The manufacturers of a chemical used in ea collars claim that under standard test conditions
each additional unit of the chemical will bring about a reduction of 5 eas (i.e. where Xj =
amount of chemical and YJ = B0 + B1 · XJ + EJ , H0 : B1 = −5
Suppose that a test has been conducted and results from a computer include:
Intercept = 60
Slope = −4
Standard error of the regression coecient = 1.0
Degrees of Freedom for Error = 2000
95% Condence Interval for the slope −2.04, −5.96
Is this evidence consistent with the claim that the number of eas is reduced at a rate of 5 eas
per unit chemical?
will result in an estimate of the dependent variable which is minimum variance and unbiased. That is to say
that from this equation comes the best unbiased point estimate of y given the values of x.
y = b0 + b, X1i + · · · + bk Xki
Θ (4.14)
Remember that point estimates do not carry a particular level of probability, or level of condence, because
points have no width above which there is an area to measure. This was why we developed condence
intervals for the mean and proportion earlier. The same concern arises here also. There are actually two
dierent approaches to the issue of developing estimates of changes in the independent variable, or variables,
on the dependent variable. The rst approach wishes to measure the expected mean value of y from a
specic change in the value of x: this specic value implies the expected value. Here the question is: what is
the mean impact on y that would result from multiple hypothetical experiments on y at this specic value
of x. Remember that there is a variance around the estimated parameter of x and thus each experiment will
result in a bit of a dierent estimate of the predicted value of y.
The second approach to estimate the eect of a specic value of x on y treats the event as a single
experiment: you choose x and multiply it times the coecient and that provides a single estimate of y.
Because this approach acts as if there were a single experiment the variance that exists in the parameter
estimate is larger than the variance associated with the expected value approach.
The conclusion is that we have two dierent ways to predict the eect of values of the independent
variable(s) on the dependent variable and thus we have two dierent intervals. Both are correct answers to
the question being asked, but there are two dierent questions. To avoid confusion, the rst case where we
are asking for the expected value of the mean of the estimated y, is called a condence interval as we
have named this concept before. The second case, where we are asking for the estimate of the impact on the
dependent variable y of a single experiment using a value of x, is called the prediction interval. The test
statistics for these two interval measures within which the estimated value of y will fall are:
Condence Interval for Expected Value of Mean Value of y for x=xp
(4.14)
(4.14)
Where se is the standard deviation of the error term and sx is the standard deviation of the x variable.
The mathematical computations of these two test statistics are complex. Various computer regression
software packages provide programs within the regression functions to provide answers to inquires of esti-
mated predicted values of y given various values chosen for the x variable(s). It is important to know just
which interval is being tested in the computer package because the dierence in the size of the standard
deviations will change the size of the interval estimated. This is shown in Figure 4.15.
Figure 4.15: Prediction and condence intervals for regression equation; 95% condence level.
Figure 4.15 shows visually the dierence the standard deviation makes in the size of the estimated
intervals. The condence interval, measuring the expected value of the dependent variable, is smaller than
the prediction interval for the same level of condence. The expected value method assumes that the
experiment is conducted multiple times rather than just once as in the other method. The logic here is
similar, although not identical, to that discussed when developing the relationship between the sample size
and the condence interval using the Central Limit Theorem. There, as the number of experiments increased,
the distribution narrowed and the condence interval became tighter around the expected value of the mean.
It is also important to note that the intervals around a point estimate are highly dependent upon the range
of data used to estimate the equation regardless of which approach is being used for prediction. Remember
that all regression equations go through the point of means, that is, the mean value of y and the mean
values of all independent variables in the equation. As the value of x chosen to estimate the associated
value of y is further from the point of means the width of the estimated interval around the point estimate
increases. Choosing values of x beyond the range of the data used to estimate the equation possess even
greater danger of creating estimates with little use; very large intervals, and risk of error. Figure 4.16 shows
this relationship.
Figure 4.16: Condence interval for an individual value of x, Xp , at 95% level of condence
Figure 4.16 demonstrates the concern for the quality of the estimated interval whether it is a prediction
interval or a condence interval. As the value chosen to predict y, Xp in the graph, is further from the
−−
central weight of the data, X , we see the interval expand in width even while holding constant the level of
condence. This shows that the precision of any estimate will diminish as one tries to predict beyond the
largest weight of the data and most certainly will degrade rapidly for predictions beyond the range of the
data. Unfortunately, this is just where most predictions are desired. They can be made, but the width of
the condence interval may be so large as to render the prediction useless. Only actual calculation and the
particular application can determine this, however.
Example 4.6
Recall the third exam/nal exam example (Example 4.5).
We found the equation of the best-t line for the nal exam grade as a function of the grade
on the third-exam. We can now use the least-squares regression line for prediction. Assume the
coecient for X was determined to be signicantly dierent from zero.
Suppose you want to estimate, or predict, the mean nal exam score of statistics students who
x
received 73 on the third exam. The exam scores ( -values) range from 65 to 75. Since 73 is
between the x-values 65 and 75, we feel comfortable to substitute x = 73 into the equation. Then:
^
y = −173.51 + 4.83 (73) = 179.08 (4.16)
We predict that statistics students who earn a grade of 73 on the third exam will earn a grade of
179.08 on the nal exam, on average.
Problem 1
a. What would you predict the nal exam score to be for a student who scored a 66 on the third
exam?
Solution
a. 145.27
4.7.1
Exercise 4.7.1 (Solution on p. 312.)
True or False? If False, correct it: Suppose you are performing a simple linear regression of Y on
X and you test the hypothesis that the slope β is zero against a two-sided alternative. You have
n = 25 observations and your computed test (t) statistic is 2.6. Then your P-value is given by .01
< P < .02, which gives borderline signicance (i.e. you would reject H0 at α = .02 but fail to reject
H0 at α = .01).
Exercise 4.7.2 (Solution on p. 312.)
An economist is interested in the possible inuence of "Miracle Wheat" on the average yield of
wheat in a district. To do so he ts a linear regression of average yield per year against year after
introduction of "Miracle Wheat" for a ten year period.
The tted trend line is
^
y j = 80 + 1.5 · Xj
(Yj : Average yield in j year after introduction)
(Xj : j year after introduction).
a. What is the estimated average yield for the fourth year after introduction?
b. Do you want to use this trend line to estimate yield for, say, 20 years after introduction?
Why? What would your estimate be?
a. most
b. half
c. very little
d. one quarter
e. none of these
a. r = 1.18
b. r = −.77
c. r = .68
This section of this chapter is here in recognition that what we are now asking requires much more than a
quick calculation of a ratio or a square root. Indeed, the use of regression analysis was almost non- existent
before the middle of the last century and did not really become a widely used tool until perhaps the late
1960's and early 1970's. Even then the computational ability of even the largest IBM machines is laughable
by today's standards. In the early days programs were developed by the researchers and shared. There was
no market for something called software and certainly nothing called apps, an entrant into the market
only a few years old.
8 This content is available online at <http://cnx.org/content/m55852/1.18/>.
With the advent of the personal computer and the explosion of a vital software market we have a number of
regression and statistical analysis packages to choose from. Each has their merits. We have chosen Microsoft
Excel because of the wide-spread availability both on college campuses and in the post-college market place.
Stata is an alternative and has features that will be important for more advanced econometrics study if you
choose to follow this path. Even more advanced packages exist, but typically require the analyst to do some
signicant amount of programing to conduct their analysis. The goal of this section is to demonstrate how
to use Excel to run a regression and then to do so with an example of a simple version of a demand curve.
The rst step to doing a regression using Excel is to load the program into your computer. If you have
Excel you have the Analysis ToolPak although you may not have it activated. The program calls upon a
signicant amount of space so is not loaded automatically.
To activate the Analysis ToolPak follow these steps:
Click File > Options > Add-ins to bring up a menu of the add-in ToolPaks. Select Analysis
ToolPak and click GO next to Manage: excel add-ins near the bottom of the window. This will open
a new window where you click Analysis ToolPak (make sure there is a green check mark in the box) and
then click OK. Now there should be an Analysis tab under the data menu. These steps are presented in
the following screen shots.
Figure 4.17
Figure 4.18
Figure 4.19
Figure 4.20
Click Data then Data Analysis and then click Regression and OK. Congratulations, you have made
it to the regression window. The window asks for your inputs. Clicking the box next to the Y and X ranges
will allow you to use the click and drag feature of Excel to select your input ranges. Excel has one odd
quirk and that is the click and drop feature requires that the independent variables, the X variables, are all
together, meaning that they form a single matrix. If your data are set up with the Y variable between two
columns of X variables Excel will not allow you to use click and drag. As an example, say Column A and
Column C are independent variables and Column B is the Y variable, the dependent variable. Excel will
not allow you to click and drop the data ranges. The solution is to move the column with the Y variable to
column A and then you can click and drag. The same problem arises again if you want to run the regression
with only some of the X variables. You will need to set up the matrix so all the X variables you wish to
regress are in a tightly formed matrix. These steps are presented in the following scene shots.
Figure 4.21
Figure 4.22
Once you have selected the data for your regression analysis and told Excel which one is the dependent
variable (Y) and which ones are the independent valuables (X`s), you have several choices as to the parameters
and how the output will be displayed. Refer to screen shot Figure 4.22 under Input section. If you check
the labels box the program will place the entry in the rst column of each variable as its name in the
output. You can enter an actual name, such as price or income in a demand analysis, in row one of the Excel
spreadsheet for each variable and it will be displayed in the output.
The level of signicance can also be set by the analyst. This will not change the calculated t statistic,
called t stat, but will alter the p value for the calculated t statistic. It will also alter the boundaries of the
condence intervals for the coecients. A 95 percent condence interval is always presented, but with a
change in this you will also get other levels of condence for the intervals.
Excel also will allow you to suppress the intercept. This forces the regression program to minimize the
residual sum of squares under the condition that the estimated line must go through the origin. This is done
in cases where there is no meaning in the model at some value other than zero, zero for the start of the line.
An example is an economic production function that is a relationship between the number of units of an
input, say hours of labor, and output. There is no meaning of positive output with zero workers.
Once the data are entered and the choices are made click OK and the results will be sent to a separate
new worksheet by default. The output from Excel is presented in a way typical of other regression package
programs. The rst block of information gives the overall statistics of the regression: Multiple R, R Squared,
and the R squared adjusted for degrees of freedom, which is the one you want to report. You also get the
Standard error (of the estimate) and the number of observations in the regression.
The second block of information is titled ANOVA which stands for Analysis of Variance. Our interest
in this section is the column marked F. This is the calculated F statistics for the null hypothesis that all of
the coecients are equal to zero verse the alternative that at least one of the coecients are not equal to
zero. This hypothesis test was presented in 13.4 under How Good is the Equation? The next column gives
the p value for this test under the title Signicance F. If the p value is less than say 0.05 (the calculated
F statistic is in the tail) we can say with 90 % condence that we cannot accept the null hypotheses that
all the coecients are equal to zero. This is a good thing: it means that at least one of the coecients is
signicantly dierent from zero thus do have an eect on the value of Y.
The last block of information contains the hypothesis tests for the individual coecient. The estimated
coecients, the intercept and the slopes, are rst listed and then each standard error (of the estimated
coecient) followed by the t stat (calculated student's t statistic for the null hypothesis that the coecient
is equal to zero). We compare the t stat and the critical value of the student's t, dependent on the degrees
of freedom, and determine if we have enough evidence to reject the null that the variable has no eect on Y.
Remember that we have set up the null hypothesis as the status quo and our claim that we know what caused
the Y to change is in the alternative hypothesis. We want to reject the status quo and substitute our version
of the world, the alternative hypothesis. The next column contains the p values for this hypothesis test
followed by the estimated upper and lower bound of the condence interval of the estimated slope parameter
for various levels of condence set by us at the beginning.
spreadsheet. Once your data are entered into the spreadsheet it is always good to look at the data. Examine
the range, the means and the standard deviations. Use your understanding of descriptive statistics from the
very rst part of this course. In large data sets you will not be able to scan the data. The Analysis ToolPac
makes it easy to get the range, mean, standard deviations and other parameters of the distributions. You
can also quickly get the correlations among the variables. Examine for outliers. Review the history. Did
something happen? Was here a labor strike, change in import fees, something that makes these observations
unusual? Do not take the data without question. There may have been a typo somewhere, who knows
without review.
Go to the regression window, enter the data and select 95% condence level and click OK. You can
include the labels in the input range if you have put a title at the top of each column, but be sure to click
the labels box on the main regression page if you do.
The regression output should show up automatically on a new worksheet.
Figure 4.23
The rst results presented is the R-Square, a measure of the strength of the correlation between Y and
X1 , X2 , and X3 taken as a group. Our R-square here of 0.699, adjusted for degrees of freedom, means that
70% of the variation in Y, demand for roses, can be explained by variations in X1 , X2 , and X3 , Price of
roses, Price of carnations and Income. There is no statistical test to determine the signicance of an R2 .
Of course a higher R2 is preferred, but it is really the signicance of the coecients that will determine the
value of the theory being tested and which will become part of any policy discussion if they are demonstrated
to be signicantly dierent form zero.
Looking at the third panel of output we can write the equation as:
Y = b0 + b1 X1 + b2 X2 + b3 X3 + e (4.23)
where b0 is the intercept, b1 is the estimated coecient on price of roses, and b2 is the estimated coecient
on price of carnations, b3 is the estimated eect of income and e is the error term. The equation is written
in Roman letters indicating that these are the estimated values and not the population parameters, β 's.
Our estimated equation is:
Quantity of roses sold = 183, 475 − 1.76 Price of roses + 1.33 Price of carnations + 3.03 Income (4.23)
We rst observe that the signs of the coecients are as expected from theory. The demand curve is downward
sloping with the negative sign for the price of roses. Further the signs of both the price of carnations and
income coecients are positive as would be expected from economic theory.
Interpreting the coecients can tell us the magnitude of the impact of a change in each variable on the
demand for roses. It is the ability to do this which makes regression analysis such a valuable tool. The
estimated coecients tell us that an increase the price of roses by one dollar will lead to a 1.76 reduction in
the number roses purchased. The price of carnations seems to play an important role in the demand for roses
as we see that increasing the price of carnations by one dollar would increase the demand for roses by 1.33
units as consumers would substitute away from the now more expensive carnations. Similarly, increasing per
capita income by one dollar will lead to a 3.03 unit increase in roses purchased.
These results are in line with the predictions of economics theory with respect to all three variables
included in this estimate of the demand for roses. It is important to have a theory rst that predicts the
signicance or at least the direction of the coecients. Without a theory to test, this research tool is not
much more helpful than the correlation coecients we learned about earlier.
We cannot stop there, however. We need to rst check whether our coecients are statistically signicant
from zero. We set up a hypothesis of:
H0 : β1 = 0 (4.23)
Ha : β1 6= 0 (4.23)
for all three coecients in the regression. Recall from earlier that we will not be able to denitively say that
our estimated b1 is the actual real population of β 1 , but rather only that with (1-α)% level of condence
that we cannot reject the null hypothesis that our estimated β 1 is signicantly dierent from zero. The
analyst is making a claim that the price of roses causes an impact on quantity demanded. Indeed, that
each of the included variables has an impact on the quantity of roses demanded. The claim is therefore
in the alternative hypotheses. It will take a very large probability, 0.95 in this case, to overthrow the null
hypothesis, the status quo, that β = 0. In all regression hypothesis tests the claim is in the alternative and
the claim is that the theory has found a variable that has a signicant impact on the Y variable.
The test statistic for this hypothesis follows the familiar standardizing formula which counts the number
of standard deviations, t, that the estimated value of the parameter, b1 , is away from the hypothesized value,
β 0 , which is zero in this case:
b1 − β0
tc = (4.23)
Sb1
The computer calculates this test statistic and presents it as t stat. You can nd this value to the right
of the standard error of the coecient estimate. The standard error of the coecient for b1 is Sb1 in the
formula. To reach a conclusion we compare this test statistic with the critical value of the student's t at
degrees of freedom n-3-1 =29, and alpha = 0.025 (5% signicance level for a two-tailed test). Our t stat for
b1 is approximately 5.90 which is greater than 1.96 (the critical value we looked up in the t-table), so we
cannot accept our null hypotheses of no eect. We conclude that Price has a signicant eect because the
calculated t value is in the tail. We conduct the same test for b2 and b3 . For each variable, we nd that
we cannot accept the null hypothesis of no relationship because the calculated t-statistics are in the tail for
each case, that is, greater than the critical value. All variables in this regression have been determined to
have a signicant eect on the demand for roses.
These tests tell us whether or not an individual coecient is signicantly dierent from zero, but does not
address the overall quality of the model. We have seen that the R squared adjusted for degrees of freedom
indicates this model with these three variables explains 70% of the variation in quantity of roses demanded.
We can also conduct a second test of the model taken as a whole. This is the F test presented in section
13.4 of this chapter. Because this is a multiple regression (more than one X), we use the F-test to determine
if our coecients collectively aect Y. The hypothesis is:
H0 : β1 = β2 = ... = βi = 0 (4.23)
4.8.2
Exercise 4.8.1 (Solution on p. 312.)
^
A computer program for multiple regression has been used to t y j = b0 +b1 ·X1j +b2 ·X2j +b3 ·X3j .
Part of the computer output includes:
i bi Sbi
0 8 1.6
1 2.2 .24
2 -.72 .32
3 0.005 0.002
Table 4.3
Table 4.4
Table 4.5
t calculated for variables 1, 2, and 3 would be 5 or larger in absolute value while that for variable 4 would
be less than 1. For most signicance levels, the hypothesis β1 = 0 would be rejected. But, notice that this
is for the case when X2 , X3 , and X4 have been included in the regression. For most signicance levels, the
hypothesis β4 = 0 would be continued (retained) for the case where X1 , X2 , and X3 are in the regression.
Often this pattern of results will result in computing another regression involving only X1 , X2 , X3 , and
examination of the t ratios produced for that case.
Solution to Exercise 4.3.2 (p. 265)
c. those who score low on one test tend to score low on the other.
Solution to Exercise 4.4.1 (p. 266)
No, the graph is not a straight line; therefore, it is not a linear equation.
Solution to Exercise 4.4.2 (p. 268)
False. Since H0 : β = −1 would not be rejected at α = 0.05, it would not be rejected at α = 0.01.
Solution to Exercise 4.4.3 (p. 268)
True
Solution to Exercise 4.4.4 (p. 268)
d
Solution to Exercise 4.4.5 (p. 268)
Some variables seem to be related, so that knowing one variable's status allows us to predict the status of
the other. This relationship can be measured and is called correlation. However, a high correlation between
two variables in no way proves that a cause-and-eect relation exists between them. It is entirely possible
that a third factor causes both variables to vary together.
Solution to Exercise 4.4.6 (p. 268)
True
Solution to Exercise 4.5.1 (p. 288)
Yj = b0 + b1 · X1 + b2 · X2 + b3 · X3 + b4 · X4 + b5 · X6 + ej
Solution to Exercise 4.5.2 (p. 288)
d. there is a perfect negative relationship between Y and X in the sample.
Solution to Exercise 4.5.3 (p. 288)
b. low
Solution to Exercise 4.6.1 (p. 293)
The precision of the estimate of the Y variable depends on the range of the independent (X) variable
explored. If we explore a very small range of the X variable, we won't be able to make much use of the
regression. Also, extrapolation is not recommended.
Solution to Exercise 4.6.2 (p. 293)
^
y = −3.6 + (3.1 · 7) = 18.1
Solution to Exercise 4.6.3 (p. 293)
Most simply, since −5 is included in the condence interval for the slope, we can conclude that the evidence
is consistent with the claim at the 95% condence level.
Using a t test:
H0 : B1 = −5
HA : B1 6= −5
tcalculated = −5−(−4)
1 = −1
tcritical = −1.96
Since tcalc < tcrit we retain the null hypothesis that B1 = −5.
Solution to Example 4.6, Problem 2 (p. 297)
b. The x values in the data are between 65 and 75. Ninety is outside of the domain of the observed x values
in the data (independent variable), so you cannot reliably predict the nal exam score for this student.
(Even though it is possible to enter 90 into the equation for x and calculate a corresponding y value, the y
value that you get will have a condence interval that may not be meaningful.)
To understand really how unreliable the prediction can be outside of the observed x values observed
in the data, make the substitution x = 90 into the equation.
^
y = − − 173.51 + 4.83 (90) = 261.19
The nal-exam score is predicted to be 261.19. The largest the nal-exam score can be is 200.
a. 80 + 1.5 · 4 = 86
b. No. Most business statisticians would not want to extrapolate that far. If someone did, the estimate
would be 110, but some other factors probably come into play with 20 years.
a. −.72, .32
b. the t value
c. the t value
a. The population value for β2 , the change that occurs in Y with a unit change in X2 , when the other
variables are held constant.
b. The population value for the standard error of the distribution of estimates of β2 .
c. .8, .1, 16 = 20 − 4.
Glossary
C Categorical Variable
variables that take on values that are names or labels
Cluster Sampling
a method for selecting a random sample and dividing the population into groups (clusters); use
simple random sampling to select a set of clusters. Every individual in the chosen clusters is
included in the sample.
Conditional Probability
the likelihood that an event will occur given that another event has already occurred
Contingency Table
the method of displaying a frequency distribution as a table with rows and columns to show how
two variables may be dependent (contingent) upon each other; the table provides an easy way to
calculate conditional probabilities.
D Data
a set of observations (a set of possible outcomes); most data used in statistical research can be
put into two groups: categorical (an attribute whose value is a label) or quantitative (an
attribute whose value is indicated by a number). Categorical data can be separated into two
subgroups: nominal and ordinal. Data is nominal if it cannot be meaningfully ordered. Data
is ordinal if the data can be meaningfully ordered. Quantitative data can be separated into two
subgroups: discrete and continuous. Data is discrete if it is the result of counting (such as the
number of students of a given ethnic group in a class or the number of books on a shelf). Data
is continuous if it is the result of measuring (such as distance traveled or weight of luggage)
Dependent Events
If two events are NOT independent, then we say that they are dependent.
Discrete Random Variable
a random variable (RV) whose outcomes are counted
Double-blinding
the act of blinding both the subjects of an experiment and the researchers who work with the
subjects
E Equally Likely
Each outcome of an experiment has the same probability.
Event
a subset of the set of all outcomes of an experiment; the set of all outcomes of an experiment is
called a sample space and is usually denoted by S. An event is an arbitrary subset in S. It can
contain one outcome, two outcomes, no outcomes (empty subset), the entire sample space, and
the like. Standard notations for events are capital letters such as A, B, C, and so on.
Experiment
a planned activity carried out under controlled conditions
Experimental Unit
any individual or object to be measured
Explanatory Variable
the independent variable in an experiment; the value controlled by researchers
F First Quartile
the value that is the median of the of the lower half of the ordered data set
Frequency Polygon
looks like a line graph but uses intervals to display ranges of large amounts of data
Frequency Table
a data representation in which grouped data is displayed along with the corresponding
frequencies
Frequency
the number of times a value of the data occurs
H Histogram
a graphical representation in x-y form of the distribution of data in a data set; x represents the
data and y represents the frequency, or relative frequency. The graph consists of contiguous
rectangles.
Hypothesis
a statement about the value of a population parameter, in case of two hypotheses, the statement
assumed to be true is called the null hypothesis (notation H 0 ) and the contradictory statement
is called the alternative hypothesis (notation Ha ).
I Independent Events
The occurrence of one event has no eect on the probability of the occurrence of another event.
Events A and B are independent if one of the following is true:
• P(A|B) = P(A)
• P(B|A) = P(B)
• P(A ∩ B) = P(A)P(B)
Informed Consent
Any human subject in a research study must be cognizant of any risks or costs associated with
the study. The subject has the right to know the nature of the treatments included in the study,
their potential risks, and their potential benets. Consent must be given freely by an informed,
t participant.
Institutional Review Board
a committee tasked with oversight of research programs that involve human subjects
Interval
also called a class interval; an interval represents a range of data and is used when displaying
large data sets
L Linear
a model that takes data and regresses it into a straight line equation.
Lurking Variable
a variable that has an eect on a study even though it is neither an explanatory variable nor a
response variable
M Mean
a number that measures the central tendency of the data; a common name for mean is 'average.'
The term 'mean' is a shortened form of 'arithmetic mean.' By denition, the mean for a sample
−− Sum of all values in the sample
(denoted by x) is x = Number of values in the sample , and the mean for a population (denoted by
Sum of all values in the population
µ) is µ = Number of values in the population .
Median
a number that separates ordered data into halves; half the values are the same number or smaller
than the median and half the values are the same number or larger than the median. The
median may or may not be part of the data.
Midpoint
the mean of an interval in a frequency table
Mode
the value that appears most frequently in a set of data
Multivariate
a system or model where more than one independent variable is being used to predict an
outcome. There can only ever be one dependent variable, but there is no limit to the number of
independent variables.
Mutually Exclusive
Two events are mutually exclusive if the probability that they both happen at the same time is
zero. If events A and B are mutually exclusive, then P(A ∩ B) = 0.
N Nonsampling Error
an issue that aects the reliability of sampling data other than natural variation; it includes a
variety of human errors including poor study design, biased sampling methods, inaccurate
information provided by study participants, data entry errors, and poor analysis.
Normal Distribution
a continuous random variable (RV) with pdf f (x) =
1 −−(x −− µ) 2
√ e 2σ 2
σ 2π
, where µ is the mean of the distribution and σ is the standard deviation; notation: X ∼ N (µ,
σ ). If µ = 0 and σ = 1, the RV, Z, is called the standard normal distribution.
Normal Distribution
−(x−µ)2
a continuous random variable (RV) with pdf f (x) = σ√12π e 2σ2 , where µ is the mean of the
distribution, and σ is the standard deviation, notation: X ∼ N (µ, σ ). If µ = 0 and σ = 1, the
RV is called the standard normal distribution.
Numerical Variable
variables that take on values that are indicated by numbers
O Outcome
a particular result of an experiment
• each data point in one data set is matched with exactly one point from the other set.
Parameter
a number that is used to represent a population characteristic and that generally cannot be
determined easily
Placebo
an inactive treatment that has no real eect on the explanatory variable
Population
all individuals, objects, or measurements whose properties are being studied
Probability
a number between zero and one, inclusive, that gives the likelihood that a specic event will occur
Probability
a number between zero and one, inclusive, that gives the likelihood that a specic event will
occur; the foundation of statistics is given by the following 3 axioms (by A.N. Kolmogorov,
1930's): Let S denote the sample space and A and B are two events in S. Then:
• 0 ≤ P(A) ≤ 1
• If A and B are any two mutually exclusive events, then P(A OR B) = P(A) + P(B).
• P(S) = 1
Proportion
the number of successes divided by the total number in the sample
Q Qualitative Data
See Data9 .
Quantitative Data
R R Correlation Coecient
A number between −1 and 1 that represents the strength and direction of the relationship
between X and Y. The value for r will equal 1 or −1 only if all the plotted points form a
perfectly straight line.
R2 Coecient of Determination
This is a number between 0 and 1 that represents the percentage variation of the dependent
variable that can be explained by the variation in the independent variable. Sometimes
SST where SSR is the Sum of Squares Regression and SST is
calculated by the equation R2 = SSR
the Sum of Squares Total. The appropriate coecient of determination to be reported should
always be adjusted for degrees of freedom rst.
Random Assignment
the act of organizing experimental units into treatment groups using random methods
Random Sampling
a method of selecting a sample that gives every member of the population an equal chance of
being selected.
Relative Frequency
the ratio of the number of times a value of the data occurs in the set of all outcomes to the
number of all outcomes
9 http://cnx.org/content/m64281/latest/
Representative Sample
a subset of the population that has the same characteristics as the population
Residual or error
^
the value calculated from subtracting y0 − y 0 = e0 . The absolute value of a residual measures the
vertical distance between the actual value of y and the estimated value of y that appears on the
best-t line.
Response Variable
the dependent variable in an experiment; the value that is measured for change at the end of
an experiment
S Sample
a subset of the population studied
Sample Space
the set of all possible outcomes of an experiment
Sampling Bias
not all members of the population are equally likely to be selected
Sampling Error
the natural variation that results from selecting a sample to represent a larger population; this
variation decreases as the sample size increases, so selecting larger samples reduces sampling
error.
Sampling with Replacement
If each member of a population is replaced after it is picked, then that member has the possibility
of being chosen more than once.
Sampling with Replacement
Once a member of the population is selected for inclusion in a sample, that member is returned
to the population for the selection of the next individual.
Sampling without Replacement
A member of the population may be chosen for inclusion in a sample only once. If chosen, the
member is not returned to the population before the next selection.
Sampling without Replacement
When sampling is done without replacement, each member of a population may be chosen only
once.
Simple Random Sampling
a straightforward method for selecting a random sample; give each member of the population a
number. Use a random number generator to select a set of labels. These randomly selected
labels identify the members of your sample.
Skewed
used to describe data that is not symmetrical; when the right side of a graph looks chopped o
compared the left side, we say it is skewed to the left. When the left side of the graph looks
chopped o compared to the right side, we say the data is skewed to the right. Alternatively:
when the lower values of the data are more spread out, we say the data are skewed to the left.
When the greater values are more spread out, the data are skewed to the right.
Standard Deviation
a number that is equal to the square root of the variance and measures how far data values are
from their mean; notation: s for sample standard deviation and σ for population standard
deviation.
Standard Normal Distribution
a continuous random variable (RV) X ∼ N (0, 1); when X follows the standard normal
distribution, it is often noted as Z ∼ N (0, 1).
Statistic
a numerical characteristic of the sample; a statistic estimates the corresponding population
parameter.
Stratied Sampling
a method for selecting a random sample used to ensure that subgroups of the population are
represented adequately; divide the population into groups (strata). Use simple random sampling
to identify a proportionate number of individuals from each stratum.
Student's t-Distribution
investigated and reported by William S. Gossett in 1908 and published under the pseudonym
Student. The major characteristics of the random variable (RV) are:
• It is continuous and assumes any real values.
• The pdf is symmetrical about its mean of zero. However, it is more spread out and atter at
the apex than the normal distribution.
• It approaches the standard normal distribution as n gets larger.
• There is a "family" of t distributions: every representative of the family is completely
dened by the number of degrees of freedom which is one less than the number of data items.
Sum of Squared Errors (SSE)
the calculated value from adding up all the squared residual terms. The hope is that this value is
very small when creating a model.
Systematic Sampling
a method for selecting a random sample; list the members of the population. Use simple random
sampling to select a starting point in the population. Let k = (number of individuals in the
population)/(number of individuals needed in the sample). Choose every kth individual in the
list starting with the one that was randomly selected. If necessary, return to the beginning of
the population list to complete your sample.
T Test Statistic
The formula that counts the number of standard deviations on the relevant distribution that
estimated parameter is away from the hypothesized value.
The Complement Event
The complement of event A consists of all outcomes that are NOT in A.
The Conditional Probability of A GIVEN B
P(A|B) is the probability that event A will occur given that the event B has already occurred.
The Intersection: the AND Event
An outcome is in the event A AND B if the outcome is in both A AND B at the same time.
The Union: the OR Event
An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B.
Treatments
V Variable
a characteristic of interest for each person or object in a population
Z z-score
|x −− µ|
the linear transformation of the form z = x−µ σ or written as z = σ ; if this transformation is
applied to any normal distribution X ∼ N (µ, σ ) the result is the standard normal distribution Z
∼ N (0,1). If this transformation is applied to any specic value x of the RV with mean µ and
standard deviation σ , the result is called the z-score of x. The z-score allows us to compare data
that are normally distributed but scaled dierently. A z-score is the number of standard
deviations a particular x is away from its mean value.
Attributions
Collection: MGMT 2262: Applied Business Statistics
Edited by: Brad Quiring
URL: http://cnx.org/content/col23820/1.2/
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction Sampling and Data MtRoyal - Version2016RevA"
By: Lyryx Learning
URL: http://cnx.org/content/m62095/1.2/
Pages: 1-2
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54035/1.2/
Module: "Denitions of Statistics, Probability, and Key Terms MRU - C Lemieux (2017)"
By: Collette Lemieux
URL: http://cnx.org/content/m64275/1.1/
Pages: 2-8
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Denitions of Statistics, Probability, and Key Terms MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62097/1.2/
Module: "Derived copy of Outcomes and the Type I and Type II Errors Hypothesis Testing with One
Sample MtRoyal - Version2016RevA"
Used here as: "Errors and Choosing a Level of Signicance"
By: Brad Quiring
URL: http://cnx.org/content/m65318/1.3/
Pages: 228-232
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Based on: Outcomes and the Type I and Type II Errors Hypothesis Testing with One Sample MtRoyal
- Version2016RevA
By: OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62369/1.1/
Module: "The Eight-Step Hypothesis Test"
By: Brad Quiring
URL: http://cnx.org/content/m65319/1.5/
Pages: 232-239
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Module: "Practice questions"
Used here as: "Practice Questions for Chapters 7 and 8"
By: Collette Lemieux
URL: http://cnx.org/content/m65288/1.3/
Pages: 240-249
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction to Regression"
By: Brad Quiring
URL: http://cnx.org/content/m65507/1.1/
Pages: 257-258
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54035/1.2/
Module: "The Correlation Coecient r"
By: OpenStax
URL: http://cnx.org/content/m55719/1.21/
Pages: 258-263
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
Module: "Testing the Signicance of the Correlation Coecient"
By: OpenStax
URL: http://cnx.org/content/m55726/1.16/
Pages: 263-265
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
About OpenStax-CNX
Rhaptos is a web-based collaborative publishing system for educational material.