0% found this document useful (0 votes)
489 views339 pages

MGMT 2262 Applied Business Statistics 2.1

Uploaded by

Ric Napus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
489 views339 pages

MGMT 2262 Applied Business Statistics 2.1

Uploaded by

Ric Napus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 339

MGMT 2262: Applied Business Statistics

Collection Editor:
Brad Quiring
MGMT 2262: Applied Business Statistics

Collection Editor:
Brad Quiring
Authors:
OpenStax
Lyryx Learning
Collette Lemieux
Brad Quiring

Online:
< http://cnx.org/content/col23820/1.2/ >

OpenStax-CNX
This selection and arrangement of content as a collection is copyrighted by Brad Quiring. It is licensed under the
Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
Collection structure revised: December 19, 2017
PDF generated: December 19, 2017
For copyright and attribution information for the modules contained in this collection, see p. 325.
Table of Contents
1 Business Statistics - Module 1 - Data collection and descriptive statistics
1.1 Chapter 1: Introduction to Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Chapter 2: Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2 Business Statistics - Module 2 - Probability
2.1 Chapter 3: Probability Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.2 Chapter 4: Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 140
2.3 Chapter 5: Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.4 Chapter 6: Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3 Business Statistics - Module 3 - Condence Intervals and Hypothesis Tests
3.1 Chapter 7: Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
3.2 Chapter 8: Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4 Business Statistics - Module 4 - Linear Regression and Correlation
4.1 Introduction to Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 257
4.2 The Correlation Coecient r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.3 Testing the Signicance of the Correlation Coecient . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 263
4.4 Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.5 The Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
4.6 Interpretation of Regression Coecients: Elasticity and Logarithmic Transforma-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
4.7 Predicting with a Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 293
4.8 How to Use Microsoft Excel® for Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .325
iv

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


Chapter 1

Business Statistics - Module 1 - Data


collection and descriptive statistics

1.1 Chapter 1: Introduction to Statistics


1.1.1 Introduction  Sampling and Data  MtRoyal - Version2016RevA1

Figure 1.1: We encounter statistics in our daily lives more often than we probably realize and from
many dierent sources, like the news. (credit: David Sim)

: By the end of this chapter, the student should be able to:

• Recognize and dierentiate between key terms.


• Apply various types of sampling methods to data collection.
1 This content is available online at <http://cnx.org/content/m62095/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>

1
CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
2
AND DESCRIPTIVE STATISTICS

You are probably asking yourself the question, "When and where will I use statistics?" If you read any
newspaper, watch television, or use the Internet, you will see statistical information. There are statistics
about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or
watch a television news program, you are given sample information. With this information, you may make
a decision about the correctness of a statement, claim, or "fact." Statistical methods can help you make the
"best educated guess."
Since you will undoubtedly be given statistical information at some point in your life, you need to know
some techniques for analyzing the information thoughtfully. Think about buying a house or managing a
budget. Think about your chosen profession. The elds of economics, business, psychology, education,
biology, law, computer science, police science, and early childhood development require at least one course
in statistics.
Included in this chapter are the basic ideas and words of probability and statistics. You will soon
understand that statistics and probability work together. You will also learn how data are gathered and
what "good" data can be distinguished from "bad."

1.1.2 Denitions of Statistics, Probability, and Key Terms  MRU - C Lemieux


(2017)2
The science of statistics deals with the collection, analysis, interpretation, and presentation of data.
The process of statistical analysis follows these broad steps.

1. Dening the problem


2. Planning the study
3. Collecting the data for the study
4. Analysis of the data
5. Interpretations and conclusions based on the analysis

For example, we may wonder if there is a gap between how much men and women are paid for doing the
same job. This would be the problem we want to investigate. Before we do the investigation, we would
want to spend some time dening the problem. This could include dening terms (e.g. what do we mean by
paid? what constitutes the same job?). Then we would want to state a research question. A research
question is the overarching question that the study aims to address. In this example, our research question
might be: Does the gender wage gap exist?.
Once we have the problem clearly dened, we need to gure out how we are going to study the problem.
This would include determining how we are going to collect the data for the study. Since it is unlikely we
are going to nd out the salary and position of every employee in the world (i.e. the population), we need to
instead collect data from a subset of the whole (i.e. a sample). The process of how we will collect the data
is called the sampling technique. The overall plan of how the study is designed is called the sampling
design or methodology.
Once we have the methodology, we want to implement it and collect the actual data.
When we have the data, we will learn how to organize and summarize data. Organizing and summarizing
data is called descriptive statistics. Two ways to summarize data are by visually summarizing the data
(for example, a histogram) and by numerically summarizing the data (for example, the average). After we
have summarized the data, we will use formal methods for drawing conclusions from "good" data. The
formal methods are called inferential statistics. Statistical inference uses probability to determine how
condent we can be that our conclusions are correct.
Once we have summarized and analyzed the data, we want to see what kind of conclusions we can
draw. This would include attempting to answer the research question and recognizing the limitations of the
conclusions.
In this course, most of our time will be spent in the last two steps of the statistical analysis process
(i.e. organizing, summarizing and analyzing data). To understand the process of making inferences from the
2 This content is available online at <http://cnx.org/content/m64275/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


3

data, we must also learn about probability. This will help us understand the likelihood of random events
occurring.

1.1.2.1 Key Idea and Terms

In statistics, we generally want to study a population. You can think of a population as a collection of
persons, things, or objects under study. You can think of a population as a collection of persons, things, or
objects under study. The person, thing or object under study (i.e. the object of study) is called the obser-
vational unit. What we are measuring or observing about the observational unit is called the variable.
We often use the letters X or Y to represent a variable. A specic instance of a variable is called data.
Example 1.1
Suppose our research question is Do current NHL forwards who make over $3 million a year score,
on average, more than 20 points a season?
The population would be all of the NHL forwards who make over $3 million a year and who are
currently playing in the NHL. The observational unit would be any forward that meets the criteria.
The variable is the number points a forward in the population gets in a season. A data value would
be the actual number of points.
In the above example, it would be reasonable to look at the population when doing the statistical analysis.
But this is not always the case. For example, suppose you want to study the average prots of oil and gas
companies in the world. This might be very hard to get a list of all of the oil and gas companies in the world
and get access to their nancial reports. When the population is not easily accessible, we instead look at a
sample. The idea of sampling (the process of collecting the sample) is to select a portion (or subset) of
the larger population and study that portion (the sample) to gain information about the population. Data
are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical
technique. If you wished to compute the overall grade point average at your school, it would make sense
to select a sample of students who attend the school. The data collected from the sample would be the
students' grade point averages. In federal elections, opinion poll samples of 1,0002,000 people are taken.
The opinion poll is supposed to represent the views of the people in the entire country. Manufacturers of
canned carbonated drinks take samples to determine if a 16 ounce can contains 16 ounces of carbonated
drink.
It is important to note that though we might not know the population, when we decide to sample from
it, it is fairly static. Going back to the example of the NHL forwards, if we were to gather the data for the
population right now that would be our xed population. But if you took a sample from that population
and your friend took a sample from that population, it is not surprising that you and your friend would get
a dierent sample. That is, there is one population, but there are many, many dierent samples that can be
drawn from the sample. How the samples vary from each other is called sampling variability. The idea
of sampling variability is a key concept in statistics and we will come back to it over and over again.

note: Data is plural. Datum is singular.

As mentioned above, a variable, or random variable, notated by capital letters such as X and Y, is a
characteristic of interest for each person or thing in a population. Data are the actual values of the variable.
Data and variables fall into two general types: either they are measuring something and they are not
measuring. When a variable is measuring or counting something, it is called a quantitative variable and
the data is called quantitative. When a variable is not measuring or counting something, it is called a
categorical variable and the data is called categorical data. For a variable to be considered quantitative,
the distance between each number has to be xed. In general, quantitative variables measure something and
take on values with equal units such as weight in pounds or number of people in a line. Categorical variables
place the person or thing into a category such as colour of car or opinion on topic.
Example 1.2

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
4
AND DESCRIPTIVE STATISTICS

• In the NHL forwards example, the variable is quantitative as we investigating the number of
points a player has.
• In the gender gap example, there were three variables: the salary, gender, and the position.
The salary is a quantitative variable as we are investigating the amount people make. Gender
is a categorical variable as we are categorizing someone's gender. Position is also categorical
as we are categorizing their type of employment.
• Sometimes though determining the type of a variable (i.e. quantitative or categorical) is not
always cut and dry. In particular, Likert scales or rating scales are tricky to place. A Likert
scale is any scale where you are asked to state your opinion on a scale. For example, you
may be asked whether you strongly agree, agree, neutral, disagree or strongly disagree with a
statement. Sometimes there is a number associated with the rating. For example, write 5 if
you strongly agree and 1 if you strongly disagree. Technically, a Likert scale is a categorical
data as we are categorizing people's opinions and the number is just a short form for the
category.

tip: When you are asked to categorize the data or variable, rst determine what the observation
unit is. Then determine the variable being studied. Then think about what the data will look like.
If the data is a number, then it is usually quantitative data (be wary of Likert scales). If the data
is word or category, then it is categorical data.

Exercise 1.1.2.1 (Solution on p. 77.)


For the following research questions, state the observational unit, the variable being studied, and
the type of variable.
a. What is the average monthly temperature in Edmonton?
b. What is the highest belt colour that most students of karate earn in Canada?
c. What is the average weight of greyhound dogs?
d. What is the average gross prot of movies made in 2016?
e. What is the average user rating of Jessica Jones season 1 on IMDB?
f. What is the most common colour of car in Nova Scotia?

Two words that come up often in statistics are mean and proportion. These are two example of numerical
descriptive statistics. If you were to take three exams in your math classes and obtain scores of 86, 75, and
92, you would calculate your mean score by adding the three exam scores and dividing by three (your mean
score would be 84.3 to one decimal place). If, in your math class, there are 40 students and 22 are men then
the proportion of men in the course is 55% and the proportion of women is 45%.
From the sample data, we can calculate a statistic. A statistic is a number that represents a property
of the sample. For example, if we consider one math class to be a sample of the population of all math
classes, then the mean number of points earned by students in that one math class at the end of the term is
an example of a statistic. The statistic is an estimate of a population parameter, in this case the mean. A
parameter is a number that is a property of the population. Since we considered all math classes to be the
population, then the mean number of points earned per student over all the math classes is an example of
a parameter (i.e. the population mean). If we took a sample of students from the math class and found the
mean points earned per student in the sample, then we would have found a statistic (i.e. the sample mean).
One of the main concerns in the eld of statistics is how accurately a statistic estimates a parameter.
The accuracy really depends on how well the sample represents the population. The sample must contain
the characteristics of the population in order to be a representative sample. We are interested in both the
sample statistic and the population parameter in inferential statistics. In a later chapter, we will use the
sample statistic to test the validity of the established population parameter.
Example 1.3
Determine what the key terms refer to in the following study. We want to know the average
(mean) amount of money rst year college students spend at ABC College on school supplies that

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


5

do not include books. We randomly survey 100 rst year students at the college. Three of those
students spent $150, $200, and $225, respectively.
Solution
The population is all rst year students attending ABC College this term.
The sample could be all students enrolled in one section of a beginning statistics course at
ABC College (although this sample may not represent the entire population).
The parameter is the average (mean) amount of money spent (excluding books) by rst year
college students at ABC College this term.
The statistic is the average (mean) amount of money spent (excluding books) by rst year
college students in the sample.
The variable could be the amount of money spent (excluding books) by one rst year student.
Let X = the amount of money spent (excluding books) by one rst year student attending ABC
College.
The data are the dollar amounts spent by the rst year students. Examples of the data are
$150, $200, and $225.

Exercise 1.1.2.2 (Solution on p. 77.)


Determine what the key terms refer to in the following study. We want to know the
average (mean) amount of money spent on school uniforms each year by families with
children at Knoll Academy. We randomly survey 100 families with children in the school.
Three of the families spent $65, $75, and $95, respectively.

Example 1.4
Determine what the key terms refer to in the following study.
A study was conducted at a local college to analyze the average cumulative GPA's of students
who graduated last year. Fill in the letter of the phrase that best describes each of the items below.
1._____ Population 2._____ Statistic 3._____ Parameter 4._____ Sample 5._____
Variable 6._____ Data

a) all students who attended the college last year


b) the cumulative GPA of one student who graduated from the college last year
c) 3.65, 2.80, 1.50, 3.90
d) a group of students who graduated from the college last year, randomly selected
e) the average cumulative GPA of students who graduated from the college last year
f) all students who graduated from the college last year
g) the average cumulative GPA of students in the study who graduated from the college last year

Solution
1. f; 2. g; 3. e; 4. d; 5. b; 6. c

Example 1.5
Determine what the key terms refer to in the following study.
As part of a study designed to test the safety of automobiles, the National Transportation Safety
Board collected and reviewed data about the eects of an automobile crash on test dummies. Here
is the criterion they used:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
6
AND DESCRIPTIVE STATISTICS

Speed at which Cars Crashed Location of drive (i.e. dummies)


35 miles/hour Front Seat

Table 1.1

Cars with dummies in the front seats were crashed into a wall at a speed of 35 miles per hour.
We want to know the proportion of dummies in the driver's seat that would have had head injuries,
if they had been actual drivers. We start with a simple random sample of 75 cars.
Solution
The population is all cars containing dummies in the front seat.
The sample is the 75 cars, selected by a simple random sample.
The parameter is the proportion of driver dummies (if they had been real people) who would
have suered head injuries in the population.
The statistic is proportion of driver dummies (if they had been real people) who would have
suered head injuries in the sample.
The variable X = the number of driver dummies (if they had been real people) who would
have suered head injuries.
The data are either: yes, had head injury, or no, did not.

Example 1.6
Determine what the key terms refer to in the following study.
An insurance company would like to determine the proportion of all medical doctors who have
been involved in one or more malpractice lawsuits. The company selects 500 doctors at random
from a professional directory and determines the number in the sample who have been involved in
a malpractice lawsuit.
Solution
The population is all medical doctors listed in the professional directory.
The parameter is the proportion of medical doctors who have been involved in one or more
malpractice suits in the population.
The sample is the 500 doctors selected at random from the professional directory.
The statistic is the proportion of medical doctors who have been involved in one or more
malpractice suits in the sample.
The variable X = the number of medical doctors who have been involved in one or more
malpractice suits.
The data are either: yes, was involved in one or more malpractice lawsuits, or no, was not.

1.1.2.2 References

The Data and Story Library, http://lib.stat.cmu.edu/DASL/Stories/CrashTestDummies.html (accessed May


1, 2013).

1.1.2.3 Chapter Review

The mathematical theory of statistics is easier to learn when you know the language. This module presents
important terms that will be used throughout the text.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


7

1.1.2.4 HOMEWORK

For each of the following eight exercises, identify: a. the population, b. the sample, c. the parameter, d. the
statistic, e. the variable, and f. the data. Give examples where appropriate.
Exercise 1.1.2.3
A tness center is interested in the mean amount of time a client exercises in the center each week.
Exercise 1.1.2.4 (Solution on p. 77.)
Ski resorts are interested in the mean age that children take their rst ski and snowboard lessons.
They need this information to plan their ski classes optimally.
Exercise 1.1.2.5
A cardiologist is interested in the mean recovery period of her patients who have had heart attacks.
Exercise 1.1.2.6 (Solution on p. 77.)
Insurance companies are interested in the mean health costs each year of their clients, so that they
can determine the costs of health insurance.
Exercise 1.1.2.7
A politician is interested in the proportion of voters in his district who think he is doing a good
job.
Exercise 1.1.2.8 (Solution on p. 77.)
A marriage counselor is interested in the proportion of clients she counsels who stay married.
Exercise 1.1.2.9
Political pollsters may be interested in the proportion of people who will vote for a particular
cause.
Exercise 1.1.2.10 (Solution on p. 77.)
A marketing company is interested in the proportion of people who will buy a particular product.

Use the following information to answer the next three exercises: A Lake Tahoe Community College instruc-
tor is interested in the mean number of days Lake Tahoe Community College math students are absent from
class during a quarter.

Exercise 1.1.2.11
What is the population she is interested in?

a. all Lake Tahoe Community College students


b. all Lake Tahoe Community College English students
c. all Lake Tahoe Community College students in her classes
d. all Lake Tahoe Community College math students

Exercise 1.1.2.12 (Solution on p. 78.)


Consider the following:
X = number of days a Lake Tahoe Community College math student is absent
In this case, X is an example of a:

a. variable.
b. population.
c. statistic.
d. data.

Exercise 1.1.2.13
The instructor's sample produces a mean number of days absent of 3.5 days. This value is an
example of a:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
8
AND DESCRIPTIVE STATISTICS

a. parameter.
b. data.
c. statistic.
d. variable.

1.1.3 Data, Sampling, and Variation  MRU - C Lemieux (2017)3


Data may come from a population or from a sample. Small letters like x or y generally are used to represent
data values. Most data can be put into the following categories:
• Categorical
• Quantitative
Categorical data (also called qualitative data) are the result of categorizing or describing attributes of a
population. Hair colour, blood type, ethnic group, the car a person drives, and the street a person lives on
are examples of qualitative data. Qualitative data are generally described by words or letters. For instance,
hair colour might be black, dark brown, light brown, blonde, grey, or red. Blood type might be AB+, O-,
or B+. Researchers often prefer to use quantitative data over qualitative data because it lends itself more
easily to mathematical analysis. For example, it does not make sense to nd an average hair or colour or
blood type.
There are two types of categorical data: nominal and ordinal. Nominal data is categorical data that
cannot be ordered in a meaningful way. For example, the colour of a car is categorical, but the order of the
colours are not meaningful. Ordinal data is categorical data that can be ordered in a meaningful way. For
example, the level of satisfaction someone has with their experience at a restaurant from not at all satised
to completely satised.
Quantitative data are always numbers. Quantitative data are the result of counting or measuring
attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and
number of students who take statistics are examples of quantitative data. Quantitative data may be either
discrete or continuous.
All data that are the result of counting are called quantitative discrete data. These data take on only
certain numerical values. If you count the number of phone calls you receive for each day of the week, you
might get values such as zero, one, two, or three.
All data that are the result of measuring are quantitative continuous data assuming that we can
measure accurately. Measuring time, distance, area, and so on; anything that can be subdivided and then
subdivided again and again is a continuous variable. If you and your friends carry backpacks with books in
them to school, the numbers of books in the backpacks are discrete data and the weights of the backpacks
are continuous data.
Example 1.7: Data Sample of Quantitative Discrete Data
The data are the number of books students carry in their backpacks. You sample ve students.
Two students carry three books, one student carries four books, one student carries two books, and
one student carries one book. The numbers of books (three, four, two, and one) are the quantitative
discrete data.

Exercise 1.1.3.1 (Solution on p. 78.)


The data are the number of machines in a gym. You sample ve gyms. One gym has 12
machines, one gym has 15 machines, one gym has ten machines, one gym has 22 machines,
and the other gym has 20 machines. What type of data is this?
3 This content is available online at <http://cnx.org/content/m64281/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


9

Example 1.8: Data Sample of Quantitative Continuous Data


The data are the weights of backpacks with books in them. You sample the same ve students.
The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, 4.3. Notice that backpacks carrying
three books can have dierent weights. Weights are quantitative continuous data because weights
are measured.

Exercise 1.1.3.2 (Solution on p. 78.)


The data are the areas of lawns in square feet. You sample ve houses. The areas of the
lawns are 144 sq. feet, 160 sq. feet, 190 sq. feet, 180 sq. feet, and 210 sq. feet. What
type of data is this?

Example 1.9
You go to the supermarket and purchase three cans of soup (19 ounces) tomato bisque, 14.1 ounces
lentil, and 19 ounces Italian wedding), two packages of nuts (walnuts and peanuts), four dierent
kinds of vegetable (broccoli, cauliower, spinach, and carrots), and two desserts (16 ounces Cherry
Garcia ice cream and two pounds (32 ounces chocolate chip cookies).
Problem
Name data sets that are quantitative discrete, quantitative continuous, and qualitative.
Solution
One Possible Solution:

• The three cans of soup, two packages of nuts, four kinds of vegetables and two desserts are
quantitative discrete data because you count them.
• The weights of the soups (19 ounces, 14.1 ounces, 19 ounces) are quantitative continuous data
because you measure weights as precisely as possible.
• Types of soups, nuts, vegetables and desserts are qualitative data because they are categorical.

Try to identify additional data sets in this example.


Example 1.10
The data are the colors of backpacks. Again, you sample the same ve students. One student has
a red backpack, two students have black backpacks, one student has a green backpack, and one
student has a gray backpack. The colors red, black, black, green, and gray are qualitative data.

Exercise 1.1.3.3 (Solution on p. 78.)


The data are the colors of houses. You sample ve houses. The colors of the houses are
white, yellow, white, red, and white. What type of data is this?

: You may collect data as numbers and report it categorically. For example, the quiz scores for
each student are recorded throughout the term. At the end of the term, the quiz scores are reported
as A, B, C, D, or F.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
10
AND DESCRIPTIVE STATISTICS

Exercise 1.1.3.4 (Solution on p. 78.)


Determine the correct data type (quantitative or qualitative) for the number of cars in a
parking lot. Indicate whether quantitative data are continuous or discrete.

Example 1.11
A statistics professor collects information about the classication of her students as freshmen,
sophomores, juniors, or seniors. The data she collects are summarized in the pie chart Figure 1.2.
What type of data does this graph show?

Figure 1.2

Solution
This pie chart shows the students in each year, which is categorical data.

Exercise 1.1.3.5 (Solution on p. 78.)


The registrar at State University keeps records of the number of credit hours students
complete each semester. The data he collects are summarized in the histogram. The class
boundaries are 10 to less than 13, 13 to less than 16, 16 to less than 19, 19 to less than

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


11

22, and 22 to less than 25.

Figure 1.3

What type of data does this graph show?

1.1.3.1 Sampling

Gathering information about an entire population often costs too much or is virtually impossible. Instead,
we use a sample of the population. To collect the sample, a sampling technique is used. Not all sampling
techniques are created equal, though. A good sampling technique meets the following criteria:

• The sample is collected randomly


• The sample is representative of the population
• The size of the sample is large enough
If a sampling technique does not meet these criteria, then it is not appropriate to make inferences from the
data. For example, it would not be appropriate to estimate the population mean from the sample mean.
A random sample reduces bias, promotes representativeness, and is a key component to sampling. To
do any scientic statistical analysis on sample data, the sample has to be randomly selected.
In a random sample, members of the population are selected in such a way that each has an equal chance
of being selected. To ensure that a sample is collected randomly, some element of randomness needs to be
included in the sampling technique. This can involve using dice to choose the time to start collecting data
or using a random number generator to pick names from a list of names.

note: Humans in general are not very random. Therefore, the randomness added to the sampling
technique cannot be someone randomly choosing something. The randomness has to come from
a random event (like rolling dice, ipping a coin, using a random number generator).

A sample is representative if it shares similar characteristics to the population. For example, suppose that
the students at a university are distributed as follows by faculty:

• Business: 20%
• Arts: 25%
• Science and Engineering: 30%
• Nursing: 15%

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
12
AND DESCRIPTIVE STATISTICS

• Education: 10%
Then a sample would be representative of this population if the distribution of the students' faculty in the
sample was similar to the population. It doesn't have to be exactly the same, but it should be close. A
random sample will generate a fairly representative sample, but it doesn't guarantee it.

note: What makes a sample representative depends on what is being studied. For example, if
we are looking at the average age of students at a university, making sure we get students from
each faculty would be important, but making sure we get students from various political aliations
might not be.

Determining if a sample is large enough is a bit arbitrary and depends on the situation. In general, the
larger the sample size the better, but issues such as time and money need to be taken into account. You
don't want to interview 5000 people, when 50 people would do. In Chapter 7, we will look at a formula that
determines how many members of a population need to be in a sample depending on the level of error we
are comfortable with. Until then, as a general rule, if the data is quantitative, a sample of at least 30 is
usually good enough. While if the data is categorical, a sample of at least 100 is usually good enough.
In general, even if a sample is collected extremely well, it will not be perfectly representative of the pop-
ulation. The discrepancy between the sample and the population is called chance error due to sampling.
When dealing with samples, there will always be error. Statistics helps us to understand and even measure
this error. As a rule, the larger the random sample, in general the smaller the sampling error.
Areas of concern for sampling bias
When people publish their research, they include a description of their sampling technique. This is called
the methodology. When evaluating a sampling technique, check to see if the sample was collected randomly,
if it is representative of the population, and if the sample is large enough. Here are some examples of areas
of concern when looking at methodologies:

1. Undercoverage occurs when some members of the population are excluded from the process of selecting
the sample. For example, if no one from the faculty of nursing is included in the sample, then we would
say that the faculty of nursing is undercovered. This has been a specic concern in scientic research
over the years. For example, women have been traditionally excluded from drug studies because of
their menstrual cycles, but this results in the research only indicating how well the drug works for men.
2. Nonresponse bias occurs when a member of the population that is selected as part of the sample cannot
be contacted or refuses to participate. Have you ever refused to be part of a telephone study? If so,
you are contributing to nonresponse bias.
• Similar to nonresponse bias is voluntary response bias. Here a large segment of the population
is contacted and people choose to participate or not. Examples of this are mail-out surveys or
online polls. In these situations, usually the person is very invested in the issue so that is why
they take the time to answer. This results in non-representative samples.
• Response rate is a measure of how many people responded out of the total contacted. If the
response rate is low, then this suggests a very narrow segment of the population answered. This
would raise concerns about representativeness.
3. Asking potentially awkward questions might result in untruthful responses. This is called response
bias. For example, if you are asked if you have ever had a sexually transmitted infection, you may not
want to divulge that. One way to minimize response bias is to allow participants in a study to answer
the questions anonymously.
4. Improper wording of questions being asked might result in skewed answers. Here is an example of a
question that skews the results:
• Do you think it should be easier for seniors to make ends meet?
· Yes  they've worked hard and helped build our country
· No  seniors don't need any help or recognition

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


13

A famous example of a survey that had a very poor methodology was the incorrect prediction by the Literary
Digest that Dewey would beat Truman in the 1936 US election. Check out the following website for more
information: https://www.math.upenn.edu/∼deturck/m170/wk4/lecture/case2.html

1.1.3.1.1 Sampling techniques

Most statisticians and researchers use various methods of random sampling in an attempt to achieve a
good sample. This section will describe a few of the most common techniques: simple random sampling,
(proportional) stratied random sampling, cluster sampling, systematic random sampling, and convenience
sampling.
Simple random sampling
The easiest method to describe is called a simple random sample. In this technique, a random sample
is taken from the members of the population. This can be done by putting the names (or identier) of all
members of the population into a hat and pulling out those names (or identiers) to choose the sample.
Or the population can be numbered and a random number generator can choose the sample. Here, each
member of the population has an equal chance of being chosen. If the goal of the technique is to get a very
random sample, this is the best method to use. But it requires having a list of the whole population, which
is not always realistic.
Stratied sampling and proportionate stratied sampling
If there are concerns that a random sample might not fully represent a population (e.g. one portion of
the population is small compared to another), the best sampling technique to use is stratied random
sampling. In this case, divide the population into groups called strata and then take a random sample from
each stratum. Each stratum needs to be mutually exclusive from any other strata. That means that each
member of the population can only belong to one stratum. For example, you could stratify (group) your
university population by faculty and then choose a simple random sample from each stratum (each faculty)
to get a stratied random sample. Using the students per faculty example above, if the sample size is 100,
to get a stratied sample, you would randomly select 20 students from each faculty (as there are 5 faculties
and 100 students, choose an equal number from each faculty).
If the size of the sample is proportionate to the size of the strata, this is called proportionate stratied
random sampling. If you wanted a proportionate stratied random sample for students by faculty, you
would randomly select 20 students from business, 25 students from arts, 30 from science and engineering,
15 from nursing, and 10 from education (i.e. proportional to the number of students in each faculty). This
technique is best used when there are large dierences in the proportion of each group. For example, if the
faculty of business had 50% of the students and the faculty of nursing only had 1% of the students, it would
not be good to have an equal number of students from each faculty.

note: To randomly choose students from each faculty, a random sampling technique needs to be
used. This could be simple random sampling or using another technique listed below.

Cluster sampling
Cluster sampling and stratied sampling are often confused. In each case, the population is divided into
groups. But, in stratied sampling, a few people from all groups (strata) are chosen. While in cluster
sampling, all of the people from a few groups (clusters) are chosen.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of
the clusters. All the members from these clusters are in the cluster sample. For example, if you were to divide
your university into departments (sub-sets of the faculty), then each department is a cluster. You can then
randomly select a few of those departments (clusters). For example, let's say you clustered the population
into ten clusters. You might then randomly select four of those clusters to sample. All of the members of
those four departments are your sample. Again, to randomly select the four departments, you have to use a
random sampling technique. Here, you could number all of the departments and then use a random number
generator to choose four of them. Cluster sampling can be very convenient as the members of the sample are
in one location (in the above example, the sample are in the locations of the four departments). This can

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
14
AND DESCRIPTIVE STATISTICS

save time and money. But it does present a real chance of undercoverage. If the four departments chosen are
only in the faculties of business and arts, then the other faculties are not included. This means that cluster
sampling can result in non-representative samples. This is only a good technique to use if the clusters are
very similar to each other and each cluster would be representative of the population.
Systematic random sampling
To choose a systematic random sample, randomly select a starting point and take every kth piece of data
from a list of the population. For example, suppose you have to do a phone survey and you must choose 400
names for the sample. Your phone book contains 20,000 residence listings. To perform systematic random
sampling, number the population from 1 to 20,000 and then use a random number generator to pick a number
that represents the rst name in the sample. k is found by taking the population size (20,000) and dividing
by the size of the population (400). In this case, this results in 50. Thus, from your random starting point,
choose every ftieth name thereafter until you have a total of 400 names. If you reach the end of the list
before completing your sample you simply go back to the beginning of your phone book and keep going until
the sample is complete.
Be careful: k needs to be large enough to ensure that you cycle through all the names. Otherwise the
sample is not random. If k had been 10, then once the random starting point was chosen only 4,000 names
had a chance of being chosen which means that not everyone has an equal chance of being chosen. In this
case, any k larger than 50 would be appropriate. Systematic sampling is frequently chosen because it is a
simple method.
A variation of systematic random sampling is very useful when a list of the population does not exist.
For example, suppose you are doing a survey about people's satisfaction with a certain mall's hours. You
won't have a list of all of the people who go to the mall. Instead, you may stand at an entrance to the mall
and ask every fth person who enters the mall to complete your survey. To ensure the sampling technique is
representative, you'll want to do the survey multiple times at multiple locations. To ensure that the sampling
technique is random, you'll want to randomly choose your starting times and locations.
Convenience sampling
A type of sampling that is non-random is convenience sampling. Convenience sampling involves using
results that are readily available. For example, a computer software store conducts a marketing study by
interviewing potential customers who happen to be in the store browsing through the available software. The
results of convenience sampling may be very good in some cases and highly biased (favour certain outcomes)
in others. This is not a valid sampling technique when it comes to statistical inference. That is, if the data
is collected using a convenience sample, then no conclusions can be made about the population from the
sample.
With replacement or without replacement
True random sampling is done with replacement. That is, once a member is picked, that member goes
back into the population and thus may be chosen more than once. However, for practical reasons, in most
populations, simple random sampling is done without replacement. Surveys are typically done without
replacement. That is, a member of the population may be chosen only once. Most samples are taken from
large populations and the sample tends to be small in comparison to the population. Since this is the case,
sampling without replacement is approximately the same as sampling with replacement because the chance
of picking the same individual more than once with replacement is very low.
Too illustrate how small of chance it is, consider a university with a population of 10,000 people. Suppose
you want to pick a sample of 1,000 randomly for a survey. For any particular sample of 1,000, if you
are sampling with replacement,

• the chance of picking the rst person is 1,000 out of 10,000 (0.1000);
• the chance of picking a dierent second person for this sample is 999 out of 10,000 (0.0999);
• the chance of picking the same person again is 1 out of 10,000 (very low).

If you are sampling without replacement,

• the chance of picking the rst person for any particular sample is 1000 out of 10,000 (0.1000);

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


15

• the chance of picking a dierent second person is 999 out of 9,999 (0.0999);
• you do not replace the rst person before picking the next person.

Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the decimal answers to four decimal
places. To four decimal places, these numbers are equivalent (0.0999).
Sampling without replacement instead of sampling with replacement becomes a mathematical issue only
when the population is small. For example, if the population is 25 people, the sample is ten, and you are
sampling with replacement for any particular sample, then the chance of picking the rst person is
ten out of 25, and the chance of picking a dierent second person is nine out of 25 (you replace the rst
person).
If you sample without replacement, then the chance of picking the rst person is ten out of 25, and
then the chance of picking the second person (who is dierent) is nine out of 24 (you do not replace the rst
person).
Compare the fractions 9/25 and 9/24. To four decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. To four
decimal places, these numbers are not equivalent.
Example 1.12
A study is done to determine the average tuition that San Jose State undergraduate students pay
per semester. Each student in the following samples is asked how much tuition he or she paid for
the Fall semester. What is the type of sampling in each case?

a. A sample of 100 undergraduate San Jose State students is taken by organizing the students'
names by classication (freshman, sophomore, junior, or senior), and then selecting 25 stu-
dents from each.
b. A random number generator is used to select a student from the alphabetical listing of all
undergraduate students in the Fall semester. Starting with that student, every 50th student
is chosen until 75 students are included in the sample.
c. A completely random method is used to select 75 students. Each undergraduate student
in the fall semester has the same probability of being chosen at any stage of the sampling
process.
d. The freshman, sophomore, junior, and senior years are numbered one, two, three, and four,
respectively. A random number generator is used to pick two of those years. All students in
those two years are in the sample.
e. An administrative assistant is asked to stand in front of the library one Wednesday and to
ask the rst 100 undergraduate students he encounters what they paid for tuition the Fall
semester. Those 100 students are the sample.

Solution
a. stratied; b. systematic; c. simple random; d. cluster; e. convenience

Example 1.13
Determine the type of sampling used (simple random, stratied, systematic, cluster, or conve-
nience).

a. A soccer coach selects six players from a group of boys aged eight to ten, seven players from
a group of boys aged 11 to 12, and three players from a group of boys aged 13 to 14 to form
a recreational soccer team.
b. A pollster interviews all human resource personnel in ve dierent high tech companies.
c. A high school educational researcher interviews 50 high school female teachers and 50 high
school male teachers.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
16
AND DESCRIPTIVE STATISTICS

d. A medical researcher interviews every third cancer patient from a list of cancer patients at a
local hospital.
e. A high school counselor uses a computer to generate 50 random numbers and then picks
students whose names correspond to the numbers.
f. A student interviews classmates in his algebra class to determine how many pairs of jeans a
student owns, on the average.

Solution
a. stratied; b. cluster; c. stratied; d. systematic; e. simple random; f.convenience

Exercise 1.1.3.6 (Solution on p. 78.)


Determine the type of sampling used (simple random, stratied, systematic, cluster, or
convenience).
A high school principal polls 50 freshmen, 50 sophomores, 50 juniors, and 50 seniors
regarding policy changes for after school activities.

If we were to examine two samples representing the same population, even if we used random sampling
methods for the samples, they would not be exactly the same. Just as there is variation in data, there is
variation in samples. As you become accustomed to sampling, the variability will begin to seem natural.
Example 1.14
Suppose ABC College has 10,000 part-time students (the population). We are interested in the
average amount of money a part-time student spends on books in the fall term. Asking all 10,000
students is an almost impossible task.
Suppose we take two dierent samples.
First, we use convenience sampling and survey ten students from a rst term organic chemistry
class. Many of these students are taking rst term calculus in addition to the organic chemistry
class. The amount of money they spend on books is as follows:
$128; $87; $173; $116; $130; $204; $147; $189; $93; $153
The second sample is taken using a list of senior citizens who take P.E. classes and taking every
fth senior citizen on the list, for a total of ten senior citizens. They spend:
$50; $40; $36; $15; $50; $100; $40; $53; $22; $22
It is unlikely that any student is in both samples.
Problem 1
a. Do you think that either of these samples is representative of (or is characteristic of) the entire
10,000 part-time student population?
Solution
a. No. The rst sample probably consists of science-oriented students. Besides the chemistry
course, some of them are also taking rst-term calculus. Books for these classes tend to be expensive.
Most of these students are, more than likely, paying more than the average part-time student for
their books. The second sample is a group of senior citizens who are, more than likely, taking
courses for health and interest. The amount of money they spend on books is probably much less
than the average parttime student. Both samples are biased. Also, in both cases, not all students
have a chance to be in either sample.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


17

Problem 2
b. Since these samples are not representative of the entire population, is it wise to use the results
to describe the entire population?
Solution
b. No. For these samples, each member of the population did not have an equally likely chance of
being chosen.

Now, suppose we take a third sample. We choose ten dierent part-time students from the disciplines
of chemistry, math, English, psychology, sociology, history, nursing, physical education, art, and
early childhood development. (We assume that these are the only disciplines in which part-time
students at ABC College are enrolled and that an equal number of part-time students are enrolled in
each of the disciplines.) Each student is chosen using simple random sampling. Using a calculator,
random numbers are generated and a student from a particular discipline is selected if he or she
has a corresponding number. The students spend the following amounts:
$180; $50; $150; $85; $260; $75; $180; $200; $200; $150
Problem 3
c. Is the sample biased?
Solution
c. The sample is unbiased, but a larger sample would be recommended to increase the likelihood
that the sample will be close to representative of the population. However, for a biased sampling
technique, even a large sample runs the risk of not being representative of the population.

Students often ask if it is "good enough" to take a sample, instead of surveying the entire population.
If the survey is done well, the answer is yes.

Exercise 1.1.3.7 (Solution on p. 78.)


A local radio station has a fan base of 20,000 listeners. The station wants to know if its
audience would prefer more music or more talk shows. Asking all 20,000 listeners is an
almost impossible task.
The station uses convenience sampling and surveys the rst 200 people they meet at one
of the station's music concert events. 24 people said they'd prefer more talk shows, and
176 people said they'd prefer more music.
Do you think that this sample is representative of (or is characteristic of) the entire 20,000
listener population?

1.1.3.2 Variation in Data

Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less
than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following
amount (in ounces) of beverage:
15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5
Measurements of the amount of beverage in a 16-ounce can may vary because dierent people make the
measurements or because the exact amount, 16 ounces of liquid, was not put into the cans. Manufacturers
regularly run tests to determine if the amount of beverage in a 16-ounce can falls within the desired range.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
18
AND DESCRIPTIVE STATISTICS

Be aware that as you take data, your data may vary somewhat from the data someone else is taking for
the same purpose. This is completely natural. However, if two or more of you are taking the same data and
get very dierent results, it is time for you and the others to reevaluate your data-taking methods and your
accuracy.

1.1.3.3 Variation in Samples

It was mentioned previously that two or more samples from the same population, taken randomly, and
having close to the same characteristics of the population will likely be dierent from each other. Suppose
Doreen and Jung both decide to study the average amount of time students at their college sleep each night.
Doreen and Jung each take samples of 500 students. Doreen uses systematic sampling and Jung uses cluster
sampling. Doreen's sample will be dierent from Jung's sample. Even if Doreen and Jung used the same
sampling method, in all likelihood their samples would be dierent. Neither would be wrong, however.
Think about what contributes to making Doreen's and Jung's samples dierent.
If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample results
(the average amount of time a student sleeps) might be closer to the actual population average. But still,
their samples would be, in all likelihood, dierent from each other. This variability in samples cannot be
stressed enough.

1.1.3.3.1 Size of a Sample

The size of a sample (often called the number of observations) is important. The examples you have seen in
this book so far have been small. Samples of only a few hundred observations, or even smaller, are sucient
for many purposes. In polling, samples that are from 1,200 to 1,500 observations are considered large enough
and good enough if the survey is random and is well done. You will learn why when you study condence
intervals.
Be aware that many large samples are biased. For example, call-in surveys are invariably biased, because
people choose to respond or not.

1.1.3.4 Critical Evaluation

We need to evaluate the statistical studies we read about critically and analyze them before accepting the
results of the studies. We listed common problems with sampling techniques above. We re-iterate them here
and add a few additional ones.

• Problems with samples: A sample must be representative of the population. A sample that is not
representative of the population is biased. Biased samples that are not representative of the population
give results that are inaccurate and not valid.
• Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are
often unreliable.
• Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible.
In some situations, having small samples is unavoidable and can still be used to draw conclusions.
Examples: crash testing cars or medical testing for rare conditions
• Undue inuence: collecting data or asking questions in a way that inuences the response
• Non-response or refusal of subject to participate: The collected responses may no longer be represen-
tative of the population. Often, people with strong positive or negative opinions may answer surveys,
which can aect the results.
• Causality: A relationship between two variables does not mean that one causes the other to occur.
They may be related (correlated) because of their relationship through a dierent variable.
• Self-funded or self-interest studies: A study performed by a person or organization in order to support
their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not automati-
cally assume that the study is good, but do not automatically assume the study is bad either. Evaluate
it on its merits and the work done.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


19

• Misleading use of data: improperly displayed graphs, incomplete data, or lack of context
• Confounding: When the eects of multiple factors on a response cannot be separated. Confounding
makes it dicult or impossible to draw valid conclusions about the eect of each factor.

1.1.3.5 References

Gallup-Healthways Well-Being Index. http://www.well-beingindex.com/default.asp (accessed May 1, 2013).


Gallup-Healthways Well-Being Index. http://www.well-beingindex.com/methodology.asp (accessed May
1, 2013).
Gallup-Healthways Well-Being Index. http://www.gallup.com/poll/146822/gallup-healthways-index-
questions.aspx (accessed May 1, 2013).
Data from http://www.bookofodds.com/Relationships-Society/Articles/A0374-How-George-Gallup-
Picked-the-President
Dominic Lusinchi, 'President' Landon and the 1936 Literary Digest Poll: Were Automo-
bile and Telephone Owners to Blame? Social Science History 36, no. 1: 23-54 (2012),
http://ssh.dukejournals.org/content/36/1/23.abstract (accessed May 1, 2013).
The Literary Digest Poll, Virtual Laboratories in Probability and Statistics
http://www.math.uah.edu/stat/data/LiteraryDigest.html (accessed May 1, 2013).
Gallup Presidential Election Trial-Heat Trends, 19362008, Gallup Politics
http://www.gallup.com/poll/110548/gallup-presidential-election-trialheat-trends-19362004.aspx#4 (ac-
cessed May 1, 2013).
The Data and Story Library, http://lib.stat.cmu.edu/DASL/Datales/USCrime.html (accessed May 1,
2013).
LBCC Distance Learning (DL) program data in 2010-2011, http://de.lbcc.edu/reports/2010-
11/future/highlights.html#focus (accessed May 1, 2013).
Data from San Jose Mercury News

1.1.3.6 Chapter Review

Data are individual items of information that come from a population or sample. Data may be classied as
qualitative, quantitative continuous, or quantitative discrete.
Because it is not practical to measure the entire population in a study, researchers use samples to represent
the population. A random sample is a representative group from the population chosen by using a method
that gives each individual in the population an equal chance of being included in the sample. Random
sampling methods include simple random sampling, stratied sampling, cluster sampling, and systematic
sampling. Convenience sampling is a nonrandom method of choosing a sample that often produces biased
data.
Samples that contain dierent individuals result in dierent data. This is true even when the samples
are well-chosen and representative of the population. When properly selected, larger samples model the
population more closely than smaller samples. There are many dierent potential problems that can aect
the reliability of a sample. Statistical data needs to be critically analyzed, not simply accepted.

1.1.3.7 HOMEWORK

For the following exercises, identify the type of data that would be used to describe a response (quantitative
discrete, quantitative continuous, or categorical), and give an example of the data.
Exercise 1.1.3.8 (Solution on p. 78.)
number of tickets sold to a concert
Exercise 1.1.3.9 (Solution on p. 78.)
percent of body fat

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
20
AND DESCRIPTIVE STATISTICS

Exercise 1.1.3.10 (Solution on p. 78.)


favorite baseball team
Exercise 1.1.3.11 (Solution on p. 78.)
time in line to buy groceries
Exercise 1.1.3.12 (Solution on p. 78.)
number of students enrolled at Evergreen Valley College
Exercise 1.1.3.13 (Solution on p. 78.)
most-watched television show
Exercise 1.1.3.14 (Solution on p. 78.)
brand of toothpaste
Exercise 1.1.3.15 (Solution on p. 78.)
distance to the closest movie theatre
Exercise 1.1.3.16 (Solution on p. 78.)
age of executives in Fortune 500 companies
Use the following information to answer the next two exercises: A study was done to determine the age,
number of times per week, and the duration (amount of time) of resident use of a local park in Vancouver.
The rst house in the neighbourhood around the park was selected randomly and then every 8th house in
the neighbourhood around the park was interviewed.
Exercise 1.1.3.17 (Solution on p. 78.)
Number of times per week is what type of data?

a. nominal qualitative ordinal


b. quantitative discrete
c. quantitative continuous
d. categorical nominal
e. categorical ordinal

Exercise 1.1.3.18 (Solution on p. 78.)


Duration (amount of time) is what type of data?

a. nominal qualitative ordinal


b. quantitative discrete
c. quantitative continuous
d. categorical nominal
e. categorical ordinal

Exercise 1.1.3.19 (Solution on p. 78.)


Airline companies are interested in the consistency of the number of babies on each ight, so that
they have adequate safety equipment. Suppose an airline conducts a survey. Over Thanksgiving
weekend, it surveys six ights from Montreal to Halifax to determine the number of babies on the
ights. It determines the amount of safety equipment needed by the result of that study.

a. Using complete sentences, list three things wrong with the way the survey was conducted.
b. Using complete sentences, list three ways that you would improve the survey if it were to be
repeated.

Exercise 1.1.3.20 (Solution on p. 79.)


Suppose you want to determine the mean number of cans of soda drunk each month by students
in their twenties at your school. Describe a possible sampling method in three to ve complete
sentences. Make the description detailed.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


21

Exercise 1.1.3.21 (Solution on p. 79.)


Name the sampling method used in each of the following situations:

a. A woman in the airport is handing out questionnaires to travelers asking them to evaluate
the airport's service. She does not ask travelers who are hurrying through the airport with
their hands full of luggage, but instead asks all travelers who are sitting near gates and not
taking naps while they wait.
b. A teacher wants to know if her students are doing homework, so she randomly selects rows
two and ve and then calls on all students in row two and all students in row ve to present
the solutions to homework problems to the class.
c. The marketing manager for an electronics chain store wants information about the ages of its
customers. Over the next two weeks, at each store location, 100 randomly selected customers
are given questionnaires to ll out asking for information about age, as well as about other
variables of interest.
d. The librarian at a public library wants to determine what proportion of the library users are
children. The librarian has a tally sheet on which she marks whether books are checked out
by an adult or a child. She records this data for every fourth patron who checks out books.
e. A political party wants to know the reaction of voters to a debate between the candidates. The
day after the debate, the party's polling sta calls 1,200 randomly selected phone numbers.
If a registered voter answers the phone or is available to come to the phone, that registered
voter is asked whom he or she intends to vote for and whether the debate changed his or her
opinion of the candidates.

Exercise 1.1.3.22 (Solution on p. 79.)


In advance of the 1936 Presidential Election, a magazine titled Literary Digest released the results
of an opinion poll predicting that the republican candidate Alf Landon would win by a large margin.
The magazine sent post cards to approximately 10,000,000 prospective voters. These prospective
voters were selected from the subscription list of the magazine, from automobile registration lists,
from phone lists, and from club membership lists. Approximately 2,300,000 people returned the
postcards.

a. Think about the state of the United States in 1936. Explain why a sample chosen from
magazine subscription lists, automobile registration lists, phone books, and club membership
lists was not representative of the population of the United States at that time.
b. What eect does the low response rate have on the reliability of the sample?
c. Are these problems examples of sampling error or nonsampling error?
d. During the same year, George Gallup conducted his own poll of 30,000 prospective voters.
His researchers used a method they called "quota sampling" to obtain survey answers from
specic subsets of the population. Quota sampling is an example of which sampling method
described in this module?

Exercise 1.1.3.23 (Solution on p. 79.)


YouPolls is a website that allows anyone to create and respond to polls. One question posted April
15 asks:
Do you feel happy paying your taxes when members of the Obama administration are allowed
to ignore their tax liabilities? 4
As of April 25, 11 people responded to this question. Each participant answered NO!
Which of the potential problems with samples discussed in this module could explain this con-
nection?

4 lastbaldeagle. 2013. On Tax Day, House to Call for Firing Federal Workers Who Owe Back Taxes. Opinion poll posted
online at: http://www.youpolls.com/details.aspx?id=12328 (accessed May 1, 2013).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
22
AND DESCRIPTIVE STATISTICS

1.1.4 Experimental Design and Ethics  MtRoyal - Version2016RevA5


Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more eective at growing roses than
another? Is fatigue as dangerous to a driver as the inuence of alcohol? Questions like these are answered
using randomized experiments. In this module, you will learn important aspects of experimental design.
Proper study design ensures the production of reliable, accurate data.
The purpose of an experiment is to investigate the relationship between two variables. When one variable
causes change in another, we call the rst variable the independent variable or explanatory variable.
The aected variable is called the dependent variable or response variable. In a randomized experiment,
the researcher manipulates values of the explanatory variable and measures the resulting changes in the
response variable. The dierent values of the explanatory variable are called treatments. An experimental
unit is a single object or individual to be measured.
You want to investigate the eectiveness of vitamin E in preventing disease. You recruit a group of
subjects and ask them if they regularly take vitamin E. You notice that the subjects who take vitamin
E exhibit better health on average than those who do not. Does this prove that vitamin E is eective in
disease prevention? It does not. There are many dierences between the two groups compared in addition
to vitamin E consumption. People who take vitamin E regularly often take other steps to improve their
health: exercise, diet, other vitamin supplements, choosing not to smoke. Any one of these factors could be
inuencing health. As described, this study does not prove that vitamin E is the key to disease prevention.
Additional variables that can cloud a study are called lurking variables. In order to prove that the
explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory
variable. The researcher must design her experiment in such a way that there is only one dierence between
groups being compared: the planned treatments. This is accomplished by the random assignment of
experimental units to treatment groups. When subjects are assigned treatments randomly, all of the potential
lurking variables are spread equally among the groups. At this point the only dierence between groups is
the one imposed by the researcher. Dierent outcomes measured in the response variable, therefore, must be
a direct result of the dierent treatments. In this way, an experiment can prove a cause-and-eect connection
between the explanatory and response variables.
The power of suggestion can have an important inuence on the outcome of an experiment. Studies have
shown that the expectation of the study participant can be as important as the actual medication. In one
study of performance-enhancing drugs, researchers noted:
Results showed that believing one had taken the substance resulted in [ performance] times almost as fast
as those associated with consuming the drug itself. In contrast, taking the drug without knowledge yielded
no signicant performance increment.6
When participation in a study prompts a physical response from a participant, it is dicult to isolate the
eects of the explanatory variable. To counter the power of suggestion, researchers set aside one treatment
group as a control group. This group is given a placebo treatmenta treatment that cannot inuence
the response variable. The control group helps researchers balance the eects of being in an experiment
with the eects of the active treatments. Of course, if you are participating in a study and you know that
you are receiving a pill which contains no actual medication, then the power of suggestion is no longer a
factor. Blinding in a randomized experiment preserves the power of suggestion. When a person involved in
a research study is blinded, he does not know who is receiving the active treatment(s) and who is receiving
the placebo treatment. A double-blind experiment is one in which both the subjects and the researchers
involved with the subjects are blinded.
Example 1.15
The Smell & Taste Treatment and Research Foundation conducted a study to investigate whether
smell can aect learning. Subjects completed mazes multiple times while wearing masks. They
completed the pencil and paper mazes three times wearing oral-scented masks, and three times
5 This content is available online at <http://cnx.org/content/m62317/1.1/>.
6 McClung, M. Collins, D. Because I know it will!: placebo eects of an ergogenic aid on athletic performance. Journal of
Sport & Exercise Psychology. 2007 Jun. 29(3):382-94. Web. April 30, 2013.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


23

with unscented masks. Participants were assigned at random to wear the oral mask during the
rst three trials or during the last three trials. For each trial, researchers recorded the time it
took to complete the maze and the subject's impression of the mask's scent: positive, negative, or
neutral.
a. Describe the explanatory and response variables in this study.
b. What are the treatments?
c. Identify any lurking variables that could interfere with this study.
d. Is it possible to use blinding in this study?

Solution

a. The explanatory variable is scent, and the response variable is the time it takes to complete
the maze.
b. There are two treatments: a oral-scented mask and an unscented mask.
c. All subjects experienced both treatments. The order of treatments was randomly assigned so
there were no dierences between the treatment groups. Random assignment eliminates the
problem of lurking variables.
d. Subjects will clearly know whether they can smell owers or not, so subjects cannot be blinded
in this study. Researchers timing the mazes can be blinded, though. The researcher who is
observing a subject will not know which mask is being worn.

1.1.4.1 References

Vitamin E and Health, Nutrition Source, Harvard School of Public Health,


http://www.hsph.harvard.edu/nutritionsource/vitamin-e/ (accessed May 1, 2013).
Stan Reents. Don't Underestimate the Power of Suggestion, athleteinme.com,
http://www.athleteinme.com/ArticleView.aspx?id=1053 (accessed May 1, 2013).
Ankita Mehta. Daily Dose of Aspiring Helps Reduce Heart Attacks: Study, International Business
Times, July 21, 2011. Also available online at http://www.ibtimes.com/daily-dose-aspirin-helps-reduce-
heart-attacks-study-300443 (accessed May 1, 2013).
The Data and Story Library, http://lib.stat.cmu.edu/DASL/Stories/ScentsandLearning.html (accessed
May 1, 2013).
M.L. Jacskon et al., Cognitive Components of Simulated Driving Performance: Sleep
Loss eect and Predictors, Accident Analysis and Prevention Journal, Jan no. 50 (2013),
http://www.ncbi.nlm.nih.gov/pubmed/22721550 (accessed May 1, 2013).
Earthquake Information by Year, U.S. Geological Survey. http://earthquake.usgs.gov/earthquakes/eqarchives/year/
(accessed May 1, 2013).
Fatality Analysis Report Systems (FARS) Encyclopedia, National Highway Trac and Safety Admin-
istration. http://www-fars.nhtsa.dot.gov/Main/index.aspx (accessed May 1, 2013).
Data from www.businessweek.com (accessed May 1, 2013).
Data from www.forbes.com (accessed May 1, 2013).
America's Best Small Companies, http://www.forbes.com/best-small-companies/list/ (accessed May 1,
2013).
U.S. Department of Health and Human Services, Code of Federal Regulations Title 45 Public Welfare
Department of Health and Human Services Part 46 Protection of Human Subjects revised January 15, 2009.
Section 46.111:Criteria for IRB Approval of Research.
April 2013 Air Travel Consumer Report, U.S. Department of Transportation, April 11 (2013),
http://www.dot.gov/airconsumer/april-2013-air-travel-consumer-report (accessed May 1, 2013).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
24
AND DESCRIPTIVE STATISTICS

Lori Alden, Statistics can be Misleading, econoclass.com, http://www.econoclass.com/misleadingstats.html


(accessed May 1, 2013).
Maria de los A. Medina, Ethics in Statistics, Based on Building an Ethics Module for Busi-
ness, Science, and Engineering Students by Jose A. Cruz-Cruz and William Frey, Connexions,
http://cnx.org/content/m15555/latest/ (accessed May 1, 2013).

1.1.4.2 Chapter Review

A poorly designed study will not produce reliable data. There are certain key components that must be
included in every experiment. To eliminate lurking variables, subjects must be assigned randomly to dierent
treatment groups. One of the groups must act as a control group, demonstrating what happens when the
active treatment is not applied. Participants in the control group receive a placebo treatment that looks
exactly like the active treatments but cannot inuence the response variable. To preserve the integrity of
the placebo, both researchers and subjects may be blinded. When a study is designed properly, the only
dierence between treatment groups is the one imposed by the researcher. Therefore, when groups respond
dierently to dierent treatments, the dierence must be due to the inuence of the explanatory variable.
An ethics problem arises when you are considering an action that benets you or some cause you support,
hurts or reduces benets to others, and violates some rule. 7 Ethical violations in statistics are not always
easy to spot. Professional associations and federal agencies post guidelines for proper conduct. It is important
that you learn basic statistical procedures so that you can recognize proper data analysis.
7 Andrew Gelman, Open Data and Open Methods, Ethics and Statistics, http://www.stat.columbia.edu/∼gelman/research/published/ChanceE
(accessed May 1, 2013).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


25

1.2 Chapter 2: Descriptive Statistics


1.2.1 Introduction  Descriptive Statistics  MRU - C Lemieux (2017)8

Figure 1.4: When you have large amounts of data, you will need to organize it in a way that makes
sense. These ballots from an election are rolled together with similar ballots to keep them organized.
(credit: William Greeson)

: By the end of this chapter, the student should be able to:

• Display data graphically and interpret graphs: pie charts, bar graphs, histograms and box
plots.
• Recognize, describe, calculate, and interpret measures of location: quartiles and percentiles.
• Recognize, describe, calculate, and interpret measures of centre: mean, median and mode.
• Recognize, describe, calculate, and interpret measures of variation: variance, standard devia-
tion, range, interquartile range and coecient of variation.

Once you have collected data, what will you do with it? Data can be described and presented in many
dierent formats. For example, suppose you are interested in buying a house in a particular area. You may
have no clue about the house prices, so you might ask your real estate agent to give you a sample data set of
prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look at the
8 This content is available online at <http://cnx.org/content/m64282/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
26
AND DESCRIPTIVE STATISTICS

median price and the variation of prices. The median and variation are just two ways that you will learn to
describe data. Your agent might also provide you with a graph of the data.
In this chapter, you will study visual and numerical ways to describe and display your data. This area
of statistics is called Descriptive Statistics. If you have collected 200 data values, just looking at them
won't tell anyone much about the data. Instead, you want to summarize the raw data in a way that you can
better understand what's going on.
Categorical data is summarized usually using a visual representation like a pie chart or a bar graph. The
numerical summary for categorical data would be a percentage, fraction or decimal.
For quantitative data, it is a bit more involved. In general, there are three components to a good summary
of quantitative data: a visual representation, a measure of centre, and a measure of variation.
The visual representation can give you a sense of the centre and variation in the data, but is very useful
for determining the shape of the data. Is the data all clustered together? Are there a bunch of data on one
side, but a few on the other? Do all of the data values occur with the same frequency? The shape describes
this. Histograms and box plots are both visual representations of quantitative data.
Measures of centre, also known as averages or measures of central tendency, provide a value(s)
that gives us a sense of a typical value in the data set. This doesn't tell us about a specic member of the
population, but instead lets us know what the average one is like. Measures of centre we will learn about
include the mean, median, and mode.
Though a measure of centre tells us about a typical value in a data set, measures of variation tell
us how much the data values vary from each other. Are they all clumped together? Are they all spread
out? Measures of variation can tell us how consistent or how volatile the data is. If we are analyzing stock
prices, the more variation there is then the more volatile and risky the investment is. But the rewards
may be greater! Measures of variation that we will learn about include range, variance, standard deviation,
interquartile range, and the coecient of variation.
When we describe the shape, centre, and variation of the data, we are describing the distribution of the
data. If we only focus on one aspect of the distribution (say the centre), then we miss out on some important
information, which is why we always want to consider all three aspects when summarizing quantitative data.
For example, suppose two stock prices have the same average price. If we only look at the average, we might
think they are equivalent. But if one of them has greater variation, then that means that one is more volatile
and riskier than the other one.
Box plots (or box and whisker diagrams) are a special type of visual representation that includes both
visual and numerical elements. A box plot divides the data into quarters (or quartiles). Thus, a box plot
contains a measure of centre (the second quartile is the halfway point, called the median) and a measure of
variation (the distance between the rst quartile and the third quartile is called the interquartile range). The
box plot can also give a sense of the data's shape. The box plot then is the only representation that we will
see that gives us a sense of the distribution all in one representation (i.e. gives a sense of centre, variation,
and distribution). It also has an additional benet of identifying outliers. Outliers are data values that are
abnormal. That is, they dier signicantly from the other data values. A box plot shows if there are any
outliers.
This chapter will go over descriptive statistics by focusing on visual and numerical representations of
data. Though categorical data is discussed, the main focus will be on determining the distribution and
outliers for quantitative data.
The vast majority of the time when conducting statistical studies, we will only have access to sample
data. In this situation, we will want to analyze the sample data to see if we can come to any conclusions
about the population data. Once we make the leap from simply describing a sample to using that sample to
draw conclusions about the population, we are doing inferential statistics. These concepts and techniques
are covered in chapter seven and eight.

important: The distribution of sample data ideally mimics the distribution of the population.
But the smaller the sample size the greater the potential for there to be dierences between the
two distributions. This means that, for a large enough sample size, the distribution of the sample
generally gives a good idea of distribution of the population. This is an example of the law of large

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


27

numbers. In other words, if the sample size is large enough and the data is collected properly, then
the sample mean will most likely be a good estimate of the population mean, the sample standard
deviation will most likely be a good estimate of the population standard deviation, and the shape
of the sample data will most likely be a good estimate of the shape of the population.

1.2.2 Descriptive Statistics - Numerical Summaries of Data - MRU - C Lemieux9


By the end of this section, we want to be able to describe the distribution of quantitative data (i.e. shape,
centre and variation). In the previous section, we looked at the shape of quantitative data. This section
focuses on numerical summaries of data for quantitative data. In particular, it focuses on measures of centre
and measures of variation.
There are other numerical summaries of data called measures of location, which will be discussed in the
next section.

1.2.2.1 Measures of centre

Measures of centre or average give us a sense of what a typical value in a data set is. For example, the
average number of children in a family in Canada is 1.9. This means that a typical family will have about
1.9 children. Obviously, no family has exactly 1.9 children, but this gives a sense of how many children
families have on average. Further, some families may have 8 children. Others may have no children. The
measure of centre gives a sense of what is going on in the middle of the data set.

note: Even though you may wish to round an average to a whole number (especially when it is
about the number of people), this is not necessary nor is it appropriate as it is giving a sense of the
centre of the data, which is not necessarily an actual data value.

The "center" of a data set is a way of describing a typical value in a data set. The three most widely used
measures of the "center" of the data are the mean, median and mode.
To explain these three measures of centre, let's look at an example. Suppose we want to nd the average
weight of 50 people. To calculate the mean weight of the 50 people, we would add the 50 weights together
and divide by 50. To nd the median weight of the 50 people, order the data from least heavy to most heavy,
and nd the weight that splits the data into two equal parts. The mode is the most commonly occurring
value. To nd the mode, nd the weight that occurs the most frequently.
This section provides more details on how to nd the measures of centre, the notation for the measures,
and when it is best to used which measure.

: Though the words mean and average are sometimes used interchangeably, they do not
necessarily mean the same thing. In general, average is any measure of centre and mean is
a specic type of centre. Many people use average and mean as the same, but not always. For
example, when people talk about average housing price, they are usually referring to the median
house price.

1.2.2.1.1 Mean

The mean of a data set can be thought of as a balancing point (or fulcrum). If you think of numbers as
weighted, then the mean is the number that will balance the data values evenly. Suppose your data values
are 1, 2, 3, 4, 5. Then the number that balances the data is 3. To go a little deeper, the balance point is
three because the distance between 3 and the data values less than it is equal to the distance between 3 and
the data values greater than it as shown in Figure 1.5.
9 This content is available online at <http://cnx.org/content/m64905/1.3/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
28
AND DESCRIPTIVE STATISTICS

Figure 1.5: To nd the mean of this data, we need to nd the number that balances the data equally
on both sides.

Let's try a harder example. Suppose our data values are 0, 1, 1, 2, 3, 3, 4, 6. The mean will be the
number such that the total distance to the data values below it and the total distance to the data values
above it are the same. Let's see 3 is the mean again. Then the distance between our suggested "mean" and
0 is 3; the distance between our "mean" and 1 is 2 (but there are two of them); and the distance between
our "mean" and 2 is 1. That is, the distance between our "mean" and all of the data values below it are
3+2+2+1 = 8. If 3 is actually our mean, then the total distance between 3 and the data values above it will
also be 8. Let's check. The distance between our "mean" and 4 is 1; the distance between our "mean" and
6 is 3. The total distance above 3 is only 4. Therefore, 3 cannot be our mean as it doesn't balance our data.

note: The two data values of 3 were ignored as their distance from the suggested mean is 0.
Therefore, they would not change the answer if included.

From our calculations above, the choice of 3 was too big as the lower was too heavy. Let's try 2.5 as our
mean. If the mean is 2.5, then the distance between our "mean" and 0 is 2.5; the distance between our
"mean" and 1 is 1.5 (but there are two of them); the distance between our "mean" and 2 is 0.5. Thus the
total distance between our mean of 2.5 and the data values below is is 2.5 + 1.5 + 1.5 + 0.5 = 6. If 2.5 is
our mean, then the total distance above 2.5 should also be 6. The distance between our "mean" and 3 is
0.5 (but there are two of them); the distance between our "mean" and 4 is 1.5; the distance between our
"mean" and 6 is 3.5. Thus the total distance between the data values and our suggested mean of 2.5 is 0.5
+ 0.5 + 1.5 + 3.5 = 6! Therefore, 2.5 is the mean for this data.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


29

Figure 1.6: To nd the mean of this data, we need to nd the number that balances the data equally
on both sides. Notice that the mean here is not a data value.

Thankfully we don't have to do these in-depth calculations and guesses each time. Instead the formula
is pretty straight-forward.
The Greek letter µ (pronounced "mew") represents the population mean. That is, it is the mean for
the population data.
Formula for Population Mean
N
1 X
µ= xi (1.6)
N i=1

The letter used to represent the sample mean is an x with a bar over it (pronounced x bar): x. It is the
mean of a sample of data from the population.
The sample mean is an estimate of the population mean. One of the requirements for the sample mean
to be a good estimate of the population mean is for the sample taken to be truly random.
Formula for Sample Mean
n
−− 1X
x= xi (1.6)
n i=1
To see how the formula words, consider the sample:
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4
−−1+1+1+2+2+3+4+4+4+4+4
x= = 2.7 (1.6)
11
Note: Since it is sample data, we use the symbol x.

: If the size of a random sample is increased, then the sample mean will more likely be a better
estimate of the population mean.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
30
AND DESCRIPTIVE STATISTICS

Note: Just because the sample size increases does not mean that the sample mean for the larger
sample must be a better estimate. It is only that it is more likely to be a better estimate.

1.2.2.1.2 Median

On a road, the median is in the middle of the road. In statistics, the median is the middle data value (when
the data is in order).
You can quickly nd the location or position of the median by using the expression n+1 2 .
The letter n is the total number of data values in the sample. If n is an odd number, the median is
the middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is
equal to the two middle values added together and divided by two after the data has been ordered. For
example, if the total number of data values is 97, then n+1
2 =
97+1
2 = 49. The median is the 49th value in
the ordered data. If the total number of data values is 100, then 2 = 100+1
n+1
2 = 50.5. The median occurs
midway between the 50th and 51st values. The location of the median and the value of the median are not
the same. The upper case letter M is often used to represent the median. The next example illustrates the
location of the median and the value of the median.

1.2.2.1.3 Mode

Another measure of the center is the mode. The mode is the data value that occurs most frequently and at
least twice.
A data set can have either

• no mode.
• one mode (unimodal)
• two modes (bimodal)
• or many modes (multimodal).

Consider the statistics exam scores for 20 students:


50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93
The most frequent score is 72, which occurs ve times. Mode = 72.

: The mode can be calculated for qualitative data as well as for quantitative data. For example,
if the data set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red.

Example 1.16
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody
drug are as follows (smallest to largest):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29;
31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
Calculate the mean, median and mode.
Solution
The calculation for the mean is:
−− ...+35+37+40+(44)(2)+47]
x = [3+4+(8)(2)+10+11+12+13+14+(15)(2)+(16)(2)+
40 = 23.6
To nd the median, M, rst use the formula for the location. The location is:
n+1 40+1
2 = 2 = 20.5
Starting at the smallest value, the median is located between the 20th and 21st values (the two 24s):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29;
31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
M = 24+242 = 24 To nd the mode, we rst have to determine if any data values repeat. If no
data values repeat, there is no mode. Since 8 repeats, we know there is a mode. 8 repeats twice.
We need to check if any data value repeats more than twice. If a data value repeats more than

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


31

twice, then it is the mode. Since no data value repeats more than twice, any data value that repeats
twice is the mode.
Therefore, the modes are 8, 15, 16, 17, 22, 24, 26, 27, 29, 34, 44. This data set is multi-modal.

Example 1.17
Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other
49 each earn $30,000. Which is the better measure of the "center": the mean, the median or the
mode?
Solution
−−
x = 5,000,000+49(30,000)
50 = 129, 400
M = 30,000
(There are 49 people who earn $30,000 and one person who earns $5,000,000.)
The mode is 30,000 as this data value occurs 49 times.
Since the median and mode are equal, lets focus on the median. The median is a better
measure of the "center" than the mean because 49 of the values are 30,000 and one is 5,000,000.
The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.

important: The above example highlights two important ideas:

• Outliers: We have dened outliers as data values that are signicantly dierent from other
data values, but we have not provided a way of nding them. This will be discussed in the
next section. Regardless, we can see that 5 million is signicantly dierent than 30 thousand
in the above example.
• Skew: When a data set has outliers, the outliers have the potential to skew the mean. In the
above example, the centre of the data is 30,000, but the mean is 129,400. Thus the outlier of
5 million is pulling the mean up. That is, it is skewing the centre value by pulling it to the
right on the number line.

1.2.2.1.4 Comparing measures of centre

Above we have described how to nd each of the measures of centre. But how do you choose which measure
of centre to use in which situation? One option is to provide all three measures of centre, but sometimes
this can be overwhelming to the audience. Instead you want to pick the best one that best describes that
data. The following are some general guidelines for choosing the best measure of centre.
The mean is often the best measure of centre to use because it is the most well-known and familiar of
the measures of centre. It is also the only measure of centre that is computed using all of the sample values.
But the mean is susceptible to outliers. As was seen in Example 1.17, if there is an outlier, the mean can be
pulled in one direction away from the centre.
Outliers are any data value that are signicantly dierent from the other data values. In Example 1.17,
the outlier is 5 million as it is signicantly higher than the other data values. We will discuss how to nd
outliers in the section 2.3 (Boxplots).
If there is an outlier in the data set that is skewing the mean, the best measure of centre to use is the
median as it is not susceptible to outliers.
But be careful: The presence of outliers does not necessarily mean that the median is the best measure
of centre. Here are a couple of examples where this is the case:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
32
AND DESCRIPTIVE STATISTICS

1. Suppose there are 200 data values in a sample and one data value is an outlier, then the mean will
most likely not be aected by the outlier.
2. Suppose there is a data set that has outliers, but one is a high outlier and one is a low outlier. Then
the outliers may balance out and not aect the mean.
The mode is best used for categorical data, but can sometimes be used for quantitative data. For example,
in Example 1.17, the mode would be a good measure of centre because the majority of data values are the
same.
In Example 1.16, since there are no outliers, the mean is the best measure of centre to use. In Exam-
ple 1.17, since there is an outlier (5 million) and the mean and median are quite dierent, the median is the
best measure of centre to use.
The following tables compare the measures of centre.

Measure How Common

Mean most familiar


Median commonly used
Mode sometimes used

Table 1.2

Measure Every Score Used

Mean yes
Median no
Mode no

Table 1.3

Measure Aected by Outliers

Mean yes
Median no
Mode no

Table 1.4

1.2.2.1.5 How to mislead with averages

Consider the following situation:


As you arrive at an open house in your preferred new home location, a neighbour comes up to
you while he is walking his dog. This is a great neighbourhood to live in! The average income in
this neighbourhood is $60,000, he tells you. You are pleased to hear how auent the community
is. A year after you've moved into your new home, the same neighbour comes to your door and
asks you to sign a petition. The city is overvaluing the homes in our neighbourhood again,
which means more taxes. The average income in this neighbourhood is $20,000. We can't aord
these increases. You dutifully sign the petition because you don't want to pay more taxes, but
you're also confused. Wasn't the average income a lot higher last year? What happened? Is your
neighbour a liar?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


33

In this example, there are many dierent possible scenarios that could explain the discrepancy. But no
matter what the scenario is, the neighbour is picking his statistics to t his situation.
One scenario: The neighbour may be picking and choosing which measure of centre to use. Suppose that
most people in the neighbourhood make around $20,000 a year, but there are a few people who live on the
street with the super nice view who make $300,000 a year. Then in the rst case, when he says the average
income is $60,000, he has used the mean which has been pulled higher by the outliers of $300,000. He chose
to use the mean to make the neighbourhood look more auent than it really is.
But when he wanted to make the argument that the neighbourhood wasn't as auent and should be in
a lower tax bracket, he changed which measure of centre to use. Instead he may have the used the median
or mode because they aren't inuenced by the outliers.
Another scenario: The neighbour may be choosing how he denes income to help make his point. In the
rst case, he may have only used those who are employed to come up with the average salary. While in the
second case, he may have used all adults in the neighbourhood including students living with their parents,
stay-at-home parents, retired people or people out of work. Their incomes may be very low or non-existent
which would skew the average to being lower. In this scenario, he may be using the same measure of centre,
but is picking what he means by income to get the results he wants.
There are other possible scenarios. Can you think of any?

1.2.2.1.6 Skew

As has been noted above, if there are outliers in a data set, this can cause the mean to be pulled up or down
(i.e. be either higher than expected or lower than expected) by these outliers. Outliers don't have to be
present for this to happen. Essentially, any time that there are data values that cause the mean and median
to be signicantly dierent, then we say the data is skewed.
• If the mean is signicantly larger than the median and the histogram has a long tail on the right, then
the data is right skewed or positively skewed.
• If the mean is signicantly smaller than the median and the histogram has a long tail on the left, then
the data is left skewed or negatively skewed.
• If the mean and the median are approximately the same and the histograms has balanced tails, then
the data is symmetric.

Examples of skewness and symmetry

Figure 1.7: These are "perfect" examples of skewness and symmetry. In reality, there may be multiple
modes or the mean and median will be similar but not equal. These are provided to give an example.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
34
AND DESCRIPTIVE STATISTICS

1.2.2.1.7 Measures of variation

An important characteristic of any set of data is the variation in the data. In some data sets, the data values
are concentrated closely near the mean; in other data sets, the data values are more widely spread out from
the mean. There are ve measures of variation: range, standard deviation, variance, interquartile range and
coecient of variation.
The range is the easiest to calculate. It is found by subtracting the maximum value in the data set
from the minimum value in the data set. Though the range is easy to calculate, it is very much aected by
outliers.
The interquartile range will be discussed in the section on box plots (section 2.3).
The most common measure of variation, or spread, is the standard deviation. The standard deviation
measures how far data values are from their mean, on average.

important: When talking about variable or variability in statistics, there are two dierent kinds:
variation within a sample and variation between samples.
When we discuss nding the standard deviation, range or any measure of variation of a sample, we
are discussing variation within a sample. In this case, we are looking at how the data values vary
from each other. Most of the time, when we talk about variation this is what we are talking about.
We can also talk about how much dierent samples vary from each other. For example, we could
take multiple samples and nd the sample mean of each sample. If we talk about how much the
means vary from each other, we are discussing variation between samples. We will discuss this
specic type of variation in Chapter 6.
The law of large numbers saws that, for random samples, as the sample size increases, then the
sample will more closely resemble the population. For example, as the sample size increases, the
sample standard deviation will approach the population standard deviation. Thus, the variation
within the sample will more closely mimic the variation within the population as the sample size
increases. But as the sample size increases, the sample means will approach the population mean.
Thus, there will be less variation between the sample means. This means that the variation between
samples decreases, as the sample size increases. When we discuss sampling variability, we are
discussing variation between samples.
For this chapter, we are focusing on variation within a sample.

1.2.2.1.7.1 The standard deviation (and variance)

• provides a numerical measure of the overall amount of variation in a data set, and
• can be used to determine whether a particular data value is close to or far from the mean.

1.2.2.1.7.1.1 The standard deviation provides a measure of the overall variation in a data set

The standard deviation is small when the data are all concentrated close to the mean, exhibiting little
variation or spread. The standard deviation is larger when the data values are more spread out from the
mean, exhibiting more variation.
Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket
A and supermarket B. It is known that the average wait time at both supermarkets is about ve minutes.
At supermarket A, though, the standard deviation for the wait time is two minutes; at supermarket B the
standard deviation for the wait time is four minutes.
Because supermarket B has a higher standard deviation, we know that there is more variation in the
wait times at supermarket B. Overall, wait times at supermarket B are more spread out from the average;
wait times at supermarket A are more concentrated near the average. This means that at supermarket B,
you have a greater chance of having a short wait time, but also a greater chance of having a long wait time,

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


35

compared to supermarket A. That means the wait times are more volatile at supermarket B. On the other
hand, you will be waiting about the same amount of time at supermarket A. That means there are more
consistent waits times at supermarket A.
One way, we could summarize the supermarket situation is as follows:

• A typical wait time at supermarket A is 5 minutes give or take 2 minutes. This means that someone
typically has to wait 3 to 7 minutes in the checkout line.
• A typical wait time at supermarket B is 5 minutes give or take 4 minutes. This means that someone
typically has to wait 1 to 9 minutes in the checkout line.

Here the term typical means common, normal. So normally people will wait between 3 to 7 minutes at
supermarket A, but there will be some people who only wait 2 minutes and some who wait 10 minutes at
the checkout. That is, the typical range only provides a sense of what is going on in the middle of the data,
but there are values occurring outside of that range.

note: For the typical value, you can use any measure of centre. But for the give or take value,
you have to use standard deviation. No other measure of variation works.

1.2.2.1.7.1.2 Calculating the Standard Deviation

note: The following explains how to calculate the standard deviation by hand. We will be using
computer software to do this. Thus it is not important to know this section in detail, but it is
helpful to know the basics of how the standard deviation is calculated to help understand what the
standard deviation is.

If x is a number, then the dierence "x  mean" is called its deviation. In a data set, there are as many
deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If
the numbers belong to a population, in symbols a deviation is x  µ. For sample data, in symbols a deviation
−−
is x  x .
The procedure to calculate the standard deviation depends on whether the numbers are the entire pop-
ulation or are data from a sample. The calculations are similar, but not identical. Therefore the symbol
used to represent the standard deviation depends on whether it is calculated from a population or a sample.
The lower case letter s represents the sample standard deviation and the Greek letter σ (sigma, lower case)
represents the population standard deviation. If the sample has the same characteristics as the population,
then s should be a good estimate of σ .
To calculate the standard deviation, we need to calculate the variance rst. The variance is the average
−−
of the squares of the deviations (the x  x values for a sample, or the x  µ values for a population).
The symbol σ 2 represents the population variance; the population standard deviation σ is the square root
of the population variance. The symbol s 2 represents the sample variance; the sample standard deviation s
is the square root of the sample variance. You can think of the standard deviation as a special average of
the deviations.
If the numbers come from a census of the entire population and not a sample, when we calculate
the average of the squared deviations to nd the variance, we divide by N, the number of items in the
population. If the data are from a sample rather than a population, when we calculate the average of the
n
squared deviations, we divide by  1, one less than the number of items in the sample.

1.2.2.1.7.1.3 Formulas for the Sample Standard Deviation


s  2
−−
Σ x− x
• s=
n - 1, that is the sample size − 1.
n−1
• For the sample standard deviation, the denominator is

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
36
AND DESCRIPTIVE STATISTICS

1.2.2.1.7.1.4 Formulas for the Population Standard Deviation


q
Σ(x−µ)2
• σ = N
• For the population standard deviation, the denominator is N, the number of items in the population.

Since the standard deviation is found by square rooting something, the standard deviation is always positive
or zero.
Since the variance is the square of the standard deviation, it is not helpful as a descriptive statistic. For
example, if you are looking at the weights of basketballs in kg, then the standard deviation will be in kg,
while the variance will be in kg^2. Thus the variance is meaningless when trying to interpret the variation
in data. It is helpful later on in statistics, but at this point it is not.
Example 1.18
In a fth grade class, the teacher was interested in the average age and the sample standard deviation
of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fth grade
students. The ages are rounded to the nearest half year:
9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5;

−− 9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3)


x= = 10.525 (1.7)
20
The average age is 10.53 years, rounded to two places.
The variance may be calculated by using a table. Then the standard deviation is calculated by
taking the square root of the variance. We will explain the parts of the table after calculating s.

Data Freq. Deviations Deviations2 (Freq.)( Deviations2)


−− −− −−
x f (x  x ) (x  x )2 (f )(x  x )2
9 1 9  10.525 = 1.525 (1.525)2 = 2.325625 1 × 2.325625 = 2.325625
9.5 2 9.5  10.525 = 1.025 (1.025)2 = 1.050625 2 × 1.050625 = 2.101250
10 4 10  10.525 = 0.525 (0.525)2 = 0.275625 4 × 0.275625 = 1.1025
10.5 4 10.5  10.525 = 0.025 (0.025)2 = 0.000625 4 × 0.000625 = 0.0025
11 6 11  10.525 = 0.475 (0.475)2 = 0.225625 6 × 0.225625 = 1.35375
11.5 3 11.5  10.525 = 0.975 (0.975)2 = 0.950625 3 × 0.950625 = 2.851875
The total is 9.7375

Table 1.5

The sample variance, s 2 , is equal to the sum of the last column (9.7375) divided by the total
number of data values minus one (20  1):
s2 = 9.7375
20−1 = 0.5125
The √
sample standard deviation s is equal to the square root of the sample variance:
s = 0.5125 = 0.715891, which is rounded to two decimal places, s = 0.72.

1.2.2.1.7.2

1.2.2.1.7.2.1 Explanation of the standard deviation calculation shown in the table

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the
mean than is the data value 11 which is indicated by the deviations 0.97 and 0.47. A positive deviation

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


37

occurs when the data value is greater than the mean, whereas a negative deviation occurs when the data
value is less than the mean. The deviation is 1.525 for the data value nine. If you add the deviations,
the sum is always zero. (For Example 1.18, there are n = 20 deviations.) So you cannot simply add the
deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and
the sum will also be positive. The variance, then, is the average squared deviation.
The variance is a squared measure and does not have the same units as the data. Taking the square root
solves the problem. The standard deviation measures the spread in the same units as the data.
Notice that instead of dividing by n = 20, the calculation divided by n  1 = 20  1 = 19 because the
data is a sample. For the sample variance, we divide by the sample size minus one (n  1). Why not divide
by n? The answer has to do with the population variance. The sample variance is an estimate of the
population variance. Based on the theoretical mathematics that lies behind these calculations, dividing
by (n  1) gives a better estimate of the population variance.
The standard deviation, s or σ , is either zero or larger than zero. When the standard deviation is zero,
there is no spread; that is, the all the data values are equal to each other. The standard deviation is small
when the data are all concentrated close to the mean, and is larger when the data values show more variation
from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out
about the mean; outliers can make s or σ very large.

1.2.2.1.7.3 Coecient of variation

The standard deviation is a very good measure of variation, but when comparing two data sets it is not
always the best. In particular, if the means of the two data sets are dierent. Suppose you are comparing
the yearly salaries (excluding bonuses) of junior employees versus CEOs at oil and gas companies around
Alberta. The yearly salaries for the junior employees will be signicantly smaller than the CEOs. Let's
say the average salary for junior employees is $45,000 while for CEOs is $500,000. Now suppose that the
standard deviation for both groups is $50,000. If we only looked at the standard deviation, we might say
that the variation in both groups is the same. But really variation of $50,000 when the average salary is
$45,000 is quite a bit more than for a salary of $500,000. That is, there is more relative variation in the junior
employees' salary. The standard deviation doesn't capture this dierence. But the coecient of variation
does and is a measure of relative variation. That is, it takes into account that bigger data values might
have a larger standard deviation, but that doesn't mean it has larger variation.
The coecient of variation is found by expressing the standard deviation as a percentage of the mean:
s
Coecient of Variation = −− (100%) (1.7)
x
In the above example, the coecient of variation would be:
50, 000
CofV for Junior employees = (100%) = 111.1% (1.7)
45, 000
50, 000
CofV for CEOs = (100%) = 1% (1.7)
5, 000, 000
The larger the coecient of variation, the larger the relative variation. Thus, as a measure of relative
variation, the junior employees have signicantly more relative variation (111.11%) compared to the CEOs
(1%).
Here are some points about the coecient of variation:

• The coecient of variation is not aected by multiplicative changes of scale.


• The coecient of variation is used to compare variation between data sets. This is very
important to remember. For multiple data sets, if the means are the same, you can compare the
standard deviations. BUT if the means are dierent, you MUST use the coecient of variation of
compare the variation in the data sets.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
38
AND DESCRIPTIVE STATISTICS

• If the standard deviation is larger than the mean, the coecient of variation is bigger than 100%.

When to use which measure of variation

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


39

Measure When to use

Range The range is rarely the best measure of variation to


use. But it is a good quick calculation of variation.
Standard deviation This is the most common measure of variation. It
is best used when nding the variation for one data
set.
Variance As it the square of the standard deviation, it is
NEVER the best measure of variation to use. It is
helpful in later topics in statistics though.
Interquartile range This is a not very well known measure of variation,
but it is helpful in describing the range for middle
50% of the data values. The interquartile range will
be discussed in section 2.3 (box plots)
Coecient of variation This is not well known, but is the best measure to
use when comparing the variations of two or more
data sets that have dierent measures of centre.

Table 1.6

Example 1.19
Suppose you are looking at two companies and each company has 24 employees. At one company,
everybody except the CEO makes $30,000. The CEO makes $490,000. Thus, the data values would
be
$30,000; $30,000; $30,000; $30,000; $30,000; . . . ;$490,000
The second company has an interesting policy. Everybody who starts at the company makes
$30,000 a year, but as soon as someone else gets hired, they get paid $20,000 more. They only hire
one person at a time. So, the rst person who was hired started at $30,000, then when a second
person got hired, the rst person's salary was raised to $50,000. When a third person got hired, the
rst person's salary was raised to $70,000 while the salary of the second person hired was raised to
$50,000. This has been done 23 times. Therefore, their data values (i.e. salaries) would look like
this:
$30,000 $50,000; $70,000; $90,000; $110,000; . . . ;$490,000
Without doing any calculations, we can see that company one has fairly consistent salaries
except for the CEO. While company two has salaries that are more spread out.
The following table provides the count (i.e. sample size), mean, and the measures of variation
for the two companies.

Company One Company Two


Count 24 24
Mean 49,166.67 260,000.00
Range 460,000 460,000
Population standard deviation 91,820.10 138,443.73
Coecient of Variation 190.98% 54.39%

Table 1.7

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
40
AND DESCRIPTIVE STATISTICS

In the table above, notice that the range is the same for the two data sets. If we only looked
at the range, this would give a false sense that the amount of variation in the two data sets is the
same, but we know it isn't.
The standard deviation is measuring how much, on average, the data values vary from the mean.
For company one, 23 of the 24 data values deviate the same amount from the mean ($49,166.67 
$30,000 = $19,166.67) with only the $490,000 deviating a large amount from the mean.
For company two, two data values only deviate by only $10,000 ($250,000 and $270,000) while
two data values deviate by a whopping $230,000 ($30,000 and $490,000).
In company one, 23 out of 24 data values deviate by less than $20,000. But for company two,
only 2 out of 24 deviate by less than $20,000. This suggests that company one will have a smaller
standard deviation than company two because there is less average deviation. This is supported
by MegaStat, which shows that the population standard deviation for company one is $91,920.10
versus company two, which has a population standard deviation of $138,443.73.
Notice that even though company one has an outlier (the CEO's salary), the standard deviation
is less than company two. That is, the average variation from the mean is less for company one.
Thus, the presence of an outlier does not necessarily result in a larger standard deviation.
The story is dierent when we look at the coecient of variation. For company one, it is
190.98%. While for company two, it is 54.39%. This means that company one has larger relative
variation than company two. This is because company two has a higher mean than company one
and thus the variation, relative to the mean, isn't as large as it is in company one.
In this situation, the best measure of variation to use would be the coecient of variation as
we are comparing two data sets with two dierent means. Based on this, company one has larger
relative variation than company two.
Notice that variance is not discussed here. As stated above, the variance is the square of the
standard deviation. Therefore, the units for variance in this example would be $^2, which makes
no sense. Again, variance is not a useful descriptive statistic.

warning: Variation and variance might seem like the same word but they aren't. Variation is a
general term used to discuss how much the data values vary from each other, how much spread there
is in the data, how consistent the data is, how volatile or risky the data is, and how much deviation
there is in the data values. It is an umbrella term. Variance is a specic type of variation. It
specically refers to the square of the standard deviation. Therefore, it is incorrect to say, There
is a lot of variance in the data or The best measure of variance is . . ..

1.2.2.1.7.4 Optional section: Comparing Values from Dierent Data Sets

The standard deviation is useful when comparing data values that come from dierent data sets. If the
data sets have dierent means and standard deviations, then comparing the data values directly can be
misleading.

• For each data value, calculate how many standard deviations away from its mean the value is.
• Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs.
• value  mean
#of ST DEV s = standard deviation
• Compare the results of this calculation.

#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become:

−−
Sample x = x + zs z= x− x
s
x−µ
Population x = µ + zσ z= σ

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


41

Table 1.8

Example 1.20
Two students, John and Ali, from dierent high schools, wanted to nd out who had the highest
GPA when compared to his school. Which student had the highest GPA when compared to his
school?

Student GPA School Mean GPA School Standard Deviation

John 2.85 3.0 0.7


Ali 77 80 10

Table 1.9

Solution
For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from
the average, for his school. Pay careful attention to signs when comparing and interpreting the
answer.
value −−mean = x−µ
z = # of ST DEV s = standard deviation σ
For John, z = #of ST DEV s = 2.85−−3.0
0.7 = − − 0.21
For Ali, z = #of ST DEV s = 77−8010 = −0.3
John has the better GPA when compared to his school because his GPA is 0.21 standard
deviations below his school's mean while Ali's GPA is 0.3 standard deviations below his school's
mean.
John's z-score of 0.21 is higher than Ali's z-score of 0.3. For GPA, higher values are better,
so we conclude that John has the better GPA when compared to his school.

Exercise 1.2.2.1 (Solution on p. 79.)


Two swimmers, Angie and Beth, from dierent teams, wanted to nd out who had the
fastest time for the 50 meter freestyle when compared to her team. Which swimmer had
the fastest time when compared to her team?

Swimmer Time (seconds) Team Mean Time Team Standard Deviation

Angie 26.2 27.2 0.8


Beth 27.3 30.1 1.4

Table 1.10

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
42
AND DESCRIPTIVE STATISTICS

1.2.2.2 Distributions

Now that we have learned about determining shape (histogram), centre (mean, median or mode), and
variation (standard deviation, coecient of variation and range), we can now describe the distribution of a
data set.
In Example 1.19, we examined the salaries for two dierent companies.
Though we have not done the histogram for either of these data sets, we can imagine what they will look
like to determine the shape. Company A will have one peak at $30,000 with an outlier at $490,000. This
will make it skewed to the right. For Company B each data value has the same frequency, which makes the
data uniform.
For company A, we would describe the distribution of salaries to be skewed to the right(shape), centred
at $49,166.67 (mean) and have variation of $91,820.10 (standard deviation).
For company B, we would describe the distribution of salaries to be uniform(shape), centred at $260,000
(mean) and have variation of $138,443.73 (standard deviation).

1.2.2.3 References

Data from The World Bank, available online at http://www.worldbank.org (accessed April 3, 2013).
Demographics: Obesity  adult prevalence rate. Indexmundi. Available online at
http://www.indexmundi.com/g/r.aspx?t=50&v=2228&l=en (accessed April 3, 2013).

1.2.2.4 Chapter Review

The mean and the median can be calculated to help you nd the "center" of a data set. The mean is the
best estimate for the actual data set, but the median is the best measurement when a data set contains
several outliers or extreme values. The mode will tell you the most frequently occuring datum (or data) in
your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, but
if your data set consists of ranges which lack specic values, the mean may seem impossible to calculate.
However, the mean can be approximated if you add the lower boundary with the upper boundary and divide
by two to nd the midpoint of each interval. Multiply each midpoint by the number of values found in the
corresponding range. Divide the sum of these values by the total number of data values in the set.
The standard deviation can help you calculate the spread of data. There are dierent equations to use if
are calculating the standard deviation of a sample or of a population.

• The Standard Deviation allows us to compare individual data or classes to the data set mean numeri-
cally.s s 
  2  2
P −− P −−
x− x f x− x
• s = n−1 or s = n−1 is the formula for calculating the standard deviation of a
sample. To calculate q
the standard deviation
q of a population, we would use the population mean, µ,
(x−µ)2 f (x−µ)2
P P
and the formula σ = N or σ = N .

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


43

1.2.2.5

Use the following information to answer the next three exercises: The following data show the lengths of
boats moored in a marina. The data are ordered from smallest to largest: 16; 17; 19; 20; 20; 21; 23; 24; 25;
25; 25; 26; 26; 27; 27; 27; 28; 29; 30; 32; 33; 33; 34; 35; 37; 39; 40
Exercise 1.2.2.2 (Solution on p. 79.)
Calculate the mean.
Exercise 1.2.2.3 (Solution on p. 79.)
Identify the median.
Exercise 1.2.2.4 (Solution on p. 79.)
Identify the mode.

Use the following information to answer the next three exercises: Sixty-ve randomly selected car salespersons
were asked the number of cars they generally sell in one week. Fourteen people answered that they generally
sell three cars; nineteen generally sell four cars; twelve generally sell ve cars; nine generally sell six cars;
eleven generally sell seven cars. Calculate the following:
Exercise 1.2.2.5 (Solution on p. 79.)
sample mean = x = _______
Exercise 1.2.2.6 (Solution on p. 79.)
median = _______
Exercise 1.2.2.7 (Solution on p. 79.)
mode = _______
Exercise 1.2.2.8 (Solution on p. 79.)
The following data are the distances between 20 retail stores and a large distribution center. The
distances are in miles.
29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 145; 150
Use a computer to nd the standard deviation and round to the nearest tenth.

1.2.2.6 Bringing It Together

Exercise 1.2.2.9 (Solution on p. 79.)


Javier and Ercilia are supervisors at a shopping mall. Each was given the task of estimating the
mean distance that shoppers live from the mall. They each randomly surveyed 100 shoppers. The
samples yielded the following information.

Javier Ercilia

x 6.0 km 6.0 km
s 4.0 km 7.0 km

Table 1.11

a. How can you determine which survey was correct ?


b. Explain what the dierence in the results of the surveys implies about the data.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
44
AND DESCRIPTIVE STATISTICS

c. If the two histograms depict the distribution of values for each supervisor, which one depicts
Ercilia's sample? How do you know?

Figure 1.8

Use the following information to answer the next three exercises: We are interested in the number of years
students in a particular elementary statistics class have lived in California. The information in the following
table is from the entire section.

Number of years Frequency Number of years Frequency

7 1 22 1
14 3 23 1
15 1 26 1
18 1 40 2
19 4 42 2
20 3
Total = 20

Table 1.12

Exercise 1.2.2.10 (Solution on p. 80.)


What is the mode?

a. 19
b. 19.5
c. 14 and 20
d. 22.65

Exercise 1.2.2.11 (Solution on p. 80.)


Is this a sample or the entire population?

a. sample

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


45

b. entire population
c. neither

Exercise 1.2.2.12 (Solution on p. 80.)


A survey of enrollment at 35 community colleges across the United States yielded the following
gures:
6414; 1550; 2109; 9350; 21828; 4300; 5944; 5722; 2825; 2044; 5481; 5200; 5853; 2750; 10012;
6357; 27000; 9414; 7681; 3200; 17500; 9200; 7380; 18314; 6557; 13713; 17768; 7493; 2771; 2861;
1263; 7285; 28165; 5080; 11622

a. Organize the data into a chart with ve intervals of equal width. Label the two columns
"Enrollment" and "Frequency."
b. Construct a histogram of the data.
c. What is the shape of the data? What does the shape tell you about the enrollment at these
community colleges?
d. What is the best measure of centre for this data and why? State the measure.
e. What is the best measure of variation for this data and why? State the measure.
f. If you were to build a new community college, what is the typical range for the enrollment?
Why would this information be helpful? What caveats would you want to think about when
you look at this typical range?

Exercise 1.2.2.13 (Solution on p. 81.)


You work for a soda pop company that is producing a new label for their Asian market. Three
dierent labels your company is considering are the same, except the colours are dierent. The
colour choices are blue, green and orange.
To determine which label consumers prefer, focus groups were done. One such focus group asked
15 participants to rate the cans from 1 to 10. A score of 1 means they hated the label and 10 means
they loved the label. The results follow.

Participant Blue Label Green Label Orange Label

1 1 10 6
2 4 8 7
3 2 9 7
4 6 3 8
5 1 8 6
6 1 7 7
7 1 3 7
8 4 9 8
9 1 10 9
10 7 4 6
11 4 7 6
12 5 6 7
13 6 9 8
14 4 4 6
15 6 8 7

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
46
AND DESCRIPTIVE STATISTICS

Table 1.13

Which label would you recommend as the new label for the Asian market? Support your decision
using the data.
Exercise 1.2.2.14 (Solution on p. 81.)
Three publicly traded telecommunications companies reported their monthly prot for the last
year. The results are presented below.

Company A Company B Company C


Mean $10,930 $13,000 $34,450
Median $9,390 $13,500 $34,450
Mode None $13,000 and $20,000 $33,880
Standard deviation $4,196 $9,360 $4,116
Range $15,050 $42,150 $16,400

Table 1.14

1. Donna is close to retirement and wants to invest in one of the three companies. She doesn't
want to see her investment drop signicantly as she doesn't want to see her retirement savings
dwindle. Which company would you recommend she invest in and why?
2. What information is missing from the list that you might want to have to help you answer
the above question?
3. What information below is not necessary for making this decision?

1.2.3 Descriptive Statistics - Visual Representations of Data - MRU - C Lemieux


(2017)10
1.2.3.1 Visual representations of categorical data

Below are tables comparing the number of part-time and full-time students at De Anza College and Foothill
College enrolled for the spring 2010 quarter. The tables display counts (frequencies) and percentages or
proportions (relative frequencies). The percent columns make comparing the same categories in the colleges
easier. Displaying percentages along with the numbers is often helpful, but it is particularly important when
comparing sets of data that do not have the same totals, such as the total enrollments for both colleges in
this example. Notice how much larger the percentage for part-time students at Foothill College is compared
to De Anza College.

Fall Term 2007 (Census day)

De Anza College Foothill College

Number Percent Number Percent


Full-time 9,200 40.9% Full-time 4,059 28.6%
Part-time 13,296 59.1% Part-time 10,124 71.4%
Total 22,496 100% Total 14,183 100%
10 This content is available online at <http://cnx.org/content/m64906/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


47

Table 1.15

Tables are a good way of organizing and displaying data. But graphs can be even more helpful in
understanding the data. There are no strict rules concerning which graphs to use. Two graphs that are used
to display categorical data are pie charts and bar graphs.
In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to
the percent of individuals in each category.
In a bar graph, the length of the bar for each category is proportional to the number or percent of
individuals in each category. Bars may be vertical or horizontal.
Look at Figure 1.9 and Figure 1.10 and determine which graph (pie or bar) you think displays the
comparisons better.
It is a good idea to look at a variety of graphs to see which is the most helpful in displaying the data. We
might make dierent choices of what we think is the best graph depending on the data and the context.
Our choice also depends on what we are using the data for.

(a) (b)

Figure 1.9

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
48
AND DESCRIPTIVE STATISTICS

Figure 1.10

1.2.3.2 Visual Representations of Quantitative Data

1.2.3.2.1 Bar Graphs

Bar graphs can also be used to summarize discrete quantitative data and categorical data. Bar graphs
consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular
boxes (used in three-dimensional plots), and they can be vertical or horizontal. The bar graph shown in
x
has age groups represented on the -axis and proportions on the -axis. y
Exercise 1.2.3.1 (Solution on p. 81.)
By the end of 2011, Facebook had over 146 million users in the United States. Table 1.16 shows
three age groups, the number of users in each age group, and the proportion (%) of users in each
age group. Construct a bar graph using this data.

Age groups Number of Facebook users Proportion (%) of Facebook users

1325 65,082,280 45%


2644 53,300,200 36%
4564 27,885,100 19%

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


49

Table 1.16

Exercise 1.2.3.2 (Solution on p. 82.)


Park city is broken down into six voting districts. The table shows the percent of the
total registered voter population that lives in each district as well as the percent total of
the entire population that lives in each district. Construct a bar graph that shows the
registered voter population by district.

District Registered voter population Overall city population

1 15.5% 19.4%
2 12.2% 15.6%
3 9.8% 9.0%
4 17.4% 18.5%
5 22.8% 20.7%
6 22.3% 16.8%

Table 1.17

1.2.3.2.2 Frequency tables

Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows:
5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3.
Table 1.18: Frequency Table of Student Work Hours lists the dierent data values in ascending order and
their frequencies.

Frequency Table of Student Work Hours

DATA VALUE FREQUENCY

2 3
3 5
4 3
5 6
6 2
7 1

Table 1.18

A frequency is the number of times a value of the data occurs. According to Table 1.18: Frequency
Table of Student Work Hours, there are three students who work two hours, ve students who work three
hours, and so on. The sum of the values in the frequency column, 20, represents the total number of students
included in the sample.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
50
AND DESCRIPTIVE STATISTICS

A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data
occurs in the set of all outcomes to the total number of outcomes. To nd the relative frequencies, divide
each frequency by the total number of students in the samplein this case, 20. Relative frequencies can be
written as fractions, percents, or decimals.
Frequency Table of Student Work Hours with Relative Frequencies

DATA VALUE FREQUENCY RELATIVE FREQUENCY

2 3 3
20 or 0.15
3 5 5
20 or 0.25
4 3 3
20 or 0.15
5 6 6
20 or 0.30
6 2 2
20 or 0.10
7 1 1
20 or 0.05
Table 1.19

The sum of the values in the relative frequency column of Table 1.19: Frequency Table of Student Work
Hours with Relative Frequencies is 20
20
, or 1.
Cumulative relative frequency is the accumulation of the previous relative frequencies. To nd the
cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the
current row, as shown in Table 1.20: Frequency Table of Student Work Hours with Relative and Cumulative
Relative Frequencies.
Frequency Table of Student Work Hours with Relative and Cumulative Relative Frequencies

DATA VALUE FREQUENCY RELATIVE CUMULATIVE


FREQUENCY RELATIVE
FREQUENCY

2 3 3
20 or 0.15 0.15
3 5 5
20 or 0.25 0.15 + 0.25 = 0.40
4 3 3
20 or 0.15 0.40 + 0.15 = 0.55
5 6 6
20 or 0.30 0.55 + 0.30 = 0.85
6 2 2
20 or 0.10 0.85 + 0.10 = 0.95
7 1 1
20 or 0.05 0.95 + 0.05 = 1.00
Table 1.20

The last entry of the cumulative relative frequency column is one, indicating that one hundred percent
of the data has been accumulated.
: Because of rounding, the relative frequency column may not always sum to one, and the last
entry in the cumulative relative frequency column may not be one. However, they each should be
close to one.
Table 1.21: Frequency Table of Soccer Player Height represents the heights, in inches, of a sample of 100
male semiprofessional soccer players.
Frequency Table of Soccer Player Height

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


51

HEIGHTS FREQUENCY RELATIVE CUMULATIVE


(INCHES) FREQUENCY RELATIVE
FREQUENCY

6061.99 5 5
100 = 0.05 0.05
6263.99 3 3
100 = 0.03 0.05 + 0.03 = 0.08
6465.99 15 15
100 = 0.15 0.08 + 0.15 = 0.23
66-67.99 40 40
100 = 0.40 0.23 + 0.40 = 0.63
6869.99 17 17
100 = 0.17 0.63 + 0.17 = 0.80
7071.99 12 12
100 = 0.12 0.80 + 0.12 = 0.92
7273.99 7 7
100 = 0.07 0.92 + 0.07 = 0.99
7475.99 1 1
100 = 0.01 0.99 + 0.01 = 1.00
Total = 100 Total = 1.00

Table 1.21

The data in this table have been grouped into the following intervals:

• 60 to 61.99 inches
• 62 to 63.99 inches
• 64 to 65.99 inches
• 66 to 67.99 inches
• 68 to 69.99 inches
• 70 to 71.99 inches
• 72 to 73.99 inches
• 74 to 75.99 inches

In this sample, there are ve players whose heights fall within the interval 59.9561.95 inches, three players
whose heights fall within the interval 61.9563.95 inches, 15 players whose heights fall within the interval
63.9565.95 inches, 40 players whose heights fall within the interval 65.9567.95 inches, 17 players whose
heights fall within the interval 67.9569.95 inches, 12 players whose heights fall within the interval 69.95
71.95, seven players whose heights fall within the interval 71.9573.95, and one player whose heights fall
within the interval 73.9575.95. All heights fall between the endpoints of an interval and not at the endpoints.
Example 1.21
From Table 1.21: Frequency Table of Soccer Player Height, nd the percentage of heights that
are less than 65.95 inches.
Solution
If you look at the rst, second, and third rows, the heights are all less than 65.95 inches. There are
5 + 3 + 15 = 23 players whose heights are less than 65.95 inches. The percentage of heights less
than 65.95 inches is then 100
23
or 23%. This percentage is the cumulative relative frequency entry in
the third row.

Exercise 1.2.3.3 (Solution on p. 82.)


Table 1.22 shows the amount, in inches, of annual rainfall in a sample of towns.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
52
AND DESCRIPTIVE STATISTICS

Rainfall (Inches) Frequency Relative Frequency Cumulative Relative Frequency

34.99 6 6
50 = 0.12 0.12
56.99 7 7
50 = 0.14 0.12 + 0.14 = 0.26
79.99 15 15
50 = 0.30 0.26 + 0.30 = 0.56
1011.99 8 8
50 = 0.16 0.56 + 0.16 = 0.72
1212.99 9 9
50 = 0.18 0.72 + 0.18 = 0.90
1314.99 5 5
50 = 0.10 0.90 + 0.10 = 1.00
Total = 50 Total = 1.00

Table 1.22

From Table 1.22, nd the percentage of rainfall that is less than 9.01 inches.

Example 1.22
From Table 1.21: Frequency Table of Soccer Player Height, nd the percentage of heights that
fall between 61.95 and 65.95 inches.
Solution
Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%.

Exercise 1.2.3.4 (Solution on p. 82.)


From Table 1.22, nd the percentage of rainfall that is between 6.99 and 13.05 inches.

Example 1.23
Use the heights of the 100 male semiprofessional soccer players in Table 1.21: Frequency Table of
Soccer Player Height. Fill in the blanks and check your answers.

a. The percentage of heights that are from 67.95 to 71.95 inches is: ____.
b. The percentage of heights that are from 67.95 to 73.95 inches is: ____.
c. The percentage of heights that are more than 65.95 inches is: ____.
d. The number of players in the sample who are between 61.95 and 71.95 inches tall is: ____.
e. What kind of data are the heights?
f. Describe how you could gather this data (the heights) so that the data are characteristic of
all male semiprofessional soccer players.

Remember, you count frequencies. To nd the relative frequency, divide the frequency by the
total number of data values. To nd the cumulative relative frequency, add all of the previous
relative frequencies to the relative frequency for the current row.
Solution

a. 29%
b. 36%

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


53

c. 77%
d. 87
e. quantitative continuous
f. get rosters from each team and choose a simple random sample from each

Example 1.24
Nineteen people were asked how many miles, to the nearest mile, they commute to work each day.
The data are as follows: 2; 5; 7; 3; 2; 10; 18; 15; 20; 7; 10; 18; 5; 12; 13; 12; 4; 5; 10. Table 1.23:
Frequency of Commuting Distances was produced:

Frequency of Commuting Distances

DATA FREQUENCY RELATIVE CUMULATIVE


FREQUENCY RELATIVE
FREQUENCY

3 3 3
19 0.1579
4 1 1
19 0.2105
5 3 3
19 0.1579
7 2 2
19 0.2632
10 3 4
19 0.4737
12 2 2
19 0.7895
13 1 1
19 0.8421
15 1 1
19 0.8948
18 1 1
19 0.9474
20 1 1
19 1.0000

Table 1.23

Problem

a. Is the table correct? If it is not correct, what is wrong?


b. True or False: Three percent of the people surveyed commute three miles. If the statement
is not correct, what should it be? If the table is incorrect, make the corrections.
c. What fraction of the people surveyed commute ve or seven miles?
d. What fraction of the people surveyed commute 12 miles or more? Less than 12 miles? Between
ve and 13 miles (not including ve and 13 miles)?

Solution

a. No. The frequency column sums to 18, not 19. Not all cumulative relative frequencies are
correct.
b. False. The frequency for three miles should be one; for two miles (left out), two. The
cumulative relative frequency column should read: 0.1052, 0.1579, 0.2105, 0.3684, 0.4737,
0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1.0000.
c. 19
5

d. 19 , 12
7
19 , 19
7

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
54
AND DESCRIPTIVE STATISTICS

Exercise 1.2.3.5 (Solution on p. 82.)


Table 1.22 represents the amount, in inches, of annual rainfall in a sample of towns. What
fraction of towns surveyed get between 11.03 and 13.05 inches of rainfall each year?

1.2.3.2.3 Histograms

In the introduction, the idea of distribution was introduced. The distribution refers to the shape, centre and
variation of quantitative data. To determine the shape of the data, we need to look at a visual representation
of the data. The best visual representation to look at is the histogram.

note: Bar graphs and histograms look very similar. They both have bars whose heights represent
the frequency of the data. But bar graphs are used for categorical data and discrete quantitative
data (i.e. whole number data). Histograms are used for continuous quantitative data (i.e. numbers
with decimals) and sometimes discrete quantitative data as well. Since there is a gap between
categories and whole numbers, the bars in bar graphs do not touch. But for continuous data, there
is no gap between the numbers, so the bars for histograms do touch.

For most of the work you do in this book, you will use a histogram to display the data. One advantage of a
histogram is that it can readily display large data sets. The following explains how to make a histogram by
hand, but you can use statistical software to do this quite quickly.
A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical
axis. The horizontal axis is labeled with what the data represents (for instance, distance from your home
to school). The vertical axis is labeled either frequency or relative frequency (or percent frequency or
probability). The graph will have the same shape with either label. The histogram (like the stemplot) can
give you the shape of the data, the center, and the spread of the data.
The relative frequency is equal to the frequency for an observed value of the data divided by the total
number of data values in the sample.(Remember, frequency is dened as the number of times an answer
occurs.) If:

• f = frequency
• n = total number of data values (or the sum of the individual frequencies), and
• RF = relative frequency,

then:
f
RF = (1.10)
n

For example, if three students in Mr. Ahab's English class of 40 students received from 90% to 100%,
then, f = 3, n = 40, and RF = nf = 40 3
= 0.075. 7.5% of the students received 90100%. 90100% are
quantitative measures.
To construct a histogram, rst decide how many bars or intervals, also called classes, represent the
data. Many histograms consist of ve to 15 bars or classes for clarity. The number of bars needs to be chosen.
Choose a starting point for the rst interval to be less than the smallest data value. A convenient starting
point is a lower value carried out to one more decimal place than the value with the most decimal places.
For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient
starting point is 6.05 (6.1  0.05 = 6.05). We say that 6.05 has more precision. If the value with the most

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


55

decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5  0.005 = 1.495).
If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point
is 0.9995 (1.0  0.0005 = 0.9995). If all the data happen to be integers and the smallest value is two, then
a convenient starting point is 1.5 (2  0.5 = 1.5). Also, when the starting point and other boundaries are
carried to one additional decimal place, no data value will fall on a boundary. The next two examples go
into detail about how to construct a histogram using continuous data and how to create a histogram using
discrete data.
Example 1.25
The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional
soccer players. The heights are continuous data, since height is measured.
60; 60.5; 61; 61; 61.5
63.5; 63.5; 63.5
64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5
66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5;
67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5
68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5
70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71
72; 72; 72; 72.5; 72.5; 73; 73.5
74
The smallest data value is 60. Since the data with the most decimal places has one decimal
(for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5,
0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for
the convenient starting point.
60  0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point
is, then, 59.95.
The largest value is 74, so 74 + 0.05 = 74.05 is the ending value.
Next, calculate the width of each bar or class interval. To calculate this width, subtract the
starting point from the ending value and divide by the number of bars (you must choose the number
of bars you desire). Suppose you choose eight bars.
74.05 − 59.95
= 1.76 (1.10)
8

: We will round up to two and make each bar or class interval two units wide. Rounding up to
two is one way to prevent a value from falling on a boundary. Rounding to the next number is
often necessary even if it goes against the standard rules of rounding. For this example, using 1.76
as the width would also work. A guideline that is followed by some for the width of a bar or class
interval is to take the square root of the number of data values and then round to the nearest whole
number, if necessary. For example, if there are 150 values of data, take the square root of 150 and
round to 12 bars or intervals.

The boundaries are:

• 59.95
• 59.95 + 2 = 61.95
• 61.95 + 2 = 63.95
• 63.95 + 2 = 65.95
• 65.95 + 2 = 67.95
• 67.95 + 2 = 69.95
• 69.95 + 2 = 71.95
• 71.95 + 2 = 73.95

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
56
AND DESCRIPTIVE STATISTICS

• 73.95 + 2 = 75.95

The heights 60 through 61.5 inches are in the interval 59.9561.95. The heights that are 63.5 are
in the interval 61.9563.95. The heights that are 64 through 64.5 are in the interval 63.9565.95.
The heights 66 through 67.5 are in the interval 65.9567.95. The heights 68 through 69.5 are in
the interval 67.9569.95. The heights 70 through 71 are in the interval 69.9571.95. The heights
72 through 73.5 are in the interval 71.9573.95. The height 74 is in the interval 73.9575.95.
The following histogram displays the heights on the x-axis and relative frequency on the y-axis.

Figure 1.11

note: Visual representations should be numbered. As they are images, they would be numbered
as gures. For example, a histogram would be numbered Figure 3. This means it is the third
image in the document. This makes it easier to refer back to: In Figure 3, we can see that . . .
The title of the visual representation includes the name of the visual representation and the context:
Histogram of . . ..
The label that goes along the axis includes the variable and the unit: Variable (unit).
These three aspects combined will make it easy to refer to the image and let the reader of the image
know what the image is about.
A frequency table would be similarly titled and labelled, but since it is a table and not an image,
it would be referred to as Table 4 (meaning the fourth table in the document).
As you look through this textbook, notice how all of the images and tables are numbered as
described above.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


57

Exercise 1.2.3.6 (Solution on p. 83.)


The following data are the shoe sizes of 50 male students. The sizes are continuous data
since shoe size is measured. Construct a histogram and calculate the width of each bar or
class interval. Suppose you choose six bars.
9; 9; 9.5; 9.5; 10; 10; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5
11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5
12; 12; 12; 12; 12; 12; 12; 12.5; 12.5; 12.5; 12.5; 14

1.2.3.2.3.1 Shape

The shape of the data helps us understand what kind of pattern the data has. For example, if all of the
data values have the same frequency, then the shape will be distinct (it is called uniform). If the data has a
skew in it, then that helps us understand the measure of centre better (to be discussed in the next section).
Overall, the shape helps us see how the data is behaving. Data that has similar shapes will behave in similar
ways.
The shape of the data set is determined by looking at a visual representation of the data and usually
the histogram. Common ways of describing the shape include whether it is symmetrical or not, how many
distinct peaks it has (unimodal, bimodal, multimodal), and whether the data has a tail only on one side
(skew).

• Data is symmetric if the shape is same on both sides of centre.


• Skewed data has a "tail" on one side. This means that there are some data values that are far from
the centre but only one one side. This is a type of non-symmetric data.
• For a histogram, the term "modal" refers to the number of distinct peaks. You almost want to think
about mountain peaks. If there are multiple, distinct mountain peaks, then we say the data is multi-
modal. If there is only one distinct peak, then the data is uni-modal. Not all data has a distinct
peak.
• Uniform data occurs if the frequency of each interval is about the same. This will result in a at
looking histogram.
• A very important shape in statistics is the bell-curve (the shape in the rst row, second column). This
shape is symmetric, uni-modal and looks like a bell. If data has this shape (and satises a few other
properties that will be discussed in Chapter 5), we call this data normal.

Here are some examples of dierent shapes of data:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
58
AND DESCRIPTIVE STATISTICS
Various shapes that data can have

Figure 1.12: Here are some examples of possible shapes that data can take

The above is provided to give you some ideas on how to describe the shape of data. But not all data sets
have a nice shape that ts into one of the above. Sometimes the data can only be described as non-symmetric.

1.2.3.3 How NOT to Lie with Statistics

It is important to remember that the very reason we develop a variety of methods to present data is to
develop insights into the subject of what the observations represent. We want to get a "sense" of the data.
Are the observations all very much alike or are they spread across a wide range of values, are they bunched
at one end of the spectrum or are they distributed evenly and so on. We are trying to get a visual picture
of the numerical data. Shortly we will develop formal mathematical measures of the data, but our visual
graphical presentation can say much. It can, unfortunately, also say much that is distracting, confusing and
simply wrong in terms of the impression the visual leaves. Many years ago Darrell Hu wrote the book How
to Lie with Statistics. It has been through 25 plus printings and sold more than one and one-half million
copies. His perspective was a harsh one and used many actual examples that were designed to mislead. He
wanted to make people aware of such deception, but perhaps more importantly to educate so that others do
not make the same errors inadvertently.
Again, the goal is to enlighten with visuals that tell the story of the data. Pie charts have a number of
common problems when used to convey the message of the data. Too many pieces of the pie overwhelm the
reader. More than perhaps ve or six categories ought to give an idea of the relative importance of each
piece. This is after all the goal of a pie chart, what subset matters most relative to the others. If there
are more components than this then perhaps an alternative approach would be better or perhaps some can
be consolidated into an "other" category. Pie charts cannot show changes over time, although we see this
attempted all too often. In federal, state, and city nance documents pie charts are often presented to show
the components of revenue available to the governing body for appropriation: income tax, sales tax motor
vehicle taxes and so on. In and of itself this is interesting information and can be nicely done with a pie
chart. The error occurs when two years are set side-by-side. Because the total revenues change year to year,
but the size of the pie is xed, no real information is provided and the relative size of each piece of the pie
cannot be meaningfully compared.
Histograms can be very helpful in understanding the data. Properly presented, they can be a quick visual
way to present probabilities of dierent categories by the simple visual of comparing relative areas in each
category. Here the error, purposeful or not, is to vary the width of the categories. This of course makes

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


59

comparison to the other categories impossible. It does embellish the importance of the category with the
expanded width because it has a greater area, inappropriately, and thus visually "says" that that category
has a higher probability of occurrence.
Changing the units of measurement of the axis can smooth out a drop or accentuate one. If you want to
show large changes, then measure the variable in small units, penny rather than thousands of dollars. And
of course to continue the fraud, be sure that the axis does not begin at zero, zero. If it begins at zero, zero,
then it becomes apparent that the axis has been manipulated.
Again, the goal of descriptive statistics is to convey meaningful visuals that tell the story of the data.
Purposeful manipulation is fraud and unethical at the worst, but even at its best, making these type of errors
will lead to confusion on the part of the analysis.

1.2.3.4 References

Burbary, Ken. Facebook Demographics Revisited  2001 Statistics, 2011. Available online at
http://www.kenburbary.com/2011/03/facebook-demographics-revisited-2011-statistics-2/ (accessed August
21, 2013).
9th Annual AP Report to the Nation. CollegeBoard, 2013. Available online at
http://apreport.collegeboard.org/goals-and-ndings/promoting-equity (accessed September 13, 2013).
Overweight and Obesity: Adult Obesity Facts. Centers for Disease Control and Prevention. Available
online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13, 2013).
Data on annual homicides in Detroit, 196173, from Gunst & Mason's book `Regression Analysis and its
Application', Marcel Dekker
Timeline: Guide to the U.S. Presidents: Information on every president's birth-
place, political party, term of oce, and more. Scholastic, 2013. Available online at
http://www.scholastic.com/teachers/article/timeline-guide-us-presidents (accessed April 3, 2013).
Presidents. Fact Monster. Pearson Education, 2007. Available online at
http://www.factmonster.com/ipka/A0194030.html (accessed April 3, 2013).
Food Security Statistics. Food and Agriculture Organization of the United Nations. Available online
at http://www.fao.org/economic/ess/ess-fs/en/ (accessed April 3, 2013).
Consumer Price Index. United States Department of Labor: Bureau of Labor Statistics. Available
online at http://data.bls.gov/pdq/SurveyOutputServlet (accessed April 3, 2013).
CO2 emissions (kt). The World Bank, 2013. Available online at
http://databank.worldbank.org/data/home.aspx (accessed April 3, 2013).
Births Time Series Data. General Register Oce For Scotland, 2013. Available online
at http://www.gro-scotland.gov.uk/statistics/theme/vital-events/births/time-series.html (accessed April 3,
2013).
Demographics: Children under the age of 5 years underweight. Indexmundi. Available online at
http://www.indexmundi.com/g/r.aspx?t=50&v=2224&aml=en (accessed April 3, 2013).
Gunst, Richard, Robert Mason. Regression Analysis and Its Application: A Data-Oriented Approach.
CRC Press: 1980.
Overweight and Obesity: Adult Obesity Facts. Centers for Disease Control and Prevention. Available
online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13, 2013).

1.2.3.5 Chapter Review

A bar graph is a chart that uses either horizontal or vertical bars to show comparisons among categories.
One axis of the chart shows the specic categories being compared, and the other axis represents a discrete
value. Some bar graphs present bars clustered in groups of more than one (grouped bar graphs), and others
show the bars divided into subparts to show cumulative eect (stacked bar graphs). Bar graphs are especially
useful when categorical data is being used, but they can also be used for quantitative discrete data.
A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width
drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
60
AND DESCRIPTIVE STATISTICS

vertical scale represents frequencies. The heights of the bars correspond to frequency values. Histograms are
typically used for large, continuous, quantitative data sets.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


61

1.2.3.6

Exercise 1.2.3.7 (Solution on p. 83.)


The students in Ms. Ramirez's math class have birthdays in each of the four seasons. Table
1.24 shows the four seasons, the number of students who have birthdays in each season, and the
percentage (%) of students in each group. Construct a bar graph showing the percentage of students
in each group.

Seasons Number of students Proportion of population

Spring 8 24%
Summer 9 26%
Autumn 11 32%
Winter 6 18%

Table 1.24

Exercise 1.2.3.8 (Solution on p. 83.)


David County has six high schools. Each school sent students to participate in a county-wide science
competition. Table 1.25 shows the percentage breakdown of competitors from each school, and the
percentage of the entire student population of the county that goes to each school. Construct a bar
graph that shows the county-wide population percentage of students at each school.

High School Science competition population Overall student population

Alabaster 28.9% 8.6%


Concordia 7.6% 23.2%
Genoa 12.1% 15.0%
Mocksville 18.5% 14.3%
Tynneson 24.2% 10.1%
West End 8.7% 28.8%

Table 1.25

Exercise 1.2.3.9
Construct a histogram for the following:

a.
Pulse Rates for Women Frequency

6069 12
7079 14
8089 11
9099 1
100109 1
110119 0
120129 1

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
62
AND DESCRIPTIVE STATISTICS

Table 1.26

b.
Actual Speed in a 30 MPH Zone Frequency

4245 25
4649 14
5053 7
5457 3
5861 1
Table 1.27

c.
Tar (mg) in Nonltered Cigarettes Frequency

1013 1
1417 0
1821 15
2225 7
2629 2
Table 1.28

1.2.3.7 Homework

Use the following information to answer the next two exercises: Suppose one hundred eleven people who
shopped in a special t-shirt store were asked the number of t-shirts they own costing more than $19 each.

Exercise 1.2.3.10 (Solution on p. 84.)


The percentage of people who own at most three t-shirts costing more than $19 each is approxi-
mately:

a. 21

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


63

b. 59
c. 41
d. Cannot be determined

Exercise 1.2.3.11 (Solution on p. 84.)


If the data were collected by asking the rst 111 people who entered the store, then the type of
sampling is:

a. cluster
b. simple random
c. stratied
d. convenience

1.2.4 Measures of Location and Box Plots  MRU  C Lemieux (2017)11


1.2.4.1 Introduction

Measures of location help us to understand where data values are located relative to other data values. We've
already seen a measure of location - the median. It tells us what data value is in the middle of the data set.
The most common measure of position is a percentile . Percentiles divide ordered data into hundredths.
To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It
means that 90% of test scores are the same or less than your score and 10% of the test scores are the same
or greater than your test score. The median is the 50th percentile
A special type of percentile are called quartiles. Quartiles divide ordered data into quarters. The rst
quartile, Q 1 , is the same as the 25th percentile, and the third quartile, Q 3 , is the same as the 75th percentile.
The median, M, is called both the second quartile and the 50th percentile.
A visual representation of measures of location is called a box plot.
In this section, we will learn how to nd quartiles and use those quartiles to nd the interquartile range
and outliers. Then we will visually represent this information on a box plot. Unlike histograms and bar
graphs, box plots require the use of numerical summaries. Thus the box plot is a representation that combines
both visual and numerical summaries of the data.

1.2.4.2 Measures of location

As described in the introduction, a common measure of location are percentiles. Percentiles are useful for
comparing values. For this reason, universities and colleges use percentiles extensively. One instance in
which colleges and universities use percentiles is when SAT results are used to determine a minimum testing
score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above
the 75th percentile. That translates into a score of at least 1220.
Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the
test scores are less (and not the same or less) than your score, it would be acceptable because removing one
particular data value is not signicant.
The median is a number that measures the "center" of the data. You can think of the median as the
"middle value," but it does not actually have to be one of the observed values. It is a number that separates
ordered data into halves. Half the values are the same number or smaller than the median, and half the
values are the same number or larger. For example, consider the following data.
1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1
Ordered from smallest to largest:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
11 This content is available online at <http://cnx.org/content/m64907/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
64
AND DESCRIPTIVE STATISTICS

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2.
To nd the median, add the two values together and divide by two.
6.8 + 7.2
=7 (1.12)
2
The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the
data. To nd the quartiles, rst nd the median or second quartile. The rst quartile, Q 1 , is the middle
value of the lower half of the data, and the third quartile, Q 3 , is the middle value, or median, of the upper
half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the
data. To nd the quartiles, rst nd the median or second quartile. The rst quartile, Q 1 , is the middle
value of the lower half of the data, and the third quartile, Q 3 , is the middle value, or median, of the upper
half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle
value of the lower half is two.
1; 1; 2; 2; 4; 6; 6.8
The number two, which is part of the data, is the rst quartile. One-fourth of the entire sets of values
are the same as or less than two and three-fourths of the values are more than two.
The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.
The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-
fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this
example.
Possible Quartile Positions

Figure 1.13

As mentioned in the previous section, the interquartile range is a measure of variation. It is a number
that indicates the spread of the middle half or the middle 50% of the data. It is the dierence between the
third quartile (Q 3 ) and the rst quartile (Q 1 ).
IQR = Q 3  Q 1

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


65

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier
if it is less than (1.5)( IQR) below the rst quartile or more than (1.5)(IQR) above the third
quartile. Potential outliers always require further investigation.

: A potential outlier is a data point that is signicantly dierent from the other data points.
These special data points may be errors or some kind of abnormality or they may be a key to
understanding the data.

Example 1.26
For the following 13 real estate prices, calculate the IQR and determine if any prices are potential
outliers. Prices are in dollars.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000;
488,800; 1,095,000
Solution
Order the data from smallest to largest.
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000;
1,095,000; 5,500,000
M = 488,800
Q 1 = 230,500 +
2
387,000
= 308,750
Q 3 = 639,000 +
2
659,000
= 649,000
IQR = 649,000  308,750 = 340,250
(1.5)(IQR) = (1.5)(340,250) = 510,375
1.5(IQR) less than the rst quartile: Q 1  (1.5)(IQR) = 308,750  510,375 = 201,625
1.5(IQR) more than the rst quartile:Q 3 + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375
No house price is less than 201,625. However, 5,500,000 is more than 1,159,375. Therefore,
5,500,000 is a potential outlier.

Example 1.27
For the two data sets in the test scores example (Example 1.33), nd the following:

a. The interquartile range. Compare the two interquartile ranges.


b. Any outliers in either set.

Solution
The ve number summary for the day and night classes is

Minimum Q1 Median Q3 Maximum

Day 32 56 74.5 82.5 99


Night 25.5 78 81 89 98

Table 1.29

a. The IQR for the day group is Q 3  Q 1 = 82.5  56 = 26.5 The IQR for the night group is Q 3
 Q 1 = 89  78 = 11
The interquartile range (the spread or variability) for the day class is larger than the night
class IQR. This suggests more variation will be found in the day class's class test scores.
b. Day class outliers are found using the IQR times 1.5 rule. So,

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
66
AND DESCRIPTIVE STATISTICS

Q 1 - IQR(1.5) = 56  26.5(1.5) = 16.25


Q 3 + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25
Since the minimum and maximum values for the day class are greater than 16.25 and less
than 122.25, there are no outliers.
Night class outliers are calculated as:
Q 1  IQR (1.5) = 78  11(1.5) = 61.5
Q 3 + IQR(1.5) = 89 + 11(1.5) = 105.5
For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5
are outliers. Since no test score is greater than 105.5, there is no upper end outlier.

1.2.4.2.1 Interpreting Percentiles, Quartiles, and Median

A percentile indicates the relative standing of a data value when data are sorted into numerical order from
smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example,
15% of data values are less than or equal to the 15th percentile.

ˆ Low percentiles always correspond to lower data values.


ˆ High percentiles always correspond to higher data values.

A percentile may or may not correspond to a value judgment about whether it is "good" or "bad." The
interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to
which the data applies. In some situations, a low percentile would be considered "good;" in other contexts
a high percentile might be considered "good". In many situations, there is no value judgment that applies.
Understanding how to interpret percentiles properly is important not only when describing data, but also
when calculating probabilities in later chapters of this text.

: When writing the interpretation of a percentile in the context of the given data, the sentence
should contain the following information.

• information about the context of the situation being considered


• the data value (value of the variable) that represents the percentile
• the percent of individuals or items with data values below the percentile
• the percent of individuals or items with data values above the percentile.

Example 1.28
On a timed math test, the rst quartile for time it took to nish the exam was 35 minutes.
Interpret the rst quartile in the context of this situation.
Solution

• Twenty-ve percent of students nished the exam in 35 minutes or less.


• Seventy-ve percent of students nished the exam in 35 minutes or more.
• A low percentile could be considered good, as nishing more quickly on a timed exam is
desirable. (If you take too long, you might not be able to nish.)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


67

Example 1.29
On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret
the 70th percentile in the context of this situation.
Solution

• Seventy percent of students answered 16 or fewer questions correctly.


• Thirty percent of students answered 16 or more questions correctly.
• A higher percentile could be considered good, as answering more questions correctly is desir-
able.

Exercise 1.2.4.1 (Solution on p. 84.)


On a 60 point written assignment, the 80th
percentile for the number of points earned
was 49. Interpret the 80th percentile in the context of this situation.

Example 1.30
At a community college, it was found that the 30th percentile of credit units that students are
enrolled for is seven units. Interpret the 30th percentile in the context of this situation.
Solution

• Thirty percent of students are enrolled in seven or fewer credit units.


• Seventy percent of students are enrolled in seven or more credit units.
• In this example, there is no "good" or "bad" value judgment associated with a higher or lower
percentile. Students attend community college for varied reasons and needs, and their course
load varies according to their needs.

1.2.4.2.2 Outliers

Above the idea of potential outliers were discussed. This section will look more in depth at how to nd
outliers and how to categorize them.
Quartiles can also be used to determine if there are any outliers in a data set. To determine if there are
outliers, we need to rst calculate the inner and outer fences. The fences dene the boundary between a
normal data value and an abnormal data value (or outlier). Any data values that fall between the inner
fences are normal data values. Any data values that fall outside the inner fences are considered
outliers.
The fences are calculated as follows:
The inner fences are Q 1 - IQR(1.5) and Q 3 + IQR(1.5).
The outer fences are Q 1 - IQR(3) and Q 3 + IQR(3).
A mild outlier is any data value between the inner and outer fences.
An extreme outlier is any data value to the extreme of the outer fence.
Example 1.31: Finding outliers
Sharpe Middle School is applying for a grant that will be used to add tness equipment to the
gym. The principal surveyed 15 anonymous students to determine how many minutes a day the
students spend exercising. The results from the 15 anonymous students are shown.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
68
AND DESCRIPTIVE STATISTICS

0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes 10 minutes; 45 minutes; 30 minutes;


300 minutes; 90 minutes; 30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes
The ve-number summary is determined to be: Min = 0; Q1 = 20; Med = 40; Q3 = 60; Max
= 300.
Are there any students who are exercising signicantly more or less than the other students?
To answer this question, we have to determine if there are any outliers.
To do this, determine the inner fences.
The IQR is 60-20=40.
The lower inner fence is Q 1 - IQR(1.5) = 20  40(1.5) = -40$ and the upper inner fence is Q 3
+ IQR(1.5) = 60 + 40(1.5) = 120$. Thus, any student who exercises between -40 minutes and
120 minutes is exercising a normal amount of time (relative to the rest of the students). Since
someone can't exercise -40 minutes, this is really 0 minutes to 120 minutes. Therefore, 300 minutes
appears to be an outlier. But is it a mild outlier or an extreme outlier?
To determine if it is mild or extreme, we need to calculate the outer fence. We only need the
upper outer fence as there are no low outliers (no one exercised for less than -40 minutes). The
upper outer fence is Q + IQR(3) = 60 + 40(3) = 180$. If the potential outlier is between 120
and 180 minutes, then it is a mild outlier (as it is between the upper inner and outer fences). If
it is more than 180 minutes, then it is an extreme outlier. In this case, 300 minutes is an extreme
outlier. This means that this student is exercising way more than the rest of their classmates!

1.2.4.3 Box Plots

Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the
concentration of the data. They also show how far the extreme values are from most of the data.
To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest
and largest data values label the endpoints of the axis. The rst quartile marks one end of the box and the
third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall
inside the box. The "whiskers" extend from the ends of the box to the smallest and largest data values.
The median or second quartile can be between the rst and third quartiles, or it can be one, or the other,
or both. The box plot gives a good, quick picture of the data.
A box plot is constructed from the ve-number summary (the minimum value, the rst quartile, the
median, the third quartile, and the maximum value) and, if there are outliers, the fences. We use these
values to compare how close other data values are to them.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


69

Example of a box plot

Figure 1.14: This is an example of a box plot. The box is in the middle and represents 50% of the data.
The circles on the right represent outliers and the dashed lines the fences. The outliers at approximately
22000 and 27000 are mild outliers, while the outlier at approximately 28500 is an extreme outlier.

To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and
largest data values label the endpoints of the axis. The rst quartile marks one end of the box and the third
quartile marks the other end of the box. The median is represented by a line inside the box. The middle 50
percent of the data fall inside the box and the length of the box is the interquartile range.
The "whiskers" extend from the ends of the box to the rst data values inside the fences. If there are no
outliers, this would be minimum and maximum values. The outliers are represented by asterisks or dots and
fall either between the inner and outer fences (mild outlier) or outside the outer fences (extreme outlier).
Consider, again, this dataset.
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
From the work done above, we know the ve number summary is 1, 2, 7, 9, 11.5. The IQR is 9-2 = 7.
IQR(1.5) is 7*1.5 = 10.5. The lower inner fence is Q1-IQR(1.5) = 2-10.5=-8.5 and the upper inner fence is
Q3+IQR(1.5)=9+10.5 = 19.5. Since no data values are smaller than -8.5 or larger than 19.5, there are no
outliers in the data set.

Figure 1.15

The two whiskers extend from the rst quartile to the smallest value and from the third quartile to the
largest value. The median is shown with a dashed line.

: It is important to start a box plot with a scaled number line. Otherwise the box plot may
not be useful.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
70
AND DESCRIPTIVE STATISTICS

Example 1.32
The following data are the heights of 40 students (in inches) in a statistics class.
59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69;
70; 70; 70; 70; 70; 71; 71; 72; 72; 73; 74; 74; 75; 77
Take this data and input it into Excel. Use the "Text to Columns" function in the "Data" menu
to separate the data into separate columns. Then copy the data, but when you paste it, use paste
special to "Transpose" the data so it is all in one column.
Now use whatever software you are using to nd the ve-number summary.

• Minimum value = 59
• Q1: First quartile = 64.75
• Q2: Second quartile or median= 66
• Q3: Third quartile = 70
• Maximum value = 77

Are there outliers? The IQR is 70-64.75 = 5.25.


IQR(1.5) = 7.875 (don't round until the end)
The lower inner fence is Q1 - IQR(1.5) = 64.75-7.875 = 56.875. Since the minimum value is 59,
there are no lower outliers.
The upper inner fence is Q3 + IQR(1.5) = 70+7.875 = 77.875. Since the maximum value is 77,
there are no upper outliers.
You can also use your computer program to create a box plot for the data.

Box plot of height of 40 students

Figure 1.16

note: The titles and labels for a box plot follow the same rules as they do for a histogram or a
bar graph.

What does the box plot tell us?

• Each quarter has approximately 25% of the data.


• The spreads of the four quarters are 64.75  59 = 5.75 (rst quarter), 66  64.75 = 1.25
(second quarter), 70  66= 4 (third quarter), and 77  70 = 7 (fourth quarter). So, the
second quarter has the smallest spread and the fourth quarter has the largest spread.
• Range = maximum value  the minimum value = 77  59 = 18, which means that from the
shortest to the tallest student there is a dierence of 18 inches.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


71

• Interquartile Range: IQR = third quartile - rst quartile = 70  64.75 = 5.25, which means
that the middle 50% (middle half) of the data has a range of 5.25 inches. This also means
the length of the box is 5.25.

Exercise 1.2.4.2 (Solution on p. 84.)


The following data are the number of pages in 40 books on a shelf. Construct a box plot
using computer software, and state the interquartile range.
136; 140; 178; 190; 205; 215; 217; 218; 232; 234; 240; 255; 270; 275; 290; 301; 303; 315;
317; 318; 326; 333; 343; 349; 360; 369; 377; 388; 391; 392; 398; 400; 402; 405; 408; 422;
429; 450; 475; 512

For some sets of data, some of the largest value, smallest value, rst quartile, median, and third quartile
may be the same. For instance, you might have a data set in which the median and the third quartile are
the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The
right side of the box would display both the third quartile and the median. For example, if the smallest
value and the rst quartile were both one, the median and the third quartile were both ve, and the largest
value was seven, the box plot would look like:

Figure 1.17

In this case, at least 25% of the values are equal to one. Twenty-ve percent of the values are between
one and ve, inclusive. At least 25% of the values are equal to ve. The top 25% of the values fall between
ve and seven, inclusive.
Example 1.33
Test scores for a college statistics class held during the day are:
99; 56; 78; 55.5; 32; 90; 80; 81; 56; 59; 45; 77; 84.5; 84; 70; 72; 68; 32; 79; 90
Test scores for a college statistics class held during the evening are:
98; 78; 68; 83; 81; 89; 88; 76; 65; 45; 98; 90; 80; 84.5; 85; 79; 78; 98; 90; 79; 81; 25.5
Problem (Solution on p. 84.)

a. Find the smallest and largest values, the median, and the rst and third quartile for the day
class.
b. Find the smallest and largest values, the median, and the rst and third quartile for the night
class.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
72
AND DESCRIPTIVE STATISTICS

c. For each data set, what percentage of the data is between the smallest value and the rst
quartile? the rst quartile and the median? the median and the third quartile? the third
quartile and the largest value? What percentage of the data is between the rst quartile and
the largest value?
d. Create a box plot for each set of data. Use one number line for both box plots.
e. Which box plot has the widest spread for the middle 50% of the data (the data between the
rst and third quartiles)? What does this mean for that set of data in comparison to the
other set of data?

Exercise 1.2.4.3 (Solution on p. 85.)


The following data set shows the heights in inches for the boys in a class of 40 students.
66; 66; 67; 67; 68; 68; 68; 68; 68; 69; 69; 69; 70; 71; 72; 72; 72; 73; 73; 74
The following data set shows the heights in inches for the girls in a class of 40 students.
61; 61; 62; 62; 63; 63; 63; 65; 65; 65; 66; 66; 66; 67; 68; 68; 68; 69; 69; 69
Construct a box plot using computer software for each data set, and state which box plot
has the wider spread for the middle 50% of the data.

1.2.4.4 References

Data from West Magazine.


Cauchon, Dennis, Paul Overberg. Census data shows minorities now a majority of U.S. births. USA
Today, 2012. Available online at http://usatoday30.usatoday.com/news/nation/story/2012-05-17/minority-
birthscensus/55029100/1 (accessed April 3, 2013).
Data from the United States Department of Commerce: United States Census Bureau. Available online
at http://www.census.gov/ (accessed April 3, 2013).
1990 Census. United States Department of Commerce: United States Census Bureau. Available online
at http://www.census.gov/main/www/cen1990.html (accessed April 3, 2013).
Data from San Jose Mercury News.
Data from Time Magazine; survey by Yankelovich Partners, Inc.

1.2.4.5 Chapter Review

The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are
used to compare and interpret data. For example, an observation at the 50th percentile would be greater
than 50 percent of the other obeservations in the set. Quartiles divide data into quarters. The rst quartile
(Q 1 ) is the 25th percentile,the second quartile (Q 2 or median) is 50th percentile, and the third quartile (Q 3 )
is the the 75th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data
values. The IQR is found by subtracting Q 1 from Q 3 , and can help determine outliers by using the following
two expressions.

• Q 3 + IQR(1.5)
• Q 1  IQR(1.5)

Box plots are a type of graph that can help visually organize data. To graph a box plot the following data
points must be calculated: the minimum value, the rst quartile, the median, the third quartile, and the
maximum value. Once the box plot is graphed, you can display and compare distributions of data.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


73

1.2.4.6

Exercise 1.2.4.4 (Solution on p. 85.)


On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain.
Exercise 1.2.4.5 (Solution on p. 85.)
Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes
is the 85th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85th
percentile in the context of this situation.
Exercise 1.2.4.6 (Solution on p. 86.)
In a study collecting data about the repair costs of damage to automobiles in a certain type of
crash tests, a certain model of car had $1,700 in damage and was in the 90th percentile. Should the
manufacturer and the consumer be pleased or upset by this result? Explain and write a sentence
that interprets the 90th percentile in the context of this problem.
Exercise 1.2.4.7 (Solution on p. 86.)
Suppose that you are buying a house. You and your realtor have determined that the most
expensive house you can aord is the 34th percentile. The 34th percentile of housing prices is
$240,000 in the town you want to move to. In this town, can you aord 34% of the houses or 66%
of the houses?
Exercise 1.2.4.8 (Solution on p. 86.)
Sixty-ve randomly selected car salespersons were asked the number of cars they generally sell
in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell
four cars; twelve generally sell ve cars; nine generally sell six cars; eleven generally sell seven cars.
Construct a box plot for this data.
Exercise 1.2.4.9 (Solution on p. 86.)
Looking at your box plot in the exercise above, does it appear that the data are concentrated
together, spread out evenly, or concentrated in some areas, but not in others? How can you tell?
Exercise 1.2.4.10 (Solution on p. 86.)
In a survey of 20-year-olds in China, Germany, and the United States, people were asked the
number of foreign countries they had visited in their lifetime. The following box plots display the
results.

Figure 1.18

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
74
AND DESCRIPTIVE STATISTICS

a. In complete sentences, describe what the shape of each box plot implies about the distribution
of the data collected.
b. Have more Americans or more Germans surveyed been to over eight foreign countries?
c. Compare the three box plots. What do they imply about the foreign travel of 20-year-old
residents of the three countries when compared to each other?

Exercise 1.2.4.11 (Solution on p. 86.)


Given the following box plot, answer the questions.

Figure 1.19

a. Think of an example (in words) where the data might t into the above box plot. In 25
sentences, write down the example.
b. What does it mean to have the rst and second quartiles so close together, while the second
to third quartiles are far apart?

Exercise 1.2.4.12 (Solution on p. 87.)


A survey was conducted of 130 purchasers of new BMW 3 series cars, 130 purchasers of new BMW
5 series cars, and 130 purchasers of new BMW 7 series cars. In it, people were asked the age they
were when they purchased their car. The following box plots display the results.

Figure 1.20

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


75

a. In complete sentences, describe what the shape of each box plot implies about the distribution
of the data collected for that car series.
b. Which group is most likely to have an outlier? Explain how you determined that.
c. Compare the three box plots. What do they imply about the age of purchasing a BMW from
the series when compared to each other?
d. Look at the BMW 5 series. Which quarter has the smallest spread of data? What is the
spread?
e. Look at the BMW 5 series. Which quarter has the largest spread of data? What is the
spread?
f. Look at the BMW 5 series. Estimate the interquartile range (IQR).
g. Look at the BMW 5 series. Are there more data in the interval 31 to 38 or in the interval 45
to 55? How do you know this?
h. Look at the BMW 5 series. Which interval has the fewest data in it? How do you know this?

i. 3135
ii. 3841
iii. 4164

Exercise 1.2.4.13 (Solution on p. 87.)


The following data represents the number of passengers per ight on the AirBus from Calgary to
Edmonton for 24 ights.
8, 19, 22, 23, 29, 30, 34, 35, 37, 39, 41, 44, 44, 46, 46, 47, 48, 49, 50, 52, 54, 55, 61, 65

a. Generate the boxplot for this data.


b. Identify the outliers in the data. Are they low or high outliers? Are the extreme or mild
outliers?
c. Interpret the outliers in the context of the question.
d. What is the IQR? Interpret it in the context of the question.
e. Which quarter of the data is the most concentrated? The least concentrated?
f. What is the ve-number summary (minimum, rst quartile, median, third quartile, maxi-
mum)?

1.2.4.7 Bringing It Together

Exercise 1.2.4.14 (Solution on p. 88.)


Santa Clara County, CA, has approximately 27,873 Japanese-Americans. Their ages are as follows:

Age Group Percent of Community

017 18.9
1824 8.0
2534 22.8
3544 15.0
4554 13.1
5564 11.9
65+ 10.3

Table 1.30

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
76
AND DESCRIPTIVE STATISTICS

a. Construct a histogram of the Japanese-American community in Santa Clara County, CA. The
bars will not be the same width for this example. Why not? What impact does this have on
the reliability of the graph?
b. What percentage of the community is under age 35?
c. Which box plot most resembles the information above?

Figure 1.21

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


77

Solutions to Exercises in Chapter 1


Solution to Exercise 1.1.2.1 (p. 4)

a. Observational unit: Edmonton. Variable: Monthly Temperature. Type: Quantitative.


b. Observational unit: Student of karate in Canada. Variable: Highest colour of belt earned. Type:
Categorical.
c. Observational unit: Greyhounds. Variable: Weight. Type: Quantitative.
d. Observational unit: Movies made in 2016. Variable: Gross prot. Type: Quantitative.
e. Observational unit: Jessica Jones. Variable: User ratings. Type: Categorical.
f. Observational unit: Cars in Nova Scotia. Variable: Colour. Type: Categorical.

to Exercise 1.1.2.2 (p. 5): Try It Solutions


The population is all families with children attending Knoll Academy.
The sample is a random selection of 100 families with children attending Knoll Academy.
The parameter is the average (mean) amount of money spent on school uniforms by families with
children at Knoll Academy.
The statistic is the average (mean) amount of money spent on school uniforms by families in the sample.
The variable is the amount of money spent by one family. Let X = the amount of money spent on
school uniforms by one family with children attending Knoll Academy.
The data are the dollar amounts spent by the families. Examples of the data are $65, $75, and $95.
Solution to Exercise 1.1.2.4 (p. 7)

a. all children who take ski or snowboard lessons


b. a group of these children
c. the population mean age of children who take their rst snowboard lesson
d. the sample mean age of children who take their rst snowboard lesson
e. X = the age of one child who takes his or her rst ski or snowboard lesson
f. values for X, such as 3, 7, and so on

Solution to Exercise 1.1.2.6 (p. 7)

a. the clients of the insurance companies


b. a group of the clients
c. the mean health costs of the clients
d. the mean health costs of the sample
e. X = the health costs of one client
f. values for X, such as 34, 9, 82, and so on

Solution to Exercise 1.1.2.8 (p. 7)

a. all the clients of this counselor


b. a group of clients of this marriage counselor
c. the proportion of all her clients who stay married
d. the proportion of the sample of the counselor's clients who stay married
e. X = the number of couples who stay married
f. yes, no

Solution to Exercise 1.1.2.10 (p. 7)

a. all people (maybe in a certain geographic area, such as the United States)
b. a group of the people
c. the proportion of all people who will buy the product
d. the proportion of the sample who will buy the product
e. X = the number of people who will buy it

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
78
AND DESCRIPTIVE STATISTICS

f. buy, not buy

Solution to Exercise 1.1.2.12 (p. 7)


a
to Exercise 1.1.3.1 (p. 8): Try It Solutions
quantitative discrete data
to Exercise 1.1.3.2 (p. 9): Try It Solutions
quantitative continuous data
to Exercise 1.1.3.3 (p. 9): Try It Solutions
qualitative data
to Exercise 1.1.3.4 (p. 10): Try It Solutions
quantitative discrete
to Exercise 1.1.3.5 (p. 10): Try It Solutions
A histogram is used to display quantitative data: the numbers of credit hours completed. Because students
can complete only a whole number of hours (no fractions of hours allowed), this data is quantitative discrete.
to Exercise 1.1.3.6 (p. 16)
stratied
to Exercise 1.1.3.7 (p. 17): Try It Solutions
The sample probably consists more of people who prefer music because it is a concert event. Also, the
sample represents only those who showed up to the event earlier than the majority. The sample probably
doesn't represent the entire fan base and is probably biased towards people who would prefer music.
Solution to Exercise 1.1.3.8 (p. 19)
quantitative discrete, 150
Solution to Exercise 1.1.3.9 (p. 19)
quantitative continuous, 19.2%
Solution to Exercise 1.1.3.10 (p. 20)
categorical, Oakland A's
Solution to Exercise 1.1.3.11 (p. 20)
quantitative continuous, 7.2 minutes
Solution to Exercise 1.1.3.12 (p. 20)
quantitative discrete, 11,234 students
Solution to Exercise 1.1.3.13 (p. 20)
categorical, Dancing with the Stars
Solution to Exercise 1.1.3.14 (p. 20)
categorical, Crest
Solution to Exercise 1.1.3.15 (p. 20)
quantitative continuous, 8.32 miles
Solution to Exercise 1.1.3.16 (p. 20)
quantitative continuous, 47.3 years
Solution to Exercise 1.1.3.17 (p. 20)
b
Solution to Exercise 1.1.3.18 (p. 20)
c
Solution to Exercise 1.1.3.19 (p. 20)

a. The survey was conducted using six similar ights.


The survey would not be a true representation of the entire population of air travelers.
Conducting the survey on a holiday weekend will not produce representative results.
b. Conduct the survey during dierent times of the year.
Conduct the survey using ights to and from various locations.
Conduct the survey on dierent days of the week.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


79

Solution to Exercise 1.1.3.20 (p. 20)


Answers will vary. Sample Answer: You could use a systematic sampling method. Stop the tenth person as
they leave one of the buildings on campus at 9:50 in the morning. Then stop the tenth person as they leave
a dierent building on campus at 1:50 in the afternoon.
Solution to Exercise 1.1.3.21 (p. 21)
a. convenience; a. cluster; a. stratied ; a. systematic; a. simple random
Solution to Exercise 1.1.3.22 (p. 21)

a. The country was in the middle of the Great Depression and many people could not aord these luxury
items and therefore not able to be included in the survey.
b. Samples that are too small can lead to sampling bias.
c. sampling error
d. stratied
Solution to Exercise 1.1.3.23 (p. 21)
Self-Selected Samples: Only people who are interested in the topic are choosing to respond. Sample Size
Issues: A sample with only 11 participants will not accurately represent the opinions of a nation.
Undue Inuence: The question is wording in a specic way to generate a specic response. Self-Funded
or Self-Interest Studies: This question was generated to support one person's claim and it was designed to
get the answer that the person desires.
Solution to Exercise 1.2.2.1 (p. 41)
For Angie: z = 26.20.8
 27.2 = 1.25
For Beth: z = 27.3 −−30.1
= 2
1.4
Solution to Exercise 1.2.2.2 (p. 43)
Mean: 16 + 17 + 19 + 20 + 20 + 21 + 23 + 24 + 25 + 25 + 25 + 26 + 26 + 27 + 27 + 27 + 28 + 29 +
30 + 32 + 33 + 33 + 34 + 35 + 37 + 39 + 40 = 738;
27 = 27.33
738

Solution to Exercise 1.2.2.3 (p. 43)


Median = 27
Solution to Exercise 1.2.2.4 (p. 43)
The most frequent lengths are 25 and 27, which occur three times. Mode = 25, 27
Solution to Exercise 1.2.2.5 (p. 43)
Mean = (14*3+19*4+12*5+9*6+11*7)/65 = 4.75
Solution to Exercise 1.2.2.6 (p. 43)
4
Solution to Exercise 1.2.2.7 (p. 43)
Mode = 4 (occurs 19 times)
Solution to Exercise 1.2.2.8 (p. 43)
s = 34.5
Solution to Exercise 1.2.2.9 (p. 43)

a. It is dicult to determine which survey is correct. Both surveys include the same number of shoppers
and the shoppers were randomly selected. We could look at how the random selection was done to see
if one of the sampling techniques would result in a more representative sample. But if they used the
same sampling technique, there is no way to know which sample is right. The only way would be to
take another, larger sample and see which of the two supervisor's samples most closely matches that
sample. But really we expect there to be sampling variability so it is not really an appropriate question
to ask which is "correct".
b. Since the mean is the same for both samples, this suggests that it is fair to say that on average shoppers
travel 6.0 km to the mall. But the standard deviations are dierent. This suggests that it is not yet
clear how much variation there is from the 6.0km.
c. Ercilia's data has a larger standard deviation. Therefore, on average, the data needs to be more spread
out from the mean than Javier's. This suggests (b) is the answer.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
80
AND DESCRIPTIVE STATISTICS

Solution to Exercise 1.2.2.10 (p. 44)


Mode = 19 (occurs 4 times)
Solution to Exercise 1.2.2.11 (p. 44)
b
Solution to Exercise 1.2.2.12 (p. 45)

a.
Enrollment Frequency

0-4999 10
5000-9999 16
10000-14999 3
15000-19999 3
20000-24999 1
25000-29999 2

Table 1.31

b. Histogram for enrollment at community colleges.

Figure 1.22

c. The shape is skewed to the right which means that there a few community colleges that have greater
enrollment compared to most of the other colleges in the sample.
d. Since the mean (8628.74) is being skewed (as it is larger than the median of 6,414), the median is the
best measure of centre.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


81

e. Since we are only looking at one data set, the standard deviation is a good measure of variation. It is
6,943.88.
f. The typical range is 6,414 +/- 6,943.88 = -529.88 to 13,357.88. As there can't be negative students
enrolled, the typical range is 0 students to 13,357.88. Though there could be multiple caveats, one
concern is the rather large variation in the data. This means that community colleges have very dierent
enrollment rates. Perhaps looking at community colleges that are similar to the one I would like to
open would be more benecial as that population would be more representative of my community
college.

Solution to Exercise 1.2.2.13 (p. 45)


Label 1 is excluded as most people don't like it. The mean for label 2 and label 3 is the same. Label 2 could
be considered the better label because more people love it than label 3, but more people hate it. Label 3
could be considered a better label because the variation is less - nobody hates it, but nobody loves it. (Note:
Even though you are comparing two data sets, it is ok to look only at the standard deviation instead of the
coecient of variation in this situation. Why?).
Choosing label 2 has greater risk (love/hate relationship). Choosing label 3 has less risk (most people
like it).
Solution to Exercise 1.2.2.14 (p. 46)
Note that this question is about risk, i.e. variation.

1. Any answer requires that you examine the amount of variation in the data set. The coecient of
variation is the best measure to use to compare the variation as the means are dierent.

Company A Company B Company C


Coecient of variation 38.39% 72% 11.95%

Table 1.32

2. The information provided is only for one year. It would be helpful to know about their changes over
more than one year. Quartiles aren't provided. They could help examine the variation as well.
3. The median and the mode are not relevant as this is a question about variation. The mean is only
required as it is needed to nd the coecient of variation.

Solution to Exercise 1.2.3.1 (p. 48)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
82
AND DESCRIPTIVE STATISTICS

Figure 1.23

to Exercise 1.2.3.2 (p. 49)

Figure 1.24

to Exercise 1.2.3.3 (p. 51): Try It Solutions


0.56 or 56%
to Exercise 1.2.3.4 (p. 52): Try It Solutions
0.30 + 0.16 + 0.18 = 0.64 or 64%

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


83

to Exercise 1.2.3.5 (p. 54): Try It Solutions


9
50
to Exercise 1.2.3.6 (p. 57)
Smallest value: 9
Largest value: 14
Convenient starting value: 9  0.05 = 8.95
Convenient ending value: 14 + 0.05 = 14.05
14.05−8.95
6 = 0.85
The calculations suggests using 0.85 as the width of each bar or class interval. You can also use an
interval with a width equal to one.
Solution to Exercise 1.2.3.7 (p. 61)

Figure 1.25

Solution to Exercise 1.2.3.8 (p. 61)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
84
AND DESCRIPTIVE STATISTICS

Figure 1.26

Solution to Exercise 1.2.3.10 (p. 62)


c
Solution to Exercise 1.2.3.11 (p. 63)
d
to Exercise 1.2.4.1 (p. 67)
Eighty percent of students earned 49 points or fewer. Twenty percent of students earned 49 or more points.
A higher percentile is good because getting more points on an assignment is desirable.
to Exercise 1.2.4.2 (p. 71)

Figure 1.27

IQR = 158
Solution to Example 1.33, Problem (p. 71)

a. Min = 32
Q 1 = 56
M = 74.5
Q 3 = 82.5
Max = 99
b. Min = 25.5
Q 1 = 78

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


85

M = 81
Q 3 = 89
Max = 98
c. Day class: There are six data values ranging from 32 to 56: 30%. There are six data values ranging
from 56 to 74.5: 30%. There are ve data values ranging from 74.5 to 82.5: 25%. There are ve data
values ranging from 82.5 to 99: 25%. There are 16 data values between the rst quartile, 56, and the
largest value, 99: 75%. Night class:

d.

Figure 1.28

e. The rst data set has the wider spread for the middle 50% of the data. The IQR for the rst data set
is greater than the IQR for the second set. This means that there is more variability in the middle
50% of the rst data set.
to Exercise 1.2.4.3 (p. 72)

Figure 1.29

IQR for the boys = 4


IQR for the girls = 5
The box plot for the heights of the girls has the wider spread for the middle 50% of the data.
Solution to Exercise 1.2.4.4 (p. 73)
It is better to earn a grade in a high percentile as that means that you have done better on the exam relative
to your classmates.
Solution to Exercise 1.2.4.5 (p. 73)
When waiting in line at the DMV, the 85th percentile would be a long wait time compared to the other
people waiting. 85% of people had shorter wait times than Mina. In this context, Mina would prefer a wait

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
86
AND DESCRIPTIVE STATISTICS

time corresponding to a lower percentile. 85% of people at the DMV waited 32 minutes or less. 15% of
people at the DMV waited 32 minutes or longer.
Solution to Exercise 1.2.4.6 (p. 73)
The manufacturer and the consumer would be upset. This is a large repair cost for the damages, compared
to the other cars in the sample. INTERPRETATION: 90% of the crash tested cars had damage repair costs
of $1700 or less; only 10% had damage repair costs of $1700 or more.
Solution to Exercise 1.2.4.7 (p. 73)
You can aord 34% of houses. 66% of the houses are too expensive for your budget. INTERPRETATION:
34% of houses cost $240,000 or less. 66% of houses cost $240,000 or more.
Solution to Exercise 1.2.4.8 (p. 73)

Figure 1.30

Solution to Exercise 1.2.4.9 (p. 73)


More than 25% of salespersons sell four cars in a typical week. You can see this concentration in the box
plot because the rst quartile is equal to the median. The top 25% and the bottom 25% are spread out
evenly; the whiskers have the same length.
Solution to Exercise 1.2.4.10 (p. 73)

a. The shape of China suggests that either every person they surveyed except one either visited 0 foreign
countries or 5 foreign countries. For example, if 30 people were interviewed in China, 29 of them have
visited no foreign country and one of them has visited 5 foreign countries OR 29 of them have visited
5 foreign countries and one of them has visited no foreign countries. It is unclear which way it is going
in the box plot. In Germany, 50% of those surveyed have visited 8 or less countries. Based on the
position of the median, this suggests that there are many people in the survey who have visited eight
countries. This suggests the distribution will have a peak at 8 and will be non-symmetric. In the USA,
50% of those surveyed have visited 2 or less countries. As there are no whiskers, this suggests that
25% of the Americans surveyed have visited no foreign countries which suggest a skew to the right for
the distribution.
b. 25% of Germans surveyed have been to more than 8 foreign countries. It is unclear what the percentage
is for Americans but it is less than 25%. Therefore, Germany.
c. Germans in the survey have visited far more countries that Americans and the Chinese in the survey.
China has the least foreign travel.

Solution to Exercise 1.2.4.11 (p. 74)

a. Answers will vary. Possible answer: State University conducted a survey to see how involved its

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


87

students are in community service. The box plot shows the number of community service hours logged
by participants over the past year.
b. Because the rst and second quartiles are close, the data in this quarter is very similar. There is not
much variation in the values. The data in the third quarter is much more variable, or spread out. This
is clear because the second quartile is so far away from the third quartile.
Solution to Exercise 1.2.4.12 (p. 74)

a. Each box plot is spread out more in the greater values. Each plot is skewed to the right, so the ages
of the top 50% of buyers are more variable than the ages of the lower 50%.
b. The BMW 3 series is most likely to have an outlier. It has the longest whisker.
c. Comparing the median ages, younger people tend to buy the BMW 3 series, while older people tend
to buy the BMW 7 series. However, this is not a rule, because there is so much variability in each data
set.
d. The second quarter has the smallest spread. There seems to be only a three-year dierence between
the rst quartile and the median.
e. The third quarter has the largest spread. There seems to be approximately a 14-year dierence between
the median and the third quartile.
f. IQR ∼ 17 years
g. There is not enough information to tell. Each interval lies within a quarter, so we cannot tell exactly
where the data in that quarter is concentrated.
h. The interval from 31 to 35 years has the fewest data values. Twenty-ve percent of the values fall in
the interval 38 to 41, and 25% fall between 41 and 64. Since 25% of values fall between 31 and 38, we
know that fewer than 25% fall between 31 and 35.
Solution to Exercise 1.2.4.13 (p. 75)

a.

Figure 1.31

b. There is one mild low outlier of 8 passengers on a ight.


c. a) The outlier means that on this ight there were signicantly fewer passengers (only 8) than there
are on other similar ights.
d. The IQR is 16.25 (from 33 to 49.25). This means that 50% of the time, the number of passengers is
between 33 and 49.25 on the Airbus. This gives us a sense of the amount of variation in the number
of passengers.
e. The distance between the median and the third quartile (from 44 to 49.25) is the least (5.25). This
means that these 25% of data values are closely packed together. While the distance between the
outlier and the rst quartile is the largest (25 passengers). This means that these 25% of the data
values are spread out from each other.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 1. BUSINESS STATISTICS - MODULE 1 - DATA COLLECTION
88
AND DESCRIPTIVE STATISTICS

f. a) The ve-number summary is: Minimum = 8; First quartile = 33; Median = 44; Third quartile =
49.25; Maximum = 65.

Solution to Exercise 1.2.4.14 (p. 75)

a.

Figure 1.32: This is technically not a histogram as the bars aren't touching, but without the original
data this is the best that I could come up with unless I drew it by hand!

b. 49.7% of the community is under the age of 35.


c. Based on the information in the table, graph (a) most closely represents the data.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


Chapter 2

Business Statistics - Module 2 -


Probability

2.1 Chapter 3: Probability Topics


2.1.1 Chapter Overview1

Figure 2.1: Meteor showers are rare, but the probability of them occurring can be calculated. (credit:
Navicore/ickr)

: By the end of this chapter, the student should be able to:

• Understand and use the terminology of probability.


• Determine whether two events are mutually exclusive and whether two events are independent.
1 This content is available online at <http://cnx.org/content/m62328/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>

89
90 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

• Calculate probabilities using the addition and multiplication rules.


• Construct and interpret contingency tables and tree diagrams.
• Understand the dierence between likely and unlikely events.

It is often necessary to "guess" about the outcome of an event in order to make a decision. Politicians study
polls to guess their likelihood of winning an election. Teachers choose a particular course of study based on
what they think students can comprehend. Doctors choose the treatments needed for various diseases based
on their assessment of likely results. You may have visited a casino where people play games chosen because
of the belief that the likelihood of winning is good. You may have chosen your course of study based on the
probable availability of jobs.
You have, more than likely, used probability. In fact, you probably have an intuitive sense of probability.
Probability deals with the chance of an event occurring. Whenever you weigh the odds of whether or not to
do your homework or to study for an exam, you are using probability. In this chapter, you will learn how to
solve probability problems using a systematic approach.

2.1.2 Introduction to Probability2


Probability is a measure that is associated with how certain we are of outcomes of a particular experiment
or activity. An experiment is a planned operation carried out under controlled conditions. If the result is
not predetermined, then the experiment is said to be a chance experiment. Flipping one fair coin twice is
an example of an experiment.
A result of an experiment is called an outcome. The sample space of an experiment is the set of all
possible outcomes. Three ways to represent a sample space are: to list the possible outcomes, to create a
tree diagram, or to create a Venn diagram. The uppercase letter S is used to denote the sample space. For
example, if you ip one fair coin, S = {H, T} where H = heads and T = tails are the outcomes.
An event is any combination of outcomes. Upper case letters like A and B represent events. For example,
if the experiment is to ip one fair coin, event A might be getting at most one head. The probability of an
event A is written P(A).
The probability of any outcome is the long-term relative frequency of that outcome. Probabilities
are between zero and one, inclusive (that is, zero and one and all numbers between these values). P(A)
= 0 means the event A can never happen. P(A) = 1 means the event A always happens. P(A) = 0.5 means
that event A has a 50% chance of happening. For example, if you ip one fair coin repeatedly (from 20 to
2,000 to 20,000 times) the relative frequency of heads approaches 0.5 (the probability of heads).
Equally likely means that each outcome of an experiment occurs with equal probability. For example,
if you toss a fair, six-sided die, each face (1, 2, 3, 4, 5, or 6) is as likely to occur as any other face. If you
toss a fair coin, a Head (H ) and a Tail (T) are equally likely to occur. If you randomly guess the answer to
a true/false question on an exam, you are equally likely to select a correct answer or an incorrect answer.
To calculate the probability of an event A
when all outcomes in the sample space are equally
likely, count the number of outcomes for event A and divide by the total number of outcomes in the sample
space. For example, if you toss a fair dime and a fair nickel, the sample space is {HH, TH, HT, TT} where
T = tails and H = heads. The sample space has four outcomes. A = getting one head. There are two
outcomes that meet this condition {HT, TH }, so P(A) = 42 = 0.5.
Suppose you roll one fair six-sided die, with the numbers {1, 2, 3, 4, 5, 6} on its faces. Let event E =
rolling a number that is at least ve. There are two outcomes {5, 6}. P(E) = 26 . If you were to roll the
die only a few times, you would not be surprised if your observed results did not match the probability. If
you were to roll the die a very large number of times, you would expect that, overall, 26 of the rolls would
result in an outcome of "at least ve". You would not expect exactly 26 . The long-term relative frequency
of obtaining this result would approach the theoretical probability of 26 as the number of repetitions grows
larger and larger.
This important characteristic of probability experiments is known as the law of large numbers which
states that as the number of repetitions of an experiment is increased, the relative frequency obtained in
2 This content is available online at <http://cnx.org/content/m62337/1.3/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


91

the experiment tends to become closer and closer to the theoretical probability. Even though the outcomes
do not happen according to any set pattern or order, overall, the long-term observed relative frequency will
approach the theoretical probability. (The word empirical is often used instead of the word observed.)
It is important to realize that in many situations, the outcomes are not equally likely. A coin or die may
be unfair, or biased. Two math professors in Europe had their statistics students test the Belgian one
Euro coin and discovered that in 250 trials, a head was obtained 56% of the time and a tail was obtained
44% of the time. The data seem to show that the coin is not a fair coin; more repetitions would be helpful
to draw a more accurate conclusion about such bias. Some dice may be biased. Look at the dice in a game
you have at home; the spots on each face are usually small holes carved out and then painted to make the
spots visible. Your dice may or may not be biased; it is possible that the outcomes may be aected by the
slight weight dierences due to the dierent numbers of holes in the faces. Gambling casinos make a lot
of money depending on outcomes from rolling dice, so casino dice are made dierently to eliminate bias.
Casino dice have at faces; the holes are completely lled with paint having the same density as the material
that the dice are made out of so that each face is equally likely to occur. Later we will learn techniques to
use to work with probabilities for events that are not equally likely.

A key concept in probability is whether an event is likely or unlikely. A likely event is an event that
has a good chance of happening, while an unlikely event is rare. For example, it is likely to snow in Calgary
in the winter, but it is unlikely to snow in Calgary in the summer (it can happen, but it would be a rare
or strange event). In general, in statistics, unlikely events usually have a probability of less than 1% of
happening. Likely events usually have a probability of greater than 10% of happening. If the probability
of the event is between 1% and 10%, it is up to the statistician or researcher to make a call to determine
whether it is likely or unlikely.
"OR" Event:
An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B. For example,
let A = {1, 2, 3, 4, 5} and B = {4, 5, 6, 7, 8}. A OR B = {1, 2, 3, 4, 5, 6, 7, 8}. Notice that 4 and 5 are
NOT listed twice.

"AND" Event:
An outcome is in the event A AND B if the outcome is in both A and B at the same time. For example, let
A and B be {1, 2, 3, 4, 5} and {4, 5, 6, 7, 8}, respectively. Then A AND B = {4, 5}.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


92 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

The complement of event A is denoted A0 (read "A prime"). A0 consists of all outcomes that are NOT
in A. Notice that P(A) + P(A0 ) = 1. For example, let S = {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then,
A0 = {5, 6}. P(A) = 64 , P(A0 ) = 26 , and P(A) + P(A0 ) = 64 + 26 = 1
The conditional probability of A given B is written P(A|B). P(A|B) is the probability that event A
will occur given that the event B has already occurred. A conditional reduces the sample space. We
calculate the probability of A from the reduced sample space B. The formula to calculate P(A|B) is P(A|B)
= P (APAND
(B)
B)
where P(B) is greater than zero.
For example, suppose we toss one fair, six-sided die. The sample space S = {1, 2, 3, 4, 5, 6}. Let A =
face is 2 or 3 and B = face is even (2, 4, 6). To calculate P(A|B), we count the number of outcomes 2 or 3
in the sample space B = {2, 4, 6}. Then we divide that by the number of outcomes B (rather than S).
We get the same result by using the formula. Remember that S has six outcomes.
(the number of outcomes that are 2 or 3 and even in S)
P (A AND B) 1
P(A|B) = P (B) = (the
6
number of outcomes that are even in S) = 6
3 = 1
3
6 6
Odds
The odds of an event presents the probability as a ratio of success to failure. This is common in various
gambling formats. Mathematically, the odds of an event can be dened as:

P (A)
(2.1)
1 − P (A)
where P(A) is the probability of success and of course 1 − P(A) is the probability of failure. Odds are
always quoted as "numerator to denominator," e.g. 2 to 1. Here the probability of winning is twice that
of losing; thus, the probability of winning is 0.66. A probability of winning of 0.60 would generate odds in
favor of winning of 3 to 2. While the calculation of odds can be useful in gambling venues in determining
payo amounts, it is not helpful for understanding probability or statistical theory.
Understanding Terminology and Symbols
It is important to read each problem carefully to think about and understand what the events are. Under-
standing the wording is the rst very important step in solving probability problems. Reread the problem
several times if necessary. Clearly identify the event of interest. Determine whether there is a condition
stated in the wording that would indicate that the probability is conditional; carefully identify the condition,
if any.
If the sample space is

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


93

then P(A|B) is found by looking only at events that involved B:

and within B looking at the portion that involve A:

That portion is clearly the intersection of A and B.


Example 2.1
The sample space S is the whole numbers starting at one and less than 20.

a. S = _____________________________ Let event A = the even numbers and


event B = numbers greater than 13.
b. A = _____________________, B = _____________________
c. P(A) = _____________, P(B) = ________________
d. A AND B = ____________________, A OR B = ________________
e. P(A AND B) = _________, P(A OR B) = _____________
f. A0 = _____________, P(A0 ) = _____________
g. P(A) + P(A0 ) = ____________
h. P(A|B) = ___________, P(B|A) = _____________; are the probabilities equal?

Solution

a. S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


94 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

b. A = {2, 4, 6, 8, 10, 12, 14, 16, 18}, B = {14, 15, 16, 17, 18, 19}
c. P(A) = 199
, P(B) = 19 6

d. A AND B = {14,16,18}, A OR B = {2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 19}
e. P(A AND B) = 19 3
, P(A OR B) = 12 19
f. A0 = 1, 3, 5, 7, 9, 11, 13, 15, 17, 19; P(A0 ) = 19
10

g. P(A) + P(A0 ) = 1 ( 19 + 19 = 1)
9 10

h. P(A|B) = P (APAND(B)
B)
= 36 , P(B|A) = P (APAND
(A)
B)
= 39 , No

Exercise 2.1.2.1 (Solution on p. 186.)


The sample space S is the ordered pairs of two whole numbers, the rst from one to
three and the second from one to four (Example: (1, 4)).

a.S = _____________________________

Let event A = the sum is even and event B = the rst number is prime.
b.A = _____________________, B =
_____________________
c.P(A) = _____________, P(B) = ________________
d.A AND B = ____________________, A OR B =
________________
e.P(A AND B) = _________, P(A OR B) = _____________
f.B0 = _____________, P(B0 ) = _____________
g.P(A) + P(A0 ) = ____________
h.P(A|B) = ___________, P(B|A) = _____________; are the probabil-
ities equal?

Example 2.2
A fair, six-sided die is rolled. Describe the sample space S, identify each of the following events
with a subset of S and compute its probability (an outcome is the number of dots that show up).

a. Event T = the outcome is two.


b. Event A = the outcome is an even number.
c. Event B = the outcome is less than four.
d. The complement of A.
e. A GIVEN B
f. B GIVEN A
g. A AND B
h. A OR B
i. A OR B0
j. Event N = the outcome is a prime number.
k. Event I = the outcome is seven.

Solution

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


95

a. T = {2}, P(T) = 61
b. A = {2, 4, 6}, P(A) = 12
c. B = {1, 2, 3}, P(B) = 12
d. A0 = {1, 3, 5}, P(A0 ) = 12
e. A|B = {2}, P(A|B) = 13
f. B|A = {2}, P(B|A) = 13
g. A AND B = {2}, P(A AND B) = 16
h. A OR B = {1, 2, 3, 4, 6}, P(A OR B) = 56
i. A OR B0 = {2, 4, 5, 6}, P(A OR B0 ) = 32
j. N = {2, 3, 5}, P(N ) = 12
k. A six-sided die does not have seven dots. P(7) = 0.

Example 2.3
Table 2.1 describes the distribution of a random sample S of 100 individuals, organized by gender
and whether they are right- or left-handed.

Right-handed Left-handed

Males 43 9
Females 44 4

Table 2.1

Problem
Let's denote the events M = the subject is male, F = the subject is female, R = the subject is
right-handed, L = the subject is left-handed. Compute the following probabilities:

a. P(M )
b. P(F)
c. P(R)
d. P(L)
e. P(M AND R)
f. P(F AND L)
g. P(M OR F)
h. P(M OR R)
i. P(F OR L)
j. P(M')
k. P(R|M )
l. P(F|L)
m. P(L|F)

Solution

a. P(M ) = 0.52
b. P(F) = 0.48
c. P(R) = 0.87
d. P(L) = 0.13
e. P(M AND R) = 0.43
f. P(F AND L) = 0.04
g. P(M OR F) = 1

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


96 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

h. P(M OR R) = 0.96
i. P(F OR L) = 0.57
j. P(M') = 0.48
k. P(R|M ) = 0.8269 (rounded to four decimal places)
l. P(F|L) = 0.3077 (rounded to four decimal places)
m. P(L|F) = 0.0833

2.1.2.1 References

Countries List by Continent. Worldatlas, 2013. Available online at


http://www.worldatlas.com/cntycont.htm (accessed May 2, 2013).

2.1.2.2 Chapter Review

In this module we learned the basic terminology of probability. The set of all possible outcomes of an
experiment is called the sample space. Events are subsets of the sample space, and they are assigned a
probability that is a number between zero and one, inclusive.

2.1.2.3 Formula Review

A and B are events


P(S) = 1 where S is the sample space
0 ≤ P(A) ≤ 1
P(A|B) = P P(A∩B
(B )
)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


97

2.1.2.4

Exercise 2.1.2.2 (Solution on p. 186.)


In a particular college class, there are male and female students. Some students have long hair
and some students have short hair. Write the symbols for the probabilities of the events for parts
a through j. (Note that you cannot nd numerical answers here. You were not given enough
information to nd any probability values yet; concentrate on understanding the symbols.)

• Let F be the event that a student is female.


• Let M be the event that a student is male.
• Let S be the event that a student has short hair.
• Let L be the event that a student has long hair.

a. The probability that a student does not have long hair.


b. The probability that a student is male or has short hair.
c. The probability that a student is a female and has long hair.
d. The probability that a student is male, given that the student has long hair.
e. The probability that a student has long hair, given that the student is male.
f. Of all the female students, the probability that a student has short hair.
g. Of all students with long hair, the probability that a student is female.
h. The probability that a student is female or has long hair.
i. The probability that a randomly selected student is a male student with short hair.
j. The probability that a student is female.

Use the following information to answer the next four exercises. A box is lled with several party favors. It
contains 12 hats, 15 noisemakers, ten nger traps, and ve bags of confetti.
Let H = the event of getting a hat.
Let N = the event of getting a noisemaker.
Let F = the event of getting a nger trap.
Let C = the event of getting a bag of confetti.
Exercise 2.1.2.3
Find P(H ).
Exercise 2.1.2.4 (Solution on p. 186.)
Find P(N ).
Exercise 2.1.2.5
Find P(F).
Exercise 2.1.2.6 (Solution on p. 186.)
Find P(C).

Use the following information to answer the next six exercises. A jar of 150 jelly beans contains 22 red jelly
beans, 38 yellow, 20 green, 28 purple, 26 blue, and the rest are orange.
Let B = the event of getting a blue jelly bean
Let G = the event of getting a green jelly bean.
Let O = the event of getting an orange jelly bean.
Let P = the event of getting a purple jelly bean.
Let R = the event of getting a red jelly bean.
Let Y = the event of getting a yellow jelly bean.
Exercise 2.1.2.7
Find P(B).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


98 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Exercise 2.1.2.8 (Solution on p. 186.)


Find P(G).
Exercise 2.1.2.9
Find P(P).
Exercise 2.1.2.10 (Solution on p. 186.)
Find P(R).
Exercise 2.1.2.11
Find P(Y ).
Exercise 2.1.2.12 (Solution on p. 186.)
Find P(O).

Use the following information to answer the next six exercises. There are 23 countries in North America, 12
countries in South America, 47 countries in Europe, 44 countries in Asia, 54 countries in Africa, and 14 in
Oceania (Pacic Ocean region).
Let A = the event that a country is in Asia.
Let E = the event that a country is in Europe.
Let F = the event that a country is in Africa.
Let N = the event that a country is in North America.
Let O = the event that a country is in Oceania.
Let S = the event that a country is in South America.
Exercise 2.1.2.13
Find P(A).
Exercise 2.1.2.14 (Solution on p. 186.)
Find P(E).
Exercise 2.1.2.15
Find P(F).
Exercise 2.1.2.16 (Solution on p. 186.)
Find P(N ).
Exercise 2.1.2.17
Find P(O).
Exercise 2.1.2.18 (Solution on p. 186.)
Find P(S).
Exercise 2.1.2.19
What is the probability of drawing a red card in a standard deck of 52 cards?
Exercise 2.1.2.20 (Solution on p. 186.)
What is the probability of drawing a club in a standard deck of 52 cards?
Exercise 2.1.2.21
What is the probability of rolling an even number of dots with a fair, six-sided die numbered one
through six?
Exercise 2.1.2.22 (Solution on p. 186.)
What is the probability of rolling a prime number of dots with a fair, six-sided die numbered one
through six?

Use the following information to answer the next two exercises. You see a game at a local fair. You have to
throw a dart at a color wheel. Each section on the color wheel is equal in area.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


99

Figure 2.2

Let B = the event of landing on blue.


Let R = the event of landing on red.
Let G = the event of landing on green.
Let Y = the event of landing on yellow.
Exercise 2.1.2.23
If you land on Y, you get the biggest prize. Find P(Y ).
Exercise 2.1.2.24 (Solution on p. 186.)
If you land on red, you don't get a prize. What is P(R)?

Use the following information to answer the next ten exercises. On a baseball team, there are inelders and
outelders. Some players are great hitters, and some players are not great hitters.
Let I = the event that a player in an inelder.
Let O = the event that a player is an outelder.
Let H = the event that a player is a great hitter.
Let N = the event that a player is not a great hitter.
Exercise 2.1.2.25
Write the symbols for the probability that a player is not an outelder.
Exercise 2.1.2.26 (Solution on p. 186.)
Write the symbols for the probability that a player is an outelder or is a great hitter.
Exercise 2.1.2.27
Write the symbols for the probability that a player is an inelder and is not a great hitter.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


100 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Exercise 2.1.2.28 (Solution on p. 187.)


Write the symbols for the probability that a player is a great hitter, given that the player is an
inelder.
Exercise 2.1.2.29
Write the symbols for the probability that a player is an inelder, given that the player is a great
hitter.
Exercise 2.1.2.30 (Solution on p. 187.)
Write the symbols for the probability that of all the outelders, a player is not a great hitter.
Exercise 2.1.2.31
Write the symbols for the probability that of all the great hitters, a player is an outelder.
Exercise 2.1.2.32 (Solution on p. 187.)
Write the symbols for the probability that a player is an inelder or is not a great hitter.
Exercise 2.1.2.33
Write the symbols for the probability that a player is an outelder and is a great hitter.
Exercise 2.1.2.34 (Solution on p. 187.)
Write the symbols for the probability that a player is an inelder.
Exercise 2.1.2.35
What is the word for the set of all possible outcomes?
Exercise 2.1.2.36 (Solution on p. 187.)
What is conditional probability?
Exercise 2.1.2.37
A shelf holds 12 books. Eight are ction and the rest are nonction. Each is a dierent book with
a unique title. The ction books are numbered one to eight. The nonction books are numbered
one to four. Randomly select one book
Let F = event that book is ction
Let N = event that book is nonction
What is the sample space?
Exercise 2.1.2.38 (Solution on p. 187.)
What is the sum of the probabilities of an event and its complement?

Use the following information to answer the next two exercises. You are rolling a fair, six-sided number cube.
Let E = the event that it lands on an even number. Let M = the event that it lands on a multiple of three.
Exercise 2.1.2.39
What does P(E|M ) mean in words?
Exercise 2.1.2.40 (Solution on p. 187.)
What does P(E OR M ) mean in words?

2.1.2.5 Homework

Exercise 2.1.2.41

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


101

Figure 2.3

The graph in Figure 2.3 displays the sample sizes and percentages of people in dierent age and
gender groups who were polled concerning their approval of Mayor Ford's actions in oce. The
total number in the sample of all the age groups is 1,045.

a. Dene three events in the graph.


b. Describe in words what the entry 40 means.
c. Describe in words the complement of the entry in question 2.
d. Describe in words what the entry 30 means.
e. Out of the males and females, what percent are males?
f. Out of the females, what percent disapprove of Mayor Ford?
g. Out of all the age groups, what percent approve of Mayor Ford?
h. Find P(Approve|Male).
i. Out of the age groups, what percent are more than 44 years old?
j. Find P(Approve|Age < 35).

Exercise 2.1.2.42 (Solution on p. 187.)


Explain what is wrong with the following statements. Use complete sentences.

a. If there is a 60% chance of rain on Saturday and a 70% chance of rain on Sunday, then there
is a 130% chance of rain over the weekend.
b. The probability that a baseball player hits a home run is greater than the probability that he
gets a successful hit.

2.1.3 Independent and Mutually Exclusive Events3


Independent and mutually exclusive do not mean the same thing.
3 This content is available online at <http://cnx.org/content/m62329/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


102 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.1.3.1 Independent Events

Two events are independent if the following are true:

• P(A|B) = P(A)
• P(B|A) = P(B)
• P(A AND B) = P(A)P(B)

Two events A and B are independent if the knowledge that one occurred does not aect the chance the
other occurs. For example, the outcomes of two roles of a fair die are independent events. The outcome of
the rst roll does not change the probability for the outcome of the second roll. To show two events are
independent, you must show only one of the above conditions. If two events are NOT independent, then
we say that they are dependent.
Sampling may be done with replacement or without replacement.

• With replacement: If each member of a population is replaced after it is picked, then that member
has the possibility of being chosen more than once. When sampling is done with replacement, then
events are considered to be independent, meaning the result of the rst pick will not change the
probabilities for the second pick.
• Without replacement: When sampling is done without replacement, each member of a population
may be chosen only once. In this case, the probabilities for the second pick are aected by the result
of the rst pick. The events are considered to be dependent or not independent.

If it is not known whether A and B are independent or dependent, assume they are dependent until
you can show otherwise.

Example 2.4
You have a fair, well-shued deck of 52 cards. It consists of four suits. The suits are clubs,
diamonds, hearts and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, J (jack), Q (queen), K (king) of that suit.
a. Sampling with replacement:
Suppose you pick three cards with replacement. The rst card you pick out of the 52 cards is the
Q of spades. You put this card back, reshue the cards and pick a second card from the 52-card
deck. It is the ten of clubs. You put this card back, reshue the cards and pick a third card from
the 52-card deck. This time, the card is the Q of spades again. Your picks are {Q of spades, ten of
clubs, Q of spades}. You have picked the Q of spades twice. You pick each card from the 52-card
deck.
b. Sampling without replacement:
Suppose you pick three cards without replacement. The rst card you pick out of the 52 cards is
the K of hearts. You put this card aside and pick the second card from the 51 cards remaining in
the deck. It is the three of diamonds. You put this card aside and pick the third card from the
remaining 50 cards in the deck. The third card is the J of spades. Your picks are {K of hearts, three
of diamonds, J of spades}. Because you have picked the cards without replacement, you cannot
pick the same card twice. The probability of picking the three of diamonds is called a conditional
probability because it is conditioned on what was picked rst. This is true also of the probability
of picking the J of spades. The probability of picking the J of spades is actually conditioned on
both the previous picks.

Exercise 2.1.3.1 (Solution on p. 187.)


You have a fair, well-shued deck of 52 cards. It consists of four suits. The suits are
clubs, diamonds, hearts and spades. There are 13 cards in each suit consisting of 1, 2, 3,

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


103

4, 5, 6, 7, 8, 9, 10, J (jack), Q (queen), K (king) of that suit. Three cards are picked at
random.

a.Suppose you know that the picked cards are Q of spades, K of hearts and Q of spades.
Can you decide if the sampling was with or without replacement?
b.Suppose you know that the picked cards are Q of spades, K of hearts, and J of
spades. Can you decide if the sampling was with or without replacement?

Example 2.5
You have a fair, well-shued deck of 52 cards. It consists of four suits. The suits are clubs,
diamonds, hearts, and spades. There are 13 cards in each suit consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, J (jack), Q (queen), and K (king) of that suit. S = spades, H = Hearts, D = Diamonds, C =
Clubs.

a. Suppose you pick four cards, but do not put any cards back into the deck. Your cards are
QS, 1D, 1C, QD.
b. Suppose you pick four cards and put each card back before you pick the next card. Your
cards are KH, 7D, 6D, KH.

Which of a. or b. did you sample with replacement and which did you sample without replacement?

a. Without replacement; b. With replacement

Exercise 2.1.3.2 (Solution on p. 187.)


You have a fair, well-shued deck of 52 cards. It consists of four suits. The suits are
clubs, diamonds, hearts, and spades. There are 13 cards in each suit consisting of 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, J (jack), Q (queen), and K (king) of that suit. S = spades, H = Hearts,
D = Diamonds, C = Clubs. Suppose that you sample four cards without replacement.
Which of the following outcomes are possible? Answer the same question for sampling
with replacement.

a.QS, 1D, 1C, QD


b.KH, 7D, 6D, KH
c.QS, 7D, 6D, KS

2.1.3.2 Mutually Exclusive Events

A and B are mutually exclusive events if they cannot occur at the same time. This means that A and B
do not share any outcomes and P(A AND B) = 0.
For example, suppose the sample space S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Let A = {1, 2, 3, 4, 5}, B =
{4, 5, 6, 7, 8}, and C = {7, 9}. A AND B = {4, 5}. P(A AND B) = 10 2
and is not equal to zero. Therefore,
A and B are not mutually exclusive. A and C do not have any numbers in common so P(A AND C) = 0.
Therefore, A and C are mutually exclusive.
If it is not known whether A and B are mutually exclusive, assume they are not until you can show
otherwise. The following examples illustrate these denitions and terms.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


104 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Example 2.6
Flip two fair coins. Find the probabilities of the events.

a. Let F = the event of getting at most one tail (zero or one tail).
b. Let G = the event of getting two faces that are the same.
c. Let H = the event of getting a head on the rst ip followed by a head or tail on the second
ip.
d. Are F and G mutually exclusive?
e. Let J = the event of getting all tails. Are J and H mutually exclusive?

Solution
Look at the sample space in .

a. Zero (0) or one (1) tails occur when the outcomes HH, TH, HT show up. P(F) = 43
b. Two faces are the same if HH or TT show up. P(G) = 24
c. A head on the rst ip followed by a head or tail on the second ip occurs when HH or HT
show up. P(H ) = 24
d. F and G share HH so P(F AND G) is not equal to zero (0). F and G are not mutually
exclusive.
e. Getting all tails occurs when tails shows up on both coins (TT). H 's outcomes are HH and
HT.

J and H have nothing in common so P(J AND H ) = 0. J and H are mutually exclusive.

Exercise 2.1.3.3 (Solution on p. 187.)


A box has two balls, one white and one red. We select one ball, put it back in the box,
and select a second ball (sampling with replacement). Find the probability of the following
events:

a.Let F = the event of getting the white ball twice.


b.Let G = the event of getting two balls of dierent colors.
c.Let H = the event of getting white on the rst pick.
d.Are F and G mutually exclusive?
e.Are G and H mutually exclusive?

Example 2.7
Roll one fair, six-sided die. The sample space is {1, 2, 3, 4, 5, 6}. Let event A = a face is odd.
Then A = {1, 3, 5}. Let event B = a face is even. Then B = {2, 4, 6}.

• Find the complement of A, A0 . The complement of A, A0 , is B because A and B together


make up the sample space. P(A) + P(B) = P(A) + P(A0 ) = 1. Also, P(A) = 63 and P(B)
= 36 .
• Let event C = odd faces larger than two. Then C = {3, 5}. Let event D = all even faces
smaller than ve. Then D = {2, 4}. P(C AND D) = 0 because you cannot have an odd and
even face at the same time. Therefore, C and D are mutually exclusive events.
• Let event E = all faces less than ve. E = {1, 2, 3, 4}.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


105

Problem
Are C and E mutually exclusive events? (Answer yes or no.) Why or why not?

No. C = {3, 5} and E = {1, 2, 3, 4}. P(C AND E) = 16 . To be mutually exclusive, P(C AND E)
must be zero.

• Find P(C|A). This is a conditional probability. Recall that the event C is {3, 5} and event
A is {1, 3, 5}. To nd P(C|A), nd the probability of C using the sample space A. You have
reduced the sample space from the original sample space {1, 2, 3, 4, 5, 6} to {1, 3, 5}. So,
P(C|A) = 32 .

Exercise 2.1.3.4 (Solution on p. 187.)


Let event A = learning Spanish. Let event B = learning German. Then A AND B =
learning Spanish and German. Suppose P(A) = 0.4 and P(B) = 0.2. P(A AND B) =
0.08. Are events A and B independent? Hint: You must show ONE of the following:

• P(A|B) = P(A)
• P(B|A)
• P(A AND B) = P(A)P(B)

Example 2.8
Let event G = taking a math class. Let event H = taking a science class. Then, G AND H =
taking a math class and a science class. Suppose P(G) = 0.6, P(H ) = 0.5, and P(G AND H ) =
0.3. Are G and H independent?
If G and H are independent, then you must show ONE of the following:

• P(G|H ) = P(G)
• P(H |G) = P(H )
• P(G AND H ) = P(G)P(H )

: The choice you make depends on the information you have. You could choose any of
the methods here because you have the necessary information.

Problem 1
a. Show that P(G|H ) = P(G).
Solution
P (G AND H ) 0.3
P(G|H ) = P (H ) = 0.5 = 0.6 = P(G)

Problem 2
b. Show P(G AND H ) = P(G)P(H ).
Solution
P(G)P(H ) = (0.6)(0.5) = 0.3 = P(G AND H )

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


106 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Since G and H are independent, knowing that a person is taking a science class does not change the
chance that he or she is taking a math class. If the two events had not been independent (that is,
they are dependent) then knowing that a person is taking a science class would change the chance
he or she is taking math. For practice, show that P(H |G) = P(H ) to show that G and H are
independent events.

Exercise 2.1.3.5 (Solution on p. 187.)


In a bag, there are six red marbles and four green marbles. The red marbles are marked
with the numbers 1, 2, 3, 4, 5, and 6. The green marbles are marked with the numbers 1,
2, 3, and 4.

• R = a red marble
• G = a green marble
• O = an odd-numbered marble
• The sample space is S = {R1, R2, R3, R4, R5, R6, G1, G2, G3, G4}.

S has ten outcomes. What is P(G AND O)?

Example 2.9
Let event C = taking an English class. Let event D = taking a speech class.
Suppose P(C) = 0.75, P(D) = 0.3, P(C|D) = 0.75 and P(C AND D) = 0.225.
Justify your answers to the following questions numerically.

a. Are C and D independent?


b. Are C and D mutually exclusive?
c. What is P(D|C)?

a. Yes, because P(C|D) = P(C).


b. No, because P(C AND D) is not equal to zero.
c. P(D|C) = P (CPAND
(C )
D)
= 0.225
0.75 = 0.3

Exercise 2.1.3.6 (Solution on p. 187.)


A student goes to the library. Let events B = the student checks out a book and D =
the student checks out a DVD. Suppose that P(B) = 0.40, P(D) = 0.30 and P(B AND
D) = 0.20.

a.Find P(B|D).
b.Find P(D|B).
c.Are B and D independent?
d.Are B and D mutually exclusive?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


107

Example 2.10
In a box there are three red cards and ve blue cards. The red cards are marked with the numbers
1, 2, and 3, and the blue cards are marked with the numbers 1, 2, 3, 4, and 5. The cards are
well-shued. You reach into the box (you cannot see into it) and draw one card.
Let R = red card is drawn, B = blue card is drawn, E = even-numbered card is drawn.
The sample space S = R1, R2, R3, B1, B2, B3, B4, B5. S has eight outcomes.

• P(R) = 83 . P(B) = 58 . P(R AND B) = 0. (You cannot draw one card that is both red and
blue.)
• P(E) = 83 . (There are three even-numbered cards, R2, B2, and B4.)
• P(E|B) = 52 . (There are ve blue cards: B1, B2, B3, B4, and B5. Out of the blue cards,
there are two even cards; B2 and B4.)
• P(B|E) = 32 . (There are three even-numbered cards: R2, B2, and B4. Out of the even-
numbered cards, to are blue; B2 and B4.)
• The events R and B are mutually exclusive because P(R AND B) = 0.
• Let G = card with a number greater than 3. G = {B4, B5}. P(G) = 82 . Let H = blue card
numbered between one and four, inclusive. H = {B1, B2, B3, B4}. P(G|H ) = 14 . (The only
card in H that has a number greater than three is B4.) Since 28 = 14 , P(G) = P(G|H ), which
means that G and H are independent.

Exercise 2.1.3.7 (Solution on p. 188.)


In a basketball arena,

• 70% of the fans are rooting for the home team.


• 25% of the fans are wearing blue.
• 20% of the fans are wearing blue and are rooting for the away team.
• Of the fans rooting for the away team, 67% are wearing blue.

Let A be the event that a fan is rooting for the away team.
Let B be the event that a fan is wearing blue.
Are the events of rooting for the away team and wearing blue independent? Are they
mutually exclusive?

Example 2.11
In a particular college class, 60% of the students are female. Fifty percent of all students in the
class have long hair. Forty-ve percent of the students are female and have long hair. Of the female
students, 75% have long hair. Let F be the event that a student is female. Let L be the event
that a student has long hair. One student is picked randomly. Are the events of being female and
having long hair independent?

• The following probabilities are given in this example:


• P(F) = 0.60; P(L) = 0.50
• P(F AND L) = 0.45
• P(L|F) = 0.75

: The choice you make depends on the information you have. You could use the rst
or last condition on the list for this example. You do not know P(F|L) yet, so you cannot use the
second condition.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


108 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solution 1
Check whether P(F AND L) = P(F)P(L). We are given that P(F AND L) = 0.45, but P(F)P(L) =
(0.60)(0.50) = 0.30. The events of being female and having long hair are not independent because
P(F AND L) does not equal P(F)P(L).
Solution 2
Check whether P(L|F) equals P(L). We are given that P(L|F) = 0.75, but P(L) = 0.50; they are
not equal. The events of being female and having long hair are not independent.
Interpretation of Results
The events of being female and having long hair are not independent; knowing that a student is
female changes the probability that a student has long hair.

Exercise 2.1.3.8 (Solution on p. 188.)


Mark is deciding which route to take to work. His choices are I = the Interstate and F
= Fifth Street.

• P(I) = 0.44 and P(F) = 0.55


• P(I AND F) = 0 because Mark will take only one route to work.

What is the probability of P(I OR F)?

Example 2.12

a. Toss one fair coin (the coin has two sides, H and T). The outcomes are ________. Count
the outcomes. There are ____ outcomes.
b. Toss one fair, six-sided die (the die has 1, 2, 3, 4, 5 or 6 dots on a side). The outcomes are
________________. Count the outcomes. There are ___ outcomes.
c. Multiply the two numbers of outcomes. The answer is _______.
d. If you ip one fair coin and follow it with the toss of one fair, six-sided die, the answer in
three is the number of outcomes (size of the sample space). What are the outcomes? (Hint:
Two of the outcomes are H 1 and T6.)
e. Event A = heads (H ) on the coin followed by an even number (2, 4, 6) on the die.
A = {_________________}. Find P(A).
f. Event B = heads on the coin followed by a three on the die. B = {________}. Find
P(B).
g. Are A and B mutually exclusive? (Hint: What is P(A AND B)? If P(A AND B) = 0, then
A and B are mutually exclusive.)
h. Are A and B independent? (Hint: Is P(A AND B) = P(A)P(B)? If P(A AND B) =
P(A)P(B), then A and B are independent. If not, then they are dependent).

a. H and T; 2
b. 1, 2, 3, 4, 5, 6; 6
c. 2(6) = 12
d. T1, T2, T3, T4, T5, T6, H 1, H 2, H 3, H 4, H 5, H 6
e. 3
A = {H 2, H 4, H 6}; P(A) = 12
f. B = {H 3}; P(B) = 12 1
g. Yes, because P(A AND B) = 0
3
 1
h. P(A AND B) = 0.P(A)P(B) = 12 12 . P(A AND B) does not equal P(A)P(B), so A and
B are dependent.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


109

Exercise 2.1.3.9 (Solution on p. 188.)


A box has two balls, one white and one red. We select one ball, put it back in the box,
and select a second ball (sampling with replacement). Let T be the event of getting the
white ball twice, F the event of picking the white ball rst, S the event of picking the
white ball in the second drawing.

a.Compute P(T).
b.Compute P(T|F).
c.Are T and F independent?.
d.Are F and S mutually exclusive?
e.Are F and S independent?

2.1.3.3 References

Lopez, Shane, Preety Sidhu. U.S. Teachers Love Their Lives, but Struggle in the Workplace. Gallup
Wellbeing, 2013. http://www.gallup.com/poll/161516/teachers-love-lives-struggle-workplace.aspx (accessed
May 2, 2013).
Data from Gallup. Available online at www.gallup.com/ (accessed May 2, 2013).

2.1.3.4 Chapter Review

Two events A and B are independent if the knowledge that one occurred does not aect the chance the other
occurs. If two events are not independent, then we say that they are dependent.
In sampling with replacement, each member of a population is replaced after it is picked, so that member
has the possibility of being chosen more than once, and the events are considered to be independent. In
sampling without replacement, each member of a population may be chosen only once, and the events are
considered not to be independent. When events do not share outcomes, they are mutually exclusive of each
other.

2.1.3.5 Formula Review

If A and B are independent, P(A ∩ B) = P(A)P(B), P(A|B) = P(A) and P(B|A) = P(B).
If A and B are mutually exclusive, P(A ∪ B) = P(A) + P(B) and P(A AND B) = 0.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


110 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.1.3.6

Exercise 2.1.3.10
E and F are mutually exclusive events. P(E) = 0.4; P(F) = 0.5. Find P(E |F).
Exercise 2.1.3.11 (Solution on p. 188.)
J and K are independent events. P(J|K) = 0.3. Find P(J).
Exercise 2.1.3.12
U and V are mutually exclusive events. P(U ) = 0.26; P(V ) = 0.37. Find:

a. P(U AND V ) =
b. P(U |V ) =
c. P(U OR V ) =

Exercise 2.1.3.13 (Solution on p. 188.)


Q and R are independent events. P(Q) = 0.4 and P(Q AND R) = 0.1. Find P(R).

2.1.3.7 Homework

Use the following information to answer the next 12 exercises. The graph shown is based on more than
170,000 interviews done by Gallup that took place from January through December 2012. The sample
consists of employed Americans 18 years of age or older. The Emotional Health Index Scores are the sample
space. We randomly sample one Emotional Health Index Score.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


111

Figure 2.4

Exercise 2.1.3.14
Find the probability that an Emotional Health Index Score is 82.7.
Exercise 2.1.3.15 (Solution on p. 188.)
Find the probability that an Emotional Health Index Score is 81.0.
Exercise 2.1.3.16
Find the probability that an Emotional Health Index Score is more than 81?
Exercise 2.1.3.17 (Solution on p. 188.)
Find the probability that an Emotional Health Index Score is between 80.5 and 82?
Exercise 2.1.3.18
If we know an Emotional Health Index Score is 81.5 or more, what is the probability that it is
82.7?
Exercise 2.1.3.19 (Solution on p. 188.)
What is the probability that an Emotional Health Index Score is 80.7 or 82.7?
Exercise 2.1.3.20
What is the probability that an Emotional Health Index Score is less than 80.2 given that it is
already less than 81.
Exercise 2.1.3.21 (Solution on p. 188.)
What occupation has the highest emotional index score?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


112 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Exercise 2.1.3.22
What occupation has the lowest emotional index score?
Exercise 2.1.3.23 (Solution on p. 188.)
What is the range of the data?
Exercise 2.1.3.24
Compute the average EHIS.
Exercise 2.1.3.25 (Solution on p. 188.)
If all occupations are equally likely for a certain individual, what is the probability that he or she
will have an occupation with lower than average EHIS?

2.1.3.8 Bringing It Together

Exercise 2.1.3.26
A previous year, the weights of the members of the San Francisco 49ers and the Dallas Cow-
boys were published in the San Jose Mercury News. The factual data are compiled into Table
2.2.

Shirt# ≤ 210 211250 251290 290≤

133 21 5 0 0
3466 6 18 7 4
6699 6 12 22 5

Table 2.2

For the following, suppose that you randomly select one player from the 49ers or Cowboys.
If having a shirt number from one to 33 and weighing at most 210 pounds were independent
events, then what should be true about P(Shirt# 133|≤ 210 pounds)?
Exercise 2.1.3.27 (Solution on p. 188.)
The probability that a male develops some form of cancer in his lifetime is 0.4567. The probability
that a male has at least one false positive test result (meaning the test comes back for cancer when
the man does not have it) is 0.51. Some of the following questions do not have enough information
for you to answer them. Write not enough information for those answers. Let C = a man develops
cancer in his lifetime and P = man has at least one false positive.
a. P(C) = ______
b. P(P|C) = ______
c. P(P|C') = ______
d. If a test comes up positive, based upon numerical values, can you assume that man has cancer?
Justify numerically and explain why or why not.

Exercise 2.1.3.28
Given events G and H : P(G) = 0.43; P(H ) = 0.26; P(H AND G) = 0.14

a. Find P(H OR G).


b. Find the probability of the complement of event (H AND G).
c. Find the probability of the complement of event (H OR G).

Exercise 2.1.3.29 (Solution on p. 188.)


Given events J and K: P(J) = 0.18; P(K) = 0.37; P(J OR K) = 0.45

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


113

a. Find P(J AND K).


b. Find the probability of the complement of event (J AND K).
c. Find the probability of the complement of event (J AND K).

2.1.4 Two Basic Rules of Probability4


When calculating probability, there are two rules to consider when determining if two events are independent
or dependent and if they are mutually exclusive or not.

2.1.4.1 The Multiplication Rule

If A and B are two events dened on a sample space, then: P (A ∩ B) = P (B) P (A|B). We can think of
the intersection symbol as substituting for the word "and".
This rule may also be written as: P (A|B) = P P(A∩B)
(B)
This equation is read as the probability of A given B equals the probability of A and B divided by the
probability of B.
If A and B are independent, then P (A|B) = P (A). Then P (A ∩ B) = P (A|B) P (B) becomes
P (A ∩ B) = P (A) (B) because the P (A|B) = P (A) if A and B are independent.
One easy way to remember the multiplication rule is that the word "and" means that the event has to
satisfy two conditions. For example the name drawn from the class roster is to be both a female and a
sophomore. It is harder to satisfy two conditions than only one and of course when we multiply fractions
the result is always smaller. This reects the increasing diculty of satisfying two conditions.

2.1.4.2 The Addition Rule

If A and B are dened on a sample space, then: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). We can think of
the union symbol substituting for the word "or". The reason we subtract the intersection of A and B is to
keep from double counting elements that are in both A and B.
If A and B are mutually exclusive, then P (A ∩ B) = 0. Then P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
becomes P (A ∪ B) = P (A) + P (B).
Example 2.13
Klaus is trying to choose where to go on vacation. His two choices are: A = New Zealand and B
= Alaska

• Klaus can only aord one vacation. The probability that he chooses A is P(A) = 0.6 and the
probability that he chooses B is P(B) = 0.35.
• P (A ∩ B) = 0 because Klaus can only aord to take one vacation
• Therefore, the probability that he chooses either New Zealand or Alaska is P (A ∪ B) =
P (A) + P (B) = 0.6 + 0.35 = 0.95. Note that the probability that he does not choose to go
anywhere on vacation must be 0.05.

Example 2.14
Carlos plays college soccer. He makes a goal 65% of the time he shoots. Carlos is going to attempt
two goals in a row in the next game. A = the event Carlos is successful on his rst attempt. P(A)
= 0.65. B = the event Carlos is successful on his second attempt. P(B) = 0.65. Carlos tends to
shoot in streaks. The probability that he makes the second goal | that he made the rst goal is 0.90.

4 This content is available online at <http://cnx.org/content/m54220/1.11/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


114 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Problem 1
a. What is the probability that he makes both goals?
Solution
a. The problem is asking you to nd P (A ∩ B) = P (B ∩ A). Since P(B |A) = 0.90: P(B ∩ A) =
P(B |A) P(A) = (0.90)(0.65) = 0.585
Carlos makes the rst and second goals with probability 0.585.

Problem 2
b. What is the probability that Carlos makes either the rst goal or the second goal?
Solution
b. The problem is asking you to nd P(A ∪ B).
P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = 0.65 + 0.65 - 0.585 = 0.715
Carlos makes either the rst goal or the second goal with probability 0.715.

Problem 3
c. Are A and B independent?
Solution
c. No, they are not, because P(B ∩ A) = 0.585.
P(B)P(A) = (0.65)(0.65) = 0.423
0.423 6= 0.585 = P(B ∩ A)
So, P(B ∩ A) is not equal to P(B)P(A).

Problem 4
d. Are A and B mutually exclusive?
Solution
d. No, they are not because P(A ∩ B) = 0.585.
To be mutually exclusive, P(A ∩ B) must equal zero.

Exercise 2.1.4.1 (Solution on p. 188.)


Helen plays basketball. For free throws, she makes the shot 75% of the time. Helen must
now attempt two free throws. C = the event that Helen makes the rst shot. P(C) =
0.75. D = the event Helen makes the second shot. P(D) = 0.75. The probability that
Helen makes the second free throw given that she made the rst is 0.85. What is the
probability that Helen makes both free throws?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


115

Example 2.15
A community swim team has 150 members. Seventy-ve of the members are advanced swimmers.
Forty-seven of the members are intermediate swimmers. The remainder are novice swimmers.
Forty of the advanced swimmers practice four times a week. Thirty of the intermediate swimmers
practice four times a week. Ten of the novice swimmers practice four times a week. Suppose one
member of the swim team is chosen randomly.

Problem 1
a. What is the probability that the member is a novice swimmer?
Solution
a. 28
150

Problem 2
b. What is the probability that the member practices four times a week?
Solution
b. 80
150

Problem 3
c. What is the probability that the member is an advanced swimmer and practices four times a
week?
Solution
c. 40
150

Problem 4
d. What is the probability that a member is an advanced swimmer and an intermediate swimmer?
Are being an advanced swimmer and an intermediate swimmer mutually exclusive? Why or why
not?
Solution
d. P(advanced ∩ intermediate) = 0, so these are mutually exclusive events. A swimmer cannot
be an advanced swimmer and an intermediate swimmer at the same time.

Problem 5
e. Are being a novice swimmer and practicing four times a week independent events? Why or why
not?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


116 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solution
e. No, these are not independent events.
P(novice ∩ practices four times per week) = 0.0667
P(novice)P(practices four times per week) = 0.0996
0.0667 6= 0.0996

Exercise 2.1.4.2 (Solution on p. 188.)


A school has 200 seniors of whom 140 will be going to college next year. Forty will be
going directly to work. The remainder are taking a gap year. Fifty of the seniors going to
college play sports. Thirty of the seniors going directly to work play sports. Five of the
seniors taking a gap year play sports. What is the probability that a senior is taking a
gap year?

Example 2.16
Felicity attends Modesto JC in Modesto, CA. The probability that Felicity enrolls in a math class
is 0.2 and the probability that she enrolls in a speech class is 0.65. The probability that she enrolls
in a math class | that she enrolls in speech class is 0.25.
Let: M = math class, S = speech class, M |S = math given speech
Problem

a. What is the probability that Felicity enrolls in math and speech?


Find P(M ∩ S) = P(M |S)P(S).
b. What is the probability that Felicity enrolls in math or speech classes?
Find P(M ∪ S) = P(M ) + P(S) - P(M ∩ S).
c. Are M and S independent? Is P(M |S) = P(M )?
d. Are M and S mutually exclusive? Is P(M ∩ S) = 0?

Solution
a. 0.1625, b. 0.6875, c. No, d. No

Exercise 2.1.4.3 (Solution on p. 189.)


A student goes to the library. Let events B = the student checks out a book and D = the
student check out a DVD. Suppose that P(B) = 0.40, P(D) = 0.30 and P(D |B) = 0.5.

a.Find P(B ∩ D).


b.Find P(B ∪ D).

Example 2.17
Studies show that about one woman in seven (approximately 14.3%) who live to be 90 will develop
breast cancer. Suppose that of those women who develop breast cancer, a test is negative 2%
of the time. Also suppose that in the general population of women, the test for breast cancer

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


117

is negative about 85% of the time. Let B = woman develops breast cancer and let N = tests
negative. Suppose one woman is selected at random.

Problem 1
a. What is the probability that the woman develops breast cancer? What is the probability that
woman tests negative?
Solution
a. P(B) = 0.143; P(N ) = 0.85

Problem 2
b. Given that the woman has breast cancer, what is the probability that she tests negative?
Solution
b. P(N |B) = 0.02

Problem 3
c. What is the probability that the woman has breast cancer AND tests negative?
Solution
c. P(B ∩ N ) = P(B)P(N |B) = (0.143)(0.02) = 0.0029

Problem 4
d. What is the probability that the woman has breast cancer or tests negative?
Solution
d. P(B ∪ N ) = P(B) + P(N ) - P(B ∩ N ) = 0.143 + 0.85 - 0.0029 = 0.9901

Problem 5
e. Are having breast cancer and testing negative independent events?
Solution
e. No. P(N ) = 0.85; P(N |B) = 0.02. So, P(N |B) does not equal P(N ).

Problem 6
f. Are having breast cancer and testing negative mutually exclusive?
Solution
f. No. P(B ∩ N ) = 0.0029. For B and N to be mutually exclusive, P(B ∩ N ) must be zero.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


118 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Exercise 2.1.4.4 (Solution on p. 189.)


A school has 200 seniors of whom 140 will be going to college next year. Forty will be
going directly to work. The remainder are taking a gap year. Fifty of the seniors going to
college play sports. Thirty of the seniors going directly to work play sports. Five of the
seniors taking a gap year play sports. What is the probability that a senior is going to
college and plays sports?

Example 2.18
Refer to the information in Example 2.17. P = tests positive.

a. Given that a woman develops breast cancer, what is the probability that she tests positive.
Find P(P |B) = 1 - P(N |B).
b. What is the probability that a woman develops breast cancer and tests positive. Find P(B ∩
P) = P(P |B)P(B).
c. What is the probability that a woman does not develop breast cancer. Find P(B0 ) = 1 -
P(B).
d. What is the probability that a woman tests positive for breast cancer. Find P(P) = 1 - P(N ).

Solution
a. 0.98; b. 0.1401; c. 0.857; d. 0.15

Exercise 2.1.4.5 (Solution on p. 189.)


A student goes to the library. Let events B = the student checks out a book and D = the
student checks out a DVD. Suppose that P(B) = 0.40, P(D) = 0.30 and P(D |B) = 0.5.

a.Find P(B0 ).
b.Find P(D ∩ B).
c.Find P(B |D).
d.Find P(D ∩ B0 ).
e.Find P(D |B0 ).

2.1.4.3 References

DiCamillo, Mark, Mervin Field. The File Poll. Field Research Corporation. Available online at
http://www.eld.com/eldpollonline/subscribers/Rls2443.pdf (accessed May 2, 2013).
Rider, David, Ford support plummeting, poll suggests, The Star, September 14, 2011. Available on-
line at http://www.thestar.com/news/gta/2011/09/14/ford_support_plummeting_poll_suggests.html (ac-
cessed May 2, 2013).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


119

Mayor's Approval Down. News Release by Forum Research Inc. Available online
at http://www.forumresearch.com/forms/News Archives/News Releases/74209_TO_Issues_-
_Mayoral_Approval_%28Forum_Research%29%2820130320%29.pdf (accessed May 2, 2013).
Roulette. Wikipedia. Available online at http://en.wikipedia.org/wiki/Roulette (accessed May 2,
2013).
Shin, Hyon B., Robert A. Kominski. Language Use in the United States: 2007. United States Cen-
sus Bureau. Available online at http://www.census.gov/hhes/socdemo/language/data/acs/ACS-12.pdf (ac-
cessed May 2, 2013).
Data from the Baseball-Almanac, 2013. Available online at www.baseball-almanac.com (accessed May 2,
2013).
Data from U.S. Census Bureau.
Data from the Wall Street Journal.
Data from The Roper Center: Public Opinion Archives at the University of Connecticut. Available online
at http://www.ropercenter.uconn.edu/ (accessed May 2, 2013).
Data from Field Research Corporation. Available online at www.eld.com/eldpollonline (accessed May
2,2 013).

2.1.4.4 Chapter Review

The multiplication rule and the addition rule are used for computing the probability of A and B, as well
as the probability of A or B for two given events A, B dened on the sample space. In sampling with
replacement each member of a population is replaced after it is picked, so that member has the possibility
of being chosen more than once, and the events are considered to be independent. In sampling without
replacement, each member of a population may be chosen only once, and the events are considered to be
not independent. The events A and B are mutually exclusive events when they do not have any outcomes
in common.

2.1.4.5 Formula Review

The multiplication rule: P(A ∩ B) = P(A|B)P(B)


The addition rule: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


120 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.1.4.6

Use the following information to answer the next ten exercises. Forty-eight percent of all Californians
registered voters prefer life in prison without parole over the death penalty for a person convicted of rst
degree murder. Among Latino California registered voters, 55% prefer life in prison without parole over the
death penalty for a person convicted of rst degree murder. 37.6% of all Californians are Latino.
In this problem, let:

• C = Californians (registered voters) preferring life in prison without parole over the death penalty for
a person convicted of rst degree murder.
• L = Latino Californians
Suppose that one Californian is randomly selected.
Exercise 2.1.4.6
Find P(C).
Exercise 2.1.4.7 (Solution on p. 189.)
Find P(L).
Exercise 2.1.4.8
Find P(C |L).
Exercise 2.1.4.9 (Solution on p. 189.)
In words, what is C |L?
Exercise 2.1.4.10
Find P(L ∩ C).
Exercise 2.1.4.11 (Solution on p. 189.)
In words, what is L ∩ C?
Exercise 2.1.4.12
Are L and C independent events? Show why or why not.
Exercise 2.1.4.13 (Solution on p. 189.)
Find P(L ∪ C).
Exercise 2.1.4.14
In words, what is L ∪ C?
Exercise 2.1.4.15 (Solution on p. 189.)
Are L and C mutually exclusive events? Show why or why not.

2.1.4.7 Homework

Exercise 2.1.4.16
On February 28, 2013, a Field Poll Survey reported that 61% of California registered voters approved
of allowing two people of the same gender to marry and have regular marriage laws apply to
them. Among 18 to 39 year olds (California registered voters), the approval rating was 78%.
Six in ten California registered voters said that the upcoming Supreme Court's ruling about the
constitutionality of California's Proposition 8 was either very or somewhat important to them. Out
of those CA registered voters who support same-sex marriage, 75% say the ruling is important to
them.
In this problem, let:

• C = California registered voters who support same-sex marriage.


• B = California registered voters who say the Supreme Court's ruling about the constitution-
ality of California's Proposition 8 is very or somewhat important to them

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


121

• A = California registered voters who are 18 to 39 years old.

a. Find P(C).
b. Find P(B).
c. Find P(C |A).
d. Find P(B|C).
e. In words, what is C |A?
f. In words, what is B |C?
g. Find P(C ∩ B).
h. In words, what is C ∩ B?
i. Find P(C ∪ B).
j. Are C and B mutually exclusive events? Show why or why not.

Exercise 2.1.4.17 (Solution on p. 189.)


After Rob Ford, the mayor of Toronto, announced his plans to cut budget costs in late 2011, the
Forum Research polled 1,046 people to measure the mayor's popularity. Everyone polled expressed
either approval or disapproval. These are the results their poll produced:

• In early 2011, 60 percent of the population approved of Mayor Ford's actions in oce.
• In mid-2011, 57 percent of the population approved of his actions.
• In late 2011, the percentage of popular approval was measured at 42 percent.

a. What is the sample size for this study?


b. What proportion in the poll disapproved of Mayor Ford, according to the results from late
2011?
c. How many people polled responded that they approved of Mayor Ford in late 2011?
d. What is the probability that a person supported Mayor Ford, based on the data collected in
mid-2011?
e. What is the probability that a person supported Mayor Ford, based on the data collected in
early 2011?

Use the following information to answer the next three exercises. The casino game, roulette, allows the
gambler to bet on the probability of a ball, which spins in the roulette wheel, landing on a particular color,
number, or range of numbers. The table used to place bets contains of 38 numbers, and each number is
assigned to a color and a range.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


122 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 2.5: (credit: lm8ker/wikibooks)

Exercise 2.1.4.18

a. List the sample space of the 38 possible outcomes in roulette.


b. You bet on red. Find P(red).
c. You bet on -1st 12- (1st Dozen). Find P(-1st 12-).
d. You bet on an even number. Find P(even number).
e. Is getting an odd number the complement of getting an even number? Why?
f. Find two mutually exclusive events.
g. Are the events Even and 1st Dozen independent?

Exercise 2.1.4.19 (Solution on p. 189.)


Compute the probability of winning the following types of bets:

a. Betting on two lines that touch each other on the table as in 1-2-3-4-5-6
b. Betting on three numbers in a line, as in 1-2-3
c. Betting on one number
d. Betting on four numbers that touch each other to form a square, as in 10-11-13-14
e. Betting on two numbers that touch each other on the table, as in 10-11 or 10-13
f. Betting on 0-00-1-2-3
g. Betting on 0-1-2; or 0-00-2; or 00-2-3

Exercise 2.1.4.20
Compute the probability of winning the following types of bets:

a. Betting on a color

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


123

b. Betting on one of the dozen groups


c. Betting on the range of numbers from 1 to 18
d. Betting on the range of numbers 1936
e. Betting on one of the columns
f. Betting on an even or odd number (excluding zero)

Exercise 2.1.4.21 (Solution on p. 189.)


Suppose that you have eight cards. Five are green and three are yellow. The ve green cards are
numbered 1, 2, 3, 4, and 5. The three yellow cards are numbered 1, 2, and 3. The cards are well
shued. You randomly draw one card.

• G = card drawn is green


• E = card drawn is even-numbered
a. List the sample space.
b. P(G) = _____
c. P(G|E) = _____
d. P(G ∩ E) = _____
e. P(G ∪ E) = _____
f. Are G and E mutually exclusive? Justify your answer numerically.

Exercise 2.1.4.22
Roll two fair dice separately. Each die has six faces.

a. List the sample space.


b. Let A be the event that either a three or four is rolled rst, followed by an even number. Find
P(A).
c. Let B be the event that the sum of the two rolls is at most seven. Find P(B).
d. In words, explain what P(A|B) represents. Find P(A|B).
e. Are A and B mutually exclusive events? Explain your answer in one to three complete
sentences, including numerical justication.
f. Are A and B independent events? Explain your answer in one to three complete sentences,
including numerical justication.

Exercise 2.1.4.23 (Solution on p. 190.)


A special deck of cards has ten cards. Four are green, three are blue, and three are red. When a
card is picked, its color of it is recorded. An experiment consists of rst picking a card and then
tossing a coin.

a. List the sample space.


b. Let A be the event that a blue card is picked rst, followed by landing a head on the coin
toss. Find P(A).
c. Let B be the event that a red or green is picked, followed by landing a head on the coin toss.
Are the events A and B mutually exclusive? Explain your answer in one to three complete
sentences, including numerical justication.
d. Let C be the event that a red or blue is picked, followed by landing a head on the coin toss.
Are the events A and C mutually exclusive? Explain your answer in one to three complete
sentences, including numerical justication.

Exercise 2.1.4.24
An experiment consists of rst rolling a die and then tossing a coin.

a. List the sample space.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


124 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

b. Let A be the event that either a three or a four is rolled rst, followed by landing a head on
the coin toss. Find P(A).
c. Let B be the event that the rst and second tosses land on heads. Are the events A and
B mutually exclusive? Explain your answer in one to three complete sentences, including
numerical justication.

Exercise 2.1.4.25 (Solution on p. 190.)


An experiment consists of tossing a nickel, a dime, and a quarter. Of interest is the side the coin
lands on.

a. List the sample space.


b. Let A be the event that there are at least two tails. Find P(A).
c. Let B be the event that the rst and second tosses land on heads. Are the events A and
B mutually exclusive? Explain your answer in one to three complete sentences, including
justication.

Exercise 2.1.4.26
Consider the following scenario:
Let P(C) = 0.4.
Let P(D) = 0.5.
Let P(C |D) = 0.6.

a. Find P(C ∩ D).


b. Are C and D mutually exclusive? Why or why not?
c. Are C and D independent events? Why or why not?
d. Find P(C ∪ D).
e. Find P(D |C).

Exercise 2.1.4.27 (Solution on p. 190.)


Y and Z are independent events.

a. Rewrite the basic Addition Rule P(Y ∪ Z) = P(Y ) + P(Z) - P(Y ∩ Z) using the information
that Y and Z are independent events.
b. Use the rewritten rule to nd P(Z) if P(Y ∪ Z) = 0.71 and P(Y ) = 0.42.

Exercise 2.1.4.28
G and H are mutually exclusive events. P(G) = 0.5 P(H ) = 0.3

a. Explain why the following statement MUST be false: P(H |G) = 0.4.
b. Find P(H ∪ G).
c. Are G and H independent or dependent events? Explain in a complete sentence.

Exercise 2.1.4.29 (Solution on p. 190.)


Approximately 281,000,000 people over age ve live in the United States. Of these people,
55,000,000 speak a language other than English at home. Of those who speak another language at
home, 62.3% speak Spanish.
Let: E = speaks English at home; E0 = speaks another language at home; S = speaks Spanish;
Finish each probability statement by matching the correct answer.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


125

Probability Statements Answers

a. P(E ) =
0
i. 0.8043
b. P(E) = ii. 0.623
c. P(S ∩ E0 ) = iii. 0.1957
d. P(S |E ) =
0
iv. 0.1219
Table 2.3

Exercise 2.1.4.30
1994, the U.S. government held a lottery to issue 55,000 Green Cards (permits for non-citizens to
work legally in the U.S.). Renate Deutsch, from Germany, was one of approximately 6.5 million
people who entered this lottery. Let G = won green card.
a. What was Renate's chance of winning a Green Card? Write your answer as a probability
statement.
b. In the summer of 1994, Renate received a letter stating she was one of 110,000 nalists chosen.
Once the nalists were chosen, assuming that each nalist had an equal chance to win, what
was Renate's chance of winning a Green Card? Write your answer as a conditional probability
statement. Let F = was a nalist.
c. Are G and F independent or dependent events? Justify your answer numerically and also
explain why.
d. Are G and F mutually exclusive events? Justify your answer numerically and explain why.

Exercise 2.1.4.31 (Solution on p. 190.)


Three professors at George Washington University did an experiment to determine if economists
are more selsh than other people. They dropped 64 stamped, addressed envelopes with $10 cash
in dierent classrooms on the George Washington campus. 44% were returned overall. From the
economics classes 56% of the envelopes were returned. From the business, psychology, and history
classes 31% were returned.
Let: R = money returned; E = economics classes; O = other classes
a. Write a probability statement for the overall percent of money returned.
b. Write a probability statement for the percent of money returned out of the economics classes.
c. Write a probability statement for the percent of money returned out of the other classes.
d. Is money being returned independent of the class? Justify your answer numerically and
explain it.
e. Based upon this study, do you think that economists are more selsh than other people?
Explain why or why not. Include numbers to justify your answer.

Exercise 2.1.4.32
The following table of data obtained from www.baseball-almanac.com shows hit information for
four players. Suppose that one hit from the table is randomly selected.

Name Single Double Triple Home Run Total Hits

Babe Ruth 1,517 506 136 714 2,873


Jackie Robinson 1,054 273 54 137 1,518
Ty Cobb 3,603 174 295 114 4,189
Hank Aaron 2,294 624 98 755 3,771
Total 8,471 1,577 583 1,720 12,351

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


126 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Table 2.4

Are "the hit being made by Hank Aaron" and "the hit being a double" independent events?

a. Yes, because P(hit by Hank Aaron|hit is a double) = P(hit by Hank Aaron)


b. No, because P(hit by Hank Aaron|hit is a double) 6= P(hit is a double)
c. No, because P(hit is by Hank Aaron|hit is a double) 6= P(hit by Hank Aaron)
d. Yes, because P(hit is by Hank Aaron|hit is a double) = P(hit is a double)

Exercise 2.1.4.33 (Solution on p. 190.)


United Blood Services is a blood bank that serves more than 500 hospitals in 18 states. According
to their website, a person with type O blood and a negative Rh factor (Rh-) can donate blood to
any person with any bloodtype. Their data show that 43% of people have type O blood and 15%
of people have Rh- factor; 52% of people have type O or Rh- factor.

a. Find the probability that a person has both type O blood and the Rh- factor.
b. Find the probability that a person does NOT have both type O blood and the Rh- factor.

Exercise 2.1.4.34
At a college, 72% of courses have nal exams and 46% of courses require research papers. Suppose
that 32% of courses have a research paper and a nal exam. Let F be the event that a course has
a nal exam. Let R be the event that a course requires a research paper.

a. Find the probability that a course has a nal exam or a research project.
b. Find the probability that a course has NEITHER of these two requirements.

Exercise 2.1.4.35 (Solution on p. 190.)


In a box of assorted cookies, 36% contain chocolate and 12% contain nuts. Of those, 8% contain
both chocolate and nuts. Sean is allergic to both chocolate and nuts.

a. Find the probability that a cookie contains chocolate or nuts (he can't eat it).
b. Find the probability that a cookie does not contain chocolate or nuts (he can eat it).

Exercise 2.1.4.36
A college nds that 10% of students have taken a distance learning class and that 40% of students
are part time students. Of the part time students, 20% have taken a distance learning class. Let D
= event that a student takes a distance learning class and E = event that a student is a part time
student

a. Find P(D ∩ E).


b. Find P(E |D).
c. Find P(D ∪ E).
d. Using an appropriate test, show whether D and E are independent.
e. Using an appropriate test, show whether D and E are mutually exclusive.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


127

2.1.5 Contingency Tables and Tree Diagrams5


2.1.5.1 Contingency Tables

A contingency table provides a way of portraying data that can facilitate calculating probabilities. The
table helps in determining conditional probabilities quite easily. The table displays sample values in relation
to two dierent variables that may be dependent or contingent on one another. Later on, we will use
contingency tables again, but in another manner.
Example 2.19
Suppose a study of speeding violations and drivers who use cell phones produced the following
ctional data:

Speeding violation in No speeding viola- Total


the last year tion in the last year

Cell phone user 25 280 305


Not a cell phone user 45 405 450
Total 70 685 755

Table 2.5

The total number of people in the sample is 755. The row totals are 305 and 450. The column
totals are 70 and 685. Notice that 305 + 450 = 755 and 70 + 685 = 755.
Calculate the following probabilities using the table.

Problem 1
a. Find P(Person is a car phone user).
Solution
a. number of car phone users
total number in study = 305
755

Problem 2
b. Find P(person had no violation in the last year).
Solution
b. number that had no violation
total number in study = 685
755

Problem 3
c. Find P(Person had no violation in the last year AND was a car phone user).
Solution
c. 280
755

5 This content is available online at <http://cnx.org/content/m62330/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


128 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Problem 4
d. Find P(Person is a car phone user OR person had no violation in the last year).
Solution
d. 305 685 280 710

755 + 755 − 755 = 755

Problem 5
e. Find P(Person is a car phone user GIVEN person had a violation in the last year).
Solution
e. 25
70 (The sample space is reduced to the number of persons who had a violation.)

Problem 6
f. Find P(Person had no violation last year GIVEN person was not a car phone user)
Solution
f. 405
450 (The sample space is reduced to the number of persons who were not car phone users.)

Exercise 2.1.5.1 (Solution on p. 191.)


Table 2.6 shows the number of athletes who stretch before exercising and how many had
injuries within the past year.

Injury in last year No injury in last year Total

Stretches 55 295 350


Does not stretch 231 219 450
Total 286 514 800

Table 2.6

a.What is P(athlete stretches before exercising)?


b.What is P(athlete stretches before exercising|no injury in the last year)?

Example 2.20
Table 2.7: Hiking Area Preference shows a random sample of 100 hikers and the areas of hiking
they prefer.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


129

Hiking Area Preference

Sex The Coastline Near Lakes and Streams On Mountain Peaks Total

Female 18 16 ___ 45
Male ___ ___ 14 55
Total ___ 41 ___ ___

Table 2.7

Problem 1
a. Complete the table.
Solution
a.
Hiking Area Preference

Sex The Coastline Near Lakes and Streams On Mountain Peaks Total

Female 18 16 11 45
Male 16 25 14 55
Total 34 41 25 100

Table 2.8

Problem 2 (Solution on p. 191.)


b. Are the events "being female" and "preferring the coastline" independent events?
Let F = being female and let C = preferring the coastline.

1. Find P(F AND C).


2. Find P(F)P(C)

Are these two numbers the same? If they are, then F and C are independent. If they are not, then
F and C are not independent.
Problem 3 (Solution on p. 191.)
c. Find the probability that a person is male given that the person prefers hiking near lakes and
streams. Let M = being male, and let L = prefers hiking near lakes and streams.

1. What word tells you this is a conditional?


2. Fill in the blanks and calculate the probability: P(___|___) = ___.
3. Is the sample space for this problem all 100 hikers? If not, what is it?

Problem 4 (Solution on p. 191.)


d. Find the probability that a person is female or prefers hiking on mountain peaks. Let F =
being female, and let P = prefers mountain peaks.

1. Find P(F).
2. Find P(P).
3. Find P(F AND P).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


130 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

4. Find P(F OR P).

Exercise 2.1.5.2 (Solution on p. 191.)


Table 2.9 shows a random sample of 200 cyclists and the routes they prefer. Let M =
males and H = hilly path.

Gender Lake Path Hilly Path Wooded Path Total

Female 45 38 27 110
Male 26 52 12 90
Total 71 90 39 200

Table 2.9

a.Out of the males, what is the probability that the cyclist prefers a hilly path?
b.Are the events being male and preferring the hilly path independent events?

Example 2.21
Muddy Mouse lives in a cage with three doors. If Muddy goes out the rst door, the probability
that he gets caught by Alissa the cat is 15 and the probability he is not caught is 45 . If he goes
out the second door, the probability he gets caught by Alissa is 14 and the probability he is not
caught is 34 . The probability that Alissa catches Muddy coming out of the third door is 21 and the
probability she does not catch Muddy is 12 . It is equally likely that Muddy will choose any of the
three doors so the probability of choosing each door is 13 .

Door Choice

Caught or Not Door One Door Two Door Three Total

Caught 1
15
1
12
1
6 ____
Not Caught 4
15
3
12
1
6 ____
Total ____ ____ ____ 1

Table 2.10

• The rst entry 15


1
= 15 13 is P(Door One AND Caught)
 

• The entry 15
4
= 54 13 is P(Door One AND Not Caught)


Verify the remaining entries.

Problem 1 (Solution on p. 191.)


a. Complete the probability contingency table. Calculate the entries for the totals. Verify that
the lower-right corner entry is 1.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


131

Problem 2
b. What is the probability that Alissa does not catch Muddy?
Solution
b. 41
60

Problem 3
c. What is the probability that Muddy chooses Door One OR Door Two given that Muddy is
caught by Alissa?
Solution
c. 9
19

Example 2.22
Table 2.11: United States Crime Index Rates Per 100,000 Inhabitants 20082011 contains the
number of crimes per 100,000 inhabitants from 2008 to 2011 in the U.S.

United States Crime Index Rates Per 100,000 Inhabitants 20082011

Year Robbery Burglary Rape Vehicle Total

2008 145.7 732.1 29.7 314.7


2009 133.1 717.7 29.1 259.2
2010 119.3 701 27.7 239.1
2011 113.7 702.2 26.8 229.6
Total

Table 2.11

Problem
TOTAL each column and each row. Total data = 4,520.7

a. Find P(2009 AND Robbery).


b. Find P(2010 AND Burglary).
c. Find P(2010 OR Burglary).
d. Find P(2011|Rape).
e. Find P(Vehicle|2008).

Solution
a. 0.0294, b. 0.1551, c. 0.7165, d. 0.2365, e. 0.2575

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


132 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Exercise 2.1.5.3 (Solution on p. 191.)


Table 2.12 relates the weights and heights of a group of individuals participating in an
observational study.

Weight/Height Tall Medium Short Totals

Obese 18 28 14
Normal 20 51 28
Underweight 12 25 9
Totals

Table 2.12

a.Find the total for each row and column


b.Find the probability that a randomly chosen individual from this group is Tall.
c.Find the probability that a randomly chosen individual from this group is Obese and
Tall.
d.Find the probability that a randomly chosen individual from this group is Tall given
that the idividual is Obese.
e.Find the probability that a randomly chosen individual from this group is Obese
given that the individual is Tall.
f.Find the probability a randomly chosen individual from this group is Tall and Un-
derweight.
g.Are the events Obese and Tall independent?

2.1.5.2 Tree Diagrams

Sometimes, when the probability problems are complex, it can be helpful to graph the situation. Tree
diagrams can be used to visualize and solve conditional probabilities.

2.1.5.2.1 Tree Diagrams

A tree diagram is a special type of graph used to determine the outcomes of an experiment. It consists of
"branches" that are labeled with either frequencies or probabilities. Tree diagrams can make some probability
problems easier to visualize and solve. The following example illustrates how to use a tree diagram.
Example 2.23
In an urn, there are 11 balls. Three balls are red (R) and eight balls are blue (B). Draw two balls,
one at a time, with replacement. "With replacement" means that you put the rst ball back
in the urn before you select the second ball. The tree diagram using frequencies that show all the
possible outcomes follows.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


133

Figure 2.6: Total = 64 + 24 + 24 + 9 = 121

The rst set of branches represents the rst draw. The second set of branches represents the
second draw. Each of the outcomes is distinct. In fact, we can list each red ball as R1, R2, and R3
and each blue ball as B1, B2, B3, B4, B5, B6, B7, and B8. Then the nine RR outcomes can be
written as:
R1R1; R1R2; R1R3; R2R1; R2R2; R2R3; R3R1; R3R2; R3R3
The other outcomes are similar.
There are a total of 11 balls in the urn. Draw two balls, one at a time, with replacement. There
are 11(11) = 121 outcomes, the size of the sample space.

Problem 1 (Solution on p. 192.)


a. List the 24 BR outcomes: B1R1, B1R2, B1R3, ...
Problem 2
b. Using the tree diagram, calculate P(RR).
Solution
b. P(RR) = 3 3
= 9
 
11 11 121

Problem 3
c. Using the tree diagram, calculate P(RB OR BR).
Solution
c. P(RB OR BR) = 3 8
+ 8 3
= 48
   
11 11 11 11 121

Problem 4
d. Using the tree diagram, calculate P(R on 1st draw AND B on 2nd draw).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


134 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solution
d. P(R on 1st draw AND B on 2nd draw) = P(RB) = 3 8
= 24
 
11 11 121

Problem 5
e. Using the tree diagram, calculate P(R on 2nd draw GIVEN B on 1st draw).
Solution
e. P(R on 2nd draw GIVEN B on 1st draw) = P(R on 2nd|B on 1st) = 88 24
= 11
3

This problem is a conditional one. The sample space has been reduced to those outcomes that
already have a blue on the rst draw. There are 24 + 64 = 88 possible outcomes (24 BR and 64
BB). Twenty-four of the 88 possible outcomes are BR. 88
24
= 11
3
.

Problem 6
f. Using the tree diagram, calculate P(BB).
Solution
f. P(BB) = 64
121

Problem 7 (Solution on p. 192.)


g. Using the tree diagram, calculate P(B on the 2nd draw given R on the rst draw).

Exercise 2.1.5.4 (Solution on p. 192.)


In a standard deck, there are 52 cards. 12 cards are face cards (event F) and 40 cards are
not face cards (event N ). Draw two cards, one at a time, with replacement. All possible
outcomes are shown in the tree diagram as frequencies. Using the tree diagram, calculate
P(FF).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


135

Figure 2.7

Example 2.24
An urn has three red marbles and eight blue marbles in it. Draw two marbles, one at a time,
this time without replacement, from the urn. "Without replacement" means that you do not
put the rst ball back before you select the second marble. Following is a tree diagram for this
situation. The branches are labeled with probabilities instead of frequencies. The numbers at the
ends of the branches
 2 are
 calculated by multiplying the numbers on the two corresponding branches,
for example, 113
10 = 6
110 .

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


136 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 2.8: Total = 56+24+24+6


110
= 110
110
=1

: If you draw a red on the rst draw from the three red possibilities, there are two red marbles
left to draw on the second draw. You do not put back or replace the rst marble after you have
drawn it. You draw without replacement, so that on the second draw there are ten marbles left
in the urn.
Calculate the following probabilities using the tree diagram.

Problem 1
a. P(RR) = ________
Solution
a. P(RR) = 3 2 6
 
11 10 = 110

Problem 2 (Solution on p. 192.)


b. Fill in the blanks:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


137

P(RB OR BR) = 3 8
+ (___)(___) = 48
 
11 10 110

Problem 3 (Solution on p. 192.)


c. P(R on 2nd|B on 1st) =
Problem 4
d. Fill in the blanks.
P(R on 1st AND B on 2nd) = P(RB) = (___)(___) = 24
100
Solution
d. P(R on 1st AND B on 2nd) = P(RB) = 3 8
= 24
 
11 10 100

Problem 5
e. Find P(BB).
Solution
e. P(BB) = 8 7
 
11 10

Problem 6
f. Find P(B on 2nd|R on 1st).
Solution
f. Using the tree diagram, P(B on 2nd|R on 1st) = P(R|B) = 10 .
8

If we are using probabilities, we can label the tree in the following general way.

• P(R|R) here means P(R on 2nd|R on 1st)


• P(B|R) here means P(B on 2nd|R on 1st)
• P(R|B) here means P(R on 2nd|B on 1st)
• P(B|B) here means P(B on 2nd|B on 1st)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


138 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Exercise 2.1.5.5 (Solution on p. 192.)


In a standard deck, there are 52 cards. Twelve cards are face cards (F) and 40 cards
are not face cards (N ). Draw two cards, one at a time, without replacement. The tree
diagram is labeled with all possible probabilities.

Figure 2.9

a.Find P(FN OR NF).


b.Find P(N |F).
c.Find P(at most one face card).
Hint: "At most one face card" means zero or one face card.
d.Find P(at least on face card).
Hint: "At least one face card" means one or two face cards.

Example 2.25
A litter of kittens available for adoption at the Humane Society has four tabby kittens and ve black
kittens. A family comes in and randomly selects two kittens (without replacement) for adoption.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


139

Problem

a. What is the probability that both kittens are tabby?

a. 12 12 b. 94 49 c. 94 38 d. 49 59
       

b. What is the probability that one kitten of each coloring is selected?

a. 49 59 b. 94 58 c. 94 59 + 59 49 d. 49 58 + 59 48
           

c. What is the probability that a tabby is chosen as the second kitten when a black kitten was
chosen as the rst?
d. What is the probability of choosing two kittens of the same color?

Solution
a. c, b. d, c. 8,
4
d. 32
72

Exercise 2.1.5.6 (Solution on p. 192.)


Suppose there are four red balls and three yellow balls in a box. Three balls are drawn
from the box without replacement. What is the probability that one ball of each coloring
is selected?

2.1.5.3 References

Blood Types. American Red Cross, 2013. Available online at http://www.redcrossblood.org/learn-about-


blood/blood-types (accessed May 3, 2013).
Data from the National Center for Health Statistics, part of the United States Department of Health and
Human Services.
Data from United States Senate. Available online at www.senate.gov (accessed May 2, 2013).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


140 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Human Blood Types. Unite Blood Services, 2011. Available online at


http://www.unitedbloodservices.org/learnMore.aspx (accessed May 2, 2013).
Haiman, Christopher A., Daniel O. Stram, Lynn R. Wilkens, Malcom C. Pike, Laurence N.
Kolonel, Brien E. Henderson, and Lo c Le Marchand. Ethnic and Racial Dierences in the Smoking-
Related Risk of Lung Cancer. The New England Journal of Medicine, 2013. Available online at
http://www.nejm.org/doi/full/10.1056/NEJMoa033250 (accessed May 2, 2013).
Samuel, T. M. Strange Facts about RH Negative Blood. eHow Health, 2013. Available online at
http://www.ehow.com/facts_5552003_strange-rh-negative-blood.html (accessed May 2, 2013).
United States: Uniform Crime Report  State Statistics from 19602011. The Disaster Center. Available
online at http://www.disastercenter.com/crime/ (accessed May 2, 2013).
Data from Clara County Public H.D.
Data from the American Cancer Society.
Data from The Data and Story Library, 1996. Available online at http://lib.stat.cmu.edu/DASL/ (ac-
cessed May 2, 2013).
Data from the Federal Highway Administration, part of the United States Department of Transportation.
Data from the United States Census Bureau, part of the United States Department of Commerce.
Data from USA Today.
Environment. The World Bank, 2013. Available online at
http://data.worldbank.org/topic/environment (accessed May 2, 2013).
Search for Datasets. Roper Center: Public Opinion Archives, University of Connecticut., 2013. Avail-
able online at http://www.ropercenter.uconn.edu/data_access/data/search_for_datasets.html (accessed
May 2, 2013).

2.1.5.4 Chapter Review

There are several tools you can use to help organize and sort data when calculating probabilities. Contin-
gency tables help display data and are particularly useful when calculating probabilites that have multiple
dependent variables.
A tree diagram use branches to show the dierent outcomes of experiments and makes complex probability
questions easy to visualize.

2.2 Chapter 4: Binomial Distribution


2.2.1 Introduction to Discrete Probability Distributions6
2.2.1.1 Introduction

2.2.1.1.1 Chapter Objectives

By the end of this chapter, the student should be able to:

• Recognize the binomial probability distribution and apply it appropriately.


• Be able to evaluate evidence using the binomial distribution.

Suppose you ip a coin ten times and each time it comes up heads. This might make you start to wonder if
there is something wrong with the coin. Perhaps it is a trick coin and is heads on both sides? Perhaps it is
imbalanced and it is more likely to come up heads over tails. You may also wonder what is the probability
of getting 10 heads in a row, if the coin was fair.
Coin ipping is interesting because it is a random event. We cannot predict whether the next ip will
be heads or tails (assuming it isn't a trick coin). That means the outcome would be a random variable. A
random variable is any variable where the outcome is determined by a random event. The outcome is
6 This content is available online at <http://cnx.org/content/m64919/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


141

also discrete because we count it. Above you ipped a coin ten times and counted the number of heads. A
discrete random variable is a variable whose outcome is determined by a random event and where we
count the outcomes. Other examples of discrete random variables include how many times you roll an even
number with a die out of ten rolls; how many customers enter a store during a ve-minute interval; how
many times you draw a high card out of a deck of cards out of eight draws (without replacement).
In each of these situations (coin toss, rolling die, number of customers, drawing cards), you could look at
each situation and, each time, come up with a new formula to nd the probability of these events happening.
But this would take a lot of work and be inecient. Instead, you would want to see if the situation can be
modelled by a distribution. A probability distribution provides the theoretical probabilities of all of the
possible events in a situation. For example, the following is a probability model of how many heads you can
get when you ip a fair coin three times:

# of heads Probability

0 0.125
1 0.375
2 0.375
3 0.125

Table 2.13

Notice that the probabilities range from 0 to 1 and that the sum of the probabilities is 1.
The above table could be determined by working out all of the possible outcomes (TTT, TTH, THT,
HTT, etc.), then counting how many heads were in each outcome. But again, that is time consuming.
Instead, you want to see if there is a probability distribution that models your situation that you can use.
For example, coin tossing can be modelled by the T distribution.
The binomial distribution is an example of a model for discrete random variables. There are many other
models for discrete random variables including Poisson, geometric, hypergeometric, and discrete uniform to
name a few. Each distribution comes with a set of criteria and if a situation ts that criteria, then the
distribution can model it. That is, the distribution can produce theoretical probabilities for that situation.

important: In simplest terms, a theoretical probability is determined by using a formula while an


experimental probability is found by actually doing the event. For example, if you ip a coin 3 times
and get 2 heads, then the experimental probability is 2/3 = 0.6667. The theoretical probability
is 0.375. The theoretical probability is also called the long-run probability, because the longer
you do the experimental probability the closer the experimental results will get to the theoretical
probability. This is an example of the law of large numbers.

In this chapter, we are going to learn about the binomial distribution, which is a model for discrete random
variables. In the next chapter, we will learn about the normal distribution, which is a model for continuous
random variables.
In particular, we want to use the binomial distribution to evaluate evidence. For example, going back to
the example ipping a coin ten times and getting ten heads, we want to use the evidence (getting ten heads)
to determine whether we think there is something wrong with the coin.

2.2.2 The Binomial Distribution7


2.2.2.1 Binomial distribution

Flipping a coin a certain number of times, let's say ten times, is a classic example of a binomial distribution.
What are the characteristics of ipping a coin that makes it binomial?
7 This content is available online at <http://cnx.org/content/m64918/1.4/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


142 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Before we answer that question, let's get a bit terminology out of the way. In probability theory, an
experiment is the actual process that you investigating. In the ipping coin example, the experiment is
ipping a coin ten times. A trial is a specic instance of an experiment. Flipping a coin only once is
considered a trial.
Going back to the coin ipping example, let's assume that we are dealing with a fair coin (i.e. probability
of getting a head or a tail is 50%). We've already discussed that when we count the number of heads that
this is an example of a discrete random variable. Notice that there are only two possible outcomes (heads or
tails). This is a key criterion for a binomial distribution (binomial derives from Latin for two terms). Also,
notice that the events are independent of each other. That is, if you get a head on one ip, that has no
impact on the probability of getting a head on the next ip. This also means that the probability of getting
a head remains constant. This is another key criterion for a binomial distribution. The other thing to notice
is that the number of times we ip the coin is xed. We don't ip it until we get bored or run out of time.
Instead we ip it ten times. This means that the number of trials is xed. This is the last key criterion of a
binomial distribution.
There are ve characteristics of a binomial experiment.
1. The variable being studied is random.
2. The outcomes of the variable are being counted.
3. There are a xed number of trials. The letter n denotes the number of trials.
4. There are only two possible outcomes, called "success" and "failure," for each trialπ denotes the
probability of a success on one trial, and 1-π denotes the probability of a failure on one trial.
5. The n trials are independent and are repeated using identical conditions. Because the n trials are
independent, the outcome of one trial does not help in predicting the outcome of another trial. Another
way of saying this is that for each individual trial, the probability, π , of a success and probability, 1-π ,
of a failure remain constant.
Other examples of binomial distributions:
• Counting the number of 2's that are rolled when you roll a die six times (trial = rolling a dice, success
= rolling a 2, n = 6, π = 1/6 = 0.1667)
• Counting the number of times a jack is pulled out of deck of cards (with replacement) when you pull
a card fteen times (trial = pulling a card, success = pulling a jack, n = 15, π = 4/52 = 0.0769)
• Counting the number of times that you win a prize in Tim Hortons Roll up the Rim to Win contest
out of four cups (trial = checking cup for win, success = winning a prize, n = 4, π = 1/6 = 0.1667 
assuming no special rules (e.g. anniversary rules that changed the odds of winning)
Examples of situations that are not binomial include:
• Counting the number of times a jack is pulled out of deck of cards (without replacement) when you
pull a card fteen times. The fth criterion is not met because the events are now dependent.
• Counting the number times each number (1 to 6) is rolled when you roll a die fty times. The fourth
criterion is not met because there are six possible outcomes instead of two.
• Counting the number of times that you win a prize in Tim Horton's Roll up the Rim to Win contest
out of how many cups you buy during the contest. Unless you know exactly how many you'll buy
during the contest, this would not meet the third criterion of having a xed number of trials.

note: The Roll up the Rim example might not be binomial as it may fail the fth criterion. At the
beginning of the contest, the odds of winning are determined by counting how many prizes there
are out of the total number of cups printed. As the contest goes on, the probability of winning may
change depending on how many people have already won. At the beginning of the contest, this
is also true but there are so many cups that it doesn't really matter (think back to the sampling
with replacement vs. without replacement in Chapter 1). Thus, this contest is only binomial at
the beginning of the contest.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


143

2.2.2.2 Notation

Suppose we are working on a probability question and there are multiple probabilities that need to be found.
Then it gets time consuming to write out, for example, the probability that three rolls of a die will result
in at least one 2 or some variation over and over again. Instead we will use notation to reduce the work.
We can write the previous statement more quickly as P(X ≥ 1). The P( ) means the probability of. X is
the random variable being studied (in this case the number of times 2 has been rolled out of 3 rolls). X ≥
1 means we are looking at the number of times a 2 is rolled at least once.
It is important to dene X. Otherwise, P(2<X ≤ 5) could refer to any random variable and the person
reading the notation won't know what it means.

2.2.2.2.1 Mean and standard deviation of the binomial distribution

Just like a set of data, a binomial distribution has a mean and a standard deviation. For the binomial
distribution, these are given by the formulas:
µ = nπ
p
σ = nπ (1 − π)
Goingpback to the Tim Hortons example, we had n=4 and π = 0.1667. Thus µ = 4 × 0.1667= 0.6667
and σ = 4 × 0.1667 × (1 − 0.1667) = 0.745. This means that if we buy four random cups of Tim Hortons
coee during the Roll Up the Rim content, we will typically win 0.67 times, give or take 0.75. Thus, when
buying four cups of coee, we will typically win between -0.08 and 1.42 times. Since we can't win negative
times, we will round the lower bound to 0. Therefore, when buying four cups of coee, we will typically win
between 0 and 1.42 times.
Exercise 2.2.2.1 (Solution on p. 192.)
A market research study shows that 30% of all passengers on Canadian Airlines are business
travelers. A random sample of 20 passengers is taken.

1. Explain why the above situation satises the criteria of a binomial distribution. If there are
any issues with why this situation may not meet all of the criteria, discuss them. Dene n, X
and π .
2. For the random sample, determine the probability that:

a. Exactly seven of the passengers are business travelers.


b. From ten to fourteen (inclusive) are business travelers.
c. At least eleven of these passengers are business travelers.
d. Five of these passengers are NOT travelling on business.

3. What is the typical range of business passengers in a random sample of 20?

2.2.2.3 Evaluating evidence using the binomial distribution

A company looked at its hiring practices. In particular, they found that their hiring practices appears to
favour men over women. Based on past data, they have found that regardless of the number of applications
by women, seventy-ve percent of hires are men. Due to this issue, they decide to implement program.
In this program, the name and any identifying features that may indicate the gender of an applicant are
removed. For example, if the application says, She executed a marketing campaign that increased revenue
by 30%, this would be changed to They executed a marketing campaign that increased revenue by 30%.
The names on the applications were changed an alpha-numeric identication (like AB-101). The company
claims that the program has worked, but they want to check the claim.
How will the company determine if the program has worked? One way to do this would be using statistics.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


144 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Now suppose that after a recent round of hiring, the proportion of men hired was 70%. Would this be
enough evidence that the program is working? 70% is denitely lower than 75%, but we know that there is
variability in sampling. This means that, prior to the program being implemented, around 75% of hires are
men, there may be some rounds of hiring where 70% of hires were men and some that were 80%. It won't
be 75% each time. Instead we expect it to be close to 75%. Therefore, if the program has caused the hiring
practices to change, would a recent round of hiring that results in the proportion of men hires being 70%
be enough evidence of change? What about 60%? What is the line between normal variability from 75%
and abnormal variability? Statistics helps us gure that out and that is how we evaluate evidence using
statistics.
Let's say in a recent round of hires there were 30 new hires and 20 of these hires were men.

2.2.2.3.1 Skepticism

Any time we are trying to evaluate evidence, we always start from a position of skepticism. That is, we don't
want to assume what we are trying to show (i.e the claim). If we do that, we may bias the investigation. To
illustrate, if you assume that your signicant other is cheating on you, then this will colour all of the evidence
you nd (why did they show up ve minutes late from work? They must be cheating!). A well-known real-
world example of this position is the assumption in court that a defendant is innocent until proven guilty.
That is, criminal court cases start with the assumption of innocence.
In general, the position of skepticism is that nothing has changed, the program didn't work, the experi-
ment didn't work, the eect being studied isn't happening, etc.
In our example, we will assume that the program that the company implemented did not work. That is,
we are assuming that the proportion of hires that are men is still 75%. Another way of writing this is π =
75% (i.e. the population proportion).

2.2.2.3.2 Evidence

In a court case, evidence would be witness testimony, forensics evidence, expert testimony, etc.
In statistics, evidence is sample data. The evidence has been collected to evaluate the claim. In this case,
the evidence has been evaluated to determine if the program is working.
In our example, the sample data is the 20 men hires out of 30. This gives us a sample proportion of
^
20/30 = 66.67%. The symbol for sample proportion is p (said p hat - the symbol above the p is supposed
to be "^", but the online textbook program does not properly show it).

2.2.2.3.3 Evaluating evidence in statistics

To evaluate the evidence, we want to determine the probability of observing the evidence (or even better
evidence against the assumption) assuming the assumption is true. Once we determine this probability, we
need to determine if the event is unlikely or not unlikely. That is, we want to determine if it unlikely we could
have observed the evidence, if the assumption is true. Or is it not unlikely that we observed the evidence, if
the assumption is true. If it is unlikely to have observed the evidence, then most likely there is something
wrong with the assumption and the claim is likely true. If it is not unlikely to have observed the evidence,
then we can't actually conclude that there is something wrong with the assumption and we cannot conclude
that the claim is true.
To go back to the court case example, if you are a juror, you have to evaluate how unlikely or not unlikely
it is that the defendant would have had a heated argument with the victim, and was found covered in blood
and holding the murder weapon at the scene, if the defendant was innocent. If you think that it is unlikely
that all of the pieces of evidence could have happened if the defendant is innocent, then you would nd the
defendant guilty. That is, the evidence calls into question the assumption. If you think that this it is not
unlikely that all of these pieces of evidence could have happened if the defendant is innocent, then you would
nd the defendant not guilty. Notice that we don't conclude that the defendant is innocent. That is, we
can't say that they are innocent; we can only say that they are not guilty.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


145

note: When evaluating evidence, we are trying to evaluate the claim (i.e. not the position of
skepticism). Therefore, the evidence has been collected about claim. No evidence has been col-
lected about the assumption. Therefore, our conclusion can only be about the claim and not the
assumption.

Therefore, if the probability is small and therefore unlikely, we can say that there is enough evidence to
suggest that the assumption is likely false (i.e. guilty).
If the probability is not small and therefore not unlikely, we can say that there is not enough evidence to
suggest that the assumption is false (i.e. not guilty).

note: In statistics, if the probability of an event happening is less than 1%, we say that the event
is unlikely to happen. If the probability is greater than 10%, we say the event is not unlikely to
happen. If the probability is between 1% and 10%, then it is up to the researcher to determine
whether they believe that the event is unlikely or not unlikely. Usually, the researcher decides on
the threshold between unlikely and not unlikely before performing the experiment or study.

In our example, to evaluate the evidence, we want to work out what is the probability this company would
have hired 20 men out 30 (or even better evidence against the assumption) if the proportion of men hires is
still 75%. That is, we want to nd P (X ≤ 20) , given π = 75%). Notice that this is a conditional probability
and the condition is the assumption.
What does or even better evidence against the assumption mean? It means that we don't just nd the
probability of exactly 20 out of 30 men hires. We nd the probability of at most 20 out of 30 men hires
because if the company hired 19, 18, 17, . . . men then that would be even better evidence that 75% is no
longer correct (as the sample proportion is getting more and more dierent from the assumed population
proportion).
Why do we look at or even better evidence against the assumption? Often the probability of exactly
one event happening is quite small. For example, the probability of getting exactly 10 heads out of 20 coin
tosses is 17.62%, even though that is the most likely event to occur. Therefore, if we only looked at the
probability of exactly one event happening (i.e. P(X=20)) rather than P (X ≤ 20) , we may come to the
false impression that an event is unlikely, when it could actually be explained by normal sampling variability.

2.2.2.3.4 Finding the probability

To nd the probability, we need to nd an appropriate distribution that models the situation. In later
chapters, we will look at other models. Right now, the model we are going to use is the binomial distribution.
For us to use this model, we have to ensure that the situation is meeting all of the conditions of the binomial
distribution.

1. The variable being studied is random: This is not necessarily the case here as the applicants are not
random and the hiring process is not random. If we randomly selected 30 hires from a greater number
of hires, then it would be.
2. The outcomes of the variable are being counted: We are counting the number of men hired.
3. There are a xed number of trials: We are looking at 30 hires (n = 30)
4. There are only two possible outcomes: Either the hire is a man or the hire is not a man.
5. The n trials are independent and the probability of success and probability of failure remain constant:
This is true because we are assuming that the probability of hiring a man remains constant at 75%.

Though the rst condition is not met, we can still use the binomial distribution to model the situation. That
the model is not perfectly met would be a limitation of the study. That means that we would want to put
a caveat at the end of our conclusion to state that this might reduce the accuracy of our results.

warning: If the conditions of randomness and independence may not be fully met, then we can
still utilize the binomial distribution. But we do have to be wary of the results. The other three
conditions do need to be met to use the binomial distribution.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


146 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Now that we have the model, we can nd the probability. In A computer program, we will use the binomial
distribution with n = 30 and π (or probability of occurrence) = 0.75. Then we will nd P (X ≤ 20) .
From the computer program, we get P (X ≤ 20) = 0.19659 = 19.659. Again, this probability is found
under the assumption that the program has not worked (i.e. π =75%)

2.2.2.3.5 Evaluating the probability

The probability that we would have observed at most 20 hires that were men out of 30, under the assumption
that the program did not work, is 19.659%. Therefore, it is not unlikely that we could have observed this
evidence as the probability is greater than 10%. This means that having 20 out of 30 hires being men falls
within the normal sampling variability for this data.
Based on the evidence collected, there is not sucient evidence to suggest that the program worked.
Notice we don't conclude that the program is not working.

note: In statistics, we never use the words prove or true when making a conclusion. All of
our conclusions are based o of sample data that we are using to make a conclusion about the
population. Therefore, there is always the chance of error.

2.2.2.3.6 Example

Olivier has spent ve years honing his archery skills in various seedy locals around the world. Now he has
returned to his city of birth to use these skills to take out criminals. One night while drinking vodka with
his friends, he boasts that he can shoot an arrow into the bullseye, blindfolded at a distance of 50m 90% of
the time.
I don't believe you! Jack, Olivier's best friend, slurred.
I swear! I've really honed my skills. Olivier countered.
But remember last week when we were in that darkened factory, you missed two of your shots! Thelma,
Olivier's sister, countered.
No. I meant to miss them.
Jack thought for a moment. I think you are exaggerating and I'm going to test you.
You're on! Olivier sneered arrogantly.
To test that Olivier was exaggerating about his marksmanship, Jack set up a bunch of targets and,
randomly had Olivier attempt the shot. Olivier hit the bullseye (blindfolded at a distance of 50m) 39 out of
50 times.

a. If Olivier's is not exaggerating, how many times out of 50 do we typically expect him to hit the bullseye?
Write your answer as a range that takes into account variation.

Answer: We would expect Olivier to hit the bullseye 45 times give or take 2.121 times. This means a typical
range is 42.88 to 47.12 bullseyes out of 50.

b. Based on your answer in a), is 39 out of 50 times potentially abnormal? Explain.

Answer: Since 39 is outside of the range, it would be deemed atypical, but that does not necessarily mean
that it is abnormal.

c. What assumption do we need to make before determining whether the 39 out of 50 provides evidence
for or against Olivier exaggerating?

Answer: Since Jack wants to show that Olivier is exaggerating, we want to assume that Olivier is not
exaggerating. This means we want to assume π = 90%, where π is the proportion of bullseyes that Olivier
hits.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


147

d. What model (i.e. distribution) will you use to test the evidence against the assumption? Explain why
it is the best model to use. Note: This situation might not completely t the model, but explain why
it is still a reasonable model to use.

Answer: The distribution satises the conditions of the binomial distribution:

• The variable being studied is random: Since Jack is randomly having Olivier take the shot, we can say
this is a random event.
• The outcomes of the variable are being counted: We are counting the number of bullseyes.
• There are a xed number of trials: We are looking at 50 shots with the bow and arrow.
• There are only two possible outcomes: Either the shot is a bullseye or it is not.
• The n trials are independent and the probability of success and probability of failure remain constant:
This is true because we are assuming that the probability of hitting the bullseye remains constant at
90%.

e. What probability do you need to nd to evaluate the evidence against the assumption?

Answer: We need to nd the probability that Olivier hits at most 39 out of 50 bullseyes, assuming his
accuracy is 90%. NOTE: We look at at most 39 because having less bullseyes is even better evidence that
Olivier is exaggerating (i.e. better evidence against the assumption).

f. Find that probability.

Answer: P (X ≤ 39 given π =90%) =0.00935=0.94%, (from computer program with n = 50, π (or probability
of occurrence) = 90%).

g. In the context of the problem, interpret the probability.

Answer: The probability that Olivier hit at most 39 out of 50 bullseyes, under the assumption that he wasn't
exaggerating about his accuracy is 0.94%.

h. Does the probability provide evidence to support whether Olivier is exaggerating or not? Explain.

Answer: Since the probability that we observed our sample data is less than 1%, then it is unlikely that that
Olivier is not exaggerating (i.e. that his accuracy is 90%). Therefore, it is likely that Olivier is exaggerating
and cannot hit the bullseye 90% of the time blindfolded from 50m.
Exercise 2.2.2.2 (Solution on p. 198.)
As stated in a previous question, the chance of an CRA audit for a tax return with over $25,000
in income is about 2% per year. An employee at I&S Square, a company that helps individuals
do their yearly tax returns and helps if there is an audit, has noticed that people in Seba Beach,
Alberta appear to have a greater chance of an audit than the rest of Canadians. Out of a random
sample of 45 residents, four of them have been audited.

a. If the residents of Seba Beach are being audited fairly, how many residents out of 45 do we
typically expect to get audited in a year? Write your answer as a range that takes into account
variation.
b. Based on your answer in a), is 4 out of 45 audits potentially abnormal? Explain.
c. What assumption do we need to make before determining whether the 4 out of 45 audits is
unfair?
d. What model (i.e. distribution) will you use to test the assumption? Explain why it is the
best model to use. Note: This situation might not completely t the model, but explain why
it is still a reasonable model to use.
e. What probability do you need to nd to evaluate the assumption?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


148 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

f. Find that probability.


g. In the context of the problem, interpret the probability.
h. Does the probability provide evidence to support or refute whether residents of Seba Beach
are being unfairly audited? Explain.

Exercise 2.2.2.3 (Solution on p. 198.)


Executives at Bull, a Canadian own cell phone company, are not very happy with their current
customer satisfaction surveys. Using a Likert scale, they surveyed a very large sample of clients
who phoned Bull and spoke to a customer service representative. They have determined that only
60% of customers rate their overall satisfaction with the service they received at 4 or higher. That
is, they either strongly agree or agreed with the statement, I am happy with the overall customer
service I received during my most recent call to Bull.
They feel that this is too low as 40% of customers were not happy with their service. To address
these issues, they've brought in a consultant who has suggested that customers are happier with
their service if they feel they've built a rapport with the customer service representative. Thus,
Bull has decided to train their customer service representatives to start each call with a short
conversation. As customers are from across Canada and it would be bad if the conversations were
generic, to help their customer service representatives build rapport, a short notice shows up on
their screens before they take the call that contains suggested conversation topic for the area the
person is calling from. For example, it might include information about weather in the local area
and how the local sporting team has done in their most recent game.
After the customer service representatives have been trained in how to make small talk to
build rapport, a random sample of sixty customers who called Bull and spoke to a customer service
representative is taken. The participants are asked the same question about their overall satisfaction
with their customer service phone call as stated above. The results of the survey are listed below:

1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5

Table 2.14

Does the recent sample provide sucient evidence to suggest that the proportion of customers
who are happy with their overall service when they call Bull has increased from 60%? Explain your
answer in detail.

note: What we can conclude when the probability is not unlikely: If the probability is greater
than 10%, then it means that it is not unlikely that we observed this evidence under the assumption.
We can NOT conclude that the assumption is likely true as the evidence was collected to evaluation
the claim (not the assumption). Instead, we can only conclude that there is not enough evidence
to say that the claim is true. When the probability is not unlikely, we have really learned very
little about the claim.

2.2.2.4 Chapter Review

A statistical experiment can be classied as a binomial experiment if the following conditions are met:

1. The variable being studied is random.


2. The outcomes of the variable are being counted.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


149

3. There are a xed number of trials. The letter n denotes the number of trials.
4. There are only two possible outcomes, called "success" and "failure," for each trialπ denotes the
probability of a success on one trial, and 1-π denotes the probability of a failure on one trial.
5. The n trials are independent and are repeated using identical conditions. Because the n trials are
independent, the outcome of one trial does not help in predicting the outcome of another trial. Another
way of saying this is that for each individual trial, the probability, π , of a success and probability, 1-π ,
of a failure remain constant.

The outcomes of a binomial experiment t a binomial probability distribution. The random variable X =
the number of successes obtained in the n independent trials. The meanp of X can be calculated using the
formula µ = nπ , and the standard deviation is given by the formula σ = nπ (1 − π).
To evaluate evidence, we must rst begin from a position of skepticism (i.e. assume the opposite of what
we want to show). Then we must nd a probability which is the distance from the actual evidence to perfect
evidence against the assumption. We can then evaluate the probability by determining whether it is less
than 1% (which means it is unlikely the evidence occurred under the assumption) or if it is greater than 10%
(which means it is not unlikely the evidence occurred under the assumption). If the probability is deemed
unlikely, then we reject the assumption, which means there is enough evidence to support what we originally
wanted to show (the claim). If the probability is deemed not unlikely, then we do not reject the assumption,
which means there is not enough evidence to support what we originally wanted to show (the claim). In the
latter situation, we cannot make any conclusions about the assumption as the evidence was collected only
for the claim.

2.2.2.5 Practice

The rst few exercises provided are from the textbook Business Statistics  BSTA 200  Hum-
ber College  Version 2016RevA  DRAFT 2016-04-04 by Alexander Holmes, Lyryx Learning:
http://cnx.org/contents/[email protected]
Use the following information to answer the next seven exercises: The Higher Education Research Insti-
tute at UCLA collected data from 203,967 incoming rst-time, full-time freshmen from 270 four-year colleges
and universities in the U.S. 71.3% of those students replied that, yes, they believe that same-sex couples
should have the right to legal marital status. Suppose that you randomly pick eight rst-time, full-time
freshmen from the survey. You are interested in the number that believes that same sex-couples should have
the right to legal marital status.
Exercise 2.2.2.4 (Solution on p. 198.)
In words, dene the random variable X.
Exercise 2.2.2.5 (Solution on p. 198.)
What values does the random variable X take on?
Exercise 2.2.2.6 (Solution on p. 198.)
Construct the probability distribution function (PDF). That is, ll in the table below. In the left
column put in the possible values for X. In the right column, put in the probability for exactly X,
i.e. P (X = x)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


150 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

x P (X = x)

Table 2.15

Exercise 2.2.2.7 (Solution on p. 199.)


On average (µ), how many would you expect to answer yes?
Exercise 2.2.2.8 (Solution on p. 199.)
What is the standard deviation (σ )?
Exercise 2.2.2.9 (Solution on p. 199.)
What is the probability that at most ve of the freshmen reply yes?
Exercise 2.2.2.10 (Solution on p. 199.)
What is the probability that at least two of the freshmen reply yes?
Exercise 2.2.2.11 (Solution on p. 199.)
A school newspaper reporter decides to randomly survey 12 students to see if they will attend Tet
(Vietnamese New Year) festivities this year. Based on past years, she knows that 18% of students
attend Tet festivities. We are interested in the number of students who will attend the festivities.

a. In words, dene the random variable X.


b. List the values that X may take on.
c. How many of the 12 students do we expect to attend the festivities?
d. Find the probability that at most four students will attend.
e. Find the probability that more than two students will attend.

Use the following information to answer the next three multiple choice questions: The probability that the
Calgary Flames will win any given game is 0.4617 based on a 45-year win history of 1,616 wins out of 3,500
games played (as of Sept. 2017). An upcoming monthly schedule contains 12 games.
Exercise 2.2.2.12 (Solution on p. 199.)
The expected number of wins for that upcoming month is:

a. 1.67
b. 12
c. 3500
1616

d. 5.54

Let X = the number of games won in that upcoming month.


Exercise 2.2.2.13 (Solution on p. 199.)
What is the probability that the Calgary Flames win exactly six games in that upcoming month?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


151

a. 0.2178
b. 0.4167
c. 0.7664
d. 0.7116

Exercise 2.2.2.14 (Solution on p. 199.)


What is the probability that the Calgary Flames win at least ve games in that upcoming month

a. 0.2176
b. 0.2762
c. 0.7238
d. 0.5062

Exercise 2.2.2.15 (Solution on p. 199.)


The chance of an Canadian Revenue Agency audit for a tax return with over $25,000 in income is
about 2% per year. We are interested in the expected number of audits a person with that income
has in a 20-year period. Assume each year is independent.

a. In words, dene the random variable X.


b. List the values that X may take on.
c. How many audits are expected in a 20-year period?
d. Find the probability that a person is not audited at all.
e. Find the probability that a person is audited more than twice.

Exercise 2.2.2.16 (Solution on p. 199.)


According to The World Bank, only 9% of the population of Uganda had access to electricity as
of 2009. Suppose we randomly sample 150 people in Uganda. Let X = the number of people who
have access to electricity.

a. Calculate the mean and standard deviation of X.


b. Find the probability that 15 people in the sample have access to electricity.
c. Find the probability that at most ten people in the sample have access to electricity.
d. Find the probability that more than 25 people in the sample have access to electricity.

Exercise 2.2.2.17 (Solution on p. 200.)


Jenna and Megan looked at the new packaging.
I guess it looks ok. Megan hedged.
The design team says that this new packaging really sells the time-saving nature of the kit.
But it's kinda o-putting. They continued to stare at the new packaging. Jenna and Megan
had developed a make-up kit called `5 minute make-up', which was aimed at women on the go who
wanted to put `their face on' but a lot quicker than they usually did. Their target market was new
moms, moms with full-time jobs, full-time students working full-time, . . .. In other words, anyone
who didn't have 30 minutes every morning to do their make-up. Their little start-up was doing
well. They'd arranged for their product to be produced, had made how-to videos on YouTube, and
were starting to get their products put into stores. Their dream placement was in Sephora.
Now that they were more established, they had decided to hire a marketing expert who could
help take them to the next level. The rst thing that Leticia suggested was to change the packaging.
She argued that their old packaging didn't convey the premise of the product clearly enough. With
the help of a design team, Leticia had come up with a packaging that showed a harried woman
with hair everywhere and bags under her eyes looking overwhelmed. But when you ip the package
over, the woman was now perfectly put together  `only ve-minute make-up can save you from
being a hot mess'.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


152 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Jenna nally broke the silence. I don't know if I would even pick up this package. It just looks
so depressing. But what do we do?
Leticia wants to put the product in this new packaging in ve stores that carry our products.
Based o of previous sales numbers, we know that the stores sell 68% of the product we give them
in a two-week period.
How does that help us? Do we just watch our sales plummet? Jenna was sounding exasperated.
I'm getting to that. Megan soothed. Leticia is convinced this packaging will increase sales.
But what if we can show her that it doesn't? Let's put this packaging into the ve stores and then
see how many kits were actually sold. I bet that we can show her that the sales went down.
I don't see how that is useful. We still have to pay her stupid fee.
You should read her contract more closely. She only gets paid if she can show that sales
increased. If they don't, then not only does she not get paid but she also has to pay for any
contractors (i.e. the design team).
Jenna perked up visibly at this.
Over the next two weeks, ve stores carried the new packaging. Megan and Jenna provided
each store with 100 kits. At the end of the two weeks, 306 of the kits were sold.
1. What is the observation unit? What is the variable? Categorize it.
2. What do Jenna and Megan want to show?
3. What assumption do Jenna and Megan need to make in order to investigate your answer in
question 2? Write your answer both as a sentence and as a probability.
4. What is the evidence that Jenna and Megan have found?
5. Describe the process that Jenna and Megan will go through to evaluate this evidence. Your
description should include (but is not limited to) what probability they will nd and what
they will do with that probability once they've found it. Don't actually do the process (that
comes later). Just describe what they will do.
6. Jenna and Megan believe that the binomial distribution will be the best model to nd the
required probability. Does this situation meet the criteria for a binomial? Examine each
criterion and comment on whether it is satised here or not.
7. Regardless of your answer above, use a binomial distribution to model this situation. Find
the appropriate probability to evaluate the evidence using MegaStat.
8. In sentence form, explain what the probability you have found means in the context of the
question. Do not make a conclusion yet. Instead explain what it is a probability of.
9. Now make a conclusion. In particular, answer this question: Is their enough evidence to
suggest that Leticia's new packaging has reduced sales? Justify your answer.

Exercise 2.2.2.18 (Solution on p. 200.)


Striking Donkey Coee recently sold an 80% stake in their company to Baravalle, an Italian coee
conglomerate. Striking Donkey's logo is simplistic. Baravalle wants to maintain brand recognition,
but also wants to put their stamp on the company. In particular, Baravalle is known for its modern
and stylish advertisements.
Designers and marketers at Baravalle have worked tirelessly for the last month to come up with
two revised Striking Donkey logos (not included, because it is top secret). They are referred to as
Logo 1 and Logo 2.
Now they want to determine whether customers show any preference to either logo. To do this,
they asked a random sample of 40 customers who were familiar with the Striking Donkey Brand
which logo they prefer. Participants had to make a choice between the logos.
The results of the study were that 26 out of the 40 participants preferred Logo 2.
The marketers at Baravalle now want to do a statistical analysis to determine whether Logo 2
is preferred signicantly more than Logo 1.
1. What assumption do you need to start with when determining whether Logo 2 is preferred
signicantly more than Logo 1? State your answer both in a sentence and mathematically.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


153

2. Can this situation be modelled by the binomial distribution? Support your answer by showing
why or why not this situation satises each of the ve criteria of the binomial distribution.
3. After previous issues with horrible new logo launches, Baravalle only wants to go forward
if there is clear evidence that Logo 2 is preferred. Based on this, what level of signicance
should they use? Explain your reasoning.
4. Regardless of your answer in b, assume that this situation satises the binomial distribution for
the remainder of the question. Use a computer program to nd that appropriate probability
that will allow you to evaluate the evidence.
5. In sentence form, explain what the probability you have found means in the context of the
question. Do not make a conclusion yet. Instead explain what it is a probability of.
6. Based on the probability, determine whether Logo 2 is preferred signicantly more than Logo
1. Explain your reasoning.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


154 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.3 Chapter 5: Normal Distribution


2.3.1 Introduction to the Normal Distribution8

Figure 2.10: If you ask enough people about their shoe size, you will nd that your graphed data is
shaped like a bell curve and can be described as normally distributed. (credit: Ömer Ünl)

: By the end of this chapter, the student should be able to:

• Recognize the normal probability distribution and apply it appropriately.


• Recognize the standard normal probability distribution and apply it appropriately.
• Compare normal probabilities by converting to the standard normal distribution.

The normal probability density function, a continuous distribution, is the most important of all the distri-
butions. It is widely used and even more widely abused. Its graph is bell-shaped. You see the bell curve
in almost all disciplines. Some of these include psychology, business, economics, the sciences, nursing, and,
of course, mathematics. Some of your instructors may use the normal distribution to help determine your
grade. Most IQ scores are normally distributed. Often real-estate prices t a normal distribution.
The normal distribution is extremely important, but it cannot be applied to everything in the real world.
Remember here that we are still talking about the distribution of population data. This is a discussion of
probability and thus it is the population data that may be normally distributed, and if it is, then this is
8 This content is available online at <http://cnx.org/content/m62345/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


155

how we can nd probabilities of specic events just as we did for population data that may be binomially
distributed or Poisson distributed. This caution is here because in the next chapter we will see that the
normal distribution describes something very dierent from raw data and forms the foundation of inferential
statistics.
In this chapter, you will study the normal distribution, the standard normal distribution, and applications
associated with them.
The normal distribution has two parameters (two numerical descriptive measures), the mean (µ) and the
standard deviation (σ ). If X is a quantity to be measured that has a normal distribution with mean (µ) and
standard deviation (σ ), we designate this by writing the following formula of the normal probability density
function:

Figure 2.11

The curve is symmetrical about a vertical line drawn through the mean, µ. The mean is the same as the
median, which is the same as the mode, because the graph is symmetric about µ. As the notation indicates,
the normal distribution depends only on the mean and the standard deviation. Note that this is unlike
several probability density functions we have already studied, such as the Poisson, where the mean is equal
to µ and the standard deviation simply the square root of the mean, or the binomial, where p is used to
determine both the mean and standard deviation. Since the area under the curve must equal one, a change
in the standard deviation, σ , causes a change in the shape of the curve; the curve becomes fatter and wider
or skinnier and taller depending on σ . A change in µ causes the graph to shift to the left or right. This
means there are an innite number of normal probability distributions. One of special interest is called the
standard normal distribution.

2.3.1.1 Formula Review

X ∼ N (µ, σ )
µ = the mean; σ = the standard deviation

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


156 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.3.2 The Standard Normal Distribution9


The standard normal distribution is a normal distribution of standardized values called -scores. z
z
A -score is measured in units of the standard deviation. For example, if the mean of a normal
distribution is ve and the standard deviation is two, the value x = 11 is three standard deviations above
(or to the right of) the mean. The calculation is as follows:
x = µ + (z)(σ ) = 5 + (3)(2) = 11
The z-score is three.
The mean for the standard normal distribution is zero, and the standard deviation is one. What this
does is dramatically simplify the mathematical calculation of probabilities. Take a moment and substitute
zero and one in the appropriate places in the above formula and you can see that the equation collapses into
one that can be much more easily solved using integral calculus. The transformation z = x−µσ produces the
distribution Z ∼ N (0, 1). The value x comes from a known normal distribution with known mean µ and
known standard deviation σ . The z-score tells how many standard deviations a particular x is away from
the mean.

2.3.2.1 Z -Scores
If X is a normally distributed random variable and X ∼ N(µ, σ ), then the z-score is:
x −− µ
z= (2.11)
σ
z
The -score tells you how many standard deviations the value x
is above (to the right of ) or
below (to the left of ) the mean, µ. Values of x that are larger than the mean have positive z-scores,
and values of x that are smaller than the mean have negative z-scores. If x equals the mean, then x has a
z-score of zero.
Example 2.26
Suppose X ∼ N(5, 6). This says that x is a normally distributed random variable with mean µ =
5 and standard deviation σ = 6. Suppose x = 17. Then:
x − −µ 17 − −5
z= = =2 (2.11)
σ 6
This means that x = 17 is two standard deviations (2σ ) above or to the right of the mean µ =
5. The standard deviation is σ = 6.
Now suppose x = 1. Then: z = x−−µ = 1−−5 = 0.67 (rounded to two decimal places)
This means that x σ 6
= 1 is 0.67 standard deviations (0.67σ ) below or to the left of
the mean µ = 5.

Example 2.27
Some doctors believe that a person can lose ve pounds, on average, in a month by reducing his or
her fat intake and by exercising consistently. Suppose weight loss has a normal distribution. Let X
= the amount of weight lost(in pounds) by a person in a month. Use a standard deviation of two
pounds. X ∼ N (5, 2).
Problem
Suppose a person gained three pounds (a negative weight loss). Then z = __________. This
z-score tells you that x = 3 is ________ standard deviations to the __________ (right
or left) of the mean.
Solution

x−µ −3 − 5
Z= = = −4 (2.11)
σ 2
9 This content is available online at <http://cnx.org/content/m64939/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


157

z = 4. This z-score tells you that x = 3 is four standard deviations to the left of the mean.

Suppose the random variables X and Y have the following normal distributions: X ∼ N (5, 6) and Y ∼ N (2,
1). If x = 17, then z = 2. (This was previously shown.) If y = 4, what is z?
z = y−µ
σ = 1 = 2 where µ = 2 and σ = 1.
4−2

The z-score for y = 4 is z = 2. This means that four is z = 2 standard deviations to the right of the
mean. Therefore, x = 17 and y = 4 are both two (of their own) standard deviations to the right of their
respective means.
z
The -score allows us to compare data that are scaled dierently. To understand the concept,
suppose X ∼ N (5, 6) represents weight gains for one group of people who are trying to gain weight in a
six week period and Y ∼ N (2, 1) measures the same weight gain for a second group of people. A negative
weight gain would be a weight loss. Since x = 17 and y = 4 are each two standard deviations to the right
of their means, they represent the same, standardized weight gain relative to their means.

Exercise 2.3.2.1 (Solution on p. 201.)


Fill in the blanks.
Jerome averages 16 points a game with a standard deviation of four points. X ∼ N (16,4).
Suppose Jerome scores ten points in a game. The zscore when x = 10 is 1.5. This score
tells you that x = 10 is _____ standard deviations to the ______(right or left) of
the mean______(What is the mean?).

The Empirical Rule


If X is a random variable and has a normal distribution with mean µ and standard deviation σ , then the
Empirical Rule says the following:

• About 68.26% of the x values lie between 1σ and +1σ of the mean µ (within one standard deviation
of the mean).
• About 95.44% of the x values lie between 2σ and +2σ of the mean µ (within two standard deviations
of the mean).
• About 99.73% of the x values lie between 3σ and +3σ of the mean µ (within three standard deviations
of the mean). Notice that almost all the x values lie within three standard deviations of the mean.
• The z-scores for +1σ and 1σ are +1 and 1, respectively.
• The z-scores for +2σ and 2σ are +2 and 2, respectively.
• The z-scores for +3σ and 3σ are +3 and 3 respectively.

The empirical rule is also known as the 68-95-99.7 rule.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


158 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 2.12

Example 2.28
The mean height of 15 to 18-year-old males from Chile from 2009 to 2010 was 170 cm with a
standard deviation of 6.28 cm. Male heights are known to follow a normal distribution. Let X =
the height of a 15 to 18-year-old male from Chile in 2009 to 2010. Then X ∼ N (170, 6.28).

Problem 1
a. Suppose a 15 to 18-year-old male from Chile was 168 cm tall from 2009 to 2010. The z-score
when x = 168 cm is z = _______. This z-score tells you that x = 168 is ________ standard
deviations to the ________ (right or left) of the mean _____ (What is the mean?).
Solution

x−µ 168 − 170


Z= = = −0.32 (2.12)
σ 6.28
a. 0.32, 0.32, left, 170

Problem 2
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010 has a z-score
of z = 1.27. What is the male's height? The z-score (z = 1.27) tells you that the male's height is
________ standard deviations to the __________ (right or left) of the mean.
Solution

x−µ x − 170
Z= = = 1.27 → 1.27 ∗ 6.28 + 170 = 177.98 (2.12)
σ 6.28
b. 177.98, 1.27, right

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


159

Exercise 2.3.2.2 (Solution on p. 201.)


In 2012, 1,664,479 students took the SAT exam. The distribution of scores in the verbal
section of the SAT had a mean µ = 496 and a standard deviation σ = 114. Let X = a
SAT exam verbal section score in 2012. Then X ∼ N (496, 114).
Find the z-scores for x 1 = 325 and x 2 = 366.21. Interpret each z-score. What can you
say about x 1 = 325 and x 2 = 366.21?

Example 2.29
Suppose x has a normal distribution with mean 50 and standard deviation 6.

• About 68% of the x values lie between 1σ = (1)(6) = 6 and 1σ = (1)(6) = 6 of the mean
50. The values 50  6 = 44 and 50 + 6 = 56 are within one standard deviation of the mean
50. The z-scores are 1 and +1 for 44 and 56, respectively.
• About 95% of the x values lie between 2σ = (2)(6) = 12 and 2σ = (2)(6) = 12. The
values 50  12 = 38 and 50 + 12 = 62 are within two standard deviations of the mean 50.
The z-scores are 2 and +2 for 38 and 62, respectively.
• About 99.7% of the x values lie between 3σ = (3)(6) = 18 and 3σ = (3)(6) = 18 of the
mean 50. The values 50  18 = 32 and 50 + 18 = 68 are within three standard deviations of
the mean 50. The z-scores are 3 and +3 for 32 and 68, respectively.

Exercise 2.3.2.3 (Solution on p. 201.)


Suppose X has a normal distribution with mean 25 and standard deviation ve. Between
what values of x do 68% of the values lie?

Exercise 2.3.2.4 (Solution on p. 201.)


The scores on a college entrance exam have an approximate normal distribution with
mean, µ = 52 points and a standard deviation, σ = 11 points.

a.About 68% of the y values lie between what two values? These values are
________________. The z-scores are ________________, respec-
tively.
b.About 95% of the y values lie between what two values? These values are
________________. The z-scores are ________________, respec-
tively.
c.About 99.7% of the y values lie between what two values? These values are
________________. The z-scores are ________________, respec-
tively.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


160 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.3.2.2 References

Blood Pressure of Males and Females. StatCruch, 2013. Available online at


http://www.statcrunch.com/5.0/viewreport.php?reportid=11960 (accessed May 14, 2013).
The Use of Epidemiological Tools in Conict-aected populations: Open-access educational resources for
policy-makers: Calculation of z-scores. London School of Hygiene and Tropical Medicine, 2009. Available
online at http://conict.lshtm.ac.uk/page_125.htm (accessed May 14, 2013).
2012 College-Bound Seniors Total Group Prole Report. CollegeBoard, 2012. Available online at
http://media.collegeboard.com/digitalServices/pdf/research/TotalGroup-2012.pdf (accessed May 14, 2013).
Digest of Education Statistics: ACT score average and standard deviations by sex and race/ethnicity
and percentage of ACT test takers, by selected composite score ranges and planned elds of study:
Selected years, 1995 through 2009. National Center for Education Statistics. Available online at
http://nces.ed.gov/programs/digest/d09/tables/dt09_147.asp (accessed May 14, 2013).
Data from the San Jose Mercury News.
Data from The World Almanac and Book of Facts.
List of stadiums by capacity. Wikipedia. Available online at
https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity (accessed May 14, 2013).
Data from the National Basketball Association. Available online at www.nba.com (accessed May 14,
2013).

2.3.2.3 Chapter Review

A z-score is a standardized value. Its distribution is the standard normal, Z ∼ N (0, 1). The mean of the
z-scores is zero and the standard deviation is one. If z is the z-score for a value x from the normal distribution
N (µ, σ ) then z tells you how many standard deviations x is above (greater than) or below (less than) µ.

2.3.2.4 Practice Questions

Exercise 2.3.2.5 (Solution on p. 201.)


In a normal distribution, x = 3 and z = 0.67. This tells you that x = 3 is ____ standard
deviations to the ____ (right or left) of the mean.
Exercise 2.3.2.6 (Solution on p. 201.)
In a normal distribution, x = 5 and z = 3.14. This tells you that x = 5 is ____ standard
deviations to the ____ (right or left) of the mean.
Exercise 2.3.2.7 (Solution on p. 201.)
About what percent of x values from a normal distribution lie within one standard deviation (left
and right) of the mean of that distribution?
Exercise 2.3.2.8 (Solution on p. 201.)
About what percent of the x values from a normal distribution lie within two standard deviations
(left and right) of the mean of that distribution?
Exercise 2.3.2.9 (Solution on p. 201.)
About what percent of x values lie between the second and third standard deviations (both sides)?
Use the following information to answer the next two multiple choice exercises: The patient recovery time
from a particular surgical procedure is normally distributed with a mean of 5.3 days and a standard deviation
of 2.1 days.
Exercise 2.3.2.10 (Solution on p. 201.)
What is the median recovery time?

a. 2.7
b. 5.3
c. 7.4

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


161

d. 2.1

Exercise 2.3.2.11 (Solution on p. 201.)


What is the z-score for a patient who takes ten days to recover?
a. 1.5
b. 0.2
c. 2.2
d. 7.3

Exercise 2.3.2.12 (Solution on p. 201.)


Wesley Crusher was tasked with exploring the Selcundi Drema sector. He found a new species
of tribbles. In his nal report, he stated, Though tribbles vary in size and dimension, the middle
99.73% of them weigh between 4 and 7.2 kg and follow a normal distribution. Based on this, what
is the mean and standard deviation for the weight of tribbles? Choose the best answer.
a. mean = 5.6 kg, standard deviation = 1.07 kg
b. mean = 5.6 kg, standard deviation = 0.53 kg
c. mean = 5.6 kg, standard deviation = 0.8 kg
d. mean = 99.73 kg, standard deviation = 3.2 kg
e. There is not enough information to determine this.

2.3.3 Using the Normal Distribution10


The shaded area in the following graph indicates the area to the right of x. This area is represented by the
probability P(X > x). Normal tables, computers, and calculators provide or calculate the probability P(X
> x).

Figure 2.13

10 This content is available online at <http://cnx.org/content/m64940/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


162 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

The area to the right is then P(X > x) = 1  P(X < x). Remember, P(X < x) = Area to the left of
the vertical line through x. P(X < x) = 1  P(X < x) = Area to the right of the vertical line through x.
P(X < x) is the same as P(X ≤ x) and P(X > x) is the same as P(X ≥ x) for continuous distributions.

2.3.3.1 Calculations of Probabilities

To nd the probability for probability curves with a continuous random variable we need to calculate the
area under the curve across the values of X we are interested in. For the normal distribution this seems a
dicult task given the complexity of the formula. There is, however, a simply way to get what we want.
We start knowing that the area under a probability curve is the probability.

Figure 2.14

This shows that the area between X1 and X2 is the probability as stated in the formula: P (X1 ≤ x ≤
X2 )
The mathematical tool needed to nd the area under a curve is integral calculus. The integral of the
normal probability density function between the two points x1 and x2 is the area under the curve between
these two points and is the probability between these two points.
Doing these integrals is no fun and can be very time consuming. But now, remembering that there are
an innite number of normal distributions out there, we can consider the one with a mean of zero and a
standard deviation of 1. This particular normal distribution is given the name Standard Normal Distribution.
Putting these values into the formula it reduces to a very simple equation. We can now quite easily calculate
all probabilities for any value of x, for this particular normal distribution, that has a mean of zero and a
standard deviation of 1. These have been produced and are available here in the text or everywhere on the
web. They are presented in various ways. The table in this text is the most common presentation and is set
up with probabilities for one-half the distribution beginning with zero, the mean, and moving outward. The

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


163

shaded area in the graph at the top of the table represents the probability from zero to the specic Z value
noted on the horizontal axis, Z.
The only problem is that even with this table, it would be a ridiculous coincidence that our data had
a mean of zero and a standard deviation of one. The solution is to convert the distribution we have with
its mean and standard deviation to this new Standard Normal Distribution. The Standard Normal has a
random variable called Z.
Using the standard normal table, typically called the normal table, to nd the probability of one standard
deviation, go to the Z column, reading down to 1.0 and then read at column 0. That number, 0.3413 is the
probability from zero to 1 standard deviation. At the top of the table is the shaded area in the distribution
which is the probability for one standard deviation. The table has solved our integral calculus problem. But
only if our data has a mean of zero and a standard deviation of 1.
However, the essential point here is, the probability for one standard deviation on one normal distribution
is the same on every normal distribution. If the population data set has a mean of 10 and a standard deviation
of 5 then the probability from 10 to 15, one standard deviation, is the same as from zero to 1, one standard
deviation on the standard normal distribution. To compute probabilities, areas, for any normal distribution,
we need only to convert the particular normal distribution to the standard normal distribution and look up
the answer in the tables. As review, here again is the standardizing formula:
x−µ
Z= (2.14)
σ

where Z is the value on the standard normal distribution, X is the value from a normal distribution one
wishes to convert to the standard normal, µ and σ are, respectively, the mean and standard deviation of that
population. Note that the equation uses µ and σ which denotes population parameters. This is still dealing
with probability so we always are dealing with the population, with known parameter values and a known
distribution. It is also important to note that because the normal distribution is symmetrical it does not
matter if the z-score is positive or negative when calculating a probability. One standard deviation to the
left (negative Z-score) covers the same area as one standard deviation to the right (positive Z-score). This
fact is why the Standard Normal tables do not provide areas for the left side of the distribution. Because of
this symmetry, the Z-score formula is sometimes written as:

|x − µ|
Z= (2.14)
σ

Where the vertical lines in the equation means the absolute value of the number.
What the standardizing formula is really doing is computing the number of standard deviations X is from
the mean of its own distribution. The standardizing formula and the concept of counting standard deviations
from the mean is the secret of all that we will do in this statistics class. The reason this is true is that all
of statistics boils down to variation, and the counting of standard deviations is a measure of variation.
This formula, in many disguises, will reappear over and over throughout this course.
Example 2.30
The nal exam scores in a statistics class were normally distributed with a mean of 63 and a
standard deviation of ve.

Problem 1
a. Find the probability that a randomly selected student scored more than 65 on the exam.
b. Find the probability that a randomly selected student scored less than 85.
Solution
a. Let X = a score on the nal exam. X ∼ N (63, 5), where µ = 63 and σ = 5
Draw a graph.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


164 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Then, nd P(x > 65).


P(x > 65) = 0.3446

Figure 2.15

x1 − µ 65 − 63
Z1 = = = 0.4 (2.15)
σ 5
P(x ≥ x 1 ) = P(Z ≥ Z 1 ) = 0.3446
The probability that any student selected at random scores more than 65 is 0.3446. Here is how
we found this answer.
The normal table provides probabilities from zero to the value Z1 . For this problem the question
can be written as: P(X ≥ 65) = P(Z ≥ Z1 ), which is the area in the tail. To nd this area the
formula would be 0.5  P(X ≤ 65). One half of the probability is above the mean value because
this is a symmetrical distribution. The graph shows how to nd the area in the tail by subtracting

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


165

that portion from the mean, zero, to the Z1 value. The nal answer is: P(X ≥ 63) = P(Z ≥ 0.4) =
0.3446
z = 65 5 63 = 0.4
Area to the left of Z1 to the mean of zero is 0.1554
P(x > 65) = P(z > 0.4) = 0.5  0.1554 = 0.3446

Problem 2

Solution
b.
Z = x−µ
σ =
85−63
5 = 4.4 which is larger than the maximum value on the Standard Normal Table.
Therefore, the probability that one student scores less than 85 is approximately one or 100%.
A score of 85 is 4.4 standard deviations from the mean of 63 which is beyond the range of
the standard normal table. Therefore, the probability that one student scores less than 85 is
approximately one (or 100%).

Exercise 2.3.3.1 (Solution on p. 201.)


The golf scores for a school team were normally distributed with a mean of 68 and a
standard deviation of three.
Find the probability that a randomly selected golfer scored less than 65.

Example 2.31
A personal computer is used for oce work at home, research, communication, personal nances,
education, entertainment, social networking, and a myriad of other things. Suppose that the
average number of hours a household personal computer is used for entertainment is two hours per
day. Assume the times for entertainment are normally distributed and the standard deviation for
the times is half an hour.

Problem 1
a. Find the probability that a household personal computer is used for entertainment between 1.8
and 2.75 hours per day.
Solution
a. Let X = the amount of time (in hours) a household personal computer is used for entertainment.
X ∼ N (2, 0.5) where µ = 2 and σ = 0.5.
Find P(1.8 < x < 2.75).
The probability for which you are looking is the area between x = 1.8 and x = 2.75. P(1.8 <
x < 2.75) = 0.5886

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


166 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 2.16

P(1.8 ≤ x ≤ 2.75) = P(Zi ≤ Z ≤ Z2 )


The probability that a household personal computer is used between 1.8 and 2.75 hours per
day for entertainment is 0.5886.

Problem 2
b. Find the maximum number of hours per day that the bottom quartile of households uses a

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


167

personal computer for entertainment.


Solution
b. To nd the maximum number of hours per day that the bottom quartile of households uses a
personal computer for entertainment, nd the 25th percentile, k, where P(x < k) = 0.25.

Figure 2.17

f (Z) = 0.5 − 0.25 = 0.25, therefore Z ≈ −0.675(or just 0.67 using the table)Z = x−µ x−2
σ = 0.5 =
−0.675, therefore x = −0.675 ∗ 0.5 + 2 = 1.66 hours.
The maximum number of hours per day that the bottom quartile of households uses a personal
computer for entertainment is 1.66 hours.

Exercise 2.3.3.2 (Solution on p. 201.)


The golf scores for a school team were normally distributed with a mean of 68 and a
standard deviation of three. Find the probability that a golfer scored between 66 and 70.

Example 2.32
There are approximately one billion smartphone users in the world today. In the United States the
ages 13 to 55+ of smartphone users approximately follow a normal distribution with approximate
mean and standard deviation of 36.9 years and 13.9 years, respectively.

Problem 1
a. Determine the probability that a random smartphone user in the age range 13 to 55+ is between
23 and 64.7 years old.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


168 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solution
a. 0.8186

Problem 2
b. Determine the probability that a randomly selected smartphone user in the age range 13 to
55+ is at most 50.8 years old.
Solution
b. 0.8413

Example 2.33
A citrus farmer who grows mandarin oranges nds that the diameters of mandarin oranges
harvested on his farm follow a normal distribution with a mean diameter of 5.85 cm and a standard
deviation of 0.24 cm.

Problem 1
a. Find the probability that a randomly selected mandarin orange from this farm has a diameter
larger than 6.0 cm. Sketch the graph.
Solution

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


169

Figure 2.18

6 − 5.85
Z1 = = .625 (2.18)
.24
P(x ≥ 6) = P(z ≥ 0.625) = 0.2670
b. The middle 20% of mandarin oranges from this farm have diameters between ______ and
______.

Problem 2

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


170 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solution
2 = 0.10 , therefore Z ≈ ±0.25
f (Z) = 0.20
x−µ
Z = σ = x−5.85
0.24 = ±0.25 → ±0.25 ∗ 0.24 + 5.85 = (5.79, 5.91)

2.3.3.2 References

Naegele's rule. Wikipedia. Available online at http://en.wikipedia.org/wiki/Naegele's_rule (accessed May


14, 2013).
403: NUMMI. Chicago Public Media & Ira Glass, 2013. Available online at
http://www.thisamericanlife.org/radio-archives/episode/403/nummi (accessed May 14, 2013).
Scratch-O Lottery Ticket Playing Tips. WinAtTheLottery.com, 2013. Available online at
http://www.winatthelottery.com/public/department40.cfm (accessed May 14, 2013).
Smart Phone Users, By The Numbers. Visual.ly, 2013. Available online at http://visual.ly/smart-
phone-users-numbers (accessed May 14, 2013).
Facebook Statistics. Statistics Brain. Available online at http://www.statisticbrain.com/facebook-
statistics/(accessed May 14, 2013).

2.3.3.3 Practice questions

Exercise 2.3.3.3 (Solution on p. 201.)


A local bank has determined that the daily balances of the chequing accounts of their customers
are normally distributed with a mean of $280 and a standard deviation of $20.

a. What percentage of their customers has daily balances less than $290?
b. What percentage of their customers has daily balances between $250 and $275?
c. What percentage of their customers has daily balances over $260?
d. The Bank is planning a special promotion where it is rewarding its customers whose balances
are in the top 15% with a free toaster. What account balance must a customer achieve in
order to qualify for a free toaster?
e. 68.26% of balances will be between what amount?
f. What is the interquartile range for the account balances?

Exercise 2.3.3.4 (Solution on p. 202.)


The Old Baldy Tire Company is introducing a new steel belted radial tire. Old Baldy's engineering
department has estimated that the average life of this tire will be 50,000 km and that the standard
deviation of these tires will be 5000 km. It is assumed that the useful life of these tires follows a
normal distribution.

a. What is the probability that:


i. These tires will last for longer than 60,000 km?
ii. These tires will last for less than 38,000 km?
iii. These tires will last for between 45,000 and 58,000 km?
iv. These tires will last for between 39,000 and 43,000 km?
b. The Old Baldy Tire Company is considering oering a tire guarantee that each new set of
tires will last a certain number of kilometers. If the tires fail to last the specied number
of kilometers a new set of tires will be provided to the purchaser for free. The Old Baldy
Tire Company wants to ensure that no more than 10% of the tires produced qualify for this
guarantee. For how many kilometers should these tires be guaranteed to last?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


171

c. 35% of tires will last less than how many kilometers?

2.4 Chapter 6: Sampling Distribution


2.4.1 Chapter Overview11
note: By the end of the chapter, the student should be able to:

• Distinguish between a sample, a population and a sampling distribution.


• Know the characteristics of the sampling distribution.
• Recognize sampling distribution problems.
• Apply and interpret the central limit theorem for both means and proportions.

2.4.2 Introduction to Sampling Distributions12


When we take a random sample from a population, we expect that there is going to be some variability (i.e.
sampling variability) between the information the sample gives us and the whole population. That is, we
might nd that the sample mean and the population mean are dierent. We may also nd that if we take
multiple random samples of size n that the sample mean for each sample is dierent. The following chapter
looks at how we can better understand the sampling variability in statistics.
Before we go on, here is a reminder of a few terms and symbols.
A parameter is a descriptive measure of the population (eg. population mean, population standard
deviation, population proportion).
A statistic is a descriptive measure of the sample (eg. sample mean, sample standard deviation, sample
proportion).

Measure Population Sample

Sample size N n

Mean µx x
Standard deviation σx sx
>
Proportion π p

Table 2.16 : Table of important symbols

The population mean, population standard deviation, and sample standard deviation have a subscript
of x to demonstrate that they are the measure for the variable X. Though this is mostly notational, it does
become important later in this chapter.

2.4.2.1 What is a sampling distribution?

Suppose we take many dierent random samples of 100 university students from a university that has an
equal number of men and women.
The number of women will vary amongst the samples. For example, one sample could have 45 women,
another sample could have 48 women, another sample could have 52 women, etc.
11 This content is available online at <http://cnx.org/content/m64933/1.1/>.
12 This content is available online at <http://cnx.org/content/m64932/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


172 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Though it could be possible that we get a random sample that only has 2 women in it, it would be pretty
unlikely. Instead, we would expect that most of the samples would have around 50 women in it with some
variation around that value.
Figure 1 is the result of a simulation that took 10,000 samples of size 100 from a population that had an
equal about of women and men. The horizontal axis is the number of women in each sample. The height of
each bar is the number of samples that had that many women.

Figure 2.19: Number of women in each sample of size 100

Notice how the most common number of women is around 50 (i.e. the average), but there is variation
from that 50. Most samples have between 40 and 60 women.
The variability among random samples of size n from the same population is called sampling variability.
A probability distribution that characterizes some aspect of sampling variability is termed a sampling
distribution. A sampling distribution is constructed by taking all possible samples of a size n from a
population. Then for each sample, a statistic is calculated (e.g. sample mean, sample proportion, sample
standard deviation). The sampling distribution is then created by making a graph of all of these samples.
Actually constructing a sampling distribution is often very dicult. A medium sized university in Canada
might have 12,000 students. All possible samples of size 100 from that population would result in 5.87×10249
unique samples! Think about that. One billion is 109 . Google is named after a googol ( 10100 ) because they
wanted Google to be associated with an immense amount of data. Yet a googol is smaller than all possible
samples at 100 from the medium sized university. If we got a computer to nd all possible samples, it would
13
take it over a billion years to nd them ! Therefore, actually constructing a true sampling distribution in
most situations is incredibly hard, incredibly time consuming, and not really worth it. Thus when we talk
about sampling distributions, we talk about a theoretical sampling distribution. That is, we theorize
what this sampling distribution would look like if it was possible to examine all possible samples.
Due to these limitations, we often look at an empirical sampling distribution, instead of a theoretical
sampling distribution. An empirical sampling distribution is created by taking many samples from a
population and nding a statistic for each sample, but not doing this for all possible samples. The plot
shown in is an example of an empirical sampling distribution as it only contains 10,000 samples and not all
possible samples. The statistic in is the number of women, but we could have also looked at the proportion
of women.
In summary, a sampling distribution is a distribution of a statistic. This diers from other distributions,
like the population distribution, which are distributions for individual data values.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


173

2.4.2.2 Why do we care about sampling distributions?

Suppose we take a random sample of 100 students from a medium sized university and we nd that 75 of
them are women. Does this call into question the assumption that 50% of the students are women? This is
hard to gure out unless we know how likely it is that we could have found this random sample, assuming
that there are an equal number of men and women.
The sampling distribution helps us nd this probability. From the empirical sampling distribution in
Figure 1 we can nd the probability of getting a random sample of 75 women, assuming that there are an
equal number of men and women is 0.0000%. That is, it is really unlikely to get a random sample of 75
women out of 100 if there are an equal number of men and women in the population. Based on this, we can
be fairly condent that this university probably doesn't have an equal number of men and women. Instead,
it is more likely that there are women than men at this university.
The process described above is called inferential statistics. Inferential statistics is used to make a
conclusion about the population (all students at the university) from a sample (100 students). In general, to
do any form of inferential statistics, we need to use a sampling distribution to either determine how likely or
unlikely a statistic is (in hypothesis testing) or to estimate a parameter from a statistic (condence intervals).
Thus sampling distributions are the backbone of inferential statistics.
Note: What was described above about the proportion of women at a university should sound familiar.
In Chapter 4, we used the binomial distribution to determine how not unlikely or unlikely events were. The
binomial distribution was helping us understand the sampling distribution of proportions.

2.4.3 Constructing Empirical Sampling Distributions14


2.4.3.1 How to construct an emprical sampling distribution

If we have access to the population, we can construct an empirical distribution from it. This can be done by
using computer software to pull random samples from a population. An example of one such tool is from the
Rossman Chance website, which has an applet that allows you to create an empirical sampling distribution
from a nite population: http://www.rossmanchance.com/applets/OneSample53.html15
When constructing an empirical sampling distribution, it is important to keep the law of large numbers
in mind. That is, the more samples you take, the closer the empirical sampling distribution will be to the
theoretical sampling distribution. In general, empirical sampling distributions should be constructed from
at least 10,000 samples.
To get an idea of how an empirical sampling distribution is constructed, go to
http://onlinestatbook.com/stat_sim/sampling_dist/index.html16
Example 2.34
The images/gures in this example were generated from David Lane's sampling distribution applet
that is part of the OnlineStatBook project 17 .
Figure 1 (Figure 2.20) shows the histogram of the population we are going to generate an
empirical sampling distribution from. We call this population the parent population as it is the
population we are creating the sampling distribution from. Notice that the parent population is
skewed left.
14 This content is available online at <http://cnx.org/content/m64934/1.2/>.
15 http://www.rossmanchance.com/applets/OneSample53.html
16 http://onlinestatbook.com/stat_sim/sampling_dist/index.html
17 Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane,
Rice University.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


174 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 2.20: Parent population

We are going to take multiple samples of size 10 from the parent population and look at the
statistic of the sample mean for each sample.
Here is the rst sample:

Figure 2.21: Sample of size 10 from the parent population

This is the sample mean of the sample:

Figure 2.22: Sample mean for one sample of size 10

Now one sample mean is not enough to tell us what the sampling distribution looks like. So
let's take a few more samples. Let's take 5 more samples of size 10 and plot their sample means:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


175

Figure 2.23: Six sample means from parent population

This is still a pretty small sample.

note: There are two sample sizes here. One is the size of the sample we are taking from the parent
population (10). The other is the number of samples we've taken (6). The rst is the sample size
for the sample. The second is the sample size for the empirical sampling distribution.

Now let's take 10,000 samples of size 10 from the population and plot each of their sample means.
This is what we get:

Figure 2.24: 10,000 sample means from parent population

Finally, let take 100,000 samples of size 10 from the population and plot each of their sample
means. This is what we get:

Figure 2.25: 100,000 sample means from parent population

Notice how there is no real dierence between the distributions (shape, centre and variation)
in Figure 5 and Figure 6. This means that are empirical distribution is now giving us a good

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


176 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

sense of what the theoretical sampling distribution would look like. When this happens, this is
called convergence. That is, the empirical sampling distribution is converging on the theoretical
sampling distribution. As the sample size of the empirical sampling distribution increases this is
expected to happen due to law of large numbers.

2.4.3.1.1 Bootstrapping

Suppose we don't have access to the population. This can happen if the population is innite (e.g. in
a manufacturing process) or where the population is large (e.g. population of Canada) or where most
researchers wouldn't have access to the population (e.g. list of students at a university). Can we still
construct an empirical sampling distribution?
The answer is yes! To do this, we use a process called bootstrapping. Essentially bootstrapping follows
the same procedure as outlined in Example 1 (Example 2.34), but instead of using a parent population, we
use a parent sample. That is, we take a good sample from the population and use that to construct the
sampling distribution.
Again the law of large numbers applies. If the random sample from the population is large enough, then
the sample will most likely be a good estimate of the population. Then the empirical sampling distribution
generated by the sample will most likely be a good estimate of the theoretical sampling distribution of the
population.

note: Bootstrapping only works if the sample being used has been collected properly and that
the sampling technique ensures that the sample is random, the sample is representative of the
population, and the sample size is large enough. There are no set rules on how big the sample
needs to be, but for bootstrapping the bigger the better.

2.4.4 The Central Limit Theorem18


Another way to determine what the sampling distribution looks like is by using theory. The main theory that
helps us understand the characteristics of the sampling distribution is called the central limit theorem.
The central limit theorem is an incredibly useful and powerful theorem. The theorem tells us about the
distribution of many dierent sampling distributions. But be careful! The central limit theorem cannot be
applied always and only applies to sampling distributions.

2.4.4.1 The central limit theorem for the sampling distribution for sample means

The sampling distribution for the sample means comes from a parent population that is comprised of quan-
titative data. Random samples of size n are taken from the parent population and the sample mean is
calculated for each sample. What will the distribution of the sample means look like? That is, what is the
shape of the distribution of sample means, where are the sample means centred, and what is the sampling
variability?

note: The following refers to the theoretical sampling distribution for the sample means. Further,
when sample size is mentioned, it is referring to the size of the sample taken from the population.
That is, it is not referring to how many dierent random samples have been taken.

18 This content is available online at <http://cnx.org/content/m64935/1.3/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


177

2.4.4.1.1 Where are the sample means centred?

As the sample means are estimating the population mean, it makes sense that the sample means are centred
around the population mean.
In the previous section, we saw the right skewed parent population in Figure 1 (Figure 2.20). The
population mean of that parent population is 8.08. Notice that the empirical sampling distributions shown
in Figures 5 (Figure 2.24) and 6 (Figure 2.25) are both centred around 8.08.
In general, the mean of the theoretical sampling distribution for the sample means equals the population
mean.

µx = µx (2.25)

note: The variable for the sample means is x. That is why the subscript for the mean of the
sample means (µx ) has changed.

2.4.4.1.2 What is the sampling variability? (or what is the variation in the sampling distribu-
tion)

Based on the law of large numbers, the sampling variability of the sample means will decrease as the sample
size increases. As the sample size increases, the sample means will become better and better estimates of
the population mean and, therefore, there will be less variability between them. That is, there will be more
variability between the sample means for samples of size 2, then there will be for samples of size 30.
Just like we can measure variability for individual data values, we can also measure variability for sample
means. We will use the standard deviation to measure the sampling variability. The standard deviation
of the sampling distribution for sample means is called the standard error of the sample means. It is
found with the following formula
r
σ N −n
σx = √ (2.25)
n N −1
As the population size (N ) increases, N −n
N −1 approaches 1 and no longer aects the standard error.
:
σ
σx = √ (2.25)
n

2.4.4.1.3 What is the shape of the distribution?

This is actually a really interesting question.


Suppose the parent population looks like this 19 :

19 The images/gures that follow were generated from David Lane's sampling distribution applet that is part of the OnlineS-
tatBook project
Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane,
Rice University.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


178 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 2.26: Parent population

What will the sampling distribution for sample means look like?
Here's the answer:

• If the parent population is normal, then the sampling distribution for sample means will be normal.
Always.
• As the sample size of the samples being taken from the parent population increases, the more normal
the sampling distribution for sample means will become.

Since the population in Figure 1 (Figure 2.26) is not normally distributed, then we would expect the sampling
distribution will not be normal for smaller sample sizes, but will be normal for larger sample size.

Figure 2.27: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 2

aside: For each of these empirical sampling distributions, 100,000 samples were taken of size n.
Therefore, we can be very condent that the empirical sampling distributions are good representa-
tions of the theoretical sampling distributions.

Figure 2.28: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 5

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


179

Figure 2.29: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 10

Figure 2.30: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 16

Figure 2.31: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 20

Figure 2.32: Sampling distribution for Figure 1 (Figure 2.26) for samples of size 25

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


180 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Figure 1 (Figure 2.26) (the parent population) is not even close to being normal, but notice that as the
sample size increases, the sampling distribution for sample means gets closer and closer to being normally
distributed!
In general, the closer the population is to being normally distributed, the faster the sampling distribution
gets closer to normal. Here faster means for a smaller sample size.

important: The central limit theorem states that regardless of the shape of the population, if the
sample size is greater than 30, the sampling distribution will be approximately normal.

Measure Population Sample Sampling distribution for the sample mean

Mean µx x µx = µx
Standard deviation σx sx σx = √σ
n
(standard error)

Table 2.17 : Summary of measures

2.4.4.2 The central limit theorem for the sampling distribution for sample proportions

The sampling distribution for the sample proportions comes from a parent population that satises the
criteria of the binomial distribution. Random samples of size n are taken from the parent population and
the sample proportion is calculated for each sample. What will the distribution of the sample means look
like? That is, what is the shape of the distribution of sample proportions, where are the sample proportions
centred, and what is the sampling variability?
The sampling distribution for sample proportions has similar characteristics as the sampling distribution
for the sample means.

2.4.4.2.1 Where are the sample proportions centred?

They are centred around the population proportion.

2.4.4.2.2 What is the sampling variability?

It decreases as the sample size increases.

2.4.4.2.3 What is the shape?

The shape of sampling distributions of the sample proportions also becomes normal. Unlike for sample
means though, the normality is not based on sample size, but is based on the number of successes (nπ ) and
failures (n (1 − π)).
To illustrate, here are the empirical sampling distributions for proportions for various population pro-
portions. The sample size is 100 in each case and the number of samples taken is 10,000.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


181

Figure 2.33: Empirical sampling distributions for sample proportions

In Figure 8 (Figure 2.33) a, n =100 and π = 0.01. Therefore, the number of successes is 1 and the number
of failures is 99. The sampling distribution is skewed to the right.
In Figure 8 (Figure 2.33) b, n =100 and π = 0.20. Therefore, the number of successes is 20 and the
number of failures is 80. The sampling distribution is approximately normal.
In Figure 8 (Figure 2.33) c, n =100 and π = 0.60. Therefore, the number of successes is 60 and the
number of failures is 40. The sampling distribution is approximately normal.
In Figure 8 (Figure 2.33) d, n =100 and π = 0.96. Therefore, the number of successes is 96 and the
number of failures is 4. The sampling distribution is skewed to the left.

important: In general, the shape of the sampling distribution for sample proportions is approxi-
mately normal if the number of successes and the number failures are both at least 5.

If the sampling distribution for sample proportions is normal, we can nd probabilities for the distribution
using two methods. The rst method is using the binomial distribution. The second method is the normal
distribution. This might seem a bit strange as the binomial distribution is for discrete random variables
and the normal distribution is for continuous random variables. In reality, we use the normal distribu-
tion to approximate probabilities for the sampling distribution for sample proportions. This is called the
normal approximation to the binomial distribution. To get the exact probability, one would need
to use the binomial distribution. But this can be cumbersome when the sample sizes are very large (e.g.
1000). Therefore, using the normal distribution can be benecial, especially because it gives very accurate
approximations. In example 6.4 below we will investigate this further.
Further when we begin to do inferential statistics, we won't know the population proportion (other-
wise inferential statistics wouldn't be necessary). Since we won't know π it will hard to use the binomial
distribution. Therefore, we use the normal approximation to the binomial distribution instead.
If we use a normal approximation to the binomial distribution, we need to know the mean and standard
deviation of the sampling distribution.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


182 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

The mean of the sampling distribution for sample proportions is the population proportion.

µ^ = π (2.33)
p
The standard deviation of the sampling distribution for sample proportions (or the standard error of
sample proportions) is found using the following formula:
r
π (1 − π)
σ^ = (2.33)
p n

2.4.5 Calculating Probabilities for Sampling DistributionsA Series of Examples20


If a sampling distribution is normally distributed, then we can nd probabilities for the sampling distribution
using the normal distribution just like we did in Chapter 5.
The z-score for the sampling distribution for sample means would be:

X − µx X − µx
Z= = σx (2.33)
σx √
n
The z-score for the sampling distribution for sample proportions would be:

^
p −µ
^ ^
p p −π
Z= =q (2.33)
σ^ π(1−π)
p n

Exercise 2.4.5.1 (Solution on p. 202.)


The Old Baldy Tire Company is introducing a new steel belted radial tire. Old Baldy's engineering
department has estimated that the mean life of this tire will be 50,000 km and that the standard
deviation of these tires will be 10,000 km. Suppose a large number of random samples of 100 tires
is taken. The shape of the population distribution is unknown.
1. Can we assume the distribution of the mean life of these tires will be normal? Explain.
2. Regardless of your result in a), assume that we are dealing with a normal distribution. Find
the probability that the mean life of a random sample of 100 tires is less than 49,000km.
3. A competitor of Old Baldy's takes a random sample of 100 tires and nds their mean life
to be 49,000 km. Based o of this data, they claim that the engineering department of Old
Baldy's has exaggerated the mean life of their new tires. Do you support the competitor's
claim? Explain.

Exercise 2.4.5.2 (Solution on p. 202.)


The maintenance manager at a popular mountain resort is trying to determine if the aging gondola
is in need of some renovationor perhaps outright replacement. Right now, the maximum load of
the gondola is 900 kilograms or 12 persons. The manager knows that the average weight of North
Americans has been on the rise for several years and wishes to test what the probabilities might be
of this gondola exceeding the maximum capacity.
Since the operators don't currently look at genderjust numbersthe manager is concerned
about what might happen if the worst-case scenario were to occur: 12 large adult males were allowed
on the gondola at the same time.
To investigate this further the manager did some research into the current average weight of
adult males and discovered that it is about 80 kilograms. He also knows that adult weight tends
to be normally distributed by gender, with a standard deviation for males of about 12 kilograms.
20 This content is available online at <http://cnx.org/content/m64931/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


183

a. Given this information, he rst wants to know what the individual weight allowance is (i.e.
the per person average) that the gondola can withstand.
b. He also wants to know how likely is it that the individual weight of any randomly selected
male will exceed the individual weight allowance calculated above.
c. Finally, he wants to know how likely it would be that the average weight of a sample of 12
adult males would exceed the average individual weight allowance.
d. Based on your answers, do you think the manager should renovate the gondola? Is there any
further information that the manager would need?

Exercise 2.4.5.3 (Solution on p. 202.)


The city of Montreal has an extensive bike lane system. In fact, it is one of the largest in North
America. But many cyclists nd that even with all of the bike lanes, it is still hard to get around the
city on a bike. In particular, there are many lanes that run east/west, but few that run north/south.
Thus, they are encouraging the city council to focus less on adding lots of kilometers to the system,
but instead making sure that the current system properly connects all parts of the city.
The city council will only go forward with this idea if at least 66% of the residents support
focusing on connecting the system rather than expanding the system.
Suppose that 62% of residents do support connecting the system rather than expanding it.
What is the probability that a random sample of 1000 residents will have a sample proportion of
at least 66%?

a. Find the above probability using the binomial distribution.


b. Find the above probability by using the sampling distribution for sample proportions.
c. Compare the two answers. Do they give similar answers?
d. Based on your answers, do you think that it is possible that the city of Montreal will choose
to focus on connecting the bike path system?

Exercise 2.4.5.4 (Solution on p. 203.)


Video games are gaining more and more popularity. Children often try to convince their parents
to buy games even when they are not appropriate. For example, they may want to play a very
violent game that is not appropriate for their age group. To help parents out, video games have
rating categories to suggest age appropriateness. But how aware are parents of these categories?
To investigate this, you conduct a survey of Canadian families that have young children who
play video games. You show parents three video game covers that have the category rating clearly
marked on it. You then ask the parents whether the games would be appropriate for children and
why. If the parent correctly identies which games are appropriate for their children and refers to
the ratings in making their choice, you categorize the parents as well informed.
Suppose that we want to use your results to justify the claim that less than 30 percent of parents
are well informed about video game ratings. In your random sample of 1000 parents, you actually
found that 27 percent of the parents that you polled were well informed about video game ratings.

a. Assuming that the proportion of parents that are well informed about video game ratings is
30%, what is the probability that you would observe a sample proportion of less than 27%.
Use the normal approximation of the binomial distribution to nd your answer.
b. Based on your results, do you believe that this is enough evidence to suggest that less than
30% of parents are well informed about video game ratings? Explain your answer.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


184 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

2.4.5.1 Practice questions

The following practice questions are from Lyryx Learning, Business Statistics I  MGMT 2262  Mt Royal
University  Version 2016 Revision A. OpenStax CNX. Sep 8, 2016 http://cnx.org/contents/f3aefa9e-58d2-
[email protected]
Use the following information to answer the next ten exercises: A manufacturer produces 25-pound lifting
weights. The lowest actual weight is 24 pounds, and the highest is 26 pounds. Each weight is equally likely
so the distribution of weights is uniform. A sample of 100 weights is taken. The standard deviation is 0.58
pounds.
Exercise 2.4.5.5 (Solution on p. 203.)

a. What is the distribution for the weights of one 25-pound lifting weight? What is the mean
and standard deivation?
b. What is the distribution for the mean weight of 100 25-pound lifting weights?
c. Find the probability that the mean actual weight for the 100 weights is less than 24.9.

Exercise 2.4.5.6 (Solution on p. 203.)


Find the probability that the mean actual weight for the 100 weights is greater than 25.2.
Exercise 2.4.5.7 (Solution on p. 203.)
Find the 90th percentile for the mean weight for the 100 weights.
Exercise 2.4.5.8 (Solution on p. 203.)
Suppose that the distance of y balls hit to the outeld (in baseball) is normally distributed with
a mean of 250 feet and a standard deviation of 50 feet. We randomly sample 49 y balls.

a. What is the probability that the 49 balls traveled an average of less than 240 feet?
b. Find the 80th percentile of the distribution of the average of 49 y balls.

Exercise 2.4.5.9 (Solution on p. 203.)


According to the Internal Revenue Service, the average length of time for an individual to complete
(keep records for, learn, prepare, copy, assemble, and send) IRS Form 1040 is 10.53 hours (without
any attached schedules). The distribution is unknown. Let us assume that the standard deviation
is two hours. Suppose we randomly sample 36 taxpayers.

a. Would you be surprised if the 36 taxpayers nished their Form 1040s in an average of more
than 12 hours? Explain why or why not in complete sentences.
b. Would you be surprised if one taxpayer nished his or her Form 1040 in more than 12 hours?
In a complete sentence, explain why.

Exercise 2.4.5.10 (Solution on p. 203.)


Suppose that a category of world-class runners are known to run a marathon (26 miles) in an
−−
average of 145 minutes with a standard deviation of 14 minutes. Consider 49 of the races. Let X
the average of the 49 races.

a. Find the probability that the runner will average between 142 and 146 minutes in these 49
marathons.
b. Find the 80th percentile for the average of these 49 marathons.
c. Find the median of the average running times.

Exercise 2.4.5.11 (Solution on p. 203.)


Determine which of the following are true and which are false. Then, in complete sentences, justify
your answers.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


185

−−
a. When the sample size is large, the mean of X is approximately equal to the mean of X.
−−
b. When the sample size is large, X is approximately normally distributed.
−−
c. When the sample size is large, the standard deviation of X is approximately the same as the
standard deviation of X.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


186 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solutions to Exercises in Chapter 2


to Exercise 2.1.2.1 (p. 94)

a. S = {(1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4)}
b. A = {(1,1), (1,3), (2,2), (2,4), (3,1), (3,3)}

B = {(2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4)}


c. P(A) = 12 , P(B) = 23
d. A AND B = {(2,2), (2,4), (3,1), (3,3)}

A OR B = {(1,1), (1,3), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4)}
e. P(A AND B) = 13 , P(A OR B) = 56
f. B0 = {(1,1), (1,2), (1,3), (1,4)}, P(B0 ) = 31
g. P(B) + P(B0 ) = 1
h. P(A|B) = P (APAND
(B )
B)
= 12 , P(B|A) = P (APAND
(B )
B)
= 23 , No.

Solution to Exercise 2.1.2.2 (p. 97)

a. P(L0 ) = P(S)
b. P(M OR S)
c. P(F AND L)
d. P(M |L)
e. P(L|M )
f. P(S|F)
g. P(F|L)
h. P(F OR L)
i. P(M AND S)
j. P(F)

Solution to Exercise 2.1.2.4 (p. 97)


P(N ) = 15
42 = 5
14 = 0.36
Solution to Exercise 2.1.2.6 (p. 97)
P(C) = 5
42 = 0.12
Solution to Exercise 2.1.2.8 (p. 98)
P(G) = 20
150 = 2
15 = 0.13
Solution to Exercise 2.1.2.10 (p. 98)
P(R) = 22
150 = 11
75 = 0.15
Solution to Exercise 2.1.2.12 (p. 98)
P(O) = 150−22−38−20−28−26
150 = 16
150 = 8
75 = 0.11
Solution to Exercise 2.1.2.14 (p. 98)
P(E) = 47
194 = 0.24
Solution to Exercise 2.1.2.16 (p. 98)
P(N ) = 23
194 = 0.12
Solution to Exercise 2.1.2.18 (p. 98)
P(S) = 12
194 = 6
97 = 0.06
Solution to Exercise 2.1.2.20 (p. 98)
13
52 = 1
4 = 0.25
Solution to Exercise 2.1.2.22 (p. 98)
3
6 = 1
2 = 0.5
Solution to Exercise 2.1.2.24 (p. 99)
4
P (R) = 8 = 0.5

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


187

Solution to Exercise 2.1.2.26 (p. 99)


P(O OR H )
Solution to Exercise 2.1.2.28 (p. 100)
P(H |I)
Solution to Exercise 2.1.2.30 (p. 100)
P(N |O)
Solution to Exercise 2.1.2.32 (p. 100)
P(I OR N )
Solution to Exercise 2.1.2.34 (p. 100)
P(I)
Solution to Exercise 2.1.2.36 (p. 100)
The likelihood that an event will occur given that another event has already occurred.
Solution to Exercise 2.1.2.38 (p. 100)
1
Solution to Exercise 2.1.2.40 (p. 100)
the probability of landing on an even number or a multiple of three
Solution to Exercise 2.1.2.42 (p. 101)

a. You can't calculate the joint probability knowing the probability of both events occurring, which is not
in the information given; the probabilities should be multiplied, not added; and probability is never
greater than 100%
b. A home run by denition is a successful hit, so he has to have at least as many successful hits as home
runs.

to Exercise 2.1.3.1 (p. 102)

a. With replacement
b. No

Solution to Exercise 2.1.3.2 (p. 103)


without replacement: 1. Possible; 2. Impossible, 3. Possible
with replacement: 1. Possible; 2. Possible, 3. Possible
to Exercise 2.1.3.3 (p. 104)

a. P(F) = 1
4
b. P(G) = 1
2
c. P(H ) = 1
2
d. Yes
e. No

to Exercise 2.1.3.4 (p. 105)


P(A|B) = P (APAND
(B)
B) 08 = 0.4 = P (A)
= 0.0.2
The events are independent because P(A|B) = P(A).
to Exercise 2.1.3.5 (p. 106)
Event G and O = {G1, G3}
2 = 0.2
P(G and O) = 10
to Exercise 2.1.3.6 (p. 106)

a. P(B|D) = 0.6667
b. P(D|B) = 0.5
c. No
d. No

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


188 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

to Exercise 2.1.3.7 (p. 107)


P(B|A) = 0.67
P(B) = 0.25
So P(B) does not equal P(B|A) which means that B and A are not independent (wearing blue and
rooting for the away team are not independent). They are also not mutually exclusive, because P(B AND
A) = 0.20, not 0.
to Exercise 2.1.3.8 (p. 108)
Because P(I AND F) = 0,
P(I OR F) = P(I) + P(F) - P(I AND F) = 0.44 + 0.56 - 0 = 1
Solution to Exercise 2.1.3.9 (p. 109)

a. P(T) = 14
b. P(T|F) = 1
2
c. No
d. No
e. Yes
to Exercise 2.1.3.11 (p. 110)
P(J) = 0.3
to Exercise 2.1.3.13 (p. 110)
P(Q AND R) = P(Q)P(R)
0.1 = (0.4)P(R)
P(R) = 0.25
Solution to Exercise 2.1.3.15 (p. 111)
0
Solution to Exercise 2.1.3.17 (p. 111)
0.3571
to Exercise 2.1.3.19 (p. 111)
0.2142
to Exercise 2.1.3.21 (p. 111)
Physician (83.7)
to Exercise 2.1.3.23 (p. 112)
83.7 − 79.6 = 4.1
to Exercise 2.1.3.25 (p. 112)
P(Occupation < 81.3) = 0.5
to Exercise 2.1.3.27 (p. 112)

a. P(C) = 0.4567
b. not enough information
c. not enough information
d. No, because over half (0.51) of men have at least one false positive text
to Exercise 2.1.3.29 (p. 112)

a. P(J OR K) = P(J) + P(K) − P(J AND K); 0.45 = 0.18 + 0.37 - P(J AND K); solve to nd P(J
AND K) = 0.10
b. P(NOT (J AND K)) = 1 - P(J AND K) = 1 - 0.10 = 0.90
c. P(NOT (J OR K)) = 1 - P(J OR K) = 1 - 0.45 = 0.55
to Exercise 2.1.4.1 (p. 114)
P(D |C) = 0.85
P(C ∩ D) = P(D ∩ C)
P(D ∩ C) = P(D|C)P(C) = (0.85)(0.75) = 0.6375
Helen makes the rst and second free throws with probability 0.6375.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


189

to Exercise 2.1.4.2 (p. 116)


200−140−40 20
P = 200 = 200 = 0.1
to Exercise 2.1.4.3 (p. 116)

a. P(B ∩ D) = P(D |B)P(B) = (0.5)(0.4) = 0.20.


b. P(B ∪ D) = P(B) + P(D) − P(B ∩ D) = 0.40 + 0.30 − 0.20 = 0.50
to Exercise 2.1.4.4 (p. 118)
Let A = student is a senior going to college.
Let B = student plays sports.
P(B) = 140
200
P(B |A) = 140
50

P(A ∩ B) = P(B |A)P(A)


P(A ∩ B) = 140 140 = 4
50 1

200
to Exercise 2.1.4.5 (p. 118)

a. P(B0 ) = 0.60
b. P(D ∩ B) = P(D |B)P(B) = 0.20
c. P(B|D) = P P(B∩D ) (0.20)
(D) = (0.30) = 0.66
d. P(D ∩ B0 ) = P(D) - P(D ∩ B) = 0.30 - 0.20 = 0.10
e. P(D |B0 ) = P(D ∩ B0 )P(B0 ) = (P(D) - P(D ∩ B))(0.60) = (0.10)(0.60) = 0.06
Solution to Exercise 2.1.4.7 (p. 120)
0.376
Solution to Exercise 2.1.4.9 (p. 120)
C |L means, given the person chosen is a Latino Californian, the person is a registered voter who prefers life
in prison without parole for a person convicted of rst degree murder.
Solution to Exercise 2.1.4.11 (p. 120)
L ∩ C is the event that the person chosen is a Latino California registered voter who prefers life without
parole over the death penalty for a person convicted of rst degree murder.
Solution to Exercise 2.1.4.13 (p. 120)
0.6492
Solution to Exercise 2.1.4.15 (p. 120)
No, because P(L ∩ C) does not equal 0.
Solution to Exercise 2.1.4.17 (p. 121)

a. The Forum Research surveyed 1,046 Torontonians.


b. 58%
c. 42% of 1,046 = 439 (rounding to the nearest integer)
d. 0.57
e. 0.60.
Solution to Exercise 2.1.4.19 (p. 122)

a. P(Betting on two line that touch each other on the table) = 38


6

b. P(Betting on three numbers in a line) = 38 3

c. P(Bettting on one number) = 38 1

d. P(Betting on four number that touch each other to form a square) = 4


38
e. P(Betting on two number that touch each other on the table ) = 38
2

f. P(Betting on 0-00-1-2-3) = 385

g. P(Betting on 0-1-2; or 0-00-2; or 00-2-3) = 38


3

Solution to Exercise 2.1.4.21 (p. 123)

a. {G1, G2, G3, G4, G5, Y 1, Y 2, Y 3}

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


190 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

b. 85
c. 23
d. 28
e. 68
f. No, because P(G ∩ E) does not equal 0.
Solution to Exercise 2.1.4.23 (p. 123)

: The coin toss is independent of the card picked rst.

a. {(G,H ) (G,T) (B,H ) (B,T) (R,H  )1 (R,T)}


b. P(A) = P(blue)P(head) = 10 3
2 = 20
3

c. Yes, A and B are mutually exclusive because they cannot happen at the same time; you cannot pick
a card that is both blue and also (red or green). P(A ∩ B) = 0
d. No, A and C are not mutually exclusive because they can occur at the same time. In fact, C includes
all of the outcomes of A; if the card chosen is blue it is also (red or blue). P(A ∩ C) = P(A) = 20
3

Solution to Exercise 2.1.4.25 (p. 124)

a. S = {(HHH ), (HHT), (HTH ), (HTT), (THH ), (THT), (TTH ), (TTT)}


b. 84
c. Yes, because if A has occurred, it is impossible to obtain two tails. In other words, P(A ∩ B) = 0.
Solution to Exercise 2.1.4.27 (p. 124)

a. If Y and Z are independent, then P(Y ∩ Z) = P(Y )P(Z), so P(Y ∪ Z) = P(Y ) + P(Z) - P(Y )P(Z).
b. 0.5
Solution to Exercise 2.1.4.29 (p. 124)
a. iii; a. i; a. iv; a. ii
Solution to Exercise 2.1.4.31 (p. 125)

a. P(R) = 0.44
b. P(R|E) = 0.56
c. P(R|O) = 0.31
d. No, whether the money is returned is not independent of which class the money was placed in. There
are several ways to justify this mathematically, but one is that the money placed in economics classes
is not returned at the same overall rate; P(R|E) 6= P(R).
e. No, this study denitely does not support that notion; in fact, it suggests the opposite. The money
placed in the economics classrooms was returned at a higher rate than the money place in all classes
collectively; P(R|E) > P(R).
Solution to Exercise 2.1.4.33 (p. 126)

a. P(type O ∪ Rh-) = P(type O) + P(Rh-) - P(type O ∩ Rh-)


0.52 = 0.43 + 0.15 - P(type O ∩ Rh-); solve to nd P(type O ∩ Rh-) = 0.06
6% of people have type O, Rh- blood
b. P(NOT(type O ∩ Rh-)) = 1 - P(type O ∩ Rh-) = 1 - 0.06 = 0.94
94% of people do not have type O, Rh- blood
Solution to Exercise 2.1.4.35 (p. 126)

a. Let C = be the event that the cookie contains chocolate. Let N = the event that the cookie contains
nuts.
b. P(C ∪ N ) = P(C) + P(N ) - P(C ∩ N ) = 0.36 + 0.12 - 0.08 = 0.40

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


191

c. P(NEITHER chocolate NOR nuts) = 1 - P(C ∪ N ) = 1 - 0.40 = 0.60

to Exercise 2.1.5.1 (p. 128)

a. P(athlete stretches before exercising) = 350


800 = 0.4375
b. P(athlete stretches before exercising|no injury in the last year) = 295
514 = 0.5739

Solution to Example 2.20, Problem 2 (p. 129)


b.

 =
1. P(F AND C) = 100
18
0.18
2. P(F)P(C) = 100 100
45 34
= (0.45)(0.34) = 0.153


P(F AND C) 6= P(F)P(C), so the events F and C are not independent.

Solution to Example 2.20, Problem 3 (p. 129)


c.

1. The word 'given' tells you that this is a conditional.


2. P(M |L) = 25
41
3. No, the sample space for this problem is the 41 hikers who prefer lakes and streams.

Solution to Example 2.20, Problem 4 (p. 129)


d.

1. P(F) = 100
45

2. P(P) = 100
25

3. P(F AND P) = 100


11

4. P(F OR P) = 100 +
45 25
100 - 11
100 = 59
100

Solution to Exercise 2.1.5.2 (p. 130)

a. P(H |M ) = 52
90 = 0.5778
b. For M and H to be independent, show P(H |M ) = P(H )

P(H |M ) = 0.5778, P(H ) = 90


200 = 0.45

P(H |M ) does not equal P(H ) so M and H are NOT independent.

Solution to Example 2.21, Problem 1 (p. 130)


a.

Door Choice

Caught or Not Door One Door Two Door Three Total

Caught 1
15
1
12
1
6
19
60
Not Caught 4
15
3
12
1
6
41
60
Total 5
15
4
12
2
6 1

Table 2.18

to Exercise 2.1.5.3 (p. 132)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


192 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Weight/Height Tall Medium Short Totals

Obese 18 28 14 60
Normal 20 51 28 99
Underweight 12 25 9 46
Totals 50 104 51 205

Table 2.19

a. Row Totals: 60, 99, 46. Column totals: 50, 104, 51.
b. P(Tall) = 205
50
= 0.244
c. P(Obese AND Tall) = 205 18
= 0.088
d. P(Tall|Obese) = 60 = 0.3
18

e. P(Obese|Tall) = 18
50 = 0.36
f. P(Tall AND Underweight = 205 12
= 0.0585
g. No. P(Tall) does not equal P(Tall|Obese).

Solution to Example 2.23, Problem 1 (p. 133)


a. B1R1; B1R2; B1R3; B2R1; B2R2; B2R3; B3R1; B3R2; B3R3; B4R1; B4R2; B4R3; B5R1; B5R2;
B5R3; B6R1; B6R2; B6R3; B7R1; B7R2; B7R3; B8R1; B8R2; B8R3

Solution to Example 2.23, Problem 7 (p. 134)


g. P(B on 2nd draw|R on 1st draw) = 118

There are 9 + 24 outcomes that have R on the rst draw (9 RR and 24 RB). The sample space is then
9 + 24 = 33. 24 of the 33 outcomes have B on the second draw. The probability is then 33
24
.
to Exercise 2.1.5.4 (p. 134)
Total number of outcomes is 144 + 480 + 480 + 1600 = 2,704.
P(FF) = 144 + 480 +
144 144 9
480 + 1,600 = 2,704 = 169
Solution to Example 2.24, Problem
 2 (p. 136)
b. P(RB OR BR) = 3 8
+ 8 3
= 48

11 10 11 10 110

Solution to Example 2.24, Problem 3 (p. 137)


c. P(R on 2nd|B on 1st) = 3
10

to Exercise 2.1.5.5 (p. 138)

a. P(FN OR NF) = 2,652480 + 480 = 960 = 80


2,652 2,652 221
b. P(N |F) = 40
51
c. P(at most one face card) = (480 + 480
2,652
+ 1,560)
= 2,520
2,652
(132 + 480 + 480) 1,092
d. P(at least one face card) = 2,652 = 2,652

to Exercise 2.1.5.6 (p. 139)


4 3
+ 3 4
 
7 6 7 6
Solution to Exercise 2.2.2.1 (p. 143)

1. The situation is a binomial distribution because:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


193

• It represents a random variable as the sample is randomly selected.


• It is a discrete variable as we are counting the number of business travellers. X is the number of
business travellers in the sample.
• There is a xed number of trials (n = 20)
• There are only two options: the passenger is a business traveller (success) or they are not a business
traveller (failure).
• In a random sample whether one passenger is a business traveller does not aect the probability of
another passenger being a business traveller. Therefore, the probability of success remains constant:
π = 30% = 0.3

2. Use a computer program to come up with the following output.

P(X)
P (X ≤ x)

0 0.00080 0.00080

1 0.00684 0.00764

2 0.02785 0.03548

3 0.07160 0.10709

4 0.13042 0.23751

5 0.17886 0.41637

6 0.19164 0.60801

7 0.16426 0.77227

8 0.11440 0.88667

9 0.06537 0.95204

10 0.03082 0.98286

11 0.01201 0.99486

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


194 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

12 0.00386 0.99872

13 0.00102 0.99974

14 0.00022 0.99996

15 0.00004 0.99999

16 0.00001 1.00000

17 0.00000 1.00000

18 0.00000 1.00000

19 0.00000 1.00000

20 0.00000 1.00000

P(X)
P (X ≤ x)

0 0.00080 0.00080

1 0.00684 0.00764

2 0.02785 0.03548

3 0.07160 0.10709

4 0.13042 0.23751

5 0.17886 0.41637

6 0.19164 0.60801

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


195

7 0.16426 0.77227

8 0.11440 0.88667

9 0.06537 0.95204

10 0.03082 0.98286

11 0.01201 0.99486

12 0.00386 0.99872

13 0.00102 0.99974

14 0.00022 0.99996

15 0.00004 0.99999

16 0.00001 1.00000

17 0.00000 1.00000

18 0.00000 1.00000

19 0.00000 1.00000

20 0.00000 1.00000

P(X)
P (X ≤ x)

0 0.00080 0.00080

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


196 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

1 0.00684 0.00764

2 0.02785 0.03548

3 0.07160 0.10709

4 0.13042 0.23751

5 0.17886 0.41637

6 0.19164 0.60801

7 0.16426 0.77227

8 0.11440 0.88667

9 0.06537 0.95204

10 0.03082 0.98286

11 0.01201 0.99486

12 0.00386 0.99872

13 0.00102 0.99974

14 0.00022 0.99996

15 0.00004 0.99999

16 0.00001 1.00000

17 0.00000 1.00000

18 0.00000 1.00000

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


197

19 0.00000 1.00000

20 0.00000 1.00000

X P(X) P (X ≤ x)
0 0.00080 0.00080
1 0.00684 0.00764
2 0.02785 0.03548
3 0.07160 0.10709
4 0.13042 0.23751
5 0.17886 0.41637
6 0.19164 0.60801
7 0.16426 0.77227
8 0.11440 0.88667
9 0.06537 0.95204
10 0.03082 0.98286
11 0.01201 0.99486
12 0.00386 0.99872
13 0.00102 0.99974
14 0.00022 0.99996
15 0.00004 0.99999
16 0.00001 1.00000
17 0.00000 1.00000
18 0.00000 1.00000
19 0.00000 1.00000
20 0.00000 1.00000

Table 2.20

a. P(X=7) = 0.16426
b. P(10 ≤ X ≤ 14) = 0.04792 (highlight the values in the column P(X) for X from 10 to 14, then look
at the Sum in the lower right)
c. P(X ≥ 11) = 0.01714 (highlight the values in the column P(X ) for X from 11 and higher, then look
at the Sum in the lower right)
d. This changes π to 0.7, then re-run the computer program. Look at when X is 5. P(X =5) = 0.00004

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


198 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

3. The mean is the same as the expected value, which is 6.0 and the standard deviation is 2.049. This
gives us a typical range of 3.951 and 8.049 for the typical number of business passengers in a random
sample of 20 passengers.

Solution to Exercise 2.2.2.2 (p. 147)

a. We would expect 0.9 give or take 0.939 to be audited. This means a typical range is 0 to 1.8 residents
to be audited out of 45.
b. Since 4 is outside of the range, it would be deemed atypical, but that does not necessarily mean that
it is abnormal.
c. Since the employee at I&S Square wants to show that something strange is happening in Seba Beach,
they would want to assume that nothing strange is happening. That is, the rate of audits is the same in
Seba Beach as anywhere in Canada. This means we want to assume π = 2%, where π is the proportion
of people who are audited.
d. The binomial distribution:

• The variable being studied is random: We are looking at a random sample.


• The outcomes of the variable are being counted: We are counting the number of audits.
• There are a xed number of trials: We are looking at 45 residents.
• There are only two possible outcomes: Either the resident is audited or they are not.
• The n trials are independent and the probability of success and probability of failure remain constant:
This is true because we are assuming that the probability of being audited remains constant at 2%.

e. We need to nd the probability that at least 4 out of 45 residents are audited, assuming an audit rate
of 2%.
f. P (X ≥ 4 given π = 2%) = 0.01242 = 1.24% (from computer program with n = 45, π (or probability
of occurrence) = 2%).
g. The probability that at least 4 out of 45 Seba Beach residents are audited, under the assumption that
the audit rate is 2%, is 1.24%.
h. Since the probability that we observed our sample data is between 1% and 10%, then we have to
determine if the probability is unlikely or not unlikely. Since it is closer to 1% than 10%, we can
say that the sample data is unlikely to have occurred under the assumption. Therefore, the evidence
suggests that there is something wrong with the assumption. That is, there is evidence that the
residents of Seba Beach are being audited at a higher rate than the rest of Canada.

Solution to Exercise 2.2.2.3 (p. 148)


To be skeptical, we want to assume that the program has not worked (i.e. π stayed at 60%). The evidence
is 41 out of 60 customers gave a rating of 4 or 5. Perfect evidence that the program worked would be 60
out of 60 happy customers in every single sample. The probability that we would observe at least 41 out
of 60 customers who gave a score of four or ve regarding their overall satisfaction, assuming that the new
program has not worked, is 11.70% (from a computer program with n=60, π =0.60). Therefore, it is not
unlikely that we observed the evidence that we did, under the assumption the program did not work. This
means that we cannot conclude that the program worked.
Solution to Exercise 2.2.2.4 (p. 149)
X = the number that reply yes
Solution to Exercise 2.2.2.5 (p. 149)
0, 1, 2, 3, 4, 5, 6, 7, 8
Solution to Exercise 2.2.2.6 (p. 149)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


199

x P (X = x)
0 0.00005
1 0.0009
2 0.0080
3 0.0395
4 0.1227
5 0.2439
6 0.3030
7 0.2151
8 0.0668

Table 2.21

Solution to Exercise 2.2.2.7 (p. 150)


5.7
Solution to Exercise 2.2.2.8 (p. 150)
1.2795
Solution to Exercise 2.2.2.9 (p. 150)
0.4151
Solution to Exercise 2.2.2.10 (p. 150)
0.9990
Solution to Exercise 2.2.2.11 (p. 150)

a. X = the number of students who will attend Tet.


b. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
c. 2.16
d. 0.9511
e. 0.3702

Solution to Exercise 2.2.2.12 (p. 150)


d. 5.54
Solution to Exercise 2.2.2.13 (p. 150)
a
Solution to Exercise 2.2.2.14 (p. 151)
c
Solution to Exercise 2.2.2.15 (p. 151)

a. X = the number of audits in a 20-year period


b. 0, 1, 2, . . ., 20
c. 0.4
d. 0.6676
e. 0.0071

Solution to Exercise 2.2.2.16 (p. 151)



a. Mean = np = 150(0.09) = 13.5; Standard Deviation = npq = 150 (0.09) (0.91) ≈ 3.5050
p

b. P(x = 15) = 0.0988


c. P(x ≤ 10) = 0.1987
d. P(x > 25) = 0.0009

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


200 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

Solution to Exercise 2.2.2.17 (p. 151)

1. Observational unit: Five-minute make-up kit; Variable: Did it sell or not; Categorize: Categorical
2. They want to show that the new packaging will decrease sales.
3. They need to assume the opposite of what they want to show. Therefore, they need to assume that the
new packaging does not decrease sales. Therefore, the proportion of kits sold stays the same at 68%.
4. They have found that out of 500 kits supplied, 306 of them have been sold.
5. They rst need to start with an assumption ( = 68%). Then they need to come up with a model based
on this assumption. Once they have the model, they will use it to nd the probability that the stores
sold at most 306 out of 500 kits, assuming that the new packaging has not decreased sales (i.e. stayed
at 68%). Once they have the probability, they need to determine whether the event is likely or unlikely.
An event is unlikely if the probability is less than 1%. An event is likely if the probability is more
than 10%. If the event is unlikely, then it means that it is unlikely we observed the evidence under the
assumption. Since we know the evidence actually happened, that makes us question the assumption.
Thus, it is unlikely the assumption is true based o of the evidence. If the event is likely to happen,
then the assumption is likely to be true based o of the evidence.
6. • Is the data randomly collected? Most likely not. The 500 kits that we are looking at were not
randomly selected.
• Is the data discrete (countable)? As we are counting the number of kits that are sold the data is
discrete.
• Are the events independent? This may be a fair assumption for this study. Most likely the sale
of one kit is not dependent on whether another kit is sold. Though if two friends buy the kits
together or someone buys a bunch as presents, this is not the case, but in general it is more likely
independent than dependent.
• Are there a xed number of trials? In this case, the number of trials would be the 500 kits with
the new packaging.
• Are there two possible outcomes? Either a kit is sold or it is not.
7. P (X ≤ 36) = 0.00077 = 0.077%
8. The probability that we observed at most 306 out of 500 kits sold (1), assuming the rate of sales is
68%, is 0.077%
9. Since the probability is less than 1%, then it is very unlikely that we would have observed this evidence
under the assumption. Since we actually observed the evidence but assumed that the rate was 68%,
what we have assumed is called into question. Therefore, it is unlikely that the assumption is true .
Therefore, it is likely that the new packaging has resulted in a decrease in sales .
Solution to Exercise 2.2.2.18 (p. 152)
Observational unit: People who are familiar with Striking Donkey Coee; Variable: Whether they prefer
Logo 2; Type of variable: Categorical.

1. They need to assume the opposite of what they want to show. This means they need to assume that
Logo 2 is NOT preferred signicantly more than Logo 1. This would mean they are preferred equally.
Therefore, there is a 50% chance that someone will choose Logo 2.
2. • The data is collected randomly: Yes. It is a random sample of participants.
• The outcomes are counted: Yes. They count how many people like Logo 2.
• There are two possible outcomes: Yes. Either they prefer Logo 2 or they did not.
• There are a xed number of trials: Yes. They asked 40 people.
• The trials are independent of each other. Yes. It is fair to assume that no participant's preference
is based on another participants preference.
3. The more unlikely it is that we observed our evidence, the smaller the probability will be. This means,
the smaller the probability, the more unlikely it is that the assumption (i.e. that there is no preference
between the logos) is true. Since the marketers want clear evidence that there is a preference, they
want a smaller probability, which would show it is unlikely that there is a preference between the logos.
The level of signicance is the threshold between likely and unlikely. Thus, if they want clear evidence,

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


201

they want to set their threshold high, meaning they want to make it a small number. Since the level
of signicance is between 1% and 10%, the lowest level of signicance (meaning the highest threshold
of evidence) is at 1%.
4. P (X ≥ 26) = 0.04035 = 4.04%
5. The probability that we observed at least 26 out of 40 people who preferred Logo 2, assuming that
there was no preference between the logos, is 4.04%.
6. Since the probability is greater than 1% (it is 4.04%), it is not unlikely that we observed at least
26 out of 40 people who preferred Logo 2, assuming that there was no preference between the logos.
Therefore, we do not reject that there was no preference between the logos. This suggests that Logo 2
is NOT preferred signicantly more than Logo 1.

to Exercise 2.3.2.1 (p. 157)


1.5, left, 16
to Exercise 2.3.2.2 (p. 159)
The z-score for x 1 = 325 is z 1 = 1.14.
The z-score for x 2 = 366.21 is z 2 = 1.14.
Student 2 scored closer to the mean than Student 1 and, since they both had negative z-scores, Student
2 had the better score.
to Exercise 2.3.2.3 (p. 159)
between 20 and 30.
to Exercise 2.3.2.4 (p. 159)

a. About 68% of the values lie between the values 41 and 63. The z-scores are 1 and 1, respectively.
b. About 95% of the values lie between the values 30 and 74. The z-scores are 2 and 2, respectively.
c. About 99.7% of the values lie between the values 19 and 85. The z-scores are 3 and 3, respectively.

Solution to Exercise 2.3.2.5 (p. 160)


0.67, right
Solution to Exercise 2.3.2.6 (p. 160)
3.14, left
Solution to Exercise 2.3.2.7 (p. 160)
about 68%
Solution to Exercise 2.3.2.8 (p. 160)
about 95.44%
Solution to Exercise 2.3.2.9 (p. 160)
about 4%
Solution to Exercise 2.3.2.10 (p. 160)
b
Solution to Exercise 2.3.2.11 (p. 161)
c
Solution to Exercise 2.3.2.12 (p. 161)
b
to Exercise 2.3.3.1 (p. 165)
normalcdf(1099 ,65,68,3) = 0.1587
to Exercise 2.3.3.2 (p. 167)
normalcdf(66,70,68,3) = 0.4950
Solution to Exercise 2.3.3.3 (p. 170)

a. 0.6915
b. 0.3345
c. 0.8413
d. $300.70
e. $260 to $300

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


202 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

f. IQR = 293.5-266.5 = $27


Solution to Exercise 2.3.3.4 (p. 170)

a. i. 0.0228
ii. 0.0082
iii. 0.7865
iv. 0.0669
b. 56407.8 km
c. 48073.4 km
Solution to Exercise 2.4.5.1 (p. 182)

a. Yes. As the sample size is greater than 30 (it is 100), we can assume that the sampling distribution of
the sample mean lifespan of the tires is normally distributed regardless of the shape of the sampling
distribution due to the central limit theorem.
b. x = mean lifespan of 100 Old Baldy tires, µx = µx = 50,000, σx = √σn = 10,000/10 = 1000. Since we
know that the data
 is normally distributed, we can use a computer program to calculate the probability
P X < 49, 000 . From the computer program, we get P X < 49, 000 = 15.87%


c. No. The probability that a random sample of 100 Old Baldy tires has a mean lifespan of 49,000km is
15.87% (assuming Old Baldy's claim). This means that this event is likely to occur (as it is greater than
10%), under the assumption that the tires last on average 50,000 km, and does not provide sucient
evidence against Old Baldy's claim.
Solution to Exercise 2.4.5.2 (p. 182)

a. 900kg/12 people = 75kg/person


b. Since this is about an individual, we will use µx and σx . As stated in the question, we know that
the population is normally distributed. From this, a computer program calculated that P (X ≥ 75) =
66.15%, where X is the weight of an individual person on the gondola.
c. Now, we are being asked about the mean for 12 people. Therefore, this question is about nding a
probability for the sampling distribution for sample means. Therefore, we will use x = mean weight
of 12 people, µx = µx = 80, σx = √σn = 12/3.46 = 3.46. Since the population distribution is normal,
we know that the sampling distribution will also be normal (regardless of the sample size). Therefore,
we can use a computer program to calculate the probability P X > 75 . We get 92.55%.
d. The probability found in c) is the probability that the average mass of 12 adult males will exceed the
maximum individual weight for the gondola. The next question is how likely is it that there will be
12 adult males on the gondola? The manager should do further research to determine this before
making a decision. While waiting for the results, the manager should implement a policy where any
large groups of males are broken up and are required to take the lift in separate groups. I.e. break up
a group of 12 males into two groups of 6 males.
Solution to Exercise 2.4.5.3 (p. 183)

a. Since we are using the binomial distribution, we are being asked to nd the probability that at least
660 of the 1000 people in the poll will want to focus on connecting the system. The 660 comes from
66% of 1000. In other words, we are asked to nd the P (X ≥ 660) , with n = 1000 and π = 60%. Using
a computer program, this yields a probability of 0.48%. This is found, by highlighting all of the values
above 660 and including 660.
b. Since we are using the sampling distribution for sample proportions, we are asked to nd the probability
^
that the sample proportion will be at least 66%. In other words, we are asked to nd the P( p ≥ 0.66).
We can assume the sampling distribution for sample proportions is normal as the number of successes
(nπ =1000 × 0.62=620) and the number of failures (n (1 − π) = 1000 × 0.38=380) are both at least
ve. Therefore, we will use the normal distribution to nd the probability with µ ^ = π = 0.62 and
p

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


203

q
π(1−π)
q
0.62(1−0.62) ^
σ^ = n = 1000 = 0.01475. Therefore, using a computer program we nd P( p ≥ 0.66)
p
= 0.33%
c. The two probabilities are quite close. They are only 0.15% apart. Therefore, the two methods give us
similar results.
d. It is unlikely that if the proportion of residents that want to focus on connecting the bike system is
62% that a poll of 1000 people would result in a sample proportion of 66%. Therefore, it is unlikely
that the city of Montreal will chose to focus on connecting the system.
Solution to Exercise 2.4.5.4 (p. 183)

a. Since we are using the sampling distribution for sample proportions, we are asked to nd the probability
^
that the sample proportion will be at most 27%. In other words, we are asked to nd the P( p ≤
0.27). We can assume the sampling distribution for sample proportions is normal as the number of
successes ( and the number of failures ((nπ =1000 × 0.30=300) and the number of failures (n (1 − π)
= 1000 × 0.7=700) are both at least ve. Therefore,
q we will
q use the normal distribution to nd the
π(1−π) 0.3(1−0.3)
probability with µ ^ = π = 0.30 and σ ^ = n = 1000 = 0.01145. Therefore, using a
p p
^
computer program we nd P( p ≤ 0.27) = 0.44%.
b. Since the probability that we would observe a sample proportion of 27% (assuming a population
proportion of 30%) is 0.44%, it is very unlikely we would have observed this evidence if the assumption
is true. Therefore, it is more likely that the population proportion is less than 30%. Thus there is
enough evidence to suggest that less than 30% of parents are well informed about video game ratings.
Solution to Exercise 2.4.5.5 (p. 184)

a. Uniform with a mean of 25 and a standard deviation of 0.58 pounds. Remember when a distribution
is uniform all of the values are equally likely. Therefore the mean will be halfway between the lowest
value (24) and the highest value (26).
b. Normal with a mean of 25 and a standard deviation of 0.0577
c. 0.0416
Solution to Exercise 2.4.5.6 (p. 184)
0.0003
Solution to Exercise 2.4.5.7 (p. 184)
25.07
Solution to Exercise 2.4.5.8 (p. 184)

a. 0.0808
b. 256.01 feet
Solution to Exercise 2.4.5.9 (p. 184)

a. Yes. I would be surprised, because the probability is almost 0.


b. No. I would not be totally surprised because the probability is 0.2312
Solution to Exercise 2.4.5.10 (p. 184)

a. 0.6247
b. 146.68
c. 145 minutes
Solution to Exercise 2.4.5.11 (p. 184)

a. True. The mean of a sampling distribution of the means is approximately the mean of the data
distribution.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


204 CHAPTER 2. BUSINESS STATISTICS - MODULE 2 - PROBABILITY

b. True. According to the Central Limit Theorem, the larger the sample, the closer the sampling distri-
bution of the means becomes normal.
c. The standard deviation of the sampling distribution of the means will decrease making it approximately
the same as the standard deviation of X as the sample size increases.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


Chapter 3

Business Statistics - Module 3 -


Condence Intervals and Hypothesis
Tests

3.1 Chapter 7: Condence Intervals


3.1.1 Introduction to Condence Intervals1
From Chapter 6, we know that if we take many samples of the same size from a population and calculate the
sample means, the sample means will be clustered around the population mean, but many of them won't be
exactly the same as the population mean. Therefore, we can estimate the population mean using a sample
mean, but we expect there to be a certain amount of error in that estimate. To determine that error, we
can look at the standard error. That is, we can look at the amount of variation between the sample means.
In the chapter, we will use this information about how sample means behave to help us make estimates
about the population mean of unknown populations. We will also do this with sample proportions and
population proportions. That is, the goal of this chapter is to make inferences about the population from
sample data. This is our rst foray into inferential statistics.
By the end of this section, the student should be able to

• Find and interpret condence intervals that estimate the population mean and the population propor-
tion.
• Understand the properties of the Student-t distribution.
• For condence intervals for the population mean, can determine whether to use the Student-t distri-
bution or the standard normal distribution as a model.
• Find the minimum sample size needed to estimate a parameter given a margin of error.

3.1.2 What are Condence Intervals?2


Suppose you are trying to determine the mean rent of a two-bedroom apartment in your town. You might
look in the classied section of the newspaper, write down several rents listed, and average them together.
This provides a point estimate of the true mean. If you are trying to determine the percentage of times you
make a basket when shooting a basketball, you might count the number of shots you make and divide that
1 This content is available online at <http://cnx.org/content/m65024/1.1/>.
2 This content is available online at <http://cnx.org/content/m65023/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>

205
CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
206
INTERVALS AND HYPOTHESIS TESTS

by the number of shots you attempted. In this case, you would have obtained a point estimate for the true
proportion.
A point estimate is a single value used to estimate a population parameter. For example, the sample
mean is a point estimate of the population mean. But point estimates do not give a sense of how much
error there is in an estimate. Thus, we instead want to provide an interval estimate for the population
parameter takes into account error. The type of interval estimate we will learn about in this chapter is called
a condence interval.
From our work on sampling distributions, we know that the sample mean probably won't be exactly the
population mean. Instead we expect it to be slightly larger or smaller than the population mean. But by
how much? The margin of error, denoted E , measures how much we expect the statistic to vary from the
parameter. The margin of error is computed by looking at how much variation is in the sampling distribution
and the level of condence (discussed below).
To calculate a condence interval, you take the statistic and you add and subtract the margin of error
from it. For example, if you are trying to estimate the population mean, you would take the sample mean
and add and subtract the margin of error from it: x − E, x + E . This gives an interval of values that you
expect the population mean to fall between.
Example 3.1
A recent opinion poll asked Canadians their opinion of the work of the current Prime Minister of
Canada. 53% of Canadians approved of his work with a margin of error of 2.6%. The statistic is
a sample proportion of 53% and we are trying to estimate the true proportion of Canadians who
approved of the Prime Minister's work. We know that there will be error in that estimate and it has
been measured to be 2.6%. Therefore, we are estimating that the true proportion of all Canadians
who approve of the Prime Minister's work is between 53% ± 2.6% or between 50.4% and 55.6%.

note: Though condence intervals change depending on the sample, but the parameter being
estimated is xed. For example, on a specic day, the population mean rent of a two-bedroom
apartment in your town is a specic value. You are trying to estimate it, but it is xed. The
condence interval, on the other hand, changes depending on the sample you take. Suppose instead
of looking at the classied section of a newspaper, you looked at a rental website. Then the
sample might be dierent, which will result in a dierent condence interval. Or suppose you stood
outside a mall entrance and asked every fth person what they paid in rent for their two-bedroom
apartment, then your sample would be dierent, which will result in a dierent condence interval.
These three dierent condence intervals are all estimating the same thing, the population mean
rent of a two-bedroom apartment in your town, but since each of the samples are dierent, the
sample means will be dierent which will result in dierent estimates. In short, the parameter
being estimated is not a random variable. But the condence interval being used to estimate the
parameter varies depending on the random sample taken.

In the following sections, we will learn how to calculate the margin of error for the mean and proportion.
For each situation, we will use a dierent model to nd the margin of error. It should be noted that all of
the models are based on the assumption that a random sample has been calculated. Therefore, nding a
condence interval based on the convenience sample of the rent in today's classied ads is not appropriate.
This is important to remember when you are critically assessing a condence interval provided to you. No
matter how prettily the condence interval is presented, if it was constructed from a non-random sample, it
is useless. It is like baking an apple pie from rotten apples. It might look good, but it is still rotten.

3.1.2.1 Why is it called a condence interval?

If you are trying to estimate how much it will cost to go on a trip to Montreal for ve days, you can work
out with strong condence the cost of the ight and hotels, but then you have to start making estimates
about how much food and entertainment will cost while you're there. You can get a pretty good estimate of

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


207

what it will cost, but your friend who you are trying to convince to come with you might want to know how
condent you are in that estimate. Are you the kind of person who just guesses at the cost of meals or did
you look at restaurantsÕ menus to come up with a sense of what meals cost in Montreal? Did you take into
account snacks? The cost of renting a car or taking the bus? Did you assume you were going to do an equal
number of free and paid admission activities? All of this aects the condence you have in your estimate.
For a condence interval, it is much easier to determine how much condence we have in our estimate
because condence intervals come with a level of condence (or condence level).
To understand the condence level, let's go back to the two-bedroom apartment situation. Let's now
suppose that 100 people on the same day were very curious about determining the mean rent for two-bedroom
apartments in your town. Each of these 100 people went out and found their own random sample of fty
people who rent two-bedroom apartments in your town. From these 100 samples, 100 condence intervals
were calculated. Based o of our work on sampling distributions, we know that the 100 sample means will
be close to the population mean (some might even be the same as the population mean), but some will
be closer and some will be farther. Thus some of the condence intervals will be 'good' estimates of the
population mean rent for two-bedroom apartments (that is, the population mean will actually be included
in the condence interval) and some will be 'bad' estimates (that is, the population mean won't actually
be included in the condence interval). Since the population mean is unknown none of the 100 people who
made these condence intervals knows if their estimate is good or bad. Instead, they can only state how
condent they are in their estimate. That is, they can only state their level of condence.
Suppose that all 100 people made 95% condence intervals. What does that mean? Well suppose a local
real estate company has actually worked out the population mean rent for two-bedroom apartments in your
town by nding out the rent for all two-bedroom apartments. Since they know the population mean, they
don't have to estimate it. They have found it to be $1200.
Figure 3.1 shows the 100 condence intervals created by the 100 random samples and compares them to
the population mean. If the interval is yellow then that means it is a good estimate. If it is red, then that
means it is a bad estimate. The yellow part in the middle represent the 95% condence interval. The yellow
and the blue combined represent the 99% condence interval.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
208
INTERVALS AND HYPOTHESIS TESTS

Figure 3.1: 100 condence intervals generated from 100 random sample of the rent of two-bedroom
apartments in your town

The above image was created using an applet from David Lane's onlinestatbook.com3
Notice that out of the 100 condence intervals calculated, 93 of them are good estimates (contain $1200)
and seven of them are bad estimates (do not contain $1200). This is what the condence level refers to.
That is, if you take many, many random samples of the same size and construct a condence interval for
each of the samples, then the percentage of condence intervals that contain the population mean is 95%
and the percentage that do not contain the population mean is 5%. Thus, the condence level refers to the
probability that the process of creating a condence interval results in the population parameter being in
the condence interval. It is NOT the probability that the population mean falls in a specic condence
interval. Remember that the population mean is xed. Therefore, either the population mean does fall in
the condence interval or it doesn't. Since there is no randomness to whether it does fall or not, there is
no probability associated with that event. Instead the level of condence refers to the percent of condence
intervals that contain the parameter being estimated if the study/experiment is repeated many, many times.
What has been described above is not an easy idea. Many people who have studied statistics are under
3 Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane,
Rice University.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


209

the false impression that the condence level refers to the probability that the parameter is in the condence
interval. Don't fret if this doesn't make entire sense to you right away. Give yourself some time to think
about it and process it.
As a note, the example provided in Figure 3.1 is a bit surprising. If you ip a fair coin 100 times, you
would expect that around 50 heads and 50 tails, but due to sampling variability it would also be fair to
get 49 heads and 51 tails. It is the same thing with condence intervals, we expect that for 100 condence
intervals that around 95 of them contain the population mean and 5 of them don't, but it would be fair to
get 94 good estimates and 6 bad ones. Once again, the law of large numbers tells us that as the sample size
increases the closer we will get to the 95%. That is, if we take 1000 random samples instead of 100, the more
likely it is that 95% will be good estimates and 5% will be bad.

3.1.2.2 Common choices for condence levels

The most common choices for condence levels are 90%, 95%, and 99%, but you can choose the level of
condence to be any percentage between 0.00001% and 99.99999%. The can't choose 100%, because that
would mean you for sure know that the population parameter falls within the condence interval. You also
can't choose 0%, because that would mean you for sure know that the population parameter does not fall
within the condence interval. If you knew for sure the parameter falls (or does not fall) in the condence
interval, you wouldn't be bothering to do a condence interval, because you already know that parameter.
90%, 95%, and 99% are common levels of condence because they oer a high degree of condence.
How does the condence level change the condence interval? Think about the following two condence
intervals for the mean age of students at your university:

4 years old to 85 years old

20 years old to 21 years old

Which condence interval are you more condent actually contains the population mean? Well it is
pretty likely that the population mean age of students at your university is somewhere between 4 years old
and 85 years old, because the range is so wide that it most likely `catches' the population mean.
In general, the larger the condence level, the wider the condence interval. That is, to increase the
condence in the estimate, we make the condence interval wider so that it is more likely to catch what we
are estimating. Think about the condence interval like a net. The smaller the net, the less likely it is you'll
catch the sh. But the wider the net, the more likely it is that you will. Thus for the same sample, the 90%
condence interval is narrower than the 99% condence interval.
Thus, a 99% condence interval is very reliable, but it gains reliability at the price of precision. That is,
its wideness might come at the sake of usefulness. Going back to the condence interval for the mean age
of students at your university, we can be very condent that the population mean age is between 4 and 85
years old, but that doesn't actually help understand what the population mean age is. We are less condent
in the estimate of 20 to 21 years old, but it is providing us more useful information.
To summarize, higher degrees of condence mean that we are more sure that the parameter fall in the
interval (i.e. more reliable). Lower degrees of condence mean that the interval is smaller and thus gives us
a better idea of where the parameter in question is (i.e. more precise). See Figure 3.2

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
210
INTERVALS AND HYPOTHESIS TESTS

Figure 3.2: Comparing dierent levels of condence for the same random sample

The choice of a 95% level of condence is most common because it provides a good balance between
precision and reliability.

3.1.2.3 What else eects the width of a condence interval?

The width of the condence interval is determined by the margin of error, E . In general, the condence
interval is calculated as follows:

point estimate +E , point estimate −E

The size of the margin of error determines the width of the condence interval. That is, the bigger the
margin of error is, the wider the condence interval.
Factors that eect the size of the condence interval include the size of the sample, the amount of
variability in the data, and the condence level.
As per the law of large numbers, the larger the sample size, the closer the statistic (or point estimate) is
to the parameter. Therefore, the larger the sample size, the less error there is between the statistic and the
parameter. This means that the margin of error is smaller for larger sample sizes taken from the
same population.
The greater the variability in the population, the greater the variability in the statistics. We saw this in
Chapter 6 when we determined that the standard deviation of the sampling distribution was related both to
the standard deviation of the population and the sample size. That is, the variation between the statistics
relied both on the variation in the population and the sample size. Thus, the margin of error is larger
in situations where there is more variability in the population.
As stated above, the larger the condence level, the wider the condence interval. Therefore, the margin
of error is larger for larger levels of condence.

3.1.2.4 Common misconceptions about condence intervals

1. The condence interval contains 95% of the data values. A condence interval is an estimate
for a parameter (like the population mean or population proportion). Though the data values are used

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


211

to construct the condence interval, the condence interval does not tell us anything about the range
of the data values.
2. We are 95% condent that the sample mean is contained in the condence interval. If
the condence interval is for the population mean, then the sample mean has to be in the condence
interval. In fact, it is right in the middle. Remember that the condence interval for the population
mean is calculated as follows: x − E, x + E . All condence intervals contain the point estimate being
used to construct the condence interval.
3. Increasing the sample size increases the width of the condence interval. In fact, the
opposite happens. From the law of large numbers, we know that a larger sample size means that the
point estimate will likely be closer to the parameter being estimated. Therefore, as the sample size
increases, the margin of error decreases and the width of the condence interval decreases.
4. A 90% condence interval is wider than a 95% for the same data. Again, it is the opposite
that happens. To become more condent in our estimate (i.e. increasing the level of condence), we
widen the condence interval. A wider condence interval is a larger net which makes it more likely
that we catch the parameter we are estimating.

3.1.3 The Basic Premise of Constructing a Condence Interval4


In the above section, we discussed at length what a condence interval is. Now we are going to discuss how
to construct and interpret one.
A condence interval is constructed by taking the point estimate and adding and subtracting the margin
of error. The margin of error is constructed by looking at the level of condence and the amount of variation
between the point estimates. For example, the margin of error for a condence interval for a population
mean is found by looking at the level of condence (which the researcher determines) and the amount of
variation between the sample means. The amount of variation between the samples means is the amount
of variation in the sampling distribution for sample means, i.e. the standard error. Thus a condence
interval is always constructed from the appropriate sampling distribution.
This is helpful in two ways:
 
• From our work in Chapter 6, we know what the standard error is for both the sample mean σx = √σn
 q 
π(1−π)
and sample proportion σ^ = n .
p
• From our work in Chapter 6, we know what the shape of the sampling distribution will be from the
Central Limit Theorem.

The margin of error is found by taking into account the condence level and the standard
error.
The next section examines how the margin of error is constructed for condence intervals for the mean.

3.1.4 The Condence Interval Estimate of a Population Mean5


There are multiple models for nding the condence interval for the mean. The models we will be looking
at rely on the sampling distribution being approximately normal. If that is not the case, then we cannot use
these models.
Therefore, the following section relies on the following assumptions:

• The sampling distribution for sample means of the population we are investigating is approximately
normally distributed.
4 This content is available online at <http://cnx.org/content/m65026/1.1/>.
5 This content is available online at <http://cnx.org/content/m65027/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
212
INTERVALS AND HYPOTHESIS TESTS

· If the sample size is greater than 30, then the central limit theorem tells us that we can assume
that the sampling distribution is approximately normal regardless of the population distribution.
Thus, if the sample size is greater than 30, we can use this model.
· If the sample size is less than 30, the central limit theorem does not guarantee that the sampling
distribution of the means will be normal. Therefore, to use this model the population distri-
bution needs to be approximately normal so that we know that the sampling distribution
for sample means is normal.
• The sample we are using to construct the condence interval is a random sample.

To construct a condence interval for the mean, collect a random sample from the population whose mean
is being estimated. Then calculate the sample mean.
The next step is to calculate the margin of error. To do this, we begin by nding out how much sampling
variability there is in the sampling distribution. That is, we determine how much variation we expect between
the sample means. This is found by calculating the standard error of the sampling distribution for sample
means:
σ
σX = √X (3.2)
n
Now we want to take into account the level of condence. To do this, we construct a normal distribution
σ
that is centred at the sample mean, x, whose standard deviation is the standard error of the mean, √Xn .
The data values for this distribution are sample means. Therefore this is a sampling distribution for sample
means. This sampling distribution is an estimate of what the sampling distribution of the population will
look like:

Figure 3.3: Blue curve: True sampling distribution for sample means centred at µx and with a
σ
standard deviation of √Xn . Red curve: Estimate of the true sampling distribution for sample means
σ
based on the mean of the random sample. It is centred at x and has a standard deviation of √Xn .

In Figure 3.3, the blue sampling distribution is the theoretical sampling distribution of the population,
which is unknown. The red sampling distribution is an estimate of the blue curve based on the sample mean
found from the random sample. We will use the red sampling distribution to estimate the population mean.
Using the red sampling distribution, we want to determine the interval of sample means that fall within
a specic percentage from the mean. The specic percentage is the condence level.
Suppose that the condence level is 95.44%. From the empirical rule, we know that 95.44% of data
values fall within 2 standard deviations of the mean for normally distributed data. Therefore, if we wanted
to construct a 95.44% condence interval, we would take the sample mean and add and subtract two standard
deviations from it. Since we are dealing with a sampling distribution, the standard deviation we are referring
to is the standard error of the mean. Therefore, a 95.44% condence interval is found by calculating
X ± 2 · σX = X ± 2 · √ σX
n
. Thus for a 95.44% condence interval, the margin of error is E = 2 · √
σX
n
.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


213

Figure 3.4: 95.44% condence interval for the mean

If we wanted to nd a 95% condence interval, we would use the same process, but we would want a
slightly narrower interval. Therefore, instead of multiplying the standard error by 2, we would multiply it
by a slightly smaller number. To determine by what number, we would need to nd out how many standard
deviations away from the mean results in an area of 95%. In other words, we would need to nd the z -score
that gives an area of 95%.

Figure 3.5: Standard normal curve with the area of the tails being 5%.

If the area in the middle of the curve is 95%, then the area of one tail is 2.5%. Using a computer program,
we can nd this value to be ±1.96.
To do this, go to your computer program and go to the menu option that lets you nd probabilities for
normal distributions. Then make the mean 0 and the standard deviation 1. Then switch from calculating
probabilities to nding z-values (like you are going to nd a percentile). In the appropriate box, put 0.0025
in for the area in the upper tail. When you hit enter, the program will give you 1.96 as the z-value for this
area.
In general, the value that you multiply the standard error by is called the critical value and is denoted
by zα/2 , where α is the total area of the tails. (1 − α) × 100% is the level of condence.
The margin of error is E = zα/2 × √ σX
n
The condence interval is x ± E . As it is an interval, always write it with the smaller number rst (x − E )
followed by the larger number (x + E ).
Exercise 3.1.4.1 (Solution on p. 250.)
Suppose that a random sample of 175 students from a university is taken and their average age is
21.34 years old and the population standard deviation is known to be 5.12 years.

1. Find the 95% condence interval for the population mean age of all university students.
2. Interpret the condence interval in the context of the question.
3. Explain what the level of condence means in the context of the problem.
4. If we decreased the sample size to 100, what would you expect to happen to the condence
interval? Explain your answer.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
214
INTERVALS AND HYPOTHESIS TESTS

5. Suppose that an administrator at the university claims that this university caters to older
students and that the mean age is 23. Does the condence interval support the claim?

A few notes about the above condence interval:

• All of the means in the interval are equally likely. That is, each of the estimates of the population
mean in the interval have an equal chance of being correct. For example, 20.58 years old and 21.25
years old are both equally likely estimates of the population mean age.
• The sample mean of 21.34 is right in the middle of the interval.
• The margin of error is 0.759 and is found using the formula
s 5.12
zα/2 × √ = 1.96 × √ (3.5)
n 175

• It is possible that the population mean is not captured by this condence interval, but we wouldn't
know whether it does or not without knowing the population mean.

3.1.4.1 Wait a second! If we don't the population mean (µx ), how do we know the population
standard deviation (σx ) in the standard error formula???

That's a really good question.


q PThe actual formula for the population standard deviation involves knowing
(X−µ)2
the population mean: σx = n . Therefore, if we don't know the population mean, how do we know
the population standard deviation?
There are two possible answers to this:

1. In some long running process (e.g. manufacturing), the standard deviation may be very static. There-
fore, the population standard deviation could be known even if the population mean isn't.
2. We don't know the population standard deviation, so instead we estimate it with the sample standard
deviation.

It is fairly unlikely that in most situations, the population standard deviation will be known. Thus, we will
focus on situations where the population standard deviation is unknown. In that case, we will use the sample
standard deviation s to estimate the population standard deviation σx .

3.1.4.2 Student-t distribution

To use this model to construct a condence interval, we need to again assume that the sampling distribution
is normal and that the sample was collected randomly. Just as we saw above, there are two general situations
that need to occur to ensure the sampling distribution is normal:

• If the sample size is greater than 30, then the central limit theorem tells us that we can assume that
the sampling distribution is approximately normal regardless of the population distribution. Thus, if
the sample size is greater than 30, we can use this model.
• If the sample size is less than 30, the central limit theorem does not guarantee that the sampling
distribution of the means will be normal. Therefore, to use this model the population distribution
needs to be approximately normal so that we know that the sampling distribution for sample
means is normal.

Since we don't know the population standard deviation, we will be using the sample standard deviation
to estimate σx . That means we are estimating the population mean using the sample mean and sample
standard deviation. This suggests that there may be more error in our estimate. To account for the greater
error, we want the condence interval to be slightly wider. To do this the margin of error needs to slightly

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


215

bigger. The margin of error is the critical value × the standard error. The standard error is inherent to the
population and can't be changed, but the critical value can be. So instead of using the standard normal
distribution to nd the critical value, we use the Student-t distribution 6
Here is some information about the Student-t distribution.

• The Student-t distribution is a normal distribution with µ = 0 and σ > 1. The standard deviation of
the Student t distribution is dierent for dierent sample size. Remember that the standard normal
distribution is a normal distribution with µ = 0 and σ = 1. Therefore, the Student-t distribution is
centred at the same place as the standard normal distribution, but has greater variation so it is slightly
wider and shorter. See Figure 3.6.
• The smaller the sample size, the greater the variability is in the sampling distribution. When the
sample size is larger, there is less variability in the sampling distribution. These aspects are reected
in shape of the Student-t distribution.
• As the sample size n gets larger, the Student-t distribution gets closer to the standard normal distri-
bution.

Figure 3.6: Comparison of Student-t distribution with standard normal distribution

The standard deviation of the Student-t distribution is based on the degrees of freedom, which in turn
are based on the sample size. The number of degrees of freedom for a sample corresponds to the number of
data values that can vary after certain restrictions have been imposed on all data values. Another way of
saying it, is the degrees of freedom are the number of components that need to be known before a statistic
is entirely determined. Depending on the model used, the degrees of freedom have a dierent formula. For
this model (i.e. condence interval for one population mean), the degrees of freedom are the sample size
minus 1, i.e. n − 1.
As stated above, we want the width of the condence interval to be wider to take into account the larger
variation due to the estimate of the standard deviation. As you can see from the gure above, the Student-t
distribution is wider than the standard normal distribution. Which means that the critical value for a 95%
condence level will be greater than that for the standard normal. See the image below.
6 The Student-t distribution was created by William Gosset, an English statistician who worked for Guinness breweries.
While working for Guinness, Gosset developed the Student-t distribution, but was prohibited from publishing his work by his
employers who worried about trade secrets getting out. Thus he published his work under the pseudonym `Student' in 1907.
The distribution, then, should really be called the Gosset-t distribution.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
216
INTERVALS AND HYPOTHESIS TESTS

Figure 3.7: Critical value for Student-t distribution with n = 5

Notice the critical value is happening about halfway between ±2 and ±3. But the critical value for the
standard normal distribution is ±1.96.
The margin of error for this model is:
s
E = tα/2 × √ (3.7)
n
The condence interval is constructed in the same way: x ± E .
Exercise 3.1.4.2 (Solution on p. 250.)
A manufacturer of AAA batteries wants to estimate the mean life expectancy of the batteries. It
is known that the life expectancy of such batteries is typically normally distributed.
A random sample of 25 batteries has a mean of 44.25 hours and a standard deviation of 2.25
hours. Assume the population is normal.

1. Construct a 95% condence interval for the mean life expectancy of all the AAA batteries
made by this manufacturer.
2. Interpret the 95% condence interval.
3. If the condence level is decreased to 90%, how does the condence interval change?

important: As the degrees of freedom of the Student-t distribution increase (i.e. as the sample
size increases), the standard deviation of the Student-t distribution decreases and gets closer and
closer to 1. That is, as the sample size increases, the Student-t distribution gets closer and closer
to the standard normal distribution. Statisticians and researchers generally agree that for n ≥ 40,
the dierence between the critical values of the Student-t distribution and the standard normal
distribution are negligible7 . Thus when n > 40, the standard normal distribution can be
used to construct the condence interval.

Figure 3.8 is a ow chart that indicates how to make a choice of which model to use to construct a condence
interval (CI) for the mean.
7 But some researchers switch to the Student-t distribution when n > 30. And other researchers only use the Student-t
distribution regardless of the sample size. This is a matter of preference, but as most researchers agree with n ≥ 40 rule, we
will stick with it here.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


217

Figure 3.8: Flow chart for determining which model to use when constructing condence interval for
the mean

3.1.4.3 Sample Size Determination

Determining an appropriate sample size is very important. Too small of a sample may lead to poor results.
Too large of a sample needlessly wastes time and money.
Prior to this section, we would have determined if a sample size was large enough simply by guessing.
Here we will learn a formula for nding the appropriate sample size based on the amount of error we will
accept in our results. This can be done by determining the minimum sample size needed to have a certain
margin of error. To do this, we solve for the sample size n in the margin of error formula.

E = z α2 · √s
n
√ zα/2 ·s
n = E
(3.8)
 zα/2 ·s 2
n = E

As we would always rather than have one more object of study rather than one less, we will always round
up the result of this calculation. That is, if the result of the formula is 50.2, then we will round up to 51.
A couple of notes about the formula:

1. Since n is unknown we can't use t. Think about why this is so.


2. We still need to have a sense of the standard deviation to use this formula. As such, we will often do
a preliminary study to estimate of the standard deviation.

Exercise 3.1.4.3 (Solution on p. 251.)


You plan to do a study of hypnotherapy to determine how eective it is in increasing the number
of hours of sleep participants get each night. To do this you will measure the number of hours of
sleep for each of the participants after they've done hypnotherapy. You want to ensure that your
estimate for the mean number of hours of sleep is within 0.2 hours of the true mean with a 95%
level of condence. Prior to doing the full study, you do a pilot study with 12 participants, which
provides the following data:

8.2; 9.1; 7.7; 8.6; 6.9; 11.2; 10.1; 9.9; 8.9; 9.2; 7.5; 10.5 (3.8)
How many participants should be in your study?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
218
INTERVALS AND HYPOTHESIS TESTS

3.1.5 The Condence Interval Estimate of a Population Proportion8


Here we want to construct a condence interval to estimate the population proportion π based o of the
^
point estimate of the sample proportion p .
^
Condence intervals for proportion are constructed by taking the point estimate p and adding and
^
subtracting the margin of error E : p ±E .
There is more than one model for constructing a condence interval for the sample proportion. The
model we will discuss here has the following criteria:

• The variable being studied satises the conditions of the binomial distribution.
• The sampling distribution for sample proportions is approximately normal. This occurs if the number
of successes (n × π ) is at least 5 and the number of failures (n × (1 − π)) is at least 5. As π is unknown
this can be checked by determining if the number of successes and failures in the sample are both at
least 5.

The margin of error is found in a similar way to margin of error for the mean. That is, it is the critical value
× the standard error. As we are assuming that the sampling distribution is approximately normal, we will
use the standard normal distribution to nd the critical value. Since the variable being studied satises the
conditions of the
q binomial distribution, we know from Chapter 6 that the standard error of the sampling
π(1−π)
distribution is .
As we don't know π as that is what we are trying to estimate, we will estimate
n
^
π in the formula with the sample proportion p . This results in the estimate of the standard error to be
^ ^
s !
p p
1−

n
If these conditions are met, then the formula for the margin of error is:
v
u^ ^
u  
u p 1− p
(3.8)
t
E = zα/2 ×
n
Example: Cell phones
Suppose that a market research rm is hired to estimate the percent of adults living in a Vancouver who
have cell phones. Five hundred randomly selected adult residents in Vancouver are surveyed to determine
whether they have cell phones. Of the 500 people sampled, 421 responded yes - they own cell phones.

1. Using a 92% condence level, compute a condence interval estimate for the true proportion of adult
residents of this city who have cell phones.
2. Would it be appropriate to say that 85% of residents have a cell phone in Vancouver?
3. What does the condence level tell us in the context of the question?

Solutions:

1. We can use the standard normal model for proportions to construct our condence interval as the
variable (cell phone ownership) follows a binomial distribution (1: The variable is random (random
sample); 2: The outcomes are being counted (number of people who have cell phones); 3: There is a
xed number of trials (500); 4: There are two possible outcomes (have cell phone or don't have cell
phone); 5: Though π is unknown it is fair to assume that the proportion of people who have a cell
phone on a given day in Vancouver is very stable) and the sampling distribution for proportions is
normal as the number of successes is 421 and the number of failures is 79 (i.e. they are both greater
than 5). Use a computer program to construct the condence interval. Input x as 421 (this may be in
8 This content is available online at <http://cnx.org/content/m65028/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


219

the same place as the sample proportion, but when you input the whole number it will switch to x),
the sample size as 500, and the condence level as 92%. Notice that you don't have to state whether it
is z or t as there is only one model for this situation. This gives the following output:

92% condence level


1.751 z
0.029 margin of error
0.813 lower condence limit
0.871 upper condence limit
Table 3.1

From this, we can see that the condence interval for the mean is 0.813 to 0.871.
2. To interpret the condence interval, we would say that we are 92% condent that proportion of residents
of Vancouver that own a cell phone is somewhere between 81.3% and 87.1%.
3. Since 85% is contained in the condence interval, it is appropriate to say that the proportion of residents
in Vancouver who have a cell phone is 85%.
4. The condence level means that if we took many random samples of Vancouver residents of size 500 and
constructed many condence intervals for each of these random samples, then 92% of these condence
intervals will contain the population proportion of cell phone users, while 8% will not.
A couple of notes about the condence interval:
• The margin of error is 0.029 or 2.9%. The margin of error for a condence interval for proportions has
to be less 1 (or 100%). If the sample size is large enough, the margin of error should be quite small
(less than 10%).
• Since proportions can only range from 0 to 1 or 0% to 100%, the condence interval can never exceed
these values. For example, if the sample proportion is 92% and the margin of error is 10%, then the
condence interval would be 82% to 102%, but since the upper bound is impossible, we would round
the answer to 82% to 100%.

3.1.5.1 Determining sample size

Just like with the mean, we want to determine an appropriate sample size to achieve a maximum amount of
error in our estimate for the population proportion.
To nd the formula for n, we again solve for n in the formula for the margin of error, this results in the
following formula:

^ ^
 
2 p 1− p
zα/2
n= (3.8)
E2
To use this formula we need to know the margin of error, the condence level and the sample proportion.
^
Note: If no estimate for π exists, then use p = 0.5.
Exercise 3.1.5.1 (Solution on p. 251.)
The Western Canada Communications Company is considering a bid to provide long-distance
phone service. You are asked to conduct a poll to estimate the percentage of consumers who are
satised with their current long-distance phone service. You want to be 90% condent that your
sample percentage is within 2.5 percentage points of the true population value, and a Roper poll
suggests that this percentage should be about 85%. How large must your sample be?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
220
INTERVALS AND HYPOTHESIS TESTS

3.2 Chapter 8: Hypothesis Testing


3.2.1 Introduction to One Population Hypothesis Testing9
3.2.1.1 What are hypothesis tests?

In chapter 7, you learned how to construct an estimate of a population parameter, such as a mean or
proportion, from a sample statistic. In this chapter we examine a related concept: investigating whether the
value of a population parameter has changed from what has previously been claimed or believed. Again, we
use the sample data for this investigation.
For example, it is commonly stated that adults should get 8 hours of sleep per night. Many of us may
suspect that the real average is lower. In conducting an investigation, since we don't yet have evidence to
the contrary, we will treat the mean of 8 as the prevailing claim. In other words, we must assume the true
population mean is 8 unless we can prove otherwise. In our attempt to nd proof against the prevailing
claim, we would need to gather sample evidence.
Let's say that after gathering a large random sample (say, n = 50), you discover that the sample mean
number of hours slept per night is only 7.5. So is a sample mean of 7.5 hours proof that the true population
mean is not 8 hours, as claimed, but actually less? On the surface, it would appear so. However, recall from
chapter 7 that every sample mean will be dierent from the true population mean. Some sample means will
be a little dierent and others will be very dierent.
Also, recall that all possible sample means taken from a population, plotted on a distribution, is called a
sampling distribution of sample means. The mean or middle of this distribution will be the true population
mean, which at present we are assuming to be 8. And if 8 really is the true population mean, then most sample
means would be expected to be very close to 8, but somethose means near the tails of the distribution
could be much lower or much higher than 8. The gure below shows a normal curve with a mean of 8 and a
standard error of 0.20. As the curve expands towards the tails, the number of observations we would expect
to see gets smaller and smaller. In other words, sample means that come from far out in the tails of the
distribution are considered rare or unlikely occurrences. So for this example, the question is whether 7.5 is
so far out into one of the tails that it would be considered an unlikely observation under the assumption
that the middle of this curve is actually 8.
To measure how far into the tail our sample mean of 7.5 is, we must use a familiar measuring tool called
a Z score (or a T score for smaller samples). Since we are assuming the mean or middle of our sampling
distribution is 8 (remember that 8 is our prevailing claim), we need to measure the number of Z scores
our sample mean of 7.5 is from 8. Recall from chapter 6 that a variable's Z score is simply the number of
standard deviations the variable lies from the middle of the normal curve. Also recall that over 95% of a
normally shaped distribution will fall within two Z scores (two standard deviations) of the middle and over
99% will fall within three Z scores.
In hypothesis testing, any value falling more than two standard deviations from the middle would be
considered unlikely (less than 5% of all possible sample means will fall more than two standard deviations
from the middle). Any value falling more than three standard deviations from the middle would be considered
very unlikely (less than 0.5% of all possible sample means will fall more than three standard deviations from
the middle). If the standard error for our example is 0.2, then our sample has a Z score of -2.5 (7.5  8/0.2).
That is, our sample mean of 7.5 lies 2.5 standard deviations to the left of our hypothesized population mean,
well out into the left tail of the curve. So, it does appear that our sample mean can be considered an unlikely
occurrence. The conclusion then must be that if the true population mean is actually 8 it would unlikely
for us to obtain a sample mean as small as 7.5. But since we did obtain such a mean, we must therefore
conclude the true population mean is less than 8.
9 This content is available online at <http://cnx.org/content/m65311/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


221

Figure 3.9

Without getting into further technicalities at this point, we have shown that a hypothesis test seeks to
measure whether the sample evidence can be considered unlikely under the assumption that the prevailing
claim is true. If our answer is `yes', then we have good reason to reject the prevailing claim. If our answer
is no, then we must let the prevailing claim stand, at least until stronger evidence against it is found.
In the next section we will break down the various steps in a hypothesis test.

3.2.2 The Distribution Needed for Hypothesis Testing10


In chapter 6, we discussed sampling distributions, which are used for hypothesis testing. We will perform
hypotheses tests of a population mean using two particular sampling distributions: a normal distribution
t
or a Student's -distribution. We will perform hypothesis tests of a population proportion using a normal
sampling distribution that has been approximated from a binomial situation.

3.2.2.1 Central Limit Theorem Revisited

When you perform a hypothesis test of a single population mean µ using a normal distribution (often
called a z-test), you take a large random sample from the population. When working with large samples,
you should recall from chapter 6 that Central Limit Theorem says that the sampling distribution of means
will be approximately normal even if the population from whence the sample came is not. For this reason
we can perform hypothesis tests using large samples and the normal distribution regardless of the shape of
the parent population.
Many statisticians prefer to use a t-distribution if the population standard deviation is unknown, even if
the samples are large. The reasoning behind this is that using the sample standard deviation in place of the
unknown population standard deviation adds an extra degree of potential error that can only be accounted
for by using a t- distribution. However, as noted in the previous chapter, it is common practice to use the
normal (Z-based) sampling distribution when working with large samples. Specically, when n>40, we will
use the standard normal(z-based)distribution to conduct a hypothesis test..
When working with small samples, we will perform a hypothesis test of a single population mean µ
t
using a Student's -distribution (often called a t-test). There are fundamental assumptions that need to
10 This content is available online at <http://cnx.org/content/m65305/1.3/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
222
INTERVALS AND HYPOTHESIS TESTS

be met in order for the test to be considered valid. Most importantly, since Central Limit Theorem does not
apply to small samples, we have no guarantee the the sampling distribution will be normally shaped. For
this reason, we can only perform means tests with small samples when we know the population is normally
distributed.

note: Please see Figure 6 in the previous chapter for further insight into how to determine which
sampling distribution is appropriate when conducting a hypotheses test of a population mean.

p
When you perform a hypothesis test of a single population proportion , you take a random sample
from the population. You must meet the conditions for a binomial distribution which are: there are a
certain number n of independent trials, the outcomes of any trial are success or failure, and each trial has
the same probability of a success p. The Central Limit Theorem says the shape of the binomial distribution
will approximate the shape of the normal distribution if the sample is suciently large. To ensure this, the
quantities np and nq must both be greater than ve (np > 5 and nq > 5). Then the binomial distribution p of
a sample (estimated) proportion can be approximated by the normal distribution with µ = p and σ = pq n.
Remember that q = 1  p.

3.2.2.2 Large Sample Hypothesis Test for the Mean

Going back to the standardizing formula we can derive the test statistic for testing hypotheses concerning
means. We have already worked with the formula below when introduced to sampling distributions in
Chapter 6. You should, however, notice one small dierence. When we perform hypothesis tests, we don't
know the population mean; we simply have a claim or belief about the mean, which may or may not be true.
Because the mean is hypothesized rather than known, we use a slightly dierent symbol in the equation, µ0 ,
as seen below.
−−
x −µ0
Zc = √ (3.9)
σ/ n
This calculated Z is nothing more than the number of standard deviations that the sample mean is from
the hypothesized population mean. If the sample mean falls "too many" standard deviations from the
hypothesized mean we conclude that the sample mean is unlikely to have come from a distribution centred
around the hypothesized mean.
So how do we know if a sample mean can be considered to have fallen "too many" standard deviations
away from a hypothesized mean? Obviously, we can't simply make this decision arbitrarily. Thankfully, we
have already been introduced this concept when we examined condence intervals in the previous chapter.
Just as we predetermine our level of condence before we compute an estimate of a population parameter, so
too must we predetermine how strong we need our sample evidence to be (i.e. how many standard deviations
away from the hypothesized population parameter it must lie) before we would be condent in rejecting the
null hypothesis. This predetermined level in hypothesis testing is called the level of signicance, and it is
simply 1- the level of condence. The level of signicance is denoted as alpha (α).
This level of signicance delineates a set number of standard deviations between evidence that would be
considered unlikely and evidence that would be considered not unlikely under the assumption that the null
hypothesis is true. By way of example, say we set our level of signicance at 5%. The corresponding Z score
for a 5% level of signicance is 1.645. This means that if our sample mean falls more than 1.645 standard
deviations away from the hypothesized middle of the distribution (i.e. the null hypothesis), we can conclude
the sample evidence is strong enough to be considered an unlikely event and we can therefore reject the null
hypothesis.
Before proceeding further, it's worth reviewing this notion of a signicance level from another perspective.
The signicance level can be thought of as the allowable amount of error in our test. Just as a 95% condence
level will produce an incorrect estimate 5% of the time, so will our hypothesis test with a level of signicance
set at 5%, produce an incorrect conclusion 5% of the time, at least theoretically. When we set the signicance
level at, say 5%, we are essentially saying that on our sampling distribution any sample mean that falls into

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


223

the top (or bottom) 5% of the tail would be considered strong evidence against the null hypothesis. This
does not mean the evidence is perfect, however. There is certainly the possibility that a sample mean that
falls into the top (or bottom ) 5% of the tail could have come from a population in which the null hypothesis
is true. Indeed that possibility is actually 5%. But 5% is a pretty small number, which is why we would say
the observance of such a sample mean must be considered an unlikelybut not impossible event.

3.2.2.3 Small Sample Hypothesis Tests for the Mean

Because the samples are small and we don't know the population standard deviation, we must use a Student
t-distribution rather than a Z distribution to perform our tests. The new standardizing formula below will
be used to compute how many standard deviations our sample mean falls from the hypothesized middle of
the t-distribution.
−−
X −µ0
tc = √ (3.9)
s/ n

3.2.2.4 Large Sample Tests for the Proportion

When conducting a hypothesis test on a proportion, we can use a Z-based test so long as the sample is
^ ^
suciently large. A sample is considered large if n p and n(1- p ) both exceed 5. Even though we will perform
a Z-based test, because we are working with proportions, the standardizing formula is quite dierent. In
the numerator, the hypothesized proportion is subtracted from the observed sample proportion. In the
denominator, the standard error is calculated by rst multiplying the hypothesized proportion by 1 - the
hypothesized proportion; then by dividing the result; and nally taking the square root of that result.
^ q
Z* = p - πo/ πo(1−πo)
n .

3.2.2.5 Chapter Review

In order for a hypothesis test's results to be generalized to a population, certain requirements must be
satised.
When testing for a single population mean:

1. A Student's t-test should be used if the data come from a small, random sample and the population is
approximately normally distributed.
2. The normal z-test can be used if the data come from a large, random sample. The population does
not need to be normally distributed.

When testing a single population proportion use a normal test for a single population proportion if the data
comes from a random sample, t the requirements for a binomial distribution, and the mean number of
success and the mean number of failures satisfy the conditions: np > 5 and nq > n where n is the sample
size, p is the probability of a success, and q is the probability of a failure.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
224
INTERVALS AND HYPOTHESIS TESTS

3.2.2.6

Exercise 3.2.2.1 (Solution on p. 252.)


Which two distributions can you use in hypothesis testing for the mean in this chapter?
Exercise 3.2.2.2 (Solution on p. 252.)
Which distribution do you use when the sample size is small, the standard deviation is not known
and you are testing one population mean?
Exercise 3.2.2.3 (Solution on p. 252.)
A population has a mean is 25 and a standard deviation of ve. The sample mean is 24, and the
sample size is 108. What distribution should you use to perform a hypothesis test?
Exercise 3.2.2.4 (Solution on p. 252.)
You are performing a hypothesis test of a single population mean using a Student's t-distribution.
What must you assume about the distribution of the data?
Exercise 3.2.2.5 (Solution on p. 252.)
You are performing a hypothesis test of a single population proportion. What must be true about
^ ^
the quantities of n p and n(1- p )
Exercise 3.2.2.6 (Solution on p. 252.)
You are performing a hypothesis test of a single population proportion. The data come from which
distribution?

3.2.2.7 Homework

Exercise 3.2.2.7 (Solution on p. 252.)


It is believed that Medicine Hat Community College (MHCC) Intermediate Accounting students get
more than seven hours of sleep per night, on average. A survey of 22 MHCC Intermediate Account-
ing students generated a mean of 7.24 hours with a standard deviation of 1.93 hours. At a level of
signicance of 5%, do MHCC Intermediate Accounting students get more than seven hours of sleep
−−
per night, on average? The distribution to be used for this test is X ∼ ________________
 
a. Z 7.24, √ 1.93
22
b. Z (7.24, 1.93)
c. t22 df
d. t21 df

3.2.3 The Null and Alternative Hypotheses11


The actual test begins by considering two hypotheses. They are called the null hypothesis and the
alternative hypothesis. These hypotheses contain opposing viewpoints.
H0 : The null hypothesis: The null hypothesis is the opposite of what the researcher is trying to show.
It is the assumption made about a population parameter, such as the mean or proportion. It is a statement
that we will assume to be true until we can nd strong evidence to the contrary. You can think of the
null hypothesis as the assumption that nothing has changed, nothing is dierent. If you nd evidence that
suggests the assumption is not valid, then you will reject the assumption about the population parameter
in favour of a claim. If you do not nd enough evidence that suggests the assumption is not valid, then you
do not have enough evidence to support the claim, but that does not mean the assumption is valid.
11 This content is available online at <http://cnx.org/content/m65304/1.2/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


225

Ha : The alternative hypothesis: This is the claim about the population that the researcher is trying
to show and it is contradictory to H0 . It is what we conclude to be likely to be true if our sample evidence
suggests that H0 is no longer valid. The alternative hypothesis says that something is dierent, that things
have changed. It must be supported by signicant evidence to overthrow the assumption.
Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you
have enough evidence to reject the null hypothesis or not. Since we rarely have access to population data,
we must take our evidence from sample data.
Later we will discuss in more detail how to determine if the sample evidence can be considered strong
enough to support the alternative hypothesis. Once you have examined the sample evidence, you can
determine if it supports the alternative hypothesis or not and make your nal decision. There are two
options for this decision. They are "reject H0 " if the sample information favours the alternative hypothesis
or "fail to reject H0 " or "decline to reject H0 " if the sample information is insucient to reject the null
hypothesis. These conclusions are all based upon a level of signicance that is set by the analyst.
Table 9.1 presents the various hypotheses in the relevant pairs. For example, if the null hypothesis is
equal to some value, the alternative has to be not equal to that value.

H0 Ha
equal (=) not equal (6=)
greater than or equal to (≥) less than (<)
less than or equal to (≤) more than (>)

Table 3.2

: As a mathematical convention H0 always has a symbol with an equal sign in it. Ha never has a
symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test.

Example 3.2
H0 : The average amount of sleep adult Canadians get per night is greater than or equal to 8 hours.
Ha : The average amount of sleep adult Canadians get per night is less than 8 hours.
H0 : µ≥8
Ha : µ<8
Example 3.3
We want to test whether the mean GPA of students in Canadian universities is dierent from 2.0
(out of 4.0). The null and alternative hypotheses are:
H0 : µ = 2.0
Ha : µ 6= 2.0
Example 3.4
We want to test if university students take more than four years to graduate from university, on
the average. The null and alternative hypotheses are:
H0 : µ ≤ 4
Ha : µ > 4
Example 3.5
We want to test if the proportion of Liberal supporters has dropped since the election.
Ho : The proportion of Liberal supporters is greater than or equal to 0.40
Ha : The proportion of Liberal supporters is less than 0.40.
H0 : π ≥ 0.40
Ha : π < 0.40

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
226
INTERVALS AND HYPOTHESIS TESTS

3.2.3.1 Chapter Review

In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim
about a population parameter, such as the mean or proportion. If the sample provides strong evidence to
the contrary of the original claim, then the claim can be rejected in favour of the new claim. In a hypothesis
test, we:

1. Evaluate the null hypothesis, typically denoted with H0 . The null is not rejected unless the hypothesis
test shows otherwise. The null statement must always contain some form of equality (=, ≤ or ≥)
2. Always write the alternative hypothesis, typically denoted with Ha or H1 , using not equal, less than
or greater than symbols, i.e., (6=, <, or > ).
3. If we reject the null hypothesis, then we can assume there is enough evidence to support the alternative
hypothesis.
4. Never state that a claim under the null hypothesis is proven true or false. Keep in mind the underlying
fact that hypothesis testing is based on probability laws; therefore, we can talk only in terms of non-
absolute certainties.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


227

3.2.3.2

Exercise 3.2.3.1 (Solution on p. 252.)


You are testing that the mean speed of your cable Internet connection is more than three Megabits
per second. What is the random variable? Describe in words.
Exercise 3.2.3.2 (Solution on p. 252.)
Canadian families have an average of two children. What is the random variable? Describe in
words.
Exercise 3.2.3.3 (Solution on p. 252.)
A sociologist claims the probability that a person picked at random visting the CN Tower in
Toronto is a tourist is 0.83. You want to test to see if the proportion is actually less. What is the
random variable? Describe in words.
Exercise 3.2.3.4 (Solution on p. 252.)
In a population of sh, approximately 42% are female. A test is conducted to see if, in fact, the
proportion is less. State the null and alternative hypotheses.

3.2.3.3 Homework

Exercise 3.2.3.5 (Solution on p. 252.)


Some of the following statements refer to the null hypothesis, some to the alternate hypothesis.
Hint: pay attention to whether the statement states or implies an equality. If so, it refers to the
null hypothesis.
State the null hypothesis, H0 , and the alternative hypothesis. Ha , in terms of the appropriate
parameter (µ or π ).

a. The mean number of years Canadians work before retiring is 34.


b. At most 60% of Canadians vote in federal elections.
c. The mean starting salary for U of A graduates is at least $100,000 per year.
d. Twenty-nine percent of high school seniors get drunk each month.
e. Fewer than 5% of adults ride the bus to work in Calgary.
f. The mean number of cars a person owns in her lifetime is not more than ten.
g. About half of Canadians prefer to live away from cities, given the choice.
h. Europeans have a mean paid vacation each year of six weeks.
i. The chance of developing breast cancer is under 11% for women.
j. Private universities' mean tuition cost is more than $20,000 per year.

Exercise 3.2.3.6 (Solution on p. 252.)


A statistics instructor believes that fewer than 20% of Lethbridge Community College (LCC)
students attended the opening night midnight showing of the latest Harry Potter movie. She
surveys 84 of her students and nds that 11 attended the midnight showing. An appropriate
alternative hypothesis is:

a. π = 0.20
b. π > 0.20
c. π < 0.20
d. π ≤ 0.20

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
228
INTERVALS AND HYPOTHESIS TESTS

3.2.3.4 References

Data from the National Institute of Mental Health. Available online at


http://www.nimh.nih.gov/publicat/depression.cfm.

3.2.4 Errors and Choosing a Level of Signicance12


Errors in Hypothesis Testing
Any time we reject a claim (Ho), there is a possibility we were wrong. Rejecting an Ho that is actually true
is known as a Type I Error. For example, when we send someone who is innocent to jail, we have committed
a Type I error; we have rejected a null hypothesis that is actually true. If making such an error is costly
(nancially, to someone's well being or otherwise), we would want to severely limit the possibility of this
kind of error from occurring. Conversely, any time we fail to reject a claim (Ho), there is also possibility we
were wrong. If a claim is actually false but we fail to reject that claim, we have committed what is known
as a Type II Error. If a Type II error is deemed to be more costly than a Type I error we would strive to
limit the possibility of this kind of error from occurring.
How? Recall from Chapter 7 that we can decide in advance how condent we wish to be in our condence
interval estimates. We do something similar in hypothesis testing by choosing what is known as a level of
signicance. The level of signicance, identied by the Greek letter alpha α, is simply 1 minus our level
of condence. So a 95% level of condence has a corresponding level of signicance of 5%. In terms of a
Type I error, an alpha of 5% is the probability that our test could lead to rejecting a null hypothesis that
is actually true. As mentioned above, if a Type I error is deemed very costly, we may wish to reduce alpha
to as low as 1%. This means that the probability our test could lead to a rejecting of a null hypothesis
that is actually true is only 1%. So why not set alpha at 0%? That way we would never make a Type I
error. Setting alpha at o% would require us to have perfect evidence before we would be able to reject the
null hypothesis. Imagine if this were the case in a court trial. The judge would instruct the jury not to
convict unless the evidence of guilt was absolutely perfect and all jury members were 100% certain of the
defendant's guilt. If this were the case, we would rarely send anyone to jail and we would have a lot more
dangerous people roaming our streets. In short, it is unreasonable to demand that sample evidence provide
perfect proof against the null hypothesis.
A Type II error is known by the Greek letter beta β . Unfortunately, we cannot predetermine beta in the
same way we do with alpha, but we do know the two types of errors share an inverse relationship: the lower
we set alpha, the higher beta becomes and vice versa. Back to our courtroom example. If we reduced to
probability of making a Type I error to 0, as we said, we would allow almost everyone to go free, even if they
were guilty, for lack of perfect evidence. When we send a person guilty of a serious crime back on the street,
we have committed a Type II errorwe have failed to reject a null hypothesis that is actually false. And
since the judge set alpha at 0 (that is, he demanded perfect proof of guilt before being willing to convict),
he has sent beta soaring. Almost no one will be convicted. Since we can't set beta in advance, we must set
our level of alpha high (for example, 10%) to minimize a Type II error.
To illustrate further, let's say a certain medical condition is easy to treat with a drug that poses little
danger and has few side eects. Let's also say this condition is relatively hard to diagnose because it shares
symptoms with several other conditions. A stomach ulcer is one example. The doctor tests you for an ulcer
by looking for evidence, such as pressing on your stomach and discussing your symptoms. As best as she
can tell, she decides there is a good chance you have an ulcer. She prescribes a drug and o you go. After
one month, your symptoms persist and so you re-visit the doctor who then rules out her earlier diagnosis
in favour of a new one. What has happened here is that in her initial diagnosis the doctor had made a
Type I error. She has rejected the null hypothesis (that you don't have an ulcer) in favour of the alternative
hypothesis that you do have an ulcer. As it turns out, she was wrong. She prescribed a drug that would
not help you for a condition you do not have. Before getting too anxious about the medical system, keep
in mind that this is a fairly common practice in diagnosing relatively benign conditions that can be treated
easily. The old saying, "Take two aspirin and call me in the morning" sums this approach up well. Recall
12 This content is available online at <http://cnx.org/content/m65318/1.3/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


229

that the doctor diagnosed your ulcer by taking in only a few pieces of evidence: talking to you and pressing
on your stomach. In other words, she was willing to reject the null hypothesis on relatively weak evidence.
Why? Because she knew that the prescription might help, and even if it didn't it would do you little harm.
And since it didn't help you after a month, she can now rule out an ulcer and focus on other, possibly less
benign, conditions. Keep in mind that if she had set alpha low, she likely would not have misdiagnosed you,
but she would also have sought much stronger evidencepossibly even invasive exploratory surgerybefore
being willing to reject the null hypothesis. Obviously, in this case it made much more sense to risk a Type
II error and treat you for a condition that you don't actually have.
Summary
When you perform a hypothesis test, there are actually four possible outcomes depending on the actual truth
(or falseness) of the null hypothesis H0 and the decision to reject or not. The outcomes are summarized in
the following table:

STATISTICAL DECISION H0 IS ACTUALLY...


True False
Cannot rejectH0 Correct Outcome Type II error
Cannot accept H0 Type I Error Correct Outcome

Table 3.3

The four possible outcomes in the table are:


H H
1. The decision is cannot reject 0 when 0 is true (correct decision).
H H
2. The decision is cannot accept 0 when 0 is true (incorrect decision known as aType I error).
This case is described as "rejecting a good null". As we will see later, it is this type of error that we
will guard against by setting the probability of making such an error. The goal is to NOT take an
action that is an error.
H H
3. The decision is cannot reject 0 when, in fact, 0 is false (incorrect decision known as a Type
II error). This is called "accepting a false null". In this situation you have allowed the status quo to
remain in force when it should be overturned. As we will see, the null hypothesis has the advantage in
competition with the alternative.
H H
4. The decision is cannot accept 0 when 0 is false (correct decision whose probability is called
the Power of the Test).
Each of the errors occurs with a particular probability. The Greek letters α and β represent the proba-
bilities.
P
α = probability of a Type I error = (Type I error) = probability of rejecting the null hypothesis
when the null hypothesis is true.
P
β = probability of a Type II error = (Type II error) = probability of not rejecting the null hypothesis
when the null hypothesis is false.
The following are examples of Type I and Type II errors.
Example 3.6
Suppose the null hypothesis, H0 , is: Frank's rock climbing equipment is safe.
Type I error: Frank thinks that his rock climbing equipment may not be safe when, in fact,
it really is safe. Type II error: Frank thinks that his rock climbing equipment may be safe when,
in fact, it is not safe.
α = probability that Frank thinks his rock climbing equipment may not be safe when, in
fact, it really is safe. β = probability that Frank thinks his rock climbing equipment may be safe
when, in fact, it is not safe.
Notice that, in this case, the error with the greater consequence is the Type II error. (If Frank
thinks his rock climbing equipment is safe, he will go ahead and use it.)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
230
INTERVALS AND HYPOTHESIS TESTS

This is a situation described as "accepting a false null".

Exercise 3.2.4.1 (Solution on p. 252.)


Suppose the null hypothesis, H0 , is: the blood cultures contain no traces of pathogen X.
State the Type I and Type II errors.

Exercise 3.2.4.2 (Solution on p. 253.)


Suppose the null hypothesis, H0 , is: a patient is not sick. Which type of error has the
greater consequence, Type I or Type II?

Exercise 3.2.4.3 (Solution on p. 253.)


Red tide is a bloom of poison-producing algaea few dierent species of a class of plank-
ton called dinoagellates. When the weather and water conditions cause these blooms,
shellsh such as clams living in the area develop dangerous levels of a paralysis-inducing
toxin. In Massachusetts, the Division of Marine Fisheries (DMF) monitors levels of the
toxin in shellsh by regular sampling of shellsh along the coastline. If the mean level of
toxin in clams exceeds 800 µg (micrograms) of toxin per kg of clam meat in any area, clam
harvesting is banned there until the bloom is over and levels of toxin in clams subside.
Describe both a Type I and a Type II error in this context, and state which error has the
greater consequence.

: Determine both Type I and Type II errors for the following scenario:
Assume a null hypothesis, H0 , that states the percentage of adults with jobs is at least 88%.
Exercise 3.2.4.4 (Solution on p. 253.)
Identify the Type I and Type II errors from these four statements.
a.Not to reject the null hypothesis that the percentage of adults who have jobs is at
least 88% when that percentage is actually less than 88%
b.Not to reject the null hypothesis that the percentage of adults who have jobs is at
least 88% when the percentage is actually at least 88%.
c.Reject the null hypothesis that the percentage of adults who have jobs is at least 88%
when the percentage is actually at least 88%.
d.Reject the null hypothesis that the percentage of adults who have jobs is at least 88%
when that percentage is actually less than 88%.

3.2.4.1 Chapter Review

In every hypothesis test, the outcomes are dependent on a correct interpretation of the data. Incorrect
calculations or misunderstood summary statistics can yield errors that aect the results. A Type I error
occurs when a true null hypothesis is rejected. A Type II error occurs when a false null hypothesis is not
rejected.
The probabilities of these errors are denoted by the Greek letters α and β , for a Type I and a Type II
error respectively.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


231

3.2.4.2

Exercise 3.2.4.5 (Solution on p. 253.)


The mean price of mid-sized cars in a region is $32,000. A test is conducted to see if the claim is
true. State the Type I and Type II errors in complete sentences.
Exercise 3.2.4.6 (Solution on p. 253.)
For Exercise 9.12, what are α and β in words?
Exercise 3.2.4.7 (Solution on p. 253.)
A group of doctors is deciding whether or not to perform an operation. Suppose the null hypothesis,
H0 , is: the surgical procedure will go well. State the Type I and Type II errors in complete sentences.

3.2.4.3 Homework

Exercise 3.2.4.8 (Solution on p. 253.)


State the Type I and Type II errors in complete sentences given the following statements.

a. The mean number of years Americans work before retiring is 34.


b. At most 60% of Americans vote in presidential elections.
c. The mean starting salary for San Jose State University graduates is at least $100,000 per
year.
d. Twenty-nine percent of high school seniors get drunk each month.
e. Fewer than 5% of adults ride the bus to work in Los Angeles.
f. The mean number of cars a person owns in his or her lifetime is not more than ten.
g. About half of Americans prefer to live away from cities, given the choice.
h. Europeans have a mean paid vacation each year of six weeks.
i. The chance of developing breast cancer is under 11% for women.
j. Private universities mean tuition cost is more than $20,000 per year.

Exercise 3.2.4.9
For statements a-j in Exercise 9.109 (Exercise 3.2.4.8), answer the following in complete sentences.

a. State a consequence of committing a Type I error.


b. State a consequence of committing a Type II error.

Exercise 3.2.4.10 (Solution on p. 254.)


When a new drug is created, the pharmaceutical company must subject it to testing before receiving
the necessary permission from the Food and Drug Administration (FDA) to market the drug.
Suppose the null hypothesis is the drug is unsafe. What is the Type II Error?

a. To conclude the drug is safe when in, fact, it is unsafe.


b. Not to conclude the drug is safe when, in fact, it is safe.
c. To conclude the drug is safe when, in fact, it is safe.
d. Not to conclude the drug is unsafe when, in fact, it is unsafe.

Exercise 3.2.4.11 (Solution on p. 254.)


It is believed that Lake Tahoe Community College (LTCC) Intermediate Algebra students get
less than seven hours of sleep per night, on average. A survey of 22 LTCC Intermediate Algebra
students generated a mean of 7.24 hours with a standard deviation of 1.93 hours. At a level of
signicance of 5%, do LTCC Intermediate Algebra students get less than seven hours of sleep per
night, on average?
The Type II error is not to reject that the mean number of hours of sleep LTCC students get
per night is at least seven when, in fact, the mean number of hours

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
232
INTERVALS AND HYPOTHESIS TESTS

a. is more than seven hours.


b. is at most seven hours.
c. is at least seven hours.
d. is less than seven hours.

3.2.5 The Eight-Step Hypothesis Test13


3.2.5.1 P-values and the Level of Signicance

Once you have set out your null and alternative hypothesis, you need to determine how strong your sample
evidence must be before you would be willing to reject the null hypothesis in favour of the alternative
hypothesis. The required strength of evidence is dened by the level of signicance (α).
Once your level of signicance has been set, you can then examine your sample evidence to determine its
strength, as measured by its p-value This process will be discussed below.

3.2.5.1.1 Ethical Implications of Choosing a Level of Signicance

Once you have set out your null and alternative hypothesis, you need to determine how strong your sample
evidence must be before you would be condent in rejecting the null hypothesis in favour of the alternative
hypothesis. The required strength of evidence is dened by the level of signicance (α).
Typically values for alpha range from 1% to 10% and will vary depending on a number of factors, including
conventions set by a particular industry or discipline and the relative risks of a Type I versus a Type II error,
as discussed in the previous section. In many cases, the choice of alpha may be left up to the analyst.
Unfortunately, without a peer review process, some analysts may be tempted to set alpha in a way that will
support his or her desired conclusion.
For example, if a pharmaceutical company stands to make millions of dollars on a new drug, it obviously
has a vested interest in oering proof that the drug is eective. The null hypothesis is that the drug is
not eective; and the aternative is that it is. But what if the proof, as discovered by several rounds of
double-blind tests, turns out to be rather weak? This would normally lead the researcher to decide not to
reject the null hypothesis and conclude that the sample evidence is insuciently strong for the drug to be
considered a success. If this were the conclusion, the drug should not be approved as an eective treatment.
But a company with millions already invested in the drug may be strongly determined to see it to market,
in spite of the test results. An unethical approach might be to simply move the goal posts to make it easier
to reject the null hypothesis (i.e. to make the proof look stronger than it is).
These goal posts, of course, are dened by the level of signicance. In much scientic testing, the level
of signicance is typically set at 1%, which means the sample evidence must be very strong before a null
hypothesis can be rejected. In this case, moving the goal posts could mean setting the level of signicance
as high as 10%. This higher level of signicance, as we shall see below, allows for weaker evidence to be used
in support of an alternative hypothesis.
In Figure 1 below, alpha has been set at 1%. As you can see, the sample evidence fails to cross over the
goal posts set by alpha and we would thus reach a fail to reject of the null hypothesis. The sample evidence
is not strong enough.
13 This content is available online at <http://cnx.org/content/m65319/1.5/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


233

Figure 3.10

In the following gure we have moved the goal posts by setting alpha at 10%, making it easier to reject
the null hypothesis. As you can see, the sample evidence now is strong enough to lead us to reject the null
hypothesis. Of course, in truth the evidence has not changed, but in the rst instance we fail to reject the
null and in the second we do reject the null.

Figure 3.11

Thankfully, at least when it comes to pharmaceutical testing, there are objective, government regulated
standards that cannot be easily manipulated by vested interests. However, there are instances where the

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
234
INTERVALS AND HYPOTHESIS TESTS

researcher is in control of choosing the level of signicance. When this is the case, the choice should be made
ethically and with an honest consideration of the implications of Type I and Type II errors.
As a nal note, the level of signicance should never be chosen after the sample evidence has been
measured. This would be akin to allowing the home team to determine where the goal posts are after the
game has already begun.

3.2.5.1.2 Examining the Sample Evidence

Once the level of signicance has been set, you can look more closely at the sample evidence to determine how
strong it is. As discussed earlier, this evidence is rst measured by determining how far away your sample
mean or proportion is from the hypothesized mean or proportion. The measuring stick we use is called a
Z-score or a t-score, which is simply the number of standard deviations our sample mean or proportion lies
from the hypothesized middle of the sampling distribution.
Recall from earlier in this chapter the example we looked at regarding sleep habits. We hypothesized
that the mean number of hours adults sleep per night is 8. We then gathered sample evidence, where the
sample mean was 7.5 and the standard deviation was 1.4 hours. The sampling distribution for this scenario
would then have a hypothesized middle of 8 and a standard error of 0.20 (i.e. 1.4/sqrt50)
Does a sample mean of 7.5 provide sucient proof that the true population mean is less than 8? To
investigate, we must rst determine our level of signicance. For now, we will use the default of 5%. This
means that if our sample mean falls into the lower 5% of the tail, it will be considered strong evidence against
the null hypothesis. We can now measure how many standard deviations (Z-scores, since we are working
with a large sample) 7.5 is from 8. This measurement is often called the test statistic. You may see it written
as Z* or t*. √
Using our standardizing formula, we get Z* = 7.5  8.0 / (1.4/ 50). The resulting Z-score is -2.5 (rounded
to one decimal). Based on the empirical rule we know that any value with a Z-score of 2.5 (as an absolute
value) would fall well out into the lower or upper 5% of the tail and would thus be considered an unlikely
observation. That is, very few sample means taken from a population with a mean of 8 would have such a
high Z-score.
Our decision, in this case would be to reject the null hypothesis (that the mean number of hours adults
sleep is 8) in favour of the alternative hypothesis (that the mean number of hours adults sleep is less than
8). Keep in mind we have not proven they only sleep 7.5; this is never what we sought to prove. We only
sought to prove that they sleep less than 8 hours. Our sample mean of 7.5 is our evidence against the null
hypothesis. As it turned out, the empirical rule helped us conclude that a sample mean of 7.5 would be a very
unlikely nding if the true population mean were actually 8, which is why we rejected the null hypothesis.

3.2.5.1.3 Measuring Sample Evidence with P-Values

While using Z-scores and t-scores can lead us to a correct decision, a more common and precise measuring
tool is preferred, called a p-value. To nd the p-value of a sample mean or proportion we simply need to
convert the test statistic into a probability. Specically, the p-value seen below is the probability of getting
sample mean of 7.5 or less from a population whose true mean is 8. As you can see, the resulting p-value
is extremely small, meaning that such an outcome would be extremely unlikely (well under a probability of
5%) to occur if the true mean is 8.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


235

Figure 3.12

Be careful! The p-value is not the probability that the null hypothesis is true. It is the probability that
our sample mean could have come from a population in which the null hypothesis is true. And since this
probability is so small, we must conclude the null hypothesis in not true. In other words, our sample mean
is what is considered an unlikely event.

3.2.5.1.4 P-values and Unlikely Events

As a nal example, suppose Didi and Ali are at a birthday party of a very wealthy friend. They hurry to be
rst in line to grab a prize from a tall basket that they cannot see inside because they will be blindfolded.
There are 200 plastic bubbles in the basket and Didi and Ali have been told that there is only one with a
$100 bill. Didi is the rst person to reach into the basket and pull out a bubble. Her bubble contains a $100
bill. The probability of this happening is 1/200 = 0.005.
In statistical language, 0.005 is akin to a p-value. Because this occurrence was unlikely to have happened
if there truly is only one $100 bill in the basket, Ali can conclude that what the two of them were told was
wrong and there are actually more $100 bills in the basket. A "rare event" has occurred (Didi getting the
$100 bill), so Ali doubts the assumption about only one $100 bill being in the basket.

3.2.5.2 The Decision and Conclusion

Once you have determined the p-value associated with a sample mean or proportion, the next step is to
compare that p-value to the original level of signcnce..
When you make a decision to reject or not reject H 0, do as follows:
If p-value < α, reject H 0. The evidence provided by the sample data is signicant. There is sucient
evidence to conclude that H 0 is an incorrect belief and that the alternative hypothesis, H a, may be
correct.
If p-value α ≥ , do not reject H 0. The evidence provided by the sample data is not signicant.There is
not sucient evidence to conclude that the alternative hypothesis,H a, may be correct.
When you "do not reject H 0", it does not mean that you have proven that H 0 is true. It simply means
that the sample data have failed to provide sucient evidence to cast serious doubt about the truthfulness
of H o.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
236
INTERVALS AND HYPOTHESIS TESTS

The gure below illustrates a P-value of 0.006 and a chosen level of signicance of 0.05. As you can see,
the p-value is much smaller than alpha (further out into the tail), which indicates strong evidence against
the null hypothesis.

Figure 3.13

Conclusion: After you make your decision, write a thoughtful conclusion about the hypotheses in
terms of the given problem, making specic reference to the context. The example below should serve as a
summary and a guide for conducting a full eight-step hypothesis test on a population mean or proportion.

3.2.5.3 Conducting the Hypothesis Test

In this course, we stress an eight-step process for conducting a hypothesis test.

1. Determine and record Ho and Ha, as discussed earlier in this chapter.


2. Record the sample evidence that you will be using to challenge Ho. For a means test, your evidence
will consist of the sample mean, the sample (or population) standard deviation, and the sample size.
For a proportions test, your evidence will consist of the sample proportion and the sample size.
3. State the test considerations. Looking at the sample evidence and any stated assumptions, deter-
mine the correct test procedure.
4. State the required strength of evidence. Consider the implications of a Type I vs. a Type II
error in choosing your level of signicance, as well as any ethical considerations.
5. Calculate the test statistic. Using the sample evidence, compute Z* or t* and the associated
p-value.
6. Discuss what the p-value measures in context and whether the test statistic can be considered
an unlikely or a likely event within the context of the problem.
7. Make a decision. Compare the test statistic (the p-value) to the required strength of evidence (alpha)
and determine if you can reject or fail to reject the null hypothesis.
8. Oer a concluding sentence. Using accessible language summarize your conclusion in sentence
form within the context of the problem.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


237

Example 1
Suppose Irene, who owns a top bakery in the city, claims that she has the best bread in the city by any
measure. Not only is her bread the tastiest, it is also the uest and the tallest, averaging 15 cm in height.
Another baker, Jose, wishes to challenge Irene's claim that her bread is the tallest. As evidence he will
provide a sample of 40 randomly selected loaves of bread and have their heights measured in his attempt to
prove that his bread heights actually exceed 15 cm, on average. In doing so, he obtains a sample mean bread
height of 15.5 cm. He also knows from baking thousands of loaves that his varation is very low: specically
the standard deviation is 0.9 cm.
Step One
The null and alternative hypotheses are as follows:
Ho: µ = 15
Ha: µ > 15
Step Two
The sample evidence is as follows: sample mean = 15.5; population standard deviation = 0.9; sample
size = 40.
Step Three
The test considerations are as follows: We are using a large sample (n>30) to conduct a means test. This
will require a sampling distribution of the mean, which central limit theorem says will be approximately
normally shaped since our sample size exceeds 30. We will therefore do a Z-based test.
Step Four
The required strength of evidence can now be determined by considering the implications of a Type I
vs. a Type II error. In this context, Jose will make a Type I error if he concludes that his bread heights
average more than 15 cm when in fact they do not. He will make a Type II error if he concludes that his
bread heights do not average more than 15 cm when in fact they do. Which error is worse will depend on
where you are standing. Jose would consider a Type II error worse, whilst Irene would consider a Type I
error worse. To be fair, we will choose a level of signicance of 5%, which is generally considered a good
balance between the two types of errors.
Reject Ho if p-value < 0.05
Step Five
Compute the test statistics
√ as follows:
Z* = 15.5-15/(0.9/ 40) = 3.51
p-value = 0.0002
NOTE: The Excel function for computing a p-value is as follows:
=1-NORM.S.DIST(3.51,1)
Step Six
Interpret the p-value in the context of the problem. Under the assumption that Jose's bread is no taller
than Irene's (this his bread averages only 15 cm), the probability of obtaining a sample of 40 with a mean
of 15.5 cm (or more) is only 0.005 or 0.5%, which makes it a very unlikely event.
Step Seven
Make a decision by comparing your p-value to your level of signicance.
Since the p-value (0.0002) < 0.05, we can reject the null hypothesis.
Step Eight
Oer a nal conclusion in sentence form: Therefore we can conclude that Jose's bread averages more
than 15 cm and is indeed taller than Irene's.
Exercise 3.2.5.1: Practice Question One (Solution on p. 254.)
An auditing rm is looking at the travel expense claims for a large book retailer. The retailer's
books suggest that their average (µ) travel expenses was $1200 per person per year. A sample
of 64 random expense claims revealed average of $1300. The population σ , based on an earlier
comprehensive audit, is $400. The sample suggests the books have under-exaggerated the expense
claims. Identify the Null and Alternative Hypotheses.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
238
INTERVALS AND HYPOTHESIS TESTS

Exercise 3.2.5.2 (Solution on p. 254.)


State your evidence.
Exercise 3.2.5.3 (Solution on p. 254.)
Identify all test considerations and then determine the appropriate test.
Exercise 3.2.5.4 (Solution on p. 254.)
Consider the implications of both Type I and Type II errors and then decide on an appropriate
level of signicance. State your decision rule.
Exercise 3.2.5.5 (Solution on p. 254.)
Calculate the test statistics.
Exercise 3.2.5.6 (Solution on p. 254.)
Dene what the p-value is measuring in the context of the problem.
Exercise 3.2.5.7 (Solution on p. 254.)
Make a decision.
Exercise 3.2.5.8 (Solution on p. 254.)
Draw a nal conclusion in sentence form.
Exercise 3.2.5.9: Practice Question Two (Solution on p. 254.)
A charitable organization wanted to see if a new form of mail marketing would change the percent-
age of people who replied. In the past the percentage of people who would reply to mail marketing
was 1 in 175. A sample of 2000 letters was sent out. A total of 20 people responded. Is there any
signicant change in the percentage of respondents? Identify the null and alternative hypotheses.
Exercise 3.2.5.10 (Solution on p. 254.)
State your evidence.
Exercise 3.2.5.11 (Solution on p. 254.)
Identify all test considerations and then determine the appropriate test.
Exercise 3.2.5.12 (Solution on p. 254.)
Consider the implications of both Type I and Type II errors and then decide on an appropriate
level of signicance. State your decision rule.
Exercise 3.2.5.13 (Solution on p. 255.)
Calculate the test statistics.
Exercise 3.2.5.14 (Solution on p. 255.)
Dene what the p-value is measuring in the context of the problem.
Exercise 3.2.5.15 (Solution on p. 255.)
Make a decision by comparing the P-value to α.
Exercise 3.2.5.16 (Solution on p. 255.)
Draw a nal conclusion in sentence form.
Exercise 3.2.5.17: Practice Question Three (Solution on p. 255.)
Charter Air claims that its new executive boarding service has improved the time it takes for
business passengers to purchase tickets, store luggage and board the plane. They believe that is
less than the previous time of 12 minutes. A sample of 9 customers of this new exclusive service
indicates the that the mean is 9.3 minutes with a standard deviation of 3.32 minutes. Previous
studies have revealed that boarding times tend to follow a normal distribution. Identify the null
and alternative hypotheses.
Exercise 3.2.5.18 (Solution on p. 255.)
State your evidence.
Exercise 3.2.5.19 (Solution on p. 255.)
Identify all test considerations and then determine the appropriate test.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


239

Exercise 3.2.5.20 (Solution on p. 255.)


Consider the implications of both Type I and Type II errors and then decide on an appropriate
level of signicance. State your decision rule.
Exercise 3.2.5.21 (Solution on p. 255.)
Calculate the test statistics.
Exercise 3.2.5.22 (Solution on p. 255.)
Make a decision by comparing the P-value to α.
Exercise 3.2.5.23 (Solution on p. 255.)
Dene what the p-value is measuring in the context of the problem.
Exercise 3.2.5.24 (Solution on p. 255.)
Make a decision by comparing the p-value to alpha
Exercise 3.2.5.25 (Solution on p. 255.)
Draw a nal conclusion in sentence form.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
240
INTERVALS AND HYPOTHESIS TESTS

3.2.6 Practice Questions for Chapters 7 and 814


3.2.6.1 Practice questions for Chap. 7 & 8

These questions were derived from Lyryx Learning, Business Statistics I  MGMT 2262  Mt Royal University
 Version 2016 Revision A. OpenStax CNX. Sep 8, 2016 http://cnx.org/contents/f3aefa9e-58d2-41ea-969f-
[email protected].

important: If a question has a set of data, please see the course site for the Excel le.

note: Solutions are at the end of the chapter.

1. Question 1: The Specic Absorption Rate (SAR) for a cell phone measures the amount of radio
frequency (RF) energy absorbed by the user's body when using the handset. Every cell phone emits
RF energy. Dierent phone models have dierent SAR measures. To receive certication from the
Federal Communications Commission (FCC) for sale in the United States, the SAR level for a cell
phone must be no more than 1.6 watts per kilogram. Table 7.1 shows the highest SAR level for a
random selection of cell phone models as measured by the FCC. A recent study has shown that if a
cell phone's SAR level exceeds 0.9 watts per kilogram, there is an increased chance of brain tumours
for those that use this phone15 An advocacy group wants to use this new study to petition the FCC
to change their regulations around the current allowable SAR levels.

Phone model SAR Phone model SAR Phone model SAR


Apple iPhone 4S 1.11 LG Ally 1.36 Pantech Laser 0.74
BlackBerry Pearl 1.48 LG AX275 1.34 Samsung Character 0.5
BlackBerry Tour 1.43 LG Cosmos 1.18 Samsung Epic 4G Touch 0.4
Cricket TXTM8 1.3 LG CU515 1.3 Samsung M240 0.867
HP/Palm Centro 1.09 LG Trax CU575 1.26 Samsung Messenger III 0.68
HTC One V 0.455 Motorola Q9h 1.29 Samsung Nexus S 0.51
HTC Touch Pro 2 1.41 Motorola Razr2 V8 0.36 Samsung SGH-A227 1.13
Huawei M835 Ideos 0.82 Motorola Razr2 V9 0.52 SGH-a107 GoPhone 0.3
Kyocera DuraPlus 0.78 Motorola V195s 1.6 Sony W350a 1.48

Table 3.4

a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Is it appropriate to assume that the sampling distribution is normal? Explain your reasoning
and provide evidence for your choice. Regardless of your answer in b), assume that the sampling
distribution is normal for the remaining questions.
c. The advocacy group will go forward with their petition if they can show that, on average, cell
phones have SAR rates that exceed 0.9 watts per kg. This advocacy group is run by an admin-
istrator who is very risk averse (meaning they will only go forward with the petition if there is a
lot of evidence). Determine whether the advocacy group should go forward with their petition by
performing an appropriate eight-step hypothesis test.
14 This content is available online at <http://cnx.org/content/m65288/1.3/>.
15 This is completely made-up.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


241

d. Find a condence interval for the true (population) mean of the Specic Absorption Rates (SARs)
for cell phones. Choose a condence level that complements the level of signicance you have
chosen above.
e. Interpret the condence interval in the context of the question.
f. Does the condence interval suggest that the mean SAR exceeds 0.9? Compare your answer with
what you got for the hypothesis test. Do the condence interval and hypothesis test support each
other? Explain your answer.
2. Question 2: A hospital is trying to cut down on emergency room wait times. In the past, they have
found that the average wait time is 1.4 hours for patients to be called back to be examined. They have
implemented a new triage protocol and are interested in seeing if it has changed the amount of time
patients must wait before being called back to be examined. An investigation committee randomly
surveyed 70 patients. The sample mean wait time was 1.5 hours with a sample standard deviation of
0.5 hours.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Use an appropriate eight-step hypothesis to determine if the average wait time for patients to be
called back to be examined has changed from 1.4 hours. Use a level of signicance of 10%.
c. Is there a level of signicance that causes you to change your decision?
d. Suppose the true population mean wait time is 1.4 hours, have you made an error in b)? If so,
what type?
e. Construct a 90% condence interval for the population mean emergency room wait times.
f. Interpret the condence interval in the context of the question .
g. If the investigation committee wants to increase its level of condence and keep the margin of
error the same by taking another survey, what changes should it make?
h. If the investigation committee did another survey, kept the margin of error the same, and surveyed
200 people instead of 70, how would the level of condence have to change? Why?
i. Suppose investigation committee wanted their estimate of the population mean emergency room
wait times to be within 0.05 hours of the true mean. How many patients would they need to
interview?
3. Question 3: Twenty-ve Americans were surveyed to determine the number of hours they spend watch-
ing television each month. The results were as follows:

207 188 168 122 107


122 173 190 140 129
205 169 163 118 142
150 130 123 129 97
156 118 150 129 216

Table 3.5

Assume that the underlying population distribution is normal and the population standard deviation
is known to be 32 hours.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. The U.S. government has recently released a recommendation that Americans watch less than 150
hours of television per month. Based on this sample, is there enough evidence to suggest that,
on average, Americans are meeting this recommendation? Base your answer on an appropriate
eight-step hypothesis test. Use α = 5%.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
242
INTERVALS AND HYPOTHESIS TESTS

c. Construct a 99% condence interval for the population mean hours spent watching television per
month.
d. Interpret the condence interval in the context of the question.
e. Explain what the condence level means in the context of the question.
4. Question 4: The standard deviation of the weights of newborn elephants is known to be approximately
15 pounds. We wish to construct a 95% condence interval for the mean weight of newborn elephant
calves. Fifty newborn elephants are weighed. The sample mean is 244 pounds. The sample standard
deviation is 11 pounds.
a. What model will you use to construct a condence interval for the population mean? Explain
your reasoning by referring to the criteria for that model.
b. Construct a 95% condence interval for the population mean weight of newborn elephants.
c. What will happen to the condence interval obtained, if 500 newborn elephants are weighed
instead of 50? Why?
d. Based on the condence interval, is it fair to say that the average weight of a newborn elephants
exceeds 235 pounds? Explain your answer.
e. Does an appropriate hypothesis test support your decision in d)? Explain your answer by doing
the eight-step hypothesis test.
5. Question 5: A news magazine is investigating the changing dynamics in marriages. Historically, men
made many of the nancial decisions including the decision on whether to make major household
purchases (such as buying a new vehicle or doing a renovation), while women were left out of them.
To investigate whether this has changed, the magazine is considering doing a study to nd out the
percentage of couples who are equally involved in making decision about household purchases.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. When designing a study to determine this population proportion, what is the minimum number
you would need to survey to be 90% condent that the population proportion is estimated to
within 0.05?
c. If it were later determined that it was important to be more than 90% condent, how would it
aect the minimum number you need to survey? Why? Do not do any calculations. Suppose the
marketing company did do the survey. They randomly surveyed 200 households and found that
in 114 of them, the couple makes major household purchasing decisions together. A similar study
from the 1980s found that 46.5% of couple made major household purchasing decisions together
d. Conduct an eight-step hypothesis test to determine whether there has been a signicant increase
in the number of couples who make major household purchasing decisions together since the 1980s.
The editor of the magazine will only publish the article if there is ample evidence to support the
claim.
e. Construct a 95% condence interval for the population proportion of couples who make major
household purchasing decisions together.
f. Interpret the condence interval in the context of the question.
g. If the rate has increased, use the condence interval to determine by how much the rate has
increased since the 1980s.
h. List two diculties the company might have in obtaining random results, if this survey were done
by email.
6. Question 6: Suppose that an accounting rm has developed a new software to help their clients do
their taxes more quickly. Based o of a national survey, most people spend 24.4 hours completing
their personal income taxes a year. The accounting rm has a random sample of 100 of their clients
complete their 2016 income tax return using the new software. The sample mean time to complete the
tax returns is 23.6 hours with a standard deviation of 7.0 hours. The rm doesn't want to release the
software unless they are sure it will reduce the time it takes clients to do their taxes. The population
distribution is assumed to be normal.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


243

a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Conduct an appropriate eight-step hypothesis test to determine if, on average, the software has
reduced the time it takes clients to do their taxes.
c. Suppose the truth is that the software does help clients do their taxes faster. Has an error been
committed? If so, what type of error is it? Explain your answers.
d. Construct a 90% condence interval for the population mean time to complete the tax forms.
e. Interpret the condence interval in the context of the question.
f. Does the condence interval support the results of the hypothesis test? Explain your answer.
g. If the rm wished to increase its level of condence and keep the margin of error the same by
taking another survey, what changes should it make? Why?
h. If the rm did another survey, kept the margin of error the same, and only surveyed 49 people,
how would the level of condence have to change? Why?
i. Suppose that the rm decided that it needed to be at least 96% condent of the population mean
length of time to within one hour. How would the number of people the rm surveys change?
Why?
7. Question 7: In 2013, it was determined that 21% of North Americans download music illegally. Public
Policy Polling is wondering whether that number has changed. They asked a random sample of adults
across North America about their downloading habits. When asked, 512 of the 2247 participants
admitted that they have illegally downloaded music.
a. Has the proportion of North Americans who illegally download music increased since 2013? Con-
duct an appropriate eight-step hypothesis test to support your answer.
b. Create and interpret a 99% condence interval for the true proportion of North American adults
who have illegally downloaded music.
c. This survey was conducted through automated telephone interviews on May 6 and 7 of this year.
The margin of error of the survey compensates for sampling error, or natural variability among
samples. List some factors that could aect the surveyÕs outcome that are not covered by the
margin of error.
d. Without performing any calculations, describe how the condence interval would change if the
condence level changed from 99% to 90%.
e. Suppose Public Policy Polling want to conduct the study again now. They want to keep the
same level of condence as their last survey, but they want their results to within 2% of the
true proportion of Canadian adults who have illegally downloaded music. What is the minimum
sample size they need to obtain this?
8. Question 8: A survey of the mean number of cents o that coupons give was conducted by randomly
surveying one coupon per page from the coupon sections of a recent San Jose Mercury News. The
following data were collected (in cents): 20; 75 ; 50 ; 65 ; 30 ; 55 ; 40 ; 40 ; 30 ; 55 ; 150; 40 ; 65 ; 40 .
Assume the underlying distribution is approximately normal.
a. What is the variable being studied? Categorize it. Based on this, what descriptive statistic (mean
or proportion) is best for this situation?
b. Conduct an appropriate eight-step hypothesis test to determine if the mean number of cents o
a coupon is dierent from 50 . Use a level of signicance of 3%.
c. What is the probability of committing a type I error in the above hypothesis test?
d. Construct a 97% condence interval for the population mean worth of coupons.
e. Interpret the condence interval in the context of the question.
f. If many random samples were taken of size 14, what percent of the condence intervals constructed
should contain the population mean worth of coupons? Explain why.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
244
INTERVALS AND HYPOTHESIS TESTS

3.2.6.1.1 Solutions to Practice questions

1. a. The variable is the specic absorption rate. It is quantitative continuous data. The best descriptive
statistic for this type of data is the mean.
b. Since the sample size is less than 30, we can only assume the sampling distribution is normal if
the population distribution is close to being normal. Based on the normal curve plot and the
empirical rule, it appears that the sample is not normally distributed. The normal curve plot is
not a straight line and only 55.6% of the data fall within the rst standard deviation of this. This
conclusion is supported by a bimodal histogram. This suggests that the population distribution is
not normal, which means we cannot be certain the sampling distribution is normal. Regardless of
your answer in b), assume that the sampling distribution is normal for the remaining questions.
Step
c. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
cell phones have SAR rates that are 0.9 watts per kg, µ = 0.9; HA : on average, cell phones
have SAR rates that exceed 0.9 watts per kg, µ > 0.9
Step 2. State the evidence. n = 27, X = 0.989, s = 0.410
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes, as stated in the question.
• Population standard deviation is known? No
• Sample size is greater than 40? No
Therefore, since we need to estimate the population standard deviation using the sample
standard deviation and the sample size is small, we will use the t-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. Since the
administrator is risk averse, they want to ensure that they have rejected H0 with a lot of
evidence. Therefore, the level of signicance that requires the most evidence to reject H0 is
1%. If p < 1%, reject H0 . If p ≥ 1%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.1357
Step 6. State what the p-value means in the context of the question. The probability that a sample
mean SAR of at least 0.989 is observed, under the assumption that the SAR rate is 0.9, is
13.57%.v
Step 7. Make a decision. Since p (13.57%) is greater than α (1%), we do not reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is not sucient evidence
to suggest that, on average, cell phones have SAR rates that exceed 0.9 watts per kg, which
means the advocacy group should not go forward with their petition.
d. Since α = 1%, I will use a condence level of 98% (for a one-tailed HT, use 1-2*alpha to determine
complementary CL): 0.793 to 1.18
e. We are 98% condent that the true population mean for SARs is somewhere between 0.793
watts/kg and 1.18 watts/kg.
f. Though there are possible values for the population mean that do exceed 0.9 watts/kg in the
CI, there are also values that do not exceed 0.9 watt/kg. Therefore, the CI would lead to an
inconclusive result, meaning it is not clear from the CI whether the pop. mean exceeds 0.9 or
not. This aligns with our hypothesis test that there is not enough evidence to suggest that the
population mean exceed 0.9 watts/kg.
2. a. The variable is the emergency room wait times. It is quantitative continuous data. The best
descriptive statistic for this type of data is the mean.
Step
b. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the average
wait time for patients to be called back to be examined is 1.4 hours, µ = 1.4; HA : the average
wait time for patients to be called back to be examined has changed from 1.4 hours, µ 6= 1.4
Step 2. State the evidence. n = 70, X = 1.5, s = 0.5
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the sample size (70) is greater
than 30, the central limit theorem applies and the sampling distribution of sample means
is normally distributed.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


245

• Population standard deviation is known? No


• Sample size is greater than 40? Yes
Therefore, since we need to estimate the population standard deviation using the sample
standard deviation, but the sample size is large enough that the dierence between the t-
based and z-based model is minimal, we will use the z-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As stated
in the question, use 10% If p < 10%, reject H0 . If p ≥ 10%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.0943
Step 6. State what the p-value means in the context of the question. The probability (times 2) that a
sample mean wait time of at least 1.5 hours is observed, under the assumption that the mean
wait time is 1.4, is 9.43%.
Step 7. Make a decision. Since p (9.43%) is less than α (10%), we reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is sucient evidence to
suggest that the average wait time for patients to be called back to be examined has changed
from 1.4 hours.
c. Yes, if α = 0.0943, we would change our decision to do not reject H0 .
d. Yes. We have concluded that the mean has changed from 1.4, but the truth is that the mean has
stayed the same. Therefore, we have made an error. As we have incorrectly rejected H0 it is a
type I error.
e. 1.402 to 1.598
f. We are 90% condent that the population average wait time in the emergency room is somewhere
between 1.4 hours and 1.6 hours.
g. If the level of condence is increased then the critical value in the margin of error would increase.
To keep the margin of error the same, either the standard deviation would need to decrease, or
the sample size would need to decrease. As the standard deviation is inherent to the data, the
sample size needs to decrease.
h. If the sample size increases, then the margin of error decreases. This means that to keep the
margin of error constant, the level of condence would need to increase. This would cause the
critical value to be bigger which would compensate for the larger sample size.
i. They would need to interview at least 271 patients.
3. a. The variable is the number of hours Americans spend watching TV. It is quantitative discrete
data. The best descriptive statistic for this type of data is the mean.
Step
b. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
Americans are not meeting this recommendation, µ = 150; HA : on average, Americans are
meeting this recommendation, µ < 150
Step 2. State the evidence. n = 25, X = 149.64, σ = 32
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes.The preamble states the the popu-
lation is normally distributed. As the population distribution is assumed to be normal, we
know the sampling distribution of sample means is also normal, even though the sample
is less than 30.
• Population standard deviation is known? Yes
As the population standard deviation is known, we will use the z-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. The level
of signicance is provided in the question. If p < 5%, reject H0 . If p ≥ 5%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.4776
Step 6. State what the p-value means in the context of the question. The probability that a sam-
ple mean number of hours of TV watched of at most 149.64 hours is observed, under the
assumption that the mean number of hours watching TV is 150, is 47.76%.
Step 7. Make a decision. Since p (47.76%) is greater than α (5%), we do not reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is not sucient evidence

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
246
INTERVALS AND HYPOTHESIS TESTS

to suggest that, on average, Americans are meeting the recommendation of watching less than
150 hours of television per month.
c. 133.2 to 166.1
d. We are 99% condent that the population mean time that Americans spend watching TV is
somewhere between 133.2 hours and 166.1 hours.
e. The condence level means that if we took many random samples of size 25 from the population of
Americans and constructed many condence intervals for each of these random samples, then 99%
of these condence intervals will contain the population mean time Americans spend watching
TV, while 1% will not.
4. a. We know the sampling distribution for sample means is normal because the sample size is greater
than 30 as stated in the Central Limit Theorem. Therefore, we use either the Student-t or the
standard normal distributions. As the population standard deviation is known, we can use the
standard normal distribution (i.e z-based normal distribution).
b. 239.84 to 248.16
c. The condence interval will get narrower because the margin of error will be smaller. The margin
of error is smaller because the amount of error between the sample means and the population
mean is smaller as stated in the law of large numbers.
d. Yes, the estimated population mean weight of newborn elephants is 239.84 pounds to 248.16
pounds. Based on this, it is fair to say that the average weight exceeds 235 pounds, as both
bounds are larger than 235.
Step
e. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
newborn elephants weigh 235 pounds, µ = 235; HA : on average, newborn elephants weigh
exceeds 235 pounds, µ > 235
Step 2. State the evidence. n = 50, X = 244, σ = 15
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the sample size (50) is greater
than 30, the central limit theorem applies and the sampling distribution of sample means
is normally distributed.
• Population standard deviation is known? Yes
As the population standard deviation is known, we will use the z-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As the
condence level in the previous question was 95% and we are attempting to verify the CI
with a HT, we should use an α of 2.5% (solve for alpha in 0.95 = 1-2*alpha, for a one-tailed
HT). If p < 2.5%, reject H0 . If p ≥ 2.5%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 1.10E − 5 =
1.10 × 10−5 = 0.000011
Step 6. State what the p-value means in the context of the question. The probability that a sample
mean weight of newborn elephants is at least 244 pounds is observed, under the assumption
that the mean weight of newborn elephants is 235, is 0.0011%.
Step 7. Make a decision. Since p (0.0011%) is less than α (5%), we reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is sucient evidence
to suggest that, on average, newborn elephants weigh exceeds 235 pounds.
5. a. The variable is what whether a couple makes major household purchasing decisions together or
not. It is categorical nominal data. The best descriptive statistic for this type of data is a
proportion.
b. They would need to interview a minimum of 271 households (Note: As no estimate of the popu-
lation proportion is provided, use 50%)
c. If it were later determined that it was important to be more than 90% condent and a new survey
were commissioned, how would it aect the minimum number you need to survey? Why?
Step
d. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the pro-
portion of couples who make major household purchasing decisions together is unchanged

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


247

at 46.5%, π = 0.465; HA : the proportion of couples who make major household purchasing
decisions together is greater than 46.5%, π > 0.465
Step 2. State the evidence. n = 200, X = 114
Step 3. State the model. Explain why you have chosen it.
• Binomial distribution? Yes, because ...
· Data is being counted: Yes. Counting the number of couples.
· Data is collected randomly: Says so in question
· There are only two outcomes: Either couple makes household decisions together or they
don't.
· There are a xed number of trials: 200
· The trials are independent: As the sample is random, it is safe to say this is the case
as how one couple makes decisions should not eect how another randomly selected
couple makes decisions.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As the
editor needs strong evidence, need to choose α to be small, i.e. 1%. If p < 1%, reject H0 . If
p ≥ 1%, do not reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.00185
Step 6. State what the p-value means in the context of the question. The probability that at least
114 out of 200 couples make major purchasing together, assuming the rate has not changed
since the 1980s, is 0.19%.
Step 7. Make a decision. Since p (0.19%) is less than α (1%), we reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is sucient evidence
to suggest that the proportion of couples who make major household purchasing decisions
together is greater than 46.5%.
e. 0.5014 to 0.6386
f. We are 95% condent that the true proportion of couples who make major household purchasing
decisions together is somewhere between 50.14% and 63.86%.
g. Based o of the CI, the rate has increased by at least 3.6% and by at most 17.4%.
h. One issue is how will the marketing company develop the list of email addresses. Most likely they
will not have a complete list of all emails for all households. Second of all, the email will be sent
to a member of the household and not to the household as a whole. Thus one household may get
multiple surveys. Further, not everyone uses email so the sample will miss those households.
6. a. The variable is the amount of time people take completing their tax forms. It is quantitative
continuous data. The best descriptive statistic for this type of data is the mean.
b. Conduct an appropriate eight-step hypothesis test to determine if, on average, the software has
reduced the time it takes clients to do their taxes.
Step 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : on average,
the software has not reduced the time it takes clients to do their taxes, µ = 24.4; HA : on
average, the software has reduced the time it takes clients to do their taxes, µ < 24.4
Step 2. State the evidence. n = 100, X = 23.6, s = 7.0
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the population distribution is
assumed to be normal, we know the sampling distribution of sample means is also normal.

• Population standard deviation is known? No


• Sample size greater than 40? Yes
Therefore, since we need to estimate the population standard deviation using the sample
standard deviation but the sample size is large enough that there the dierence between the
z-based and t-based models is minimal, we will use the z-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. Since the
rm doesn't want to release the software unless they are very condent that it works, they

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
248
INTERVALS AND HYPOTHESIS TESTS

should choose a small level of signicance (i.e. 1%). If p < 1%, reject H0 . If p ≥ 1%, do not
reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.1265
Step 6. State what the p-value means in the context of the question. The probability that a sample
mean time to complete tax returns of at most 23.6 hours is observed, under the assumption
that the mean time is 24.4, is 12.65%.
Step 7. Make a decision. Since p (12.65%) is greater than α (1%), we do not reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is not sucient evidence
to suggest that, on average, the software has reduced the time it takes clients to do their taxes.
c. Since we have stated that it is that there is not enough evidence that the software has reduced
the time it takes clients to do their taxes, when in fact it has, we have committed a type II error.

d. 22.45 to 24.75
e. We are 90% condent that the true average time it takes for people to complete their tax forms
with this new software is somewhere between 22.45 hours and 24.75 hours.
f. The HT has led us to state that there is evidence that the average time has not been reduced
from 24.4. The CI supports this as it contains the population mean of 24.4 hours.
g. If the level of condence is increased then the critical value in the margin of error would increase.
To keep the margin of error the same, either the standard deviation would need to decrease, or
the sample size would need to increase. As the standard deviation is inherent to the data, the
sample size needs to increase.
h. If the sample size decreases, then the margin of error increases. This means that to keep the
margin of error constant, the level of condence would need to decrease. This would cause the
critical value to be smaller which would compensate for the smaller sample size.
i. It would not change the number of people needed to be interviewed. The level of condence and
the sample size are independent of each other.
7. Step
a. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the propor-
tion of North Americans who illegally download music not increased since 2013, π = 0.21HA :
the proportion of North Americans who illegally download music increased since 2013,
π > 0.21
Step 2. State the evidence. n = 2247, X = 512
Step 3. State the model. Explain why you have chosen it. Normal distribution approximation of the
binomial distribution:
• Binomial distribution? Yes, because ...
· Data is being counted: Yes. Counting the number of North Americans who illegally
download music.
· Data is collected randomly: Says so in question
· There are only two outcomes: Either person illegally downloads music or they don't.
· There are a xed number of trials: 2247
· The trials are independent: As the sample is random, it is safe to say this is the case as
whether one person downloads music illegally or not should not eect whether another
randomly selected person does.
Therefore, we will use the normal distribution approximation of the binomial distribution.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. As there
is no motivation stated in the study, I will choose a level of signicance that is a balance
between rejecting and not rejecting H0, i.e. 5%. If p < 5%, reject H0 . If p ≥ 5%, do not
reject H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.0208
Step 6. State what the p-value means in the context of the question. The probability that at least
512 out of 2247 North Americans admit that they have downloaded music illegally, assuming
the rate has not changed since 2013, is 2.08%.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


249

Step 7. Make a decision. Since p (2.08%) is less than α (5%), we reject H0 .


Step 8. Explain the result in a sentence that refers back to the context. There is sucient evidence
to suggest that the proportion of North Americans who illegally download music increased
since 2013.
b. We are 99% condent that the true proportion of Canadians that download music illegally is
somewhere between 20.51% and 25.07%.
c. Some people may not want to admit to having downloaded music illegally. It is unclear how
PPP got the list of phone numbers. This list could miss cell phone users and thus would not be
representative.
d. The condence interval would get narrower.
e. 2919
8. a. The variable is the number of cents o that coupons give. It is quantitative discrete data. The
best descriptive statistic for this type of data is the mean.
Step
b. 1. State hypotheses both in sentence and numerical form. Dene the symbols. H0 : the mean
number of cents o a coupon is the same as 50 , µ = 50; HA : the mean number of cents o
a coupon is dierent from 50 , µ 6= 50
Step 2. State the evidence. n = 14, X = 53.93, s = 31.63
Step 3. State the model. Explain why you have chosen it.
• Sampling distribution of sample means is normal? Yes as the population distribution is
assumed to be normal, we know the sampling distribution of sample means is also normal.

• Population standard deviation is known? No


• Sample size greater than 40? No
Therefore, since we need to estimate the population standard deviation using the sample
standard deviation and the sample size is small, we will use the t-based mean model.
Step 4. State the level of signicance and why you have chosen it. State the decision rule. Level of
signicance in the question is stated to be 3%. If p < 3%, reject H0 . If p ≥ 3%, do not reject
H0 .
Step 5. Evaluate the evidence (i.e. nd the p-value using a computer program). p = 0.6499
Step 6. State what the p-value means in the context of the question. The probability (times 2) that a
sample mean number of cents o a coupon of at most 53.929 is observed, under the assumption
that the mean number of cents is 50, is 64.99%.
Step 7. Make a decision. Since p (64.99%) is greater than α (3%), we do not reject H0 .
Step 8. Explain the result in a sentence that refers back to the context. There is not sucient evidence
to suggest that the mean number of cents o a coupon is dierent from 50 .
c. It is the level of signicance, 3%.
d. 33.335 to 74.522
e. We are 97% condent that the mean number of cents o that coupons give is somewhere between
33.3 and 74.5 .
f. 97% of them would contain the population mean, while 3% would not. This is determined by the
condence level.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
250
INTERVALS AND HYPOTHESIS TESTS

Solutions to Exercises in Chapter 3


Solution to Exercise 3.1.4.1 (p. 213)

1. We can use the standard normal model to nd the condence interval, because the sample was collected
randomly and, since the sample size is greater than 30 (it is 175), we can be very condent that the
sampling distribution for the sample means is normal due to the central limit theorem. To nd the
condence interval, use a computer program. Make sure to choose the z-model (instead of the t-model).
Input the sample size as 175, the sample mean as 21.34 and the standard deviation as 5.12. Choose
the level of condence to be 95%. This gives the following output:

95% condence level


1.96 z
0.759 margin of error
20.581 lower condence limit
22.099 upper condence limit

Table 3.6

From this, we can see that the condence interval for the mean is 20.58 to 22.10.
2. To interpret the condence interval, we would say that we are 95% condent that the population mean
age of students from this university is somewhere between 20.58 years old and 22.10 years old. That
is, we are estimating that the population mean age is somewhere between 20.58 years old and 22.10
years old.
3. The condence level means that if we took many random samples of size 175 from the student body
of this university and constructed many condence intervals for each of these random samples, then
95% of these condence intervals will contain the population mean age for this university, while 5%
will not.
4. If the sample size is decreased to 100, we would expect that the condence interval would get wider.
From the law of large numbers, we know there is more sampling variability in smaller samples. Thus
there is more potential for error between the sample mean and the population mean when the sample
size is smaller. The margin of error then is bigger to take this into account. This
√ is supported by the
formula for the margin of error (zα/2 × √σn ). Since we are dividing by the n, the margin of error
would be smaller for larger n and bigger for smaller n.
5. We have estimated that the population mean age is between 20.58 years old and 22.10 years old.
Therefore, based on our estimate, it is unlikely that the mean age of this university is 23 years old as
23 does not fall within our estimate. The administrator's claim is most likely incorrect.

Solution to Exercise 3.1.4.2 (p. 216)

1. We can use the Student-t distribution model to construct the condence interval, because the popu-
lation standard deviation is unknown (so we don't use the standard normal distribution), the sample
is collected randomly, and the sampling distribution of the sample means is normal because the popu-
lation distribution is normal. To nd the condence interval, use a computer program. Make sure to
choose the t-model (instead of the z -model). Input the sample size as 25, the sample mean as 44.25
and the standard deviation as 2.25. Choose the level of condence to be 95%. This gives the following
output:

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


251

95% condence level


2.064 t
24 degrees of freedom
0.929 margin of error
43.321 lower condence limit
45.179 upper condence limit

Table 3.7

From this, we can see that the condence interval for the mean is 43.321 to 45.179.
2. To interpret the condence interval, we would say that we are 95% condent that the true mean battery
life of brand of AAA batteries is somewhere between 43.32 hours and 45.18 hours.
3. If the condence level is decreased to 90%, we would expect that the condence interval would get
narrower. A higher level of condence is obtained by making the condence interval wider. Therefore,
if the condence level is decreased, then the condence interval would get narrower.

Notice from the computer output, that the critical value is 2.064 with 24 degrees of freedom (i.e one less
than the sample size). If the population standard deviation was known, the critical value would be 1.96. To
re-iterate, since we are estimating the population standard deviation with the sample standard deviation,
we know there is more room for error in the estimate. Therefore, we want the estimate (i.e. condence
interval) to be slightly wider, thus the margin of error needs to be slightly bigger. This is done by using the
Student-t distribution, which results in bigger critical values for the same condence level as would occur for
the standard normal distribution. In this case, 2.064.
Solution to Exercise 3.1.4.3 (p. 217)
We know the condence level (95%). The margin of error is stated by saying that we want the estimate
of the true mean to be within 0.2 hours. Thus the 0.2 hours is telling us how much error we want in the
estimate (i.e. E = 0.2). We do need to have a sense of the standard deviation, which we get from the
preliminary study. Using the 12 participants, we get a sample standard deviation of 1.29.
We can now use a computer program to do the calculation. From the question, we know the margin of
error (E ) is 0.2, the standard deviation is 1.29, and the condence level is 95%. When we input this into
the computer program, we get output similar to this.

95% condence level


1.96 z
159.814 sample size
160 rounded up

Table 3.8

From this, we can see that to get our sample size within 0.2 hours of the true mean we would need a
sample size of at least 160 participants.
Solution to Exercise 3.1.5.1 (p. 219)
The condence level is 90%, the sample proportion is 85%, and the amount of error we want in our estimate
(i.e. the margin of error) is 2.5%.
We can now use a computer program to do the calculation. From the question, we know the margin of
error (E ) is 0.025 (remember to write it as a decimal), the sample proportion is 0.85, and the condence
level is 90%. When we input this into the computer program, we get output similar to this.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
252
INTERVALS AND HYPOTHESIS TESTS

90% condence level


1.645 z
551.931 sample size
552 rounded up

Table 3.9

From this, we can see that we need to have at least 552 consumers in our sample.
Solution to Exercise 3.2.2.1 (p. 224)
A normal distribution or a Student's t-distribution
Solution to Exercise 3.2.2.2 (p. 224)
Use a Student's t-distribution
Solution to Exercise 3.2.2.3 (p. 224)
a normal distribution for a single population mean
Solution to Exercise 3.2.2.4 (p. 224)
It must be approximately normally distributed.
Solution to Exercise 3.2.2.5 (p. 224)
They must both be greater than ve.
Solution to Exercise 3.2.2.6 (p. 224)
binomial distribution

Solution to Exercise 3.2.2.7 (p. 224)


d
Solution to Exercise 3.2.3.1 (p. 227)
The random variable is the mean Internet speed in Megabits per second.
Solution to Exercise 3.2.3.2 (p. 227)
The random variable is the mean number of children a Canadian family has.
Solution to Exercise 3.2.3.3 (p. 227)
The random variable is the proportion of people who are tourists picked at random at the CN Tower.
Solution to Exercise 3.2.3.4 (p. 227)

a. H0 : π = 0.42
b. Ha : π < 0.42
Solution to Exercise 3.2.3.5 (p. 227)

a. H0 : µ = 34; Ha : µ 6= 34
b. H0 : π ≤ 0.60; Ha : π > 0.60
c. H0 : µ ≥ 100,000; Ha : µ < 100,000
d. H0 : π = 0.29; Ha : π 6= 0.29
e. H0 : π = 0.05; Ha : π < 0.05
f. H0 : µ ≤ 10; Ha : µ > 10
g. H0 : π = 0.50; Ha : π 6= 0.50
h. H0 : µ = 6; Ha : µ 6= 6
i. H0 : π ≥ 0.11; Ha : π < 0.11
j. H0 : µ ≤ 20,000; Ha : µ > 20,000
Solution to Exercise 3.2.3.6 (p. 227)
c
Solution to Exercise 3.2.4.1 (p. 230)
Type I error: The researcher thinks the blood cultures do contain traces of pathogen X, when in fact, they
do not.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


253

Type II error: The researcher thinks the blood cultures do not contain traces of pathogen X, when in
fact, they do.
Solution to Exercise 3.2.4.2 (p. 230)
The error with the greater consequence is the Type II error: the patient will be thought well when, in fact,
he is sick, so he will not get treatment.
Solution to Exercise 3.2.4.3 (p. 230)
In this scenario, an appropriate null hypothesis would beH0 : the mean level of toxins is at most 800 µg, H0
: µ0 ≤ 800 µg.
Type I error: The DMF believes that toxin levels are still too high when, in fact, toxin levels are at
most 800 µg. The DMF continues the harvesting ban.
Type II error: The DMF believes that toxin levels are within acceptable levels (are at least 800 µg)
when, in fact, toxin levels are still too high (more than 800 µg). The DMF lifts the harvesting ban. This
error could be the most serious. If the ban is lifted and clams are still toxic, consumers could possibly eat
tainted food.
In summary, the more dangerous error would be to commit a Type II error, because this error involves
the availability of tainted clams for consumption.
Solution to Exercise 3.2.4.4 (p. 230)
Type I error: c
Type I error: b
Solution to Exercise 3.2.4.5 (p. 231)
Type I: The mean price of mid-sized cars is $32,000, but we conclude that it is not $32,000.
Type II: The mean price of mid-sized cars is not $32,000, but we conclude that it is $32,000.
Solution to Exercise 3.2.4.6 (p. 231)
α = the probability that you think the bag cannot withstand -15 degrees F, when in fact it can
β = the probability that you think the bag can withstand -15 degrees F, when in fact it cannot
Solution to Exercise 3.2.4.7 (p. 231)
Type I: The procedure will go well, but the doctors think it will not.
Type II: The procedure will not go well, but the doctors think it will.
Solution to Exercise 3.2.4.8 (p. 231)

a. Type I error: We conclude that the mean is not 34 years, when it really is 34 years. Type II error: We
conclude that the mean is 34 years, when in fact it really is not 34 years.
b. Type I error: We conclude that more than 60% of Americans vote in presidential elections, when the
actual percentage is at most 60%.Type II error: We conclude that at most 60% of Americans vote in
presidential elections when, in fact, more than 60% do.
c. Type I error: We conclude that the mean starting salary is less than $100,000, when it really is at least
$100,000. Type II error: We conclude that the mean starting salary is at least $100,000 when, in fact,
it is less than $100,000.
d. Type I error: We conclude that the proportion of high school seniors who get drunk each month is not
29%, when it really is 29%. Type II error: We conclude that the proportion of high school seniors who
get drunk each month is 29% when, in fact, it is not 29%.
e. Type I error: We conclude that fewer than 5% of adults ride the bus to work in Los Angeles, when the
percentage that do is really 5% or more. Type II error: We conclude that 5% or more adults ride the
bus to work in Los Angeles when, in fact, fewer that 5% do.
f. Type I error: We conclude that the mean number of cars a person owns in his or her lifetime is more
than 10, when in reality it is not more than 10. Type II error: We conclude that the mean number of
cars a person owns in his or her lifetime is not more than 10 when, in fact, it is more than 10.
g. Type I error: We conclude that the proportion of Americans who prefer to live away from cities is not
about half, though the actual proportion is about half. Type II error: We conclude that the proportion
of Americans who prefer to live away from cities is half when, in fact, it is not half.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
254
INTERVALS AND HYPOTHESIS TESTS

h. Type I error: We conclude that the duration of paid vacations each year for Europeans is not six weeks,
when in fact it is six weeks. Type II error: We conclude that the duration of paid vacations each year
for Europeans is six weeks when, in fact, it is not.
i. Type I error: We conclude that the proportion is less than 11%, when it is really at least 11%. Type
II error: We conclude that the proportion of women who develop breast cancer is at least 11%, when
in fact it is less than 11%.
j. Type I error: We conclude that the average tuition cost at private universities is more than $20,000,
though in reality it is at most $20,000. Type II error: We conclude that the average tuition cost at
private universities is at most $20,000 when, in fact, it is more than $20,000.

Solution to Exercise 3.2.4.10 (p. 231)


b
Solution to Exercise 3.2.4.11 (p. 231)
d
Solution to Exercise 3.2.5.1 (p. 237)
Ho: µ = $1200; Ha: µ > $1200
Solution to Exercise 3.2.5.2 (p. 238)
A sample of 64 random expense claims revealed average of $1300. The population σ , based on an earlier
comprehensive audit, is $400.
Solution to Exercise 3.2.5.3 (p. 238)
We are investigating a hypothesis about a population mean using a large sample. Central Limit Theorem
says the sampling distribution will be normally shaped for sample sizes over 30. Thus we will conduct a
Z-based test.
Solution to Exercise 3.2.5.4 (p. 238)
A Type I error in this case would be for the auditor to accuse the bookstore of exaggerating its expense
claims when in fact it has not. A Type II error in this case would be for the auditor to not accuse the
bookstore of exaggerating its expense claims when in fact it has.A Type I error could lead to a wrongful
conviction for tax fraud so it would be best to minimize the likelihood of making this type of error. Alpha
should be set at 1% (or at most 5%). Reject Ho if P-value < 0.01.
Solution to Exercise 3.2.5.5 (p. 238)

Z* = 1300-1200/(400/ 64) = 2.00; P-value = 0.0228
Solution to Exercise 3.2.5.6 (p. 238)
Our P-value is 0.0228. This means that the probability of getting a sample mean of $1300 (or more) from
a population with a mean of $1200 is 2.28%. Given our level of signicance, this would be considered a not
unlikely event. The P-value > 0.01, so we will fail to reject the null hypothesis.
Solution to Exercise 3.2.5.7 (p. 238)
Our P-value of 0.0228 is less than alpha of 0.01, we can reject the null hypothesis.
Solution to Exercise 3.2.5.8 (p. 238)
There is insucient evidence to indicate that the average yearly travel expenditures per person per year is
greater than $1200.
Solution to Exercise 3.2.5.9 (p. 238)
HO: π = 0.0057 HA: π 6= 0.0057
Solution to Exercise 3.2.5.10 (p. 238)
n = 2000; number of success = 20
Solution to Exercise 3.2.5.11 (p. 238)
We are testing if there has been a change in the proportion of successes within a population. To ensure
that a large sample z-test is valid, we must ensure that both np and n(1-p) > 20. In this case np == 6 and
n(1-p) = 1980, so central limit theorem says the sample distribution of sample proportions will follow and
approximately normal shape. Thus we will conduct a z-based test.
Solution to Exercise 3.2.5.12 (p. 238)
A Type I error in this context would conclude the campaign has been successful when in fact it has not. A
Type II error would conclude the campaign has not been successful when in fact it has. If the campaign is

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


255

costly, it would be better to err on the side of making a Type II error over a Type I error. Therefore we will
set alpha at 10%. Reject Ho if the p-value < 0.10.
Solution to Exercise 3.2.5.13 (p. 238)
Z* = (0.01  0.0057)/sqrt((0.0057*0.9943)/2000) = 2.54; P-value = 0.011
Solution to Exercise 3.2.5.14 (p. 238)
Our P-value is 0.011. This means that the probability of getting a sample proportion of 0.01 (or more)
from a population with a proportion of 0.0057 is only 1.1%. Given our level of signicance, this would be
considered an unlikely event
Solution to Exercise 3.2.5.15 (p. 238)
Since the p-value of 0.011 is less than alpha of 0.10, we will reject the null hypothesis.
Solution to Exercise 3.2.5.16 (p. 238)
There is sucient evidence to indicate that the proportion of responses diers as a result of the marketing
campaign.
Solution to Exercise 3.2.5.17 (p. 238)
Ho: µ = 12; Ha: µ < 12
Solution to Exercise 3.2.5.18 (p. 238)
A sample of 9 randomly chosen customers' boarding times reveals: n = 9; mean = 9.3; sample standard
deviation = 3.32.
Solution to Exercise 3.2.5.19 (p. 238)
We are investigating a hypothesis about a population mean using a small sample. Central Limit Theorem
does not apply to small samples, but we can expect the sampling distribution to be normally shaped if the
population is also normal. This has been conrmed through previous studies. Thus we will conduct a t-based
test.
Solution to Exercise 3.2.5.20 (p. 239)
A Type I error in this case would be for Charter Air to claim their boarding time is less than 12 minutes,
when in fact it is not. A Type II error in this case would be for Charter Air not to claim their boarding time
is less than 12 minutes, when in fact it is. A Type I error could lead to false advertising, which has both
ethical and legal implications, so it would be best to minimize the likelihood of making this type of error.
Alpha should be set at 1% (or at most 5%). Reject Ho if P-value < 0.01.
Solution to Exercise 3.2.5.21 (p. 239)

t* = 9.3-12/(3.32/ 9) = -2.43; P-value = 0.0203
Solution to Exercise 3.2.5.22 (p. 239)
The P-value > 0.01, so we will fail to reject the null hypothesis.
Solution to Exercise 3.2.5.23 (p. 239)
Our P-value is 0.0203. This means that the probability of getting a sample mean of 9.3 minutes (or less) from
a population with a mean of 12 minutes is 2.03%. Given our level of signicance, this would be considered
a not unlikely event.
Solution to Exercise 3.2.5.24 (p. 239)
The P-value of 0.0203 > 0.01, so we will fail to reject the null hypothesis.
Solution to Exercise 3.2.5.25 (p. 239)
There is insucient evidence to indicate that the mean boarding time is less than 12 minutes.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 3. BUSINESS STATISTICS - MODULE 3 - CONFIDENCE
256
INTERVALS AND HYPOTHESIS TESTS

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


Chapter 4

Business Statistics - Module 4 - Linear


Regression and Correlation

4.1 Introduction to Regression1

Figure 4.1: We encounter statistics in our daily lives more often than we probably realize and from
many dierent sources, like the news. (credit: David Sim)

You are probably asking yourself the question, "When and where will I use statistics?" If you read any
newspaper, watch television, or use the Internet, you will see statistical information. There are statistics
about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or
watch a television news program, you are given sample information. With this information, you may make
a decision about the correctness of a statement, claim, or "fact." Statistical methods can help you make the
"best educated guess."
1 This content is available online at <http://cnx.org/content/m65507/1.1/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>

257
CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
258
AND CORRELATION

Since you will undoubtedly be given statistical information at some point in your life, you need to know
some techniques for analyzing the information thoughtfully. Think about buying a house or managing a
budget. Think about your chosen profession. The elds of economics, business, psychology, education,
biology, law, computer science, police science, and early childhood development require at least one course
in statistics.
Included in this chapter are the basic ideas and words of probability and statistics. You will soon
understand that statistics and probability work together. You will also learn how data are gathered and
what "good" data can be distinguished from "bad."

4.2 The Correlation Coecient r2


As we begin this section we note that the type of data we will be working with has changed. Perhaps
unnoticed, all the data we have been using is for a single variable. It may be from two samples, but it is
still a univariate variable. The type of data described in the examples above and for any model of cause and
eect is bivariate data  "bi" for two variables. In reality, statisticians use multivariate data, meaning
many variables.
For our work we can classify data into three broad categories, time series data, cross-section data, and
panel data. We met the rst two very early on. Time series data measures a single unit of observation; say
a person, or a company or a country, as time passes. What are measured will be at least two characteristics,
say the person's income, the quantity of a particular good they buy and the price they paid. This would be
three pieces of information in one time period, say 1985. If we followed that person across time we would
have those same pieces of information for 1985,1986, 1987, etc. This would constitute a times series data
set. If we did this for 10 years we would have 30 pieces of information concerning this person's consumption
habits of this good for the past decade and we would know their income and the price they paid.
A second type of data set is for cross-section data. Here the variation is not across time for a single unit
of observation, but across units of observation during one point in time. For a particular period of time we
would gather the price paid, amount purchased, and income of many individual people.
A third type of data set is panel data. Here a panel of units of observation is followed across time. If we
take our example from above we might follow 500 people, the unit of observation, through time, ten years,
and observe their income, price paid and quantity of the good purchased. If we had 500 people and data for
ten years for price, income and quantity purchased we would have 15,000 pieces of information. These types
of data sets are very expensive to construct and maintain. They do, however, provide a tremendous amount
of information that can be used to answer very important questions. As an example, what is the eect on
the labor force participation rate of women as their family of origin, mother and father, age? Or are there
dierential eects on health outcomes depending upon the age at which a person started smoking? Only
panel data can give answers to these and related questions because we must follow multiple people across
time. The work we do here however will not be fully appropriate for data sets such as these.
Beginning with a set of data with two independent variables we ask the question: are these related? One
way to visually answer this question is to create a scatter plot of the data. We could not do that before when
we were doing descriptive statistics because those data were univariate. Now we have bivariate data so we
can plot in two dimensions. Three dimensions are possible on a at piece of paper, but become very hard
to fully conceptualize. Of course, more than three dimensions cannot be graphed although the relationships
can be measured mathematically.
To provide mathematical precision to the measurement of what we see we use the correlation coecient.
The correlation tells us something about the co-movement of two variables, but nothing about why this
movement occurred. Formally, correlation analysis assumes that both variables being analyzed are inde-
pendent variables. This means that neither one causes the movement in the other. Further, it means
that neither variable is dependent on the other, or for that matter, on any other variable. Even with these
limitations, correlation analysis can yield some interesting results.
2 This content is available online at <http://cnx.org/content/m55719/1.21/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


259

The correlation coecient, ρ (pronounced rho), is the mathematical statistic for a population that pro-
vides us with a measurement of the strength of a linear relationship between the two variables. For a sample
of data, the statistic, r, developed by Karl Pearson in the early 1900s, is an estimate of the population
correlation and is dened mathematically as:
  
−− −−
1
n−1 Σ X1i − X1 X 2i − X2
r= (4.1)
sx1 sx2

OR (4.1)

−− −−
ΣX1i X2i − n X 1 − X 2
r = s (4.1)
−− 2 −− 2
 
ΣX12 i − n X 1 ΣX22 i − n X 2

−− −−
where sx1 and sx2 are the standard deviations of the two independent variables X1 and X2 , X 1 and X 2 are
the sample means of the two variables, and X1i and X2i are the individual observations of X1 and X2 . The
correlation coecient r ranges in value from -1 to 1. The second equivalent formula is often used because it
may be computationally easier. As scary as these formulas look they are really just the ratio of the covariance
between the two variables and the product of their two standard deviations. That is to say, it is a measure
of relative variances.
In practice all correlation and regression analysis will be provided through computer software designed for
these purposes. Anything more than perhaps one-half a dozen observations creates immense computational
problems. It was because of this fact that correlation, and even more so, regression, were not widely used
research tools until after the advent of computing machines. Now the computing power required to analyze
data using regression packages is deemed almost trivial by comparison to just a decade ago.
To visualize any linear relationship that may exist review the plot of a scatter diagrams of the stan-
dardized data. Figure 4.2 presents several scatter diagrams and the calculated value of r. In panels (a) and
(b) notice that the data generally trend together, (a) upward and (b) downward. Panel (a) is an example
of a positive correlation and panel (b) is an example of a negative correlation, or relationship. The sign of
the correlation coecient tells us if the relationship is a positive or negative (inverse) one. If all the values
of X1 and X2 are on a straight line the correlation coecient will be either 1 or -1 depending on whether
the line has a positive or negative slope and the closer to one or negative one the stronger the relationship
between the two variables. BUT ALWAYS REMEMBER THAT THE CORRELATION COEFFICIENT
DOES NOT TELL US THE SLOPE.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
260
AND CORRELATION

Figure 4.2

Remember, all the correlation coecient tells us is whether or not the data are linearly related. In panel
(d) the variables obviously have some type of very specic relationship to each other, but the correlation
coecient is zero, indicating no linear relationship exists.
If you suspect a linear relationship between X1 and X2 then r can measure how strong the linear rela-
tionship is.
What the VALUE of r tells us:
• The value of r is always between 1 and +1: 1 ≤ r ≤ 1.
• The size of the correlation r indicates the strength of the linear relationship between X1 and X2 .
Values of r close to 1 or to +1 indicate a stronger linear relationship between X1 and X2 .
• If r = 0 there is absolutely no linear relationship between X1 and X2 (no linear correlation).
• If r = 1, there is perfect positive correlation. If r = 1, there is perfect negative correlation. In both
these cases, all of the original data points lie on a straight line: ANY straight line no matter what the
slope. Of course, in the real world, this will not generally happen.

What the SIGN of r tells us


• A positive value of r means that when X1 increases, X2 tends to increase and when X1 decreases, X2
tends to decrease (positive correlation).
• A negative value of r means that when X1 increases, X2 tends to decrease and when X1 decreases, X2
tends to increase (negative correlation).

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


261

note: Strong correlation does not suggest that X1 causes X2 or X2 causes X1 . We say "correlation
does not imply causation."

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
262
AND CORRELATION

4.2.1
Exercise 4.2.1 (Solution on p. 310.)
In order to have a correlation coecient between traits A and B, it is necessary to have:

a. one group of subjects, some of whom possess characteristics of trait A, the remainder pos-
sessing those of trait B
b. measures of trait A on one group of subjects and of trait B on another group
c. two groups of subjects, one which could be classied as A or not A, the other as B or not B
d. two groups of subjects, one which could be classied as A or not A, the other as B or not B

Exercise 4.2.2 (Solution on p. 310.)


Dene the Correlation Coecient and give a unique example of its use.
Exercise 4.2.3 (Solution on p. 310.)
If the correlation between age of an auto and money spent for repairs is +.90

a. 81% of the variation in the money spent for repairs is explained by the age of the auto
b. 81% of money spent for repairs is unexplained by the age of the auto
c. 90% of the money spent for repairs is explained by the age of the auto
d. none of the above

Exercise 4.2.4 (Solution on p. 310.)


Suppose that college grade-point average and verbal portion of an IQ test had a correlation of .40.
What percentage of the variance do these two have in common?

a. 20
b. 16
c. 40
d. 80

Exercise 4.2.5 (Solution on p. 310.)


True or false? If false, explain why: The coecient of determination can have values between -1
and +1.
Exercise 4.2.6 (Solution on p. 310.)
True or False: Whenever r is calculated on the basis of a sample, the value which we obtain for r
is only an estimate of the true correlation coecient which we would obtain if we calculated it for
the entire population.
Exercise 4.2.7 (Solution on p. 310.)
Under a "scatter diagram" there is a notation that the coecient of correlation is .10. What does
this mean?

a. plus and minus 10% from the means includes about 68% of the cases
b. one-tenth of the variance of one variable is shared with the other variable
c. one-tenth of one variable is caused by the other variable
d. on a scale from -1 to +1, the degree of linear relationship between the two variables is +.10

Exercise 4.2.8 (Solution on p. 310.)


The correlation coecient for X and Y is known to be zero. We then can conclude that:

a. X and Y have standard distributions


b. the variances of X and Y are equal
c. there exists no relationship between X and Y

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


263

d. there exists no linear relationship between X and Y


e. none of these

Exercise 4.2.9 (Solution on p. 310.)


What would you guess the value of the correlation coecient to be for the pair of variables:
"number of man-hours worked" and "number of units of work completed"?

a. Approximately 0.9
b. Approximately 0.4
c. Approximately 0.0
d. Approximately -0.4
e. Approximately -0.9

Exercise 4.2.10 (Solution on p. 310.)


In a given group, the correlation between height measured in feet and weight measured in pounds
is +.68. Which of the following would alter the value of r?

a. height is expressed centimeters.


b. weight is expressed in Kilograms.
c. both of the above will aect r.
d. neither of the above changes will aect r.

4.3 Testing the Signicance of the Correlation Coecient3


The correlation coecient, r, tells us about the strength and direction of the linear relationship between X1
and X2 .
The sample data are used to compute r, the correlation coecient for the sample. If we had data for
the entire population, we could nd the population correlation coecient. But because we have only sample
data, we cannot calculate the population correlation coecient. The sample correlation coecient, r, is our
estimate of the unknown population correlation coecient.

ρ = population correlation coecient (unknown)


r = sample correlation coecient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coecient ρ is "close to
zero" or "signicantly dierent from zero". We decide this based on the sample correlation coecient r and
the sample size n.
If the test concludes that the correlation coecient is signicantly dierent from zero, we
say that the correlation coecient is "signicant."

• Conclusion: There is sucient evidence to conclude that there is a signicant linear relationship
between X1 and X2 because the correlation coecient is signicantly dierent from zero.
• What the conclusion means: There is a signicant linear relationship X1 and X2 . If the test concludes
that the correlation coecient is not signicantly dierent from zero (it is close to zero), we say that
correlation coecient is "not signicant".

3 This content is available online at <http://cnx.org/content/m55726/1.16/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
264
AND CORRELATION

4.3.1 Performing the Hypothesis Test


H
• Null Hypothesis: 0 : ρ = 0
H
• Alternate Hypothesis: a : ρ 6= 0

What the Hypotheses Mean in Words

H
• Null Hypothesis 0 : The population correlation coecient IS NOT signicantly dierent from zero.
There IS NOT a signicant linear relationship (correlation) between X1 and X2 in the population.
H
• Alternate Hypothesis a : The population correlation coecient is signicantly dierent from zero.
There is a signicant linear relationship (correlation) between X1 and X2 in the population.
Drawing a Conclusion
There are two methods of making the decision concerning the hypothesis. The test statistic to test this
hypothesis is:

(4.2)

OR (4.2)

r n−2
tc = √ (4.2)
1 − r2
Where the second formula is an equivalent form of the test statistic, n is the sample size and the degrees of
freedom are n-2. This is a t-statistic and operates in the same way as other t tests. Calculate the t-value and
compare that with the critical value from the t-table at the appropriate degrees of freedom and the level of
condence you wish to maintain. If the calculated value is in the tail then cannot accept the null hypothesis
that there is no linear relationship between these two independent random variables. If the calculated t-value
is NOT in the tailed then cannot reject the null hypothesis that there is no linear relationship between the
two variables.
A quick shorthand way to test correlations is the relationship between the sample size and the correlation.
If:
2
|r| ≥ √ (4.2)
n
then this implies that the correlation between the two variables demonstrates that a linear relationship exists
and is statistically signicant at approximately the 0.05 level of signicance. As the formula indicates, there
is an inverse relationship between the sample size and the required correlation for signicance of a linear
relationship. With only 10 observations, the required correlation for signicance is 0.6325, for 30 observations
the required correlation for signicance decreases to 0.3651 and at 100 observations the required level is only
0.2000.
Correlations may be helpful in visualizing the data, but are not appropriately used to "explain" a rela-
tionship between two variables. Perhaps no single statistic is more misused than the correlation coecient.
Citing correlations between health conditions and everything from place of residence to eye color have the
eect of implying a cause and eect relationship. This simply cannot be accomplished with a correlation
coecient. The correlation coecient is, of course, innocent of this misinterpretation. It is the duty of the
analyst to use a statistic that is designed to test for cause and eect relationships and report only those
results if they are intending to make such a claim. The problem is that passing this more rigorous test is
dicult so lazy and/or unscrupulous "researchers" fall back on correlations when they cannot make their
case legitimately.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


265

4.3.1.1

Exercise 4.3.1 (Solution on p. 310.)


Dene a t Test of a Regression Coecient, and give a unique example of its use.
Exercise 4.3.2 (Solution on p. 311.)
The correlation between scores on a neuroticism test and scores on an anxiety test is high and
positive; therefore

a. anxiety causes neuroticism


b. those who score low on one test tend to score high on the other.
c. those who score low on one test tend to score low on the other.
d. no prediction from one test to the other can be meaningfully made.

4.4 Linear Equations4


Linear regression for two variables is based on a linear equation with one independent variable. The equation
has the form:

y = a + bx (4.2)

where a and b are constant numbers.


x
The variable is the independent variable, and y
is the dependent variable. Another way to
think about this equation is a statement of cause and eect. The X variable is the cause and the Y variable
is the hypothesized eect. Typically, you choose a value to substitute for the independent variable and then
solve for the dependent variable.
Example 4.1
The following examples are linear equations.

y = 3 + 2x (4.2)

y = − − 0.01 + 1.2x (4.2)

The graph of a linear equation of the form y = a + bx is a straight line. Any line that is not vertical can
be described by this equation.
Example 4.2
Graph the equation y = 1 + 2x.
4 This content is available online at <http://cnx.org/content/m55718/1.12/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
266
AND CORRELATION

Figure 4.3

Exercise 4.4.1 (Solution on p. 311.)


Is the following an example of a linear equation? Why or why not?

Figure 4.4

Example 4.3
Aaron's Word Processing Service (AWPS) does word processing. The rate for services is $32 per
hour plus a $31.50 one-time charge. The total cost to a customer depends on the number of hours
it takes to complete the job.
Problem
Find the equation that expresses the total cost in terms of the number of hours required to
complete the job.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


267

Solution
Let x = the number of hours it takes to get the job done.
Let y = the total cost to the customer.
The $31.50 is a xed cost. If it takes x hours to complete the job, then (32)(x) is the cost of
the word processing only. The total cost is: y = 31.50 + 32x

4.4.1 Slope and Y -Intercept of a Linear Equation


For the linear equation y = a + bx, b = slope and a = y-intercept. From algebra recall that the slope is
a number that describes the steepness of a line, and the y-intercept is the y coordinate of the point (0, a)
where the line crosses the y-axis. From calculus the slope is the rst derivative of the function. For a linear
function the slope is dy / dx = b where we can read the mathematical expression as "the change in y (dy)
that results from a change in x (dx) = b * dx".

Figure 4.5: Three possible graphs of y = a + bx. (a) If b > 0, the line slopes upward to the right. (b)
If b = 0, the line is horizontal. (c) If b < 0, the line slopes downward to the right.

Example 4.4
Svetlana tutors to make extra money for college. For each tutoring session, she charges a one-time
fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total amount of money
Svetlana earns for each session she tutors is y = 25 + 15x.
Problem
What are the independent and dependent variables? What is the y-intercept and what is the
slope? Interpret them using complete sentences.
Solution
The independent variable (x) is the number of hours Svetlana tutors each session. The dependent
variable (y) is the amount, in dollars, Svetlana earns for each session.
The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-time
fee of $25 (this is when x = 0). The slope is 15 (b = 15). For each session, Svetlana earns $15 for
each hour she tutors.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
268
AND CORRELATION

4.4.1.1

Exercise 4.4.2 (Solution on p. 311.)


True or False? If False, correct it: Suppose a 95% condence interval for the slope β of the straight
line regression of Y on X is given by -3.5 < β < -0.5. Then a two-sided test of the hypothesis
H0 : β = −1 would result in rejection of H0 at the 1% level of signicance.
Exercise 4.4.3 (Solution on p. 311.)
True or False: It is safer to interpret correlation coecients as measures of association rather than
causation because of the possibility of spurious correlation.
Exercise 4.4.4 (Solution on p. 311.)
We are interested in nding the linear relation between the number of widgets purchased at one
time and the cost per widget. The following data has been obtained:
X: Number of widgets purchased  1, 3, 6, 10, 15
Y: Cost per widget(in dollars)  55, 52, 46, 32, 25
^
Suppose the regression line is y = −2.5x + 60. We compute the average price per widget if 30
are purchased and observe which of the following?
^ ^
a. y = 15 dollars; obviously, we are mistaken; the prediction y is actually +15 dollars.
^
b. y = 15 dollars, which seems reasonable judging by the data.
^
c. y = −15 dollars, which is obvious nonsense. The regression line must be incorrect.
^
d. y = −15 dollars, which is obvious nonsense. This reminds us that predicting Y outside the
range of X values in our data is a very poor practice.

Exercise 4.4.5 (Solution on p. 311.)


Discuss briey the distinction between correlation and causality.
Exercise 4.4.6 (Solution on p. 311.)
True or False: If r is close to + or -1, we shall say there is a strong correlation, with the tacit
understanding that we are referring to a linear relationship and nothing else.

4.4.2 Chapter Review


The most basic type of association is a linear association. This type of relationship can be dened alge-
braically by the equations used, numerically with actual or predicted data values, or graphically from a
plotted curve. (Lines are classied as straight curves.) Algebraically, a linear equation typically takes the
form y = mx + b , where m b x y
and are constants, is the independent variable, is the dependent vari-
able. In a statistical context, a linear equation is written in the form y = a + bx a b
, where and are the
constants. This form is used to help readers distinguish the statistical context from the algebraic context. In
x
the equation y = a + bx, the constant b that multiplies the variable (b is called a coecient) is called as
the slope. The slope describes the rate of change between the independent and dependent variables; in other
words, the rate of change describes the change that occurs in the dependent variable as the independent
variable is changed. In the equation y = a + bx, the constant a is called as the y-intercept. Graphically, the
y-intercept is the y coordinate of the point where the graph of the line crosses the y axis. At this point x
= 0.
The slope of a line is a value that describes the rate of change between the independent and dependent
variables. The slope tells us how the dependent variable (y) changes for every one unit increase in the
y
independent (x) variable, on average. The -intercept is used to describe the dependent variable when
the independent variable equals zero. Graphically, the slope is represented by three line types in elementary
statistics.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


269

4.5 The Regression Equation5


Regression analysis is a statistical technique that can test the hypothesis that a variable is dependent upon
one or more other variables. Further, regression analysis can provide an estimate of the magnitude of the
impact of a change in one variable on another. This last feature, of course, is all important in predicting
future values.
Regression analysis is based upon a functional relationship among variables and further, assumes that
the relationship is linear. This linearity assumption is required because, for the most part, the theoretical
statistical properties of non-linear estimation are not well worked out yet by the mathematicians and econo-
metricians. This presents us with some diculties in economic analysis because many of our theoretical
models are nonlinear. The marginal cost curve, for example, is decidedly nonlinear as is the total cost func-
tion, if we are to believe in the eect of specialization of labor and the Law of Diminishing Marginal Product.
There are techniques for overcoming some of these diculties, exponential and logarithmic transformation
of the data for example, but at the outset we must recognize that standard ordinary least squares (OLS)
regression analysis will always use a linear function to estimate what might be a nonlinear relationship.
The general linear regression model can be stated by the equation:

yi = β0 + β1 X1i + β2 X2i + · · · + βk Xki + i (4.5)


where β 0 is the intercept, β i 's are the slope between Y and the appropriate Xi , and ε (pronounced epsilon),
is the error term that captures errors in measurement of Y and the eect on Y of any variables missing from
the equation that would contribute to explaining variations in Y. This equation is the theoretical population
equation and therefore uses Greek letters. The equation we will estimate will have the Roman equivalent
symbols. This is parallel to how we kept track of the population parameters and sample parameters before.
−−
The symbol for the population mean was µ and for the sample mean X and for the population standard
deviation was σ and for the sample standard deviation was s. The equation that will be estimated with a
sample of data for two independent variables will thus be:

yi = b0 + b1 x1i + b2 x2i + ei (4.5)


As with our earlier work with probability distributions, this model works only if certain assumptions hold.
These are that the Y is normally distributed, the errors are also normally distributed with a mean of zero and
a constant standard deviation, and that the error terms are independent of the size of X and independent
of each other.

4.5.1 Assumptions of the Ordinary Least Squares Regression Model


Each of these assumptions needs a bit more explanation. If one of these assumptions fails to be true, then
it will have an eect on the quality of the estimates. Some of the failures of these assumptions can be xed
while others result in estimates that quite simply provide no insight into the questions the model is trying
to answer or worse, give biased estimates.

1. The independent variables, xi , are all measured without error, and are xed numbers that are inde-
pendent of the error term. This assumption is saying in eect that Y is deterministic, the result of a
xed component X and a random error component  .
2. The error term is a random variable with a mean of zero and a constant variance. The meaning of
this is that the variances of the independent variables are independent of the value of the variable.
Consider the relationship between personal income and the quantity of a good purchased as an example
of a case where the variance is dependent upon the value of the independent variable, income. It is
plausible that as income increases the variation around the amount purchased will also increase simply
because of the exibility provided with higher levels of income. The assumption is for constant variance
5 This content is available online at <http://cnx.org/content/m55838/1.30/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
270
AND CORRELATION

with respect to the magnitude of the independent variable called homoscedasticity. If the assumption
fails, then it is called heteroscedasticity. Figure 4.6 shows the case of homoscedasticity where all three
distributions have the same variance around the predicted value of Y regardless of the magnitude of
X.
3. While the independent variables are all xed values they are from a probability distribution that is
normally distributed. This can be seen in Figure 4.6 by the shape of the distributions placed on the
predicted line at the expected value of the relevant value of Y.
4. The independent variables are independent of Y, but are also assumed to be independent of the other
X variables. The model is designed to estimate the eects of independent variables on some dependent
variable in accordance with a proposed theory. The case where some or more of the independent
variables are correlated is not unusual. There may be no cause and eect relationship among the
independent variables, but nevertheless they move together. Take the case of a simple supply curve
where quantity supplied is theoretically related to the price of the product and the prices of inputs.
There may be multiple inputs that may over time move together from general inationary pressure.
The input prices will therefore violate this assumption of regression analysis. This condition is called
multicollinearity, which will be taken up in detail later.
5. The error terms are uncorrelated with each other. This situation arises from an eect on one error
term from another error term. While not exclusively a time series problem, it is here that we most
often see this case. An X variable in time period one has an eect on the Y variable, but this eect
then has an eect in the next time period. This eect gives rise to a relationship among the error
terms. This case is called autocorrelation, self-correlated. The error terms are now not independent
of each other, but rather have their own eect on subsequent error terms.

Figure 4.6 shows the case where the assumptions of the regression model are being satised. The estimated
^
line is y = a + bx. Three values of X are shown. A normal distribution is placed at each point where X equals
the estimated line and the associated error at each value of Y. Notice that the three distributions are normally
distributed around the point on the line, and further, the variation, variance, around the predicted value is
constant indicating homoscedasticity from assumption 2. Figure 4.6 does not show all the assumptions of
the regression model, but it helps visualize these important ones.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


271

Figure 4.6

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
272
AND CORRELATION

Figure 4.7

This is the general form that is most often called the multiple regression model. So-called "simple"
regression analysis has only one independent (right-hand) variable rather than many independent variables.
Simple regression is just a special case of multiple regression. There is some value in beginning with simple
regression: it is easy to graph in two dimensions, dicult to graph in three dimensions, and impossible
to graph in more than three dimensions. Consequently, our graphs will be for the simple regression case.
Figure 4.7 presents the regression problem in the form of a scatter plot graph of the data set where it is
hypothesized that Y is dependent upon the single independent variable X.
A basic relationship from Macroeconomic Principles is the consumption function. This theoretical re-
lationship states that as a person's income rises, their consumption rises, but by a smaller amount than
the rise in income. If Y is consumption and X is income in the equation below Figure 4.7, the regression
problem is, rst, to establish that this relationship exists, and second, to determine the impact of a change
in income on a person's consumption. The parameter β 1 was called the Marginal Propensity to Consume in
Macroeconomics Principles.
Each "dot" in Figure 4.7 represents the consumption and income of dierent individuals at some point in
time. This was called cross-section data earlier; observations on variables at one point in time across dierent
people or other units of measurement. This analysis is often done with time series data, which would be

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


273

the consumption and income of one individual or country at dierent points in time. For macroeconomic
problems it is common to use times series aggregated data for a whole country. For this particular theoretical
concept these data are readily available in the annual report of the President's Council of Economic Advisors.
The regression problem comes down to determining which straight line would best represent the data in
Figure 4.8. Regression analysis is sometimes called "least squares" analysis because the method of deter-
mining which line best "ts" the data is to minimize the sum of the squared residuals of a line put through
the data.

Figure 4.8:
Population Equation: C = β 0 + β 1 Income + ε
Estimated Equation: C = b0 + b1 Income + e

This gure shows the assumed relationship between consumption and income from macroeconomic theory.
Here the data are plotted as a scatter plot and an estimated straight line has been drawn. From this graph
we can see an error term, e1 . Each data point also has an error term. Again, the error term is put into the
equation to capture eects on consumption that are not caused by income changes. Such other eects might
be a person's savings or wealth, or periods of unemployment. We will see how by minimizing the sum of
these errors we can get an estimate for the slope and intercept of this line.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
274
AND CORRELATION

Consider the graph below. The notation has returned to that for the more general model rather than
the specic case of the Macroeconomic consumption function in our example.

Figure 4.9

^
The y y y
 is read " hat" and is the estimated value of . (In Figure 4.8 C represents the estimated
value of consumption because it is on the estimated line.) It is the value of y obtained using the regression
line. y
 is not generally equal to y from the data.
The term y0 −Θ y 0 = e0 is called the "error" or residual. It is not an error in the sense of a mistake. The
error term was put into the estimating equation to capture missing variables and errors in measurement that
may have occurred in the dependent variables. The absolute value of a residual measures the vertical
distance between the actual value of y and the estimated value of y. In other words, it measures the vertical
distance between the actual data point and the predicted point on the line as can be seen on the graph at
point X0 .

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


275

If the observed data point lies above the line, the residual is positive, and the line underestimates the
actual data value for y.
If the observed data point lies below the line, the residual is negative, and the line overestimates that
actual data value for y.
In the graph, y0 − Θy 0 = e0 is the residual for the point shown. Here the point lies above the line and the
residual is positive. For each data point the residuals, or errors, are calculated yi  y
i = ei for i = 1, 2, 3, ...,
n where n is the sample size. Each |e| is a vertical distance.
The sum of the errors squared is the term obviously called Sum of Squared Errors (SSE).
Using calculus, you can determine the straight line that has the parameter values of b0 and b1 that
minimizes the SSE. When you make the SSE a minimum, you have determined the points that are on the
line of best t. It turns out that the line of best t has the equation:

Θ
y = b0 + b1 x (4.9)
  
−− −−
−− Σ x− x y− y
−− cov(x,y)
where b0 = y −b1 x and b1 = 
−−
2 = sx 2
Σ x− x
−− −−
The sample means of the x values and the y values are x and y , respectively. The best t line always
−− −−
passes through the point ( x , y ) called the points of means.
The slope b can also be written as:
 
sy
b1 = ry,x (4.9)
sx
where sy = the standard deviation of the y values and sx = the standard deviation of the x values and r is
the correlation coecient between x and y.
These equations are called the Normal Equations and come from another very important mathematical
nding called the Gauss-Markov Theorem without which we could not do regression analysis. The Gauss-
Markov Theorem tells us that the estimates we get from using the ordinary least squares (OLS) regression
method will result in estimates that have some very important properties. In the Gauss-Markov Theorem
it was proved that a least squares line is BLUE, which is, Best, Linear, Unbiased, Estimator. Best is the
statistical property that an estimator is the one with the minimum variance. Linear refers to the property
of the type of line being estimated. An unbiased estimator is one whose estimating function has an expected
mean equal to the mean of the population. (You will remember that the expected value of µ−− was equal
x
to the population mean µ in accordance with the Central Limit Theorem. This is exactly the same concept
here).
Both Gauss and Markov were giants in the eld of mathematics, and Gauss in physics too, in the 18th
century and early 19th century. They barely overlapped chronologically and never in geography, but Markov's
work on this theorem was based extensively on the earlier work of Carl Gauss. The extensive applied value
of this theorem had to wait until the middle of this last century.
Using the OLS method we can now nd the estimate of the error variance which is the variance of
the squared errors, e2 . This is sometimes called the standard error of the estimate. (Grammatically
this is probably best said as the estimate of the error's variance) The formula for the estimate of the error
variance is:
2
Σ(yi − Θ
yi ) Σei 2
s2e = = (4.9)
n−k n−k
2
where y is the predicted value of y and y is the observed value, and thus the term (yi − Θ
y i ) is the squared
errors that are to be minimized to nd the estimates of the regression line parameters. This is really just
the variance of the error terms and follows our regular variance formula. One important note is that here
we are dividing by (n − k), which is the degrees of freedom. The degrees of freedom of a regression equation

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
276
AND CORRELATION

will be the number of observations, n, reduced by the number of estimated parameters, which includes the
intercept as a parameter.
The variance of the errors is fundamental in testing hypotheses for a regression. It tells us just how
tight the dispersion is about the line. As we will see shortly, the greater the dispersion about the line,
meaning the larger the variance of the errors, the less probable that the hypothesized independent variable
will be found to have a signicant eect on the dependent variable. In short, the theory being tested will
more likely fail if the variance of the error term is high. Upon reection this should not be a surprise. As we
tested hypotheses about a mean we observed that large variances reduced the calculated test statistic and
thus it failed to reach the tail of the distribution. In those cases, the null hypotheses could not be rejected.
If we cannot reject the null hypothesis in a regression problem, we must conclude that the hypothesized
independent variable has no eect on the dependent variable.
A way to visualize this concept is to draw two scatter plots of x and y data along a predetermined line.
The rst will have little variance of the errors, meaning that all the data points will move close to the line.
Now do the same except the data points will have a large estimate of the error variance, meaning that the
data points are scattered widely along the line. Clearly the condence about a relationship between x and
y is eected by this dierence between the estimate of the error variance.

4.5.2 Testing the Parameters of the Line


The whole goal of the regression analysis was to test the hypothesis that the dependent variable, Y, was in
fact dependent upon the values of the independent variables as asserted by some foundation theory, such as
the consumption function example. Looking at the estimated equation under Figure 4.8, we see that this
amounts to determining the values of b0 and b1 . Notice that again we are using the convention of Greek
letters for the population parameters and Roman letters for their estimates.
The regression analysis output provided by the computer software will produce an estimate of b0 and b1 ,
and any other b's for other independent variables that were included in the estimated equation. The issue is
how good are these estimates? In order to test a hypothesis concerning any estimate, we have found that we
need to know the underlying sampling distribution. It should come as no surprise at his stage in the course
that the answer is going to be the normal distribution. This can be seen by remembering the assumption
that the error term in the population, ε, is normally distributed. If the error term is normally distributed
and the variance of the estimates of the equation parameters, b0 and b1 , are determined by the variance of
the error term, it follows that the variances of the parameter estimates are also normally distributed. And
indeed this is just the case.
We can see this by the creation of the test statistic for the test of hypothesis for the slope parameter,
β 1 in our consumption function equation. To test whether or not Y does indeed depend upon X, or in our
example, that consumption depends upon income, we need only test the hypothesis that β 1 equals zero.
This hypothesis would be stated formally as:

H0 : β1 = 0 (4.9)

Ha : β1 6= 0 (4.9)
If we cannot reject the null hypothesis, we must conclude that our theory has no validity. If we cannot
reject the null hypothesis that β 1 = 0 then b1 , the coecient of Income, is zero and zero times anything is
zero. Therefore the eect of Income on Consumption is zero. There is no relationship as our theory had
suggested.
Notice that we have set up the presumption, the null hypothesis, as "no relationship". This puts the
burden of proof on the alternative hypothesis. In other words, if we are to validate our claim of nding a
relationship, we must do so with a level of signicance greater than 90, 95, or 99 percent. The status quo is
ignorance, no relationship exists, and to be able to make the claim that we have actually added to our body
of knowledge we must do so with signicant probability of being correct. John Maynard Keynes got it right
and thus was born Keynesian economics starting with this basic concept in 1936.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


277

The test statistic for this test comes directly from our old friend the standardizing formula:
b1 − β1
tc = (4.9)
Sb1
where b1 is the estimated value of the slope of the regression line, β 1 is the hypothesized value of beta,
in this case zero, and Sb1 is the standard deviation of the estimate of b1 . In this case we are asking how
many standard deviations is the estimated slope away from the hypothesized slope. This is exactly the same
question we asked before with respect to a hypothesis about a mean: how many standard deviations is the
estimated mean, the sample mean, from the hypothesized mean?
The test statistic is written as a student's t distribution, but if the sample size is larger enough so that
the degrees of freedom are greater than 30 we may again use the normal distribution. To see why we can
use the student's t or normal distribution we have only to look at Sb1 ,the formula for the standard deviation
of the estimate of b1 :

Se2
Sb1 = r (4.9)
−− 2

xi − x

or (4.9)

Se2
Sb1 = (4.9)
(n − 1) Sx2
Where Se is the estimate of the error variance and S2 x is the variance of x values of the coecient of the
independent variable being tested.
We see that Se , the estimate of the error variance, is part of the computation. Because the estimate
of the error variance is based on the assumption of normality of the error terms, we can conclude that
the sampling distribution of the b's, the coecients of our hypothesized regression line, are also normally
distributed.
One last note concerns the degrees of freedom of the test statistic, ν =n-k. Previously we subtracted 1
from the sample size to determine the degrees of freedom in a student's t problem. Here we must subtract
one degree of freedom for each parameter estimated in the equation. For the example of the consumption
function we lose 2 degrees of freedom, one for b0 , the intercept, and one for b1 , the slope of the consumption
function. The degrees of freedom would be n - k - 1, where k is the number of independent variables and
the extra one is lost because of the intercept. If we were estimating an equation with three independent
variables, we would lose 4 degrees of freedom: three for the independent variables, k, and one more for the
intercept.
The decision rule for acceptance or rejection of the null hypothesis follows exactly the same form as in
all our previous test of hypothesis. Namely, if the calculated value of t (or Z) falls into the tails of the
distribution, where the tails are dened by α ,the required signicance level in the test, we cannot accept the
null hypothesis. If on the other hand, the calculated value of the test statistic is within the critical region,
we cannot reject the null hypothesis.
If we conclude that we cannot accept the null hypothesis, we are able to state with (1 − α) level of
condence that the slope of the line is given by b1 . This is an extremely important conclusion. Regression
analysis not only allows us to test if a cause and eect relationship exists, we can also determine the magnitude
of that relationship, if one is found to exist. It is this feature of regression analysis that makes it so valuable.
If models can be developed that have statistical validity, we are then able to simulate the eects of changes
in variables that may be under our control with some degree of probability , of course. For example, if
advertising is demonstrated to eect sales, we can determine the eects of changing the advertising budget
and decide if the increased sales are worth the added expense.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
278
AND CORRELATION

4.5.3 Multicollinearity
Our discussion earlier indicated that like all statistical models, the OLS regression model has important
assumptions attached. Each assumption, if violated, has an eect on the ability of the model to provide
useful and meaningful estimates. The Gauss-Markov Theorem has assured us that the OLS estimates are
unbiased and minimum variance, but this is true only under the assumptions of the model. Here we will look
at the eects on OLS estimates if the independent variables are correlated. The other assumptions and the
methods to mitigate the diculties they pose if they are found to be violated are examined in Econometrics
courses. We take up multicollinearity because it is so often prevalent in Economic models and it often leads
to frustrating results.
The OLS model assumes that all the independent variables are independent of each other. This assump-
tion is easy to test for a particular sample of data with simple correlation coecients. Correlation, like much
in statistics, is a matter of degree: a little is not good, and a lot is terrible.
The goal of the regression technique is to tease out the independent impacts of each of a set of independent
variables on some hypothesized dependent variable. If two 2 independent variables are interrelated, that is,
correlated, then we cannot isolate the eects on Y of one from the other. In an extreme case where x1 is a
linear combination of x2 , correlation equal to one, both variables move in identical ways with Y. In this case
it is impossible to determine the variable that is the true cause of the eect on Y. (If the two variables were
actually perfectly correlated, then mathematically no regression results could actually be calculated.)
The normal equations for the coecients show the eects of multicollinearity on the coecients.

sy (rx1 y − rx1 x2 rx2 y )


b1 =  (4.9)
sx1 1 − rx21 x2

sy (rx2 y − rx1 x2 rx1 y )


b2 =  (4.9)
sx2 1 − rx21 x2
− − −
b0 = y −b1 x1 − b2 x2 (4.9)

The correlation between x1 and x2 , rx21 x2 , appears in the denominator of both the estimating formula
for b1 and b2 . If the assumption of independence holds, then this term is zero. This indicates that there
is no eect of the correlation on the coecient. On the other hand, as the correlation between the two
independent variables increases the denominator decreases, and thus the estimate of the coecient increases.
The correlation has the same eect on both of the coecients of these two variables. In essence, each variable
is taking part of the eect on Y that should be attributed to the collinear variable. This results in biased
estimates.
Multicollinearity has a further deleterious impact on the OLS estimates. The correlation between the
two independent variables also shows up in the formulas for the estimate of the variance for the coecients.

s2e
s2b1 =  (4.9)
(n − 1) s2x1 1 − rx21 x2
s2e
s2b2 =  (4.9)
(n − 1) s2x2 1 − rx21 x2

Here again we see the correlation between x1 and x2 in the denominator of the estimates of the variance
for the coecients for both variables. If the correlation is zero as assumed in the regression model, then the
formula collapses to the familiar ratio of the variance of the errors to the variance of the relevant independent
variable. If however the two independent variables are correlated, then the variance of the estimate of the
coecient increases. This results in a smaller t-value for the test of hypothesis of the coecient. In short,
multicollinearity results in failing to reject the null hypothesis that the X variable has no impact on Y when
in fact X does have a statistically signicant impact on Y. Said another way, the large standard errors of the

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


279

estimated coecient created by multicollinearity suggest statistical insignicance even when the hypothesized
relationship is strong.

4.5.4 How Good is the Equation?


In the last section we concerned ourselves with testing the hypothesis that the dependent variable did indeed
depend upon the hypothesized independent variable or variables. It may be that we nd an independent
variable that has some eect on the dependent variable, but it may not be the only one, and it may not even
be the most important one. Remember that the error term was placed in the model to capture the eects
of any missing independent variables. It follows that the error term may be used to give a measure of the
"goodness of t" of the equation taken as a whole in explaining the variation of the dependent variable, Y.
The multiple correlation coecient, also called the coecient of multiple determination or the
coecient of determination, is given by the formula:

SSR
R2 = (4.9)
SST
where SSR
 is theregression sum of squares, the squared deviation of the predicted value of y from the mean
−−
value of y Θ
y− y , and SST is the total sum of squares which is the total squared deviation of the dependent
variable, y, from its mean value, including the error term, SSE, the sum of squared errors. Figure 4.10 shows
how the total deviation of the dependent variable, y, is partitioned into these two pieces.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
280
AND CORRELATION

Figure 4.10

Figure 4.10 shows the estimated regression line and a single observation, x1 . Regression analysis tries to
explain the variation of the data about the mean value of the dependent variable, y. The question is, why do
the observations of y vary from the average level of y? The value of y at observation x1 varies from the mean
−−
of y by the dierence (yi − y ). The sum of these dierences squared is SST, the sum of squares total. The
actual value of y at x1 deviates from the estimated value, y , by the dierence between the estimated value
and the actual value, (yi − Θy ). We recall that this is the error term, e, and the sum of these errors is SSE,
−−
sum of squared errors. The deviation of the predicted value of y, y
, from the mean value of y is (Θ
y − y ) and
is the SSR, sum of squares regression. It is called regression because it is the deviation explained by the
regression. (Sometimes the SSR is called SSM for sum of squares mean because it measures the deviation
from the mean value of the dependent variable, y, as shown on the graph.).
Because the SST = SSR + SSE we see that the multiple correlation coecient is the percent of the
variance, or deviation in y from its mean value, that is explained by the equation when taken as a whole.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


281

R2 will vary between zero and 1, with zero indicating that none of the variation in y was explained by the
equation and a value of 1 indicating that 100% of the variation in y was explained by the equation. For time
series studies expect a high R2 and for cross-section data expect low R2 .
While a high R2 is desirable, remember that it is the tests of the hypothesis concerning the existence
of a relationship between a set of independent variables and a particular dependent variable that was the
motivating factor in using the regression model. It is validating a cause and eect relationship developed
by some theory that is the true reason that we chose the regression analysis. Increasing the number of
independent variables will have the eect of increasing R2 . To account for this eect the proper measure of
−−2
the coecient of determination is the R , adjusted for degrees of freedom, to keep down mindless addition
of independent variables.
There is no statistical test for the R2 and thus little can be said about the model using R2 with our
characteristic condence level. Two models that have the same size of SSE, that is sum of squared errors,
may have very dierent R2 if the competing models have dierent SST, total sum of squared deviations.
The goodness of t of the two models is the same; they both have the same sum of squares unexplained,
errors squared, but because of the larger total sum of squares on one of the models the R2 diers. Again,
the real value of regression as a tool is to examine hypotheses developed from a model that predicts certain
relationships among the variables. These are tests of hypotheses on the coecients of the model and not a
game of maximizing R2 .
Another way to test the general quality of the overall model is to test the coecients as a group rather
than independently. Because this is multiple regression (more than one X), we use the F-test to determine
if our coecients collectively aect Y. The hypothesis is:
Ho : β1 = β2 = · · · = βi = 0
Ha : "at least one of the β i is not equal to 0"
If the null hypothesis cannot be rejected, then we conclude that none of the independent variables
contribute to explaining the variation in Y. Reviewing Figure 4.10 we see that SSR, the explained sum of
squares, is a measure of just how much of the variation in Y is explained by all the variables in the model.
SSE, the sum of the errors squared, measures just how much is unexplained. It follows that the ratio of these
two can provide us with a statistical test of the model as a whole. Remembering that the F distribution is a
ratio of Chi squared distributions and that variances are distributed according to Chi Squared, and the sum
of squared errors and the sum of squares are both variances, we have the test statistic for this hypothesis as:
SSR

Fc =  k  (4.10)
SSE
n−k−1

where n is the number of observations and k is the number of independent variables. It can be shown that
this is equivalent to:
n−k−1 R2
Fc = ∗ (4.10)
k 1 − R2
building from Figure 4.10 where R2 is the coecient of determination which is also a measure of the
goodness of the model.
As with all our tests of hypothesis, we reach a conclusion by comparing the calculated F statistic with
the critical value given our desired level of condence. If the calculated test statistic, an F statistic in this
case, is in the tail of the distribution, then we cannot accept the null hypothesis. By not being able to accept
the null hypotheses we conclude that this specication of this model has validity, because at least one of the
estimated coecients is signicantly dierent from zero.
An alternative way to reach this conclusion is to use the p-value comparison rule. The p-value is the
area in the tail, given the calculated F statistic. In essence, the computer is nding the F value in the table
for us. The computer regression output for the calculated F statistic is typically found in the ANOVA table
section labeled signicance F". How to read the output of an Excel regression is presented below. This is
the probability of NOT accepting a false null hypothesis. If this probability is less than our pre-determined
alpha error, then the conclusion is that we cannot accept the null hypothesis.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
282
AND CORRELATION

4.5.5 Dummy Variables


Thus far the analysis of the OLS regression technique assumed that the independent variables in the models
tested were continuous random variables. There are, however, no restrictions in the regression model against
independent variables that are binary. This opens the regression model for testing hypotheses concerning
categorical variables such as gender, race, region of the country, before a certain data, after a certain date
and innumerable others. These categorical variables take on only two values, 1 and 0, success or failure,
from the binomial probability distribution. The form of the equation becomes:

Θ
y = b0 + b2 x2 + b1 x1 (4.10)

Figure 4.11

where x2 = 0, 1. X2 is the dummy variable and X1 is some continuous random variable. The constant,
b0 , is the y-intercept, the value where the line crosses the y-axis. When the value of X2 = 0, the estimated
line crosses at b0 . When the value of X2 = 1 then the estimated line crosses at b0 + b2 . In eect the dummy
variable causes the estimated line to shift either up or down by the size of the eect of the characteristic
captured by the dummy variable. Note that this is a simple parallel shift and does not aect the impact
of the other independent variable; X1 .This variable is a continuous random variable and predicts dierent
values of y at dierent values of X1 holding constant the condition of the dummy variable.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


283

An example of the use of a dummy variable is the work estimating the impact of gender on salaries.
There is a full body of literature on this topic and dummy variables are used extensively. For this example
the salaries of elementary and secondary school teachers for a particular state is examined. Using a homoge-
neous job category, school teachers, and for a single state reduces many of the variations that naturally eect
salaries such as dierential physical risk, cost of living in a particular state, and other working conditions.
The estimating equation in its simplest form species salary as a function of various teacher characteristic
that economic theory would suggest could aect salary. These would include education level as a measure of
potential productivity, age and/or experience to capture on-the-job training, again as a measure of produc-
tivity. Because the data are for school teachers employed in a public school districts rather than workers in
a for-prot company, the school district's average revenue per average daily student attendance is included
as a measure of ability to pay. The results of the regression analysis using data on 24,916 school teachers
are presented below.

Earnings Estimate for Elementary and Secondary School Teachers

Variable Regression Coecients (b) Standard Errors of the Esti-


mates
for Teacher's Earnings Func-
tion (sb )

Intercept 4269.9
Gender (male = 1) 632.38 13.39
Total Years of Experience 52.32 1.10
Years of Experience in Current 29.97 1.52
District
Education 629.33 13.16
Total Revenue per ADA 90.24 3.76
−−2
R .725
n 24,916

Table 4.1

The coecients for all the independent variables are signicantly dierent from zero as indicated by the
standard errors. Dividing the standard errors of each coecient results in a t-value greater than 1.96 which is
the required level for 95% signicance. The binary variable, our dummy variable of interest in this analysis, is
gender where male is given a value of 1 and female given a value of 0. The coecient is signicantly dierent
from zero with a dramatic t-statistic of 47 standard deviations. We thus cannot accept the null hypothesis
that the coecient is equal to zero. Therefore we conclude that there is a premium paid male teachers of
$632 after holding constant experience, education and the wealth of the school district in which the teacher
is employed. It is important to note that these data are from some time ago and the $632 represents a six
percent salary premium at that time. A graph of this example of dummy variables is presented below.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
284
AND CORRELATION

Figure 4.12

In two dimensions, salary is the dependent variable on the vertical axis and total years of experience was
chosen for the continuous independent variable on horizontal axis. Any of the other independent variables
could have been chosen to illustrate the eect of the dummy variable. The relationship between total years
of experience has a slope of $52.32 per year of experience and the estimated line has an intercept of $4,269 if
the gender variable is equal to zero, for female. If the gender variable is equal to 1, for male, the coecient
for the gender variable is added to the intercept and thus the relationship between total years of experience
and salary is shifted upward parallel as indicated on the graph. Also marked on the graph are various points
for reference. A female school teacher with 10 years of experience receives a salary of $4,792 on the basis of
her experience only, but this is still $109 less than a male teacher with zero years of experience.
A more complex interaction between a dummy variable and the dependent variable can also be estimated.
It may be that the dummy variable has more than a simple shift eect on the dependent variable, but also
interacts with one or more of the other continuous independent variables. While not tested in the example

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


285

above, it could be hypothesized that the impact of gender on salary was not a one-time shift, but impacted
the value of additional years of experience on salary also. That is, female school teacher's salaries were
discounted at the start, and further did not grow at the same rate from the eect of experience as for male
school teachers. This would show up as a dierent slope for the relationship between total years of experience
for males than for females. If this is so then females school teachers would not just start behind their male
colleagues (as measured by the shift in the estimated regression line), but would fall further and further
behind as time and experienced increased.
The graph below shows how this hypothesis can be tested with the use of dummy variables and an
interaction variable.

Figure 4.13

The estimating equation shows how the slope of X1 , the continuous random variable experience, contains

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
286
AND CORRELATION

two parts, b1 and b3 . This occurs because of the new variable X2 X1 , called the interaction variable, was
created to allow for an eect on the slope of X1 from changes in X2 , the binary dummy variable. Note that
when the dummy variable, X2 = 0 the interaction variable has a value of 0, but when X2 = 1 the interaction
variable has a value of X1 . The coecient b3 is an estimate of the dierence in the coecient of X1 when
X2 = 1 compared to when X2 = 0. In the example of teacher's salaries, if there is a premium paid to male
teachers that aects the rate of increase in salaries from experience, then the rate at which male teachers'
salaries rises would be b1 + b3 and the rate at which female teachers' salaries rise would be simply b1 . This
hypothesis can be tested with the hypothesis:

H0 : β3 = 0|β1 = 0, β2 = 0 (4.13)

Ha : β3 6= 0|β1 6= 0, β2 6= 0 (4.13)
This is a t-test using the test statistic for the parameter β 3 . If we cannot accept the null hypothesis that
β 3 =0 we conclude there is a dierence between the rate of increase for the group for whom the value of the
binary variable is set to 1, males in this example. This estimating equation can be combined with our earlier
one that tested only a parallel shift in the estimated line. The earnings/experience functions in Figure 4.13
are drawn for this case with a shift in the earnings function and a dierence in the slope of the function with
respect to total years of experience.
Example 4.5
A random sample of 11 statistics students produced the following data, where x is the third exam
score out of 80, and y is the nal exam score out of 200. Can you predict the nal exam score of a
randomly selected student if you know the third exam score?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


287

x (third exam score) y (nal exam score)

65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
69 159
(a)

(b)

Figure 4.14: (a) Table showing the scores on the nal exam based on scores from the third exam. (b)
Scatter plot showing the scores on the nal exam based on scores from the third exam.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
288
AND CORRELATION

4.5.6
Exercise 4.5.1 (Solution on p. 311.)
Suppose that you have at your disposal the information below for each of 30 drivers. Propose
a model (including a very brief indication of symbols used to represent independent variables) to
explain how miles per gallon vary from driver to driver on the basis of the factors measured.
Information:

1. miles driven per day


2. weight of car
3. number of cylinders in car
4. average speed
5. miles per gallon
6. number of passengers

Exercise 4.5.2 (Solution on p. 311.)


Consider a sample least squares regression analysis between a dependent variable (Y) and an
independent variable (X). A sample correlation coecient of −1 (minus one) tells us that

a. there is no relationship between Y and X in the sample


b. there is no relationship between Y and X in the population
c. there is a perfect negative relationship between Y and X in the population
d. there is a perfect negative relationship between Y and X in the sample.

Exercise 4.5.3 (Solution on p. 311.)


In correlational analysis, when the points scatter widely about the regression line, this means that
the correlation is

a. negative.
b. low.
c. heterogeneous.
d. between two measures that are unreliable.

4.5.7 Chapter Review


It is hoped that this discussion of regression analysis has demonstrated the tremendous potential value it has
as a tool for testing models and helping to better understand the world around us. The regression model has
its limitations, especially the requirement that the underlying relationship be approximately linear. To the
extent that the true relationship is nonlinear it may be approximated with a linear relationship or nonlinear
forms of transformations that can be estimated with linear techniques. Double logarithmic transformation
of the data will provide an easy way to test this particular shape of the relationship. A reasonably good
quadratic form (the shape of the total cost curve from Microeconomics Principles) can be generated by the
equation:

Y = a + b1 X + b2 X 2 (4.14)
where the values of X are simply squared and put into the equation as a separate variable.
There is much more in the way of econometric "tricks" that can bypass some of the more troublesome
assumptions of the general regression model. This statistical technique is so valuable that further study
would provide any student signicant, statistically signicant, dividends.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


289

4.6 Interpretation of Regression Coecients: Elasticity and Loga-


rithmic Transformation6
As we have seen, the coecient of an equation estimated using OLS regression analysis provides an estimate
of the slope of a straight line that is assumed be the relationship between the dependent variable and at
least one independent variable. From the calculus, the slope of the line is the rst derivative and tells us the
magnitude of the impact of a one unit change in the X variable upon the value of the Y variable measured
in the units of the Y variable. As we saw in the case of dummy variables, this can show up as a parallel
shift in the estimated line or even a change in the slope of the line through an interactive variable. Here we
wish to explore the concept of elasticity and how we can use a regression analysis to estimate the various
elasticities in which economists have an interest.
The concept of elasticity is borrowed from engineering and physics where it is used to measure a material's
responsiveness to a force, typically a physical force such as a stretching/pulling force. It is from here that
we get the term an elastic band. In economics, the force in question is some market force such as a change
in price or income. Elasticity is measured as a percentage change/response in both engineering applications
and in economics. The value of measuring in percentage terms is that the units of measurement do not
play a role in the value of the measurement and thus allows direct comparison between elasticities. As an
example, if the price of gasoline increased say 50 cents from an initial price of $3.00 and generated a decline
in monthly consumption for a consumer from 50 gallons to 48 gallons we calculate the elasticity to be 0.25.
The price elasticity is the percentage change in quantity resulting from some percentage change in price.
A 16 percent increase in price has generated only a 4 percent decrease in demand: 16% price change →
4% quantity change or .04/.16 = .25. This is called an inelastic demand meaning a small response to the
price change. This comes about because there are few if any real substitutes for gasoline; perhaps public
transportation, a bicycle or walking. Technically, of course, the percentage change in demand from a price
increase is a decline in demand thus price elasticity is a negative number. The common convention, however,
is to talk about elasticity as the absolute value of the number. Some goods have many substitutes: pears
for apples for plums, for grapes, etc. etc. The elasticity for such goods is larger than one and are called
elastic in demand. Here a small percentage change in price will induce a large percentage change in quantity
demanded. The consumer will easily shift the demand to the close substitute.
While this discussion has been about price changes, any of the independent variables in a demand equation
will have an associated elasticity. Thus, there is an income elasticity that measures the sensitivity of demand
to changes in income: not much for the demand for food, but very sensitive for yachts. If the demand
equation contains a term for substitute goods, say candy bars in a demand equation for cookies, then the
responsiveness of demand for cookies from changes in prices of candy bars can be measured. This is called
the cross-price elasticity of demand and to an extent can be thought of as brand loyalty from a marketing
view. How responsive is the demand for Coca-Cola to changes in the price of Pepsi?
Now imagine the demand for a product that is very expensive. Again, the measure of elasticity is in
percentage terms thus the elasticity can be directly compared to that for gasoline: an elasticity of 0.25 for
gasoline conveys the same information as an elasticity of 0.25 for $25,000 car. Both goods are considered by
the consumer to have few substitutes and thus have inelastic demand curves, elasticities less than one.
The mathematical formulae for various elasticities are:

(%∆Q)
Price elasticity: η p = (4.14)
(%∆P)
Where η is the Greek small case letter eta used to designate elasticity. ∆ is read as change.

(%∆Q)
Income elasticity: η Y = (4.14)
(%∆Y)
6 This content is available online at <http://cnx.org/content/m64846/1.7/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
290
AND CORRELATION

Where Y is used as the symbol for income.

(%∆Q1 )
Cross-Price elasticity: η p1 = (4.14)
(%∆P2 )
Where P2 is the price of the substitute good.
Examining closer the price elasticity we can write the formula as:

(%∆Q) dQ P P
   
ηp = = =b (4.14)
(%∆P) dP Q Q
Where b is the estimated coecient for price in the OLS regression.
The rst form of the equation demonstrates the principle that elasticities are measured in percentage
terms. Of course, the ordinary least squares coecients provide an estimate of the impact of a unit change
in the independent variable, X, on the dependent variable measured in units of Y. These coecients  are not
elasticities, however, and are shown in the second way of writing the formula for elasticity as ddQ P , the
derivative of the estimated demand function which is simply the slope of the regression line. Multiplying the
slope times Q P
provides an elasticity measured in percentage terms.
Along a straight-line demand curve the percentage change, thus elasticity, changes continuously as the
scale changes, while the slope, the estimated regression coecient, remains constant. Going back to the
demand for gasoline. A change in price from $3.00 to $3.50 was a 16 percent increase in price. If the
beginning price were $5.00 then the same 50¢ increase would be only a 10 percent increase generating a
dierent elasticity. Every straight-line demand curve has a range of elasticities starting at the top left, high
prices, with large elasticity numbers, elastic demand, and decreasing as one goes down the demand curve,
inelastic demand.
In order to provide a meaningful estimate of the elasticity of demand the convention is to estimate the
elasticity at the point of means. Remember that all OLS regression lines will go through the point of means.
At this point is the greatest weight of the data used to estimate the coecient. The formula to estimate an
elasticity when an OLS demand curve has been estimated becomes:
−
P
ηp = b  −  (4.14)
Q
− −
Where P and Q are the mean values of these data used to estimate b, the price coecient.
The same method can be used to estimate the other elasticities for the demand function by using the
appropriate mean values of the other variables; income and price of substitute goods for example.

4.6.1 Logarithmic Transformation of the Data


Ordinary least squares estimates typically assume that the population relationship among the variables is
linear thus of the form presented in The Regression Equation (Section 4.5). In this form the interpretation
of the coecients is as discussed above; quite simply the coecient provides an estimate of the impact of
a one unit change in X on Y measured in units of Y. It does not matter just where along the line one
wishes to make the measurement because it is a straight line with a constant slope thus constant estimated
level of impact per unit change. It may be, however, that the analyst wishes to estimate not the simple unit
measured impact on the Y variable, but the magnitude of the percentage impact on Y of a one unit change
in the X variable. Such a case might be how a unit change in experience, say one year, eects not the
absolute amount of a worker's wage, but the percentage impact on the worker's wage. Alternatively, it
may be that the question asked is the unit measured impact on Y of a specic percentage increase in X. An
example may be by how many dollars will sales increase if the rm spends X percent more on advertising?
The third possibility is the case of elasticity discussed above. Here we are interested in the percentage impact
on quantity demanded for a given percentage change in price, or income or perhaps the price of a substitute

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


291

good. All three of these cases can be estimated by transforming the data to logarithms before running the
regression. The resulting coecients will then provide a percentage change measurement of the relevant
variable.
To summarize, there are four cases:

1. Unit ∆X → Unit ∆Y (Standard OLS case)


2. Unit ∆X → %∆Y
3. %∆X → Unit ∆Y
4. %∆X → %∆Y (elasticity case)

Case 1: The ordinary least squares case begins with the linear model developed above:

Y = a + bX (4.14)
where the coecient of the independent variable b = ddX
Y
is the slope of a straight line and thus measures
the impact of a unit change in X on Y measured in units of Y.
Case 2: The underlying estimated equation is:

log (Y) = a + bX (4.14)


The equation is estimated by converting the Y values to logarithms and using OLS techniques to estimate
the coecient of the X variable, b. This is called a semi-log estimation. Again, dierentiating both sides of
the equation allows us to develop the interpretation of the X coecient b:

d (logY ) = bdX (4.14)

dY
= bdX (4.14)
Y
Multiply by 100 to covert to percentages and rearranging terms gives:

%∆Y
100b = (4.14)
Unit ∆X
100b is thus the percentage change in Y resulting from a unit change in X.
Case 3: In this case the question is what is the unit change in Y resulting from a percentage change in
X? What is the dollar loss in revenues of a ve percent increase in price or what is the total dollar cost
impact of a ve percent increase in labor costs? The estimated equation for this case would be:

Y = a + B log (X) (4.14)


Here the calculus dierential of the estimated equation is:

dY = bd (logX) (4.14)

dX
dY = b (4.14)
X
Divide by 100 to get percentage and rearranging terms gives:
b dY Unit ∆Y
= d = (4.14)
100 100 X X %∆X

Therefore, b
100 is the increase in Y measured in units from a one percent increase in X.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
292
AND CORRELATION

Case 4: This is the elasticity case where both the dependent and independent variables are converted to
logs before the OLS estimation. This is known as the log-log case or double log case, and provides us with
direct estimates of the elasticities of the independent variables. The estimated equation is:

logY = a + blogX (4.14)


Dierentiating we have:

d (logY ) = bd (logX) (4.14)

1
d (logX) = b dX (4.14)
X
thus:

dY dX dY
 
1 1 X
dY = b dX OR =b OR b = (4.14)
Y X Y X dX Y
and b = % %∆Y our denition of elasticity. We conclude that we can directly estimate the elasticity of a
∆X
variable through double log transformation of the data. The estimated coecient is the elasticity. It is
common to use double log transformation of all variables in the estimation of demand functions to get
estimates of all the various elasticities of the demand curve.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


293

4.6.2
Exercise 4.6.1 (Solution on p. 311.)
In a linear regression, why do we need to be concerned with the range of the independent (X)
variable?
Exercise 4.6.2 (Solution on p. 311.)
Suppose one collected the following information where X is diameter of tree trunk and Y is tree
height.

X Y
4 8
2 4
8 18
6 22
10 30
6 8

Table 4.2

^
Regression equation: y i = −3.6 + 3.1 · Xi
What is your estimate of the average height of all trees having a trunk diameter of 7 inches?
Exercise 4.6.3 (Solution on p. 311.)
The manufacturers of a chemical used in ea collars claim that under standard test conditions
each additional unit of the chemical will bring about a reduction of 5 eas (i.e. where Xj =
amount of chemical and YJ = B0 + B1 · XJ + EJ , H0 : B1 = −5
Suppose that a test has been conducted and results from a computer include:
Intercept = 60
Slope = −4
Standard error of the regression coecient = 1.0
Degrees of Freedom for Error = 2000
95% Condence Interval for the slope −2.04, −5.96
Is this evidence consistent with the claim that the number of eas is reduced at a rate of 5 eas
per unit chemical?

4.7 Predicting with a Regression Equation7


One important value of an estimated regression equation is its ability to predict the eects on Y of a change
in one or more values of the independent variables. The value of this is obvious. Careful policy cannot be
made without estimates of the eects that may result. Indeed, it is the desire for particular results that
drive the formation of most policy. Regression models can be, and have been, invaluable aids in forming
such policies.
The Gauss-Markov theorem assures us that the point estimate of the impact on the dependent variable
derived by putting in the equation the hypothetical values of the independent variables one wishes to simulate
7 This content is available online at <http://cnx.org/content/m56649/1.17/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
294
AND CORRELATION

will result in an estimate of the dependent variable which is minimum variance and unbiased. That is to say
that from this equation comes the best unbiased point estimate of y given the values of x.

y = b0 + b, X1i + · · · + bk Xki
Θ (4.14)
Remember that point estimates do not carry a particular level of probability, or level of condence, because
points have no width above which there is an area to measure. This was why we developed condence
intervals for the mean and proportion earlier. The same concern arises here also. There are actually two
dierent approaches to the issue of developing estimates of changes in the independent variable, or variables,
on the dependent variable. The rst approach wishes to measure the expected mean value of y from a
specic change in the value of x: this specic value implies the expected value. Here the question is: what is
the mean impact on y that would result from multiple hypothetical experiments on y at this specic value
of x. Remember that there is a variance around the estimated parameter of x and thus each experiment will
result in a bit of a dierent estimate of the predicted value of y.
The second approach to estimate the eect of a specic value of x on y treats the event as a single
experiment: you choose x and multiply it times the coecient and that provides a single estimate of y.
Because this approach acts as if there were a single experiment the variance that exists in the parameter
estimate is larger than the variance associated with the expected value approach.
The conclusion is that we have two dierent ways to predict the eect of values of the independent
variable(s) on the dependent variable and thus we have two dierent intervals. Both are correct answers to
the question being asked, but there are two dierent questions. To avoid confusion, the rst case where we
are asking for the expected value of the mean of the estimated y, is called a condence interval as we
have named this concept before. The second case, where we are asking for the estimate of the impact on the
dependent variable y of a single experiment using a value of x, is called the prediction interval. The test
statistics for these two interval measures within which the estimated value of y will fall are:
Condence Interval for Expected Value of Mean Value of y for x=xp

(4.14)

Prediction Interval for an Individual y for x = xp

(4.14)

Where se is the standard deviation of the error term and sx is the standard deviation of the x variable.
The mathematical computations of these two test statistics are complex. Various computer regression
software packages provide programs within the regression functions to provide answers to inquires of esti-
mated predicted values of y given various values chosen for the x variable(s). It is important to know just
which interval is being tested in the computer package because the dierence in the size of the standard
deviations will change the size of the interval estimated. This is shown in Figure 4.15.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


295

Figure 4.15: Prediction and condence intervals for regression equation; 95% condence level.

Figure 4.15 shows visually the dierence the standard deviation makes in the size of the estimated
intervals. The condence interval, measuring the expected value of the dependent variable, is smaller than
the prediction interval for the same level of condence. The expected value method assumes that the
experiment is conducted multiple times rather than just once as in the other method. The logic here is
similar, although not identical, to that discussed when developing the relationship between the sample size
and the condence interval using the Central Limit Theorem. There, as the number of experiments increased,
the distribution narrowed and the condence interval became tighter around the expected value of the mean.
It is also important to note that the intervals around a point estimate are highly dependent upon the range
of data used to estimate the equation regardless of which approach is being used for prediction. Remember
that all regression equations go through the point of means, that is, the mean value of y and the mean
values of all independent variables in the equation. As the value of x chosen to estimate the associated
value of y is further from the point of means the width of the estimated interval around the point estimate
increases. Choosing values of x beyond the range of the data used to estimate the equation possess even
greater danger of creating estimates with little use; very large intervals, and risk of error. Figure 4.16 shows
this relationship.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
296
AND CORRELATION

Figure 4.16: Condence interval for an individual value of x, Xp , at 95% level of condence

Figure 4.16 demonstrates the concern for the quality of the estimated interval whether it is a prediction
interval or a condence interval. As the value chosen to predict y, Xp in the graph, is further from the
−−
central weight of the data, X , we see the interval expand in width even while holding constant the level of
condence. This shows that the precision of any estimate will diminish as one tries to predict beyond the
largest weight of the data and most certainly will degrade rapidly for predictions beyond the range of the
data. Unfortunately, this is just where most predictions are desired. They can be made, but the width of
the condence interval may be so large as to render the prediction useless. Only actual calculation and the
particular application can determine this, however.
Example 4.6
Recall the third exam/nal exam example (Example 4.5).
We found the equation of the best-t line for the nal exam grade as a function of the grade
on the third-exam. We can now use the least-squares regression line for prediction. Assume the
coecient for X was determined to be signicantly dierent from zero.
Suppose you want to estimate, or predict, the mean nal exam score of statistics students who
x
received 73 on the third exam. The exam scores ( -values) range from 65 to 75. Since 73 is

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


297

between the x-values 65 and 75, we feel comfortable to substitute x = 73 into the equation. Then:

^
y = −173.51 + 4.83 (73) = 179.08 (4.16)
We predict that statistics students who earn a grade of 73 on the third exam will earn a grade of
179.08 on the nal exam, on average.
Problem 1
a. What would you predict the nal exam score to be for a student who scored a 66 on the third
exam?
Solution
a. 145.27

Problem 2 (Solution on p. 311.)


b. What would you predict the nal exam score to be for a student who scored a 90 on the third
exam?

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
298
AND CORRELATION

4.7.1
Exercise 4.7.1 (Solution on p. 312.)
True or False? If False, correct it: Suppose you are performing a simple linear regression of Y on
X and you test the hypothesis that the slope β is zero against a two-sided alternative. You have
n = 25 observations and your computed test (t) statistic is 2.6. Then your P-value is given by .01
< P < .02, which gives borderline signicance (i.e. you would reject H0 at α = .02 but fail to reject
H0 at α = .01).
Exercise 4.7.2 (Solution on p. 312.)
An economist is interested in the possible inuence of "Miracle Wheat" on the average yield of
wheat in a district. To do so he ts a linear regression of average yield per year against year after
introduction of "Miracle Wheat" for a ten year period.
The tted trend line is
^
y j = 80 + 1.5 · Xj
(Yj : Average yield in j year after introduction)
(Xj : j year after introduction).

a. What is the estimated average yield for the fourth year after introduction?
b. Do you want to use this trend line to estimate yield for, say, 20 years after introduction?
Why? What would your estimate be?

Exercise 4.7.3 (Solution on p. 312.)


An interpretation of r = 0.5 is that the following part of the Y-variation is associated with which
variation in X:

a. most
b. half
c. very little
d. one quarter
e. none of these

Exercise 4.7.4 (Solution on p. 312.)


Which of the following values of r indicates the most accurate prediction of one variable from
another?

a. r = 1.18
b. r = −.77
c. r = .68

4.8 How to Use Microsoft Excel ® for Regression Analysis 8

This section of this chapter is here in recognition that what we are now asking requires much more than a
quick calculation of a ratio or a square root. Indeed, the use of regression analysis was almost non- existent
before the middle of the last century and did not really become a widely used tool until perhaps the late
1960's and early 1970's. Even then the computational ability of even the largest IBM machines is laughable
by today's standards. In the early days programs were developed by the researchers and shared. There was
no market for something called software and certainly nothing called apps, an entrant into the market
only a few years old.
8 This content is available online at <http://cnx.org/content/m55852/1.18/>.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


299

With the advent of the personal computer and the explosion of a vital software market we have a number of
regression and statistical analysis packages to choose from. Each has their merits. We have chosen Microsoft
Excel because of the wide-spread availability both on college campuses and in the post-college market place.
Stata is an alternative and has features that will be important for more advanced econometrics study if you
choose to follow this path. Even more advanced packages exist, but typically require the analyst to do some
signicant amount of programing to conduct their analysis. The goal of this section is to demonstrate how
to use Excel to run a regression and then to do so with an example of a simple version of a demand curve.
The rst step to doing a regression using Excel is to load the program into your computer. If you have
Excel you have the Analysis ToolPak although you may not have it activated. The program calls upon a
signicant amount of space so is not loaded automatically.
To activate the Analysis ToolPak follow these steps:
Click File > Options > Add-ins to bring up a menu of the add-in ToolPaks. Select Analysis
ToolPak and click GO next to Manage: excel add-ins near the bottom of the window. This will open
a new window where you click Analysis ToolPak (make sure there is a green check mark in the box) and
then click OK. Now there should be an Analysis tab under the data menu. These steps are presented in
the following screen shots.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
300
AND CORRELATION

Figure 4.17

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


301

Figure 4.18

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
302
AND CORRELATION

Figure 4.19

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


303

Figure 4.20

Click Data then Data Analysis and then click Regression and OK. Congratulations, you have made
it to the regression window. The window asks for your inputs. Clicking the box next to the Y and X ranges
will allow you to use the click and drag feature of Excel to select your input ranges. Excel has one odd
quirk and that is the click and drop feature requires that the independent variables, the X variables, are all
together, meaning that they form a single matrix. If your data are set up with the Y variable between two
columns of X variables Excel will not allow you to use click and drag. As an example, say Column A and
Column C are independent variables and Column B is the Y variable, the dependent variable. Excel will
not allow you to click and drop the data ranges. The solution is to move the column with the Y variable to
column A and then you can click and drag. The same problem arises again if you want to run the regression
with only some of the X variables. You will need to set up the matrix so all the X variables you wish to
regress are in a tightly formed matrix. These steps are presented in the following scene shots.

Figure 4.21

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
304
AND CORRELATION

Figure 4.22

Once you have selected the data for your regression analysis and told Excel which one is the dependent
variable (Y) and which ones are the independent valuables (X`s), you have several choices as to the parameters
and how the output will be displayed. Refer to screen shot Figure 4.22 under Input section. If you check
the labels box the program will place the entry in the rst column of each variable as its name in the
output. You can enter an actual name, such as price or income in a demand analysis, in row one of the Excel
spreadsheet for each variable and it will be displayed in the output.
The level of signicance can also be set by the analyst. This will not change the calculated t statistic,
called t stat, but will alter the p value for the calculated t statistic. It will also alter the boundaries of the
condence intervals for the coecients. A 95 percent condence interval is always presented, but with a
change in this you will also get other levels of condence for the intervals.
Excel also will allow you to suppress the intercept. This forces the regression program to minimize the
residual sum of squares under the condition that the estimated line must go through the origin. This is done
in cases where there is no meaning in the model at some value other than zero, zero for the start of the line.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


305

An example is an economic production function that is a relationship between the number of units of an
input, say hours of labor, and output. There is no meaning of positive output with zero workers.
Once the data are entered and the choices are made click OK and the results will be sent to a separate
new worksheet by default. The output from Excel is presented in a way typical of other regression package
programs. The rst block of information gives the overall statistics of the regression: Multiple R, R Squared,
and the R squared adjusted for degrees of freedom, which is the one you want to report. You also get the
Standard error (of the estimate) and the number of observations in the regression.
The second block of information is titled ANOVA which stands for Analysis of Variance. Our interest
in this section is the column marked F. This is the calculated F statistics for the null hypothesis that all of
the coecients are equal to zero verse the alternative that at least one of the coecients are not equal to
zero. This hypothesis test was presented in 13.4 under How Good is the Equation? The next column gives
the p value for this test under the title Signicance F. If the p value is less than say 0.05 (the calculated
F statistic is in the tail) we can say with 90 % condence that we cannot accept the null hypotheses that
all the coecients are equal to zero. This is a good thing: it means that at least one of the coecients is
signicantly dierent from zero thus do have an eect on the value of Y.
The last block of information contains the hypothesis tests for the individual coecient. The estimated
coecients, the intercept and the slopes, are rst listed and then each standard error (of the estimated
coecient) followed by the t stat (calculated student's t statistic for the null hypothesis that the coecient
is equal to zero). We compare the t stat and the critical value of the student's t, dependent on the degrees
of freedom, and determine if we have enough evidence to reject the null that the variable has no eect on Y.
Remember that we have set up the null hypothesis as the status quo and our claim that we know what caused
the Y to change is in the alternative hypothesis. We want to reject the status quo and substitute our version
of the world, the alternative hypothesis. The next column contains the p values for this hypothesis test
followed by the estimated upper and lower bound of the condence interval of the estimated slope parameter
for various levels of condence set by us at the beginning.

4.8.1 Estimating the Demand for Roses


Here is an example of using the Excel program to run a regression for a particular specic case: estimating
the demand for roses. We are trying to estimate a demand curve, which from economic theory we expect
certain variables aect how much of a good we buy. The relationship between the price of a good and the
quantity demanded is the demand curve. Beyond that we have the demand function that includes other
relevant variables: a person's income, the price of substitute goods, and perhaps other variables such as
season of the year or the price of complimentary goods. Quantity demanded will be our Y variable, and
Price of roses, Price of carnations and Income will be our independent variables, the X variables.
For all of these variables theory tells us the expected relationship. For the price of the good in question,
roses, theory predicts an inverse relationship, the negatively sloped demand curve. Theory also predicts the
relationship between the quantity demanded of one good, here roses, and the price of a substitute, carnations
in this example. Theory predicts that this should be a positive or direct relationship; as the price of the
substitute falls we substitute away from roses to the cheaper substitute, carnations. A reduction in the
price of the substitute generates a reduction in demand for the good being analyzed, roses here. Reduction
generates reduction is a positive relationship. For normal goods, theory also predicts a positive relationship;
as our incomes rise we buy more of the good, roses. We expect these results because that is what is predicted
by a hundred years of economic theory and research. Essentially we are testing these century-old hypotheses.
The data gathered was determined by the model that is being tested. This should always be the case. One
is not doing inferential statistics by throwing a mountain of data into a computer and asking the machine
for a theory. Theory rst, test follows.
These data here are national average prices and income is the nation's per capita personal income.
Quantity demanded is total national annual sales of roses. These are annual time series data; we are
tracking the rose market for the United States from 1984-2017, 33 observations.
Because of the quirky way Excel requires how the data are entered into the regression package it is best
to have the independent variables, price of roses, price of carnations and income next to each other on the

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
306
AND CORRELATION

spreadsheet. Once your data are entered into the spreadsheet it is always good to look at the data. Examine
the range, the means and the standard deviations. Use your understanding of descriptive statistics from the
very rst part of this course. In large data sets you will not be able to scan the data. The Analysis ToolPac
makes it easy to get the range, mean, standard deviations and other parameters of the distributions. You
can also quickly get the correlations among the variables. Examine for outliers. Review the history. Did
something happen? Was here a labor strike, change in import fees, something that makes these observations
unusual? Do not take the data without question. There may have been a typo somewhere, who knows
without review.
Go to the regression window, enter the data and select 95% condence level and click OK. You can
include the labels in the input range if you have put a title at the top of each column, but be sure to click
the labels box on the main regression page if you do.
The regression output should show up automatically on a new worksheet.

Figure 4.23

The rst results presented is the R-Square, a measure of the strength of the correlation between Y and
X1 , X2 , and X3 taken as a group. Our R-square here of 0.699, adjusted for degrees of freedom, means that
70% of the variation in Y, demand for roses, can be explained by variations in X1 , X2 , and X3 , Price of
roses, Price of carnations and Income. There is no statistical test to determine the signicance of an R2 .
Of course a higher R2 is preferred, but it is really the signicance of the coecients that will determine the
value of the theory being tested and which will become part of any policy discussion if they are demonstrated
to be signicantly dierent form zero.
Looking at the third panel of output we can write the equation as:

Y = b0 + b1 X1 + b2 X2 + b3 X3 + e (4.23)
where b0 is the intercept, b1 is the estimated coecient on price of roses, and b2 is the estimated coecient
on price of carnations, b3 is the estimated eect of income and e is the error term. The equation is written
in Roman letters indicating that these are the estimated values and not the population parameters, β 's.
Our estimated equation is:

Quantity of roses sold = 183, 475 − 1.76 Price of roses + 1.33 Price of carnations + 3.03 Income (4.23)
We rst observe that the signs of the coecients are as expected from theory. The demand curve is downward
sloping with the negative sign for the price of roses. Further the signs of both the price of carnations and
income coecients are positive as would be expected from economic theory.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


307

Interpreting the coecients can tell us the magnitude of the impact of a change in each variable on the
demand for roses. It is the ability to do this which makes regression analysis such a valuable tool. The
estimated coecients tell us that an increase the price of roses by one dollar will lead to a 1.76 reduction in
the number roses purchased. The price of carnations seems to play an important role in the demand for roses
as we see that increasing the price of carnations by one dollar would increase the demand for roses by 1.33
units as consumers would substitute away from the now more expensive carnations. Similarly, increasing per
capita income by one dollar will lead to a 3.03 unit increase in roses purchased.
These results are in line with the predictions of economics theory with respect to all three variables
included in this estimate of the demand for roses. It is important to have a theory rst that predicts the
signicance or at least the direction of the coecients. Without a theory to test, this research tool is not
much more helpful than the correlation coecients we learned about earlier.
We cannot stop there, however. We need to rst check whether our coecients are statistically signicant
from zero. We set up a hypothesis of:

H0 : β1 = 0 (4.23)

Ha : β1 6= 0 (4.23)
for all three coecients in the regression. Recall from earlier that we will not be able to denitively say that
our estimated b1 is the actual real population of β 1 , but rather only that with (1-α)% level of condence
that we cannot reject the null hypothesis that our estimated β 1 is signicantly dierent from zero. The
analyst is making a claim that the price of roses causes an impact on quantity demanded. Indeed, that
each of the included variables has an impact on the quantity of roses demanded. The claim is therefore
in the alternative hypotheses. It will take a very large probability, 0.95 in this case, to overthrow the null
hypothesis, the status quo, that β = 0. In all regression hypothesis tests the claim is in the alternative and
the claim is that the theory has found a variable that has a signicant impact on the Y variable.
The test statistic for this hypothesis follows the familiar standardizing formula which counts the number
of standard deviations, t, that the estimated value of the parameter, b1 , is away from the hypothesized value,
β 0 , which is zero in this case:
b1 − β0
tc = (4.23)
Sb1
The computer calculates this test statistic and presents it as t stat. You can nd this value to the right
of the standard error of the coecient estimate. The standard error of the coecient for b1 is Sb1 in the
formula. To reach a conclusion we compare this test statistic with the critical value of the student's t at
degrees of freedom n-3-1 =29, and alpha = 0.025 (5% signicance level for a two-tailed test). Our t stat for
b1 is approximately 5.90 which is greater than 1.96 (the critical value we looked up in the t-table), so we
cannot accept our null hypotheses of no eect. We conclude that Price has a signicant eect because the
calculated t value is in the tail. We conduct the same test for b2 and b3 . For each variable, we nd that
we cannot accept the null hypothesis of no relationship because the calculated t-statistics are in the tail for
each case, that is, greater than the critical value. All variables in this regression have been determined to
have a signicant eect on the demand for roses.
These tests tell us whether or not an individual coecient is signicantly dierent from zero, but does not
address the overall quality of the model. We have seen that the R squared adjusted for degrees of freedom
indicates this model with these three variables explains 70% of the variation in quantity of roses demanded.
We can also conduct a second test of the model taken as a whole. This is the F test presented in section
13.4 of this chapter. Because this is a multiple regression (more than one X), we use the F-test to determine
if our coecients collectively aect Y. The hypothesis is:

H0 : β1 = β2 = ... = βi = 0 (4.23)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
308
AND CORRELATION

Ha : "at least one of the βi is not equal to 0" (4.23)


Under the ANOVA section of the output we nd the calculated F statistic for this hypotheses. For this
example the F statistic is 21.9. Again, comparing the calculated F statistic with the critical value given our
desired level of signicance and the degrees of freedom will allow us to reach a conclusion.
The best way to reach a conclusion for this statistical test is to use the p-value comparison rule. The
p-value is the area in the tail, given the calculated F statistic. In essence the computer is nding the F
value in the table for us and calculating the p-value. In the Summary Output under signicance F is this
probability. For this example, it is calculated to be 2.6 x 10-5 , or 2.6 then moving the decimal ve places
to the left. (.000026) This is an almost innitesimal level of probability and is certainly less than our alpha
level of .05 for a 5 percent level of signicance.
By not being able to accept the null hypotheses we conclude that this specication of this model has
validity because at least one of the estimated coecients is signicantly dierent from zero. Since F-calculated
is greater than F-critical, we cannot accept H0 , meaning that X1 , X2 and X3 together has a signicant eect
on Y.
The development of computing machinery and the software useful for academic and business research has
made it possible to answer questions that just a few years ago we could not even formulate. Data is available
in electronic format and can be moved into place for analysis in ways and at speeds that were unimaginable
a decade ago. The sheer magnitude of data sets that can today be used for research and analysis gives us a
higher quality of results than in days past. Even with only an Excel spreadsheet we can conduct very high
level research. This section gives you the tools to conduct some of this very interesting research with the
only limit being your imagination.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


309

4.8.2
Exercise 4.8.1 (Solution on p. 312.)
^
A computer program for multiple regression has been used to t y j = b0 +b1 ·X1j +b2 ·X2j +b3 ·X3j .
Part of the computer output includes:

i bi Sbi
0 8 1.6
1 2.2 .24
2 -.72 .32
3 0.005 0.002

Table 4.3

a. Calculation of condence interval for b2 consists of _______± (a student's t value)


(_______)
b. The condence level for this interval is reected in the value used for _______.
c. The degrees of freedom available for estimating the variance are directly concerned with the
value used for _______

Exercise 4.8.2 (Solution on p. 312.)


An investigator has used a multiple regression program on 20 data points to obtain a regression
equation with 3 variables. Part of the computer output is:

Variable Coecient Standard Error of bi


1 0.45 0.21
2 0.80 0.10
3 3.10 0.86

Table 4.4

a. 0.80 is an estimate of ___________.


b. 0.10 is an estimate of ___________.
c. Assuming the responses satisfy the normality assumption, we can be 95% condent that the
value of β2 is in the interval,_______ ± [t.025 · _______], where t.025 is the critical
value of the student's t distribution with ____ degrees of freedom.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
310
AND CORRELATION

Solutions to Exercises in Chapter 4


Solution to Exercise 4.2.1 (p. 262)
d
Solution to Exercise 4.2.2 (p. 262)
A measure of the degree to which variation of one variable is related to variation in one or more other
variables. The most commonly used correlation coecient indicates the degree to which variation in one
variable is described by a straight line relation with another variable.
Suppose that sample information is available on family income and Years of schooling of the head of
the household. A correlation coecient = 0 would indicate no linear association at all between these two
variables. A correlation of 1 would indicate perfect linear association (where all variation in family income
could be associated with schooling and vice versa).
Solution to Exercise 4.2.3 (p. 262)
a. 81% of the variation in the money spent for repairs is explained by the age of the auto
Solution to Exercise 4.2.4 (p. 262)
b. 16
Solution to Exercise 4.2.5 (p. 262)
The coecient of determination is r**2 with 0 ≤ r**2 ≤ 1, since -1 ≤ r ≤ 1.
Solution to Exercise 4.2.6 (p. 262)
True
Solution to Exercise 4.2.7 (p. 262)
d. on a scale from -1 to +1, the degree of linear relationship between the two variables is +.10
Solution to Exercise 4.2.8 (p. 262)
d. there exists no linear relationship between X and Y
Solution to Exercise 4.2.9 (p. 263)
Approximately 0.9
Solution to Exercise 4.2.10 (p. 263)
d. neither of the above changes will aect r.
Solution to Exercise 4.3.1 (p. 265)
Denition:
A t test is obtained by dividing a regression coecient by its standard error and then comparing the
result to critical values for Students' t with Error df. It provides a test of the claim that βi = 0 when all
other variables have been included in the relevant regression model.
Example:
Suppose that 4 variables are suspected of inuencing some response. Suppose that the results of tting
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + β4 X4i + ei include:

Variable Regression coecient Standard error of regular coecient


.5 1 -3
.4 2 +2
.02 3 +1
.6 4 -.5

Table 4.5

t calculated for variables 1, 2, and 3 would be 5 or larger in absolute value while that for variable 4 would
be less than 1. For most signicance levels, the hypothesis β1 = 0 would be rejected. But, notice that this
is for the case when X2 , X3 , and X4 have been included in the regression. For most signicance levels, the
hypothesis β4 = 0 would be continued (retained) for the case where X1 , X2 , and X3 are in the regression.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


311

Often this pattern of results will result in computing another regression involving only X1 , X2 , X3 , and
examination of the t ratios produced for that case.
Solution to Exercise 4.3.2 (p. 265)
c. those who score low on one test tend to score low on the other.
Solution to Exercise 4.4.1 (p. 266)
No, the graph is not a straight line; therefore, it is not a linear equation.
Solution to Exercise 4.4.2 (p. 268)
False. Since H0 : β = −1 would not be rejected at α = 0.05, it would not be rejected at α = 0.01.
Solution to Exercise 4.4.3 (p. 268)
True
Solution to Exercise 4.4.4 (p. 268)
d
Solution to Exercise 4.4.5 (p. 268)
Some variables seem to be related, so that knowing one variable's status allows us to predict the status of
the other. This relationship can be measured and is called correlation. However, a high correlation between
two variables in no way proves that a cause-and-eect relation exists between them. It is entirely possible
that a third factor causes both variables to vary together.
Solution to Exercise 4.4.6 (p. 268)
True
Solution to Exercise 4.5.1 (p. 288)
Yj = b0 + b1 · X1 + b2 · X2 + b3 · X3 + b4 · X4 + b5 · X6 + ej
Solution to Exercise 4.5.2 (p. 288)
d. there is a perfect negative relationship between Y and X in the sample.
Solution to Exercise 4.5.3 (p. 288)
b. low
Solution to Exercise 4.6.1 (p. 293)
The precision of the estimate of the Y variable depends on the range of the independent (X) variable
explored. If we explore a very small range of the X variable, we won't be able to make much use of the
regression. Also, extrapolation is not recommended.
Solution to Exercise 4.6.2 (p. 293)
^
y = −3.6 + (3.1 · 7) = 18.1
Solution to Exercise 4.6.3 (p. 293)
Most simply, since −5 is included in the condence interval for the slope, we can conclude that the evidence
is consistent with the claim at the 95% condence level.
Using a t test:
H0 : B1 = −5
HA : B1 6= −5
tcalculated = −5−(−4)
1 = −1
tcritical = −1.96
Since tcalc < tcrit we retain the null hypothesis that B1 = −5.
Solution to Example 4.6, Problem 2 (p. 297)
b. The x values in the data are between 65 and 75. Ninety is outside of the domain of the observed x values
in the data (independent variable), so you cannot reliably predict the nal exam score for this student.
(Even though it is possible to enter 90 into the equation for x and calculate a corresponding y value, the y
value that you get will have a condence interval that may not be meaningful.)

To understand really how unreliable the prediction can be outside of the observed x values observed
in the data, make the substitution x = 90 into the equation.

^
y = − − 173.51 + 4.83 (90) = 261.19

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


CHAPTER 4. BUSINESS STATISTICS - MODULE 4 - LINEAR REGRESSION
312
AND CORRELATION

The nal-exam score is predicted to be 261.19. The largest the nal-exam score can be is 200.

Solution to Exercise 4.7.1 (p. 298)


True.
t(critical, df = 23, two-tailed, α = .02) = ± 2.5
tcritical, df = 23, two-tailed, α = .01 = ± 2.8
Solution to Exercise 4.7.2 (p. 298)

a. 80 + 1.5 · 4 = 86
b. No. Most business statisticians would not want to extrapolate that far. If someone did, the estimate
would be 110, but some other factors probably come into play with 20 years.

Solution to Exercise 4.7.3 (p. 298)


d. one quarter
Solution to Exercise 4.7.4 (p. 298)
b. r = −.77
Solution to Exercise 4.8.1 (p. 309)

a. −.72, .32
b. the t value
c. the t value

Solution to Exercise 4.8.2 (p. 309)

a. The population value for β2 , the change that occurs in Y with a unit change in X2 , when the other
variables are held constant.
b. The population value for the standard error of the distribution of estimates of β2 .
c. .8, .1, 16 = 20 − 4.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


GLOSSARY 313

Glossary

A a is the symbol for the Y-Intercept


Sometimes written as b0 , because when writing the theoretical linear model β0 is used to
represent a coecient for a population.
Average
also called mean or arithmetic mean; a number that describes the central tendency of the data

B b is the symbol for Slope


The word coecient will be used regularly for the slope, because it is a number that will always
be next to the letter x. It will be written as b1 when a sample is used, and β1 will be used with
a population or when writing the theoretical linear model.
Binomial Distribution
a discrete random variable (RV) that arises from Bernoulli trials. There are a xed number, n, of
independent trials. Independent means that the result of any trial (for example, trial 1) does
not aect the results of the following trials, and all trials are conducted under the same
conditions. Under these circumstances the binomial RV X is dened as the number of successes

in n trials. The notation is: X ∼ B(n, p) µ = np and the standard deviation is σ = npq . The
n
probability of exactly x successes in n trials is P (X = x) =   px q n−x .
x
Bivariate
two variables are present in the model where one is the cause or independent variable and the
other is the eect of dependent variable.
Blinding
not telling participants which treatment a subject is receiving
Box plot
a graph that gives a quick picture of the middle 50% of the data

C Categorical Variable
variables that take on values that are names or labels
Cluster Sampling
a method for selecting a random sample and dividing the population into groups (clusters); use
simple random sampling to select a set of clusters. Every individual in the chosen clusters is
included in the sample.
Conditional Probability
the likelihood that an event will occur given that another event has already occurred
Contingency Table
the method of displaying a frequency distribution as a table with rows and columns to show how
two variables may be dependent (contingent) upon each other; the table provides an easy way to
calculate conditional probabilities.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


314 GLOSSARY

Continuous Random Variable


a random variable (RV) whose outcomes are measured; the height of trees in the forest is a
continuous RV.
Control Group
a group in a randomized experiment that receives an inactive treatment but is otherwise managed
exactly as the other groups
Convenience Sampling
a nonrandom method of selecting a sample; this method selects individuals that are easily
accessible and may result in biased data.
Critical Value
The t or Z value set by the researcher that measures the probability of a Type I error, α.

D Data
a set of observations (a set of possible outcomes); most data used in statistical research can be
put into two groups: categorical (an attribute whose value is a label) or quantitative (an
attribute whose value is indicated by a number). Categorical data can be separated into two
subgroups: nominal and ordinal. Data is nominal if it cannot be meaningfully ordered. Data
is ordinal if the data can be meaningfully ordered. Quantitative data can be separated into two
subgroups: discrete and continuous. Data is discrete if it is the result of counting (such as the
number of students of a given ethnic group in a class or the number of books on a shelf). Data
is continuous if it is the result of measuring (such as distance traveled or weight of luggage)
Dependent Events
If two events are NOT independent, then we say that they are dependent.
Discrete Random Variable
a random variable (RV) whose outcomes are counted
Double-blinding
the act of blinding both the subjects of an experiment and the researchers who work with the
subjects

E Equally Likely
Each outcome of an experiment has the same probability.
Event
a subset of the set of all outcomes of an experiment; the set of all outcomes of an experiment is
called a sample space and is usually denoted by S. An event is an arbitrary subset in S. It can
contain one outcome, two outcomes, no outcomes (empty subset), the entire sample space, and
the like. Standard notations for events are capital letters such as A, B, C, and so on.
Experiment
a planned activity carried out under controlled conditions
Experimental Unit
any individual or object to be measured
Explanatory Variable
the independent variable in an experiment; the value controlled by researchers

F First Quartile

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


GLOSSARY 315

the value that is the median of the of the lower half of the ordered data set
Frequency Polygon
looks like a line graph but uses intervals to display ranges of large amounts of data
Frequency Table
a data representation in which grouped data is displayed along with the corresponding
frequencies
Frequency
the number of times a value of the data occurs

H Histogram
a graphical representation in x-y form of the distribution of data in a data set; x represents the
data and y represents the frequency, or relative frequency. The graph consists of contiguous
rectangles.
Hypothesis
a statement about the value of a population parameter, in case of two hypotheses, the statement
assumed to be true is called the null hypothesis (notation H 0 ) and the contradictory statement
is called the alternative hypothesis (notation Ha ).

I Independent Events
The occurrence of one event has no eect on the probability of the occurrence of another event.
Events A and B are independent if one of the following is true:
• P(A|B) = P(A)
• P(B|A) = P(B)
• P(A ∩ B) = P(A)P(B)
Informed Consent
Any human subject in a research study must be cognizant of any risks or costs associated with
the study. The subject has the right to know the nature of the treatments included in the study,
their potential risks, and their potential benets. Consent must be given freely by an informed,
t participant.
Institutional Review Board
a committee tasked with oversight of research programs that involve human subjects
Interval
also called a class interval; an interval represents a range of data and is used when displaying
large data sets

L Linear
a model that takes data and regresses it into a straight line equation.
Lurking Variable
a variable that has an eect on a study even though it is neither an explanatory variable nor a
response variable

M Mean

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


316 GLOSSARY

a number that measures the central tendency of the data; a common name for mean is 'average.'
The term 'mean' is a shortened form of 'arithmetic mean.' By denition, the mean for a sample
−− Sum of all values in the sample
(denoted by x) is x = Number of values in the sample , and the mean for a population (denoted by
Sum of all values in the population
µ) is µ = Number of values in the population .
Median
a number that separates ordered data into halves; half the values are the same number or smaller
than the median and half the values are the same number or larger than the median. The
median may or may not be part of the data.
Midpoint
the mean of an interval in a frequency table
Mode
the value that appears most frequently in a set of data
Multivariate
a system or model where more than one independent variable is being used to predict an
outcome. There can only ever be one dependent variable, but there is no limit to the number of
independent variables.
Mutually Exclusive
Two events are mutually exclusive if the probability that they both happen at the same time is
zero. If events A and B are mutually exclusive, then P(A ∩ B) = 0.

N Nonsampling Error
an issue that aects the reliability of sampling data other than natural variation; it includes a
variety of human errors including poor study design, biased sampling methods, inaccurate
information provided by study participants, data entry errors, and poor analysis.
Normal Distribution
a continuous random variable (RV) with pdf f (x) =
1 −−(x −− µ) 2
√ e 2σ 2
σ 2π
, where µ is the mean of the distribution and σ is the standard deviation; notation: X ∼ N (µ,
σ ). If µ = 0 and σ = 1, the RV, Z, is called the standard normal distribution.
Normal Distribution
−(x−µ)2
a continuous random variable (RV) with pdf f (x) = σ√12π e 2σ2 , where µ is the mean of the
distribution, and σ is the standard deviation, notation: X ∼ N (µ, σ ). If µ = 0 and σ = 1, the
RV is called the standard normal distribution.
Numerical Variable
variables that take on values that are indicated by numbers

O Outcome
a particular result of an experiment

P Paired Data Set


two data sets that have a one to one relationship so that:
• both data sets are the same size, and

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


GLOSSARY 317

• each data point in one data set is matched with exactly one point from the other set.
Parameter
a number that is used to represent a population characteristic and that generally cannot be
determined easily
Placebo
an inactive treatment that has no real eect on the explanatory variable
Population
all individuals, objects, or measurements whose properties are being studied
Probability
a number between zero and one, inclusive, that gives the likelihood that a specic event will occur
Probability
a number between zero and one, inclusive, that gives the likelihood that a specic event will
occur; the foundation of statistics is given by the following 3 axioms (by A.N. Kolmogorov,
1930's): Let S denote the sample space and A and B are two events in S. Then:
• 0 ≤ P(A) ≤ 1
• If A and B are any two mutually exclusive events, then P(A OR B) = P(A) + P(B).
• P(S) = 1
Proportion
the number of successes divided by the total number in the sample

Q Qualitative Data
See Data9 .
Quantitative Data

R R  Correlation Coecient
A number between −1 and 1 that represents the strength and direction of the relationship
between X and Y. The value for r will equal 1 or −1 only if all the plotted points form a
perfectly straight line.
R2  Coecient of Determination
This is a number between 0 and 1 that represents the percentage variation of the dependent
variable that can be explained by the variation in the independent variable. Sometimes
SST where SSR is the Sum of Squares Regression and SST is
calculated by the equation R2 = SSR
the Sum of Squares Total. The appropriate coecient of determination to be reported should
always be adjusted for degrees of freedom rst.
Random Assignment
the act of organizing experimental units into treatment groups using random methods
Random Sampling
a method of selecting a sample that gives every member of the population an equal chance of
being selected.
Relative Frequency
the ratio of the number of times a value of the data occurs in the set of all outcomes to the
number of all outcomes
9 http://cnx.org/content/m64281/latest/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


318 GLOSSARY

Representative Sample
a subset of the population that has the same characteristics as the population
Residual or error
^
the value calculated from subtracting y0 − y 0 = e0 . The absolute value of a residual measures the
vertical distance between the actual value of y and the estimated value of y that appears on the
best-t line.
Response Variable
the dependent variable in an experiment; the value that is measured for change at the end of
an experiment

S Sample
a subset of the population studied
Sample Space
the set of all possible outcomes of an experiment
Sampling Bias
not all members of the population are equally likely to be selected
Sampling Error
the natural variation that results from selecting a sample to represent a larger population; this
variation decreases as the sample size increases, so selecting larger samples reduces sampling
error.
Sampling with Replacement
If each member of a population is replaced after it is picked, then that member has the possibility
of being chosen more than once.
Sampling with Replacement
Once a member of the population is selected for inclusion in a sample, that member is returned
to the population for the selection of the next individual.
Sampling without Replacement
A member of the population may be chosen for inclusion in a sample only once. If chosen, the
member is not returned to the population before the next selection.
Sampling without Replacement
When sampling is done without replacement, each member of a population may be chosen only
once.
Simple Random Sampling
a straightforward method for selecting a random sample; give each member of the population a
number. Use a random number generator to select a set of labels. These randomly selected
labels identify the members of your sample.
Skewed
used to describe data that is not symmetrical; when the right side of a graph looks chopped o
compared the left side, we say it is skewed to the left. When the left side of the graph looks
chopped o compared to the right side, we say the data is skewed to the right. Alternatively:
when the lower values of the data are more spread out, we say the data are skewed to the left.
When the greater values are more spread out, the data are skewed to the right.
Standard Deviation

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


GLOSSARY 319

a number that is equal to the square root of the variance and measures how far data values are
from their mean; notation: s for sample standard deviation and σ for population standard
deviation.
Standard Normal Distribution
a continuous random variable (RV) X ∼ N (0, 1); when X follows the standard normal
distribution, it is often noted as Z ∼ N (0, 1).
Statistic
a numerical characteristic of the sample; a statistic estimates the corresponding population
parameter.
Stratied Sampling
a method for selecting a random sample used to ensure that subgroups of the population are
represented adequately; divide the population into groups (strata). Use simple random sampling
to identify a proportionate number of individuals from each stratum.
Student's t-Distribution
investigated and reported by William S. Gossett in 1908 and published under the pseudonym
Student. The major characteristics of the random variable (RV) are:
• It is continuous and assumes any real values.
• The pdf is symmetrical about its mean of zero. However, it is more spread out and atter at
the apex than the normal distribution.
• It approaches the standard normal distribution as n gets larger.
• There is a "family" of t distributions: every representative of the family is completely
dened by the number of degrees of freedom which is one less than the number of data items.
Sum of Squared Errors (SSE)
the calculated value from adding up all the squared residual terms. The hope is that this value is
very small when creating a model.
Systematic Sampling
a method for selecting a random sample; list the members of the population. Use simple random
sampling to select a starting point in the population. Let k = (number of individuals in the
population)/(number of individuals needed in the sample). Choose every kth individual in the
list starting with the one that was randomly selected. If necessary, return to the beginning of
the population list to complete your sample.

T Test Statistic
The formula that counts the number of standard deviations on the relevant distribution that
estimated parameter is away from the hypothesized value.
The Complement Event
The complement of event A consists of all outcomes that are NOT in A.
The Conditional Probability of A GIVEN B
P(A|B) is the probability that event A will occur given that the event B has already occurred.
The Intersection: the AND Event
An outcome is in the event A AND B if the outcome is in both A AND B at the same time.
The Union: the OR Event
An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B.
Treatments

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


320 GLOSSARY

dierent values or components of the explanatory variable applied in an experiment


Tree Diagram
the useful visual representation of a sample space and events in the form of a tree with branches
marked by possible outcomes together with associated probabilities (frequencies, relative
frequencies)
Type I Error
The decision is to reject the null hypothesis when, in fact, the null hypothesis is true.
Type II Error
The decision is not to reject the null hypothesis when, in fact, the null hypothesis is false.

V Variable
a characteristic of interest for each person or object in a population

X X  the independent variable


This will sometimes be referred to as the predictor variable, because these values were measured
in order to determine what possible outcomes could be predicted.

Y Y  the dependent variable


^
Also, using the letter y represents actual values while y represents predicted or estimated
values. Predicted values will come from plugging in observed x values into a linear model.

Z z-score
|x −− µ|
the linear transformation of the form z = x−µ σ or written as z = σ ; if this transformation is
applied to any normal distribution X ∼ N (µ, σ ) the result is the standard normal distribution Z
∼ N (0,1). If this transformation is applied to any specic value x of the RV with mean µ and
standard deviation σ , the result is called the z-score of x. The z-score allows us to compare data
that are normally distributed but scaled dierently. A z-score is the number of standard
deviations a particular x is away from its mean value.

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


INDEX 321

Index of Keywords and Terms


Keywords are listed by the section with that keyword (page numbers are in parentheses). Keywords
do not necessarily appear in the text of the page. They are merely associated with that section. Ex.
apples, Ÿ 1.1 (1) Terms are referenced by the page they appear on. Ex. apples, 1

6 68-95-99.7 rule, Ÿ 2.3.1(154), Ÿ 2.3.2(155) coecient of variation, Ÿ 1.2.2(27)


complement, Ÿ 2.1.2(90), 92
A absolute values of a residual, Ÿ 4.5(269) conditional probability, Ÿ 2.1.2(90), 92
addition rule, Ÿ 2.1.4(113) condence interval, 294
alternate hypothesis, Ÿ 3.2.2(221), Ÿ 3.2.3(224), condence interval for the population mean,
Ÿ 3.2.4(228) Ÿ 3.1.4(211)
alternative hypothesis, Ÿ 3.2.2(221), condence intervals, Ÿ 3.1.1(205), Ÿ 3.1.2(205),
Ÿ 3.2.3(224), 224, Ÿ 3.2.4(228) Ÿ 3.1.3(211), Ÿ 3.2.6(240)
area to the left, Ÿ 2.3.1(154), Ÿ 2.3.3(161), 162 condence intervals for proportions,
area to the right, Ÿ 2.3.1(154), Ÿ 2.3.3(161), 162 Ÿ 3.1.5(217)
at least, Ÿ 2.1.3(101) confounding data, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
at most, Ÿ 2.1.3(101) Ÿ 4.1(257)
average, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257) contingency table, Ÿ 2.1.5(126), 127
continuous, 8
B bar graph, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 1.2.3(46),
continuous data, Ÿ 1.2.3(46)
47, Ÿ 4.1(257)
continuous random variable, Ÿ 1.1.1(1),
biased, Ÿ 2.1.2(90)
Ÿ 1.1.3(8), Ÿ 4.1(257)
binomial distribution, Ÿ 2.2.2(141),
control group, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22,
Ÿ 3.2.2(221), 222
Ÿ 4.1(257)
bivariate, 258
convenience sample, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
blinding, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22, Ÿ 4.1(257)
Ÿ 4.1(257)
bootstrapping, Ÿ 2.4.3(173)
convenience sampling, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
box plot, Ÿ 1.2.4(63)
14, Ÿ 4.1(257)
Box plots, 26, 68
convenient starting point, Ÿ 1.2.3(46)
box-and-whisker plot, Ÿ 1.2.4(63)
correlation coecient, Ÿ 4.5(269)
box-and-whisker plots, 68
critical value, Ÿ 2.3.1(154), Ÿ 2.3.3(161)
box-whisker plots, 68
cumulative relative frequency, Ÿ 1.1.1(1),
Boxplot, Ÿ 1.2.1(25), Ÿ 1.2.4(63)
Ÿ 1.2.3(46), 50, Ÿ 4.1(257)
C categorical, 3
Categorical data, 8
D data, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 2, 3, Ÿ 4.1(257)
datum, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257)
categorical variable, Ÿ 1.1.1(1), Ÿ 1.1.2(2),
dependent, Ÿ 2.1.3(101)
Ÿ 4.1(257)
dependent event, Ÿ 2.1.3(101)
causality, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257)
dependent variable, 22, Ÿ 4.4(265)
central limit theorem, Ÿ 2.4.1(171), Ÿ 2.4.4(176)
descriptive statistics, 2
chance, Ÿ 2.1.2(90)
discrete, 8
chance error due to sampling, 12
discrete data, Ÿ 1.2.3(46)
cluster sample, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257)
discrete probability, Ÿ 2.2.1(140)
cluster sampling, Ÿ 1.1.1(1), Ÿ 1.1.3(8), 13,
discrete random variable, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
Ÿ 4.1(257)
Ÿ 4.1(257)
coecient of determination, Ÿ 4.5(269), 279
distribution, 26
coecient of multiple determination, 279
dot plot, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


322 INDEX

double blind, Ÿ 1.1.1(1), Ÿ 1.1.4(21), Ÿ 4.1(257) Interval, Ÿ 1.2.1(25)


double-blind experiment, 22 interval scale, Ÿ 1.1.1(1), Ÿ 4.1(257)
double-blinding, Ÿ 1.1.1(1), Ÿ 1.1.4(21), intervals, Ÿ 1.2.3(46)
Ÿ 4.1(257) introduction, Ÿ 2.4.1(171)
IQR, Ÿ 1.2.4(63)
E empirical, Ÿ 2.1.2(90)
empirical rule, Ÿ 2.3.1(154), Ÿ 2.3.2(155), 157 K Karl Pearson, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257)
empirical sampling distribution, Ÿ 2.4.3(173) key terms, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257)
Equally likely, 90
estimate of the error variance, 275 L law of large numbers, 26, Ÿ 1.2.2(27),
ethics, Ÿ 1.1.1(1), Ÿ 1.1.4(21), Ÿ 4.1(257) Ÿ 2.1.2(90), 90
event, Ÿ 2.1.2(90), 90 least-squares line, Ÿ 4.5(269)
Evidence, Ÿ 3.2.1(220) Level of Signicance, Ÿ 3.2.5(232)
expected mean, 294 level of signicance of the test, Ÿ 3.2.4(228)
expected value, 294 levels of measurement, Ÿ 1.1.1(1), Ÿ 4.1(257)
experiment, Ÿ 2.1.2(90), 90 likely, Ÿ 2.1.2(90)
experiment design, Ÿ 1.1.1(1), Ÿ 1.1.4(21), Likert scales, 4
Ÿ 4.1(257) line of best t, Ÿ 4.5(269)
experimental unit, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22, linear correlations, Ÿ 4.5(269)
Ÿ 4.1(257) long-term relative frequency, Ÿ 2.1.2(90), 90
explanatory variable, Ÿ 1.1.1(1), Ÿ 1.1.4(21), lurking variable, Ÿ 1.1.1(1), Ÿ 1.1.4(21),
22, Ÿ 4.1(257) Ÿ 4.1(257)
extreme value, Ÿ 1.2.3(46) lurking variables, 22

F fair, Ÿ 2.1.2(90), 90 M margin of error, Ÿ 3.1.4(211)


fair coin, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257) mean, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 4, Ÿ 1.2.1(25),
nal signicant digit, Ÿ 1.2.3(46) Ÿ 1.2.2(27), 27, Ÿ 3.2.6(240), Ÿ 4.1(257)
First Quartile, Ÿ 1.2.1(25), Ÿ 1.2.4(63), 64 Measures of centre, 26
frequency, Ÿ 1.1.1(1), Ÿ 1.2.1(25), Ÿ 1.2.3(46), measures of variation, 26
49, 54, Ÿ 4.1(257) Median, Ÿ 1.2.1(25), Ÿ 1.2.2(27), 27,
Frequency Polygon, Ÿ 1.2.1(25), Ÿ 1.2.3(46) Ÿ 1.2.4(63), 63
Frequency Table, Ÿ 1.2.1(25) methodology, 2
frequency tables, Ÿ 1.1.1(1), Ÿ 4.1(257) Midpoint, Ÿ 1.2.1(25), Ÿ 1.2.2(27), Ÿ 1.2.4(63)
misleading data, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
G given, Ÿ 2.1.4(113) Ÿ 4.1(257)
Mode, Ÿ 1.2.1(25), 27, 30
H Histogram, Ÿ 1.2.1(25), Ÿ 1.2.3(46), 54 more than, Ÿ 2.1.3(101)
hypotheses, 224 multiple correlation coecient, 279
hypothesis, Ÿ 3.2.3(224) multiplication rule, Ÿ 2.1.4(113)
Hypothesis test, Ÿ 3.2.1(220), 221, Ÿ 3.2.5(232) multivariate, 258
hypothesis tests, Ÿ 3.2.6(240) mutually exclusive, Ÿ 2.1.3(101), 103,
Ÿ 2.1.4(113), 113
I independent, Ÿ 2.1.3(101), 102, Ÿ 2.1.4(113),
113 N Nominal data, 8
independent event, Ÿ 2.1.3(101), Ÿ 2.1.4(113) nominal scale, Ÿ 1.1.1(1), Ÿ 4.1(257)
independent variable, 22, Ÿ 4.4(265) nonsampling error, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
inferential statistics, 2, 26 Ÿ 4.1(257)
informal inferential statistics, Ÿ 2.2.2(141) normal distribution, Ÿ 3.2.2(221), 221
informed consent, Ÿ 1.1.1(1), Ÿ 1.1.4(21), null hypothesis, Ÿ 3.2.2(221), Ÿ 3.2.3(224), 224,
Ÿ 4.1(257) Ÿ 3.2.4(228)
institution review board, Ÿ 1.1.1(1), numerical variable, Ÿ 1.1.1(1), Ÿ 1.1.2(2),
Ÿ 1.1.4(21), Ÿ 4.1(257) Ÿ 4.1(257)
Interquartile Range, Ÿ 1.2.1(25), Ÿ 1.2.4(63), 64

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


INDEX 323

O observational unit, 3 replacement, 14, 102


Ordinal data, 8 representative, 11
ordinal scale, Ÿ 1.1.1(1), Ÿ 4.1(257) representative sample, Ÿ 1.1.1(1), Ÿ 1.1.2(2),
outcome, Ÿ 2.1.2(90), 90 Ÿ 4.1(257)
Outlier, Ÿ 1.2.1(25), Ÿ 1.2.3(46), 65 research question, 2
Outliers, 26 residual, Ÿ 4.5(269)
response variable, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22,
P p-value, Ÿ 3.2.2(221) Ÿ 4.1(257)
P-values, Ÿ 3.2.5(232) rounding, Ÿ 1.1.1(1), Ÿ 4.1(257)
Paired Data Set, Ÿ 1.2.1(25), Ÿ 1.2.3(46) rounding o, Ÿ 1.1.1(1), Ÿ 4.1(257)
parameter, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 4, Ÿ 4.1(257)
Pareto chart, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257) S sample, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 3, Ÿ 1.1.3(8),
percentage impact, 290 Ÿ 4.1(257)
Percentile, Ÿ 1.2.1(25), Ÿ 1.2.4(63), sample mean, Ÿ 1.2.2(27)
Ÿ 2.3.1(154), Ÿ 2.3.3(161) sample size, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257)
pie chart, Ÿ 1.1.1(1), Ÿ 1.1.3(8), 47, Ÿ 4.1(257) sample size determination, Ÿ 3.1.5(217)
placebo, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22, Ÿ 4.1(257) sample space, Ÿ 2.1.2(90), 90, Ÿ 2.1.4(113),
population, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 3, Ÿ 1.1.3(8), 113, Ÿ 2.1.5(126), 133
18, Ÿ 4.1(257) samples, 18
population mean, Ÿ 1.2.2(27) sampling, 3
practice questions, Ÿ 3.2.6(240) sampling bias, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257)
prediction interval, 294 sampling design, 2
probability, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 2.1.2(90), sampling distribution, Ÿ 1.2.2(27), Ÿ 2.4.1(171),
90, Ÿ 2.4.5(182), Ÿ 4.1(257) Ÿ 2.4.2(171), Ÿ 2.4.3(173), Ÿ 3.2.1(220)
proportion, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 4, sampling distributions, Ÿ 2.4.5(182)
Ÿ 3.2.6(240), Ÿ 4.1(257) sampling error, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257)
proportionate stratied random sampling, 13 sampling technique, 2
sampling variability, 3
Q qualitative data, Ÿ 1.1.1(1), Ÿ 1.1.3(8), sampling with replacement, Ÿ 1.1.1(1),
Ÿ 4.1(257) Ÿ 1.1.3(8), Ÿ 4.1(257)
quantitative, 3 sampling without replacement, Ÿ 1.1.1(1),
quantitative continuous data, Ÿ 1.1.1(1), Ÿ 1.1.3(8), Ÿ 4.1(257)
Ÿ 1.1.3(8), 8, Ÿ 4.1(257) scaled number line, Ÿ 1.2.4(63)
Quantitative data, 8 second quartile, Ÿ 1.2.4(63)
quantitative date, Ÿ 1.1.1(1), Ÿ 1.1.3(8), self-funded study, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
Ÿ 4.1(257) Ÿ 4.1(257)
quantitative discrete data, Ÿ 1.1.1(1), self-interested study, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
Ÿ 1.1.3(8), 8, Ÿ 4.1(257) Ÿ 4.1(257)
quartile, Ÿ 1.2.4(63) self-selected sample, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
Quartiles, Ÿ 1.2.1(25), 26, 64, 64 Ÿ 4.1(257)
shape, 26
R random assignment, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22,
side by side stemplot, Ÿ 1.2.3(46)
Ÿ 4.1(257)
side-by-side stem-and-leaf plot, Ÿ 1.2.3(46)
random sample, 11
simple random sample, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
random sampling, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
13, Ÿ 3.2.2(221), Ÿ 4.1(257)
Ÿ 4.1(257)
simple random sampling, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
range, Ÿ 1.2.2(27)
Ÿ 4.1(257)
ratio scale, Ÿ 1.1.1(1), Ÿ 4.1(257)
single population mean, Ÿ 3.2.2(221)
relative frequency, Ÿ 1.1.1(1), Ÿ 1.2.1(25),
single population proportion, Ÿ 3.2.2(221)
Ÿ 1.2.3(46), 50, 54, Ÿ 4.1(257)
Skewed, Ÿ 1.2.1(25), 33
relative frequency distribution, Ÿ 1.2.2(27)
slope, Ÿ 4.4(265)
relative frequency table, Ÿ 1.2.2(27)
sse, Ÿ 4.5(269)

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


324 INDEX

Standard Deviation, Ÿ 1.2.1(25), Ÿ 1.2.2(27), third quartile, Ÿ 1.2.4(63), 64


34, Ÿ 2.3.1(154), Ÿ 2.3.2(155), Ÿ 3.2.2(221) time series graph, Ÿ 1.2.3(46)
standard error of the estimate, 275 treatments, Ÿ 1.1.1(1), Ÿ 1.1.4(21), 22,
standard Normal Distribution, Ÿ 2.3.1(154), Ÿ 4.1(257)
Ÿ 2.3.2(155), 156, Ÿ 3.1.4(211) tree diagram, Ÿ 2.1.5(126), 132
standardizing formula, 163 type 1 error, Ÿ 3.2.4(228)
statistic, 4 type 2 error, Ÿ 3.2.4(228)
statistics, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 2, Ÿ 4.1(257) Type I and Type II Errors, Ÿ 3.2.5(232)
stats, Ÿ 1.1.1(1), Ÿ 1.1.2(2), Ÿ 4.1(257) type I Error, Ÿ 3.2.4(228), 229
stem and leaf graph, Ÿ 1.2.3(46) type II error, Ÿ 3.2.4(228), 229
stem-and-leaf graph, Ÿ 1.2.3(46)
stemplot, Ÿ 1.2.3(46) U unbiased, Ÿ 2.1.2(90)
straight line, Ÿ 4.4(265) undue inuence, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
stratied random sampling, 13 Ÿ 4.1(257)
stratied sample, Ÿ 1.1.1(1), Ÿ 1.1.3(8), unfair, Ÿ 2.1.2(90), 91
Ÿ 4.1(257) unit, 290
stratied sampling, Ÿ 1.1.1(1), Ÿ 1.1.3(8), unit change, 290
Ÿ 4.1(257) units, 290
Student's t-distribution, 221, 221 Unlikely events, Ÿ 3.2.1(220)
Student-t distribution, Ÿ 3.1.4(211)
V variable, Ÿ 1.1.1(1), Ÿ 1.1.2(2), 3, Ÿ 4.1(257)
student's t-distribution, Ÿ 3.2.2(221)
Variance, Ÿ 1.2.1(25), Ÿ 1.2.2(27), 35
sum of squared errors, Ÿ 4.5(269)
Variation, 17
Sum of Squared Errors (SSE), 275
systematic random sample, 14 W with replacement, Ÿ 2.1.3(101), Ÿ 2.1.5(126)
systematic sample, Ÿ 1.1.1(1), Ÿ 1.1.3(8), without replacement, Ÿ 2.1.3(101), Ÿ 2.1.5(126)
Ÿ 4.1(257)
systematic sampling, Ÿ 1.1.1(1), Ÿ 1.1.3(8),
Ÿ 4.1(257) Y y-intercept, Ÿ 4.4(265)

T test statistic, 222 Z z-score, Ÿ 2.3.1(154), Ÿ 2.3.2(155)


z-scores, 156

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


ATTRIBUTIONS 325

Attributions
Collection: MGMT 2262: Applied Business Statistics
Edited by: Brad Quiring
URL: http://cnx.org/content/col23820/1.2/
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction  Sampling and Data  MtRoyal - Version2016RevA"
By: Lyryx Learning
URL: http://cnx.org/content/m62095/1.2/
Pages: 1-2
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54035/1.2/

Module: "Denitions of Statistics, Probability, and Key Terms  MRU - C Lemieux (2017)"
By: Collette Lemieux
URL: http://cnx.org/content/m64275/1.1/
Pages: 2-8
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Denitions of Statistics, Probability, and Key Terms  MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62097/1.2/

Module: "Data, Sampling, and Variation  MRU - C Lemieux (2017)"


By: Collette Lemieux
URL: http://cnx.org/content/m64281/1.1/
Pages: 8-21
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Data, Sampling, and Variation  MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62319/1.2/
Module: "Experimental Design and Ethics  MtRoyal - Version2016RevA"
By: Lyryx Learning
URL: http://cnx.org/content/m62317/1.1/
Pages: 21-24
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Experimental Design and Ethics
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54056/1.3/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


326 ATTRIBUTIONS

Module: "Introduction  Descriptive Statistics  MRU - C Lemieux (2017)"


By: Collette Lemieux
URL: http://cnx.org/content/m64282/1.1/
Pages: 25-27
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction  Descriptive Statistics  MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62325/1.1/
Module: "Descriptive Statistics - Numerical Summaries of Data - MRU - C Lemieux"
By: Collette Lemieux
URL: http://cnx.org/content/m64905/1.3/
Pages: 27-46
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Measures of the Center of the Data  MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62320/1.2/

Module: "Descriptive Statistics - Visual Representations of Data - MRU - C Lemieux (2017)"


By: Collette Lemieux
URL: http://cnx.org/content/m64906/1.2/
Pages: 46-63
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Display Data  Descriptive Statistics  MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62323/1.1/

Module: "Measures of Location and Box Plots  MRU  C Lemieux (2017)"


By: Collette Lemieux
URL: http://cnx.org/content/m64907/1.1/
Pages: 63-76
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Box Plots  MtRoyal - Version2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m61599/1.2/
Module: "Introduction  Probability Topics  MtRoyal - Version2016RevA"
Used here as: "Chapter Overview"
By: Lyryx Learning
URL: http://cnx.org/content/m62328/1.1/
Pages: 89-90
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54216/1.1/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


ATTRIBUTIONS 327

Module: "Terminology  Probability Topics  MtRoyal - Version2016RevA"


Used here as: "Introduction to Probability"
By: Lyryx Learning
URL: http://cnx.org/content/m62337/1.3/
Pages: 90-101
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Terminology
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54218/1.8/
Module: "Independent and Mutually Exclusive Events  Probaility Topics  MtRoyal - Version2016RevA"
Used here as: "Independent and Mutually Exclusive Events"
By: Lyryx Learning
URL: http://cnx.org/content/m62329/1.2/
Pages: 101-113
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Independent and Mutually Exclusive Events
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54219/1.2/
Module: "Two Basic Rules of Probability"
By: OpenStax
URL: http://cnx.org/content/m54220/1.11/
Pages: 113-126
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
Based on: Two Basic Rules of Probability
By: OpenStax
URL: http://cnx.org/content/m46947/1.4/
Module: "Contingency Tables and Tree Diagrams  Probability Topics  MtRoyal - Version2016RevA"
Used here as: "Contingency Tables and Tree Diagrams"
By: Lyryx Learning
URL: http://cnx.org/content/m62330/1.2/
Pages: 126-140
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Contingency Tables and Probability Trees
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54259/1.7/

Module: "Introduction to Discrete Probability Distributions - MRU - C Lemieux (2017)"


Used here as: "Introduction to Discrete Probability Distributions"
By: Collette Lemieux
URL: http://cnx.org/content/m64919/1.1/
Pages: 140-141
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


328 ATTRIBUTIONS

Module: "Binomial distribution - MRU - C Lemieux"


Used here as: "The Binomial Distribution"
By: Collette Lemieux
URL: http://cnx.org/content/m64918/1.4/
Pages: 141-153
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction  The Normal Distribution  Mt Royal University  Version 2016RevA"
Used here as: "Introduction to the Normal Distribution"
By: Lyryx Learning
URL: http://cnx.org/content/m62345/1.1/
Pages: 154-155
Copyright: Lyryx Learning
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54495/1.6/
Module: "The Standard Normal Distribution The Normal Distribution MRU - C Lemieux"
Used here as: "The Standard Normal Distribution"
By: Collette Lemieux
URL: http://cnx.org/content/m64939/1.1/
Pages: 155-161
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: The Standard Normal Distribution The Normal Distribution  Mt Royal University  Version
2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62346/1.1/
Module: "Using the Normal Distribution The Normal Distribution  MRU - C Lemieux"
Used here as: "Using the Normal Distribution"
By: Collette Lemieux
URL: http://cnx.org/content/m64940/1.2/
Pages: 161-171
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Based on: Using the Normal Distribution The Normal Distribution  Mt Royal University  Version
2016RevA
By: OpenStax, OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62347/1.1/
Module: "Introduction - Sampling distributions - MRU - C Lemieux"
Used here as: "Chapter Overview"
By: Collette Lemieux
URL: http://cnx.org/content/m64933/1.1/
Page: 171
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


ATTRIBUTIONS 329

Module: "Introduction to Sampling Distributions"


By: Collette Lemieux
URL: http://cnx.org/content/m64932/1.2/
Pages: 171-173
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Constructing empirical sampling distributions - MRU - C Lemieux"
Used here as: "Constructing Empirical Sampling Distributions"
By: Collette Lemieux
URL: http://cnx.org/content/m64934/1.2/
Pages: 173-176
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Central Limit Theorem - MRU - C Lemieux"
Used here as: "The Central Limit Theorem"
By: Collette Lemieux
URL: http://cnx.org/content/m64935/1.3/
Pages: 176-182
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "A series of examples - Calculating probabilities for sampling distributions - MRU - C Lemieux"
Used here as: "Calculating Probabilities for Sampling DistributionsA Series of Examples"
By: Collette Lemieux
URL: http://cnx.org/content/m64931/1.2/
Pages: 182-185
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction to condence intervals - MRU - C Lemieux"
Used here as: "Introduction to Condence Intervals"
By: Collette Lemieux
URL: http://cnx.org/content/m65024/1.1/
Page: 205
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "What are condence intervals? - MRU - C Lemieux"
Used here as: "What are Condence Intervals?"
By: Collette Lemieux
URL: http://cnx.org/content/m65023/1.2/
Pages: 205-211
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Basic premise of constructing a condence interval - MRU - C Lemieux"
Used here as: "The Basic Premise of Constructing a Condence Interval"
By: Collette Lemieux
URL: http://cnx.org/content/m65026/1.1/
Page: 211
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


330 ATTRIBUTIONS

Module: "Condence interval for the mean - MRU - C Lemieux"


Used here as: "The Condence Interval Estimate of a Population Mean"
By: Collette Lemieux
URL: http://cnx.org/content/m65027/1.1/
Pages: 211-217
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Condence interval for proportion - MRU - C Lemieux"
Used here as: "The Condence Interval Estimate of a Population Proportion"
By: Collette Lemieux
URL: http://cnx.org/content/m65028/1.1/
Pages: 217-219
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction to one population hypothesis testing"
Used here as: "Introduction to One Population Hypothesis Testing"
By: Brad Quiring
URL: http://cnx.org/content/m65311/1.2/
Pages: 220-221
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Module: "Derived copy of Distribution Needed for Hypothesis Testing  Hypothesis Testing with One Sample
 MtRoyal - Version2016RevA"
Used here as: "The Distribution Needed for Hypothesis Testing"
By: Brad Quiring
URL: http://cnx.org/content/m65305/1.3/
Pages: 221-224
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Based on: Distribution Needed for Hypothesis Testing  Hypothesis Testing with One Sample  MtRoyal -
Version2016RevA
By: OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62286/1.1/
Module: "Derived copy of Null and Alternative Hypotheses"
Used here as: "The Null and Alternative Hypotheses"
By: Brad Quiring
URL: http://cnx.org/content/m65304/1.2/
Pages: 224-228
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Based on: Null and Alternative Hypotheses
By: OpenStax Business Statistics
URL: http://cnx.org/content/m55606/1.8/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


ATTRIBUTIONS 331

Module: "Derived copy of Outcomes and the Type I and Type II Errors  Hypothesis Testing with One
Sample  MtRoyal - Version2016RevA"
Used here as: "Errors and Choosing a Level of Signicance"
By: Brad Quiring
URL: http://cnx.org/content/m65318/1.3/
Pages: 228-232
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Based on: Outcomes and the Type I and Type II Errors  Hypothesis Testing with One Sample  MtRoyal
- Version2016RevA
By: OpenStax Business Statistics, Lyryx Learning
URL: http://cnx.org/content/m62369/1.1/
Module: "The Eight-Step Hypothesis Test"
By: Brad Quiring
URL: http://cnx.org/content/m65319/1.5/
Pages: 232-239
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Module: "Practice questions"
Used here as: "Practice Questions for Chapters 7 and 8"
By: Collette Lemieux
URL: http://cnx.org/content/m65288/1.3/
Pages: 240-249
Copyright: Collette Lemieux
License: http://creativecommons.org/licenses/by/4.0/
Module: "Introduction to Regression"
By: Brad Quiring
URL: http://cnx.org/content/m65507/1.1/
Pages: 257-258
Copyright: Brad Quiring
License: http://creativecommons.org/licenses/by/4.0/
Based on: Introduction
By: OpenStax, OpenStax Business Statistics
URL: http://cnx.org/content/m54035/1.2/
Module: "The Correlation Coecient r"
By: OpenStax
URL: http://cnx.org/content/m55719/1.21/
Pages: 258-263
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
Module: "Testing the Signicance of the Correlation Coecient"
By: OpenStax
URL: http://cnx.org/content/m55726/1.16/
Pages: 263-265
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


332 ATTRIBUTIONS

Module: "Linear Equations"


By: OpenStax
URL: http://cnx.org/content/m55718/1.12/
Pages: 265-268
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
Based on: Linear Equations
By: OpenStax
URL: http://cnx.org/content/m47100/1.2/
Module: "The Regression Equation"
By: OpenStax
URL: http://cnx.org/content/m55838/1.30/
Pages: 269-288
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
Based on: The Regression Equation
By: OpenStax
URL: http://cnx.org/content/m47117/1.3/
Module: "Interpretation of Regression Coecients: Elasticity and Logarithmic Transformation"
By: OpenStax
URL: http://cnx.org/content/m64846/1.7/
Pages: 289-293
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/
Module: "Predicting with a Regression Equation"
By: OpenStax
URL: http://cnx.org/content/m56649/1.17/
Pages: 293-298
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/

Module: "How to Use Microsoft Excel® for Regression Analysis"


By: OpenStax
URL: http://cnx.org/content/m55852/1.18/
Pages: 298-309
Copyright: Rice University
License: http://creativecommons.org/licenses/by/4.0/

Available for free at Connexions <http://cnx.org/content/col23820/1.2>


MGMT 2262: Applied Business Statistics
This collection covers three modules: data and sampling; probability topics; condence intervals and hy-
pothesis tests for one population

About OpenStax-CNX
Rhaptos is a web-based collaborative publishing system for educational material.

You might also like