0% found this document useful (0 votes)
72 views623 pages

Nilkanta Sir Merged PDF

This document discusses business statistics and introduces R for statistical analysis. It covers key topics such as defining data and different data types, summarizing categorical data using frequency distributions and tables, and providing examples of quantitative business applications of statistics in accounting, finance, marketing, production, and economics. The objectives of the course are outlined as learning to present, analyze, interpret data, use probability concepts, make inferences from samples, and understand real-world applications of statistics.

Uploaded by

NABANIT GIRI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views623 pages

Nilkanta Sir Merged PDF

This document discusses business statistics and introduces R for statistical analysis. It covers key topics such as defining data and different data types, summarizing categorical data using frequency distributions and tables, and providing examples of quantitative business applications of statistics in accounting, finance, marketing, production, and economics. The objectives of the course are outlined as learning to present, analyze, interpret data, use probability concepts, make inferences from samples, and understand real-world applications of statistics.

Uploaded by

NABANIT GIRI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 623

Business Statistics with R

An Introduction
Dr. Nilakantan Narasinganallur Ph.D.
Statistics – study of data with analytical measures to understand the
structure underlying the data

Analysis leads to understanding and predictive/forecasting ability to


forecast the future.

Use of the word statistics in singular & plural

Need for In singular, stands for the subject and BoK


statistics • Statistics is defined as “the art and science of collecting, analysing, and presenting, and
interpreting data.
• In business environment, information gives managers and decision makers better
understanding of the environment.

In plural, indicates measures derived from samples

Statistic is a measure derived from a sample as opposed to parameter.


We see different reports in the newspapers and magazines.
• The median price of a house in a metro is Rs15 lakh compared
with Rs2 lakh in an underdeveloped rural area. ( Livemint, 5th
Dec 2016).
• Indians spend more time in daily office commute than people in
most countries in the world, with more than two hours on the
road everyday, according to a report by office commute platform
Real world – MoveinSy. ( ET, 3rd Sep 2019)
• NCAA president Myles Brand reported that college athletes are
through the earning degrees at record rates. Latest figures show that 79% of all
men and women student-athletes graduate (Associated Press, October
looking glass 15, 2008).
What do these newspaper reports indicate about Statistics and its
importance?
• Statistical measures to explain phenomena
• Raw data collected invisible to us
• Measures can be of different types – in original units or percentages
• Anything more? – can be discussed
Learning objectives
Statistics and Probability are interrelated.
Data can be collected and interpreted on a We
define the objectives of this course as follows:
We define the objectives
• To learn to present, analyze, and interpret data. of this course as follows:
• To use concepts of probability in business
situations • To learn to present, analyze, and
• To make inferences from samples drawn from
large datasets. interpret data.
• standalone basis.
• Uncertainty in the real world leads to the need
• To use concepts of probability in
for probability theory. business situations
• We will visit probability theory in a subsequent
module.
• To make inferences from
To understand the happenings in the world samples drawn from large
• We need these measures datasets.
• We need to learn the methods and techniques to
analyse the data and interpret the results well.
• Accounting
• Audit firms use sampling procedures while conducting audits for their clients- Sampling
and inference
• Finance
• Financial analysts use a variety of statistical information to guide their investment
recommendations – price/earnings ratios and dividend yields

Business • Marketing
• Electronic scanners at retail checkout counters collect data which can be used for market
research. – market basket analysis

applications • Production
• Emphasis on quality makes quality control an important application of statistics in
production. – process charts, control charts, six sigma
• Economics
• Economics frequently provide forecasts about the future of economy or some aspect of
it. Producer price index, unemployment rate, capacity utilization
• Define data
• Data types & examples
• Raw data examples
What is • Data & information – what is different?
data? • Use of summaries
• Use of tables & charts
• Nominal category
• Labels or names or numbers representing
categories
Categorical • Ordinal data
data • Numbers carrying partial meanings
• Data collation in frequency distribution
• Data visualisation in bar charts, pie charts
• Interval type
Quantitative • Examples –
data • Ratio type
• Examples -
Raw data
Let us look at some hypothetical data on purchases of soft drinks,
identified in terms of the brands. The data from a sample of 100
purchases are presented in the next slide.
In order to derive meaningful information from the data., what do we do?
Summarizing Frequency distribution

categorical • A frequency distribution is a tabular summary of data showing the


number (frequency) of items in each of several nonoverlapping
data classes.
• We can collate this data in a frequency distribution. How?
• The table is produced in the next slide.
• Relative Frequency
mirinda 7 up frooti pepsi fanta coca cola mirinda diet pepsi
frooti sprite thumsup maaza limca coca cola pepsi mirinda
diet pepsi 7 up 7 up limca mirinda diet pepsi thumsup frooti
mountain mountain mountain

Raw Data limca


pepsi
pepsi
pepsi
dew
maaza
dew
sprite
diet pepsi mirinda
limca thumsup
coca cola dew
maaza coca cola

on softdrink 7 up limca coca cola


mountain
dew coca cola maaza limca
mountain
dew

purchases - thumsup fanta pepsi


mountain mountain
mountain
dew fanta frooti fanta thumsup
mountain

sample of slice
slice
dew
maaza
dew
sprite
maaza
thumsup
7 up
mirinda
pepsi
fanta
fanta dew
coca cola slice

100 thumsup
frooti
thumsup
mirinda
7 up
7 up
limca
frooti
7 up
mirinda
diet pepsi frooti
mirinda sprite
mountain
frooti sprite slice diet pepsi dew mirinda thumsup
mountain mountain
frooti dew maaza pepsi dew sprite frooti
soft drink frequency
sprite 6
thumsup 9
pepsi 8
diet pepsi 6
Frequency coca cola 7
distribution limca
mirinda
7
10
– softdrink fanta 6

purchases 7 up
mountain dew
8
12
maaza 7
frooti 10
slice 4
sum 100
Government YEAR EXPORT GROWTH%

document 1 100

• subject: inviting applications 2 150 50


for setting up of unit in SEEPZ
SEZ – growth rate envisaged 3 200 33.33333
in 5 years ( 10 marks)
4 175 -12.5
5 210 20.000000
average 22.70833
CAGR 16.00%
Data and Data
Sets
• Data are the facts and figures
collected, analyzed, and
summarized for presentation
and interpretation.
• All the data collected in a
particular study are referred
to as the data set for the
study.

13
Elements are the entities on which data are collected.

Elements, A variable is a characteristic of interest for the elements.

Variables, The set of measurements obtained for a particular element is


called an observation.
and
A data set with n elements contains n observations.
Observations
The total number of data values in a complete data set is the
number of elements multiplied by the number of variables.

14
Data, Data Sets, Elements, Variables, and Observations
Variables

Company Stock Exchange Annual Sales ($M) Earnings per share ($)
Dataram NQ 73.10 0.86
EnergySouth N 74.00 1.67 Observation
Element Names Keystone N 365.70 0.86
LandCare NQ 111.40 0.33
Psychemedics N 17.60 0.13

Data Set

15
Scales of measurement include

• Nominal
• Ordinal
Scales of • Interval
• Ratio
Measurement The scale determines the amount of
information contained in the data.

The scale indicates the data summarization and


statistical analyses that are most appropriate.

16
Nominal scale

Data are labels or names used to identify an attribute of the element.

Scales of A nonnumeric label or numeric code may be used.

Measurement Example

Students of a university are classified by the school in which they are


enrolled using a nonnumeric label such as first year, second year, third
year etc. or Business, Humanities, Education, and so on.

Alternatively, a numeric code could be used for the school variable (e.g.
1 denotes Business, 2 denotes Humanities, 3 denotes Education, and so
on).

17
Ordinal scale

The data have the properties of nominal data and the order or rank of the
data is meaningful.

A nonnumeric label or numeric code may be used.

Scales of Example

Measurement Students of a class are classified by their class standing using a


nonnumeric label such as first class, second class, or third class.

Alternatively, a numeric code could be used for the class standing variable
(e.g. 1 denotes Freshman, 2 denotes Sophomore, and so on).

Question ?

What is the fine difference between nominal and ordinal scales?

18
Interval scale

The data have the properties of ordinal data, and the


interval between observations is expressed in terms of a
fixed unit of measure.

Scales of Interval data are always numeric.


Measurement
Example

Meesha has a SNAP score of 39/60, while Kavin has a


SNAP score of 37/60. Meesha thus scored 2 points more
than Kavin.

19
Ratio scale
• Data have all the properties of interval data and the
ratio of two values is meaningful.
• Ratio data are always numerical.
• Zero value is included in the scale.
Scales of
Measurement Example 1:
Price of a book at a retail store is Rs. 200, while the price
of the same book sold online is Rs. 100. The ratio
property shows that retail stores charge twice the online
price.
Example 2:
The temperature outside is 35 degree Celsius. It was 40
yesterday. Is this interval or ratio scale?

20
• Data can be further classified as being categorical or
Categorical quantitative.
and • The statistical analysis that is appropriate depends on
whether the data for the variable are categorical or
Quantitative quantitative.
• In general, there are more alternatives for statistical
Data analysis when the data are quantitative.

21
• Labels or names are used to identify an attribute of each
element
• Often referred to as qualitative data
Categorical • Use either the nominal or ordinal scale of measurement
• Can be either numeric or nonnumeric
Data • Appropriate statistical analyses are rather limited

22
• Quantitative data indicate how many or how much.
Quantitative • Quantitative data are always numeric.
• Ordinary arithmetic operations are meaningful for
Data quantitative data.

23
Cross-Sectional Data

Cross-sectional data are collected at the same or


approximately the same point in time.

Example

Data detailing the number of building permits issued in


November 2020 in each of the districts of Maharashtra.
24
Time series data are collected over several time periods.

Example
Data detailing the number of building permits issued in
Mumbai during each of the last 36 months.

Time Series Graphs of time series data help analysts understand


Data • what happened in the past
• identify any trends over time, and
• project future levels for the time series

25
Graph of Time Series Data

Time Series
Data

26
Existing Sources

Internal company records – almost any department

Business database services – Exchange data bases.

Government agencies - Census and CSO data, WHO


Data Sources
Industry associations – CII, BCC, other industry associations in India

Special-interest organizations – Graduate Management Admission


Council (GMAT)

Internet – more and more firms

27
Most of the statistical information in newspapers, magazines, company
reports, and other publications consists of data that are summarized and
presented in a form that is easy to understand.

Descriptive Such summaries of data, which may be tabular, graphical, or numerical,


are referred to as descriptive statistics.
Statistics- A
Preview
Example

The manager of Harsha Auto would like to have a better understanding


of the cost of parts used in the engine tune-ups performed in her shop.
You are asked to examine 30 customer invoices for tune-ups. The costs
of parts, rounded to the nearest rupee , are listed on the next slide.

28
Example: Harsha Auto Repair

Sample of Parts Cost (Rs.) for 30 Tune-ups


• What should be your approach?
• Put the data into a frequency distribution
6370 5460 6510 3990 5250 3640
Steps
• calculate minimum and maximum. 4970 4830 5040 6230 4620 5250
• Provide a guess of number of class intervals
between 6 & 10. 7280 5180 4340 4760 6790 7350
• Calculate number of CI with a formula, Sturges rule
• K= 1 + 3.322/logN 5950 6790 6160 4760 5810 4760
• In this case, k =
4340 5740 6860 7070 5530 7350

29
Harsha Auto repair- frequency distribution
Histogram
8
lower upper Frequency % frequency
3600 4249 2 7% 7

4250 4899 7 23%


6
4900 5549 7 23%
5550 6199 4 13%
5
6200 6849 5 17%

Frequency
6850 7499 5 17% 4
0 0% Frequency
30 100 3

0
4249 4899 5549 6199 6849 7499 More
Bin
The most common numerical descriptive statistic
is the mean (or average).
Numerical
Descriptive The mean demonstrates a measure of the
central tendency, or central location of the data
Statistics for a variable.

Harsha’s mean cost of parts, based on the 30


tune-ups studied is Rs. 5633(found by summing
up the 30 cost values and then dividing by 30).

31
Population: The set of all elements of interest in a
particular study.

Sample: A subset of the population.

Statistical
Inference- A Statistical inference: The process of using data obtained
from a sample to make estimates and test hypotheses
Preview about the characteristics of a population.
Census: Collecting data for the entire population.

Sample survey: Collecting data for a sample.

32
Process of Statistical Inference
Example: Harsha Auto

Step 1 Step 2 Step 3 Step 4


• Population consists • A sample of 30 • The sample data • The sample
of all tune ups. engine tune-ups is provides a sample average is used to
Average cost of examined. average parts cost estimate the
parts is unknown. of Rs. 5633 per population
tune-up. average.

33
Analytics is the scientific process of transforming
data into insight for making better decisions .
Techniques:

Descriptive analytics: This describes what has


Analytics- A happened in the past.
Preview Predictive analytics: Use models constructed from
past data to predict the future
or to assess the impact of one variable on
another.
Prescriptive analytics: The set of analytical
techniques that yield a best course of action.

34
Big data: Large and complex data set.

Three V’s of Big data:

Big data and Volume : Amount of available data


Data Mining:
Velocity: Speed at which data is collected
and processed
Variety: Different data types

35
Data warehousing is the process of capturing, storing, and
maintaining the data.
• Organizations obtain large amounts of data on a daily basis
Data by means of magnetic card readers, bar code scanners,
point of sale terminals, and touch screen monitors.
warehousing • Wal-Mart captures data on 20-30 million transactions per
day.
• Visa processes 6,800 payment transactions per second.

36
• Methods for developing useful decision-making
information from large databases.
• Using a combination of procedures from statistics,
mathematics, and computer science, analysts “mine the
data” to convert it into useful information.
• The most effective data mining systems use automated
procedures to discover relationships in the data and predict
Data Mining future outcomes prompted by general and even vague
queries by the user.

37
• The major applications of data mining have been made by
companies with a strong consumer focus such as retail,
financial, and communication firms.
• Data mining is used to identify related products that
customers who have already purchased a specific product
are also likely to purchase (and then pop-ups are used to
Data Mining draw attention to those related products).
Applications • Data mining is also used to identify customers who should
receive special discount offers based on their past
purchasing volumes.

38
• Statistical methodology such as multiple regression, logistic
regression, and correlation are heavily used.
• Also needed are computer science technologies involving
artificial intelligence and machine learning.
Data Mining • A significant investment in time and money is required as
Requirements well.

39
• Finding a statistical model that works well for a particular
sample of data does not necessarily mean that it can be
reliably applied to other data.
• With the enormous amount of data available, the data set
Data Mining can be partitioned into a training set (for model
development) and a test set (for validating the model).
Model • There is, however, a danger of overfitting the model to the
point that misleading associations and conclusions appear
Reliability to exist.
• Careful interpretation of results and extensive testing is
important.

40
• In a statistical study, unethical behavior can take a variety
of forms including:
Ethical • Improper sampling
• Inappropriate analysis of the data
Guidelines for • Development of misleading graphs
• Use of inappropriate summary statistics
Statistical • Biased interpretation of the statistical results
Practice • One should strive to be fair, thorough, objective, and
neutral as you collect, analyze, and present data.
• As a consumer of statistics, one should also be aware of the
possibility of unethical behavior by others.

41
Frequency Distribution

Relative Frequency Distribution

Summarizing
Categorical Percent Frequency Distribution

Data Bar Chart

Pie Chart
Frequency Distribution
• A frequency distribution is a tabular summary of data showing
the number (frequency) of observations in each of several
non-overlapping categories or classes.
• The objective is to provide insights about the data that cannot
be quickly obtained by looking only at the original data.

• Example: Marada Inn


• Guests staying at Marada Inn were asked to rate the quality of
their accommodations as being excellent, above average,
average, below average, or poor.
• The ratings provided by a sample of 20 guests are:

43
Frequency Distribution

Below Average Average Above Average Rating Frequency


Above Average Above Average Above Average Poor 2
Below Average 3
Above Average Below Average Below Average
Average 5
Average Poor Poor
Above Average 9
Above Average Excellent Above Average Excellent 1
Average Above Average Average Total 20
Above Average Average

44
Relative & Percent Frequency Distribution

• The relative frequency of a class is the Rating Relative Percent


fraction or proportion of the total number
of data items belonging to the class. Frequency Frequency
• Relative frequency of a class =
Poor .10 10
Frequency of the class Below Average .15 15

𝑛

• A relative frequency distribution is a Average .25 25


tabular summary of a set of data showing
the relative frequency for each class. Above Average .45 45
• The percent frequency of a class is the
relative frequency multiplied by 100.
• A percent frequency distribution is a
Excellent .05 5
tabular summary of a set of data showing
the percent frequency for each class. Total 1.00 100
A bar chart is a graphical display for On one axis (usually the horizontal
depicting qualitative data. axis), we specify the labels that are
used for each of the classes.

Bar Chart

A frequency, relative frequency, or Using a bar of fixed width drawn


percent frequency scale can be used above each class label, we extend the
for the other axis (usually the vertical height appropriately.
axis).
Bar Chart – let us do it in Excel!
10 Marada Inn Quality Ratings
9
8
7
Frequency

6
5
4
3
2
1
Quality
Poor Below Average Above Excellent Rating
Average Average

47
• In quality • When the bars are
control, arranged in descending
bar charts order of height from left
Pareto are used to right (with the most
Diagram – to identify frequently occurring
the most cause appearing first) the
Let us try important bar chart is called a
drawing causes of Pareto diagram.
this! problems.
• This diagram is named for its founder,
Vilfredo Pareto, an Italian economist.

48
Pie Chart – Let us draw this for Marada Inn!

• The pie chart is a • First draw a circle; • Since there are 360
commonly used then use the relative degrees in a circle,
graphical display for frequencies to a class with a
presenting relative subdivide the circle relative frequency
frequency and into sectors that of .25 would
percent frequency correspond to the consume .25(360)
distributions for relative frequency for = 90 degrees of the
categorical data. each class. circle.

49
Pie Chart
Marada Inn Quality Ratings

Excellent
5%
Poor
10%
Below
Average
Above 15%
Average
45%
Average
25%

50
Insights Gained from the Preceding Pie Chart

Example: • One-half of the • For each customer


customers surveyed who gave an
Marada Inn gave Marada a “excellent” rating,
quality rating of there were two
“above average” or customers who gave
“excellent” (looking a “poor” rating
at the left side of (looking at the top of
the pie). This might the pie). This should
please the displease the
manager. manager.

51
Summarizing Quantitative Data
• Frequency Distribution
• Relative Frequency and Percent Frequency Distributions
• Dot Plot
• Histogram
• Cumulative Distributions
• Stem-and-Leaf Display

52
Example: Harsha Auto Repair – let us do in excel!
Sample of Parts Cost (Rs.) for 30 Tune-ups
• What should be your approach?
• Put the data into a frequency distribution
6370 5460 6510 3990 5250 3640
Steps
• calculate minimum and maximum. 4970 4830 5040 6230 4620 5250
• Provide a guess of number of class intervals between 6 & 20.
7280 5180 4340 4760 6790 7350
• Calculate number of CI with a formula, Sturges rule
• K= 1 + 3.322/logN 5950 6790 6160 4760 5810 4760
• In this case, k =
4340 5740 6860 7070 5530 7350

53
Harsha Auto repair- frequency distribution
Histogram
8
lower upper Frequency % frequency
3600 4249 2 7% 7

4250 4899 7 23%


6
4900 5549 7 23%
5550 6199 4 13%
5
6200 6849 5 17%

Frequency
6850 7499 5 17% 4
0 0% Frequency
30 100 3

0
4249 4899 5549 6199 6849 7499 More
Bin
• Another common graphical display of
quantitative data is a histogram.
• The variable of interest is placed on the
horizontal axis.
• A rectangle is drawn above each class
Histogram interval with its height corresponding to
the interval’s frequency, relative frequency,
or percent frequency.

• Unlike a bar graph, a histogram has no


natural separation between rectangles of
adjacent classes.
Histograms Showing Skewness
• Symmetric
• Left tail is the mirror image of the right tail
• Example: Heights of People

.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0

56
Histograms Showing Skewness
• Moderately Skewed Left
• A longer tail to the left
• Example: Exam Scores
.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0

57
Histograms Showing Skewness
• Moderately Right Skewed
• A Longer tail to the right
• Example: Housing Values
.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0

58
Histograms Showing Skewness
• Highly Skewed Right
• A very long tail to the right
• Example: Executive Salaries
.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0

59
• Cumulative frequency distribution
- shows the number of items with
values less than or equal to the
upper limit of each class.
• Cumulative relative frequency
distribution – shows the proportion
Cumulative of items with values less than or
Distributions equal to the upper limit of each
class.
• Cumulative percent frequency
distribution – shows the percentage of
items with values less than or equal to
the upper limit of each class.
Learnings
Descriptive Statistics

Dr. Nilakantan Narasinganallur


Ph.D.
• If the measures are computed for data from
a sample, they are called sample statistics.
• If the measures are computed for data from
a population, they are called population
parameters.
• A sample statistic is referred to as the point
estimator of the corresponding population
parameter.
Numerical Measures
• Mean
• Median
• Mode
• Weighted mean
• Geometric mean
• Percentiles
• Quartiles
• Quantiles

2
• Most important measure of location
• Provides a measure of central location
• Mean of a data set is the average of its data
values.
• Sample mean is the point estimator of
population mean.
• Affected by extreme values/outliers
Arithmetic Mean
σ 𝑥𝑖
Sample mean 𝑥ҧ = 𝑛

• where: Sxi = sum of the values of the n observations,


n = number of observations in the sample
σ 𝑥𝑖
Population mean 𝜇 =
𝑁

• where: Sxi = sum of the values of the N observations


• N = number of observations in the population
Sample Mean 𝑥ҧ
• Example: Apartment Rents – Let us work out the mean in Excel
Seventy efficiency apartments were randomly sampled in a college town.
The monthly rents for these apartments are listed below.

545 715 530 690 535 700 560 700 540 715
540 540 540 625 525 545 675 545 550 550
565 550 625 550 550 560 535 560 565 580
550 570 590 572 575 575 600 580 670 565
700 585 680 570 590 600 649 600 600 580
670 615 550 545 625 635 575 650 580 610
610 675 590 535 700 535 545 535 530 540

4
Sample Mean 𝑥ҧ
• Example: Apartment Rents
σ𝑥𝑖 41,356
𝑥ҧ = = = 590.80
𝑛 70

5
• The median of a data set is the value in the
middle when the data items are arranged
in ascending order.

• Whenever a data set has extreme values,


the median is the preferred measure of
Median central location.
• The median is the measure of location
most often reported for annual income
and property value data.
• A few extremely large incomes or property
values can inflate the mean.
Median
• For an odd number of observations:

26 18 27 12 14 27 19 7 observations

12 14 18 19 26 27 27 in ascending order

The median is the middle value. Median = 19

7
Median
• For an even number of observations:

26 18 27 12 14 27 30 19 8 observations

12 14 18 19 26 27 27 30 in ascending order

The median is the average of the middle two values.


Median = (19 + 26)/2 = 22.5

8
Median
• Example: Apartment Rents
Averaging the 35th and 36th data values:
Median = (575 + 575)/2 = 575

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

Note: Data is in ascending order.

9
• Another • It is obtained by deleting
measure,
sometimes a percentage of the
used when smallest and largest
extreme values values from a data set
are present, is and then computing the
the trimmed
Trimmed mean. mean of the remaining
Mean values.
• For example, the 5% trimmed mean is
obtained by removing the smallest 5% and
the largest 5% of the data values and then
computing the mean of the remaining
values.
• The mode of a • The greatest
data set is the frequency can
value that occur at two
occurs with or more
greatest different
frequency. values.
Mode
• If the data • If the data have
have exactly more than two
two modes, modes, the
the data are data are
bimodal. multimodal.
Mode
• Example: Apartment Rents
550 occurred most frequently (7 times)
Mode = 550

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

Note: Data is in ascending order.

12
• In some instances • The weights might
the mean is be the number of
computed by credit hours earned
giving each for each grade, as in
observation a
weight that GPA.
Weighted reflects its
Mean • In other weighted
relative
importance. mean computations,
quantities such as
pounds, dollars, or
• The choice of
weights depends volume are
on the application. frequently used.
σ 𝑤𝑖 𝑥𝑖 where: xi = value of
𝑥ҧ =
σ 𝑤𝑖 observation i
wi = weight for
observation i

Weighted
Mean Numerator: sum of the weighted data values

Denominator: If data is from a


sum of the population, m replaces
weights 𝑥.ҧ
Weighted Mean
• Example: Construction Wages
Ron Butler, a home builder, is looking over the expenses he incurred for a
house he just built. For the purpose of pricing future projects, he would like to
know the average wage ($/hour) he paid the workers he employed. Listed
below are the categories of worker he employed, along with their respective
wage and total hours worked.
Worker Wage ($/hr) Total Hours
Carpenter 21.60 520
Electrician 28.72 230
Laborer 11.80 410
Painter 19.75 270
Plumber 24.16 160

15
Weighted Mean
• Example: Construction Wages
Worker xi wi wi x i
Carpenter 21.60 520 11232.0
Electrician 28.72 230 6605.6
Laborer 11.80 410 4838.0
Painter 19.75 270 5332.5
Plumber 24.16 160 3865.6
1590 31873.7

σ 𝑤𝑖 𝑥𝑖 31,873.7
𝑥ҧ = σ 𝑤𝑖
= = 20.0464 = $20.05
1,590

FYI, equally-weighted (simple) mean = $21.21

16
• Another • It is obtained by
measure, deleting a percentage
sometimes of the smallest and
used when largest values from a
extreme data set and then
values are
present, is
computing the mean
the trimmed of the remaining
mean. values.
Trimmed Mean
• For example, the 5% trimmed mean
is obtained by removing the smallest
5% and the largest 5% of the data
values and then computing the
mean of the remaining values.

17
• The geometric mean is calculated by finding
the nth root of the product of n values.

• It is often used in analyzing growth


rates in financial data (where using the
arithmetic mean will provide
misleading results).
Geometric • It should be applied anytime you want
Mean to determine the mean rate of change
over several successive periods (be it
years, quarters, weeks, . . .).

• Other common applications include:


changes in populations of species, crop
yields, pollution levels, and birth and
death rates.
18
𝑛
𝑥𝑔ҧ = 𝑥1 𝑥2 … (𝑥𝑛 ) =
[(x1)(x2)…(xn)]1/n

Geometric
Mean

19
Geometric Mean
• Example: Rate of Return
Period Return (%) Growth Factor
1 -6.0 0.940
2 -8.0 0.920
3 -4.0 0.960
4 2.0 1.020
5 5.4 1.054
5
𝑥𝑔ҧ = .94 . 92)(.96)(1.02)(1.054)
= [.89254]1/5 = .97752
Average growth rate per period is (.97752 - 1) (100) = -2.248%

20
Percentiles
• A percentile provides information about how the data are spread over the
interval from the smallest value to the largest value.
• Admission test scores for colleges and universities are frequently reported in
terms of percentiles.
• The pth percentile of a data set is a value such that at least p percent
of the items take on this value or less and at least (100 - p) percent of
the items take on this value or more.

21
Percentiles
• Arrange the data in ascending order.
• Compute Lp, the location of the pth percentile.

Lp = (p/100)(n + 1)

22
80 th Percentile
• Example: Apartment Rents
Lp = (p/100)(n + 1) = (80/100)(70 + 1) = 56.8
(the 56th value plus .8 times the
difference between the 57th and 56th values)
80th Percentile = 635 + .8(649 – 635) = 646.2
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

23
80th Percentile
• Example: Apartment Rents
“At least 80% of the “At least 20% of the
items take on a items take on a
value of 646.2 or less.” value of 646.2 or more.”
56/70 = .8 or 80% 14/70 = .2 or 20%

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

24
• Quartiles are • First Quartile =
specific percentiles. 25th Percentile

Quartiles
• Second Quartile = • Third Quartile =
50th Percentile = 75th Percentile
Median
Third Quartile (75th Percentile)
• Example: Apartment Rents
Lp = (p/100)(n + 1) = (75/100)(70 + 1) = 53.25
(the 53rd value plus .25 times the
difference between the 54th and 53rd values)
Third quartile = 625 + .25(625 – 625) = 625

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

26
• It is often • For example, in choosing
desirable to supplier A or supplier B
Measures of consider we might consider not
Variability measures of only the average delivery
variability time for each, but also
(dispersion), the variability in delivery
as well as time for each.
measures of
location.

27
Measures of Variability
• Range
• Interquartile Range
• Variance
• Standard Deviation
• Coefficient of Variation

28
• The range of a data set is the difference
between the largest and smallest data
Range values.

Range = Largest value – Smallest value

• It is the simplest measure of variability.

• It is very sensitive to the smallest and


largest data values.
Range
• Example: Apartment Rents
Range = largest value - smallest value
Range = 715 - 525 = 190

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

30
Interquartile Range

• The interquartile range of a data set is the difference between the


third quartile and the first quartile.

• It is the range for the middle 50% of the data.


• It overcomes the sensitivity to extreme data values.
Interquartile Range (IQR)
• Example: Apartment Rents
3rd Quartile (Q3) = 625
1st Quartile (Q1) = 545
IQR = Q3 - Q1 = 625 - 545 = 80

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

32
• The variance is a measure of variability
that utilizes all the data.

• It is based on the difference between the


value of each observation (xi) and the
Variance mean (𝑥ҧ for a sample, m for a population).

• The variance is useful in comparing the


variability of two or more variables.
Variance
• The variance is the average of the squared differences between each data
value and the mean.
• The variance is computed as follows:

2
σ 𝑥𝑖 − 𝑥ҧ 2
2
σ 𝑥𝑖 − 𝜇 2
𝑠 = 𝜎 =
𝑛−1 𝑁
for a for a
sample population

34
• The standard deviation of a data set is the
positive square root of the variance.

Standard
Deviation • It is measured in the same units as the
data, making it more easily interpreted
than the variance.
Standard Deviation
• The standard deviation is computed as follows:

s = 𝑠2 s= s2
for a for a
sample population

36
Coefficient of Variation
• The coefficient of variation indicates how large the standard deviation is in
relation to the mean.
• The coefficient of variation is computed as follows:
𝑠 𝜎
x 100 % x 100 %
𝑥ҧ 𝜇

for a for a
sample population

37
Sample Variance, Standard Deviation,
And Coefficient of Variation
• Example: Apartment Rents
• Variance
2
σ 𝑥𝑖 −𝑥ҧ
s2 = = 2,996.16
𝑛−1
• Standard Deviation
s = 𝑠2 = 2,996.16 = 54.74

• Coefficient of Variation
𝑠 54.74
x 100 % = x 100 % = 9.27%
𝑥ҧ 590.80

38
Measures of Distribution Shape,
Relative Location, and Detecting Outliers
• Distribution Shape
• z-Scores
• Chebyshev’s Theorem
• Empirical Rule
• Detecting Outliers
• Five-number summary
• Box plots

39
Distribution Shape: Skewness
• An important measure of the shape of a distribution is called skewness.
• The formula for the skewness of sample data is
𝑛 𝑥𝑖 −𝑥ҧ 3
Skewness = σ
(𝑛−1)(𝑛−2) 𝑠

• Skewness can be easily computed using statistical software.

40
Distribution Shape: Skewness
• Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
Skewness = 0
.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0

41
Distribution Shape: Skewness
• Moderately Skewed Left
• Skewness is negative.
• Mean will usually be less than the median.

.35 Skewness = - .31


Relative Frequency

.30
.25
.20
.15
.10
.05
0

42
Distribution Shape: Skewness
• Moderately Skewed Right
• Skewness is positive.
• Mean will usually be more than the median.

.35 Skewness = .31


Relative Frequency

.30
.25
.20
.15
.10
.05
0

43
Distribution Shape: Skewness
• Highly Skewed Right
• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.

.35 Skewness = 1.25


Relative Frequency

.30
.25
.20
.15
.10
.05
0

44
Distribution Shape: Skewness
• Example: Apartment Rents – Let us draw the graphs in Excel
Seventy efficiency apartments were randomly sampled in a college town. The
monthly rent prices for the apartments are listed below in ascending order.

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

45
Distribution Shape: Skewness
• Example: Apartment Rents
.35 Skewness = .92
.30
Relative Frequency

.25

.20
.15

.10
.05
0

46
z-Scores
• The z-score is often called the standardized value.
• It denotes the number of standard deviations a data value xi is from the mean.
𝑥𝑖 −𝑥ҧ
𝑧𝑖 =
𝑠
• Excel’s STANDARDIZE function can be used to compute the z-score.

47
z-Scores
• An observation’s z-score is a measure of the relative location of the observation
in a data set.
• A data value less than the sample mean will have a z-score less than zero.
• A data value greater than the sample mean will have a z-score greater than
zero.
• A data value equal to the sample mean will have a z-score of zero.

48
z-Scores – calculate in Excel!
• Example: Apartment Rents
• z-Score of Smallest Value (525)
𝑥𝑖 −𝑥ҧ 525−590.80
𝑧𝑖 = = = -1.20
𝑠 54.74

Standardized Values for Apartment Rents


-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

49
Chebyshev’s Theorem
• At least (1 - 1/z2) of the items in any data set will be within z standard
deviations of the mean, where z is any value greater than 1.
• Chebyshev’s theorem requires z > 1, but z need not be an integer.

50
Chebyshev’s Theorem
• At least 75% of the data values must be within z = 2 standard
deviations of the mean.
• At least 89% of the data values must be within z = 3 standard
deviations of the mean.
• At least 94% of the data values must be within z = 4 standard
deviations of the mean.

51
Chebyshev’s Theorem
• Example: Apartment Rents
Let z = 1.5 with 𝑥ҧ = 590.80 and s = 54.74

At least (1 - 1/(1.5)2) = 1 - 0.44 = 0.56 or 56%


of the rent values must be between
𝑥ҧ - z(s) = 590.80 - 1.5(54.74) = 509
and
𝑥ҧ + z(s) = 590.80 + 1.5(54.74) = 673

(Actually, 86% of the rent values


are between 509 and 673.)

52
Empirical Rule
• When the data are believed to approximate a bell-shaped distribution:
• The empirical rule can be used to determine the percentage of data
values that must be within a specified number of standard deviations
of the mean.
• The empirical rule is based on the normal distribution, which is
covered in Chapter 6.

53
Empirical Rule
For data having a bell-shaped distribution:
• 68.26% of the values of a normal random variable are within +/- 1
standard deviation of its mean.
• 95.44% of the values of a normal random variable are within +/- 2
standard deviations of its mean.
• 99.72% of the values of a normal random variable are within +/- 3
standard deviations of its mean.

54
Empirical Rule
99.72%
95.44%
68.26%

m x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s

55
Detecting Outliers
• An outlier is an unusually small or unusually large value in a data set.
• A data value with a z-score less than -3 or greater than +3 might be considered
an outlier.
• It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the data set
• a correctly recorded data value that belongs in the data set

56
Empirical Rule
• Example: Apartment Rents
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there are no outliers in this data
set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

57
Outlier detection – with IQR-in Excel!
• IQR method identifies outliers by setting up a “fence” outside of Q1
and Q3. Values falling outside of this fence are considered outliers.
• We build this fence by taking 1.5 times the IQR, then subtracting this
value from Q1 and adding this value to Q3. This gives us the minimum
and maximum fence posts that we compare each observation to.
• Any observations that are more than 1.5 IQR below Q1 or more than
1.5 IQR above Q3 are considered outliers. This is the method that
Minitab uses to identify outliers by default.
Five-Number Summaries and Box Plots
• Summary statistics and easy-to-draw graphs can be used to quickly
summarize large quantities of data.
• Two tools that accomplish this are five-number summaries and box plots.

59
Five-Number Summary – Apartment Rents
1. Smallest Value

2. First Quartile

3. Median

4. Third Quartile

5. Largest Value

60
Five-Number Summary- Let’s do it in excel!
• Example: Apartment Rents
Lowest Value = 525 First Quartile = 545
Median = 575
Third Quartile = 625 Largest Value = 715

525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715

61
Box Plot
• A box plot is a graphical summary of data that is based on a five-number
summary.
• A key to the development of a box plot is the computation of the median and
the quartiles Q1 and Q3.
• Box plots provide another way to identify outliers.

62
Box Plot – Let’s do it in excel!
• Example: Apartment Rents
• A box is drawn with its ends located at the first and third quartiles.
• A vertical line is drawn in the box at the location of the median (second
quartile).

500 525 550 575 600 625 650 675 700 725
Q1 = 545 Q3 = 625
Q2 = 575
63
Box Plot
• Limits are located (not drawn) using the interquartile range (IQR).
• Data outside these limits are considered outliers.
• The locations of each outlier is shown with the symbol * .

continued

64
Box Plot
• Example: Apartment Rents
• The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 545 - 1.5(80) = 425
• The upper limit is located 1.5(IQR) above Q3.

Upper Limit: Q3 + 1.5(IQR) = 625 + 1.5(80) = 745


• There are no outliers (values less than 425 or greater than 745) in the
apartment rent data.

65
Box Plot
• Example: Apartment Rents
• Whiskers (dashed lines) are drawn from the ends of the box to
the smallest and largest data values inside the limits.

500 525 550 575 600 625 650 675 700 725

Smallest value Largest value


inside limits = 525 inside limits = 715

66
Practice with
Excel- list of
completed
activity!
Probability
Theory
Dr. Nilakantan Narasinganallur
Ph.D.
Before we get into new topic, let’s recap!(1/2)
 Why outliers should be identified?  Mean = 92.8
 WE understand from theory that mean is  Median = 66.5
affected by extreme values.
 Now let us remove extreme value
 Median is not affected. 350.
 In a given data set, how do we establish  Mean = 64.9
this difference in the properties of mean
 Median = 65.5
and median?

35,45,67,89,98,87,76,65,54,43, 350, 105


Before we get into new topic, let’s
recap!(2/2)
RRTO Total %

 Mumbai RTO has recently Tardeo 26604 8


introduced online learners’ test. Wadala 24031 13
 If cleared, LL will be issued. Andheri 12962 11
 At four Regional RTOs, the Borivili 18204 10
performance is reported as below:
 10 % of those who took online test
failed.
 Failure at four RRTOs are given.
 What is the overall percentage for
Mumbai as a whole?
 Is the figure of 10 % correct?
Introduction to Probability
• Experiments, Counting Rules, and Assigning Probabilities
• Events and Their Probability
• Some Basic Relationships of Probability
• Conditional Probability
• Bayes’ Theorem

4
Uncertainties
• Managers often base their decisions on an analysis of
uncertainties such as the following:
• What are the chances that sales will decrease if we increase
• prices?
What is the likelihood a new assembly method will increase
productivity?
• What are the odds that a new investment will be profitable?

5
Probability
• Probability is a numerical measure of the likelihood that an event
will occur.
• Probability values are always assigned on a scale from 0 to 1.
• A probability near zero indicates an event is quite unlikely to
occur.
• A probability near one indicates an event is almost certain to
occur.

6
Probability as a Numerical Measure
of the Likelihood of Occurrence
Increasing Likelihood of Occurrence

Probability: 0 .5 1

The event The occurrence The event


is very of the event is is almost
unlikely just as likely as certain
to occur. it is unlikely. to occur.

7
Statistical Experiments
• What is the difference between statistical and scientific
experiments?

• In statistical experiments, probability determines outcomes.


• Even though the experiment is repeated in exactly the same way,
an entirely different outcome may occur.
• For this reason, statistical experiments are sometimes called
random experiments.

8
An Experiment and Its Sample Space
• An experiment is any process that generates well-defined
outcomes.
• The sample space for an experiment is the set of all experimental
outcomes.
• An experimental outcome is also called a sample point.

9
An Experiment and Its Sample Space
Experiment Experiment Outcomes
Toss a coin Head, tail
Inspect a part Defective, non-defective
Conduct a sales call Purchase, no purchase
Roll a die 1, 2, 3, 4, 5, 6
Play a football game Win, lose, tie

10
An Experiment and Its Sample Space
• Example: Bradley Investments
Bradley has invested in two stocks, Markley Oil and Collins Mining. Bradley has
determined that the possible outcomes of these investments three months from now are
as follows.

Investment Gain or Loss


in 3 Months (in $1000s)
Markley Oil Collins Mining
10 8
5 -2
0
-20
11
A Counting Rule for Multiple-Step Experiments
• If an experiment consists of a sequence of k steps in which there are
n1 possible results for the first step, n2 possible results for the second
step, and so on, then the total number of experimental outcomes is
given by (n1)(n2) . . . (nk).
• A helpful graphical representation of a multiple-step experiment is a
tree diagram.

12
A Counting Rule for Multiple-Step Experiments
• Example: Bradley Investments
 Bradley Investments can be viewed as a two-step experiment. It involves two stocks,
each with a set of experimental outcomes.

Markley Oil: n1 = 4
Collins Mining: n2 = 2
Total Number of
Experimental Outcomes: n1n2 = (4)(2) = 8

13
Tree Diagram
• Example: Bradley Investments
Markley Oil Collins Mining Experimental
(Stage 1) (Stage 2) Outcomes
Gain 8
(10, 8) Gain $18,000
(10, -2) Gain $8,000
Gain 10 Lose 2
Gain 8 (5, 8) Gain $13,000

Lose 2 (5, -2) Gain $3,000


Gain 5
Gain 8
(0, 8) Gain $8,000
Even
(0, -2) Lose $2,000
Lose 20 Lose 2
Gain 8 (-20, 8) Lose $12,000

Lose 2 (-20, -2) Lose $22,000


14
Counting Rule for Combinations
• Number of Combinations of N Objects Taken n at a Time
 A second useful counting rule enables us to count the number of experimental
outcomes when n objects are to be selected from a set of N objects.

𝑁 𝑁 𝑁!
𝐶 = =
𝑛 𝑛 𝑛! 𝑁−𝑛 !

where: N! = N(N - 1)(N - 2) . . . (2)(1)


n! = n(n - 1)(n - 2) . . . (2)(1)
0! = 1

15
Counting Rule for Permutations
• Number of Permutations of N Objects Taken n at a Time
• A third useful counting rule enables us to count the number of
experimental outcomes when n objects are to be selected from
a set of N objects, where the order of selection is important.

𝑁 𝑁 𝑁!
𝑃 = 𝑛! =
𝑛 𝑛 𝑁−𝑛 !

where: N! = N(N - 1)(N - 2) . . . (2)(1)


n! = n(n - 1)(n - 2) . . . (2)(1)
0! = 1

16
Assigning Probabilities
• Basic Requirements for Assigning Probabilities
1. The probability assigned to each experimental outcome must
be between 0 and 1, inclusively.
0 < P(Ei) < 1 for all i

where: Ei is the i th experimental outcome


and P(Ei) is its probability

17
Assigning Probabilities
• Basic Requirements for Assigning Probabilities
2. The sum of the probabilities for all experimental outcomes must
equal 1.
P(E1) + P(E2) + . . . + P(En) = 1
where: n is the number of experimental outcomes

18
Probability puzzle-discussion and solution
 You are going by car to a small town in the
interior of India. You come to a four-road
junction where the direction indicator is
uprooted and lying down by the wayside.
 There are three roads leading away from the
junction.
 There is a board indicating the traffic in each
direction. You know the town you intend to
visit is more famous and gets the maximum
traffic.
 There is a man sitting alone on a bench near
the junction.
 What will you do to find the correct road to
your destination?
 Alternatives?
 1. Can follow traffic details. 2. Ask the old
man. 3. Choose one of the roads at random.
 Solve the puzzle.
Assigning Probabilities
• Classical Method
Assigning probabilities based on the assumption of equally likely
outcomes

• Relative Frequency Method


Assigning probabilities based on experimentation or historical data
• Subjective Method
Assigning probabilities based on judgment

20
Classical Method
• Example: Rolling a Die
If an experiment has n possible outcomes, the classical method will assign a probability
of 1/n to each outcome.

Experiment: Rolling a die


Sample Space: S = {1, 2, 3, 4, 5, 6}
Probabilities: Each sample point has a 1/6 chance of occurring

21
Relative Frequency Method
• Example: Lucas Tool Rental
Lucas Tool Rental would like to assign probabilities to the number of
car polishers it rents each day. Office records show the following
frequencies of daily rentals for the last 40 days.
Number of Number
Polishers Rented of Days
0 4
1 6
2 18
3 10
4 2
22
Relative Frequency Method
• Example: Lucas Tool Rental
Each probability assignment is given by dividing the frequency (number of days) by
the total frequency (total number of days).

Number of Number
Polishers Rented of Days Probability
0 4 .10 = 4/40
1 6 .15
2 18 .45
3 10 .25
4 2 .05
40 1.00
23
Subjective Method
• When economic conditions or a company’s circumstances change rapidly
it might be inappropriate to assign probabilities based solely on
historical data.
• We can use any data available as well as our experience and intuition,
but ultimately a probability value should express our degree of belief
that the experimental outcome will occur.
• The best probability estimates often are obtained by combining the
estimates from the classical or relative frequency approach with the
subjective estimate.

24
Subjective Method
• Example: Bradley Investments
An analyst made the following probability estimates.

Exper. Outcome Net Gain or Loss Probability


(10, 8) $18,000 Gain .20
(10, -2) $8,000 Gain .08
(5, 8) $13,000 Gain .16
(5, -2) $3,000 Gain .26
(0, 8) $8,000 Gain .10
(0, -2) $2,000 Loss .12
(-20, 8) $12,000 Loss .02
(-20, -2) $22,000 Loss .06
1.00
25
Events and Their Probabilities
• An event is a collection of sample points.
• The probability of any event is equal to the sum of the
probabilities of the sample points in the event.
• If we can identify all the sample points of an experiment and
assign a probability to each, we can compute the probability of an
event.

26
Events and Their Probabilities
• Example: Bradley Investments
Event M = Markley Oil Profitable
M = {(10, 8), (10, -2), (5, 8), (5, -2)}
P(M) = P(10, 8) + P(10, -2) + P(5, 8) + P(5, -2)
= .20 + .08 + .16 + .26
= .70

27
Events and Their Probabilities
• Example: Bradley Investments
Event C = Collins Mining Profitable
C = {(10, 8), (5, 8), (0, 8), (-20, 8)}
P(C) = P(10, 8) + P(5, 8) + P(0, 8) + P(-20, 8)
= .20 + .16 + .10 + .02
= .48

28
Some Basic Relationships of Probability
 There are some basic probability relationships that can be used to compute the probability
of an event without knowledge of all the sample point probabilities.

Complement of an Event
Union of Two Events
Intersection of Two Events

Mutually Exclusive Events

29
Complement of an Event
• The complement of event A is defined to be the event consisting of
all sample points that are not in A.
• The complement of A is denoted by Ac.

Sample
Event A Ac Space S

Venn Diagram

30
Birthday puzzle
 Solution: Let’s figure the odds that no one shares a
 In probability theory, the birthday and invert that. The odds are calculated by
counting all the ways that N people won’t share a
birthday problem or birthday birthday and dividing by the number of possible
paradox concerns the birthdays they could have.
probability that, in a set of n  For example, two people could have 365×365 birthday
randomly chosen people, some combinations. That’s the denominator. To count the
pair of them will have the same numerator, imagine that the first person gets to choose
their birthday. They can pick from 365 days. The second
birthday. person can also pick their birthday, but can’t share a
 How many people do you need birthday with the first person. They’ve got 364 days to
choose from. So the chance that two people don’t share
before the odds are good a birthday is (365×364)/365². Subtract that from 1 and
(greater than 50%) that at least you get what you expect: that there’s a 1 in 365 chance
two of them share a birthday? that two people share a birthday.
 For three people, the denominator is 365³ and the
 What is the probability that two numerator is 365×364×363. The formula for N people
members of your class have the is:
same birthday?  P(N) = [365 × 364 × · · · × (365−N+1)] / 365N
Union of Two Events
• The union of events A and B is the event containing all sample points
that are in A or B or both.
• The union of events A and B is denoted by A  B.

Sample
Event A Event B Space S

Venn Diagram

32
Union of Two Events
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M  C = Markley Oil Profitable
or Collins Mining Profitable (or both)
M  C = {(10, 8), (10, -2), (5, 8), (5, -2), (0, 8), (-20, 8)}
P(M  C) = P(10, 8) + P(10, -2) + P(5, 8) + P(5, -2) + P(0, 8) + P(-20, 8)
= .20 + .08 + .16 + .26 + .10 + .02
= .82

33
Intersection of Two Events
• The intersection of events A and B is the set of all sample points that
are in both A and B.
• The intersection of events A and B is denoted by A  B.
Intersection of A and B

Sample
Event A Event B Space S

Venn Diagram
34
Intersection of Two Events
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M  C = Markley Oil Profitable and Collins Mining Profitable
M  C = {(10, 8), (5, 8)}
P(M  C) = P(10, 8) + P(5, 8)
= .20 + .16
= .36

35
Addition Law
• The addition law provides a way to compute the probability of
event A, or B, or both A and B occurring.
• The law is written as:
P(A  B) = P(A) + P(B) - P(A  B)

36
Addition Law
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M  C = Markley Oil Profitable or Collins Mining Profitable
We know: P(M) = .70, P(C) = .48, P(M  C) = .36
Thus: P(M  C) = P(M) + P(C) - P(M  C)
= .70 + .48 - .36
= .82
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
37
Mutually Exclusive Events
• Two events are said to be mutually exclusive if the events have no
sample points in common.
• Two events are mutually exclusive if, when one event occurs, the
other cannot occur.

Sample
Event A Event B Space S

Venn Diagram

38
Mutually Exclusive Events
• If events A and B are mutually exclusive, P(A  B) = 0.

• The addition law for mutually exclusive events is:


P(A  B) = P(A) + P(B)

39
Conditional Probability
• The probability of an event given that another event has occurred is
called a conditional probability.
• The conditional probability of A given B is denoted by P(A|B).
• A conditional probability is computed as follows :
𝑃(𝐴 ∩ 𝐵)
𝑃 𝐴𝐵 =
𝑃(𝐵)

40
Conditional Probability
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
P(C|M) = Collins Mining Profitable given Markley Oil Profitable
We know: P(M  C) = .36, P(M) = .70
𝑃(𝐶∩𝑀) .36
Thus: 𝑃 𝐶 𝑀 = = = .5143
𝑃(𝑀) .70

41
Multiplication Law
• The multiplication law provides a way to compute the probability of
the intersection of two events.
• The law is written as:
P(A  B) = P(B)P(A|B)

42
Multiplication Law
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M  C = Markley Oil Profitable and Collins Mining Profitable
We know: P(M) = .70, P(C|M) = .5143
Thus: P(M  C) = P(M)P(M|C)
= (.70)(.5143)
= .36
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
43
Joint Probability Table
Collins Mining
Markley Oil Profitable (C) Not Profitable (Cc) Total

Profitable (M) .36 .34 .70

Not Profitable (Mc) .12 .18 .30


Total .48 .52 1.00

• Joint probabilities appear in the center of the table.


• Marginal probabilities appear in the margins of the table.

44
Independent Events
• If the probability of event A is not changed by the existence of
event B, we would say that events A and B are independent.
• Two events A and B are independent if:
P(A|B) = P(A) or P(B|A) = P(B)

45
Multiplication Law for Independent Events
• The multiplication law also can be used as a test to see if two events
are independent.
• The law is written as:
P(A  B) = P(A)P(B)

46
Multiplication Law for Independent Events
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
Are events M and C independent?
Does P(M  C) = P(M)P(C) ?
We know: P(M  C) = .36, P(M) = .70, P(C) = .48
But: P(M)P(C) = (.70)(.48) = .34, not .36
Hence: M and C are not independent.

47
Mutual Exclusiveness and Independence
• Do not confuse the notion of mutually exclusive events with that
of independent events.
• Two events with nonzero probabilities cannot be both mutually
exclusive and independent.
• If one mutually exclusive event is known to occur, the other
cannot occur.; thus, the probability of the other event occurring
is reduced to zero (and they are therefore dependent).
• Two events that are not mutually exclusive, might or might not be
independent.

48
A conversation in Deep Space 9/S-3:Ep15- Destiny
 A short conversation between Chief Engineer O Brien, and Cardassian Scientist Gilora.
 Gilora: what happened to these couplings?
 O Brien: what? Oh, I made a few modifications.
 Gilora: Well, these relays don’t have as much carrying capacity as before. They won’t be able to
handle the signal load from the transceiver.
 O Brien: Well, in order to bring the system up to Starfleet code, I had to take out the couplings
to make room for secondary backup.
 Gilora: Starfleet code requires a second backup?
 O Brien: In case the first one fails.
 Gilora: What are the chances that both primary system and its backup fail at the same time?
 O Brien: Well, it is very unlikely, but in a crunch, I wouldn’t like to be caught without a second
backup.
An exercise in Reliability
 Reliability is complementary  Now let us understand the
to probability of failure, i.e. conversation in DS9.
... For example, if two  Primary and backup are arranged in
components are arranged in parallel. Assume each with reliability
parallel, each with reliability Rp = Rb = 0.98. Then Fp = Fb= 0.02.
R 1 = R 2 = 0.9, that is, F 1 = the resultant probability of failure is
F 2 = 0.1, the resultant F = 0.02 x0.02 = 0.0004. the
probability of failure is F = resultant reliability is R = 1- 0.0004 =
0.1 × 0.1 = 0.01. The 0.9996.
resultant reliability is R = 1 –
0.01 = 0.99.
Bayes’ Theorem
• Often, we begin probability analysis with initial or prior probabilities.
• Then, from a sample, special report, or a product test we obtain some
additional information.
• Given this information, we calculate revised or posterior probabilities.
• Bayes’ theorem provides the means for revising the prior probabilities.

Application
Prior New Posterior
of Bayes’
Probabilities Information Probabilities
Theorem

51
Bayes’ Theorem
• Example: L. S. Clothiers
A proposed shopping center will provide strong competition for downtown businesses
like L. S. Clothiers. If the shopping center is built, the owner of L. S. Clothiers feels it
would be best to relocate to the shopping center.

The shopping center cannot be built unless a zoning change is


approved by the town council. The planning board must first
make a recommendation, for or against the zoning change, to the
council.

52
Prior Probabilities
• Example: L. S. Clothiers
Let:
A1 = town council approves the zoning change
A2 = town council disapproves the change
Using subjective judgment:
P(A1) = .7, P(A2) = .3

53
New Information
• Example: L. S. Clothiers
The planning board has recommended against the zoning change. Let B denote the
event of a negative recommendation by the planning board.

Given that B has occurred, should L. S. Clothiers revise the


probabilities that the town council will approve or disapprove the
zoning change?

54
Conditional Probabilities
• Example: L. S. Clothiers
Past history with the planning board and the town council indicates
the following:
P(B|A1) = .2 and P(B|A2) = .9
Hence: P(BC|A1) = .8 and P(BC|A2) = .1

55
Tree Diagram
• Example: L. S. Clothiers
Town Council Planning Board Experimental Outcomes

P(B|A1) = .2
P(A1  B) = .14
P(A1) = .7
c
P(B |A1) = .8 P(A1  Bc) = .56

P(B|A2) = .9
P(A2  B) = .27
P(A2) = .3
c
P(B |A2) = .1 P(A2  Bc) = .03
1.00
56
Bayes’ Theorem
• To find the posterior probability that event Ai will occur given that
event B has occurred, we apply Bayes’ theorem.
𝑃 𝐴𝑖 𝑃(𝐵|𝐴𝑖 )
𝑃 𝐴𝑖 𝐵 =
𝑃 𝐴1 𝑃 𝐵 𝐴1 + 𝑃 𝐴2 𝑃 𝐵 𝐴2 + ⋯ + 𝑃 𝐴𝑛 𝑃(𝐵|𝐴𝑛 )

• Bayes’ theorem is applicable when the events for which we want to


compute posterior probabilities are mutually exclusive and their
union is the entire sample space.

57
Posterior Probabilities
• Example: L. S. Clothiers
Given the planning board’s recommendation not to approve the zoning change, we revise the
prior probabilities as follows:

𝑃 𝐴1 𝑃(𝐵|𝐴1 )
𝑃 𝐴1 𝐵 =
𝑃 𝐴1 𝑃 𝐵 𝐴1 + 𝑃 𝐴2 𝑃 𝐵 𝐴2
.7 (.2)
=
.7 .2)+ .3 .9)
= .34

58
Posterior Probabilities
• Example: L. S. Clothiers
The planning board’s recommendation is good news for L. S.
Clothiers. The posterior probability of the town council approving
the zoning change is .34 compared to a prior probability of .70.

59
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 1
Prepare the following three columns:
Column 1 - The mutually exclusive events for which posterior
probabilities are desired.
Column 2 - The prior probabilities for the events.
Column 3 - The conditional probabilities of the new information
given each event.

60
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 1
(1) (2) (3) (4) (5)
Prior Conditional
Events Probabilities Probabilities
Ai P(Ai) P(B|Ai)
A1 .7 .2
A2 .3 .9
1.0

61
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 2
Prepare the fourth column:
Column 4
Compute the joint probabilities for each event and the new
information B by using the multiplication law.
Multiply the prior probabilities in column 2 by the corresponding
conditional probabilities in column 3. That is, P(Ai IB) = P(Ai)
P(B|Ai).

62
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 2
(1) (2) (3) (4) (5)
Prior Conditional Joint
Events Probabilities Probabilities Probabilities
Ai P(Ai) P(B|Ai) P(Ai I B)
A1 .7 .2 .14 = .7(.2)
A2 .3 .9 .27
1.0

63
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 2 (continued)
We see that there is a .14 probability of the town council
approving the zoning change and a negative recommendation by
the planning board.
There is a .27 probability of the town council disapproving the
zoning change and a negative recommendation by the planning
board.

64
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 3
Sum the joint probabilities in Column 4. The sum is the
probability of the new information, P(B). The sum .14 + .27
shows an overall
probability of .41 of a negative recommendation by the
planning board.

65
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 3
(1) (2) (3) (4) (5)
Prior Conditional Joint
Events Probabilities Probabilities Probabilities
Ai P(Ai) P(B|Ai) P(Ai I B)
A1 .7 .2 .14
A2 .3 .9 .27
1.0 P(B) = .41

66
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 4
Prepare the fifth column:
Column 5
Compute the posterior probabilities using the basic relationship
of conditional probability.
𝑃(𝐴𝑖 ∩ 𝐵)
𝑃 𝐴𝑖 𝐵 =
𝑃(𝐵)

The joint probabilities P(Ai I B) are in column 4 and the


probability P(B) is the sum of column 4.

67
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 4
(1) (2) (3) (4) (5)
Prior Conditional Joint Posterior
Events Probabilities Probabilities Probabilities Probabilities
Ai P(Ai) P(B|Ai) P(Ai I B) P(Ai |B)
A1 .7 .2 .14 .3415 = .14/.41
A2 .3 .9 .27 .6585
1.0 P(B) = .41 1.0000

68
ProbabilityDistributions
Dr. Nilakantan Narasinganallur Ph.D.
Probability Distributions
• Random Variables
• Developing Discrete Probability Distributions
• Expected Value and Variance
• Binomial Probability Distribution
• Normal Probability Distribution

2
Random Variables
• A random variable is a numerical description of the outcome of an
experiment.
• A discrete random variable may assume either a finite number of values
or an infinite sequence of values.
• A continuous random variable may assume any numerical value in an
interval or collection of intervals.

3
Discrete Random Variable
with a Finite Number of Values
• Example: Auto Distributor
Let x = number of cars sold at the outlet in one day,
where x can take on 5 values (0, 1, 2, 3, 4)

We can count the cars sold, and there is a finite upper limit on the
number that might be sold (which is the number of cars in stock).

4
Discrete Random Let x = number of We can count the
Variable customers arriving in customers arriving, but
with a Finite one day, there is
Number of Values where x can no finite upper limit on
take on the values 0, 1, the number that might
2, . . . arrive.

• Example: Auto
Distributor

5
Random Variables
Question Random Variable x Type

Family x = Number of dependents Discrete


size reported on tax return
Distance from x = Distance in miles from Continuous
home to store home to the store site
Own dog x = 1 if own no pet; Discrete
or cat = 2 if own dog(s) only;
= 3 if own cat(s) only;
= 4 if own dog(s) and cat(s)

6
• The probability • We can
Discrete distribution for a describe a
Probability random variable discrete
Distributions describes how probability
probabilities are distribution
distributed over the with a table,
values of the random graph, or
variable. formula.

7
Discrete • First type: uses the rules of assigning probabilities
Probability to experimental outcomes to determine
Distributions probabilities for each value of the random
variable.
• Two types of
discrete
probability
distributions will • Second type: uses a special mathematical formula
be introduced. to compute the probabilities for each value of the
random variable.

8
Discrete Probability Distributions

• The probability distribution is defined by a probability function,


denoted by f(x), that provides the probability for each value of the
random variable.

• The required conditions for a discrete f(x) > 0 and f(x)


probability function are: =1

9
Discrete Probability Distributions

• There are three methods • The use of the relative frequency


for assign probabilities to method to develop discrete
random variables: classical probability distributions leads to
method, subjective what is called an empirical discrete
method, and relative distribution.
frequency method.
(example on next slide)

10
Discrete Probability Distributions
• Example: Auto Distributor
Using past data on car sales, a tabular representation
of the probability distribution for sales was developed.
Number
Units Sold of Days x f(x)
0 80 0 .40 = 80/200
1 50 1 .25
2 40 2 .20
3 10 3 .05
4 20 4 .10
200 1.00

11
Discrete Probability Distributions
• Example: Auto Distributor

.50
.40 Graphical
Probability

representation
.30 of probability
distribution
.20
.10

0 1 2 3 4
Values of Random Variable x ( car sales)

12
Discrete Probability Distributions

• In addition to tables and graphs, a formula that gives the


probability function, f(x), for every value of x is often used to
describe the probability distributions.
• Several discrete probability distributions specified by formulas are the
discrete-uniform, binomial, Poisson, and hypergeometric distributions.
• We will cover binomial distribution.

13
Expected Value
• The expected value, or mean, of a random variable is a measure of its
central location.
E(x) =  = xf(x)
• The expected value is a weighted average of the values the random
variable may assume. The weights are the probabilities.
• The expected value does not have to be a value the random variable can
assume.

14
• The variance • The variance is a weighted • The standard
summarizes the average of the squared deviation, , is
variability in the deviations of a random defined as the
values of a random variable from its mean. positive square
variable. The weights are the root of the
probabilities. variance.

Var(x) =  2 = (x -
)2f(x)

Variance and Standard Deviation


15
Expected Value
• Example: Auto distributor
x f(x) xf(x)
0 .40 .00
1 .25 .25
2 .20 .40
3 .05 .15
4 .10 .40
E(x) = 1.20 = expected number of cars sold in a day

16
Variance
• Example: Auto Distributor
x x- (x - )2 f(x) (x - )2f(x)
0 -1.2 1.44 .40 .576
1 -0.2 0.04 .25 .010
2 0.8 0.64 .20 .128
3 1.8 3.24 .05 .162
4 2.8 7.84 .10 .784
Variance of daily sales =  2 = 1.660
Standard deviation of daily sales = 1.2884 cars

17
Binomial Probability Distribution
• Four Properties of a Binomial Experiment
1. The experiment consists of a sequence of n identical trials.
2. Two outcomes, success and failure, are possible on each trial.
3. The probability of a success, denoted by p, does not change from trial to
trial. (This is referred to as the stationarity assumption.)
4. The trials are independent.

18
Binomial Probability Distribution

• Our interest is in the number of • We let x denote the number


successes occurring in the n of successes occurring in the n
trials. trials.

19
Binomial Probability Distribution

• Binomial Probability Function


𝑛!
𝑓 𝑥 = 𝑝 𝑥 (1 − 𝑝)(𝑛−𝑥)
𝑥! 𝑛 − 𝑥 !

where:
x = the number of successes
p = the probability of a success on one trial
n = the number of trials
f(x) = the probability of x successes in n trials
n! = n(n – 1)(n – 2) ….. (2)(1)

20
Binomial Probability Distribution
• Binomial Probability Function

𝑛!
𝑓 𝑥 = 𝑝 𝑥 (1 − 𝑝)(𝑛−𝑥)
𝑥! 𝑛 − 𝑥 !

Probability of a particular
Number of experimental sequence of trial outcomes
outcomes providing exactly with x successes in n trials
x successes in n trials

21
Binomial Probability Distribution Thus, for any hourly employee
chosen at random, management
• Example: Actis Hospital estimates a probability of 0.1 that the
Actis Hospital is concerned person will not be with the company
about a low retention rate for its next year.
employees. In recent years,
management has seen a
turnover of 10% of the hourly Choosing 3 hourly employees at
employees annually. random, what is the probability that 1
of them will leave the company this
year?

22
• Exampl • The probability of the first employee leaving and the second
e: Actis and third employees staying, denoted (S, F, F), is given by
Hospit
p(1 – p)(1 – p)
al

• With a .10 probability of an employee leaving on any one trial, the


probability of an employee leaving on the first trial and not on the
second and third trials is given by
(.10)(.90)(.90) = (.10)(.90)2 = .081

Binomial Probability Distribution


23
Binomial Probability Distribution
• Example: Actis Hospital
• Two other experimental outcomes result in one success and two
failures. The probabilities for all three experimental outcomes involving
one success follow.

Experimental Probability of
Outcome Experimental Outcome
(S, F, F) p(1 – p)(1 – p) = (.1)(.9)(.9) = .081
(F, S, F) (1 – p)p(1 – p) = (.9)(.1)(.9) = .081
(F, F, S) (1 – p)(1 – p)p = (.9)(.9)(.1) = .081
Total = .243

24
Binomial Probability Distribution
• Example: Actis Hospital
Using the probability function:
Let: p = .10, n = 3, x = 1

𝑛!
𝑓 𝑥 = 𝑝 𝑥 (1 − 𝑝)(𝑛−𝑥)
𝑥! 𝑛 − 𝑥 !
3!
𝑓 1 = 0.1 1 (0.9)2 = .243
1! 3−1 !

25
Binomial Probability Distribution
• Example: Evans Electronics
1st Worker 2nd Worker 3rd Worker x Prob.
L (.1) 3 .0010
Leaves (.1)
S (.9) 2 .0090
Leaves
(.1) L (.1) 2 .0090
Stays (.9)
S (.9) 1 .0810
L (.1) 2 .0090
Leaves (.1)
Stays S (.9) 1 .0810
(.9) L (.1)
1 .0810
Stays (.9)
S (.9) 0 .7290

26
Binomial Probabilities and Cumulative Probabilities
• Statisticians have developed tables that give probabilities and
cumulative probabilities for a binomial random variable.
• These tables can be found in some statistics textbooks.
• With modern calculators and the capability of statistical software
packages, such tables are almost unnecessary.

27
Binomial Probability Distribution
• Using Tables of Binomial Probabilities

p
n x .05 .10 .15 .20 .25 .30 .35 .40 .45 .50
3 0 .8574 .7290 .6141 .5120 .4219 .3430 .2746 .2160 .1664 .1250
1 .1354 .2430 .3251 .3840 .4219 .4410 .4436 .4320 .4084 .3750
2 .0071 .0270 .0574 .0960 .1406 .1890 .2389 .2880 .3341 .3750
3 .0001 .0010 .0034 .0080 .0156 .0270 .0429 .0640 .0911 .1250

28
Binomial Probability Distribution
• Expected Value
E(x) =  = np

• Variance
Var(x) =  2 = np(1 – p)

• Standard Deviation
𝜎= 𝑛𝑝(1 − 𝑝)

29
Binomial Probability Distribution
• Example: Actis Hospital
• Expected Value
E(x) = np = 3(.1) = .3 employees out of 3

• Variance
Var(x) = np(1 – p) = 3(.1)(.9) = .27

• Standard Deviation
𝜎= 3 .1 . 9) = .52 employees

30
• A continuous random • It is not possible to • Instead, we talk
variable can assume talk about the about the
any value in an interval probability of the probability of the
on the real line or in a random variable random variable
collection of intervals. assuming a assuming a value
particular value. within a given
interval.

Continuous Probability Distributions


31
Continuous Probability Distributions
• The probability of the random variable assuming a value within some given
interval from x1 to x2 is defined to be the area under the graph of the
probability density function between x1 and x2.
f (x) Exponential

Uniform
f (x)
Normal
f (x)
x
x1 x2

x
x1 x2
x
x1 x2

32
• The area under the graph of f(x) and probability are identical.

• This is valid for all continuous random variables.

• The probability that x takes on a value between some lower value x1


and some higher value x2 can be found by computing the area under
the graph of f(x) over the interval from x1 to x2.

Area as a Measure of Probability


33
Normal Probability Distribution
• The normal probability distribution is the most important
distribution for describing a continuous random variable.
• It is widely used in statistical inference.
• It has been used in a wide variety of applications including:
• Heights of people • Test scores
• Rainfall amounts • Scientific measurements

• Abraham de Moivre, a French mathematician, published The Doctrine of


Chances in 1733.
• However, normal distribution is usually credited to Gauss. Both
arrived at the normal curve and derivation independently.

34
Normal Probability Distribution
• Normal Probability Density Function
1 2 /2𝜎 2
𝑓 𝑥 = 𝑒 −(𝑥−𝜇)
𝜎 2𝜋

where:  = mean
 = standard deviation
 = 3.14159
e = 2.71828

35
Normal Probability Distribution
• Characteristics
The distribution is symmetric; its skewness measure is zero.

36
Normal Probability Distribution
• Characteristics
The entire family of normal probability distributions is defined by its mean
 and its standard deviation  .

Standard Deviation 

x
Mean 

37
Normal Probability Distribution
• Characteristics
The highest point on the normal curve is at the mean, which is also the
median and mode.

38
Normal Probability Distribution
• Characteristics
The mean can be any numerical value: negative, zero, or positive.

x
-10 0 25

39
Normal Probability Distribution
• Characteristics
The standard deviation determines the width of the
curve: larger values result in wider, flatter curves.

 = 15

 = 25

40
Normal Probability Distribution
• Characteristics
Probabilities for the normal random variable are given by areas under the
curve. The total area under the curve is 1 (.5 to the left of the mean and
.5 to the right).

.5 .5
x

41
Normal Probability Distribution
• Characteristics (basis for the empirical rule)

68.26% of values of a normal random variable


are within +/- 1 standard deviation of its mean.

95.44% of values of a normal 99.72% of values of a normal


random variable random variable
are within +/- 2 standard are within +/- 3 standard
deviations of its mean. deviations of its mean.

42
Normal Probability Distribution
• Characteristics (basis for the empirical rule)
99.72%
95.44%
68.26%

 x
 – 3  – 1  + 1  + 3
 – 2  + 2

43
Standard Normal Probability Distribution
• Characteristics
A random variable having a normal distribution with a mean of 0 and a
standard deviation of 1 is said to have a standard normal probability
distribution.

44
Standard Normal Probability Distribution
• Characteristics
The letter z is used to designate the standard normal random variable.

=1

z
0

45
Standard Normal Probability Distribution
• Converting to the Standard Normal Distribution
𝑥−𝜇
z=
𝜎

We can think of z as a measure of the


number of standard deviations x is from .

46
Standard Normal Probability Distribution
• Example: Car Zone
Car Zone sells auto parts and supplies including a popular multi-grade
motor oil. When the stock of this oil drops to 20 gallons, a replenishment
order is placed.
The store manager is concerned that sales are being lost due to
stockouts while waiting for a replenishment order.

47
Standard Normal Probability Distribution
• Example: Car Zone
It has been determined that demand during replenishment lead-time is
normally distributed with a mean of 15 gallons and a standard deviation
of 6 gallons.
The manager would like to know the probability of a stockout during
replenishment lead-time. In other words, what is the probability that
demand during lead-time will exceed 20 gallons?
P(x > 20) = ?

48
• Solving for the Stockout Probability
Standard
Normal Step 1: Convert x to the z = (x - )/
Probability standard normal = (20 - 15)/6
Distribution distribution. = .83

Step 2: Find the area under the standard


normal curve to the left of z = .83.

Use the appropriate formula in Excel

49
• Excel has two functions for computing cumulative • NORM.INV is
probabilities and x values for any normal used to compute
distribution: the x value given
a cumulative
probability.

• NORM.DIST is used to compute the cumulative


probability given an x value.

Using Excel to Compute Normal Probabilities


50
Standard Normal Probability Distribution
• Solving for the Stockout Probability
Step 3: Compute the area under the standard normal
curve to the right of z = .83.

P(z > .83) = 1 – P(z < .83)


= 1- .7967
= .2033

51
Standard Normal Probability Distribution
• Solving for the Stockout Probability

Area = .7967 Area = 1 - .7967


= .2033

z
0 .83

52
If the manager of Car Zone wants the probability of a stockout during
replenishment lead-time to be no more than .05, what should the reorder
point be?

(Hint: Given a probability, we can use the standard normal table in an


inverse fashion to find the corresponding z value.)
We will use excel function normsinv(), or norminv().

Standard Normal Probability Distribution


53
Standard Normal Probability Distribution
• Solving for the Reorder Point

Area = .9500
Area = .0500

z
0 z.05

54
Standard Normal Probability Distribution
• Solving for the Reorder Point
Step 2: Convert z.05 to the corresponding value of x.

x =  + z.05
= 15 + 1.645(6)
= 24.87 or 25

A reorder point of 25 gallons will place the probability


of a stockout during lead time at (slightly less than) .05.

55
Normal Probability Distribution
• Solving for the Reorder Point

Probability of no Probability of a
stockout during stockout during
replenishment replenishment
lead-time = .95 lead-time = .05

x
15 24.87

56
• Solving for the Reorder Point
By raising the reorder point from 20 gallons
to 25 gallons on hand, the probability of a
Standard stockout decreases from about .20 to .05.
Normal This is a significant decrease in the chance
Probability that Car Zone will be out of stock and unable
Distribution to meet a customer’s desire to make a
purchase.
Learnings?
Sampling Theory &
Estimation
Dr. Nilakantan Narasinganallur Ph.D.
Review – Bayes’ theorem
1.The prior probabilities for events A1, A2, and A3 are The bank also found that the probability of missing a
P(A1) = .20, P(A2) = .50, and P(A3) =.30. The monthly payment is .20 for customers who do not
conditional probabilities of event B given A1, A2, and A3 default. Of course, the probability of missing a
monthly payment for those who default is 1.
are P(B | A1) = .50, P(B | A2) = .40, and P(B| A3) = .30. • a. Given that a customer missed one or more
• a. Compute P(B ∩ A1), P(B ∩ A2), and P(B ∩ A3). monthly payments, compute the posterior probability
• b. Apply Bayes’theorem, equation (4.19), to compute that the customer will default.
the posterior probability P(A2 | B). • b. The bank would like to recall its card if the
• c. Use the tabular approach to applying Bayes’ probability that a customer will default is greater than
theorem to compute P(A1 | B), P(A2 | B), and P(A3| B). .20. Should the bank recall its card if the customer
misses a monthly payment? Why or why not?
Ans: a. .10, .20, .09, b. .51, c. .26, .51, .23
2. A local bank reviewed its credit card policy with the
intention of recalling some of its credit cards. In the past
approximately 5% of cardholders defaulted, leaving the
bank unable to collect the outstanding balance. Hence,
management established a prior probability of .05 that
any particular cardholder will default.
Binomial distribution
Consider a binomial experiment with two trials and • A Harris Interactive survey for InterContinental
p= .4. Hotels & Resorts asked respondents, “When
traveling internationally, do you generally venture
• a. Compute the probability of one success, f (1). out on your own to experience culture, or stick with
• b. Compute f (0). your tour group and itineraries?” The survey found
that 23% of the respondents stick with their tour
• c. Compute f (2). group (USA Today, January 21, 2004).
• d. Compute the probability of at least one success. • a. In a sample of six international travelers, what is
• e. Compute the expected value, variance, and the probability that two will stick with their tour
standard deviation. group?

• Ans: f(1)= 0.48, f(0)=0.36, f(20=0.16, P(x>=1) • b. In a sample of six international travelers, what is
=0.64, E(x) = 0.8, V(x) = 0.48, s.d. = 0.6928. the probability that at least two will stick with their
tour group?
• c. In a sample of 10 international travelers, what is
the probability that none will stick with the tour
group?
• Ans: a. .2789, b. .4181, c. .0733
Normal Distribution
Given that z is a standard normal random variable, compute the Trading volume on the New York Stock Exchange is heaviest during
following probabilities. the first half hour (early morning) and last half hour (late afternoon)
of the trading day. The early morning trading volumes (millions of
• a. P(1.98 z .49) shares) for 13 days in January and February are shown here
(Barron’s, January 23, 2006; February 13, 2006; and February 27,
• b. P(.52 z 1.22) 2006).
• c. P(1.75 z1.04) • 214 163 265 194 180 202 198 212 201 174 171 211 211
• Ans: a. P(1.98 z .49) P(z .49) P(z 1.98) .6879 .0239 .6640 • The probability distribution of trading volume is approximately
• b. P(.52 z 1.22) P(z 1.22) P(z .52) .8888 .6985 .1903 normal.

• c. P(1.75 z 1.04) P(z 1.04) P(z 1.75) .1492 .0401 .1091 • a. Compute the mean and standard deviation to use as estimates of
the population mean and standard deviation.
Given that z is a standard normal random variable, find z for each • b. What is the probability that, on a randomly selected day, the
situation. early morning trading volume will be less than 180 million shares?
• a. The area to the left of z is .9750. • c. What is the probability that, on a randomly selected day, the
• b. The area between 0 and z is .4750. early morning trading volume will exceed 230 million shares?
• c. The area to the left of z is .7291. • d. How many shares would have to be traded for the early morning
trading volume on a particular day to be among the busiest 5% of
• d. The area to the right of z is .1314. days?
• e. The area to the left of z is .6700. • Ans: 200, 26.04, b. .2206, c. .1251, d. 242.84 million
• f. The area to the right of z is .3300.
• Ans:a. z 1.96, b. z 1.96, c. z .61 d. z 1.12 e. z .44 f. z .44
• Selecting a Sample
• Point Estimation
• Introduction to Sampling
Sampling and Distributions
• Sampling Distribution of 𝑥ҧ
Sampling • Sampling Distribution of 𝑝ҧ
Distributions • Other Sampling Methods

5
• An element is the entity on which data are
Introduction collected.
• A population is a collection of all the elements
of interest.

• A sample is a subset of the population.

• The sampled population is the population from


which the sample is drawn.

• A frame is a list of the elements that the


sample will be selected from.

6
Introduction
• The reason we select a sample is to collect data to answer a research question
about a population.
• The sample results provide only estimates of the values of the population
characteristics.
• The reason is simply that the sample contains only a portion of the
population.
• With proper sampling methods, the sample results can provide “good”
estimates of the population characteristics.

7
• Sampling from a Finite Population
Selecting a
Sample

• Sampling from an Infinite Population

8
Sampling from a Finite Population
• Finite populations are often defined by lists such as:
• Organization membership roster
• Credit card account numbers
• Inventory product numbers
• A simple random sample of size n from a finite population of size N is a sample
selected such that each possible sample of size n has the same probability of
being selected.

9
Sampling from a Finite Population
• Replacing each sampled element before • Sampling without
selecting subsequent elements is called replacement is the
sampling with replacement. procedure used most
often.

• In large sampling projects, computer-generated random numbers are


often used to automate the sample selection process.

10
Sampling from • Example: Symbioses College
a Finite Symbioses College received 900
Population applications for admission in the upcoming year
from prospective students. The applicants were
numbered, from 1 to 900, as their applications
arrived. The Director of Admissions would like to
select a simple random sample of 30 applicants.

11
• Example: Symbioses College
Step 1: Assign The random numbers
a random generated by Excel’s
number to RAND function follow a
Sampling each of the uniform probability
from a Finite 900 applicants. distribution between 0 and
Population 1.

Step 2: Select the 30 applicants


corresponding to the 30 smallest
random numbers.
Sampling from • Sometimes we want to select • As a result,
an Infinite a sample, but find it is not we cannot
Population possible to obtain a list of all construct a
elements in the population. frame for the
population.

• Hence, we cannot • Most often this


use the random situation occurs in
number selection infinite population
procedure. cases.

13
Sampling from an Infinite Population
• Populations are often generated by an ongoing process where there is no upper
limit on the number of units that can be generated.
• Some examples of on-going processes, with infinite populations, are:
• parts being manufactured on a production line
• transactions occurring at a bank
• telephone calls arriving at a technical help desk
• customers entering a store

14
Sampling from an Infinite Population
• In the case of an infinite population, we must select a random sample in order
to make valid statistical inferences about the population from which the
sample is taken.
• A random sample from an infinite population is a sample selected such that
the following conditions are satisfied.
• Each element selected comes from the population of interest.
• Each element is selected independently.

15
Practice - Sampling from a finite population

1. Assume a finite population has 350 elements. Using the last three digits of each of the following five-digit random
numbers (e.g., 601, 022, 448, . . . ), determine the first four elements that will be selected for the simple random
sample.
• 98601 73022 83448 02147 34229 27553 84147 93289 14209
• Ans: 22, 147, 229, 289
2. Indicate which of the following situations involve sampling from a finite population and which involve sampling
from an infinite population. In cases where the sampled population is finite, describe how you would construct a
frame.
• a. Obtain a sample of licensed drivers in the state of New York.
• b. Obtain a sample of boxes of cereal produced by the Breakfast Choice company.
• c. Obtain a sample of cars crossing the Golden Gate Bridge on a typical weekday.
• d. Obtain a sample of students in a statistics course at Indiana University.
• e. Obtain a sample of the orders that are processed by a mail-order firm.
Ans: a. finite; b. infinite; c. infinite; d. finite; e. infinite
• Point • In point estimation we use the
Point estimatio data from the sample to compute
Estimation n is a a value of a sample statistic that
form of serves as an estimate of a
statistical
inference population parameter.
.
• We refer to 𝑥ҧ as the • s is the point
point estimator of estimator of the
the population mean population standard
. deviation .
• 𝑝ҧ is the point estimator of the population proportion
p.
17
Point Estimation
• Example: Symbioses College
Recall that Symbioses College received 900 applications from prospective
students. The application form contains a variety of information including the
individual’s Scholastic Aptitude Test (SAT) score and whether or not the
individual desires on-campus housing.
At a meeting in a few hours, the Director of Admissions would like to
announce the average SAT score and the proportion of applicants that want to
live on campus, for the population of 900 applicants.

18
Point Estimation
• Example: Symbioses College
However, the necessary data on the applicants have not yet been
entered in the college’s computerized database. So, the Director decides to
estimate the values of the population parameters of interest based on sample
statistics. The sample of 30 applicants is selected using computer-generated
random numbers.

19
Point Estimation
• 𝑥ҧ as Point Estimator of 
σ 𝑥𝑖 50,520
𝑥ҧ = = = 1684
30 30
• s as Point Estimator of 

σ(𝑥𝑖 − 𝑥)ҧ 2 210,512


𝑠= = = 85.2
29 29
• 𝑝ҧ as Point Estimator of p
𝑝ҧ = 20Τ30 = .67

Note: Different random numbers would have identified a different


sample which would have resulted in different point estimates.

20
Point Estimation
• Once all the data for the 900 applicants were entered in the college’s database,
the values of the population parameters of interest were calculated.
• Population Mean SAT Score
σ 𝑥𝑖
𝜇= = 1697
900
• Population Standard Deviation for SAT Score

σ(𝑥𝑖 −𝜇)2
𝜎= = 87.4
900
• Population Proportion Wanting On-Campus Housing
𝑝 = 648/900 = .72

21
Summary of Point Estimates
Obtained from a Simple Random Sample
Population Parameter Point Point
Parameter Value Estimator Estimate
 = Population mean 1697 𝑥ҧ = Sample mean 1684
SAT score SAT score
 = Population std. 87.4 s = Sample stan- 85.2
deviation for dard deviation
SAT score for SAT score
p = Population pro- .72 𝑝ҧ = Sample pro- .67
portion wanting portion wanting
campus housing campus housing

22
Practical Advice
• The target population is the population we want to make inferences about.

• The sampled population is the population from which the sample is actually
taken.
• Whenever a sample is used to make inferences about a population, we
should make sure that the targeted population and the sampled population
are in close agreement.

23
Practice – Point Estimation

3. A simple random sample of 5 months of sales data provided the


following information:
• Month: 1 2 3 4 5
• Units Sold: 94 100 85 94 92
• a. Develop a point estimate of the population mean number of units sold per month.
• b. Develop a point estimate of the population standard deviation.ns:
• Ans: a. sample mean – 93. b. sample sd – 5.39
• 4. A sample of 50 Fortune 500 companies (Fortune, April 14, 2003) showed 5 were
based in New York, 6 in California, 2 in Minnesota, and 1 in Wisconsin.
• a. Develop an estimate of the proportion of Fortune 500 companies based in New York.
• b. Develop an estimate of the number of Fortune 500 companies based in Minnesota.
• c. Develop an estimate of the proportion of Fortune 500 companies that are not based
in these four states.
• Ans: a. 0.10 b.20 c. 0.72.
Sampling Distribution of 𝑥ҧ
• Process of Statistical Inference

Population A simple random sample


with mean of n elements is selected
=? from the population.

The value of 𝑥ҧ is used to The sample data


make inferences about provide a value for
the value of . the sample mean 𝑥ҧ .

25
Sampling Distribution of 𝑥ҧ
• The sampling distribution of 𝑥ҧ is the probability distribution of all possible
values of the sample mean 𝑥.ҧ
• Expected Value of 𝑥ҧ
E(𝑥)ҧ = 
where:  = the population mean
• When the expected value of the point estimator equals the population
parameter, we say the point estimator is unbiased.

26
Sampling Distribution of 𝑥ҧ
• We will use the following notation to define the standard deviation of the
Sampling distribution of 𝑥.ҧ

𝜎𝑥ҧ = the standard deviation of 𝑥ҧ


 = the standard deviation of the population
n = the sample size
N = the population size

27
Sampling Distribution of 𝑥ҧ
• Standard Deviation of 𝑥ҧ
Finite Population Infinite Population

𝑁−𝑛 𝜎 𝜎
𝜎𝑥ҧ = 𝜎𝑥ҧ =
𝑁−1 𝑛 𝑛

• A finite population is treated as being infinite if n/N < .05.


• (𝑁 − 𝑛)/(𝑁 − 1) is the finite population correction factor.

• 𝜎𝑥ҧ is referred to as the standard error of the mean.

28
Sampling Distribution of 𝑥ҧ
• When the population has a normal distribution, the sampling distribution
of 𝑥ҧ is normally distributed for any sample size.
• In most applications, the sampling distribution of 𝑥ҧ can be approximated
by a normal distribution whenever the sample is size 30 or more.
• In cases where the population is highly skewed or outliers are present,
samples of size 50 may be needed.

29
Sampling Distribution of 𝑥ҧ

• The sampling distribution of 𝑥ҧ can be used to provide probability


information about how close the sample mean 𝑥ҧ is to the population
mean μ .

30
Central Limit Theorem
• When the population from which we are selecting a random sample does
not have a normal distribution, the central limit theorem is helpful in
identifying the shape of the sampling distribution of 𝑥.ҧ

CENTRAL LIMIT THEOREM


In selecting random samples of size n from a
population, the sampling distribution of the sample
mean 𝑥ҧ can be approximated by a normal
distribution as the sample size becomes large.

31
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College

Sampling
Distribution 𝜎 87.4
𝜎𝑥ҧ = = = 15.96
of 𝑥ҧ for 𝑛 30
SAT Scores

𝑥ҧ
𝐸 𝑥ҧ = 1697

32
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
• What is the probability that a simple random sample of 30 applicants will
provide an estimate of the population mean SAT score that is within +/-10
of the actual population mean  ?
• In other words, what is the probability that 𝑥ҧ will be between 1687 and
1707?

33
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Step 1: Calculate the z-value at the upper endpoint of the interval.
z = (1707 - 1697)/15.96 = .63
Step 2: Find the area under the curve to the left of the upper endpoint.
P(z < .63) = .7357

34
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College

Sampling Distribution
𝜎𝑥ҧ = 15.96 of 𝑥ҧ for SAT Scores

Area = .7357

𝑥ҧ
1697 1707

35
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Step 3: Calculate the z-value at the lower endpoint of the interval.
z = (1687 - 1697)/15.96 = - .63

Step 4: Find the area under the curve to the left of the lower endpoint.
P(z < -.63) = .2643

36
Sampling Distribution of 𝑥ҧ for SAT Scores
• Example: Symbioses College

Sampling Distribution
𝜎𝑥ҧ = 15.96 of 𝑥ҧ for SAT Scores

Area = .2643

𝑥ҧ
1687 1697

37
Sampling Distribution of 𝑥ҧ for SAT Scores
• Example: Symbioses College
Step 5: Calculate the area under the curve between
the lower and upper endpoints of the interval.
P(-.68 < z < .68) = P(z < .68) - P(z < -.68)
= .7357 - .2643
= .4714
The probability that the sample mean SAT
score will be between 1687 and 1707 is:

P(1687 < 𝑥ҧ < 1707) = .4714

38
Sampling Distribution of 𝑥ҧ for SAT Scores
• Example: Symbioses College

Sampling Distribution
of 𝑥ҧ for SAT Scores
𝜎𝑥ҧ = 15.96

Area = .4714

𝑥ҧ
1687 1697 1707

39
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
• Suppose we select a simple random sample of 100 applicants instead of the
30 originally considered.
• E(𝑥)ҧ = m regardless of the sample size. In our example, E(𝑥)ҧ remains at
1697.
• Whenever the sample size is increased, the standard error of the mean 𝜎𝑥ҧ
is decreased. With the increase in the sample size to n = 100, the standard
error of the mean is decreased from 15.96 to:
𝑁−𝑛 𝜎 900−100 87.4
𝜎𝑥ҧ = = =.9433(8.74) = 8.2
𝑁−1 𝑛 900−1 100

40
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College

With n = 100,
𝜎𝑥ҧ = 8.2 With n = 30,
𝜎𝑥ҧ = 15.96

𝑥ҧ
𝐸 𝑥ҧ = 1697

41
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
• Recall that when n = 30, P(1687 < 𝑥ҧ < 1707) = .4714.
• We follow the same steps to solve for P(1687 < 𝑥ҧ < 1707) when n = 100 as
we showed earlier when n = 30.
• Now, with n = 100, P(1687 < 𝑥ҧ < 1707) = .7776.
• Because the sampling distribution with n = 100 has a smaller standard error,
the values of 𝑥ҧ have less variability and tend to be closer to the population
mean than the values of 𝑥ҧ with n = 30.

42
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Sampling Distribution
of 𝑥ҧ for SAT Scores

𝜎𝑥ҧ = 8.2
Area = .7776

𝑥ҧ
1687 1697 1707

43
Illustration of CLT
Notes and comments
• 1. while discussing the sampling distribution of mean for symbioses college
problem, we used the values of the population mean μ =1697, and the population
standard deviation σ =15.96, which were known. However, usually the values of
the population mean μ and the population standard deviation σ that are needed to
determine the sampling distribution of will be unknown. Later, we will study how
the sample mean and the sample standard deviation s are used when μ and σ are
unknown.
• 2. The theoretical proof of the central limit theorem requires independent
observations in the sample. This condition is met for infinite populations and for
finite populations where sampling is done with replacement. Although the central
limit theorem does not directly address sampling without replacement from finite
populations, general statistical practice applies the findings of the central limit
theorem when the population size is large.
Other Sampling Methods
• Stratified Random Sampling
• Cluster Sampling
• Systematic Sampling
• Convenience Sampling
• Judgment Sampling

46
Stratified • The population is first divided into groups of
Random elements called strata.
Sampling
• Each element in the population belongs to one
and only one stratum.

• Best results are obtained when the elements


within each stratum are as much alike as
possible (i.e. a homogeneous group).

47
Stratified Random Sampling

• A simple • Formulas are available for combining


random sample the stratum sample results into one
is taken from population parameter estimate.
each stratum.

• Advantage: If strata are • Example: The basis for


homogeneous, this method is forming the strata might
as “precise” as simple random be department,
sampling but with a smaller location, age, industry
total sample size. type, and so on.

48
• The population is first divided into separate
Cluster groups of elements called clusters.
Sampling

• Ideally, each cluster is a representative small-


scale version of the population (i.e.
heterogeneous group).

• A simple random sample of the clusters is then


taken.
• All elements within each sampled (chosen)
cluster form the sample.
49
Cluster Sampling
• Example: A primary application is area sampling, where clusters are city
blocks or other well-defined areas.
• Advantage: The close proximity of elements can be cost effective (i.e. many
sample observations can be obtained in a short time).
• Disadvantage: This method generally requires a larger total sample size than
simple or stratified random sampling.

50
• If a sample size of n • We randomly select
Systematic is desired from a one of the first n/N
Sampling population elements from the
containing N population list.
elements, we might
sample one element
for every n/N
elements in the • We then select every
population. n/Nth element that
follows in the
population list.

51
Systematic Sampling
• This method has the properties of a simple random sample, especially if the
list of the population elements is a random ordering.
• Advantage: The sample usually will be easier to identify than it would be if
simple random sampling were used.
• Example: Selecting every 100th listing in a telephone book after the first
randomly selected listing

52
Convenience Sampling

• It is a nonprobability • The • Example: A


sampling technique. Items sample is professor conducting
are included in the sample identified research might use
without known probabilities primarily student volunteers
of being selected. by to constitute a
convenie
sample.
nce.

53
Convenience Sampling

• Advantage: Sample selection and data collection are relatively


easy.

• Disadvantage: It is impossible to determine how representative of


the population the sample is.

54
Judgment Sampling

• The person most knowledgeable on the subject of the study selects


elements of the population that he or she feels are most
representative of the population.
• It is a nonprobability sampling technique.

• Example: A reporter might sample three or four senators, judging


them as reflecting the general opinion of the senate.

55
Judgment Sampling

• Advantage: It is a relatively easy way of selecting a sample.

• Disadvantage: The quality of the sample results depends on the


judgment of the person selecting the sample.

56
• It is recommended that probability sampling
Recommendati methods (simple random, stratified, cluster, or
on systematic) be used.

• For these methods, formulas are available for


evaluating the “goodness” of the sample
results in terms of the closeness of the results
to the population parameters being
estimated.

• An evaluation of the goodness cannot be made


with non-probability (convenience or
judgment) sampling methods.
57
Random sampling in Excel
• If a list of the elements in a population is available in an Excel file, Excel can be used to select a simple random
sample. For example, a list of the top 100 metropolitan areas in the United States and Canada is provided in column
A of the data set MetAreas (Places Rated Almanac—The Millennium Edition 2000). Column B contains the overall
rating of each metropolitan area. The first 10 metropolitan areas in the data set and their corresponding ratings are
shown in Table 7.6. Assume that you would like to select a simple random sample of 30 metropolitan areas in order
to do an in-depth study of the cost of living in the United States and Canada.
• In the MetAreas data set, labels are in row 1 and the 100 metropolitan areas are in rows 2 to 101. The following steps
can be used to select a simple random sample of 30 metropolitan areas.
• Step 1. Enter RAND() in cell C2
• Step 2. Copy cell C2 to cells C3:C101
• Step 3. Select any cell in Column C
• Step 4. Click the Home tab on the Ribbon
• Step 5. In the Editing group, click Sort & Filter
• Step 6. Click Sort Smallest to Largest
• The random sample of 30 metropolitan areas appears in rows 2 to 31 of the reordered data set. The random numbers
in column C are no longer necessary and can be deleted if desired.
Case discussion - MeadWestvaco Corporation (1/2)
• MeadWestvaco Corporation, a leading producer of packaging, • How does MeadWestvaco obtain the information it needs
coated and specialty papers, consumer and office products, and about its vast forest holdings? Data collected from sample
specialty chemicals, employs more than 30,000 people. It operates plots throughout the forests are the basis for learning about
worldwide in 29 countries andserves customers located in the population of trees owned by the company. To identify
approximately 100 countries. MeadWestvaco holds a leading position the sample plots, the timberland holdings are first divided
in paper production, with an annual capacity of 1.8 million tons. The into three sections based on location and types of trees.
Using maps and random numbers, MeadWestvaco analysts
company’s products include textbook paper, glossy magazine paper,
identify random samples of 1/5- to 1/ 7-acre plots in each
beverage packaging systems, and office products. MeadWestvaco’s section of the forest.
internal consulting group uses sampling to provide a variety of • MeadWestvaco foresters collect data from these sample
information that enables the company to obtain significant plots to learn about the forest population. Foresters
productivity benefits and remain competitive. throughout the organization participate in the field data
• For example, MeadWestvaco maintains large woodland holdings, collection process. Periodically, two person teams gather
which supply the trees, or raw material, for many of the company’s information on each tree in every sample plot. The sample
products. Managers need reliable and accurate information about the data are entered into the company’s continuous forest
timberlands and forests to evaluate the company’s ability to meet its inventory (CFI) computer system. Reports from the CFI
future raw material needs. What is the present volume in the forests? system include a number of frequency distribution
What is the past growth of the forests? What is the projected future summaries containing statistics on types of trees, present
growth of the forests? With answers to these important questions forest volume, past forest growth rates, and projected future
MeadWestvaco’s managers can develop plans for the future, forest growth and volume. Sampling and the associated
including long-term planting and harvesting schedules for the trees.
statistical summaries of the sample data provide the reports
essential for the effective management of MeadWestvaco’s
forests and timberlands.
Case discussion - MeadWestvaco Corporation (1/2)

Based on the above description,


discuss the following:
a. Simple random sampling
procedure
b. Sample selection process
c. Estimation of parameters with
sample statistics
Estimation
Dr. Nilakantan Narasinganallur Ph.D.
• Population Mean: s Known
Interval
Estimation
• Population Mean: s Unknown

• Determining the Sample Size

• Population Proportion

2
Margin of Error and the Interval Estimate
• A point estimator cannot be expected to provide the exact value of the
population parameter.
• An interval estimate can be computed by adding and subtracting a margin of
error to the point estimate.
Point Estimate +/- Margin of Error
• The purpose of an interval estimate is to provide information about how
close the point estimate is to the value of the parameter.

3
Margin of Error and the Interval Estimate

• The general form of an interval estimate of a 𝑥ҧ + Margin of


population mean is Error

4
Interval Estimate of a Population Mean: s Known
• In order to develop an interval estimate of a population mean, the margin of
error must be computed using either:
• the population standard deviation s , or
• the sample standard deviation s
• s is rarely known exactly, but often a good estimate can be obtained based on
historical data or other information.
• We refer to such cases as the s known case.

5
Interval Estimate of a Population Mean: s Known
There is a 1 -  probability that the value of a sample
mean will provide a margin of error of 𝑧𝛼/2 𝜎𝑥ҧ or less.

Sampling
distribution
of 𝑥ҧ

1 -  of all
/2 𝑥ҧ values /2

𝑥ҧ

՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜ ՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜

6
Interval Estimate of a Population Mean: s Known
Sampling
distribution
of 𝑥ҧ

1 -  of all
/2 𝑥ҧ values /2

𝑥ҧ

interval
՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜ ՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜ interval
does not −−−− −𝑥ҧ −−−− − includes 
include 
−−−− −𝑥ҧ −−−− −

7
Interval Estimate of a Population Mean: s Known
• Interval Estimate of 
𝜎
𝑥ҧ ± 𝑧𝛼/2
𝑛

where: 𝑥ҧ is the sample mean


1 -  is the confidence coefficient
z/2 is the z value providing an area of
/2 in the upper tail of the standard
normal probability distribution
s is the population standard deviation
n is the sample size

8
• Values of z/2 for the Most Commonly Used Confidence
Levels

Interval Estimate
Confidence Table
of a Population Level  /2 Look-up Area z/2
Mean: s 90% .10 .05 .9500 1.645
Known 95% .05 .025 .9750 1.960
99% .01 .005 .9950 2.576

9
Meaning of
Confidence • Because 90% of all the intervals constructed
using 𝑥ഥ + 1.645𝜎𝑥ҧ will contain the population
mean, we say we are 90% confident that the
interval 𝑥ഥ + 1.645𝜎𝑥ҧ includes the
population mean .

• We say that this interval has been established


at the 90% confidence level.

• The value .90 is referred to as the confidence


coefficient.
10
• A sample of size n
Example: Discount Sounds
= 36 was taken; the
Discount Sounds sample mean income
has 260 retail outlets is $41,100. The
throughout the population is not
Interval United States. The believed to be highly
Estimate of a firm is evaluating a skewed. The
Population potential location for population standard
a new outlet, based
Mean: s deviation is estimated
in part, on the mean to be $4,500, and the
Known annual income of the confidence coefficient
individuals in the to be used in the
marketing area of the interval estimate is
new location. .95.
Interval Estimate of a Population Mean: s Known
• Example: Discount Sounds
95% of the sample means that can be observed are within + 1.96 𝜎𝑥ҧ of the
population mean . The margin of error is:

𝜎 4,500
𝑧𝛼/2 = 1.96 = 1,470
𝑛 36

Thus, at 95% confidence, the margin of error is $1,470.

12
Interval Estimate of a Population Mean: s Known
• Example: Discount Sounds
Interval estimate of  is:
$41,100 + $1,470
or
$39,630 to $42,570

We are 95% confident that the interval contains the population mean.

13
Interval Estimate of a Population Mean: s Known
• Example: Discount Sounds
Confidence Margin
Level of Error Interval Estimate
90% 1,234 39,866 to 42,334
95% 1,470 39,630 to 42,570
99% 1,932 39,168 to 43,032

In order to have a higher degree of confidence, the margin of error


and thus the width of the confidence interval must be larger.

14
• Adequate Sample Size
• In most applications, a sample size of n =
30 is adequate.
Interval
Estimate of a
Population • If the population distribution is highly
Mean: s skewed or contains outliers, a sample size
Known of 50 or more is recommended.
• Adequate Sample Size • If the
(continued) population is
believed to be
Interval • If the population is not at least
normally distributed approximately
Estimate of a
but is roughly normal, a
Population
symmetric, a sample sample size of
Mean: s size as small as 15 will less than 15 can
Known suffice. be used.
Practice – interval estimation of μ, σ known
1. A simple random sample of 50 items from a population with σ = 6 resulted in a sample mean of 32.
• a. Provide a 90% confidence interval for the population mean.
• b. Provide a 95% confidence interval for the population mean.
• c. Provide a 99% confidence interval for the population mean.
• Ans: a. 30.6 to 33.4, b. 30.34 to 33.66, c. 29.81 to 34.19.
2. A 95% confidence interval for a population mean was reported to be 152 to 160. If σ = 15, what sample size
was used in this study?
• Ans: 54.
• If an estimate of the population standard
deviation s cannot be developed prior to
sampling, we use the sample standard
Interval deviation s to estimate s .
Estimate of a
Population
• This is the s unknown case.
Mean: s
unknown • In this case, the interval estimate for  is
based on the t distribution.

• (We’ll assume for now that the population


is normally distributed.)
• William Gosset, writing under the name
t Distribution “Student”, is the founder of the t distribution.

• Gosset was an • He developed the t


Oxford graduate in distribution while
mathematics and working on small-scale
worked for the materials and
Guinness Brewery in temperature
Dublin. experiments.

19
t Distribution
• The t distribution is a family of similar probability distributions.
• A specific t distribution depends on a parameter known as the degrees of
freedom.
• Degrees of freedom refer to the number of independent pieces of
information that go into the computation of s.

20
t Distribution

• A t distribution • As the degrees of freedom increases, the


with more difference between the t distribution and the
degrees of standard normal probability distribution
freedom has less becomes smaller and smaller.
dispersion.
t Distribution
Standard t distribution
normal (20 degrees
distribution of freedom)

t distribution
(10 degrees
of freedom)

z, t
0

22
t Distribution
• For more than 100 degrees of freedom, the standard normal z value
provides a good approximation to the t value.
• The standard normal z values can be found in the infinite degrees (∞ ) row
of the t distribution table.

23
t Distribution
Degrees Area in Upper Tail
of Freedom .20 .10 .05 .025 .01 .005
. . . . . . .
50 .849 1.299 1.676 2.009 2.403 2.678
60 .848 1.296 1.671 2.000 2.390 2.660
80 .846 1.292 1.664 1.990 2.374 2.639
100 .845 1.290 1.660 1.984 2.364 2.626
∞ .842 1.282 1.645 1.960 2.326 2.576
(bottom row is standard normal z values)

24
Interval Estimate of a Population Mean: s Unknown
• Interval Estimate
𝑠
𝑥ҧ ± 𝑡𝛼/2
𝑛

where: 𝑥ҧ = the sample mean


1 -  = the confidence coefficient
t/2 = the t value providing an area of /2
in the upper tail of a t distribution
with n - 1 degrees of freedom
s = the sample standard deviation
n = the sample size

25
Interval Estimate of a Population Mean: s Unknown
• Example: Apartment Rents
A reporter for a student newspaper is writing an article on the
cost of off-campus housing. A sample of 16 one-bedroom
apartments within a half-mile of campus resulted in a sample
mean of $750 per month and a sample standard deviation of $55.
Let us provide a 95% confidence interval estimate Of the mean rent per
month for the population of one-bedroom apartments within a half-mile of
campus. We will assume this population to be normally distributed.

26
Interval Estimate of a Population Mean: s Unknown
• At 95% confidence,  = .05, and /2 = .025.
• t.025 is based on n - 1 = 16 - 1 = 15 degrees of freedom.

Degrees Area in Upper Tail


of Freedom .20 .10 .05 .025 .01 .005

15 .866 1.341 1.753 2.131 2.602 2.947


16 .865 1.337 1.746 2.120 2.583 2.921
17 .863 1.333 1.740 2.110 2.567 2.898
18 .862 1.330 1.734 2.101 2.520 2.878
19 .861 1.328 1.729 2.093 2.539 2.861
. . . . . . .

27
Interval Estimate of a Population Mean: s Unknown
• Interval Estimate
𝑠
𝑥ҧ ± 𝑡.025
𝑛
55
750 + 2.131 = 750 + 29.30
16

We are 95% confident that the mean rent per month


for the population of one-bedroom apartments within
a half-mile of campus is between $720.70 and $779.30.

28
Interval Estimate of a Population Mean: s Unknown
• Adequate Sample Size
• Usually, a sample size of n = 30 is adequate when using the
expression 𝑥ҧ ± 𝑡𝛼/2 𝑠/ 𝑛 to develop an interval estimate of a
population mean.
• If the population distribution is highly skewed or contains outliers,
a sample size of 50 or more is recommended.

29
Interval Estimate of a Population Mean: s
Unknown

• Adequate Sample Size (continued) • If the population is


believed to be at least
• If the population is not normally
approximately normal, a
distributed but is roughly symmetric,
sample size of less than
a sample size as small as 15 will
15 can be used.
suffice.

30
Practice – interval estimation of μ, σ
unknown

3. The following sample data are from a normal population: 10, 8, 12, 15, 13, 11, 6, 5.

• a. What is the point estimate of the population mean?

• b. What is the point estimate of the population standard deviation?

• c. With 95% confidence, what is the margin of error for the estimation of the population mean?

• d. What is the 95% confidence interval for the population mean?

• Ans: a. 10, b. 3.464, c. 2.9, d. 7.1 to 12.9

4. A simple random sample with n = 54 provided a sample mean of 22.5 and a sample standard deviation of 4.4.

• a. Develop a 90% confidence interval for the population mean.

• b. Develop a 95% confidence interval for the population mean.

• c. Develop a 99% confidence interval for the population mean.

• d. What happens to the margin of error and the confidence interval as the confidence level is increased?

• Ans: a. 21.5 to 23.5, b. 21.3 to 23.7, c. 20.9 to 24.1, d. a larger margin of error and wider interval.
Summary of Interval Estimation Procedures
for a Population Mean
Can the
Yes No
population standard
deviation s be assumed
known ?
Use the sample
standard deviation
s to estimate s

Use Use
𝜎 s Known s Unknown 𝑠
𝑥ҧ ± 𝑧𝛼/2 Case Case 𝑥ҧ ± 𝑡𝛼/2
𝑛 𝑛

32
Sample Size for an Interval Estimate of a Population Mean
• Let E = the desired margin of error.
• E is the amount added to and subtracted from the point estimate to obtain
an interval estimate.
• If a desired margin of error is selected prior to sampling, the sample size
necessary to satisfy the margin of error can be determined.

33
Sample Size for an Interval Estimate of a Population Mean
• Margin of Error
𝜎
𝐸 = 𝑧𝛼/2
𝑛

• Necessary Sample Size


(𝑧𝛼/2 )2 𝜎 2
n=
𝐸2

34
Sample Size for an Interval Estimate of a Population Mean
• The Necessary Sample Size equation requires a value for the population
standard deviation s .
• If s is unknown, a preliminary or planning value for s can be used in the
equation.
1. Use the estimate of the population standard deviation computed in a
previous study.
2. Use a pilot study to select a preliminary study and use the sample
standard deviation from the study.
3. Use judgment or a “best guess” for the value of s .

35
Sample Size for an Interval Estimate of a Population Mean
• Example: Discount Sounds
Recall that Discount Sounds is evaluating a potential location
for a new retail outlet, based in part, on the mean annual income
of the individuals in the marketing area of the new location.
Suppose that Discount Sounds’ management team wants an estimate of
the population mean such that there is a .95 probability that the sampling
error is $500 or less.
How large a sample size is needed to meet the required precision?

36
Sample Size for an Interval Estimate of a Population Mean
𝜎
𝐸 = 𝑧𝛼/2 = 500
𝑛

At 95% confidence, z.025 = 1.96. Recall that s = 4,500.


(1.96)2 (4,500)2
𝑛= 2
= 311.17 ⋍ 312
(500)

A sample of size 312 is needed to reach a desired


precision of + $500 at 95% confidence.

37
Practice – sample size estimation

5. The range for a set of data is estimated to be 36.


• a. What is the planning value for the population standard deviation?
• b. At 95% confidence, how large a sample would provide a margin of error of 3?
• c. At 95% confidence, how large a sample would provide a margin of error of 2?
Ans: a. planning value – sigma/4 = 9. b. n = 35. c. n= 78.
6. The average cost of a gallon of unleaded gasoline in Greater Cincinnati was reported to be $2.41 (The Cincinnati Enquirer, February 3, 2006). During periods of rapidly
changing prices, the newspaper samples service stations and prepares reports on gasoline prices frequently. Assume the standard deviation is $.15 for the price of a gallon of
unleaded regular gasoline, and recommend the appropriate sample size for the newspaper to use if they wish to report a margin of error at 95% confidence.
• a. Suppose the desired margin of error is $.07.
• b. Suppose the desired margin of error is $.05.
• c. Suppose the desired margin of error is $.03.
Ans: a. 18, b. 35, c. 97.
Food Lion
Founded in 1957 as Food Town, Food Lion is one of the largest • A LIFO index for each inventory pool requires that the
supermarket chains in the United States, with 1300 stores in 11 year-end inventory count for each product be valued at
Southeastern and Mid-Atlantic states. The company sells more than the current year-end cost and at the preceding year-end
24,000 different products and offers nationally and regionally cost. To avoid excessive time and expense associated
advertised brand-name merchandise, as well as a growing number of with counting the inventory in all 1200 store locations,
high-quality private label products manufactured especially for Food Food Lion selects a random sample of 50 stores. Yearend
Lion. The company maintains its low price leadership and quality
assurance through operating efficiencies such as standard store physical inventories are taken in each of the sample
formats, innovative warehouse design, energyefficient facilities, and stores. The current-year and preceding-year costs for each
data synchronization with suppliers. item are then used to construct the required LIFO indexes
• Food Lion looks to a future of continued innovation, growth, price for each inventory pool.
leadership, and service to its customers. Being in an inventory-intense • For a recent year, the sample estimate of the LIFO
business, Food Lion made the decision to adopt the LIFO (last-in, index for the Health & Beauty Aids inventory pool was
first-out) method of inventory valuation. This method matches current 1.015. Using a 95% confidence level, Food Lion
costs against current revenues, which minimizes the effect of radical
price changes on profit and loss results. computed a margin of error of .006 for the sample
• In addition, the LIFO method reduces net income thereby reducing estimate. Thus, the interval from 1.009 to 1.021 provided
income taxes during periods of inflation. Food Lion establishes a a 95% confidence interval estimate of the population
LIFO index for each of seven inventory pools: Grocery, LIFO index. This level of precision was judged to be
Paper/Household, Pet Supplies, Health & Beauty Aids, Dairy, very good.
Cigarette/Tobacco, and Beer/Wine. For example, a LIFO index of • Based on the above case description, discuss the
1.008 for the Grocery pool would indicate that the company’s grocery following:
inventory value at current costs reflects a 0.8% increase due to a. Margin of error
inflation over the most recent one-year period.
b. Constructing interval estimation
c. Interpret interval estimates
Learnings?
Hypothesis Testing
Dr. Nilakantan Narasinganallur Ph.D.
• 7.42 Many newspapers and private • 7.44.
Review organisations conduct surveys on
colleges and rank them according to their
BusinessWeek surveyed
MBA alumni 10 years after graduation
(BusinessWeek, September 22, 2003).
problems – performance in their ranking
methodology. ( example – Education
One finding was that alumni spend an
average of $115.50 per week eating out
World rankings 2021- socially. You have been asked to conduct
discussion https://www.educationworld.in/ew-india-
higher-education-rankings-2021-22/). In
a follow-up study by taking a sample of
40 of these MBA alumni. Assume the
this issue EW has presented league tables population standard deviation is $35.
ranking the country’s best private • a. Show the sampling distribution of ,
autonomous, government autonomous the sample mean weekly expenditure for
and non-autonomous colleges, and Top the 40 MBA alumni.
100 private engineering colleges. • b. What is the probability the sample
mean will be within $10 of the
• You would like to Take a sample of population mean?
30 colleges out of 100 for a follow up • c. Suppose you find a sample mean of
study of their students. Numbering these $100. What is the probability of finding a
colleges from 1 to 100, use the Rand sample mean of $100 or less? Would you
function in Excel and select the random consider this sample to be an unusually
sample of 30 colleges for your study. low spending group of alumni? Why or
• Ans Discussion. why not?
• Ans a. Normal with E(x ) = 115.50
and st dev= 5.53, b. .9298, c. .0026
Hypothesis Testing

• Developing Null and Alternative Hypotheses

• Type I and Type II • Population Mean: • Population Mean: s


Errors s Known Unknown

3
• Hypothesis testing • The alternative
Hypothesis can be used to hypothesis, denoted
Testing determine whether a by Ha, is the
statement about the opposite of what is
value of a population stated in the null
parameter should or hypothesis.
should not be
rejected. • The hypothesis testing
procedure uses data
• The null hypothesis, from a sample to test
denoted by H0 , is a the two competing
tentative assumption statements indicated
about a population by H0 and Ha.
parameter.
4
Developing Null and • It is not always obvious how the null and
5
Alternative Hypotheses alternative hypotheses should be formulated.

• Care must be taken to structure the hypotheses appropriately so


that the test conclusion provides the information the researcher
wants.
• The context of the situation is very important in determining how
the hypotheses should be stated.
• In some cases it is easier to identify the • Correct hypothesis
alternative hypothesis first. In other cases formulation will
the null is easier. take practice.
Developing Null and Alternative Hypotheses
• Alternative Hypothesis as a Research Hypothesis
• Many applications of hypothesis testing involve an attempt to gather
evidence in support of a research hypothesis.
• In such cases, it is often best to begin with the alternative hypothesis and
make it the conclusion that the researcher hopes to support.
• The conclusion that the research hypothesis is true is made if the sample
data provide sufficient evidence to show that the null hypothesis can be
rejected.

6
Developing Null • Alternative • Example:
and Alternative Hypothesis A new teaching method is
Hypotheses as a developed that is believed to
Research be better than the current
Hypothesis method.

• Alternative • Null Hypothesis:


Hypothesis: The new method is
The new teaching no better than the old
method is better. method.

7
Developing Null • Alternative • Example:
and Alternative Hypothesis as A new sales force bonus
Hypotheses a Research plan is developed in an
Hypothesis attempt to increase sales.

• Alternative • Null Hypothesis:


Hypothesis: The new bonus plan
The new bonus does not increase sales.
plan increase sales.

8
• Alternative Hypothesis as a Research
Hypothesis
• Example:
A new drug is developed with the
goal of lowering blood pressure
Developing Null more than the existing drug.
and Alternative • Alternative Hypothesis:
Hypotheses The new drug lowers blood
pressure more than the existing drug.
• Null Hypothesis:
The new drug does not lower blood
pressure more than the existing drug.

9
Developing Null and Alternative Hypotheses
• Null Hypothesis as an Assumption to be Challenged
• We might begin with a belief or assumption that a statement about the
value of a population parameter is true.

• We then use a hypothesis test to challenge the assumption and determine


if there is statistical evidence to conclude that the assumption is incorrect.

• In these situations, it is helpful to develop the null hypothesis first.

10
Developing Null • Null Hypothesis as an Assumption to be
and Alternative Challenged
Hypotheses
• Example:
The label on a soft drink bottle states that it
contains 67.6 fluid ounces.

• Null Hypothesis: • Alternative Hypothesis:


The label is The label is
correct. m > 67.6 incorrect. m < 67.6
ounces. ounces.

11
Summary of Forms for Null and Alternative Hypotheses
about a Population Mean
• The equality part of the hypotheses always appears in the null hypothesis.
• In general, a hypothesis test about the value of a population mean m must
take one of the following three forms (where m0 is the hypothesized value of
the population mean).
𝐻0 : 𝜇 ≥ 𝜇0 𝐻0 : 𝜇 ≤ 𝜇0 𝐻0 : 𝜇 = 𝜇0
𝐻𝑎 : 𝜇 < 𝜇0 𝐻𝑎 : 𝜇 > 𝜇0 𝐻𝑎 : 𝜇 ≠ 𝜇0

One-tailed One-tailed Two-tailed


(lower-tail) (upper-tail)

12
Null and Alternative Hypotheses
• Example: Metro EMS
A major west coast city provides one of the most comprehensive
emergency medical services in the world. Operating in a multiple hospital
system with approximately 20 mobile medical units, the service goal is to
respond to medical emergencies with a mean time of 12 minutes or less.
The director of medical services wants to formulate a hypothesis test that
could use a sample of emergency response times to determine whether or
not the service goal of 12 minutes or less is being achieved.

13
Null and Alternative Hypotheses
H0: m < 12 The emergency service is meeting the response goal;
no follow-up action is necessary.

Ha: m > 12 The emergency service is not meeting the response


goal; appropriate follow-up action is necessary.

where: m = mean response time for the population


of medical emergency requests

14
• Because hypothesis tests are based on
Type I Error sample data, we must allow for the
possibility of errors.
• A Type I error is rejecting H0 when it is true.

• The probability of making a Type I error


when the null hypothesis is true as an
equality is called the level of significance.

• Applications of hypothesis testing that only


control the Type I error are often called
significance tests.
15
• A Type II error is • It is difficult to control for
Type II Error accepting H0 the probability of making
when it is false. a Type II error.

• Statisticians avoid the risk of making a Type II


error by using “do not reject H0” and not
“accept H0”.

16
Type I and Type II Errors
Population Condition

H0 True H0 False
Conclusion (m < 12) (m > 12)

Accept H0 Correct
Type II Error
(Conclude m < 12) Decision

Reject H0 Correct
Type I Error
(Conclude m > 12) Decision

17
p-Value
Approach to • The p-value is the probability, computed using
One-Tailed the test statistic, that measures the support (or
Hypothesis lack of support) provided by the sample for the
Testing null hypothesis.

• If the p-value is less than or equal to the level


of significance , the value of the test statistic
is in the rejection region.

• Reject H0 if the p-value <  .


Suggested • Less than .01
Guidelines for Overwhelming evidence to conclude Ha is
Interpreting p- true.
Values
• Between .01 and .05
Strong evidence to conclude Ha is true.

• Between .05 and .10


Weak evidence to conclude Ha is true.

• Greater than .10


Insufficient evidence to conclude Ha is
true.
19
Lower-Tailed Test About a Population Mean: s Known
• p-Value Approach

Sampling
 = .10 Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
p-value
= .0721

p-Value <  ,
so reject H0. z

z= z = 0
-1.46 -1.28

20
Upper-Tailed Test About a Population Mean: s Known
• p-Value Approach

Sampling
Distribution of  = .04
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
p-Value (p-Value <  ,
so reject H0.)
= .011

z
0 z = z=
1.75 2.29

21
Critical Value
Approach to • The test statistic z has a standard normal
One-Tailed probability distribution.
Hypothesis • We can use the standard normal probability
Testing distribution table to find the z-value with an
area of  in the lower (or upper) tail of the
distribution.
• The value of the test • The rejection rule
is:
statistic that established
• Lower tail:
the boundary of the Reject H0 if z < -z
rejection region is called • Upper tail:
the critical value for the Reject H0 if z > z
test.
22
Lower-Tailed Test About a Population Mean: s Known
• Critical Value Approach

Sampling
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
Reject H0

 = 1
Do Not Reject H0

z
-z = -1.28 0

23
Upper-Tailed Test About a Population Mean: s Known
• Critical Value Approach

Sampling
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛 Reject H0
 = .05
Do Not Reject H0

z
0 z = 1.645

24
Steps of Hypothesis Testing
Step 1. Develop the null and alternative hypotheses.
Step 2. Specify the level of significance .
Step 3. Collect the sample data and compute the value of the test statistic.

p-Value Approach
Step 4. Use the value of the test statistic to compute the p-value.
Step 5. Reject H0 if p-value < .

25
Steps of Hypothesis Testing
Critical Value Approach
Step 4. Use the level of significance  to determine the critical value and the
rejection rule.
Step 5. Use the value of the test statistic and the rejection rule to determine
whether to reject H0.

26
One-Tailed Tests About a Population Mean: s Known
• Example: Metro EMS
The response times for a random sample of 40 medical emergencies were
tabulated. The sample mean is 13.25 minutes. The population standard
deviation is believed to be 3.2 minutes.
The EMS director wants to perform a hypothesis test, with a .05 level of
significance, to determine whether the service goal of 12 minutes or less is
being achieved.

27
One-Tailed Tests About a Population Mean: s Known
• p -Value and Critical Value Approaches

1. Develop the hypotheses. H0: m < 12


Ha: m > 12

2. Specify the level of significance.  = .05

3. Compute the value of the test statistic.


ҧ 0
𝑥−𝜇 13.25−12
𝑧= = = 2.47
𝜎Τ 𝑛 3.2/ 40

28
One-Tailed Tests About a Population Mean: s Known
• p –Value Approach

4. Compute the p –value.


For z = 2.47, cumulative probability = .9932.
p-value = 1 - .9932 = .0068

5. Determine whether to reject H0.


Because p-value = .0068 <  = .05, we reject H0.
There is sufficient statistical evidence
to infer that Metro EMS is not meeting
the response goal of 12 minutes.

29
One-Tailed Tests About a Population Mean: s Known
• p –Value Approach

Sampling
 = .05
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
p-value (p-Value < ,
=  so reject H0.)

z
0 z = z=
1.645 2.47

30
One-Tailed Tests About a Population Mean: s Known
• Critical Value Approach

4. Determine the critical value and rejection rule.


For  = .05, z.05 = 1.96
Reject H0 if z > 1.96
5. Determine whether to reject H0.
Because 2.47 > 1.96, we reject H0.
There is sufficient statistical evidence
to infer that Metro EMS is not meeting
the response goal of 12 minutes.

31
p-Value Approach to Two-Tailed Hypothesis
Testing

• Compute the p-value using the following three steps: 3. Double the tail
1. Compute the value of the test statistic z. area obtained in
2. If z is in the upper tail (z > 0), compute the step 2 to obtain
probability that z is greater than or equal to the value the p-value.
of the test statistic. If z is in the lower tail (z < 0), • The rejection
compute the probability that z is less than or equal to rule: Reject H0 if
the p-value <  .
the value of the test statistic.

32
Critical Value Approach to Two-Tailed
Hypothesis Testing

• The critical values will occur in both the lower and upper tails of
the standard normal curve.
• Use the standard normal probability • The rejection
distribution table to find z/2 (the z-value with rule is: Reject H0
an area of /2 in the upper tail of the if z < -z/2 or z >
distribution). z /2.

33
Two-Tailed Tests About a Population Mean: s Known
• Example: Glow Toothpaste
The production line for Glow toothpaste is designed to fill tubes with a
mean weight of 6 oz. Periodically, a sample of 30 tubes will be selected in
order to check the filling process.
Quality assurance procedures call for the continuation of the filling
process if the sample results are consistent with the assumption that the
mean filling weight for the population of toothpaste tubes is 6 oz.; otherwise
the process will be adjusted.

34
Two-Tailed Tests About a Population Mean: s Known
• Example: Glow Toothpaste
Assume that a sample of 30 toothpaste tubes provides a sample mean of
6.1 oz. The population standard deviation is believed to be 0.2 oz.
Perform a hypothesis test, at the .03 level of significance, to help
determine whether the filling process should continue operating or be
stopped and corrected.

35
Two-Tailed Tests About a Population Mean: s Known
• p –Value and Critical Value Approaches

1. Determine the hypotheses. 𝐻0 : 𝜇 = 6


𝐻𝑎 : 𝜇 ≠ 6

2. Specify the level of significance.  = .03

3. Compute the value of the test statistic.


ҧ 0
𝑥−𝜇 6.1−6
𝑧= = = 0.274
𝜎Τ 𝑛 2/ 30

36
Two-Tailed Tests About a Population Mean: s Known
• p –Value Approach
4. Compute the p –value.
For z = 0.274, cumulative probability = 0.608
p-value = 2(1 - .608) = .784

5. Determine whether to reject H0.


Because p-value = .784 > = .03, we reject H0.
There is sufficient statistical evidence to
infer that the null hypothesis is true
(i.e. the mean filling weight is 6 ounces).

37
Two-Tailed Tests About a Population Mean: s Known
• Critical Value Approach

4. Determine the critical value and rejection rule.


For /2 = .03/2 = .015, z.015 = 2.17
Reject H0 if z < -2.17 or z > 2.17

5. Determine whether to reject H0.


Because 0.274<= 2.17, & -0.274 >=-2.17, we accept H0.
There is sufficient statistical evidence to
infer that the null hypothesis is true
(i.e. the mean filling weight is 6 ounces).

38
Two-Tailed Tests About a Population Mean: s Known
• Critical Value Approach
Sampling
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛

Reject H0 Reject H0
/2 = .015 /2 = .015
Do Not Reject H0
z
-2.17 0 2.17

39
Confidence
Interval Approach • Select a simple • If the confidence
to random sample interval contains the
Two-Tailed Tests from the population
About a Population and use the value hypothesized value
Mean of the sample m0, do not reject H0.
mean 𝑥ҧ to develop Otherwise, reject H0.
the confidence (Actually, H0 should
interval for the be rejected if m0
population mean m.
(Confidence happens to be equal
intervals are to one of the end
covered in Chapter points of the
8.) confidence
interval.)

40
Confidence Interval Approach to
Two-Tailed Tests About a Population Mean
• The 97% confidence interval for m is
𝜎
𝑥ҧ ± 𝑧𝛼/2 = 6.1 ± 2.17 .2 30 = 6.1 ± .07924
𝑛
or 6.02076 to 6.17924
• Because the hypothesized value of 6.1 for the population mean, m0 = 6, is in
this interval, the hypothesis-testing conclusion is that the null hypothesis,
H0: m = 6, cannot be rejected.

41
Tests About a Population Mean: s Unknown
• Test Statistic: 𝑥ҧ − 𝜇0
𝑡=
𝑠Τ 𝑛
• This test statistic has a t distribution with n - 1 degrees of freedom.

42
Tests About a Population Mean: s Unknown
• Rejection Rule: p -Value Approach
Reject H0 if p –value < 
• Rejection Rule: Critical Value Approach
H0: m > m0 Reject H0 if t < -t
H0: m < m0 Reject H0 if t > t
H0: m = m0 Reject H0 if t < - t/2 or t > t/2

43
p -Values and the t Distribution
• The format of the t distribution table provided in most statistics textbooks does
not have sufficient detail to determine the exact p-value for a hypothesis test.
• However, we can still use the t distribution table to identify a range for the p-
value.
• An advantage of computer software packages is that the computer output will
provide the p-value for the t distribution.

44
Example: Highway Patrol
• One-Tailed Test About a Population Mean: s Unknown
A State Highway Patrol periodically samples vehicle speeds at various
locations on a particular roadway. The sample of vehicle speeds is used to
test the hypothesis H0: m < 65.
The locations where H0 is rejected are deemed the best locations for radar
traps. At Location F, a sample of 64 vehicles shows a mean speed of 66.2 mph
with a standard deviation of 4.2 mph. Use a = .05 to test the hypothesis.

45
One-Tailed Test About a Population Mean: s Unknown
• p –Value and Critical Value Approaches

1. Determine the hypotheses. H0: m < 65


Ha: m > 65

2. Specify the level of significance.  = .05

3. Compute the value of the test statistic.

ҧ 0
𝑥−𝜇 66.2−65
𝑡= = = 2.286
𝑠Τ 𝑛 4.2/ 64

46
One-Tailed Test About a Population Mean: s Unknown
• p –Value Approach
4. Compute the p –value.
For t = 2.286, the p-value must be less than .025
(for t = 1.998) and greater than .01 (for t = 2.387).
.01 < p–value < .025

5. Determine whether to reject H0.


Because p-value <  = .05, we reject H0.
We are at least 95% confident that the mean speed
of vehicles at Location F is greater than 65 mph.

47
One-Tailed Test About a Population Mean: s Unknown
• Critical Value Approach
4. Determine the critical value and rejection rule.
For  = .05 and d.f. = 64 – 1 = 63, t.05 = 1.669
Reject H0 if t > 1.669
5. Determine whether to reject H0.
Because 2.286 > 1.669, we reject H0.
We are at least 95% confident that the mean speed of vehicles at
Location F is greater than 65 mph. Location F is a good candidate
for a radar trap.

48
One-Tailed Test About a Population Mean: s Unknown

Reject H0
( = )

p-value (p-Value <  ,


Do Not Reject H0 so reject H0.)
< .025

t
0 t = t=
1.669 2.286

49
Inference about two
populations
Dr. Nilakantan Narasinganallur Ph.D.
Inference About Means and Proportions with Two
Populations
• Inferences About the Difference Between Two
Population Means: s 1 and s 2 Known
• Inferences About the Difference Between Two Population Means:
s 1 and s 2 Unknown
• Inferences About the Difference Between Two Population Means:
Matched Samples

2
Inferences About the Difference Between
Two Population Means: s 1 and s 2 Known

INTERVAL ESTIMATION OF HYPOTHESIS TESTS ABOUT


M1–M2 M1–M2

3
Estimating the Difference Between Two Population Means
• Let 1 equal the mean of population 1 and 2 equal the mean of population 2.
• The difference between the two population means is 1 - 2.
• To estimate 1 - 2, we will select a simple random sample of size n1 from
population 1 and a simple random sample of size n2 from population 2.
• Let 𝑥1ҧ equal the mean of sample 1 and 𝑥ҧ2 equal the mean of sample 2.
• The point estimator of the difference between the means of the populations 1
and 2 is 𝑥1ҧ − 𝑥ҧ2 .

4
Sampling Distribution of 𝑥1ҧ − 𝑥ҧ2
• Expected Value
𝐸(𝑥1ҧ − 𝑥ҧ2 )= 𝜇1 − 𝜇2

• Standard Deviation (Standard Error)

𝜎1 2 𝜎2 2
𝜎𝑥ҧ1 −𝑥ҧ2 = +
𝑛1 𝑛2

where: s1 = standard deviation of population 1


s2 = standard deviation of population 2
n1 = sample size from population 1
n2 = sample size from population 2

5
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Interval Estimate

𝜎1 2 𝜎2 2
𝑥ҧ1 − 𝑥ҧ2 ± 𝑧𝛼/2 +
𝑛1 𝑛2

where:
1 -  is the confidence coefficient

6
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Example: Par, Inc.
Par, Inc. is a manufacturer of golf equipment and has developed a new
golf ball that has been designed to provide “extra distance.”
In a test of driving distance using a mechanical driving device, a sample of
Par golf balls was compared with a sample of golf balls made by Rap, Ltd., a
competitor. The sample statistics appear on the next slide.

7
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Example: Par, Inc.
Sample #1 Sample #2
Par, Inc. Rap, Ltd.

Sample Size 120 balls 80 balls


Sample Mean 295 yards 278 yards

Based on data from previous driving distance


tests, the two population standard deviations are
known with s 1 = 15 yards and s 2 = 20 yards.

8
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Example: Par, Inc.

Let us develop a 95% confidence interval estimate of the difference


between the mean driving distances of the two brands of golf ball.

9
Estimating the Difference Between Two Population Means
Population 1 Population 2
Par, Inc. Golf Balls Rap, Ltd. Golf Balls
1 = mean driving 2 = mean driving
distance of Par distance of Rap
golf balls golf balls

1 – 2 = difference between
the mean distances

Simple random sample Simple random sample


of n1 Par golf balls of n2 Rap golf balls
𝑥ҧ1 = sample mean distance 𝑥ҧ2 = sample mean distance
for the Par golf balls for the Rap golf balls

𝑥ҧ1 − 𝑥ҧ2 = Point Estimate of 1 – 2

10
Point Estimate of 1 - 2
Point estimate of 1 - 2 = 𝑥1ҧ − 𝑥ҧ2 = 295 - 278
= 17 yards

where:
1 = mean distance for the population
of Par, Inc. golf balls
2 = mean distance for the population
of Rap, Ltd. golf balls

11
Interval Estimation of 1 - 2: s 1 and s 2 Known
𝜎1 2 𝜎2 2 (15)2 (20)2
𝑥1ҧ − 𝑥ҧ2 ± 𝑧𝛼/2 + = 17 ± 1.96 +
𝑛1 𝑛2 120 80

17 + 5.14 or 11.86 yards to 22.14 yards

We are 95% confident that the difference between


the mean driving distances of Par, Inc. balls and Rap,
Ltd. balls is 11.86 to 22.14 yards.

12
Hypothesis Tests About 1 - 2: s1 and s2 Known
• Hypotheses

H0: 1 – 2 > D0 H0: 1 – 2 < D0 H0: 1 – 2 = D0


Ha: 1 – 2 < D0 Ha: 1 – 2 > D0 Ha: 1 – 2 ≠ D0
Left-tailed Right-tailed Two-tailed

• Test Statistic
𝑥1ҧ − 𝑥ҧ2 − 𝐷0
𝑧=
(𝜎1 )2 (𝜎2 )2
+
𝑛1 𝑛2

13
Hypothesis Tests About 1 - 2: s1 and s2 Known
• Example: Par, Inc.
Can we conclude, using  = .01, that the mean driving distance of Par, Inc.
golf balls is greater than the mean driving distance of Rap, Ltd. golf balls?

14
Hypothesis Tests About 1 - 2: s1 and s2 Known
• p –Value and Critical Value Approaches

1. Develop the hypotheses. H0: 1 - 2 < 0 (right-tailed test)


Ha: 1 - 2 > 0
where:
1 = mean distance for the population of Par, Inc. golf balls
2 = mean distance for the population of Rap, Ltd. golf balls

2. Specify the level of significance.  = .01

15
Hypothesis Tests About 1 - 2: s1 and s2 Known
• p –Value and Critical Value Approaches
3. Compute the value of the test statistic.
𝑥1ҧ − 𝑥ҧ2 − 𝐷0
𝑧=
(𝜎1 )2 (𝜎2 )2
+
𝑛1 𝑛2

295 − 278 − 0 17
𝑧= = = 6.49
(15)2 (20)2 2.62
+
120 80

16
Hypothesis Tests About 1 - 2: s1 and s2 Known
• p –Value Approach

4. Compute the p–value.


For z = 6.49, the p –value < .0001.

5. Determine whether to reject H0.


Because p–value <  = .01, we reject H0.
At the .01 level of significance, the sample evidence
indicates the mean driving distance of Par, Inc. golf
balls is greater than the mean driving distance of Rap,
Ltd. golf balls.

17
Hypothesis Tests About 1 - 2: s1 and s2 Known
• Critical Value Approach
4. Determine the critical value and rejection rule.
For  = .01, z.01 = 2.33
Reject H0 if z > 2.33

5. Determine whether to reject H0.


Because z = 6.49 > 2.33, we reject H0.
The sample evidence indicates the mean driving
distance of Par, Inc. golf balls is greater than the mean
driving distance of Rap, Ltd. golf balls.

18
Inferences About the Difference Between
Two Population Means: s 1 and s 2 Unknown
• Interval Estimation of 1 – 2
• Hypothesis Tests About 1 – 2

19
Interval Estimation of 1 - 2: s1 and s2 Unknown
When s 1 and s 2 are unknown, we will:
• use the sample standard deviations s1 and s2 as estimates of
s 1 and s 2 , and
• replace z/2 with t/2.

20
Interval Estimation of 1 - 2: s1 and s2 Unknown
• Interval Estimate

𝑠1 2 𝑠2 2
𝑥1ҧ − 𝑥ҧ2 ± 𝑡𝛼/2 +
𝑛1 𝑛2

where the degrees of freedom for t/2 are:


2
2 2
𝑠1 𝑠
+ 2
𝑛1 𝑛2
𝑑𝑓 = 2 2
1 𝑠1 2 1 𝑠2 2
+
𝑛1 − 1 𝑛1 𝑛2 − 1 𝑛2

21
Difference Between Two Population Means: s 1 and s 2
Unknown
• Example: Specific Motors
Specific Motors of Detroit has developed a new Automobile known as the
M car. 24 M cars and 28 J cars (from Japan) were road tested to compare
miles-per-gallon (mpg) performance. The sample statistics are shown on the
next slide.

22
Difference Between Two Population Means: s 1 and s 2
Unknown
• Example: Specific Motors
Sample #1 Sample #2
M Cars J Cars
24 cars 28 cars Sample Size
29.8 mpg 27.3 mpg Sample Mean
2.56 mpg 1.81 mpg Sample Std. Dev.

23
Difference Between Two Population Means: s 1 and s 2
Unknown
• Example: Specific Motors
Let us develop a 90% confidence interval estimate of the difference between
the mpg performances of the two models of automobile.

24
Point Estimate of 1 - 2
Point estimate of 1 - 2 = 𝑥1ҧ − 𝑥ҧ2 = 29.8 - 27.3 = 2.5 mpg

where:
1 = mean miles-per-gallon for the population of M cars
2 = mean miles-per-gallon for the population of J cars

25
Interval Estimation of 1 - 2: s1 and s2 Unknown
The degrees of freedom for t/2 are:
2
(2.56)2 (1.81)2
24 + 28
𝑑𝑓 = 2 2 = 40.59 = 41
1 (2.56)2 1 (1.81)2
+ 28 − 1
24 − 1 24 28

with /2 = .05 and df = 41, t/2 = 1.683

26
Interval Estimation of 1 - 2: s1 and s2 Unknown

𝑠1 2 𝑠2 2
𝑥1ҧ − 𝑥ҧ2 ± 𝑡𝛼/2 +
𝑛1 𝑛2

(2.56)2 (1.81)2
29.8 − 27.3 ± 1.683 +
24 28

2.5 + 1.051 or 1.449 to 3.551 mpg

We are 90% confident that the difference between


the miles-per-gallon performances of M cars and J cars
is 1.449 to 3.551 mpg.

27
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• Hypotheses

H0: 1 – 2 > D0 H0: 1 – 2 < D0 H0: 1 – 2 = D0


Ha: 1 – 2 < D0 Ha: 1 – 2 > D0 Ha: 1 – 2 ≠ D0
Left-tailed Right-tailed Two-tailed
• Test Statistic
𝑥1ҧ − 𝑥ҧ2 − 𝐷0
𝑡=
(𝑠1 )2 (𝑠2 )2
+
𝑛1 𝑛2

28
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• Example: Specific Motors
Can we conclude, using a .05 level of significance, that the miles-per-gallon
(mpg) performance of M cars is greater than the miles-per-gallon performance
of J cars?

29
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value and Critical Value Approaches
1. Develop the hypotheses.
H0: 1 - 2 < 0 (right-tailed test)
Ha: 1 - 2 > 0

where:
1 = mean mpg for the population of M cars
2 = mean mpg for the population of J cars

30
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value and Critical Value Approaches

2. Specify the level of significance.  = .05

3. Compute the value of the test statistic.

29.8 − 27.3 − 0
𝑡= = 4.003
(2.56)2 (1.81)2
+
24 28

31
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value Approach
4. Compute the p –value.
The degrees of freedom for t are:
2
(2.56)2 (1.81)2

24 + 28
𝑑𝑓 = 2 2 = 40.59 = 41
1 (2.56)2 1 (1.81)2
+
24 − 1 24 28 − 1 24
Because t = 4.003 > t.05 = 1.683, the p–value < .05.
(In fact, the p–value < .005.)

32
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value Approach
5. Determine whether to reject H0.
Because p–value <  = .05, we reject H0.
We are at least 95% confident that the miles-per-gallon (mpg)
performance of M cars is greater than the miles-per-gallon
performance of J cars.

33
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• Critical Value Approach
4. Determine the critical value and rejection rule.

For  = .05 and df = 41, t.05 = 1.683


Reject H0 if t > 1.683

5. Determine whether to reject H0.


Because 4.003 > 1.683, we reject H0.
We are at least 95% confident that the miles-per-gallon (mpg)
performance of M cars is greater than the miles-per-gallon
performance of J cars.

34
Inferences About the Difference Between Two Population Means:
Matched Samples
• With a matched-sample design each sampled item provides a pair of data
values.
• This design often leads to a smaller sampling error than the independent-
sample design because variation between sampled items is eliminated as a
source of sampling error.

35
Inferences About the Difference Between Two Population Means:
Matched Samples
• Example: Express Deliveries
A Chicago-based firm has documents that must be quickly distributed to
district offices throughout the U.S. The firm must decide between two
delivery services, UPX (United Parcel Express) and INTEX (International
Express), to transport its documents.

36
Inferences About the Difference Between Two Population Means:
Matched Samples
• Example: Express Deliveries
In testing the delivery times of the two services, the firm sent two reports
to a random sample of its district offices with one report carried by UPX and
the other report carried by INTEX. Do the data on the next slide indicate a
difference in mean delivery times for the two services? Use a .05 level of
significance.

37
Inferences About the Difference Between Two Population Means:
Matched Samples
Delivery Time (Hours)
District Office UPX INTEX Difference
Seattle 32 25 7
Los Angeles 30 24 6
Boston 19 15 4
Cleveland 16 15 1
New York 15 13 2
Houston 18 15 3
Atlanta 14 15 -1
St. Louis 10 8 2
Milwaukee 7 9 -2
Denver 16 11 5

38
Inferences About the Difference Between Two Population Means:
Matched Samples
• p –Value and Critical Value Approaches
1. Develop the hypotheses.
H0: d = 0
Ha: d  
Let d = the mean of the difference values for the
two delivery services for the population
of district offices

39
Inferences About the Difference Between Two Population Means:
Matched Samples
• p –Value and Critical Value Approaches
2. Specify the level of significance.  = .05

3. Compute the value of the test statistic.


σ 𝑑𝑖 (7+6+⋯+5)
ҧ
𝑑= = = 2.7
𝑛 10

σ 𝑑𝑖 −𝑑ത 2 76.1
𝑠𝑑 = = = 2.9
𝑛−1 9

ത 𝑑 2.7−0
𝑑−𝜇
𝑡= = = 2.94
𝑠𝑑 / 𝑛 2.9 10

40
Inferences About the Difference Between Two Population Means:
Matched Samples
• p –Value Approach
4. Compute the p –value.
For t = 2.94 and df = 9, the p–value is between .02 and .01.
(This is a two-tailed test, so we double the upper-tail areas of
.01 and .005.)
5. Determine whether to reject H0.
Because p–value <  = .05, we reject H0.
We are at least 95% confident that there is a difference in
mean delivery times for the two services.

41
Inferences About the Difference Between Two Population Means:
Matched Samples
• Critical Value Approach
4. Determine the critical value and rejection rule.
For  = .05 and df = 9, t.025 = 2.262.
Reject H0 if t > 2.262

5. Determine whether to reject H0.


Because t = 2.94 > 2.262, we reject H0.
We are at least 95% confident that there is a difference
in mean delivery times for the two services.

42
Learnings?
Chi-square &
Cross
tabulation
Dr. Nilakantan Narasinganallur Ph.D.
Tests of Goodness of Fit, Independence,
and Multiple Proportions
• Testing For Equality of Three or More Population Proportions
• Goodness of Fit Test
• Test of Independence

2
Tests of Goodness of Fit, Independence,
and Multiple Proportions
• We introduce three additional hypothesis-testing procedures.

• The test statistic and the distribution used are based on the chi-square (c2) distribution.

• In all cases, the data are categorical.

3
Testing the Equality of Population Proportions
for Three or More Populations
Using the notation
p1 = population proportion for population 1
p2 = population proportion for population 2
pk = population proportion for population k

The hypotheses for the equality of population proportions for k > 3 populations
are as follows:

H0: p1 = p2 = . . . = pk
Ha: Not all population proportions are equal

4
Testing the Equality of Population Proportions
for Three or More Populations
• If H0 cannot be rejected, we cannot detect a difference among the k population proportions.

• If H0 can be rejected, we can conclude that not all k population proportions are equal.

• Further analyses can be done to conclude which population proportions are significantly different from
others.

5
Testing the Equality of Population Proportions
for Three or More Populations
• Example: Finger Lakes Homes

Finger Lakes Homes manufactures three models of prefabricated homes, a two-story colonial, a log
cabin, and an A-frame. To help in product-line planning, management would like to compare the customer
satisfaction with the three home styles.

p1 = proportion likely to repurchase a Colonial for the population of Colonial owners


p2 = proportion likely to repurchase a Log Cabin for the population of Log Cabin owners
p3 = proportion likely to repurchase an A-Frame for the population of A-Frame owners

6
Testing the Equality of Population Proportions
for Three or More Populations
• We begin by taking a sample of owners from each of the three populations.

• Each sample contains categorical data indicating whether the respondents are likely or not likely to
repurchase the home.

7
Testing the Equality of Population Proportions
for Three or More Populations
• Observed Frequencies (sample results)

Home Owner
Colonial Log A-Frame Total
Likely to Yes 97 83 80 260
Repurchase No 38 18 44 100
Total 135 101 124 360

8
Testing the Equality of Population Proportions
for Three or More Populations
• Next, we determine the expected frequencies under the assumption H0 is correct.

Expected Frequencies
Under the Assumption H0 is True

(Row 𝑖 Total)(Column 𝑗 Total)


𝑒𝑖𝑗 =
Total Sample Size

• If a significant difference exists between the observed and expected frequencies, H0 can be rejected.

9
Testing the Equality of Population Proportions
for Three or More Populations
• Expected Frequencies (computed)

Home Owner
Colonial Log A-Frame Total
Likely to Yes 97.50 72.94 89.56 260
Repurchase No 37.50 28.06 34.44 100
Total 135 101 124 360

10
Testing the Equality of Population Proportions
for Three or More Populations
• Next, compute the value of the chi-square test statistic.

2
𝑓𝑖𝑗 − 𝑒𝑖𝑗
𝜒2 = ෍ ෍
𝑒𝑖𝑗
𝑖 𝑗

where: fij = observed frequency for the cell in row i and column j
eij = expected frequency for the cell in row i and column j
under the assumption H0 is true

Note: The test statistic has a chi-square distribution with k – 1 degrees of freedom, provided the
expected frequency is 5 or more for each cell.

11
Testing the Equality of Population Proportions
for Three or More Populations
• Computation of the Chi-Square Test Statistic.
Obs. Exp. Sqd. Sqd. Diff. /
Likely to Home Freq. Freq. Diff. Diff. Exp. Freq.
Repurchase Owner fij eij (fij - eij) (fij - eij)2 (fij - eij)2/eij
Yes Colonial 97 97.50 -0.50 0.2500 0.0026
Yes Log Cab. 83 72.94 10.06 101.1142 1.3862
Yes A-Frame 80 89.56 -9.56 91.3086 1.0196
No Colonial 38 37.50 0.50 0.2500 0.0067
No Log Cab. 18 28.06 -10.06 101.1142 3.6041
No A-Frame 44 34.44 9.56 91.3086 2.6509
Total 360 360 c2 = 8.6700

12
Testing the Equality of Population Proportions
for Three or More Populations
• Rejection Rule

p-value approach: Reject H0 if p-value < 

Critical value approach: Reject H0 if 𝜒 2 > 𝜒𝛼2

where  is the significance level and there are k - 1


degrees of freedom

13
Testing the Equality of Population Proportions
for Three or More Populations
• Rejection Rule (using  = .05)

Reject H0 if p-value < .05 or c2 > 5.991

With  = .05 and


k-1=3-1=2
degrees of freedom

Do Not Reject H0 Reject H0

c2
5.991

14
Testing the Equality of Population Proportions
for Three or More Populations
• Conclusion Using the p-Value Approach

Area in Upper Tail .10 .05 .025 .01 .005

c2 Value (df = 2) 4.605 5.991 7.378 9.210 10.597

Because c2 = 8.670 is between 9.210 and 7.378, the area in the upper tail of
the distribution is between .01 and .025.

The p-value <  . We can reject the null hypothesis.

(Actual p-value is .0131)

15
Testing the Equality of Population Proportions
for Three or More Populations
• We have concluded that the population proportions for the three populations of home owners are
not equal.

• To identify where the differences between population proportions exist, we will rely on a multiple
comparisons procedure.

16
Multiple Comparisons Procedure
• We begin by computing the three sample proportions.

Colonial: 𝑝1ҧ = 97Τ135 = .7185

Log Cabin: 𝑝ҧ2 = 83Τ101 = .8218

A-Frame: 𝑝ҧ3 = 80Τ124 = .6452

• We will use a multiple comparison procedure known as the Marascuilo procedure.

17
Multiple Comparisons Procedure
• Marascuilo Procedure

We compute the absolute value of the pairwise difference between sample proportions.

Colonial and Log Cabin: 𝑝1ҧ − 𝑝ҧ2 = .7185 − .8218 = .1033

Colonial and A-Frame: 𝑝1ҧ − 𝑝ҧ3 = .7185 − .6452 = .0733

Log Cabin and A-Frame: 𝑝ҧ2 − 𝑝ҧ 3 = .8218 − .6452 = .1766

18
Multiple Comparisons Procedure
• Critical Values for the Marascuilo Pairwise Comparison

For each pairwise comparison compute a critical value as follows:

2
𝑝𝑖ҧ (1 − 𝑝𝑖ҧ ) 𝑝𝑗ҧ (1 − 𝑝𝑗ҧ )
𝐶𝑉𝑖𝑗 = 𝜒𝛼,𝑘−1 +
𝑛𝑖 𝑛𝑗

For  = .05 and k = 3: c2 = 5.991

19
Multiple Comparisons Procedure
• Pairwise Comparison Tests

Significant if
Pairwise Comparison 𝑝ҧ𝑖 − 𝑝𝑗ҧ CVij 𝑝ҧ𝑖 − 𝑝𝑗ҧ > CVij

Colonial vs. Log Cabin .1033 .1329 Not Significant


Colonial vs. A-Frame .0733 .1415 Not Significant
.1766 .1405 Significant
Log Cabin vs. A-Frame

20
Goodness of Fit Test:
Multinomial Probability Distribution
1. State the null and alternative hypotheses.

H0: The population follows a multinomial distribution with specified probabilities for each of
the k categories

Ha: The population does not follow a multinomial distribution with specified probabilities for
each of the k categories

21
Goodness of Fit Test:
Multinomial Probability Distribution
2. Select a random sample and record the observed frequency, fi , for each of the k categories.

3. Assuming H0 is true, compute the expected frequency, ei , in each category by


multiplying the category probability by the sample size.

22
Goodness of Fit Test:
Multinomial Probability Distribution
4. Compute the value of the test statistic.

𝑘 2
𝑓𝑖 − 𝑒𝑖
𝜒2 = ෍
𝑒𝑖
𝑖=1

where: fi = observed frequency for category i


ei = expected frequency for category i
k = number of categories

Note: The test statistic has a chi-square distribution with k – 1 df provided


that the expected frequencies are 5 or more for all categories.
Goodness of Fit Test:
Multinomial Probability Distribution
5. Rejection rule:
p-value approach: Reject H0 if p-value < a

Critical value approach: Reject H0 if 𝜒 2 > 𝜒𝛼2

where  is the significance level and there are k - 1


degrees of freedom

24
Multinomial Distribution Goodness of Fit Test
• Example: Finger Lakes Homes (A)

Finger Lakes Homes manufactures four models of prefabricated homes, a two-story colonial, a log
cabin, a split-level, and an A-frame. To help in production planning, management would like to
determine if previous customer purchases indicate that there is a preference in the style selected.

25
Multinomial Distribution Goodness of Fit Test
• Example: Finger Lakes Homes (A)

The number of homes sold of each model for 100 sales over the past two years is shown below.

Split- A-
Model Colonial Log Level Frame
# Sold 30 20 35 15

26
Multinomial Distribution Goodness of Fit Test
• Hypotheses

H0: pC = pL = pS = pA = .25

Ha: The population proportions are not


pC = .25, pL = .25, pS = .25, and pA = .25

where:
pC = population proportion that purchase a colonial
pL = population proportion that purchase a log cabin
pS = population proportion that purchase a split-level
pA = population proportion that purchase an A-frame

27
Multinomial Distribution Goodness of Fit Test
• Rejection Rule

Reject H0 if p-value < .05 or c2 > 7.815.

With  = .05 and


k-1=4-1=3
degrees of freedom

Do Not Reject H0 Reject H0

c2
7.815

28
Multinomial Distribution Goodness of Fit Test
• Expected Frequencies

e1 = .25(100) = 25 e2 = .25(100) = 25
e3 = .25(100) = 25 e4 = .25(100) = 25
• Test Statistic

30−25 2 20−25 2 35−25 2 15−25 2


𝜒2 = + + +
25 25 25 25

=1+1+4+4
= 10

29
Multinomial Distribution Goodness of Fit Test
• Conclusion Using the p-Value Approach

Area in Upper Tail .10 .05 .025 .01 .005

c2 Value (df = 3) 6.251 7.815 9.348 11.345 12.838

Because c2 = 10 is between 9.348 and 11.345, the area in the upper tail of
the distribution is between .025 and .01.

The p-value <  . We can reject the null hypothesis.

30
Multinomial Distribution Goodness of Fit Test
• Conclusion Using the Critical Value Approach

c2 = 10 > 7.815

We reject, at the .05 level of significance, the assumption that there is


no home style preference.

31
Test of Independence
1. Set up the null and alternative hypotheses.

H0: The column variable is independent of the row variable

Ha: The column variable is not independent of the row variable

2. Select a random sample and record the observed frequency, fij , for each cell of the contingency table.

3. Compute the expected frequency, eij , for each cell.

(Row 𝑖 Total)(Column 𝑗 Total)


𝑒𝑖𝑗 =
Sample Size

32
Test of Independence
4. Compute the test statistic.
2
2
𝑓𝑖𝑗 − 𝑒𝑖𝑗
𝜒 = ෍෍
𝑒𝑖𝑗
𝑖 𝑗

5. Determine the rejection rule.

Reject H0 if p -value < a or 𝜒 2 > 𝜒𝛼2 .

where  is the significance level and, with n rows and m columns, there
are (n - 1)(m - 1) degrees of freedom.

33
Test of Independence
• Example: Finger Lakes Homes (B)

Each home sold by Finger Lakes Homes can be classified according to price and to style. Finger
Lakes’ manager would like to determine if the price of the home and the style of the home are
independent variables.

34
Test of Independence
• Example: Finger Lakes Homes (B)

The number of homes sold for each model and price for the past two years is shown below. For
convenience, the price of the home is listed as either less than $200,000 or more than or equal to
$200,000.

Price Colonial Log Split-Level A-Frame

< $200,000 18 6 19 12
> $200,000 12 14 16 3

35
Test of Independence
• Hypotheses

H0: Price of the home is independent of the style of the home that is purchased

Ha: Price of the home is not independent of the style of the home that is purchased

36
Test of Independence
• Expected Frequencies

Price Colonial Log Split-Level A-Frame Total

18 6 19 12 55
< $200K
> $200K 12 14 16 3 45

Total 30 20 35 15 100

37
Test of Independence
• Rejection Rule

With  = .05 and (2 - 1)(4 - 1) = 3 d.f., 𝜒𝛼2 = 7.815

Reject H0 if p-value < .05 or c2 > 7.815

• Test Statistic
(18 − 16.5)2 (6 − 11)2 (3 − 6.75)2
𝜒2 = + + ⋯+
16.5 11 6.75

= .1364 + 2.2727 + . . . + 2.0833 = 9.149

38
Test of Independence
• Conclusion Using the p-Value Approach

Area in Upper Tail .10 .05 .025 .01 .005

c2 Value (df = 3) 6.251 7.815 9.348 11.345 12.838

Because c2 = 9.145 is between 7.815 and 9.348, the area in the upper tail of the
distribution is between .05 and .025.

The p-value <  . We can reject the null hypothesis.

(Actual p-value is .0274)

39
Test of Independence
• Conclusion Using the Critical Value Approach

c2 = 9.145 > 7.815

We reject, at the .05 level of significance, the assumption that the price of the home is
independent of the style of home that is purchased.

40
Simple Regression
Dr. Nilakantan Narasinganallur Ph.D.
Simple Linear Regression

• Simple Linear Regression Model


• Least Squares Method
• Coefficient of Determination
• Model Assumptions
• Testing for Significance

2
Simple Linear Regression
• Managerial decisions often are based on the relationship between two or
more variables.
• Regression analysis can be used to develop an equation showing how the
variables are related.
• The variable being predicted is called the dependent variable and is denoted
by y. also called – predicted variable, response variable etc.
• The variables being used to predict the value of the dependent variable are
called the independent variables and are denoted by x. also called predictor
variable, explanatory variable.

3
Simple Linear Regression
• Simple linear regression involves one independent variable and one
dependent variable.
• The relationship between the two variables is approximated by a
straight line.
• Regression analysis involving two or more independent variables is
called multiple regression. Note: dependent variable is one and
independent variables can be multiple.
• While independent variables can be of categorical or quantitative type,
dependent variable is required to be quantitative in linear regression.

4
Simple Linear Regression Model
• The equation that describes how y is related to x and an error term is called
the regression model.
• The simple linear regression model is:

y = b0 + b1x + e

where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.

5
Simple Linear Regression Equation
• The simple linear regression equation is:

E(y) = b0 + b1x

• Graph of the regression equation is a straight line.


• b0 is the y intercept of the regression line.
• b1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value.

6
Simple Linear Regression Equation
• Positive Linear Relationship

E(y)

Regression line

Intercept Slope b1
b0 is positive

7
Simple Linear Regression Equation
• Negative Linear Relationship

E(y)

Intercept
b0 Regression line

Slope b1
is negative

8
Simple Linear Regression Equation
• No Relationship

E(y)

Intercept Regression line


b0
Slope b1
is 0

9
Estimated Simple Linear Regression Equation
• The estimated simple linear regression equation
• Based on sample data
𝑦ො = 𝑏0 + 𝑏1 𝑥

• The graph is called the estimated regression line.


• b0 is the y intercept of the line.
• b1 is the slope of the line.
• 𝑦ො is the estimated value of y for a given x value.

10
Estimation Process
Regression Model Sample Data:
y = b0 + b1x +e x y
Regression Equation x1 y1
E(y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn

Estimated
b0 and b1 Regression Equation
provide estimates of 𝑦ො = 𝑏0 + 𝑏1 𝑥
b0 and b1 Sample Statistics
b0, b1

11
Least squares method - diagram
Least Squares Method
• Least Squares Criterion
min σ(𝑦𝑖 − 𝑦ො𝑖 )2

where:
yi = observed value of the dependent variable
for the i th observation
𝑦ො𝑖 = estimated value of the dependent variable
for the i th observation

13
Least Squares Method
• Slope for the Estimated Regression Equation

σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑏1 =
σ 𝑥𝑖 − 𝑥ҧ 2

where:
xi = value of independent variable for i th observation
yi = value of dependent variable for i th observation
𝑥ҧ = mean value for independent variable
𝑦ത = mean value for dependent variable

14
Least Squares Method
• y-Intercept for the Estimated Regression Equation

𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ

15
Simple Linear Regression
• Example: Reed Auto Sales
Reed Auto periodically has a special week-long sale. As part of
the advertising campaign Reed runs one or more television
commercials during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.

16
Simple Linear Regression
• Example: Reed Auto Sales

Number of Number of
TV Ads (x) Cars Sold (y)
1 14
3 24
2 18
1 17
3 27
Sx = 10 Sy = 100
𝑥ҧ = 2 𝑦ത = 20

17
Estimated Regression Equation
• Slope for the Estimated Regression Equation
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത 20
𝑏1 = 2
= =5
σ 𝑥𝑖 − 𝑥ҧ 4

• y-Intercept for the Estimated Regression Equation


𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ = 20 − 5 2 = 10

• Estimated Regression Equation


𝑦ො = 10 + 5𝑥

18
Using Excel’s Chart Tools for
Scatter Diagram & Estimated Regression Equation
Reed Auto Sales Estimated Regression Line
30
25
Cars Sold

20
y = 5x + 10
15
10
5
0
0 1 2 3 4
TV Ads

19
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE

σ 𝑦𝑖 − 𝑦ത 2 = σ 𝑦ො𝑖 − 𝑦ത 2 + σ 𝑦𝑖 − 𝑦ො𝑖 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error

20
Coefficient of Determination
• The coefficient of determination is:

r2 = SSR/SST

where:
SSR = sum of squares due to regression
SST = total sum of squares

21
Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772

The regression relationship is very strong; 87.72% of the variability


in the number of cars sold can be explained by the linear relationship
between the number of TV ads and the number of cars sold.

22
Using Excel to Compute the Coefficient of Determination
• Adding r 2 Value to Scatter Diagram
Reed Auto Sales Estimated Regression Line
30
25
Cars Sold

20
y = 5x + 10
15
R2 = 0.8772
10
5
0
0 1 2 3 4
TV Ads

23
Sample Correlation Coefficient

𝑟𝑥𝑦 = (sign of 𝑏1 ) Coefficient of Determination


𝑟𝑥𝑦 = (sign of 𝑏1 ) 𝑟 2

where:
b1 = the slope of the estimated regression
equation 𝑦ො = 𝑏0 + 𝑏1 𝑥

24
Sample Correlation Coefficient

𝑟𝑥𝑦 = (sign of 𝑏1 ) 𝑟 2

The sign of b1 in the equation 𝑦ො = 10 + 5x is "+".

𝑟𝑥𝑦 = + .8772

= +.9366

25
Assumptions About the Error Term e
1. The error e is a random variable with mean of zero.
2. The variance of e , denoted by  2, is the same for all values of the
independent variable.
3. The values of e are independent.
4. The error e is a normally distributed random variable.

26
Testing for Significance
• To test for a significant regression relationship, we must conduct a
hypothesis test to determine whether the value of b1 is zero.
• Two tests are commonly used:

t Test and F Test

• Both the t test and F test require an estimate of  2, the variance of e in the
regression model.

27
Testing for Significance
• An Estimate of  2
The mean square error (MSE) provides the estimate of  2, and the notation
s2 is also used.

s 2 = MSE = SSE/(n - 2)

where:

SSE=σ 𝑦𝑖 − 𝑦ො𝑖 2 = σ 𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 2

28
Testing for Significance
• An Estimate of 
• To estimate , we take the square root of s2.
• The resulting s is called the standard error of the estimate.

SSE
s = MSE =
𝑛−2

29
Testing for Significance: t Test
• Hypotheses

H0: b1 = 0
H a : b1 ≠ 0

• Test Statistic

𝑏1 𝑠
𝑡= where 𝑠𝑏1 =
𝑠𝑏1 σ 𝑥𝑖 − 𝑥ҧ 2

30
Testing for Significance: t Test
• Rejection Rule

Reject H0 if p-value < 


or t < -t or t > t

where:
t is based on a t distribution
with n - 2 degrees of freedom

31
Testing for Significance: t Test
1. Determine the hypotheses. H0: b1 = 0
H a : b1 ≠ 0

2. Specify the level of significance.  = .05

𝑏1
3. Select the test statistic. 𝑡=
𝑠𝑏1

4. State the rejection rule. Reject H0 if p-value < .05


or |t| > 3.182 (with
3 degrees of freedom)

32
Testing for Significance: t Test
𝑏1 5
5. Compute the value of the test statistic. 𝑡= = = 4.63
𝑠𝑏1 1.08

6. Determine whether to reject H0. t = 4.541 provides an area of .01


in the upper tail. Hence, the p-
value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject
H0.

33
Confidence Interval for b1
• We can use a 95% confidence interval for b1 to test the hypotheses just used in
the t test.
• H0 is rejected if the hypothesized value of b1 is not included in the confidence
interval for b1.

34
Confidence Interval for b1
• The form of a confidence interval for b1 is:

𝑏1 ± 𝑡𝞪/2 𝑠𝑏1

where
b1 is the point estimator,
𝑡𝞪/2 𝑠𝑏1 is the margin of error, and
ta/2 is the t value providing an area of
/2 in the upper tail of a t distribution
with n - 2 degrees of freedom

35
Confidence Interval for b1
• Rejection Rule

Reject H0 if 0 is not included in the confidence interval for b1.

• 95% Confidence Interval for b1


𝑏1 ± 𝑡𝞪/2 𝑠𝑏1 = 5 +/- 3.182(1.08) = 5 +/- 3.44 or 1.56 to 8.44

• Conclusion
0 is not included in the confidence interval. Reject H0

36
Testing for Significance: F Test
• Hypotheses

H0: b1 = 0
H a : b1 ≠ 0

• Test Statistic

F = MSR/MSE

37
Testing for Significance: F Test
• Rejection Rule

Reject H0 if
p-value < 
or F > F

where:
F is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator

38
Testing for Significance: F Test
1. Determine the hypotheses. H0: b1 = 0
H a : b1 ≠ 0

2. Specify the level of significance.  = .05

3. Select the test statistic. F = MSR/MSE

4. State the rejection rule. Reject H0 if p-value < .05 or


F > 10.13 (with 1 d.f. in
numerator and 3 d.f. in denominator)

39
Testing for Significance: F Test
5. Compute the value of the test statistic.

F = MSR/MSE = 100/4.667 = 21.43

6. Determine whether to reject H0.

F = 17.44 provides an area of .025 in the upper tail. Thus, the p-value
corresponding to F = 21.43 is less than .025. Hence, we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars
sold.

40
Some Cautions about the Interpretation of
Significance Tests
• Rejecting H0: b1 = 0 and concluding that the relationship between x and y
is significant does not enable us to conclude that a cause-and-effect
relationship is present between x and y.

• Just because we are able to reject H0: b1 = 0 and demonstrate statistical


significance does not enable us to conclude that there is a linear relationship
between x and y.

41
Excel Practice – example 1 – Anscombe’s regression
• Anscombe’s quartet comprises four data sets that have nearly identical
simple descriptive statistics, yet have very different distributions and
appear very different when graphed.
— Wikipedia
• The data sets are given in excel file – Anscombe’s regression template.
• Interpretation of results
• Data set 1 – linear regression fits well.
• Data set 2- linear regression does not fit as the scatter plot shows a non-
linear relationship.
• Data set 3- shows outliers and LR cannot handle outliers.
• Data set 4- shows outliers and LR cannot handle outliers.
• Conclusion – Anscombe’s data sets, though specifically created, lead us to
conclude that it is necessary to visualize the data well before creating a
linear regression or any other model.rmand
Anscombe's
regression template
LR – example 2- Pete’s Pizza
• Data were collected from a sample of 10 Pete’s Pizza Parlor
restaurants located near college campuses.
• Our objective is to identify if there is a relationship between student student
population
Quarterly
sales
population and pizza sales in the outlets. restaurant (1000s) ($1000s)

• First draw a scatter diagram: student population – independent 1 2 58


variable on x axis and quarterly sales – dependent variable on the y 2 6 105
axis. 3 8 88

• The scatter diagram enables us to observe the data graphically and to 4 8 118
draw preliminary conclusions about the possible relationship between 5 12 117
the variables. 6 16 137
• What preliminary conclusions can be drawn from the scatter 7 20 157
diagram ? 8 20 169

• Quarterly sales appear to be higher at campuses with larger student 9 22 149


populations. In addition, for these data the relationship between the 10 26 202
size of the student population and quarterly sales appears to be
approximately linear.
• A positive linear relationship is indicated between x and y.
Pete’s Pizza – results of LR
• Results interpretation
SUMMARY
OUTPUT

Regression
Statistics
Multiple R 0.950123
• Is the LR model significant?
• yes. This is concluded from the F
R Square 0.902734
Adjusted R
Square 0.890575
Standard
Error
Observatio
13.82932 value & Significance of F.
• Are coefficients of regression
ns 10

ANOVA

df SS MS F
Significanc
eF significant
Regression 1 14200 14200 74.24837 2.55E-05
Residual 8 1530 191.25
Total 9 15730

Coefficient Standard Lower Upper


s Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%
Intercept 60 9.226035 6.503336 0.000187 38.72473 81.27527 38.72473 81.27527
student
population
(1000s) 5 0.580265 8.616749 2.55E-05 3.661906 6.338094 3.661906 6.338094
LR – example 3 – 100 m race timings
• We have collected the data on Race timing records for 100 m race.
• First, we plot the relationship between the timings (Y) and passage of
time (x).
• The data are provided in a raw form, inconvenient for data
processing.
• We need to convert this into data in excel, which will be convenient
for data processing.
• This is done in an excel sheet, which can be accessed here.
100 m race records ( 1910-2010)
additional matters

• Using the Estimated Regression Equation for Estimation and Prediction


• Excel’s Regression Tool
• Residual Analysis: Validating Model Assumptions
• Outliers and Influential Observations

47
Using the Estimated Regression Equation
for Estimation and Prediction
• A confidence interval is an interval estimate of the mean value of y for a given
value of x.
• A prediction interval is used whenever we want to predict an individual value of
y for a new observation corresponding to a given value of x.
• The margin of error is larger for a prediction interval.

48
Using the Estimated Regression Equation
for Estimation and Prediction
• Confidence Interval Estimate of E(y*)

𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑦ො ∗

• Prediction Interval Estimate of y*

𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑝𝑟𝑒𝑑

where:
confidence coefficient is 1 -  and t/2 is based
on a t distribution with n - 2 degrees of freedom

49
Point Estimation
If 3 TV ads are run prior to a sale, we expect the mean number of cars sold
to be:

𝑦ො = 10 + 5 3 = 25 cars

50
Confidence Interval for E(y*)
• Estimate of the Standard Deviation of 𝑦ො ∗

1 𝑥 ∗ − 𝑥ҧ 2
𝑠𝑦ො ∗ =𝑠 +
𝑛 σ 𝑥𝑖 − 𝑥ҧ 2

1 3−2 2
𝑠𝑦ො ∗ = 2.16025 + 2 + 3 − 2 2 + ⋯+ 3 − 2 2
5 1−2

1 1
𝑠𝑦ො ∗ = 2.16025 + = 1.4491
5 4

51
Confidence Interval for E(y*)
The 95% confidence interval estimate of the mean number of cars sold
when 3 TV ads are run is:

𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑦ො ∗

25 + 3.1824(1.4491)
25 + 4.61

20.39 to 29.61 cars

52
Prediction Interval for y*
• Estimate of the Standard Deviation of an Individual Value of y*

1 𝑥 ∗ − 𝑥ҧ 2
𝑠𝑝𝑟𝑒𝑑 =𝑠 1+ +
𝑛 σ 𝑥𝑖 − 𝑥ҧ 2

1 1
𝑠𝑝𝑟𝑒𝑑 = 2.16025 1 + +
5 4

spred = 2.16025(1.20416) = 2.6013

53
Prediction Interval for y*
The 95% prediction interval estimate of the number of cars sold in one
particular week when 3 TV ads are run is:

𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑝𝑟𝑒𝑑

25 + 3.1824(2.6013)
25 + 8.28

16.72 to 33.28 cars

54
Using Excel’s Regression Tool
• Up to this point, you have seen how Excel can be used for various parts of a
regression analysis.
• Excel also has a comprehensive tool in its Data Analysis package called
Regression.
• The Regression tool can be used to perform a complete regression analysis.

55
Using Excel’s Regression Tool
• Excel Output (top portion)
A B C
9
10 Regression Statistics
11 Multiple R 0.936585812
12 R Square 0.877192982
13 Adjusted R Square 0.83625731
14 Standard Error 2.160246899
15 Observations 5
16

56
Using Excel’s Regression Tool
• Excel Output (middle portion)
A B C D E F
16
17 ANOVA
18 df SS MS F Significance F
19 Regression 1 100 100 21.4286 0.018986231
20 Residual 3 14 4.66667
21 Total 4 114
22

57
Using Excel’s Regression Tool
• Excel Output (bottom-left portion)
A B C D E
22
23 Coeffic. Std. Err. t Stat P-value
24 Intercept 10 2.36643 4.2258 0.02424
25 TV Ads 5 1.08012 4.6291 0.01899
26
Note: Columns F-I are not shown.

58
Using Excel’s Regression Tool
• Excel Output (bottom-right portion)
A B F G H I
22
23 Coeffic. Low. 95% Up. 95% Low. 95.0% Up. 95.0%
24 Intercept 10 2.46895 17.53105 2.46895044 17.5310496
25 TV Ads 5 1.562562 8.437438 1.56256189 8.43743811
26
Note: Columns C-E are hidden.

59
Residual Analysis
• If the assumptions about the error term e appear questionable, the hypothesis
tests about the significance of the regression relationship and the interval
estimation results may not be valid.
• The residuals provide the best information about e .
• Residual for observation i

𝑦𝑖 − 𝑦ො𝑖

• Much of the residual analysis is based on an examination of graphical plots.

60
Residual Plot Against x
• If the assumption that the variance of e is the same for all values of x is valid,
and the assumed regression model is an adequate representation of the
relationship between the variables, then the residual plot should give an
overall impression of a horizontal band of points.

61
Residual Plot Against x
𝑦 − 𝑦ො
Good Pattern
Residual

62
Residual Plot Against x
𝑦 − 𝑦ො
Nonconstant Variance
Residual

63
Residual Plot Against x
𝑦 − 𝑦ො
Model Form Not Adequate
Residual

64
Residual Plot Against x
• Residuals
Observation Predicted Cars Sold Residuals
1 15 -1
2 25 -1
3 20 -2
4 15 2
5 25 2

65
Residual Plot Against x
TV Ads Residual Plot
3
2
Residuals
1
0
-1
-2
-3
0 1 2 3 4
TV Ads

66
Standardized Residuals
• Standardized Residual for Observation i

𝑦𝑖 − 𝑦ො𝑖
𝑠𝑦𝑖 −𝑦ො 𝑖

where: 𝑠𝑦𝑖 −𝑦ො 𝑖 = 𝑠 1 − ℎ𝑖

1 𝑥𝑖 − 𝑥ҧ 2
ℎ𝑖 = +
𝑛 σ 𝑥𝑖 − 𝑥ҧ 2

67
Standardized Residual Plot
• The standardized residual plot can provide insight about the assumption that
the error term e has a normal distribution.
• If this assumption is satisfied, the distribution of the standardized residuals
should appear to come from a standard normal probability distribution.

68
Standardized Residual Plot
• Standardized Residuals

Standardized
Observation Predicted y Residual Residual
1 15 -1 -0.5345
2 25 -1 -0.5345
3 20 -2 -1.0690
4 15 2 1.0690
5 25 2 1.0690

69
Standardized Residual Plot
• Standardized Residual Plot

1.5 A B C D
28
Standard Residuals

1
29 RESIDUAL OUTPUT
30 0.5
31 Observation Predicted Y Standard Residuals
Residuals
0
32 1 15 -1 -0.534522
0 10 20 30
33 -0.5 2 25 -1 -0.534522
34 -1 3 20 -2 -1.069045
35 4 15 2 1.069045
-1.5
36 5 25 2 1.069045
Cars Sold
37

70
Standardized Residual Plot
• All of the standardized residuals are between –1.5 and +1.5 indicating that
there is no reason to question the assumption that e has a normal distribution.

71
Outliers and Influential Observations
• Detecting Outliers
• An outlier is an observation that is unusual in comparison with the other
data.
• Minitab classifies an observation as an outlier if its standardized residual
value is < -2 or > +2.
• This standardized residual rule sometimes fails to identify an unusually large
observation as being an outlier.
• This rule’s shortcoming can be circumvented by using studentized deleted
residuals.
• The |i th studentized deleted residual| will be larger than the |i th
standardized residual|.

72
Next topic
• Multiple Regression
Multiple Regression
Dr. Nilakantan Narasinganallur Ph.D.
Multiple Regression
• Multiple Regression Model
• Least Squares Method
• Multiple Coefficient of Determination
• Model Assumptions
• Testing for Significance
• Using the Estimated Regression Equation for Estimation and Prediction
• Categorical Independent Variables
• Residual Analysis
• Logistic Regression

2
Multiple Regression
• In this chapter we continue our study of regression analysis by considering
situations involving two or more independent variables.
• This subject area, called multiple regression analysis, enables us to consider
more factors and thus obtain better estimates than are possible with simple
linear regression.

3
Multiple Regression Model
• Multiple Regression Model
The equation that describes how the dependent variable y is related
to the independent variables x1, x2, . . . xp and an error term is:
y = b0 + b1x1 + b2x2 + . . . + bpxp + e

where:
b0, b1, b2, . . . , bp are the parameters, and
e is a random variable called the error term

4
Multiple Regression Equation
• Multiple Regression Equation
The equation that describes how the mean value of y is related to x1,
x2, . . . xp is:
E(y) = b0 + b1x1 + b2x2 + . . . + bpxp

5
Estimated Multiple Regression Equation
• Estimated Multiple Regression Equation

𝑦ෝ = b0 + b1x1 + b2x2 + . . . + bpxp

A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp
that are used as the point estimators of the parameters b0, b1, b2, . . . , bp.

6
Estimation Process
Multiple Regression Model
E(y) = b0 + b1x1 + b2x2 +. .+ bpxp + e Sample Data:
x1 x2 . . . xp y
Multiple Regression Equation
. . . .
E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp . . . .
Unknown parameters are
b 0, b 1, b 2, . . . , b p

Estimated Multiple
Regression Equation
b0, b1, b2, . . . , bp
provide estimates of 𝑦ෝ = b0 + b1x1 + b2x2 + . . . + bpxp
Sample statistics are
b 0, b 1, b 2 , . . . , b p b0, b1, b2, . . . , bp

7
Least Squares Method
• Least Squares Criterion
2
min σ 𝑦𝑖 − 𝑦ො𝑖

• Computation of Coefficient Values


The formulas for the regression coefficients b0, b1, b2, . . . bp involve the
use of matrix algebra. We will rely on computer software packages to
perform the calculations.
The emphasis will be on how to interpret the computer output rather
than on how to make the multiple regression computations.

8
Multiple Regression Model
• Example: Programmer Salary Survey
A software firm collected data for a sample of 20 computer programmers.
A suggestion was made that regression analysis could be used to determine if
salary was related to the years of experience and the score on the firm’s
programmer aptitude test.
The years of experience, score on the aptitude test, and corresponding
annual salary ($1000s) for a sample of 20 programmers is shown on the next
slide.

9
Multiple Regression Model
Exper. Test Salary Exper. Test Salary
(Yrs.) Score ($1000s) (Yrs.) Score ($1000s)
4 78 24.0 9 88 38.0
7 100 43.0 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29.0
10 84 38.0 8 87 34.0
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30.0 3 70 28.2
6 91 33.0 3 89 30.0

10
Multiple Regression Model
Suppose we believe that salary (y) is related to the years of experience
(x1) and the score on the programmer aptitude test (x2) by the following
regression model:
y = b0 + b1x1 + b2x2 + e

where
y = annual salary ($1000s)
x1 = years of experience
x2 = score on programmer aptitude test

11
Solving for the Estimates of b0, b1, b2
Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
Multiple
. . . R2 =
. . . Regression
3 89 30 Problems etc.

12
Solving for the Estimates of b0, b1, b2
• Regression Equation Output

Predictor Coef SE Coef T p


Constant 3.17394 6.15607 0.5156 0.61279
Experience 1.4039 0.19857 7.0702 1.9E-06
Test Score 0.25089 0.07735 3.2433 0.00478

13
Estimated Regression Equation

SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

(Note: Predicted salary will be in thousands of dollars.)

14
Interpreting the Coefficients
• In multiple regression analysis, we interpret each regression coefficient as
follows:
bi represents an estimate of the change in y corresponding to one
unit increase in xi when all other independent variables are held
constant.

15
Interpreting the Coefficients

b1 = 1.404

Salary is expected to increase by $1,404 for each additional year of


experience (when the variable score on programmer aptitude test is
held constant).

16
Interpreting the Coefficients

b2 = 0.251

Salary is expected to increase by $251 for each additional point scored on


the programmer aptitude test (when the variable years of experience is
held constant).

17
Multiple Coefficient of Determination
• Relationship Among SST, SSR, SSE

SST = SSR + SSE

σ 𝑦𝑖 − 𝑦ത 2 2 2
= σ 𝑦ො𝑖 − 𝑦ത + σ 𝑦𝑖 − 𝑦ො𝑖

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error

18
Multiple Coefficient of Determination
• ANOVA Output

Analysis of Variance

SOURCE DF SS MS F P
Regression 2 500.3285 250.164 42.76 0.000
Residual Error 17 99.45697 5.850
Total 19 599.7855

19
Multiple Coefficient of Determination

R2 = SSR/SST

R2 = 500.3285/599.7855 = .83418

20
Adjusted Multiple Coefficient of Determination
• Adding independent variables, even ones that are not statistically significant,
causes the prediction errors to become smaller, thus reducing the sum of
squares due to error, SSE.
• Because SSR = SST – SSE, when SSE becomes smaller, SSR becomes larger,
causing R2 = SSR/SST to increase.
• The adjusted multiple coefficient of determination compensates for the
number of independent variables in the model.

21
Adjusted Multiple Coefficient of Determination

2 𝑛−1
𝑅𝑎 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1

2 20 − 1
𝑅𝑎 = 1 − 1 − .834179 = .814671
20 − 2 − 1

22
Assumptions About the Error Term e
• The error e is a random variable with mean of zero.
• The variance of e , denoted by  2, is the same for all values of the
independent variables.
• The values of e are independent.
• The error e is a normally distributed random variable reflecting the deviation
between the y value and the expected value of y given by b0 + b1x1 + b2x2 + .
. + bpxp.

23
Testing for Significance
• In simple linear regression, the F and t tests provide the same conclusion.
• In multiple regression, the F and t tests have different purposes.

24
Testing for Significance: F Test
• The F test is used to determine whether a significant relationship exists
between the dependent variable and the set of all the independent variables.
• The F test is referred to as the test for overall significance.

25
Testing for Significance: t Test
• If the F test shows an overall significance, the t test is used to determine
whether each of the individual independent variables is significant.
• A separate t test is conducted for each of the independent variables in the
model.
• We refer to each of these t tests as a test for individual significance.

26
Testing for Significance: F Test
Hypotheses H0: b1 = b2 = . . . = bp = 0
Ha: One or more of the parameters is not equal to zero

Test Statistics F = MSR/MSE

Rejection Rule Reject H0 if p-value < a or if F ≥ Fa , where Fa is based


on an F distribution with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.

27
F Test for Overall Significance
Hypotheses H0: b1 = b2 = 0
Ha: One or both of the parameters is not equal to zero.

Rejection Rule For a = .05 and d.f. = 2, 17; F.05 = 3.59


Reject H0 if p-value < .05 or F > 3.59

28
F Test for Overall Significance
• ANOVA Output
Analysis of Variance

SOURCE DF SS MS F P
Regression 2 500.3285 250.164 42.76 0.000
Residual Error 17 99.45697 5.850
Total 19 599.7855

p-value used to test for


overall significance

29
F Test for Overall Significance
Test Statistics F = MSR/MSE
= 250.16/5.85 = 42.76

Conclusion p-value < .05, so we can reject H0.


(Also, F = 42.76 > 3.59)

30
Testing for Significance: t Test
Hypotheses H0: bi= 0
H a: bi ≠ 0

𝑏𝑖
Test Statistics 𝑡=
𝑠𝑏𝑖

Rejection Rule Reject H0 if p-value < a or if t < -ta or t > ta


where ta is based on a t distribution with n - p – 1
degrees of freedom.

31
t Test for Significance of Individual Parameters
Hypotheses H0: bi = 0
H a : bi ≠ 0

Rejection Rule For a = .05 and d.f. = 17, t.025 = 2.11


Reject H0 if p-value < .05, or if t < -2.11 or t > 2.11

32
t Test for Significance of Individual Parameters
• Regression Equation Output

Predictor Coef SE Coef T p


Constant 3.17394 6.15607 0.5156 0.61279
Experience 1.4039 0.19857 7.0702 1.9E-06
Test Score 0.25089 0.07735 3.2433 0.00478

33
t Test for Significance of Individual Parameters
• Regression Equation Output

Predictor Coef SE Coef T p


Constant 3.17394 6.15607 0.5156 0.61279
Experience 1.4039 0.19857 7.0702 1.9E-06
Test Score 0.25089 0.07735 3.2433 0.00478

t statistic and p-value used to test for


the individual significance of “Test Score”

34
t Test for Significance of Individual Parameters
𝑏1 1.4039
Test Statistics 𝑡= = = 7.07
𝑠𝑏1 .1986
𝑏2 .25089
𝑡= = = 3.24
𝑠𝑏2 .07735

Conclusions Reject both H0: b1 = 0 and H0: b2 = 0.


Both independent variables are significant.

35
Testing for Significance: Multicollinearity
• The term multicollinearity refers to the correlation among the independent
variables.
• When the independent variables are highly correlated (say, |r |> .7), it is not
possible to determine the separate effect of any particular independent
variable on the dependent variable.

36
Testing for Significance: Multicollinearity
• If the estimated regression equation is to be used only for predictive
purposes, multicollinearity is usually not a serious problem.
• Every attempt should be made to avoid including independent variables that
are highly correlated.

37
Using the Estimated Regression Equation
for Estimation and Prediction
• The procedures for estimating the mean value of y and predicting an
individual value of y in multiple regression are similar to those in simple
regression.
• We substitute the given values of x1, x2, . . . , xp into the estimated regression
equation and use the corresponding value of 𝑦ො as the point estimate.

38
Using the Estimated Regression Equation
for Estimation and Prediction
• The formulas required to develop interval estimates for the mean value of 𝑦ො
and for an individual value of y are beyond the scope of the textbook.
• Software packages for multiple regression will often provide these interval
estimates.

39
Residual Analysis
• For simple linear regression the residual plot against 𝑦ො and the residual plot
against x provide the same information.
• In multiple regression analysis it is preferable to use the residual plot against 𝑦ො
to determine if the model assumptions are satisfied.

40
Standardized Residual Plot Against 𝑦ො
• Standardized residuals are frequently used in residual plots for purposes of:
• Identifying outliers (typically, standardized residuals < -2 or > +2)
• Providing insight about the assumption that the error term ∈ has a normal
distribution
• The computation of the standardized residuals in multiple regression analysis is
too complex to be done by hand.
• Excel’s Regression tool can be used.

41
Standardized Residual Plot Against 𝑦ො
• Residual Output

Observation Predicted Y Residuals Standard Residuals


1 27.89626 -3.89626 -1.771707
2 37.95204 5.047957 2.295406
3 26.02901 -2.32901 -1.059048
4 32.11201 2.187986 0.994921
5 36.34251 -0.54251 -0.246689

42
Standardized Residual Plot Against 𝑦ො
Standardized Residual Plot
3

2
Residuals
Standard

0
0 10 20 30 40 50
-1

-2

-3
Predicted Salary

43
Categorical Independent Variables
• In many situations we must work with categorical independent variables such
as gender (male, female), method of payment (cash, check, credit card), etc.
• For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1
indicates female.
• In this case, x2 is called a dummy or indicator variable.

44
Categorical Independent Variables
• Example: Programmer Salary Survey
As an extension of the problem involving the computer programmer salary
survey, suppose that management also believes that the annual salary is
related to whether the individual has a graduate degree in computer science
or information systems.
The years of experience, the score on the programmer aptitude test,
whether the individual has a relevant graduate degree, and the annual
salary ($1000) for each of the sampled 20 programmers are shown on
the next slide.

45
Categorical Independent Variables
Exper. Test Salary Exper. Test Salary
(Yrs.) Score Degr. ($1000) (Yrs.) Score Degr. ($1000)
4 78 No 24.0 9 88 Yes 38.0
7 100 Yes 43.0 2 73 No 26.6
1 86 No 23.7 10 75 Yes 36.2
5 82 Yes 34.3 5 81 No 31.6
8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
0 75 No 22.2 4 79 No 30.1
1 80 No 23.1 6 94 Yes 33.9
6 83 No 30.0 3 70 No 28.2
6 91 Yes 33.0 3 89 No 30.0

46
Categorical Independent Variables
• Regression Equation Output

𝑦ො = b0 + b1x1 + b2x2 + b3x3

where:
𝑦ො = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
(x3 is a dummy variable)

47
Categorical Independent Variables
• ANOVA Output
Analysis of Variance

SOURCE DF SS MS F P
Regression 3 507.8960 269.299 29.48 0.000
Residual Error 16 91.8895 5.743
Total 19 599.7855

R2 = 507.896/599.7855 = .8468 Previously, R2 = .8342

2 20 − 1
𝑅𝑎 = 1 − 1 − .8468 Previously, Adjusted
= .8181
20 − 3 − 1 R2 = .815

48
Categorical Independent Variables
• Regression Equation Output

Predictor Coef SE Coef T p


Constant 7.945 7.382 1.076 0.298
Experience 1.148 0.298 3.856 0.001
Test Score 0.197 0.090 2.191 0.044
Grad. Degr. 2.280 1.987 1.148 0.268

Not significant

49
More Complex Categorical Variables
• If a categorical variable has k levels, k - 1 dummy variables are required, with
each dummy variable being coded as 0 or 1.
• For example, a variable with levels A, B, and C could be represented by x1 and
x2 values of (0, 0) for A, (1, 0) for B, and (0, 1) for C.
• Care must be taken in defining and interpreting the dummy variables.

50
More Complex Categorical Variables
• For example, a variable indicating level of education could be represented by x1
and x2 values as follows:

Highest
Degree x1 x2
Bachelor’s 0 0
Master’s 1 0
Ph.D. 0 1

51
Modeling Curvilinear Relationships
• Example: Sales of Laboratory Scales
A manufacturer of laboratory scales wants to investigate the relationship
between the length of employment of their salespeople and the number of
scales sold.
The table on the next slide gives the number of months each salesperson
has been employed by the firm (x) and the number of scales sold (y) by 15
randomly selected salespersons.

52
Modeling Curvilinear Relationships
• Example: Sales of Laboratory Scales

Months Sales Months Sales

41 275 40 189
106 296 51 235
76 317 9 83
104 376 12 112
22 162 6 67
12 150 56 325
85 367 19 189
111 308

53
Modeling Curvilinear Relationships
• Excel’s Chart tools can be used to develop a scatter diagram and fit a straight
line to bivariate data.
• The estimated regression equation and the coefficient of determination for
simple linear regression can also be developed.
• The results of using Excel’s Chart tools to fit a line to the data are shown on
the next slide.

54
Modeling Curvilinear Relationships
• Chart Tools Output

55
Modeling Curvilinear Relationships
• The scatter diagram indicates a possible curvilinear relationship between the
length of time employed and the number of scales sold.
• So, we develop a multiple regression model with two independent variables: x
and x2.

y = b0 + b1x + b2x2 + e

• This model is often referred to as a second-order polynomial or a quadratic


model.

56
Modeling Curvilinear Relationships
• Excel’s Chart tools can be used to fit a polynomial curve to the data. (Dialog
box is on next slide.)
• To get the dialog box, position the mouse pointer over any data point in the
scatter diagram and right-click.
• The estimated multiple regression equation and multiple coefficient of
determination for this second-order model are also obtained.

57
Modeling Curvilinear Relationships
• Chart Tools
Dialog Box

58
Modeling Curvilinear Relationships
• Chart Tools Output

59
Modeling Curvilinear Relationships
• Excel’s Chart tools output does not provide any means for testing the
significance of the results, so we need to use Excel’s Regression tool.
• We will treat the values of x2 as a second independent variable (called
MonthSq on the next slide).

60
Modeling Curvilinear Relationships
• Second Independent Variable (MonthSq) Added

Months MonthsSq Sales Months MonthsSq Sales

41 1681 275 40 1600 189


106 11236 296 51 2601 235
76 5776 317 9 81 83
104 10816 376 12 144 112
22 484 162 6 36 67
12 144 150 56 3136 325
85 7225 367 19 361 189
111 12321 308

61
Modeling Curvilinear Relationships
• Excel’s Regression Tool Output

We should be pleased with the fit provided by


the estimated multiple regression equation.

62
Modeling Curvilinear Relationships
• Excel’s Regression Tool Output

The overall model is significant (p-value for the F test is 8.75E-07)

63
Modeling Curvilinear Relationships
• Excel’s Regression Tool Output

We can conclude that adding MonthsSq to the model is significant.

64

You might also like