Nilkanta Sir Merged PDF
Nilkanta Sir Merged PDF
An Introduction
Dr. Nilakantan Narasinganallur Ph.D.
Statistics – study of data with analytical measures to understand the
structure underlying the data
Business • Marketing
• Electronic scanners at retail checkout counters collect data which can be used for market
research. – market basket analysis
applications • Production
• Emphasis on quality makes quality control an important application of statistics in
production. – process charts, control charts, six sigma
• Economics
• Economics frequently provide forecasts about the future of economy or some aspect of
it. Producer price index, unemployment rate, capacity utilization
• Define data
• Data types & examples
• Raw data examples
What is • Data & information – what is different?
data? • Use of summaries
• Use of tables & charts
• Nominal category
• Labels or names or numbers representing
categories
Categorical • Ordinal data
data • Numbers carrying partial meanings
• Data collation in frequency distribution
• Data visualisation in bar charts, pie charts
• Interval type
Quantitative • Examples –
data • Ratio type
• Examples -
Raw data
Let us look at some hypothetical data on purchases of soft drinks,
identified in terms of the brands. The data from a sample of 100
purchases are presented in the next slide.
In order to derive meaningful information from the data., what do we do?
Summarizing Frequency distribution
sample of slice
slice
dew
maaza
dew
sprite
maaza
thumsup
7 up
mirinda
pepsi
fanta
fanta dew
coca cola slice
100 thumsup
frooti
thumsup
mirinda
7 up
7 up
limca
frooti
7 up
mirinda
diet pepsi frooti
mirinda sprite
mountain
frooti sprite slice diet pepsi dew mirinda thumsup
mountain mountain
frooti dew maaza pepsi dew sprite frooti
soft drink frequency
sprite 6
thumsup 9
pepsi 8
diet pepsi 6
Frequency coca cola 7
distribution limca
mirinda
7
10
– softdrink fanta 6
purchases 7 up
mountain dew
8
12
maaza 7
frooti 10
slice 4
sum 100
Government YEAR EXPORT GROWTH%
document 1 100
13
Elements are the entities on which data are collected.
14
Data, Data Sets, Elements, Variables, and Observations
Variables
Company Stock Exchange Annual Sales ($M) Earnings per share ($)
Dataram NQ 73.10 0.86
EnergySouth N 74.00 1.67 Observation
Element Names Keystone N 365.70 0.86
LandCare NQ 111.40 0.33
Psychemedics N 17.60 0.13
Data Set
15
Scales of measurement include
• Nominal
• Ordinal
Scales of • Interval
• Ratio
Measurement The scale determines the amount of
information contained in the data.
16
Nominal scale
Measurement Example
Alternatively, a numeric code could be used for the school variable (e.g.
1 denotes Business, 2 denotes Humanities, 3 denotes Education, and so
on).
17
Ordinal scale
The data have the properties of nominal data and the order or rank of the
data is meaningful.
Scales of Example
Alternatively, a numeric code could be used for the class standing variable
(e.g. 1 denotes Freshman, 2 denotes Sophomore, and so on).
Question ?
18
Interval scale
19
Ratio scale
• Data have all the properties of interval data and the
ratio of two values is meaningful.
• Ratio data are always numerical.
• Zero value is included in the scale.
Scales of
Measurement Example 1:
Price of a book at a retail store is Rs. 200, while the price
of the same book sold online is Rs. 100. The ratio
property shows that retail stores charge twice the online
price.
Example 2:
The temperature outside is 35 degree Celsius. It was 40
yesterday. Is this interval or ratio scale?
20
• Data can be further classified as being categorical or
Categorical quantitative.
and • The statistical analysis that is appropriate depends on
whether the data for the variable are categorical or
Quantitative quantitative.
• In general, there are more alternatives for statistical
Data analysis when the data are quantitative.
21
• Labels or names are used to identify an attribute of each
element
• Often referred to as qualitative data
Categorical • Use either the nominal or ordinal scale of measurement
• Can be either numeric or nonnumeric
Data • Appropriate statistical analyses are rather limited
22
• Quantitative data indicate how many or how much.
Quantitative • Quantitative data are always numeric.
• Ordinary arithmetic operations are meaningful for
Data quantitative data.
23
Cross-Sectional Data
Example
Example
Data detailing the number of building permits issued in
Mumbai during each of the last 36 months.
25
Graph of Time Series Data
Time Series
Data
26
Existing Sources
27
Most of the statistical information in newspapers, magazines, company
reports, and other publications consists of data that are summarized and
presented in a form that is easy to understand.
28
Example: Harsha Auto Repair
29
Harsha Auto repair- frequency distribution
Histogram
8
lower upper Frequency % frequency
3600 4249 2 7% 7
Frequency
6850 7499 5 17% 4
0 0% Frequency
30 100 3
0
4249 4899 5549 6199 6849 7499 More
Bin
The most common numerical descriptive statistic
is the mean (or average).
Numerical
Descriptive The mean demonstrates a measure of the
central tendency, or central location of the data
Statistics for a variable.
31
Population: The set of all elements of interest in a
particular study.
Statistical
Inference- A Statistical inference: The process of using data obtained
from a sample to make estimates and test hypotheses
Preview about the characteristics of a population.
Census: Collecting data for the entire population.
32
Process of Statistical Inference
Example: Harsha Auto
33
Analytics is the scientific process of transforming
data into insight for making better decisions .
Techniques:
34
Big data: Large and complex data set.
35
Data warehousing is the process of capturing, storing, and
maintaining the data.
• Organizations obtain large amounts of data on a daily basis
Data by means of magnetic card readers, bar code scanners,
point of sale terminals, and touch screen monitors.
warehousing • Wal-Mart captures data on 20-30 million transactions per
day.
• Visa processes 6,800 payment transactions per second.
36
• Methods for developing useful decision-making
information from large databases.
• Using a combination of procedures from statistics,
mathematics, and computer science, analysts “mine the
data” to convert it into useful information.
• The most effective data mining systems use automated
procedures to discover relationships in the data and predict
Data Mining future outcomes prompted by general and even vague
queries by the user.
37
• The major applications of data mining have been made by
companies with a strong consumer focus such as retail,
financial, and communication firms.
• Data mining is used to identify related products that
customers who have already purchased a specific product
are also likely to purchase (and then pop-ups are used to
Data Mining draw attention to those related products).
Applications • Data mining is also used to identify customers who should
receive special discount offers based on their past
purchasing volumes.
38
• Statistical methodology such as multiple regression, logistic
regression, and correlation are heavily used.
• Also needed are computer science technologies involving
artificial intelligence and machine learning.
Data Mining • A significant investment in time and money is required as
Requirements well.
39
• Finding a statistical model that works well for a particular
sample of data does not necessarily mean that it can be
reliably applied to other data.
• With the enormous amount of data available, the data set
Data Mining can be partitioned into a training set (for model
development) and a test set (for validating the model).
Model • There is, however, a danger of overfitting the model to the
point that misleading associations and conclusions appear
Reliability to exist.
• Careful interpretation of results and extensive testing is
important.
40
• In a statistical study, unethical behavior can take a variety
of forms including:
Ethical • Improper sampling
• Inappropriate analysis of the data
Guidelines for • Development of misleading graphs
• Use of inappropriate summary statistics
Statistical • Biased interpretation of the statistical results
Practice • One should strive to be fair, thorough, objective, and
neutral as you collect, analyze, and present data.
• As a consumer of statistics, one should also be aware of the
possibility of unethical behavior by others.
41
Frequency Distribution
Summarizing
Categorical Percent Frequency Distribution
Pie Chart
Frequency Distribution
• A frequency distribution is a tabular summary of data showing
the number (frequency) of observations in each of several
non-overlapping categories or classes.
• The objective is to provide insights about the data that cannot
be quickly obtained by looking only at the original data.
43
Frequency Distribution
44
Relative & Percent Frequency Distribution
Bar Chart
6
5
4
3
2
1
Quality
Poor Below Average Above Excellent Rating
Average Average
47
• In quality • When the bars are
control, arranged in descending
bar charts order of height from left
Pareto are used to right (with the most
Diagram – to identify frequently occurring
the most cause appearing first) the
Let us try important bar chart is called a
drawing causes of Pareto diagram.
this! problems.
• This diagram is named for its founder,
Vilfredo Pareto, an Italian economist.
48
Pie Chart – Let us draw this for Marada Inn!
• The pie chart is a • First draw a circle; • Since there are 360
commonly used then use the relative degrees in a circle,
graphical display for frequencies to a class with a
presenting relative subdivide the circle relative frequency
frequency and into sectors that of .25 would
percent frequency correspond to the consume .25(360)
distributions for relative frequency for = 90 degrees of the
categorical data. each class. circle.
49
Pie Chart
Marada Inn Quality Ratings
Excellent
5%
Poor
10%
Below
Average
Above 15%
Average
45%
Average
25%
50
Insights Gained from the Preceding Pie Chart
51
Summarizing Quantitative Data
• Frequency Distribution
• Relative Frequency and Percent Frequency Distributions
• Dot Plot
• Histogram
• Cumulative Distributions
• Stem-and-Leaf Display
52
Example: Harsha Auto Repair – let us do in excel!
Sample of Parts Cost (Rs.) for 30 Tune-ups
• What should be your approach?
• Put the data into a frequency distribution
6370 5460 6510 3990 5250 3640
Steps
• calculate minimum and maximum. 4970 4830 5040 6230 4620 5250
• Provide a guess of number of class intervals between 6 & 20.
7280 5180 4340 4760 6790 7350
• Calculate number of CI with a formula, Sturges rule
• K= 1 + 3.322/logN 5950 6790 6160 4760 5810 4760
• In this case, k =
4340 5740 6860 7070 5530 7350
53
Harsha Auto repair- frequency distribution
Histogram
8
lower upper Frequency % frequency
3600 4249 2 7% 7
Frequency
6850 7499 5 17% 4
0 0% Frequency
30 100 3
0
4249 4899 5549 6199 6849 7499 More
Bin
• Another common graphical display of
quantitative data is a histogram.
• The variable of interest is placed on the
horizontal axis.
• A rectangle is drawn above each class
Histogram interval with its height corresponding to
the interval’s frequency, relative frequency,
or percent frequency.
.35
Relative Frequency
.30
.25
.20
.15
.10
.05
0
56
Histograms Showing Skewness
• Moderately Skewed Left
• A longer tail to the left
• Example: Exam Scores
.35
Relative Frequency
.30
.25
.20
.15
.10
.05
0
57
Histograms Showing Skewness
• Moderately Right Skewed
• A Longer tail to the right
• Example: Housing Values
.35
Relative Frequency
.30
.25
.20
.15
.10
.05
0
58
Histograms Showing Skewness
• Highly Skewed Right
• A very long tail to the right
• Example: Executive Salaries
.35
Relative Frequency
.30
.25
.20
.15
.10
.05
0
59
• Cumulative frequency distribution
- shows the number of items with
values less than or equal to the
upper limit of each class.
• Cumulative relative frequency
distribution – shows the proportion
Cumulative of items with values less than or
Distributions equal to the upper limit of each
class.
• Cumulative percent frequency
distribution – shows the percentage of
items with values less than or equal to
the upper limit of each class.
Learnings
Descriptive Statistics
2
• Most important measure of location
• Provides a measure of central location
• Mean of a data set is the average of its data
values.
• Sample mean is the point estimator of
population mean.
• Affected by extreme values/outliers
Arithmetic Mean
σ 𝑥𝑖
Sample mean 𝑥ҧ = 𝑛
545 715 530 690 535 700 560 700 540 715
540 540 540 625 525 545 675 545 550 550
565 550 625 550 550 560 535 560 565 580
550 570 590 572 575 575 600 580 670 565
700 585 680 570 590 600 649 600 600 580
670 615 550 545 625 635 575 650 580 610
610 675 590 535 700 535 545 535 530 540
4
Sample Mean 𝑥ҧ
• Example: Apartment Rents
σ𝑥𝑖 41,356
𝑥ҧ = = = 590.80
𝑛 70
5
• The median of a data set is the value in the
middle when the data items are arranged
in ascending order.
26 18 27 12 14 27 19 7 observations
12 14 18 19 26 27 27 in ascending order
7
Median
• For an even number of observations:
26 18 27 12 14 27 30 19 8 observations
12 14 18 19 26 27 27 30 in ascending order
8
Median
• Example: Apartment Rents
Averaging the 35th and 36th data values:
Median = (575 + 575)/2 = 575
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
9
• Another • It is obtained by deleting
measure,
sometimes a percentage of the
used when smallest and largest
extreme values values from a data set
are present, is and then computing the
the trimmed
Trimmed mean. mean of the remaining
Mean values.
• For example, the 5% trimmed mean is
obtained by removing the smallest 5% and
the largest 5% of the data values and then
computing the mean of the remaining
values.
• The mode of a • The greatest
data set is the frequency can
value that occur at two
occurs with or more
greatest different
frequency. values.
Mode
• If the data • If the data have
have exactly more than two
two modes, modes, the
the data are data are
bimodal. multimodal.
Mode
• Example: Apartment Rents
550 occurred most frequently (7 times)
Mode = 550
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
12
• In some instances • The weights might
the mean is be the number of
computed by credit hours earned
giving each for each grade, as in
observation a
weight that GPA.
Weighted reflects its
Mean • In other weighted
relative
importance. mean computations,
quantities such as
pounds, dollars, or
• The choice of
weights depends volume are
on the application. frequently used.
σ 𝑤𝑖 𝑥𝑖 where: xi = value of
𝑥ҧ =
σ 𝑤𝑖 observation i
wi = weight for
observation i
Weighted
Mean Numerator: sum of the weighted data values
15
Weighted Mean
• Example: Construction Wages
Worker xi wi wi x i
Carpenter 21.60 520 11232.0
Electrician 28.72 230 6605.6
Laborer 11.80 410 4838.0
Painter 19.75 270 5332.5
Plumber 24.16 160 3865.6
1590 31873.7
σ 𝑤𝑖 𝑥𝑖 31,873.7
𝑥ҧ = σ 𝑤𝑖
= = 20.0464 = $20.05
1,590
16
• Another • It is obtained by
measure, deleting a percentage
sometimes of the smallest and
used when largest values from a
extreme data set and then
values are
present, is
computing the mean
the trimmed of the remaining
mean. values.
Trimmed Mean
• For example, the 5% trimmed mean
is obtained by removing the smallest
5% and the largest 5% of the data
values and then computing the
mean of the remaining values.
17
• The geometric mean is calculated by finding
the nth root of the product of n values.
Geometric
Mean
19
Geometric Mean
• Example: Rate of Return
Period Return (%) Growth Factor
1 -6.0 0.940
2 -8.0 0.920
3 -4.0 0.960
4 2.0 1.020
5 5.4 1.054
5
𝑥𝑔ҧ = .94 . 92)(.96)(1.02)(1.054)
= [.89254]1/5 = .97752
Average growth rate per period is (.97752 - 1) (100) = -2.248%
20
Percentiles
• A percentile provides information about how the data are spread over the
interval from the smallest value to the largest value.
• Admission test scores for colleges and universities are frequently reported in
terms of percentiles.
• The pth percentile of a data set is a value such that at least p percent
of the items take on this value or less and at least (100 - p) percent of
the items take on this value or more.
21
Percentiles
• Arrange the data in ascending order.
• Compute Lp, the location of the pth percentile.
Lp = (p/100)(n + 1)
22
80 th Percentile
• Example: Apartment Rents
Lp = (p/100)(n + 1) = (80/100)(70 + 1) = 56.8
(the 56th value plus .8 times the
difference between the 57th and 56th values)
80th Percentile = 635 + .8(649 – 635) = 646.2
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
23
80th Percentile
• Example: Apartment Rents
“At least 80% of the “At least 20% of the
items take on a items take on a
value of 646.2 or less.” value of 646.2 or more.”
56/70 = .8 or 80% 14/70 = .2 or 20%
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
24
• Quartiles are • First Quartile =
specific percentiles. 25th Percentile
Quartiles
• Second Quartile = • Third Quartile =
50th Percentile = 75th Percentile
Median
Third Quartile (75th Percentile)
• Example: Apartment Rents
Lp = (p/100)(n + 1) = (75/100)(70 + 1) = 53.25
(the 53rd value plus .25 times the
difference between the 54th and 53rd values)
Third quartile = 625 + .25(625 – 625) = 625
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
26
• It is often • For example, in choosing
desirable to supplier A or supplier B
Measures of consider we might consider not
Variability measures of only the average delivery
variability time for each, but also
(dispersion), the variability in delivery
as well as time for each.
measures of
location.
27
Measures of Variability
• Range
• Interquartile Range
• Variance
• Standard Deviation
• Coefficient of Variation
28
• The range of a data set is the difference
between the largest and smallest data
Range values.
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
30
Interquartile Range
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
32
• The variance is a measure of variability
that utilizes all the data.
2
σ 𝑥𝑖 − 𝑥ҧ 2
2
σ 𝑥𝑖 − 𝜇 2
𝑠 = 𝜎 =
𝑛−1 𝑁
for a for a
sample population
34
• The standard deviation of a data set is the
positive square root of the variance.
Standard
Deviation • It is measured in the same units as the
data, making it more easily interpreted
than the variance.
Standard Deviation
• The standard deviation is computed as follows:
s = 𝑠2 s= s2
for a for a
sample population
36
Coefficient of Variation
• The coefficient of variation indicates how large the standard deviation is in
relation to the mean.
• The coefficient of variation is computed as follows:
𝑠 𝜎
x 100 % x 100 %
𝑥ҧ 𝜇
for a for a
sample population
37
Sample Variance, Standard Deviation,
And Coefficient of Variation
• Example: Apartment Rents
• Variance
2
σ 𝑥𝑖 −𝑥ҧ
s2 = = 2,996.16
𝑛−1
• Standard Deviation
s = 𝑠2 = 2,996.16 = 54.74
• Coefficient of Variation
𝑠 54.74
x 100 % = x 100 % = 9.27%
𝑥ҧ 590.80
38
Measures of Distribution Shape,
Relative Location, and Detecting Outliers
• Distribution Shape
• z-Scores
• Chebyshev’s Theorem
• Empirical Rule
• Detecting Outliers
• Five-number summary
• Box plots
39
Distribution Shape: Skewness
• An important measure of the shape of a distribution is called skewness.
• The formula for the skewness of sample data is
𝑛 𝑥𝑖 −𝑥ҧ 3
Skewness = σ
(𝑛−1)(𝑛−2) 𝑠
40
Distribution Shape: Skewness
• Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
Skewness = 0
.35
Relative Frequency
.30
.25
.20
.15
.10
.05
0
41
Distribution Shape: Skewness
• Moderately Skewed Left
• Skewness is negative.
• Mean will usually be less than the median.
.30
.25
.20
.15
.10
.05
0
42
Distribution Shape: Skewness
• Moderately Skewed Right
• Skewness is positive.
• Mean will usually be more than the median.
.30
.25
.20
.15
.10
.05
0
43
Distribution Shape: Skewness
• Highly Skewed Right
• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
.30
.25
.20
.15
.10
.05
0
44
Distribution Shape: Skewness
• Example: Apartment Rents – Let us draw the graphs in Excel
Seventy efficiency apartments were randomly sampled in a college town. The
monthly rent prices for the apartments are listed below in ascending order.
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
45
Distribution Shape: Skewness
• Example: Apartment Rents
.35 Skewness = .92
.30
Relative Frequency
.25
.20
.15
.10
.05
0
46
z-Scores
• The z-score is often called the standardized value.
• It denotes the number of standard deviations a data value xi is from the mean.
𝑥𝑖 −𝑥ҧ
𝑧𝑖 =
𝑠
• Excel’s STANDARDIZE function can be used to compute the z-score.
47
z-Scores
• An observation’s z-score is a measure of the relative location of the observation
in a data set.
• A data value less than the sample mean will have a z-score less than zero.
• A data value greater than the sample mean will have a z-score greater than
zero.
• A data value equal to the sample mean will have a z-score of zero.
48
z-Scores – calculate in Excel!
• Example: Apartment Rents
• z-Score of Smallest Value (525)
𝑥𝑖 −𝑥ҧ 525−590.80
𝑧𝑖 = = = -1.20
𝑠 54.74
49
Chebyshev’s Theorem
• At least (1 - 1/z2) of the items in any data set will be within z standard
deviations of the mean, where z is any value greater than 1.
• Chebyshev’s theorem requires z > 1, but z need not be an integer.
50
Chebyshev’s Theorem
• At least 75% of the data values must be within z = 2 standard
deviations of the mean.
• At least 89% of the data values must be within z = 3 standard
deviations of the mean.
• At least 94% of the data values must be within z = 4 standard
deviations of the mean.
51
Chebyshev’s Theorem
• Example: Apartment Rents
Let z = 1.5 with 𝑥ҧ = 590.80 and s = 54.74
52
Empirical Rule
• When the data are believed to approximate a bell-shaped distribution:
• The empirical rule can be used to determine the percentage of data
values that must be within a specified number of standard deviations
of the mean.
• The empirical rule is based on the normal distribution, which is
covered in Chapter 6.
53
Empirical Rule
For data having a bell-shaped distribution:
• 68.26% of the values of a normal random variable are within +/- 1
standard deviation of its mean.
• 95.44% of the values of a normal random variable are within +/- 2
standard deviations of its mean.
• 99.72% of the values of a normal random variable are within +/- 3
standard deviations of its mean.
54
Empirical Rule
99.72%
95.44%
68.26%
m x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
55
Detecting Outliers
• An outlier is an unusually small or unusually large value in a data set.
• A data value with a z-score less than -3 or greater than +3 might be considered
an outlier.
• It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the data set
• a correctly recorded data value that belongs in the data set
56
Empirical Rule
• Example: Apartment Rents
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there are no outliers in this data
set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
57
Outlier detection – with IQR-in Excel!
• IQR method identifies outliers by setting up a “fence” outside of Q1
and Q3. Values falling outside of this fence are considered outliers.
• We build this fence by taking 1.5 times the IQR, then subtracting this
value from Q1 and adding this value to Q3. This gives us the minimum
and maximum fence posts that we compare each observation to.
• Any observations that are more than 1.5 IQR below Q1 or more than
1.5 IQR above Q3 are considered outliers. This is the method that
Minitab uses to identify outliers by default.
Five-Number Summaries and Box Plots
• Summary statistics and easy-to-draw graphs can be used to quickly
summarize large quantities of data.
• Two tools that accomplish this are five-number summaries and box plots.
59
Five-Number Summary – Apartment Rents
1. Smallest Value
2. First Quartile
3. Median
4. Third Quartile
5. Largest Value
60
Five-Number Summary- Let’s do it in excel!
• Example: Apartment Rents
Lowest Value = 525 First Quartile = 545
Median = 575
Third Quartile = 625 Largest Value = 715
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
61
Box Plot
• A box plot is a graphical summary of data that is based on a five-number
summary.
• A key to the development of a box plot is the computation of the median and
the quartiles Q1 and Q3.
• Box plots provide another way to identify outliers.
62
Box Plot – Let’s do it in excel!
• Example: Apartment Rents
• A box is drawn with its ends located at the first and third quartiles.
• A vertical line is drawn in the box at the location of the median (second
quartile).
500 525 550 575 600 625 650 675 700 725
Q1 = 545 Q3 = 625
Q2 = 575
63
Box Plot
• Limits are located (not drawn) using the interquartile range (IQR).
• Data outside these limits are considered outliers.
• The locations of each outlier is shown with the symbol * .
continued
64
Box Plot
• Example: Apartment Rents
• The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 545 - 1.5(80) = 425
• The upper limit is located 1.5(IQR) above Q3.
65
Box Plot
• Example: Apartment Rents
• Whiskers (dashed lines) are drawn from the ends of the box to
the smallest and largest data values inside the limits.
500 525 550 575 600 625 650 675 700 725
66
Practice with
Excel- list of
completed
activity!
Probability
Theory
Dr. Nilakantan Narasinganallur
Ph.D.
Before we get into new topic, let’s recap!(1/2)
Why outliers should be identified? Mean = 92.8
WE understand from theory that mean is Median = 66.5
affected by extreme values.
Now let us remove extreme value
Median is not affected. 350.
In a given data set, how do we establish Mean = 64.9
this difference in the properties of mean
Median = 65.5
and median?
4
Uncertainties
• Managers often base their decisions on an analysis of
uncertainties such as the following:
• What are the chances that sales will decrease if we increase
• prices?
What is the likelihood a new assembly method will increase
productivity?
• What are the odds that a new investment will be profitable?
5
Probability
• Probability is a numerical measure of the likelihood that an event
will occur.
• Probability values are always assigned on a scale from 0 to 1.
• A probability near zero indicates an event is quite unlikely to
occur.
• A probability near one indicates an event is almost certain to
occur.
6
Probability as a Numerical Measure
of the Likelihood of Occurrence
Increasing Likelihood of Occurrence
Probability: 0 .5 1
7
Statistical Experiments
• What is the difference between statistical and scientific
experiments?
8
An Experiment and Its Sample Space
• An experiment is any process that generates well-defined
outcomes.
• The sample space for an experiment is the set of all experimental
outcomes.
• An experimental outcome is also called a sample point.
9
An Experiment and Its Sample Space
Experiment Experiment Outcomes
Toss a coin Head, tail
Inspect a part Defective, non-defective
Conduct a sales call Purchase, no purchase
Roll a die 1, 2, 3, 4, 5, 6
Play a football game Win, lose, tie
10
An Experiment and Its Sample Space
• Example: Bradley Investments
Bradley has invested in two stocks, Markley Oil and Collins Mining. Bradley has
determined that the possible outcomes of these investments three months from now are
as follows.
12
A Counting Rule for Multiple-Step Experiments
• Example: Bradley Investments
Bradley Investments can be viewed as a two-step experiment. It involves two stocks,
each with a set of experimental outcomes.
Markley Oil: n1 = 4
Collins Mining: n2 = 2
Total Number of
Experimental Outcomes: n1n2 = (4)(2) = 8
13
Tree Diagram
• Example: Bradley Investments
Markley Oil Collins Mining Experimental
(Stage 1) (Stage 2) Outcomes
Gain 8
(10, 8) Gain $18,000
(10, -2) Gain $8,000
Gain 10 Lose 2
Gain 8 (5, 8) Gain $13,000
𝑁 𝑁 𝑁!
𝐶 = =
𝑛 𝑛 𝑛! 𝑁−𝑛 !
15
Counting Rule for Permutations
• Number of Permutations of N Objects Taken n at a Time
• A third useful counting rule enables us to count the number of
experimental outcomes when n objects are to be selected from
a set of N objects, where the order of selection is important.
𝑁 𝑁 𝑁!
𝑃 = 𝑛! =
𝑛 𝑛 𝑁−𝑛 !
16
Assigning Probabilities
• Basic Requirements for Assigning Probabilities
1. The probability assigned to each experimental outcome must
be between 0 and 1, inclusively.
0 < P(Ei) < 1 for all i
17
Assigning Probabilities
• Basic Requirements for Assigning Probabilities
2. The sum of the probabilities for all experimental outcomes must
equal 1.
P(E1) + P(E2) + . . . + P(En) = 1
where: n is the number of experimental outcomes
18
Probability puzzle-discussion and solution
You are going by car to a small town in the
interior of India. You come to a four-road
junction where the direction indicator is
uprooted and lying down by the wayside.
There are three roads leading away from the
junction.
There is a board indicating the traffic in each
direction. You know the town you intend to
visit is more famous and gets the maximum
traffic.
There is a man sitting alone on a bench near
the junction.
What will you do to find the correct road to
your destination?
Alternatives?
1. Can follow traffic details. 2. Ask the old
man. 3. Choose one of the roads at random.
Solve the puzzle.
Assigning Probabilities
• Classical Method
Assigning probabilities based on the assumption of equally likely
outcomes
20
Classical Method
• Example: Rolling a Die
If an experiment has n possible outcomes, the classical method will assign a probability
of 1/n to each outcome.
21
Relative Frequency Method
• Example: Lucas Tool Rental
Lucas Tool Rental would like to assign probabilities to the number of
car polishers it rents each day. Office records show the following
frequencies of daily rentals for the last 40 days.
Number of Number
Polishers Rented of Days
0 4
1 6
2 18
3 10
4 2
22
Relative Frequency Method
• Example: Lucas Tool Rental
Each probability assignment is given by dividing the frequency (number of days) by
the total frequency (total number of days).
Number of Number
Polishers Rented of Days Probability
0 4 .10 = 4/40
1 6 .15
2 18 .45
3 10 .25
4 2 .05
40 1.00
23
Subjective Method
• When economic conditions or a company’s circumstances change rapidly
it might be inappropriate to assign probabilities based solely on
historical data.
• We can use any data available as well as our experience and intuition,
but ultimately a probability value should express our degree of belief
that the experimental outcome will occur.
• The best probability estimates often are obtained by combining the
estimates from the classical or relative frequency approach with the
subjective estimate.
24
Subjective Method
• Example: Bradley Investments
An analyst made the following probability estimates.
26
Events and Their Probabilities
• Example: Bradley Investments
Event M = Markley Oil Profitable
M = {(10, 8), (10, -2), (5, 8), (5, -2)}
P(M) = P(10, 8) + P(10, -2) + P(5, 8) + P(5, -2)
= .20 + .08 + .16 + .26
= .70
27
Events and Their Probabilities
• Example: Bradley Investments
Event C = Collins Mining Profitable
C = {(10, 8), (5, 8), (0, 8), (-20, 8)}
P(C) = P(10, 8) + P(5, 8) + P(0, 8) + P(-20, 8)
= .20 + .16 + .10 + .02
= .48
28
Some Basic Relationships of Probability
There are some basic probability relationships that can be used to compute the probability
of an event without knowledge of all the sample point probabilities.
Complement of an Event
Union of Two Events
Intersection of Two Events
29
Complement of an Event
• The complement of event A is defined to be the event consisting of
all sample points that are not in A.
• The complement of A is denoted by Ac.
Sample
Event A Ac Space S
Venn Diagram
30
Birthday puzzle
Solution: Let’s figure the odds that no one shares a
In probability theory, the birthday and invert that. The odds are calculated by
counting all the ways that N people won’t share a
birthday problem or birthday birthday and dividing by the number of possible
paradox concerns the birthdays they could have.
probability that, in a set of n For example, two people could have 365×365 birthday
randomly chosen people, some combinations. That’s the denominator. To count the
pair of them will have the same numerator, imagine that the first person gets to choose
their birthday. They can pick from 365 days. The second
birthday. person can also pick their birthday, but can’t share a
How many people do you need birthday with the first person. They’ve got 364 days to
choose from. So the chance that two people don’t share
before the odds are good a birthday is (365×364)/365². Subtract that from 1 and
(greater than 50%) that at least you get what you expect: that there’s a 1 in 365 chance
two of them share a birthday? that two people share a birthday.
For three people, the denominator is 365³ and the
What is the probability that two numerator is 365×364×363. The formula for N people
members of your class have the is:
same birthday? P(N) = [365 × 364 × · · · × (365−N+1)] / 365N
Union of Two Events
• The union of events A and B is the event containing all sample points
that are in A or B or both.
• The union of events A and B is denoted by A B.
Sample
Event A Event B Space S
Venn Diagram
32
Union of Two Events
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M C = Markley Oil Profitable
or Collins Mining Profitable (or both)
M C = {(10, 8), (10, -2), (5, 8), (5, -2), (0, 8), (-20, 8)}
P(M C) = P(10, 8) + P(10, -2) + P(5, 8) + P(5, -2) + P(0, 8) + P(-20, 8)
= .20 + .08 + .16 + .26 + .10 + .02
= .82
33
Intersection of Two Events
• The intersection of events A and B is the set of all sample points that
are in both A and B.
• The intersection of events A and B is denoted by A B.
Intersection of A and B
Sample
Event A Event B Space S
Venn Diagram
34
Intersection of Two Events
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M C = Markley Oil Profitable and Collins Mining Profitable
M C = {(10, 8), (5, 8)}
P(M C) = P(10, 8) + P(5, 8)
= .20 + .16
= .36
35
Addition Law
• The addition law provides a way to compute the probability of
event A, or B, or both A and B occurring.
• The law is written as:
P(A B) = P(A) + P(B) - P(A B)
36
Addition Law
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M C = Markley Oil Profitable or Collins Mining Profitable
We know: P(M) = .70, P(C) = .48, P(M C) = .36
Thus: P(M C) = P(M) + P(C) - P(M C)
= .70 + .48 - .36
= .82
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
37
Mutually Exclusive Events
• Two events are said to be mutually exclusive if the events have no
sample points in common.
• Two events are mutually exclusive if, when one event occurs, the
other cannot occur.
Sample
Event A Event B Space S
Venn Diagram
38
Mutually Exclusive Events
• If events A and B are mutually exclusive, P(A B) = 0.
39
Conditional Probability
• The probability of an event given that another event has occurred is
called a conditional probability.
• The conditional probability of A given B is denoted by P(A|B).
• A conditional probability is computed as follows :
𝑃(𝐴 ∩ 𝐵)
𝑃 𝐴𝐵 =
𝑃(𝐵)
40
Conditional Probability
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
P(C|M) = Collins Mining Profitable given Markley Oil Profitable
We know: P(M C) = .36, P(M) = .70
𝑃(𝐶∩𝑀) .36
Thus: 𝑃 𝐶 𝑀 = = = .5143
𝑃(𝑀) .70
41
Multiplication Law
• The multiplication law provides a way to compute the probability of
the intersection of two events.
• The law is written as:
P(A B) = P(B)P(A|B)
42
Multiplication Law
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
M C = Markley Oil Profitable and Collins Mining Profitable
We know: P(M) = .70, P(C|M) = .5143
Thus: P(M C) = P(M)P(M|C)
= (.70)(.5143)
= .36
(This result is the same as that obtained earlier
using the definition of the probability of an event.)
43
Joint Probability Table
Collins Mining
Markley Oil Profitable (C) Not Profitable (Cc) Total
44
Independent Events
• If the probability of event A is not changed by the existence of
event B, we would say that events A and B are independent.
• Two events A and B are independent if:
P(A|B) = P(A) or P(B|A) = P(B)
45
Multiplication Law for Independent Events
• The multiplication law also can be used as a test to see if two events
are independent.
• The law is written as:
P(A B) = P(A)P(B)
46
Multiplication Law for Independent Events
• Example: Bradley Investments
Event M = Markley Oil Profitable
Event C = Collins Mining Profitable
Are events M and C independent?
Does P(M C) = P(M)P(C) ?
We know: P(M C) = .36, P(M) = .70, P(C) = .48
But: P(M)P(C) = (.70)(.48) = .34, not .36
Hence: M and C are not independent.
47
Mutual Exclusiveness and Independence
• Do not confuse the notion of mutually exclusive events with that
of independent events.
• Two events with nonzero probabilities cannot be both mutually
exclusive and independent.
• If one mutually exclusive event is known to occur, the other
cannot occur.; thus, the probability of the other event occurring
is reduced to zero (and they are therefore dependent).
• Two events that are not mutually exclusive, might or might not be
independent.
48
A conversation in Deep Space 9/S-3:Ep15- Destiny
A short conversation between Chief Engineer O Brien, and Cardassian Scientist Gilora.
Gilora: what happened to these couplings?
O Brien: what? Oh, I made a few modifications.
Gilora: Well, these relays don’t have as much carrying capacity as before. They won’t be able to
handle the signal load from the transceiver.
O Brien: Well, in order to bring the system up to Starfleet code, I had to take out the couplings
to make room for secondary backup.
Gilora: Starfleet code requires a second backup?
O Brien: In case the first one fails.
Gilora: What are the chances that both primary system and its backup fail at the same time?
O Brien: Well, it is very unlikely, but in a crunch, I wouldn’t like to be caught without a second
backup.
An exercise in Reliability
Reliability is complementary Now let us understand the
to probability of failure, i.e. conversation in DS9.
... For example, if two Primary and backup are arranged in
components are arranged in parallel. Assume each with reliability
parallel, each with reliability Rp = Rb = 0.98. Then Fp = Fb= 0.02.
R 1 = R 2 = 0.9, that is, F 1 = the resultant probability of failure is
F 2 = 0.1, the resultant F = 0.02 x0.02 = 0.0004. the
probability of failure is F = resultant reliability is R = 1- 0.0004 =
0.1 × 0.1 = 0.01. The 0.9996.
resultant reliability is R = 1 –
0.01 = 0.99.
Bayes’ Theorem
• Often, we begin probability analysis with initial or prior probabilities.
• Then, from a sample, special report, or a product test we obtain some
additional information.
• Given this information, we calculate revised or posterior probabilities.
• Bayes’ theorem provides the means for revising the prior probabilities.
Application
Prior New Posterior
of Bayes’
Probabilities Information Probabilities
Theorem
51
Bayes’ Theorem
• Example: L. S. Clothiers
A proposed shopping center will provide strong competition for downtown businesses
like L. S. Clothiers. If the shopping center is built, the owner of L. S. Clothiers feels it
would be best to relocate to the shopping center.
52
Prior Probabilities
• Example: L. S. Clothiers
Let:
A1 = town council approves the zoning change
A2 = town council disapproves the change
Using subjective judgment:
P(A1) = .7, P(A2) = .3
53
New Information
• Example: L. S. Clothiers
The planning board has recommended against the zoning change. Let B denote the
event of a negative recommendation by the planning board.
54
Conditional Probabilities
• Example: L. S. Clothiers
Past history with the planning board and the town council indicates
the following:
P(B|A1) = .2 and P(B|A2) = .9
Hence: P(BC|A1) = .8 and P(BC|A2) = .1
55
Tree Diagram
• Example: L. S. Clothiers
Town Council Planning Board Experimental Outcomes
P(B|A1) = .2
P(A1 B) = .14
P(A1) = .7
c
P(B |A1) = .8 P(A1 Bc) = .56
P(B|A2) = .9
P(A2 B) = .27
P(A2) = .3
c
P(B |A2) = .1 P(A2 Bc) = .03
1.00
56
Bayes’ Theorem
• To find the posterior probability that event Ai will occur given that
event B has occurred, we apply Bayes’ theorem.
𝑃 𝐴𝑖 𝑃(𝐵|𝐴𝑖 )
𝑃 𝐴𝑖 𝐵 =
𝑃 𝐴1 𝑃 𝐵 𝐴1 + 𝑃 𝐴2 𝑃 𝐵 𝐴2 + ⋯ + 𝑃 𝐴𝑛 𝑃(𝐵|𝐴𝑛 )
57
Posterior Probabilities
• Example: L. S. Clothiers
Given the planning board’s recommendation not to approve the zoning change, we revise the
prior probabilities as follows:
𝑃 𝐴1 𝑃(𝐵|𝐴1 )
𝑃 𝐴1 𝐵 =
𝑃 𝐴1 𝑃 𝐵 𝐴1 + 𝑃 𝐴2 𝑃 𝐵 𝐴2
.7 (.2)
=
.7 .2)+ .3 .9)
= .34
58
Posterior Probabilities
• Example: L. S. Clothiers
The planning board’s recommendation is good news for L. S.
Clothiers. The posterior probability of the town council approving
the zoning change is .34 compared to a prior probability of .70.
59
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 1
Prepare the following three columns:
Column 1 - The mutually exclusive events for which posterior
probabilities are desired.
Column 2 - The prior probabilities for the events.
Column 3 - The conditional probabilities of the new information
given each event.
60
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 1
(1) (2) (3) (4) (5)
Prior Conditional
Events Probabilities Probabilities
Ai P(Ai) P(B|Ai)
A1 .7 .2
A2 .3 .9
1.0
61
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 2
Prepare the fourth column:
Column 4
Compute the joint probabilities for each event and the new
information B by using the multiplication law.
Multiply the prior probabilities in column 2 by the corresponding
conditional probabilities in column 3. That is, P(Ai IB) = P(Ai)
P(B|Ai).
62
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 2
(1) (2) (3) (4) (5)
Prior Conditional Joint
Events Probabilities Probabilities Probabilities
Ai P(Ai) P(B|Ai) P(Ai I B)
A1 .7 .2 .14 = .7(.2)
A2 .3 .9 .27
1.0
63
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 2 (continued)
We see that there is a .14 probability of the town council
approving the zoning change and a negative recommendation by
the planning board.
There is a .27 probability of the town council disapproving the
zoning change and a negative recommendation by the planning
board.
64
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 3
Sum the joint probabilities in Column 4. The sum is the
probability of the new information, P(B). The sum .14 + .27
shows an overall
probability of .41 of a negative recommendation by the
planning board.
65
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 3
(1) (2) (3) (4) (5)
Prior Conditional Joint
Events Probabilities Probabilities Probabilities
Ai P(Ai) P(B|Ai) P(Ai I B)
A1 .7 .2 .14
A2 .3 .9 .27
1.0 P(B) = .41
66
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 4
Prepare the fifth column:
Column 5
Compute the posterior probabilities using the basic relationship
of conditional probability.
𝑃(𝐴𝑖 ∩ 𝐵)
𝑃 𝐴𝑖 𝐵 =
𝑃(𝐵)
67
Bayes’ Theorem: Tabular Approach
• Example: L. S. Clothiers
• Step 4
(1) (2) (3) (4) (5)
Prior Conditional Joint Posterior
Events Probabilities Probabilities Probabilities Probabilities
Ai P(Ai) P(B|Ai) P(Ai I B) P(Ai |B)
A1 .7 .2 .14 .3415 = .14/.41
A2 .3 .9 .27 .6585
1.0 P(B) = .41 1.0000
68
ProbabilityDistributions
Dr. Nilakantan Narasinganallur Ph.D.
Probability Distributions
• Random Variables
• Developing Discrete Probability Distributions
• Expected Value and Variance
• Binomial Probability Distribution
• Normal Probability Distribution
2
Random Variables
• A random variable is a numerical description of the outcome of an
experiment.
• A discrete random variable may assume either a finite number of values
or an infinite sequence of values.
• A continuous random variable may assume any numerical value in an
interval or collection of intervals.
3
Discrete Random Variable
with a Finite Number of Values
• Example: Auto Distributor
Let x = number of cars sold at the outlet in one day,
where x can take on 5 values (0, 1, 2, 3, 4)
We can count the cars sold, and there is a finite upper limit on the
number that might be sold (which is the number of cars in stock).
4
Discrete Random Let x = number of We can count the
Variable customers arriving in customers arriving, but
with a Finite one day, there is
Number of Values where x can no finite upper limit on
take on the values 0, 1, the number that might
2, . . . arrive.
• Example: Auto
Distributor
5
Random Variables
Question Random Variable x Type
6
• The probability • We can
Discrete distribution for a describe a
Probability random variable discrete
Distributions describes how probability
probabilities are distribution
distributed over the with a table,
values of the random graph, or
variable. formula.
7
Discrete • First type: uses the rules of assigning probabilities
Probability to experimental outcomes to determine
Distributions probabilities for each value of the random
variable.
• Two types of
discrete
probability
distributions will • Second type: uses a special mathematical formula
be introduced. to compute the probabilities for each value of the
random variable.
8
Discrete Probability Distributions
9
Discrete Probability Distributions
10
Discrete Probability Distributions
• Example: Auto Distributor
Using past data on car sales, a tabular representation
of the probability distribution for sales was developed.
Number
Units Sold of Days x f(x)
0 80 0 .40 = 80/200
1 50 1 .25
2 40 2 .20
3 10 3 .05
4 20 4 .10
200 1.00
11
Discrete Probability Distributions
• Example: Auto Distributor
.50
.40 Graphical
Probability
representation
.30 of probability
distribution
.20
.10
0 1 2 3 4
Values of Random Variable x ( car sales)
12
Discrete Probability Distributions
13
Expected Value
• The expected value, or mean, of a random variable is a measure of its
central location.
E(x) = = xf(x)
• The expected value is a weighted average of the values the random
variable may assume. The weights are the probabilities.
• The expected value does not have to be a value the random variable can
assume.
14
• The variance • The variance is a weighted • The standard
summarizes the average of the squared deviation, , is
variability in the deviations of a random defined as the
values of a random variable from its mean. positive square
variable. The weights are the root of the
probabilities. variance.
Var(x) = 2 = (x -
)2f(x)
16
Variance
• Example: Auto Distributor
x x- (x - )2 f(x) (x - )2f(x)
0 -1.2 1.44 .40 .576
1 -0.2 0.04 .25 .010
2 0.8 0.64 .20 .128
3 1.8 3.24 .05 .162
4 2.8 7.84 .10 .784
Variance of daily sales = 2 = 1.660
Standard deviation of daily sales = 1.2884 cars
17
Binomial Probability Distribution
• Four Properties of a Binomial Experiment
1. The experiment consists of a sequence of n identical trials.
2. Two outcomes, success and failure, are possible on each trial.
3. The probability of a success, denoted by p, does not change from trial to
trial. (This is referred to as the stationarity assumption.)
4. The trials are independent.
18
Binomial Probability Distribution
19
Binomial Probability Distribution
where:
x = the number of successes
p = the probability of a success on one trial
n = the number of trials
f(x) = the probability of x successes in n trials
n! = n(n – 1)(n – 2) ….. (2)(1)
20
Binomial Probability Distribution
• Binomial Probability Function
𝑛!
𝑓 𝑥 = 𝑝 𝑥 (1 − 𝑝)(𝑛−𝑥)
𝑥! 𝑛 − 𝑥 !
Probability of a particular
Number of experimental sequence of trial outcomes
outcomes providing exactly with x successes in n trials
x successes in n trials
21
Binomial Probability Distribution Thus, for any hourly employee
chosen at random, management
• Example: Actis Hospital estimates a probability of 0.1 that the
Actis Hospital is concerned person will not be with the company
about a low retention rate for its next year.
employees. In recent years,
management has seen a
turnover of 10% of the hourly Choosing 3 hourly employees at
employees annually. random, what is the probability that 1
of them will leave the company this
year?
22
• Exampl • The probability of the first employee leaving and the second
e: Actis and third employees staying, denoted (S, F, F), is given by
Hospit
p(1 – p)(1 – p)
al
Experimental Probability of
Outcome Experimental Outcome
(S, F, F) p(1 – p)(1 – p) = (.1)(.9)(.9) = .081
(F, S, F) (1 – p)p(1 – p) = (.9)(.1)(.9) = .081
(F, F, S) (1 – p)(1 – p)p = (.9)(.9)(.1) = .081
Total = .243
24
Binomial Probability Distribution
• Example: Actis Hospital
Using the probability function:
Let: p = .10, n = 3, x = 1
𝑛!
𝑓 𝑥 = 𝑝 𝑥 (1 − 𝑝)(𝑛−𝑥)
𝑥! 𝑛 − 𝑥 !
3!
𝑓 1 = 0.1 1 (0.9)2 = .243
1! 3−1 !
25
Binomial Probability Distribution
• Example: Evans Electronics
1st Worker 2nd Worker 3rd Worker x Prob.
L (.1) 3 .0010
Leaves (.1)
S (.9) 2 .0090
Leaves
(.1) L (.1) 2 .0090
Stays (.9)
S (.9) 1 .0810
L (.1) 2 .0090
Leaves (.1)
Stays S (.9) 1 .0810
(.9) L (.1)
1 .0810
Stays (.9)
S (.9) 0 .7290
26
Binomial Probabilities and Cumulative Probabilities
• Statisticians have developed tables that give probabilities and
cumulative probabilities for a binomial random variable.
• These tables can be found in some statistics textbooks.
• With modern calculators and the capability of statistical software
packages, such tables are almost unnecessary.
27
Binomial Probability Distribution
• Using Tables of Binomial Probabilities
p
n x .05 .10 .15 .20 .25 .30 .35 .40 .45 .50
3 0 .8574 .7290 .6141 .5120 .4219 .3430 .2746 .2160 .1664 .1250
1 .1354 .2430 .3251 .3840 .4219 .4410 .4436 .4320 .4084 .3750
2 .0071 .0270 .0574 .0960 .1406 .1890 .2389 .2880 .3341 .3750
3 .0001 .0010 .0034 .0080 .0156 .0270 .0429 .0640 .0911 .1250
28
Binomial Probability Distribution
• Expected Value
E(x) = = np
• Variance
Var(x) = 2 = np(1 – p)
• Standard Deviation
𝜎= 𝑛𝑝(1 − 𝑝)
29
Binomial Probability Distribution
• Example: Actis Hospital
• Expected Value
E(x) = np = 3(.1) = .3 employees out of 3
• Variance
Var(x) = np(1 – p) = 3(.1)(.9) = .27
• Standard Deviation
𝜎= 3 .1 . 9) = .52 employees
30
• A continuous random • It is not possible to • Instead, we talk
variable can assume talk about the about the
any value in an interval probability of the probability of the
on the real line or in a random variable random variable
collection of intervals. assuming a assuming a value
particular value. within a given
interval.
Uniform
f (x)
Normal
f (x)
x
x1 x2
x
x1 x2
x
x1 x2
32
• The area under the graph of f(x) and probability are identical.
34
Normal Probability Distribution
• Normal Probability Density Function
1 2 /2𝜎 2
𝑓 𝑥 = 𝑒 −(𝑥−𝜇)
𝜎 2𝜋
where: = mean
= standard deviation
= 3.14159
e = 2.71828
35
Normal Probability Distribution
• Characteristics
The distribution is symmetric; its skewness measure is zero.
36
Normal Probability Distribution
• Characteristics
The entire family of normal probability distributions is defined by its mean
and its standard deviation .
Standard Deviation
x
Mean
37
Normal Probability Distribution
• Characteristics
The highest point on the normal curve is at the mean, which is also the
median and mode.
38
Normal Probability Distribution
• Characteristics
The mean can be any numerical value: negative, zero, or positive.
x
-10 0 25
39
Normal Probability Distribution
• Characteristics
The standard deviation determines the width of the
curve: larger values result in wider, flatter curves.
= 15
= 25
40
Normal Probability Distribution
• Characteristics
Probabilities for the normal random variable are given by areas under the
curve. The total area under the curve is 1 (.5 to the left of the mean and
.5 to the right).
.5 .5
x
41
Normal Probability Distribution
• Characteristics (basis for the empirical rule)
42
Normal Probability Distribution
• Characteristics (basis for the empirical rule)
99.72%
95.44%
68.26%
x
– 3 – 1 + 1 + 3
– 2 + 2
43
Standard Normal Probability Distribution
• Characteristics
A random variable having a normal distribution with a mean of 0 and a
standard deviation of 1 is said to have a standard normal probability
distribution.
44
Standard Normal Probability Distribution
• Characteristics
The letter z is used to designate the standard normal random variable.
=1
z
0
45
Standard Normal Probability Distribution
• Converting to the Standard Normal Distribution
𝑥−𝜇
z=
𝜎
46
Standard Normal Probability Distribution
• Example: Car Zone
Car Zone sells auto parts and supplies including a popular multi-grade
motor oil. When the stock of this oil drops to 20 gallons, a replenishment
order is placed.
The store manager is concerned that sales are being lost due to
stockouts while waiting for a replenishment order.
47
Standard Normal Probability Distribution
• Example: Car Zone
It has been determined that demand during replenishment lead-time is
normally distributed with a mean of 15 gallons and a standard deviation
of 6 gallons.
The manager would like to know the probability of a stockout during
replenishment lead-time. In other words, what is the probability that
demand during lead-time will exceed 20 gallons?
P(x > 20) = ?
48
• Solving for the Stockout Probability
Standard
Normal Step 1: Convert x to the z = (x - )/
Probability standard normal = (20 - 15)/6
Distribution distribution. = .83
49
• Excel has two functions for computing cumulative • NORM.INV is
probabilities and x values for any normal used to compute
distribution: the x value given
a cumulative
probability.
51
Standard Normal Probability Distribution
• Solving for the Stockout Probability
z
0 .83
52
If the manager of Car Zone wants the probability of a stockout during
replenishment lead-time to be no more than .05, what should the reorder
point be?
Area = .9500
Area = .0500
z
0 z.05
54
Standard Normal Probability Distribution
• Solving for the Reorder Point
Step 2: Convert z.05 to the corresponding value of x.
x = + z.05
= 15 + 1.645(6)
= 24.87 or 25
55
Normal Probability Distribution
• Solving for the Reorder Point
Probability of no Probability of a
stockout during stockout during
replenishment replenishment
lead-time = .95 lead-time = .05
x
15 24.87
56
• Solving for the Reorder Point
By raising the reorder point from 20 gallons
to 25 gallons on hand, the probability of a
Standard stockout decreases from about .20 to .05.
Normal This is a significant decrease in the chance
Probability that Car Zone will be out of stock and unable
Distribution to meet a customer’s desire to make a
purchase.
Learnings?
Sampling Theory &
Estimation
Dr. Nilakantan Narasinganallur Ph.D.
Review – Bayes’ theorem
1.The prior probabilities for events A1, A2, and A3 are The bank also found that the probability of missing a
P(A1) = .20, P(A2) = .50, and P(A3) =.30. The monthly payment is .20 for customers who do not
conditional probabilities of event B given A1, A2, and A3 default. Of course, the probability of missing a
monthly payment for those who default is 1.
are P(B | A1) = .50, P(B | A2) = .40, and P(B| A3) = .30. • a. Given that a customer missed one or more
• a. Compute P(B ∩ A1), P(B ∩ A2), and P(B ∩ A3). monthly payments, compute the posterior probability
• b. Apply Bayes’theorem, equation (4.19), to compute that the customer will default.
the posterior probability P(A2 | B). • b. The bank would like to recall its card if the
• c. Use the tabular approach to applying Bayes’ probability that a customer will default is greater than
theorem to compute P(A1 | B), P(A2 | B), and P(A3| B). .20. Should the bank recall its card if the customer
misses a monthly payment? Why or why not?
Ans: a. .10, .20, .09, b. .51, c. .26, .51, .23
2. A local bank reviewed its credit card policy with the
intention of recalling some of its credit cards. In the past
approximately 5% of cardholders defaulted, leaving the
bank unable to collect the outstanding balance. Hence,
management established a prior probability of .05 that
any particular cardholder will default.
Binomial distribution
Consider a binomial experiment with two trials and • A Harris Interactive survey for InterContinental
p= .4. Hotels & Resorts asked respondents, “When
traveling internationally, do you generally venture
• a. Compute the probability of one success, f (1). out on your own to experience culture, or stick with
• b. Compute f (0). your tour group and itineraries?” The survey found
that 23% of the respondents stick with their tour
• c. Compute f (2). group (USA Today, January 21, 2004).
• d. Compute the probability of at least one success. • a. In a sample of six international travelers, what is
• e. Compute the expected value, variance, and the probability that two will stick with their tour
standard deviation. group?
• Ans: f(1)= 0.48, f(0)=0.36, f(20=0.16, P(x>=1) • b. In a sample of six international travelers, what is
=0.64, E(x) = 0.8, V(x) = 0.48, s.d. = 0.6928. the probability that at least two will stick with their
tour group?
• c. In a sample of 10 international travelers, what is
the probability that none will stick with the tour
group?
• Ans: a. .2789, b. .4181, c. .0733
Normal Distribution
Given that z is a standard normal random variable, compute the Trading volume on the New York Stock Exchange is heaviest during
following probabilities. the first half hour (early morning) and last half hour (late afternoon)
of the trading day. The early morning trading volumes (millions of
• a. P(1.98 z .49) shares) for 13 days in January and February are shown here
(Barron’s, January 23, 2006; February 13, 2006; and February 27,
• b. P(.52 z 1.22) 2006).
• c. P(1.75 z1.04) • 214 163 265 194 180 202 198 212 201 174 171 211 211
• Ans: a. P(1.98 z .49) P(z .49) P(z 1.98) .6879 .0239 .6640 • The probability distribution of trading volume is approximately
• b. P(.52 z 1.22) P(z 1.22) P(z .52) .8888 .6985 .1903 normal.
• c. P(1.75 z 1.04) P(z 1.04) P(z 1.75) .1492 .0401 .1091 • a. Compute the mean and standard deviation to use as estimates of
the population mean and standard deviation.
Given that z is a standard normal random variable, find z for each • b. What is the probability that, on a randomly selected day, the
situation. early morning trading volume will be less than 180 million shares?
• a. The area to the left of z is .9750. • c. What is the probability that, on a randomly selected day, the
• b. The area between 0 and z is .4750. early morning trading volume will exceed 230 million shares?
• c. The area to the left of z is .7291. • d. How many shares would have to be traded for the early morning
trading volume on a particular day to be among the busiest 5% of
• d. The area to the right of z is .1314. days?
• e. The area to the left of z is .6700. • Ans: 200, 26.04, b. .2206, c. .1251, d. 242.84 million
• f. The area to the right of z is .3300.
• Ans:a. z 1.96, b. z 1.96, c. z .61 d. z 1.12 e. z .44 f. z .44
• Selecting a Sample
• Point Estimation
• Introduction to Sampling
Sampling and Distributions
• Sampling Distribution of 𝑥ҧ
Sampling • Sampling Distribution of 𝑝ҧ
Distributions • Other Sampling Methods
5
• An element is the entity on which data are
Introduction collected.
• A population is a collection of all the elements
of interest.
6
Introduction
• The reason we select a sample is to collect data to answer a research question
about a population.
• The sample results provide only estimates of the values of the population
characteristics.
• The reason is simply that the sample contains only a portion of the
population.
• With proper sampling methods, the sample results can provide “good”
estimates of the population characteristics.
7
• Sampling from a Finite Population
Selecting a
Sample
8
Sampling from a Finite Population
• Finite populations are often defined by lists such as:
• Organization membership roster
• Credit card account numbers
• Inventory product numbers
• A simple random sample of size n from a finite population of size N is a sample
selected such that each possible sample of size n has the same probability of
being selected.
9
Sampling from a Finite Population
• Replacing each sampled element before • Sampling without
selecting subsequent elements is called replacement is the
sampling with replacement. procedure used most
often.
10
Sampling from • Example: Symbioses College
a Finite Symbioses College received 900
Population applications for admission in the upcoming year
from prospective students. The applicants were
numbered, from 1 to 900, as their applications
arrived. The Director of Admissions would like to
select a simple random sample of 30 applicants.
11
• Example: Symbioses College
Step 1: Assign The random numbers
a random generated by Excel’s
number to RAND function follow a
Sampling each of the uniform probability
from a Finite 900 applicants. distribution between 0 and
Population 1.
13
Sampling from an Infinite Population
• Populations are often generated by an ongoing process where there is no upper
limit on the number of units that can be generated.
• Some examples of on-going processes, with infinite populations, are:
• parts being manufactured on a production line
• transactions occurring at a bank
• telephone calls arriving at a technical help desk
• customers entering a store
14
Sampling from an Infinite Population
• In the case of an infinite population, we must select a random sample in order
to make valid statistical inferences about the population from which the
sample is taken.
• A random sample from an infinite population is a sample selected such that
the following conditions are satisfied.
• Each element selected comes from the population of interest.
• Each element is selected independently.
15
Practice - Sampling from a finite population
1. Assume a finite population has 350 elements. Using the last three digits of each of the following five-digit random
numbers (e.g., 601, 022, 448, . . . ), determine the first four elements that will be selected for the simple random
sample.
• 98601 73022 83448 02147 34229 27553 84147 93289 14209
• Ans: 22, 147, 229, 289
2. Indicate which of the following situations involve sampling from a finite population and which involve sampling
from an infinite population. In cases where the sampled population is finite, describe how you would construct a
frame.
• a. Obtain a sample of licensed drivers in the state of New York.
• b. Obtain a sample of boxes of cereal produced by the Breakfast Choice company.
• c. Obtain a sample of cars crossing the Golden Gate Bridge on a typical weekday.
• d. Obtain a sample of students in a statistics course at Indiana University.
• e. Obtain a sample of the orders that are processed by a mail-order firm.
Ans: a. finite; b. infinite; c. infinite; d. finite; e. infinite
• Point • In point estimation we use the
Point estimatio data from the sample to compute
Estimation n is a a value of a sample statistic that
form of serves as an estimate of a
statistical
inference population parameter.
.
• We refer to 𝑥ҧ as the • s is the point
point estimator of estimator of the
the population mean population standard
. deviation .
• 𝑝ҧ is the point estimator of the population proportion
p.
17
Point Estimation
• Example: Symbioses College
Recall that Symbioses College received 900 applications from prospective
students. The application form contains a variety of information including the
individual’s Scholastic Aptitude Test (SAT) score and whether or not the
individual desires on-campus housing.
At a meeting in a few hours, the Director of Admissions would like to
announce the average SAT score and the proportion of applicants that want to
live on campus, for the population of 900 applicants.
18
Point Estimation
• Example: Symbioses College
However, the necessary data on the applicants have not yet been
entered in the college’s computerized database. So, the Director decides to
estimate the values of the population parameters of interest based on sample
statistics. The sample of 30 applicants is selected using computer-generated
random numbers.
19
Point Estimation
• 𝑥ҧ as Point Estimator of
σ 𝑥𝑖 50,520
𝑥ҧ = = = 1684
30 30
• s as Point Estimator of
20
Point Estimation
• Once all the data for the 900 applicants were entered in the college’s database,
the values of the population parameters of interest were calculated.
• Population Mean SAT Score
σ 𝑥𝑖
𝜇= = 1697
900
• Population Standard Deviation for SAT Score
σ(𝑥𝑖 −𝜇)2
𝜎= = 87.4
900
• Population Proportion Wanting On-Campus Housing
𝑝 = 648/900 = .72
21
Summary of Point Estimates
Obtained from a Simple Random Sample
Population Parameter Point Point
Parameter Value Estimator Estimate
= Population mean 1697 𝑥ҧ = Sample mean 1684
SAT score SAT score
= Population std. 87.4 s = Sample stan- 85.2
deviation for dard deviation
SAT score for SAT score
p = Population pro- .72 𝑝ҧ = Sample pro- .67
portion wanting portion wanting
campus housing campus housing
22
Practical Advice
• The target population is the population we want to make inferences about.
• The sampled population is the population from which the sample is actually
taken.
• Whenever a sample is used to make inferences about a population, we
should make sure that the targeted population and the sampled population
are in close agreement.
23
Practice – Point Estimation
25
Sampling Distribution of 𝑥ҧ
• The sampling distribution of 𝑥ҧ is the probability distribution of all possible
values of the sample mean 𝑥.ҧ
• Expected Value of 𝑥ҧ
E(𝑥)ҧ =
where: = the population mean
• When the expected value of the point estimator equals the population
parameter, we say the point estimator is unbiased.
26
Sampling Distribution of 𝑥ҧ
• We will use the following notation to define the standard deviation of the
Sampling distribution of 𝑥.ҧ
27
Sampling Distribution of 𝑥ҧ
• Standard Deviation of 𝑥ҧ
Finite Population Infinite Population
𝑁−𝑛 𝜎 𝜎
𝜎𝑥ҧ = 𝜎𝑥ҧ =
𝑁−1 𝑛 𝑛
28
Sampling Distribution of 𝑥ҧ
• When the population has a normal distribution, the sampling distribution
of 𝑥ҧ is normally distributed for any sample size.
• In most applications, the sampling distribution of 𝑥ҧ can be approximated
by a normal distribution whenever the sample is size 30 or more.
• In cases where the population is highly skewed or outliers are present,
samples of size 50 may be needed.
29
Sampling Distribution of 𝑥ҧ
30
Central Limit Theorem
• When the population from which we are selecting a random sample does
not have a normal distribution, the central limit theorem is helpful in
identifying the shape of the sampling distribution of 𝑥.ҧ
31
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Sampling
Distribution 𝜎 87.4
𝜎𝑥ҧ = = = 15.96
of 𝑥ҧ for 𝑛 30
SAT Scores
𝑥ҧ
𝐸 𝑥ҧ = 1697
32
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
• What is the probability that a simple random sample of 30 applicants will
provide an estimate of the population mean SAT score that is within +/-10
of the actual population mean ?
• In other words, what is the probability that 𝑥ҧ will be between 1687 and
1707?
33
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Step 1: Calculate the z-value at the upper endpoint of the interval.
z = (1707 - 1697)/15.96 = .63
Step 2: Find the area under the curve to the left of the upper endpoint.
P(z < .63) = .7357
34
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Sampling Distribution
𝜎𝑥ҧ = 15.96 of 𝑥ҧ for SAT Scores
Area = .7357
𝑥ҧ
1697 1707
35
Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Step 3: Calculate the z-value at the lower endpoint of the interval.
z = (1687 - 1697)/15.96 = - .63
Step 4: Find the area under the curve to the left of the lower endpoint.
P(z < -.63) = .2643
36
Sampling Distribution of 𝑥ҧ for SAT Scores
• Example: Symbioses College
Sampling Distribution
𝜎𝑥ҧ = 15.96 of 𝑥ҧ for SAT Scores
Area = .2643
𝑥ҧ
1687 1697
37
Sampling Distribution of 𝑥ҧ for SAT Scores
• Example: Symbioses College
Step 5: Calculate the area under the curve between
the lower and upper endpoints of the interval.
P(-.68 < z < .68) = P(z < .68) - P(z < -.68)
= .7357 - .2643
= .4714
The probability that the sample mean SAT
score will be between 1687 and 1707 is:
38
Sampling Distribution of 𝑥ҧ for SAT Scores
• Example: Symbioses College
Sampling Distribution
of 𝑥ҧ for SAT Scores
𝜎𝑥ҧ = 15.96
Area = .4714
𝑥ҧ
1687 1697 1707
39
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
• Suppose we select a simple random sample of 100 applicants instead of the
30 originally considered.
• E(𝑥)ҧ = m regardless of the sample size. In our example, E(𝑥)ҧ remains at
1697.
• Whenever the sample size is increased, the standard error of the mean 𝜎𝑥ҧ
is decreased. With the increase in the sample size to n = 100, the standard
error of the mean is decreased from 15.96 to:
𝑁−𝑛 𝜎 900−100 87.4
𝜎𝑥ҧ = = =.9433(8.74) = 8.2
𝑁−1 𝑛 900−1 100
40
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
With n = 100,
𝜎𝑥ҧ = 8.2 With n = 30,
𝜎𝑥ҧ = 15.96
𝑥ҧ
𝐸 𝑥ҧ = 1697
41
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
• Recall that when n = 30, P(1687 < 𝑥ҧ < 1707) = .4714.
• We follow the same steps to solve for P(1687 < 𝑥ҧ < 1707) when n = 100 as
we showed earlier when n = 30.
• Now, with n = 100, P(1687 < 𝑥ҧ < 1707) = .7776.
• Because the sampling distribution with n = 100 has a smaller standard error,
the values of 𝑥ҧ have less variability and tend to be closer to the population
mean than the values of 𝑥ҧ with n = 30.
42
Relationship Between the Sample Size
and the Sampling Distribution of 𝑥ҧ
• Example: Symbioses College
Sampling Distribution
of 𝑥ҧ for SAT Scores
𝜎𝑥ҧ = 8.2
Area = .7776
𝑥ҧ
1687 1697 1707
43
Illustration of CLT
Notes and comments
• 1. while discussing the sampling distribution of mean for symbioses college
problem, we used the values of the population mean μ =1697, and the population
standard deviation σ =15.96, which were known. However, usually the values of
the population mean μ and the population standard deviation σ that are needed to
determine the sampling distribution of will be unknown. Later, we will study how
the sample mean and the sample standard deviation s are used when μ and σ are
unknown.
• 2. The theoretical proof of the central limit theorem requires independent
observations in the sample. This condition is met for infinite populations and for
finite populations where sampling is done with replacement. Although the central
limit theorem does not directly address sampling without replacement from finite
populations, general statistical practice applies the findings of the central limit
theorem when the population size is large.
Other Sampling Methods
• Stratified Random Sampling
• Cluster Sampling
• Systematic Sampling
• Convenience Sampling
• Judgment Sampling
46
Stratified • The population is first divided into groups of
Random elements called strata.
Sampling
• Each element in the population belongs to one
and only one stratum.
47
Stratified Random Sampling
48
• The population is first divided into separate
Cluster groups of elements called clusters.
Sampling
50
• If a sample size of n • We randomly select
Systematic is desired from a one of the first n/N
Sampling population elements from the
containing N population list.
elements, we might
sample one element
for every n/N
elements in the • We then select every
population. n/Nth element that
follows in the
population list.
51
Systematic Sampling
• This method has the properties of a simple random sample, especially if the
list of the population elements is a random ordering.
• Advantage: The sample usually will be easier to identify than it would be if
simple random sampling were used.
• Example: Selecting every 100th listing in a telephone book after the first
randomly selected listing
52
Convenience Sampling
53
Convenience Sampling
54
Judgment Sampling
55
Judgment Sampling
56
• It is recommended that probability sampling
Recommendati methods (simple random, stratified, cluster, or
on systematic) be used.
• Population Proportion
2
Margin of Error and the Interval Estimate
• A point estimator cannot be expected to provide the exact value of the
population parameter.
• An interval estimate can be computed by adding and subtracting a margin of
error to the point estimate.
Point Estimate +/- Margin of Error
• The purpose of an interval estimate is to provide information about how
close the point estimate is to the value of the parameter.
3
Margin of Error and the Interval Estimate
4
Interval Estimate of a Population Mean: s Known
• In order to develop an interval estimate of a population mean, the margin of
error must be computed using either:
• the population standard deviation s , or
• the sample standard deviation s
• s is rarely known exactly, but often a good estimate can be obtained based on
historical data or other information.
• We refer to such cases as the s known case.
5
Interval Estimate of a Population Mean: s Known
There is a 1 - probability that the value of a sample
mean will provide a margin of error of 𝑧𝛼/2 𝜎𝑥ҧ or less.
Sampling
distribution
of 𝑥ҧ
1 - of all
/2 𝑥ҧ values /2
𝑥ҧ
՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜ ՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜
6
Interval Estimate of a Population Mean: s Known
Sampling
distribution
of 𝑥ҧ
1 - of all
/2 𝑥ҧ values /2
𝑥ҧ
interval
՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜ ՚ 𝑧𝛼/2 𝜎𝑥ҧ ՜ interval
does not −−−− −𝑥ҧ −−−− − includes
include
−−−− −𝑥ҧ −−−− −
7
Interval Estimate of a Population Mean: s Known
• Interval Estimate of
𝜎
𝑥ҧ ± 𝑧𝛼/2
𝑛
8
• Values of z/2 for the Most Commonly Used Confidence
Levels
Interval Estimate
Confidence Table
of a Population Level /2 Look-up Area z/2
Mean: s 90% .10 .05 .9500 1.645
Known 95% .05 .025 .9750 1.960
99% .01 .005 .9950 2.576
9
Meaning of
Confidence • Because 90% of all the intervals constructed
using 𝑥ഥ + 1.645𝜎𝑥ҧ will contain the population
mean, we say we are 90% confident that the
interval 𝑥ഥ + 1.645𝜎𝑥ҧ includes the
population mean .
𝜎 4,500
𝑧𝛼/2 = 1.96 = 1,470
𝑛 36
12
Interval Estimate of a Population Mean: s Known
• Example: Discount Sounds
Interval estimate of is:
$41,100 + $1,470
or
$39,630 to $42,570
We are 95% confident that the interval contains the population mean.
13
Interval Estimate of a Population Mean: s Known
• Example: Discount Sounds
Confidence Margin
Level of Error Interval Estimate
90% 1,234 39,866 to 42,334
95% 1,470 39,630 to 42,570
99% 1,932 39,168 to 43,032
14
• Adequate Sample Size
• In most applications, a sample size of n =
30 is adequate.
Interval
Estimate of a
Population • If the population distribution is highly
Mean: s skewed or contains outliers, a sample size
Known of 50 or more is recommended.
• Adequate Sample Size • If the
(continued) population is
believed to be
Interval • If the population is not at least
normally distributed approximately
Estimate of a
but is roughly normal, a
Population
symmetric, a sample sample size of
Mean: s size as small as 15 will less than 15 can
Known suffice. be used.
Practice – interval estimation of μ, σ known
1. A simple random sample of 50 items from a population with σ = 6 resulted in a sample mean of 32.
• a. Provide a 90% confidence interval for the population mean.
• b. Provide a 95% confidence interval for the population mean.
• c. Provide a 99% confidence interval for the population mean.
• Ans: a. 30.6 to 33.4, b. 30.34 to 33.66, c. 29.81 to 34.19.
2. A 95% confidence interval for a population mean was reported to be 152 to 160. If σ = 15, what sample size
was used in this study?
• Ans: 54.
• If an estimate of the population standard
deviation s cannot be developed prior to
sampling, we use the sample standard
Interval deviation s to estimate s .
Estimate of a
Population
• This is the s unknown case.
Mean: s
unknown • In this case, the interval estimate for is
based on the t distribution.
19
t Distribution
• The t distribution is a family of similar probability distributions.
• A specific t distribution depends on a parameter known as the degrees of
freedom.
• Degrees of freedom refer to the number of independent pieces of
information that go into the computation of s.
20
t Distribution
t distribution
(10 degrees
of freedom)
z, t
0
22
t Distribution
• For more than 100 degrees of freedom, the standard normal z value
provides a good approximation to the t value.
• The standard normal z values can be found in the infinite degrees (∞ ) row
of the t distribution table.
23
t Distribution
Degrees Area in Upper Tail
of Freedom .20 .10 .05 .025 .01 .005
. . . . . . .
50 .849 1.299 1.676 2.009 2.403 2.678
60 .848 1.296 1.671 2.000 2.390 2.660
80 .846 1.292 1.664 1.990 2.374 2.639
100 .845 1.290 1.660 1.984 2.364 2.626
∞ .842 1.282 1.645 1.960 2.326 2.576
(bottom row is standard normal z values)
24
Interval Estimate of a Population Mean: s Unknown
• Interval Estimate
𝑠
𝑥ҧ ± 𝑡𝛼/2
𝑛
25
Interval Estimate of a Population Mean: s Unknown
• Example: Apartment Rents
A reporter for a student newspaper is writing an article on the
cost of off-campus housing. A sample of 16 one-bedroom
apartments within a half-mile of campus resulted in a sample
mean of $750 per month and a sample standard deviation of $55.
Let us provide a 95% confidence interval estimate Of the mean rent per
month for the population of one-bedroom apartments within a half-mile of
campus. We will assume this population to be normally distributed.
26
Interval Estimate of a Population Mean: s Unknown
• At 95% confidence, = .05, and /2 = .025.
• t.025 is based on n - 1 = 16 - 1 = 15 degrees of freedom.
27
Interval Estimate of a Population Mean: s Unknown
• Interval Estimate
𝑠
𝑥ҧ ± 𝑡.025
𝑛
55
750 + 2.131 = 750 + 29.30
16
28
Interval Estimate of a Population Mean: s Unknown
• Adequate Sample Size
• Usually, a sample size of n = 30 is adequate when using the
expression 𝑥ҧ ± 𝑡𝛼/2 𝑠/ 𝑛 to develop an interval estimate of a
population mean.
• If the population distribution is highly skewed or contains outliers,
a sample size of 50 or more is recommended.
29
Interval Estimate of a Population Mean: s
Unknown
30
Practice – interval estimation of μ, σ
unknown
3. The following sample data are from a normal population: 10, 8, 12, 15, 13, 11, 6, 5.
• c. With 95% confidence, what is the margin of error for the estimation of the population mean?
4. A simple random sample with n = 54 provided a sample mean of 22.5 and a sample standard deviation of 4.4.
• d. What happens to the margin of error and the confidence interval as the confidence level is increased?
• Ans: a. 21.5 to 23.5, b. 21.3 to 23.7, c. 20.9 to 24.1, d. a larger margin of error and wider interval.
Summary of Interval Estimation Procedures
for a Population Mean
Can the
Yes No
population standard
deviation s be assumed
known ?
Use the sample
standard deviation
s to estimate s
Use Use
𝜎 s Known s Unknown 𝑠
𝑥ҧ ± 𝑧𝛼/2 Case Case 𝑥ҧ ± 𝑡𝛼/2
𝑛 𝑛
32
Sample Size for an Interval Estimate of a Population Mean
• Let E = the desired margin of error.
• E is the amount added to and subtracted from the point estimate to obtain
an interval estimate.
• If a desired margin of error is selected prior to sampling, the sample size
necessary to satisfy the margin of error can be determined.
33
Sample Size for an Interval Estimate of a Population Mean
• Margin of Error
𝜎
𝐸 = 𝑧𝛼/2
𝑛
34
Sample Size for an Interval Estimate of a Population Mean
• The Necessary Sample Size equation requires a value for the population
standard deviation s .
• If s is unknown, a preliminary or planning value for s can be used in the
equation.
1. Use the estimate of the population standard deviation computed in a
previous study.
2. Use a pilot study to select a preliminary study and use the sample
standard deviation from the study.
3. Use judgment or a “best guess” for the value of s .
35
Sample Size for an Interval Estimate of a Population Mean
• Example: Discount Sounds
Recall that Discount Sounds is evaluating a potential location
for a new retail outlet, based in part, on the mean annual income
of the individuals in the marketing area of the new location.
Suppose that Discount Sounds’ management team wants an estimate of
the population mean such that there is a .95 probability that the sampling
error is $500 or less.
How large a sample size is needed to meet the required precision?
36
Sample Size for an Interval Estimate of a Population Mean
𝜎
𝐸 = 𝑧𝛼/2 = 500
𝑛
37
Practice – sample size estimation
3
• Hypothesis testing • The alternative
Hypothesis can be used to hypothesis, denoted
Testing determine whether a by Ha, is the
statement about the opposite of what is
value of a population stated in the null
parameter should or hypothesis.
should not be
rejected. • The hypothesis testing
procedure uses data
• The null hypothesis, from a sample to test
denoted by H0 , is a the two competing
tentative assumption statements indicated
about a population by H0 and Ha.
parameter.
4
Developing Null and • It is not always obvious how the null and
5
Alternative Hypotheses alternative hypotheses should be formulated.
6
Developing Null • Alternative • Example:
and Alternative Hypothesis A new teaching method is
Hypotheses as a developed that is believed to
Research be better than the current
Hypothesis method.
7
Developing Null • Alternative • Example:
and Alternative Hypothesis as A new sales force bonus
Hypotheses a Research plan is developed in an
Hypothesis attempt to increase sales.
8
• Alternative Hypothesis as a Research
Hypothesis
• Example:
A new drug is developed with the
goal of lowering blood pressure
Developing Null more than the existing drug.
and Alternative • Alternative Hypothesis:
Hypotheses The new drug lowers blood
pressure more than the existing drug.
• Null Hypothesis:
The new drug does not lower blood
pressure more than the existing drug.
9
Developing Null and Alternative Hypotheses
• Null Hypothesis as an Assumption to be Challenged
• We might begin with a belief or assumption that a statement about the
value of a population parameter is true.
10
Developing Null • Null Hypothesis as an Assumption to be
and Alternative Challenged
Hypotheses
• Example:
The label on a soft drink bottle states that it
contains 67.6 fluid ounces.
11
Summary of Forms for Null and Alternative Hypotheses
about a Population Mean
• The equality part of the hypotheses always appears in the null hypothesis.
• In general, a hypothesis test about the value of a population mean m must
take one of the following three forms (where m0 is the hypothesized value of
the population mean).
𝐻0 : 𝜇 ≥ 𝜇0 𝐻0 : 𝜇 ≤ 𝜇0 𝐻0 : 𝜇 = 𝜇0
𝐻𝑎 : 𝜇 < 𝜇0 𝐻𝑎 : 𝜇 > 𝜇0 𝐻𝑎 : 𝜇 ≠ 𝜇0
12
Null and Alternative Hypotheses
• Example: Metro EMS
A major west coast city provides one of the most comprehensive
emergency medical services in the world. Operating in a multiple hospital
system with approximately 20 mobile medical units, the service goal is to
respond to medical emergencies with a mean time of 12 minutes or less.
The director of medical services wants to formulate a hypothesis test that
could use a sample of emergency response times to determine whether or
not the service goal of 12 minutes or less is being achieved.
13
Null and Alternative Hypotheses
H0: m < 12 The emergency service is meeting the response goal;
no follow-up action is necessary.
14
• Because hypothesis tests are based on
Type I Error sample data, we must allow for the
possibility of errors.
• A Type I error is rejecting H0 when it is true.
16
Type I and Type II Errors
Population Condition
H0 True H0 False
Conclusion (m < 12) (m > 12)
Accept H0 Correct
Type II Error
(Conclude m < 12) Decision
Reject H0 Correct
Type I Error
(Conclude m > 12) Decision
17
p-Value
Approach to • The p-value is the probability, computed using
One-Tailed the test statistic, that measures the support (or
Hypothesis lack of support) provided by the sample for the
Testing null hypothesis.
Sampling
= .10 Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
p-value
= .0721
p-Value < ,
so reject H0. z
z= z = 0
-1.46 -1.28
20
Upper-Tailed Test About a Population Mean: s Known
• p-Value Approach
Sampling
Distribution of = .04
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
p-Value (p-Value < ,
so reject H0.)
= .011
z
0 z = z=
1.75 2.29
21
Critical Value
Approach to • The test statistic z has a standard normal
One-Tailed probability distribution.
Hypothesis • We can use the standard normal probability
Testing distribution table to find the z-value with an
area of in the lower (or upper) tail of the
distribution.
• The value of the test • The rejection rule
is:
statistic that established
• Lower tail:
the boundary of the Reject H0 if z < -z
rejection region is called • Upper tail:
the critical value for the Reject H0 if z > z
test.
22
Lower-Tailed Test About a Population Mean: s Known
• Critical Value Approach
Sampling
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
Reject H0
= 1
Do Not Reject H0
z
-z = -1.28 0
23
Upper-Tailed Test About a Population Mean: s Known
• Critical Value Approach
Sampling
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛 Reject H0
= .05
Do Not Reject H0
z
0 z = 1.645
24
Steps of Hypothesis Testing
Step 1. Develop the null and alternative hypotheses.
Step 2. Specify the level of significance .
Step 3. Collect the sample data and compute the value of the test statistic.
p-Value Approach
Step 4. Use the value of the test statistic to compute the p-value.
Step 5. Reject H0 if p-value < .
25
Steps of Hypothesis Testing
Critical Value Approach
Step 4. Use the level of significance to determine the critical value and the
rejection rule.
Step 5. Use the value of the test statistic and the rejection rule to determine
whether to reject H0.
26
One-Tailed Tests About a Population Mean: s Known
• Example: Metro EMS
The response times for a random sample of 40 medical emergencies were
tabulated. The sample mean is 13.25 minutes. The population standard
deviation is believed to be 3.2 minutes.
The EMS director wants to perform a hypothesis test, with a .05 level of
significance, to determine whether the service goal of 12 minutes or less is
being achieved.
27
One-Tailed Tests About a Population Mean: s Known
• p -Value and Critical Value Approaches
28
One-Tailed Tests About a Population Mean: s Known
• p –Value Approach
29
One-Tailed Tests About a Population Mean: s Known
• p –Value Approach
Sampling
= .05
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
p-value (p-Value < ,
= so reject H0.)
z
0 z = z=
1.645 2.47
30
One-Tailed Tests About a Population Mean: s Known
• Critical Value Approach
31
p-Value Approach to Two-Tailed Hypothesis
Testing
• Compute the p-value using the following three steps: 3. Double the tail
1. Compute the value of the test statistic z. area obtained in
2. If z is in the upper tail (z > 0), compute the step 2 to obtain
probability that z is greater than or equal to the value the p-value.
of the test statistic. If z is in the lower tail (z < 0), • The rejection
compute the probability that z is less than or equal to rule: Reject H0 if
the p-value < .
the value of the test statistic.
32
Critical Value Approach to Two-Tailed
Hypothesis Testing
• The critical values will occur in both the lower and upper tails of
the standard normal curve.
• Use the standard normal probability • The rejection
distribution table to find z/2 (the z-value with rule is: Reject H0
an area of /2 in the upper tail of the if z < -z/2 or z >
distribution). z /2.
33
Two-Tailed Tests About a Population Mean: s Known
• Example: Glow Toothpaste
The production line for Glow toothpaste is designed to fill tubes with a
mean weight of 6 oz. Periodically, a sample of 30 tubes will be selected in
order to check the filling process.
Quality assurance procedures call for the continuation of the filling
process if the sample results are consistent with the assumption that the
mean filling weight for the population of toothpaste tubes is 6 oz.; otherwise
the process will be adjusted.
34
Two-Tailed Tests About a Population Mean: s Known
• Example: Glow Toothpaste
Assume that a sample of 30 toothpaste tubes provides a sample mean of
6.1 oz. The population standard deviation is believed to be 0.2 oz.
Perform a hypothesis test, at the .03 level of significance, to help
determine whether the filling process should continue operating or be
stopped and corrected.
35
Two-Tailed Tests About a Population Mean: s Known
• p –Value and Critical Value Approaches
36
Two-Tailed Tests About a Population Mean: s Known
• p –Value Approach
4. Compute the p –value.
For z = 0.274, cumulative probability = 0.608
p-value = 2(1 - .608) = .784
37
Two-Tailed Tests About a Population Mean: s Known
• Critical Value Approach
38
Two-Tailed Tests About a Population Mean: s Known
• Critical Value Approach
Sampling
Distribution of
𝑥ҧ − 𝜇0
𝑧=
𝜎Τ 𝑛
Reject H0 Reject H0
/2 = .015 /2 = .015
Do Not Reject H0
z
-2.17 0 2.17
39
Confidence
Interval Approach • Select a simple • If the confidence
to random sample interval contains the
Two-Tailed Tests from the population
About a Population and use the value hypothesized value
Mean of the sample m0, do not reject H0.
mean 𝑥ҧ to develop Otherwise, reject H0.
the confidence (Actually, H0 should
interval for the be rejected if m0
population mean m.
(Confidence happens to be equal
intervals are to one of the end
covered in Chapter points of the
8.) confidence
interval.)
40
Confidence Interval Approach to
Two-Tailed Tests About a Population Mean
• The 97% confidence interval for m is
𝜎
𝑥ҧ ± 𝑧𝛼/2 = 6.1 ± 2.17 .2 30 = 6.1 ± .07924
𝑛
or 6.02076 to 6.17924
• Because the hypothesized value of 6.1 for the population mean, m0 = 6, is in
this interval, the hypothesis-testing conclusion is that the null hypothesis,
H0: m = 6, cannot be rejected.
41
Tests About a Population Mean: s Unknown
• Test Statistic: 𝑥ҧ − 𝜇0
𝑡=
𝑠Τ 𝑛
• This test statistic has a t distribution with n - 1 degrees of freedom.
42
Tests About a Population Mean: s Unknown
• Rejection Rule: p -Value Approach
Reject H0 if p –value <
• Rejection Rule: Critical Value Approach
H0: m > m0 Reject H0 if t < -t
H0: m < m0 Reject H0 if t > t
H0: m = m0 Reject H0 if t < - t/2 or t > t/2
43
p -Values and the t Distribution
• The format of the t distribution table provided in most statistics textbooks does
not have sufficient detail to determine the exact p-value for a hypothesis test.
• However, we can still use the t distribution table to identify a range for the p-
value.
• An advantage of computer software packages is that the computer output will
provide the p-value for the t distribution.
44
Example: Highway Patrol
• One-Tailed Test About a Population Mean: s Unknown
A State Highway Patrol periodically samples vehicle speeds at various
locations on a particular roadway. The sample of vehicle speeds is used to
test the hypothesis H0: m < 65.
The locations where H0 is rejected are deemed the best locations for radar
traps. At Location F, a sample of 64 vehicles shows a mean speed of 66.2 mph
with a standard deviation of 4.2 mph. Use a = .05 to test the hypothesis.
45
One-Tailed Test About a Population Mean: s Unknown
• p –Value and Critical Value Approaches
ҧ 0
𝑥−𝜇 66.2−65
𝑡= = = 2.286
𝑠Τ 𝑛 4.2/ 64
46
One-Tailed Test About a Population Mean: s Unknown
• p –Value Approach
4. Compute the p –value.
For t = 2.286, the p-value must be less than .025
(for t = 1.998) and greater than .01 (for t = 2.387).
.01 < p–value < .025
47
One-Tailed Test About a Population Mean: s Unknown
• Critical Value Approach
4. Determine the critical value and rejection rule.
For = .05 and d.f. = 64 – 1 = 63, t.05 = 1.669
Reject H0 if t > 1.669
5. Determine whether to reject H0.
Because 2.286 > 1.669, we reject H0.
We are at least 95% confident that the mean speed of vehicles at
Location F is greater than 65 mph. Location F is a good candidate
for a radar trap.
48
One-Tailed Test About a Population Mean: s Unknown
Reject H0
( = )
t
0 t = t=
1.669 2.286
49
Inference about two
populations
Dr. Nilakantan Narasinganallur Ph.D.
Inference About Means and Proportions with Two
Populations
• Inferences About the Difference Between Two
Population Means: s 1 and s 2 Known
• Inferences About the Difference Between Two Population Means:
s 1 and s 2 Unknown
• Inferences About the Difference Between Two Population Means:
Matched Samples
2
Inferences About the Difference Between
Two Population Means: s 1 and s 2 Known
3
Estimating the Difference Between Two Population Means
• Let 1 equal the mean of population 1 and 2 equal the mean of population 2.
• The difference between the two population means is 1 - 2.
• To estimate 1 - 2, we will select a simple random sample of size n1 from
population 1 and a simple random sample of size n2 from population 2.
• Let 𝑥1ҧ equal the mean of sample 1 and 𝑥ҧ2 equal the mean of sample 2.
• The point estimator of the difference between the means of the populations 1
and 2 is 𝑥1ҧ − 𝑥ҧ2 .
4
Sampling Distribution of 𝑥1ҧ − 𝑥ҧ2
• Expected Value
𝐸(𝑥1ҧ − 𝑥ҧ2 )= 𝜇1 − 𝜇2
𝜎1 2 𝜎2 2
𝜎𝑥ҧ1 −𝑥ҧ2 = +
𝑛1 𝑛2
5
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Interval Estimate
𝜎1 2 𝜎2 2
𝑥ҧ1 − 𝑥ҧ2 ± 𝑧𝛼/2 +
𝑛1 𝑛2
where:
1 - is the confidence coefficient
6
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Example: Par, Inc.
Par, Inc. is a manufacturer of golf equipment and has developed a new
golf ball that has been designed to provide “extra distance.”
In a test of driving distance using a mechanical driving device, a sample of
Par golf balls was compared with a sample of golf balls made by Rap, Ltd., a
competitor. The sample statistics appear on the next slide.
7
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Example: Par, Inc.
Sample #1 Sample #2
Par, Inc. Rap, Ltd.
8
Interval Estimation of 1 - 2: s 1 and s 2 Known
• Example: Par, Inc.
9
Estimating the Difference Between Two Population Means
Population 1 Population 2
Par, Inc. Golf Balls Rap, Ltd. Golf Balls
1 = mean driving 2 = mean driving
distance of Par distance of Rap
golf balls golf balls
1 – 2 = difference between
the mean distances
10
Point Estimate of 1 - 2
Point estimate of 1 - 2 = 𝑥1ҧ − 𝑥ҧ2 = 295 - 278
= 17 yards
where:
1 = mean distance for the population
of Par, Inc. golf balls
2 = mean distance for the population
of Rap, Ltd. golf balls
11
Interval Estimation of 1 - 2: s 1 and s 2 Known
𝜎1 2 𝜎2 2 (15)2 (20)2
𝑥1ҧ − 𝑥ҧ2 ± 𝑧𝛼/2 + = 17 ± 1.96 +
𝑛1 𝑛2 120 80
12
Hypothesis Tests About 1 - 2: s1 and s2 Known
• Hypotheses
• Test Statistic
𝑥1ҧ − 𝑥ҧ2 − 𝐷0
𝑧=
(𝜎1 )2 (𝜎2 )2
+
𝑛1 𝑛2
13
Hypothesis Tests About 1 - 2: s1 and s2 Known
• Example: Par, Inc.
Can we conclude, using = .01, that the mean driving distance of Par, Inc.
golf balls is greater than the mean driving distance of Rap, Ltd. golf balls?
14
Hypothesis Tests About 1 - 2: s1 and s2 Known
• p –Value and Critical Value Approaches
15
Hypothesis Tests About 1 - 2: s1 and s2 Known
• p –Value and Critical Value Approaches
3. Compute the value of the test statistic.
𝑥1ҧ − 𝑥ҧ2 − 𝐷0
𝑧=
(𝜎1 )2 (𝜎2 )2
+
𝑛1 𝑛2
295 − 278 − 0 17
𝑧= = = 6.49
(15)2 (20)2 2.62
+
120 80
16
Hypothesis Tests About 1 - 2: s1 and s2 Known
• p –Value Approach
17
Hypothesis Tests About 1 - 2: s1 and s2 Known
• Critical Value Approach
4. Determine the critical value and rejection rule.
For = .01, z.01 = 2.33
Reject H0 if z > 2.33
18
Inferences About the Difference Between
Two Population Means: s 1 and s 2 Unknown
• Interval Estimation of 1 – 2
• Hypothesis Tests About 1 – 2
19
Interval Estimation of 1 - 2: s1 and s2 Unknown
When s 1 and s 2 are unknown, we will:
• use the sample standard deviations s1 and s2 as estimates of
s 1 and s 2 , and
• replace z/2 with t/2.
20
Interval Estimation of 1 - 2: s1 and s2 Unknown
• Interval Estimate
𝑠1 2 𝑠2 2
𝑥1ҧ − 𝑥ҧ2 ± 𝑡𝛼/2 +
𝑛1 𝑛2
21
Difference Between Two Population Means: s 1 and s 2
Unknown
• Example: Specific Motors
Specific Motors of Detroit has developed a new Automobile known as the
M car. 24 M cars and 28 J cars (from Japan) were road tested to compare
miles-per-gallon (mpg) performance. The sample statistics are shown on the
next slide.
22
Difference Between Two Population Means: s 1 and s 2
Unknown
• Example: Specific Motors
Sample #1 Sample #2
M Cars J Cars
24 cars 28 cars Sample Size
29.8 mpg 27.3 mpg Sample Mean
2.56 mpg 1.81 mpg Sample Std. Dev.
23
Difference Between Two Population Means: s 1 and s 2
Unknown
• Example: Specific Motors
Let us develop a 90% confidence interval estimate of the difference between
the mpg performances of the two models of automobile.
24
Point Estimate of 1 - 2
Point estimate of 1 - 2 = 𝑥1ҧ − 𝑥ҧ2 = 29.8 - 27.3 = 2.5 mpg
where:
1 = mean miles-per-gallon for the population of M cars
2 = mean miles-per-gallon for the population of J cars
25
Interval Estimation of 1 - 2: s1 and s2 Unknown
The degrees of freedom for t/2 are:
2
(2.56)2 (1.81)2
24 + 28
𝑑𝑓 = 2 2 = 40.59 = 41
1 (2.56)2 1 (1.81)2
+ 28 − 1
24 − 1 24 28
26
Interval Estimation of 1 - 2: s1 and s2 Unknown
𝑠1 2 𝑠2 2
𝑥1ҧ − 𝑥ҧ2 ± 𝑡𝛼/2 +
𝑛1 𝑛2
(2.56)2 (1.81)2
29.8 − 27.3 ± 1.683 +
24 28
27
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• Hypotheses
28
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• Example: Specific Motors
Can we conclude, using a .05 level of significance, that the miles-per-gallon
(mpg) performance of M cars is greater than the miles-per-gallon performance
of J cars?
29
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value and Critical Value Approaches
1. Develop the hypotheses.
H0: 1 - 2 < 0 (right-tailed test)
Ha: 1 - 2 > 0
where:
1 = mean mpg for the population of M cars
2 = mean mpg for the population of J cars
30
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value and Critical Value Approaches
29.8 − 27.3 − 0
𝑡= = 4.003
(2.56)2 (1.81)2
+
24 28
31
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value Approach
4. Compute the p –value.
The degrees of freedom for t are:
2
(2.56)2 (1.81)2
24 + 28
𝑑𝑓 = 2 2 = 40.59 = 41
1 (2.56)2 1 (1.81)2
+
24 − 1 24 28 − 1 24
Because t = 4.003 > t.05 = 1.683, the p–value < .05.
(In fact, the p–value < .005.)
32
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• p –Value Approach
5. Determine whether to reject H0.
Because p–value < = .05, we reject H0.
We are at least 95% confident that the miles-per-gallon (mpg)
performance of M cars is greater than the miles-per-gallon
performance of J cars.
33
Hypothesis Tests About 1 - 2: s1 and s2 Unknown
• Critical Value Approach
4. Determine the critical value and rejection rule.
34
Inferences About the Difference Between Two Population Means:
Matched Samples
• With a matched-sample design each sampled item provides a pair of data
values.
• This design often leads to a smaller sampling error than the independent-
sample design because variation between sampled items is eliminated as a
source of sampling error.
35
Inferences About the Difference Between Two Population Means:
Matched Samples
• Example: Express Deliveries
A Chicago-based firm has documents that must be quickly distributed to
district offices throughout the U.S. The firm must decide between two
delivery services, UPX (United Parcel Express) and INTEX (International
Express), to transport its documents.
36
Inferences About the Difference Between Two Population Means:
Matched Samples
• Example: Express Deliveries
In testing the delivery times of the two services, the firm sent two reports
to a random sample of its district offices with one report carried by UPX and
the other report carried by INTEX. Do the data on the next slide indicate a
difference in mean delivery times for the two services? Use a .05 level of
significance.
37
Inferences About the Difference Between Two Population Means:
Matched Samples
Delivery Time (Hours)
District Office UPX INTEX Difference
Seattle 32 25 7
Los Angeles 30 24 6
Boston 19 15 4
Cleveland 16 15 1
New York 15 13 2
Houston 18 15 3
Atlanta 14 15 -1
St. Louis 10 8 2
Milwaukee 7 9 -2
Denver 16 11 5
38
Inferences About the Difference Between Two Population Means:
Matched Samples
• p –Value and Critical Value Approaches
1. Develop the hypotheses.
H0: d = 0
Ha: d
Let d = the mean of the difference values for the
two delivery services for the population
of district offices
39
Inferences About the Difference Between Two Population Means:
Matched Samples
• p –Value and Critical Value Approaches
2. Specify the level of significance. = .05
σ 𝑑𝑖 −𝑑ത 2 76.1
𝑠𝑑 = = = 2.9
𝑛−1 9
ത 𝑑 2.7−0
𝑑−𝜇
𝑡= = = 2.94
𝑠𝑑 / 𝑛 2.9 10
40
Inferences About the Difference Between Two Population Means:
Matched Samples
• p –Value Approach
4. Compute the p –value.
For t = 2.94 and df = 9, the p–value is between .02 and .01.
(This is a two-tailed test, so we double the upper-tail areas of
.01 and .005.)
5. Determine whether to reject H0.
Because p–value < = .05, we reject H0.
We are at least 95% confident that there is a difference in
mean delivery times for the two services.
41
Inferences About the Difference Between Two Population Means:
Matched Samples
• Critical Value Approach
4. Determine the critical value and rejection rule.
For = .05 and df = 9, t.025 = 2.262.
Reject H0 if t > 2.262
42
Learnings?
Chi-square &
Cross
tabulation
Dr. Nilakantan Narasinganallur Ph.D.
Tests of Goodness of Fit, Independence,
and Multiple Proportions
• Testing For Equality of Three or More Population Proportions
• Goodness of Fit Test
• Test of Independence
2
Tests of Goodness of Fit, Independence,
and Multiple Proportions
• We introduce three additional hypothesis-testing procedures.
• The test statistic and the distribution used are based on the chi-square (c2) distribution.
3
Testing the Equality of Population Proportions
for Three or More Populations
Using the notation
p1 = population proportion for population 1
p2 = population proportion for population 2
pk = population proportion for population k
The hypotheses for the equality of population proportions for k > 3 populations
are as follows:
H0: p1 = p2 = . . . = pk
Ha: Not all population proportions are equal
4
Testing the Equality of Population Proportions
for Three or More Populations
• If H0 cannot be rejected, we cannot detect a difference among the k population proportions.
• If H0 can be rejected, we can conclude that not all k population proportions are equal.
• Further analyses can be done to conclude which population proportions are significantly different from
others.
5
Testing the Equality of Population Proportions
for Three or More Populations
• Example: Finger Lakes Homes
Finger Lakes Homes manufactures three models of prefabricated homes, a two-story colonial, a log
cabin, and an A-frame. To help in product-line planning, management would like to compare the customer
satisfaction with the three home styles.
6
Testing the Equality of Population Proportions
for Three or More Populations
• We begin by taking a sample of owners from each of the three populations.
• Each sample contains categorical data indicating whether the respondents are likely or not likely to
repurchase the home.
7
Testing the Equality of Population Proportions
for Three or More Populations
• Observed Frequencies (sample results)
Home Owner
Colonial Log A-Frame Total
Likely to Yes 97 83 80 260
Repurchase No 38 18 44 100
Total 135 101 124 360
8
Testing the Equality of Population Proportions
for Three or More Populations
• Next, we determine the expected frequencies under the assumption H0 is correct.
Expected Frequencies
Under the Assumption H0 is True
• If a significant difference exists between the observed and expected frequencies, H0 can be rejected.
9
Testing the Equality of Population Proportions
for Three or More Populations
• Expected Frequencies (computed)
Home Owner
Colonial Log A-Frame Total
Likely to Yes 97.50 72.94 89.56 260
Repurchase No 37.50 28.06 34.44 100
Total 135 101 124 360
10
Testing the Equality of Population Proportions
for Three or More Populations
• Next, compute the value of the chi-square test statistic.
2
𝑓𝑖𝑗 − 𝑒𝑖𝑗
𝜒2 =
𝑒𝑖𝑗
𝑖 𝑗
where: fij = observed frequency for the cell in row i and column j
eij = expected frequency for the cell in row i and column j
under the assumption H0 is true
Note: The test statistic has a chi-square distribution with k – 1 degrees of freedom, provided the
expected frequency is 5 or more for each cell.
11
Testing the Equality of Population Proportions
for Three or More Populations
• Computation of the Chi-Square Test Statistic.
Obs. Exp. Sqd. Sqd. Diff. /
Likely to Home Freq. Freq. Diff. Diff. Exp. Freq.
Repurchase Owner fij eij (fij - eij) (fij - eij)2 (fij - eij)2/eij
Yes Colonial 97 97.50 -0.50 0.2500 0.0026
Yes Log Cab. 83 72.94 10.06 101.1142 1.3862
Yes A-Frame 80 89.56 -9.56 91.3086 1.0196
No Colonial 38 37.50 0.50 0.2500 0.0067
No Log Cab. 18 28.06 -10.06 101.1142 3.6041
No A-Frame 44 34.44 9.56 91.3086 2.6509
Total 360 360 c2 = 8.6700
12
Testing the Equality of Population Proportions
for Three or More Populations
• Rejection Rule
13
Testing the Equality of Population Proportions
for Three or More Populations
• Rejection Rule (using = .05)
c2
5.991
14
Testing the Equality of Population Proportions
for Three or More Populations
• Conclusion Using the p-Value Approach
Because c2 = 8.670 is between 9.210 and 7.378, the area in the upper tail of
the distribution is between .01 and .025.
15
Testing the Equality of Population Proportions
for Three or More Populations
• We have concluded that the population proportions for the three populations of home owners are
not equal.
• To identify where the differences between population proportions exist, we will rely on a multiple
comparisons procedure.
16
Multiple Comparisons Procedure
• We begin by computing the three sample proportions.
17
Multiple Comparisons Procedure
• Marascuilo Procedure
We compute the absolute value of the pairwise difference between sample proportions.
18
Multiple Comparisons Procedure
• Critical Values for the Marascuilo Pairwise Comparison
2
𝑝𝑖ҧ (1 − 𝑝𝑖ҧ ) 𝑝𝑗ҧ (1 − 𝑝𝑗ҧ )
𝐶𝑉𝑖𝑗 = 𝜒𝛼,𝑘−1 +
𝑛𝑖 𝑛𝑗
19
Multiple Comparisons Procedure
• Pairwise Comparison Tests
Significant if
Pairwise Comparison 𝑝ҧ𝑖 − 𝑝𝑗ҧ CVij 𝑝ҧ𝑖 − 𝑝𝑗ҧ > CVij
20
Goodness of Fit Test:
Multinomial Probability Distribution
1. State the null and alternative hypotheses.
H0: The population follows a multinomial distribution with specified probabilities for each of
the k categories
Ha: The population does not follow a multinomial distribution with specified probabilities for
each of the k categories
21
Goodness of Fit Test:
Multinomial Probability Distribution
2. Select a random sample and record the observed frequency, fi , for each of the k categories.
22
Goodness of Fit Test:
Multinomial Probability Distribution
4. Compute the value of the test statistic.
𝑘 2
𝑓𝑖 − 𝑒𝑖
𝜒2 =
𝑒𝑖
𝑖=1
24
Multinomial Distribution Goodness of Fit Test
• Example: Finger Lakes Homes (A)
Finger Lakes Homes manufactures four models of prefabricated homes, a two-story colonial, a log
cabin, a split-level, and an A-frame. To help in production planning, management would like to
determine if previous customer purchases indicate that there is a preference in the style selected.
25
Multinomial Distribution Goodness of Fit Test
• Example: Finger Lakes Homes (A)
The number of homes sold of each model for 100 sales over the past two years is shown below.
Split- A-
Model Colonial Log Level Frame
# Sold 30 20 35 15
26
Multinomial Distribution Goodness of Fit Test
• Hypotheses
H0: pC = pL = pS = pA = .25
where:
pC = population proportion that purchase a colonial
pL = population proportion that purchase a log cabin
pS = population proportion that purchase a split-level
pA = population proportion that purchase an A-frame
27
Multinomial Distribution Goodness of Fit Test
• Rejection Rule
c2
7.815
28
Multinomial Distribution Goodness of Fit Test
• Expected Frequencies
e1 = .25(100) = 25 e2 = .25(100) = 25
e3 = .25(100) = 25 e4 = .25(100) = 25
• Test Statistic
=1+1+4+4
= 10
29
Multinomial Distribution Goodness of Fit Test
• Conclusion Using the p-Value Approach
Because c2 = 10 is between 9.348 and 11.345, the area in the upper tail of
the distribution is between .025 and .01.
30
Multinomial Distribution Goodness of Fit Test
• Conclusion Using the Critical Value Approach
c2 = 10 > 7.815
31
Test of Independence
1. Set up the null and alternative hypotheses.
2. Select a random sample and record the observed frequency, fij , for each cell of the contingency table.
32
Test of Independence
4. Compute the test statistic.
2
2
𝑓𝑖𝑗 − 𝑒𝑖𝑗
𝜒 =
𝑒𝑖𝑗
𝑖 𝑗
where is the significance level and, with n rows and m columns, there
are (n - 1)(m - 1) degrees of freedom.
33
Test of Independence
• Example: Finger Lakes Homes (B)
Each home sold by Finger Lakes Homes can be classified according to price and to style. Finger
Lakes’ manager would like to determine if the price of the home and the style of the home are
independent variables.
34
Test of Independence
• Example: Finger Lakes Homes (B)
The number of homes sold for each model and price for the past two years is shown below. For
convenience, the price of the home is listed as either less than $200,000 or more than or equal to
$200,000.
< $200,000 18 6 19 12
> $200,000 12 14 16 3
35
Test of Independence
• Hypotheses
H0: Price of the home is independent of the style of the home that is purchased
Ha: Price of the home is not independent of the style of the home that is purchased
36
Test of Independence
• Expected Frequencies
18 6 19 12 55
< $200K
> $200K 12 14 16 3 45
Total 30 20 35 15 100
37
Test of Independence
• Rejection Rule
• Test Statistic
(18 − 16.5)2 (6 − 11)2 (3 − 6.75)2
𝜒2 = + + ⋯+
16.5 11 6.75
38
Test of Independence
• Conclusion Using the p-Value Approach
Because c2 = 9.145 is between 7.815 and 9.348, the area in the upper tail of the
distribution is between .05 and .025.
39
Test of Independence
• Conclusion Using the Critical Value Approach
We reject, at the .05 level of significance, the assumption that the price of the home is
independent of the style of home that is purchased.
40
Simple Regression
Dr. Nilakantan Narasinganallur Ph.D.
Simple Linear Regression
2
Simple Linear Regression
• Managerial decisions often are based on the relationship between two or
more variables.
• Regression analysis can be used to develop an equation showing how the
variables are related.
• The variable being predicted is called the dependent variable and is denoted
by y. also called – predicted variable, response variable etc.
• The variables being used to predict the value of the dependent variable are
called the independent variables and are denoted by x. also called predictor
variable, explanatory variable.
3
Simple Linear Regression
• Simple linear regression involves one independent variable and one
dependent variable.
• The relationship between the two variables is approximated by a
straight line.
• Regression analysis involving two or more independent variables is
called multiple regression. Note: dependent variable is one and
independent variables can be multiple.
• While independent variables can be of categorical or quantitative type,
dependent variable is required to be quantitative in linear regression.
4
Simple Linear Regression Model
• The equation that describes how y is related to x and an error term is called
the regression model.
• The simple linear regression model is:
y = b0 + b1x + e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
5
Simple Linear Regression Equation
• The simple linear regression equation is:
E(y) = b0 + b1x
6
Simple Linear Regression Equation
• Positive Linear Relationship
E(y)
Regression line
Intercept Slope b1
b0 is positive
7
Simple Linear Regression Equation
• Negative Linear Relationship
E(y)
Intercept
b0 Regression line
Slope b1
is negative
8
Simple Linear Regression Equation
• No Relationship
E(y)
9
Estimated Simple Linear Regression Equation
• The estimated simple linear regression equation
• Based on sample data
𝑦ො = 𝑏0 + 𝑏1 𝑥
10
Estimation Process
Regression Model Sample Data:
y = b0 + b1x +e x y
Regression Equation x1 y1
E(y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn
Estimated
b0 and b1 Regression Equation
provide estimates of 𝑦ො = 𝑏0 + 𝑏1 𝑥
b0 and b1 Sample Statistics
b0, b1
11
Least squares method - diagram
Least Squares Method
• Least Squares Criterion
min σ(𝑦𝑖 − 𝑦ො𝑖 )2
where:
yi = observed value of the dependent variable
for the i th observation
𝑦ො𝑖 = estimated value of the dependent variable
for the i th observation
13
Least Squares Method
• Slope for the Estimated Regression Equation
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑏1 =
σ 𝑥𝑖 − 𝑥ҧ 2
where:
xi = value of independent variable for i th observation
yi = value of dependent variable for i th observation
𝑥ҧ = mean value for independent variable
𝑦ത = mean value for dependent variable
14
Least Squares Method
• y-Intercept for the Estimated Regression Equation
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
15
Simple Linear Regression
• Example: Reed Auto Sales
Reed Auto periodically has a special week-long sale. As part of
the advertising campaign Reed runs one or more television
commercials during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.
16
Simple Linear Regression
• Example: Reed Auto Sales
Number of Number of
TV Ads (x) Cars Sold (y)
1 14
3 24
2 18
1 17
3 27
Sx = 10 Sy = 100
𝑥ҧ = 2 𝑦ത = 20
17
Estimated Regression Equation
• Slope for the Estimated Regression Equation
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത 20
𝑏1 = 2
= =5
σ 𝑥𝑖 − 𝑥ҧ 4
18
Using Excel’s Chart Tools for
Scatter Diagram & Estimated Regression Equation
Reed Auto Sales Estimated Regression Line
30
25
Cars Sold
20
y = 5x + 10
15
10
5
0
0 1 2 3 4
TV Ads
19
Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
σ 𝑦𝑖 − 𝑦ത 2 = σ 𝑦ො𝑖 − 𝑦ത 2 + σ 𝑦𝑖 − 𝑦ො𝑖 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
20
Coefficient of Determination
• The coefficient of determination is:
r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
21
Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772
22
Using Excel to Compute the Coefficient of Determination
• Adding r 2 Value to Scatter Diagram
Reed Auto Sales Estimated Regression Line
30
25
Cars Sold
20
y = 5x + 10
15
R2 = 0.8772
10
5
0
0 1 2 3 4
TV Ads
23
Sample Correlation Coefficient
where:
b1 = the slope of the estimated regression
equation 𝑦ො = 𝑏0 + 𝑏1 𝑥
24
Sample Correlation Coefficient
𝑟𝑥𝑦 = (sign of 𝑏1 ) 𝑟 2
𝑟𝑥𝑦 = + .8772
= +.9366
25
Assumptions About the Error Term e
1. The error e is a random variable with mean of zero.
2. The variance of e , denoted by 2, is the same for all values of the
independent variable.
3. The values of e are independent.
4. The error e is a normally distributed random variable.
26
Testing for Significance
• To test for a significant regression relationship, we must conduct a
hypothesis test to determine whether the value of b1 is zero.
• Two tests are commonly used:
• Both the t test and F test require an estimate of 2, the variance of e in the
regression model.
27
Testing for Significance
• An Estimate of 2
The mean square error (MSE) provides the estimate of 2, and the notation
s2 is also used.
s 2 = MSE = SSE/(n - 2)
where:
SSE=σ 𝑦𝑖 − 𝑦ො𝑖 2 = σ 𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 2
28
Testing for Significance
• An Estimate of
• To estimate , we take the square root of s2.
• The resulting s is called the standard error of the estimate.
SSE
s = MSE =
𝑛−2
29
Testing for Significance: t Test
• Hypotheses
H0: b1 = 0
H a : b1 ≠ 0
• Test Statistic
𝑏1 𝑠
𝑡= where 𝑠𝑏1 =
𝑠𝑏1 σ 𝑥𝑖 − 𝑥ҧ 2
30
Testing for Significance: t Test
• Rejection Rule
where:
t is based on a t distribution
with n - 2 degrees of freedom
31
Testing for Significance: t Test
1. Determine the hypotheses. H0: b1 = 0
H a : b1 ≠ 0
𝑏1
3. Select the test statistic. 𝑡=
𝑠𝑏1
32
Testing for Significance: t Test
𝑏1 5
5. Compute the value of the test statistic. 𝑡= = = 4.63
𝑠𝑏1 1.08
33
Confidence Interval for b1
• We can use a 95% confidence interval for b1 to test the hypotheses just used in
the t test.
• H0 is rejected if the hypothesized value of b1 is not included in the confidence
interval for b1.
34
Confidence Interval for b1
• The form of a confidence interval for b1 is:
𝑏1 ± 𝑡𝞪/2 𝑠𝑏1
where
b1 is the point estimator,
𝑡𝞪/2 𝑠𝑏1 is the margin of error, and
ta/2 is the t value providing an area of
/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
35
Confidence Interval for b1
• Rejection Rule
• Conclusion
0 is not included in the confidence interval. Reject H0
36
Testing for Significance: F Test
• Hypotheses
H0: b1 = 0
H a : b1 ≠ 0
• Test Statistic
F = MSR/MSE
37
Testing for Significance: F Test
• Rejection Rule
Reject H0 if
p-value <
or F > F
where:
F is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
38
Testing for Significance: F Test
1. Determine the hypotheses. H0: b1 = 0
H a : b1 ≠ 0
39
Testing for Significance: F Test
5. Compute the value of the test statistic.
F = 17.44 provides an area of .025 in the upper tail. Thus, the p-value
corresponding to F = 21.43 is less than .025. Hence, we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars
sold.
40
Some Cautions about the Interpretation of
Significance Tests
• Rejecting H0: b1 = 0 and concluding that the relationship between x and y
is significant does not enable us to conclude that a cause-and-effect
relationship is present between x and y.
41
Excel Practice – example 1 – Anscombe’s regression
• Anscombe’s quartet comprises four data sets that have nearly identical
simple descriptive statistics, yet have very different distributions and
appear very different when graphed.
— Wikipedia
• The data sets are given in excel file – Anscombe’s regression template.
• Interpretation of results
• Data set 1 – linear regression fits well.
• Data set 2- linear regression does not fit as the scatter plot shows a non-
linear relationship.
• Data set 3- shows outliers and LR cannot handle outliers.
• Data set 4- shows outliers and LR cannot handle outliers.
• Conclusion – Anscombe’s data sets, though specifically created, lead us to
conclude that it is necessary to visualize the data well before creating a
linear regression or any other model.rmand
Anscombe's
regression template
LR – example 2- Pete’s Pizza
• Data were collected from a sample of 10 Pete’s Pizza Parlor
restaurants located near college campuses.
• Our objective is to identify if there is a relationship between student student
population
Quarterly
sales
population and pizza sales in the outlets. restaurant (1000s) ($1000s)
• The scatter diagram enables us to observe the data graphically and to 4 8 118
draw preliminary conclusions about the possible relationship between 5 12 117
the variables. 6 16 137
• What preliminary conclusions can be drawn from the scatter 7 20 157
diagram ? 8 20 169
Regression
Statistics
Multiple R 0.950123
• Is the LR model significant?
• yes. This is concluded from the F
R Square 0.902734
Adjusted R
Square 0.890575
Standard
Error
Observatio
13.82932 value & Significance of F.
• Are coefficients of regression
ns 10
ANOVA
df SS MS F
Significanc
eF significant
Regression 1 14200 14200 74.24837 2.55E-05
Residual 8 1530 191.25
Total 9 15730
47
Using the Estimated Regression Equation
for Estimation and Prediction
• A confidence interval is an interval estimate of the mean value of y for a given
value of x.
• A prediction interval is used whenever we want to predict an individual value of
y for a new observation corresponding to a given value of x.
• The margin of error is larger for a prediction interval.
48
Using the Estimated Regression Equation
for Estimation and Prediction
• Confidence Interval Estimate of E(y*)
𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑦ො ∗
𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑝𝑟𝑒𝑑
where:
confidence coefficient is 1 - and t/2 is based
on a t distribution with n - 2 degrees of freedom
49
Point Estimation
If 3 TV ads are run prior to a sale, we expect the mean number of cars sold
to be:
𝑦ො = 10 + 5 3 = 25 cars
50
Confidence Interval for E(y*)
• Estimate of the Standard Deviation of 𝑦ො ∗
1 𝑥 ∗ − 𝑥ҧ 2
𝑠𝑦ො ∗ =𝑠 +
𝑛 σ 𝑥𝑖 − 𝑥ҧ 2
1 3−2 2
𝑠𝑦ො ∗ = 2.16025 + 2 + 3 − 2 2 + ⋯+ 3 − 2 2
5 1−2
1 1
𝑠𝑦ො ∗ = 2.16025 + = 1.4491
5 4
51
Confidence Interval for E(y*)
The 95% confidence interval estimate of the mean number of cars sold
when 3 TV ads are run is:
𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑦ො ∗
25 + 3.1824(1.4491)
25 + 4.61
52
Prediction Interval for y*
• Estimate of the Standard Deviation of an Individual Value of y*
1 𝑥 ∗ − 𝑥ҧ 2
𝑠𝑝𝑟𝑒𝑑 =𝑠 1+ +
𝑛 σ 𝑥𝑖 − 𝑥ҧ 2
1 1
𝑠𝑝𝑟𝑒𝑑 = 2.16025 1 + +
5 4
53
Prediction Interval for y*
The 95% prediction interval estimate of the number of cars sold in one
particular week when 3 TV ads are run is:
𝑦ො ∗ ± 𝑡𝛼/2 𝑠𝑝𝑟𝑒𝑑
25 + 3.1824(2.6013)
25 + 8.28
54
Using Excel’s Regression Tool
• Up to this point, you have seen how Excel can be used for various parts of a
regression analysis.
• Excel also has a comprehensive tool in its Data Analysis package called
Regression.
• The Regression tool can be used to perform a complete regression analysis.
55
Using Excel’s Regression Tool
• Excel Output (top portion)
A B C
9
10 Regression Statistics
11 Multiple R 0.936585812
12 R Square 0.877192982
13 Adjusted R Square 0.83625731
14 Standard Error 2.160246899
15 Observations 5
16
56
Using Excel’s Regression Tool
• Excel Output (middle portion)
A B C D E F
16
17 ANOVA
18 df SS MS F Significance F
19 Regression 1 100 100 21.4286 0.018986231
20 Residual 3 14 4.66667
21 Total 4 114
22
57
Using Excel’s Regression Tool
• Excel Output (bottom-left portion)
A B C D E
22
23 Coeffic. Std. Err. t Stat P-value
24 Intercept 10 2.36643 4.2258 0.02424
25 TV Ads 5 1.08012 4.6291 0.01899
26
Note: Columns F-I are not shown.
58
Using Excel’s Regression Tool
• Excel Output (bottom-right portion)
A B F G H I
22
23 Coeffic. Low. 95% Up. 95% Low. 95.0% Up. 95.0%
24 Intercept 10 2.46895 17.53105 2.46895044 17.5310496
25 TV Ads 5 1.562562 8.437438 1.56256189 8.43743811
26
Note: Columns C-E are hidden.
59
Residual Analysis
• If the assumptions about the error term e appear questionable, the hypothesis
tests about the significance of the regression relationship and the interval
estimation results may not be valid.
• The residuals provide the best information about e .
• Residual for observation i
𝑦𝑖 − 𝑦ො𝑖
60
Residual Plot Against x
• If the assumption that the variance of e is the same for all values of x is valid,
and the assumed regression model is an adequate representation of the
relationship between the variables, then the residual plot should give an
overall impression of a horizontal band of points.
61
Residual Plot Against x
𝑦 − 𝑦ො
Good Pattern
Residual
62
Residual Plot Against x
𝑦 − 𝑦ො
Nonconstant Variance
Residual
63
Residual Plot Against x
𝑦 − 𝑦ො
Model Form Not Adequate
Residual
64
Residual Plot Against x
• Residuals
Observation Predicted Cars Sold Residuals
1 15 -1
2 25 -1
3 20 -2
4 15 2
5 25 2
65
Residual Plot Against x
TV Ads Residual Plot
3
2
Residuals
1
0
-1
-2
-3
0 1 2 3 4
TV Ads
66
Standardized Residuals
• Standardized Residual for Observation i
𝑦𝑖 − 𝑦ො𝑖
𝑠𝑦𝑖 −𝑦ො 𝑖
1 𝑥𝑖 − 𝑥ҧ 2
ℎ𝑖 = +
𝑛 σ 𝑥𝑖 − 𝑥ҧ 2
67
Standardized Residual Plot
• The standardized residual plot can provide insight about the assumption that
the error term e has a normal distribution.
• If this assumption is satisfied, the distribution of the standardized residuals
should appear to come from a standard normal probability distribution.
68
Standardized Residual Plot
• Standardized Residuals
Standardized
Observation Predicted y Residual Residual
1 15 -1 -0.5345
2 25 -1 -0.5345
3 20 -2 -1.0690
4 15 2 1.0690
5 25 2 1.0690
69
Standardized Residual Plot
• Standardized Residual Plot
1.5 A B C D
28
Standard Residuals
1
29 RESIDUAL OUTPUT
30 0.5
31 Observation Predicted Y Standard Residuals
Residuals
0
32 1 15 -1 -0.534522
0 10 20 30
33 -0.5 2 25 -1 -0.534522
34 -1 3 20 -2 -1.069045
35 4 15 2 1.069045
-1.5
36 5 25 2 1.069045
Cars Sold
37
70
Standardized Residual Plot
• All of the standardized residuals are between –1.5 and +1.5 indicating that
there is no reason to question the assumption that e has a normal distribution.
71
Outliers and Influential Observations
• Detecting Outliers
• An outlier is an observation that is unusual in comparison with the other
data.
• Minitab classifies an observation as an outlier if its standardized residual
value is < -2 or > +2.
• This standardized residual rule sometimes fails to identify an unusually large
observation as being an outlier.
• This rule’s shortcoming can be circumvented by using studentized deleted
residuals.
• The |i th studentized deleted residual| will be larger than the |i th
standardized residual|.
72
Next topic
• Multiple Regression
Multiple Regression
Dr. Nilakantan Narasinganallur Ph.D.
Multiple Regression
• Multiple Regression Model
• Least Squares Method
• Multiple Coefficient of Determination
• Model Assumptions
• Testing for Significance
• Using the Estimated Regression Equation for Estimation and Prediction
• Categorical Independent Variables
• Residual Analysis
• Logistic Regression
2
Multiple Regression
• In this chapter we continue our study of regression analysis by considering
situations involving two or more independent variables.
• This subject area, called multiple regression analysis, enables us to consider
more factors and thus obtain better estimates than are possible with simple
linear regression.
3
Multiple Regression Model
• Multiple Regression Model
The equation that describes how the dependent variable y is related
to the independent variables x1, x2, . . . xp and an error term is:
y = b0 + b1x1 + b2x2 + . . . + bpxp + e
where:
b0, b1, b2, . . . , bp are the parameters, and
e is a random variable called the error term
4
Multiple Regression Equation
• Multiple Regression Equation
The equation that describes how the mean value of y is related to x1,
x2, . . . xp is:
E(y) = b0 + b1x1 + b2x2 + . . . + bpxp
5
Estimated Multiple Regression Equation
• Estimated Multiple Regression Equation
A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp
that are used as the point estimators of the parameters b0, b1, b2, . . . , bp.
6
Estimation Process
Multiple Regression Model
E(y) = b0 + b1x1 + b2x2 +. .+ bpxp + e Sample Data:
x1 x2 . . . xp y
Multiple Regression Equation
. . . .
E(y) = b0 + b1x1 + b2x2 +. . .+ bpxp . . . .
Unknown parameters are
b 0, b 1, b 2, . . . , b p
Estimated Multiple
Regression Equation
b0, b1, b2, . . . , bp
provide estimates of 𝑦ෝ = b0 + b1x1 + b2x2 + . . . + bpxp
Sample statistics are
b 0, b 1, b 2 , . . . , b p b0, b1, b2, . . . , bp
7
Least Squares Method
• Least Squares Criterion
2
min σ 𝑦𝑖 − 𝑦ො𝑖
8
Multiple Regression Model
• Example: Programmer Salary Survey
A software firm collected data for a sample of 20 computer programmers.
A suggestion was made that regression analysis could be used to determine if
salary was related to the years of experience and the score on the firm’s
programmer aptitude test.
The years of experience, score on the aptitude test, and corresponding
annual salary ($1000s) for a sample of 20 programmers is shown on the next
slide.
9
Multiple Regression Model
Exper. Test Salary Exper. Test Salary
(Yrs.) Score ($1000s) (Yrs.) Score ($1000s)
4 78 24.0 9 88 38.0
7 100 43.0 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29.0
10 84 38.0 8 87 34.0
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30.0 3 70 28.2
6 91 33.0 3 89 30.0
10
Multiple Regression Model
Suppose we believe that salary (y) is related to the years of experience
(x1) and the score on the programmer aptitude test (x2) by the following
regression model:
y = b0 + b1x1 + b2x2 + e
where
y = annual salary ($1000s)
x1 = years of experience
x2 = score on programmer aptitude test
11
Solving for the Estimates of b0, b1, b2
Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
Multiple
. . . R2 =
. . . Regression
3 89 30 Problems etc.
12
Solving for the Estimates of b0, b1, b2
• Regression Equation Output
13
Estimated Regression Equation
14
Interpreting the Coefficients
• In multiple regression analysis, we interpret each regression coefficient as
follows:
bi represents an estimate of the change in y corresponding to one
unit increase in xi when all other independent variables are held
constant.
15
Interpreting the Coefficients
b1 = 1.404
16
Interpreting the Coefficients
b2 = 0.251
17
Multiple Coefficient of Determination
• Relationship Among SST, SSR, SSE
σ 𝑦𝑖 − 𝑦ത 2 2 2
= σ 𝑦ො𝑖 − 𝑦ത + σ 𝑦𝑖 − 𝑦ො𝑖
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
18
Multiple Coefficient of Determination
• ANOVA Output
Analysis of Variance
SOURCE DF SS MS F P
Regression 2 500.3285 250.164 42.76 0.000
Residual Error 17 99.45697 5.850
Total 19 599.7855
19
Multiple Coefficient of Determination
R2 = SSR/SST
R2 = 500.3285/599.7855 = .83418
20
Adjusted Multiple Coefficient of Determination
• Adding independent variables, even ones that are not statistically significant,
causes the prediction errors to become smaller, thus reducing the sum of
squares due to error, SSE.
• Because SSR = SST – SSE, when SSE becomes smaller, SSR becomes larger,
causing R2 = SSR/SST to increase.
• The adjusted multiple coefficient of determination compensates for the
number of independent variables in the model.
21
Adjusted Multiple Coefficient of Determination
2 𝑛−1
𝑅𝑎 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1
2 20 − 1
𝑅𝑎 = 1 − 1 − .834179 = .814671
20 − 2 − 1
22
Assumptions About the Error Term e
• The error e is a random variable with mean of zero.
• The variance of e , denoted by 2, is the same for all values of the
independent variables.
• The values of e are independent.
• The error e is a normally distributed random variable reflecting the deviation
between the y value and the expected value of y given by b0 + b1x1 + b2x2 + .
. + bpxp.
23
Testing for Significance
• In simple linear regression, the F and t tests provide the same conclusion.
• In multiple regression, the F and t tests have different purposes.
24
Testing for Significance: F Test
• The F test is used to determine whether a significant relationship exists
between the dependent variable and the set of all the independent variables.
• The F test is referred to as the test for overall significance.
25
Testing for Significance: t Test
• If the F test shows an overall significance, the t test is used to determine
whether each of the individual independent variables is significant.
• A separate t test is conducted for each of the independent variables in the
model.
• We refer to each of these t tests as a test for individual significance.
26
Testing for Significance: F Test
Hypotheses H0: b1 = b2 = . . . = bp = 0
Ha: One or more of the parameters is not equal to zero
27
F Test for Overall Significance
Hypotheses H0: b1 = b2 = 0
Ha: One or both of the parameters is not equal to zero.
28
F Test for Overall Significance
• ANOVA Output
Analysis of Variance
SOURCE DF SS MS F P
Regression 2 500.3285 250.164 42.76 0.000
Residual Error 17 99.45697 5.850
Total 19 599.7855
29
F Test for Overall Significance
Test Statistics F = MSR/MSE
= 250.16/5.85 = 42.76
30
Testing for Significance: t Test
Hypotheses H0: bi= 0
H a: bi ≠ 0
𝑏𝑖
Test Statistics 𝑡=
𝑠𝑏𝑖
31
t Test for Significance of Individual Parameters
Hypotheses H0: bi = 0
H a : bi ≠ 0
32
t Test for Significance of Individual Parameters
• Regression Equation Output
33
t Test for Significance of Individual Parameters
• Regression Equation Output
34
t Test for Significance of Individual Parameters
𝑏1 1.4039
Test Statistics 𝑡= = = 7.07
𝑠𝑏1 .1986
𝑏2 .25089
𝑡= = = 3.24
𝑠𝑏2 .07735
35
Testing for Significance: Multicollinearity
• The term multicollinearity refers to the correlation among the independent
variables.
• When the independent variables are highly correlated (say, |r |> .7), it is not
possible to determine the separate effect of any particular independent
variable on the dependent variable.
36
Testing for Significance: Multicollinearity
• If the estimated regression equation is to be used only for predictive
purposes, multicollinearity is usually not a serious problem.
• Every attempt should be made to avoid including independent variables that
are highly correlated.
37
Using the Estimated Regression Equation
for Estimation and Prediction
• The procedures for estimating the mean value of y and predicting an
individual value of y in multiple regression are similar to those in simple
regression.
• We substitute the given values of x1, x2, . . . , xp into the estimated regression
equation and use the corresponding value of 𝑦ො as the point estimate.
38
Using the Estimated Regression Equation
for Estimation and Prediction
• The formulas required to develop interval estimates for the mean value of 𝑦ො
and for an individual value of y are beyond the scope of the textbook.
• Software packages for multiple regression will often provide these interval
estimates.
39
Residual Analysis
• For simple linear regression the residual plot against 𝑦ො and the residual plot
against x provide the same information.
• In multiple regression analysis it is preferable to use the residual plot against 𝑦ො
to determine if the model assumptions are satisfied.
40
Standardized Residual Plot Against 𝑦ො
• Standardized residuals are frequently used in residual plots for purposes of:
• Identifying outliers (typically, standardized residuals < -2 or > +2)
• Providing insight about the assumption that the error term ∈ has a normal
distribution
• The computation of the standardized residuals in multiple regression analysis is
too complex to be done by hand.
• Excel’s Regression tool can be used.
41
Standardized Residual Plot Against 𝑦ො
• Residual Output
42
Standardized Residual Plot Against 𝑦ො
Standardized Residual Plot
3
2
Residuals
Standard
0
0 10 20 30 40 50
-1
-2
-3
Predicted Salary
43
Categorical Independent Variables
• In many situations we must work with categorical independent variables such
as gender (male, female), method of payment (cash, check, credit card), etc.
• For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1
indicates female.
• In this case, x2 is called a dummy or indicator variable.
44
Categorical Independent Variables
• Example: Programmer Salary Survey
As an extension of the problem involving the computer programmer salary
survey, suppose that management also believes that the annual salary is
related to whether the individual has a graduate degree in computer science
or information systems.
The years of experience, the score on the programmer aptitude test,
whether the individual has a relevant graduate degree, and the annual
salary ($1000) for each of the sampled 20 programmers are shown on
the next slide.
45
Categorical Independent Variables
Exper. Test Salary Exper. Test Salary
(Yrs.) Score Degr. ($1000) (Yrs.) Score Degr. ($1000)
4 78 No 24.0 9 88 Yes 38.0
7 100 Yes 43.0 2 73 No 26.6
1 86 No 23.7 10 75 Yes 36.2
5 82 Yes 34.3 5 81 No 31.6
8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
0 75 No 22.2 4 79 No 30.1
1 80 No 23.1 6 94 Yes 33.9
6 83 No 30.0 3 70 No 28.2
6 91 Yes 33.0 3 89 No 30.0
46
Categorical Independent Variables
• Regression Equation Output
where:
𝑦ො = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
(x3 is a dummy variable)
47
Categorical Independent Variables
• ANOVA Output
Analysis of Variance
SOURCE DF SS MS F P
Regression 3 507.8960 269.299 29.48 0.000
Residual Error 16 91.8895 5.743
Total 19 599.7855
2 20 − 1
𝑅𝑎 = 1 − 1 − .8468 Previously, Adjusted
= .8181
20 − 3 − 1 R2 = .815
48
Categorical Independent Variables
• Regression Equation Output
Not significant
49
More Complex Categorical Variables
• If a categorical variable has k levels, k - 1 dummy variables are required, with
each dummy variable being coded as 0 or 1.
• For example, a variable with levels A, B, and C could be represented by x1 and
x2 values of (0, 0) for A, (1, 0) for B, and (0, 1) for C.
• Care must be taken in defining and interpreting the dummy variables.
50
More Complex Categorical Variables
• For example, a variable indicating level of education could be represented by x1
and x2 values as follows:
Highest
Degree x1 x2
Bachelor’s 0 0
Master’s 1 0
Ph.D. 0 1
51
Modeling Curvilinear Relationships
• Example: Sales of Laboratory Scales
A manufacturer of laboratory scales wants to investigate the relationship
between the length of employment of their salespeople and the number of
scales sold.
The table on the next slide gives the number of months each salesperson
has been employed by the firm (x) and the number of scales sold (y) by 15
randomly selected salespersons.
52
Modeling Curvilinear Relationships
• Example: Sales of Laboratory Scales
41 275 40 189
106 296 51 235
76 317 9 83
104 376 12 112
22 162 6 67
12 150 56 325
85 367 19 189
111 308
53
Modeling Curvilinear Relationships
• Excel’s Chart tools can be used to develop a scatter diagram and fit a straight
line to bivariate data.
• The estimated regression equation and the coefficient of determination for
simple linear regression can also be developed.
• The results of using Excel’s Chart tools to fit a line to the data are shown on
the next slide.
54
Modeling Curvilinear Relationships
• Chart Tools Output
55
Modeling Curvilinear Relationships
• The scatter diagram indicates a possible curvilinear relationship between the
length of time employed and the number of scales sold.
• So, we develop a multiple regression model with two independent variables: x
and x2.
y = b0 + b1x + b2x2 + e
56
Modeling Curvilinear Relationships
• Excel’s Chart tools can be used to fit a polynomial curve to the data. (Dialog
box is on next slide.)
• To get the dialog box, position the mouse pointer over any data point in the
scatter diagram and right-click.
• The estimated multiple regression equation and multiple coefficient of
determination for this second-order model are also obtained.
57
Modeling Curvilinear Relationships
• Chart Tools
Dialog Box
58
Modeling Curvilinear Relationships
• Chart Tools Output
59
Modeling Curvilinear Relationships
• Excel’s Chart tools output does not provide any means for testing the
significance of the results, so we need to use Excel’s Regression tool.
• We will treat the values of x2 as a second independent variable (called
MonthSq on the next slide).
60
Modeling Curvilinear Relationships
• Second Independent Variable (MonthSq) Added
61
Modeling Curvilinear Relationships
• Excel’s Regression Tool Output
62
Modeling Curvilinear Relationships
• Excel’s Regression Tool Output
63
Modeling Curvilinear Relationships
• Excel’s Regression Tool Output
64