0% found this document useful (0 votes)
12 views22 pages

SB (Final-Note)

This document discusses sampling distributions and estimation, emphasizing the importance of sample size in achieving accurate population estimates. It covers concepts such as sampling error, bias, and the central limit theorem, which helps approximate the sampling distribution of sample means. Additionally, it outlines the process of hypothesis testing, including the formulation of null and alternative hypotheses and the steps involved in testing them.

Uploaded by

linh.ha.005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views22 pages

SB (Final-Note)

This document discusses sampling distributions and estimation, emphasizing the importance of sample size in achieving accurate population estimates. It covers concepts such as sampling error, bias, and the central limit theorem, which helps approximate the sampling distribution of sample means. Additionally, it outlines the process of hypothesis testing, including the formulation of null and alternative hypotheses and the steps involved in testing them.

Uploaded by

linh.ha.005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1

Chapter 8. Sampling distributions and estimation

Sampling and estimation

● Some samples represent population well ( same similar to population)


● However, some samples differ greatly from population (particularly if the sample size is small)
→ Sampling variation

→ Larger samples, 𝑥 (sample mean) tends to be closer to µ (population mean)


→ Statistical estimation
Example: P = {1, 2, 3, 4} → sample of sizes
S1 = {1, 2} → 𝑥 = 1. 5 → an point estimate of µ
1

S2 = {1, 4} → 𝑥
2
= 2. 5, tính tương tự với {1, 3}, {2, 3}, {2, 4}, {3, 4}
→ 𝑋 (S1) = 1.5 → 𝑋: 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
1.5+2+2.5+2.5+3+3.5
µ𝑋 = σ
= 2. 5 = µ

1 1 2
Practice. P (𝑋 = 1. 5) = P ({1, 2}) = 2 = 6
⇒ 𝐶4 𝑙à 𝑐ℎọ𝑛 2 𝑠ố 𝑡ừ 𝑡ậ𝑝 4 𝑠ố 𝑛ℎư 𝑡𝑟ê𝑛
𝐶4

1.5 2 2.5 3 3.5


𝑋
P 1/6 1/6 1/3 1/6 1/6

Sampling error: 𝑆 = {1, 2} → 𝑥 = 1. 5, µ = 2. 5 ⇒ 𝑥 − µ =− 1


1 1

Get samples from the population. Make inferences about population from sample.

*NOTE:
Uncontrollable Controllable

● Sampling variation ● Sample size


● Population variation ● Desired confidence in the estimate
2
Estimators
● Estimator is a statistic derived from a sample to infer the value of a population parameter

Population parameters Sample estimators

µ − σ − π 𝑥− 𝑠 − 𝑝

1. Mean µ : population mean - 𝑥 : sample mean


2. Standard deviation: σ : population SD - s : sample SD
3. Proportion: π : population - p : sample

𝑛 2
𝑛 ∑ (𝑥𝑖−𝑥)
1 𝑖=1
Sample mean: 𝑥 = 𝑛 ∑ 𝑥 Sample SD: 𝑠 = 𝑛−1
𝑖
𝑖=1
𝑥
Sample proportion: 𝑝 = 𝑛

Sampling error: the difference between an estimate and the corresponding population parameter
𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟 = 𝑥 − µ
Bias: the difference between the expected value of the estimator and the true parameter
𝐵𝑖𝑎𝑠 = 𝐸(𝑋) − µ

Central limit theorem


σ
n is large enough: 𝑋 ∼ 𝑁(µ𝑋 , σ𝑋) → µ𝑋 = µ → σ𝑋 =
𝑛
● Sampling distribution: probability distribution of all possible values the statistics may assume when a
random sample of size n is taken
*NOTE: sample mean 𝑋 used to estimate population mean µ
Explain: bởi population quá lớn nên ko thể lấy population mean nên ta chỉ lấy sample, sau đó dùng sample
mean để generalize và để make reference about population mean. Sample mean tượng trưng cho
population mean. - tương tự đối vs SD và proportion
𝐸(𝑋) = µ (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛)
σ
σ𝑥 = (𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛)
𝑛

● Central limit theorem: allows us to approximate the shape of the sampling distribution of 𝑋
3
Range of sample means
σ
● Expected range of sample means: µ ±𝑧
𝑛
● We use the familiar z-values for the standard normal distribution. If we know µ and σ, the CLT allows us
to predict the range of sample means for sample of size n:

90% interval 95% interval 99% interval

σ σ σ
µ ± 1. 645 µ ± 1. 960 µ ± 2. 576
𝑛 𝑛 𝑛

Sample size and standard error


The standard error decreases as n increases
Example. When n = 4, the standard error is halved. To halve it again requires n = 16, and to halve it again
requires n = 64. To halve the standard error, you must quadruple the sample size
Sample size standard error
n=4 σ𝑥 = σ/ 4 = σ/2

n = 16 σ𝑥 = σ/ 16 = σ/4

n = 64 σ𝑥 = σ/ 64 = σ/8
Exercise. Consider a discrete uniform distribution consisting of integers {0, 1, 2, 3}. The population parameters
are µ = 1. 5 and σ = 1. 118
𝑁
1 0+1+2+3
µ = 𝑁
∑ 𝑥𝑖 = 4
= 1. 5
𝑖=1

𝑛
2
∑ (𝑥𝑖−µ) 2 2 2 2
𝑖=1 (0−1.5) +(1−1.5) +(2−1.5) +(3−1.5)
σ = 𝑁
= 4
= 1. 118
Với n = 2, chọn 2 số bất kỳ trong tập hợp {0, 1, 2, 3} → có 16 trường hợp xảy ra: (0, 0); (0, 1); (0, 2);...
𝑥1+𝑥2 0+0
Each sample mean: 𝑥 = = = 0 (𝑣í 𝑑ụ 𝑐ℎ𝑜 𝑡𝑟ườ𝑛𝑔 ℎợ𝑝 (0, 0))
2 2

Confidence interval for a mean (µ) with known σ


Confidence interval
● Point estimate: a sample mean 𝑥 calculated from a random sample 𝑥 , 𝑥2,..., 𝑥𝑛.
1
σ σ
● Confidence interval: 𝑥 ± 𝑧𝑎/2 (𝑤ℎ𝑒𝑟𝑒 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛)
𝑛 𝑛
*NOTE: if samples are drawn from a normal population and σ is known, the margin of error is calculated using
the standard normal distribution
4
The value of 𝑧 will depend on the level of confidence desired
𝑎/2
α = area of tails, (1 − α): confidence interval for a mean, σ is known

As can be seen from the information above: σ = 1. 25 ↔ σ 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛


Sample: 𝑥 = 21. 0, 𝑛 = 10
σ
Q: 95% confidence interval for µ ⇒ 𝑥 ± 𝑧𝑎/2
𝑛

α
1 − α = 95% → α = 5% ⇒ 2
= 0. 025
𝑧𝑎/2 = 𝑞𝑛𝑜𝑟𝑚(0. 025) =− 1. 96

The error E = |𝑧𝑎/2| * α


𝑛
= 1. 96 *
1.25
10
= 0. 775
● How if σ is unknown ?

Example. If the chosen confidence level is 90% → 1


− α = 0. 9 ⇒ α = 0. 1
→ we would use 𝑧 = 𝑧0.1/2 = 𝑧0.05 = 1. 645
𝑎/2

Choosing a confidence level


● Confidence is not free - there is a trade-off that must be made.
● A higher confidence level leads to a wider confidence interval
5
● In order to gain confidence, we must accept a wide range of possible values for µ. Greater confidence
implies loss of precision (e.g. greater margin of error)
● A 95 confidence level is often used b/c it’s a reasonable compromise btw confidence & precision
Interpretation
σ σ
𝑃(𝑋 − 𝑧𝑎/2 < µ < 𝑋 + 𝑧𝑎/2 )= 1− α
𝑛 𝑛

→ This is a statement about the random variable 𝑋

Confidence interval for a mean (µ) with unknown µ


Student’s t distribution
● This should be used instead of normal z distribution when the population is normal but its SD is unknown
● This is particularly important when the sample size is small
𝑠 𝑠
𝑥 ± 𝑡α/2 (𝑤ℎ𝑒𝑟𝑒 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛)
𝑛 𝑛
Interpretation
The interpretation of the confidence interval is the same as when σ is known, however, the confidence intervals
will be wider b/c 𝑡 always greater than 𝑧
𝑎/2 𝑎/2

Degrees of freedom
𝑑. 𝑓 = 𝑛 − 1 (𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 𝑓𝑜𝑟 𝑎 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 µ)

Confidence interval for a proportion (π)


For a proportion, the CLT says that the distribution of a sample proportion 𝑝 = 𝑥/𝑛 tends toward normality as
n increases

● As n increases, the range of sample proportion p = x/n narrows b/c n appears in the denominator of the
standard error:
π(1−π)
σ𝑃 = 𝑛
6
→ therefore, the sampling variation can be reduced by increasing the sample size

*NOTE: the sample proportion p = x/n may be assumed normal if both 𝑛π ≥ 10 and 𝑛(1 − π) ≥ 10

𝑝(1−𝑝)
Confidence interval for π: 𝑝 ± 𝑧 𝑛
𝑎/2
The width of the confidence interval for π depends on:
● Sample size
● Confidence level
● Sample proportion p
● If we want a narrower interval, we could either increase the sample size or reduce the confidence
level (e.g. from 95% to 90%)

Chapter 9. One-sample hypothesis tests

Logic of hypothesis testing


● The process of hypothesis testing can be an iterative process (quá trình lặp đi lặp lại)

● All business managers need at least a basic understanding of hypothesis testing b/c managers often
interact with specialists, read technical reports,...
Steps

1 State the hypothesis to be tested


One statement or the other must be true, but they can’t be both true
● 𝐻0: 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠
● 𝐻1: 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠
E.g. Criminal trial: the hypotheses are:
1. 𝐻0: 𝑡ℎ𝑒 𝑑𝑒𝑓𝑒𝑛𝑑𝑎𝑛𝑡 𝑖𝑠 𝑖𝑛𝑛𝑜𝑐𝑒𝑛𝑡
2. 𝐻1: 𝑡ℎ𝑒 𝑑𝑒𝑓𝑒𝑛𝑑𝑎𝑛𝑡 𝑖𝑠 𝑔𝑢𝑖𝑙𝑡𝑦
→ a defendant is innocent unless the evidence gathered by the prosecutor is sufficient to
reject this assumption

2 Specify the decision rule

3 Collect data & calculate necessary statistics to test the hypothesis


7
4 Make a decision. Should the hypothesis be rejected or not ?

5 Take action based on the decision

Type I and type II error


We have 2 possible choices concerning the Null hypothesis. We either reject 𝐻 or fail to reject 𝐻
0 0

★ Rejecting the null hypothesis when it is true → Type I error (false positive)
★ Failure to reject the null hypothesis when it is false → Type II error (false negative)

Probability of Type I and Type II errors

Relationship between α and β


The proper balance between α and β can be elusive
E.g. a doctor who is conservative about admitting patients with symptoms of heart attack (reduced β) will admit
more patients with no heart attack (increased α)
*NOTE: both α and β can be reduced simultaneously only by increasing the sample size, which is not always
feasible and cost-effective

Decision rules and critical values

Statistical hypothesis: a statement about the value of a proportion parameter


Hypothesis test: a decision btw 2 competing, mutually exclusive, and collectively exhaustive hypotheses about
the value of parameter
*NOTE:
● For a mean or proportion, the value of µ (or π ) is a benchmark based on past experience, an industry
0 0
standard, a target or a product specification
● µ0 (or π0) does not come from a sample

One-tailed and two-tailed tests


There are 3 possible alternative hypotheses:

Left-tailed test Two-tailed test Right-tailed test

𝐻0: µ ≥ µ0 𝐻0: µ = µ0 𝐻0: µ ≤ µ0


8

𝐻1: µ < µ0 𝐻1: µ ≠ µ0 𝐻1: µ > µ0

Decision rule
● Extreme outcomes occurring in the left tail would cause us to reject the null hypothesis in a right-tailed
test, the same for right tail
● Rejection region: the area under the sampling distribution curve that defines an extreme outcome
● Test statistic: measures the difference between the sample statistic and the hypothesised parameter

Testing a mean: known population variance


Test statistic
● Test statistic measures the difference between a given sample mean 𝑥 and a benchmark µ in terms
0
of the standard error of the mean
● Test statistic: is the “standardised score” of the sample statistic
● 𝑧𝑐𝑎𝑙𝑐: refer to the calculated value of the rest statistic

Steps

1 State the hypotheses


The question indicates the right-tailed test, the hypotheses would be:
● 𝐻0: µ ≤ 216𝑚𝑚 (product mean does not exceed the specification)
● 𝐻1: µ > 216𝑚𝑚 (product mean has risen above the specification)
● µ0 = 216𝑚𝑚 (product specification)

2 Specify the decision rule


Reject 𝐻 if 𝑧 > 1. 645, otherwise do not reject 𝐻0
0 𝑐𝑎𝑙𝑐
9

3 Collect sample data and calculate the test statistic


If 𝐻 is true, the test statistic should be near 0. The value of test statistic:
0
𝑥−µ0
𝑧𝑐𝑎𝑙𝑐 = σ
𝑛

4 Make the decision

5 Take action

P-value method
For a right-tailed test, the decision rule using the p-value approach is stated as:
𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑖𝑓 𝑃(𝑍 > 𝑧𝑐𝑎𝑙𝑐) < α, otherwise fail to reject 𝐻0

Two-tailed test

Steps

1 State the hypotheses


For a two-tailed test, the hypotheses are:
● 𝐻0: µ = 216𝑚𝑚 (product mean is what it is supposed to be)
● 𝐻1: µ ≠ 216𝑚𝑚 (product mean is not what it is supposed to be)

2 Specify the decision rule


Reject 𝐻 if
0
𝑧𝑐𝑎𝑙𝑐 >+ 1. 96 𝑜𝑟 𝑖𝑓 𝑧𝑐𝑎𝑙𝑐 <− 1. 96, otherwise do not reject 𝐻0

3 Calculate the test statistic


The test statistic is unaffected by the hypotheses or the level of significance
𝑥−µ0
𝑧𝑐𝑎𝑙𝑐 = σ
𝑛

4 Make the decision

5 Take action
P-value approach
In a two-tailed test, the decision rule using the p-value method is the same as one-tailed test:
10
𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑖𝑓 𝑝 − 𝑣𝑎𝑙𝑢𝑒 < α, otherwise, do not reject 𝐻0

Testing a mean: unknown population variance


Using student’s t

Testing proportion
Our rule is to assume normality if 𝑛π≥ 10 and 𝑛(1 − π0) ≥ 10
0
If we can assume a normal sampling distribution, then the test statistic would be the z-score. The sample
proportion is:
𝑥 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠
𝑝 = 𝑛 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
Test statistic for a proportion
𝑝−π0 𝑝−π0
𝑧𝑐𝑎𝑙𝑐 = σ𝑝
=
π0(1−π0)
𝑛

*NOTE: the value of π we are testing is a benchmark


0

Chapter 10. Two-sample hypothesis tests

Two-sample tests

● Two-sample tests: compare 2 sample estimates with each other, whereas one-sample tests compare a
sample estimate with a non sample benchmark or target

E.g. manufacturer A’s sample mean was 510.5 with a SD of 147.2 in 18 tests, compare with manufacturer B’s
mean of 628.9 with a SD of 237.9 in 17 tests

Basic of two-sample tests


You can think of many situations where 2 groups are to be compared:
➢ Before versus after
➢ Old versus new
➢ Experimental versus control
*The logic of two-sample tests is based on the fact that 2 samples drawn from the same population may
yield different estimates of a parameter due to chance

Test procedure
Larger samples are always desirable because they permit us to reduce the chance of making either a Type I
error or Type II error, however, large samples take time and cost money, so we often must work with available
data.
11

Comparing 2 means: independent samples


Format of hypotheses - find the distance between µ and µ
1 2
*Tìm khoảng cách giữa 2 giá trị trên sau đó so sánh để quyết nó thuộc loại nào dưới đây.

𝐷0 = 0, this is what we need to focus on


Test statistic
The sample statistic used to test the parameter µ − µ2 is 𝑋1 − 𝑋2 where both 𝑋1 and 𝑋2 are calculated
1
from independent random samples taken from normal populations
The formula for the test statistic is determined by the sampling distribution of the sample statistic. There are 3
cases to consider:
1. Case 1: 2 known variances
(𝑥1−𝑥2)−(µ1−µ2)
𝑧𝑐𝑎𝑙𝑐 = 2 2
σ1 σ2
𝑛1
+𝑛
2

2. Case 2: unknown variances, assumed equal

3. Case 3: unknown variances, assumed unequal


12

For the common situation of testing for a zero difference (𝐷 = 0) in 2 population means the possible pairs of
0
null and alternative hypotheses are:

Large samples
𝑥1−𝑥2
𝑧𝑐𝑎𝑙𝑐 = 2 2
𝑠1 𝑠2
𝑛1
+𝑛
2

Confidence interval for the difference of 2 means, µ − µ


1 2
If the confidence interval for the difference of 2 means includes zero, we could conclude that there is no
significant difference in means
● Equal variances:

● Unequal variances:

Comparing 2 means: paired samples


Paired data
13
If the same individuals are observed twice but under different circumstances, we have a paired comparison.
● Paired data typically come from a before-after experiment, but not always

Paired t test
In the paired t test we define a new variable 𝑑 = 𝑋1 − 𝑋2 as the difference between 𝑋1 and 𝑋2
𝑛
∑ 𝑑𝑖
𝑖=1
𝑑= 𝑛
(𝑚𝑒𝑎𝑛 𝑜𝑓 𝑛 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠)

𝑛 2
(𝑑𝑖−𝑑)
𝑠𝑑 = ∑ 𝑛−1
(𝑠𝑡𝑑. 𝑑𝑒𝑣. 𝑜𝑓 𝑛 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠)
𝑖=1

Analogy to confidence interval


A two-tailed test for a zero difference is equivalent to asking whether the confidence interval for the true mean
difference µ includes zero
𝑑
𝑠𝑑
𝑑 ± 𝑡α/2 (confidence interval for difference of paired means)
𝑛

Comparing two proportions


Testing for zero difference π − π = 0
1 2
3 possible pairs of hypotheses are:

Sample proportions

Pooled proportion
If 𝐻 is true, there is no difference between π and π .
0 1 2
𝑥1+𝑥2 #𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠 𝑖𝑛 𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑝𝑐 = 𝑛1+𝑛2
= 𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒𝑠
(𝑝𝑜𝑜𝑙𝑒𝑑 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛)
Test statistic
14

Confidence interval for the difference of 2 proportions, π − π


1 2
A confidence interval for the difference of 2 population proportions, π − π2, given by
1

𝑝1(1−𝑝1) 𝑝2(1−𝑝2)
(𝑝1 − 𝑝2) ± 𝑧𝑎/2 𝑛1
+ 𝑛2

The rule of thumb for assuming normally is that 𝑛𝑝 ≥ 10 and 𝑛(1 − 𝑝) ≥ 10 for each sample
Comparing 2 variances
Format of hypotheses

An equivalent way to state these hypotheses is to look at the ratio of the 2 variances. A ratio near 1 would
indicate equal variances

.
The F test

If the null hypothesis of equal variances is true, this ratio should be near 1:
𝐹𝑐𝑎𝑙𝑐≌ 1 (𝑖𝑓 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒)
If the test statistic F is much less than 1 or much greater than 1, we would reject the hypothesis of equal
population variances.
2 2
● The numerator 𝑠 has degrees of freedom 𝑑𝑓 = 𝑛 − 1, while the denominator 𝑠 has degrees of
1 1 1 2
freedom 𝑑𝑓 = 𝑛2 − 1
2
2 2
● F can’t be negative, since 𝑠 and 𝑠 can’t be negative
1 2
Two-tailed F test
The critical values for the F test are denoted 𝐹 (left tail) and 𝐹 (right tail)
𝐿 𝑅
Notice that the rejection regions are asymmetric
15

A right-tail critical value 𝐹 may be found from Appendix F using 𝑑𝑓 and 𝑑𝑓 degrees of freedom.
𝑅 1 2
𝐹𝑅 = 𝐹𝑑𝑓 ,𝑑𝑓 (right-tail critical F)
1 2

1
𝐹𝐿 = 𝐹𝑑𝑓 ,𝑑𝑓
(left-tail critical F with reversed 𝑑𝑓 and 𝑑𝑓 )
1 2
2 1

Steps

1 State the hypotheses


For a two-tailed test for equality of variances, the hypotheses are:

2 Specify the decision rule


❖ Numerator: 𝑑𝑓 = 𝑛1 − 1
1
❖ Denominator: 𝑑𝑓 = 𝑛2 − 1
2

3 Calculate the test statistic


2
𝑠1
𝐹𝑐𝑎𝑙𝑐 = 2
𝑠2

4 Make the decision

Folded F test
The test statistic for the folded F test is:
2
𝑠𝑚𝑎𝑥
𝐹𝑐𝑎𝑙𝑐 = 2 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑖𝑓 𝐹𝑐𝑎𝑙𝑐 > 𝐹α/2
𝑠𝑚𝑖𝑛
*NOTE: the largest variance goes in the numerator and the smaller variance in the denominator
➢ ‘Larger’ refers to the variance
➢ But the hypotheses are the same as two-tailed test:
2 2
𝐻0: σ1/σ2 = 1
2 2
𝐻1: σ1/σ2 ≠ 1
16
One-tailed F test
Suppose that the firm was interested in knowing whether the new bumper had reduced the variance in collision
damage cost. We would then perform a left-tailed test.

Steps

1 State the hypotheses

2 Specify the decision rule


Degrees of freedom for the F test are the same as for a two-tailed test:
❖ Numerator: 𝑑𝑓 = 𝑛1 − 1
1
❖ Denominator: 𝑑𝑓 = 𝑛2 − 1
2

3 Calculate the test statistic


The test statistic is the same as for a two-tailed test
2
𝑠1
𝐹𝑐𝑎𝑙𝑐 = 2
𝑠2

4 Make the decision

Chapter 11. Analysis of variance (ANOVA)

In this chapter, you will learn how to compare more than 2 means simultaneously and how to trade sources of
variation to potential explanatory factors by using analysis of variance (ANOVA)

Variation in the response variable about its mean either is explained by one or more categorical independent
variables (the factors) or is unexplained (random error)
𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑌 = 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 + 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
● Each possible value of a factor or combination of factors is a treatment

A simple way to state the one-factor ANOVA hypothesis:


● 𝐻0: µ1 = µ2 = µ3 = µ4
● 𝐻1: 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙 (𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑠 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑜𝑡ℎ𝑒𝑟𝑠)
17

One-factor ANOVA (completely randomised model)

If we are only interested in comparing the means of C groups, we have a one-factor ANOVA
If subjects (individuals) are assigned randomly to treatments, we call this completely randomised model
(most common ANOVA model)
The total number of observations:
𝑛 = 𝑛1 + 𝑛2 + 𝑛3 +... + 𝑛𝑐

One-factor ANOVA as a linear model


This is an equivalent way to express the one-factor ANOVA model

If we interested in only what happens to the response for the particular levels of the factor that were selected
(fixed-effects model), the hypotheses to be tested are:
𝐻0: 𝑇1 = 𝑇2 =... = 𝑇𝐶 = 0
𝐻1: 𝑛𝑜𝑡 𝑎𝑙𝑙 𝐴𝑗𝑎𝑟𝑒 𝑧𝑒𝑟𝑜

If the Null hypothesis

True ➢ Observation x came from treatment j does not explain the variation
in Y
➢ ANOVA model collapses to:
𝑦𝑖𝑗 = µ + ε𝑖𝑗

False 🙂 At least some of the 𝑇 must be nonzero


𝑗
𝑇𝑗: negative (below µ)

Group means
𝑛𝑗
1
𝑦𝑖 = 𝑛𝑗
∑ 𝑦𝑖𝑗 (mean of each group)
𝑖=1

𝑐 𝑛𝑗 𝑐
1 1
𝑦= 𝑛
∑ ∑ 𝑦𝑖𝑗 = 𝑛
∑ 𝑛𝑗𝑦𝑗 (overall sample mean)
𝑗=1 𝑖=1 𝑗=1

Partitioned sum of squares


(𝑦𝑖𝑗 − 𝑦) = (𝑦𝑗 − 𝑦) + (𝑦𝑖𝑗 − 𝑦𝑗)
18
𝑐 𝑛𝑗 𝑐 𝑐 𝑛𝑗
2 2 2
∑ ∑ (𝑦𝑖𝑗 − 𝑦) = ∑ 𝑛𝑗(𝑦𝑗 − 𝑦) + ∑ ∑ (𝑦𝑖𝑗 − 𝑦𝑗)
𝑗=1 𝑖=1 𝑗=1 𝑗=1 𝑖=1
This important relationship may be simply expressed as:

⚠️ the sums SSB and SSE may be used to test the hypothesis that the treatment means differ from the grand
mean.
● The F test statistic is the ratio of the resulting mean squares.

Test statistic

💥 The test statistic 𝐹 =


𝑀𝑆𝐵
𝑀𝑆𝐸
cannot be negative ⇒ the F test for equal treatment means is always a
right-tailed test

Decision rule

Steps

1 State the hypotheses


➢ 𝐻0: µ1 = µ2 = µ3 = µ4
➢ 𝐻1: 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙 (𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 1 𝑚𝑒𝑎𝑛 𝑖𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡)

2 State the decision rule


Degrees of freedom for the F test are:
➢ Numerator: 𝑑𝑓 = 𝑐 − 1
1
➢ Denominator: 𝑑𝑓 =𝑛−𝑐
2

3 Perform the calculations


19
4 Make the decision

5 Take action

Multiple comparisons
To maintain the desired overall probability of Type I error, we need to create a simultaneous confidence interval

🤓
for the difference of means based on the pool variances for all c groups
𝑐(𝑐−1)
For all c groups, there are distinct pairs of means to be compared
2

Tukey’s studentized range test - HSD


● It has good power and is widely used
● This test is available in most statistical packages
● Two-tailed test for equality of paired means from c groups

The hypotheses to compare group j with group k are:


𝐻0: µ𝑗 = µ𝑘
𝐻1: µ𝑗 ≠ µ𝑘
||𝑦 −𝑦 ||
| 𝑗 𝑘|
Tukey’s test statistic is: 𝑇𝑐𝑎𝑙𝑐 =
1 1
𝑀𝑆𝐸[ 𝑛 + 𝑛 ]

😘
𝑗 𝑘

we would reject 𝐻 if 𝑇
0 𝑐𝑎𝑙𝑐
> 𝑇𝑐,𝑛−𝑐, where 𝑇𝑐,𝑛−𝑐 is a critical value for the desired level of significance
Tukey’s test statistic could also be written as:

😄 The decision rule for any pair of means is:


||𝑦 −𝑦 ||
| 𝑗 𝑘|
Reject 𝐻 if
0
𝑇𝑐𝑎𝑙𝑐 = > 2. 86
1 1
𝑀𝑆𝐸[ 𝑛𝑗
+ 𝑛𝑘
]

Tests for homogeneity of variances


Hartley’s test
The hypotheses are:
2 2 2 2
𝐻0: σ1 = σ2 =... = σ𝑐 𝐻1: 𝑇ℎ𝑒 σ𝑗 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑒𝑞𝑢𝑎𝑙
Hartley’s test statistic is the ratio of the largest sample variance to the smallest sample variance:

2
𝑠𝑚𝑎𝑥
𝐻𝑐𝑎𝑙𝑐 = 2
𝑠𝑚𝑖𝑛
20

Two-factor ANOVA without replication (randomised block model)

In this two-factor ANOVA without replication (or non repeated measures design), each factor combination is
observed exactly once

Two-factor ANOVA model


𝑦𝑗𝑘 = µ + 𝐴𝑗 + 𝐵𝑘 + ε𝑗𝑘

👏 The random error is assumed to be normally distributed with zero mean and the same variance for all
treatments
21

ANOVA table
22
Total = between + within
Between df = SS between / MS between = 12471.6 / 2078.6 = 6

Chapter 12. Simple regression

You might also like