Introduction To Survey Methodology and Sampling Techniques (PDFDrive)
Introduction To Survey Methodology and Sampling Techniques (PDFDrive)
3/468
Eurostat
Textbooks
4/468
Eurostat
Textbooks
4/468
Eurostat
Textbooks
4/468
Eurostat
Textbooks
4/468
Eurostat
Training learning outcomes
5/468
Eurostat
• In this course, we’ll cover the basic methods of sampling
and estimation and then explore selected topics and
recent developments including:
• simple random sampling with associated estimation and
confidence interval methods,
• computing sample sizes,
• estimating proportions,
• unequal probability sampling,
• ratio and regression estimation,
• stratified sampling,
• cluster and systematic sampling,
• multistage designs.
6/468
Eurostat
• One important point to consider as we move forward is
that the estimation procedure will depend on the sample
design.
• Being able to identify what to use under different
sampling designs is one of the things that you will learn in
this course.
7/468
Eurostat
Day 1
1. Introduction
1.1 An overview of sampling
1.2 Estimating population mean and total under simple
random sampling
1.3 Confidence intervals and the central limit theorem
1.4 Domain estimation
2. Confidence intervals and sample size
2.1 Selecting sample size for estimating population mean and
total
8/468
Eurostat
Day 1
9/468
Eurostat
Day 2
11/468
Eurostat
Day 3
6. Stratified sampling
6.2 [...]
6.3 Post-stratification
6.4 Further topics on stratification
7. Cluster sampling and systematic sampling
7.1 Introduction
7.2 Estimators for cluster sampling when primary units are
selected by simple random sampling
7.3 Estimators for cluster sampling when primary units are
selected by pps
12/468
Eurostat
Day 3
15/468
Eurostat
Unit learning outcomes
16/468
Eurostat
Subsection 1
An overview of sampling
17/468
Eurostat
Why do we take samples?
18/468
Eurostat
Why do we take samples?
18/468
Eurostat
Why do we take samples?
18/468
Eurostat
Why do we take samples?
18/468
Eurostat
• In this case, you have a certain goal in mind.
19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!
19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!
• And the major objective in statistics that now arises is
inference.
19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!
• And the major objective in statistics that now arises is
inference.
• One important objective of statistics is to make inferences
about a population from the information contained in a
sample.
19/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.
20/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.
• Because of this inference we begin to talk about things
like confidence intervals and hypothesis testing.
20/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.
• Because of this inference we begin to talk about things
like confidence intervals and hypothesis testing.
• A good picture to represent this situation follows.
20/468
Eurostat
Sampling
21/468
Eurostat
Population and sample
22/468
Eurostat
Population and sample
22/468
Eurostat
Population and sample
22/468
Eurostat
Examples of sampling
23/468
Eurostat
Examples of sampling
23/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.
24/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.
• Marketing research: we might want to estimate the
total market size for electrical cars.
24/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.
• Marketing research: we might want to estimate the
total market size for electrical cars.
• Engineering: we might want to estimate the failure rate
of a certain electronic component.
24/468
Eurostat
• To deal with all of these problems one thing we have to
decide is:
How are we going to select a sample?
25/468
Eurostat
• To deal with all of these problems one thing we have to
decide is:
How are we going to select a sample?
• There are many ways to take a sample.
25/468
Eurostat
Sampling design
26/468
Eurostat
Sampling design
26/468
Eurostat
Sampling design
26/468
Eurostat
Sampling design
26/468
Eurostat
Target population and sampling frame
27/468
Eurostat
Target population and sampling frame
27/468
Eurostat
Target population and sampling frame
27/468
Eurostat
Target population and sampling frame
27/468
Eurostat
28/468
Eurostat
Probabilistic sampling
29/468
Eurostat
Probabilistic sampling
29/468
Eurostat
Non-probabilistic sampling
30/468
Eurostat
Non-probabilistic sampling
30/468
Eurostat
Non-probabilistic sampling
30/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?
31/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?
• If you are subjective when it comes to the individuals
sampled, then this is an example of quota sampling.
31/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?
• If you are subjective when it comes to the individuals
sampled, then this is an example of quota sampling.
• Let’s illustrate this point a bit more.
31/468
Eurostat
Sample illustration
32/468
Eurostat
Sample illustration
32/468
Eurostat
Sample illustration
33/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.
• The main difference between these two approaches is that
probability sampling removes the human subjectivity.
33/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.
• The main difference between these two approaches is that
probability sampling removes the human subjectivity.
• This is an important distinction that you need to be able
to make.
33/468
Eurostat
Illustration
34/468
Eurostat
Quota Sample Probability Sample Actual result
35/468
Eurostat
Quota Sample Probability Sample Actual result
35/468
Eurostat
Quota Sample Probability Sample Actual result
35/468
Eurostat
Quota Sample Probability Sample Actual result
35/468
Eurostat
Final remarks
36/468
Eurostat
Final remarks
36/468
Eurostat
Final remarks
36/468
Eurostat
Basic idea of sampling and estimation
37/468
Eurostat
Basic idea of sampling and estimation
37/468
Eurostat
Basic idea of sampling and estimation
37/468
Eurostat
Basic idea of sampling and estimation
1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators
1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators
1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators
40/468
Eurostat
Sampling and non-sampling error
40/468
Eurostat
Coffee break!
41/468
Eurostat
Subsection 2
42/468
Eurostat
Simple Random Sampling
44/468
Eurostat
Example 1
44/468
Eurostat
Example 1
44/468
Eurostat
Example 1
44/468
Eurostat
Example 2
45/468
Eurostat
Example 2
45/468
Eurostat
Example 2
45/468
Eurostat
Example 2
46/468
Eurostat
Take a simple random sample of eight units and count the
number of beetles in these eight units.
47/468
Eurostat
Unit # beetles
9 234
66 256
81 128
11 245
92 211
54 240
6 202
23 267
48/468
Eurostat
Notation
49/468
Eurostat
Notation
49/468
Eurostat
Notation
49/468
Eurostat
Notation
49/468
Eurostat
Notation
49/468
Eurostat
Notation
49/468
Eurostat
Notation
49/468
Eurostat
Definition: finite population variance
N
2
X (yi − µ)2
• σ =
i=1
N −1
50/468
Eurostat
Definition: finite population variance
N
2
X (yi − µ)2
• σ =
i=1
N −1
• σ 2 can be estimated by sample variance s 2
n
2
X (yi − ȳ )2 (y1 − ȳ )2 + (y2 − ȳ )2 + ... + (yn − ȳ )2
s = =
i=1
n−1 n−1
50/468
Eurostat
Definition: finite population variance
N
2
X (yi − µ)2
• σ =
i=1
N −1
• σ 2 can be estimated by sample variance s 2
n
2
X (yi − ȳ )2 (y1 − ȳ )2 + (y2 − ȳ )2 + ... + (yn − ȳ )2
s = =
i=1
n−1 n−1
√
• Sample standard deviation: s = s2
50/468
Eurostat
The beetle example
51/468
Eurostat
The beetle example
51/468
Eurostat
The beetle example
51/468
Eurostat
The beetle example
51/468
Eurostat
Estimate for the population total is:
τ̂ = N × ȳ
= 100 × 222.875
= 22, 287.5
52/468
Eurostat
Properties of ȳ (SRS)
Unbiased
y1 + y2 + . . . + yn
E (ȳ ) = E
n
E (y1 ) + E (y2 ) + . . . + E (yn )
=
n
µ + µ + ... + µ nµ
= =
n n
= µ
53/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
• Remark 1: when the sampling is done with replacement,
the fraction disappears.
54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
• Remark 1: when the sampling is done with replacement,
the fraction disappears.
• Remark 2: when the sample size is very small compared
to the population size, the fraction will disappear.
54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
55/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
55/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
n
• Remark 3: is sometime referred as sampling rate.
N
55/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
56/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
\) and
• The estimate for Var(ȳ ) is denoted as Var(ȳ
\ N − n s2
Var (ȳ ) = · .
N n
56/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
\) and
• The estimate for Var(ȳ ) is denoted as Var(ȳ
\ N − n s2
Var (ȳ ) = · .
N n
• For the beatles example
2
\) = N − n · s
Var(ȳ
N n
100 − 8 1932.657
= ·
100 8
= 222.256
56/468
Eurostat
Properties of τ̂ (SRS)
It is unbiased
E (τ̂ ) = E (N × ȳ )
= N ×µ
= τ
57/468
Eurostat
Its variance, Var(τ̂ ), is:
58/468
Eurostat
• The estimate for Var(τ̂ ) is thus:
\ s2
Var (τ̂ ) = N(N − n) · .
n
59/468
Eurostat
• The estimate for Var(τ̂ ) is thus:
\ s2
Var (τ̂ ) = N(N − n) · .
n
• For the beatles example
59/468
Eurostat
Subsection 3
60/468
Eurostat
Confidence intervals
61/468
Eurostat
Confidence intervals
61/468
Eurostat
Confidence intervals
61/468
Eurostat
Confidence intervals
62/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.
• To obtain this confidence interval you need to know the
sampling distribution of the estimate.
62/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.
• To obtain this confidence interval you need to know the
sampling distribution of the estimate.
• Once we know the distribution, a confidence interval
might be defined.
62/468
Eurostat
• So the type of statement that we want to make will look
like this:
P(|θ̂ − θ| < d) = 1 − α
63/468
Eurostat
• So the type of statement that we want to make will look
like this:
P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
63/468
Eurostat
• So the type of statement that we want to make will look
like this:
P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
• In certain cases the distribution of θ̂ can be stated easily.
63/468
Eurostat
• So the type of statement that we want to make will look
like this:
P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
• In certain cases the distribution of θ̂ can be stated easily.
• However, there are many different types of distributions.
63/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.
64/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.
• When we talk about the Central Limit Theorem for the
sample mean, what are we talking about?
64/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.
• When we talk about the Central Limit Theorem for the
sample mean, what are we talking about?
• The finite population Central Limit Theorem for the
sample mean:
What happens when n (sample size), gets large?
64/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
σ
ȳ ∼ N µ, √ .
n
65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.
65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.
• We can thus estimate the standard deviation of ȳ to be:
s
√ .
n
65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.
• We can thus estimate the standard deviation of ȳ to be:
s
√ .
n
• Thus approximately
s
ȳ ∼ N µ, √ .
n
65/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.
66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.
• When the sample size is larger there is more regularity
and it is easier to see the distribution.
66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.
• When the sample size is larger there is more regularity
and it is easier to see the distribution.
• This is not the case when the sample size is small.
66/468
Eurostat
Confidence interval for µ
67/468
Eurostat
Confidence interval for µ
67/468
Eurostat
68/468
Eurostat
• Now, we can compute the confidence interval as:
ȳ − µ
P( q < d) = 1 − α
\)
Var(ȳ
ȳ − µ
P( q
< z1−α/2 ) = 1 − α
\)
Var(ȳ
q q
\) < µ < ȳ + z1−α/2 Var(ȳ
P(ȳ − z1−α/2 Var(ȳ \)) = 1 − α
69/468
Eurostat
Confidence interval for µ
• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s 2
N −n s
ȳ ± z1−α/2
N n
70/468
Eurostat
Confidence interval for µ
• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s 2
N −n s
ȳ ± z1−α/2
N n
70/468
Eurostat
Confidence interval for µ
• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s 2
N −n s
ȳ ± z1−α/2
N n
70/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?
71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?
• In what situation are these confidence intervals
applicable?
71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?
• In what situation are these confidence intervals
applicable?
• These approximate intervals above are good when n is
large (because of the Central Limit Theorem), or when
the observations y1 , y2 , ..., yn are normal.
71/468
Eurostat
Confidence intervals and sample size
72/468
Eurostat
Confidence intervals and sample size
72/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2
2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2
• However, when sample size is 7 or less, if we use normal
probability plot to check for normality, we may fail to
reject normality due to not enough sample size.
2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2
• However, when sample size is 7 or less, if we use normal
probability plot to check for normality, we may fail to
reject normality due to not enough sample size.
• Remark: In the examples of this training we typically use
small sample sizes for illustration purposes only.
2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
s
s2
N −n
ȳ ± z1−α/2
N n
74/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
s
s2
N −n
ȳ ± z1−α/2
N n
• Note that the z-value for α = 0.025 can be found in the
following table:
Confidence α 1 − α/2 z1−α/2
74/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
75/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
• sample mean: ȳ = 222.875
75/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
• sample mean: ȳ = 222.875
• sample variance: s 2 = 1932.657
s 2
N −n s
ȳ ± z1−α/2
N n
√
= 222.875 ± 1.96 222.256
= 222.875 ± 1.96 × 14.908
= 222.875 ± 29.220
75/468
Eurostat
• And, an approximate 95% CI for τ is then:
r
s2
τ̂ ± z1−α/2 N(N − n)
p n
= 22, 287.5 ± 1.96 2, 222, 560
= 22, 287.5 ± 2, 922.018
76/468
Eurostat
Questions?
77/468
Eurostat
Lunch break!
78/468
Eurostat
Subsection 4
Domain estimation
79/468
Eurostat
Domain estimation
80/468
Eurostat
Domain estimation
80/468
Eurostat
Domain estimation
80/468
Eurostat
81/468
Eurostat
• Therefore, we wish to estimate the parameters of a
subpopulation (domain) of the population represented in
the frame.
82/468
Eurostat
• Therefore, we wish to estimate the parameters of a
subpopulation (domain) of the population represented in
the frame.
• Main Issue: you do not know the size of the domain
(subpopulation)?
82/468
Eurostat
Notation
83/468
Eurostat
Notation
83/468
Eurostat
Notation
83/468
Eurostat
Notation
83/468
Eurostat
Notation
83/468
Eurostat
• An unbiased estimator of µd , the subpopulation mean is:
nd
1 X
ȳd = ydi .
nd i=1
84/468
Eurostat
• An unbiased estimator of µd , the subpopulation mean is:
nd
1 X
ȳd = ydi .
nd i=1
• Its variance is estimated by:
sd2
\d ) = Nd − nd
Var(ȳ ,
Nd nd
nd
(ydi − ȳd )2
P
i=1
where sd2 = .
nd − 1
84/468
Eurostat
• Usually we do not know Nd , so we will estimate the finite
population correction factor as:
Nd − nd N −n
by .
Nd N
85/468
Eurostat
Example: variable food cost
86/468
Eurostat
Example: variable food cost
86/468
Eurostat
Example: variable food cost
86/468
Eurostat
Example: variable food cost
86/468
Eurostat
• What is the average food cost for married students in
that college?
87/468
Eurostat
• What is the average food cost for married students in
that college?
• ANSWER:
87/468
Eurostat
• What is the average food cost for married students in
that college?
• ANSWER:
• The average food cost for married students is:
ȳm = 135.3.
87/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
88/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
• ANSWER:
88/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
• ANSWER:
• An estimate for the standard deviation for the estimate is:
\ 80 − 15 44.42
Var(ȳ m) = · = 160.173.
80 10
\
SD(ȳ m ) = 12.656.
88/468
Eurostat
Confidence intervals and
sample size
Unit learning outcomes
91/468
Eurostat
Sample size for mean and total
92/468
Eurostat
Sample size for mean and total
θ̂ − θ
q ∼ N(0, 1).
Var(θ̂)
92/468
Eurostat
Then
|θ̂ − θ|
P q < z1−α/2 = 1 − α
Var(θ̂)
q
P |θ̂ − θ| < z1−α/2 · Var(θ̂) = 1−α
93/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.
94/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.
• So, we need to ask, "What is the goal of your
experiment?"
94/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.
• So, we need to ask, "What is the goal of your
experiment?"
• This is perhaps the most important question to be asked
as a part of your experiment.
94/468
Eurostat
• What if we were interested in estimating the average
weight of ESTAT male collaborators.
95/468
Eurostat
• What if we were interested in estimating the average
weight of ESTAT male collaborators.
• How many observations should we plan on taking for
estimating the mean weight of ESTAT male collaborators?
95/468
Eurostat
• What do we need to consider?
96/468
Eurostat
• What do we need to consider?
• In first place: how accurate (precision) do you want
this estimate to be?
96/468
Eurostat
• What do we need to consider?
• In first place: how accurate (precision) do you want
this estimate to be?
• You thus need to specify the margin of error.
96/468
Eurostat
• We should also take into account:
97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.
97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.
2. The second thing that you need to think about is the
type of conclusion that you would like to report. That is,
you need to specify the 1 − α value, the confidence
level, that you are happy with.
97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.
2. The second thing that you need to think about is the
type of conclusion that you would like to report. That is,
you need to specify the 1 − α value, the confidence
level, that you are happy with.
• Now, if we specify 1 − α (confidence level), the margin of
error d (also can be viewed as the half width of the
(1 − α)100% CI), we can solve for the sample size such
that the CI has the specified margin of error.
97/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N
98/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N
• Can we now use this formula to estimate the sample size?
98/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N
• Can we now use this formula to estimate the sample size?
• Not exactly!
98/468
Eurostat
• The weak point is the population variance used.
99/468
Eurostat
• The weak point is the population variance used.
• We do not know the value of σ 2 .
99/468
Eurostat
• Similarly, for estimating the population total τ , here is the
formula:
r !
σ2
P |τ̂ − τ | < z1−α/2 · N(N − n) =1−α
n
r
σ2
z1−α/2 N(N − n) =d
n
1
n= 2
d 1
2
+
N2 · z1−α/2 · σ2 N
100/468
Eurostat
The beetle example
9 234
66 256
81 128
11 245
92 211
54 240
6 202
23 267
102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?
102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?
• No, but we can estimate σ 2 by
n
(xi − x̄)2
X
2
s = = 1932.657.
i=1
n−1
102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?
• No, but we can estimate σ 2 by
n
(xi − x̄)2
X
2
s = = 1932.657.
i=1
n−1
102/468
Eurostat
• Let’s calculate this out and:
1
n = 2
d 1
2
+
N2 · z1−α/2 · σ2 N
1
n = 2 = 42.610
(1000) 1
2 2
+
(100) · (1.96) · 1932.657 100
103/468
Eurostat
• Let’s calculate this out and:
1
n = 2
d 1
2
+
N2 · z1−α/2 · σ2 N
1
n = 2 = 42.610
(1000) 1
2 2
+
(100) · (1.96) · 1932.657 100
103/468
Eurostat
• Remark: If we ignore the finite population correction
adjustment then,
N 2 · z1−α/2
2
· σ2
n =
d2
(100) · (1.96)2 · 1932.657
2
=
(1000)2
= 74.245
104/468
Eurostat
• Remark: If we ignore the finite population correction
adjustment then,
N 2 · z1−α/2
2
· σ2
n =
d2
(100) · (1.96)2 · 1932.657
2
=
(1000)2
= 74.245
104/468
Eurostat
Think about it!
105/468
Eurostat
Think about it!
105/468
Eurostat
Think about it!
105/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
106/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
• What can one do if there is no pilot data?
106/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
• What can one do if there is no pilot data?
• How can we get some rough idea about what σ 2 is?
106/468
Eurostat
Example
107/468
Eurostat
Example
107/468
Eurostat
Example
107/468
Eurostat
• There is no pilot data here.
108/468
Eurostat
• There is no pilot data here.
• We don’t have the time to select out some pigs in order
to get an estimate for σ 2 , the variance of the weight gain.
108/468
Eurostat
• There is no pilot data here.
• We don’t have the time to select out some pigs in order
to get an estimate for σ 2 , the variance of the weight gain.
• Question: How do we get a rough estimate of σ?
108/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?
109/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?
• One thing we can do is rely on the information that we
already have, i.e., find some historical data that exists
on this topic.
109/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?
• One thing we can do is rely on the information that we
already have, i.e., find some historical data that exists
on this topic.
• But what if this historical data does not exist?
109/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:
Range
σ≈
4
110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:
Range
σ≈
4
• The range is relatively easy to have some idea about.
110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:
Range
σ≈
4
• The range is relatively easy to have some idea about.
• This is an important point.
110/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.
111/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.
• So, for this case we will make a sensible guess of the
range of weight gain and intuitively estimate this to be
from a minimum of 10 lbs, to a maximum of 50 lbs within
this 3 week period.
111/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.
• So, for this case we will make a sensible guess of the
range of weight gain and intuitively estimate this to be
from a minimum of 10 lbs, to a maximum of 50 lbs within
this 3 week period.
• σ can now be roughly estimated to be:
Range 50 − 10
= = 10 lbs
4 4
111/468
Eurostat
• Now we can use the formula for estimating the mean, µ.
112/468
Eurostat
• Now we can use the formula for estimating the mean, µ.
• Then,
1
n = 2
d 1
2
+
zα/2 · σ2 N
1
=
22 1
2 2
+
(1.645) · (10) 1000
= 63.36
112/468
Eurostat
• The value 63.36 should rounded up to 64.
113/468
Eurostat
• The value 63.36 should rounded up to 64.
• We will need to sample 64 pigs in order to estimate the
average weight gain in 3 weeks to within 2 lbs with a 90%
confidence interval.
113/468
Eurostat
Coffee break!
114/468
Eurostat
Subsection 2
115/468
Eurostat
Estimating proportions
116/468
Eurostat
Estimating proportions
116/468
Eurostat
Estimating proportions
116/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.
117/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.
• Usually the sample size is at least 1000, sometimes even
1500.
117/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.
• Usually the sample size is at least 1000, sometimes even
1500.
• Let us see in what ways the proportion problem is related
to the mean problem...
117/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.
118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.
• The variable of interest: y1 , y2 , ... , yN
118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.
• The variable of interest: y1 , y2 , ... , yN
1 P N
• Population proportion: p = yi which is the
N i=1
population mean, µ, of Y .
118/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
119/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
• This specific definition of yi makes it having a variance
that is related to its mean.
119/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
• This specific definition of yi makes it having a variance
that is related to its mean.
• To find the finite population variance for y1 , y2 , ... , yN ,
we know that the population mean is:
N
1 X
µ= yi = p.
N i=1
119/468
Eurostat
By definition the variance is then:
N
(yi − p)2
P
i=1
σ2 =
N −1
N
(yi2 − 2pyi + p 2 )
P
i=1
=
N −1
N N
yi2 − 2p yi + Np 2
P P
i=1 i=1
=
N −1
120/468
Eurostat
Then, since yi2 = yi :
N N
yi + Np 2
P P
yi − 2p
i=1 i=1
=
N −1
Np − 2p(Np) + Np 2
=
N −1
Np − Np 2 N
σ2 = = p(1 − p)
N −1 N −1
121/468
Eurostat
• How will we estimate this?
122/468
Eurostat
• How will we estimate this?
• We can estimate this by:
n
σ̂ 2 = s 2 = p̂ · (1 − p̂).
n−1
122/468
Eurostat
• How will we estimate this?
• We can estimate this by:
n
σ̂ 2 = s 2 = p̂ · (1 − p̂).
n−1
• What we want is to see how p̂ behaves, therefore, we
want to know its distribution.
122/468
Eurostat
• First, we find its mean, then its variance.
123/468
Eurostat
• First, we find its mean, then its variance.
• Since p̂ is ȳ , we can get E(p̂) = µ = p.
123/468
Eurostat
• First, we find its mean, then its variance.
• Since p̂ is ȳ , we can get E(p̂) = µ = p.
• Then, we proceed to find its variance.
n σ2
Var(p̂) = 1− ·
N n
N −n N · p · (1 − p)
= ·
N (N − 1) · n
N −n p · (1 − p)
= ·
N −1 n
123/468
Eurostat
• How will we estimate the variance of p̂?
124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.
124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.
• An unbiased estimator of the variance is:
\ N −n p̂ · (1 − p̂)
Var(p̂) = ·
N n−1
124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.
• An unbiased estimator of the variance is:
\ N −n p̂ · (1 − p̂)
Var(p̂) = ·
N n−1
• This is one reasonable answer for determining an estimate
of the variance.
124/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.
125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.
• When the sample size is large we know that p̂ has a
normal distribution by the central limit theorem.
125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.
• When the sample size is large we know that p̂ has a
normal distribution by the central limit theorem.
• Therefore, we can use the usual interval:
q
\
p̂ ± z1−α/2 Var(p̂)
125/468
Eurostat
• How large is large enough?
Answer: if n · p̂ ≥ 5, n · (1 − p̂) ≥ 5.
126/468
Eurostat
Back to example
127/468
Eurostat
Back to example
127/468
Eurostat
• The 22% is a sample proportion.
128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:
128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:
• A 95% confidence interval for p is:
´
√
0.22 ± 1.96 0.0001545
0.22 ± 0.0244
where
\ N − n p̂ · (1 − p̂) 0.22 × 0.78
Var(p̂) = · = 1· = 0.0001545
N n−1 1112 − 1
128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:
• A 95% confidence interval for p is:
´
√
0.22 ± 1.96 0.0001545
0.22 ± 0.0244
where
\ N − n p̂ · (1 − p̂) 0.22 × 0.78
Var(p̂) = · = 1· = 0.0001545
N n−1 1112 − 1
• Remark: because N is large compared to n we ignore the
finite population correction. 128/468
Eurostat
Subsection 3
129/468
Eurostat
Sample size for estimating proportion
1
n= 2 .
d 1
2 2
+
z1−α/2 · σ N
130/468
Eurostat
N
• Now, σ 2 = · p · (1 − p) substitutes in and we get:
N −1
N · p · (1 − p)
n= .
d2
(N − 1) 2 + p · (1 − p)
z1−α/2
131/468
Eurostat
• When the finite population correction can be ignored, the
formula is:
2
z1−α/2 · p · (1 − p)
n≈ .
d2
132/468
Eurostat
• When the finite population correction can be ignored, the
formula is:
2
z1−α/2 · p · (1 − p)
n≈ .
d2
• Now, for finding sample sizes for proportion, in addition
to using an educated guess to estimate p, we can also
find a conservative sample size which can guarantee the
margin of error is short enough at a specified α.
132/468
Eurostat
A. Educated guess (estimate p by p̂):
N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
133/468
Eurostat
A. Educated guess (estimate p by p̂):
N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
1. Note, p̂ may be different from the true proportion.
133/468
Eurostat
A. Educated guess (estimate p by p̂):
N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
1. Note, p̂ may be different from the true proportion.
2. The sample size may not be large enough for some cases,
(i.e., the margin of error not as small as specified).
133/468
Eurostat
B. Conservative sample size:
N · 1/4
n= .
d2
(N − 1) 2 + 1/4
z1−α/2
134/468
Eurostat
B. Conservative sample size:
N · 1/4
n= .
d2
(N − 1) 2 + 1/4
z1−α/2
1. Since p(1 − p) attains maximum at p = 1/2.
134/468
Eurostat
Example
2
p̂ · (1 − p̂) · z1−α/2 0.22 · 0.78 · 1.962
n= = = 732.47
d2 0.032
135/468
Eurostat
Example
2
p̂ · (1 − p̂) · z1−α/2 0.22 · 0.78 · 1.962
n= = = 732.47
d2 0.032
1. Round up to 733 135/468
Eurostat
To estimate the president’s final approval rating, how many
people should be sampled so that the margin of error is 3%, (a
popular choice), with 95% confidence?
136/468
Eurostat
To estimate the president’s final approval rating, how many
people should be sampled so that the margin of error is 3%, (a
popular choice), with 95% confidence?
1. Round up to 1068.
136/468
Eurostat
What to choose?
137/468
Eurostat
What to choose?
137/468
Eurostat
What to choose?
137/468
Eurostat
Example
138/468
Eurostat
Example
138/468
Eurostat
Example
138/468
Eurostat
• Would you use the educated guess or the conservative
approach?
139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use an educated guess because it is not costly
to set up the testing procedure again.
139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use an educated guess because it is not costly
to set up the testing procedure again.
• On the other hand, the cost of the sampling of extra
units is high due to the nature of the test.
139/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
140/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
• Last year the proportion is 0.9.
140/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
• Last year the proportion is 0.9.
• Want to estimate the proportion to within 0.01 with 95%
confidence.
140/468
Eurostat
• Would you use the educated guess or the conservative
approach?
141/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
141/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use a conservative approach because it is too
expensive to send a ship out again if needed.
141/468
Eurostat
Unequal probability sampling
Unit learning outcomes
144/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.
145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.
• For example, we want to estimate the number of job
openings in a city by sampling firms in that city.
145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.
• For example, we want to estimate the number of job
openings in a city by sampling firms in that city.
• Many of the firms in the city are small firms.
145/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.
146/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.
• However, the number of job openings is heavily influenced
by large firms.
146/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.
• However, the number of job openings is heavily influenced
by large firms.
• Thus, we should be able to improve the estimate of
number of job openings by giving the large firms a greater
chance to appear in the sample, for example, with
probability proportional to size or proportional to some
other relevant aspects.
146/468
Eurostat
Selection probabilities
147/468
Eurostat
Selection probabilities
147/468
Eurostat
• If the selection probabilities are unequal, the sample mean
is not unbiased for population mean and sample total is
not unbiased for population total.
148/468
Eurostat
• If the selection probabilities are unequal, the sample mean
is not unbiased for population mean and sample total is
not unbiased for population total.
• Example: if larger firms are sampled with higher
probability, the sample mean for job openings will be
biased upward.
148/468
Eurostat
Questions?
149/468
Eurostat
See you tomorrow!
150/468
Eurostat
Subsection 2
151/468
Eurostat
Sampling with replacement
152/468
Eurostat
Sampling with replacement
152/468
Eurostat
Sampling with replacement
152/468
Eurostat
Sampling with replacement
153/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.
• Let pi , i = 1, ..., N denote the probability that a given
population unit will be selected.
153/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.
• Let pi , i = 1, ..., N denote the probability that a given
population unit will be selected.
• The Hansen-Hurwitz estimator for τ is:
n
1 X yi
τ̂p = .
n i=1 pi
153/468
Eurostat
Since,
N
yi X yi
E = pi
pi i=1
pi
N
X
= yi = τ
i=1
N
X
where τ = yi is the population total.
i=1
154/468
Eurostat
Thus,
n
!
1 X yi
E(τ̂p ) = E
n i=1 pi
n
1X yi
= E
n i=1 pi
n
1X
= τ
n i=1
1
= nτ = τ
n
which means τ̂p is an unbiased estimator for τ .
155/468
Eurostat
X N 2
yi yi
Since Var = pi −τ ,
pi i=1
pi
N 2
1X yi
Var(τ̂p ) = pi −τ
n i=1 pi
156/468
Eurostat
• An unbiased estimator for Var(τ̂p ) is:
n 2
X yi
− τ̂p
1 p i
\p ) = · i=1
Var(τ̂
n n−1
and an approximate (1 − α)100% confidence interval for
τ is:
q
τ̂p ± z1−α/2 · \p ).
Var(τ̂
157/468
Eurostat
τ
• For population mean, µ = one uses:
N
n
!
1 1 X yi τ̂p
µ̂p = · =
N n i=1 pi N
τ
E(µ̂p ) = =µ
N
\p ) = 1 · Var(τ̂
Var(µ̂ \p )
N2
158/468
Eurostat
τ
• For population mean, µ = one uses:
N
n
!
1 1 X yi τ̂p
µ̂p = · =
N n i=1 pi N
τ
E(µ̂p ) = =µ
N
\p ) = 1 · Var(τ̂
Var(µ̂ \p )
N2
• How do we perform unequal probability sampling
according to given pi ?
158/468
Eurostat
Example 1
159/468
Eurostat
Example 1
159/468
Eurostat
Division # employees
1 1000
2 650
3 2100
4 860
5 2840
6 1910
7 390
8 3200
9 1500
10 1200
Total 15650
160/468
Eurostat
A. How do we practically implement unequal probability
sampling according to the given pi ’s?
161/468
Eurostat
A. How do we practically implement unequal probability
sampling according to the given pi ’s?
B. With the divisions selected by probability proportional to
size, how do we construct the Hansen-Hurwitz estimator
for τ ?
161/468
Eurostat
Example: Answer to A
Division # employees pi
1 1000 1000/15650
2 650 650/15650
3 2100 2100/15650
4 860 860/15650
5 2840 2840/15650
6 1910 1910/15650
7 390 390/15650
8 3200 3200/15650
9 1500 1500/15650
10 1200 1200/15650
Total 15650 1
162/468
Eurostat
Division # employees pi Assigned numbers
163/468
Eurostat
Division # employees pi Assigned numbers
163/468
Eurostat
Division # employees pi Assigned numbers
164/468
Eurostat
• For division 2, y1 : the number requests is 420
• For division 5, y2 : the number of requests is 1785
164/468
Eurostat
• For division 2, y1 : the number requests is 420
• For division 5, y2 : the number of requests is 1785
• For division 8, y3 : the number of requests is 2198
164/468
Eurostat
• We will need to compute the Hansen-Hurwitz estimator
as follows:
165/468
Eurostat
• We will need to compute the Hansen-Hurwitz estimator
as follows:
• The Hansen-Hurwitz estimator for τ is
n
1 X yi
τ̂p = =
n pi
i=1
1 15650 15650 15650
= 420 · + 1785 · + 2198 ·
3 650 2840 3200
1
= (10112.31 + 9836.36 + 10749.59)
3
= 10232.75
165/468
Eurostat
• Each of the values, 10112.31, 9836.36, and 10749.59,
look fairly stable so it looks like the variance will not be
too large.
3
2
P yi
− τ̂p
\p ) = 1 i=1 pi
Var(τ̂ ·
3 3−1
1 1
= · ((10112.31 − 10232.75)2
3 2
+(9836.36 − 10232.75)2 + (10749.59 − 10232.75)
= 73125.74
and
\
SD(τ̂ p ) = 270.418
166/468
Eurostat
Hansen-Hurwitz estimator
167/468
Eurostat
Hansen-Hurwitz estimator
167/468
Eurostat
Hansen-Hurwitz estimator
168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.
168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.
• Would we automatically choose to use p.p.s.?
168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.
• Would we automatically choose to use p.p.s.?
• The idea is that the thing that you are interested in has
to be related to the size.
168/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.
169/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.
• However, if what you are interested in has nothing to do
with the size of the department, then there is no reason
to use p.p.s.
169/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.
• However, if what you are interested in has nothing to do
with the size of the department, then there is no reason
to use p.p.s.
• Now, let us address the ’why’.
169/468
Eurostat
• By definition,
N N 2
X 1X yi
τ= yi and Var(τ̂p ) = pi −τ .
i=1
n i=1 pi
170/468
Eurostat
• By definition,
N N 2
X 1X yi
τ= yi and Var(τ̂p ) = pi −τ .
i=1
n i=1 pi
yi
• For the special and unrealistic case = constant, the
pi
constant will be τ and the Var(τ̂p ) will be zero.
170/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .
171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .
• If we know yi is approximately proportional to a known
variable such as xi , then we can choose pi proportional
to xi .
171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .
• If we know yi is approximately proportional to a known
variable such as xi , then we can choose pi proportional
to xi .
• τ̂p will have low variances.
171/468
Eurostat
Example: palm trees
172/468
Eurostat
Example: palm trees
172/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.
173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.
• We find that p1 , ..., pN are:
173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.
• We find that p1 , ..., pN are:
173/468
Eurostat
• Answer:
174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
• Generate 4 random numbers form a uniform distribution
on (0,1)
174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
• Generate 4 random numbers form a uniform distribution
on (0,1)
• Choose the units that correspond to the interval
containing the random number.
174/468
Eurostat
• In this example, we use uniform and get: 0.335257,
0.0065551, 0.401869, 0.318977
175/468
Eurostat
• In this example, we use uniform and get: 0.335257,
0.0065551, 0.401869, 0.318977
• The units selected are the islands 29, 1, 36, and 29,
(since 0.335257 falls between 0.31 and 0.36, 0.0065551
falls between 0 and 0.01, 0.401869 falls between 0.40 and
0.42, and 0.318977 falls between 0.31 and 0.36.).
175/468
Eurostat
The measurements (yi ) are:
i Size pi yi
1 1 0.01 14
29 5 0.05 50
29 5 0.05 50
36 2 0.02 25
n 2
\p ) = 1 X yi
Var(τ̂ − τ̂p
n(n − 1) i=1 pi
1
= [(1400 − 1162.5)2 + (1000 − 1162.5)2
4·3
+(1000 − 1162.5)2 + (1250 − 1162.5)2 ]
= 9739.58
\
SD(τ̂p ) = 98.69.
177/468
Eurostat
• If we are interested in the mean number of trees per
island in that population, then
τ̂p 1162.5
µ̂p = = = 11.625.
N 100
\p ) = 1 \p )
Var(µ̂ · Var(τ̂
N2
1
= · 9739.58
(100)2
= 0.973958
\p ) = 0.987
SD(µ̂
178/468
Eurostat
Subsection 3
179/468
Eurostat
The Horvitz-Thompson estimator
180/468
Eurostat
The Horvitz-Thompson estimator
180/468
Eurostat
• The Horvitz-Thompson estimator is:
v
X yi
τ̂π =
i=1
πi
where v is the distinct number of units in the sample.
181/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
182/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
• Each distinct unit of the sample is utilized only once.
182/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
• Each distinct unit of the sample is utilized only once.
• Note that the estimator is unbiased:
E(τ̂π ) = τ
182/468
Eurostat
• Its variance is given by
N N X
X 1 − πi X πij − πi πj
Var(τ̂π ) = yi2 + yi yj
i=1
πi i=1 j6=i
π i πj
183/468
Eurostat
• Its variance is given by
N N X
X 1 − πi X πij − πi πj
Var(τ̂π ) = yi2 + yi yj
i=1
πi i=1 j6=i
π i πj
where πij > 0 denotes the probability that both unit i and
unit j are included in the sample.
183/468
Eurostat
An approximate (1 − α)100% CI for τ is:
q
\π ).
τ̂π ± z1−α/2 Var(τ̂
184/468
Eurostat
Palm trees with Horvitz-Thompson
estimator
185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator
185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator
185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator
185/468
Eurostat
• For sample with replacement, we will compute:
186/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
187/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
• Since p1 = 0.01, π1 = 1 − (1 − 0.01)4 = 0.0394, and
187/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
• Since p1 = 0.01, π1 = 1 − (1 − 0.01)4 = 0.0394, and
188/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ
• For this, we need to compute πij .
188/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ
• For this, we need to compute πij .
• Since
188/468
Eurostat
• Then we get:
πij = πi + πj − [1 − (1 − pi − pj )n ]
189/468
Eurostat
• Then we get:
πij = πi + πj − [1 − (1 − pi − pj )n ]
• This means that we have to run through each of the unit
pairs such as:
189/468
Eurostat
• Plugging in the values in
v v X
X 1 − πi X πij − πi πj 1
\π ) =
Var(τ̂ yi2 + yi yj ,
i=1
πi2 i=1 j6=i
πi πj πij
we obtain:
\π ) = 92692.9
Var(τ̂
190/468
Eurostat
• Plugging in the values in
v v X
X 1 − πi X πij − πi πj 1
\π ) =
Var(τ̂ yi2 + yi yj ,
i=1
πi2 i=1 j6=i
πi πj πij
we obtain:
\π ) = 92692.9
Var(τ̂
√
\
• Thus, SD(τ̂ π) = 92692.9 = 304.455
190/468
Eurostat
• Is there some popular estimator that can be derived as a
Horvitz-Thompson estimator?
191/468
Eurostat
• Is there some popular estimator that can be derived as a
Horvitz-Thompson estimator?
• Yes, under simple random sampling (without
replacement), the inclusion of the probability of the i-th
unit is:
πi = P(unit i-th is included in the sample)
# of samples including unit i-th
=
# of samples
N−1 (N−1)! (N−1)!
Cn−1 (N−1−n+1)!(n−1)! (N−n)!(n−1)!
= = N!
= N(N−1)!
CnN (N−n)!n! (N−n)!n(n−1)!
n
=
N 191/468
Eurostat
n
X yi
τ̂π =
i=1
πi
n
X yi
= ·N
i=1
n
= N ȳ
192/468
Eurostat
Coffee break!
193/468
Eurostat
Subsection 4
194/468
Eurostat
Wheat production
unit (Farm) i 1 2 3
195/468
Eurostat
s p(s) Sample
196/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
197/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
• Answer: When (1,1) is sampled, the Hansen-Hurwitz
estimator is:
1 y1 y1 1 11 11
τ̂p = + = + = 36.67.
2 p1 p1 2 0.3 0.3
197/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
• Answer: When (1,1) is sampled, the Hansen-Hurwitz
estimator is:
1 y1 y1 1 11 11
τ̂p = + = + = 36.67.
2 p1 p1 2 0.3 0.3
• When (1,2) is sampled, the Hansen-Hurwitz estimator is:
1 y1 y2 1 11 6
τ̂p = + = + = 33.33.
2 p1 p2 2 0.3 0.2
197/468
Eurostat
Similarly, we can fill out the table and get the Hansen-Hurwitz
estimators as shown:
198/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.
199/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.
• Answer:
π1 = 0.09 + 0.06 + 0.06 + 0.15 + 0.15 = 0.51,
π2 = 0.04 + 0.06 + 0.06 + 0.10 + 0.10 = 0.36,
π3 = 0.25 + 0.15 + 0.15 + 0.10 + 0.10 = 0.75.
199/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.
• Answer:
π1 = 0.09 + 0.06 + 0.06 + 0.15 + 0.15 = 0.51,
π2 = 0.04 + 0.06 + 0.06 + 0.10 + 0.10 = 0.36,
π3 = 0.25 + 0.15 + 0.15 + 0.10 + 0.10 = 0.75.
• When (1,1) is sampled, the Horvitz-Thompson estimator
is:
11
τ̂π = = 21.57.
0.51
199/468
Eurostat
• When (1,2) is sampled, the Horvitz-Thompson estimator
is:
11 6
τ̂π = + = 38.24.
0.51 0.36
200/468
Eurostat
Similarly, we can fill out the table and get the
Horvitz-Thompson estimators as shown below:
201/468
Eurostat
• From the table above we can see that both τ̂p and τ̂π are
unbiased.
202/468
Eurostat
• From the table above we can see that both τ̂p and τ̂π are
unbiased.
• This example is a small population example to illustrate
conceptually the properties of these estimators.
202/468
Eurostat
Remark 1
203/468
Eurostat
Remark 1
203/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5
204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5
• We draw a sample.
204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5
• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5
• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
• We will not be able to find the real population total nor
the real variance of the estimator.
204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5
• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
• We will not be able to find the real population total nor
the real variance of the estimator.
• However, we will be able to estimate them.
204/468
Eurostat
Remark 2
205/468
Eurostat
Remark 2
205/468
Eurostat
Remark 2
205/468
Eurostat
Auxiliary data and ratio esti-
mation
Unit learning outcomes
207/468
Eurostat
Unit learning outcomes
208/468
Eurostat
Subsection 1
209/468
Eurostat
Using auxiliary information
210/468
Eurostat
Using auxiliary information
210/468
Eurostat
Using auxiliary information
210/468
Eurostat
Using auxiliary information
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
• A third example would be a forest that has N trees.
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
• A third example would be a forest that has N trees.
• yi = the volume of the tree
211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
• A third example would be a forest that has N trees.
• yi = the volume of the tree
• xi = the diameter of the tree
211/468
Eurostat
Ratio estimators
PN PN τy µy
• If τy = yi and τx = xi , then = and
i=1 i=1 τx µx
µy
τy = · τx .
µx
212/468
Eurostat
Ratio estimators
PN PN τy µy
• If τy = yi and τx = xi , then = and
i=1 i=1 τx µx
µy
τy = · τx .
µx
ȳ
• The ratio estimator, denoted as τ̂r , is τ̂r = · τx
x̄
212/468
Eurostat
• The estimator is useful in the following situations:
213/468
Eurostat
• The estimator is useful in the following situations:
A. When X and Y are highly linearly correlated through the
origin, then:
213/468
Eurostat
• The estimator is useful in the following situations:
A. When X and Y are highly linearly correlated through the
origin, then:
213/468
Eurostat
Historical use
214/468
Eurostat
Historical use
214/468
Eurostat
Historical use
214/468
Eurostat
Historical use
214/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?
215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?
• Laplace found auxiliary information to help him and found
good records of the number of registered births.
215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?
• Laplace found auxiliary information to help him and found
good records of the number of registered births.
• It turns out that the total number of registered births for
the 30 communities that he had selected = 71,866.33.
215/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35
216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35
• Rationale: Communities with larger populations are
likely to have larger number of registered births.
216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35
• Rationale: Communities with larger populations are
likely to have larger number of registered births.
• This is an example of an early use of ratio estimation.
216/468
Eurostat
Example 1: apple juice from apples
• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
217/468
Eurostat
Example 1: apple juice from apples
• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.
217/468
Eurostat
Example 1: apple juice from apples
• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.
• We can sample n apples and find y1 , ..., yn , the amount
of apple juice in those apples.
217/468
Eurostat
Example 1: apple juice from apples
• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.
• We can sample n apples and find y1 , ..., yn , the amount
of apple juice in those apples.
• N ȳ is hard to get in this case because N is hard to count.
217/468
Eurostat
• How could we measure this?
218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.
218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.
• We will use the relationship between weight of the load
and the weight of the apple juice one obtains.
218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.
• We will use the relationship between weight of the load
and the weight of the apple juice one obtains.
• Y is related to the x, the weight of each apple in the
sample and the total weight is easy to get for the entire
shipment.
218/468
Eurostat
Ratio estimator for τ
ȳ
τ̂r = · τx
x̄
219/468
Eurostat
Ratio estimator for τ
ȳ
τ̂r = · τx
x̄
• For this example, N is unknown and we cannot use N ȳ .
219/468
Eurostat
Ratio estimator for τ
ȳ
τ̂r = · τx
x̄
• For this example, N is unknown and we cannot use N ȳ .
• One can see that if the condition for using the ratio
estimator is satisfied and N is know, this ratio estimator
may actually work better than N ȳ .
219/468
Eurostat
Ratio estimator for µ
ȳ
µ̂r = · µx .
x̄
220/468
Eurostat
Ratio estimator for µ
ȳ
µ̂r = · µx .
x̄
• It turns out that this estimate is not unbiased.
220/468
Eurostat
Ratio estimator for µ
ȳ
µ̂r = · µx .
x̄
• It turns out that this estimate is not unbiased.
• Note that τ̂r is not unbiased for τy and µ̂r is not unbiased
for µy but they are approximately unbiased for large
samples when the sampling is a simple random sample.
220/468
Eurostat
Properties
σr2
N −n
Var (µ̂r ) ≈ · .
N n
221/468
Eurostat
Properties
σr2
N −n
Var (µ̂r ) ≈ · .
N n
• How can we compute the
N 2
1 X τy
σr2 = yi − · xi .
N − 1 i=1 τx
221/468
Eurostat
• When we want to estimate σr2 we will estimate using this
formula:
n 2
1 X ȳ
sr2 = yi − · xi .
n − 1 i=1 x̄
222/468
Eurostat
• When we want to estimate σr2 we will estimate using this
formula:
n 2
1 X ȳ
sr2 = yi − · xi .
n − 1 i=1 x̄
• Given all of this, when do we know that the estimate µ̂r is
good?
222/468
Eurostat
• We can compare it to:
σ2
N −n
Var(ȳ ) = · .
N n
223/468
Eurostat
• We can compare it to:
σ2
N −n
Var(ȳ ) = · .
N n
• µ̂r will perform better if σr2 < σ 2 .
223/468
Eurostat
• We can compare it to:
σ2
N −n
Var(ȳ ) = · .
N n
• µ̂r will perform better if σr2 < σ 2 .
• That is the case for populations for which y ’s and x’s are
highly correlated and with roughly a linear relationship
through the origin.
223/468
Eurostat
• An approximate 100(1 − α)% CI for µy is
q
\r ).
µ̂r ± z1−α/2 Var(µ̂
224/468
Eurostat
• For τy ,
ȳ
τ̂r = N µ̂r = · τx ,
x̄
and
2
\r ) = N · (N − n) sr .
Var(τ̂
n
225/468
Eurostat
Back to apple juice example
226/468
Eurostat
Back to apple juice example
226/468
Eurostat
Back to apple juice example
226/468
Eurostat
Back to apple juice example
Sum 0.004536
227/468
Eurostat
• ID is the sampled Apple
• yi , the weight of the Apple’s juice in lbs.
• xi , the weight of the Apple in lbs.
• yi − rxi , is the (observed y value - estimated y value), and
• (yi − rxi )2 is the (observed y value - estimated y value)
squared.
• Total Apple juice weight is 2.85 lbs. (mean = 0.19 lbs.)
• Total Apple weight is 4.32 lbs. (mean = 0.288 lbs.)
228/468
Eurostat
• Is it appropriate to use the ratio estimate?
229/468
Eurostat
• Is it appropriate to use the ratio estimate?
• The scatter plot of the data shows a linear relationship
between y and x variables.
●
0.25
●
0.20
●
y
● ● ● ●
0.15
● ●
229/468
Eurostat
Moreover, the regression analysis suggests that the regression
line goes through the origin (p-value of constant =
0.659 > 0.05). Therefore, it appears appropriate to use the
ratio estimate.
230/468
Eurostat
• The ratio estimate of the total weight is
0.190
τ̂r = r τx = × 2000 = 1319.44.
0.288
n
1 X
sr2 = (yi − rxi )2
n − 1 i=1
1
= [(0.16 − 0.6597 × 0.22)2 + . . .
14
+(0.22 − 0.6597 × 0.35)2 ]
231/468
Eurostat
• The ratio estimate of the total weight is
0.190
τ̂r = r τx = × 2000 = 1319.44.
0.288
n
1 X
sr2 = (yi − rxi )2
n − 1 i=1
1
= [(0.16 − 0.6597 × 0.22)2 + . . .
14
+(0.22 − 0.6597 × 0.35)2 ]
\
= 1319.44 ± z1−α/2 SD(τ̂r)
233/468
Eurostat
• Then an approximate 95% CI for τ is then:
\
= 1319.44 ± z1−α/2 SD(τ̂r)
233/468
Eurostat
Estimation for ratio
234/468
Eurostat
• For example, sociologists are interested in ratios such as
the monthly food budget compared to the monthly
income per family.
235/468
Eurostat
• For example, sociologists are interested in ratios such as
the monthly food budget compared to the monthly
income per family.
• The sample ratio is the estimate for R and:
ȳ
r =
x̄
N − n σr2
Var(r ) ≈
Nµ2x n
2
\) ≈ N − n sr
Var(r
Nµ2x n
235/468
Eurostat
Questions?
236/468
Eurostat
Lunch break!
237/468
Eurostat
Subsection 2
238/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.
239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.
• He also has aerial photographs of the plantation from
which he can estimate the number of trees (x) on each
plot of the entire plantation.
239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.
• He also has aerial photographs of the plantation from
which he can estimate the number of trees (x) on each
plot of the entire plantation.
• Hence, he knows µx = 19.7 and since the two counts are
approximately proportional through the origin, he uses a
ratio estimate to estimate µy .
239/468
Eurostat
Plot yi xi (aerial estimate) yi − rxi
1 25 23 9.8263889
2 15 14 5.7638889
3 22 20 8.8055556
4 24 25 7.5069444
5 13 12 5.0833333
6 18 18 6.1250000
7 35 30 15.2083333
8 30 27 12.1875000
9 10 8 4.7222222
10 29 31 8.5486111
Mean 22.10 20.80
240/468
Eurostat
Here is a scatterplot of this data:
35
●
30
●
●
25
●
●
y
●
20
●
15
●
10
10 15 20 25 30
241/468
Eurostat
And, here is the R output for regression:
242/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.
243/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.
• Moreover, the regression analysis suggests that the
regression line goes through the origin (p-value of
constant = 0.554 > 0.05).
243/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.
• Moreover, the regression analysis suggests that the
regression line goes through the origin (p-value of
constant = 0.554 > 0.05).
• Therefore, it may be appropriate to use the ratio estimate.
243/468
Eurostat
• Estimating the number of trees per acre
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
• ȳ = 22.10
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
• ȳ = 22.10
• x̄ = 20.80
244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
• ȳ = 22.10
• x̄ = 20.80
• µx is given to be 19.70
244/468
Eurostat
ȳ 22.10
µ̂r = · µx = · 19.70 = 20.93,
x̄ 20.80
10 2
1 X 22.10
sr2 = yi − xi = 4.2,
10 − 1 i=1 20.80
2
\r ) = N − n · sr = 1000 − 10 · 4.2 = 0.4158,
Var(µ̂
N n 1000 10
√
\r ) =
SD(µ̂ 0.4158 = 0.6448
245/468
Eurostat
The approximate 95% confidence interval for µy is:
\r )
µ̂r ± z0.975 · SD(µ̂
20.93 ± 1.96 · 0.6448
= 20.93 ± 1.26
246/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .
247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .
• Then we know that:
r
N − n sr2
z1−α/2 · · = d.
N n
247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .
• Then we know that:
r
N − n sr2
z1−α/2 · · = d.
N n
• Thus, the formula to compute the required sample size is:
2
N · z1−α/2 · sr2
n= 2
z1−α/2 · sr2 + Nd 2
247/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4
Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400
248/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4
Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400
• τx = 22 , τy = 1400.
248/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4
Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400
• τx = 22 , τy = 1400.
• Samples (s.r.s.): n = 2.
248/468
Eurostat
ȳ
Samples τ̂r = · τx
x̄
(200 + 300)/2
(1,2) τ̂r = · 22 = 1222
(4 + 8)/2
(200 + 500)/2
(1,3) τ̂r = · 22 = 1283
(4 + 8)/2
(1,4) 1467
(2,3) 1354
(2,4) 1540
(3,4) 1523
1 1 1
E (τ̂r ) = · 1222 + · 1283 + + · 1467
6 6 6
1 1 1
+ · 1354 + · 1540 + · 1523
6 6 6
= 1398.17 6= τy = 1400
249/468
Thus, there is a very slight bias. Eurostat
6
X
MSE = (τ̂r ,s − τ )2 · P(s)
i=1
2 1 2 1
= (1222 − 1400) · + (1283 − 1400) ·
6 6
2 1 2 1
+ (1467 − 1400) · + (1354 − 1400) ·
6 6
2 1 2 1
+ (1540 − 1400) · + (1523 − 1400) ·
6 6
= 14, 451.2
1 1 1
E(τ̂ ) = · 1000 + · 1400 + · 1200
6 6 6
1 1 1
+ · 1600 + · 1400 + · 1800
6 6 6
= 1400, unbiased.
251/468
Eurostat
6
X
MSE = (τ̂ − τ )2 · P(s)
i=1
2 1 2 1
= (1000 − 1400) · + (1400 − 1400) ·
6 6
2 1 2 1
+ (1200 − 1400) · + (1600 − 1400) ·
6 6
2 1 2 1
+ (1400 − 1400) · + (1800 − 1400) ·
6 6
= 66, 667
66,667 is much larger than the MSE of τ̂r .
252/468
Eurostat
Auxiliary data and regression
estimation
Unit learning outcomes
254/468
Eurostat
Unit learning outcomes
255/468
Eurostat
Subsection 1
256/468
Eurostat
The idea behind regression estimation
257/468
Eurostat
The idea behind regression estimation
257/468
Eurostat
The idea behind regression estimation
257/468
Eurostat
• When the auxiliary variable x is linearly related to y but
does not pass through the origin, a linear regression
estimator would be appropriate.
258/468
Eurostat
• When the auxiliary variable x is linearly related to y but
does not pass through the origin, a linear regression
estimator would be appropriate.
• In addition, if multiple auxiliary variables have a linear
relationship with y , multiple regression estimates may be
appropriate.
258/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
259/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
• Let us start with a simple example:
ŷ = a + bx,
is our basic regression equation.
259/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
• Let us start with a simple example:
ŷ = a + bx,
is our basic regression equation.
sxy
• Then, b = 2 and a = ȳ − b̂x̄.
sx
259/468
Eurostat
• Then to estimate the mean for y , µ̂L , substitute as
follows, x = µx , a = ȳ − bx̄, then
ŷ = a + bx
µ̂L = a + bµx
µ̂L = (ȳ − bx̄) + bµx
µ̂L = ȳ + b(µx − x̄)
260/468
Eurostat
• Then to estimate the mean for y , µ̂L , substitute as
follows, x = µx , a = ȳ − bx̄, then
ŷ = a + bx
µ̂L = a + bµx
µ̂L = (ȳ − bx̄) + bµx
µ̂L = ȳ + b(µx − x̄)
260/468
Eurostat
• Thus, the mean square error of µ̂L is roughly estimated
by:
n
(yi − a − bxi )2
P
\L ) = N − n i=1
Var(µ̂ ·
N ×n n−2
N −n
= · MSE
N ×n
where MSE is the MSE of the linear regression model of
y on x.
261/468
Eurostat
• Therefore, an approximate (1 − α)100% CI for µ is:
q
\L )
µ̂L ± z1−α/2 Var(µ̂
262/468
Eurostat
• It follows that:
\L ) = N 2 Var(µ̂
Var(τ̂ \L )
N × (N − n)
= · MSE
n
263/468
Eurostat
• It follows that:
\L ) = N 2 Var(µ̂
Var(τ̂ \L )
N × (N − n)
= · MSE
n
• And, an approximate (1 − α)100% CI for τ is:
q
\L )
τ̂L ± z1−α/2 Var(τ̂
263/468
Eurostat
Example
264/468
Eurostat
Example
264/468
Eurostat
Example
264/468
Eurostat
Example
●
90
●
Calculus score Y
80
●
●
70
●
60
20 30 40 50 60 70
265/468
Eurostat
Student Test score (xi ) Calculus score (yi )
1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75
Mean 46 76
266/468
Eurostat
267/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .
268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .
• We can see that both terms are significant.
268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .
• We can see that both terms are significant.
• Ratio estimate is not appropriate since the constant term
is non-zero.
268/468
Eurostat
Example
269/468
Eurostat
Example
269/468
Eurostat
Example
269/468
Eurostat
Example
270/468
Eurostat
Example
270/468
Eurostat
Coffee break!
271/468
Eurostat
Subsection 2
Comparison of estimators
272/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:
\ N − n s2
Var(ȳ ) = · .
N n
273/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:
\ N − n s2
Var(ȳ ) = · .
N n
• s 2 for y values is: (15.11)2
273/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:
\ N − n s2
Var(ȳ ) = · .
N n
• s 2 for y values is: (15.11)2
• What is the Var(ȳ )?
\) = 486 − 10 · (15.11)2
Var(ȳ
486 × 10
= 22.36
273/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ
√
= 76 ± 1.96 × 22.36
= 76 ± 9.27
274/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ
√
= 76 ± 1.96 × 22.36
= 76 ± 9.27
.
• Recall: The 95% confidence interval using regression
estimate is 80.6 ± 5.34; a much shorter confidence
interval.
274/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ
√
= 76 ± 1.96 × 22.36
= 76 ± 9.27
.
• Recall: The 95% confidence interval using regression
estimate is 80.6 ± 5.34; a much shorter confidence
interval.
• This regression estimate is more precise than ȳ .
274/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
275/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
• Compare µ̂L to the ratio estimator µ̂r
275/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
• Compare µ̂L to the ratio estimator µ̂r
• Next table contains the mean and standard deviation for
X and Y .
275/468
Eurostat
Student Test score (xi ) Calculus score (yi ) yi − rxi
1 39 65 0.565
2 43 78 6.957
3 21 52 17.304
4 64 82 -23.739
5 57 92 -2.174
6 47 89 11.348
7 28 73 26.739
8 75 98 -25.913
9 34 56 -0.174
10 52 75 -10.913
Mean 46 76
Std. deviation 16.58 15.11
sr2 283.42
276/468
Eurostat
• The ratio estimate is inappropriate for this example.
277/468
Eurostat
• The ratio estimate is inappropriate for this example.
• However, just to show a counter example, we can
compute the variance of the ratio estimate using the
previous table data and compare this to the regression
estimate.
277/468
Eurostat
Note
278/468
Eurostat
Note
278/468
Eurostat
Note
279/468
Eurostat
• Next, we need to figure out the variance and for this we
need the MSE while using ratio estimate. From the
previous table the
10
1 X
sr2 = (yi − rxi )2 = 283.42 this is huge!
10 − 1 i=1
• Now we can compute the variance:
2
\r ) = N − n · sr
Var(µ̂
N n
486 − 10 283.42
= · = 27.75
486 10
279/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32
280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32
280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32
280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32
280/468
Eurostat
Stratified sampling
Some important information on this unit
283/468
Eurostat
Introduction
284/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.
285/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.
• Another example might be to determine the proportions
of defective products being assembled in a factory. In this
case sampling may be stratified by production lines,
factory, etc.
285/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.
• Another example might be to determine the proportions
of defective products being assembled in a factory. In this
case sampling may be stratified by production lines,
factory, etc.
• Can you think of a couple additional examples where
stratified sampling would make sense?
285/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.
286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.
2. The cost per observation in the survey may be reduced
by stratification of the population elements into
convenient groupings.
286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.
2. The cost per observation in the survey may be reduced
by stratification of the population elements into
convenient groupings.
3. Estimates of population parameters may be desired for
subgroups of the population. These subgroups should
then be identified.
286/468
Eurostat
Example
287/468
Eurostat
Example
287/468
Eurostat
Example
287/468
Eurostat
Example
288/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.
• The firm decides to select 20 households from Town A, 8
households from Town B and 12 households from the
rural area.
288/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.
• The firm decides to select 20 households from Town A, 8
households from Town B and 12 households from the
rural area.
• The data are given in the following table:
Town A 35,43,36,39,28,28,29,25,38,27
26,32,29,40,35,41,37,31,45,34
Town B 27,15,4,41,49,25,10,30
288/468
Eurostat
• Usually a sample is selected by some probability design
from each of the L strata in the population, with
selections in different strata independent of each other.
289/468
Eurostat
• Usually a sample is selected by some probability design
from each of the L strata in the population, with
selections in different strata independent of each other.
• The special case where from each stratum a simple
random sample is drawn is called a stratified random
sample.
289/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
290/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
• Why or why not?
290/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
• Why or why not?
• Yes, for all three reasons listed above.
290/468
Eurostat
• Notation
291/468
Eurostat
• Notation
• L: the number of strata
291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h
291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h
• N: the total number of units in the population , i.e.,
N1 + N2 + ... + NL
291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h
• N: the total number of units in the population , i.e.,
N1 + N2 + ... + NL
• For our “Watching TV"example the following values are:
L = 3, N1 = 155, N2 = 62 N3 = 93,
N = 155 + 62 + 93 = 310.
291/468
Eurostat
Some results are given in the following table:
292/468
Eurostat
Estimating the population total
L
X
τ̂st = τ̂h .
h=1
293/468
Eurostat
Estimating the population total
L
X
τ̂st = τ̂h .
h=1
294/468
Eurostat
• The formula are computed differently according to the
sampling scheme within each stratum.
• For stratified random sampling, i.e., take a simple random
sample within each stratum:
τ̂h = Nh ȳh ,
L
\
X sh2
Var(τ̂ st ) = Nh · (Nh − nh ) · ,
h=1
nh
h n
1 X
sh2 = (yhi − ȳh )2 .
nh − 1 i=1
294/468
Eurostat
• You can see that this turns out pretty easy to remember,
and one can easily obtain the estimates for the population
mean.
τ̂st
µ̂st = ,
N
\st ) = 1 Var(τ̂
Var(µ̂ \ st ).
N2
295/468
Eurostat
Estimating the population mean
296/468
Eurostat
Estimating the population mean
297/468
Eurostat
Example: estimating the mean
297/468
Eurostat
The overall variance of the estimator of mean for this example
is:
3 2
Nh − nh sh2
\
X Nh
Var(ȳst ) =
h=1
N Nh nh
2
1 2 (155 − 20) (5.95)
= (155) · ·
(310)2 155 20
2
(62 − 8) (15.25)
+ (62)2 · ·
62 8
2
2 (93 − 12) (9.36)
+ (93) · ·
93 12
= 1.97
298/468
Eurostat
Example: estimating the population total
\ 2 \
Var(τ̂ st ) = N Var(ȳst )
299/468
Eurostat
Example: confidence intervals
300/468
Eurostat
Example: confidence intervals
300/468
Eurostat
• What is the degrees of freedom for the τ used in this
formula for the confidence interval?
301/468
Eurostat
• What is the degrees of freedom for the τ used in this
formula for the confidence interval?
• Intuitively we would want this to be,
(n1 − 1) + (n2 − 1) + ... + (nL − 1), and this is correct
when the variances of all strata are all the same.
301/468
Eurostat
• But when this is not the case and we can not pool the
degrees of freedom, we will need to use the Satterwaithe
approximation for the degrees of freedom as follows:
L
!2 L
X X (ah sh2 )2
d= ah sh2 / .
h=1 h=1
(nh − 1)
Nh (Nh − nh )
where, ah = .
nh
302/468
Eurostat
• But when this is not the case and we can not pool the
degrees of freedom, we will need to use the Satterwaithe
approximation for the degrees of freedom as follows:
L
!2 L
X X (ah sh2 )2
d= ah sh2 / .
h=1 h=1
(nh − 1)
Nh (Nh − nh )
where, ah = .
nh
• In particular, when Nh are all equal, nh are all equal and
sh2 are all equal , the d.f. = n - L.
302/468
Eurostat
For the TV example:
303/468
Eurostat
(a1 s12 + a2 s22 + a3 s32 )2
d =
(a1 s12 )2 (a2 s22 )2 (a3 s32 )2
+ +
n1 − 1 n2 − 1 n3 − 1
(1046.5 · (5.95)2 + 418.5 · (15.25)2 + 627.75 · (9.36)2 )2
=
(1046.5 · (5.95)2 )2 (418.5 · (15.25)2 )2 (627.75 · (9.36)2 )
+ +
20 − 1 8−1 12 − 1
= 21.09
304/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
305/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
• ANSWER:
305/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
• ANSWER:
• We will use t with df = 21, hence a 95% CI for µ is:
q
\
ȳst ± t(21;1−α/2) Var(ȳ st )
√
= 27.7 ± 2.08 × 1.97
= 27.7 ± 2.91
305/468
Eurostat
Similarly, a 95% CI for τ is:
q
\
τ̂st ± t(21;1−α/2) Var(τ̂ st )
√
= 8587 ± 2.08 × 189278.56
= 8587 ± 902.32
306/468
Eurostat
Subsection 2
307/468
Eurostat
Stratification principle
308/468
Eurostat
Stratification principle
308/468
Eurostat
• For example, to estimate the average starting income for
recent young workers, it would make sense to stratify by
age group since the starting income for young workers of
the same age would be similar.
309/468
Eurostat
• For example, to estimate the average starting income for
recent young workers, it would make sense to stratify by
age group since the starting income for young workers of
the same age would be similar.
• Check the stratification principle in the following slides
309/468
Eurostat
Example: stratification principle
312/468
Eurostat
• The population variance, σ 2 , can be decomposed as:
σ 2 = σwithin
2 2
+ σbetween
where
313/468
Eurostat
• The population variance, σ 2 , can be decomposed as:
σ 2 = σwithin
2 2
+ σbetween
where
L
2
X Nh
• σwithin = σh2
N
h=1
313/468
Eurostat
• The population variance, σ 2 , can be decomposed as:
σ 2 = σwithin
2 2
+ σbetween
where
L
2
X Nh
• σwithin = σh2
N
h=1
L
2
X Nh
• σbetween = (µh − µ)2
N
h=1
313/468
Eurostat
• In the first stratification scheme (U):
314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):
314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):
2
• σwithin = 13.87 (96% of σ 2 )
314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):
2
• σwithin = 13.87 (96% of σ 2 )
2
• σbetween = 0.56 (4% of σ 2 )
314/468
Eurostat
• When a population is stratified, the total variance (σ 2 ) is
decomposed in a variance component within strata
2 2
(σwithin ) and between strata (σbetween ).
315/468
Eurostat
• When a population is stratified, the total variance (σ 2 ) is
decomposed in a variance component within strata
2 2
(σwithin ) and between strata (σbetween ).
• This examples show that, although the total variance in
the population is a fixed value, different stratification
2
schemes result in different decompositions of σwithin and
2
σbetween .
315/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2
316/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2
• Hence, in the first stratification scheme, η 2 = 0.96 shows
that the variance between strata is 96% of the total
variance of the population.
316/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2
• Hence, in the first stratification scheme, η 2 = 0.96 shows
that the variance between strata is 96% of the total
variance of the population.
• The variance within strata is small. This means that
strata are very homogeneous.
316/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.
317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.
• We can conclude that the first stratification scheme is
better, since the estimation accuracy is higher when
strata are more homogeneous (within).
317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.
• We can conclude that the first stratification scheme is
better, since the estimation accuracy is higher when
strata are more homogeneous (within).
• In fact, the closer the correlation ratio is to 1, the more
homogeneous are strata and more accurate is the
estimation.
317/468
Eurostat
Allocation in stratified random sampling
318/468
Eurostat
Allocation in stratified random sampling
318/468
Eurostat
Allocation in stratified random sampling
318/468
Eurostat
Allocation in stratified random sampling
318/468
Eurostat
Allocation in stratified random sampling
319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N
319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N
• This does not take into consideration the variability
within each stratum and is not the optimal choice.
319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N
• This does not take into consideration the variability
within each stratum and is not the optimal choice.
• If the cost of sampling from each stratum is the same,
then the optimal allocation (the allocation with the
lowest variances) is:
Nh σh
nh = n · L 319/468
P
Eurostat
• However, if the cost of sampling differs from stratum to
stratum and the total cost is:
c = c0 + c1 n1 + c2 n2 + ... + cL nL ,
320/468
Eurostat
• However, if the cost of sampling differs from stratum to
stratum and the total cost is:
c = c0 + c1 n1 + c2 n2 + ... + cL nL ,
320/468
Eurostat
• Remarks:
321/468
Eurostat
• Remarks:
• The sample size is directly proportional to Nh and σh ,
i.e., allocate a larger sample size to the larger and more
variable stratum.
321/468
Eurostat
• Remarks:
• The sample size is directly proportional to Nh and σh ,
i.e., allocate a larger sample size to the larger and more
variable stratum.
√
• The sample size is inversely proportional to ch , i.e., this
allocates smaller sample sizes to the more expensive
stratum.
321/468
Eurostat
• In order to use the optimal allocation, one must be able
to estimate σh
322/468
Eurostat
• In order to use the optimal allocation, one must be able
to estimate σh
• Let’s take a look at this in the context of the TV
Example...
322/468
Eurostat
Back to TV example
323/468
Eurostat
Back to TV example
323/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1
where,
324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1
where,
• N1 ∼ 155, σ1 ∼ 5
324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1
where,
• N1 ∼ 155, σ1 ∼ 5
• N2 ∼ 62, σ2 ∼ 15
324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1
where,
• N1 ∼ 155, σ1 ∼ 5
• N2 ∼ 62, σ2 ∼ 15
• N3 ∼ 93, σ3 ∼ 10
324/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10
325/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10
• Thus we will choose n1 = 12, n2 = 14 and n3 = 14.
325/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10
• Thus we will choose n1 = 12, n2 = 14 and n3 = 14.
• Remember, it is important that n1 + n2 + n3 = 40 in this
case.
325/468
Eurostat
Questions?
326/468
Eurostat
See you tomorrow!
327/468
Eurostat
Subsection 3
Post-stratification
328/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.
329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.
• Post-stratification: stratification after the selection of a
sample, is often appropriate when a simple random
sample is not properly balanced by the representation.
329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.
• Post-stratification: stratification after the selection of a
sample, is often appropriate when a simple random
sample is not properly balanced by the representation.
• Here is an example.
329/468
Eurostat
Example
330/468
Eurostat
Example
330/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N
331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N
• Thus,
331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N
• Thus,
331/468
Eurostat
Post-stratification estimator variance
332/468
Eurostat
Post-stratification estimator variance
332/468
Eurostat
More specifically,
L L
X
N −nX Nh 1 N −n N − Nh 2
≈ σh2 + 2 σh .
nN h=1 N n N −1 h=1
N
333/468
Eurostat
Example
334/468
Eurostat
Example
334/468
Eurostat
Example
n1 = 70 n2 = 30
ȳ1 = 520 ȳ2 = 280.
s1 = 210 s2 = 90
334/468
Eurostat
• Compute the post-stratified mean.
335/468
Eurostat
• Compute the post-stratified mean.
• ANSWER:
N1 N2
ȳst = ȳ1 + ȳ2
N N
= 0.4 × 520 + 0.6 × 280
= 376
335/468
Eurostat
• Compute the variance of the post-stratified mean.
336/468
Eurostat
• Compute the variance of the post-stratified mean.
• ANSWER:
1 N1 2 N 2 2
Var(post-stratified ȳ ) ≈
c s + s
n N 1 N 2
1 N1 2 N2 2
+ 2 1− s1 + 1 − s2
n N N
1
= [0.4 × (210)2 + 0.6 × (90)2 ]
100
1
+ [0.6 × (210)2 + 0.4 × (90)2 ]
1002
= 225 + 2.97 = 227.97
336/468
Eurostat
Subsection 4
337/468
Eurostat
Estimator properties
338/468
Eurostat
Estimator properties
338/468
Eurostat
Estimator properties
338/468
Eurostat
Estimator properties
339/468
Eurostat
Example
339/468
Eurostat
• The data (in lbs.) is given in the following table:
Class Weight of student (in lbs.)
Class 1 94,90,102,110
Class 2 91,99,93,105,111,101
Class 3 108,96,100,93,93
Class 4 92,110,94,91,113
340/468
Eurostat
Here is a table that describes the data from each stratum:
341/468
Eurostat
• Calculate the stratified estimator ȳst .
342/468
Eurostat
• Calculate the stratified estimator ȳst .
• ANSWER:
To estimate the average weight of the 7th grade boys:
L
X Nh
ȳst = ȳh = 99.3.
h=1
N
342/468
Eurostat
• Calculate the variance of ȳst .
343/468
Eurostat
• Calculate the variance of ȳst .
• ANSWER:
4
1 X 2 Nh − nh sh2
\
Var(ȳ st ) = N
N 2 h=1 h Nh nh
2
1 2 5 (8.87) 2 5 (7.46)
= (24) · · + (36) · ·
1202 6 4 6 6
2 2
2 5 (6.28) 2 5 (10.61)
+ (30) · · + (30) · ·
6 5 6 5
= 2.93
343/468
Eurostat
For a 95% CI, we need to compute the Satterwaithe’s formula
to get the degrees of freedom:
L
2
ah sh2
P
h=1 Nh (Nh − nh )
d= L 2 2
, ah = ,
P (a h sh ) nh
h=1 nh − 1
24(24 − 4) 36(36 − 6)
a1 = = 120, a2 = = 180,
4 6
30(30 − 5) 30(30 − 5)
a3 = = 150, a4 = = 150.
5 5
344/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
345/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
• Round it down to 13, to be more conservative, and use
df = 13.
345/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
• Round it down to 13, to be more conservative, and use
df = 13.
• Then, an approximate 95% CI is:
√
99.3 ± 2.160 2.93
= 99.3 ± 3.697
345/468
Eurostat
• Looking back at the data, if we had used simple random
sampling, would our CI have been tighter or looser?
346/468
Eurostat
• Looking back at the data, if we had used simple random
sampling, would our CI have been tighter or looser?
• ANSWER:
2
\) = N − n s
Var(ȳ
N n
(7.73)2
120 − 20
=
120 20
= 2.49
346/468
Eurostat
• Then an approximate 95% CI is: df = 19
√
99.3 ± 2.093 2.49
= 99.3 ± 3.30
347/468
Eurostat
• Usually the stratified random sampling will overall
perform better because we usually use stratified random
sampling when the stratum are more homogeneous.
348/468
Eurostat
• Usually the stratified random sampling will overall
perform better because we usually use stratified random
sampling when the stratum are more homogeneous.
• There is no reason that the classes are more
homogeneous in weight, and therefore there is no reason
why this stratified random sampling is any better than a
simple random sampling.
348/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.
349/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.
• How the variance is computed depends on the method by
which the sample was taken.
349/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.
• How the variance is computed depends on the method by
which the sample was taken.
• We did the computation just to show that if
hypothetically, the data was collected by s.r.s. with the
data turn out to be as shown (for illustration’s sake),
then the margin of error will be smaller.
349/468
Eurostat
Moral of this example
350/468
Eurostat
Moral of this example
350/468
Eurostat
Moral of this example
350/468
Eurostat
Stratified sampling and proportions
L
1 X
p̂st = Nh p̂h .
N h=1
L
\ 1 X 2 \
Var(p̂st ) = N Var(p̂h )
N 2 h=1 h
L
1 X 2 Nh − nh p̂h (1 − p̂h )
= N h ·
N 2 h=1 Nh nh − 1
351/468
Eurostat
Example
352/468
Eurostat
Example
352/468
Eurostat
Example
Town A n1 = 20 16/20=0.80
Town B n2 = 8 2/8=0.25
Rural area C n3 = 12 6/12=0.50
352/468
Eurostat
• We plug in the values and we can get the following:
L
1 X
p̂st = Nh p̂h
N h=1
155 62 93
= · 0.8 + · 0.25 + · 0.5 = 0.6
310 310 310
353/468
Eurostat
The following display the estimated variance for each stratum:
\ N1 − n 1 p̂1 (1 − p̂1 )
Var(p̂1 ) = ·
N1 n1 − 1
155 − 20 0.8(0.2)
= · = 0.007
155 19
\ N 2 − n 2 p̂2 (1 − p̂2 )
Var( p̂2 ) = ·
N2 n2 − 1
62 − 8 0.25(0.75)
= · = 0.024
62 7
354/468
Eurostat
\ N 3 − n3 p̂3 (1 − p̂3 )
Var(p̂2 ) = ·
N3 n3 − 1
93 − 12 0.5(0.5)
= · = 0.02
93 11
355/468
Eurostat
• Compute the estimated variance of the stratified
proportion.
356/468
Eurostat
• Compute the estimated variance of the stratified
proportion.
• ANSWER:
1
\
Var(p̂st ) = 2
[(155)2 (0.007) + (62)2 (0.024)
(310)
+(93)2 (0.02)]
= 0.0045
356/468
Eurostat
Cluster sampling and systema-
tic sampling
Unit learning outcomes
358/468
Eurostat
Unit learning outcomes
359/468
Eurostat
Subsection 1
Introduction
360/468
Eurostat
Cluster versus systematic sampling
361/468
Eurostat
Cluster versus systematic sampling
361/468
Eurostat
Cluster versus systematic sampling
361/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.
362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.
362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.
362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.
362/468
Eurostat
• It is not uncommon to have a systematic sample of size 1,
such as the above 1 in 3 systematic sample. We just
sample 1 primary unit.
363/468
Eurostat
• It is not uncommon to have a systematic sample of size 1,
such as the above 1 in 3 systematic sample. We just
sample 1 primary unit.
• In the following two graphs, we provide examples for two
configurations of primary units:
363/468
Eurostat
The above figure has 50 primary units (PSU) (the colored
rectangle is an example of a primary unit)
364/468
Eurostat
• Primary units (PSU) may be different from observation
units.
365/468
Eurostat
• Primary units (PSU) may be different from observation
units.
• One can view the systematic sampling as a sampling of
primary units.
365/468
Eurostat
• Primary units (PSU) may be different from observation
units.
• One can view the systematic sampling as a sampling of
primary units.
• Once the primary units are selected, a cluster of
secondary units are also selected.
365/468
Eurostat
Advantages of systematic sampling
366/468
Eurostat
Advantages of systematic sampling
366/468
Eurostat
Advantages of systematic sampling
367/468
Eurostat
Advantages of systematic sampling
367/468
Eurostat
Advantages of systematic sampling
367/468
Eurostat
Cluster sampling
368/468
Eurostat
Cluster sampling
368/468
Eurostat
Cluster sampling
368/468
Eurostat
Cluster sampling
368/468
Eurostat
Cluster sampling
368/468
Eurostat
Cluster sampling
369/468
Eurostat
For figure below , N = 50, n = 10, Mi = 8
370/468
Eurostat
• Thus, the population total is:
X Mi
N X N
X
τ= yij = τi .
i=1 j=1 i=1
371/468
Eurostat
• Thus, the population total is:
X Mi
N X N
X
τ= yij = τi .
i=1 j=1 i=1
τ
µτ = .
N
371/468
Eurostat
• Thus, the population total is:
X Mi
N X N
X
τ= yij = τi .
i=1 j=1 i=1
τ
µτ = .
N
• The population mean per secondary unit is
τ
µ= .
M
371/468
Eurostat
Coffee break!
372/468
Eurostat
Subsection 2
373/468
Eurostat
• When the primary units are selected by simple random
sampling, frequently used estimators among many
possible estimators are:
374/468
Eurostat
• When the primary units are selected by simple random
sampling, frequently used estimators among many
possible estimators are:
• Unbiased estimator
374/468
Eurostat
• When the primary units are selected by simple random
sampling, frequently used estimators among many
possible estimators are:
• Unbiased estimator
• Ratio estimator
374/468
Eurostat
Unbiased estimator
n
P
τi
i=1
τ̂ = N · µ̂τ = N · ,
n
recall that yi is the total of y -values in the i-th primary unit.
2
\) = N · (N − n) su .
Var(τ̂
n
1 n
where su2 = (τi − µ̂τ )2
P
n − 1 i=1
375/468
Eurostat
τ
• To estimate the mean per primary unit, , one will use:
N
τ̂ 1
µ̂τ = , Var(µ̂τ ) = 2 Var(τ̂ ).
N N
376/468
Eurostat
τ
• To estimate the mean per primary unit, , one will use:
N
τ̂ 1
µ̂τ = , Var(µ̂τ ) = 2 Var(τ̂ ).
N N
• To estimate the mean per secondary unit,
τ̂ 1
µ̂ = , Var(µ̂) = 2 Var(τ̂ ).
M M
376/468
Eurostat
Ratio estimator
N
X
τ̂r = r · M, M= Mi ,
i=1
n
P
τi
\r ) = N(N − n) P (τi − rMi )2 .
n
i=1
where r = n , Var(τ̂
P n(n − 1) i=1
Mi
i=1
377/468
Eurostat
The basic principle
378/468
Eurostat
The basic principle
\ su2
Var(τ̂ ) = N(N − n) · ,
n
1 n
where su2 = (τi − µ̂τ )2 .
P
n − 1 i=1
378/468
Eurostat
• Thus, to obtain estimators of low variances,
379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
2. Each cluster should contain the full diversity of the
population and thus, is ’representative’.
379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
2. Each cluster should contain the full diversity of the
population and thus, is ’representative’.
• With natural populations of spatially distributed plants,
animals, or minerals, and human populations, the above
condition is typically satisfied by systematic sampling
where each cluster contains units that are far apart.
379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
2. Each cluster should contain the full diversity of the
population and thus, is ’representative’.
• With natural populations of spatially distributed plants,
animals, or minerals, and human populations, the above
condition is typically satisfied by systematic sampling
where each cluster contains units that are far apart.
• Cluster sampling is more often than not carried out for
reasons of convenience or practicality rather than to
obtain the lowest variances. 379/468
Eurostat
• Why or when do we use cluster sampling?
380/468
Eurostat
• Why or when do we use cluster sampling?
• Will it give us a more precise estimator?
380/468
Eurostat
• Why or when do we use cluster sampling?
• Will it give us a more precise estimator?
• The answer is no for most cases.
380/468
Eurostat
• Why or when do we use cluster sampling?
• Will it give us a more precise estimator?
• The answer is no for most cases.
• We do use cluster sampling out of necessity even though
it will give us a larger variance.
380/468
Eurostat
If the objective of sampling is to obtain a specified amount of
information about a population parameter at minimum cost,
cluster sampling sometimes gives more information per unit
cost than simple random sampling, stratified sampling and
systematic sampling due to the cost of sampling units within a
cluster may be much lower.
381/468
Eurostat
Cluster sampling is an effective design in two different
scenarios:
382/468
Eurostat
Example using a ratio estimator
383/468
Eurostat
Example using a ratio estimator
383/468
Eurostat
Example using a ratio estimator
383/468
Eurostat
Example using a ratio estimator
383/468
Eurostat
Example using a ratio estimator
383/468
Eurostat
Cluster Number of Total budget Cluster Number of Total budget
households (Mi ) per cluster (yi ) households (Mi ) per cluster (yi )
1 7 12,000 13 8 12,340
2 9 15,000 14 4 5,000
3 5 8,000 15 6 8,900
4 8 13,000 16 9 14,000
5 12 18,000 17 3 4,000
6 5 7,000 18 10 11,400
7 4 6,000 19 4 5,000
8 8 13,000 20 7 13,000
9 14 22,000 21 6 8,900
10 6 9,800 22 5 8,700
11 3 7,000 23 7 10,000
12 13 18,000 24 6 9,200
384/468
Eurostat
Here is a plot of this data so that we can see if the cluster size
is proportional to the total for the cluster.
●
20000
●
15000
●
Total of cluster
● ●
●
●
●
10000
●
●
●
●
●
●
● ●
●
5000
4 6 8 10 12 14
Cluster size
385/468
Eurostat
• The ratio estimator for cluster sample (ratio-to-size):
386/468
Eurostat
• The ratio estimator for cluster sample (ratio-to-size):
• If primary unit total τi is highly correlated with cluster
size Mi , a ratio estimator based on size may be efficient.
386/468
Eurostat
• The ratio estimator for cluster sample (ratio-to-size):
• If primary unit total τi is highly correlated with cluster
size Mi , a ratio estimator based on size may be efficient.
• The ratio estimator of the population total is:
n
P
τi
i=1
τ̂r = r · M where r = Pn .
Mi
i=1
386/468
Eurostat
• The ratio estimator is biased but the bias is small when
the sample size is large.
387/468
Eurostat
• The ratio estimator is biased but the bias is small when
the sample size is large.
• Here is the variance:
n
\r ) = N(N − n)
X
Var(τ̂ (τi − rMi )2 .
n(n − 1) i=1
387/468
Eurostat
• The ratio estimator is biased but the bias is small when
the sample size is large.
• Here is the variance:
n
\r ) = N(N − n)
X
Var(τ̂ (τi − rMi )2 .
n(n − 1) i=1
387/468
Eurostat
• The ratio estimator for the mean is:
τ̂r
µ̂r = = r.
M
n
\r ) = N(N − n) · 1
X
Var(µ̂ (τi − rMi )2 .
n(n − 1) M 2 i=1
388/468
Eurostat
• The ratio estimator for the mean is:
τ̂r
µ̂r = = r.
M
n
\r ) = N(N − n) · 1
X
Var(µ̂ (τi − rMi )2 .
n(n − 1) M 2 i=1
• Back to the example.
388/468
Eurostat
• To estimate the average yearly vacation budget for each
household we will use:
P n
τi
i=1
µ̂r = r = P n .
Mi
i=1
389/468
Eurostat
• To estimate the average yearly vacation budget for each
household we will use:
P n
τi
i=1
µ̂r = r = P n .
Mi
i=1
• In this example we see that N = 400, the total number of
blocks, and n = 24.
389/468
Eurostat
• To estimate the average yearly vacation budget for each
household we will use:
P n
τi
i=1
µ̂r = r = P n .
Mi
i=1
• In this example we see that N = 400, the total number of
blocks, and n = 24.
• M in this case is as follows:
XN
M= Mi = 3, 100.
i=1
389/468
Eurostat
The ratio estimator for the average yearly vacation budget for
each household in that city is:
n
P
τi
i=1 259, 240
µ̂r = n = = 1, 533.96.
P 169
Mi
i=1
n
N(N − n) 1 X
\
Var(µ̂r ) = · 2 (τi − rMi )2 .
n(n − 1) M i=1
390/468
Eurostat
• For this example, M = 3100, N = 400, n = 24:
n
1 X
(τi − rMi )2 [st.dev. of (τ − rM)]2
n − 1 i=1
= (1, 325)2
391/468
Eurostat
• For this example, M = 3100, N = 400, n = 24:
n
1 X
(τi − rMi )2 [st.dev. of (τ − rM)]2
n − 1 i=1
= (1, 325)2
• The estimated variance for the ratio estimator.
391/468
Eurostat
• If we used the unbiased estimator would our variance be
larger or smaller?
392/468
Eurostat
• If we used the unbiased estimator would our variance be
larger or smaller?
• For this example, we also want to compute the unbiased
estimator for comparison purposes.
392/468
Eurostat
• The unbiased estimator for the average yearly vacation
budget for each household in that city is:
n
P
τi
i=1 1
µ̂ = N ·
n M
259, 240 1
= 400 · ·
24 3, 100
1
= 400 · 10, 802 ·
3, 100
= 1, 393.81
393/468
Eurostat
• The estimated variance for the unbiased estimator.
n
\ = N(N − n) · 1
X
Var(µ̂) (τi − µ̂τ )2
M2 · n n − 1 i=1
400(400 − 24)
= (st.dev. of τ )2
(3, 100)2 · 24
400(400 − 24)
= (4, 495)2
(3, 100)2 · 24
= 13, 175.67
394/468
Eurostat
Remark 1
395/468
Eurostat
Remark 1
395/468
Eurostat
Remark 2
396/468
Eurostat
Remark 2
396/468
Eurostat
Remark 2
396/468
Eurostat
Remark 2
396/468
Eurostat
Subsection 3
397/468
Eurostat
Estimators
Mi
pi = .
M
398/468
Eurostat
Estimators
Mi
pi = .
M
• The Hansen-Hurwitz (pps) estimator is:
n
M X τi
τ̂p = .
n i=1 Mi
398/468
Eurostat
τi
• Denote by µi = :
Mi
n
M2 X
\p ) =
Var(τ̂ (µi − µ̂p )2
n(n − 1) i=1
τ̂p
µ̂p = is unbiased for µ.
M
399/468
Eurostat
τi
• Denote by µi = :
Mi
n
M2 X
\p ) =
Var(τ̂ (µi − µ̂p )2
n(n − 1) i=1
τ̂p
µ̂p = is unbiased for µ.
M
• Thus we also see that:
n
1 X
\p ) =
Var(µ̂ (µi − µ̂p )2
n(n − 1) i=1
399/468
Eurostat
Example
400/468
Eurostat
Example
400/468
Eurostat
Example
1 1000
2 650
3 2100
4 860
5 2840
6 1910
7 390
8 3200
9 1500
10 1200
Total 15650
401/468
Eurostat
• A sample of 3 clusters out of 10 clusters are sampled
(n = 3) with replacement. Cluster 2, 5 and 8 are selected
402/468
Eurostat
• A sample of 3 clusters out of 10 clusters are sampled
(n = 3) with replacement. Cluster 2, 5 and 8 are selected
• The data are:
402/468
Eurostat
• Find the Hansen-Hurwitz estimator for the population
mean
403/468
Eurostat
• Find the Hansen-Hurwitz estimator for the population
mean
• ANSWER:
n
1 X τi
µ̂p =
n i=1 Mi
1 420 1785 2198
= × + +
3 650 2840 3200
= 0.6538
403/468
Eurostat
• Find its variance.
404/468
Eurostat
• Find its variance.
• ANSWER:
n
1 X
\p ) =
Var(µ̂ (µi − µ̂p )2
n(n − 1) i=1
1
= [(0.6462 − 0.6538)2 + (0.6285 − 0.6538)2
3×2
+(0.6869 − 0.6538)2 ]
= 0.000299
404/468
Eurostat
Subsection 4
Systematic sample
405/468
Eurostat
• In previous section, we introduce systematic sampling and
state why it may be a challenge to estimate the variance
when only one primary unit is taken.
406/468
Eurostat
• In previous section, we introduce systematic sampling and
state why it may be a challenge to estimate the variance
when only one primary unit is taken.
• Then the repeated systematic sampling is introduced so
that the variance can be estimated.
406/468
Eurostat
• In previous section, we introduce systematic sampling and
state why it may be a challenge to estimate the variance
when only one primary unit is taken.
• Then the repeated systematic sampling is introduced so
that the variance can be estimated.
• We then provide an example of repeated systematic
sampling.
406/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
407/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
• We then provide an estimate for the relative efficiency of
simple random sampling versus simple random cluster
sampling.
407/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
• We then provide an estimate for the relative efficiency of
simple random sampling versus simple random cluster
sampling.
• An example is provided to compare the variances for these
two sampling methods.
407/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
• We then provide an estimate for the relative efficiency of
simple random sampling versus simple random cluster
sampling.
• An example is provided to compare the variances for these
two sampling methods.
• One should note that it is not uncommon to see examples
that cluster sampling is much less efficient than the
simple random sampling, as illustrated in this example.
407/468
Eurostat
Systematic sample
408/468
Eurostat
Systematic sample
408/468
Eurostat
Systematic sample
408/468
Eurostat
• To sample systematically from a field, the following is one
example:
409/468
Eurostat
• To sample systematically from a field, the following is one
example:
• There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12),
(5, 7, 13, 15), (6, 8, 14, 16).
409/468
Eurostat
• To sample systematically from a field, the following is one
example:
• There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12),
(5, 7, 13, 15), (6, 8, 14, 16).
• How do we draw a 1 in k systematic sample?
409/468
Eurostat
Example
410/468
Eurostat
Example
410/468
Eurostat
Example
410/468
Eurostat
Example
410/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
411/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
• How do we estimate the variance of this single systematic
sample?
411/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
• How do we estimate the variance of this single systematic
sample?
• We can not use the formula:
n
1 X
su2 = (τi − µ̂t au)2
n − 1 i=1
since n = 1.
411/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
• How do we estimate the variance of this single systematic
sample?
• We can not use the formula:
n
1 X
su2 = (τi − µ̂t au)2
n − 1 i=1
since n = 1.
• Only one primary unit is selected.
411/468
Eurostat
• If the population is randomly ordered, then there is no
problem.
412/468
Eurostat
• If the population is randomly ordered, then there is no
problem.
• We can estimate the variance σ 2 by:
M1
(y1j − ȳ1 )2
P
j=1
s2 =
M1 − 1
412/468
Eurostat
• However, when the population is ordered, the systematic
sampling is usually better than simple random sampling
and the above formula will overestimate the variance.
413/468
Eurostat
• However, when the population is ordered, the systematic
sampling is usually better than simple random sampling
and the above formula will overestimate the variance.
• When the population is periodic, the systematic sampling
may be worse than the simple random sampling and the
above formula will underestimate the variance since if the
period k is chosen poorly, then the elements sampled may
be too similar to each other.
413/468
Eurostat
Questions?
414/468
Eurostat
Lunch break!
415/468
Eurostat
Subsection 5
416/468
Eurostat
• For simplicity, suppose that each of N primary units has
an equal number M of secondary units.
417/468
Eurostat
• For simplicity, suppose that each of N primary units has
an equal number M of secondary units.
• To simplify the variance computations and to explore the
relationship between cluster and simple random sampling,
we note the identity:
X M
N X N X
X M N
X
2 2
(yij − µ) = (yij − µi ) + M (µi − µ)2
i=1 j=1 i=1 j=1 i=1
| {z } | {z } | {z }
SST SSW SSB
M y
P ij
where where µi = .
j=1 M
417/468
Eurostat
SST = SSW + SSB
418/468
Eurostat
• The within-primary-unit variance is:
X N XM
2 2
σw = (yij − µi ) /[N(M − 1)]
i=1 j=1
419/468
Eurostat
• The within-primary-unit variance is:
X N XM
2 2
σw = (yij − µi ) /[N(M − 1)]
i=1 j=1
419/468
Eurostat
• The identity can be rewritten as:
420/468
Eurostat
• The identity can be rewritten as:
420/468
Eurostat
• Since the data was obtained by cluster sampling, we
cannot use s 2 to estimate σ 2 but we can use σ̂ 2 to
estimate σ 2 .
421/468
Eurostat
• Since the data was obtained by cluster sampling, we
cannot use s 2 to estimate σ 2 but we can use σ̂ 2 to
estimate σ 2 .
• The relative efficiency of simple random sampling versus
simple random cluster sampling is:
Var(ȳsrs ) Mσ 2
= 2 .
Var(µ̂) σu
421/468
Eurostat
• Since the data was obtained by cluster sampling, we
cannot use s 2 to estimate σ 2 but we can use σ̂ 2 to
estimate σ 2 .
• The relative efficiency of simple random sampling versus
simple random cluster sampling is:
Var(ȳsrs ) Mσ 2
= 2 .
Var(µ̂) σu
2
N −n σ
• Recall: Var(ȳsrs ) = · and
N nM
N − n σu2 2
Var(µ̂) = · 2 where σu is the finite population
N nM
variance of τi .
421/468
Eurostat
• It can be estimated by:
\
Var(ȳ srs ) M σ̂ 2
= 2 .
\
Var(µ̂) su
422/468
Eurostat
• It can be estimated by:
\
Var(ȳ srs ) M σ̂ 2
= 2 .
\
Var(µ̂) su
• Note:
n n 2
1 X 1 X τ
su2 = 2
(τi − µ̂τ ) = τi −
n − 1 i=1 n − 1 i=1 M
n
1 X
= (Mτi − M µ̂)2
n − 1 i=1
n
(τi − µ̂)2
P
2 i=1 2
= M = M sb2 .
n−1 422/468
Eurostat
Example
423/468
Eurostat
Example
423/468
Eurostat
Example
423/468
Eurostat
Cluster Number of cell phones Total
1 3 5 6 4 5 6 3 2 4 5 43
2 2 0 2 1 1 0 1 1 0 1 9
3 3 2 3 2 4 2 2 1 2 2 23
4 5 2 3 2 1 1 2 2 4 1 23
424/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
425/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
• In this example, N = 400, n = 4, and M = 10.
425/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
• In this example, N = 400, n = 4, and M = 10.
• We need to find sb2 , sw2 .
425/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
• In this example, N = 400, n = 4, and M = 10.
• We need to find sb2 , sw2 .
• Note the identity for the population:
425/468
Eurostat
• The identity for the sample is:
(nM − 1)s 2 = n(M − 1)sw2 + (n − 1)Msb2
SS total = SS within + SS between
SS between 58.7
SS within 43.2
426/468
Eurostat
• The identity for the sample is:
(nM − 1)s 2 = n(M − 1)sw2 + (n − 1)Msb2
SS total = SS within + SS between
SS between 58.7
SS within 43.2
58.70
sb2 = = 1.957
30
426/468
Eurostat
• We can find sw2 by:
sw2 = 1.20.
427/468
Eurostat
• Compute σ 2
428/468
Eurostat
• Compute σ 2
• ANSWER:
428/468
Eurostat
• And now we can determine the relative efficiency of
simple random sampling versus cluster sampling by
plugging the values into the formula:
2
su2 = M sb2 = 100 × 1.957 = 195.7.
429/468
Eurostat
• And now we can determine the relative efficiency of
simple random sampling versus cluster sampling by
plugging the values into the formula:
2
su2 = M sb2 = 100 × 1.957 = 195.7.
• Compute the relative efficiency of simple random
sampling versus cluster sampling.
429/468
Eurostat
• And now we can determine the relative efficiency of
simple random sampling versus cluster sampling by
plugging the values into the formula:
2
su2 = M sb2 = 100 × 1.957 = 195.7.
• Compute the relative efficiency of simple random
sampling versus cluster sampling.
• ANSWER:
\
Var(ȳsrs ) M σ̂ 2 10 × 3.03
= 2 = = 0.155.
\
Var(µ̂) su 195.7
429/468
Eurostat
• What is this telling us?
430/468
Eurostat
• What is this telling us?
• ANSWER:
430/468
Eurostat
• What is this telling us?
• ANSWER:
• The variance of simple random sampling is just 15.5% of
that of the cluster sampling if the same sample size is
used. We can see that in this example simple random
sampling is more efficient if only variance is considered.
430/468
Eurostat
• What is this telling us?
• ANSWER:
• The variance of simple random sampling is just 15.5% of
that of the cluster sampling if the same sample size is
used. We can see that in this example simple random
sampling is more efficient if only variance is considered.
• Remark: It is a BIG mistake to analyze a cluster sample
as if it were a simple random sample, (often with the
reported standard error much less than it should be). You
will end up being much too optimistic and not
conservative regarding your results as you should be.
430/468
Eurostat
Muti-stage designs
Unit learning outcomes
432/468
Eurostat
Subsection 1
433/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
• For example, in two-stage sampling:
434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
• For example, in two-stage sampling:
• 1st stage samples n primary units
434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
• For example, in two-stage sampling:
• 1st stage samples n primary units
• 2nd stage, for the i-th primary unit, selects mi (not all)
secondary units
434/468
Eurostat
• Multistage designs are used in many practical cases.
435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
• Large surveys involving the sampling of housing units -
Statistics Portugal selects geographical areas within each
NUTS II region and then select housing units (dwellings)
within each selected geographical area.
435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
• Large surveys involving the sampling of housing units -
Statistics Portugal selects geographical areas within each
NUTS II region and then select housing units (dwellings)
within each selected geographical area.
• Practical quality control problems often involve two (or
more) stages of sampling. For example, Volkswagen
wants to inspect the quality of a supplier of air filters.
They first sample some cartons and then inspect some
air filters inside these selected cartons.
435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
• Large surveys involving the sampling of housing units -
Statistics Portugal selects geographical areas within each
NUTS II region and then select housing units (dwellings)
within each selected geographical area.
• Practical quality control problems often involve two (or
more) stages of sampling. For example, Volkswagen
wants to inspect the quality of a supplier of air filters.
They first sample some cartons and then inspect some
air filters inside these selected cartons.
• Poll samples election districts. At the second stage, they
select households.
435/468
Eurostat
• Notation:
436/468
Eurostat
• Notation:
• N: number of primary units in the population
436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
τ N
P
• µ= where M = Mi .
M i=1
436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
τ N
P
• µ= where M = Mi .
M i=1
• n: number of primary units selected in the first stage
436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
τ N
P
• µ= where M = Mi .
M i=1
• n: number of primary units selected in the first stage
• mi : number of secondary units selected in the second
stage
436/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
437/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
• ANSWER:
437/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
• ANSWER:
1. If mi = Mi (all secondary units are selected), it reduces
to cluster sampling.
437/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
• ANSWER:
1. If mi = Mi (all secondary units are selected), it reduces
to cluster sampling.
2. If n = N (all primary units are selected), it reduces to
stratified random sampling.
437/468
Eurostat
Multistage design
438/468
Eurostat
Multistage design
438/468
Eurostat
Multistage design
438/468
Eurostat
Multistage design
438/468
Eurostat
Two-stage sample of 10 primary units and four secondary units
per primary unit.
439/468
Eurostat
Here is another graph for another example of two-stage
sample. Two-stage sample of 20 primary units and two
secondary units per primary unit.
440/468
Eurostat
Simple random sampling at each stage
441/468
Eurostat
Simple random sampling at each stage
441/468
Eurostat
Simple random sampling at each stage
441/468
Eurostat
Simple random sampling at each stage
442/468
Eurostat
Simple random sampling at each stage
442/468
Eurostat
Simple random sampling at each stage
442/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
443/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
• The estimated variance of τ̂ is:
n
su2 N X s2
\
Var(τ̂ ) = N(N − n) + Mi (Mi − mi ) i
n n i=1 mi
where
443/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
• The estimated variance of τ̂ is:
n
su2 N X s2
\
Var(τ̂ ) = N(N − n) + Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance among the primary unit totals
Pn 2
n τ̂i
1 X τ̂i − i=1 .
su2 =
n−1 n
i=1
443/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
• The estimated variance of τ̂ is:
n
su2 N X s2
\
Var(τ̂ ) = N(N − n) + Mi (Mi − mi ) i
n n i=1 mi
where
• si2 is the sample variance within the i-th primary unit
i m
1 X
si2 = (yij − ȳi )2 .
mi − 1
j=1
444/468
Eurostat
τ
• To estimate the population mean µ = , the estimators
M
and the estimated variance are:
n
P
τ̂i
µ̂ =
N
· i=1 \ = 1 Var(τ̂
, and Var(µ̂) \).
M n M2
445/468
Eurostat
τ
• To estimate the population mean µ = , the estimators
M
and the estimated variance are:
n
P
τ̂i
µ̂ =
N
· i=1 \ = 1 Var(τ̂
, and Var(µ̂) \).
M n M2
• Let’s take a look at an example where we can compute
both the estimates and their variances.
445/468
Eurostat
Example: employee satisfaction
446/468
Eurostat
Example: employee satisfaction
446/468
Eurostat
Example: employee satisfaction
446/468
Eurostat
Example: employee satisfaction
1 54 10 5, 7,
6, 5, 4, 7, 6, 6, 4, 5 5.5 1.08 297
2 48 10 7, 7,
7, 6, 5, 4, 7, 7, 6, 6 6.2 1.03 297.6
3 68 14 5, 6, 5, 6, 4, 5,
6, 5, 4, 5, 4, 6, 5, 6 5.14 0.77 349.52
4 70 14 6, 5, 7, 6, 7, 6,
5, 7, 5, 7, 6, 5, 7, 6 6.07 0.83 424.9
5 52 10 4, 5,
4, 5, 5, 6, 5, 4, 4, 4 4.6 0.7 239.2
6 62 12 5, 7, 6, 7,
4, 3, 1, 5, 4, 6, 4, 5 4.75 1.71 294.5
7 41 8 7, 6, 7, 7, 6, 6, 5, 7 6.38 0.74 261.58
8 53 11 6, 6, 5, 4, 6, 7, 5, 5, 7, 6, 5 5.64 0.92 298.92
9 64 12 7, 6, 5, 4, 6, 5, 7, 4, 3, 6, 5, 7 5.42 1.31 346.88
10 43 9 7, 6, 6, 5, 7, 3, 5, 4, 5 5.33 1.32 229.19
447/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
448/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
• ANSWER:
448/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
• ANSWER:
• The unbiased estimator is:
n
P
Mi ȳi
i=1
τ̂ = N ·
n
(54 × 5.50) + (48 × 6.20) + . . . + (43 × 5.33)
= 120 ·
10
= 36, 471.5
448/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
• ANSWER:
• The unbiased estimator is:
n
P
Mi ȳi
i=1
τ̂ = N ·
n
(54 × 5.50) + (48 × 6.20) + . . . + (43 × 5.33)
= 120 ·
10
= 36, 471.5
• This might be thought of as the total satisfaction score.
448/468
Eurostat
• If we divided this by the total number of employees we
would get the average score.
449/468
Eurostat
• If we divided this by the total number of employees we
would get the average score.
• If M is given to be 6,860 then
36, 471.5
µ̂ = = 5.32.
6, 860
449/468
Eurostat
• The estimated variance of the unbiased estimator is then:
2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
450/468
Eurostat
• The estimated variance of the unbiased estimator is then:
2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance of τ̂1 , τ̂2 , ..., τ̂10 . From the
previous table, su2 = (58.1)2 = 3375.61
450/468
Eurostat
• The estimated variance of the unbiased estimator is then:
2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance of τ̂1 , τ̂2 , ..., τ̂10 . From the
previous table, su2 = (58.1)2 = 3375.61
• si2 is the sample variance within the primary unit.
i m
1 X
si2 = (yij − ȳi )2 .
mi − 1
j=1
450/468
Eurostat
• The estimated variance of the unbiased estimator is then:
2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance of τ̂1 , τ̂2 , ..., τ̂10 . From the
previous table, su2 = (58.1)2 = 3375.61
• si2 is the sample variance within the primary unit.
i m
1 X
si2 = (yij − ȳi )2 .
mi − 1
j=1
451/468
Eurostat
• Find the estimated variance of the unbiased estimator for
the mean employee satisfaction score.
• ANSWER:
\) = 120 × (120 − 10) × 3375.61
Var(τ̂
10
1.082
120
+ × 54(54 − 10) + ...
10 10
1.322
+43(43 − 9)
9
= 4, 455, 805.2 + 32, 451.6 = 4, 488, 256.8
\ = 4, 488, 256.8 = 0.095.
Var(µ̂)
68602
451/468
Eurostat
Ratio estimator
452/468
Eurostat
Ratio estimator
452/468
Eurostat
• For the population total, the ratio estimator and its
estimated variance are:
n
P
ŷi
i=1
τ̂r = Pn · M = rˆM.
Mi
i=1
n
\r ) = N(N − n) · 1
X
Var(τ̂ (ŷi − Mi rˆ)2
n n − 1 i=1
n
NX s2
+ Mi (Mi − mi ) i .
n i=1 mi
453/468
Eurostat
• A similar question can be asked of the population mean.
454/468
Eurostat
• A similar question can be asked of the population mean.
• Therefore, for the population mean, the ratio estimator
and its estimated variance are:
n
P n
P
ŷi Mi ȳi
i=1 i=1
µ̂r = Pn = Pn = rˆ.
Mi Mi
i=1 i=1
\r ) = 1 Var(τ̂
Var(µ̂ \r ).
M2
454/468
Eurostat
Illustration
455/468
Eurostat
Illustration
455/468
Eurostat
• And the ratio estimator estimated variance.
456/468
Eurostat
• And the ratio estimator estimated variance.
• ANSWER:
" n
1 N(N − n) 1 X
\
Var(µ̂r ) = · (ŷi − Mi rˆ)2
M2 n n − 1 i=1
n
#
NX si2
+ Mi (Mi − mi )
n i=1 mi
1 120(120 − 10) 1
= 2
· ((54 × 5.50 − 54 × 5.48
6860 10 9
+ . . . + (43 × 5.33 − 43 × 5.48)2 ) + 32451.6
= 0.029
456/468
Eurostat
• Remark: If M is unknown, one can use µ̂r and estimate
M by:
n
P
Mi
i=1
× N.
n
457/468
Eurostat
• Remark: If M is unknown, one can use µ̂r and estimate
M by:
n
P
Mi
i=1
× N.
n
N
P
• Recall: M = Mi .
i=1
457/468
Eurostat
Coffee break!
458/468
Eurostat
Subsection 2
459/468
Eurostat
Multi-stage design with primary units
selected with p.p.s. and secondary units
selected with srs.
461/468
Eurostat
Example
462/468
Eurostat
Example
462/468
Eurostat
Example
462/468
Eurostat
Example
462/468
Eurostat
Department Mi mi Textbook expenses in $ for last semester ȳi
463/468
Eurostat
• Estimate the population mean using probability
proportional to size estimator (Hansen-Hurwitz).
464/468
Eurostat
• Estimate the population mean using probability
proportional to size estimator (Hansen-Hurwitz).
• ANSWER:
P
ȳi 398 + 371.3 + 451.3 + 427.5
µ̂p = = = 412.025.
n 4
464/468
Eurostat
• Estimate the variance of that estimator.
465/468
Eurostat
• Estimate the variance of that estimator.
• ANSWER:
1 X
\p ) =
Var(µ̂ (ȳi − µ̂p )2
n(n − 1)
1
= (398 − 412.025)2 + (371.3 − 412.025)2
4×3
+(451.3 − 412.025)2 + (427.5 − 412.025)2
= 303.12
465/468
Eurostat
Topics covered in other courses
466/468
Eurostat
Topics covered in other courses
466/468
Eurostat
Topics covered in other courses
466/468
Eurostat
Topics covered in other courses
466/468
Eurostat
467/468
Eurostat
Feedback and evaluation
468/468
Eurostat