0% found this document useful (0 votes)
162 views

Introduction To Survey Methodology and Sampling Techniques (PDFDrive)

This document provides an overview of an upcoming training on survey methodology and sampling techniques. The training will be conducted over three days from March 14-16, 2016 and will be led by Jorge M. Mendes. It will cover topics such as simple random sampling, confidence intervals, sample size calculations, stratified sampling, cluster sampling, and multistage designs. The goal is for participants to understand how to perform basic sampling methods and make inferences about populations from sample data. Recommended textbooks are also listed.

Uploaded by

amien_ptk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views

Introduction To Survey Methodology and Sampling Techniques (PDFDrive)

This document provides an overview of an upcoming training on survey methodology and sampling techniques. The training will be conducted over three days from March 14-16, 2016 and will be led by Jorge M. Mendes. It will cover topics such as simple random sampling, confidence intervals, sample size calculations, stratified sampling, cluster sampling, and multistage designs. The goal is for participants to understand how to perform basic sampling methods and make inferences about populations from sample data. Recommended textbooks are also listed.

Uploaded by

amien_ptk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1046

Introduction to Survey

Methodology and Sampling


Techniques
14-16 March 2016
Jorge M. Mendes <[email protected]>

CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT

CONCLUDED WITH THE EUROPEAN COMMISSION


Overview
Trainer and schedule

• Trainer: Jorge M. Mendes ([email protected])


• Training schedule:
• Morning: 9:00-12:30 (15 minutes break at 11:00);
• Afternoon: 14:00-17:00 (15 minutes break at 15:30)

3/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,


K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.

4/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,


K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.
• Cochran, W. C. (1977). Sampling Techniques, 3rd ed.,
John Wiley & Sons.

4/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,


K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.
• Cochran, W. C. (1977). Sampling Techniques, 3rd ed.,
John Wiley & Sons.
• Kish, L. (1965). Survey Sampling, New York, Wiley.

4/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,


K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.
• Cochran, W. C. (1977). Sampling Techniques, 3rd ed.,
John Wiley & Sons.
• Kish, L. (1965). Survey Sampling, New York, Wiley.
• Särndal, C.-E., B. Swensson. J. Wretman (1992). Model
Assisted Survey Sampling. New York, Springer-Verlag.

4/468
Eurostat
Training learning outcomes

• This course covers sampling design and analysis methods


useful for research and management in many fields.
• A well designed sampling procedure ensures that we can
summarize and analyze the data with a minimum of
assumptions or complications.

5/468
Eurostat
• In this course, we’ll cover the basic methods of sampling
and estimation and then explore selected topics and
recent developments including:
• simple random sampling with associated estimation and
confidence interval methods,
• computing sample sizes,
• estimating proportions,
• unequal probability sampling,
• ratio and regression estimation,
• stratified sampling,
• cluster and systematic sampling,
• multistage designs.

6/468
Eurostat
• One important point to consider as we move forward is
that the estimation procedure will depend on the sample
design.
• Being able to identify what to use under different
sampling designs is one of the things that you will learn in
this course.

7/468
Eurostat
Day 1

1. Introduction
1.1 An overview of sampling
1.2 Estimating population mean and total under simple
random sampling
1.3 Confidence intervals and the central limit theorem
1.4 Domain estimation
2. Confidence intervals and sample size
2.1 Selecting sample size for estimating population mean and
total
8/468
Eurostat
Day 1

2. Confidence intervals and sample size


2.1 [...]
2.2 Confidence intervals for population proportion
2.3 Sample size needed for estimating proportions
3. Unequal probability sampling
3.1 Unequal probability sampling

9/468
Eurostat
Day 2

3. Unequal probability sampling


3.1 (...)
3.2 The Hansen-Hurwitz estimator
3.3 The Horvitz-Thompson estimator
3.4 Small population illustration
4. Auxiliary data and ratio estimation
4.1 Auxiliary data, ratio estimator and its computation
4.2 Sample size and small population example for ratio
estimation
10/468
Eurostat
Day 2

5. Auxiliary data and regression estimation


5.1 Linear regression estimator
5.2 Comparison of estimators
6. Stratified sampling
6.1 How to use stratified sampling
6.2 The stratification principle

11/468
Eurostat
Day 3

6. Stratified sampling
6.2 [...]
6.3 Post-stratification
6.4 Further topics on stratification
7. Cluster sampling and systematic sampling
7.1 Introduction
7.2 Estimators for cluster sampling when primary units are
selected by simple random sampling
7.3 Estimators for cluster sampling when primary units are
selected by pps
12/468
Eurostat
Day 3

7. Cluster sampling and systematic sampling


7.3 [...]
7.4 Systematic sampling
7.5 Variance and cost in cluster and systematic sampling
versus srs
8. Multistage designs
8.1 Multi-stage sampling: two stages with srs at each stage
8.2 Primary units selected by pps and secondary units
selected with srs
9. Topics covered in other courses
13/468
Eurostat
Introduction
Unit learning outcomes

• Upon successful completion of this lesson, you will be


able to:
• know that estimation procedures depend on the sample
design
• distinguish between quota sampling and probability
sampling
• know the desirable properties of estimates
• distinguish between sampling error and non-sampling
errors

15/468
Eurostat
Unit learning outcomes

• Upon successful completion of this lesson, you will be


able to:
• know how to perform simple random sampling
• provide point estimate to population mean and be able
to estimate the variance of the estimate
• provide point estimate to population total and be able to
estimate the variance of the estimate

16/468
Eurostat
Subsection 1

An overview of sampling

17/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some


objective in mind.

18/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some


objective in mind.
• In each case there is a target population.

18/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some


objective in mind.
• In each case there is a target population.
• The goal for many research projects is to know more
about your objective, i.e., your population. This is what
you are interested in.

18/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some


objective in mind.
• In each case there is a target population.
• The goal for many research projects is to know more
about your objective, i.e., your population. This is what
you are interested in.
• For instance, if you were a conservation officer you might
be interested in the number of polar bears in Artic.

18/468
Eurostat
• In this case, you have a certain goal in mind.

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!
• And the major objective in statistics that now arises is
inference.

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!
• And the major objective in statistics that now arises is
inference.
• One important objective of statistics is to make inferences
about a population from the information contained in a
sample.

19/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.

20/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.
• Because of this inference we begin to talk about things
like confidence intervals and hypothesis testing.

20/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.
• Because of this inference we begin to talk about things
like confidence intervals and hypothesis testing.
• A good picture to represent this situation follows.

20/468
Eurostat
Sampling

21/468
Eurostat
Population and sample

• We can draw a sample from the population.

22/468
Eurostat
Population and sample

• We can draw a sample from the population.


• How do we do this?

22/468
Eurostat
Population and sample

• We can draw a sample from the population.


• How do we do this?
• What type of scheme do we use to draw a sample?

22/468
Eurostat
Examples of sampling

• Sampling is useful in many different fields, however,


different sampling problems can arise in each of the
following areas.

23/468
Eurostat
Examples of sampling

• Sampling is useful in many different fields, however,


different sampling problems can arise in each of the
following areas.
• Economic: we might want to estimate the average
household income in a country.

23/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.

24/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.
• Marketing research: we might want to estimate the
total market size for electrical cars.

24/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.
• Marketing research: we might want to estimate the
total market size for electrical cars.
• Engineering: we might want to estimate the failure rate
of a certain electronic component.

24/468
Eurostat
• To deal with all of these problems one thing we have to
decide is:
How are we going to select a sample?

25/468
Eurostat
• To deal with all of these problems one thing we have to
decide is:
How are we going to select a sample?
• There are many ways to take a sample.

25/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample


is selected.

26/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample


is selected.
• There are two very broad categories of sampling designs.

26/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample


is selected.
• There are two very broad categories of sampling designs.
• Probabilistic sampling

26/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample


is selected.
• There are two very broad categories of sampling designs.
• Probabilistic sampling
• Non probabilistic sampling

26/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we


want to study about certain characteristics.

27/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we


want to study about certain characteristics.
• Sampling frame: Is a list, map or any other registry
where the population units (to be sampled) are registered.

27/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we


want to study about certain characteristics.
• Sampling frame: Is a list, map or any other registry
where the population units (to be sampled) are registered.

• Ideally, the list should be exhaustive and without


duplications.

27/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we


want to study about certain characteristics.
• Sampling frame: Is a list, map or any other registry
where the population units (to be sampled) are registered.

• Ideally, the list should be exhaustive and without


duplications.
• It is the list of units in the study population.

27/468
Eurostat
28/468
Eurostat
Probabilistic sampling

• All designs we will discuss in detail fall into this type.

29/468
Eurostat
Probabilistic sampling

• All designs we will discuss in detail fall into this type.


• When we use probability sampling, randomness will be
built into the sampling designs so that properties of the
estimators can be assessed probabilistically, e.g., simple
random sampling, stratified sampling, cluster sampling,
systematic sampling, network sampling, etc.

29/468
Eurostat
Non-probabilistic sampling

• This is what people used to do before 1948.

30/468
Eurostat
Non-probabilistic sampling

• This is what people used to do before 1948.


• Sampling here is based upon quotas.

30/468
Eurostat
Non-probabilistic sampling

• This is what people used to do before 1948.


• Sampling here is based upon quotas.
• For instance, each interviewer will sample based upon
quotas that are representative of the population where
the selection of respondent is left up to the subjective
judgment of the interviewers.

30/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?

31/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?
• If you are subjective when it comes to the individuals
sampled, then this is an example of quota sampling.

31/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?
• If you are subjective when it comes to the individuals
sampled, then this is an example of quota sampling.
• Let’s illustrate this point a bit more.

31/468
Eurostat
Sample illustration

• Suppose you were going to select and interview people


that visit ESTAT premises.

32/468
Eurostat
Sample illustration

• Suppose you were going to select and interview people


that visit ESTAT premises.
• If you are just selecting people by walking around and
picking them subjectively to interview based upon those
you met, or that just walked by, this involves human
subjectivity.

32/468
Eurostat
Sample illustration

• Suppose you were going to select and interview people


that visit ESTAT premises.
• If you are just selecting people by walking around and
picking them subjectively to interview based upon those
you met, or that just walked by, this involves human
subjectivity.
• Interviewers in probability sampling are given specific
sampling procedures to follow or names and addresses
already selected by a randomization scheme, selected
without human subjectivity. 32/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.

33/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.
• The main difference between these two approaches is that
probability sampling removes the human subjectivity.

33/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.
• The main difference between these two approaches is that
probability sampling removes the human subjectivity.
• This is an important distinction that you need to be able
to make.

33/468
Eurostat
Illustration

Let’s compare quota and probability sample results for the


1948 US Washington State presidential poll.

Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%


Truman (dem) 45.3% 50.5% 52.6%

Using quota sampling Dewey had 52% of the votes and


Truman had 45.3% of the votes.

34/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%


Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.

35/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%


Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.


• Their results gave 46% of the votes to Dewey and 50.5%
of the votes to Truman.

35/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%


Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.


• Their results gave 46% of the votes to Dewey and 50.5%
of the votes to Truman.
• See that in this case the quota sampling approach was off
by quite a bit.

35/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%


Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.


• Their results gave 46% of the votes to Dewey and 50.5%
of the votes to Truman.
• See that in this case the quota sampling approach was off
by quite a bit.
• From this time on probability sampling became the norm.

35/468
Eurostat
Final remarks

• When you choose your respondent, use an objective


criteria.

36/468
Eurostat
Final remarks

• When you choose your respondent, use an objective


criteria.
• The major reason for poor results from quota sampling is
subjectivity involved in the selection of subjects.

36/468
Eurostat
Final remarks

• When you choose your respondent, use an objective


criteria.
• The major reason for poor results from quota sampling is
subjectivity involved in the selection of subjects.
• As soon as we introduce this type of bias, we introduce
problems with our data, some of which we cannot get rid
of even by acquiring additional samples.

36/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most


useful sampling schemes, variability from sample to
sample can be estimated using the single sample selected.

37/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most


useful sampling schemes, variability from sample to
sample can be estimated using the single sample selected.
• Using the sample we collect, we can construct estimates
for the parameter of the population that we are interested
in.

37/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most


useful sampling schemes, variability from sample to
sample can be estimated using the single sample selected.
• Using the sample we collect, we can construct estimates
for the parameter of the population that we are interested
in.
• Usually, there are many ways to construct estimates.

37/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most


useful sampling schemes, variability from sample to
sample can be estimated using the single sample selected.
• Using the sample we collect, we can construct estimates
for the parameter of the population that we are interested
in.
• Usually, there are many ways to construct estimates.
• Thus, we need some guidelines to determine which
estimates are desirable.
37/468
Eurostat
Properties of estimators

• Some desirable properties for estimators are:

1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators

• Some desirable properties for estimators are:


• Unbiased or nearly unbiased.

1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators

• Some desirable properties for estimators are:


• Unbiased or nearly unbiased.
• Have a low MSE (mean square error) or a low variance
when the estimator is unbiased. 1

1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators

• Some desirable properties for estimators are:


• Unbiased or nearly unbiased.
• Have a low MSE (mean square error) or a low variance
when the estimator is unbiased. 1
• Robust - so your answer does not fluctuate too much
with respect to extreme values.
1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
39/468
Eurostat
Sampling and non-sampling error

• Sampling error: error due to the collection of a fraction


of the population (sample), and not the whole population
instead.

40/468
Eurostat
Sampling and non-sampling error

• Sampling error: error due to the collection of a fraction


of the population (sample), and not the whole population
instead.
• Non-sampling error: non-response, variables measured
with error, etc.

40/468
Eurostat
Coffee break!

41/468
Eurostat
Subsection 2

Simple random sampling and its


estimators

42/468
Eurostat
Simple Random Sampling

In simple random sampling (without replacement) every


possible sample of n units has the same probability of selection.

• It is many times referred and equal probability


sampling because all the units in the population have
the same probability of selection:
n
,
N
where n is the sample size and N is the population size.
43/468
Eurostat
Example 1

• A hospital has 1,125 patient records.

44/468
Eurostat
Example 1

• A hospital has 1,125 patient records.


• How can one randomly select 120 records to review?

44/468
Eurostat
Example 1

• A hospital has 1,125 patient records.


• How can one randomly select 120 records to review?
• ANSWER:

44/468
Eurostat
Example 1

• A hospital has 1,125 patient records.


• How can one randomly select 120 records to review?
• ANSWER:
• Assign a number from 1 to 1,125 to each record and
select randomly 120 numbers from 1 to 1,125 without
replacement.

44/468
Eurostat
Example 2

• How to estimate the total number of beetles in an


agricultural field?

45/468
Eurostat
Example 2

• How to estimate the total number of beetles in an


agricultural field?
• ANSWER:

45/468
Eurostat
Example 2

• How to estimate the total number of beetles in an


agricultural field?
• ANSWER:
• To estimate the total number of beetles in an agricultural
field, subdivide the field into 100 equally sized units.

45/468
Eurostat
Example 2

46/468
Eurostat
Take a simple random sample of eight units and count the
number of beetles in these eight units.

47/468
Eurostat
Unit # beetles

9 234
66 256
81 128
11 245
92 211
54 240
6 202
23 267

Population size: N = 100; sample size n = 8.

48/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.


• N units in the population.

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.


• N units in the population.
• Variable of interest: Y1 , ... , YN

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.


• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.


• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N
• The population total: τ = y1 + y2 + ... + yN = N × µ

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.


• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N
• The population total: τ = y1 + y2 + ... + yN = N × µ
y1 + y2 + ... + yn
• Sample mean: ȳ = µ̂ =
n

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.


• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N
• The population total: τ = y1 + y2 + ... + yN = N × µ
y1 + y2 + ... + yn
• Sample mean: ȳ = µ̂ =
n
• Estimate for population total: τ̂ = N × ȳ

49/468
Eurostat
Definition: finite population variance

N
2
X (yi − µ)2
• σ =
i=1
N −1

50/468
Eurostat
Definition: finite population variance

N
2
X (yi − µ)2
• σ =
i=1
N −1
• σ 2 can be estimated by sample variance s 2
n
2
X (yi − ȳ )2 (y1 − ȳ )2 + (y2 − ȳ )2 + ... + (yn − ȳ )2
s = =
i=1
n−1 n−1

50/468
Eurostat
Definition: finite population variance

N
2
X (yi − µ)2
• σ =
i=1
N −1
• σ 2 can be estimated by sample variance s 2
n
2
X (yi − ȳ )2 (y1 − ȳ )2 + (y2 − ȳ )2 + ... + (yn − ȳ )2
s = =
i=1
n−1 n−1

• Sample standard deviation: s = s2

50/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight


fields are 234, 256, 128, 245, 211, 240, 202, 267.

51/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight


fields are 234, 256, 128, 245, 211, 240, 202, 267.
• sample mean:
y1 + y2 + ... + y8 234 + ... + 267
ȳ = µ̂ = = = 222.875
8 8

51/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight


fields are 234, 256, 128, 245, 211, 240, 202, 267.
• sample mean:
y1 + y2 + ... + y8 234 + ... + 267
ȳ = µ̂ = = = 222.875
8 8
• sample variance:
n=8 8
2
X (yi − µ)2 X (yi − ȳ )2
s = = = 1932.657
n−1 7
i=1 i=1

51/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight


fields are 234, 256, 128, 245, 211, 240, 202, 267.
• sample mean:
y1 + y2 + ... + y8 234 + ... + 267
ȳ = µ̂ = = = 222.875
8 8
• sample variance:
n=8 8
2
X (yi − µ)2 X (yi − ȳ )2
s = = = 1932.657
n−1 7
i=1 i=1
• Sample standard deviation:
√ √
s = s 2 = 1932.657 = 43.962

51/468
Eurostat
Estimate for the population total is:

τ̂ = N × ȳ
= 100 × 222.875
= 22, 287.5

52/468
Eurostat
Properties of ȳ (SRS)

Unbiased
 
y1 + y2 + . . . + yn
E (ȳ ) = E
n
E (y1 ) + E (y2 ) + . . . + E (yn )
=
n
µ + µ + ... + µ nµ
= =
n n
= µ

53/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n

54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:

54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
• Remark 1: when the sampling is done with replacement,
the fraction disappears.

54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
• Remark 1: when the sampling is done with replacement,
the fraction disappears.
• Remark 2: when the sample size is very small compared
to the population size, the fraction will disappear.

54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:

N − n σ2
Var(ȳ ) = ·
N n

55/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:

N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:

55/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:

N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
n
• Remark 3: is sometime referred as sampling rate.
N

55/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.

56/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
\) and
• The estimate for Var(ȳ ) is denoted as Var(ȳ
\ N − n s2
Var (ȳ ) = · .
N n

56/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
\) and
• The estimate for Var(ȳ ) is denoted as Var(ȳ
\ N − n s2
Var (ȳ ) = · .
N n
• For the beatles example
2
\) = N − n · s
Var(ȳ
N n
100 − 8 1932.657
= ·
100 8
= 222.256

56/468
Eurostat
Properties of τ̂ (SRS)

It is unbiased

E (τ̂ ) = E (N × ȳ )
= N ×µ
= τ

57/468
Eurostat
Its variance, Var(τ̂ ), is:

Var(τ̂ ) = Var(N × ȳ ) = N 2 · Var(ȳ )


N − n σ2
= N2 · ·
N n
σ2
= N · (N − n) ·
n

58/468
Eurostat
• The estimate for Var(τ̂ ) is thus:

\ s2
Var (τ̂ ) = N(N − n) · .
n

59/468
Eurostat
• The estimate for Var(τ̂ ) is thus:

\ s2
Var (τ̂ ) = N(N − n) · .
n
• For the beatles example

\) = 100 · (100 − 8) · 1932.657


Var(τ̂
8
= 2222560
= N 2 · Var(ȳ
\)

59/468
Eurostat
Subsection 3

Confidence intervals and the central limit


theorem

60/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not


enough just using sample mean to estimate the
population mean.

61/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not


enough just using sample mean to estimate the
population mean.
• The sample mean by itself is a single point.

61/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not


enough just using sample mean to estimate the
population mean.
• The sample mean by itself is a single point.
• This does not give people any idea on how good your
estimation is of the population mean.

61/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not


enough just using sample mean to estimate the
population mean.
• The sample mean by itself is a single point.
• This does not give people any idea on how good your
estimation is of the population mean.
• If we want to assess the accuracy of this estimate we will
use confidence intervals which provide us with information
on how good our estimation is.
61/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.

62/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.
• To obtain this confidence interval you need to know the
sampling distribution of the estimate.

62/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.
• To obtain this confidence interval you need to know the
sampling distribution of the estimate.
• Once we know the distribution, a confidence interval
might be defined.

62/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α

63/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.

63/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
• In certain cases the distribution of θ̂ can be stated easily.

63/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
• In certain cases the distribution of θ̂ can be stated easily.
• However, there are many different types of distributions.

63/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.

64/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.
• When we talk about the Central Limit Theorem for the
sample mean, what are we talking about?

64/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.
• When we talk about the Central Limit Theorem for the
sample mean, what are we talking about?
• The finite population Central Limit Theorem for the
sample mean:
What happens when n (sample size), gets large?

64/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
 
σ
ȳ ∼ N µ, √ .
n

65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
 
σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.

65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
 
σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.
• We can thus estimate the standard deviation of ȳ to be:
s
√ .
n

65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n
 
σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.
• We can thus estimate the standard deviation of ȳ to be:
s
√ .
n
• Thus approximately
 
s
ȳ ∼ N µ, √ .
n
65/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.

66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.

66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.
• When the sample size is larger there is more regularity
and it is easier to see the distribution.

66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.
• When the sample size is larger there is more regularity
and it is easier to see the distribution.
• This is not the case when the sample size is small.

66/468
Eurostat
Confidence interval for µ

• If we go about picking samples we can determine a ȳ and


from here we can construct an interval around the mean.

67/468
Eurostat
Confidence interval for µ

• If we go about picking samples we can determine a ȳ and


from here we can construct an interval around the mean.
• Thus, a 100(1 − α)% confidence interval for µ can be
derived as follows:
ȳ − µ ȳ − µ
p ∼ N(0, 1) whereas, q ∼ N(0, 1)
Var(ȳ ) \
Var(ȳ )

67/468
Eurostat
68/468
Eurostat
• Now, we can compute the confidence interval as:


ȳ − µ
P( q < d) = 1 − α

\)
Var(ȳ


ȳ − µ
P( q
< z1−α/2 ) = 1 − α

\)
Var(ȳ
q q
\) < µ < ȳ + z1−α/2 Var(ȳ
P(ȳ − z1−α/2 Var(ȳ \)) = 1 − α

69/468
Eurostat
Confidence interval for µ

• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s  2
N −n s
ȳ ± z1−α/2
N n

70/468
Eurostat
Confidence interval for µ

• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s  2
N −n s
ȳ ± z1−α/2
N n

• What you now have above is the confidence interval for µ.

70/468
Eurostat
Confidence interval for µ

• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s  2
N −n s
ȳ ± z1−α/2
N n

• What you now have above is the confidence interval for µ.


• The confidence interval for τ is given below.

70/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n

71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?

71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?
• In what situation are these confidence intervals
applicable?

71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?
• In what situation are these confidence intervals
applicable?
• These approximate intervals above are good when n is
large (because of the Central Limit Theorem), or when
the observations y1 , y2 , ..., yn are normal.

71/468
Eurostat
Confidence intervals and sample size

• When sample size is 30 or more, we consider the sample


size to be large and by Central Limit Theorem, ȳ will be
normal even if the sample does not come from a normal
distribution.

72/468
Eurostat
Confidence intervals and sample size

• When sample size is 30 or more, we consider the sample


size to be large and by Central Limit Theorem, ȳ will be
normal even if the sample does not come from a normal
distribution.
• Thus, when sample size is 30 or more, there is no need to
check whether the sample comes from a normal
distribution.

72/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2

2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2
• However, when sample size is 7 or less, if we use normal
probability plot to check for normality, we may fail to
reject normality due to not enough sample size.

2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2
• However, when sample size is 7 or less, if we use normal
probability plot to check for normality, we may fail to
reject normality due to not enough sample size.
• Remark: In the examples of this training we typically use
small sample sizes for illustration purposes only.
2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
s
s2
 
N −n
ȳ ± z1−α/2
N n

74/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
s
s2
 
N −n
ȳ ± z1−α/2
N n
• Note that the z-value for α = 0.025 can be found in the
following table:
Confidence α 1 − α/2 z1−α/2

90% 0.1 0.95 1.64


95% 0.05 0.975 1.96
99% 0.01 0.995 2.58

74/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:

75/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
• sample mean: ȳ = 222.875

75/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
• sample mean: ȳ = 222.875
• sample variance: s 2 = 1932.657
s  2
N −n s
ȳ ± z1−α/2
N n

= 222.875 ± 1.96 222.256
= 222.875 ± 1.96 × 14.908
= 222.875 ± 29.220

75/468
Eurostat
• And, an approximate 95% CI for τ is then:
r
s2
τ̂ ± z1−α/2 N(N − n)
p n
= 22, 287.5 ± 1.96 2, 222, 560
= 22, 287.5 ± 2, 922.018

76/468
Eurostat
Questions?

77/468
Eurostat
Lunch break!

78/468
Eurostat
Subsection 4

Domain estimation

79/468
Eurostat
Domain estimation

• Quite often, obtaining a frame that lists only those


elements of the population that one is interested in is
impossible.

80/468
Eurostat
Domain estimation

• Quite often, obtaining a frame that lists only those


elements of the population that one is interested in is
impossible.
• For example, you want to sample households with
children, however, the best frame available is a list of all
households.

80/468
Eurostat
Domain estimation

• Quite often, obtaining a frame that lists only those


elements of the population that one is interested in is
impossible.
• For example, you want to sample households with
children, however, the best frame available is a list of all
households.
• Check visually the type of problem.

80/468
Eurostat
81/468
Eurostat
• Therefore, we wish to estimate the parameters of a
subpopulation (domain) of the population represented in
the frame.

82/468
Eurostat
• Therefore, we wish to estimate the parameters of a
subpopulation (domain) of the population represented in
the frame.
• Main Issue: you do not know the size of the domain
(subpopulation)?

82/468
Eurostat
Notation

• N: the number of elements in the population

83/468
Eurostat
Notation

• N: the number of elements in the population


• Nd : the number of elements in the domain
(subpopulation)

83/468
Eurostat
Notation

• N: the number of elements in the population


• Nd : the number of elements in the domain
(subpopulation)
• n: sample size from the population

83/468
Eurostat
Notation

• N: the number of elements in the population


• Nd : the number of elements in the domain
(subpopulation)
• n: sample size from the population
• nd : the number of sampled elements from the domain
(subpopulation)

83/468
Eurostat
Notation

• N: the number of elements in the population


• Nd : the number of elements in the domain
(subpopulation)
• n: sample size from the population
• nd : the number of sampled elements from the domain
(subpopulation)
• ydi - the i-th sampled observation that falls in the
subpopulation

83/468
Eurostat
• An unbiased estimator of µd , the subpopulation mean is:
nd
1 X
ȳd = ydi .
nd i=1

84/468
Eurostat
• An unbiased estimator of µd , the subpopulation mean is:
nd
1 X
ȳd = ydi .
nd i=1
• Its variance is estimated by:

sd2
 
\d ) = Nd − nd
Var(ȳ ,
Nd nd
nd
(ydi − ȳd )2
P
i=1
where sd2 = .
nd − 1

84/468
Eurostat
• Usually we do not know Nd , so we will estimate the finite
population correction factor as:

Nd − nd N −n
by .
Nd N

85/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount


spent on food by married graduate students in a certain
college.

86/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount


spent on food by married graduate students in a certain
college.
• There are 80 graduate students in the college.

86/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount


spent on food by married graduate students in a certain
college.
• There are 80 graduate students in the college.
• n = 15 are sampled and nm = 10 are married.

86/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount


spent on food by married graduate students in a certain
college.
• There are 80 graduate students in the college.
• n = 15 are sampled and nm = 10 are married.
• A summary of the data follows:
Marital status N Mean std. deviation

married 10 135.3 44.4


single 5 87.6 21.6

86/468
Eurostat
• What is the average food cost for married students in
that college?

87/468
Eurostat
• What is the average food cost for married students in
that college?
• ANSWER:

87/468
Eurostat
• What is the average food cost for married students in
that college?
• ANSWER:
• The average food cost for married students is:

ȳm = 135.3.

87/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.

88/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
• ANSWER:

88/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
• ANSWER:
• An estimate for the standard deviation for the estimate is:

\ 80 − 15 44.42
Var(ȳ m) = · = 160.173.
80 10

\
SD(ȳ m ) = 12.656.

88/468
Eurostat
Confidence intervals and
sample size
Unit learning outcomes

• Upon successful completion of this lesson, you will be


able to:
• find the sample size needed for estimating population
mean and population total
• know how to compute the confidence interval for
population proportion
• find the sample size needed for estimating population
proportion by both the educated guess method and
conservative method
• know when to use educated guess method and when to
use conservative method 90/468
Eurostat
Subsection 1

Calculating sample size

91/468
Eurostat
Sample size for mean and total

• How large should be a sample size for estimating the


population mean with specified accuracy?

92/468
Eurostat
Sample size for mean and total

• How large should be a sample size for estimating the


population mean with specified accuracy?
• If θ̂ is an unbiased, normally distributed estimator of θ,
then

θ̂ − θ
q ∼ N(0, 1).
Var(θ̂)

92/468
Eurostat
Then
 
|θ̂ − θ|
P q < z1−α/2  = 1 − α
Var(θ̂)
 q 
P |θ̂ − θ| < z1−α/2 · Var(θ̂) = 1−α

93/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.

94/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.
• So, we need to ask, "What is the goal of your
experiment?"

94/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.
• So, we need to ask, "What is the goal of your
experiment?"
• This is perhaps the most important question to be asked
as a part of your experiment.

94/468
Eurostat
• What if we were interested in estimating the average
weight of ESTAT male collaborators.

95/468
Eurostat
• What if we were interested in estimating the average
weight of ESTAT male collaborators.
• How many observations should we plan on taking for
estimating the mean weight of ESTAT male collaborators?

95/468
Eurostat
• What do we need to consider?

96/468
Eurostat
• What do we need to consider?
• In first place: how accurate (precision) do you want
this estimate to be?

96/468
Eurostat
• What do we need to consider?
• In first place: how accurate (precision) do you want
this estimate to be?
• You thus need to specify the margin of error.

96/468
Eurostat
• We should also take into account:

97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.

97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.
2. The second thing that you need to think about is the
type of conclusion that you would like to report. That is,
you need to specify the 1 − α value, the confidence
level, that you are happy with.

97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.
2. The second thing that you need to think about is the
type of conclusion that you would like to report. That is,
you need to specify the 1 − α value, the confidence
level, that you are happy with.
• Now, if we specify 1 − α (confidence level), the margin of
error d (also can be viewed as the half width of the
(1 − α)100% CI), we can solve for the sample size such
that the CI has the specified margin of error.
97/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N

98/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N
• Can we now use this formula to estimate the sample size?

98/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N
• Can we now use this formula to estimate the sample size?
• Not exactly!

98/468
Eurostat
• The weak point is the population variance used.

99/468
Eurostat
• The weak point is the population variance used.
• We do not know the value of σ 2 .

99/468
Eurostat
• Similarly, for estimating the population total τ , here is the
formula:
r !
σ2
P |τ̂ − τ | < z1−α/2 · N(N − n) =1−α
n
r
σ2
z1−α/2 N(N − n) =d
n
1
n= 2
d 1
2
+
N2 · z1−α/2 · σ2 N

100/468
Eurostat
The beetle example

• What sample size is needed to estimate the population


total of beetles, τ , to within d = 1000 with a 95% CI?
Unit # beetles

9 234
66 256
81 128
11 245
92 211
54 240
6 202
23 267

Sample mean (ȳ ) 222.875


Sample variance (s 2 ) 1932.657

Population size: N = 100; sample size n = 8.


101/468
Eurostat
• Now, let’s begin plugging what we know into the formula.

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?
• No, but we can estimate σ 2 by
n 
(xi − x̄)2
X 
2
s = = 1932.657.
i=1
n−1

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?
• No, but we can estimate σ 2 by
n 
(xi − x̄)2
X 
2
s = = 1932.657.
i=1
n−1

• How many units should we sample?

102/468
Eurostat
• Let’s calculate this out and:

1
n = 2
d 1
2
+
N2 · z1−α/2 · σ2 N
1
n = 2 = 42.610
(1000) 1
2 2
+
(100) · (1.96) · 1932.657 100

103/468
Eurostat
• Let’s calculate this out and:

1
n = 2
d 1
2
+
N2 · z1−α/2 · σ2 N
1
n = 2 = 42.610
(1000) 1
2 2
+
(100) · (1.96) · 1932.657 100

• We will always round this up, therefore, we will sample 43


of the 100 plots.

103/468
Eurostat
• Remark: If we ignore the finite population correction
adjustment then,

N 2 · z1−α/2
2
· σ2
n =
d2
(100) · (1.96)2 · 1932.657
2
=
(1000)2
= 74.245

which rounds up to 75.

104/468
Eurostat
• Remark: If we ignore the finite population correction
adjustment then,

N 2 · z1−α/2
2
· σ2
n =
d2
(100) · (1.96)2 · 1932.657
2
=
(1000)2
= 74.245

which rounds up to 75.


• This value is much larger than 43.

104/468
Eurostat
Think about it!

• What is the major point that was just illustrated in the


previous example?

105/468
Eurostat
Think about it!

• What is the major point that was just illustrated in the


previous example?
• ANSWER:

105/468
Eurostat
Think about it!

• What is the major point that was just illustrated in the


previous example?
• ANSWER:
• In this first example, N = 100 is not very large compared
to n, so one should not ignore the finite population
adjustment!

105/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .

106/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
• What can one do if there is no pilot data?

106/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
• What can one do if there is no pilot data?
• How can we get some rough idea about what σ 2 is?

106/468
Eurostat
Example

• A farm has 1000 young pigs with an initial weight of


about 50 lbs.

107/468
Eurostat
Example

• A farm has 1000 young pigs with an initial weight of


about 50 lbs.
• They put them on a new diet for 3 weeks and want to
know how many pigs to sample so that they can estimate
the average weight gain.

107/468
Eurostat
Example

• A farm has 1000 young pigs with an initial weight of


about 50 lbs.
• They put them on a new diet for 3 weeks and want to
know how many pigs to sample so that they can estimate
the average weight gain.
• They want the answer to be within 2 lbs with 90%
confidence.

107/468
Eurostat
• There is no pilot data here.

108/468
Eurostat
• There is no pilot data here.
• We don’t have the time to select out some pigs in order
to get an estimate for σ 2 , the variance of the weight gain.

108/468
Eurostat
• There is no pilot data here.
• We don’t have the time to select out some pigs in order
to get an estimate for σ 2 , the variance of the weight gain.
• Question: How do we get a rough estimate of σ?

108/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?

109/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?
• One thing we can do is rely on the information that we
already have, i.e., find some historical data that exists
on this topic.

109/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?
• One thing we can do is rely on the information that we
already have, i.e., find some historical data that exists
on this topic.
• But what if this historical data does not exist?

109/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.

110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:

Range
σ≈
4

110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:

Range
σ≈
4
• The range is relatively easy to have some idea about.

110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:

Range
σ≈
4
• The range is relatively easy to have some idea about.
• This is an important point.

110/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.

111/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.
• So, for this case we will make a sensible guess of the
range of weight gain and intuitively estimate this to be
from a minimum of 10 lbs, to a maximum of 50 lbs within
this 3 week period.

111/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.
• So, for this case we will make a sensible guess of the
range of weight gain and intuitively estimate this to be
from a minimum of 10 lbs, to a maximum of 50 lbs within
this 3 week period.
• σ can now be roughly estimated to be:

Range 50 − 10
= = 10 lbs
4 4

111/468
Eurostat
• Now we can use the formula for estimating the mean, µ.

112/468
Eurostat
• Now we can use the formula for estimating the mean, µ.
• Then,
1
n = 2
d 1
2
+
zα/2 · σ2 N
1
=
22 1
2 2
+
(1.645) · (10) 1000
= 63.36

112/468
Eurostat
• The value 63.36 should rounded up to 64.

113/468
Eurostat
• The value 63.36 should rounded up to 64.
• We will need to sample 64 pigs in order to estimate the
average weight gain in 3 weeks to within 2 lbs with a 90%
confidence interval.

113/468
Eurostat
Coffee break!

114/468
Eurostat
Subsection 2

Confidence intervals for population


proportion

115/468
Eurostat
Estimating proportions

• Estimating population proportions can be seen as a


particular case of estimating the population mean.

116/468
Eurostat
Estimating proportions

• Estimating population proportions can be seen as a


particular case of estimating the population mean.
• We want to estimate the proportion of units in the
population having some attribute.

116/468
Eurostat
Estimating proportions

• Estimating population proportions can be seen as a


particular case of estimating the population mean.
• We want to estimate the proportion of units in the
population having some attribute.
• For example a question might be, "What would be the
proportion of ESTAT workers who are smokers?"

116/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.

117/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.
• Usually the sample size is at least 1000, sometimes even
1500.

117/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.
• Usually the sample size is at least 1000, sometimes even
1500.
• Let us see in what ways the proportion problem is related
to the mean problem...

117/468
Eurostat
• Question: Do you approve President Junker’s job
performance?

118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.

118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.
• The variable of interest: y1 , y2 , ... , yN

118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.
• The variable of interest: y1 , y2 , ... , yN
1 P N
• Population proportion: p = yi which is the
N i=1
population mean, µ, of Y .

118/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n

119/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
• This specific definition of yi makes it having a variance
that is related to its mean.

119/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
• This specific definition of yi makes it having a variance
that is related to its mean.
• To find the finite population variance for y1 , y2 , ... , yN ,
we know that the population mean is:

N
1 X
µ= yi = p.
N i=1

119/468
Eurostat
By definition the variance is then:
N
(yi − p)2
P
i=1
σ2 =
N −1
N
(yi2 − 2pyi + p 2 )
P
i=1
=
N −1
N N
yi2 − 2p yi + Np 2
P P
i=1 i=1
=
N −1

120/468
Eurostat
Then, since yi2 = yi :
N N
yi + Np 2
P P
yi − 2p
i=1 i=1
=
N −1
Np − 2p(Np) + Np 2
=
N −1
Np − Np 2 N
σ2 = = p(1 − p)
N −1 N −1

Theoretically this is the variance.

121/468
Eurostat
• How will we estimate this?

122/468
Eurostat
• How will we estimate this?
• We can estimate this by:

n
σ̂ 2 = s 2 = p̂ · (1 − p̂).
n−1

122/468
Eurostat
• How will we estimate this?
• We can estimate this by:

n
σ̂ 2 = s 2 = p̂ · (1 − p̂).
n−1
• What we want is to see how p̂ behaves, therefore, we
want to know its distribution.

122/468
Eurostat
• First, we find its mean, then its variance.

123/468
Eurostat
• First, we find its mean, then its variance.
• Since p̂ is ȳ , we can get E(p̂) = µ = p.

123/468
Eurostat
• First, we find its mean, then its variance.
• Since p̂ is ȳ , we can get E(p̂) = µ = p.
• Then, we proceed to find its variance.

 n  σ2
Var(p̂) = 1− ·
 N n
N −n N · p · (1 − p)
= ·
N (N − 1) · n
 
N −n p · (1 − p)
= ·
N −1 n

123/468
Eurostat
• How will we estimate the variance of p̂?

124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.

124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.

124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.
• An unbiased estimator of the variance is:
 
\ N −n p̂ · (1 − p̂)
Var(p̂) = ·
N n−1

124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.
• An unbiased estimator of the variance is:
 
\ N −n p̂ · (1 − p̂)
Var(p̂) = ·
N n−1
• This is one reasonable answer for determining an estimate
of the variance.

124/468
Eurostat
• The answer will not be very different from what one
would get using other methods.

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.
• When the sample size is large we know that p̂ has a
normal distribution by the central limit theorem.

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.
• When the sample size is large we know that p̂ has a
normal distribution by the central limit theorem.
• Therefore, we can use the usual interval:
q
\
p̂ ± z1−α/2 Var(p̂)

125/468
Eurostat
• How large is large enough?

Answer: if n · p̂ ≥ 5, n · (1 − p̂) ≥ 5.

126/468
Eurostat
Back to example

• Imagine President Junker’s final approval rating is 22%


(based upon a sample of 1112 interviews)!

127/468
Eurostat
Back to example

• Imagine President Junker’s final approval rating is 22%


(based upon a sample of 1112 interviews)!
• After looking at this statistic, we can provide a 95% CI
for the true proportion.

127/468
Eurostat
• The 22% is a sample proportion.

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:
• A 95% confidence interval for p is:
´

0.22 ± 1.96 0.0001545
0.22 ± 0.0244
where
 
\ N − n p̂ · (1 − p̂) 0.22 × 0.78
Var(p̂) = · = 1· = 0.0001545
N n−1 1112 − 1

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:
• A 95% confidence interval for p is:
´

0.22 ± 1.96 0.0001545
0.22 ± 0.0244
where
 
\ N − n p̂ · (1 − p̂) 0.22 × 0.78
Var(p̂) = · = 1· = 0.0001545
N n−1 1112 − 1
• Remark: because N is large compared to n we ignore the
finite population correction. 128/468
Eurostat
Subsection 3

Sample size for estimating proportions

129/468
Eurostat
Sample size for estimating proportion

• Using the formula to find sample size for estimating the


mean we have:

1
n= 2 .
d 1
2 2
+
z1−α/2 · σ N

130/468
Eurostat
N
• Now, σ 2 = · p · (1 − p) substitutes in and we get:
N −1
N · p · (1 − p)
n= .
d2
(N − 1) 2 + p · (1 − p)
z1−α/2

131/468
Eurostat
• When the finite population correction can be ignored, the
formula is:
2
z1−α/2 · p · (1 − p)
n≈ .
d2

132/468
Eurostat
• When the finite population correction can be ignored, the
formula is:
2
z1−α/2 · p · (1 − p)
n≈ .
d2
• Now, for finding sample sizes for proportion, in addition
to using an educated guess to estimate p, we can also
find a conservative sample size which can guarantee the
margin of error is short enough at a specified α.

132/468
Eurostat
A. Educated guess (estimate p by p̂):

N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2

133/468
Eurostat
A. Educated guess (estimate p by p̂):

N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
1. Note, p̂ may be different from the true proportion.

133/468
Eurostat
A. Educated guess (estimate p by p̂):

N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
1. Note, p̂ may be different from the true proportion.
2. The sample size may not be large enough for some cases,
(i.e., the margin of error not as small as specified).

133/468
Eurostat
B. Conservative sample size:

N · 1/4
n= .
d2
(N − 1) 2 + 1/4
z1−α/2

134/468
Eurostat
B. Conservative sample size:

N · 1/4
n= .
d2
(N − 1) 2 + 1/4
z1−α/2
1. Since p(1 − p) attains maximum at p = 1/2.

134/468
Eurostat
Example

To estimate the president’s final approval rating, how many


people should be sampled so that the absolute margin of error
is 3%, (a popular choice), with 95% confidence?
A. Use educated guess: Junker’s = 0.22
Since N is very large compared to n, finite population
correction is not needed.

2
p̂ · (1 − p̂) · z1−α/2 0.22 · 0.78 · 1.962
n= = = 732.47
d2 0.032
135/468
Eurostat
Example

To estimate the president’s final approval rating, how many


people should be sampled so that the absolute margin of error
is 3%, (a popular choice), with 95% confidence?
A. Use educated guess: Junker’s = 0.22
Since N is very large compared to n, finite population
correction is not needed.

2
p̂ · (1 − p̂) · z1−α/2 0.22 · 0.78 · 1.962
n= = = 732.47
d2 0.032
1. Round up to 733 135/468
Eurostat
To estimate the president’s final approval rating, how many
people should be sampled so that the margin of error is 3%, (a
popular choice), with 95% confidence?

B. Use conservative approach.

0.5 · 0.5 · 1.962


n =
0.032
= 1067.11

136/468
Eurostat
To estimate the president’s final approval rating, how many
people should be sampled so that the margin of error is 3%, (a
popular choice), with 95% confidence?

B. Use conservative approach.

0.5 · 0.5 · 1.962


n =
0.032
= 1067.11

1. Round up to 1068.

136/468
Eurostat
What to choose?

• How do we choose between the educated guess or the


conservative approach?

137/468
Eurostat
What to choose?

• How do we choose between the educated guess or the


conservative approach?
• One should look at the cost of sampling extra units versus
the set-up cost of the sampling process once more.

137/468
Eurostat
What to choose?

• How do we choose between the educated guess or the


conservative approach?
• One should look at the cost of sampling extra units versus
the set-up cost of the sampling process once more.
• If the set-up cost (maybe needed if an educated guess is
used) of the sampling procedure once more is high
compared to the cost of sampling extra units, then one
will prefer to use a conservative approach.

137/468
Eurostat
Example

• Find the proportion of CD players in a shipment that have


lifetime longer than 2000 hours.

138/468
Eurostat
Example

• Find the proportion of CD players in a shipment that have


lifetime longer than 2000 hours.
• The proportion from last shipment was 0.9. It is not
costly to set up the testing procedure again if needed
whereas sampling cost of each unit is expensive.

138/468
Eurostat
Example

• Find the proportion of CD players in a shipment that have


lifetime longer than 2000 hours.
• The proportion from last shipment was 0.9. It is not
costly to set up the testing procedure again if needed
whereas sampling cost of each unit is expensive.
• We want to estimate the proportion to within 0.01 with
95% confidence.

138/468
Eurostat
• Would you use the educated guess or the conservative
approach?

139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:

139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use an educated guess because it is not costly
to set up the testing procedure again.

139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use an educated guess because it is not costly
to set up the testing procedure again.
• On the other hand, the cost of the sampling of extra
units is high due to the nature of the test.

139/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.

140/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
• Last year the proportion is 0.9.

140/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
• Last year the proportion is 0.9.
• Want to estimate the proportion to within 0.01 with 95%
confidence.

140/468
Eurostat
• Would you use the educated guess or the conservative
approach?

141/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:

141/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use a conservative approach because it is too
expensive to send a ship out again if needed.

141/468
Eurostat
Unequal probability sampling
Unit learning outcomes

• Upon successful completion of this lesson, you will be


able to:
• know why and when to use unequal probability sampling,
• how to perform unequal probability sampling,
• how to compute the Hansen-Hurwitz estimator and its
estimated variance,
• how to compute the Horvitz-Thompson estimator and its
estimated variance, and
• learn about the unbiasedness of these two estimators
through an artificial small population example.
143/468
Eurostat
Subsection 1

Unequal probability sampling

144/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.

145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.

145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.
• For example, we want to estimate the number of job
openings in a city by sampling firms in that city.

145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.
• For example, we want to estimate the number of job
openings in a city by sampling firms in that city.
• Many of the firms in the city are small firms.

145/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.

146/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.
• However, the number of job openings is heavily influenced
by large firms.

146/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.
• However, the number of job openings is heavily influenced
by large firms.
• Thus, we should be able to improve the estimate of
number of job openings by giving the large firms a greater
chance to appear in the sample, for example, with
probability proportional to size or proportional to some
other relevant aspects.

146/468
Eurostat
Selection probabilities

• On each draw, the probability that a given population


unit will be selected is denoted as: pi , i = 1, 2, 3, ..., N.

147/468
Eurostat
Selection probabilities

• On each draw, the probability that a given population


unit will be selected is denoted as: pi , i = 1, 2, 3, ..., N.
• Suppose that sampling is with replacement, the
probability of selecting the i-th unit in the population is
pi .

147/468
Eurostat
• If the selection probabilities are unequal, the sample mean
is not unbiased for population mean and sample total is
not unbiased for population total.

148/468
Eurostat
• If the selection probabilities are unequal, the sample mean
is not unbiased for population mean and sample total is
not unbiased for population total.
• Example: if larger firms are sampled with higher
probability, the sample mean for job openings will be
biased upward.

148/468
Eurostat
Questions?

149/468
Eurostat
See you tomorrow!

150/468
Eurostat
Subsection 2

The Hansen-Hurwitz estimator

151/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to


be larger.

152/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to


be larger.
• However, formula for replacement are simpler and easier
to derive.

152/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to


be larger.
• However, formula for replacement are simpler and easier
to derive.
• When the sample size is small compared to N, with and
without replacement are not too different.

152/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to


be larger.
• However, formula for replacement are simpler and easier
to derive.
• When the sample size is small compared to N, with and
without replacement are not too different.
• We often use the sampling with replacement formulae
(easier to handle) to approximate sampling without
replacement.
152/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.

153/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.
• Let pi , i = 1, ..., N denote the probability that a given
population unit will be selected.

153/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.
• Let pi , i = 1, ..., N denote the probability that a given
population unit will be selected.
• The Hansen-Hurwitz estimator for τ is:
n
1 X yi
τ̂p = .
n i=1 pi

153/468
Eurostat
Since,
  N
yi X yi
E = pi
pi i=1
pi
N
X
= yi = τ
i=1

N
X
where τ = yi is the population total.
i=1

154/468
Eurostat
Thus,
n
!
1 X yi
E(τ̂p ) = E
n i=1 pi
n  
1X yi
= E
n i=1 pi
n
1X
= τ
n i=1
1
= nτ = τ
n
which means τ̂p is an unbiased estimator for τ .
155/468
Eurostat
  X N  2
yi yi
Since Var = pi −τ ,
pi i=1
pi

N  2
1X yi
Var(τ̂p ) = pi −τ
n i=1 pi

156/468
Eurostat
• An unbiased estimator for Var(τ̂p ) is:
n  2
X yi
− τ̂p
1 p i
\p ) = · i=1
Var(τ̂
n n−1
and an approximate (1 − α)100% confidence interval for
τ is:
q
τ̂p ± z1−α/2 · \p ).
Var(τ̂

157/468
Eurostat
τ
• For population mean, µ = one uses:
N

n
!
1 1 X yi τ̂p
µ̂p = · =
N n i=1 pi N
τ
E(µ̂p ) = =µ
N
\p ) = 1 · Var(τ̂
Var(µ̂ \p )
N2

158/468
Eurostat
τ
• For population mean, µ = one uses:
N

n
!
1 1 X yi τ̂p
µ̂p = · =
N n i=1 pi N
τ
E(µ̂p ) = =µ
N
\p ) = 1 · Var(τ̂
Var(µ̂ \p )
N2
• How do we perform unequal probability sampling
according to given pi ?

158/468
Eurostat
Example 1

• The director of computer support department plans to


sample 3 divisions of a large firm that has 10 divisions,
with varying numbers of employees per division.

159/468
Eurostat
Example 1

• The director of computer support department plans to


sample 3 divisions of a large firm that has 10 divisions,
with varying numbers of employees per division.
• Since number of computer support requests within each
division should be highly correlated with the number of
employees in that division, the director decides to use
unequal probability sampling with replacement with pi
proportional to number of employees in that division.

159/468
Eurostat
Division # employees

1 1000
2 650
3 2100
4 860
5 2840
6 1910
7 390
8 3200
9 1500
10 1200
Total 15650

160/468
Eurostat
A. How do we practically implement unequal probability
sampling according to the given pi ’s?

161/468
Eurostat
A. How do we practically implement unequal probability
sampling according to the given pi ’s?
B. With the divisions selected by probability proportional to
size, how do we construct the Hansen-Hurwitz estimator
for τ ?

161/468
Eurostat
Example: Answer to A

Division # employees pi

1 1000 1000/15650
2 650 650/15650
3 2100 2100/15650
4 860 860/15650
5 2840 2840/15650
6 1910 1910/15650
7 390 390/15650
8 3200 3200/15650
9 1500 1500/15650
10 1200 1200/15650
Total 15650 1

162/468
Eurostat
Division # employees pi Assigned numbers

1 1000 1000/15650 1-1000


2 650 650/15650 1001-1650
3 2100 2100/15650 1651-3750
4 860 860/15650 3751-4610
5 2840 2840/15650 4611-7450
6 1910 1910/15650 7451-9360
7 390 390/15650 9361-9750
8 3200 3200/15650 9751-12950
9 1500 1500/15650 12951-12450
10 1200 1200/15650 14451-15650
Total 15650 1

• Sample with replacement 3 numbers between 1 and


15650.

163/468
Eurostat
Division # employees pi Assigned numbers

1 1000 1000/15650 1-1000


2 650 650/15650 1001-1650
3 2100 2100/15650 1651-3750
4 860 860/15650 3751-4610
5 2840 2840/15650 4611-7450
6 1910 1910/15650 7451-9360
7 390 390/15650 9361-9750
8 3200 3200/15650 9751-12950
9 1500 1500/15650 12951-12450
10 1200 1200/15650 14451-15650
Total 15650 1

• Sample with replacement 3 numbers between 1 and


15650.
• They are 1085, 6261 and 9787.

163/468
Eurostat
Division # employees pi Assigned numbers

1 1000 1000/15650 1-1000


2 650 650/15650 1001-1650
3 2100 2100/15650 1651-3750
4 860 860/15650 3751-4610
5 2840 2840/15650 4611-7450
6 1910 1910/15650 7451-9360
7 390 390/15650 9361-9750
8 3200 3200/15650 9751-12950
9 1500 1500/15650 12951-12450
10 1200 1200/15650 14451-15650
Total 15650 1

• Sample with replacement 3 numbers between 1 and


15650.
• They are 1085, 6261 and 9787.
• These numbers fall into division 2, division 5 and division
8.
163/468
Eurostat
• For division 2, y1 : the number requests is 420

164/468
Eurostat
• For division 2, y1 : the number requests is 420
• For division 5, y2 : the number of requests is 1785

164/468
Eurostat
• For division 2, y1 : the number requests is 420
• For division 5, y2 : the number of requests is 1785
• For division 8, y3 : the number of requests is 2198

164/468
Eurostat
• We will need to compute the Hansen-Hurwitz estimator
as follows:

165/468
Eurostat
• We will need to compute the Hansen-Hurwitz estimator
as follows:
• The Hansen-Hurwitz estimator for τ is
n
1 X yi
τ̂p = =
n pi
 i=1 
1 15650 15650 15650
= 420 · + 1785 · + 2198 ·
3 650 2840 3200
1
= (10112.31 + 9836.36 + 10749.59)
3
= 10232.75

165/468
Eurostat
• Each of the values, 10112.31, 9836.36, and 10749.59,
look fairly stable so it looks like the variance will not be
too large.
3
 2
P yi
− τ̂p
\p ) = 1 i=1 pi
Var(τ̂ ·
3 3−1
1 1
= · ((10112.31 − 10232.75)2
3 2
+(9836.36 − 10232.75)2 + (10749.59 − 10232.75)
= 73125.74
and
\
SD(τ̂ p ) = 270.418
166/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional


to the values of a known positive auxiliary variable such
xi
as size, pi = N , the Hansen-Hurwitz estimator is also
X
xi
i=1
called p.p.s. (probability proportional to size).

167/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional


to the values of a known positive auxiliary variable such
xi
as size, pi = N , the Hansen-Hurwitz estimator is also
X
xi
i=1
called p.p.s. (probability proportional to size).
• Now, we need to ask ourselves, when and why would we
need to use an unequal probability sampling?

167/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional


to the values of a known positive auxiliary variable such
xi
as size, pi = N , the Hansen-Hurwitz estimator is also
X
xi
i=1
called p.p.s. (probability proportional to size).
• Now, we need to ask ourselves, when and why would we
need to use an unequal probability sampling?
• Let’s think about the ’when’ first.
167/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional


to the values of a known positive auxiliary variable such
xi
as size, pi = N , the Hansen-Hurwitz estimator is also
X
xi
i=1
called p.p.s. (probability proportional to size).
• Now, we need to ask ourselves, when and why would we
need to use an unequal probability sampling?
• Let’s think about the ’when’ first.
• When would we elect to use p.p.s.?
167/468
Eurostat
• What about if we were sampling from ESTAT
departments?

168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.

168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.
• Would we automatically choose to use p.p.s.?

168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.
• Would we automatically choose to use p.p.s.?
• The idea is that the thing that you are interested in has
to be related to the size.

168/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.

169/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.
• However, if what you are interested in has nothing to do
with the size of the department, then there is no reason
to use p.p.s.

169/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.
• However, if what you are interested in has nothing to do
with the size of the department, then there is no reason
to use p.p.s.
• Now, let us address the ’why’.

169/468
Eurostat
• By definition,

N N  2
X 1X yi
τ= yi and Var(τ̂p ) = pi −τ .
i=1
n i=1 pi

170/468
Eurostat
• By definition,

N N  2
X 1X yi
τ= yi and Var(τ̂p ) = pi −τ .
i=1
n i=1 pi
yi
• For the special and unrealistic case = constant, the
pi
constant will be τ and the Var(τ̂p ) will be zero.

170/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi

171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .

171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .
• If we know yi is approximately proportional to a known
variable such as xi , then we can choose pi proportional
to xi .

171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .
• If we know yi is approximately proportional to a known
variable such as xi , then we can choose pi proportional
to xi .
• τ̂p will have low variances.

171/468
Eurostat
Example: palm trees

• We want to estimate the total number of palm trees on


100 islands in a tropical paradise.

172/468
Eurostat
Example: palm trees

• We want to estimate the total number of palm trees on


100 islands in a tropical paradise.
• The area of each island is known and it is reasonable to
think that the number of palm trees on each island is
approximately proportional to the size of the island.

172/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.

173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.

173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.
• We find that p1 , ..., pN are:

173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.
• We find that p1 , ..., pN are:

• How can we sample 4 islands by probabilities p1 , ..., p100 ?

173/468
Eurostat
• Answer:

174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit

174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
• Generate 4 random numbers form a uniform distribution
on (0,1)

174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
• Generate 4 random numbers form a uniform distribution
on (0,1)
• Choose the units that correspond to the interval
containing the random number.

174/468
Eurostat
• In this example, we use uniform and get: 0.335257,
0.0065551, 0.401869, 0.318977

175/468
Eurostat
• In this example, we use uniform and get: 0.335257,
0.0065551, 0.401869, 0.318977
• The units selected are the islands 29, 1, 36, and 29,
(since 0.335257 falls between 0.31 and 0.36, 0.0065551
falls between 0 and 0.01, 0.401869 falls between 0.40 and
0.42, and 0.318977 falls between 0.31 and 0.36.).

175/468
Eurostat
The measurements (yi ) are:
i Size pi yi

1 1 0.01 14
29 5 0.05 50
29 5 0.05 50
36 2 0.02 25

Given these results we should now be able to estimate how


many total palm trees are there on all of the islands put
together:
 
1 14 50 50 25
τ̂p = + + +
4 0.01 0.05 0.05 0.02
1
= (1400 + 1000 + 1000 + 1250)
4
= 1162.5 176/468
Eurostat
Example: palm trees

n  2
\p ) = 1 X yi
Var(τ̂ − τ̂p
n(n − 1) i=1 pi
1
= [(1400 − 1162.5)2 + (1000 − 1162.5)2
4·3
+(1000 − 1162.5)2 + (1250 − 1162.5)2 ]
= 9739.58
\
SD(τ̂p ) = 98.69.

177/468
Eurostat
• If we are interested in the mean number of trees per
island in that population, then

τ̂p 1162.5
µ̂p = = = 11.625.
N 100

\p ) = 1 \p )
Var(µ̂ · Var(τ̂
N2
1
= · 9739.58
(100)2
= 0.973958
\p ) = 0.987
SD(µ̂

178/468
Eurostat
Subsection 3

The Horvitz-Thompson Estimator

179/468
Eurostat
The Horvitz-Thompson estimator

• Horvitz-Thompson (1952) introduced an unbiased


estimator for τ for any design, with or without
replacement.

180/468
Eurostat
The Horvitz-Thompson estimator

• Horvitz-Thompson (1952) introduced an unbiased


estimator for τ for any design, with or without
replacement.
• Definition: pi , i = 1, ..., N are given positive numbers
that represent the probability that unit i is included in the
sample under a given sampling scheme.

180/468
Eurostat
• The Horvitz-Thompson estimator is:
v
X yi
τ̂π =
i=1
πi
where v is the distinct number of units in the sample.

181/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.

182/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
• Each distinct unit of the sample is utilized only once.

182/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
• Each distinct unit of the sample is utilized only once.
• Note that the estimator is unbiased:

E(τ̂π ) = τ

182/468
Eurostat
• Its variance is given by
N   N X 
X 1 − πi X πij − πi πj
Var(τ̂π ) = yi2 + yi yj
i=1
πi i=1 j6=i
π i πj

183/468
Eurostat
• Its variance is given by
N   N X 
X 1 − πi X πij − πi πj
Var(τ̂π ) = yi2 + yi yj
i=1
πi i=1 j6=i
π i πj

• It can be estimated by:


v   v X 
\
X 1 − πi 2
X πij − πi πj 1
Var(τ̂π ) = 2
yi + yi yj
i=1
π i i=1 j6=i
π i πj π ij

where πij > 0 denotes the probability that both unit i and
unit j are included in the sample.

183/468
Eurostat
An approximate (1 − α)100% CI for τ is:
q
\π ).
τ̂π ± z1−α/2 Var(τ̂

184/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total


number of palm trees.

185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total


number of palm trees.
• ANSWER:

185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total


number of palm trees.
• ANSWER:
• Since, for that example the sample is with replacement,
the n draws are independent.

185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total


number of palm trees.
• ANSWER:
• Since, for that example the sample is with replacement,
the n draws are independent.
• It is relatively easy to compute the π’s .

185/468
Eurostat
• For sample with replacement, we will compute:

πi = the probability of unit i-th is included in the sample


= 1 − P(unit i-th is not included in the sample)
= 1 − (1 − pi )n

186/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.

187/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
• Since p1 = 0.01, π1 = 1 − (1 − 0.01)4 = 0.0394, and

p2 = 0.05, π2 = 1 − (1 − 0.05)4 = 0.1855,


p3 = 0.02, π3 = 1 − (1 − 0.02)4 = 0.0776

187/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
• Since p1 = 0.01, π1 = 1 − (1 − 0.01)4 = 0.0394, and

p2 = 0.05, π2 = 1 − (1 − 0.05)4 = 0.1855,


p3 = 0.02, π3 = 1 − (1 − 0.02)4 = 0.0776
• Therefore,
ν
X yi
τ̂π =
i=1
πi
14 50 25
= + +
0.0394 0.1855 0.0776
= 947.037
187/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ

188/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ
• For this, we need to compute πij .

188/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ
• For this, we need to compute πij .
• Since

P(A ∩ B) = P(A) + P(B) − P(A ∪ B)


= P(A) + P(B) − [1 − P(Ac ∩ B c )]

188/468
Eurostat
• Then we get:

πij = πi + πj − [1 − (1 − pi − pj )n ]

189/468
Eurostat
• Then we get:

πij = πi + πj − [1 − (1 − pi − pj )n ]
• This means that we have to run through each of the unit
pairs such as:

π12 = 0.0394 + 0.1855 − [1 − (1 − 0.01 − 0.05)4 ] = 0.00565


π13 = 0.0394 + 0.0776 − [1 − (1 − 0.01 − 0.02)4 ] = 0.00229
π23 = 0.1855 + 0.0776 − [1 − (1 − 0.05 − 0.02)4 ] = 0.01115

189/468
Eurostat
• Plugging in the values in
v   v X 
X 1 − πi X πij − πi πj 1
\π ) =
Var(τ̂ yi2 + yi yj ,
i=1
πi2 i=1 j6=i
πi πj πij

we obtain:

\π ) = 92692.9
Var(τ̂

190/468
Eurostat
• Plugging in the values in
v   v X 
X 1 − πi X πij − πi πj 1
\π ) =
Var(τ̂ yi2 + yi yj ,
i=1
πi2 i=1 j6=i
πi πj πij

we obtain:

\π ) = 92692.9
Var(τ̂

\
• Thus, SD(τ̂ π) = 92692.9 = 304.455

190/468
Eurostat
• Is there some popular estimator that can be derived as a
Horvitz-Thompson estimator?

191/468
Eurostat
• Is there some popular estimator that can be derived as a
Horvitz-Thompson estimator?
• Yes, under simple random sampling (without
replacement), the inclusion of the probability of the i-th
unit is:
πi = P(unit i-th is included in the sample)
# of samples including unit i-th
=
# of samples
N−1 (N−1)! (N−1)!
Cn−1 (N−1−n+1)!(n−1)! (N−n)!(n−1)!
= = N!
= N(N−1)!
CnN (N−n)!n! (N−n)!n(n−1)!
n
=
N 191/468
Eurostat
n
X yi
τ̂π =
i=1
πi
n
X yi
= ·N
i=1
n
= N ȳ

Which is the popular estimator we use! This is also called the


expansion estimator.

192/468
Eurostat
Coffee break!

193/468
Eurostat
Subsection 4

Small population illustration

194/468
Eurostat
Wheat production

unit (Farm) i 1 2 3

pi 0.3 0.2 0.5


Wheat produced 11 6 25

N = 3 farms; n = 2 farms; sample with replacement.

195/468
Eurostat
s p(s) Sample

(1,1) 0.3(0.3)=0.09 (11,11)


(2,2) 0.2(0.2)=0.04 (6,6)
(3,3) 0.5(0.5)=0.25 (25,25)
(1,2) 0.3(0.2)=0.06 (11,6)
(2,1) 0.2(0.3)=0.06 (6,11)
(1,3) 0.3(0.5)=0.15 (11,25)
(3,1) 0.5(0.3)=0.15 (25,11)
(2,3) 0.2(0.5)=0.10 (6,25)
(3,2) 0.5(0.2)=0.10 (25,6)

196/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.

197/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
• Answer: When (1,1) is sampled, the Hansen-Hurwitz
estimator is:
   
1 y1 y1 1 11 11
τ̂p = + = + = 36.67.
2 p1 p1 2 0.3 0.3

197/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
• Answer: When (1,1) is sampled, the Hansen-Hurwitz
estimator is:
   
1 y1 y1 1 11 11
τ̂p = + = + = 36.67.
2 p1 p1 2 0.3 0.3
• When (1,2) is sampled, the Hansen-Hurwitz estimator is:
   
1 y1 y2 1 11 6
τ̂p = + = + = 33.33.
2 p1 p2 2 0.3 0.2

197/468
Eurostat
Similarly, we can fill out the table and get the Hansen-Hurwitz
estimators as shown:

s p(s) Sample τ̂p

(1,1) 0.3(0.3)=0.09 (11,11) 36.670


(2,2) 0.2(0.2)=0.04 (6,6) 30.000
(3,3) 0.5(0.5)=0.25 (25,25) 50.000
(1,2) 0.3(0.2)=0.06 (11,6) 33.330
(2,1) 0.2(0.3)=0.06 (6,11) 33.330
(1,3) 0.3(0.5)=0.15 (11,25) 43.330
(3,1) 0.5(0.3)=0.15 (25,11) 43.330
(2,3) 0.2(0.5)=0.10 (6,25) 40.000
(3,2) 0.5(0.2)=0.10 (25,6) 40.000

198/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.

199/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.
• Answer:
π1 = 0.09 + 0.06 + 0.06 + 0.15 + 0.15 = 0.51,
π2 = 0.04 + 0.06 + 0.06 + 0.10 + 0.10 = 0.36,
π3 = 0.25 + 0.15 + 0.15 + 0.10 + 0.10 = 0.75.

199/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.
• Answer:
π1 = 0.09 + 0.06 + 0.06 + 0.15 + 0.15 = 0.51,
π2 = 0.04 + 0.06 + 0.06 + 0.10 + 0.10 = 0.36,
π3 = 0.25 + 0.15 + 0.15 + 0.10 + 0.10 = 0.75.
• When (1,1) is sampled, the Horvitz-Thompson estimator
is:
 
11
τ̂π = = 21.57.
0.51

199/468
Eurostat
• When (1,2) is sampled, the Horvitz-Thompson estimator
is:
 
11 6
τ̂π = + = 38.24.
0.51 0.36

200/468
Eurostat
Similarly, we can fill out the table and get the
Horvitz-Thompson estimators as shown below:

s p(s) ys τ̂p τ̂π

(1,1) 0.3(0.3)=0.09 (11,11) 36.67 21.57


(2,2) 0.2(0.2)=0.04 (6,6) 30.00 16.67
(3,3) 0.5(0.5)=0.25 (25,25) 50.00 33.33
(1,2) 0.3(0.2)=0.06 (11,6) 33.33 38.24
(2,1) 0.2(0.3)=0.06 (6,11) 33.33 38.24
(1,3) 0.3(0.5)=0.15 (11,25) 43.33 54.9
(3,1) 0.5(0.3)=0.15 (25,11) 43.33 54.9
(2,3) 0.2(0.5)=0.10 (6,25) 40.00 50
(3,2) 0.5(0.2)=0.10 (25,6) 40.00 50
Mean 42.00 42.00
Variance 34.67 146.44

201/468
Eurostat
• From the table above we can see that both τ̂p and τ̂π are
unbiased.

202/468
Eurostat
• From the table above we can see that both τ̂p and τ̂π are
unbiased.
• This example is a small population example to illustrate
conceptually the properties of these estimators.

202/468
Eurostat
Remark 1

• The above demonstration is just a teaching tool.

203/468
Eurostat
Remark 1

• The above demonstration is just a teaching tool.


• In reality we will not know the population and will not
come across small population problems like this.

203/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
• We will not be able to find the real population total nor
the real variance of the estimator.

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
• We will not be able to find the real population total nor
the real variance of the estimator.
• However, we will be able to estimate them.

204/468
Eurostat
Remark 2

• Now, should we use τ̂p or should we use τ̂π ?

205/468
Eurostat
Remark 2

• Now, should we use τ̂p or should we use τ̂π ?


• There are no clear answers.

205/468
Eurostat
Remark 2

• Now, should we use τ̂p or should we use τ̂π ?


• There are no clear answers.
• Both estimators are acceptable when yi and pi are
proportional.

205/468
Eurostat
Auxiliary data and ratio esti-
mation
Unit learning outcomes

• Upon successful completion of this unit, you will be able


to:
• know why and when to use ratio estimates
• check the condition to see whether one can use the ratio
estimate
• compute the ratio estimate and its estimated variance
• compute confidence interval based on ratio estimates
• compute the sample size needed when the ratio estimate
is used

207/468
Eurostat
Unit learning outcomes

• Upon successful completion of this unit, you will be able


to:
• learn about the biasedness of the ratio estimate via a
small population example
• see that the ratio estimate does perform better than the
expansion estimate when the condition for using the ratio
estimate is satisfied

208/468
Eurostat
Subsection 1

Auxiliary data, ratio estimator and its


computation

209/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may


include a known variable to which the variable of interest
is approximately related.

210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may


include a known variable to which the variable of interest
is approximately related.
• The auxiliary information typically is easy to measure,
whereas the variable of interest may be expensive to
measure.

210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may


include a known variable to which the variable of interest
is approximately related.
• The auxiliary information typically is easy to measure,
whereas the variable of interest may be expensive to
measure.
• Population units: 1, 2, ..., N

210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may


include a known variable to which the variable of interest
is approximately related.
• The auxiliary information typically is easy to measure,
whereas the variable of interest may be expensive to
measure.
• Population units: 1, 2, ..., N
• Variable of interest: y1 , y2 , ..., yN (expensive or costly to
measure)
210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may


include a known variable to which the variable of interest
is approximately related.
• The auxiliary information typically is easy to measure,
whereas the variable of interest may be expensive to
measure.
• Population units: 1, 2, ..., N
• Variable of interest: y1 , y2 , ..., yN (expensive or costly to
measure)
• Auxiliary variable : x1 , x2 , ..., xN (known) 210/468
Eurostat
• For example consider: a national park is partitioned into
N units.

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
• A third example would be a forest that has N trees.

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
• A third example would be a forest that has N trees.
• yi = the volume of the tree

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i
• A third example would be a forest that has N trees.
• yi = the volume of the tree
• xi = the diameter of the tree

211/468
Eurostat
Ratio estimators

PN PN τy µy
• If τy = yi and τx = xi , then = and
i=1 i=1 τx µx
µy
τy = · τx .
µx

212/468
Eurostat
Ratio estimators

PN PN τy µy
• If τy = yi and τx = xi , then = and
i=1 i=1 τx µx
µy
τy = · τx .
µx

• The ratio estimator, denoted as τ̂r , is τ̂r = · τx

212/468
Eurostat
• The estimator is useful in the following situations:

213/468
Eurostat
• The estimator is useful in the following situations:
A. When X and Y are highly linearly correlated through the
origin, then:

Var(τ̂r ) is less than Var(N ȳ ).

213/468
Eurostat
• The estimator is useful in the following situations:
A. When X and Y are highly linearly correlated through the
origin, then:

Var(τ̂r ) is less than Var(N ȳ ).

B. The case where N is unknown, then it provides a way to


estimate τy since when N is unknown, one cannot use
N ȳ .

213/468
Eurostat
Historical use

• When was this type of estimator used historically?

214/468
Eurostat
Historical use

• When was this type of estimator used historically?


• Probably the first instance of its use occurred in France in
1802.

214/468
Eurostat
Historical use

• When was this type of estimator used historically?


• Probably the first instance of its use occurred in France in
1802.
• At this time there was no population census and Laplace
wanted to estimate the total population of France.

214/468
Eurostat
Historical use

• When was this type of estimator used historically?


• Probably the first instance of its use occurred in France in
1802.
• At this time there was no population census and Laplace
wanted to estimate the total population of France.
• He did not have the resources to count every individual so
he sampled 30 communities in France.

214/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.

215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?

215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?
• Laplace found auxiliary information to help him and found
good records of the number of registered births.

215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?
• Laplace found auxiliary information to help him and found
good records of the number of registered births.
• It turns out that the total number of registered births for
the 30 communities that he had selected = 71,866.33.

215/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.

216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35

216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35
• Rationale: Communities with larger populations are
likely to have larger number of registered births.

216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35
• Rationale: Communities with larger populations are
likely to have larger number of registered births.
• This is an example of an early use of ratio estimation.

216/468
Eurostat
Example 1: apple juice from apples

• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.

217/468
Eurostat
Example 1: apple juice from apples

• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.

217/468
Eurostat
Example 1: apple juice from apples

• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.
• We can sample n apples and find y1 , ..., yn , the amount
of apple juice in those apples.

217/468
Eurostat
Example 1: apple juice from apples

• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.
• We can sample n apples and find y1 , ..., yn , the amount
of apple juice in those apples.
• N ȳ is hard to get in this case because N is hard to count.

217/468
Eurostat
• How could we measure this?

218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.

218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.
• We will use the relationship between weight of the load
and the weight of the apple juice one obtains.

218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.
• We will use the relationship between weight of the load
and the weight of the apple juice one obtains.
• Y is related to the x, the weight of each apple in the
sample and the total weight is easy to get for the entire
shipment.

218/468
Eurostat
Ratio estimator for τ

• We can thus estimate the total apple juice by:


τ̂r = · τx

219/468
Eurostat
Ratio estimator for τ

• We can thus estimate the total apple juice by:


τ̂r = · τx

• For this example, N is unknown and we cannot use N ȳ .

219/468
Eurostat
Ratio estimator for τ

• We can thus estimate the total apple juice by:


τ̂r = · τx

• For this example, N is unknown and we cannot use N ȳ .
• One can see that if the condition for using the ratio
estimator is satisfied and N is know, this ratio estimator
may actually work better than N ȳ .

219/468
Eurostat
Ratio estimator for µ

• Similarly, to estimate µy , we can use


µ̂r = · µx .

220/468
Eurostat
Ratio estimator for µ

• Similarly, to estimate µy , we can use


µ̂r = · µx .

• It turns out that this estimate is not unbiased.

220/468
Eurostat
Ratio estimator for µ

• Similarly, to estimate µy , we can use


µ̂r = · µx .

• It turns out that this estimate is not unbiased.
• Note that τ̂r is not unbiased for τy and µ̂r is not unbiased
for µy but they are approximately unbiased for large
samples when the sampling is a simple random sample.

220/468
Eurostat
Properties

• The approximate MSE of µ̂r is Var(µ̂r ) and given by:

σr2
 
N −n
Var (µ̂r ) ≈ · .
N n

221/468
Eurostat
Properties

• The approximate MSE of µ̂r is Var(µ̂r ) and given by:

σr2
 
N −n
Var (µ̂r ) ≈ · .
N n
• How can we compute the
N  2
1 X τy
σr2 = yi − · xi .
N − 1 i=1 τx

221/468
Eurostat
• When we want to estimate σr2 we will estimate using this
formula:
n  2
1 X ȳ
sr2 = yi − · xi .
n − 1 i=1 x̄

222/468
Eurostat
• When we want to estimate σr2 we will estimate using this
formula:
n  2
1 X ȳ
sr2 = yi − · xi .
n − 1 i=1 x̄
• Given all of this, when do we know that the estimate µ̂r is
good?

222/468
Eurostat
• We can compare it to:

σ2
 
N −n
Var(ȳ ) = · .
N n

223/468
Eurostat
• We can compare it to:

σ2
 
N −n
Var(ȳ ) = · .
N n
• µ̂r will perform better if σr2 < σ 2 .

223/468
Eurostat
• We can compare it to:

σ2
 
N −n
Var(ȳ ) = · .
N n
• µ̂r will perform better if σr2 < σ 2 .
• That is the case for populations for which y ’s and x’s are
highly correlated and with roughly a linear relationship
through the origin.

223/468
Eurostat
• An approximate 100(1 − α)% CI for µy is
q
\r ).
µ̂r ± z1−α/2 Var(µ̂

224/468
Eurostat
• For τy ,


τ̂r = N µ̂r = · τx ,

and
2
\r ) = N · (N − n) sr .
Var(τ̂
n

225/468
Eurostat
Back to apple juice example

• Back to the context for this example...

226/468
Eurostat
Back to apple juice example

• Back to the context for this example...


• As it turns out in this example, 15 apples selected by
simple random samples were weighed and also juiced.

226/468
Eurostat
Back to apple juice example

• Back to the context for this example...


• As it turns out in this example, 15 apples selected by
simple random samples were weighed and also juiced.
• The total weight of the apple shipment was found to be
2000 pounds.

226/468
Eurostat
Back to apple juice example

• Back to the context for this example...


• As it turns out in this example, 15 apples selected by
simple random samples were weighed and also juiced.
• The total weight of the apple shipment was found to be
2000 pounds.
• What we need to do, given the table of results below, is
to get a point estimate of the total weight of the juice for
the shipment of apples and provide a 95% confidence
interval.
226/468
Eurostat
Here is the data:
ID yi xi yi − rxi (yi − rxi )2

1 0.16 0.22 0.0148611 0.0002209


2 0.15 0.26 -0.0215278 0.0004634
3 0.2 0.31 -0.0045139 0.0000204
4 0.25 0.37 0.0059028 0.0000348
5 0.16 0.28 -0.0247222 0.0006112
6 0.27 0.38 0.0193056 0.0003727
7 0.28 0.4 0.0161111 0.0002596
8 0.16 0.21 0.0214583 0.0004605
9 0.11 0.18 -0.0087500 0.0000766
10 0.16 0.29 -0.0313194 0.0009809
11 0.17 0.26 -0.0015278 0.0000023
12 0.24 0.32 0.0288889 0.0008346
13 0.21 0.33 -0.0077083 0.0000594
14 0.11 0.16 0.0044444 0.0000198
15 0.22 0.35 -0.0109028 0.0001189

Mean 0.190 0.288

Sum 0.004536

227/468
Eurostat
• ID is the sampled Apple
• yi , the weight of the Apple’s juice in lbs.
• xi , the weight of the Apple in lbs.
• yi − rxi , is the (observed y value - estimated y value), and
• (yi − rxi )2 is the (observed y value - estimated y value)
squared.
• Total Apple juice weight is 2.85 lbs. (mean = 0.19 lbs.)
• Total Apple weight is 4.32 lbs. (mean = 0.288 lbs.)

228/468
Eurostat
• Is it appropriate to use the ratio estimate?

229/468
Eurostat
• Is it appropriate to use the ratio estimate?
• The scatter plot of the data shows a linear relationship
between y and x variables.


0.25


0.20


y

● ● ● ●
0.15

● ●

0.20 0.25 0.30 0.35 0.40

229/468
Eurostat
Moreover, the regression analysis suggests that the regression
line goes through the origin (p-value of constant =
0.659 > 0.05). Therefore, it appears appropriate to use the
ratio estimate.

230/468
Eurostat
• The ratio estimate of the total weight is

0.190
τ̂r = r τx = × 2000 = 1319.44.
0.288

n
1 X
sr2 = (yi − rxi )2
n − 1 i=1
1
= [(0.16 − 0.6597 × 0.22)2 + . . .
14
+(0.22 − 0.6597 × 0.35)2 ]

231/468
Eurostat
• The ratio estimate of the total weight is

0.190
τ̂r = r τx = × 2000 = 1319.44.
0.288

n
1 X
sr2 = (yi − rxi )2
n − 1 i=1
1
= [(0.16 − 0.6597 × 0.22)2 + . . .
14
+(0.22 − 0.6597 × 0.35)2 ]

• How accurate is this result?


231/468
Eurostat
Example 1: apple juice from apples

Let’s compute a confidence interval and for this we need the


variance.
2  2
\r ) = N̂ · (N̂ − n) sr = τx τx − n sr

Var(τ̂
n x̄ x̄ n
1 15
(yi − rxi )2
P
 
2000 2000 n − 1 i=1
= − 15
0.288 0.288 n
1
· 0.004536
= 6944.444 · 6929.444 · 14 = 1039.42
15
\
SD(τ̂ r ) = 32.24
232/468
Eurostat
• Then an approximate 95% CI for τ is then:

\
= 1319.44 ± z1−α/2 SD(τ̂r)

= 1319.44 ± 1.96 × 32.24


= 1319.44 ± 63.19

233/468
Eurostat
• Then an approximate 95% CI for τ is then:

\
= 1319.44 ± z1−α/2 SD(τ̂r)

= 1319.44 ± 1.96 × 32.24


= 1319.44 ± 63.19

• In this case the estimate does reduce the variance by


using information contained in x about y .

233/468
Eurostat
Estimation for ratio

• In some cases we are interested in estimating:


 
τy µy
R= also, .
τx µx

234/468
Eurostat
• For example, sociologists are interested in ratios such as
the monthly food budget compared to the monthly
income per family.

235/468
Eurostat
• For example, sociologists are interested in ratios such as
the monthly food budget compared to the monthly
income per family.
• The sample ratio is the estimate for R and:

r =

N − n σr2
 
Var(r ) ≈
Nµ2x n
  2
\) ≈ N − n sr
Var(r
Nµ2x n

235/468
Eurostat
Questions?

236/468
Eurostat
Lunch break!

237/468
Eurostat
Subsection 2

Sample size and small population example


for ratio estimation

238/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation

239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.

239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.
• He also has aerial photographs of the plantation from
which he can estimate the number of trees (x) on each
plot of the entire plantation.

239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.
• He also has aerial photographs of the plantation from
which he can estimate the number of trees (x) on each
plot of the entire plantation.
• Hence, he knows µx = 19.7 and since the two counts are
approximately proportional through the origin, he uses a
ratio estimate to estimate µy .
239/468
Eurostat
Plot yi xi (aerial estimate) yi − rxi

1 25 23 9.8263889
2 15 14 5.7638889
3 22 20 8.8055556
4 24 25 7.5069444
5 13 12 5.0833333
6 18 18 6.1250000
7 35 30 15.2083333
8 30 27 12.1875000
9 10 8 4.7222222
10 29 31 8.5486111
Mean 22.10 20.80

240/468
Eurostat
Here is a scatterplot of this data:

35

30



25



y


20


15


10

10 15 20 25 30

241/468
Eurostat
And, here is the R output for regression:

242/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.

243/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.
• Moreover, the regression analysis suggests that the
regression line goes through the origin (p-value of
constant = 0.554 > 0.05).

243/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.
• Moreover, the regression analysis suggests that the
regression line goes through the origin (p-value of
constant = 0.554 > 0.05).
• Therefore, it may be appropriate to use the ratio estimate.

243/468
Eurostat
• Estimating the number of trees per acre

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
• ȳ = 22.10

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
• ȳ = 22.10
• x̄ = 20.80

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot
• ȳ = 22.10
• x̄ = 20.80
• µx is given to be 19.70

244/468
Eurostat
ȳ 22.10
µ̂r = · µx = · 19.70 = 20.93,
x̄ 20.80
10  2
1 X 22.10
sr2 = yi − xi = 4.2,
10 − 1 i=1 20.80
2
\r ) = N − n · sr = 1000 − 10 · 4.2 = 0.4158,
Var(µ̂
N n 1000 10

\r ) =
SD(µ̂ 0.4158 = 0.6448

245/468
Eurostat
The approximate 95% confidence interval for µy is:

\r )
µ̂r ± z0.975 · SD(µ̂
20.93 ± 1.96 · 0.6448
= 20.93 ± 1.26

246/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.

247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .

247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .
• Then we know that:
r
N − n sr2
z1−α/2 · · = d.
N n

247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .
• Then we know that:
r
N − n sr2
z1−α/2 · · = d.
N n
• Thus, the formula to compute the required sample size is:
2
N · z1−α/2 · sr2
n= 2
z1−α/2 · sr2 + Nd 2
247/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4

Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400

248/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4

Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400

• τx = 22 , τy = 1400.

248/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4

Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400

• τx = 22 , τy = 1400.
• Samples (s.r.s.): n = 2.

248/468
Eurostat

Samples τ̂r = · τx

(200 + 300)/2
(1,2) τ̂r = · 22 = 1222
(4 + 8)/2
(200 + 500)/2
(1,3) τ̂r = · 22 = 1283
(4 + 8)/2
(1,4) 1467
(2,3) 1354
(2,4) 1540
(3,4) 1523

     
1 1 1
E (τ̂r ) = · 1222 + · 1283 + + · 1467
6 6 6
     
1 1 1
+ · 1354 + · 1540 + · 1523
6 6 6
= 1398.17 6= τy = 1400
249/468
Thus, there is a very slight bias. Eurostat
6
X
MSE = (τ̂r ,s − τ )2 · P(s)
i=1
   
2 1 2 1
= (1222 − 1400) · + (1283 − 1400) ·
6 6
   
2 1 2 1
+ (1467 − 1400) · + (1354 − 1400) ·
6 6
   
2 1 2 1
+ (1540 − 1400) · + (1523 − 1400) ·
6 6
= 14, 451.2

When there is a slight bias, MSE 6= Var.


250/468
Eurostat
On the other hand, if one uses τ̂ = N · ȳ
Samples τ̂ = N · ȳ

(1,2) 4 × (200 + 300)/2 = 1000


(1,3) 4 × (200 + 500)/2 = 1400
(1,4) 4 × (200 + 400)/2 = 1200
(2,3) 4 × (300 + 500)/2 = 1600
(2,4) 4 × (300 + 400)/2 = 1400
(3,4) 4 × (500 + 400)/2 = 1800

     
1 1 1
E(τ̂ ) = · 1000 + · 1400 + · 1200
6 6 6
     
1 1 1
+ · 1600 + · 1400 + · 1800
6 6 6
= 1400, unbiased.
251/468
Eurostat
6
X
MSE = (τ̂ − τ )2 · P(s)
i=1
   
2 1 2 1
= (1000 − 1400) · + (1400 − 1400) ·
6 6
   
2 1 2 1
+ (1200 − 1400) · + (1600 − 1400) ·
6 6
   
2 1 2 1
+ (1400 − 1400) · + (1800 − 1400) ·
6 6
= 66, 667
66,667 is much larger than the MSE of τ̂r .
252/468
Eurostat
Auxiliary data and regression
estimation
Unit learning outcomes

• Upon success completion of this unit, you will be able to:


• know why and when to use regression estimates
• know how to check the condition to see whether one can
use the regression estimate
• compute the regression estimate and its estimated
variance

254/468
Eurostat
Unit learning outcomes

• Upon success completion of this unit, you will be able to:


• compute confidence interval based on regression estimate
• see that the regression estimate does perform better than
the expansion estimate when auxiliary data is useful
• see that the regression estimate does perform better than
the ratio estimate when the condition for using the ratio
estimate is not satisfied

255/468
Eurostat
Subsection 1

Linear regression estimator

256/468
Eurostat
The idea behind regression estimation

• Looking at the data, how will we find things that will


work, or which model should we use?

257/468
Eurostat
The idea behind regression estimation

• Looking at the data, how will we find things that will


work, or which model should we use?
• These are key questions.

257/468
Eurostat
The idea behind regression estimation

• Looking at the data, how will we find things that will


work, or which model should we use?
• These are key questions.
• The variance for the estimators will be an important
indicator.

257/468
Eurostat
• When the auxiliary variable x is linearly related to y but
does not pass through the origin, a linear regression
estimator would be appropriate.

258/468
Eurostat
• When the auxiliary variable x is linearly related to y but
does not pass through the origin, a linear regression
estimator would be appropriate.
• In addition, if multiple auxiliary variables have a linear
relationship with y , multiple regression estimates may be
appropriate.

258/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.

259/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
• Let us start with a simple example:

ŷ = a + bx,
is our basic regression equation.

259/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
• Let us start with a simple example:

ŷ = a + bx,
is our basic regression equation.
sxy
• Then, b = 2 and a = ȳ − b̂x̄.
sx

259/468
Eurostat
• Then to estimate the mean for y , µ̂L , substitute as
follows, x = µx , a = ȳ − bx̄, then

ŷ = a + bx
µ̂L = a + bµx
µ̂L = (ȳ − bx̄) + bµx
µ̂L = ȳ + b(µx − x̄)

260/468
Eurostat
• Then to estimate the mean for y , µ̂L , substitute as
follows, x = µx , a = ȳ − bx̄, then

ŷ = a + bx
µ̂L = a + bµx
µ̂L = (ȳ − bx̄) + bµx
µ̂L = ȳ + b(µx − x̄)

• Note that even though µ̂L is not unbiased under simple


random sampling, it is roughly so (asymptotically
unbiased) for large samples.

260/468
Eurostat
• Thus, the mean square error of µ̂L is roughly estimated
by:
n
(yi − a − bxi )2
P
\L ) = N − n i=1
Var(µ̂ ·
N ×n n−2
N −n
= · MSE
N ×n
where MSE is the MSE of the linear regression model of
y on x.

261/468
Eurostat
• Therefore, an approximate (1 − α)100% CI for µ is:
q
\L )
µ̂L ± z1−α/2 Var(µ̂

262/468
Eurostat
• It follows that:

τ̂L = N · µ̂L = N ȳ + b(τx − N x̄)

\L ) = N 2 Var(µ̂
Var(τ̂ \L )
N × (N − n)
= · MSE
n

263/468
Eurostat
• It follows that:

τ̂L = N · µ̂L = N ȳ + b(τx − N x̄)

\L ) = N 2 Var(µ̂
Var(τ̂ \L )
N × (N − n)
= · MSE
n
• And, an approximate (1 − α)100% CI for τ is:
q
\L )
τ̂L ± z1−α/2 Var(τ̂

263/468
Eurostat
Example

• A mathematics achievement test was given to 486


students prior to entering a certain college who then took
a calculus class.

264/468
Eurostat
Example

• A mathematics achievement test was given to 486


students prior to entering a certain college who then took
a calculus class.
• A simple random sampling of 10 students are selected
and their calculus score recorded.

264/468
Eurostat
Example

• A mathematics achievement test was given to 486


students prior to entering a certain college who then took
a calculus class.
• A simple random sampling of 10 students are selected
and their calculus score recorded.
• It is known that the average achievement test score for
the 486 students was 52.

264/468
Eurostat
Example

• A mathematics achievement test was given to 486


students prior to entering a certain college who then took
a calculus class.
• A simple random sampling of 10 students are selected
and their calculus score recorded.
• It is known that the average achievement test score for
the 486 students was 52.
• The scatterplot of the 10 samples are given below and the
data follow.
264/468
Eurostat


90


Calculus score Y

80



70


60

20 30 40 50 60 70

Achievement test score X

265/468
Eurostat
Student Test score (xi ) Calculus score (yi )

1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75

Mean 46 76

266/468
Eurostat
267/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?

268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6

268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .

268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .
• We can see that both terms are significant.

268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .
• We can see that both terms are significant.
• Ratio estimate is not appropriate since the constant term
is non-zero.
268/468
Eurostat
Example

• Now we can compute the variance and the confidence


interval.

269/468
Eurostat
Example

• Now we can compute the variance and the confidence


interval.
• What is the variance of the regression estimate?

269/468
Eurostat
Example

• Now we can compute the variance and the confidence


interval.
• What is the variance of the regression estimate?
• ANSWER:
\L ) = N − n · MSE
Var(µ̂
N ×n
486 − 10
= × 8.7042
486 × 10
= 7.42

269/468
Eurostat
Example

• What is then, an approximate 95% CI for µ?

270/468
Eurostat
Example

• What is then, an approximate 95% CI for µ?


• ANSWER:

= 80.6 ± 1.96 × 7.42
= 80.6 ± 5.34

270/468
Eurostat
Coffee break!

271/468
Eurostat
Subsection 2

Comparison of estimators

272/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:

\ N − n s2
Var(ȳ ) = · .
N n

273/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:

\ N − n s2
Var(ȳ ) = · .
N n
• s 2 for y values is: (15.11)2

273/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:

\ N − n s2
Var(ȳ ) = · .
N n
• s 2 for y values is: (15.11)2
• What is the Var(ȳ )?

\) = 486 − 10 · (15.11)2
Var(ȳ
486 × 10
= 22.36

273/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ

= 76 ± 1.96 × 22.36
= 76 ± 9.27

274/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ

= 76 ± 1.96 × 22.36
= 76 ± 9.27

.
• Recall: The 95% confidence interval using regression
estimate is 80.6 ± 5.34; a much shorter confidence
interval.

274/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ

= 76 ± 1.96 × 22.36
= 76 ± 9.27

.
• Recall: The 95% confidence interval using regression
estimate is 80.6 ± 5.34; a much shorter confidence
interval.
• This regression estimate is more precise than ȳ .

274/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .

275/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
• Compare µ̂L to the ratio estimator µ̂r

275/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
• Compare µ̂L to the ratio estimator µ̂r
• Next table contains the mean and standard deviation for
X and Y .

275/468
Eurostat
Student Test score (xi ) Calculus score (yi ) yi − rxi

1 39 65 0.565
2 43 78 6.957
3 21 52 17.304
4 64 82 -23.739
5 57 92 -2.174
6 47 89 11.348
7 28 73 26.739
8 75 98 -25.913
9 34 56 -0.174
10 52 75 -10.913

Mean 46 76
Std. deviation 16.58 15.11
sr2 283.42

276/468
Eurostat
• The ratio estimate is inappropriate for this example.

277/468
Eurostat
• The ratio estimate is inappropriate for this example.
• However, just to show a counter example, we can
compute the variance of the ratio estimate using the
previous table data and compare this to the regression
estimate.

277/468
Eurostat
Note

• For the Calculus Scores example we should not use the


ratio estimator µ̂r because the p-value for the constant
term is 0.002.

278/468
Eurostat
Note

• For the Calculus Scores example we should not use the


ratio estimator µ̂r because the p-value for the constant
term is 0.002.
• This implies that it does not go through the origin and for
this reason the ratio estimate is not appropriate.

278/468
Eurostat
Note

• For the Calculus Scores example we should not use the


ratio estimator µ̂r because the p-value for the constant
term is 0.002.
• This implies that it does not go through the origin and for
this reason the ratio estimate is not appropriate.
• But for the purposes of a counter example we will work it
out here anyway:
ȳ 76
µ̂r = r µx = · µx = · 52 = 85.91.
x̄ 46
278/468
Eurostat
• Next, we need to figure out the variance and for this we
need the MSE while using ratio estimate. From the
previous table the
10
1 X
sr2 = (yi − rxi )2 = 283.42 this is huge!
10 − 1 i=1

279/468
Eurostat
• Next, we need to figure out the variance and for this we
need the MSE while using ratio estimate. From the
previous table the
10
1 X
sr2 = (yi − rxi )2 = 283.42 this is huge!
10 − 1 i=1
• Now we can compute the variance:
2
\r ) = N − n · sr
Var(µ̂
N n
486 − 10 283.42
= · = 27.75
486 10

279/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂

= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂

= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

• We can see that the ratio estimate is even worse than


µ̂ = ȳ when it is used in an inappropriate situation.

280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂

= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

• We can see that the ratio estimate is even worse than


µ̂ = ȳ when it is used in an inappropriate situation.
• The width of the interval is larger than the one for the
regression estimate.

280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂

= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

• We can see that the ratio estimate is even worse than


µ̂ = ȳ when it is used in an inappropriate situation.
• The width of the interval is larger than the one for the
regression estimate.
• The moral to this story here is, "Use the right model!".

280/468
Eurostat
Stratified sampling
Some important information on this unit

• Upon success completion of this lesson, you will be able


to:
• know why and when to use stratified sampling
• know how to estimate mean and total when stratified
sampling is used
• to compute confidence interval for these estimates
• determine the optimal allocation of sample sizes
• compute estimates when post-stratification is used
• compute the variance for the estimates when
post-stratification is used
• provide estimates for stratified sample for proportion 282/468
Eurostat
Subsection 1

How to use stratified sampling

283/468
Eurostat
Introduction

In stratified sampling, the population is partitioned into


non-overlapping groups, called strata and a sample is selected
by some design within each stratum.

284/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.

285/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.
• Another example might be to determine the proportions
of defective products being assembled in a factory. In this
case sampling may be stratified by production lines,
factory, etc.

285/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.
• Another example might be to determine the proportions
of defective products being assembled in a factory. In this
case sampling may be stratified by production lines,
factory, etc.
• Can you think of a couple additional examples where
stratified sampling would make sense?

285/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:

286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.

286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.
2. The cost per observation in the survey may be reduced
by stratification of the population elements into
convenient groupings.

286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.
2. The cost per observation in the survey may be reduced
by stratification of the population elements into
convenient groupings.
3. Estimates of population parameters may be desired for
subgroups of the population. These subgroups should
then be identified.
286/468
Eurostat
Example

• An advertising firm, interested in determining how much


to emphasize television advertising in a certain country
decides to conduct a sample survey to estimate the
average number of hours each week that households
within that country watch television.

287/468
Eurostat
Example

• An advertising firm, interested in determining how much


to emphasize television advertising in a certain country
decides to conduct a sample survey to estimate the
average number of hours each week that households
within that country watch television.
• The country has two towns, A and B, and a rural area C.

287/468
Eurostat
Example

• An advertising firm, interested in determining how much


to emphasize television advertising in a certain country
decides to conduct a sample survey to estimate the
average number of hours each week that households
within that country watch television.
• The country has two towns, A and B, and a rural area C.
• Town A is built around a factory and most households
contain factory workers with school-aged children.

287/468
Eurostat
Example

• An advertising firm, interested in determining how much


to emphasize television advertising in a certain country
decides to conduct a sample survey to estimate the
average number of hours each week that households
within that country watch television.
• The country has two towns, A and B, and a rural area C.
• Town A is built around a factory and most households
contain factory workers with school-aged children.
• Town B contains mainly retirees and the rural area C are
mainly farmers. 287/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.

288/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.
• The firm decides to select 20 households from Town A, 8
households from Town B and 12 households from the
rural area.

288/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.
• The firm decides to select 20 households from Town A, 8
households from Town B and 12 households from the
rural area.
• The data are given in the following table:
Town A 35,43,36,39,28,28,29,25,38,27
26,32,29,40,35,41,37,31,45,34

Town B 27,15,4,41,49,25,10,30

Rural area C 8,14,12,15,30,32,21,20,34,7,11,24

288/468
Eurostat
• Usually a sample is selected by some probability design
from each of the L strata in the population, with
selections in different strata independent of each other.

289/468
Eurostat
• Usually a sample is selected by some probability design
from each of the L strata in the population, with
selections in different strata independent of each other.
• The special case where from each stratum a simple
random sample is drawn is called a stratified random
sample.

289/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?

290/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
• Why or why not?

290/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
• Why or why not?
• Yes, for all three reasons listed above.

290/468
Eurostat
• Notation

291/468
Eurostat
• Notation
• L: the number of strata

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h
• N: the total number of units in the population , i.e.,
N1 + N2 + ... + NL

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h
• N: the total number of units in the population , i.e.,
N1 + N2 + ... + NL
• For our “Watching TV"example the following values are:

L = 3, N1 = 155, N2 = 62 N3 = 93,

N = 155 + 62 + 93 = 310.

291/468
Eurostat
Some results are given in the following table:

Town A N1 = 155 n1 = 20 Mean=33.90 sd=5.95


Town B N2 = 62 n2 = 8 Mean=25.12 sd=15.25
Rural area C N3 = 93 n3 = 12 Mean=19.00 sd=9.36

292/468
Eurostat
Estimating the population total

L
X
τ̂st = τ̂h .
h=1

• The total is from each stratum added up where τ̂h is an


unbiased estimator for τh .

293/468
Eurostat
Estimating the population total

L
X
τ̂st = τ̂h .
h=1

• The total is from each stratum added up where τ̂h is an


unbiased estimator for τh .
• Since selections in different stratum are independent, the
variance is:
L
X L
X
Var(τ̂st ) = \
Var(τ̂h ) and Var(τ̂st ) = \h )
Var(τ̂
293/468
h=1 h=1
Eurostat
• The formula are computed differently according to the
sampling scheme within each stratum.

294/468
Eurostat
• The formula are computed differently according to the
sampling scheme within each stratum.
• For stratified random sampling, i.e., take a simple random
sample within each stratum:

τ̂h = Nh ȳh ,
L
\
X sh2
Var(τ̂ st ) = Nh · (Nh − nh ) · ,
h=1
nh
h n
1 X
sh2 = (yhi − ȳh )2 .
nh − 1 i=1

294/468
Eurostat
• You can see that this turns out pretty easy to remember,
and one can easily obtain the estimates for the population
mean.
τ̂st
µ̂st = ,
N
\st ) = 1 Var(τ̂
Var(µ̂ \ st ).
N2

295/468
Eurostat
Estimating the population mean

• For stratified random sampling:


L
1 X
ȳst = Nh ȳh ,
N h=1
L  2 
Nh − nh sh2

\
X Nh
Var(ȳ st ) = .
h=1
N N h n h

296/468
Eurostat
Estimating the population mean

• For stratified random sampling:


L
1 X
ȳst = Nh ȳh ,
N h=1
L  2 
Nh − nh sh2

\
X Nh
Var(ȳ st ) = .
h=1
N N h n h

• sh is the sample standard deviation of h stratum as given


ahead.
296/468
Eurostat
Example: estimating the mean

• Consider the TV Watching example.

297/468
Eurostat
Example: estimating the mean

• Consider the TV Watching example.


• The overall mean for this example is:
1
ȳst = (N1 ȳ1 + N2 ȳ2 + N3 ȳ3 )
N
1
= [(155 × 33.9) + (62 × 25.12)
155 + 62 + 93
+(93 × 19.0)]
= 27.7

297/468
Eurostat
The overall variance of the estimator of mean for this example
is:
3  2 
Nh − nh sh2

\
X Nh
Var(ȳst ) =
h=1
N Nh nh
2
 
1 2 (155 − 20) (5.95)
= (155) · ·
(310)2 155 20
2
 
(62 − 8) (15.25)
+ (62)2 · ·
62 8
2
 
2 (93 − 12) (9.36)
+ (93) · ·
93 12
= 1.97
298/468
Eurostat
Example: estimating the population total

For the total hours watching TV example:

τ̂st = N · ȳst = 310 × 27.7 = 8587.

\ 2 \
Var(τ̂ st ) = N Var(ȳst )

= (310)2 × 1.97 = 189317.

299/468
Eurostat
Example: confidence intervals

• When all of the stratum sizes are small, an approximate


100(1 − α)% CI for τ is:
q
\
τ̂st ± z1−α/2 Var(τ̂ st ).

300/468
Eurostat
Example: confidence intervals

• When all of the stratum sizes are small, an approximate


100(1 − α)% CI for τ is:
q
\
τ̂st ± z1−α/2 Var(τ̂ st ).

• However, when the stratum sample sizes are smaller than


30, a different interval should be used.

300/468
Eurostat
• What is the degrees of freedom for the τ used in this
formula for the confidence interval?

301/468
Eurostat
• What is the degrees of freedom for the τ used in this
formula for the confidence interval?
• Intuitively we would want this to be,
(n1 − 1) + (n2 − 1) + ... + (nL − 1), and this is correct
when the variances of all strata are all the same.

301/468
Eurostat
• But when this is not the case and we can not pool the
degrees of freedom, we will need to use the Satterwaithe
approximation for the degrees of freedom as follows:

L
!2 L
X X (ah sh2 )2
d= ah sh2 / .
h=1 h=1
(nh − 1)

Nh (Nh − nh )
where, ah = .
nh

302/468
Eurostat
• But when this is not the case and we can not pool the
degrees of freedom, we will need to use the Satterwaithe
approximation for the degrees of freedom as follows:

L
!2 L
X X (ah sh2 )2
d= ah sh2 / .
h=1 h=1
(nh − 1)

Nh (Nh − nh )
where, ah = .
nh
• In particular, when Nh are all equal, nh are all equal and
sh2 are all equal , the d.f. = n - L.

302/468
Eurostat
For the TV example:

N1 (N1 − n1 ) 155(155 − 20)


a1 = = = 1046.25,
n1 20
N2 (N2 − n2 ) 62(62 − 8)
a2 = = = 418.5,
n2 8
N3 (N3 − n3 ) 93(93 − 12)
a3 = = = 627.75.
n3 12

303/468
Eurostat
(a1 s12 + a2 s22 + a3 s32 )2
d =
(a1 s12 )2 (a2 s22 )2 (a3 s32 )2
+ +
n1 − 1 n2 − 1 n3 − 1
(1046.5 · (5.95)2 + 418.5 · (15.25)2 + 627.75 · (9.36)2 )2
=
(1046.5 · (5.95)2 )2 (418.5 · (15.25)2 )2 (627.75 · (9.36)2 )
+ +
20 − 1 8−1 12 − 1
= 21.09

304/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .

305/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
• ANSWER:

305/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
• ANSWER:
• We will use t with df = 21, hence a 95% CI for µ is:

q
\
ȳst ± t(21;1−α/2) Var(ȳ st )

= 27.7 ± 2.08 × 1.97
= 27.7 ± 2.91

305/468
Eurostat
Similarly, a 95% CI for τ is:
q
\
τ̂st ± t(21;1−α/2) Var(τ̂ st )

= 8587 ± 2.08 × 189278.56
= 8587 ± 902.32

306/468
Eurostat
Subsection 2

The stratification principle

307/468
Eurostat
Stratification principle

• If your only objective of stratification is to produce


estimators with small variances, then we want to stratify
such that within each stratum, the units are as similar as
possible.

308/468
Eurostat
Stratification principle

• If your only objective of stratification is to produce


estimators with small variances, then we want to stratify
such that within each stratum, the units are as similar as
possible.
• In a survey of human population, stratification may be
based on socioeconomic factors or geographic regions.

308/468
Eurostat
• For example, to estimate the average starting income for
recent young workers, it would make sense to stratify by
age group since the starting income for young workers of
the same age would be similar.

309/468
Eurostat
• For example, to estimate the average starting income for
recent young workers, it would make sense to stratify by
age group since the starting income for young workers of
the same age would be similar.
• Check the stratification principle in the following slides

309/468
Eurostat
Example: stratification principle

• Population is defined by dots in the figure


• Population values: 1, 2, 2, 3, 5, 6, 7, 8, 9, 9, 10, 11, 12,
13
• N = 14, µ = 7, σ 2 = 14.43. 310/468
Eurostat
Population: U
Strata 1 2 3 4
Data 1 5 8 11
2 6 9 12
2 7 9 13
3 10
Nh 4 3 4 3
µh 2 6 9 12
1 2 1 2
σh2
2 3 2 3
311/468
Eurostat
Population: U ∗
Strata 1 2 3 4
Data 2 3 2 1
9 9 6 5
10 13 7 8
12 11
Nh 3 3 4 3
µh 7 8.33 6.75 6.25
σh2 12.67 16.89 12.69 13.69

312/468
Eurostat
• The population variance, σ 2 , can be decomposed as:

σ 2 = σwithin
2 2
+ σbetween

where

313/468
Eurostat
• The population variance, σ 2 , can be decomposed as:

σ 2 = σwithin
2 2
+ σbetween

where
L
2
X Nh
• σwithin = σh2
N
h=1

313/468
Eurostat
• The population variance, σ 2 , can be decomposed as:

σ 2 = σwithin
2 2
+ σbetween

where
L
2
X Nh
• σwithin = σh2
N
h=1
L
2
X Nh
• σbetween = (µh − µ)2
N
h=1

313/468
Eurostat
• In the first stratification scheme (U):

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):
2
• σwithin = 13.87 (96% of σ 2 )

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):
2
• σwithin = 13.87 (96% of σ 2 )
2
• σbetween = 0.56 (4% of σ 2 )

314/468
Eurostat
• When a population is stratified, the total variance (σ 2 ) is
decomposed in a variance component within strata
2 2
(σwithin ) and between strata (σbetween ).

315/468
Eurostat
• When a population is stratified, the total variance (σ 2 ) is
decomposed in a variance component within strata
2 2
(σwithin ) and between strata (σbetween ).
• This examples show that, although the total variance in
the population is a fixed value, different stratification
2
schemes result in different decompositions of σwithin and
2
σbetween .

315/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2

316/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2
• Hence, in the first stratification scheme, η 2 = 0.96 shows
that the variance between strata is 96% of the total
variance of the population.

316/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2
• Hence, in the first stratification scheme, η 2 = 0.96 shows
that the variance between strata is 96% of the total
variance of the population.
• The variance within strata is small. This means that
strata are very homogeneous.

316/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.

317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.

317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.

317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.
• We can conclude that the first stratification scheme is
better, since the estimation accuracy is higher when
strata are more homogeneous (within).

317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.
• We can conclude that the first stratification scheme is
better, since the estimation accuracy is higher when
strata are more homogeneous (within).
• In fact, the closer the correlation ratio is to 1, the more
homogeneous are strata and more accurate is the
estimation.
317/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we


allocate these among L strata?

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we


allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we


allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:
1. the total number of elements in each stratum,

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we


allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:
1. the total number of elements in each stratum,
2. the variability of the measurements within each stratum,
and

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we


allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:
1. the total number of elements in each stratum,
2. the variability of the measurements within each stratum,
and
3. the cost associated with obtaining an observation from
each stratum.
318/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.

319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N

319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N
• This does not take into consideration the variability
within each stratum and is not the optimal choice.

319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N
• This does not take into consideration the variability
within each stratum and is not the optimal choice.
• If the cost of sampling from each stratum is the same,
then the optimal allocation (the allocation with the
lowest variances) is:
Nh σh
nh = n · L 319/468
P
Eurostat
• However, if the cost of sampling differs from stratum to
stratum and the total cost is:

c = c0 + c1 n1 + c2 n2 + ... + cL nL ,

where c0 is the overhead cost, ch is the cost per unit for


stratum h.

320/468
Eurostat
• However, if the cost of sampling differs from stratum to
stratum and the total cost is:

c = c0 + c1 n1 + c2 n2 + ... + cL nL ,

where c0 is the overhead cost, ch is the cost per unit for


stratum h.
• The optimal allocation is:

(c − c0 )Nh σh / ch
nh = L
.
P √
Nk σk ck
k=1

320/468
Eurostat
• Remarks:

321/468
Eurostat
• Remarks:
• The sample size is directly proportional to Nh and σh ,
i.e., allocate a larger sample size to the larger and more
variable stratum.

321/468
Eurostat
• Remarks:
• The sample size is directly proportional to Nh and σh ,
i.e., allocate a larger sample size to the larger and more
variable stratum.

• The sample size is inversely proportional to ch , i.e., this
allocates smaller sample sizes to the more expensive
stratum.

321/468
Eurostat
• In order to use the optimal allocation, one must be able
to estimate σh

322/468
Eurostat
• In order to use the optimal allocation, one must be able
to estimate σh
• Let’s take a look at this in the context of the TV
Example...

322/468
Eurostat
Back to TV example

• For the TV Example, if before the advertising the firm


conducts the survey they have already estimated that
σ1 ∼ 5, σ2 ∼ 15, σ3 ∼ 10.

323/468
Eurostat
Back to TV example

• For the TV Example, if before the advertising the firm


conducts the survey they have already estimated that
σ1 ∼ 5, σ2 ∼ 15, σ3 ∼ 10.
• Now, if the cost of obtaining an observation is about the
same for the three areas , (e.g., telephone interview),
then what is the optimal allocation if they want to sample
40 households?

323/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,

324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,
• N1 ∼ 155, σ1 ∼ 5

324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,
• N1 ∼ 155, σ1 ∼ 5
• N2 ∼ 62, σ2 ∼ 15

324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,
• N1 ∼ 155, σ1 ∼ 5
• N2 ∼ 62, σ2 ∼ 15
• N3 ∼ 93, σ3 ∼ 10

324/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10

325/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10
• Thus we will choose n1 = 12, n2 = 14 and n3 = 14.

325/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10
• Thus we will choose n1 = 12, n2 = 14 and n3 = 14.
• Remember, it is important that n1 + n2 + n3 = 40 in this
case.

325/468
Eurostat
Questions?

326/468
Eurostat
See you tomorrow!

327/468
Eurostat
Subsection 3

Post-stratification

328/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.

329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.

329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.
• Post-stratification: stratification after the selection of a
sample, is often appropriate when a simple random
sample is not properly balanced by the representation.

329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.
• Post-stratification: stratification after the selection of a
sample, is often appropriate when a simple random
sample is not properly balanced by the representation.
• Here is an example.

329/468
Eurostat
Example

• We want to estimate the average weight and take a


simple random sample of 100 people.

330/468
Eurostat
Example

• We want to estimate the average weight and take a


simple random sample of 100 people.
• Here is what was obtained.
Male Female
n1 = 20 n2 = 80
ȳ1 = 180 lbs. ȳ2 = 120 lbs.

ȳ : the overall sample mean = 132.

330/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.

331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?

331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N

331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N
• Thus,

ȳst = 0.5 · ȳ1 + 0.5 · ȳ2


N1 N2
= ȳ1 + ȳ2 = 150
N N

331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N
• Thus,

ȳst = 0.5 · ȳ1 + 0.5 · ȳ2


N1 N2
= ȳ1 + ȳ2 = 150
N N
• Algebraic form is similar!

331/468
Eurostat
Post-stratification estimator variance

• But the post-stratification estimator ȳst will not have the


same variance as the stratified sample mean since the
sample sizes nh are random.

332/468
Eurostat
Post-stratification estimator variance

• But the post-stratification estimator ȳst will not have the


same variance as the stratified sample mean since the
sample sizes nh are random.
• Thus, the variance of the post-stratification ȳst is the sum
of the variance of the stratum under the proportional
Nh
allocation, n , and a term that shows the amount of
N
increase one expects from the post-rather than the
pre-stratification.

332/468
Eurostat
More specifically,
L    L
X
N −nX Nh 1 N −n N − Nh 2
≈ σh2 + 2 σh .
nN h=1 N n N −1 h=1
N

333/468
Eurostat
Example

• A firm knows that 40% of its accounts receivable are


wholesale and 60% are retail.

334/468
Eurostat
Example

• A firm knows that 40% of its accounts receivable are


wholesale and 60% are retail.
• However, to identify an account without pulling a file and
looking at it is difficult.

334/468
Eurostat
Example

• A firm knows that 40% of its accounts receivable are


wholesale and 60% are retail.
• However, to identify an account without pulling a file and
looking at it is difficult.
• An auditor randomly sampled 100 accounts without
replacement. Here are the results of his sampling:
Whosale Retail

n1 = 70 n2 = 30
ȳ1 = 520 ȳ2 = 280.
s1 = 210 s2 = 90

334/468
Eurostat
• Compute the post-stratified mean.

335/468
Eurostat
• Compute the post-stratified mean.
• ANSWER:
N1 N2
ȳst = ȳ1 + ȳ2
N N
= 0.4 × 520 + 0.6 × 280
= 376

335/468
Eurostat
• Compute the variance of the post-stratified mean.

336/468
Eurostat
• Compute the variance of the post-stratified mean.
• ANSWER:
 
1 N1 2 N 2 2
Var(post-stratified ȳ ) ≈
c s + s
n N 1 N 2
    
1 N1 2 N2 2
+ 2 1− s1 + 1 − s2
n N N
1
= [0.4 × (210)2 + 0.6 × (90)2 ]
100
1
+ [0.6 × (210)2 + 0.4 × (90)2 ]
1002
= 225 + 2.97 = 227.97

336/468
Eurostat
Subsection 4

Further topics on stratification

337/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always


produces an estimator with a smaller variance than that
from simple random sampling. Let’s example!

338/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always


produces an estimator with a smaller variance than that
from simple random sampling. Let’s example!
• The dean of school for boys wants to estimate the
average weight of the 7th grade boys in the school.

338/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always


produces an estimator with a smaller variance than that
from simple random sampling. Let’s example!
• The dean of school for boys wants to estimate the
average weight of the 7th grade boys in the school.
• There are 4 classes, 24 students in class 1, 36 in class 2,
30 students in class 3, and 30 in class 4.

338/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always


produces an estimator with a smaller variance than that
from simple random sampling. Let’s example!
• The dean of school for boys wants to estimate the
average weight of the 7th grade boys in the school.
• There are 4 classes, 24 students in class 1, 36 in class 2,
30 students in class 3, and 30 in class 4.
• For administrative ease, he decides to use stratified
sampling with each class as a stratum.
338/468
Eurostat
Example

• The principal has enough time and money to obtain data


for 20 students, and because the cost of sampling is the
same in each stratum, he decides to use proportional
allocation.

339/468
Eurostat
Example

• The principal has enough time and money to obtain data


for 20 students, and because the cost of sampling is the
same in each stratum, he decides to use proportional
allocation.
• Sample allocation is n1 = 4, n2 = 6, n3 = 5, and n4 = 5.

339/468
Eurostat
• The data (in lbs.) is given in the following table:
Class Weight of student (in lbs.)
Class 1 94,90,102,110
Class 2 91,99,93,105,111,101
Class 3 108,96,100,93,93
Class 4 92,110,94,91,113

340/468
Eurostat
Here is a table that describes the data from each stratum:

Class 1 N1 = 24 n1 = 4 Mean=99.00 sd=8.87


Class 2 N2 = 36 n2 = 6 Mean=100.00 sd=7.46
Class 3 N3 = 30 n3 = 5 Mean=98.00 sd=6.28
Class 4 N4 = 30 n4 = 5 Mean=100.00 sd=10.61
All N = 120 n = 20 Mean=99.30 sd=7.73

341/468
Eurostat
• Calculate the stratified estimator ȳst .

342/468
Eurostat
• Calculate the stratified estimator ȳst .
• ANSWER:
To estimate the average weight of the 7th grade boys:
L
X Nh
ȳst = ȳh = 99.3.
h=1
N

342/468
Eurostat
• Calculate the variance of ȳst .

343/468
Eurostat
• Calculate the variance of ȳst .
• ANSWER:

4
1 X 2 Nh − nh sh2
 
\
Var(ȳ st ) = N
N 2 h=1 h Nh nh
2
  
1 2 5 (8.87) 2 5 (7.46)
= (24) · · + (36) · ·
1202 6 4 6 6
2 2
   
2 5 (6.28) 2 5 (10.61)
+ (30) · · + (30) · ·
6 5 6 5
= 2.93

343/468
Eurostat
For a 95% CI, we need to compute the Satterwaithe’s formula
to get the degrees of freedom:

 L
2
ah sh2
P
h=1 Nh (Nh − nh )
d= L 2 2
, ah = ,
P (a h sh ) nh
h=1 nh − 1

24(24 − 4) 36(36 − 6)
a1 = = 120, a2 = = 180,
4 6
30(30 − 5) 30(30 − 5)
a3 = = 150, a4 = = 150.
5 5
344/468
Eurostat
• Plug in the formula and we get that d = 13.7576.

345/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
• Round it down to 13, to be more conservative, and use
df = 13.

345/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
• Round it down to 13, to be more conservative, and use
df = 13.
• Then, an approximate 95% CI is:

99.3 ± 2.160 2.93
= 99.3 ± 3.697

345/468
Eurostat
• Looking back at the data, if we had used simple random
sampling, would our CI have been tighter or looser?

346/468
Eurostat
• Looking back at the data, if we had used simple random
sampling, would our CI have been tighter or looser?
• ANSWER:
  2
\) = N − n s
Var(ȳ
N n
(7.73)2
  
120 − 20
=
120 20
= 2.49

346/468
Eurostat
• Then an approximate 95% CI is: df = 19

99.3 ± 2.093 2.49
= 99.3 ± 3.30

Thus the margin of error is smaller and the confidence


interval narrower.

347/468
Eurostat
• Usually the stratified random sampling will overall
perform better because we usually use stratified random
sampling when the stratum are more homogeneous.

348/468
Eurostat
• Usually the stratified random sampling will overall
perform better because we usually use stratified random
sampling when the stratum are more homogeneous.
• There is no reason that the classes are more
homogeneous in weight, and therefore there is no reason
why this stratified random sampling is any better than a
simple random sampling.

348/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.

349/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.
• How the variance is computed depends on the method by
which the sample was taken.

349/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.
• How the variance is computed depends on the method by
which the sample was taken.
• We did the computation just to show that if
hypothetically, the data was collected by s.r.s. with the
data turn out to be as shown (for illustration’s sake),
then the margin of error will be smaller.

349/468
Eurostat
Moral of this example

• Stratifying on class, which is not related to weight, does


not result in smaller variances within the strata.

350/468
Eurostat
Moral of this example

• Stratifying on class, which is not related to weight, does


not result in smaller variances within the strata.
• On the other hand, if stratification had other purposes
such as to estimate the parameters of each subgroup, it
still makes sense to stratify, though the purpose is not to
get estimates with smaller variance.

350/468
Eurostat
Moral of this example

• Stratifying on class, which is not related to weight, does


not result in smaller variances within the strata.
• On the other hand, if stratification had other purposes
such as to estimate the parameters of each subgroup, it
still makes sense to stratify, though the purpose is not to
get estimates with smaller variance.
• For this particular example, the stratification to estimate
the average weight for each class may be relevant.

350/468
Eurostat
Stratified sampling and proportions

L
1 X
p̂st = Nh p̂h .
N h=1

L
\ 1 X 2 \
Var(p̂st ) = N Var(p̂h )
N 2 h=1 h
L  
1 X 2 Nh − nh p̂h (1 − p̂h )
= N h ·
N 2 h=1 Nh nh − 1

351/468
Eurostat
Example

• The advertising firm wants to estimate the proportion of


households in the county that view the television show
"American Idol".

352/468
Eurostat
Example

• The advertising firm wants to estimate the proportion of


households in the county that view the television show
"American Idol".
• N1 = 155, N2 = 62, N3 = 93.

352/468
Eurostat
Example

• The advertising firm wants to estimate the proportion of


households in the county that view the television show
"American Idol".
• N1 = 155, N2 = 62, N3 = 93.
• As before, we stratify by town and the sample results is:
Stratum Sample size p̂h

Town A n1 = 20 16/20=0.80
Town B n2 = 8 2/8=0.25
Rural area C n3 = 12 6/12=0.50

352/468
Eurostat
• We plug in the values and we can get the following:
L
1 X
p̂st = Nh p̂h
N h=1
155 62 93
= · 0.8 + · 0.25 + · 0.5 = 0.6
310 310 310

353/468
Eurostat
The following display the estimated variance for each stratum:
 
\ N1 − n 1 p̂1 (1 − p̂1 )
Var(p̂1 ) = ·
N1 n1 − 1
 
155 − 20 0.8(0.2)
= · = 0.007
155 19

 
\ N 2 − n 2 p̂2 (1 − p̂2 )
Var( p̂2 ) = ·
N2 n2 − 1
 
62 − 8 0.25(0.75)
= · = 0.024
62 7

354/468
Eurostat
 
\ N 3 − n3 p̂3 (1 − p̂3 )
Var(p̂2 ) = ·
N3 n3 − 1
 
93 − 12 0.5(0.5)
= · = 0.02
93 11

355/468
Eurostat
• Compute the estimated variance of the stratified
proportion.

356/468
Eurostat
• Compute the estimated variance of the stratified
proportion.
• ANSWER:
1
\
Var(p̂st ) = 2
[(155)2 (0.007) + (62)2 (0.024)
(310)
+(93)2 (0.02)]
= 0.0045

356/468
Eurostat
Cluster sampling and systema-
tic sampling
Unit learning outcomes

• Upon success completion of this lesson, you will be able


to:
• know why and when to use cluster sampling
• know the notation for cluster and systematic sampling
• know what are primary units and what are secondary
units

358/468
Eurostat
Unit learning outcomes

• Upon success completion of this lesson, you will be able


to:
• compute the unbiased estimator for cluster samples when
primary units are selected by srs
• compute the ratio estimator for cluster samples when
primary units are selected by srs
• compute the Hansen-Hurwitz estimator for cluster
samples when primary units are selected by pps

359/468
Eurostat
Subsection 1

Introduction

360/468
Eurostat
Cluster versus systematic sampling

• On the surface, systematic and cluster sampling are very


different.

361/468
Eurostat
Cluster versus systematic sampling

• On the surface, systematic and cluster sampling are very


different.
• In fact, the two designs share the same structure: the
population is partitioned into primary units, each primary
unit being composed of secondary units.

361/468
Eurostat
Cluster versus systematic sampling

• On the surface, systematic and cluster sampling are very


different.
• In fact, the two designs share the same structure: the
population is partitioned into primary units, each primary
unit being composed of secondary units.
• Whenever a primary unit is included in the sample, the
y -values of every secondary unit within it are observed.

361/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

• Randomly pick a value from {1, 2, 3}.

362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

• Randomly pick a value from {1, 2, 3}.


• For example, if 2 is chosen, then we will pick
{2, 5, 8, 11, 14}, the x’s.

362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

• Randomly pick a value from {1, 2, 3}.


• For example, if 2 is chosen, then we will pick
{2, 5, 8, 11, 14}, the x’s.
• The set {2, 5, 8, 11, 14} is an example of a primary unit.

362/468
Eurostat
• It is not uncommon to have a systematic sample of size 1,
such as the above 1 in 3 systematic sample. We just
sample 1 primary unit.

363/468
Eurostat
• It is not uncommon to have a systematic sample of size 1,
such as the above 1 in 3 systematic sample. We just
sample 1 primary unit.
• In the following two graphs, we provide examples for two
configurations of primary units:

363/468
Eurostat
The above figure has 50 primary units (PSU) (the colored
rectangle is an example of a primary unit)

364/468
Eurostat
• Primary units (PSU) may be different from observation
units.

365/468
Eurostat
• Primary units (PSU) may be different from observation
units.
• One can view the systematic sampling as a sampling of
primary units.

365/468
Eurostat
• Primary units (PSU) may be different from observation
units.
• One can view the systematic sampling as a sampling of
primary units.
• Once the primary units are selected, a cluster of
secondary units are also selected.

365/468
Eurostat
Advantages of systematic sampling

• Easier to perform in the field, especially if a good frame is


not available.

366/468
Eurostat
Advantages of systematic sampling

• Easier to perform in the field, especially if a good frame is


not available.
• Frequently provides more information per unit cost than
simple random sampling, in the sense of smaller variances.

366/468
Eurostat
Advantages of systematic sampling

• For example, a systematic sample was drawn from a


batch of produced computer chips.

367/468
Eurostat
Advantages of systematic sampling

• For example, a systematic sample was drawn from a


batch of produced computer chips.
• The first 400 chips are fine but due to a fault of the
machine, the last 300 chips are defective.

367/468
Eurostat
Advantages of systematic sampling

• For example, a systematic sample was drawn from a


batch of produced computer chips.
• The first 400 chips are fine but due to a fault of the
machine, the last 300 chips are defective.
• Systematic sampling will select uniformly over the
defective and non-defective items and would give a very
accurate estimate of the fraction of defective items.

367/468
Eurostat
Cluster sampling

• A cluster sample is a probability sample in which each


sampling unit is a collection, or cluster, of elements.

368/468
Eurostat
Cluster sampling

• A cluster sample is a probability sample in which each


sampling unit is a collection, or cluster, of elements.
• Notations for cluster and systematic sampling:

368/468
Eurostat
Cluster sampling

• A cluster sample is a probability sample in which each


sampling unit is a collection, or cluster, of elements.
• Notations for cluster and systematic sampling:
• N: the number of primary units in the population

368/468
Eurostat
Cluster sampling

• A cluster sample is a probability sample in which each


sampling unit is a collection, or cluster, of elements.
• Notations for cluster and systematic sampling:
• N: the number of primary units in the population
• n : the number of primary units in the sample

368/468
Eurostat
Cluster sampling

• A cluster sample is a probability sample in which each


sampling unit is a collection, or cluster, of elements.
• Notations for cluster and systematic sampling:
• N: the number of primary units in the population
• n : the number of primary units in the sample
• Mi : the number of secondary units in the i-th primary
unit

368/468
Eurostat
Cluster sampling

• A cluster sample is a probability sample in which each


sampling unit is a collection, or cluster, of elements.
• Notations for cluster and systematic sampling:
N
P
• M= Mi : the total number of secondary units in the
i=1
population
• yij : the value of the variable of interest of j-th secondary
unit in the i-th primary unit
Mi
P
• τi = yij : the total of y-values in the i-th primary unit
j=1

369/468
Eurostat
For figure below , N = 50, n = 10, Mi = 8

370/468
Eurostat
• Thus, the population total is:

X Mi
N X N
X
τ= yij = τi .
i=1 j=1 i=1

371/468
Eurostat
• Thus, the population total is:

X Mi
N X N
X
τ= yij = τi .
i=1 j=1 i=1

• The population mean per primary unit is:

τ
µτ = .
N

371/468
Eurostat
• Thus, the population total is:

X Mi
N X N
X
τ= yij = τi .
i=1 j=1 i=1

• The population mean per primary unit is:

τ
µτ = .
N
• The population mean per secondary unit is

τ
µ= .
M

371/468
Eurostat
Coffee break!

372/468
Eurostat
Subsection 2

Estimators for cluster sampling when


primary units are selected by simple
random sampling

373/468
Eurostat
• When the primary units are selected by simple random
sampling, frequently used estimators among many
possible estimators are:

374/468
Eurostat
• When the primary units are selected by simple random
sampling, frequently used estimators among many
possible estimators are:
• Unbiased estimator

374/468
Eurostat
• When the primary units are selected by simple random
sampling, frequently used estimators among many
possible estimators are:
• Unbiased estimator
• Ratio estimator

374/468
Eurostat
Unbiased estimator

n
P
τi
i=1
τ̂ = N · µ̂τ = N · ,
n
recall that yi is the total of y -values in the i-th primary unit.

2
\) = N · (N − n) su .
Var(τ̂
n
1 n
where su2 = (τi − µ̂τ )2
P
n − 1 i=1
375/468
Eurostat
τ
• To estimate the mean per primary unit, , one will use:
N
τ̂ 1
µ̂τ = , Var(µ̂τ ) = 2 Var(τ̂ ).
N N

376/468
Eurostat
τ
• To estimate the mean per primary unit, , one will use:
N
τ̂ 1
µ̂τ = , Var(µ̂τ ) = 2 Var(τ̂ ).
N N
• To estimate the mean per secondary unit,

τ̂ 1
µ̂ = , Var(µ̂) = 2 Var(τ̂ ).
M M

376/468
Eurostat
Ratio estimator

If the primary unit total is highly correlated with the primary


unit size Mi , a ratio estimator based on size may be efficient.

N
X
τ̂r = r · M, M= Mi ,
i=1

n
P
τi
\r ) = N(N − n) P (τi − rMi )2 .
n
i=1
where r = n , Var(τ̂
P n(n − 1) i=1
Mi
i=1
377/468
Eurostat
The basic principle

• Since every secondary unit is observed within a selected


primary unit, the within primary unit variance does not
enter into the variances of the estimators.

378/468
Eurostat
The basic principle

• Since every secondary unit is observed within a selected


primary unit, the within primary unit variance does not
enter into the variances of the estimators.
• For example,

\ su2
Var(τ̂ ) = N(N − n) · ,
n
1 n
where su2 = (τi − µ̂τ )2 .
P
n − 1 i=1

378/468
Eurostat
• Thus, to obtain estimators of low variances,

379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)

379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
2. Each cluster should contain the full diversity of the
population and thus, is ’representative’.

379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
2. Each cluster should contain the full diversity of the
population and thus, is ’representative’.
• With natural populations of spatially distributed plants,
animals, or minerals, and human populations, the above
condition is typically satisfied by systematic sampling
where each cluster contains units that are far apart.

379/468
Eurostat
• Thus, to obtain estimators of low variances,
1. Clusters should be formed so that one cluster is similar to
another cluster. (Note: this is ’very different’ from saying
that units in the cluster are similar)
2. Each cluster should contain the full diversity of the
population and thus, is ’representative’.
• With natural populations of spatially distributed plants,
animals, or minerals, and human populations, the above
condition is typically satisfied by systematic sampling
where each cluster contains units that are far apart.
• Cluster sampling is more often than not carried out for
reasons of convenience or practicality rather than to
obtain the lowest variances. 379/468
Eurostat
• Why or when do we use cluster sampling?

380/468
Eurostat
• Why or when do we use cluster sampling?
• Will it give us a more precise estimator?

380/468
Eurostat
• Why or when do we use cluster sampling?
• Will it give us a more precise estimator?
• The answer is no for most cases.

380/468
Eurostat
• Why or when do we use cluster sampling?
• Will it give us a more precise estimator?
• The answer is no for most cases.
• We do use cluster sampling out of necessity even though
it will give us a larger variance.

380/468
Eurostat
If the objective of sampling is to obtain a specified amount of
information about a population parameter at minimum cost,
cluster sampling sometimes gives more information per unit
cost than simple random sampling, stratified sampling and
systematic sampling due to the cost of sampling units within a
cluster may be much lower.

381/468
Eurostat
Cluster sampling is an effective design in two different
scenarios:

1. A good frame listing the population elements either is not


available or is very costly to obtain, whereas a frame
listing clusters is easily obtained.
2. The cost of obtaining observations increases as the
distance separating the elements increases.

382/468
Eurostat
Example using a ratio estimator

• A sociologist wants to estimate the average yearly


vacation budget for each household in a certain city.

383/468
Eurostat
Example using a ratio estimator

• A sociologist wants to estimate the average yearly


vacation budget for each household in a certain city.
• It is given that there are 3,100 households in the city.

383/468
Eurostat
Example using a ratio estimator

• A sociologist wants to estimate the average yearly


vacation budget for each household in a certain city.
• It is given that there are 3,100 households in the city.
• The sociologist marked off the city into 400 blocks and
treated them as 400 clusters.

383/468
Eurostat
Example using a ratio estimator

• A sociologist wants to estimate the average yearly


vacation budget for each household in a certain city.
• It is given that there are 3,100 households in the city.
• The sociologist marked off the city into 400 blocks and
treated them as 400 clusters.
• He then randomly sampled 24 clusters interviewing every
household living in that cluster.

383/468
Eurostat
Example using a ratio estimator

• A sociologist wants to estimate the average yearly


vacation budget for each household in a certain city.
• It is given that there are 3,100 households in the city.
• The sociologist marked off the city into 400 blocks and
treated them as 400 clusters.
• He then randomly sampled 24 clusters interviewing every
household living in that cluster.
• The data are given in the table.

383/468
Eurostat
Cluster Number of Total budget Cluster Number of Total budget
households (Mi ) per cluster (yi ) households (Mi ) per cluster (yi )

1 7 12,000 13 8 12,340
2 9 15,000 14 4 5,000
3 5 8,000 15 6 8,900
4 8 13,000 16 9 14,000
5 12 18,000 17 3 4,000
6 5 7,000 18 10 11,400
7 4 6,000 19 4 5,000
8 8 13,000 20 7 13,000
9 14 22,000 21 6 8,900
10 6 9,800 22 5 8,700
11 3 7,000 23 7 10,000
12 13 18,000 24 6 9,200

Total 169 259,240

384/468
Eurostat
Here is a plot of this data so that we can see if the cluster size
is proportional to the total for the cluster.


20000


15000


Total of cluster

● ●



10000






● ●


5000

4 6 8 10 12 14

Cluster size

385/468
Eurostat
• The ratio estimator for cluster sample (ratio-to-size):

386/468
Eurostat
• The ratio estimator for cluster sample (ratio-to-size):
• If primary unit total τi is highly correlated with cluster
size Mi , a ratio estimator based on size may be efficient.

386/468
Eurostat
• The ratio estimator for cluster sample (ratio-to-size):
• If primary unit total τi is highly correlated with cluster
size Mi , a ratio estimator based on size may be efficient.
• The ratio estimator of the population total is:
n
P
τi
i=1
τ̂r = r · M where r = Pn .
Mi
i=1

386/468
Eurostat
• The ratio estimator is biased but the bias is small when
the sample size is large.

387/468
Eurostat
• The ratio estimator is biased but the bias is small when
the sample size is large.
• Here is the variance:
n
\r ) = N(N − n)
X
Var(τ̂ (τi − rMi )2 .
n(n − 1) i=1

387/468
Eurostat
• The ratio estimator is biased but the bias is small when
the sample size is large.
• Here is the variance:
n
\r ) = N(N − n)
X
Var(τ̂ (τi − rMi )2 .
n(n − 1) i=1

• To estimate the population mean per secondary unit we


have:
τ
µ= .
M

387/468
Eurostat
• The ratio estimator for the mean is:
τ̂r
µ̂r = = r.
M

n
\r ) = N(N − n) · 1
X
Var(µ̂ (τi − rMi )2 .
n(n − 1) M 2 i=1

388/468
Eurostat
• The ratio estimator for the mean is:
τ̂r
µ̂r = = r.
M

n
\r ) = N(N − n) · 1
X
Var(µ̂ (τi − rMi )2 .
n(n − 1) M 2 i=1
• Back to the example.

388/468
Eurostat
• To estimate the average yearly vacation budget for each
household we will use:
P n
τi
i=1
µ̂r = r = P n .
Mi
i=1

389/468
Eurostat
• To estimate the average yearly vacation budget for each
household we will use:
P n
τi
i=1
µ̂r = r = P n .
Mi
i=1
• In this example we see that N = 400, the total number of
blocks, and n = 24.

389/468
Eurostat
• To estimate the average yearly vacation budget for each
household we will use:
P n
τi
i=1
µ̂r = r = P n .
Mi
i=1
• In this example we see that N = 400, the total number of
blocks, and n = 24.
• M in this case is as follows:
XN
M= Mi = 3, 100.
i=1

389/468
Eurostat
The ratio estimator for the average yearly vacation budget for
each household in that city is:

n
P
τi
i=1 259, 240
µ̂r = n = = 1, 533.96.
P 169
Mi
i=1

n
N(N − n) 1 X
\
Var(µ̂r ) = · 2 (τi − rMi )2 .
n(n − 1) M i=1

390/468
Eurostat
• For this example, M = 3100, N = 400, n = 24:

n
1 X
(τi − rMi )2 [st.dev. of (τ − rM)]2
n − 1 i=1
= (1, 325)2

391/468
Eurostat
• For this example, M = 3100, N = 400, n = 24:

n
1 X
(τi − rMi )2 [st.dev. of (τ − rM)]2
n − 1 i=1
= (1, 325)2
• The estimated variance for the ratio estimator.

\r ) = 400(400 − 24) · (1325)2


Var(µ̂
24(3100)2
= 1, 144.84

391/468
Eurostat
• If we used the unbiased estimator would our variance be
larger or smaller?

392/468
Eurostat
• If we used the unbiased estimator would our variance be
larger or smaller?
• For this example, we also want to compute the unbiased
estimator for comparison purposes.

392/468
Eurostat
• The unbiased estimator for the average yearly vacation
budget for each household in that city is:
n
P
τi
i=1 1
µ̂ = N ·
n M
259, 240 1
= 400 · ·
24 3, 100
1
= 400 · 10, 802 ·
3, 100
= 1, 393.81

393/468
Eurostat
• The estimated variance for the unbiased estimator.
n
\ = N(N − n) · 1
X
Var(µ̂) (τi − µ̂τ )2
M2 · n n − 1 i=1
400(400 − 24)
= (st.dev. of τ )2
(3, 100)2 · 24
400(400 − 24)
= (4, 495)2
(3, 100)2 · 24
= 13, 175.67

394/468
Eurostat
Remark 1

• This variance is huge and we should be very unhappy


using the unbiased estimate.

395/468
Eurostat
Remark 1

• This variance is huge and we should be very unhappy


using the unbiased estimate.
• We can thus see that when cluster total is proportional to
cluster size, it is better to use the ratio estimate than the
unbiased estimator.

395/468
Eurostat
Remark 2

• Can we use formula to compute variances by the simple


random sampling?

396/468
Eurostat
Remark 2

• Can we use formula to compute variances by the simple


random sampling?
• Unfortunately, No!

396/468
Eurostat
Remark 2

• Can we use formula to compute variances by the simple


random sampling?
• Unfortunately, No!
• We would have to have collected this data via simple
random sampling in order to calculate the variance by the
formula corresponding to simple random sampling.

396/468
Eurostat
Remark 2

• Can we use formula to compute variances by the simple


random sampling?
• Unfortunately, No!
• We would have to have collected this data via simple
random sampling in order to calculate the variance by the
formula corresponding to simple random sampling.
• Note: it is a big mistake if you do not compute the
variance according to its sampling scheme!

396/468
Eurostat
Subsection 3

Estimators for cluster sampling when


primary units are selected by p.p.s

397/468
Eurostat
Estimators

• The primary units selected with probabilities proportional


to size:

Mi
pi = .
M

398/468
Eurostat
Estimators

• The primary units selected with probabilities proportional


to size:

Mi
pi = .
M
• The Hansen-Hurwitz (pps) estimator is:
n  
M X τi
τ̂p = .
n i=1 Mi

398/468
Eurostat
τi
• Denote by µi = :
Mi
n
M2 X
\p ) =
Var(τ̂ (µi − µ̂p )2
n(n − 1) i=1
τ̂p
µ̂p = is unbiased for µ.
M

399/468
Eurostat
τi
• Denote by µi = :
Mi
n
M2 X
\p ) =
Var(τ̂ (µi − µ̂p )2
n(n − 1) i=1
τ̂p
µ̂p = is unbiased for µ.
M
• Thus we also see that:
n
1 X
\p ) =
Var(µ̂ (µi − µ̂p )2
n(n − 1) i=1

399/468
Eurostat
Example

• Recall the “Total number of computer help


requests"example.

400/468
Eurostat
Example

• Recall the “Total number of computer help


requests"example.
• Thee director of computer support department plans to
sample 3 divisions of a large firm that has 10 divisions,
with varying numbers of employees per division.

400/468
Eurostat
Example

• Recall the “Total number of computer help


requests"example.
• Thee director of computer support department plans to
sample 3 divisions of a large firm that has 10 divisions,
with varying numbers of employees per division.
• Since number of computer support requests within each
division should be highly correlated with the number of
employees in that division, the director decides to use
unequal probability sampling with replacement with pi
proportional to number of employees in that division. 400/468
Eurostat
Division # employees

1 1000
2 650
3 2100
4 860
5 2840
6 1910
7 390
8 3200
9 1500
10 1200
Total 15650

401/468
Eurostat
• A sample of 3 clusters out of 10 clusters are sampled
(n = 3) with replacement. Cluster 2, 5 and 8 are selected

402/468
Eurostat
• A sample of 3 clusters out of 10 clusters are sampled
(n = 3) with replacement. Cluster 2, 5 and 8 are selected
• The data are:

y1 = 420, y2 = 1785, y3 = 2198,


and
M1 = 650, M2 = 2840, M3 = 3200

402/468
Eurostat
• Find the Hansen-Hurwitz estimator for the population
mean

403/468
Eurostat
• Find the Hansen-Hurwitz estimator for the population
mean
• ANSWER:
n
1 X τi
µ̂p =
n i=1 Mi
 
1 420 1785 2198
= × + +
3 650 2840 3200
= 0.6538

403/468
Eurostat
• Find its variance.

404/468
Eurostat
• Find its variance.
• ANSWER:
n
1 X
\p ) =
Var(µ̂ (µi − µ̂p )2
n(n − 1) i=1
1
= [(0.6462 − 0.6538)2 + (0.6285 − 0.6538)2
3×2
+(0.6869 − 0.6538)2 ]
= 0.000299

404/468
Eurostat
Subsection 4

Systematic sample

405/468
Eurostat
• In previous section, we introduce systematic sampling and
state why it may be a challenge to estimate the variance
when only one primary unit is taken.

406/468
Eurostat
• In previous section, we introduce systematic sampling and
state why it may be a challenge to estimate the variance
when only one primary unit is taken.
• Then the repeated systematic sampling is introduced so
that the variance can be estimated.

406/468
Eurostat
• In previous section, we introduce systematic sampling and
state why it may be a challenge to estimate the variance
when only one primary unit is taken.
• Then the repeated systematic sampling is introduced so
that the variance can be estimated.
• We then provide an example of repeated systematic
sampling.

406/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.

407/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
• We then provide an estimate for the relative efficiency of
simple random sampling versus simple random cluster
sampling.

407/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
• We then provide an estimate for the relative efficiency of
simple random sampling versus simple random cluster
sampling.
• An example is provided to compare the variances for these
two sampling methods.

407/468
Eurostat
• In this section, variance for cluster and systematic
sampling is decomposed in terms of between cluster and
within cluster variances.
• We then provide an estimate for the relative efficiency of
simple random sampling versus simple random cluster
sampling.
• An example is provided to compare the variances for these
two sampling methods.
• One should note that it is not uncommon to see examples
that cluster sampling is much less efficient than the
simple random sampling, as illustrated in this example.
407/468
Eurostat
Systematic sample

• Suppose you have a number of students lined up in a row:


1 2 3 4 5 6 7 8 9 10 11 12

408/468
Eurostat
Systematic sample

• Suppose you have a number of students lined up in a row:


1 2 3 4 5 6 7 8 9 10 11 12
• Here we might take a sample every 4 elements, or 1 in 4
elements from the population: (1, 5, 9) or (2, 6, 10), etc.

408/468
Eurostat
Systematic sample

• Suppose you have a number of students lined up in a row:


1 2 3 4 5 6 7 8 9 10 11 12
• Here we might take a sample every 4 elements, or 1 in 4
elements from the population: (1, 5, 9) or (2, 6, 10), etc.
• There are four primary units: (1, 5, 9), (2, 6, 10), (3, 7,
11), (4, 8, 12).

408/468
Eurostat
• To sample systematically from a field, the following is one
example:

409/468
Eurostat
• To sample systematically from a field, the following is one
example:

• There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12),
(5, 7, 13, 15), (6, 8, 14, 16).

409/468
Eurostat
• To sample systematically from a field, the following is one
example:

• There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12),
(5, 7, 13, 15), (6, 8, 14, 16).
• How do we draw a 1 in k systematic sample?
409/468
Eurostat
Example

• Suppose our population is 9,000 students and we want to


sample 1,200 students.

410/468
Eurostat
Example

• Suppose our population is 9,000 students and we want to


sample 1,200 students.
• How do we sample these students systematically?

410/468
Eurostat
Example

• Suppose our population is 9,000 students and we want to


sample 1,200 students.
• How do we sample these students systematically?
• Since, 9000/1200 = 7.5, we can perform a 1-in-7
systematic sample.

410/468
Eurostat
Example

• Suppose our population is 9,000 students and we want to


sample 1,200 students.
• How do we sample these students systematically?
• Since, 9000/1200 = 7.5, we can perform a 1-in-7
systematic sample.
• Or, we should sample every 7-th student.

410/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.

411/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
• How do we estimate the variance of this single systematic
sample?

411/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
• How do we estimate the variance of this single systematic
sample?
• We can not use the formula:
n
1 X
su2 = (τi − µ̂t au)2
n − 1 i=1
since n = 1.

411/468
Eurostat
• We can pick a starting point randomly from 1 to 600 and
sample every 7-th student from that on until we have
reached 1200 samples.
• How do we estimate the variance of this single systematic
sample?
• We can not use the formula:
n
1 X
su2 = (τi − µ̂t au)2
n − 1 i=1
since n = 1.
• Only one primary unit is selected.
411/468
Eurostat
• If the population is randomly ordered, then there is no
problem.

412/468
Eurostat
• If the population is randomly ordered, then there is no
problem.
• We can estimate the variance σ 2 by:

M1
(y1j − ȳ1 )2
P
j=1
s2 =
M1 − 1

412/468
Eurostat
• However, when the population is ordered, the systematic
sampling is usually better than simple random sampling
and the above formula will overestimate the variance.

413/468
Eurostat
• However, when the population is ordered, the systematic
sampling is usually better than simple random sampling
and the above formula will overestimate the variance.
• When the population is periodic, the systematic sampling
may be worse than the simple random sampling and the
above formula will underestimate the variance since if the
period k is chosen poorly, then the elements sampled may
be too similar to each other.

413/468
Eurostat
Questions?

414/468
Eurostat
Lunch break!

415/468
Eurostat
Subsection 5

Variance and cost in cluster and


systematic sampling versus srs

416/468
Eurostat
• For simplicity, suppose that each of N primary units has
an equal number M of secondary units.

417/468
Eurostat
• For simplicity, suppose that each of N primary units has
an equal number M of secondary units.
• To simplify the variance computations and to explore the
relationship between cluster and simple random sampling,
we note the identity:

X M
N X N X
X M N
X
2 2
(yij − µ) = (yij − µi ) + M (µi − µ)2
i=1 j=1 i=1 j=1 i=1
| {z } | {z } | {z }
SST SSW SSB

M y
P ij
where where µi = .
j=1 M
417/468
Eurostat
SST = SSW + SSB

• SST: the total sum of squares


• SSW: within-cluster sum of squares (within-primary units)
• SSB: between-cluster sum of squares (between-primary
units)

418/468
Eurostat
• The within-primary-unit variance is:
 
X N XM 
2 2
σw = (yij − µi ) /[N(M − 1)]
 
i=1 j=1

419/468
Eurostat
• The within-primary-unit variance is:
 
X N XM 
2 2
σw = (yij − µi ) /[N(M − 1)]
 
i=1 j=1

• The between-primary-unit variance is:


( N )
X
σb2 = (µi − µ)2 /(N − 1)
i=1

419/468
Eurostat
• The identity can be rewritten as:

(NM − 1)σ 2 = N(M − 1)σw2 + (N − 1)Mσb2 .

420/468
Eurostat
• The identity can be rewritten as:

(NM − 1)σ 2 = N(M − 1)σw2 + (N − 1)Mσb2 .


• Thus, an unbiased estimator of σ 2 from a simple random
cluster sample is:

2 N(M − 1)sw2 + (N − 1)Msb2


σ̂ = .
NM − 1

420/468
Eurostat
• Since the data was obtained by cluster sampling, we
cannot use s 2 to estimate σ 2 but we can use σ̂ 2 to
estimate σ 2 .

421/468
Eurostat
• Since the data was obtained by cluster sampling, we
cannot use s 2 to estimate σ 2 but we can use σ̂ 2 to
estimate σ 2 .
• The relative efficiency of simple random sampling versus
simple random cluster sampling is:

Var(ȳsrs ) Mσ 2
= 2 .
Var(µ̂) σu

421/468
Eurostat
• Since the data was obtained by cluster sampling, we
cannot use s 2 to estimate σ 2 but we can use σ̂ 2 to
estimate σ 2 .
• The relative efficiency of simple random sampling versus
simple random cluster sampling is:

Var(ȳsrs ) Mσ 2
= 2 .
Var(µ̂) σu
2
N −n σ
• Recall: Var(ȳsrs ) = · and
N nM
N − n σu2 2
Var(µ̂) = · 2 where σu is the finite population
N nM
variance of τi .
421/468
Eurostat
• It can be estimated by:

\
Var(ȳ srs ) M σ̂ 2
= 2 .
\
Var(µ̂) su

422/468
Eurostat
• It can be estimated by:

\
Var(ȳ srs ) M σ̂ 2
= 2 .
\
Var(µ̂) su
• Note:
n n  2
1 X 1 X τ
su2 = 2
(τi − µ̂τ ) = τi −
n − 1 i=1 n − 1 i=1 M
n
1 X
= (Mτi − M µ̂)2
n − 1 i=1
n
(τi − µ̂)2
P
2 i=1 2
= M = M sb2 .
n−1 422/468
Eurostat
Example

• The marketing research department of a communication


company wishes to estimate the average number of cell
phones purchased per household in a given community.

423/468
Eurostat
Example

• The marketing research department of a communication


company wishes to estimate the average number of cell
phones purchased per household in a given community.
• Therefore, the 4,000 households in the community are
listed in 400 geographical clusters of 10 households each,
and a simple random sample of 4 clusters is selected to
reduce the traveling cost for interviewing each household.

423/468
Eurostat
Example

• The marketing research department of a communication


company wishes to estimate the average number of cell
phones purchased per household in a given community.
• Therefore, the 4,000 households in the community are
listed in 400 geographical clusters of 10 households each,
and a simple random sample of 4 clusters is selected to
reduce the traveling cost for interviewing each household.
• The data are given in the following table.

423/468
Eurostat
Cluster Number of cell phones Total
1 3 5 6 4 5 6 3 2 4 5 43
2 2 0 2 1 1 0 1 1 0 1 9
3 3 2 3 2 4 2 2 1 2 2 23
4 5 2 3 2 1 1 2 2 4 1 23

424/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.

425/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
• In this example, N = 400, n = 4, and M = 10.

425/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
• In this example, N = 400, n = 4, and M = 10.
• We need to find sb2 , sw2 .

425/468
Eurostat
• Let’s find the relative efficiency of simple random
sampling versus cluster sampling for the data in this
example.
• In this example, N = 400, n = 4, and M = 10.
• We need to find sb2 , sw2 .
• Note the identity for the population:

(NM − 1)σ 2 = N(M − 1)σw2 + (N − 1)Mσb2 .

425/468
Eurostat
• The identity for the sample is:
(nM − 1)s 2 = n(M − 1)sw2 + (n − 1)Msb2
SS total = SS within + SS between
SS between 58.7
SS within 43.2

426/468
Eurostat
• The identity for the sample is:
(nM − 1)s 2 = n(M − 1)sw2 + (n − 1)Msb2
SS total = SS within + SS between
SS between 58.7
SS within 43.2

• From the table above, we can find sb2 by:

SS between = (4 − 1)10sb2 = 58.70,

58.70
sb2 = = 1.957
30

426/468
Eurostat
• We can find sw2 by:

SS within = 4(10 − 1)sw2 = 43.2,

sw2 = 1.20.

427/468
Eurostat
• Compute σ 2

428/468
Eurostat
• Compute σ 2
• ANSWER:

N(M − 1)sw2 + (N − 1)Msb2


σ̂ 2 =
NM − 1
(400 × 9 × 1.2) + [(400 − 1) × 10 × 1.957]
= = 3.03
400 × 10 − 1

428/468
Eurostat
• And now we can determine the relative efficiency of
simple random sampling versus cluster sampling by
plugging the values into the formula:

2
su2 = M sb2 = 100 × 1.957 = 195.7.

429/468
Eurostat
• And now we can determine the relative efficiency of
simple random sampling versus cluster sampling by
plugging the values into the formula:

2
su2 = M sb2 = 100 × 1.957 = 195.7.
• Compute the relative efficiency of simple random
sampling versus cluster sampling.

429/468
Eurostat
• And now we can determine the relative efficiency of
simple random sampling versus cluster sampling by
plugging the values into the formula:

2
su2 = M sb2 = 100 × 1.957 = 195.7.
• Compute the relative efficiency of simple random
sampling versus cluster sampling.
• ANSWER:

\
Var(ȳsrs ) M σ̂ 2 10 × 3.03
= 2 = = 0.155.
\
Var(µ̂) su 195.7

429/468
Eurostat
• What is this telling us?

430/468
Eurostat
• What is this telling us?
• ANSWER:

430/468
Eurostat
• What is this telling us?
• ANSWER:
• The variance of simple random sampling is just 15.5% of
that of the cluster sampling if the same sample size is
used. We can see that in this example simple random
sampling is more efficient if only variance is considered.

430/468
Eurostat
• What is this telling us?
• ANSWER:
• The variance of simple random sampling is just 15.5% of
that of the cluster sampling if the same sample size is
used. We can see that in this example simple random
sampling is more efficient if only variance is considered.
• Remark: It is a BIG mistake to analyze a cluster sample
as if it were a simple random sample, (often with the
reported standard error much less than it should be). You
will end up being much too optimistic and not
conservative regarding your results as you should be.
430/468
Eurostat
Muti-stage designs
Unit learning outcomes

• Upon success completion of this unit, you will be able to:


• know why and when to use multi-stage sampling
• compute unbiased estimator and its estimated variance
for the two stage design when srs is used at each stage
• compute ratio estimator and its estimated variance for
the two stage design when srs is used at each stage
• compute the Hansen-Hurwitz estimator and its estimated
variance when primary units are selected with probability
proportional to size and secondary units selected with srs

432/468
Eurostat
Subsection 1

Multi-stage sampling: two stages with srs


at each stage

433/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.

434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.

434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
• For example, in two-stage sampling:

434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
• For example, in two-stage sampling:
• 1st stage samples n primary units

434/468
Eurostat
• We have learned about cluster sampling where one selects
the primary units and then all of the cases from the
secondary units.
• With multi-stage sampling we will only select some of the
units from the secondary stages.
• For example, in two-stage sampling:
• 1st stage samples n primary units
• 2nd stage, for the i-th primary unit, selects mi (not all)
secondary units

434/468
Eurostat
• Multistage designs are used in many practical cases.

435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:

435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
• Large surveys involving the sampling of housing units -
Statistics Portugal selects geographical areas within each
NUTS II region and then select housing units (dwellings)
within each selected geographical area.

435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
• Large surveys involving the sampling of housing units -
Statistics Portugal selects geographical areas within each
NUTS II region and then select housing units (dwellings)
within each selected geographical area.
• Practical quality control problems often involve two (or
more) stages of sampling. For example, Volkswagen
wants to inspect the quality of a supplier of air filters.
They first sample some cartons and then inspect some
air filters inside these selected cartons.

435/468
Eurostat
• Multistage designs are used in many practical cases.
• These are just a few:
• Large surveys involving the sampling of housing units -
Statistics Portugal selects geographical areas within each
NUTS II region and then select housing units (dwellings)
within each selected geographical area.
• Practical quality control problems often involve two (or
more) stages of sampling. For example, Volkswagen
wants to inspect the quality of a supplier of air filters.
They first sample some cartons and then inspect some
air filters inside these selected cartons.
• Poll samples election districts. At the second stage, they
select households.
435/468
Eurostat
• Notation:

436/468
Eurostat
• Notation:
• N: number of primary units in the population

436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit

436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1

436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1

436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
τ N
P
• µ= where M = Mi .
M i=1

436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
τ N
P
• µ= where M = Mi .
M i=1
• n: number of primary units selected in the first stage

436/468
Eurostat
• Notation:
• N: number of primary units in the population
• Mi : number of secondary units in the i-th primary unit
Mi
P
• yi = yij
i=1
Mi
N P
P
• Population total: τ = yij
i=1 j=1
τ N
P
• µ= where M = Mi .
M i=1
• n: number of primary units selected in the first stage
• mi : number of secondary units selected in the second
stage

436/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?

437/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
• ANSWER:

437/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
• ANSWER:
1. If mi = Mi (all secondary units are selected), it reduces
to cluster sampling.

437/468
Eurostat
• Two-stage sampling includes both one-stage cluster
sampling and stratified random sampling as special cases.
When does two-stage sampling reduce to cluster
sampling? When does two-stage sampling reduce to
stratified random sampling?
• ANSWER:
1. If mi = Mi (all secondary units are selected), it reduces
to cluster sampling.
2. If n = N (all primary units are selected), it reduces to
stratified random sampling.

437/468
Eurostat
Multistage design

• This is something that arises in practice quite often.

438/468
Eurostat
Multistage design

• This is something that arises in practice quite often.


• As a result, we need to be able to figure out how this
type of sampling design is implemented.

438/468
Eurostat
Multistage design

• This is something that arises in practice quite often.


• As a result, we need to be able to figure out how this
type of sampling design is implemented.
• Most of the time this deals with two stages of sample
with simple random sampling at each stage.

438/468
Eurostat
Multistage design

• This is something that arises in practice quite often.


• As a result, we need to be able to figure out how this
type of sampling design is implemented.
• Most of the time this deals with two stages of sample
with simple random sampling at each stage.
• Let’s take a look at this graph as a means of
understanding how this type of sampling design plays out.
N = 50 for both graphs

438/468
Eurostat
Two-stage sample of 10 primary units and four secondary units
per primary unit.

439/468
Eurostat
Here is another graph for another example of two-stage
sample. Two-stage sample of 20 primary units and two
secondary units per primary unit.

440/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.

441/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.
A. Unbiased Estimator

441/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.
A. Unbiased Estimator
A.1 Since simple random sampling is used in the second
stage, an unbiased estimator of the total y-value for the
i-th primary unit is:
mi
P Pmi
yij yij
j=1 j=1
τ̂i = Mi = Mi ȳi where ȳi =
mi mi

441/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.
A. Unbiased Estimator
A.1 Since simple random sampling is used in the second
stage, an unbiased estimator of the total y-value for the
i-th primary unit is:
mi
P Pmi
yij yij
j=1 j=1
τ̂i = Mi = Mi ȳi where ȳi =
mi mi
A.2 The first part of this formula is also known as the
expansion estimator.
441/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.

442/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.
A. Unbiased Estimator

442/468
Eurostat
Simple random sampling at each stage

• We will discuss two possible estimators for this sampling


design: unbiased estimator and ratio estimator.
A. Unbiased Estimator
A.3 Also, since simple random sampling is used in the first
stage, an unbiased estimator for the population total is:
n
P n
P
τ̂i Mi ȳi
i=1 i=1
τ̂ = N · =N·
n n

442/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.

443/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
• The estimated variance of τ̂ is:
n
su2 N X s2
\
Var(τ̂ ) = N(N − n) + Mi (Mi − mi ) i
n n i=1 mi
where

443/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
• The estimated variance of τ̂ is:
n
su2 N X s2
\
Var(τ̂ ) = N(N − n) + Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance among the primary unit totals
 Pn 2
n τ̂i
1 X τ̂i − i=1  .
su2 =

n−1  n 
i=1

443/468
Eurostat
• Now we have the expansion estimators from each stage.
The next thing we need is the variance.
• The estimated variance of τ̂ is:
n
su2 N X s2
\
Var(τ̂ ) = N(N − n) + Mi (Mi − mi ) i
n n i=1 mi
where
• si2 is the sample variance within the i-th primary unit
i m
1 X
si2 = (yij − ȳi )2 .
mi − 1
j=1

444/468
Eurostat
τ
• To estimate the population mean µ = , the estimators
M
and the estimated variance are:
n
P
τ̂i
µ̂ =
N
· i=1 \ = 1 Var(τ̂
, and Var(µ̂) \).
M n M2

445/468
Eurostat
τ
• To estimate the population mean µ = , the estimators
M
and the estimated variance are:
n
P
τ̂i
µ̂ =
N
· i=1 \ = 1 Var(τ̂
, and Var(µ̂) \).
M n M2
• Let’s take a look at an example where we can compute
both the estimates and their variances.

445/468
Eurostat
Example: employee satisfaction

• A restaurant chain wants to estimate the average


employee satisfaction with their job (the scale is from 1
to 7).

446/468
Eurostat
Example: employee satisfaction

• A restaurant chain wants to estimate the average


employee satisfaction with their job (the scale is from 1
to 7).
• They have 120 restaurants the total number of employees
in the chain is 6,860.

446/468
Eurostat
Example: employee satisfaction

• A restaurant chain wants to estimate the average


employee satisfaction with their job (the scale is from 1
to 7).
• They have 120 restaurants the total number of employees
in the chain is 6,860.
• They use simple random sampling to sample 10
restaurants.

446/468
Eurostat
Example: employee satisfaction

• A restaurant chain wants to estimate the average


employee satisfaction with their job (the scale is from 1
to 7).
• They have 120 restaurants the total number of employees
in the chain is 6,860.
• They use simple random sampling to sample 10
restaurants.
• They then use simple random sampling to sample and
interview about 20% of the employees in those
restaurants. 446/468
Eurostat
The data are given as follows.

Restaurant Mi mi Employee Satisfaction ȳi si τ̂i = Mi · ȳi

1 54 10 5, 7,
6, 5, 4, 7, 6, 6, 4, 5 5.5 1.08 297
2 48 10 7, 7,
7, 6, 5, 4, 7, 7, 6, 6 6.2 1.03 297.6
3 68 14 5, 6, 5, 6, 4, 5,
6, 5, 4, 5, 4, 6, 5, 6 5.14 0.77 349.52
4 70 14 6, 5, 7, 6, 7, 6,
5, 7, 5, 7, 6, 5, 7, 6 6.07 0.83 424.9
5 52 10 4, 5,
4, 5, 5, 6, 5, 4, 4, 4 4.6 0.7 239.2
6 62 12 5, 7, 6, 7,
4, 3, 1, 5, 4, 6, 4, 5 4.75 1.71 294.5
7 41 8 7, 6, 7, 7, 6, 6, 5, 7 6.38 0.74 261.58
8 53 11 6, 6, 5, 4, 6, 7, 5, 5, 7, 6, 5 5.64 0.92 298.92
9 64 12 7, 6, 5, 4, 6, 5, 7, 4, 3, 6, 5, 7 5.42 1.31 346.88
10 43 9 7, 6, 6, 5, 7, 3, 5, 4, 5 5.33 1.32 229.19

St. Dev. 58.1

447/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.

448/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
• ANSWER:

448/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
• ANSWER:
• The unbiased estimator is:

n
P
Mi ȳi
i=1
τ̂ = N ·
n
(54 × 5.50) + (48 × 6.20) + . . . + (43 × 5.33)
= 120 ·
10
= 36, 471.5

448/468
Eurostat
• Find the unbiased estimator for the mean employee
satisfaction score.
• ANSWER:
• The unbiased estimator is:

n
P
Mi ȳi
i=1
τ̂ = N ·
n
(54 × 5.50) + (48 × 6.20) + . . . + (43 × 5.33)
= 120 ·
10
= 36, 471.5
• This might be thought of as the total satisfaction score.
448/468
Eurostat
• If we divided this by the total number of employees we
would get the average score.

449/468
Eurostat
• If we divided this by the total number of employees we
would get the average score.
• If M is given to be 6,860 then
36, 471.5
µ̂ = = 5.32.
6, 860

449/468
Eurostat
• The estimated variance of the unbiased estimator is then:

2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where

450/468
Eurostat
• The estimated variance of the unbiased estimator is then:

2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance of τ̂1 , τ̂2 , ..., τ̂10 . From the
previous table, su2 = (58.1)2 = 3375.61

450/468
Eurostat
• The estimated variance of the unbiased estimator is then:

2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance of τ̂1 , τ̂2 , ..., τ̂10 . From the
previous table, su2 = (58.1)2 = 3375.61
• si2 is the sample variance within the primary unit.
i m
1 X
si2 = (yij − ȳi )2 .
mi − 1
j=1

450/468
Eurostat
• The estimated variance of the unbiased estimator is then:

2 n
\) = N(N − n) su + N
X s2
Var(τ̂ Mi (Mi − mi ) i
n n i=1 mi
where
• su2 is the sample variance of τ̂1 , τ̂2 , ..., τ̂10 . From the
previous table, su2 = (58.1)2 = 3375.61
• si2 is the sample variance within the primary unit.
i m
1 X
si2 = (yij − ȳi )2 .
mi − 1
j=1

• si has been computed and given in the table.


450/468
Eurostat
• Find the estimated variance of the unbiased estimator for
the mean employee satisfaction score.

451/468
Eurostat
• Find the estimated variance of the unbiased estimator for
the mean employee satisfaction score.
• ANSWER:
\) = 120 × (120 − 10) × 3375.61
Var(τ̂
10
1.082

120
+ × 54(54 − 10) + ...
10 10
1.322

+43(43 − 9)
9
= 4, 455, 805.2 + 32, 451.6 = 4, 488, 256.8
\ = 4, 488, 256.8 = 0.095.
Var(µ̂)
68602
451/468
Eurostat
Ratio estimator

• Remark: If M is unknown, we cannot use the unbiased


estimator µ̂.

452/468
Eurostat
Ratio estimator

• Remark: If M is unknown, we cannot use the unbiased


estimator µ̂.
• If the cluster total is proportional to the cluster size, then
the ratio estimate is appropriate.

452/468
Eurostat
• For the population total, the ratio estimator and its
estimated variance are:
n
P
ŷi
i=1
τ̂r = Pn · M = rˆM.
Mi
i=1
n
\r ) = N(N − n) · 1
X
Var(τ̂ (ŷi − Mi rˆ)2
n n − 1 i=1
n
NX s2
+ Mi (Mi − mi ) i .
n i=1 mi

453/468
Eurostat
• A similar question can be asked of the population mean.

454/468
Eurostat
• A similar question can be asked of the population mean.
• Therefore, for the population mean, the ratio estimator
and its estimated variance are:
n
P n
P
ŷi Mi ȳi
i=1 i=1
µ̂r = Pn = Pn = rˆ.
Mi Mi
i=1 i=1

\r ) = 1 Var(τ̂
Var(µ̂ \r ).
M2

454/468
Eurostat
Illustration

• For the example using the restaurant employee’s


satisfaction data above, find the ratio estimator for the
population mean.

455/468
Eurostat
Illustration

• For the example using the restaurant employee’s


satisfaction data above, find the ratio estimator for the
population mean.
• ANSWER:
n
P
Mi ȳi
i=1 54 × 5.50 + . . . + 43 × 5.33 3039.3
µ̂r = Pn = = = 5.48
54 + 48 + . . . + 43 555
Mi
i=1

455/468
Eurostat
• And the ratio estimator estimated variance.

456/468
Eurostat
• And the ratio estimator estimated variance.
• ANSWER:
" n
1 N(N − n) 1 X
\
Var(µ̂r ) = · (ŷi − Mi rˆ)2
M2 n n − 1 i=1
n
#
NX si2
+ Mi (Mi − mi )
n i=1 mi

1 120(120 − 10) 1
= 2
· ((54 × 5.50 − 54 × 5.48
6860 10 9
+ . . . + (43 × 5.33 − 43 × 5.48)2 ) + 32451.6


= 0.029

456/468
Eurostat
• Remark: If M is unknown, one can use µ̂r and estimate
M by:
n
P
Mi
i=1
× N.
n

457/468
Eurostat
• Remark: If M is unknown, one can use µ̂r and estimate
M by:
n
P
Mi
i=1
× N.
n
N
P
• Recall: M = Mi .
i=1

457/468
Eurostat
Coffee break!

458/468
Eurostat
Subsection 2

Primary units selected by pps and


secondary units selected with srs

459/468
Eurostat
Multi-stage design with primary units
selected with p.p.s. and secondary units
selected with srs.

• Using the Hansen-Hurwitz estimator, we get the following:


n P
M X τ̂i ȳi ŷi
τ̂p = =M where ȳi = ,
n i=1 Mi n Mi
M2 X
\p ) =
Var(τ̂ (ȳi − µ̂p )2 .
n(n − 1)
460/468
Eurostat
• To estimate the population mean:
P
ȳi
µ̂p = .
n
  P
τ̂p ȳi
since it is and thus it becomes :
M n
1 X
\p ) =
Var(µ̂ (ȳi − µ̂p )2 .
n(n − 1)

461/468
Eurostat
Example

• There are 36 departments in a small liberal arts college.

462/468
Eurostat
Example

• There are 36 departments in a small liberal arts college.


• One wants to estimate the average amount of money the
students spent on textbooks last semester.

462/468
Eurostat
Example

• There are 36 departments in a small liberal arts college.


• One wants to estimate the average amount of money the
students spent on textbooks last semester.
• Since the size of each department varies very much, a
two-stage cluster sampling using probability proportional
to size for the primary unit is carried out.

462/468
Eurostat
Example

• There are 36 departments in a small liberal arts college.


• One wants to estimate the average amount of money the
students spent on textbooks last semester.
• Since the size of each department varies very much, a
two-stage cluster sampling using probability proportional
to size for the primary unit is carried out.
• The results are listed in the following table.

462/468
Eurostat
Department Mi mi Textbook expenses in $ for last semester ȳi

1 10 4 326, 400, 423, 443 398.0


2 20 8 278, 312, 450, 350, 227, 438, 512, 403 371.3
3 30 12 512, 256, 332, 402, 512, 309, 451.3
411, 610, 422, 630, 550, 470 451.3
4 15 6 426, 312, 512, 440, 342, 533 472.5

463/468
Eurostat
• Estimate the population mean using probability
proportional to size estimator (Hansen-Hurwitz).

464/468
Eurostat
• Estimate the population mean using probability
proportional to size estimator (Hansen-Hurwitz).
• ANSWER:

P
ȳi 398 + 371.3 + 451.3 + 427.5
µ̂p = = = 412.025.
n 4

464/468
Eurostat
• Estimate the variance of that estimator.

465/468
Eurostat
• Estimate the variance of that estimator.
• ANSWER:
1 X
\p ) =
Var(µ̂ (ȳi − µ̂p )2
n(n − 1)
1 
= (398 − 412.025)2 + (371.3 − 412.025)2
4×3
+(451.3 − 412.025)2 + (427.5 − 412.025)2


= 303.12

465/468
Eurostat
Topics covered in other courses

• Using auxiliary information

466/468
Eurostat
Topics covered in other courses

• Using auxiliary information


• Small area estimation

466/468
Eurostat
Topics covered in other courses

• Using auxiliary information


• Small area estimation
• Non-response error

466/468
Eurostat
Topics covered in other courses

• Using auxiliary information


• Small area estimation
• Non-response error
• Accuracy assessment/variance estimation

466/468
Eurostat
467/468
Eurostat
Feedback and evaluation

468/468
Eurostat

You might also like