0% found this document useful (0 votes)

162 views

Introduction To Survey Methodology and Sampling Techniques (PDFDrive)

This document provides an overview of an upcoming training on survey methodology and sampling techniques. The training will be conducted over three days from March 14-16, 2016 and will be led by Jorge M. Mendes. It will cover topics such as simple random sampling, confidence intervals, sample size calculations, stratified sampling, cluster sampling, and multistage designs. The goal is for participants to understand how to perform basic sampling methods and make inferences about populations from sample data. Recommended textbooks are also listed.

Uploaded by

amien_ptk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views

Introduction To Survey Methodology and Sampling Techniques (PDFDrive)

Uploaded by

amien_ptk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1046

Introduction to Survey

Methodology and Sampling

Techniques
14-16 March 2016
Jorge M. Mendes <[email protected]>

CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT

CONCLUDED WITH THE EUROPEAN COMMISSION

Overview
Trainer and schedule

• Trainer: Jorge M. Mendes ([email protected])

• Training schedule:
• Morning: 9:00-12:30 (15 minutes break at 11:00);
• Afternoon: 14:00-17:00 (15 minutes break at 15:30)

3/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,

K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.

4/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,

K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.
• Cochran, W. C. (1977). Sampling Techniques, 3rd ed.,
John Wiley & Sons.

4/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,

4/468
Eurostat
Textbooks

• Scheaffer, R. L., Mendenhall, W. III, Ott, L. and Gerow,

K. G. (2011). Elementary Survey Sampling, 7th ed.,
Brooks/Cole Cengage Learning.
• Cochran, W. C. (1977). Sampling Techniques, 3rd ed.,
John Wiley & Sons.
• Kish, L. (1965). Survey Sampling, New York, Wiley.
• Särndal, C.-E., B. Swensson. J. Wretman (1992). Model
Assisted Survey Sampling. New York, Springer-Verlag.

4/468
Eurostat
Training learning outcomes

• This course covers sampling design and analysis methods

useful for research and management in many fields.
• A well designed sampling procedure ensures that we can
summarize and analyze the data with a minimum of
assumptions or complications.

5/468
Eurostat
• In this course, we’ll cover the basic methods of sampling
and estimation and then explore selected topics and
recent developments including:
• simple random sampling with associated estimation and
confidence interval methods,
• computing sample sizes,
• estimating proportions,
• unequal probability sampling,
• ratio and regression estimation,
• stratified sampling,
• cluster and systematic sampling,
• multistage designs.

6/468
Eurostat
• One important point to consider as we move forward is
that the estimation procedure will depend on the sample
design.
• Being able to identify what to use under different
sampling designs is one of the things that you will learn in
this course.

7/468
Eurostat
Day 1

1. Introduction
1.1 An overview of sampling
1.2 Estimating population mean and total under simple
random sampling
1.3 Confidence intervals and the central limit theorem
1.4 Domain estimation
2. Confidence intervals and sample size
2.1 Selecting sample size for estimating population mean and
total
8/468
Eurostat
Day 1

2. Confidence intervals and sample size

2.1 [...]
2.2 Confidence intervals for population proportion
2.3 Sample size needed for estimating proportions
3. Unequal probability sampling
3.1 Unequal probability sampling

9/468
Eurostat
Day 2

3. Unequal probability sampling

3.1 (...)
3.2 The Hansen-Hurwitz estimator
3.3 The Horvitz-Thompson estimator
3.4 Small population illustration
4. Auxiliary data and ratio estimation
4.1 Auxiliary data, ratio estimator and its computation
4.2 Sample size and small population example for ratio
estimation
10/468
Eurostat
Day 2

5. Auxiliary data and regression estimation

5.1 Linear regression estimator
5.2 Comparison of estimators
6. Stratified sampling
6.1 How to use stratified sampling
6.2 The stratification principle

11/468
Eurostat
Day 3

6. Stratified sampling
6.2 [...]
6.3 Post-stratification
6.4 Further topics on stratification
7. Cluster sampling and systematic sampling
7.1 Introduction
7.2 Estimators for cluster sampling when primary units are
selected by simple random sampling
7.3 Estimators for cluster sampling when primary units are
selected by pps
12/468
Eurostat
Day 3

7. Cluster sampling and systematic sampling

7.3 [...]
7.4 Systematic sampling
7.5 Variance and cost in cluster and systematic sampling
versus srs
8. Multistage designs
8.1 Multi-stage sampling: two stages with srs at each stage
8.2 Primary units selected by pps and secondary units
selected with srs
9. Topics covered in other courses
13/468
Eurostat
Introduction
Unit learning outcomes

• Upon successful completion of this lesson, you will be

able to:
• know that estimation procedures depend on the sample
design
• distinguish between quota sampling and probability
sampling
• know the desirable properties of estimates
• distinguish between sampling error and non-sampling
errors

15/468
Eurostat
Unit learning outcomes

• Upon successful completion of this lesson, you will be

able to:
• know how to perform simple random sampling
• provide point estimate to population mean and be able
to estimate the variance of the estimate
• provide point estimate to population total and be able to
estimate the variance of the estimate

16/468
Eurostat
Subsection 1

An overview of sampling

17/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some

objective in mind.

18/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some

objective in mind.
• In each case there is a target population.

18/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some

18/468
Eurostat
Why do we take samples?

• You want to understand certain things and have some

objective in mind.
• In each case there is a target population.
• The goal for many research projects is to know more
about your objective, i.e., your population. This is what
you are interested in.
• For instance, if you were a conservation officer you might
be interested in the number of polar bears in Artic.

18/468
Eurostat
• In this case, you have a certain goal in mind.

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!

19/468
Eurostat
• In this case, you have a certain goal in mind.
• What steps can we take to understand the population
better?
• What we can do is to take a sample!
• And the major objective in statistics that now arises is
inference.
• One important objective of statistics is to make inferences
about a population from the information contained in a
sample.

19/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.

20/468
Eurostat
• We should always keep in mind that we perform sampling
because we want to make this inference.
• Because of this inference we begin to talk about things
like confidence intervals and hypothesis testing.

20/468
Eurostat
Sampling

21/468
Eurostat
Population and sample

• We can draw a sample from the population.

22/468
Eurostat
Population and sample

• We can draw a sample from the population.

• How do we do this?

22/468
Eurostat
Population and sample

• We can draw a sample from the population.

• How do we do this?
• What type of scheme do we use to draw a sample?

22/468
Eurostat
Examples of sampling

• Sampling is useful in many different fields, however,

different sampling problems can arise in each of the
following areas.

23/468
Eurostat
Examples of sampling

• Sampling is useful in many different fields, however,

different sampling problems can arise in each of the
following areas.
• Economic: we might want to estimate the average
household income in a country.

23/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.

24/468
Eurostat
• Geologic: we might want to estimate the total pyrite
content of the rocks at a specific construction site.
• Marketing research: we might want to estimate the
total market size for electrical cars.
• Engineering: we might want to estimate the failure rate
of a certain electronic component.

24/468
Eurostat
• To deal with all of these problems one thing we have to
decide is:
How are we going to select a sample?

25/468
Eurostat
• To deal with all of these problems one thing we have to
decide is:
How are we going to select a sample?
• There are many ways to take a sample.

25/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample

is selected.

26/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample

is selected.
• There are two very broad categories of sampling designs.

26/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample

is selected.
• There are two very broad categories of sampling designs.
• Probabilistic sampling

26/468
Eurostat
Sampling design

• Sampling design is the procedure by which the sample

is selected.
• There are two very broad categories of sampling designs.
• Probabilistic sampling
• Non probabilistic sampling

26/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we

want to study about certain characteristics.

27/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we

want to study about certain characteristics.
• Sampling frame: Is a list, map or any other registry
where the population units (to be sampled) are registered.

27/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we

want to study about certain characteristics.
• Sampling frame: Is a list, map or any other registry
where the population units (to be sampled) are registered.

• Ideally, the list should be exhaustive and without

duplications.

27/468
Eurostat
Target population and sampling frame

• Target population: is a set of elements of finite size we

want to study about certain characteristics.
• Sampling frame: Is a list, map or any other registry
where the population units (to be sampled) are registered.

• Ideally, the list should be exhaustive and without

duplications.
• It is the list of units in the study population.

27/468
Eurostat
28/468
Eurostat
Probabilistic sampling

• All designs we will discuss in detail fall into this type.

29/468
Eurostat
Probabilistic sampling

• All designs we will discuss in detail fall into this type.

• When we use probability sampling, randomness will be
built into the sampling designs so that properties of the
estimators can be assessed probabilistically, e.g., simple
random sampling, stratified sampling, cluster sampling,
systematic sampling, network sampling, etc.

29/468
Eurostat
Non-probabilistic sampling

• This is what people used to do before 1948.

30/468
Eurostat
Non-probabilistic sampling

• This is what people used to do before 1948.

• Sampling here is based upon quotas.

30/468
Eurostat
Non-probabilistic sampling

• This is what people used to do before 1948.

• Sampling here is based upon quotas.
• For instance, each interviewer will sample based upon
quotas that are representative of the population where
the selection of respondent is left up to the subjective
judgment of the interviewers.

30/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?

31/468
Eurostat
• How can you ensure that the sample that you have
selected is indeed representative?
• If you are subjective when it comes to the individuals
sampled, then this is an example of quota sampling.

31/468
Eurostat
Sample illustration

• Suppose you were going to select and interview people

that visit ESTAT premises.

32/468
Eurostat
Sample illustration

• Suppose you were going to select and interview people

32/468
Eurostat
Sample illustration

• Suppose you were going to select and interview people

that visit ESTAT premises.
• If you are just selecting people by walking around and
picking them subjectively to interview based upon those
you met, or that just walked by, this involves human
subjectivity.
• Interviewers in probability sampling are given specific
sampling procedures to follow or names and addresses
already selected by a randomization scheme, selected
without human subjectivity. 32/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.

33/468
Eurostat
• For example, if you were to sample every third person
that walked in the door of the building regardless of who
they are.
• The main difference between these two approaches is that
probability sampling removes the human subjectivity.

33/468
Eurostat
Illustration

Let’s compare quota and probability sample results for the

1948 US Washington State presidential poll.

Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%

Truman (dem) 45.3% 50.5% 52.6%

Using quota sampling Dewey had 52% of the votes and

Truman had 45.3% of the votes.

34/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%

Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.

35/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%

Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.

• Their results gave 46% of the votes to Dewey and 50.5%
of the votes to Truman.

35/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%

Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.

• Their results gave 46% of the votes to Dewey and 50.5%
of the votes to Truman.
• See that in this case the quota sampling approach was off
by quite a bit.

35/468
Eurostat
Quota Sample Probability Sample Actual result

Dewey (rep) 52.0% 46.0% 42.7%

Truman (dem) 45.3% 50.5% 52.6%

• The Gallop poll pioneered probability sampling.

• Their results gave 46% of the votes to Dewey and 50.5%
of the votes to Truman.
• See that in this case the quota sampling approach was off
by quite a bit.
• From this time on probability sampling became the norm.

35/468
Eurostat
Final remarks

• When you choose your respondent, use an objective

criteria.

36/468
Eurostat
Final remarks

• When you choose your respondent, use an objective

criteria.
• The major reason for poor results from quota sampling is
subjectivity involved in the selection of subjects.

36/468
Eurostat
Final remarks

• When you choose your respondent, use an objective

criteria.
• The major reason for poor results from quota sampling is
subjectivity involved in the selection of subjects.
• As soon as we introduce this type of bias, we introduce
problems with our data, some of which we cannot get rid
of even by acquiring additional samples.

36/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most

useful sampling schemes, variability from sample to
sample can be estimated using the single sample selected.

37/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most

37/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most

useful sampling schemes, variability from sample to
sample can be estimated using the single sample selected.
• Using the sample we collect, we can construct estimates
for the parameter of the population that we are interested
in.
• Usually, there are many ways to construct estimates.

37/468
Eurostat
Basic idea of sampling and estimation

• One interesting and important fact to note is that in most

• Some desirable properties for estimators are:

1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
Properties of estimators

• Some desirable properties for estimators are:

• Unbiased or nearly unbiased.

• Some desirable properties for estimators are:

• Unbiased or nearly unbiased.
• Have a low MSE (mean square error) or a low variance
when the estimator is unbiased. 1

• Some desirable properties for estimators are:

• Unbiased or nearly unbiased.
• Have a low MSE (mean square error) or a low variance
when the estimator is unbiased. 1
• Robust - so your answer does not fluctuate too much
with respect to extreme values.
1
MSE measures how far the estimate is from the parameter of interest
whereas variance measures how far the estimate is from the mean of that
estimate. Thus, when an estimator is unbiased, its MSE is the same as
its variance.
38/468
Eurostat
39/468
Eurostat
Sampling and non-sampling error

• Sampling error: error due to the collection of a fraction

of the population (sample), and not the whole population
instead.

40/468
Eurostat
Sampling and non-sampling error

• Sampling error: error due to the collection of a fraction

of the population (sample), and not the whole population
instead.
• Non-sampling error: non-response, variables measured
with error, etc.

40/468
Eurostat
Coffee break!

41/468
Eurostat
Subsection 2

Simple random sampling and its

estimators

42/468
Eurostat
Simple Random Sampling

In simple random sampling (without replacement) every

possible sample of n units has the same probability of selection.

• It is many times referred and equal probability

sampling because all the units in the population have
the same probability of selection:
n
,
N
where n is the sample size and N is the population size.
43/468
Eurostat
Example 1

• A hospital has 1,125 patient records.

44/468
Eurostat
Example 1

• A hospital has 1,125 patient records.

• How can one randomly select 120 records to review?

44/468
Eurostat
Example 1

• A hospital has 1,125 patient records.

• How can one randomly select 120 records to review?
• ANSWER:

44/468
Eurostat
Example 1

• A hospital has 1,125 patient records.

• How can one randomly select 120 records to review?
• ANSWER:
• Assign a number from 1 to 1,125 to each record and
select randomly 120 numbers from 1 to 1,125 without
replacement.

44/468
Eurostat
Example 2

• How to estimate the total number of beetles in an

agricultural field?

45/468
Eurostat
Example 2

• How to estimate the total number of beetles in an

agricultural field?
• ANSWER:

45/468
Eurostat
Example 2

• How to estimate the total number of beetles in an

agricultural field?
• ANSWER:
• To estimate the total number of beetles in an agricultural
field, subdivide the field into 100 equally sized units.

45/468
Eurostat
Example 2

46/468
Eurostat
Take a simple random sample of eight units and count the
number of beetles in these eight units.

47/468
Eurostat
Unit # beetles

9 234
66 256
81 128
11 245
92 211
54 240
6 202
23 267

Population size: N = 100; sample size n = 8.

48/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

• N units in the population.

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

• N units in the population.
• Variable of interest: Y1 , ... , YN

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N
• The population total: τ = y1 + y2 + ... + yN = N × µ

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

• N units in the population.
• Variable of interest: Y1 , ... , YN
y1 + y2 + ... + yN
• The population mean: µ =
N
• The population total: τ = y1 + y2 + ... + yN = N × µ
y1 + y2 + ... + yn
• Sample mean: ȳ = µ̂ =
n

49/468
Eurostat
Notation

• Let Yi denote the number beetles in the i-th unit.

49/468
Eurostat
Definition: finite population variance

N
2
X (yi − µ)2
• σ =
i=1
N −1

50/468
Eurostat
Definition: finite population variance

N
2
X (yi − µ)2
• σ =
i=1
N −1
• σ 2 can be estimated by sample variance s 2
n
2
X (yi − ȳ )2 (y1 − ȳ )2 + (y2 − ȳ )2 + ... + (yn − ȳ )2
s = =
i=1
n−1 n−1

50/468
Eurostat
Definition: finite population variance

N
2
X (yi − µ)2
• σ =
i=1
N −1
• σ 2 can be estimated by sample variance s 2
n
2
X (yi − ȳ )2 (y1 − ȳ )2 + (y2 − ȳ )2 + ... + (yn − ȳ )2
s = =
i=1
n−1 n−1
√
• Sample standard deviation: s = s2

50/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight

fields are 234, 256, 128, 245, 211, 240, 202, 267.

51/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight

fields are 234, 256, 128, 245, 211, 240, 202, 267.
• sample mean:
y1 + y2 + ... + y8 234 + ... + 267
ȳ = µ̂ = = = 222.875
8 8

51/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight

fields are 234, 256, 128, 245, 211, 240, 202, 267.
• sample mean:
y1 + y2 + ... + y8 234 + ... + 267
ȳ = µ̂ = = = 222.875
8 8
• sample variance:
n=8 8
2
X (yi − µ)2 X (yi − ȳ )2
s = = = 1932.657
n−1 7
i=1 i=1

51/468
Eurostat
The beetle example

• For the beetle example, the observed samples at the eight

51/468
Eurostat
Estimate for the population total is:

τ̂ = N × ȳ
= 100 × 222.875
= 22, 287.5

52/468
Eurostat
Properties of ȳ (SRS)

Unbiased

y1 + y2 + . . . + yn
E (ȳ ) = E
n
E (y1 ) + E (y2 ) + . . . + E (yn )
=
n
µ + µ + ... + µ nµ
= =
n n
= µ

53/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n

54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:
N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
• Remark 1: when the sampling is done with replacement,
the fraction disappears.

54/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:

N − n σ2
Var(ȳ ) = ·
N n

55/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:

N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:

55/468
Eurostat
• Under simple random sampling, we can estimate the
variance of ȳ from a single sample as:

N − n σ2
Var(ȳ ) = ·
N n
N −n n
• Note that =1− is called the finite population
N N
correction fraction:
n
• Remark 3: is sometime referred as sampling rate.
N

55/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.

56/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
\) and
• The estimate for Var(ȳ ) is denoted as Var(ȳ
\ N − n s2
Var (ȳ ) = · .
N n

56/468
Eurostat
• If one wants to estimate Var(ȳ ), one needs to estimate σ 2
by s 2 in the formula.
\) and
• The estimate for Var(ȳ ) is denoted as Var(ȳ
\ N − n s2
Var (ȳ ) = · .
N n
• For the beatles example
2
\) = N − n · s
Var(ȳ
N n
100 − 8 1932.657
= ·
100 8
= 222.256

56/468
Eurostat
Properties of τ̂ (SRS)

It is unbiased

E (τ̂ ) = E (N × ȳ )
= N ×µ
= τ

57/468
Eurostat
Its variance, Var(τ̂ ), is:

Var(τ̂ ) = Var(N × ȳ ) = N 2 · Var(ȳ )

N − n σ2
= N2 · ·
N n
σ2
= N · (N − n) ·
n

58/468
Eurostat
• The estimate for Var(τ̂ ) is thus:

\ s2
Var (τ̂ ) = N(N − n) · .
n

59/468
Eurostat
• The estimate for Var(τ̂ ) is thus:

\ s2
Var (τ̂ ) = N(N − n) · .
n
• For the beatles example

\) = 100 · (100 − 8) · 1932.657

Var(τ̂
8
= 2222560
= N 2 · Var(ȳ
\)

59/468
Eurostat
Subsection 3

Confidence intervals and the central limit

theorem

60/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not

enough just using sample mean to estimate the
population mean.

61/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not

enough just using sample mean to estimate the
population mean.
• The sample mean by itself is a single point.

61/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not

61/468
Eurostat
Confidence intervals

• The idea behind confidence intervals is that it is not

enough just using sample mean to estimate the
population mean.
• The sample mean by itself is a single point.
• This does not give people any idea on how good your
estimation is of the population mean.
• If we want to assess the accuracy of this estimate we will
use confidence intervals which provide us with information
on how good our estimation is.
61/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.

62/468
Eurostat
• A confidence interval, defined before the sample is
selected, is the interval which has a pre-specified
probability of containing the parameter.
• To obtain this confidence interval you need to know the
sampling distribution of the estimate.

62/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α

63/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.

63/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
• In certain cases the distribution of θ̂ can be stated easily.

63/468
Eurostat
• So the type of statement that we want to make will look
like this:

P(|θ̂ − θ| < d) = 1 − α
• Thus, we need to know the distribution of θ̂.
• In certain cases the distribution of θ̂ can be stated easily.
• However, there are many different types of distributions.

63/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.

64/468
Eurostat
• The normal distribution is easy to use as an example
because it does not bring with it too much complexity.
• When we talk about the Central Limit Theorem for the
sample mean, what are we talking about?
• The finite population Central Limit Theorem for the
sample mean:
What happens when n (sample size), gets large?

64/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n

σ
ȳ ∼ N µ, √ .
n

65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n

σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.

65/468
Eurostat
• ȳ , the sample mean, has a population mean µ and a
σ
standard deviation of √
n

σ
ȳ ∼ N µ, √ .
n
• Since we do not know σ so we will use s to estimate σ.
• We can thus estimate the standard deviation of ȳ to be:
s
√ .
n
• Thus approximately

s
ȳ ∼ N µ, √ .
n
65/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.

66/468
Eurostat
• The value n in the denominator helps us because as n is
getting larger the standard deviation of ȳ is getting
smaller.
• The distribution of ȳ is very complicated when the sample
size is small.
• When the sample size is larger there is more regularity
and it is easier to see the distribution.

66/468
Eurostat
Confidence interval for µ

• If we go about picking samples we can determine a ȳ and

from here we can construct an interval around the mean.

67/468
Eurostat
Confidence interval for µ

• If we go about picking samples we can determine a ȳ and

from here we can construct an interval around the mean.
• Thus, a 100(1 − α)% confidence interval for µ can be
derived as follows:
ȳ − µ ȳ − µ
p ∼ N(0, 1) whereas, q ∼ N(0, 1)
Var(ȳ ) \
Var(ȳ )

67/468
Eurostat
68/468
Eurostat
• Now, we can compute the confidence interval as:

ȳ − µ
P( q < d) = 1 − α

\)
Var(ȳ

ȳ − µ
P( q
< z1−α/2 ) = 1 − α

\)
Var(ȳ
q q
\) < µ < ȳ + z1−α/2 Var(ȳ
P(ȳ − z1−α/2 Var(ȳ \)) = 1 − α

69/468
Eurostat
Confidence interval for µ

• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s 2
N −n s
ȳ ± z1−α/2
N n

70/468
Eurostat
Confidence interval for µ

• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s 2
N −n s
ȳ ± z1−α/2
N n

• What you now have above is the confidence interval for µ.

70/468
Eurostat
Confidence interval for µ

• Thus,
q
\)
ȳ ± z1−α/2 Var(ȳ
s 2
N −n s
ȳ ± z1−α/2
N n

• What you now have above is the confidence interval for µ.

• The confidence interval for τ is given below.

70/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n

71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?

71/468
Eurostat
• A 100(1 − α)% confidence interval for τ is given by:
r
s2
τ̂ ± z1−α/2 N(N − n)
n
• Be careful now, when can we use these?
• In what situation are these confidence intervals
applicable?
• These approximate intervals above are good when n is
large (because of the Central Limit Theorem), or when
the observations y1 , y2 , ..., yn are normal.

71/468
Eurostat
Confidence intervals and sample size

• When sample size is 30 or more, we consider the sample

size to be large and by Central Limit Theorem, ȳ will be
normal even if the sample does not come from a normal
distribution.

72/468
Eurostat
Confidence intervals and sample size

• When sample size is 30 or more, we consider the sample

size to be large and by Central Limit Theorem, ȳ will be
normal even if the sample does not come from a normal
distribution.
• Thus, when sample size is 30 or more, there is no need to
check whether the sample comes from a normal
distribution.

72/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2

2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• When sample size is 8 to 29, we would usually use a
normal probability plot to see whether the data come
from a normal distribution.2
• However, when sample size is 7 or less, if we use normal
probability plot to check for normality, we may fail to
reject normality due to not enough sample size.
• Remark: In the examples of this training we typically use
small sample sizes for illustration purposes only.
2
If it does not violate the normal assumption then we can go ahead and
use the interval.
73/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
s
s2

N −n
ȳ ± z1−α/2
N n

74/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
s
s2

N −n
ȳ ± z1−α/2
N n
• Note that the z-value for α = 0.025 can be found in the
following table:
Confidence α 1 − α/2 z1−α/2

90% 0.1 0.95 1.64

95% 0.05 0.975 1.96
99% 0.01 0.995 2.58

74/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:

75/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
• sample mean: ȳ = 222.875

75/468
Eurostat
• For the beetle example in the text, an approximate 95%
CI for µ is:
• sample mean: ȳ = 222.875
• sample variance: s 2 = 1932.657
s 2
N −n s
ȳ ± z1−α/2
N n
√
= 222.875 ± 1.96 222.256
= 222.875 ± 1.96 × 14.908
= 222.875 ± 29.220

75/468
Eurostat
• And, an approximate 95% CI for τ is then:
r
s2
τ̂ ± z1−α/2 N(N − n)
p n
= 22, 287.5 ± 1.96 2, 222, 560
= 22, 287.5 ± 2, 922.018

76/468
Eurostat
Questions?

77/468
Eurostat
Lunch break!

78/468
Eurostat
Subsection 4

Domain estimation

79/468
Eurostat
Domain estimation

• Quite often, obtaining a frame that lists only those

elements of the population that one is interested in is
impossible.

80/468
Eurostat
Domain estimation

• Quite often, obtaining a frame that lists only those

elements of the population that one is interested in is
impossible.
• For example, you want to sample households with
children, however, the best frame available is a list of all
households.

80/468
Eurostat
Domain estimation

• Quite often, obtaining a frame that lists only those

elements of the population that one is interested in is
impossible.
• For example, you want to sample households with
children, however, the best frame available is a list of all
households.
• Check visually the type of problem.

80/468
Eurostat
81/468
Eurostat
• Therefore, we wish to estimate the parameters of a
subpopulation (domain) of the population represented in
the frame.

82/468
Eurostat
• Therefore, we wish to estimate the parameters of a
subpopulation (domain) of the population represented in
the frame.
• Main Issue: you do not know the size of the domain
(subpopulation)?

82/468
Eurostat
Notation

• N: the number of elements in the population

83/468
Eurostat
Notation

• N: the number of elements in the population

• Nd : the number of elements in the domain
(subpopulation)

83/468
Eurostat
Notation

• N: the number of elements in the population

• Nd : the number of elements in the domain
(subpopulation)
• n: sample size from the population

83/468
Eurostat
Notation

• N: the number of elements in the population

• Nd : the number of elements in the domain
(subpopulation)
• n: sample size from the population
• nd : the number of sampled elements from the domain
(subpopulation)

83/468
Eurostat
Notation

• N: the number of elements in the population

• Nd : the number of elements in the domain
(subpopulation)
• n: sample size from the population
• nd : the number of sampled elements from the domain
(subpopulation)
• ydi - the i-th sampled observation that falls in the
subpopulation

83/468
Eurostat
• An unbiased estimator of µd , the subpopulation mean is:
nd
1 X
ȳd = ydi .
nd i=1

84/468
Eurostat
• An unbiased estimator of µd , the subpopulation mean is:
nd
1 X
ȳd = ydi .
nd i=1
• Its variance is estimated by:

sd2

\d ) = Nd − nd
Var(ȳ ,
Nd nd
nd
(ydi − ȳd )2
P
i=1
where sd2 = .
nd − 1

84/468
Eurostat
• Usually we do not know Nd , so we will estimate the finite
population correction factor as:

Nd − nd N −n
by .
Nd N

85/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount

spent on food by married graduate students in a certain
college.

86/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount

spent on food by married graduate students in a certain
college.
• There are 80 graduate students in the college.

86/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount

spent on food by married graduate students in a certain
college.
• There are 80 graduate students in the college.
• n = 15 are sampled and nm = 10 are married.

86/468
Eurostat
Example: variable food cost

• Let’s say we want to estimate the average weekly amount

spent on food by married graduate students in a certain
college.
• There are 80 graduate students in the college.
• n = 15 are sampled and nm = 10 are married.
• A summary of the data follows:
Marital status N Mean std. deviation

married 10 135.3 44.4

single 5 87.6 21.6

86/468
Eurostat
• What is the average food cost for married students in
that college?

87/468
Eurostat
• What is the average food cost for married students in
that college?
• ANSWER:

87/468
Eurostat
• What is the average food cost for married students in
that college?
• ANSWER:
• The average food cost for married students is:

ȳm = 135.3.

87/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.

88/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
• ANSWER:

88/468
Eurostat
• Provide an estimate for the standard deviation for the
estimate.
• ANSWER:
• An estimate for the standard deviation for the estimate is:

\ 80 − 15 44.42
Var(ȳ m) = · = 160.173.
80 10

\
SD(ȳ m ) = 12.656.

88/468
Eurostat
Confidence intervals and
sample size
Unit learning outcomes

• Upon successful completion of this lesson, you will be

able to:
• find the sample size needed for estimating population
mean and population total
• know how to compute the confidence interval for
population proportion
• find the sample size needed for estimating population
proportion by both the educated guess method and
conservative method
• know when to use educated guess method and when to
use conservative method 90/468
Eurostat
Subsection 1

Calculating sample size

91/468
Eurostat
Sample size for mean and total

• How large should be a sample size for estimating the

population mean with specified accuracy?

92/468
Eurostat
Sample size for mean and total

• How large should be a sample size for estimating the

population mean with specified accuracy?
• If θ̂ is an unbiased, normally distributed estimator of θ,
then

θ̂ − θ
q ∼ N(0, 1).
Var(θ̂)

92/468
Eurostat
Then
 
|θ̂ − θ|
P q < z1−α/2  = 1 − α
Var(θ̂)
q
P |θ̂ − θ| < z1−α/2 · Var(θ̂) = 1−α

93/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.

94/468
Eurostat
• And, if we specify this α we can then try to find out the
sample size large enough to achieve the goal of your
experiment.
• So, we need to ask, "What is the goal of your
experiment?"
• This is perhaps the most important question to be asked
as a part of your experiment.

94/468
Eurostat
• What if we were interested in estimating the average
weight of ESTAT male collaborators.

95/468
Eurostat
• What if we were interested in estimating the average
weight of ESTAT male collaborators.
• How many observations should we plan on taking for
estimating the mean weight of ESTAT male collaborators?

95/468
Eurostat
• What do we need to consider?

96/468
Eurostat
• What do we need to consider?
• In first place: how accurate (precision) do you want
this estimate to be?

96/468
Eurostat
• What do we need to consider?
• In first place: how accurate (precision) do you want
this estimate to be?
• You thus need to specify the margin of error.

96/468
Eurostat
• We should also take into account:

97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.

97/468
Eurostat
• We should also take into account:
1. The variability of the data, the measure that you are
estimating is your first concern. This directly affects
sample size.
2. The second thing that you need to think about is the
type of conclusion that you would like to report. That is,
you need to specify the 1 − α value, the confidence
level, that you are happy with.
• Now, if we specify 1 − α (confidence level), the margin of
error d (also can be viewed as the half width of the
(1 − α)100% CI), we can solve for the sample size such
that the CI has the specified margin of error.
97/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N

98/468
Eurostat
• For estimating population mean, the equation becomes:
r !
N − n σ2
P |ȳ − µ| < z1−α/2 · · = 1−α
N n
r
N − n σ2
z1−α/2 · = d
N n
1
n = 2
d 1
2 2
+
z1−α/2 · σ N
• Can we now use this formula to estimate the sample size?

98/468
Eurostat
• The weak point is the population variance used.

99/468
Eurostat
• The weak point is the population variance used.
• We do not know the value of σ 2 .

99/468
Eurostat
• Similarly, for estimating the population total τ , here is the
formula:
r !
σ2
P |τ̂ − τ | < z1−α/2 · N(N − n) =1−α
n
r
σ2
z1−α/2 N(N − n) =d
n
1
n= 2
d 1
2
+
N2 · z1−α/2 · σ2 N

100/468
Eurostat
The beetle example

• What sample size is needed to estimate the population

total of beetles, τ , to within d = 1000 with a 95% CI?
Unit # beetles

9 234
66 256
81 128
11 245
92 211
54 240
6 202
23 267

Sample mean (ȳ ) 222.875

Sample variance (s 2 ) 1932.657

Population size: N = 100; sample size n = 8.

101/468
Eurostat
• Now, let’s begin plugging what we know into the formula.

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?

102/468
Eurostat
• Now, let’s begin plugging what we know into the formula.
• We know N = 100, α = 0.05 and d = 1000.
• Do we know σ 2 ?
• No, but we can estimate σ 2 by
n
(xi − x̄)2
X
2
s = = 1932.657.
i=1
n−1

• How many units should we sample?

102/468
Eurostat
• Let’s calculate this out and:

1
n = 2
d 1
2
+
N2 · z1−α/2 · σ2 N
1
n = 2 = 42.610
(1000) 1
2 2
+
(100) · (1.96) · 1932.657 100

103/468
Eurostat
• Let’s calculate this out and:

1
n = 2
d 1
2
+
N2 · z1−α/2 · σ2 N
1
n = 2 = 42.610
(1000) 1
2 2
+
(100) · (1.96) · 1932.657 100

• We will always round this up, therefore, we will sample 43

of the 100 plots.

103/468
Eurostat
• Remark: If we ignore the finite population correction
adjustment then,

N 2 · z1−α/2
2
· σ2
n =
d2
(100) · (1.96)2 · 1932.657
2
=
(1000)2
= 74.245

which rounds up to 75.

104/468
Eurostat
• Remark: If we ignore the finite population correction
adjustment then,

N 2 · z1−α/2
2
· σ2
n =
d2
(100) · (1.96)2 · 1932.657
2
=
(1000)2
= 74.245

which rounds up to 75.

• This value is much larger than 43.

104/468
Eurostat
Think about it!

• What is the major point that was just illustrated in the

previous example?

105/468
Eurostat
Think about it!

• What is the major point that was just illustrated in the

previous example?
• ANSWER:

105/468
Eurostat
Think about it!

• What is the major point that was just illustrated in the

previous example?
• ANSWER:
• In this first example, N = 100 is not very large compared
to n, so one should not ignore the finite population
adjustment!

105/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .

106/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
• What can one do if there is no pilot data?

106/468
Eurostat
• In the beetle example, there are data to estimate σ 2 .
• What can one do if there is no pilot data?
• How can we get some rough idea about what σ 2 is?

106/468
Eurostat
Example

• A farm has 1000 young pigs with an initial weight of

about 50 lbs.

107/468
Eurostat
Example

• A farm has 1000 young pigs with an initial weight of

about 50 lbs.
• They put them on a new diet for 3 weeks and want to
know how many pigs to sample so that they can estimate
the average weight gain.

107/468
Eurostat
Example

• A farm has 1000 young pigs with an initial weight of

about 50 lbs.
• They put them on a new diet for 3 weeks and want to
know how many pigs to sample so that they can estimate
the average weight gain.
• They want the answer to be within 2 lbs with 90%
confidence.

107/468
Eurostat
• There is no pilot data here.

108/468
Eurostat
• There is no pilot data here.
• We don’t have the time to select out some pigs in order
to get an estimate for σ 2 , the variance of the weight gain.

108/468
Eurostat
• There is no pilot data here.
• We don’t have the time to select out some pigs in order
to get an estimate for σ 2 , the variance of the weight gain.
• Question: How do we get a rough estimate of σ?

108/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?

109/468
Eurostat
• What would be a reasonable measure that would help this
farmer to give him some guidance on how to estimate the
standard deviation of the weight gain?
• One thing we can do is rely on the information that we
already have, i.e., find some historical data that exists
on this topic.

109/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.

110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:

Range
σ≈
4

110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:

Range
σ≈
4
• The range is relatively easy to have some idea about.

110/468
Eurostat
• For certain variables we can make reasonable guesses for
an estimate of σ.
• Here is a formula for this rough estimate:

Range
σ≈
4
• The range is relatively easy to have some idea about.
• This is an important point.

110/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.

111/468
Eurostat
• Even though perhaps none of us has raised pigs we can
still come up with a sensible guess.
• So, for this case we will make a sensible guess of the
range of weight gain and intuitively estimate this to be
from a minimum of 10 lbs, to a maximum of 50 lbs within
this 3 week period.

Range 50 − 10
= = 10 lbs
4 4

111/468
Eurostat
• Now we can use the formula for estimating the mean, µ.

112/468
Eurostat
• Now we can use the formula for estimating the mean, µ.
• Then,
1
n = 2
d 1
2
+
zα/2 · σ2 N
1
=
22 1
2 2
+
(1.645) · (10) 1000
= 63.36

112/468
Eurostat
• The value 63.36 should rounded up to 64.

113/468
Eurostat
• The value 63.36 should rounded up to 64.
• We will need to sample 64 pigs in order to estimate the
average weight gain in 3 weeks to within 2 lbs with a 90%
confidence interval.

113/468
Eurostat
Coffee break!

114/468
Eurostat
Subsection 2

Confidence intervals for population

proportion

115/468
Eurostat
Estimating proportions

• Estimating population proportions can be seen as a

particular case of estimating the population mean.

116/468
Eurostat
Estimating proportions

• Estimating population proportions can be seen as a

particular case of estimating the population mean.
• We want to estimate the proportion of units in the
population having some attribute.

116/468
Eurostat
Estimating proportions

• Estimating population proportions can be seen as a

particular case of estimating the population mean.
• We want to estimate the proportion of units in the
population having some attribute.
• For example a question might be, "What would be the
proportion of ESTAT workers who are smokers?"

116/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.

117/468
Eurostat
• Poll surveys: most are based on telephone interviews with
a significant portion based on interviews conducted in
person from home visits.
• Usually the sample size is at least 1000, sometimes even
1500.

117/468
Eurostat
• Question: Do you approve President Junker’s job
performance?

118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.

118/468
Eurostat
• Question: Do you approve President Junker’s job
performance?
(
0, no
• Answer: yi = the population unit is:
1, yes
1, 2, ..., N.
• The variable of interest: y1 , y2 , ... , yN
1 P N
• Population proportion: p = yi which is the
N i=1
population mean, µ, of Y .

118/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n

119/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
• This specific definition of yi makes it having a variance
that is related to its mean.

119/468
Eurostat
• If we take a simple random sample of size n, then
n
X yi
p̂ = = ȳ
i=1
n
• This specific definition of yi makes it having a variance
that is related to its mean.
• To find the finite population variance for y1 , y2 , ... , yN ,
we know that the population mean is:

N
1 X
µ= yi = p.
N i=1

119/468
Eurostat
By definition the variance is then:
N
(yi − p)2
P
i=1
σ2 =
N −1
N
(yi2 − 2pyi + p 2 )
P
i=1
=
N −1
N N
yi2 − 2p yi + Np 2
P P
i=1 i=1
=
N −1

120/468
Eurostat
Then, since yi2 = yi :
N N
yi + Np 2
P P
yi − 2p
i=1 i=1
=
N −1
Np − 2p(Np) + Np 2
=
N −1
Np − Np 2 N
σ2 = = p(1 − p)
N −1 N −1

Theoretically this is the variance.

121/468
Eurostat
• How will we estimate this?

122/468
Eurostat
• How will we estimate this?
• We can estimate this by:

n
σ̂ 2 = s 2 = p̂ · (1 − p̂).
n−1

122/468
Eurostat
• How will we estimate this?
• We can estimate this by:

n
σ̂ 2 = s 2 = p̂ · (1 − p̂).
n−1
• What we want is to see how p̂ behaves, therefore, we
want to know its distribution.

122/468
Eurostat
• First, we find its mean, then its variance.

123/468
Eurostat
• First, we find its mean, then its variance.
• Since p̂ is ȳ , we can get E(p̂) = µ = p.

123/468
Eurostat
• First, we find its mean, then its variance.
• Since p̂ is ȳ , we can get E(p̂) = µ = p.
• Then, we proceed to find its variance.

n σ2
Var(p̂) = 1− ·
N n
N −n N · p · (1 − p)
= ·
N (N − 1) · n

N −n p · (1 − p)
= ·
N −1 n

123/468
Eurostat
• How will we estimate the variance of p̂?

124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.

124/468
Eurostat
• How will we estimate the variance of p̂?
• There are many answers for how to do this.
• One method would be to use maximum likelihood,
another would be to find the unbiased estimator.
• An unbiased estimator of the variance is:

\ N −n p̂ · (1 − p̂)
Var(p̂) = ·
N n−1

124/468
Eurostat
• The answer will not be very different from what one
would get using other methods.

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.

125/468
Eurostat
• The answer will not be very different from what one
would get using other methods.
• What about for confidence intervals?
• For this we need to know the distribution of p̂.
• When the sample size is large we know that p̂ has a
normal distribution by the central limit theorem.

125/468
Eurostat
• How large is large enough?

Answer: if n · p̂ ≥ 5, n · (1 − p̂) ≥ 5.

126/468
Eurostat
Back to example

• Imagine President Junker’s final approval rating is 22%

(based upon a sample of 1112 interviews)!

127/468
Eurostat
Back to example

• Imagine President Junker’s final approval rating is 22%

(based upon a sample of 1112 interviews)!
• After looking at this statistic, we can provide a 95% CI
for the true proportion.

127/468
Eurostat
• The 22% is a sample proportion.

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:

128/468
Eurostat
• The 22% is a sample proportion.
• What is the true population proportion?
• ANSWER:
• A 95% confidence interval for p is:
´
√
0.22 ± 1.96 0.0001545
0.22 ± 0.0244
where

\ N − n p̂ · (1 − p̂) 0.22 × 0.78
Var(p̂) = · = 1· = 0.0001545
N n−1 1112 − 1

Sample size for estimating proportions

129/468
Eurostat
Sample size for estimating proportion

• Using the formula to find sample size for estimating the

mean we have:

1
n= 2 .
d 1
2 2
+
z1−α/2 · σ N

130/468
Eurostat
N
• Now, σ 2 = · p · (1 − p) substitutes in and we get:
N −1
N · p · (1 − p)
n= .
d2
(N − 1) 2 + p · (1 − p)
z1−α/2

131/468
Eurostat
• When the finite population correction can be ignored, the
formula is:
2
z1−α/2 · p · (1 − p)
n≈ .
d2

132/468
Eurostat
• When the finite population correction can be ignored, the
formula is:
2
z1−α/2 · p · (1 − p)
n≈ .
d2
• Now, for finding sample sizes for proportion, in addition
to using an educated guess to estimate p, we can also
find a conservative sample size which can guarantee the
margin of error is short enough at a specified α.

132/468
Eurostat
A. Educated guess (estimate p by p̂):

N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2

133/468
Eurostat
A. Educated guess (estimate p by p̂):

N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
1. Note, p̂ may be different from the true proportion.

133/468
Eurostat
A. Educated guess (estimate p by p̂):

N · p̂ · (1 − p̂)
n= .
d2
(N − 1) 2 + p̂ · (1 − p̂)
z1−α/2
1. Note, p̂ may be different from the true proportion.
2. The sample size may not be large enough for some cases,
(i.e., the margin of error not as small as specified).

133/468
Eurostat
B. Conservative sample size:

N · 1/4
n= .
d2
(N − 1) 2 + 1/4
z1−α/2

134/468
Eurostat
B. Conservative sample size:

N · 1/4
n= .
d2
(N − 1) 2 + 1/4
z1−α/2
1. Since p(1 − p) attains maximum at p = 1/2.

134/468
Eurostat
Example

To estimate the president’s final approval rating, how many

people should be sampled so that the absolute margin of error
is 3%, (a popular choice), with 95% confidence?
A. Use educated guess: Junker’s = 0.22
Since N is very large compared to n, finite population
correction is not needed.

2
p̂ · (1 − p̂) · z1−α/2 0.22 · 0.78 · 1.962
n= = = 732.47
d2 0.032
135/468
Eurostat
Example

To estimate the president’s final approval rating, how many

2
p̂ · (1 − p̂) · z1−α/2 0.22 · 0.78 · 1.962
n= = = 732.47
d2 0.032
1. Round up to 733 135/468
Eurostat
To estimate the president’s final approval rating, how many
people should be sampled so that the margin of error is 3%, (a
popular choice), with 95% confidence?

B. Use conservative approach.

0.5 · 0.5 · 1.962

n =
0.032
= 1067.11

136/468
Eurostat
To estimate the president’s final approval rating, how many
people should be sampled so that the margin of error is 3%, (a
popular choice), with 95% confidence?

B. Use conservative approach.

0.5 · 0.5 · 1.962

n =
0.032
= 1067.11

1. Round up to 1068.

136/468
Eurostat
What to choose?

• How do we choose between the educated guess or the

conservative approach?

137/468
Eurostat
What to choose?

• How do we choose between the educated guess or the

conservative approach?
• One should look at the cost of sampling extra units versus
the set-up cost of the sampling process once more.

137/468
Eurostat
What to choose?

• How do we choose between the educated guess or the

conservative approach?
• One should look at the cost of sampling extra units versus
the set-up cost of the sampling process once more.
• If the set-up cost (maybe needed if an educated guess is
used) of the sampling procedure once more is high
compared to the cost of sampling extra units, then one
will prefer to use a conservative approach.

137/468
Eurostat
Example

• Find the proportion of CD players in a shipment that have

lifetime longer than 2000 hours.

138/468
Eurostat
Example

• Find the proportion of CD players in a shipment that have

lifetime longer than 2000 hours.
• The proportion from last shipment was 0.9. It is not
costly to set up the testing procedure again if needed
whereas sampling cost of each unit is expensive.

138/468
Eurostat
Example

• Find the proportion of CD players in a shipment that have

lifetime longer than 2000 hours.
• The proportion from last shipment was 0.9. It is not
costly to set up the testing procedure again if needed
whereas sampling cost of each unit is expensive.
• We want to estimate the proportion to within 0.01 with
95% confidence.

138/468
Eurostat
• Would you use the educated guess or the conservative
approach?

139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:

139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use an educated guess because it is not costly
to set up the testing procedure again.

139/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use an educated guess because it is not costly
to set up the testing procedure again.
• On the other hand, the cost of the sampling of extra
units is high due to the nature of the test.

139/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.

140/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
• Last year the proportion is 0.9.

140/468
Eurostat
• Get a ship out to the Bering Sea to sample the proportion
of fish that have mercury level within a specified level.
• Last year the proportion is 0.9.
• Want to estimate the proportion to within 0.01 with 95%
confidence.

140/468
Eurostat
• Would you use the educated guess or the conservative
approach?

141/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:

141/468
Eurostat
• Would you use the educated guess or the conservative
approach?
• ANSWER:
• We should use a conservative approach because it is too
expensive to send a ship out again if needed.

141/468
Eurostat
Unequal probability sampling
Unit learning outcomes

• Upon successful completion of this lesson, you will be

able to:
• know why and when to use unequal probability sampling,
• how to perform unequal probability sampling,
• how to compute the Hansen-Hurwitz estimator and its
estimated variance,
• how to compute the Horvitz-Thompson estimator and its
estimated variance, and
• learn about the unbiasedness of these two estimators
through an artificial small population example.
143/468
Eurostat
Subsection 1

Unequal probability sampling

144/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.

145/468
Eurostat
• In simple random sampling, the probability that each unit
will be sampled is the same.
• But sometimes, estimates can be improved by varying the
probabilities with which units are sampled.
• For example, we want to estimate the number of job
openings in a city by sampling firms in that city.

145/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.

146/468
Eurostat
• If one uses s.r.s, size of a firm is not taken into
consideration and a typical sample will consist of mostly
small firms.
• However, the number of job openings is heavily influenced
by large firms.
• Thus, we should be able to improve the estimate of
number of job openings by giving the large firms a greater
chance to appear in the sample, for example, with
probability proportional to size or proportional to some
other relevant aspects.

146/468
Eurostat
Selection probabilities

• On each draw, the probability that a given population

unit will be selected is denoted as: pi , i = 1, 2, 3, ..., N.

147/468
Eurostat
Selection probabilities

• On each draw, the probability that a given population

unit will be selected is denoted as: pi , i = 1, 2, 3, ..., N.
• Suppose that sampling is with replacement, the
probability of selecting the i-th unit in the population is
pi .

147/468
Eurostat
• If the selection probabilities are unequal, the sample mean
is not unbiased for population mean and sample total is
not unbiased for population total.

148/468
Eurostat
• If the selection probabilities are unequal, the sample mean
is not unbiased for population mean and sample total is
not unbiased for population total.
• Example: if larger firms are sampled with higher
probability, the sample mean for job openings will be
biased upward.

148/468
Eurostat
Questions?

149/468
Eurostat
See you tomorrow!

150/468
Eurostat
Subsection 2

The Hansen-Hurwitz estimator

151/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to

be larger.

152/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to

be larger.
• However, formula for replacement are simpler and easier
to derive.

152/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to

be larger.
• However, formula for replacement are simpler and easier
to derive.
• When the sample size is small compared to N, with and
without replacement are not too different.

152/468
Eurostat
Sampling with replacement

• When sampling with replacement, the variances tend to

be larger.
• However, formula for replacement are simpler and easier
to derive.
• When the sample size is small compared to N, with and
without replacement are not too different.
• We often use the sampling with replacement formulae
(easier to handle) to approximate sampling without
replacement.
152/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.

153/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.
• Let pi , i = 1, ..., N denote the probability that a given
population unit will be selected.

153/468
Eurostat
• For this section, lets’s consider sampling is with
replacement.
• Let pi , i = 1, ..., N denote the probability that a given
population unit will be selected.
• The Hansen-Hurwitz estimator for τ is:
n
1 X yi
τ̂p = .
n i=1 pi

153/468
Eurostat
Since,
N
yi X yi
E = pi
pi i=1
pi
N
X
= yi = τ
i=1

N
X
where τ = yi is the population total.
i=1

154/468
Eurostat
Thus,
n
!
1 X yi
E(τ̂p ) = E
n i=1 pi
n
1X yi
= E
n i=1 pi
n
1X
= τ
n i=1
1
= nτ = τ
n
which means τ̂p is an unbiased estimator for τ .
155/468
Eurostat
X N 2
yi yi
Since Var = pi −τ ,
pi i=1
pi

N 2
1X yi
Var(τ̂p ) = pi −τ
n i=1 pi

156/468
Eurostat
• An unbiased estimator for Var(τ̂p ) is:
n 2
X yi
− τ̂p
1 p i
\p ) = · i=1
Var(τ̂
n n−1
and an approximate (1 − α)100% confidence interval for
τ is:
q
τ̂p ± z1−α/2 · \p ).
Var(τ̂

157/468
Eurostat
τ
• For population mean, µ = one uses:
N

n
!
1 1 X yi τ̂p
µ̂p = · =
N n i=1 pi N
τ
E(µ̂p ) = =µ
N
\p ) = 1 · Var(τ̂
Var(µ̂ \p )
N2

158/468
Eurostat
τ
• For population mean, µ = one uses:
N

n
!
1 1 X yi τ̂p
µ̂p = · =
N n i=1 pi N
τ
E(µ̂p ) = =µ
N
\p ) = 1 · Var(τ̂
Var(µ̂ \p )
N2
• How do we perform unequal probability sampling
according to given pi ?

158/468
Eurostat
Example 1

• The director of computer support department plans to

sample 3 divisions of a large firm that has 10 divisions,
with varying numbers of employees per division.

159/468
Eurostat
Example 1

• The director of computer support department plans to

sample 3 divisions of a large firm that has 10 divisions,
with varying numbers of employees per division.
• Since number of computer support requests within each
division should be highly correlated with the number of
employees in that division, the director decides to use
unequal probability sampling with replacement with pi
proportional to number of employees in that division.

159/468
Eurostat
Division # employees

1 1000
2 650
3 2100
4 860
5 2840
6 1910
7 390
8 3200
9 1500
10 1200
Total 15650

160/468
Eurostat
A. How do we practically implement unequal probability
sampling according to the given pi ’s?

161/468
Eurostat
A. How do we practically implement unequal probability
sampling according to the given pi ’s?
B. With the divisions selected by probability proportional to
size, how do we construct the Hansen-Hurwitz estimator
for τ ?

161/468
Eurostat
Example: Answer to A

Division # employees pi

1 1000 1000/15650
2 650 650/15650
3 2100 2100/15650
4 860 860/15650
5 2840 2840/15650
6 1910 1910/15650
7 390 390/15650
8 3200 3200/15650
9 1500 1500/15650
10 1200 1200/15650
Total 15650 1

162/468
Eurostat
Division # employees pi Assigned numbers

1 1000 1000/15650 1-1000

2 650 650/15650 1001-1650
3 2100 2100/15650 1651-3750
4 860 860/15650 3751-4610
5 2840 2840/15650 4611-7450
6 1910 1910/15650 7451-9360
7 390 390/15650 9361-9750
8 3200 3200/15650 9751-12950
9 1500 1500/15650 12951-12450
10 1200 1200/15650 14451-15650
Total 15650 1

• Sample with replacement 3 numbers between 1 and

15650.

163/468
Eurostat
Division # employees pi Assigned numbers

1 1000 1000/15650 1-1000

• Sample with replacement 3 numbers between 1 and

15650.
• They are 1085, 6261 and 9787.

163/468
Eurostat
Division # employees pi Assigned numbers

1 1000 1000/15650 1-1000

• Sample with replacement 3 numbers between 1 and

15650.
• They are 1085, 6261 and 9787.
• These numbers fall into division 2, division 5 and division
8.
163/468
Eurostat
• For division 2, y1 : the number requests is 420

164/468
Eurostat
• For division 2, y1 : the number requests is 420
• For division 5, y2 : the number of requests is 1785

164/468
Eurostat
• For division 2, y1 : the number requests is 420
• For division 5, y2 : the number of requests is 1785
• For division 8, y3 : the number of requests is 2198

164/468
Eurostat
• We will need to compute the Hansen-Hurwitz estimator
as follows:

165/468
Eurostat
• We will need to compute the Hansen-Hurwitz estimator
as follows:
• The Hansen-Hurwitz estimator for τ is
n
1 X yi
τ̂p = =
n pi
i=1
1 15650 15650 15650
= 420 · + 1785 · + 2198 ·
3 650 2840 3200
1
= (10112.31 + 9836.36 + 10749.59)
3
= 10232.75

165/468
Eurostat
• Each of the values, 10112.31, 9836.36, and 10749.59,
look fairly stable so it looks like the variance will not be
too large.
3
2
P yi
− τ̂p
\p ) = 1 i=1 pi
Var(τ̂ ·
3 3−1
1 1
= · ((10112.31 − 10232.75)2
3 2
+(9836.36 − 10232.75)2 + (10749.59 − 10232.75)
= 73125.74
and
\
SD(τ̂ p ) = 270.418
166/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional

to the values of a known positive auxiliary variable such
xi
as size, pi = N , the Hansen-Hurwitz estimator is also
X
xi
i=1
called p.p.s. (probability proportional to size).

167/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional

167/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional

to the values of a known positive auxiliary variable such
xi
as size, pi = N , the Hansen-Hurwitz estimator is also
X
xi
i=1
called p.p.s. (probability proportional to size).
• Now, we need to ask ourselves, when and why would we
need to use an unequal probability sampling?
• Let’s think about the ’when’ first.
167/468
Eurostat
Hansen-Hurwitz estimator

• We will see that in the example pi are chosen proportional

168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.

168/468
Eurostat
• What about if we were sampling from ESTAT
departments?
• They are of very different sizes, some are very large and
others are very small.
• Would we automatically choose to use p.p.s.?

168/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.

169/468
Eurostat
• If the thing that you are interested in is related to size,
then you would want to use p.p.s.
• However, if what you are interested in has nothing to do
with the size of the department, then there is no reason
to use p.p.s.

169/468
Eurostat
• By definition,

N N 2
X 1X yi
τ= yi and Var(τ̂p ) = pi −τ .
i=1
n i=1 pi

170/468
Eurostat
• By definition,

N N 2
X 1X yi
τ= yi and Var(τ̂p ) = pi −τ .
i=1
n i=1 pi
yi
• For the special and unrealistic case = constant, the
pi
constant will be τ and the Var(τ̂p ) will be zero.

170/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi

171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .

171/468
Eurostat
yi
• Therefore, you want to be close to a constant.
pi
• However, in reality, prior to sampling, the yi are unknown
and we can not choose pi proportional to yi .
• If we know yi is approximately proportional to a known
variable such as xi , then we can choose pi proportional
to xi .

171/468
Eurostat
Example: palm trees

• We want to estimate the total number of palm trees on

100 islands in a tropical paradise.

172/468
Eurostat
Example: palm trees

• We want to estimate the total number of palm trees on

100 islands in a tropical paradise.
• The area of each island is known and it is reasonable to
think that the number of palm trees on each island is
approximately proportional to the size of the island.

172/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.

173/468
Eurostat
• We know that the sizes of the island are given (e.g., size
of island 1 is 1 square mile, size of island 29 is 5 square
mile and size of island 36 is 2 square miles.
• The total size of these 100 islands are 100 square miles.

• How can we sample 4 islands by probabilities p1 , ..., p100 ?

173/468
Eurostat
• Answer:

174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit

174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
• Generate 4 random numbers form a uniform distribution
on (0,1)

174/468
Eurostat
• Answer:
• Assign an interval width of pi to i-th unit
• Generate 4 random numbers form a uniform distribution
on (0,1)
• Choose the units that correspond to the interval
containing the random number.

174/468
Eurostat
• In this example, we use uniform and get: 0.335257,
0.0065551, 0.401869, 0.318977

175/468
Eurostat
• In this example, we use uniform and get: 0.335257,
0.0065551, 0.401869, 0.318977
• The units selected are the islands 29, 1, 36, and 29,
(since 0.335257 falls between 0.31 and 0.36, 0.0065551
falls between 0 and 0.01, 0.401869 falls between 0.40 and
0.42, and 0.318977 falls between 0.31 and 0.36.).

175/468
Eurostat
The measurements (yi ) are:
i Size pi yi

1 1 0.01 14
29 5 0.05 50
29 5 0.05 50
36 2 0.02 25

Given these results we should now be able to estimate how

many total palm trees are there on all of the islands put
together:

1 14 50 50 25
τ̂p = + + +
4 0.01 0.05 0.05 0.02
1
= (1400 + 1000 + 1000 + 1250)
4
= 1162.5 176/468
Eurostat
Example: palm trees

n 2
\p ) = 1 X yi
Var(τ̂ − τ̂p
n(n − 1) i=1 pi
1
= [(1400 − 1162.5)2 + (1000 − 1162.5)2
4·3
+(1000 − 1162.5)2 + (1250 − 1162.5)2 ]
= 9739.58
\
SD(τ̂p ) = 98.69.

177/468
Eurostat
• If we are interested in the mean number of trees per
island in that population, then

τ̂p 1162.5
µ̂p = = = 11.625.
N 100

\p ) = 1 \p )
Var(µ̂ · Var(τ̂
N2
1
= · 9739.58
(100)2
= 0.973958
\p ) = 0.987
SD(µ̂

178/468
Eurostat
Subsection 3

The Horvitz-Thompson Estimator

179/468
Eurostat
The Horvitz-Thompson estimator

• Horvitz-Thompson (1952) introduced an unbiased

estimator for τ for any design, with or without
replacement.

180/468
Eurostat
The Horvitz-Thompson estimator

• Horvitz-Thompson (1952) introduced an unbiased

estimator for τ for any design, with or without
replacement.
• Definition: pi , i = 1, ..., N are given positive numbers
that represent the probability that unit i is included in the
sample under a given sampling scheme.

180/468
Eurostat
• The Horvitz-Thompson estimator is:
v
X yi
τ̂π =
i=1
πi
where v is the distinct number of units in the sample.

181/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.

182/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
• Each distinct unit of the sample is utilized only once.

182/468
Eurostat
• The Horvitz-Thompson estimator does not depend on the
number of times a unit may be selected.
• Each distinct unit of the sample is utilized only once.
• Note that the estimator is unbiased:

E(τ̂π ) = τ

182/468
Eurostat
• Its variance is given by
N N X
X 1 − πi X πij − πi πj
Var(τ̂π ) = yi2 + yi yj
i=1
πi i=1 j6=i
π i πj

183/468
Eurostat
• Its variance is given by
N N X
X 1 − πi X πij − πi πj
Var(τ̂π ) = yi2 + yi yj
i=1
πi i=1 j6=i
π i πj

• It can be estimated by:

v v X
\
X 1 − πi 2
X πij − πi πj 1
Var(τ̂π ) = 2
yi + yi yj
i=1
π i i=1 j6=i
π i πj π ij

where πij > 0 denotes the probability that both unit i and
unit j are included in the sample.

183/468
Eurostat
An approximate (1 − α)100% CI for τ is:
q
\π ).
τ̂π ± z1−α/2 Var(τ̂

184/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total

number of palm trees.

185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total

number of palm trees.
• ANSWER:

185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total

number of palm trees.
• ANSWER:
• Since, for that example the sample is with replacement,
the n draws are independent.

185/468
Eurostat
Palm trees with Horvitz-Thompson
estimator

• Compute the Horvitz-Thompson estimator of the total

number of palm trees.
• ANSWER:
• Since, for that example the sample is with replacement,
the n draws are independent.
• It is relatively easy to compute the π’s .

185/468
Eurostat
• For sample with replacement, we will compute:

πi = the probability of unit i-th is included in the sample

= 1 − P(unit i-th is not included in the sample)
= 1 − (1 − pi )n

186/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.

187/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
• Since p1 = 0.01, π1 = 1 − (1 − 0.01)4 = 0.0394, and

p2 = 0.05, π2 = 1 − (1 − 0.05)4 = 0.1855,

p3 = 0.02, π3 = 1 − (1 − 0.02)4 = 0.0776

187/468
Eurostat
• Recall: Units 1, 29 and 36 are selected.
• Since p1 = 0.01, π1 = 1 − (1 − 0.01)4 = 0.0394, and

p2 = 0.05, π2 = 1 − (1 − 0.05)4 = 0.1855,

p3 = 0.02, π3 = 1 − (1 − 0.02)4 = 0.0776
• Therefore,
ν
X yi
τ̂π =
i=1
πi
14 50 25
= + +
0.0394 0.1855 0.0776
= 947.037
187/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ

188/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ
• For this, we need to compute πij .

188/468
Eurostat
• Next, we need to compute the estimated variance,
\π ).
Var(τ
• For this, we need to compute πij .
• Since

P(A ∩ B) = P(A) + P(B) − P(A ∪ B)

= P(A) + P(B) − [1 − P(Ac ∩ B c )]

188/468
Eurostat
• Then we get:

πij = πi + πj − [1 − (1 − pi − pj )n ]

189/468
Eurostat
• Then we get:

πij = πi + πj − [1 − (1 − pi − pj )n ]
• This means that we have to run through each of the unit
pairs such as:

π12 = 0.0394 + 0.1855 − [1 − (1 − 0.01 − 0.05)4 ] = 0.00565

π13 = 0.0394 + 0.0776 − [1 − (1 − 0.01 − 0.02)4 ] = 0.00229
π23 = 0.1855 + 0.0776 − [1 − (1 − 0.05 − 0.02)4 ] = 0.01115

189/468
Eurostat
• Plugging in the values in
v v X
X 1 − πi X πij − πi πj 1
\π ) =
Var(τ̂ yi2 + yi yj ,
i=1
πi2 i=1 j6=i
πi πj πij

we obtain:

\π ) = 92692.9
Var(τ̂

190/468
Eurostat
• Plugging in the values in
v v X
X 1 − πi X πij − πi πj 1
\π ) =
Var(τ̂ yi2 + yi yj ,
i=1
πi2 i=1 j6=i
πi πj πij

we obtain:

\π ) = 92692.9
Var(τ̂
√
\
• Thus, SD(τ̂ π) = 92692.9 = 304.455

190/468
Eurostat
• Is there some popular estimator that can be derived as a
Horvitz-Thompson estimator?

191/468
Eurostat
• Is there some popular estimator that can be derived as a
Horvitz-Thompson estimator?
• Yes, under simple random sampling (without
replacement), the inclusion of the probability of the i-th
unit is:
πi = P(unit i-th is included in the sample)
# of samples including unit i-th
=
# of samples
N−1 (N−1)! (N−1)!
Cn−1 (N−1−n+1)!(n−1)! (N−n)!(n−1)!
= = N!
= N(N−1)!
CnN (N−n)!n! (N−n)!n(n−1)!
n
=
N 191/468
Eurostat
n
X yi
τ̂π =
i=1
πi
n
X yi
= ·N
i=1
n
= N ȳ

Which is the popular estimator we use! This is also called the

expansion estimator.

192/468
Eurostat
Coffee break!

193/468
Eurostat
Subsection 4

Small population illustration

194/468
Eurostat
Wheat production

unit (Farm) i 1 2 3

pi 0.3 0.2 0.5

Wheat produced 11 6 25

N = 3 farms; n = 2 farms; sample with replacement.

195/468
Eurostat
s p(s) Sample

(1,1) 0.3(0.3)=0.09 (11,11)

(2,2) 0.2(0.2)=0.04 (6,6)
(3,3) 0.5(0.5)=0.25 (25,25)
(1,2) 0.3(0.2)=0.06 (11,6)
(2,1) 0.2(0.3)=0.06 (6,11)
(1,3) 0.3(0.5)=0.15 (11,25)
(3,1) 0.5(0.3)=0.15 (25,11)
(2,3) 0.2(0.5)=0.10 (6,25)
(3,2) 0.5(0.2)=0.10 (25,6)

196/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.

197/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
• Answer: When (1,1) is sampled, the Hansen-Hurwitz
estimator is:

1 y1 y1 1 11 11
τ̂p = + = + = 36.67.
2 p1 p1 2 0.3 0.3

197/468
Eurostat
• Question: Compute the Hansen-Hurwitz estimator.
• Answer: When (1,1) is sampled, the Hansen-Hurwitz
estimator is:

1 y1 y1 1 11 11
τ̂p = + = + = 36.67.
2 p1 p1 2 0.3 0.3
• When (1,2) is sampled, the Hansen-Hurwitz estimator is:

1 y1 y2 1 11 6
τ̂p = + = + = 33.33.
2 p1 p2 2 0.3 0.2

197/468
Eurostat
Similarly, we can fill out the table and get the Hansen-Hurwitz
estimators as shown:

s p(s) Sample τ̂p

(1,1) 0.3(0.3)=0.09 (11,11) 36.670

(2,2) 0.2(0.2)=0.04 (6,6) 30.000
(3,3) 0.5(0.5)=0.25 (25,25) 50.000
(1,2) 0.3(0.2)=0.06 (11,6) 33.330
(2,1) 0.2(0.3)=0.06 (6,11) 33.330
(1,3) 0.3(0.5)=0.15 (11,25) 43.330
(3,1) 0.5(0.3)=0.15 (25,11) 43.330
(2,3) 0.2(0.5)=0.10 (6,25) 40.000
(3,2) 0.5(0.2)=0.10 (25,6) 40.000

198/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.

199/468
Eurostat
• Question: Compute the Horvitz-Thompson estimator.
• Answer:
π1 = 0.09 + 0.06 + 0.06 + 0.15 + 0.15 = 0.51,
π2 = 0.04 + 0.06 + 0.06 + 0.10 + 0.10 = 0.36,
π3 = 0.25 + 0.15 + 0.15 + 0.10 + 0.10 = 0.75.

199/468
Eurostat
• When (1,2) is sampled, the Horvitz-Thompson estimator
is:

11 6
τ̂π = + = 38.24.
0.51 0.36

200/468
Eurostat
Similarly, we can fill out the table and get the
Horvitz-Thompson estimators as shown below:

s p(s) ys τ̂p τ̂π

(1,1) 0.3(0.3)=0.09 (11,11) 36.67 21.57

(2,2) 0.2(0.2)=0.04 (6,6) 30.00 16.67
(3,3) 0.5(0.5)=0.25 (25,25) 50.00 33.33
(1,2) 0.3(0.2)=0.06 (11,6) 33.33 38.24
(2,1) 0.2(0.3)=0.06 (6,11) 33.33 38.24
(1,3) 0.3(0.5)=0.15 (11,25) 43.33 54.9
(3,1) 0.5(0.3)=0.15 (25,11) 43.33 54.9
(2,3) 0.2(0.5)=0.10 (6,25) 40.00 50
(3,2) 0.5(0.2)=0.10 (25,6) 40.00 50
Mean 42.00 42.00
Variance 34.67 146.44

201/468
Eurostat
• From the table above we can see that both τ̂p and τ̂π are
unbiased.

202/468
Eurostat
• From the table above we can see that both τ̂p and τ̂π are
unbiased.
• This example is a small population example to illustrate
conceptually the properties of these estimators.

202/468
Eurostat
Remark 1

• The above demonstration is just a teaching tool.

203/468
Eurostat
Remark 1

• The above demonstration is just a teaching tool.

• In reality we will not know the population and will not
come across small population problems like this.

203/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
• We will not be able to find the real population total nor
the real variance of the estimator.

204/468
Eurostat
• What we know are:
unit 1 2 3
Selection probability 0.3 0.2 0.5

• We draw a sample.
• If the sample we draw is (1,2) then τ̂p = 33.33 and
τ̂π = 38.24.
• We will not be able to find the real population total nor
the real variance of the estimator.
• However, we will be able to estimate them.

204/468
Eurostat
Remark 2

• Now, should we use τ̂p or should we use τ̂π ?

205/468
Eurostat
Remark 2

• Now, should we use τ̂p or should we use τ̂π ?

• There are no clear answers.

205/468
Eurostat
Remark 2

• Now, should we use τ̂p or should we use τ̂π ?

• There are no clear answers.
• Both estimators are acceptable when yi and pi are
proportional.

205/468
Eurostat
Auxiliary data and ratio esti-
mation
Unit learning outcomes

• Upon successful completion of this unit, you will be able

to:
• know why and when to use ratio estimates
• check the condition to see whether one can use the ratio
estimate
• compute the ratio estimate and its estimated variance
• compute confidence interval based on ratio estimates
• compute the sample size needed when the ratio estimate
is used

207/468
Eurostat
Unit learning outcomes

• Upon successful completion of this unit, you will be able

to:
• learn about the biasedness of the ratio estimate via a
small population example
• see that the ratio estimate does perform better than the
expansion estimate when the condition for using the ratio
estimate is satisfied

208/468
Eurostat
Subsection 1

Auxiliary data, ratio estimator and its

computation

209/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may

include a known variable to which the variable of interest
is approximately related.

210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may

210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may

210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may

include a known variable to which the variable of interest
is approximately related.
• The auxiliary information typically is easy to measure,
whereas the variable of interest may be expensive to
measure.
• Population units: 1, 2, ..., N
• Variable of interest: y1 , y2 , ..., yN (expensive or costly to
measure)
210/468
Eurostat
Using auxiliary information

• The auxiliary information about the population may

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i

211/468
Eurostat
• For example consider: a national park is partitioned into
N units.
• yi = the number of animals in unit i
• xi = the size of unit i
• Another example might be where a certain city has N
bookstores.
• yi = the sales of a given book title at bookstore i
• xi = the size of the bookstore i

211/468
Eurostat
Ratio estimators

PN PN τy µy
• If τy = yi and τx = xi , then = and
i=1 i=1 τx µx
µy
τy = · τx .
µx

212/468
Eurostat
Ratio estimators

PN PN τy µy
• If τy = yi and τx = xi , then = and
i=1 i=1 τx µx
µy
τy = · τx .
µx
ȳ
• The ratio estimator, denoted as τ̂r , is τ̂r = · τx
x̄

212/468
Eurostat
• The estimator is useful in the following situations:

213/468
Eurostat
• The estimator is useful in the following situations:
A. When X and Y are highly linearly correlated through the
origin, then:

Var(τ̂r ) is less than Var(N ȳ ).

213/468
Eurostat
• The estimator is useful in the following situations:
A. When X and Y are highly linearly correlated through the
origin, then:

Var(τ̂r ) is less than Var(N ȳ ).

B. The case where N is unknown, then it provides a way to

estimate τy since when N is unknown, one cannot use
N ȳ .

213/468
Eurostat
Historical use

• When was this type of estimator used historically?

214/468
Eurostat
Historical use

• When was this type of estimator used historically?

• Probably the first instance of its use occurred in France in
1802.

214/468
Eurostat
Historical use

• When was this type of estimator used historically?

• Probably the first instance of its use occurred in France in
1802.
• At this time there was no population census and Laplace
wanted to estimate the total population of France.

214/468
Eurostat
Historical use

• When was this type of estimator used historically?

• Probably the first instance of its use occurred in France in
1802.
• At this time there was no population census and Laplace
wanted to estimate the total population of France.
• He did not have the resources to count every individual so
he sampled 30 communities in France.

214/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.

215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?

215/468
Eurostat
• In this case for Laplace, n = 30, and the total number of
inhabitants in these communities were 2,037,615.
• What type of information did the government already
have?
• Laplace found auxiliary information to help him and found
good records of the number of registered births.

215/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.

216/468
Eurostat
• Dividing 2,037,615 by 71,866.33, he estimated that there
is one registered birth for every 28.35 persons.
• Therefore, he estimated the total population by the total
number of annual births × 28.35
• Rationale: Communities with larger populations are
likely to have larger number of registered births.

216/468
Eurostat
Example 1: apple juice from apples

• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.

217/468
Eurostat
Example 1: apple juice from apples

• For a juice company, the price they are paid for apples in
large shipments is based on the amount of apple juice
from the load.
• Therefore, we need to determine the amount of apple
juice in the whole load prior to extraction.
• We can sample n apples and find y1 , ..., yn , the amount
of apple juice in those apples.

217/468
Eurostat
Example 1: apple juice from apples

217/468
Eurostat
• How could we measure this?

218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.

218/468
Eurostat
• How could we measure this?
• The total weight would be a good idea and easy to get.
• We will use the relationship between weight of the load
and the weight of the apple juice one obtains.
• Y is related to the x, the weight of each apple in the
sample and the total weight is easy to get for the entire
shipment.

218/468
Eurostat
Ratio estimator for τ

• We can thus estimate the total apple juice by:

ȳ
τ̂r = · τx
x̄

219/468
Eurostat
Ratio estimator for τ

• We can thus estimate the total apple juice by:

ȳ
τ̂r = · τx
x̄
• For this example, N is unknown and we cannot use N ȳ .

219/468
Eurostat
Ratio estimator for τ

• We can thus estimate the total apple juice by:

ȳ
τ̂r = · τx
x̄
• For this example, N is unknown and we cannot use N ȳ .
• One can see that if the condition for using the ratio
estimator is satisfied and N is know, this ratio estimator
may actually work better than N ȳ .

219/468
Eurostat
Ratio estimator for µ

• Similarly, to estimate µy , we can use

ȳ
µ̂r = · µx .
x̄

220/468
Eurostat
Ratio estimator for µ

• Similarly, to estimate µy , we can use

ȳ
µ̂r = · µx .
x̄
• It turns out that this estimate is not unbiased.

220/468
Eurostat
Ratio estimator for µ

• Similarly, to estimate µy , we can use

ȳ
µ̂r = · µx .
x̄
• It turns out that this estimate is not unbiased.
• Note that τ̂r is not unbiased for τy and µ̂r is not unbiased
for µy but they are approximately unbiased for large
samples when the sampling is a simple random sample.

220/468
Eurostat
Properties

• The approximate MSE of µ̂r is Var(µ̂r ) and given by:

σr2

N −n
Var (µ̂r ) ≈ · .
N n

221/468
Eurostat
Properties

• The approximate MSE of µ̂r is Var(µ̂r ) and given by:

σr2

N −n
Var (µ̂r ) ≈ · .
N n
• How can we compute the
N 2
1 X τy
σr2 = yi − · xi .
N − 1 i=1 τx

221/468
Eurostat
• When we want to estimate σr2 we will estimate using this
formula:
n 2
1 X ȳ
sr2 = yi − · xi .
n − 1 i=1 x̄

222/468
Eurostat
• When we want to estimate σr2 we will estimate using this
formula:
n 2
1 X ȳ
sr2 = yi − · xi .
n − 1 i=1 x̄
• Given all of this, when do we know that the estimate µ̂r is
good?

222/468
Eurostat
• We can compare it to:

σ2

N −n
Var(ȳ ) = · .
N n

223/468
Eurostat
• We can compare it to:

σ2

N −n
Var(ȳ ) = · .
N n
• µ̂r will perform better if σr2 < σ 2 .

223/468
Eurostat
• We can compare it to:

σ2

N −n
Var(ȳ ) = · .
N n
• µ̂r will perform better if σr2 < σ 2 .
• That is the case for populations for which y ’s and x’s are
highly correlated and with roughly a linear relationship
through the origin.

223/468
Eurostat
• An approximate 100(1 − α)% CI for µy is
q
\r ).
µ̂r ± z1−α/2 Var(µ̂

224/468
Eurostat
• For τy ,

ȳ
τ̂r = N µ̂r = · τx ,
x̄
and
2
\r ) = N · (N − n) sr .
Var(τ̂
n

225/468
Eurostat
Back to apple juice example

• Back to the context for this example...

226/468
Eurostat
Back to apple juice example

• Back to the context for this example...

• As it turns out in this example, 15 apples selected by
simple random samples were weighed and also juiced.

226/468
Eurostat
Back to apple juice example

• Back to the context for this example...

• As it turns out in this example, 15 apples selected by
simple random samples were weighed and also juiced.
• The total weight of the apple shipment was found to be
2000 pounds.

226/468
Eurostat
Back to apple juice example

• Back to the context for this example...

• As it turns out in this example, 15 apples selected by
simple random samples were weighed and also juiced.
• The total weight of the apple shipment was found to be
2000 pounds.
• What we need to do, given the table of results below, is
to get a point estimate of the total weight of the juice for
the shipment of apples and provide a 95% confidence
interval.
226/468
Eurostat
Here is the data:
ID yi xi yi − rxi (yi − rxi )2

1 0.16 0.22 0.0148611 0.0002209

2 0.15 0.26 -0.0215278 0.0004634
3 0.2 0.31 -0.0045139 0.0000204
4 0.25 0.37 0.0059028 0.0000348
5 0.16 0.28 -0.0247222 0.0006112
6 0.27 0.38 0.0193056 0.0003727
7 0.28 0.4 0.0161111 0.0002596
8 0.16 0.21 0.0214583 0.0004605
9 0.11 0.18 -0.0087500 0.0000766
10 0.16 0.29 -0.0313194 0.0009809
11 0.17 0.26 -0.0015278 0.0000023
12 0.24 0.32 0.0288889 0.0008346
13 0.21 0.33 -0.0077083 0.0000594
14 0.11 0.16 0.0044444 0.0000198
15 0.22 0.35 -0.0109028 0.0001189

Mean 0.190 0.288

Sum 0.004536

227/468
Eurostat
• ID is the sampled Apple
• yi , the weight of the Apple’s juice in lbs.
• xi , the weight of the Apple in lbs.
• yi − rxi , is the (observed y value - estimated y value), and
• (yi − rxi )2 is the (observed y value - estimated y value)
squared.
• Total Apple juice weight is 2.85 lbs. (mean = 0.19 lbs.)
• Total Apple weight is 4.32 lbs. (mean = 0.288 lbs.)

228/468
Eurostat
• Is it appropriate to use the ratio estimate?

229/468
Eurostat
• Is it appropriate to use the ratio estimate?
• The scatter plot of the data shows a linear relationship
between y and x variables.

●
0.25

●
0.20

●
y

● ● ● ●
0.15

● ●

0.20 0.25 0.30 0.35 0.40

229/468
Eurostat
Moreover, the regression analysis suggests that the regression
line goes through the origin (p-value of constant =
0.659 > 0.05). Therefore, it appears appropriate to use the
ratio estimate.

230/468
Eurostat
• The ratio estimate of the total weight is

0.190
τ̂r = r τx = × 2000 = 1319.44.
0.288

n
1 X
sr2 = (yi − rxi )2
n − 1 i=1
1
= [(0.16 − 0.6597 × 0.22)2 + . . .
14
+(0.22 − 0.6597 × 0.35)2 ]

231/468
Eurostat
• The ratio estimate of the total weight is

0.190
τ̂r = r τx = × 2000 = 1319.44.
0.288

n
1 X
sr2 = (yi − rxi )2
n − 1 i=1
1
= [(0.16 − 0.6597 × 0.22)2 + . . .
14
+(0.22 − 0.6597 × 0.35)2 ]

• How accurate is this result?

231/468
Eurostat
Example 1: apple juice from apples

Let’s compute a confidence interval and for this we need the

variance.
2 2
\r ) = N̂ · (N̂ − n) sr = τx τx − n sr

Var(τ̂
n x̄ x̄ n
1 15
(yi − rxi )2
P

2000 2000 n − 1 i=1
= − 15
0.288 0.288 n
1
· 0.004536
= 6944.444 · 6929.444 · 14 = 1039.42
15
\
SD(τ̂ r ) = 32.24
232/468
Eurostat
• Then an approximate 95% CI for τ is then:

\
= 1319.44 ± z1−α/2 SD(τ̂r)

= 1319.44 ± 1.96 × 32.24

= 1319.44 ± 63.19

233/468
Eurostat
• Then an approximate 95% CI for τ is then:

\
= 1319.44 ± z1−α/2 SD(τ̂r)

= 1319.44 ± 1.96 × 32.24

= 1319.44 ± 63.19

• In this case the estimate does reduce the variance by

using information contained in x about y .

233/468
Eurostat
Estimation for ratio

• In some cases we are interested in estimating:

τy µy
R= also, .
τx µx

234/468
Eurostat
• For example, sociologists are interested in ratios such as
the monthly food budget compared to the monthly
income per family.

235/468
Eurostat
• For example, sociologists are interested in ratios such as
the monthly food budget compared to the monthly
income per family.
• The sample ratio is the estimate for R and:
ȳ
r =
x̄
N − n σr2

Var(r ) ≈
Nµ2x n
2
\) ≈ N − n sr
Var(r
Nµ2x n

235/468
Eurostat
Questions?

236/468
Eurostat
Lunch break!

237/468
Eurostat
Subsection 2

Sample size and small population example

for ratio estimation

238/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation

239/468
Eurostat
• The goal is to estimate the average number of trees per
acre on a 1000-acre plantation
• The investigator samples 10 one-acre plots by simple
random sampling and counts the number of trees (y ) on
each plot.
• He also has aerial photographs of the plantation from
which he can estimate the number of trees (x) on each
plot of the entire plantation.
• Hence, he knows µx = 19.7 and since the two counts are
approximately proportional through the origin, he uses a
ratio estimate to estimate µy .
239/468
Eurostat
Plot yi xi (aerial estimate) yi − rxi

1 25 23 9.8263889
2 15 14 5.7638889
3 22 20 8.8055556
4 24 25 7.5069444
5 13 12 5.0833333
6 18 18 6.1250000
7 35 30 15.2083333
8 30 27 12.1875000
9 10 8 4.7222222
10 29 31 8.5486111
Mean 22.10 20.80

240/468
Eurostat
Here is a scatterplot of this data:

35
●
30

●
●
25

●
●
y

●
20

●
15

●
10

10 15 20 25 30

241/468
Eurostat
And, here is the R output for regression:

242/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.

243/468
Eurostat
• The scatter plot of the data shows a linear relationship
between y and x.
• Moreover, the regression analysis suggests that the
regression line goes through the origin (p-value of
constant = 0.554 > 0.05).

243/468
Eurostat
• Estimating the number of trees per acre

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.

244/468
Eurostat
• Estimating the number of trees per acre
• N = 1000 (plantation size)
• n = 10 (taken by s.r.s.)
• yi = the actual count of trees in the 1 acre plots,
i = 1, 2, ..., 10.
• xi = the aerial estimate for each plot

244/468
Eurostat
ȳ 22.10
µ̂r = · µx = · 19.70 = 20.93,
x̄ 20.80
10 2
1 X 22.10
sr2 = yi − xi = 4.2,
10 − 1 i=1 20.80
2
\r ) = N − n · sr = 1000 − 10 · 4.2 = 0.4158,
Var(µ̂
N n 1000 10
√
\r ) =
SD(µ̂ 0.4158 = 0.6448

245/468
Eurostat
The approximate 95% confidence interval for µy is:

\r )
µ̂r ± z0.975 · SD(µ̂
20.93 ± 1.96 · 0.6448
= 20.93 ± 1.26

246/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.

247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .

247/468
Eurostat
• To find the sample size needed to estimate µy when the
ratio estimator is used.
• Let d denote the margin of error of the 100(1 − α)%
confidence interval for µy .
• Then we know that:
r
N − n sr2
z1−α/2 · · = d.
N n
• Thus, the formula to compute the required sample size is:
2
N · z1−α/2 · sr2
n= 2
z1−α/2 · sr2 + Nd 2
247/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4

Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400

248/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4

Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400

• τx = 22 , τy = 1400.

248/468
Eurostat
• This is an artificial small population example that we will
use to demonstrate how to compute the bias and MSE of
ratio estimator.
site i 1 2 3 4

Nets, xi 4 5 8 5
Fishes, yi 200 300 500 400

• τx = 22 , τy = 1400.
• Samples (s.r.s.): n = 2.

248/468
Eurostat
ȳ
Samples τ̂r = · τx
x̄
(200 + 300)/2
(1,2) τ̂r = · 22 = 1222
(4 + 8)/2
(200 + 500)/2
(1,3) τ̂r = · 22 = 1283
(4 + 8)/2
(1,4) 1467
(2,3) 1354
(2,4) 1540
(3,4) 1523

1 1 1
E (τ̂r ) = · 1222 + · 1283 + + · 1467
6 6 6

1 1 1
+ · 1354 + · 1540 + · 1523
6 6 6
= 1398.17 6= τy = 1400
249/468
Thus, there is a very slight bias. Eurostat
6
X
MSE = (τ̂r ,s − τ )2 · P(s)
i=1

2 1 2 1
= (1222 − 1400) · + (1283 − 1400) ·
6 6

2 1 2 1
+ (1467 − 1400) · + (1354 − 1400) ·
6 6

2 1 2 1
+ (1540 − 1400) · + (1523 − 1400) ·
6 6
= 14, 451.2

When there is a slight bias, MSE 6= Var.

250/468
Eurostat
On the other hand, if one uses τ̂ = N · ȳ
Samples τ̂ = N · ȳ

(1,2) 4 × (200 + 300)/2 = 1000

(1,3) 4 × (200 + 500)/2 = 1400
(1,4) 4 × (200 + 400)/2 = 1200
(2,3) 4 × (300 + 500)/2 = 1600
(2,4) 4 × (300 + 400)/2 = 1400
(3,4) 4 × (500 + 400)/2 = 1800

1 1 1
E(τ̂ ) = · 1000 + · 1400 + · 1200
6 6 6

1 1 1
+ · 1600 + · 1400 + · 1800
6 6 6
= 1400, unbiased.
251/468
Eurostat
6
X
MSE = (τ̂ − τ )2 · P(s)
i=1

2 1 2 1
= (1000 − 1400) · + (1400 − 1400) ·
6 6

2 1 2 1
+ (1200 − 1400) · + (1600 − 1400) ·
6 6

2 1 2 1
+ (1400 − 1400) · + (1800 − 1400) ·
6 6
= 66, 667
66,667 is much larger than the MSE of τ̂r .
252/468
Eurostat
Auxiliary data and regression
estimation
Unit learning outcomes

• Upon success completion of this unit, you will be able to:

• know why and when to use regression estimates
• know how to check the condition to see whether one can
use the regression estimate
• compute the regression estimate and its estimated
variance

254/468
Eurostat
Unit learning outcomes

• Upon success completion of this unit, you will be able to:

• compute confidence interval based on regression estimate
• see that the regression estimate does perform better than
the expansion estimate when auxiliary data is useful
• see that the regression estimate does perform better than
the ratio estimate when the condition for using the ratio
estimate is not satisfied

255/468
Eurostat
Subsection 1

Linear regression estimator

256/468
Eurostat
The idea behind regression estimation

• Looking at the data, how will we find things that will

work, or which model should we use?

257/468
Eurostat
The idea behind regression estimation

• Looking at the data, how will we find things that will

work, or which model should we use?
• These are key questions.

257/468
Eurostat
The idea behind regression estimation

• Looking at the data, how will we find things that will

work, or which model should we use?
• These are key questions.
• The variance for the estimators will be an important
indicator.

257/468
Eurostat
• When the auxiliary variable x is linearly related to y but
does not pass through the origin, a linear regression
estimator would be appropriate.

258/468
Eurostat
• When the auxiliary variable x is linearly related to y but
does not pass through the origin, a linear regression
estimator would be appropriate.
• In addition, if multiple auxiliary variables have a linear
relationship with y , multiple regression estimates may be
appropriate.

258/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.

259/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
• Let us start with a simple example:

ŷ = a + bx,
is our basic regression equation.

259/468
Eurostat
• To estimate the mean and total of y -values, denoted as µ
and τ , one can use the linear relationship between y and
known x-values.
• Let us start with a simple example:

ŷ = a + bx,
is our basic regression equation.
sxy
• Then, b = 2 and a = ȳ − b̂x̄.
sx

259/468
Eurostat
• Then to estimate the mean for y , µ̂L , substitute as
follows, x = µx , a = ȳ − bx̄, then

ŷ = a + bx
µ̂L = a + bµx
µ̂L = (ȳ − bx̄) + bµx
µ̂L = ȳ + b(µx − x̄)

260/468
Eurostat
• Then to estimate the mean for y , µ̂L , substitute as
follows, x = µx , a = ȳ − bx̄, then

ŷ = a + bx
µ̂L = a + bµx
µ̂L = (ȳ − bx̄) + bµx
µ̂L = ȳ + b(µx − x̄)

• Note that even though µ̂L is not unbiased under simple

random sampling, it is roughly so (asymptotically
unbiased) for large samples.

260/468
Eurostat
• Thus, the mean square error of µ̂L is roughly estimated
by:
n
(yi − a − bxi )2
P
\L ) = N − n i=1
Var(µ̂ ·
N ×n n−2
N −n
= · MSE
N ×n
where MSE is the MSE of the linear regression model of
y on x.

261/468
Eurostat
• Therefore, an approximate (1 − α)100% CI for µ is:
q
\L )
µ̂L ± z1−α/2 Var(µ̂

262/468
Eurostat
• It follows that:

τ̂L = N · µ̂L = N ȳ + b(τx − N x̄)

\L ) = N 2 Var(µ̂
Var(τ̂ \L )
N × (N − n)
= · MSE
n

263/468
Eurostat
• It follows that:

τ̂L = N · µ̂L = N ȳ + b(τx − N x̄)

\L ) = N 2 Var(µ̂
Var(τ̂ \L )
N × (N − n)
= · MSE
n
• And, an approximate (1 − α)100% CI for τ is:
q
\L )
τ̂L ± z1−α/2 Var(τ̂

263/468
Eurostat
Example

• A mathematics achievement test was given to 486

students prior to entering a certain college who then took
a calculus class.

264/468
Eurostat
Example

• A mathematics achievement test was given to 486

students prior to entering a certain college who then took
a calculus class.
• A simple random sampling of 10 students are selected
and their calculus score recorded.

264/468
Eurostat
Example

• A mathematics achievement test was given to 486

students prior to entering a certain college who then took
a calculus class.
• A simple random sampling of 10 students are selected
and their calculus score recorded.
• It is known that the average achievement test score for
the 486 students was 52.

264/468
Eurostat
Example

• A mathematics achievement test was given to 486

●
90

●
Calculus score Y

●
●
70

●
60

20 30 40 50 60 70

Achievement test score X

265/468
Eurostat
Student Test score (xi ) Calculus score (yi )

1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75

Mean 46 76

266/468
Eurostat
267/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?

268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6

268/468
Eurostat
• Using the results from the R output here, what do you get
for the regression estimate?
• ANSWER:
µ̂L = ȳ + b(µx − x̄)
= 76 + 0.766 × (52 − 46)
= 80.6
• The R output provides us with p-values for the constant
and the coefficient of X .
• We can see that both terms are significant.

• Now we can compute the variance and the confidence

interval.

269/468
Eurostat
Example

• Now we can compute the variance and the confidence

interval.
• What is the variance of the regression estimate?

269/468
Eurostat
Example

• Now we can compute the variance and the confidence

interval.
• What is the variance of the regression estimate?
• ANSWER:
\L ) = N − n · MSE
Var(µ̂
N ×n
486 − 10
= × 8.7042
486 × 10
= 7.42

269/468
Eurostat
Example

• What is then, an approximate 95% CI for µ?

270/468
Eurostat
Example

• What is then, an approximate 95% CI for µ?

• ANSWER:
√
= 80.6 ± 1.96 × 7.42
= 80.6 ± 5.34

270/468
Eurostat
Coffee break!

271/468
Eurostat
Subsection 2

Comparison of estimators

272/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:

\ N − n s2
Var(ȳ ) = · .
N n

273/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:

\ N − n s2
Var(ȳ ) = · .
N n
• s 2 for y values is: (15.11)2

273/468
Eurostat
• To compare the regression estimate to the estimate ȳ ,
(which does not use auxiliary result of x), we see that:

\ N − n s2
Var(ȳ ) = · .
N n
• s 2 for y values is: (15.11)2
• What is the Var(ȳ )?

\) = 486 − 10 · (15.11)2
Var(ȳ
486 × 10
= 22.36

273/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ
√
= 76 ± 1.96 × 22.36
= 76 ± 9.27

274/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ
√
= 76 ± 1.96 × 22.36
= 76 ± 9.27

.
• Recall: The 95% confidence interval using regression
estimate is 80.6 ± 5.34; a much shorter confidence
interval.

274/468
Eurostat
• Next, what is an approximate 95% CI for µ?
q
\)
ȳ ± z1−α/2 Var(ȳ
√
= 76 ± 1.96 × 22.36
= 76 ± 9.27

.
• Recall: The 95% confidence interval using regression
estimate is 80.6 ± 5.34; a much shorter confidence
interval.
• This regression estimate is more precise than ȳ .

274/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .

275/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
• Compare µ̂L to the ratio estimator µ̂r

275/468
Eurostat
• Additionally, we have another estimator that we can look
at: µ̂r .
• Compare µ̂L to the ratio estimator µ̂r
• Next table contains the mean and standard deviation for
X and Y .

275/468
Eurostat
Student Test score (xi ) Calculus score (yi ) yi − rxi

1 39 65 0.565
2 43 78 6.957
3 21 52 17.304
4 64 82 -23.739
5 57 92 -2.174
6 47 89 11.348
7 28 73 26.739
8 75 98 -25.913
9 34 56 -0.174
10 52 75 -10.913

Mean 46 76
Std. deviation 16.58 15.11
sr2 283.42

276/468
Eurostat
• The ratio estimate is inappropriate for this example.

277/468
Eurostat
• The ratio estimate is inappropriate for this example.
• However, just to show a counter example, we can
compute the variance of the ratio estimate using the
previous table data and compare this to the regression
estimate.

277/468
Eurostat
Note

• For the Calculus Scores example we should not use the

ratio estimator µ̂r because the p-value for the constant
term is 0.002.

278/468
Eurostat
Note

• For the Calculus Scores example we should not use the

ratio estimator µ̂r because the p-value for the constant
term is 0.002.
• This implies that it does not go through the origin and for
this reason the ratio estimate is not appropriate.

278/468
Eurostat
Note

• For the Calculus Scores example we should not use the

ratio estimator µ̂r because the p-value for the constant
term is 0.002.
• This implies that it does not go through the origin and for
this reason the ratio estimate is not appropriate.
• But for the purposes of a counter example we will work it
out here anyway:
ȳ 76
µ̂r = r µx = · µx = · 52 = 85.91.
x̄ 46
278/468
Eurostat
• Next, we need to figure out the variance and for this we
need the MSE while using ratio estimate. From the
previous table the
10
1 X
sr2 = (yi − rxi )2 = 283.42 this is huge!
10 − 1 i=1

279/468
Eurostat
• Next, we need to figure out the variance and for this we
need the MSE while using ratio estimate. From the
previous table the
10
1 X
sr2 = (yi − rxi )2 = 283.42 this is huge!
10 − 1 i=1
• Now we can compute the variance:
2
\r ) = N − n · sr
Var(µ̂
N n
486 − 10 283.42
= · = 27.75
486 10

279/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

• We can see that the ratio estimate is even worse than

µ̂ = ȳ when it is used in an inappropriate situation.

280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

• We can see that the ratio estimate is even worse than

µ̂ = ȳ when it is used in an inappropriate situation.
• The width of the interval is larger than the one for the
regression estimate.

280/468
Eurostat
• Now we can compute a 95% confidence interval for µ
q
\r )
µ̂r ± z1−α/2 Var(µ̂
√
= 85.91 ± 1.96 × 27.75
= 85.91 ± 10.32

• We can see that the ratio estimate is even worse than

µ̂ = ȳ when it is used in an inappropriate situation.
• The width of the interval is larger than the one for the
regression estimate.
• The moral to this story here is, "Use the right model!".

280/468
Eurostat
Stratified sampling
Some important information on this unit

• Upon success completion of this lesson, you will be able

to:
• know why and when to use stratified sampling
• know how to estimate mean and total when stratified
sampling is used
• to compute confidence interval for these estimates
• determine the optimal allocation of sample sizes
• compute estimates when post-stratification is used
• compute the variance for the estimates when
post-stratification is used
• provide estimates for stratified sample for proportion 282/468
Eurostat
Subsection 1

How to use stratified sampling

283/468
Eurostat
Introduction

In stratified sampling, the population is partitioned into

non-overlapping groups, called strata and a sample is selected
by some design within each stratum.

284/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.

285/468
Eurostat
• For example, geographical regions can be stratified into
similar regions by means of some known variable such as
habitat type, elevation or soil type.
• Another example might be to determine the proportions
of defective products being assembled in a factory. In this
case sampling may be stratified by production lines,
factory, etc.

285/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:

286/468
Eurostat
• The principal reasons for using stratified random sampling
rather than simple random sampling include:
1. Stratification may produce a smaller error of estimation
than would be produced by a simple random sample of
the same size. This result is particularly true if
measurements within strata are very homogeneous.
2. The cost per observation in the survey may be reduced
by stratification of the population elements into
convenient groupings.

• An advertising firm, interested in determining how much

to emphasize television advertising in a certain country
decides to conduct a sample survey to estimate the
average number of hours each week that households
within that country watch television.

287/468
Eurostat
Example

• An advertising firm, interested in determining how much

287/468
Eurostat
Example

• An advertising firm, interested in determining how much

287/468
Eurostat
Example

• An advertising firm, interested in determining how much

to emphasize television advertising in a certain country
decides to conduct a sample survey to estimate the
average number of hours each week that households
within that country watch television.
• The country has two towns, A and B, and a rural area C.
• Town A is built around a factory and most households
contain factory workers with school-aged children.
• Town B contains mainly retirees and the rural area C are
mainly farmers. 287/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.

288/468
Eurostat
• There are 155 households in town A, 62 in town B and 93
in the rural area, C.
• The firm decides to select 20 households from Town A, 8
households from Town B and 12 households from the
rural area.
• The data are given in the following table:
Town A 35,43,36,39,28,28,29,25,38,27
26,32,29,40,35,41,37,31,45,34

Town B 27,15,4,41,49,25,10,30

Rural area C 8,14,12,15,30,32,21,20,34,7,11,24

288/468
Eurostat
• Usually a sample is selected by some probability design
from each of the L strata in the population, with
selections in different strata independent of each other.

289/468
Eurostat
• Usually a sample is selected by some probability design
from each of the L strata in the population, with
selections in different strata independent of each other.
• The special case where from each stratum a simple
random sample is drawn is called a stratified random
sample.

289/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?

290/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
• Why or why not?

290/468
Eurostat
• Does it make sense to use a stratified random sample for
this problem?
• Why or why not?
• Yes, for all three reasons listed above.

290/468
Eurostat
• Notation

291/468
Eurostat
• Notation
• L: the number of strata

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h

291/468
Eurostat
• Notation
• L: the number of strata
• Nh : number of units in each stratum h
• nh : = the number of samples taken from stratum h
• N: the total number of units in the population , i.e.,
N1 + N2 + ... + NL

L = 3, N1 = 155, N2 = 62 N3 = 93,

N = 155 + 62 + 93 = 310.

291/468
Eurostat
Some results are given in the following table:

Town A N1 = 155 n1 = 20 Mean=33.90 sd=5.95

Town B N2 = 62 n2 = 8 Mean=25.12 sd=15.25
Rural area C N3 = 93 n3 = 12 Mean=19.00 sd=9.36

292/468
Eurostat
Estimating the population total

L
X
τ̂st = τ̂h .
h=1

• The total is from each stratum added up where τ̂h is an

unbiased estimator for τh .

293/468
Eurostat
Estimating the population total

L
X
τ̂st = τ̂h .
h=1

• The total is from each stratum added up where τ̂h is an

unbiased estimator for τh .
• Since selections in different stratum are independent, the
variance is:
L
X L
X
Var(τ̂st ) = \
Var(τ̂h ) and Var(τ̂st ) = \h )
Var(τ̂
293/468
h=1 h=1
Eurostat
• The formula are computed differently according to the
sampling scheme within each stratum.

294/468
Eurostat
• The formula are computed differently according to the
sampling scheme within each stratum.
• For stratified random sampling, i.e., take a simple random
sample within each stratum:

τ̂h = Nh ȳh ,
L
\
X sh2
Var(τ̂ st ) = Nh · (Nh − nh ) · ,
h=1
nh
h n
1 X
sh2 = (yhi − ȳh )2 .
nh − 1 i=1

294/468
Eurostat
• You can see that this turns out pretty easy to remember,
and one can easily obtain the estimates for the population
mean.
τ̂st
µ̂st = ,
N
\st ) = 1 Var(τ̂
Var(µ̂ \ st ).
N2

295/468
Eurostat
Estimating the population mean

• For stratified random sampling:

L
1 X
ȳst = Nh ȳh ,
N h=1
L 2
Nh − nh sh2

\
X Nh
Var(ȳ st ) = .
h=1
N N h n h

296/468
Eurostat
Estimating the population mean

• For stratified random sampling:

L
1 X
ȳst = Nh ȳh ,
N h=1
L 2
Nh − nh sh2

\
X Nh
Var(ȳ st ) = .
h=1
N N h n h

• sh is the sample standard deviation of h stratum as given

ahead.
296/468
Eurostat
Example: estimating the mean

• Consider the TV Watching example.

297/468
Eurostat
Example: estimating the mean

• Consider the TV Watching example.

• The overall mean for this example is:
1
ȳst = (N1 ȳ1 + N2 ȳ2 + N3 ȳ3 )
N
1
= [(155 × 33.9) + (62 × 25.12)
155 + 62 + 93
+(93 × 19.0)]
= 27.7

297/468
Eurostat
The overall variance of the estimator of mean for this example
is:
3 2
Nh − nh sh2

\
X Nh
Var(ȳst ) =
h=1
N Nh nh
2

1 2 (155 − 20) (5.95)
= (155) · ·
(310)2 155 20
2

(62 − 8) (15.25)
+ (62)2 · ·
62 8
2

2 (93 − 12) (9.36)
+ (93) · ·
93 12
= 1.97
298/468
Eurostat
Example: estimating the population total

For the total hours watching TV example:

τ̂st = N · ȳst = 310 × 27.7 = 8587.

\ 2 \
Var(τ̂ st ) = N Var(ȳst )

= (310)2 × 1.97 = 189317.

299/468
Eurostat
Example: confidence intervals

• When all of the stratum sizes are small, an approximate

100(1 − α)% CI for τ is:
q
\
τ̂st ± z1−α/2 Var(τ̂ st ).

300/468
Eurostat
Example: confidence intervals

• When all of the stratum sizes are small, an approximate

100(1 − α)% CI for τ is:
q
\
τ̂st ± z1−α/2 Var(τ̂ st ).

• However, when the stratum sample sizes are smaller than

30, a different interval should be used.

300/468
Eurostat
• What is the degrees of freedom for the τ used in this
formula for the confidence interval?

301/468
Eurostat
• What is the degrees of freedom for the τ used in this
formula for the confidence interval?
• Intuitively we would want this to be,
(n1 − 1) + (n2 − 1) + ... + (nL − 1), and this is correct
when the variances of all strata are all the same.

301/468
Eurostat
• But when this is not the case and we can not pool the
degrees of freedom, we will need to use the Satterwaithe
approximation for the degrees of freedom as follows:

L
!2 L
X X (ah sh2 )2
d= ah sh2 / .
h=1 h=1
(nh − 1)

Nh (Nh − nh )
where, ah = .
nh

302/468
Eurostat
• But when this is not the case and we can not pool the
degrees of freedom, we will need to use the Satterwaithe
approximation for the degrees of freedom as follows:

L
!2 L
X X (ah sh2 )2
d= ah sh2 / .
h=1 h=1
(nh − 1)

Nh (Nh − nh )
where, ah = .
nh
• In particular, when Nh are all equal, nh are all equal and
sh2 are all equal , the d.f. = n - L.

302/468
Eurostat
For the TV example:

N1 (N1 − n1 ) 155(155 − 20)

a1 = = = 1046.25,
n1 20
N2 (N2 − n2 ) 62(62 − 8)
a2 = = = 418.5,
n2 8
N3 (N3 − n3 ) 93(93 − 12)
a3 = = = 627.75.
n3 12

303/468
Eurostat
(a1 s12 + a2 s22 + a3 s32 )2
d =
(a1 s12 )2 (a2 s22 )2 (a3 s32 )2
+ +
n1 − 1 n2 − 1 n3 − 1
(1046.5 · (5.95)2 + 418.5 · (15.25)2 + 627.75 · (9.36)2 )2
=
(1046.5 · (5.95)2 )2 (418.5 · (15.25)2 )2 (627.75 · (9.36)2 )
+ +
20 − 1 8−1 12 − 1
= 21.09

304/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .

305/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
• ANSWER:

305/468
Eurostat
• Provide a 95% CI for µ and also a 95% CI for τ .
• ANSWER:
• We will use t with df = 21, hence a 95% CI for µ is:

q
\
ȳst ± t(21;1−α/2) Var(ȳ st )
√
= 27.7 ± 2.08 × 1.97
= 27.7 ± 2.91

305/468
Eurostat
Similarly, a 95% CI for τ is:
q
\
τ̂st ± t(21;1−α/2) Var(τ̂ st )
√
= 8587 ± 2.08 × 189278.56
= 8587 ± 902.32

306/468
Eurostat
Subsection 2

The stratification principle

307/468
Eurostat
Stratification principle

• If your only objective of stratification is to produce

estimators with small variances, then we want to stratify
such that within each stratum, the units are as similar as
possible.

308/468
Eurostat
Stratification principle

• If your only objective of stratification is to produce

estimators with small variances, then we want to stratify
such that within each stratum, the units are as similar as
possible.
• In a survey of human population, stratification may be
based on socioeconomic factors or geographic regions.

308/468
Eurostat
• For example, to estimate the average starting income for
recent young workers, it would make sense to stratify by
age group since the starting income for young workers of
the same age would be similar.

309/468
Eurostat
• For example, to estimate the average starting income for
recent young workers, it would make sense to stratify by
age group since the starting income for young workers of
the same age would be similar.
• Check the stratification principle in the following slides

309/468
Eurostat
Example: stratification principle

• Population is defined by dots in the figure

• Population values: 1, 2, 2, 3, 5, 6, 7, 8, 9, 9, 10, 11, 12,
13
• N = 14, µ = 7, σ 2 = 14.43. 310/468
Eurostat
Population: U
Strata 1 2 3 4
Data 1 5 8 11
2 6 9 12
2 7 9 13
3 10
Nh 4 3 4 3
µh 2 6 9 12
1 2 1 2
σh2
2 3 2 3
311/468
Eurostat
Population: U ∗
Strata 1 2 3 4
Data 2 3 2 1
9 9 6 5
10 13 7 8
12 11
Nh 3 3 4 3
µh 7 8.33 6.75 6.25
σh2 12.67 16.89 12.69 13.69

312/468
Eurostat
• The population variance, σ 2 , can be decomposed as:

σ 2 = σwithin
2 2
+ σbetween

where

313/468
Eurostat
• The population variance, σ 2 , can be decomposed as:

σ 2 = σwithin
2 2
+ σbetween

where
L
2
X Nh
• σwithin = σh2
N
h=1

313/468
Eurostat
• The population variance, σ 2 , can be decomposed as:

σ 2 = σwithin
2 2
+ σbetween

where
L
2
X Nh
• σwithin = σh2
N
h=1
L
2
X Nh
• σbetween = (µh − µ)2
N
h=1

313/468
Eurostat
• In the first stratification scheme (U):

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):

314/468
Eurostat
• In the first stratification scheme (U):
2
• σwithin = 0.57 (4% of σ 2 )
2
• σbetween = 13.86 (96% of σ 2 )
• In the second stratification scheme (U ∗ ):
2
• σwithin = 13.87 (96% of σ 2 )

314/468
Eurostat
• When a population is stratified, the total variance (σ 2 ) is
decomposed in a variance component within strata
2 2
(σwithin ) and between strata (σbetween ).

315/468
Eurostat
• When a population is stratified, the total variance (σ 2 ) is
decomposed in a variance component within strata
2 2
(σwithin ) and between strata (σbetween ).
• This examples show that, although the total variance in
the population is a fixed value, different stratification
2
schemes result in different decompositions of σwithin and
2
σbetween .

315/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2

316/468
Eurostat
• An indicator of how the total variance is split is the
σ2
correlation ratio (η 2 = between ).
σ2
• Hence, in the first stratification scheme, η 2 = 0.96 shows
that the variance between strata is 96% of the total
variance of the population.

316/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.

317/468
Eurostat
• In the second stratification scheme η 2 = 0.04. In this case
the variance between strata only represents 4% of the
total variance.
• The variance within strata represents the remaining 96%.
• These strata are much more heterogeneous (within) and
more similar to each other.
• We can conclude that the first stratification scheme is
better, since the estimation accuracy is higher when
strata are more homogeneous (within).

• The question is, given a total sample size of n, how do we

allocate these among L strata?

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we

allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we

allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:
1. the total number of elements in each stratum,

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we

318/468
Eurostat
Allocation in stratified random sampling

• The question is, given a total sample size of n, how do we

allocate these among L strata?
• The best allocation scheme is affected by the following
three factors:
1. the total number of elements in each stratum,
2. the variability of the measurements within each stratum,
and
3. the cost associated with obtaining an observation from
each stratum.
318/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.

319/468
Eurostat
• If we don’t have all this information, but we know the
total number, we can use a simplistic allocation.
• This is a proportional allocation that will maintain a
steady sampling fraction throughout the population.
Nh
nh = n · .
N
• This does not take into consideration the variability
within each stratum and is not the optimal choice.
• If the cost of sampling from each stratum is the same,
then the optimal allocation (the allocation with the
lowest variances) is:
Nh σh
nh = n · L 319/468
P
Eurostat
• However, if the cost of sampling differs from stratum to
stratum and the total cost is:

c = c0 + c1 n1 + c2 n2 + ... + cL nL ,

where c0 is the overhead cost, ch is the cost per unit for

stratum h.

320/468
Eurostat
• However, if the cost of sampling differs from stratum to
stratum and the total cost is:

c = c0 + c1 n1 + c2 n2 + ... + cL nL ,

where c0 is the overhead cost, ch is the cost per unit for

stratum h.
• The optimal allocation is:
√
(c − c0 )Nh σh / ch
nh = L
.
P √
Nk σk ck
k=1

320/468
Eurostat
• Remarks:

321/468
Eurostat
• Remarks:
• The sample size is directly proportional to Nh and σh ,
i.e., allocate a larger sample size to the larger and more
variable stratum.

321/468
Eurostat
• Remarks:
• The sample size is directly proportional to Nh and σh ,
i.e., allocate a larger sample size to the larger and more
variable stratum.
√
• The sample size is inversely proportional to ch , i.e., this
allocates smaller sample sizes to the more expensive
stratum.

321/468
Eurostat
• In order to use the optimal allocation, one must be able
to estimate σh

322/468
Eurostat
• In order to use the optimal allocation, one must be able
to estimate σh
• Let’s take a look at this in the context of the TV
Example...

322/468
Eurostat
Back to TV example

• For the TV Example, if before the advertising the firm

conducts the survey they have already estimated that
σ1 ∼ 5, σ2 ∼ 15, σ3 ∼ 10.

323/468
Eurostat
Back to TV example

• For the TV Example, if before the advertising the firm

conducts the survey they have already estimated that
σ1 ∼ 5, σ2 ∼ 15, σ3 ∼ 10.
• Now, if the cost of obtaining an observation is about the
same for the three areas , (e.g., telephone interview),
then what is the optimal allocation if they want to sample
40 households?

323/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,

324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,
• N1 ∼ 155, σ1 ∼ 5

324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,
• N1 ∼ 155, σ1 ∼ 5
• N2 ∼ 62, σ2 ∼ 15

324/468
Eurostat
• Optimal allocation:
Nh σh
nh = n · L
.
P
Nk σ k
k=1

where,
• N1 ∼ 155, σ1 ∼ 5
• N2 ∼ 62, σ2 ∼ 15
• N3 ∼ 93, σ3 ∼ 10

324/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10

325/468
Eurostat
• Then,
40 × 155 × 5
n1 = = 11.7647,
155 × 5 + 62 × 15 + 93 × 10
40 × 62 × 15
n2 = = 14.1176,
155 × 5 + 62 × 15 + 93 × 10
40 × 93 × 10
n3 = = 14.1176.
155 × 5 + 62 × 15 + 93 × 10
• Thus we will choose n1 = 12, n2 = 14 and n3 = 14.

325/468
Eurostat
Questions?

326/468
Eurostat
See you tomorrow!

327/468
Eurostat
Subsection 3

Post-stratification

328/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.

329/468
Eurostat
• Sometimes, we would like to stratify on a key variable but
cannot place the units into their correct strata until the
units are sampled.
• For instance, in a telephone interview the respondents can
not be placed into a male or female stratum until after
the respondent is contacted.
• Post-stratification: stratification after the selection of a
sample, is often appropriate when a simple random
sample is not properly balanced by the representation.

329/468
Eurostat
Example

• We want to estimate the average weight and take a

simple random sample of 100 people.

330/468
Eurostat
Example

• We want to estimate the average weight and take a

simple random sample of 100 people.
• Here is what was obtained.
Male Female
n1 = 20 n2 = 80
ȳ1 = 180 lbs. ȳ2 = 120 lbs.

ȳ : the overall sample mean = 132.

330/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.

331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?

331/468
Eurostat
• This is obviously not balanced with respect to gender and
is likely an underestimate due to the under representation
of males in the data.
• How can we account for this?
N1 N2
• In the population = 0.5 and = 0.5.
N N

ȳst = 0.5 · ȳ1 + 0.5 · ȳ2

N1 N2
= ȳ1 + ȳ2 = 150
N N

ȳst = 0.5 · ȳ1 + 0.5 · ȳ2

N1 N2
= ȳ1 + ȳ2 = 150
N N
• Algebraic form is similar!

331/468
Eurostat
Post-stratification estimator variance

• But the post-stratification estimator ȳst will not have the

same variance as the stratified sample mean since the
sample sizes nh are random.

332/468
Eurostat
Post-stratification estimator variance

• But the post-stratification estimator ȳst will not have the

same variance as the stratified sample mean since the
sample sizes nh are random.
• Thus, the variance of the post-stratification ȳst is the sum
of the variance of the stratum under the proportional
Nh
allocation, n , and a term that shows the amount of
N
increase one expects from the post-rather than the
pre-stratification.

332/468
Eurostat
More specifically,
L L
X
N −nX Nh 1 N −n N − Nh 2
≈ σh2 + 2 σh .
nN h=1 N n N −1 h=1
N

333/468
Eurostat
Example

• A firm knows that 40% of its accounts receivable are

wholesale and 60% are retail.

334/468
Eurostat
Example

• A firm knows that 40% of its accounts receivable are

wholesale and 60% are retail.
• However, to identify an account without pulling a file and
looking at it is difficult.

334/468
Eurostat
Example

• A firm knows that 40% of its accounts receivable are

wholesale and 60% are retail.
• However, to identify an account without pulling a file and
looking at it is difficult.
• An auditor randomly sampled 100 accounts without
replacement. Here are the results of his sampling:
Whosale Retail

n1 = 70 n2 = 30
ȳ1 = 520 ȳ2 = 280.
s1 = 210 s2 = 90

334/468
Eurostat
• Compute the post-stratified mean.

335/468
Eurostat
• Compute the post-stratified mean.
• ANSWER:
N1 N2
ȳst = ȳ1 + ȳ2
N N
= 0.4 × 520 + 0.6 × 280
= 376

335/468
Eurostat
• Compute the variance of the post-stratified mean.

336/468
Eurostat
• Compute the variance of the post-stratified mean.
• ANSWER:

1 N1 2 N 2 2
Var(post-stratified ȳ ) ≈
c s + s
n N 1 N 2

1 N1 2 N2 2
+ 2 1− s1 + 1 − s2
n N N
1
= [0.4 × (210)2 + 0.6 × (90)2 ]
100
1
+ [0.6 × (210)2 + 0.4 × (90)2 ]
1002
= 225 + 2.97 = 227.97

336/468
Eurostat
Subsection 4

Further topics on stratification

337/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always

produces an estimator with a smaller variance than that
from simple random sampling. Let’s example!

338/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always

338/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always

produces an estimator with a smaller variance than that
from simple random sampling. Let’s example!
• The dean of school for boys wants to estimate the
average weight of the 7th grade boys in the school.
• There are 4 classes, 24 students in class 1, 36 in class 2,
30 students in class 3, and 30 in class 4.

338/468
Eurostat
Estimator properties

• It is not true that stratified random sampling always

• The principal has enough time and money to obtain data

for 20 students, and because the cost of sampling is the
same in each stratum, he decides to use proportional
allocation.

339/468
Eurostat
Example

• The principal has enough time and money to obtain data

for 20 students, and because the cost of sampling is the
same in each stratum, he decides to use proportional
allocation.
• Sample allocation is n1 = 4, n2 = 6, n3 = 5, and n4 = 5.

339/468
Eurostat
• The data (in lbs.) is given in the following table:
Class Weight of student (in lbs.)
Class 1 94,90,102,110
Class 2 91,99,93,105,111,101
Class 3 108,96,100,93,93
Class 4 92,110,94,91,113

340/468
Eurostat
Here is a table that describes the data from each stratum:

Class 1 N1 = 24 n1 = 4 Mean=99.00 sd=8.87

Class 2 N2 = 36 n2 = 6 Mean=100.00 sd=7.46
Class 3 N3 = 30 n3 = 5 Mean=98.00 sd=6.28
Class 4 N4 = 30 n4 = 5 Mean=100.00 sd=10.61
All N = 120 n = 20 Mean=99.30 sd=7.73

341/468
Eurostat
• Calculate the stratified estimator ȳst .

342/468
Eurostat
• Calculate the stratified estimator ȳst .
• ANSWER:
To estimate the average weight of the 7th grade boys:
L
X Nh
ȳst = ȳh = 99.3.
h=1
N

342/468
Eurostat
• Calculate the variance of ȳst .

343/468
Eurostat
• Calculate the variance of ȳst .
• ANSWER:

4
1 X 2 Nh − nh sh2

\
Var(ȳ st ) = N
N 2 h=1 h Nh nh
2

1 2 5 (8.87) 2 5 (7.46)
= (24) · · + (36) · ·
1202 6 4 6 6
2 2

2 5 (6.28) 2 5 (10.61)
+ (30) · · + (30) · ·
6 5 6 5
= 2.93

343/468
Eurostat
For a 95% CI, we need to compute the Satterwaithe’s formula
to get the degrees of freedom:

L
2
ah sh2
P
h=1 Nh (Nh − nh )
d= L 2 2
, ah = ,
P (a h sh ) nh
h=1 nh − 1

24(24 − 4) 36(36 − 6)
a1 = = 120, a2 = = 180,
4 6
30(30 − 5) 30(30 − 5)
a3 = = 150, a4 = = 150.
5 5
344/468
Eurostat
• Plug in the formula and we get that d = 13.7576.

345/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
• Round it down to 13, to be more conservative, and use
df = 13.

345/468
Eurostat
• Plug in the formula and we get that d = 13.7576.
• Round it down to 13, to be more conservative, and use
df = 13.
• Then, an approximate 95% CI is:
√
99.3 ± 2.160 2.93
= 99.3 ± 3.697

345/468
Eurostat
• Looking back at the data, if we had used simple random
sampling, would our CI have been tighter or looser?

346/468
Eurostat
• Looking back at the data, if we had used simple random
sampling, would our CI have been tighter or looser?
• ANSWER:
2
\) = N − n s
Var(ȳ
N n
(7.73)2

120 − 20
=
120 20
= 2.49

346/468
Eurostat
• Then an approximate 95% CI is: df = 19
√
99.3 ± 2.093 2.49
= 99.3 ± 3.30

Thus the margin of error is smaller and the confidence

interval narrower.

347/468
Eurostat
• Usually the stratified random sampling will overall
perform better because we usually use stratified random
sampling when the stratum are more homogeneous.

348/468
Eurostat
• Usually the stratified random sampling will overall
perform better because we usually use stratified random
sampling when the stratum are more homogeneous.
• There is no reason that the classes are more
homogeneous in weight, and therefore there is no reason
why this stratified random sampling is any better than a
simple random sampling.

348/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.

349/468
Eurostat
• Since the data had been collected by stratified sampling,
the above method treating it as srs is the wrong way to
compute the variance for this problem.
• How the variance is computed depends on the method by
which the sample was taken.
• We did the computation just to show that if
hypothetically, the data was collected by s.r.s. with the
data turn out to be as shown (for illustration’s sake),
then the margin of error will be smaller.

349/468
Eurostat
Moral of this example

• Stratifying on class, which is not related to weight, does

not result in smaller variances within the strata.

350/468
Eurostat
Moral of this example

• Stratifying on class, which is not related to weight, does

350/468
Eurostat
Moral of this example

• Stratifying on class, which is not related to weight, does

not result in smaller variances within the strata.
• On the other hand, if stratification had other purposes
such as to estimate the parameters of each subgroup, it
still makes sense to stratify, though the purpose is not to
get estimates with smaller variance.
• For this particular example, the stratification to estimate
the average weight for each class may be relevant.

350/468
Eurostat
Stratified sampling and proportions

L
1 X
p̂st = Nh p̂h .
N h=1

L
\ 1 X 2 \
Var(p̂st ) = N Var(p̂h )
N 2 h=1 h
L
1 X 2 Nh − nh p̂h (1 − p̂h )
= N h ·
N 2 h=1 Nh nh − 1

351/468
Eurostat
Example

• The advertising firm wants to estimate the proportion of

households in the county that view the television show
"American Idol".

352/468
Eurostat
Example

• The advertising firm wants to estimate the proportion of

households in the county that view the television show
"American Idol".
• N1 = 155, N2 = 62, N3 = 93.

352/468
Eurostat
Example

• The advertising firm wants to estimate the proportion of

households in the county that view the television show
"American Idol".
• N1 = 155, N2 = 62, N3 = 93.
• As before, we stratify by town and the sample results is:
Stratum Sample size p̂h

Town A n1 = 20 16/20=0.80
Town B n2 = 8 2/8=0.25
Rural area C n3 = 12 6/12=0.50

352/468
Eurostat
• We plug in the values and we can get the following:
L
1 X
p̂st = Nh p̂h
N h=1
155 62 93
= · 0.8 + · 0.25 + · 0.5 = 0.6
310 310 310

353/468
Eurostat
The following display the estimated variance for each stratum:

\ N1 − n 1 p̂1 (1 − p̂1 )
Var(p̂1 ) = ·
N1 n1 − 1

155 − 20 0.8(0.2)
= · = 0.007
155 19

\ N 2 − n 2 p̂2 (1 − p̂2 )
Var( p̂2 ) = ·
N2 n2 − 1

62 − 8 0.25(0.75)
= · = 0.024
62 7

354/468
Eurostat

\ N 3 − n3 p̂3 (1 − p̂3 )
Var(p̂2 ) = ·
N3 n3 − 1

93 − 12 0.5(0.5)
= · = 0.02
93 11

355/468
Eurostat
• Compute the estimated variance of the stratified
proportion.

356/468
Eurostat
• Compute the estimated variance of the stratified
proportion.
• ANSWER:
1
\
Var(p̂st ) = 2
[(155)2 (0.007) + (62)2 (0.024)
(310)
+(93)2 (0.02)]
= 0.0045

356/468
Eurostat
Cluster sampling and systema-
tic sampling
Unit learning outcomes

• Upon success completion of this lesson, you will be able

to:
• know why and when to use cluster sampling
• know the notation for cluster and systematic sampling
• know what are primary units and what are secondary
units

358/468
Eurostat
Unit learning outcomes

• Upon success completion of this lesson, you will be able

to:
• compute the unbiased estimator for cluster samples when
primary units are selected by srs
• compute the ratio estimator for cluster samples when
primary units are selected by srs
• compute the Hansen-Hurwitz estimator for cluster
samples when primary units are selected by pps

359/468
Eurostat
Subsection 1

Introduction

360/468
Eurostat
Cluster versus systematic sampling

• On the surface, systematic and cluster sampling are very

different.

361/468
Eurostat
Cluster versus systematic sampling

• On the surface, systematic and cluster sampling are very

different.
• In fact, the two designs share the same structure: the
population is partitioned into primary units, each primary
unit being composed of secondary units.

361/468
Eurostat
Cluster versus systematic sampling

• On the surface, systematic and cluster sampling are very

different.
• In fact, the two designs share the same structure: the
population is partitioned into primary units, each primary
unit being composed of secondary units.
• Whenever a primary unit is included in the sample, the
y -values of every secondary unit within it are observed.

361/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

• Randomly pick a value from {1, 2, 3}.

362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

• Randomly pick a value from {1, 2, 3}.

• For example, if 2 is chosen, then we will pick
{2, 5, 8, 11, 14}, the x’s.

362/468
Eurostat
• Example: an one in three systematic sampling where we
randomly pick one from the first three units and then
choose every three from that on.

• Randomly pick a value from {1, 2, 3}.

• For example, if 2 is chosen, then we will pick
{2, 5, 8, 11, 14}, the x’s.
• The set {2, 5, 8, 11, 14} is an example of a primary unit.

362/468
Eurostat
• It is not uncommon to have a systematic sample of size 1,
such as the above 1 in 3 systematic sample. We just
sample 1 primary unit.

363/468
Eurostat
• It is not uncommon to have a systematic sample of size 1,
such as the above 1 in 3 systematic sample. We just
sample 1 primary unit.
• In the following two graphs, we provide examples for two
configurations of primary units:

363/468
Eurostat
The above figure has 50 primary units (PSU) (the colored
rectangle is an example of a primary unit)

364/468
Eurostat
• Primary units (PSU) may be different from observation
units.

365/468
Eurostat
• Primary units (PSU) may be different from observation
units.
• One can view the systematic sampling as a sampling of
primary units.

365/468
Eurostat
• Primary units (PSU) may be different from observation
units.
• One can view the systematic sampling as a sampling of
primary units.
• Once the primary units are selected, a cluster of
secondary units are also selected.

365/468
Eurostat
Advantages of systematic sampling

• Easier to perform in the field, especially if a good frame is

not available.

366/468
Eurostat
Advantages of systematic sampling

• Easier to perform in the field, especially if a good frame is

not available.
• Frequently provides more information per unit cost than
simple random sampling, in the sense of smaller variances.

366/468
Eurostat
Advantages of systematic sampling

• For example, a systematic sample was drawn from a

batch of produced computer chips.

367/468
Eurostat
Advantages of systematic sampling

• For example, a systematic sample was drawn from a

batch of produced computer chips.
• The first 400 chips are fine but due to a fault of the
machine, the last 300 chips are defective.

367/468
Eurostat
Advantages of systematic sampling

• For example, a systematic sample was drawn from a

batch of produced computer chips.
• The first 400 chips are fine but due to a fault of the
machine, the last 300 chips are defective.
• Systematic sampling will select uniformly over the
defective and non-defective items and would give a very
accurate estimate of the fraction of defective items.

367/468
Eurostat
Cluster sampling