100% (2) 100% found this document useful (2 votes) 4K views 314 pages SarndalEtAl - 1992 - Model Assisted Survey Sampling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save SarndalEtAl_1992_Model Assisted Survey Sampling For Later dma ee my leet
Model Assisted
Survey SamplingSpringer Series in Statistics
Andersen/Borgan/Gill/Keiding: Statistical Models Based on Counting Processes.
“Atkinson/Riant: Robust Diagnostic Regression Analysis.
‘Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Borg/Groenen: Modem Multidimensional Sceling: Theory and Applications.
Brockwell/Davis: Time Series: Theory and Methods, 2nd edition.
Chan/Tong: Chaos: A Statistical Perspective.
Chen/Shao/fbrahim: Monte Carlo Methods in Bayesian Computation.
David/Edwards: Annotated Readings in the History of Statistics.
DevroyelLugosi: Combinatorial Methods in Density Estimation.
Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications.
Eggermont/LaRiccia: Maximum Penalized Likelihood Estimation, Volume I:
Density Estimation.
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear
Models, 2nd edition.
Fan!Yao: Nonlinear Time Series: Nonparametric and Parametric Methods.
Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations
1750-1900.
Federer: Statistical Design and Analysis for Intereropping Experiments, Volume I:
Two Crops.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume Ii:
Three or More Crops.
Ghosh/Ramamoorthi: Bayesian Nonparametrics.
Glaz/Naus/Wallenstein: Scan Statistics.
Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing
Hypotheses, 2nd edition.
Gouriéroux: ARCH Models and Financial Applications.
Gu: Smoothing Spline ANOVA Models.
Gybrfi/Kohler/Krayzak/ Walk: A Distribution-Free Theory of Nonparametric
Regression.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Hardle: Smoothing Techniques: With Implementation in S.
Harrell: Regression Modeling Strategies: With Applications to Linear Models,
Logistic Regression, and Survival Analysis.
Hart: Nonparametric Smoothing and Lack-of-Fit Tests.
‘Hastie/Tibshirani/Friedman: The Elements of Statistical Leaming: Data Mining,
Inference, and Prediction.
Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications.
‘Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
Parameter Estimation.
Huet/Bouvier/Poursat/Jolivet: Statistical Tools for Nonlinear Regression: A Practical
Guide with S-PLUS and R Examples, 2nd edition.
Ibrahim/Chen/Sinha: Bayesian Survival Analysis.
Jolliffe: Principal Component Analysis, 2nd edition.
feontinued after index)
Carl-Erik Sarndal Bengt Swensson
Jan Wretman
Model Assisted
Survey Sampling
Springer
>) 93
)
2
)
III)
}
JCorl-Brik Sarndal Bengt Swensson ;
Statistics Sweeden Department of Data Analysis
Klostergatan 23 University of Orebro
SE-70189 Orebro 701 30 Orebro
Sweden ‘Sweden
Jan Wrenman
Department of Statistics
‘Stockholm University
106 91 Stockholm
Sweden
‘The work on this book was supported in past by Statistics Sweden.
With $ illustrations.
‘Mathematics Subject Classification: 62005,
Library of Congress Cataloging-in-Publication Data,
Samdal, Carl-Erik, 1937-
Model assisted survey sampling / Catl-Erik Samdal, Bengt
‘Swensson, Jan Wretman.
B. omen(Springer series in statistics)
Includes bibliographical references and indexes
1. Sampling (Statistics). 1. Swensson, Bengt. If. Wretman, Jan
Hakan, 1939. . Hi. Tite. IV. Series,
QA276.6837 1991
001.4°222—4e20 91-7854
ISBN 0-387-40620-4 Printed on acid-free paper.
First softcover printing, 2003.
© 1992 Springer-Verlag New York, Ine.
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer-Verlag New York, Inc., I75 Fifth Avenue, New York, NY
10010, USA), except for brief excerpts in connsction with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adgptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
‘The use in this publication of trade names, trademar%s, service marks, and similar terms, even if they
are not identified as such, is not to be taken as an expression of opinion as to whether or not they are
subject to proprietary rights.
Printed in the United States of America.
987654321 SPIN 10947609
‘wow.springer-ny.com
Springer-Verlag New York Berlin Heidelberg
A member of BertelsmannSpringer Science+Business Media GmbH
Preface .
This text on survey sampling contains both basic and advanced material. The
main theme is estimation in surveys. Other books of this kind exist, but most
were written before the recent rapid advances. This book has four important
objectives:
1. To develop the central ideas in survey sampling from. the unified perspec-
tive of unequal probability sampling. In a majority of surveys, the sam-
pling units have different probabilities of selection, and these probabilities
play a crucial role in estimation.
2. To write a basic sampling text that, unlike its predecessors, is guided by
statistical modeling in the derivation of estimators. The model assisted
approach in this book clarifies and systematizes the use of auxiliary vari-
ables, which is an important feature of survey design.
3. To cover significant recent developments in special areas such as analysis
of survey data, domain estimation, variance estimation, methods for non-
response, and measurement error models.
4. To provide opportunity for students to practice sampling techniques on
real data. We provide numerous exercises concerning estimation for real
(albeit smali) populations described in the appendices,
This book grew in part out of our work as leaders of survey methodology
development projects at Statistics Sweden. In supervising younger colleagues,
we repeatedly found it more fruitful to stress a few important general prin-
ciples than to consider every selection scheme and every estimator formula as
a separate estimation problem. We emphasize a general approach.
This book will be useful in teaching basic, as well as more advanced, univer-
sity courses in survey sampling. Our suggestions for structuring such courses
are given below. The material has been tested in our own courses in Montréal,vi Preface
Orebro, and Stockholm. Also, this book will provide a good source of in-
formation for persons engaged in practical survey work or in survey meth-
odology research.
The theory and methods discussed in this book have their primary field of
application in the large surveys typically carried out by government statistical
agencies and major survey institutes. Such organizations have the resources
to collect data for the large probability samples drawn by the complex survey
designs that are often required. However, the issues and the methods dis-
cussed in the book are also relevant to smaller surveys.
Statistical modeling has strongly influenced survey sampling theory in re-
cent years. In this book, sampling theory is assisted by modeling. Tt becomes
simple to explain how the auxiliary information available in a given survey
will lead to a particular estimation technique. The teaching of sampling and
the style of presentation in journal articles on sampling have changed a great
deal by this new emphasis. Readers of this book will become familiar with this
new style.
‘We use the randomization theory, or design-based, point of view. This is
the traditional mode of inference in surveys, ever since the sampling theory
breakthroughs in the 1930s and 1940s. The reasoning is familiar to survey
statisticians in government and elsewhere.
A body of basic and more advanced knowledge is defined that is useful for
the survey sampling methodologist, and a broad range of topics is covered,
without going into great detail on any one subject. Some topics to which this
book devotes a single chapter have been the subject of specialized treatise
The material should be accessible to a wide audience. The presentation is
rich in statistical concepts, but the mathematical level is not advanced. We
assume that readers of this book have been exposed to at least one course in
statistical inference, covering principles of point estimation and confidence
intervals. Some familiarity with statistical linear models, in particular regres-
sion analysis, is also desirable. A previous exposure to finite population sam-
pling theory is not required for otherwise well-prepared readers. Some prior
knowledge of sampling techniques is, of course, an advantage.
A collection of exercises is placed at the end of each chapter. The under-
standing of sampling theory is facilitated by analyzing data. Some of the
exercises involve sampling and analysis of data from three populations of
Swedish municipalities, named MU284, MU281, and MU200, and one popu-
jation of countries named CO124. These populations are necessarily small in
comparison to populations in real-world surveys, but the issues invoked by
the exercises are real. Appendices B, C, and D present the populations. Other
exercises are theoretical; some of them ask the reader to derive an expression
or verify an assertion in the text,
There are various ways in which the book can be used for teaching courses
in survey sampling. A first (one-semester or one-quarter) course can be based
on the following chapters: 1, 2, 3, parts of 4, 5, 6, 7, and 8, and, at the instruc-
tor’s discretion, a selection of material from later chapters. To mention at
Preface vii
jeast a few of the issues in chapters 10, 14, and 15 seems particularly impor-
tant in a first course. A second course (one-semester or one-quarter) may use
topics from chapters 4 to 7 not covered in the first course, followed by a
selection of material from chapters 8 to 17.
Certain sections usually placed toward the end of a chapter provide further
detail on ideas or derivations presented earlier. Examples are sections 3.8,
5.12, and 68, Such sections are not essential for the main flow of the argu-
ment, and they may be omitted in a first as well as in a second course.
The Monte Carlo simulations and other computation for this book were
carried out on an Apple Macintosh Il, using MicroAPLs APL.68000-
Macintosh II 68020+ 68881 version 7.20A.
Statistics Sweden generously supported this project. We are truly grate-
ful for their cooperation as well as for support from University of Grebro,
Stockholm University, Université de Montréal, Svenska Handelsbanken, and
the Natural Sciences and Engineering Research Council of Canada.
Many individuals helped and encouraged us during the work on this book.
In particular, we are indebted to Christer Arvas and Lars Lyberg for their
supportive view of the project, to Jean-Claude Deville, Eva Elvers, Michael
Hidiroglou, Klaus Krickeberg, and Ingrid Lyberg for critical appraisal of
some of the chapters, to Kerstin Averbick, Bibi Thunell and the late Sivan
Rosén for typing, and to Patricia Dean and Nathalie Gaudet for editorial
assistance. We have benefitted from discussions with survey statisticians at
Statistics Canada and Statistics Sweden.
Montréal Carl-Brik Sérndal
Srebro Bengt Swensson
Stockholm Jan WretmanContents
Preface
PART I
Principles of Estimation for Finite Populations and Important
Sampling Designs
CHAPTER 1
Survey Sampling in Theory and Practice
LI Surveys in Society
12 Skeleton Outline of a Survey
13. Probability Sampling
14 Sampling Frame
1.5 Area Frames and Simitar Devices
1.6 Target Population and Frame Population
1.7 Survey Operations and Associated Sources of Error
18 Planning a Survey and the Need for Total Survey Design
1.9 Total Survey Design
1.10 The Role of Statistical Theory in Survey Sampling
Exercises
CHAPTER 2
Basic Ideas in Estimation from Probability Samples
2.1 Introduction
22 Population, Sample, and Sample Selection
23 Sampling Design
24 Inclusion Probabilities
2.5 The Notion of a Statistic
2.6 ‘The Sample Membership Indicators
2.7 Estimators and Their Basic Statistical Properties
24
24
24
27
30
33
36
3828 The x Estimator and Its Properties
29 With-Replacement Sampling
2.10 The Design Effect
2.11 Confidence Intervals
Exercises
CHAPTER 3
Unbiased Estimation for Element Sampling Designs
3.1 Intreduction
3.2 Bernoulli Sampling
3.3 Simple Random Sampling
3.3.1 Simple Random Sampling without Replacement
3.3.2 Simple Random Sampling with Replacement
3.4 Systematic Sampling
34.1 Definitions and Main Result
3.4.2 Controlling the Sample Size
3.43 The Efficiency of Systematic Sampling
3.44 Estimating the Variance
3.5 Poisson Sampling
3.6 Probability Proportional-to-Size Sampling
3.6.1 Introduction
3.6.2 mps Sampling
3.63 pps Sampling
3.64 Selection from Randomly Formed Groups
Stratified Sampling
3.7.1 Introduction
3.7.2. Notation, Definitions, and Estimation
3.7.3 Optimum Sample Allocation
3.7.4 Alternative Allocations under STS! Sampling
3.8 Sampling without Replacement versus Sampling with Replacement
3.8.1 Alternative Estimators for Simple Random Sampling with
Replacement
3.8.2 The Design Effect of Simple Random Sampling with Replacement
Exercises
3.
Q
CHAPTER 4
‘Unbiased Estimation for Cluster Sampling and Sampling in Two
or More Stages
4.1 Introduction
4.2 Single-Stage Cluster Sampling
4.2.1 Introduction
4,22 Simple Random Cluster Sampling
43 Two-Stage Sampling
43.1 Introduction
43.2 Two-Stage Element Sampling
44 Multistage Sampling
44,1 Introduction and a General Result
442 Three-Stage Element Sampling
4,5. With-Replacement Sampling of PSUs
Contents
42
48
53
55
58
124
124
126
126
129
133
133
135
144
144
146
150
Contents
4.6 Comparing Simplified Variance Estimators in Multistage Sampling
Exercises
CHAPTER 5
Introduction to More Complex Estimation Problems
5.1 Introduction
5.2. The Effect of Bias on Confidence Statements
5.3 Consistency and Asymptotic Unbiasedness
5.4 x Estimators for Several Variables of Study
5.5 The Taylor Linearization Technique for Variance Estimation
5.6 Estimation of a Ratio
5:7 Estimation of a Population Mean
$.8 Estimation of a Domain Mean
59 Estimation of Variances and Covariances in a Finite Population
5.10 Estimation of Regression Coefficients
5.10.1 The Parameters of Interest
5.10.2 Estimation of the Regression Coefficients
$.11 Estimation of a Population Median
5.12 Demonstration of Result 5.10.1
Exercises
PART II
Estimation through Linear Modeling, Using Auxiliary Variables
CHAPTER 6
‘The Regression Estimator
6.1 Introduction
62 Auxiliary Variables
63 The Diflerence Estimator
64 Introducing the Regression Estimator
65 Alternative Expressions for the Regression Estimator
6.6 The Variance of the Regression Estimator
6.7 Comments on the Role of the Model
68 Optimal Coefficients for the Difference Estimator
Exercises
CHAPTER 7
Regression Estimators for Element Sampling Designs
74 Introduction
7.2. Preliminary Considerations
7.3. The Common Ratio Model and the Ratio Estimator
73.1 The Ratio Estimator under Si Sampling
73.2 The Ratio Estimator under Other Designs
7.33 Optimal Sampling Design for the x Weighted Ratio Estimator
7.34 Altemative Ratio Models
7.4 The Common Mean Model
75 Models Involving Population Groups
7.6 The Group Mean Model and the Poststratified Estimator
1.7 The Group Ratio Model and the Separate Ratio Estimator
xi
153
134
162
162
163
166,
169
12
176
181
184
186
190
190
192,
197
205
207
219
29
219
221
228
230
238
239
242
245
245
245
247°
249
252
253
255
258
260
269xi Contents
78 Simple Regression Models and Simple Regression Estimators m2
79 Estimators Based on Multiple Regression Models 215
7.9.1 Multiple Regression Models 276
7.9.2 Analysis of Variance Models 281
7.10 Conditional Confidence Intervals 283
7.10.1 Conditional Analysis for BE Sampling 284
7.10.2 Conditional Analysis for the Poststratification Estimator 287
7.1L Regression Estimators for Variable-Size Sampling Designs 289
7.12 A Class of Regression Estimators 291
7.13 Regression Estimation of a Ratio of Population Totals 294
Exercises 297
CHAPTER 8
Regression Estimators for Cluster Sampling and Two-Stage Sampling 303
8.1 Introduction 303
82 The Nature of the Auxiliary Information When Clusters of Elements
Ate Selected 304
83 Comments on Variance and Variance Estimation in ‘Two-Stage
Sampling 307
84 Regression Estimators Arising Out of Modeling at the Cluster Level 308
8&5 The Common Ratio Model for Cluster Totals 312
8.6 Estimation of the Population Mean When Clusters Are Sampled 314
&7 Design Effects for Single-Stage Cluster Sampling 315
88 Stratified Clusters and Poststratified Clusters 319
89 Regression Estimators Arising Ou: of Modeling at the Element Level 322
8.10 Ratio Models for Elements 327
8.11 The Group Ratio Model for Elements 330
8.12 The Ratio Model Applied within a Single PSU 332
Exercises 333,
PART Ti
Further Questions in Design and Analysis of Surveys
CHAPTER 9
Two-Phase Sampling 343
94 Introduction 343
9.2 Notation and Choice of Estimator 345
93 The x* Estimator 347
9.4 Two-Phase Sampling for Stratification 350
9.5 Auxiliary Variables for Selection in Two Phases 354
9.6 Difference Estimators 386
9.7 Regression Estimators for Two-Phase Sampling 359
9.8 Stratified Bernoulli Sampling in Phase Two 366
99 Sampling on Two Occasions 368
9.9.1 Estimating the Current Total 370
9.9.2 Estimating the Previous Total 376
9.9.3 Estimating the Absolute Change and the Sum of the Totals 377
Exercises 379
Contents
CHAPTER 10
Estimation for Domains
10.1 Introduction
10.2 The Background for Domain Estimation
10.3 The Basic Estimation Methods for Domains
104 Conditioning on the Domain Sample Size
10.5 Regression Estimators for Domains
10.6 A Ratio Model for Each Domain
10.7 Group Models for Domains
10.8 Problems Arising for Small Domains; Synthetic Estimation
10.9 More on the Comparison of Two Domains
Exercises
CHAPTER 11
Variance Estimation
11.1 Introduction
11.2 A Simplified Variance Estimator under Sampling without Replacement
11.3 The Random Groups Technique
11.3.1 Independent Random Groups
11.3.2 Dependent Random Groups
11.4 Balanced Half-Samples
115 The Jackknife Technique
11.6 The Bootstrap
11.7 Concluding Remarks
Exercises
CHAPTER 12
Searching for Optimal Sampling Designs
12.1 Introduction
122 Model-Based Optimal Design for the General Regression Estimator
123 Model-Based Optimal Design for the Group Mean Model
12.4 Model-Based Stratified Sampling
12.5 Applications of Model-Based Stratification
12.6 Other Approaches to Efficient Stratification
12.7 Allocation Problems in Stratified Random Sampling
12.8 Allocation Problems in Two-Stage Sampling
128.1 The x Estimator of the Population Total
12.8.2 Estimation of the Population Mean
12.9 Allocation in Two-Phase Sampling for Stratification
12.40 A Further Comment on Mathematical Programming
12.41 Sampling Design and Experimental Design
Exercises
CHAPTER 13
Further Statistical Techniques for Survey Data
13.1 Introduction
13.2 Finite Population Parameters in Multivariate Regression and
Correlation Analysis
xiii
386
386
387
390
396
397
403
405
408
412
413
418
ats
424
423
423
426
430
4a3t
442
444
445
447
447
448
455
456
461
462
465
an
47
475
478
480
481
481
485
485
486xiv Contents
133 The Effect of Sampling Design on a Statistical Analysis
13.4 Variances and Estimated Variances for Complex Analyses
13 Analysis of Categorical Data for Finite Populations
13.5.1 Test of Homogeneity for Two Populations
13.52 Testing Homogeneity for More than Two Finite Populations
13.53 Discussion of Categorical Data Tests for Finite Populations
136 Types of Inference When a Finite Population Is Sampled
Exercises
PART IV
A Broader View of Errors in Surveys
CHAPTER 14
Nonsampling Errors and Extensions of Probability Sampling Theory
14.1 Introduction
142. Historic Notes: The Evolution of the Probability Sampling Approach
143 Measurable Sampling Designs
144 Some Nonprobability Sampling Methods
145 Model-Based Inference from Survey Samples
146 Imperfections in the Survey Operations
14.6.1 Ideal Conditions for the Probability Sampling Approach
14.62 Extension of the Probability Sampling Approach
14.7 Sampling Frames
14.7.1 Frame Imperfections
14.7.2 Estimation in the Presence of Frame Imperfections
14,73 Multiple Frames
14.74 Frame Construction and Maintenance
148 Measurement and Data Collection
149. Data Processing
14.10 Nonresponse
Exercises
CHAPTER 15
Nonresponse
15.1 Introduction
15.2. Characteristics of Nonresponse
15.2.4 Definition of Nonresponse
15.2.2 Response Sets
15.2.3 Lack of Unbiased Estimators
153 Measuring Nonresponse
154 Dealing with Nonresponse
15.4.1 Planning of the Survey
154.2 Callbacks and Follow-Ups
15.43 Subsampling of Nonrespondents
154.4 Randomized Response
15.5 Perspectives on Nonresponse
15.6 Estimation in the Presence of Unit Nonresponse
15.6.1 Response Modeting
491
494
500
500
507
510
513
520
525
525
525
$27
529
533
537
537
538
540
540
543
545
545
546
548
$51
353
556
556
556
356
557
558
559
563
564
564
566
570
573
575
575
Contents
156.2 A Useful Response Model
15.6.3 Estimators That Use Weighting Only - .
15.6.4 Estimators That Use Weighting as Well as Auxiliary Variables
15.7 Imputation
Exercises
CHAPTER 16
Measurement Errors
16.1 Introduction
162 On the Nature of Measurement Errors
163 The Simple Measurement Model
16.4 Decomposition of the Mean Square Error _
165. The Risk of Underestimating the Total Variance
16,6 Repeated Measurements as a Too! in Variance Estimation
16.7 Measurement Models Taking Interviewer Effects into Account,
168 Deterministic Assignment of Interviewers
169 Random Assignment of Interviewers to Groups
16.10 interpenetrating Subsamples
16.41 A Measurement Model with Sample-Dependent Moments
Exercises
CHAPTER 17
Quality Declarations for Survey Data
17.1 Introduction |
17.2 Policies Concerning Information on Data Quality
173 Statistics Canada’s Policy on Informing Users of Data Quality and
Methodology
Exercise
APPENDIX A
Principles of Notation
APPENDIX B
The MU284 Population
APPENDIX C
The Clustered MU284 Population
APPENDIX D
The CO124 Population
References
Answers to Selected Exercises
Author Index
Subject Index.
xv
S77
580
383
589
595
601
601
602
608,
612
614
617
618
622
627
630
634
637
637
638
641
648,
649
652
660
662
666
680
684
688PART I
Principles of Estimation
for Finite Populations and
Important Sampling DesignsCHAPTER 1
Survey Sampling in Theory and Practice
1.1. Surveys in Society
‘The need for statistical information seems endless in modern society. In par-
ticular, data are regularly collected to satisfy the need for information about
specified sets of elements, called finite populations. For example, our objective
might be to obtain information about the households in a city and their
spending patterns, the business enterprises in an industry and their profits, the
individuals in a country and their participation in the work force, or the
farms in a region and their production of cereal.
One of the most important modes of data collection for satisfying such
needs is a sample survey, that is, a partial investigation of the finite popula-
tion. A sample survey costs less than a complete enumeration, is usually less
time consuming, and may even be more accurate than the complete enumera-
tion. This book is an up to date account of statistical theory and methods for
sample surveys. The emphasis is on new developments.
Over the last few decades, survey sampling has evolved into an extensive
body of theory, methods, and operations used daily all over the world. As
Rossi, Wright and Anderson (1983) point out, it is appropriate today to speak
of a worldwide survey industry with different sectors: a government sector,
an academic sector, a private and mass media sector, and a residual sector
consisting of ad hoc and in-house surveys.
In many countries, a central statistical office is mandated by law to provide
statistical information about the state of the nation, and surveys are an im-
portant part of this activity. For example, in Canada, the 1971 Statistics Act
mandates Statistics Canada to “collect, compile, analyze, abstract, and pub-
lish statistical information relating to the commercial, industrial, financial,
social, economic, and general activities and condition of the people.”4 1. Survey Sampling in Theory and Practice
Thus, national statistical offices regularly produce statistics on important
national characteristics and activities, including demography (age and sex
distribution, fertility and mortality), agriculture (crop distribution), labor
force (employment), health and living conditions, and industry and trade. We
owe much of the essential theory of survey sampling to individuals who are
or were associated with government statistical offices.
In the academic sector, survey sampling is extensively used, especially in
sociology and public opinion research, but also in economics, political science,
psychology, and social psychology. Many academically affiliated survey insti-
tutes are heavily engaged in survey sampling activity. In the private and mass
media sectors, we find television audience surveys, readership surveys, polls,
and marketing surveys. The contents of ad hoc and in-house surveys vary
greatly. Examples include payroll surveys and surveys for auditing purposes.
Survey sampling has thus grown into a universally accepted approach for
information gathering. Extensive resources are devoted every year to surveys.
We do not have accurate figures to illustrate the scope of the industry. How-
ever, as an example from the United States, the National Research Council
in 1981 estimated that the survey industry in the country conducted roughly
100 million interviews in one year. If we assume a cost of $20 to $40 per
interview, interviewing alone (which is only one component of the total sur-
vey operation) represents 2 to 4 billion dollars per year.
News media provide the public with the results of new or recurring
surveys. It is widely accepted that a sample of fairly modest size is sufficient
to give an accurate picture of a much larger universe; for example, a well-
selected sample of a few thousand individuals can portray with great accuracy
a total population of millions. However data gathering is costly. Therefore, it
makes a great difference if a major national survey uses 20,000 observations,
when 15,000 or even 10,000 observations might suffice. For reasons of cost
effectiveness, it is imperative to use the best methods available for sampling
design and estimation, to profit from auxiliary information, and so on.
Here statistical knowledge and insight become highly important. The ex-
pert survey statistician must have a good grasp of statistical concepts in gen-
eral, as well as the particular reasoning used in survey sampling. A good
measure of practical experience is also necessary. In this book, we present a
basic body of knowledge concerning survey sampling theory and methods.
1.2. Skeleton Outline of a Survey
To start, we need a skeleton outline of a survey and some basic terminology.
The terms “survey” and “sample survey” are used to denote statistical investi-
gations with the following methodologic features (key words are italicized):
i, A survey concerns a finite set of elements called a finite population. An
enumeration rule exists that unequivocally defines the elements belong-1.2. Skeleton Outline of a Survey 5
ing to the population. The goal of a survey is to provide information
about the finite population in question or about subpopulations of special
interest, for example, “men” and “women” as two subpopulations of “all
persons.” Such subpopulations are called domains of study or just domains.
ii. A value of one or more variables of study is associated with each popula-
tion element. The goal of a survey is to get information about unknown
population characteristics or parameters, Parameters are functions of the
study variable values. They are unknown, quantitative measures of interest
to the investigator, for example, total revenue, mean revenue, total yield,
number of unemployed, for the entire population or for specified domains.
iii, In most surveys, access to and observation of individual population ele-
ments is established through a sampling frame, a device that associates the
elements of the population with the sampling units in the frame.
iv. From the population, a sample (that is, a subset) of elements is selected.
This can be done by selecting sampling units in the frame. A sample is a
probability sample if realized by a chance mechanism, respecting the basic
tules discussed in Section 1.3.
y. The sample elements are observed, that is, for each clement in the sample,
the variables of study are measured and the values recorded. The measure-
ment conforms to a well-defined measurement plan, specified in terms of
measurement instruments, one or more measurement operations, the
order between these, and the conditions under which they are carried out.
vi. The recorded variable values are used to calculate (point) estimates of the
finite population parameters of interest (totals, means, medians, ratios,
regression coefficients, etc.), Estimates of the precision of the estimates are
also calculated. The estimates are finally published.
Ina sample survey, observation is limited to a subset of the population.
The special type of survey where the whole population is observed is called a
census or a complete enumeration.
EXAMPLE 1.2.1. Labor force surveys are conducted in many countries. Such a
survey aims at answering questions of the following type: How many persons
are currently in the labor force in the country as a whole and in various
regions of the country? What proportion of these are unemployed? In this
case, some of the key concepts may be as follows. Population: All persons in
the country with certain exceptions (such as infants, people in institutions).
Domains of interest: age/sex groups of the population, occupational groups in
the population, and regions of the country. Variables: Each person can be
described at the time of the survey as (a) belonging to the labor force or
not, and (b) employed or not. Correspondingly, there is a variable of interest
that takes the value “one” for a person in the labor force, “zero” for a person
not in the labor force. To measure unemployment, a second variable of inter-
est is defined as taking the value “one” if a person is unemployed, “zero”
otherwise, Precise definitions are essential. If the purpose is to estimate un-6 1. Survey Sampling in Theory and Practice
employment in a given month, and if an interviewed person states that
he worked one week during that month, but that he is unemployed the
day of the interview, there must be a clear rule stating whether he is to
be recorded as unemployed or not. Population characteristics of interest:
Number of persons in the labor force. Number of unemployed persons in the
labor force. Proportion of unemployed persons in the labor force. Sample: A
sample of persons is selected from the population in an efficient manner given
existing devices for observational access to the persons in the country. Obser-
vations: Each person included in the sample is visited by a trained interviewer
who asks questions following a standardized questionnaire and records the
answers. Data processing and estimation: The recorded data are edited, that
is, prepared for the estimation phase; rules for handling of nonresponse are
observed; estimates of the population characteristics are calculated. Indica-
tors of the uncertainty of the estimates (variance estimates) are calculated.
‘The results are published.
ExamPLe 1.2.2 Consider a household survey whose aim is to obtain informa-
tion about planned household expenditures in the coming year for specified
durable goods. Here, some of the basic concepts may be as follows. Popula-
tion: All households in the country. Variables: Planned expenditure amounts
for specified goods, such as automobiles, refrigerators, etc. Population char-
acteristics of interest: Total of planned household expenditures for the speci-
fied durable goods. Sample: A sample of households is obtained by initially
selecting a sample of geographic areas, then by subsampling of houscholds in
selected areas. Observations: Each household in the sample receives a self-
administered questionnaire. For a majority of households, the questions are
answered and the questionnaire returned. Households not returning the ques-
tionnaire are followed up by telephone or visited by a trained interviewer
to obtain the desired information. Data processing and estimation: Data are
edited. The calculation of point estimates and precision takes into account the
two-stage design of the survey.
The methodologic features (i) to (vi) identified above prompt a few
comments.
. The complexity of a survey can vary greatly, depending on the size of
the population and the means of accessing that population. To survey
the members of a professional society, the hospitals in a region, or the
residents in a small municipality may be a relatively simple matter. At the
other extreme are complex nationwide surveys, with a population of many
millions spread over a large territory; such surveys are typically carried out
by government statistical agencies and require extensive administrative
and financial resources.
Although a survey involves observations on individual population ele-
ments, the purpose of a survey is not to use such data for decision-making1.2. Skeleton Outline of a Survey 7
about individual elements, but to obtain summary statistics for the popu-
lation or for specific subgroups.
In the same survey, there are often many variables of study and many
domains of interest. The number of characteristics to estimate may be
large, in the hundreds or even in the thousands.
Finite population parameters are quantitative measures of various aspects
of the population. Prior to a survey they are unknown. In this book, we
examine the estimation of different types of parameters: the total of a vari-
able of study, the mean of the variable, the median of the variable, the
correlation coefficient between two variables, and so on. The exact value
ofa finite population parameter can be obtained in a special case, namely,
if we observe the complete population (i., the survey is a census), and
there are no measurement errors and no nonresponse. A census does not
automatically mean “estimation without error.”
Most people are aware of the term “census” in a particular sense, namely,
as a fact finding operation about a country’s population, in particular
about such sociodemographic characteristics as the age distribution, edu-
cation levels, special skills, mother tongue, housing conditions, household
structure, migration patterns. In these situations there is often a “census
proper,” done through a “short form” (a questionnaire with few questions)
going out to all individuals, while a “long form” may be administered to a
20% sample with a request for more extensive information.
A sample is any subset of the population. It may or may not be drawn
by a probability mechanism. A simple example of a probability sampling
scheme is one that gives every sample of a fixed size the same probability
of selection. This is simple random sampling without replacement. In prac-
tice, selection schemes are usually more complex. Probability sampling has
over the years proved to be a highly accurate instrument and is the focus
of this book. The reasons for probability sampling are discussed later in
this chapter and in Chapter 14. An example of a nonprobability sample is
‘one that is designated by an expert as representative of the population.
Only in fortunate circumstances will nonprobabilistic selection yield accu-
rate estimates.
To correctly measure and record the desired information for all sampled
elements may be difficult or impossible. False responses may be obtained.
For some elements designated for the survey, measurements may be miss-
ing because of, for example, impossibility to contact or refusal to respond.
These nonsampling errors may be considerable.
8. Advances in computer technology have made it possible to produce a great
deal of official statistics (for example, in the economic sector) from admin-
istrative data files. Several files may be used. For example, elements are
matched in two complete population registers, and the information com-
bined. The matched files give a more extensive base for the production of
statistics. (For populations of individuals, matching may conflict with pri-
vacy considerations.) Information from a sample survey may also be com-
*
a
=8 1. Survey Sampling in Theory and Practice
bined with the information from one or more complete administrative
registers. The administrative data may then serve as auxiliary information
to strengthen the survey estimates.
1.3, Probability Sampling
Probability sampling is an approach to sample selection that satisfies certain
conditions, which, for the case of selecting elements directly from the popula-
tion, are described as follows:
. We can define the set of samples, ¥ = {s,, 52, ..., 5y}, that are possible to
obtain with the sampling procedure,
A known probability of selection p(s) is associated with cach possible
sample s.
The procedure gives every element in the population a nonzero probability
of selection.
We select one sample by a random mechanism under which each possible
s receives exactly the probability p(s).
A sample realized under these four requirements is called a probability
sample.
If the survey functions without disturbance, we can go out and measure
cach element in the realized sample and obtain true observed values for the
study variables. We assume a formula exists for computing an estimate of
each parameter of interest. The sample data are inserted in the formula, yield-
ing, for every possible sample, a unique estimate.
The function p(-) defines a probability distribution on ¥ = {s,, s2,..., Su}.
It is called a sampling design, or just design; a more rigorous definition is
given in Section 2.3.
The probability referred to in point 3 is called the inclusion probability of
the element. Under a probability sampling design, every population clement
has a strictly positive inclusion probability. This is a strong requirement, but
one that plays an important role in the probability sampling approach. In
practice there are sometimes compelling reasons for not adhering strictly to
the requirement. Cut-off sampling (see Section 14.4) is an occasionally used
technique in which certain elements are deliberately excluded from selection.
In that case, valid conclusions are limited to the part of the population that
can be sampled.
The randomization referred to in point 4 is usually carried out by an easily
implemented algorithm. A common type of algorithm is one in which a
randomized experiment is performed for each element listed in the frame,
leading either to inclusion or noninclusion. Different algorithms are discussed
in Chapters 2 and 3.
R
»
-1.4, Sampling Frame 9
Sampling is often carried out in two or more stages. Clusters of elements
are selected in an initial stage. This may be followed by one or more sub-
sampling stages; the clements themselves are sampled at the ultimate stage.
To have a probability sampling design in that case, conditions 1 to 4 above
must apply to cach stage. The procedure as a whole must give every popula-
tion element a strictly positive inclusion probability.
Probability sampling has developed into an important scientific approach.
‘Two important reasons for randomized selection are (1) the elimination of
selection biases, and (2) randomly selected samples are “objective” and ac-
ceptable to the public. Some milestones in the development of the approach
are noted in Chapter 14. Most of the arguments and methods in this book are
based on the probability sampling philosophy.
1.4. Sampling Frame
The frame or the sampling frame is any material or device used to obtain
observational access to the finite population of interest. It must be possible
with the aid of the frame to (1) identify and select a sample in a way that
respects a given probability sampling design and (2) establish contact with
selected elements (by telephone, visit at home, mailed questionnaire, etc.). The
following more detailed definition is from Lessler (1982):
Frame. The materials or devices which delimit, identify, and allow access to
the elements of the target population. In a sample survey, the units of the
frame are the units to which the probability sampling scheme is applied. The
frame also includes any auxiliary information (measures of size, demographic
information) that is used for (1) special sampling techniques, such as stratifica-
tion and probability proportional-to-size sample selections, or for (2) special
estimation techniques, such as ratio or regression estimation.
As in the preceding definition, we call elements the entities that make up
the population, and units or sampling units the entities of the frame. The latter
term stresses that the sample is actually selected from the frame.
EXAMPLE 1.4.1. The Swedish Register of the Total Population is a large sam-
pling frame that lists some eight million individuals. This frame gives, for each
individual, information on variables such as date of birth, sex, marital status,
address, and taxable income. It is a reasonably correct frame. There are few
omissions, and few are listed in the frame who do not rightfully belong in it.
Anattractive feature of this frame is that it gives direct access to the country’s
entire population. Stratified sampling from this frame is often used for Swed-
ish surveys. Sampled elements (individuals) can be contacted with relative
case.10 1, Survey Sampling in Theory and Practice
EXAMPLE 1.4.2, The Central Frame Data Base is a sampling frame compiled
by Statistics Canada for use in business surveys. It is a fairly complex frame,
consists of several parts, and is based on information from several sources.
The construction of this frame is based on two types of Canadian tax returns:
corporate and idual, the latter for the self-employed. The frame has two
an “integrated” component (containing all of the large
business establishments) and a “nonintegrated” component, which is further
divided into two separate parts using information from Revenue Canada
Taxation. Business firms reporting small total operating revenue are con-
sidered “out-of-scope” for business surveys. Continuous updating is required
to register “births” (starting of new business activity), “deaths” (termination of
business activity), and changes in classification based on geography, industry,
or size.
We use the term direct element sampling to denote sample selection from a
frame that directly identifies the individual elements of the population of
interest. That is, the units in the frame are objects of the same kind as those
that we wish to measure and observe. A selection of elements can take place
directly from the frame. Ideally, the set of elements identified in the frame is
equal to the set of elements in the population of interest.
For example, if the population of interest is Swedish individuals, we can
carry out direct element sampling from the frame “Register of the Total Popu-
lation” mentioned in Example 1.4.1. Here, unit equals element which equals
individual. (The two sets are actually not exactly equal, but differences are
minor so the frame is nearly perfect.) The frame in Example 1.4.2 can be used
for direct element sampling with the objective of studying the population of
business establishments in Canada; in that case, unit equals element which
equals business establishment.
The following is a list of properties that a frame for direct element sampling
should have. Minimum requirements are that:
1. The units in the frame are identified, for example, through an identifier k
running from 1 to Np where N, is the number of sampling units.
2. All units can be found, if selected in a sample. That is, the address, tele-
phone number, location on map, or other device for making contact is
specified in the frame or can be made available.
It simplifies many sample selection procedures if the following feature is
present:
3. The frame is organized in a systematic fashion, for example, the units are
ordered by geography or by size.
Other information is sometimes available in the frame and will often improve
estimates. The following is desirable:
4. The frame contains a vector of additional information for each unit; such1.4, Sampling Frame i
information may be used for efficiency improvement such as stratification
or to construct estimators that involve auxiliary variables.
5, When estimation is required for domains (subpopulations), the frame
specifies the domain to which each unit belongs.
Other desirable properties involve the relationship between the units in the
frame and the population elements:
6. Every element in the population of interest is present only once in the
frame.
7. No element not in the population of interest is present in the frame.
The preceding two features will simplify many selection and estimation
procedures.
8. Every element in the population of interest is present in the frame.
The last mentioned property is particularly important, because in its absence,
the frame does not give access to the whole population of interest. Then not
even observation of all elements in the frame will make it possible to calculate
the true value of a finite population parameter of interest.
In practice, a frame is often in the form of a computer data file. At
a minimum, it is a file with an element identifier k running from 1 to Np. It
may contain other information, as mentioned in points 4 and 5. We can state
all that is available in the frame about the kth element as a vector x, =
(Xtap Xone +-++ Npy ees Xqa)s Here, xq is the value of the jth x-variable for the
kth element. The value x, may be quantitative (for example, date of birth or
salary for individual k) or qualitative (for example, address for individual k).
‘The frame can be seen as a matrix arrangement with Ny rows (records), and
with each row is associated q + 1 data entries (fields); one entry for the identi-
fier and q entries for the components of the row vector x,, as follows:
Identifier Known Vector
1 x
2 x
k
Ne
It is important to construct a good sampling frame. As Jessen (1978) puts
it: “In many practical situations the frame is a matter of choice to the survey
planner, and sometimes a critical one, since frames not only should provide
us with a clear specification of what collection of objects or phenomena we
are undertaking to study, but they should also help us carry out our study
efficiently and with determinable reliability. Some very worthwhile investiga-
tions are not undertaken at all because of the lack of an apparent frame;12 1. Survey Sampling in Theory and Practice
others, because of faulty frames, have ended in a disaster or in a cloud of
doubt.” Guidelines for frame construction and maintenance are discussed in
Section 14.7.
1.5. Area Frames and Similar Devices
The following distinction is important:
i. A frame as a direct listing or identification of elements.
ii, A frame as a listing or identification of (smaller or larger) sets of elements.
In case (i), direct element sampling can be carried out. In case (ii), access
to the elements is more roundabout, namely, by selecting sets of elements and
by observing all or some of the elements in the selected sets. In many situa-
tions, case (ii) is the only option, since it may not be possible to find or
construct (without prohibitive cost) a direct list of elements. The total number
of population elements is often unknown in case (ii). For example, let us think
of the population of households in a large metropolitan area. In many cities,
there is nothing that comes close to a complete register of the households.
Other sampling units than individual households must be considered. One
way is to define sampling units as city blocks on a map and select a sample of
such units. With relative ease, we may then gain access to the households in
(a modest number of) selected blocks. A variation of the same idea occurs
when segments are identified on a forest map, and a sample of segments is
drawn with the objective of observing individual trees in selected segments.
The concept of area frame is defined as follows:
Area Frame. A geographic frame consisting of area units; every population
element belongs to an area unit, and it can be identified after inspection of the
area unit. The area units may vary in size and in the number of elements that
they contain.
Area sampling entails sampling from an area frame, such as a city map, a
forest map, or an aerial photograph. The sets of elements drawn with the aid
of an area frame are often called clusters. In a secondary selection step, the
selected clusters may be subsampled. A sample of still smaller areas could be
defined and sampled, and so on, until the elements themselves are finally
sampled in the ultimate step.
Maps are, of course, not always used when sets (clusters) of elements are
sampled; a succession of lists may be used instead. A frame for studying a
population of high school students may in the first step consist of a list of
school districts, then a list of all schools in each selected district, then a list of
all classes in each selected school, then in the fourth and final step one would
gain access to students. “Frame” here refers to a device with four successive
layers. Schoo! districts are the first stage sampling units, schools the second1.6, Target Population and Frame Population 13
stage sampling units, classes the third stage units. The individual elements
(the students) are the sampling units in the fourth and final stage of sampling.
In a selection consisting of several stages, cach stage has its own type of
sampling unit.
A finite population is made up of elements. They are sometimes also called
units of analysis, which underscores that they are entities that are measured
and for which values are recorded. For example, if one is interested in estimat-
ing the population total of the variable “household income,” the element (or
unit of analysis) is naturally the household. The frame is an instrument for
gaining access (more or less directly) to these elements. One way is to first
select city blocks and then observe households in selected blocks.
Our examples so far have perhaps left the impression that in practice “cle-
ment” is always something “smaller than or at most equal to” a “sampling
unit.” This is not necessarily so, as the following example shows.
EXAMPLE 1.5.1. The Swedish household survey HINK is a national survey of
household income. In the absence of a good complete list of households, it
has proved convenient to use the Swedish Register of the Total Population
described in Example 1.4.1 as a sampling frame. This register is a list of in-
dividuals. A probability sample (ordinarily, a stratified random sample) of
individuals is selected. The households to which the selected individuals
belong are identified, and income-related variables are measured for these
households. Here, the sampling unit is the individual, and the element (unit
of analysis) is the household. That is, an element comprises one or more
sampling units. This is a particular case of “network sampling” (see Sirken
1972, Levy 1977, and Rosén 1987).
1.6. Target Population and Frame Population
It becomes necessary at this point to distinguish target population from frame
population. The target population is the set of elements about which informa-
tion is wanted and parameter estimates are required. The frame population is
the set of all elements that are either listed directly as units in the frame or can
be identified through a more complex frame concept, such as a frame for
selection in several stages.
Frame quality can be studied through the relations that exist between the
target population, denoted U, and the frame population denoted U,. If each
element in the frame population corresponds to one and only one element in
the target population, then U = U,, and the frame is perfect for the target
population. In all other cases, there is frame imperfection. Frame imperfec-
tions are discussed in more detail in Section 14.7. At this stage it suffices to
note that points 6, 7, and 8 in Section 1.4 point out three common frame
imperfections, namely, undercoverage, overcoverage, and duplicate listings.4 1. Survey Sampling in Theory and Practice
The frame has undercoverage when some target population elements are
not in the frame population. The frame has overcoverage when elements not
in the target population are in the frame population. A duplicate listing occurs
when a target population element is listed in the frame more than once.
For example, if unit equals element which equals business enterprise, it may
be that a unit in the frame is a firm that recently went out of business. Since
no longer involved in business activity, this firm is not in the target popula-
tion, but still exists as a unit in the frame. Another frame unit may be a firm
that exists, but in a different industry than the one defined by the target
population. Both firms are part of the overcoverage. Newly established firms
not yet listed in the frame are part of the undercoverage.
Ifa probability sample is selected from the frame population, valid statisti-
cal inference can be made about the frame population. If the frame population
is different from the target population, valid inference about the target popu-
lation may be impossible, so the goal of the survey may be missed. The prob-
lem is particularly serious if the frame gives access only to parts of the target
population.
To come up with a perfect sampling frame is not always possible in prac-
tice. Minor imperfections are often tolerated, since a perfect frame may not
be obtained without excessive cost. However, it is imperative that frame im-
perfections be minor.
To construct a high-quality frame for the target population is an important
aspect of survey planning, and adequate resources must be set aside for this
activity. Viewed somewhat differently, when the target population is defined,
a realistic goal should be set. There is no sense in fixing a target population
for which a good frame cannot realistically be obtained within budget restric-
tions. Grossly invalid conclusions may result from samples drawn from faulty
frames. Inexpensive, easy-to-come-by frames should be avoided if they give
only fragmentary access to the target population.
Frame construction and maintenance are discussed in Chapter 14. Sample
selection is sometimes carried out with the aid of several partially overlapping
frames; such multiple frame sampling is also considered in Chapter 14.
1.7. Survey Operations and Associated
Sources of Error
A survey consists of a number of survey operations. Especially in a large
survey, the operations may extend over a considerable period of time, from
the planning stage to the ultimate publication of results. The operations affect
the quality of survey estimates. We distinguish five phases of survey opera-
tions, as follows. With each phase we can associate sources of errors in the
estimates,1.7. Survey Operations and Associated Sources of Error 15
i. Sample Selection
This phase consists of the execution of a preconceived sampling design. The
sample size necessary to obtain the desired precision is determined. A sample
of elements is drawn with the given sampling design using a suitable sampling
frame, which may be an already existing frame or one that is constructed
specifically for the survey. Errors in estimates associated with this phase are
(1) frame errors, of which undercoverage is particularly serious, and (2)
sampling error, which arises because a sample, not the whole population, is
observed.
ii, Data Collection
There is a preconceived measurement plan with a specified mode of data
collection (personal interview, telephone interview, mail questionnaire, or
other). The field work is organized; interviewers are selected and interviewer
assignments are determined. Data are collected according to the measure-
ment plan, for the elements in the sample. The data are recorded and trans-
mitted to the central statistical office. Errors in estimates resulting from this
phase include (1) measurement errors, for instance, the respondent gives (in-
tentionally or unintentionally) incorrect answers, the interviewer understands
or records incorrectly, the interviewer influences the responses, the question-
naire is misinterpreted, (2) error due to nonresponse (missing observations).
iii, Data Processing
This phase serves to prepare collected data for estimation and analysis and
includes the following elements:
Coding and data entry (transcription of information from questionnaire to
a medium suited for estimation and data analysis).
Editing (consistency checks to see if observed values conform to a set of
logical rules, handling of values that do not pass the edit check, outlier detec-
tion and treatment).
Renewed contact with respondents to get clarification if necessary.
Imputation (substitution of good artificial values for missing values).
Errors in estimates associated with this phase include transcription errors
(keying errors), coding errors, errors in imputed values, errors introduced by
or not corrected by edit
iv. Estimation and Analysis
This phase entails the calculation of survey estimates according to the speci-
fied point estimator formula, with appropriate use of auxiliary information16 1. Survey Sampling in Theory and Practice
and adjustment for nonresponse, as well as a calculation of measures of preci-
sion in the estimates (variance estimate, coefficient of variation of the esti-
mate, confidence interval). Statistical analyses may he carried out, such as
comparison of subgroups of the population, correlation and regression
analyses, etc. All errors from the phases (i) to (ii) above will affect the point
estimates and they should ideally be accounted for in the calculation of the
measures of precision.
v. Dissemination of Results and Postsurvey Evaluation
This phase includes the publication of the survey results, including a general
declaration of the conditions surrounding the survey. This declaration often
follows a set of specified guidelines for quality declaration.
Errors in survey estimates are traditionally divided into two major cate-
gories: sampling error and nonsampling error. The sampling error is, as
mentioned, the error caused by observing a sample instead of the whole
population, The sampling error is subject to sample-to-sample variation. The
nonsampling errors include all other errors. The two principal categories of
nonsampling errors are:
a. Errors due to nonobservation. Failure to obtain data from parts of the
target population.
b. Errors in observations. This kind of error occurs when an element is se-
lected and observed, but the finally recorded value for that element (the
value that goes into the estimation and analysis phase) differs from the true
value. Two major types are (b1) measurement error (error arising in the
data collection phase) and (b2) processing error (error arising in the data
processing phase).
There are two principal types of nonobservation, namely, (1) undercov-
erage, that is, failure of the frame to give access to all elements that belong to
the target population (such elements will obviously not be selected, much less
observed, and they have zero inclusion probability), and (2) nonresponse,
that is, some elements actually selected for the sample turn out to be non-
observations because of refusal or incapacity to answer, not being at home,
and so on. Nonobservation generally results in biased estimates.
Measurement error can be traced to four principal sources:
. The interviewer.
. The respondent.
. The questionnaire.
The mode of the interview, that is, whether telephone, personal interview,
self-administered questionnaire, or other medium is used.
Pen
Processing error comprises the errors arising from coding, transcription,
imputation, editing, outlier treatment, and other types of preestimation data1.8, Planning a Survey and the Need for Total Survey Design 17
handling. In modern computer assisted data collection methods (CATI and
CAPI; see Section 14.9), the data collection and data processing phases tend
to be merged, and, as a consequence, processing errors may be reduced.
‘The basic estimation theory, taking into account the sampling error, is
presented in Part I of this book (Chapters 2 to 5). Various extensions are given
in Parts II and ITI (Chapters 6 to 13). Estimation in the presence of non-
sampling errors is treated in Part IV (Chapters 14 to 17). In particular, Chap-
ter 14 contains a more detailed discussion of nonsampling errors and their
impact on the survey estimates.
1.8. Planning a Survey and the Need for
Total Survey Design
A survey usually has its background in some practical problem. Someone—a
member of parliament, a researcher, an administrator, a decisionmaker—
formulates a question in the course of his work, a problem to which no answer
is readily available. A study is needed. It can be an experiment, a survey, or
some other form of fact finding. In either case, it is imperative that the prob-
Jem be clearly stated. If a survey is the instrument chosen for the study, the
survey statistician needs to work with a clearly stated objective. What exactly
is the problem? Exactly what information is wanted?
For example, suppose a survey of housing conditions for the elderly is
proposed. This description is vague and too general. The concepts involved
must be given clear definitions. Precisely what is the population of aged in-
dividuals of interest? What age groups are to be included? Should we look
separately at single-person households, two-person households, households
where the elderly cohabit with younger persons, etc.? What is the more
complete specification of housing conditions? Do we mean age of dwelling
‘or some other quality measure of the dwelling? What time period is to be
studied? Should one distinguish between an urban elderly population and a
rural elderly population? As answers are obtained to these questions, the
survey statistician begins to work toward a reformulation of the original
question into one stated in precise terms that can be answered by a survey.
The statistician’s formulation must be unequivocal on the following:
i. The finite population and the subpopulations for which information is
required.
ii. The kinds of information needed for this population, that is, the variables
to be measured and the parameters to be estimated.
Once the operational definitions are clearly stated, the survey statistician
can work toward the specification of a suitable survey design, including sam-
pling frame, data collection method, staff required, sample selection, estima-
tion method, and determination of the sample size required to obtain the18 1, Survey Sampling in Theory and Practice
desired precision in the survey results, Before going ahead with the survey, the
statistician will make sure that his “translation” corresponds fully to what the
problem originator had in mind: Will the survey give at least an approximate
solution to the right problem? In the words of W. E. Deming (1950, p. 3), “The
requirement of a plain statement of what is wanted (the specification of the
survey) is perhaps one of the greatest contributions of modern theoretical
statistics”.
Some important aspects of survey planning are as follows:
Specifying the objective of the survey.
Translation of the subject-matter problem into a survey problem.
Specification of target population, known variables (auxiliary variables),
study variables, population parameters to be estimated.
Construction of sampling frame, if none is available;
Inventory of resources available in terms of budget, staff, data processing,
and other equipment.
Specifications of requirements to be met, for instance, time schedule and
accuracy of estimates.
Specification of data collection method, including questionnaire
construction.
‘Specification of sampling design, sample selection mechanism , and sample
size determination.
Specification of data processing methods, including edit and imputation.
Specification of formulas for point estimator and measures of precision
(variance estimator).
Training of personnel, organization of field work.
Allocation of resources to different survey operations,
Allocation of resources to control and evaluation.
The survey planning should ideally lead to a decision for each of the survey
operations. Statistical theory can guide us to important conclusions about
some of these decisions, in particular with regard to sample selection, choice
of estimator, different sources of error and their associated components of
variance, methods for assessment of the accuracy of the estimates, and statisti-
cal analysis of survey data.
The planning process should try to foresee difficulties that may arise.
Resources should be set aside and back-up procedures should be identified to
deal with perceived difficulties, For example, some nonresponse can almost
certainly be expected, and the survey planning should take this into account;
the nonresponse should be kept at low levels. Procedures for follow up and
renewed contact with nonrespondents should be identified and included in
budget considerations. Estimator formulas that attempt to adjust for non-
response should be identified.
Ideally, survey planning should lead to an optimal specification for the
survey as a whole. The goal is to obtain the best possible accuracy under a19. Total Survey Design 19
fixed budget. In a major survey, the decision problem is, however, so complex
that an optimum, in the sense of a mathematical solution to a closed-form
problem, is inconceivable. There are too many interrelated decisions and too
many variables to take into account. The concept of total survey design,
discussed in the next section, can be seen as a tool in the search for an overall
optimization of a survey.
1.9. Total Survey Design
The term total survey design has come to be used for planning processes that
aim at overall optimization in a survey. The concept arose out of a desire for
an overall control of all sources of errors in a survey. Instrumental in this
regard were efforts at the United States Bureau of the Census. Key references
are Hansen et al. (1951), Hansen, Hurwitz, and Bershad (1962), Hansen,
Hurwitz, and Pritzker (1964). A detailed discussion of total survey design is
found in Dalenius (1974).
Total survey design is concerned with obtaining the best possible precision
in the survey estimates while striking an overall economic balance between
sampling and nonsampling errors. For a view of total survey design, it is
helpful to consider a survey from three perspectives:
1. The requirements.
2. The survey specifications.
3. The survey operations.
By the requirements are meant the needs for information about the finite
population, usually originating in some subject-matter problem. Correspond-
ing to these requirements is a conceptual survey which will achieve the ideal
goal, if carried out under the best possible circumstances.
The survey specifications are a set of rules and operations, which together
constitute a defined goal of the survey. Because of actual conditions, the de-
fined goal may be somewhat different from the ideal goal. The defined goal
specifies key elements of the survey, such as population, sampling design,
measurement procedures, estimators, and auxiliary variables.
Several survey designs usually exist for realizing the defined goal. The sur-
vey statistician selects from a set of operationally feasible survey designs one
that comes as close as possible to realizing the defined goal. The selected
design gives rise to a series of survey operations. The essential ones are those
that we identified in Section 1.7. The survey is finally carried out executing
these operations as carefully as possible; this constitutes the survey operations
proper (see Figure 1.1).
Following Dalenius (1974) we can summarize the total survey design pro-
cess in a diagram as shown in Figure 1.1.20 1. Survey Sampling in Theory and Practice
zomsparpem
roRHzOO
Figure 1.1. The Total Survey Design Process.
1.10. The Role of Statistical Theory
in Survey Sampling
There is no unified theory covering all survey operations simultaneously. The
current state of the art offers partial solutions, obtained sometimes under
restrictive assumptions and idealized conditions. This book covers a crucial
aspect of the total survey picture, namely, the part where statistical theory,
especially estimation theory, is used to obtain answers.
Estimation theory is the organized study of variability in estimates. In
surveys, the variability is tied to various errors identified in Section 1.7.1.10. The Role of Statistical Theory in Survey Sampling a
Suppose that the probability distributions of the various errors can be given,
if not a complete specification, at least some general features. A stochastic
structure is thereby defined, and we can work toward obtaining probability
statements about the total error in an estimate. This is the traditional goal of
statistical inference.
To study the variability of survey estimates, we need to specify as accurate-
ly as possible the stochastic features of the various errors. Let us see how this
can be done.
i. Stochastic Structure Relative to Sampling Error
How was the sample selected? In the probability sampling approach, we
know the answer. The sample is selected according to the given probability
sampling design. We know or can determine the probabilities given to the
different possible samples. This is the key to describing the sample-to-sample
variation of a proposed estimator. The statistical properties of the estimator
can be worked out: its expected value, its bias (if any), its variance, and so on.
We can also obtain an estimate of the variance and a confidence interval.
These notions are explained in detail in Chapter 2.
We mentioned that two reasons for randomized sample selection are pro-
tection against selection bias and the fact that randomly selected samples are
viewed as objective. In this book, the probability distribution associated
with the randomized sample selection has another very important function.
It provides the basis for probabilistic conclusions (inferences) about the target
population. For instance, a 95% confidence interval will have the property of
covering the unknown population quantity in 95% of all samples that can be
drawn with the given probability sampling design. The term used in the litera-
ture for this kind of conclusion is design-based inference or randomization
inference. Design-based inference is objective; nobody can challenge that
the sample was really selected according to the given sampling design. The
probability distribution associated with the design is “real,” not modeled or
assumed.
ii, Stochastic Structure Relative to Nonsampling Errors
How do measurement errors and data processing errors arise? In most situa
tions, we do not know. Any answer will involve hypothetical assumptions,
called model assumptions, about these errors. Models for errors in observa-
tions are discussed in Chapter 16.
How does nonresponse arise? What is the mechanism that generates re-
sponse from elements in the sample? Again, in most situations we do not
know, and any answer will have to be in the form of a model. Models of this
kind are called response mechanism models; see Chapter 15.22 1. Survey Sampling in Theory and Practice
iii. Stochastic Structure Relative to the Origin of Finite Population
Variable Values
How were the population variable values generated? An attempt answering
this question is to say that the N population values y,,..., yy of the study
variable y are generated from a superpopulation. If the idea of a super-
population is accepted, we must concede that the properties of this popula-
tion are unknown or at least partially unknown. Superpopulation modeling is
the activity whereby one specifies or assumes certain features of the mecha-
nism thought to have generated y,, ..., yy. Superpopulation modeling leads
to model-based inference, which is discussed in Section 14.5.
A number of important distinctions have been made in this chapter. They
help in getting a perspective on how the material is organized in the chapters
to come. Chapters 2 to 12 are based on the idea of probability sampling from
a high quality sampling frame in the absence of nonsampling errors. Chapters
2 and 3 present elementary estimation theory for finite populations. Methods
for estimation of totals and means are presented for important sampling
designs. In Chapter 5, we look at the estimation of other parameters of
interest, such as medians, variances, correlation, and regression coefficients.
Estimation with auxiliary information is an important theme in Chapters 6
to 8. Chapters 4 and 8 deal with sampling in two stages. Chapters 9 to 13
are devoted to significant special topics, such as estimation in two phases,
estimation for domains, variance estimation techniques, questions of optimal
sampling design, and analysis of survey data. In Chapter 14, we assess the
limitations and extensions of the probability sampling approach, and the
nature of nonsampling errors is examined. Complete chapters are devoted
to two important types of nonsampling error, namely, nonresponse (Chapter
15) and measurement error (Chapter 16), Finally, the importance of quality
declaration of survey data is stressed (Chapter 17),
Exercises
1.1. Consider an agriculture yield survey aimed at obtaining information about the
yield of different crops (wheat, rye, etc.) in a given country and in a given year.
Specify definitions that might be appropriate in such a survey for key concepts
such as population, variable(s) of study, parameter(s) of interest. Try to be as
specific as possible in your definitions.
1.2. For a survey of the type indicated in Exercise 1.1, discuss what device(s) reason-
ably may be used to constitute a sampling frame. Discuss possible methods for
accurate data collection, with consideration given to avoidance of measurement
error.Exercises 23
13. Consider a retail sales survey in a given country, that is, a survey aimed at finding
out about the amount of sales in retail stores. Specify definitions that might be
realistic in this case for key concepts such as population, variable(s) of study, and
parameter(s) of interest. Try to be as specific as possible in your definitions.
1.4. For a survey of the type indicated in Exercise 1.3, discuss what device(s) reason-
ably may be used to constitute a sampling frame. Discuss possible alternatives for
data collection, with consideration given to avoidance of measurement error.
1.5, Contrast a “simple survey” (say, a survey of approximately 500 members of a
professional society) with a “complex survey” (say, a survey of approximately 10
million inhabitants of a country). Think through the list of survey operations and
specify where you foresee that great differences in complexity may arise.CHAPTER 2
Basic Ideas in Estimation from
Probability Samples
2.1. Introduction
‘This chapter introduces some basic notions in survey sampling, such as sam-
pling design and sampling scheme, estimation by z-expanded sums, design
variance, design effect, and design-based confidence intervals. Mastery of
these concepts is essential for an understanding of the subsequent chapters.
Basic designs, such as simple random sampling without replacement and
Bernoulli sampling, are discussed. ;
Anestimate’s error is the deviation of the estimate from the unknown value
of the parameter that we wish to estimate. In this chapter, we focus on the
sampling eccor and ignore any errors caused by faulty measurement, non-
response, or other nonsampling sources. The sampling error is caused by
calculating the estimate from data for a subset of the population only. There
is no sampling error when the estimate is based on data for the entire popula-
tion, as in a census.
2.2. Population, Sample, and Sample Selection
N;
Let us consider @ population consisting of N elements labeled k = 1,.
{ilgs cee tay oes tay}
For simplicity, we let the kth element be represented by its label k. Thus, we
denote the finite population as
24
2.2. Population, Sample, and Sample Selection 25
U= Ck NY
In this chapter, the population size N is treated as known at the outset. But
for many populations encountered in practice, N is unknown, and methods
of estimation for this case are developed in later chapters.
Let y denote a variabie, and let y, be the value of y for the kth population
element. For instance, if U is @ population of households and y is the varia-
ble “disposable income,” then y, quantifies the disposable income of the kth
household. We assume that the values Yn: k @ U, are unknown at the outset.
Suppose that an estimate is needed for the population total of y,
t= Love
or for the population mean of y,
Jo = N= Deyn
In these expressions, Vy y, is abbreviated notation for Y', 0.
2, Attach the appropriate probability p(s) to each sample se %.
3. Select one sample s ¢ % by a random mechanism obeying the probability
distribution p(3), using a random number table or a computer-assisted
randomized selection procedure.
This algorithm is almost never a practical possibility for sample selection, The
simple reason is that in virtually all situations met in practice, the number of
samples is too large. It is often “astronomically large.” If, for example, N =
1,000, and % is the set of all samples of the fixed size n = 40, then the number
of samples is
(“0°) = 5.6 x 107?
HN = 5,000 and # = 200 (thus a sampling fraction, as in the previous case,
of 200/5,000 = 4%), the number of samples increases to
5,000
200
Clearly the enumeration of all these samples is a hopeless task, even for a
computer.
} 14 x 1088
Remark 2.3.1. It has become standard in modern literature to call the proba-
bility distribution p(-) the sampling design. Some statisticians use the similar
term sample design with a different meaning, For example, Hansen, Hurwitz
and Madow (1953, vol. II, page 7) say: “The sample design will consist of the
sampling plan and the method of estimation.” “Sampling plan” corresponds
roughly to our concept sampling design. Used in this way, sample design is
thus a broad concept that includes the caoice of sampling design, the actual
selection of a samuple, as well as the estimation.
By way of terminology, the design stage of a survey is sometimes used to
designate the period during which the sample selection procedure (and there-
by the sampling design) is decided and the sample selected. By contrast, the
estimation stage refers to the period when the data are already collected and
the required estimates are calculated.
Itis important to note that several different sample selection schemes may
conform to one and the same sampling design p(-). If the sampling design p(-)
is taken as a starting point, one has to specify a suitable sample selection30 2. Estimation from Probability Samples
scheme that implements the design. The scheme should be efficient in terms
of computer execution, cost, and other practical aspects.
EXaMpre 2.3.3. The design SI given in Example 2.3.1 can be implemented by
the following draw-sequential scheme, an alternative to the scheme in Exam-
ple 2.2.1. Select with equal probability 1/N, a first element from the N ele-
ments in the population. Replace the element obtained. Repeat this step until
n distinct elements are obtained. If v denotes the number of draws required,
we have v > with probability one, since already drawn elements may be
selected again. Here, v is a random variable with a fairly complex probability
distribution. It can be shown that the selection scheme just stated conforms
to the conditions of the design SI. Other possible executions of SI exist. An
often used list sequential scheme is stated in Section 3.3.1.
The objective of a survey is ordinarily to estimate one or more population
parameters, Two of the many important choices that must be made in a
survey are as follows:
1. The choice of a sampling design and a sample selection scheme to imple-
ment the design.
2. The choice of a formula (an estimator) by which to calculate an estimate
of a given parameter of interest.
‘The two choices are not independent of cach other. For example, the choice
of estimator will usuaily depend on the choice of sampling design.
A Strategy is the combination of a sampling design and an estimator. For
a given parameter, the general aim is to find the best possible strategy, that
is, one that estimates the parameter as accurately as possible.
2A. Inclusion Probabilities
An interesting feature of a finite population of N labeled elements is that the
elements can be given different probabilities of inclusion in the sample. The
sampling statistician often takes advantage of the identifiability of the ele-
ments by deliberately attaching different inclusion probabilities to the various
elements. This is one way to obtain more accurate estimates.
Suppose that a certain sampling design has been fixed. That is, p(s), the
probability of selecting s, has a given mathematical form. The inclusion of a
given element k in a sample is a random event indicated by the random vari-
able J,, defined as
1 ifkeS
he { ifnot Qt)
Note that 1, = J,(S) is a function of the random variable S. We call f, the
sample membership indicator of element k.
24, Inclusion Probabilities 31
The probability that element k will be included in a sample, denoted m,, is
obtained from the given design p(-} as follows:
m= Prike S) = Pr = = 5 pe) @42)
Here, s 5 k denotes that the sum is over those samples s that contain the given
ke The probability that both of the elements k and / will be included is denoted
ay and is obtained from the given p(+) as follows:
Ty = Pr(k&le S$) = Prhl;=D= Y pis) (2.4.3)
a)
We have % = my for all k L
Note that (2.4.3) applies also when k = I, for in that case
My, = Pr(I? = 1) = Pry = 1) =
‘That is, in the following, x, should be interpreted as identical to m,, for k =
1 Ne
Remark 2.4.1. The writing “k ¢ S” in, for example, equation (2.4.2)-should be
interpreted as the random event “S 3 k,” which is the event “a sample con-
taining k is realized”,
With a given design p(-) are associated the N quantities
Tyg eee y Migs ener TE
‘They constitute the set of first-order inclusion probabilities. Moreover, with
p(-) are associated the N(N — 1)/2 quantities
Tyas Migs ++ Mags +e TN-4N
which are called the set of second-order inclusion probabilities. Inclusion
probabilities of higher order can be defined and calculated for a given design
p(-). However, they play a less important role and will not be discussed in this
book.
Exampte 2.4.1. Consider the SI design defined in Example 2.3.1. Let us calcu-
late the inclusion probabilities of first and second orders. There are exactly
=i
nod 3) samples
s that include the elements k and I (k + 1), Since all samples of size n have the
same probability, 1 | (" , we obtain from (2.4.2) and (2.4.3)
a= Zn (2o7)/(X) =n k
samples s that include the element k, and exactly (* - 3)
n-
and32 2. Estimation from Probability Samples
=2\/(N\ _ n— 4). -
m= 5 v= (8-2) = arc k#l=1..,N Q45)
53K at
EXAMPLE 2.4.2. The BE sampling design (see Example 2.3.2) can be character-
ized as a design in which the indicators J, are independently and identically
distributed, each J, obeying the Bernoulli distribution with parameter 7. All
N elements have the same first-order inclusion probability,
mem kel N
moreover, since the J, are independent,
ty = E(hL) = EQ E)
for any k #1.
‘A sampling design is often chosen to yield certain desired first- and second-
order inclusion probabilities, Although p(-) may be complicated, we can at-
tain one of the primary goals, namely, to determine expected value and vari-
ance of certain quantities calculated from the sample, from a knowledge of
the m and the mq alone.
Unless otherwise stated, we assume that the sampling design is such that
all first-order inclusion probabilities , are strictly positive, that is,
m>0, al keU (2.4.6)
‘This requirement ensures that every element has a chance to appear in the
sample. In order that a sampling design be called a probability sampling de-
sign, the m, must satisfy (2.4.6) (cf. condition 3 in Section 1.3), A sample s
realized by such a design is called a probability sample.
Remark 2.4.2. In practice, one nevertheless sometimes uses designs where the
requirement x, > 0 for all ke U is not met. One example is cut-off sampling.
For instance, in a population of business enterprises, the smallest firms are
sometimes cut off, that is, given a zero inclusion probability, because their
contribution to the whole is deemed trivial, and the cost of constructing and
maintaining a frame that lists the numerous small enterprises may be too
high. Cut-off sampling leads to some bias in the estimates and must be used
only with great caution. For further discussion of cut-off sampling and other
nonprobability sampling designs, the reader is referred to Section 14.4.
Remark 2.4.3. In direct element sampling, all m, k = 1,..., N, are ordinarily
known prior to sampling. However, in more complex design procedures
(notably multistage and multiphase sampling, see Chapters 4 and 7), the
sampling is often carried out in such a way that 7, cannot be calculated at the
outset for all k. In multistage sampling, for example, the inclusion proba-
bilities are known a priori for the sampling units in each stage, which does
not imply that they are known a priori for all elements k ¢ U.
2.5. The Notion of a Statistic 3
Another important property of a design occurs when the condition
Tq >O0 forall k#leU Q4D
holds. A sampling design is said to be measurable if (2.4.6) and (2.4.7) are
satisfied. A measurable design allows the calculations of valid variance esti-
mates and valid confidence intervals based on the observed survey data. The
* notion of measurability is further discussed in Section 14.3.
The N indicators can be summarized in vector form as
VC este tyy
‘The event § = sis clearly equivalent to the event I = i,, where
1, = (lass > Enel!
with i, = 1 ifk es and i,, = 0 if not. Then the probability distribution p(-)
introduced in Section 2.3 can be written in terms of the random vector I as
Pri =i,) = Pr(S = 5) = pls) for se
Remark 2.4.4, Unless otherwise stated, che designs considered in this book
are such that the probability
Pr(S = 8) = Pr(l =i.)
does not depend explicitly on the values of the study variables, y, 2, and so
on. Such designs are called noninformative. The probability Pr(S = s) may
depend on auxiliary variable values, that is, on other variable values known
beforehand for the population elements.
An example of a design that is informative is the following. Two elements
are drawn sequentially, the first by giving equal selection probability to every
element, the second by assigning to the various elements selection probabili-
ties that depend on the value y,, of the element, k,, obtained in the first draw.
In practical survey work, the noninformative designs dominate.
2.5. The Notion of a Statistic
‘The general theory of statistics uses the term “statistic” to refer to a real-
valued function whose value may vary with the different outcomes of a certain
experiment. Moreover, an essential requirement for a statistic is that it must
be computable for any given outcome cf the experiment. The same generat
idea of a statistic will serve the purposes of this hook. We want to examine
how a statistic varies from one realization s to another of the random set S.
In other words, it is sample-to-sample variation that is of interest.
If Q(S) is a real-valued function of the random set S, we call any such
function a statistic, provided that the vajue Q{s) can be calculated once s, an
outcome of S, has been specified and the data for the elements of s have been
collected.34 2. Estimation from: Probability Samples
A simple but important statistic is 1,(5), the random variable defined by
(2.4.1). It indicates membership or nonmembership in the sample s of the kth
element.
‘The sample size (that is, the cardinal of S), defined as
ns = Fu bS)
is another simple example of a statistic.
Other examples of statistics are J's y,, the sample total of the variable y,
and Y°s ¥;/S's2i, the ratio of the sample totals of two variables y and 2. By
contrast, }'s ¥./d.u 2 is not a statistic, unless the population total of z hap-
pens to be known from other sources.
When a sample is drawn in practice, exactly one realization s of the ran-
dom set S occurs. Once s has been realized, we assume that it is possible to
observe and measure certain variables of interest, for example, y and z, for
each element k in s. Thus, in the case of the statistic Q(S) = Y's x/Tis2n. for
example, we can, after measurement, calculate the realized vaine of the statis-
tic, namely, O(S) = Ve yi/Ss 2a
Note in this example that y and z are variables in the sense of taking
possibly different values y, and z, for the various elements k. However, y and
2 are not treated as random variables. The random nature of a statistic Q(S)
stems solely from the fact that the set S is random.
Remark 2.5.1. If we have two variables of study, y and z, the example Q(S) =
Ys y./¥s%Z Shows that a more telling (but too cumbersome) notation for a
statistic would be
Q(S) = OTK, ves 4): ke S]
Expressed in words, Q(S) is a function of S, ¥ = (45--- Pw) and Z = (244 00.4
zy) that depends on y and z only through those values y,, 2, for which ke S.
The realized value of Q(S) is computed from the set of pairs (,, 2,) associated
with the elements k in the realized sample s. For simplicity, we write simply
Q(S) for the statistic and Q(6) for the realized value.
Because a statistic Q(S) is a random variable in the sense just described, it
has various statistical properties, such as an expected value and a variance.
These concepts are detailed in the following definition.
Definition 2.5.1. The expectation and the variance of a statistic Q = Q(S) are
defined, respectively, by
E(Q) = ay (Ss) O(s)
¥(Q) = E{[Q — EQ}
we P(S)[Q() — EQ)?
it
2.5. The Notion of a Statistic 35
The covariance between two statistics Q; = Q,(S) and Q, = Q,(S) is defined
by
C(Q1, Q2) = E{[Q, — F(Q:)1£Q2 — E(Q2)1}
= ae P(S)LQ,(5) — E(Qs)] 10218) ~ E(Q2)]
Note that these definitions refer to the variation over all possible samples
that can be obtained under the given sampling design, p(s). To emphasize this,
the terms design expectation, design variance, and design covariance are often
used in the literature. Here, we generally suppress the word “design” in these
and similar terms, as there is no risk of misinterpretation. The design expecta-
tion E(Q), for example, is the weighted average of the values Q(s), the weight
of Q(s) being the probability p(s) with which s is chosen.
When estimators are examined and compared, it is often of interest to
determine the value of an expectation E(Q), a variance V(Q), or a covariance
C(Q,, Qa).
For simple statistics Q(S), the expected value and the variance can often be
evaluated easily as closed form analytic expressions. ‘This is true in particular
for the linear statistics that we examine in detail later. Whether Q(S) is simple
or not, there exists a “long run frequency interpretation” of the expected value
E(Q) and the variance V(Q). Suppose we let a computer draw 10,000 indepen-
dent samples, each of size n, with the SI design, from a population of size N.
Once drawn, a sample is replaced into the population, the next sample is
drawn, and so on. For each sample, the value of 0(S) is computed.
At the end of the run, we can let the computer calculate the average and
the variance of the 10,000 obtained values of Q(S). The values thus obtained
will closely approximate the expectation EfQ(S)] and the variance V[Q(S)],
respectively. This method of approximating the value of quantities that may
be hard to calculate by analytic means is known as a Monte Carlo simulation.
In survey sampling, we have the following frequency interpretation of the
expected value and the variance of a statistic: In a Jong run of repeated sam-
ples drawn from the finite population with a given sampling design p(-), the
average of the values Q(3) and the variance of the values Q(s) will closely
approximate their theoretical counterparts.
The number of realized samples is the crucial factor in determining the
accuracy of these approximations. For instance, if samples of size n = 6 are
drawn from a population of size N = 20, the number of possible samples is
(2) = 38,760, and the 10,000 samples realized in the simulation represent at
most 10,000/38,760 = 26% of the possible samples. By contrast, if samples of
” size n = 200 are drawn from a population of N = 5,000, the number of possi-
ble samples is of the order of 10°, Close approximations to the true expected
value and the true variance would, however, generally be obtained in this case
also with 10,000 repeated samples, although they now represent a vanishingly
small fraction of all the possible samples.36 2. Estimation from Probability Samapies
‘The massive calculations required in a Monte Cazlo simulation are not
part of an actual survey. In a survey, there is ordinarily one and only one
sample from which conclusions are drawn about the finite population. Monte
Carlo simulation is, however, a highly useful tool in the evaluation of the
statistical properties of complicated estimators. More detail on this subject is
given in Section 7.9.1.
2.6. The Sample Membership Indicators
. The estimators that we are interested in examining can be expressed as func-
tions of the sample membership indicators defined by (2.4.1). It is therefore
important to desctibe the basic properties of the statistics i, = 1,(S), for k =
1,..., N, as in the following result.
Result 2.6.1, For an arbitrary sampling design p(s), and for k, 1
EU.) =
V(h) = m(1 ~~ m)
CUes Li) = Ran — 7H
Poor, Note that J, = ,(S) is a Bernoulli random variable. Thus, E(I,) =
Pr(l, = 1) =m, using (2.4.2). Because E(/2) = E(I,) =m, it follows that
Vly) = E(I2) ~ af = m(1 — 1). Moreover, I, = 1, ifand only if both k and
} are members of s. Thus E{UI,) = Pr(l,l, = 1) = my by equation (2.4.3), so
that
Cll L) = Elle) — EGE) = ty — mm a
Note that for k = J, the covariance equals the variance, that is,
CU, L) = VL)
Depending on the design, the covariances C({,, I,) can bé zero, positive, or
negative.
Remark 2.6.1. It will save space to have a special symbol for the variances and
covariances of J,. We use the symbol A. That is, for any k, le U, we set
Cle 1) = My = Dee (2.6.1)
When k = J, this implies (because x, = 7,) that
Cll I) = V(4) = ml ~ m) = Ay (2.6.2)
Double sums over a set of elements will now be needed. Let ay, be any
quantity associated with k and 1, where k, ! < U. Let A be any set of elements,
2.6. The Sample Membership Indicators 37
AGU. For example, A may be the whole population U, or A may be a
sample s. The double sum notation
DE ate
will be our shorthand for
= +
The double sum on the right-hand side of this expression we denote more
concisely as
Xda ae
tat
Thus we have
Oy = + 4,
DEadu= Era LE ou
Another simple statistic is the sample size n,. It can be expressed in terms of
the indicators J, as
ns= Yuh
The first two moments of the statistic n; follow easily from Result 2.6.1. We
have
Elis) = Du te (2.6.3)
and
Vins
Do mbm) +E Lo Gu ~ mm)
= Lom Com) + ype Te (2.6.4)
EXAMPLE 2.6.1. We return to the design BE considered in Examples 2.2.2, 2.3.2,
and 2.4.2. The sample size a, is a binomially distributed random variable with
parameters N and m. It follows that
Enp(tts) = Nx
and
Vop(tts) = Na(i — 2)
The results can be obtained alternatively from equations (2.6.3) and (2,64),
For example, from (2.6.4), using my = 7 for all k and ty = 7? for all k # |,
Voeltis) = Nex — (Nx)? + N(N — 1)n? = Nal — x)
Another example of random sample size occurs in single-stage cluster sam-
pling, as described in Chapter 4. A cluster is a set of elements, for example,
the households in a city block or the stucents in a class. If clusters are selected38 2. Estimation from Probability Samples
and if the sample consists of ali elements in the selected clusters, then the
sample size, that is, the number of observed elements, will be variable if the
clusters are of unequal sizes.
Practitioners avoid designs in which the sample size varies extensively.
One reason is that variable sample size will cause an increase in variance for
certain types of estimators. More importantly, survey statisticians dislike be-
ing in a situation where the number of observations is highly unpredictable
when the survey is planned.
A fixed (sample) size design is such that whenever p(s) > 0, the sample s
will contain a fixed number of elements say n. That is, a sample is realizable
under a fixed size design only if its size is exactly n. But all samples of size n
need not be realizable to have a fixed size design.
In the case of a fixed size design the inclusion probabilities obey some
simple relations, which are stated in the following result.
Result 2.6.2. [f the design p(s) has the fixed size n, then
Yeoman
Uys My = n(n — 1)
X ta =~ Lm
rey
Tek
Poor, If p(s) is of the fixed size n, then ns =n with probability one. Thus
Ens) =n and V(ns) = 0. Using (26.3) and (2.6.4), we obtain the first two
results of the theorem. The third result follows from the derivation
& Ty = 2b Ei.) = EQ hh ~ §))
tek
Tae
= nE(I,) — EZ) = (a — Dy a
‘The SI design is an example of a fixed size design. That the three parts of
Result 2.6.2 hold for this design in particular is easily checked by use of the
inclusion probabilities n, and m, calculated in Example 2.4.1.
2.7, Estimators and Their Basic Statistical Properties
Most of the statistics that we examine in this book are different kinds of
estimators. An estimator is a statistic thought to produce values that, for
most samples, lie near the unknown population quantity that one wishes to
estimate. Such quantities are called parameters. The general notation for a
parameter will be @.
If there is only one study variable, y, we can think of @ as a function of
2.1. Estimators and Their Basic Statistical Properties 39
Fas +5 Yup the N values of y. Thus,
8 = A945 0005 Yn)
Examples include the population total,
@=t=Yon
the population mean,
9= Fe = Lo nlN
and the population variance,
8 = Su = Lule ~ FuPAN — 1)
= Lovin — 1) — (Le) /NW — 1)
A parameter can be a function of the values of two or more variables of study,
as in the case of the ratio of the population totals of y and z,
dude
u 2%
(2.7.1)
We denote an estimator of 0 by
6 =)
If s is a realization of the random set S, we assume it is possible to calculate
9 from the study variable values y,, 2, ... associated with the elements ke s.
For example, under the SI design,
pan bets
1"
is an often-used estimator of the parameter @ = iy y, and
pase
Ds %
is an often-used estimator of the parameter 6 given by (2.7.1).
Itis of considerable interest to describe the sample-to-sample variations of
a proposed estimator 6. An estimator that varies little around the unknown
value 4 is on intuitive grounds “better” than one that varies a great deal.
By the sampling distribution of a estimator 6 we mean a specification of
all the possible values of 6, together with the probability that 6 attains each
of these values under the sampling design in use p(s). That is, the exact state-
ment of the sampling distribution of 6 requires, for each possible value c of 6,
a specification of the probability
Pr@=j= T ps) 0.72)
where & is the set of samples s for which 6 = ¢.40 2. Estimation from Probability Samples
For a given population of y-values, y,,..., Ya given design,and a given
estimator, it is possible in theory to produce the exact sampling distribution
of a statistic 6. But, computationally, it would be a formidable task for most
finite populations of practical interest. Ordinarily, both the number of possi-
ble samples and the number of possible values of 6 are extremely iarge. How-
ever, summary measures that describe important aspects of the sampling dis-
tribution of an estimator 6 are needed, for example, when comparisons are
made with competing estimators. These summary measures ate usually un-
known, theoretical quantities. The following summary measures are derived
from Definition 2.5.1. The expectation of 6 is given by
BG)= ¥ ries)
It is a weighted average of the possible values 6(s) of 8, with the probabilities
p(s) as weights. The variance of 6 is given by
vO) = ¥ ri9{6) - BOY
‘Two important measures of the quality of an estimator 6 are the bias and
the mean square error. The bias of 6 is defined as
Bb) = £6) -—6
An estimator 6 is said to be unbiased for @ if
B@)=0 forall y=(y,,...,yy¥eR™
The mean square error of 6 is defined as
MSEG@)=E[6-aP = ¥ pl tis) — oF
An easily verified result is that
MSE(6) = V(6) + [BOP (2.7.3)
If 6 is unbiased for 8, it follows from equation (2.7.3) that MSE(6) = V(@).
Remark 2.7.1. Note the distinction between an estimate and an estimator. By
the estimate produced by the estimator 6 = 6(S) is meant the number 6(s) that
can be calculated after a specific outcome s of the random set $ has been
observed and the study variable values y,, 2,, ... have been recorded for the
elements ks. For example, for an SI sample of n elements, the random
variable
OS) = NEAL me Ue
7m a
is an estimator of @ = )'y y;; the estimate obtained for a particular outcome
sis the number
6) = nlets
n
27. Estimators and Their Basic Statistical Properties 4t
In the following, we ignore the typographic distinction between S, the random
set, and s, a realization of S. For simplicity, we use the lower case character
to designate both the random set and its realization. There is little risk of
misunderstanding.
An estimator is unbiased if its weighted average (over all possible samples
using the probabilities p(s) as weights) is equal to the unknown parameter
value. The most important estimators in survey sampling are unbiased or
approximately unbiased. It is characteristic of an approximately unbiased
estimator that the bias is unimportant in large samples. For most of the
approximately unbiased estimators that we consider, the bias is actually very
small, even for modest sample sizes.
Remark 2.7.2. The statement that an estimator @ is unbiased is a statement
of average performance, namely, over ail possible samples. The probability-
weighted average of the deviations § — @ is nil. However, to say that an esti-
mate is biased is strictly speaking incorzect. An estimate is a constant value
obtained for a particular sample realization. This value can be off the mark,
in the sense of deviating from the unknown parameter value 6. Because an
estimate is a number, it has no variation and no bias. The term biased estimate
is nevertheless used occasionally. The only way that the term makes sense is
if itis interpreted as “an estimate calculated from an estimator that is biased.”
Although unbiasedness or approximate unbiasedness are desirable proper-
ties, it is clear that these properties say nothing about another important
aspect of the sampling distribution, namely, how widely dispersed the various
values of the estimator are. The variance is a measure of this dispersion.
‘When choosing between several possible estimators 6,, 4,, ... for one and
the same parameter 8, the statistician will normally want to single out one for
which the sampling distribution is narrowly concentrated around the un-
known value. 6. This suggests using the criterion of “small mean square error”
to select an estimator, because, if MSE(@)} = (0) + [.B(6)]? is small, there is
strong reason to believe that the sample actually drawn in a survey will have
produced an estimate near the true value. However, even if the sampling
distribution is tightly concentrated around @ there is always a small possibili-
ty that our particular sample was “bad,” so that the estimate falls in one of
the tails of the distribution, rather far removed from 6. The statistician must
live with this possibility.
The survey statistician should avoid estimators that are considerably
biased, because valid confidence intervals cannot be obtained if the bias is
substantial {for a further explanation, see Section 5.2). Therefore, typically
the statistician will seek among estimators that are at least approximately
unbiased and choose one that has a small variance.
Remark 2.7.3. The square root of the var:ance {V(@)]'” is called the standard
error of the estimator 6. The ratio of the standard error of the estimator to
You might also like Lecture Notes in Statistics 153: Edited by P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, Olkin, N. Wermuth, S. Zeger PDF
Lecture Notes in Statistics 153: Edited by P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, Olkin, N. Wermuth, S. Zeger
10 pages