100% found this document useful (2 votes)
4K views314 pages

SarndalEtAl - 1992 - Model Assisted Survey Sampling

Muestreo estadistico

Uploaded by

Gisela M Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (2 votes)
4K views314 pages

SarndalEtAl - 1992 - Model Assisted Survey Sampling

Muestreo estadistico

Uploaded by

Gisela M Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 314
dma ee my leet Model Assisted Survey Sampling Springer Series in Statistics Andersen/Borgan/Gill/Keiding: Statistical Models Based on Counting Processes. “Atkinson/Riant: Robust Diagnostic Regression Analysis. ‘Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition. Borg/Groenen: Modem Multidimensional Sceling: Theory and Applications. Brockwell/Davis: Time Series: Theory and Methods, 2nd edition. Chan/Tong: Chaos: A Statistical Perspective. Chen/Shao/fbrahim: Monte Carlo Methods in Bayesian Computation. David/Edwards: Annotated Readings in the History of Statistics. DevroyelLugosi: Combinatorial Methods in Density Estimation. Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications. Eggermont/LaRiccia: Maximum Penalized Likelihood Estimation, Volume I: Density Estimation. Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd edition. Fan!Yao: Nonlinear Time Series: Nonparametric and Parametric Methods. Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations 1750-1900. Federer: Statistical Design and Analysis for Intereropping Experiments, Volume I: Two Crops. Federer: Statistical Design and Analysis for Intercropping Experiments, Volume Ii: Three or More Crops. Ghosh/Ramamoorthi: Bayesian Nonparametrics. Glaz/Naus/Wallenstein: Scan Statistics. Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, 2nd edition. Gouriéroux: ARCH Models and Financial Applications. Gu: Smoothing Spline ANOVA Models. Gybrfi/Kohler/Krayzak/ Walk: A Distribution-Free Theory of Nonparametric Regression. Haberman: Advanced Statistics, Volume I: Description of Populations. Hall: The Bootstrap and Edgeworth Expansion. Hardle: Smoothing Techniques: With Implementation in S. Harrell: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Hart: Nonparametric Smoothing and Lack-of-Fit Tests. ‘Hastie/Tibshirani/Friedman: The Elements of Statistical Leaming: Data Mining, Inference, and Prediction. Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications. ‘Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal Parameter Estimation. Huet/Bouvier/Poursat/Jolivet: Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS and R Examples, 2nd edition. Ibrahim/Chen/Sinha: Bayesian Survival Analysis. Jolliffe: Principal Component Analysis, 2nd edition. feontinued after index) Carl-Erik Sarndal Bengt Swensson Jan Wretman Model Assisted Survey Sampling Springer >) 93 ) 2 ) III) } J Corl-Brik Sarndal Bengt Swensson ; Statistics Sweeden Department of Data Analysis Klostergatan 23 University of Orebro SE-70189 Orebro 701 30 Orebro Sweden ‘Sweden Jan Wrenman Department of Statistics ‘Stockholm University 106 91 Stockholm Sweden ‘The work on this book was supported in past by Statistics Sweden. With $ illustrations. ‘Mathematics Subject Classification: 62005, Library of Congress Cataloging-in-Publication Data, Samdal, Carl-Erik, 1937- Model assisted survey sampling / Catl-Erik Samdal, Bengt ‘Swensson, Jan Wretman. B. omen(Springer series in statistics) Includes bibliographical references and indexes 1. Sampling (Statistics). 1. Swensson, Bengt. If. Wretman, Jan Hakan, 1939. . Hi. Tite. IV. Series, QA276.6837 1991 001.4°222—4e20 91-7854 ISBN 0-387-40620-4 Printed on acid-free paper. First softcover printing, 2003. © 1992 Springer-Verlag New York, Ine. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., I75 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connsction with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adgptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. ‘The use in this publication of trade names, trademar%s, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 987654321 SPIN 10947609 ‘wow.springer-ny.com Springer-Verlag New York Berlin Heidelberg A member of BertelsmannSpringer Science+Business Media GmbH Preface . This text on survey sampling contains both basic and advanced material. The main theme is estimation in surveys. Other books of this kind exist, but most were written before the recent rapid advances. This book has four important objectives: 1. To develop the central ideas in survey sampling from. the unified perspec- tive of unequal probability sampling. In a majority of surveys, the sam- pling units have different probabilities of selection, and these probabilities play a crucial role in estimation. 2. To write a basic sampling text that, unlike its predecessors, is guided by statistical modeling in the derivation of estimators. The model assisted approach in this book clarifies and systematizes the use of auxiliary vari- ables, which is an important feature of survey design. 3. To cover significant recent developments in special areas such as analysis of survey data, domain estimation, variance estimation, methods for non- response, and measurement error models. 4. To provide opportunity for students to practice sampling techniques on real data. We provide numerous exercises concerning estimation for real (albeit smali) populations described in the appendices, This book grew in part out of our work as leaders of survey methodology development projects at Statistics Sweden. In supervising younger colleagues, we repeatedly found it more fruitful to stress a few important general prin- ciples than to consider every selection scheme and every estimator formula as a separate estimation problem. We emphasize a general approach. This book will be useful in teaching basic, as well as more advanced, univer- sity courses in survey sampling. Our suggestions for structuring such courses are given below. The material has been tested in our own courses in Montréal, vi Preface Orebro, and Stockholm. Also, this book will provide a good source of in- formation for persons engaged in practical survey work or in survey meth- odology research. The theory and methods discussed in this book have their primary field of application in the large surveys typically carried out by government statistical agencies and major survey institutes. Such organizations have the resources to collect data for the large probability samples drawn by the complex survey designs that are often required. However, the issues and the methods dis- cussed in the book are also relevant to smaller surveys. Statistical modeling has strongly influenced survey sampling theory in re- cent years. In this book, sampling theory is assisted by modeling. Tt becomes simple to explain how the auxiliary information available in a given survey will lead to a particular estimation technique. The teaching of sampling and the style of presentation in journal articles on sampling have changed a great deal by this new emphasis. Readers of this book will become familiar with this new style. ‘We use the randomization theory, or design-based, point of view. This is the traditional mode of inference in surveys, ever since the sampling theory breakthroughs in the 1930s and 1940s. The reasoning is familiar to survey statisticians in government and elsewhere. A body of basic and more advanced knowledge is defined that is useful for the survey sampling methodologist, and a broad range of topics is covered, without going into great detail on any one subject. Some topics to which this book devotes a single chapter have been the subject of specialized treatise The material should be accessible to a wide audience. The presentation is rich in statistical concepts, but the mathematical level is not advanced. We assume that readers of this book have been exposed to at least one course in statistical inference, covering principles of point estimation and confidence intervals. Some familiarity with statistical linear models, in particular regres- sion analysis, is also desirable. A previous exposure to finite population sam- pling theory is not required for otherwise well-prepared readers. Some prior knowledge of sampling techniques is, of course, an advantage. A collection of exercises is placed at the end of each chapter. The under- standing of sampling theory is facilitated by analyzing data. Some of the exercises involve sampling and analysis of data from three populations of Swedish municipalities, named MU284, MU281, and MU200, and one popu- jation of countries named CO124. These populations are necessarily small in comparison to populations in real-world surveys, but the issues invoked by the exercises are real. Appendices B, C, and D present the populations. Other exercises are theoretical; some of them ask the reader to derive an expression or verify an assertion in the text, There are various ways in which the book can be used for teaching courses in survey sampling. A first (one-semester or one-quarter) course can be based on the following chapters: 1, 2, 3, parts of 4, 5, 6, 7, and 8, and, at the instruc- tor’s discretion, a selection of material from later chapters. To mention at Preface vii jeast a few of the issues in chapters 10, 14, and 15 seems particularly impor- tant in a first course. A second course (one-semester or one-quarter) may use topics from chapters 4 to 7 not covered in the first course, followed by a selection of material from chapters 8 to 17. Certain sections usually placed toward the end of a chapter provide further detail on ideas or derivations presented earlier. Examples are sections 3.8, 5.12, and 68, Such sections are not essential for the main flow of the argu- ment, and they may be omitted in a first as well as in a second course. The Monte Carlo simulations and other computation for this book were carried out on an Apple Macintosh Il, using MicroAPLs APL.68000- Macintosh II 68020+ 68881 version 7.20A. Statistics Sweden generously supported this project. We are truly grate- ful for their cooperation as well as for support from University of Grebro, Stockholm University, Université de Montréal, Svenska Handelsbanken, and the Natural Sciences and Engineering Research Council of Canada. Many individuals helped and encouraged us during the work on this book. In particular, we are indebted to Christer Arvas and Lars Lyberg for their supportive view of the project, to Jean-Claude Deville, Eva Elvers, Michael Hidiroglou, Klaus Krickeberg, and Ingrid Lyberg for critical appraisal of some of the chapters, to Kerstin Averbick, Bibi Thunell and the late Sivan Rosén for typing, and to Patricia Dean and Nathalie Gaudet for editorial assistance. We have benefitted from discussions with survey statisticians at Statistics Canada and Statistics Sweden. Montréal Carl-Brik Sérndal Srebro Bengt Swensson Stockholm Jan Wretman Contents Preface PART I Principles of Estimation for Finite Populations and Important Sampling Designs CHAPTER 1 Survey Sampling in Theory and Practice LI Surveys in Society 12 Skeleton Outline of a Survey 13. Probability Sampling 14 Sampling Frame 1.5 Area Frames and Simitar Devices 1.6 Target Population and Frame Population 1.7 Survey Operations and Associated Sources of Error 18 Planning a Survey and the Need for Total Survey Design 1.9 Total Survey Design 1.10 The Role of Statistical Theory in Survey Sampling Exercises CHAPTER 2 Basic Ideas in Estimation from Probability Samples 2.1 Introduction 22 Population, Sample, and Sample Selection 23 Sampling Design 24 Inclusion Probabilities 2.5 The Notion of a Statistic 2.6 ‘The Sample Membership Indicators 2.7 Estimators and Their Basic Statistical Properties 24 24 24 27 30 33 36 38 28 The x Estimator and Its Properties 29 With-Replacement Sampling 2.10 The Design Effect 2.11 Confidence Intervals Exercises CHAPTER 3 Unbiased Estimation for Element Sampling Designs 3.1 Intreduction 3.2 Bernoulli Sampling 3.3 Simple Random Sampling 3.3.1 Simple Random Sampling without Replacement 3.3.2 Simple Random Sampling with Replacement 3.4 Systematic Sampling 34.1 Definitions and Main Result 3.4.2 Controlling the Sample Size 3.43 The Efficiency of Systematic Sampling 3.44 Estimating the Variance 3.5 Poisson Sampling 3.6 Probability Proportional-to-Size Sampling 3.6.1 Introduction 3.6.2 mps Sampling 3.63 pps Sampling 3.64 Selection from Randomly Formed Groups Stratified Sampling 3.7.1 Introduction 3.7.2. Notation, Definitions, and Estimation 3.7.3 Optimum Sample Allocation 3.7.4 Alternative Allocations under STS! Sampling 3.8 Sampling without Replacement versus Sampling with Replacement 3.8.1 Alternative Estimators for Simple Random Sampling with Replacement 3.8.2 The Design Effect of Simple Random Sampling with Replacement Exercises 3. Q CHAPTER 4 ‘Unbiased Estimation for Cluster Sampling and Sampling in Two or More Stages 4.1 Introduction 4.2 Single-Stage Cluster Sampling 4.2.1 Introduction 4,22 Simple Random Cluster Sampling 43 Two-Stage Sampling 43.1 Introduction 43.2 Two-Stage Element Sampling 44 Multistage Sampling 44,1 Introduction and a General Result 442 Three-Stage Element Sampling 4,5. With-Replacement Sampling of PSUs Contents 42 48 53 55 58 124 124 126 126 129 133 133 135 144 144 146 150 Contents 4.6 Comparing Simplified Variance Estimators in Multistage Sampling Exercises CHAPTER 5 Introduction to More Complex Estimation Problems 5.1 Introduction 5.2. The Effect of Bias on Confidence Statements 5.3 Consistency and Asymptotic Unbiasedness 5.4 x Estimators for Several Variables of Study 5.5 The Taylor Linearization Technique for Variance Estimation 5.6 Estimation of a Ratio 5:7 Estimation of a Population Mean $.8 Estimation of a Domain Mean 59 Estimation of Variances and Covariances in a Finite Population 5.10 Estimation of Regression Coefficients 5.10.1 The Parameters of Interest 5.10.2 Estimation of the Regression Coefficients $.11 Estimation of a Population Median 5.12 Demonstration of Result 5.10.1 Exercises PART II Estimation through Linear Modeling, Using Auxiliary Variables CHAPTER 6 ‘The Regression Estimator 6.1 Introduction 62 Auxiliary Variables 63 The Diflerence Estimator 64 Introducing the Regression Estimator 65 Alternative Expressions for the Regression Estimator 6.6 The Variance of the Regression Estimator 6.7 Comments on the Role of the Model 68 Optimal Coefficients for the Difference Estimator Exercises CHAPTER 7 Regression Estimators for Element Sampling Designs 74 Introduction 7.2. Preliminary Considerations 7.3. The Common Ratio Model and the Ratio Estimator 73.1 The Ratio Estimator under Si Sampling 73.2 The Ratio Estimator under Other Designs 7.33 Optimal Sampling Design for the x Weighted Ratio Estimator 7.34 Altemative Ratio Models 7.4 The Common Mean Model 75 Models Involving Population Groups 7.6 The Group Mean Model and the Poststratified Estimator 1.7 The Group Ratio Model and the Separate Ratio Estimator xi 153 134 162 162 163 166, 169 12 176 181 184 186 190 190 192, 197 205 207 219 29 219 221 228 230 238 239 242 245 245 245 247° 249 252 253 255 258 260 269 xi Contents 78 Simple Regression Models and Simple Regression Estimators m2 79 Estimators Based on Multiple Regression Models 215 7.9.1 Multiple Regression Models 276 7.9.2 Analysis of Variance Models 281 7.10 Conditional Confidence Intervals 283 7.10.1 Conditional Analysis for BE Sampling 284 7.10.2 Conditional Analysis for the Poststratification Estimator 287 7.1L Regression Estimators for Variable-Size Sampling Designs 289 7.12 A Class of Regression Estimators 291 7.13 Regression Estimation of a Ratio of Population Totals 294 Exercises 297 CHAPTER 8 Regression Estimators for Cluster Sampling and Two-Stage Sampling 303 8.1 Introduction 303 82 The Nature of the Auxiliary Information When Clusters of Elements Ate Selected 304 83 Comments on Variance and Variance Estimation in ‘Two-Stage Sampling 307 84 Regression Estimators Arising Out of Modeling at the Cluster Level 308 8&5 The Common Ratio Model for Cluster Totals 312 8.6 Estimation of the Population Mean When Clusters Are Sampled 314 &7 Design Effects for Single-Stage Cluster Sampling 315 88 Stratified Clusters and Poststratified Clusters 319 89 Regression Estimators Arising Ou: of Modeling at the Element Level 322 8.10 Ratio Models for Elements 327 8.11 The Group Ratio Model for Elements 330 8.12 The Ratio Model Applied within a Single PSU 332 Exercises 333, PART Ti Further Questions in Design and Analysis of Surveys CHAPTER 9 Two-Phase Sampling 343 94 Introduction 343 9.2 Notation and Choice of Estimator 345 93 The x* Estimator 347 9.4 Two-Phase Sampling for Stratification 350 9.5 Auxiliary Variables for Selection in Two Phases 354 9.6 Difference Estimators 386 9.7 Regression Estimators for Two-Phase Sampling 359 9.8 Stratified Bernoulli Sampling in Phase Two 366 99 Sampling on Two Occasions 368 9.9.1 Estimating the Current Total 370 9.9.2 Estimating the Previous Total 376 9.9.3 Estimating the Absolute Change and the Sum of the Totals 377 Exercises 379 Contents CHAPTER 10 Estimation for Domains 10.1 Introduction 10.2 The Background for Domain Estimation 10.3 The Basic Estimation Methods for Domains 104 Conditioning on the Domain Sample Size 10.5 Regression Estimators for Domains 10.6 A Ratio Model for Each Domain 10.7 Group Models for Domains 10.8 Problems Arising for Small Domains; Synthetic Estimation 10.9 More on the Comparison of Two Domains Exercises CHAPTER 11 Variance Estimation 11.1 Introduction 11.2 A Simplified Variance Estimator under Sampling without Replacement 11.3 The Random Groups Technique 11.3.1 Independent Random Groups 11.3.2 Dependent Random Groups 11.4 Balanced Half-Samples 115 The Jackknife Technique 11.6 The Bootstrap 11.7 Concluding Remarks Exercises CHAPTER 12 Searching for Optimal Sampling Designs 12.1 Introduction 122 Model-Based Optimal Design for the General Regression Estimator 123 Model-Based Optimal Design for the Group Mean Model 12.4 Model-Based Stratified Sampling 12.5 Applications of Model-Based Stratification 12.6 Other Approaches to Efficient Stratification 12.7 Allocation Problems in Stratified Random Sampling 12.8 Allocation Problems in Two-Stage Sampling 128.1 The x Estimator of the Population Total 12.8.2 Estimation of the Population Mean 12.9 Allocation in Two-Phase Sampling for Stratification 12.40 A Further Comment on Mathematical Programming 12.41 Sampling Design and Experimental Design Exercises CHAPTER 13 Further Statistical Techniques for Survey Data 13.1 Introduction 13.2 Finite Population Parameters in Multivariate Regression and Correlation Analysis xiii 386 386 387 390 396 397 403 405 408 412 413 418 ats 424 423 423 426 430 4a3t 442 444 445 447 447 448 455 456 461 462 465 an 47 475 478 480 481 481 485 485 486 xiv Contents 133 The Effect of Sampling Design on a Statistical Analysis 13.4 Variances and Estimated Variances for Complex Analyses 13 Analysis of Categorical Data for Finite Populations 13.5.1 Test of Homogeneity for Two Populations 13.52 Testing Homogeneity for More than Two Finite Populations 13.53 Discussion of Categorical Data Tests for Finite Populations 136 Types of Inference When a Finite Population Is Sampled Exercises PART IV A Broader View of Errors in Surveys CHAPTER 14 Nonsampling Errors and Extensions of Probability Sampling Theory 14.1 Introduction 142. Historic Notes: The Evolution of the Probability Sampling Approach 143 Measurable Sampling Designs 144 Some Nonprobability Sampling Methods 145 Model-Based Inference from Survey Samples 146 Imperfections in the Survey Operations 14.6.1 Ideal Conditions for the Probability Sampling Approach 14.62 Extension of the Probability Sampling Approach 14.7 Sampling Frames 14.7.1 Frame Imperfections 14.7.2 Estimation in the Presence of Frame Imperfections 14,73 Multiple Frames 14.74 Frame Construction and Maintenance 148 Measurement and Data Collection 149. Data Processing 14.10 Nonresponse Exercises CHAPTER 15 Nonresponse 15.1 Introduction 15.2. Characteristics of Nonresponse 15.2.4 Definition of Nonresponse 15.2.2 Response Sets 15.2.3 Lack of Unbiased Estimators 153 Measuring Nonresponse 154 Dealing with Nonresponse 15.4.1 Planning of the Survey 154.2 Callbacks and Follow-Ups 15.43 Subsampling of Nonrespondents 154.4 Randomized Response 15.5 Perspectives on Nonresponse 15.6 Estimation in the Presence of Unit Nonresponse 15.6.1 Response Modeting 491 494 500 500 507 510 513 520 525 525 525 $27 529 533 537 537 538 540 540 543 545 545 546 548 $51 353 556 556 556 356 557 558 559 563 564 564 566 570 573 575 575 Contents 156.2 A Useful Response Model 15.6.3 Estimators That Use Weighting Only - . 15.6.4 Estimators That Use Weighting as Well as Auxiliary Variables 15.7 Imputation Exercises CHAPTER 16 Measurement Errors 16.1 Introduction 162 On the Nature of Measurement Errors 163 The Simple Measurement Model 16.4 Decomposition of the Mean Square Error _ 165. The Risk of Underestimating the Total Variance 16,6 Repeated Measurements as a Too! in Variance Estimation 16.7 Measurement Models Taking Interviewer Effects into Account, 168 Deterministic Assignment of Interviewers 169 Random Assignment of Interviewers to Groups 16.10 interpenetrating Subsamples 16.41 A Measurement Model with Sample-Dependent Moments Exercises CHAPTER 17 Quality Declarations for Survey Data 17.1 Introduction | 17.2 Policies Concerning Information on Data Quality 173 Statistics Canada’s Policy on Informing Users of Data Quality and Methodology Exercise APPENDIX A Principles of Notation APPENDIX B The MU284 Population APPENDIX C The Clustered MU284 Population APPENDIX D The CO124 Population References Answers to Selected Exercises Author Index Subject Index. xv S77 580 383 589 595 601 601 602 608, 612 614 617 618 622 627 630 634 637 637 638 641 648, 649 652 660 662 666 680 684 688 PART I Principles of Estimation for Finite Populations and Important Sampling Designs CHAPTER 1 Survey Sampling in Theory and Practice 1.1. Surveys in Society ‘The need for statistical information seems endless in modern society. In par- ticular, data are regularly collected to satisfy the need for information about specified sets of elements, called finite populations. For example, our objective might be to obtain information about the households in a city and their spending patterns, the business enterprises in an industry and their profits, the individuals in a country and their participation in the work force, or the farms in a region and their production of cereal. One of the most important modes of data collection for satisfying such needs is a sample survey, that is, a partial investigation of the finite popula- tion. A sample survey costs less than a complete enumeration, is usually less time consuming, and may even be more accurate than the complete enumera- tion. This book is an up to date account of statistical theory and methods for sample surveys. The emphasis is on new developments. Over the last few decades, survey sampling has evolved into an extensive body of theory, methods, and operations used daily all over the world. As Rossi, Wright and Anderson (1983) point out, it is appropriate today to speak of a worldwide survey industry with different sectors: a government sector, an academic sector, a private and mass media sector, and a residual sector consisting of ad hoc and in-house surveys. In many countries, a central statistical office is mandated by law to provide statistical information about the state of the nation, and surveys are an im- portant part of this activity. For example, in Canada, the 1971 Statistics Act mandates Statistics Canada to “collect, compile, analyze, abstract, and pub- lish statistical information relating to the commercial, industrial, financial, social, economic, and general activities and condition of the people.” 4 1. Survey Sampling in Theory and Practice Thus, national statistical offices regularly produce statistics on important national characteristics and activities, including demography (age and sex distribution, fertility and mortality), agriculture (crop distribution), labor force (employment), health and living conditions, and industry and trade. We owe much of the essential theory of survey sampling to individuals who are or were associated with government statistical offices. In the academic sector, survey sampling is extensively used, especially in sociology and public opinion research, but also in economics, political science, psychology, and social psychology. Many academically affiliated survey insti- tutes are heavily engaged in survey sampling activity. In the private and mass media sectors, we find television audience surveys, readership surveys, polls, and marketing surveys. The contents of ad hoc and in-house surveys vary greatly. Examples include payroll surveys and surveys for auditing purposes. Survey sampling has thus grown into a universally accepted approach for information gathering. Extensive resources are devoted every year to surveys. We do not have accurate figures to illustrate the scope of the industry. How- ever, as an example from the United States, the National Research Council in 1981 estimated that the survey industry in the country conducted roughly 100 million interviews in one year. If we assume a cost of $20 to $40 per interview, interviewing alone (which is only one component of the total sur- vey operation) represents 2 to 4 billion dollars per year. News media provide the public with the results of new or recurring surveys. It is widely accepted that a sample of fairly modest size is sufficient to give an accurate picture of a much larger universe; for example, a well- selected sample of a few thousand individuals can portray with great accuracy a total population of millions. However data gathering is costly. Therefore, it makes a great difference if a major national survey uses 20,000 observations, when 15,000 or even 10,000 observations might suffice. For reasons of cost effectiveness, it is imperative to use the best methods available for sampling design and estimation, to profit from auxiliary information, and so on. Here statistical knowledge and insight become highly important. The ex- pert survey statistician must have a good grasp of statistical concepts in gen- eral, as well as the particular reasoning used in survey sampling. A good measure of practical experience is also necessary. In this book, we present a basic body of knowledge concerning survey sampling theory and methods. 1.2. Skeleton Outline of a Survey To start, we need a skeleton outline of a survey and some basic terminology. The terms “survey” and “sample survey” are used to denote statistical investi- gations with the following methodologic features (key words are italicized): i, A survey concerns a finite set of elements called a finite population. An enumeration rule exists that unequivocally defines the elements belong- 1.2. Skeleton Outline of a Survey 5 ing to the population. The goal of a survey is to provide information about the finite population in question or about subpopulations of special interest, for example, “men” and “women” as two subpopulations of “all persons.” Such subpopulations are called domains of study or just domains. ii. A value of one or more variables of study is associated with each popula- tion element. The goal of a survey is to get information about unknown population characteristics or parameters, Parameters are functions of the study variable values. They are unknown, quantitative measures of interest to the investigator, for example, total revenue, mean revenue, total yield, number of unemployed, for the entire population or for specified domains. iii, In most surveys, access to and observation of individual population ele- ments is established through a sampling frame, a device that associates the elements of the population with the sampling units in the frame. iv. From the population, a sample (that is, a subset) of elements is selected. This can be done by selecting sampling units in the frame. A sample is a probability sample if realized by a chance mechanism, respecting the basic tules discussed in Section 1.3. y. The sample elements are observed, that is, for each clement in the sample, the variables of study are measured and the values recorded. The measure- ment conforms to a well-defined measurement plan, specified in terms of measurement instruments, one or more measurement operations, the order between these, and the conditions under which they are carried out. vi. The recorded variable values are used to calculate (point) estimates of the finite population parameters of interest (totals, means, medians, ratios, regression coefficients, etc.), Estimates of the precision of the estimates are also calculated. The estimates are finally published. Ina sample survey, observation is limited to a subset of the population. The special type of survey where the whole population is observed is called a census or a complete enumeration. EXAMPLE 1.2.1. Labor force surveys are conducted in many countries. Such a survey aims at answering questions of the following type: How many persons are currently in the labor force in the country as a whole and in various regions of the country? What proportion of these are unemployed? In this case, some of the key concepts may be as follows. Population: All persons in the country with certain exceptions (such as infants, people in institutions). Domains of interest: age/sex groups of the population, occupational groups in the population, and regions of the country. Variables: Each person can be described at the time of the survey as (a) belonging to the labor force or not, and (b) employed or not. Correspondingly, there is a variable of interest that takes the value “one” for a person in the labor force, “zero” for a person not in the labor force. To measure unemployment, a second variable of inter- est is defined as taking the value “one” if a person is unemployed, “zero” otherwise, Precise definitions are essential. If the purpose is to estimate un- 6 1. Survey Sampling in Theory and Practice employment in a given month, and if an interviewed person states that he worked one week during that month, but that he is unemployed the day of the interview, there must be a clear rule stating whether he is to be recorded as unemployed or not. Population characteristics of interest: Number of persons in the labor force. Number of unemployed persons in the labor force. Proportion of unemployed persons in the labor force. Sample: A sample of persons is selected from the population in an efficient manner given existing devices for observational access to the persons in the country. Obser- vations: Each person included in the sample is visited by a trained interviewer who asks questions following a standardized questionnaire and records the answers. Data processing and estimation: The recorded data are edited, that is, prepared for the estimation phase; rules for handling of nonresponse are observed; estimates of the population characteristics are calculated. Indica- tors of the uncertainty of the estimates (variance estimates) are calculated. ‘The results are published. ExamPLe 1.2.2 Consider a household survey whose aim is to obtain informa- tion about planned household expenditures in the coming year for specified durable goods. Here, some of the basic concepts may be as follows. Popula- tion: All households in the country. Variables: Planned expenditure amounts for specified goods, such as automobiles, refrigerators, etc. Population char- acteristics of interest: Total of planned household expenditures for the speci- fied durable goods. Sample: A sample of households is obtained by initially selecting a sample of geographic areas, then by subsampling of houscholds in selected areas. Observations: Each household in the sample receives a self- administered questionnaire. For a majority of households, the questions are answered and the questionnaire returned. Households not returning the ques- tionnaire are followed up by telephone or visited by a trained interviewer to obtain the desired information. Data processing and estimation: Data are edited. The calculation of point estimates and precision takes into account the two-stage design of the survey. The methodologic features (i) to (vi) identified above prompt a few comments. . The complexity of a survey can vary greatly, depending on the size of the population and the means of accessing that population. To survey the members of a professional society, the hospitals in a region, or the residents in a small municipality may be a relatively simple matter. At the other extreme are complex nationwide surveys, with a population of many millions spread over a large territory; such surveys are typically carried out by government statistical agencies and require extensive administrative and financial resources. Although a survey involves observations on individual population ele- ments, the purpose of a survey is not to use such data for decision-making 1.2. Skeleton Outline of a Survey 7 about individual elements, but to obtain summary statistics for the popu- lation or for specific subgroups. In the same survey, there are often many variables of study and many domains of interest. The number of characteristics to estimate may be large, in the hundreds or even in the thousands. Finite population parameters are quantitative measures of various aspects of the population. Prior to a survey they are unknown. In this book, we examine the estimation of different types of parameters: the total of a vari- able of study, the mean of the variable, the median of the variable, the correlation coefficient between two variables, and so on. The exact value ofa finite population parameter can be obtained in a special case, namely, if we observe the complete population (i., the survey is a census), and there are no measurement errors and no nonresponse. A census does not automatically mean “estimation without error.” Most people are aware of the term “census” in a particular sense, namely, as a fact finding operation about a country’s population, in particular about such sociodemographic characteristics as the age distribution, edu- cation levels, special skills, mother tongue, housing conditions, household structure, migration patterns. In these situations there is often a “census proper,” done through a “short form” (a questionnaire with few questions) going out to all individuals, while a “long form” may be administered to a 20% sample with a request for more extensive information. A sample is any subset of the population. It may or may not be drawn by a probability mechanism. A simple example of a probability sampling scheme is one that gives every sample of a fixed size the same probability of selection. This is simple random sampling without replacement. In prac- tice, selection schemes are usually more complex. Probability sampling has over the years proved to be a highly accurate instrument and is the focus of this book. The reasons for probability sampling are discussed later in this chapter and in Chapter 14. An example of a nonprobability sample is ‘one that is designated by an expert as representative of the population. Only in fortunate circumstances will nonprobabilistic selection yield accu- rate estimates. To correctly measure and record the desired information for all sampled elements may be difficult or impossible. False responses may be obtained. For some elements designated for the survey, measurements may be miss- ing because of, for example, impossibility to contact or refusal to respond. These nonsampling errors may be considerable. 8. Advances in computer technology have made it possible to produce a great deal of official statistics (for example, in the economic sector) from admin- istrative data files. Several files may be used. For example, elements are matched in two complete population registers, and the information com- bined. The matched files give a more extensive base for the production of statistics. (For populations of individuals, matching may conflict with pri- vacy considerations.) Information from a sample survey may also be com- * a = 8 1. Survey Sampling in Theory and Practice bined with the information from one or more complete administrative registers. The administrative data may then serve as auxiliary information to strengthen the survey estimates. 1.3, Probability Sampling Probability sampling is an approach to sample selection that satisfies certain conditions, which, for the case of selecting elements directly from the popula- tion, are described as follows: . We can define the set of samples, ¥ = {s,, 52, ..., 5y}, that are possible to obtain with the sampling procedure, A known probability of selection p(s) is associated with cach possible sample s. The procedure gives every element in the population a nonzero probability of selection. We select one sample by a random mechanism under which each possible s receives exactly the probability p(s). A sample realized under these four requirements is called a probability sample. If the survey functions without disturbance, we can go out and measure cach element in the realized sample and obtain true observed values for the study variables. We assume a formula exists for computing an estimate of each parameter of interest. The sample data are inserted in the formula, yield- ing, for every possible sample, a unique estimate. The function p(-) defines a probability distribution on ¥ = {s,, s2,..., Su}. It is called a sampling design, or just design; a more rigorous definition is given in Section 2.3. The probability referred to in point 3 is called the inclusion probability of the element. Under a probability sampling design, every population clement has a strictly positive inclusion probability. This is a strong requirement, but one that plays an important role in the probability sampling approach. In practice there are sometimes compelling reasons for not adhering strictly to the requirement. Cut-off sampling (see Section 14.4) is an occasionally used technique in which certain elements are deliberately excluded from selection. In that case, valid conclusions are limited to the part of the population that can be sampled. The randomization referred to in point 4 is usually carried out by an easily implemented algorithm. A common type of algorithm is one in which a randomized experiment is performed for each element listed in the frame, leading either to inclusion or noninclusion. Different algorithms are discussed in Chapters 2 and 3. R » - 1.4, Sampling Frame 9 Sampling is often carried out in two or more stages. Clusters of elements are selected in an initial stage. This may be followed by one or more sub- sampling stages; the clements themselves are sampled at the ultimate stage. To have a probability sampling design in that case, conditions 1 to 4 above must apply to cach stage. The procedure as a whole must give every popula- tion element a strictly positive inclusion probability. Probability sampling has developed into an important scientific approach. ‘Two important reasons for randomized selection are (1) the elimination of selection biases, and (2) randomly selected samples are “objective” and ac- ceptable to the public. Some milestones in the development of the approach are noted in Chapter 14. Most of the arguments and methods in this book are based on the probability sampling philosophy. 1.4. Sampling Frame The frame or the sampling frame is any material or device used to obtain observational access to the finite population of interest. It must be possible with the aid of the frame to (1) identify and select a sample in a way that respects a given probability sampling design and (2) establish contact with selected elements (by telephone, visit at home, mailed questionnaire, etc.). The following more detailed definition is from Lessler (1982): Frame. The materials or devices which delimit, identify, and allow access to the elements of the target population. In a sample survey, the units of the frame are the units to which the probability sampling scheme is applied. The frame also includes any auxiliary information (measures of size, demographic information) that is used for (1) special sampling techniques, such as stratifica- tion and probability proportional-to-size sample selections, or for (2) special estimation techniques, such as ratio or regression estimation. As in the preceding definition, we call elements the entities that make up the population, and units or sampling units the entities of the frame. The latter term stresses that the sample is actually selected from the frame. EXAMPLE 1.4.1. The Swedish Register of the Total Population is a large sam- pling frame that lists some eight million individuals. This frame gives, for each individual, information on variables such as date of birth, sex, marital status, address, and taxable income. It is a reasonably correct frame. There are few omissions, and few are listed in the frame who do not rightfully belong in it. Anattractive feature of this frame is that it gives direct access to the country’s entire population. Stratified sampling from this frame is often used for Swed- ish surveys. Sampled elements (individuals) can be contacted with relative case. 10 1, Survey Sampling in Theory and Practice EXAMPLE 1.4.2, The Central Frame Data Base is a sampling frame compiled by Statistics Canada for use in business surveys. It is a fairly complex frame, consists of several parts, and is based on information from several sources. The construction of this frame is based on two types of Canadian tax returns: corporate and idual, the latter for the self-employed. The frame has two an “integrated” component (containing all of the large business establishments) and a “nonintegrated” component, which is further divided into two separate parts using information from Revenue Canada Taxation. Business firms reporting small total operating revenue are con- sidered “out-of-scope” for business surveys. Continuous updating is required to register “births” (starting of new business activity), “deaths” (termination of business activity), and changes in classification based on geography, industry, or size. We use the term direct element sampling to denote sample selection from a frame that directly identifies the individual elements of the population of interest. That is, the units in the frame are objects of the same kind as those that we wish to measure and observe. A selection of elements can take place directly from the frame. Ideally, the set of elements identified in the frame is equal to the set of elements in the population of interest. For example, if the population of interest is Swedish individuals, we can carry out direct element sampling from the frame “Register of the Total Popu- lation” mentioned in Example 1.4.1. Here, unit equals element which equals individual. (The two sets are actually not exactly equal, but differences are minor so the frame is nearly perfect.) The frame in Example 1.4.2 can be used for direct element sampling with the objective of studying the population of business establishments in Canada; in that case, unit equals element which equals business establishment. The following is a list of properties that a frame for direct element sampling should have. Minimum requirements are that: 1. The units in the frame are identified, for example, through an identifier k running from 1 to Np where N, is the number of sampling units. 2. All units can be found, if selected in a sample. That is, the address, tele- phone number, location on map, or other device for making contact is specified in the frame or can be made available. It simplifies many sample selection procedures if the following feature is present: 3. The frame is organized in a systematic fashion, for example, the units are ordered by geography or by size. Other information is sometimes available in the frame and will often improve estimates. The following is desirable: 4. The frame contains a vector of additional information for each unit; such 1.4, Sampling Frame i information may be used for efficiency improvement such as stratification or to construct estimators that involve auxiliary variables. 5, When estimation is required for domains (subpopulations), the frame specifies the domain to which each unit belongs. Other desirable properties involve the relationship between the units in the frame and the population elements: 6. Every element in the population of interest is present only once in the frame. 7. No element not in the population of interest is present in the frame. The preceding two features will simplify many selection and estimation procedures. 8. Every element in the population of interest is present in the frame. The last mentioned property is particularly important, because in its absence, the frame does not give access to the whole population of interest. Then not even observation of all elements in the frame will make it possible to calculate the true value of a finite population parameter of interest. In practice, a frame is often in the form of a computer data file. At a minimum, it is a file with an element identifier k running from 1 to Np. It may contain other information, as mentioned in points 4 and 5. We can state all that is available in the frame about the kth element as a vector x, = (Xtap Xone +-++ Npy ees Xqa)s Here, xq is the value of the jth x-variable for the kth element. The value x, may be quantitative (for example, date of birth or salary for individual k) or qualitative (for example, address for individual k). ‘The frame can be seen as a matrix arrangement with Ny rows (records), and with each row is associated q + 1 data entries (fields); one entry for the identi- fier and q entries for the components of the row vector x,, as follows: Identifier Known Vector 1 x 2 x k Ne It is important to construct a good sampling frame. As Jessen (1978) puts it: “In many practical situations the frame is a matter of choice to the survey planner, and sometimes a critical one, since frames not only should provide us with a clear specification of what collection of objects or phenomena we are undertaking to study, but they should also help us carry out our study efficiently and with determinable reliability. Some very worthwhile investiga- tions are not undertaken at all because of the lack of an apparent frame; 12 1. Survey Sampling in Theory and Practice others, because of faulty frames, have ended in a disaster or in a cloud of doubt.” Guidelines for frame construction and maintenance are discussed in Section 14.7. 1.5. Area Frames and Similar Devices The following distinction is important: i. A frame as a direct listing or identification of elements. ii, A frame as a listing or identification of (smaller or larger) sets of elements. In case (i), direct element sampling can be carried out. In case (ii), access to the elements is more roundabout, namely, by selecting sets of elements and by observing all or some of the elements in the selected sets. In many situa- tions, case (ii) is the only option, since it may not be possible to find or construct (without prohibitive cost) a direct list of elements. The total number of population elements is often unknown in case (ii). For example, let us think of the population of households in a large metropolitan area. In many cities, there is nothing that comes close to a complete register of the households. Other sampling units than individual households must be considered. One way is to define sampling units as city blocks on a map and select a sample of such units. With relative ease, we may then gain access to the households in (a modest number of) selected blocks. A variation of the same idea occurs when segments are identified on a forest map, and a sample of segments is drawn with the objective of observing individual trees in selected segments. The concept of area frame is defined as follows: Area Frame. A geographic frame consisting of area units; every population element belongs to an area unit, and it can be identified after inspection of the area unit. The area units may vary in size and in the number of elements that they contain. Area sampling entails sampling from an area frame, such as a city map, a forest map, or an aerial photograph. The sets of elements drawn with the aid of an area frame are often called clusters. In a secondary selection step, the selected clusters may be subsampled. A sample of still smaller areas could be defined and sampled, and so on, until the elements themselves are finally sampled in the ultimate step. Maps are, of course, not always used when sets (clusters) of elements are sampled; a succession of lists may be used instead. A frame for studying a population of high school students may in the first step consist of a list of school districts, then a list of all schools in each selected district, then a list of all classes in each selected school, then in the fourth and final step one would gain access to students. “Frame” here refers to a device with four successive layers. Schoo! districts are the first stage sampling units, schools the second 1.6, Target Population and Frame Population 13 stage sampling units, classes the third stage units. The individual elements (the students) are the sampling units in the fourth and final stage of sampling. In a selection consisting of several stages, cach stage has its own type of sampling unit. A finite population is made up of elements. They are sometimes also called units of analysis, which underscores that they are entities that are measured and for which values are recorded. For example, if one is interested in estimat- ing the population total of the variable “household income,” the element (or unit of analysis) is naturally the household. The frame is an instrument for gaining access (more or less directly) to these elements. One way is to first select city blocks and then observe households in selected blocks. Our examples so far have perhaps left the impression that in practice “cle- ment” is always something “smaller than or at most equal to” a “sampling unit.” This is not necessarily so, as the following example shows. EXAMPLE 1.5.1. The Swedish household survey HINK is a national survey of household income. In the absence of a good complete list of households, it has proved convenient to use the Swedish Register of the Total Population described in Example 1.4.1 as a sampling frame. This register is a list of in- dividuals. A probability sample (ordinarily, a stratified random sample) of individuals is selected. The households to which the selected individuals belong are identified, and income-related variables are measured for these households. Here, the sampling unit is the individual, and the element (unit of analysis) is the household. That is, an element comprises one or more sampling units. This is a particular case of “network sampling” (see Sirken 1972, Levy 1977, and Rosén 1987). 1.6. Target Population and Frame Population It becomes necessary at this point to distinguish target population from frame population. The target population is the set of elements about which informa- tion is wanted and parameter estimates are required. The frame population is the set of all elements that are either listed directly as units in the frame or can be identified through a more complex frame concept, such as a frame for selection in several stages. Frame quality can be studied through the relations that exist between the target population, denoted U, and the frame population denoted U,. If each element in the frame population corresponds to one and only one element in the target population, then U = U,, and the frame is perfect for the target population. In all other cases, there is frame imperfection. Frame imperfec- tions are discussed in more detail in Section 14.7. At this stage it suffices to note that points 6, 7, and 8 in Section 1.4 point out three common frame imperfections, namely, undercoverage, overcoverage, and duplicate listings. 4 1. Survey Sampling in Theory and Practice The frame has undercoverage when some target population elements are not in the frame population. The frame has overcoverage when elements not in the target population are in the frame population. A duplicate listing occurs when a target population element is listed in the frame more than once. For example, if unit equals element which equals business enterprise, it may be that a unit in the frame is a firm that recently went out of business. Since no longer involved in business activity, this firm is not in the target popula- tion, but still exists as a unit in the frame. Another frame unit may be a firm that exists, but in a different industry than the one defined by the target population. Both firms are part of the overcoverage. Newly established firms not yet listed in the frame are part of the undercoverage. Ifa probability sample is selected from the frame population, valid statisti- cal inference can be made about the frame population. If the frame population is different from the target population, valid inference about the target popu- lation may be impossible, so the goal of the survey may be missed. The prob- lem is particularly serious if the frame gives access only to parts of the target population. To come up with a perfect sampling frame is not always possible in prac- tice. Minor imperfections are often tolerated, since a perfect frame may not be obtained without excessive cost. However, it is imperative that frame im- perfections be minor. To construct a high-quality frame for the target population is an important aspect of survey planning, and adequate resources must be set aside for this activity. Viewed somewhat differently, when the target population is defined, a realistic goal should be set. There is no sense in fixing a target population for which a good frame cannot realistically be obtained within budget restric- tions. Grossly invalid conclusions may result from samples drawn from faulty frames. Inexpensive, easy-to-come-by frames should be avoided if they give only fragmentary access to the target population. Frame construction and maintenance are discussed in Chapter 14. Sample selection is sometimes carried out with the aid of several partially overlapping frames; such multiple frame sampling is also considered in Chapter 14. 1.7. Survey Operations and Associated Sources of Error A survey consists of a number of survey operations. Especially in a large survey, the operations may extend over a considerable period of time, from the planning stage to the ultimate publication of results. The operations affect the quality of survey estimates. We distinguish five phases of survey opera- tions, as follows. With each phase we can associate sources of errors in the estimates, 1.7. Survey Operations and Associated Sources of Error 15 i. Sample Selection This phase consists of the execution of a preconceived sampling design. The sample size necessary to obtain the desired precision is determined. A sample of elements is drawn with the given sampling design using a suitable sampling frame, which may be an already existing frame or one that is constructed specifically for the survey. Errors in estimates associated with this phase are (1) frame errors, of which undercoverage is particularly serious, and (2) sampling error, which arises because a sample, not the whole population, is observed. ii, Data Collection There is a preconceived measurement plan with a specified mode of data collection (personal interview, telephone interview, mail questionnaire, or other). The field work is organized; interviewers are selected and interviewer assignments are determined. Data are collected according to the measure- ment plan, for the elements in the sample. The data are recorded and trans- mitted to the central statistical office. Errors in estimates resulting from this phase include (1) measurement errors, for instance, the respondent gives (in- tentionally or unintentionally) incorrect answers, the interviewer understands or records incorrectly, the interviewer influences the responses, the question- naire is misinterpreted, (2) error due to nonresponse (missing observations). iii, Data Processing This phase serves to prepare collected data for estimation and analysis and includes the following elements: Coding and data entry (transcription of information from questionnaire to a medium suited for estimation and data analysis). Editing (consistency checks to see if observed values conform to a set of logical rules, handling of values that do not pass the edit check, outlier detec- tion and treatment). Renewed contact with respondents to get clarification if necessary. Imputation (substitution of good artificial values for missing values). Errors in estimates associated with this phase include transcription errors (keying errors), coding errors, errors in imputed values, errors introduced by or not corrected by edit iv. Estimation and Analysis This phase entails the calculation of survey estimates according to the speci- fied point estimator formula, with appropriate use of auxiliary information 16 1. Survey Sampling in Theory and Practice and adjustment for nonresponse, as well as a calculation of measures of preci- sion in the estimates (variance estimate, coefficient of variation of the esti- mate, confidence interval). Statistical analyses may he carried out, such as comparison of subgroups of the population, correlation and regression analyses, etc. All errors from the phases (i) to (ii) above will affect the point estimates and they should ideally be accounted for in the calculation of the measures of precision. v. Dissemination of Results and Postsurvey Evaluation This phase includes the publication of the survey results, including a general declaration of the conditions surrounding the survey. This declaration often follows a set of specified guidelines for quality declaration. Errors in survey estimates are traditionally divided into two major cate- gories: sampling error and nonsampling error. The sampling error is, as mentioned, the error caused by observing a sample instead of the whole population, The sampling error is subject to sample-to-sample variation. The nonsampling errors include all other errors. The two principal categories of nonsampling errors are: a. Errors due to nonobservation. Failure to obtain data from parts of the target population. b. Errors in observations. This kind of error occurs when an element is se- lected and observed, but the finally recorded value for that element (the value that goes into the estimation and analysis phase) differs from the true value. Two major types are (b1) measurement error (error arising in the data collection phase) and (b2) processing error (error arising in the data processing phase). There are two principal types of nonobservation, namely, (1) undercov- erage, that is, failure of the frame to give access to all elements that belong to the target population (such elements will obviously not be selected, much less observed, and they have zero inclusion probability), and (2) nonresponse, that is, some elements actually selected for the sample turn out to be non- observations because of refusal or incapacity to answer, not being at home, and so on. Nonobservation generally results in biased estimates. Measurement error can be traced to four principal sources: . The interviewer. . The respondent. . The questionnaire. The mode of the interview, that is, whether telephone, personal interview, self-administered questionnaire, or other medium is used. Pen Processing error comprises the errors arising from coding, transcription, imputation, editing, outlier treatment, and other types of preestimation data 1.8, Planning a Survey and the Need for Total Survey Design 17 handling. In modern computer assisted data collection methods (CATI and CAPI; see Section 14.9), the data collection and data processing phases tend to be merged, and, as a consequence, processing errors may be reduced. ‘The basic estimation theory, taking into account the sampling error, is presented in Part I of this book (Chapters 2 to 5). Various extensions are given in Parts II and ITI (Chapters 6 to 13). Estimation in the presence of non- sampling errors is treated in Part IV (Chapters 14 to 17). In particular, Chap- ter 14 contains a more detailed discussion of nonsampling errors and their impact on the survey estimates. 1.8. Planning a Survey and the Need for Total Survey Design A survey usually has its background in some practical problem. Someone—a member of parliament, a researcher, an administrator, a decisionmaker— formulates a question in the course of his work, a problem to which no answer is readily available. A study is needed. It can be an experiment, a survey, or some other form of fact finding. In either case, it is imperative that the prob- Jem be clearly stated. If a survey is the instrument chosen for the study, the survey statistician needs to work with a clearly stated objective. What exactly is the problem? Exactly what information is wanted? For example, suppose a survey of housing conditions for the elderly is proposed. This description is vague and too general. The concepts involved must be given clear definitions. Precisely what is the population of aged in- dividuals of interest? What age groups are to be included? Should we look separately at single-person households, two-person households, households where the elderly cohabit with younger persons, etc.? What is the more complete specification of housing conditions? Do we mean age of dwelling ‘or some other quality measure of the dwelling? What time period is to be studied? Should one distinguish between an urban elderly population and a rural elderly population? As answers are obtained to these questions, the survey statistician begins to work toward a reformulation of the original question into one stated in precise terms that can be answered by a survey. The statistician’s formulation must be unequivocal on the following: i. The finite population and the subpopulations for which information is required. ii. The kinds of information needed for this population, that is, the variables to be measured and the parameters to be estimated. Once the operational definitions are clearly stated, the survey statistician can work toward the specification of a suitable survey design, including sam- pling frame, data collection method, staff required, sample selection, estima- tion method, and determination of the sample size required to obtain the 18 1, Survey Sampling in Theory and Practice desired precision in the survey results, Before going ahead with the survey, the statistician will make sure that his “translation” corresponds fully to what the problem originator had in mind: Will the survey give at least an approximate solution to the right problem? In the words of W. E. Deming (1950, p. 3), “The requirement of a plain statement of what is wanted (the specification of the survey) is perhaps one of the greatest contributions of modern theoretical statistics”. Some important aspects of survey planning are as follows: Specifying the objective of the survey. Translation of the subject-matter problem into a survey problem. Specification of target population, known variables (auxiliary variables), study variables, population parameters to be estimated. Construction of sampling frame, if none is available; Inventory of resources available in terms of budget, staff, data processing, and other equipment. Specifications of requirements to be met, for instance, time schedule and accuracy of estimates. Specification of data collection method, including questionnaire construction. ‘Specification of sampling design, sample selection mechanism , and sample size determination. Specification of data processing methods, including edit and imputation. Specification of formulas for point estimator and measures of precision (variance estimator). Training of personnel, organization of field work. Allocation of resources to different survey operations, Allocation of resources to control and evaluation. The survey planning should ideally lead to a decision for each of the survey operations. Statistical theory can guide us to important conclusions about some of these decisions, in particular with regard to sample selection, choice of estimator, different sources of error and their associated components of variance, methods for assessment of the accuracy of the estimates, and statisti- cal analysis of survey data. The planning process should try to foresee difficulties that may arise. Resources should be set aside and back-up procedures should be identified to deal with perceived difficulties, For example, some nonresponse can almost certainly be expected, and the survey planning should take this into account; the nonresponse should be kept at low levels. Procedures for follow up and renewed contact with nonrespondents should be identified and included in budget considerations. Estimator formulas that attempt to adjust for non- response should be identified. Ideally, survey planning should lead to an optimal specification for the survey as a whole. The goal is to obtain the best possible accuracy under a 19. Total Survey Design 19 fixed budget. In a major survey, the decision problem is, however, so complex that an optimum, in the sense of a mathematical solution to a closed-form problem, is inconceivable. There are too many interrelated decisions and too many variables to take into account. The concept of total survey design, discussed in the next section, can be seen as a tool in the search for an overall optimization of a survey. 1.9. Total Survey Design The term total survey design has come to be used for planning processes that aim at overall optimization in a survey. The concept arose out of a desire for an overall control of all sources of errors in a survey. Instrumental in this regard were efforts at the United States Bureau of the Census. Key references are Hansen et al. (1951), Hansen, Hurwitz, and Bershad (1962), Hansen, Hurwitz, and Pritzker (1964). A detailed discussion of total survey design is found in Dalenius (1974). Total survey design is concerned with obtaining the best possible precision in the survey estimates while striking an overall economic balance between sampling and nonsampling errors. For a view of total survey design, it is helpful to consider a survey from three perspectives: 1. The requirements. 2. The survey specifications. 3. The survey operations. By the requirements are meant the needs for information about the finite population, usually originating in some subject-matter problem. Correspond- ing to these requirements is a conceptual survey which will achieve the ideal goal, if carried out under the best possible circumstances. The survey specifications are a set of rules and operations, which together constitute a defined goal of the survey. Because of actual conditions, the de- fined goal may be somewhat different from the ideal goal. The defined goal specifies key elements of the survey, such as population, sampling design, measurement procedures, estimators, and auxiliary variables. Several survey designs usually exist for realizing the defined goal. The sur- vey statistician selects from a set of operationally feasible survey designs one that comes as close as possible to realizing the defined goal. The selected design gives rise to a series of survey operations. The essential ones are those that we identified in Section 1.7. The survey is finally carried out executing these operations as carefully as possible; this constitutes the survey operations proper (see Figure 1.1). Following Dalenius (1974) we can summarize the total survey design pro- cess in a diagram as shown in Figure 1.1. 20 1. Survey Sampling in Theory and Practice zomsparpem roRHzOO Figure 1.1. The Total Survey Design Process. 1.10. The Role of Statistical Theory in Survey Sampling There is no unified theory covering all survey operations simultaneously. The current state of the art offers partial solutions, obtained sometimes under restrictive assumptions and idealized conditions. This book covers a crucial aspect of the total survey picture, namely, the part where statistical theory, especially estimation theory, is used to obtain answers. Estimation theory is the organized study of variability in estimates. In surveys, the variability is tied to various errors identified in Section 1.7. 1.10. The Role of Statistical Theory in Survey Sampling a Suppose that the probability distributions of the various errors can be given, if not a complete specification, at least some general features. A stochastic structure is thereby defined, and we can work toward obtaining probability statements about the total error in an estimate. This is the traditional goal of statistical inference. To study the variability of survey estimates, we need to specify as accurate- ly as possible the stochastic features of the various errors. Let us see how this can be done. i. Stochastic Structure Relative to Sampling Error How was the sample selected? In the probability sampling approach, we know the answer. The sample is selected according to the given probability sampling design. We know or can determine the probabilities given to the different possible samples. This is the key to describing the sample-to-sample variation of a proposed estimator. The statistical properties of the estimator can be worked out: its expected value, its bias (if any), its variance, and so on. We can also obtain an estimate of the variance and a confidence interval. These notions are explained in detail in Chapter 2. We mentioned that two reasons for randomized sample selection are pro- tection against selection bias and the fact that randomly selected samples are viewed as objective. In this book, the probability distribution associated with the randomized sample selection has another very important function. It provides the basis for probabilistic conclusions (inferences) about the target population. For instance, a 95% confidence interval will have the property of covering the unknown population quantity in 95% of all samples that can be drawn with the given probability sampling design. The term used in the litera- ture for this kind of conclusion is design-based inference or randomization inference. Design-based inference is objective; nobody can challenge that the sample was really selected according to the given sampling design. The probability distribution associated with the design is “real,” not modeled or assumed. ii, Stochastic Structure Relative to Nonsampling Errors How do measurement errors and data processing errors arise? In most situa tions, we do not know. Any answer will involve hypothetical assumptions, called model assumptions, about these errors. Models for errors in observa- tions are discussed in Chapter 16. How does nonresponse arise? What is the mechanism that generates re- sponse from elements in the sample? Again, in most situations we do not know, and any answer will have to be in the form of a model. Models of this kind are called response mechanism models; see Chapter 15. 22 1. Survey Sampling in Theory and Practice iii. Stochastic Structure Relative to the Origin of Finite Population Variable Values How were the population variable values generated? An attempt answering this question is to say that the N population values y,,..., yy of the study variable y are generated from a superpopulation. If the idea of a super- population is accepted, we must concede that the properties of this popula- tion are unknown or at least partially unknown. Superpopulation modeling is the activity whereby one specifies or assumes certain features of the mecha- nism thought to have generated y,, ..., yy. Superpopulation modeling leads to model-based inference, which is discussed in Section 14.5. A number of important distinctions have been made in this chapter. They help in getting a perspective on how the material is organized in the chapters to come. Chapters 2 to 12 are based on the idea of probability sampling from a high quality sampling frame in the absence of nonsampling errors. Chapters 2 and 3 present elementary estimation theory for finite populations. Methods for estimation of totals and means are presented for important sampling designs. In Chapter 5, we look at the estimation of other parameters of interest, such as medians, variances, correlation, and regression coefficients. Estimation with auxiliary information is an important theme in Chapters 6 to 8. Chapters 4 and 8 deal with sampling in two stages. Chapters 9 to 13 are devoted to significant special topics, such as estimation in two phases, estimation for domains, variance estimation techniques, questions of optimal sampling design, and analysis of survey data. In Chapter 14, we assess the limitations and extensions of the probability sampling approach, and the nature of nonsampling errors is examined. Complete chapters are devoted to two important types of nonsampling error, namely, nonresponse (Chapter 15) and measurement error (Chapter 16), Finally, the importance of quality declaration of survey data is stressed (Chapter 17), Exercises 1.1. Consider an agriculture yield survey aimed at obtaining information about the yield of different crops (wheat, rye, etc.) in a given country and in a given year. Specify definitions that might be appropriate in such a survey for key concepts such as population, variable(s) of study, parameter(s) of interest. Try to be as specific as possible in your definitions. 1.2. For a survey of the type indicated in Exercise 1.1, discuss what device(s) reason- ably may be used to constitute a sampling frame. Discuss possible methods for accurate data collection, with consideration given to avoidance of measurement error. Exercises 23 13. Consider a retail sales survey in a given country, that is, a survey aimed at finding out about the amount of sales in retail stores. Specify definitions that might be realistic in this case for key concepts such as population, variable(s) of study, and parameter(s) of interest. Try to be as specific as possible in your definitions. 1.4. For a survey of the type indicated in Exercise 1.3, discuss what device(s) reason- ably may be used to constitute a sampling frame. Discuss possible alternatives for data collection, with consideration given to avoidance of measurement error. 1.5, Contrast a “simple survey” (say, a survey of approximately 500 members of a professional society) with a “complex survey” (say, a survey of approximately 10 million inhabitants of a country). Think through the list of survey operations and specify where you foresee that great differences in complexity may arise. CHAPTER 2 Basic Ideas in Estimation from Probability Samples 2.1. Introduction ‘This chapter introduces some basic notions in survey sampling, such as sam- pling design and sampling scheme, estimation by z-expanded sums, design variance, design effect, and design-based confidence intervals. Mastery of these concepts is essential for an understanding of the subsequent chapters. Basic designs, such as simple random sampling without replacement and Bernoulli sampling, are discussed. ; Anestimate’s error is the deviation of the estimate from the unknown value of the parameter that we wish to estimate. In this chapter, we focus on the sampling eccor and ignore any errors caused by faulty measurement, non- response, or other nonsampling sources. The sampling error is caused by calculating the estimate from data for a subset of the population only. There is no sampling error when the estimate is based on data for the entire popula- tion, as in a census. 2.2. Population, Sample, and Sample Selection N; Let us consider @ population consisting of N elements labeled k = 1,. {ilgs cee tay oes tay} For simplicity, we let the kth element be represented by its label k. Thus, we denote the finite population as 24 2.2. Population, Sample, and Sample Selection 25 U= Ck NY In this chapter, the population size N is treated as known at the outset. But for many populations encountered in practice, N is unknown, and methods of estimation for this case are developed in later chapters. Let y denote a variabie, and let y, be the value of y for the kth population element. For instance, if U is @ population of households and y is the varia- ble “disposable income,” then y, quantifies the disposable income of the kth household. We assume that the values Yn: k @ U, are unknown at the outset. Suppose that an estimate is needed for the population total of y, t= Love or for the population mean of y, Jo = N= Deyn In these expressions, Vy y, is abbreviated notation for Y', 0. 2, Attach the appropriate probability p(s) to each sample se %. 3. Select one sample s ¢ % by a random mechanism obeying the probability distribution p(3), using a random number table or a computer-assisted randomized selection procedure. This algorithm is almost never a practical possibility for sample selection, The simple reason is that in virtually all situations met in practice, the number of samples is too large. It is often “astronomically large.” If, for example, N = 1,000, and % is the set of all samples of the fixed size n = 40, then the number of samples is (“0°) = 5.6 x 107? HN = 5,000 and # = 200 (thus a sampling fraction, as in the previous case, of 200/5,000 = 4%), the number of samples increases to 5,000 200 Clearly the enumeration of all these samples is a hopeless task, even for a computer. } 14 x 1088 Remark 2.3.1. It has become standard in modern literature to call the proba- bility distribution p(-) the sampling design. Some statisticians use the similar term sample design with a different meaning, For example, Hansen, Hurwitz and Madow (1953, vol. II, page 7) say: “The sample design will consist of the sampling plan and the method of estimation.” “Sampling plan” corresponds roughly to our concept sampling design. Used in this way, sample design is thus a broad concept that includes the caoice of sampling design, the actual selection of a samuple, as well as the estimation. By way of terminology, the design stage of a survey is sometimes used to designate the period during which the sample selection procedure (and there- by the sampling design) is decided and the sample selected. By contrast, the estimation stage refers to the period when the data are already collected and the required estimates are calculated. Itis important to note that several different sample selection schemes may conform to one and the same sampling design p(-). If the sampling design p(-) is taken as a starting point, one has to specify a suitable sample selection 30 2. Estimation from Probability Samples scheme that implements the design. The scheme should be efficient in terms of computer execution, cost, and other practical aspects. EXaMpre 2.3.3. The design SI given in Example 2.3.1 can be implemented by the following draw-sequential scheme, an alternative to the scheme in Exam- ple 2.2.1. Select with equal probability 1/N, a first element from the N ele- ments in the population. Replace the element obtained. Repeat this step until n distinct elements are obtained. If v denotes the number of draws required, we have v > with probability one, since already drawn elements may be selected again. Here, v is a random variable with a fairly complex probability distribution. It can be shown that the selection scheme just stated conforms to the conditions of the design SI. Other possible executions of SI exist. An often used list sequential scheme is stated in Section 3.3.1. The objective of a survey is ordinarily to estimate one or more population parameters, Two of the many important choices that must be made in a survey are as follows: 1. The choice of a sampling design and a sample selection scheme to imple- ment the design. 2. The choice of a formula (an estimator) by which to calculate an estimate of a given parameter of interest. ‘The two choices are not independent of cach other. For example, the choice of estimator will usuaily depend on the choice of sampling design. A Strategy is the combination of a sampling design and an estimator. For a given parameter, the general aim is to find the best possible strategy, that is, one that estimates the parameter as accurately as possible. 2A. Inclusion Probabilities An interesting feature of a finite population of N labeled elements is that the elements can be given different probabilities of inclusion in the sample. The sampling statistician often takes advantage of the identifiability of the ele- ments by deliberately attaching different inclusion probabilities to the various elements. This is one way to obtain more accurate estimates. Suppose that a certain sampling design has been fixed. That is, p(s), the probability of selecting s, has a given mathematical form. The inclusion of a given element k in a sample is a random event indicated by the random vari- able J,, defined as 1 ifkeS he { ifnot Qt) Note that 1, = J,(S) is a function of the random variable S. We call f, the sample membership indicator of element k. 24, Inclusion Probabilities 31 The probability that element k will be included in a sample, denoted m,, is obtained from the given design p(-} as follows: m= Prike S) = Pr = = 5 pe) @42) Here, s 5 k denotes that the sum is over those samples s that contain the given ke The probability that both of the elements k and / will be included is denoted ay and is obtained from the given p(+) as follows: Ty = Pr(k&le S$) = Prhl;=D= Y pis) (2.4.3) a) We have % = my for all k L Note that (2.4.3) applies also when k = I, for in that case My, = Pr(I? = 1) = Pry = 1) = ‘That is, in the following, x, should be interpreted as identical to m,, for k = 1 Ne Remark 2.4.1. The writing “k ¢ S” in, for example, equation (2.4.2)-should be interpreted as the random event “S 3 k,” which is the event “a sample con- taining k is realized”, With a given design p(-) are associated the N quantities Tyg eee y Migs ener TE ‘They constitute the set of first-order inclusion probabilities. Moreover, with p(-) are associated the N(N — 1)/2 quantities Tyas Migs ++ Mags +e TN-4N which are called the set of second-order inclusion probabilities. Inclusion probabilities of higher order can be defined and calculated for a given design p(-). However, they play a less important role and will not be discussed in this book. Exampte 2.4.1. Consider the SI design defined in Example 2.3.1. Let us calcu- late the inclusion probabilities of first and second orders. There are exactly =i nod 3) samples s that include the elements k and I (k + 1), Since all samples of size n have the same probability, 1 | (" , we obtain from (2.4.2) and (2.4.3) a= Zn (2o7)/(X) =n k samples s that include the element k, and exactly (* - 3) n- and 32 2. Estimation from Probability Samples =2\/(N\ _ n— 4). - m= 5 v= (8-2) = arc k#l=1..,N Q45) 53K at EXAMPLE 2.4.2. The BE sampling design (see Example 2.3.2) can be character- ized as a design in which the indicators J, are independently and identically distributed, each J, obeying the Bernoulli distribution with parameter 7. All N elements have the same first-order inclusion probability, mem kel N moreover, since the J, are independent, ty = E(hL) = EQ E) for any k #1. ‘A sampling design is often chosen to yield certain desired first- and second- order inclusion probabilities, Although p(-) may be complicated, we can at- tain one of the primary goals, namely, to determine expected value and vari- ance of certain quantities calculated from the sample, from a knowledge of the m and the mq alone. Unless otherwise stated, we assume that the sampling design is such that all first-order inclusion probabilities , are strictly positive, that is, m>0, al keU (2.4.6) ‘This requirement ensures that every element has a chance to appear in the sample. In order that a sampling design be called a probability sampling de- sign, the m, must satisfy (2.4.6) (cf. condition 3 in Section 1.3), A sample s realized by such a design is called a probability sample. Remark 2.4.2. In practice, one nevertheless sometimes uses designs where the requirement x, > 0 for all ke U is not met. One example is cut-off sampling. For instance, in a population of business enterprises, the smallest firms are sometimes cut off, that is, given a zero inclusion probability, because their contribution to the whole is deemed trivial, and the cost of constructing and maintaining a frame that lists the numerous small enterprises may be too high. Cut-off sampling leads to some bias in the estimates and must be used only with great caution. For further discussion of cut-off sampling and other nonprobability sampling designs, the reader is referred to Section 14.4. Remark 2.4.3. In direct element sampling, all m, k = 1,..., N, are ordinarily known prior to sampling. However, in more complex design procedures (notably multistage and multiphase sampling, see Chapters 4 and 7), the sampling is often carried out in such a way that 7, cannot be calculated at the outset for all k. In multistage sampling, for example, the inclusion proba- bilities are known a priori for the sampling units in each stage, which does not imply that they are known a priori for all elements k ¢ U. 2.5. The Notion of a Statistic 3 Another important property of a design occurs when the condition Tq >O0 forall k#leU Q4D holds. A sampling design is said to be measurable if (2.4.6) and (2.4.7) are satisfied. A measurable design allows the calculations of valid variance esti- mates and valid confidence intervals based on the observed survey data. The * notion of measurability is further discussed in Section 14.3. The N indicators can be summarized in vector form as VC este tyy ‘The event § = sis clearly equivalent to the event I = i,, where 1, = (lass > Enel! with i, = 1 ifk es and i,, = 0 if not. Then the probability distribution p(-) introduced in Section 2.3 can be written in terms of the random vector I as Pri =i,) = Pr(S = 5) = pls) for se Remark 2.4.4, Unless otherwise stated, che designs considered in this book are such that the probability Pr(S = 8) = Pr(l =i.) does not depend explicitly on the values of the study variables, y, 2, and so on. Such designs are called noninformative. The probability Pr(S = s) may depend on auxiliary variable values, that is, on other variable values known beforehand for the population elements. An example of a design that is informative is the following. Two elements are drawn sequentially, the first by giving equal selection probability to every element, the second by assigning to the various elements selection probabili- ties that depend on the value y,, of the element, k,, obtained in the first draw. In practical survey work, the noninformative designs dominate. 2.5. The Notion of a Statistic ‘The general theory of statistics uses the term “statistic” to refer to a real- valued function whose value may vary with the different outcomes of a certain experiment. Moreover, an essential requirement for a statistic is that it must be computable for any given outcome cf the experiment. The same generat idea of a statistic will serve the purposes of this hook. We want to examine how a statistic varies from one realization s to another of the random set S. In other words, it is sample-to-sample variation that is of interest. If Q(S) is a real-valued function of the random set S, we call any such function a statistic, provided that the vajue Q{s) can be calculated once s, an outcome of S, has been specified and the data for the elements of s have been collected. 34 2. Estimation from: Probability Samples A simple but important statistic is 1,(5), the random variable defined by (2.4.1). It indicates membership or nonmembership in the sample s of the kth element. ‘The sample size (that is, the cardinal of S), defined as ns = Fu bS) is another simple example of a statistic. Other examples of statistics are J's y,, the sample total of the variable y, and Y°s ¥;/S's2i, the ratio of the sample totals of two variables y and 2. By contrast, }'s ¥./d.u 2 is not a statistic, unless the population total of z hap- pens to be known from other sources. When a sample is drawn in practice, exactly one realization s of the ran- dom set S occurs. Once s has been realized, we assume that it is possible to observe and measure certain variables of interest, for example, y and z, for each element k in s. Thus, in the case of the statistic Q(S) = Y's x/Tis2n. for example, we can, after measurement, calculate the realized vaine of the statis- tic, namely, O(S) = Ve yi/Ss 2a Note in this example that y and z are variables in the sense of taking possibly different values y, and z, for the various elements k. However, y and 2 are not treated as random variables. The random nature of a statistic Q(S) stems solely from the fact that the set S is random. Remark 2.5.1. If we have two variables of study, y and z, the example Q(S) = Ys y./¥s%Z Shows that a more telling (but too cumbersome) notation for a statistic would be Q(S) = OTK, ves 4): ke S] Expressed in words, Q(S) is a function of S, ¥ = (45--- Pw) and Z = (244 00.4 zy) that depends on y and z only through those values y,, 2, for which ke S. The realized value of Q(S) is computed from the set of pairs (,, 2,) associated with the elements k in the realized sample s. For simplicity, we write simply Q(S) for the statistic and Q(6) for the realized value. Because a statistic Q(S) is a random variable in the sense just described, it has various statistical properties, such as an expected value and a variance. These concepts are detailed in the following definition. Definition 2.5.1. The expectation and the variance of a statistic Q = Q(S) are defined, respectively, by E(Q) = ay (Ss) O(s) ¥(Q) = E{[Q — EQ} we P(S)[Q() — EQ)? it 2.5. The Notion of a Statistic 35 The covariance between two statistics Q; = Q,(S) and Q, = Q,(S) is defined by C(Q1, Q2) = E{[Q, — F(Q:)1£Q2 — E(Q2)1} = ae P(S)LQ,(5) — E(Qs)] 10218) ~ E(Q2)] Note that these definitions refer to the variation over all possible samples that can be obtained under the given sampling design, p(s). To emphasize this, the terms design expectation, design variance, and design covariance are often used in the literature. Here, we generally suppress the word “design” in these and similar terms, as there is no risk of misinterpretation. The design expecta- tion E(Q), for example, is the weighted average of the values Q(s), the weight of Q(s) being the probability p(s) with which s is chosen. When estimators are examined and compared, it is often of interest to determine the value of an expectation E(Q), a variance V(Q), or a covariance C(Q,, Qa). For simple statistics Q(S), the expected value and the variance can often be evaluated easily as closed form analytic expressions. ‘This is true in particular for the linear statistics that we examine in detail later. Whether Q(S) is simple or not, there exists a “long run frequency interpretation” of the expected value E(Q) and the variance V(Q). Suppose we let a computer draw 10,000 indepen- dent samples, each of size n, with the SI design, from a population of size N. Once drawn, a sample is replaced into the population, the next sample is drawn, and so on. For each sample, the value of 0(S) is computed. At the end of the run, we can let the computer calculate the average and the variance of the 10,000 obtained values of Q(S). The values thus obtained will closely approximate the expectation EfQ(S)] and the variance V[Q(S)], respectively. This method of approximating the value of quantities that may be hard to calculate by analytic means is known as a Monte Carlo simulation. In survey sampling, we have the following frequency interpretation of the expected value and the variance of a statistic: In a Jong run of repeated sam- ples drawn from the finite population with a given sampling design p(-), the average of the values Q(3) and the variance of the values Q(s) will closely approximate their theoretical counterparts. The number of realized samples is the crucial factor in determining the accuracy of these approximations. For instance, if samples of size n = 6 are drawn from a population of size N = 20, the number of possible samples is (2) = 38,760, and the 10,000 samples realized in the simulation represent at most 10,000/38,760 = 26% of the possible samples. By contrast, if samples of ” size n = 200 are drawn from a population of N = 5,000, the number of possi- ble samples is of the order of 10°, Close approximations to the true expected value and the true variance would, however, generally be obtained in this case also with 10,000 repeated samples, although they now represent a vanishingly small fraction of all the possible samples. 36 2. Estimation from Probability Samapies ‘The massive calculations required in a Monte Cazlo simulation are not part of an actual survey. In a survey, there is ordinarily one and only one sample from which conclusions are drawn about the finite population. Monte Carlo simulation is, however, a highly useful tool in the evaluation of the statistical properties of complicated estimators. More detail on this subject is given in Section 7.9.1. 2.6. The Sample Membership Indicators . The estimators that we are interested in examining can be expressed as func- tions of the sample membership indicators defined by (2.4.1). It is therefore important to desctibe the basic properties of the statistics i, = 1,(S), for k = 1,..., N, as in the following result. Result 2.6.1, For an arbitrary sampling design p(s), and for k, 1 EU.) = V(h) = m(1 ~~ m) CUes Li) = Ran — 7H Poor, Note that J, = ,(S) is a Bernoulli random variable. Thus, E(I,) = Pr(l, = 1) =m, using (2.4.2). Because E(/2) = E(I,) =m, it follows that Vly) = E(I2) ~ af = m(1 — 1). Moreover, I, = 1, ifand only if both k and } are members of s. Thus E{UI,) = Pr(l,l, = 1) = my by equation (2.4.3), so that Cll L) = Elle) — EGE) = ty — mm a Note that for k = J, the covariance equals the variance, that is, CU, L) = VL) Depending on the design, the covariances C({,, I,) can bé zero, positive, or negative. Remark 2.6.1. It will save space to have a special symbol for the variances and covariances of J,. We use the symbol A. That is, for any k, le U, we set Cle 1) = My = Dee (2.6.1) When k = J, this implies (because x, = 7,) that Cll I) = V(4) = ml ~ m) = Ay (2.6.2) Double sums over a set of elements will now be needed. Let ay, be any quantity associated with k and 1, where k, ! < U. Let A be any set of elements, 2.6. The Sample Membership Indicators 37 AGU. For example, A may be the whole population U, or A may be a sample s. The double sum notation DE ate will be our shorthand for = + The double sum on the right-hand side of this expression we denote more concisely as Xda ae tat Thus we have Oy = + 4, DEadu= Era LE ou Another simple statistic is the sample size n,. It can be expressed in terms of the indicators J, as ns= Yuh The first two moments of the statistic n; follow easily from Result 2.6.1. We have Elis) = Du te (2.6.3) and Vins Do mbm) +E Lo Gu ~ mm) = Lom Com) + ype Te (2.6.4) EXAMPLE 2.6.1. We return to the design BE considered in Examples 2.2.2, 2.3.2, and 2.4.2. The sample size a, is a binomially distributed random variable with parameters N and m. It follows that Enp(tts) = Nx and Vop(tts) = Na(i — 2) The results can be obtained alternatively from equations (2.6.3) and (2,64), For example, from (2.6.4), using my = 7 for all k and ty = 7? for all k # |, Voeltis) = Nex — (Nx)? + N(N — 1)n? = Nal — x) Another example of random sample size occurs in single-stage cluster sam- pling, as described in Chapter 4. A cluster is a set of elements, for example, the households in a city block or the stucents in a class. If clusters are selected 38 2. Estimation from Probability Samples and if the sample consists of ali elements in the selected clusters, then the sample size, that is, the number of observed elements, will be variable if the clusters are of unequal sizes. Practitioners avoid designs in which the sample size varies extensively. One reason is that variable sample size will cause an increase in variance for certain types of estimators. More importantly, survey statisticians dislike be- ing in a situation where the number of observations is highly unpredictable when the survey is planned. A fixed (sample) size design is such that whenever p(s) > 0, the sample s will contain a fixed number of elements say n. That is, a sample is realizable under a fixed size design only if its size is exactly n. But all samples of size n need not be realizable to have a fixed size design. In the case of a fixed size design the inclusion probabilities obey some simple relations, which are stated in the following result. Result 2.6.2. [f the design p(s) has the fixed size n, then Yeoman Uys My = n(n — 1) X ta =~ Lm rey Tek Poor, If p(s) is of the fixed size n, then ns =n with probability one. Thus Ens) =n and V(ns) = 0. Using (26.3) and (2.6.4), we obtain the first two results of the theorem. The third result follows from the derivation & Ty = 2b Ei.) = EQ hh ~ §)) tek Tae = nE(I,) — EZ) = (a — Dy a ‘The SI design is an example of a fixed size design. That the three parts of Result 2.6.2 hold for this design in particular is easily checked by use of the inclusion probabilities n, and m, calculated in Example 2.4.1. 2.7, Estimators and Their Basic Statistical Properties Most of the statistics that we examine in this book are different kinds of estimators. An estimator is a statistic thought to produce values that, for most samples, lie near the unknown population quantity that one wishes to estimate. Such quantities are called parameters. The general notation for a parameter will be @. If there is only one study variable, y, we can think of @ as a function of 2.1. Estimators and Their Basic Statistical Properties 39 Fas +5 Yup the N values of y. Thus, 8 = A945 0005 Yn) Examples include the population total, @=t=Yon the population mean, 9= Fe = Lo nlN and the population variance, 8 = Su = Lule ~ FuPAN — 1) = Lovin — 1) — (Le) /NW — 1) A parameter can be a function of the values of two or more variables of study, as in the case of the ratio of the population totals of y and z, dude u 2% (2.7.1) We denote an estimator of 0 by 6 =) If s is a realization of the random set S, we assume it is possible to calculate 9 from the study variable values y,, 2, ... associated with the elements ke s. For example, under the SI design, pan bets 1" is an often-used estimator of the parameter @ = iy y, and pase Ds % is an often-used estimator of the parameter 6 given by (2.7.1). Itis of considerable interest to describe the sample-to-sample variations of a proposed estimator 6. An estimator that varies little around the unknown value 4 is on intuitive grounds “better” than one that varies a great deal. By the sampling distribution of a estimator 6 we mean a specification of all the possible values of 6, together with the probability that 6 attains each of these values under the sampling design in use p(s). That is, the exact state- ment of the sampling distribution of 6 requires, for each possible value c of 6, a specification of the probability Pr@=j= T ps) 0.72) where & is the set of samples s for which 6 = ¢. 40 2. Estimation from Probability Samples For a given population of y-values, y,,..., Ya given design,and a given estimator, it is possible in theory to produce the exact sampling distribution of a statistic 6. But, computationally, it would be a formidable task for most finite populations of practical interest. Ordinarily, both the number of possi- ble samples and the number of possible values of 6 are extremely iarge. How- ever, summary measures that describe important aspects of the sampling dis- tribution of an estimator 6 are needed, for example, when comparisons are made with competing estimators. These summary measures ate usually un- known, theoretical quantities. The following summary measures are derived from Definition 2.5.1. The expectation of 6 is given by BG)= ¥ ries) It is a weighted average of the possible values 6(s) of 8, with the probabilities p(s) as weights. The variance of 6 is given by vO) = ¥ ri9{6) - BOY ‘Two important measures of the quality of an estimator 6 are the bias and the mean square error. The bias of 6 is defined as Bb) = £6) -—6 An estimator 6 is said to be unbiased for @ if B@)=0 forall y=(y,,...,yy¥eR™ The mean square error of 6 is defined as MSEG@)=E[6-aP = ¥ pl tis) — oF An easily verified result is that MSE(6) = V(6) + [BOP (2.7.3) If 6 is unbiased for 8, it follows from equation (2.7.3) that MSE(6) = V(@). Remark 2.7.1. Note the distinction between an estimate and an estimator. By the estimate produced by the estimator 6 = 6(S) is meant the number 6(s) that can be calculated after a specific outcome s of the random set $ has been observed and the study variable values y,, 2,, ... have been recorded for the elements ks. For example, for an SI sample of n elements, the random variable OS) = NEAL me Ue 7m a is an estimator of @ = )'y y;; the estimate obtained for a particular outcome sis the number 6) = nlets n 27. Estimators and Their Basic Statistical Properties 4t In the following, we ignore the typographic distinction between S, the random set, and s, a realization of S. For simplicity, we use the lower case character to designate both the random set and its realization. There is little risk of misunderstanding. An estimator is unbiased if its weighted average (over all possible samples using the probabilities p(s) as weights) is equal to the unknown parameter value. The most important estimators in survey sampling are unbiased or approximately unbiased. It is characteristic of an approximately unbiased estimator that the bias is unimportant in large samples. For most of the approximately unbiased estimators that we consider, the bias is actually very small, even for modest sample sizes. Remark 2.7.2. The statement that an estimator @ is unbiased is a statement of average performance, namely, over ail possible samples. The probability- weighted average of the deviations § — @ is nil. However, to say that an esti- mate is biased is strictly speaking incorzect. An estimate is a constant value obtained for a particular sample realization. This value can be off the mark, in the sense of deviating from the unknown parameter value 6. Because an estimate is a number, it has no variation and no bias. The term biased estimate is nevertheless used occasionally. The only way that the term makes sense is if itis interpreted as “an estimate calculated from an estimator that is biased.” Although unbiasedness or approximate unbiasedness are desirable proper- ties, it is clear that these properties say nothing about another important aspect of the sampling distribution, namely, how widely dispersed the various values of the estimator are. The variance is a measure of this dispersion. ‘When choosing between several possible estimators 6,, 4,, ... for one and the same parameter 8, the statistician will normally want to single out one for which the sampling distribution is narrowly concentrated around the un- known value. 6. This suggests using the criterion of “small mean square error” to select an estimator, because, if MSE(@)} = (0) + [.B(6)]? is small, there is strong reason to believe that the sample actually drawn in a survey will have produced an estimate near the true value. However, even if the sampling distribution is tightly concentrated around @ there is always a small possibili- ty that our particular sample was “bad,” so that the estimate falls in one of the tails of the distribution, rather far removed from 6. The statistician must live with this possibility. The survey statistician should avoid estimators that are considerably biased, because valid confidence intervals cannot be obtained if the bias is substantial {for a further explanation, see Section 5.2). Therefore, typically the statistician will seek among estimators that are at least approximately unbiased and choose one that has a small variance. Remark 2.7.3. The square root of the var:ance {V(@)]'” is called the standard error of the estimator 6. The ratio of the standard error of the estimator to

You might also like