Sampling and Estimation
Sampling and Estimation
SAMPLING
1. All the students in your school constitute a population. What are some samples of
this population? (all students in their class, all male students; all female students; all students
with glasses; every tenth student as they walked up the stairs this morning, etc.)
2. Why would anyone want to take a sample rather than studying the whole population?
(Usually because it is cheaper than measuring the entire population. Or because measuring the
whole population is impossible.)
PROBABILITY SAMPLING
There are several different ways in which a probability sample can be selected. The method
chosen depends on a number of factors, such as the available sampling frame, how spread out
the population is, how costly it is to survey members of the population and how users will
analyse the data. When choosing a probability sample design, your goal should be to
minimize the sampling error of the estimates for the most important survey variables, while
simultaneously minimizing the time and cost of conducting the survey.
2
• multi-stage sampling
• multi-phase sampling
Example 1: To draw a simple random sample from a telephone book, each entry would need
to be numbered sequentially. If there were 10,000 entries in the telephone book and
if the sample size were 2,000, then 2,000 numbers between 1 and 10,000 would need
to be randomly generated by a computer. Each number will have the same chance of
being generated by the computer (in order to fill the simple random sampling
requirement of an equal chance for every unit). The 2,000 telephone entries
corresponding to the 2,000 computer-generated random numbers would make up the sample.
Simple random sampling is the easiest method of sampling and it is the most commonly used.
Advantages of this technique are that it does not require any additional information on
the frame (such as geographic areas) other than the complete list of members of the
survey population along with information for contact. Also, since simple random
sampling is a simple method and the theory behind it is well established, standard
formulas exist to determine the sample size, the estimates and so on, and these formulas are
easy to use.
On the other hand, this technique makes no use of auxiliary information present on the frame
(i.e., number of employees in each business) that could make the design of the sample more
efficient. And although it is easy to apply simple random sampling to small populations, it
can be expensive and unfeasible for large populations because all elements must be identified
and labeled prior to sampling. It can also be expensive if personal interviewers are required
since the sample may be geographically spread out across the population.
A lottery draw is a good example of simple random sampling. For example, when a sample
of
6 numbers is randomly generated from a population of 49, each number has an equal chance
of being selected and each combination of 6 numbers has the same chance of being
the winning combination. Even though people tend to avoid combinations such as 1-2-3-4-5-
6, it has the same chance of being the winning set of numbers as the combination of 8-15-21-
28-
3
32-40.
4
Example 2: Suppose your school has 500 students and you need to conduct a short survey on
the quality of the food served in the cafeteria. You decide that a sample of 10 students should
be sufficient for your purposes. In order to get your sample, you assign a number from 1 to
500 to each student in your school. To select the sample, you use a table of randomly
generated numbers. All you have to do is pick a starting point in the table (a row and column
number) and look at the random numbers that appear there. In this case, since the data run into
three digits, the random numbers would need to contain three digits as well. Ignore all random
numbers after 500 because they do not correspond to any of the students in the school.
Remember that the sample is without replacement, so if a number recurs, skip over it and use
the next random number. The first 10 different numbers between 001 and 500 make up your
sample.
Example 3: Imagine that you own a movie theatre and you are offering a special
horror movie film festival next month. To decide which horror movies to show, you
survey moviegoers asking them which of the listed movies are their favourites. To create the
list of movies needed for your survey, you decide to sample 100 of the 1,000 best horror
movies of all time. The horror movie population is divided evenly into classic movies (those
filmed in or before 1969) and modern movies (those filmed in or later than 1970). One way
of getting a sample would be to write out all of the movie titles on slips of paper and place
them in an empty box. Then, draw out 100 titles and you will have your sample. By using this
approach, you will have ensured that each movie had an equal chance of selection
You can also calculate the probability of a given movie being selected. Since we know the
sample size (n) and the total population (N), calculating the probability of being included in
the sample becomes a simple matter of division:
You can see that that one disadvantage of simple random sampling (not the
only disadvantage, but an important one) is that even if you know that the population is made
up of
500 classic movies and 500 modern movies and you know each movie's release date from the
sampling frame, no use is made of this information. This sample might contain 77
classic movies and 23 modern movies, which would not be representative of the whole horror
movie population.
There are ways to overcome this problem (these will be briefly discussed in the Estimation
section), but there are also ways to account for this information. (This will also be discussed
later, under the section on Stratified sampling.)
Systematic sampling
5
Sometimes called interval sampling, systematic sampling means that there is a gap, or
interval, between each selected unit in the sample. In order to select a systematic sample, you
need to follow these steps:
6
1. Number the units on your frame from 1 to N (where N is the total population size).
2. Determine the sampling interval (K) by dividing the number of units in
the population by the desired sample size. For example, to select a sample of 100 from
a population of 400, you would need a sampling interval of 400 ÷ 100 = 4. Therefore,
K = 4. You will need to select one unit out of every four units to end up with a total
of 100 units in your sample.
3. Select a number between one and K at random. This number is called the
random start and would be the first number included in your sample. Using the sample
above, you would select a number between 1 and 4 from a table of random numbers.
If you choose 3, the third unit on your frame would be the first unit included in your
sample; if you choose 2, your sample would start with the second unit on your frame.
4. Select every Kth (in this case, every fourth) unit after that first number. For example,
the sample might consist of the following units to make up a sample of 100: 3 (the
random start), 7, 11, 15, 19...395, 399 (up to N, which is 400 in this case).
Using the example above, you can see that with a systematic sample approach there are only
four possible samples that can be selected, corresponding to the four possible random starts:
Each member of the population belongs to only one of the four samples and each sample has
the same chance of being selected. From that, we can see that each unit has a one in
four chance of being selected in the sample. This is the same probability as if a simple
random sampling of 100 units was selected. The main difference is that with simple random
sampling, any combination of 100 units would have a chance of making up the sample,
while with systematic sampling, there are only four possible samples. From that, we can see
how precise systematic sampling is compared with simple random sampling. The population's
order on the frame will determine the possible samples for systematic sampling. If the
population is randomly distributed on the frame, then systematic sampling should yield
results that are similar to simple random sampling.
This method is often used in industry, where an item is selected for testing from a production
line to ensure that machines and equipment are of a standard quality. For example, a tester in
a manufacturing plant might perform a quality check on every 20th product in an assembly
line. The tester might choose a random start between the numbers 1 and 20. This will
determine the first product to be tested; every 20th product will be tested thereafter.
Interviewers can use this sampling technique when questioning people for a sample survey.
The market researcher might select, for example, every 10th person who enters a particular
store, after selecting the first person at random. The surveyor may interview the occupants of
every fifth house on a street, after randomly selecting one of the first five houses.
7
Example 4: Imagine you have to conduct a survey on student housing for your university or
college. Your school has an enrolment of 10,000 students and you want to take a systematic
sample of 500 students. In order to do this, you must first determine what your
sampling interval (K) would be:
N÷n=K
K= 10,000 ÷ 500
K= 20
To begin this systematic sample, all students would have to be assigned sequential numbers.
The starting point would be chosen by selecting a random number between 1 and 20. If this
number were 9, then the 9th student on the list would be selected along with every
20th student thereafter. The sample of students would be those corresponding to student
numbers
9, 29, 49, 69...9,929, 9,949, 9,969 and 9,989.
In the examples used thus far, the sampling interval K was always a whole number, but this
is not always the case. For example, if you want a sample of 30 from a population of 740, your
sampling interval (or K) will be 24.7. In these cases, there are a few options to make
the number easier to work with. You can round the number—either round it up to the
nearest whole number or round it down. Rounding down will ensure that you select at
least the number of units you originally wanted (and you can then delete some units to get the
exact sample size you wanted). Techniques exist to adapt systematic sampling to the case
where N (total population) is not a multiple of n (sample size), but still give a sample exactly
the same as the n units. These techniques will not be discussed here.
The advantages of systematic sampling are that the sample selection cannot be easier
(you only get one random number—the random start—and the rest of the sample
automatically follows) and that the sample is distributed evenly over the listed
population. The biggest drawback of the systematic sampling method is that if there is
some cycle in the way the population is arranged on a list and if that cycle coincides
in some way with the sampling interval, the possible samples may not be representative of
the population. This can be seen in the following example:
Example 5: Suppose you run a large grocery store and have a list of the employees in each
section. The grocery store is divided into the following 10 sections: deli counter,
bakery, cashiers, stock, meat counter, produce, pharmacy, photo shop, flower shop and dry
cleaning. Each section has 10 employees, including a manager (making 100 employees in
total). Your list is ordered by section, with the manager listed first and then, the other
employees by descending order of
seniority.
If you wanted to survey your employees about their thoughts on their work environment, you
might choose a small sample to answer your questions. If you use a systematic
8
sampling approach and your sampling interval is 10, then you could end up selecting only
managers or
9
the newest employees in each section. This type of sample would not give you a complete or
appropriate picture of your employees' thoughts.
Probability sampling requires that each member of the survey population have a chance
of being included in the sample, but it does not require that this chance be the same
for everyone. If there is information available on the frame about the size of each unit
(e.g., number of employees for each business) and if those units vary in size, this information
can be used in the sampling selection in order to increase the efficiency. This is known
as sampling with probability proportional to size (PPS). With this method, the bigger the size
of the unit, the higher the chance it has of being included in the sample. For this method to
bring increased efficiency, the measure of size needs to be accurate. This is a more
complex sampling method that will not be discussed in further detail here.
Stratified sampling
Why do we need to create strata? There are many reasons, the main one being that
it can make the sampling strategy more efficient. It was mentioned earlier that you
need a larger sample to get a more accurate estimation of a characteristic that varies greatly
from one unit to the other than for a characteristic that does not. For example, if every
person in a population had the same salary, then a sample of one individual would be
enough to get a precise estimate of the average salary.
This is the idea behind the efficiency gain obtained with stratification. If you create
strata within which units share similar characteristics (e.g., income) and are considerably
different from units in other strata (e.g., occupation, type of dwelling) then you would
only need a small sample from each stratum to get a precise estimate of total income
for that stratum. Then you could combine these estimates to get a precise estimate of
total income for the whole population. If you were to use a simple random sampling
approach in the whole population without stratification, the sample would need to be
larger than the total of all stratum samples to get an estimate of total income with the same
level of precision.
Stratified sampling ensures an adequate sample size for sub-groups in the population
of interest. When a population is stratified, each stratum becomes an independent population
and you will need to decide the sample size for each stratum.
10
Example 6: Suppose you want to estimate how many high school students have part-
time jobs at the national level and also in each province. If you were to select a simple
random
11
sample of 25,000 people from a list of all high school students in Canada (assuming such a
list was available for selection), you would end up on average with just a little over
100 people from Prince Edward Island, since they account for less than half of a percent of
the whole Canadian population. This sample would probably not be large enough for the kind
of detailed analysis you had in mind. Stratifying your list by province, again assuming that
this information is available, and then selecting a sample size for each province would allow
you to decide on the exact sample size needed for that specific province. Thus, in
order to get good representation of Prince Edward Island, you would use a larger
sample than the one allotted to it by the simple random sampling approach.
Example 7: An Ontario school board wanted to assess student opinion on dropping Grade 13
from the secondary school program. They decided to survey students from Elmsview
High School. To ensure a representative sample of students from all grade levels, the school
board used a stratified sampling technique.
In this case, the strata were the five grade levels (grades 9 to 13). The school board
then selected a sample within each stratum. The students selected in this sample were
extracted using simple random or systematic sampling, making up a total sample of 100
students.
Cluster sampling
Cluster sampling divides the population into groups or clusters. A number of clusters
are selected randomly to represent the total population, and then all units within selected
clusters are included in the sample. No units from non-selected clusters are included in the
sample— they are represented by those from selected clusters. This differs from stratified
sampling, where some units are selected from each group.
Examples of clusters are factories, schools and geographic areas such as electoral
subdivisions. The selected clusters are used to represent the population.
Example 8: Suppose you are a representative from an athletic organization wishing to find
out which sports Grade 11 students are participating in across Canada. It would be too costly
and lengthy to survey every Canadian in Grade 11, or even a couple of students from every
Grade 11 class in Canada. Instead, 100 schools are randomly selected from all over Canada.
12
These schools provide clusters of samples. Then every Grade 11 student in all 100 clusters is
surveyed. In effect, the students in these clusters represent all Grade 11 students in Canada.
13
Example 9: Imagine that the municipal council of a small city wants to investigate the use of
health care services by residents.
First, the council requests from Statistics Canada electoral subdivision maps that identify and
label each city block. From these maps, the council creates a list of all city blocks. This list
will serve as the sampling frame.
Every household in that city belongs to a city block, and each city block represents a cluster
of households. The council randomly picks a number of city blocks. Using the simple random
sample approach, then the council creates a list of all households in the selected city blocks;
these households make up the survey sample.
As mentioned, cost reduction is a reason for using cluster sampling. It creates 'pockets'
of sampled units instead of spreading the sample over the whole territory. Another reason is
that sometimes a list of all units in the population (a requirement when conducting simple
random sample, systematic sample or sampling with probability proportional to size) is not
available, while a list of all clusters is either available or easy to create.
In most cases, the main drawback is a loss of efficiency when compared with simple random
sampling. It is usually better to survey a large number of small clusters instead of a
small number of large clusters. This is because neighbouring units tend to be more alike,
resulting in a sample that does not represent the whole spectrum of opinions or situations
present in the overall population. In the two previous examples, students in the same
school tend to participate in the same types of sports (depending on the facilities available at
their school); similarly, elderly people have a tendency to live in specific neighbourhoods and
to be heavy users of health services.
Another drawback to cluster sampling is that you do not have total control over the
final sample size. Since not all schools have the same number of Grade 11 students and city
blocks do not all have the same number of households, and you must interview every
student or household in your sample, the final size may be larger or smaller than you expected.
Multi-stage sampling
Multi-stage sampling is like the cluster method, except that it involves picking a sample from
within each chosen cluster, rather than including all units in the cluster. This type of sampling
requires at least two stages. In the first stage, large groups or clusters are identified
and selected. These clusters contain more population units than are needed for the final
sample.
In the second stage, population units are picked from within the selected clusters (using any
of the possible probability sampling methods) for a final sample. If more than two stages are
used, the process of choosing population units within clusters continues until there is a final
sample.
Example 10: In Example 8, a cluster sample would choose 100 schools and then interview
every Grade 11 student from those schools. Instead in multi-stage sampling, you could select
14
more schools, get a list of all Grade 11 students from these selected schools and select
a random sample (e.g., simple random sample) of students from each school. This would be a
15
two-stage sampling design.
You could also get a list of all Grade 11 classes in the selected schools, pick a random sample
of classes from each of those schools, get a list of all the students in the selected classes and
finally select a random sample of students from each class. This would be a three-
stage sampling design. Each time we add a stage, the process becomes more
complex.
Now imagine that each school has on average 80 Grade 11 students. Cluster sampling would
then give your organization a sample of about 8,000 students (100 schools x 80 Grade
11 students). If you wanted a bigger sample, you could select schools with more students; and
for a smaller sample you could select schools with fewer students.
One way to control the sample size would be to stratify the schools into large, medium and
small sizes (in terms of the number of Grade 11 students) and select a sample of schools from
each stratum. This is called stratified cluster sampling.
With a three-stage design, you could select a sample of 400 schools, then select two Grade 11
classes per school (assuming that there are two or more Grade 11 classes per school). Finally,
you could select 10 students per class. This way, you still end up with a sample of
about
8,000 students (400 schools x 2 classes x 10 students), but the sample will be more spread out.
You can see from this example that with multi-stage sampling, you still have the benefit of a
more concentrated sample for cost reduction. However, the sample is not as concentrated as
other clusters and the sample size is still bigger than for a simple random sample size. Also,
you do not need to have a list of all of the students in the population. All you need is a list of
the classes from the 400 schools and a list of the students from the 800 classes. Admittedly,
more information is needed in this type of sample than what is required in cluster sampling.
However, multi-stage sampling still saves a great amount of time and effort by not having to
create a list of all of the units in a population.
Multi-phase sampling
A multi-phase sample collects basic information from a large sample of units and then, for a
subsample of these units, collects more detailed information. The most common form
of multi-phase sampling is two-phase sampling (or double sampling), but three or more phases
are also possible.
Multi-phase sampling is quite different from multi-stage sampling, despite the similarities in
name. Although multi-phase sampling also involves taking two or more samples, all samples
are drawn from the same frame and at each phase the units are structurally the same.
However, as with multi-stage sampling, the more phases used, the more complex the sample
design and estimation will become.
Multi-phase sampling is useful when the frame lacks auxiliary information that could be used
to stratify the population or to screen out part of the population.
16
Example 11: Suppose that an organization needs information about cattle farmers in Alberta,
but the survey frame lists all types of farms—cattle, dairy, grain, hog, poultry and produce.
To complicate matters, the survey frame does not provide any auxiliary information for the
farms listed there.
A simple survey could be conducted whose only question is "Is part or all of your
farm devoted to cattle farming?" With only one question, this survey should have a low cost
per interview (especially if done by telephone) and, consequently, the organization should be
able to draw a large sample. Once the first sample has been drawn, a second, smaller sample
can be extracted from among the cattle farmers and more detailed questions asked of
these farmers. Using this method, the organization avoids the expense of surveying units that
are not in this specific scope (i.e., non-cattle farmers).
Example 12: A health survey asks participants some basic questions about their diet,
smoking habits, exercise routines and alcohol consumption. In addition, the survey requires
that respondents subject themselves to some direct physical tests, such as running on
a treadmill or having their blood pressure and cholesterol levels measured.
The difference between probability and non-probability sampling has to do with a basic
assumption about the nature of the population under study. In probability sampling,
every item has a chance of being selected. In non-probability sampling, there is an assumption
that there is an even distribution of characteristics within the population. This is what makes
the researcher believe that any sample would be representative and because of that, results
will be accurate. For probability sampling, randomization is a feature of the selection process,
rather than an assumption about the structure of the population.
NON-PROBABILITY
SAMPLING
In non-probability sampling, since elements are chosen arbitrarily, there is no way to estimate
the probability of any one element being included in the sample. Also, no assurance is given
that each item has a chance of being included, making it impossible either to estimate
sampling variability or to identify possible bias.
18
error. Statisticians are reluctant to use these methods because there is no way to measure the
precision of the resulting sample.
Despite these drawbacks, non-probability sampling methods can be useful when descriptive
comments about the sample itself are desired. Secondly, they are quick, inexpensive
and convenient. There are also other circumstances, such as in applied social research, when
it is unfeasible or impractical to conduct probability sampling. Statistics Canada uses
probability sampling for almost all of its surveys, but uses non-probability sampling for
questionnaire testing and some preliminary studies during the development stage of a survey.
Most non-sampling methods require some effort and organization to complete, but others, like
convenience sampling, are done casually and do not need a formal plan of action.
There are times when the average person uses convenience sampling. A food critic,
for example, may try several appetizers or entrees to judge the quality and variety of a menu.
And television reporters often seek so-called ‘people-on-the-street interviews' to find out
how people view an issue. In both these examples, the sample is chosen randomly, without
use of a specific survey method.
The obvious advantage is that the method is easy to use, but that advantage is greatly offset
by the presence of bias. Although useful applications of the technique are limited, it
can deliver accurate results when the population is homogeneous.
For example, a scientist could use this method to determine whether a lake is polluted.
Assuming that the lake water is well-mixed, any sample would yield similar information. A
scientist could safely draw water anywhere on the lake without fretting about whether or not
the sample is representative.
Volunteer sampling
19
As the term implies, this type of sampling occurs when people volunteer their services for the
study. In psychological experiments or pharmaceutical trials (drug testing), for example,
it would be difficult and unethical to enlist random participants from the general public. In
these instances, the sample is taken from a group of volunteers. Sometimes, the researcher
offers payment to entice respondents. In exchange, the volunteers accept the possibility of a
lengthy, demanding or sometimes unpleasant process.
Sampling voluntary participants as opposed to the general population may introduce strong
biases. Often in opinion polling, only the people who care strongly enough about the subject
one way or another tend to respond. The silent majority does not typically respond, resulting
in large selection bias.
Television and radio media often use call-in polls to informally query an audience on their
views. The Much Music television channel uses this kind of survey in their CombatZone
program. The program asks viewers to cast a vote for one of two music videos by telephone,
e-mail or through their online website.
Oftentimes, there is no limit imposed on the frequency or number of calls one respondent can
make. So, unfortunately, a person might be able to vote repeatedly. It should also be noted
that the people who contribute to these surveys might have different views than those who do
not.
Judgement sampling
This approach is used when a sample is taken based on certain judgements about the overall
population. The underlying assumption is that the investigator will select units that are
characteristic of the population. The critical issue here is objectivity: how much can judgment
be relied upon to arrive at a typical sample? Judgement sampling is subject to the researcher's
biases and is perhaps even more biased than haphazard sampling. Since any preconceptions
the researcher may have are reflected in the sample, large biases can be introduced if these
preconceptions are inaccurate.
Statisticians often use this method in exploratory studies like pre-testing of questionnaires and
focus groups. They also prefer to use this method in laboratory settings where the choice of
experimental subjects (i.e., animal, human, vegetable) reflects the investigator's pre-existing
beliefs about the population.
One advantage of judgement sampling is the reduced cost and time involved in acquiring the
sample.
Quota sampling
This is one of the most common forms of non-probability sampling. Sampling is done until a
specific number of units (quotas) for various sub-populations have been selected. Since there
are no rules as to how these quotas are to be filled, quota sampling is really a means
for satisfying sample size objectives for certain sub-populations.
20
The quotas may be based on population proportions. For example, if there are 100 men and
100 women in a population and a sample of 20 are to be drawn to participate in a cola taste
challenge, you may want to divide the sample evenly between the sexes—10 men and
10 women. Quota sampling can be considered preferable to other forms of non-
probability sampling (e.g., judgement sampling) because it forces the inclusion of members
of different sub-populations.
Quota sampling is somewhat similar to stratified sampling in that similar units are grouped
together. However, it differs in how the units are selected. In probability sampling, the units
are selected randomly while in quota sampling it is usually left up to the interviewer to decide
who is sampled. This results in selection bias. Thus, quota sampling is often used by market
researchers (particularly for telephone surveys) instead of stratified sampling, because
it is relatively inexpensive and easy to administer and has the desirable property of
satisfying population proportions. However, it disguises potentially significant bias.
As with all other non-probability sampling methods, in order to make inferences about
the population, it is necessary to assume that persons selected are similar to those not selected.
Such strong assumptions are rarely valid.
Example 1: The student council at Cedar Valley Public School wants to gauge student
opinion on the quality of their extracurricular activities. They decide to survey 100 of 1,000
students using the grade levels (7 to 12) as the sub-population.
The table below gives the number of students in each grade level.
7 150 15 15
8 220 22 22
9 160 16 16
10 150 15 15
11 200 20 20
12 120 12 12
Total 1,000 100 100
The student council wants to make sure that the percentage of students in each grade level
is reflected in the sample. The formula is:
21
Since 15% of the school population is in Grade 10, 15% of the sample should contain Grade
10 students. Therefore, use the following formula to calculate the number of Grade
10 students that should be included in the sample:
The main difference between stratified sampling and quota sampling is that stratified
sampling would select the students using a probability sampling method such as simple
random sampling or systematic sampling. In quota sampling, no such technique is used. The
15 students might be selected by choosing the first 15 Grade 10 students to enter school on a
certain day, or by choosing 15 students from the first two rows of a particular
classroom. Keep in mind that those students who arrive late or sit at the back of the
class may hold different opinions from those who arrived earlier or sat in the front.
The main argument against quota sampling is that it does not meet the basic requirement of
randomness. Some units may have no chance of selection or the chance of selection may be
unknown. Therefore, the sample may be biased.
It is common, but not necessary, for quota samples to use random selection procedures at the
beginning stages, much in the same way as probability sampling does. For instance, the first
step in multi-stage sampling would be randomly selecting the geographic areas.
The difference is in the selection of the units in the final stages of the process.
In multi-stage sampling, units are based on up-to-date lists for selected areas and a sample is
selected according to a random process. In quota sampling, by contrast, each interviewer is
instructed on how many of the respondents should be men and how many should be women,
as well as how many people should represent the various age groups. The quotas are therefore
calculated from available data for the population, so that the sexes, age groups or
other demographic variables are represented in the correct proportions. But within each
quota, interviewers may fail to secure a representative sample of respondents. For example,
suppose that an organization wants to find out information about the occupations of men aged
20 to
25. An interviewer goes to a university campus and selects the first 50 men aged 20 to 25 that
she comes across and who agree to participate in her organization's survey. However,
this sample does not mean that these 50 men are representative of all men aged 20 to 25.
Quota sampling is generally less expensive than random sampling. It is also easy to
administer, especially considering the tasks of listing the whole population, randomly
selecting the sample and following-up on non-respondents can be omitted from the procedure.
Quota sampling is an effective sampling method when information is urgently required and
can be carried out independent of existing sampling frames. In many cases where the
population has no suitable frame, quota sampling may be the only appropriate sampling
method.
ESTIMATION
22
Self-weighting designs
Adjusting the weights
Other estimation methods
Estimating the sampling error
Examples of estimation using an simple random sampling design
Estimation of the population mean
Estimation of the population total
As we now know, the goal of conducting surveys is to obtain information about a particular
population. When the sample has been selected and the information collected (see the Data
collection chapter) and processed (see the Data processing chapter), there still remains
the task of linking the information gathered from the sample back to the overall population.
Estimation is the process of determining a likely value for a variable in the survey population,
based on information collected from the sample. Researchers are usually interested in looking
at estimates of many statistics—totals, averages and proportions being the most frequent—for
different variables. For example, a sample survey could be used to produce any of the
following statistics: estimates for the proportion of smokers among all people aged 15 to 24
in the population; the average earnings of men and women with a university degree; or the
total number of cars possessed by the whole survey population.
Underpinning the estimation process is the sampling weight of a unit, which indicates
the number of units in the population (including the sampling weight) that are represented by
this sampled unit. The sampling weight is the inverse of the unit's probability of selection.
• Example 1: Suppose that the city of Winnipeg has decided to award bus travellers
with free one-year bus passes as a way of promoting its services. A simple random
sample of 10 people is selected from the 30 passengers on a city bus. Since simple
random sampling gives equal probability of selection to every member of the
population (in this case, all passengers on the bus), each passenger had one chance out
of three of being selected. This translates into a sampling weight of three for
every selected unit. This means that each person in the sample represents three
persons in the population—himself or herself, plus two other persons
To estimate this sampling weight, one could take the survey information for the 10
selected passengers and copy it three times to create an artificial population of
30. Totals, averages or proportions for the real population could then be estimated by
the corresponding statistics computed using the artificial population. Instead of
doing this, however, survey statisticians attach a sampling weight to each unit in the
sample and take this weight into account when
estimating.
If one person in a sample (with a sampling weight of 18) had blue eyes and brown
hair, then it is as if a total of 18 people in the population had blue eyes and brown hair.
23
Example 2: You are conducting a survey to determine the total number of
people living on your street and the average number of cars owned by each household.
You
24
decide to select a systematic sample of 5 households from the 20 on your street and
intend to use that sample to estimate the totals you are looking for. The
following table summarizes the information that you gathered during your interviews
with the sampled households:
1 1 0 1/4 4
2 4 2 1/4 4
3 2 1 1/4 4
4 2 1 1/4 4
5 3 2 1/4 4
• The selection probability of 1 in 4 comes from the fact that systematic sampling gives
an equal chance of being selected to each household on your street. The
sampling weight of 4 is just the inverse of that probability. When estimating, you have
to look at the characteristics of each sampled household. In this case, it is
decided that 4 households from the population of 20 on your street have the same
characteristics.
In order to estimate the total number of persons living on your street, you
have to multiply the number of persons in a household by the number of households
in that sampling weight, then add up all the final numbers. For example, there
are 4 one- person households (represented by Household number 1), 4 four-person
households,
8 two-person households (four households represented by Household number 3 and
four households represented by Household number 4) and 4 three-person households.
The estimation of the total number of persons would then be:
To estimate the average number of cars per household, you proceed in the
same manner. Get an estimate of the total number of cars owned by households
on your street and then, divide the estimate by the actual number of households on
the street. For example, there are 4 households without a car (represented by
Household number
1), 8 households with two cars (represented by Household number 2 and Household
number 5), 8 households with one car each (represented by Household number 3 and
Household number 4).
26
Estimated average
= 24 ÷ 20
= 1.2 cars per household
Self-weighting designs
It is not always the case that all sampled units had the same sampling weight. Some designs
give unequal probability of selection to units, resulting in units within the same sample
having different sampling weights. Answers from one household or business could represent
the answers for 200 units of the population, while the answers from another could represent
only 50 units in the population.
When every unit in the sample has the same sampling weight, the sampling design is said to
be self-weighted. This kind of design is time-saving and operationally convenient,
particularly for large samples. Because every unit has the same weight, those weights can be
ignored when estimating averages and proportions. The average for the sample gives
an appropriate estimate of the average for the whole population.
Simple random sampling and systematic sampling are examples of self-weighted designs. In
that sense, calculations could have been made easier in Example 2. For instance, to estimate
the average number of cars per household in the population, we could have used the
same average as the one used in the sample. The 5 sampled households own a total of 6 cars,
an average of 1.2 cars per household. This is the same result as that obtained using the
sampling weight procedure.
Sometimes, the sampling weights are adjusted prior to estimation. There are basically
two reasons for weight adjustment:
• Adjusting for non-response: Using sampling weights for estimation works fine when
you have been able to interview all selected units. In Example 2, if two of the five
sampled households refused to answer or were unavailable at the time of the survey,
you would only have answers for three households, thus representing only 12 of the
20 households on the street. The two non-responding units represent four households
each. This means that we have no information on the number of persons or cars for 8
households on your street. In order to adjust for that, survey statisticians usually
increase the weights of responding units to account for the loss of representativeness
caused by non-response. The goal would be to use only the 3 units for which we have
information, but still represent the 20 households on the street.
• Adjusting for external information: Sometimes, we know the actual total for one or
more variables measured in the sample. In Example 3 of the Probability
sampling section, a population of the 1,000 best horror movies was equally divided
into 500 classic movies and 500 modern movies. Even though you knew this
prior to sampling, you decided to select a simple random sample of 100 movies and
ended up with 77 classic movies and 23 modern movies. Each of these movies has a
weight of
10 (because you selected 1 movie out of every 10 titles). Using the answers from the
27
survey and the sampling weight, your sample would represent a population of
770
28
classic movies and 230 modern movies. This could lead to inaccurate estimates. One
solution would be to decrease the weight of every sampled classic and increase the
weight of every sampled modern movie so that your sample gives an estimate of 500
classics and 500 modern films in the population. This should reduce the
distortion caused by a 'bad' sample.
Of course, stratifying by release date prior to sampling would have solved this
problem. However, in a lot of cases, we have totals at the population level, but we
don't know the attribute of each unit on the sampling frame. For example, from the
Census of Population, we know how many men and women there are in a
specific city, but all we have for sampling is a list of households. Thus,
stratifying our population by sex would not be possible. Demographic projections by
age and sex for each province are often used in social surveys to adjust sampling
weights.
The weights adjusted for non-response and/or external counts are used for estimation, in the
same way as the sampling weight was used in Example 1.
Using the weights to inflate the sample results is not the only estimation method that exists,
but it is the simplest one and the only one that we will cover. Nevertheless, it is important to
know that there exist some other methods that could lead to more precise estimates
(e.g., using auxiliary information). The estimation process has to take into account the
sampling design that was used. Otherwise, the resulting estimates could be severely biased.
As mentioned before, any estimates derived from samples are subject to what is called
the sampling error. This comes from the fact that only a part of the population was
observed, instead of the whole. A different sample could have come up with different
results. The amount of variation that exists among the estimates from the different
possible samples is what makes the sampling error. (There are roughly 14 million
different combinations of 6 numbers from 1 to 49, so imagine how many ways there
are to select a sample of 25,000
Canadian households!) Of course, this sampling error is unknown, since we would need
to
know the answer for each unit of the population in order to calculate it. Nevertheless, it can
be estimated by using the survey data. The extent of the sampling error depends on
many things, including the sampling method, the estimation method, the sample size
and the variability of the estimated characteristic. This is why each sample estimate
has its own sampling error. This error should thus be approximated for each estimated
total, average, proportion, etc. produced by the survey.
Simple random sampling is the simplest of all sampling methods. Estimation using the simple
random sampling method has been studied extensively. There are simple formulas to estimate
29
the sampling error for many statistics when simple random sampling is used, especially since
it is a self-weighting design. We present here the most common estimator for a population
average (mean) and total, under simple random sampling.
30
Estimation of the population mean
In a simple random sample, the estimate of the population mean is identical to the mean of
the sample:
where
x = an observed value
= estimate of the population mean
Note: Lowercase x and n should be used if you are referring to a sample survey and upper
case X and N should be used when referring to a population.
If the sample results have been summarized in a frequency table, then the estimate
for the population mean is the same as the sample. Thus,
where
x = an observed value
f = the frequency of the value (the number of times that this value have been observed in
the sample)
= estimate of the population mean
xf = sum of all observed xf values (the product of the observed values times its frequency)
in the sample
Example 2: A farmer randomly selects 10 eggs from a gross of 12 dozen eggs (144 eggs)
he finds in his hen house. He carefully weighs each egg.
The following weights were recorded in grams:
0.75, 0.70, 0.55, 0.50, 0.60, 0.65, 0.75, 0.65, 0.75, 0.50
31
Using the above formula, we can determine the mean weight of all of the eggs:
For a simple random sample, the estimation formula of a total for the population
is
where
x = an observed value
= estimated population total
It is just the estimate for the mean value multiplied by the number of units in the population.
In the previous example, the mean weight of an egg is 0.64 grams, so it is logical to think that
the total weight of the 144 eggs would be 92.16 grams (144 x 0.64 = 92.16 grams).
If sample results have been summarized in a frequency table, then the estimate formula for
total population is
where
x = an observed value
= estimated population total
32