0% found this document useful (0 votes)
14 views

ch09 SamplingDistributions

The document discusses different sampling techniques used in statistical analysis including simple random sampling, systematic sampling, cluster sampling, and stratified sampling. It defines what a population and sample are, explains the difference between descriptive and inferential statistics, and the importance of sample representativeness. Examples are provided for each sampling technique.

Uploaded by

david
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ch09 SamplingDistributions

The document discusses different sampling techniques used in statistical analysis including simple random sampling, systematic sampling, cluster sampling, and stratified sampling. It defines what a population and sample are, explains the difference between descriptive and inferential statistics, and the importance of sample representativeness. Examples are provided for each sampling technique.

Uploaded by

david
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Part 4

Confidence Intervals
CHAPTER 9

Sampling

1. Introduction

This chapter marks the start of the inferential statistics section of the course. Hence, prior to proceeding,
let’s refresh some of the content that has been introduced in earlier lectures. Firstly, what is the di↵erence
between a population and a sample for a specific phenomenon under investigation? A (or, better, the)
population is the totality of data for that specific phenomenon. In contrast, a sample is a portion of the
population. Moreover, while there is one population, there are as many samples as they can be possibly
drawn from a population. An example of a population and a possible sample drawn from it is displayed
in Figure 9.1.

Figure 9.1. Population vs. Sample

Apart from a graphical display, examples of populations and corresponding samples are as follows

• Population: Students enrolled on an MSc programme at UCD; Sample: Students enrolled on


the MSc Business Analytics programme at UCD;
• Population: Households in the Republic of Ireland; Sample: Households in Dublin;
• Population: People su↵ering from Covid-19 around the world; Sample: People su↵ering from
Covid-19 in Europe.

Finally, let’s just refresh the di↵erence between descriptive and inferential statistics. The former com-
prises everything that we studied so far, the latter starts from this point onwards and heavily relies on
the concepts related to the normal distribution.
Descriptive statistics can be applied to either a population or a sample and allows us to organise / gather
/ display / present information through graphs and charts (e.g., histograms and pie charts) as well as
indicators (e.g., measures of centrality, location, and dispersion).
In distinction, inferential statistics comprises all of those tools, such as confidence intervals and hypothesis
testing (see later chapters), that allow us to draw reliable conclusions about a population starting from
one (or more) of its samples. Hence, inferential statistics is only applied to samples so as to gain some
insights on the related population.

2. Sampling Representativeness

The set we are studying (the population) may be too large to capture in full — or the expense of doing
this might be prohibitive. Thus, we may choose a restricted subset of the population to survey and
study: a sample. However, we need to take care of how we choose the items of the sample — we do
not want the items in the sample to have very di↵erent characteristics from those of the population as
a whole. That is, we need the sample to be representative.
87
88 MIS2008L

Example 9.1. The US presidential election in 1948 saw Truman vs. Dewey. All of the polls were in
favour of Dewey; however, Truman was the “unexpected” winner of the elections. Polls, that are carried
out even today, are deployed to get a feeling for the sentiment of the population based on a restricted
portion of it (it is impossible to interview all the possible citizens that can vote in a specific State!).
Hence, somebody may question the validity of inferential statistics based on the 1948 US presidential
election. . . }

However, it seems the issue at the time was that the sample was not representative of the population.
Phone interviews were conducted and, at that time, a telephone was only for wealthy households; hence,
other opinions could not be captured through that interview method. Conclusion: sample representa-
tiveness is crucial!
Unless care is taken when conducting a survey, the results may be unintentionally skewed toward a
certain section of the population: this is called coverage error or selection bias. We also need to ensure
that questions are clear and unambiguous so that participants are not confused or misled; similarly, it
would be unethical to use ‘leading’ questions to elicit a particular response from participants. We need to
ensure the quality and integrity of the data we use in our analysis; otherwise, our results and conclusions
are worthless.
Typically, the number of people that respond to a survey is between 2 and 4 out of every hundred. When
did you last agree to stop on the street and participate in a survey or take a phone call to ascertain your
music listening habits? This lack of willingness is called survey fatigue and gives rise to non-response
error. Many companies o↵er an incentive or a donation to a charitable cause to encourage you to
participate in their survey (e.g., gift voucher, prize draw).

3. Sampling Techniques

3.1. Simple Random Sampling. We are only interested in probability samples in which survey
subjects are chosen on the basis of known probabilities and so are representative of the population as
a whole. Thus, we are not interested in nonprobability samples where items are chosen for inclusion
without regard to their probability of occurrence.
An example of a nonprobability sample is a self-selecting web survey. Such samples are said to be biased
and cannot be said to be truly representative of the population from which they are drawn.
We are going to focus on four sampling techniques:
• simple random sampling;
• systematic sampling;
• cluster sampling; and
• stratified sampling.
In a random sample, on each selection from the population, every item remaining in the population has
an equal chance of being selected. This is a special kind of probability sample.
Selection may be with replacement or without replacement:
• if we sample with replacement, we return to the population the item chosen on this particular
selection and thus, it could be chosen again on a later selection (i.e., an item can be picked
more than once);
• if we sample without replacement, we do not return to the population the item chosen on this
particular selection and thus, it has no chance of being chosen again on a later selection (i.e.,
an item cannot be picked more than once).
Sampling without replacement guarantees that all items in the sample will be di↵erent and so gives a
better “look” at the population. Samples can be obtained using tables of random numbers or computer
random number generators.
Pros of Simple Random Sampling: Very easy.
Cons of Simple Random Sampling:
Data Analysis for Decision Makers 89

• not enough information on the presence of sub-populations (if any);


• not really practical when items of the population are widely spread geographically (e.g., possible
issue with sample representativeness).
An example of Simple Random Sampling is displayed in Figure 9.2.

Figure 9.2. Simple Random Sampling

3.2. Other Types of Sampling: Systematic, Stratified, and Cluster. In a systematic sample,
where we have a population (or frame) of size N and we want a sample of size n, we (systematically)
choose one individual from each of m intervals where m = N/n (rounded down to the nearest integer).
Specifically:
• calculate the ratio between the population size N and the sample size n and round it down to
the nearest whole number, m;
• use a random-number generator device to obtain a number k such that 1  k  m; and
• select as sample members those items that are indexed as k, k + m, k + 2m up to sample size n.
Pros of Systematic Sampling: easier than Simple Random Sampling.
Cons of Systematic Sampling: not recommended when the population features cyclical patterns.
An example of Systematic Sampling is displayed in Figure 9.3.
A cluster sample is the result of a pattern-based procedure where:
• the population is divided into groups named clusters;
• a simple random sample of the clusters is obtained; and
• all the selected items (i.e., those belonging to the selected clusters) constitute the final sample.
Pros of Cluster Sampling: recommended when population items are widely spread geographically.
Cons of Cluster Sampling: each cluster should represent parts of the population; however, clusters may
not be diverse enough.
A stratified sample results from the following procedure:
• divide the population into sub-populations (or strata);
• for each stratum, obtain a sample size s that is equal to the rounded value (n)*[(s)/(N )] where
n is the overall sample size and N is the population size; and
• use all the items simply randomly sampled from each strata to build the final sample.
90 MIS2008L

Figure 9.3. Systematic Sampling

Population Sample
Mean µ x̄
Standard deviation s
Size N n
Proportion p ps
Table 9.1. Notations for population parameters and sample statistics

Pros of Stratified Sampling: more reliable than Cluster Sampling.


Example 9.2. In Brightspace, you can find an Excel file named SamplingSolution. What does the file
provide?
• a simple random sample of size 10;
• a systematic sample of size 10, given a cyclical pattern of 3;
• a cluster sample of size 9; and
• a stratified sample of size 9.
}

4. Sampling Probability Distributions

A sampling distribution is a distribution of all the possible values of a statistic for a given sample size.
Very often, it is practical to analyse a sample from a population in an attempt to infer the population
parameters. Some reasons for this include saving time, money, resources, etc.
So far we have talked about mean and standard deviation, for both sample and population. Another
important statistic we will study is the following:
Definition 9.3. Given some criterion of interest, some members of the population (or sample) may
satisfy that criterion, while some may not. The proportion is the fraction or percentage of the population
(or sample) that satisfy the criterion. It is written as p for the population while ps for a sample.

It is important at this stage to be comfortable with the di↵erent notations for the population parameters
and the sample statistics: see Table 9.1.
As di↵erent samples are taken from the population, the sample statistics vary from sample (size) to
sample (size). The probability distribution of the sample statistic gives rise to the sampling distribution.
Data Analysis for Decision Makers 91

We focus on which probability distributions describe the sample mean and sample proportion. We will
be dealing with two di↵erent sampling distributions: see Figure 9.4.

Figure 9.4. Sampling distributions that we study

4.1. Sampling Distribution of the Sample Mean. We take as an example a population of size
N = 4, and a random variable X, whose value is the number of chat messages sent by the four members
of the population. Let’s suppose these are 18, 20, 22 and 24. Thus, X takes values in {18, 20, 22, 24}.
As we have not said one population member is more likely than another, all four values of X are equally
likely (i.e., each has probability 0.25 of being observed) so X is a uniformly distributed random variable
(see Figure 9.5).

Figure 9.5. Probability mass function of X: a uniform distribution, shown as a bar chart

Confirm that the mean µ = 21 texts and the standard deviation = 2.236 texts for X.
Now assume that we sample with replacement, and count how many samples of size 2 we can take from
this population, i.e., n = 2. Since we allow replacement, there are 4 choices for the first sample element,
and also 4 choices for the second: so, by the multiplication principle, there are 4 ⇥ 4 = 16 possible
samples.
Di↵erent samples of the same size from the same population will yield di↵erent sample means. For each
of the 16 possible samples we calculate the sample mean; then we plot the distribution, which is not
uniform (see Figure 9.6):
Calculating the mean of all the sample means, we find µX̄ p= 21. Calculating the standard deviation of
all the possible sample means, we find X̄ = 1.58 = 2.236/ 2.
p
It is no coincidence that µX̄ = µ and X̄ = / n. Figure 9.7 shows the original distribution and the
sampling distribution of the mean.
A measure of the variability in the mean from sample to sample is given by X̄ , which is also called the
standard error of the mean.
92 MIS2008L

Figure 9.6. Sampling distribution of the sample mean

Figure 9.7. Original distribution and the sampling distribution of the sample mean

4.1.1. Normally distributed population. If a population is normally distributed with mean µ and
standard deviation , the sampling distribution of the mean is also normally distributed and can be
standardised with the transformation formula:
X̄ µX̄ X̄ µ
Z= = p .
X̄ / n
Figure 9.8 shows that a normal distribution has a normal sampling distribution of the mean, and they
both have the same mean, µ.
4.1.2. Non-normally distributed population: the Central Limit Theorem. Even if a population is not
normally distributed, we can still apply the Central Limit Theorem (or CLT):
Theorem 9.4 (Central Limit Theorem). If the sample size is large enough, even if the population is not
normally distributed, sample means from the population will be approximately normally distributed, and
the approximation improves as the sample size increases.

The CLT says we can assume that the manner in which the possible sample means are distributed will
be roughly a normal distribution, provided we take random samples of sufficient size (usually, n 30).

4.2. Sampling Distribution of the Proportion. We also need to look at the distribution of
the sample proportion, ps . We define p as the population proportion (the proportion of the population
satisfying some given criterion): thus 0  p  1. Then ps approximately follows a normal distribution as
long as np 5 and n(1 p) 5. Since ps can be approximated by a normal distribution, it is possible
to calculate probabilities once we have standardised the values.
We define the standard error of the proportion as
r
p(1 p)
ps =
n
Data Analysis for Decision Makers 93

Figure 9.8. A normal distribution gives a normal sampling distribution of the mean,
with the same mean

which lets us standardise with the following formula


ps p ps p
Z= =p .
ps p(1 p)/n

You might also like