Observational studies
and sampling strategies
Populations and Samples
Research Question: Can people
become better, more efficient
runners on their own, merely by
running?
http://well.blogs.nytimes.com/2012/08/29/finding-
your-ideal-running-form
Populations and Samples
Research Question: Can people
become better, more efficient
runners on their own, merely by
running?
Population of Interest: All people
http://well.blogs.nytimes.com/2012/08/29/finding-
your-ideal-running-form
Populations and Samples
Research Question: Can people
become better, more efficient
runners on their own, merely by
running?
Population of Interest: All people
http://well.blogs.nytimes.com/2012/08/29/finding-
your-ideal-running-form
Sample: Group of adult women who recently joined a running group
Populations and Samples
Research Question: Can people
become better, more efficient
runners on their own, merely by
running?
Population of Interest: All people
http://well.blogs.nytimes.com/2012/08/29/finding-
your-ideal-running-form
Sample: Group of adult women who recently joined a running group
Population to which results can be generalized: Adult women, if the
data are randomly sampled
Census
● Wouldn't it be better to just include everyone and "sample" the
entire population?
○ This is called a census.
● There are problems with taking a census:
○ It can be difficult to complete a census: there always seem to
be some individuals who are hard to locate or hard to measure.
And these difficult-to-find people may have certain
characteristics that distinguish them from the rest of the
population.
○ Populations rarely stand still. Even if you could take a census,
the population changes constantly, so it's never possible to get
a perfect measure.
○ Taking a census may be more complex than sampling.
Exploratory analysis to inference
● Sampling is natural.
Exploratory analysis to inference
● Sampling is natural.
● Think about sampling something you are cooking - you taste (examine)
a small part of what you're cooking to get an idea about the dish as a
whole.
Exploratory analysis to inference
● Sampling is natural.
● Think about sampling something you are cooking - you taste (examine)
a small part of what you're cooking to get an idea about the dish as a
whole.
● When you taste a spoonful of soup and decide the spoonful you tasted
isn't salty enough, that's exploratory analysis.
Exploratory analysis to inference
● Sampling is natural.
● Think about sampling something you are cooking - you taste (examine)
a small part of what you're cooking to get an idea about the dish as a
whole.
● When you taste a spoonful of soup and decide the spoonful you tasted
isn't salty enough, that's exploratory analysis.
● If you generalize and conclude that your entire soup needs salt, that's
an inference.
Exploratory analysis to inference
● Sampling is natural.
● Think about sampling something you are cooking - you taste (examine)
a small part of what you're cooking to get an idea about the dish as a
whole.
● When you taste a spoonful of soup and decide the spoonful you tasted
isn't salty enough, that's exploratory analysis.
● If you generalize and conclude that your entire soup needs salt, that's
an inference.
● For your inference to be valid, the spoonful you tasted (the sample)
needs to be representative of the entire pot (the population).
○ If your spoonful comes only from the surface and the salt is
collected at the bottom of the pot, what you tasted is probably not
representative of the whole pot.
○ If you first stir the soup thoroughly before you taste, your spoonful
will more likely be representative of the whole pot.
Sampling bias
● Non-response: If only a small fraction of the randomly sampled people
choose to respond to a survey, the sample may no longer be
representative of the population.
Sampling bias
● Non-response: If only a small fraction of the randomly sampled people
choose to respond to a survey, the sample may no longer be
representative of the population.
● Voluntary response: Occurs when the sample consists of people who
volunteer to respond because they have strong opinions on the issue.
Such a sample will also not be representative of the population.
Sampling bias
● Non-response: If only a small fraction of the randomly sampled people
choose to respond to a survey, the sample may no longer be
representative of the population.
● Voluntary response: Occurs when the sample consists of people who
volunteer to respond because they have strong opinions on the issue.
Such a sample will also not be representative of the population.
Sampling bias
● Non-response: If only a small fraction of the randomly sampled people
choose to respond to a survey, the sample may no longer be
representative of the population.
● Voluntary response: Occurs when the sample consists of people who
volunteer to respond because they have strong opinions on the issue.
Such a sample will also not be representative of the population.
Sampling bias
● Non-response: If only a small fraction of the randomly sampled people
choose to respond to a survey, the sample may no longer be
representative of the population.
● Voluntary response: Occurs when the sample consists of people who
volunteer to respond because they have strong opinions on the issue.
Such a sample will also not be representative of the population.
● Convenience sample: Individuals who are easily accessible are more
likely to be included in the sample.
Sampling bias example:
Landon vs. FDR
A historical example of a biased sample yielding misleading results
In 1936, Landon
sought the
Republican
presidential
nomination
opposing the re-
election of FDR.
The Literary Digest Poll
● The Literary Digest polled about 10
million Americans, and got
responses from about 2.4 million.
● The poll showed that Landon would
likely be the overwhelming winner
and FDR would get only 43% of the
votes.
● Election result: FDR won, with 62%
of the votes.
● The magazine was completely
discredited because of the poll, and
was soon discontinued.
The Literary Digest Poll -
what went wrong?
● The magazine had surveyed
○ its own readers,
○ registered automobile owners, and
○ registered telephone users.
● These groups had incomes well above the national average
of the day (remember, this is Great Depression era) which
resulted in lists of voters far more likely to support
Republicans than a truly typical voter of the time, i.e. the
sample was not representative of the American population at
the time.
Large samples are preferable, but...
● The Literary Digest election poll was based on a sample
size of 2.4 million, which is huge, but since the sample was
biased, the sample did not yield an accurate prediction.
● Back to the soup analogy: If the soup is not well stirred, it
doesn't matter how large a spoon you have, it will still not
taste right. If the soup is well stirred, a small spoon will
suffice to test the soup.
Practice
A school district is considering whether it will no longer allow high school students
to park at school after two recent accidents where students were severely injured.
As a first step, they survey parents by mail, asking them whether or not the parents
would object to this policy change. Of 6,000 surveys that go out, 1,200 are returned.
Of these 1,200 surveys that were completed, 960 agreed with the policy change
and 240 disagreed. Which of the following statements are true?
I. Some of the mailings may have never reached the parents.
II. The school district has strong support from parents to move forward with the
policy approval.
III. It is possible that majority of the parents of high school students disagree with
the policy change.
IV. The survey results are unlikely to be biased because all parents were mailed a
survey.
(a) Only I (b) I and II (c) I and III (d) III and IV (e)
Only IV
Practice
A school district is considering whether it will no longer allow high school students
to park at school after two recent accidents where students were severely injured.
As a first step, they survey parents by mail, asking them whether or not the parents
would object to this policy change. Of 6,000 surveys that go out, 1,200 are returned.
Of these 1,200 surveys that were completed, 960 agreed with the policy change
and 240 disagreed. Which of the following statements are true?
I. Some of the mailings may have never reached the parents.
II. The school district has strong support from parents to move forward with the
policy approval.
III. It is possible that majority of the parents of high school students disagree with
the policy change.
IV. The survey results are unlikely to be biased because all parents were mailed a
survey.
(a) Only I (b) I and II (c) I and III (d) III and IV (e)
Only IV
Observational studies
● Researchers collect data in a way that does not directly interfere with
how the data arise.
● Results of an observational study can generally be used to establish
an association between the explanatory and response variables.
Obtaining good samples
● Almost all statistical methods are based on the notion of implied
randomness.
● If observational data are not collected in a random framework from a
population, these statistical methods – the estimates and errors
associated with the estimates – are not reliable.
● Most commonly used random sampling techniques are simple,
stratified, and cluster sampling.
Prospective vs.
Retrospective Studies
A prospective study identifies individuals and collects information
as events unfold.
● Example: The Nurses Health Study has been recruiting
registered nurses and then collecting data from them using
questionnaires since 1976.
Retrospective studies collect data after events have taken place.
● Example: Researchers reviewing past events in medical
records.
Obtaining Good Samples
● Almost all statistical methods are based on the notion of
implied randomness.
● If observational data are not collected in a random framework
from a population, these statistical methods -- the estimates
and errors associated with the estimates -- are not reliable.
● Most commonly used random sampling techniques are
simple, stratified, and cluster sampling.
Simple Random Sample
Randomly select cases from the population, where there is no
implied connection between the points that are selected.
Stratified Sample
Strata are made up of similar observations. We take a simple
random sample from each stratum.
Cluster Sample
Clusters are usually not made up of homogeneous observations.
We take a simple random sample of clusters, and then sample
all observations in that cluster. Usually preferred for economical
reasons.
Multistage Sample
Clusters are usually not made up of homogeneous observations.
We take a simple random sample of clusters, and then take a
simple random sample of observations from the sampled
clusters
Practice
A city council has requested a household survey be conducted in
a suburban area of their city. The area is broken into many
distinct and unique neighborhoods, some including large homes,
some with only apartments. Which approach would likely be the
least effective?
(a) Simple random sampling
(b) Cluster sampling
(c) Stratified sampling
(d) Blocked sampling
Practice
A city council has requested a household survey be conducted in
a suburban area of their city. The area is broken into many
distinct and unique neighborhoods, some including large homes,
some with only apartments. Which approach would likely be the
least effective?
(a) Simple random sampling
(b) Cluster sampling
(c) Stratified sampling
(d) Blocked sampling