Session 2
Sampling Distribution and
Central Limit Theorem
• Statistical Inference
• Random Sampling
• Distribution of Sample Means
• Central Limit Theorem
Session 2 2 2
Population:
The complete set of N items (people, objects,
transactions, or events) under investigation.
Example: All 1 million households in Hyderabad.
Sample:
The portion (or subset) of the population selected
for analysis.
Example: 1000 selected households in
Hyderabad.
Session 2 3 2
There are many different ways to select a sample from
a population. The simplest way…
Random Sample:
A sample of size n drawn from N items in such a
way that each of the N items has the same chance
of being selected.
Variable:
A numerical characteristic of an item (in the
population or sample.)
Example: Annual household income.
Population Parameter:
A numerical measure, computed from a
population, which describes an aspect of the
whole population.
Example: Average annual household income.
Sample Statistic:
A numerical measure, computed from a sample,
which describes an aspect of the sample.
Example: Average annual household income.
Session 2 4 2
Session 2 5 2
Session 2 6 2
Selecting the Sampling Frame
• Sampling frame is simply a list of items from
which to draw a sample
• Does the sampling frame represent the
population?
– e.g. Literary Digest vs. George Gallup polls
• The available list may differ from desired list
– e.g. we do not have list of customers who did not buy
from a store
• Sometimes, no comprehensive sampling frame
exists
– e.g., when forecasting for the future
Session 2 7 2
Typical Pitfalls in Sampling
• Collecting data only from volunteers (voluntary response
sample)
– e.g. online reviews (yelp.com, maps.google.com,
tripadvisor.com)
• Picking easily available respondents (convenience sample)
– e.g. choosing to survey in In-Orbit mall
• A high rate of non-response (more than 70%)
– e.g. CEO / CIO surveys on some industry trends
Session 2 8 2
Notation
We think of a random variable and its probability
distribution as representing a population. In the prior
chapters, we used and to describe parameters of
normal and binomial random variables.
To distinguish the average in a sample from the
average in a population, we use different notation.
Likewise, with the standard deviation. We summarize
this in a table:
Session 2 9 2
Estimation
In a large population, we will not know the parameters
and . We will need to take a sample from the
population and then compute X and s.
X is an estimator of the unknown
s is an estimator of the unknown
For larger sample sizes, X and s will tend to give
better estimates of the parameters.
Simulation:
We can demonstrate this with simulation. We specify any
known values of = 100 and = 15 (as in IQ). We take a
sample of people, one by one, and update the X and s
values after each new data point. So, if the first three people
sampled have IQ’s of 102, 110, 97 then X will be 102, 106,
103. Similarly, for s. As we sample more people, X
converges to 100 and s converges to 15.
Here are 2 possible simulations, each using 400 people. (Be
sure you use the left axis for X and the right axis for s.)
continued…
Session 2 10 2
Cumulative Mean and Std Deviation
120 30
X (left axis)
110 25
100 20
90 15
80 10
s (right axis)
70 5
60 0
0 100 200 300 400
Observation Number
Cumulative Mean and Std Deviation
120 30
X (left axis)
110 25
100 20
90 15
80 10
s (right axis)
70 5
60 0
0 100 200 300 400
Observation Number
Session 2 11 2
Tossing a Single Die
If we toss dice (or flip a coin) we consider the result a
sample from a population. Consider a single die,
tossed once. We know the probability distribution is:
Throw of one die
0.18
0.16
0.14
0.12
Probability
0.10
0.08
0.06
0.04
0.02
0.00
1 2 3 4 5 6
Result
Session 2 12 2
Sum of Two Dice
The sum of two dice is a random variable with 11
possible values between 2 and 12. Not all of the 11
results have the same probability:
Sum of 2 Dice
0.18
0.16
0.14
0.12
Probability
0.10
0.08
0.06
0.04
0.02
0.00
2 3 4 5 6 7 8 9 10 11 12
Sum
X is a random variable:
What is the probability distribution of the Average of
the 2 dice rolls instead of the Sum?
X = (X 1 + X 2 ) 2 = Sum 2
Session 2 13 2
Example:
In general, we can throw any number of dice:
X + X 2 + ... + X n
X = 1
n
Let’s do a simulation of the sum or average of
simultaneous dice rolls. This will enable us to see the
probability distribution of X .
Go to:
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
Session 2 14 2
The Central Limit Theorem in Pictures
Session 2 15 2
It turns out we have simple formulas to determine the
mean and standard deviation of X in terms of n, ,
that we computed for a single die:
The mean of the X is the mean of each individual roll:
X =
The standard deviation of the X is smaller than the
standard deviation of each individual roll:
X =
n
In addition to the mean and standard deviation, we can also
say something about the probability distribution of X for
large values of n …
Session 2 16 2
Statement of the Central Limit Theorem (C.L.T.):
X behaves more and more like a normally distributed
random variable as n increases.
******************************************************************
This is very important because it says we can use the
z- table for problems that start with any distribution.
Note that the mean of X stays the same (the dotted
line) but the density function gets narrower as n
increases. This is also obvious from the formulas:
X = X =
n
Note from the picture that for n = 30 the distribution of
X looks like a normal (last row). That depends on
what kind of distribution we start with. If it is very
skewed (asymmetric), then we might need a larger n.
Session 2 17 2
CLT is Valid When…
• Each data point in the sample is independent of
the other.
• The sample size is large enough.
• A sample size of 30 is usually considered large
enough to make X normal but there are more
precise conditions:
• n > 10 (K3)2, where K3 is sample skewness, and
• n > 10 |K4|, where K4 is sample kurtosis
• Adequate sample size depends on the
distribution of data.
• If data is quite symmetric and has few outliers,
even smaller samples are fine. Otherwise, we
need larger samples.
Session 2 18 2
Session 2 19 2
Summary of Session 2
What is statistical inference?
• Statistical inference is the process of making probabilistic inferences about
population parameters based on sample statistics
How to (and how not to) choose a sample?
• You want a simple random sample (SRS). To do so, you require a sampling frame
that represents the population and a randomization device
What are sample statistics and their properties?
• Sample statistics are random variables because they vary across samples drawn
from the same population. They can be used as point estimates of the population
parameters
What is the central limit theorem and how is it useful?
• Central limit theorem implies that no matter what the population distribution is,
the sample mean ( X ) is normally distributed with mean (µ) and standard error
n , approximately.