0% found this document useful (0 votes)
51 views19 pages

Sampling Distribution and Central Limit Theorem: Session 2

This document discusses sampling distributions and the central limit theorem. It defines key concepts like populations, samples, random sampling, and how sample means are distributed. The central limit theorem states that as sample size increases, the distribution of sample means approaches a normal distribution, regardless of the shape of the population. This allows inferring properties of populations from sample statistics using normal distribution probabilities.

Uploaded by

Anyone Someone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views19 pages

Sampling Distribution and Central Limit Theorem: Session 2

This document discusses sampling distributions and the central limit theorem. It defines key concepts like populations, samples, random sampling, and how sample means are distributed. The central limit theorem states that as sample size increases, the distribution of sample means approaches a normal distribution, regardless of the shape of the population. This allows inferring properties of populations from sample statistics using normal distribution probabilities.

Uploaded by

Anyone Someone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Session 2

Sampling Distribution and


Central Limit Theorem

• Statistical Inference

• Random Sampling

• Distribution of Sample Means

• Central Limit Theorem


Session 2 2 2

Population:
The complete set of N items (people, objects,
transactions, or events) under investigation.

Example: All 1 million households in Hyderabad.

Sample:
The portion (or subset) of the population selected
for analysis.

Example: 1000 selected households in


Hyderabad.
Session 2 3 2

There are many different ways to select a sample from


a population. The simplest way…

Random Sample:
A sample of size n drawn from N items in such a
way that each of the N items has the same chance
of being selected.

Variable:
A numerical characteristic of an item (in the
population or sample.)

Example: Annual household income.

Population Parameter:
A numerical measure, computed from a
population, which describes an aspect of the
whole population.

Example: Average annual household income.

Sample Statistic:
A numerical measure, computed from a sample,
which describes an aspect of the sample.

Example: Average annual household income.


Session 2 4 2
Session 2 5 2
Session 2 6 2

Selecting the Sampling Frame

• Sampling frame is simply a list of items from


which to draw a sample

• Does the sampling frame represent the


population?
– e.g. Literary Digest vs. George Gallup polls

• The available list may differ from desired list


– e.g. we do not have list of customers who did not buy
from a store

• Sometimes, no comprehensive sampling frame


exists
– e.g., when forecasting for the future
Session 2 7 2

Typical Pitfalls in Sampling


• Collecting data only from volunteers (voluntary response
sample)
– e.g. online reviews (yelp.com, maps.google.com,
tripadvisor.com)

• Picking easily available respondents (convenience sample)


– e.g. choosing to survey in In-Orbit mall

• A high rate of non-response (more than 70%)


– e.g. CEO / CIO surveys on some industry trends
Session 2 8 2

Notation

We think of a random variable and its probability


distribution as representing a population. In the prior
chapters, we used  and  to describe parameters of
normal and binomial random variables.

To distinguish the average in a sample from the


average in a population, we use different notation.
Likewise, with the standard deviation. We summarize
this in a table:
Session 2 9 2

Estimation

In a large population, we will not know the parameters


 and  . We will need to take a sample from the
population and then compute X and s.

X is an estimator of the unknown 


s is an estimator of the unknown 

For larger sample sizes, X and s will tend to give


better estimates of the parameters.

Simulation:
We can demonstrate this with simulation. We specify any
known values of  = 100 and  = 15 (as in IQ). We take a
sample of people, one by one, and update the X and s
values after each new data point. So, if the first three people
sampled have IQ’s of 102, 110, 97 then X will be 102, 106,
103. Similarly, for s. As we sample more people, X
converges to 100 and s converges to 15.

Here are 2 possible simulations, each using 400 people. (Be


sure you use the left axis for X and the right axis for s.)

continued…
Session 2 10 2

Cumulative Mean and Std Deviation

120 30
X (left axis)
110 25

100 20

90 15

80 10
s (right axis)

70 5

60 0
0 100 200 300 400

Observation Number

Cumulative Mean and Std Deviation

120 30
X (left axis)
110 25

100 20

90 15

80 10
s (right axis)

70 5

60 0
0 100 200 300 400

Observation Number
Session 2 11 2

Tossing a Single Die

If we toss dice (or flip a coin) we consider the result a


sample from a population. Consider a single die,
tossed once. We know the probability distribution is:

Throw of one die

0.18

0.16

0.14

0.12
Probability

0.10

0.08

0.06

0.04

0.02

0.00
1 2 3 4 5 6
Result
Session 2 12 2

Sum of Two Dice

The sum of two dice is a random variable with 11


possible values between 2 and 12. Not all of the 11
results have the same probability:

Sum of 2 Dice

0.18

0.16

0.14

0.12
Probability

0.10

0.08

0.06

0.04

0.02

0.00
2 3 4 5 6 7 8 9 10 11 12
Sum

X is a random variable:
What is the probability distribution of the Average of
the 2 dice rolls instead of the Sum?
X = (X 1 + X 2 ) 2 = Sum 2
Session 2 13 2

Example:
In general, we can throw any number of dice:
X + X 2 + ... + X n
X = 1
n

Let’s do a simulation of the sum or average of


simultaneous dice rolls. This will enable us to see the
probability distribution of X .
Go to:
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
Session 2 14 2

The Central Limit Theorem in Pictures


Session 2 15 2

It turns out we have simple formulas to determine the


mean and standard deviation of X in terms of n,  , 
that we computed for a single die:

The mean of the X is the mean of each individual roll:


X = 

The standard deviation of the X is smaller than the


standard deviation of each individual roll:

X =
n

In addition to the mean and standard deviation, we can also


say something about the probability distribution of X for
large values of n …
Session 2 16 2

Statement of the Central Limit Theorem (C.L.T.):

X behaves more and more like a normally distributed


random variable as n increases.

******************************************************************
This is very important because it says we can use the
z- table for problems that start with any distribution.

Note that the mean of X stays the same (the dotted


line) but the density function gets narrower as n
increases. This is also obvious from the formulas:

X =  X =
n

Note from the picture that for n = 30 the distribution of


X looks like a normal (last row). That depends on
what kind of distribution we start with. If it is very
skewed (asymmetric), then we might need a larger n.
Session 2 17 2

CLT is Valid When…

• Each data point in the sample is independent of


the other.

• The sample size is large enough.

• A sample size of 30 is usually considered large


enough to make X normal but there are more
precise conditions:
• n > 10 (K3)2, where K3 is sample skewness, and
• n > 10 |K4|, where K4 is sample kurtosis

• Adequate sample size depends on the


distribution of data.

• If data is quite symmetric and has few outliers,


even smaller samples are fine. Otherwise, we
need larger samples.
Session 2 18 2
Session 2 19 2

Summary of Session 2

What is statistical inference?


• Statistical inference is the process of making probabilistic inferences about
population parameters based on sample statistics

How to (and how not to) choose a sample?


• You want a simple random sample (SRS). To do so, you require a sampling frame
that represents the population and a randomization device

What are sample statistics and their properties?


• Sample statistics are random variables because they vary across samples drawn
from the same population. They can be used as point estimates of the population
parameters

What is the central limit theorem and how is it useful?


• Central limit theorem implies that no matter what the population distribution is,
the sample mean ( X ) is normally distributed with mean (µ) and standard error
  
 
 n , approximately.

You might also like