Chapter 7 Systematic Sampling
Simple random sampling and stratified sampling require detailed work. Systematic sampling has
some advantages, namely:
1) Systematic sampling is easy to do. Eg street corner, supermarket, production line.
Identify possible problems if a systematic sample is obtained on a street corner or in a
supermarket.
Only certain types of people will be on any particular street. Eg in a business district
downtown we’ll get mostly business people.
In a supermarket, we’ll get people who live in the area rather than a cross-section of people
of all income levels.
Production line: If we sample from only one production line, we’ll miss problems in other
production lines.
2) Systematic sampling usually provides more information per unit cost than a simple
random sample partly because a systematic sample is spread evenly over the population.
A systematic random sample is obtained by randomly selecting one element/individual from the
first k in the frame and then every k-th element/individual thereafter. This is called a 1-in-k
systematic sample.
Note: k =Population ¿¿ Sample ¿ ¿ Step ¿ ¿ ¿ ¿
Example
Suppose N= 100,000 and we need a systematic sample of 150.
Divide the 100,000 into 150 chunks of size k = 100,000/150=667
Choose one number randomly from the first chunk of k = 667. Use graphing calculator.
Math>prb>RandInt(1, 667, 1) My calculator gave me 353, so that will be the starting point.
Then add 667 successively until we reach the 150th chunk:
353+667= 1020
1020+667 =1687
1687+667 = 2354
Etc
The individuals in the systematic sample are those numbered 353, 1020, 1687, 2354, ….
1
To do now: Suppose that we have a popln of 100 individuals and we want a sample of 5.
What is the step size (k)? 100/5=20
Choose a random number between 1 and k = 20 (We need a volunteer 😊 for this)
Math>prb>RandInt(1, 20, 1)=11
Which individuals are in the sample? Ans: 11, 31, 51, 71, 91
Note: If N is not known, guess or use a convenient step size. Eg Opinion of customer towards
new product: probably every 20th customer for a supermarket, but maybe alternate customers for
a small store.
There are three types of populations (pics p 222, 223):
1) Random. Random populations have no order or patterns in the variable of interest. eg
variable of interest = time spent waiting for service at cashier in supermarket between 6
and 7pm. The list of customers has no particular order. There are no particular patterns in
the times spent waiting.
2) Ordered. There are patterns in the order of individuals that are connected with the
variable of interest.
eg Variable = daily price of share X, list is arranged in the order of the calendar.
eg variable = shoe size, list is alphabetic list of students
Alphabetic lists are ordered because certain languages prefer certain letters, and
languages are connected with ethnicity, as is shoe size.
2
3) Periodic. There are repeating patterns in the list. eg variable = daily turnover at pizza
store. The population consists of successive days. The pattern may be: high turnover
every Friday and Saturday, low turnover every Monday, etc. eg monthly sales at a car
dealership. (every January the sales are low, etc.)
Extreme example:
Suppose we want to estimate the average income per day at a pizza restaurant. Suppose that this
pizza shop sells a lot of pizzas over week-ends and not many on Mondays.
Define the parameter and describe the estimate of the parameter, called y sy .
The parameter is mu= the average income per day at a pizza restaurant for all days on which the
restaurant exists
y sy =¿ the average income per day at a pizza restaurant for all days in the sample
What value for k would be the most misleading? Ans: k=7
Why? Because we’d always record sales on the same day eg all Mondays, or all Saturdays
Notice that we have too little variability in the sample because the readings in the sample are
highly correlated.
Suggest a better value for k. k=3 or 10. Not any multiple of 7!!
Estimating μ from a systematic sample
^μ= y sy =
∑ yi This is the same as for SRS.
n
( )
2
^ ( y sy )= 1− n s .
If the population is random, use V This is the same as for SRS.
N n
3
If the population is not random, this estimator of V ( y sy )is biased. We need repeated systematic
samples to obtain an unbiased estimator of V ( y sy ) or we can use the method of successive
differences to get an estimate of V ( y sy ).
Note: For SRS, y = average of SRS. V ( y )=
σ 2 N −n
n N−1 ( )
This is a theoretical quantity. Not to be
^ ( y )= 1−
used for estimation. For estimation, use V ( )
n s2
N n
2
σ
For systematic sample, y sy = average of systematic sample. V ( y sy )= ( 1+(n−1) ρ ).
n
where ρ is a measure of the correlation between pairs of elements in the systematic sample.
− ( )
1
n−1
< ρ<1 This is a theoretical quantity. Compare it with V ( y )=
σ 2 N −n
n N−1 ( )
In summary:
If pairs of elements are similar, then ρ will be close to 1, so
( )
2 2
V ( y sy )= σ ( 1+ ( n−1 ) ρ ) ≈ σ 2=V ( y ) .Compare this with V ( y )= σ N −n We can see that
n n N−1
( )
2 2
σ N −n σ
V ( y )= is a lot smaller than V ( y sy )= ( 1+ ( n−1 ) ρ ) ≈ σ 2 so
n N−1 n
the error bound for systematic sampling will be larger than the error bound for SRS. What we are
saying is that if pairs of elements in the sample are correlated then systematic sampling is not
good. It’s better to use SRS.
Eg pizza restaurant with k =7. Successive outcomes on any particular day, say Mondays, are
2
positively correlated, so ρ will be close to 1. V ( y sy ) ≈ σ =V ( y ) y sy is a really bad estimator
because its variation is so high!
In general, if ρ > 0 then V ( y sy ) >V ( y ) because
2 2
V ( y sy )= σ ( 1+(n−1) ρ )= σ +a postive amount ≈ V ( y ) +a positive amount
n n
2 2
σ
( 1+(n−1) ρ )= σ ≈ V ( y ) for
If pairs of elements are unrelated then ρ = 0 and V ( y sy )=
n n
SRS. So if the popln is random then the sample average ybar from a systematic sample is
as variable as the sample average from a SRS.
4
If pairs of elements are negatively correlated, then ρ < 0. Then V ( y sy ) <¿ V ( y ). That is,
systematic sampling is better than SRS if successive elements in the sample are
negatively related. Eg if successive elements are low, high, low, high…(eg summer,
winter, summer, winter…) then ρ < 0
Example: Suppose N large, n = 50, ρ=−0.01. ie there is a small negative correlation between
successive elements in the popln. (There’s a slight tendency for large elements to be followed by
small elements.) Compare V ( y sy ) and V ( y ) .
2 2 2
σ σ σ
Solution: V ( y sy )= ( 1+(n−1) ρ )= ( 1+ ( 50−1 )∗(−0.01) )=0.51
n n n
Compare this with V ( y )= (
σ 2 N −n
n N−1 )
≈
σ2
n
We see that V ( y sy ) is approx half of V ( y ) so the average from the systematic sample varies only
half as much as the average from the SRS!
Notice that V ( y sy ) << V ( y ) even though the correlation is very small.
How to get negative correlation? -eg sales of winter coats. Take readings every October and
every April because we expect lots in October and hardly any in April. The sample has to have
maximum spread.
If you don’t know the pattern of seasonality/order then choose a random starting point within
each 10th block, say. That is, start over every 10 steps.
Example: n = 150, k= 100 1-in-100 systematic sample
Suppose the random stating point is 57. The first 9 readings are numbered 57, 157, 257, 357, …
857. The 10th block is 901….1000. Choose randomly from 901…1000. Suppose it’s 978. 10th
reading is 978, then 1078, 1178, …etc
The systematic sample is 57, 157, 257,….857, 978, 1078, 1178, etc
Note: Always plot the sample in sequence to look for patterns.
Estimating τ or p from a systematic sample: If the population is randomly ordered, use the
formulae for SRS. If the popln is not randomly ordered, we need repeated systematic samples to
obtain unbiased estimates of V ( τ^ sy ) and V ( ^psy ).
Selecting the sample size for given margin of error: Use the formulae for SRS. If the
population is ordered then the sample may be a bit big (the margin of error will be smaller than
anticipated.)
5
If the population is periodic and k is chosen so that ρ>0 (ie successive readings are positively
correlated) then the sample will be a bit small.
Repeated systematic sampling is one way to obtain unbiased estimates of V ( y sy ), V ( τ^ sy ) or
V ( ^psy ).
“In most cases, systematic sampling, is NOT equivalent to SRS” because most lists are NOT
randomly ordered with respect to the variable of interest.
Eg alphabetic lists – have ethnic bias. Eg Suppose that variable of interest is electricity
consumption. People whose last names begin with O’ or Mc (ie Irish) may tend to have large
families so they higher electricity consumption. People whose last names begin with M or S
(English) may tend to have small families.
List by account number – may be connected with age. Customers who have been with the
company for a long time tend to have lower account numbers and tend to be older.
List by phone number – depends on income. Phone numbers depend on area code. Income and
area code ae connected.
List by last 4 digits of phone number – this is OK
Repeated systematic sampling entails choosing a number of systematic samples. The
population is traversed several times.
40
Eg N=40 , n=8 , so k = =5so we will have chunks of size 5 and jumps of size 5
8
This will yield two values of y , so we can get some idea of the variation of y . Usually more than
two samples are used!
Eg Instead of a 1-in-5 sample (every 5th) containing 60 individuals, obtain ten 1-in-50 (every
50th) samples, each containing 6 individuals. Since we then have 10 values of y sy , we obtain
^ ( y ) by calculating the variance of y 1 , y 2 , … … y n.
V sy
Note: n s ¿ number of samples) is often taken to be 10.
6
n = number of readings
ns
1
^μ= ∑ y = average of the n s averages y 1 , y 2 , … … y n
n s i=1 i
V (
^ ( ^μ )= 1− n
N )( n1 )[ n 1−1 ∑ ( y − ^μ) ]
s s
i
2
1
Notice that
ns −1
∑ ( y i− μ^ )2 is the sample variance of the n s averages y 1 , y 2 , … … y n
To estimate a total:
τ^ =N μ^ ^ ( τ^ )=N 2 V
and V ^ ( ^μ )
Example: Obtain 10 repeated systematic samples of 5 individuals from a population of 2,000
individuals.
Solution: We have n s=10=number of samples
2000
Step size = k = =400
5
Choose 10 random starting points between 1 and 400. (How?) Graphing calculator
Math>PRB>RandInt(1, 400, 10) or Minitab patterned data to generate the numbers 1, 2, 3…400.
Then Calc> sample from columns>
Successively add 400 to each of these 10 starting numbers to get the numbers of the individuals
in the 10 samples.
Eg suppose my 10 starting numbers are 87, 313, 229, 71, 173, 58, 21, 311, 78, 301
First sample is individuals numbered 87, 487, 887, 1287, 1787
Second sample is individuals numbered 313, 713, 1113, 1513, 1913
Third sample is individuals numbered 229, 629, 1029, 1429, 1829
Etc
Tenth sample is individuals numbered 301, 701, 1101, 1501, 1901
Example: From Exercise 7.6 on p244. The quality control section of an industrial firm uses
systematic sampling to estimate the average amount of fill in 12-ounce cans coming off an
7
assembly line. Suppose that the data comprise 6 systematic samples, one sample per row,
across the page. (This is different from what the exercise mentions.)
Estimate the mean fill in all 12-ounce cans produced on this day, place a bound on the error of
estimation and interpret the resulting interval. Assume that the daily production is 1800 cans.
Solution: Response y = amount of liquid in a randomly selected 12-ounce can.
Let’s plot the data to check for patterns:
Time Series Plot of Sample 1
12.05
12.00
11.95
Sample 1
11.90
11.85
11.80
1 2 3 4 5 6
Index
The graph above shows that Sample 1 doesn’t have any patterns
8
Time Series Plot of Sample 2
12.05
12.00
Sample 2
11.95
11.90
11.85
1 2 3 4 5 6
Index
From the graph above: Sample 2 doesn’t appear to have any patterns
Time Series Plot of Sample 3
12.025
12.000
11.975
Sample 3
11.950
11.925
11.900
11.875
11.850
1 2 3 4 5 6
Index
From the graph above: Sample 3 doesn’t appear to have any patterns.
Time Series Plot of Sample 4
12.05
12.00
Sample 4
11.95
11.90
11.85
1 2 3 4 5 6
Index
From the graph above: Sample 4 doesn’t appear to have any patterns.
Time Series Plot of Sample 5
12.05
12.00
11.95
Sample 5
11.90
11.85
11.80
11.75
1 2 3 4 5 6
Index
From the graph above: Sample 5 doesn’t appear to have any patterns.
Time Series Plot of Sample 6
12.05
12.00
Sample 6
11.95
11.90
11.85
1 2 3 4 5 6
Index
From the graph above: Sample 6 doesn’t appear to have any patterns.
9
y 1=¿ 11.97 , y 2=11.955 , y 3=11.918 , y 4 =11.932, y 5=11.93 , y 6=¿ 11.968
N=1800 , n=36 , ns =¿ 6
Type the averages into Minitab: (see the column of sample means)
Now request sample mean and variance of the sample averages:
ns
1 1
^μ= ∑ y i= ( y 1 + y 2 + y 3 + y 4 + y 5 + y 6 )
n s i=1 6
1
¿ ( 11.97+11.955+11.9183+11.9316+ 11.93+11.9683 )
6
1
¿ ( 71.673 )=11.946
6
10
V (
^ ( ^μ )= 1− n
N ) ( )[
1 1
n s n s−1 ]( 36
∑ ( y i− ^μ )2 = 1− 1800 )( 16 ) [ sample variance of y , … .. y ]
1 6
(
¿ 1−
36
1800 )( 16 ) [ 0.00047673 ] =.0000778659
∴ B=2 √ .0000778659=0.0176
The CI is ( 11.9284, 11.9636 )
Interpret: I estimate that the average amount of liquid in all 12-ounce cans produced on this day
is somewhere between 11.9284 ounces and 11.9636 ounces.
1800
Note: These samples would be 1-in-__300_______samples. k = =300 (The denominator is
6
6 because each sample has 6 readings.)
^ ( y sy ) using the method of successive differences.
Finding V
This is especially useful when there are linear trends on the population. Use this method
whenever the list is suspected to be ordered. Eg alphabetic
2
Let Y 1 ,Y 2 , … . Y n be a random sample. Also, E ( Y i ) =μ and V ( Y i )=σ . Suppose μ ≠ 0.
Consider the n−1 successive differences:
D i=Y i +1−Y i , i=1 , 2 ,… … n−1
Then E( Di )=E (Y ¿ ¿ i+1−Y i)=E(Y i +1)−E (Y i)=μ−μ=0 ¿
Also: V ( D¿¿ i )=V (Y ¿ ¿ i +1−Y i )=V (Y i +1)+V (Y i)=2 σ 2 . ¿ ¿ (Y i+1 ¿ Y i are independent because
the sample is random.)
Estimate V (D¿¿ i)¿ =2 σ 2by the (almost) sample variance of D1 , D2 , … . D n−1 :
1 1
2 σ^ 2=
n−1
∑ ( Di−E( Di) )2=
n−1
∑ D2i because E ( Di ) =0. (The real sample variance of
D 1 , D2 , … . D n−1 would have divisor n−2.)
1
So the estimator of σ 2 is ¿.
2
( )
2
σ n
The estimator of V ( y sy )= 1− is
n N
11
( )
^ d ( y sy ) = 1− n ¿
V
N
Use this estimator for V ( y sy ) whenever you suspect that the population is not random. Use the
formulae for SRS only if you know that the population is random.
Example 7.21 p248 The following table shows the number of divorces (in thousands) in the
United States for a systematic sample of years between 1950 and 1990. Estimate the total
number of divorces for this period and find an appropriate variance approximation for your
estimate.
1950 1955 1960 1965 1970 1975 1980 1985 1990
385 377 393 479 708 1036 1189 1190 1175
SOURCE: U.S. Bureau of the Census, Statistical Abstract of the United States, 1993–94, 113th ed., Washington,
D.C., 1994.
(a) First, make a time series plot of the data. Is the population random?
(b) Would the mean number of divorces over this period be a good predictor of the number
of divorces for 1995? Explain.
(c) Estimate the total number of divorces 1950-1990
Note: This is a 1-in-5 systematic sample
Solution: variable of interest = y = Number of divorces per year
N = number of years in popln = 41 , n = 9
(a) To make a time series plot: Graph>Time series plot. Enter the divorces as the viable of
interest. Click on the Time/scale button. Select stamp. Enter Year.
12
Here is the plot:
The population is definitely not random (scattered all over) and it is not periodic (repeating
pattern). It is ordered because it systematically changes.
(b) No, the average number of divorces per year (770 thousand) will be somewhere in the
middle so it will not be a good estimate of the number of divorces in 1995. According to the
graph, we expect that number of divorces in 1995 was approx. the same the number in 1990.
So we expect that there were about 1175 thousand divorces in 1995.
(c) N= # of years in the popln = 41
6932
τ^ =N μ^ =41 =31 ,579 thousand
9
Can we use the formula for SRS to estimate V(τ^ ¿ ? No because the popln is not random.
13
We’ll use the method of successive differences.
Do in Minitab: D i=Y i +1−Y i, i=1 , 2 ,… … 8
The differences D iare: -8 16 86 229 328 153 1 -15
∑ Di2=¿ ¿191376 from Minitab (see below)
^ d ( ^τ )=V
V ^ d ( N y sy )=N 2 V
d ( N )
^ ( y sy ) =N 2 1− n ¿
(
¿ 412 1−
9
)
(
1
41 2∗9∗( 9−1 )
( 191376 ) )
¿ 1743648
∴ B=2 √ 1743648=2640.95
The CI is (28,938 ; 34,220 )
Interpret: We estimate that the total number of divorces between 1950 and 1990 inclusive was
somewhere between 28,938 and 34,220 thousand. Ie between 28.938 million and 34.220 million.
HW Exercises 7.1, 7.4 (answer: .66± .0637 ¿, 7.5, 7.7, 7.9, 7.11, 7.13, 7.15, 7.17, 7.27(calculate
^ and V
V ^ for parts (d) and (e), 7.29.
d
14