Sta15w1 Notes
Sta15w1 Notes
Chapter 1 – Terminology
1.1 Definitions
Data/Data set – Set of values collected or obtained when gathering information on some
issue of interest.
Examples
4 The yields of a certain crop obtained after applying different types of fertilizer.
Statistics – Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting the data, and drawing
conclusions from it.
Statistics in the above sense refers to the methodology used in drawing meaningful
information from a data set. This use of the term should not be confused with statistics
(referring to a set of numerical values) or statistics (referring to measures of description
obtained from a data set).
Examples
2 The collection of all cars of a certain type manufactured during a particular month.
Examples
1 Study of the entire population carried out by the government every 10 years.
A census is usually very costly and time consuming. It is therefore not carried out very often.
A study of a population is usually confined to a subgroup of the population.
Discrete variables – Variables which can assume a finite or countable number of possible
values. Such variables are usually obtained by counting.
Examples
Continuous variables – Variables which can assume an infinite number of possible values.
Such variables are usually obtained by measurement.
Some contiuous variables like e.g., age can become discrete when they are rounded.
Measurement scales
Examples
Nominal scale – Level of measurement which classifies data into categories in which no
order or ranking can be imposed on the data.
A variable can be treated as nominal when its values represent categories with no intrinsic
ranking. For example, the department of the company in which an employee works.
Examples of nominal variables include region, postal code, or religious affiliation.
Ordinal scale – Level of measurement which classifies data into categories that can be
ordered or ranked. Differences between the ranks do not exist.
A variable can be treated as ordinal when its values represent categories with some intrinsic
order or ranking.
Examples
3 A person’s response (agree, not agree) to a statement. A one (1) is recorded when the
person agrees with the statement, a zero (0) is recorded when a person does not agree.
4 Likert scale responses to statements (strongly agree, agree, neutral, disagree, strongly
disagree).
Examples
Interval scale – Level of measurement which classifies data that can be ordered and ranked
and where differences are meaningful. However, there is no meaningful zero and ratios are
meaningless.
Examples
1 The difference between a temperature of 100 degrees and 90 degrees is the same
difference as that between 90 degrees and 80 degrees. Taking ratios in such a case does not
make sense.
Ratio scale – Level of measurement where differences and ratios are meaningful and there is
a natural zero. This is the “highest” level of measurement in terms of possible operations that
can be performed on the data.
4
Examples
Variables like height, weight, mark (in test) and speed are ratio variables. These variables
have a natural zero and ratios make sense when doing calculations e.g., a weight of 80
kilograms is twice as heavy as one of 40 kilograms.
Consider a study of 4 fuel additives on the reduction in oxides of nitrogen. You may have 4
drivers and 4 cars at your disposal. You are not particularly interested in any effects of cars or
drivers on the resultant oxide reduction. However, you do not want the results for the fuel
additives to be influenced by the driver or car. An appropriate design of the experiment (way
of performing the experiment) will allow you to estimate effects of all factors of interest
without these outside factors influencing the results.
2.1 Collecting data that compares reckless driving of female and male drivers.
2.2 Collecting data on smoking and lung cancer.
5
Examples
Sampling frame (synonyms: "sample frame", "survey frame") – This is the actual set of units
from which a sample is drawn
Example
Consider a survey aimed at establishing the number of potential customers for a new service
in a certain city. The research team has drawn 1000 numbers at random from a telephone
directory for the city, made 200 calls each day from Monday to Friday from 8am to 5pm and
asked some questions.
In this example, the population of interest is all the inhabitants in the city. The sampling
frame includes only those city dwellers that satisfy all the following conditions:
3 They are likely to be at home from 8am to 5pm from Monday to Friday;
The sampling frame in this case differs from the population. For example, it under-represents
the categories which either do not have a telephone (e.g. the most poor), have an unlisted
number, or who were not at home at the time of calls (e.g. employed people), who don't like
to participate in telephone interviews (e.g. more busy and active people). Such differences
between the sampling frame and the population of interest are a main cause of bias when
drawing conclusions based on the sample.
Probability samples – Samples drawn according to the laws of chance. These include simple
random sampling, systematic sampling and stratified random sampling.
6
Simple random sampling – Sampling in which each sample of a given size that can be
drawn will have the same chance of being drawn. Most of the theory in statistical inference is
based on random sampling being used.
Examples
1 The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each potential sample
of 6 winning numbers has the same chance of being drawn.
2 Each name in a telephone directory could be numbered sequentially. If the sample size
was to include 2 000 people, then 2 000 numbers could be randomly generated by computer
or numbers could be picked out of a hat. These numbers could then be matched to names in
the telephone directory, thereby providing a list of 2 000 people.
A random sample can be selected by using a table of random numbers (see table at the back).
Examples
The first 6 random numbers in the table of random numbers are 10480, 22368, 24130, 42167,
37570, 77921. Use these numbers to select the 6 wining numbers in a Lotto draw.
The 49 numbers from which the draw is made all involve 2 digits i.e. 01, 02, . . . , 49.
Putting the above numbers from the table of random numbers next to each other in a string of
digits gives 10 48 02 23 68 24 13 04 21 67 37 57 07 79 21 .
The winning numbers can be selected by either taking all pairs of digits between 01 and 49
(discarding any numbers outside this range or repeats) by working from left to right or right
to left in the above string.
By working from left to right the winning numbers are 10, 48, 2, 23, 24 and 13.
By working from right to left the winning numbers are 21, 7, 37, 21, 4 and 13.
Example 2
1 Open an excel sheet and type the numbers 1,2, . . . ,49 in column A.
2 In cell B1 type = rand() and drag the cursor down the cell B49. These cells are filled
with numbers between 0 and 1.
3 Highlight the entries in column B, right click and select “copy”, paste, value 123. This
enbsures that these numbers do not change.
4 Higlight all the entries in cells A1 to B49 and select data, sort columnB numbers from
smallest to largest.
The advantage of simple random sampling is that it is simple and easy to apply when small
populations are involved. However, because every person or item in a population has to be
listed before the corresponding random numbers can be read, this method is very
cumbersome to use for large populations and cannot be used if no list of the population items
is available. It can also be very time consuming to try and locate every person included in the
sample. There is also a possibility that some of the persons in the sample cannot be contacted
at all.
Systematic sampling – Sampling in which data is obtained by selecting every kth object,
N
where k is approximately n .
Examples
1 A manufacturer might decide to select every 20th item on a production line to test for
defects and quality. This technique requires the first item to be selected at random as a
starting point for testing and, thereafter, every 20th item is chosen.
2 A market researcher might select every 10th person who enters a particular store, after
selecting a person at random as a starting point, or interview occupants of every 5th house in
a street, after selecting a house at random as a starting point.
Stratified random sampling – Sampling in which the population is divided into groups
(called strata) according to some characteristic. Each of these strata is then sampled using
random sampling.
Example
A general problem with random sampling is that you could, by chance, miss out a particular
group in the sample. However, if you subdivide the population into groups, and sample from
each group, you can make sure the sample is representative. Some examples of strata
commonly used are those according to province, age, and gender. Other strata may be
according to religion, academic ability, or marital status.
Example
In a study investigating the expenditure pattern of consumers, they were divided into low,
medium and high-income groups.
medium 45
high 15
When sampling is proportional to size (an income group comprises the same percentage of
the sample as of the population) the sample sizes for the strata should be calculated as
follows.
Convenience Sampling –Sampling in which data which are readily available is used e.g.,
surveys done on the internet. These include quota sampling.
Quota sampling
Stage 2: Decide on the categories to be sampled from. These categories are determined by
cross-classification according to the characteristics chosen at stage 1.
Stage 3: Decide on the overall number (quota) and numbers (sub-quotas) to be sampled
from each of the categories specified in step 2.
Step 4: Collect the information required until all the numbers (quotas) are obtained.
Example
A company is marketing a new product and needs to know how potential customers might
react to the product.
Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40) and gender (male,
female) are the characteristics that will determine the sample.
Stage 2: The 6 categories to be sampled from are (male under 20), (male 20-40), (male over
40), (female under 20), (female 20-40) and (female over 40).
Stage 3: Decide on the sub-quotas (required sample sizes) for the different subgroups.
Example
Category Sub-quota
male under 20 40
male 20-40 60
male over 40 25
female under 20 35
9
female 20-40 65
(female over 40 30
Total 255
The total quota is the total of all the sub-quotas i.e., 255.
Stage 4: Visit a place where individuals to be interviewed are readily available e.g., a large
shopping center and interview people until all the quotas are filled.
Quota sampling is a cheap and convenient way of obtaining a sample in a short space of time.
However, this method of sampling is not based on the laws of chance and cannot guarantee a
sample that is representative of the population from which it is drawn.
When obtaining a quota sample, interviewers often choose who they like (within criteria
specifications) and may therefore select those who are easiest to interview. Therefore,
sampling bias (uncontrolled factors that result in the sample being not representative of the
population) can result. It is also impossible to estimate the accuracy of quota sampling
(because the sampling is not random).
The excel software has a facility with which a random sample of a specific size can be
selected from a given population. Below is the output of a random sample of size 5 selected
from a population consisting of 10 items.
Populati sampl
on e
13 16
27 27
14 12
12 12
15 13
9
10
12
16
9
Line graph
A line graph is a graph used to present some characteristic recorded over time.
Example:
The graph above shows how a person's weight varied from the beginning of 1991 to the
beginning of 1995.
Bar charts
A bar chart or bar graph is a chart consisting of rectangular bars with heights proportional to
the values that they represent. Bar charts are used for comparing two or more values that are
taken over time or under different conditions.
In a simple bar chart, the figures used to make comparisons are represented by bars. These
are either drawn vertically or horizontally. Only totals are represented. The height or length
of the bar is drawn in proportion to the size of the figure being presented. An example is
shown below.
11
When you want to draw a bar chart to illustrate your data, it is often the case that the totals of
the figures can be broken down into parts or components.
You start by drawing a simple bar chart with the total figures as shown above. The columns
or bars (depending on whether you draw the chart vertically or horizontally) are then divided
into the component parts.
You may find that your data allows you to make comparisons of the component figures
themselves. If so, you will want to create a multiple (compound) bar chart. Each component
is represented by a separate bar with all the components relating to a particular case (e.g., a
12
year) next to each other. This type of chart enables you to trace the trends of each individual
component, as well as making comparisons between the components.
Pareto chart
This is a special bar chart where the frequencies (presented by bars) are arranged in
decreasing order of magnitude (largest to smallest).
Example
The table below shows the occurrence of diseases taken from a citrus orchard.
Citrus frequenc
diseases y
Anthraknose 467
Canker 598
Melanose 532
Scab 503
Leaf miner 427
Sooty mold 568
Pest hole 415
Total 3510
13
600
500
400
300
200
100
0
Canker Sooty mold Melanose Scab Anthraknose Leaf miner Pest hole
From this chart the seriousness of the diseases (most to least) can be seen.
Dot Plot
This is diagram where a line is drawn according to a scale that is appropriate for the data set
and the values (in the data set) plotted at their positions on the scale. If the same value occurs
more than once, the multiple values are plotted on top of each other at the same point on the
scale. For small data sets (few values) this plot can provide useful information regarding data
patterns.
Example
Imagine that a medium-sized retailer, thinking of expanding into a new region identifies a
business that it considers as being ready for takeover. It finds the following annual profit
figures (in tens of thousands of pounds) for the target retailer's last ten years trading:
9977765433
To draw a dot plot, we can begin by drawing a horizontal line across the page to represent the
range of values of all the numbers (scale). Then we can mark an 'x' above the appropriate
value along the line as follows:
Pie Chart
14
A Pie chart is a diagram that shows the subdivision of some entity/total into subgroups. The
diagram is in the form of a circle which is divided into slices with each slice having an area
according to the proportion that it makes up of the total.
Example
The pie chart below shows the ingredients used to make a sausage and mushroom pizza.
The degrees needed for each slice is found by calculating the appropriate percentage of 360
e.g., for sausage the degrees are 0.125*360 = 45, for cheese 0.25*360 =90 etc. The complete
calculations are shown in the table below.
Stem-and-leaf plot
Examples
To construct a stem-and-leaf plot, the values must first be sorted in ascending order. Here is
the sorted set of data values that will used in the example:
44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106
Next, it must be determined what the stems will represent and what the leaves will represent.
Typically, the leaf contains the last digit of the number, and the stem contains the other
digits. In the case of very large or very small numbers, the data values may be rounded to a
15
particular place value (such as the hundredths place) that will be used for the leaves. The
remaining digits to the left of the rounded place value are used as the stems.
In this example, the leaf represents the ones place and the stem the rest of the number (tens
place or higher).
The stem-and-leaf plot is drawn with two columns separated by a vertical line. The stems are
listed to the left and the leaves to the right of the vertical line. It is important that each stem is
listed only once and that no numbers are skipped, even if it means that some stems have no
leaves. The leaves are listed in increasing order in a row to the right of each stem.
4|4679
5 |
6 |34688
7 |2256
8 |148
9 |
10 | 6
key: 5|4=54
leaf unit: 1.0
stem unit: 10.0
A stem-and-leaf plot enables the researcher to see patterns of clustered values e.g., 12 of the
17 values are greater or equal to 63 and less or equal to 88.
As an example, suppose the fat contents (in grams) for eating English breakfasts and cold
meat sandwiches are to be compared. The fat contents are shown below.
Sandwiches: 6, 7, 12, 13, 17, 18, 20, 21, 21, 24, 26, 28, 30, 34
Breakfasts: 12, 14, 15, 16, 18, 23, 25, 25, 36, 36, 38, 41, 44, 45
Breakfasts Sandwiches
|0| 6 7
2 4 5 6 8 |1| 2 3 7 8
3 5 5 |2| 0 1 1 4 6 8
6 6 8 |3| 0 4
1 4 5 |4|
key: 2|4=24
leaf unit: 1.0
16
Conclusion: The fat content in breakfasts appears to be higher than that in sandwiches.
The symbol sigma ∑ (Capital S in Greek alphabet) is used to denote “the sum of” values.
Suppose the symbol x is used to denote some variable of interest in a study. To distinguish
between values of this variable, subscripts are used.
If it is understood that the range of subscript indices over which the summation is taken
involves all the x values, the summation can be written as just
x1 + x2 + . . . + xn = ∑x.
Example 1: Suppose x1=70, x2=74, x3=66, x4=68, x5=71
5
∑ xi
Then i=1 = x1+x2+ . . . + x5 = 70+74+66+68+71 = 349.
n
∑ x 2i 2 2 2
∑ x 2 for short.
i=1 = x 1 + x 2 +⋯+ x n or
5
∑ x 2i
Example 2: For the data set in example 1, i=1 = 702+742+662+682+712 = 24397.
n n 5
∑ x 2i ∑ xi ) 2
∑ x 2i 349
2
Note that i=1 ¿ ( i=1 e.g., for the abovementioned data i=1 = 24397 ¿ = 121801.
The summation notation can also be used to write the sum of products of corresponding
values for 2 different sets of values.
17
n
∑ xi yi x 1 y 1 +x 2 y 2 +⋯+x n y n
i=1 =
i 1 2 3 4 5 6
xi 11 13 7 1 10 8
2
yi 8 5 7 6 9 11
x i y i 88 65 49 7 90 88
2
6
∑ xi yi
For this data i=1 = 11*8+13*5+7*7+12*6+10*9+8*11 = 88+65+49+72+90+88 = 452.
n n n 6
∑ x i y i ∑ x i ) ∑ yi ∑xi
Note that i=1 ¿ ( i=1 ( i=1 ) e.g., for the abovementioned data i=1 = 61 and
6 6 6 6
∑ yi ∑ x i ∑ yi ∑ xi yi
i=1 = 46. ( i=1 ) ( i=1 ) = 2806 ¿ i=1 .
Frequency distribution
A frequency distribution is a table in which data are grouped into classes and the number of
values (frequencies) which fall in each class recorded.
The main purpose of constructing a frequency distribution is to get an insight into the
distribution pattern of the frequencies over the classes. Hence, the name frequency
distribution is used to refer to this pattern.
Examples
1 ||||| || 7
2 ||||| | | | | | 10
3 ||||| ||| 8
4 ||||| | 6
5 |||| 4
6 || 2
Total 40
Note: The sum of the frequencies = sample size i.e., ∑ f =n.
Example 2 Consider the following data of low temperatures (in degrees Fahrenheit to the
nearest degree) for 50 days. The highest temperature is 64 and the lowest temperature is 39.
The classes into which the above values can be sorted can be found by following the steps
shown below.
1 Find the maximum (=64) and minimum (=39) values and calculate the
2 Decide on the number of classes. Use Sturges’ rule which states that
k = 7.
3 Calculate the class width such that no of classes*class width > range i.e.
4 Find the lower value that defines the first class. This is usually a value just below the
minimum value in the data set. Since the minimum value for this data set is 39, the lowest
class can have a minimum value one below this i.e., 38.
5 Find the lower values that define each of the classes that follow by successively adding the
class width to the lower value of class.
The frequency distribution below shows the data values sorted into the classes
The table below shows the classes, their frequencies, relative frequencies,and cumulative
frequencies for the temperatures data set.
class class
limits boundaries f relative frequency cumulative frequency
38-41 37.5-41.5 4 0.08 4
42-45 41.5-45.5 10 0.2 14
46-49 45.5-49.5 8 0.16 22
50-53 49.5-53.5 15 0.3 37
54-57 53.5-57.5 9 0.18 46
58-61 57.5-61.5 3 0.06 49
62-65 61.5-65.5 1 0.02 50
Total 50
Class limits – The values that define the classes of the frequency distribution in terms of the
rounded values in the data set.
lower class limit – minimum rounded value that defines a class of the frequency distribution.
upper class limit – maximum rounded value that defines a class of the frequency
distribution.
class boundaries – The values that define the classes of the frequency distribution in terms
of the actual values in the data set.
lower class boundary – minimum actual value that defines a class of the frequency
distribution.
upper class boundary – maximum actual value that defines a class of the frequency
distribution.
20
The first class in the above table has lower class limit of 38 and lower-class boundary of 37.5
(since this is smallest actual value that can be rounded up to 38).
The first class in the above table has upper class limit of 41 and upper-class boundary of 41.5
(since this is largest actual value that can be rounded down to 41).
The lower and upper-class boundaries of a particular class can be calculated by using the
following formulae.
For the second class (i=2) in the above frequency distribution, class (i-1) is the first class
(since i-1=1). Hence
45+ 46
=45. 5 .
= 2
1 Each class is defined by a single value and not by a range of values like in example 2.
Therefore, upper class limit (boundary) = lower class limit (boundary).
2 The values are accurately recorded (no rounding). Therefore, the limits and boundaries are
identical values.
In general, for accurately recorded (not rounded) data, the lower-class limit = lower class
boundary and upper-class limit = upper-class boundary.
Example 3
10.3116
7.21741 7.8989 6.85461 7 8.48253 5.17069
5.09063 8.16412 5.67094 7.7394 7.87423 5.41634
10.1443 10.3110
9.37265 6 7.15675 7 8.86571 10.1734
5.99276 6.5738 7.06965 8.82439 7.47467 9.50018
10.9559
4.90014 5.50273 8.12516 5.51933 7.43641 9
10.1889
5.87188 9.36936 9.83773 3 5.12028 9.60018
10.7834
8.56534 9.27719 8.37107 7.03318 4 9.08941
6.85749 7.7887 9.68159 6.75009 8.0521 8.19638
10.1731 11.3138
2 7.51527 3 8.5765 7.48021 8.39881
7.37565 7.28159 8.81773 5.53182 5.98515 7.71778
classes f
4.5-5.5 5
5.5-6.5 7
6.5-7.5 13
7.5-8.5 13
8.5-9.5 9
9.5-10.5 10
10.5-11.5 3
Total 60
For this distribution the lower (upper) class limit = lower (upper) class boundary for each of
the classes.
A value that falls on the boundary of 2 classes is allocated to the higher of the two classes
e.g., 5.50000 is allocated to the class 5.5-6.5 (not 4.5 to 5.5).
Class midpoints
Examples
1 For the frequency distribution in example 2, the class midpoints are given below.
class midpoint
class limits boundaries s
38-41 37.5-41.5 39.5
42-45 41.5-45.5 43.5
46-49 45.5-49.5 47.5
50-53 49.5-53.5 51.5
22
2 For the frequency distribution in example 3 the class midpoints are given below.
classes midpoints
4.5-5.5 5
5.5-6.5 6
6.5-7.5 7
7.5-8.5 8
8.5-9.5 9
9.5-10.5 10
10.5-11.5 11
Cumulative frequencies
The “less than” cumulative frequency of a class is the number of values in the sample that are
less than or equal to the upper-class boundary of the class.
Examples
2 For the frequency distribution in example 3 the “less than” cumulative frequencies are
calculated as shown below.
f
Relative frequency = frequency/sample size i.e., Rf = n .
Examples
2 For the frequency distribution in example 3 the relative frequencies are calculated as
shown below.
23
relative
classes f frequency
4.5-5.5 5 0.083
5.5-6.5 7 0.117
6.5-7.5 13 0.217
7.5-8.5 13 0.217
8.5-9.5 9 0.15
9.5-10.5 10 0.167
10.5-11.5 3 0.05
Total 60 1
Why it is necessary to distinguish between the definition of classes for accurately recorded
and rounded data? The following example explains why.
Example
Consider the data set (variable temperature) used in example 2. The classes (defined in terms
of boundaries) suggested for grouping the values into are 38-41, 42-45, . . . , 62-65
(grouping A). Suppose the classes are defined as 38-42, 42-46, . . . , 62-66 (grouping B).
The first definition of classes allows for rounded values, but the second one does not allow
for it. Consider the following.
actual rounded value value should be in class value is put class value is put
value (nearest class according to in with class in with class
integer) grouping B grouping B grouping A
41.5 42 38-42 42-46 42-45
45.5 46 42-46 46-50 46-49
61.5 62 58-62 62-66 62-65
With grouping B all the above values are put in incorrect classes, while with grouping A they
are put in their correct classes.
Histogram
Example
16
14
12
10
frequency
8
6
4
2
0
37.5-41.5 41.5-45.5 45.5-49.5 49.5-53.5 53.5-57.5 57.5-61.5 61.5-65.5
temperature
Frequency polygon
This is also a graphical representation of a frequency distribution. For each class the class
midpoint is plotted against the frequency and the plotted points joined by means of straight
lines.
midpoin
t 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0
16
14
12
10
frequency
8
6
4
2
0
0 10 20 30 40 50 60 70 80
midpoint
Note: The two plotted values at the lower and upper ends were added to anchor the graph to
the horizontal axis. The lower end value is a plot of 0 versus the midpoint of the class below
25
the first (lowest) class (35.5). This midpoint is obtained by subtracting the class width (4)
from the midpoint of the lowest class (39.5). The upper end value is a plot of 0 versus the
midpoint of the class above the last class (67.5). This midpoint is obtained by adding the class
width (4) to the midpoint of the last (highest) class (63.5).
The histogram and frequency polygon are equivalent graphical representations of the pattern
of the frequencies shown in the frequency distribution. It can be shown that the areas under
the histogram and frequency polygon are the same. The total area under the histogram
(frequency polygon) represents the total number of observations in the data set (n).
The ratio [area under the histogram (frequency polygon) between 2 values]/ sample size
is an estimate of the probability (chance) that a value drawn at random from the data set will
lie between these two values.
Examples
1 For the frequency distribution in example 2 the estimated chance that a randomly drawn
8+15+9
=0. 64 .
value will be between 45.5 and 57.5 is 50
2 For the frequency distribution in example 3 the estimated chance that a randomly drawn
13+9+ 10+3
=0. 583 .
value will be greater than 7.5 is 60
This is the graph of the “less than” cumulative frequencies versus the upper-class boundaries.
Example
class boundary 37.5 41.5 45.5 49.5 53.5 57.5 61.5 65.5
cumulative
frequency 0 4 14 22 37 46 49 50
26
cumulative frequency
60
50
40
Cum. frequency
30
20
10
0
0 10 20 30 40 50 60 70
class boundary
Note: The plotted value at the lower end was added to anchor the graph to the horizontal axis.
The lower end value is a plot of 0 versus the upper-class boundary of the class below the first
(lowest) class (37.5). This upper-class boundary is obtained by subtracting the class width (4)
from the upper class boundary of the lowest class (41.5).
A percentage “less than” ogive can be plotted by just changing the vertical scale. In this
example the frequencies add up to 50. These frequencies can be converted to percentages, by
multiplying each frequency by 2. To draw the percentage ogive, each cumulative frequency
in the above table will have to be multiplied by 2. The resulting graph is shown below.
Values that have a given percentage of the observations in the data set less than it can be read
off from the ogive.
120
100
% cumulative freq
80
60
40
20
0
0 10 20 30 40 50 60 70
boundaries
27
The main purpose of drawing a histogram is to describe the clustering pattern of the values in
the data set. For a large sample size, the histogram (frequency polygon) can be well
approximated by a smooth curve (called a frequency curve) that is fitted to the frequencies.
The following patterns of the shape of the frequency curve appear regularly in data sets.
0.45
0.4
0.35
0.3
frequency
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
x
This shape is for data sets where many values are in the central portion of the scale with
fewer and fewer the values the further away from the center (in both directions). Many data
sets have this shape. The graph has a symmetrical appearance i.e., the two halves on either
side of the zero x-value are identical. Examples are
0.12
0.1
0.08
frequency
0.06
0.04
0.02
0
0 1 2 3 4 5 6
x
This shape occurs when all the values in the data set occur approximately the same number of
times. Examples are
3 Frequencies obtained when tossing an unbiased coin and recording 0 if tails come up and
1 if heads come up.
Bimodal shape
29
60
50
40
frequency
30
20
10
0
0 20 40 60 80 100 120
Body length (m m )
This pattern which shows two distinct peaks (hence the name bimodal data) appearing when
there are two subgroups with different sets of values in the same data set.
Examples
1 Measuring the body lengths of ants when there are adults and juveniles together in the
same data set. The two peaks in the curve reflect the fact that juvenile ants have shorter body
lengths than adult ants.
2 Heights of a population of males and females. Since the females are shorter than the
males, the frequency curve will have two peaks. One peak will be located where the most
female heights are concentrated and one where the most male heights are concentrated.
1.2
0.8
frequency
0.6
0.4
0.2
0
0 2 4 6 8 10 12 14
x
This shape shows a high clustering of values at the lower end of the scale and less and less
clustering further away from the lower end towards the upper end.
Example
The time it takes to serve a customer at a supermarket. For most customers the service time is
quite short. The longer the service time, the less the number of customers.
0.3
0.25
0.2
frequency
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16
-0.05
x
This shape shows a high clustering of values at the upper end of the scale and less and less
clustering further away from the upper end towards the lower end.
Example
31
Marks in a test where most students did well, but a few performed poorly.
A measure of central tendency is a value that shows the location on the scale where a data
set is centrally located (most values are clustered around it).
In the calculations a distinction will be made between methods used when the data are in raw
form (values as collected) or grouped form (form of a frequency distribution).
For each of the measures discussed in sections 2.4 and 2.5 the formulas used will be based on
samples selected from the corresponding populations. The measures (statistics) are estimates
of the corresponding population parameters.
The mean (or average) of a set of data values is the sum of all of the data values in the set
divided by the n the number of data values. That is
1
n∑ .
x̄= x
mean =
x̄ is pronounced “x bar”.
Example
The marks of seven students in a mathematics test with a maximum possible mark of 20 are
given below:
15 13 18 16 14 17 12:
x̄=
∑x
mean = n
Median
The median is the value in the data set which is such that half of the values in the data set are
less than or equal to it and half are greater than or equal to it.
32
For an odd number of values in the data set, the median is the middle value of the data set
when it has been arranged in ascending order. That is, from the smallest value to the largest
value.
If the number of values in the data set is even, then the median is the average of the two
middle values.
Examples
1 The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:
47 35 37 32 38 39 36 34 35
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
2 Consider the above data set with the first value (47) omitted.
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39
In this case the number of values n = 8 which is an even number. The two middle values in
n 8 n
= =4 + 1=5
the data set are in positions 2 2 and 2 i.e., the values 35 and 36.
35+36
=35 . 5.
median = 2
Mode
The mode of a set of data values is the value(s) that occurs most often.
Example
48 44 48 45 42 49 48
Note
1 It is possible for a set of data values to have more than one mode.
2 If there are two data values that occur most frequently, we say that the set of data values is
bimodal e.g., the data set 2, 2, 4, 5, 5, 6 has two modes (2 and 5).
3 If no value in the data set occurs more than once, it has no mode e.g. the data set 4, 5, 7,
9 has no mode.
4 For continuos data, it is possible to have a large data set where no value occurs more than
once. For such data, see remark 3 below.
1 The mean is used as a measure of central tendency for symmetrical, bell-shaped data that
do not have extreme values (extreme values are called outliers). Outliers are unusually small
or large numbers.
2 The median may be more useful than the mean when there are extreme values in the data
set as it is not affected by the extreme values.
3 The mode is useful when the most common item, characteristic or value of a data set is
required. In such case a smooth function is fitted to the histogram of the data and the mode
defined as the value on the horizontal axis corresponding to the maximum point of the fitted
curve.
Examples
1 The amounts (thousands) for which each of 7 properties were sold are shown below.
For this data set mean = x̄ = 772.86. This value of the mean is not a central value for the data
set (it is greater than all the values except the largest one). The reason for this is that the last
value (2350) has a considerable influence on the value of the mean.
The median = 555 is a value thatis more centrally located than the mean. Unlike the mean,
the median is not influenced by large values in the data set.
2 For qualitative (non-numerical) data only the mode can be calculated e.g., suppose 10
different rate payers are asked whether they think the percentage increase in rates is
reasonable. They can either agree (A), disagree (D) or be neutral (N) on the issue. Their
responses are shown below.
A, A, D, N, D, A, D, D, N, N.
For this data set the modal response is D (since D occurs more times than the other
responses). It is not possible to calculate a median or a mean for this data set.
When calculating the mean for raw data, it is usually assumed that all the values in the data
set are equally important. If the values are not all considered equally important, the weighted
mean ( x̄ w ) is calculated according to the formula below.
r
∑ x i wi
x̄ w = i=1r
∑ wi
i=1 .
In the formula x1, x2, . . ., xr are the values and w1, w2, . . . ,wr their respective weights.
Example
The final mark (percentage) in a certain course is based on an assignment mark (which counts
for 10% of the final mark), a test mark (which counts for 30% of the final mark) and an exam
mark (which counts for 60% of the final mark). Calculate the final mark of a student who gets
a 65% assignment mark, a 70% test mark, and a 55% exam mark.
The above formula is applied with x1= 65, x2= 70 x3= 55, w1= 10, w2= 30 w3= 60.
65∗10+70∗30+55∗60 6050
x̄ w = = =60 . 5 .
10+30+ 60 100
For grouped data the mean is calculated from the formula below.
35
k
∑ x mid ( i ) f i
x̄= i=1
n , where
xmid(i) is the midpoint of the ith class, k the number of classes and n the sample size.
This formula is a special case of the weighted mean formula with wi = fi and
k
∑ wi =n .
i=1
Example
For the frequency distribution of temperatures (example 2 of the frequency distributions), the
mean can be calculated as shown below.
Class
boundaries xmid(i) fi xmid(i) fi
37.5-41.5 39.5 4 158
41.5-45.5 43.5 10 435
45.5-49.5 47.5 8 380
49.5-53.5 51.5 15 772.5
53.5-57.5 55.5 9 499.5
57.5-61.5 59.5 3 178.5
61.5-65.5 63.5 1 63.5
Total 50 2487
2487
=49 . 74 .
mean = 50
Mode
f −f
m 0 =L+ m m−1 c
( f m−f m−1 )+(f m−f m+1 )
Example
Variability refers to the extent to which the values in a data set vary around (differ from) the
associated measure of central tendency.
Example
The performance of 2 different stocks is monitored over a period of 8 days. Their values are
shown in the table below.
day 1 2 3 4 5 6 7 8
A 100 120 110 108 130 106 120 112
B 112 97 88 123 153 84 146 110
The dot plot that follows shows the performance of each stock.
stock
The mean values for the two stocks are the same (=113.25), but they differ in variability
(extent of spread around the mean). Stock B has a far wider spread around the mean than
stock A.
Range
For the stocks data sets the range = 130-100 =30 (for stock A data set)
The larger (wider) spread in stock B values is reflected in the larger range (more than twice
that of stock A).
A measure of deviation is based on the differences between the values in the data set and the
n n
mean ( x ). Since ∑ (x i ¿−x)=∑ x i−n x =0 ¿ for any data set, a measure of deviation is based
i=1 i=1
on the sum of the squares of these differences.
Note: 1 Division in the above formula is by n−1 and not by n . The reason why this is done
is that the variance estimator with division by n−1 has some superior properties for small
values of n (to be discussed in a second year module).
2 For large values of n , the answers when dividing by n and n−1 differ very little. For such
values of n the variance is calculed by dividing by n .
Unless told to do so, the variance will be estimated by using the formula where division is by
n−1 .
The variance is expressed in the data units squared. The standard deviation = S = √ S ,
2
which is the positive square root of the variance, is expressed in the same units as the data.
Example
38
x = score
A x2
100 10000
120 14400
110 12100
108 11664
130 16900
106 11236
120 14400
112 12544
sum 906 103244
2
906
103244−
variance = S2 = 8 .
91.357
7
For stock B the standard deviation is 25.385 (check this using STATMODE).
Interpretation: The stock A values differ (on average) from the mean by 9.558, while stock
B values differ (on average) from the mean by almost 3 times this amount.
For grouped data, the raw data formulae for the variance and standard deviation can be
slightly modified.
k k k
∑ ( x mid( i )− x̄ )2 f ∑ x 2mid( i ) f i −( ∑ x mid (i ) f i )2 / n
i=1 i=1 i =1
Example
For the frequency distribution of temperatures (example 2 of the frequency distributions), the
variance and standard deviation can be calculated as shown below.
Class
boundaries xmid(i) fi xmid(i) fi x 2mid ( i ) f i
37.5-41.5 39.5 4 158 6241
41.5-45.5 43.5 10 435 18922.5
45.5-49.5 47.5 8 380 18050
51.5 39783.7
49.5-53.5 15 772.5 5
39
55.5 27722.2
53.5-57.5 9 499.5 5
57.5-61.5 59.5 3 178.5 10620.7
5
61.5-65.5 63.5 1 63.5 4032.25
Total 50 2487 125372.
5
A measure of variation can also be based on absolute differences between the values and the
mean i.e., ¿ x i−x∨¿.
n
1
MD = ∑ ¿ x i−x∨¿¿ .
n i=1
Example
For the stock A data, the mean deviation can be calculated as shown below.
x |x-113.25|
100 13.25
120 6.75
110 3.25
108 5.25
130 16.75
106 7.25
120 6.75
112 1.25
Total MD = 60.5
mean 113.25
The standard deviations of 2 data sets that are expressed in different units cannot be
compared directly. Such a comparison can be done by calculating the
S∗100
.
coefficient of variation = CV = x̄
Example:
40
For the expenditure data (see example 3 of the frequency distributions) x̄= 7.93333 and
S = 1.65567.
Since the two standard deviations that were calculated above are in different units, they
cannot be compared directly.
5. 836∗100
=11. 733 %
For the temperature data CV = 49 . 74 .
1. 65567∗100
=20 . 87 %.
For the expenditure data CV = 7 . 9333
The coefficient of variation calculations shows that in relative terms the variability for
expenditure data set is greater than that of the temperature data set.
1
1−
Chebychev’s theorem states that for any data set a proportion of at least d 2 of the values
lie within d standard deviations of the mean.
Examples
1
1− =0. 75 .
1 Proportion of values that lie within 2 standard deviations of the mean ≥ 22
1
1− =0. 889 .
2 Proportion of values that lie within 3 standard deviations of the mean ≥ 32
3 A coffee maker is regulated so that it takes an average of 5.8 min to brew a cup of coffee
with a standard deviation of 0.6 min. What proportion of the time will it take
(b) less than or equal to 4.8 minutes or greater than or equal to 6.8 minutes
Solution
4.8 – 5.8 = -1
6.8 – 5.8 = 1.
41
1
=
d = 0 .6 1.667 standard deviations.
1
2
=
(a) proportion of time between 4.8 and 6.8 minutes ≥ 1- 1. 667 1 - 0.36 = 0.64.
(b) From the answer to (a) and following from the fact that
proportion (between 4.8 and 6.8 minutes) + proportion(<= 4.8 minutes or >= 6.8 minutes) =1,
If it is known that the data set of interest has a bell-shaped clustering pattern of the values,
results that are better than that of Chebychev’s theorem can be obtained. For data with such a
shape
(i) Approximately 68% of data values are within 1 standard deviation of the mean.
(ii) Approximately 95% of data values are within 2 standard deviations of the mean.
(iii) Approximately 99.7% of data values are within 3 standard deviations of the
mean.
Example: Men’s Heights have a bell-shaped distribution with a mean of 69.2 inches and a
standard deviation of 2.9 inches.
Approximately 68% of data values are within 69.2 ± 2.9 = (66.3, 72.1).
Approximately 95% of data values are within 69.2 ± 5.8 = (63.4, 75).
Approximately 99.7% of data values are within 69.2 ± 8.7 = (60.5, 77.9).
2.9.1 Definitions
The ith percentile , Pi , is the value that has i% of the values in a data set less or equal to it
(0< i ≤.100).
Examples
4 The 9 deciles D1, D2, . . . , D9 are the values that have 10%, 20%, . . . , 90% respectively
of the values in the data set less or equal to them.
42
The pthquantile (not to be confused with a quartile) denoted by Qt p is the point that divides
a data set into two groups with 100 p % of the data values below it and 100(1− p)% above it.
Examples
Qt 0.5 = median, Qt 0.25 = first quartile, Qt 0.60 = 6 th decile, Qt 0.85= 85th percentile.
There are many methods (15 according to the article to be found at the webite address below)
that can be used to calculate the first and third quartiles ( Q1 and Q3).
http://jse.amstat.org/v14n3/langford.html
For raw data the calculations of the first and third quartiles are based on the same principles
as that of the median.
Steps to be followed in calculating the first and third quartiles for raw data.
3 Divide the data set into 2 portions of equal numbers of values – set 1 consists of those
values less or equal to the median and set 2 consists of those values greater or equal to the
median. When the data set has an odd number of values, the median is excluded from the
division of the data set into 2 portions.
4 The first quartile (Q1) is the median of set 1 and the third quartile (Q3) is the median of
set 2.
Examples
The distance from home to work (kilometers) of 11 employees at a certain company are
shown below. Calculate Q1 and Q3.
Example 1 Ordered data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49
2 Median = 40. After this step the median is deleted from the data set.
Set 2 – 5 values greater than the median i.e., 41, 42, 43, 47, 49.
43
Example 2 Suppose the data set consists of the above values and 56 (12 values).
1 Ordered Data Set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56
40+ 41
=40 . 5.
2 median = 2 Unlike what was done in example 1, no values are deleted from
the data set.
3 Set 1 – 6 values less or equal than median i.e., 6, 7, 15, 36, 39, 40
Set 2 – 6 values greater or equal than the median i.e., 41, 42, 43, 47, 49, 56.
15+36 43+ 47
=25 . 5 =45.
4 Q1 = median of set 1 = 2 , Q3 = median of set 2 = 2
Note: Another approach when n is odd is to include the median in both halves when
calculating the two quartiles. Using this approach Q1 will be the median of 6, 7, 15, 36, 39, 40
15+36
which is =25.5 and Q3 will be the median of 40, 41, 42, 43, 47, 49 which is
2
1
(42+43)=42.5. This gives slightly different answers to the approach where the median is
2
excluded. This approach for calculating the quartiles is also known as the Tukey hinges
method.
Q3 −Q 1
The quartile deviation = Q = 2 can also be used as a measure of variability.
For the data set in example 1, quartile deviation = Q = (43 – 15)/2 = 14.
The quartile deviation value shows the extent to which the values in the data set deviate from
the median. For a skew data set (heavy clustering at lower or upper end of the scale) the
quartile deviation is a more appropriate measure of variability than the standard deviation
(which is more suitable as a measure of variability for symmetric data sets).
A formula for calculating the ith percentile Pi for grouped data is shown below.
c (n∗i/100−F less )
Pi = Li + fi , i = 1, 2, . . . , 100.
n – sample size
c – class width.
Examples
class
boundaries f cumulative frequency
37.5-41.5 4 4
41.5-45.5 10 14
45.5-49.5 8 22
49.5-53.5 15 37
53.5-57.5 9 46
57.5-61.5 3 49
61.5-65.5 1 50
Total 50
1 Median.
i∗n 50∗50
= =25 .
Step 1: Calculate position of median = 100 100
Step 2: Median class (class that contains 25th observation) is the class 49.5-53.5.
(25−22)∗4
=
Median = 49.5 +15 50.3.
First quartile
i∗n 25∗50
= =12 .5 .
Step 1: Calculate position of first quartile = 100 100
45
Step 2: First quartile class (class that contains 12.5th observation) is the class 41.5-45.5.
Third quartile
i∗n 75∗50
= =37 . 5 .
Step 1: Calculate position of third quartile = 100 100
Step 2: Third quartile class (class that contains 37.5th observation) is the class 53.5-57.5.
Fourth decile
i∗n 40∗50
= =20 .
Step 1: Calculate position of 4th decile = 100 100
Step 2: 4th decile class (class that contains 20th observation) is the class 45.5-49.5.
(20−14 )∗4
=
D4 = 45.5 + 8 48.5.
65th Percentile
i∗n 65∗50
= =32 .5 .
Step 1: Calculate position of 65 percentile = 100 100
th
46
Step 2: 65th percentile class (class that contains 32.5th observation) is the class 49.5-53.5.
(32 .5−22)∗4
=
P65 = 49.5 +15 52.3.
Example
The following cumulative frequency graph shows the distribution of marks scored by a class
of 40 students in a test.
A five number summary of a data set is a summary using the minimum, 1st quartile, median,
3rd quartile and maximum as summary measures.
type value(s)
47
Example
92, 104, 93, 98, 112, 145, 88, 90, 104, 119, 101, 95, 154
In the above the 1st and 3rd quartiles were calculated according to the Tukey hinges
definitions.
The difference
2 Q=Q3 −Q1 is known as the interquartile range and Q (above) as the semi
interquartile range.
Box-and-Whisker plot
The “box” portion of this graph has the 1st and 3rd quartiles defining its lower and upper
limits and the median plotted at its position in between these limits.
Step 1
Q** = Q3 + 3Q.
Step 2
Step 3
Example
48
For the IQ data set (see previous example) Q1 = 93, median = 101 and Q3 = 112 define the
“box” portion.
112−93
=9 .5
Q= 2 , Q* = 92.5-3*9.5 = 64.5, Q**=112 + 3*9.5 =140.5.
Since Q** = 140.5 < maximum = 154 and Q** = 140.5 < 2nd largest = 145, there are 2
outliers and upper whisker = 119 (largest value less or equal than 140.5).
In the plot the maximum = 154 and 2nd largest = 145 are shown as separate points (outliers)
above the upper whisker.
Unusually large or small values (such as this maximum) are called outliers.
A Box-and-Whisker plot can also be used to assess the skewness (departure from symmetry)
of a variable. For positively skewed data most of the values are at the lower end of the scale
(mean > median, “box” section of the plot towards the lower end of the scale) and for
negatively skewed data most of the values are at the upper end of the scale (mean < median,
“box” section of the plot towards the upper end of the scale). In the previous example the data
set is positively skew.
When several data sets are to be compared, several Box-and-Whisker plots can be plotted
side-by-side.
Example
The Box-and-Whisker plot shown below enables one to compare delays in departing flights
(in minutes) for certain days in December (16th to the 26th).
49
For all the days the data sets are positively skewed (data sets all have the “box” section closer
to the lower end of the scale with a long upper whisker). This means that there are short
delays in flight departures on all the days. The long upper whiskers that are visible show that
there were some quite late departures on 16, 17, 21, 22, 23, 24 and 25 December.
2.11 Skewness
x1 =0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9,
10
x2 = 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8,
9, 10
x3 = 0, 1, 2, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10,
10,10
Boxplots
x1 x2 x3
Histograms
51
1 The following is the output of computer calculations (using excel) of the various measures
of description for the data set in example 3 in section 2.3.
7.93987
Mean 4
0.21710
Standard Error 6
7.88656
Median 5
Standard 1.68169
Deviation 7
2.82810
Sample Variance 6
Skewness -0.02094
Range 6.41369
Minimum 4.90014
11.3138
Maximum 3
476.392
Sum 4
Count 60
2 The following is a stem-and-leaf plot for the data (rounded to 1 decimal place) in example
3 in section 2.3 obtained from SPSS. The pattern of the plot is bimodal.
1.00 4 . 9
11.00 5 . 01145556899
4.00 6 . 5788
15.00 7 . 001223444577788
12.00 8 . 011133455888
8.00 9 . 02335668
8.00 10 . 11113379
1.00 11 . 3
Chapter 3 – Probability
3.1 Terminology
Probability (Chance)
A probability is the chance that something of interest (called an event) will happen.
A probability is usually expressed as a proportion (range of values from 0 to 1) but can also
be expressed as a percentage (range of values from 0 to 100).
Examples
1
2 The probability of winning the Lotto is (0.00000715%).
13983816
Random experiment
This is an experiment that gives different outcomes when repeated under similar conditions.
3 The outcome that will occur when the experiment is performed depends on chance.
Examples
4 Drawing a card from a deck of cards (possible outcomes: 13 hearts, 13 clubs, 13 spades,
13 diamonds).
Set
Sample space
The sample space is the set of all possible outcomes of a random experiment.
A sample space is usually denoted by the symbol S and the collection of elements contained
in S enclosed in curly brackets { }.
Sample point
Examples
5 Drawing a card from a deck of cards. The elements in the sample space are listed below.
Event
54
An event is a subset of a sample space i.e., a collection of sample points taken from a sample
space.
Impossible event
Certain event
Simple events are events that involve only one outcome at a time.
Examples
1 Let E denote the event “an odd number is obtained when tossing a single die”.
2 Let H denote the event “at least one head appears when tossing two coins”.
2 Let C denote the event “at most one head appears when tossing two coins” and D the
event (“at least one head appears when tossing two coins”).
4 Let B denote the event “obtaining a club and a heart in a single draw from a deck of
cards”. The event B is impossible. The set of outcomes of B is an empty set denoted by
B = { } = φ.
5 Let A denote the event “obtaining a 1, 2, 3, 4, 5 or 6 when tossing a single die”. The event
A is a certain event i.e., one of the outcomes belonging to the set describing the event must
happen. This is denoted by A = S, where S is the sample space.
Venn diagrams
A Venn diagram is a drawing, in which circular areas represent groups of items usually
sharing common properties.
The drawing consists of two or more circles, each representing a specific group or set,
contained within a square that represents the sample space. Venn diagrams are often used as a
visual display when referring to sample spaces, events and operations involving events.
Compound events are events that involve more than one event. Such events can be obtained
by performing various operations involving two or more events.
Some of the operations that can be performed are described in the sections that follow.
Complementary events
The complementary event Ā of an event A is all the outcomes in S that are not in A (purple
part of the diagam below).
Examples
1 Consider the experiment of tossing a single die. S = {1, 2, 3, 4, 5, 6}. The complement of
the event A = “obtaining a 3 or less” = {1, 2, 3} is Ā = “obtaining a 4 or more” = {4, 5, 6}.
2 Consider the experiment of tossing two coins. S = {hh, ht, th, tt}. The complement of the
event H = “at least one head”= {hh, ht, th} is H̄= “no heads” = {tt}.
The union of two events A and B denoted by A ∪B is the set of outcomes that are in A or
in B or in both A and B i.e., the event “either A or B or both A and B occur”. The event
A ∪B can also be interpreted as the event “at least one of A or B occurs”.
56
The intersection of two events A and B denoted by A ∩B is the set of outcomes that are in
both A and B i.e., the event “both A and B occur”.
These definitions involving two events can be extended to ones involving 3 or more events
e.g., for the 3 events A1, A2 and A3 the event
A1 ∪ A2 ∪ A3 is the event “at least one of A , A
1 2
or A3 occurs” and
A1 ∩ A2 ∩ A3 the event “A and A and A occur”.
1 2 3
Examples
Two events A and B are mutually exclusive (disjoint) if they have no elements (outcomes) in
common. This also means that these events cannot occur together.
Examples
1 Let B be the event “drawing a black card from a deck of cards” and
The events B and R have no outcomes in common i.e. , B∩R=φ (empty set). Hence B and
R are mutually exclusive.
2 Let E be the event “an even number with a single throw of a die” and O the event “an odd
number with a single throw of a die”.
E and O have no outcomes in common i.e., E∩O=φ and are therefore mutually exclusive.
If there are n equally likely total numbers of outcomes of which m are favorable to an event
A, then the probability of occurrence of an event A, denoted as P(A), is given by
N (A) m
P(A) = N ( S) = n ,
where N(A) = m is the number of outcomes favourable to the event A and N(S) = n the
number of outcomes in the sample space S i.e., the total number of outcomes.
Examples
2 Two dice are rolled. Find the probability that a sum of 7 will occur.
Solution: The number of sample points in S is 36 (see example 3 under sample space).
The classical definition of probability requires the assumption that all the outcomes in the
sample space are equally likely. If this assumption is not met, this formula cannot be used.
Example
The possible temperatures (degrees Celsius) at a certain location on a particular day are
1
P(temperature=22) = 7 would be incorrect if all the temperature values are not equally likely
e.g., suppose that over the past year these temperatures occurred the following numbers of
times.
Temperature 21 22 23 24 25 26 27 Total
Number of
days 15 16 20 25 19 21 14 130
Estimated 0.11538 0.12307 0.15384 0.19230 0.14615 0.16153 0.10769
Prob. 5 7 6 8 4 8 2
If an experiment is repeated n times and an event A is observed f times, then the estimated
probability of occurrence (empirical probability) of an event A is given by
59
f
P(A) = n .
Note: This formula differs from the classical formula in the sense that the classical formula
uses all the outcomes in the sample space as the total number of outcomes, while the relative
frequency formula uses the number of repetitions (n) of the experiment as the total number of
outcomes. In the classical formula the number of outcomes in the sample space is fixed,
while the number of repetitions of an experiment (n) can vary. It can be shown that the
empirical probability is a good approximation of the true probability when n is sufficiently
large (Law of large numbers).
Examples
1 A bent coin is tossed 1000 times with heads coming up 692 times.
692
=0 .692 .
An estimate of P(heads) is 1000
mark f
less than 30 6
30-39 26
40-49 45
50-59 64
60-69 82
70-79 37
80-89 22
90-99 8
Total 290
From the table (using the empirical formula) the following probabilities can be estimated.
26+ 6
=0. 110.
(a) P(mark less than 40) = 290
22+8
=0 . 103.
(c) P(above 80) = 290
Probabilities involving the occurrence of single events are called marginal probabilities.
Probabilities involving the occurrence of two or more events are called joint probabilities
e.g., P(A¿ B) is the joint probability of both A and B occurring.
Example
60
The preference probabilities according to gender for 2 different brands of a certain product
are summarized in the table below.
P(brand 2) = 0.40.
The gender marginal probabilities are obtained by summing the joint probabilities over the
brands. The brand marginal probabilities are obtained by summing the joint probabilities over
the genders.
The computation of probabilities using the classical definition involves counting the number
of outcomes favourable to the event of interest (say event A) and the total number of possible
outcomes in the sample space. The following formulae can be used to count numbers of
outcomes to be used in the classical definition formula.
Addition formula: If an experiment can be performed in n ways, and another experiment can
be performed in m ways then either of the two experiments can be performed in (m+n) ways.
This rule can be extended to any finite number of experiments. If one experiment can be done
in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then one of the k
experiments can be done in n1 + n2 +. . . + nk ways.
Example: Suppose there are 3 doors in a room, 2 on one side and 1 on other side. A person
wants to go out of the room. Obviously, he/she has 3 options to go out. He/she can go out by
61
This rule can be extended to any finite number of experiments. If one experiment can be done
in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then the k experiments
together can be done in n1* n2 * . . . * nk ways.
Example 1: A basic meal consists of soup, a sandwich and a beverage. If a person having this
meal has 3 choices of soup, 4 choices of sandwiches and a choice of coffee or tea as a
beverage, how many such meals are possible?
Example 2: A PIN to be used at an ATM can be formed by selecting 4 digits from the digits
0, 1, 2, . . . , 9 . How many choices of PIN are there if
Factorial notation
Number of ways = 2 x 1 = 2.
Number of ways = 3 x 2 x 1 = 6.
Note: 1! = 1, 0! = 1.
Examples
no of ways = 7 x 6x 5 x . . . x 2 x 1 = 7! = 5040.
no of ways = 5 x 4 x 3 x 2 x 1 = 5! = 120.
n!
nPr = P(n, r) =
(n−r)! .
A combination is the number of different selections of a group of items where order does
not matter.
n n!
nCr = C(n, r) = (r ) =. ( n−r ) ! r ! .
63
Examples: 1 Four people (A, B, C, D) serve on a board of directors. A chairman and vice-
chairman are to be chosen from these 4 people. In how many ways can this be done?
Chairma Vice-chairman
n
A B
B A
A C
C A
A D
D A
B C
C B
B D
D B
C D
D C
2 Four people (A, B, C, D) serve on a board of directors. Two people are to be chosen from
them as members of a committee that will investigate fraud allegations. In how many ways
can this be done?
Number of ways = 6.
In both these examples a choice of 2 people from 4 people is made. In example 1 the order of
choice of the 2 people matters (since the one person chosen is chairman and the other one
vice-chairman). In example 2 the order does not matter. The only interest is in who serves on
the committee.
Application of formulae.
4!
=12.
Number of ways = P(4, 2) = (4−2 )!
In question 2 the combinations formula applies with n = 4, r =2.
4!
=6.
Number of ways = C(4, 2) = 2!(4−2)!
64
3 Find the number of ways to take 4 people and place them in groups of 3 at a time where
order does not matter.
Solution:
Since order does not matter, use the combination formula.
4! 24
= =4
C(4,3) = 3!(4−3 )! 6 .
4 Find the number of way to arrange 6 items in groups of 4 at a time where order matters.
Solution:
6! 720
P(6,4) = = =360 .
( 6−4 ) ! 2
There are 360 ways to arrange 6 items taken 4 at a time when order matters.
5 Find the number of ways to take 20 objects and arrange them in groups of 5 at a time
where order does not matter.
Solution:
20 ! 20 ×19 ×18 ×17 × 16
C(20,5) = = =15504 .
5! (20−5 ) ! 1.2.3 .4 .5
There are 15 504 ways to arrange 20 objects taken 5 at a time when order does not matter.
6 Determine the total number of five-card hands that can be drawn from a deck of 52 cards.
Solution:
When a hand of cards is dealt, the order of the cards does not matter. Thus, the combinations
formula is used.
There are 52 cards in a deck, and we want to know in how many ways we can draw them in
groups of five at a time when order does not matter. Using the combination formula gives
Solution
9 In how many ways can the 6 winning numbers in a Lotto draw be selected?
Therefore, there are 24 different ways in which to deal the desired hand.
11 How many different 5-card hands include 4 of a kind and one other card?
Solution:
We have 13 different ways to choose 4 of a kind: 2's, 3's, 4's, … Queens, Kings and Aces.
Once a set of 4 of a kind has been removed from the deck, 48 cards are left.
Remember OR means add.
The possible situations that will satisfy the above requirement are:
4 Aces and one other card C(4,4)*C(48,1) = 48.
or 4 Kings and one other card C(4,4)*C(48,1) = 48.
or 4 Queens and one other card C(4,4)*C(48,1) = 48.
.
.
.
or 4 twos and one other card C(4,4)*C(48,1) = 48.
Total of 48*13 = 624 ways.
12 A local delivery company has three packages to deliver to three different houses.
If the packages are delivered at random to the three houses, in how many ways can at least
one house to get the wrong package?
Solution
Given that the first package was delivered, the second package can be delivered to any of 2
houses.
Given that the first two packages were delivered, the third package can be delivered to only
one house.
There is only one way in which all 3 packages can be delivered to the correct house.
The event “at least one house to gets the wrong package” is the complement of the event
“all 3 packages are delivered to the correct house” (why?).
The number of ways at least one house gets the wrong package is therefore
6 - 1 = 5.
Complementary events
P( Ā ) = 1 – P(A).
= P(A) + P(B) −P( A∩B ) for events that are not mutually exclusive.
Proof: P( A∪B )=P( Ā∩B)+P( A∩B)+P( A∩B̄ )
=P ( A )−P ( A∩B)+P( B )−P( A∩B )+P( A∩B )
=P ( A )+P( B)−P ( A∩B )
These formulae can be extended to probabilities involving more than two events e.g.,
for 3 events A, B and C defined on some sample space
This formula can easily be verified with the aid of the Venn diagram shown below.
From the above diagram the following sets can be written down.
The result for P( A∪B∪C ) can also be theoretically proved by applying the result for
P( A∪B ) (for non mutually exclusive events) more than once.
De Morgan’s Laws
____
1 P( Ā∩B̄ )=P( A∪B)
_____
2 P( Ā∪B̄ )=P( A∩B)
These formulae can be verified from the Venn diagram shown below.
The above formulae can be extended to probabilities involving more than two events.
Examples
1 There are two telephone lines A and B. Line A is engaged 50% of the time and line B is
engaged 60% of the time. Both lines are engaged 30% of the time. Calculate the probability
that
Solution
Let E1 denote the event “line A is engaged” and E2 the event “line B is engaged”.
(a) P(at least one of the lines are engaged) = P(E1¿ E2) = P(E1) + P(E2) – P(E1¿ E2)
(b) P(none of the lines are engaged.) = 1 – P(at least one of the lines are engaged) = 1-0.8
= 0.2.
(d) The event “line A is engaged, but line B is not engaged” can be written in symbols as
P(E1 ¿ Ē2 ) = P(E1) – P(E1 ¿ E2) = 0.5-0.3 = 0.2. (Using the total probability
formula)
69
(e) P(only one line is engaged) = P(line A is engaged, but line B is not engaged) +
P(line B is engaged, but line A is not engaged)
= P( E 1 ∩ Ē 2 )+ P( Ē 1∩E 2 )
P( Ē1 ∩E2 ) = P(E2) - P(E1 ¿ E2) = 0.6-0.3 = 0.3. (Using the total probability
formula)
2 Let O be the event that a certain lecturer will be in his/her office on a particular afternoon
and L the event that he/she will be at a lecture. Suppose P(O) = 0.48 and P(L) = 0.27.
Solution
(a) Ō is the event that the lecturer will not be in his/her office on a particular afternoon.
Ō∩ L̄ is the event that the lecturer will not be in his/her office and that the lecturer will not
be at a lecture i.e., that the lecturer will be neither in his/her office nor at a lecture.
_____
(b) P(Ō∩ L̄ ) = P(O¿ L) = 1 – P(O¿ L) (Using De Morgan’s law and the complementary
probability formula)
3 A batch of 20 computers contain 3 that are faulty. Four (4) computers are selected at
random without replacement from this batch. Calculate the probability that
Solution
There are C(20,4) = 4845 [why not P(20,4) ?] ways of selecting the 4 computers from the
batch of 20. Since random selection is used, all 4845 selections are equally likely. Let A
denote the event “all 4 the computers selected are not faulty” and B the event “at least 2 of
the computers selected are faulty”
70
N ( A ) C (17 , 4 ) 2380
= = =0 . 4912.
(a) P(A) = N ( S ) C (20 , 4 ) 4845
N (B ) N (2faulty )+N (3 faulty ) C (17 ,2 )∗C (3, 2)+C (17 ,1 )∗C (3, 3)
= =
(b) P(B) = N (S ) 4845 4845
136∗3+17∗1 425
= =0 .0877 .
= 4845 4845
The conditional probability of an event A occurring given that another event B has occurred
is given by
P ( A∩B )
P(A | B) = P (B ) , where P(B) > 0.
P ( A∩B )
Also, P(B|A) = P( A ) , where P(A) > 0.
Examples
Example 1
Five hundred (500) TV viewers consisting of 300 males and 200 females were asked whether
they were satisfied with the news coverage on a certain TV channel. Their replies are
summarized in the table below.
180
P(satisfied | male) = 300 = 0.6.
90
P(satisfied | female) = 200 = 0.45.
71
120 180
=1−
P(not satisfied | male) = 300 300 = 0.4.
110 90
=1−
P(not satisfied | female) = 200 200 = 0.55.
270
=0 .54
P(satisfied) = 500 and P(not satisfied) = 1 – 0.54 = 0.46.
Note
2 The probability of a person being satisfied depends on the gender of the person being
interviewed. In this case females are less satisfied than males with the news coverage.
Example 2
At a certain university the probability of passing accounting is 0.68, the probability of passing
statistics 0.65 and the probability of passing both statistics and accounting is 0.57. Calculate
the probability that a student
(c) passes statistics when it is known that he/she did not pass accounting.
Solution
Let A denote the event “a student passes accounting” and B the event “a student passes
statistics”. Then Ā is the event “a student did not pass accounting”, A ∩B the event “a
student passes both statistics and accounting” and Ā∩B the event “a student passes
statistics, but not accounting”.
P ( A∩B ) 0 .57
=0. 838
(a) P(B|A) = P( A ) = 0 .68 .
P ( A∩B ) 0 .57
=0. 877 .
(b) P(A|B) = P (B ) = 0 .65
72
Examples
1 A box has 12 bulbs, 3 of which are defective. If two bulbs are selected at random without
replacement, then what is the probability that both are defective?
Solution
Let d1 denote the event “the first bulb is defective” and d2 the event “the second bulb is
defective”.
T
3 2
Then P(d1) = 12 and P(d2|d1) = 11 . Using the above-mentioned multiplication formula,
3 2
=0 .045 .
P(d2 ¿ d1) = P(d1) P(d2|d1) = 12 11
2 Two cards are drawn at random from from a deck of playing cards. What is the probability
that both these cards are aces?
Solution
Since there are 4 aces in a deck of 52 cards, the probability of drawing one ace is 4/52.
Having removed one ace and not replacing it reduces the probabilities of drawing another ace
on the second draw. The 51 cards remaining contain 3 aces and therefore the probability of
drawing an ace on the second draw is 3/51. We can multiply these probabilities and
determine the probability of drawing two aces.
The multiplication rule can be extended to involve more than 2 events e.g., for 3 events A1,
A2 and A3 defined on the same sample space,
73
3 Three cards are drawn at random from from a deck of playing cards. What is the
probability that all 3 these cards are aces?
Solution
Independent events
Two events A and B are said to independent if P(A| B) = P(A) or P(B|A) = P(B).
This means that the occurrence of B does not affect the probability that A occurs.
Substitution of the above results into the multiplication formula for two probabilities
Example
1 The probability that person A will be alive in 20 years is 0.7 and the probability that
person B will be alive in 20 years is 0.5, while the probability that they will both be alive in
20 years is 0.45. Are the events E1 “A is alive in 20 years” and E2 “B is alive in 20 years”
independent?
Solution
Since P(E1) P(E2) = 0.7 x 0.5 = 0.35 ≠ P(E1¿ E2), the events E1 and E2 are not independent.
Assuming that both coins are unbiased, P(1st coin is heads) = P(2nd coin is heads) = ½ .
Since P(1st coin is heads) x P(2nd coin is heads) = ½ . ½ = ¼ = P(both tosses heads), the
events “heads on the first toss” and “heads on the second toss” are independent.
The multiplication rule for independent events can be extended to involve more than 2
events. In general, if the events A1, A2, . . . , An are independent then
P(
A1 ∩ A2 ∩¿ ¿ . . . ¿ An ) = P(A ) P(A ) . . . P(A ).
1 2 n
Examples
74
1 A coin is tossed and a single 6-sided die is rolled. Find the probability of “heads” and
rolling a 3 with the die.
Since the results of the coin and the die are independent,
11 1
= .
P(heads and 3) = P(heads) P(3) = 2 6 12
2 A school survey found that 9 out of 10 students like pizza. If three students are chosen at
random with replacement, what is the probability that all three students like pizza?
P(student 1 likes pizza) = 9/10 = P(student 2 likes pizza) = P(student 3 likes pizza).
P(student 1 likes pizza and student 2 likes pizza and student 3 likes pizza) =
9 3
) =0 .729
P(student 1 likes pizza) x P(student 2 likes pizza) x P(student 3 likes pizza) = (10 .
3 It is known that 8% of all cars of a certain make that are sold encounter engine
overheating problems within 50 000 kilometers of travel. During the past week 4 such cars
were sold. Suppose that engine overheating problems for the 4 cars are encountered
independently. What is the probability that
(a) all 4 (b) none (c) at least one of these cars sold encounter engine overheating problems
within 50 000 kilometers of travel?
Solution
Let A denote the event “overheating problems within 50 000 kilometers of travel”.
Bayes’ theorem
P ( A∩B )
P(A|B) = P (B ) , values for P(A¿ B) and P(B) are needed.
Suppose that only the values for P(A), P(B|A) and P(B| Ā ) are available.
In this case the probabilities [ P(A¿ B) and P(B)] required for calculating P(A|B) can be
calculated from
and
(Using the total probability formula and the conditional probability multiplication formula)
Substituting these probabilities into the first conditional probability formula gives
P( A )P( B|A )
P(A|B) = P ( A ) P( B|A )+P( Ā )P( B| Ā ) .
This result is known as Bayes’ theorem after the person who proposed the method.
Example
When testing a person for a certain disease, the test can show either a positive result (the
person has the disease) or a negative result (the person does not have the disease).
When a person has the disease, the test shows positive 99% of the time. When the person
does not have the disease, the test shows negative 95% of the time. Suppose it is known that
only 0.1% of the people in the population have the disease.
(a) If a test turns out to be positive, what is the probability that the person has the disease?
(b) If the test turns out to be negative, what is the probability that the person does not have
the disease?
Solution
Let A be the event “the person has the disease”, and B be the event “the test returns a positive
result”.
76
Then Ā is the event “the person does not have the disease”, B|A is the event “the test is
positive given the person has the disease”, B| Ā the event “the test is positive given the
person does not have the disease” and B̄| Ā the event “the test is negative given the person
does not have the disease”.
(a) P(A) = 0.001 (given) , P( Ā ) = 1- P(A) = 0.999, P(B|A) = 0.99 (given), P( B̄| Ā ) =0.95
(given), P(B| Ā ) = 1- P( B̄| Ā ) = 0.05.
P(B) = P( A∩B )+P( Ā∩B ) = P(A) P(B|A) + P( Ā ) P(B| Ā ) = 0.001 x 0.99 + 0.999 x 0.05
= 0.00099 + 0.04995
= 0.05094
conditional
unconditional probabilities probabilities product
0.001 x 0.99 0.00099
0.999 x 0.05 0.04995
sum 0.05094
P ( A∩B ) 0 . 00099
P(A|B) = P (B ) = 0 .05094 = 0.0194.
From the above it follows that a negative result of the test is very reliable (it will be wrong
only 105 times in 10 million cases). On the other hand, the chances that a person will have
the disease when the result of the test shows positive is 194 in 10 000.
Let a to b be the odds in favour of some event A. Suppose P(A) = p. Then P( Ā ) = 1-p. The
odds in favour of A is then defined as
a p
=
b 1− p .
77
a b
From the above it can be shown that p = a+b and 1-p = a+b .
b 1− p
= .
a p
Examples
1 A pair of balanced dice is tossed. What are the odds in favour of the sum of the numbers
showing a 6?
Possible ways of getting a sum of 6 : (1, 5), (2, 4), (3, 3), (4, 2), (5,1).
18 18
= =0 . 486
probability (red number) = 18+19 37 .
3 The table below shows data that were collected from 781 middle aged female patients at a
certain hospital.
no 90 346 436
For smokers the odds in favour of heart problems are 172 to 173 or 1 to 1.0058
For non-smokers the odds in favour of heart problems are 90 to 346 or 1 to 3.8444.
78
From this it follows that smokers are much more at risk for heart problems than non-smokers.
The following is an output of permutation [P(n,r)], combination [C(n,r)] and factorial (n!)
values calculated by using excel.
n r C(n,r) P(n,r)
5 3 10 60
6 4 20 360
7 4 35 840
10 6 120 151200
15 10 455 10897286400
20 12 1140 6.03398E+13
25 13 2300 3.23824E+16
n n!
3 6
4 24
5 120
6 720
7 5040
8 40320
9 362880
2 The following is the computer output of a cross classification of the cards in a deck of
cards according to colour and type of card (number, picture, ace).
colour
black red Total
type number card 18 18 36
picture card 6 6 12
ace 2 2 4
Total 26 26 52
From the above table various probabilities can easily be calculated eg.
36 9 6 3
=
P(number card) = 52 = 13 , P(picture card and red) =52 26 ,
16 4
=
P(not number card) =52 13 .
Examples:
2 X = the sum of the values (x) showing when two dice are rolled.
Discrete Random Variables – Variables that have a finite or countable number of possible
values. These variables usually occur in counting experiments.
Continuous Random Variables – Variables that can take on any value in some interval i.e.,
they can take an infinite number of possible values. These variables usually occur in
experiments where measurements are taken.
Examples:
1 The variables T and X from the above examples are discrete random variables.
2 The variables H and V from the above examples are continuous random variables.
A discrete probability distribution is a list of the possible distinct values of the random
variable together with their corresponding probabilities. The probability of the random
variable X assuming a particular value x is denoted by P(X=x) = P(x). This probability,
which is a function of x, is referred to as the probability mass function.
Examples:
1 As above, let T be the random variable that represents the number of tails obtained when a
coin is flipped three times. Then T has 4 possible values 0, 1, 2, and 3. The outcomes of the
experiment and the values of T are summarized in the next table.
Outcomes T
80
hhh 0
hht, hth, thh 1
tth, tht, htt 2
ttt 3
Assuming that the outcomes are all equally likely, the probability distribution for T is given
in the following table.
t 0 1 2 3 Total
P(t) 1/8 3/8 3/8 1/8 1
2 Let Y denote the number of tosses of a coin until heads appear first. Then
y 1 2 3 . . . Total
P(y) ½ (½)2 (½)3 . . . 1
3 A pair of dice is tossed. Let X denote the sum of the digits. The probability distribution of
X can be found from the following table. The entry in a particular cell is the sum of row and
column values
1st/2nd 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
x 2 3 4 5 6 7 8 9 10 11 12
P(X=x 1/36 2/36 3/36 4/36 5/3 6/36 5/36 4/36 3/36 2/36 1/36
) 6
Note: For any discrete random variable X the range of values that it can assume are such that
∑ P (x )=1
0 ≤ P(x) ≤ 1 and x .
∑ P(r )
F(x) = P(X ≤ x) = r≤x .
Examples
81
1 For the probability mass function in example 1 the cumulative distribution function is
x 0 1 2 3
F(x 1/8 ½ 7/8 1
)
2 For the probability mass function in example 3 the cumulative distribution function is
x 2 3 4 5 6 7 8 9 10 11 12
F(x) 1/3 3/36 6/3 10/36 15/36 21/36 26/36 30/36 33/36 35/36 1
6 6
3 Consider a discrete random variable with probability mass function given below.
x 1 2 3 4
P(X=x 0. 0.3 0.4 0.2
) 1
The graphs above are plots of the probability mass function (graph on the right) and
cumulative distribution function (graph on the left).
A random variable can only take on one value at a time i.e., the events X=x1 and X=x2 for x1
≠ x2 are mutually exclusive. The probability of the variable taking on any number of
different values can be found by simply adding the appropriate probabilities.
Examples
1 Find the probability of getting 2 or more tails when a coin is flipped 3 times.
82
2 Find the probability of getting at least one tail when a coin is flipped 3 times.
P(at least 1) = P(1) + P(2) + P(3) = 3/8 + 3/8 +1/8 = 7/8 = 1 – P(0) = 1 – 1/8.
3 Find the probability of needing at most 3 tosses of a coin to get the first heads.
4 Find the probability of getting a sum of (a) 7 (b) at least 4 when tossing a pair of dice.
= 6/36 = 1/6.
(b) P(at least 4) = P(4) + P(5) + . . . + P(12) = 1 – [P(2) + P(3)] = 1- 3/36 = 33/36 =11/12.
4.3 Mean (expected value), variance and standard deviation of a discrete random
variable
The mean or expected value of a random variable X is the average value that we would
expect for X when performing the random experiment many times.
E(X) = μ = ∑ xp ( x) .
Examples
1 3 3 1 3
+ 1∗ + 2∗ + 3∗
E(T) = ∑ t P(t) = 0* 8 8 8 8 = 2 .
Thus if 3 coins are flipped many times, we should expect the average number of tails (per 3
flips) to be about 1.5. Since the number of tails is an integer value, it will never actually
assume the mean value of 1.5. This mean value more reflects the fact that the extreme values
1
(0 and 3) occur the same proportion of times ( 8 th) and the middle values occur the same
3
proportion of times ( 8 ths).
83
2 The score S obtained in a certain quiz is a random variable with probability distribution
given below.
s 0 1 2 3 4 5
P(S=s) 0.12 0.04 0.16 0.32 0.24 0.12
s 0 1 2 3 4 5 sum
P(S=s) 0.1 0.04 0.16 0.3 0.24 0.1 1
2 2 2
s*P(s) 0 0.04 0.32 0.9 0.96 0.6 2.88
6 0
μ = E(S) = 2.88
2
For a random variable X, the variance, denoted by σ can be calculated by using the
formula
2
The standard deviation of X, denoted by σ is just the positive square root of σ . This is a
measure of the extent to which the values are spread around the mean.
The calculation of the standard deviation for a random variable is similar to that of the
calculation of the standard deviation for grouped data.
Example
t 0 1 2 3 sum
P(t) 1/ 3/8 3/8 1/8 1
8
t*P(t) 0 3/8 6/8 3/8 1.5
t2*P(t) 0 3/8 12/8 9/8 3
1 The experiment is repeated a fixed number of times. Each repetition is called a trial. The
number of trials is denoted by n.
3 The outcome for each trial of the experiment can be one of two complementary outcomes,
one (s) labeled “success” and the other (f) labeled “failure”. A single such a trial is called a
Bernoulli trial.
4 The probability of success P(s) has a constant value of p for each trial.
5 The random variable X counts the number of successes that has occurred.in n trials.
Examples:
1 Consider the experiment of flipping a coin 5 times. If we let the event of getting “tails” on
a flip be labeled “success” and “heads” failure, and if the random variable T represents the
3 Fourteen percent of flights from a certain airport are delayed. If 20 flights are chosen at
random, then we can consider each flight to be an independent Bernoulli trial. If we define a
successful trial to be one where a flight takes off on time, then the random variable Z
representing the number of on-time flights will be binomially distributed with ,.
p=0.86 , q=0.14.
Tree diagram
The number of possible outcomes in a binomial experiment can be written down from a
diagram such as the one below. This diagram called a tree diagram enables one to write down
all the outcomes when this experiment is performed 3 times.
s f s f s f s f 3rd
s f s f 2nd
s f 1st
start
The following outcomes and their respective number of successes (x) can be written down
from the above tree diagram.
Outcomes x
fff 0
85
A formula for the binomial probability mass function for the case n = 3 can be written down
from the above table by noting the following.
1 Each outcome is a sequence of s (success) and f (failure) values e.g., fff, ffs, ssf etc.
3 Since the trials are independent, the probability of a particular sequence of s’s and f’s is
given by a product of p (the probability of success) and q (the probability of failure) values,
where p’s occur x times and q’s (3-x) times e.g., P(fff) = q3, P(ffs) = pq2, P(ssf) = p2q etc.
4 The number of outcomes where there are x success and (3-x) failure outcomes can be
counted by using the formula C(3, x)= 3Cx .
By using the above, the binomial formula for n = 3 can be written down as
To write down the general formula, the same reasoning as explained above applies to
sequences with n outcomes consisting of s (x of these) and f (n-x of these) values. In the
formula the number 3 is just replaced by n i.e.,
A short hand way of referring to a binomially distributed random variable X, based on n trials
with probability of success p, is X ~ B(n,p).
Examples
1 As in the previous examples, let T be the random variable representing the number of tails
when a coin is flipped 3 times. Using the formula above with n=3 and p = ½ , we can
calculate the probability of exactly 2 tails as:
1 2 1 1
)( )
P(2) = 3C2 ( 2 2 = 0.375 .
2 Let the random variable X represent the number of correct answers in the multiple-choice
test described above. Then the probability of a student guessing 3 answers correctly is
2
mean = E(X) = μ = np , var(X) = σ = npq and standard deviation (X) = √ npq.
Example
1 3
E (T ) 3
2 2 and σ=√ 3×0.5×0.5 = √ 0.75 =0.866.
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25
x
87
0.3
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25
x
0.3
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25
x
The folowing expreimental model is sometimes associated with the binomial distribution.
Consider a bowl with N marbles of which Np are blue and Nq red, where p+q=1. If
sampling is done with replacement and drawing a blue marble labeled “success” (red marble
Np Nq
=p =q
labeled “failure”), then P(success) = N and P(failure) = N . If P( x blue marbles
in n draws) is required and sampling is with replacement, the binomial formula will still
apply. If sampling is without replacement, P(success) is no longer constant (assumption 4 of
binomial experiment is violated) and the binomial formula will no longer apply for
calculating the abovementioned probability. In such a case
Np C x ×Nq C n−x
Example
A bowl contains 10 blue and 7 red marbles. Four (4) marbles are drawn at random from the
bowl. Calculate the probability of
(a) two (b) at least 3 blue marbles drawn when sampling is done
1 with replacement.
2 without replacement.
10 7
p= , q=
1(a) 17 17 , x=2 .
10 2 7 2
4 C 2( ) ( )
P(X=2) = 17 17 = 0.352.
10 3 7 1 10 7
4 C3( ) ( ) 4 C 4 ( )4 ( )0
(b) P(X≥3) = P(X=3) + P(X=4) = 17 17 + 17 17 = 0.335+0.120 = 0.455.
C 2×7 C 2
10 45×21
=0 .397
2(a) P(X=2) = 17 C 4 = 2380 .
C 3× 7 C 1 C 4 ×7 C 0
10 10 840+210
=0 . 441
(b) P(X≥3) = P(X=3) + P(X=4) = 17 C 4 + 17 C 4 = 2380 .
Bernoulli distribution
An important special case of the binomial distribution is the case where n = 1. Then the
probability formula becomes
The mean and variance (standard deviation) of the Bernoulli distribution can be written down
as special cases (n = 1) of the corresponding binomial formulae.
2
mean = E(X) = μ = p, var(X) = σ = pq and standard deviation (X) = √ pq .
4.5 Poisson distribution
89
A Poisson random variable (X) is one that counts the number of events that occur at random
in an interval of time or space. The average number of events that occur in the time/space
interval is denoted by μ.
Examples
μ x e− μ
P(X=x) = P(x) = x ! , for x = 0, 1, 2, . . . (µ > 0)
A short hand way of referring to a Poisson distributed random variable X with average
(mean) rate of occurrence µ is X ~ Po(µ)..
Examples
1 A bank receives on average μ=6 bad cheques per day. Calculate the probability of the
bank receiving
Solution
(a) Substituting μ=6 and x=4 into the above formula gives
64 e−6
=0 . 134
P(4) = 4 ! .
−6e−6 6 e−6 6 2
e + +
(b) P(X ≥ 3) = 1 –P(X ≤ 2) = 1 – ( 1! 2 ! ) = 1-0.062 = 0.938.
2 A secretary claims an average mistake rate of 1 per page. A sample page is selected at
random, and 5 mistakes found. What is the probability of her making 5 or more mistakes if
her claim of 1 mistake per page on average is correct?
In this case μ=1 is claimed and X the number of mistakes ≥5. If the claim is true,
90
The above calculation shows that if the claim of 1 mistake per page on average is true, there
is only a 37 in 10 000 chance of getting 5 or more mistakes per page. This remote chance of 5
or more mistakes when an average of 1 mistake per page is made is true, casts doubt on
whether the claim of 1 mistake per page on average is in fact true.
The Poisson random variable can also be seen as an approximation to a binomial random
variable with n the number of trials large and p the probability of success small such that the
mean μ = np is of moderate size. This approximation is good when n≥20 and p ≤ 0.05 or
n≥100 and np< 10.
Example
A life insurance company has found that the probability is 0.000015 that a person aged 40-50
will die from a certain rare disease. If the company has 100 000 policy holders in this age
group, what is the probability that this company will have to pay out 4 claims or more
because of death from this disease?
For the following reasons a binomial distribution with n = 100 000 and p = 0.000015 is
reasonable in this case.
3 The death or not from this disease of one person does not affect that of another
person.
The Poisson distribution with µ = 100 000*(0.000015) = 1.5 can be used to approximate this
probability.
−1.5
P(X ≥ 4) = 1 – P(X ≤ 3) = 1 – (e +1.5e−1 .5+1.52 e−1 .5 /2!+1.53 e−1 .5/3!) = 1 – 0.9344
= 0.0656.
The mean and variance of the Poisson distribution are given by E(X) = µ and var(X) = µ.
In the case of the Poisson approximation to the binomial distribution
If the average rate of occurrence of µ is given for a particular time/space interval length/size,
probability calculations can also be carried out for an interval length/size which is different to
the one given.
91
Example
Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is the probability
of not more than 5 calls arriving during a particular minute?
A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every minute. Since the
question concerns an interval of 1 minute, µ = 4 (not µ = 1).
The output of binomial and poisson probability calculations done on excel are shown below.
n=10 p
x 0.2 0.4 0.6 0.8
0 0.107 0.006 0 0
1 0.268 0.04 0.002 0
2 0.302 0.121 0.011 0
3 0.201 0.215 0.042 0.001
4 0.088 0.251 0.111 0.006
5 0.026 0.201 0.201 0.026
6 0.006 0.111 0.251 0.088
7 0.001 0.042 0.215 0.201
8 0 0.011 0.121 0.302
9 0 0.002 0.04 0.268
10 0 0 0.006 0.107
n=15 0.2 0.4 0.6 0.8
0 0.035 0 0 0
1 0.132 0.005 0 0
2 0.231 0.022 0 0
3 0.25 0.063 0.002 0
4 0.188 0.127 0.007 0
5 0.103 0.186 0.024 0
6 0.043 0.207 0.061 0.001
7 0.014 0.177 0.118 0.003
8 0.003 0.118 0.177 0.014
9 0.001 0.061 0.207 0.043
10 0 0.024 0.186 0.103
11 0 0.007 0.127 0.188
12 0 0.002 0.063 0.25
13 0 0 0.022 0.231
14 0 0 0.005 0.132
15 0 0 0 0.035
92
A random variable X is called continuous if it can assume any of the possible values in some
interval i.e., the number of possible values is infinite. In this case the definition of a discrete
random variable (list of possible values with their corresponding probabilities) cannot be used
(since there are an infinite number of possible values it is not possible to draw up a list of
possible values). For this reason, probabilities associated with individual values of a
continuous random variable X are taken as 0.
The clustering pattern of the values of X over the possible values in the interval is described
by a mathematical function f(x) called the probability density function. A high (low)
clustering of values will result in high (low) values of this function. For a continuous random
variable X, only probabilities associated with ranges of values (e.g., an interval of values
from a to b) will be calculated. The probability that the value of X will fall between the values
a and b is given by the area between a and b under the curve describing the probability
density function f(x). For any probability density function the total area under the graph of
f(x) is 1.
The constants and can be shown to be the mean and standard deviation respectively of
X. These constants completely specify the density function. A graph of the curve describing
the probability function (known as the normal curve) for the case μ=0 and σ =1 is shown
below.
94
0.45
0.4
0.35
0.3
0.25
p(z)
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
z
The graph of the function defined above has a symmetric, bell-shaped appearance. The mean
µ is located on the horizontal axis where the graph reaches its maximum value. At the two
ends of the scale the curve describing the function gets closer and closer to the horizontal axis
without touching it. Many quantities measured in everyday life have a distribution which
closely matches that of a normal random variable (e.g., marks in an exam, weights of
products, heights of a male population). The parameter µ shows where the distribution is
centrally located and σ the spread of the values around µ. A short hand way of referring to a
random variable X which follows a normal distribution with mean µ and variance σ2 is by
writing X ~ N(µ, σ2). The next diagram shows graphs of normal distributions for various
values of μ and σ2.
95
An increase (decrease) in the mean µ results in a shift of the graph to the right (left) e.g. the
curve of the distribution with a mean of -2 is moved 2 units to the left. An increase
(decrease) in the standard deviation σ results in the graph becoming more (less) spread out
e.g. compare the curves of the distributions with σ2 = 0.2, 0.5, 1 and 5.
Histogram
1000
900
800
700
600
fr
q
e
500
400
300
200
100
0
e
15
25
35
45
55
65
75
90
or
M
mark
The histogram of the marks has an appearance that can be described by a normal curve i.e., it
has a symmetric, bell-shaped appearance. The mean of the marks is 51.95 and the standard
deviation 10.
X−μ
Z= σ .
It can be shown that the transformed random variable Z ~ N(0, 1). The random variable Z can
be transformed back to X by using the formula
X= μ+Zσ .
The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard
normal distribution. The symbol Z is reserved for a random variable with this distribution.
The graph of the standard normal distribution appears below.
Various areas under the above normal curve are shown. The standard normal table gives the
area under the curve to the left of the value z. The area to the right of z and the area between
two z values (z1 and z2) can be found by subtraction of appropriate areas as shown in the next
examples.
The first few lines of the standard normal table are shown below.
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
97
-3.7 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
-3.6 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
. . . ‘
. . . .
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
. . . ‘
. . . .
3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
When looking up a particular value of Z the first two diguts (units, thenths) can be found in
the appropriate row on the left. The column entry is the second decimal digit (hundredths).
The areas shown in the table are those under the standard normal curve to the left of the value
of z looked up i.e. P(Z ≤ z) e.g. P(Z ≤ 0.14) = 0.5557.
Note
1 For negative values of z less than the minimum value (-3.79) in the table, the probabilities
are taken as 0 i.e.,
2 For positive values of z greater than the maximum value (3.79) in the table, the
probabilities are taken as 1 i.e.
3 P(-0.47 < Z < 1.35) = P(Z < 1.35) – P(Z < -0.47) = 0.9115-0.3192 = 0.5923
In all the above examples an area was found for a given value of z. It is also possible to find a
value of z when an area to its left is given. This can be written as P(Z ≤ zα) = α (α is the
greek letter for a and is pronounced “alpha”). In this case zα has to be found such that α is the
area to its left
Examples
98
Search the body of the table for the required area (0.0344) and then read off the value of z
corresponding to this area. In this case z0.0344 = -1.82.
Finding 0.975 in the body of the table and reading off the z value gives z0.975 = 1.96.
3 Find the values of z that have areas of 0.95 and 0.05 to their left.
When searching the body of the table for 0.95 this value is not found. The z value
corresponding to 0.95 can be estimated from the following information obtained from the
table.
z area to left
1.64 0.9495
? 0.95
1.65 0.9505
Since the required area (0.95) is halfway between the 2 areas obtained from the table, the
required z can be taken as the value halfway between the two z values that were obtained
1. 64+1 . 65
=1. 645 .
from the table i.e., z = 2
Exercise: Using the same approach as above, verify that the z value corresponding to an area
of 0.05 to its left is -1.645.
At the bottom of the standard normal table selected percentiles zα are given for different
values of α. This means that the area under the normal curve to the left of zα is α.
The standard normal distribution is symmetric with respect to the mean = 0. From this it
follows that the area under the normal curve to the right of a positive z entry in the standard
normal table is the same as the area to the left of the associated negative entry (-z) i.e.
Let X be a N(μ, σ2) random variable and Z a N(0, 1) random variable. Then
99
Examples:
63−63 .5
z= =−0 . 18
2. 5
This means that 42.86% (a proportion of 0.4286) of women are less than 63 inches tall.
2 The length X (inches) of sardines is a N(4.62, 0.0529) random variable. What proportion
of sardines is
(a) longer than 5 inches? (b) between 4.35 and 4.85 inches?
5−4 .62
)
(a) P(X > 5) = P(Z > 0 .23 = P(Z > 1.65) = 1 – P(Z ≤ 1.65) = 1 – 0.9505 = 0.0495.
The standard normal table can be used to find percentiles for random variables which are
normally distributed.
Example
The scores S obtained in a mathematics entrance examination are normally distributed with
and . Find the score which marks the 80th percentile. From the standard
normal table, the z-score which is closest to an entry of 0.80 in the body of the table is 0.84
(the actual area to its left is 0.7995). The score which corresponds to a z-score of 0.84 can be
100
1 Find .
2 If a person scores in the top 5% of test scores, what is the minimum score they could have
received?
3 If a person scores in the bottom 10% of test scores, what is the maximum score they could
have received?
2 When a value is excluded in an upper limit of a probability calculation, subtract 0.5 from
the limit. When a value is included in an upper limit of a probability calculation add 0.5 to
the limit.
Examples
Note: A continuity correction is only needed when using a continuous distribution (e.g.,
normal distribution) to approximate a discrete one and not when you calculate probabilities
involving a continuous random variable or a discrete random variable. Before using a
continuity correction, note the type of variable associated with the probability you are
approximating and the type of variable you are using to approximate it.
Example
1.1 P( X> 50) 1.2 P( X ≥ 50) 1.3 P(58< X ≤62) 1.4 P(69 ≤ X <73) 1.5
P( X ≤ 75)
2 The 2.1 10th 2.2 88th 2.3 95th percentile of the daily hamburger sales.
Solution
(
a. ( X ≥ 50 )=P ¿ P
X −65 49.5−65
6
≥
6 )=P ( Z ≥−2.58 )=0.99506 .
(
P Z≤
62.5−65
6 ) (−¿ P X ≤
58.5−65
6 ) =P ( Z ≤−0.42 ) −P ( Z ≤−1.08 )=¿
0.33724−0.14007=0.19717
1.4 P ( 69 ≤ X< 73 )=P ( 68.5 ≤ X <72.5 )=P ( 0.58 ≤ Z <1.25 )=¿
0.89435−0.71904=0.17531.
(
1.5 P ( X ≤75 )=P ( X ≤ 75.5 ) =P Z ≤
75.5−65
6 )=P ( Z ≤ 1.75 ) =0.95994
2.1 P ( Z ≤ z0.10 ) =0.10 . From the normal tables P ( Z ≤−1.28 )=0.10. Solving for x from
x−65
=−1.28 gives x=65−1.28 ×6=57.32. The answer is 57.32 rounded to the nearest
6
integer i.e., 57.
2.2 P( Z ≤ z 0.88 )=0.88 . From the normal tables P ( Z ≤1.175 )=0.88 . Solving for x from
x−65
=1.175 gives x=65+1.175 × 6=72.05 . The answer is 72.05 rounded to the nearest
6
integer i.e., 72.
2.3 P( Z ≤ z 0.95 )=0.95 . From the normal tables P ( Z ≤1.645 )=0.95. Solving for x from
x−65
=1.645 gives x=65+1.645 × 6=74.87 . The answer is 74.87 rounded to the nearest
6
integer i.e. ,75.
Excel has a built-in function (called normdist) that can be used to find areas under the normal
curve for a given z-score or to calculate a z-score that has a given area under the normal
curve to its left.
1 The table below shows areas under the standard normal curve to the left of various z-
scores.
102
z-score area
-2.5 0.0062
-2 0.0228
-1.5 0.0668
-1 0.1587
-0.5 0.3085
0 0.5
0.5 0.6915
1 0.8413
1.5 0.9332
2 0.9772
2.5 0.9938
2 Using the inverse function of normdist, the table below shows z-scores for certain areas
under the standard normal curve to its left.
area z-score
0.005 -2.5758
0.01 -2.3263
0.025 -1.96
0.05 -1.6449
0.1 -1.2816
0.2 -0.8416
0.8 0.8416
0.9 1.2816
0.95 1.6449
0.975 1.96
0.99 2.3263
0.995 2.5758
6.1 Definitions
A sampling distribution arises when repeated samples are drawn from a particular
population (distribution) and a statistic (numerical measure of description of sample data) is
calculated for each sample. The interest is then focused on the probability distribution
(called the sampling distribution) of the statistic.
Sampling distributions arise in the context of statistical inference i.e., when statements are
made about a population by drawing random samples from it.
Example
Suppose all possible samples of size 2 are drawn with replacement from a population with
sample space S = {2, 4, 6, 8} and the mean calculated for each sample.
The different values that can be obtained and their corresponding means are shown in the
table below.
103
In the above table the row and column entries indicate the two values in the sample (16
possibilities when combining rows and columns). The mean is located in the cell
4+ 6
=5
corresponding to these entries e.g., 1 value = 4, 2 value = 6 has a mean entry of 2
st nd
.
Assuming that random sampling is used, all the mean values in the above table are equally
likely. Under this assumption the following distribution can be constructed for these mean
values.
x̄ 2 3 4 5 6 7 8 sum
count 1 2 3 4 3 2 1 16
P( 1 1 3 1 3 1 1 1
X̄ = x̄ ) 16 8 16 4 16 8 16
The above distribution is referred to as the sampling distribution of the mean for random
samples of size 2 drawn from this distribution. For this sampling distribution the population
size N = 4 and the sample size n = 2.
The mean of the population from which these samples are drawn is µ = 5 and the variance is
Note that
μ X̄ =5 = µ and that σ 2X̄ = σ2/2 = 5/2 =2.5.
Consider a population with mean µ and variance σ2. It can be shown that the mean and
variance of the sampling distribution of the mean, based on a random sample of size n, are
given by
σ
σ X̄ = √ n is known as the standard error.
Sampling distributions can involve different statistics (e.g., sample mean, sample proportion,
sample variance) calculated from different sample sizes drawn from different distributions.
Some of the important results from statistical theory concerning sampling distributions are
summarized in the sections that follow.
Let X1, X2, . . . , Xn be a random sample of size n drawn from a distribution with mean µ
n
X̄ =∑ X i / n
and variance σ2 (σ2 should be finite). Then for sufficiently large n the mean i=1 is
approximately normally distributed with mean = X̄
μ =μ and variance = σ X̄ = σ2/n.
2
2 The value of n for which this theorem is valid depends on the distribution from which the
sample is drawn. If the sample is drawn from a normal population, the theorem is valid for all
n. If the distribution from which the sample is drawn is close to being normal, a value of n >
30 will suffice for the theorem to be valid. If the distribution from which the sample is drawn
is substantially different from a normal distribution e.g., positively, or negatively skewed, a
value of n much larger than 30 will be needed for the theorem to be valid.
3 There are various versions of the central limit theorem. The only other central limit
theorem result that will be used here is the following one.
If the population from which the sample is drawn is a Bernoulli distribution (consists of only
values of 0 or 1 with probability p of drawing a 1 and probability of q = 1-p of drawing a 0),
n
S=∑ X i 2
then i=1 follows a binomial distribution with mean µS = np and variance σ S = npq.
n
^
P=S / n=∑ X i / n
According to the central limit theorem, i=1 follows a normal distribution
^ ^ σ 2S
with mean µ( P ) = µS /n = np/n = p and variance σ2( P ) = / n2 = npq/n2 = pq/n when n is
^
sufficiently large. P is the proportion of 1’s in the sample and can be seen as an estimate of p
the proportion of 1’s in the population (distribution from which sample is drawn).
^
P−μ( P^ ) P−^ p
=
^ √ pq/n
Z = σ(P) ~ N(0, 1).
105
Example:
An electric firm manufactures light bulbs whose lifetime (in hours) follows a normal
distribution with mean 800 and variance 1600. A random sample of 10 light bulbs is drawn
and the lifetime recorded for each light bulb. Calculate the probability that the mean of this
sample
(a) differs from the actual mean lifetime of 800 by not more than 16 hours.
(b) differs from the actual mean lifetime of 800 by more than 16 hours.
(a) P(-16 ≤ X̄ − 800 ≤ 16) = P( | X̄ −800 | ≤ 16) = P(|Z |≤ 16/√ 1600/10 ) = P(|Z|≤1.265)
= P(Z ≤ 1.265) – P(Z ≤ -1.265)
820−800
)
X̄
(c) P( > 820) = P( Z > √ 1600 /10 = P( Z > 1.58) = 1 – 0.9429 = 0.0571
785−800
)
(d) P( X̄<¿ ¿ 785) = P( Z < √ 1600 /10 = P( Z < -1.19) = 0.117
X̄−μ
The central limit theorem states that the statistic Z = σ / √ n follows a standard normal
distribution. If σ is not known, it would be logical to replace σ (in the formula for Z) by its
X̄−μ
sample estimate S. For small values of the sample size n , the statistic t = S/ √ n does not
follow a normal distribution. If it is assumed that sampling is done from a population that is
approximately a normal population, the distribution of the statistic t follows a t-distribution.
This distribution changes with the degrees of freedom = df = n-1 i.e.for each value of degrees
of freedom a different distribution is defined.
The t-distribution was first proposed in a paper by William Gosset in 1908 who wrote the
paper under the pseudonym “Student”. The t-distribution has the following properties.
106
1. The Student t-distribution is symmetric and bell-shaped, but for smaller sample sizes it
shows increased variability when compared to the standard normal distribution (its curve
has a flatter appearance than that of the standard normal distribution). In other words, the
distribution is less peaked than a standard normal distribution and with thicker tails. As
the sample size increases, the distribution approaches a standard normal distribution. For
n > 30, the differences are small.
2. The mean is zero (like the standard normal distribution).
3. The distribution is symmetrical about the mean.
4. The variance is greater than one but approaches one from above as the sample size
increases (σ2=1 for the standard normal distribution).
The graph below shows how the t-distribution changes for different values of r (the degrees
of freedom).
The row entry is the degrees of freedom (df) and the column entry (α) the area under the t-
curve to the right of the value that appears in the table at the intersection of the row and
column entry.
107
When a t-value that has an area less than 0.5 to its left is to be looked up, the fact that the t-
distribution is symmetrical around 0 is used i.e.,
P(t ≤ tα) = P(t ≤ -t1-α) = P(t ≥ t1-α) for α ≤ 0.5 (Using symmetry).
Examples
1 For df = 2 and α = 0.005 the entry is 9.925. This means that for the t-distribution with 2
degrees of freedom P(t ≥ 9.925) = 0.005.
2 For df = ∞ and α = 0.95 the entry is 1.645. This means that for the t-distribution with ∞
degrees of freedom
P(t ≤ 1.645) = 0.95. This is the same as P(Z ≤ 1.645) , where Z ~ N(0,1).
3 For df = ν = 10 and α = 0.10 the value of t0.10 such that P(t ≤ t0.10 ) = 0.10 is found from
t0.10 = -t1-0.10 = -t0.90 = 1.372.
Note that the percentile values in the last row of the t-distribution are identical to the
corresponding percentile entries in the standard normal table. Since the t-distribution for large
samples (degrees of freedom) is the same as the standard normal distribution, their percentiles
should be the same.
The chi-square distribution arises in several sampling situations. These include the ones
described below.
(n−1)S 2
χ2 = σ 2 follows a chi-square distribution with degrees of freedom = n-1.
2 When comparing sequences of observed and expected frequencies as shown in the table
below. The observed frequencies (referring to the number of times values of some variable of
interest occur) are obtained from an experiment, while the expected ones arise from some
pattern believed to be true.
observed frequency f1 f2 .. fk
expected frequency e1 e2 .. ek
k
( f i −ei )2
∑ e
The quantity χ2 = i=1 i can be shown to follow a chi-square distribution with k-1
degrees of freedom. The purpose of calculating this χ2 is to make an assessment as to how
well the observed and expected frequencies correspond.
The chi-square curve is different for each value of degrees of freedom. The graph below
shows how the chi-square distribution changes for different values of ν (the degrees of
freedom).
108
Unlike the normal and t-distributions the chi-square distribution is only defined for positive
values and is not a symmetrical distribution. As the degrees of freedom increase, the chi-
square distribution becomes more a more symmetrical. For a sufficiently large value of
degrees of freedom the chi-square distribution approaches the normal distribution.
The row entry is the degrees of freedom (df) and the column entry (α) the area under the chi-
square curve to the riright of the value that appears in the table at the intersection of the row
and column entry.
Examples:
1 For df = 30 and α = 0.99 the entry is 14.95. This means that for the chi-square distribution
with 30 degrees of freedom
2
P( χ ≥ 14.95) = 0.99.
109
2 For df = 30 and α = 0.005 the entry is 53.67. This means that for the chi-square distribution
with 30 degrees of freedom
2
P( χ ≥53.67) = 0.005.
3 For df = 6 and α = 0.05 the entry is 12.59. This means that for the chi-square distribution
with 6 degrees of freedom
2 2
P( χ ≤12 .59 ) = 0.95 or P( χ >12. 59 ) = 0.05.
Random samples of sizes n1and n2 are drawn from normally distributed populations that are
2
labeled 1 and 2 respectively. Denote the variances calculated from these samples by S1 and
S22 respectively and their corresponding population variances by σ 21 and σ 22 respectively.
S 21 / σ 21
F=
The ratio S 22 / σ 22 is distributed according to an F-distribution (named after the famous
The F-distribution is positively skewed, and the F-values can only be positive. The graph
2 2
below shows plots for three F-distributions (F-curves) with σ 1= σ 2 . These plots are referred
to by F (df 1 , df 2 ) e.g., F (33,10) refers to an F-distribution with 33 degrees of freedom
associated with the numerator and 10 degrees of freedom associated with the denominator.
df df
For each combination of 1 and 2 there is a different F-distribution. Three other important
distributions are special cases of the F-distribution. The normal distribution is an F(1,
infinity) distribution, the t-distribution an F(1, n2 ) distribution and the chi-square distribution
an F(
n1 , infinity) distribution.
df2/df1 1 2 ... ∞
1 161.5 199. 254.3
5
2 18.51 19.0 19.5
.
∞ 3.85 3.0 ... 1.01
Examples
1 F (3 ,26 )=2. 98 has an area (under the F (3,26) curve) of α =0 . 05 to its right (see graph
below).
111
2 F (4 ,32 )=2.67 has an area (under the F (4,32) curve) of α =0 . 05 to its right (see graph
below).
For each different value of α a different F-table is used to read off a value that has an area of
α to its right i.e. a percentage of 100(1-α ) to its left. The F-tables that are used and their α
and 100(1-α ) values are summarized in the table below.
The first entry in the above table refers to the percentage of the area under the F-curve to the
left of the F-value read off and the second entry to the proportion under the F-curve to the
right of this F-value.
112
Examples:
1 For 1
df =7 , df =52 the value read from the 95% F-distribution table is 4.88. This means
that for this F-distribution 95% of the area under the F-curve is to the left of 4.88 (a
proportion of 0.05 to the right of 4.88).
P( F ¿4 .88) = 0.95
P( F >4.88) = 0.05
2 For 1
df =7 , df =5
2 the value read from the 97.5% F-distribution table is 6.85. This
means that for this F-distribution 97.5% of the area under the F-curve is to the left of 6.85 (a
proportion of 0.025 to the right of 6.85).
P( F ¿6.85) = 0.975
P( F >6.85) = 0.025
3 For 1
df =10 , df =17
2 the value read from the 99% F-distribution table is 3.59. This means
that for this F-distribution 99% of the area under the F-curve is to the left of 3.59 (a
proportion of 0.01 to the right of 3.59).
P( F ¿3 .59 ) = 0.99
P( F >3.59) = 0.01
Only upper tail values (areas of 5%, 2.5% and 1% above) can be read off from the F-tables.
Lower tail values can be calculated from the formula
1
F (df 1 , df 2 ;α )=
F (df 2 , df 1 ,1−α ) i.e.
= 1/ (F value with an area 1− α under the F-curve to its left with numerator and
denominator
degrees of freedom interchanged)
Examples
1 Find the value such that 2.5% of the area under the F(7,5) curve is to the left of it.
1 1
F (7,5;0.025)= = =0.189.
F (5,7;0.975) 5.29
2 Find the value such that 1% of the area under the F(10,17) curve is to the left of it.
1 1
F (10,17;0.01)= = =0.223.
F (17,10; 0.99) 4.49
6.6 Computer output
In excel values from the t, chi-square and F-distributions, that have a given area under the
curve above it, can be found by using the TINV(area, df), CHIINV (area, df) and FINV(area,
df1,df2) functions respectively.
Examples
1 TINV(0.05, 15) = 2.13145. The area under the t(15) curve to the right of 2.13145 is 0.025
and to the left of -2.13145 is 0.025. Thus, the total tail area is 0.05.
2 CHIINV(0.01, 14) = 29.14124. The area under the chi-square (14) curve to the right of
29.14124 is 0.01.
3 FINV(0.05,10,8) = 3.347163. The area under the F (10, 8) curve to the right of 3.347163
is 0.05.
114
Statistical inference (inferential statistics) refers to the methodology used to draw conclusions
(expressed in the language of probability) about population parameters by selecting random
samples from the population.
Examples
1 The government of a country wants to estimate the proportion of voters ( p ) in the country
that approve of their economic policies.
2 A manufacturer of car batteries wishes to estimate the average lifetime (µ) of their
batteries.
3 A paint company is interested in estimating the variability (as measured by the variance,
σ 2 ) in the drying time of their paints.
2
The quantities p , µ and σ that are to be estimated are called population parameters.
A sample estimate of a population parameter is called a statistic. The table below gives
examples of some commonly used parameters toegether with their statistics.
Parameter Statistic
p ^p
µ x̄
σ2 S2
Examples
Suppose the mean time it takes to serve customers at a supermarket checkout counter is to be
estimated.
115
1 The mean service time of 100 customers of (say) x̄= 2.283 minutes is an example of a
point estimate of the parameter µ.
2 If it is stated that the probability is 0.95 (95% chance) that the mean service time will be
from 1.637 minutes to 4.009 minutes, the interval of values (1.637, 4.009) is an interval
estimate of the parameter µ.
The estimation approaches discussed will focus mainly on the interval estimate approach.
A confidence interval is a range of values from L (lower value) to U (upper value) that
estimate a population parameter θ with 100(1-α )% confidence.
θ - pronounced “theta”.
1-α is called the confidence coefficient. It is the probability that the confidence interval will
contain θ the parameter that is being estimated.
Example
L = 1.637, U = 4.009
α =0.05
In the sections that follow the determination of L and U when estimating the parameters µ, p
and σ2 will be discussed.
7.4 Confidence interval for the population mean (population variance known)
116
The determination of the confidence limits is based on the central limit theorem (discussed in
the previous chapter). This theorem states that for sufficiently large samples
σ2 X̄−μ
)
the sample mean X̄ ~ N(µ, n and hence that Z = σ / √ n ~ N(0, 1).
Formulae for the lower and upper confidence limits can be constructed in the following way.
X̄−μ X̄−μ
P(-1.96 ≤ σ / √ n ≤ 1.96) = 0.95 , ( Substitute Z = σ / √ n in the line above ).
By a few steps of mathematical manipulation (not shown here), the above part in brackets can
be changed to have only the parameter µ between the inequality signs. This will give
σ σ
X̄ −1. 96 X̄ +1 . 96
P( √n ≤ µ ≤ √n ) = 0.95 .
σ σ
X̄ −1. 96 X̄ +1 . 96
Let L = √ n and U = √n . Then the above formula can be written as
117
P(L ≤ µ ≤ U) = 0.95.
Since both L and U are determined by the sample values (which determine X̄ ), they (and the
confidence interval) will change for different samples. Since the parameter µ that is being
estimated remains constant, these intervals will either include or exclude µ. The Central Limit
Theorem states that such intervals will include the parameter µ with probability 0.95 (95 out
of 100 times).
In a practical situation the confidence interval will not be determined by many samples, but
by only one sample. Therefore, the confidence interval that is calculated in a practical
situation will involve replacing the random variable X̄ by the sample value x̄ . Then the
above formulae for a 95% confidence interval for the population mean µ becomes
σ σ σ
x̄−1 . 96 x̄ +1. 96 x̄±1 . 96
( √n , √ n .) or √n .
The percentage of confidence associated with the interval is determined by the value (called
the z – multiplier) obtained from the standard normal distribution. In the above formula a z-
multiplier of 1.96 determines a 95% confidence interval.
confidence percentage 99 95 90
z-multiplier 2.576 1.96 1.645
α 0.01 0.05 0.10
Example
The actual content of cool drink in a 500 milliliters bottle is known to vary. The standard
deviation is known to be 5 milliliters. Thirty (30) of these 500 milliliter bottles were selected
at random and their mean content found to 498.5. Calculate 95% and 99% confidence
intervals for the population mean content of these bottles.
Solution
5
498.5 ± 1.96 √ 30 = (496.71, 500.29).
5
498.5 ± 2.576 √ 30 = (496.15, 500.85).
7.5 Confidence interval for the population mean (population variance not known)
When the population variance (σ2) is not known, it is replaced by the sample variance (S2) in
the formula for Z mentioned in the previous section. In such a case the quantity
X̄−μ
t = S/ √ n follows a t-distribution with degrees of freedom = df = n-1.
The confidence interval formula used in the previous section is modified by replacing the z-
multiplier by the t-multiplier that is looked up from the t-distribution.
Example
The time (in seconds) taken to complete a simple task was recorded for each of 15 randomly
selected employees at a certain company. The values are given below.
38. 43. 38. 26. 41. 42. 37. 37. 41. 42. 50. 37. 36. 31.
2 9 4 2 3 3 5 2 2 3 31 1 3 7 8
Calculate 95% and 99% confidence intervals for the population mean time it takes to
complete this task.
Solution
Looking up the t-multiplier involves a row and column entry in the t-table.
Substituting x̄ = 38.36, n = 15, S = 5.78, t = 2.145 into the above formula gives
5.78
38.36 ± 2.145 √15 = (35.16, 41.56).
Substituting x̄ = 38.36, n = 15, S = 5.78, t = 2.977 into the above formula gives
5.78
38.36 ± 2.977 √15 = (33.92, 42.80).
Interpretation of confidence interval
The confidence limits depend on the sample values and will therefore change as the sample
changes. Suppose it is known that μ=9 , σ 2=2.The plot below shows 95% confidence
intervals calculated from 100 different data sets of size n=24 simuated from a N(9, 2)
distribution. Most of the confidence intervals will include the true mean of 9. The expression
“with 95% confidence” is interpreted as “95% of the simulated confidence intervals will
include the true mean of 9”.
120
The formula for the confidence interval of the population variance σ2 follows from the fact
(n−1)S 2 α
2 χ 2 ( 1− )
that σ follows a chi-square distribution with (n-1) degrees of freedom. Let 2
α α 100 α
χ2 ( ) 1− )
and 2 denote the 100( 2 and 2 percentile points of the chi-square distribution
with (n-1) degrees of freedom. These points are shown in the graph below.
121
2 α (n−1)S 2 α
χ ( ) 2 χ 2 ( 1− )
P[ 2 ≤ σ ≤ 2 ] = 1-α .
By a few steps of mathematical manipulation (not shown here), the above part in brackets can
be changed to have only the parameter σ2 between the inequality signs. This will give
(n−1)S 2 (n−1)S 2
]
P[ upper ≤ σ2 ≤ lower = 1-α ,
α
χ 2 ( 1− )
where upper = 2 , the larger of the 2 percentile points and
α
χ2 ( )
lower = 2 , the smaller of the 2 percentile points.
Example
Calculate 90% and 95% confidence intervals for the population variance of the time taken to
complete the simple task (see previous example).
Solution
α
χ 2 ( 1− ) 2
upper = 2 = χ (0 .95 ) = 23.68 for ν = 14.
α
χ2 ( ) 2
lower = 2 = χ (0 .05 ) = 6.57 for ν = 14.
467 . 34 467 . 34
)
The confidence interval is (23 . 68 , 6 . 57 = (19.74, 71.13).
α
χ 2 ( 1− ) 2
upper = 2 = χ (0 .975 ) = 26.12 for ν = 14.
α
χ2 ( ) 2
lower = 2 = χ (0 .025 ) =5.63 for ν = 14.
467 . 34 467 . 34
)
The confidence interval is (26 . 12 , 5 . 63 = (17.89, 83.01).
123
In some experiments the interest is in whether items posses a certain characteristic of interest
(e.g., whether a patient improves or not after treatment, whether an item manufactured is
acceptable or not, whether an answer to a question is correct or incorrect). The population
proportion of items labeled “success” in such an experiment (e.g., patient improves, item is
acceptable, answer is correct) is estimated by calculating the sample proportion of “success”
items.
The determination of the confidence limits for the population proportion of items labeled
^ X
P=
“success” is based on the central limit theorem for the sample proportion n , where X is
the number of items in the sample labeled “success”.
pq
^ )
the sample proportion of “success” items P ~ N(p, n and hence that
^
P−μ( P^ ) P−^ p
=
^ √ pq/n
Z = σ(P) ~ N(0, 1).
Formulae for the lower and upper confidence limits can be constructed in the following way.
Since Z ~ N(0,1),
^ p
P−
P(-1.96 ≤ √ pq /n ≤ 1.96) = 0.95
By a few steps of mathematical manipulation (not shown here), the above part in brackets can
be changed to have the parameter p (in the numerator) between the inequality signs. This will
give
P(
^ . 96 √ pq /n ≤ p ≤ P+1
P−1 ^ .96 √ pq /n ) = 0.95.
Since the confidence interval formula is based on a single sample, the random variable
^ X
P= ^p=
x
n is replaced by its sample estimate n and the parameters p and q=1-p by their
x
^p=
respective sample estimates ^
n and q=1− p^ .
confidence percentage 99 95 90
z-multiplier 2.576 1.96 1.645
α 0.01 0.05 0.10
During a marketing campaign for a new product, 176 out of the 200 potential users of this
product that were contacted indicated that they would use it. Calculate a 90% confidence
interval for the proportion of potential users who would use this product.
Solution
176
^p = 200 = 0.88, q=1−
^ p^ = 0.12.
z-multiplier = 1.645 (From above table)
Confidence interval is (0.88 ± 1.645 √ 0 . 88∗0 . 12/200 ) = (0.88 ± 0.0378) = (0.842, 0.918).
The confidence interval for the proportion of successes calculated above is called the
binomial confidence interval (based on the binomial distribution formulae).
Other approaches used to calculate such a confidence interval are the Wilson, Clopper-
Pearson, Jeffreys, Agresti-Coull and Arcsine.
2
Consider the formula for the confidence interval of the mean (µ) when σ is known.
σ
x̄± z-multiplier √ n
125
σ
The quantity z-multiplier √ n is known as the error (denoted by E).
The smaller the error, the more accurately the parameter μ is estimated.
Suppose the size of the error is specified in advance and the sample size n is determined to
achieve this accuracy. This can be done by solving for n from the equation
σ
E = z-multiplier √ n , which gives
z−multiplier∗σ 2
)
n=( E .
Example
Consider the example on the interval estimation of the mean content of 500 millilitre cool
drink bottles. The standard deviation σ is known to be 5. Suppose it is desired to estimate the
mean with 95% confidence and an error that is not greater than 0.8. What sample size is
needed to achieve this accuracy?
Solution
1. 96∗5 2
)
n= ( 0 . 8 = 150.0625 =151 (n is always rounded up).
The approach used in determining the sample size for the estimation of the population
proportion is much the same as that used when estimating the population mean.
z−multiplier∗√ pq
E= √n .
z−multiplier 2
)
n = pq ( E .
126
A practical problem encountered when using this formula is that values for the parameters p
and q=1-p are needed. Since the purpose of this technique is to estimate p, these values of p
and q are obviously not known.
If no information on p is available, the value of p that will give the maximum value of
max pq = ¼.
z−multiplier 2
)
max n = ¼ ( E .
If more accurate information on the value of p is known (e.g., some range of values), it
should be used in the above formula.
Example
Consider the problem (discussed earlier) of estimating the proportion of potential users who
would use a new product. Suppose this proportion is to be estimated with 99% confidence
and an error not exceeding 2% (proportion of 0.02) is required. What sample size is needed to
achieve this?
Solution
2. 576 2
)
n = ¼ ( 0. 02 = 4147.36 = 4148 (rounded up).
Supppose it is known that the value of p is between 0.8 and 0.9. In such a case
0.8¿ p≤ 0.9
By using this information, the value of n can be calculated as
2. 576 2
)
n =0.16 ( 0. 02 = 2654.31 = 2655 (rounded up).
The additional information on possible values for p reduces the sample size by 36%.
2
1 Confidence interval for the mean (σ known). For the data in the example in section 7.4,
the information can be typed on an excel sheet and the confidence interval calculated as
follows.
mean 498.5
sigma 5
n 30
z 1.9599
multiplier 64
Confiden interva
ce l
lower 496.71
upper 500.29
2
2 Confidence interval for the mean (σ not known). For the data in the example in section
7.5, the information can be typed on an excel sheet and the confidence interval calculated as
follows.
mean 38.36
5.7776
stand.dev 42
n 15
t 2.1447
multiplier 87
Confiden interva
ce l
lower 35.16
upper 41.56
3 Confidence interval for the variance. For the data in the example in section 7.6, the
information can be typed on an excel sheet and the confidence interval calculated as follows.
33.3811
variance 4
n 15
degrees of
freedom 14
5.62872
lower chisq. 6
26.1189
upper chisq. 5
Confidence interval
lower 17.89
upper 83.03
4 Confidence interval for the proportion of successes. For the data in the example in section
7.7, the information can be typed on an excel sheet and the confidence interval calculated as
follows.
128
n 200
x 176
z 1.6448
multiplier 54
0.0229
st.error 78
Confiden interva
ce l
lower 0.842
upper 0.918
The purpose of testing of hypotheses is to determine whether a claim that is made could be
true. The conclusion about the truth of such a claim is not stated with absolute certainty, but
rather in terms of the language of probability.
1 A supermarket receives complaints that the mean content of “1 kilogram” sugar bags that
are sold by them is less than 1 kilogram.
2 The variability in the drying time of a certain paint (as measured by the variance) has until
recently been 65 minutes. It is suspected that the variability has now increased.
3 A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). They want to test whether this is indeed the case.
The null hypothesis (H0) is a statement concerning the value of the parameter of interest (θ
) in a claim that is made. This is formulated as
H0:
θ=θ 0 (The statement that the parameter θ is equal to the hypothetical value θ0 ).
129
The alternative hypothesis (H1) is a statement about the possible values of the parameter
θ that are believed to be true if H0 is not true. One of the alternative hypotheses shown
below will apply.
H1a:
θ<θ 0 or H : θ>θ 0 or H : θ≠θ0 .
1b 1c
Examples
1 In the first example (above) the parameter of interest is the population mean µ and the
hypotheses to be tested are
versus
2 In the second example (above) the parameter of interest is the population variance σ 2 and
the hypotheses to be tested are
versus
3 In the third example (above) the parameter of interest is the population proportion, p, of
job completions behind schedule and the hypotheses to be tested are
versus
A one-sided alternative hypothesis is one that specifies the alternative values (to the null
hypothesis) in a direction that is either below or above that specified by the null hypothesis.
Example
130
The alternative hypothesis H1a (see example 1 above) is the alternative that the value of the
parameter is less than that stated under the null hypothesis and the alternative H1b (see
example 2 above) is the alternative that the value of the parameter is greater than that stated
under the null hypothesis.
A two-sided alternative hypothesis is one that specifies the alternative values (to the null
hypothesis) in directions that can be either below or above that specified by the null
hypothesis.
Example
The alternative hypothesis H1c (see example 3 above) is the alternative that the value of the
parameter is either greater than that stated under the null hypothesis or less than that stated
under the null hypothesis.
8.2 Testing of hypotheses for one sample: Terminology and summary of procedure
The testing procedure and terminology will be explained for the test for the population mean
μ with population variance σ2 known.
H0: µ = µ0
versus
The data set that is needed to perform the test is x1, x2, . . . , xn ,
a random sample of size n drawn from the population for which the mean is tested. The test is
performed to see whether the sample data are consistent with what is stated by the null
hypothesis. The instrument that is used to perform the test is called a test statistic.
When testing for the population mean, the test statistic used is
x̄−μ0
z0 = σ / √ n .
If the difference between x̄ and µ0 (and therefore the value of z0) is reasonably small, H0 will
not be rejected. In this case the sample mean is consistent with the value of the population
mean that is being tested. If this difference (and therefore the value of z0) is sufficiently large,
H0 will be rejected. In this case the sample mean is not consistent with the value of the
population mean that is being tested. To decide how large this difference between x̄ and μ0
(and therefore the value of z0) should be before H0 is rejected, the following should be
considered.
131
Type I error
A type I error is committed when the null hypothesis is rejected when, in fact it is true i.e.,
H0 is wrongly rejected.
In this test, a type I error is committed when it is decided that the statement H 0: µ = μ0 should
be rejected when, in fact, it is true.
A type II error is committed when the null hypothesis is not rejected when, in fact, it is
false i.e., a decision not to reject H0 is wrong.
In this test, a type II error is committed when it is decided that the statement H 0: µ = μ0
should not be rejected when, in fact, it is false.
The power of a statistical test is the probability that H 0 is rejected when, in fact, it is false.
The following table gives a summary of possible conclusions and their correctness when
performing a test of hypotheses.
A type I error is often considered to be more serious, and therefore more important to avoid,
than a type II error. The hypothesis testing procedure is therefore designed so that there is a
guaranteed small probability of rejecting the null hypothesis wrongly. This probability is
never 0 (why?). Mathematically the probability of a type I error can be stated as
Probabilities of type I and type II errors work in opposite directions. The more reluctant you
are to reject H0, the higher the risk of accepting it when, in fact, it is false. The easier you
make it to reject H0, the lower the risk of accepting it when, in fact, it is false.
When taking power into account the sample size to be used in a test is
132
2
(z α + z β ) ¿
n= 2 , where ES = effect size ¿ ¿ μ1−μ0 ∨ σ ¿ , α is the level of significance, 1−β is
ES 2
the power and μ0and μ1are the means under H 0 and H 1 respectively.
The critical (cut-off) value (s) for tests of hypotheses is a value(s) with which the test
statistic is compared to determine whether or not the null hypothesis should be rejected.
The critical value is determined according to the specified value of α, the probability of a type
I error.
For the test of the population mean the critical value is determined in the following way.
X̄−μ 0
Assuming that H0 is true, the test statistic Z0 = σ / √ n ~ N(0, 1).
(i) When testing H0 versus the alternative hypothesis H1a (µ < µ0), the critical value is the
value Zα which is such that the area under the standard normal curve to the left of Zα is
α i.e., P(Z0 < Zα) = α.
The graph below illustrates the case α = 0.05 i.e., P(Z0 < -1.645) = 0.05.
(ii) When testing H0 versus the alternative hypothesis H1b (µ > µ0), the critical value is the
value Z1-α which is such that the area under the standard normal curve to the right of Z1-α is
α i.e., P(Z0 > Z1-α) = α..
The graph below illustrates the case α = 0.05 i.e. P(Z0 > 1.645) = 0.05.
133
(iii) When testing H0 versus the alternative hypothesis H1c (µ ≠ µ0), the critical values are the
values Z1-α/2 and Zα/2 which are such that the area under the standard normal curve to the right
of Z1-α/2 is α/2 and the area under the standard normal curve to the left of Zα/2 is α/2. i.e. P(Z0
> Z1-α/2) = α/2 and P(Z0 < Zα/2) = α/2.
The area under the normal curve between these two critical values is 1-α. The graph below
illustrates the case α = 0.05 i.e. P(Z0 <-1.96 or Z0> 1.96) = 0.05.
The critical region CR, or rejection region R, is the set of values of the test statistic for
which the null hypothesis is rejected.
(i) When testing H0 versus the alternative hypothesis H1a , the rejection region is
{ z0 | z0 < Zα }.
(ii) When testing H0 versus the alternative hypothesis H1b , the rejection region is
{ z0 | z0 > Z1-α }.
(iii) When testing H0 versus the alternative hypothesis H1c , the rejection region is
x̄
H0 is rejected when there is a sufficiently large difference between the sample mean and
the mean (μ0) under H0 . Such a large difference is called a significant difference (result of
the test is significant). The value of α is called the level of significance. It specifies the level
x̄
beyond which this difference (between and μ0) is sufficiently large for H0 to be rejected.
The value of α is specified prior to performing the test and is usually taken as either 0.05 (5%
level of significance) or 0.01 (1% level of significance).
When H0 is rejected, it does not necessarily mean that it is not true. It means that according to
the sample evidence available it appears not to be true. Similarly, when H0 is not rejected it
does not necessarily mean that it is true. It means that there is not sufficient sample evidence
to disprove H0.
Critical values for tests based on the standard normal distribution can be found from the
selected percentiles listed at the bottom of the pages of the standard normal table.
134
The p-value is defined as the probability of getting a value more extreme than that of the test
statistic. Suppose the hypothesis H0: µ = µ0 is being tested and the test statistic values is
found to be
z0.
(i) When testing H0 versus H1a:
μ< μ0 , p-value =P (Z < z 0 ) .
(ii) When testing H0 versus H1b:
μ> μ0 , p-value =P (Z > z 0 ) .
(iii) When testing H0 versus H1c:
μ≠μ 0 , p-value =2 P( Z>|z 0|)
To reach a decision to reject or not reject H0, the p-value is compared to α the level of
significance. If p-value < α , reject H0. If this not true, do not reject H0.
To see that the above- mentioned rule for a decion rule based on a p-value will lead to the
same conclusion as that based on a critical region based on the level of significance α ,
consider the following.
then
z 0 ≥Z α and H is not rejected.
0
In the above definitions, the calculation of the p-value is based on the standard normal
z
distribution (of Z ) and the test statistic value 0 . When performing a different test of
hypotheses, the p-value calculation will be based on the test statistic applicable to the test and
its associated sampling distribution.
2
Test for μ when σ is known
1 State null and alternative hypotheses.
135
H0:
μ=μ 0 versus H : μ< μ0 or H : μ > μ0 or H : μ≠μ 0
1a 1b 1c
x̄−μ0
z 0=
2 Calculate the test statistic σ /√n .
3 State the level of significance α and determine the critical value(s) and critical region.
(ii) For alternative H1b the critical region is R = {z0 | z0 > Z1-α }.
(iii) For alternative H1c the critical region is R = {z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Examples
1 A supermarket receives complaints that the mean content of “1 kilogram” sugar bags that
are sold by them is less than 1 kilogram. A random sample of 40 sugar bags is selected from
the shelves and the mean found to be 0.987 kilograms. The standard deviation contents of
these bags is known to be 0.025 kilograms. Test, at the 5% level of significance, whether this
complaint is justified.
0. 987−1
=
Test statistic: z0 = 0 .025 / √ 40 -3.289.
Note on power of the test: The critical region can also be expressed in terms of values of x .
x−1
In the above example H 0 will be rejected if z o= ←1.645. This means that values
0.025 / √40
0.025
of x <1−1.645 × =0.9935 will lead to the rejection of H 0.
√ 40
Suppose H 0 : μ=1 is true. The power of the tests against the alternative H 1 : μ=0.98 will be
( )
0.9935−0.98
P ( x< 0.9935|μ=0.98 )=P z < =3.415 =0.9997
0.025 .
√ 40
2 A supermarket manager suspects that the machine filling “500 gram” containers of coffee
is overfilling them i.e., the actual contents of these containers is more than 500 grams. A
random sample of 30 of these containers is selected from the shelves and the mean found to
be 501.8 grams. The variance of contents of these bags is known to be 60 grams. Test at the
5% level of significance whether the manager’s suspicion is justified.
Solution
501 .8−500
=
Test statistic: z0 = √ 60 /30 1.273.
Since z0 = 1.273 < 1.645, H0 is not rejected. [p-value =P (Z >1. 273 )=0 . 102>0 .05 ].
3 During a quality control exercise the manager of a factory that fills cans of frozen shrimp
wants to check whether the mean weights of the cans conform to specifications i.e., the mean
of these cans should be 600 grams as stated on the label of the can. He/she wants to guard
against either over or under filling the cans. A random sample of 50 of these cans is selected
and the mean found to be 595 grams. The standard deviation of contents of these bags is
known to be 20 grams. Test, at the 5% level of significance, whether the weights conform to
specifications. Repeat the test at the 10% level of significance.
Solution
595−600
=
Test statistic: z0 = 20/ √ 50 1.768.
α = 0.05. Critical region R = {z0 < Z0.025 = -1.96 or z0 > Z0.975 = 1.96}.
Suppose the test is performed at the 10% level of significance. In such a case
α = 0.10. Critical region R = {z0 < Z0.05 = -1.645 or z0 > Z0.95 = 1.645}.
The p-value =2 P( Z>1.768)=0.0771 is greater than 0.05 but less than 0.10.
Thus, being less strict about controlling a type I error (changing α from 0.05 to 0.10) results
in a different conclusion about H0 (reject instead of do not reject).
Note
1 In example 1 the alternative hypothesis H1a was used, in example 2 the alternative H1b and
in example 3 the alternative H1c.
2 Alternatives H1a and H1b [one-sided (tailed) alternatives] are used when there is a
particular direction attached to the range of mean values that could be true if H 0 is not true.
3 Alternative H1c [two-sided (tailed) alternative] is used when there is no direction attached
to the range of mean values that could be true if H0 is not true.
4 If, in the above examples, the level of significance had been changed to 1%, the critical
values used would have been Z0.01= -2.326 (in example 1) , Z0.99 = 2.326 (in example 2) and
and Z0.005 = -2.576, Z0.995= 2.576 (in example 3).
8.4 Test for the population mean (population variance not known): t-test
When performing the test for the population mean for the case where the population variance
is not known, the following modifications are made to the procedure.
1 In the test statistic formula the population standard deviation σ is replaced by the sample
standard deviation S.
138
x̄−μ0
2 Since the test statistic t0 = S/ √ n is used to perform the test, it is based on a
t-distribution with n-1 degrees of freedom. Critical values are looked up in the t-tables.
2
Test for μ when σ is not known (t-test)
1 State null and alternative hypotheses.
H0:
μ=μ 0 versus H : μ< μ0 or H : μ > μ0 or H : μ≠μ 0 .
1a 1b 1c
x̄ −μ 0
t0=
2 Calculate the test statistic S /√n .
3 State the level of significance α and determine the critical value(s) and critical region.
(i) For alternative H1a the critical region is R = {t0 | t0 < tα}.
(ii) For alternative H1b the critical region is R = {t0 | t0 > t1-α}.
(iii) For alternative H1c the critical region is R = {t0 | t0 > t1-α/2 or t0 < tα/2}.
4 If t0 lies in the critical region, reject H0, otherwise do not reject H0.
Examples
A paint manufacturer claims that the average drying time for a new paint is 2 hours (120
minutes). The drying times for 20 randomly selected cans of paint were obtained. The results
are shown below.
(a) test whether the population mean drying time is greater than 2 hours (120 minutes)
(b) test, at the 5% level of significance, whether the population mean drying time could be 2
hours (120 minutes).
Solution
124. 1−120
Test statistic t0 = 9 .65674 / √ 20 = 1.899.
Since 1.899 > 1.729, H0 is rejected. [p-value =P (t >1 .899 )=0. 0364 <0 .05 ].
Thus, being stricter about controlling a type I error (changing α from 0.05 to 0.01) results in
a different conclusion about H0 (Do not reject instead of reject).
124. 1−120
Test statistic: t0 = 9 .65674 / √ 20 = 1.899 (as calculated in part(a)).
140
If α = 0.05, α/2 = 0.025, 1-α/2 = 0.975. From the t-distribution table with
Note: Despite the fact that the same data were used in the above examples, the conclusions
were different. In the first test H0 was rejected, but in the next 2 tests H0 was not rejected.
1 In the first test the probability of a type I error was set at 5%, while in the second test this
was changed to 1%. To achieve this, the critical was moved from 1.729 to 2.539, resulting in
the test statistic value (1.899) being less than (in stead of greater than) the critical value.
2 In the third test (which has a two-sided alternative hypothesis), the upper critical value
was increased to 2.093 (to have an area of 0.025 under the t-curve to its right). Again, this
resulted in the test statistic value (1.899) being less than (in stead of greater than) the critical
value.
( n−1) S 2
χ 2=
The test for the population variance is based on σ2 following a chi-square
distribution with n-1 degrees of freedom. The critical values are therefore obtained from the
chi-square tables.
χ 20 χ 20 χ 2α
(i) For alternative H1a the critical region is R = { | < }.
χ 20 χ 20 χ 21−α
(ii) For alternative H1b the critical region is R = { | > }.
141
χ 20 χ 20 χ 21−α /2 χ 20 χ 2α / 2
(iii) For alternative H1c the critical region is R = { | > or < }.
2
4 If χ 0 lies in the critical region, reject H0 , otherwise do not reject H0.
For a one-sided test with alternative hypothesis H1b the rejection region (highlighted area) is
shown in the graph below.
For a two-sided test with alternative hypothesis H1c the rejection region (highlighted area) is
shown in the graph below.
142
Example
1 Consider the example on the drying time of the paint discussed in the previous section.
Until recently it was believed that the variance in the drying time is 65 minutes. Suppose it is
suspected that this variance has increased. Test this assertion at the 5% level of significance.
Solution
2
degrees of freedom = ν = n-1 =19, χ 0 . 95 = 30.14.
2 2
Critical region R = { χ 0 > χ 0 . 95 = 30.14}.
2
Since 27.258 < 30.14, H0 is not rejected. [p-value =P ( χ >27 . 258)=0 .0988 ].
2 A manufacturer of car batteries guarantees that their batteries will last, on average 3 years
with a standard deviation of 1 year. Ten of the batteries have lifetimes of
1.2, 2.5, 3, 3.5, 2.8, 4, 4.3, 1.9, 0.7 and 4.3 years.
Test at the 5% level of significance whether the variability guarantee is still valid.
Solution
H0 : σ2 = 1 (Guarantee is valid)
2
n = 10, σ 0 = 1 (given), S = 1.26209702, S2 = 1.592889 (calculated from the data).
143
9∗1 . 592889
Test statistic: χ 20 = 1 = 14.336.
α = 0.05, α/2 = 0.025, 1-α/2 = 0.975. From the chi-square distribution table with
2 2
degrees of freedom = ν = n-1 =9, χ 0 . 025 = 2.70, χ 0 . 975 = 19.02.
2 2 2 2
Critical region R = { χ 0 < χ 0 . 025 = 2.70 or χ 0 > χ 0 . 975 =19.02}.
Since 2.70 < 14.336 < 19.02, H0 is not rejected. [p-value¿ P ( χ 2 >14.336 ) =0.110865> 0.05] .
The test for the population proportion (p) is based on the fact that the sample proportion
^ X
P=
n ~ N(p, pq/n) , where n is the sample size and X the number of items labeled
^ p
P−
“success” in the sample. From this result it follows that Z = √ pq /n ~ N(0, 1).
For this reason, the critical value(s) and critical region are the same as that for the test for the
population mean (both based on the standard normal distribution).
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Examples
144
1 A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). Of their 80 most recent jobs 22 were completed behind schedule.
Test at the 5% level of significance whether this information confirms their suspicion.
Solution
22
^ = 80 = 0.275, p0 = 0.20.
n = 80, x = 22 (given), p
0 .275−0 . 20
Test statistic z0 = √ 0. 20∗0 .80 /80 = 1.677.
2 During a marketing campaign for a new product 176 out of the 200 potential users of this
product that were contacted indicated that they would use it. Is this evidence that more than
85% of all the potential customers will use the product? Use α = 0.01.
Solution
H1 : p > 0.85 (More than 85% of all potential users will use the product)
176
^ =200 = 0.88.
n = 200, x = 176, p0 = 0.85 (given), p
0 .88−0 . 85
Test statistic z0 = √ 0. 85∗0 .15 /200 = 1.188.
Since z0 = 1.188 < 2.326, H0 is not rejected. [p-value =P (Z=1 .188 )=0.>0. 1174 ]
1 The output shown below is when the test for the population mean, for the data in example
1 in section 8.4, is performed by using excel.
t-Test: Mean
Mean 129.1
Variance 93.25263158
Observations 20
Hypothesized Mean 120
df 19
t Stat 1.898752271
P(T<=t) one-tail 0.036445557
t Critical one-tail 1.729132792
The value of the test statistic is t0 = 1.90 (2 decimal places). From the table P(T<=-1.9) =
0.036. This probability is known as the p-value (the probability of getting a t value more
remote than the test statistic). When testing at the 5% level of significance, a p-value of
below 0.05 will cause the null hypothesis to be rejected.
2 The output shown below is when the test for the population variance in example 1 in
section 8.5 (the data in example 1 in section 8.4) is performed by using excel.
The values of the test statistic and critical value are the same as in the example in section 8.5.
The p-value is 0.098775 (2nd to last entry in the 2nd column in the table above). Since
0.098775 >0.05 the null hypothesis cannot be rejected at the 5% level of significance.
The tests discussed in the previous chapter involve hypotheses concerning parameters of a
single population and were based on a random sample drawn from a single population of
interest. Often the interest is in tests concerning parameters of two different populations
(labeled populations 1 and 2) where two random samples (one from each population) are
drawn.
Examples
146
1 Are the mean salaries the same for males and females with the same educational
qualifications and work experience?
3 Are the variances in drying times for two different types of paints different?
1 The test for equality of two variances. As an example, see example 3 above.
2 The test for equality of two means (independent samples). As an example, see example 1
above.
3 The test for equality of two means (paired samples). As an example, see example 4 above.
4 The test for equality of two proportions. As an example, see example 2 above.
The parameters to be used, when testing the hypotheses, are summarized in the table below.
The following null and alternative hypotheses (as defined in section 8.1) also apply in the
two- sample case.
H0:
θ=θ 0 (The statement that the parameter θ is equal to the hypothetical value θ0 ).
H1a:
θ<θ 0 or H : θ>θ 0 or H : θ≠θ0 .
1b 1c
Examples
1 When testing for equality of variances from 2 different populations labeled 1 and 2 the
hypotheses are
2 2
H0: σ 1= σ 2
2 2 2 2 2 2
H1a: σ 1 <σ 2 or H1b: σ 1 >σ 2 or H1c: σ 1≠ σ 2 .
σ 21
2
=1
H0: σ 2
147
σ 21 σ 21 σ 21
2
<1 2
>1 2
≠1
H1a: σ 2 or H1b: σ 2 or H1c: σ 2 .
σ 21
θ=
In terms of the general notation stated above σ 22 and θ0 =1 .
2 When testing for equality of means from 2 different populations labeled 1 and 2 the
hypotheses are
H0:
μ1 =μ2
H1a:
μ1 < μ 2 or H : μ1 >μ 2 or H : μ1 ≠μ2 .
1b 1c
H0:
μ1 −μ2 =0
H1a:
μ1 −μ2 < 0 or H : μ1 −μ 2 > 0 or H : μ1 −μ2 ≠0
1b 1c
3 When testing for equality of proportions from 2 different populations labeled 1 and 2 the
hypotheses are
H0:
p1 = p2
H1a:
p1 < p 2 or H : p1 > p 2 or H : p1 ≠ p2 .
1b 1c
H0:
p1 − p2 =0
H1a:
p1 − p2 <0 or H : p1 − p2 >0 or H : p1 − p2 ≠0 .
1b 1c
Notation
The following notation will used in the description of the two sample tests.
sample size n m
sample x 1 , x2 ,⋯, x n x 1 , x 2 ,⋯, x m
sample mean x̄ 1 x̄ 2
sample variance (standard deviation) S21 S1 S22 ( S2 )
( )
sample proportion x¿n x¿m
^ 1= n
p ^p2 = m
¿ ¿
x n and x m are the numbers of “success” items in the samples from populations 1 and 2
respectively.
^ condition ^
Sample difference ( θ ) standard error [ SE( θ )]
X̄ 1 − X̄ 2 population variances not equal σ 21 σ 22 1 /2
( + )
n m
X̄ 1 − X̄ 2 population variances equal 1 1
σ ( + )1 /2
2 2 2
i.e., σ 1= σ 2 =σ n m
P^ 1 − P^ 2 population proportions not equal p1 ( 1−p 1 ) p 2 ( 1− p 2 ) 1/ 2
[ + ]
n m
P^ 1 − P^ 2 population proportions equal 1 1
[ p(1−p )( + )]1/2
i.e.,
p1 = p2 =p n m
The above-mentioned formulae can also be used calculate 100(1-α ) % confidence intervals
^ ^ ^ ^ ^
for θ . These are of the form θ±Z α /2 SE( θ ) or θ±t α /2 S E ( θ) depending on whether the
normal or t distribution applies.
Two-sample sampling distribution results for differences between means and difference
between proportions
1 For sufficiently large random samples (both n , m>30 ) drawn from populations (with
known variances) that are not too different from a normal population, the statistic
149
X̄ 1− X̄ 2 −( μ1 −μ2 )
Z=
σ 21 σ 22 1 /2
( + ) 2
n m follows a N(0,1) distribution. Here the population variances σ 1
2
and σ 2 are assumed to be known but are not equal.
1 1
2 2 2 σ ( + )1 /2
2 When σ 1= σ 2 =σ the above-mentioned result still holds, but with n m in the
denominator in the formula for Z .
2 2 2
3 When the population variances σ 1 , σ 2 and σ , referred to in the two above mentioned
2 2
results, are not known they may be replaced by their sample estimates S1 , S2 and
2 (n−1)S 21 +(m−1)S 22
S=
n+ m−2 respectively in the above formula for Z .
In both the results that follow, it is assumed that the samples drawn are independent samples
from normally distributed populations.
2 2
(i) When the population variances σ 1 and σ 2 are not known but equal, it can be shown that
X̄ 1 − X̄ 2 −( μ1 −μ2 )
t=
1 1
S( + )1/ 2
n m follows a t-distribution with n+ m−2 degrees of freedom.
2 2
(ii) When the population variances σ 1 and σ 2 are not known and not equal, it can be shown
X̄ 1 − X̄ 2 −( μ1 −μ2 )
t=
S21 S 22 1/ 2
( + )
that n m follows a t-distribution with degrees of freedom = the integer
( S21 / n+S 22 / m )2
v=
( S 21 / n )2 ( S 22 / m )2
+
part of n−1 m −1 .
^ −P
P ^ −( p −p )
1 2 1 2
Z=
p1 ( 1−p 1 ) p 2 ( 1− p 2 )
[ + ]1/ 2
4 For sufficiently large random samples the statistic n m
follows a N(0,1) distribution.
1 1 1/2
p = p2 =p the abovementioned result still holds but with [ p(1−p )( n + m )]
5 When 1 in
the denominator.
150
6 Provided the sample sizes are sufficiently large, the two above mentioned results will still
x¿n x¿m
be valid with
p1 , p2 and p ^p ^p =
in the denominator replaced by 1 = n , 2 m and
x ¿n + x ¿m
^p=
n+m respectively.
σ 21
2
9.2 Test for equality of population variances (F-test) and confidence interval for σ 2
2 2
Test for σ 1= σ 2
Step 1: State null and alternative hypotheses
2 2 2 2 2 2 2 2
H0: σ 1= σ 2 versus H1a: σ 1 <σ 2 or H1b: σ 1 >σ 2 or H1c: σ 1≠ σ 2
max ( S21 , S22 )
F 0=
Step 2: Calculate the test statistic min ( S 21 , S 22 )
Step 3: State the level of significance α and determine the critical value(s) and critical
region.
Degrees of freedom is
df 1 = sample size (numerator sample variance)-1 and
df 2 = sample size (denominator sample variance) -1
Step 4: If
F 0 lies in the critical region, reject H , otherwise do not reject H .
0 0
σ 21
2
Confidence interval for σ 2
2 2
Step 1: Calculate S1 and S 2 . Values of n , m and confidence percentage are given.
Step 2: Determine the upper and lower F - distribution values for a given confidence
percentage,
df 1 and df 2 .
S 21 S 21
2 2
Step 3: Confidence interval is ( S 2 * lower, S 2 * upper)
Examples
151
1 The following sample information about the daily travel expenses of the sales (population
1) and audit (population 2) staff at a certain company was collected.
(a) Test at the 10% level of significance whether the population variances could be the
same.
σ 21
2
(b) Calculate a 95% confidence interval for σ 2 .
2 2
(a) H : σ 1= σ 2
0
2 2
H1: σ 1≠ σ 2
For
df 1=6 ,df 2 =5 , F 0. 95=4 . 95 .
Critical region R = {
F 0 >4 . 95 }
Since
F 0=1. 656 < 4.95, H is not rejected. [p-value =P (F >1 .656 )=0. 2982 ]
0
S 22 / σ 22 S 21 σ 21 S 21
F 0. 025 < < F 0. 975 )=0 . 95 F 0 .025 < < F 0. 975 )=0 . 95
(b) P( S 21 / σ 21 2
or P( S 2 σ 22 S 22
2 2
In the above expression S2 is in the numerator and S1 in the denominator. Hence
1
df 1=6 ,df 2 =5 and upper = F 0. 975 =6 . 98. lower = F 0. 025 is found from F 0 .975 with
1
df 1=5 , df 2 =6 i.e. lower = 5. 99 = 0.1669.
S 21
=0 . 604
Substituting S 2
2 F =0 .1669 and F 0. 975 = 6.98 into the above gives a confidence
, 0. 025
interval of (0.604*0.1669, 0.604*6.98) = (0.101, 4.216).
152
2 The waiting times (minutes) for minor treatments were recorded at two different medical
centres. Below is a summary of the calculations made from the samples.
Test at the 5% level of significance whether the population 1 variance is less than that for
population 2.
2 2
H0: σ 1= σ 2
2 2
H1: σ 1 <σ 2
From the above table n=12 ,m=10 , S1 =7 . 200 and S2 =22 .017 .
2 2
df 1=10−1=9 , df 2 =12−1=11 , α =0 . 05
For
df 1=9 ,df 2 =11 F 0. 95=2 . 90 .
Critical region R = {
F 0 >2 . 90 }
Since
F 0=3. 058 > 2.90, H is rejected. [p-value =P (F >3 . 058)=0.0422<0. 05 ]
0
Conclusion: The variance for population 1 is probably less than that for population 2.
(i) For independent large samples (both sample sizes n , m>30 ) and population
variances known
Test for 1
μ −μ =0
2 (large samples, population variances known)
Step 1: State null and alternative hypotheses
H0:
μ1 −μ2 =0
H1a:
μ1 −μ2 < 0 or H : μ1 −μ 2 > 0 or H : μ1 −μ2 ≠0
1b 1c
x̄ 1 − x̄2
z0=
σ 21 σ 22 1/ 2
( + )
Step 2: Calculate the test statistic n m .
Step 3: State the level of significance α and determine the critical value(s) and critical
region.
153
(ii) For alternative H1b the critical region is R = {z0 | z0 > Z1-α}.
(iii) For alternative H1c the critical region is R = {z0 | z0 > Z1-α/2 or z0 < Zα/2}.
Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0.
σ 12 σ 22 1/ 2
μ −μ2 is given by x̄ 1− x̄ 2 ±Z 1−α / 2 ( n + m ) .
A 100(1-α ) % confidence interval for 1
2 2
If the population variances σ 1 and σ 2 are not known, they can be replaced in the above
2 2
formulae by their sample estimates S1 and S2 respectively with the testing procedure
unchanged.
Examples:
1 Data were collected on the length of short term stay of patients at hospitals. Independent
random samples of n= 40 male patients (population 1) and m= 35 female patients
(population 2) were selected. The sample mean stays for male and female patients were 1 =
x̄
x̄ 2
9 days and 2 = 7.2 days respectively. The population variances are known to be σ 1 = 55 and
σ 22 = 47.
(a) Test at the 5% level of significance whether male patients stay longer on average than
female patients.
(b) Calculate a 95% confidence interval for the mean difference (in staying time) between
males and females.
(a) H0:
μ1 −μ2 =0 (mean staying times for males and females the same)
H1:
μ1 −μ 2 > 0 (mean staying time for males greater than for females)
x̄ 1 − x̄2
z0= 9−7.2 1.8
σ 21 σ 22 1/ 2 = =1.09184
( + ) 55 47
1/ 2
1.6486
Test statistic: n m =( + )
40 35 .
z
Since 0 = 1.09184 < 1.645, H0 cannot be rejected. [p-value
¿ P ( Z >1.09184 )=0.862548> 0.05 ¿ .
Conclusion: The mean staying times for males and females are probably the same.
σ 21 σ 22 1 /2
(b) 1
x̄ − x̄ 2 = 2, ( n + m ) = 1.6486 (denominator value when calculating the test
Z Z
statistic), 1−α=0 .95 , α =0 . 05 , α / 2 = 0.025, 1−α /2 = 0. 975 = 1.96.
σ 21 σ 22 1/ 2
x̄ 1− x̄ 2 −Z1−α / 2 ( + )
n m = 1.8 -1.96*1.6486 = -1.431
σ 21 σ 22 1 /2
x̄ 1− x̄ 2 + Z 1−α / 2 ( + )
n m = 1.8 + 1.96*1.6486 = 5.031
2 Researchers in obesity want to test the effectiveness of dieting with exercise against
dieting without exercise. Seventy-three patients who were on the same diet were randomly
divided into “exercise” (n =37 patients) and “no exercise” groups (m =36 patients). The
results of the weight losses (in kilograms) of the patients after 2 months are summarized in
the table below.
Test at the 5% level of significance whether there is a difference in weight loss between the 2
groups.
H0:
μ1 −μ2 =0 (No difference in weight loss)
H1:
μ1 −μ 2 ≠0 (There is a difference in weight loss)
x̄ 1− x̄ 2 7 . 6−6. 7
z0= =
S12 S22
1/ 2 (
2. 53 5 . 59 1/2 0.9
( + ) + )
Test statistic: n m = 37 36 0 . 473 = 1.903.
Conclusion: There is not sufficient evidence to suggest a difference in weight loss between
the 2 groups.
(ii) For independent samples from normal populations with variances unknown
The test to be performed in this case will be preceded by a test for equality of population
2 2 2
variances (σ 1= σ 2 = σ ) i.e., the F-test discussed in section 9.2. If the hypothesis of equal
variances cannot be rejected, the test described below should be performed. If this hypothesis
is rejected, the Welsh-Aspin test (see below) should be performed. If, in this case, the
assumption of samples from normal populations does not hold, a nonparametric test like the
Wilcoxon-Mann-Whitney test (to be discussed in later chapter) should be used.
Test for 1
μ −μ =0
2 (population variances unknown but equal)
Step 1: State null and alternative hypotheses
H0:
μ1 −μ2 =0
H1a:
μ1 −μ2 < 0 or H : μ1 −μ 2 > 0 or H : μ1 −μ2 ≠0
1b 1c
x̄1 − x̄ 2
t 0= 2 2
1 1 1/2 2 (n−1)S 1 +(m−1)S 2
S( + ) S=
Step 2: Calculate the test statistic n m with n+ m−2
Step 3: State the level of significance α and determine the critical value(s) and critical
region.
(iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }.
Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0.
1 1
μ −μ2 is given by x̄ 1− x̄ 2 ±t n+m−2, 1−α /2 S ( + )1 /2
A 100(1-α ) % confidence interval for 1 n m .
Examples
1 Consider the above example on the comparison of the travel expenses for the sale and
audit staff (see section 9.2, example 1 for F-test).
(a) Test, at the 5% level of significance, whether the mean expenses for the two types of
staff could be the same.
156
(b) Calculate a 95% confidence interval for the difference between the mean expenses for the
two types of staff.
(a) Since the hypothesis of equal population variances was not rejected, the test described
x̄ x̄ 2
above can be performed. From the data given 1 =1140, 2 = 1042, S1 = 9593.6 and S2 =
2
15884.
H0:
μ1 −μ2 =0 (Mean travel expenses for sale and audit staff the same)
H1:
μ1 −μ2 ≠0 (Mean travel expenses for sale and audit staff not the same)
1140−1042
1 1
114 .126 ( + )1/2
Test statistic = 6 7 = 1.543.
Critical region = R = {
t 0 >t 0 . 975=2 .201 }.
Conclusion: Mean travel expenses for sale and audit staff are probably the same.
(b) A 95% confidence interval for the difference between sales and audit staff means is
1 1 1/2
+ )
1140-1042 ± 2.201*114.126*( 6 7 i.e, (-41.75, 237.75).
2 A certain hospital has been getting complaints that the response to calls from senior
citizens is slower (takes longer time on average) than that to calls from other patients. To test
this claim, a pilot study was carried out. The results are shown below.
Patient type sample mean response time sample standard deviation sample size
Senior 5.60 minutes 0.25 minutes 18
citizens
Others 5.30 minutes 0.21 minutes 13
Label the “senior citizens” and “others” populations as 1 and 2 and their population mean
response times as
μ1 and μ2 respectively.
H0:
μ1 −μ2 =0 (Mean response times the same)
H1:
μ1 −μ2 > 0 (Mean response time for senior citizens longer than for others)
The hypothesis that the population variances are equal cannot be rejected (perform the F-test
to check this). Hence equal variances for the 2 populations can be assumed.
5 . 6−5. 3
t 0=
1 1
0 . 2343( + )1/2
Test statistic: 18 13 = 3.518
Critical region = R = {
t 0 >t 0 . 99=2. 462 }. [p-value = P ( t>3.518 )=0.00073 ¿
Since 0
t =3. 518 > 2.462, H0 is rejected.
Conclusion: The claim is justified i.e., the mean response time for senior citizens takes longer
than that for others.
(iii) For independent samples from normal populations with population variances not
known and not equal (Welsh-Aspin test)
Test for 1
μ −μ =0
2 (large samples, population variances not known)
Step 1: State null and alternative hypotheses
H0:
μ1 −μ2 =0
H1a:
μ1 −μ2 < 0 or H : μ1 −μ 2 > 0 or H : μ1 −μ2 ≠0
1b 1c
x̄1 − x̄ 2
t 0=
S21 S 22 1 /2
( + )
Step 2: Calculate the test statistic n m .
Step 3: State the level of significance α and determine the critical value(s) and critical
region using the degrees of freedom defined below.
(iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }.
Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0.
S21 S 22 1/2
μ −μ2 is given by x̄ 1− x̄ 2 ±t 1−α /2 ( n + m ) .
A 100(1-α ) % confidence interval for 1
( S21 / n+S 22 / m )2
v=
( S 21 / n )2 ( S 22 / m )2
+
In the above the integer part of n−1 m −1 is used as degrees of freedom to
determine the value from the t-tables.
Example
The waiting times (minutes) for minor treatments were recorded at two different medical
centres. Below is a summary of the calculations made from the samples.
(a) Test at the 5% level of significance whether the population means for the 2 centres could
be equal.
(b) Calculate a 95% confidence interval for the difference between the means of centres 1
and 2.
(a) H0:
μ1 −μ2 =0
H1:
μ1 −μ2 ≠0
An F-test for equality of variances (see example 2 of tests for equality of population
variances) show that the two population variances are probably not equal. Therefore, the test
statistic to be used is
The degrees of freedom are integer ( v ) = 13. From the t-distribution table with 13 degrees
of freedom
t 0. 025 =−2. 1604 .
Critical region = R = {
t 0 <t 0 .025 =−2 .1604 }.
t =−1. 177
Since 0 >- 2.1604, H0 is not rejected. [ p-value P(t <−1.177 )=0.1301 ]
Conclusion: The population means for the 2 centres could be equal.
(b) A 95% confidence interval for the the difference between the means of centres 1 and 2 is
S12 S22 1/2 7 .2 22. 017 1/2
x̄ 1− x̄ 2 ±t 13 ;0 . 975×( + ) =(25. 69−27 .66 )±2. 1604×( + )
n m 12 10
=−1.97±3.616=(−5.586;1.646)
9.4 Test for difference between means for paired (matched) samples
The tests for the difference between means in the previous section assumed independent
samples. In certain situations, this assumption is not met.
Examples
1 A group of patients going on a diet is weighed before going on the diet and again after
having been on the diet for one month. A test to determine whether the diet has reduced their
weight is to be performed.
2 The aptitudes of boys and girls for mathematics are to be compared. To eliminate the
effect of social factors, pairs of brothers and sisters are used in the comparison. Each (brother,
sister) pair is given the same test and the mean marks of boys and girls compared.
In each of these situations the two samples cannot be regarded as independent. In the first
example two readings (before and after readings) are made on the same subject. In the second
example the two samples are matched via a common factor (family connection).
The data layout for the experiments described above is shown below.
sample 1 x1 x2 .... xn
.
sample 2 y1 y2 .... yn
.
difference d 1 x1 y1 d 2 =x 2− y 2 .... d n =x n − y n
.
The mean of the paired differences of the ( x , y ) values of the two populations is defined as
μd . Under the assumption that the differences are sampled from a normal population,
hypotheses concerning the mean of the differences
μd can be tested by performing a one
sample t -test (described in the previous chapter) with the observed differences
d 1 , d 2 ,⋯, d n
160
as the sample. The mean and standard deviation of these sample differences will be denoted
by d̄ and
Sd respectively.
Test for d
μ =0 (paired samples)
Step 1: State null and alternative hypotheses
H0:
μd =0
H1a:
μd <0 or H : μ d > 0 or H : μd ≠0
1b 1c
d̄
t 0=
Sd
Step 2: Calculate the test statistic √n .
Step 3: State the level of significance α and determine the critical value(s) and critical
region.
(ii) For alternative H1b the critical region is R = {t0 | t0 > t1-α }.
(iii) For alternative H1c the critical region is R = {t0 | t0 > t1-α/2 or t0 < tα/2 }.
Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0.
Sd
A 100(1-α ) % confidence interval for d is given by d̄±t -multiplier* √ n , where the t -
μ
multiplier is obtained from the t-tables with n−1 degrees of freedom with an area 1− α / 2
under the t-curve below it.
Examples
1 A bank is considering loan applications for buying each of 10 homes. Two different
companies (company 1 and company 2) are asked to do an evaluation of each of these 10
homes. The evaluations (thousands of Rand) for these homes are shown in the table below.
Home 1 2 3 4 5 6 7 8 9 10
company 750 990 1025 1285 130 875 124 880 700 1315
1 0 0
company 810 100 1020 1320 129 915 125 910 650 1290
2 0 0 0
161
(a) At the 5% level of significance, is there a difference in the mean evaluations for the 2
companies?
(b) Calculate a 95% confidence interval for the difference between the mean evaluations for
companies 1 and 2.
(a) H0:
μd =0 (No difference in mean evaluations)
μ ≠0 (There is a difference in mean evaluations)
H: d1
9 .5
t 0=
33 . 12015
Test statistic: √10 = 0.907.
α=0.05,α /2=0.025,1−α /2=0.975 . From the t-tables with ν=n - 1 = 9 degrees of
freedom,
t 0. 975 = 2.262.
Critical region R = {
t 0 >2 . 262 }.
Since
t 0 = 0.907 < 2.262, H is not rejected i.e., no difference in mean evaluations.
0
33.12015
±2 .262
(b) A 95% confidence interval is given by 9.5 √10 = (-14.19, 33.19).
2 Each of 15 people going on a diet was weighed before going on the diet and again after
having been on the diet for one month. The weights (in kilograms) are shown in the table
below.
Person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
before 90 11 124 116 105 8 86 92 10 112 138 96 102 11 82
0 8 1 1
after 85 10 126 118 94 8 87 87 99 105 130 93 95 10 83
5 4 2
difference -5 -5 2 2 -11 -4 1 -5 -2 -7 -8 -3 -7 -9 1
Test, at the 1% level of significance, whether the mean weight after one month on the diet is
less than that before going on the diet.
μ
Let d denote the mean difference between the weight after having been on the diet for one
month and before going on the diet.
H0:
μd =0 (No difference in mean weights)
162
H1:
μd <0 (Mean weight after one month on diet less than before going on diet)
−4
t 0=
4 . 1231
Test statistic: √ 15 = -3.757.
Critical region R = {
t 0 <−2 . 624 }.
Since
t 0 = -3.757 < -2.624, H is rejected.
0
Conclusion: The mean weight after one month on the diet is less than before going on diet.
9.5 Test for the difference between proportions for independent samples
When testing for the difference between the proportions of two different populations, the test
is based on the sampling distribution results 4-6 described in the first section of this chapter.
Test for 1
p − p =0
2
Step 1: State null and alternative hypotheses
H0:
p1 − p2 =0
H1a:
p1 − p2 <0 or H : p1 − p 2 >0 or H : p1 − p2 ≠0
1b 1c
p^ 1− p^ 2
z0= x ¿n + x ¿m
1 1 1/2 ^p=
[ p^ (1− p^ )( + )]
Step 2: Calculate the test statistic n m with n+m .
Step 3: State the level of significance α and determine the critical value(s) and critical
region.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Example
A perfume company is planning to market a new fragrance. To test the popularity of the
fragrance, 120 young women and 150 older women were selected at random and asked
whether they liked the new fragrance. The results of the survey are shown below.
(a) Test, at the 5% level of significance, whether older women like the new fragrance more
than young women.
(b) Calculate a 95% confidence interval for the difference between the proportions of older
and young women who like the fragrance.
(a) Let the older and younger women populations be labeled 1 and 2 respectively and
p1 and
p2 the respective population proportions that like the fragrance.
H0:
p1 − p2 =0
H1:
p1 − p 2 >0
¿ ¿
From the above table n = 150, m = 120, x n = 72, x m = 48.
72+48 4
=
= 150+120 9
72 48
−
150 120
z0=
4 5 1 1 1/2 0 .08
[( )×( )×( + )]
Test statistic: 9 9 150 120 = 0 .060858 =1.3145.
Since
z 0 = 1.3145 < 1.645, H cannot be rejected.
0
164
Conclusion: There is not sufficient evidence to suggest that older women like the new
fragrance more than young women.
(b)
^p1 − ^p2 = 0.08 [numerator of z 0 in part (a)], Z 0 . 975 =1.96
[ ]
1
72
(
1−
72 48
1−) 48
( ) 2
[ ]
1
^p1 ( 1− ^p1 ) ^p2 ( 1− ^p2 )
2 150 150 120 120
^p1− ^p2 ± Z α + =0.08± 1.96 +
2
n m 150 120
= 0.08± 0.11864 = (-0.03864, 0.19864)
1 The test for the difference between population means in example 1 in section 9.3(ii) (the
data in example 1 in section 9.2) can be performed by using excel. What follows is the
output.
The p-value is 0.150984 > 0.05. At the 5% level of significance the null hypothesis cannot be
rejected.
2 The output shown below is when the test for equality of population variances for the data
in example 1 in section 9.2 is performed by using excel.
s21 9593. 6
= =0 . 603979
s 22 15884
The value of the test statistic shown in the above table is . The
critical value (last entry under variable 1 in the above table) is
165
1 1
F 5,6 ; 0.975= = =0.143266
F 6 ,5 ; 0.975 6.98 and the p-value (second to last entry under variable 1 in
the above table) is 0.701718. Since 0.701718 > 0.025, the null hypothesis cannot be rejected.
Often two variables are measured simultaneously and relationships between these variables
explored. Data sets involving two variables are known as bivariate data sets.
The first step in the exploration of bivariate data is to plot the variables on a graph. From
such a graph, which is known as a scatter diagram (scatter plot, scatter graph), an idea can
be formed about the nature of the relationship.
Examples
1 The number of copies sold (y) of a new book is dependent on the advertising budget (x)
the publisher commits in a pre-publication campaign. The values of x and y for 12 recently
published books are shown below.
166
x (thousands of y
rands) (thousands)
8 12.5
9.5 18.6
7.2 25.3
6.5 24.8
10 35.7
12 45.4
11.5 44.4
14.8 45.8
17.3 65.3
27 75.7
30 72.3
25 79.2
Scatter diagram
90
80
70
60
copies sold
50
40
30
20
10
0
0 5 10 15 20 25 30 35
advertising budget
2 In a study of the relationship between the amount of daily rainfall (x) and the quantity of
air pollution removed (y), the following data were collected.
Scatter diagram
167
160
140
120
Quantity removed
100
80
60
40
20
0
0 2 4 6 8
Rainfall
1 In both cases the relationship can be well described by means of a straight line i.e., both
these relationships are linear relationships.
4 In both the examples changes in the values of y are affected by changes in the values of x
(not the other way round). The variable x is known as the explanatory (independent,
predictor) variable and the variable y the response (dependent) variable.
In this section only linear relationships between 2 variables will be explored. The issues to be
explored are
1 Measuring the strength of the linear relationship between the 2 variables (the linear
correlation problem).
2 Finding the equation of the straight line that will best describe the relationship between
the 2 variables (the linear regression problem). Once this line is determined, it can be used
to estimate a value of y for given value of x (linear estimation).
The calculation of the coefficient of correlation (r) is based on the closeness of the plotted
points (in the scatter diagram) to the line fitted to them. It can be shown that
-1 ≤ r ≤ 1.
168
If the plotted points are closely clustered around this line, r will lie close to either 1 or -1
(depending on whether the linear relationship is positive or negative). The further the plotted
points are away from the line, the closer the value of r will be to 0. Consider the scatter
diagrams below.
No pattern (r close to 0)
For a sample of n pairs of values (x1, y1) , (x2, y2), . . . , (xn, yn) , the coefficient of
correlation can be calculated from the formula
169
n ∑ xy−∑ x ∑ y
√ ∑ x 2−( ∑ x )2 ][ n ∑ y 2−( ∑ y )2 ] .
r = [n
Example
Consider the data on the advertising budget (x) and the number of copies sold (y) considered
earlier. For this data r can be calculated in the following way.
x y xy x2 y2
8 12.5 100 64 156.25
9.5 18.6 176.7 90.25 345.96
7.2 25.3 182.16 51.84 640.09
6.5 24.8 161.2 42.25 615.04
10 35.7 357 100 1274.49
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
14.8 45.8 677.84 219.04 2097.64
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
10032.8
sum 178.8 545 9 3396.92 30656.5
Comment: Strong positive correlation i.e., the increase in the number of copies sold is closely
linked with an increase in advertising budget.
Coefficient of determination
The strength of the correlation between 2 variables is proportional to the square of the
correlation coefficient (r2). This quantity, called the coefficient of determination, is the
proportion of variability in the y variable that is accounted for by its linear relationship with
the x variable.
Example
In the above example on copies sold (y) and advertising budget (x), the
This means that 84.53% of the change in the variability of copies sold is explained by its
relationship with advertising budget.
170
Finding the equation of the line that best fits the (x, y) points is based on the least squares
principle. This principle can best be explained by considering the scatter diagram below.
The scatter diagram is a plot of the DBH (diameter at breast height) versus the age for 12 oak
trees. The data are shown in the table below.
Age (years 1
x ) 97 93 88 81 75 57 52 45 28 15 12 1
DB 12. 12. 16. 10. 1.
Hy (inch) 5 5 8 9.5 5 11 5 9 6 5 1 1
According to the least squares principle, the line that “best” fits the plotted points is the one
that minimizes the sum of the squares of the vertical deviations (see vertical lines in the
above graph) between the plotted y and estimated y (values on the line). For this reason the
line fitted according to this principle is called the least squares line.
^y = a + bx,
171
where ^y is the fitted y value (y value on the line which is different to the observed y value),
a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be calculated from
n ∑ xy−∑ x ∑ y
b= n ∑ x 2−( ∑ x)2 and
a= ȳ−b x̄.
Example
For the above data on age (x) and DBH (y) the least squares line can calculated as shown
below.
x y xy x2
Substituting n=12, ∑ x = 654, ∑ y = 99, ∑ xy = 6877.5 and ∑ x2 = 47240 into the above
equation gives.
99 654
−0 . 12779∗
a= 12 12 = 1.285.
Therefore, the equation of the y on x least squares line that can be used to estimate values of
y (DBH) based on x (age) is
172
^y = 1.285 + 0.12779 x.
Suppose the DBH of a tree aged 90 years is to be estimated. This can be done by substituting
the value of x = 90 into the above equation. Then
1 The linear relationship between y and x is often only valid for values of x within a certain
range e.g., when estimating the DBH using age as explanatory variable, it should be taken
into account that at some age the tree will stop growing. Assuming a linear relationship
between age and DBH for values beyond the age where the tree stops growing would be
incorrect.
2 Only relationships between variables that could be related in a practical sense are explored
e.g., it would be pointless to explore the relationship between the number of vehicles in New
York and the number of divorces in South Africa. Even if data collected on such variables
might suggest a relationship, it cannot be of any practical value.
3 If variables are not linearly related, it does not necessarily mean that they are not related.
There are many situations where the relationships between variables are non-linear.
Example
A plot of the banana consumption (y) versus the price (x) is shown in the graph below. A
straight line will not describe this relationship very well, but the non-linear curve shown
below will describe it well.
14
y
12
10
8
6 y u z u
x
4
0
0 1 2 3 4 5 6 7 8 9 10 11 x12
This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.
1
173
The true regression equation that describes the relationship between y and x can be written
as y=α+βx + error. The coefficients a and b that were calculated in the previous section
are the least squares estimates of α and β respectively. A hypothesis that is often of interest
when exploring a linear relationship between x and y , is whether they are indeed linearly
related. When testing this hypothesis, it is assumed that the error term in the above formula is
normally distributed.
2 (∑ x )
2
2 (∑ y )
2
( ∑ x )( ∑ y )
S xx =∑ x − S yy =∑ y − S xy=∑ xy −
n , n and n .
Step 3: State the level of significance α and determine the critical value(s) and critical
region. This is based on
t 0 ~ t n−2 .
Step 4: If t 0lies in the critical region, reject H0, otherwise do not reject H0.
S
b±t α /2
A 100(1−α ) % confidence interval for β is given by √ S xx . The degrees of freedom
used to find the t-multiplier is n−2 .
The test described above is also a test for zero correlation between y and x in a linear
relationship.
Examples
(a)
H0: β=0
H1: β≠0
The following sums are needed in the calculation of the denominator of the test statistic.
x y xy x2 y2
97 12.5 1212.5 9409 156.25
93 12.5 1162.5 8649 156.25
88 8 704 7744 64
81 9.5 769.5 6561 90.25
75 16.5 1237.5 5625 272.25
57 11 627 3249 121
52 10.5 546 2704 110.25
45 9 405 2025 81
28 6 168 784 36
15 1.5 22.5 225 2.25
12 1 12 144 1
11 1 11 121 1
sum 654 99 6877.5 47240 1091.5
( ∑ x )2 6542
S xx=∑ x − 2
=47240− =11597
n=12 , b=0 .12779 , n 12 ,
(∑ y ) 2
99 2
S yy =∑ y 2 − =1091. 5− =274 . 75
n 12 ,
( ∑ x )( ∑ y ) 654×99
S xy =∑ xy − =6877 .5− =1482
n 12
S 2xy 14822
SSE=S yy − =274 . 75− =85. 36274
S xx 11597 ,
SSE 1/2 85 . 36274 1/2
S=( ) =( ) =(8 . 536274 )1 /2 =2 . 92169
n−2 10
b 0 . 12779
=t 0 = = =4 . 710158
Test statistic S / √ S xx 2. 92169 / √11597
2.92169
0.12779±2.228× =0.12779±0.06045=(0.06734 ;0.18824 )
Confidence interval: √ 11597
2 Suppose that in the above question the hyopotheses H0: β=0 versus H1: β >0 were
tested at the 1% level of significance. How will the testing procedure change?
Consider the data on age (x variable) and DBH (y variable). The output when performing a
straight line regression on this data on excel is shown below.
SUMMARY OUTPUT
Regression Statistics
R Square 0.689307572
ANOVA
Significance
df SS MS F F
Regression 1 189.3872553 189.3873 22.1862 0.000828626
Residual 10 85.36274468 8.536274
Total 11 274.75
Standard
Coefficients Error t Stat P-value
Intercept 1.285353971 1.702259153 0.755087 0.46761
X Variable 0.12779167 0.027130722 4.71022 0.00083
3 The third of the tables in the summary output shows the intercept and slope values of the
line. These are the first two entries under Coefficients. The remaining columns to the right of
the Coefficients column concerns the performance of tests for zero intercept and slope. From
the intercept and slope p-values (0.46761 and 0.00083 respectively) it follows that the
intercept is not significantly different from zero at the 5% level of significance
(0.46761>0.05) but that the slope is significantly different from zero at the 5% or 1% levels
of significance (0.00083 < 0.01 < 0.05).
When the correlation coefficient is calculated for the above-mentioned data by using excel,
the output is as shown below.
Column
1 Column 2
Column
1 1
Column 0.8302
2 5 1
The above table shows that the correlation between x and y is 0.83025.
177
Consider an experiment where samples are drawn from k different populations that are
labelled 1, 2, . . . ,k . A test for the equality of the k different means
μ1 , μ2 ,⋯, μ k is to be
performed. A random sample of size
n1 is drawn from population 1, one of size n2 from
n
population 2, . . . , one of size k from population k . The k different populations can also
be seen as k different treatments. The data layout and some of the important calculations are
shown below.
x 21 , x 22 ,⋯, x 2 n x̄ 2 ∑ ( x 2 j− x̄ 2 )2 =( n2−1 ) S 22
Treatment 2 : 2 j=1
. . . . . . .
nk
The experiment described above can also be seen as randomly assigning n experimental
units to k different treatments such that
n1 units are assigned to treatment 1, n2 to treatment 2,
⋯
n
, k to treatment k with 1 2
n +n +⋯+ nk =n . For this reason, this design is referred to as
a completely randomized design.
Example
The sound distortion on 4 different types of coatings (A, B, C, D) on sound tapes are
measured. The data collected are shown below.
In this example
12×5+17×4+16×7+15×6
k =4 , n1=5 , n 2=4 , n3 =7 , n 4 =6 , n=22 , x̄= =15
22 .
H0: 1
μ =μ =⋯=μ =μ
2 k
H1: Not all means are equal
or
H0: 1
α =α =⋯=α
2 k −1 =0
α
H1: No all ’s equal to 0
SST =∑ ∑ ( x ij − x̄ ) =∑ ∑ ( x ij − x̄ i + x̄ i − x̄ ) =∑ ∑ ( x ij − x̄ i ) +2 ∑ ∑ ( x ij − x̄ i )( x̄ i− x̄ )
2 2 2
Therefore
k ni k ni k ni
SST =∑ ∑ ( x ij − x̄ ) =∑ ∑ ( x ij − x̄i ) + ∑ ∑ ( x̄ i− x̄ )2
2 2
= SSE + SSTr
(Error) (Treatment)
In a similar fashion the degrees of freedom associated with each of the above sums of squares
can be partitioned as n−1=(n−k )+( k−1 ) .
These above results can be summarized in the form of an Analysis of Variance (ANOVA)
table as shown below.
When H0 is true and the y observations are assumed to be normally distributed with equal
variances, the F statistic follows an F distribution with k −1 and n−k degrees of freedom.
Since this test is concerned with testing the effect of a single factor, the testing procedure
described above is also known as one-way Analysis of Variance (ANOVA).
k ni k ni k ni
T2
SST =∑ ∑ ( x ij − x̄ ) =∑ ∑ x −
2
T =∑ ∑ x ij
2
ij
i=1 j =1 i=1 j=1 n , where i=1 j=1 is the grand total.
ni n
k k
T 2i T 2 i
SSTr=∑ ∑ ( x̄ i− x̄ )2 =∑ − T i= ∑ xij
i=1 j=1 i=1 ni n , where j=1
Step 2: Calculate SST ,SSTr and SSE and use these quantities to calculate 0 as explained
F
in the above ANOVA table,
Step 3: State the level of significance α and determine the critical value(s) and critical
region. This is based on
F 0 ~ F k−1 ,n−k .
Using k −1 and n−k degrees of freedom, the critical region is R = {
F 0|F 0 >F 1−α ¿¿ .
Step 4: If
F 0 lies in the critical region, reject H , otherwise do not reject H .
0 0
Example
By using the data on the sound distortion on the 4 different types of coatings, test at the 5%
level of siginifance whether their population means could be equal.
H0:
μ1 =μ2 =μ3 =μ4 =μ
or
H0:
α 1=α 2 =α 3 =0
H1: No all α ’s equal to 0, where
α i=μ i−μ , i=1 ,2 ,⋯, 4 .
180
k ni 2
T
SST =∑ ∑ x 2ij − =
i=1 j =1 n (102+152+82+122+152+142+182+212+ 152+172+162+142+152+172+
2
−330
152+182 +122+152+172+152+162 +152)
22
=5112−4950=162
k
T i2 T2
SSTr=∑ −
i =1 ni n , where T 1=60 ,T 2=68 ,T 3 =112 , T 4 =90
2 2 2 2
60 68 112 90
¿ + + + −4950=( 720+1156+1792+1350 )−4950
5 4 7 6
¿ 5018−4950=68.
ANOVA table
Test statistic
=F 0 =4 .34 .
Using 3 and 18 degrees of freedom, the critical region is R = {
F 0|F 0 >3 .16 ¿¿ .
Since
F 0=4 . 34>3 . 16 , H is rejected, and it is concluded that not all means are equal.
0
A graph showing side-by-side box plots of the y observations from the different ANOVA
groups can reveal some useful additional information about the validity of the test
assumptions and conclusions made about the test results. As an example, the side-by-side box
plots for the sound distortion data shown below suggest that
1 The mean sound distortion for coating A appears to be less than that for coatings B ,C and
D.
2 The “box” part of the box plot for coatings A and B is greater than that for coatings C and
D . This suggests that the assumption of equal variances of the y observations is probably not
valid.
181
The traditional one-way ANOVA performed above is performed under the assumption of
equal population variances. The big differences between the 4 sample variances (9.5, 10, 2,
2.8) creates doubts about the validity of this assumption. An alternative test that allows
unequal variances is the Welch one-way ANOVA (not discussed here).
When H0: 1
μ =μ =⋯=μ
2 k is rejected, follow up tests to determine which means are
different from each other are needed. This is done by testing for the equality of all possible
k (k−1 )
c=
pairs of means selected from the k population means. In general, there are 2 pairs
that are tested. When c such tests are each performed at the 100 α % level of significance,
the probability of a type I error associated with a conclusion based on the results of all these
c
tests is α E =1−(1−α ) . It can be shown that E
α ≤cα . This means that if the overall
α
α α= E
probability of a type I error is to be no more than E , it is specified that c . This
adjustment of α for the tests concerning the individual pairs of means is called the
Bonferroni adjustment. Each test is then performed by comparing the absolute difference
between the sample means for groups i and j i.e. i
| x̄ − x̄ |
j with the Least Significance
1 /2
1 1 n n
Difference = LSD=t n−k ;α /2 [ MSE ×( + )] , where i and j are the sample sizes of the
ni n j
groups that are being compared and MSE the mean square error that is obtained from the
ANOVA table. If
| x̄ i− x̄ j|> LSD the hypothesis of equal means is rejected, otherwise it is
not rejected.
These post hoc tests can also be carried out by using other similar testing procedures (not
discussed here).
Example
182
4×3
k =4 c= =6
For the sound distortion data (discussed above) . Then 2 and when α =0 . 05 ,
6
α E=1−(1−0 . 05) =0 . 265 . This means that the overall probability of a type I error (0.265)
for a conclusion based on the 6 tests for equality of means is more than 5 times the type I
error for each individual test (0.05).
When testing for the equality of all possible pairs of means in the sound distortion example
and specifying E
α =0 . 05
, the probability of a type I error for the individual tests should be
α E 0 .05
α= = =0 .0083
c 6 .
The 4 sample-means are
x̄ 1=12 , x̄ 2=17 , x̄ 3 =16 , x̄ 4 =15 , sample sizes n1 =5 , n2 =4 ,
n3 =7 , n 4 =6 and MSE=5 .22 (from the ANOVA table). Using n−k =22−4=18 degrees of
freedom.t 18; 0.05/(2 ×6) =2.962 Then
1 1 1 1
LSD=2 . 9627×[ 5. 22×( + )]1/2 =6 . 77×( + )1/2
ni n j ni n j .
Tests for differences between means
H0:
μ1 =μ2
1 1 1/2 1 1
LSD=6 . 77×( + ) =6 . 77×( + )1/2 =4 .54
| x̄ 1− x̄ 2|=|12−17|=5 , n1 n2 5 4
Conclusion: Since 5 > 4.54,
μ1 and μ2 are different.
H0:
μ1 =μ3
1 1 1/2 1 1
LSD=6 . 77×( + ) =6 . 77×( + )1/2 =3. 96
| x̄ 1− x̄ 3|=|12−16|=4 , n1 n3 5 7
Conclusion: Since 4 > 3.96,
μ1 and μ3 are different.
H0:
μ1 =μ4
1 1 1/2 1 1
LSD=6 . 77×( + ) =6 .77×( + )1/2=4 .10
| x̄ 1− x̄ 4|=|12−15|=3 , n1 n4 5 6
Conclusion: Since 3 < 4.10,
μ1 and μ4 are not different.
H0:
μ2 =μ3
1 1 1 /2 1 1
+ ) =6 . 77×( + )1/2 =4 . 24
LSD=6 . 77×(
| x̄ 2 − x̄3|=|17−16|=1 , n2 n3 4 7
μ μ
Conclusion: Since 1 < 4.24, 2 and 3 are not different.
H0:
μ2 =μ4
183
1 1 1/2 1 1
LSD=6 . 77×( + ) =6 .77×( + )1/2=4 .37
| x̄ 2 − x̄ 4|=|17−15|=2 , n2 n4 4 6
Conclusion: Since 2 < 4.37,
μ2 and μ4 are not different.
H0:
μ3 =μ4
1 1 1/2 1 1
LSD=6 . 77×( + ) =6 .77×( + )1/2 =3. 77
| x̄ 3 − x̄ 4|=|16−15|=1 , n3 n4 7 6
Conclusion: Since 1 < 3.77,
μ3 and μ4 are not different.
Overall conclusion: Population 1 (coating A) has a mean sound distortion that is less than
that of populations 2 and 3 (coating B and C). This is a confirmation of the conclusion
suggested by the side-by-side box plot.
An important assumption in ANOVA is that the data are normally distributed. This can be
checked by drawing a histogram of the data or a Q-Q plot of the error (residual) values
obtained from fitting the model to the data. The code below (written in R) fits an ANOVA to
the data, draws a histogram and a Q-Q plot.
data=read.table("clipboard",header=T)
attach(data)
data
y coating
1 10 A
2 15 A
3 8 A
4 12 A
5 15 A
6 14 B
7 18 B
8 21 B
9 15 B
10 17 C
11 16 C
12 14 C
13 15 C
14 17 C
15 15 C
16 18 C
17 12 D
18 15 D
19 17 D
20 15 D
21 16 D
22 15 D
model <- aov(y ~ coating, data = data)
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
coating 3 68 22.667 4.34 0.0181 *
Residuals 18 94 5.222
---
184
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Histogram
hist(y)
The histogram has a bell-shaped appearance, which suggests that the data could be normally
distributed.
# Draw a Q-Q plot
qqnorm(model$residuals, pch=16,cex=0.5)
qqline(model$residuals)
For the error (residual) terms to be normally distributed, the Q-Q (Quantile-Quantile) plot
should show a straight-line pattern. In the above plot, the points do show a straight-line
pattern. This indicates that the error terms are normally distributed.
185
Data that are measured on the nominal or ordinal scales are usually summarized in the form
of tables of counts. In such tables the observations are allocated to categories/combinations of
categories and the number of observations in each category/combination of categories
determined. When observations are allocated to two or more combinations of categories, the
resulting table of counts is referred to as a cross-classification table (cross tab) or contingency
table.
Examples
Status Frequency
Married 200
Widowed 47
Divorced 84
Separated 59
Never 110
married
Total 500
Table 12.2 – Educational status of a group of people that are 21 years or older
Status Frequency
Primary school or less 53
Less than grade 12 105
Grade 12 239
Diploma/Certificate 114
Degree 152
Post graduate 37
Total 700
newspaper
occupation G&M Post Star Su Total
n
Blue collar 27 18 38 37 120
White 29 43 21 15 108
collar
Professional 33 51 22 20 126
Total 89 112 81 72 354
1 In table 12.1 the factor “marital status” is measured on the nominal scale of measurement.
2 In table 12.2 the factor “educational status” is measured on the ordinal scale of
measurement.
3 In table 12.3 the row factor (diet) is measured on the nominal scale and the column factor
(health) on the ordinal scale.
4 In table 12.4 both the row and column factors are measured on the nominal scale.
5 Both tables 12.3 and 12.4 are cross tabs but they differ in the sense that in table 12.3 the
row totals (margins) are fixed in advance (80 and 70), while in table 12.4 neither the row nor
the column totals are not fixed in advance.
Category 1 2 ... k
Count (number) n1 n2 . . . nk
k
n1 +n2 +⋯+ nk =∑ ni =n
In the above table i =1 . Let 1 2
p , p ,⋯, p
k denote the probabilities
associated with the different cells. The hypothesis that these probabilities follow a particular
pattern is to be tested.
i =1 ei .
Step 3: State the level of significance α and determine the critical value(s) and critical
2 2
region. This is based on χ 0 ~ χ k −1 .
187
2 2 2
Using k −1 df. , the critical region is R = { χ 0| χ 0 > χ 1−α ¿ ¿ .
2
Step 4: If χ 0 lies in the critical region, reject H0, otherwise do not reject H0.
Example
The number of births (per 10 000) by day of the week in the USA in 1971 is given below.
Test whether all 7 days of the week are equally likely for childbirth.
1
p1 = p2 =⋯= p7 =
H0: 7
1
p
H1: Not all i ’s equal to 7
Total
ni 52.09 54.46 52.68 51.68 53.83 47.21 44.36 356.31
ei 50.90143 50.90143 50.90143 50.90143 50.90143 50.90143 50.90143 356.31
( ni −e i )2
ei 0.027754 0.248783 0.062146 0.011909 0.168493 0.267707 0.84065 1.627441
2
From the above table χ 0 =1 . 627441 .
2 2 2
Using k −1=6 df. the critical region is R = { χ 0| χ 0 > χ 0. 95=12 . 5916 ¿ ¿ .
2 2
Since χ 0 =1 . 627441< 12. 5916 , H is not rejected. [p-value =P ( χ >1 .627441 )=0 . 9506 ].
0
Conclusion: No reason to believe that all 7 days of the week are not equally likely for
childbirth.
Consider the following contingency table with the row totals decided before hand (fixed).
p , p ,⋯, p
Let 11 12 1c and 21 22 p , p ,⋯, p
2 c denote the probabilities associated with the cells in
the first and second rows respectively. The test for homogeneity is a test that the probability
patterns in the two rows are the same i.e. H0:
p11=p 21 , p12= p22 ,⋯, p1c = p2 c .
r2 c1 r2 c2 r2 cc
e 21=r 2 ^p1 = ,e 22=r 2 p^ 2= ,⋯, e 2 c=r 2 p^ c=
n n n . Therefore, the general formula for
ri c j
e ij= ,i=1 , 2; j=1 , 2,⋯, c
the expected frequencies assuming H0 to be true is n .
Test of homogeneity (one margin fixed) –Test for equality of k proportions
Step 1: State the null and alternative hypotheses
H0:
p11=p 21 , p12= p22 ,⋯, p1c = p2 c
2
( nij −eij )2
c
χ 20 =∑ ∑
i =1 j =1 eij .
Step 3: State the level of significance α and determine the critical value(s) and critical
2 2
region. This is based on χ 0 ~ χ c−1 .
2 2 2
Using c−1 df. , the critical region is R = { χ 0| χ 0 > χ 1−α ¿ ¿ .
2
Step 4: If χ 0 lies in the critical region, reject H0, otherwise do not reject H0.
Example
Total 54 57 39 150
Test at the 5% level of significance whether the health proportions are the same for diets A
and B.
Refer to the diets A and B by i=1 , 2 and the health states by j=1,2,3 .
H0: p11 =p 21 , p12= p22 , p13= p23 (Diet proportions are the same)
H1: Not all proportions in the two rows are equal (Diet proportions are not all the same)
Each cell entry in the above table is cell row total ¿ cell column total / 150 e.g.,
80×54 80×57
28 . 8= 30 . 4=
150 , 150 etc.
2 3
(nij −eij )2 (37−28 . 8)2 (24−30 . 4 )2 ( 19−20. 8 )2 ( 17−25 . 2)2
χ 20 = ∑ ∑ e =28 . 8 +
30 . 4
+
20 . 8
+
25. 2
+
i =1 j =1 ij
The multiple bar chart below shows that following diet A leads to a better health than
following diet B.
190
12.4 Test of independence of row and column factors (neither margin fixed)
As for the test described in the previous section, this test is also based on a two-factor table of
counts, but in the construction of the table neither the row, or column totals are fixed. The
table can have any number of rows (say r ) and any number of columns (say c ) and has the
following appearance.
The hypotheses to be tested are H0: Row and column factor are independent
H1: Row and column factor are not independent
2
Step 4: If χ 0 lies in the critical region, reject H0, otherwise do not reject H0.
Example
By using the counts in table 12.4 (see below), test at the 1% level of significance whether
occupation and newspaper are independent.
newspaper
occupation G&M Post Star Sun Total
Blue collar 27 18 38 37 r 1 =120
White collar 29 43 21 15 r 2 =108
Professional 33 51 22 20 r 3 =126
Total c1 = c2 = c3 = c4 = n=
354
89 112 81 72
ri c j
e ij= ,i=1 , 2 ,…, 3 ; j=1 , 2 ,⋯, 4
Using 354 , the following table of expected frequencies are
obtained.
From the above tables of observed and expected frequencies it follows that
3 4
( n −e )2
χ 20 =∑ ∑ ij ij =32 . 5726
i =1 j =1 eij .
2 2
Since χ 0 =32 .5726 > χ 0 . 99=16 . 8119 , H is rejected, and it is concluded that occupation and
0
newspaper are not independent.
2 −5
[p-value =P ( χ >32 .5726=1 .267×10 ]
The multiple bar chart below shows that the Star and Sun are mostly read by Blue Collar
workers, while the Post is mostly read by White Collar and Professional workers and G&M
by all 3 groups.
Parametric tests are based on assumptions about the distributions from which the sample(s),
used in the testing procedure, are drawn. For example, the one-sample t test is based on a
sample being drawn from a population which is normally distributed. Because the
distribution from which the sample is taken is specified by the values of two parameters, μ
2
andσ , the t test is a parametric procedure.
193
When using nonparametric tests, no assumptions concerning the distributions from which the
samples are drawn, or their parameters are made. For this reason, the term “Distribution
Free” is sometimes also used to refer to these tests. These tests will be used when there is
some doubt about the validity of the assumptions that are made for parametric tests.
This test is the nonparametric equivalent of the one sample t test or the t test for data
involving matched pairs.
The critical region can be written down by specifying 100(1−α) % ,the level of significance,
and looking up critical values from the tables of the signed rank distribution.
(i) For alternatives H1a and H1b the critical region is R = {T0 | T0 < Tα}.
(ii) For alternative H1c the critical region is R = {T0 | T0 < Tα/2}.
Step 4: If the test statistic is in the critical region reject H0 , otherwise do not reject H0.
Step 5: State conclusion in terms of the original problem.
Comment: Since T is a discrete random variable it might not always be possible to find the
T α such that P(T <T α )=α . Therefore, the critical value T c is defined as the
critical value
maximum value such that P(T <T c ) ¿ α . This applies to all tests where the critical values
are determined from discrete probability distributions.
Example
194
The following are the scores ( x) obtained by 12 randomly selected people in a standard
aptitude test: 107 113 108 128 146 103 109 118 111 119 155 140
(a) Represent this data in the form of a dot plot. Comment on its appearance.
(b) By performing the Wilcoxon signed rank test, test at the 5% level of significance
whether the median could 120.
(a) The dot plot below suggests that the distribution of scores are positively skewed. For this
reason, there is some doubt about the assumption that the scores are normally distributed.
Therefore, a nonparametric test is preferred to a t test.
H0: m=120
H1: m≠120
103 -17 17 9 -
109 -11 11 6 -
118 -2 2 2 -
111 -9 9 5 -
119 -1 1 1 -
155 35 35 12 +
140 20 20 10 +
+ −
From the above table T = 4+11+12+10 = 37 , T = 8+3+7+9+6+5+2+1= 41 and
T 0 = min (37,41) = 37.
From the Wilcoxon signed rank table with n=12,α=0.05 (two-tailed) it follows that
T 0.025=14 .
13.2.2 Wilcoxon signed rank test for the difference between paired (matched) samples
The data consists of pairs of observations ( x i , y i ),i=1, 2 ,⋯, n where the two samples are
matched via some common factor (e.g., family connection, 2 different observations on the
same subject).
The data layout for the experiments described above is shown below.
sample 1 x1 x2 .... xn
.
sample 2 y1 y2 .... yn
.
difference d 1 x1 y1 d 2 =x 2− y 2 .... d n =x n − y n
.
The median of the paired differences of the ( x , y ) values of the two populations is defined as
m d . The signed rank test can be used to test H : m d =d 0 . When performing this test, no
0
assumption is made about the population from which the sample is drawn. The test statistic is
calculated in much the same way as that for the signed rank test for the median. The absolute
d −d , d −d ,⋯, d −d
values of the differences 1 0 2 0 n 0 are ranked from smallest to largest and
the ranks originating from positive and negative differences identified. The test statistic is
T 0 =min (T + ,T − ) , where
T + = sum of ranks from positive differences and
T − = sum of ranks from negative differences.
The critical region can be written down by specifying 100(1−α) % , the level of significance,
and looking up critical values from the tables of the signed rank distribution.
196
Wilcoxon signed rank test for the difference between paired (matched) samples
Step 1: State null and alternative hypotheses.
H0:
m d =d 0
H1a:
m d <d 0 or H : m d >d 0 or H : m d ≠d 0 .
1b 1c
T
Step 2: Calculate the test statistic 0 .
Step 3: State the level of significance α and determine the critical value(s) and critical
region.
(i) For alternatives H1a and H1b the critical region is R = {T0 | T0 < Tα}.
(ii) For alternative H1c the critical region is R = {T0 | T0 < Tα/2}.
Step 4: If the test statistic is in the critical region reject H0, otherwise do not reject H0.
Step 5: State conclusion in terms of the original problem.
Example
In order to compare the effectiveness of two methods (A and B) for teaching mathematics, 10
randomly selected pairs of twins of school going age were tested. For each of these pairs of
twins one twin was selected at random and assigned to method A, while the other twin was
assigned to method B. After being taught mathematics for 2 months according to the chosen
methods all the twins were given identical mathematics tests. The test scores are shown in the
table below.
Twin 1 2 3 4 5 6 7 8 9 10
Method A 6 80 65 70 86 5 63 81 86 60
7 0
Method B 3 75 73 55 74 5 56 72 89 47
9 2
Test, at the 5% level of significance, whether method A has higher scores than method B.
H0:
m d =0
H1:
m d >0
+ −
From the above table T = 10+3+9+7+4+6+8 = 47, T = 5+1+2= 8 and
T 0 = min (47,8) = 8.
From the Wilcoxon signed rank table with n=10,α=0.05 (one-tailed) it is found that
T 0.05=11.
For a sufficiently large sample size n> 25 , the normal approximation to the Wilcoxon
signed rank test can be used. The hypotheses to be tested are the same as those formulated in
T 0 −μT
z0=
the previous two sections. The test statistic used is σ T , where T is calculated as
n ( n+1) n( n+ 1)( 2 n+1) 1/ 2
μT = , σ T =[ ]
explained in the previous two sections and 4 24 . The
critical region is determined from the fact that
Z 0 is approximately normally distributed.
region.
(i) For alternative H1a the critical region is R = {z0 | z0 < Zα}.
(ii) For alternative H1b the critical region is R = {z0 | z0 > Z1-α}.
(iii) For alternative H1c the critical region is R = {z0 | z0 > Z1-α/2 or z0 < Zα/2}.
Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Example
Each of 25 subjects was asked to perform a certain task under normal and stress conditions.
Blood pressure readings that were taken under both conditions are shown in the table below.
Do the data present sufficient evidence to indicate higher blood pressure readings during
conditions of stress? Perform the test at the 1% level of significance.
199
Let
d i = normal – stress , i=1,⋯,25 denote the difference between normal and stress blood
i i
m
pressure readings and d the median of the random variable d .
H0:
m d =0
H1:
m d <0
normal- |normal-
person normal stress stress stress| rank sign
1 115.96 120.55 -4.59 4.59 18 -
2 110.86 107.95 2.91 2.91 11 +
3 122.89 119.45 3.44 3.44 15 +
4 120.38 123.71 -3.33 3.33 13 -
5 117.75 120.45 -2.7 2.7 10 -
6 120.1 119.63 0.47 0.47 1 +
7 123.16 127.95 -4.79 4.79 19 -
8 125.14 125.69 -0.55 0.55 2 -
9 116.31 127.23 -10.92 10.92 25 -
10 121.14 123.23 -2.09 2.09 7 -
11 122.16 123.93 -1.77 1.77 6 -
12 116.71 126.78 -10.07 10.07 24 -
13 126.43 127.72 -1.29 1.29 5 -
14 115.84 117.95 -2.11 2.11 8 -
15 119.23 115.48 3.75 3.75 16 +
16 125.26 129.06 -3.8 3.8 17 -
17 123.43 126.63 -3.2 3.2 12 -
18 119.48 124.91 -5.43 5.43 20 -
19 120.7 122.93 -2.23 2.23 9 -
20 120.65 129.16 -8.51 8.51 22 -
21 127.42 128.15 -0.73 0.73 3 -
22 126.53 123.15 3.38 3.38 14 +
23 120.47 121.22 -0.75 0.75 4 -
24 116.39 123.73 -7.34 7.34 21 -
25 116.59 125.92 -9.33 9.33 23 -
+
From the above table T = 1+11+14+15+16 = 57,
T − = 2+3+4+5+6+7+8+9+10+12+13+17+18+19+20+21+22+23+24+25 268 and =
n ( n+1) 25×26
μT = = =162. 5
Since n=25 , 4 4 ,
n(n+1 )(2n+ 1) 1/ 2 25×26×51 1 /2
σ T =[ ] =( ) =37 . 16517
24 24 and
T 0 −μT 57−162. 5
z0= = =−2 .839
σT 37 . 16517 .
Z 0. 01=−2 .326
The critical region is R = {z0 | z0 < }.
Since 0
z =−2 .839<−2. 326
, H0 is rejected, and it is concluded that the blood pressure
readings are higher during conditions of stress.
13.2.3 The normal approximation to the Wilcoxon signed rank test with tied ranks
When pairs of observations are tied (have the same value) the differences between them will
be 0. In such cases the 0 values are removed from the data and the absolute values of the
remaining values ranked. When tied values are encountered, average ranks are allocated to
the tied values. The formula for σ 2T is modified to
g
n ( n+1 )( 2 n+1 ) 1
σ 2tied = − ∑ (t¿¿ i 3 ¿−t i )¿ ¿ ,
24 48 i=1
Example
A B difference
2.5 2 0.5
3.5 1.5 2
2 2 0
1.5 4 -2.5
4 3.5 0.5
3.5 4 -0.5
3 3 0
2.5 2 0.5
4 3.5 0.5
3.5 2.5 1
3.5 3.5 0
2.5 1.5 1
2 2 0
3 3 0
1.5 2.5 -1
1.5 1.5 0
1.5 1.5 0
201
2 2.5 -0.5
3.5 2.5 1
1.5 1.5 0
3 2 1
3.3 2.8 0.5
1.9 1.4 0.5
2.6 2.6 0
1.7 1.5 0.2
Perform a signed rank test for the difference between the means for the paired data. Use
α =0.05 .
H 0 :md=0 , H 1 : md ≠0
The 9 tied pairs are deleted from the data. The n=¿16 non-zero paired differences, the ranks
of their absolute differences and signs of ranks (+ or -) are shown in the above table.
The tied ranks and their group sizes are shown below.
rank 5.5 12
group size 8 5
2
From the above table: t 1=8 , t 2=5 and ∑ (t ¿¿ i3 ¿−t i)=(8 ¿¿ 3−8)+(5¿¿ 3−5)=¿ ¿¿ ¿ ¿ 624.
i=1
202
16 × 17 2 16 × 17 ×33 624
μT = =68 and σ tied = − =361 .
4 24 48
39−68
z 0= =−1.526. Since z 0 lies between z 0.025=−1.96 and z 0.975=−z 0.025=1.96 , H 0
√ 361
cannot be rejected and it is concluded that there is no difference between the means. The p-
value is 2 × P ( Z ≤−1.526 )=0.127 .
This test is the nonparametric equivalent of the two sample tests based on independent
samples. Random samples of sizes
n1 and n2 are drawn from populations 1 and 2
respectively, where the populations are labelled such that 1
n ≤n
2 . If 1 n =n
2 the labelling of
populations does not matter. It is assumed that the two populations are identical except for a
difference in location. The hypothesis to be tested is H0:
m1 =m2 , wherem 1 and m 2 are the
medians for populations 1 and 2 respectively.
1 Pool (join) the two samples and allocating ranks to the observations in the pooled sample.
2 Let
R01 denote the sum of the ranks associated with the observations drawn from
R
population 1 and 02 the sum of the ranks associated with the observations drawn from
n1 ( n1 +1 ) n ( n + 1)
w 01 =R01 − w 02=R012 − 2 2
population 2. Let 2 and 2 .
The critical region can be written down by specifying 100(1−α) %, the level of significance,
and looking up critical values from the tables of the rank sum distribution.
(i) For alternatives H1a and H1b the critical region is R = {w0 | w0 < Wα}.
(ii) For alternative H1c the critical region is R = {w0 | w0 < Wα/2}.
Step 4: If the test statistic is in the critical region reject H0, otherwise do not reject H0.
Step 5: State conclusion in terms of the original problem.
Example
The strengths of two types of papers are to be compared. The one type of paper is made using
a standard process and the other by treating it with a chemical substance. A random sample of
size 10 is selected from each of the two types of paper and the strengths measured. The data
are shown below.
strengt
h Type
Standar
1.21 d
Standar
1.43 d
Standar
1.35 d
Standar
1.51 d
Standar
1.39 d
Standar
1.17 d
Standar
1.48 d
Standar
1.42 d
Standar
1.28 d
Standar
1.4 d
1.49 Treated
1.38 Treated
1.67 Treated
1.5 Treated
1.31 Treated
1.29 Treated
1.52 Treated
1.37 Treated
1.44 Treated
1.53 Treated
Test, at the 5% level of significance, whether the treated paper has greater strength than the
standard paper.
204
Since 1
n =n =10
2 , either population can be labelled as population 1. The “standard process”
paper will be labelled population 1 and the “treated process” paper population 2.
H0:
m1 =m2
H1:
m 1 < m2
Strengt
h rank Type
Standar
1.21 2 d
Standar
1.43 12 d
Standar
1.35 6 d
Standar
1.51 17 d
Standar
1.39 9 d
Standar
1.17 1 d
Standar
1.48 14 d
Standar
1.42 11 d
Standar
1.28 3 d
Standar
1.4 10 d
1.49 15 Treated
1.38 8 Treated
1.67 20 Treated
1.5 16 Treated
1.31 5 Treated
1.29 4 Treated
1.52 18 Treated
1.37 7 Treated
1.44 13 Treated
1.53 19 Treated
n1 ( n1 +1 ) 10×11
R01= 2+12+6+17+9+1+14+11+3+10=85 ; w 01=R01− 2
=85−
2
=30
.
205
n2 ( n2 +1 ) 10×11
R02= 15+8+20+16+5+4+18+7+13+19=125 ; w 02=R02− 2
=125−
2
=70
.
Test statistic: w 0 =min (w01 , w 02 )=min (30 ,70 )=30 .
From the Wilcoxon rank sum table with n1=n2=10 , α =0.05 (one-tailed) it is found that
W 0 . 05=27 .
Since 0
w =30>27 , H0 is not rejected, and it is concluded that there is no evidence that the
treated paper has greater strength than the standard paper.
It should be noted here that H0 is close to being rejected and that when larger or different
samples are used the conclusion might change.
When the sample sizes are both greater than 10, the normal approximation to the rank sum
test can be used. The hypotheses to be tested are the same as those formulated in the previous
w 0 −μW
Z 0=
section. The test statistic used is σW w
, where 0 is calculated as explained in the
n n n n ( n +n + 1) 1/ 2
μW = 1 2 , σ W =[ 1 2 1 2 ]
previous two sections and 2 12 . The critical region is
determined from the fact that
Z 0 is approximately normally distributed.
(ii) For alternative H1c the critical region is R = {z0 | z0 < Zα/2}.
Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Example
Fifteen experimental batteries were selected at random from a lot at pilot plant A, and 15
standard batteries were selected at random from production at plant B. All 30 batteries were
simultaneously placed under an electrical load of the same magnitude. The first battery to fail
was an A, the second a B, the third a B, and so on. The following sequence shows the order
of failure for the 30 batteries:
ABBBABAABBBBABABBBBAABAAABAAAA
Using the large-sample theory for the rank sum test, determine whether there is sufficient
evidence to conclude that the lengths of life for the experimental batteries tend to be greater
than the lengths of life for the standard batteries. Use α = .05.
Denote the plant A batteries as population 1 and the plant B ones as population 2. Since the 2
samples are of equal size,
n1 =n 2=15 .
H0:
m1 =m2
H1:
m 1 > m2
n1 ( n1 +1 ) 15×16
w 01 =R01 − =276− =276−120=156
2 2
n ( n + 1) 15×16
w 02=R012 − 2 2 =189− =189−120=69
2 2
w 0 =min (w01 , w 02 )=min (156 , 69)=69
n n
1 2 15×15
n1 =n 2=15 , μW = 2 = 2 =112 .5 ,
n1 n2 ( n1 + n2 +1 ) 1/ 2 15×15×31 1/ 2
σ W =[ ] =( ) =24 .10913
12 12
w0 −μW 69−112. 5
z0= = =−1 . 8043
σW 24 .10913
207
When tied observations are present in the two samples the following modifications to the
testing procedure are needed.
In the above formula n=n1 +n2 , K is the number of groups of unique ranks with ties and t k is
the number of tied ranks in group k .
Example
The earthquake magnitudes in Chile according to the location (ocean or land) are shown in
the table below.
Test whether the magnitudes of ocean earthquakes are greater than those of land ones. Use
α =0.10 .
magnitud
e location rank
41 land 3.5
43 land 7
43 land 7
43 land 7
44 land 11
44 land 11
45 land 13.5
46 land 15
50 land 17
51 land 18.5
51 land 18.5
39 ocean 1
40 ocean 2
41 ocean 3.5
43 ocean 7
43 ocean 7
44 ocean 11
45 ocean 13.5
48 ocean 16
54 ocean 20
63 ocean 21
68 ocean 22.5
68 ocean 22.5
23 ×24
R02= −129=147. n1=11 , n2 =12 .
2
11×12 12 ×13
w 01=129− =63 , w 02=147− =69 .
2 2
n 1 n2 11×12
μW = = =66
2 2
The tied ranks and their group sizes are shown below.
2
σ ties ¿ n1 n2 (n ¿ ¿ 1+n2 +1)
∑ (t¿¿ k 3−t k ) 11×12 ×24 11×12 ×168
−n1 n2 k=1 = − =260.3478 ¿ ¿
12 12 n ( n−1 ) 12 12× 23 ×22
w 0−μW 63−66
z 0= = =−0.186 .
σ ties √ 260.3478
Since z 0=−0.186 > z 0.10=−1.282 , H 0 cannot be rejected and it is concluded that the medians
are probably equal. p-value = P ( Z ≤−0.186 )=0.426 > 0.10.
Additional reading:
https://data.library.virginia.edu/the-wilcoxon-rank-sum-test/
Exercises
The data to be sorted should be in a single column. If the data values are not in a single
column, move the values into a single column (see explanation A2 above).
Suppose the data are in cells A1 to A80.
1 Type =countif($A$1:$A1,A1) in cell B1.
2 Position the cursor in the bottom right hand corner of cell B1 and drag it down to cell B80.
3 Highlight the data in A1 to B80 and select Insert – Charts – Scatter (top of excel sheet)
and click on the Scatter icon to produce the dot plot.
A5 To do a Pie chart.
1 Type the names of the components in the chart in column A and their corresponding sizes
in column B in an excel sheet.
2 Highlight the data in columns A an B in the excel sheet.
3 Go to the top of the excel sheet and select Insert. Then click on the pie chart (round) icon.
A6 Q-Q plot (See notes)
B Excel add-in Data Analysis
See notes for making the add-in available in excel.
B1 Descriptive Statistics
Highlight the data in excel.
Data – Data Analysis – Descriptive Statistics: In the window that appears, specify input range
(cells with data) and first cell of output. Select Summary Statistics, Confidence level for
mean, kth largest and kth smallest.
B2 Data Analysis (Analysis Tools) – sampling
Data to be sampled put in a single column
Data – Data Analysis – sampling. Specify input range (data to be sampled), number of
samples and output range (cell containing first value of outout)
B3 Data Analysis (Analysis Tools) (t-test – paired two sample, two sample equal variances,
two sample unequal variances).
B4 Data Analysis (Analysis Tools) – Correlation, Regression. Data in two columns (x and y
columns).
B5 Data Analysis (Analysis Tools) – Anova: Single Factor. Data in 2 or more coumns.
C Excel functions
=max(cells with data) – minmum value.
=min((cells with data) – maximum value.
=frequency(cells with data, cells with boundaries) – frequency distribution counts.
=sum(cells with data)
=sumsq(cells with data) – sum of squares.
=sumproduct(cells with data 1, cells with data 2) – sum of products of two data sets.
=log(value) gives logarithm with base 10.
=ln(value) gives logarithm with base e (natural logarithm).
=average(cells with data) – mean.
=stdev(cells with data) – standard deviation.
=var(cells with data) – variance.
=percentile(cells with data,percent/100) – Percentile value for a given percentage. Special
cases are 1st decile(percent =10), 2nd decile (precent =20), 1st quartile (percent=25), 3rd decile
(percent=30), 4th decile (percent=40), median (percent = 50), 6th decile (percent=60), 7th
decile (percent=70), 3rd quartile (percent= 75), 8th decile (percent=80), ), 9th decile
(percent=90).
multiplication (*), division (/), exponentiation (^), sqrt (square root)
=BINOM.DIST(x,n, p, cumulative =TRUE or cumulative =FALSE)
=combin(n, i) where n – number selected from and i – number selected
=fact(n), n !=n(n−1)(n−2)⋯2.1
211